In my previous post, I introduced the Spotify Charts dataset. To do a quick “Last Week on Westworld“-style recap:
- Each day, Spotify shares the 200 most-streamed songs of the day globally and in 60 countries.
- These top 200 lists are archived back to 2017.
- This list provides streaming count numbers for each song that gets 1000 or more streams.
- Only the top 200 most-streamed song’s stream counts are made available. This Spotify Charts data represents just a portion of total Spotify streaming.
- When comparing the streaming totals of the 60 countries in the dataset, the United States, Mexico, and Brazil had the most total streams in a 102 day period from 1/1/2020 to 4/11/2020.
- The total streams were normalized by population and the highest-total streaming nations were overtaken by smaller countries when comparing streams per capita. Nordic and Oceanic countries in particular had high per capita streaming totals.
Now onto the new episode…
When I was looking at the streams per capita, I couldn’t stop thinking that the list also looked a lot like the Quality-of-Life Index (QLI) rankings; everywhere you look, it seems people are talking about how great it is to live in Scandinavia. You could chalk this up as proof that music is a key to happiness, but I was thinking that it makes sense that the same countries that are consistently viewed as the best places to live would also stream the most music.
The QLI is based on several factors, including personal wealth, political freedom, and job security. Because the primary requirements to stream music on Spotify are an electronic device capable of the app (computer or smartphone) and a high-speed internet connection, it certainly makes sense that access to these two things would be related to streaming numbers.
Pulling from the World Bank World Development Indicators, we can glean a great deal of statistics on a country, from Gross Domestic Product to total alcohol consumption per capita (it’s Moldova, in case you were wondering). Since access to internet is so crucial to being able to stream, I figured I’d start by taking a look at our 60 Spotify Charts countries and see what their internet access looked like.
Spotify Charts and Internet Access
The above histogram shows the breakdown of internet access for a population by country. For example, the right-most bar represents the bin of 90-100% of population with internet, and it shows that there are 8 such countries in our dataset. Each remaining bin represents another 10% range. In total, 53 of the 60 countries in the dataset offer at least half of their population access to internet.
In last week’s post, I used the Python package Seaborn to generate my plots. A friend recommended that I check out Bokeh and its interactive plot generators. This package is absolutely incredible! All of the following plots are interactive, so you can hover over data points to display information on the point and even click the point to isolate it. I had to move the whole website over to a host that could handle the custom HTML, so hope y’all enjoy!
First, I wanted to see if there was any sort of relationship between a population’s internet access and their wealth. The plot below shows the relationship between the percentage of a country’s population with internet access versus the nation’s per capita gross net income (GNI). This plot only includes the countries given in our Spotify charts dataset.
We can see that as GNI increases, so too does the percent of the population with internet access. Since a population’s internet access can’t exceed 100%, the curve tapers off as the country nears full access. This behavior can be expressed with a logarithmic function, and a trendline can be fit to this data using the sklearn package.
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import FunctionTransformer
from sklearn.metrics import r2_score
transformer = FunctionTransformer(np.log, validate=True)
X=df['GNI'].values.reshape(-1,1)
y=df['PercentOfPopulationWithInternet'].values.reshape(-1,1)
X_log = transformer.fit_transform(X)
regressor = LinearRegression()
regressor.fit(X_log, y)
y_reg = regressor.predict(X_log)
b = regressor.intercept_
m = regressor.coef_
r2 = r2_score(y, y_reg)
By transforming the x axis to a log scale, we can then use sklearn’s linear regression function to find the intercept and coefficient of the trendline. This trendline takes the form,
y(x) = b + m * ln(x)
The coefficient of determination is represented with the R^2 value. The R^2 for this logarithmic trendline is 0.816, which suggests a pretty good fit. Below we can see the same data plotted with the trendline we just calculated.
Streaming Numbers and QLI
Wow – that was a rabbit hole. Now that we’ve seen that there is a relationship between a nation’s wealth and its population’s access to internet we can get to the original point of this post, seeing if there is any sort of relationship between a nation’s streaming and its QLI.
Unfortunately, one of the first things we can see is that not all of the countries are included in the QLI rankings. Of the 60 Spotify Charts countries, nine are not given a QLI ranking (I guess The Economist forgot that Latin America existed):
- Nicaragua
- Honduras
- El Salvador
- Dominican Republic
- Bolivia
- Vietnam
- Paraguay
- Costa Rica
- Guatemala
When we plot the streams per capita vs QLI, we don’t really see a very clear trend. The data can be best fit with a polynomial curve. If we go back to our friend sklearn, we can run the regression using,
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import r2_score
X=df['QualityOfLife'].values.reshape(-1,1)
y=df['StreamsPerCapita'].values.reshape(-1,1)
polynomial_feature = PolynomialFeatures(degree = 2)
X_poly = polynomial_feature.fit_transform(X)
regressor = LinearRegression()
regressor.fit(X_poly, y)
y_reg = regressor.predict(X_poly)
b = regressor.intercept_
m = regressor.coef_
r2 = r2_score(y,y_reg)
If we play with the “degree” term in the above code, we can see that no matter high we make our curve, we’re going to get a crappy fit. The R^2 for a quadratic (degree=2) trendline is 0.350, while the R^2 for a cubic (degree=3) trendline is 0.352 – not worth it! There just isn’t that close a relationship between streams per capita and QLI. I was wrong!
Streaming Numbers and Wealth
Although QLI and streams per capita didn’t turn out to be a very good fit, there is one more metric we can check. As mentioned above, one of the factors QLI is based on is personal wealth, which can be expressed with GNI. We can plot GNI versus streams per capita and see if there’s any discernible relationship. An added benefit of looking at this relationship with GNI is that all of the Spotify Charts countries in our dataset are included, unlike in the QLI.
In this chart we plot GNI versus streams per capita. We can see a scatter of data points, loosely following a linear trend. There is a cluster of points with GNI less than $20,000 and per capita streams less than 40. All in all, there are 35 such countries, over half of our dataset.
This graph can also represent the wide gap between country’s streaming habits (and their wealth). Norway is at the far upper right of our chart, representing the second-highest GNI and highest per capita streaming, with nearly $62,000 of income and 95 streams per person. India falls on the other end of the spectrum, with a GNI of less than $1700 and about 0.6 streams per capita.
To get an idea of how closely these parameters follow a linear relationship, we can again use the sklearn package to fit a linear trendline to our data.
from sklearn.linear_model import LinearRegression
X=df2['GNI'].values.reshape(-1,1)
y=df2['StreamsPerCapita'].values.reshape(-1,1)
regressor = LinearRegression()
regressor.fit(X, y)
b = regressor.intercept_
m = regressor.coef_
r2 = LinearRegression().fit(X,y).score(X,y)
With the linear regression complete, we can find the slope (m) and intercept (b) of the line and superimpose this trendline onto our scatterplot from earlier. Since this a linear relationship, our trendline takes the form,
y = m*x + b
In the above graph, we can see that our linear trendline doesn’t do that great a job of capturing the relationship between GNI and streaming per capita. The coefficient of determination is 0.483, not very good.
Although this trendline (or “model”) isn’t a great predictor, it does provide us a new perspective through which we can view our data. If the red trendline represents the path that most closely expresses the dataset, then the larger the distance from the trendline to a point (A.K.A. the “residual”), the more the point deviates from the model.
When looking at these residuals, we see that the model very accurately predicts some country’s streaming numbers. For example, Portugal, Italy, and Singapore all fall very close to the trendline and have a residual of nearly zero.
Many countries don’t land very close to this line and have larger residuals. Again, we can’t use this trendline to confidently predict the streaming totals of a country based on its GNI, but we can see which countries deviate most from the model’s predictions.
In the residual plot below, the predicted streams per capita were subtracted from the actual streams per capita.
One way to think about this plot is that any country above the red trendline is streaming more than the model predicted, and any country below is streaming less.
With this in mind, we can see our biggest streaming overachiever, Chile. When considered in this context, the country’s streaming prowess – first realized last week – is even more marked, with 53 more streams per capita than predicted. Similarly, the biggest underachiever is Switzerland, streaming 36 fewer songs than the model anticipates. The United States is among the underachievers, falling almost 24 streams short of the trendline.
Conclusion
What does all this mean? Maybe nothing: our R^2 value isn’t very good for the linear relationship between QLI and streams per capita, which kind of pokes a hole in my original hypothesis. We can say that wealthier countries tend to have higher streaming totals than poorer countries. This may be attributed to a number of factors. However, given the correlation between GNI and access to internet, one of these reasons could be that less people are streaming in these countries because they don’t have the necessary devices or internet access.
Another thing that became even clearer is just how far ahead of the pack Chile is in their streaming numbers; I mean, they are dominating. When comparing their GNI-adjusted residual, they are outstreaming their closest competitor (and Spotify’s homecourt), Sweden, by 10 streams per capita. One factor that could be playing into this is their above-the-curve internet access. 82% of the population has access to internet, putting it in the same range as wealthier countries like France (81%), Germany (84%), and Japan (85%).
But to attribute this all to internet access doesn’t seem entirely fair, Chileans just seem to be listening to a lot of music on Spotify. Peruse your favorite artist’s homepage. Chances are, Santiago will show show up on the “Where people listen” list. This isn’t just true for Spanish-speaking urbano and pop stars like Bad Bunny and Rauw Alejandro (Justin Bieber, if he could actually speak Spanish), even Cannibal Corpse is getting some of the Santiago shine.
When you’re done marveling at the beautiful segmental concrete bridge in their picture or reading some of the critical acclaim in their bio, take a look at what cities listen to Cannibal Corpse the most. Santiago outstreams Mexico City, Chicago, and many other metropoles with larger populations. The same holds true for Bad Bunny, Rauw Alejandro, and dozens of other top artists. Spotify Charts just wouldn’t be the same without Chile.
In short, although we couldn’t find a clear predictor of what makes one country stream more than another, we were able to use regression modeling to provide new insight into assessing how countries are streaming relative to one another, considering factors besides just population. With these more nuanced lenses, investigations can be focused on these successful streaming areas and lessons can be learned. To Spotify execs: figure out what you’re doing in Chile and replicate it in other countries, because it’s clearly working.
You should do a time based approach for this. I would think the stream count would be much higher than predicted for a country when a new album drops that was highly anticipated. Would be cool to line up any large swings with certain albums that came out around that time.