Spotify, the most popular music streaming service today, is a constant companion for me. Not only does Spotify give me access to all my favorite songs wherever I am (work, home, in the car), it has also introduced me to artists that I would never have listened to before, in genres that I had never experienced. You could say that Spotify has really allowed me to expand my musical tastes.
On the technical side, Spotify provides an interesting look into their listening data. Not just the popularity of tracks, but also features of the tracks they have in their library.
The question we’ll be looking at is: Can we predict a track’s popularity from key features about the song?.
The data comes from Kaggle1 and was last retreived April 2019.
spotify <- read_csv("SpotifyAudioFeaturesApril2019.csv")
There are 130663 tracks in the dataset, each representing a different song. The popularity ratings range from 0-100, with the top 10 most popular songs being:
spotify %>% top_n(10,popularity) %>% select(artist_name, track_name, popularity) %>% arrange(desc(popularity))
## # A tibble: 13 x 3
## artist_name track_name popularity
## <chr> <chr> <dbl>
## 1 Daddy Yankee Con Calma 100
## 2 Post Malone Wow. 98
## 3 Jonas Brothers Sucker 98
## 4 Billie Eilish bad guy 98
## 5 Post Malone Sunflower - Spider-Man: Into the Spider-Verse 98
## 6 Ariana Grande break up with your girlfriend, i'm bored 97
## 7 Ariana Grande 7 rings 96
## 8 Sam Smith Dancing With A Stranger (with Normani) 96
## 9 Halsey Without Me 96
## 10 Marshmello Happier 96
## 11 Ava Max Sweet but Psycho 96
## 12 Lady Gaga Shallow 96
## 13 Pedro Capó Calma - Remix 96
The distribution of popularity seems skewed negatively, with few songs reaching the highest numbers:
ggplot(spotify,aes(y=popularity)) + geom_boxplot() +
labs(title="Distribution of Song Popularity",
subtitle="Spotify, April 2019")
The dataset gives us several variables that we can use as potential predictors in a linear regression. Let’s see if we can find a few that seem to correlate well with the popularity:
cor(spotify[,-1:-3])
## acousticness danceability duration_ms energy
## acousticness 1.00000000 -0.36046193 0.0334261930 -0.71006657
## danceability -0.36046193 1.00000000 -0.1267811312 0.28619581
## duration_ms 0.03342619 -0.12678113 1.0000000000 -0.01988486
## energy -0.71006657 0.28619581 -0.0198848640 1.00000000
## instrumentalness 0.27268471 -0.30511231 0.0291243918 -0.30130753
## key -0.01998735 0.02112342 -0.0018797030 0.03984267
## liveness -0.10054515 -0.13737651 -0.0036606332 0.20944774
## loudness -0.60336645 0.43155440 -0.0185947302 0.76669747
## mode 0.06717127 -0.05791183 0.0103208525 -0.06926311
## speechiness -0.11923149 0.24819202 -0.1019552197 0.10507757
## tempo -0.21632796 0.08179074 -0.0096571687 0.22992988
## time_signature -0.16531911 0.20632785 0.0210066739 0.16503012
## valence -0.17702336 0.46146765 -0.1418368926 0.31476839
## popularity -0.11651999 0.13108558 -0.0008010133 0.12250614
## instrumentalness key liveness loudness
## acousticness 0.27268471 -0.019987346 -0.100545151 -0.60336645
## danceability -0.30511231 0.021123423 -0.137376506 0.43155440
## duration_ms 0.02912439 -0.001879703 -0.003660633 -0.01859473
## energy -0.30130753 0.039842673 0.209447743 0.76669747
## instrumentalness 1.00000000 -0.025072393 -0.058389622 -0.50851922
## key -0.02507239 1.000000000 0.009191355 0.02810077
## liveness -0.05838962 0.009191355 1.000000000 0.06216759
## loudness -0.50851922 0.028100770 0.062167594 1.00000000
## mode -0.00221131 -0.176237900 -0.001324836 -0.03608064
## speechiness -0.21735871 0.010353850 0.106801421 0.07445622
## tempo -0.08689433 0.005463959 -0.009125933 0.22306701
## time_signature -0.08422338 0.008877675 -0.018307024 0.17967938
## valence -0.24686882 0.043347523 -0.007800342 0.31988077
## popularity -0.21644701 0.002681513 -0.031174152 0.24408756
## mode speechiness tempo time_signature
## acousticness 0.0671712653 -0.1192314897 -0.2163279628 -0.165319110
## danceability -0.0579118329 0.2481920170 0.0817907432 0.206327855
## duration_ms 0.0103208525 -0.1019552197 -0.0096571687 0.021006674
## energy -0.0692631146 0.1050775674 0.2299298846 0.165030125
## instrumentalness -0.0022113098 -0.2173587142 -0.0868943260 -0.084223383
## key -0.1762378997 0.0103538497 0.0054639590 0.008877675
## liveness -0.0013248364 0.1068014212 -0.0091259328 -0.018307024
## loudness -0.0360806388 0.0744562224 0.2230670134 0.179679384
## mode 1.0000000000 -0.0535544493 -0.0002488567 -0.036244213
## speechiness -0.0535544493 1.0000000000 0.0548271079 0.053706516
## tempo -0.0002488567 0.0548271079 1.0000000000 0.083758714
## time_signature -0.0362442132 0.0537065159 0.0837587140 1.000000000
## valence 0.0110817073 0.1215523213 0.1048569713 0.069161843
## popularity -0.0090698371 -0.0002138378 0.0370752117 0.064939422
## valence popularity
## acousticness -0.177023355 -0.1165199935
## danceability 0.461467645 0.1310855768
## duration_ms -0.141836893 -0.0008010133
## energy 0.314768394 0.1225061432
## instrumentalness -0.246868816 -0.2164470114
## key 0.043347523 0.0026815127
## liveness -0.007800342 -0.0311741517
## loudness 0.319880771 0.2440875551
## mode 0.011081707 -0.0090698371
## speechiness 0.121552321 -0.0002138378
## tempo 0.104856971 0.0370752117
## time_signature 0.069161843 0.0649394223
## valence 1.000000000 0.0143026691
## popularity 0.014302669 1.0000000000
Some of the variables most correlated to Popularity seem to be: acousticness, danceability, energy, instrumentalness, and loudness.
Also, looking at key:
spotify %>% group_by(key) %>% select(key, popularity) %>%
ggplot(aes(x=as.factor(key),y=popularity,fill=as.factor(key))) +
geom_boxplot() + theme(legend.position = "none") +
labs(title="Popularity of Song Keys", subtitle = "by Pitch Classes",
x="Key/Pitch Class", y="Popularity")
There seems to be little differentiation between the keys and popularity. However, one key does seem to have a larger number of popular (> 70) songs in it:
spotify %>% filter(popularity > 70) %>% group_by(key) %>% summarize(count = n()) %>% arrange(desc(count))
## # A tibble: 12 x 2
## key count
## <dbl> <int>
## 1 1 262
## 2 0 188
## 3 7 166
## 4 11 166
## 5 6 161
## 6 2 143
## 7 8 138
## 8 5 129
## 9 10 128
## 10 9 124
## 11 4 116
## 12 3 37
Key 1 (which applies to c-Sharp and d-Flat) has more popular songs than other keys. That could be a potential predictor variable, so we will encode a new binary variable we can use in our model.
spotify$isKey1 <- as.integer(spotify$key == 1)
We’ll start with these variables and see if we can build a model that is at least somewhat useful for predicting a song’s popularity on Spotify.
First we try all the identified variables. We’ll also add an interaction term for loudness and energy, as they seem like they may have a relationship.
m1 <- lm(popularity ~ acousticness + danceability + energy + instrumentalness + loudness + isKey1 + (energy * loudness), data=spotify)
summary(m1)
##
## Call:
## lm(formula = popularity ~ acousticness + danceability + energy +
## instrumentalness + loudness + isKey1 + (energy * loudness),
## data = spotify)
##
## Residuals:
## Min 1Q Median 3Q Max
## -36.405 -15.113 -1.894 12.766 81.486
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.22300 0.44190 81.971 < 2e-16 ***
## acousticness -0.79112 0.22154 -3.571 0.000356 ***
## danceability 2.30756 0.31538 7.317 2.56e-13 ***
## energy -2.96071 0.43944 -6.737 1.62e-11 ***
## instrumentalness -5.44163 0.17175 -31.684 < 2e-16 ***
## loudness 0.61738 0.01650 37.419 < 2e-16 ***
## isKey1 0.17596 0.16291 1.080 0.280091
## energy:loudness 0.92506 0.03015 30.680 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.86 on 130655 degrees of freedom
## Multiple R-squared: 0.08501, Adjusted R-squared: 0.08496
## F-statistic: 1734 on 7 and 130655 DF, p-value: < 2.2e-16
Looking at the model summary, we see that all of our predictor variables are significant with the exception of isKey1 the encoded variable we added. The F-statistic shows that our model does have some significance over the mean. Our Adjusted R-squared, however is a paltry 0.0849589.
Since the isKey1 variable isn’t significant, we should remove it from the model. We could also put another interaction variable in the model between acousticness and instrumentalness and see if the model improves.
m2 <- lm(popularity ~ acousticness + danceability + energy + instrumentalness + loudness + (energy * loudness) + (acousticness * instrumentalness), data=spotify)
summary(m2)
##
## Call:
## lm(formula = popularity ~ acousticness + danceability + energy +
## instrumentalness + loudness + (energy * loudness) + (acousticness *
## instrumentalness), data = spotify)
##
## Residuals:
## Min 1Q Median 3Q Max
## -36.553 -14.884 -1.867 12.762 76.220
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.78322 0.44644 86.872 <2e-16 ***
## acousticness -4.78146 0.25011 -19.117 <2e-16 ***
## danceability 2.61564 0.31389 8.333 <2e-16 ***
## energy -5.52234 0.44425 -12.431 <2e-16 ***
## instrumentalness -10.89381 0.23642 -46.079 <2e-16 ***
## loudness 0.75877 0.01696 44.733 <2e-16 ***
## energy:loudness 0.64124 0.03120 20.550 <2e-16 ***
## acousticness:instrumentalness 13.52540 0.40495 33.400 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.78 on 130655 degrees of freedom
## Multiple R-squared: 0.09275, Adjusted R-squared: 0.0927
## F-statistic: 1908 on 7 and 130655 DF, p-value: < 2.2e-16
Here we see the Adjusted R-Squared jumped a little to 0.0926974. Our new interaction term seems significant as well. Perhaps another one is called for? Let’s see if there is value to an interaction variable composed of energy and danceability.
m3 <- lm(popularity ~ acousticness + danceability + energy + instrumentalness + loudness + (energy * loudness) + (acousticness * instrumentalness) + (energy * danceability), data=spotify)
summary(m3)
##
## Call:
## lm(formula = popularity ~ acousticness + danceability + energy +
## instrumentalness + loudness + (energy * loudness) + (acousticness *
## instrumentalness) + (energy * danceability), data = spotify)
##
## Residuals:
## Min 1Q Median 3Q Max
## -38.281 -14.840 -1.867 12.728 75.889
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35.49640 0.58935 60.230 <2e-16 ***
## acousticness -4.65081 0.25051 -18.565 <2e-16 ***
## danceability 7.87817 0.69152 11.392 <2e-16 ***
## energy 0.69415 0.85272 0.814 0.416
## instrumentalness -10.89881 0.23635 -46.112 <2e-16 ***
## loudness 0.71081 0.01786 39.791 <2e-16 ***
## energy:loudness 0.75571 0.03395 22.258 <2e-16 ***
## acousticness:instrumentalness 13.65674 0.40513 33.709 <2e-16 ***
## danceability:energy -9.70001 1.13584 -8.540 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.77 on 130654 degrees of freedom
## Multiple R-squared: 0.09325, Adjusted R-squared: 0.0932
## F-statistic: 1680 on 8 and 130654 DF, p-value: < 2.2e-16
Interestingly the new interaction term is significant in this model, but the energy variable itself ceased to be. R will not allow us to drop the main variable and keep it’s intereaction variable, so we’ll keep the model without the new interaction (m2).
So, how good is our model? Judging by the Adjusted R-squared, not too great. But perhaps it is somewhat useful. Let’s look at the residuals:
am2 <- augment(m2)
am2 %>% ggplot(aes(x=.fitted,y=.resid)) +
geom_point(alpha=0.1) + geom_hline(yintercept=0) +
labs(title="Residual Plot")
am2 %>% ggplot(aes(sample=.std.resid)) +
geom_qq() + geom_qq_line() +
labs(title="QQ Plot")
The residual plot is interesting and a definite pattern is there. The QQ plot shows that our residuals aren’t exactly normal throughout the range of samples.
Given the low Adjusted R-squared, and unusual patterns in the residuals, the models we’ve created seem like they are unsuitable for predicting a song’s popularity on Spotify; sorry music executives!