Introduction

Spotify, the most popular music streaming service today, is a constant companion for me. Not only does Spotify give me access to all my favorite songs wherever I am (work, home, in the car), it has also introduced me to artists that I would never have listened to before, in genres that I had never experienced. You could say that Spotify has really allowed me to expand my musical tastes.

On the technical side, Spotify provides an interesting look into their listening data. Not just the popularity of tracks, but also features of the tracks they have in their library.

The question we’ll be looking at is: Can we predict a track’s popularity from key features about the song?.

The Data

The data comes from Kaggle1 and was last retreived April 2019.

spotify <- read_csv("SpotifyAudioFeaturesApril2019.csv")

There are 130663 tracks in the dataset, each representing a different song. The popularity ratings range from 0-100, with the top 10 most popular songs being:

spotify %>% top_n(10,popularity) %>% select(artist_name, track_name, popularity) %>% arrange(desc(popularity))
## # A tibble: 13 x 3
##    artist_name    track_name                                    popularity
##    <chr>          <chr>                                              <dbl>
##  1 Daddy Yankee   Con Calma                                            100
##  2 Post Malone    Wow.                                                  98
##  3 Jonas Brothers Sucker                                                98
##  4 Billie Eilish  bad guy                                               98
##  5 Post Malone    Sunflower - Spider-Man: Into the Spider-Verse         98
##  6 Ariana Grande  break up with your girlfriend, i'm bored              97
##  7 Ariana Grande  7 rings                                               96
##  8 Sam Smith      Dancing With A Stranger (with Normani)                96
##  9 Halsey         Without Me                                            96
## 10 Marshmello     Happier                                               96
## 11 Ava Max        Sweet but Psycho                                      96
## 12 Lady Gaga      Shallow                                               96
## 13 Pedro Capó     Calma - Remix                                         96

The distribution of popularity seems skewed negatively, with few songs reaching the highest numbers:

ggplot(spotify,aes(y=popularity)) + geom_boxplot() +
  labs(title="Distribution of Song Popularity",
       subtitle="Spotify, April 2019")

The dataset gives us several variables that we can use as potential predictors in a linear regression. Let’s see if we can find a few that seem to correlate well with the popularity:

cor(spotify[,-1:-3])
##                  acousticness danceability   duration_ms      energy
## acousticness       1.00000000  -0.36046193  0.0334261930 -0.71006657
## danceability      -0.36046193   1.00000000 -0.1267811312  0.28619581
## duration_ms        0.03342619  -0.12678113  1.0000000000 -0.01988486
## energy            -0.71006657   0.28619581 -0.0198848640  1.00000000
## instrumentalness   0.27268471  -0.30511231  0.0291243918 -0.30130753
## key               -0.01998735   0.02112342 -0.0018797030  0.03984267
## liveness          -0.10054515  -0.13737651 -0.0036606332  0.20944774
## loudness          -0.60336645   0.43155440 -0.0185947302  0.76669747
## mode               0.06717127  -0.05791183  0.0103208525 -0.06926311
## speechiness       -0.11923149   0.24819202 -0.1019552197  0.10507757
## tempo             -0.21632796   0.08179074 -0.0096571687  0.22992988
## time_signature    -0.16531911   0.20632785  0.0210066739  0.16503012
## valence           -0.17702336   0.46146765 -0.1418368926  0.31476839
## popularity        -0.11651999   0.13108558 -0.0008010133  0.12250614
##                  instrumentalness          key     liveness    loudness
## acousticness           0.27268471 -0.019987346 -0.100545151 -0.60336645
## danceability          -0.30511231  0.021123423 -0.137376506  0.43155440
## duration_ms            0.02912439 -0.001879703 -0.003660633 -0.01859473
## energy                -0.30130753  0.039842673  0.209447743  0.76669747
## instrumentalness       1.00000000 -0.025072393 -0.058389622 -0.50851922
## key                   -0.02507239  1.000000000  0.009191355  0.02810077
## liveness              -0.05838962  0.009191355  1.000000000  0.06216759
## loudness              -0.50851922  0.028100770  0.062167594  1.00000000
## mode                  -0.00221131 -0.176237900 -0.001324836 -0.03608064
## speechiness           -0.21735871  0.010353850  0.106801421  0.07445622
## tempo                 -0.08689433  0.005463959 -0.009125933  0.22306701
## time_signature        -0.08422338  0.008877675 -0.018307024  0.17967938
## valence               -0.24686882  0.043347523 -0.007800342  0.31988077
## popularity            -0.21644701  0.002681513 -0.031174152  0.24408756
##                           mode   speechiness         tempo time_signature
## acousticness      0.0671712653 -0.1192314897 -0.2163279628   -0.165319110
## danceability     -0.0579118329  0.2481920170  0.0817907432    0.206327855
## duration_ms       0.0103208525 -0.1019552197 -0.0096571687    0.021006674
## energy           -0.0692631146  0.1050775674  0.2299298846    0.165030125
## instrumentalness -0.0022113098 -0.2173587142 -0.0868943260   -0.084223383
## key              -0.1762378997  0.0103538497  0.0054639590    0.008877675
## liveness         -0.0013248364  0.1068014212 -0.0091259328   -0.018307024
## loudness         -0.0360806388  0.0744562224  0.2230670134    0.179679384
## mode              1.0000000000 -0.0535544493 -0.0002488567   -0.036244213
## speechiness      -0.0535544493  1.0000000000  0.0548271079    0.053706516
## tempo            -0.0002488567  0.0548271079  1.0000000000    0.083758714
## time_signature   -0.0362442132  0.0537065159  0.0837587140    1.000000000
## valence           0.0110817073  0.1215523213  0.1048569713    0.069161843
## popularity       -0.0090698371 -0.0002138378  0.0370752117    0.064939422
##                       valence    popularity
## acousticness     -0.177023355 -0.1165199935
## danceability      0.461467645  0.1310855768
## duration_ms      -0.141836893 -0.0008010133
## energy            0.314768394  0.1225061432
## instrumentalness -0.246868816 -0.2164470114
## key               0.043347523  0.0026815127
## liveness         -0.007800342 -0.0311741517
## loudness          0.319880771  0.2440875551
## mode              0.011081707 -0.0090698371
## speechiness       0.121552321 -0.0002138378
## tempo             0.104856971  0.0370752117
## time_signature    0.069161843  0.0649394223
## valence           1.000000000  0.0143026691
## popularity        0.014302669  1.0000000000

Some of the variables most correlated to Popularity seem to be: acousticness, danceability, energy, instrumentalness, and loudness.

Also, looking at key:

spotify %>% group_by(key) %>% select(key, popularity) %>%
  ggplot(aes(x=as.factor(key),y=popularity,fill=as.factor(key))) +
  geom_boxplot() + theme(legend.position = "none") +
  labs(title="Popularity of Song Keys", subtitle = "by Pitch Classes",
       x="Key/Pitch Class", y="Popularity")

There seems to be little differentiation between the keys and popularity. However, one key does seem to have a larger number of popular (> 70) songs in it:

spotify %>% filter(popularity > 70) %>% group_by(key) %>% summarize(count = n()) %>% arrange(desc(count))
## # A tibble: 12 x 2
##      key count
##    <dbl> <int>
##  1     1   262
##  2     0   188
##  3     7   166
##  4    11   166
##  5     6   161
##  6     2   143
##  7     8   138
##  8     5   129
##  9    10   128
## 10     9   124
## 11     4   116
## 12     3    37

Key 1 (which applies to c-Sharp and d-Flat) has more popular songs than other keys. That could be a potential predictor variable, so we will encode a new binary variable we can use in our model.

spotify$isKey1 <- as.integer(spotify$key == 1)

We’ll start with these variables and see if we can build a model that is at least somewhat useful for predicting a song’s popularity on Spotify.

Modeling

First we try all the identified variables. We’ll also add an interaction term for loudness and energy, as they seem like they may have a relationship.

m1 <- lm(popularity ~ acousticness + danceability + energy + instrumentalness + loudness + isKey1 + (energy * loudness), data=spotify)

summary(m1)
## 
## Call:
## lm(formula = popularity ~ acousticness + danceability + energy + 
##     instrumentalness + loudness + isKey1 + (energy * loudness), 
##     data = spotify)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -36.405 -15.113  -1.894  12.766  81.486 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      36.22300    0.44190  81.971  < 2e-16 ***
## acousticness     -0.79112    0.22154  -3.571 0.000356 ***
## danceability      2.30756    0.31538   7.317 2.56e-13 ***
## energy           -2.96071    0.43944  -6.737 1.62e-11 ***
## instrumentalness -5.44163    0.17175 -31.684  < 2e-16 ***
## loudness          0.61738    0.01650  37.419  < 2e-16 ***
## isKey1            0.17596    0.16291   1.080 0.280091    
## energy:loudness   0.92506    0.03015  30.680  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.86 on 130655 degrees of freedom
## Multiple R-squared:  0.08501,    Adjusted R-squared:  0.08496 
## F-statistic:  1734 on 7 and 130655 DF,  p-value: < 2.2e-16

Looking at the model summary, we see that all of our predictor variables are significant with the exception of isKey1 the encoded variable we added. The F-statistic shows that our model does have some significance over the mean. Our Adjusted R-squared, however is a paltry 0.0849589.

Since the isKey1 variable isn’t significant, we should remove it from the model. We could also put another interaction variable in the model between acousticness and instrumentalness and see if the model improves.

m2 <- lm(popularity ~ acousticness + danceability + energy + instrumentalness + loudness + (energy * loudness) + (acousticness * instrumentalness), data=spotify)

summary(m2)
## 
## Call:
## lm(formula = popularity ~ acousticness + danceability + energy + 
##     instrumentalness + loudness + (energy * loudness) + (acousticness * 
##     instrumentalness), data = spotify)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -36.553 -14.884  -1.867  12.762  76.220 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    38.78322    0.44644  86.872   <2e-16 ***
## acousticness                   -4.78146    0.25011 -19.117   <2e-16 ***
## danceability                    2.61564    0.31389   8.333   <2e-16 ***
## energy                         -5.52234    0.44425 -12.431   <2e-16 ***
## instrumentalness              -10.89381    0.23642 -46.079   <2e-16 ***
## loudness                        0.75877    0.01696  44.733   <2e-16 ***
## energy:loudness                 0.64124    0.03120  20.550   <2e-16 ***
## acousticness:instrumentalness  13.52540    0.40495  33.400   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.78 on 130655 degrees of freedom
## Multiple R-squared:  0.09275,    Adjusted R-squared:  0.0927 
## F-statistic:  1908 on 7 and 130655 DF,  p-value: < 2.2e-16

Here we see the Adjusted R-Squared jumped a little to 0.0926974. Our new interaction term seems significant as well. Perhaps another one is called for? Let’s see if there is value to an interaction variable composed of energy and danceability.

m3 <- lm(popularity ~ acousticness + danceability + energy + instrumentalness + loudness + (energy * loudness) + (acousticness * instrumentalness) + (energy * danceability), data=spotify)

summary(m3)
## 
## Call:
## lm(formula = popularity ~ acousticness + danceability + energy + 
##     instrumentalness + loudness + (energy * loudness) + (acousticness * 
##     instrumentalness) + (energy * danceability), data = spotify)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -38.281 -14.840  -1.867  12.728  75.889 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    35.49640    0.58935  60.230   <2e-16 ***
## acousticness                   -4.65081    0.25051 -18.565   <2e-16 ***
## danceability                    7.87817    0.69152  11.392   <2e-16 ***
## energy                          0.69415    0.85272   0.814    0.416    
## instrumentalness              -10.89881    0.23635 -46.112   <2e-16 ***
## loudness                        0.71081    0.01786  39.791   <2e-16 ***
## energy:loudness                 0.75571    0.03395  22.258   <2e-16 ***
## acousticness:instrumentalness  13.65674    0.40513  33.709   <2e-16 ***
## danceability:energy            -9.70001    1.13584  -8.540   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.77 on 130654 degrees of freedom
## Multiple R-squared:  0.09325,    Adjusted R-squared:  0.0932 
## F-statistic:  1680 on 8 and 130654 DF,  p-value: < 2.2e-16

Interestingly the new interaction term is significant in this model, but the energy variable itself ceased to be. R will not allow us to drop the main variable and keep it’s intereaction variable, so we’ll keep the model without the new interaction (m2).

Analysis

So, how good is our model? Judging by the Adjusted R-squared, not too great. But perhaps it is somewhat useful. Let’s look at the residuals:

am2 <- augment(m2)

am2 %>% ggplot(aes(x=.fitted,y=.resid)) +
  geom_point(alpha=0.1) + geom_hline(yintercept=0) +
  labs(title="Residual Plot")

am2 %>% ggplot(aes(sample=.std.resid)) +
  geom_qq() + geom_qq_line() +
  labs(title="QQ Plot")

The residual plot is interesting and a definite pattern is there. The QQ plot shows that our residuals aren’t exactly normal throughout the range of samples.

Conclusion

Given the low Adjusted R-squared, and unusual patterns in the residuals, the models we’ve created seem like they are unsuitable for predicting a song’s popularity on Spotify; sorry music executives!


  1. https://www.kaggle.com/tomigelo/spotify-audio-features#SpotifyAudioFeaturesApril2019.csv