Final Project - Popular Spotify Songs

Mid-term review for this project provided by Eunsun Yook

Section 1: Introduction

Streaming music services have been able to collect a wealth of data for song characteristics, amount of streams, and user interaction. Spotify has taken this data to create playlists curated to a user’s taste based on their listening habits. Considering how songs have been quantified by Spotify, is it possible to find correlations for a song’s popularity based on the other quantifiable fields Spotify makes available?

With the Spotify dataset retrieved with the spotifyr package, we will look at 12 variables from almost 33,000 songs and see how they interact with each other, and whether they are significant in determining popularity. As genres of music can have very different characteristics, it may make more sense to gauge popularity based on each genre’s representative statistics rather than overall.

We will first compare variables to determine which have influence on the popularity of a song. We will then apply a multiple linear regression to the data to make a prediction as to what characteristics are likely to produce a popular song.

A predictive model of popularity can help songwriters craft a song based on data-driven methods. Users will also benefit by being able to discover songs based on their preferred characteristics, and see whether they prefer more or less popular songs.

Section 2: Required Packages

library(tidyverse)
library(scales)
library(table1)
library(htmltools)
library(cowplot)
library(rlang)
library(Hmisc)
library(gridExtra)
library(ggcorrplot)

This analysis will make use of the following packages:

tidyverse - A collection of various packages designed to make it easier to make data tidy for analysis
scales - A package with various string formatting functions
table1 - A package for the creation of HTML tables of descriptive statistics
htmltools - A package to enable the inclusion of external HTML files
cowplot - A package that enhances ggplot by adding themes
rlang - A package that provides tools to work with core features of R and tidyverse
Hmisc - A package that contains many functions useful for data analysis, high-level graphics, utility operations, and more
gridExtra - A package to help work with “grid” graphics
ggcorrplot - A package to produce a visualized correlation matrix using ggplot2

Section 3: Data Preparation

The data used in this analysis is available as part of the tidytuesdayR package, and is also available for download here.

The codebook is also available on the tidytuesday GitHub page.

songs <- read.csv('spotify_songs.csv', stringsAsFactors = FALSE)

The data set, which was originally created on 2020-01-21, consists of 32,833 records, each with 23 columns. Each record represents a single song, and the columns represent various aspects of each song. The codebook includes the variable definitions.

Because the columns are named sensibly, we won’t need to re-name any of them.

In examining the structure of the data set, we can see that track_album_release_date should be re-formatted as a date, and that playlist_genre and playlist_subgenre should be made into factors.

str(songs)
songs$track_album_release_date <- as.Date(songs$track_album_release_date)
songs$playlist_genre <- as.factor(songs$playlist_genre)
songs$playlist_subgenre <- as.factor(songs$playlist_subgenre)

Is there any missing data?

nrow(songs[complete.cases(songs), ])

## [1] 30942

colSums(is.na(songs))

##                 track_id               track_name             track_artist 
##                        0                        5                        5 
##         track_popularity           track_album_id         track_album_name 
##                        0                        0                        5 
## track_album_release_date            playlist_name              playlist_id 
##                     1886                        0                        0 
##           playlist_genre        playlist_subgenre             danceability 
##                        0                        0                        0 
##                   energy                      key                 loudness 
##                        0                        0                        0 
##                     mode              speechiness             acousticness 
##                        0                        0                        0 
##         instrumentalness                 liveness                  valence 
##                        0                        0                        0 
##                    tempo              duration_ms 
##                        0                        0

The NA values in this data set exist in the track_name, track_artist, track_album_name, and track_album_release_date. Given the nature of the columns with missing values, it does not make sense for us to make any imputations.

How many rows still have missing data?

na_rows <- nrow(songs) - nrow(songs[complete.cases(songs), ])
pct_na_rows <- na_rows / nrow(songs)

Only 1891 rows are still missing data at this point. This is only about 6% of the total number of rows, so we’ll just drop those rows from the data set.

songs <- na.omit(songs)

Next we’ll look at summaries of each of the numeric values to be sure they make sense, and check for any outliers.

songs %>%
  summarise_if(is.numeric, mean)

##   track_popularity danceability    energy      key  loudness      mode
## 1         42.75609    0.6572603 0.6988573 5.368011 -6.639354 0.5609528
##   speechiness acousticness instrumentalness  liveness   valence    tempo
## 1   0.1082296    0.1759632       0.08695591 0.1899777 0.5050244 120.9423
##   duration_ms
## 1    223946.6

There do not appear to be any outliers among the numeric variables, and all values are within the stated ranges (i.e., values such as danceability and energy are measured on a scale of 0.0 - 1.0). mode is binary variable with 0 and 1 as the only possible values. The longest song has a duration of 517810, which is about 8.63 minutes, which seems entirely reasonable.

Finally, there are several columns we won’t need for our analysis, so we’ll drop them.

unused_cols <- c('track_id', 'track_album_id', 'track_album_name', 
                 'playlist_name', 'playlist_id', 'playlist_subgenre')
songs <- songs[ , -which(names(songs) %in% c(unused_cols))]

The data is now clean, with 30942 observations of 17 variables. Here’s what it looks like:

head(songs)

##                                              track_name     track_artist
## 1 I Don't Care (with Justin Bieber) - Loud Luxury Remix       Ed Sheeran
## 2                       Memories - Dillon Francis Remix         Maroon 5
## 3                       All the Time - Don Diablo Remix     Zara Larsson
## 4                     Call You Mine - Keanu Silva Remix The Chainsmokers
## 5               Someone You Loved - Future Humans Remix    Lewis Capaldi
## 6     Beautiful People (feat. Khalid) - Jack Wins Remix       Ed Sheeran
##   track_popularity track_album_release_date playlist_genre danceability energy
## 1               66               2019-06-14            pop        0.748  0.916
## 2               67               2019-12-13            pop        0.726  0.815
## 3               70               2019-07-05            pop        0.675  0.931
## 4               60               2019-07-19            pop        0.718  0.930
## 5               69               2019-03-05            pop        0.650  0.833
## 6               67               2019-07-11            pop        0.675  0.919
##   key loudness mode speechiness acousticness instrumentalness liveness valence
## 1   6   -2.634    1      0.0583       0.1020         0.00e+00   0.0653   0.518
## 2  11   -4.969    1      0.0373       0.0724         4.21e-03   0.3570   0.693
## 3   1   -3.432    0      0.0742       0.0794         2.33e-05   0.1100   0.613
## 4   7   -3.778    1      0.1020       0.0287         9.43e-06   0.2040   0.277
## 5   1   -4.672    1      0.0359       0.0803         0.00e+00   0.0833   0.725
## 6   8   -5.385    1      0.1270       0.0799         0.00e+00   0.1430   0.585
##     tempo duration_ms
## 1 122.036      194754
## 2  99.972      162600
## 3 124.008      176616
## 4 121.956      169093
## 5 123.976      189052
## 6 124.982      163049

The variables we are concerned with are track_album_release_date, track_artist, and playlist_genre, and the quantifying variables track_popularity, danceability, energy, key, loudness, mode, speechiness, acousticness, instrumentalness, liveness, and valence.

track_artist: There are 10,316 distinct artists in this data set.
playlist_genre: Each song is classified by one of 6 distinct genres.

Section 4: Proposed Exploratory Data Analysis

What is the makeup of songs in our data set, in terms of genre?

Here are summaries of the numeric columns of interest, broken down by playlist_genre:

table1::table1(~ track_popularity + danceability + energy + key + loudness + mode + speechiness + acousticness + instrumentalness +liveness + valence | playlist_genre, data = songs)

songs %>% ggplot(aes(x = playlist_genre, fill = playlist_genre)) + 
  geom_bar() +
  ggtitle("Number of Songs Within Each Genre") +
  scale_fill_discrete(name = "Genre") +
  scale_x_discrete(name = NULL) + 
  scale_y_continuous(expand = c(0, 0), name = "Count of \n Songs") +
  theme(axis.title.y = element_text(angle = 0, vjust = .5)) + 
  theme(panel.grid.major.x = element_blank()) + 
  coord_cartesian(ylim = c(0, 7000))

As we can see, the makeup of songs in our dataset is pretty even, with EDM at the highest and rock at the lowest.

How do the genres rank in terms of popularity?

popularity_means <- songs %>% 
  group_by(playlist_genre) %>%
  dplyr::summarize(mean_popularity = mean(track_popularity))

popularity_means %>% 
  ggplot(aes(x = playlist_genre, 
             y = mean_popularity, 
             fill = playlist_genre)) + 
  geom_bar(stat = "identity") +
  ggtitle("Average Popularity of Each Genre") +
  scale_fill_discrete(name = "Genre") +
  scale_x_discrete(name = NULL) + 
  scale_y_continuous(expand = c(0,0), name = "Average \n Popularity") +
  theme(axis.title.y = element_text(angle = 0, vjust = .5)) + 
  theme(panel.grid.major.x = element_blank()) + 
  coord_cartesian(ylim = c(0, 55))

Who are the most and least popular artists with at least five tracks in the dataset?

Most popular:

# most popular 
songs %>% 
  group_by(track_artist) %>%
  filter(n() >= 5) %>% 
  dplyr::summarize(mean_popularity = mean(track_popularity)) %>% 
  arrange(desc(mean_popularity)) %>% 
  slice_head(n = 10)

## # A tibble: 10 x 2
##    track_artist  mean_popularity
##    <chr>                   <dbl>
##  1 Trevor Daniel            97  
##  2 Y2K                      91  
##  3 Don Toliver              90.7
##  4 Roddy Ricch              88.2
##  5 DaBaby                   87.9
##  6 Kina                     85.4
##  7 JACKBOYS                 85.2
##  8 YNW Melly                84.6
##  9 Tainy                    84  
## 10 Lewis Capaldi            83.7

Least popular:

# least popular 
songs %>% 
  group_by(track_artist) %>%
  filter(n() >= 5) %>%
  dplyr::summarize(mean_popularity = mean(track_popularity)) %>% 
  arrange(mean_popularity) %>% 
  slice_head(n = 10)

## # A tibble: 10 x 2
##    track_artist         mean_popularity
##    <chr>                          <dbl>
##  1 Ballin Entertainment             0  
##  2 CASIOPEA                         0  
##  3 Hanybal                          0  
##  4 Jamie Jones                      0  
##  5 LemKuuja                         0  
##  6 Lisa McClendon                   0  
##  7 New World Sound                  0  
##  8 T-SQUARE                         0  
##  9 Green Tea                        0.5
## 10 Adventure Club                   0.6

Let’s look at the makeup of our dataset in terms of time. What dates are covered?

Oldest track:

min(songs$track_album_release_date)

## [1] "1957-01-01"

Newest track:

max(songs$track_album_release_date)

## [1] "2020-01-29"

Next we look at the popularity of songs by year, using color to denote genre.

songs %>% ggplot(aes(x = track_album_release_date, y = track_popularity, color = playlist_genre)) + 
  geom_point(alpha = .5) +
  ggtitle("Song Popularity by Year") +
  scale_y_continuous(name = "Popularity") +
  labs(color = "Genre", x = "Year") +
  theme(axis.title.y = element_text(angle = 0, vjust = .5))

This plot is interesting, as it shows the timeline of when specific genres start to become dominant.

We will now compare genre characteristics in boxplots.

myplots <- 
  map(names(songs %>% select(where(is.numeric)) %>% select(-mode)), 
      function(colName) {
        songs %>% 
          ggplot(aes(x = playlist_genre,
                     y = !! sym(colName),
                     fill = playlist_genre)) +
          geom_boxplot() +
          theme(legend.position = "NONE") +
          labs(title = capitalize(colName), x = "", y = "")
    })
gridExtra::grid.arrange(grobs = myplots[c(1:4)])

gridExtra::grid.arrange(grobs = myplots[c(5:8)])

gridExtra::grid.arrange(grobs = myplots[c(9:12)])

We now compare the variables in pairs to determine if any influence the other, and whether they are significant to the overall model. New variable creation will not be necessary.

corr_songs <-cor(songs %>% select(where(is.numeric)) %>% select(-mode))
ggcorrplot(corr_songs, method = "circle", type ="lower")

We see few strong correlations. The most prevalent are energy, which correlates strongly with loudness, and loudness and energy negatively correlated with acousticness, which makes intuitive sense.

Next, we will generate our explanatory multiple linear regression model by using forward selection to determine which variables will be used as covariates.

# forward selection base model
add1(lm(track_popularity ~ 1, data = songs),
     track_popularity ~ danceability + energy + key + loudness 
     + speechiness + acousticness + instrumentalness + liveness 
     + valence + tempo,
     test = "F")

# instrumentalness has the highest significant F value, so add it first
add1(lm(track_popularity ~ instrumentalness, data = songs),
     track_popularity ~ danceability + energy + key + loudness 
     + speechiness + acousticness + instrumentalness + liveness 
     + valence + tempo,
     test = "F")

# add energy
add1(lm(track_popularity ~ instrumentalness + energy, data = songs),
     track_popularity ~ danceability + energy + key + loudness 
     + speechiness + acousticness + instrumentalness + liveness 
     + valence + tempo,
     test = "F")

# add loudness 
add1(lm(track_popularity ~ instrumentalness + energy + loudness, 
        data = songs),
     track_popularity ~ danceability + energy + key + loudness 
     + speechiness + acousticness + instrumentalness + liveness 
     + valence + tempo,
     test = "F")

# add valence
add1(lm(track_popularity ~ instrumentalness + energy + loudness + valence, 
        data = songs),
     track_popularity ~ danceability + energy + key + loudness 
     + speechiness + acousticness + instrumentalness + liveness 
     + valence + tempo,
     test = "F")

# add liveness
add1(lm(track_popularity ~ instrumentalness + energy + loudness + valence
        + liveness, 
        data = songs),
     track_popularity ~ danceability + energy + key + loudness 
     + speechiness + acousticness + instrumentalness + liveness 
     + valence + tempo,
     test = "F")

# add acousticness
add1(lm(track_popularity ~ instrumentalness + energy + loudness + valence
        + liveness + acousticness, 
        data = songs),
     track_popularity ~ danceability + energy + key + loudness 
     + speechiness + acousticness + instrumentalness + liveness 
     + valence + tempo,
     test = "F")

# add danceability
add1(lm(track_popularity ~ instrumentalness + energy + loudness + valence
        + liveness + acousticness + danceability, 
        data = songs),
     track_popularity ~ danceability + energy + key + loudness 
     + speechiness + acousticness + instrumentalness + liveness 
     + valence + tempo,
     test = "F")

# add tempo
add1(lm(track_popularity ~ instrumentalness + energy + loudness + valence
        + liveness + acousticness + danceability + tempo, 
        data = songs),
     track_popularity ~ danceability + energy + key + loudness 
     + speechiness + acousticness + instrumentalness + liveness 
     + valence + tempo,
     test = "F")

# add speechiness
add1(lm(track_popularity ~ instrumentalness + energy + loudness + valence
        + liveness + acousticness + danceability + tempo + speechiness, 
        data = songs),
     track_popularity ~ danceability + energy + key + loudness 
     + speechiness + acousticness + instrumentalness + liveness 
     + valence + tempo,
     test = "F")

# key is not found to be significant

Based on our forward selection process, our model will use instrumentalness, energy, loudness, valence, liveness, acousticness, danceability, tempo, and speechiness as covariates for the response variable track_popularity. key was not found to be significant, so it will not be included in the model.

model1 <- lm(track_popularity ~ instrumentalness + energy + loudness + valence
             + liveness + acousticness + danceability + tempo + speechiness, 
             data = songs)
summary(model1)

## 
## Call:
## lm(formula = track_popularity ~ instrumentalness + energy + loudness + 
##     valence + liveness + acousticness + danceability + tempo + 
##     speechiness, data = songs)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -62.369 -17.262   3.129  18.874  78.202 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       69.600763   1.637877  42.494  < 2e-16 ***
## instrumentalness -12.463786   0.644196 -19.348  < 2e-16 ***
## energy           -31.884870   1.240935 -25.694  < 2e-16 ***
## loudness           1.784146   0.066932  26.656  < 2e-16 ***
## valence            3.198616   0.662528   4.828 1.39e-06 ***
## liveness          -4.208941   0.913680  -4.607 4.11e-06 ***
## acousticness       4.687185   0.752145   6.232 4.67e-10 ***
## danceability       6.755458   1.083990   6.232 4.66e-10 ***
## tempo              0.023060   0.005275   4.372 1.24e-05 ***
## speechiness       -4.636935   1.388245  -3.340 0.000838 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24.13 on 30932 degrees of freedom
## Multiple R-squared:  0.06471,    Adjusted R-squared:  0.06443 
## F-statistic: 237.8 on 9 and 30932 DF,  p-value: < 2.2e-16

Our model only accounts for about 6% of the variation in track_popularity. Let’s take a look at the QQ plot and histogram of our residuals to see if any improvements can be made.

par(mfrow = c(1,2))
qqnorm(model1$residuals, main = "Q-Q Plot of Model 1 Residuals")
qqline(model1$residuals)
hist(model1$residuals, main = "Histogram of Model 1 Residuals")

par(mfrow = c(1,1))

The QQ plot of the residuals shows a pattern that indicates non-normality in our residuals, and the histogram reveals light tails. These results suggest that transforming our response variable might be in order, so let’s try that. We’ll create five alternative models using the following transformations on track_popularity:

square root transformation
log transformation
reciprocal square root transformation
reciprocal transformation
arcsin transformation

# square root transformation
songs$sqrt_track_popularity <- sqrt(songs$track_popularity)
model2 <- lm(sqrt_track_popularity ~ instrumentalness + energy + loudness + 
               valence + liveness + acousticness + danceability + tempo + 
               speechiness, 
             data = songs)

# log transformation
songs$log_track_popularity <- log(songs$track_popularity)
model3 <- lm(log_track_popularity ~ instrumentalness + energy + loudness + 
               valence + liveness + acousticness + danceability + tempo + 
               speechiness, 
             data = songs[songs$log_track_popularity > -Inf,])

# reciprocal square root transformation
songs$recip_sqrt_track_popularity <- songs$track_popularity ^ (-.5)
model4 <- lm(recip_sqrt_track_popularity ~ instrumentalness + energy + loudness + 
               valence + liveness + acousticness + danceability + tempo + 
               speechiness, 
             data = songs[songs$log_track_popularity > -Inf,])

# reciprocal transformation
songs$recip_track_popularity <- songs$track_popularity ^ (-1)
model5 <- lm(recip_track_popularity ~ instrumentalness + energy + loudness + 
               valence + liveness + acousticness + danceability + tempo + 
               speechiness, 
             data = songs[songs$recip_sqrt_track_popularity < Inf,])

# arcsin transformation
songs$asin_track_popularity <- asin(songs$sqrt_track_popularity)
model6 <- lm(asin_track_popularity ~ instrumentalness + energy + loudness + 
               valence + liveness + acousticness + danceability + tempo + 
               speechiness, 
             data = songs[!(is.nan(songs$asin_track_popularity)),])

Unfortunately, none of our transformations had a beneficial effect on our residuals, as can be seen in the QQ plots of our alternate models.

par(mfrow = c(2, 3))

qqnorm(model2$residuals, main = "Square root")
qqline(model2$residuals)

qqnorm(model3$residuals, main = "Log")
qqline(model3$residuals)

qqnorm(model4$residuals, main = "Reciprocal square root")
qqline(model4$residuals)

qqnorm(model5$residuals, main = "Reciprocal")
qqline(model5$residuals)

qqnorm(model6$residuals, main = "Arcsin")
qqline(model6$residuals)

par(mfrow = c(1,1))

We can also tell by looking at our models’ adjusted R-squared values that our original model performs better than any of our alternative models.

(df <- data.frame(Model = c(1:6),
                 Transformation = c("None",
                                    "Square root",
                                    "Log",
                                    "Reciprocal square root",
                                    "Reciprocal",
                                    "Arcsin"), 
                 AdjRsqr = c(summary(model1)$adj.r.squared,
                             summary(model2)$adj.r.squared,
                             summary(model3)$adj.r.squared,
                             summary(model4)$adj.r.squared,
                             summary(model5)$adj.r.squared,
                             summary(model6)$adj.r.squared)))

##   Model         Transformation     AdjRsqr
## 1     1                   None 0.064434292
## 2     2            Square root 0.050388224
## 3     3                    Log 0.024922753
## 4     4 Reciprocal square root 0.009597271
## 5     5             Reciprocal 0.003216709
## 6     6                 Arcsin 0.020659481

Section 6: Summary

In this report, we attempted to build a predictive model for the popularity of songs in the Spotify dataset using multiple linear regression, and our model was derived using the forward selection process for variable selection. Our Spotify dataset contains a wide variety of music over 43 years, and it is evenly represented. Each genre of music has varying characteristics, but we were not able to derive a model to predict popularity as evidenced by our model’s adjusted R-squared value of about 6%. This shows the unpredictability of a hit song, and how difficult it is to to write a song with the intent for it to be popular. Art is subjective and at times data cannot account for the fickleness of human taste. However, over the course of exploring the data we did notice an interesting insight regarding genres and popularity: New genres appear to generate popular songs. We believe this insight could provide an interesting avenue for further exploration of the relationship between genre and popularity, perhaps by constructing a model that takes release date into account.