Mid-term review for this project provided by Eunsun Yook
Streaming music services have been able to collect a wealth of data for song characteristics, amount of streams, and user interaction. Spotify has taken this data to create playlists curated to a user’s taste based on their listening habits. Considering how songs have been quantified by Spotify, is it possible to find correlations for a song’s popularity based on the other quantifiable fields Spotify makes available?
With the Spotify dataset retrieved with the spotifyr
package, we will look at 12 variables from almost 33,000 songs and see how they interact with each other, and whether they are significant in determining popularity. As genres of music can have very different characteristics, it may make more sense to gauge popularity based on each genre’s representative statistics rather than overall.
We will first compare variables to determine which have influence on the popularity of a song. We will then apply a multiple linear regression to the data to make a prediction as to what characteristics are likely to produce a popular song.
A predictive model of popularity can help songwriters craft a song based on data-driven methods. Users will also benefit by being able to discover songs based on their preferred characteristics, and see whether they prefer more or less popular songs.
library(tidyverse)
library(scales)
library(table1)
library(htmltools)
library(cowplot)
library(rlang)
library(Hmisc)
library(gridExtra)
library(ggcorrplot)
This analysis will make use of the following packages:
tidyverse
ggplot2
The data used in this analysis is available as part of the tidytuesdayR package, and is also available for download here.
The codebook is also available on the tidytuesday GitHub page.
songs <- read.csv('spotify_songs.csv', stringsAsFactors = FALSE)
The data set, which was originally created on 2020-01-21, consists of 32,833 records, each with 23 columns. Each record represents a single song, and the columns represent various aspects of each song. The codebook includes the variable definitions.
Because the columns are named sensibly, we won’t need to re-name any of them.
In examining the structure of the data set, we can see that track_album_release_date
should be re-formatted as a date, and that playlist_genre
and playlist_subgenre
should be made into factors.
str(songs)
songs$track_album_release_date <- as.Date(songs$track_album_release_date)
songs$playlist_genre <- as.factor(songs$playlist_genre)
songs$playlist_subgenre <- as.factor(songs$playlist_subgenre)
Is there any missing data?
nrow(songs[complete.cases(songs), ])
## [1] 30942
colSums(is.na(songs))
## track_id track_name track_artist
## 0 5 5
## track_popularity track_album_id track_album_name
## 0 0 5
## track_album_release_date playlist_name playlist_id
## 1886 0 0
## playlist_genre playlist_subgenre danceability
## 0 0 0
## energy key loudness
## 0 0 0
## mode speechiness acousticness
## 0 0 0
## instrumentalness liveness valence
## 0 0 0
## tempo duration_ms
## 0 0
The NA values in this data set exist in the track_name
, track_artist
, track_album_name
, and track_album_release_date
. Given the nature of the columns with missing values, it does not make sense for us to make any imputations.
How many rows still have missing data?
na_rows <- nrow(songs) - nrow(songs[complete.cases(songs), ])
pct_na_rows <- na_rows / nrow(songs)
Only 1891 rows are still missing data at this point. This is only about 6% of the total number of rows, so we’ll just drop those rows from the data set.
songs <- na.omit(songs)
Next we’ll look at summaries of each of the numeric values to be sure they make sense, and check for any outliers.
songs %>%
summarise_if(is.numeric, mean)
## track_popularity danceability energy key loudness mode
## 1 42.75609 0.6572603 0.6988573 5.368011 -6.639354 0.5609528
## speechiness acousticness instrumentalness liveness valence tempo
## 1 0.1082296 0.1759632 0.08695591 0.1899777 0.5050244 120.9423
## duration_ms
## 1 223946.6
There do not appear to be any outliers among the numeric variables, and all values are within the stated ranges (i.e., values such as danceability
and energy
are measured on a scale of 0.0 - 1.0). mode
is binary variable with 0 and 1 as the only possible values. The longest song has a duration of 517810, which is about 8.63 minutes, which seems entirely reasonable.
Finally, there are several columns we won’t need for our analysis, so we’ll drop them.
unused_cols <- c('track_id', 'track_album_id', 'track_album_name',
'playlist_name', 'playlist_id', 'playlist_subgenre')
songs <- songs[ , -which(names(songs) %in% c(unused_cols))]
The data is now clean, with 30942 observations of 17 variables. Here’s what it looks like:
head(songs)
## track_name track_artist
## 1 I Don't Care (with Justin Bieber) - Loud Luxury Remix Ed Sheeran
## 2 Memories - Dillon Francis Remix Maroon 5
## 3 All the Time - Don Diablo Remix Zara Larsson
## 4 Call You Mine - Keanu Silva Remix The Chainsmokers
## 5 Someone You Loved - Future Humans Remix Lewis Capaldi
## 6 Beautiful People (feat. Khalid) - Jack Wins Remix Ed Sheeran
## track_popularity track_album_release_date playlist_genre danceability energy
## 1 66 2019-06-14 pop 0.748 0.916
## 2 67 2019-12-13 pop 0.726 0.815
## 3 70 2019-07-05 pop 0.675 0.931
## 4 60 2019-07-19 pop 0.718 0.930
## 5 69 2019-03-05 pop 0.650 0.833
## 6 67 2019-07-11 pop 0.675 0.919
## key loudness mode speechiness acousticness instrumentalness liveness valence
## 1 6 -2.634 1 0.0583 0.1020 0.00e+00 0.0653 0.518
## 2 11 -4.969 1 0.0373 0.0724 4.21e-03 0.3570 0.693
## 3 1 -3.432 0 0.0742 0.0794 2.33e-05 0.1100 0.613
## 4 7 -3.778 1 0.1020 0.0287 9.43e-06 0.2040 0.277
## 5 1 -4.672 1 0.0359 0.0803 0.00e+00 0.0833 0.725
## 6 8 -5.385 1 0.1270 0.0799 0.00e+00 0.1430 0.585
## tempo duration_ms
## 1 122.036 194754
## 2 99.972 162600
## 3 124.008 176616
## 4 121.956 169093
## 5 123.976 189052
## 6 124.982 163049
The variables we are concerned with are track_album_release_date
, track_artist
, and playlist_genre
, and the quantifying variables track_popularity
, danceability
, energy
, key
, loudness
, mode
, speechiness
, acousticness
, instrumentalness
, liveness
, and valence
.
track_artist
: There are 10,316 distinct artists in this data set.playlist_genre
: Each song is classified by one of 6 distinct genres.What is the makeup of songs in our data set, in terms of genre?
Here are summaries of the numeric columns of interest, broken down by playlist_genre:
table1::table1(~ track_popularity + danceability + energy + key + loudness + mode + speechiness + acousticness + instrumentalness +liveness + valence | playlist_genre, data = songs)
songs %>% ggplot(aes(x = playlist_genre, fill = playlist_genre)) +
geom_bar() +
ggtitle("Number of Songs Within Each Genre") +
scale_fill_discrete(name = "Genre") +
scale_x_discrete(name = NULL) +
scale_y_continuous(expand = c(0, 0), name = "Count of \n Songs") +
theme(axis.title.y = element_text(angle = 0, vjust = .5)) +
theme(panel.grid.major.x = element_blank()) +
coord_cartesian(ylim = c(0, 7000))
As we can see, the makeup of songs in our dataset is pretty even, with EDM at the highest and rock at the lowest.
How do the genres rank in terms of popularity?
popularity_means <- songs %>%
group_by(playlist_genre) %>%
dplyr::summarize(mean_popularity = mean(track_popularity))
popularity_means %>%
ggplot(aes(x = playlist_genre,
y = mean_popularity,
fill = playlist_genre)) +
geom_bar(stat = "identity") +
ggtitle("Average Popularity of Each Genre") +
scale_fill_discrete(name = "Genre") +
scale_x_discrete(name = NULL) +
scale_y_continuous(expand = c(0,0), name = "Average \n Popularity") +
theme(axis.title.y = element_text(angle = 0, vjust = .5)) +
theme(panel.grid.major.x = element_blank()) +
coord_cartesian(ylim = c(0, 55))
Who are the most and least popular artists with at least five tracks in the dataset?
Most popular:
# most popular
songs %>%
group_by(track_artist) %>%
filter(n() >= 5) %>%
dplyr::summarize(mean_popularity = mean(track_popularity)) %>%
arrange(desc(mean_popularity)) %>%
slice_head(n = 10)
## # A tibble: 10 x 2
## track_artist mean_popularity
## <chr> <dbl>
## 1 Trevor Daniel 97
## 2 Y2K 91
## 3 Don Toliver 90.7
## 4 Roddy Ricch 88.2
## 5 DaBaby 87.9
## 6 Kina 85.4
## 7 JACKBOYS 85.2
## 8 YNW Melly 84.6
## 9 Tainy 84
## 10 Lewis Capaldi 83.7
Least popular:
# least popular
songs %>%
group_by(track_artist) %>%
filter(n() >= 5) %>%
dplyr::summarize(mean_popularity = mean(track_popularity)) %>%
arrange(mean_popularity) %>%
slice_head(n = 10)
## # A tibble: 10 x 2
## track_artist mean_popularity
## <chr> <dbl>
## 1 Ballin Entertainment 0
## 2 CASIOPEA 0
## 3 Hanybal 0
## 4 Jamie Jones 0
## 5 LemKuuja 0
## 6 Lisa McClendon 0
## 7 New World Sound 0
## 8 T-SQUARE 0
## 9 Green Tea 0.5
## 10 Adventure Club 0.6
Let’s look at the makeup of our dataset in terms of time. What dates are covered?
Oldest track:
min(songs$track_album_release_date)
## [1] "1957-01-01"
Newest track:
max(songs$track_album_release_date)
## [1] "2020-01-29"
Next we look at the popularity of songs by year, using color to denote genre.
songs %>% ggplot(aes(x = track_album_release_date, y = track_popularity, color = playlist_genre)) +
geom_point(alpha = .5) +
ggtitle("Song Popularity by Year") +
scale_y_continuous(name = "Popularity") +
labs(color = "Genre", x = "Year") +
theme(axis.title.y = element_text(angle = 0, vjust = .5))
This plot is interesting, as it shows the timeline of when specific genres start to become dominant.
We will now compare genre characteristics in boxplots.
myplots <-
map(names(songs %>% select(where(is.numeric)) %>% select(-mode)),
function(colName) {
songs %>%
ggplot(aes(x = playlist_genre,
y = !! sym(colName),
fill = playlist_genre)) +
geom_boxplot() +
theme(legend.position = "NONE") +
labs(title = capitalize(colName), x = "", y = "")
})
gridExtra::grid.arrange(grobs = myplots[c(1:4)])
gridExtra::grid.arrange(grobs = myplots[c(5:8)])
gridExtra::grid.arrange(grobs = myplots[c(9:12)])
We now compare the variables in pairs to determine if any influence the other, and whether they are significant to the overall model. New variable creation will not be necessary.
corr_songs <-cor(songs %>% select(where(is.numeric)) %>% select(-mode))
ggcorrplot(corr_songs, method = "circle", type ="lower")
We see few strong correlations. The most prevalent are energy
, which correlates strongly with loudness
, and loudness
and energy
negatively correlated with acousticness
, which makes intuitive sense.
Next, we will generate our explanatory multiple linear regression model by using forward selection to determine which variables will be used as covariates.
# forward selection base model
add1(lm(track_popularity ~ 1, data = songs),
track_popularity ~ danceability + energy + key + loudness
+ speechiness + acousticness + instrumentalness + liveness
+ valence + tempo,
test = "F")
# instrumentalness has the highest significant F value, so add it first
add1(lm(track_popularity ~ instrumentalness, data = songs),
track_popularity ~ danceability + energy + key + loudness
+ speechiness + acousticness + instrumentalness + liveness
+ valence + tempo,
test = "F")
# add energy
add1(lm(track_popularity ~ instrumentalness + energy, data = songs),
track_popularity ~ danceability + energy + key + loudness
+ speechiness + acousticness + instrumentalness + liveness
+ valence + tempo,
test = "F")
# add loudness
add1(lm(track_popularity ~ instrumentalness + energy + loudness,
data = songs),
track_popularity ~ danceability + energy + key + loudness
+ speechiness + acousticness + instrumentalness + liveness
+ valence + tempo,
test = "F")
# add valence
add1(lm(track_popularity ~ instrumentalness + energy + loudness + valence,
data = songs),
track_popularity ~ danceability + energy + key + loudness
+ speechiness + acousticness + instrumentalness + liveness
+ valence + tempo,
test = "F")
# add liveness
add1(lm(track_popularity ~ instrumentalness + energy + loudness + valence
+ liveness,
data = songs),
track_popularity ~ danceability + energy + key + loudness
+ speechiness + acousticness + instrumentalness + liveness
+ valence + tempo,
test = "F")
# add acousticness
add1(lm(track_popularity ~ instrumentalness + energy + loudness + valence
+ liveness + acousticness,
data = songs),
track_popularity ~ danceability + energy + key + loudness
+ speechiness + acousticness + instrumentalness + liveness
+ valence + tempo,
test = "F")
# add danceability
add1(lm(track_popularity ~ instrumentalness + energy + loudness + valence
+ liveness + acousticness + danceability,
data = songs),
track_popularity ~ danceability + energy + key + loudness
+ speechiness + acousticness + instrumentalness + liveness
+ valence + tempo,
test = "F")
# add tempo
add1(lm(track_popularity ~ instrumentalness + energy + loudness + valence
+ liveness + acousticness + danceability + tempo,
data = songs),
track_popularity ~ danceability + energy + key + loudness
+ speechiness + acousticness + instrumentalness + liveness
+ valence + tempo,
test = "F")
# add speechiness
add1(lm(track_popularity ~ instrumentalness + energy + loudness + valence
+ liveness + acousticness + danceability + tempo + speechiness,
data = songs),
track_popularity ~ danceability + energy + key + loudness
+ speechiness + acousticness + instrumentalness + liveness
+ valence + tempo,
test = "F")
# key is not found to be significant
Based on our forward selection process, our model will use instrumentalness
, energy
, loudness
, valence
, liveness
, acousticness
, danceability
, tempo
, and speechiness
as covariates for the response variable track_popularity
. key
was not found to be significant, so it will not be included in the model.
model1 <- lm(track_popularity ~ instrumentalness + energy + loudness + valence
+ liveness + acousticness + danceability + tempo + speechiness,
data = songs)
summary(model1)
##
## Call:
## lm(formula = track_popularity ~ instrumentalness + energy + loudness +
## valence + liveness + acousticness + danceability + tempo +
## speechiness, data = songs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -62.369 -17.262 3.129 18.874 78.202
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 69.600763 1.637877 42.494 < 2e-16 ***
## instrumentalness -12.463786 0.644196 -19.348 < 2e-16 ***
## energy -31.884870 1.240935 -25.694 < 2e-16 ***
## loudness 1.784146 0.066932 26.656 < 2e-16 ***
## valence 3.198616 0.662528 4.828 1.39e-06 ***
## liveness -4.208941 0.913680 -4.607 4.11e-06 ***
## acousticness 4.687185 0.752145 6.232 4.67e-10 ***
## danceability 6.755458 1.083990 6.232 4.66e-10 ***
## tempo 0.023060 0.005275 4.372 1.24e-05 ***
## speechiness -4.636935 1.388245 -3.340 0.000838 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 24.13 on 30932 degrees of freedom
## Multiple R-squared: 0.06471, Adjusted R-squared: 0.06443
## F-statistic: 237.8 on 9 and 30932 DF, p-value: < 2.2e-16
Our model only accounts for about 6% of the variation in track_popularity
. Let’s take a look at the QQ plot and histogram of our residuals to see if any improvements can be made.
par(mfrow = c(1,2))
qqnorm(model1$residuals, main = "Q-Q Plot of Model 1 Residuals")
qqline(model1$residuals)
hist(model1$residuals, main = "Histogram of Model 1 Residuals")
par(mfrow = c(1,1))
The QQ plot of the residuals shows a pattern that indicates non-normality in our residuals, and the histogram reveals light tails. These results suggest that transforming our response variable might be in order, so let’s try that. We’ll create five alternative models using the following transformations on track_popularity
:
# square root transformation
songs$sqrt_track_popularity <- sqrt(songs$track_popularity)
model2 <- lm(sqrt_track_popularity ~ instrumentalness + energy + loudness +
valence + liveness + acousticness + danceability + tempo +
speechiness,
data = songs)
# log transformation
songs$log_track_popularity <- log(songs$track_popularity)
model3 <- lm(log_track_popularity ~ instrumentalness + energy + loudness +
valence + liveness + acousticness + danceability + tempo +
speechiness,
data = songs[songs$log_track_popularity > -Inf,])
# reciprocal square root transformation
songs$recip_sqrt_track_popularity <- songs$track_popularity ^ (-.5)
model4 <- lm(recip_sqrt_track_popularity ~ instrumentalness + energy + loudness +
valence + liveness + acousticness + danceability + tempo +
speechiness,
data = songs[songs$log_track_popularity > -Inf,])
# reciprocal transformation
songs$recip_track_popularity <- songs$track_popularity ^ (-1)
model5 <- lm(recip_track_popularity ~ instrumentalness + energy + loudness +
valence + liveness + acousticness + danceability + tempo +
speechiness,
data = songs[songs$recip_sqrt_track_popularity < Inf,])
# arcsin transformation
songs$asin_track_popularity <- asin(songs$sqrt_track_popularity)
model6 <- lm(asin_track_popularity ~ instrumentalness + energy + loudness +
valence + liveness + acousticness + danceability + tempo +
speechiness,
data = songs[!(is.nan(songs$asin_track_popularity)),])
Unfortunately, none of our transformations had a beneficial effect on our residuals, as can be seen in the QQ plots of our alternate models.
par(mfrow = c(2, 3))
qqnorm(model2$residuals, main = "Square root")
qqline(model2$residuals)
qqnorm(model3$residuals, main = "Log")
qqline(model3$residuals)
qqnorm(model4$residuals, main = "Reciprocal square root")
qqline(model4$residuals)
qqnorm(model5$residuals, main = "Reciprocal")
qqline(model5$residuals)
qqnorm(model6$residuals, main = "Arcsin")
qqline(model6$residuals)
par(mfrow = c(1,1))
We can also tell by looking at our models’ adjusted R-squared values that our original model performs better than any of our alternative models.
(df <- data.frame(Model = c(1:6),
Transformation = c("None",
"Square root",
"Log",
"Reciprocal square root",
"Reciprocal",
"Arcsin"),
AdjRsqr = c(summary(model1)$adj.r.squared,
summary(model2)$adj.r.squared,
summary(model3)$adj.r.squared,
summary(model4)$adj.r.squared,
summary(model5)$adj.r.squared,
summary(model6)$adj.r.squared)))
## Model Transformation AdjRsqr
## 1 1 None 0.064434292
## 2 2 Square root 0.050388224
## 3 3 Log 0.024922753
## 4 4 Reciprocal square root 0.009597271
## 5 5 Reciprocal 0.003216709
## 6 6 Arcsin 0.020659481
In this report, we attempted to build a predictive model for the popularity of songs in the Spotify dataset using multiple linear regression, and our model was derived using the forward selection process for variable selection. Our Spotify dataset contains a wide variety of music over 43 years, and it is evenly represented. Each genre of music has varying characteristics, but we were not able to derive a model to predict popularity as evidenced by our model’s adjusted R-squared value of about 6%. This shows the unpredictability of a hit song, and how difficult it is to to write a song with the intent for it to be popular. Art is subjective and at times data cannot account for the fickleness of human taste. However, over the course of exploring the data we did notice an interesting insight regarding genres and popularity: New genres appear to generate popular songs. We believe this insight could provide an interesting avenue for further exploration of the relationship between genre and popularity, perhaps by constructing a model that takes release date into account.