The Recording Industry Association of America reports that 75% of revenue in the music industry is from steaming. Spotify has 286 million active users and is listed as the best music streaming service. Since Spotify is very relevant to the music industry, many artists rely on it for profit. Spotify only pays $.003 to $.005 per stream, and an artist can make between $3 to $5 per 1,000 streams.
To help artists earn money on Spotify, the popularity metric will be investigated. Having a high popularity index means the algorithm is more likely to recommend a song to new listeners and put the song into playlists curated by the algorithm. The more exposure the song gets, the more streams it will garner.
This analysis aims to answer the following questions:
1. What features make a song popular?
2. How well can popularity be predicted?
Before I run the analysis, I am loading all necessary packages. The packages I am using are for data cleaning, data visualization, and machine learning.
#Loading packages
# Package names
packages <- c("tidyverse", "ggcorrplot", "randomForest", "here", "knitr", "tree", "gbm",
"e1071", "randomForestSRC", "caret", "ROCR", "car",
"rpart", "readr", "DescTools", "MASS", "rstatix", "viridis",
"rattle", "pdp","rmarkdown", "scales")
# Install packages not yet installed
installed_packages <- packages %in% rownames(installed.packages())
if (any(installed_packages == FALSE)) {
install.packages(packages[!installed_packages])
}
# Packages loading
invisible(lapply(packages, library, character.only = TRUE))
Opening the csv file and loading it into R.
spotify = read.csv("data.csv", header = TRUE)
Checking the structure of the data to make sure the data types of the variables are correct. Some variables need to be transformed into another type.
str(spotify)
## 'data.frame': 174389 obs. of 19 variables:
## $ acousticness : num 0.991 0.643 0.993 0.000173 0.295 0.996 0.992 0.996 0.996 0.00682 ...
## $ artists : chr "['Mamie Smith']" "[\"Screamin' Jay Hawkins\"]" "['Mamie Smith']" "['Oscar Velazquez']" ...
## $ danceability : num 0.598 0.852 0.647 0.73 0.704 0.424 0.782 0.474 0.469 0.571 ...
## $ duration_ms : int 168333 150200 163827 422087 165224 198627 195200 186173 146840 476304 ...
## $ energy : num 0.224 0.517 0.186 0.798 0.707 0.245 0.0573 0.239 0.238 0.753 ...
## $ explicit : int 0 0 0 0 1 0 0 0 0 0 ...
## $ id : chr "0cS0A1fUEUd1EW3FcF8AEI" "0hbkKFIJm7Z05H8Zl9w30f" "11m7laMUgmOKqI3oYzuhne" "19Lc5SfJJ5O1oaxY0fpwfh" ...
## $ instrumentalness: num 5.22e-04 2.64e-02 1.76e-05 8.01e-01 2.46e-04 7.99e-01 1.61e-06 1.86e-01 9.60e-01 8.73e-01 ...
## $ key : int 5 5 0 2 10 5 5 9 8 8 ...
## $ liveness : num 0.379 0.0809 0.519 0.128 0.402 0.235 0.176 0.195 0.149 0.092 ...
## $ loudness : num -12.63 -7.26 -12.1 -7.31 -6.04 ...
## $ mode : int 0 0 1 1 0 1 1 1 1 1 ...
## $ name : chr "Keep A Song In Your Soul" "I Put A Spell On You" "Golfing Papa" "True House Music - Xavier Santos & Carlos Gomix Remix" ...
## $ popularity : int 12 7 4 17 2 9 5 0 0 0 ...
## $ release_date : chr "1920" "1920-01-05" "1920" "1920-01-01" ...
## $ speechiness : num 0.0936 0.0534 0.174 0.0425 0.0768 0.0397 0.0592 0.0289 0.0741 0.0446 ...
## $ tempo : num 150 86.9 97.6 128 122.1 ...
## $ valence : num 0.634 0.95 0.689 0.0422 0.299 0.477 0.487 0.366 0.621 0.119 ...
## $ year : int 1920 1920 1920 1920 1920 1920 1920 1920 1920 1920 ...
Explicit, Mode, and Key are factor variables, however, R interpreted them as numeric. To make sure our analysis is done correctly, we must change Explicit, Mode, and Key to their proper variable type.
spotify$explicit[spotify$explicit == 1] = "Explicit"
spotify$explicit[spotify$explicit == 0] = "Not explicit"
spotify$explicit = as.factor(spotify$explicit)
spotify$mode[spotify$mode == 1] = "Major"
spotify$mode[spotify$mode == 0] = "Minor"
spotify$mode = as.factor(spotify$mode)
spotify$key = as.factor(spotify$key)
Double checking the structure to make sure the changes were correctly implemented.
str(spotify)
## 'data.frame': 174389 obs. of 19 variables:
## $ acousticness : num 0.991 0.643 0.993 0.000173 0.295 0.996 0.992 0.996 0.996 0.00682 ...
## $ artists : chr "['Mamie Smith']" "[\"Screamin' Jay Hawkins\"]" "['Mamie Smith']" "['Oscar Velazquez']" ...
## $ danceability : num 0.598 0.852 0.647 0.73 0.704 0.424 0.782 0.474 0.469 0.571 ...
## $ duration_ms : int 168333 150200 163827 422087 165224 198627 195200 186173 146840 476304 ...
## $ energy : num 0.224 0.517 0.186 0.798 0.707 0.245 0.0573 0.239 0.238 0.753 ...
## $ explicit : Factor w/ 2 levels "Explicit","Not explicit": 2 2 2 2 1 2 2 2 2 2 ...
## $ id : chr "0cS0A1fUEUd1EW3FcF8AEI" "0hbkKFIJm7Z05H8Zl9w30f" "11m7laMUgmOKqI3oYzuhne" "19Lc5SfJJ5O1oaxY0fpwfh" ...
## $ instrumentalness: num 5.22e-04 2.64e-02 1.76e-05 8.01e-01 2.46e-04 7.99e-01 1.61e-06 1.86e-01 9.60e-01 8.73e-01 ...
## $ key : Factor w/ 12 levels "0","1","2","3",..: 6 6 1 3 11 6 6 10 9 9 ...
## $ liveness : num 0.379 0.0809 0.519 0.128 0.402 0.235 0.176 0.195 0.149 0.092 ...
## $ loudness : num -12.63 -7.26 -12.1 -7.31 -6.04 ...
## $ mode : Factor w/ 2 levels "Major","Minor": 2 2 1 1 2 1 1 1 1 1 ...
## $ name : chr "Keep A Song In Your Soul" "I Put A Spell On You" "Golfing Papa" "True House Music - Xavier Santos & Carlos Gomix Remix" ...
## $ popularity : int 12 7 4 17 2 9 5 0 0 0 ...
## $ release_date : chr "1920" "1920-01-05" "1920" "1920-01-01" ...
## $ speechiness : num 0.0936 0.0534 0.174 0.0425 0.0768 0.0397 0.0592 0.0289 0.0741 0.0446 ...
## $ tempo : num 150 86.9 97.6 128 122.1 ...
## $ valence : num 0.634 0.95 0.689 0.0422 0.299 0.477 0.487 0.366 0.621 0.119 ...
## $ year : int 1920 1920 1920 1920 1920 1920 1920 1920 1920 1920 ...
There are no missing values in this data set.
sum(is.na(spotify))
## [1] 0
The duration variable is in milliseconds which is too small for this analysis. When discussing the length of a song, minutes is a better metric, so milliseconds will be converted into minutes.
spotify =
spotify %>%
dplyr::mutate(duration_min = duration_ms/60000)
For this analysis, I want to focus on songs from the past 5 years, so I am looking at music that is relevant to the music scene. I’m creating a new data set that only has music from 2016-2020.
spotify.past.5 = spotify %>%
filter(year == 2016 | year == 2017 |
year == 2018 | year == 2019 | year == 2020)
Most songs are written in the key of G, C, and C#. The key of D# is used the least.
#changing the factor levels of Key so they show up correctly on the visualization
spotify.past.5$key = factor(spotify.past.5$key, levels = c("7", "0", "1", "9",
"2", "11", "5", "4",
"6", "10", "8", "3"))
ggplot(spotify.past.5, aes(x = forcats::fct_infreq(key))) +
geom_bar(aes(fill = key), stat = "count", position = "dodge") +
labs(title= "Count of Key of Songs from 2016-2020", y = "Count", x = "Key") +
scale_color_viridis(discrete = TRUE) +
scale_y_continuous(label = scales::comma) +
scale_x_discrete(breaks = c(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10,11), labels = c("C", "C#", "D", "D#", "E", "F", "F#", "G", "G#", "A", "A#", "B")) +
scale_fill_manual(values = c("#440154", "#482173", "#433e85", "#38588c", "#2d708e",
"#25858e", "#1e9b8a", "#2ab07f", "#52c569", "#86d549",
"#c2df23", "#fde725")) +
theme(legend.position="none")
Most songs are Major.
ggplot(spotify.past.5, aes(x = mode)) +
geom_bar(aes(fill = mode), stat = "count", position = "dodge") +
labs(title= "Count of Mode of Songs from 2016-2020", y = "Count", x = "Mode") +
scale_y_continuous(label = scales::comma) +
scale_x_discrete(breaks = c("Major", "Minor"), labels = c("Major", "Minor")) +
scale_fill_manual(values = c("#440154", "#fde725")) +
theme(legend.position="none")
Most songs are not explicit.
#changing the factor levels of Explicit so they show up correctly on the visualization
spotify.past.5$explicit = factor(spotify.past.5$explicit, levels = c("Not explicit","Explicit"))
ggplot(spotify.past.5, aes(x = explicit)) +
geom_bar(aes(fill = explicit), stat = "count", position = "dodge") +
labs(title= "Count of Explicit and Non Explicit Songs from 2016-2020", y = "Count", x = "Explicit") +
scale_y_continuous(label = scales::comma) +
scale_x_discrete(breaks = c("Explicit", "Not explicit"), labels = c("Explicit", "Not Explicit")) +
scale_fill_manual(values = c("#440154", "#fde725")) +
theme(legend.position="none")
Examining the correlations of the numeric variables.
Acousticness and energy have a high negative correlation (r = -0.68) meaning as the acousticness increases, energy decreases. Acousticness and loudness have a high negative correlation (r = -52) meaning as the acoustincess increases, loudness decreases.
cordata3 <- spotify.past.5 %>%
dplyr::select(acousticness, danceability, energy, duration_ms, instrumentalness, valence, tempo, liveness, loudness, speechiness)
cormatrix3 <- cor(cordata3)
round(cormatrix3, 2)
ggcorrplot(cormatrix3, hc.order = TRUE,outline.color = "white", lab = TRUE, colors = c("#52c569", "white", "#fde725"), lab_size = 2.5) +
labs(title="Correlation of Numeric Variables")
In the histogram, we can observe that acoustincess, duration, instrumentalness, liveness, popularity, and speechiness are right skewed. Energy, loudness, and danceability are left skewed.
spotify.past.5%>%
keep(is.numeric) %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_histogram(aes(y=..density..), colour="black", fill="white")+
geom_density(alpha=.2, fill="#3399CC") +
labs(title= "Distribution of Numeric Variables")
The distribution of popularity in major and minor songs is the same.
ggplot(spotify.past.5, aes(y = popularity, x = mode, fill = mode)) +
geom_violin() +
labs(title= "Distribution of Mode of Songs from 2016-2020", y = "Popularity", x = "Mode") +
scale_fill_manual(values = c("#440154", "#fde725")) +
theme(legend.position="none")
Most explicit songs fall into the high popularity range and non explicit songs fall in the low popularity range.
ggplot(spotify.past.5, aes(y = popularity, x = explicit, fill = explicit)) +
geom_violin() +
labs(title= "Distribution of Explicit of Songs from 2016-2020", y = "Popularity", x = "Explicit") +
scale_fill_manual(values = c("#440154", "#fde725")) +
theme(legend.position="none")
The keys have a very similar distribution in popularity.
ggplot(spotify.past.5, aes(y = popularity, x = key, fill = key)) +
geom_violin() +
labs(title= "Distribution of Key of Songs from 2016-2020", y = "Popularity", x = "Key") +
# scale_fill_discrete(name = "Key", labels = c("C", "C#", "D", "D#", "E", "F", "F#", "G", "G#", "A", "A#", "B")) +
scale_x_discrete(breaks = c(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10,11), labels = c("C", "C#", "D", "D#", "E", "F", "F#", "G", "G#", "A", "A#", "B")) +
scale_fill_manual(values = c("#440154", "#482173", "#433e85", "#38588c", "#2d708e",
"#25858e", "#1e9b8a", "#2ab07f", "#52c569", "#86d549",
"#c2df23", "#fde725")) +
theme(legend.position="none")
To decide if any categorical variables should be kept in the popularity model, I am using ANOVA to test if the categorical variables (explicit, mode, key) have a significant impact on the continuous variable (popularity). The null hypothesis of ANOVA is there is no difference between means. The alternative hypothesis states that there is a difference. If p-value of the ANOVA is greater than .05 then we can reject the null hypothesis and keep the variables in our model.
The results show that there is a significant difference between popularity and explicit (p < .05)
aov.res = aov(popularity~explicit, data = spotify.past.5)
summary(aov.res)
## Df Sum Sq Mean Sq F value Pr(>F)
## explicit 1 2471147 2471147 3057 <2e-16 ***
## Residuals 13840 11188838 808
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
One of the assumptions in ANOVA is equal variances, so I’m using the Levene’s Test to determine whether or not that assumption has been violated. The null hypothesis for the Levene’s Test is the variances are equal across groups. The alternative hypothesis for the Levene’s Test is the variances are not equal. If the p-value of the Levene’s Test is greater than .05, we can accept the null hypothesis and the equal variance assumption of ANOVA will be met.
The results of the Levene’s Test indicate there is equal variance (p > .05)
leveneTest(aov.res)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 1 3.5674 0.05894 .
## 13840
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The results from this ANOVA show there is no difference between popularity and mode (p > .05). Mode won’t be used in the model.
aov.res.2 = aov(popularity~mode, data = spotify.past.5)
summary(aov.res.2)
## Df Sum Sq Mean Sq F value Pr(>F)
## mode 1 3073 3073.2 3.114 0.0776 .
## Residuals 13840 13656911 986.8
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p-value from the Levene’s Test is greater than .05, so the equal variance assumption has been met.
leveneTest(aov.res.2)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 1 1.3104 0.2523
## 13840
Results from this ANOVA indicate that there is a significant difference between key and popularity (p < .05)
aov.res.3 = aov(popularity~key, data = spotify.past.5)
summary(aov.res.3)
## Df Sum Sq Mean Sq F value Pr(>F)
## key 11 138555 12596 12.88 <2e-16 ***
## Residuals 13830 13521429 978
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The Levene’s Test indicates that the equal variance assumption has been violated (p < .05). A Welch’s ANOVA will been used since it has failed the homogeneity assumption.
leveneTest(aov.res.3)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 11 11.98 < 2.2e-16 ***
## 13830
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Welch’s ANOVA results show that there is a significant difference between popularity and key (p < .05). Key will be kept in the model.
oneway.test(popularity ~ key, data = spotify.past.5, var.equal = FALSE)
##
## One-way analysis of means (not assuming equal variances)
##
## data: popularity and key
## F = 12.898, num df = 11.0, denom df = 4543.6, p-value < 2.2e-16
In order to have all the data on the same scale, I will apply the min-max normalization technique. This will change the data point scaling, so they have a range between 0 and 1. The difference between each value and the minimum range will be divided by the maximum and minimum value. x - min(x)/max(x) - min(x).
norm = preProcess(spotify.past.5[, c(1,3, 5, 8, 10,11, 16:18, 20)], method = c("range"))
transformed = predict(norm, spotify.past.5[, c(1,3, 5, 8, 10,11, 16:18, 20)])
transformed$popularity = spotify.past.5$popularity
transformed$explicit = spotify.past.5$explicit
transformed$key = spotify.past.5$key
Splitting data into 80% train and 20% test
set.seed(1)
idx.train.past.5 = createDataPartition(y=transformed$popularity, p =.8, list =F)
train.past.5 = transformed[idx.train.past.5,]
test.past.5 = transformed[-idx.train.past.5,]
To predict popularity, random forest will be used. This method was chosen because it can handle outliers, skewed data, and can make predictions very well. Random forest take independent decision trees and aggregates them to come up with a final prediction. Bagging and feature randomness create uncorrelated trees which makes this algorithm powerful. Uncorrelated trees protects each tree from their individual errors; therefore, an aggregate prediction is more accurate than relying on a single decision tree.
set.seed(1)
rf.model = randomForest(popularity ~
tempo +
valence +
acousticness +
duration_min +
instrumentalness +
liveness +
speechiness +
energy +
loudness +
danceability +
key +
explicit,
data = train.past.5, importance = TRUE)
print(rf.model)
##
## Call:
## randomForest(formula = popularity ~ tempo + valence + acousticness + duration_min + instrumentalness + liveness + speechiness + energy + loudness + danceability + key + explicit, data = train.past.5, importance = TRUE)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 4
##
## Mean of squared residuals: 435.5278
## % Var explained: 55.82
The variable importance plot shows that explicit, instrumentalness, loudness, duration, and energy are important to the model.
varImpPlot(rf.model)
Explicit songs are predicted to be more popular.
partialPlot(rf.model, train.past.5, explicit, popularity)
The more a song is instrumental, the less likely it will be predicted to be high on popularity.
partialPlot(rf.model, train.past.5, instrumentalness, popularity)
Shorter songs are predicted to be more popular.
partialPlot(rf.model, train.past.5, duration_min, popularity)
Songs that are loud tend to be more popular.
partialPlot(rf.model, train.past.5, loudness, popularity)
Songs with low to moderate energy are predicted to be more popular.
partialPlot(rf.model, train.past.5, energy, popularity)
Making predictions on the test data and evaluating the model.
The random forest accounts for 58% of the variation in the data, has a MSE of 428.4, and a RMSE of 20.7
y_hat = predict(rf.model, test.past.5)
mse.rf = mean((y_hat - test.past.5$popularity)^2)
rf.rmse = caret::RMSE(y_hat,test.past.5$popularity)
rf.r2 = R2(y_hat, test.past.5$popularity)
rf_performance_metrics =
data.frame(R_squared = round(rf.r2,2),
MSE = round(mse.rf,2),
RMSE = round(rf.rmse,2))
kable(rf_performance_metrics,
col.names = c("R-squared", "MSE", "RMSE"))
| R-squared | MSE | RMSE |
|---|---|---|
| 0.58 | 428.4 | 20.7 |
Now we will test if an XGBoost model can outperform the random forest. The trees in gradient boosting are built one after the other, and the trees are built to improve upon the errors from the last tree. This is different than the random forest because that algorithm takes many independent decision trees and aggregates them instead of learning from the errors of a previous tree.
set.seed(1)
xgb.model = train( popularity ~
tempo +
valence +
acousticness +
duration_min +
instrumentalness +
liveness +
speechiness +
energy +
loudness +
danceability +
key +
explicit,
data = train.past.5,
method = "xgbTree",
metric = 'Rsquared')
print(xgb.model$finalModel)
## ##### xgb.Booster
## raw: 179.3 Kb
## call:
## xgboost::xgb.train(params = list(eta = param$eta, max_depth = param$max_depth,
## gamma = param$gamma, colsample_bytree = param$colsample_bytree,
## min_child_weight = param$min_child_weight, subsample = param$subsample),
## data = x, nrounds = param$nrounds, objective = "reg:squarederror")
## params (as set within xgb.train):
## eta = "0.3", max_depth = "3", gamma = "0", colsample_bytree = "0.8", min_child_weight = "1", subsample = "1", objective = "reg:squarederror", validate_parameters = "TRUE"
## xgb.attributes:
## niter
## callbacks:
## cb.print.evaluation(period = print_every_n)
## # of features: 22
## niter: 150
## nfeatures : 22
## xNames : tempo valence acousticness duration_min instrumentalness liveness speechiness energy loudness danceability key0 key1 key9 key2 key11 key5 key4 key6 key10 key8 key3 explicitExplicit
## problemType : Regression
## tuneValue :
## nrounds max_depth eta gamma colsample_bytree min_child_weight subsample
## 54 150 3 0.3 0 0.8 1 1
## obsLevels : NA
## param :
## list()
The XGBoost’s most important variables are similar to the important variables in the random forest. Explicit and instrumentalness are the most impactful variables, followed by duration, loudness, and energy.
imp = varImp(xgb.model)
plot(imp)
Making predictions on the test data and evaluating the model.
The XGBoost accounts for 51% of the variation in the data, has a MSE of 489.7, and a RMSE of 22.13
y_hat = predict(xgb.model, test.past.5)
mse.xbg = mean((y_hat - test.past.5$popularity)^2)
xgb.rmse = caret::RMSE(y_hat,test.past.5$popularity)
xgb.r2 = R2(y_hat, test.past.5$popularity)
xgb_performance_metrics =
data.frame(R_squared = round(xgb.r2,2),
MSE = round(mse.xbg,2),
RMSE = round(xgb.rmse,2))
kable(xgb_performance_metrics,
col.names = c("R-squared", "MSE", "RMSE"))
| R-squared | MSE | RMSE |
|---|---|---|
| 0.51 | 488.93 | 22.11 |
The random forest outperformed the XGBoost model. The random forest accounted for more variation in the data and had a lower MSE and RMSE compared to the XGBoost.
According to the random forest variable importance plots, explicit, instrumentalness, loudness, duration, and energy are important to the model. The partial dependence plots examined the relationship between those variables and popularity. Based on the partial dependence plots, explicit songs, louder songs, shorter songs, and low to moderate energy songs are predicted to be more popular.
With this in mind, songs should use those characteristics in order to gain a higher popularity score. It is important to note that music trends are always changing, and depending on how trends change throughout the years, so could the impact of these variables. As the times change, it is recommended to monitor the impact of these variables to determine if they are still relevant or not.