Music is a near universal language that most people enjoy in one way or another. I would like to know what makes a song popular and use linear regression with various predictors to build a model that allows me to predict a song’s popularity. I’ll start with two datasets, merge them, tidy and transform where appropriate, and choose a regression model using backward elimination. Before I run the backward elimination step function I will run some exploratory data analysis to see if all variables are viable or in need of transformation. Using the optimal predictors I will evaluate the model based on the adjusted R-squared values and plot the residuals vs fitted values to ensure assumptions of linearity and homoscedasticity hold up.
library(tidyverse)
library(RCurl)
low_pop_url <- getURL("https://raw.githubusercontent.com/isaias-soto/CUNY_DAT607/refs/heads/main/Final%20Project/Data/low_popularity_spotify_data.csv")
low_pop <- read.csv(text = low_pop_url)
high_pop_url <- getURL("https://raw.githubusercontent.com/isaias-soto/CUNY_DAT607/refs/heads/main/Final%20Project/Data/high_popularity_spotify_data.csv")
high_pop <- read.csv(text = high_pop_url)
# Combine low and high popular music datasets
common_column_names <- intersect(names(low_pop),names(high_pop))
common <- intersect(low_pop,high_pop) # This shows the duplicate rows from both df's
popular_music <- merge(low_pop,high_pop, by = common_column_names,all = TRUE) #merge into one df
popular_music <- popular_music |> #de-dupe by track name and select relevant columns
group_by(track_artist) |>
distinct(track_name,.keep_all = TRUE) |>
ungroup() |>
select(track_popularity,speechiness,danceability,track_artist,duration_ms,energy,playlist_genre,playlist_subgenre,track_name,mode,instrumentalness,valence,key,tempo,loudness,acousticness,liveness) |>
mutate(key = factor(key)) # coerce to factor
The low-pop
dataframe has 3,145 rows and 29 columns,
while the high-pop
dataframe has 1,686 rows and 29 columns.
When merging, the new dataframe popular_music
has 4,788
rows and 29 columns, a 43 observation deficit. Using the
intersect
function shows the observations that appear in
the orginal dataframes and yields 43. This means my new dataframe
popular_music
is already de-duplicated by rows. However,
looking at track_name
shows that the same song may appear
if it is in a different playlist. So, we will de-duplicate by
track_name
after grouping by artist; this results in 322
observations deleted. After selecting only those columns that I will use
in the analysis, the final dataset contains 29 variables and 4,466
observations.
Lets take a look at the distributions of the response and predictor variables.
# track popularity
ggplot(popular_music, aes(x= track_popularity)) +
geom_histogram(binwidth = 5)
# speechiness
ggplot(popular_music, aes(x= speechiness)) + # needs a log transforamation
geom_histogram(bins = 40,na.rm = TRUE)
# log of speechiness
ggplot(popular_music, aes(x= log(speechiness))) + # roughly normal
geom_histogram(bins = 40,na.rm = TRUE)
# danceability
ggplot(popular_music, aes(x= danceability)) + # roughly normal
geom_histogram(bins = 40,na.rm = TRUE)
# duration_ms
ggplot(popular_music, aes(x= duration_ms)) + # nearly normal
geom_histogram(bins = 40,na.rm = TRUE)
# energy
ggplot(popular_music, aes(x= energy)) + # roughly normal
geom_histogram(bins = 35,na.rm = TRUE)
# instrumentalness
ggplot(popular_music, aes(x= instrumentalness)) + #not at all normal
geom_histogram(bins = 40,na.rm = TRUE)
# valence
ggplot(popular_music, aes(x= valence)) + # roughly normal
geom_histogram(bins = 40,na.rm = TRUE)
# tempo
ggplot(popular_music, aes(x= tempo)) +
geom_histogram(bins = 30,na.rm = TRUE) # nearly normal
# loudness
ggplot(popular_music, aes(x= loudness)) + # roughly normal
geom_histogram(bins = 40,na.rm = TRUE)
# acousticness
ggplot(popular_music, aes(x= acousticness)) + # not normal at all
geom_histogram(bins = 40,na.rm = TRUE)
# liveness
ggplot(popular_music, aes(x= liveness)) + # roughly normal
geom_histogram(bins = 40,na.rm = TRUE)
Most of variables were nearly normal to roughly normal with the
exceptions of speechiness
, instrumentallness
and acousticness
. After a log transformation on
speechiness
the distribution is closer to roughly normal. I
will omit instrumentallness
and acousticness
since they are not at all normal and instead represent probability of
the dichotomous values of 0 and 1. This pseudo dichotomous variable
status is also found in liveness, so we will omit this predictor as
well.
Now we want to know which linear regression model will be best to use with which predictors. We will use a backward step function to choose the best fit model.
# Fit the initial full model
initial_model <- lm(track_popularity ~ log(speechiness) + danceability + duration_ms + energy + valence + key + tempo + loudness, data = popular_music)
# Perform backward elimination
backward_model <- step(initial_model, direction = "backward")
## Start: AIC=26351.27
## track_popularity ~ log(speechiness) + danceability + duration_ms +
## energy + valence + key + tempo + loudness
##
## Df Sum of Sq RSS AIC
## - key 11 5528.8 1624424 26344
## - duration_ms 1 4.2 1618900 26349
## - valence 1 594.0 1619489 26351
## <none> 1618895 26351
## - log(speechiness) 1 820.7 1619716 26352
## - tempo 1 889.8 1619785 26352
## - danceability 1 1398.0 1620293 26353
## - energy 1 2072.4 1620968 26355
## - loudness 1 8593.3 1627489 26373
##
## Step: AIC=26344.5
## track_popularity ~ log(speechiness) + danceability + duration_ms +
## energy + valence + tempo + loudness
##
## Df Sum of Sq RSS AIC
## - duration_ms 1 0.8 1624425 26342
## - log(speechiness) 1 522.9 1624947 26344
## <none> 1624424 26344
## - valence 1 749.9 1625174 26345
## - tempo 1 947.5 1625372 26345
## - danceability 1 1771.9 1626196 26347
## - energy 1 2216.5 1626641 26349
## - loudness 1 8377.2 1632801 26366
##
## Step: AIC=26342.5
## track_popularity ~ log(speechiness) + danceability + energy +
## valence + tempo + loudness
##
## Df Sum of Sq RSS AIC
## - log(speechiness) 1 529.6 1624955 26342
## <none> 1624425 26342
## - valence 1 751.5 1625176 26343
## - tempo 1 947.6 1625373 26343
## - danceability 1 1805.5 1626230 26346
## - energy 1 2246.8 1626672 26347
## - loudness 1 8405.8 1632831 26364
##
## Step: AIC=26341.95
## track_popularity ~ danceability + energy + valence + tempo +
## loudness
##
## Df Sum of Sq RSS AIC
## <none> 1624955 26342
## - valence 1 790.6 1625745 26342
## - tempo 1 861.4 1625816 26342
## - danceability 1 1474.8 1626429 26344
## - energy 1 2133.3 1627088 26346
## - loudness 1 8525.6 1633480 26363
# Summary of the selected model
summary(backward_model)
##
## Call:
## lm(formula = track_popularity ~ danceability + energy + valence +
## tempo + loudness, data = popular_music)
##
## Residuals:
## Min 1Q Median 3Q Max
## -45.648 -13.591 1.509 15.980 46.096
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 50.30952 2.59749 19.368 < 2e-16 ***
## danceability 4.04065 2.00857 2.012 0.0443 *
## energy 4.85833 2.00798 2.420 0.0156 *
## valence -2.05925 1.39806 -1.473 0.1408
## tempo 0.01567 0.01019 1.537 0.1243
## loudness 0.35488 0.07337 4.837 1.36e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.09 on 4459 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.04094, Adjusted R-squared: 0.03986
## F-statistic: 38.07 on 5 and 4459 DF, p-value: < 2.2e-16
The model chosen doesn’t seem very good as it only explains about 4%
of the variance in the response variable per the adjusted R-squared
value. The model also uses 2 predictors, valence
and
tempo
, that do not seem to be significant in predicting our
response variable. There may also be issues will colinearity since
danceability
is defined as a score describing how suitable
a track is for dancing based on tempo, rhythm stability, beat strength
and overall regularity. Energy
is a measure of intensity
based on how fast, loud, and noisy a track is. Loudness
and
tempo
are how loud and fast a track can be, which means
there is overlap and possible colinearity between these variables and
danceability
and energy
.
mod <- lm(data = popular_music, formula = track_popularity ~ danceability + energy + valence + tempo + loudness)
pred <- predict(mod)
pred <- append(pred,54.27)
# Another view of response variable
x <- seq(from = 1, to = 100, along.with = popular_music$track_popularity)
ggplot(popular_music, aes(x = x, y = track_popularity)) +
geom_point() +
geom_smooth(method = "glm", formula = y ~ poly(x,1))
# track_popularity vs predicted values
ggplot(popular_music, aes(x = track_popularity, y = pred)) +
geom_point()
# residuals vs fitted values
ggplot(data = mod,aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") +
xlab("Fitted values") +
ylab("Residuals")
Looking at our plot of the response variable over the span of it’s values on the x-axis we see an almost stepwise function that would be difficult to fit with a linear model. To get a real sense if a linear model would be a best fit we look at the plot of residuals vs fitted values. The plot shows clumping of data points on the right while fanning out to the left. This is indicative of violating assumptions of linearity and homoscedasticity in the residuals. The conclusion is that a linear model is not a good fit.
After using backward elimination to choose our best fit model for
predicting track_popularity
we interpreted the results of
the summary of the regression model. The adjusted R-squared valued
showed that the model only explained 4% of the variance in
track_popularity
which seemed like the model was not better
than then just taking the average to guess the
track_popularity
. For good measure we also plotted out the
residuals vs fitted values to measure the assumptions of linearity and
homoscedasticity in the residuals. The plot pointed to a violation of
both of these assumptions which points to the conclusion that a linear
model is not the best fit for predicting
track_popularity
.
References
Ameh, S. (2025, January 7). Spotify Music dataset. Kaggle. https://www.kaggle.com/datasets/solomonameh/spotify-music-dataset