Introduction and
Background
Here we have a dataset originally composed of over 5,000 songs that
were available on the music-streaming service Spotify over the course of
2024. These tracks stem from a wide array of vastly different genres and
styles of creation. The dataset is publicly available on the platform
Kaggle and is provided in the References section below. Although the
data originally contained 29 variables, only 11 play a particularly
relevant role in the analysis I wish to do here. A breakdown of those
variables, their meanings and data types is below:
Name
|
Meaning
|
Data_Type
|
track_popularity
|
song’s rating on the 0 - 100 metric
|
double
|
popular
|
how popular the track is (more on that later)
|
character
|
track_name
|
name of the song
|
character
|
track_artist
|
artist’s stage name
|
character
|
playlist_genre
|
genre that Spotify classifies the song in
|
character
|
playlist_genre
|
subgenre that Spotify classifies the song in
|
character
|
tempo
|
the speed of a track, measured in beats per minute (BPM)
|
double
|
danceability
|
A score describing how suitable a track is for dancing based on tempo,
rhythm stability, beat strength and overall regularity (0 to 1 scale)
|
double
|
speechiness
|
the presence of audible words (0 to 1 scale)
|
double
|
instrumentalness
|
likelihood that the track contains instrumentals (0 to 1 scale)
|
double
|
duration_mins
|
length of track in minutes
|
double
|
Many of the other variables in the dataset I chose to disregard as
they either pertained to unnecessary technical information (such as the
algorithmic identifier for the particular song or its album), or
extremely niche musical topics that the average listener or introductory
critic would most likely not pick up on (such as mode and key).
Objective for my
Analysis
Using this dataset, I wish to quantifiably examine what musical
qualities and characteristics lead to a song (or “track”) becoming
extremely successful (hence popular) via Spotify. As of October of 2025,
Spotify has a market cap of over 139 billion dollars and close to 700
million users. This means that for any aspiring musician, “making it
big” on Spotify would serve as a monumental and life-altering
milestone.
Spotify has a metric known as the popularity index which serves to
numerically represent the popularity of tracks on its platform. While
the exact formula behind the score is kept confidential, critics in the
music industry and those with strong technical backgrounds have
theorized that the primary factors which influence a track’s score are
the following; its total number of streams, its recent number of
streams, its number of downloads and saves, and its month-to-month
growth in all three of said metrics.
This popularity index operates on a scale ranging from 0 to 100, with
only the most well-established stars reaching the 90th percentile and
higher. That being said, any track with a rating of 75 or higher is
largely considered a “top track” by industry experts, so that is the
interpretation I will apply with my analysis. I created a categorical
binary variable “popular”, which is valued as “yes” for tracks with a
popularity index rating of 75 or higher, and “no” for tracks with a
lower rating. I will perform logistic regression to assess which, if
any, of the factors mentioned in my variable breakdown above contribute
to a track being deemed popular or not.
Simple Logistic
Regression Model
Checking for Pairwise
Relationships
Before jumping into model creation, I created a correlation matrix
for the quantitative predictor variables I am going to use in my
analysis. This is because variables with very high correlations (r >=
0.8) often can function as almost multiples of one another, leading to
inaccurate conclusions, especially when performing multiple
regression.
cor_matrix = cor(Quantitative_Variables)
# No high correlations between the variables
kable(cor_matrix, caption = "Correlation Matrix")
Correlation Matrix
tempo |
1.0000000 |
0.0338179 |
0.0576956 |
-0.1518469 |
0.0397516 |
danceability |
0.0338179 |
1.0000000 |
0.2840300 |
-0.3205070 |
-0.1374377 |
speechiness |
0.0576956 |
0.2840300 |
1.0000000 |
-0.1903130 |
-0.1050720 |
instrumentalness |
-0.1518469 |
-0.3205070 |
-0.1903130 |
1.0000000 |
-0.2218689 |
duration_mins |
0.0397516 |
-0.1374377 |
-0.1050720 |
-0.2218689 |
1.0000000 |
After creating a correlation matrix, we can see that none of the
quantitative predictor variables that we are examining have a high
correlation with one another. Practically speaking, this is not very
surprising as even “similar” songs by the same artist or within the same
genre can have wildly different characteristics when it comes to tempo
or anything else we are looking at.
Picking Variable for
Simple Model
Below I created histograms to represent the distributions of all five
quantitative variables I am going to use in this analysis. The purpose
for doing such is to determine if any sort of variable manipulation
(typically log transformation or square root transformation) is
necessary before using the variable in our logistic model.
par(mfrow = c(2, 3), mar = c(4, 4, 2, 1)) # mar controls margins
ylimit = max(density(Work_Data$tempo)$y)
hist(Work_Data$tempo, probability = TRUE, main = "Tempo Distribution", xlab="Tempo",
col = "azure1", border="lightseagreen")
lines(density(Work_Data$tempo, adjust=2), col="blue")
ylimit = max(density(Work_Data$danceability)$y)
hist(Work_Data$danceability, probability = TRUE, main = "Danceability Distribution", xlab="Danceability",
col = "azure1", border="lightseagreen")
lines(density(Work_Data$danceability, adjust=2), col="blue")
ylimit = max(density(Work_Data$speechiness)$y)
hist(Work_Data$speechiness, probability = TRUE, main = "Speechiness Distribution", xlab="Speechiness",
col = "azure1", border="lightseagreen")
lines(density(Work_Data$speechiness, adjust=2), col="blue")
ylimit = max(density(Work_Data$instrumentalness)$y)
hist(Work_Data$instrumentalness, probability = TRUE, main = "Instrumentalness Distribution", xlab="Instrumentalness",
col = "azure1", border="lightseagreen")
lines(density(Work_Data$instrumentalness, adjust=2), col="blue")
ylimit = max(density(Work_Data$duration_mins)$y)
hist(Work_Data$duration_mins, probability = TRUE, main = "Length (mins) Distribution", xlab="Length (mins)",
col = "azure1", border="lightseagreen")
lines(density(Work_Data$duration_mins, adjust=2), col="blue")

We can see from the histograms that there is an extreme skew in
speechiness, instrumentalness and length. While there are slight skews
in tempo and danceability, they are far more moderate. Since
tempo appears to be the least skewed of the variables, and it’s
calculation is the most straightforward (a simple measure of beats per
minute), I will use tempo to perform my simple logistic regression.
Creating the
Model
Using built-in R functions, I created a simple binary logistic
regression model, in which the tempo of a song is meant to predict
whether or not that track is popular or not (popularity score of 75 or
not). In the calculations, the base level, or 0, was stored for
instances of a track’s popularity being marked as “no”. While the
instance level, or 1, was stored for instances of a track’s popularity
being marked as “yes”.
Log_Model = glm(formula = popular ~ tempo, family = binomial(link = "logit"),
data = Work_Data)
invisible(summary(Log_Model))
Log_Model_Coef_Summary = summary(Log_Model)$coef # output stats of coefficients
Confidence_Interval = confint(Log_Model) # confidence intervals of betas
Summary_Stats = cbind(Log_Model_Coef_Summary, Confidence_Interval.95=Confidence_Interval) # rounding off decimals
kable(Summary_Stats,caption = "Simple Logistic Regression Model Summary")
Simple Logistic Regression Model Summary
(Intercept) |
-2.2643674 |
0.2126810 |
-10.646777 |
0.000000 |
-2.6844577 |
-1.8504312 |
tempo |
0.0060146 |
0.0017127 |
3.511866 |
0.000445 |
0.0026561 |
0.0093727 |
The summary of our model above leads us to believe that a track’s
tempo does have a positive association with the likelihood that a track
is popular on Spotify, however that association is very small in
practical effect. Despite a statistically significant p value, \(\beta\)1 has an estimated
association with the probability of a track being deemend popular of
.006, and a 95% confidence interval deeming the association to fall
between the interval [.003, .009].
Odds Ratio
Summary
While the direct interpretation of a logistic regression model’s
summary statistics can be helpful, it is sometimes more practical to
examine our model’s odds ratio. An odds ratio represents the likelihood
of success divided by the likelihood of failure. For example, an odds
ratio of 2 would mean that whatever instance we are considering a
“success” occurs twice the number of times than it fails to occur.
# Odds ratio
Log_Model_Coef_Summary = summary(Log_Model)$coef
Odds_Ratio = exp(coef(Log_Model))
out.stats = cbind(Log_Model_Coef_Summary, odds.ratio = Odds_Ratio)
kable(out.stats,caption = "Regression Model Coefficients W/ Odds Ratios")
Regression Model Coefficients W/ Odds Ratios
(Intercept) |
-2.2643674 |
0.2126810 |
-10.646777 |
0.000000 |
0.1038957 |
tempo |
0.0060146 |
0.0017127 |
3.511866 |
0.000445 |
1.0060327 |
The odds ratio between a track’s tempo and it’s popularity status is
~ 1.006. Since tempo is measured in beats per minute (BPM), this means
that for every additional beat per minute that a song is recorded in,
it’s probability of having a popularity score above 75 increases by
about 0.6%.
Goodness-of-fit
measures
Below I calculated the goodness-of-fit (GOF) measures for the
logistic regression model. Generally speaking, it is beneficial to find
a model’s GOF features as they serve as relative comparisons of model
efficiency when comparing multiple models. In this instance, we only
created one model so far, so the measures are not of practical use at
the moment.
Residual_Deviance = Log_Model$deviance
Null_Residual_Deviance = Log_Model$null.deviance
aic = Log_Model$aic
Goodness_Of_Fit = cbind(Deviance.residual = Residual_Deviance, Null.Deviance.Residual = Null_Residual_Deviance,
AIC = aic)
pander(Goodness_Of_Fit)
Success Probability
Curve
Below I created a visual of the the model’s S curve (also called
success probability curve or logistic curve). The S curve helps us see
the change in probability of an event happening (a track being popular
or not on Spotify), in relation to a change in the value of a predictor
variable (tempo in BPM in this instance).
###
tempo_range = range(Work_Data$tempo)
x = seq(tempo_range[1], tempo_range[2], length = 200)
beta.x = coef(Log_Model)[1] + coef(Log_Model)[2]*x
success.prob = exp(beta.x)/(1+exp(beta.x))
failure.prob = 1/(1+exp(beta.x))
ylimit = max(success.prob, failure.prob)
##
beta1 = coef(Log_Model)[2]
success.prob.rate = beta1*exp(beta.x)/(1+exp(beta.x))^2
##
##
par(mfrow = c(1,1))
plot(x, success.prob, type = "l", lwd = 2, col = "black",
main = "The Probability of a Track Being \n Considered Popular on Spotify",
ylim=c(0, 1.1*ylimit),
xlab = "Tempo (BPM)",
ylab = "probability",
axes = FALSE,
col.main = "black",
cex.main = 1.1)
# lines(x, failure.prob,lwd = 2, col = "darkred")
axis(1, pos = 0)
axis(2)

##
y.rate = max(success.prob.rate)
plot(x, success.prob.rate, type = "l", lwd = 2, col = "black",
main = "The Rate of Change in Probability of a Track Being \n Considered Popular on Spotify",
xlab = "Tempo (BPM)",
ylab = "Rate of Change",
ylim=c(0,1.1*y.rate),
axes = FALSE,
col.main = "black",
cex.main = 1.1
)
axis(1, pos = 0)
axis(2)

The S Curve provides visual confirmation of the conclusions we were
able to draw from analyzing our coefficient summary, confidence
intervals and odds ratio. That is, that there is a positive association
between a track’s tempo and it’s likelihood of being a popular track on
Spotify, but that association is very minimal in
practical effect.
We can see from our second plot, that looked at the relationship
between the rate of change in said probability and a track’s tempo, that
from 50 to 200 beats per minute, the probability of a track being
popular’s positive rate of change only increases from 0.0006 to
0.0012.
Conclusion
To conclude, we can say that although there is a statistically
significant and positive relationship between a track’s tempo and the
probability that it is popular on Spotify, that association is of such a
small magnitude that it is relatively a non-factor in practical sense.
Therefore, it would be overly simplistic and hyperbolic for music labels
or agents to aggressively push their artists towards a more up-tempo
production, as the return on such a decision would likely not be worth
the cost of souring the relationship with the artist at hand.
