1 Introduction and Background

Here we have a dataset originally composed of over 5,000 songs that were available on the music-streaming service Spotify over the course of 2024. These tracks stem from a wide array of vastly different genres and styles of creation. The dataset is publicly available on the platform Kaggle and is provided in the References section below. Although the data originally contained 29 variables, only 11 play a particularly relevant role in the analysis I wish to do here. A breakdown of those variables, their meanings and data types is below:

Name Meaning Data_Type
track_popularity song’s rating on the 0 - 100 metric double
popular how popular the track is (more on that later) character
track_name name of the song character
track_artist artist’s stage name character
playlist_genre genre that Spotify classifies the song in character
playlist_genre subgenre that Spotify classifies the song in character
tempo the speed of a track, measured in beats per minute (BPM) double
danceability A score describing how suitable a track is for dancing based on tempo, rhythm stability, beat strength and overall regularity (0 to 1 scale) double
speechiness the presence of audible words (0 to 1 scale) double
instrumentalness likelihood that the track contains instrumentals (0 to 1 scale) double
duration_mins length of track in minutes double

Many of the other variables in the dataset I chose to disregard as they either pertained to unnecessary technical information (such as the algorithmic identifier for the particular song or its album), or extremely niche musical topics that the average listener or introductory critic would most likely not pick up on (such as mode and key).

1.1 Objective for my Analysis

Using this dataset, I wish to quantifiably examine what musical qualities and characteristics lead to a song (or “track”) becoming extremely successful (hence popular) via Spotify. As of October of 2025, Spotify has a market cap of over 139 billion dollars and close to 700 million users. This means that for any aspiring musician, “making it big” on Spotify would serve as a monumental and life-altering milestone.

Spotify has a metric known as the popularity index which serves to numerically represent the popularity of tracks on its platform. While the exact formula behind the score is kept confidential, critics in the music industry and those with strong technical backgrounds have theorized that the primary factors which influence a track’s score are the following; its total number of streams, its recent number of streams, its number of downloads and saves, and its month-to-month growth in all three of said metrics.

This popularity index operates on a scale ranging from 0 to 100, with only the most well-established stars reaching the 90th percentile and higher. That being said, any track with a rating of 75 or higher is largely considered a “top track” by industry experts, so that is the interpretation I will apply with my analysis. I created a categorical binary variable “popular”, which is valued as “yes” for tracks with a popularity index rating of 75 or higher, and “no” for tracks with a lower rating. I will perform logistic regression to assess which, if any, of the factors mentioned in my variable breakdown above contribute to a track being deemed popular or not.

2 Simple Logistic Regression Model

2.1 Checking for Pairwise Relationships

Before jumping into model creation, I created a correlation matrix for the quantitative predictor variables I am going to use in my analysis. This is because variables with very high correlations (r >= 0.8) often can function as almost multiples of one another, leading to inaccurate conclusions, especially when performing multiple regression.

cor_matrix = cor(Quantitative_Variables)
  # No high correlations between the variables

kable(cor_matrix, caption = "Correlation Matrix")
Correlation Matrix
tempo danceability speechiness instrumentalness duration_mins
tempo 1.0000000 0.0338179 0.0576956 -0.1518469 0.0397516
danceability 0.0338179 1.0000000 0.2840300 -0.3205070 -0.1374377
speechiness 0.0576956 0.2840300 1.0000000 -0.1903130 -0.1050720
instrumentalness -0.1518469 -0.3205070 -0.1903130 1.0000000 -0.2218689
duration_mins 0.0397516 -0.1374377 -0.1050720 -0.2218689 1.0000000

After creating a correlation matrix, we can see that none of the quantitative predictor variables that we are examining have a high correlation with one another. Practically speaking, this is not very surprising as even “similar” songs by the same artist or within the same genre can have wildly different characteristics when it comes to tempo or anything else we are looking at.

2.2 Picking Variable for Simple Model

Below I created histograms to represent the distributions of all five quantitative variables I am going to use in this analysis. The purpose for doing such is to determine if any sort of variable manipulation (typically log transformation or square root transformation) is necessary before using the variable in our logistic model.

par(mfrow = c(2, 3), mar = c(4, 4, 2, 1))  # mar controls margins 
  
  
ylimit = max(density(Work_Data$tempo)$y)
hist(Work_Data$tempo, probability = TRUE, main = "Tempo Distribution", xlab="Tempo", 
       col = "azure1", border="lightseagreen")
  lines(density(Work_Data$tempo, adjust=2), col="blue") 

ylimit = max(density(Work_Data$danceability)$y)
hist(Work_Data$danceability, probability = TRUE, main = "Danceability Distribution", xlab="Danceability", 
       col = "azure1", border="lightseagreen")
  lines(density(Work_Data$danceability, adjust=2), col="blue") 
  
ylimit = max(density(Work_Data$speechiness)$y)
hist(Work_Data$speechiness, probability = TRUE, main = "Speechiness Distribution", xlab="Speechiness", 
       col = "azure1", border="lightseagreen")
  lines(density(Work_Data$speechiness, adjust=2), col="blue") 
  
  
ylimit = max(density(Work_Data$instrumentalness)$y)
hist(Work_Data$instrumentalness, probability = TRUE, main = "Instrumentalness Distribution", xlab="Instrumentalness", 
       col = "azure1", border="lightseagreen")
  lines(density(Work_Data$instrumentalness, adjust=2), col="blue") 
  
  
ylimit = max(density(Work_Data$duration_mins)$y)
hist(Work_Data$duration_mins, probability = TRUE, main = "Length (mins) Distribution", xlab="Length (mins)", 
       col = "azure1", border="lightseagreen")
  lines(density(Work_Data$duration_mins, adjust=2), col="blue") 

We can see from the histograms that there is an extreme skew in speechiness, instrumentalness and length. While there are slight skews in tempo and danceability, they are far more moderate. Since tempo appears to be the least skewed of the variables, and it’s calculation is the most straightforward (a simple measure of beats per minute), I will use tempo to perform my simple logistic regression.

2.3 Creating the Model

Using built-in R functions, I created a simple binary logistic regression model, in which the tempo of a song is meant to predict whether or not that track is popular or not (popularity score of 75 or not). In the calculations, the base level, or 0, was stored for instances of a track’s popularity being marked as “no”. While the instance level, or 1, was stored for instances of a track’s popularity being marked as “yes”.

Log_Model = glm(formula = popular ~ tempo, family = binomial(link = "logit"), 
    data = Work_Data)

invisible(summary(Log_Model))

Log_Model_Coef_Summary = summary(Log_Model)$coef       # output stats of coefficients
Confidence_Interval = confint(Log_Model)                     # confidence intervals of betas
Summary_Stats = cbind(Log_Model_Coef_Summary, Confidence_Interval.95=Confidence_Interval)   # rounding off decimals
kable(Summary_Stats,caption = "Simple Logistic Regression Model Summary")  
Simple Logistic Regression Model Summary
Estimate Std. Error z value Pr(>|z|) 2.5 % 97.5 %
(Intercept) -2.2643674 0.2126810 -10.646777 0.000000 -2.6844577 -1.8504312
tempo 0.0060146 0.0017127 3.511866 0.000445 0.0026561 0.0093727

The summary of our model above leads us to believe that a track’s tempo does have a positive association with the likelihood that a track is popular on Spotify, however that association is very small in practical effect. Despite a statistically significant p value, \(\beta\)1 has an estimated association with the probability of a track being deemend popular of .006, and a 95% confidence interval deeming the association to fall between the interval [.003, .009].

2.4 Odds Ratio Summary

While the direct interpretation of a logistic regression model’s summary statistics can be helpful, it is sometimes more practical to examine our model’s odds ratio. An odds ratio represents the likelihood of success divided by the likelihood of failure. For example, an odds ratio of 2 would mean that whatever instance we are considering a “success” occurs twice the number of times than it fails to occur.

# Odds ratio
Log_Model_Coef_Summary = summary(Log_Model)$coef
Odds_Ratio = exp(coef(Log_Model))
out.stats = cbind(Log_Model_Coef_Summary, odds.ratio = Odds_Ratio)                 
kable(out.stats,caption = "Regression Model Coefficients W/ Odds Ratios")
Regression Model Coefficients W/ Odds Ratios
Estimate Std. Error z value Pr(>|z|) odds.ratio
(Intercept) -2.2643674 0.2126810 -10.646777 0.000000 0.1038957
tempo 0.0060146 0.0017127 3.511866 0.000445 1.0060327

The odds ratio between a track’s tempo and it’s popularity status is ~ 1.006. Since tempo is measured in beats per minute (BPM), this means that for every additional beat per minute that a song is recorded in, it’s probability of having a popularity score above 75 increases by about 0.6%.

2.5 Goodness-of-fit measures

Below I calculated the goodness-of-fit (GOF) measures for the logistic regression model. Generally speaking, it is beneficial to find a model’s GOF features as they serve as relative comparisons of model efficiency when comparing multiple models. In this instance, we only created one model so far, so the measures are not of practical use at the moment.

Residual_Deviance = Log_Model$deviance
Null_Residual_Deviance = Log_Model$null.deviance
aic = Log_Model$aic
Goodness_Of_Fit = cbind(Deviance.residual = Residual_Deviance, Null.Deviance.Residual = Null_Residual_Deviance,
      AIC = aic)
pander(Goodness_Of_Fit)
Deviance.residual Null.Deviance.Residual AIC
2451 2464 2455

2.6 Success Probability Curve

Below I created a visual of the the model’s S curve (also called success probability curve or logistic curve). The S curve helps us see the change in probability of an event happening (a track being popular or not on Spotify), in relation to a change in the value of a predictor variable (tempo in BPM in this instance).

###

tempo_range = range(Work_Data$tempo)
x = seq(tempo_range[1], tempo_range[2], length = 200)
beta.x = coef(Log_Model)[1] + coef(Log_Model)[2]*x
success.prob = exp(beta.x)/(1+exp(beta.x))
failure.prob = 1/(1+exp(beta.x))
ylimit = max(success.prob, failure.prob)
##
beta1 = coef(Log_Model)[2]
success.prob.rate = beta1*exp(beta.x)/(1+exp(beta.x))^2
##
##
par(mfrow = c(1,1))
plot(x, success.prob, type = "l", lwd = 2, col = "black",
     main = "The Probability of a Track Being \n Considered Popular on Spotify", 
     ylim=c(0, 1.1*ylimit),
     xlab = "Tempo (BPM)",
     ylab = "probability",
     axes = FALSE,
     col.main = "black",
     cex.main = 1.1)
# lines(x, failure.prob,lwd = 2, col = "darkred")
axis(1, pos = 0)
axis(2)

##
y.rate = max(success.prob.rate)
plot(x, success.prob.rate, type = "l", lwd = 2, col = "black",
     main = "The Rate of Change in Probability of a Track Being \n Considered Popular on Spotify", 
     xlab = "Tempo (BPM)",
     ylab = "Rate of Change",
     ylim=c(0,1.1*y.rate),
     axes = FALSE,
     col.main = "black",
     cex.main = 1.1
     )
axis(1, pos = 0)
axis(2)

The S Curve provides visual confirmation of the conclusions we were able to draw from analyzing our coefficient summary, confidence intervals and odds ratio. That is, that there is a positive association between a track’s tempo and it’s likelihood of being a popular track on Spotify, but that association is very minimal in practical effect.

We can see from our second plot, that looked at the relationship between the rate of change in said probability and a track’s tempo, that from 50 to 200 beats per minute, the probability of a track being popular’s positive rate of change only increases from 0.0006 to 0.0012.

3 Conclusion

To conclude, we can say that although there is a statistically significant and positive relationship between a track’s tempo and the probability that it is popular on Spotify, that association is of such a small magnitude that it is relatively a non-factor in practical sense. Therefore, it would be overly simplistic and hyperbolic for music labels or agents to aggressively push their artists towards a more up-tempo production, as the return on such a decision would likely not be worth the cost of souring the relationship with the artist at hand.

