1 Introduction and Background

Here we have a dataset originally composed of over 5,000 songs that were available on the music-streaming service Spotify over the course of 2024. These tracks stem from a wide array of vastly different genres and styles of creation. The dataset is publicly available on the platform Kaggle and is provided in the References section below. Although the data originally contained 29 variables, only 11 play a particularly relevant role in the analysis I wish to do here. A breakdown of those variables, their meanings and data types is below:

Name Meaning Data_Type
track_popularity song’s rating on the 0 - 100 metric double
popular how popular the track is (more on that later) character
track_name name of the song character
track_artist artist’s stage name character
playlist_genre genre that Spotify classifies the song in character
playlist_subgenre subgenre that Spotify classifies the song in character
tempo the speed of a track, measured in beats per minute (BPM) double
danceability A score describing how suitable a track is for dancing based on tempo, rhythm stability, beat strength and overall regularity (0 to 1 scale) double
speechiness the presence of audible words (0 to 1 scale) double
instrumentalness likelihood that the track contains instrumentals (0 to 1 scale) double
duration_mins length of track in minutes double

Many of the other variables in the dataset I chose to disregard as they either pertained to unnecessary technical information (such as the algorithmic identifier for the particular song or its album), or extremely niche musical topics that the average listener or introductory critic would most likely not pick up on (such as mode and key).

1.1 Objective for my Analysis

Using this dataset, I wish to quantifiably examine what musical qualities and characteristics lead to a song (or “track”) becoming extremely successful (hence popular) via Spotify. As of October of 2025, Spotify has a market cap of over 139 billion dollars and close to 700 million users. This means that for any aspiring musician, “making it big” on Spotify would serve as a monumental and life-altering milestone.

Spotify has a metric known as the popularity index which serves to numerically represent the popularity of tracks on its platform. While the exact formula behind the score is kept confidential, critics in the music industry and those with strong technical backgrounds have theorized that the primary factors which influence a track’s score are the following; its total number of streams, its recent number of streams, its number of downloads and saves, and its month-to-month growth in all three of said metrics.

This popularity index operates on a scale ranging from 0 to 100, with only the most well-established stars reaching the 90th percentile and higher. That being said, any track with a rating of 75 or higher is largely considered a “top track” by industry experts, so that is the interpretation I will apply with my analysis. I created a categorical binary variable “popular”, which is valued as “yes” for tracks with a popularity index rating of 75 or higher, and “no” for tracks with a lower rating. I will perform associative logistic regression to assess which, if any, of the factors mentioned in my variable breakdown above contribute to a track being deemed popular or not.

Finally, after performing simple and multiple linear regression processes to analyze the association between a track’s popularity and its status of the variables I have laid out, I will then use predictive modeling techniques in the effort to create a model which can accurately predict a particular track’s likelihood of being popular on Spotify.

2 Simple Logistic Regression Model

2.1 Checking for Pairwise Relationships

Before jumping into model creation, I created a correlation matrix for the quantitative predictor variables I am going to use in my analysis. This is because variables with very high correlations (r >= 0.8) often can function as almost multiples of one another, leading to inaccurate conclusions, especially when performing multiple regression.

cor_matrix = cor(Quantitative_Variables)
  # No high correlations between the variables

kable(cor_matrix, caption = "Correlation Matrix")
Correlation Matrix
tempo danceability speechiness instrumentalness duration_mins
tempo 1.0000000 0.0338179 0.0576956 -0.1518469 0.0397516
danceability 0.0338179 1.0000000 0.2840300 -0.3205070 -0.1374377
speechiness 0.0576956 0.2840300 1.0000000 -0.1903130 -0.1050720
instrumentalness -0.1518469 -0.3205070 -0.1903130 1.0000000 -0.2218689
duration_mins 0.0397516 -0.1374377 -0.1050720 -0.2218689 1.0000000

After creating a correlation matrix, we can see that none of the quantitative predictor variables that we are examining have a high correlation with one another. Practically speaking, this is not very surprising as even “similar” songs by the same artist or within the same genre can have wildly different characteristics when it comes to tempo or anything else we are looking at.

2.2 Picking Variable for Simple Model

Below I created histograms to represent the distributions of all five quantitative variables I am going to use in this analysis. The purpose for doing such is to determine if any sort of variable manipulation is necessary before using the variable in our logistic model.

par(mfrow = c(2, 3), mar = c(4, 4, 2, 1))  # mar controls margins 
  
  
ylimit = max(density(Work_Data$tempo)$y)
hist(Work_Data$tempo, probability = TRUE, main = "Tempo Distribution", xlab="Tempo", 
       col = "azure1", border="lightseagreen")
  lines(density(Work_Data$tempo, adjust=2), col="blue") 

ylimit = max(density(Work_Data$danceability)$y)
hist(Work_Data$danceability, probability = TRUE, main = "Danceability Distribution", xlab="Danceability", 
       col = "azure1", border="lightseagreen")
  lines(density(Work_Data$danceability, adjust=2), col="blue") 
  
ylimit = max(density(Work_Data$speechiness)$y)
hist(Work_Data$speechiness, probability = TRUE, main = "Speechiness Distribution", xlab="Speechiness", 
       col = "azure1", border="lightseagreen")
  lines(density(Work_Data$speechiness, adjust=2), col="blue") 
  
  
ylimit = max(density(Work_Data$instrumentalness)$y)
hist(Work_Data$instrumentalness, probability = TRUE, main = "Instrumentalness Distribution", xlab="Instrumentalness", 
       col = "azure1", border="lightseagreen")
  lines(density(Work_Data$instrumentalness, adjust=2), col="blue") 
  
  
ylimit = max(density(Work_Data$duration_mins)$y)
hist(Work_Data$duration_mins, probability = TRUE, main = "Length (mins) Distribution", xlab="Length (mins)", 
       col = "azure1", border="lightseagreen")
  lines(density(Work_Data$duration_mins, adjust=2), col="blue") 

We can see from the histograms that there is an extreme skew in speechiness, instrumentalness and length. While there are slight skews in tempo and danceability, they are far more moderate. Since tempo appears to be the least skewed of the variables, and it’s calculation is the most straightforward (a simple measure of beats per minute), I will use tempo to perform my simple logistic regression.

2.3 Creating the Simple Model

Using built-in R functions, I created a simple binary logistic regression model, in which the tempo of a song is meant to predict whether or not that track is popular or not (popularity score of 75 or not). In the calculations, the base level, or 0, was stored for instances of a track’s popularity being marked as “no”. While the instance level, or 1, was stored for instances of a track’s popularity being marked as “yes”.

Simple_Log_Model = glm(formula = popular ~ tempo, family = binomial(link = "logit"), 
    data = Work_Data)

invisible(summary(Simple_Log_Model))

Simple_Log_Model_Coef_Summary = summary(Simple_Log_Model)$coef       # output stats of coefficients
Simple_Confidence_Interval = confint(Simple_Log_Model)                     # confidence intervals of betas
Simple_Summary_Stats = cbind(Simple_Log_Model_Coef_Summary, Simple_Confidence_Interval.95=Simple_Confidence_Interval)   # rounding off decimals
kable(Simple_Summary_Stats,caption = "Simple Logistic Regression Model Summary")  
Simple Logistic Regression Model Summary
Estimate Std. Error z value Pr(>|z|) 2.5 % 97.5 %
(Intercept) -2.2643674 0.2126810 -10.646777 0.000000 -2.6844577 -1.8504312
tempo 0.0060146 0.0017127 3.511866 0.000445 0.0026561 0.0093727

The summary of our model above leads us to believe that a track’s tempo does have a positive association with the likelihood that a track is popular on Spotify, however that association is very small in practical effect. Despite a statistically significant p value, \(\beta\)1 has an estimated association with the probability of a track being deemed popular of .006, and a 95% confidence interval deeming the association to fall between the interval [.003, .009].

2.4 Odds Ratio Summary

While the direct interpretation of a logistic regression model’s summary statistics can be helpful, it is sometimes more practical to examine our model’s odds ratio. An odds ratio represents the likelihood of success divided by the likelihood of failure. For example, an odds ratio of 2 would mean that whatever instance we are considering a “success” occurs twice the number of times than it fails to occur.

# Odds ratio
Simple_Log_Model_Coef_Summary = summary(Simple_Log_Model)$coef
Odds_Ratio = exp(coef(Simple_Log_Model))
out.stats = cbind(Simple_Log_Model_Coef_Summary, odds.ratio = Odds_Ratio)                 
kable(out.stats,caption = "Simple Regression Model Coefficients W/ Odds Ratios")
Simple Regression Model Coefficients W/ Odds Ratios
Estimate Std. Error z value Pr(>|z|) odds.ratio
(Intercept) -2.2643674 0.2126810 -10.646777 0.000000 0.1038957
tempo 0.0060146 0.0017127 3.511866 0.000445 1.0060327

The odds ratio between a track’s tempo and it’s popularity status is ~ 1.006. Since tempo is measured in beats per minute (BPM), this means that for every additional beat per minute that a song is recorded in, it’s probability of having a popularity score above 75 increases by about 0.6%.

2.5 Goodness-of-fit measures

Below I calculated the goodness-of-fit (GOF) measures for the logistic regression model. Generally speaking, it is beneficial to find a model’s GOF features as they serve as relative comparisons of model efficiency when comparing multiple models. In this instance, we only created one model so far, so the measures are not of practical use at the moment.

Residual_Deviance = Simple_Log_Model$deviance
Null_Residual_Deviance = Simple_Log_Model$null.deviance
aic = Simple_Log_Model$aic
Goodness_Of_Fit = cbind(Deviance.residual = Residual_Deviance, Null.Deviance.Residual = Null_Residual_Deviance,
      AIC = aic)
pander(Goodness_Of_Fit)
Deviance.residual Null.Deviance.Residual AIC
2451 2464 2455

2.6 Success Probability Curve

Below I created a visual of the the model’s S curve (also called success probability curve or logistic curve). The S curve helps us see the change in probability of an event happening (a track being popular or not on Spotify), in relation to a change in the value of a predictor variable (tempo in BPM in this instance).

###

tempo_range = range(Work_Data$tempo)
x = seq(tempo_range[1], tempo_range[2], length = 200)
beta.x = coef(Simple_Log_Model)[1] + coef(Simple_Log_Model)[2]*x
success.prob = exp(beta.x)/(1+exp(beta.x))
failure.prob = 1/(1+exp(beta.x))
ylimit = max(success.prob, failure.prob)
##
beta1 = coef(Simple_Log_Model)[2]
success.prob.rate = beta1*exp(beta.x)/(1+exp(beta.x))^2
##
##
par(mfrow = c(1,1))
plot(x, success.prob, type = "l", lwd = 2, col = "black",
     main = "The Probability of a Track Being \n Considered Popular on Spotify", 
     ylim=c(0, 1.1*ylimit),
     xlab = "Tempo (BPM)",
     ylab = "probability",
     axes = FALSE,
     col.main = "black",
     cex.main = 1.1)
# lines(x, failure.prob,lwd = 2, col = "darkred")
axis(1, pos = 0)
axis(2)

##
y.rate = max(success.prob.rate)
plot(x, success.prob.rate, type = "l", lwd = 2, col = "black",
     main = "The Rate of Change in Probability of a Track Being \n Considered Popular on Spotify", 
     xlab = "Tempo (BPM)",
     ylab = "Rate of Change",
     ylim=c(0,1.1*y.rate),
     axes = FALSE,
     col.main = "black",
     cex.main = 1.1
     )
axis(1, pos = 0)
axis(2)

The S Curve provides visual confirmation of the conclusions we were able to draw from analyzing our coefficient summary, confidence intervals and odds ratio. That is, that there is a positive association between a track’s tempo and it’s likelihood of being a popular track on Spotify, but that association is very minimal in practical effect.

We can see from our second plot, that looked at the relationship between the rate of change in said probability and a track’s tempo, that from 50 to 200 beats per minute, the probability of a track being popular’s positive rate of change only increases from 0.0006 to 0.0012.

2.7 Takeaways from Simple Logistic Regression Model

To conclude, we can say that although there is a statistically significant and positive relationship between a track’s tempo and the probability that it is popular on Spotify, that association is of such a small magnitude that it is relatively a non-factor in practical sense. Therefore, it would be overly simplistic and hyperbolic for music labels or agents to aggressively push their artists towards a more up-tempo production, as the return on such a decision would likely not be worth the cost of souring the relationship with the artist at hand.

3 Multiple Logistic Regression Model

As we saw earlier when creating our simple regression model, there is no significant risk of multicollinearity with the data we are working with. However, we also saw there is an extreme rightward skew in the distribution of speechiness, instrumentalness and length. Below I take a closer look at the distributions of those three variables.

## Remember in the HW instructions, include not just the numerical variables but two categorical ones too (let's use playlist_genre and playlist_subgenre)

par(mfrow=c(1,2))
hist(speechiness, xlab="Speechiness", main = "")
hist(instrumentalness, xlab = "Instrumentalness", main = "")

hist(duration_mins, xlab = "Length (Mins)", main = "")

To combat the skew of all three variables, I will perform discretization on each of them. The coding for such can be seen below.

  # Speechiness = "presence of audible words on 0 to 1 scale"
grp.speechiness = speechiness
grp.speechiness[speechiness < 0.5] = ": s under 0.5"
grp.speechiness[speechiness > 0.5] = ": s over 0.5"

  # sum(speechiness > 0.5) = 14
  # sum(speechiness < 0.5) = 2641


  # Instrumentalness = "likelihood that the track contains instrumentals (0 to 1 scale)"
grp.instrumentalness = instrumentalness
grp.instrumentalness[instrumentalness < 0.1] = ": s under 0.1"
grp.instrumentalness[instrumentalness > 0.1 & instrumentalness <= 0.7] = ": s from 0.1 to 0.7"
grp.instrumentalness[instrumentalness > 0.7] = ": s over 0.7"
 
 #  sum(instrumentalness < 0.1) = 1802
 #  sum(instrumentalness > 0.1 & instrumentalness < 0.7) = 193
 #  sum(instrumentalness > 0.7) = 659
  

  # duration_mins below
grp.duration_mins = duration_mins
grp.duration_mins[duration_mins < 5] = ": Under 5 mins"
grp.duration_mins[duration_mins >= 5 & duration_mins <= 10] = ": Between 5 and 10 mins"
grp.duration_mins[duration_mins > 10] = ": Over 10 mins"
  
# sum(duration_mins < 5) = 2424
# sum(duration_mins >= 5 & duration_mins <= 10) = 211
# sum(duration_mins > 10) = 20


Work_Data$grp.speechiness = grp.speechiness
Work_Data$grp.instrumentalness = grp.instrumentalness
Work_Data$grp.duration_mins = grp.duration_mins

3.1 Building Full Multiple Logistic Model

For the full multiple logistic model, I will initially include the continuous numerical variables tempo and danceability in their original form, as well as the discretized transformations of instrumentalness, length in minutes and speechiness. After coding this model in, I will examine its summary statistics.

Multiple_Full_Log_Model = glm(formula = popular ~ tempo + danceability + grp.speechiness + grp.instrumentalness + grp.duration_mins, family = binomial(link = "logit"), 
    data = Work_Data)

invisible(summary(Multiple_Full_Log_Model))

Multiple_Full_Log_Model_Coef_Summary = summary(Multiple_Full_Log_Model)$coef
Multiple_Full_Confidence_Interval = confint(Multiple_Full_Log_Model)  

Multiple_Full_Summary_Stats = cbind(Multiple_Full_Log_Model_Coef_Summary, Multiple_Full_Confidence_Interval.95=Multiple_Full_Confidence_Interval)   # rounding off decimals

kable(Multiple_Full_Summary_Stats,caption = "Full Multiple Logistic Regression Model Summary") 
Full Multiple Logistic Regression Model Summary
Estimate Std. Error z value Pr(>|z|) 2.5 % 97.5 %
(Intercept) -17.3380444 375.7154226 -0.0461467 0.9631933 NA 7.5447416
tempo 0.0038700 0.0018965 2.0405532 0.0412953 0.0001487 0.0075874
danceability 0.3230690 0.3274150 0.9867263 0.3237768 -0.3156955 0.9684832
grp.speechiness: s under 0.5 14.2760281 375.7151570 0.0379969 0.9696901 -10.9252242 NA
grp.instrumentalness: s over 0.7 -2.1650529 0.3613366 -5.9917905 0.0000000 -2.9070614 -1.4777880
grp.instrumentalness: s under 0.1 0.6024645 0.2205999 2.7310277 0.0063137 0.1872867 1.0554459
grp.duration_mins: Over 10 mins -13.2932711 310.2840235 -0.0428423 0.9658273 NA 4.4755962
grp.duration_mins: Under 5 mins 0.6782754 0.2321971 2.9211189 0.0034878 0.2431306 1.1574116
invisible(table(Work_Data$popular, Work_Data$grp.speechiness))
invisible(table(Work_Data$popular, Work_Data$grp.instrumentalness))
invisible(table(Work_Data$popular, Work_Data$grp.duration_mins))
  #NA's in kable function display are due to "perfect" results between certain levels of our discretized variables and the response variable of popular or not. If all observations in a certain level result in one of the two options for the response variable, then that coefficient will trend towards negative or positive infinity per R's calculations.

We can see the following from our model summary. First, just like the results from our simple regression model, there is once again a statistically significant association between a track’s tempo and its likelihood of being popular on Spotify (p ~ 0.0413). However that association is so small that it is of little practical importance. Additionally, it appears that a track’s instrumentalness and length absolutely play a part in its probability of being popular on Spotify.

If we look at the discretized breakdown of instrumentalness, we see that there is an significant difference in tracks whose instrumentalness rating is of 0.7 and higher, or less than 0.1, when compared to tracks whose rating is somewhere in the middle. For track length, there also appears to be significant evidence that tracks less than 5 minutes long have a greater probability of achieving popularity on Spotify than those of any other lengths in our dataset.

On the other hand, there is not evidence to suggest that a track’s danceability or speechiness have such an association with its popularity.

After examining the model summary, I also chose to run VIF on this full model, which can be seen below. Given that all GVIF are below 1.1, we can safely say that multicollinearity is not pertinent to this dataset. This finding is in line with the takeaway from our pairwise scatterplot earlier on.

vif(Multiple_Full_Log_Model)
                         GVIF Df GVIF^(1/(2*Df))
tempo                1.018314  1        1.009116
danceability         1.048916  1        1.024166
grp.speechiness      1.000000  1        1.000000
grp.instrumentalness 1.035527  2        1.008766
grp.duration_mins    1.012616  2        1.003139

3.2 Manually Reduced Multiple Logistic Model

Since parsimony (having as least amount of variables as possible without losing significant accuracy in association examination or prediction accuracy) is an important aspect of statistical modeling, I also then chose to create a reduced version of my multiple logistic model above. In this model, I did not include predictors whose impact was very clearly insignificant as per the original full model’s summary (danceability and speechiness). The summary for this new reduced model is below.

Multiple_Reduced_Log_Model = glm(formula = popular ~ tempo + grp.instrumentalness + grp.duration_mins, family = binomial(link = "logit"), 
    data = Work_Data)

invisible(summary(Multiple_Reduced_Log_Model))

Multiple_Reduced_Log_Model_Coef_Summary = summary(Multiple_Reduced_Log_Model)$coef
Multiple_Reduced_Log_Model_Confidence_Interval = confint(Multiple_Reduced_Log_Model)  

Multiple_Reduced_Summary_Stats = cbind(Multiple_Reduced_Log_Model_Coef_Summary, Multiple_Reduced_Confidence_Interval.95=Multiple_Reduced_Log_Model_Confidence_Interval)   # rounding off decimals

kable(Multiple_Reduced_Summary_Stats,caption = "Reduced Multiple Logistic Regression Model Summary") 
Reduced Multiple Logistic Regression Model Summary
Estimate Std. Error z value Pr(>|z|) 2.5 % 97.5 %
(Intercept) -2.8580203 0.3686009 -7.7536985 0.0000000 -3.5991921 -2.1520692
tempo 0.0035527 0.0018681 1.9017792 0.0572000 -0.0001176 0.0072091
grp.instrumentalness: s over 0.7 -2.1920495 0.3605499 -6.0797397 0.0000000 -2.9326652 -1.5064166
grp.instrumentalness: s under 0.1 0.6152133 0.2195892 2.8016558 0.0050841 0.2022035 1.0664024
grp.duration_mins: Over 10 mins -13.3332333 310.6582027 -0.0429193 0.9657659 NA 4.4642172
grp.duration_mins: Under 5 mins 0.6999423 0.2308747 3.0316973 0.0024318 0.2676486 1.1767544

The results from this new reduced model largely align with the findings from the original multiple model. However, just as overfitting a model with needless information is a poor practice, so is underfitting a model by not including everything that is truly important. To ensure validity of this reduced model’s structure, below I ran automatic stepwwise selection via R’s AIC function. The model’s summary statistics have been printed.

3.3 Algorithmically Reduced Multiple Logistic Model

library(MASS)
AIC_Model = stepAIC(Multiple_Reduced_Log_Model, 
                      scope = list(lower=formula(Multiple_Reduced_Log_Model),upper=formula(Multiple_Full_Log_Model)),
                      direction = "forward",   # forward selection
                      trace = 0   # do not show the details
                      )

kable(Multiple_Reduced_Summary_Stats,caption ="AIC-Created Multiple Logistic Regression Model Summary")
AIC-Created Multiple Logistic Regression Model Summary
Estimate Std. Error z value Pr(>|z|) 2.5 % 97.5 %
(Intercept) -2.8580203 0.3686009 -7.7536985 0.0000000 -3.5991921 -2.1520692
tempo 0.0035527 0.0018681 1.9017792 0.0572000 -0.0001176 0.0072091
grp.instrumentalness: s over 0.7 -2.1920495 0.3605499 -6.0797397 0.0000000 -2.9326652 -1.5064166
grp.instrumentalness: s under 0.1 0.6152133 0.2195892 2.8016558 0.0050841 0.2022035 1.0664024
grp.duration_mins: Over 10 mins -13.3332333 310.6582027 -0.0429193 0.9657659 NA 4.4642172
grp.duration_mins: Under 5 mins 0.6999423 0.2308747 3.0316973 0.0024318 0.2676486 1.1767544

We can see from our model summary that the AIC-created model and the reduced model I coded in via manual variable selection are identical. This is not surprising due to how clearly significant or insignificant the predictors were in the original full model. Below I will perform global GOF measures on the model to have record of a formal comparison of each one’s accuracy.

3.4 Model Comparison and Selection

 # GOF Full Multiple Model
Mult_Full_Residual_Deviance = Multiple_Full_Log_Model$deviance
Mult_Full_Null_Residual_Deviance = Multiple_Full_Log_Model$null.deviance
Mult_Full_aic = Multiple_Full_Log_Model$aic
Mult_Full_Goodness_Of_Fit = cbind(Mult_Full_Residual_Deviance,
                                  Mult_Full_Null_Residual_Deviance, Mult_Full_aic)

 # GOF Manually Reduced Multiple Model
Mult_Reduced_Residual_Deviance = Multiple_Reduced_Log_Model$deviance
Mult_Reduced_Null_Residual_Deviance = Multiple_Reduced_Log_Model$null.deviance
Mult_Reduced_aic = Multiple_Reduced_Log_Model$aic
Mult_Reduced_Goodness_Of_Fit = cbind(Mult_Reduced_Residual_Deviance,
                                  Mult_Reduced_Null_Residual_Deviance, Mult_Reduced_aic)

 # GOF AIC Reduced Model
Mult_AIC_Residual_Deviance = AIC_Model$deviance
Mult_AIC_Null_Residual_Deviance = AIC_Model$null.deviance
Mult_AIC_aic = AIC_Model$aic
Mult_AIC_Goodness_Of_Fit = cbind(Mult_AIC_Residual_Deviance,
                                  Mult_AIC_Null_Residual_Deviance, Mult_AIC_aic)

Model_GOFs = rbind(Mult_Full_Goodness_Of_Fit, Mult_Reduced_Goodness_Of_Fit,
                   Mult_AIC_Goodness_Of_Fit)
rownames(Model_GOFs) = c("Full Model", "Manually Reduced Model", "AIC Reduced Model")
colnames(Model_GOFs) = c("Residual Deviance", "Null Deviance", "AIC")

kable(Model_GOFs, caption ="GOF Summary of Each Model")
GOF Summary of Each Model
Residual Deviance Null Deviance AIC
Full Model 2216.519 2463.55 2232.519
Manually Reduced Model 2224.352 2463.55 2236.352
AIC Reduced Model 2217.497 2463.55 2231.497

As we can see from the table above, all three models perform extremely similar. Since our AIC-created model has the lowest AIC, and its residual deviance is only a fraction greater than that of our full model, we will accept the AIC-created model as our final choice to use for interpetation of this dataset.

3.5 Takeaways from Multiple Logistic Regression Modeling Process

After the simple logistic regression modeling process earlier, we came away with the conclusion that a track’s tempo was ‘technically’ significant in it’s likelihood of being popular on Spotify, but the practical magnitude of its impact was so small that it was not a factor that deserves a great deal of emphasis in the real world. Our AIC-created model for the most part confirms this finding, (p ~ .0572).

The most important findings from our multiple logistic regression process come from what was discovered by our discretized variables. Tracks with very high instrumentalness (over 0.7 on this dataset’s scale) have a lesser likelihood of being popular on Spotify than those that fall in between 0.1 and 0.7 on said scale. Additionally, tracks under 0.1 on the scale actually have a slighly higher chance of being popular.

Regarding a track’s length, tracks being under 5 minutes does have a significant association with increased likelihood of popularity when compared to tracks ranging from 5 to 10 minutes and tracks beyond 10 minutes. However, there is not sufficient evidence that tracks going above 10 minutes significantly reduces their chance of being popular when compared to those in the 5 to 10 minute category.

Below is a visual displaying the odds ratios for relevant variables from my AIC-created model, they will be referenced in my final conclusion. The extreme estimate, standard error and odds ratio for speechiness can be disregarded and attributed largely to the vast majority of observations falling below 0.5 on the speechiness scale. That along with the variable’s extremely high p value, and standard error being practically identical to the standard error of the model’s null intercept lead us to believe that the summary statistics for speechiness can be chalked up to random chance, not true logistic association.

AIC_Model_Coef_Summary = summary(AIC_Model)$coef
AIC_Odds_Ratio = exp(coef(AIC_Model))
AIC_out.stats = cbind(AIC_Model_Coef_Summary, odds.ratio = AIC_Odds_Ratio)                 
kable(AIC_out.stats,caption = "AIC-Created Multiple Regression Model Coefficients W/ Odds Ratios")
AIC-Created Multiple Regression Model Coefficients W/ Odds Ratios
Estimate Std. Error z value Pr(>|z|) odds.ratio
(Intercept) -17.1140371 375.3481554 -0.0455951 0.9636330 0.0000000
tempo 0.0036649 0.0018745 1.9551582 0.0505644 1.0036716
grp.instrumentalness: s over 0.7 -2.1888224 0.3605641 -6.0705491 0.0000000 0.1120486
grp.instrumentalness: s under 0.1 0.6234236 0.2196071 2.8388131 0.0045282 1.8653032
grp.duration_mins: Over 10 mins -13.3387936 310.5685700 -0.0429496 0.9657417 0.0000016
grp.duration_mins: Under 5 mins 0.7018358 0.2310153 3.0380492 0.0023812 2.0174530
grp.speechiness: s under 0.5 14.2414898 375.3479579 0.0379421 0.9697338 1531088.9917716

3.6 Final Association Modeling Conclusion

To conclude, in the most practical sense there are two factors that seem to have the greatest association on a track’s chance of being popular on Spotify. Those being its instrumentalness and length. Tracks that are under 5 minutes long have about twice the likelihood of being popular than not, while tracks that are longer than 10 minutes do not have nearly as great a chance. Tracks that are not very instrumental (s under 0.1) also have close to twice as great a chance of being popular than not. Much better odds than their very instrumental (s over 0.7) and mildly instrumental (0.1 < s < 0.7) counterparts.

4 Predictive Binary Logistic Model

Now that we have deeply studied the association of a track’s likelihood of popularity with some of the track’s characteristics, I am going to take our analysis one step further, away from the realm of strict observation and into the realm of educated and predictive speculation. For this process, I will utilize the models constructed in sections 3.1 through 3.3 above, as well as standard predictive modeling techniques like data splitting and ROC visualization.

4.1 Data Splitting

Below I split the data we have been analyzing into two subsets, one for training purposes and one for assessment purposes, at a rate of 70% to 30% respectively. This procedure is done to counteract the risk of overfitting a model and to ensure that the true rate of inaccurate prediction (both false positives and false negatives) is known.

n = nrow(Work_Data) # Sample size of data we are working with
Training_n =round(0.7*n) # Sample size of training data
Training_ID = sample(1:n, Training_n, replace = FALSE) 
  # The number of rows to choose from is 1:n, the number that will be selected is Training_n, replacement is set to FALSE since we are taking a subset of the data, so do not want to take the same observation(s) more than once.

Training_Data = Work_Data[Training_ID, ]
  # Creates a subset of Work_Data, assigning only the observations from the dataset that were marked with Training_ID to be in the Training_Data subset

Testing_Data = Work_Data[-Training_ID, ]
  # Creates an alternate subset of Work_Data, assigning only the observations from the dataset that were NOT marked with Training_ID to be in the Testing_Data subset

4.2 Cross Validation (CV) Process and Model Selection

Now I will use the aforementioned models, created in my multiple regression association analysis, in the CV process. I will create 3 folds (distinct groups) and set the cut-off probability (what constitutes “yes” or “no” as the value of our binary response variable) to the standard 0.5. The results and interpretation can be seen below.

# Multiple_Full_Log_Model = First multiple log model
# Multiple_Reduced_Log_Model = Manually reduced one
# AIC_Model = The final Algorithmically Reduced one we used

# 3 fold cross-validation
k = 3

fold_size = floor(dim(Training_Data)[1]/k)

## PE vectors for candidate models
PE1 = rep(0,5)
PE2 = rep(0,5)
PE3 = rep(0,5)
for(i in 1:k){  # Start of big FOR loop
  ## Training and testing folds
  valid.id = (fold_size*(i-1)+1):(fold_size*i)
  valid = Training_Data[valid.id, ]
  train.dat = Training_Data[-valid.id,]

  # Full Log Model
CVModel1 = glm(formula = popular ~ tempo + danceability + grp.speechiness + grp.instrumentalness + grp.duration_mins, family = binomial(link = "logit"), 
    data = train.dat)
table(Training_Data$popular)


  # Manually Reduced Log Model
CVModel2 = glm(formula = popular ~ tempo + grp.instrumentalness + grp.duration_mins, family = binomial(link = "logit"), 
    data = train.dat)

  
  # AIC Reduced (Final) Log Model
  CVModel3 = stepAIC(CVModel1, 
          scope = list(lower=formula(CVModel2),upper=formula(CVModel1)),
                      direction = "forward",   # forward selection
                      trace = 0 )

   ##  predicted probabilities of each candidate model
   pred01 = predict(CVModel1, newdata = valid, type="response")
   pred02 = predict(CVModel2, newdata = valid, type="response")
   pred03 = predict(CVModel3, newdata = valid, type="response")
   
   pre.outcome01 = ifelse(as.vector(pred01) > 0.5, "yes", "no")
   pre.outcome02 = ifelse(as.vector(pred02) > 0.5, "yes", "no")
   pre.outcome03 = ifelse(as.vector(pred03) > 0.5, "yes", "no")
      # Have our cut-off probability set to 0.5
        # HAVE to set responses as "yes" and "no", earlier had them set as "pos" and "neg" to mimic the case study code, which is why outcome was illogical
   
   PE1[i] = 1-sum(pre.outcome01 == valid$popular )/length(pred01)
   PE2[i] = 1-sum(pre.outcome02 == valid$popular )/length(pred02)
   PE3[i] = 1-sum(pre.outcome02 == valid$popular)/length(pred03)
   # valid$popular = “From the validation subset valid, take the column named popular.”
   
  } # End of big FOR loop
 
 # Display the output of the loop
Avg.Prediction_Error = cbind(PE1 = mean(PE1), PE2 = mean(PE2), PE3 = mean(PE3))
kable(Avg.Prediction_Error, digits = 4, caption = "Average Prediction Errors of the CV Models")
Average Prediction Errors of the CV Models
PE1 PE2 PE3
0.104 0.104 0.104
pred02 = predict(CVModel2, newdata = Testing_Data, type="response")
pred02.outcome = ifelse(as.vector(pred02)>0.5, "yes", "no")

accuracy = sum(pred02.outcome == Testing_Data$popular)/length(pred02)
kable(accuracy, caption="The actual accuracy of the final model")
The actual accuracy of the final model
x
0.8205772

We can see from above that, at a cut-off probability of 0.5, all three of our models used for associative analysis function with an average prediction error of ~ 0.1057 (this number can vary by about .01 either way depending on which run of the simulation we are on). This is not too surprising given the shared associative findings of each model (check Multiple Regression section). Given that I previously determined the AIC-produced model was the most ideal, that is the one I will consider as the ultimate or “final” predictive model. This model’s rate of accuracy, which was found through applying it’s structure to the testing dataset that we subsetted earlier, is about 82.67% (like prediction error, can have slight variation).

4.3 ROC Visualization

To assist with understanding, I also constructed a ROC (Receiver Operating Characteristics) curve. This image allows us to interpret the relationship between our different models’ specificity (rate of producing false positives) on the x axis VS sensitivity (rate of producing true positives) on the y axis.

We can see from the curve below that all three models are extremely similar in their ratio of sensitivity to specificity, and hence have almost identical AUC values. As mentioned above, we will utilize the AIC-reduced model given that it was deemed the most accurate for our previous association analysis.

######### Using a function to extract false positive and false negative values
TPR.FPR=function(pred){
  prob.seq = seq(0,1, length=50)  # 50 equally spaced cut-off probabilities
  pn=length(prob.seq)
  true.lab=as.vector(Training_Data$popular)
  TPR = NULL
  FPR = NULL
  ##
  for (i in 1:pn){
   pred.lab = as.vector(ifelse(pred >prob.seq[i],"yes", "no"))
   TPR[i] = length(which(true.lab=="yes" & pred.lab=="yes"))/length(which(true.lab=="yes"))
   FPR[i] = length(which(true.lab=="no" & pred.lab=="yes"))/length(which(true.lab=="no"))
  }
 cbind(FPR = FPR, TPR = TPR)
}

######### Calculating the AUC
if (!require("pROC")) {
   install.packages("pROC")
   library(pROC)
}
###  Candidate Models
 # Full Log Model
CVModel1 = glm(formula = popular ~ tempo + danceability + grp.speechiness + grp.instrumentalness + grp.duration_mins, family = binomial(link = "logit"), 
    data = train.dat)

  # Manually Reduced Log Model
CVModel2 = glm(formula = popular ~ tempo + grp.instrumentalness + grp.duration_mins, family = binomial(link = "logit"), 
    data = train.dat)

  # AIC Reduced (Final) Log Model
  CVModel3 = stepAIC(CVModel1, 
          scope = list(lower=formula(CVModel2),upper=formula(CVModel1)),
                      direction = "forward",   # forward selection
                      trace = 0 )
##  predicted probabilities
pred01 = predict.glm(CVModel1, newdata = Training_Data, type="response") 
pred02 = predict.glm(CVModel2, newdata = Training_Data, type="response")
pred03 = predict.glm(CVModel3, newdata = Training_Data, type="response")
####
## ROC curve
 plot(TPR.FPR(pred01)[,1], TPR.FPR(pred01)[,2], 
      type="l", col="red", lty=1, lwd = 2, xlim=c(0,1), ylim=c(0,1),
      xlab = "False Positive Rate (FPR)",
      ylab ="True Positive Rate (TPR)",
      main = "ROC curves of the three candidate models",
      cex.main = 1.1,
      col.main = "black")
 lines(TPR.FPR(pred02)[,1], TPR.FPR(pred02)[,2],  col="darkblue", lty=2, lwd = 2)
 lines(TPR.FPR(pred03)[,1], TPR.FPR(pred03)[,2],  col="black", lty=3, lwd = 2)    

  ##
  category = Training_Data$popular == "yes"
  ROCobj01 <- roc(category, as.vector(pred01))
  ROCobj02 <- roc(category, as.vector(pred02))
  ROCobj03 <- roc(category, as.vector(pred03))
  AUC01 = round(auc(ROCobj01),4)
  AUC02 = round(auc(ROCobj02),4)
  AUC03 = round(auc(ROCobj03),4)
  ##
  legend("bottomright", c(paste("Full Log Model AUC = ",AUC01), 
                         paste("Man Reduced Log Model AUC = ",AUC02),
                         paste("AIC Reduced Log Model AUC = ", AUC03)),
        col=c("red","darkblue", "black"), lty=1:3, cex = 0.85, bty="n")

4.4 Predictive Modeling Takeaways

To conclude, if someone had interest in predicting a song’s chance of being popular given its length, instrumentalness and (to a minor extent) tempo, they should utilize the AIC-reduced model that was formed in section 3.3 above. I could see those most interested in this predictive model being agents, record label executives and aspiring artists, all attempting to “fix” their music to fit the mold of what tends to commercially succeed on the world’s largest streaming platform.

---
title: "Binary Logistic Predictive Modeling of a Track's Likelihood to be Popular on Spotify"
author: "Chris Bahm"
date: "2025-10-23"
output:
  html_document:
    toc: true
    toc_float:
      collapsed: true
      smooth_scroll: true
    toc_depth: 4
    fig_width: 6
    fig_height: 4
    fig_caption: true
    number_sections: true
    code_folding: hide
    code_download: true
    theme: lumen
    highlight: tango
  pdf_document:
    toc: true
    toc_depth: 4
    fig_caption: true
    number_sections: true
  word_document:
    toc: true
    toc_depth: 4
---

```{css, echo = FALSE}
div#TOC li {     /* table of content  */
    list-style:upper-roman;
    background-image:none;
    background-repeat:none;
    background-position:0;
}

h1.title {    /* level 1 header of title  */
  font-size: 24px;
  font-weight: bold;
  color: DarkRed;
  text-align: center;
}

h4.author { /* Header 4 - and the author and data headers use this too  */
  font-size: 18px;
  font-weight: bold;
  font-family: "Times New Roman", Times, serif;
  color: DarkRed;
  text-align: center;
}

h4.date { /* Header 4 - and the author and data headers use this too  */
  font-size: 18px;
  font-weight: bold;
  font-family: "Times New Roman", Times, serif;
  color: DarkBlue;
  text-align: center;
}

h1 { /* Header 1 - and the author and data headers use this too  */
    font-size: 20px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: darkred;
    text-align: center;
}

h2 { /* Header 2 - and the author and data headers use this too  */
    font-size: 18px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h3 { /* Header 3 - and the author and data headers use this too  */
    font-size: 16px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h4 { /* Header 4 - and the author and data headers use this too  */
    font-size: 14px;
  font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: darkred;
    text-align: left;
}

/* Add dots after numbered headers */
.header-section-number::after {
  content: ".";
}
```

```{r setup, include=FALSE}
# code chunk specifies whether the R code, warnings, and output 
# will be included in the output files.

if (!require("knitr")) {                      # use conditional statement to detect
   install.packages("knitr")                  # whether a package was installed in
   library(knitr)                             # your machine. If not, install it and
}                                             # load it to the working directory.

if (!require(tidyverse)) {library(tidyvserse)} 

if (!require(GGally)) {library(GGally)} 

if (!require(kableExtra)) {library(kableExtra)} 

if (!require(ggplot2)) {library(ggplot2)} 

if (!require(car)) {library(car)} 

if (!require(dplyr)) {library(dplyr)} 

if (!require(pander)) {library(pander)} 

if (!require(car)) {library(car)} 

if (!require("scales")) {
install.packages("scales")                                        
library("scales") 
}

knitr::opts_chunk$set(
	echo = TRUE,
	message = FALSE,
	warning = FALSE,
	comment = NA,
	results = TRUE
)
   
```

# Introduction and Background

Here we have a dataset originally composed of over 5,000 songs that were available on the music-streaming service Spotify over the course of 2024. These tracks stem from a wide array of vastly different genres and styles of creation. The dataset is publicly available on the platform Kaggle and is provided in the References section below. Although the data originally contained 29 variables, only 11 play a particularly relevant role in the analysis I wish to do here. A breakdown of those variables, their meanings and data types is below:

```{r Variable Table, echo=FALSE}
library(knitr)

Var_Table = data.frame(
  Name = c("track_popularity", "popular", "track_name", "track_artist", "playlist_genre", "playlist_subgenre", "tempo", "danceability","speechiness", "instrumentalness", "duration_mins"),
  
  Meaning = c("song's rating on the 0 - 100 metric", "how popular the track is (more on that later)",
              "name of the song",
              "artist's stage name", 
              "genre that Spotify classifies the song in", 
              "subgenre that Spotify classifies the song in",  
              "the speed of a track, measured in beats per minute (BPM)", 
              "A score describing how suitable a track is for dancing based on tempo, rhythm stability, beat strength and overall regularity (0 to 1 scale)", 
              "the presence of audible words (0 to 1 scale)", 
              "likelihood that the track contains instrumentals (0 to 1 scale) ", 
              "length of track in minutes"),
  
  Data_Type = c("double", "character", "character", "character", "character", "character", "double", "double", "double", "double", "double")
)

kable(Var_Table) %>%
  kable_styling(
    bootstrap_options = c("striped", "bordered"),
    full_width = FALSE,
    position = "center")
```

Many of the other variables in the dataset I chose to disregard as they either pertained to unnecessary technical information (such as the algorithmic identifier for the particular song or its album), or extremely niche musical topics that the average listener or introductory critic would most likely not pick up on (such as mode and key).

## Objective for my Analysis
Using this dataset, I wish to quantifiably examine what musical qualities and characteristics lead to a song (or "track") becoming extremely successful (hence popular) via Spotify. As of October of 2025, Spotify has a market cap of over 139 billion dollars and close to 700 million users. This means that for any aspiring musician, "making it big" on Spotify would serve as a monumental and life-altering milestone. 

Spotify has a metric known as the popularity index which serves to numerically represent the popularity of tracks on its platform. While the exact formula behind the score is kept confidential, critics in the music industry and those with strong technical backgrounds have theorized that the primary factors which influence a track's score are the following; its total number of streams, its recent number of streams, its number of downloads and saves, and its month-to-month growth in all three of said metrics.

This popularity index operates on a scale ranging from 0 to 100, with only the most well-established stars reaching the 90th percentile and higher. That being said, any track with a rating of 75 or higher is largely considered a "top track" by industry experts, so that is the interpretation I will apply with my analysis. I created a categorical binary variable "popular", which is valued as "yes" for tracks with a popularity index rating of 75 or higher, and "no" for tracks with a lower rating. I will perform associative logistic regression to assess which, if any, of the factors mentioned in my variable breakdown above contribute to a track being deemed popular or not. 

Finally, after performing simple and multiple linear regression processes to analyze the association between a track's popularity and its status of the variables I have laid out, I will then use predictive modeling techniques in the effort to create a model which can accurately predict a particular track's likelihood of being popular on Spotify.

```{r Data Read in and Cleaning, include=FALSE}
options(scipen = 999) # Out of scientific notation, personal preference

Data_Url_A = "https://raw.githubusercontent.com/ChrisB2323/STA321/refs/heads/main/high_popularity_spotify_data.csv"
Above_68 = read_delim(Data_Url_A, delim = ",", show_col_types = FALSE)

Data_Url_B = "https://raw.githubusercontent.com/ChrisB2323/STA321/refs/heads/main/low_popularity_spotify_data.csv"
Below_68 = read_delim(Data_Url_B, delim = ",", show_col_types = FALSE)


### Line below was used for initial visual purposes 
#Above_68 = Above_68[, c("track_name", "track_album_name", "track_artist", "playlist_name", "playlist_genre", "playlist_subgenre", "track_popularity", "energy", "tempo", "danceability", "loudness", "liveness", "valence", "speechiness", "instrumentalness", "key", "duration_ms", "time_signature", "track_href", "mode", "uri", "type", "track_album_release_date", "analysis_url", "id", "track_album_id", "playlist_id", "track_id", "acousticness")]


Data = rbind(Above_68, Below_68)
  Data$duration_mins = Data$duration_ms/60000
Data$popular = ifelse(Data$track_popularity > 75, "yes", "no")

Data = dplyr::select(Data, track_popularity, popular, track_name, track_album_name, track_artist, playlist_name, playlist_genre, playlist_subgenre, energy, tempo, danceability, loudness, liveness, valence, speechiness, instrumentalness, key, duration_mins)

unique(Data$playlist_genre)

# Breakdown of Variables Kept
   # Identifying Variables:
       # track_name
       # track_album_name
       # track_artist
   # Categorical Info:
       # playlist_name
       # playlist_genre
       # playlist_subgenre
       # **** popular **** binary response variable self-created
   # Quantitative Info:
       # track_popularity
       # energy
       # tempo
       # danceability
       # loudness
       # liveness
       # valence
       # speechiness
       # instrumentalness
       # key
       # duration_ms
   
    # 17 selected above from original data set, plus popular
   
  # Variables we are not including in reduced data set are largely technical tracking variables such as id and analysis_url, or very niche musical metrics that the average listener would be unaware of such as mode.


#### Checking for Duplicates:

Duplicate_Songs = Data[duplicated(Data$track_name) | duplicated(Data$track_name, fromLast = TRUE), ]
Duplicate_Songs_Reduced = dplyr::select(Duplicate_Songs, track_name, track_artist, playlist_name, playlist_genre, playlist_subgenre)

Duplicate_Song_Counts = as.data.frame(table(Duplicate_Songs$track_name))   
  # 837 instances of the same song being listed more than once

names(Duplicate_Song_Counts) = c("track_name", "count")
  # 378 songs listed more than once

sum(Duplicate_Song_Counts$count)   # Sum function confirms there are 837 total repeats

  
# Going to reduce main dataset to only one observation per song, used base R function to keep first occurence of each song

#### New_Data = no duplicates
New_Data = Data[!duplicated(Data$track_name), ]

# Going to eliminate genres that are not prevalent in North America and Europe, as my analysis will apply largely to audience in Western world

unique(New_Data$playlist_genre)

New_Data$West = ifelse(
  New_Data$playlist_genre %in% c("brazilian", "arabic", "indian", "mandopop",
                                 "cantopop", "k-pop", "turkish", "korean", "j-pop", "world", "latin", "electronic"),
  "no",
  "yes"
)

table(New_Data$West)

#### New_Data2 = Confined to Western music genres
New_Data2 = New_Data[New_Data$West!= "no", ]
colnames(New_Data2)

#### Work_Data = Reduced number of variables to perform analysis on, missing value checks performed
Work_Data = dplyr::select(New_Data2, track_popularity, popular, track_name, track_artist, playlist_genre, playlist_subgenre, tempo, danceability, speechiness, instrumentalness, duration_mins)

Work_Data = sort_by(Work_Data, Work_Data$track_popularity, decreasing=TRUE)

sum(is.na(Work_Data)) # 5 missing values in dataset

Missing_Rows = Work_Data[!complete.cases(Work_Data), ]
 # All 5 missing values come from one observation, the song is "Make It" by Berhanio. I will just remove this song from the dataset using filter function below.

Work_Data = filter(Work_Data, track_name != "Make It" )

  track_popularity = Work_Data$track_popularity 
  popular = Work_Data$popular
  track_name = Work_Data$track_name
  track_artist = Work_Data$track_artist
  playlist_genre = Work_Data$playlist_genre
  playlist_subgenre = Work_Data$playlist_subgenre
  tempo = Work_Data$tempo
  danceability = Work_Data$danceability
  speechiness = Work_Data$speechiness
  instrumentalness = Work_Data$instrumentalness
  duration_mins = Work_Data$duration_mins
  
Quantitative_Variables = cbind(tempo, danceability, speechiness, instrumentalness, duration_mins)

Categorical_Variables = cbind(playlist_genre, playlist_subgenre)
 
```

# Simple Logistic Regression Model

## Checking for Pairwise Relationships
Before jumping into model creation, I created a correlation matrix for the quantitative predictor variables I am going to use in my analysis. This is because variables with very high correlations (r >= 0.8) often can function as almost multiples of one another, leading to inaccurate conclusions, especially when performing multiple regression.
```{r}
cor_matrix = cor(Quantitative_Variables)
  # No high correlations between the variables

kable(cor_matrix, caption = "Correlation Matrix")
```

After creating a correlation matrix, we can see that none of the quantitative predictor variables that we are examining have a high correlation with one another. Practically speaking, this is not very surprising as even "similar" songs by the same artist or within the same genre can have wildly different characteristics when it comes to tempo or anything else we are looking at.


## Picking Variable for Simple Model
Below I created histograms to represent the distributions of all five quantitative variables I am going to use in this analysis. The purpose for doing such is to determine if any sort of variable manipulation is necessary before using the variable in our logistic model.
```{r}
par(mfrow = c(2, 3), mar = c(4, 4, 2, 1))  # mar controls margins 
  
  
ylimit = max(density(Work_Data$tempo)$y)
hist(Work_Data$tempo, probability = TRUE, main = "Tempo Distribution", xlab="Tempo", 
       col = "azure1", border="lightseagreen")
  lines(density(Work_Data$tempo, adjust=2), col="blue") 

ylimit = max(density(Work_Data$danceability)$y)
hist(Work_Data$danceability, probability = TRUE, main = "Danceability Distribution", xlab="Danceability", 
       col = "azure1", border="lightseagreen")
  lines(density(Work_Data$danceability, adjust=2), col="blue") 
  
ylimit = max(density(Work_Data$speechiness)$y)
hist(Work_Data$speechiness, probability = TRUE, main = "Speechiness Distribution", xlab="Speechiness", 
       col = "azure1", border="lightseagreen")
  lines(density(Work_Data$speechiness, adjust=2), col="blue") 
  
  
ylimit = max(density(Work_Data$instrumentalness)$y)
hist(Work_Data$instrumentalness, probability = TRUE, main = "Instrumentalness Distribution", xlab="Instrumentalness", 
       col = "azure1", border="lightseagreen")
  lines(density(Work_Data$instrumentalness, adjust=2), col="blue") 
  
  
ylimit = max(density(Work_Data$duration_mins)$y)
hist(Work_Data$duration_mins, probability = TRUE, main = "Length (mins) Distribution", xlab="Length (mins)", 
       col = "azure1", border="lightseagreen")
  lines(density(Work_Data$duration_mins, adjust=2), col="blue") 


```

We can see from the histograms that there is an extreme skew in speechiness, instrumentalness and length. While there are slight skews in tempo and danceability, they are *far* more moderate. Since tempo appears to be the least skewed of the variables, and it's calculation is the most straightforward (a simple measure of beats per minute), I will use tempo to perform my simple logistic regression.


## Creating the Simple Model
Using built-in R functions, I created a simple binary logistic regression model, in which the tempo of a song is meant to predict whether or not that track is popular or not (popularity score of 75 or not). In the calculations, the base level, or 0, was stored for instances of a track's popularity being marked as "no". While the instance level, or 1, was stored for instances of a track's popularity being marked as "yes".
```{r, include=FALSE}
# Convert the character values to factors, necessary for glm function to work
Work_Data$popular = factor(Work_Data$popular, levels = c("no", "yes"))
  invisible(str(Work_Data$popular))
  invisible(unique(Work_Data$popular))
```
```{r}
Simple_Log_Model = glm(formula = popular ~ tempo, family = binomial(link = "logit"), 
    data = Work_Data)

invisible(summary(Simple_Log_Model))

Simple_Log_Model_Coef_Summary = summary(Simple_Log_Model)$coef       # output stats of coefficients
Simple_Confidence_Interval = confint(Simple_Log_Model)                     # confidence intervals of betas
Simple_Summary_Stats = cbind(Simple_Log_Model_Coef_Summary, Simple_Confidence_Interval.95=Simple_Confidence_Interval)   # rounding off decimals
kable(Simple_Summary_Stats,caption = "Simple Logistic Regression Model Summary")  
```
The summary of our model above leads us to believe that a track's tempo does have a positive association with the likelihood that a track is popular on Spotify, however that association is very small in practical effect. Despite a statistically significant p value, $\beta$~1~ has an estimated association with the probability of a track being deemed popular of .006, and a 95% confidence interval deeming the association to fall between the interval [.003, .009].


## Odds Ratio Summary
While the direct interpretation of a logistic regression model's summary statistics can be helpful, it is sometimes more practical to examine our model's odds ratio. An odds ratio represents the likelihood of success divided by the likelihood of failure. For example, an odds ratio of 2 would mean that whatever instance we are considering a "success" occurs twice the number of times than it fails to occur.
```{r}
# Odds ratio
Simple_Log_Model_Coef_Summary = summary(Simple_Log_Model)$coef
Odds_Ratio = exp(coef(Simple_Log_Model))
out.stats = cbind(Simple_Log_Model_Coef_Summary, odds.ratio = Odds_Ratio)                 
kable(out.stats,caption = "Simple Regression Model Coefficients W/ Odds Ratios")
```
The odds ratio between a track's tempo and it's popularity status is ~ 1.006. Since tempo is measured in beats per minute (BPM), this means that for every additional beat per minute that a song is recorded in, it's probability of having a popularity score above 75 increases by about 0.6%.

## Goodness-of-fit measures
Below I calculated the goodness-of-fit (GOF) measures for the logistic regression model. Generally speaking, it is beneficial to find a model's GOF features as they serve as relative comparisons of model efficiency when comparing multiple models. In this instance, we only created one model so far, so the measures are not of practical use at the moment.
```{r}
Residual_Deviance = Simple_Log_Model$deviance
Null_Residual_Deviance = Simple_Log_Model$null.deviance
aic = Simple_Log_Model$aic
Goodness_Of_Fit = cbind(Deviance.residual = Residual_Deviance, Null.Deviance.Residual = Null_Residual_Deviance,
      AIC = aic)
pander(Goodness_Of_Fit)

```

## Success Probability Curve
Below I created a visual of the the model's S curve (also called success probability curve or logistic curve). The S curve helps us see the change in probability of an event happening (a track being popular or not on Spotify), in relation to a change in the value of a predictor variable (tempo in BPM in this instance).
```{r}
###

tempo_range = range(Work_Data$tempo)
x = seq(tempo_range[1], tempo_range[2], length = 200)
beta.x = coef(Simple_Log_Model)[1] + coef(Simple_Log_Model)[2]*x
success.prob = exp(beta.x)/(1+exp(beta.x))
failure.prob = 1/(1+exp(beta.x))
ylimit = max(success.prob, failure.prob)
##
beta1 = coef(Simple_Log_Model)[2]
success.prob.rate = beta1*exp(beta.x)/(1+exp(beta.x))^2
##
##
par(mfrow = c(1,1))
plot(x, success.prob, type = "l", lwd = 2, col = "black",
     main = "The Probability of a Track Being \n Considered Popular on Spotify", 
     ylim=c(0, 1.1*ylimit),
     xlab = "Tempo (BPM)",
     ylab = "probability",
     axes = FALSE,
     col.main = "black",
     cex.main = 1.1)
# lines(x, failure.prob,lwd = 2, col = "darkred")
axis(1, pos = 0)
axis(2)

##
y.rate = max(success.prob.rate)
plot(x, success.prob.rate, type = "l", lwd = 2, col = "black",
     main = "The Rate of Change in Probability of a Track Being \n Considered Popular on Spotify", 
     xlab = "Tempo (BPM)",
     ylab = "Rate of Change",
     ylim=c(0,1.1*y.rate),
     axes = FALSE,
     col.main = "black",
     cex.main = 1.1
     )
axis(1, pos = 0)
axis(2)
```

The S Curve provides visual confirmation of the conclusions we were able to draw from analyzing our coefficient summary, confidence intervals and odds ratio. That is, that there is a positive association between a track's tempo and it's likelihood of being a popular track on Spotify, **but** that association is very minimal in practical effect.

We can see from our second plot, that looked at the relationship between the rate of change in said probability and a track's tempo, that from 50 to 200 beats per minute, the probability of a track being popular's positive rate of change only increases from 0.0006 to 0.0012.

## Takeaways from Simple Logistic Regression Model
To conclude, we can say that although there is a statistically significant and positive relationship between a track's tempo and the probability that it is popular on Spotify, that association is of such a small magnitude that it is relatively a non-factor in practical sense. Therefore, it would be overly simplistic and hyperbolic for music labels or agents to aggressively push their artists towards a more up-tempo production, as the return on such a decision would likely not be worth the cost of souring the relationship with the artist at hand. 

# Multiple Logistic Regression Model
As we saw earlier when creating our simple regression model, there is no significant risk of multicollinearity with the data we are working with. However, we also saw there is an extreme rightward skew in the distribution of speechiness, instrumentalness and length. Below I take a closer look at the distributions of those three variables.
```{r}
## Remember in the HW instructions, include not just the numerical variables but two categorical ones too (let's use playlist_genre and playlist_subgenre)

par(mfrow=c(1,2))
hist(speechiness, xlab="Speechiness", main = "")
hist(instrumentalness, xlab = "Instrumentalness", main = "")
hist(duration_mins, xlab = "Length (Mins)", main = "")
```

To combat the skew of all three variables, I will perform discretization on each of them. The coding for such can be seen below.
```{r}
  # Speechiness = "presence of audible words on 0 to 1 scale"
grp.speechiness = speechiness
grp.speechiness[speechiness < 0.5] = ": s under 0.5"
grp.speechiness[speechiness > 0.5] = ": s over 0.5"

  # sum(speechiness > 0.5) = 14
  # sum(speechiness < 0.5) = 2641


  # Instrumentalness = "likelihood that the track contains instrumentals (0 to 1 scale)"
grp.instrumentalness = instrumentalness
grp.instrumentalness[instrumentalness < 0.1] = ": s under 0.1"
grp.instrumentalness[instrumentalness > 0.1 & instrumentalness <= 0.7] = ": s from 0.1 to 0.7"
grp.instrumentalness[instrumentalness > 0.7] = ": s over 0.7"
 
 #  sum(instrumentalness < 0.1) = 1802
 #  sum(instrumentalness > 0.1 & instrumentalness < 0.7) = 193
 #  sum(instrumentalness > 0.7) = 659
  

  # duration_mins below
grp.duration_mins = duration_mins
grp.duration_mins[duration_mins < 5] = ": Under 5 mins"
grp.duration_mins[duration_mins >= 5 & duration_mins <= 10] = ": Between 5 and 10 mins"
grp.duration_mins[duration_mins > 10] = ": Over 10 mins"
  
# sum(duration_mins < 5) = 2424
# sum(duration_mins >= 5 & duration_mins <= 10) = 211
# sum(duration_mins > 10) = 20


Work_Data$grp.speechiness = grp.speechiness
Work_Data$grp.instrumentalness = grp.instrumentalness
Work_Data$grp.duration_mins = grp.duration_mins
```


## Building Full Multiple Logistic Model
For the full multiple logistic model, I will initially include the continuous numerical variables tempo and danceability in their original form, as well as the discretized transformations of instrumentalness, length in minutes and speechiness. After coding this model in, I will examine its summary statistics.

```{r}
Multiple_Full_Log_Model = glm(formula = popular ~ tempo + danceability + grp.speechiness + grp.instrumentalness + grp.duration_mins, family = binomial(link = "logit"), 
    data = Work_Data)

invisible(summary(Multiple_Full_Log_Model))

Multiple_Full_Log_Model_Coef_Summary = summary(Multiple_Full_Log_Model)$coef
Multiple_Full_Confidence_Interval = confint(Multiple_Full_Log_Model)  

Multiple_Full_Summary_Stats = cbind(Multiple_Full_Log_Model_Coef_Summary, Multiple_Full_Confidence_Interval.95=Multiple_Full_Confidence_Interval)   # rounding off decimals

kable(Multiple_Full_Summary_Stats,caption = "Full Multiple Logistic Regression Model Summary") 

invisible(table(Work_Data$popular, Work_Data$grp.speechiness))
invisible(table(Work_Data$popular, Work_Data$grp.instrumentalness))
invisible(table(Work_Data$popular, Work_Data$grp.duration_mins))
  #NA's in kable function display are due to "perfect" results between certain levels of our discretized variables and the response variable of popular or not. If all observations in a certain level result in one of the two options for the response variable, then that coefficient will trend towards negative or positive infinity per R's calculations.
```

We can see the following from our model summary. First, just like the results from our simple regression model, there is once again a statistically significant association between a track's tempo and its likelihood of being popular on Spotify (p ~ 0.0413). However that association is so small that it is of little *practical* importance. Additionally, it appears that a track's instrumentalness and length absolutely play a part in its probability of being popular on Spotify. 

If we look at the discretized breakdown of instrumentalness, we see that there is an significant difference in tracks whose instrumentalness rating is of 0.7 and higher, or less than 0.1, when compared to tracks whose rating is somewhere in the middle. For track length, there also appears to be significant evidence that tracks less than 5 minutes long have a greater probability of achieving popularity on Spotify than those of any other lengths in our dataset.

On the other hand, there is not evidence to suggest that a track's danceability or speechiness have such an association with its popularity.

After examining the model summary, I also chose to run VIF on this full model, which can be seen below. Given that all GVIF are below 1.1, we can safely say that multicollinearity is not pertinent to this dataset. This finding is in line with the takeaway from our pairwise scatterplot earlier on.
```{r}
vif(Multiple_Full_Log_Model)
```

## Manually Reduced Multiple Logistic Model
Since parsimony (having as least amount of variables as possible without losing significant accuracy in association examination or prediction accuracy) is an important aspect of statistical modeling, I also then chose to create a reduced version of my multiple logistic model above. In this model, I did not include predictors whose impact was very clearly insignificant as per the original full model's summary (danceability and speechiness). The summary for this new reduced model is below.

```{r}
Multiple_Reduced_Log_Model = glm(formula = popular ~ tempo + grp.instrumentalness + grp.duration_mins, family = binomial(link = "logit"), 
    data = Work_Data)

invisible(summary(Multiple_Reduced_Log_Model))

Multiple_Reduced_Log_Model_Coef_Summary = summary(Multiple_Reduced_Log_Model)$coef
Multiple_Reduced_Log_Model_Confidence_Interval = confint(Multiple_Reduced_Log_Model)  

Multiple_Reduced_Summary_Stats = cbind(Multiple_Reduced_Log_Model_Coef_Summary, Multiple_Reduced_Confidence_Interval.95=Multiple_Reduced_Log_Model_Confidence_Interval)   # rounding off decimals

kable(Multiple_Reduced_Summary_Stats,caption = "Reduced Multiple Logistic Regression Model Summary") 
```

The results from this new reduced model largely align with the findings from the original multiple model. However, just as overfitting a model with needless information is a poor practice, so is underfitting a model by not including everything that is truly important. To ensure validity of this reduced model's structure, below I ran automatic stepwwise selection via R's AIC function. The model's summary statistics have been printed.

## Algorithmically Reduced Multiple Logistic Model
```{r}
library(MASS)
AIC_Model = stepAIC(Multiple_Reduced_Log_Model, 
                      scope = list(lower=formula(Multiple_Reduced_Log_Model),upper=formula(Multiple_Full_Log_Model)),
                      direction = "forward",   # forward selection
                      trace = 0   # do not show the details
                      )

kable(Multiple_Reduced_Summary_Stats,caption ="AIC-Created Multiple Logistic Regression Model Summary")
```

We can see from our model summary that the AIC-created model and the reduced model I coded in via manual variable selection are identical. This is not surprising due to how clearly significant or insignificant the predictors were in the original full model. Below I will perform global GOF measures on the model to have record of a formal comparison of each one's accuracy.

## Model Comparison and Selection
```{r}
 # GOF Full Multiple Model
Mult_Full_Residual_Deviance = Multiple_Full_Log_Model$deviance
Mult_Full_Null_Residual_Deviance = Multiple_Full_Log_Model$null.deviance
Mult_Full_aic = Multiple_Full_Log_Model$aic
Mult_Full_Goodness_Of_Fit = cbind(Mult_Full_Residual_Deviance,
                                  Mult_Full_Null_Residual_Deviance, Mult_Full_aic)

 # GOF Manually Reduced Multiple Model
Mult_Reduced_Residual_Deviance = Multiple_Reduced_Log_Model$deviance
Mult_Reduced_Null_Residual_Deviance = Multiple_Reduced_Log_Model$null.deviance
Mult_Reduced_aic = Multiple_Reduced_Log_Model$aic
Mult_Reduced_Goodness_Of_Fit = cbind(Mult_Reduced_Residual_Deviance,
                                  Mult_Reduced_Null_Residual_Deviance, Mult_Reduced_aic)

 # GOF AIC Reduced Model
Mult_AIC_Residual_Deviance = AIC_Model$deviance
Mult_AIC_Null_Residual_Deviance = AIC_Model$null.deviance
Mult_AIC_aic = AIC_Model$aic
Mult_AIC_Goodness_Of_Fit = cbind(Mult_AIC_Residual_Deviance,
                                  Mult_AIC_Null_Residual_Deviance, Mult_AIC_aic)

Model_GOFs = rbind(Mult_Full_Goodness_Of_Fit, Mult_Reduced_Goodness_Of_Fit,
                   Mult_AIC_Goodness_Of_Fit)
rownames(Model_GOFs) = c("Full Model", "Manually Reduced Model", "AIC Reduced Model")
colnames(Model_GOFs) = c("Residual Deviance", "Null Deviance", "AIC")

kable(Model_GOFs, caption ="GOF Summary of Each Model")
```

As we can see from the table above, all three models perform extremely similar. Since our AIC-created model has the lowest AIC, and its residual deviance is only a fraction greater than that of our full model, we will accept the AIC-created model as our final choice to use for interpetation of this dataset.

## Takeaways from Multiple Logistic Regression Modeling Process
After the simple logistic regression modeling process earlier, we came away with the conclusion that a track's tempo was 'technically' significant in it's likelihood of being popular on Spotify, but the practical magnitude of its impact was so small that it was not a factor that deserves a great deal of emphasis in the real world. Our AIC-created model for the most part confirms this finding, (p ~ .0572). 

The most important findings from our multiple logistic regression process come from what was discovered by our discretized variables. Tracks with very high instrumentalness (over 0.7 on this dataset's scale) have a lesser likelihood of being popular on Spotify than those that fall in between 0.1 and 0.7 on said scale. Additionally, tracks under 0.1 on the scale actually have a slighly higher chance of being popular.

Regarding a track's length, tracks being under 5 minutes does have a significant association with increased likelihood of popularity when compared to tracks ranging from 5 to 10 minutes and tracks beyond 10 minutes. However, there is not sufficient evidence that tracks going above 10 minutes significantly reduces their chance of being popular when compared to those in the 5 to 10 minute category.

Below is a visual displaying the odds ratios for relevant variables from my AIC-created model, they will be referenced in my final conclusion. The extreme estimate, standard error and odds ratio for speechiness can be disregarded and attributed largely to the vast majority of observations falling below 0.5 on the speechiness scale. That along with the variable's extremely high p value, and standard error being practically identical to the standard error of the model's null intercept lead us to believe that the summary statistics for speechiness can be chalked up to random chance, not true logistic association.
```{r}
AIC_Model_Coef_Summary = summary(AIC_Model)$coef
AIC_Odds_Ratio = exp(coef(AIC_Model))
AIC_out.stats = cbind(AIC_Model_Coef_Summary, odds.ratio = AIC_Odds_Ratio)                 
kable(AIC_out.stats,caption = "AIC-Created Multiple Regression Model Coefficients W/ Odds Ratios")
```

## Final Association Modeling Conclusion
To conclude, in the most practical sense there are two factors that seem to have the greatest association on a track's chance of being popular on Spotify. Those being its instrumentalness and length. Tracks that are under 5 minutes long have about **twice** the likelihood of being popular than not, while tracks that are longer than 10 minutes do not have nearly as great a chance. Tracks that are not very instrumental (s under 0.1) also have close to twice as great a chance of being popular than not. Much better odds than their very instrumental (s over 0.7) and mildly instrumental (0.1 < s < 0.7) counterparts.

# Predictive Binary Logistic Model
Now that we have deeply studied the association of a track's likelihood of popularity with some of the track's characteristics, I am going to take our analysis one step further, away from the realm of strict observation and into the realm of educated and predictive speculation. For this process, I will utilize the models constructed in sections 3.1 through 3.3 above, as well as standard predictive modeling techniques like data splitting and ROC visualization.

## Data Splitting
Below I split the data we have been analyzing into two subsets, one for training purposes and one for assessment purposes, at a rate of 70% to 30% respectively. This procedure is done to counteract the risk of overfitting a model and to ensure that the true rate of inaccurate prediction (both false positives and false negatives) is known.

```{r Subsetting Data for Testing and Training}
n = nrow(Work_Data) # Sample size of data we are working with
Training_n =round(0.7*n) # Sample size of training data
Training_ID = sample(1:n, Training_n, replace = FALSE) 
  # The number of rows to choose from is 1:n, the number that will be selected is Training_n, replacement is set to FALSE since we are taking a subset of the data, so do not want to take the same observation(s) more than once.

Training_Data = Work_Data[Training_ID, ]
  # Creates a subset of Work_Data, assigning only the observations from the dataset that were marked with Training_ID to be in the Training_Data subset

Testing_Data = Work_Data[-Training_ID, ]
  # Creates an alternate subset of Work_Data, assigning only the observations from the dataset that were NOT marked with Training_ID to be in the Testing_Data subset
```

## Cross Validation (CV) Process and Model Selection
Now I will use the aforementioned models, created in my multiple regression association analysis, in the CV process. I will create 3 folds (distinct groups) and set the cut-off probability (what constitutes “yes” or “no” as the value of our binary response variable) to the standard 0.5. The results and interpretation can be seen below.

```{r}
# Multiple_Full_Log_Model = First multiple log model
# Multiple_Reduced_Log_Model = Manually reduced one
# AIC_Model = The final Algorithmically Reduced one we used

# 3 fold cross-validation
k = 3

fold_size = floor(dim(Training_Data)[1]/k)

## PE vectors for candidate models
PE1 = rep(0,5)
PE2 = rep(0,5)
PE3 = rep(0,5)
for(i in 1:k){  # Start of big FOR loop
  ## Training and testing folds
  valid.id = (fold_size*(i-1)+1):(fold_size*i)
  valid = Training_Data[valid.id, ]
  train.dat = Training_Data[-valid.id,]

  # Full Log Model
CVModel1 = glm(formula = popular ~ tempo + danceability + grp.speechiness + grp.instrumentalness + grp.duration_mins, family = binomial(link = "logit"), 
    data = train.dat)
table(Training_Data$popular)


  # Manually Reduced Log Model
CVModel2 = glm(formula = popular ~ tempo + grp.instrumentalness + grp.duration_mins, family = binomial(link = "logit"), 
    data = train.dat)

  
  # AIC Reduced (Final) Log Model
  CVModel3 = stepAIC(CVModel1, 
          scope = list(lower=formula(CVModel2),upper=formula(CVModel1)),
                      direction = "forward",   # forward selection
                      trace = 0 )

   ##  predicted probabilities of each candidate model
   pred01 = predict(CVModel1, newdata = valid, type="response")
   pred02 = predict(CVModel2, newdata = valid, type="response")
   pred03 = predict(CVModel3, newdata = valid, type="response")
   
   pre.outcome01 = ifelse(as.vector(pred01) > 0.5, "yes", "no")
   pre.outcome02 = ifelse(as.vector(pred02) > 0.5, "yes", "no")
   pre.outcome03 = ifelse(as.vector(pred03) > 0.5, "yes", "no")
      # Have our cut-off probability set to 0.5
        # HAVE to set responses as "yes" and "no", earlier had them set as "pos" and "neg" to mimic the case study code, which is why outcome was illogical
   
   PE1[i] = 1-sum(pre.outcome01 == valid$popular )/length(pred01)
   PE2[i] = 1-sum(pre.outcome02 == valid$popular )/length(pred02)
   PE3[i] = 1-sum(pre.outcome02 == valid$popular)/length(pred03)
   # valid$popular = “From the validation subset valid, take the column named popular.”
   
  } # End of big FOR loop
 
 # Display the output of the loop
Avg.Prediction_Error = cbind(PE1 = mean(PE1), PE2 = mean(PE2), PE3 = mean(PE3))
kable(Avg.Prediction_Error, digits = 4, caption = "Average Prediction Errors of the CV Models")

pred02 = predict(CVModel2, newdata = Testing_Data, type="response")
pred02.outcome = ifelse(as.vector(pred02)>0.5, "yes", "no")

accuracy = sum(pred02.outcome == Testing_Data$popular)/length(pred02)
kable(accuracy, caption="The actual accuracy of the final model")
```

We can see from above that, at a cut-off probability of 0.5, all three of our models used for associative analysis function with an average prediction error of ~ 0.1057 (this number can vary by about .01 either way depending on which run of the simulation we are on). This is not too surprising given the shared associative findings of each model (check Multiple Regression section). Given that I previously determined the *AIC-produced model* was the most ideal, that is the one I will consider as the ultimate or "final" predictive model. *This* model's rate of accuracy, which was found through applying it's structure to the testing dataset that we subsetted earlier, is about 82.67% (like prediction error, can have slight variation).

## ROC Visualization
To assist with understanding, I also constructed a ROC (Receiver Operating Characteristics) curve. This image allows us to interpret the relationship between our different models' specificity (rate of producing false positives) on the x axis VS sensitivity (rate of producing true positives) on the y axis. 

We can see from the curve below that all three models are extremely similar in their ratio of sensitivity to specificity, and hence have almost identical AUC values. As mentioned above, we will utilize the AIC-reduced model given that it was deemed the most accurate for our previous association analysis.
```{r}
######### Using a function to extract false positive and false negative values
TPR.FPR=function(pred){
  prob.seq = seq(0,1, length=50)  # 50 equally spaced cut-off probabilities
  pn=length(prob.seq)
  true.lab=as.vector(Training_Data$popular)
  TPR = NULL
  FPR = NULL
  ##
  for (i in 1:pn){
   pred.lab = as.vector(ifelse(pred >prob.seq[i],"yes", "no"))
   TPR[i] = length(which(true.lab=="yes" & pred.lab=="yes"))/length(which(true.lab=="yes"))
   FPR[i] = length(which(true.lab=="no" & pred.lab=="yes"))/length(which(true.lab=="no"))
  }
 cbind(FPR = FPR, TPR = TPR)
}

######### Calculating the AUC
if (!require("pROC")) {
   install.packages("pROC")
   library(pROC)
}
###  Candidate Models
 # Full Log Model
CVModel1 = glm(formula = popular ~ tempo + danceability + grp.speechiness + grp.instrumentalness + grp.duration_mins, family = binomial(link = "logit"), 
    data = train.dat)

  # Manually Reduced Log Model
CVModel2 = glm(formula = popular ~ tempo + grp.instrumentalness + grp.duration_mins, family = binomial(link = "logit"), 
    data = train.dat)

  # AIC Reduced (Final) Log Model
  CVModel3 = stepAIC(CVModel1, 
          scope = list(lower=formula(CVModel2),upper=formula(CVModel1)),
                      direction = "forward",   # forward selection
                      trace = 0 )
##  predicted probabilities
pred01 = predict.glm(CVModel1, newdata = Training_Data, type="response") 
pred02 = predict.glm(CVModel2, newdata = Training_Data, type="response")
pred03 = predict.glm(CVModel3, newdata = Training_Data, type="response")
####
## ROC curve
 plot(TPR.FPR(pred01)[,1], TPR.FPR(pred01)[,2], 
      type="l", col="red", lty=1, lwd = 2, xlim=c(0,1), ylim=c(0,1),
      xlab = "False Positive Rate (FPR)",
      ylab ="True Positive Rate (TPR)",
      main = "ROC curves of the three candidate models",
      cex.main = 1.1,
      col.main = "black")
 lines(TPR.FPR(pred02)[,1], TPR.FPR(pred02)[,2],  col="darkblue", lty=2, lwd = 2)
 lines(TPR.FPR(pred03)[,1], TPR.FPR(pred03)[,2],  col="black", lty=3, lwd = 2)    

  ##
  category = Training_Data$popular == "yes"
  ROCobj01 <- roc(category, as.vector(pred01))
  ROCobj02 <- roc(category, as.vector(pred02))
  ROCobj03 <- roc(category, as.vector(pred03))
  AUC01 = round(auc(ROCobj01),4)
  AUC02 = round(auc(ROCobj02),4)
  AUC03 = round(auc(ROCobj03),4)
  ##
  legend("bottomright", c(paste("Full Log Model AUC = ",AUC01), 
                         paste("Man Reduced Log Model AUC = ",AUC02),
                         paste("AIC Reduced Log Model AUC = ", AUC03)),
        col=c("red","darkblue", "black"), lty=1:3, cex = 0.85, bty="n")
```

## Predictive Modeling Takeaways 
To conclude, if someone had interest in predicting a song's chance of being popular given its length, instrumentalness and (to a minor extent) tempo, they should utilize the AIC-reduced model that was formed in section 3.3 above. I could see those most interested in this predictive model being agents, record label executives and aspiring artists, all attempting to "fix" their music to fit the mold of what tends to commercially succeed on the world's largest streaming platform. 

# References:

Original Dataset Source:

- https://www.kaggle.com/datasets/solomonameh/spotify-music-dataset?select=high_popularity_spotify_data.csv

Dataset Download Links via Github:

- https://raw.githubusercontent.com/ChrisB2323/STA321/refs/heads/main/high_popularity_spotify_data.csv

- https://raw.githubusercontent.com/ChrisB2323/STA321/refs/heads/main/low_popularity_spotify_data.csv
