Introduction and
Background
Here we have a dataset originally composed of over 5,000 songs that
were available on the music-streaming service Spotify over the course of
2024. These tracks stem from a wide array of vastly different genres and
styles of creation. The dataset is publicly available on the platform
Kaggle and is provided in the References section below. Although the
data originally contained 29 variables, only 11 play a particularly
relevant role in the analysis I wish to do here. A breakdown of those
variables, their meanings and data types is below:
|
Name
|
Meaning
|
Data_Type
|
|
track_popularity
|
song’s rating on the 0 - 100 metric
|
double
|
|
popular
|
how popular the track is (more on that later)
|
character
|
|
track_name
|
name of the song
|
character
|
|
track_artist
|
artist’s stage name
|
character
|
|
playlist_genre
|
genre that Spotify classifies the song in
|
character
|
|
playlist_subgenre
|
subgenre that Spotify classifies the song in
|
character
|
|
tempo
|
the speed of a track, measured in beats per minute (BPM)
|
double
|
|
danceability
|
A score describing how suitable a track is for dancing based on tempo,
rhythm stability, beat strength and overall regularity (0 to 1 scale)
|
double
|
|
speechiness
|
the presence of audible words (0 to 1 scale)
|
double
|
|
instrumentalness
|
likelihood that the track contains instrumentals (0 to 1 scale)
|
double
|
|
duration_mins
|
length of track in minutes
|
double
|
Many of the other variables in the dataset I chose to disregard as
they either pertained to unnecessary technical information (such as the
algorithmic identifier for the particular song or its album), or
extremely niche musical topics that the average listener or introductory
critic would most likely not pick up on (such as mode and key).
Objective for my
Analysis
Using this dataset, I wish to quantifiably examine what musical
qualities and characteristics lead to a song (or “track”) becoming
extremely successful (hence popular) via Spotify. As of October of 2025,
Spotify has a market cap of over 139 billion dollars and close to 700
million users. This means that for any aspiring musician, “making it
big” on Spotify would serve as a monumental and life-altering
milestone.
Spotify has a metric known as the popularity index which serves to
numerically represent the popularity of tracks on its platform. While
the exact formula behind the score is kept confidential, critics in the
music industry and those with strong technical backgrounds have
theorized that the primary factors which influence a track’s score are
the following; its total number of streams, its recent number of
streams, its number of downloads and saves, and its month-to-month
growth in all three of said metrics.
This popularity index operates on a scale ranging from 0 to 100, with
only the most well-established stars reaching the 90th percentile and
higher. That being said, any track with a rating of 75 or higher is
largely considered a “top track” by industry experts, so that is the
interpretation I will apply with my analysis. I created a categorical
binary variable “popular”, which is valued as “yes” for tracks with a
popularity index rating of 75 or higher, and “no” for tracks with a
lower rating. I will perform associative logistic regression to assess
which, if any, of the factors mentioned in my variable breakdown above
contribute to a track being deemed popular or not.
Finally, after performing simple and multiple linear regression
processes to analyze the association between a track’s popularity and
its status of the variables I have laid out, I will then use predictive
modeling techniques in the effort to create a model which can accurately
predict a particular track’s likelihood of being popular on Spotify.
Simple Logistic
Regression Model
Checking for Pairwise
Relationships
Before jumping into model creation, I created a correlation matrix
for the quantitative predictor variables I am going to use in my
analysis. This is because variables with very high correlations (r >=
0.8) often can function as almost multiples of one another, leading to
inaccurate conclusions, especially when performing multiple
regression.
cor_matrix = cor(Quantitative_Variables)
# No high correlations between the variables
kable(cor_matrix, caption = "Correlation Matrix")
Correlation Matrix
| tempo |
1.0000000 |
0.0338179 |
0.0576956 |
-0.1518469 |
0.0397516 |
| danceability |
0.0338179 |
1.0000000 |
0.2840300 |
-0.3205070 |
-0.1374377 |
| speechiness |
0.0576956 |
0.2840300 |
1.0000000 |
-0.1903130 |
-0.1050720 |
| instrumentalness |
-0.1518469 |
-0.3205070 |
-0.1903130 |
1.0000000 |
-0.2218689 |
| duration_mins |
0.0397516 |
-0.1374377 |
-0.1050720 |
-0.2218689 |
1.0000000 |
After creating a correlation matrix, we can see that none of the
quantitative predictor variables that we are examining have a high
correlation with one another. Practically speaking, this is not very
surprising as even “similar” songs by the same artist or within the same
genre can have wildly different characteristics when it comes to tempo
or anything else we are looking at.
Picking Variable for
Simple Model
Below I created histograms to represent the distributions of all five
quantitative variables I am going to use in this analysis. The purpose
for doing such is to determine if any sort of variable manipulation is
necessary before using the variable in our logistic model.
par(mfrow = c(2, 3), mar = c(4, 4, 2, 1)) # mar controls margins
ylimit = max(density(Work_Data$tempo)$y)
hist(Work_Data$tempo, probability = TRUE, main = "Tempo Distribution", xlab="Tempo",
col = "azure1", border="lightseagreen")
lines(density(Work_Data$tempo, adjust=2), col="blue")
ylimit = max(density(Work_Data$danceability)$y)
hist(Work_Data$danceability, probability = TRUE, main = "Danceability Distribution", xlab="Danceability",
col = "azure1", border="lightseagreen")
lines(density(Work_Data$danceability, adjust=2), col="blue")
ylimit = max(density(Work_Data$speechiness)$y)
hist(Work_Data$speechiness, probability = TRUE, main = "Speechiness Distribution", xlab="Speechiness",
col = "azure1", border="lightseagreen")
lines(density(Work_Data$speechiness, adjust=2), col="blue")
ylimit = max(density(Work_Data$instrumentalness)$y)
hist(Work_Data$instrumentalness, probability = TRUE, main = "Instrumentalness Distribution", xlab="Instrumentalness",
col = "azure1", border="lightseagreen")
lines(density(Work_Data$instrumentalness, adjust=2), col="blue")
ylimit = max(density(Work_Data$duration_mins)$y)
hist(Work_Data$duration_mins, probability = TRUE, main = "Length (mins) Distribution", xlab="Length (mins)",
col = "azure1", border="lightseagreen")
lines(density(Work_Data$duration_mins, adjust=2), col="blue")

We can see from the histograms that there is an extreme skew in
speechiness, instrumentalness and length. While there are slight skews
in tempo and danceability, they are far more moderate. Since
tempo appears to be the least skewed of the variables, and it’s
calculation is the most straightforward (a simple measure of beats per
minute), I will use tempo to perform my simple logistic regression.
Creating the Simple
Model
Using built-in R functions, I created a simple binary logistic
regression model, in which the tempo of a song is meant to predict
whether or not that track is popular or not (popularity score of 75 or
not). In the calculations, the base level, or 0, was stored for
instances of a track’s popularity being marked as “no”. While the
instance level, or 1, was stored for instances of a track’s popularity
being marked as “yes”.
Simple_Log_Model = glm(formula = popular ~ tempo, family = binomial(link = "logit"),
data = Work_Data)
invisible(summary(Simple_Log_Model))
Simple_Log_Model_Coef_Summary = summary(Simple_Log_Model)$coef # output stats of coefficients
Simple_Confidence_Interval = confint(Simple_Log_Model) # confidence intervals of betas
Simple_Summary_Stats = cbind(Simple_Log_Model_Coef_Summary, Simple_Confidence_Interval.95=Simple_Confidence_Interval) # rounding off decimals
kable(Simple_Summary_Stats,caption = "Simple Logistic Regression Model Summary")
Simple Logistic Regression Model Summary
| (Intercept) |
-2.2643674 |
0.2126810 |
-10.646777 |
0.000000 |
-2.6844577 |
-1.8504312 |
| tempo |
0.0060146 |
0.0017127 |
3.511866 |
0.000445 |
0.0026561 |
0.0093727 |
The summary of our model above leads us to believe that a track’s
tempo does have a positive association with the likelihood that a track
is popular on Spotify, however that association is very small in
practical effect. Despite a statistically significant p value, \(\beta\)1 has an estimated
association with the probability of a track being deemed popular of
.006, and a 95% confidence interval deeming the association to fall
between the interval [.003, .009].
Odds Ratio
Summary
While the direct interpretation of a logistic regression model’s
summary statistics can be helpful, it is sometimes more practical to
examine our model’s odds ratio. An odds ratio represents the likelihood
of success divided by the likelihood of failure. For example, an odds
ratio of 2 would mean that whatever instance we are considering a
“success” occurs twice the number of times than it fails to occur.
# Odds ratio
Simple_Log_Model_Coef_Summary = summary(Simple_Log_Model)$coef
Odds_Ratio = exp(coef(Simple_Log_Model))
out.stats = cbind(Simple_Log_Model_Coef_Summary, odds.ratio = Odds_Ratio)
kable(out.stats,caption = "Simple Regression Model Coefficients W/ Odds Ratios")
Simple Regression Model Coefficients W/ Odds Ratios
| (Intercept) |
-2.2643674 |
0.2126810 |
-10.646777 |
0.000000 |
0.1038957 |
| tempo |
0.0060146 |
0.0017127 |
3.511866 |
0.000445 |
1.0060327 |
The odds ratio between a track’s tempo and it’s popularity status is
~ 1.006. Since tempo is measured in beats per minute (BPM), this means
that for every additional beat per minute that a song is recorded in,
it’s probability of having a popularity score above 75 increases by
about 0.6%.
Goodness-of-fit
measures
Below I calculated the goodness-of-fit (GOF) measures for the
logistic regression model. Generally speaking, it is beneficial to find
a model’s GOF features as they serve as relative comparisons of model
efficiency when comparing multiple models. In this instance, we only
created one model so far, so the measures are not of practical use at
the moment.
Residual_Deviance = Simple_Log_Model$deviance
Null_Residual_Deviance = Simple_Log_Model$null.deviance
aic = Simple_Log_Model$aic
Goodness_Of_Fit = cbind(Deviance.residual = Residual_Deviance, Null.Deviance.Residual = Null_Residual_Deviance,
AIC = aic)
pander(Goodness_Of_Fit)
Success Probability
Curve
Below I created a visual of the the model’s S curve (also called
success probability curve or logistic curve). The S curve helps us see
the change in probability of an event happening (a track being popular
or not on Spotify), in relation to a change in the value of a predictor
variable (tempo in BPM in this instance).
###
tempo_range = range(Work_Data$tempo)
x = seq(tempo_range[1], tempo_range[2], length = 200)
beta.x = coef(Simple_Log_Model)[1] + coef(Simple_Log_Model)[2]*x
success.prob = exp(beta.x)/(1+exp(beta.x))
failure.prob = 1/(1+exp(beta.x))
ylimit = max(success.prob, failure.prob)
##
beta1 = coef(Simple_Log_Model)[2]
success.prob.rate = beta1*exp(beta.x)/(1+exp(beta.x))^2
##
##
par(mfrow = c(1,1))
plot(x, success.prob, type = "l", lwd = 2, col = "black",
main = "The Probability of a Track Being \n Considered Popular on Spotify",
ylim=c(0, 1.1*ylimit),
xlab = "Tempo (BPM)",
ylab = "probability",
axes = FALSE,
col.main = "black",
cex.main = 1.1)
# lines(x, failure.prob,lwd = 2, col = "darkred")
axis(1, pos = 0)
axis(2)

##
y.rate = max(success.prob.rate)
plot(x, success.prob.rate, type = "l", lwd = 2, col = "black",
main = "The Rate of Change in Probability of a Track Being \n Considered Popular on Spotify",
xlab = "Tempo (BPM)",
ylab = "Rate of Change",
ylim=c(0,1.1*y.rate),
axes = FALSE,
col.main = "black",
cex.main = 1.1
)
axis(1, pos = 0)
axis(2)

The S Curve provides visual confirmation of the conclusions we were
able to draw from analyzing our coefficient summary, confidence
intervals and odds ratio. That is, that there is a positive association
between a track’s tempo and it’s likelihood of being a popular track on
Spotify, but that association is very minimal in
practical effect.
We can see from our second plot, that looked at the relationship
between the rate of change in said probability and a track’s tempo, that
from 50 to 200 beats per minute, the probability of a track being
popular’s positive rate of change only increases from 0.0006 to
0.0012.
Takeaways from Simple
Logistic Regression Model
To conclude, we can say that although there is a statistically
significant and positive relationship between a track’s tempo and the
probability that it is popular on Spotify, that association is of such a
small magnitude that it is relatively a non-factor in practical sense.
Therefore, it would be overly simplistic and hyperbolic for music labels
or agents to aggressively push their artists towards a more up-tempo
production, as the return on such a decision would likely not be worth
the cost of souring the relationship with the artist at hand.
Multiple Logistic
Regression Model
As we saw earlier when creating our simple regression model, there is
no significant risk of multicollinearity with the data we are working
with. However, we also saw there is an extreme rightward skew in the
distribution of speechiness, instrumentalness and length. Below I take a
closer look at the distributions of those three variables.
## Remember in the HW instructions, include not just the numerical variables but two categorical ones too (let's use playlist_genre and playlist_subgenre)
par(mfrow=c(1,2))
hist(speechiness, xlab="Speechiness", main = "")
hist(instrumentalness, xlab = "Instrumentalness", main = "")

hist(duration_mins, xlab = "Length (Mins)", main = "")

To combat the skew of all three variables, I will perform
discretization on each of them. The coding for such can be seen
below.
# Speechiness = "presence of audible words on 0 to 1 scale"
grp.speechiness = speechiness
grp.speechiness[speechiness < 0.5] = ": s under 0.5"
grp.speechiness[speechiness > 0.5] = ": s over 0.5"
# sum(speechiness > 0.5) = 14
# sum(speechiness < 0.5) = 2641
# Instrumentalness = "likelihood that the track contains instrumentals (0 to 1 scale)"
grp.instrumentalness = instrumentalness
grp.instrumentalness[instrumentalness < 0.1] = ": s under 0.1"
grp.instrumentalness[instrumentalness > 0.1 & instrumentalness <= 0.7] = ": s from 0.1 to 0.7"
grp.instrumentalness[instrumentalness > 0.7] = ": s over 0.7"
# sum(instrumentalness < 0.1) = 1802
# sum(instrumentalness > 0.1 & instrumentalness < 0.7) = 193
# sum(instrumentalness > 0.7) = 659
# duration_mins below
grp.duration_mins = duration_mins
grp.duration_mins[duration_mins < 5] = ": Under 5 mins"
grp.duration_mins[duration_mins >= 5 & duration_mins <= 10] = ": Between 5 and 10 mins"
grp.duration_mins[duration_mins > 10] = ": Over 10 mins"
# sum(duration_mins < 5) = 2424
# sum(duration_mins >= 5 & duration_mins <= 10) = 211
# sum(duration_mins > 10) = 20
Work_Data$grp.speechiness = grp.speechiness
Work_Data$grp.instrumentalness = grp.instrumentalness
Work_Data$grp.duration_mins = grp.duration_mins
Building Full
Multiple Logistic Model
For the full multiple logistic model, I will initially include the
continuous numerical variables tempo and danceability in their original
form, as well as the discretized transformations of instrumentalness,
length in minutes and speechiness. After coding this model in, I will
examine its summary statistics.
Multiple_Full_Log_Model = glm(formula = popular ~ tempo + danceability + grp.speechiness + grp.instrumentalness + grp.duration_mins, family = binomial(link = "logit"),
data = Work_Data)
invisible(summary(Multiple_Full_Log_Model))
Multiple_Full_Log_Model_Coef_Summary = summary(Multiple_Full_Log_Model)$coef
Multiple_Full_Confidence_Interval = confint(Multiple_Full_Log_Model)
Multiple_Full_Summary_Stats = cbind(Multiple_Full_Log_Model_Coef_Summary, Multiple_Full_Confidence_Interval.95=Multiple_Full_Confidence_Interval) # rounding off decimals
kable(Multiple_Full_Summary_Stats,caption = "Full Multiple Logistic Regression Model Summary")
Full Multiple Logistic Regression Model Summary
| (Intercept) |
-17.3380444 |
375.7154226 |
-0.0461467 |
0.9631933 |
NA |
7.5447416 |
| tempo |
0.0038700 |
0.0018965 |
2.0405532 |
0.0412953 |
0.0001487 |
0.0075874 |
| danceability |
0.3230690 |
0.3274150 |
0.9867263 |
0.3237768 |
-0.3156955 |
0.9684832 |
| grp.speechiness: s under 0.5 |
14.2760281 |
375.7151570 |
0.0379969 |
0.9696901 |
-10.9252242 |
NA |
| grp.instrumentalness: s over 0.7 |
-2.1650529 |
0.3613366 |
-5.9917905 |
0.0000000 |
-2.9070614 |
-1.4777880 |
| grp.instrumentalness: s under 0.1 |
0.6024645 |
0.2205999 |
2.7310277 |
0.0063137 |
0.1872867 |
1.0554459 |
| grp.duration_mins: Over 10 mins |
-13.2932711 |
310.2840235 |
-0.0428423 |
0.9658273 |
NA |
4.4755962 |
| grp.duration_mins: Under 5 mins |
0.6782754 |
0.2321971 |
2.9211189 |
0.0034878 |
0.2431306 |
1.1574116 |
invisible(table(Work_Data$popular, Work_Data$grp.speechiness))
invisible(table(Work_Data$popular, Work_Data$grp.instrumentalness))
invisible(table(Work_Data$popular, Work_Data$grp.duration_mins))
#NA's in kable function display are due to "perfect" results between certain levels of our discretized variables and the response variable of popular or not. If all observations in a certain level result in one of the two options for the response variable, then that coefficient will trend towards negative or positive infinity per R's calculations.
We can see the following from our model summary. First, just like the
results from our simple regression model, there is once again a
statistically significant association between a track’s tempo and its
likelihood of being popular on Spotify (p ~ 0.0413). However that
association is so small that it is of little practical
importance. Additionally, it appears that a track’s instrumentalness and
length absolutely play a part in its probability of being popular on
Spotify.
If we look at the discretized breakdown of instrumentalness, we see
that there is an significant difference in tracks whose instrumentalness
rating is of 0.7 and higher, or less than 0.1, when compared to tracks
whose rating is somewhere in the middle. For track length, there also
appears to be significant evidence that tracks less than 5 minutes long
have a greater probability of achieving popularity on Spotify than those
of any other lengths in our dataset.
On the other hand, there is not evidence to suggest that a track’s
danceability or speechiness have such an association with its
popularity.
After examining the model summary, I also chose to run VIF on this
full model, which can be seen below. Given that all GVIF are below 1.1,
we can safely say that multicollinearity is not pertinent to this
dataset. This finding is in line with the takeaway from our pairwise
scatterplot earlier on.
vif(Multiple_Full_Log_Model)
GVIF Df GVIF^(1/(2*Df))
tempo 1.018314 1 1.009116
danceability 1.048916 1 1.024166
grp.speechiness 1.000000 1 1.000000
grp.instrumentalness 1.035527 2 1.008766
grp.duration_mins 1.012616 2 1.003139
Manually Reduced
Multiple Logistic Model
Since parsimony (having as least amount of variables as possible
without losing significant accuracy in association examination or
prediction accuracy) is an important aspect of statistical modeling, I
also then chose to create a reduced version of my multiple logistic
model above. In this model, I did not include predictors whose impact
was very clearly insignificant as per the original full model’s summary
(danceability and speechiness). The summary for this new reduced model
is below.
Multiple_Reduced_Log_Model = glm(formula = popular ~ tempo + grp.instrumentalness + grp.duration_mins, family = binomial(link = "logit"),
data = Work_Data)
invisible(summary(Multiple_Reduced_Log_Model))
Multiple_Reduced_Log_Model_Coef_Summary = summary(Multiple_Reduced_Log_Model)$coef
Multiple_Reduced_Log_Model_Confidence_Interval = confint(Multiple_Reduced_Log_Model)
Multiple_Reduced_Summary_Stats = cbind(Multiple_Reduced_Log_Model_Coef_Summary, Multiple_Reduced_Confidence_Interval.95=Multiple_Reduced_Log_Model_Confidence_Interval) # rounding off decimals
kable(Multiple_Reduced_Summary_Stats,caption = "Reduced Multiple Logistic Regression Model Summary")
Reduced Multiple Logistic Regression Model Summary
| (Intercept) |
-2.8580203 |
0.3686009 |
-7.7536985 |
0.0000000 |
-3.5991921 |
-2.1520692 |
| tempo |
0.0035527 |
0.0018681 |
1.9017792 |
0.0572000 |
-0.0001176 |
0.0072091 |
| grp.instrumentalness: s over 0.7 |
-2.1920495 |
0.3605499 |
-6.0797397 |
0.0000000 |
-2.9326652 |
-1.5064166 |
| grp.instrumentalness: s under 0.1 |
0.6152133 |
0.2195892 |
2.8016558 |
0.0050841 |
0.2022035 |
1.0664024 |
| grp.duration_mins: Over 10 mins |
-13.3332333 |
310.6582027 |
-0.0429193 |
0.9657659 |
NA |
4.4642172 |
| grp.duration_mins: Under 5 mins |
0.6999423 |
0.2308747 |
3.0316973 |
0.0024318 |
0.2676486 |
1.1767544 |
The results from this new reduced model largely align with the
findings from the original multiple model. However, just as overfitting
a model with needless information is a poor practice, so is underfitting
a model by not including everything that is truly important. To ensure
validity of this reduced model’s structure, below I ran automatic
stepwwise selection via R’s AIC function. The model’s summary statistics
have been printed.
Algorithmically
Reduced Multiple Logistic Model
library(MASS)
AIC_Model = stepAIC(Multiple_Reduced_Log_Model,
scope = list(lower=formula(Multiple_Reduced_Log_Model),upper=formula(Multiple_Full_Log_Model)),
direction = "forward", # forward selection
trace = 0 # do not show the details
)
kable(Multiple_Reduced_Summary_Stats,caption ="AIC-Created Multiple Logistic Regression Model Summary")
AIC-Created Multiple Logistic Regression Model
Summary
| (Intercept) |
-2.8580203 |
0.3686009 |
-7.7536985 |
0.0000000 |
-3.5991921 |
-2.1520692 |
| tempo |
0.0035527 |
0.0018681 |
1.9017792 |
0.0572000 |
-0.0001176 |
0.0072091 |
| grp.instrumentalness: s over 0.7 |
-2.1920495 |
0.3605499 |
-6.0797397 |
0.0000000 |
-2.9326652 |
-1.5064166 |
| grp.instrumentalness: s under 0.1 |
0.6152133 |
0.2195892 |
2.8016558 |
0.0050841 |
0.2022035 |
1.0664024 |
| grp.duration_mins: Over 10 mins |
-13.3332333 |
310.6582027 |
-0.0429193 |
0.9657659 |
NA |
4.4642172 |
| grp.duration_mins: Under 5 mins |
0.6999423 |
0.2308747 |
3.0316973 |
0.0024318 |
0.2676486 |
1.1767544 |
We can see from our model summary that the AIC-created model and the
reduced model I coded in via manual variable selection are identical.
This is not surprising due to how clearly significant or insignificant
the predictors were in the original full model. Below I will perform
global GOF measures on the model to have record of a formal comparison
of each one’s accuracy.
Model Comparison and
Selection
# GOF Full Multiple Model
Mult_Full_Residual_Deviance = Multiple_Full_Log_Model$deviance
Mult_Full_Null_Residual_Deviance = Multiple_Full_Log_Model$null.deviance
Mult_Full_aic = Multiple_Full_Log_Model$aic
Mult_Full_Goodness_Of_Fit = cbind(Mult_Full_Residual_Deviance,
Mult_Full_Null_Residual_Deviance, Mult_Full_aic)
# GOF Manually Reduced Multiple Model
Mult_Reduced_Residual_Deviance = Multiple_Reduced_Log_Model$deviance
Mult_Reduced_Null_Residual_Deviance = Multiple_Reduced_Log_Model$null.deviance
Mult_Reduced_aic = Multiple_Reduced_Log_Model$aic
Mult_Reduced_Goodness_Of_Fit = cbind(Mult_Reduced_Residual_Deviance,
Mult_Reduced_Null_Residual_Deviance, Mult_Reduced_aic)
# GOF AIC Reduced Model
Mult_AIC_Residual_Deviance = AIC_Model$deviance
Mult_AIC_Null_Residual_Deviance = AIC_Model$null.deviance
Mult_AIC_aic = AIC_Model$aic
Mult_AIC_Goodness_Of_Fit = cbind(Mult_AIC_Residual_Deviance,
Mult_AIC_Null_Residual_Deviance, Mult_AIC_aic)
Model_GOFs = rbind(Mult_Full_Goodness_Of_Fit, Mult_Reduced_Goodness_Of_Fit,
Mult_AIC_Goodness_Of_Fit)
rownames(Model_GOFs) = c("Full Model", "Manually Reduced Model", "AIC Reduced Model")
colnames(Model_GOFs) = c("Residual Deviance", "Null Deviance", "AIC")
kable(Model_GOFs, caption ="GOF Summary of Each Model")
GOF Summary of Each Model
| Full Model |
2216.519 |
2463.55 |
2232.519 |
| Manually Reduced Model |
2224.352 |
2463.55 |
2236.352 |
| AIC Reduced Model |
2217.497 |
2463.55 |
2231.497 |
As we can see from the table above, all three models perform
extremely similar. Since our AIC-created model has the lowest AIC, and
its residual deviance is only a fraction greater than that of our full
model, we will accept the AIC-created model as our final choice to use
for interpetation of this dataset.
Takeaways from
Multiple Logistic Regression Modeling Process
After the simple logistic regression modeling process earlier, we
came away with the conclusion that a track’s tempo was ‘technically’
significant in it’s likelihood of being popular on Spotify, but the
practical magnitude of its impact was so small that it was not a factor
that deserves a great deal of emphasis in the real world. Our
AIC-created model for the most part confirms this finding, (p ~
.0572).
The most important findings from our multiple logistic regression
process come from what was discovered by our discretized variables.
Tracks with very high instrumentalness (over 0.7 on this dataset’s
scale) have a lesser likelihood of being popular on Spotify than those
that fall in between 0.1 and 0.7 on said scale. Additionally, tracks
under 0.1 on the scale actually have a slighly higher chance of being
popular.
Regarding a track’s length, tracks being under 5 minutes does have a
significant association with increased likelihood of popularity when
compared to tracks ranging from 5 to 10 minutes and tracks beyond 10
minutes. However, there is not sufficient evidence that tracks going
above 10 minutes significantly reduces their chance of being popular
when compared to those in the 5 to 10 minute category.
Below is a visual displaying the odds ratios for relevant variables
from my AIC-created model, they will be referenced in my final
conclusion. The extreme estimate, standard error and odds ratio for
speechiness can be disregarded and attributed largely to the vast
majority of observations falling below 0.5 on the speechiness scale.
That along with the variable’s extremely high p value, and standard
error being practically identical to the standard error of the model’s
null intercept lead us to believe that the summary statistics for
speechiness can be chalked up to random chance, not true logistic
association.
AIC_Model_Coef_Summary = summary(AIC_Model)$coef
AIC_Odds_Ratio = exp(coef(AIC_Model))
AIC_out.stats = cbind(AIC_Model_Coef_Summary, odds.ratio = AIC_Odds_Ratio)
kable(AIC_out.stats,caption = "AIC-Created Multiple Regression Model Coefficients W/ Odds Ratios")
AIC-Created Multiple Regression Model Coefficients W/ Odds
Ratios
| (Intercept) |
-17.1140371 |
375.3481554 |
-0.0455951 |
0.9636330 |
0.0000000 |
| tempo |
0.0036649 |
0.0018745 |
1.9551582 |
0.0505644 |
1.0036716 |
| grp.instrumentalness: s over 0.7 |
-2.1888224 |
0.3605641 |
-6.0705491 |
0.0000000 |
0.1120486 |
| grp.instrumentalness: s under 0.1 |
0.6234236 |
0.2196071 |
2.8388131 |
0.0045282 |
1.8653032 |
| grp.duration_mins: Over 10 mins |
-13.3387936 |
310.5685700 |
-0.0429496 |
0.9657417 |
0.0000016 |
| grp.duration_mins: Under 5 mins |
0.7018358 |
0.2310153 |
3.0380492 |
0.0023812 |
2.0174530 |
| grp.speechiness: s under 0.5 |
14.2414898 |
375.3479579 |
0.0379421 |
0.9697338 |
1531088.9917716 |
Final Association
Modeling Conclusion
To conclude, in the most practical sense there are two factors that
seem to have the greatest association on a track’s chance of being
popular on Spotify. Those being its instrumentalness and length. Tracks
that are under 5 minutes long have about twice the
likelihood of being popular than not, while tracks that are longer than
10 minutes do not have nearly as great a chance. Tracks that are not
very instrumental (s under 0.1) also have close to twice as great a
chance of being popular than not. Much better odds than their very
instrumental (s over 0.7) and mildly instrumental (0.1 < s < 0.7)
counterparts.
Predictive Binary
Logistic Model
Now that we have deeply studied the association of a track’s
likelihood of popularity with some of the track’s characteristics, I am
going to take our analysis one step further, away from the realm of
strict observation and into the realm of educated and predictive
speculation. For this process, I will utilize the models constructed in
sections 3.1 through 3.3 above, as well as standard predictive modeling
techniques like data splitting and ROC visualization.
Data Splitting
Below I split the data we have been analyzing into two subsets, one
for training purposes and one for assessment purposes, at a rate of 70%
to 30% respectively. This procedure is done to counteract the risk of
overfitting a model and to ensure that the true rate of inaccurate
prediction (both false positives and false negatives) is known.
n = nrow(Work_Data) # Sample size of data we are working with
Training_n =round(0.7*n) # Sample size of training data
Training_ID = sample(1:n, Training_n, replace = FALSE)
# The number of rows to choose from is 1:n, the number that will be selected is Training_n, replacement is set to FALSE since we are taking a subset of the data, so do not want to take the same observation(s) more than once.
Training_Data = Work_Data[Training_ID, ]
# Creates a subset of Work_Data, assigning only the observations from the dataset that were marked with Training_ID to be in the Training_Data subset
Testing_Data = Work_Data[-Training_ID, ]
# Creates an alternate subset of Work_Data, assigning only the observations from the dataset that were NOT marked with Training_ID to be in the Testing_Data subset
Cross Validation (CV)
Process and Model Selection
Now I will use the aforementioned models, created in my multiple
regression association analysis, in the CV process. I will create 3
folds (distinct groups) and set the cut-off probability (what
constitutes “yes” or “no” as the value of our binary response variable)
to the standard 0.5. The results and interpretation can be seen
below.
# Multiple_Full_Log_Model = First multiple log model
# Multiple_Reduced_Log_Model = Manually reduced one
# AIC_Model = The final Algorithmically Reduced one we used
# 3 fold cross-validation
k = 3
fold_size = floor(dim(Training_Data)[1]/k)
## PE vectors for candidate models
PE1 = rep(0,5)
PE2 = rep(0,5)
PE3 = rep(0,5)
for(i in 1:k){ # Start of big FOR loop
## Training and testing folds
valid.id = (fold_size*(i-1)+1):(fold_size*i)
valid = Training_Data[valid.id, ]
train.dat = Training_Data[-valid.id,]
# Full Log Model
CVModel1 = glm(formula = popular ~ tempo + danceability + grp.speechiness + grp.instrumentalness + grp.duration_mins, family = binomial(link = "logit"),
data = train.dat)
table(Training_Data$popular)
# Manually Reduced Log Model
CVModel2 = glm(formula = popular ~ tempo + grp.instrumentalness + grp.duration_mins, family = binomial(link = "logit"),
data = train.dat)
# AIC Reduced (Final) Log Model
CVModel3 = stepAIC(CVModel1,
scope = list(lower=formula(CVModel2),upper=formula(CVModel1)),
direction = "forward", # forward selection
trace = 0 )
## predicted probabilities of each candidate model
pred01 = predict(CVModel1, newdata = valid, type="response")
pred02 = predict(CVModel2, newdata = valid, type="response")
pred03 = predict(CVModel3, newdata = valid, type="response")
pre.outcome01 = ifelse(as.vector(pred01) > 0.5, "yes", "no")
pre.outcome02 = ifelse(as.vector(pred02) > 0.5, "yes", "no")
pre.outcome03 = ifelse(as.vector(pred03) > 0.5, "yes", "no")
# Have our cut-off probability set to 0.5
# HAVE to set responses as "yes" and "no", earlier had them set as "pos" and "neg" to mimic the case study code, which is why outcome was illogical
PE1[i] = 1-sum(pre.outcome01 == valid$popular )/length(pred01)
PE2[i] = 1-sum(pre.outcome02 == valid$popular )/length(pred02)
PE3[i] = 1-sum(pre.outcome02 == valid$popular)/length(pred03)
# valid$popular = “From the validation subset valid, take the column named popular.”
} # End of big FOR loop
# Display the output of the loop
Avg.Prediction_Error = cbind(PE1 = mean(PE1), PE2 = mean(PE2), PE3 = mean(PE3))
kable(Avg.Prediction_Error, digits = 4, caption = "Average Prediction Errors of the CV Models")
Average Prediction Errors of the CV Models
| 0.104 |
0.104 |
0.104 |
pred02 = predict(CVModel2, newdata = Testing_Data, type="response")
pred02.outcome = ifelse(as.vector(pred02)>0.5, "yes", "no")
accuracy = sum(pred02.outcome == Testing_Data$popular)/length(pred02)
kable(accuracy, caption="The actual accuracy of the final model")
The actual accuracy of the final model
| 0.8205772 |
We can see from above that, at a cut-off probability of 0.5, all
three of our models used for associative analysis function with an
average prediction error of ~ 0.1057 (this number can vary by about .01
either way depending on which run of the simulation we are on). This is
not too surprising given the shared associative findings of each model
(check Multiple Regression section). Given that I previously determined
the AIC-produced model was the most ideal, that is the one I
will consider as the ultimate or “final” predictive model. This
model’s rate of accuracy, which was found through applying it’s
structure to the testing dataset that we subsetted earlier, is about
82.67% (like prediction error, can have slight variation).
ROC
Visualization
To assist with understanding, I also constructed a ROC (Receiver
Operating Characteristics) curve. This image allows us to interpret the
relationship between our different models’ specificity (rate of
producing false positives) on the x axis VS sensitivity (rate of
producing true positives) on the y axis.
We can see from the curve below that all three models are extremely
similar in their ratio of sensitivity to specificity, and hence have
almost identical AUC values. As mentioned above, we will utilize the
AIC-reduced model given that it was deemed the most accurate for our
previous association analysis.
######### Using a function to extract false positive and false negative values
TPR.FPR=function(pred){
prob.seq = seq(0,1, length=50) # 50 equally spaced cut-off probabilities
pn=length(prob.seq)
true.lab=as.vector(Training_Data$popular)
TPR = NULL
FPR = NULL
##
for (i in 1:pn){
pred.lab = as.vector(ifelse(pred >prob.seq[i],"yes", "no"))
TPR[i] = length(which(true.lab=="yes" & pred.lab=="yes"))/length(which(true.lab=="yes"))
FPR[i] = length(which(true.lab=="no" & pred.lab=="yes"))/length(which(true.lab=="no"))
}
cbind(FPR = FPR, TPR = TPR)
}
######### Calculating the AUC
if (!require("pROC")) {
install.packages("pROC")
library(pROC)
}
### Candidate Models
# Full Log Model
CVModel1 = glm(formula = popular ~ tempo + danceability + grp.speechiness + grp.instrumentalness + grp.duration_mins, family = binomial(link = "logit"),
data = train.dat)
# Manually Reduced Log Model
CVModel2 = glm(formula = popular ~ tempo + grp.instrumentalness + grp.duration_mins, family = binomial(link = "logit"),
data = train.dat)
# AIC Reduced (Final) Log Model
CVModel3 = stepAIC(CVModel1,
scope = list(lower=formula(CVModel2),upper=formula(CVModel1)),
direction = "forward", # forward selection
trace = 0 )
## predicted probabilities
pred01 = predict.glm(CVModel1, newdata = Training_Data, type="response")
pred02 = predict.glm(CVModel2, newdata = Training_Data, type="response")
pred03 = predict.glm(CVModel3, newdata = Training_Data, type="response")
####
## ROC curve
plot(TPR.FPR(pred01)[,1], TPR.FPR(pred01)[,2],
type="l", col="red", lty=1, lwd = 2, xlim=c(0,1), ylim=c(0,1),
xlab = "False Positive Rate (FPR)",
ylab ="True Positive Rate (TPR)",
main = "ROC curves of the three candidate models",
cex.main = 1.1,
col.main = "black")
lines(TPR.FPR(pred02)[,1], TPR.FPR(pred02)[,2], col="darkblue", lty=2, lwd = 2)
lines(TPR.FPR(pred03)[,1], TPR.FPR(pred03)[,2], col="black", lty=3, lwd = 2)
##
category = Training_Data$popular == "yes"
ROCobj01 <- roc(category, as.vector(pred01))
ROCobj02 <- roc(category, as.vector(pred02))
ROCobj03 <- roc(category, as.vector(pred03))
AUC01 = round(auc(ROCobj01),4)
AUC02 = round(auc(ROCobj02),4)
AUC03 = round(auc(ROCobj03),4)
##
legend("bottomright", c(paste("Full Log Model AUC = ",AUC01),
paste("Man Reduced Log Model AUC = ",AUC02),
paste("AIC Reduced Log Model AUC = ", AUC03)),
col=c("red","darkblue", "black"), lty=1:3, cex = 0.85, bty="n")

Predictive Modeling
Takeaways
To conclude, if someone had interest in predicting a song’s chance of
being popular given its length, instrumentalness and (to a minor extent)
tempo, they should utilize the AIC-reduced model that was formed in
section 3.3 above. I could see those most interested in this predictive
model being agents, record label executives and aspiring artists, all
attempting to “fix” their music to fit the mold of what tends to
commercially succeed on the world’s largest streaming platform.
---
title: "Binary Logistic Predictive Modeling of a Track's Likelihood to be Popular on Spotify"
author: "Chris Bahm"
date: "2025-10-23"
output:
  html_document:
    toc: true
    toc_float:
      collapsed: true
      smooth_scroll: true
    toc_depth: 4
    fig_width: 6
    fig_height: 4
    fig_caption: true
    number_sections: true
    code_folding: hide
    code_download: true
    theme: lumen
    highlight: tango
  pdf_document:
    toc: true
    toc_depth: 4
    fig_caption: true
    number_sections: true
  word_document:
    toc: true
    toc_depth: 4
---

```{css, echo = FALSE}
div#TOC li {     /* table of content  */
    list-style:upper-roman;
    background-image:none;
    background-repeat:none;
    background-position:0;
}

h1.title {    /* level 1 header of title  */
  font-size: 24px;
  font-weight: bold;
  color: DarkRed;
  text-align: center;
}

h4.author { /* Header 4 - and the author and data headers use this too  */
  font-size: 18px;
  font-weight: bold;
  font-family: "Times New Roman", Times, serif;
  color: DarkRed;
  text-align: center;
}

h4.date { /* Header 4 - and the author and data headers use this too  */
  font-size: 18px;
  font-weight: bold;
  font-family: "Times New Roman", Times, serif;
  color: DarkBlue;
  text-align: center;
}

h1 { /* Header 1 - and the author and data headers use this too  */
    font-size: 20px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: darkred;
    text-align: center;
}

h2 { /* Header 2 - and the author and data headers use this too  */
    font-size: 18px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h3 { /* Header 3 - and the author and data headers use this too  */
    font-size: 16px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h4 { /* Header 4 - and the author and data headers use this too  */
    font-size: 14px;
  font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: darkred;
    text-align: left;
}

/* Add dots after numbered headers */
.header-section-number::after {
  content: ".";
}
```

```{r setup, include=FALSE}
# code chunk specifies whether the R code, warnings, and output 
# will be included in the output files.

if (!require("knitr")) {                      # use conditional statement to detect
   install.packages("knitr")                  # whether a package was installed in
   library(knitr)                             # your machine. If not, install it and
}                                             # load it to the working directory.

if (!require(tidyverse)) {library(tidyvserse)} 

if (!require(GGally)) {library(GGally)} 

if (!require(kableExtra)) {library(kableExtra)} 

if (!require(ggplot2)) {library(ggplot2)} 

if (!require(car)) {library(car)} 

if (!require(dplyr)) {library(dplyr)} 

if (!require(pander)) {library(pander)} 

if (!require(car)) {library(car)} 

if (!require("scales")) {
install.packages("scales")                                        
library("scales") 
}

knitr::opts_chunk$set(
	echo = TRUE,
	message = FALSE,
	warning = FALSE,
	comment = NA,
	results = TRUE
)
   
```

# Introduction and Background

Here we have a dataset originally composed of over 5,000 songs that were available on the music-streaming service Spotify over the course of 2024. These tracks stem from a wide array of vastly different genres and styles of creation. The dataset is publicly available on the platform Kaggle and is provided in the References section below. Although the data originally contained 29 variables, only 11 play a particularly relevant role in the analysis I wish to do here. A breakdown of those variables, their meanings and data types is below:

```{r Variable Table, echo=FALSE}
library(knitr)

Var_Table = data.frame(
  Name = c("track_popularity", "popular", "track_name", "track_artist", "playlist_genre", "playlist_subgenre", "tempo", "danceability","speechiness", "instrumentalness", "duration_mins"),
  
  Meaning = c("song's rating on the 0 - 100 metric", "how popular the track is (more on that later)",
              "name of the song",
              "artist's stage name", 
              "genre that Spotify classifies the song in", 
              "subgenre that Spotify classifies the song in",  
              "the speed of a track, measured in beats per minute (BPM)", 
              "A score describing how suitable a track is for dancing based on tempo, rhythm stability, beat strength and overall regularity (0 to 1 scale)", 
              "the presence of audible words (0 to 1 scale)", 
              "likelihood that the track contains instrumentals (0 to 1 scale) ", 
              "length of track in minutes"),
  
  Data_Type = c("double", "character", "character", "character", "character", "character", "double", "double", "double", "double", "double")
)

kable(Var_Table) %>%
  kable_styling(
    bootstrap_options = c("striped", "bordered"),
    full_width = FALSE,
    position = "center")
```

Many of the other variables in the dataset I chose to disregard as they either pertained to unnecessary technical information (such as the algorithmic identifier for the particular song or its album), or extremely niche musical topics that the average listener or introductory critic would most likely not pick up on (such as mode and key).

## Objective for my Analysis
Using this dataset, I wish to quantifiably examine what musical qualities and characteristics lead to a song (or "track") becoming extremely successful (hence popular) via Spotify. As of October of 2025, Spotify has a market cap of over 139 billion dollars and close to 700 million users. This means that for any aspiring musician, "making it big" on Spotify would serve as a monumental and life-altering milestone. 

Spotify has a metric known as the popularity index which serves to numerically represent the popularity of tracks on its platform. While the exact formula behind the score is kept confidential, critics in the music industry and those with strong technical backgrounds have theorized that the primary factors which influence a track's score are the following; its total number of streams, its recent number of streams, its number of downloads and saves, and its month-to-month growth in all three of said metrics.

This popularity index operates on a scale ranging from 0 to 100, with only the most well-established stars reaching the 90th percentile and higher. That being said, any track with a rating of 75 or higher is largely considered a "top track" by industry experts, so that is the interpretation I will apply with my analysis. I created a categorical binary variable "popular", which is valued as "yes" for tracks with a popularity index rating of 75 or higher, and "no" for tracks with a lower rating. I will perform associative logistic regression to assess which, if any, of the factors mentioned in my variable breakdown above contribute to a track being deemed popular or not. 

Finally, after performing simple and multiple linear regression processes to analyze the association between a track's popularity and its status of the variables I have laid out, I will then use predictive modeling techniques in the effort to create a model which can accurately predict a particular track's likelihood of being popular on Spotify.

```{r Data Read in and Cleaning, include=FALSE}
options(scipen = 999) # Out of scientific notation, personal preference

Data_Url_A = "https://raw.githubusercontent.com/ChrisB2323/STA321/refs/heads/main/high_popularity_spotify_data.csv"
Above_68 = read_delim(Data_Url_A, delim = ",", show_col_types = FALSE)

Data_Url_B = "https://raw.githubusercontent.com/ChrisB2323/STA321/refs/heads/main/low_popularity_spotify_data.csv"
Below_68 = read_delim(Data_Url_B, delim = ",", show_col_types = FALSE)


### Line below was used for initial visual purposes 
#Above_68 = Above_68[, c("track_name", "track_album_name", "track_artist", "playlist_name", "playlist_genre", "playlist_subgenre", "track_popularity", "energy", "tempo", "danceability", "loudness", "liveness", "valence", "speechiness", "instrumentalness", "key", "duration_ms", "time_signature", "track_href", "mode", "uri", "type", "track_album_release_date", "analysis_url", "id", "track_album_id", "playlist_id", "track_id", "acousticness")]


Data = rbind(Above_68, Below_68)
  Data$duration_mins = Data$duration_ms/60000
Data$popular = ifelse(Data$track_popularity > 75, "yes", "no")

Data = dplyr::select(Data, track_popularity, popular, track_name, track_album_name, track_artist, playlist_name, playlist_genre, playlist_subgenre, energy, tempo, danceability, loudness, liveness, valence, speechiness, instrumentalness, key, duration_mins)

unique(Data$playlist_genre)

# Breakdown of Variables Kept
   # Identifying Variables:
       # track_name
       # track_album_name
       # track_artist
   # Categorical Info:
       # playlist_name
       # playlist_genre
       # playlist_subgenre
       # **** popular **** binary response variable self-created
   # Quantitative Info:
       # track_popularity
       # energy
       # tempo
       # danceability
       # loudness
       # liveness
       # valence
       # speechiness
       # instrumentalness
       # key
       # duration_ms
   
    # 17 selected above from original data set, plus popular
   
  # Variables we are not including in reduced data set are largely technical tracking variables such as id and analysis_url, or very niche musical metrics that the average listener would be unaware of such as mode.


#### Checking for Duplicates:

Duplicate_Songs = Data[duplicated(Data$track_name) | duplicated(Data$track_name, fromLast = TRUE), ]
Duplicate_Songs_Reduced = dplyr::select(Duplicate_Songs, track_name, track_artist, playlist_name, playlist_genre, playlist_subgenre)

Duplicate_Song_Counts = as.data.frame(table(Duplicate_Songs$track_name))   
  # 837 instances of the same song being listed more than once

names(Duplicate_Song_Counts) = c("track_name", "count")
  # 378 songs listed more than once

sum(Duplicate_Song_Counts$count)   # Sum function confirms there are 837 total repeats

  
# Going to reduce main dataset to only one observation per song, used base R function to keep first occurence of each song

#### New_Data = no duplicates
New_Data = Data[!duplicated(Data$track_name), ]

# Going to eliminate genres that are not prevalent in North America and Europe, as my analysis will apply largely to audience in Western world

unique(New_Data$playlist_genre)

New_Data$West = ifelse(
  New_Data$playlist_genre %in% c("brazilian", "arabic", "indian", "mandopop",
                                 "cantopop", "k-pop", "turkish", "korean", "j-pop", "world", "latin", "electronic"),
  "no",
  "yes"
)

table(New_Data$West)

#### New_Data2 = Confined to Western music genres
New_Data2 = New_Data[New_Data$West!= "no", ]
colnames(New_Data2)

#### Work_Data = Reduced number of variables to perform analysis on, missing value checks performed
Work_Data = dplyr::select(New_Data2, track_popularity, popular, track_name, track_artist, playlist_genre, playlist_subgenre, tempo, danceability, speechiness, instrumentalness, duration_mins)

Work_Data = sort_by(Work_Data, Work_Data$track_popularity, decreasing=TRUE)

sum(is.na(Work_Data)) # 5 missing values in dataset

Missing_Rows = Work_Data[!complete.cases(Work_Data), ]
 # All 5 missing values come from one observation, the song is "Make It" by Berhanio. I will just remove this song from the dataset using filter function below.

Work_Data = filter(Work_Data, track_name != "Make It" )

  track_popularity = Work_Data$track_popularity 
  popular = Work_Data$popular
  track_name = Work_Data$track_name
  track_artist = Work_Data$track_artist
  playlist_genre = Work_Data$playlist_genre
  playlist_subgenre = Work_Data$playlist_subgenre
  tempo = Work_Data$tempo
  danceability = Work_Data$danceability
  speechiness = Work_Data$speechiness
  instrumentalness = Work_Data$instrumentalness
  duration_mins = Work_Data$duration_mins
  
Quantitative_Variables = cbind(tempo, danceability, speechiness, instrumentalness, duration_mins)

Categorical_Variables = cbind(playlist_genre, playlist_subgenre)
 
```

# Simple Logistic Regression Model

## Checking for Pairwise Relationships
Before jumping into model creation, I created a correlation matrix for the quantitative predictor variables I am going to use in my analysis. This is because variables with very high correlations (r >= 0.8) often can function as almost multiples of one another, leading to inaccurate conclusions, especially when performing multiple regression.
```{r}
cor_matrix = cor(Quantitative_Variables)
  # No high correlations between the variables

kable(cor_matrix, caption = "Correlation Matrix")
```

After creating a correlation matrix, we can see that none of the quantitative predictor variables that we are examining have a high correlation with one another. Practically speaking, this is not very surprising as even "similar" songs by the same artist or within the same genre can have wildly different characteristics when it comes to tempo or anything else we are looking at.


## Picking Variable for Simple Model
Below I created histograms to represent the distributions of all five quantitative variables I am going to use in this analysis. The purpose for doing such is to determine if any sort of variable manipulation is necessary before using the variable in our logistic model.
```{r}
par(mfrow = c(2, 3), mar = c(4, 4, 2, 1))  # mar controls margins 
  
  
ylimit = max(density(Work_Data$tempo)$y)
hist(Work_Data$tempo, probability = TRUE, main = "Tempo Distribution", xlab="Tempo", 
       col = "azure1", border="lightseagreen")
  lines(density(Work_Data$tempo, adjust=2), col="blue") 

ylimit = max(density(Work_Data$danceability)$y)
hist(Work_Data$danceability, probability = TRUE, main = "Danceability Distribution", xlab="Danceability", 
       col = "azure1", border="lightseagreen")
  lines(density(Work_Data$danceability, adjust=2), col="blue") 
  
ylimit = max(density(Work_Data$speechiness)$y)
hist(Work_Data$speechiness, probability = TRUE, main = "Speechiness Distribution", xlab="Speechiness", 
       col = "azure1", border="lightseagreen")
  lines(density(Work_Data$speechiness, adjust=2), col="blue") 
  
  
ylimit = max(density(Work_Data$instrumentalness)$y)
hist(Work_Data$instrumentalness, probability = TRUE, main = "Instrumentalness Distribution", xlab="Instrumentalness", 
       col = "azure1", border="lightseagreen")
  lines(density(Work_Data$instrumentalness, adjust=2), col="blue") 
  
  
ylimit = max(density(Work_Data$duration_mins)$y)
hist(Work_Data$duration_mins, probability = TRUE, main = "Length (mins) Distribution", xlab="Length (mins)", 
       col = "azure1", border="lightseagreen")
  lines(density(Work_Data$duration_mins, adjust=2), col="blue") 


```

We can see from the histograms that there is an extreme skew in speechiness, instrumentalness and length. While there are slight skews in tempo and danceability, they are *far* more moderate. Since tempo appears to be the least skewed of the variables, and it's calculation is the most straightforward (a simple measure of beats per minute), I will use tempo to perform my simple logistic regression.


## Creating the Simple Model
Using built-in R functions, I created a simple binary logistic regression model, in which the tempo of a song is meant to predict whether or not that track is popular or not (popularity score of 75 or not). In the calculations, the base level, or 0, was stored for instances of a track's popularity being marked as "no". While the instance level, or 1, was stored for instances of a track's popularity being marked as "yes".
```{r, include=FALSE}
# Convert the character values to factors, necessary for glm function to work
Work_Data$popular = factor(Work_Data$popular, levels = c("no", "yes"))
  invisible(str(Work_Data$popular))
  invisible(unique(Work_Data$popular))
```
```{r}
Simple_Log_Model = glm(formula = popular ~ tempo, family = binomial(link = "logit"), 
    data = Work_Data)

invisible(summary(Simple_Log_Model))

Simple_Log_Model_Coef_Summary = summary(Simple_Log_Model)$coef       # output stats of coefficients
Simple_Confidence_Interval = confint(Simple_Log_Model)                     # confidence intervals of betas
Simple_Summary_Stats = cbind(Simple_Log_Model_Coef_Summary, Simple_Confidence_Interval.95=Simple_Confidence_Interval)   # rounding off decimals
kable(Simple_Summary_Stats,caption = "Simple Logistic Regression Model Summary")  
```
The summary of our model above leads us to believe that a track's tempo does have a positive association with the likelihood that a track is popular on Spotify, however that association is very small in practical effect. Despite a statistically significant p value, $\beta$~1~ has an estimated association with the probability of a track being deemed popular of .006, and a 95% confidence interval deeming the association to fall between the interval [.003, .009].


## Odds Ratio Summary
While the direct interpretation of a logistic regression model's summary statistics can be helpful, it is sometimes more practical to examine our model's odds ratio. An odds ratio represents the likelihood of success divided by the likelihood of failure. For example, an odds ratio of 2 would mean that whatever instance we are considering a "success" occurs twice the number of times than it fails to occur.
```{r}
# Odds ratio
Simple_Log_Model_Coef_Summary = summary(Simple_Log_Model)$coef
Odds_Ratio = exp(coef(Simple_Log_Model))
out.stats = cbind(Simple_Log_Model_Coef_Summary, odds.ratio = Odds_Ratio)                 
kable(out.stats,caption = "Simple Regression Model Coefficients W/ Odds Ratios")
```
The odds ratio between a track's tempo and it's popularity status is ~ 1.006. Since tempo is measured in beats per minute (BPM), this means that for every additional beat per minute that a song is recorded in, it's probability of having a popularity score above 75 increases by about 0.6%.

## Goodness-of-fit measures
Below I calculated the goodness-of-fit (GOF) measures for the logistic regression model. Generally speaking, it is beneficial to find a model's GOF features as they serve as relative comparisons of model efficiency when comparing multiple models. In this instance, we only created one model so far, so the measures are not of practical use at the moment.
```{r}
Residual_Deviance = Simple_Log_Model$deviance
Null_Residual_Deviance = Simple_Log_Model$null.deviance
aic = Simple_Log_Model$aic
Goodness_Of_Fit = cbind(Deviance.residual = Residual_Deviance, Null.Deviance.Residual = Null_Residual_Deviance,
      AIC = aic)
pander(Goodness_Of_Fit)

```

## Success Probability Curve
Below I created a visual of the the model's S curve (also called success probability curve or logistic curve). The S curve helps us see the change in probability of an event happening (a track being popular or not on Spotify), in relation to a change in the value of a predictor variable (tempo in BPM in this instance).
```{r}
###

tempo_range = range(Work_Data$tempo)
x = seq(tempo_range[1], tempo_range[2], length = 200)
beta.x = coef(Simple_Log_Model)[1] + coef(Simple_Log_Model)[2]*x
success.prob = exp(beta.x)/(1+exp(beta.x))
failure.prob = 1/(1+exp(beta.x))
ylimit = max(success.prob, failure.prob)
##
beta1 = coef(Simple_Log_Model)[2]
success.prob.rate = beta1*exp(beta.x)/(1+exp(beta.x))^2
##
##
par(mfrow = c(1,1))
plot(x, success.prob, type = "l", lwd = 2, col = "black",
     main = "The Probability of a Track Being \n Considered Popular on Spotify", 
     ylim=c(0, 1.1*ylimit),
     xlab = "Tempo (BPM)",
     ylab = "probability",
     axes = FALSE,
     col.main = "black",
     cex.main = 1.1)
# lines(x, failure.prob,lwd = 2, col = "darkred")
axis(1, pos = 0)
axis(2)

##
y.rate = max(success.prob.rate)
plot(x, success.prob.rate, type = "l", lwd = 2, col = "black",
     main = "The Rate of Change in Probability of a Track Being \n Considered Popular on Spotify", 
     xlab = "Tempo (BPM)",
     ylab = "Rate of Change",
     ylim=c(0,1.1*y.rate),
     axes = FALSE,
     col.main = "black",
     cex.main = 1.1
     )
axis(1, pos = 0)
axis(2)
```

The S Curve provides visual confirmation of the conclusions we were able to draw from analyzing our coefficient summary, confidence intervals and odds ratio. That is, that there is a positive association between a track's tempo and it's likelihood of being a popular track on Spotify, **but** that association is very minimal in practical effect.

We can see from our second plot, that looked at the relationship between the rate of change in said probability and a track's tempo, that from 50 to 200 beats per minute, the probability of a track being popular's positive rate of change only increases from 0.0006 to 0.0012.

## Takeaways from Simple Logistic Regression Model
To conclude, we can say that although there is a statistically significant and positive relationship between a track's tempo and the probability that it is popular on Spotify, that association is of such a small magnitude that it is relatively a non-factor in practical sense. Therefore, it would be overly simplistic and hyperbolic for music labels or agents to aggressively push their artists towards a more up-tempo production, as the return on such a decision would likely not be worth the cost of souring the relationship with the artist at hand. 

# Multiple Logistic Regression Model
As we saw earlier when creating our simple regression model, there is no significant risk of multicollinearity with the data we are working with. However, we also saw there is an extreme rightward skew in the distribution of speechiness, instrumentalness and length. Below I take a closer look at the distributions of those three variables.
```{r}
## Remember in the HW instructions, include not just the numerical variables but two categorical ones too (let's use playlist_genre and playlist_subgenre)

par(mfrow=c(1,2))
hist(speechiness, xlab="Speechiness", main = "")
hist(instrumentalness, xlab = "Instrumentalness", main = "")
hist(duration_mins, xlab = "Length (Mins)", main = "")
```

To combat the skew of all three variables, I will perform discretization on each of them. The coding for such can be seen below.
```{r}
  # Speechiness = "presence of audible words on 0 to 1 scale"
grp.speechiness = speechiness
grp.speechiness[speechiness < 0.5] = ": s under 0.5"
grp.speechiness[speechiness > 0.5] = ": s over 0.5"

  # sum(speechiness > 0.5) = 14
  # sum(speechiness < 0.5) = 2641


  # Instrumentalness = "likelihood that the track contains instrumentals (0 to 1 scale)"
grp.instrumentalness = instrumentalness
grp.instrumentalness[instrumentalness < 0.1] = ": s under 0.1"
grp.instrumentalness[instrumentalness > 0.1 & instrumentalness <= 0.7] = ": s from 0.1 to 0.7"
grp.instrumentalness[instrumentalness > 0.7] = ": s over 0.7"
 
 #  sum(instrumentalness < 0.1) = 1802
 #  sum(instrumentalness > 0.1 & instrumentalness < 0.7) = 193
 #  sum(instrumentalness > 0.7) = 659
  

  # duration_mins below
grp.duration_mins = duration_mins
grp.duration_mins[duration_mins < 5] = ": Under 5 mins"
grp.duration_mins[duration_mins >= 5 & duration_mins <= 10] = ": Between 5 and 10 mins"
grp.duration_mins[duration_mins > 10] = ": Over 10 mins"
  
# sum(duration_mins < 5) = 2424
# sum(duration_mins >= 5 & duration_mins <= 10) = 211
# sum(duration_mins > 10) = 20


Work_Data$grp.speechiness = grp.speechiness
Work_Data$grp.instrumentalness = grp.instrumentalness
Work_Data$grp.duration_mins = grp.duration_mins
```


## Building Full Multiple Logistic Model
For the full multiple logistic model, I will initially include the continuous numerical variables tempo and danceability in their original form, as well as the discretized transformations of instrumentalness, length in minutes and speechiness. After coding this model in, I will examine its summary statistics.

```{r}
Multiple_Full_Log_Model = glm(formula = popular ~ tempo + danceability + grp.speechiness + grp.instrumentalness + grp.duration_mins, family = binomial(link = "logit"), 
    data = Work_Data)

invisible(summary(Multiple_Full_Log_Model))

Multiple_Full_Log_Model_Coef_Summary = summary(Multiple_Full_Log_Model)$coef
Multiple_Full_Confidence_Interval = confint(Multiple_Full_Log_Model)  

Multiple_Full_Summary_Stats = cbind(Multiple_Full_Log_Model_Coef_Summary, Multiple_Full_Confidence_Interval.95=Multiple_Full_Confidence_Interval)   # rounding off decimals

kable(Multiple_Full_Summary_Stats,caption = "Full Multiple Logistic Regression Model Summary") 

invisible(table(Work_Data$popular, Work_Data$grp.speechiness))
invisible(table(Work_Data$popular, Work_Data$grp.instrumentalness))
invisible(table(Work_Data$popular, Work_Data$grp.duration_mins))
  #NA's in kable function display are due to "perfect" results between certain levels of our discretized variables and the response variable of popular or not. If all observations in a certain level result in one of the two options for the response variable, then that coefficient will trend towards negative or positive infinity per R's calculations.
```

We can see the following from our model summary. First, just like the results from our simple regression model, there is once again a statistically significant association between a track's tempo and its likelihood of being popular on Spotify (p ~ 0.0413). However that association is so small that it is of little *practical* importance. Additionally, it appears that a track's instrumentalness and length absolutely play a part in its probability of being popular on Spotify. 

If we look at the discretized breakdown of instrumentalness, we see that there is an significant difference in tracks whose instrumentalness rating is of 0.7 and higher, or less than 0.1, when compared to tracks whose rating is somewhere in the middle. For track length, there also appears to be significant evidence that tracks less than 5 minutes long have a greater probability of achieving popularity on Spotify than those of any other lengths in our dataset.

On the other hand, there is not evidence to suggest that a track's danceability or speechiness have such an association with its popularity.

After examining the model summary, I also chose to run VIF on this full model, which can be seen below. Given that all GVIF are below 1.1, we can safely say that multicollinearity is not pertinent to this dataset. This finding is in line with the takeaway from our pairwise scatterplot earlier on.
```{r}
vif(Multiple_Full_Log_Model)
```

## Manually Reduced Multiple Logistic Model
Since parsimony (having as least amount of variables as possible without losing significant accuracy in association examination or prediction accuracy) is an important aspect of statistical modeling, I also then chose to create a reduced version of my multiple logistic model above. In this model, I did not include predictors whose impact was very clearly insignificant as per the original full model's summary (danceability and speechiness). The summary for this new reduced model is below.

```{r}
Multiple_Reduced_Log_Model = glm(formula = popular ~ tempo + grp.instrumentalness + grp.duration_mins, family = binomial(link = "logit"), 
    data = Work_Data)

invisible(summary(Multiple_Reduced_Log_Model))

Multiple_Reduced_Log_Model_Coef_Summary = summary(Multiple_Reduced_Log_Model)$coef
Multiple_Reduced_Log_Model_Confidence_Interval = confint(Multiple_Reduced_Log_Model)  

Multiple_Reduced_Summary_Stats = cbind(Multiple_Reduced_Log_Model_Coef_Summary, Multiple_Reduced_Confidence_Interval.95=Multiple_Reduced_Log_Model_Confidence_Interval)   # rounding off decimals

kable(Multiple_Reduced_Summary_Stats,caption = "Reduced Multiple Logistic Regression Model Summary") 
```

The results from this new reduced model largely align with the findings from the original multiple model. However, just as overfitting a model with needless information is a poor practice, so is underfitting a model by not including everything that is truly important. To ensure validity of this reduced model's structure, below I ran automatic stepwwise selection via R's AIC function. The model's summary statistics have been printed.

## Algorithmically Reduced Multiple Logistic Model
```{r}
library(MASS)
AIC_Model = stepAIC(Multiple_Reduced_Log_Model, 
                      scope = list(lower=formula(Multiple_Reduced_Log_Model),upper=formula(Multiple_Full_Log_Model)),
                      direction = "forward",   # forward selection
                      trace = 0   # do not show the details
                      )

kable(Multiple_Reduced_Summary_Stats,caption ="AIC-Created Multiple Logistic Regression Model Summary")
```

We can see from our model summary that the AIC-created model and the reduced model I coded in via manual variable selection are identical. This is not surprising due to how clearly significant or insignificant the predictors were in the original full model. Below I will perform global GOF measures on the model to have record of a formal comparison of each one's accuracy.

## Model Comparison and Selection
```{r}
 # GOF Full Multiple Model
Mult_Full_Residual_Deviance = Multiple_Full_Log_Model$deviance
Mult_Full_Null_Residual_Deviance = Multiple_Full_Log_Model$null.deviance
Mult_Full_aic = Multiple_Full_Log_Model$aic
Mult_Full_Goodness_Of_Fit = cbind(Mult_Full_Residual_Deviance,
                                  Mult_Full_Null_Residual_Deviance, Mult_Full_aic)

 # GOF Manually Reduced Multiple Model
Mult_Reduced_Residual_Deviance = Multiple_Reduced_Log_Model$deviance
Mult_Reduced_Null_Residual_Deviance = Multiple_Reduced_Log_Model$null.deviance
Mult_Reduced_aic = Multiple_Reduced_Log_Model$aic
Mult_Reduced_Goodness_Of_Fit = cbind(Mult_Reduced_Residual_Deviance,
                                  Mult_Reduced_Null_Residual_Deviance, Mult_Reduced_aic)

 # GOF AIC Reduced Model
Mult_AIC_Residual_Deviance = AIC_Model$deviance
Mult_AIC_Null_Residual_Deviance = AIC_Model$null.deviance
Mult_AIC_aic = AIC_Model$aic
Mult_AIC_Goodness_Of_Fit = cbind(Mult_AIC_Residual_Deviance,
                                  Mult_AIC_Null_Residual_Deviance, Mult_AIC_aic)

Model_GOFs = rbind(Mult_Full_Goodness_Of_Fit, Mult_Reduced_Goodness_Of_Fit,
                   Mult_AIC_Goodness_Of_Fit)
rownames(Model_GOFs) = c("Full Model", "Manually Reduced Model", "AIC Reduced Model")
colnames(Model_GOFs) = c("Residual Deviance", "Null Deviance", "AIC")

kable(Model_GOFs, caption ="GOF Summary of Each Model")
```

As we can see from the table above, all three models perform extremely similar. Since our AIC-created model has the lowest AIC, and its residual deviance is only a fraction greater than that of our full model, we will accept the AIC-created model as our final choice to use for interpetation of this dataset.

## Takeaways from Multiple Logistic Regression Modeling Process
After the simple logistic regression modeling process earlier, we came away with the conclusion that a track's tempo was 'technically' significant in it's likelihood of being popular on Spotify, but the practical magnitude of its impact was so small that it was not a factor that deserves a great deal of emphasis in the real world. Our AIC-created model for the most part confirms this finding, (p ~ .0572). 

The most important findings from our multiple logistic regression process come from what was discovered by our discretized variables. Tracks with very high instrumentalness (over 0.7 on this dataset's scale) have a lesser likelihood of being popular on Spotify than those that fall in between 0.1 and 0.7 on said scale. Additionally, tracks under 0.1 on the scale actually have a slighly higher chance of being popular.

Regarding a track's length, tracks being under 5 minutes does have a significant association with increased likelihood of popularity when compared to tracks ranging from 5 to 10 minutes and tracks beyond 10 minutes. However, there is not sufficient evidence that tracks going above 10 minutes significantly reduces their chance of being popular when compared to those in the 5 to 10 minute category.

Below is a visual displaying the odds ratios for relevant variables from my AIC-created model, they will be referenced in my final conclusion. The extreme estimate, standard error and odds ratio for speechiness can be disregarded and attributed largely to the vast majority of observations falling below 0.5 on the speechiness scale. That along with the variable's extremely high p value, and standard error being practically identical to the standard error of the model's null intercept lead us to believe that the summary statistics for speechiness can be chalked up to random chance, not true logistic association.
```{r}
AIC_Model_Coef_Summary = summary(AIC_Model)$coef
AIC_Odds_Ratio = exp(coef(AIC_Model))
AIC_out.stats = cbind(AIC_Model_Coef_Summary, odds.ratio = AIC_Odds_Ratio)                 
kable(AIC_out.stats,caption = "AIC-Created Multiple Regression Model Coefficients W/ Odds Ratios")
```

## Final Association Modeling Conclusion
To conclude, in the most practical sense there are two factors that seem to have the greatest association on a track's chance of being popular on Spotify. Those being its instrumentalness and length. Tracks that are under 5 minutes long have about **twice** the likelihood of being popular than not, while tracks that are longer than 10 minutes do not have nearly as great a chance. Tracks that are not very instrumental (s under 0.1) also have close to twice as great a chance of being popular than not. Much better odds than their very instrumental (s over 0.7) and mildly instrumental (0.1 < s < 0.7) counterparts.

# Predictive Binary Logistic Model
Now that we have deeply studied the association of a track's likelihood of popularity with some of the track's characteristics, I am going to take our analysis one step further, away from the realm of strict observation and into the realm of educated and predictive speculation. For this process, I will utilize the models constructed in sections 3.1 through 3.3 above, as well as standard predictive modeling techniques like data splitting and ROC visualization.

## Data Splitting
Below I split the data we have been analyzing into two subsets, one for training purposes and one for assessment purposes, at a rate of 70% to 30% respectively. This procedure is done to counteract the risk of overfitting a model and to ensure that the true rate of inaccurate prediction (both false positives and false negatives) is known.

```{r Subsetting Data for Testing and Training}
n = nrow(Work_Data) # Sample size of data we are working with
Training_n =round(0.7*n) # Sample size of training data
Training_ID = sample(1:n, Training_n, replace = FALSE) 
  # The number of rows to choose from is 1:n, the number that will be selected is Training_n, replacement is set to FALSE since we are taking a subset of the data, so do not want to take the same observation(s) more than once.

Training_Data = Work_Data[Training_ID, ]
  # Creates a subset of Work_Data, assigning only the observations from the dataset that were marked with Training_ID to be in the Training_Data subset

Testing_Data = Work_Data[-Training_ID, ]
  # Creates an alternate subset of Work_Data, assigning only the observations from the dataset that were NOT marked with Training_ID to be in the Testing_Data subset
```

## Cross Validation (CV) Process and Model Selection
Now I will use the aforementioned models, created in my multiple regression association analysis, in the CV process. I will create 3 folds (distinct groups) and set the cut-off probability (what constitutes “yes” or “no” as the value of our binary response variable) to the standard 0.5. The results and interpretation can be seen below.

```{r}
# Multiple_Full_Log_Model = First multiple log model
# Multiple_Reduced_Log_Model = Manually reduced one
# AIC_Model = The final Algorithmically Reduced one we used

# 3 fold cross-validation
k = 3

fold_size = floor(dim(Training_Data)[1]/k)

## PE vectors for candidate models
PE1 = rep(0,5)
PE2 = rep(0,5)
PE3 = rep(0,5)
for(i in 1:k){  # Start of big FOR loop
  ## Training and testing folds
  valid.id = (fold_size*(i-1)+1):(fold_size*i)
  valid = Training_Data[valid.id, ]
  train.dat = Training_Data[-valid.id,]

  # Full Log Model
CVModel1 = glm(formula = popular ~ tempo + danceability + grp.speechiness + grp.instrumentalness + grp.duration_mins, family = binomial(link = "logit"), 
    data = train.dat)
table(Training_Data$popular)


  # Manually Reduced Log Model
CVModel2 = glm(formula = popular ~ tempo + grp.instrumentalness + grp.duration_mins, family = binomial(link = "logit"), 
    data = train.dat)

  
  # AIC Reduced (Final) Log Model
  CVModel3 = stepAIC(CVModel1, 
          scope = list(lower=formula(CVModel2),upper=formula(CVModel1)),
                      direction = "forward",   # forward selection
                      trace = 0 )

   ##  predicted probabilities of each candidate model
   pred01 = predict(CVModel1, newdata = valid, type="response")
   pred02 = predict(CVModel2, newdata = valid, type="response")
   pred03 = predict(CVModel3, newdata = valid, type="response")
   
   pre.outcome01 = ifelse(as.vector(pred01) > 0.5, "yes", "no")
   pre.outcome02 = ifelse(as.vector(pred02) > 0.5, "yes", "no")
   pre.outcome03 = ifelse(as.vector(pred03) > 0.5, "yes", "no")
      # Have our cut-off probability set to 0.5
        # HAVE to set responses as "yes" and "no", earlier had them set as "pos" and "neg" to mimic the case study code, which is why outcome was illogical
   
   PE1[i] = 1-sum(pre.outcome01 == valid$popular )/length(pred01)
   PE2[i] = 1-sum(pre.outcome02 == valid$popular )/length(pred02)
   PE3[i] = 1-sum(pre.outcome02 == valid$popular)/length(pred03)
   # valid$popular = “From the validation subset valid, take the column named popular.”
   
  } # End of big FOR loop
 
 # Display the output of the loop
Avg.Prediction_Error = cbind(PE1 = mean(PE1), PE2 = mean(PE2), PE3 = mean(PE3))
kable(Avg.Prediction_Error, digits = 4, caption = "Average Prediction Errors of the CV Models")

pred02 = predict(CVModel2, newdata = Testing_Data, type="response")
pred02.outcome = ifelse(as.vector(pred02)>0.5, "yes", "no")

accuracy = sum(pred02.outcome == Testing_Data$popular)/length(pred02)
kable(accuracy, caption="The actual accuracy of the final model")
```

We can see from above that, at a cut-off probability of 0.5, all three of our models used for associative analysis function with an average prediction error of ~ 0.1057 (this number can vary by about .01 either way depending on which run of the simulation we are on). This is not too surprising given the shared associative findings of each model (check Multiple Regression section). Given that I previously determined the *AIC-produced model* was the most ideal, that is the one I will consider as the ultimate or "final" predictive model. *This* model's rate of accuracy, which was found through applying it's structure to the testing dataset that we subsetted earlier, is about 82.67% (like prediction error, can have slight variation).

## ROC Visualization
To assist with understanding, I also constructed a ROC (Receiver Operating Characteristics) curve. This image allows us to interpret the relationship between our different models' specificity (rate of producing false positives) on the x axis VS sensitivity (rate of producing true positives) on the y axis. 

We can see from the curve below that all three models are extremely similar in their ratio of sensitivity to specificity, and hence have almost identical AUC values. As mentioned above, we will utilize the AIC-reduced model given that it was deemed the most accurate for our previous association analysis.
```{r}
######### Using a function to extract false positive and false negative values
TPR.FPR=function(pred){
  prob.seq = seq(0,1, length=50)  # 50 equally spaced cut-off probabilities
  pn=length(prob.seq)
  true.lab=as.vector(Training_Data$popular)
  TPR = NULL
  FPR = NULL
  ##
  for (i in 1:pn){
   pred.lab = as.vector(ifelse(pred >prob.seq[i],"yes", "no"))
   TPR[i] = length(which(true.lab=="yes" & pred.lab=="yes"))/length(which(true.lab=="yes"))
   FPR[i] = length(which(true.lab=="no" & pred.lab=="yes"))/length(which(true.lab=="no"))
  }
 cbind(FPR = FPR, TPR = TPR)
}

######### Calculating the AUC
if (!require("pROC")) {
   install.packages("pROC")
   library(pROC)
}
###  Candidate Models
 # Full Log Model
CVModel1 = glm(formula = popular ~ tempo + danceability + grp.speechiness + grp.instrumentalness + grp.duration_mins, family = binomial(link = "logit"), 
    data = train.dat)

  # Manually Reduced Log Model
CVModel2 = glm(formula = popular ~ tempo + grp.instrumentalness + grp.duration_mins, family = binomial(link = "logit"), 
    data = train.dat)

  # AIC Reduced (Final) Log Model
  CVModel3 = stepAIC(CVModel1, 
          scope = list(lower=formula(CVModel2),upper=formula(CVModel1)),
                      direction = "forward",   # forward selection
                      trace = 0 )
##  predicted probabilities
pred01 = predict.glm(CVModel1, newdata = Training_Data, type="response") 
pred02 = predict.glm(CVModel2, newdata = Training_Data, type="response")
pred03 = predict.glm(CVModel3, newdata = Training_Data, type="response")
####
## ROC curve
 plot(TPR.FPR(pred01)[,1], TPR.FPR(pred01)[,2], 
      type="l", col="red", lty=1, lwd = 2, xlim=c(0,1), ylim=c(0,1),
      xlab = "False Positive Rate (FPR)",
      ylab ="True Positive Rate (TPR)",
      main = "ROC curves of the three candidate models",
      cex.main = 1.1,
      col.main = "black")
 lines(TPR.FPR(pred02)[,1], TPR.FPR(pred02)[,2],  col="darkblue", lty=2, lwd = 2)
 lines(TPR.FPR(pred03)[,1], TPR.FPR(pred03)[,2],  col="black", lty=3, lwd = 2)    

  ##
  category = Training_Data$popular == "yes"
  ROCobj01 <- roc(category, as.vector(pred01))
  ROCobj02 <- roc(category, as.vector(pred02))
  ROCobj03 <- roc(category, as.vector(pred03))
  AUC01 = round(auc(ROCobj01),4)
  AUC02 = round(auc(ROCobj02),4)
  AUC03 = round(auc(ROCobj03),4)
  ##
  legend("bottomright", c(paste("Full Log Model AUC = ",AUC01), 
                         paste("Man Reduced Log Model AUC = ",AUC02),
                         paste("AIC Reduced Log Model AUC = ", AUC03)),
        col=c("red","darkblue", "black"), lty=1:3, cex = 0.85, bty="n")
```

## Predictive Modeling Takeaways 
To conclude, if someone had interest in predicting a song's chance of being popular given its length, instrumentalness and (to a minor extent) tempo, they should utilize the AIC-reduced model that was formed in section 3.3 above. I could see those most interested in this predictive model being agents, record label executives and aspiring artists, all attempting to "fix" their music to fit the mold of what tends to commercially succeed on the world's largest streaming platform. 

# References:

Original Dataset Source:

- https://www.kaggle.com/datasets/solomonameh/spotify-music-dataset?select=high_popularity_spotify_data.csv

Dataset Download Links via Github:

- https://raw.githubusercontent.com/ChrisB2323/STA321/refs/heads/main/high_popularity_spotify_data.csv

- https://raw.githubusercontent.com/ChrisB2323/STA321/refs/heads/main/low_popularity_spotify_data.csv
