library(dplyr)
## Warning: package 'dplyr' was built under R version 4.3.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.3.3
library(broom)
## Warning: package 'broom' was built under R version 4.3.3
spotify_songs <- read.csv("C:/Users/priya/Downloads/spotify_songs.csv")
spotify_songs <- spotify_songs %>%
mutate(top_hit = ifelse(track_popularity > 75, 1, 0))
table(spotify_songs$top_hit)
##
## 0 1
## 30172 2661
# Save the modified dataset to use in further analysis
write.csv(spotify_songs, "spotify_songs_with_top_hit.csv", row.names = FALSE)
# Display the first few rows of the modified dataset
head(spotify_songs)
## track_id track_name
## 1 6f807x0ima9a1j3VPbc7VN I Don't Care (with Justin Bieber) - Loud Luxury Remix
## 2 0r7CVbZTWZgbTCYdfa2P31 Memories - Dillon Francis Remix
## 3 1z1Hg7Vb0AhHDiEmnDE79l All the Time - Don Diablo Remix
## 4 75FpbthrwQmzHlBJLuGdC7 Call You Mine - Keanu Silva Remix
## 5 1e8PAfcKUYoKkxPhrHqw4x Someone You Loved - Future Humans Remix
## 6 7fvUMiyapMsRRxr07cU8Ef Beautiful People (feat. Khalid) - Jack Wins Remix
## track_artist track_popularity track_album_id
## 1 Ed Sheeran 66 2oCs0DGTsRO98Gh5ZSl2Cx
## 2 Maroon 5 67 63rPSO264uRjW1X5E6cWv6
## 3 Zara Larsson 70 1HoSmj2eLcsrR0vE9gThr4
## 4 The Chainsmokers 60 1nqYsOef1yKKuGOVchbsk6
## 5 Lewis Capaldi 69 7m7vv9wlQ4i0LFuJiE2zsQ
## 6 Ed Sheeran 67 2yiy9cd2QktrNvWC2EUi0k
## track_album_name
## 1 I Don't Care (with Justin Bieber) [Loud Luxury Remix]
## 2 Memories (Dillon Francis Remix)
## 3 All the Time (Don Diablo Remix)
## 4 Call You Mine - The Remixes
## 5 Someone You Loved (Future Humans Remix)
## 6 Beautiful People (feat. Khalid) [Jack Wins Remix]
## track_album_release_date playlist_name playlist_id playlist_genre
## 1 2019-06-14 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop
## 2 2019-12-13 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop
## 3 2019-07-05 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop
## 4 2019-07-19 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop
## 5 2019-03-05 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop
## 6 2019-07-11 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop
## playlist_subgenre danceability energy key loudness mode speechiness
## 1 dance pop 0.748 0.916 6 -2.634 1 0.0583
## 2 dance pop 0.726 0.815 11 -4.969 1 0.0373
## 3 dance pop 0.675 0.931 1 -3.432 0 0.0742
## 4 dance pop 0.718 0.930 7 -3.778 1 0.1020
## 5 dance pop 0.650 0.833 1 -4.672 1 0.0359
## 6 dance pop 0.675 0.919 8 -5.385 1 0.1270
## acousticness instrumentalness liveness valence tempo duration_ms top_hit
## 1 0.1020 0.00e+00 0.0653 0.518 122.036 194754 0
## 2 0.0724 4.21e-03 0.3570 0.693 99.972 162600 0
## 3 0.0794 2.33e-05 0.1100 0.613 124.008 176616 0
## 4 0.0287 9.43e-06 0.2040 0.277 121.956 169093 0
## 5 0.0803 0.00e+00 0.0833 0.725 123.976 189052 0
## 6 0.0799 0.00e+00 0.1430 0.585 124.982 163049 0
Binary Column : created a new binary column called top_hit. Here, songs with a track_popularity score above 75 are labeled as “1” (Top Hit) and others as “0”. Verification: used table(spotify_songs$top_hit) to check the distribution of values in the top_hit column. Saving and Previewing: The modified dataset is saved for further analysis and previewed to verify the changes.
To interpret the coefficients of a logistic regression model predicting top_hit (a binary outcome) in the spotify_songs dataset, I’ve selected four explanatory variables: danceability, energy, speechiness, and valence.
# Load necessary libraries
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.3.3
## Warning: package 'readr' was built under R version 4.3.3
## Warning: package 'forcats' was built under R version 4.3.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ lubridate 1.9.3 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.1
## ✔ readr 2.1.5
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(broom)
# Logistic regression model predicting `top_hit` using 4 explanatory variables
model <- glm(top_hit ~ danceability + energy + speechiness + valence,
data = spotify_songs, family = binomial)
# Summarize the model to get coefficients
summary(model)
##
## Call:
## glm(formula = top_hit ~ danceability + energy + speechiness +
## valence, family = binomial, data = spotify_songs)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.68269 0.12888 -20.816 <2e-16 ***
## danceability 1.57222 0.15912 9.881 <2e-16 ***
## energy -1.29109 0.11233 -11.494 <2e-16 ***
## speechiness 0.04316 0.19827 0.218 0.828
## valence 0.14258 0.09441 1.510 0.131
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 18473 on 32832 degrees of freedom
## Residual deviance: 18190 on 32828 degrees of freedom
## AIC: 18200
##
## Number of Fisher Scoring iterations: 5
# Extract the model coefficients for interpretation
tidy_model <- tidy(model)
print(tidy_model)
## # A tibble: 5 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -2.68 0.129 -20.8 3.09e-96
## 2 danceability 1.57 0.159 9.88 5.04e-23
## 3 energy -1.29 0.112 -11.5 1.41e-30
## 4 speechiness 0.0432 0.198 0.218 8.28e- 1
## 5 valence 0.143 0.0944 1.51 1.31e- 1
Each coefficient in a logistic regression model represents the change in the log-odds of the outcome variable (top_hit in this case) for a one-unit increase in the predictor variable, holding other variables constant. Here’s how to interpret each coefficient based on hypothetical output:
Each coefficient helps assess which factors (such as danceability, energy, etc.) contribute more significantly to a song’s success, as measured by top_hit status. By understanding these relationships, you can draw insights into the characteristics associated with popular songs.
# Extract coefficients and standard errors
coef <- coef(model)
se <- summary(model)$coefficients[, "Std. Error"]
# Degrees of freedom for the model
df <- df.residual(model)
# Calculate the critical t-value for a 95% confidence level
t_value <- qt(0.975, df)
# Calculate confidence intervals for all coefficients
ci_lower <- coef - t_value * se
ci_upper <- coef + t_value * se
# Store confidence intervals in a data frame
confidence_intervals <- data.frame(ci_lower, ci_upper)
rownames(confidence_intervals) <- names(coef)
print("95% Confidence Intervals for Model Coefficients:")
## [1] "95% Confidence Intervals for Model Coefficients:"
print(confidence_intervals)
## ci_lower ci_upper
## (Intercept) -2.9352910 -2.4300897
## danceability 1.2603452 1.8840988
## energy -1.5112555 -1.0709262
## speechiness -0.3454571 0.4317718
## valence -0.0424668 0.3276242
Interpretation of one coefficient (‘danceability’): Each coefficient
represents the change in the log-odds of top_hit
for a
one-unit increase in the explanatory variable, holding others constant.
A positive interval implies that higher values of the variable are
associated with higher odds of a top hit.
The logistic regression model predicts top_hit using danceability, energy, speechiness, and valence.
Confidence Interval Calculation:
calculated a 95% confidence interval for each coefficient using the formula: Estimate ± t-value × Std. Error The confidence_intervals data frame contains the lower and upper bounds for each coefficient.
Interpreting the Confidence Intervals: If a CI does not contain zero, it suggests that the corresponding predictor is statistically significant.
For example, a positive CI for danceability indicates that higher danceability is associated with an increased likelihood of a song being a top hit, while a negative CI would suggest the opposite.
above code provides a comprehensive view of the model’s coefficients and their confidence intervals, which you can interpret to assess the significance and impact of each predictor.
library(ggplot2)
library(dplyr)
# Function to create plot data for each predictor, holding others constant at their means
plot_probability <- function(variable, data, model) {
# Filter out missing values
data_clean <- data %>%
filter(!is.na(danceability), !is.na(energy), !is.na(speechiness), !is.na(valence))
# Create a data frame with the variable of interest varying, others held at mean
plot_data <- expand.grid(
danceability = if (variable == "danceability") seq(min(data_clean$danceability, na.rm = TRUE), max(data_clean$danceability, na.rm = TRUE), length.out = 100) else mean(data_clean$danceability, na.rm = TRUE),
energy = if (variable == "energy") seq(min(data_clean$energy, na.rm = TRUE), max(data_clean$energy, na.rm = TRUE), length.out = 100) else mean(data_clean$energy, na.rm = TRUE),
speechiness = if (variable == "speechiness") seq(min(data_clean$speechiness, na.rm = TRUE), max(data_clean$speechiness, na.rm = TRUE), length.out = 100) else mean(data_clean$speechiness, na.rm = TRUE),
valence = if (variable == "valence") seq(min(data_clean$valence, na.rm = TRUE), max(data_clean$valence, na.rm = TRUE), length.out = 100) else mean(data_clean$valence, na.rm = TRUE)
)
# Predict probabilities
plot_data$predicted_prob <- predict(model, newdata = plot_data, type = "response")
# Plot predicted probabilities
ggplot(plot_data, aes_string(x = variable, y = "predicted_prob")) +
geom_line(color = "blue") +
labs(
title = paste("Effect of", variable, "on Probability of Top Hit"),
x = variable,
y = "Predicted Probability of Top Hit"
) +
theme_minimal()
}
# Create and display plots for each predictor
plot_danceability <- plot_probability("danceability", spotify_songs, model)
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
plot_energy <- plot_probability("energy", spotify_songs, model)
plot_speechiness <- plot_probability("speechiness", spotify_songs, model)
plot_valence <- plot_probability("valence", spotify_songs, model)
# Display the plots
plot_danceability
plot_energy
plot_speechiness
plot_valence
plot_probability Function: This function takes the name of a predictor (e.g., “danceability”), the data (spotify_songs), and the logistic regression model (model). It varies the specified predictor across its range while holding the others constant at their mean values. It then predicts the probability of top_hit for each value of the predictor. Plots: For each predictor (e.g., danceability, energy, etc.), the plot shows the predicted probability of being a “Top Hit” as the predictor varies, helping to understand the effect of each predictor on the target variable.
library(ggplot2)
# Prepare confidence intervals data for ggplot
confidence_intervals$Coefficient <- rownames(confidence_intervals) # Add coefficient names as a column
confidence_intervals$Width <- confidence_intervals$ci_upper - confidence_intervals$ci_lower # Calculate width
# Plotting the confidence intervals
ggplot(confidence_intervals, aes(x = Coefficient, y = Width)) +
geom_bar(stat = "identity", fill = "skyblue") +
geom_errorbar(aes(ymin = ci_lower, ymax = ci_upper), width = 0.2, color = "black") +
labs(
title = "95% Confidence Intervals for Model Coefficients",
x = "Coefficient",
y = "Width of Confidence Interval"
) +
coord_flip() + # Flip axes for horizontal bars
theme_minimal()
This visualization code creates a bar chart with error bars to represent the 95% confidence intervals for each coefficient in a logistic regression model. First, it calculates the width of each interval by subtracting the lower limit (ci_lower) from the upper limit (ci_upper). The bars themselves represent these widths, showing the degree of certainty or precision for each coefficient estimate. Each bar’s height corresponds to the interval’s width, with narrower bars indicating more precise estimates.
Error bars are added on each bar to show the full range of the confidence interval, extending from the lower to the upper bound. This allows us to see the range within which the true coefficient value likely falls, with 95% confidence, for each predictor. Finally, the coord_flip() function is used to switch the x and y axes, making the coefficient names easier to read by presenting the bars horizontally. This plot quickly communicates the uncertainty in each coefficient’s estimate, helping in model interpretation and assessing which predictors are statistically significant.
The logistic regression model here aims to understand the relationship between the probability of a song being a “Top Hit” (popularity score > 75) and specific audio features: danceability, energy, speechiness, and valence. The model’s output, particularly the coefficients, indicates how each feature influences the likelihood of a song being classified as a “Top Hit.” 1. Danceability: If the coefficient for danceability is positive and statistically significant, it suggests that higher danceability may increase the likelihood of a track becoming a “Top Hit.” Energy: A significant positive coefficient for energy would imply that more energetic tracks tend to have a higher chance of popularity. 2. Speechiness: Speechiness may have a different effect depending on genre and track type. If significant, it could indicate the influence of spoken-word elements on popularity. 3. Valence: The valence coefficient, if positive, suggests that songs with a happier or more positive tone may have better chances of achieving “Top Hit” status.
The code calculates 95% confidence intervals for the coefficients of the logistic regression model. Confidence intervals provide a range within which we expect the true value of each coefficient to lie, with a 95% level of certainty. This range is important because it allows us to interpret the reliability and significance of each predictor’s impact on the probability of a song being classified as a “Top Hit.”
Key insights:
# Load necessary libraries
library(dplyr)
library(ggplot2)
library(broom)
# Load the dataset
spotify_songs <- read.csv("C:/Users/priya/Downloads/spotify_songs.csv")
# Create a binary column for top_hit based on track_popularity
spotify_songs <- spotify_songs %>%
mutate(top_hit = ifelse(track_popularity > 75, 1, 0))
# Logistic regression model
model <- glm(top_hit ~ danceability + energy,
data = spotify_songs, family = binomial)
# Summarize the model
summary(model)
##
## Call:
## glm(formula = top_hit ~ danceability + energy, family = binomial,
## data = spotify_songs)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.6865 0.1288 -20.85 <2e-16 ***
## danceability 1.6579 0.1470 11.28 <2e-16 ***
## energy -1.2544 0.1094 -11.47 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 18473 on 32832 degrees of freedom
## Residual deviance: 18192 on 32830 degrees of freedom
## AIC: 18198
##
## Number of Fisher Scoring iterations: 5
# Calculate Odds Ratios and Confidence Intervals
odds_ratios <- exp(coef(model)) # Exponentiate coefficients to get odds ratios
ci <- exp(confint(model)) # Confidence intervals for odds ratios
## Waiting for profiling to be done...
odds_ratios_df <- data.frame(
Predictor = names(odds_ratios),
OddsRatio = odds_ratios,
CI_Lower = ci[, 1],
CI_Upper = ci[, 2]
)
# Print odds ratios and confidence intervals
print("Odds Ratios and 95% Confidence Intervals:")
## [1] "Odds Ratios and 95% Confidence Intervals:"
print(odds_ratios_df)
## Predictor OddsRatio CI_Lower CI_Upper
## (Intercept) (Intercept) 0.06811861 0.05282436 0.08753002
## danceability danceability 5.24808084 3.93911351 7.00815483
## energy energy 0.28524358 0.23032065 0.35360628
# Visualization of Odds Ratios with Confidence Intervals
ggplot(odds_ratios_df, aes(x = Predictor, y = OddsRatio)) +
geom_point(size = 4, color = "blue") +
geom_errorbar(aes(ymin = CI_Lower, ymax = CI_Upper), width = 0.2) +
scale_y_log10() +
labs(
title = "Odds Ratios with 95% Confidence Intervals",
x = "Predictors",
y = "Odds Ratio (log scale)"
) +
theme_minimal()
# Plot Predicted Probabilities for Significant Predictors
plot_probability <- function(variable, data, model) {
data_clean <- data %>% filter(!is.na(danceability), !is.na(energy))
plot_data <- expand.grid(
danceability = if (variable == "danceability")
seq(min(data_clean$danceability, na.rm = TRUE),
max(data_clean$danceability, na.rm = TRUE), length.out = 100)
else mean(data_clean$danceability, na.rm = TRUE),
energy = if (variable == "energy")
seq(min(data_clean$energy, na.rm = TRUE),
max(data_clean$energy, na.rm = TRUE), length.out = 100)
else mean(data_clean$energy, na.rm = TRUE)
)
plot_data$predicted_prob <- predict(model, newdata = plot_data, type = "response")
ggplot(plot_data, aes_string(x = variable, y = "predicted_prob")) +
geom_line(color = "blue") +
labs(
title = paste("Effect of", variable, "on Probability of Top Hit"),
x = variable,
y = "Predicted Probability"
) +
theme_minimal()
}
# Create and display plots for danceability and energy
plot_danceability <- plot_probability("danceability", spotify_songs, model)
plot_energy <- plot_probability("energy", spotify_songs, model)
# Display the plots
print(plot_danceability)
print(plot_energy)
Model Summary:
Coefficients: Intercept (−2.6865): This represents the log-odds of a song being a “Top Hit” when both danceability and energy are zero. This value itself isn’t directly interpretable because it assumes unrealistic zero values for predictors.
Danceability (1.6579): - Log-odds interpretation: For every unit increase in danceability, the log-odds of a song being a “Top Hit” increase by 1.6579, holding other variables constant. - Odds ratio interpretation: The odds of being a “Top Hit” increase by e^1.6579 ≈5.25 or a 425% increase in the odds for every unit increase in danceability.
Energy (−1.2544): - Log-odds interpretation: For every unit increase in energy, the log-odds of a song being a “Top Hit” decrease by 1.2544, holding other variables constant. Odds ratio interpretation: The odds of being a “Top Hit” decrease by𝑒^−1.2544 ≈ 0.285, or a 71.5% decrease in the odds for every unit increase in energy.
Significance of Predictors:
Both danceability (𝑝< 2𝑒^−16) and energy (p<2e^−16) are highly significant predictors of “Top Hit” status.
Danceability: - Odds Ratio: 5.25, meaning a 425% increase in the odds of being a “Top Hit” with a one-unit increase in danceability. - Confidence Interval: [3.939, 7.008]. This interval does not include 1, confirming the predictor’s significance.
Energy: Odds Ratio: 0.285, meaning a 71.5% decrease in the odds of being a “Top Hit” with a one-unit increase in energy. Confidence Interval: [0.230, 0.354]. This interval does not include 1, confirming the predictor’s significance.
Visualization: The plot of odds ratios shows the relative importance of predictors: Danceability has a strong positive impact on the likelihood of a “Top Hit.” Energy has a significant negative impact.
Danceability: The probability of a song being a “Top Hit” increases steadily with higher danceability. The curve is exponential, as logistic regression outputs probabilities derived from log-odds.
Energy: The probability of a song being a “Top Hit” decreases sharply with higher energy levels. This negative relationship suggests that overly energetic tracks may be less likely to become hits.
Interpretation of Plots: These plots provide a clear visual representation of how each predictor affects the probability of a song achieving “Top Hit” status.
They also show the non-linear nature of logistic regression, where probabilities asymptotically approach 0 or 1 but never exceed these bounds.