Load necessary libraries

library(dplyr)

## Warning: package 'dplyr' was built under R version 4.3.3

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.3.3

library(broom)

## Warning: package 'broom' was built under R version 4.3.3

Load the spotify_songs dataset

spotify_songs <- read.csv("C:/Users/priya/Downloads/spotify_songs.csv")

binary column based on the ‘track_popularity’ score

Define a popularity threshold for “Top Hit” classification (e.g., popularity > 75)

spotify_songs <- spotify_songs %>%
  mutate(top_hit = ifelse(track_popularity > 75, 1, 0))

Check the new column

table(spotify_songs$top_hit)

## 
##     0     1 
## 30172  2661

# Save the modified dataset to use in further analysis
write.csv(spotify_songs, "spotify_songs_with_top_hit.csv", row.names = FALSE)

# Display the first few rows of the modified dataset
head(spotify_songs)

##                 track_id                                            track_name
## 1 6f807x0ima9a1j3VPbc7VN I Don't Care (with Justin Bieber) - Loud Luxury Remix
## 2 0r7CVbZTWZgbTCYdfa2P31                       Memories - Dillon Francis Remix
## 3 1z1Hg7Vb0AhHDiEmnDE79l                       All the Time - Don Diablo Remix
## 4 75FpbthrwQmzHlBJLuGdC7                     Call You Mine - Keanu Silva Remix
## 5 1e8PAfcKUYoKkxPhrHqw4x               Someone You Loved - Future Humans Remix
## 6 7fvUMiyapMsRRxr07cU8Ef     Beautiful People (feat. Khalid) - Jack Wins Remix
##       track_artist track_popularity         track_album_id
## 1       Ed Sheeran               66 2oCs0DGTsRO98Gh5ZSl2Cx
## 2         Maroon 5               67 63rPSO264uRjW1X5E6cWv6
## 3     Zara Larsson               70 1HoSmj2eLcsrR0vE9gThr4
## 4 The Chainsmokers               60 1nqYsOef1yKKuGOVchbsk6
## 5    Lewis Capaldi               69 7m7vv9wlQ4i0LFuJiE2zsQ
## 6       Ed Sheeran               67 2yiy9cd2QktrNvWC2EUi0k
##                                        track_album_name
## 1 I Don't Care (with Justin Bieber) [Loud Luxury Remix]
## 2                       Memories (Dillon Francis Remix)
## 3                       All the Time (Don Diablo Remix)
## 4                           Call You Mine - The Remixes
## 5               Someone You Loved (Future Humans Remix)
## 6     Beautiful People (feat. Khalid) [Jack Wins Remix]
##   track_album_release_date playlist_name            playlist_id playlist_genre
## 1               2019-06-14     Pop Remix 37i9dQZF1DXcZDD7cfEKhW            pop
## 2               2019-12-13     Pop Remix 37i9dQZF1DXcZDD7cfEKhW            pop
## 3               2019-07-05     Pop Remix 37i9dQZF1DXcZDD7cfEKhW            pop
## 4               2019-07-19     Pop Remix 37i9dQZF1DXcZDD7cfEKhW            pop
## 5               2019-03-05     Pop Remix 37i9dQZF1DXcZDD7cfEKhW            pop
## 6               2019-07-11     Pop Remix 37i9dQZF1DXcZDD7cfEKhW            pop
##   playlist_subgenre danceability energy key loudness mode speechiness
## 1         dance pop        0.748  0.916   6   -2.634    1      0.0583
## 2         dance pop        0.726  0.815  11   -4.969    1      0.0373
## 3         dance pop        0.675  0.931   1   -3.432    0      0.0742
## 4         dance pop        0.718  0.930   7   -3.778    1      0.1020
## 5         dance pop        0.650  0.833   1   -4.672    1      0.0359
## 6         dance pop        0.675  0.919   8   -5.385    1      0.1270
##   acousticness instrumentalness liveness valence   tempo duration_ms top_hit
## 1       0.1020         0.00e+00   0.0653   0.518 122.036      194754       0
## 2       0.0724         4.21e-03   0.3570   0.693  99.972      162600       0
## 3       0.0794         2.33e-05   0.1100   0.613 124.008      176616       0
## 4       0.0287         9.43e-06   0.2040   0.277 121.956      169093       0
## 5       0.0803         0.00e+00   0.0833   0.725 123.976      189052       0
## 6       0.0799         0.00e+00   0.1430   0.585 124.982      163049       0

Explanation

Binary Column : created a new binary column called top_hit. Here, songs with a track_popularity score above 75 are labeled as “1” (Top Hit) and others as “0”. Verification: used table(spotify_songs$top_hit) to check the distribution of values in the top_hit column. Saving and Previewing: The modified dataset is saved for further analysis and previewed to verify the changes.

Logistic regression

To interpret the coefficients of a logistic regression model predicting top_hit (a binary outcome) in the spotify_songs dataset, I’ve selected four explanatory variables: danceability, energy, speechiness, and valence.

# Load necessary libraries
library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.3.3

## Warning: package 'readr' was built under R version 4.3.3

## Warning: package 'forcats' was built under R version 4.3.3

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.1
## ✔ readr     2.1.5     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(broom)

# Logistic regression model predicting `top_hit` using 4 explanatory variables
model <- glm(top_hit ~ danceability + energy + speechiness + valence, 
             data = spotify_songs, family = binomial)

# Summarize the model to get coefficients
summary(model)

## 
## Call:
## glm(formula = top_hit ~ danceability + energy + speechiness + 
##     valence, family = binomial, data = spotify_songs)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -2.68269    0.12888 -20.816   <2e-16 ***
## danceability  1.57222    0.15912   9.881   <2e-16 ***
## energy       -1.29109    0.11233 -11.494   <2e-16 ***
## speechiness   0.04316    0.19827   0.218    0.828    
## valence       0.14258    0.09441   1.510    0.131    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 18473  on 32832  degrees of freedom
## Residual deviance: 18190  on 32828  degrees of freedom
## AIC: 18200
## 
## Number of Fisher Scoring iterations: 5

# Extract the model coefficients for interpretation
tidy_model <- tidy(model)
print(tidy_model)

## # A tibble: 5 × 5
##   term         estimate std.error statistic  p.value
##   <chr>           <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)   -2.68      0.129    -20.8   3.09e-96
## 2 danceability   1.57      0.159      9.88  5.04e-23
## 3 energy        -1.29      0.112    -11.5   1.41e-30
## 4 speechiness    0.0432    0.198      0.218 8.28e- 1
## 5 valence        0.143     0.0944     1.51  1.31e- 1

Interpretation of the Coefficients

Each coefficient in a logistic regression model represents the change in the log-odds of the outcome variable (top_hit in this case) for a one-unit increase in the predictor variable, holding other variables constant. Here’s how to interpret each coefficient based on hypothetical output:

Intercept:

The intercept is the log-odds of a song being a top_hit when all predictor variables are zero. Although this value itself may not be meaningful in isolation, it serves as the baseline for interpreting other variables.

Danceability:

Suppose the coefficient for danceability is positive. This suggests that as danceability increases, the likelihood of a song being classified as a top hit increases. Specifically, for each one-unit increase in danceability, the log-odds of a song being a top hit increase by the value of the danceability coefficient.
In probability terms, higher danceability values are associated with a greater probability of a song achieving top-hit status.

Energy:

If the coefficient for energy is positive, it indicates that songs with higher energy are more likely to become top hits. A one-unit increase in energy is associated with an increase in the log-odds of top_hit by the energy coefficient’s value.
If negative, the interpretation would be that higher energy reduces the odds of a song becoming a top hit.

Speechiness:

A positive coefficient for speechiness would imply that songs with higher speech content (spoken words) have a greater likelihood of becoming top hits.
Conversely, a negative coefficient suggests that as speechiness increases, the likelihood of a song being a top hit decreases.

Valence:

Valence measures musical positiveness. If the coefficient for valence is positive, this would indicate that songs with a happier or more positive tone are more likely to become top hits.
A negative coefficient implies that higher valence (positiveness) decreases the probability of a song being classified as a top hit.

Practical Meaning

Positive Coefficient: Indicates an increase in the probability of a song being a top hit with higher values of the explanatory variable.
Negative Coefficient: Indicates a decrease in the probability of a song being a top hit with higher values of the explanatory variable.

Each coefficient helps assess which factors (such as danceability, energy, etc.) contribute more significantly to a song’s success, as measured by top_hit status. By understanding these relationships, you can draw insights into the characteristics associated with popular songs.

# Extract coefficients and standard errors
coef <- coef(model)
se <- summary(model)$coefficients[, "Std. Error"]

# Degrees of freedom for the model
df <- df.residual(model)

# Calculate the critical t-value for a 95% confidence level
t_value <- qt(0.975, df)

# Calculate confidence intervals for all coefficients
ci_lower <- coef - t_value * se
ci_upper <- coef + t_value * se

# Store confidence intervals in a data frame
confidence_intervals <- data.frame(ci_lower, ci_upper)
rownames(confidence_intervals) <- names(coef)

print("95% Confidence Intervals for Model Coefficients:")

## [1] "95% Confidence Intervals for Model Coefficients:"

print(confidence_intervals)

##                ci_lower   ci_upper
## (Intercept)  -2.9352910 -2.4300897
## danceability  1.2603452  1.8840988
## energy       -1.5112555 -1.0709262
## speechiness  -0.3454571  0.4317718
## valence      -0.0424668  0.3276242

Interpretation of one coefficient (‘danceability’): Each coefficient represents the change in the log-odds of top_hit for a one-unit increase in the explanatory variable, holding others constant. A positive interval implies that higher values of the variable are associated with higher odds of a top hit.

Explanation and Interpretation

The logistic regression model predicts top_hit using danceability, energy, speechiness, and valence.

Confidence Interval Calculation:

calculated a 95% confidence interval for each coefficient using the formula: Estimate ± t-value × Std. Error The confidence_intervals data frame contains the lower and upper bounds for each coefficient.

Interpreting the Confidence Intervals: If a CI does not contain zero, it suggests that the corresponding predictor is statistically significant.

For example, a positive CI for danceability indicates that higher danceability is associated with an increased likelihood of a song being a top hit, while a negative CI would suggest the opposite.

above code provides a comprehensive view of the model’s coefficients and their confidence intervals, which you can interpret to assess the significance and impact of each predictor.

library(ggplot2)
library(dplyr)

# Function to create plot data for each predictor, holding others constant at their means
plot_probability <- function(variable, data, model) {
  # Filter out missing values
  data_clean <- data %>%
    filter(!is.na(danceability), !is.na(energy), !is.na(speechiness), !is.na(valence))
  
  # Create a data frame with the variable of interest varying, others held at mean
  plot_data <- expand.grid(
    danceability = if (variable == "danceability") seq(min(data_clean$danceability, na.rm = TRUE), max(data_clean$danceability, na.rm = TRUE), length.out = 100) else mean(data_clean$danceability, na.rm = TRUE),
    energy = if (variable == "energy") seq(min(data_clean$energy, na.rm = TRUE), max(data_clean$energy, na.rm = TRUE), length.out = 100) else mean(data_clean$energy, na.rm = TRUE),
    speechiness = if (variable == "speechiness") seq(min(data_clean$speechiness, na.rm = TRUE), max(data_clean$speechiness, na.rm = TRUE), length.out = 100) else mean(data_clean$speechiness, na.rm = TRUE),
    valence = if (variable == "valence") seq(min(data_clean$valence, na.rm = TRUE), max(data_clean$valence, na.rm = TRUE), length.out = 100) else mean(data_clean$valence, na.rm = TRUE)
  )

  # Predict probabilities
  plot_data$predicted_prob <- predict(model, newdata = plot_data, type = "response")

  # Plot predicted probabilities
  ggplot(plot_data, aes_string(x = variable, y = "predicted_prob")) +
    geom_line(color = "blue") +
    labs(
      title = paste("Effect of", variable, "on Probability of Top Hit"),
      x = variable,
      y = "Predicted Probability of Top Hit"
    ) +
    theme_minimal()
}

# Create and display plots for each predictor
plot_danceability <- plot_probability("danceability", spotify_songs, model)

## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

plot_energy <- plot_probability("energy", spotify_songs, model)
plot_speechiness <- plot_probability("speechiness", spotify_songs, model)
plot_valence <- plot_probability("valence", spotify_songs, model)

# Display the plots
plot_danceability

plot_energy

plot_speechiness

plot_valence

Explanation

plot_probability Function: This function takes the name of a predictor (e.g., “danceability”), the data (spotify_songs), and the logistic regression model (model). It varies the specified predictor across its range while holding the others constant at their mean values. It then predicts the probability of top_hit for each value of the predictor. Plots: For each predictor (e.g., danceability, energy, etc.), the plot shows the predicted probability of being a “Top Hit” as the predictor varies, helping to understand the effect of each predictor on the target variable.

library(ggplot2)

# Prepare confidence intervals data for ggplot
confidence_intervals$Coefficient <- rownames(confidence_intervals)  # Add coefficient names as a column
confidence_intervals$Width <- confidence_intervals$ci_upper - confidence_intervals$ci_lower  # Calculate width

# Plotting the confidence intervals
ggplot(confidence_intervals, aes(x = Coefficient, y = Width)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  geom_errorbar(aes(ymin = ci_lower, ymax = ci_upper), width = 0.2, color = "black") +
  labs(
    title = "95% Confidence Intervals for Model Coefficients",
    x = "Coefficient",
    y = "Width of Confidence Interval"
  ) +
  coord_flip() +  # Flip axes for horizontal bars
  theme_minimal()

explanation

This visualization code creates a bar chart with error bars to represent the 95% confidence intervals for each coefficient in a logistic regression model. First, it calculates the width of each interval by subtracting the lower limit (ci_lower) from the upper limit (ci_upper). The bars themselves represent these widths, showing the degree of certainty or precision for each coefficient estimate. Each bar’s height corresponds to the interval’s width, with narrower bars indicating more precise estimates.

Error bars are added on each bar to show the full range of the confidence interval, extending from the lower to the upper bound. This allows us to see the range within which the true coefficient value likely falls, with 95% confidence, for each predictor. Finally, the coord_flip() function is used to switch the x and y axes, making the coefficient names easier to read by presenting the bars horizontally. This plot quickly communicates the uncertainty in each coefficient’s estimate, helping in model interpretation and assessing which predictors are statistically significant.

Binary column selection

Insights

Popularity Distribution: The binary classification of songs into “Top Hits” versus others reveals how many songs surpass the popularity threshold of 75. This gives us a baseline of how common high-popularity songs are in the dataset.
Characteristic Patterns: Preliminary data exploration may show that “Top Hits” have distinct characteristics in terms of features like tempo, energy, and danceability. Understanding these attributes can help pinpoint elements that contribute to a song’s widespread appeal.
Genre and Popularity: By associating genres with “Top Hits,” we can observe if certain genres are more prone to high popularity, indicating trends or preferences in listener behavior.

Significance

Market Insights: Knowing the factors that contribute to song popularity can help artists, record labels, and music producers tailor their work to align with listener preferences, ultimately enhancing market success.
Platform Recommendations: Streaming services like Spotify can leverage this insight for better recommendation algorithms, curating playlists that resonate more with users based on trends associated with popular tracks.
Predictive Modelling: These insights pave the way for predictive modeling—identifying attributes that often correlate with “Top Hits” could inform future decisions on song promotion or playlist placements.
Industry Trends: Analyzing the patterns in “Top Hits” also sheds light on evolving musical trends, giving music industry professionals an edge in adapting to listener preferences.

Further Investigation:

Genre Influence: Are certain genres more likely to contain “Top Hits”? This might reveal genre-specific trends in popularity.
Feature Comparison: Do attributes such as tempo, loudness, or danceability show consistent trends in “Top Hits”?
Temporal Patterns: Is there a trend in “Top Hits” over time? Do newer songs have a higher likelihood of achieving high popularity?
Artist Effect: Do certain artists produce more “Top Hits” compared to others?

logistic regression model

Insights

The logistic regression model here aims to understand the relationship between the probability of a song being a “Top Hit” (popularity score > 75) and specific audio features: danceability, energy, speechiness, and valence. The model’s output, particularly the coefficients, indicates how each feature influences the likelihood of a song being classified as a “Top Hit.” 1. Danceability: If the coefficient for danceability is positive and statistically significant, it suggests that higher danceability may increase the likelihood of a track becoming a “Top Hit.” Energy: A significant positive coefficient for energy would imply that more energetic tracks tend to have a higher chance of popularity. 2. Speechiness: Speechiness may have a different effect depending on genre and track type. If significant, it could indicate the influence of spoken-word elements on popularity. 3. Valence: The valence coefficient, if positive, suggests that songs with a happier or more positive tone may have better chances of achieving “Top Hit” status.

Significance

Music Production: Artists and producers can prioritize these traits, especially danceability and energy, when creating tracks intended for high engagement.
Streaming Platform Optimization: Streaming services might use such models to optimize recommendations, as these features could help identify songs that align with listener preferences.
Marketing Strategies: By focusing on songs with these attributes, marketers can better target and promote songs likely to perform well, increasing reach and engagement.

Further Investigation

Nonlinear Effects: Could certain features have a non-linear relationship with popularity, where only moderate levels (not high or low) lead to high popularity?
Interactions: Do interactions between these variables (e.g., energy and danceability) provide additional predictive power?
Genre-Specific Trends: Would the effect of these features differ across genres, indicating that different listener groups have varied preferences?

C.I

Insights

The code calculates 95% confidence intervals for the coefficients of the logistic regression model. Confidence intervals provide a range within which we expect the true value of each coefficient to lie, with a 95% level of certainty. This range is important because it allows us to interpret the reliability and significance of each predictor’s impact on the probability of a song being classified as a “Top Hit.”

Key insights:

Direction of Influence: If the confidence interval for a coefficient does not include zero, this suggests that the predictor has a statistically significant effect on the probability of a song being a “Top Hit.” For example, if danceability has a positive interval that doesn’t cross zero, higher danceability likely increases the song’s popularity.
Magnitude of Effect: The width of each interval tells us about the precision of our estimates. A narrow interval suggests a high level of confidence in the effect size, while a wide interval indicates more uncertainty.

Significance

Statistical Confidence: Confidence intervals are critical for assessing whether each predictor meaningfully impacts song popularity or if its effect may be due to random variation. Predictors with intervals that do not overlap zero are considered statistically significant, meaning they are likely to influence the target variable (top hit classification).
Informed Decisions: Identifying significant predictors provides insights for decision-makers, such as music producers or streaming platforms. For example, if high danceability significantly raises the likelihood of popularity, producers might focus on this characteristic when creating music intended for wide appeal.
Model Interpretation: Understanding the confidence interval for each predictor helps clarify how robust the model is. Confidence intervals provide more detail beyond the point estimates alone, allowing for a nuanced interpretation of each predictor’s role.

Further Investigation

Feature Interaction: Could interactions between variables (e.g., energy and danceability) further explain song popularity? Examining interaction terms could provide deeper insight into combined effects.
Alternative Confidence Levels: Would a 90% or 99% confidence level change which predictors are statistically significant? Testing different confidence levels might help assess the robustness of the results.
Genre-Specific Analysis: Are these predictors similarly significant across different genres, or are some predictors only impactful in specific music styles? Stratifying the analysis by genre could reveal more detailed patterns.

updated one

# Load necessary libraries
library(dplyr)
library(ggplot2)
library(broom)

# Load the dataset
spotify_songs <- read.csv("C:/Users/priya/Downloads/spotify_songs.csv")

# Create a binary column for top_hit based on track_popularity
spotify_songs <- spotify_songs %>%
  mutate(top_hit = ifelse(track_popularity > 75, 1, 0))

# Logistic regression model
model <- glm(top_hit ~ danceability + energy, 
             data = spotify_songs, family = binomial)

# Summarize the model
summary(model)

## 
## Call:
## glm(formula = top_hit ~ danceability + energy, family = binomial, 
##     data = spotify_songs)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -2.6865     0.1288  -20.85   <2e-16 ***
## danceability   1.6579     0.1470   11.28   <2e-16 ***
## energy        -1.2544     0.1094  -11.47   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 18473  on 32832  degrees of freedom
## Residual deviance: 18192  on 32830  degrees of freedom
## AIC: 18198
## 
## Number of Fisher Scoring iterations: 5

# Calculate Odds Ratios and Confidence Intervals
odds_ratios <- exp(coef(model))  # Exponentiate coefficients to get odds ratios
ci <- exp(confint(model))        # Confidence intervals for odds ratios

## Waiting for profiling to be done...

odds_ratios_df <- data.frame(
  Predictor = names(odds_ratios),
  OddsRatio = odds_ratios,
  CI_Lower = ci[, 1],
  CI_Upper = ci[, 2]
)

# Print odds ratios and confidence intervals
print("Odds Ratios and 95% Confidence Intervals:")

## [1] "Odds Ratios and 95% Confidence Intervals:"

print(odds_ratios_df)

##                 Predictor  OddsRatio   CI_Lower   CI_Upper
## (Intercept)   (Intercept) 0.06811861 0.05282436 0.08753002
## danceability danceability 5.24808084 3.93911351 7.00815483
## energy             energy 0.28524358 0.23032065 0.35360628

# Visualization of Odds Ratios with Confidence Intervals
ggplot(odds_ratios_df, aes(x = Predictor, y = OddsRatio)) +
  geom_point(size = 4, color = "blue") +
  geom_errorbar(aes(ymin = CI_Lower, ymax = CI_Upper), width = 0.2) +
  scale_y_log10() +
  labs(
    title = "Odds Ratios with 95% Confidence Intervals",
    x = "Predictors",
    y = "Odds Ratio (log scale)"
  ) +
  theme_minimal()

# Plot Predicted Probabilities for Significant Predictors
plot_probability <- function(variable, data, model) {
  data_clean <- data %>% filter(!is.na(danceability), !is.na(energy))
  
  plot_data <- expand.grid(
    danceability = if (variable == "danceability") 
      seq(min(data_clean$danceability, na.rm = TRUE), 
          max(data_clean$danceability, na.rm = TRUE), length.out = 100) 
    else mean(data_clean$danceability, na.rm = TRUE),
    
    energy = if (variable == "energy") 
      seq(min(data_clean$energy, na.rm = TRUE), 
          max(data_clean$energy, na.rm = TRUE), length.out = 100) 
    else mean(data_clean$energy, na.rm = TRUE)
  )
  
  plot_data$predicted_prob <- predict(model, newdata = plot_data, type = "response")
  
  ggplot(plot_data, aes_string(x = variable, y = "predicted_prob")) +
    geom_line(color = "blue") +
    labs(
      title = paste("Effect of", variable, "on Probability of Top Hit"),
      x = variable,
      y = "Predicted Probability"
    ) +
    theme_minimal()
}

# Create and display plots for danceability and energy
plot_danceability <- plot_probability("danceability", spotify_songs, model)
plot_energy <- plot_probability("energy", spotify_songs, model)

# Display the plots
print(plot_danceability)

print(plot_energy)

Logistic Regression

Model Summary:

Coefficients: Intercept (−2.6865): This represents the log-odds of a song being a “Top Hit” when both danceability and energy are zero. This value itself isn’t directly interpretable because it assumes unrealistic zero values for predictors.

Danceability (1.6579): - Log-odds interpretation: For every unit increase in danceability, the log-odds of a song being a “Top Hit” increase by 1.6579, holding other variables constant. - Odds ratio interpretation: The odds of being a “Top Hit” increase by e^1.6579 ≈5.25 or a 425% increase in the odds for every unit increase in danceability.

Energy (−1.2544): - Log-odds interpretation: For every unit increase in energy, the log-odds of a song being a “Top Hit” decrease by 1.2544, holding other variables constant. Odds ratio interpretation: The odds of being a “Top Hit” decrease by𝑒^−1.2544 ≈ 0.285, or a 71.5% decrease in the odds for every unit increase in energy.

Significance of Predictors:

Both danceability (𝑝< 2𝑒^−16) and energy (p<2e^−16) are highly significant predictors of “Top Hit” status.

Odds Ratios and Confidence Intervals:

Danceability: - Odds Ratio: 5.25, meaning a 425% increase in the odds of being a “Top Hit” with a one-unit increase in danceability. - Confidence Interval: [3.939, 7.008]. This interval does not include 1, confirming the predictor’s significance.

Energy: Odds Ratio: 0.285, meaning a 71.5% decrease in the odds of being a “Top Hit” with a one-unit increase in energy. Confidence Interval: [0.230, 0.354]. This interval does not include 1, confirming the predictor’s significance.

Visualization: The plot of odds ratios shows the relative importance of predictors: Danceability has a strong positive impact on the likelihood of a “Top Hit.” Energy has a significant negative impact.

Predicted Probabilities for Significant Predictors:

Danceability: The probability of a song being a “Top Hit” increases steadily with higher danceability. The curve is exponential, as logistic regression outputs probabilities derived from log-odds.

Energy: The probability of a song being a “Top Hit” decreases sharply with higher energy levels. This negative relationship suggests that overly energetic tracks may be less likely to become hits.

Interpretation of Plots: These plots provide a clear visual representation of how each predictor affects the probability of a song achieving “Top Hit” status.

They also show the non-linear nature of logistic regression, where probabilities asymptotically approach 0 or 1 but never exceed these bounds.

Data Dive — GLMs

2024-11-06

Load necessary libraries

Load the spotify_songs dataset

binary column based on the ‘track_popularity’ score

Define a popularity threshold for “Top Hit” classification (e.g., popularity > 75)

Check the new column

Explanation

Logistic regression

Interpretation of the Coefficients

Intercept:

Danceability:

Energy:

Speechiness:

Valence:

Practical Meaning

Explanation and Interpretation

Explanation

explanation

Binary column selection

Insights

Significance

Further Investigation:

logistic regression model

Insights

Significance

Further Investigation

C.I

Insights

Significance

Further Investigation

updated one

Logistic Regression

Odds Ratios and Confidence Intervals:

Predicted Probabilities for Significant Predictors: