1. Introduction and Research Question

This analysis explores the audio features of songs from the Billboard Top 100 to understand what makes a song a “Super-Hit.” The music industry often relies on intuition but can data reveal any underlying statistical patterns in song characteristics which actually make a song super hit?

Research Question: Can we identify specific measurable audio features that distinguish a “Super-Hit” from a regular song?

Why only these variables? Previous academic studies (e.g., Herremans & Chew, 2017) and data science projects have consistently found that features like danceability, energy, and valence are strong indicators of a song’s popularity and chart success. This analysis builds on that foundation by modeling how these and other audio features correlate with a song’s hit potential.

About the Dataset: The data combines song information from the Billboard Top 100 charts with audio feature data from Spotify. Each row represents a song and includes attributes like its chart position, year and a set of audio features calculated by Spotify’s algorithms.

2. Setup and Data Loading

First, we will load the necessary R packages for data manipulation, visualization and modeling.

library(tidyverse)      # Data manipulation & plotting
library(janitor)        # Clean column names
library(skimr)          # Quick data summary
library(ggthemes)       # FiveThirtyEight theme for plots
library(broom)          # Clean regression outputs
library(rsample)        # Train/test split

Next, we load the dataset and use clean_names() to standardize(consistent format) column names for easier use.

songs <- read.csv("billboard_top_100_final.csv") %>% 
  clean_names()

A quick glimpse at the data structure helps us understand the variables we are working with.

glimpse(songs)
## Rows: 1,798
## Columns: 20
## $ artist           <chr> "Daniel Powter", "Sean Paul", "Nelly Furtado Featurin…
## $ song             <chr> "Bad Day", "Temperature", "Promiscuous", "You're Beau…
## $ year             <int> 2006, 2006, 2006, 2006, 2006, 2006, 2006, 2006, 2006,…
## $ album_id         <chr> "4zhigAhPwqp43XVHBiVeQI", "32Bu3ETQhR1PFCj3ndDlYf", "…
## $ album            <chr> "Daniel Powter", "The Trinity", "Promiscuous (In the …
## $ release_date     <chr> "2005-02-22", "2005-09-26", "2013-03-09", "2005-08-08…
## $ danceability     <dbl> 0.599, 0.951, 0.803, 0.675, 0.797, 0.706, 0.835, 0.78…
## $ energy           <dbl> 0.785, 0.600, 0.574, 0.479, 0.533, 0.800, 0.741, 0.79…
## $ key              <int> 3, 0, 1, 0, 8, 5, 8, 8, 7, 1, 2, 4, 8, 8, 2, 1, 6, 1,…
## $ loudness         <dbl> -4.013, -4.675, -13.556, -9.870, -9.554, -6.333, -1.6…
## $ mode             <int> 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0,…
## $ speechiness      <dbl> 0.0309, 0.0685, 0.0530, 0.0278, 0.0690, 0.0399, 0.033…
## $ acousticness     <dbl> 0.448000, 0.106000, 0.004740, 0.633000, 0.016000, 0.0…
## $ instrumentalness <dbl> 3.36e-03, 0.00e+00, 2.50e-01, 1.76e-05, 1.68e-01, 0.0…
## $ liveness         <dbl> 0.1510, 0.0712, 0.0990, 0.0880, 0.1550, 0.0822, 0.082…
## $ valence          <dbl> 0.520, 0.822, 0.766, 0.454, 0.688, 0.629, 0.612, 0.83…
## $ tempo            <dbl> 140.046, 125.040, 114.306, 81.998, 99.989, 100.011, 1…
## $ duration_ms      <int> 233640, 218573, 243493, 209493, 234013, 259333, 18206…
## $ time_signature   <int> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,…
## $ genres           <chr> "['canadian pop', 'neo mellow', 'pop rock']", "['danc…

3. Defining the Target Variable: “Super-Hit”

To perform a classification analysis, we need a target variable. I’ve created a binary category called hit_class. Based on the idea that highly danceable, energetic and positive songs are often the biggest hits. a “Super-Hit” is defined as any song where the sum of danceability, energy, and valence is greater than 2. All other songs are classified as “Regular Hits.” further more converting it as factor for better handling of categorical data

songs <- songs %>%
  mutate(
    hit_class = ifelse(danceability + energy + valence > 2, "Super-Hit", "Regular Hit"),
    hit_class = factor(hit_class, levels = c("Regular Hit", "Super-Hit"))
  )

# Let's see the distribution of our new variable
table(songs$hit_class)
## 
## Regular Hit   Super-Hit 
##        1130         668

4. Exploratory Data Analysis (EDA)

Visualizing Audio Feature Distributions

To see how the audio features differ between “Regular Hits” and “Super-Hits,” we can plot their distributions. The plots below show that “Super-Hits” tend to have higher average values for danceability, energy, and valence which aligns with our initial hypothesis.

features <- c("danceability", "energy", "loudness", "speechiness",
              "acousticness", "duration_ms", "liveness", "valence", "tempo")

songs_long <- songs %>%
  pivot_longer(cols = all_of(features), names_to = "feature", values_to = "value")

ggplot(songs_long, aes(x = value, fill = hit_class)) +
  geom_density(alpha = 0.4) +
  facet_wrap(~feature, scales = "free") +
  labs(
    title = "Super-Hits vs Regular Hits: Audio Feature Distributions",
    subtitle = "Density plots show the distribution of each audio feature for the two hit classes.",
    fill = "Hit Class"
  ) +
  theme_fivethirtyeight() +
  scale_fill_manual(values = c("Regular Hit" = "#30a2da", "Super-Hit" = "#fc4f30"))

Statistical Significance Testing

Are the visual differences statistically significant? We will run a t-test for each feature to compare the means between the two groups. A p-value less than 0.05 suggests a significant difference.

t_test_results <- map_df(features, function(feat) {
  t_out <- t.test(songs[[feat]] ~ songs$hit_class)
  tibble(
    feature = feat,
    mean_superhit = mean(songs %>% filter(hit_class == "Super-Hit") %>% pull(.data[[feat]]), na.rm = TRUE),
    mean_regular = mean(songs %>% filter(hit_class == "Regular Hit") %>% pull(.data[[feat]]), na.rm = TRUE),
    p_value = t_out$p.value
  )
}) %>%
  mutate(
    significance = ifelse(p_value < 0.05, "Significant", "Not Significant")
  )

print(t_test_results)
## # A tibble: 9 × 5
##   feature      mean_superhit mean_regular   p_value significance   
##   <chr>                <dbl>        <dbl>     <dbl> <chr>          
## 1 danceability         0.733       0.627  4.66e- 66 Significant    
## 2 energy               0.765       0.608  8.74e-110 Significant    
## 3 loudness            -5.13       -6.56   1.94e- 46 Significant    
## 4 speechiness          0.100       0.0998 9.18e-  1 Not Significant
## 5 acousticness         0.103       0.197  1.18e- 27 Significant    
## 6 duration_ms     215413.     222458.     2.76e-  4 Significant    
## 7 liveness             0.188       0.166  1.29e-  3 Significant    
## 8 valence              0.729       0.389  4.88e-318 Significant    
## 9 tempo              122.        122.     7.30e-  1 Not Significant

The results confirm that features like danceability, energy, loudness, valence and acousticness show statistically significant differences between our two classes.

5. Predictive Modeling: Logistic Regression

Now, let’s build a model to predict whether a song will be a “Super-Hit” based on its audio features. A logistic regression model is a great choice for this binary classification task.

Train/Test Split

First, we split the data into a training set (70%) and testing set (30%) to ensure our model’s performance is evaluated on unseen data.

set.seed(123) # for reproducibility
split <- initial_split(songs, prop = 0.7, strata = hit_class)
train_data <- training(split)
test_data  <- testing(split)

Building and Summarizing the Model

We’ll use the features that appeared most significant in our EDA.

model <- glm(hit_class ~ danceability + energy + loudness + valence + tempo,
             data = train_data, family = binomial)

summary(model)
## 
## Call:
## glm(formula = hit_class ~ danceability + energy + loudness + 
##     valence + tempo, family = binomial, data = train_data)
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)
## (Intercept)  -1.240e+04  1.897e+05  -0.065    0.948
## danceability  6.091e+03  9.367e+04   0.065    0.948
## energy        6.234e+03  9.512e+04   0.066    0.948
## loudness     -1.122e+01  1.268e+03  -0.009    0.993
## valence       6.149e+03  9.650e+04   0.064    0.949
## tempo         7.006e-02  1.762e+01   0.004    0.997
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1.6596e+03  on 1257  degrees of freedom
## Residual deviance: 1.7547e-05  on 1252  degrees of freedom
## AIC: 12
## 
## Number of Fisher Scoring iterations: 25

Interpreting Model Coefficients with Odds Ratios

To make the model’s coefficients more interpretable, we convert them to odds ratios. An odds ratio greater than 1 indicates that an increase in that feature increases the odds of a song being a “Super-Hit.”

coef_df <- tidy(model, exponentiate = TRUE, conf.int = TRUE) %>%
  filter(term != "(Intercept)")

ggplot(coef_df, aes(x = reorder(term, estimate), y = estimate)) +
  geom_point(size = 3, color = "#fc4f30") +
  geom_errorbar(aes(ymin = conf.low, ymax = conf.high), width = 0.2) +
  geom_hline(yintercept = 1, linetype = "dashed") +
  coord_flip() +
  labs(
    title = "Odds Ratios for Super-Hit Prediction",
    subtitle = "Features with odds ratios > 1 increase the likelihood of being a Super-Hit",
    y = "Odds Ratio", x = "Feature"
  ) +
  theme_fivethirtyeight()

As expected danceability, energy, and valence are strong positive predictors.

Model Evaluation

Finally, we evaluate the model’s performance on the test data using a confusion matrix and calculating its accuracy.

test_data <- test_data %>%
  mutate(
    pred_prob = predict(model, newdata = test_data, type = "response"),
    pred_class = ifelse(pred_prob > 0.5, "Super-Hit", "Regular Hit")
  )

# Confusion Matrix
cat("Confusion Matrix:\n")
## Confusion Matrix:
table(Predicted = test_data$pred_class, Actual = test_data$hit_class)
##              Actual
## Predicted     Regular Hit Super-Hit
##   Regular Hit         339         3
##   Super-Hit             0       198
# Accuracy
accuracy <- mean(test_data$pred_class == test_data$hit_class)
cat("\nModel Accuracy:", round(accuracy, 3), "\n")
## 
## Model Accuracy: 0.994

The model performs well, correctly classifying a high percentage of songs on the unseen test data.

6. Trend Analysis Over Time

Have the characteristics of hit songs changed over the years? Let’s plot the average danceability, energy and valence for top songs by year.

songs %>%
  group_by(year) %>%
  summarise(
    avg_dance = mean(danceability, na.rm = TRUE),
    avg_energy = mean(energy, na.rm = TRUE),
    avg_valence = mean(valence, na.rm = TRUE)
  ) %>%
  ggplot(aes(x = year)) +
  geom_line(aes(y = avg_dance, color = "Danceability")) +
  geom_line(aes(y = avg_energy, color = "Energy")) +
  geom_line(aes(y = avg_valence, color = "Valence")) +
  labs(
    title = "Trends in Hit Song Audio Features Over Time",
    subtitle = "Modern pop music appears more danceable and energetic.",
    y = "Average Feature Value (0-1)", x = "Year", color = "Feature"
  ) +
  theme_fivethirtyeight()

The plot shows a clear upward trend in the average danceability and energy of hit songs over the last several decades.

7. Conclusion

This analysis confirms that audio features provide significant insight into what makes a song a commercial success.

  • Key Differentiators: Super-Hits are statistically more likely to have higher danceability, energy, and valence.
  • Predictive Power: A logistic regression model can predict a song’s “Super-Hit” status with high accuracy based on these features.
  • Historical Trends: Hit songs have become progressively more danceable and energetic over time, reflecting changes in music production and consumer tastes, likely influenced by social media trends (e.g., TikTok dances).

In conclusion, while musical taste is subjective, the data shows that a quantifiable formula for a hit song—one that is upbeat, positive, and easy to dance to—is a powerful force in the music industry.