This analysis explores the audio features of songs from the Billboard Top 100 to understand what makes a song a “Super-Hit.” The music industry often relies on intuition but can data reveal any underlying statistical patterns in song characteristics which actually make a song super hit?
Research Question: Can we identify specific measurable audio features that distinguish a “Super-Hit” from a regular song?
Why only these variables? Previous academic studies (e.g., Herremans & Chew, 2017) and data science projects have consistently found that features like danceability, energy, and valence are strong indicators of a song’s popularity and chart success. This analysis builds on that foundation by modeling how these and other audio features correlate with a song’s hit potential.
About the Dataset: The data combines song information from the Billboard Top 100 charts with audio feature data from Spotify. Each row represents a song and includes attributes like its chart position, year and a set of audio features calculated by Spotify’s algorithms.
First, we will load the necessary R packages for data manipulation, visualization and modeling.
library(tidyverse) # Data manipulation & plotting
library(janitor) # Clean column names
library(skimr) # Quick data summary
library(ggthemes) # FiveThirtyEight theme for plots
library(broom) # Clean regression outputs
library(rsample) # Train/test split
Next, we load the dataset and use clean_names() to
standardize(consistent format) column names for easier use.
songs <- read.csv("billboard_top_100_final.csv") %>%
clean_names()
A quick glimpse at the data structure helps us understand the variables we are working with.
glimpse(songs)
## Rows: 1,798
## Columns: 20
## $ artist <chr> "Daniel Powter", "Sean Paul", "Nelly Furtado Featurin…
## $ song <chr> "Bad Day", "Temperature", "Promiscuous", "You're Beau…
## $ year <int> 2006, 2006, 2006, 2006, 2006, 2006, 2006, 2006, 2006,…
## $ album_id <chr> "4zhigAhPwqp43XVHBiVeQI", "32Bu3ETQhR1PFCj3ndDlYf", "…
## $ album <chr> "Daniel Powter", "The Trinity", "Promiscuous (In the …
## $ release_date <chr> "2005-02-22", "2005-09-26", "2013-03-09", "2005-08-08…
## $ danceability <dbl> 0.599, 0.951, 0.803, 0.675, 0.797, 0.706, 0.835, 0.78…
## $ energy <dbl> 0.785, 0.600, 0.574, 0.479, 0.533, 0.800, 0.741, 0.79…
## $ key <int> 3, 0, 1, 0, 8, 5, 8, 8, 7, 1, 2, 4, 8, 8, 2, 1, 6, 1,…
## $ loudness <dbl> -4.013, -4.675, -13.556, -9.870, -9.554, -6.333, -1.6…
## $ mode <int> 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0,…
## $ speechiness <dbl> 0.0309, 0.0685, 0.0530, 0.0278, 0.0690, 0.0399, 0.033…
## $ acousticness <dbl> 0.448000, 0.106000, 0.004740, 0.633000, 0.016000, 0.0…
## $ instrumentalness <dbl> 3.36e-03, 0.00e+00, 2.50e-01, 1.76e-05, 1.68e-01, 0.0…
## $ liveness <dbl> 0.1510, 0.0712, 0.0990, 0.0880, 0.1550, 0.0822, 0.082…
## $ valence <dbl> 0.520, 0.822, 0.766, 0.454, 0.688, 0.629, 0.612, 0.83…
## $ tempo <dbl> 140.046, 125.040, 114.306, 81.998, 99.989, 100.011, 1…
## $ duration_ms <int> 233640, 218573, 243493, 209493, 234013, 259333, 18206…
## $ time_signature <int> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,…
## $ genres <chr> "['canadian pop', 'neo mellow', 'pop rock']", "['danc…
To perform a classification analysis, we need a target variable. I’ve
created a binary category called hit_class. Based on the
idea that highly danceable, energetic and positive songs are often the
biggest hits. a “Super-Hit” is defined as any song where the sum of
danceability, energy, and valence
is greater than 2. All other songs are classified as “Regular Hits.”
further more converting it as factor for better handling of categorical
data
songs <- songs %>%
mutate(
hit_class = ifelse(danceability + energy + valence > 2, "Super-Hit", "Regular Hit"),
hit_class = factor(hit_class, levels = c("Regular Hit", "Super-Hit"))
)
# Let's see the distribution of our new variable
table(songs$hit_class)
##
## Regular Hit Super-Hit
## 1130 668
To see how the audio features differ between “Regular Hits” and “Super-Hits,” we can plot their distributions. The plots below show that “Super-Hits” tend to have higher average values for danceability, energy, and valence which aligns with our initial hypothesis.
features <- c("danceability", "energy", "loudness", "speechiness",
"acousticness", "duration_ms", "liveness", "valence", "tempo")
songs_long <- songs %>%
pivot_longer(cols = all_of(features), names_to = "feature", values_to = "value")
ggplot(songs_long, aes(x = value, fill = hit_class)) +
geom_density(alpha = 0.4) +
facet_wrap(~feature, scales = "free") +
labs(
title = "Super-Hits vs Regular Hits: Audio Feature Distributions",
subtitle = "Density plots show the distribution of each audio feature for the two hit classes.",
fill = "Hit Class"
) +
theme_fivethirtyeight() +
scale_fill_manual(values = c("Regular Hit" = "#30a2da", "Super-Hit" = "#fc4f30"))
Are the visual differences statistically significant? We will run a t-test for each feature to compare the means between the two groups. A p-value less than 0.05 suggests a significant difference.
t_test_results <- map_df(features, function(feat) {
t_out <- t.test(songs[[feat]] ~ songs$hit_class)
tibble(
feature = feat,
mean_superhit = mean(songs %>% filter(hit_class == "Super-Hit") %>% pull(.data[[feat]]), na.rm = TRUE),
mean_regular = mean(songs %>% filter(hit_class == "Regular Hit") %>% pull(.data[[feat]]), na.rm = TRUE),
p_value = t_out$p.value
)
}) %>%
mutate(
significance = ifelse(p_value < 0.05, "Significant", "Not Significant")
)
print(t_test_results)
## # A tibble: 9 × 5
## feature mean_superhit mean_regular p_value significance
## <chr> <dbl> <dbl> <dbl> <chr>
## 1 danceability 0.733 0.627 4.66e- 66 Significant
## 2 energy 0.765 0.608 8.74e-110 Significant
## 3 loudness -5.13 -6.56 1.94e- 46 Significant
## 4 speechiness 0.100 0.0998 9.18e- 1 Not Significant
## 5 acousticness 0.103 0.197 1.18e- 27 Significant
## 6 duration_ms 215413. 222458. 2.76e- 4 Significant
## 7 liveness 0.188 0.166 1.29e- 3 Significant
## 8 valence 0.729 0.389 4.88e-318 Significant
## 9 tempo 122. 122. 7.30e- 1 Not Significant
The results confirm that features like danceability, energy, loudness, valence and acousticness show statistically significant differences between our two classes.
Now, let’s build a model to predict whether a song will be a “Super-Hit” based on its audio features. A logistic regression model is a great choice for this binary classification task.
First, we split the data into a training set (70%) and testing set (30%) to ensure our model’s performance is evaluated on unseen data.
set.seed(123) # for reproducibility
split <- initial_split(songs, prop = 0.7, strata = hit_class)
train_data <- training(split)
test_data <- testing(split)
We’ll use the features that appeared most significant in our EDA.
model <- glm(hit_class ~ danceability + energy + loudness + valence + tempo,
data = train_data, family = binomial)
summary(model)
##
## Call:
## glm(formula = hit_class ~ danceability + energy + loudness +
## valence + tempo, family = binomial, data = train_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.240e+04 1.897e+05 -0.065 0.948
## danceability 6.091e+03 9.367e+04 0.065 0.948
## energy 6.234e+03 9.512e+04 0.066 0.948
## loudness -1.122e+01 1.268e+03 -0.009 0.993
## valence 6.149e+03 9.650e+04 0.064 0.949
## tempo 7.006e-02 1.762e+01 0.004 0.997
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1.6596e+03 on 1257 degrees of freedom
## Residual deviance: 1.7547e-05 on 1252 degrees of freedom
## AIC: 12
##
## Number of Fisher Scoring iterations: 25
To make the model’s coefficients more interpretable, we convert them to odds ratios. An odds ratio greater than 1 indicates that an increase in that feature increases the odds of a song being a “Super-Hit.”
coef_df <- tidy(model, exponentiate = TRUE, conf.int = TRUE) %>%
filter(term != "(Intercept)")
ggplot(coef_df, aes(x = reorder(term, estimate), y = estimate)) +
geom_point(size = 3, color = "#fc4f30") +
geom_errorbar(aes(ymin = conf.low, ymax = conf.high), width = 0.2) +
geom_hline(yintercept = 1, linetype = "dashed") +
coord_flip() +
labs(
title = "Odds Ratios for Super-Hit Prediction",
subtitle = "Features with odds ratios > 1 increase the likelihood of being a Super-Hit",
y = "Odds Ratio", x = "Feature"
) +
theme_fivethirtyeight()
As expected danceability, energy, and valence are
strong positive predictors.
Finally, we evaluate the model’s performance on the test data using a confusion matrix and calculating its accuracy.
test_data <- test_data %>%
mutate(
pred_prob = predict(model, newdata = test_data, type = "response"),
pred_class = ifelse(pred_prob > 0.5, "Super-Hit", "Regular Hit")
)
# Confusion Matrix
cat("Confusion Matrix:\n")
## Confusion Matrix:
table(Predicted = test_data$pred_class, Actual = test_data$hit_class)
## Actual
## Predicted Regular Hit Super-Hit
## Regular Hit 339 3
## Super-Hit 0 198
# Accuracy
accuracy <- mean(test_data$pred_class == test_data$hit_class)
cat("\nModel Accuracy:", round(accuracy, 3), "\n")
##
## Model Accuracy: 0.994
The model performs well, correctly classifying a high percentage of songs on the unseen test data.
Have the characteristics of hit songs changed over the years? Let’s
plot the average danceability, energy and
valence for top songs by year.
songs %>%
group_by(year) %>%
summarise(
avg_dance = mean(danceability, na.rm = TRUE),
avg_energy = mean(energy, na.rm = TRUE),
avg_valence = mean(valence, na.rm = TRUE)
) %>%
ggplot(aes(x = year)) +
geom_line(aes(y = avg_dance, color = "Danceability")) +
geom_line(aes(y = avg_energy, color = "Energy")) +
geom_line(aes(y = avg_valence, color = "Valence")) +
labs(
title = "Trends in Hit Song Audio Features Over Time",
subtitle = "Modern pop music appears more danceable and energetic.",
y = "Average Feature Value (0-1)", x = "Year", color = "Feature"
) +
theme_fivethirtyeight()
The plot shows a clear upward trend in the average danceability and
energy of hit songs over the last several decades.
This analysis confirms that audio features provide significant insight into what makes a song a commercial success.
In conclusion, while musical taste is subjective, the data shows that a quantifiable formula for a hit song—one that is upbeat, positive, and easy to dance to—is a powerful force in the music industry.