Background & Problem

  • Project Goal: Identify which audio characteristics influence the popularity of tracks on Spotify.
  • Dataset: 114,000 songs from the Spotify Tracks dataset on Kaggle.
  • Approach: Conduct exploratory data analysis (EDA) and build a basic linear regression model using two features: danceability and energy.

2 Data Loading & Setup

library(readxl)
dataset <- read_excel("~/Downloads/dataset.xlsm")
## New names:
## • `` -> `...1`
suppressPackageStartupMessages({
  library(ggplot2)
  library(readxl)
  library(tidyverse)   # dplyr, ggplot2, etc.
  library(caret)       # machine-learning helpers
  library(GGally)
  library(conflicted)
})

conflict_prefer("filter", "dplyr")
conflict_prefer("lag",    "dplyr")

3 Data Cleaning & Wrangling

library(tidyverse)

df <- dataset %>% 
  select(-any_of(c("...1","track_id","artists",
                   "album_name","track_name","track_genre"))) %>% 
  mutate(across(where(is.logical), as.numeric)) %>% 
  drop_na()

glimpse(df)
## Rows: 114,000
## Columns: 15
## $ popularity       <dbl> 73, 55, 57, 71, 82, 58, 74, 80, 74, 56, 74, 69, 52, 6…
## $ duration_ms      <dbl> 230666, 149610, 210826, 201933, 198853, 214240, 22940…
## $ explicit         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ danceability     <dbl> 0.676, 0.420, 0.438, 0.266, 0.618, 0.688, 0.407, 0.70…
## $ energy           <dbl> 0.4610, 0.1660, 0.3590, 0.0596, 0.4430, 0.4810, 0.147…
## $ key              <dbl> 1, 1, 0, 0, 2, 6, 2, 11, 0, 1, 8, 4, 7, 3, 2, 4, 2, 1…
## $ loudness         <dbl> -6.746, -17.235, -9.734, -18.515, -9.681, -8.807, -8.…
## $ mode             <dbl> 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0,…
## $ speechiness      <dbl> 0.1430, 0.0763, 0.0557, 0.0363, 0.0526, 0.1050, 0.035…
## $ acousticness     <dbl> 0.0322, 0.9240, 0.2100, 0.9050, 0.4690, 0.2890, 0.857…
## $ instrumentalness <dbl> 1.01e-06, 5.56e-06, 0.00e+00, 7.07e-05, 0.00e+00, 0.0…
## $ liveness         <dbl> 0.3580, 0.1010, 0.1170, 0.1320, 0.0829, 0.1890, 0.091…
## $ valence          <dbl> 0.7150, 0.2670, 0.1200, 0.1430, 0.1670, 0.6660, 0.076…
## $ tempo            <dbl> 87.917, 77.489, 76.332, 181.740, 119.949, 98.017, 141…
## $ time_signature   <dbl> 4, 4, 4, 3, 4, 4, 3, 4, 4, 4, 4, 3, 4, 4, 4, 3, 4, 4,…

Track Popularity Distribution

ggplot(df, aes(popularity)) + geom_histogram(binwidth = 3, fill = "steelblue") + labs(title = "Distribution of Track Popularity", x = "Popularity (0 – 100)", y = "Count")

Popularity vs. Explicit Content

Energy Distribution

Data Splitting for Modeling

library(caret)
set.seed(123)
idx  <- createDataPartition(df$popularity, p = 0.7, list = FALSE)
train <- df[idx, ]
test  <- df[-idx, ]

Linear Model Output

Term Estimate SE t-value p
(Intercept) 30.501 0.319 95.716 0.000
danceability 4.900 0.460 10.643 0.000
energy -0.055 0.317 -0.175 0.861

Model Performance Summary

Adj R² σ (resid SE) F-test p
0.001 0.001 22.304 0
## 
## **Test-set RMSE:** 22.26

Final Evaluation Metrics

pred   <- predict(model, test)
rmse   <- sqrt(mean((pred - test$popularity)^2))
r2     <- cor(pred, test$popularity)^2

cat("RMSE:", round(rmse, 2), "\n")
## RMSE: 22.26
cat("R-squared:", round(r2, 2), "\n")
## R-squared: 0

Conclusion & Takeaways

  • Danceability contributes positively to track popularity.
  • Energy did not significantly predict popularity in our model.
  • Spotify popularity is likely influenced by external factors not captured here, like artist fame, marketing, and playlist placement.
  • This project helped apply real-world EDA, modeling, and communication skills using R.