Spotify Track Popularity – A Basic Analysis

Background & Problem

Project Goal: Identify which audio characteristics influence the popularity of tracks on Spotify.
Dataset: 114,000 songs from the Spotify Tracks dataset on Kaggle.
Approach: Conduct exploratory data analysis (EDA) and build a basic linear regression model using two features: danceability and energy.

2 Data Loading & Setup

library(readxl)
dataset <- read_excel("~/Downloads/dataset.xlsm")

## New names:
## • `` -> `...1`

suppressPackageStartupMessages({
  library(ggplot2)
  library(readxl)
  library(tidyverse)   # dplyr, ggplot2, etc.
  library(caret)       # machine-learning helpers
  library(GGally)
  library(conflicted)
})

conflict_prefer("filter", "dplyr")
conflict_prefer("lag",    "dplyr")

3 Data Cleaning & Wrangling

library(tidyverse)

df <- dataset %>% 
  select(-any_of(c("...1","track_id","artists",
                   "album_name","track_name","track_genre"))) %>% 
  mutate(across(where(is.logical), as.numeric)) %>% 
  drop_na()

glimpse(df)

## Rows: 114,000
## Columns: 15
## $ popularity       <dbl> 73, 55, 57, 71, 82, 58, 74, 80, 74, 56, 74, 69, 52, 6…
## $ duration_ms      <dbl> 230666, 149610, 210826, 201933, 198853, 214240, 22940…
## $ explicit         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ danceability     <dbl> 0.676, 0.420, 0.438, 0.266, 0.618, 0.688, 0.407, 0.70…
## $ energy           <dbl> 0.4610, 0.1660, 0.3590, 0.0596, 0.4430, 0.4810, 0.147…
## $ key              <dbl> 1, 1, 0, 0, 2, 6, 2, 11, 0, 1, 8, 4, 7, 3, 2, 4, 2, 1…
## $ loudness         <dbl> -6.746, -17.235, -9.734, -18.515, -9.681, -8.807, -8.…
## $ mode             <dbl> 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0,…
## $ speechiness      <dbl> 0.1430, 0.0763, 0.0557, 0.0363, 0.0526, 0.1050, 0.035…
## $ acousticness     <dbl> 0.0322, 0.9240, 0.2100, 0.9050, 0.4690, 0.2890, 0.857…
## $ instrumentalness <dbl> 1.01e-06, 5.56e-06, 0.00e+00, 7.07e-05, 0.00e+00, 0.0…
## $ liveness         <dbl> 0.3580, 0.1010, 0.1170, 0.1320, 0.0829, 0.1890, 0.091…
## $ valence          <dbl> 0.7150, 0.2670, 0.1200, 0.1430, 0.1670, 0.6660, 0.076…
## $ tempo            <dbl> 87.917, 77.489, 76.332, 181.740, 119.949, 98.017, 141…
## $ time_signature   <dbl> 4, 4, 4, 3, 4, 4, 3, 4, 4, 4, 4, 3, 4, 4, 4, 3, 4, 4,…

Track Popularity Distribution

ggplot(df, aes(popularity)) + geom_histogram(binwidth = 3, fill = "steelblue") + labs(title = "Distribution of Track Popularity", x = "Popularity (0 – 100)", y = "Count")

Popularity vs. Explicit Content

Energy Distribution

Data Splitting for Modeling

library(caret)
set.seed(123)
idx  <- createDataPartition(df$popularity, p = 0.7, list = FALSE)
train <- df[idx, ]
test  <- df[-idx, ]

Linear Model Output

Term	Estimate	SE	t-value	p
(Intercept)	30.501	0.319	95.716	0.000
danceability	4.900	0.460	10.643	0.000
energy	-0.055	0.317	-0.175	0.861

Model Performance Summary

R²	Adj R²	σ (resid SE)	F-test p
0.001	0.001	22.304	0

## 
## **Test-set RMSE:** 22.26

Final Evaluation Metrics

pred   <- predict(model, test)
rmse   <- sqrt(mean((pred - test$popularity)^2))
r2     <- cor(pred, test$popularity)^2

cat("RMSE:", round(rmse, 2), "\n")

## RMSE: 22.26

cat("R-squared:", round(r2, 2), "\n")

## R-squared: 0

Conclusion & Takeaways

Danceability contributes positively to track popularity.
Energy did not significantly predict popularity in our model.
Spotify popularity is likely influenced by external factors not captured here, like artist fame, marketing, and playlist placement.
This project helped apply real-world EDA, modeling, and communication skills using R.