Spotify Music Data Analysis

Exploring Popularity, Genre Trends & Audio Features

Author

Aditya Suresh

Published

April 17, 2026

Introduction

About the Dataset

The Spotify Ultimate Tracks Database is a comprehensive collection of music track metadata sourced from the Spotify API. It contains audio features and popularity metrics for thousands of tracks spanning multiple genres and artists.

Key variables include:

  • popularity – Track popularity score (0–100, Spotify-assigned)
  • genre – Musical genre category
  • artist_name – Name of the performing artist
  • track_name – Title of the track
  • energy – Perceptual measure of intensity (0.0–1.0)
  • danceability – How suitable a track is for dancing (0.0–1.0)
  • acousticness, valence, tempo, loudness, and more

Objectives

This report aims to:

  1. Understand the distribution of track popularity across the dataset
  2. Compare popularity across genres to identify top-performing categories
  3. Explore the relationship between energy and danceability
  4. Use faceted plots to reveal patterns across multiple genres

Setup & Library Loading

# Install required packages (run once if not already installed)
# install.packages(c("ggplot2", "dplyr", "tidyr", "readr", "scales", "ggthemes"))

library(ggplot2)
library(dplyr)
library(tidyr)
library(readr)
library(scales)
library(ggthemes)

Data Loading & Preprocessing

spotify <- read.csv(file.choose(), stringsAsFactors = FALSE)
library(dplyr)
colnames(spotify)
 [1] "genre"            "artist_name"      "track_name"       "track_id"        
 [5] "popularity"       "acousticness"     "danceability"     "duration_ms"     
 [9] "energy"           "instrumentalness" "key"              "liveness"        
[13] "loudness"         "mode"             "speechiness"      "tempo"           
[17] "time_signature"   "valence"         
# Fix column names
if ("track_popularity" %in% colnames(spotify)) {
  spotify <- spotify %>% rename(popularity = track_popularity)
}

if ("playlist_genre" %in% colnames(spotify)) {
  spotify <- spotify %>% rename(genre = playlist_genre)
}

# Fix column names if needed
if ("track_popularity" %in% colnames(spotify)) {
  spotify <- spotify %>% rename(popularity = track_popularity)
}
if ("playlist_genre" %in% colnames(spotify)) {
  spotify <- spotify %>% rename(genre = playlist_genre)
}

# Inspect
colnames(spotify)
 [1] "genre"            "artist_name"      "track_name"       "track_id"        
 [5] "popularity"       "acousticness"     "danceability"     "duration_ms"     
 [9] "energy"           "instrumentalness" "key"              "liveness"        
[13] "loudness"         "mode"             "speechiness"      "tempo"           
[17] "time_signature"   "valence"         
str(spotify)
'data.frame':   232725 obs. of  18 variables:
 $ genre           : chr  "Movie" "Movie" "Movie" "Movie" ...
 $ artist_name     : chr  "Henri Salvador" "Martin & les fées" "Joseph Williams" "Henri Salvador" ...
 $ track_name      : chr  "C'est beau de faire un Show" "Perdu d'avance (par Gad Elmaleh)" "Don't Let Me Be Lonely Tonight" "Dis-moi Monsieur Gordon Cooper" ...
 $ track_id        : chr  "0BRjO6ga9RKCKjfDqeFgWV" "0BjC1NfoEOOusryehmNudP" "0CoSDzoNIKCRs124s9uTVy" "0Gc6TVm52BwZD07Ki6tIvf" ...
 $ popularity      : int  0 1 3 0 4 0 2 15 0 10 ...
 $ acousticness    : num  0.611 0.246 0.952 0.703 0.95 0.749 0.344 0.939 0.00104 0.319 ...
 $ danceability    : num  0.389 0.59 0.663 0.24 0.331 0.578 0.703 0.416 0.734 0.598 ...
 $ duration_ms     : int  99373 137373 170267 152427 82625 160627 212293 240067 226200 152694 ...
 $ energy          : num  0.91 0.737 0.131 0.326 0.225 0.0948 0.27 0.269 0.481 0.705 ...
 $ instrumentalness: num  0 0 0 0 0.123 0 0 0 0.00086 0.00125 ...
 $ key             : chr  "C#" "F#" "C" "C#" ...
 $ liveness        : num  0.346 0.151 0.103 0.0985 0.202 0.107 0.105 0.113 0.0765 0.349 ...
 $ loudness        : num  -1.83 -5.56 -13.88 -12.18 -21.15 ...
 $ mode            : chr  "Major" "Minor" "Minor" "Major" ...
 $ speechiness     : num  0.0525 0.0868 0.0362 0.0395 0.0456 0.143 0.953 0.0286 0.046 0.0281 ...
 $ tempo           : num  167 174 99.5 171.8 140.6 ...
 $ time_signature  : chr  "4/4" "4/4" "5/4" "4/4" ...
 $ valence         : num  0.814 0.816 0.368 0.227 0.39 0.358 0.533 0.274 0.765 0.718 ...

── Data Cleaning ──────────────────────────────────────────────────

1. Remove duplicate rows

spotify <- spotify %>% distinct()

2. Drop rows with missing values in key columns

spotify <- spotify %>% filter( !is.na(popularity), !is.na(energy), !is.na(danceability) )

3. Rename columns to standardised names (adapt if yours differ)

Common alternatives: genre_name → genre, artists → artist_name

if (“genre_name” %in% colnames(spotify)) { spotify <- spotify %>% rename(genre = genre_name) } if (“artists” %in% colnames(spotify)) { spotify <- spotify %>% rename(artist_name = artists) }

4. Convert popularity to numeric (safety)

spotify\(popularity <- as.numeric(spotify\)popularity)

5. Add a popularity category for grouping

spotify <- spotify %>% mutate(popularity_group = case_when( popularity >= 70 ~ “High (70–100)”, popularity >= 40 ~ “Medium (40–69)”, TRUE ~ “Low (0–39)” ))

6. Identify the top 10 genres by track count

top10_genres <- spotify %>% count(genre, sort = TRUE) %>% slice_head(n = 10) %>% pull(genre)

7. Filter dataset to top 10 genres

spotify_top <- spotify %>% filter(genre %in% top10_genres) %>% mutate(genre = factor(genre, levels = top10_genres))

cat(“dataset:”, nrow(spotify_top), “rows across top 10 genres”) cat(“Genres included:”, paste(top10_genres, collapse = “,”), “”)


---

# Visualisation 1: Histogram of Track Popularity


::: {.cell}

```{.r .cell-code}
ggplot(spotify, aes(x = popularity)) +
  geom_histogram(
    binwidth  = 5,
    fill      = "#1DB954",   # Spotify green
    color     = "white",
    alpha     = 0.85
  ) +
  geom_vline(
    aes(xintercept = mean(popularity, na.rm = TRUE)),
    colour    = "#E94560",
    linetype  = "dashed",
    linewidth = 0.8
  ) +
  annotate(
    "text",
    x     = mean(spotify$popularity, na.rm = TRUE) + 3,
    y     = Inf,
    label = paste0("Mean = ", round(mean(spotify$popularity, na.rm = TRUE), 1)),
    vjust = 1.5,
    hjust = 0,
    color = "#E94560",
    size  = 3.5
  ) +
  scale_x_continuous(breaks = seq(0, 100, 10)) +
  scale_y_continuous(labels = comma) +
  labs(
    title    = "Distribution of Spotify Track Popularity",
    subtitle = "Each bar represents a 5-point popularity band (0–100 scale)",
    x        = "Popularity Score",
    y        = "Number of Tracks",
    caption  = "Source: Kaggle – Ultimate Spotify Tracks DB"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title    = element_text(face = "bold", size = 14),
    plot.subtitle = element_text(color = "grey50"),
    panel.grid.minor = element_blank()
  )

Distribution of track popularity scores across the Spotify dataset. The histogram reveals whether most tracks are obscure, moderately popular, or viral hits.

:::

Interpretation

The histogram reveals that track popularity follows a right-skewed distribution. A large proportion of tracks cluster near zero, indicating that most songs on Spotify receive very little listener engagement. The long tail toward higher scores (70–100) represents genuinely popular or viral tracks — a much rarer occurrence. The dashed red line marks the mean popularity, which lies in the low-to-medium range, confirming that the average track is relatively obscure. This is consistent with the “long tail” economics of music streaming, where a small number of tracks dominate listener attention.


Visualisation 2: Box Plot — Popularity by Genre

# Create top 10 genres dataset
spotify_top <- spotify %>%
  group_by(genre) %>%
  summarise(count = n()) %>%
  arrange(desc(count)) %>%
  slice_head(n = 10) %>%
  pull(genre)

spotify_top <- spotify %>%
  filter(genre %in% spotify_top)
# Compute median popularity per genre for ordering
genre_medians <- spotify_top %>%
  group_by(genre) %>%
  summarise(med = median(popularity, na.rm = TRUE)) %>%
  arrange(desc(med))

spotify_top <- spotify_top %>%
  mutate(genre = factor(genre, levels = genre_medians$genre))

ggplot(spotify_top, aes(x = genre, y = popularity, fill = genre)) +
  geom_boxplot(
    outlier.shape  = 21,
    outlier.size   = 1.2,
    outlier.alpha  = 0.4,
    width          = 0.6,
    color          = "grey30"
  ) +
  scale_fill_brewer(palette = "Set3") +
  scale_y_continuous(breaks = seq(0, 100, 10)) +
  labs(
    title    = "Track Popularity by Genre (Top 10 Genres)",
    subtitle = "Genres ordered by median popularity — higher medians indicate stronger overall performance",
    x        = "Genre",
    y        = "Popularity Score (0–100)",
    caption  = "Source: Kaggle – Ultimate Spotify Tracks DB"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title    = element_text(face = "bold", size = 14),
    plot.subtitle = element_text(color = "grey50"),
    legend.position = "none",
    axis.text.x  = element_text(angle = 35, hjust = 1, size = 10),
    panel.grid.minor = element_blank()
  )

Box plots comparing popularity score distributions across the top 10 genres. The box shows the interquartile range; the line inside is the median; dots are outliers.

Interpretation

The box plots show considerable variation in popularity across genres. Genres ranked higher on the x-axis (ordered by median) tend to have both higher median scores and more compact distributions, suggesting more consistently well-received tracks. Genres with wide boxes indicate high variability — some tracks are hits, others are not. Genres with many outlier dots above the upper whisker contain a handful of breakout viral tracks that skew above the typical range. This chart is useful for understanding which genres are commercially dominant on Spotify versus which are niche communities.


Visualisation 3: Scatter Plot — Energy vs Danceability

Create popularity groups

spotify <- spotify %>% mutate( popularity_group = case_when( popularity >= 80 ~ “High”, popularity >= 50 ~ “Medium”, TRUE ~ “Low” ) )

Sample data for better visualization

set.seed(42) spotify_sample <- spotify %>% slice_sample(n = 600)

Scatter plot

ggplot(spotify_sample, aes(x = energy, y = danceability, color = popularity_group)) + geom_point(alpha = 0.6) + labs( title = “Energy vs Danceability”, x = “Energy”, y = “Danceability”, color = “Popularity Group” ) + theme_minimal()

Interpretation

The scatter plot reveals a moderate positive relationship between energy and danceability — tracks with higher energy tend to be more danceable, though there is substantial scatter. High-popularity tracks (green dots) appear more concentrated in the mid-to-high range of both axes, suggesting that popular tracks are neither too calm nor too intense. Low-popularity tracks (red) are spread more uniformly. The smooth trend line confirms the positive but non-linear association: danceability rises with energy up to a point, then levels off at extreme intensities. This is consistent with the idea that very aggressive or very mellow tracks are less universally appealing.


Visualisation 4: Faceted Plots by Genre

# Faceted plot by genre (clean version)

# Use only top genres to avoid overload
top_genres <- spotify %>%
  count(genre, sort = TRUE) %>%
  slice_head(n = 6) %>%
  pull(genre)

spotify_facet <- spotify %>%
  filter(genre %in% top_genres)

# Plot
ggplot(spotify_facet, aes(x = energy, y = danceability)) +
  geom_point(alpha = 0.5, color = "steelblue") +
  facet_wrap(~ genre) +
  labs(
    title = "Energy vs Danceability Across Top Genres",
    x = "Energy",
    y = "Danceability"
  ) +
  theme_minimal()

Interpretation

The faceted view exposes genre-specific differences in the energy–danceability relationship. Some genres show a steep positive slope (high energy → high danceability), which is characteristic of electronic or hip-hop music. Other genres, such as classical or folk, show flatter or even negative trends, reflecting that in those styles, energy intensity does not translate to rhythmic danceability. The concentration of green (high-popularity) points also varies per genre panel, confirming that different audiences reward different audio qualities. Faceting is essential here because pooling all genres would mask these distinct patterns.


Summary Statistics

# Summary table per genre
summary_tbl <- spotify_top %>%
  group_by(genre) %>%
  summarise(
    Tracks        = n(),
    Avg_Popularity = round(mean(popularity, na.rm = TRUE), 1),
    Median_Pop    = round(median(popularity, na.rm = TRUE), 1),
    Avg_Energy    = round(mean(energy, na.rm = TRUE), 2),
    Avg_Dance     = round(mean(danceability, na.rm = TRUE), 2)
  ) %>%
  arrange(desc(Avg_Popularity))

knitr::kable(
  summary_tbl,
  col.names = c("Genre", "Tracks", "Avg Popularity", "Median Pop",
                "Avg Energy", "Avg Danceability"),
  caption   = "Summary statistics for the top 10 genres"
)
Summary statistics for the top 10 genres
Genre Tracks Avg Popularity Median Pop Avg Energy Avg Danceability
Pop 9386 66.6 66 0.64 0.64
Rock 9272 59.6 59 0.68 0.54
Hip-Hop 9295 58.4 57 0.64 0.72
Children’s Music 9353 54.7 54 0.71 0.54
Indie 9543 54.7 54 0.58 0.57
Folk 9299 49.9 49 0.49 0.53
Jazz 9441 40.8 40 0.47 0.59
Electronic 9377 38.1 37 0.74 0.62
Soundtrack 9646 34.0 33 0.22 0.27
Comedy 9681 21.3 20 0.68 0.56

Final Insights & Conclusion

Key Findings

  1. Popularity is rare. The majority of Spotify tracks score below 40 on the popularity scale. Only a small elite of tracks achieve scores above 70, reflecting the highly competitive and winner-takes-all nature of music streaming.

  2. Genre matters. Popularity varies substantially by genre. Commercially mainstream genres (pop, hip-hop, latin) consistently outperform niche genres (classical, folk, jazz) in median popularity. However, niche genres occasionally produce standout hits visible as high outliers.

  3. Energy and danceability are positively correlated. Tracks that feel more intense also tend to feel more danceable — but this relationship is not universal. The correlation is strongest in dance-oriented genres and weakest in acoustic or classical styles.

  4. Genre shapes audio feature relationships. Faceted analysis reveals that the energy–danceability link operates very differently across genres. A one-size-fits-all model would be misleading.

Practical Implications

  • Artists & producers can use these insights to target specific popularity outcomes by aligning their genre and audio feature mix with patterns seen in high-scoring tracks.
  • Playlist curators can leverage genre popularity baselines to balance mainstream appeal with discovery content.
  • Streaming platforms can use similar models for recommendation systems and content acquisition strategies.

Report compiled using R (ggplot2, dplyr, tidyr) | Data: Ultimate Spotify Tracks DB — Kaggle