For this project, I’m exploring a dataset of songs from Spotify to see how different musical features relate to a song’s popularity. The dataset includes both numbers (like tempo and popularity score) and categories (like genre and song name), making it great for visual analysis.
I’ll focus on two main questions: Do more danceable or faster songs tend to be more popular? And do certain genres consistently perform better than others? To answer this, I’ll look at relationships between danceability, tempo, popularity, and genre.
Source: Spotify.com
Load the Dataset
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.4
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Rows: 2000 Columns: 18
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): artist, song, genre
dbl (14): duration_ms, year, popularity, danceability, energy, key, loudness...
lgl (1): explicit
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 59 × 5
genre avg_popularity avg_danceability avg_tempo count
<chr> <dbl> <dbl> <dbl> <int>
1 pop, rock, Folk/Acoustic 79 0.509 108. 2
2 Folk/Acoustic, pop 78 0.563 112. 2
3 rock, pop, metal, Dance/Elec… 76 0.507 105. 1
4 hip hop, rock, pop 75 0.526 90.1 2
5 easy listening 72 0.417 158. 1
6 hip hop, latin, Dance/Electr… 72 0.767 172. 1
7 metal 72 0.473 106. 9
8 rock, Folk/Acoustic, pop 71 0.556 80.5 1
9 rock, metal 70.9 0.537 126. 38
10 World/Traditional, Folk/Acou… 69 0.418 82.8 1
# ℹ 49 more rows
Data Visualisation
How danceability and tempo relate to popularity across genres.
library(ggplot2)ggplot(songs_summary, aes(x = avg_danceability, y = avg_tempo, color = genre, size = avg_popularity)) +geom_point(alpha =0.8) +labs(title ="Average Danceability vs. Tempo by Genre",subtitle ="Point size shows average popularity; color shows genre",x ="Average Danceability",y ="Average Tempo (BPM)",color ="Genre",size ="Avg Popularity",caption ="Source: Spotify dataset via Spotify Web API" ) +scale_color_viridis_d(option ="plasma") +theme_minimal()
##Average Popularity by Genre
library(ggplot2)genre_popularity <- songs |>filter(!is.na(genre),!is.na(popularity), popularity <=100 ) |>group_by(genre) |>summarize(avg_popularity =mean(popularity, na.rm =TRUE),count =n() ) |>filter(count >=10) %>%# Keep only genres with enough dataarrange(desc(avg_popularity))ggplot(genre_popularity, aes(x =reorder(genre, avg_popularity), y = avg_popularity, fill = genre)) +geom_col(show.legend =FALSE) +coord_flip() +# Flip for better readabilitylabs(title ="Average Popularity by Genre",x ="Genre",y ="Average Popularity",caption ="Source: Spotify dataset via Spotify Web API" ) +scale_fill_viridis_d(option ="magma") +theme_minimal()
Conclusion
Data Cleaning To prepare the dataset, I removed rows that had missing values in genre or popularity, and I also filtered out songs with a popularity score above 100. Then I grouped the data by genre and used summarize() to calculate the average popularity for each one. I kept only genres with at least 10 songs so the results would be more reliable.
What the Visualization Shows The bar plot shows the average popularity of songs across different genres. It makes it easy to compare which genres tend to perform better on Spotify. For example, I noticed that genres like Dance and Pop Rap had some of the highest average popularity scores, while genres like Classical and Folk were lower. This suggests that upbeat or modern genres tend to attract more streams and listeners on the platform.