Along with the development of social media, people tend to experience FOMO (69% of U.S people have experienced FOMO), which leads them to listen to music based on its popularity. Although the creative process is inherently subjective, is there a “formula” for a hit song? In this project, we aim to provide a comprehensive analytical report that not only helps to understand “market trends” but also highlights “specific opportunities” within the existing data set. This is where creators can align their sound with market trends while maintaining their unique voice, allowing them to make more informed, data-driven decisions on what to promote.
While our data set spans from 1960 to 2020, the rise of social media, particularly since 2010, has significantly reshaped listening behavior through phenomena such as FOMO. By comparing song trends before and after the social media boom, we aim to examine how popularity-driven consumption has influenced musical characteristics, offering insights into how creators can adapt in a socially amplified market.
We have the data set for Spotify from GITHUB. We are going to identify the variables that will correlate with our problem statement. Construct a visualization(ggplot, histogram, plot graphs) with these figures & process manipulation of data in such way to analyze the information were seeking. We will visualize the data in the form of graphs as well as what the audience is looking for in music: danceability, liveness, and energy as our dimensions.
Regression analysis allows us to explore the relationship between each variable and a song’s popularity, helping identify which features have the strongest impact.
Artists and producers make strategic choices to increase the reach of their music, while talent scouts identify artists with high commercial potential. Even music lovers discover hidden gems that have long been overlooked in their playlists.
library(tidyverse)
library(dplyr)
library(tidyr)
library(ggplot2)
library(lubridate)
library(knitr)
library(kableExtra)
library(hexbin)
library(corrplot)
library(purrr)
library(broom)
library(gridExtra)
library(grid)
More “packages” can be added in the future:
Import the data set into Rstudio:
spotify <- read.csv("C:/Users/samc8/OneDrive - Xavier University/Data Wrangling/Week 4/spotify_songs (2).csv")
View structure of the data set:
str(spotify)
## 'data.frame': 32833 obs. of 23 variables:
## $ track_id : chr "6f807x0ima9a1j3VPbc7VN" "0r7CVbZTWZgbTCYdfa2P31" "1z1Hg7Vb0AhHDiEmnDE79l" "75FpbthrwQmzHlBJLuGdC7" ...
## $ track_name : chr "I Don't Care (with Justin Bieber) - Loud Luxury Remix" "Memories - Dillon Francis Remix" "All the Time - Don Diablo Remix" "Call You Mine - Keanu Silva Remix" ...
## $ track_artist : chr "Ed Sheeran" "Maroon 5" "Zara Larsson" "The Chainsmokers" ...
## $ track_popularity : int 66 67 70 60 69 67 62 69 68 67 ...
## $ track_album_id : chr "2oCs0DGTsRO98Gh5ZSl2Cx" "63rPSO264uRjW1X5E6cWv6" "1HoSmj2eLcsrR0vE9gThr4" "1nqYsOef1yKKuGOVchbsk6" ...
## $ track_album_name : chr "I Don't Care (with Justin Bieber) [Loud Luxury Remix]" "Memories (Dillon Francis Remix)" "All the Time (Don Diablo Remix)" "Call You Mine - The Remixes" ...
## $ track_album_release_date: chr "2019-06-14" "2019-12-13" "2019-07-05" "2019-07-19" ...
## $ playlist_name : chr "Pop Remix" "Pop Remix" "Pop Remix" "Pop Remix" ...
## $ playlist_id : chr "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" ...
## $ playlist_genre : chr "pop" "pop" "pop" "pop" ...
## $ playlist_subgenre : chr "dance pop" "dance pop" "dance pop" "dance pop" ...
## $ danceability : num 0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
## $ energy : num 0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
## $ key : int 6 11 1 7 1 8 5 4 8 2 ...
## $ loudness : num -2.63 -4.97 -3.43 -3.78 -4.67 ...
## $ mode : int 1 1 0 1 1 1 0 0 1 1 ...
## $ speechiness : num 0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
## $ acousticness : num 0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
## $ instrumentalness : num 0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
## $ liveness : num 0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
## $ valence : num 0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
## $ tempo : num 122 100 124 122 124 ...
## $ duration_ms : int 194754 162600 176616 169093 189052 163049 187675 207619 193187 253040 ...
View summary statistics of the data set:
summary(spotify)
## track_id track_name track_artist track_popularity
## Length:32833 Length:32833 Length:32833 Min. : 0.00
## Class :character Class :character Class :character 1st Qu.: 24.00
## Mode :character Mode :character Mode :character Median : 45.00
## Mean : 42.48
## 3rd Qu.: 62.00
## Max. :100.00
## track_album_id track_album_name track_album_release_date
## Length:32833 Length:32833 Length:32833
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## playlist_name playlist_id playlist_genre playlist_subgenre
## Length:32833 Length:32833 Length:32833 Length:32833
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## danceability energy key loudness
## Min. :0.0000 Min. :0.000175 Min. : 0.000 Min. :-46.448
## 1st Qu.:0.5630 1st Qu.:0.581000 1st Qu.: 2.000 1st Qu.: -8.171
## Median :0.6720 Median :0.721000 Median : 6.000 Median : -6.166
## Mean :0.6548 Mean :0.698619 Mean : 5.374 Mean : -6.719
## 3rd Qu.:0.7610 3rd Qu.:0.840000 3rd Qu.: 9.000 3rd Qu.: -4.645
## Max. :0.9830 Max. :1.000000 Max. :11.000 Max. : 1.275
## mode speechiness acousticness instrumentalness
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000000
## 1st Qu.:0.0000 1st Qu.:0.0410 1st Qu.:0.0151 1st Qu.:0.0000000
## Median :1.0000 Median :0.0625 Median :0.0804 Median :0.0000161
## Mean :0.5657 Mean :0.1071 Mean :0.1753 Mean :0.0847472
## 3rd Qu.:1.0000 3rd Qu.:0.1320 3rd Qu.:0.2550 3rd Qu.:0.0048300
## Max. :1.0000 Max. :0.9180 Max. :0.9940 Max. :0.9940000
## liveness valence tempo duration_ms
## Min. :0.0000 Min. :0.0000 Min. : 0.00 Min. : 4000
## 1st Qu.:0.0927 1st Qu.:0.3310 1st Qu.: 99.96 1st Qu.:187819
## Median :0.1270 Median :0.5120 Median :121.98 Median :216000
## Mean :0.1902 Mean :0.5106 Mean :120.88 Mean :225800
## 3rd Qu.:0.2480 3rd Qu.:0.6930 3rd Qu.:133.92 3rd Qu.:253585
## Max. :0.9960 Max. :0.9910 Max. :239.44 Max. :517810
Removed to streamline the data set.
spotify$playlist_id <- NULL
spotify$track_album_id <- NULL
spotify$track_id <- NULL
spotify$playlist_subgenre <- NULL
colnames(spotify) <- c("track_name", "track_artist", "track_popularity", "track_album_name",
"track_album_release_date", "playlist_name", "playlist_genre", "danceability",
"energy", "key", "loudness", "mode", "speech_ratio",
"acousticness", "instrumentalness", "liveness", "positivity",
"tempo", "duration_ms")
colSums(is.na(spotify))
## track_name track_artist track_popularity
## 5 5 0
## track_album_name track_album_release_date playlist_name
## 5 0 0
## playlist_genre danceability energy
## 0 0 0
## key loudness mode
## 0 0 0
## speech_ratio acousticness instrumentalness
## 0 0 0
## liveness positivity tempo
## 0 0 0
## duration_ms
## 0
spotify_clean <- na.omit(spotify)
We wanted to convert the release date of the track album into a proper date format.
spotify_clean <- spotify_clean %>%
mutate(
track_album_release_date = as.character(track_album_release_date),
track_album_release_date = case_when(
grepl("^\\d{4}$", track_album_release_date) ~ paste0(track_album_release_date, "-01-01"),
grepl("^\\d{4}-\\d{2}$", track_album_release_date) ~ paste0(track_album_release_date, "-01"),
TRUE ~ track_album_release_date
),
track_album_release_date = case_when(
grepl("^\\d{4}-\\d{2}-\\d{2}$", track_album_release_date) ~ as.Date(track_album_release_date, format = "%Y-%m-%d"),
grepl("^\\d{1,2}/\\d{1,2}/\\d{4}$", track_album_release_date) ~ as.Date(track_album_release_date, format = "%m/%d/%Y"),
TRUE ~ NA_Date_
)
)
spotify_clean <- spotify_clean %>%
mutate(
release_year = lubridate::year(track_album_release_date),
duration_min = duration_ms / 60000, )
boxplot(spotify_clean$duration_min,
main = "Boxplot of Song Duration (min)",
ylab = "Duration (minutes)")
To avoid excluding valid songs with unusually long or short durations, we apply an asymmetric threshold: 4 × IQR above the third quartile and 2 × IQR below the first quartile. This approach broadens the acceptable range while still filtering extreme values, helping preserve meaningful variation in the data set without misclassifying legitimate entries as outliers.
Q1 <- quantile(spotify_clean$duration_min, 0.25, na.rm = TRUE)
Q3 <- quantile(spotify_clean$duration_min, 0.75, na.rm = TRUE)
IQR <- Q3 - Q1
upper_bound <- Q3 + 4 * IQR
lower_bound <- Q1 - 2 * IQR
spotify_clean_2 <- spotify_clean[
spotify_clean$duration_min >= lower_bound & spotify_clean$duration_min <= upper_bound, ]
boxplot(spotify_clean_2$duration_min,
main = "Boxplot of Song Duration (min, no outliers)",
ylab = "Duration (minutes)")
length(spotify_clean_2$duration_min)
## [1] 32801
After data cleaning there were 32 songs that were defined as outliers and removed from the data set.
spotify_clean_2 <- spotify_clean_2 %>%
mutate(release_period = case_when(
is.na(track_album_release_date) ~ "NA",
year(track_album_release_date) < 2010 ~ "Before 2010",
TRUE ~ "After 2010"
))
kableExtra::scroll_box(
kableExtra::kable_paper(
kableExtra::kbl(head(spotify_clean_2, 10))
),
width = "700px",
height = "300px"
)
| track_name | track_artist | track_popularity | track_album_name | track_album_release_date | playlist_name | playlist_genre | danceability | energy | key | loudness | mode | speech_ratio | acousticness | instrumentalness | liveness | positivity | tempo | duration_ms | release_year | duration_min | release_period |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| I Don’t Care (with Justin Bieber) - Loud Luxury Remix | Ed Sheeran | 66 | I Don’t Care (with Justin Bieber) [Loud Luxury Remix] | 2019-06-14 | Pop Remix | pop | 0.748 | 0.916 | 6 | -2.634 | 1 | 0.0583 | 0.1020 | 0.00e+00 | 0.0653 | 0.518 | 122.036 | 194754 | 2019 | 3.245900 | After 2010 |
| Memories - Dillon Francis Remix | Maroon 5 | 67 | Memories (Dillon Francis Remix) | 2019-12-13 | Pop Remix | pop | 0.726 | 0.815 | 11 | -4.969 | 1 | 0.0373 | 0.0724 | 4.21e-03 | 0.3570 | 0.693 | 99.972 | 162600 | 2019 | 2.710000 | After 2010 |
| All the Time - Don Diablo Remix | Zara Larsson | 70 | All the Time (Don Diablo Remix) | 2019-07-05 | Pop Remix | pop | 0.675 | 0.931 | 1 | -3.432 | 0 | 0.0742 | 0.0794 | 2.33e-05 | 0.1100 | 0.613 | 124.008 | 176616 | 2019 | 2.943600 | After 2010 |
| Call You Mine - Keanu Silva Remix | The Chainsmokers | 60 | Call You Mine - The Remixes | 2019-07-19 | Pop Remix | pop | 0.718 | 0.930 | 7 | -3.778 | 1 | 0.1020 | 0.0287 | 9.40e-06 | 0.2040 | 0.277 | 121.956 | 169093 | 2019 | 2.818217 | After 2010 |
| Someone You Loved - Future Humans Remix | Lewis Capaldi | 69 | Someone You Loved (Future Humans Remix) | 2019-03-05 | Pop Remix | pop | 0.650 | 0.833 | 1 | -4.672 | 1 | 0.0359 | 0.0803 | 0.00e+00 | 0.0833 | 0.725 | 123.976 | 189052 | 2019 | 3.150867 | After 2010 |
| Beautiful People (feat. Khalid) - Jack Wins Remix | Ed Sheeran | 67 | Beautiful People (feat. Khalid) [Jack Wins Remix] | 2019-07-11 | Pop Remix | pop | 0.675 | 0.919 | 8 | -5.385 | 1 | 0.1270 | 0.0799 | 0.00e+00 | 0.1430 | 0.585 | 124.982 | 163049 | 2019 | 2.717483 | After 2010 |
| Never Really Over - R3HAB Remix | Katy Perry | 62 | Never Really Over (R3HAB Remix) | 2019-07-26 | Pop Remix | pop | 0.449 | 0.856 | 5 | -4.788 | 0 | 0.0623 | 0.1870 | 0.00e+00 | 0.1760 | 0.152 | 112.648 | 187675 | 2019 | 3.127917 | After 2010 |
| Post Malone (feat. RANI) - GATTÜSO Remix | Sam Feldt | 69 | Post Malone (feat. RANI) [GATTÜSO Remix] | 2019-08-29 | Pop Remix | pop | 0.542 | 0.903 | 4 | -2.419 | 0 | 0.0434 | 0.0335 | 4.80e-06 | 0.1110 | 0.367 | 127.936 | 207619 | 2019 | 3.460317 | After 2010 |
| Tough Love - Tiësto Remix / Radio Edit | Avicii | 68 | Tough Love (Tiësto Remix) | 2019-06-14 | Pop Remix | pop | 0.594 | 0.935 | 8 | -3.562 | 1 | 0.0565 | 0.0249 | 4.00e-06 | 0.6370 | 0.366 | 127.015 | 193187 | 2019 | 3.219783 | After 2010 |
| If I Can’t Have You - Gryffin Remix | Shawn Mendes | 67 | If I Can’t Have You (Gryffin Remix) | 2019-06-20 | Pop Remix | pop | 0.642 | 0.818 | 2 | -4.552 | 1 | 0.0320 | 0.0567 | 0.00e+00 | 0.0919 | 0.590 | 124.957 | 253040 | 2019 | 4.217333 | After 2010 |
summary(spotify_clean_2[,10:19])
## key loudness mode speech_ratio
## Min. : 0.000 Min. :-46.448 Min. :0.0000 Min. :0.0224
## 1st Qu.: 2.000 1st Qu.: -8.167 1st Qu.:0.0000 1st Qu.:0.0410
## Median : 6.000 Median : -6.164 Median :1.0000 Median :0.0625
## Mean : 5.375 Mean : -6.715 Mean :0.5656 Mean :0.1070
## 3rd Qu.: 9.000 3rd Qu.: -4.644 3rd Qu.:1.0000 3rd Qu.:0.1320
## Max. :11.000 Max. : 1.275 Max. :1.0000 Max. :0.9180
## acousticness instrumentalness liveness positivity
## Min. :0.0000014 Min. :0.000000 Min. :0.00936 Min. :0.00001
## 1st Qu.:0.0151000 1st Qu.:0.000000 1st Qu.:0.09270 1st Qu.:0.33100
## Median :0.0803000 Median :0.000016 Median :0.12700 Median :0.51200
## Mean :0.1751027 Mean :0.084489 Mean :0.19015 Mean :0.51058
## 3rd Qu.:0.2540000 3rd Qu.:0.004810 3rd Qu.:0.24800 3rd Qu.:0.69300
## Max. :0.9920000 Max. :0.994000 Max. :0.99600 Max. :0.99100
## tempo duration_ms
## Min. : 35.48 Min. : 57373
## 1st Qu.: 99.96 1st Qu.:187867
## Median :121.98 Median :216033
## Mean :120.89 Mean :225877
## 3rd Qu.:133.92 3rd Qu.:253585
## Max. :239.44 Max. :515960
table(spotify_clean_2$release_year)
##
## 1957 1958 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973
## 2 1 4 1 2 5 9 12 19 41 23 56 82 70 74 104
## 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989
## 80 106 133 100 130 84 97 87 94 118 140 144 121 183 193 128
## 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005
## 171 209 186 224 237 219 250 252 283 278 250 312 259 353 385 506
## 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020
## 445 470 619 472 615 603 783 956 1524 1778 2127 2426 3301 9080 785
ggplot(spotify_clean_2, aes(x = track_popularity)) +
geom_histogram(binwidth = 5, fill = "#2ECC71", color = "white") +
labs(
title = "Track Popularity Distribution (0-100)",
x = "Track Popularity",
y = "Number of Songs",
caption = "Binwidth = 5 | Source: spotify_clean_2"
) +
theme_minimal()
Values near 0 are often excluded because they represent songs with
little to no listener engagement, likely unreleased, inactive, or
algorithmically suppressed. Including them can distort trend analysis
and obscure meaningful patterns among actively consumed tracks.
ggplot(spotify_long, aes(x = track_popularity)) +
geom_histogram(binwidth = 5, fill = "#1DB954", color = "white", alpha = 0.8) +
geom_vline(aes(xintercept = mean(track_popularity)),
color = "red", linetype = "dashed", size = 1) +
annotate("text", x = mean(spotify_long$track_popularity) + 10, y = 5000,
label = paste("Mean =", round(mean(spotify_long$track_popularity), 1)),
color = "red") +
labs(
title = "Song Track Popularity Distribution",
subtitle = "Most songs have low to medium popularity, with few viral hits",
x = "Track Popularity Score",
y = "Number of Songs",
caption = "Source: Spotify Songs Dataset"
) +
theme_minimal() +
theme(plot.title = element_text(face = "bold", size = 14))
### Creating a new dataframe:
spotify_popular
spotify_popular <- spotify_clean_2 %>%
filter(track_popularity >= 20 & track_popularity <= 80)
# Filter data to include songs with popularity >= 20
spotify_filtered <- spotify_clean_2 %>%
filter(track_popularity >= 20)
# Correlation for songs BEFORE 2010 (popularity >= 20)
cor_before <- spotify_filtered %>%
filter(release_year < 2010) %>%
select(track_popularity, danceability, energy, loudness,
acousticness, instrumentalness, liveness, positivity,
tempo, duration_min, speech_ratio) %>%
cor(use = "complete.obs")
# Correlation for songs AFTER 2010 (popularity >= 20)
cor_after <- spotify_filtered %>%
filter(release_year >= 2010) %>%
select(track_popularity, danceability, energy, loudness,
acousticness, instrumentalness, liveness, positivity,
tempo, duration_min, speech_ratio) %>%
cor(use = "complete.obs")
# Set up side-by-side plots
par(mfrow = c(1, 2))
# Plot Before 2010
corrplot(cor_before,
method = "color",
type = "upper",
tl.col = "black",
tl.srt = 45,
addCoef.col = "black",
number.cex = 0.6,
col = colorRampPalette(c("#E74C3C", "white", "#3498DB"))(200),
title = "Before 2010 (Popularity >= 20)",
mar = c(0,0,2,0),
tl.cex = 0.8)
# Plot After 2010
corrplot(cor_after,
method = "color",
type = "upper",
tl.col = "black",
tl.srt = 45,
addCoef.col = "black",
number.cex = 0.6,
col = colorRampPalette(c("#E74C3C", "white", "#3498DB"))(200),
title = "After 2010 (Popularity >= 20)",
mar = c(0,0,2,0),
tl.cex = 0.8)
# Reset plot layout
par(mfrow = c(1, 1))
# Display sample size information
cat("Sample sizes for correlation analysis (Popularity >= 20):\n")
## Sample sizes for correlation analysis (Popularity >= 20):
cat("Before 2010:", nrow(spotify_filtered %>% filter(release_year < 2010)), "songs\n")
## Before 2010: 6262 songs
cat("After 2010:", nrow(spotify_filtered %>% filter(release_year >= 2010)), "songs\n")
## After 2010: 19341 songs
The heatmaps show no standout audio feature that clearly drives popularity, suggesting that musical success likely depends on external factors—such as marketing, artist reputation, playlist placement, and timing—rather than sound characteristics alone.
spotify_long <- spotify_popular %>%
pivot_longer(cols = c(danceability, energy, loudness, speech_ratio,
acousticness, instrumentalness, liveness,
positivity, duration_min),
names_to = "feature",
values_to = "value")
# Release_Period Variable
spotify_clean_2 <- spotify_clean_2 %>%
mutate(release_period = case_when(
is.na(track_album_release_date) ~ "NA",
year(track_album_release_date) < 2010 ~ "Before 2010",
TRUE ~ "After 2010"
))
# Creating the spotify_popular variable
spotify_popular <- spotify_clean_2 %>%
filter(track_popularity >= 50 & track_popularity <= 80)
# Creating the spotify_long variable
spotify_long <- spotify_popular %>%
pivot_longer(cols = c(danceability, energy, loudness, speech_ratio,
acousticness, instrumentalness, liveness,
positivity, duration_min),
names_to = "feature",
values_to = "value")
# Visualization
ggplot(spotify_long, aes(x = value, y = track_popularity)) +
geom_point(aes(color = release_period), alpha = 0.3, size = 1) +
geom_smooth(
data = filter(spotify_long, release_period == "Before 2010"),
method = "lm", se = FALSE, color = "black", linewidth = 0.8
) +
geom_smooth(
data = filter(spotify_long, release_period == "After 2010"),
method = "lm", se = FALSE, color = "#1DB954", linewidth = 0.8
) +
facet_wrap(~ feature, scales = "free_x", ncol = 3) +
labs(
title = "How Audio Features Relate to Popularity",
subtitle = "Each panel shows trends before and after 2010",
x = "Feature Value",
y = "Popularity Score",
color = "Release Period",
caption = "Source: Spotify Songs Dataset"
) +
theme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 14),
strip.text = element_text(face = "bold", size = 10),
legend.position = "bottom"
)
Loudness stands out as a measurable, interpretable feature with clear
temporal and genre-based patterns making it a strong candidate for
deeper analysis.
# Filter data to include songs with popularity >= 20
analysis_data_updated <- spotify_clean_2 %>%
filter(track_popularity >= 20)
# Function: Analyze loudness correlation for each genre
analyze_loudness_correlation <- function(data, genre_name, period_label, year_threshold = 2010) {
# Filter and prepare data
genre_data <- data %>%
filter(playlist_genre == genre_name) %>%
select(track_popularity, loudness)
# Check sample size adequacy
n_obs <- nrow(genre_data)
MIN_SAMPLE_SIZE <- 30
if (n_obs < MIN_SAMPLE_SIZE) {
warning(paste0("Insufficient data for ", genre_name, " - ", period_label,
" (n=", n_obs, ", required: ", MIN_SAMPLE_SIZE, ")"))
return(NULL)
}
# Compute correlations
cor_value <- cor(genre_data$track_popularity, genre_data$loudness,
method = "spearman", use = "pairwise.complete.obs")
cor_pearson <- cor(genre_data$track_popularity, genre_data$loudness,
method = "pearson", use = "pairwise.complete.obs")
return(list(
correlation_spearman = cor_value,
correlation_pearson = cor_pearson,
sample_size = n_obs,
genre = genre_name,
period = period_label,
data = genre_data
))
}
# Function: Create scatter plot with correlation
plot_loudness_correlation <- function(cor_result, main_title = NULL) {
if (is.null(cor_result)) return(invisible(NULL))
# Create scatter plot
p <- ggplot(cor_result$data, aes(x = loudness, y = track_popularity)) +
geom_hex(bins = 40, alpha = 0.8) +
geom_smooth(method = "lm", color = "#E74C3C", size = 1.5, se = TRUE, alpha = 0.2) +
geom_smooth(method = "loess", color = "#3498DB", size = 1.2, linetype = "dashed", se = FALSE) +
scale_fill_gradient(low = "#FFFFBF", high = "#1A9850", name = "Song\nDensity") +
labs(
title = ifelse(!is.null(main_title), main_title,
paste0(cor_result$genre, " - ", cor_result$period)),
subtitle = sprintf("Spearman r = %.3f | Pearson r = %.3f | n = %d",
cor_result$correlation_spearman,
cor_result$correlation_pearson,
cor_result$sample_size),
x = "Loudness (dB)",
y = "Track Popularity"
) +
theme_minimal(base_size = 11) +
theme(
plot.title = element_text(face = "bold", size = 13),
plot.subtitle = element_text(size = 10, color = "gray40"),
legend.position = "right"
)
return(p)
}
# Main analysis pipeline
generate_loudness_comparison <- function(data, genres, year_cutoff = 2010) {
data_before <- data %>% filter(release_year < year_cutoff)
data_after <- data %>% filter(release_year >= year_cutoff)
for (genre in genres) {
cat("\n\n### Genre Analysis:", toupper(genre), "\n")
# Get sample sizes
n_before <- data_before %>% filter(playlist_genre == genre) %>% nrow()
n_after <- data_after %>% filter(playlist_genre == genre) %>% nrow()
cat("Sample sizes - Before 2010:", n_before, "| After 2010:", n_after, "\n")
# Analyze both periods
cor_before <- analyze_loudness_correlation(data_before, genre, "Before 2010")
cor_after <- analyze_loudness_correlation(data_after, genre, "After 2010")
if (is.null(cor_before) || is.null(cor_after)) {
cat("Skipped due to insufficient data.\n")
next
}
# Display correlation values
cat(sprintf("\nLoudness-Popularity Correlation (Spearman):\n"))
cat(sprintf(" Before 2010: r = %.3f\n", cor_before$correlation_spearman))
cat(sprintf(" After 2010: r = %.3f\n", cor_after$correlation_spearman))
cat(sprintf(" Change: Δr = %.3f\n",
cor_after$correlation_spearman - cor_before$correlation_spearman))
# Create side-by-side plots
p1 <- plot_loudness_correlation(cor_before,
paste0(toupper(genre), " - Before 2010"))
p2 <- plot_loudness_correlation(cor_after,
paste0(toupper(genre), " - After 2010"))
# Display plots side by side
grid.arrange(p1, p2, ncol = 2,
top = textGrob(paste0("Loudness vs Popularity: ", toupper(genre)),
gp = gpar(fontsize = 16, fontface = "bold")))
cat("\n", strrep("-", 80), "\n")
}
}
# Execute analysis with filtered data
library(gridExtra)
library(grid)
genres <- unique(analysis_data_updated$playlist_genre)
generate_loudness_comparison(analysis_data_updated, genres)
##
##
## ### Genre Analysis: POP
## Sample sizes - Before 2010: 500 | After 2010: 4034
##
## Loudness-Popularity Correlation (Spearman):
## Before 2010: r = 0.107
## After 2010: r = 0.169
## Change: Δr = 0.062
##
## --------------------------------------------------------------------------------
##
##
## ### Genre Analysis: RAP
## Sample sizes - Before 2010: 1128 | After 2010: 3522
##
## Loudness-Popularity Correlation (Spearman):
## Before 2010: r = 0.139
## After 2010: r = 0.077
## Change: Δr = -0.063
##
## --------------------------------------------------------------------------------
##
##
## ### Genre Analysis: ROCK
## Sample sizes - Before 2010: 2584 | After 2010: 1170
##
## Loudness-Popularity Correlation (Spearman):
## Before 2010: r = 0.097
## After 2010: r = 0.116
## Change: Δr = 0.018
##
## --------------------------------------------------------------------------------
##
##
## ### Genre Analysis: LATIN
## Sample sizes - Before 2010: 603 | After 2010: 3603
##
## Loudness-Popularity Correlation (Spearman):
## Before 2010: r = 0.210
## After 2010: r = 0.252
## Change: Δr = 0.042
##
## --------------------------------------------------------------------------------
##
##
## ### Genre Analysis: R&B
## Sample sizes - Before 2010: 1368 | After 2010: 2728
##
## Loudness-Popularity Correlation (Spearman):
## Before 2010: r = 0.112
## After 2010: r = 0.128
## Change: Δr = 0.016
##
## --------------------------------------------------------------------------------
##
##
## ### Genre Analysis: EDM
## Sample sizes - Before 2010: 79 | After 2010: 4284
##
## Loudness-Popularity Correlation (Spearman):
## Before 2010: r = 0.150
## After 2010: r = 0.064
## Change: Δr = -0.086
##
## --------------------------------------------------------------------------------
cat("\n\nDATA SUMMARY (Popularity >= 20):\n")
##
##
## DATA SUMMARY (Popularity >= 20):
cat("Total songs analyzed:", nrow(analysis_data_updated), "\n")
## Total songs analyzed: 25603
cat("Original dataset size:", nrow(spotify_clean_2), "\n")
## Original dataset size: 32801
cat("Songs excluded (popularity < 20):", nrow(spotify_clean_2) - nrow(analysis_data_updated), "\n")
## Songs excluded (popularity < 20): 7198
Pop songs tend to be most popular when loudness falls between –6 dB and –4 dB, suggesting a production sweet spot. The relationship is mildly non-linear, with popularity peaking around that range.
For rap songs after 2010, popularity tends to peak when loudness is around –5 dB to –3 dB, though the correlation remains weak. The trend suggests a mild preference for louder production, but not a strong linear relationship.
For rock songs, popular tracks tend to center around –8 dB to –6 dB in loudness, both before and after 2010, indicating a consistent production preference across time.
For Latin songs, both before and after 2010, popular tracks tend to cluster around –6 dB to –4 dB in loudness, indicating a consistent preference for moderately loud production.
For R&B songs, popular tracks tend to concentrate around –9 dB to –6 dB in loudness, both before and after 2010, reflecting a steady production preference over time.
For EDM tracks, popular songs after 2010 tend to center around –5 dB to –3 dB in loudness, reflecting a shift toward more intense, high-energy production compared to earlier years.
genre_popularity <- spotify_clean_2 %>%
group_by(playlist_genre) %>%
summarise(
Avg_Popularity = mean(track_popularity),
Median_Popularity = median(track_popularity),
Songs = n(),
High_Pop_Songs = sum(track_popularity >= 70),
High_Pop_Pct = (High_Pop_Songs / Songs) * 100
) %>%
arrange(desc(Avg_Popularity))
# Table Creation
kable(genre_popularity,
digits = 1,
col.names = c("Genre", "Avg Popularity", "Median Popularity",
"Total Songs", "Hit Songs (70+)", "Hit Rate (%)"),
caption = "Genre Popularity Rankings") %>%
kable_styling(bootstrap_options = c("striped", "hover")) %>%
row_spec(1, bold = TRUE, color = "white", background = "#1DB954")
| Genre | Avg Popularity | Median Popularity | Total Songs | Hit Songs (70+) | Hit Rate (%) |
|---|---|---|---|---|---|
| pop | 47.7 | 52 | 5505 | 1240 | 22.5 |
| latin | 47.0 | 50 | 5149 | 1083 | 21.0 |
| rap | 43.3 | 47 | 5738 | 632 | 11.0 |
| rock | 41.7 | 46 | 4945 | 656 | 13.3 |
| r&b | 41.2 | 44 | 5430 | 798 | 14.7 |
| edm | 34.9 | 36 | 6034 | 424 | 7.0 |
Across genres and time periods, loudness shows consistent clustering around specific ranges where songs tend to be more popular—typically between –6 dB and –4 dB for pop, Latin, and EDM, and slightly softer for R&B and rock. While correlations are generally weak, this pattern suggests a genre-specific “sweet spot” in production loudness. Other audio features show no standout relationship with popularity, indicating that external factors like marketing, artist reputation, and playlist placement likely play a larger role in driving musical success.
Causation vs Correlation: This analysis identifies relationships but cannot prove that specific features cause popularity. Missing Context: External factors; artist fame, and social media presence are limited. Temporal Bias: Dataset may over-represent recent music due to streaming platform recency bias.
We may reflect on using predictive modeling to build machine learning models to predict song popularity from audio features and potentially use natural language processing (NLP) to analyze how lyrics impact popularity and how song popularity changes over time with past data (decay curves) to test classification models like logistic regression to predict hit vs. non-hit outcomes, incorporate external metadata such as artist type (solo vs. group), label tier (major vs. indie), or song language, visualize residuals to detect patterns that simple correlations might miss, explore interaction effects (e.g., loudness × genre, energy × danceability) to uncover compound influences and finally annotate plots with genre-specific loudness thresholds to highlight production sweet spots and guide interpretation.