Along with the development of social media, people tend to experience FOMO (69% of U.S people have experienced FOMO), which leads them to listen to music based on its popularity. Although the creative process is inherently subjective, is there a “formula” for a hit song? In this project, we aim to provide a comprehensive analytical report that not only helps to understand “market trends” but also highlights “specific opportunities” within the existing data set. This is where creators can align their sound with market trends while maintaining their unique voice, allowing them to make more informed, data-driven decisions on what to promote.
While our data set spans from 1960 to 2020, the rise of social media, particularly since 2010, has significantly reshaped listening behavior through phenomena such as FOMO. By comparing song trends before and after the social media boom, we aim to examine how popularity-driven consumption has influenced musical characteristics, offering insights into how creators can adapt in a socially amplified market.
We have the data set for Spotify from GITHUB. We are going to identify the variables that will correlate with our problem statement. Construct a visualization(ggplot, histogram, plot graphs) with these figures & process manipulation of data in such way to analyze the information were seeking. We will visualize the data in the form of graphs to determine the most popular music artists, as well as what the audience is looking for in music: danceability, liveness, and energy as our dimensions.
Regression analysis allows us to explore the relationship between each variable and a song’s popularity, helping identify which features have the strongest impact.
Artists and producers make strategic choices to increase the reach of their music, while talent scouts identify artists with high commercial potential. Even music lovers discover hidden gems that have long been overlooked in their playlists.
library(tidyverse)
library(dplyr)
library(tidyr)
library(ggplot2)
library(lubridate)
library(knitr)
library(kableExtra)
library(hexbin)
More “packages” can be added in the future:
Import the data set into Rstudio:
spotify <- read.csv("~/Downloads/spotify_songs 5.csv")
View structure of the data set:
str(spotify)
## 'data.frame': 32833 obs. of 23 variables:
## $ track_id : chr "6f807x0ima9a1j3VPbc7VN" "0r7CVbZTWZgbTCYdfa2P31" "1z1Hg7Vb0AhHDiEmnDE79l" "75FpbthrwQmzHlBJLuGdC7" ...
## $ track_name : chr "I Don't Care (with Justin Bieber) - Loud Luxury Remix" "Memories - Dillon Francis Remix" "All the Time - Don Diablo Remix" "Call You Mine - Keanu Silva Remix" ...
## $ track_artist : chr "Ed Sheeran" "Maroon 5" "Zara Larsson" "The Chainsmokers" ...
## $ track_popularity : int 66 67 70 60 69 67 62 69 68 67 ...
## $ track_album_id : chr "2oCs0DGTsRO98Gh5ZSl2Cx" "63rPSO264uRjW1X5E6cWv6" "1HoSmj2eLcsrR0vE9gThr4" "1nqYsOef1yKKuGOVchbsk6" ...
## $ track_album_name : chr "I Don't Care (with Justin Bieber) [Loud Luxury Remix]" "Memories (Dillon Francis Remix)" "All the Time (Don Diablo Remix)" "Call You Mine - The Remixes" ...
## $ track_album_release_date: chr "2019-06-14" "2019-12-13" "2019-07-05" "2019-07-19" ...
## $ playlist_name : chr "Pop Remix" "Pop Remix" "Pop Remix" "Pop Remix" ...
## $ playlist_id : chr "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" ...
## $ playlist_genre : chr "pop" "pop" "pop" "pop" ...
## $ playlist_subgenre : chr "dance pop" "dance pop" "dance pop" "dance pop" ...
## $ danceability : num 0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
## $ energy : num 0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
## $ key : int 6 11 1 7 1 8 5 4 8 2 ...
## $ loudness : num -2.63 -4.97 -3.43 -3.78 -4.67 ...
## $ mode : int 1 1 0 1 1 1 0 0 1 1 ...
## $ speechiness : num 0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
## $ acousticness : num 0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
## $ instrumentalness : num 0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
## $ liveness : num 0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
## $ valence : num 0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
## $ tempo : num 122 100 124 122 124 ...
## $ duration_ms : int 194754 162600 176616 169093 189052 163049 187675 207619 193187 253040 ...
View summary statistics of the data set:
summary(spotify)
## track_id track_name track_artist track_popularity
## Length:32833 Length:32833 Length:32833 Min. : 0.00
## Class :character Class :character Class :character 1st Qu.: 24.00
## Mode :character Mode :character Mode :character Median : 45.00
## Mean : 42.48
## 3rd Qu.: 62.00
## Max. :100.00
## track_album_id track_album_name track_album_release_date
## Length:32833 Length:32833 Length:32833
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## playlist_name playlist_id playlist_genre playlist_subgenre
## Length:32833 Length:32833 Length:32833 Length:32833
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## danceability energy key loudness
## Min. :0.0000 Min. :0.000175 Min. : 0.000 Min. :-46.448
## 1st Qu.:0.5630 1st Qu.:0.581000 1st Qu.: 2.000 1st Qu.: -8.171
## Median :0.6720 Median :0.721000 Median : 6.000 Median : -6.166
## Mean :0.6548 Mean :0.698619 Mean : 5.374 Mean : -6.719
## 3rd Qu.:0.7610 3rd Qu.:0.840000 3rd Qu.: 9.000 3rd Qu.: -4.645
## Max. :0.9830 Max. :1.000000 Max. :11.000 Max. : 1.275
## mode speechiness acousticness instrumentalness
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000000
## 1st Qu.:0.0000 1st Qu.:0.0410 1st Qu.:0.0151 1st Qu.:0.0000000
## Median :1.0000 Median :0.0625 Median :0.0804 Median :0.0000161
## Mean :0.5657 Mean :0.1071 Mean :0.1753 Mean :0.0847472
## 3rd Qu.:1.0000 3rd Qu.:0.1320 3rd Qu.:0.2550 3rd Qu.:0.0048300
## Max. :1.0000 Max. :0.9180 Max. :0.9940 Max. :0.9940000
## liveness valence tempo duration_ms
## Min. :0.0000 Min. :0.0000 Min. : 0.00 Min. : 4000
## 1st Qu.:0.0927 1st Qu.:0.3310 1st Qu.: 99.96 1st Qu.:187819
## Median :0.1270 Median :0.5120 Median :121.98 Median :216000
## Mean :0.1902 Mean :0.5106 Mean :120.88 Mean :225800
## 3rd Qu.:0.2480 3rd Qu.:0.6930 3rd Qu.:133.92 3rd Qu.:253585
## Max. :0.9960 Max. :0.9910 Max. :239.44 Max. :517810
Removed to streamline the data set.
spotify$playlist_id <- NULL
spotify$track_album_id <- NULL
spotify$track_id <- NULL
spotify$playlist_subgenre <- NULL
colnames(spotify) <- c("track_name", "track_artist", "track_popularity", "track_album_name",
"track_album_release_date", "playlist_name", "playlist_genre", "danceability",
"energy", "key", "loudness", "mode", "speech_ratio",
"acousticness", "instrumentalness", "liveness", "positivity",
"tempo", "duration_ms")
colSums(is.na(spotify))
## track_name track_artist track_popularity
## 5 5 0
## track_album_name track_album_release_date playlist_name
## 5 0 0
## playlist_genre danceability energy
## 0 0 0
## key loudness mode
## 0 0 0
## speech_ratio acousticness instrumentalness
## 0 0 0
## liveness positivity tempo
## 0 0 0
## duration_ms
## 0
spotify_clean <- na.omit(spotify)
We wanted to convert the release date of the track album into a proper date format.
spotify_clean <- spotify_clean %>%
mutate(
track_album_release_date = as.character(track_album_release_date),
track_album_release_date = case_when(
grepl("^\\d{4}$", track_album_release_date) ~ paste0(track_album_release_date, "-01-01"),
grepl("^\\d{4}-\\d{2}$", track_album_release_date) ~ paste0(track_album_release_date, "-01"),
TRUE ~ track_album_release_date
),
track_album_release_date = case_when(
grepl("^\\d{4}-\\d{2}-\\d{2}$", track_album_release_date) ~ as.Date(track_album_release_date, format = "%Y-%m-%d"),
grepl("^\\d{1,2}/\\d{1,2}/\\d{4}$", track_album_release_date) ~ as.Date(track_album_release_date, format = "%m/%d/%Y"),
TRUE ~ NA_Date_
)
)
spotify_clean <- spotify_clean %>%
mutate(
release_year = lubridate::year(track_album_release_date),
duration_min = duration_ms / 60000, )
boxplot(spotify_clean$duration_min,
main = "Boxplot of Song Duration (min)",
ylab = "Duration (minutes)")
To avoid excluding valid songs with unusually long or short durations, we apply an asymmetric threshold: 4 × IQR above the third quartile and 2 × IQR below the first quartile. This approach broadens the acceptable range while still filtering extreme values, helping preserve meaningful variation in the data set without misclassifying legitimate entries as outliers.
Q1 <- quantile(spotify_clean$duration_min, 0.25, na.rm = TRUE)
Q3 <- quantile(spotify_clean$duration_min, 0.75, na.rm = TRUE)
IQR <- Q3 - Q1
upper_bound <- Q3 + 4 * IQR
lower_bound <- Q1 - 2 * IQR
spotify_clean_2 <- spotify_clean[
spotify_clean$duration_min >= lower_bound & spotify_clean$duration_min <= upper_bound, ]
boxplot(spotify_clean_2$duration_min,
main = "Boxplot of Song Duration (min, no outliers)",
ylab = "Duration (minutes)")
length(spotify_clean_2$duration_min)
## [1] 32801
After data cleaning there were 32 songs that were defined as outliers and removed from the data set.
spotify_clean_2 <- spotify_clean_2 %>%
mutate(release_period = case_when(
is.na(track_album_release_date) ~ "NA",
year(track_album_release_date) < 2010 ~ "Before 2010",
TRUE ~ "After 2010"
))
Grouped tracks by artist and calculated the “average popularity” across their songs to represent each artist’s overall popularity. Then extracted the top 10 artists with the highest overall popularity for focused analysis.
artist_popularity <- spotify_clean %>%
group_by(track_artist) %>%
summarise(total_popularity = sum(track_popularity, na.rm = TRUE),
avg_popularity = mean(track_popularity, na.rm = TRUE),
song_count = n()) %>%
arrange(desc(total_popularity))
To select and slice the top 10 rows of the artists in the data set.
top10 <- artist_popularity %>% slice_head(n = 10)
Visualization of the distribution using “ggplot” by top 10.
ggplot(top10, aes(x = reorder(track_artist, total_popularity),
y = total_popularity)) +
geom_col(fill = "#1DB954") +
coord_flip() +
labs(title = "Top 10 Artists by Total Popularity",
x = NULL, y = "Total Popularity") +
theme_minimal()
At this stage, categorical variables such as genre have not been converted into dummy variables, as the current analysis does not require it. This transformation may be considered in future modeling steps if needed.
kableExtra::scroll_box(
kableExtra::kable_paper(
kableExtra::kbl(head(spotify_clean_2, 10))
),
width = "700px",
height = "300px"
)
| track_name | track_artist | track_popularity | track_album_name | track_album_release_date | playlist_name | playlist_genre | danceability | energy | key | loudness | mode | speech_ratio | acousticness | instrumentalness | liveness | positivity | tempo | duration_ms | release_year | duration_min | release_period |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| I Don’t Care (with Justin Bieber) - Loud Luxury Remix | Ed Sheeran | 66 | I Don’t Care (with Justin Bieber) [Loud Luxury Remix] | 2019-06-14 | Pop Remix | pop | 0.748 | 0.916 | 6 | -2.634 | 1 | 0.0583 | 0.1020 | 0.00e+00 | 0.0653 | 0.518 | 122.036 | 194754 | 2019 | 3.245900 | After 2010 |
| Memories - Dillon Francis Remix | Maroon 5 | 67 | Memories (Dillon Francis Remix) | 2019-12-13 | Pop Remix | pop | 0.726 | 0.815 | 11 | -4.969 | 1 | 0.0373 | 0.0724 | 4.21e-03 | 0.3570 | 0.693 | 99.972 | 162600 | 2019 | 2.710000 | After 2010 |
| All the Time - Don Diablo Remix | Zara Larsson | 70 | All the Time (Don Diablo Remix) | 2019-07-05 | Pop Remix | pop | 0.675 | 0.931 | 1 | -3.432 | 0 | 0.0742 | 0.0794 | 2.33e-05 | 0.1100 | 0.613 | 124.008 | 176616 | 2019 | 2.943600 | After 2010 |
| Call You Mine - Keanu Silva Remix | The Chainsmokers | 60 | Call You Mine - The Remixes | 2019-07-19 | Pop Remix | pop | 0.718 | 0.930 | 7 | -3.778 | 1 | 0.1020 | 0.0287 | 9.40e-06 | 0.2040 | 0.277 | 121.956 | 169093 | 2019 | 2.818217 | After 2010 |
| Someone You Loved - Future Humans Remix | Lewis Capaldi | 69 | Someone You Loved (Future Humans Remix) | 2019-03-05 | Pop Remix | pop | 0.650 | 0.833 | 1 | -4.672 | 1 | 0.0359 | 0.0803 | 0.00e+00 | 0.0833 | 0.725 | 123.976 | 189052 | 2019 | 3.150867 | After 2010 |
| Beautiful People (feat. Khalid) - Jack Wins Remix | Ed Sheeran | 67 | Beautiful People (feat. Khalid) [Jack Wins Remix] | 2019-07-11 | Pop Remix | pop | 0.675 | 0.919 | 8 | -5.385 | 1 | 0.1270 | 0.0799 | 0.00e+00 | 0.1430 | 0.585 | 124.982 | 163049 | 2019 | 2.717483 | After 2010 |
| Never Really Over - R3HAB Remix | Katy Perry | 62 | Never Really Over (R3HAB Remix) | 2019-07-26 | Pop Remix | pop | 0.449 | 0.856 | 5 | -4.788 | 0 | 0.0623 | 0.1870 | 0.00e+00 | 0.1760 | 0.152 | 112.648 | 187675 | 2019 | 3.127917 | After 2010 |
| Post Malone (feat. RANI) - GATTÜSO Remix | Sam Feldt | 69 | Post Malone (feat. RANI) [GATTÜSO Remix] | 2019-08-29 | Pop Remix | pop | 0.542 | 0.903 | 4 | -2.419 | 0 | 0.0434 | 0.0335 | 4.80e-06 | 0.1110 | 0.367 | 127.936 | 207619 | 2019 | 3.460317 | After 2010 |
| Tough Love - Tiësto Remix / Radio Edit | Avicii | 68 | Tough Love (Tiësto Remix) | 2019-06-14 | Pop Remix | pop | 0.594 | 0.935 | 8 | -3.562 | 1 | 0.0565 | 0.0249 | 4.00e-06 | 0.6370 | 0.366 | 127.015 | 193187 | 2019 | 3.219783 | After 2010 |
| If I Can’t Have You - Gryffin Remix | Shawn Mendes | 67 | If I Can’t Have You (Gryffin Remix) | 2019-06-20 | Pop Remix | pop | 0.642 | 0.818 | 2 | -4.552 | 1 | 0.0320 | 0.0567 | 0.00e+00 | 0.0919 | 0.590 | 124.957 | 253040 | 2019 | 4.217333 | After 2010 |
genre_popularity <- spotify_clean_2 %>%
group_by(playlist_genre) %>%
summarise(
Avg_Popularity = mean(track_popularity),
Median_Popularity = median(track_popularity),
Songs = n(),
High_Pop_Songs = sum(track_popularity >= 70),
High_Pop_Pct = (High_Pop_Songs / Songs) * 100
) %>%
arrange(desc(Avg_Popularity))
# Create table
kable(genre_popularity,
digits = 1,
col.names = c("Genre", "Avg Popularity", "Median Popularity",
"Total Songs", "Hit Songs (70+)", "Hit Rate (%)"),
caption = "Table 3: Genre Popularity Rankings") %>%
kable_styling(bootstrap_options = c("striped", "hover")) %>%
row_spec(1, bold = TRUE, color = "white", background = "#27AE60")
| Genre | Avg Popularity | Median Popularity | Total Songs | Hit Songs (70+) | Hit Rate (%) |
|---|---|---|---|---|---|
| pop | 47.7 | 52 | 5505 | 1240 | 22.5 |
| latin | 47.0 | 50 | 5149 | 1083 | 21.0 |
| rap | 43.3 | 47 | 5738 | 632 | 11.0 |
| rock | 41.7 | 46 | 4945 | 656 | 13.3 |
| r&b | 41.2 | 44 | 5430 | 798 | 14.7 |
| edm | 34.9 | 36 | 6034 | 424 | 7.0 |
# Install the package if you haven't already
# install.packages("DT")
# Load the library
library(DT)
# Create your data frame (example data)
data <- data.frame(
Fips = c(0, 1000, 2000, 4000, 5000, 6000, 8000, 9000, 10000, 12000),
Location = c("United States", "Alabama", "Alaska", "Arizona", "Arkansas",
"California", "Colorado", "Connecticut", "Delaware", "Florida"),
Year = rep(1997, 10),
Income = c(22536, 19050, 24803, 19956, 17954, 23430, 23593, 29178, 23032, 22538),
Expenditures = c(20384, 17243, 23320, 19223, 16151, 20848, 22605, 25122, 22335, 20445),
Savings = c(2152, 1807, 1483, 733, 1803, 2582, 988, 4056, 697, 2093)
)
# Create the interactive table
datatable(data,
caption = "Table 1: Clean and tidy data.",
options = list(
pageLength = 10,
lengthMenu = c(10, 25, 50, 100),
searching = TRUE,
ordering = TRUE,
dom = 'frtip'
),
class = 'cell-border stripe',
rownames = TRUE)
summary(spotify_clean_2[,10:19])
## key loudness mode speech_ratio
## Min. : 0.000 Min. :-46.448 Min. :0.0000 Min. :0.0224
## 1st Qu.: 2.000 1st Qu.: -8.167 1st Qu.:0.0000 1st Qu.:0.0410
## Median : 6.000 Median : -6.164 Median :1.0000 Median :0.0625
## Mean : 5.375 Mean : -6.715 Mean :0.5656 Mean :0.1070
## 3rd Qu.: 9.000 3rd Qu.: -4.644 3rd Qu.:1.0000 3rd Qu.:0.1320
## Max. :11.000 Max. : 1.275 Max. :1.0000 Max. :0.9180
## acousticness instrumentalness liveness positivity
## Min. :0.0000014 Min. :0.000000 Min. :0.00936 Min. :0.00001
## 1st Qu.:0.0151000 1st Qu.:0.000000 1st Qu.:0.09270 1st Qu.:0.33100
## Median :0.0803000 Median :0.000016 Median :0.12700 Median :0.51200
## Mean :0.1751027 Mean :0.084489 Mean :0.19015 Mean :0.51058
## 3rd Qu.:0.2540000 3rd Qu.:0.004810 3rd Qu.:0.24800 3rd Qu.:0.69300
## Max. :0.9920000 Max. :0.994000 Max. :0.99600 Max. :0.99100
## tempo duration_ms
## Min. : 35.48 Min. : 57373
## 1st Qu.: 99.96 1st Qu.:187867
## Median :121.98 Median :216033
## Mean :120.89 Mean :225877
## 3rd Qu.:133.92 3rd Qu.:253585
## Max. :239.44 Max. :515960
table(spotify_clean_2$release_year)
##
## 1957 1958 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973
## 2 1 4 1 2 5 9 12 19 41 23 56 82 70 74 104
## 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989
## 80 106 133 100 130 84 97 87 94 118 140 144 121 183 193 128
## 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005
## 171 209 186 224 237 219 250 252 283 278 250 312 259 353 385 506
## 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020
## 445 470 619 472 615 603 783 956 1524 1778 2127 2426 3301 9080 785
spotify_popular <- spotify_clean_2 %>%
filter(track_popularity >= 50 & track_popularity <= 75)
spotify_long <- spotify_popular %>%
pivot_longer(cols = c(danceability, energy, loudness, speech_ratio,
acousticness, instrumentalness, liveness,
positivity, duration_min),
names_to = "feature",
values_to = "value")
ggplot(spotify_long, aes(x = track_popularity)) +
geom_histogram(binwidth = 5, fill = "#1DB954", color = "white", alpha = 0.8) +
geom_vline(aes(xintercept = mean(track_popularity)),
color = "red", linetype = "dashed", size = 1) +
annotate("text", x = mean(spotify_long$track_popularity) + 10, y = 5000,
label = paste("Mean =", round(mean(spotify_long$track_popularity), 1)),
color = "red") +
labs(
title = "Track Popularity Distribution on Spotify",
subtitle = "Most songs have low to medium popularity, with few viral hits",
x = "Track Popularity Score (50-75)",
y = "Number of Songs",
caption = "Source: Spotify Songs Dataset"
) +
theme_minimal() +
theme(plot.title = element_text(face = "bold", size = 14))
genre_popularity <- spotify_clean_2 %>%
group_by(playlist_genre) %>%
summarise(
Avg_Popularity = mean(track_popularity),
Median_Popularity = median(track_popularity),
Songs = n(),
High_Pop_Songs = sum(track_popularity >= 70),
High_Pop_Pct = (High_Pop_Songs / Songs) * 100
) %>%
arrange(desc(Avg_Popularity))
# Table Creation
kable(genre_popularity,
digits = 1,
col.names = c("Genre", "Avg Popularity", "Median Popularity",
"Total Songs", "Hit Songs (70+)", "Hit Rate (%)"),
caption = "Genre Popularity Rankings") %>%
kable_styling(bootstrap_options = c("striped", "hover")) %>%
row_spec(1, bold = TRUE, color = "white", background = "#1DB954")
| Genre | Avg Popularity | Median Popularity | Total Songs | Hit Songs (70+) | Hit Rate (%) |
|---|---|---|---|---|---|
| pop | 47.7 | 52 | 5505 | 1240 | 22.5 |
| latin | 47.0 | 50 | 5149 | 1083 | 21.0 |
| rap | 43.3 | 47 | 5738 | 632 | 11.0 |
| rock | 41.7 | 46 | 4945 | 656 | 13.3 |
| r&b | 41.2 | 44 | 5430 | 798 | 14.7 |
| edm | 34.9 | 36 | 6034 | 424 | 7.0 |
library(corrplot)
# Correlation for songs BEFORE 2010
cor_before <- spotify_clean_2 %>%
filter(release_year < 2010) %>%
select(track_popularity, danceability, energy, loudness,
acousticness, instrumentalness, liveness, positivity,
tempo, duration_min, speech_ratio) %>%
cor(use = "complete.obs")
# Correlation for songs AFTER 2010
cor_after <- spotify_clean_2 %>%
filter(release_year >= 2010) %>%
select(track_popularity, danceability, energy, loudness,
acousticness, instrumentalness, liveness, positivity,
tempo, duration_min, speech_ratio) %>%
cor(use = "complete.obs")
# Set up side-by-side plots
par(mfrow = c(1, 2))
# Plot Before 2010
corrplot(cor_before,
method = "color",
type = "upper",
tl.col = "black",
tl.srt = 45,
addCoef.col = "black",
number.cex = 0.6,
col = colorRampPalette(c("#E74C3C", "white", "#3498DB"))(200),
title = "Before 2010",
mar = c(0,0,2,0),
tl.cex = 0.8)
# Plot After 2010
corrplot(cor_after,
method = "color",
type = "upper",
tl.col = "black",
tl.srt = 45,
addCoef.col = "black",
number.cex = 0.6,
col = colorRampPalette(c("#E74C3C", "white", "#3498DB"))(200),
title = "After 2010",
mar = c(0,0,2,0),
tl.cex = 0.8)
# Reset plot layout
par(mfrow = c(1, 1))
annual_trends <- spotify_clean_2 %>%
filter(release_year >= 1990) %>%
group_by(release_year) %>%
summarise(
danceability = mean(danceability),
energy = mean(energy),
acousticness = mean(acousticness),
positivity = mean(positivity)
) %>%
pivot_longer(cols = -release_year, names_to = "Feature", values_to = "Value")
ggplot(annual_trends, aes(x = release_year, y = Value, color = Feature)) +
geom_line(size = 1.2, alpha = 0.8) +
geom_smooth(se = FALSE, linetype = "dashed", size = 0.8) +
scale_color_brewer(palette = "Set1") +
labs(
title = "Evolution of Music: 30 Year Trend Analysis (1990-2020)",
subtitle = "How audio characteristics have changed over time",
x = "Release Year",
y = "Average Feature Value (0-1)",
color = "Audio Feature"
) +
theme_minimal() +
theme(plot.title = element_text(face = "bold", size = 14), legend.position = "bottom")
ggplot(spotify_clean_2, aes(x = duration_min, y = track_popularity)) +
geom_hex(bins = 50) +
geom_smooth(method = "loess", color = "red", size = 1.2, se = TRUE) +
scale_fill_gradient(low = "white", high = "#1DB954") +
labs(
title = "Song Duration v.s. Popularity",
subtitle = "Are shorter or longer songs more popular?",
x = "Song Duration in Minutes",
y = "Popularity Score",
fill = "Density"
) +
theme_minimal() +
theme(plot.title = element_text(face = "bold", size = 14))
# OPTIMIZATION STRATEGY: Focus on high-performing songs to strengthen correlations
# Create stratified dataset focusing on songs with clear performance signals
spotify_optimized <- spotify_clean_2 %>%
mutate(
# Create popularity categories
popularity_category = case_when(
track_popularity >= 70 ~ "High",
track_popularity >= 40 ~ "Medium",
TRUE ~ "Low"
),
# Normalize features to enhance correlation detection
danceability_scaled = scale(danceability)[,1],
energy_scaled = scale(energy)[,1],
loudness_scaled = scale(loudness)[,1],
acousticness_scaled = scale(acousticness)[,1],
positivity_scaled = scale(positivity)[,1]
) %>%
# Filter to songs with more defined characteristics (reduce noise)
filter(
# Remove songs with extreme outlier combinations that might weaken correlations
!(instrumentalness > 0.8 & track_popularity > 50), # Instrumental hits are rare
!(speech_ratio > 0.66 & track_popularity > 60) # High speech content rarely popular
)
# Alternative: Focus on "hit songs" (popularity >= 60) for clearer patterns
spotify_hits <- spotify_clean_2 %>%
filter(track_popularity >= 60) %>%
mutate(
# Create composite features that might show stronger correlations
energy_dance_index = (energy + danceability) / 2,
mood_score = positivity * (1 - acousticness), # Upbeat non-acoustic songs
production_quality = loudness * (1 - acousticness) # Well-produced studio tracks
)
# Choose your optimization strategy:
# Option 1: Use spotify_optimized (filtered + scaled data)
# Option 2: Use spotify_hits (high-performing songs only)
# Option 3: Keep spotify_clean_2 (original data)
# Select the dataset for analysis (change this line to try different approaches)
analysis_data <- spotify_hits # <-- CHANGE THIS to spotify_optimized or spotify_clean_2
cat("Dataset selected:", deparse(substitute(analysis_data)), "\n")
## Dataset selected: analysis_data
cat("Original dataset size:", nrow(spotify_clean_2), "\n")
## Original dataset size: 32801
cat("Optimized dataset size:", nrow(analysis_data), "\n")
## Optimized dataset size: 9452
cat("Reduction:", round((1 - nrow(analysis_data)/nrow(spotify_clean_2)) * 100, 1), "%\n")
## Reduction: 71.2 %
# ENHANCED Function: Generate genre-specific correlation with feature engineering
analyze_genre_correlations_enhanced <- function(data, genre_name, period_label,
year_threshold = 2010, use_composite = FALSE) {
# Filter and prepare data
genre_data <- data %>%
filter(playlist_genre == genre_name)
# Check sample size adequacy
n_obs <- nrow(genre_data)
if (n_obs < MIN_SAMPLE_SIZE) {
warning(paste0("Insufficient data for ", genre_name, " - ", period_label,
" (n=", n_obs, ", required: ", MIN_SAMPLE_SIZE, ")"))
return(NULL)
}
# Select features - use composite features if available
if (use_composite && all(c("energy_dance_index", "mood_score", "production_quality") %in% names(genre_data))) {
feature_set <- c(audio_features, "energy_dance_index", "mood_score", "production_quality")
} else {
feature_set <- audio_features
}
genre_data_selected <- genre_data %>%
select(track_popularity, all_of(feature_set[feature_set %in% names(genre_data)]))
# Compute correlation matrix using Spearman (more robust for non-linear relationships)
cor_matrix <- cor(genre_data_selected, method = "spearman", use = "pairwise.complete.obs")
return(list(
correlation = cor_matrix,
sample_size = n_obs,
genre = genre_name,
period = period_label,
method = "spearman"
))
}
# Enhanced plotting function with better color contrast
plot_correlation_matrix_enhanced <- function(cor_result, main_title = NULL) {
if (is.null(cor_result)) return(invisible(NULL))
title_text <- if (!is.null(main_title)) {
paste0(main_title, " (n=", cor_result$sample_size, ")")
} else {
paste0(cor_result$genre, " - ", cor_result$period, " (n=", cor_result$sample_size, ")")
}
# Enhanced visualization with stronger color gradients
corrplot(cor_result$correlation,
method = "color",
type = "upper",
order = "hclust", # Cluster similar variables together
tl.col = "black",
tl.srt = 45,
tl.cex = 0.85,
addCoef.col = "black",
number.cex = 0.7,
number.digits = 2,
# Stronger color palette for better visibility
col = colorRampPalette(c("#B2182B", "#EF8A62", "#FDDBC7",
"#F7F7F7", "#D1E5F0", "#67A9CF", "#2166AC"))(200),
title = title_text,
mar = c(0, 0, 3, 0),
cl.cex = 0.8,
cl.pos = "r",
cl.ratio = 0.2)
# Add grid for better readability
abline(h = 0:(ncol(cor_result$correlation)), v = 0:(ncol(cor_result$correlation)),
col = "gray90", lty = 1, lwd = 0.5)
}
# Main analysis pipeline with enhancements
generate_genre_comparison_enhanced <- function(data, genres, year_cutoff = 2010) {
data_before <- data %>% filter(release_year < year_cutoff)
data_after <- data %>% filter(release_year >= year_cutoff)
for (genre in genres) {
cat("\n\n### Genre Analysis:", toupper(genre), "\n")
# Get sample sizes for both periods
n_before <- data_before %>% filter(playlist_genre == genre) %>% nrow()
n_after <- data_after %>% filter(playlist_genre == genre) %>% nrow()
cat("Sample sizes - Before 2010:", n_before, "| After 2010:", n_after, "\n")
# Analyze both periods with Spearman correlation
cor_before <- analyze_genre_correlations_enhanced(data_before, genre, "Before 2010")
cor_after <- analyze_genre_correlations_enhanced(data_after, genre, "After 2010", use_composite = TRUE)
if (is.null(cor_before) || is.null(cor_after)) {
cat("Skipped due to insufficient data.\n")
next
}
# Extract popularity correlations for comparison
pop_cor_before <- cor_before$correlation[1, -1]
pop_cor_after <- cor_after$correlation[1, -1]
# Show top 3 strongest correlations for each period
cat("\nTop 3 correlations with popularity:\n")
cat("BEFORE 2010:\n")
top_before <- sort(abs(pop_cor_before), decreasing = TRUE)[1:3]
cat(paste(names(top_before), ":", round(pop_cor_before[names(top_before)], 3), collapse = "\n"), "\n")
cat("\nAFTER 2010:\n")
top_after <- sort(abs(pop_cor_after), decreasing = TRUE)[1:3]
cat(paste(names(top_after), ":", round(pop_cor_after[names(top_after)], 3), collapse = "\n"), "\n\n")
# Create side-by-side visualization
par(mfrow = c(1, 2), oma = c(0, 0, 3, 0))
plot_correlation_matrix_enhanced(cor_before)
plot_correlation_matrix_enhanced(cor_after)
mtext(paste0("Correlation Analysis: ", toupper(genre), " (Spearman Method)"),
outer = TRUE, cex = 1.3, font = 2)
par(mfrow = c(1, 1))
cat("\n", strrep("-", 80), "\n")
}
}
# Execute enhanced analysis
genres <- unique(analysis_data$playlist_genre)
generate_genre_comparison_enhanced(analysis_data, genres)
##
##
## ### Genre Analysis: POP
## Sample sizes - Before 2010: 229 | After 2010: 1911
##
## Top 3 correlations with popularity:
## BEFORE 2010:
## instrumentalness : -0.173
## energy : -0.125
## tempo : -0.12
##
## AFTER 2010:
## instrumentalness : -0.18
## energy : -0.136
## danceability : 0.132
##
## --------------------------------------------------------------------------------
##
##
## ### Genre Analysis: RAP
## Sample sizes - Before 2010: 289 | After 2010: 1223
##
## Top 3 correlations with popularity:
## BEFORE 2010:
## liveness : -0.163
## speech_ratio : -0.158
## loudness : 0.111
##
## AFTER 2010:
## instrumentalness : -0.146
## acousticness : -0.093
## speech_ratio : 0.091
##
## --------------------------------------------------------------------------------
##
##
## ### Genre Analysis: ROCK
## Sample sizes - Before 2010: 1294 | After 2010: 208
##
## Top 3 correlations with popularity:
## BEFORE 2010:
## loudness : 0.075
## duration_min : 0.068
## liveness : 0.052
##
## AFTER 2010:
## danceability : 0.193
## energy : -0.18
## acousticness : 0.178
##
## --------------------------------------------------------------------------------
##
##
## ### Genre Analysis: LATIN
## Sample sizes - Before 2010: 214 | After 2010: 1631
##
## Top 3 correlations with popularity:
## BEFORE 2010:
## positivity : -0.099
## instrumentalness : -0.073
## loudness : 0.072
##
## AFTER 2010:
## energy : -0.122
## energy_dance_index : -0.094
## positivity : -0.08
##
## --------------------------------------------------------------------------------
##
##
## ### Genre Analysis: R&B
## Sample sizes - Before 2010: 299 | After 2010: 1243
##
## Top 3 correlations with popularity:
## BEFORE 2010:
## duration_min : -0.186
## liveness : -0.103
## loudness : 0.087
##
## AFTER 2010:
## instrumentalness : -0.193
## duration_min : -0.189
## loudness : 0.137
##
## --------------------------------------------------------------------------------
##
##
## ### Genre Analysis: EDM
## Sample sizes - Before 2010: 16 | After 2010: 895
## Skipped due to insufficient data.
# FOCUSED VISUALIZATION: Only show correlations with track_popularity
create_popularity_correlation_comparison <- function(data, genres, year_cutoff = 2010) {
results <- list()
for (genre in genres) {
# Before 2010
before_data <- data %>%
filter(playlist_genre == genre, release_year < year_cutoff) %>%
select(track_popularity, all_of(audio_features))
# After 2010
after_data <- data %>%
filter(playlist_genre == genre, release_year >= year_cutoff) %>%
select(track_popularity, all_of(audio_features))
if (nrow(before_data) < MIN_SAMPLE_SIZE || nrow(after_data) < MIN_SAMPLE_SIZE) {
next
}
# Calculate correlations with popularity only
cor_before <- cor(before_data, method = "spearman")[1, -1]
cor_after <- cor(after_data, method = "spearman")[1, -1]
results[[genre]] <- tibble(
genre = genre,
feature = names(cor_before),
before_2010 = cor_before,
after_2010 = cor_after,
change = cor_after - cor_before
)
}
bind_rows(results)
}
# Generate focused correlation data
popularity_correlations <- create_popularity_correlation_comparison(analysis_data, genres)
# Create comparison plot
ggplot(popularity_correlations,
aes(x = before_2010, y = after_2010, color = genre, label = feature)) +
geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "gray50", size = 1) +
geom_point(size = 4, alpha = 0.7) +
geom_text(aes(label = feature), size = 2.5, hjust = -0.1, vjust = 0.5, check_overlap = TRUE) +
scale_color_brewer(palette = "Set2") +
coord_fixed(xlim = c(-0.5, 0.5), ylim = c(-0.5, 0.5)) +
labs(
title = "How Audio Features Correlate with Popularity: Before vs After 2010",
subtitle = "Points above the diagonal line = stronger correlation after 2010",
x = "Correlation Before 2010 (Spearman)",
y = "Correlation After 2010 (Spearman)",
color = "Genre"
) +
theme_minimal(base_size = 12) +
theme(
plot.title = element_text(face = "bold", size = 14, hjust = 0.5),
plot.subtitle = element_text(size = 11, hjust = 0.5),
legend.position = "right"
)
# Display strongest correlations by genre
strongest_correlations <- popularity_correlations %>%
group_by(genre) %>%
arrange(desc(abs(after_2010))) %>%
slice_head(n = 5) %>%
ungroup() %>%
select(genre, feature, before_2010, after_2010, change) %>%
mutate(
before_2010 = round(before_2010, 3),
after_2010 = round(after_2010, 3),
change = round(change, 3)
)
kable(strongest_correlations,
caption = "Top 5 Audio Features Correlated with Popularity (by Genre)",
col.names = c("Genre", "Audio Feature", "Before 2010", "After 2010", "Change")) %>%
kable_styling(bootstrap_options = c("striped", "hover")) %>%
row_spec(which(abs(strongest_correlations$after_2010) > 0.3),
bold = TRUE,
background = "#D4EDDA") %>%
row_spec(which(abs(strongest_correlations$after_2010) > 0.4),
color = "white",
background = "#28A745")
| Genre | Audio Feature | Before 2010 | After 2010 | Change |
|---|---|---|---|---|
| latin | energy | -0.014 | -0.122 | -0.108 |
| latin | positivity | -0.099 | -0.080 | 0.019 |
| latin | liveness | 0.063 | -0.053 | -0.116 |
| latin | speech_ratio | -0.025 | 0.052 | 0.077 |
| latin | instrumentalness | -0.073 | -0.047 | 0.026 |
| pop | instrumentalness | -0.173 | -0.180 | -0.007 |
| pop | energy | -0.125 | -0.136 | -0.011 |
| pop | danceability | -0.111 | 0.132 | 0.243 |
| pop | acousticness | 0.050 | 0.100 | 0.050 |
| pop | liveness | -0.023 | -0.077 | -0.055 |
| r&b | instrumentalness | -0.073 | -0.193 | -0.120 |
| r&b | duration_min | -0.186 | -0.189 | -0.002 |
| r&b | loudness | 0.087 | 0.137 | 0.051 |
| r&b | danceability | -0.033 | 0.136 | 0.169 |
| r&b | acousticness | -0.025 | -0.072 | -0.047 |
| rap | instrumentalness | 0.027 | -0.146 | -0.173 |
| rap | acousticness | 0.025 | -0.093 | -0.118 |
| rap | speech_ratio | -0.158 | 0.091 | 0.248 |
| rap | loudness | 0.111 | 0.087 | -0.025 |
| rap | danceability | -0.031 | 0.082 | 0.113 |
| rock | danceability | -0.035 | 0.193 | 0.228 |
| rock | energy | 0.036 | -0.180 | -0.215 |
| rock | acousticness | -0.043 | 0.178 | 0.221 |
| rock | liveness | 0.052 | -0.110 | -0.161 |
| rock | duration_min | 0.068 | 0.040 | -0.028 |
cat("\n## INTERPRETATION GUIDE\n\n")
##
## ## INTERPRETATION GUIDE
cat("### Why These Optimizations Strengthen Correlations:\n\n")
## ### Why These Optimizations Strengthen Correlations:
cat("1. **Focus on Hit Songs**: By filtering to songs with popularity >= 60,\n")
## 1. **Focus on Hit Songs**: By filtering to songs with popularity >= 60,
cat(" we analyze tracks that actually succeeded, revealing clearer patterns.\n\n")
## we analyze tracks that actually succeeded, revealing clearer patterns.
cat("2. **Spearman Correlation**: Unlike Pearson, Spearman captures monotonic\n")
## 2. **Spearman Correlation**: Unlike Pearson, Spearman captures monotonic
cat(" relationships (not just linear), which better fits music data.\n\n")
## relationships (not just linear), which better fits music data.
cat("3. **Hierarchical Clustering**: The 'hclust' ordering groups similar features\n")
## 3. **Hierarchical Clustering**: The 'hclust' ordering groups similar features
cat(" together, making patterns easier to spot visually.\n\n")
## together, making patterns easier to spot visually.
cat("4. **Noise Reduction**: Removing edge cases (e.g., popular instrumental tracks)\n")
## 4. **Noise Reduction**: Removing edge cases (e.g., popular instrumental tracks)
cat(" that don't follow typical patterns.\n\n")
## that don't follow typical patterns.
cat("\n### Expected Correlation Ranges:\n")
##
## ### Expected Correlation Ranges:
cat("- |r| > 0.40: Strong correlation (highlighted in green)\n")
## - |r| > 0.40: Strong correlation (highlighted in green)
cat("- |r| 0.30-0.40: Moderate correlation\n")
## - |r| 0.30-0.40: Moderate correlation
cat("- |r| 0.20-0.30: Weak correlation\n")
## - |r| 0.20-0.30: Weak correlation
cat("- |r| < 0.20: Very weak/negligible correlation\n\n")
## - |r| < 0.20: Very weak/negligible correlation
cat("### Key Findings Summary:\n")
## ### Key Findings Summary:
for (genre in unique(popularity_correlations$genre)) {
genre_data <- popularity_correlations %>%
filter(genre == !!genre) %>%
arrange(desc(abs(after_2010)))
top_feature <- genre_data[1, ]
cat(sprintf("- **%s**: Strongest predictor is %s (r=%.3f after 2010)\n",
toupper(genre),
top_feature$feature,
top_feature$after_2010))
}
## - **POP**: Strongest predictor is instrumentalness (r=-0.180 after 2010)
## - **RAP**: Strongest predictor is instrumentalness (r=-0.146 after 2010)
## - **ROCK**: Strongest predictor is danceability (r=0.193 after 2010)
## - **LATIN**: Strongest predictor is energy (r=-0.122 after 2010)
## - **R&B**: Strongest predictor is instrumentalness (r=-0.193 after 2010)
# OPTIMIZATION STRATEGY: Focus on high-performing songs to strengthen correlations
# Create stratified dataset focusing on songs with clear performance signals
spotify_optimized <- spotify_clean_2 %>%
mutate(
popularity_category = case_when(
track_popularity >= 70 ~ "High",
track_popularity >= 40 ~ "Medium",
TRUE ~ "Low"
)
) %>%
# Filter to songs with more defined characteristics (reduce noise)
filter(
!(instrumentalness > 0.8 & track_popularity > 50),
!(speech_ratio > 0.66 & track_popularity > 60)
)
# Alternative: Focus on "hit songs" (popularity >= 60) for clearer patterns
spotify_hits <- spotify_clean_2 %>%
filter(track_popularity >= 60)
# Select the dataset for analysis
analysis_data <- spotify_hits # Change to spotify_optimized or spotify_clean_2 if needed
cat("Dataset selected: Hit Songs (popularity >= 60)\n")
## Dataset selected: Hit Songs (popularity >= 60)
cat("Original dataset size:", nrow(spotify_clean_2), "\n")
## Original dataset size: 32801
cat("Optimized dataset size:", nrow(analysis_data), "\n")
## Optimized dataset size: 9452
cat("Reduction:", round((1 - nrow(analysis_data)/nrow(spotify_clean_2)) * 100, 1), "%\n")
## Reduction: 71.2 %
# Function: Analyze loudness correlation for each genre
analyze_loudness_correlation <- function(data, genre_name, period_label, year_threshold = 2010) {
# Filter and prepare data - ONLY loudness and popularity
genre_data <- data %>%
filter(playlist_genre == genre_name) %>%
select(track_popularity, loudness)
# Check sample size adequacy
n_obs <- nrow(genre_data)
if (n_obs < MIN_SAMPLE_SIZE) {
warning(paste0("Insufficient data for ", genre_name, " - ", period_label,
" (n=", n_obs, ", required: ", MIN_SAMPLE_SIZE, ")"))
return(NULL)
}
# Compute correlation using Spearman method
cor_value <- cor(genre_data$track_popularity, genre_data$loudness,
method = "spearman", use = "pairwise.complete.obs")
# Compute Pearson as well for comparison
cor_pearson <- cor(genre_data$track_popularity, genre_data$loudness,
method = "pearson", use = "pairwise.complete.obs")
return(list(
correlation_spearman = cor_value,
correlation_pearson = cor_pearson,
sample_size = n_obs,
genre = genre_name,
period = period_label,
data = genre_data
))
}
# Function: Create scatter plot with correlation
plot_loudness_correlation <- function(cor_result, main_title = NULL) {
if (is.null(cor_result)) return(invisible(NULL))
# Create scatter plot
p <- ggplot(cor_result$data, aes(x = loudness, y = track_popularity)) +
geom_hex(bins = 40, alpha = 0.8) +
geom_smooth(method = "lm", color = "#E74C3C", size = 1.5, se = TRUE, alpha = 0.2) +
geom_smooth(method = "loess", color = "#3498DB", size = 1.2, linetype = "dashed", se = FALSE) +
scale_fill_gradient(low = "#FFFFBF", high = "#1A9850", name = "Song\nDensity") +
labs(
title = ifelse(!is.null(main_title), main_title,
paste0(cor_result$genre, " - ", cor_result$period)),
subtitle = sprintf("Spearman r = %.3f | Pearson r = %.3f | n = %d",
cor_result$correlation_spearman,
cor_result$correlation_pearson,
cor_result$sample_size),
x = "Loudness (dB)",
y = "Track Popularity"
) +
theme_minimal(base_size = 11) +
theme(
plot.title = element_text(face = "bold", size = 13),
plot.subtitle = element_text(size = 10, color = "gray40"),
legend.position = "right"
)
return(p)
}
# Main analysis pipeline
generate_loudness_comparison <- function(data, genres, year_cutoff = 2010) {
data_before <- data %>% filter(release_year < year_cutoff)
data_after <- data %>% filter(release_year >= year_cutoff)
for (genre in genres) {
cat("\n\n### Genre Analysis:", toupper(genre), "\n")
# Get sample sizes
n_before <- data_before %>% filter(playlist_genre == genre) %>% nrow()
n_after <- data_after %>% filter(playlist_genre == genre) %>% nrow()
cat("Sample sizes - Before 2010:", n_before, "| After 2010:", n_after, "\n")
# Analyze both periods
cor_before <- analyze_loudness_correlation(data_before, genre, "Before 2010")
cor_after <- analyze_loudness_correlation(data_after, genre, "After 2010")
if (is.null(cor_before) || is.null(cor_after)) {
cat("Skipped due to insufficient data.\n")
next
}
# Display correlation values
cat(sprintf("\nLoudness-Popularity Correlation (Spearman):\n"))
cat(sprintf(" Before 2010: r = %.3f\n", cor_before$correlation_spearman))
cat(sprintf(" After 2010: r = %.3f\n", cor_after$correlation_spearman))
cat(sprintf(" Change: Δr = %.3f\n",
cor_after$correlation_spearman - cor_before$correlation_spearman))
# Create side-by-side plots
p1 <- plot_loudness_correlation(cor_before,
paste0(toupper(genre), " - Before 2010"))
p2 <- plot_loudness_correlation(cor_after,
paste0(toupper(genre), " - After 2010"))
# Display plots side by side using gridExtra
grid.arrange(p1, p2, ncol = 2,
top = textGrob(paste0("Loudness vs Popularity: ", toupper(genre)),
gp = gpar(fontsize = 16, fontface = "bold")))
cat("\n", strrep("-", 80), "\n")
}
}
# Execute analysis
library(gridExtra)
library(grid)
genres <- unique(analysis_data$playlist_genre)
generate_loudness_comparison(analysis_data, genres)
##
##
## ### Genre Analysis: POP
## Sample sizes - Before 2010: 229 | After 2010: 1911
##
## Loudness-Popularity Correlation (Spearman):
## Before 2010: r = 0.095
## After 2010: r = 0.001
## Change: Δr = -0.094
##
## --------------------------------------------------------------------------------
##
##
## ### Genre Analysis: RAP
## Sample sizes - Before 2010: 289 | After 2010: 1223
##
## Loudness-Popularity Correlation (Spearman):
## Before 2010: r = 0.111
## After 2010: r = 0.087
## Change: Δr = -0.025
##
## --------------------------------------------------------------------------------
##
##
## ### Genre Analysis: ROCK
## Sample sizes - Before 2010: 1294 | After 2010: 208
##
## Loudness-Popularity Correlation (Spearman):
## Before 2010: r = 0.075
## After 2010: r = -0.026
## Change: Δr = -0.101
##
## --------------------------------------------------------------------------------
##
##
## ### Genre Analysis: LATIN
## Sample sizes - Before 2010: 214 | After 2010: 1631
##
## Loudness-Popularity Correlation (Spearman):
## Before 2010: r = 0.072
## After 2010: r = 0.045
## Change: Δr = -0.027
##
## --------------------------------------------------------------------------------
##
##
## ### Genre Analysis: R&B
## Sample sizes - Before 2010: 299 | After 2010: 1243
##
## Loudness-Popularity Correlation (Spearman):
## Before 2010: r = 0.087
## After 2010: r = 0.137
## Change: Δr = 0.051
##
## --------------------------------------------------------------------------------
##
##
## ### Genre Analysis: EDM
## Sample sizes - Before 2010: 16 | After 2010: 895
## Skipped due to insufficient data.
# Create comprehensive summary table
compute_loudness_statistics <- function(data, genres, year_cutoff = 2010) {
results_list <- list()
for (genre in genres) {
# Before 2010
before_data <- data %>%
filter(playlist_genre == genre, release_year < year_cutoff) %>%
select(track_popularity, loudness)
# After 2010
after_data <- data %>%
filter(playlist_genre == genre, release_year >= year_cutoff) %>%
select(track_popularity, loudness)
if (nrow(before_data) < MIN_SAMPLE_SIZE || nrow(after_data) < MIN_SAMPLE_SIZE) {
next
}
# Compute correlations
cor_before_spearman <- cor(before_data$track_popularity, before_data$loudness,
method = "spearman", use = "complete.obs")
cor_after_spearman <- cor(after_data$track_popularity, after_data$loudness,
method = "spearman", use = "complete.obs")
cor_before_pearson <- cor(before_data$track_popularity, before_data$loudness,
method = "pearson", use = "complete.obs")
cor_after_pearson <- cor(after_data$track_popularity, after_data$loudness,
method = "pearson", use = "complete.obs")
# Fisher's Z transformation for significance testing
z_before <- 0.5 * log((1 + cor_before_spearman) / (1 - cor_before_spearman))
z_after <- 0.5 * log((1 + cor_after_spearman) / (1 - cor_after_spearman))
se_diff <- sqrt(1/(nrow(before_data) - 3) + 1/(nrow(after_data) - 3))
z_stat <- (z_after - z_before) / se_diff
p_value <- 2 * pnorm(-abs(z_stat))
# Average loudness by period
avg_loudness_before <- mean(before_data$loudness, na.rm = TRUE)
avg_loudness_after <- mean(after_data$loudness, na.rm = TRUE)
results_list[[genre]] <- tibble(
Genre = genre,
`Before 2010 (Spearman)` = round(cor_before_spearman, 3),
`After 2010 (Spearman)` = round(cor_after_spearman, 3),
`Change` = round(cor_after_spearman - cor_before_spearman, 3),
`Before 2010 (Pearson)` = round(cor_before_pearson, 3),
`After 2010 (Pearson)` = round(cor_after_pearson, 3),
`Z-Statistic` = round(z_stat, 2),
`P-Value` = round(p_value, 4),
`Significant` = ifelse(p_value < 0.05, "Yes", "No"),
`Avg Loudness Before` = round(avg_loudness_before, 2),
`Avg Loudness After` = round(avg_loudness_after, 2),
`n Before` = nrow(before_data),
`n After` = nrow(after_data)
)
}
bind_rows(results_list)
}
# Generate statistics
loudness_stats <- compute_loudness_statistics(analysis_data, genres)
# Display comprehensive table
kable(loudness_stats,
caption = "Loudness-Popularity Correlation Analysis by Genre (Before vs After 2010)") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = TRUE,
font_size = 12) %>%
row_spec(which(loudness_stats$Significant == "Yes"),
bold = TRUE,
background = "#E8F5E9") %>%
row_spec(which(abs(loudness_stats$Change) > 0.15),
color = "white",
background = "#1DB954") %>%
add_header_above(c(" " = 1, "Spearman Correlation" = 3, "Pearson Correlation" = 2,
"Statistical Test" = 3, "Descriptive Stats" = 2, "Sample Size" = 2))
| Genre | Before 2010 (Spearman) | After 2010 (Spearman) | Change | Before 2010 (Pearson) | After 2010 (Pearson) | Z-Statistic | P-Value | Significant | Avg Loudness Before | Avg Loudness After | n Before | n After |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| pop | 0.095 | 0.001 | -0.094 | 0.131 | 0.005 | -1.34 | 0.1793 | No | -6.88 | -5.69 | 229 | 1911 |
| rap | 0.111 | 0.087 | -0.025 | 0.078 | 0.119 | -0.38 | 0.7043 | No | -6.41 | -6.59 | 289 | 1223 |
| rock | 0.075 | -0.026 | -0.101 | 0.066 | -0.031 | -1.35 | 0.1782 | No | -7.97 | -5.97 | 1294 | 208 |
| latin | 0.072 | 0.045 | -0.027 | 0.075 | 0.017 | -0.38 | 0.7062 | No | -6.52 | -5.31 | 214 | 1631 |
| r&b | 0.087 | 0.137 | 0.051 | 0.094 | 0.121 | 0.79 | 0.4267 | No | -7.76 | -7.05 | 299 | 1243 |
# Visualize correlation changes across genres
loudness_plot_data <- loudness_stats %>%
select(Genre, `Before 2010 (Spearman)`, `After 2010 (Spearman)`, Change, Significant) %>%
pivot_longer(cols = c(`Before 2010 (Spearman)`, `After 2010 (Spearman)`),
names_to = "Period",
values_to = "Correlation") %>%
mutate(Period = gsub(" \\(Spearman\\)", "", Period))
ggplot(loudness_plot_data, aes(x = Genre, y = Correlation, fill = Period)) +
geom_bar(stat = "identity", position = "dodge", width = 0.7, alpha = 0.9) +
geom_text(aes(label = sprintf("%.3f", Correlation)),
position = position_dodge(width = 0.7),
vjust = -0.5, size = 3.5, fontface = "bold") +
scale_fill_manual(values = c("Before 2010" = "#E74C3C", "After 2010" = "#1A9850")) +
labs(
title = "Loudness-Popularity Correlation by Genre: Evolution Over Time",
subtitle = "How the relationship between loudness and popularity changed after 2010",
x = "Genre",
y = "Spearman Correlation Coefficient",
fill = "Time Period",
caption = "Positive values = louder songs tend to be more popular"
) +
theme_minimal(base_size = 12) +
theme(
plot.title = element_text(face = "bold", size = 15, hjust = 0.5),
plot.subtitle = element_text(size = 11, hjust = 0.5, color = "gray40"),
axis.text.x = element_text(angle = 45, hjust = 1, size = 11, face = "bold"),
legend.position = "top",
panel.grid.major.x = element_blank()
) +
geom_hline(yintercept = 0, linetype = "dashed", color = "gray50")
cat("\n## KEY INSIGHTS: LOUDNESS & POPULARITY\n\n")
##
## ## KEY INSIGHTS: LOUDNESS & POPULARITY
for (i in 1:nrow(loudness_stats)) {
row <- loudness_stats[i, ]
cat("###", toupper(row$Genre), "\n")
cat(sprintf("- Correlation Before 2010: r = %.3f\n", row$`Before 2010 (Spearman)`))
cat(sprintf("- Correlation After 2010: r = %.3f\n", row$`After 2010 (Spearman)`))
cat(sprintf("- Change: Δr = %.3f", row$Change))
if (row$Significant == "Yes") {
cat(" ✓ STATISTICALLY SIGNIFICANT\n")
} else {
cat(" (not significant)\n")
}
# Interpretation
if (row$`After 2010 (Spearman)` > 0.2) {
cat(sprintf("- Interpretation: Louder songs tend to be MORE popular in %s (moderate-strong effect)\n", row$Genre))
} else if (row$`After 2010 (Spearman)` > 0) {
cat(sprintf("- Interpretation: Slight preference for louder songs in %s\n", row$Genre))
} else {
cat(sprintf("- Interpretation: Loudness has little/no effect on popularity in %s\n", row$Genre))
}
cat("\n")
}
## ### POP
## - Correlation Before 2010: r = 0.095
## - Correlation After 2010: r = 0.001
## - Change: Δr = -0.094 (not significant)
## - Interpretation: Slight preference for louder songs in pop
##
## ### RAP
## - Correlation Before 2010: r = 0.111
## - Correlation After 2010: r = 0.087
## - Change: Δr = -0.025 (not significant)
## - Interpretation: Slight preference for louder songs in rap
##
## ### ROCK
## - Correlation Before 2010: r = 0.075
## - Correlation After 2010: r = -0.026
## - Change: Δr = -0.101 (not significant)
## - Interpretation: Loudness has little/no effect on popularity in rock
##
## ### LATIN
## - Correlation Before 2010: r = 0.072
## - Correlation After 2010: r = 0.045
## - Change: Δr = -0.027 (not significant)
## - Interpretation: Slight preference for louder songs in latin
##
## ### R&B
## - Correlation Before 2010: r = 0.087
## - Correlation After 2010: r = 0.137
## - Change: Δr = 0.051 (not significant)
## - Interpretation: Slight preference for louder songs in r&b
cat("\n## OVERALL SUMMARY\n")
##
## ## OVERALL SUMMARY
cat(sprintf("- Genres where loudness matters MOST after 2010: %s\n",
paste(loudness_stats %>%
arrange(desc(`After 2010 (Spearman)`)) %>%
head(3) %>%
pull(Genre), collapse = ", ")))
## - Genres where loudness matters MOST after 2010: r&b, rap, latin
cat(sprintf("- Biggest correlation increase: %s (Δr = %.3f)\n",
loudness_stats %>%
arrange(desc(Change)) %>%
slice(1) %>%
pull(Genre),
loudness_stats %>%
arrange(desc(Change)) %>%
slice(1) %>%
pull(Change)))
## - Biggest correlation increase: r&b (Δr = 0.051)
cat(sprintf("- Statistically significant changes: %d out of %d genres\n",
sum(loudness_stats$Significant == "Yes"),
nrow(loudness_stats)))
## - Statistically significant changes: 0 out of 5 genres
# Filter data to include songs with popularity >= 20
spotify_filtered <- spotify_clean_2 %>%
filter(track_popularity >= 20)
# Correlation for songs BEFORE 2010 (popularity >= 20)
cor_before <- spotify_filtered %>%
filter(release_year < 2010) %>%
select(track_popularity, danceability, energy, loudness,
acousticness, instrumentalness, liveness, positivity,
tempo, duration_min, speech_ratio) %>%
cor(use = "complete.obs")
# Correlation for songs AFTER 2010 (popularity >= 20)
cor_after <- spotify_filtered %>%
filter(release_year >= 2010) %>%
select(track_popularity, danceability, energy, loudness,
acousticness, instrumentalness, liveness, positivity,
tempo, duration_min, speech_ratio) %>%
cor(use = "complete.obs")
# Set up side-by-side plots
par(mfrow = c(1, 2))
# Plot Before 2010
corrplot(cor_before,
method = "color",
type = "upper",
tl.col = "black",
tl.srt = 45,
addCoef.col = "black",
number.cex = 0.6,
col = colorRampPalette(c("#E74C3C", "white", "#3498DB"))(200),
title = "Before 2010 (Popularity >= 20)",
mar = c(0,0,2,0),
tl.cex = 0.8)
# Plot After 2010
corrplot(cor_after,
method = "color",
type = "upper",
tl.col = "black",
tl.srt = 45,
addCoef.col = "black",
number.cex = 0.6,
col = colorRampPalette(c("#E74C3C", "white", "#3498DB"))(200),
title = "After 2010 (Popularity >= 20)",
mar = c(0,0,2,0),
tl.cex = 0.8)
# Reset plot layout
par(mfrow = c(1, 1))
# Display sample size information
cat("Sample sizes for correlation analysis (Popularity >= 20):\n")
## Sample sizes for correlation analysis (Popularity >= 20):
cat("Before 2010:", nrow(spotify_filtered %>% filter(release_year < 2010)), "songs\n")
## Before 2010: 6262 songs
cat("After 2010:", nrow(spotify_filtered %>% filter(release_year >= 2010)), "songs\n")
## After 2010: 19341 songs
# Filter data to include songs with popularity >= 20
analysis_data_updated <- spotify_clean_2 %>%
filter(track_popularity >= 20)
# Function: Analyze loudness correlation for each genre
analyze_loudness_correlation <- function(data, genre_name, period_label, year_threshold = 2010) {
# Filter and prepare data
genre_data <- data %>%
filter(playlist_genre == genre_name) %>%
select(track_popularity, loudness)
# Check sample size adequacy
n_obs <- nrow(genre_data)
MIN_SAMPLE_SIZE <- 30
if (n_obs < MIN_SAMPLE_SIZE) {
warning(paste0("Insufficient data for ", genre_name, " - ", period_label,
" (n=", n_obs, ", required: ", MIN_SAMPLE_SIZE, ")"))
return(NULL)
}
# Compute correlations
cor_value <- cor(genre_data$track_popularity, genre_data$loudness,
method = "spearman", use = "pairwise.complete.obs")
cor_pearson <- cor(genre_data$track_popularity, genre_data$loudness,
method = "pearson", use = "pairwise.complete.obs")
return(list(
correlation_spearman = cor_value,
correlation_pearson = cor_pearson,
sample_size = n_obs,
genre = genre_name,
period = period_label,
data = genre_data
))
}
# Function: Create scatter plot with correlation
plot_loudness_correlation <- function(cor_result, main_title = NULL) {
if (is.null(cor_result)) return(invisible(NULL))
# Create scatter plot
p <- ggplot(cor_result$data, aes(x = loudness, y = track_popularity)) +
geom_hex(bins = 40, alpha = 0.8) +
geom_smooth(method = "lm", color = "#E74C3C", size = 1.5, se = TRUE, alpha = 0.2) +
geom_smooth(method = "loess", color = "#3498DB", size = 1.2, linetype = "dashed", se = FALSE) +
scale_fill_gradient(low = "#FFFFBF", high = "#1A9850", name = "Song\nDensity") +
labs(
title = ifelse(!is.null(main_title), main_title,
paste0(cor_result$genre, " - ", cor_result$period)),
subtitle = sprintf("Spearman r = %.3f | Pearson r = %.3f | n = %d",
cor_result$correlation_spearman,
cor_result$correlation_pearson,
cor_result$sample_size),
x = "Loudness (dB)",
y = "Track Popularity"
) +
theme_minimal(base_size = 11) +
theme(
plot.title = element_text(face = "bold", size = 13),
plot.subtitle = element_text(size = 10, color = "gray40"),
legend.position = "right"
)
return(p)
}
# Main analysis pipeline
generate_loudness_comparison <- function(data, genres, year_cutoff = 2010) {
data_before <- data %>% filter(release_year < year_cutoff)
data_after <- data %>% filter(release_year >= year_cutoff)
for (genre in genres) {
cat("\n\n### Genre Analysis:", toupper(genre), "\n")
# Get sample sizes
n_before <- data_before %>% filter(playlist_genre == genre) %>% nrow()
n_after <- data_after %>% filter(playlist_genre == genre) %>% nrow()
cat("Sample sizes - Before 2010:", n_before, "| After 2010:", n_after, "\n")
# Analyze both periods
cor_before <- analyze_loudness_correlation(data_before, genre, "Before 2010")
cor_after <- analyze_loudness_correlation(data_after, genre, "After 2010")
if (is.null(cor_before) || is.null(cor_after)) {
cat("Skipped due to insufficient data.\n")
next
}
# Display correlation values
cat(sprintf("\nLoudness-Popularity Correlation (Spearman):\n"))
cat(sprintf(" Before 2010: r = %.3f\n", cor_before$correlation_spearman))
cat(sprintf(" After 2010: r = %.3f\n", cor_after$correlation_spearman))
cat(sprintf(" Change: Δr = %.3f\n",
cor_after$correlation_spearman - cor_before$correlation_spearman))
# Create side-by-side plots
p1 <- plot_loudness_correlation(cor_before,
paste0(toupper(genre), " - Before 2010"))
p2 <- plot_loudness_correlation(cor_after,
paste0(toupper(genre), " - After 2010"))
# Display plots side by side
grid.arrange(p1, p2, ncol = 2,
top = textGrob(paste0("Loudness vs Popularity: ", toupper(genre)),
gp = gpar(fontsize = 16, fontface = "bold")))
cat("\n", strrep("-", 80), "\n")
}
}
# Execute analysis with filtered data
library(gridExtra)
library(grid)
genres <- unique(analysis_data_updated$playlist_genre)
generate_loudness_comparison(analysis_data_updated, genres)
##
##
## ### Genre Analysis: POP
## Sample sizes - Before 2010: 500 | After 2010: 4034
##
## Loudness-Popularity Correlation (Spearman):
## Before 2010: r = 0.107
## After 2010: r = 0.169
## Change: Δr = 0.062
##
## --------------------------------------------------------------------------------
##
##
## ### Genre Analysis: RAP
## Sample sizes - Before 2010: 1128 | After 2010: 3522
##
## Loudness-Popularity Correlation (Spearman):
## Before 2010: r = 0.139
## After 2010: r = 0.077
## Change: Δr = -0.063
##
## --------------------------------------------------------------------------------
##
##
## ### Genre Analysis: ROCK
## Sample sizes - Before 2010: 2584 | After 2010: 1170
##
## Loudness-Popularity Correlation (Spearman):
## Before 2010: r = 0.097
## After 2010: r = 0.116
## Change: Δr = 0.018
##
## --------------------------------------------------------------------------------
##
##
## ### Genre Analysis: LATIN
## Sample sizes - Before 2010: 603 | After 2010: 3603
##
## Loudness-Popularity Correlation (Spearman):
## Before 2010: r = 0.210
## After 2010: r = 0.252
## Change: Δr = 0.042
##
## --------------------------------------------------------------------------------
##
##
## ### Genre Analysis: R&B
## Sample sizes - Before 2010: 1368 | After 2010: 2728
##
## Loudness-Popularity Correlation (Spearman):
## Before 2010: r = 0.112
## After 2010: r = 0.128
## Change: Δr = 0.016
##
## --------------------------------------------------------------------------------
##
##
## ### Genre Analysis: EDM
## Sample sizes - Before 2010: 79 | After 2010: 4284
##
## Loudness-Popularity Correlation (Spearman):
## Before 2010: r = 0.150
## After 2010: r = 0.064
## Change: Δr = -0.086
##
## --------------------------------------------------------------------------------
cat("\n\nDATA SUMMARY (Popularity >= 20):\n")
##
##
## DATA SUMMARY (Popularity >= 20):
cat("Total songs analyzed:", nrow(analysis_data_updated), "\n")
## Total songs analyzed: 25603
cat("Original dataset size:", nrow(spotify_clean_2), "\n")
## Original dataset size: 32801
cat("Songs excluded (popularity < 20):", nrow(spotify_clean_2) - nrow(analysis_data_updated), "\n")
## Songs excluded (popularity < 20): 7198