Is there a Formula for a Hit Song?

1. Introduction

1.1 Provide an introduction that explains the problem statement you are addressing. Why should I be interested in this?

Along with the development of social media, people tend to experience FOMO (69% of U.S people have experienced FOMO), which leads them to listen to music based on its popularity. Although the creative process is inherently subjective, is there a “formula” for a hit song? In this project, we aim to provide a comprehensive analytical report that not only helps to understand “market trends” but also highlights “specific opportunities” within the existing data set. This is where creators can align their sound with market trends while maintaining their unique voice, allowing them to make more informed, data-driven decisions on what to promote.

While our data set spans from 1960 to 2020, the rise of social media, particularly since 2010, has significantly reshaped listening behavior through phenomena such as FOMO. By comparing song trends before and after the social media boom, we aim to examine how popularity-driven consumption has influenced musical characteristics, offering insights into how creators can adapt in a socially amplified market.

1.2 Provide a short explanation of how you plan to address this problem statement (the data used and the methodology employed).

We have the data set for Spotify from GITHUB. We are going to identify the variables that will correlate with our problem statement. Construct a visualization(ggplot, histogram, plot graphs) with these figures & process manipulation of data in such way to analyze the information were seeking. We will visualize the data in the form of graphs as well as what the audience is looking for in music: danceability, liveness, and energy as our dimensions.

1.3 Discuss your current proposed approach/analytic technique you think will address (fully or partially) this problem.

Regression analysis allows us to explore the relationship between each variable and a song’s popularity, helping identify which features have the strongest impact.

Audio features ↔︎ popularity
Genre ↔︎ popularity
Artists ↔︎ popularity
Release timing (seasonal effect) ↔︎ popularity

1.4 Explain how your analysis will help the consumer of your analysis.

Artists and producers make strategic choices to increase the reach of their music, while talent scouts identify artists with high commercial potential. Even music lovers discover hidden gems that have long been overlooked in their playlists.

2. Packages Required

2.1 All packages used are loaded upfront so the reader knows which are required to replicate the analysis.

library(tidyverse)
library(dplyr)
library(tidyr)
library(ggplot2)
library(lubridate)
library(knitr)
library(kableExtra)
library(hexbin)
library(corrplot)
library(purrr)
library(broom)
library(gridExtra)
library(grid)

2.2 Messages and warnings resulting from loading the package are suppressed.

2.3 Explanation is provided regarding the purpose of each package (there are over 10,000 packages, don’t assume that I know why you loaded each package).

More “packages” can be added in the future:

library(tidyverse) - A comprehensive toolkit for data science workflows, including data import, cleaning, transformation, visualization, and integration.
library(dplyr) - Used for data manipulation.
library(tidyr) - Reshaping and organizing data.
library(ggplot2) - Create beautiful, flexible plots.
library(lubridate) - Work with dates and times.
library(knitr) - Dynamic Report Generation.
library(kableExtra) - Enhanced Table Styling.
library(hexbin) - Visualize density in large scale scatter plots.
library(corrplot) - Visually display correlations between numeric variables to identify patterns.
library(purrr) - Apply functions to lists and nested data.
library(broom) - Convert statistical model outputs into data frames.
library(gridExtra) - Arrange multiple plots into a single comparison.
library(grid) - Create graphical layouts through low-level functions for arranging visual elements.

3. Data Preparation

3.1 Original source where the data was obtained is cited and, if possible, hyperlinked.

We will use the Spotify data set from the course material, named “spotify_songs.csv”, or tidytuesday from GitHub.
- Spotify Data Set

3.2 Source data is thoroughly explained (i.e. what was the original purpose of the data, when was it collected, how many variables did the original have, explain any peculiarities of the source data such as how missing values are recorded, or how data was imputed, etc.).

Origin: Part of the TidyTuesday weekly data project for practicing R skills.
Purpose: Designed to help users learn data wrangling and visualization using ggplot2, dplyr, tidyr, and other tidyverse tools.
Community: Created by members of the R4DS Online Learning Community, inspired by the “R for Data Science” textbook.
Source: Data collected from Spotify via the spotifyr package.
Date Created: January 21, 2020.
Authors: Charlie Thompson, Josiah Parry, Donal Phipps, and Tom Wolff.
Size: 32,833 records and 23 variables.
Content: Includes track metadata (e.g., artist, album, genre) and musical features (e.g., danceability, energy, valence).
Missing Values: Recorded as NA; no imputation was applied.
Use Case: Ideal for exploratory analysis, genre comparison, and building visualizations.

3.3 Data importing and cleaning steps are explained in the text (tell me why you are doing the data cleaning activities that you perform) and follow a logical process.

Import the data set into Rstudio:

spotify <- read.csv("C:/Users/samc8/OneDrive - Xavier University/Data Wrangling/Week 4/spotify_songs (2).csv")

View structure of the data set:

str(spotify)

## 'data.frame':    32833 obs. of  23 variables:
##  $ track_id                : chr  "6f807x0ima9a1j3VPbc7VN" "0r7CVbZTWZgbTCYdfa2P31" "1z1Hg7Vb0AhHDiEmnDE79l" "75FpbthrwQmzHlBJLuGdC7" ...
##  $ track_name              : chr  "I Don't Care (with Justin Bieber) - Loud Luxury Remix" "Memories - Dillon Francis Remix" "All the Time - Don Diablo Remix" "Call You Mine - Keanu Silva Remix" ...
##  $ track_artist            : chr  "Ed Sheeran" "Maroon 5" "Zara Larsson" "The Chainsmokers" ...
##  $ track_popularity        : int  66 67 70 60 69 67 62 69 68 67 ...
##  $ track_album_id          : chr  "2oCs0DGTsRO98Gh5ZSl2Cx" "63rPSO264uRjW1X5E6cWv6" "1HoSmj2eLcsrR0vE9gThr4" "1nqYsOef1yKKuGOVchbsk6" ...
##  $ track_album_name        : chr  "I Don't Care (with Justin Bieber) [Loud Luxury Remix]" "Memories (Dillon Francis Remix)" "All the Time (Don Diablo Remix)" "Call You Mine - The Remixes" ...
##  $ track_album_release_date: chr  "2019-06-14" "2019-12-13" "2019-07-05" "2019-07-19" ...
##  $ playlist_name           : chr  "Pop Remix" "Pop Remix" "Pop Remix" "Pop Remix" ...
##  $ playlist_id             : chr  "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" ...
##  $ playlist_genre          : chr  "pop" "pop" "pop" "pop" ...
##  $ playlist_subgenre       : chr  "dance pop" "dance pop" "dance pop" "dance pop" ...
##  $ danceability            : num  0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
##  $ energy                  : num  0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
##  $ key                     : int  6 11 1 7 1 8 5 4 8 2 ...
##  $ loudness                : num  -2.63 -4.97 -3.43 -3.78 -4.67 ...
##  $ mode                    : int  1 1 0 1 1 1 0 0 1 1 ...
##  $ speechiness             : num  0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
##  $ acousticness            : num  0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
##  $ instrumentalness        : num  0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
##  $ liveness                : num  0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
##  $ valence                 : num  0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
##  $ tempo                   : num  122 100 124 122 124 ...
##  $ duration_ms             : int  194754 162600 176616 169093 189052 163049 187675 207619 193187 253040 ...

View summary statistics of the data set:

summary(spotify)

##    track_id          track_name        track_artist       track_popularity
##  Length:32833       Length:32833       Length:32833       Min.   :  0.00  
##  Class :character   Class :character   Class :character   1st Qu.: 24.00  
##  Mode  :character   Mode  :character   Mode  :character   Median : 45.00  
##                                                           Mean   : 42.48  
##                                                           3rd Qu.: 62.00  
##                                                           Max.   :100.00  
##  track_album_id     track_album_name   track_album_release_date
##  Length:32833       Length:32833       Length:32833            
##  Class :character   Class :character   Class :character        
##  Mode  :character   Mode  :character   Mode  :character        
##                                                                
##                                                                
##                                                                
##  playlist_name      playlist_id        playlist_genre     playlist_subgenre 
##  Length:32833       Length:32833       Length:32833       Length:32833      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##   danceability        energy              key            loudness      
##  Min.   :0.0000   Min.   :0.000175   Min.   : 0.000   Min.   :-46.448  
##  1st Qu.:0.5630   1st Qu.:0.581000   1st Qu.: 2.000   1st Qu.: -8.171  
##  Median :0.6720   Median :0.721000   Median : 6.000   Median : -6.166  
##  Mean   :0.6548   Mean   :0.698619   Mean   : 5.374   Mean   : -6.719  
##  3rd Qu.:0.7610   3rd Qu.:0.840000   3rd Qu.: 9.000   3rd Qu.: -4.645  
##  Max.   :0.9830   Max.   :1.000000   Max.   :11.000   Max.   :  1.275  
##       mode         speechiness      acousticness    instrumentalness   
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000000  
##  1st Qu.:0.0000   1st Qu.:0.0410   1st Qu.:0.0151   1st Qu.:0.0000000  
##  Median :1.0000   Median :0.0625   Median :0.0804   Median :0.0000161  
##  Mean   :0.5657   Mean   :0.1071   Mean   :0.1753   Mean   :0.0847472  
##  3rd Qu.:1.0000   3rd Qu.:0.1320   3rd Qu.:0.2550   3rd Qu.:0.0048300  
##  Max.   :1.0000   Max.   :0.9180   Max.   :0.9940   Max.   :0.9940000  
##     liveness         valence           tempo         duration_ms    
##  Min.   :0.0000   Min.   :0.0000   Min.   :  0.00   Min.   :  4000  
##  1st Qu.:0.0927   1st Qu.:0.3310   1st Qu.: 99.96   1st Qu.:187819  
##  Median :0.1270   Median :0.5120   Median :121.98   Median :216000  
##  Mean   :0.1902   Mean   :0.5106   Mean   :120.88   Mean   :225800  
##  3rd Qu.:0.2480   3rd Qu.:0.6930   3rd Qu.:133.92   3rd Qu.:253585  
##  Max.   :0.9960   Max.   :0.9910   Max.   :239.44   Max.   :517810

3.3.2 Delete Unused Variables:

playlist_subgenre: Contains 24 distinct sub-genres that introduce noise and fragmentation. Removed to avoid overfitting or misleading groupings in genre-based analysis.
playlist_id: Unique identifier with no analytical value. Dropped to reduce dimensionality and avoid clutter.
track_album_id / track_id: Technical identifiers used for database referencing, not meaningful for visualization or modeling.

Removed to streamline the data set.

spotify$playlist_id <- NULL
spotify$track_album_id <- NULL
spotify$track_id <- NULL
spotify$playlist_subgenre <- NULL

3.3.3 Renaming Variables

colnames(spotify) <- c("track_name", "track_artist", "track_popularity", "track_album_name",
                       "track_album_release_date", "playlist_name", "playlist_genre", "danceability",
                       "energy", "key", "loudness", "mode", "speech_ratio",
                       "acousticness", "instrumentalness", "liveness", "positivity",
                       "tempo", "duration_ms")

Renamed “speechiness” to speech_ratio to clarify that the variable reflects the proportion of spoken content in a track.
Renamed “valence” to positivity to make the emotional tone more intuitive and easier to interpret.

3.3.4 Issues with missing values

colSums(is.na(spotify))

##               track_name             track_artist         track_popularity 
##                        5                        5                        0 
##         track_album_name track_album_release_date            playlist_name 
##                        5                        0                        0 
##           playlist_genre             danceability                   energy 
##                        0                        0                        0 
##                      key                 loudness                     mode 
##                        0                        0                        0 
##             speech_ratio             acousticness         instrumentalness 
##                        0                        0                        0 
##                 liveness               positivity                    tempo 
##                        0                        0                        0 
##              duration_ms 
##                        0

Find Missing Values:

There are 15 missing values
- 5 missing values from track_name
- 5 missing values from track_artist
- 5 missing values from track_album_name

Remove Missing Values:

spotify_clean <- na.omit(spotify)

3.3.5 - Conversion of date format.

We wanted to convert the release date of the track album into a proper date format.

spotify_clean <- spotify_clean %>%
  mutate(
    track_album_release_date = as.character(track_album_release_date),
    track_album_release_date = case_when(
      grepl("^\\d{4}$", track_album_release_date) ~ paste0(track_album_release_date, "-01-01"),
      grepl("^\\d{4}-\\d{2}$", track_album_release_date) ~ paste0(track_album_release_date, "-01"),
      TRUE ~ track_album_release_date
    ),
    track_album_release_date = case_when(
      grepl("^\\d{4}-\\d{2}-\\d{2}$", track_album_release_date) ~ as.Date(track_album_release_date, format = "%Y-%m-%d"),
      grepl("^\\d{1,2}/\\d{1,2}/\\d{4}$", track_album_release_date) ~ as.Date(track_album_release_date, format = "%m/%d/%Y"),
      TRUE ~ NA_Date_
    )
  )

3.3.6 - Creation of new dimension.

To create new dimensions for the following variables:

release_year - Created the “release_year” column to enable year-based analysis of track trends, allowing for easier aggregation and comparison over time.
duration_min - Since song duration was originally stored in milliseconds, we created a new variable “duration_min” to express it in minutes, making comparisons and visualizations more intuitive.

spotify_clean <- spotify_clean %>%
  mutate(
    release_year = lubridate::year(track_album_release_date),
    duration_min = duration_ms / 60000, )

3.3.7.1 - Finding Outiers.

boxplot(spotify_clean$duration_min,
        main = "Boxplot of Song Duration (min)",
        ylab = "Duration (minutes)")

3.3.7.2 - Removal of Outiers.

To avoid excluding valid songs with unusually long or short durations, we apply an asymmetric threshold: 4 × IQR above the third quartile and 2 × IQR below the first quartile. This approach broadens the acceptable range while still filtering extreme values, helping preserve meaningful variation in the data set without misclassifying legitimate entries as outliers.

Q1 <- quantile(spotify_clean$duration_min, 0.25, na.rm = TRUE)
Q3 <- quantile(spotify_clean$duration_min, 0.75, na.rm = TRUE)
IQR <- Q3 - Q1
upper_bound <- Q3 + 4 * IQR
lower_bound <- Q1 - 2 * IQR

spotify_clean_2 <- spotify_clean[
  spotify_clean$duration_min >= lower_bound & spotify_clean$duration_min <= upper_bound, ]

boxplot(spotify_clean_2$duration_min,
        main = "Boxplot of Song Duration (min, no outliers)",
        ylab = "Duration (minutes)")

length(spotify_clean_2$duration_min)

## [1] 32801

After data cleaning there were 32 songs that were defined as outliers and removed from the data set.

Original data set: 32833 observations.
Cleaned data set: 32801 observations.

Description:

Interquartile ranges helps to find the outlier by providing a clear picture of the data’s spread or also known as midspread. Accessing variability in the data and understanding the distribution of the whole data set.

Create release_period variable:

spotify_clean_2 <- spotify_clean_2 %>%
  mutate(release_period = case_when(
    is.na(track_album_release_date) ~ "NA",
    year(track_album_release_date) < 2010 ~ "Before 2010",
    TRUE ~ "After 2010"
  ))

3.4 Once your data is clean, show what the final data set looks like. However, do not print off a data frame with 200+ rows; show me the data in the most condensed form possible.

kableExtra::scroll_box(
  kableExtra::kable_paper(
    kableExtra::kbl(head(spotify_clean_2, 10))
  ),
  width = "700px",
  height = "300px"
)

track_name	track_artist	track_popularity	track_album_name	track_album_release_date	playlist_name	playlist_genre	danceability	energy	key	loudness	mode	speech_ratio	acousticness	instrumentalness	liveness	positivity	tempo	duration_ms	release_year	duration_min	release_period
I Don’t Care (with Justin Bieber) - Loud Luxury Remix	Ed Sheeran	66	I Don’t Care (with Justin Bieber) [Loud Luxury Remix]	2019-06-14	Pop Remix	pop	0.748	0.916	6	-2.634	1	0.0583	0.1020	0.00e+00	0.0653	0.518	122.036	194754	2019	3.245900	After 2010
Memories - Dillon Francis Remix	Maroon 5	67	Memories (Dillon Francis Remix)	2019-12-13	Pop Remix	pop	0.726	0.815	11	-4.969	1	0.0373	0.0724	4.21e-03	0.3570	0.693	99.972	162600	2019	2.710000	After 2010
All the Time - Don Diablo Remix	Zara Larsson	70	All the Time (Don Diablo Remix)	2019-07-05	Pop Remix	pop	0.675	0.931	1	-3.432	0	0.0742	0.0794	2.33e-05	0.1100	0.613	124.008	176616	2019	2.943600	After 2010
Call You Mine - Keanu Silva Remix	The Chainsmokers	60	Call You Mine - The Remixes	2019-07-19	Pop Remix	pop	0.718	0.930	7	-3.778	1	0.1020	0.0287	9.40e-06	0.2040	0.277	121.956	169093	2019	2.818217	After 2010
Someone You Loved - Future Humans Remix	Lewis Capaldi	69	Someone You Loved (Future Humans Remix)	2019-03-05	Pop Remix	pop	0.650	0.833	1	-4.672	1	0.0359	0.0803	0.00e+00	0.0833	0.725	123.976	189052	2019	3.150867	After 2010
Beautiful People (feat. Khalid) - Jack Wins Remix	Ed Sheeran	67	Beautiful People (feat. Khalid) [Jack Wins Remix]	2019-07-11	Pop Remix	pop	0.675	0.919	8	-5.385	1	0.1270	0.0799	0.00e+00	0.1430	0.585	124.982	163049	2019	2.717483	After 2010
Never Really Over - R3HAB Remix	Katy Perry	62	Never Really Over (R3HAB Remix)	2019-07-26	Pop Remix	pop	0.449	0.856	5	-4.788	0	0.0623	0.1870	0.00e+00	0.1760	0.152	112.648	187675	2019	3.127917	After 2010
Post Malone (feat. RANI) - GATTÜSO Remix	Sam Feldt	69	Post Malone (feat. RANI) [GATTÜSO Remix]	2019-08-29	Pop Remix	pop	0.542	0.903	4	-2.419	0	0.0434	0.0335	4.80e-06	0.1110	0.367	127.936	207619	2019	3.460317	After 2010
Tough Love - Tiësto Remix / Radio Edit	Avicii	68	Tough Love (Tiësto Remix)	2019-06-14	Pop Remix	pop	0.594	0.935	8	-3.562	1	0.0565	0.0249	4.00e-06	0.6370	0.366	127.015	193187	2019	3.219783	After 2010
If I Can’t Have You - Gryffin Remix	Shawn Mendes	67	If I Can’t Have You (Gryffin Remix)	2019-06-20	Pop Remix	pop	0.642	0.818	2	-4.552	1	0.0320	0.0567	0.00e+00	0.0919	0.590	124.957	253040	2019	4.217333	After 2010

3.5 Provide summary information about the variables of concern in your cleaned data set. Do not just print off a bunch of code chunks with str(), summary(), etc. Rather, provide me with a consolidated explanation, either with a table that provides summary info for each variable or a nicely written summary paragraph with inline code.

summary(spotify_clean_2[,10:19])

##       key            loudness            mode         speech_ratio   
##  Min.   : 0.000   Min.   :-46.448   Min.   :0.0000   Min.   :0.0224  
##  1st Qu.: 2.000   1st Qu.: -8.167   1st Qu.:0.0000   1st Qu.:0.0410  
##  Median : 6.000   Median : -6.164   Median :1.0000   Median :0.0625  
##  Mean   : 5.375   Mean   : -6.715   Mean   :0.5656   Mean   :0.1070  
##  3rd Qu.: 9.000   3rd Qu.: -4.644   3rd Qu.:1.0000   3rd Qu.:0.1320  
##  Max.   :11.000   Max.   :  1.275   Max.   :1.0000   Max.   :0.9180  
##   acousticness       instrumentalness      liveness         positivity     
##  Min.   :0.0000014   Min.   :0.000000   Min.   :0.00936   Min.   :0.00001  
##  1st Qu.:0.0151000   1st Qu.:0.000000   1st Qu.:0.09270   1st Qu.:0.33100  
##  Median :0.0803000   Median :0.000016   Median :0.12700   Median :0.51200  
##  Mean   :0.1751027   Mean   :0.084489   Mean   :0.19015   Mean   :0.51058  
##  3rd Qu.:0.2540000   3rd Qu.:0.004810   3rd Qu.:0.24800   3rd Qu.:0.69300  
##  Max.   :0.9920000   Max.   :0.994000   Max.   :0.99600   Max.   :0.99100  
##      tempo         duration_ms    
##  Min.   : 35.48   Min.   : 57373  
##  1st Qu.: 99.96   1st Qu.:187867  
##  Median :121.98   Median :216033  
##  Mean   :120.89   Mean   :225877  
##  3rd Qu.:133.92   3rd Qu.:253585  
##  Max.   :239.44   Max.   :515960

table(spotify_clean_2$release_year)

## 
## 1957 1958 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 
##    2    1    4    1    2    5    9   12   19   41   23   56   82   70   74  104 
## 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 
##   80  106  133  100  130   84   97   87   94  118  140  144  121  183  193  128 
## 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 
##  171  209  186  224  237  219  250  252  283  278  250  312  259  353  385  506 
## 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 
##  445  470  619  472  615  603  783  956 1524 1778 2127 2426 3301 9080  785

4. Exploratory Data Analysis (EDA)

Track Popularity Distribution on Spotify:

ggplot(spotify_clean_2, aes(x = track_popularity)) +
  geom_histogram(binwidth = 5, fill = "#2ECC71", color = "white") +
  labs(
    title = "Track Popularity Distribution (0-100)",
    x = "Track Popularity",
    y = "Number of Songs",
    caption = "Binwidth = 5 | Source: spotify_clean_2"
  ) +
  theme_minimal()

Values near 0 are often excluded because they represent songs with little to no listener engagement, likely unreleased, inactive, or algorithmically suppressed. Including them can distort trend analysis and obscure meaningful patterns among actively consumed tracks.

ggplot(spotify_long, aes(x = track_popularity)) +
  geom_histogram(binwidth = 5, fill = "#1DB954", color = "white", alpha = 0.8) +
  geom_vline(aes(xintercept = mean(track_popularity)), 
             color = "red", linetype = "dashed", size = 1) +
  annotate("text", x = mean(spotify_long$track_popularity) + 10, y = 5000, 
           label = paste("Mean =", round(mean(spotify_long$track_popularity), 1)), 
           color = "red") +
  labs(
    title = "Song Track Popularity Distribution",
    subtitle = "Most songs have low to medium popularity, with few viral hits",
    x = "Track Popularity Score",
    y = "Number of Songs",
    caption = "Source: Spotify Songs Dataset"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold", size = 14))

### Creating a new dataframe: spotify_popular

spotify_popular <- spotify_clean_2 %>%
  filter(track_popularity >= 20 & track_popularity <= 80)

Before and After 2010:

# Filter data to include songs with popularity >= 20
spotify_filtered <- spotify_clean_2 %>%
  filter(track_popularity >= 20)

# Correlation for songs BEFORE 2010 (popularity >= 20)
cor_before <- spotify_filtered %>%
  filter(release_year < 2010) %>%
  select(track_popularity, danceability, energy, loudness, 
         acousticness, instrumentalness, liveness, positivity, 
         tempo, duration_min, speech_ratio) %>%
  cor(use = "complete.obs")

# Correlation for songs AFTER 2010 (popularity >= 20)
cor_after <- spotify_filtered %>%
  filter(release_year >= 2010) %>%
  select(track_popularity, danceability, energy, loudness, 
         acousticness, instrumentalness, liveness, positivity, 
         tempo, duration_min, speech_ratio) %>%
  cor(use = "complete.obs")

# Set up side-by-side plots
par(mfrow = c(1, 2))

# Plot Before 2010
corrplot(cor_before, 
         method = "color", 
         type = "upper",
         tl.col = "black", 
         tl.srt = 45,
         addCoef.col = "black",
         number.cex = 0.6,
         col = colorRampPalette(c("#E74C3C", "white", "#3498DB"))(200),
         title = "Before 2010 (Popularity >= 20)",
         mar = c(0,0,2,0),
         tl.cex = 0.8)

# Plot After 2010
corrplot(cor_after, 
         method = "color", 
         type = "upper",
         tl.col = "black", 
         tl.srt = 45,
         addCoef.col = "black",
         number.cex = 0.6,
         col = colorRampPalette(c("#E74C3C", "white", "#3498DB"))(200),
         title = "After 2010 (Popularity >= 20)",
         mar = c(0,0,2,0),
         tl.cex = 0.8)

# Reset plot layout
par(mfrow = c(1, 1))

# Display sample size information
cat("Sample sizes for correlation analysis (Popularity >= 20):\n")

## Sample sizes for correlation analysis (Popularity >= 20):

cat("Before 2010:", nrow(spotify_filtered %>% filter(release_year < 2010)), "songs\n")

## Before 2010: 6262 songs

cat("After 2010:", nrow(spotify_filtered %>% filter(release_year >= 2010)), "songs\n")

## After 2010: 19341 songs

The heatmaps show no standout audio feature that clearly drives popularity, suggesting that musical success likely depends on external factors—such as marketing, artist reputation, playlist placement, and timing—rather than sound characteristics alone.

Transforming the spotify_popular dataset from wide format to long format:

spotify_long <- spotify_popular %>%
  pivot_longer(cols = c(danceability, energy, loudness, speech_ratio,
                        acousticness, instrumentalness, liveness,
                        positivity, duration_min),
               names_to = "feature",
               values_to = "value")

It allows you to loop through features for modeling or analysis.
It’s a cleaner structure for comparative plots, like showing how each feature relates to popularity.

How Audio Features Relates to Popularity:

# Release_Period Variable

spotify_clean_2 <- spotify_clean_2 %>%
  mutate(release_period = case_when(
    is.na(track_album_release_date) ~ "NA",
    year(track_album_release_date) < 2010 ~ "Before 2010",
    TRUE ~ "After 2010"
  ))

# Creating the spotify_popular variable

spotify_popular <- spotify_clean_2 %>%
  filter(track_popularity >= 50 & track_popularity <= 80)

# Creating the spotify_long variable

spotify_long <- spotify_popular %>%
  pivot_longer(cols = c(danceability, energy, loudness, speech_ratio,
                        acousticness, instrumentalness, liveness,
                        positivity, duration_min),
               names_to = "feature",
               values_to = "value")

# Visualization

ggplot(spotify_long, aes(x = value, y = track_popularity)) +
  geom_point(aes(color = release_period), alpha = 0.3, size = 1) +
  geom_smooth(
    data = filter(spotify_long, release_period == "Before 2010"),
    method = "lm", se = FALSE, color = "black", linewidth = 0.8
  ) +
  geom_smooth(
    data = filter(spotify_long, release_period == "After 2010"),
    method = "lm", se = FALSE, color = "#1DB954", linewidth = 0.8
  ) +
  facet_wrap(~ feature, scales = "free_x", ncol = 3) +
  labs(
    title = "How Audio Features Relate to Popularity",
    subtitle = "Each panel shows trends before and after 2010",
    x = "Feature Value",
    y = "Popularity Score",
    color = "Release Period",
    caption = "Source: Spotify Songs Dataset"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    strip.text = element_text(face = "bold", size = 10),
    legend.position = "bottom"
  )

Loudness stands out as a measurable, interpretable feature with clear temporal and genre-based patterns making it a strong candidate for deeper analysis.

# Filter data to include songs with popularity >= 20
analysis_data_updated <- spotify_clean_2 %>%
  filter(track_popularity >= 20)

# Function: Analyze loudness correlation for each genre
analyze_loudness_correlation <- function(data, genre_name, period_label, year_threshold = 2010) {
  
  # Filter and prepare data
  genre_data <- data %>%
    filter(playlist_genre == genre_name) %>%
    select(track_popularity, loudness)
  
  # Check sample size adequacy
  n_obs <- nrow(genre_data)
  MIN_SAMPLE_SIZE <- 30
  if (n_obs < MIN_SAMPLE_SIZE) {
    warning(paste0("Insufficient data for ", genre_name, " - ", period_label, 
                   " (n=", n_obs, ", required: ", MIN_SAMPLE_SIZE, ")"))
    return(NULL)
  }
  
  # Compute correlations
  cor_value <- cor(genre_data$track_popularity, genre_data$loudness, 
                   method = "spearman", use = "pairwise.complete.obs")
  cor_pearson <- cor(genre_data$track_popularity, genre_data$loudness, 
                     method = "pearson", use = "pairwise.complete.obs")
  
  return(list(
    correlation_spearman = cor_value,
    correlation_pearson = cor_pearson,
    sample_size = n_obs,
    genre = genre_name,
    period = period_label,
    data = genre_data
  ))
}

# Function: Create scatter plot with correlation
plot_loudness_correlation <- function(cor_result, main_title = NULL) {
  
  if (is.null(cor_result)) return(invisible(NULL))
  
  # Create scatter plot
  p <- ggplot(cor_result$data, aes(x = loudness, y = track_popularity)) +
    geom_hex(bins = 40, alpha = 0.8) +
    geom_smooth(method = "lm", color = "#E74C3C", size = 1.5, se = TRUE, alpha = 0.2) +
    geom_smooth(method = "loess", color = "#3498DB", size = 1.2, linetype = "dashed", se = FALSE) +
    scale_fill_gradient(low = "#FFFFBF", high = "#1A9850", name = "Song\nDensity") +
    labs(
      title = ifelse(!is.null(main_title), main_title, 
                     paste0(cor_result$genre, " - ", cor_result$period)),
      subtitle = sprintf("Spearman r = %.3f | Pearson r = %.3f | n = %d",
                        cor_result$correlation_spearman,
                        cor_result$correlation_pearson,
                        cor_result$sample_size),
      x = "Loudness (dB)",
      y = "Track Popularity"
    ) +
    theme_minimal(base_size = 11) +
    theme(
      plot.title = element_text(face = "bold", size = 13),
      plot.subtitle = element_text(size = 10, color = "gray40"),
      legend.position = "right"
    )
  
  return(p)
}

# Main analysis pipeline
generate_loudness_comparison <- function(data, genres, year_cutoff = 2010) {
  
  data_before <- data %>% filter(release_year < year_cutoff)
  data_after <- data %>% filter(release_year >= year_cutoff)
  
  for (genre in genres) {
    
    cat("\n\n### Genre Analysis:", toupper(genre), "\n")
    
    # Get sample sizes
    n_before <- data_before %>% filter(playlist_genre == genre) %>% nrow()
    n_after <- data_after %>% filter(playlist_genre == genre) %>% nrow()
    
    cat("Sample sizes - Before 2010:", n_before, "| After 2010:", n_after, "\n")
    
    # Analyze both periods
    cor_before <- analyze_loudness_correlation(data_before, genre, "Before 2010")
    cor_after <- analyze_loudness_correlation(data_after, genre, "After 2010")
    
    if (is.null(cor_before) || is.null(cor_after)) {
      cat("Skipped due to insufficient data.\n")
      next
    }
    
    # Display correlation values
    cat(sprintf("\nLoudness-Popularity Correlation (Spearman):\n"))
    cat(sprintf("  Before 2010: r = %.3f\n", cor_before$correlation_spearman))
    cat(sprintf("  After 2010:  r = %.3f\n", cor_after$correlation_spearman))
    cat(sprintf("  Change:      Δr = %.3f\n", 
                cor_after$correlation_spearman - cor_before$correlation_spearman))
    
    # Create side-by-side plots
    p1 <- plot_loudness_correlation(cor_before, 
                                    paste0(toupper(genre), " - Before 2010"))
    p2 <- plot_loudness_correlation(cor_after, 
                                    paste0(toupper(genre), " - After 2010"))
    
    # Display plots side by side
    grid.arrange(p1, p2, ncol = 2, 
                 top = textGrob(paste0("Loudness vs Popularity: ", toupper(genre)),
                               gp = gpar(fontsize = 16, fontface = "bold")))
    
    cat("\n", strrep("-", 80), "\n")
  }
}

# Execute analysis with filtered data
library(gridExtra)
library(grid)

genres <- unique(analysis_data_updated$playlist_genre)
generate_loudness_comparison(analysis_data_updated, genres)

## 
## 
## ### Genre Analysis: POP 
## Sample sizes - Before 2010: 500 | After 2010: 4034 
## 
## Loudness-Popularity Correlation (Spearman):
##   Before 2010: r = 0.107
##   After 2010:  r = 0.169
##   Change:      Δr = 0.062

## 
##  -------------------------------------------------------------------------------- 
## 
## 
## ### Genre Analysis: RAP 
## Sample sizes - Before 2010: 1128 | After 2010: 3522 
## 
## Loudness-Popularity Correlation (Spearman):
##   Before 2010: r = 0.139
##   After 2010:  r = 0.077
##   Change:      Δr = -0.063

## 
##  -------------------------------------------------------------------------------- 
## 
## 
## ### Genre Analysis: ROCK 
## Sample sizes - Before 2010: 2584 | After 2010: 1170 
## 
## Loudness-Popularity Correlation (Spearman):
##   Before 2010: r = 0.097
##   After 2010:  r = 0.116
##   Change:      Δr = 0.018

## 
##  -------------------------------------------------------------------------------- 
## 
## 
## ### Genre Analysis: LATIN 
## Sample sizes - Before 2010: 603 | After 2010: 3603 
## 
## Loudness-Popularity Correlation (Spearman):
##   Before 2010: r = 0.210
##   After 2010:  r = 0.252
##   Change:      Δr = 0.042

## 
##  -------------------------------------------------------------------------------- 
## 
## 
## ### Genre Analysis: R&B 
## Sample sizes - Before 2010: 1368 | After 2010: 2728 
## 
## Loudness-Popularity Correlation (Spearman):
##   Before 2010: r = 0.112
##   After 2010:  r = 0.128
##   Change:      Δr = 0.016

## 
##  -------------------------------------------------------------------------------- 
## 
## 
## ### Genre Analysis: EDM 
## Sample sizes - Before 2010: 79 | After 2010: 4284 
## 
## Loudness-Popularity Correlation (Spearman):
##   Before 2010: r = 0.150
##   After 2010:  r = 0.064
##   Change:      Δr = -0.086

## 
##  --------------------------------------------------------------------------------

cat("\n\nDATA SUMMARY (Popularity >= 20):\n")

## 
## 
## DATA SUMMARY (Popularity >= 20):

cat("Total songs analyzed:", nrow(analysis_data_updated), "\n")

## Total songs analyzed: 25603

cat("Original dataset size:", nrow(spotify_clean_2), "\n")

## Original dataset size: 32801

cat("Songs excluded (popularity < 20):", nrow(spotify_clean_2) - nrow(analysis_data_updated), "\n")

## Songs excluded (popularity < 20): 7198

Pop songs tend to be most popular when loudness falls between –6 dB and –4 dB, suggesting a production sweet spot. The relationship is mildly non-linear, with popularity peaking around that range.

For rap songs after 2010, popularity tends to peak when loudness is around –5 dB to –3 dB, though the correlation remains weak. The trend suggests a mild preference for louder production, but not a strong linear relationship.

For rock songs, popular tracks tend to center around –8 dB to –6 dB in loudness, both before and after 2010, indicating a consistent production preference across time.

For Latin songs, both before and after 2010, popular tracks tend to cluster around –6 dB to –4 dB in loudness, indicating a consistent preference for moderately loud production.

For R&B songs, popular tracks tend to concentrate around –9 dB to –6 dB in loudness, both before and after 2010, reflecting a steady production preference over time.

For EDM tracks, popular songs after 2010 tend to center around –5 dB to –3 dB in loudness, reflecting a shift toward more intense, high-energy production compared to earlier years.

Genre Popularity Rankings:

genre_popularity <- spotify_clean_2 %>%
  group_by(playlist_genre) %>%
  summarise(
    Avg_Popularity = mean(track_popularity),
    Median_Popularity = median(track_popularity),
    Songs = n(),
    High_Pop_Songs = sum(track_popularity >= 70),
    High_Pop_Pct = (High_Pop_Songs / Songs) * 100
  ) %>%
  arrange(desc(Avg_Popularity))

# Table Creation
kable(genre_popularity, 
      digits = 1,
      col.names = c("Genre", "Avg Popularity", "Median Popularity", 
                    "Total Songs", "Hit Songs (70+)", "Hit Rate (%)"),
      caption = "Genre Popularity Rankings") %>%
  kable_styling(bootstrap_options = c("striped", "hover")) %>%
  row_spec(1, bold = TRUE, color = "white", background = "#1DB954")

Genre Popularity Rankings
Genre	Avg Popularity	Median Popularity	Total Songs	Hit Songs (70+)	Hit Rate (%)
pop	47.7	52	5505	1240	22.5
latin	47.0	50	5149	1083	21.0
rap	43.3	47	5738	632	11.0
rock	41.7	46	4945	656	13.3
r&b	41.2	44	5430	798	14.7
edm	34.9	36	6034	424	7.0

6. Summary

Across genres and time periods, loudness shows consistent clustering around specific ranges where songs tend to be more popular—typically between –6 dB and –4 dB for pop, Latin, and EDM, and slightly softer for R&B and rock. While correlations are generally weak, this pattern suggests a genre-specific “sweet spot” in production loudness. Other audio features show no standout relationship with popularity, indicating that external factors like marketing, artist reputation, and playlist placement likely play a larger role in driving musical success.

Limitations of our research:

Causation vs Correlation: This analysis identifies relationships but cannot prove that specific features cause popularity. Missing Context: External factors; artist fame, and social media presence are limited. Temporal Bias: Dataset may over-represent recent music due to streaming platform recency bias.

In the future:

We may reflect on using predictive modeling to build machine learning models to predict song popularity from audio features and potentially use natural language processing (NLP) to analyze how lyrics impact popularity and how song popularity changes over time with past data (decay curves) to test classification models like logistic regression to predict hit vs. non-hit outcomes, incorporate external metadata such as artist type (solo vs. group), label tier (major vs. indie), or song language, visualize residuals to detect patterns that simple correlations might miss, explore interaction effects (e.g., loudness × genre, energy × danceability) to uncover compound influences and finally annotate plots with genre-specific loudness thresholds to highlight production sweet spots and guide interpretation.