1. Introduction

1.1 Provide an introduction that explains the problem statement you are addressing. Why should I be interested in this?

Along with the development of social media, people tend to experience FOMO (69% of U.S people have experienced FOMO), which leads them to listen to music based on its popularity. Although the creative process is inherently subjective, is there a “formula” for a hit song? In this project, we aim to provide a comprehensive analytical report that not only helps to understand “market trends” but also highlights “specific opportunities” within the existing data set. This is where creators can align their sound with market trends while maintaining their unique voice, allowing them to make more informed, data-driven decisions on what to promote.

While our data set spans from 1960 to 2020, the rise of social media, particularly since 2010, has significantly reshaped listening behavior through phenomena such as FOMO. By comparing song trends before and after the social media boom, we aim to examine how popularity-driven consumption has influenced musical characteristics, offering insights into how creators can adapt in a socially amplified market.

1.2 Provide a short explanation of how you plan to address this problem statement (the data used and the methodology employed).

We have the data set for Spotify from GITHUB. We are going to identify the variables that will correlate with our problem statement. Construct a visualization(ggplot, histogram, plot graphs) with these figures & process manipulation of data in such way to analyze the information were seeking. We will visualize the data in the form of graphs to determine the most popular music artists, as well as what the audience is looking for in music: danceability, liveness, and energy as our dimensions.

1.3 Discuss your current proposed approach/analytic technique you think will address (fully or partially) this problem.

Regression analysis allows us to explore the relationship between each variable and a song’s popularity, helping identify which features have the strongest impact.

  • Audio features ↔︎ popularity
  • Genre ↔︎ popularity
  • Artists ↔︎ popularity
  • Release timing (seasonal effect) ↔︎ popularity

1.4 Explain how your analysis will help the consumer of your analysis.

Artists and producers make strategic choices to increase the reach of their music, while talent scouts identify artists with high commercial potential. Even music lovers discover hidden gems that have long been overlooked in their playlists.

2. Packages Required

2.1 All packages used are loaded upfront so the reader knows which are required to replicate the analysis.

library(tidyverse)
library(dplyr)
library(tidyr)
library(ggplot2)
library(lubridate)
library(knitr)
library(kableExtra)
library(hexbin)

2.2 Messages and warnings resulting from loading the package are suppressed.

2.3 Explanation is provided regarding the purpose of each package (there are over 10,000 packages, don’t assume that I know why you loaded each package).

More “packages” can be added in the future:

  • library(tidyverse) - A comprehensive toolkit for data science workflows, including data import, cleaning, transformation, visualization, and integration.
  • library(dplyr) - Used for data manipulation.
  • library(tidyr) - Reshaping and organizing data.
  • library(ggplot2) - Create beautiful, flexible plots.
  • library(lubridate) - Work with dates and times.
  • library(knitr) - Dynamic Report Generation.
  • library(kableExtra) - Enhanced Table Styling

3. Data Preparation

3.1 Original source where the data was obtained is cited and, if possible, hyperlinked.

  • We will use the Spotify data set from the course material, named “spotify_songs.csv”, or tidytuesday from GitHub.

3.2 Source data is thoroughly explained (i.e. what was the original purpose of the data, when was it collected, how many variables did the original have, explain any peculiarities of the source data such as how missing values are recorded, or how data was imputed, etc.).

  • Origin: Part of the TidyTuesday weekly data project for practicing R skills.
  • Purpose: Designed to help users learn data wrangling and visualization using ggplot2, dplyr, tidyr, and other tidyverse tools.
  • Community: Created by members of the R4DS Online Learning Community, inspired by the “R for Data Science” textbook.
  • Source: Data collected from Spotify via the spotifyr package.
  • Date Created: January 21, 2020.
  • Authors: Charlie Thompson, Josiah Parry, Donal Phipps, and Tom Wolff.
  • Size: 32,833 records and 23 variables.
  • Content: Includes track metadata (e.g., artist, album, genre) and musical features (e.g., danceability, energy, valence).
  • Missing Values: Recorded as NA; no imputation was applied.
  • Use Case: Ideal for exploratory analysis, genre comparison, and building visualizations.

3.3 Data importing and cleaning steps are explained in the text (tell me why you are doing the data cleaning activities that you perform) and follow a logical process.

Import the data set into Rstudio:

spotify <- read.csv("~/Downloads/spotify_songs 5.csv")

View structure of the data set:

str(spotify)
## 'data.frame':    32833 obs. of  23 variables:
##  $ track_id                : chr  "6f807x0ima9a1j3VPbc7VN" "0r7CVbZTWZgbTCYdfa2P31" "1z1Hg7Vb0AhHDiEmnDE79l" "75FpbthrwQmzHlBJLuGdC7" ...
##  $ track_name              : chr  "I Don't Care (with Justin Bieber) - Loud Luxury Remix" "Memories - Dillon Francis Remix" "All the Time - Don Diablo Remix" "Call You Mine - Keanu Silva Remix" ...
##  $ track_artist            : chr  "Ed Sheeran" "Maroon 5" "Zara Larsson" "The Chainsmokers" ...
##  $ track_popularity        : int  66 67 70 60 69 67 62 69 68 67 ...
##  $ track_album_id          : chr  "2oCs0DGTsRO98Gh5ZSl2Cx" "63rPSO264uRjW1X5E6cWv6" "1HoSmj2eLcsrR0vE9gThr4" "1nqYsOef1yKKuGOVchbsk6" ...
##  $ track_album_name        : chr  "I Don't Care (with Justin Bieber) [Loud Luxury Remix]" "Memories (Dillon Francis Remix)" "All the Time (Don Diablo Remix)" "Call You Mine - The Remixes" ...
##  $ track_album_release_date: chr  "2019-06-14" "2019-12-13" "2019-07-05" "2019-07-19" ...
##  $ playlist_name           : chr  "Pop Remix" "Pop Remix" "Pop Remix" "Pop Remix" ...
##  $ playlist_id             : chr  "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" ...
##  $ playlist_genre          : chr  "pop" "pop" "pop" "pop" ...
##  $ playlist_subgenre       : chr  "dance pop" "dance pop" "dance pop" "dance pop" ...
##  $ danceability            : num  0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
##  $ energy                  : num  0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
##  $ key                     : int  6 11 1 7 1 8 5 4 8 2 ...
##  $ loudness                : num  -2.63 -4.97 -3.43 -3.78 -4.67 ...
##  $ mode                    : int  1 1 0 1 1 1 0 0 1 1 ...
##  $ speechiness             : num  0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
##  $ acousticness            : num  0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
##  $ instrumentalness        : num  0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
##  $ liveness                : num  0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
##  $ valence                 : num  0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
##  $ tempo                   : num  122 100 124 122 124 ...
##  $ duration_ms             : int  194754 162600 176616 169093 189052 163049 187675 207619 193187 253040 ...

View summary statistics of the data set:

summary(spotify)
##    track_id          track_name        track_artist       track_popularity
##  Length:32833       Length:32833       Length:32833       Min.   :  0.00  
##  Class :character   Class :character   Class :character   1st Qu.: 24.00  
##  Mode  :character   Mode  :character   Mode  :character   Median : 45.00  
##                                                           Mean   : 42.48  
##                                                           3rd Qu.: 62.00  
##                                                           Max.   :100.00  
##  track_album_id     track_album_name   track_album_release_date
##  Length:32833       Length:32833       Length:32833            
##  Class :character   Class :character   Class :character        
##  Mode  :character   Mode  :character   Mode  :character        
##                                                                
##                                                                
##                                                                
##  playlist_name      playlist_id        playlist_genre     playlist_subgenre 
##  Length:32833       Length:32833       Length:32833       Length:32833      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##   danceability        energy              key            loudness      
##  Min.   :0.0000   Min.   :0.000175   Min.   : 0.000   Min.   :-46.448  
##  1st Qu.:0.5630   1st Qu.:0.581000   1st Qu.: 2.000   1st Qu.: -8.171  
##  Median :0.6720   Median :0.721000   Median : 6.000   Median : -6.166  
##  Mean   :0.6548   Mean   :0.698619   Mean   : 5.374   Mean   : -6.719  
##  3rd Qu.:0.7610   3rd Qu.:0.840000   3rd Qu.: 9.000   3rd Qu.: -4.645  
##  Max.   :0.9830   Max.   :1.000000   Max.   :11.000   Max.   :  1.275  
##       mode         speechiness      acousticness    instrumentalness   
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000000  
##  1st Qu.:0.0000   1st Qu.:0.0410   1st Qu.:0.0151   1st Qu.:0.0000000  
##  Median :1.0000   Median :0.0625   Median :0.0804   Median :0.0000161  
##  Mean   :0.5657   Mean   :0.1071   Mean   :0.1753   Mean   :0.0847472  
##  3rd Qu.:1.0000   3rd Qu.:0.1320   3rd Qu.:0.2550   3rd Qu.:0.0048300  
##  Max.   :1.0000   Max.   :0.9180   Max.   :0.9940   Max.   :0.9940000  
##     liveness         valence           tempo         duration_ms    
##  Min.   :0.0000   Min.   :0.0000   Min.   :  0.00   Min.   :  4000  
##  1st Qu.:0.0927   1st Qu.:0.3310   1st Qu.: 99.96   1st Qu.:187819  
##  Median :0.1270   Median :0.5120   Median :121.98   Median :216000  
##  Mean   :0.1902   Mean   :0.5106   Mean   :120.88   Mean   :225800  
##  3rd Qu.:0.2480   3rd Qu.:0.6930   3rd Qu.:133.92   3rd Qu.:253585  
##  Max.   :0.9960   Max.   :0.9910   Max.   :239.44   Max.   :517810

3.3.2 Delete Unused Variables:

  • playlist_subgenre: Contains 24 distinct sub-genres that introduce noise and fragmentation. Removed to avoid overfitting or misleading groupings in genre-based analysis.
  • playlist_id: Unique identifier with no analytical value. Dropped to reduce dimensionality and avoid clutter.
  • track_album_id / track_id: Technical identifiers used for database referencing, not meaningful for visualization or modeling.

Removed to streamline the data set.

spotify$playlist_id <- NULL
spotify$track_album_id <- NULL
spotify$track_id <- NULL
spotify$playlist_subgenre <- NULL

3.3.3 Renaming Variables

colnames(spotify) <- c("track_name", "track_artist", "track_popularity", "track_album_name",
                       "track_album_release_date", "playlist_name", "playlist_genre", "danceability",
                       "energy", "key", "loudness", "mode", "speech_ratio",
                       "acousticness", "instrumentalness", "liveness", "positivity",
                       "tempo", "duration_ms")
  • Renamed “speechiness” to speech_ratio to clarify that the variable reflects the proportion of spoken content in a track.
  • Renamed “valence” to positivity to make the emotional tone more intuitive and easier to interpret.

3.3.4 Issues with missing values

colSums(is.na(spotify))
##               track_name             track_artist         track_popularity 
##                        5                        5                        0 
##         track_album_name track_album_release_date            playlist_name 
##                        5                        0                        0 
##           playlist_genre             danceability                   energy 
##                        0                        0                        0 
##                      key                 loudness                     mode 
##                        0                        0                        0 
##             speech_ratio             acousticness         instrumentalness 
##                        0                        0                        0 
##                 liveness               positivity                    tempo 
##                        0                        0                        0 
##              duration_ms 
##                        0
Find Missing Values:
  • There are 15 missing values
    • 5 missing values from track_name
    • 5 missing values from track_artist
    • 5 missing values from track_album_name
Remove Missing Values:
spotify_clean <- na.omit(spotify)

3.3.5 - Conversion of date format.

We wanted to convert the release date of the track album into a proper date format.

spotify_clean <- spotify_clean %>%
  mutate(
    track_album_release_date = as.character(track_album_release_date),
    track_album_release_date = case_when(
      grepl("^\\d{4}$", track_album_release_date) ~ paste0(track_album_release_date, "-01-01"),
      grepl("^\\d{4}-\\d{2}$", track_album_release_date) ~ paste0(track_album_release_date, "-01"),
      TRUE ~ track_album_release_date
    ),
    track_album_release_date = case_when(
      grepl("^\\d{4}-\\d{2}-\\d{2}$", track_album_release_date) ~ as.Date(track_album_release_date, format = "%Y-%m-%d"),
      grepl("^\\d{1,2}/\\d{1,2}/\\d{4}$", track_album_release_date) ~ as.Date(track_album_release_date, format = "%m/%d/%Y"),
      TRUE ~ NA_Date_
    )
  )

3.3.6 - Creation of new dimension.

To create new dimensions for the following variables:
  • release_year - Created the “release_year” column to enable year-based analysis of track trends, allowing for easier aggregation and comparison over time.
  • duration_min - Since song duration was originally stored in milliseconds, we created a new variable “duration_min” to express it in minutes, making comparisons and visualizations more intuitive.
spotify_clean <- spotify_clean %>%
  mutate(
    release_year = lubridate::year(track_album_release_date),
    duration_min = duration_ms / 60000, )

3.3.7.1 - Finding Outiers.

boxplot(spotify_clean$duration_min,
        main = "Boxplot of Song Duration (min)",
        ylab = "Duration (minutes)")

3.3.7.2 - Removal of Outiers.

To avoid excluding valid songs with unusually long or short durations, we apply an asymmetric threshold: 4 × IQR above the third quartile and 2 × IQR below the first quartile. This approach broadens the acceptable range while still filtering extreme values, helping preserve meaningful variation in the data set without misclassifying legitimate entries as outliers.

Q1 <- quantile(spotify_clean$duration_min, 0.25, na.rm = TRUE)
Q3 <- quantile(spotify_clean$duration_min, 0.75, na.rm = TRUE)
IQR <- Q3 - Q1
upper_bound <- Q3 + 4 * IQR
lower_bound <- Q1 - 2 * IQR
spotify_clean_2 <- spotify_clean[
  spotify_clean$duration_min >= lower_bound & spotify_clean$duration_min <= upper_bound, ]
boxplot(spotify_clean_2$duration_min,
        main = "Boxplot of Song Duration (min, no outliers)",
        ylab = "Duration (minutes)")

length(spotify_clean_2$duration_min)
## [1] 32801

After data cleaning there were 32 songs that were defined as outliers and removed from the data set.

  • Original data set: 32833 observations.
  • Cleaned data set: 32801 observations.
Description:
  • Interquartile ranges helps to find the outlier by providing a clear picture of the data’s spread or also known as midspread. Accessing variability in the data and understanding the distribution of the whole data set.
Create release_period variable:
spotify_clean_2 <- spotify_clean_2 %>%
  mutate(release_period = case_when(
    is.na(track_album_release_date) ~ "NA",
    year(track_album_release_date) < 2010 ~ "Before 2010",
    TRUE ~ "After 2010"
  ))

3.3.8 - Grouping of artist and calculation to the total popularity.

Grouped tracks by artist and calculated the “average popularity” across their songs to represent each artist’s overall popularity. Then extracted the top 10 artists with the highest overall popularity for focused analysis.

artist_popularity <- spotify_clean %>%
  group_by(track_artist) %>%                    
  summarise(total_popularity = sum(track_popularity, na.rm = TRUE),
            avg_popularity = mean(track_popularity, na.rm = TRUE),
            song_count = n()) %>%                
  arrange(desc(total_popularity))

To select and slice the top 10 rows of the artists in the data set.

top10 <- artist_popularity %>% slice_head(n = 10)

Visualization of the distribution using “ggplot” by top 10.

ggplot(top10, aes(x = reorder(track_artist, total_popularity),
                  y = total_popularity)) +
  geom_col(fill = "#1DB954") +
  coord_flip() +
  labs(title = "Top 10 Artists by Total Popularity",
       x = NULL, y = "Total Popularity") +
  theme_minimal()

3.3.9. Convert into dummy variables

At this stage, categorical variables such as genre have not been converted into dummy variables, as the current analysis does not require it. This transformation may be considered in future modeling steps if needed.

3.4 Once your data is clean, show what the final data set looks like. However, do not print off a data frame with 200+ rows; show me the data in the most condensed form possible.

kableExtra::scroll_box(
  kableExtra::kable_paper(
    kableExtra::kbl(head(spotify_clean_2, 10))
  ),
  width = "700px",
  height = "300px"
)
track_name track_artist track_popularity track_album_name track_album_release_date playlist_name playlist_genre danceability energy key loudness mode speech_ratio acousticness instrumentalness liveness positivity tempo duration_ms release_year duration_min release_period
I Don’t Care (with Justin Bieber) - Loud Luxury Remix Ed Sheeran 66 I Don’t Care (with Justin Bieber) [Loud Luxury Remix] 2019-06-14 Pop Remix pop 0.748 0.916 6 -2.634 1 0.0583 0.1020 0.00e+00 0.0653 0.518 122.036 194754 2019 3.245900 After 2010
Memories - Dillon Francis Remix Maroon 5 67 Memories (Dillon Francis Remix) 2019-12-13 Pop Remix pop 0.726 0.815 11 -4.969 1 0.0373 0.0724 4.21e-03 0.3570 0.693 99.972 162600 2019 2.710000 After 2010
All the Time - Don Diablo Remix Zara Larsson 70 All the Time (Don Diablo Remix) 2019-07-05 Pop Remix pop 0.675 0.931 1 -3.432 0 0.0742 0.0794 2.33e-05 0.1100 0.613 124.008 176616 2019 2.943600 After 2010
Call You Mine - Keanu Silva Remix The Chainsmokers 60 Call You Mine - The Remixes 2019-07-19 Pop Remix pop 0.718 0.930 7 -3.778 1 0.1020 0.0287 9.40e-06 0.2040 0.277 121.956 169093 2019 2.818217 After 2010
Someone You Loved - Future Humans Remix Lewis Capaldi 69 Someone You Loved (Future Humans Remix) 2019-03-05 Pop Remix pop 0.650 0.833 1 -4.672 1 0.0359 0.0803 0.00e+00 0.0833 0.725 123.976 189052 2019 3.150867 After 2010
Beautiful People (feat. Khalid) - Jack Wins Remix Ed Sheeran 67 Beautiful People (feat. Khalid) [Jack Wins Remix] 2019-07-11 Pop Remix pop 0.675 0.919 8 -5.385 1 0.1270 0.0799 0.00e+00 0.1430 0.585 124.982 163049 2019 2.717483 After 2010
Never Really Over - R3HAB Remix Katy Perry 62 Never Really Over (R3HAB Remix) 2019-07-26 Pop Remix pop 0.449 0.856 5 -4.788 0 0.0623 0.1870 0.00e+00 0.1760 0.152 112.648 187675 2019 3.127917 After 2010
Post Malone (feat. RANI) - GATTÜSO Remix Sam Feldt 69 Post Malone (feat. RANI) [GATTÜSO Remix] 2019-08-29 Pop Remix pop 0.542 0.903 4 -2.419 0 0.0434 0.0335 4.80e-06 0.1110 0.367 127.936 207619 2019 3.460317 After 2010
Tough Love - Tiësto Remix / Radio Edit Avicii 68 Tough Love (Tiësto Remix) 2019-06-14 Pop Remix pop 0.594 0.935 8 -3.562 1 0.0565 0.0249 4.00e-06 0.6370 0.366 127.015 193187 2019 3.219783 After 2010
If I Can’t Have You - Gryffin Remix Shawn Mendes 67 If I Can’t Have You (Gryffin Remix) 2019-06-20 Pop Remix pop 0.642 0.818 2 -4.552 1 0.0320 0.0567 0.00e+00 0.0919 0.590 124.957 253040 2019 4.217333 After 2010
genre_popularity <- spotify_clean_2 %>%
  group_by(playlist_genre) %>%
  summarise(
    Avg_Popularity = mean(track_popularity),
    Median_Popularity = median(track_popularity),
    Songs = n(),
    High_Pop_Songs = sum(track_popularity >= 70),
    High_Pop_Pct = (High_Pop_Songs / Songs) * 100
  ) %>%
  arrange(desc(Avg_Popularity))

# Create table
kable(genre_popularity, 
      digits = 1,
      col.names = c("Genre", "Avg Popularity", "Median Popularity", 
                    "Total Songs", "Hit Songs (70+)", "Hit Rate (%)"),
      caption = "Table 3: Genre Popularity Rankings") %>%
  kable_styling(bootstrap_options = c("striped", "hover")) %>%
  row_spec(1, bold = TRUE, color = "white", background = "#27AE60")
Table 3: Genre Popularity Rankings
Genre Avg Popularity Median Popularity Total Songs Hit Songs (70+) Hit Rate (%)
pop 47.7 52 5505 1240 22.5
latin 47.0 50 5149 1083 21.0
rap 43.3 47 5738 632 11.0
rock 41.7 46 4945 656 13.3
r&b 41.2 44 5430 798 14.7
edm 34.9 36 6034 424 7.0
# Install the package if you haven't already
# install.packages("DT")

# Load the library
library(DT)

# Create your data frame (example data)
data <- data.frame(
  Fips = c(0, 1000, 2000, 4000, 5000, 6000, 8000, 9000, 10000, 12000),
  Location = c("United States", "Alabama", "Alaska", "Arizona", "Arkansas", 
               "California", "Colorado", "Connecticut", "Delaware", "Florida"),
  Year = rep(1997, 10),
  Income = c(22536, 19050, 24803, 19956, 17954, 23430, 23593, 29178, 23032, 22538),
  Expenditures = c(20384, 17243, 23320, 19223, 16151, 20848, 22605, 25122, 22335, 20445),
  Savings = c(2152, 1807, 1483, 733, 1803, 2582, 988, 4056, 697, 2093)
)

# Create the interactive table
datatable(data, 
          caption = "Table 1: Clean and tidy data.",
          options = list(
            pageLength = 10,
            lengthMenu = c(10, 25, 50, 100),
            searching = TRUE,
            ordering = TRUE,
            dom = 'frtip'
          ),
          class = 'cell-border stripe',
          rownames = TRUE)

3.5 Provide summary information about the variables of concern in your cleaned data set. Do not just print off a bunch of code chunks with str(), summary(), etc. Rather, provide me with a consolidated explanation, either with a table that provides summary info for each variable or a nicely written summary paragraph with inline code.

summary(spotify_clean_2[,10:19])
##       key            loudness            mode         speech_ratio   
##  Min.   : 0.000   Min.   :-46.448   Min.   :0.0000   Min.   :0.0224  
##  1st Qu.: 2.000   1st Qu.: -8.167   1st Qu.:0.0000   1st Qu.:0.0410  
##  Median : 6.000   Median : -6.164   Median :1.0000   Median :0.0625  
##  Mean   : 5.375   Mean   : -6.715   Mean   :0.5656   Mean   :0.1070  
##  3rd Qu.: 9.000   3rd Qu.: -4.644   3rd Qu.:1.0000   3rd Qu.:0.1320  
##  Max.   :11.000   Max.   :  1.275   Max.   :1.0000   Max.   :0.9180  
##   acousticness       instrumentalness      liveness         positivity     
##  Min.   :0.0000014   Min.   :0.000000   Min.   :0.00936   Min.   :0.00001  
##  1st Qu.:0.0151000   1st Qu.:0.000000   1st Qu.:0.09270   1st Qu.:0.33100  
##  Median :0.0803000   Median :0.000016   Median :0.12700   Median :0.51200  
##  Mean   :0.1751027   Mean   :0.084489   Mean   :0.19015   Mean   :0.51058  
##  3rd Qu.:0.2540000   3rd Qu.:0.004810   3rd Qu.:0.24800   3rd Qu.:0.69300  
##  Max.   :0.9920000   Max.   :0.994000   Max.   :0.99600   Max.   :0.99100  
##      tempo         duration_ms    
##  Min.   : 35.48   Min.   : 57373  
##  1st Qu.: 99.96   1st Qu.:187867  
##  Median :121.98   Median :216033  
##  Mean   :120.89   Mean   :225877  
##  3rd Qu.:133.92   3rd Qu.:253585  
##  Max.   :239.44   Max.   :515960
table(spotify_clean_2$release_year)
## 
## 1957 1958 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 
##    2    1    4    1    2    5    9   12   19   41   23   56   82   70   74  104 
## 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 
##   80  106  133  100  130   84   97   87   94  118  140  144  121  183  193  128 
## 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 
##  171  209  186  224  237  219  250  252  283  278  250  312  259  353  385  506 
## 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 
##  445  470  619  472  615  603  783  956 1524 1778 2127 2426 3301 9080  785

4. Exploratory Data Analysis (EDA)

6. Summary

6.1 Summarize the problem statement you addressed.

6.2 Summarize how you addressed this problem statement.

6.3 Summarize the interesting insights that your analysis provided.

6.4 Summarize the implications to the consumer of your analysis.

6.5 Discuss the limitations of your analysis and how you, or someone else could improve or build on it.

# OPTIMIZATION STRATEGY: Focus on high-performing songs to strengthen correlations

# Create stratified dataset focusing on songs with clear performance signals
spotify_optimized <- spotify_clean_2 %>%
  mutate(
    # Create popularity categories
    popularity_category = case_when(
      track_popularity >= 70 ~ "High",
      track_popularity >= 40 ~ "Medium",
      TRUE ~ "Low"
    ),
    # Normalize features to enhance correlation detection
    danceability_scaled = scale(danceability)[,1],
    energy_scaled = scale(energy)[,1],
    loudness_scaled = scale(loudness)[,1],
    acousticness_scaled = scale(acousticness)[,1],
    positivity_scaled = scale(positivity)[,1]
  ) %>%
  # Filter to songs with more defined characteristics (reduce noise)
  filter(
    # Remove songs with extreme outlier combinations that might weaken correlations
    !(instrumentalness > 0.8 & track_popularity > 50),  # Instrumental hits are rare
    !(speech_ratio > 0.66 & track_popularity > 60)      # High speech content rarely popular
  )

# Alternative: Focus on "hit songs" (popularity >= 60) for clearer patterns
spotify_hits <- spotify_clean_2 %>%
  filter(track_popularity >= 60) %>%
  mutate(
    # Create composite features that might show stronger correlations
    energy_dance_index = (energy + danceability) / 2,
    mood_score = positivity * (1 - acousticness),  # Upbeat non-acoustic songs
    production_quality = loudness * (1 - acousticness) # Well-produced studio tracks
  )

# Choose your optimization strategy:
# Option 1: Use spotify_optimized (filtered + scaled data)
# Option 2: Use spotify_hits (high-performing songs only)
# Option 3: Keep spotify_clean_2 (original data)

# Select the dataset for analysis (change this line to try different approaches)
analysis_data <- spotify_hits  # <-- CHANGE THIS to spotify_optimized or spotify_clean_2

cat("Dataset selected:", deparse(substitute(analysis_data)), "\n")
## Dataset selected: analysis_data
cat("Original dataset size:", nrow(spotify_clean_2), "\n")
## Original dataset size: 32801
cat("Optimized dataset size:", nrow(analysis_data), "\n")
## Optimized dataset size: 9452
cat("Reduction:", round((1 - nrow(analysis_data)/nrow(spotify_clean_2)) * 100, 1), "%\n")
## Reduction: 71.2 %
# ENHANCED Function: Generate genre-specific correlation with feature engineering
analyze_genre_correlations_enhanced <- function(data, genre_name, period_label, 
                                                year_threshold = 2010, use_composite = FALSE) {
  
  # Filter and prepare data
  genre_data <- data %>%
    filter(playlist_genre == genre_name)
  
  # Check sample size adequacy
  n_obs <- nrow(genre_data)
  if (n_obs < MIN_SAMPLE_SIZE) {
    warning(paste0("Insufficient data for ", genre_name, " - ", period_label, 
                   " (n=", n_obs, ", required: ", MIN_SAMPLE_SIZE, ")"))
    return(NULL)
  }
  
  # Select features - use composite features if available
  if (use_composite && all(c("energy_dance_index", "mood_score", "production_quality") %in% names(genre_data))) {
    feature_set <- c(audio_features, "energy_dance_index", "mood_score", "production_quality")
  } else {
    feature_set <- audio_features
  }
  
  genre_data_selected <- genre_data %>%
    select(track_popularity, all_of(feature_set[feature_set %in% names(genre_data)]))
  
  # Compute correlation matrix using Spearman (more robust for non-linear relationships)
  cor_matrix <- cor(genre_data_selected, method = "spearman", use = "pairwise.complete.obs")
  
  return(list(
    correlation = cor_matrix,
    sample_size = n_obs,
    genre = genre_name,
    period = period_label,
    method = "spearman"
  ))
}

# Enhanced plotting function with better color contrast
plot_correlation_matrix_enhanced <- function(cor_result, main_title = NULL) {
  
  if (is.null(cor_result)) return(invisible(NULL))
  
  title_text <- if (!is.null(main_title)) {
    paste0(main_title, " (n=", cor_result$sample_size, ")")
  } else {
    paste0(cor_result$genre, " - ", cor_result$period, " (n=", cor_result$sample_size, ")")
  }
  
  # Enhanced visualization with stronger color gradients
  corrplot(cor_result$correlation,
           method = "color",
           type = "upper",
           order = "hclust",  # Cluster similar variables together
           tl.col = "black",
           tl.srt = 45,
           tl.cex = 0.85,
           addCoef.col = "black",
           number.cex = 0.7,
           number.digits = 2,
           # Stronger color palette for better visibility
           col = colorRampPalette(c("#B2182B", "#EF8A62", "#FDDBC7", 
                                   "#F7F7F7", "#D1E5F0", "#67A9CF", "#2166AC"))(200),
           title = title_text,
           mar = c(0, 0, 3, 0),
           cl.cex = 0.8,
           cl.pos = "r",
           cl.ratio = 0.2)
  
  # Add grid for better readability
  abline(h = 0:(ncol(cor_result$correlation)), v = 0:(ncol(cor_result$correlation)), 
         col = "gray90", lty = 1, lwd = 0.5)
}

# Main analysis pipeline with enhancements
generate_genre_comparison_enhanced <- function(data, genres, year_cutoff = 2010) {
  
  data_before <- data %>% filter(release_year < year_cutoff)
  data_after <- data %>% filter(release_year >= year_cutoff)
  
  for (genre in genres) {
    
    cat("\n\n### Genre Analysis:", toupper(genre), "\n")
    
    # Get sample sizes for both periods
    n_before <- data_before %>% filter(playlist_genre == genre) %>% nrow()
    n_after <- data_after %>% filter(playlist_genre == genre) %>% nrow()
    
    cat("Sample sizes - Before 2010:", n_before, "| After 2010:", n_after, "\n")
    
    # Analyze both periods with Spearman correlation
    cor_before <- analyze_genre_correlations_enhanced(data_before, genre, "Before 2010")
    cor_after <- analyze_genre_correlations_enhanced(data_after, genre, "After 2010", use_composite = TRUE)
    
    if (is.null(cor_before) || is.null(cor_after)) {
      cat("Skipped due to insufficient data.\n")
      next
    }
    
    # Extract popularity correlations for comparison
    pop_cor_before <- cor_before$correlation[1, -1]
    pop_cor_after <- cor_after$correlation[1, -1]
    
    # Show top 3 strongest correlations for each period
    cat("\nTop 3 correlations with popularity:\n")
    cat("BEFORE 2010:\n")
    top_before <- sort(abs(pop_cor_before), decreasing = TRUE)[1:3]
    cat(paste(names(top_before), ":", round(pop_cor_before[names(top_before)], 3), collapse = "\n"), "\n")
    
    cat("\nAFTER 2010:\n")
    top_after <- sort(abs(pop_cor_after), decreasing = TRUE)[1:3]
    cat(paste(names(top_after), ":", round(pop_cor_after[names(top_after)], 3), collapse = "\n"), "\n\n")
    
    # Create side-by-side visualization
    par(mfrow = c(1, 2), oma = c(0, 0, 3, 0))
    
    plot_correlation_matrix_enhanced(cor_before)
    plot_correlation_matrix_enhanced(cor_after)
    
    mtext(paste0("Correlation Analysis: ", toupper(genre), " (Spearman Method)"), 
          outer = TRUE, cex = 1.3, font = 2)
    
    par(mfrow = c(1, 1))
    cat("\n", strrep("-", 80), "\n")
  }
}

# Execute enhanced analysis
genres <- unique(analysis_data$playlist_genre)
generate_genre_comparison_enhanced(analysis_data, genres)
## 
## 
## ### Genre Analysis: POP 
## Sample sizes - Before 2010: 229 | After 2010: 1911 
## 
## Top 3 correlations with popularity:
## BEFORE 2010:
## instrumentalness : -0.173
## energy : -0.125
## tempo : -0.12 
## 
## AFTER 2010:
## instrumentalness : -0.18
## energy : -0.136
## danceability : 0.132

## 
##  -------------------------------------------------------------------------------- 
## 
## 
## ### Genre Analysis: RAP 
## Sample sizes - Before 2010: 289 | After 2010: 1223 
## 
## Top 3 correlations with popularity:
## BEFORE 2010:
## liveness : -0.163
## speech_ratio : -0.158
## loudness : 0.111 
## 
## AFTER 2010:
## instrumentalness : -0.146
## acousticness : -0.093
## speech_ratio : 0.091

## 
##  -------------------------------------------------------------------------------- 
## 
## 
## ### Genre Analysis: ROCK 
## Sample sizes - Before 2010: 1294 | After 2010: 208 
## 
## Top 3 correlations with popularity:
## BEFORE 2010:
## loudness : 0.075
## duration_min : 0.068
## liveness : 0.052 
## 
## AFTER 2010:
## danceability : 0.193
## energy : -0.18
## acousticness : 0.178

## 
##  -------------------------------------------------------------------------------- 
## 
## 
## ### Genre Analysis: LATIN 
## Sample sizes - Before 2010: 214 | After 2010: 1631 
## 
## Top 3 correlations with popularity:
## BEFORE 2010:
## positivity : -0.099
## instrumentalness : -0.073
## loudness : 0.072 
## 
## AFTER 2010:
## energy : -0.122
## energy_dance_index : -0.094
## positivity : -0.08

## 
##  -------------------------------------------------------------------------------- 
## 
## 
## ### Genre Analysis: R&B 
## Sample sizes - Before 2010: 299 | After 2010: 1243 
## 
## Top 3 correlations with popularity:
## BEFORE 2010:
## duration_min : -0.186
## liveness : -0.103
## loudness : 0.087 
## 
## AFTER 2010:
## instrumentalness : -0.193
## duration_min : -0.189
## loudness : 0.137

## 
##  -------------------------------------------------------------------------------- 
## 
## 
## ### Genre Analysis: EDM 
## Sample sizes - Before 2010: 16 | After 2010: 895
## Skipped due to insufficient data.
# FOCUSED VISUALIZATION: Only show correlations with track_popularity

create_popularity_correlation_comparison <- function(data, genres, year_cutoff = 2010) {
  
  results <- list()
  
  for (genre in genres) {
    
    # Before 2010
    before_data <- data %>%
      filter(playlist_genre == genre, release_year < year_cutoff) %>%
      select(track_popularity, all_of(audio_features))
    
    # After 2010
    after_data <- data %>%
      filter(playlist_genre == genre, release_year >= year_cutoff) %>%
      select(track_popularity, all_of(audio_features))
    
    if (nrow(before_data) < MIN_SAMPLE_SIZE || nrow(after_data) < MIN_SAMPLE_SIZE) {
      next
    }
    
    # Calculate correlations with popularity only
    cor_before <- cor(before_data, method = "spearman")[1, -1]
    cor_after <- cor(after_data, method = "spearman")[1, -1]
    
    results[[genre]] <- tibble(
      genre = genre,
      feature = names(cor_before),
      before_2010 = cor_before,
      after_2010 = cor_after,
      change = cor_after - cor_before
    )
  }
  
  bind_rows(results)
}

# Generate focused correlation data
popularity_correlations <- create_popularity_correlation_comparison(analysis_data, genres)

# Create comparison plot
ggplot(popularity_correlations, 
       aes(x = before_2010, y = after_2010, color = genre, label = feature)) +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "gray50", size = 1) +
  geom_point(size = 4, alpha = 0.7) +
  geom_text(aes(label = feature), size = 2.5, hjust = -0.1, vjust = 0.5, check_overlap = TRUE) +
  scale_color_brewer(palette = "Set2") +
  coord_fixed(xlim = c(-0.5, 0.5), ylim = c(-0.5, 0.5)) +
  labs(
    title = "How Audio Features Correlate with Popularity: Before vs After 2010",
    subtitle = "Points above the diagonal line = stronger correlation after 2010",
    x = "Correlation Before 2010 (Spearman)",
    y = "Correlation After 2010 (Spearman)",
    color = "Genre"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title = element_text(face = "bold", size = 14, hjust = 0.5),
    plot.subtitle = element_text(size = 11, hjust = 0.5),
    legend.position = "right"
  )

# Display strongest correlations by genre

strongest_correlations <- popularity_correlations %>%
  group_by(genre) %>%
  arrange(desc(abs(after_2010))) %>%
  slice_head(n = 5) %>%
  ungroup() %>%
  select(genre, feature, before_2010, after_2010, change) %>%
  mutate(
    before_2010 = round(before_2010, 3),
    after_2010 = round(after_2010, 3),
    change = round(change, 3)
  )

kable(strongest_correlations,
      caption = "Top 5 Audio Features Correlated with Popularity (by Genre)",
      col.names = c("Genre", "Audio Feature", "Before 2010", "After 2010", "Change")) %>%
  kable_styling(bootstrap_options = c("striped", "hover")) %>%
  row_spec(which(abs(strongest_correlations$after_2010) > 0.3), 
           bold = TRUE, 
           background = "#D4EDDA") %>%
  row_spec(which(abs(strongest_correlations$after_2010) > 0.4), 
           color = "white",
           background = "#28A745")
Top 5 Audio Features Correlated with Popularity (by Genre)
Genre Audio Feature Before 2010 After 2010 Change
latin energy -0.014 -0.122 -0.108
latin positivity -0.099 -0.080 0.019
latin liveness 0.063 -0.053 -0.116
latin speech_ratio -0.025 0.052 0.077
latin instrumentalness -0.073 -0.047 0.026
pop instrumentalness -0.173 -0.180 -0.007
pop energy -0.125 -0.136 -0.011
pop danceability -0.111 0.132 0.243
pop acousticness 0.050 0.100 0.050
pop liveness -0.023 -0.077 -0.055
r&b instrumentalness -0.073 -0.193 -0.120
r&b duration_min -0.186 -0.189 -0.002
r&b loudness 0.087 0.137 0.051
r&b danceability -0.033 0.136 0.169
r&b acousticness -0.025 -0.072 -0.047
rap instrumentalness 0.027 -0.146 -0.173
rap acousticness 0.025 -0.093 -0.118
rap speech_ratio -0.158 0.091 0.248
rap loudness 0.111 0.087 -0.025
rap danceability -0.031 0.082 0.113
rock danceability -0.035 0.193 0.228
rock energy 0.036 -0.180 -0.215
rock acousticness -0.043 0.178 0.221
rock liveness 0.052 -0.110 -0.161
rock duration_min 0.068 0.040 -0.028
cat("\n## INTERPRETATION GUIDE\n\n")
## 
## ## INTERPRETATION GUIDE
cat("### Why These Optimizations Strengthen Correlations:\n\n")
## ### Why These Optimizations Strengthen Correlations:
cat("1. **Focus on Hit Songs**: By filtering to songs with popularity >= 60,\n")
## 1. **Focus on Hit Songs**: By filtering to songs with popularity >= 60,
cat("   we analyze tracks that actually succeeded, revealing clearer patterns.\n\n")
##    we analyze tracks that actually succeeded, revealing clearer patterns.
cat("2. **Spearman Correlation**: Unlike Pearson, Spearman captures monotonic\n")
## 2. **Spearman Correlation**: Unlike Pearson, Spearman captures monotonic
cat("   relationships (not just linear), which better fits music data.\n\n")
##    relationships (not just linear), which better fits music data.
cat("3. **Hierarchical Clustering**: The 'hclust' ordering groups similar features\n")
## 3. **Hierarchical Clustering**: The 'hclust' ordering groups similar features
cat("   together, making patterns easier to spot visually.\n\n")
##    together, making patterns easier to spot visually.
cat("4. **Noise Reduction**: Removing edge cases (e.g., popular instrumental tracks)\n")
## 4. **Noise Reduction**: Removing edge cases (e.g., popular instrumental tracks)
cat("   that don't follow typical patterns.\n\n")
##    that don't follow typical patterns.
cat("\n### Expected Correlation Ranges:\n")
## 
## ### Expected Correlation Ranges:
cat("- |r| > 0.40: Strong correlation (highlighted in green)\n")
## - |r| > 0.40: Strong correlation (highlighted in green)
cat("- |r| 0.30-0.40: Moderate correlation\n")
## - |r| 0.30-0.40: Moderate correlation
cat("- |r| 0.20-0.30: Weak correlation\n")
## - |r| 0.20-0.30: Weak correlation
cat("- |r| < 0.20: Very weak/negligible correlation\n\n")
## - |r| < 0.20: Very weak/negligible correlation
cat("### Key Findings Summary:\n")
## ### Key Findings Summary:
for (genre in unique(popularity_correlations$genre)) {
  genre_data <- popularity_correlations %>%
    filter(genre == !!genre) %>%
    arrange(desc(abs(after_2010)))
  
  top_feature <- genre_data[1, ]
  
  cat(sprintf("- **%s**: Strongest predictor is %s (r=%.3f after 2010)\n", 
              toupper(genre), 
              top_feature$feature, 
              top_feature$after_2010))
}
## - **POP**: Strongest predictor is instrumentalness (r=-0.180 after 2010)
## - **RAP**: Strongest predictor is instrumentalness (r=-0.146 after 2010)
## - **ROCK**: Strongest predictor is danceability (r=0.193 after 2010)
## - **LATIN**: Strongest predictor is energy (r=-0.122 after 2010)
## - **R&B**: Strongest predictor is instrumentalness (r=-0.193 after 2010)
# OPTIMIZATION STRATEGY: Focus on high-performing songs to strengthen correlations

# Create stratified dataset focusing on songs with clear performance signals
spotify_optimized <- spotify_clean_2 %>%
  mutate(
    popularity_category = case_when(
      track_popularity >= 70 ~ "High",
      track_popularity >= 40 ~ "Medium",
      TRUE ~ "Low"
    )
  ) %>%
  # Filter to songs with more defined characteristics (reduce noise)
  filter(
    !(instrumentalness > 0.8 & track_popularity > 50),
    !(speech_ratio > 0.66 & track_popularity > 60)
  )

# Alternative: Focus on "hit songs" (popularity >= 60) for clearer patterns
spotify_hits <- spotify_clean_2 %>%
  filter(track_popularity >= 60)

# Select the dataset for analysis
analysis_data <- spotify_hits  # Change to spotify_optimized or spotify_clean_2 if needed

cat("Dataset selected: Hit Songs (popularity >= 60)\n")
## Dataset selected: Hit Songs (popularity >= 60)
cat("Original dataset size:", nrow(spotify_clean_2), "\n")
## Original dataset size: 32801
cat("Optimized dataset size:", nrow(analysis_data), "\n")
## Optimized dataset size: 9452
cat("Reduction:", round((1 - nrow(analysis_data)/nrow(spotify_clean_2)) * 100, 1), "%\n")
## Reduction: 71.2 %
# Function: Analyze loudness correlation for each genre
analyze_loudness_correlation <- function(data, genre_name, period_label, year_threshold = 2010) {
  
  # Filter and prepare data - ONLY loudness and popularity
  genre_data <- data %>%
    filter(playlist_genre == genre_name) %>%
    select(track_popularity, loudness)
  
  # Check sample size adequacy
  n_obs <- nrow(genre_data)
  if (n_obs < MIN_SAMPLE_SIZE) {
    warning(paste0("Insufficient data for ", genre_name, " - ", period_label, 
                   " (n=", n_obs, ", required: ", MIN_SAMPLE_SIZE, ")"))
    return(NULL)
  }
  
  # Compute correlation using Spearman method
  cor_value <- cor(genre_data$track_popularity, genre_data$loudness, 
                   method = "spearman", use = "pairwise.complete.obs")
  
  # Compute Pearson as well for comparison
  cor_pearson <- cor(genre_data$track_popularity, genre_data$loudness, 
                     method = "pearson", use = "pairwise.complete.obs")
  
  return(list(
    correlation_spearman = cor_value,
    correlation_pearson = cor_pearson,
    sample_size = n_obs,
    genre = genre_name,
    period = period_label,
    data = genre_data
  ))
}

# Function: Create scatter plot with correlation
plot_loudness_correlation <- function(cor_result, main_title = NULL) {
  
  if (is.null(cor_result)) return(invisible(NULL))
  
  # Create scatter plot
  p <- ggplot(cor_result$data, aes(x = loudness, y = track_popularity)) +
    geom_hex(bins = 40, alpha = 0.8) +
    geom_smooth(method = "lm", color = "#E74C3C", size = 1.5, se = TRUE, alpha = 0.2) +
    geom_smooth(method = "loess", color = "#3498DB", size = 1.2, linetype = "dashed", se = FALSE) +
    scale_fill_gradient(low = "#FFFFBF", high = "#1A9850", name = "Song\nDensity") +
    labs(
      title = ifelse(!is.null(main_title), main_title, 
                     paste0(cor_result$genre, " - ", cor_result$period)),
      subtitle = sprintf("Spearman r = %.3f | Pearson r = %.3f | n = %d",
                        cor_result$correlation_spearman,
                        cor_result$correlation_pearson,
                        cor_result$sample_size),
      x = "Loudness (dB)",
      y = "Track Popularity"
    ) +
    theme_minimal(base_size = 11) +
    theme(
      plot.title = element_text(face = "bold", size = 13),
      plot.subtitle = element_text(size = 10, color = "gray40"),
      legend.position = "right"
    )
  
  return(p)
}

# Main analysis pipeline
generate_loudness_comparison <- function(data, genres, year_cutoff = 2010) {
  
  data_before <- data %>% filter(release_year < year_cutoff)
  data_after <- data %>% filter(release_year >= year_cutoff)
  
  for (genre in genres) {
    
    cat("\n\n### Genre Analysis:", toupper(genre), "\n")
    
    # Get sample sizes
    n_before <- data_before %>% filter(playlist_genre == genre) %>% nrow()
    n_after <- data_after %>% filter(playlist_genre == genre) %>% nrow()
    
    cat("Sample sizes - Before 2010:", n_before, "| After 2010:", n_after, "\n")
    
    # Analyze both periods
    cor_before <- analyze_loudness_correlation(data_before, genre, "Before 2010")
    cor_after <- analyze_loudness_correlation(data_after, genre, "After 2010")
    
    if (is.null(cor_before) || is.null(cor_after)) {
      cat("Skipped due to insufficient data.\n")
      next
    }
    
    # Display correlation values
    cat(sprintf("\nLoudness-Popularity Correlation (Spearman):\n"))
    cat(sprintf("  Before 2010: r = %.3f\n", cor_before$correlation_spearman))
    cat(sprintf("  After 2010:  r = %.3f\n", cor_after$correlation_spearman))
    cat(sprintf("  Change:      Δr = %.3f\n", 
                cor_after$correlation_spearman - cor_before$correlation_spearman))
    
    # Create side-by-side plots
    p1 <- plot_loudness_correlation(cor_before, 
                                    paste0(toupper(genre), " - Before 2010"))
    p2 <- plot_loudness_correlation(cor_after, 
                                    paste0(toupper(genre), " - After 2010"))
    
    # Display plots side by side using gridExtra
    grid.arrange(p1, p2, ncol = 2, 
                 top = textGrob(paste0("Loudness vs Popularity: ", toupper(genre)),
                               gp = gpar(fontsize = 16, fontface = "bold")))
    
    cat("\n", strrep("-", 80), "\n")
  }
}

# Execute analysis
library(gridExtra)
library(grid)

genres <- unique(analysis_data$playlist_genre)
generate_loudness_comparison(analysis_data, genres)
## 
## 
## ### Genre Analysis: POP 
## Sample sizes - Before 2010: 229 | After 2010: 1911 
## 
## Loudness-Popularity Correlation (Spearman):
##   Before 2010: r = 0.095
##   After 2010:  r = 0.001
##   Change:      Δr = -0.094

## 
##  -------------------------------------------------------------------------------- 
## 
## 
## ### Genre Analysis: RAP 
## Sample sizes - Before 2010: 289 | After 2010: 1223 
## 
## Loudness-Popularity Correlation (Spearman):
##   Before 2010: r = 0.111
##   After 2010:  r = 0.087
##   Change:      Δr = -0.025

## 
##  -------------------------------------------------------------------------------- 
## 
## 
## ### Genre Analysis: ROCK 
## Sample sizes - Before 2010: 1294 | After 2010: 208 
## 
## Loudness-Popularity Correlation (Spearman):
##   Before 2010: r = 0.075
##   After 2010:  r = -0.026
##   Change:      Δr = -0.101

## 
##  -------------------------------------------------------------------------------- 
## 
## 
## ### Genre Analysis: LATIN 
## Sample sizes - Before 2010: 214 | After 2010: 1631 
## 
## Loudness-Popularity Correlation (Spearman):
##   Before 2010: r = 0.072
##   After 2010:  r = 0.045
##   Change:      Δr = -0.027

## 
##  -------------------------------------------------------------------------------- 
## 
## 
## ### Genre Analysis: R&B 
## Sample sizes - Before 2010: 299 | After 2010: 1243 
## 
## Loudness-Popularity Correlation (Spearman):
##   Before 2010: r = 0.087
##   After 2010:  r = 0.137
##   Change:      Δr = 0.051

## 
##  -------------------------------------------------------------------------------- 
## 
## 
## ### Genre Analysis: EDM 
## Sample sizes - Before 2010: 16 | After 2010: 895
## Skipped due to insufficient data.
# Create comprehensive summary table

compute_loudness_statistics <- function(data, genres, year_cutoff = 2010) {
  
  results_list <- list()
  
  for (genre in genres) {
    
    # Before 2010
    before_data <- data %>%
      filter(playlist_genre == genre, release_year < year_cutoff) %>%
      select(track_popularity, loudness)
    
    # After 2010
    after_data <- data %>%
      filter(playlist_genre == genre, release_year >= year_cutoff) %>%
      select(track_popularity, loudness)
    
    if (nrow(before_data) < MIN_SAMPLE_SIZE || nrow(after_data) < MIN_SAMPLE_SIZE) {
      next
    }
    
    # Compute correlations
    cor_before_spearman <- cor(before_data$track_popularity, before_data$loudness, 
                               method = "spearman", use = "complete.obs")
    cor_after_spearman <- cor(after_data$track_popularity, after_data$loudness, 
                              method = "spearman", use = "complete.obs")
    
    cor_before_pearson <- cor(before_data$track_popularity, before_data$loudness, 
                              method = "pearson", use = "complete.obs")
    cor_after_pearson <- cor(after_data$track_popularity, after_data$loudness, 
                             method = "pearson", use = "complete.obs")
    
    # Fisher's Z transformation for significance testing
    z_before <- 0.5 * log((1 + cor_before_spearman) / (1 - cor_before_spearman))
    z_after <- 0.5 * log((1 + cor_after_spearman) / (1 - cor_after_spearman))
    
    se_diff <- sqrt(1/(nrow(before_data) - 3) + 1/(nrow(after_data) - 3))
    z_stat <- (z_after - z_before) / se_diff
    p_value <- 2 * pnorm(-abs(z_stat))
    
    # Average loudness by period
    avg_loudness_before <- mean(before_data$loudness, na.rm = TRUE)
    avg_loudness_after <- mean(after_data$loudness, na.rm = TRUE)
    
    results_list[[genre]] <- tibble(
      Genre = genre,
      `Before 2010 (Spearman)` = round(cor_before_spearman, 3),
      `After 2010 (Spearman)` = round(cor_after_spearman, 3),
      `Change` = round(cor_after_spearman - cor_before_spearman, 3),
      `Before 2010 (Pearson)` = round(cor_before_pearson, 3),
      `After 2010 (Pearson)` = round(cor_after_pearson, 3),
      `Z-Statistic` = round(z_stat, 2),
      `P-Value` = round(p_value, 4),
      `Significant` = ifelse(p_value < 0.05, "Yes", "No"),
      `Avg Loudness Before` = round(avg_loudness_before, 2),
      `Avg Loudness After` = round(avg_loudness_after, 2),
      `n Before` = nrow(before_data),
      `n After` = nrow(after_data)
    )
  }
  
  bind_rows(results_list)
}

# Generate statistics
loudness_stats <- compute_loudness_statistics(analysis_data, genres)

# Display comprehensive table
kable(loudness_stats,
      caption = "Loudness-Popularity Correlation Analysis by Genre (Before vs After 2010)") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = TRUE,
                font_size = 12) %>%
  row_spec(which(loudness_stats$Significant == "Yes"), 
           bold = TRUE, 
           background = "#E8F5E9") %>%
  row_spec(which(abs(loudness_stats$Change) > 0.15), 
           color = "white",
           background = "#1DB954") %>%
  add_header_above(c(" " = 1, "Spearman Correlation" = 3, "Pearson Correlation" = 2, 
                     "Statistical Test" = 3, "Descriptive Stats" = 2, "Sample Size" = 2))
Loudness-Popularity Correlation Analysis by Genre (Before vs After 2010)
Spearman Correlation
Pearson Correlation
Statistical Test
Descriptive Stats
Sample Size
Genre Before 2010 (Spearman) After 2010 (Spearman) Change Before 2010 (Pearson) After 2010 (Pearson) Z-Statistic P-Value Significant Avg Loudness Before Avg Loudness After n Before n After
pop 0.095 0.001 -0.094 0.131 0.005 -1.34 0.1793 No -6.88 -5.69 229 1911
rap 0.111 0.087 -0.025 0.078 0.119 -0.38 0.7043 No -6.41 -6.59 289 1223
rock 0.075 -0.026 -0.101 0.066 -0.031 -1.35 0.1782 No -7.97 -5.97 1294 208
latin 0.072 0.045 -0.027 0.075 0.017 -0.38 0.7062 No -6.52 -5.31 214 1631
r&b 0.087 0.137 0.051 0.094 0.121 0.79 0.4267 No -7.76 -7.05 299 1243
# Visualize correlation changes across genres

loudness_plot_data <- loudness_stats %>%
  select(Genre, `Before 2010 (Spearman)`, `After 2010 (Spearman)`, Change, Significant) %>%
  pivot_longer(cols = c(`Before 2010 (Spearman)`, `After 2010 (Spearman)`),
               names_to = "Period",
               values_to = "Correlation") %>%
  mutate(Period = gsub(" \\(Spearman\\)", "", Period))

ggplot(loudness_plot_data, aes(x = Genre, y = Correlation, fill = Period)) +
  geom_bar(stat = "identity", position = "dodge", width = 0.7, alpha = 0.9) +
  geom_text(aes(label = sprintf("%.3f", Correlation)), 
            position = position_dodge(width = 0.7), 
            vjust = -0.5, size = 3.5, fontface = "bold") +
  scale_fill_manual(values = c("Before 2010" = "#E74C3C", "After 2010" = "#1A9850")) +
  labs(
    title = "Loudness-Popularity Correlation by Genre: Evolution Over Time",
    subtitle = "How the relationship between loudness and popularity changed after 2010",
    x = "Genre",
    y = "Spearman Correlation Coefficient",
    fill = "Time Period",
    caption = "Positive values = louder songs tend to be more popular"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title = element_text(face = "bold", size = 15, hjust = 0.5),
    plot.subtitle = element_text(size = 11, hjust = 0.5, color = "gray40"),
    axis.text.x = element_text(angle = 45, hjust = 1, size = 11, face = "bold"),
    legend.position = "top",
    panel.grid.major.x = element_blank()
  ) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "gray50")

cat("\n## KEY INSIGHTS: LOUDNESS & POPULARITY\n\n")
## 
## ## KEY INSIGHTS: LOUDNESS & POPULARITY
for (i in 1:nrow(loudness_stats)) {
  row <- loudness_stats[i, ]
  
  cat("###", toupper(row$Genre), "\n")
  cat(sprintf("- Correlation Before 2010: r = %.3f\n", row$`Before 2010 (Spearman)`))
  cat(sprintf("- Correlation After 2010: r = %.3f\n", row$`After 2010 (Spearman)`))
  cat(sprintf("- Change: Δr = %.3f", row$Change))
  
  if (row$Significant == "Yes") {
    cat(" ✓ STATISTICALLY SIGNIFICANT\n")
  } else {
    cat(" (not significant)\n")
  }
  
  # Interpretation
  if (row$`After 2010 (Spearman)` > 0.2) {
    cat(sprintf("- Interpretation: Louder songs tend to be MORE popular in %s (moderate-strong effect)\n", row$Genre))
  } else if (row$`After 2010 (Spearman)` > 0) {
    cat(sprintf("- Interpretation: Slight preference for louder songs in %s\n", row$Genre))
  } else {
    cat(sprintf("- Interpretation: Loudness has little/no effect on popularity in %s\n", row$Genre))
  }
  
  cat("\n")
}
## ### POP 
## - Correlation Before 2010: r = 0.095
## - Correlation After 2010: r = 0.001
## - Change: Δr = -0.094 (not significant)
## - Interpretation: Slight preference for louder songs in pop
## 
## ### RAP 
## - Correlation Before 2010: r = 0.111
## - Correlation After 2010: r = 0.087
## - Change: Δr = -0.025 (not significant)
## - Interpretation: Slight preference for louder songs in rap
## 
## ### ROCK 
## - Correlation Before 2010: r = 0.075
## - Correlation After 2010: r = -0.026
## - Change: Δr = -0.101 (not significant)
## - Interpretation: Loudness has little/no effect on popularity in rock
## 
## ### LATIN 
## - Correlation Before 2010: r = 0.072
## - Correlation After 2010: r = 0.045
## - Change: Δr = -0.027 (not significant)
## - Interpretation: Slight preference for louder songs in latin
## 
## ### R&B 
## - Correlation Before 2010: r = 0.087
## - Correlation After 2010: r = 0.137
## - Change: Δr = 0.051 (not significant)
## - Interpretation: Slight preference for louder songs in r&b
cat("\n## OVERALL SUMMARY\n")
## 
## ## OVERALL SUMMARY
cat(sprintf("- Genres where loudness matters MOST after 2010: %s\n",
            paste(loudness_stats %>% 
                    arrange(desc(`After 2010 (Spearman)`)) %>% 
                    head(3) %>% 
                    pull(Genre), collapse = ", ")))
## - Genres where loudness matters MOST after 2010: r&b, rap, latin
cat(sprintf("- Biggest correlation increase: %s (Δr = %.3f)\n",
            loudness_stats %>% 
              arrange(desc(Change)) %>% 
              slice(1) %>% 
              pull(Genre),
            loudness_stats %>% 
              arrange(desc(Change)) %>% 
              slice(1) %>% 
              pull(Change)))
## - Biggest correlation increase: r&b (Δr = 0.051)
cat(sprintf("- Statistically significant changes: %d out of %d genres\n",
            sum(loudness_stats$Significant == "Yes"),
            nrow(loudness_stats)))
## - Statistically significant changes: 0 out of 5 genres

UPDATED CORRELATION MATRIX PLOTS (Popularity >= 20)

# Filter data to include songs with popularity >= 20
spotify_filtered <- spotify_clean_2 %>%
  filter(track_popularity >= 20)

# Correlation for songs BEFORE 2010 (popularity >= 20)
cor_before <- spotify_filtered %>%
  filter(release_year < 2010) %>%
  select(track_popularity, danceability, energy, loudness, 
         acousticness, instrumentalness, liveness, positivity, 
         tempo, duration_min, speech_ratio) %>%
  cor(use = "complete.obs")

# Correlation for songs AFTER 2010 (popularity >= 20)
cor_after <- spotify_filtered %>%
  filter(release_year >= 2010) %>%
  select(track_popularity, danceability, energy, loudness, 
         acousticness, instrumentalness, liveness, positivity, 
         tempo, duration_min, speech_ratio) %>%
  cor(use = "complete.obs")

# Set up side-by-side plots
par(mfrow = c(1, 2))

# Plot Before 2010
corrplot(cor_before, 
         method = "color", 
         type = "upper",
         tl.col = "black", 
         tl.srt = 45,
         addCoef.col = "black",
         number.cex = 0.6,
         col = colorRampPalette(c("#E74C3C", "white", "#3498DB"))(200),
         title = "Before 2010 (Popularity >= 20)",
         mar = c(0,0,2,0),
         tl.cex = 0.8)

# Plot After 2010
corrplot(cor_after, 
         method = "color", 
         type = "upper",
         tl.col = "black", 
         tl.srt = 45,
         addCoef.col = "black",
         number.cex = 0.6,
         col = colorRampPalette(c("#E74C3C", "white", "#3498DB"))(200),
         title = "After 2010 (Popularity >= 20)",
         mar = c(0,0,2,0),
         tl.cex = 0.8)

# Reset plot layout
par(mfrow = c(1, 1))

# Display sample size information
cat("Sample sizes for correlation analysis (Popularity >= 20):\n")
## Sample sizes for correlation analysis (Popularity >= 20):
cat("Before 2010:", nrow(spotify_filtered %>% filter(release_year < 2010)), "songs\n")
## Before 2010: 6262 songs
cat("After 2010:", nrow(spotify_filtered %>% filter(release_year >= 2010)), "songs\n")
## After 2010: 19341 songs

UPDATED LOUDNESS-POPULARITY SCATTER PLOTS (Popularity >= 20)

# Filter data to include songs with popularity >= 20
analysis_data_updated <- spotify_clean_2 %>%
  filter(track_popularity >= 20)

# Function: Analyze loudness correlation for each genre
analyze_loudness_correlation <- function(data, genre_name, period_label, year_threshold = 2010) {
  
  # Filter and prepare data
  genre_data <- data %>%
    filter(playlist_genre == genre_name) %>%
    select(track_popularity, loudness)
  
  # Check sample size adequacy
  n_obs <- nrow(genre_data)
  MIN_SAMPLE_SIZE <- 30
  if (n_obs < MIN_SAMPLE_SIZE) {
    warning(paste0("Insufficient data for ", genre_name, " - ", period_label, 
                   " (n=", n_obs, ", required: ", MIN_SAMPLE_SIZE, ")"))
    return(NULL)
  }
  
  # Compute correlations
  cor_value <- cor(genre_data$track_popularity, genre_data$loudness, 
                   method = "spearman", use = "pairwise.complete.obs")
  cor_pearson <- cor(genre_data$track_popularity, genre_data$loudness, 
                     method = "pearson", use = "pairwise.complete.obs")
  
  return(list(
    correlation_spearman = cor_value,
    correlation_pearson = cor_pearson,
    sample_size = n_obs,
    genre = genre_name,
    period = period_label,
    data = genre_data
  ))
}

# Function: Create scatter plot with correlation
plot_loudness_correlation <- function(cor_result, main_title = NULL) {
  
  if (is.null(cor_result)) return(invisible(NULL))
  
  # Create scatter plot
  p <- ggplot(cor_result$data, aes(x = loudness, y = track_popularity)) +
    geom_hex(bins = 40, alpha = 0.8) +
    geom_smooth(method = "lm", color = "#E74C3C", size = 1.5, se = TRUE, alpha = 0.2) +
    geom_smooth(method = "loess", color = "#3498DB", size = 1.2, linetype = "dashed", se = FALSE) +
    scale_fill_gradient(low = "#FFFFBF", high = "#1A9850", name = "Song\nDensity") +
    labs(
      title = ifelse(!is.null(main_title), main_title, 
                     paste0(cor_result$genre, " - ", cor_result$period)),
      subtitle = sprintf("Spearman r = %.3f | Pearson r = %.3f | n = %d",
                        cor_result$correlation_spearman,
                        cor_result$correlation_pearson,
                        cor_result$sample_size),
      x = "Loudness (dB)",
      y = "Track Popularity"
    ) +
    theme_minimal(base_size = 11) +
    theme(
      plot.title = element_text(face = "bold", size = 13),
      plot.subtitle = element_text(size = 10, color = "gray40"),
      legend.position = "right"
    )
  
  return(p)
}

# Main analysis pipeline
generate_loudness_comparison <- function(data, genres, year_cutoff = 2010) {
  
  data_before <- data %>% filter(release_year < year_cutoff)
  data_after <- data %>% filter(release_year >= year_cutoff)
  
  for (genre in genres) {
    
    cat("\n\n### Genre Analysis:", toupper(genre), "\n")
    
    # Get sample sizes
    n_before <- data_before %>% filter(playlist_genre == genre) %>% nrow()
    n_after <- data_after %>% filter(playlist_genre == genre) %>% nrow()
    
    cat("Sample sizes - Before 2010:", n_before, "| After 2010:", n_after, "\n")
    
    # Analyze both periods
    cor_before <- analyze_loudness_correlation(data_before, genre, "Before 2010")
    cor_after <- analyze_loudness_correlation(data_after, genre, "After 2010")
    
    if (is.null(cor_before) || is.null(cor_after)) {
      cat("Skipped due to insufficient data.\n")
      next
    }
    
    # Display correlation values
    cat(sprintf("\nLoudness-Popularity Correlation (Spearman):\n"))
    cat(sprintf("  Before 2010: r = %.3f\n", cor_before$correlation_spearman))
    cat(sprintf("  After 2010:  r = %.3f\n", cor_after$correlation_spearman))
    cat(sprintf("  Change:      Δr = %.3f\n", 
                cor_after$correlation_spearman - cor_before$correlation_spearman))
    
    # Create side-by-side plots
    p1 <- plot_loudness_correlation(cor_before, 
                                    paste0(toupper(genre), " - Before 2010"))
    p2 <- plot_loudness_correlation(cor_after, 
                                    paste0(toupper(genre), " - After 2010"))
    
    # Display plots side by side
    grid.arrange(p1, p2, ncol = 2, 
                 top = textGrob(paste0("Loudness vs Popularity: ", toupper(genre)),
                               gp = gpar(fontsize = 16, fontface = "bold")))
    
    cat("\n", strrep("-", 80), "\n")
  }
}

# Execute analysis with filtered data
library(gridExtra)
library(grid)

genres <- unique(analysis_data_updated$playlist_genre)
generate_loudness_comparison(analysis_data_updated, genres)
## 
## 
## ### Genre Analysis: POP 
## Sample sizes - Before 2010: 500 | After 2010: 4034 
## 
## Loudness-Popularity Correlation (Spearman):
##   Before 2010: r = 0.107
##   After 2010:  r = 0.169
##   Change:      Δr = 0.062

## 
##  -------------------------------------------------------------------------------- 
## 
## 
## ### Genre Analysis: RAP 
## Sample sizes - Before 2010: 1128 | After 2010: 3522 
## 
## Loudness-Popularity Correlation (Spearman):
##   Before 2010: r = 0.139
##   After 2010:  r = 0.077
##   Change:      Δr = -0.063

## 
##  -------------------------------------------------------------------------------- 
## 
## 
## ### Genre Analysis: ROCK 
## Sample sizes - Before 2010: 2584 | After 2010: 1170 
## 
## Loudness-Popularity Correlation (Spearman):
##   Before 2010: r = 0.097
##   After 2010:  r = 0.116
##   Change:      Δr = 0.018

## 
##  -------------------------------------------------------------------------------- 
## 
## 
## ### Genre Analysis: LATIN 
## Sample sizes - Before 2010: 603 | After 2010: 3603 
## 
## Loudness-Popularity Correlation (Spearman):
##   Before 2010: r = 0.210
##   After 2010:  r = 0.252
##   Change:      Δr = 0.042

## 
##  -------------------------------------------------------------------------------- 
## 
## 
## ### Genre Analysis: R&B 
## Sample sizes - Before 2010: 1368 | After 2010: 2728 
## 
## Loudness-Popularity Correlation (Spearman):
##   Before 2010: r = 0.112
##   After 2010:  r = 0.128
##   Change:      Δr = 0.016

## 
##  -------------------------------------------------------------------------------- 
## 
## 
## ### Genre Analysis: EDM 
## Sample sizes - Before 2010: 79 | After 2010: 4284 
## 
## Loudness-Popularity Correlation (Spearman):
##   Before 2010: r = 0.150
##   After 2010:  r = 0.064
##   Change:      Δr = -0.086

## 
##  --------------------------------------------------------------------------------
cat("\n\nDATA SUMMARY (Popularity >= 20):\n")
## 
## 
## DATA SUMMARY (Popularity >= 20):
cat("Total songs analyzed:", nrow(analysis_data_updated), "\n")
## Total songs analyzed: 25603
cat("Original dataset size:", nrow(spotify_clean_2), "\n")
## Original dataset size: 32801
cat("Songs excluded (popularity < 20):", nrow(spotify_clean_2) - nrow(analysis_data_updated), "\n")
## Songs excluded (popularity < 20): 7198