Exploratory Data Analysis and Regression

Synopsis

Problem Statement: I have been captivated by music throughout my entire life. I am an avid listener, participant, and enjoy the art of songwriting and jamming out with friends. I chose to work with this Spotify data set I pulled from GitHub, because I am curious what there is to be learned from a set of songs, their qualities, artists, and popularity outcomes. In particular, I have often wondered what qualities make a song popular and how industry professionals might be able to predict a “hit” song.

Solution Approach: I chose to explore this data, following data cleaning, by visualizing and using univariate/multivariate statistical methods to answer different questions that arose from my natural curiosity as I got to know the data set and then make an attempt to fit the data to a regression model.

Insights:

Pop, Rap, and Latin are currently the most popular genre types on Spotify in descending order.
The genres which appear to be experiencing the most significant increase of attention from the year 2000 to 2020 were rap, R&B, and to a lesser extent with Latin music. EDM (Electronic Dance Music) reached its pinnacle in 2005 and appears to be making a gradual resurgence to former heights.
Every major genre was more likely to have its songs produced in the major key. Rock has the largest proportion with nearly 70% of songs being major.
Top Artists by Decade (Most Hit Songs):
- 1950’s - Elvis Presley
- 1960’s - The Beatles
- 1970’s - Fleetwood Mac and Queen
- 1980’s - AC/DC, Luis Miguel, and Michael Jackson
- 1990’s - Red Hot Chili Peppers
- 2000’s - Coldplay
- 2010’s - Billie Eilish
- 2020’s - Eminem (So far as of March 2020)
The distribution of popularity scores is unbalanced due to the fact the most songs never seemed to make it off the ground. The bottom 1% and top 25% capture 8% of the popularity scores from the data.
There were two notable independent variables that have an issue with co-linearity. There was a strong positive correlation between loudness and energy (Pearson Correlation - 0.68) and a strong negative correlation between energy and acousticness (Pearson Correlation - -0.55). Loudness & acousticness and danceability & valence also shared a fairly strong correlation.
Energy, instrumentalness, speechiness, and danceability were the strongest predictors of track popularity according to our regression models while duration, tempo, and key, held close to no predictive power.

Packages Required

Here is a comprehensive list of all the packages necessary for purpose of data cleaning, visualization, and analysis.

library(tibble)
library(xtable)
library(DT)
library(knitr)
library(tm)
library(ggplot2)
library(dplyr)
library(plotly)
library(readr)
library(caret)
library(magrittr)
library(reshape2)
library(Hmisc)
library(tidyverse) 
library(modelr)     
library(broom) 
library(pscl)
library(grid)
library(gridExtra)

Data Preparation

All of the procedures in this section are for the purpose of preparing for data analysis.

Data Import

Spotify Songs data set from GitHub The data set of spotify songs contains 23 variables and 32,833 songs from 1957-2020. There are 10,693 artists and 6 main genres with sub-categories for each.

This dataset was downloaded from Github

More information about the data here.

Spotify_songs <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv')
class(Spotify_songs)

## [1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"

colnames(Spotify_songs)

##  [1] "track_id"                 "track_name"              
##  [3] "track_artist"             "track_popularity"        
##  [5] "track_album_id"           "track_album_name"        
##  [7] "track_album_release_date" "playlist_name"           
##  [9] "playlist_id"              "playlist_genre"          
## [11] "playlist_subgenre"        "danceability"            
## [13] "energy"                   "key"                     
## [15] "loudness"                 "mode"                    
## [17] "speechiness"              "acousticness"            
## [19] "instrumentalness"         "liveness"                
## [21] "valence"                  "tempo"                   
## [23] "duration_ms"

dim(Spotify_songs)

## [1] 32833    23

Data Cleaning

We’ll start cleaning the data by scanning for any missing values within the data set.

colSums(is.na(Spotify_songs))

##                 track_id               track_name             track_artist 
##                        0                        5                        5 
##         track_popularity           track_album_id         track_album_name 
##                        0                        0                        5 
## track_album_release_date            playlist_name              playlist_id 
##                        0                        0                        0 
##           playlist_genre        playlist_subgenre             danceability 
##                        0                        0                        0 
##                   energy                      key                 loudness 
##                        0                        0                        0 
##                     mode              speechiness             acousticness 
##                        0                        0                        0 
##         instrumentalness                 liveness                  valence 
##                        0                        0                        0 
##                    tempo              duration_ms 
##                        0                        0

There are only 5 observations with missing values. These songs still contain useful information and are distinguishable by their id, so we will simply leave them in the data set.

print(Spotify_songs[rowSums(is.na(Spotify_songs)) > 0,])

## # A tibble: 5 x 23
##   track_id track_name track_artist track_popularity track_album_id
##   <chr>    <chr>      <chr>                   <dbl> <chr>         
## 1 69gRFGO~ <NA>       <NA>                        0 717UG2du6utFe~
## 2 5cjecvX~ <NA>       <NA>                        0 3luHJEPw434tv~
## 3 5TTzhRS~ <NA>       <NA>                        0 3luHJEPw434tv~
## 4 3VKFip3~ <NA>       <NA>                        0 717UG2du6utFe~
## 5 69gRFGO~ <NA>       <NA>                        0 717UG2du6utFe~
## # ... with 18 more variables: track_album_name <chr>,
## #   track_album_release_date <chr>, playlist_name <chr>, playlist_id <chr>,
## #   playlist_genre <chr>, playlist_subgenre <chr>, danceability <dbl>,
## #   energy <dbl>, key <dbl>, loudness <dbl>, mode <dbl>, speechiness <dbl>,
## #   acousticness <dbl>, instrumentalness <dbl>, liveness <dbl>, valence <dbl>,
## #   tempo <dbl>, duration_ms <dbl>

We will now continue forward by removing any duplicates.

The dimensions following the removal of duplicates by id and name are different. This tells us that there are songs within the data set that contain the same name but have a different id. We want to remove these duplicates, but we want to keep our songs titled ‘NA’.

Dimensions after removing track_id duplicates:

print(dim(Spotify_songs[!duplicated(Spotify_songs$track_id),]))

## [1] 28356    23

Dimensions after removing track_name duplicate:

print(dim(Spotify_songs[!duplicated(Spotify_songs$track_name),]))

## [1] 23450    23

Again, we want to remove all of our name duplicates, but we want to keep our songs titled ‘NA’ in track_name.

Spotify_songs_nonamedups = Spotify_songs[!duplicated(Spotify_songs$track_name),]
Spotify_songs_NA = Spotify_songs[rowSums(is.na(Spotify_songs)) > 0,]

Spotify_songs = merge(Spotify_songs_nonamedups, Spotify_songs_NA, all= TRUE, sort = FALSE)

After all our cleaning has been done, I’ll check the rows and columns one more time.

There are 23454 unique songs.

dim(Spotify_songs)

## [1] 23454    23

Data Preview

In this dataset, the rows represent a unique song, and the columns give us information about the songs.

library(DT)
datatable(head(Spotify_songs,25))

Data Description

Here, I will provide a table with the variable names, types, and descriptions.

Var_type <- lapply(Spotify_songs, class)
var_description <- c("Song unique ID", "Song Name", "Song Artist", "Song Popularity (0-100) where higher is better", "Album unique ID", "Song album name", "Date when album released", "Nammme of playlist", " Playlist ID", "Playlist genre", "Playlist subgenre", "Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.", "Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.", "The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1.", "The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.", "Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.", "    Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.", "A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.", "Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.", "    Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.", "A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).", "The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.", "  Duration of song in milliseconds")
Variable_names <- colnames(Spotify_songs)
data_description <- as_data_frame(cbind(Variable_names,Var_type, var_description))
colnames(data_description) <- c("Variable Name", "Data Type", "Variable Description")
library(knitr)
kable(data_description)

Variable Name	Data Type	Variable Description
track_id	character	Song unique ID
track_name	character	Song Name
track_artist	character	Song Artist
track_popularity	numeric	Song Popularity (0-100) where higher is better
track_album_id	character	Album unique ID
track_album_name	character	Song album name
track_album_release_date	character	Date when album released
playlist_name	character	Nammme of playlist
playlist_id	character	Playlist ID
playlist_genre	character	Playlist genre
playlist_subgenre	character	Playlist subgenre
danceability	numeric	Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
energy	numeric	Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
key	numeric	The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C<U+266F>/D<U+266D>, 2 = D, and so on. If no key was detected, the value is -1.
loudness	numeric	The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.
mode	numeric	Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.
speechiness	numeric	Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
acousticness	numeric	A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
instrumentalness	numeric	Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
liveness	numeric	Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
valence	numeric	A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
tempo	numeric	The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
duration_ms	numeric	Duration of song in milliseconds

Exploratory Data Analysis

Genre Popularity

First, I wanted to explore the popularity of different genres. I found that Pop, Rap, Latin, Rock, R&B, and EDM were the genres from most to least popular.

Genre_popularity <- aggregate(subset(Spotify_songs, select=c(track_popularity)), list(Spotify_songs$playlist_genre), mean)
attach(Genre_popularity)
Genre_popularity <- Genre_popularity[order(-track_popularity),]
Genre_popularity$track_popularity <- round(Genre_popularity$track_popularity, 1)

ggplot(Genre_popularity, aes(x=reorder(Group.1, -track_popularity), y=track_popularity, color=Group.1)) +
  geom_bar(stat="identity", fill="white")+
  geom_text(aes(label=track_popularity), vjust=1.6, color="black", size=3.5) +
  ggtitle("Mean Popularity by Genre")+
  xlab("Genre")+ 
  ylab("Popularity") + 
  theme(legend.title = element_blank())

Genre Popularity by Release Date

I then wanted to visualize how the popularity of each genre changes with respect to its release date. The genres which appear to be experiencing the most significant increase of attention from the year 2000 to 2020 were rap, R&B, and, to a lesser extent, Latin music. Notably, EDM music reached its pinnacle in 2005 and appears to be making a return.

Spotify_songs %>%
  mutate(Year =  as.Date(substr(Spotify_songs$track_album_release_date, 1, 4), '%Y')) %>%
  filter(track_album_release_date >"2000 -1-1") %>%
  select(playlist_genre, Year, track_popularity) %>%
  group_by(playlist_genre, Year) %>%
  summarise(Mean = mean(track_popularity), .groups="drop") %>%
  ggplot(aes(x=Year, y=Mean, color=playlist_genre)) + 
  geom_line(aes(group=playlist_genre), size=.5) + 
  facet_wrap(~playlist_genre)

  theme(axis.text.x=element_text(angle=45, hjust=1))+
  ggtitle("Genre Popularity over time")+
  xlab("Time (Years)")+ 
  ylab("Popularity") + 
  labs(color='Genres')

Major and Minor Key Proportion

The musician in me was curious to see how often the major and minor keys were used for each genre. I was expecting to see the major key more often, but I was surprised to see that it captured the majority for every single genre. Pop and rock especially are very likely to be produced in the major key.

Spotify_songs %>% 
  group_by(playlist_genre, mode) %>% 
  summarise(count = n()) %>%
  mutate(prop = round(count/sum(count), digits = 2)) %>%
  ggplot(aes(x=playlist_genre, y=prop, fill=mode)) + geom_bar(stat="identity") + geom_label(aes(label = prop), position = position_stack(vjust = 0.5)) +
  ggtitle("Proportion of mode for each genre")+
  xlab("Genres")+ 
  ylab("Proportion") + labs(color='Genres') +
  labs(fill="Mode")

Artists - Most Hits by Decade

Here, I wanted to find out the number of hit songs released by each artist per decade. I’ve enjoyed music from the 1950’s and on so it was interesting for me to see what the most popular artists from each decade are according to this data set. I’m happy to see The Beatles and Queen performing well in their respective decades!

Top_hit_artists_by_decade <- Spotify_songs %>% 
  select(track_album_release_date, track_artist, track_id, track_popularity) %>%
  mutate(Decade = (as.numeric(substr(Spotify_songs$track_album_release_date, 1,4))) - (as.numeric(substr(Spotify_songs$track_album_release_date, 1,4)) %% 10)) %>%
  filter(track_popularity >= 70) %>%
  group_by(Decade,track_artist) %>%
  summarise(Number_of_Hits = n()) %>%
  group_by(Decade) %>%
  top_n(1, Number_of_Hits)
datatable(Top_hit_artists_by_decade)

Popularity Scores

I decided to start exploring the popularity scores by plotting a histogram. Upon doing this, I have notice a large portion of the songs are not popular. There are very few songs with a popularity greater than 75. The top 25% captures 8% of our observations, but the bottom 1% captures 8%, so the popularity scores are quite unbalanced on the low end. There is a mean popularity score of about 39.

(Spotify_songs %>% filter(track_popularity > 75) %>% summarise(Popularity_over_75_ratio = n())) / nrow(Spotify_songs)

ggplot(Spotify_songs, aes(x=track_popularity)) + geom_histogram(aes(y=..density..), color="black", fill="white") + 
  geom_density(alpha=.2, fill='#FF6666') +
  geom_vline(aes(xintercept=mean(track_popularity)), color='blue', linetype='dashed', size=1) +
  geom_vline(aes(xintercept=75), color='black', size=1) +
  ggtitle("Popularity Scores - Histogram") +
  xlab("Track Popularity") +
  ylab("Density")

Variable Correlation

Here is a list of all the correlations between our independent and dependent numeric variables sorted by significance.

Spotify_numeric <- subset(Spotify_songs, select = -c(track_id, track_name, track_artist, track_album_id, track_album_name, track_album_release_date, playlist_name, playlist_id, playlist_genre, playlist_subgenre, mode, key))
     
flattenCorrMatrix <- function(cormat, pmat) {
  ut <- upper.tri(cormat)
  data.frame(
    row = rownames(cormat)[row(cormat)[ut]],
    column = rownames(cormat)[col(cormat)[ut]],
    cor  =(cormat)[ut],
    p = pmat[ut]
    )
} 
      
cormat <- rcorr(as.matrix(Spotify_numeric), type= 'pearson')
flattened_cormat <- flattenCorrMatrix(cormat$r, cormat$P)
flattened_cormat$cor <- round(flattened_cormat$cor, 2)
flattened_cormat$p <- round(flattened_cormat$p, 3)
flattened_cormat <- flattened_cormat[order(flattened_cormat$p, flattened_cormat$cor),]

datatable(flattened_cormat)

Correlation Heat Map

I wanted to look at a correlation heat map of this data to visualize any large correlations between the independent variables and track popularity. Upon inspection, it does not appear that this is the case, but there are subtle correlations that are statistically significant. Notably, there does not appear to be many strong co-linearity issues among independent variables except between loudness-energy and loudness-acousticness.

melted_cormat <- melt(cormat$r)

ggplot(data = melted_cormat, aes(Var2, Var1, fill=value)) + 
  geom_tile(color = "white") + 
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = 0, limit = c(-1,1), space = "Lab", name="Pearson\nCorrelation") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, size = 12, hjust = 1)) +
  coord_fixed()

Regression Analysis

Linear Regression - Attempt 1

Building the Model

I’m going to start by modeling this data with a linear regression to try to predict the popularity of songs. I will use the standard procedure of splitting the data set into a training set and a test set with a 60% to 40% split, respectively. The R-squared from this regression tells us this is not a very effective model because it sits around 0.062. This suggests that the model does not do a good job of explaining the variance in track popularity.

set.seed(123)
sample <- sample(c(TRUE, FALSE), nrow(Spotify_songs), replace = T, prob = c(0.6,0.4))
train <- Spotify_songs[sample, ]
test <- Spotify_songs[!sample, ]

model <- lm(track_popularity ~ danceability + energy + key + loudness + mode + speechiness + acousticness + instrumentalness + liveness + valence + tempo + duration_ms, data = train)

summary(model)

## 
## Call:
## lm(formula = track_popularity ~ danceability + energy + key + 
##     loudness + mode + speechiness + acousticness + instrumentalness + 
##     liveness + valence + tempo + duration_ms, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -55.207 -16.374   2.824  17.756  63.221 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       6.572e+01  2.355e+00  27.907  < 2e-16 ***
## danceability      4.853e+00  1.480e+00   3.278 0.001047 ** 
## energy           -2.311e+01  1.687e+00 -13.703  < 2e-16 ***
## key               6.175e-03  5.371e-02   0.115 0.908469    
## loudness          1.217e+00  9.054e-02  13.440  < 2e-16 ***
## mode              1.462e+00  3.928e-01   3.722 0.000198 ***
## speechiness      -6.257e+00  1.883e+00  -3.323 0.000892 ***
## acousticness      4.977e+00  1.024e+00   4.863 1.17e-06 ***
## instrumentalness -9.189e+00  8.549e-01 -10.748  < 2e-16 ***
## liveness         -4.346e+00  1.260e+00  -3.450 0.000563 ***
## valence           2.677e+00  9.163e-01   2.922 0.003488 ** 
## tempo             3.074e-02  7.253e-03   4.238 2.27e-05 ***
## duration_ms      -4.124e-05  3.179e-06 -12.971  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22.66 on 14090 degrees of freedom
## Multiple R-squared:  0.06297,    Adjusted R-squared:  0.06217 
## F-statistic: 78.91 on 12 and 14090 DF,  p-value: < 2.2e-16

Visual Assessment

When visualizing the performance of our model. It is clear that our model fit is far from optimal based on the spread between the residuals and fitted values. Also, error between our prediction and actual values as indicated by the second chart shows our predictions are not very accurate.

rf_1 <- ggplot(model, aes(.fitted, .resid)) +
  geom_ref_line(h = 0) +
  geom_point() +
  geom_smooth(se = FALSE)
  ggtitle("Residuals vs Fitted")

rf_1

plot(predict(model), train$track_popularity,
      xlab="predicted",ylab="actual")
 abline(a=0,b=1)

From the Q-Q plot, we can see that the residuals are not normally distributed. This is evident by the snaking line. The predictions of popularity for values in the quantiles around the mean are fit well to the line, but the extreme values have significant residuals.

model_results <- augment(model, train)

model_results <- model_results %>%
  mutate(Model = "Model 1")

qqnorm(model_results$.resid); qqline(model_results$.resid)

Coefficient Assessment

Based on estimate values, the attributes which most contribute to the prediction are:

Energy
Instrumentalness
Speechiness
Acousticness
Danceability

Based on estimate values, the attributes which least contribute to the prediction are:

Key
Tempo
Duration

The coefficients indicate the significance of incremental change for each variable in its prediction of track popularity. The negative values will cause a prediction of lower track popularity and positive values predict higher track popularity.

coefficients <- summary(model)$coefficients[2:13,1]
coefficients <- data.frame(round(coefficients, digits=2))
colnames(coefficients) <- "Coefficients" 
Percentage = 100*round((abs(coefficients$Coefficients)/sum(abs(coefficients$Coefficients))),2)
coefficients = cbind(coefficients, Percentage)
coefficients <- coefficients[order(-coefficients$Percentage),,drop=FALSE]
coefficients$Percentage <- paste(as.character(coefficients$Percentage),"%")

datatable(coefficients)

Linear Regression Attempt 2

Building the Model

From earlier we noticed that our track popularity was unbalanced. The majority of tracks are unpopular. To try to improve the linear regression model, I will continue on by using under-sampling. This will cause the regression to balance the classes and give more value to the popular songs we care about.

I’m going to approach this task by taking all the samples from popular tracks and then randomly sampling an equal amount of observations from the unpopular category. Since the cutoff for popular and unpopular songs is arbitrarily chosen, I will choose multiple cutoffs and try linear regression on each.

Looking at the r squared and significance values for each model, I believe our best option would be to choose our cutoff for popular/not popular at around 70. Any higher and we start to lose too much data, and the independent variables start losing their significance. At 70 track popularity, we have already sliced off about 84% of our dataset. Our r-squared is slightly better (0.11 at 75 cutoff) on our under-sampled regressions compared with 0.63, so this suggests our newer model better explains the variance of track popularity, but this is still a poor r-squared.

library(ggplot2)
Popularity_cutoffs <- list(45, 50, 60, 70)
model_r.squared <- list(0.062)
model_pvalue <- list(0)

for(x in 1:length(Popularity_cutoffs)){
  
  Popular <- Spotify_songs[which(Spotify_songs$track_popularity >= Popularity_cutoffs[x]),]
  
  Unpopular <- Spotify_songs[which(Spotify_songs$track_popularity < Popularity_cutoffs[x]),]
  
  Unpopular_sampled <- Unpopular[sample(nrow(Unpopular), NROW(Popular)),]
  
  Undersampled <- rbind(Popular, Unpopular_sampled)
  
  set.seed(123)
  sample <- sample(c(TRUE, FALSE), nrow(Undersampled), replace = T, prob = c(0.6,0.4))
  train <- Undersampled[sample, ]
  test <- Undersampled[!sample, ]

  model <- lm(track_popularity ~ danceability + energy + key + loudness + mode + speechiness + acousticness + instrumentalness + liveness + valence + tempo + duration_ms, data = train)
  
  model_r.squared[x+1] = summary(model)$r.squared
  model_pvalue[x+1] = glance(model)$p.value
}

Cutoffs <- list('Orginal model', 'Cutoff - 45', 'Cutoff - 50', 'Cutoff - 60', 'Cutoff - 70')
model_r.squared <- unlist(model_r.squared, recursive=TRUE)
model_pvalue <- unlist(model_pvalue, recursive=TRUE)

Reg_summary <- cbind(data.frame(model_r.squared), data.frame(model_pvalue))
rownames(Reg_summary) <- Cutoffs

sum(Spotify_songs$track_popularity>70)/sum(Spotify_songs$track_popularity<=100)
kable(Reg_summary)

print(summary(model))

Visual Assessment

Our model is still performing poorly. Our residuals vs. fitted plot tells us there is still a poor linear relationship between the independent variables and track popularity. Due to these concerns, we may need to use more data to create an effective linear regression or use a different model to predict track popularity

rf_2 <- ggplot(model, aes(.fitted, .resid)) +
  geom_ref_line(h = 0) +
  geom_point() +
  geom_smooth(se = FALSE)
  ggtitle("Residuals vs Fitted")

grid.arrange(rf_1, rf_2, ncol=2)

Our normal Q-Q plot tells us we still have concerns with the distribution of our residuals, because we have large deviation on the extremes.

model_results <- augment(model, train)

model_results <- model_results %>%
  mutate(Model = "Model 1")

qqnorm(model_results$.resid); qqline(model_results$.resid)

Coefficient Assessment

Based on estimate values, the attributes which most contribute to the prediction are:

energy
Instrumentalness
Speechiness
Valence

Based on estimate values, the attributes which least contribute to the prediction are:

Key
Mode
Tempo
Duration

coefficients <- summary(model)$coefficients[2:13,1]
coefficients <- data.frame(round(coefficients, digits=2))
colnames(coefficients) <- "Coefficients" 
Percentage = 100*round((abs(coefficients$Coefficients)/sum(abs(coefficients$Coefficients))),2)
coefficients = cbind(coefficients, Percentage)
coefficients <- coefficients[order(-coefficients$Percentage),,drop=FALSE]
coefficients$Percentage <- paste(as.character(coefficients$Percentage),"%")

datatable(coefficients)

Logistic Regression

Building the Model

Since our Multiple Linear Regression model did not turn out as well as hope in its ability to predict track popularity, I want to give logistic regression an attempt in trying to predict popular/not popular as a binary variable. I am not expecting an optimal solution, because I have begun to expect the optimal model is not a linear, though, It is possible that this model could turn out to be more useful in its predictive power.

Spotify_songs$Popular <- ifelse(Spotify_songs$track_popularity>80,1,0)

Popular <- Spotify_songs[which(Spotify_songs$track_popularity >= 80),]
  
  Unpopular <- Spotify_songs[which(Spotify_songs$track_popularity < 50),]
  
  Unpopular_sampled <- Unpopular[sample(nrow(Unpopular), NROW(Popular)),]
  
  Undersampled <- rbind(Popular, Unpopular_sampled)
  
  set.seed(123)
  sample <- sample(c(TRUE, FALSE), nrow(Undersampled), replace = T, prob = c(0.6,0.4))
  train <- Undersampled[sample, ]
  test <- Undersampled[!sample, ]


model <- glm(Popular ~ danceability + energy + key + mode + speechiness + acousticness + instrumentalness + liveness + valence + tempo + duration_ms, family = "binomial", data = train)
tidy(model)

## # A tibble: 12 x 5
##    term                estimate  std.error statistic   p.value
##    <chr>                  <dbl>      <dbl>     <dbl>     <dbl>
##  1 (Intercept)       0.864      1.05           0.824 0.410    
##  2 danceability      1.74       0.865          2.01  0.0445   
##  3 energy           -2.68       0.710         -3.78  0.000156 
##  4 key              -0.0187     0.0271        -0.691 0.489    
##  5 mode             -0.0695     0.201         -0.346 0.729    
##  6 speechiness      -0.533      0.935         -0.570 0.569    
##  7 acousticness     -0.0929     0.468         -0.198 0.843    
##  8 instrumentalness -5.07       1.31          -3.87  0.000111 
##  9 liveness         -0.256      0.709         -0.362 0.717    
## 10 valence           0.0537     0.511          0.105 0.916    
## 11 tempo             0.0141     0.00362        3.90  0.0000979
## 12 duration_ms      -0.00000880 0.00000220    -3.99  0.0000648

Model Evaluation

The varImp function gives us the most influential independent variables for predicting the dependent.

imp <- as.data.frame(caret::varImp(model))
imp <- data.frame(overall = imp$Overall,
           names   = rownames(imp))
imp[order(imp$overall,decreasing = T),]

##      overall            names
## 11 3.9945946      duration_ms
## 10 3.8958297            tempo
## 7  3.8662017 instrumentalness
## 2  3.7809327           energy
## 1  2.0094717     danceability
## 3  0.6913609              key
## 5  0.5695189      speechiness
## 8  0.3618950         liveness
## 4  0.3459827             mode
## 6  0.1984020     acousticness
## 9  0.1052666          valence

McFadden’s R^2 is for representing the variance captured by the predictors. A score of 0.121 is quite low, though, it is not abysmal.

pscl::pR2(model)["McFadden"]

## fitting null model for pseudo-r2

##  McFadden 
## 0.1208498

Validation of Predicted Values

Without under-sampling, the sensitivity was very poor. The classifier would always classify songs as unpopular due to the imbalance in our data set. I chose the cutoff for the binary classification at 80 popularity to balance our sensitivity vs. specificity. When the cutoff is higher, our model will too often classify songs as popular. In other words, we will have good sensitivity but poor specificity. As the cutoff lowers, there are more Type 1 errors but less Type 2 error, and we can achieve a balance between the sensitivity and specificity of our classifier.

test.predicted.m3 <- predict(model, newdata = test, type = "response")
table(test$Popular, test.predicted.m3 > 0.5) %>% prop.table() %>% round(3)

##    
##     FALSE  TRUE
##   0 0.476 0.127
##   1 0.194 0.203

Proportion of popular songs predicted (Sensitivity) -> 67/131 = 51.1%

Proportion of unpopular songs predicted (Specificity) -> 1 - 42/199 = 78.9%

At our current popularity level we were able to correctly predict 51.1% of the popular songs and 78.9% of the unpopular songs.

table(test$Popular, test.predicted.m3 > 0.5)

##    
##     FALSE TRUE
##   0   157   42
##   1    64   67

Our ROC allows us to visualize the trade-off between correct and incorrect predictions. The visualization validates our point in the previous section that our model struggles to balance sensitivity vs. specificity. An ideal model would stick more closely to the y-axis.

library(ROCR)

par(mfrow=c(1, 2))

prediction(test.predicted.m3, test$Popular) %>%
  performance(measure = "tpr", x.measure = "fpr") %>%
  plot()

An AUC value of 0.72 indicates that our model has some separation capacity between classes, but again, it is again rather poor.

prediction(test.predicted.m3, test$Popular) %>%
  performance(measure = "auc") %>%
  .@y.values

## [[1]]
## [1] 0.7198588

coefficients <- summary(model)$coefficients[2:12,1]
coefficients <- data.frame(round(coefficients, digits=2))
colnames(coefficients) <- "Coefficients" 
Percentage = 100*round((abs(coefficients$Coefficients)/sum(abs(coefficients$Coefficients))),2)
coefficients = cbind(coefficients, Percentage)
coefficients <- coefficients[order(-coefficients$Percentage),,drop=FALSE]
coefficients$Percentage <- paste(as.character(coefficients$Percentage),"%")

datatable(coefficients)

Summary

Insights:

Pop, Rap, and Latin are currently the most popular genre types on Spotify in descending order.
The genres which appear to be experiencing the most significant increase of attention from the year 2000 to 2020 were rap, R&B, and to a lesser extent with Latin music. EDM (Electronic Dance Music) reached its pinnacle in 2005 and appears to be making a gradual resurgence to former heights.
Every major genre was more likely to have its songs produced in the major key. Rock has the largest proportion with nearly 70% of songs being major.
Top Artists by Decade (Most Hit Songs):
- 1950’s - Elvis Presley
- 1960’s - The Beatles
- 1970’s - Fleetwood Mac and Queen
- 1980’s - AC/DC, Luis Miguel, and Michael Jackson
- 1990’s - Red Hot Chili Peppers
- 2000’s - Coldplay
- 2010’s - Billie Eilish
- 2020’s - Eminem (So far as of March 2020)
The distribution of popularity scores is unbalanced due to the fact the most songs never seemed to make it off the ground. The bottom 1% and top 25% capture 8% of the popularity scores from the data.
There were two notable independent variables that have an issue with co-linearity. There was a strong positive correlation between loudness and energy (Pearson Correlation - 0.68) and a strong negative correlation between energy and acousticness (Pearson Correlation - -0.55). Loudness & acousticness and danceability & valence also shared a fairly strong correlation.
Energy, instrumentalness, speechiness, and danceability were the strongest predictors of track popularity according to our regression models while duration, tempo, and key, held close to no predictive power.

Linear Regression

Our linear regression was not very effective in predicting popularity scores. This was due to a few compounding factors. Our popularity scores were highly skewed, and when using under-sampling to account for this fact, much of the data was lost. Less data can mean less predictive power even if the data is better balanced. Also, there were some issues in the linearity of dependence as indicated by our Q-Q plot. This makes it difficult to consistently predict accurately across the spectrum of popularity in a linear model. Linear regression could possibly be improved by including genre as an independent variable in the classifier or discovering different variables which have linearity of predictive power across the spectrum.

Logistic Regression

The expectation for this model was not that it would perform excellently, because we noticed issues with linearity across the independent variables in the linear regression. I thought it could still provide more insight into how the independent variables related to popularity. Altogether, I do not think the classifier is useless even even though it may be less accurate in finding the next hit song than a producer due to experience. Obviously, if there is a significant penalty for having the current levels of sensitivity and specificity in our predictions then the model would not be a good option. The model could be useful as a tool to pick out potential hit songs in the case where a successful and experienced producer is not available or as additional assistance to a busy producer as sorting through and listening to every new song that hits the market is an arduous task. It would be ideal to collect more data and determine if a different model may be more appropriate.

Reflection and Future Work

This project was exciting, because I got to explore and apply many new packages in R through statistical analysis and data visualization. I was happy with the way the EDA worked out, though the regression models were not highly successful in their predictions. This was likely due to the non-linearity of the relationship between independent variables and the popularity. The coefficients that represented the increase in popularity per unit increase of the independent variable were not constant across the spectrum of population. Using methods of non-linear classification like kernel SVM, KNN, or Random Forest models could prove to be more effective in their predictions and would be a natural step forward from the current stage of the analysis.

The scope of what could be explored with this data set is not nearly exhausted. There is still more to be explored in regards to EDA, variable/model selection, and optimization. In particular, identifying popularity of songs and analyzing past and current musical attributes could have widespread implications especially as our country’s demographics shift and a greater variety of cultural musicality begins to become more mainstream. At the very least, it could serve as a useful tool for producers or production companies by utilizing artificial intelligence in the selection of hit songs. I will move forward with this project after learning more about non-linear classification and modeling in my coursework.

Spotify Project

Exploratory Data Analysis and Regression

Synopsis

Packages Required

Data Preparation

Data Import

Data Cleaning

Data Preview

Data Description

Exploratory Data Analysis

Genre Popularity

Genre Popularity by Release Date

Major and Minor Key Proportion

Artists - Most Hits by Decade

Popularity Scores

Variable Correlation

Correlation Heat Map

Regression Analysis

Linear Regression - Attempt 1

Building the Model

Visual Assessment

Coefficient Assessment

Linear Regression Attempt 2

Building the Model

Visual Assessment

Coefficient Assessment

Logistic Regression

Building the Model

Model Evaluation

Validation of Predicted Values

Summary