Spotify Analysis

Introduction

#![](/Users/makayla/Library/Mobile #Documents/com~apple~CloudDocs/STA -Rstudio/MS -  Data #Minning/spotify-logo-1920x1080-2.jpeg){width=40%}

The Spotify Dataset comes from Spotify via the spotifyr package. Charlie Thompson, Josiah Parry, Donal Phipps, and Tom Wolff authored this package to make it easier to get either your own data or general metadata arounds songs from Spotify’s API. Kaylin Pavlik had a recent blogpost using the audio features to explore and classify songs. She used the spotifyr package to collect about 5000 songs from 6 main categories.
The data shows general metadata around songs from Spotify’s API. It shows the song’s popularity and other parameters such as acousticness, danceability, energy, speechiness, valence, key and genres and sybgenres…
With this analysis we are interested in how track popularity is getting influenced by the other varaibles.
The plan is to analyze relationship between popularity and different features of the song to predict future popularity of a song. We plan on performing Data Preparation, EDA and Modelling using models such as *linear regression, knn, logistic regression, SVM and Tree Models and compare which one performs better and see which one predicts better if a song is going to be popular or not.
This is mainly beneficial to market spotify customers and improve their experience while using spotify. Also for Spotify, they will be able to provide more accurate predictions of a new song’s potential popularity even before its release.

Packages Required

library(tidyverse) #It assists with data import, tidying, manipulation, and data visualization.
library(ggplot2) # package for producing statistical, or data, graphics
library(kknn) # to perform k-nearest neighbor classification
library(corrplot) # graphical display of a correlation matrix, confidence interval
library(readr) # provides a fast way to read rectangular data
library(rpart) #implements the classification and regression tree algorithm (CART)
library(rpart.plot) #An Enhanced Plotting Package for rpart

Tidyverse:assists with data import, tidying, manipulation, and data visualization. ggplot2: package for producing statistical, or data, graphics. kknn: performs k-nearest neighbor classification. corrplot: graphical display of a correlation matrix, confidence interval. readr: provides a fast way to read rectangular data. rpart: implements the classification and regression tree algorithm (CART) rpart.plot: An Enhanced Plotting Package for rpart.

Data Preparation

Data Loading

library(readr)
spotify <- read.csv("spotify.csv")

Data Examination

We are going to examine our data in prder to see how we can best clean it for further analysis.

### Checking dimension of Data
dim(spotify)

## [1] 32833    23

The original dataset contains 32833 rows and 23 columns. ##### First 5 rows

# show first 5 rows
head(spotify)

##                 track_id                                            track_name
## 1 6f807x0ima9a1j3VPbc7VN I Don't Care (with Justin Bieber) - Loud Luxury Remix
## 2 0r7CVbZTWZgbTCYdfa2P31                       Memories - Dillon Francis Remix
## 3 1z1Hg7Vb0AhHDiEmnDE79l                       All the Time - Don Diablo Remix
## 4 75FpbthrwQmzHlBJLuGdC7                     Call You Mine - Keanu Silva Remix
## 5 1e8PAfcKUYoKkxPhrHqw4x               Someone You Loved - Future Humans Remix
## 6 7fvUMiyapMsRRxr07cU8Ef     Beautiful People (feat. Khalid) - Jack Wins Remix
##       track_artist track_popularity         track_album_id
## 1       Ed Sheeran               66 2oCs0DGTsRO98Gh5ZSl2Cx
## 2         Maroon 5               67 63rPSO264uRjW1X5E6cWv6
## 3     Zara Larsson               70 1HoSmj2eLcsrR0vE9gThr4
## 4 The Chainsmokers               60 1nqYsOef1yKKuGOVchbsk6
## 5    Lewis Capaldi               69 7m7vv9wlQ4i0LFuJiE2zsQ
## 6       Ed Sheeran               67 2yiy9cd2QktrNvWC2EUi0k
##                                        track_album_name
## 1 I Don't Care (with Justin Bieber) [Loud Luxury Remix]
## 2                       Memories (Dillon Francis Remix)
## 3                       All the Time (Don Diablo Remix)
## 4                           Call You Mine - The Remixes
## 5               Someone You Loved (Future Humans Remix)
## 6     Beautiful People (feat. Khalid) [Jack Wins Remix]
##   track_album_release_date playlist_name            playlist_id playlist_genre
## 1               2019-06-14     Pop Remix 37i9dQZF1DXcZDD7cfEKhW            pop
## 2               2019-12-13     Pop Remix 37i9dQZF1DXcZDD7cfEKhW            pop
## 3               2019-07-05     Pop Remix 37i9dQZF1DXcZDD7cfEKhW            pop
## 4               2019-07-19     Pop Remix 37i9dQZF1DXcZDD7cfEKhW            pop
## 5               2019-03-05     Pop Remix 37i9dQZF1DXcZDD7cfEKhW            pop
## 6               2019-07-11     Pop Remix 37i9dQZF1DXcZDD7cfEKhW            pop
##   playlist_subgenre danceability energy key loudness mode speechiness
## 1         dance pop        0.748  0.916   6   -2.634    1      0.0583
## 2         dance pop        0.726  0.815  11   -4.969    1      0.0373
## 3         dance pop        0.675  0.931   1   -3.432    0      0.0742
## 4         dance pop        0.718  0.930   7   -3.778    1      0.1020
## 5         dance pop        0.650  0.833   1   -4.672    1      0.0359
## 6         dance pop        0.675  0.919   8   -5.385    1      0.1270
##   acousticness instrumentalness liveness valence   tempo duration_ms
## 1       0.1020         0.00e+00   0.0653   0.518 122.036      194754
## 2       0.0724         4.21e-03   0.3570   0.693  99.972      162600
## 3       0.0794         2.33e-05   0.1100   0.613 124.008      176616
## 4       0.0287         9.43e-06   0.2040   0.277 121.956      169093
## 5       0.0803         0.00e+00   0.0833   0.725 123.976      189052
## 6       0.0799         0.00e+00   0.1430   0.585 124.982      163049

column names

#### Checking column name
names(spotify)

##  [1] "track_id"                 "track_name"              
##  [3] "track_artist"             "track_popularity"        
##  [5] "track_album_id"           "track_album_name"        
##  [7] "track_album_release_date" "playlist_name"           
##  [9] "playlist_id"              "playlist_genre"          
## [11] "playlist_subgenre"        "danceability"            
## [13] "energy"                   "key"                     
## [15] "loudness"                 "mode"                    
## [17] "speechiness"              "acousticness"            
## [19] "instrumentalness"         "liveness"                
## [21] "valence"                  "tempo"                   
## [23] "duration_ms"

Data Cleaning

Now, we will look at the structure, summary, data types, duplcates and missing values to then, clean the data. ##### Structure of data

### Checking structure of Data
str(spotify)

## 'data.frame':    32833 obs. of  23 variables:
##  $ track_id                : chr  "6f807x0ima9a1j3VPbc7VN" "0r7CVbZTWZgbTCYdfa2P31" "1z1Hg7Vb0AhHDiEmnDE79l" "75FpbthrwQmzHlBJLuGdC7" ...
##  $ track_name              : chr  "I Don't Care (with Justin Bieber) - Loud Luxury Remix" "Memories - Dillon Francis Remix" "All the Time - Don Diablo Remix" "Call You Mine - Keanu Silva Remix" ...
##  $ track_artist            : chr  "Ed Sheeran" "Maroon 5" "Zara Larsson" "The Chainsmokers" ...
##  $ track_popularity        : int  66 67 70 60 69 67 62 69 68 67 ...
##  $ track_album_id          : chr  "2oCs0DGTsRO98Gh5ZSl2Cx" "63rPSO264uRjW1X5E6cWv6" "1HoSmj2eLcsrR0vE9gThr4" "1nqYsOef1yKKuGOVchbsk6" ...
##  $ track_album_name        : chr  "I Don't Care (with Justin Bieber) [Loud Luxury Remix]" "Memories (Dillon Francis Remix)" "All the Time (Don Diablo Remix)" "Call You Mine - The Remixes" ...
##  $ track_album_release_date: chr  "2019-06-14" "2019-12-13" "2019-07-05" "2019-07-19" ...
##  $ playlist_name           : chr  "Pop Remix" "Pop Remix" "Pop Remix" "Pop Remix" ...
##  $ playlist_id             : chr  "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" ...
##  $ playlist_genre          : chr  "pop" "pop" "pop" "pop" ...
##  $ playlist_subgenre       : chr  "dance pop" "dance pop" "dance pop" "dance pop" ...
##  $ danceability            : num  0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
##  $ energy                  : num  0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
##  $ key                     : int  6 11 1 7 1 8 5 4 8 2 ...
##  $ loudness                : num  -2.63 -4.97 -3.43 -3.78 -4.67 ...
##  $ mode                    : int  1 1 0 1 1 1 0 0 1 1 ...
##  $ speechiness             : num  0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
##  $ acousticness            : num  0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
##  $ instrumentalness        : num  0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
##  $ liveness                : num  0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
##  $ valence                 : num  0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
##  $ tempo                   : num  122 100 124 122 124 ...
##  $ duration_ms             : int  194754 162600 176616 169093 189052 163049 187675 207619 193187 253040 ...

Observations: Shows structure of data. We can see that track_id , track_name, track_artist, track_album_id, track_album_name, track_album_release_date, playlist_name, playlist_id, playlist_genre, playlist_subgenre are character variables. On the other side, track_popularity, energy, key, loudness, mode, speechiness, acousticness, instrumentalness, liveness, valence, tempo and duration_ms are numeric. We need to change track_album_release_date to date variable. We will also change playlist_genre to factor, for future plotting.

Modifying Data Types

# Modifying Data Types
spotify$track_album_release_date<- as.Date(spotify$track_album_release_date)
spotify$playlist_genre<-as.factor(spotify$playlist_genre)

Summary Statistics

#summary statistics
summary(spotify)

##    track_id          track_name        track_artist       track_popularity
##  Length:32833       Length:32833       Length:32833       Min.   :  0.00  
##  Class :character   Class :character   Class :character   1st Qu.: 24.00  
##  Mode  :character   Mode  :character   Mode  :character   Median : 45.00  
##                                                           Mean   : 42.48  
##                                                           3rd Qu.: 62.00  
##                                                           Max.   :100.00  
##                                                                           
##  track_album_id     track_album_name   track_album_release_date
##  Length:32833       Length:32833       Min.   :1957-01-01      
##  Class :character   Class :character   1st Qu.:2010-12-04      
##  Mode  :character   Mode  :character   Median :2017-01-27      
##                                        Mean   :2012-09-09      
##                                        3rd Qu.:2019-05-16      
##                                        Max.   :2020-01-29      
##                                        NA's   :1886            
##  playlist_name      playlist_id        playlist_genre playlist_subgenre 
##  Length:32833       Length:32833       edm  :6043     Length:32833      
##  Class :character   Class :character   latin:5155     Class :character  
##  Mode  :character   Mode  :character   pop  :5507     Mode  :character  
##                                        r&b  :5431                       
##                                        rap  :5746                       
##                                        rock :4951                       
##                                                                         
##   danceability        energy              key            loudness      
##  Min.   :0.0000   Min.   :0.000175   Min.   : 0.000   Min.   :-46.448  
##  1st Qu.:0.5630   1st Qu.:0.581000   1st Qu.: 2.000   1st Qu.: -8.171  
##  Median :0.6720   Median :0.721000   Median : 6.000   Median : -6.166  
##  Mean   :0.6548   Mean   :0.698619   Mean   : 5.374   Mean   : -6.720  
##  3rd Qu.:0.7610   3rd Qu.:0.840000   3rd Qu.: 9.000   3rd Qu.: -4.645  
##  Max.   :0.9830   Max.   :1.000000   Max.   :11.000   Max.   :  1.275  
##                                                                        
##       mode         speechiness      acousticness    instrumentalness   
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000000  
##  1st Qu.:0.0000   1st Qu.:0.0410   1st Qu.:0.0151   1st Qu.:0.0000000  
##  Median :1.0000   Median :0.0625   Median :0.0804   Median :0.0000161  
##  Mean   :0.5657   Mean   :0.1071   Mean   :0.1753   Mean   :0.0847472  
##  3rd Qu.:1.0000   3rd Qu.:0.1320   3rd Qu.:0.2550   3rd Qu.:0.0048300  
##  Max.   :1.0000   Max.   :0.9180   Max.   :0.9940   Max.   :0.9940000  
##                                                                        
##     liveness         valence           tempo         duration_ms    
##  Min.   :0.0000   Min.   :0.0000   Min.   :  0.00   Min.   :  4000  
##  1st Qu.:0.0927   1st Qu.:0.3310   1st Qu.: 99.96   1st Qu.:187819  
##  Median :0.1270   Median :0.5120   Median :121.98   Median :216000  
##  Mean   :0.1902   Mean   :0.5106   Mean   :120.88   Mean   :225800  
##  3rd Qu.:0.2480   3rd Qu.:0.6930   3rd Qu.:133.92   3rd Qu.:253585  
##  Max.   :0.9960   Max.   :0.9910   Max.   :239.44   Max.   :517810  
##

Observations: This function displays Min, Q1, Median, Mean, Q3 and Max of each varibale. We can already see that there are probably some outliers and that some variables have too big Max (duration_ms has a Max of 517810; tempo has a Max of 239.44). We will do some truncation, winsorization or standardization, to see how it affects the model.

Tables

#lets look at some tables for categorical variables
table(spotify$playlist_genre)

## 
##   edm latin   pop   r&b   rap  rock 
##  6043  5155  5507  5431  5746  4951

table(spotify$playlist_subgenre)

## 
##                album rock                  big room              classic rock 
##                      1065                      1206                      1296 
##                 dance pop             electro house                electropop 
##                      1298                      1511                      1408 
##              gangster rap                 hard rock                   hip hop 
##                      1458                      1485                      1322 
##                   hip pop           indie poptimism             latin hip hop 
##                      1256                      1672                      1656 
##                 latin pop                  neo soul            new jack swing 
##                      1262                      1637                      1133 
##            permanent wave                   pop edm             post-teen pop 
##                      1105                      1517                      1129 
## progressive electro house                 reggaeton          southern hip hop 
##                      1809                       949                      1675 
##                      trap                  tropical        urban contemporary 
##                      1291                      1288                      1405

table(spotify$key)

## 
##    0    1    2    3    4    5    6    7    8    9   10   11 
## 3454 4010 2827  913 2201 2680 2670 3352 2430 3027 2273 2996

table(spotify$mode)

## 
##     0     1 
## 14259 18574

Observations: We can see that mode is a binary variable. Playlist_genre and playlist_subgenre are categorical variables with the genre of music. Each category does not vary a lot in observations, which is good, because we it will be easier to interpret their predictions. There are no outliers either.

Duplicates

#Looking for duplicates
dups_id <- sum(duplicated(spotify$track_id))
print(dups_id)

## [1] 4477

We can see that a lot of songs have been duplicated in this dataset. They have the same track_id. Therefore we will remove them, for further analysis.

spotify_dups = spotify[duplicated(spotify$track_id),]
spotify = spotify[!duplicated(spotify$track_id),]

We removed the duplicate songs to another dataset called spotify_dups and removed duplicates from current dataset.

Missing Values

#looking for missing values
sum(is.na(spotify))

## [1] 1693

There are 12 missing values in this dataset. However, we will not remove them, since they might still be important for the analysis.

Data Visualization

Visual histogram exploration to understand the data set.

We wil plot some variables to observe its skewiness and distribution, and interpret them.

histograms <- names(spotify)[c(4,12:23)]

songs1 <- spotify %>% 
            select(c(histograms)) %>%
            pivot_longer(cols = histograms)

## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
##   # Was:
##   data %>% select(histograms)
## 
##   # Now:
##   data %>% select(all_of(histograms))
## 
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

songs1 %>%
  ggplot(aes(x = value)) +
  geom_histogram() +
  facet_wrap(~name, ncol = 5, scales = 'free') +
  labs(title = 'Audio Feature Pattern Frequency Plots', x = '', y = '') +
  theme_void()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Observations: -Duration and Valence are normally distributed - Danceability, Enery and Loudness is left-skewed - Acousticness, Liveness and Speechiness is right-skewed - Key and mode are categorical variables - in popularity the max is around 50-60, and it is normally distributed. There are a lot of zero values, possible missing values. Very few songs have a popularity of above 90.

Boxplot to plot different genres vs popularity

ggplot(spotify,aes(x = playlist_genre, y = track_popularity, fill = playlist_genre)) +geom_boxplot()

ggplot(spotify,aes(x = playlist_subgenre, y = track_popularity, fill = playlist_subgenre)) +geom_boxplot()

Observations: - We can see that pop is the most popular followed by latin and rock. - From the subgenre playlist post-teen pop is the most popular followed by dance pop and permament wave.

Visualitazion scatterplots based on genre and popularity

Now we will plot scatterplots to compare the different parameters to popularity and see which genre has the best popularity based on the different variables and see on popular songs, which parameter best predicts popularity.

# new dataset with some variables
feature_names <- names(spotify)[c(12:13,15,17:19,23)]

# new dataset only displying variables with 500 most popular songs
songs <- spotify %>% 
  arrange(desc(track_popularity)) %>%
  head(n = 500) %>%
  pivot_longer(cols = feature_names)

## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
##   # Was:
##   data %>% select(feature_names)
## 
##   # Now:
##   data %>% select(all_of(feature_names))
## 
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

#plotting 
songs %>%
  ggplot(aes(x = name, y = value)) +
  geom_jitter(aes(color = playlist_genre)) +
  facet_wrap(~name, ncol = 4, scales = 'free') +
  labs(title = 'Audio Feature Pattern Frequency Plots', x = '', y = '') +
  theme(axis.text.y = element_blank())

Observation: - These represent the 500 most popular songs - Higher danceability, energy and loudness usually means more popularity - Lower speechiness and accousticness and instrumentalness means more popular - An average to low duration means more popularity - From this visualizations, since most of the variables are numerical, we could consider linear regression, Decision Trees and SVM to be the best models. After converting track_popularity into dummy variable we could also use other models such as logistic regression.

Visualization Boxplots

Outliers

ggplot(data = songs1) +
  geom_boxplot(aes(y = value)) + 
  facet_wrap(~name, nrow = 4, scales = "free") +
  coord_flip() +
  ggtitle("Outlier analysis", subtitle = "For different song attributes") +
  theme_void()

Observations:

From the boxplots we can observe that a lot of variables (danceability, energy, loudness, speechiness, acousticness, instrumentalness, liveness, duration) have outliers. Removing them would influence the analysis a lot.

Datset copy without oultiers

spotify_copy <- spotify

Truncation

Now, we will truncate energy speechiness, acousticness, instrumentalness and liveness.

# truncation danceability, energy, speechiness, acousticness, instrumentalness and liveness
spotify_copy$danceability[spotify_copy$danceability <= 0.28] <- 0.28

spotify_copy$energy[spotify_copy$energy <= 0.2] <- 0.2

spotify_copy$speechiness[spotify_copy$speechiness >= 0.27] <- 0.27

spotify_copy$acousticness[spotify_copy$acousticness >= 0.6] <- 0.6

spotify_copy$instrumentalness[spotify_copy$instrumentalness >= 0.015] <- 0.015

spotify_copy$liveness[spotify_copy$liveness >= 0.4] <- 0.4

Winsorization

Now, we will winsorize loudness, tempo and duration.

# winsorization loudness
# Calculate the 5th and 95th percentiles for 'loudness'
lower_bound_loudness <- quantile(spotify_copy$loudness, 0.05, na.rm = TRUE)
upper_bound_loudness <- quantile(spotify_copy$loudness, 0.95, na.rm = TRUE)

# Winsorize the data
spotify_copy$loudness[spotify_copy$loudness < lower_bound_loudness] <- lower_bound_loudness
spotify_copy$loudness[spotify_copy$loudness > upper_bound_loudness] <- upper_bound_loudness

# winsorization tempo
# Calculate the 5th and 95th percentiles for 'tempo'
lower_bound_tempo <- quantile(spotify_copy$tempo, 0.05, na.rm = TRUE)
upper_bound_tempo <- quantile(spotify_copy$tempo, 0.95, na.rm = TRUE)

# Winsorize the data
spotify_copy$tempo[spotify_copy$tempo < lower_bound_tempo] <- lower_bound_tempo
spotify_copy$tempo[spotify_copy$tempo > upper_bound_tempo] <- upper_bound_tempo

#winsorize duration
# Calculate the 5th and 95th percentiles for 'duration'
lower_bound_duration_ms <- quantile(spotify_copy$duration_ms, 0.05, na.rm = TRUE)
upper_bound_duration_ms <- quantile(spotify_copy$duration_ms, 0.95, na.rm = TRUE)

# Winsorize the data
spotify_copy$duration_ms[spotify_copy$duration_ms < lower_bound_duration_ms] <- lower_bound_duration_ms
spotify_copy$duration_ms[spotify_copy$duration_ms > upper_bound_duration_ms] <- upper_bound_duration_ms

Now we have removed all the outliers from those variables. The data is cleaned. We also have dealt with missing values, duplicates, and data types. Let’s create a table to see how all the cleaned data looks like.

knitr::kable(head(spotify[, 1:23]), "simple")

track_id	track_name	track_artist	track_popularity	track_album_id	track_album_name	track_album_release_date	playlist_name	playlist_id	playlist_genre	playlist_subgenre	danceability	energy	key	loudness	mode	speechiness	acousticness	instrumentalness	liveness	valence	tempo	duration_ms
6f807x0ima9a1j3VPbc7VN	I Don’t Care (with Justin Bieber) - Loud Luxury Remix	Ed Sheeran	66	2oCs0DGTsRO98Gh5ZSl2Cx	I Don’t Care (with Justin Bieber) [Loud Luxury Remix]	2019-06-14	Pop Remix	37i9dQZF1DXcZDD7cfEKhW	pop	dance pop	0.748	0.916	6	-2.634	1	0.0583	0.1020	0.00e+00	0.0653	0.518	122.036	194754
0r7CVbZTWZgbTCYdfa2P31	Memories - Dillon Francis Remix	Maroon 5	67	63rPSO264uRjW1X5E6cWv6	Memories (Dillon Francis Remix)	2019-12-13	Pop Remix	37i9dQZF1DXcZDD7cfEKhW	pop	dance pop	0.726	0.815	11	-4.969	1	0.0373	0.0724	4.21e-03	0.3570	0.693	99.972	162600
1z1Hg7Vb0AhHDiEmnDE79l	All the Time - Don Diablo Remix	Zara Larsson	70	1HoSmj2eLcsrR0vE9gThr4	All the Time (Don Diablo Remix)	2019-07-05	Pop Remix	37i9dQZF1DXcZDD7cfEKhW	pop	dance pop	0.675	0.931	1	-3.432	0	0.0742	0.0794	2.33e-05	0.1100	0.613	124.008	176616
75FpbthrwQmzHlBJLuGdC7	Call You Mine - Keanu Silva Remix	The Chainsmokers	60	1nqYsOef1yKKuGOVchbsk6	Call You Mine - The Remixes	2019-07-19	Pop Remix	37i9dQZF1DXcZDD7cfEKhW	pop	dance pop	0.718	0.930	7	-3.778	1	0.1020	0.0287	9.40e-06	0.2040	0.277	121.956	169093
1e8PAfcKUYoKkxPhrHqw4x	Someone You Loved - Future Humans Remix	Lewis Capaldi	69	7m7vv9wlQ4i0LFuJiE2zsQ	Someone You Loved (Future Humans Remix)	2019-03-05	Pop Remix	37i9dQZF1DXcZDD7cfEKhW	pop	dance pop	0.650	0.833	1	-4.672	1	0.0359	0.0803	0.00e+00	0.0833	0.725	123.976	189052
7fvUMiyapMsRRxr07cU8Ef	Beautiful People (feat. Khalid) - Jack Wins Remix	Ed Sheeran	67	2yiy9cd2QktrNvWC2EUi0k	Beautiful People (feat. Khalid) [Jack Wins Remix]	2019-07-11	Pop Remix	37i9dQZF1DXcZDD7cfEKhW	pop	dance pop	0.675	0.919	8	-5.385	1	0.1270	0.0799	0.00e+00	0.1430	0.585	124.982	163049

Exploratory Data Anaylsis

Observations: - we first looked at the descriptive statistics, which showed us the mean, median, minimum, maximum, and quartiles of each variable. - We saw the duplicates and deleted them - No missing values - Histograms helped us look a the distribution of variables, and since a lot pf them where numeric we concluded that a Linear regression might be a good model to predict track_popularity. - Scatterplots helped us see which parameter might influence more our dependent variable. These were: instrumentalness, speechiness and loudness, since the data was more dense. - bosplots against popularity helped us identify most popular genres: pop, latin and rock. - Popular subgenres were: post-teen pop, dance pop and permanent wave.

To create new summary information, we plan on narrowing our data down to the variables we really plan on exploring to help us gain insights to our predictions.

Correlation Matrix

First, we will select the variables that we will use for the correlation matrix. We will convert playlist_genre to categorical in order to be able to see the correlation.

matrix <- spotify %>%  select(track_popularity, danceability, energy, key, loudness, mode, speechiness,acousticness, instrumentalness, liveness, valence, tempo, duration_ms, playlist_genre)

matrix$playlist_genre<-recode(matrix$playlist_genre, 'pop'=0, 'r&b'=1, 'rock'=2, 'latin'=3, 'edm'=4, 'rap'=5)

cor_matrix <- cor(matrix)
print(cor_matrix)

##                  track_popularity danceability      energy           key
## track_popularity      1.000000000  0.046596507 -0.10362202 -0.0081442988
## danceability          0.046596507  1.000000000 -0.08143602  0.0070203133
## energy               -0.103622023 -0.081436016  1.00000000  0.0128766164
## key                  -0.008144299  0.007020313  0.01287662  1.0000000000
## loudness              0.037285154  0.015297214  0.68213774 -0.0005315332
## mode                  0.016275225 -0.055211662 -0.00431074 -0.1760953351
## speechiness           0.005205509  0.183453077 -0.02900840  0.0231150971
## acousticness          0.091725358 -0.028878237 -0.54588619  0.0041975410
## instrumentalness     -0.124431070 -0.002266695  0.02381792  0.0073927120
## liveness             -0.052773029 -0.127002362  0.16370922  0.0021365502
## valence               0.022581329  0.333728627  0.14971046  0.0216180479
## tempo                 0.004446180 -0.184575167  0.15154451 -0.0103554421
## duration_ms          -0.139681783 -0.087560915  0.01758400  0.0188095544
## playlist_genre       -0.073434088  0.190073891  0.08828930  0.0100195621
##                       loudness          mode  speechiness acousticness
## track_popularity  0.0372851540  0.0162752246  0.005205509  0.091725358
## danceability      0.0152972137 -0.0552116618  0.183453077 -0.028878237
## energy            0.6821377425 -0.0043107400 -0.029008396 -0.545886191
## key              -0.0005315332 -0.1760953351  0.023115097  0.004197541
## loudness          1.0000000000 -0.0176877795  0.012981015 -0.371602357
## mode             -0.0176877795  1.0000000000 -0.059650521  0.006760044
## speechiness       0.0129810147 -0.0596505208  1.000000000  0.024945425
## acousticness     -0.3716023568  0.0067600443  0.024945425  1.000000000
## instrumentalness -0.1543090413 -0.0057660749 -0.107959332 -0.003100502
## liveness          0.0819270381 -0.0002748805  0.059342894 -0.074539622
## valence           0.0495101626 -0.0030881898  0.064700922 -0.018998564
## tempo             0.0967061101  0.0167097090  0.032728605 -0.114335210
## duration_ms      -0.1045846289  0.0128129792 -0.098435680 -0.094143218
## playlist_genre    0.0715914257 -0.0505853182  0.298193374 -0.076573192
##                  instrumentalness      liveness     valence        tempo
## track_popularity     -0.124431070 -0.0527730291  0.02258133  0.004446180
## danceability         -0.002266695 -0.1270023619  0.33372863 -0.184575167
## energy                0.023817924  0.1637092244  0.14971046  0.151544510
## key                   0.007392712  0.0021365502  0.02161805 -0.010355442
## loudness             -0.154309041  0.0819270381  0.04951016  0.096706110
## mode                 -0.005766075 -0.0002748805 -0.00308819  0.016709709
## speechiness          -0.107959332  0.0593428937  0.06470092  0.032728605
## acousticness         -0.003100502 -0.0745396224 -0.01899856 -0.114335210
## instrumentalness      1.000000000 -0.0085048515 -0.17416318  0.021484050
## liveness             -0.008504851  1.0000000000 -0.01992507  0.022026642
## valence              -0.174163182 -0.0199250679  1.00000000 -0.025135288
## tempo                 0.021484050  0.0220266420 -0.02513529  1.000000000
## duration_ms           0.059077441  0.0075705280 -0.03332403 -0.001695701
## playlist_genre        0.147880944  0.0508644474 -0.06999690  0.050109733
##                   duration_ms playlist_genre
## track_popularity -0.139681783    -0.07343409
## danceability     -0.087560915     0.19007389
## energy            0.017584005     0.08828930
## key               0.018809554     0.01001956
## loudness         -0.104584629     0.07159143
## mode              0.012812979    -0.05058532
## speechiness      -0.098435680     0.29819337
## acousticness     -0.094143218    -0.07657319
## instrumentalness  0.059077441     0.14788094
## liveness          0.007570528     0.05086445
## valence          -0.033324030    -0.06999690
## tempo            -0.001695701     0.05010973
## duration_ms       1.000000000    -0.07577263
## playlist_genre   -0.075772625     1.00000000

corrplot(cor(cor_matrix), method="color", type="upper", order="hclust")

The overall ranking of the variables that mostly correlate with track_popularity are the following: 1. Duration (-0.1396) 2. Instrumentallness (-0.1244) 3. Energy (-0.1036) 4. Acousticness (0.0917) 5. playlist_genre (-0.0734)

Comparing to the scatterplots one would think that instrumentallness, speechiness and loudness would have the strongest correlations since they have most points out of the most popular songs in the smallest interval.

If we look at other variables we can see that there exist a high positive correlation between energy & loudness with 0.682. There is a negative correlation exists between acousticness & energy with -0.545.

Modelling

Linear Regression model with training and test data

We want to predict track popularity and we want to see which variables best predicts the dependent variable. Therefore we will first build a linear regression model to see how it predicts track_popularity.

Data splitting

# Set the seed for reproducibility
set.seed(2023)

# Randomly sample row indices for the training set split in 70% and 30%
train_indices <- sample(1:NROW(spotify),NROW(spotify)*0.70)

# Create the training set
train_data <- spotify[train_indices, ] #everything before comma is row selector and after comma is column selector

# Create the testing set
test_data <- spotify[-train_indices, ]

Linear Regression on Training Set

# Train the linear regression model, comparing popularity to rest of numeric parameters
lm_model <- lm(track_popularity ~ danceability + energy + key + loudness + mode + speechiness + acousticness + instrumentalness + liveness + valence + tempo + duration_ms, data = train_data)
summary(lm_model)

## 
## Call:
## lm(formula = track_popularity ~ danceability + energy + key + 
##     loudness + mode + speechiness + acousticness + instrumentalness + 
##     liveness + valence + tempo + duration_ms, data = train_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -54.757 -17.360   2.935  18.119  60.604 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       6.684e+01  2.045e+00  32.682  < 2e-16 ***
## danceability      4.170e+00  1.288e+00   3.237  0.00121 ** 
## energy           -2.318e+01  1.461e+00 -15.864  < 2e-16 ***
## key               1.398e-02  4.595e-02   0.304  0.76088    
## loudness          1.132e+00  7.802e-02  14.512  < 2e-16 ***
## mode              7.590e-01  3.357e-01   2.261  0.02378 *  
## speechiness      -7.397e+00  1.657e+00  -4.464 8.09e-06 ***
## acousticness      4.934e+00  8.895e-01   5.547 2.95e-08 ***
## instrumentalness -9.426e+00  7.437e-01 -12.676  < 2e-16 ***
## liveness         -4.420e+00  1.077e+00  -4.104 4.08e-05 ***
## valence           1.997e+00  7.861e-01   2.540  0.01110 *  
## tempo             2.808e-02  6.261e-03   4.485 7.33e-06 ***
## duration_ms      -4.277e-05  2.726e-06 -15.689  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.03 on 19836 degrees of freedom
## Multiple R-squared:  0.06003,    Adjusted R-squared:  0.05946 
## F-statistic: 105.6 on 12 and 19836 DF,  p-value: < 2.2e-16

The model suggests that all the variables are significant except key, since it is the only variable with a p-value higher than 0.5. This model has an R^2 of 0.06003 which means that only 6% of the of the data’s variability can be explained by the regression model, which indicated that this is not a good model.
We can say that when instrumentalness, speechiness and duration_ms are increased by 1 unit, track_popularity is affected the most with coefficients of -9.426, -7.397 and -4.277. However, energy and loudness did not affect track_popularity as much as we thought.

Manually check the results

we can compile the actual and predicted values and view the first 5 records

# Create a data frame to compare actual and predicted values
comparison_df <- data.frame(Actual =  train_data$track_popularity, lm_predicted =lm_model$fitted.values)
head(comparison_df)

##       Actual lm_predicted
## 25119      0     33.94611
## 11393     24     40.51391
## 29550     39     41.34777
## 2064      63     38.75230
## 8416      40     35.35537
## 32344     30     28.37513

In-sample (training) MSE

We will look at the In-Sample MSE to evaluate the model.

lm_mse_train <- mean((lm_model$fitted.values - train_data$track_popularity)^2)
print(paste("Training MSE for Linear Model:", round(lm_mse_train, 2)))

## [1] "Training MSE for Linear Model: 529.94"

We got a MSE of 529.94, which is high for this model. A value close to 0 would be perfect.

Out-of-sample (testing) MSE

Now we will look at the Out-Of-Sample MSE to evaluate the model.

# Predict on testing data
lm_test_pred <- predict(lm_model, newdata = test_data)
# Cal
lm_mse_test <- mean((lm_test_pred - test_data$track_popularity)^2)
print(paste("Testing MSE for Linear Model:", round(lm_mse_test, 2)))

## [1] "Testing MSE for Linear Model: 527.51"

The MSE for the testing set is lower, with a value of 527.51.

Now, we will train the linear model, comparing it to other categorical values to see how they predict popularity.

# Train the linear regression model, comparing popularity to other character parameters
lm_model_ch <- lm(track_popularity ~ playlist_genre + playlist_subgenre, data = train_data)

summary(lm_model_ch)

## 
## Call:
## lm(formula = track_popularity ~ playlist_genre + playlist_subgenre, 
##     data = train_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -55.418 -16.339   2.905  16.931  60.098 
## 
## Coefficients: (5 not defined because of singularities)
##                                            Estimate Std. Error t value Pr(>|t|)
## (Intercept)                                 23.9894     0.6853  35.005  < 2e-16
## playlist_genrelatin                         18.0795     1.0338  17.489  < 2e-16
## playlist_genrepop                           31.4286     1.0654  29.501  < 2e-16
## playlist_genrer&b                           22.2155     1.0313  21.541  < 2e-16
## playlist_genrerap                           24.4461     1.0227  23.902  < 2e-16
## playlist_genrerock                          13.5668     1.0741  12.631  < 2e-16
## playlist_subgenrebig room                    4.4078     1.0705   4.117 3.85e-05
## playlist_subgenreclassic rock               -0.2592     1.1543  -0.225 0.822309
## playlist_subgenredance pop                  -3.0791     1.1009  -2.797 0.005164
## playlist_subgenreelectro house               9.7948     0.9814   9.981  < 2e-16
## playlist_subgenreelectropop                -16.3501     1.0989 -14.878  < 2e-16
## playlist_subgenregangster rap              -15.9361     1.0535 -15.127  < 2e-16
## playlist_subgenrehard rock                  -4.2979     1.1230  -3.827 0.000130
## playlist_subgenrehip hop                     4.6597     1.0591   4.400 1.09e-05
## playlist_subgenrehip pop                    -1.8091     1.2060  -1.500 0.133599
## playlist_subgenreindie poptimism           -14.2180     1.0637 -13.367  < 2e-16
## playlist_subgenrelatin hip hop              -9.5706     1.0786  -8.873  < 2e-16
## playlist_subgenrelatin pop                   4.3781     1.1083   3.950 7.83e-05
## playlist_subgenreneo soul                  -16.3026     1.0344 -15.760  < 2e-16
## playlist_subgenrenew jack swing            -19.9120     1.1214 -17.757  < 2e-16
## playlist_subgenrepermanent wave             14.2161     1.1811  12.036  < 2e-16
## playlist_subgenrepop edm                    12.5587     1.0861  11.563  < 2e-16
## playlist_subgenrepost-teen pop                   NA         NA      NA       NA
## playlist_subgenreprogressive electro house       NA         NA      NA       NA
## playlist_subgenrereggaeton                   4.2245     1.2535   3.370 0.000752
## playlist_subgenresouthern hip hop          -13.6192     1.0064 -13.533  < 2e-16
## playlist_subgenretrap                            NA         NA      NA       NA
## playlist_subgenretropical                        NA         NA      NA       NA
## playlist_subgenreurban contemporary              NA         NA      NA       NA
##                                               
## (Intercept)                                ***
## playlist_genrelatin                        ***
## playlist_genrepop                          ***
## playlist_genrer&b                          ***
## playlist_genrerap                          ***
## playlist_genrerock                         ***
## playlist_subgenrebig room                  ***
## playlist_subgenreclassic rock                 
## playlist_subgenredance pop                 ** 
## playlist_subgenreelectro house             ***
## playlist_subgenreelectropop                ***
## playlist_subgenregangster rap              ***
## playlist_subgenrehard rock                 ***
## playlist_subgenrehip hop                   ***
## playlist_subgenrehip pop                      
## playlist_subgenreindie poptimism           ***
## playlist_subgenrelatin hip hop             ***
## playlist_subgenrelatin pop                 ***
## playlist_subgenreneo soul                  ***
## playlist_subgenrenew jack swing            ***
## playlist_subgenrepermanent wave            ***
## playlist_subgenrepop edm                   ***
## playlist_subgenrepost-teen pop                
## playlist_subgenreprogressive electro house    
## playlist_subgenrereggaeton                 ***
## playlist_subgenresouthern hip hop          ***
## playlist_subgenretrap                         
## playlist_subgenretropical                     
## playlist_subgenreurban contemporary           
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22.07 on 19825 degrees of freedom
## Multiple R-squared:  0.1372, Adjusted R-squared:  0.1362 
## F-statistic: 137.1 on 23 and 19825 DF,  p-value: < 2.2e-16

The R^2 is bigger meaning that 13.62% of the variance can be explained by this model. It is still too low. From the coefficients pop, has the highest one (31.42), then rap and the r&b. Looking back at the plots, it looked like pop, latin and rock were the most populars. From the subgenre it seems like playlist_subgenrehip pop and playlist_subgenreclassic rock were the only ones non-significant but when we see at the boxplots above, it looks like they are not the ones with less popularity. On the sibgernes, new jack swing had the highest coefficients, followed by electropop and neo soul.

# Predict on training data (character)
lm_train_pred_ch <- mean((lm_model_ch$fitted.values - train_data$track_popularity)^2)
print(paste("Training MSE for Linear Model:", round(lm_train_pred_ch, 2)))

## [1] "Training MSE for Linear Model: 486.43"

# Predict on testing data (character)
lm_test_pred_ch <- predict(lm_model_ch, newdata = test_data)

## Warning in predict.lm(lm_model_ch, newdata = test_data): prediction from a
## rank-deficient fit may be misleading

# Cal
lm_mse_test_ch <- mean((lm_test_pred_ch - test_data$track_popularity)^2)
print(paste("Testing MSE for Linear Model:", round(lm_mse_test_ch, 2)))

## [1] "Testing MSE for Linear Model: 485.8"

The MSE of the model in the categorical variables is still very high with training of 486.43 and testing MSE of 485.8.

Now we will try adding an interaction to see how it would fit the model.

lm_model_int <- lm(track_popularity ~ danceability + loudness + danceability*loudness, data = spotify)
summary(lm_model_int)

## 
## Call:
## lm(formula = track_popularity ~ danceability + loudness + danceability * 
##     loudness, data = spotify)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -49.912 -18.249   3.069  18.643  60.571 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            28.4207     1.4221  19.985  < 2e-16 ***
## danceability           20.2423     2.1859   9.261  < 2e-16 ***
## loudness               -0.8238     0.1768  -4.660 3.18e-06 ***
## danceability:loudness   1.7837     0.2743   6.502 8.07e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.64 on 28352 degrees of freedom
## Multiple R-squared:  0.004993,   Adjusted R-squared:  0.004887 
## F-statistic: 47.42 on 3 and 28352 DF,  p-value: < 2.2e-16

As we saw before, almost every variables is significant towards track popularity, so it is very easy to find interactions. We performed an interaction which was significant. However the R^2 is too low. Now it is 0.5%. The new coefficient to account for this interaction is 1.784 indicating that if there was an increase of 1 in the danceability variable that 1.784 would be multiplied by the value of loudnessvariable in the regression equation.

Predictions on new model

# Predict on training data (character)
lm_train_pred_int <- mean((lm_model_int$fitted.values - train_data$track_popularity)^2)

## Warning in lm_model_int$fitted.values - train_data$track_popularity: longer
## object length is not a multiple of shorter object length

print(paste("Training MSE for Linear Model:", round(lm_train_pred_int, 2)))

## [1] "Training MSE for Linear Model: 567.09"

# Predict on testing data
lm_test_pred_int <- predict(lm_model_int, newdata = test_data)
# Cal
lm_mse_test <- mean((lm_test_pred_int - test_data$track_popularity)^2)
print(paste("Testing MSE for Linear Model:", round(lm_mse_test, 2)))

## [1] "Testing MSE for Linear Model: 554.33"

Model with no interaction

lm_model_no_int <- lm(track_popularity ~ danceability + loudness , data = spotify)
summary(lm_model_no_int)

## 
## Call:
## lm(formula = track_popularity ~ danceability + loudness, data = spotify)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -42.803 -18.306   2.989  18.645  59.569 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  36.38628    0.72256  50.357  < 2e-16 ***
## danceability  7.48487    0.96398   7.765 8.47e-15 ***
## loudness      0.28557    0.04629   6.170 6.93e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.66 on 28353 degrees of freedom
## Multiple R-squared:  0.003509,   Adjusted R-squared:  0.003439 
## F-statistic: 49.92 on 2 and 28353 DF,  p-value: < 2.2e-16

If we perform the same model with no interaction the R^2 is 0.035%. This means that 0.035% of variance can be explained by this model, indicating that it is not a good model.

In-Sample MSE

# Predict on training data (character)
lm_train_pred_no_int <- mean((lm_model_no_int$fitted.values - train_data$track_popularity)^2)

## Warning in lm_model_no_int$fitted.values - train_data$track_popularity: longer
## object length is not a multiple of shorter object length

print(paste("Training MSE for Linear Model:", round(lm_train_pred_no_int, 2)))

## [1] "Training MSE for Linear Model: 566.13"

Out-Of-Sample MSE

# Predict on testing data
lm_test_pred_no_int <- predict(lm_model_no_int, newdata = test_data)
# Cal
lm_mse_test <- mean((lm_test_pred_no_int - test_data$track_popularity)^2)
print(paste("Testing MSE for Linear Model:", round(lm_mse_test, 2)))

## [1] "Testing MSE for Linear Model: 555.23"

LM Observations:

From the model with numeric variables, we could see that all variables except key were statistically significant on the 0.05 level. Which means that a variation in them would influence the outcome in track_popularity. The In-Sample-MSE was 529.94 and the Out-Of-Sample MSE was 527.51. Then we performed a lm with categorical variables such as genre and subgenre. We can see that all variables besides playlist_subgenrelatin hip hop are statistically significant at the 0.05 level. The In-Sample-MSE was 486.43 and the Out-Of-Sample MSE was 485.8. Then we did a linear model doing an interaction with danceability and loudness. The MSE for the interaction was 567.09 and 566.13. On he model with no interaction it was 554.33 on the training set and 555.23 on the testing set respectively. All this indicating that the model that performed the best was the testing model one with the categorical variables genre and sub_genre indicating that those variables do influence in track_popuarity and that it is a more trustworthy model. While we thought from the plots that pop, latin and rock were the most popular genres, this model indicated that it was pop, rap and r&b. With the parameters our model also indicated that loudness and energy might not affect our model as much as we thought. Also, we can see that the model with interaction performed better than the model with no interaction.

KNN model when k=5

spotify_knn_model <- kknn(track_popularity ~ danceability + energy + key + loudness + mode + speechiness + acousticness + instrumentalness + liveness + valence + tempo + duration_ms, train = train_data, test = train_data, k = 5)

In-Sample MSE

# Predict on training data
knn_train_pred <- fitted.values(spotify_knn_model)

# Calculate in-sample MSE manually
knn_train_mse <- mean((train_data$track_popularity - knn_train_pred)^2)
print(paste("In-Sample MSE for KNN: ", knn_train_mse))

## [1] "In-Sample MSE for KNN:  194.903869196156"

Out-Of-Sample MSE

# Predict on testing data
knn_model_test <- kknn(track_popularity ~ danceability + energy + key + loudness + mode + speechiness + acousticness + instrumentalness + liveness + valence + tempo + duration_ms, train = train_data, test = test_data, k = 5)

knn_test_pred <- fitted.values(knn_model_test)

# Calculate out-of-sample MSE manually
knn_test_mse <- mean((test_data$track_popularity - knn_test_pred)^2)
print(paste("Out-of-Sample MSE for KNN: ", knn_test_mse))

## [1] "Out-of-Sample MSE for KNN:  687.024025645934"

KNN model when k=20

spotify_knn_model_20 <- kknn(track_popularity ~ danceability + energy + key + loudness + mode + speechiness + acousticness + instrumentalness + liveness + valence + tempo + duration_ms, train = train_data, test = train_data, k = 20)

In-Sample MSE

# Predict on training data
knn_train_pred_20 <- fitted.values(spotify_knn_model_20)

# Calculate in-sample MSE manually
knn_train_mse_20 <- mean((train_data$track_popularity - knn_train_pred_20)^2)
print(paste("In-Sample MSE for KNN: ", knn_train_mse_20))

## [1] "In-Sample MSE for KNN:  389.061411631355"

Out-Of-Sample MSE

# Predict on testing data
knn_model_test_20 <- kknn(track_popularity ~ danceability + energy + key + loudness + mode + speechiness + acousticness + instrumentalness + liveness + valence + tempo + duration_ms, train = train_data, test = test_data, k = 20)

knn_test_pred_20 <- fitted.values(knn_model_test_20)

# Calculate out-of-sample MSE manually
knn_test_mse_20 <- mean((test_data$track_popularity - knn_test_pred_20)^2)
print(paste("Out-of-Sample MSE for KNN: ", knn_test_mse_20))

## [1] "Out-of-Sample MSE for KNN:  556.981777608275"

Interpreting the models

We tried the linear regression and the KNN model to see which model would be best for predicting track popularity using the mean square error to see which one performs better.We found that the MSE for the linear regression model was 529.94 for In-Sample testing and the MSE was 527.51 for Out-of-sample testing with numeric values. We also found that the KNN model has a MSE 194.903 for In-sample test, and a MSE of 687.024 for the Out-of Sample test. When comparing these two models, the KNN model performed better on the training set than the regression model and worse on the testing set. We then performed another KNN model but with k=20, the MSE for training was 389.06 and for testing 556.981., having a btter result in the Out-of Sample Test.

We did not use all of the variables in the data set, but we used all of our continuous variables.We decided to go this route because we mostly wanted to explore track popularity and the relationship it has with the continuous variables in our data set.

Theoretically, the model that would fit the data the best is the linear regression model because it is easier to interpret, you can see the coefficients and how each variable is changed by one unit increase in Y. You also have to meet the four assumptions (linearity, independence,normality, and homoscedasticity). We can run a diagnostics plot to see if our model meets these assumptions.

# diagnostic plot
par(mfrow = c(2, 2))
plot(lm_model)

From the diagnostic plot, we see that the model does not meet the four assumptions, which leads us to the conclusion that the linear regression model is not the best fit for this data.

However, teh regression model meets the foru assumoptions, lets us interpret the coefficients, and has a lower OUT-OF Sample MSE.

Standardized vs Unstandarized train knn model

Since the mse is not quite where we want it to be, we are going to do some feature scaling to see if there is any impact on knn model. The variables that will be removed are key, mode and accoustiness.

#Creating a standardized train knn model 
train_new.x <- train_data[, c("track_popularity", "danceability", "energy" , "loudness","speechiness","instrumentalness" , "liveness","valence","duration_ms")]
k<- 5
stand_knn <- kknn(track_popularity ~ danceability + energy + loudness + speechiness + instrumentalness + liveness + valence + duration_ms, train = train_new.x, test = train_new.x, k = 5)

# Creating a unstandardized knn model
knn_unstand <- kknn(track_popularity ~  danceability + energy + loudness + speechiness + instrumentalness + liveness + valence + duration_ms, train = train_new.x, test = train_new.x, k = 5, scale = FALSE)

#finding the MSE for our standardized model 
#first we have to find the prediction of our standardized knn model 
pred_stand <- fitted(stand_knn)
train_new.y <- train_new.x$track_popularity 

#Standardized MSE  
stand_mse <- mean((train_new.y - pred_stand)^2)

#Unstandardized MSE
pred_undstand <- fitted(knn_unstand)
unstand_mse <- mean((train_new.y - pred_undstand)^2)

#output
print(paste("Training MSE of standardized data :", stand_mse))

## [1] "Training MSE of standardized data : 206.7465001884"

print(paste("Training MSE of unstandardized data:", unstand_mse))

## [1] "Training MSE of unstandardized data: 217.941700903636"

Interpreting the model

The MSE of the standardized data is 206.7465 and the MSE for the unstandarized data is 217.9417 , so there is a very small difference between the two. Our original train data had a MSE of 194.903, which is also a small difference compared to our feature scaling models. This leads us to the conclusion that feature scaling could make a effective difference in overall model, but it might not be needed.

Now lets test the test data with the same variables removed.

Standardized vs Unstandarized test knn model

#Creating a standardized test knn model 
test_new.x <- test_data[, c("track_popularity", "danceability", "energy" , "loudness","speechiness","instrumentalness" , "liveness","valence","duration_ms")]

stand_test_knn <- kknn(track_popularity ~ danceability + energy + loudness + speechiness + instrumentalness + liveness + valence + duration_ms, train = test_new.x, test = test_new.x, k = 5)

# Creating a unstandardized knn model
knn_test_unstand <- kknn(track_popularity ~  danceability + energy + loudness + speechiness + instrumentalness + liveness + valence + duration_ms, train = test_new.x, test = test_new.x, k = 5, scale = FALSE)

#finding the MSE for our standardized model 
#first we have to find the prediction of our standardized knn model 
pred_stand_test <- fitted(stand_test_knn)
test_new.y <- test_new.x$track_popularity 

#Standardized MSE  
test_stand_mse <- mean((test_new.y - pred_stand_test)^2)

#Unstandardized MSE
pred_undstand_test <- fitted(knn_test_unstand)
unstand_mse_test <- mean((test_new.y - pred_undstand_test)^2)

#output
print(paste("Testing MSE of standardized data :", test_stand_mse))

## [1] "Testing MSE of standardized data : 201.450244331452"

print(paste("Test MSE of unstandardized data:", unstand_mse_test))

## [1] "Test MSE of unstandardized data: 213.310812353227"

Interpreting the model

We did a new KNN with new variables and K=5 for training and tsting set with unstandardized and standardized data. The MSE of the standardized data for training set is 206.747 and the MSE for the unstandradized data for training set is 217.942 having a better result with standardized data. For the testing set the standardized model was also better, with a result of 201.45 and unstandrdized was 213.31 which is not a big difference compared to each other, but a big difference compared to out test MSE (687.024) without feature scaling which leads us to the conclusion that feature scaling has a positive impact on our model.

From the feature scaling, we can conclude that the knn model is still the best model compared to the linear regression model for our data.

Logistic regresion

Lets create put track-popularity as dummy variable and insert a 1, when the popuLarity is above 70 and a 0 when it is below 70. Meaning that 1 is popular and 0 is not popular. track_populaity will be again or dependent variale.

spotify_copy <- spotify
spotify_copy$track_popularity <- ifelse(spotify_copy$track_popularity >=70, 1 , 0)

Data Splitting

set.seed(2023)
index <- sample(1:nrow(spotify_copy),nrow(spotify_copy)*0.80)
spotify_copy.train = spotify_copy[index,]
spotify_copy.test = spotify_copy[-index,]

Logistic Regrssion model with all numeric variables

spotify_copy.glm0<- glm(track_popularity~ danceability + energy + key + loudness + mode + speechiness + acousticness + instrumentalness + liveness + valence + tempo + duration_ms, family=binomial, data=spotify_copy.train)
summary(spotify_copy.glm0)

## 
## Call:
## glm(formula = track_popularity ~ danceability + energy + key + 
##     loudness + mode + speechiness + acousticness + instrumentalness + 
##     liveness + valence + tempo + duration_ms, family = binomial, 
##     data = spotify_copy.train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.1331  -0.5000  -0.4123  -0.2795   4.2276  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       1.066e+00  3.085e-01   3.455 0.000551 ***
## danceability      3.319e-01  1.902e-01   1.745 0.081027 .  
## energy           -3.095e+00  2.169e-01 -14.269  < 2e-16 ***
## key              -9.644e-03  6.523e-03  -1.478 0.139311    
## loudness          2.076e-01  1.310e-02  15.843  < 2e-16 ***
## mode              1.286e-01  4.805e-02   2.676 0.007446 ** 
## speechiness      -7.592e-01  2.401e-01  -3.162 0.001568 ** 
## acousticness      1.473e-01  1.228e-01   1.199 0.230523    
## instrumentalness -3.071e+00  2.814e-01 -10.916  < 2e-16 ***
## liveness         -4.376e-01  1.710e-01  -2.558 0.010518 *  
## valence           4.779e-01  1.159e-01   4.125 3.71e-05 ***
## tempo             2.652e-03  8.573e-04   3.093 0.001980 ** 
## duration_ms      -1.883e-06  4.619e-07  -4.076 4.57e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 14035  on 22683  degrees of freedom
## Residual deviance: 13193  on 22671  degrees of freedom
## AIC: 13219
## 
## Number of Fisher Scoring iterations: 7

Observation: As we can see from this model, energy, loudness, mode, speechiness, instrumentalness, liveness, valence, tempo and duration_ms are significant in predicting whether a song is going to be popular or not. Unlike linear regression, in this model key has the highest coefficient with -9.644, followed by speechiness and liveness.

Prediction

In-sample prediction

pred.glm0.train <- predict(spotify_copy.glm0,type="response")
hist(pred.glm0.train)

Observation: These tables illustrate the impact of choosing different cut-off probability. Choosing a large cut-off probability will result many cases being predicted as 1, and choosing a small cut-off probability will result in few cases being predicted as 1.

#ROC curve 
library(ROCR)
pred <- prediction(pred.glm0.train, spotify_copy.train$track_popularity)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)

Obsservations: -The ROC curse predicts the false positive rate and the true positive rate. You will want to have these values closer to 1.

#Get the AUC
unlist(slot(performance(pred, "auc"), "y.values"))

## [1] 0.6855791

Observation: We got the AUC (Area under the Curve). Again we would like to have a number closer to 1, meaning that it would be perfectly predicted. However, it is 0.6856. It is still good.

Out-of-sample prediction (more important)

pred.glm0.test<- predict(spotify_copy.glm0, newdata = spotify_copy.test, type="response")

ROC Curve for Out-Of-Sample

pred <- prediction(pred.glm0.test, spotify_copy.test$track_popularity)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)

#Get the AUC
unlist(slot(performance(pred, "auc"), "y.values"))

## [1] 0.6610358

Observations: -The AUC is even lower, with a score of 0.661.

In order to get the matrix for True Positive and True Negative Predictions, we will determine the optimal cut-off Probability using Grid Search Method. We will first use a weight of 5 and 1.

# define a cost function with input "obs" being observed response 
# and "pi" being predicted probability, and "pcut" being the threshold.
costfunc = function(obs, pred.p, pcut){
    weight1 = 5   # define the weight for "true=1 but pred=0" (FN)
    weight0 = 1    # define the weight for "true=0 but pred=1" (FP)
    c1 = (obs==1)&(pred.p<pcut)    # count for "true=1 but pred=0"   (FN)
    c0 = (obs==0)&(pred.p>=pcut)   # count for "true=0 but pred=1"   (FP)
    cost = mean(weight1*c1 + weight0*c0)  # misclassification with weight
    return(cost) # you have to return to a value when you write R functions
} # end of the function

Next, we will define a sequence of probability (you need to search the optimal p-cut from this sequence)

# define a sequence from 0.01 to 1 by 0.01
p.seq = seq(0.01, 1, 0.01)

Then, we will to calculate the cost for each probability in the sequence p.seq.

# write a loop for all p-cut to see which one provides the smallest cost
# first, need to define a 0 vector in order to save the value of cost from all pcut
cost = rep(0, length(p.seq))  
for(i in 1:length(p.seq)){ 
    cost[i] = costfunc(obs = spotify_copy.train$track_popularity, pred.p = pred.glm0.train, pcut = p.seq[i])  
} # end of the loop

#cbind(p.seq, cost)

Last, we will draw a plot with cost against p.seq, and find the p-cut that gives us the minimum cost.

# draw a plot with X axis being all pcut and Y axis being associated cost
plot(p.seq, cost)

# find the optimal pcut
optimal.pcut.glm0 = p.seq[which(cost==min(cost))]
print(optimal.pcut.glm0)

## [1] 0.16

Observation: With the weight of 5 and 1, we got o.15 as optimal cut-off-probability

Use the optimal cut-off probability for in-sample data

# step 1. get binary classification
class.glm0.train.opt<- (pred.glm0.train>optimal.pcut.glm0)*1
# step 2. get confusion matrix, MR, FPR, FNR
table(spotify_copy.train$track_popularity, class.glm0.train.opt, dnn = c("True", "Predicted"))

##     Predicted
## True     0     1
##    0 18732  1843
##    1  1636   473

Observation: With weight of 5 and 1, we got 16845 TN, 824 TP, 3422 FP and 1593 FN. These predictions seem very good. It means for example that 824 were predicted a popular when they were popular.

Obtain MR, FPR, FNR and cost rate.

MR<- mean(spotify_copy.train$track_popularity!= class.glm0.train.opt)
FPR<- sum(spotify_copy.train$track_popularity==0 & class.glm0.train.opt==1)/sum(spotify_copy.train$track_popularity==0)
FNR<- sum(spotify_copy.train$track_popularity==1 & class.glm0.train.opt==0)/sum(spotify_copy.train$track_popularity==1)
cost<- costfunc(obs = spotify_copy.train$track_popularity, pred.p = class.glm0.train.opt, pcut = optimal.pcut.glm0)  
print(paste0("MR:",MR))

## [1] "MR:0.153368012696174"

print(paste0("FPR:",FPR))

## [1] "FPR:0.0895747266099635"

print(paste0("FNR:",FNR))

## [1] "FNR:0.775723091512565"

print(paste0("cost:",cost))

## [1] "cost:0.441853288661612"

We got 0.1534 for MR, which is pretty accuarte for such a complicated data. We also got 0.0896 fro FPR, 0.7757 for FNR and 0.4419 for cost.

Now, lets do the same with the Out-Of-Sample Data which is more important and calculate MR.

# step 0. obtain predicated values on testing data (we actually did this already, but let's do it again) 
pred.glm0.test<- predict(spotify_copy.glm0, newdata = spotify_copy.test, type="response")

# step 1. get binary classification, I used the optimal cut-off
pred.glm0.test.opt <- (pred.glm0.test>optimal.pcut.glm0)*1
# step 2. get confusion matrix, MR, FPR, FNR
table(spotify_copy.test$track_popularity, pred.glm0.test.opt, dnn = c("True", "Predicted"))

##     Predicted
## True    0    1
##    0 4698  448
##    1  416  110

Obs: With weight of 5 and 1, (optimal cut-off being 0.15) we got 4214 TN, 186 TP, 876 FP and 396 FN. These predictions seem very good. It means for example that 186 were predicted a popular when they were popular.

MR<- mean(spotify_copy.test$track_popularity!= pred.glm0.test.opt)
FPR<- sum(spotify_copy.test$track_popularity==0 & pred.glm0.test.opt==1)/sum(spotify_copy.test$track_popularity==0)
FNR<- sum(spotify_copy.test$track_popularity==1 & pred.glm0.test.opt==0)/sum(spotify_copy.test$track_popularity==1)
cost<- costfunc(obs = spotify_copy.test$track_popularity, pred.p = pred.glm0.test.opt, pcut = optimal.pcut.glm0)  
print(paste0("MR:",MR))

## [1] "MR:0.152327221438646"

print(paste0("FPR:",FPR))

## [1] "FPR:0.0870579090555771"

print(paste0("FNR:",FNR))

## [1] "FNR:0.790874524714829"

print(paste0("cost:",cost))

## [1] "cost:0.445698166431594"

Out of sample MR is still very good, with a score of 0.1523. We also got 0.08701 for FPR, 0.79087 for FNR and 0.4457 for cost. All of the values were better for testing set except of FNR.

Logistic Regression model 2 (with categorical variables)

spotify_glm0_cat<- glm(track_popularity~ playlist_genre + playlist_subgenre, family=binomial, data=spotify_copy.train)
summary(spotify_glm0_cat)

## 
## Call:
## glm(formula = track_popularity ~ playlist_genre + playlist_subgenre, 
##     family = binomial, data = spotify_copy.train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.9468  -0.5258  -0.3472  -0.1463   3.3090  
## 
## Coefficients: (5 not defined because of singularities)
##                                            Estimate Std. Error z value Pr(>|z|)
## (Intercept)                                 -5.4706     0.4474 -12.229  < 2e-16
## playlist_genrelatin                          2.1066     0.4832   4.360 1.30e-05
## playlist_genrepop                            4.9005     0.4532  10.813  < 2e-16
## playlist_genrer&b                            3.5616     0.4578   7.781 7.21e-15
## playlist_genrerap                            3.4258     0.4587   7.469 8.10e-14
## playlist_genrerock                           3.1048     0.4647   6.681 2.38e-11
## playlist_subgenrebig room                    0.1464     0.6715   0.218 0.827462
## playlist_subgenreclassic rock               -0.1167     0.1790  -0.652 0.514502
## playlist_subgenredance pop                  -0.6406     0.1037  -6.178 6.50e-10
## playlist_subgenreelectro house               0.9380     0.5332   1.759 0.078579
## playlist_subgenreelectropop                 -1.1165     0.1131  -9.868  < 2e-16
## playlist_subgenregangster rap               -1.1283     0.1873  -6.025 1.69e-09
## playlist_subgenrehard rock                  -0.7950     0.2064  -3.853 0.000117
## playlist_subgenrehip hop                     0.4123     0.1312   3.142 0.001676
## playlist_subgenrehip pop                     0.1436     0.1470   0.977 0.328782
## playlist_subgenreindie poptimism            -2.2086     0.1422 -15.535  < 2e-16
## playlist_subgenrelatin hip hop               0.9102     0.2183   4.169 3.06e-05
## playlist_subgenrelatin pop                   1.9038     0.2023   9.410  < 2e-16
## playlist_subgenreneo soul                   -1.8874     0.2208  -8.549  < 2e-16
## playlist_subgenrenew jack swing             -3.2198     0.4588  -7.018 2.25e-12
## playlist_subgenrepermanent wave              0.7548     0.1584   4.766 1.88e-06
## playlist_subgenrepop edm                     2.7406     0.4715   5.812 6.16e-09
## playlist_subgenrepost-teen pop                   NA         NA      NA       NA
## playlist_subgenreprogressive electro house       NA         NA      NA       NA
## playlist_subgenrereggaeton                   1.5118     0.2209   6.844 7.71e-12
## playlist_subgenresouthern hip hop           -1.0521     0.1711  -6.149 7.79e-10
## playlist_subgenretrap                            NA         NA      NA       NA
## playlist_subgenretropical                        NA         NA      NA       NA
## playlist_subgenreurban contemporary              NA         NA      NA       NA
##                                               
## (Intercept)                                ***
## playlist_genrelatin                        ***
## playlist_genrepop                          ***
## playlist_genrer&b                          ***
## playlist_genrerap                          ***
## playlist_genrerock                         ***
## playlist_subgenrebig room                     
## playlist_subgenreclassic rock                 
## playlist_subgenredance pop                 ***
## playlist_subgenreelectro house             .  
## playlist_subgenreelectropop                ***
## playlist_subgenregangster rap              ***
## playlist_subgenrehard rock                 ***
## playlist_subgenrehip hop                   ** 
## playlist_subgenrehip pop                      
## playlist_subgenreindie poptimism           ***
## playlist_subgenrelatin hip hop             ***
## playlist_subgenrelatin pop                 ***
## playlist_subgenreneo soul                  ***
## playlist_subgenrenew jack swing            ***
## playlist_subgenrepermanent wave            ***
## playlist_subgenrepop edm                   ***
## playlist_subgenrepost-teen pop                
## playlist_subgenreprogressive electro house    
## playlist_subgenrereggaeton                 ***
## playlist_subgenresouthern hip hop          ***
## playlist_subgenretrap                         
## playlist_subgenretropical                     
## playlist_subgenreurban contemporary           
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 14035  on 22683  degrees of freedom
## Residual deviance: 12271  on 22660  degrees of freedom
## AIC: 12319
## 
## Number of Fisher Scoring iterations: 7

Observation: All of the variables are significant at the 0.05 level, except playlist_subgenrebig room, playlist_subgenreclassic rock, playlist_subgenrepost-teen pop, playlist_subgenreprogressive electro house, playlist_subgenretrap, playlist_subgenretropical and playlist_subgenreurban contemporary. Now under genre, the highest change in one unit goes to pop, followed by r&b and rap. For the subgenre, the highest coefficients were on southern hip hop, followed by reggaeton and dance pop. On lm the biggest changes were in other variables (new jack swing, electropop and neo soul).

Prediction

Let’s start with in-sample prediction for logistic model 2

pred.glm0_cat.train <- predict(spotify_glm0_cat,type="response")


#ROC curve 
library(ROCR)
pred <- prediction(pred.glm0_cat.train, spotify_copy.train$track_popularity)
perf <- performance(pred, "tpr", "fpr")

#Get the AUC
unlist(slot(performance(pred, "auc"), "y.values"))

## [1] 0.7593949

Out-of-sample prediction for model 2 (more important)

pred.glm0.test_cat<- predict(spotify_glm0_cat, newdata = spotify_copy.test, type="response")

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading

pred <- prediction(pred.glm0.test_cat, spotify_copy.test$track_popularity)
perf <- performance(pred, "tpr", "fpr")
#Get the AUC
unlist(slot(performance(pred, "auc"), "y.values"))

## [1] 0.7539842

Observation: For the training set we get an AUC of 0.7594 and for the testing set we get 0.75398.

Find optimal p-cut value for new model

# define a cost function with input "obs" being observed response 
# and "pi" being predicted probability, and "pcut" being the threshold.
costfunc = function(obs, pred.p, pcut){
    weight1 = 5   # define the weight for "true=1 but pred=0" (FN)
    weight0 = 1    # define the weight for "true=0 but pred=1" (FP)
    c1 = (obs==1)&(pred.p<pcut)    # count for "true=1 but pred=0"   (FN)
    c0 = (obs==0)&(pred.p>=pcut)   # count for "true=0 but pred=1"   (FP)
    cost = mean(weight1*c1 + weight0*c0)  # misclassification with weight
    return(cost) # you have to return to a value when you write R functions
} # end of the function

# define a sequence from 0.01 to 1 by 0.01
p.seq = seq(0.01, 1, 0.01) 

# write a loop for all p-cut to see which one provides the smallest cost
# first, need to define a 0 vector in order to save the value of cost from all pcut
cost = rep(0, length(p.seq))  
for(i in 1:length(p.seq)){ 
    cost[i] = costfunc(obs = spotify_copy.train$track_popularity, pred.p = pred.glm0_cat.train, pcut = p.seq[i])  
} # end of the loop

#cbind(p.seq, cost)

# find the optimal pcut
optimal.pcut.glm0 = p.seq[which(cost==min(cost))]
print(optimal.pcut.glm0)

## [1] 0.17 0.18

The optimal pcut is either 0.17 or 0.18. Now we will get MR, FPR, FNR and cost to compare it to each other.

Get scores for LGM 2 and training set

# step 1. get binary classification
class.glm0.train.cat<- (pred.glm0_cat.train>optimal.pcut.glm0)*1

# get values
MR<- mean(spotify_copy.train$track_popularity!= class.glm0.train.cat)
FPR<- sum(spotify_copy.train$track_popularity==0 & class.glm0.train.cat==1)/sum(spotify_copy.train$track_popularity==0)
FNR<- sum(spotify_copy.train$track_popularity==1 & class.glm0.train.cat==0)/sum(spotify_copy.train$track_popularity==1)
cost<- costfunc(obs = spotify_copy.train$track_popularity, pred.p = class.glm0.train.cat, pcut = optimal.pcut.glm0)  
print(paste0("MR:",MR))

## [1] "MR:0.151340151648739"

print(paste0("FPR:",FPR))

## [1] "FPR:0.0982260024301337"

print(paste0("FNR:",FNR))

## [1] "FNR:0.669511616880038"

print(paste0("cost:",cost))

## [1] "cost:0.400326221125022"

Get scores for LGM 2 and testing set

# step 0. obtain predicated values on testing data (we actually did this already, but let's do it again) 
pred.glm0.test<- predict(spotify_glm0_cat, newdata = spotify_copy.test, type="response")

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading

# step 1. get binary classification, I used the optimal cut-off
pred.glm0.test.cat <- (pred.glm0.test>optimal.pcut.glm0)*1
#Get values

MR<- mean(spotify_copy.test$track_popularity!= pred.glm0.test.cat)
FPR<- sum(spotify_copy.test$track_popularity==0 & pred.glm0.test.cat==1)/sum(spotify_copy.test$track_popularity==0)
FNR<- sum(spotify_copy.test$track_popularity==1 & pred.glm0.test.cat==0)/sum(spotify_copy.test$track_popularity==1)
cost<- costfunc(obs = spotify_copy.test$track_popularity, pred.p = pred.glm0.test.cat, pcut = optimal.pcut.glm0)  
print(paste0("MR:",MR))

## [1] "MR:0.157087447108604"

print(paste0("FPR:",FPR))

## [1] "FPR:0.104741546832491"

print(paste0("FNR:",FNR))

## [1] "FNR:0.669201520912547"

print(paste0("cost:",cost))

## [1] "cost:0.405324400564175"

Conclusion logistic regression

Observation: For this new model we got a better AUC on the training set with 0.7593 than on the testing set (0.7539). Both values where better on Model 2 than on Model 1 (0.6856 and 0.6610 respectively). We then calculated MR, FPR, FNR and cost on both models as well, in training and testing sets. For the first model we got MR= 0.1534; FPR= 0.0986; FNR=0.7757 and cost=0.4419on the training set.We also got MR= 0.1523; FPR= 0.087; FNR=0.7908 and cost=0.4457** on the training set. The testing set performed better than the training set. For the second model we got MR= 0.1513; FPR= 0.098; FNR=0.7757 and cost=0.4003 on the training set.We also got MR= 0.1571; FPR= 0.1047; FNR=0.6692 and cost=0.4053 on the training set. The training set performed better than the testing set in model 2. Model 1 performed better on the testing set than Model 2. And alos the opposite.

Support Vector Model (SVM)

We are first going to make a new data set without the categorical variables, then make our track_popularity variable 2 levels.

spotify_new = subset(spotify, select = -c(track_album_release_date, track_id, track_name, track_artist, track_album_name, track_album_id , track_album_name, playlist_genre, playlist_name, playlist_subgenre, playlist_id) )

spotify_new$track_popularity <- ifelse(spotify_new$track_popularity >=70, 1 , 0)

Data Splitting

#splitting the data 
set.seed(2023)
index <- sample(1:nrow(spotify_new),nrow(spotify_copy)*0.80)
spotify_svm_train <-  spotify_new[index,]
spotify_svm_test <- spotify_new[-index,]

Weight Cost

#fitting the data w/o weight class cost 
library(e1071)

spotify_new$track_popularity<- as.factor(spotify_new$track_popularity)

str(spotify_new)

## 'data.frame':    28356 obs. of  13 variables:
##  $ track_popularity: Factor w/ 2 levels "0","1": 1 1 2 1 1 1 1 1 1 1 ...
##  $ danceability    : num  0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
##  $ energy          : num  0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
##  $ key             : int  6 11 1 7 1 8 5 4 8 2 ...
##  $ loudness        : num  -2.63 -4.97 -3.43 -3.78 -4.67 ...
##  $ mode            : int  1 1 0 1 1 1 0 0 1 1 ...
##  $ speechiness     : num  0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
##  $ acousticness    : num  0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
##  $ instrumentalness: num  0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
##  $ liveness        : num  0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
##  $ valence         : num  0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
##  $ tempo           : num  122 100 124 122 124 ...
##  $ duration_ms     : int  194754 162600 176616 169093 189052 163049 187675 207619 193187 253040 ...

#svm model 
spotify_svm = svm({{as.factor(track_popularity)}} ~ danceability + energy + key + loudness + mode + speechiness + acousticness + instrumentalness + liveness + valence + tempo + duration_ms, data = spotify_svm_train, kernel = 'linear')

#summary
summary(spotify_svm)

## 
## Call:
## svm(formula = {
##     {
##         as.factor(track_popularity)
##     }
## } ~ danceability + energy + key + loudness + mode + speechiness + 
##     acousticness + instrumentalness + liveness + valence + tempo + 
##     duration_ms, data = spotify_svm_train, kernel = "linear")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  1 
## 
## Number of Support Vectors:  4832
## 
##  ( 2723 2109 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1

# predictions on the train data
pred_spotify_train = predict(spotify_svm, spotify_svm_train)

# Confusion matrix to evaluate the model on train data
Cmatrix_train = table(true = spotify_svm_train$track_popularity ,
                      pred = pred_spotify_train)
Cmatrix_train

##     pred
## true     0     1
##    0 20575     0
##    1  2109     0

#Mis-classification rate 
1 - sum(diag(Cmatrix_train))/sum(Cmatrix_train)

## [1] 0.09297302

The confusion matrix of the predicted values with the training data shows the number of true negatives as 22435, false positives as 0 , false negatives as 3831 and true positives as 0.

The mis-classification error rate is 0.145854

# Predictions on the train data
pred_spotify_test <- predict(spotify_svm, spotify_svm_test)

# Confusion matrix to evaluate the model on train data
Cmatrix_test = table(true = spotify_svm_test$track_popularity,
                     pred = pred_spotify_test)
Cmatrix_test

##     pred
## true    0    1
##    0 5146    0
##    1  526    0

#Mis-classification rate 
1 - sum(diag(Cmatrix_test))/sum(Cmatrix_test)

## [1] 0.09273625

The confusion matrix of the predicted values with the testing data shows the number of true negatives as 22435, false positives as 0 , false negatives as 3831 and true positives as 0.

The mis-classification error rate is 0.145854

Now we are going to fit a SVM model with weight cost using the train data

spotify_svm_asymmetric = svm(as.factor(track_popularity) ~  danceability + energy + loudness + speechiness + mode + key + acousticness+ instrumentalness + liveness + valence + tempo +duration_ms, data = spotify_svm_train, kernel = 'linear', class.weights = c("0" = 1, "1" = 2))
#Summary
summary(spotify_svm_asymmetric)

## 
## Call:
## svm(formula = as.factor(track_popularity) ~ danceability + energy + 
##     loudness + speechiness + mode + key + acousticness + instrumentalness + 
##     liveness + valence + tempo + duration_ms, data = spotify_svm_train, 
##     kernel = "linear", class.weights = c(`0` = 1, `1` = 2))
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  1 
## 
## Number of Support Vectors:  7076
## 
##  ( 4967 2109 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1

# predictions for train data
pred_spotify_train <- predict(spotify_svm_asymmetric, spotify_svm_train)
#Confusion matrix for train data 
Cmatrix_train_21 <- table(true =spotify_svm_train$track_popularity, 
                           pred = pred_spotify_train)
print(Cmatrix_train_21)

##     pred
## true     0     1
##    0 20575     0
##    1  2109     0

#Mis-classification rate
1 - sum(diag(Cmatrix_train_21))/sum(Cmatrix_train_21)

## [1] 0.09297302

The confusion matrix of the predicted values with the training data shows the number of true negatives as 22435, false positives as 0 , false negatives as 3831 and true positives as 0.

The mis-classification error rate is 0.145854

Now we are going to fit a SVM model with cost using the train data

#testing predicted probability
pred_spotify_test <- predict(spotify_svm_asymmetric, spotify_svm_test)

#Confusion matrix 
Cmatrix_test_22 <-  table( true = spotify_svm_test$track_popularity, pred = pred_spotify_test)

print(Cmatrix_test_22)

##     pred
## true    0    1
##    0 5146    0
##    1  526    0

#Mis-classification rate
MR <- 1 - sum(diag(Cmatrix_test_22))/sum(Cmatrix_test_22)
MR

## [1] 0.09273625

The confusion matrix of the predicted values with the testing data shows the number of true negatives as 5564, false positives as 0 , false negatives as 1003 and true positives as 0.

The mis-classification error rate is 0.1527334

Now we are going to obtain the ROC and AUC for our SVM model. We will then use our output to compare it to the ROC and AUC we obtained in our logistic model.

#ROC and AUC using the train data 

spotify_svm_prob = svm(as.factor(track_popularity) ~  danceability + energy + loudness + speechiness + mode + key + acousticness+ instrumentalness + liveness + valence + tempo +duration_ms,
                      data = spotify_svm_train, kernel = 'linear',
                      probability = TRUE) 
pred_prob_train = predict(spotify_svm_prob,
                          newdata = spotify_svm_train,
                          probability = TRUE) 

pred_prob_train = attr(pred_prob_train, "probabilities")[, 2] 

#ROC
library(ROCR)
pred <- prediction(pred_prob_train, spotify_svm_train$track_popularity)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)

#AUC
unlist(slot(performance(pred, "auc"), "y.values"))

## [1] 0.5866711

The AUC is 0.5971307

#refitting the model on testing
pred_prob_test = predict(spotify_svm_prob,
                         newdata = spotify_svm_test,
                         probability = TRUE)

pred_prob_test = attr(pred_prob_test, "probabilities")[, 2]

#ROC
pred <- prediction(pred_prob_test, spotify_svm_test$track_popularity )
perf <- performance(pred, "tpr", "fpr")


#AUC
unlist(slot(performance(pred, "auc"), "y.values"))

## [1] 0.5698786

The AUC is 0.5909064

Insights

When comparing the ROC and AUC for our SVM model to our AUC and ROC for our logistic regression model, we found that the logistic regression model performed better because it had a higher AUC for both the train and test data.

Classification Tree

We will create a copy of spotify with a new data name and delete some variables for the purpose of this analysis.

spotify_dec = subset(spotify, select = -c(track_album_release_date, track_id, track_name, track_artist, track_album_name, track_album_id ,playlist_genre, playlist_subgenre, track_album_name, playlist_name, playlist_id) )

spotify_dec$track_popularity <- ifelse(spotify_dec$track_popularity >=70, 1 , 0)
str(spotify_dec)

## 'data.frame':    28356 obs. of  13 variables:
##  $ track_popularity: num  0 0 1 0 0 0 0 0 0 0 ...
##  $ danceability    : num  0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
##  $ energy          : num  0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
##  $ key             : int  6 11 1 7 1 8 5 4 8 2 ...
##  $ loudness        : num  -2.63 -4.97 -3.43 -3.78 -4.67 ...
##  $ mode            : int  1 1 0 1 1 1 0 0 1 1 ...
##  $ speechiness     : num  0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
##  $ acousticness    : num  0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
##  $ instrumentalness: num  0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
##  $ liveness        : num  0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
##  $ valence         : num  0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
##  $ tempo           : num  122 100 124 122 124 ...
##  $ duration_ms     : int  194754 162600 176616 169093 189052 163049 187675 207619 193187 253040 ...

Data Spliting

Changing data types

spotify_dec$mode<- as.factor(spotify_dec$mode)
spotify_dec$key<- as.factor(spotify_dec$key)

str(spotify_dec)

## 'data.frame':    28356 obs. of  13 variables:
##  $ track_popularity: num  0 0 1 0 0 0 0 0 0 0 ...
##  $ danceability    : num  0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
##  $ energy          : num  0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
##  $ key             : Factor w/ 12 levels "0","1","2","3",..: 7 12 2 8 2 9 6 5 9 3 ...
##  $ loudness        : num  -2.63 -4.97 -3.43 -3.78 -4.67 ...
##  $ mode            : Factor w/ 2 levels "0","1": 2 2 1 2 2 2 1 1 2 2 ...
##  $ speechiness     : num  0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
##  $ acousticness    : num  0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
##  $ instrumentalness: num  0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
##  $ liveness        : num  0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
##  $ valence         : num  0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
##  $ tempo           : num  122 100 124 122 124 ...
##  $ duration_ms     : int  194754 162600 176616 169093 189052 163049 187675 207619 193187 253040 ...

set.seed(2023)
index <- sample(1:nrow(spotify_dec),nrow(spotify_dec)*0.60)
spotify_dec.train = spotify_dec[index,]
spotify_dec.test = spotify_dec[-index,]

Fit Classification Tree with all variables

library(rpart)
library(rpart.plot)
# fit the model
fit_tree_dec <- rpart(as.factor(track_popularity) ~ danceability + energy + speechiness + loudness + instrumentalness+ mode+ key + acousticness+ liveness + valence + tempo, data=spotify_dec.train)
summary(fit_tree_dec)

## Call:
## rpart(formula = as.factor(track_popularity) ~ danceability + 
##     energy + speechiness + loudness + instrumentalness + mode + 
##     key + acousticness + liveness + valence + tempo, data = spotify_dec.train)
##   n= 17013 
## 
##   CP nsplit rel error xerror xstd
## 1  0      0         1      0    0
## 
## Node number 1: 17013 observations
##   predicted class=0  expected loss=0.09351672  P(node) =1
##     class counts: 15422  1591
##    probabilities: 0.906 0.094

#rpart.plot(fit_tree, yesno=2, extra=4)

Obtain the predicted values for training data and confusion matrix.

# Make predictions on the train data
pred_credit_train <- predict(fit_tree_dec, spotify_dec.train, type="class")

# Confusion matrix to evaluate the model on train data
Cmatrix_train = table(true = spotify_dec.train$track_popularity,
                      pred = pred_credit_train)
Cmatrix_train

##     pred
## true     0     1
##    0 15422     0
##    1  1591     0

1 - sum(diag(Cmatrix_train))/sum(Cmatrix_train)

## [1] 0.09351672

Observation: From this model we get a very small missclassification rate of 0.0935, but from the Confusion Matrix of the training set we can see that none of the variables are predicted as 1.

Obtain the predicted values for testing data and confusion matrix.

# Make predictions on the train data
pred_credit_test <- predict(fit_tree_dec, spotify_dec.test, type="class")

# Confusion matrix to evaluate the model on train data
Cmatrix_test = table(true = spotify_dec.test$track_popularity,
                     pred = pred_credit_test)
Cmatrix_test

##     pred
## true     0     1
##    0 10299     0
##    1  1044     0

1 - sum(diag(Cmatrix_test))/sum(Cmatrix_test)

## [1] 0.09203914

Observation:

Now that we have done th same on the testing set, we can see that MR is even smaller with 0.092 but still none of the variables are predicted as 1. Therefore we will use asymmetric test, to produce different weights and see if this improves the model.

PLotting tree

# We need to define a cost matrix first, don't change 0 there
cost_matrix <- matrix(c(0, 1,  # cost of 1 for FP
                        6, 0),  # cost of 6 for FN
                      byrow = TRUE, nrow = 2)
fit_tree_asym <- rpart(as.factor(track_popularity) ~ danceability + energy + speechiness + loudness + instrumentalness+ mode+ key + acousticness+ liveness + valence + tempo, data=spotify_dec.train, parms = list(loss = cost_matrix))

rpart.plot(fit_tree_asym,extra=4, yesno=2)

Since we now do have some variables on the ‘1’ side of the data, we can visualize the tree in a plot. We can predict that from the variables energy is the one that best predicts a popular song. The model explains that energy value smaller 0.82 would predict 1, and loudness bigger that -5.1 would also predict 1.

#get predictions for training
pred_spotify_train <- predict(fit_tree_asym, spotify_dec.train,
                             type = "class")
#C matrix for training
Cmatrix_train_asy <- table( true = spotify_dec.train$track_popularity, pred = pred_spotify_train)
Cmatrix_train_asy

##     pred
## true     0     1
##    0 13157  2265
##    1  1054   537

1 - sum(diag(Cmatrix_train_asy))/sum(Cmatrix_train_asy)

## [1] 0.1950861

Observation: Now that we have redistributed the weights we finally can get some values for 1 (being popular) , with 537 being TP, 2265 FP, 13157 TN and 1054 FN on the training set. However, now we have a bigger MR with 0.1951. Lets get assymmetric predictions for the testing set.

#get predictions for testing
pred_credit_test <- predict(fit_tree_asym, spotify_dec.test, type = "class")
#C matrix for testing with more weights on "1"
Cmatrix_test_weight = table( true = spotify_dec.test$track_popularity, pred = pred_credit_test)
Cmatrix_test_weight

##     pred
## true    0    1
##    0 8820 1479
##    1  707  337

1 - sum(diag(Cmatrix_test_weight))/sum(Cmatrix_test_weight)

## [1] 0.192718

Observation: Now for the testing set we getthe following values: with 337 being TP, 1479 FP, 8820 TN and 707 FN on the training set. The MR is now a bit lower with a value of 0.1927. Now, we will obtain the AUC for the training and testing data.

# obtain predicted probability
pred_prob_train = predict(fit_tree_dec, spotify_dec.train, type = "prob")

# This is necessary again, as predict() for tree model return two values, one for 0 and one for 1.
# Replace "1" with the actual category if response variable is a factor
pred_prob_train = pred_prob_train[,"1"] 

library(ROCR)
pred <- prediction(pred_prob_train, spotify_dec.train$track_popularity)
perf <- performance(pred, "tpr", "fpr")

#Get the AUC
unlist(slot(performance(pred, "auc"), "y.values"))

## [1] 0.5

# obtain predicted probability
pred_prob_test = predict(fit_tree_dec, spotify_dec.test, type = "prob")
# This is necessary again, as predict() for tree model return two values, one for 0 and one for 1.
pred_prob_test = pred_prob_test[,"1"] #replace "1" with the actual category if reponse variable is a factor
#ROC
pred <- prediction(pred_prob_test, spotify_dec.test$track_popularity)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)

#Get the AUC
unlist(slot(performance(pred, "auc"), "y.values"))

## [1] 0.5

Observation: When we get the AUC for both the training and testing datasets for the model with no weight we get the same value of 0.5 and the ROC curve is a diagonal linear line. This is due to the fact that none of the variables are predicted as 1. Now, lets get the AUC-ROC for the model with weights.

AUC-ROC model with weights for training

# obtain predicted probability
pred_asym_train = predict(fit_tree_asym, spotify_dec.train, type = "prob")

# This is necessary again, as predict() for tree model return two values, one for 0 and one for 1.
# Replace "1" with the actual category if response variable is a factor
pred_asym_train = pred_asym_train[,"1"] 

library(ROCR)
pred <- prediction(pred_asym_train, spotify_dec.train$track_popularity)
perf <- performance(pred, "tpr", "fpr")
#plot(perf, colorize=TRUE)

#Get the AUC
unlist(slot(performance(pred, "auc"), "y.values"))

## [1] 0.6613876

AUC-ROC model with weights for testing

# obtain predicted probability
pred_asym_test = predict(fit_tree_asym, spotify_dec.test, type = "prob")

# This is necessary again, as predict() for tree model return two values, one for 0 and one for 1.
# Replace "1" with the actual category if response variable is a factor
pred_asym_test = pred_asym_test[,"1"] 

library(ROCR)
pred <- prediction(pred_asym_test, spotify_dec.test$track_popularity)
perf <- performance(pred, "tpr", "fpr")
#plot(perf, colorize=TRUE)

#Get the AUC
unlist(slot(performance(pred, "auc"), "y.values"))

## [1] 0.6450638

Observation: Now that we have done AUC-ROC on the weighted models, we get a curve and AUC of 0.6614 fr the training set and 0.6451 for the testing set, being a better model that the one with no weights. Overall, although MR is lower on the model with no weights, it does not mean that is better because none of the observations are classified as popular. Although the model with weights has a bigger MR, the model does better classification that model 1. AUC is better for the model with weights, than with no weights.

Conclusion: While exploring which variable is the best predictor for ‘track_popularity’ we found that ‘energy’ and ‘loudness’ were our best predictors. We tested our variables using different prediction models (Logistic regression, KNN, SVM and Tree Model) to see which one performed better. We found that the Tree model performed best due to its high AUC score and the fact that out of all of the models, the Tree model had the best confusion matrix.

Spotify Analysis

2023-10-05

Introduction

Packages Required

Data Preparation

Data Loading

Data Examination

column names

Data Cleaning

Modifying Data Types

Summary Statistics

Tables

Duplicates

Missing Values

Data Visualization

Visual histogram exploration to understand the data set.

Boxplot to plot different genres vs popularity

Visualitazion scatterplots based on genre and popularity

Visualization Boxplots

Outliers

Datset copy without oultiers

Truncation

Winsorization

Exploratory Data Anaylsis

Correlation Matrix

Modelling

Linear Regression model with training and test data

Data splitting

Linear Regression on Training Set

Manually check the results

In-sample (training) MSE

Out-of-sample (testing) MSE

Predictions on new model

Model with no interaction

In-Sample MSE

Out-Of-Sample MSE

LM Observations:

KNN model when k=5

In-Sample MSE

Out-Of-Sample MSE

KNN model when k=20

In-Sample MSE

Out-Of-Sample MSE

Interpreting the models

Standardized vs Unstandarized train knn model

Interpreting the model

Standardized vs Unstandarized test knn model

Interpreting the model

Logistic regresion

Data Splitting

Logistic Regrssion model with all numeric variables

Prediction

In-sample prediction

Out-of-sample prediction (more important)

ROC Curve for Out-Of-Sample

Use the optimal cut-off probability for in-sample data

Obtain MR, FPR, FNR and cost rate.

Logistic Regression model 2 (with categorical variables)

Prediction

Let’s start with in-sample prediction for logistic model 2

Out-of-sample prediction for model 2 (more important)

Find optimal p-cut value for new model

Get scores for LGM 2 and training set

Get scores for LGM 2 and testing set

Conclusion logistic regression

Support Vector Model (SVM)

Data Splitting

Weight Cost

Insights

Classification Tree

Data Spliting

Changing data types

Fit Classification Tree with all variables

PLotting tree

AUC-ROC model with weights for training

AUC-ROC model with weights for testing