Spotify Analysis

Introduction

#![](/Users/makayla/Library/Mobile #Documents/com~apple~CloudDocs/STA -Rstudio/MS -  Data #Minning/spotify-logo-1920x1080-2.jpeg){width=40%}

The Spotify Dataset comes from Spotify via the spotifyr package. Charlie Thompson, Josiah Parry, Donal Phipps, and Tom Wolff authored this package to make it easier to get either your own data or general metadata around songs from Spotify’s API. Kaylin Pavlik had a recent blogpost using the audio features to explore and classify songs. She used the spotifyr package to collect about 5000 songs from 6 main categories.
The data shows general metadata around songs from Spotify’s API. It shows the song’s popularity and other parameters such as acousticness, danceability, energy, speechiness, valence, key, genres and subgenres…
With this analysis we are interested in how track popularity is getting influenced by the other variables.
We want to answer the question: What characteristics of a song can determine its popularity? Which genres and subgenres predict usually the most popular songs?
The plan is to analyze relationship between popularity and different features of the song to predict future popularity of a song with data modelling. We plan on performing Data Preparation, EDA and Modelling using models such as linear regression, knn, logistic regression, SVM and Tree Models and compare which one performs better and see which one predicts better if a song is going to be popular or not.
This is mainly beneficial to market spotify customers and improve their experience while using spotify. Also for Spotify, they will be able to provide more accurate predictions of a new song’s potential popularity even before its release and the consumer will be able to identify which factors influence the popularity of a song on Spotify.

Packages Required

library(tidyverse) 
library(ggplot2) 
library(kknn) 
library(corrplot)
library(readr) 
library(rpart) 
library(rpart.plot)

Tidyverse:assists with data import, tidying, manipulation, and data visualization.

ggplot2: package for producing statistical, or data, graphics.

kknn: performs k-nearest neighbor classification.

corrplot: graphical display of a correlation matrix, confidence interval.

readr: provides a fast way to read rectangular data.

rpart: implements the classification and regression tree algorithm (CART).

rpart.plot: An Enhanced Plotting Package for rpart.

Data Preparation

Data Loading

library(readr)
spotify <- read.csv("/Users/evabeyebach/Desktop/spotify.csv")

Data Examination

We are going to examine our data in order to see how we can best clean it for further analysis.

### Checking dimension of Data
dim(spotify)

## [1] 32833    23

The original dataset contains 32833 rows and 23 columns.

First 5 rows

# show first 5 rows
head(spotify)

Column Names

#### Checking column name
names(spotify)

##  [1] "track_id"                 "track_name"              
##  [3] "track_artist"             "track_popularity"        
##  [5] "track_album_id"           "track_album_name"        
##  [7] "track_album_release_date" "playlist_name"           
##  [9] "playlist_id"              "playlist_genre"          
## [11] "playlist_subgenre"        "danceability"            
## [13] "energy"                   "key"                     
## [15] "loudness"                 "mode"                    
## [17] "speechiness"              "acousticness"            
## [19] "instrumentalness"         "liveness"                
## [21] "valence"                  "tempo"                   
## [23] "duration_ms"

Data Cleaning

Now, we will look at the structure, summary, data types, duplicates and missing values to then, clean the data.

Structure of Data

### Checking structure of Data
str(spotify)

## 'data.frame':    32833 obs. of  23 variables:
##  $ track_id                : chr  "6f807x0ima9a1j3VPbc7VN" "0r7CVbZTWZgbTCYdfa2P31" "1z1Hg7Vb0AhHDiEmnDE79l" "75FpbthrwQmzHlBJLuGdC7" ...
##  $ track_name              : chr  "I Don't Care (with Justin Bieber) - Loud Luxury Remix" "Memories - Dillon Francis Remix" "All the Time - Don Diablo Remix" "Call You Mine - Keanu Silva Remix" ...
##  $ track_artist            : chr  "Ed Sheeran" "Maroon 5" "Zara Larsson" "The Chainsmokers" ...
##  $ track_popularity        : int  66 67 70 60 69 67 62 69 68 67 ...
##  $ track_album_id          : chr  "2oCs0DGTsRO98Gh5ZSl2Cx" "63rPSO264uRjW1X5E6cWv6" "1HoSmj2eLcsrR0vE9gThr4" "1nqYsOef1yKKuGOVchbsk6" ...
##  $ track_album_name        : chr  "I Don't Care (with Justin Bieber) [Loud Luxury Remix]" "Memories (Dillon Francis Remix)" "All the Time (Don Diablo Remix)" "Call You Mine - The Remixes" ...
##  $ track_album_release_date: chr  "2019-06-14" "2019-12-13" "2019-07-05" "2019-07-19" ...
##  $ playlist_name           : chr  "Pop Remix" "Pop Remix" "Pop Remix" "Pop Remix" ...
##  $ playlist_id             : chr  "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" ...
##  $ playlist_genre          : chr  "pop" "pop" "pop" "pop" ...
##  $ playlist_subgenre       : chr  "dance pop" "dance pop" "dance pop" "dance pop" ...
##  $ danceability            : num  0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
##  $ energy                  : num  0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
##  $ key                     : int  6 11 1 7 1 8 5 4 8 2 ...
##  $ loudness                : num  -2.63 -4.97 -3.43 -3.78 -4.67 ...
##  $ mode                    : int  1 1 0 1 1 1 0 0 1 1 ...
##  $ speechiness             : num  0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
##  $ acousticness            : num  0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
##  $ instrumentalness        : num  0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
##  $ liveness                : num  0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
##  $ valence                 : num  0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
##  $ tempo                   : num  122 100 124 122 124 ...
##  $ duration_ms             : int  194754 162600 176616 169093 189052 163049 187675 207619 193187 253040 ...

Observations:

str() shows structure of data. We can see that track_id , track_name, track_artist, track_album_id, track_album_name, track_album_release_date, playlist_name, playlist_id, playlist_genre, playlist_subgenre are character variables.
track_popularity, energy, key, loudness, mode, speechiness, acousticness, instrumentalness, liveness, valence, tempo and duration_ms are numeric. We need to change track_album_release_date to date variable. We will also change playlist_genre to factor, for future plotting.

Modifying Data Types

# Modifying Data Types
spotify$track_album_release_date<- as.Date(spotify$track_album_release_date)
spotify$playlist_genre<-as.factor(spotify$playlist_genre)

Summary Statistics

#summary statistics
summary(spotify)

##    track_id          track_name        track_artist       track_popularity
##  Length:32833       Length:32833       Length:32833       Min.   :  0.00  
##  Class :character   Class :character   Class :character   1st Qu.: 24.00  
##  Mode  :character   Mode  :character   Mode  :character   Median : 45.00  
##                                                           Mean   : 42.48  
##                                                           3rd Qu.: 62.00  
##                                                           Max.   :100.00  
##                                                                           
##  track_album_id     track_album_name   track_album_release_date
##  Length:32833       Length:32833       Min.   :1957-01-01      
##  Class :character   Class :character   1st Qu.:2010-12-04      
##  Mode  :character   Mode  :character   Median :2017-01-27      
##                                        Mean   :2012-09-09      
##                                        3rd Qu.:2019-05-16      
##                                        Max.   :2020-01-29      
##                                        NA's   :1886            
##  playlist_name      playlist_id        playlist_genre playlist_subgenre 
##  Length:32833       Length:32833       edm  :6043     Length:32833      
##  Class :character   Class :character   latin:5155     Class :character  
##  Mode  :character   Mode  :character   pop  :5507     Mode  :character  
##                                        r&b  :5431                       
##                                        rap  :5746                       
##                                        rock :4951                       
##                                                                         
##   danceability        energy              key            loudness      
##  Min.   :0.0000   Min.   :0.000175   Min.   : 0.000   Min.   :-46.448  
##  1st Qu.:0.5630   1st Qu.:0.581000   1st Qu.: 2.000   1st Qu.: -8.171  
##  Median :0.6720   Median :0.721000   Median : 6.000   Median : -6.166  
##  Mean   :0.6548   Mean   :0.698619   Mean   : 5.374   Mean   : -6.720  
##  3rd Qu.:0.7610   3rd Qu.:0.840000   3rd Qu.: 9.000   3rd Qu.: -4.645  
##  Max.   :0.9830   Max.   :1.000000   Max.   :11.000   Max.   :  1.275  
##                                                                        
##       mode         speechiness      acousticness    instrumentalness   
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000000  
##  1st Qu.:0.0000   1st Qu.:0.0410   1st Qu.:0.0151   1st Qu.:0.0000000  
##  Median :1.0000   Median :0.0625   Median :0.0804   Median :0.0000161  
##  Mean   :0.5657   Mean   :0.1071   Mean   :0.1753   Mean   :0.0847472  
##  3rd Qu.:1.0000   3rd Qu.:0.1320   3rd Qu.:0.2550   3rd Qu.:0.0048300  
##  Max.   :1.0000   Max.   :0.9180   Max.   :0.9940   Max.   :0.9940000  
##                                                                        
##     liveness         valence           tempo         duration_ms    
##  Min.   :0.0000   Min.   :0.0000   Min.   :  0.00   Min.   :  4000  
##  1st Qu.:0.0927   1st Qu.:0.3310   1st Qu.: 99.96   1st Qu.:187819  
##  Median :0.1270   Median :0.5120   Median :121.98   Median :216000  
##  Mean   :0.1902   Mean   :0.5106   Mean   :120.88   Mean   :225800  
##  3rd Qu.:0.2480   3rd Qu.:0.6930   3rd Qu.:133.92   3rd Qu.:253585  
##  Max.   :0.9960   Max.   :0.9910   Max.   :239.44   Max.   :517810  
##

Observations:

This function displays Min, Q1, Median, Mean, Q3 and Max of each variable.
There are probably some outliers and that some variables have too big Max (duration_ms has a Max of 517810; tempo has a Max of 239.44).
From the structure of the data we would assume that the best models to predict track_popularity are either linear regression model or classification tree since the dependent variable is numeric and continuous.
We will do some truncation, winsorization or standardization to remove the outliers.

Tables

#lets look at some tables for categorical variables
table(spotify$playlist_genre)

## 
##   edm latin   pop   r&b   rap  rock 
##  6043  5155  5507  5431  5746  4951

table(spotify$playlist_subgenre)

## 
##                album rock                  big room              classic rock 
##                      1065                      1206                      1296 
##                 dance pop             electro house                electropop 
##                      1298                      1511                      1408 
##              gangster rap                 hard rock                   hip hop 
##                      1458                      1485                      1322 
##                   hip pop           indie poptimism             latin hip hop 
##                      1256                      1672                      1656 
##                 latin pop                  neo soul            new jack swing 
##                      1262                      1637                      1133 
##            permanent wave                   pop edm             post-teen pop 
##                      1105                      1517                      1129 
## progressive electro house                 reggaeton          southern hip hop 
##                      1809                       949                      1675 
##                      trap                  tropical        urban contemporary 
##                      1291                      1288                      1405

table(spotify$key)

## 
##    0    1    2    3    4    5    6    7    8    9   10   11 
## 3454 4010 2827  913 2201 2680 2670 3352 2430 3027 2273 2996

table(spotify$mode)

## 
##     0     1 
## 14259 18574

Observations:

mode is a binary variable.
Playlist_genre and playlist_subgenre are categorical variables with the genre of music.
Each category does not vary a lot in observations, which is good, because we it will be easier to interpret their predictions.
no outliers

Duplicates

#Looking for duplicates
dups_id <- sum(duplicated(spotify$track_id))
print(dups_id)

## [1] 4477

We can see that a lot of songs have been duplicated in this dataset. They have the same track_id. Therefore we will remove them, for further analysis.

spotify_dups = spotify[duplicated(spotify$track_id),]
spotify = spotify[!duplicated(spotify$track_id),]

We removed the duplicate songs to another dataset called spotify_dups and removed duplicates from current dataset.

Missing Values

#looking for missing values
sum(is.na(spotify))

## [1] 1693

There are 1693 missing values in this dataset. However, we will not remove them, since they might still be important for the analysis.

Exploratory Data Anaylsis

After performing Data Preparation we found out the following:
- We saw the duplicates and deleted them
- No missing values
Different ways that we could look at the Data to answer our questions could be histograms, to learn about the distribution of variables, boxplots, to find outliers and see which genres are most popular, scatterplots to interpret the variables based on track_popularity and correlation matrix to find out the correlation between variables.

*We plan on splitting the data into training and testing and doing different models to see which variables are going to predict better the popularity of a song. We might choose different variables to see if a model performs better and improves its R^2. We are also going to observe each models AUC or MR/MSE to compare the models to each other.

For the purpose of this analysis we will not use the variables track_id, track_name, track_artist, track_album_id, track_album_name, track_abum_release_date, playlist_name nor playlist_id. We only want to focus on the sound parameters and genres.

Visual histogram exploration to understand the data set.

We will plot some variables to observe its skewiness, distribution, and interpret them.

histograms <- names(spotify)[c(4,12:23)]

songs1 <- spotify %>% 
            select(c(histograms)) %>%
            pivot_longer(cols = histograms) 

songs1 %>%
  ggplot(aes(x = value)) +
  geom_histogram() +
  facet_wrap(~name, ncol = 5, scales = 'free') +
  labs(title = 'Audio Feature Pattern Frequency Plots', x = '', y = '') +
  theme_void()

Observations:

Duration and Valence are normally distributed.
Danceability, Enery and Loudness are left-skewed.
Acousticness, Liveness and Speechiness are right-skewed.
Key and mode are categorical variables.
In popularity the max is around 50-60, and it is normally distributed. There are a lot of zero values, possible missing values. Very few songs have a popularity of above 90.

Boxplot to plot different genres vs popularity

ggplot(spotify,aes(x = playlist_genre, y = track_popularity, fill = playlist_genre)) +geom_boxplot()

ggplot(spotify,aes(x = playlist_subgenre, y = track_popularity, fill = playlist_subgenre)) +geom_boxplot()

Observations:

We can see that pop is the most popular followed by latin and rock.
From the subgenre playlist post-teen pop is the most popular followed by dance pop and permament wave.

Visualitazion scatterplots based on parameters and popularity

Now we will plot scatterplots to compare the different parameters to popularity and see which genre has the best popularity based on the different variables.
We also want to see on the most popular songs, which parameter best predicts popularity.
We are only going to plot it for the first 500 most popular songs, because therefore we can differentiate more if a variable affect of a song is poular.
Also, if we use the whole dataset there are too many observations and we can not draw clean insights.

# new dataset with some variables
feature_names <- names(spotify)[c(12:13,15,17:19,23)]

# new dataset only displying variables with 500 most popular songs
songs <- spotify %>% 
  arrange(desc(track_popularity)) %>%
  head(n = 500) %>%
  pivot_longer(cols = feature_names)

## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
##   # Was:
##   data %>% select(feature_names)
## 
##   # Now:
##   data %>% select(all_of(feature_names))
## 
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

#plotting 
songs %>%
  ggplot(aes(x = name, y = value)) +
  geom_jitter(aes(color = playlist_genre)) +
  facet_wrap(~name, ncol = 4, scales = 'free') +
  labs(title = 'Audio Feature Pattern Frequency Plots', x = '', y = '') +
  theme(axis.text.y = element_blank())

Observation:

Higher danceability, energy and loudness usually means more popularity.
Lower speechiness and accousticness and instrumentalness means more popular.
An average to low duration means more popularity.
After converting track_popularity into dummy variable we could also use other models such as logistic regression, SVM and Regression Tree.

Visualization Boxplots

We will plot boxplots on every continuous variable to identify outliers.

Outliers

ggplot(data = songs1) +
  geom_boxplot(aes(y = value)) + 
  facet_wrap(~name, nrow = 4, scales = "free") +
  coord_flip() +
  ggtitle("Outlier analysis", subtitle = "For different song attributes") +
  theme_void()

Observations:

From the boxplots we can observe that a lot of variables (danceability, energy, loudness, speechiness, acousticness, instrumentalness, liveness, duration) have outliers. Removing them would influence the analysis a lot, so we will create a new dataset and remove the outliers on that dataset.

Dataset copy without oultiers

spotify_copy <- spotify

Truncation

Now, we will truncate energy speechiness, acousticness, instrumentalness and liveness.

# truncation danceability, energy, speechiness, acousticness, instrumentalness and liveness
spotify_copy$danceability[spotify_copy$danceability <= 0.28] <- 0.28

spotify_copy$energy[spotify_copy$energy <= 0.2] <- 0.2

spotify_copy$speechiness[spotify_copy$speechiness >= 0.27] <- 0.27

spotify_copy$acousticness[spotify_copy$acousticness >= 0.6] <- 0.6

spotify_copy$instrumentalness[spotify_copy$instrumentalness >= 0.015] <- 0.015

spotify_copy$liveness[spotify_copy$liveness >= 0.4] <- 0.4

Winsorization

Now, we will winsorize loudness, tempo and duration.

# winsorization loudness
# Calculate the 5th and 95th percentiles for 'loudness'
lower_bound_loudness <- quantile(spotify_copy$loudness, 0.05, na.rm = TRUE)
upper_bound_loudness <- quantile(spotify_copy$loudness, 0.95, na.rm = TRUE)

# Winsorize the data
spotify_copy$loudness[spotify_copy$loudness < lower_bound_loudness] <- lower_bound_loudness
spotify_copy$loudness[spotify_copy$loudness > upper_bound_loudness] <- upper_bound_loudness

# winsorization tempo
# Calculate the 5th and 95th percentiles for 'tempo'
lower_bound_tempo <- quantile(spotify_copy$tempo, 0.05, na.rm = TRUE)
upper_bound_tempo <- quantile(spotify_copy$tempo, 0.95, na.rm = TRUE)

# Winsorize the data
spotify_copy$tempo[spotify_copy$tempo < lower_bound_tempo] <- lower_bound_tempo
spotify_copy$tempo[spotify_copy$tempo > upper_bound_tempo] <- upper_bound_tempo

#winsorize duration
# Calculate the 5th and 95th percentiles for 'duration'
lower_bound_duration_ms <- quantile(spotify_copy$duration_ms, 0.05, na.rm = TRUE)
upper_bound_duration_ms <- quantile(spotify_copy$duration_ms, 0.95, na.rm = TRUE)

# Winsorize the data
spotify_copy$duration_ms[spotify_copy$duration_ms < lower_bound_duration_ms] <- lower_bound_duration_ms
spotify_copy$duration_ms[spotify_copy$duration_ms > upper_bound_duration_ms] <- upper_bound_duration_ms

Show cleaned dataset

knitr::kable(head(spotify[, 1:23]), "simple")

track_id	track_name	track_artist	track_popularity	track_album_id	track_album_name	track_album_release_date	playlist_name	playlist_id	playlist_genre	playlist_subgenre	danceability	energy	key	loudness	mode	speechiness	acousticness	instrumentalness	liveness	valence	tempo	duration_ms
6f807x0ima9a1j3VPbc7VN	I Don’t Care (with Justin Bieber) - Loud Luxury Remix	Ed Sheeran	66	2oCs0DGTsRO98Gh5ZSl2Cx	I Don’t Care (with Justin Bieber) [Loud Luxury Remix]	2019-06-14	Pop Remix	37i9dQZF1DXcZDD7cfEKhW	pop	dance pop	0.748	0.916	6	-2.634	1	0.0583	0.1020	0.00e+00	0.0653	0.518	122.036	194754
0r7CVbZTWZgbTCYdfa2P31	Memories - Dillon Francis Remix	Maroon 5	67	63rPSO264uRjW1X5E6cWv6	Memories (Dillon Francis Remix)	2019-12-13	Pop Remix	37i9dQZF1DXcZDD7cfEKhW	pop	dance pop	0.726	0.815	11	-4.969	1	0.0373	0.0724	4.21e-03	0.3570	0.693	99.972	162600
1z1Hg7Vb0AhHDiEmnDE79l	All the Time - Don Diablo Remix	Zara Larsson	70	1HoSmj2eLcsrR0vE9gThr4	All the Time (Don Diablo Remix)	2019-07-05	Pop Remix	37i9dQZF1DXcZDD7cfEKhW	pop	dance pop	0.675	0.931	1	-3.432	0	0.0742	0.0794	2.33e-05	0.1100	0.613	124.008	176616
75FpbthrwQmzHlBJLuGdC7	Call You Mine - Keanu Silva Remix	The Chainsmokers	60	1nqYsOef1yKKuGOVchbsk6	Call You Mine - The Remixes	2019-07-19	Pop Remix	37i9dQZF1DXcZDD7cfEKhW	pop	dance pop	0.718	0.930	7	-3.778	1	0.1020	0.0287	9.40e-06	0.2040	0.277	121.956	169093
1e8PAfcKUYoKkxPhrHqw4x	Someone You Loved - Future Humans Remix	Lewis Capaldi	69	7m7vv9wlQ4i0LFuJiE2zsQ	Someone You Loved (Future Humans Remix)	2019-03-05	Pop Remix	37i9dQZF1DXcZDD7cfEKhW	pop	dance pop	0.650	0.833	1	-4.672	1	0.0359	0.0803	0.00e+00	0.0833	0.725	123.976	189052
7fvUMiyapMsRRxr07cU8Ef	Beautiful People (feat. Khalid) - Jack Wins Remix	Ed Sheeran	67	2yiy9cd2QktrNvWC2EUi0k	Beautiful People (feat. Khalid) [Jack Wins Remix]	2019-07-11	Pop Remix	37i9dQZF1DXcZDD7cfEKhW	pop	dance pop	0.675	0.919	8	-5.385	1	0.1270	0.0799	0.00e+00	0.1430	0.585	124.982	163049

Observations:

Histograms helped us look a the distribution of variables, and since a lot pf them where numeric we concluded that a Linear regression and regression tree might be a good model to predict track_popularity.
Scatterplots helped us see which parameter might influence more our dependent variable. These were: instrumentalness, speechiness and loudness, since the data was more dense.
boxplots against popularity helped us identify most popular genres: pop, latin and rock.
Popular subgenres were: post-teen pop, dance pop and permanent wave.
To create new summary information, we plan on narrowing our data down to the variables we really plan on exploring to help us gain insights to our predictions.
We still need to perform a correlation matrix and modelling to see which variables best predict track_popularity and see if they agree with our plots.

Correlation Matrix

First, we will select the variables that we will use for the correlation matrix. We will convert playlist_genre to categorical in order to be able to see the correlation.

matrix <- spotify %>%  select(track_popularity, danceability, energy, key, loudness, mode, speechiness,acousticness, instrumentalness, liveness, valence, tempo, duration_ms, playlist_genre)

matrix$playlist_genre<-recode(matrix$playlist_genre, 'pop'=0, 'r&b'=1, 'rock'=2, 'latin'=3, 'edm'=4, 'rap'=5)

cor_matrix <- cor(matrix)
print(cor_matrix)

##                  track_popularity danceability      energy           key
## track_popularity      1.000000000  0.046596507 -0.10362202 -0.0081442988
## danceability          0.046596507  1.000000000 -0.08143602  0.0070203133
## energy               -0.103622023 -0.081436016  1.00000000  0.0128766164
## key                  -0.008144299  0.007020313  0.01287662  1.0000000000
## loudness              0.037285154  0.015297214  0.68213774 -0.0005315332
## mode                  0.016275225 -0.055211662 -0.00431074 -0.1760953351
## speechiness           0.005205509  0.183453077 -0.02900840  0.0231150971
## acousticness          0.091725358 -0.028878237 -0.54588619  0.0041975410
## instrumentalness     -0.124431070 -0.002266695  0.02381792  0.0073927120
## liveness             -0.052773029 -0.127002362  0.16370922  0.0021365502
## valence               0.022581329  0.333728627  0.14971046  0.0216180479
## tempo                 0.004446180 -0.184575167  0.15154451 -0.0103554421
## duration_ms          -0.139681783 -0.087560915  0.01758400  0.0188095544
## playlist_genre       -0.073434088  0.190073891  0.08828930  0.0100195621
##                       loudness          mode  speechiness acousticness
## track_popularity  0.0372851540  0.0162752246  0.005205509  0.091725358
## danceability      0.0152972137 -0.0552116618  0.183453077 -0.028878237
## energy            0.6821377425 -0.0043107400 -0.029008396 -0.545886191
## key              -0.0005315332 -0.1760953351  0.023115097  0.004197541
## loudness          1.0000000000 -0.0176877795  0.012981015 -0.371602357
## mode             -0.0176877795  1.0000000000 -0.059650521  0.006760044
## speechiness       0.0129810147 -0.0596505208  1.000000000  0.024945425
## acousticness     -0.3716023568  0.0067600443  0.024945425  1.000000000
## instrumentalness -0.1543090413 -0.0057660749 -0.107959332 -0.003100502
## liveness          0.0819270381 -0.0002748805  0.059342894 -0.074539622
## valence           0.0495101626 -0.0030881898  0.064700922 -0.018998564
## tempo             0.0967061101  0.0167097090  0.032728605 -0.114335210
## duration_ms      -0.1045846289  0.0128129792 -0.098435680 -0.094143218
## playlist_genre    0.0715914257 -0.0505853182  0.298193374 -0.076573192
##                  instrumentalness      liveness     valence        tempo
## track_popularity     -0.124431070 -0.0527730291  0.02258133  0.004446180
## danceability         -0.002266695 -0.1270023619  0.33372863 -0.184575167
## energy                0.023817924  0.1637092244  0.14971046  0.151544510
## key                   0.007392712  0.0021365502  0.02161805 -0.010355442
## loudness             -0.154309041  0.0819270381  0.04951016  0.096706110
## mode                 -0.005766075 -0.0002748805 -0.00308819  0.016709709
## speechiness          -0.107959332  0.0593428937  0.06470092  0.032728605
## acousticness         -0.003100502 -0.0745396224 -0.01899856 -0.114335210
## instrumentalness      1.000000000 -0.0085048515 -0.17416318  0.021484050
## liveness             -0.008504851  1.0000000000 -0.01992507  0.022026642
## valence              -0.174163182 -0.0199250679  1.00000000 -0.025135288
## tempo                 0.021484050  0.0220266420 -0.02513529  1.000000000
## duration_ms           0.059077441  0.0075705280 -0.03332403 -0.001695701
## playlist_genre        0.147880944  0.0508644474 -0.06999690  0.050109733
##                   duration_ms playlist_genre
## track_popularity -0.139681783    -0.07343409
## danceability     -0.087560915     0.19007389
## energy            0.017584005     0.08828930
## key               0.018809554     0.01001956
## loudness         -0.104584629     0.07159143
## mode              0.012812979    -0.05058532
## speechiness      -0.098435680     0.29819337
## acousticness     -0.094143218    -0.07657319
## instrumentalness  0.059077441     0.14788094
## liveness          0.007570528     0.05086445
## valence          -0.033324030    -0.06999690
## tempo            -0.001695701     0.05010973
## duration_ms       1.000000000    -0.07577263
## playlist_genre   -0.075772625     1.00000000

corrplot(cor(cor_matrix), method="color", type="upper", order="hclust")

The overall ranking of the variables that mostly correlate with track_popularity are the following:

Duration (-0.1396)
Instrumentalness (-0.1244)
Energy (-0.1036)
Acousticness (0.0917)
playlist_genre (-0.0734)

Comparing to the scatterplots one would think that instrumentallness, speechiness and loudness would have the strongest correlations since they have most points out of the most popular songs in the smallest interval.
In this correlation matrix however Duration, Instrumentalness and Energy would influence more track_popularity.

Modelling

Linear Regression model with training and test data

We want to predict track popularity and we want to see which variables best predict the dependent variable. Therefore we will first build a linear regression model to see how it predicts track_popularity.

Data splitting

# Set the seed for reproducibility
set.seed(2023)

# Randomly sample row indices for the training set split in 70% and 30%
train_indices <- sample(1:NROW(spotify),NROW(spotify)*0.70)

# Create the training set
train_data <- spotify[train_indices, ] #everything before comma is row selector and after comma is column selector

# Create the testing set
test_data <- spotify[-train_indices, ]

Linear Regression on Training Set

# Train the linear regression model, comparing popularity to rest of numeric parameters
lm_model <- lm(track_popularity ~ danceability + energy  + mode + key + acousticness + loudness +  speechiness +  instrumentalness + liveness + valence + tempo + duration_ms, data = train_data)
summary(lm_model)

## 
## Call:
## lm(formula = track_popularity ~ danceability + energy + mode + 
##     key + acousticness + loudness + speechiness + instrumentalness + 
##     liveness + valence + tempo + duration_ms, data = train_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -54.757 -17.360   2.935  18.119  60.604 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       6.684e+01  2.045e+00  32.682  < 2e-16 ***
## danceability      4.170e+00  1.288e+00   3.237  0.00121 ** 
## energy           -2.318e+01  1.461e+00 -15.864  < 2e-16 ***
## mode              7.590e-01  3.357e-01   2.261  0.02378 *  
## key               1.398e-02  4.595e-02   0.304  0.76088    
## acousticness      4.934e+00  8.895e-01   5.547 2.95e-08 ***
## loudness          1.132e+00  7.802e-02  14.512  < 2e-16 ***
## speechiness      -7.397e+00  1.657e+00  -4.464 8.09e-06 ***
## instrumentalness -9.426e+00  7.437e-01 -12.676  < 2e-16 ***
## liveness         -4.420e+00  1.077e+00  -4.104 4.08e-05 ***
## valence           1.997e+00  7.861e-01   2.540  0.01110 *  
## tempo             2.808e-02  6.261e-03   4.485 7.33e-06 ***
## duration_ms      -4.277e-05  2.726e-06 -15.689  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.03 on 19836 degrees of freedom
## Multiple R-squared:  0.06003,    Adjusted R-squared:  0.05946 
## F-statistic: 105.6 on 12 and 19836 DF,  p-value: < 2.2e-16

The model suggests that all the variables are significant except key, since it is the only variable with a p-value higher than 0.5. This model has an R^2 of 0.06003 which means that only 6% of the of the data’s variability can be explained by the regression model, which indicated that this is not a good model.
We can say that when instrumentalness, speechiness and duration_ms are increased by 1 unit, track_popularity is affected the most with coefficients of -9.426, -7.397 and -4.277. However, energy and loudness did not affect track_popularity as much as we thought.

Manually check the results

In-sample (training) MSE

We will look at the In-Sample MSE to evaluate the model.

lm_mse_train <- mean((lm_model$fitted.values - train_data$track_popularity)^2)
print(paste("Training MSE for Linear Model:", round(lm_mse_train, 2)))

## [1] "Training MSE for Linear Model: 529.94"

We got a MSE of 529.94, which is high for this model. A value close to 0 would be perfect.

Out-of-sample (testing) MSE

Now we will look at the Out-Of-Sample MSE to evaluate the model.

# Predict on testing data
lm_test_pred <- predict(lm_model, newdata = test_data)
# Cal
lm_mse_test <- mean((lm_test_pred - test_data$track_popularity)^2)
print(paste("Testing MSE for Linear Model:", round(lm_mse_test, 2)))

## [1] "Testing MSE for Linear Model: 527.51"

The MSE for the testing set is lower, with a value of 527.51.

Lm 2 (categorical)

Now, we will train the linear model, comparing it to other categorical values to see how they predict popularity.

# Train the linear regression model, comparing popularity to other character parameters
lm_model_ch <- lm(track_popularity ~ playlist_genre + playlist_subgenre, data = train_data)

summary(lm_model_ch)

## 
## Call:
## lm(formula = track_popularity ~ playlist_genre + playlist_subgenre, 
##     data = train_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -55.418 -16.339   2.905  16.931  60.098 
## 
## Coefficients: (5 not defined because of singularities)
##                                            Estimate Std. Error t value Pr(>|t|)
## (Intercept)                                 23.9894     0.6853  35.005  < 2e-16
## playlist_genrelatin                         18.0795     1.0338  17.489  < 2e-16
## playlist_genrepop                           31.4286     1.0654  29.501  < 2e-16
## playlist_genrer&b                           22.2155     1.0313  21.541  < 2e-16
## playlist_genrerap                           24.4461     1.0227  23.902  < 2e-16
## playlist_genrerock                          13.5668     1.0741  12.631  < 2e-16
## playlist_subgenrebig room                    4.4078     1.0705   4.117 3.85e-05
## playlist_subgenreclassic rock               -0.2592     1.1543  -0.225 0.822309
## playlist_subgenredance pop                  -3.0791     1.1009  -2.797 0.005164
## playlist_subgenreelectro house               9.7948     0.9814   9.981  < 2e-16
## playlist_subgenreelectropop                -16.3501     1.0989 -14.878  < 2e-16
## playlist_subgenregangster rap              -15.9361     1.0535 -15.127  < 2e-16
## playlist_subgenrehard rock                  -4.2979     1.1230  -3.827 0.000130
## playlist_subgenrehip hop                     4.6597     1.0591   4.400 1.09e-05
## playlist_subgenrehip pop                    -1.8091     1.2060  -1.500 0.133599
## playlist_subgenreindie poptimism           -14.2180     1.0637 -13.367  < 2e-16
## playlist_subgenrelatin hip hop              -9.5706     1.0786  -8.873  < 2e-16
## playlist_subgenrelatin pop                   4.3781     1.1083   3.950 7.83e-05
## playlist_subgenreneo soul                  -16.3026     1.0344 -15.760  < 2e-16
## playlist_subgenrenew jack swing            -19.9120     1.1214 -17.757  < 2e-16
## playlist_subgenrepermanent wave             14.2161     1.1811  12.036  < 2e-16
## playlist_subgenrepop edm                    12.5587     1.0861  11.563  < 2e-16
## playlist_subgenrepost-teen pop                   NA         NA      NA       NA
## playlist_subgenreprogressive electro house       NA         NA      NA       NA
## playlist_subgenrereggaeton                   4.2245     1.2535   3.370 0.000752
## playlist_subgenresouthern hip hop          -13.6192     1.0064 -13.533  < 2e-16
## playlist_subgenretrap                            NA         NA      NA       NA
## playlist_subgenretropical                        NA         NA      NA       NA
## playlist_subgenreurban contemporary              NA         NA      NA       NA
##                                               
## (Intercept)                                ***
## playlist_genrelatin                        ***
## playlist_genrepop                          ***
## playlist_genrer&b                          ***
## playlist_genrerap                          ***
## playlist_genrerock                         ***
## playlist_subgenrebig room                  ***
## playlist_subgenreclassic rock                 
## playlist_subgenredance pop                 ** 
## playlist_subgenreelectro house             ***
## playlist_subgenreelectropop                ***
## playlist_subgenregangster rap              ***
## playlist_subgenrehard rock                 ***
## playlist_subgenrehip hop                   ***
## playlist_subgenrehip pop                      
## playlist_subgenreindie poptimism           ***
## playlist_subgenrelatin hip hop             ***
## playlist_subgenrelatin pop                 ***
## playlist_subgenreneo soul                  ***
## playlist_subgenrenew jack swing            ***
## playlist_subgenrepermanent wave            ***
## playlist_subgenrepop edm                   ***
## playlist_subgenrepost-teen pop                
## playlist_subgenreprogressive electro house    
## playlist_subgenrereggaeton                 ***
## playlist_subgenresouthern hip hop          ***
## playlist_subgenretrap                         
## playlist_subgenretropical                     
## playlist_subgenreurban contemporary           
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22.07 on 19825 degrees of freedom
## Multiple R-squared:  0.1372, Adjusted R-squared:  0.1362 
## F-statistic: 137.1 on 23 and 19825 DF,  p-value: < 2.2e-16

The R^2 is bigger meaning that 13.62% of the variance can be explained by this model. It is still low. From the coefficients pop, has the highest one (31.42), then rap and the r&b. Looking back at the plots, it looked like pop, latin and rock were the most popular. From the subgenre it seems like playlist_subgenrehip pop and playlist_subgenreclassic rock were the only ones non-significant. On the subgenres, new jack swing had the highest coefficients, followed by electropop and neo soul.

# Predict on training data (character)
lm_train_pred_ch <- mean((lm_model_ch$fitted.values - train_data$track_popularity)^2)
print(paste("Training MSE for Linear Model:", round(lm_train_pred_ch, 2)))

## [1] "Training MSE for Linear Model: 486.43"

# Predict on testing data (character)
lm_test_pred_ch <- predict(lm_model_ch, newdata = test_data)

## Warning in predict.lm(lm_model_ch, newdata = test_data): prediction from a
## rank-deficient fit may be misleading

# Cal
lm_mse_test_ch <- mean((lm_test_pred_ch - test_data$track_popularity)^2)
print(paste("Testing MSE for Linear Model:", round(lm_mse_test_ch, 2)))

## [1] "Testing MSE for Linear Model: 485.8"

The MSE of the model in the categorical variables is still very high with training of 486.43 and testing MSE of 485.8.

Now we will try adding an interaction to see how it would fit the model. * We will use the variables that had the biggest coefficient before, to see if we can improve the R^2, with an interaction.

lm_model_int <- lm(track_popularity ~ instrumentalness + speechiness + instrumentalness*speechiness, data = spotify)
summary(lm_model_int)

## 
## Call:
## lm(formula = track_popularity ~ instrumentalness + speechiness + 
##     instrumentalness * speechiness, data = spotify)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -41.770 -17.723   3.269  18.519  64.505 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   40.8548     0.2187 186.804  < 2e-16 ***
## instrumentalness             -15.2125     0.8901 -17.091  < 2e-16 ***
## speechiness                   -3.3088     1.4190  -2.332 0.019718 *  
## instrumentalness:speechiness  30.0642     8.0613   3.729 0.000192 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.51 on 28352 degrees of freedom
## Multiple R-squared:  0.01603,    Adjusted R-squared:  0.01593 
## F-statistic:   154 on 3 and 28352 DF,  p-value: < 2.2e-16

As we saw before, almost every variables is significant towards track popularity, so it is very easy to find interactions. We performed an interaction which was significant. However the R^2 is too low. Now it is 0.16%. The new coefficient to account for this interaction is 30.064 indicating that if there was an increase of 1 in the instrumentalness variable that 30.064 would be multiplied by the value of speechinessvariable in the regression equation.

Predictions on new model

# Predict on training data (character)
lm_train_pred_int <- mean((lm_model_int$fitted.values - train_data$track_popularity)^2)

## Warning in lm_model_int$fitted.values - train_data$track_popularity: longer
## object length is not a multiple of shorter object length

print(paste("Training MSE for Linear Model:", round(lm_train_pred_int, 2)))

## [1] "Training MSE for Linear Model: 572.16"

# Predict on testing data
lm_test_pred_int <- predict(lm_model_int, newdata = test_data)
# Cal
lm_mse_test <- mean((lm_test_pred_int - test_data$track_popularity)^2)
print(paste("Testing MSE for Linear Model:", round(lm_mse_test, 2)))

## [1] "Testing MSE for Linear Model: 549.2"

Model with no interaction

lm_model_no_int <- lm(track_popularity ~ instrumentalness + speechiness , data = spotify)
summary(lm_model_no_int)

## 
## Call:
## lm(formula = track_popularity ~ instrumentalness + speechiness, 
##     data = spotify)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -40.701 -17.639   3.357  18.561  64.882 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       40.7014     0.2149 189.438   <2e-16 ***
## instrumentalness -12.7742     0.6041 -21.145   <2e-16 ***
## speechiness       -1.9240     1.3698  -1.405     0.16    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.52 on 28353 degrees of freedom
## Multiple R-squared:  0.01555,    Adjusted R-squared:  0.01548 
## F-statistic: 223.9 on 2 and 28353 DF,  p-value: < 2.2e-16

If we perform the same model with no interaction the R^2 is 0.15%. This means that 0.15% of variance can be explained by this model, indicating that it is not a good model. It is worse thn the model with the interaction

In-Sample MSE

# Predict on training data (character)
lm_train_pred_no_int <- mean((lm_model_no_int$fitted.values - train_data$track_popularity)^2)

## Warning in lm_model_no_int$fitted.values - train_data$track_popularity: longer
## object length is not a multiple of shorter object length

print(paste("Training MSE for Linear Model:", round(lm_train_pred_no_int, 2)))

## [1] "Training MSE for Linear Model: 572.03"

Out-Of-Sample MSE

# Predict on testing data
lm_test_pred_no_int <- predict(lm_model_no_int, newdata = test_data)
# Cal
lm_mse_test <- mean((lm_test_pred_no_int - test_data$track_popularity)^2)
print(paste("Testing MSE for Linear Model:", round(lm_mse_test, 2)))

## [1] "Testing MSE for Linear Model: 549.59"

Summary MSE Table

# create matrix with 4 columns and 4 rows
data <- matrix(c('529.4', '527.51', '486.43', '485.8', '572.16', '549.2', '572.03', '549.59'), ncol=2, byrow=TRUE)
 
# specify the column names and row names of matrix
colnames(data) = c('Training MSE','Testing MSE')
rownames(data) <- c('LM1','LM2','LM3','LM4')
 
# assign to table
lm_MSE=as.table(data)
 
# display
lm_MSE

##     Training MSE Testing MSE
## LM1 529.4        527.51     
## LM2 486.43       485.8      
## LM3 572.16       549.2      
## LM4 572.03       549.59

LM Observations:

From the model with numeric variables, we could see that all variables except key were statistically significant on the 0.05 level. Which means that a variation in them would influence the outcome in track_popularity. Then we performed a lm with categorical variables such as genre and subgenre. We can see that all variables besides playlist_subgenrelatin hip hop are statistically significant at the 0.05 level. From the table we can see that the model that performed the best was the one with categroical values (with a high MSE still)

While we thought from the plots that pop, latin and rock were the most popular genres, this model indicated that it was pop, rap and r&b. With the parameters our model also indicated that loudness and energy might not affect our model as much as we thought. The best R^2 was the model with genres and subgenres (with 13.62%), but the model is still not very good.

KNN model when k=5

spotify_knn_model <- kknn(track_popularity ~ danceability + energy + key + loudness + mode + speechiness + acousticness + instrumentalness + liveness + valence + tempo + duration_ms, train = train_data, test = train_data, k = 5)

In-Sample MSE

# Predict on training data
knn_train_pred <- fitted.values(spotify_knn_model)

# Calculate in-sample MSE manually
knn_train_mse <- mean((train_data$track_popularity - knn_train_pred)^2)
print(paste("In-Sample MSE for KNN: ", knn_train_mse))

## [1] "In-Sample MSE for KNN:  194.903869196156"

Out-Of-Sample MSE

# Predict on testing data
knn_model_test <- kknn(track_popularity ~ danceability + energy + key + loudness + mode + speechiness + acousticness + instrumentalness + liveness + valence + tempo + duration_ms, train = train_data, test = test_data, k = 5)

knn_test_pred <- fitted.values(knn_model_test)

# Calculate out-of-sample MSE manually
knn_test_mse <- mean((test_data$track_popularity - knn_test_pred)^2)
print(paste("Out-of-Sample MSE for KNN: ", knn_test_mse))

## [1] "Out-of-Sample MSE for KNN:  687.024025645934"

KNN model when k=20

spotify_knn_model_20 <- kknn(track_popularity ~ danceability + energy + key + loudness + mode + speechiness + acousticness + instrumentalness + liveness + valence + tempo + duration_ms, train = train_data, test = train_data, k = 20)

In-Sample MSE

# Predict on training data
knn_train_pred_20 <- fitted.values(spotify_knn_model_20)

# Calculate in-sample MSE manually
knn_train_mse_20 <- mean((train_data$track_popularity - knn_train_pred_20)^2)
print(paste("In-Sample MSE for KNN: ", knn_train_mse_20))

## [1] "In-Sample MSE for KNN:  389.061411631355"

Out-Of-Sample MSE

# Predict on testing data
knn_model_test_20 <- kknn(track_popularity ~ danceability + energy + key + loudness + mode + speechiness + acousticness + instrumentalness + liveness + valence + tempo + duration_ms, train = train_data, test = test_data, k = 20)

knn_test_pred_20 <- fitted.values(knn_model_test_20)

# Calculate out-of-sample MSE manually
knn_test_mse_20 <- mean((test_data$track_popularity - knn_test_pred_20)^2)
print(paste("Out-of-Sample MSE for KNN: ", knn_test_mse_20))

## [1] "Out-of-Sample MSE for KNN:  556.981777608275"

KNN model when k=500

spotify_knn_model_500 <- kknn(track_popularity ~ danceability + energy + key + loudness + mode + speechiness + acousticness + instrumentalness + liveness + valence + tempo + duration_ms, train = train_data, test = train_data, k = 500)

In-Sample MSE

# Predict on training data
knn_train_pred_500 <- fitted.values(spotify_knn_model_500)

# Calculate in-sample MSE manually
knn_train_mse_500 <- mean((train_data$track_popularity - knn_train_pred_500)^2)
print(paste("In-Sample MSE for KNN: ", knn_train_mse_500))

## [1] "In-Sample MSE for KNN:  517.746609308476"

Out-Of-Sample MSE

# Predict on testing data
knn_model_test_500 <- kknn(track_popularity ~ danceability + energy + key + loudness + mode + speechiness + acousticness + instrumentalness + liveness + valence + tempo + duration_ms, train = train_data, test = test_data, k = 500)

knn_test_pred_500 <- fitted.values(knn_model_test_500)

# Calculate out-of-sample MSE manually
knn_test_mse_500 <- mean((test_data$track_popularity - knn_test_pred_500)^2)
print(paste("Out-of-Sample MSE for KNN: ", knn_test_mse_500))

## [1] "Out-of-Sample MSE for KNN:  524.140072017361"

Adding KNN MSE results to table

# create matrix with 4 columns and 4 rows
knn <- matrix(c('194.90', '687.02', '389.06', '556.981', '517.75', '524.14'), ncol=2, byrow=TRUE)
 
# specify the column names and row names of matrix
colnames(knn) = c('Training MSE','Testing MSE')
rownames(knn) <- c('knn5', 'knn20', 'knn500' )
 
# assign to table
knn_MSE=as.table(knn)
 
# display
knn_MSE

##        Training MSE Testing MSE
## knn5   194.90       687.02     
## knn20  389.06       556.981    
## knn500 517.75       524.14

lm_MSE

##     Training MSE Testing MSE
## LM1 529.4        527.51     
## LM2 486.43       485.8      
## LM3 572.16       549.2      
## LM4 572.03       549.59

We tried the linear regression and the KNN model with different k’s to see which model would be best for predicting track popularity using the mean square error to see which one performs better. When comparing the Lm to knn model we saw that the model with best Training MSE was knn5 , but the MSE on the testing set was too high (687.02 and 556.981 and 524.14) on the two knn models. We did not use all of the variables in the data set, but we used all of our continuous variables.We decided to go this route because we mostly wanted to explore track popularity and the relationship it has with the continuous variables in our data set.

Standardized vs Unstandarized train knn model

Since the MSE is not quite where we want it to be, we are going to do some feature scaling to see if there is any impact on knn model with less variables. The variables that will be removed are key, mode and accoustiness.

#Creating a standardized train knn model 
train_new.x <- train_data[, c("track_popularity", "danceability", "energy" , "loudness","speechiness","instrumentalness" , "liveness","valence","duration_ms")]
k<- 5
stand_knn <- kknn(track_popularity ~ danceability + energy + loudness + speechiness + instrumentalness + liveness + valence + duration_ms, train = train_new.x, test = train_new.x, k = 5)

# Creating a unstandardized knn model
knn_unstand <- kknn(track_popularity ~  danceability + energy + loudness + speechiness + instrumentalness + liveness + valence + duration_ms, train = train_new.x, test = train_new.x, k = 5, scale = FALSE)

#finding the MSE for our standardized model 
#first we have to find the prediction of our standardized knn model 
pred_stand <- fitted(stand_knn)
train_new.y <- train_new.x$track_popularity 

#Standardized MSE  
stand_mse <- mean((train_new.y - pred_stand)^2)

#Unstandardized MSE
pred_undstand <- fitted(knn_unstand)
unstand_mse <- mean((train_new.y - pred_undstand)^2)

#output
print(paste("Training MSE of standardized data :", stand_mse))

## [1] "Training MSE of standardized data : 206.7465001884"

print(paste("Training MSE of unstandardized data:", unstand_mse))

## [1] "Training MSE of unstandardized data: 217.941700903636"

Interpreting the model

For this model we took out key, mode and acousticness The MSE of the standardized data is 206.7465 and the MSE for the unstandarized data is 217.9417 , so there is a very small difference between the two. Our original train data had a MSE of 194.903, which is also a small difference compared to our feature scaling models. This leads us to the conclusion that feature scaling could make a effective difference in overall model, but it might not be needed.

Now lets test the test data with the same variables removed.

Standardized vs Unstandarized test knn model

#Creating a standardized test knn model 
test_new.x <- test_data[, c("track_popularity", "danceability", "energy" , "loudness","speechiness","instrumentalness" , "liveness","valence","duration_ms")]

stand_test_knn <- kknn(track_popularity ~ danceability + energy + loudness + speechiness + instrumentalness + liveness + valence + duration_ms, train = test_new.x, test = test_new.x, k = 5)

# Creating a unstandardized knn model
knn_test_unstand <- kknn(track_popularity ~  danceability + energy + loudness + speechiness + instrumentalness + liveness + valence + duration_ms, train = test_new.x, test = test_new.x, k = 5, scale = FALSE)

#finding the MSE for our standardized model 
#first we have to find the prediction of our standardized knn model 
pred_stand_test <- fitted(stand_test_knn)
test_new.y <- test_new.x$track_popularity 

#Standardized MSE  
test_stand_mse <- mean((test_new.y - pred_stand_test)^2)

#Unstandardized MSE
pred_undstand_test <- fitted(knn_test_unstand)
unstand_mse_test <- mean((test_new.y - pred_undstand_test)^2)

#output
print(paste("Testing MSE of standardized data :", test_stand_mse))

## [1] "Testing MSE of standardized data : 201.450244331452"

print(paste("Test MSE of unstandardized data:", unstand_mse_test))

## [1] "Test MSE of unstandardized data: 213.310812353227"

Adding results in table

# create matrix with 4 columns and 4 rows
knn <- matrix(c('194.90', '687.02', '389.06', '556.981', '517.75','524.14', '217.942', '206.747', '213.31', '201.45'), ncol=2, byrow=TRUE)
 
# specify the column names and row names of matrix
colnames(knn) = c('Training MSE','Testing MSE')
rownames(knn) <- c('knn5', 'knn20','knn500', 'knn_un', 'knn_stand' )
 
# assign to table
knn_MSE=as.table(knn)
 
# display

knn_MSE

##           Training MSE Testing MSE
## knn5      194.90       687.02     
## knn20     389.06       556.981    
## knn500    517.75       524.14     
## knn_un    217.942      206.747    
## knn_stand 213.31       201.45

lm_MSE

##     Training MSE Testing MSE
## LM1 529.4        527.51     
## LM2 486.43       485.8      
## LM3 572.16       549.2      
## LM4 572.03       549.59

Interpreting the model

We did a new KNN with new variables and K=5 for training and testing set with unstandardized and standardized data. The MSE improved on our model in both training and testing data, but we were using different variables. Smallest MSE went down to 201.45 in standardized model. Feature scaling might improve model, but not as much as we wanted compared to the same model with no standardization (206.747). What we can conclude is that the model improved a lot when deleting some variables.

# diagnostic plot
par(mfrow = c(2, 2))
plot(lm_model)

##### Interpreting Models

Theoretically, the model that would fit the data the best is the linear regression model because it is easier to interpret, you can see the coefficients and how each variable is changed by one unit increase in Y. You also have to meet the four assumptions (linearity, independence,normality, and homoscedasticity). We can run a diagnostics plot to see if our model meets these assumptions.

From the diagnostic plot, we see that the model does not meet the four assumptions, which leads us to the conclusion that the linear regression model is not the best fit for this data. It also has a very low R^2 (also when we tried different data) and very high MSE.

KNN performed better practically since the MSE was lower, but the k’s were too low (only 5). We did the same model with k=50 with same variables and the MSE went up to 490. Therefore although R^2 was very small, and MSE very high we still think that LM is a better model, since it is also more visual.

Logistic regression

Lets create track-popularity as dummy variable and insert a 1, when the popularity is above 70 and a 0 when it is below 70. Meaning that 1 is popular and 0 is not popular. track_populaity will be again or dependent variable.

spotify_copy <- spotify
spotify_copy$track_popularity <- ifelse(spotify_copy$track_popularity >=70, 1 , 0)

Data Splitting

set.seed(2023)
index <- sample(1:nrow(spotify_copy),nrow(spotify_copy)*0.80)
spotify_copy.train = spotify_copy[index,]
spotify_copy.test = spotify_copy[-index,]

Logistic Regrssion model with all numeric variables

spotify_copy.glm0<- glm(track_popularity~ danceability + energy + key + loudness + mode + speechiness + acousticness + instrumentalness + liveness + valence + tempo + duration_ms, family=binomial, data=spotify_copy.train)
summary(spotify_copy.glm0)

## 
## Call:
## glm(formula = track_popularity ~ danceability + energy + key + 
##     loudness + mode + speechiness + acousticness + instrumentalness + 
##     liveness + valence + tempo + duration_ms, family = binomial, 
##     data = spotify_copy.train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.1331  -0.5000  -0.4123  -0.2795   4.2276  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       1.066e+00  3.085e-01   3.455 0.000551 ***
## danceability      3.319e-01  1.902e-01   1.745 0.081027 .  
## energy           -3.095e+00  2.169e-01 -14.269  < 2e-16 ***
## key              -9.644e-03  6.523e-03  -1.478 0.139311    
## loudness          2.076e-01  1.310e-02  15.843  < 2e-16 ***
## mode              1.286e-01  4.805e-02   2.676 0.007446 ** 
## speechiness      -7.592e-01  2.401e-01  -3.162 0.001568 ** 
## acousticness      1.473e-01  1.228e-01   1.199 0.230523    
## instrumentalness -3.071e+00  2.814e-01 -10.916  < 2e-16 ***
## liveness         -4.376e-01  1.710e-01  -2.558 0.010518 *  
## valence           4.779e-01  1.159e-01   4.125 3.71e-05 ***
## tempo             2.652e-03  8.573e-04   3.093 0.001980 ** 
## duration_ms      -1.883e-06  4.619e-07  -4.076 4.57e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 14035  on 22683  degrees of freedom
## Residual deviance: 13193  on 22671  degrees of freedom
## AIC: 13219
## 
## Number of Fisher Scoring iterations: 7

Observation: As we can see from this model, energy, loudness, mode, speechiness, instrumentalness, liveness, valence, tempo and duration_ms are significant in predicting whether a song is going to be popular or not. Unlike linear regression, in this model key has the highest coefficient with -9.644, followed by speechiness and liveness.

Prediction

In-sample prediction

pred.glm0.train <- predict(spotify_copy.glm0,type="response")
hist(pred.glm0.train)

Observation: These tables illustrate the impact of choosing different cut-off probability. Choosing a large cut-off probability will result many cases being predicted as 1, and choosing a small cut-off probability will result in few cases being predicted as 1.

#ROC curve 
library(ROCR)
pred <- prediction(pred.glm0.train, spotify_copy.train$track_popularity)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)

Obsservations:

The ROC curse predicts the false positive rate and the true positive rate. You will want to have these values closer to 1.

#Get the AUC
unlist(slot(performance(pred, "auc"), "y.values"))

## [1] 0.6855791

Observation:

We got the AUC (Area under the Curve). Again we would like to have a number closer to 1, meaning that it would be perfectly predicted. However, it is 0.6856. It is not to bad.

Out-of-sample prediction (more important)

pred.glm0.test<- predict(spotify_copy.glm0, newdata = spotify_copy.test, type="response")

ROC Curve for Out-Of-Sample

pred <- prediction(pred.glm0.test, spotify_copy.test$track_popularity)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)

#Get the AUC
unlist(slot(performance(pred, "auc"), "y.values"))

## [1] 0.6610358

Observations:

The AUC is even lower, with a score of 0.661.

In order to get the matrix for True Positive and True Negative Predictions, we will determine the optimal cut-off Probability using Grid Search Method. We will first use a weight of 5 and 1.

# define a cost function with input "obs" being observed response 
# and "pi" being predicted probability, and "pcut" being the threshold.
costfunc = function(obs, pred.p, pcut){
    weight1 = 5   # define the weight for "true=1 but pred=0" (FN)
    weight0 = 1    # define the weight for "true=0 but pred=1" (FP)
    c1 = (obs==1)&(pred.p<pcut)    # count for "true=1 but pred=0"   (FN)
    c0 = (obs==0)&(pred.p>=pcut)   # count for "true=0 but pred=1"   (FP)
    cost = mean(weight1*c1 + weight0*c0)  # misclassification with weight
    return(cost) # you have to return to a value when you write R functions
} # end of the function

Next, we will define a sequence of probability (you need to search the optimal p-cut from this sequence)

# define a sequence from 0.01 to 1 by 0.01
p.seq = seq(0.01, 1, 0.01)

Then, we will to calculate the cost for each probability in the sequence p.seq.

# write a loop for all p-cut to see which one provides the smallest cost
# first, need to define a 0 vector in order to save the value of cost from all pcut
cost = rep(0, length(p.seq))  
for(i in 1:length(p.seq)){ 
    cost[i] = costfunc(obs = spotify_copy.train$track_popularity, pred.p = pred.glm0.train, pcut = p.seq[i])  
} # end of the loop

#cbind(p.seq, cost)

Last, we will draw a plot with cost against p.seq, and find the p-cut that gives us the minimum cost.

# draw a plot with X axis being all pcut and Y axis being associated cost
plot(p.seq, cost)

# find the optimal pcut
optimal.pcut.glm0 = p.seq[which(cost==min(cost))]
print(optimal.pcut.glm0)

## [1] 0.16

Observation:

With the weight of 5 and 1, we got o.15 as optimal cut-off-probability

Use the optimal cut-off probability for in-sample data

# step 1. get binary classification
class.glm0.train.opt<- (pred.glm0.train>optimal.pcut.glm0)*1
# step 2. get confusion matrix, MR, FPR, FNR
table(spotify_copy.train$track_popularity, class.glm0.train.opt, dnn = c("True", "Predicted"))

##     Predicted
## True     0     1
##    0 18732  1843
##    1  1636   473

Observation:

With weight of 5 and 1, we got 16845 TN, 824 TP, 3422 FP and 1593 FN. These predictions seem very good. It means for example that 824 were predicted a popular when they were popular.

Obtain MR, FPR, FNR and cost rate.

MR<- mean(spotify_copy.train$track_popularity!= class.glm0.train.opt)
FPR<- sum(spotify_copy.train$track_popularity==0 & class.glm0.train.opt==1)/sum(spotify_copy.train$track_popularity==0)
FNR<- sum(spotify_copy.train$track_popularity==1 & class.glm0.train.opt==0)/sum(spotify_copy.train$track_popularity==1)
cost<- costfunc(obs = spotify_copy.train$track_popularity, pred.p = class.glm0.train.opt, pcut = optimal.pcut.glm0)  
print(paste0("MR:",MR))

## [1] "MR:0.153368012696174"

print(paste0("FPR:",FPR))

## [1] "FPR:0.0895747266099635"

print(paste0("FNR:",FNR))

## [1] "FNR:0.775723091512565"

print(paste0("cost:",cost))

## [1] "cost:0.441853288661612"

We got 0.1534 for MR, which is pretty accuarte for such a complicated data. We also got 0.0896 fro FPR, 0.7757 for FNR and 0.4419 for cost.

Now, lets do the same with the Out-Of-Sample Data which is more important and calculate MR.

# step 0. obtain predicated values on testing data (we actually did this already, but let's do it again) 
pred.glm0.test<- predict(spotify_copy.glm0, newdata = spotify_copy.test, type="response")

# step 1. get binary classification, I used the optimal cut-off
pred.glm0.test.opt <- (pred.glm0.test>optimal.pcut.glm0)*1
# step 2. get confusion matrix, MR, FPR, FNR
table(spotify_copy.test$track_popularity, pred.glm0.test.opt, dnn = c("True", "Predicted"))

##     Predicted
## True    0    1
##    0 4698  448
##    1  416  110

Obs: With weight of 5 and 1, (optimal cut-off being 0.15) we got 4214 TN, 186 TP, 876 FP and 396 FN. These predictions seem very good. It means for example that 186 were predicted a popular when they were popular.

MR<- mean(spotify_copy.test$track_popularity!= pred.glm0.test.opt)
FPR<- sum(spotify_copy.test$track_popularity==0 & pred.glm0.test.opt==1)/sum(spotify_copy.test$track_popularity==0)
FNR<- sum(spotify_copy.test$track_popularity==1 & pred.glm0.test.opt==0)/sum(spotify_copy.test$track_popularity==1)
cost<- costfunc(obs = spotify_copy.test$track_popularity, pred.p = pred.glm0.test.opt, pcut = optimal.pcut.glm0)  
print(paste0("MR:",MR))

## [1] "MR:0.152327221438646"

print(paste0("FPR:",FPR))

## [1] "FPR:0.0870579090555771"

print(paste0("FNR:",FNR))

## [1] "FNR:0.790874524714829"

print(paste0("cost:",cost))

## [1] "cost:0.445698166431594"

Out of sample MR is still very good, with a score of 0.1523. We also got 0.08701 for FPR, 0.79087 for FNR and 0.4457 for cost. All of the values were better for testing set except of FNR.

Logistic Regression model 2 (with categorical variables)

spotify_glm0_cat<- glm(track_popularity~ playlist_genre + playlist_subgenre, family=binomial, data=spotify_copy.train)
summary(spotify_glm0_cat)

## 
## Call:
## glm(formula = track_popularity ~ playlist_genre + playlist_subgenre, 
##     family = binomial, data = spotify_copy.train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.9468  -0.5258  -0.3472  -0.1463   3.3090  
## 
## Coefficients: (5 not defined because of singularities)
##                                            Estimate Std. Error z value Pr(>|z|)
## (Intercept)                                 -5.4706     0.4474 -12.229  < 2e-16
## playlist_genrelatin                          2.1066     0.4832   4.360 1.30e-05
## playlist_genrepop                            4.9005     0.4532  10.813  < 2e-16
## playlist_genrer&b                            3.5616     0.4578   7.781 7.21e-15
## playlist_genrerap                            3.4258     0.4587   7.469 8.10e-14
## playlist_genrerock                           3.1048     0.4647   6.681 2.38e-11
## playlist_subgenrebig room                    0.1464     0.6715   0.218 0.827462
## playlist_subgenreclassic rock               -0.1167     0.1790  -0.652 0.514502
## playlist_subgenredance pop                  -0.6406     0.1037  -6.178 6.50e-10
## playlist_subgenreelectro house               0.9380     0.5332   1.759 0.078579
## playlist_subgenreelectropop                 -1.1165     0.1131  -9.868  < 2e-16
## playlist_subgenregangster rap               -1.1283     0.1873  -6.025 1.69e-09
## playlist_subgenrehard rock                  -0.7950     0.2064  -3.853 0.000117
## playlist_subgenrehip hop                     0.4123     0.1312   3.142 0.001676
## playlist_subgenrehip pop                     0.1436     0.1470   0.977 0.328782
## playlist_subgenreindie poptimism            -2.2086     0.1422 -15.535  < 2e-16
## playlist_subgenrelatin hip hop               0.9102     0.2183   4.169 3.06e-05
## playlist_subgenrelatin pop                   1.9038     0.2023   9.410  < 2e-16
## playlist_subgenreneo soul                   -1.8874     0.2208  -8.549  < 2e-16
## playlist_subgenrenew jack swing             -3.2198     0.4588  -7.018 2.25e-12
## playlist_subgenrepermanent wave              0.7548     0.1584   4.766 1.88e-06
## playlist_subgenrepop edm                     2.7406     0.4715   5.812 6.16e-09
## playlist_subgenrepost-teen pop                   NA         NA      NA       NA
## playlist_subgenreprogressive electro house       NA         NA      NA       NA
## playlist_subgenrereggaeton                   1.5118     0.2209   6.844 7.71e-12
## playlist_subgenresouthern hip hop           -1.0521     0.1711  -6.149 7.79e-10
## playlist_subgenretrap                            NA         NA      NA       NA
## playlist_subgenretropical                        NA         NA      NA       NA
## playlist_subgenreurban contemporary              NA         NA      NA       NA
##                                               
## (Intercept)                                ***
## playlist_genrelatin                        ***
## playlist_genrepop                          ***
## playlist_genrer&b                          ***
## playlist_genrerap                          ***
## playlist_genrerock                         ***
## playlist_subgenrebig room                     
## playlist_subgenreclassic rock                 
## playlist_subgenredance pop                 ***
## playlist_subgenreelectro house             .  
## playlist_subgenreelectropop                ***
## playlist_subgenregangster rap              ***
## playlist_subgenrehard rock                 ***
## playlist_subgenrehip hop                   ** 
## playlist_subgenrehip pop                      
## playlist_subgenreindie poptimism           ***
## playlist_subgenrelatin hip hop             ***
## playlist_subgenrelatin pop                 ***
## playlist_subgenreneo soul                  ***
## playlist_subgenrenew jack swing            ***
## playlist_subgenrepermanent wave            ***
## playlist_subgenrepop edm                   ***
## playlist_subgenrepost-teen pop                
## playlist_subgenreprogressive electro house    
## playlist_subgenrereggaeton                 ***
## playlist_subgenresouthern hip hop          ***
## playlist_subgenretrap                         
## playlist_subgenretropical                     
## playlist_subgenreurban contemporary           
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 14035  on 22683  degrees of freedom
## Residual deviance: 12271  on 22660  degrees of freedom
## AIC: 12319
## 
## Number of Fisher Scoring iterations: 7

Observation:

All of the variables are significant at the 0.05 level, except playlist_subgenrebig room, playlist_subgenreclassic rock, playlist_subgenrepost-teen pop, playlist_subgenreprogressive electro house, playlist_subgenretrap, playlist_subgenretropical and playlist_subgenreurban contemporary.
Now under genre, the highest change in one unit goes to pop, followed by r&b and rap.
For the subgenre, the highest coefficients were on southern hip hop, followed by reggaeton and dance pop. On lm the biggest changes were in other variables (new jack swing, electropop and neo soul).

Prediction

Let’s start with in-sample prediction for logistic model 2

pred.glm0_cat.train <- predict(spotify_glm0_cat,type="response")


#ROC curve 
library(ROCR)
pred <- prediction(pred.glm0_cat.train, spotify_copy.train$track_popularity)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)

#Get the AUC
unlist(slot(performance(pred, "auc"), "y.values"))

## [1] 0.7593949

Out-of-sample prediction for model 2 (more important)

pred.glm0.test_cat<- predict(spotify_glm0_cat, newdata = spotify_copy.test, type="response")

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading

pred <- prediction(pred.glm0.test_cat, spotify_copy.test$track_popularity)
perf <- performance(pred, "tpr", "fpr")
#Get the AUC
unlist(slot(performance(pred, "auc"), "y.values"))

## [1] 0.7539842

plot(perf, colorize=TRUE)

Observation: For the training set we get an AUC of 0.7594 and for the testing set we get 0.75398.

Find optimal p-cut value for new model

# define a cost function with input "obs" being observed response 
# and "pi" being predicted probability, and "pcut" being the threshold.
costfunc = function(obs, pred.p, pcut){
    weight1 = 5   # define the weight for "true=1 but pred=0" (FN)
    weight0 = 1    # define the weight for "true=0 but pred=1" (FP)
    c1 = (obs==1)&(pred.p<pcut)    # count for "true=1 but pred=0"   (FN)
    c0 = (obs==0)&(pred.p>=pcut)   # count for "true=0 but pred=1"   (FP)
    cost = mean(weight1*c1 + weight0*c0)  # misclassification with weight
    return(cost) # you have to return to a value when you write R functions
} # end of the function

# define a sequence from 0.01 to 1 by 0.01
p.seq = seq(0.01, 1, 0.01) 

# write a loop for all p-cut to see which one provides the smallest cost
# first, need to define a 0 vector in order to save the value of cost from all pcut
cost = rep(0, length(p.seq))  
for(i in 1:length(p.seq)){ 
    cost[i] = costfunc(obs = spotify_copy.train$track_popularity, pred.p = pred.glm0_cat.train, pcut = p.seq[i])  
} # end of the loop

#cbind(p.seq, cost)

# find the optimal pcut
optimal.pcut.glm0 = p.seq[which(cost==min(cost))]
print(optimal.pcut.glm0)

## [1] 0.17 0.18

The optimal pcut is either 0.17 or 0.18. Now we will get MR, FPR, FNR and cost to compare it to each other.

Get scores for LGM 2 and training set

# step 1. get binary classification
class.glm0.train.cat<- (pred.glm0_cat.train>optimal.pcut.glm0)*1

# get values
MR<- mean(spotify_copy.train$track_popularity!= class.glm0.train.cat)
FPR<- sum(spotify_copy.train$track_popularity==0 & class.glm0.train.cat==1)/sum(spotify_copy.train$track_popularity==0)
FNR<- sum(spotify_copy.train$track_popularity==1 & class.glm0.train.cat==0)/sum(spotify_copy.train$track_popularity==1)
cost<- costfunc(obs = spotify_copy.train$track_popularity, pred.p = class.glm0.train.cat, pcut = optimal.pcut.glm0)  
print(paste0("MR:",MR))

## [1] "MR:0.151340151648739"

print(paste0("FPR:",FPR))

## [1] "FPR:0.0982260024301337"

print(paste0("FNR:",FNR))

## [1] "FNR:0.669511616880038"

print(paste0("cost:",cost))

## [1] "cost:0.400326221125022"

Get scores for LGM 2 and testing set

# step 0. obtain predicated values on testing data (we actually did this already, but let's do it again) 
pred.glm0.test<- predict(spotify_glm0_cat, newdata = spotify_copy.test, type="response")

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading

# step 1. get binary classification, I used the optimal cut-off
pred.glm0.test.cat <- (pred.glm0.test>optimal.pcut.glm0)*1
#Get values

MR<- mean(spotify_copy.test$track_popularity!= pred.glm0.test.cat)
FPR<- sum(spotify_copy.test$track_popularity==0 & pred.glm0.test.cat==1)/sum(spotify_copy.test$track_popularity==0)
FNR<- sum(spotify_copy.test$track_popularity==1 & pred.glm0.test.cat==0)/sum(spotify_copy.test$track_popularity==1)
cost<- costfunc(obs = spotify_copy.test$track_popularity, pred.p = pred.glm0.test.cat, pcut = optimal.pcut.glm0)  
print(paste0("MR:",MR))

## [1] "MR:0.157087447108604"

print(paste0("FPR:",FPR))

## [1] "FPR:0.104741546832491"

print(paste0("FNR:",FNR))

## [1] "FNR:0.669201520912547"

print(paste0("cost:",cost))

## [1] "cost:0.405324400564175"

Table to compare resutls

# create matrix with 2 columns and 4 rows
lg_num <- matrix(c('0.6856', '0.6610', '0.1534', '0.1523', '0.0896', '0.087', '0.7757', '0.7908', '0.4419', '0.4457'), ncol=2, byrow=TRUE)
 
lg_cat <- matrix(c('0.7593', '0.7539', '0.1513', '0.1571', '0.098', '0.1047', '0.6695', '0.6692', '0.4003', '0.4053'), ncol=2, byrow=TRUE)


# specify the column names and row names of matrix
colnames(lg_num) = c('Training_num','Testing_num')
rownames(lg_num) <- c('AUC', 'MR', 'FPR', 'FNR', 'Cost' )
 
colnames(lg_cat) = c('Training_cat','Testing_cat')
rownames(lg_cat) <- c('AUC', 'MR', 'FPR', 'FNR', 'Cost' )
# assign to table
lg1_results=as.table(lg_num)
lg2_results=as.table(lg_cat)
# display

#lg1_results


#lg2_results
library(knitr)


kable(lg1_results)

	Training_num	Testing_num
AUC	0.6856	0.6610
MR	0.1534	0.1523
FPR	0.0896	0.087
FNR	0.7757	0.7908
Cost	0.4419	0.4457

kable(lg2_results)

	Training_cat	Testing_cat
AUC	0.7593	0.7539
MR	0.1513	0.1571
FPR	0.098	0.1047
FNR	0.6695	0.6692
Cost	0.4003	0.4053

Conclusion logistic regression

Observations: We performed two logistic regression models to predict track_popularity. One was with numerical observations and the other one with categorical observations. AUC was better on categroical model on the training data. In MR on testing was better in the first model and on training was better in the second model. Overall the model with categorical variables predicted better than with numerical variables.

Something that we like about this model is that the accuracy is not to bad and the MR is relatively small in all models. We prefer a lower FP ratio meaning that less song are being predicted as popular when they are not. However, the FN ratio is very high meaning that many songs are predicted as not popular when they are popular.

Support Vector Model 1 (SVM)

We are first going to make a new data set without the categorical variables, then make our track_popularity variable 2 levels.

spotify_new = subset(spotify, select = -c(track_album_release_date, track_id, track_name, track_artist, track_album_name, track_album_id , track_album_name, playlist_genre, playlist_name, playlist_subgenre, playlist_id) )

spotify_new$track_popularity <- ifelse(spotify_new$track_popularity >=70, 1 , 0)

Data Splitting

#splitting the data 
set.seed(2023)
index <- sample(1:nrow(spotify_new),nrow(spotify_new)*0.80)
spotify_svm_train <-  spotify_new[index,]
spotify_svm_test <- spotify_new[-index,]

Weight Cost

#fitting the data w/o weight class cost 
library(e1071)

spotify_new$track_popularity<- as.factor(spotify_new$track_popularity)

str(spotify_new)

## 'data.frame':    28356 obs. of  13 variables:
##  $ track_popularity: Factor w/ 2 levels "0","1": 1 1 2 1 1 1 1 1 1 1 ...
##  $ danceability    : num  0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
##  $ energy          : num  0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
##  $ key             : int  6 11 1 7 1 8 5 4 8 2 ...
##  $ loudness        : num  -2.63 -4.97 -3.43 -3.78 -4.67 ...
##  $ mode            : int  1 1 0 1 1 1 0 0 1 1 ...
##  $ speechiness     : num  0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
##  $ acousticness    : num  0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
##  $ instrumentalness: num  0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
##  $ liveness        : num  0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
##  $ valence         : num  0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
##  $ tempo           : num  122 100 124 122 124 ...
##  $ duration_ms     : int  194754 162600 176616 169093 189052 163049 187675 207619 193187 253040 ...

#svm model 
spotify_svm = svm({{as.factor(track_popularity)}} ~ danceability + energy + key + loudness + mode + speechiness + acousticness + instrumentalness + liveness + valence + tempo + duration_ms, data = spotify_svm_train, kernel = 'linear')

# predictions on the train data
pred_spotify_train = predict(spotify_svm, spotify_svm_train)

# Confusion matrix to evaluate the model on train data
Cmatrix_train = table(true = spotify_svm_train$track_popularity ,
                      pred = pred_spotify_train)
Cmatrix_train

##     pred
## true     0     1
##    0 20575     0
##    1  2109     0

#Mis-classification rate 
1 - sum(diag(Cmatrix_train))/sum(Cmatrix_train)

## [1] 0.09297302

# Predictions on the train data
pred_spotify_test <- predict(spotify_svm, spotify_svm_test)

# Confusion matrix to evaluate the model on train data
Cmatrix_test = table(true = spotify_svm_test$track_popularity,
                     pred = pred_spotify_test)
Cmatrix_test

##     pred
## true    0    1
##    0 5146    0
##    1  526    0

#Mis-classification rate 
1 - sum(diag(Cmatrix_test))/sum(Cmatrix_test)

## [1] 0.09273625

Observations The confusion matrix of the predicted values with the training data shows the number of true negatives as 20575, false positives as 0 , false negatives as 2109 and true positives as 0. The mis-classification error rate is 0.0929

The confusion matrix of the predicted values with the testing data shows the number of true negatives as 5146, false positives as 0 , false negatives as 526 and true positives as 0. The mis-classification error rate is 0.09274.

As we can see this model is not working well since no songs are predicted as popular. Therefore we will try either different kernels or different weight to see if we can be able to get a better model.

SVM Model 2

Now we are going to fit a SVM model with weight cost using the train data

spotify_svm_asymmetric = svm(as.factor(track_popularity) ~  danceability + energy + loudness + speechiness + mode + key + acousticness+ instrumentalness + liveness + valence + tempo +duration_ms, data = spotify_svm_train, kernel = 'linear', class.weights = c("0" = 1, "1" = 7))

# predictions for train data
pred_spotify_train <- predict(spotify_svm_asymmetric, spotify_svm_train)
#Confusion matrix for train data 
Cmatrix_train_21 <- table(true =spotify_svm_train$track_popularity, 
                           pred = pred_spotify_train)
print(Cmatrix_train_21)

##     pred
## true     0     1
##    0 16364  4211
##    1  1240   869

#Mis-classification rate
1 - sum(diag(Cmatrix_train_21))/sum(Cmatrix_train_21)

## [1] 0.2403015

The confusion matrix of the predicted values with the training data shows the number of TN as 16364, FP as 4211 , FN as 1240 and TP as 869.

The mis-classification error rate is 0.24. This is the best model we could get with lineal kernel. However the MR is vry high with 0.24.

Now we are going to fit a SVM model with cost using the testing data

#testing predicted probability
pred_spotify_test <- predict(spotify_svm_asymmetric, spotify_svm_test)

#Confusion matrix 
Cmatrix_test_22 <-  table( true = spotify_svm_test$track_popularity, pred = pred_spotify_test)

print(Cmatrix_test_22)

##     pred
## true    0    1
##    0 4051 1095
##    1  320  206

#Mis-classification rate
MR <- 1 - sum(diag(Cmatrix_test_22))/sum(Cmatrix_test_22)
MR

## [1] 0.2494711

The confusion matrix of the predicted values with the testing data shows the number of TN as 4051, FN as 1095 , FP as 320 and TP as 206.

The MR is 0.25 Still a very high MR.

SVM Model 3

Lets do another model with polynomial kernel ad compare those.

spotify_svm_asy_poly = svm(as.factor(track_popularity) ~  danceability + energy + loudness + speechiness + mode + key + acousticness+ instrumentalness + liveness + valence + tempo +duration_ms, data = spotify_svm_train, kernel = 'polynomial', class.weights = c("0" = 1, "1" = 5))

# predictions for train data
pred_spotify_train_poly <- predict(spotify_svm_asy_poly, spotify_svm_train)
#Confusion matrix for train data 
Cmatrix_train_23 <- table(true =spotify_svm_train$track_popularity, 
                           pred = pred_spotify_train_poly)
print(Cmatrix_train_23)

##     pred
## true     0     1
##    0 20085   490
##    1  1903   206

#Mis-classification rate
1 - sum(diag(Cmatrix_train_23))/sum(Cmatrix_train_23)

## [1] 0.1054929

#testing predicted probability
pred_spotify_test_asy <- predict(spotify_svm_asy_poly, spotify_svm_test)

#Confusion matrix 
Cmatrix_test_24 <-  table( true = spotify_svm_test$track_popularity, pred = pred_spotify_test_asy)

print(Cmatrix_test_24)

##     pred
## true    0    1
##    0 5017  129
##    1  493   33

#Mis-classification rate
MR <- 1 - sum(diag(Cmatrix_test_24))/sum(Cmatrix_test_24)
MR

## [1] 0.1096615

Observations: Polynomial SVM with different weight is the best SVM model for this data set. We were able to get predictions for popularity (not many), and our MR is lower than in the previous model (With 0.11) . We also had a lower weight on 1 than in the linear model.

AUC SVM 1

Now we are going to obtain the ROC and AUC for our SVM model. We will then use our output to compare it to the ROC and AUC we obtained in our logistic model.

#ROC and AUC using the train data 

spotify_svm_prob = svm(as.factor(track_popularity) ~  danceability + energy + loudness + speechiness + mode + key + acousticness+ instrumentalness + liveness + valence + tempo +duration_ms, data = spotify_svm_train, kernel = 'linear',
                      probability = TRUE) 
pred_prob_train = predict(spotify_svm_prob,
                          newdata = spotify_svm_train,
                          probability = TRUE) 

pred_prob_train = attr(pred_prob_train, "probabilities")[, 2] 

#ROC
library(ROCR)
pred <- prediction(pred_prob_train, spotify_svm_train$track_popularity)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)

#AUC
unlist(slot(performance(pred, "auc"), "y.values"))

## [1] 0.5866711

The AUC is 0.5867.

#refitting the model on testing
pred_prob_test = predict(spotify_svm_prob,
                         newdata = spotify_svm_test,
                         probability = TRUE)

pred_prob_test = attr(pred_prob_test, "probabilities")[, 2]

#ROC
pred <- prediction(pred_prob_test, spotify_svm_test$track_popularity )
perf <- performance(pred, "tpr", "fpr")


#AUC
unlist(slot(performance(pred, "auc"), "y.values"))

## [1] 0.5698786

The AUC is 0.5699.

AUC SVM 2

# work on it here!

#refit the model with probability enable
spotify_svm_asymmetric = svm(as.factor(track_popularity) ~  danceability + energy + loudness + speechiness + mode + key + acousticness+ instrumentalness + liveness + valence + tempo +duration_ms, data = spotify_svm_train, kernel = 'linear', class.weights = c("0" = 1, "1" = 7),probability = TRUE)



#get predictions for training
pred_prob_train <- predict(spotify_svm_asymmetric, spotify_svm_train, probability = TRUE)
pred_prob_train = attr(pred_prob_train, "probabilities")[, 2]

#ROC and AUC
#ROC
library(ROCR)
pred <- prediction(pred_prob_train, spotify_svm_train$track_popularity)
perf <- performance(pred, "tpr", "fpr")
unlist(slot(performance(pred, "auc"), "y.values"))

## [1] 0.6800004

#get predictions for testing
pred_prob_test <- predict(spotify_svm_asymmetric, spotify_svm_test, probability = TRUE)
pred_prob_test = attr(pred_prob_test, "probabilities")[, 2]

#ROC and AUC
#ROC
library(ROCR)
pred <- prediction(pred_prob_test, spotify_svm_test$track_popularity)
perf <- performance(pred, "tpr", "fpr")
unlist(slot(performance(pred, "auc"), "y.values"))

## [1] 0.6587955

Observations: From our second model we got a better AUC for almost 10% higher with scores of (0.68 and 0.66).

AUC SVM 3

# work on it here!

#refit the model with probability enable
spotify_svm_asymmetric_POL = svm(as.factor(track_popularity) ~  danceability + energy + loudness + speechiness + mode + key + acousticness+ instrumentalness + liveness + valence + tempo +duration_ms, data = spotify_svm_train, kernel = 'polynomial', class.weights = c("0" = 1, "1" = 5),probability = TRUE)



#get predictions for training
pred_prob_train <- predict(spotify_svm_asymmetric_POL, spotify_svm_train, probability = TRUE)
pred_prob_train = attr(pred_prob_train, "probabilities")[, 2]

#ROC and AUC
#ROC
library(ROCR)
pred <- prediction(pred_prob_train, spotify_svm_train$track_popularity)
perf <- performance(pred, "tpr", "fpr")
unlist(slot(performance(pred, "auc"), "y.values"))

## [1] 0.7154458

#get predictions for testing
pred_prob_test <- predict(spotify_svm_asymmetric_POL, spotify_svm_test, probability = TRUE)
pred_prob_test = attr(pred_prob_test, "probabilities")[, 2]

#ROC and AUC
#ROC
library(ROCR)
pred <- prediction(pred_prob_test, spotify_svm_test$track_popularity)
perf <- performance(pred, "tpr", "fpr")
unlist(slot(performance(pred, "auc"), "y.values"))

## [1] 0.6612042

**From our last model we get the best in training AUC with 0.714 and a lower score on the AUC of testing set (with 0.659).

Table to compare SVM Results

# create matrix with 2 columns and 4 rows
svm_results <- matrix(c('0.587', '0.5909', '0.6800', '0.658','0.715', '0.66', '0.0929', '0.0928', '0.0930', '0.0927', '0.105', '0.1096'), ncol=2, byrow=TRUE)
 

# specify the column names and row names of matrix
colnames(svm_results) = c('Training_svm','Testing_svm')
rownames(svm_results) <- c('AUC1','AUC2', 'AUC3', 'MR1','MR2', 'MR3' )
 

# assign to table
svm=as.table(svm_results)

# display

kable(lg1_results)

	Training_num	Testing_num
AUC	0.6856	0.6610
MR	0.1534	0.1523
FPR	0.0896	0.087
FNR	0.7757	0.7908
Cost	0.4419	0.4457

kable(svm)

	Training_svm	Testing_svm
AUC1	0.587	0.5909
AUC2	0.6800	0.658
AUC3	0.715	0.66
MR1	0.0929	0.0928
MR2	0.0930	0.0927
MR3	0.105	0.1096

Observations: For the SVM model we performed three different ones. The first one was a model with no weights and linear kernel. That one did not predict any single popular song. Then we did a linear model with weights (to weighted to 1). Lastly a polynomial model with 5 weights on 1. The model with best MR was the second model (0.927 on testing), and the confusion matrix made sense as well. Best AUC was on the third model with 0.66 on the testing set. Out of all three the third one performed the best, since it was also less weighted.

Classification Tree

We will create a copy of spotify with a new data name for the purpose of this analysis.

spotify_dec <- spotify

spotify_dec$track_popularity <- ifelse(spotify_dec$track_popularity >=70, 1 , 0)

Data Spliting

Changing data types

spotify_dec$mode<- as.factor(spotify_dec$mode)
spotify_dec$key<- as.factor(spotify_dec$key)

set.seed(2023)
index <- sample(1:nrow(spotify_dec),nrow(spotify_dec)*0.60)
spotify_dec.train = spotify_dec[index,]
spotify_dec.test = spotify_dec[-index,]

Classification Tree 1

# fit the model
fit_tree_dec <- rpart(as.factor(track_popularity) ~ danceability + energy + speechiness + loudness + instrumentalness+ mode+ key + acousticness+ liveness + valence + tempo, data=spotify_dec.train)
summary(fit_tree_dec)

## Call:
## rpart(formula = as.factor(track_popularity) ~ danceability + 
##     energy + speechiness + loudness + instrumentalness + mode + 
##     key + acousticness + liveness + valence + tempo, data = spotify_dec.train)
##   n= 17013 
## 
##   CP nsplit rel error xerror xstd
## 1  0      0         1      0    0
## 
## Node number 1: 17013 observations
##   predicted class=0  expected loss=0.09351672  P(node) =1
##     class counts: 15422  1591
##    probabilities: 0.906 0.094

#rpart.plot(fit_tree, yesno=2, extra=4)

Obtain the predicted values for training data and confusion matrix.

# Make predictions on the train data
pred_credit_train <- predict(fit_tree_dec, spotify_dec.train, type="class")

# Confusion matrix to evaluate the model on train data
Cmatrix_train = table(true = spotify_dec.train$track_popularity,
                      pred = pred_credit_train)
Cmatrix_train

##     pred
## true     0     1
##    0 15422     0
##    1  1591     0

1 - sum(diag(Cmatrix_train))/sum(Cmatrix_train)

## [1] 0.09351672

Observation: From this model we get a very small missclassification rate of 0.0935, but from the Confusion Matrix of the training set we can see that none of the variables are predicted as 1.

Obtain the predicted values for testing data and confusion matrix.

# Make predictions on the test data
pred_credit_test <- predict(fit_tree_dec, spotify_dec.test, type="class")

# Confusion matrix to evaluate the model on train data
Cmatrix_test = table(true = spotify_dec.test$track_popularity,
                     pred = pred_credit_test)
Cmatrix_test

##     pred
## true     0     1
##    0 10299     0
##    1  1044     0

1 - sum(diag(Cmatrix_test))/sum(Cmatrix_test)

## [1] 0.09203914

Observation:

Now that we have done th same on the testing set, we can see that MR is even smaller with 0.092 but still none of the variables are predicted as 1. This model is not good. Therefore we will use asymmetric test, to produce different weights and see if this improves the model.

Classification Tree 2 (with weights)

# We need to define a cost matrix first, don't change 0 there
cost_matrix <- matrix(c(0, 1,  # cost of 1 for FP
                        6, 0),  # cost of 6 for FN
                      byrow = TRUE, nrow = 2)
fit_tree_asym <- rpart(as.factor(track_popularity) ~ danceability + energy + speechiness + loudness + instrumentalness+ mode+ key + acousticness+ liveness + valence + tempo, data=spotify_dec.train, parms = list(loss = cost_matrix))

rpart.plot(fit_tree_asym,extra=4, yesno=2)

Since we now do have some variables on the ‘1’ side of the data, we can visualize the tree in a plot. We can predict that from the variables energy is the one that best predicts a popular song. The model explains that energy value smaller 0.82 would predict 1, and loudness bigger that -5.1 would also predict 1.

#get predictions for training
pred_spotify_train <- predict(fit_tree_asym, spotify_dec.train,
                             type = "class")
#C matrix for training
Cmatrix_train_asy <- table( true = spotify_dec.train$track_popularity, pred = pred_spotify_train)
Cmatrix_train_asy

##     pred
## true     0     1
##    0 13157  2265
##    1  1054   537

1 - sum(diag(Cmatrix_train_asy))/sum(Cmatrix_train_asy)

## [1] 0.1950861

Observation: Now that we have redistributed the weights we finally can get some values for 1 (being popular) , with 537 being TP, 2265 FP, 13157 TN and 1054 FN on the training set. However, now we have a bigger MR with 0.1951. Lets get assymmetric predictions for the testing set.

#get predictions for testing
pred_credit_test <- predict(fit_tree_asym, spotify_dec.test, type = "class")
#C matrix for testing with more weights on "1"
Cmatrix_test_weight = table( true = spotify_dec.test$track_popularity, pred = pred_credit_test)
Cmatrix_test_weight

##     pred
## true    0    1
##    0 8820 1479
##    1  707  337

1 - sum(diag(Cmatrix_test_weight))/sum(Cmatrix_test_weight)

## [1] 0.192718

Observation: Now for the testing set we getthe following values: with 337 being TP, 1479 FP, 8820 TN and 707 FN on the training set. The MR is now a bit lower with a value of 0.1927. Something that is not too good is that the prediction for a popular song, when it is popular, is small compared to other predictions. Now, we will obtain the AUC for the training and testing data.

# obtain predicted probability
pred_prob_train = predict(fit_tree_dec, spotify_dec.train, type = "prob")

# This is necessary again, as predict() for tree model return two values, one for 0 and one for 1.
# Replace "1" with the actual category if response variable is a factor
pred_prob_train = pred_prob_train[,"1"] 

library(ROCR)
pred <- prediction(pred_prob_train, spotify_dec.train$track_popularity)
perf <- performance(pred, "tpr", "fpr")

#Get the AUC
unlist(slot(performance(pred, "auc"), "y.values"))

## [1] 0.5

# obtain predicted probability
pred_prob_test = predict(fit_tree_dec, spotify_dec.test, type = "prob")
# This is necessary again, as predict() for tree model return two values, one for 0 and one for 1.
pred_prob_test = pred_prob_test[,"1"] #replace "1" with the actual category if reponse variable is a factor
#ROC
pred <- prediction(pred_prob_test, spotify_dec.test$track_popularity)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)

#Get the AUC
unlist(slot(performance(pred, "auc"), "y.values"))

## [1] 0.5

Observation:

When we get the AUC for both the training and testing datasets for the model with no weight we get the same value of 0.5 and the ROC curve is a diagonal linear line. This is due to the fact that none of the variables are predicted as 1. Now, lets get the AUC-ROC for the model with weights.

AUC-ROC model For Tree 2

# obtain predicted probability
pred_asym_train = predict(fit_tree_asym, spotify_dec.train, type = "prob")

# This is necessary again, as predict() for tree model return two values, one for 0 and one for 1.
# Replace "1" with the actual category if response variable is a factor
pred_asym_train = pred_asym_train[,"1"] 

library(ROCR)
pred <- prediction(pred_asym_train, spotify_dec.train$track_popularity)
perf <- performance(pred, "tpr", "fpr")
#plot(perf, colorize=TRUE)

#Get the AUC
unlist(slot(performance(pred, "auc"), "y.values"))

## [1] 0.6613876

AUC-ROC model with weights for testing

# obtain predicted probability
pred_asym_test = predict(fit_tree_asym, spotify_dec.test, type = "prob")

# This is necessary again, as predict() for tree model return two values, one for 0 and one for 1.
# Replace "1" with the actual category if response variable is a factor
pred_asym_test = pred_asym_test[,"1"] 

library(ROCR)
pred <- prediction(pred_asym_test, spotify_dec.test$track_popularity)
perf <- performance(pred, "tpr", "fpr")
#plot(perf, colorize=TRUE)

#Get the AUC
unlist(slot(performance(pred, "auc"), "y.values"))

## [1] 0.6450638

Observations The AUC for the second model is not very good (0.645), and it is highly weighted on 1.

Table for Tree Results

# create matrix with 2 columns and 4 rows
tree_results <- matrix(c('0.5', '0.5', '0.6614', '0.6451', '0.0955', '0.0920', '0.1951', '0.1927'), ncol=2, byrow=TRUE)
 

# specify the column names and row names of matrix
colnames(tree_results) = c('Training_tree','Testing_tree')
rownames(tree_results) <- c('AUC1', 'AUC2', 'MR1', 'MR2' )
 

# assign to table
tree=as.table(tree_results)

comparing all tables

kable(knn_MSE)

	Training MSE	Testing MSE
knn5	194.90	687.02
knn20	389.06	556.981
knn500	517.75	524.14
knn_un	217.942	206.747
knn_stand	213.31	201.45

kable(lm_MSE)

	Training MSE	Testing MSE
LM1	529.4	527.51
LM2	486.43	485.8
LM3	572.16	549.2
LM4	572.03	549.59

kable(tree)

	Training_tree	Testing_tree
AUC1	0.5	0.5
AUC2	0.6614	0.6451
MR1	0.0955	0.0920
MR2	0.1951	0.1927

kable(lg1_results)

	Training_num	Testing_num
AUC	0.6856	0.6610
MR	0.1534	0.1523
FPR	0.0896	0.087
FNR	0.7757	0.7908
Cost	0.4419	0.4457

kable(lg2_results)

	Training_cat	Testing_cat
AUC	0.7593	0.7539
MR	0.1513	0.1571
FPR	0.098	0.1047
FNR	0.6695	0.6692
Cost	0.4003	0.4053

kable(svm)

	Training_svm	Testing_svm
AUC1	0.587	0.5909
AUC2	0.6800	0.658
AUC3	0.715	0.66
MR1	0.0929	0.0928
MR2	0.0930	0.0927
MR3	0.105	0.1096

Observation: Now that we have done AUC-ROC on the weighted models, we get the best AUC of 0.6614 fr the training set and 0.6451 for the testing set, being a better model the one with weights. Overall, although MR is lower on the model with no weights, it does not mean that is better because none of the observations are classified as popular. Although the model with weights has a bigger MR, the model does better classification that model 1. Overall the model that best fits the data is the one with weights (in classification trees).

Conclusion:

After, cleaning our data, we did EDA to predict which variables and genre would best indicate if a song was popular or not. From the visualizations and the models we saw that:

The Models had different variables but we can agree on instrumentallness, speechiness and liveness being the variables that best predict track_popularity. Loudness and Energy were not the best predictors as one would think.
From visualizations and models, all agreed that pop predicted popularity the most.
dance pop was the best predictor as far as subgenre.

From Data Modeling we concluded the following:

In LM only key was non-significant and 2 variables in the second model. OUT-Of Sample MSE was the best in the model with categorical values (485.8) when comparing LM to Knn and improved when adding an interaction. But the R^2 was still to low for it to be a good model. When we performed Knn we got a lower MSE (but with less variables) and this model still does not give a good estimate and coefficients.

When comparing Trees , SVM and logistic, SVM third model had the best AUC in training (0.715), but the weights are very distributed to 1 and this model is not easy to interpret visually. Trees had lower AUC score than SVM and logistic. Logistic Regression was the best model for this dataset because its AUC was almost as high as SVM (0.6856 & 0.6610), and just a bit higher in MR. However we still think that Logistic Regression is better than SVM becasue we did not have to distribute any weights and the coefficients and estimates are interpretable.

Our predictions were that lm and Classification Tree would be the best models. However Logistic Regression performed better than Tree, and we did not have to put in in any weights. Lm had a too high MSE and R^2 for it to be a good model. Therefore Regression Tree is the best model for this dataset.

Spotify Analysis

2023-10-05

Introduction

Packages Required

Data Preparation

Data Loading

Data Examination

First 5 rows

Column Names

Data Cleaning

Structure of Data

Modifying Data Types

Summary Statistics

Tables

Duplicates

Missing Values

Exploratory Data Anaylsis

Visual histogram exploration to understand the data set.

Boxplot to plot different genres vs popularity

Visualitazion scatterplots based on parameters and popularity

Visualization Boxplots

Outliers

Dataset copy without oultiers

Truncation

Winsorization

Show cleaned dataset

Correlation Matrix

Modelling

Linear Regression model with training and test data

Data splitting

Linear Regression on Training Set

Manually check the results

In-sample (training) MSE

Out-of-sample (testing) MSE

Lm 2 (categorical)

Predictions on new model

Model with no interaction

In-Sample MSE

Out-Of-Sample MSE

Summary MSE Table

LM Observations:

KNN model when k=5

In-Sample MSE

Out-Of-Sample MSE

KNN model when k=20

In-Sample MSE

Out-Of-Sample MSE

KNN model when k=500

In-Sample MSE

Out-Of-Sample MSE

Adding KNN MSE results to table

Standardized vs Unstandarized train knn model

Interpreting the model

Standardized vs Unstandarized test knn model

Adding results in table

Interpreting the model

Logistic regression

Data Splitting

Logistic Regrssion model with all numeric variables

Prediction

In-sample prediction

Out-of-sample prediction (more important)

ROC Curve for Out-Of-Sample

Use the optimal cut-off probability for in-sample data

Obtain MR, FPR, FNR and cost rate.

Logistic Regression model 2 (with categorical variables)

Prediction

Let’s start with in-sample prediction for logistic model 2

Out-of-sample prediction for model 2 (more important)

Find optimal p-cut value for new model

Get scores for LGM 2 and training set

Get scores for LGM 2 and testing set

Table to compare resutls

Conclusion logistic regression

Support Vector Model 1 (SVM)

Data Splitting

Weight Cost

SVM Model 2

SVM Model 3

AUC SVM 1