#{width=40%}
The Spotify Dataset comes from Spotify via the spotifyr package. Charlie Thompson, Josiah Parry, Donal Phipps, and Tom Wolff authored this package to make it easier to get either your own data or general metadata arounds songs from Spotify’s API. Kaylin Pavlik had a recent blogpost using the audio features to explore and classify songs. She used the spotifyr package to collect about 5000 songs from 6 main categories.
The data shows general metadata around songs from Spotify’s API. It shows the song’s popularity and other parameters such as acousticness, danceability, energy, speechiness, valence, key and genres and sybgenres…
With this analysis we are interested in how track popularity is getting influenced by the other varaibles.
The plan is to analyze relationship between popularity and different features of the song to predict future popularity of a song. We plan on performing Data Preparation, EDA and Modelling using models such as *linear regression, knn, logistic regression, SVM and Tree Models and compare which one performs better and see which one predicts better if a song is going to be popular or not.
This is mainly beneficial to market spotify customers and improve their experience while using spotify. Also for Spotify, they will be able to provide more accurate predictions of a new song’s potential popularity even before its release.
library(tidyverse) #It assists with data import, tidying, manipulation, and data visualization.
library(ggplot2) # package for producing statistical, or data, graphics
library(kknn) # to perform k-nearest neighbor classification
library(corrplot) # graphical display of a correlation matrix, confidence interval
library(readr) # provides a fast way to read rectangular data
library(rpart) #implements the classification and regression tree algorithm (CART)
library(rpart.plot) #An Enhanced Plotting Package for rpart
Tidyverse:assists with data import, tidying, manipulation, and data visualization. ggplot2: package for producing statistical, or data, graphics. kknn: performs k-nearest neighbor classification. corrplot: graphical display of a correlation matrix, confidence interval. readr: provides a fast way to read rectangular data. rpart: implements the classification and regression tree algorithm (CART) rpart.plot: An Enhanced Plotting Package for rpart.
library(readr)
spotify <- read.csv("spotify.csv")
We are going to examine our data in prder to see how we can best clean it for further analysis.
### Checking dimension of Data
dim(spotify)
## [1] 32833 23
The original dataset contains 32833 rows and 23 columns. ##### First 5 rows
# show first 5 rows
head(spotify)
## track_id track_name
## 1 6f807x0ima9a1j3VPbc7VN I Don't Care (with Justin Bieber) - Loud Luxury Remix
## 2 0r7CVbZTWZgbTCYdfa2P31 Memories - Dillon Francis Remix
## 3 1z1Hg7Vb0AhHDiEmnDE79l All the Time - Don Diablo Remix
## 4 75FpbthrwQmzHlBJLuGdC7 Call You Mine - Keanu Silva Remix
## 5 1e8PAfcKUYoKkxPhrHqw4x Someone You Loved - Future Humans Remix
## 6 7fvUMiyapMsRRxr07cU8Ef Beautiful People (feat. Khalid) - Jack Wins Remix
## track_artist track_popularity track_album_id
## 1 Ed Sheeran 66 2oCs0DGTsRO98Gh5ZSl2Cx
## 2 Maroon 5 67 63rPSO264uRjW1X5E6cWv6
## 3 Zara Larsson 70 1HoSmj2eLcsrR0vE9gThr4
## 4 The Chainsmokers 60 1nqYsOef1yKKuGOVchbsk6
## 5 Lewis Capaldi 69 7m7vv9wlQ4i0LFuJiE2zsQ
## 6 Ed Sheeran 67 2yiy9cd2QktrNvWC2EUi0k
## track_album_name
## 1 I Don't Care (with Justin Bieber) [Loud Luxury Remix]
## 2 Memories (Dillon Francis Remix)
## 3 All the Time (Don Diablo Remix)
## 4 Call You Mine - The Remixes
## 5 Someone You Loved (Future Humans Remix)
## 6 Beautiful People (feat. Khalid) [Jack Wins Remix]
## track_album_release_date playlist_name playlist_id playlist_genre
## 1 2019-06-14 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop
## 2 2019-12-13 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop
## 3 2019-07-05 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop
## 4 2019-07-19 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop
## 5 2019-03-05 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop
## 6 2019-07-11 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop
## playlist_subgenre danceability energy key loudness mode speechiness
## 1 dance pop 0.748 0.916 6 -2.634 1 0.0583
## 2 dance pop 0.726 0.815 11 -4.969 1 0.0373
## 3 dance pop 0.675 0.931 1 -3.432 0 0.0742
## 4 dance pop 0.718 0.930 7 -3.778 1 0.1020
## 5 dance pop 0.650 0.833 1 -4.672 1 0.0359
## 6 dance pop 0.675 0.919 8 -5.385 1 0.1270
## acousticness instrumentalness liveness valence tempo duration_ms
## 1 0.1020 0.00e+00 0.0653 0.518 122.036 194754
## 2 0.0724 4.21e-03 0.3570 0.693 99.972 162600
## 3 0.0794 2.33e-05 0.1100 0.613 124.008 176616
## 4 0.0287 9.43e-06 0.2040 0.277 121.956 169093
## 5 0.0803 0.00e+00 0.0833 0.725 123.976 189052
## 6 0.0799 0.00e+00 0.1430 0.585 124.982 163049
#### Checking column name
names(spotify)
## [1] "track_id" "track_name"
## [3] "track_artist" "track_popularity"
## [5] "track_album_id" "track_album_name"
## [7] "track_album_release_date" "playlist_name"
## [9] "playlist_id" "playlist_genre"
## [11] "playlist_subgenre" "danceability"
## [13] "energy" "key"
## [15] "loudness" "mode"
## [17] "speechiness" "acousticness"
## [19] "instrumentalness" "liveness"
## [21] "valence" "tempo"
## [23] "duration_ms"
Now, we will look at the structure, summary, data types, duplcates and missing values to then, clean the data. ##### Structure of data
### Checking structure of Data
str(spotify)
## 'data.frame': 32833 obs. of 23 variables:
## $ track_id : chr "6f807x0ima9a1j3VPbc7VN" "0r7CVbZTWZgbTCYdfa2P31" "1z1Hg7Vb0AhHDiEmnDE79l" "75FpbthrwQmzHlBJLuGdC7" ...
## $ track_name : chr "I Don't Care (with Justin Bieber) - Loud Luxury Remix" "Memories - Dillon Francis Remix" "All the Time - Don Diablo Remix" "Call You Mine - Keanu Silva Remix" ...
## $ track_artist : chr "Ed Sheeran" "Maroon 5" "Zara Larsson" "The Chainsmokers" ...
## $ track_popularity : int 66 67 70 60 69 67 62 69 68 67 ...
## $ track_album_id : chr "2oCs0DGTsRO98Gh5ZSl2Cx" "63rPSO264uRjW1X5E6cWv6" "1HoSmj2eLcsrR0vE9gThr4" "1nqYsOef1yKKuGOVchbsk6" ...
## $ track_album_name : chr "I Don't Care (with Justin Bieber) [Loud Luxury Remix]" "Memories (Dillon Francis Remix)" "All the Time (Don Diablo Remix)" "Call You Mine - The Remixes" ...
## $ track_album_release_date: chr "2019-06-14" "2019-12-13" "2019-07-05" "2019-07-19" ...
## $ playlist_name : chr "Pop Remix" "Pop Remix" "Pop Remix" "Pop Remix" ...
## $ playlist_id : chr "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" ...
## $ playlist_genre : chr "pop" "pop" "pop" "pop" ...
## $ playlist_subgenre : chr "dance pop" "dance pop" "dance pop" "dance pop" ...
## $ danceability : num 0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
## $ energy : num 0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
## $ key : int 6 11 1 7 1 8 5 4 8 2 ...
## $ loudness : num -2.63 -4.97 -3.43 -3.78 -4.67 ...
## $ mode : int 1 1 0 1 1 1 0 0 1 1 ...
## $ speechiness : num 0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
## $ acousticness : num 0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
## $ instrumentalness : num 0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
## $ liveness : num 0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
## $ valence : num 0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
## $ tempo : num 122 100 124 122 124 ...
## $ duration_ms : int 194754 162600 176616 169093 189052 163049 187675 207619 193187 253040 ...
Observations: Shows structure of data. We can see
that track_id , track_name,
track_artist, track_album_id,
track_album_name, track_album_release_date,
playlist_name, playlist_id,
playlist_genre, playlist_subgenre are
character variables. On the other side,
track_popularity, energy, key,
loudness, mode, speechiness,
acousticness, instrumentalness,
liveness, valence, tempo and
duration_ms are numeric. We need to change
track_album_release_date to date variable. We will also
change playlist_genre to factor, for future plotting.
# Modifying Data Types
spotify$track_album_release_date<- as.Date(spotify$track_album_release_date)
spotify$playlist_genre<-as.factor(spotify$playlist_genre)
#summary statistics
summary(spotify)
## track_id track_name track_artist track_popularity
## Length:32833 Length:32833 Length:32833 Min. : 0.00
## Class :character Class :character Class :character 1st Qu.: 24.00
## Mode :character Mode :character Mode :character Median : 45.00
## Mean : 42.48
## 3rd Qu.: 62.00
## Max. :100.00
##
## track_album_id track_album_name track_album_release_date
## Length:32833 Length:32833 Min. :1957-01-01
## Class :character Class :character 1st Qu.:2010-12-04
## Mode :character Mode :character Median :2017-01-27
## Mean :2012-09-09
## 3rd Qu.:2019-05-16
## Max. :2020-01-29
## NA's :1886
## playlist_name playlist_id playlist_genre playlist_subgenre
## Length:32833 Length:32833 edm :6043 Length:32833
## Class :character Class :character latin:5155 Class :character
## Mode :character Mode :character pop :5507 Mode :character
## r&b :5431
## rap :5746
## rock :4951
##
## danceability energy key loudness
## Min. :0.0000 Min. :0.000175 Min. : 0.000 Min. :-46.448
## 1st Qu.:0.5630 1st Qu.:0.581000 1st Qu.: 2.000 1st Qu.: -8.171
## Median :0.6720 Median :0.721000 Median : 6.000 Median : -6.166
## Mean :0.6548 Mean :0.698619 Mean : 5.374 Mean : -6.720
## 3rd Qu.:0.7610 3rd Qu.:0.840000 3rd Qu.: 9.000 3rd Qu.: -4.645
## Max. :0.9830 Max. :1.000000 Max. :11.000 Max. : 1.275
##
## mode speechiness acousticness instrumentalness
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000000
## 1st Qu.:0.0000 1st Qu.:0.0410 1st Qu.:0.0151 1st Qu.:0.0000000
## Median :1.0000 Median :0.0625 Median :0.0804 Median :0.0000161
## Mean :0.5657 Mean :0.1071 Mean :0.1753 Mean :0.0847472
## 3rd Qu.:1.0000 3rd Qu.:0.1320 3rd Qu.:0.2550 3rd Qu.:0.0048300
## Max. :1.0000 Max. :0.9180 Max. :0.9940 Max. :0.9940000
##
## liveness valence tempo duration_ms
## Min. :0.0000 Min. :0.0000 Min. : 0.00 Min. : 4000
## 1st Qu.:0.0927 1st Qu.:0.3310 1st Qu.: 99.96 1st Qu.:187819
## Median :0.1270 Median :0.5120 Median :121.98 Median :216000
## Mean :0.1902 Mean :0.5106 Mean :120.88 Mean :225800
## 3rd Qu.:0.2480 3rd Qu.:0.6930 3rd Qu.:133.92 3rd Qu.:253585
## Max. :0.9960 Max. :0.9910 Max. :239.44 Max. :517810
##
Observations: This function displays Min, Q1,
Median, Mean, Q3 and Max of each varibale. We can already see that there
are probably some outliers and that some variables have too big Max
(duration_ms has a Max of 517810; tempo has a
Max of 239.44). We will do some truncation, winsorization or
standardization, to see how it affects the model.
#lets look at some tables for categorical variables
table(spotify$playlist_genre)
##
## edm latin pop r&b rap rock
## 6043 5155 5507 5431 5746 4951
table(spotify$playlist_subgenre)
##
## album rock big room classic rock
## 1065 1206 1296
## dance pop electro house electropop
## 1298 1511 1408
## gangster rap hard rock hip hop
## 1458 1485 1322
## hip pop indie poptimism latin hip hop
## 1256 1672 1656
## latin pop neo soul new jack swing
## 1262 1637 1133
## permanent wave pop edm post-teen pop
## 1105 1517 1129
## progressive electro house reggaeton southern hip hop
## 1809 949 1675
## trap tropical urban contemporary
## 1291 1288 1405
table(spotify$key)
##
## 0 1 2 3 4 5 6 7 8 9 10 11
## 3454 4010 2827 913 2201 2680 2670 3352 2430 3027 2273 2996
table(spotify$mode)
##
## 0 1
## 14259 18574
Observations: We can see that mode is a
binary variable. Playlist_genre and
playlist_subgenre are categorical variables with the genre
of music. Each category does not vary a lot in observations, which is
good, because we it will be easier to interpret their predictions. There
are no outliers either.
#Looking for duplicates
dups_id <- sum(duplicated(spotify$track_id))
print(dups_id)
## [1] 4477
We can see that a lot of songs have been duplicated in this dataset.
They have the same track_id. Therefore we will remove them,
for further analysis.
spotify_dups = spotify[duplicated(spotify$track_id),]
spotify = spotify[!duplicated(spotify$track_id),]
We removed the duplicate songs to another dataset called
spotify_dups and removed duplicates from current
dataset.
#looking for missing values
sum(is.na(spotify))
## [1] 1693
There are 12 missing values in this dataset. However, we will not remove them, since they might still be important for the analysis.
We wil plot some variables to observe its skewiness and distribution, and interpret them.
histograms <- names(spotify)[c(4,12:23)]
songs1 <- spotify %>%
select(c(histograms)) %>%
pivot_longer(cols = histograms)
## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
## # Was:
## data %>% select(histograms)
##
## # Now:
## data %>% select(all_of(histograms))
##
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
songs1 %>%
ggplot(aes(x = value)) +
geom_histogram() +
facet_wrap(~name, ncol = 5, scales = 'free') +
labs(title = 'Audio Feature Pattern Frequency Plots', x = '', y = '') +
theme_void()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Observations: -Duration and
Valence are normally distributed -
Danceability, Enery and Loudness
is left-skewed - Acousticness, Liveness and
Speechiness is right-skewed - Key and mode are
categorical variables - in popularity the max is around
50-60, and it is normally distributed. There are a lot of zero values,
possible missing values. Very few songs have a popularity of above
90.
ggplot(spotify,aes(x = playlist_genre, y = track_popularity, fill = playlist_genre)) +geom_boxplot()
ggplot(spotify,aes(x = playlist_subgenre, y = track_popularity, fill = playlist_subgenre)) +geom_boxplot()
Observations: - We can see that pop is
the most popular followed by latin and rock. -
From the subgenre playlist post-teen pop is the most
popular followed by dance pop and
permament wave.
Now we will plot scatterplots to compare the different parameters to popularity and see which genre has the best popularity based on the different variables and see on popular songs, which parameter best predicts popularity.
# new dataset with some variables
feature_names <- names(spotify)[c(12:13,15,17:19,23)]
# new dataset only displying variables with 500 most popular songs
songs <- spotify %>%
arrange(desc(track_popularity)) %>%
head(n = 500) %>%
pivot_longer(cols = feature_names)
## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
## # Was:
## data %>% select(feature_names)
##
## # Now:
## data %>% select(all_of(feature_names))
##
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
#plotting
songs %>%
ggplot(aes(x = name, y = value)) +
geom_jitter(aes(color = playlist_genre)) +
facet_wrap(~name, ncol = 4, scales = 'free') +
labs(title = 'Audio Feature Pattern Frequency Plots', x = '', y = '') +
theme(axis.text.y = element_blank())
Observation: - These represent the 500 most popular
songs - Higher danceability, energy and
loudness usually means more popularity - Lower
speechiness and accousticness and
instrumentalness means more popular - An average to low
duration means more popularity - From this visualizations, since most of
the variables are numerical, we could consider linear
regression, Decision Trees and
SVM to be the best models. After converting
track_popularity into dummy variable we could also use
other models such as logistic regression.
ggplot(data = songs1) +
geom_boxplot(aes(y = value)) +
facet_wrap(~name, nrow = 4, scales = "free") +
coord_flip() +
ggtitle("Outlier analysis", subtitle = "For different song attributes") +
theme_void()
Observations:
From the boxplots we can observe that a lot of variables
(danceability, energy, loudness,
speechiness, acousticness,
instrumentalness, liveness,
duration) have outliers. Removing them would influence the
analysis a lot.
spotify_copy <- spotify
Now, we will truncate energy speechiness, acousticness, instrumentalness and liveness.
# truncation danceability, energy, speechiness, acousticness, instrumentalness and liveness
spotify_copy$danceability[spotify_copy$danceability <= 0.28] <- 0.28
spotify_copy$energy[spotify_copy$energy <= 0.2] <- 0.2
spotify_copy$speechiness[spotify_copy$speechiness >= 0.27] <- 0.27
spotify_copy$acousticness[spotify_copy$acousticness >= 0.6] <- 0.6
spotify_copy$instrumentalness[spotify_copy$instrumentalness >= 0.015] <- 0.015
spotify_copy$liveness[spotify_copy$liveness >= 0.4] <- 0.4
Now, we will winsorize loudness, tempo and duration.
# winsorization loudness
# Calculate the 5th and 95th percentiles for 'loudness'
lower_bound_loudness <- quantile(spotify_copy$loudness, 0.05, na.rm = TRUE)
upper_bound_loudness <- quantile(spotify_copy$loudness, 0.95, na.rm = TRUE)
# Winsorize the data
spotify_copy$loudness[spotify_copy$loudness < lower_bound_loudness] <- lower_bound_loudness
spotify_copy$loudness[spotify_copy$loudness > upper_bound_loudness] <- upper_bound_loudness
# winsorization tempo
# Calculate the 5th and 95th percentiles for 'tempo'
lower_bound_tempo <- quantile(spotify_copy$tempo, 0.05, na.rm = TRUE)
upper_bound_tempo <- quantile(spotify_copy$tempo, 0.95, na.rm = TRUE)
# Winsorize the data
spotify_copy$tempo[spotify_copy$tempo < lower_bound_tempo] <- lower_bound_tempo
spotify_copy$tempo[spotify_copy$tempo > upper_bound_tempo] <- upper_bound_tempo
#winsorize duration
# Calculate the 5th and 95th percentiles for 'duration'
lower_bound_duration_ms <- quantile(spotify_copy$duration_ms, 0.05, na.rm = TRUE)
upper_bound_duration_ms <- quantile(spotify_copy$duration_ms, 0.95, na.rm = TRUE)
# Winsorize the data
spotify_copy$duration_ms[spotify_copy$duration_ms < lower_bound_duration_ms] <- lower_bound_duration_ms
spotify_copy$duration_ms[spotify_copy$duration_ms > upper_bound_duration_ms] <- upper_bound_duration_ms
Now we have removed all the outliers from those variables. The data is cleaned. We also have dealt with missing values, duplicates, and data types. Let’s create a table to see how all the cleaned data looks like.
knitr::kable(head(spotify[, 1:23]), "simple")
| track_id | track_name | track_artist | track_popularity | track_album_id | track_album_name | track_album_release_date | playlist_name | playlist_id | playlist_genre | playlist_subgenre | danceability | energy | key | loudness | mode | speechiness | acousticness | instrumentalness | liveness | valence | tempo | duration_ms |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 6f807x0ima9a1j3VPbc7VN | I Don’t Care (with Justin Bieber) - Loud Luxury Remix | Ed Sheeran | 66 | 2oCs0DGTsRO98Gh5ZSl2Cx | I Don’t Care (with Justin Bieber) [Loud Luxury Remix] | 2019-06-14 | Pop Remix | 37i9dQZF1DXcZDD7cfEKhW | pop | dance pop | 0.748 | 0.916 | 6 | -2.634 | 1 | 0.0583 | 0.1020 | 0.00e+00 | 0.0653 | 0.518 | 122.036 | 194754 |
| 0r7CVbZTWZgbTCYdfa2P31 | Memories - Dillon Francis Remix | Maroon 5 | 67 | 63rPSO264uRjW1X5E6cWv6 | Memories (Dillon Francis Remix) | 2019-12-13 | Pop Remix | 37i9dQZF1DXcZDD7cfEKhW | pop | dance pop | 0.726 | 0.815 | 11 | -4.969 | 1 | 0.0373 | 0.0724 | 4.21e-03 | 0.3570 | 0.693 | 99.972 | 162600 |
| 1z1Hg7Vb0AhHDiEmnDE79l | All the Time - Don Diablo Remix | Zara Larsson | 70 | 1HoSmj2eLcsrR0vE9gThr4 | All the Time (Don Diablo Remix) | 2019-07-05 | Pop Remix | 37i9dQZF1DXcZDD7cfEKhW | pop | dance pop | 0.675 | 0.931 | 1 | -3.432 | 0 | 0.0742 | 0.0794 | 2.33e-05 | 0.1100 | 0.613 | 124.008 | 176616 |
| 75FpbthrwQmzHlBJLuGdC7 | Call You Mine - Keanu Silva Remix | The Chainsmokers | 60 | 1nqYsOef1yKKuGOVchbsk6 | Call You Mine - The Remixes | 2019-07-19 | Pop Remix | 37i9dQZF1DXcZDD7cfEKhW | pop | dance pop | 0.718 | 0.930 | 7 | -3.778 | 1 | 0.1020 | 0.0287 | 9.40e-06 | 0.2040 | 0.277 | 121.956 | 169093 |
| 1e8PAfcKUYoKkxPhrHqw4x | Someone You Loved - Future Humans Remix | Lewis Capaldi | 69 | 7m7vv9wlQ4i0LFuJiE2zsQ | Someone You Loved (Future Humans Remix) | 2019-03-05 | Pop Remix | 37i9dQZF1DXcZDD7cfEKhW | pop | dance pop | 0.650 | 0.833 | 1 | -4.672 | 1 | 0.0359 | 0.0803 | 0.00e+00 | 0.0833 | 0.725 | 123.976 | 189052 |
| 7fvUMiyapMsRRxr07cU8Ef | Beautiful People (feat. Khalid) - Jack Wins Remix | Ed Sheeran | 67 | 2yiy9cd2QktrNvWC2EUi0k | Beautiful People (feat. Khalid) [Jack Wins Remix] | 2019-07-11 | Pop Remix | 37i9dQZF1DXcZDD7cfEKhW | pop | dance pop | 0.675 | 0.919 | 8 | -5.385 | 1 | 0.1270 | 0.0799 | 0.00e+00 | 0.1430 | 0.585 | 124.982 | 163049 |
Observations: - we first looked at the
descriptive statistics, which showed us the mean,
median, minimum, maximum, and quartiles of each variable. - We saw the
duplicates and deleted them - No missing
values - Histograms helped us look a the distribution of
variables, and since a lot pf them where numeric we concluded that a
Linear regression might be a good model to predict
track_popularity. - Scatterplots helped us see which
parameter might influence more our dependent variable. These were:
instrumentalness, speechiness and
loudness, since the data was more dense. - bosplots against
popularity helped us identify most popular genres:
pop, latin and rock. - Popular
subgenres were: post-teen pop, dance pop and
permanent wave.
First, we will select the variables that we will use for the
correlation matrix. We will convert playlist_genre to
categorical in order to be able to see the correlation.
matrix <- spotify %>% select(track_popularity, danceability, energy, key, loudness, mode, speechiness,acousticness, instrumentalness, liveness, valence, tempo, duration_ms, playlist_genre)
matrix$playlist_genre<-recode(matrix$playlist_genre, 'pop'=0, 'r&b'=1, 'rock'=2, 'latin'=3, 'edm'=4, 'rap'=5)
cor_matrix <- cor(matrix)
print(cor_matrix)
## track_popularity danceability energy key
## track_popularity 1.000000000 0.046596507 -0.10362202 -0.0081442988
## danceability 0.046596507 1.000000000 -0.08143602 0.0070203133
## energy -0.103622023 -0.081436016 1.00000000 0.0128766164
## key -0.008144299 0.007020313 0.01287662 1.0000000000
## loudness 0.037285154 0.015297214 0.68213774 -0.0005315332
## mode 0.016275225 -0.055211662 -0.00431074 -0.1760953351
## speechiness 0.005205509 0.183453077 -0.02900840 0.0231150971
## acousticness 0.091725358 -0.028878237 -0.54588619 0.0041975410
## instrumentalness -0.124431070 -0.002266695 0.02381792 0.0073927120
## liveness -0.052773029 -0.127002362 0.16370922 0.0021365502
## valence 0.022581329 0.333728627 0.14971046 0.0216180479
## tempo 0.004446180 -0.184575167 0.15154451 -0.0103554421
## duration_ms -0.139681783 -0.087560915 0.01758400 0.0188095544
## playlist_genre -0.073434088 0.190073891 0.08828930 0.0100195621
## loudness mode speechiness acousticness
## track_popularity 0.0372851540 0.0162752246 0.005205509 0.091725358
## danceability 0.0152972137 -0.0552116618 0.183453077 -0.028878237
## energy 0.6821377425 -0.0043107400 -0.029008396 -0.545886191
## key -0.0005315332 -0.1760953351 0.023115097 0.004197541
## loudness 1.0000000000 -0.0176877795 0.012981015 -0.371602357
## mode -0.0176877795 1.0000000000 -0.059650521 0.006760044
## speechiness 0.0129810147 -0.0596505208 1.000000000 0.024945425
## acousticness -0.3716023568 0.0067600443 0.024945425 1.000000000
## instrumentalness -0.1543090413 -0.0057660749 -0.107959332 -0.003100502
## liveness 0.0819270381 -0.0002748805 0.059342894 -0.074539622
## valence 0.0495101626 -0.0030881898 0.064700922 -0.018998564
## tempo 0.0967061101 0.0167097090 0.032728605 -0.114335210
## duration_ms -0.1045846289 0.0128129792 -0.098435680 -0.094143218
## playlist_genre 0.0715914257 -0.0505853182 0.298193374 -0.076573192
## instrumentalness liveness valence tempo
## track_popularity -0.124431070 -0.0527730291 0.02258133 0.004446180
## danceability -0.002266695 -0.1270023619 0.33372863 -0.184575167
## energy 0.023817924 0.1637092244 0.14971046 0.151544510
## key 0.007392712 0.0021365502 0.02161805 -0.010355442
## loudness -0.154309041 0.0819270381 0.04951016 0.096706110
## mode -0.005766075 -0.0002748805 -0.00308819 0.016709709
## speechiness -0.107959332 0.0593428937 0.06470092 0.032728605
## acousticness -0.003100502 -0.0745396224 -0.01899856 -0.114335210
## instrumentalness 1.000000000 -0.0085048515 -0.17416318 0.021484050
## liveness -0.008504851 1.0000000000 -0.01992507 0.022026642
## valence -0.174163182 -0.0199250679 1.00000000 -0.025135288
## tempo 0.021484050 0.0220266420 -0.02513529 1.000000000
## duration_ms 0.059077441 0.0075705280 -0.03332403 -0.001695701
## playlist_genre 0.147880944 0.0508644474 -0.06999690 0.050109733
## duration_ms playlist_genre
## track_popularity -0.139681783 -0.07343409
## danceability -0.087560915 0.19007389
## energy 0.017584005 0.08828930
## key 0.018809554 0.01001956
## loudness -0.104584629 0.07159143
## mode 0.012812979 -0.05058532
## speechiness -0.098435680 0.29819337
## acousticness -0.094143218 -0.07657319
## instrumentalness 0.059077441 0.14788094
## liveness 0.007570528 0.05086445
## valence -0.033324030 -0.06999690
## tempo -0.001695701 0.05010973
## duration_ms 1.000000000 -0.07577263
## playlist_genre -0.075772625 1.00000000
corrplot(cor(cor_matrix), method="color", type="upper", order="hclust")
The overall ranking of the variables that mostly correlate with
track_popularity are the following: 1.
Duration (-0.1396) 2. Instrumentallness
(-0.1244) 3. Energy (-0.1036) 4. Acousticness
(0.0917) 5. playlist_genre (-0.0734)
instrumentallness, speechiness and
loudness would have the strongest correlations since they
have most points out of the most popular songs in the smallest
interval.If we look at other variables we can see that there exist a high
positive correlation between energy &
loudness with 0.682. There is a negative correlation exists
between acousticness & energy with -0.545.
track popularity and we want to see
which variables best predicts the dependent variable. Therefore we will
first build a linear regression model to see how it predicts
track_popularity.# Set the seed for reproducibility
set.seed(2023)
# Randomly sample row indices for the training set split in 70% and 30%
train_indices <- sample(1:NROW(spotify),NROW(spotify)*0.70)
# Create the training set
train_data <- spotify[train_indices, ] #everything before comma is row selector and after comma is column selector
# Create the testing set
test_data <- spotify[-train_indices, ]
# Train the linear regression model, comparing popularity to rest of numeric parameters
lm_model <- lm(track_popularity ~ danceability + energy + key + loudness + mode + speechiness + acousticness + instrumentalness + liveness + valence + tempo + duration_ms, data = train_data)
summary(lm_model)
##
## Call:
## lm(formula = track_popularity ~ danceability + energy + key +
## loudness + mode + speechiness + acousticness + instrumentalness +
## liveness + valence + tempo + duration_ms, data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -54.757 -17.360 2.935 18.119 60.604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.684e+01 2.045e+00 32.682 < 2e-16 ***
## danceability 4.170e+00 1.288e+00 3.237 0.00121 **
## energy -2.318e+01 1.461e+00 -15.864 < 2e-16 ***
## key 1.398e-02 4.595e-02 0.304 0.76088
## loudness 1.132e+00 7.802e-02 14.512 < 2e-16 ***
## mode 7.590e-01 3.357e-01 2.261 0.02378 *
## speechiness -7.397e+00 1.657e+00 -4.464 8.09e-06 ***
## acousticness 4.934e+00 8.895e-01 5.547 2.95e-08 ***
## instrumentalness -9.426e+00 7.437e-01 -12.676 < 2e-16 ***
## liveness -4.420e+00 1.077e+00 -4.104 4.08e-05 ***
## valence 1.997e+00 7.861e-01 2.540 0.01110 *
## tempo 2.808e-02 6.261e-03 4.485 7.33e-06 ***
## duration_ms -4.277e-05 2.726e-06 -15.689 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23.03 on 19836 degrees of freedom
## Multiple R-squared: 0.06003, Adjusted R-squared: 0.05946
## F-statistic: 105.6 on 12 and 19836 DF, p-value: < 2.2e-16
The model suggests that all the variables are significant except key, since it is the only variable with a p-value higher than 0.5. This model has an R^2 of 0.06003 which means that only 6% of the of the data’s variability can be explained by the regression model, which indicated that this is not a good model.
We can say that when instrumentalness,
speechiness and duration_ms are increased by 1
unit, track_popularity is affected the most with
coefficients of -9.426, -7.397 and -4.277. However, energy
and loudness did not affect track_popularity
as much as we thought.
we can compile the actual and predicted values and view the first 5 records
# Create a data frame to compare actual and predicted values
comparison_df <- data.frame(Actual = train_data$track_popularity, lm_predicted =lm_model$fitted.values)
head(comparison_df)
## Actual lm_predicted
## 25119 0 33.94611
## 11393 24 40.51391
## 29550 39 41.34777
## 2064 63 38.75230
## 8416 40 35.35537
## 32344 30 28.37513
We will look at the In-Sample MSE to evaluate the model.
lm_mse_train <- mean((lm_model$fitted.values - train_data$track_popularity)^2)
print(paste("Training MSE for Linear Model:", round(lm_mse_train, 2)))
## [1] "Training MSE for Linear Model: 529.94"
We got a MSE of 529.94, which is high for this model. A value close to 0 would be perfect.
Now we will look at the Out-Of-Sample MSE to evaluate the model.
# Predict on testing data
lm_test_pred <- predict(lm_model, newdata = test_data)
# Cal
lm_mse_test <- mean((lm_test_pred - test_data$track_popularity)^2)
print(paste("Testing MSE for Linear Model:", round(lm_mse_test, 2)))
## [1] "Testing MSE for Linear Model: 527.51"
The MSE for the testing set is lower, with a value of 527.51.
Now, we will train the linear model, comparing it to other categorical values to see how they predict popularity.
# Train the linear regression model, comparing popularity to other character parameters
lm_model_ch <- lm(track_popularity ~ playlist_genre + playlist_subgenre, data = train_data)
summary(lm_model_ch)
##
## Call:
## lm(formula = track_popularity ~ playlist_genre + playlist_subgenre,
## data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -55.418 -16.339 2.905 16.931 60.098
##
## Coefficients: (5 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.9894 0.6853 35.005 < 2e-16
## playlist_genrelatin 18.0795 1.0338 17.489 < 2e-16
## playlist_genrepop 31.4286 1.0654 29.501 < 2e-16
## playlist_genrer&b 22.2155 1.0313 21.541 < 2e-16
## playlist_genrerap 24.4461 1.0227 23.902 < 2e-16
## playlist_genrerock 13.5668 1.0741 12.631 < 2e-16
## playlist_subgenrebig room 4.4078 1.0705 4.117 3.85e-05
## playlist_subgenreclassic rock -0.2592 1.1543 -0.225 0.822309
## playlist_subgenredance pop -3.0791 1.1009 -2.797 0.005164
## playlist_subgenreelectro house 9.7948 0.9814 9.981 < 2e-16
## playlist_subgenreelectropop -16.3501 1.0989 -14.878 < 2e-16
## playlist_subgenregangster rap -15.9361 1.0535 -15.127 < 2e-16
## playlist_subgenrehard rock -4.2979 1.1230 -3.827 0.000130
## playlist_subgenrehip hop 4.6597 1.0591 4.400 1.09e-05
## playlist_subgenrehip pop -1.8091 1.2060 -1.500 0.133599
## playlist_subgenreindie poptimism -14.2180 1.0637 -13.367 < 2e-16
## playlist_subgenrelatin hip hop -9.5706 1.0786 -8.873 < 2e-16
## playlist_subgenrelatin pop 4.3781 1.1083 3.950 7.83e-05
## playlist_subgenreneo soul -16.3026 1.0344 -15.760 < 2e-16
## playlist_subgenrenew jack swing -19.9120 1.1214 -17.757 < 2e-16
## playlist_subgenrepermanent wave 14.2161 1.1811 12.036 < 2e-16
## playlist_subgenrepop edm 12.5587 1.0861 11.563 < 2e-16
## playlist_subgenrepost-teen pop NA NA NA NA
## playlist_subgenreprogressive electro house NA NA NA NA
## playlist_subgenrereggaeton 4.2245 1.2535 3.370 0.000752
## playlist_subgenresouthern hip hop -13.6192 1.0064 -13.533 < 2e-16
## playlist_subgenretrap NA NA NA NA
## playlist_subgenretropical NA NA NA NA
## playlist_subgenreurban contemporary NA NA NA NA
##
## (Intercept) ***
## playlist_genrelatin ***
## playlist_genrepop ***
## playlist_genrer&b ***
## playlist_genrerap ***
## playlist_genrerock ***
## playlist_subgenrebig room ***
## playlist_subgenreclassic rock
## playlist_subgenredance pop **
## playlist_subgenreelectro house ***
## playlist_subgenreelectropop ***
## playlist_subgenregangster rap ***
## playlist_subgenrehard rock ***
## playlist_subgenrehip hop ***
## playlist_subgenrehip pop
## playlist_subgenreindie poptimism ***
## playlist_subgenrelatin hip hop ***
## playlist_subgenrelatin pop ***
## playlist_subgenreneo soul ***
## playlist_subgenrenew jack swing ***
## playlist_subgenrepermanent wave ***
## playlist_subgenrepop edm ***
## playlist_subgenrepost-teen pop
## playlist_subgenreprogressive electro house
## playlist_subgenrereggaeton ***
## playlist_subgenresouthern hip hop ***
## playlist_subgenretrap
## playlist_subgenretropical
## playlist_subgenreurban contemporary
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.07 on 19825 degrees of freedom
## Multiple R-squared: 0.1372, Adjusted R-squared: 0.1362
## F-statistic: 137.1 on 23 and 19825 DF, p-value: < 2.2e-16
The R^2 is bigger meaning that 13.62% of the
variance can be explained by this model. It is still too low. From the
coefficients pop, has the highest one
(31.42), then rap and the r&b. Looking
back at the plots, it looked like pop, latin and rock were the most
populars. From the subgenre it seems like
playlist_subgenrehip pop and
playlist_subgenreclassic rock were the only ones
non-significant but when we see at the boxplots above, it looks like
they are not the ones with less popularity. On the sibgernes,
new jack swing had the highest coefficients, followed by
electropop and neo soul.
# Predict on training data (character)
lm_train_pred_ch <- mean((lm_model_ch$fitted.values - train_data$track_popularity)^2)
print(paste("Training MSE for Linear Model:", round(lm_train_pred_ch, 2)))
## [1] "Training MSE for Linear Model: 486.43"
# Predict on testing data (character)
lm_test_pred_ch <- predict(lm_model_ch, newdata = test_data)
## Warning in predict.lm(lm_model_ch, newdata = test_data): prediction from a
## rank-deficient fit may be misleading
# Cal
lm_mse_test_ch <- mean((lm_test_pred_ch - test_data$track_popularity)^2)
print(paste("Testing MSE for Linear Model:", round(lm_mse_test_ch, 2)))
## [1] "Testing MSE for Linear Model: 485.8"
The MSE of the model in the categorical variables is still very high with training of 486.43 and testing MSE of 485.8.
Now we will try adding an interaction to see how it would fit the model.
lm_model_int <- lm(track_popularity ~ danceability + loudness + danceability*loudness, data = spotify)
summary(lm_model_int)
##
## Call:
## lm(formula = track_popularity ~ danceability + loudness + danceability *
## loudness, data = spotify)
##
## Residuals:
## Min 1Q Median 3Q Max
## -49.912 -18.249 3.069 18.643 60.571
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 28.4207 1.4221 19.985 < 2e-16 ***
## danceability 20.2423 2.1859 9.261 < 2e-16 ***
## loudness -0.8238 0.1768 -4.660 3.18e-06 ***
## danceability:loudness 1.7837 0.2743 6.502 8.07e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23.64 on 28352 degrees of freedom
## Multiple R-squared: 0.004993, Adjusted R-squared: 0.004887
## F-statistic: 47.42 on 3 and 28352 DF, p-value: < 2.2e-16
As we saw before, almost every variables is significant towards
track popularity, so it is very easy to find interactions.
We performed an interaction which was significant. However the R^2 is
too low. Now it is 0.5%. The new coefficient to account for this
interaction is 1.784 indicating that if there was an increase of 1 in
the danceability variable that 1.784 would be multiplied by
the value of loudnessvariable in the regression
equation.
# Predict on training data (character)
lm_train_pred_int <- mean((lm_model_int$fitted.values - train_data$track_popularity)^2)
## Warning in lm_model_int$fitted.values - train_data$track_popularity: longer
## object length is not a multiple of shorter object length
print(paste("Training MSE for Linear Model:", round(lm_train_pred_int, 2)))
## [1] "Training MSE for Linear Model: 567.09"
# Predict on testing data
lm_test_pred_int <- predict(lm_model_int, newdata = test_data)
# Cal
lm_mse_test <- mean((lm_test_pred_int - test_data$track_popularity)^2)
print(paste("Testing MSE for Linear Model:", round(lm_mse_test, 2)))
## [1] "Testing MSE for Linear Model: 554.33"
lm_model_no_int <- lm(track_popularity ~ danceability + loudness , data = spotify)
summary(lm_model_no_int)
##
## Call:
## lm(formula = track_popularity ~ danceability + loudness, data = spotify)
##
## Residuals:
## Min 1Q Median 3Q Max
## -42.803 -18.306 2.989 18.645 59.569
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.38628 0.72256 50.357 < 2e-16 ***
## danceability 7.48487 0.96398 7.765 8.47e-15 ***
## loudness 0.28557 0.04629 6.170 6.93e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23.66 on 28353 degrees of freedom
## Multiple R-squared: 0.003509, Adjusted R-squared: 0.003439
## F-statistic: 49.92 on 2 and 28353 DF, p-value: < 2.2e-16
If we perform the same model with no interaction the R^2 is 0.035%. This means that 0.035% of variance can be explained by this model, indicating that it is not a good model.
# Predict on training data (character)
lm_train_pred_no_int <- mean((lm_model_no_int$fitted.values - train_data$track_popularity)^2)
## Warning in lm_model_no_int$fitted.values - train_data$track_popularity: longer
## object length is not a multiple of shorter object length
print(paste("Training MSE for Linear Model:", round(lm_train_pred_no_int, 2)))
## [1] "Training MSE for Linear Model: 566.13"
# Predict on testing data
lm_test_pred_no_int <- predict(lm_model_no_int, newdata = test_data)
# Cal
lm_mse_test <- mean((lm_test_pred_no_int - test_data$track_popularity)^2)
print(paste("Testing MSE for Linear Model:", round(lm_mse_test, 2)))
## [1] "Testing MSE for Linear Model: 555.23"
From the model with numeric variables, we could see that all
variables except key were statistically significant on the
0.05 level. Which means that a variation in them would influence the
outcome in track_popularity. The
In-Sample-MSE was 529.94 and the
Out-Of-Sample MSE was 527.51. Then we
performed a lm with categorical variables such as genre and
subgenre. We can see that all variables besides
playlist_subgenrelatin hip hop are statistically
significant at the 0.05 level. The In-Sample-MSE was
486.43 and the Out-Of-Sample MSE was
485.8. Then we did a linear model doing an interaction
with danceability and loudness. The
MSE for the interaction was 567.09 and
566.13. On he model with no interaction it was
554.33 on the training set and
555.23 on the testing set respectively. All this
indicating that the model that performed the best was the testing model
one with the categorical variables genre and
sub_genre indicating that those variables do influence in
track_popuarity and that it is a more trustworthy model.
While we thought from the plots that pop,
latin and rock were the most popular genres,
this model indicated that it was pop, rap and
r&b. With the parameters our model also indicated that
loudness and energy might not affect our model
as much as we thought. Also, we can see that the model with interaction
performed better than the model with no interaction.
spotify_knn_model <- kknn(track_popularity ~ danceability + energy + key + loudness + mode + speechiness + acousticness + instrumentalness + liveness + valence + tempo + duration_ms, train = train_data, test = train_data, k = 5)
# Predict on training data
knn_train_pred <- fitted.values(spotify_knn_model)
# Calculate in-sample MSE manually
knn_train_mse <- mean((train_data$track_popularity - knn_train_pred)^2)
print(paste("In-Sample MSE for KNN: ", knn_train_mse))
## [1] "In-Sample MSE for KNN: 194.903869196156"
# Predict on testing data
knn_model_test <- kknn(track_popularity ~ danceability + energy + key + loudness + mode + speechiness + acousticness + instrumentalness + liveness + valence + tempo + duration_ms, train = train_data, test = test_data, k = 5)
knn_test_pred <- fitted.values(knn_model_test)
# Calculate out-of-sample MSE manually
knn_test_mse <- mean((test_data$track_popularity - knn_test_pred)^2)
print(paste("Out-of-Sample MSE for KNN: ", knn_test_mse))
## [1] "Out-of-Sample MSE for KNN: 687.024025645934"
spotify_knn_model_20 <- kknn(track_popularity ~ danceability + energy + key + loudness + mode + speechiness + acousticness + instrumentalness + liveness + valence + tempo + duration_ms, train = train_data, test = train_data, k = 20)
# Predict on training data
knn_train_pred_20 <- fitted.values(spotify_knn_model_20)
# Calculate in-sample MSE manually
knn_train_mse_20 <- mean((train_data$track_popularity - knn_train_pred_20)^2)
print(paste("In-Sample MSE for KNN: ", knn_train_mse_20))
## [1] "In-Sample MSE for KNN: 389.061411631355"
# Predict on testing data
knn_model_test_20 <- kknn(track_popularity ~ danceability + energy + key + loudness + mode + speechiness + acousticness + instrumentalness + liveness + valence + tempo + duration_ms, train = train_data, test = test_data, k = 20)
knn_test_pred_20 <- fitted.values(knn_model_test_20)
# Calculate out-of-sample MSE manually
knn_test_mse_20 <- mean((test_data$track_popularity - knn_test_pred_20)^2)
print(paste("Out-of-Sample MSE for KNN: ", knn_test_mse_20))
## [1] "Out-of-Sample MSE for KNN: 556.981777608275"
We tried the linear regression and the KNN model to see which model
would be best for predicting track popularity using the
mean square error to see which one performs better.We found that the
MSE for the linear regression model was
529.94 for In-Sample testing and the MSE was
527.51 for Out-of-sample testing with numeric values.
We also found that the KNN model has a MSE 194.903 for
In-sample test, and a MSE of 687.024 for the Out-of
Sample test. When comparing these two models, the KNN model performed
better on the training set than the regression model and worse on the
testing set. We then performed another KNN model but with k=20, the MSE
for training was 389.06 and for testing
556.981., having a btter result in the Out-of Sample
Test.
We did not use all of the variables in the data set, but we used all of our continuous variables.We decided to go this route because we mostly wanted to explore track popularity and the relationship it has with the continuous variables in our data set.
Theoretically, the model that would fit the data the best is the linear regression model because it is easier to interpret, you can see the coefficients and how each variable is changed by one unit increase in Y. You also have to meet the four assumptions (linearity, independence,normality, and homoscedasticity). We can run a diagnostics plot to see if our model meets these assumptions.
# diagnostic plot
par(mfrow = c(2, 2))
plot(lm_model)
From the diagnostic plot, we see that the model does not meet the four assumptions, which leads us to the conclusion that the linear regression model is not the best fit for this data.
However, teh regression model meets the foru assumoptions, lets us interpret the coefficients, and has a lower OUT-OF Sample MSE.
Since the mse is not quite where we want it to be, we are going to do
some feature scaling to see if there is any impact on knn model. The
variables that will be removed are key, mode
and accoustiness.
#Creating a standardized train knn model
train_new.x <- train_data[, c("track_popularity", "danceability", "energy" , "loudness","speechiness","instrumentalness" , "liveness","valence","duration_ms")]
k<- 5
stand_knn <- kknn(track_popularity ~ danceability + energy + loudness + speechiness + instrumentalness + liveness + valence + duration_ms, train = train_new.x, test = train_new.x, k = 5)
# Creating a unstandardized knn model
knn_unstand <- kknn(track_popularity ~ danceability + energy + loudness + speechiness + instrumentalness + liveness + valence + duration_ms, train = train_new.x, test = train_new.x, k = 5, scale = FALSE)
#finding the MSE for our standardized model
#first we have to find the prediction of our standardized knn model
pred_stand <- fitted(stand_knn)
train_new.y <- train_new.x$track_popularity
#Standardized MSE
stand_mse <- mean((train_new.y - pred_stand)^2)
#Unstandardized MSE
pred_undstand <- fitted(knn_unstand)
unstand_mse <- mean((train_new.y - pred_undstand)^2)
#output
print(paste("Training MSE of standardized data :", stand_mse))
## [1] "Training MSE of standardized data : 206.7465001884"
print(paste("Training MSE of unstandardized data:", unstand_mse))
## [1] "Training MSE of unstandardized data: 217.941700903636"
The MSE of the standardized data is 206.7465 and the MSE for the unstandarized data is 217.9417 , so there is a very small difference between the two. Our original train data had a MSE of 194.903, which is also a small difference compared to our feature scaling models. This leads us to the conclusion that feature scaling could make a effective difference in overall model, but it might not be needed.
Now lets test the test data with the same variables removed.
#Creating a standardized test knn model
test_new.x <- test_data[, c("track_popularity", "danceability", "energy" , "loudness","speechiness","instrumentalness" , "liveness","valence","duration_ms")]
stand_test_knn <- kknn(track_popularity ~ danceability + energy + loudness + speechiness + instrumentalness + liveness + valence + duration_ms, train = test_new.x, test = test_new.x, k = 5)
# Creating a unstandardized knn model
knn_test_unstand <- kknn(track_popularity ~ danceability + energy + loudness + speechiness + instrumentalness + liveness + valence + duration_ms, train = test_new.x, test = test_new.x, k = 5, scale = FALSE)
#finding the MSE for our standardized model
#first we have to find the prediction of our standardized knn model
pred_stand_test <- fitted(stand_test_knn)
test_new.y <- test_new.x$track_popularity
#Standardized MSE
test_stand_mse <- mean((test_new.y - pred_stand_test)^2)
#Unstandardized MSE
pred_undstand_test <- fitted(knn_test_unstand)
unstand_mse_test <- mean((test_new.y - pred_undstand_test)^2)
#output
print(paste("Testing MSE of standardized data :", test_stand_mse))
## [1] "Testing MSE of standardized data : 201.450244331452"
print(paste("Test MSE of unstandardized data:", unstand_mse_test))
## [1] "Test MSE of unstandardized data: 213.310812353227"
We did a new KNN with new variables and K=5 for training and tsting set with unstandardized and standardized data. The MSE of the standardized data for training set is 206.747 and the MSE for the unstandradized data for training set is 217.942 having a better result with standardized data. For the testing set the standardized model was also better, with a result of 201.45 and unstandrdized was 213.31 which is not a big difference compared to each other, but a big difference compared to out test MSE (687.024) without feature scaling which leads us to the conclusion that feature scaling has a positive impact on our model.
From the feature scaling, we can conclude that the knn model is still the best model compared to the linear regression model for our data.
Lets create put track-popularity as dummy variable and
insert a 1, when the popuLarity is above 70 and a 0 when it is below 70.
Meaning that 1 is popular and 0 is not popular.
track_populaity will be again or dependent variale.
spotify_copy <- spotify
spotify_copy$track_popularity <- ifelse(spotify_copy$track_popularity >=70, 1 , 0)
set.seed(2023)
index <- sample(1:nrow(spotify_copy),nrow(spotify_copy)*0.80)
spotify_copy.train = spotify_copy[index,]
spotify_copy.test = spotify_copy[-index,]
spotify_copy.glm0<- glm(track_popularity~ danceability + energy + key + loudness + mode + speechiness + acousticness + instrumentalness + liveness + valence + tempo + duration_ms, family=binomial, data=spotify_copy.train)
summary(spotify_copy.glm0)
##
## Call:
## glm(formula = track_popularity ~ danceability + energy + key +
## loudness + mode + speechiness + acousticness + instrumentalness +
## liveness + valence + tempo + duration_ms, family = binomial,
## data = spotify_copy.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.1331 -0.5000 -0.4123 -0.2795 4.2276
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.066e+00 3.085e-01 3.455 0.000551 ***
## danceability 3.319e-01 1.902e-01 1.745 0.081027 .
## energy -3.095e+00 2.169e-01 -14.269 < 2e-16 ***
## key -9.644e-03 6.523e-03 -1.478 0.139311
## loudness 2.076e-01 1.310e-02 15.843 < 2e-16 ***
## mode 1.286e-01 4.805e-02 2.676 0.007446 **
## speechiness -7.592e-01 2.401e-01 -3.162 0.001568 **
## acousticness 1.473e-01 1.228e-01 1.199 0.230523
## instrumentalness -3.071e+00 2.814e-01 -10.916 < 2e-16 ***
## liveness -4.376e-01 1.710e-01 -2.558 0.010518 *
## valence 4.779e-01 1.159e-01 4.125 3.71e-05 ***
## tempo 2.652e-03 8.573e-04 3.093 0.001980 **
## duration_ms -1.883e-06 4.619e-07 -4.076 4.57e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 14035 on 22683 degrees of freedom
## Residual deviance: 13193 on 22671 degrees of freedom
## AIC: 13219
##
## Number of Fisher Scoring iterations: 7
Observation: As we can see from this model, energy,
loudness, mode, speechiness,
instrumentalness, liveness,
valence, tempo and duration_ms
are significant in predicting whether a song is going to be popular or
not. Unlike linear regression, in this model key has the
highest coefficient with -9.644, followed by speechiness
and liveness.
pred.glm0.train <- predict(spotify_copy.glm0,type="response")
hist(pred.glm0.train)
Observation: These tables illustrate the impact of
choosing different cut-off probability. Choosing a large cut-off
probability will result many cases being predicted as 1, and choosing a
small cut-off probability will result in few cases being predicted as
1.
#ROC curve
library(ROCR)
pred <- prediction(pred.glm0.train, spotify_copy.train$track_popularity)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)
Obsservations: -The ROC curse predicts the false
positive rate and the true positive rate. You will want to have these
values closer to 1.
#Get the AUC
unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.6855791
Observation: We got the AUC (Area under the Curve). Again we would like to have a number closer to 1, meaning that it would be perfectly predicted. However, it is 0.6856. It is still good.
pred.glm0.test<- predict(spotify_copy.glm0, newdata = spotify_copy.test, type="response")
pred <- prediction(pred.glm0.test, spotify_copy.test$track_popularity)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)
#Get the AUC
unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.6610358
Observations: -The AUC is even lower, with a score of 0.661.
In order to get the matrix for True Positive and True Negative Predictions, we will determine the optimal cut-off Probability using Grid Search Method. We will first use a weight of 5 and 1.
# define a cost function with input "obs" being observed response
# and "pi" being predicted probability, and "pcut" being the threshold.
costfunc = function(obs, pred.p, pcut){
weight1 = 5 # define the weight for "true=1 but pred=0" (FN)
weight0 = 1 # define the weight for "true=0 but pred=1" (FP)
c1 = (obs==1)&(pred.p<pcut) # count for "true=1 but pred=0" (FN)
c0 = (obs==0)&(pred.p>=pcut) # count for "true=0 but pred=1" (FP)
cost = mean(weight1*c1 + weight0*c0) # misclassification with weight
return(cost) # you have to return to a value when you write R functions
} # end of the function
Next, we will define a sequence of probability (you need to search the optimal p-cut from this sequence)
# define a sequence from 0.01 to 1 by 0.01
p.seq = seq(0.01, 1, 0.01)
Then, we will to calculate the cost for each probability in the sequence p.seq.
# write a loop for all p-cut to see which one provides the smallest cost
# first, need to define a 0 vector in order to save the value of cost from all pcut
cost = rep(0, length(p.seq))
for(i in 1:length(p.seq)){
cost[i] = costfunc(obs = spotify_copy.train$track_popularity, pred.p = pred.glm0.train, pcut = p.seq[i])
} # end of the loop
#cbind(p.seq, cost)
Last, we will draw a plot with cost against p.seq, and find the p-cut that gives us the minimum cost.
# draw a plot with X axis being all pcut and Y axis being associated cost
plot(p.seq, cost)
# find the optimal pcut
optimal.pcut.glm0 = p.seq[which(cost==min(cost))]
print(optimal.pcut.glm0)
## [1] 0.16
Observation: With the weight of 5 and 1, we got o.15 as optimal cut-off-probability
# step 1. get binary classification
class.glm0.train.opt<- (pred.glm0.train>optimal.pcut.glm0)*1
# step 2. get confusion matrix, MR, FPR, FNR
table(spotify_copy.train$track_popularity, class.glm0.train.opt, dnn = c("True", "Predicted"))
## Predicted
## True 0 1
## 0 18732 1843
## 1 1636 473
Observation: With weight of 5 and 1, we got 16845 TN, 824 TP, 3422 FP and 1593 FN. These predictions seem very good. It means for example that 824 were predicted a popular when they were popular.
MR<- mean(spotify_copy.train$track_popularity!= class.glm0.train.opt)
FPR<- sum(spotify_copy.train$track_popularity==0 & class.glm0.train.opt==1)/sum(spotify_copy.train$track_popularity==0)
FNR<- sum(spotify_copy.train$track_popularity==1 & class.glm0.train.opt==0)/sum(spotify_copy.train$track_popularity==1)
cost<- costfunc(obs = spotify_copy.train$track_popularity, pred.p = class.glm0.train.opt, pcut = optimal.pcut.glm0)
print(paste0("MR:",MR))
## [1] "MR:0.153368012696174"
print(paste0("FPR:",FPR))
## [1] "FPR:0.0895747266099635"
print(paste0("FNR:",FNR))
## [1] "FNR:0.775723091512565"
print(paste0("cost:",cost))
## [1] "cost:0.441853288661612"
We got 0.1534 for MR, which is pretty accuarte for such a complicated data. We also got 0.0896 fro FPR, 0.7757 for FNR and 0.4419 for cost.
Now, lets do the same with the Out-Of-Sample Data which is more important and calculate MR.
# step 0. obtain predicated values on testing data (we actually did this already, but let's do it again)
pred.glm0.test<- predict(spotify_copy.glm0, newdata = spotify_copy.test, type="response")
# step 1. get binary classification, I used the optimal cut-off
pred.glm0.test.opt <- (pred.glm0.test>optimal.pcut.glm0)*1
# step 2. get confusion matrix, MR, FPR, FNR
table(spotify_copy.test$track_popularity, pred.glm0.test.opt, dnn = c("True", "Predicted"))
## Predicted
## True 0 1
## 0 4698 448
## 1 416 110
Obs: With weight of 5 and 1, (optimal cut-off being 0.15) we got 4214 TN, 186 TP, 876 FP and 396 FN. These predictions seem very good. It means for example that 186 were predicted a popular when they were popular.
MR<- mean(spotify_copy.test$track_popularity!= pred.glm0.test.opt)
FPR<- sum(spotify_copy.test$track_popularity==0 & pred.glm0.test.opt==1)/sum(spotify_copy.test$track_popularity==0)
FNR<- sum(spotify_copy.test$track_popularity==1 & pred.glm0.test.opt==0)/sum(spotify_copy.test$track_popularity==1)
cost<- costfunc(obs = spotify_copy.test$track_popularity, pred.p = pred.glm0.test.opt, pcut = optimal.pcut.glm0)
print(paste0("MR:",MR))
## [1] "MR:0.152327221438646"
print(paste0("FPR:",FPR))
## [1] "FPR:0.0870579090555771"
print(paste0("FNR:",FNR))
## [1] "FNR:0.790874524714829"
print(paste0("cost:",cost))
## [1] "cost:0.445698166431594"
Out of sample MR is still very good, with a score of 0.1523. We also got 0.08701 for FPR, 0.79087 for FNR and 0.4457 for cost. All of the values were better for testing set except of FNR.
spotify_glm0_cat<- glm(track_popularity~ playlist_genre + playlist_subgenre, family=binomial, data=spotify_copy.train)
summary(spotify_glm0_cat)
##
## Call:
## glm(formula = track_popularity ~ playlist_genre + playlist_subgenre,
## family = binomial, data = spotify_copy.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.9468 -0.5258 -0.3472 -0.1463 3.3090
##
## Coefficients: (5 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.4706 0.4474 -12.229 < 2e-16
## playlist_genrelatin 2.1066 0.4832 4.360 1.30e-05
## playlist_genrepop 4.9005 0.4532 10.813 < 2e-16
## playlist_genrer&b 3.5616 0.4578 7.781 7.21e-15
## playlist_genrerap 3.4258 0.4587 7.469 8.10e-14
## playlist_genrerock 3.1048 0.4647 6.681 2.38e-11
## playlist_subgenrebig room 0.1464 0.6715 0.218 0.827462
## playlist_subgenreclassic rock -0.1167 0.1790 -0.652 0.514502
## playlist_subgenredance pop -0.6406 0.1037 -6.178 6.50e-10
## playlist_subgenreelectro house 0.9380 0.5332 1.759 0.078579
## playlist_subgenreelectropop -1.1165 0.1131 -9.868 < 2e-16
## playlist_subgenregangster rap -1.1283 0.1873 -6.025 1.69e-09
## playlist_subgenrehard rock -0.7950 0.2064 -3.853 0.000117
## playlist_subgenrehip hop 0.4123 0.1312 3.142 0.001676
## playlist_subgenrehip pop 0.1436 0.1470 0.977 0.328782
## playlist_subgenreindie poptimism -2.2086 0.1422 -15.535 < 2e-16
## playlist_subgenrelatin hip hop 0.9102 0.2183 4.169 3.06e-05
## playlist_subgenrelatin pop 1.9038 0.2023 9.410 < 2e-16
## playlist_subgenreneo soul -1.8874 0.2208 -8.549 < 2e-16
## playlist_subgenrenew jack swing -3.2198 0.4588 -7.018 2.25e-12
## playlist_subgenrepermanent wave 0.7548 0.1584 4.766 1.88e-06
## playlist_subgenrepop edm 2.7406 0.4715 5.812 6.16e-09
## playlist_subgenrepost-teen pop NA NA NA NA
## playlist_subgenreprogressive electro house NA NA NA NA
## playlist_subgenrereggaeton 1.5118 0.2209 6.844 7.71e-12
## playlist_subgenresouthern hip hop -1.0521 0.1711 -6.149 7.79e-10
## playlist_subgenretrap NA NA NA NA
## playlist_subgenretropical NA NA NA NA
## playlist_subgenreurban contemporary NA NA NA NA
##
## (Intercept) ***
## playlist_genrelatin ***
## playlist_genrepop ***
## playlist_genrer&b ***
## playlist_genrerap ***
## playlist_genrerock ***
## playlist_subgenrebig room
## playlist_subgenreclassic rock
## playlist_subgenredance pop ***
## playlist_subgenreelectro house .
## playlist_subgenreelectropop ***
## playlist_subgenregangster rap ***
## playlist_subgenrehard rock ***
## playlist_subgenrehip hop **
## playlist_subgenrehip pop
## playlist_subgenreindie poptimism ***
## playlist_subgenrelatin hip hop ***
## playlist_subgenrelatin pop ***
## playlist_subgenreneo soul ***
## playlist_subgenrenew jack swing ***
## playlist_subgenrepermanent wave ***
## playlist_subgenrepop edm ***
## playlist_subgenrepost-teen pop
## playlist_subgenreprogressive electro house
## playlist_subgenrereggaeton ***
## playlist_subgenresouthern hip hop ***
## playlist_subgenretrap
## playlist_subgenretropical
## playlist_subgenreurban contemporary
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 14035 on 22683 degrees of freedom
## Residual deviance: 12271 on 22660 degrees of freedom
## AIC: 12319
##
## Number of Fisher Scoring iterations: 7
Observation: All of the variables are significant at
the 0.05 level, except playlist_subgenrebig room,
playlist_subgenreclassic rock,
playlist_subgenrepost-teen pop,
playlist_subgenreprogressive electro house,
playlist_subgenretrap,
playlist_subgenretropical and
playlist_subgenreurban contemporary. Now under genre, the
highest change in one unit goes to pop, followed by
r&b and rap. For the subgenre, the highest
coefficients were on southern hip hop, followed by
reggaeton and dance pop. On lm the biggest
changes were in other variables (new jack swing,
electropop and neo soul).
pred.glm0_cat.train <- predict(spotify_glm0_cat,type="response")
#ROC curve
library(ROCR)
pred <- prediction(pred.glm0_cat.train, spotify_copy.train$track_popularity)
perf <- performance(pred, "tpr", "fpr")
#Get the AUC
unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.7593949
pred.glm0.test_cat<- predict(spotify_glm0_cat, newdata = spotify_copy.test, type="response")
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
pred <- prediction(pred.glm0.test_cat, spotify_copy.test$track_popularity)
perf <- performance(pred, "tpr", "fpr")
#Get the AUC
unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.7539842
Observation: For the training set we get an AUC of 0.7594 and for the testing set we get 0.75398.
# define a cost function with input "obs" being observed response
# and "pi" being predicted probability, and "pcut" being the threshold.
costfunc = function(obs, pred.p, pcut){
weight1 = 5 # define the weight for "true=1 but pred=0" (FN)
weight0 = 1 # define the weight for "true=0 but pred=1" (FP)
c1 = (obs==1)&(pred.p<pcut) # count for "true=1 but pred=0" (FN)
c0 = (obs==0)&(pred.p>=pcut) # count for "true=0 but pred=1" (FP)
cost = mean(weight1*c1 + weight0*c0) # misclassification with weight
return(cost) # you have to return to a value when you write R functions
} # end of the function
# define a sequence from 0.01 to 1 by 0.01
p.seq = seq(0.01, 1, 0.01)
# write a loop for all p-cut to see which one provides the smallest cost
# first, need to define a 0 vector in order to save the value of cost from all pcut
cost = rep(0, length(p.seq))
for(i in 1:length(p.seq)){
cost[i] = costfunc(obs = spotify_copy.train$track_popularity, pred.p = pred.glm0_cat.train, pcut = p.seq[i])
} # end of the loop
#cbind(p.seq, cost)
# find the optimal pcut
optimal.pcut.glm0 = p.seq[which(cost==min(cost))]
print(optimal.pcut.glm0)
## [1] 0.17 0.18
The optimal pcut is either 0.17 or 0.18. Now we will get MR, FPR, FNR and cost to compare it to each other.
# step 1. get binary classification
class.glm0.train.cat<- (pred.glm0_cat.train>optimal.pcut.glm0)*1
# get values
MR<- mean(spotify_copy.train$track_popularity!= class.glm0.train.cat)
FPR<- sum(spotify_copy.train$track_popularity==0 & class.glm0.train.cat==1)/sum(spotify_copy.train$track_popularity==0)
FNR<- sum(spotify_copy.train$track_popularity==1 & class.glm0.train.cat==0)/sum(spotify_copy.train$track_popularity==1)
cost<- costfunc(obs = spotify_copy.train$track_popularity, pred.p = class.glm0.train.cat, pcut = optimal.pcut.glm0)
print(paste0("MR:",MR))
## [1] "MR:0.151340151648739"
print(paste0("FPR:",FPR))
## [1] "FPR:0.0982260024301337"
print(paste0("FNR:",FNR))
## [1] "FNR:0.669511616880038"
print(paste0("cost:",cost))
## [1] "cost:0.400326221125022"
# step 0. obtain predicated values on testing data (we actually did this already, but let's do it again)
pred.glm0.test<- predict(spotify_glm0_cat, newdata = spotify_copy.test, type="response")
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
# step 1. get binary classification, I used the optimal cut-off
pred.glm0.test.cat <- (pred.glm0.test>optimal.pcut.glm0)*1
#Get values
MR<- mean(spotify_copy.test$track_popularity!= pred.glm0.test.cat)
FPR<- sum(spotify_copy.test$track_popularity==0 & pred.glm0.test.cat==1)/sum(spotify_copy.test$track_popularity==0)
FNR<- sum(spotify_copy.test$track_popularity==1 & pred.glm0.test.cat==0)/sum(spotify_copy.test$track_popularity==1)
cost<- costfunc(obs = spotify_copy.test$track_popularity, pred.p = pred.glm0.test.cat, pcut = optimal.pcut.glm0)
print(paste0("MR:",MR))
## [1] "MR:0.157087447108604"
print(paste0("FPR:",FPR))
## [1] "FPR:0.104741546832491"
print(paste0("FNR:",FNR))
## [1] "FNR:0.669201520912547"
print(paste0("cost:",cost))
## [1] "cost:0.405324400564175"
Observation: For this new model we got a better AUC on the training set with 0.7593 than on the testing set (0.7539). Both values where better on Model 2 than on Model 1 (0.6856 and 0.6610 respectively). We then calculated MR, FPR, FNR and cost on both models as well, in training and testing sets. For the first model we got MR= 0.1534; FPR= 0.0986; FNR=0.7757 and cost=0.4419on the training set.We also got MR= 0.1523; FPR= 0.087; FNR=0.7908 and cost=0.4457** on the training set. The testing set performed better than the training set. For the second model we got MR= 0.1513; FPR= 0.098; FNR=0.7757 and cost=0.4003 on the training set.We also got MR= 0.1571; FPR= 0.1047; FNR=0.6692 and cost=0.4053 on the training set. The training set performed better than the testing set in model 2. Model 1 performed better on the testing set than Model 2. And alos the opposite.
We are first going to make a new data set without the categorical variables, then make our track_popularity variable 2 levels.
spotify_new = subset(spotify, select = -c(track_album_release_date, track_id, track_name, track_artist, track_album_name, track_album_id , track_album_name, playlist_genre, playlist_name, playlist_subgenre, playlist_id) )
spotify_new$track_popularity <- ifelse(spotify_new$track_popularity >=70, 1 , 0)
#splitting the data
set.seed(2023)
index <- sample(1:nrow(spotify_new),nrow(spotify_copy)*0.80)
spotify_svm_train <- spotify_new[index,]
spotify_svm_test <- spotify_new[-index,]
#fitting the data w/o weight class cost
library(e1071)
spotify_new$track_popularity<- as.factor(spotify_new$track_popularity)
str(spotify_new)
## 'data.frame': 28356 obs. of 13 variables:
## $ track_popularity: Factor w/ 2 levels "0","1": 1 1 2 1 1 1 1 1 1 1 ...
## $ danceability : num 0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
## $ energy : num 0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
## $ key : int 6 11 1 7 1 8 5 4 8 2 ...
## $ loudness : num -2.63 -4.97 -3.43 -3.78 -4.67 ...
## $ mode : int 1 1 0 1 1 1 0 0 1 1 ...
## $ speechiness : num 0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
## $ acousticness : num 0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
## $ instrumentalness: num 0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
## $ liveness : num 0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
## $ valence : num 0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
## $ tempo : num 122 100 124 122 124 ...
## $ duration_ms : int 194754 162600 176616 169093 189052 163049 187675 207619 193187 253040 ...
#svm model
spotify_svm = svm({{as.factor(track_popularity)}} ~ danceability + energy + key + loudness + mode + speechiness + acousticness + instrumentalness + liveness + valence + tempo + duration_ms, data = spotify_svm_train, kernel = 'linear')
#summary
summary(spotify_svm)
##
## Call:
## svm(formula = {
## {
## as.factor(track_popularity)
## }
## } ~ danceability + energy + key + loudness + mode + speechiness +
## acousticness + instrumentalness + liveness + valence + tempo +
## duration_ms, data = spotify_svm_train, kernel = "linear")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 1
##
## Number of Support Vectors: 4832
##
## ( 2723 2109 )
##
##
## Number of Classes: 2
##
## Levels:
## 0 1
# predictions on the train data
pred_spotify_train = predict(spotify_svm, spotify_svm_train)
# Confusion matrix to evaluate the model on train data
Cmatrix_train = table(true = spotify_svm_train$track_popularity ,
pred = pred_spotify_train)
Cmatrix_train
## pred
## true 0 1
## 0 20575 0
## 1 2109 0
#Mis-classification rate
1 - sum(diag(Cmatrix_train))/sum(Cmatrix_train)
## [1] 0.09297302
The confusion matrix of the predicted values with the training data shows the number of true negatives as 22435, false positives as 0 , false negatives as 3831 and true positives as 0.
The mis-classification error rate is 0.145854
# Predictions on the train data
pred_spotify_test <- predict(spotify_svm, spotify_svm_test)
# Confusion matrix to evaluate the model on train data
Cmatrix_test = table(true = spotify_svm_test$track_popularity,
pred = pred_spotify_test)
Cmatrix_test
## pred
## true 0 1
## 0 5146 0
## 1 526 0
#Mis-classification rate
1 - sum(diag(Cmatrix_test))/sum(Cmatrix_test)
## [1] 0.09273625
The confusion matrix of the predicted values with the testing data shows the number of true negatives as 22435, false positives as 0 , false negatives as 3831 and true positives as 0.
The mis-classification error rate is 0.145854
Now we are going to fit a SVM model with weight cost using the train data
spotify_svm_asymmetric = svm(as.factor(track_popularity) ~ danceability + energy + loudness + speechiness + mode + key + acousticness+ instrumentalness + liveness + valence + tempo +duration_ms, data = spotify_svm_train, kernel = 'linear', class.weights = c("0" = 1, "1" = 2))
#Summary
summary(spotify_svm_asymmetric)
##
## Call:
## svm(formula = as.factor(track_popularity) ~ danceability + energy +
## loudness + speechiness + mode + key + acousticness + instrumentalness +
## liveness + valence + tempo + duration_ms, data = spotify_svm_train,
## kernel = "linear", class.weights = c(`0` = 1, `1` = 2))
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 1
##
## Number of Support Vectors: 7076
##
## ( 4967 2109 )
##
##
## Number of Classes: 2
##
## Levels:
## 0 1
# predictions for train data
pred_spotify_train <- predict(spotify_svm_asymmetric, spotify_svm_train)
#Confusion matrix for train data
Cmatrix_train_21 <- table(true =spotify_svm_train$track_popularity,
pred = pred_spotify_train)
print(Cmatrix_train_21)
## pred
## true 0 1
## 0 20575 0
## 1 2109 0
#Mis-classification rate
1 - sum(diag(Cmatrix_train_21))/sum(Cmatrix_train_21)
## [1] 0.09297302
The confusion matrix of the predicted values with the training data shows the number of true negatives as 22435, false positives as 0 , false negatives as 3831 and true positives as 0.
The mis-classification error rate is 0.145854
Now we are going to fit a SVM model with cost using the train data
#testing predicted probability
pred_spotify_test <- predict(spotify_svm_asymmetric, spotify_svm_test)
#Confusion matrix
Cmatrix_test_22 <- table( true = spotify_svm_test$track_popularity, pred = pred_spotify_test)
print(Cmatrix_test_22)
## pred
## true 0 1
## 0 5146 0
## 1 526 0
#Mis-classification rate
MR <- 1 - sum(diag(Cmatrix_test_22))/sum(Cmatrix_test_22)
MR
## [1] 0.09273625
The confusion matrix of the predicted values with the testing data shows the number of true negatives as 5564, false positives as 0 , false negatives as 1003 and true positives as 0.
The mis-classification error rate is 0.1527334
Now we are going to obtain the ROC and AUC for our SVM model. We will then use our output to compare it to the ROC and AUC we obtained in our logistic model.
#ROC and AUC using the train data
spotify_svm_prob = svm(as.factor(track_popularity) ~ danceability + energy + loudness + speechiness + mode + key + acousticness+ instrumentalness + liveness + valence + tempo +duration_ms,
data = spotify_svm_train, kernel = 'linear',
probability = TRUE)
pred_prob_train = predict(spotify_svm_prob,
newdata = spotify_svm_train,
probability = TRUE)
pred_prob_train = attr(pred_prob_train, "probabilities")[, 2]
#ROC
library(ROCR)
pred <- prediction(pred_prob_train, spotify_svm_train$track_popularity)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)
#AUC
unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.5866711
The AUC is 0.5971307
#refitting the model on testing
pred_prob_test = predict(spotify_svm_prob,
newdata = spotify_svm_test,
probability = TRUE)
pred_prob_test = attr(pred_prob_test, "probabilities")[, 2]
#ROC
pred <- prediction(pred_prob_test, spotify_svm_test$track_popularity )
perf <- performance(pred, "tpr", "fpr")
#AUC
unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.5698786
The AUC is 0.5909064
When comparing the ROC and AUC for our SVM model to our AUC and ROC for our logistic regression model, we found that the logistic regression model performed better because it had a higher AUC for both the train and test data.
We will create a copy of spotify with a new data name
and delete some variables for the purpose of this analysis.
spotify_dec = subset(spotify, select = -c(track_album_release_date, track_id, track_name, track_artist, track_album_name, track_album_id ,playlist_genre, playlist_subgenre, track_album_name, playlist_name, playlist_id) )
spotify_dec$track_popularity <- ifelse(spotify_dec$track_popularity >=70, 1 , 0)
str(spotify_dec)
## 'data.frame': 28356 obs. of 13 variables:
## $ track_popularity: num 0 0 1 0 0 0 0 0 0 0 ...
## $ danceability : num 0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
## $ energy : num 0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
## $ key : int 6 11 1 7 1 8 5 4 8 2 ...
## $ loudness : num -2.63 -4.97 -3.43 -3.78 -4.67 ...
## $ mode : int 1 1 0 1 1 1 0 0 1 1 ...
## $ speechiness : num 0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
## $ acousticness : num 0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
## $ instrumentalness: num 0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
## $ liveness : num 0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
## $ valence : num 0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
## $ tempo : num 122 100 124 122 124 ...
## $ duration_ms : int 194754 162600 176616 169093 189052 163049 187675 207619 193187 253040 ...
spotify_dec$mode<- as.factor(spotify_dec$mode)
spotify_dec$key<- as.factor(spotify_dec$key)
str(spotify_dec)
## 'data.frame': 28356 obs. of 13 variables:
## $ track_popularity: num 0 0 1 0 0 0 0 0 0 0 ...
## $ danceability : num 0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
## $ energy : num 0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
## $ key : Factor w/ 12 levels "0","1","2","3",..: 7 12 2 8 2 9 6 5 9 3 ...
## $ loudness : num -2.63 -4.97 -3.43 -3.78 -4.67 ...
## $ mode : Factor w/ 2 levels "0","1": 2 2 1 2 2 2 1 1 2 2 ...
## $ speechiness : num 0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
## $ acousticness : num 0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
## $ instrumentalness: num 0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
## $ liveness : num 0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
## $ valence : num 0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
## $ tempo : num 122 100 124 122 124 ...
## $ duration_ms : int 194754 162600 176616 169093 189052 163049 187675 207619 193187 253040 ...
set.seed(2023)
index <- sample(1:nrow(spotify_dec),nrow(spotify_dec)*0.60)
spotify_dec.train = spotify_dec[index,]
spotify_dec.test = spotify_dec[-index,]
library(rpart)
library(rpart.plot)
# fit the model
fit_tree_dec <- rpart(as.factor(track_popularity) ~ danceability + energy + speechiness + loudness + instrumentalness+ mode+ key + acousticness+ liveness + valence + tempo, data=spotify_dec.train)
summary(fit_tree_dec)
## Call:
## rpart(formula = as.factor(track_popularity) ~ danceability +
## energy + speechiness + loudness + instrumentalness + mode +
## key + acousticness + liveness + valence + tempo, data = spotify_dec.train)
## n= 17013
##
## CP nsplit rel error xerror xstd
## 1 0 0 1 0 0
##
## Node number 1: 17013 observations
## predicted class=0 expected loss=0.09351672 P(node) =1
## class counts: 15422 1591
## probabilities: 0.906 0.094
#rpart.plot(fit_tree, yesno=2, extra=4)
Obtain the predicted values for training data and confusion matrix.
# Make predictions on the train data
pred_credit_train <- predict(fit_tree_dec, spotify_dec.train, type="class")
# Confusion matrix to evaluate the model on train data
Cmatrix_train = table(true = spotify_dec.train$track_popularity,
pred = pred_credit_train)
Cmatrix_train
## pred
## true 0 1
## 0 15422 0
## 1 1591 0
1 - sum(diag(Cmatrix_train))/sum(Cmatrix_train)
## [1] 0.09351672
Observation: From this model we get a very small missclassification rate of 0.0935, but from the Confusion Matrix of the training set we can see that none of the variables are predicted as 1.
Obtain the predicted values for testing data and confusion matrix.
# Make predictions on the train data
pred_credit_test <- predict(fit_tree_dec, spotify_dec.test, type="class")
# Confusion matrix to evaluate the model on train data
Cmatrix_test = table(true = spotify_dec.test$track_popularity,
pred = pred_credit_test)
Cmatrix_test
## pred
## true 0 1
## 0 10299 0
## 1 1044 0
1 - sum(diag(Cmatrix_test))/sum(Cmatrix_test)
## [1] 0.09203914
Observation:
Now that we have done th same on the testing set, we can see that MR is even smaller with 0.092 but still none of the variables are predicted as 1. Therefore we will use asymmetric test, to produce different weights and see if this improves the model.
# We need to define a cost matrix first, don't change 0 there
cost_matrix <- matrix(c(0, 1, # cost of 1 for FP
6, 0), # cost of 6 for FN
byrow = TRUE, nrow = 2)
fit_tree_asym <- rpart(as.factor(track_popularity) ~ danceability + energy + speechiness + loudness + instrumentalness+ mode+ key + acousticness+ liveness + valence + tempo, data=spotify_dec.train, parms = list(loss = cost_matrix))
rpart.plot(fit_tree_asym,extra=4, yesno=2)
Since we now do have some variables on the ‘1’ side of the data, we can
visualize the tree in a plot. We can predict that from the variables
energy is the one that best predicts a popular song. The
model explains that energy value smaller 0.82 would predict
1, and loudness bigger that -5.1 would also predict 1.
#get predictions for training
pred_spotify_train <- predict(fit_tree_asym, spotify_dec.train,
type = "class")
#C matrix for training
Cmatrix_train_asy <- table( true = spotify_dec.train$track_popularity, pred = pred_spotify_train)
Cmatrix_train_asy
## pred
## true 0 1
## 0 13157 2265
## 1 1054 537
1 - sum(diag(Cmatrix_train_asy))/sum(Cmatrix_train_asy)
## [1] 0.1950861
Observation: Now that we have redistributed the weights we finally can get some values for 1 (being popular) , with 537 being TP, 2265 FP, 13157 TN and 1054 FN on the training set. However, now we have a bigger MR with 0.1951. Lets get assymmetric predictions for the testing set.
#get predictions for testing
pred_credit_test <- predict(fit_tree_asym, spotify_dec.test, type = "class")
#C matrix for testing with more weights on "1"
Cmatrix_test_weight = table( true = spotify_dec.test$track_popularity, pred = pred_credit_test)
Cmatrix_test_weight
## pred
## true 0 1
## 0 8820 1479
## 1 707 337
1 - sum(diag(Cmatrix_test_weight))/sum(Cmatrix_test_weight)
## [1] 0.192718
Observation: Now for the testing set we getthe following values: with 337 being TP, 1479 FP, 8820 TN and 707 FN on the training set. The MR is now a bit lower with a value of 0.1927. Now, we will obtain the AUC for the training and testing data.
# obtain predicted probability
pred_prob_train = predict(fit_tree_dec, spotify_dec.train, type = "prob")
# This is necessary again, as predict() for tree model return two values, one for 0 and one for 1.
# Replace "1" with the actual category if response variable is a factor
pred_prob_train = pred_prob_train[,"1"]
library(ROCR)
pred <- prediction(pred_prob_train, spotify_dec.train$track_popularity)
perf <- performance(pred, "tpr", "fpr")
#Get the AUC
unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.5
# obtain predicted probability
pred_prob_test = predict(fit_tree_dec, spotify_dec.test, type = "prob")
# This is necessary again, as predict() for tree model return two values, one for 0 and one for 1.
pred_prob_test = pred_prob_test[,"1"] #replace "1" with the actual category if reponse variable is a factor
#ROC
pred <- prediction(pred_prob_test, spotify_dec.test$track_popularity)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)
#Get the AUC
unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.5
Observation: When we get the AUC for both the training and testing datasets for the model with no weight we get the same value of 0.5 and the ROC curve is a diagonal linear line. This is due to the fact that none of the variables are predicted as 1. Now, lets get the AUC-ROC for the model with weights.
# obtain predicted probability
pred_asym_train = predict(fit_tree_asym, spotify_dec.train, type = "prob")
# This is necessary again, as predict() for tree model return two values, one for 0 and one for 1.
# Replace "1" with the actual category if response variable is a factor
pred_asym_train = pred_asym_train[,"1"]
library(ROCR)
pred <- prediction(pred_asym_train, spotify_dec.train$track_popularity)
perf <- performance(pred, "tpr", "fpr")
#plot(perf, colorize=TRUE)
#Get the AUC
unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.6613876
# obtain predicted probability
pred_asym_test = predict(fit_tree_asym, spotify_dec.test, type = "prob")
# This is necessary again, as predict() for tree model return two values, one for 0 and one for 1.
# Replace "1" with the actual category if response variable is a factor
pred_asym_test = pred_asym_test[,"1"]
library(ROCR)
pred <- prediction(pred_asym_test, spotify_dec.test$track_popularity)
perf <- performance(pred, "tpr", "fpr")
#plot(perf, colorize=TRUE)
#Get the AUC
unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.6450638
Observation: Now that we have done AUC-ROC on the weighted models, we get a curve and AUC of 0.6614 fr the training set and 0.6451 for the testing set, being a better model that the one with no weights. Overall, although MR is lower on the model with no weights, it does not mean that is better because none of the observations are classified as popular. Although the model with weights has a bigger MR, the model does better classification that model 1. AUC is better for the model with weights, than with no weights.
Conclusion: While exploring which variable is the best predictor for ‘track_popularity’ we found that ‘energy’ and ‘loudness’ were our best predictors. We tested our variables using different prediction models (Logistic regression, KNN, SVM and Tree Model) to see which one performed better. We found that the Tree model performed best due to its high AUC score and the fact that out of all of the models, the Tree model had the best confusion matrix.