#{width=40%}
The Spotify Dataset comes from Spotify via the spotifyr package. Charlie Thompson, Josiah Parry, Donal Phipps, and Tom Wolff authored this package to make it easier to get either your own data or general metadata arounds songs from Spotify’s API. Kaylin Pavlik had a recent blogpost using the audio features to explore and classify songs. She used the spotifyr package to collect about 5000 songs from 6 main categories.
The data shows general metadata around songs from Spotify’s API. It shows the song’s popularity and other parameters such as acousticness, danceability, energy, speechiness, valence, key…
With this analysis we are interested in how track popularity is getting influenced by other attributes like danceability, loudness, speechiness, valence etc.
The plan is to analyze relationship between popularity and different features of the song to predict future popularity of a song. We plan on performing Data Preparation, EDA and Modelling using models such as linear regression, knn, logistic regression, SVM and Tree Models and compare which one performs better.
This is mainly beneficial to market spotify customers and improve their experience while using spotify. Also for Spotify, they will be able to provide more accurate predictions of a new song’s potential popularity even before its release.
library(tidyverse) #It assists with data import, tidying, manipulation, and data visualization.
library(ggplot2) # package for producing statistical, or data, graphics
library(kknn) # to perform k-nearest neighbor classification
library(corrplot) # graphical display of a correlation matrix, confidence interval
library(readr) # provides a fast way to read rectangular data
library(rpart) #implements the classification and regression tree algorithm (CART)
library(rpart.plot) #An Enhanced Plotting Package for rpart
Tidyverse:assists with data import, tidying, manipulation, and data visualization. ggplot2: package for producing statistical, or data, graphics. kknn: performs k-nearest neighbor classification. corrplot: graphical display of a correlation matrix, confidence interval. readr: provides a fast way to read rectangular data. rpart: implements the classification and regression tree algorithm (CART) rpart.plot: An Enhanced Plotting Package for rpart.
library(readr)
spotify <- read.csv("spotify.csv")
We are going to examine our data in prder to see how we can best clean it for further analysis.
### Checking dimension of Data
dim(spotify)
## [1] 32833 23
The original dataset contains 32833 rows and 23 columns We will show the first 5 rows and we will provide each column name.
# show first 5 rows
head(spotify)
## track_id track_name
## 1 6f807x0ima9a1j3VPbc7VN I Don't Care (with Justin Bieber) - Loud Luxury Remix
## 2 0r7CVbZTWZgbTCYdfa2P31 Memories - Dillon Francis Remix
## 3 1z1Hg7Vb0AhHDiEmnDE79l All the Time - Don Diablo Remix
## 4 75FpbthrwQmzHlBJLuGdC7 Call You Mine - Keanu Silva Remix
## 5 1e8PAfcKUYoKkxPhrHqw4x Someone You Loved - Future Humans Remix
## 6 7fvUMiyapMsRRxr07cU8Ef Beautiful People (feat. Khalid) - Jack Wins Remix
## track_artist track_popularity track_album_id
## 1 Ed Sheeran 66 2oCs0DGTsRO98Gh5ZSl2Cx
## 2 Maroon 5 67 63rPSO264uRjW1X5E6cWv6
## 3 Zara Larsson 70 1HoSmj2eLcsrR0vE9gThr4
## 4 The Chainsmokers 60 1nqYsOef1yKKuGOVchbsk6
## 5 Lewis Capaldi 69 7m7vv9wlQ4i0LFuJiE2zsQ
## 6 Ed Sheeran 67 2yiy9cd2QktrNvWC2EUi0k
## track_album_name
## 1 I Don't Care (with Justin Bieber) [Loud Luxury Remix]
## 2 Memories (Dillon Francis Remix)
## 3 All the Time (Don Diablo Remix)
## 4 Call You Mine - The Remixes
## 5 Someone You Loved (Future Humans Remix)
## 6 Beautiful People (feat. Khalid) [Jack Wins Remix]
## track_album_release_date playlist_name playlist_id playlist_genre
## 1 2019-06-14 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop
## 2 2019-12-13 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop
## 3 2019-07-05 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop
## 4 2019-07-19 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop
## 5 2019-03-05 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop
## 6 2019-07-11 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop
## playlist_subgenre danceability energy key loudness mode speechiness
## 1 dance pop 0.748 0.916 6 -2.634 1 0.0583
## 2 dance pop 0.726 0.815 11 -4.969 1 0.0373
## 3 dance pop 0.675 0.931 1 -3.432 0 0.0742
## 4 dance pop 0.718 0.930 7 -3.778 1 0.1020
## 5 dance pop 0.650 0.833 1 -4.672 1 0.0359
## 6 dance pop 0.675 0.919 8 -5.385 1 0.1270
## acousticness instrumentalness liveness valence tempo duration_ms
## 1 0.1020 0.00e+00 0.0653 0.518 122.036 194754
## 2 0.0724 4.21e-03 0.3570 0.693 99.972 162600
## 3 0.0794 2.33e-05 0.1100 0.613 124.008 176616
## 4 0.0287 9.43e-06 0.2040 0.277 121.956 169093
## 5 0.0803 0.00e+00 0.0833 0.725 123.976 189052
## 6 0.0799 0.00e+00 0.1430 0.585 124.982 163049
#### Checking column name
names(spotify)
## [1] "track_id" "track_name"
## [3] "track_artist" "track_popularity"
## [5] "track_album_id" "track_album_name"
## [7] "track_album_release_date" "playlist_name"
## [9] "playlist_id" "playlist_genre"
## [11] "playlist_subgenre" "danceability"
## [13] "energy" "key"
## [15] "loudness" "mode"
## [17] "speechiness" "acousticness"
## [19] "instrumentalness" "liveness"
## [21] "valence" "tempo"
## [23] "duration_ms"
Now, we will look at the structure, summary, data types, duplcates and missing values to then, clean the data.
### Checking structure of Data
str(spotify)
## 'data.frame': 32833 obs. of 23 variables:
## $ track_id : chr "6f807x0ima9a1j3VPbc7VN" "0r7CVbZTWZgbTCYdfa2P31" "1z1Hg7Vb0AhHDiEmnDE79l" "75FpbthrwQmzHlBJLuGdC7" ...
## $ track_name : chr "I Don't Care (with Justin Bieber) - Loud Luxury Remix" "Memories - Dillon Francis Remix" "All the Time - Don Diablo Remix" "Call You Mine - Keanu Silva Remix" ...
## $ track_artist : chr "Ed Sheeran" "Maroon 5" "Zara Larsson" "The Chainsmokers" ...
## $ track_popularity : int 66 67 70 60 69 67 62 69 68 67 ...
## $ track_album_id : chr "2oCs0DGTsRO98Gh5ZSl2Cx" "63rPSO264uRjW1X5E6cWv6" "1HoSmj2eLcsrR0vE9gThr4" "1nqYsOef1yKKuGOVchbsk6" ...
## $ track_album_name : chr "I Don't Care (with Justin Bieber) [Loud Luxury Remix]" "Memories (Dillon Francis Remix)" "All the Time (Don Diablo Remix)" "Call You Mine - The Remixes" ...
## $ track_album_release_date: chr "2019-06-14" "2019-12-13" "2019-07-05" "2019-07-19" ...
## $ playlist_name : chr "Pop Remix" "Pop Remix" "Pop Remix" "Pop Remix" ...
## $ playlist_id : chr "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" ...
## $ playlist_genre : chr "pop" "pop" "pop" "pop" ...
## $ playlist_subgenre : chr "dance pop" "dance pop" "dance pop" "dance pop" ...
## $ danceability : num 0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
## $ energy : num 0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
## $ key : int 6 11 1 7 1 8 5 4 8 2 ...
## $ loudness : num -2.63 -4.97 -3.43 -3.78 -4.67 ...
## $ mode : int 1 1 0 1 1 1 0 0 1 1 ...
## $ speechiness : num 0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
## $ acousticness : num 0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
## $ instrumentalness : num 0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
## $ liveness : num 0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
## $ valence : num 0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
## $ tempo : num 122 100 124 122 124 ...
## $ duration_ms : int 194754 162600 176616 169093 189052 163049 187675 207619 193187 253040 ...
Observations: Shows structure of data. We can see
that track_id , track_name,
track_artist, track_album_id,
track_album_name, track_album_release_date,
playlist_name, playlist_id,
playlist_genre, playlist_subgenre are
character variables. On the other side, track_popularity,
energy, key, loudness,
mode, speechiness, acousticness,
instrumentalness, liveness,
valence, tempo and duration_ms
are numeric. We need to change track_album_release_date to
date variable. We will also change playlist_genre to
factor, for future plotting.
#### Modifying Data Types
# Modifying Data Types
spotify$track_album_release_date<- as.Date(spotify$track_album_release_date)
spotify$playlist_genre<-as.factor(spotify$playlist_genre)
#summary statistics
summary(spotify)
## track_id track_name track_artist track_popularity
## Length:32833 Length:32833 Length:32833 Min. : 0.00
## Class :character Class :character Class :character 1st Qu.: 24.00
## Mode :character Mode :character Mode :character Median : 45.00
## Mean : 42.48
## 3rd Qu.: 62.00
## Max. :100.00
##
## track_album_id track_album_name track_album_release_date
## Length:32833 Length:32833 Min. :1957-01-01
## Class :character Class :character 1st Qu.:2010-12-04
## Mode :character Mode :character Median :2017-01-27
## Mean :2012-09-09
## 3rd Qu.:2019-05-16
## Max. :2020-01-29
## NA's :1886
## playlist_name playlist_id playlist_genre playlist_subgenre
## Length:32833 Length:32833 edm :6043 Length:32833
## Class :character Class :character latin:5155 Class :character
## Mode :character Mode :character pop :5507 Mode :character
## r&b :5431
## rap :5746
## rock :4951
##
## danceability energy key loudness
## Min. :0.0000 Min. :0.000175 Min. : 0.000 Min. :-46.448
## 1st Qu.:0.5630 1st Qu.:0.581000 1st Qu.: 2.000 1st Qu.: -8.171
## Median :0.6720 Median :0.721000 Median : 6.000 Median : -6.166
## Mean :0.6548 Mean :0.698619 Mean : 5.374 Mean : -6.720
## 3rd Qu.:0.7610 3rd Qu.:0.840000 3rd Qu.: 9.000 3rd Qu.: -4.645
## Max. :0.9830 Max. :1.000000 Max. :11.000 Max. : 1.275
##
## mode speechiness acousticness instrumentalness
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000000
## 1st Qu.:0.0000 1st Qu.:0.0410 1st Qu.:0.0151 1st Qu.:0.0000000
## Median :1.0000 Median :0.0625 Median :0.0804 Median :0.0000161
## Mean :0.5657 Mean :0.1071 Mean :0.1753 Mean :0.0847472
## 3rd Qu.:1.0000 3rd Qu.:0.1320 3rd Qu.:0.2550 3rd Qu.:0.0048300
## Max. :1.0000 Max. :0.9180 Max. :0.9940 Max. :0.9940000
##
## liveness valence tempo duration_ms
## Min. :0.0000 Min. :0.0000 Min. : 0.00 Min. : 4000
## 1st Qu.:0.0927 1st Qu.:0.3310 1st Qu.: 99.96 1st Qu.:187819
## Median :0.1270 Median :0.5120 Median :121.98 Median :216000
## Mean :0.1902 Mean :0.5106 Mean :120.88 Mean :225800
## 3rd Qu.:0.2480 3rd Qu.:0.6930 3rd Qu.:133.92 3rd Qu.:253585
## Max. :0.9960 Max. :0.9910 Max. :239.44 Max. :517810
##
Observations: This function displays Min, Q1,
Median, Mean, Q3 and Max of each varibale. We can already see that there
are probably some outliers and that some variables have too big Max
(duration_ms has a Max od 517810; tempo has a
Max of 239.44). We will do some truncation, winorization or
standardization, to see how it affects the model.
#lets look at some tables for categorical variables
table(spotify$playlist_genre)
##
## edm latin pop r&b rap rock
## 6043 5155 5507 5431 5746 4951
table(spotify$playlist_subgenre)
##
## album rock big room classic rock
## 1065 1206 1296
## dance pop electro house electropop
## 1298 1511 1408
## gangster rap hard rock hip hop
## 1458 1485 1322
## hip pop indie poptimism latin hip hop
## 1256 1672 1656
## latin pop neo soul new jack swing
## 1262 1637 1133
## permanent wave pop edm post-teen pop
## 1105 1517 1129
## progressive electro house reggaeton southern hip hop
## 1809 949 1675
## trap tropical urban contemporary
## 1291 1288 1405
table(spotify$key)
##
## 0 1 2 3 4 5 6 7 8 9 10 11
## 3454 4010 2827 913 2201 2680 2670 3352 2430 3027 2273 2996
table(spotify$mode)
##
## 0 1
## 14259 18574
Observations: We can see that mode is a
binary variable. Playlist_genre and
playlist_subgenre are categorical variables with the genre
of music. Each category does not vary a lot in observations, which is
good. There are no outliers either.
#Looking for duplicates
dups_id <- sum(duplicated(spotify$track_id))
print(dups_id)
## [1] 4477
We can see that a lot of songs have been duplicated in this dataset. They have the same track_id. Therefore we will remove them, for further analysis.
spotify_dups = spotify[duplicated(spotify$track_id),]
spotify = spotify[!duplicated(spotify$track_id),]
We removed the duplicate songs to another dataset called
spotify_dups and removed duplicates from current
dataset.
#looking for missing values
sum(is.na(spotify))
## [1] 1693
There are 12 missing values in this dataset. However, we will not remove them, since they might still be important for the analysis.
We wil plot some variables to observe its skewiness, and interpret them.
histograms <- names(spotify)[c(4,12:23)]
songs1 <- spotify %>%
select(c(histograms)) %>%
pivot_longer(cols = histograms)
## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
## # Was:
## data %>% select(histograms)
##
## # Now:
## data %>% select(all_of(histograms))
##
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
songs1 %>%
ggplot(aes(x = value)) +
geom_histogram() +
facet_wrap(~name, ncol = 5, scales = 'free') +
labs(title = 'Audio Feature Pattern Frequency Plots', x = '', y = '') +
theme_void()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Observations: - Duration and Valence are normally
distributed - Danceability, Enery and Loudness is left-skewed -
Acousticness, Liveness and Speechiness is right-skewed - Key is a
categorical variable - in popularity the max is around
50-60, and it is normally distributed. There are a lot of zero values,
possible missing values. Very few songs have a popularity of above
90.
ggplot(spotify,aes(x = playlist_genre, y = track_popularity, fill = playlist_genre)) +geom_boxplot()
Obs: We can see that
pop is the most popular followed by
latin and rock.
Now we will plot scatterplots to compare the different parameter to popularity and see which genre has the best popularity based on the different variables and see on popular songs, which oarameter best predicts popularity.
feature_names <- names(spotify)[c(12:13,15,17:18,23)]
songs <- spotify %>%
arrange(desc(track_popularity)) %>%
head(n = 500) %>%
pivot_longer(cols = feature_names)
## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
## # Was:
## data %>% select(feature_names)
##
## # Now:
## data %>% select(all_of(feature_names))
##
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
songs %>%
ggplot(aes(x = name, y = value)) +
geom_jitter(aes(color = playlist_genre)) +
facet_wrap(~name, ncol = 3, scales = 'free') +
labs(title = 'Audio Feature Pattern Frequency Plots', x = '', y = '') +
theme(axis.text.y = element_blank())
Observation:
Pop and edm are the most popularspeechiness and accousticness means
more popularWe will plot the variables in boxplots to look at the outlier and structure of the data.
ggplot(data = songs1) +
geom_boxplot(aes(y = value)) +
facet_wrap(~name, nrow = 4, scales = "free") +
coord_flip() +
ggtitle("Outlier analysis", subtitle = "For different song attributes") +
theme_void()
Observations:
From the boxplots we can observe that a lot of variables
(danceability, energy, loudness,
speechiness, acousticness,
instrumentalness, liveness,
duration) have outliers. Removing them would influence the
analysis a lot.
Lets create a new dataset with all winsorized and truncated variables to reduce outliers just in case we need to use the data.
spotify_copy <- spotify
Now, we will truncate energy speechiness, acousticness, instrumentalness and liveness.
# truncation danceability, energy, speechiness, acousticness, instrumentalness and liveness
spotify_copy$danceability[spotify_copy$danceability <= 0.28] <- 0.28
spotify_copy$energy[spotify_copy$energy <= 0.2] <- 0.2
spotify_copy$speechiness[spotify_copy$speechiness >= 0.27] <- 0.27
spotify_copy$acousticness[spotify_copy$acousticness >= 0.6] <- 0.6
spotify_copy$instrumentalness[spotify_copy$instrumentalness >= 0.015] <- 0.015
spotify_copy$liveness[spotify_copy$liveness >= 0.4] <- 0.4
Now, we will winsorize loudness, tempo and duration.
# winsorization loudness
# Calculate the 5th and 95th percentiles for 'loudness'
lower_bound_loudness <- quantile(spotify_copy$loudness, 0.05, na.rm = TRUE)
upper_bound_loudness <- quantile(spotify_copy$loudness, 0.95, na.rm = TRUE)
# Winsorize the data
spotify_copy$loudness[spotify_copy$loudness < lower_bound_loudness] <- lower_bound_loudness
spotify_copy$loudness[spotify_copy$loudness > upper_bound_loudness] <- upper_bound_loudness
# winsorization tempo
# Calculate the 5th and 95th percentiles for 'tempo'
lower_bound_tempo <- quantile(spotify_copy$tempo, 0.05, na.rm = TRUE)
upper_bound_tempo <- quantile(spotify_copy$tempo, 0.95, na.rm = TRUE)
# Winsorize the data
spotify_copy$tempo[spotify_copy$tempo < lower_bound_tempo] <- lower_bound_tempo
spotify_copy$tempo[spotify_copy$tempo > upper_bound_tempo] <- upper_bound_tempo
#winsorize duration
# Calculate the 5th and 95th percentiles for 'duration'
lower_bound_duration_ms <- quantile(spotify_copy$duration_ms, 0.05, na.rm = TRUE)
upper_bound_duration_ms <- quantile(spotify_copy$duration_ms, 0.95, na.rm = TRUE)
# Winsorize the data
spotify_copy$duration_ms[spotify_copy$duration_ms < lower_bound_duration_ms] <- lower_bound_duration_ms
spotify_copy$duration_ms[spotify_copy$duration_ms > upper_bound_duration_ms] <- upper_bound_duration_ms
Now we have removed all the outliers from those variables. The data is cleaned. We also have dealt with missing values, duplicates, and data types. Let’s create a table to see how all the cleaned data looks like.
knitr::kable(head(spotify[, 1:23]), "simple")
| track_id | track_name | track_artist | track_popularity | track_album_id | track_album_name | track_album_release_date | playlist_name | playlist_id | playlist_genre | playlist_subgenre | danceability | energy | key | loudness | mode | speechiness | acousticness | instrumentalness | liveness | valence | tempo | duration_ms |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 6f807x0ima9a1j3VPbc7VN | I Don’t Care (with Justin Bieber) - Loud Luxury Remix | Ed Sheeran | 66 | 2oCs0DGTsRO98Gh5ZSl2Cx | I Don’t Care (with Justin Bieber) [Loud Luxury Remix] | 2019-06-14 | Pop Remix | 37i9dQZF1DXcZDD7cfEKhW | pop | dance pop | 0.748 | 0.916 | 6 | -2.634 | 1 | 0.0583 | 0.1020 | 0.00e+00 | 0.0653 | 0.518 | 122.036 | 194754 |
| 0r7CVbZTWZgbTCYdfa2P31 | Memories - Dillon Francis Remix | Maroon 5 | 67 | 63rPSO264uRjW1X5E6cWv6 | Memories (Dillon Francis Remix) | 2019-12-13 | Pop Remix | 37i9dQZF1DXcZDD7cfEKhW | pop | dance pop | 0.726 | 0.815 | 11 | -4.969 | 1 | 0.0373 | 0.0724 | 4.21e-03 | 0.3570 | 0.693 | 99.972 | 162600 |
| 1z1Hg7Vb0AhHDiEmnDE79l | All the Time - Don Diablo Remix | Zara Larsson | 70 | 1HoSmj2eLcsrR0vE9gThr4 | All the Time (Don Diablo Remix) | 2019-07-05 | Pop Remix | 37i9dQZF1DXcZDD7cfEKhW | pop | dance pop | 0.675 | 0.931 | 1 | -3.432 | 0 | 0.0742 | 0.0794 | 2.33e-05 | 0.1100 | 0.613 | 124.008 | 176616 |
| 75FpbthrwQmzHlBJLuGdC7 | Call You Mine - Keanu Silva Remix | The Chainsmokers | 60 | 1nqYsOef1yKKuGOVchbsk6 | Call You Mine - The Remixes | 2019-07-19 | Pop Remix | 37i9dQZF1DXcZDD7cfEKhW | pop | dance pop | 0.718 | 0.930 | 7 | -3.778 | 1 | 0.1020 | 0.0287 | 9.40e-06 | 0.2040 | 0.277 | 121.956 | 169093 |
| 1e8PAfcKUYoKkxPhrHqw4x | Someone You Loved - Future Humans Remix | Lewis Capaldi | 69 | 7m7vv9wlQ4i0LFuJiE2zsQ | Someone You Loved (Future Humans Remix) | 2019-03-05 | Pop Remix | 37i9dQZF1DXcZDD7cfEKhW | pop | dance pop | 0.650 | 0.833 | 1 | -4.672 | 1 | 0.0359 | 0.0803 | 0.00e+00 | 0.0833 | 0.725 | 123.976 | 189052 |
| 7fvUMiyapMsRRxr07cU8Ef | Beautiful People (feat. Khalid) - Jack Wins Remix | Ed Sheeran | 67 | 2yiy9cd2QktrNvWC2EUi0k | Beautiful People (feat. Khalid) [Jack Wins Remix] | 2019-07-11 | Pop Remix | 37i9dQZF1DXcZDD7cfEKhW | pop | dance pop | 0.675 | 0.919 | 8 | -5.385 | 1 | 0.1270 | 0.0799 | 0.00e+00 | 0.1430 | 0.585 | 124.982 | 163049 |
To uncover new information in the data, we first took a look at the descriptive statistics of the continuous variables, this showed us the mean, median, minimum, maximum, and quartiles of each variable in our data. We then used a histogram to visualize the distribution of each variable. This allowed us to see the skewness of each variable which gave us an idea of the characteristics of the data. Our visualizations gave us the clearest insight. The scatter plots allowed us to see the relationships between the variables and the influence that a song’s genre has over that relationship. The box plots allowed us to get a different look at the distribution and see outliers which can have a major influence over our overall data set.The different ways we can look at this data is from the perspective of
The plots that we can use to illustrate our findings are scatter plots, line charts, and histograms. We already used scatter plots and histograms in the discovery process, but they will be useful to illustrate our findings because they can show evidence of relationships between variables and show the distribution of those variables.
Currently, we do not know how to conduct statistical tests like t-test, ANOVA, and chi-square to be able to test our hypothesis.
To create new summary information, we plan on narrowing our data down to the variables we really plan on exploring to help us gain insights to our predictions.
track_populrity. We will do so by splitting the data into
training and test sets.# Set the seed for reproducibility
set.seed(2023)
# Randomly sample row indices for the training set split in 70% and 30%
train_indices <- sample(1:NROW(spotify),NROW(spotify)*0.70)
# Create the training set
train_data <- spotify[train_indices, ] #everything before comma is row selector and after comma is column selector
# Create the testing set
test_data <- spotify[-train_indices, ]
Now, after we have split the data, we will perform a linear regression model on the training set using the variables for songs characteristics.
# Train the linear regression model, comparing popularity to rest of numeric parameters
lm_model <- lm(track_popularity ~ danceability + energy + key + loudness + mode + speechiness + acousticness + instrumentalness + liveness + valence + tempo + duration_ms, data = train_data)
summary(lm_model)
##
## Call:
## lm(formula = track_popularity ~ danceability + energy + key +
## loudness + mode + speechiness + acousticness + instrumentalness +
## liveness + valence + tempo + duration_ms, data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -54.757 -17.360 2.935 18.119 60.604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.684e+01 2.045e+00 32.682 < 2e-16 ***
## danceability 4.170e+00 1.288e+00 3.237 0.00121 **
## energy -2.318e+01 1.461e+00 -15.864 < 2e-16 ***
## key 1.398e-02 4.595e-02 0.304 0.76088
## loudness 1.132e+00 7.802e-02 14.512 < 2e-16 ***
## mode 7.590e-01 3.357e-01 2.261 0.02378 *
## speechiness -7.397e+00 1.657e+00 -4.464 8.09e-06 ***
## acousticness 4.934e+00 8.895e-01 5.547 2.95e-08 ***
## instrumentalness -9.426e+00 7.437e-01 -12.676 < 2e-16 ***
## liveness -4.420e+00 1.077e+00 -4.104 4.08e-05 ***
## valence 1.997e+00 7.861e-01 2.540 0.01110 *
## tempo 2.808e-02 6.261e-03 4.485 7.33e-06 ***
## duration_ms -4.277e-05 2.726e-06 -15.689 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23.03 on 19836 degrees of freedom
## Multiple R-squared: 0.06003, Adjusted R-squared: 0.05946
## F-statistic: 105.6 on 12 and 19836 DF, p-value: < 2.2e-16
we can compile the actual and predicted values and view the first 5 records
# Create a data frame to compare actual and predicted values
comparison_df <- data.frame(Actual = train_data$track_popularity, lm_predicted =lm_model$fitted.values)
head(comparison_df)
## Actual lm_predicted
## 25119 0 33.94611
## 11393 24 40.51391
## 29550 39 41.34777
## 2064 63 38.75230
## 8416 40 35.35537
## 32344 30 28.37513
We will look at the In-Sample MSE to evaluate the model.
lm_mse_train <- mean((lm_model$fitted.values - train_data$track_popularity)^2)
print(paste("Training MSE for Linear Model:", round(lm_mse_train, 2)))
## [1] "Training MSE for Linear Model: 529.94"
We got a MSE of 529.94, which is high for this model. A value close to 0 would be perfect.
Now we will look at the Out-Of-Sample MSE to evaluate the model.
# Predict on testing data
lm_test_pred <- predict(lm_model, newdata = test_data)
# Cal
lm_mse_test <- mean((lm_test_pred - test_data$track_popularity)^2)
print(paste("Testing MSE for Linear Model:", round(lm_mse_test, 2)))
## [1] "Testing MSE for Linear Model: 527.51"
The MSE for the testing set is lower, with a value of 527.51.
Now, we will train the linear model, comparing it to other categorical values to see how they predict popularity.
# Train the linear regression model, comparing popularity to other character parameters
lm_model_ch <- lm(track_popularity ~ playlist_genre + playlist_subgenre, data = train_data)
summary(lm_model_ch)
##
## Call:
## lm(formula = track_popularity ~ playlist_genre + playlist_subgenre,
## data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -55.418 -16.339 2.905 16.931 60.098
##
## Coefficients: (5 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.9894 0.6853 35.005 < 2e-16
## playlist_genrelatin 18.0795 1.0338 17.489 < 2e-16
## playlist_genrepop 31.4286 1.0654 29.501 < 2e-16
## playlist_genrer&b 22.2155 1.0313 21.541 < 2e-16
## playlist_genrerap 24.4461 1.0227 23.902 < 2e-16
## playlist_genrerock 13.5668 1.0741 12.631 < 2e-16
## playlist_subgenrebig room 4.4078 1.0705 4.117 3.85e-05
## playlist_subgenreclassic rock -0.2592 1.1543 -0.225 0.822309
## playlist_subgenredance pop -3.0791 1.1009 -2.797 0.005164
## playlist_subgenreelectro house 9.7948 0.9814 9.981 < 2e-16
## playlist_subgenreelectropop -16.3501 1.0989 -14.878 < 2e-16
## playlist_subgenregangster rap -15.9361 1.0535 -15.127 < 2e-16
## playlist_subgenrehard rock -4.2979 1.1230 -3.827 0.000130
## playlist_subgenrehip hop 4.6597 1.0591 4.400 1.09e-05
## playlist_subgenrehip pop -1.8091 1.2060 -1.500 0.133599
## playlist_subgenreindie poptimism -14.2180 1.0637 -13.367 < 2e-16
## playlist_subgenrelatin hip hop -9.5706 1.0786 -8.873 < 2e-16
## playlist_subgenrelatin pop 4.3781 1.1083 3.950 7.83e-05
## playlist_subgenreneo soul -16.3026 1.0344 -15.760 < 2e-16
## playlist_subgenrenew jack swing -19.9120 1.1214 -17.757 < 2e-16
## playlist_subgenrepermanent wave 14.2161 1.1811 12.036 < 2e-16
## playlist_subgenrepop edm 12.5587 1.0861 11.563 < 2e-16
## playlist_subgenrepost-teen pop NA NA NA NA
## playlist_subgenreprogressive electro house NA NA NA NA
## playlist_subgenrereggaeton 4.2245 1.2535 3.370 0.000752
## playlist_subgenresouthern hip hop -13.6192 1.0064 -13.533 < 2e-16
## playlist_subgenretrap NA NA NA NA
## playlist_subgenretropical NA NA NA NA
## playlist_subgenreurban contemporary NA NA NA NA
##
## (Intercept) ***
## playlist_genrelatin ***
## playlist_genrepop ***
## playlist_genrer&b ***
## playlist_genrerap ***
## playlist_genrerock ***
## playlist_subgenrebig room ***
## playlist_subgenreclassic rock
## playlist_subgenredance pop **
## playlist_subgenreelectro house ***
## playlist_subgenreelectropop ***
## playlist_subgenregangster rap ***
## playlist_subgenrehard rock ***
## playlist_subgenrehip hop ***
## playlist_subgenrehip pop
## playlist_subgenreindie poptimism ***
## playlist_subgenrelatin hip hop ***
## playlist_subgenrelatin pop ***
## playlist_subgenreneo soul ***
## playlist_subgenrenew jack swing ***
## playlist_subgenrepermanent wave ***
## playlist_subgenrepop edm ***
## playlist_subgenrepost-teen pop
## playlist_subgenreprogressive electro house
## playlist_subgenrereggaeton ***
## playlist_subgenresouthern hip hop ***
## playlist_subgenretrap
## playlist_subgenretropical
## playlist_subgenreurban contemporary
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.07 on 19825 degrees of freedom
## Multiple R-squared: 0.1372, Adjusted R-squared: 0.1362
## F-statistic: 137.1 on 23 and 19825 DF, p-value: < 2.2e-16
We can see that all variables besides
playlist_subgenre latin hip hop are statistically
significant at the 0.05 level. The R^2 is bigger meaning that 13.62% of
the variance can be explained by this model. It is still too low.
# Predict on training data (character)
lm_train_pred_ch <- mean((lm_model_ch$fitted.values - train_data$track_popularity)^2)
print(paste("Training MSE for Linear Model:", round(lm_train_pred_ch, 2)))
## [1] "Training MSE for Linear Model: 486.43"
# Predict on testing data (character)
lm_test_pred_ch <- predict(lm_model_ch, newdata = test_data)
## Warning in predict.lm(lm_model_ch, newdata = test_data): prediction from a
## rank-deficient fit may be misleading
# Cal
lm_mse_test_ch <- mean((lm_test_pred_ch - test_data$track_popularity)^2)
print(paste("Testing MSE for Linear Model:", round(lm_mse_test_ch, 2)))
## [1] "Testing MSE for Linear Model: 485.8"
The MSE of the model in the categorical variables is still very high with training of 486.43 and testing MSE of 485.8.
Now we will try adding an interaction to see how it would fit the model.
lm_model_int <- lm(track_popularity ~ danceability + loudness + danceability*loudness, data = spotify)
summary(lm_model_int)
##
## Call:
## lm(formula = track_popularity ~ danceability + loudness + danceability *
## loudness, data = spotify)
##
## Residuals:
## Min 1Q Median 3Q Max
## -49.912 -18.249 3.069 18.643 60.571
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 28.4207 1.4221 19.985 < 2e-16 ***
## danceability 20.2423 2.1859 9.261 < 2e-16 ***
## loudness -0.8238 0.1768 -4.660 3.18e-06 ***
## danceability:loudness 1.7837 0.2743 6.502 8.07e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23.64 on 28352 degrees of freedom
## Multiple R-squared: 0.004993, Adjusted R-squared: 0.004887
## F-statistic: 47.42 on 3 and 28352 DF, p-value: < 2.2e-16
As we saw before, almost every variables is significant towards
track populariry, so it is very easy to find interactions.
We performed an interaction which was significant. However the R^2 is
too low. Now it is 0.5%. The new coefficient to account for this
interaction is 1.784 indicating that if there was an increase of 1 in
the danceability variable that 1.784 would be multiplied by
the value of loudnessvariable in the regression
equation.
Lets predict on training and testing data and look at the MSE.
# Predict on training data (character)
lm_train_pred_int <- mean((lm_model_int$fitted.values - train_data$track_popularity)^2)
## Warning in lm_model_int$fitted.values - train_data$track_popularity: longer
## object length is not a multiple of shorter object length
print(paste("Training MSE for Linear Model:", round(lm_train_pred_int, 2)))
## [1] "Training MSE for Linear Model: 567.09"
# Predict on testing data
lm_test_pred_int <- predict(lm_model_int, newdata = test_data)
# Cal
lm_mse_test <- mean((lm_test_pred_int - test_data$track_popularity)^2)
print(paste("Testing MSE for Linear Model:", round(lm_mse_test, 2)))
## [1] "Testing MSE for Linear Model: 554.33"
Let’s do the same model with no interaction.
lm_model_no_int <- lm(track_popularity ~ danceability + loudness , data = spotify)
summary(lm_model_no_int)
##
## Call:
## lm(formula = track_popularity ~ danceability + loudness, data = spotify)
##
## Residuals:
## Min 1Q Median 3Q Max
## -42.803 -18.306 2.989 18.645 59.569
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.38628 0.72256 50.357 < 2e-16 ***
## danceability 7.48487 0.96398 7.765 8.47e-15 ***
## loudness 0.28557 0.04629 6.170 6.93e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23.66 on 28353 degrees of freedom
## Multiple R-squared: 0.003509, Adjusted R-squared: 0.003439
## F-statistic: 49.92 on 2 and 28353 DF, p-value: < 2.2e-16
If we perform the same model with no interaction the R^2 is 0.035%. This means that 0.035% of variance can be explained by this model, indicating that it is not a good model.
# Predict on training data (character)
lm_train_pred_no_int <- mean((lm_model_no_int$fitted.values - train_data$track_popularity)^2)
## Warning in lm_model_no_int$fitted.values - train_data$track_popularity: longer
## object length is not a multiple of shorter object length
print(paste("Training MSE for Linear Model:", round(lm_train_pred_no_int, 2)))
## [1] "Training MSE for Linear Model: 566.13"
Lets predict on testing data and look at the MSE.
# Predict on testing data
lm_test_pred_no_int <- predict(lm_model_no_int, newdata = test_data)
# Cal
lm_mse_test <- mean((lm_test_pred_no_int - test_data$track_popularity)^2)
print(paste("Testing MSE for Linear Model:", round(lm_mse_test, 2)))
## [1] "Testing MSE for Linear Model: 555.23"
From the model with numeric variables, we could see that all
variables except key were statistically significant on the
0.05 level. Which means that a variation in them would influence the
outcome in track_popularity. The In-Sample-MSE was 529.94
and the Out-Of-Sample MSE was 527.51. Then we performed a lm with
categroical variables such as genre and subgenre. We can see that all
varibles besides playlist_subgenrelatin hip hop are
statistically significant at the 0.05 level. The In-Sample-MSE was
486.43 and the Out-Of-Sample MSE was 485.8. Then we did a linear model
doing an interaction with danceability and
loudness. The MSE for the interaction was 567.09 and with
no interaction it was 554.33 on the training set. On the testing set
they were 566.13 and 555.23 respectively. All this indicating that the
model that performed the best was the testing model one with the
categorical variables genre and sub_genre
indicating that those vriables do influence in
track_popuarity. Also, we can see that the model with
interaction performed better than the model bith no interaction.
matrix <- spotify %>% select(track_popularity, danceability, energy, key, loudness, mode, speechiness,acousticness, instrumentalness, liveness, valence, tempo, duration_ms)
cor_matrix <- cor(matrix)
print(cor_matrix)
## track_popularity danceability energy key
## track_popularity 1.000000000 0.046596507 -0.10362202 -0.0081442988
## danceability 0.046596507 1.000000000 -0.08143602 0.0070203133
## energy -0.103622023 -0.081436016 1.00000000 0.0128766164
## key -0.008144299 0.007020313 0.01287662 1.0000000000
## loudness 0.037285154 0.015297214 0.68213774 -0.0005315332
## mode 0.016275225 -0.055211662 -0.00431074 -0.1760953351
## speechiness 0.005205509 0.183453077 -0.02900840 0.0231150971
## acousticness 0.091725358 -0.028878237 -0.54588619 0.0041975410
## instrumentalness -0.124431070 -0.002266695 0.02381792 0.0073927120
## liveness -0.052773029 -0.127002362 0.16370922 0.0021365502
## valence 0.022581329 0.333728627 0.14971046 0.0216180479
## tempo 0.004446180 -0.184575167 0.15154451 -0.0103554421
## duration_ms -0.139681783 -0.087560915 0.01758400 0.0188095544
## loudness mode speechiness acousticness
## track_popularity 0.0372851540 0.0162752246 0.005205509 0.091725358
## danceability 0.0152972137 -0.0552116618 0.183453077 -0.028878237
## energy 0.6821377425 -0.0043107400 -0.029008396 -0.545886191
## key -0.0005315332 -0.1760953351 0.023115097 0.004197541
## loudness 1.0000000000 -0.0176877795 0.012981015 -0.371602357
## mode -0.0176877795 1.0000000000 -0.059650521 0.006760044
## speechiness 0.0129810147 -0.0596505208 1.000000000 0.024945425
## acousticness -0.3716023568 0.0067600443 0.024945425 1.000000000
## instrumentalness -0.1543090413 -0.0057660749 -0.107959332 -0.003100502
## liveness 0.0819270381 -0.0002748805 0.059342894 -0.074539622
## valence 0.0495101626 -0.0030881898 0.064700922 -0.018998564
## tempo 0.0967061101 0.0167097090 0.032728605 -0.114335210
## duration_ms -0.1045846289 0.0128129792 -0.098435680 -0.094143218
## instrumentalness liveness valence tempo
## track_popularity -0.124431070 -0.0527730291 0.02258133 0.004446180
## danceability -0.002266695 -0.1270023619 0.33372863 -0.184575167
## energy 0.023817924 0.1637092244 0.14971046 0.151544510
## key 0.007392712 0.0021365502 0.02161805 -0.010355442
## loudness -0.154309041 0.0819270381 0.04951016 0.096706110
## mode -0.005766075 -0.0002748805 -0.00308819 0.016709709
## speechiness -0.107959332 0.0593428937 0.06470092 0.032728605
## acousticness -0.003100502 -0.0745396224 -0.01899856 -0.114335210
## instrumentalness 1.000000000 -0.0085048515 -0.17416318 0.021484050
## liveness -0.008504851 1.0000000000 -0.01992507 0.022026642
## valence -0.174163182 -0.0199250679 1.00000000 -0.025135288
## tempo 0.021484050 0.0220266420 -0.02513529 1.000000000
## duration_ms 0.059077441 0.0075705280 -0.03332403 -0.001695701
## duration_ms
## track_popularity -0.139681783
## danceability -0.087560915
## energy 0.017584005
## key 0.018809554
## loudness -0.104584629
## mode 0.012812979
## speechiness -0.098435680
## acousticness -0.094143218
## instrumentalness 0.059077441
## liveness 0.007570528
## valence -0.033324030
## tempo -0.001695701
## duration_ms 1.000000000
corrplot(cor(cor_matrix), method="color", type="upper", order="hclust")
The strongest positive correlations to
track_popularity are
acousticness with 0.092 and loudness with
0.037. However they have almost no correlation with each other.
The strongest negative correlation to track_popularity
are duration_ms with -0.140 and
instrumentalness with -0.124. None have them have a
relative correlation with track_popularity. These are all
results before truncation of data that happens in question.
if we look at other variables we can see that there exist a high
positive correlation between energy &
loudness with 0.682. There is a negative correlation exists
between acousticness & energy with -0.545.
spotify_knn_model <- kknn(track_popularity ~ danceability + energy + key + loudness + mode + speechiness + acousticness + instrumentalness + liveness + valence + tempo + duration_ms, train = train_data, test = train_data, k = 5)
# Predict on training data
knn_train_pred <- fitted.values(spotify_knn_model)
# Calculate in-sample MSE manually
knn_train_mse <- mean((train_data$track_popularity - knn_train_pred)^2)
print(paste("In-Sample MSE for KNN: ", knn_train_mse))
## [1] "In-Sample MSE for KNN: 194.903869196156"
# Predict on testing data
knn_model_test <- kknn(track_popularity ~ danceability + energy + key + loudness + mode + speechiness + acousticness + instrumentalness + liveness + valence + tempo + duration_ms, train = train_data, test = test_data, k = 5)
knn_test_pred <- fitted.values(knn_model_test)
# Calculate out-of-sample MSE manually
knn_test_mse <- mean((test_data$track_popularity - knn_test_pred)^2)
print(paste("Out-of-Sample MSE for KNN: ", knn_test_mse))
## [1] "Out-of-Sample MSE for KNN: 687.024025645934"
spotify_knn_model_20 <- kknn(track_popularity ~ danceability + energy + key + loudness + mode + speechiness + acousticness + instrumentalness + liveness + valence + tempo + duration_ms, train = train_data, test = train_data, k = 20)
# Predict on training data
knn_train_pred_20 <- fitted.values(spotify_knn_model_20)
# Calculate in-sample MSE manually
knn_train_mse_20 <- mean((train_data$track_popularity - knn_train_pred_20)^2)
print(paste("In-Sample MSE for KNN: ", knn_train_mse_20))
## [1] "In-Sample MSE for KNN: 389.061411631355"
# Predict on testing data
knn_model_test_20 <- kknn(track_popularity ~ danceability + energy + key + loudness + mode + speechiness + acousticness + instrumentalness + liveness + valence + tempo + duration_ms, train = train_data, test = test_data, k = 20)
knn_test_pred_20 <- fitted.values(knn_model_test_20)
# Calculate out-of-sample MSE manually
knn_test_mse_20 <- mean((test_data$track_popularity - knn_test_pred_20)^2)
print(paste("Out-of-Sample MSE for KNN: ", knn_test_mse_20))
## [1] "Out-of-Sample MSE for KNN: 556.981777608275"
We tried the linear regression and the KNN model to see which model would be best for predicting track popularity using the mean square error to see which one performs better.We found that the MSE for the linear regression model was 529.94 for In-Sample testing and the MSE was 527.51 for Out-of-sample testing with numeric values. We also found that the KNN model has a MSE 194.903 for In-sample test, and a MSE of 687.024 for the Out-of Sample test. When comparing these two models, the KNN model performed better on the training set than the regression model and worse on the testing set. We then performed another KNN model but with k=20, the MSE for training was 389.06 and for testing 556.981., having a btter result in the Out-of Sample Test.
We did not use all of the variables in the data set, but we used all of our continuous variables.We decided to go this route because we mostly wanted to explore track popularity and the relationship it has with the continuous variables in our data set.
Theoretically, the model that would fit the data the best is the linear regression model because it is easier to interpret. You also have to meet the four assumptions (linearity, independence,normality, and homoscedasticity). We can run a diagnostics plot to see if our model meets these assumptions.
# diagnostic plot
par(mfrow = c(2, 2))
#plot(lm_model_ch)
From the diagnostic plot, we see that the model does not meet the four assumptions, which leads us to the conclusion that the linear regression model is not the best fit for this data.
The model that fits the best in practice is the KNN model. The KNN gave us the best in sample performance when tested and the Linear Regression Model gave us the best In-Sample testing.The evaluation metrics we have been using are the mean square error to see which model has the lowest to reveal which model will be the best to test our predictions .The training data in the KNN model gave us the lowest MSE (194.903) which is what lead us to the conclusion that the KNN model would be better for prediction.
Since the mse is not quite where we want it to be, we are going to do some feature scaling to see if there is any impact on knn model. The variables that will be removed are key, mode and acoustiness
#Creating a standardized train knn model
train_new.x <- train_data[, c("track_popularity", "danceability", "energy" , "loudness","speechiness","instrumentalness" , "liveness","valence","duration_ms")]
k<- 5
stand_knn <- kknn(track_popularity ~ danceability + energy + loudness + speechiness + instrumentalness + liveness + valence + duration_ms, train = train_new.x, test = train_new.x, k = 5)
# Creating a unstandardized knn model
knn_unstand <- kknn(track_popularity ~ danceability + energy + loudness + speechiness + instrumentalness + liveness + valence + duration_ms, train = train_new.x, test = train_new.x, k = 5, scale = FALSE)
#finding the MSE for our standardized model
#first we have to find the prediction of our standardized knn model
pred_stand <- fitted(stand_knn)
train_new.y <- train_new.x$track_popularity
#Standardized MSE
stand_mse <- mean((train_new.y - pred_stand)^2)
#Unstandardized MSE
pred_undstand <- fitted(knn_unstand)
unstand_mse <- mean((train_new.y - pred_undstand)^2)
#output
print(paste("Training MSE of standardized data :", stand_mse))
## [1] "Training MSE of standardized data : 206.7465001884"
print(paste("Training MSE of unstandardized data:", unstand_mse))
## [1] "Training MSE of unstandardized data: 217.941700903636"
The MSE of the standardized data is 191.10 and the MSE for the unstandradized data is 198.70 , so there is a very small difference between the two. Our original train data had a MSE of 194.903, which is also a small difference compared to our feature scaling models. This leads us to the conclusion that feature scaling could make a effective difference in overall model, but it might not be needed.
Now lets test the test data with the same variables removed.
#Creating a standardized test knn model
test_new.x <- test_data[, c("track_popularity", "danceability", "energy" , "loudness","speechiness","instrumentalness" , "liveness","valence","duration_ms")]
stand_test_knn <- kknn(track_popularity ~ danceability + energy + loudness + speechiness + instrumentalness + liveness + valence + duration_ms, train = test_new.x, test = test_new.x, k = 5)
# Creating a unstandardized knn model
knn_test_unstand <- kknn(track_popularity ~ danceability + energy + loudness + speechiness + instrumentalness + liveness + valence + duration_ms, train = test_new.x, test = test_new.x, k = 5, scale = FALSE)
#finding the MSE for our standardized model
#first we have to find the prediction of our standardized knn model
pred_stand_test <- fitted(stand_test_knn)
test_new.y <- test_new.x$track_popularity
#Standardized MSE
test_stand_mse <- mean((test_new.y - pred_stand_test)^2)
#Unstandardized MSE
pred_undstand_test <- fitted(knn_test_unstand)
unstand_mse_test <- mean((test_new.y - pred_undstand_test)^2)
#output
print(paste("Testing MSE of standardized data :", test_stand_mse))
## [1] "Testing MSE of standardized data : 201.450244331452"
print(paste("Test MSE of unstandardized data:", unstand_mse_test))
## [1] "Test MSE of unstandardized data: 213.310812353227"
We did a new KNN with new variables and K=5 for training and testing set with unstandardized and standardized data. The MSE of the standardized data for training set is 206.747 and the MSE for the unstandradized data for training set is 217.942 having a better result with standardized data. For the testing set the standardized model was also better, with a result of 201.45 and unstandrdized was 213.31 which is not a big difference compared to each other, but a big difference compared to out test MSE (687.024) without feature scaling which leads us to the conclusion that feature scaling has a positive impact on our model.
From the feature scaling, we can conclude that the knn model is still the best model compared to the linear regression model for our data.
Now we are going to do a logistic regression model, to classify songs
based on popularity. Those songs where track_popoularity
are above 70 will get a value of 1 and those songs whose
track_popularity are below 70 will get a value of 0. We will usespotify_copy`
to do this, since there are no outliers on the data set.
Lets create put track-popularity as dummy variable and
insert a 1, when the popuarity is above 70 and a 0 when it is below 70.
Meaning that 1 is popular and 0 is not popular.
spotify_copy <- spotify
spotify_copy$track_popularity <- ifelse(spotify_copy$track_popularity >=70, 1 , 0)
Now , we have modified the variable.
Randomly split the data to training (80%) and testing (20%) datasets:
set.seed(2023)
index <- sample(1:nrow(spotify_copy),nrow(spotify_copy)*0.80)
spotify_copy.train = spotify_copy[index,]
spotify_copy.test = spotify_copy[-index,]
spotify_copy.glm0<- glm(track_popularity~ danceability + energy + key + loudness + mode + speechiness + acousticness + instrumentalness + liveness + valence + tempo + duration_ms, family=binomial, data=spotify_copy.train)
summary(spotify_copy.glm0)
##
## Call:
## glm(formula = track_popularity ~ danceability + energy + key +
## loudness + mode + speechiness + acousticness + instrumentalness +
## liveness + valence + tempo + duration_ms, family = binomial,
## data = spotify_copy.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.1331 -0.5000 -0.4123 -0.2795 4.2276
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.066e+00 3.085e-01 3.455 0.000551 ***
## danceability 3.319e-01 1.902e-01 1.745 0.081027 .
## energy -3.095e+00 2.169e-01 -14.269 < 2e-16 ***
## key -9.644e-03 6.523e-03 -1.478 0.139311
## loudness 2.076e-01 1.310e-02 15.843 < 2e-16 ***
## mode 1.286e-01 4.805e-02 2.676 0.007446 **
## speechiness -7.592e-01 2.401e-01 -3.162 0.001568 **
## acousticness 1.473e-01 1.228e-01 1.199 0.230523
## instrumentalness -3.071e+00 2.814e-01 -10.916 < 2e-16 ***
## liveness -4.376e-01 1.710e-01 -2.558 0.010518 *
## valence 4.779e-01 1.159e-01 4.125 3.71e-05 ***
## tempo 2.652e-03 8.573e-04 3.093 0.001980 **
## duration_ms -1.883e-06 4.619e-07 -4.076 4.57e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 14035 on 22683 degrees of freedom
## Residual deviance: 13193 on 22671 degrees of freedom
## AIC: 13219
##
## Number of Fisher Scoring iterations: 7
Observation: As we can see from this model, energy, loudness, mode, speechiness, instrumentalness, liveness, valence, tempo and duration_ms are significant in predicting whether a song is going to be popular or not.
pred.glm0.train <- predict(spotify_copy.glm0,type="response")
hist(pred.glm0.train)
Observation: These tables illustrate the impact of choosing different
cut-off probability. Choosing a large cut-off probability will result
many cases being predicted as 1, and choosing a small cut-off
probability will result in few cases being predicted as 1.
#ROC curve
library(ROCR)
pred <- prediction(pred.glm0.train, spotify_copy.train$track_popularity)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)
Obs: -The ROC curse predicts the false positive rate and the true
positive rate. You will want to have these values closer to 1.
#Get the AUC
unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.6855791
Obs: We got the AUC (Area under the Curve). Again we would like to have a number closer to 1, meaning that it would be perfectly predicted. However, it is 0.6856. It is still good.
Now. lets do the Out-Of-Sample prediction, which is more important. #### Out-of-sample prediction (more important)
pred.glm0.test<- predict(spotify_copy.glm0, newdata = spotify_copy.test, type="response")
pred <- prediction(pred.glm0.test, spotify_copy.test$track_popularity)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)
#Get the AUC
unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.6610358
Obs: -The AUC is even lower, with a score of 0.661
In order to get the matrix for True Positive and True Negative Predictions, we will determine the optimal cut-off Probability using Grid Search Method. We will first use a weight of 5 and 1.
# define a cost function with input "obs" being observed response
# and "pi" being predicted probability, and "pcut" being the threshold.
costfunc = function(obs, pred.p, pcut){
weight1 = 5 # define the weight for "true=1 but pred=0" (FN)
weight0 = 1 # define the weight for "true=0 but pred=1" (FP)
c1 = (obs==1)&(pred.p<pcut) # count for "true=1 but pred=0" (FN)
c0 = (obs==0)&(pred.p>=pcut) # count for "true=0 but pred=1" (FP)
cost = mean(weight1*c1 + weight0*c0) # misclassification with weight
return(cost) # you have to return to a value when you write R functions
} # end of the function
Next, we will define a sequence of probability (you need to search the optimal p-cut from this sequence)
# define a sequence from 0.01 to 1 by 0.01
p.seq = seq(0.01, 1, 0.01)
Then, we will to calculate the cost (as you defined before) for each probability in the sequence p.seq.
# write a loop for all p-cut to see which one provides the smallest cost
# first, need to define a 0 vector in order to save the value of cost from all pcut
cost = rep(0, length(p.seq))
for(i in 1:length(p.seq)){
cost[i] = costfunc(obs = spotify_copy.train$track_popularity, pred.p = pred.glm0.train, pcut = p.seq[i])
} # end of the loop
#cbind(p.seq, cost)
Last, we will draw a plot with cost against p.seq, and find the p-cut that gives us the minimum cost.
# draw a plot with X axis being all pcut and Y axis being associated cost
plot(p.seq, cost)
# find the optimal pcut
optimal.pcut.glm0 = p.seq[which(cost==min(cost))]
print(optimal.pcut.glm0)
## [1] 0.16
Obs: With the weight of 5 and 1, we got o.15 as optimal cut-off-probability
Now let calculate MR, FPR, FNR, and cost based on the optimal cut-off.
# step 1. get binary classification
class.glm0.train.opt<- (pred.glm0.train>optimal.pcut.glm0)*1
# step 2. get confusion matrix, MR, FPR, FNR
table(spotify_copy.train$track_popularity, class.glm0.train.opt, dnn = c("True", "Predicted"))
## Predicted
## True 0 1
## 0 18732 1843
## 1 1636 473
Obs: With weight of 5 and 1, we got 16845 TN, 824 TP, 3422 FP and 1593 FN. These predictions seem very good. It means for example that 824 were predicted a popular when they were popular.
Now we want to obtain the MR, FPR, FNR and cost rate.
MR<- mean(spotify_copy.train$track_popularity!= class.glm0.train.opt)
FPR<- sum(spotify_copy.train$track_popularity==0 & class.glm0.train.opt==1)/sum(spotify_copy.train$track_popularity==0)
FNR<- sum(spotify_copy.train$track_popularity==1 & class.glm0.train.opt==0)/sum(spotify_copy.train$track_popularity==1)
cost<- costfunc(obs = spotify_copy.train$track_popularity, pred.p = class.glm0.train.opt, pcut = optimal.pcut.glm0)
print(paste0("MR:",MR))
## [1] "MR:0.153368012696174"
print(paste0("FPR:",FPR))
## [1] "FPR:0.0895747266099635"
print(paste0("FNR:",FNR))
## [1] "FNR:0.775723091512565"
print(paste0("cost:",cost))
## [1] "cost:0.441853288661612"
We got 0.1534 for MR, which is pretty accuarte for such a complicated data. We also got 0.0896 fro FPR, 0.7757 for FNR and 0.4419 for cost.
Now, lets do the same with the Out-Of-Sample Data which is more important and calculate MR.
# step 0. obtain predicated values on testing data (we actually did this already, but let's do it again)
pred.glm0.test<- predict(spotify_copy.glm0, newdata = spotify_copy.test, type="response")
# step 1. get binary classification, I used the optimal cut-off
pred.glm0.test.opt <- (pred.glm0.test>optimal.pcut.glm0)*1
# step 2. get confusion matrix, MR, FPR, FNR
table(spotify_copy.test$track_popularity, pred.glm0.test.opt, dnn = c("True", "Predicted"))
## Predicted
## True 0 1
## 0 4698 448
## 1 416 110
Obs: With weight of 5 and 1, (optimal cut-off being 0.15) we got 4214 TN, 186 TP, 876 FP and 396 FN. These predictions seem very good. It means for example that 186 were predicted a popular when they were popular.
MR<- mean(spotify_copy.test$track_popularity!= pred.glm0.test.opt)
FPR<- sum(spotify_copy.test$track_popularity==0 & pred.glm0.test.opt==1)/sum(spotify_copy.test$track_popularity==0)
FNR<- sum(spotify_copy.test$track_popularity==1 & pred.glm0.test.opt==0)/sum(spotify_copy.test$track_popularity==1)
cost<- costfunc(obs = spotify_copy.test$track_popularity, pred.p = pred.glm0.test.opt, pcut = optimal.pcut.glm0)
print(paste0("MR:",MR))
## [1] "MR:0.152327221438646"
print(paste0("FPR:",FPR))
## [1] "FPR:0.0870579090555771"
print(paste0("FNR:",FNR))
## [1] "FNR:0.790874524714829"
print(paste0("cost:",cost))
## [1] "cost:0.445698166431594"
Out of sample MR is still very good, with a score of 0.1523. We also got 0.08701 for FPR, 0.79087 for FNR and 0.4457 for cost. All of the values were better for testing set except of FNR.
spotify_glm0_cat<- glm(track_popularity~ playlist_genre + playlist_subgenre, family=binomial, data=spotify_copy.train)
summary(spotify_glm0_cat)
##
## Call:
## glm(formula = track_popularity ~ playlist_genre + playlist_subgenre,
## family = binomial, data = spotify_copy.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.9468 -0.5258 -0.3472 -0.1463 3.3090
##
## Coefficients: (5 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.4706 0.4474 -12.229 < 2e-16
## playlist_genrelatin 2.1066 0.4832 4.360 1.30e-05
## playlist_genrepop 4.9005 0.4532 10.813 < 2e-16
## playlist_genrer&b 3.5616 0.4578 7.781 7.21e-15
## playlist_genrerap 3.4258 0.4587 7.469 8.10e-14
## playlist_genrerock 3.1048 0.4647 6.681 2.38e-11
## playlist_subgenrebig room 0.1464 0.6715 0.218 0.827462
## playlist_subgenreclassic rock -0.1167 0.1790 -0.652 0.514502
## playlist_subgenredance pop -0.6406 0.1037 -6.178 6.50e-10
## playlist_subgenreelectro house 0.9380 0.5332 1.759 0.078579
## playlist_subgenreelectropop -1.1165 0.1131 -9.868 < 2e-16
## playlist_subgenregangster rap -1.1283 0.1873 -6.025 1.69e-09
## playlist_subgenrehard rock -0.7950 0.2064 -3.853 0.000117
## playlist_subgenrehip hop 0.4123 0.1312 3.142 0.001676
## playlist_subgenrehip pop 0.1436 0.1470 0.977 0.328782
## playlist_subgenreindie poptimism -2.2086 0.1422 -15.535 < 2e-16
## playlist_subgenrelatin hip hop 0.9102 0.2183 4.169 3.06e-05
## playlist_subgenrelatin pop 1.9038 0.2023 9.410 < 2e-16
## playlist_subgenreneo soul -1.8874 0.2208 -8.549 < 2e-16
## playlist_subgenrenew jack swing -3.2198 0.4588 -7.018 2.25e-12
## playlist_subgenrepermanent wave 0.7548 0.1584 4.766 1.88e-06
## playlist_subgenrepop edm 2.7406 0.4715 5.812 6.16e-09
## playlist_subgenrepost-teen pop NA NA NA NA
## playlist_subgenreprogressive electro house NA NA NA NA
## playlist_subgenrereggaeton 1.5118 0.2209 6.844 7.71e-12
## playlist_subgenresouthern hip hop -1.0521 0.1711 -6.149 7.79e-10
## playlist_subgenretrap NA NA NA NA
## playlist_subgenretropical NA NA NA NA
## playlist_subgenreurban contemporary NA NA NA NA
##
## (Intercept) ***
## playlist_genrelatin ***
## playlist_genrepop ***
## playlist_genrer&b ***
## playlist_genrerap ***
## playlist_genrerock ***
## playlist_subgenrebig room
## playlist_subgenreclassic rock
## playlist_subgenredance pop ***
## playlist_subgenreelectro house .
## playlist_subgenreelectropop ***
## playlist_subgenregangster rap ***
## playlist_subgenrehard rock ***
## playlist_subgenrehip hop **
## playlist_subgenrehip pop
## playlist_subgenreindie poptimism ***
## playlist_subgenrelatin hip hop ***
## playlist_subgenrelatin pop ***
## playlist_subgenreneo soul ***
## playlist_subgenrenew jack swing ***
## playlist_subgenrepermanent wave ***
## playlist_subgenrepop edm ***
## playlist_subgenrepost-teen pop
## playlist_subgenreprogressive electro house
## playlist_subgenrereggaeton ***
## playlist_subgenresouthern hip hop ***
## playlist_subgenretrap
## playlist_subgenretropical
## playlist_subgenreurban contemporary
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 14035 on 22683 degrees of freedom
## Residual deviance: 12271 on 22660 degrees of freedom
## AIC: 12319
##
## Number of Fisher Scoring iterations: 7
Observation: All of the variables are significant at the 0.05 level,
except playlist_subgenrebig room,
playlist_subgenreclassic rock,
playlist_subgenrepost-teen pop,
playlist_subgenreprogressive electro house,
playlist_subgenretrap,
playlist_subgenretropical and
playlist_subgenreurban contemporary.
pred.glm0_cat.train <- predict(spotify_glm0_cat,type="response")
#ROC curve
library(ROCR)
pred <- prediction(pred.glm0_cat.train, spotify_copy.train$track_popularity)
perf <- performance(pred, "tpr", "fpr")
#Get the AUC
unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.7593949
pred.glm0.test_cat<- predict(spotify_glm0_cat, newdata = spotify_copy.test, type="response")
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
pred <- prediction(pred.glm0.test_cat, spotify_copy.test$track_popularity)
perf <- performance(pred, "tpr", "fpr")
#Get the AUC
unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.7539842
Observation: For the training set we get an AUC of 0.7594 and for the testing set we get 0.6610
# define a cost function with input "obs" being observed response
# and "pi" being predicted probability, and "pcut" being the threshold.
costfunc = function(obs, pred.p, pcut){
weight1 = 5 # define the weight for "true=1 but pred=0" (FN)
weight0 = 1 # define the weight for "true=0 but pred=1" (FP)
c1 = (obs==1)&(pred.p<pcut) # count for "true=1 but pred=0" (FN)
c0 = (obs==0)&(pred.p>=pcut) # count for "true=0 but pred=1" (FP)
cost = mean(weight1*c1 + weight0*c0) # misclassification with weight
return(cost) # you have to return to a value when you write R functions
} # end of the function
# define a sequence from 0.01 to 1 by 0.01
p.seq = seq(0.01, 1, 0.01)
# write a loop for all p-cut to see which one provides the smallest cost
# first, need to define a 0 vector in order to save the value of cost from all pcut
cost = rep(0, length(p.seq))
for(i in 1:length(p.seq)){
cost[i] = costfunc(obs = spotify_copy.train$track_popularity, pred.p = pred.glm0_cat.train, pcut = p.seq[i])
} # end of the loop
#cbind(p.seq, cost)
# find the optimal pcut
optimal.pcut.glm0 = p.seq[which(cost==min(cost))]
print(optimal.pcut.glm0)
## [1] 0.17 0.18
The optimal pcut is either 0.17 or 0.18. Now we will get MR, FPR, FNR and cost to compare it to each other.
# step 1. get binary classification
class.glm0.train.cat<- (pred.glm0_cat.train>optimal.pcut.glm0)*1
# get values
MR<- mean(spotify_copy.train$track_popularity!= class.glm0.train.cat)
FPR<- sum(spotify_copy.train$track_popularity==0 & class.glm0.train.cat==1)/sum(spotify_copy.train$track_popularity==0)
FNR<- sum(spotify_copy.train$track_popularity==1 & class.glm0.train.cat==0)/sum(spotify_copy.train$track_popularity==1)
cost<- costfunc(obs = spotify_copy.train$track_popularity, pred.p = class.glm0.train.cat, pcut = optimal.pcut.glm0)
print(paste0("MR:",MR))
## [1] "MR:0.151340151648739"
print(paste0("FPR:",FPR))
## [1] "FPR:0.0982260024301337"
print(paste0("FNR:",FNR))
## [1] "FNR:0.669511616880038"
print(paste0("cost:",cost))
## [1] "cost:0.400326221125022"
# step 0. obtain predicated values on testing data (we actually did this already, but let's do it again)
pred.glm0.test<- predict(spotify_glm0_cat, newdata = spotify_copy.test, type="response")
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
# step 1. get binary classification, I used the optimal cut-off
pred.glm0.test.cat <- (pred.glm0.test>optimal.pcut.glm0)*1
#Get values
MR<- mean(spotify_copy.test$track_popularity!= pred.glm0.test.cat)
FPR<- sum(spotify_copy.test$track_popularity==0 & pred.glm0.test.cat==1)/sum(spotify_copy.test$track_popularity==0)
FNR<- sum(spotify_copy.test$track_popularity==1 & pred.glm0.test.cat==0)/sum(spotify_copy.test$track_popularity==1)
cost<- costfunc(obs = spotify_copy.test$track_popularity, pred.p = pred.glm0.test.cat, pcut = optimal.pcut.glm0)
print(paste0("MR:",MR))
## [1] "MR:0.157087447108604"
print(paste0("FPR:",FPR))
## [1] "FPR:0.104741546832491"
print(paste0("FNR:",FNR))
## [1] "FNR:0.669201520912547"
print(paste0("cost:",cost))
## [1] "cost:0.405324400564175"
#####Conclusion logistic regression
Observation: For this new model we got a better AUC on the training set with 0.7593 than on the testing set (0.7539). Both values where better on Model 2 than on Model 1 (0.6856 and 0.6610 respectively). We then calculated MR, FPR, FNR and cost on both models as well, in training and testing sets. For the first model we got MR= 0.1534; FPR= 0.0986; FNR=0.7757 and cost=0.4419 on the training set.We also got MR= 0.1523; FPR= 0.087; FNR=0.7908 and cost=0.4457 on the training set. The testing set performed better than the training set. For the second model we got MR= 0.1513; FPR= 0.098; FNR=0.7757 and cost=0.4003 on the training set.We also got MR= 0.1571; FPR= 0.1047; FNR=0.6692 and cost=0.4053 on the training set. The training set performed better than the testing set in model 2. Model 1 performed better on the testing set than Model 2. And alos the opposite.
We are first going to make a new data set without the categorical variables, then make our track_popularity variable 2 levels.
spotify_new = subset(spotify, select = -c(track_album_release_date, track_id, track_name, track_artist, track_album_name, track_album_id , track_album_name, playlist_genre, playlist_name, playlist_subgenre, playlist_id) )
spotify_new$track_popularity <- ifelse(spotify_new$track_popularity >=70, 1 , 0)
Now we are going to split our new spotify data.
#splitting the data
set.seed(2023)
index <- sample(1:nrow(spotify_new),nrow(spotify_copy)*0.80)
spotify_svm_train <- spotify_new[index,]
spotify_svm_test <- spotify_new[-index,]
Now we are going to fit a SVM model without weight cost.
#fitting the data w/o weight class cost
library(e1071)
spotify_new$track_popularity<- as.factor(spotify_new$track_popularity)
str(spotify_new)
## 'data.frame': 28356 obs. of 13 variables:
## $ track_popularity: Factor w/ 2 levels "0","1": 1 1 2 1 1 1 1 1 1 1 ...
## $ danceability : num 0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
## $ energy : num 0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
## $ key : int 6 11 1 7 1 8 5 4 8 2 ...
## $ loudness : num -2.63 -4.97 -3.43 -3.78 -4.67 ...
## $ mode : int 1 1 0 1 1 1 0 0 1 1 ...
## $ speechiness : num 0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
## $ acousticness : num 0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
## $ instrumentalness: num 0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
## $ liveness : num 0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
## $ valence : num 0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
## $ tempo : num 122 100 124 122 124 ...
## $ duration_ms : int 194754 162600 176616 169093 189052 163049 187675 207619 193187 253040 ...
#svm model
spotify_svm = svm({{as.factor(track_popularity)}} ~ danceability + energy + key + loudness + mode + speechiness + acousticness + instrumentalness + liveness + valence + tempo + duration_ms, data = spotify_svm_train, kernel = 'linear')
#summary
summary(spotify_svm)
##
## Call:
## svm(formula = {
## {
## as.factor(track_popularity)
## }
## } ~ danceability + energy + key + loudness + mode + speechiness +
## acousticness + instrumentalness + liveness + valence + tempo +
## duration_ms, data = spotify_svm_train, kernel = "linear")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 1
##
## Number of Support Vectors: 4832
##
## ( 2723 2109 )
##
##
## Number of Classes: 2
##
## Levels:
## 0 1
# predictions on the train data
pred_spotify_train = predict(spotify_svm, spotify_svm_train)
# Confusion matrix to evaluate the model on train data
Cmatrix_train = table(true = spotify_svm_train$track_popularity ,
pred = pred_spotify_train)
Cmatrix_train
## pred
## true 0 1
## 0 20575 0
## 1 2109 0
#Mis-classification rate
1 - sum(diag(Cmatrix_train))/sum(Cmatrix_train)
## [1] 0.09297302
The confusion matrix of the predicted values with the training data shows the number of true negatives as 22435, false positives as 0 , false negatives as 3831 and true positives as 0.
The mis-classification error rate is 0.145854
# Predictions on the train data
pred_spotify_test <- predict(spotify_svm, spotify_svm_test)
# Confusion matrix to evaluate the model on train data
Cmatrix_test = table(true = spotify_svm_test$track_popularity,
pred = pred_spotify_test)
Cmatrix_test
## pred
## true 0 1
## 0 5146 0
## 1 526 0
#Mis-classification rate
1 - sum(diag(Cmatrix_test))/sum(Cmatrix_test)
## [1] 0.09273625
The confusion matrix of the predicted values with the testing data shows the number of true negatives as 22435, false positives as 0 , false negatives as 3831 and true positives as 0.
The mis-classification error rate is 0.145854
Now we are going to fit a SVM model with weight cost using the train data
spotify_svm_asymmetric = svm(as.factor(track_popularity) ~ danceability + energy + loudness + speechiness + mode + key + acousticness+ instrumentalness + liveness + valence + tempo +duration_ms, data = spotify_svm_train, kernel = 'linear', class.weights = c("0" = 1, "1" = 2))
#Summary
summary(spotify_svm_asymmetric)
##
## Call:
## svm(formula = as.factor(track_popularity) ~ danceability + energy +
## loudness + speechiness + mode + key + acousticness + instrumentalness +
## liveness + valence + tempo + duration_ms, data = spotify_svm_train,
## kernel = "linear", class.weights = c(`0` = 1, `1` = 2))
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 1
##
## Number of Support Vectors: 7076
##
## ( 4967 2109 )
##
##
## Number of Classes: 2
##
## Levels:
## 0 1
# predictions for train data
pred_spotify_train <- predict(spotify_svm_asymmetric, spotify_svm_train)
#Confusion matrix for train data
Cmatrix_train_21 <- table(true =spotify_svm_train$track_popularity,
pred = pred_spotify_train)
print(Cmatrix_train_21)
## pred
## true 0 1
## 0 20575 0
## 1 2109 0
#Mis-classification rate
1 - sum(diag(Cmatrix_train_21))/sum(Cmatrix_train_21)
## [1] 0.09297302
The confusion matrix of the predicted values with the training data shows the number of true negatives as 22435, false positives as 0 , false negatives as 3831 and true positives as 0.
The mis-classification error rate is 0.145854
Now we are going to fit a SVM model with cost using the train data
#testing predicted probability
pred_spotify_test <- predict(spotify_svm_asymmetric, spotify_svm_test)
#Confusion matrix
Cmatrix_test_22 <- table( true = spotify_svm_test$track_popularity, pred = pred_spotify_test)
print(Cmatrix_test_22)
## pred
## true 0 1
## 0 5146 0
## 1 526 0
#Mis-classification rate
MR <- 1 - sum(diag(Cmatrix_test_22))/sum(Cmatrix_test_22)
MR
## [1] 0.09273625
The confusion matrix of the predicted values with the testing data shows the number of true negatives as 5564, false positives as 0 , false negatives as 1003 and true positives as 0.
The mis-classification error rate is 0.1527334
Now we are going to obtain the ROC and AUC for our SVM model. We will then use our output to compare it to the ROC and AUC we obtained in our logistic model.
#ROC and AUC using the train data
spotify_svm_prob = svm(as.factor(track_popularity) ~ danceability + energy + loudness + speechiness + mode + key + acousticness+ instrumentalness + liveness + valence + tempo +duration_ms,
data = spotify_svm_train, kernel = 'linear',
probability = TRUE)
pred_prob_train = predict(spotify_svm_prob,
newdata = spotify_svm_train,
probability = TRUE)
pred_prob_train = attr(pred_prob_train, "probabilities")[, 2]
#ROC
library(ROCR)
pred <- prediction(pred_prob_train, spotify_svm_train$track_popularity)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)
#AUC
unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.5866711
The AUC is 0.5971307
#refitting the model on testing
pred_prob_test = predict(spotify_svm_prob,
newdata = spotify_svm_test,
probability = TRUE)
pred_prob_test = attr(pred_prob_test, "probabilities")[, 2]
#ROC
pred <- prediction(pred_prob_test, spotify_svm_test$track_popularity )
perf <- performance(pred, "tpr", "fpr")
#AUC
unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.5698786
The AUC is 0.5909064
When comparing the ROC and AUC for our SVM model to our AUC and ROC for our logistic regression model, we found that the logistic regression model performed better because it had a higher AUC for both the train and test data.
We will create a copy of spotify with a new data name
and delete some variables for the purpose of this analysis.
spotify_dec = subset(spotify, select = -c(track_album_release_date, track_id, track_name, track_artist, track_album_name, track_album_id ,playlist_genre, playlist_subgenre, track_album_name, playlist_name, playlist_id) )
spotify_dec$track_popularity <- ifelse(spotify_dec$track_popularity >=70, 1 , 0)
str(spotify_dec)
## 'data.frame': 28356 obs. of 13 variables:
## $ track_popularity: num 0 0 1 0 0 0 0 0 0 0 ...
## $ danceability : num 0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
## $ energy : num 0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
## $ key : int 6 11 1 7 1 8 5 4 8 2 ...
## $ loudness : num -2.63 -4.97 -3.43 -3.78 -4.67 ...
## $ mode : int 1 1 0 1 1 1 0 0 1 1 ...
## $ speechiness : num 0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
## $ acousticness : num 0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
## $ instrumentalness: num 0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
## $ liveness : num 0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
## $ valence : num 0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
## $ tempo : num 122 100 124 122 124 ...
## $ duration_ms : int 194754 162600 176616 169093 189052 163049 187675 207619 193187 253040 ...
spotify_dec$mode<- as.factor(spotify_dec$mode)
spotify_dec$key<- as.factor(spotify_dec$key)
str(spotify_dec)
## 'data.frame': 28356 obs. of 13 variables:
## $ track_popularity: num 0 0 1 0 0 0 0 0 0 0 ...
## $ danceability : num 0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
## $ energy : num 0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
## $ key : Factor w/ 12 levels "0","1","2","3",..: 7 12 2 8 2 9 6 5 9 3 ...
## $ loudness : num -2.63 -4.97 -3.43 -3.78 -4.67 ...
## $ mode : Factor w/ 2 levels "0","1": 2 2 1 2 2 2 1 1 2 2 ...
## $ speechiness : num 0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
## $ acousticness : num 0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
## $ instrumentalness: num 0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
## $ liveness : num 0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
## $ valence : num 0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
## $ tempo : num 122 100 124 122 124 ...
## $ duration_ms : int 194754 162600 176616 169093 189052 163049 187675 207619 193187 253040 ...
set.seed(2023)
index <- sample(1:nrow(spotify_dec),nrow(spotify_dec)*0.60)
spotify_dec.train = spotify_dec[index,]
spotify_dec.test = spotify_dec[-index,]
library(rpart)
library(rpart.plot)
# fit the model
fit_tree_dec <- rpart(as.factor(track_popularity) ~ danceability + energy + speechiness + loudness + instrumentalness+ mode+ key + acousticness+ liveness + valence + tempo, data=spotify_dec.train)
summary(fit_tree_dec)
## Call:
## rpart(formula = as.factor(track_popularity) ~ danceability +
## energy + speechiness + loudness + instrumentalness + mode +
## key + acousticness + liveness + valence + tempo, data = spotify_dec.train)
## n= 17013
##
## CP nsplit rel error xerror xstd
## 1 0 0 1 0 0
##
## Node number 1: 17013 observations
## predicted class=0 expected loss=0.09351672 P(node) =1
## class counts: 15422 1591
## probabilities: 0.906 0.094
#rpart.plot(fit_tree, yesno=2, extra=4)
Obtain the predicted values for training data and confusion matrix.
# Make predictions on the train data
pred_credit_train <- predict(fit_tree_dec, spotify_dec.train, type="class")
# Confusion matrix to evaluate the model on train data
Cmatrix_train = table(true = spotify_dec.train$track_popularity,
pred = pred_credit_train)
Cmatrix_train
## pred
## true 0 1
## 0 15422 0
## 1 1591 0
1 - sum(diag(Cmatrix_train))/sum(Cmatrix_train)
## [1] 0.09351672
Observation: From this model we get a very small missclassification rate of 0.0935, but from the Confusion Matrix of the training set we can see that none of the variables are predicted as 1.
Obtain the predicted values for testing data and confusion matrix.
# Make predictions on the train data
pred_credit_test <- predict(fit_tree_dec, spotify_dec.test, type="class")
# Confusion matrix to evaluate the model on train data
Cmatrix_test = table(true = spotify_dec.test$track_popularity,
pred = pred_credit_test)
Cmatrix_test
## pred
## true 0 1
## 0 10299 0
## 1 1044 0
1 - sum(diag(Cmatrix_test))/sum(Cmatrix_test)
## [1] 0.09203914
Your Observation:
Now that we have done th same on the testing set, we can see that MR is even smaller with 0.092 but still none of the variables are predicted as 1. Therefore we will use asymmetric test, to produce different weights and see if this improves the model.
Using classification tree with asymmetric cost is different again!
# We need to define a cost matrix first, don't change 0 there
cost_matrix <- matrix(c(0, 1, # cost of 1 for FP
6, 0), # cost of 6 for FN
byrow = TRUE, nrow = 2)
fit_tree_asym <- rpart(as.factor(track_popularity) ~ danceability + energy + speechiness + loudness + instrumentalness+ mode+ key + acousticness+ liveness + valence + tempo, data=spotify_dec.train, parms = list(loss = cost_matrix))
rpart.plot(fit_tree_asym,extra=4, yesno=2)
Since we now do have some variables on the ‘1’ side of the data, we can
visualize the tree in a plot. We can predict that from the variables
energy is the one that best predicts a popular song. The
model explains that energy value smaller 0.82 would predict
1, and loudness bigger that -5.1 would also predict 1.
#get predictions for training
pred_spotify_train <- predict(fit_tree_asym, spotify_dec.train,
type = "class")
#C matrix for training
Cmatrix_train_asy <- table( true = spotify_dec.train$track_popularity, pred = pred_spotify_train)
Cmatrix_train_asy
## pred
## true 0 1
## 0 13157 2265
## 1 1054 537
1 - sum(diag(Cmatrix_train_asy))/sum(Cmatrix_train_asy)
## [1] 0.1950861
Your observation: Now that we have redistributed the weights we finally can get some values for 1 (being popular) , with 537 being TP, 2265 FP, 13157 TN and 1054 FN on the training set. However, now we have a bigger MR with 0.1951. Lets get assymmetric predictions for the testing set.
#get predictions for testing
pred_credit_test <- predict(fit_tree_asym, spotify_dec.test, type = "class")
#C matrix for testing with more weights on "1"
Cmatrix_test_weight = table( true = spotify_dec.test$track_popularity, pred = pred_credit_test)
Cmatrix_test_weight
## pred
## true 0 1
## 0 8820 1479
## 1 707 337
1 - sum(diag(Cmatrix_test_weight))/sum(Cmatrix_test_weight)
## [1] 0.192718
Your observation: Now for the testing set we get the following values: with 337 being TP, 1479 FP, 8820 TN and 707 FN on the training set. The MR is now a bit lower with a value of 0.1927. Now, we will obtain the AUC for the training and testing data.
# obtain predicted probability
pred_prob_train = predict(fit_tree_dec, spotify_dec.train, type = "prob")
# This is necessary again, as predict() for tree model return two values, one for 0 and one for 1.
# Replace "1" with the actual category if response variable is a factor
pred_prob_train = pred_prob_train[,"1"]
library(ROCR)
pred <- prediction(pred_prob_train, spotify_dec.train$track_popularity)
perf <- performance(pred, "tpr", "fpr")
#Get the AUC
unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.5
# obtain predicted probability
pred_prob_test = predict(fit_tree_dec, spotify_dec.test, type = "prob")
# This is necessary again, as predict() for tree model return two values, one for 0 and one for 1.
pred_prob_test = pred_prob_test[,"1"] #replace "1" with the actual category if reponse variable is a factor
#ROC
pred <- prediction(pred_prob_test, spotify_dec.test$track_popularity)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)
#Get the AUC
unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.5
Your observation: When we get the AUC for both the training and testing datasets for the model with no weight we get the same value of 0.5 and the ROC curve is a diagonal linear line. This is due to the fact that none of the variables are predicted as 1. Now, lets get the AUC-ROC for the model with weights.
# obtain predicted probability
pred_asym_train = predict(fit_tree_asym, spotify_dec.train, type = "prob")
# This is necessary again, as predict() for tree model return two values, one for 0 and one for 1.
# Replace "1" with the actual category if response variable is a factor
pred_asym_train = pred_asym_train[,"1"]
library(ROCR)
pred <- prediction(pred_asym_train, spotify_dec.train$track_popularity)
perf <- performance(pred, "tpr", "fpr")
#plot(perf, colorize=TRUE)
#Get the AUC
unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.6613876
# obtain predicted probability
pred_asym_test = predict(fit_tree_asym, spotify_dec.test, type = "prob")
# This is necessary again, as predict() for tree model return two values, one for 0 and one for 1.
# Replace "1" with the actual category if response variable is a factor
pred_asym_test = pred_asym_test[,"1"]
library(ROCR)
pred <- prediction(pred_asym_test, spotify_dec.test$track_popularity)
perf <- performance(pred, "tpr", "fpr")
#plot(perf, colorize=TRUE)
#Get the AUC
unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.6450638
Your Observation: Now that we have done AUC-ROC on the weighted models, we get a curve and AUC of 0.6614 fr the training set and 0.6451 for the testing set, being a better model that the one with no weights. Overall, although MR is lower on the model with no weights, it does not mean that is better because none of the observations are classified as popular. Although the model with weights has a bigger MR, the model does better classification that model 1. AUC is better for the model with weights, than with no weights.
Conclusion: While exploring which variable is the best predictor for ‘track_popularity’ we found that ‘energy’ and ‘loudness’ were our best predictors. We tested our variables using different prediction models (Logistic regression, KNN, SVM and Tree Model) to see which one performed better. We found that the Tree model performed best due to its high AUC score and the fact that out of all of the models, the Tree model had the best confusion matrix.