#{width=40%}
The Spotify Dataset comes from Spotify via the spotifyr package. Charlie Thompson, Josiah Parry, Donal Phipps, and Tom Wolff authored this package to make it easier to get either your own data or general metadata around songs from Spotify’s API. Kaylin Pavlik had a recent blogpost using the audio features to explore and classify songs. She used the spotifyr package to collect about 5000 songs from 6 main categories.
The data shows general metadata around songs from Spotify’s API. It shows the song’s popularity and other parameters such as acousticness, danceability, energy, speechiness, valence, key, genres and subgenres…
With this analysis we are interested in how track popularity is getting influenced by the other variables.
We want to answer the question: What characteristics of a song can determine its popularity? Which genres and subgenres predict usually the most popular songs?
The plan is to analyze relationship between popularity and different features of the song to predict future popularity of a song with data modelling. We plan on performing Data Preparation, EDA and Modelling using models such as linear regression, knn, logistic regression, SVM and Tree Models and compare which one performs better and see which one predicts better if a song is going to be popular or not.
This is mainly beneficial to market spotify customers and improve their experience while using spotify. Also for Spotify, they will be able to provide more accurate predictions of a new song’s potential popularity even before its release and the consumer will be able to identify which factors influence the popularity of a song on Spotify.
library(tidyverse)
library(ggplot2)
library(kknn)
library(corrplot)
library(readr)
library(rpart)
library(rpart.plot)
Tidyverse:assists with data import, tidying, manipulation, and data visualization.
ggplot2: package for producing statistical, or data, graphics.
kknn: performs k-nearest neighbor classification.
corrplot: graphical display of a correlation matrix, confidence interval.
readr: provides a fast way to read rectangular data.
rpart: implements the classification and regression tree algorithm (CART).
rpart.plot: An Enhanced Plotting Package for rpart.
library(readr)
spotify <- read.csv("/Users/evabeyebach/Desktop/spotify.csv")
We are going to examine our data in order to see how we can best clean it for further analysis.
### Checking dimension of Data
dim(spotify)
## [1] 32833 23
The original dataset contains 32833 rows and 23 columns.
# show first 5 rows
head(spotify)
#### Checking column name
names(spotify)
## [1] "track_id" "track_name"
## [3] "track_artist" "track_popularity"
## [5] "track_album_id" "track_album_name"
## [7] "track_album_release_date" "playlist_name"
## [9] "playlist_id" "playlist_genre"
## [11] "playlist_subgenre" "danceability"
## [13] "energy" "key"
## [15] "loudness" "mode"
## [17] "speechiness" "acousticness"
## [19] "instrumentalness" "liveness"
## [21] "valence" "tempo"
## [23] "duration_ms"
Now, we will look at the structure, summary, data types, duplicates and missing values to then, clean the data.
### Checking structure of Data
str(spotify)
## 'data.frame': 32833 obs. of 23 variables:
## $ track_id : chr "6f807x0ima9a1j3VPbc7VN" "0r7CVbZTWZgbTCYdfa2P31" "1z1Hg7Vb0AhHDiEmnDE79l" "75FpbthrwQmzHlBJLuGdC7" ...
## $ track_name : chr "I Don't Care (with Justin Bieber) - Loud Luxury Remix" "Memories - Dillon Francis Remix" "All the Time - Don Diablo Remix" "Call You Mine - Keanu Silva Remix" ...
## $ track_artist : chr "Ed Sheeran" "Maroon 5" "Zara Larsson" "The Chainsmokers" ...
## $ track_popularity : int 66 67 70 60 69 67 62 69 68 67 ...
## $ track_album_id : chr "2oCs0DGTsRO98Gh5ZSl2Cx" "63rPSO264uRjW1X5E6cWv6" "1HoSmj2eLcsrR0vE9gThr4" "1nqYsOef1yKKuGOVchbsk6" ...
## $ track_album_name : chr "I Don't Care (with Justin Bieber) [Loud Luxury Remix]" "Memories (Dillon Francis Remix)" "All the Time (Don Diablo Remix)" "Call You Mine - The Remixes" ...
## $ track_album_release_date: chr "2019-06-14" "2019-12-13" "2019-07-05" "2019-07-19" ...
## $ playlist_name : chr "Pop Remix" "Pop Remix" "Pop Remix" "Pop Remix" ...
## $ playlist_id : chr "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" ...
## $ playlist_genre : chr "pop" "pop" "pop" "pop" ...
## $ playlist_subgenre : chr "dance pop" "dance pop" "dance pop" "dance pop" ...
## $ danceability : num 0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
## $ energy : num 0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
## $ key : int 6 11 1 7 1 8 5 4 8 2 ...
## $ loudness : num -2.63 -4.97 -3.43 -3.78 -4.67 ...
## $ mode : int 1 1 0 1 1 1 0 0 1 1 ...
## $ speechiness : num 0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
## $ acousticness : num 0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
## $ instrumentalness : num 0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
## $ liveness : num 0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
## $ valence : num 0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
## $ tempo : num 122 100 124 122 124 ...
## $ duration_ms : int 194754 162600 176616 169093 189052 163049 187675 207619 193187 253040 ...
Observations:
str() shows structure of data. We can see that
track_id , track_name,
track_artist, track_album_id,
track_album_name, track_album_release_date,
playlist_name, playlist_id,
playlist_genre, playlist_subgenre are
character variables.
track_popularity, energy,
key, loudness, mode,
speechiness, acousticness,
instrumentalness, liveness,
valence, tempo and duration_ms
are numeric. We need to change
track_album_release_date to date variable. We will also
change playlist_genre to factor, for future
plotting.
# Modifying Data Types
spotify$track_album_release_date<- as.Date(spotify$track_album_release_date)
spotify$playlist_genre<-as.factor(spotify$playlist_genre)
#summary statistics
summary(spotify)
## track_id track_name track_artist track_popularity
## Length:32833 Length:32833 Length:32833 Min. : 0.00
## Class :character Class :character Class :character 1st Qu.: 24.00
## Mode :character Mode :character Mode :character Median : 45.00
## Mean : 42.48
## 3rd Qu.: 62.00
## Max. :100.00
##
## track_album_id track_album_name track_album_release_date
## Length:32833 Length:32833 Min. :1957-01-01
## Class :character Class :character 1st Qu.:2010-12-04
## Mode :character Mode :character Median :2017-01-27
## Mean :2012-09-09
## 3rd Qu.:2019-05-16
## Max. :2020-01-29
## NA's :1886
## playlist_name playlist_id playlist_genre playlist_subgenre
## Length:32833 Length:32833 edm :6043 Length:32833
## Class :character Class :character latin:5155 Class :character
## Mode :character Mode :character pop :5507 Mode :character
## r&b :5431
## rap :5746
## rock :4951
##
## danceability energy key loudness
## Min. :0.0000 Min. :0.000175 Min. : 0.000 Min. :-46.448
## 1st Qu.:0.5630 1st Qu.:0.581000 1st Qu.: 2.000 1st Qu.: -8.171
## Median :0.6720 Median :0.721000 Median : 6.000 Median : -6.166
## Mean :0.6548 Mean :0.698619 Mean : 5.374 Mean : -6.720
## 3rd Qu.:0.7610 3rd Qu.:0.840000 3rd Qu.: 9.000 3rd Qu.: -4.645
## Max. :0.9830 Max. :1.000000 Max. :11.000 Max. : 1.275
##
## mode speechiness acousticness instrumentalness
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000000
## 1st Qu.:0.0000 1st Qu.:0.0410 1st Qu.:0.0151 1st Qu.:0.0000000
## Median :1.0000 Median :0.0625 Median :0.0804 Median :0.0000161
## Mean :0.5657 Mean :0.1071 Mean :0.1753 Mean :0.0847472
## 3rd Qu.:1.0000 3rd Qu.:0.1320 3rd Qu.:0.2550 3rd Qu.:0.0048300
## Max. :1.0000 Max. :0.9180 Max. :0.9940 Max. :0.9940000
##
## liveness valence tempo duration_ms
## Min. :0.0000 Min. :0.0000 Min. : 0.00 Min. : 4000
## 1st Qu.:0.0927 1st Qu.:0.3310 1st Qu.: 99.96 1st Qu.:187819
## Median :0.1270 Median :0.5120 Median :121.98 Median :216000
## Mean :0.1902 Mean :0.5106 Mean :120.88 Mean :225800
## 3rd Qu.:0.2480 3rd Qu.:0.6930 3rd Qu.:133.92 3rd Qu.:253585
## Max. :0.9960 Max. :0.9910 Max. :239.44 Max. :517810
##
Observations:
duration_ms has a Max of 517810;
tempo has a Max of 239.44).track_popularity are either linear
regression model or classification tree since
the dependent variable is numeric and continuous.#lets look at some tables for categorical variables
table(spotify$playlist_genre)
##
## edm latin pop r&b rap rock
## 6043 5155 5507 5431 5746 4951
table(spotify$playlist_subgenre)
##
## album rock big room classic rock
## 1065 1206 1296
## dance pop electro house electropop
## 1298 1511 1408
## gangster rap hard rock hip hop
## 1458 1485 1322
## hip pop indie poptimism latin hip hop
## 1256 1672 1656
## latin pop neo soul new jack swing
## 1262 1637 1133
## permanent wave pop edm post-teen pop
## 1105 1517 1129
## progressive electro house reggaeton southern hip hop
## 1809 949 1675
## trap tropical urban contemporary
## 1291 1288 1405
table(spotify$key)
##
## 0 1 2 3 4 5 6 7 8 9 10 11
## 3454 4010 2827 913 2201 2680 2670 3352 2430 3027 2273 2996
table(spotify$mode)
##
## 0 1
## 14259 18574
Observations:
mode is a binary variable.Playlist_genre and playlist_subgenre are
categorical variables with the genre of music.#Looking for duplicates
dups_id <- sum(duplicated(spotify$track_id))
print(dups_id)
## [1] 4477
We can see that a lot of songs have been duplicated in this dataset.
They have the same track_id. Therefore we will remove them,
for further analysis.
spotify_dups = spotify[duplicated(spotify$track_id),]
spotify = spotify[!duplicated(spotify$track_id),]
We removed the duplicate songs to another dataset called
spotify_dups and removed duplicates from current
dataset.
#looking for missing values
sum(is.na(spotify))
## [1] 1693
There are 1693 missing values in this dataset. However, we will not remove them, since they might still be important for the analysis.
After performing Data Preparation we found out the following:
Different ways that we could look at the Data to answer our
questions could be histograms, to learn about the
distribution of variables, boxplots, to find outliers
and see which genres are most popular, scatterplots to
interpret the variables based on track_popularity and
correlation matrix to find out the correlation between
variables.
*We plan on splitting the data into training and testing and doing different models to see which variables are going to predict better the popularity of a song. We might choose different variables to see if a model performs better and improves its R^2. We are also going to observe each models AUC or MR/MSE to compare the models to each other.
track_id, track_name,
track_artist, track_album_id,
track_album_name, track_abum_release_date,
playlist_name nor playlist_id. We only want to
focus on the sound parameters and genres.We will plot some variables to observe its skewiness, distribution, and interpret them.
histograms <- names(spotify)[c(4,12:23)]
songs1 <- spotify %>%
select(c(histograms)) %>%
pivot_longer(cols = histograms)
songs1 %>%
ggplot(aes(x = value)) +
geom_histogram() +
facet_wrap(~name, ncol = 5, scales = 'free') +
labs(title = 'Audio Feature Pattern Frequency Plots', x = '', y = '') +
theme_void()
Observations:
Duration and Valence are normally
distributed.Danceability, Enery and
Loudness are left-skewed.Acousticness, Liveness and
Speechiness are right-skewed.Key and mode are categorical
variables.popularity the max is around 50-60, and it is
normally distributed. There are a lot of zero values,
possible missing values. Very few songs have a popularity of above
90.ggplot(spotify,aes(x = playlist_genre, y = track_popularity, fill = playlist_genre)) +geom_boxplot()
ggplot(spotify,aes(x = playlist_subgenre, y = track_popularity, fill = playlist_subgenre)) +geom_boxplot()
Observations:
pop is the most popular followed by
latin and rock.post-teen pop is the most
popular followed by dance pop and
permament wave.# new dataset with some variables
feature_names <- names(spotify)[c(12:13,15,17:19,23)]
# new dataset only displying variables with 500 most popular songs
songs <- spotify %>%
arrange(desc(track_popularity)) %>%
head(n = 500) %>%
pivot_longer(cols = feature_names)
## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
## # Was:
## data %>% select(feature_names)
##
## # Now:
## data %>% select(all_of(feature_names))
##
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
#plotting
songs %>%
ggplot(aes(x = name, y = value)) +
geom_jitter(aes(color = playlist_genre)) +
facet_wrap(~name, ncol = 4, scales = 'free') +
labs(title = 'Audio Feature Pattern Frequency Plots', x = '', y = '') +
theme(axis.text.y = element_blank())
Observation:
danceability, energy and
loudness usually means more popularity.speechiness and accousticness and
instrumentalness means more popular.duration means more popularity.track_popularity into dummy variable
we could also use other models such as logistic
regression, SVM and Regression
Tree.We will plot boxplots on every continuous variable to identify outliers.
ggplot(data = songs1) +
geom_boxplot(aes(y = value)) +
facet_wrap(~name, nrow = 4, scales = "free") +
coord_flip() +
ggtitle("Outlier analysis", subtitle = "For different song attributes") +
theme_void()
Observations:
From the boxplots we can observe that a lot of variables
(danceability, energy, loudness,
speechiness, acousticness,
instrumentalness, liveness,
duration) have outliers. Removing them would influence the
analysis a lot, so we will create a new dataset and remove the outliers
on that dataset.
spotify_copy <- spotify
Now, we will truncate energy speechiness, acousticness, instrumentalness and liveness.
# truncation danceability, energy, speechiness, acousticness, instrumentalness and liveness
spotify_copy$danceability[spotify_copy$danceability <= 0.28] <- 0.28
spotify_copy$energy[spotify_copy$energy <= 0.2] <- 0.2
spotify_copy$speechiness[spotify_copy$speechiness >= 0.27] <- 0.27
spotify_copy$acousticness[spotify_copy$acousticness >= 0.6] <- 0.6
spotify_copy$instrumentalness[spotify_copy$instrumentalness >= 0.015] <- 0.015
spotify_copy$liveness[spotify_copy$liveness >= 0.4] <- 0.4
Now, we will winsorize loudness, tempo and duration.
# winsorization loudness
# Calculate the 5th and 95th percentiles for 'loudness'
lower_bound_loudness <- quantile(spotify_copy$loudness, 0.05, na.rm = TRUE)
upper_bound_loudness <- quantile(spotify_copy$loudness, 0.95, na.rm = TRUE)
# Winsorize the data
spotify_copy$loudness[spotify_copy$loudness < lower_bound_loudness] <- lower_bound_loudness
spotify_copy$loudness[spotify_copy$loudness > upper_bound_loudness] <- upper_bound_loudness
# winsorization tempo
# Calculate the 5th and 95th percentiles for 'tempo'
lower_bound_tempo <- quantile(spotify_copy$tempo, 0.05, na.rm = TRUE)
upper_bound_tempo <- quantile(spotify_copy$tempo, 0.95, na.rm = TRUE)
# Winsorize the data
spotify_copy$tempo[spotify_copy$tempo < lower_bound_tempo] <- lower_bound_tempo
spotify_copy$tempo[spotify_copy$tempo > upper_bound_tempo] <- upper_bound_tempo
#winsorize duration
# Calculate the 5th and 95th percentiles for 'duration'
lower_bound_duration_ms <- quantile(spotify_copy$duration_ms, 0.05, na.rm = TRUE)
upper_bound_duration_ms <- quantile(spotify_copy$duration_ms, 0.95, na.rm = TRUE)
# Winsorize the data
spotify_copy$duration_ms[spotify_copy$duration_ms < lower_bound_duration_ms] <- lower_bound_duration_ms
spotify_copy$duration_ms[spotify_copy$duration_ms > upper_bound_duration_ms] <- upper_bound_duration_ms
knitr::kable(head(spotify[, 1:23]), "simple")
| track_id | track_name | track_artist | track_popularity | track_album_id | track_album_name | track_album_release_date | playlist_name | playlist_id | playlist_genre | playlist_subgenre | danceability | energy | key | loudness | mode | speechiness | acousticness | instrumentalness | liveness | valence | tempo | duration_ms |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 6f807x0ima9a1j3VPbc7VN | I Don’t Care (with Justin Bieber) - Loud Luxury Remix | Ed Sheeran | 66 | 2oCs0DGTsRO98Gh5ZSl2Cx | I Don’t Care (with Justin Bieber) [Loud Luxury Remix] | 2019-06-14 | Pop Remix | 37i9dQZF1DXcZDD7cfEKhW | pop | dance pop | 0.748 | 0.916 | 6 | -2.634 | 1 | 0.0583 | 0.1020 | 0.00e+00 | 0.0653 | 0.518 | 122.036 | 194754 |
| 0r7CVbZTWZgbTCYdfa2P31 | Memories - Dillon Francis Remix | Maroon 5 | 67 | 63rPSO264uRjW1X5E6cWv6 | Memories (Dillon Francis Remix) | 2019-12-13 | Pop Remix | 37i9dQZF1DXcZDD7cfEKhW | pop | dance pop | 0.726 | 0.815 | 11 | -4.969 | 1 | 0.0373 | 0.0724 | 4.21e-03 | 0.3570 | 0.693 | 99.972 | 162600 |
| 1z1Hg7Vb0AhHDiEmnDE79l | All the Time - Don Diablo Remix | Zara Larsson | 70 | 1HoSmj2eLcsrR0vE9gThr4 | All the Time (Don Diablo Remix) | 2019-07-05 | Pop Remix | 37i9dQZF1DXcZDD7cfEKhW | pop | dance pop | 0.675 | 0.931 | 1 | -3.432 | 0 | 0.0742 | 0.0794 | 2.33e-05 | 0.1100 | 0.613 | 124.008 | 176616 |
| 75FpbthrwQmzHlBJLuGdC7 | Call You Mine - Keanu Silva Remix | The Chainsmokers | 60 | 1nqYsOef1yKKuGOVchbsk6 | Call You Mine - The Remixes | 2019-07-19 | Pop Remix | 37i9dQZF1DXcZDD7cfEKhW | pop | dance pop | 0.718 | 0.930 | 7 | -3.778 | 1 | 0.1020 | 0.0287 | 9.40e-06 | 0.2040 | 0.277 | 121.956 | 169093 |
| 1e8PAfcKUYoKkxPhrHqw4x | Someone You Loved - Future Humans Remix | Lewis Capaldi | 69 | 7m7vv9wlQ4i0LFuJiE2zsQ | Someone You Loved (Future Humans Remix) | 2019-03-05 | Pop Remix | 37i9dQZF1DXcZDD7cfEKhW | pop | dance pop | 0.650 | 0.833 | 1 | -4.672 | 1 | 0.0359 | 0.0803 | 0.00e+00 | 0.0833 | 0.725 | 123.976 | 189052 |
| 7fvUMiyapMsRRxr07cU8Ef | Beautiful People (feat. Khalid) - Jack Wins Remix | Ed Sheeran | 67 | 2yiy9cd2QktrNvWC2EUi0k | Beautiful People (feat. Khalid) [Jack Wins Remix] | 2019-07-11 | Pop Remix | 37i9dQZF1DXcZDD7cfEKhW | pop | dance pop | 0.675 | 0.919 | 8 | -5.385 | 1 | 0.1270 | 0.0799 | 0.00e+00 | 0.1430 | 0.585 | 124.982 | 163049 |
Observations:
Histograms helped us look a the distribution of variables, and
since a lot pf them where numeric we concluded that a Linear
regression and regression tree might be a good
model to predict track_popularity.
Scatterplots helped us see which parameter might influence more
our dependent variable. These were: instrumentalness,
speechiness and loudness, since the data was
more dense.
boxplots against popularity helped us identify most
popular genres: pop, latin and
rock.
Popular subgenres were: post-teen pop,
dance pop and permanent wave.
To create new summary information, we plan on narrowing our data down to the variables we really plan on exploring to help us gain insights to our predictions.
We still need to perform a correlation matrix
and modelling to see which variables best predict
track_popularity and see if they agree with our
plots.
First, we will select the variables that we will use for the
correlation matrix. We will convert playlist_genre to
categorical in order to be able to see the correlation.
matrix <- spotify %>% select(track_popularity, danceability, energy, key, loudness, mode, speechiness,acousticness, instrumentalness, liveness, valence, tempo, duration_ms, playlist_genre)
matrix$playlist_genre<-recode(matrix$playlist_genre, 'pop'=0, 'r&b'=1, 'rock'=2, 'latin'=3, 'edm'=4, 'rap'=5)
cor_matrix <- cor(matrix)
print(cor_matrix)
## track_popularity danceability energy key
## track_popularity 1.000000000 0.046596507 -0.10362202 -0.0081442988
## danceability 0.046596507 1.000000000 -0.08143602 0.0070203133
## energy -0.103622023 -0.081436016 1.00000000 0.0128766164
## key -0.008144299 0.007020313 0.01287662 1.0000000000
## loudness 0.037285154 0.015297214 0.68213774 -0.0005315332
## mode 0.016275225 -0.055211662 -0.00431074 -0.1760953351
## speechiness 0.005205509 0.183453077 -0.02900840 0.0231150971
## acousticness 0.091725358 -0.028878237 -0.54588619 0.0041975410
## instrumentalness -0.124431070 -0.002266695 0.02381792 0.0073927120
## liveness -0.052773029 -0.127002362 0.16370922 0.0021365502
## valence 0.022581329 0.333728627 0.14971046 0.0216180479
## tempo 0.004446180 -0.184575167 0.15154451 -0.0103554421
## duration_ms -0.139681783 -0.087560915 0.01758400 0.0188095544
## playlist_genre -0.073434088 0.190073891 0.08828930 0.0100195621
## loudness mode speechiness acousticness
## track_popularity 0.0372851540 0.0162752246 0.005205509 0.091725358
## danceability 0.0152972137 -0.0552116618 0.183453077 -0.028878237
## energy 0.6821377425 -0.0043107400 -0.029008396 -0.545886191
## key -0.0005315332 -0.1760953351 0.023115097 0.004197541
## loudness 1.0000000000 -0.0176877795 0.012981015 -0.371602357
## mode -0.0176877795 1.0000000000 -0.059650521 0.006760044
## speechiness 0.0129810147 -0.0596505208 1.000000000 0.024945425
## acousticness -0.3716023568 0.0067600443 0.024945425 1.000000000
## instrumentalness -0.1543090413 -0.0057660749 -0.107959332 -0.003100502
## liveness 0.0819270381 -0.0002748805 0.059342894 -0.074539622
## valence 0.0495101626 -0.0030881898 0.064700922 -0.018998564
## tempo 0.0967061101 0.0167097090 0.032728605 -0.114335210
## duration_ms -0.1045846289 0.0128129792 -0.098435680 -0.094143218
## playlist_genre 0.0715914257 -0.0505853182 0.298193374 -0.076573192
## instrumentalness liveness valence tempo
## track_popularity -0.124431070 -0.0527730291 0.02258133 0.004446180
## danceability -0.002266695 -0.1270023619 0.33372863 -0.184575167
## energy 0.023817924 0.1637092244 0.14971046 0.151544510
## key 0.007392712 0.0021365502 0.02161805 -0.010355442
## loudness -0.154309041 0.0819270381 0.04951016 0.096706110
## mode -0.005766075 -0.0002748805 -0.00308819 0.016709709
## speechiness -0.107959332 0.0593428937 0.06470092 0.032728605
## acousticness -0.003100502 -0.0745396224 -0.01899856 -0.114335210
## instrumentalness 1.000000000 -0.0085048515 -0.17416318 0.021484050
## liveness -0.008504851 1.0000000000 -0.01992507 0.022026642
## valence -0.174163182 -0.0199250679 1.00000000 -0.025135288
## tempo 0.021484050 0.0220266420 -0.02513529 1.000000000
## duration_ms 0.059077441 0.0075705280 -0.03332403 -0.001695701
## playlist_genre 0.147880944 0.0508644474 -0.06999690 0.050109733
## duration_ms playlist_genre
## track_popularity -0.139681783 -0.07343409
## danceability -0.087560915 0.19007389
## energy 0.017584005 0.08828930
## key 0.018809554 0.01001956
## loudness -0.104584629 0.07159143
## mode 0.012812979 -0.05058532
## speechiness -0.098435680 0.29819337
## acousticness -0.094143218 -0.07657319
## instrumentalness 0.059077441 0.14788094
## liveness 0.007570528 0.05086445
## valence -0.033324030 -0.06999690
## tempo -0.001695701 0.05010973
## duration_ms 1.000000000 -0.07577263
## playlist_genre -0.075772625 1.00000000
corrplot(cor(cor_matrix), method="color", type="upper", order="hclust")
The overall ranking of the variables that mostly correlate with
track_popularity are the following:
Duration (-0.1396)
Instrumentalness (-0.1244)
Energy (-0.1036)
Acousticness (0.0917)
playlist_genre (-0.0734)
Comparing to the scatterplots one would think that
instrumentallness, speechiness and
loudness would have the strongest correlations since they
have most points out of the most popular songs in the smallest
interval.
In this correlation matrix however Duration,
Instrumentalness and Energy would influence
more track_popularity.
track popularity and we want to see
which variables best predict the dependent variable. Therefore we will
first build a linear regression model to see how it predicts
track_popularity.# Set the seed for reproducibility
set.seed(2023)
# Randomly sample row indices for the training set split in 70% and 30%
train_indices <- sample(1:NROW(spotify),NROW(spotify)*0.70)
# Create the training set
train_data <- spotify[train_indices, ] #everything before comma is row selector and after comma is column selector
# Create the testing set
test_data <- spotify[-train_indices, ]
# Train the linear regression model, comparing popularity to rest of numeric parameters
lm_model <- lm(track_popularity ~ danceability + energy + mode + key + acousticness + loudness + speechiness + instrumentalness + liveness + valence + tempo + duration_ms, data = train_data)
summary(lm_model)
##
## Call:
## lm(formula = track_popularity ~ danceability + energy + mode +
## key + acousticness + loudness + speechiness + instrumentalness +
## liveness + valence + tempo + duration_ms, data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -54.757 -17.360 2.935 18.119 60.604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.684e+01 2.045e+00 32.682 < 2e-16 ***
## danceability 4.170e+00 1.288e+00 3.237 0.00121 **
## energy -2.318e+01 1.461e+00 -15.864 < 2e-16 ***
## mode 7.590e-01 3.357e-01 2.261 0.02378 *
## key 1.398e-02 4.595e-02 0.304 0.76088
## acousticness 4.934e+00 8.895e-01 5.547 2.95e-08 ***
## loudness 1.132e+00 7.802e-02 14.512 < 2e-16 ***
## speechiness -7.397e+00 1.657e+00 -4.464 8.09e-06 ***
## instrumentalness -9.426e+00 7.437e-01 -12.676 < 2e-16 ***
## liveness -4.420e+00 1.077e+00 -4.104 4.08e-05 ***
## valence 1.997e+00 7.861e-01 2.540 0.01110 *
## tempo 2.808e-02 6.261e-03 4.485 7.33e-06 ***
## duration_ms -4.277e-05 2.726e-06 -15.689 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23.03 on 19836 degrees of freedom
## Multiple R-squared: 0.06003, Adjusted R-squared: 0.05946
## F-statistic: 105.6 on 12 and 19836 DF, p-value: < 2.2e-16
The model suggests that all the variables are significant except
key, since it is the only variable with a p-value higher
than 0.5. This model has an R^2 of 0.06003 which means that only 6% of
the of the data’s variability can be explained by the regression model,
which indicated that this is not a good model.
We can say that when instrumentalness,
speechiness and duration_ms are increased by 1
unit, track_popularity is affected the most with
coefficients of -9.426, -7.397 and -4.277. However, energy
and loudness did not affect track_popularity
as much as we thought.
We will look at the In-Sample MSE to evaluate the model.
lm_mse_train <- mean((lm_model$fitted.values - train_data$track_popularity)^2)
print(paste("Training MSE for Linear Model:", round(lm_mse_train, 2)))
## [1] "Training MSE for Linear Model: 529.94"
We got a MSE of 529.94, which is high for this model. A value close to 0 would be perfect.
Now we will look at the Out-Of-Sample MSE to evaluate the model.
# Predict on testing data
lm_test_pred <- predict(lm_model, newdata = test_data)
# Cal
lm_mse_test <- mean((lm_test_pred - test_data$track_popularity)^2)
print(paste("Testing MSE for Linear Model:", round(lm_mse_test, 2)))
## [1] "Testing MSE for Linear Model: 527.51"
The MSE for the testing set is lower, with a value of 527.51.
Now, we will train the linear model, comparing it to other categorical values to see how they predict popularity.
# Train the linear regression model, comparing popularity to other character parameters
lm_model_ch <- lm(track_popularity ~ playlist_genre + playlist_subgenre, data = train_data)
summary(lm_model_ch)
##
## Call:
## lm(formula = track_popularity ~ playlist_genre + playlist_subgenre,
## data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -55.418 -16.339 2.905 16.931 60.098
##
## Coefficients: (5 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.9894 0.6853 35.005 < 2e-16
## playlist_genrelatin 18.0795 1.0338 17.489 < 2e-16
## playlist_genrepop 31.4286 1.0654 29.501 < 2e-16
## playlist_genrer&b 22.2155 1.0313 21.541 < 2e-16
## playlist_genrerap 24.4461 1.0227 23.902 < 2e-16
## playlist_genrerock 13.5668 1.0741 12.631 < 2e-16
## playlist_subgenrebig room 4.4078 1.0705 4.117 3.85e-05
## playlist_subgenreclassic rock -0.2592 1.1543 -0.225 0.822309
## playlist_subgenredance pop -3.0791 1.1009 -2.797 0.005164
## playlist_subgenreelectro house 9.7948 0.9814 9.981 < 2e-16
## playlist_subgenreelectropop -16.3501 1.0989 -14.878 < 2e-16
## playlist_subgenregangster rap -15.9361 1.0535 -15.127 < 2e-16
## playlist_subgenrehard rock -4.2979 1.1230 -3.827 0.000130
## playlist_subgenrehip hop 4.6597 1.0591 4.400 1.09e-05
## playlist_subgenrehip pop -1.8091 1.2060 -1.500 0.133599
## playlist_subgenreindie poptimism -14.2180 1.0637 -13.367 < 2e-16
## playlist_subgenrelatin hip hop -9.5706 1.0786 -8.873 < 2e-16
## playlist_subgenrelatin pop 4.3781 1.1083 3.950 7.83e-05
## playlist_subgenreneo soul -16.3026 1.0344 -15.760 < 2e-16
## playlist_subgenrenew jack swing -19.9120 1.1214 -17.757 < 2e-16
## playlist_subgenrepermanent wave 14.2161 1.1811 12.036 < 2e-16
## playlist_subgenrepop edm 12.5587 1.0861 11.563 < 2e-16
## playlist_subgenrepost-teen pop NA NA NA NA
## playlist_subgenreprogressive electro house NA NA NA NA
## playlist_subgenrereggaeton 4.2245 1.2535 3.370 0.000752
## playlist_subgenresouthern hip hop -13.6192 1.0064 -13.533 < 2e-16
## playlist_subgenretrap NA NA NA NA
## playlist_subgenretropical NA NA NA NA
## playlist_subgenreurban contemporary NA NA NA NA
##
## (Intercept) ***
## playlist_genrelatin ***
## playlist_genrepop ***
## playlist_genrer&b ***
## playlist_genrerap ***
## playlist_genrerock ***
## playlist_subgenrebig room ***
## playlist_subgenreclassic rock
## playlist_subgenredance pop **
## playlist_subgenreelectro house ***
## playlist_subgenreelectropop ***
## playlist_subgenregangster rap ***
## playlist_subgenrehard rock ***
## playlist_subgenrehip hop ***
## playlist_subgenrehip pop
## playlist_subgenreindie poptimism ***
## playlist_subgenrelatin hip hop ***
## playlist_subgenrelatin pop ***
## playlist_subgenreneo soul ***
## playlist_subgenrenew jack swing ***
## playlist_subgenrepermanent wave ***
## playlist_subgenrepop edm ***
## playlist_subgenrepost-teen pop
## playlist_subgenreprogressive electro house
## playlist_subgenrereggaeton ***
## playlist_subgenresouthern hip hop ***
## playlist_subgenretrap
## playlist_subgenretropical
## playlist_subgenreurban contemporary
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.07 on 19825 degrees of freedom
## Multiple R-squared: 0.1372, Adjusted R-squared: 0.1362
## F-statistic: 137.1 on 23 and 19825 DF, p-value: < 2.2e-16
The R^2 is bigger meaning that 13.62% of the
variance can be explained by this model. It is still low. From the
coefficients pop, has the highest one
(31.42), then rap and the r&b. Looking
back at the plots, it looked like pop, latin
and rock were the most popular. From the
subgenre it seems like
playlist_subgenrehip pop and
playlist_subgenreclassic rock were the only ones
non-significant. On the subgenres, new jack swing had the
highest coefficients, followed by electropop and
neo soul.
# Predict on training data (character)
lm_train_pred_ch <- mean((lm_model_ch$fitted.values - train_data$track_popularity)^2)
print(paste("Training MSE for Linear Model:", round(lm_train_pred_ch, 2)))
## [1] "Training MSE for Linear Model: 486.43"
# Predict on testing data (character)
lm_test_pred_ch <- predict(lm_model_ch, newdata = test_data)
## Warning in predict.lm(lm_model_ch, newdata = test_data): prediction from a
## rank-deficient fit may be misleading
# Cal
lm_mse_test_ch <- mean((lm_test_pred_ch - test_data$track_popularity)^2)
print(paste("Testing MSE for Linear Model:", round(lm_mse_test_ch, 2)))
## [1] "Testing MSE for Linear Model: 485.8"
The MSE of the model in the categorical variables is still very high with training of 486.43 and testing MSE of 485.8.
Now we will try adding an interaction to see how it would fit the model. * We will use the variables that had the biggest coefficient before, to see if we can improve the R^2, with an interaction.
lm_model_int <- lm(track_popularity ~ instrumentalness + speechiness + instrumentalness*speechiness, data = spotify)
summary(lm_model_int)
##
## Call:
## lm(formula = track_popularity ~ instrumentalness + speechiness +
## instrumentalness * speechiness, data = spotify)
##
## Residuals:
## Min 1Q Median 3Q Max
## -41.770 -17.723 3.269 18.519 64.505
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 40.8548 0.2187 186.804 < 2e-16 ***
## instrumentalness -15.2125 0.8901 -17.091 < 2e-16 ***
## speechiness -3.3088 1.4190 -2.332 0.019718 *
## instrumentalness:speechiness 30.0642 8.0613 3.729 0.000192 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23.51 on 28352 degrees of freedom
## Multiple R-squared: 0.01603, Adjusted R-squared: 0.01593
## F-statistic: 154 on 3 and 28352 DF, p-value: < 2.2e-16
As we saw before, almost every variables is significant towards
track popularity, so it is very easy to find interactions.
We performed an interaction which was significant. However the R^2 is
too low. Now it is 0.16%. The new coefficient to account for this
interaction is 30.064 indicating that if there was an increase of 1 in
the instrumentalness variable that 30.064 would be
multiplied by the value of speechinessvariable in the
regression equation.
# Predict on training data (character)
lm_train_pred_int <- mean((lm_model_int$fitted.values - train_data$track_popularity)^2)
## Warning in lm_model_int$fitted.values - train_data$track_popularity: longer
## object length is not a multiple of shorter object length
print(paste("Training MSE for Linear Model:", round(lm_train_pred_int, 2)))
## [1] "Training MSE for Linear Model: 572.16"
# Predict on testing data
lm_test_pred_int <- predict(lm_model_int, newdata = test_data)
# Cal
lm_mse_test <- mean((lm_test_pred_int - test_data$track_popularity)^2)
print(paste("Testing MSE for Linear Model:", round(lm_mse_test, 2)))
## [1] "Testing MSE for Linear Model: 549.2"
lm_model_no_int <- lm(track_popularity ~ instrumentalness + speechiness , data = spotify)
summary(lm_model_no_int)
##
## Call:
## lm(formula = track_popularity ~ instrumentalness + speechiness,
## data = spotify)
##
## Residuals:
## Min 1Q Median 3Q Max
## -40.701 -17.639 3.357 18.561 64.882
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 40.7014 0.2149 189.438 <2e-16 ***
## instrumentalness -12.7742 0.6041 -21.145 <2e-16 ***
## speechiness -1.9240 1.3698 -1.405 0.16
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23.52 on 28353 degrees of freedom
## Multiple R-squared: 0.01555, Adjusted R-squared: 0.01548
## F-statistic: 223.9 on 2 and 28353 DF, p-value: < 2.2e-16
If we perform the same model with no interaction the R^2 is 0.15%. This means that 0.15% of variance can be explained by this model, indicating that it is not a good model. It is worse thn the model with the interaction
# Predict on training data (character)
lm_train_pred_no_int <- mean((lm_model_no_int$fitted.values - train_data$track_popularity)^2)
## Warning in lm_model_no_int$fitted.values - train_data$track_popularity: longer
## object length is not a multiple of shorter object length
print(paste("Training MSE for Linear Model:", round(lm_train_pred_no_int, 2)))
## [1] "Training MSE for Linear Model: 572.03"
# Predict on testing data
lm_test_pred_no_int <- predict(lm_model_no_int, newdata = test_data)
# Cal
lm_mse_test <- mean((lm_test_pred_no_int - test_data$track_popularity)^2)
print(paste("Testing MSE for Linear Model:", round(lm_mse_test, 2)))
## [1] "Testing MSE for Linear Model: 549.59"
# create matrix with 4 columns and 4 rows
data <- matrix(c('529.4', '527.51', '486.43', '485.8', '572.16', '549.2', '572.03', '549.59'), ncol=2, byrow=TRUE)
# specify the column names and row names of matrix
colnames(data) = c('Training MSE','Testing MSE')
rownames(data) <- c('LM1','LM2','LM3','LM4')
# assign to table
lm_MSE=as.table(data)
# display
lm_MSE
## Training MSE Testing MSE
## LM1 529.4 527.51
## LM2 486.43 485.8
## LM3 572.16 549.2
## LM4 572.03 549.59
From the model with numeric variables, we could see that all
variables except key were statistically significant on the
0.05 level. Which means that a variation in them would influence the
outcome in track_popularity. Then we performed a lm with
categorical variables such as genre and
subgenre. We can see that all variables besides
playlist_subgenrelatin hip hop are statistically
significant at the 0.05 level. From the table we can see that the model
that performed the best was the one with categroical
values (with a high MSE still)
While we thought from the plots that pop,
latin and rock were the most popular genres,
this model indicated that it was pop, rap and
r&b. With the parameters our model also indicated that
loudness and energy might not affect our model
as much as we thought. The best R^2 was the model with genres and
subgenres (with 13.62%), but the model is still not very good.
spotify_knn_model <- kknn(track_popularity ~ danceability + energy + key + loudness + mode + speechiness + acousticness + instrumentalness + liveness + valence + tempo + duration_ms, train = train_data, test = train_data, k = 5)
# Predict on training data
knn_train_pred <- fitted.values(spotify_knn_model)
# Calculate in-sample MSE manually
knn_train_mse <- mean((train_data$track_popularity - knn_train_pred)^2)
print(paste("In-Sample MSE for KNN: ", knn_train_mse))
## [1] "In-Sample MSE for KNN: 194.903869196156"
# Predict on testing data
knn_model_test <- kknn(track_popularity ~ danceability + energy + key + loudness + mode + speechiness + acousticness + instrumentalness + liveness + valence + tempo + duration_ms, train = train_data, test = test_data, k = 5)
knn_test_pred <- fitted.values(knn_model_test)
# Calculate out-of-sample MSE manually
knn_test_mse <- mean((test_data$track_popularity - knn_test_pred)^2)
print(paste("Out-of-Sample MSE for KNN: ", knn_test_mse))
## [1] "Out-of-Sample MSE for KNN: 687.024025645934"
spotify_knn_model_20 <- kknn(track_popularity ~ danceability + energy + key + loudness + mode + speechiness + acousticness + instrumentalness + liveness + valence + tempo + duration_ms, train = train_data, test = train_data, k = 20)
# Predict on training data
knn_train_pred_20 <- fitted.values(spotify_knn_model_20)
# Calculate in-sample MSE manually
knn_train_mse_20 <- mean((train_data$track_popularity - knn_train_pred_20)^2)
print(paste("In-Sample MSE for KNN: ", knn_train_mse_20))
## [1] "In-Sample MSE for KNN: 389.061411631355"
# Predict on testing data
knn_model_test_20 <- kknn(track_popularity ~ danceability + energy + key + loudness + mode + speechiness + acousticness + instrumentalness + liveness + valence + tempo + duration_ms, train = train_data, test = test_data, k = 20)
knn_test_pred_20 <- fitted.values(knn_model_test_20)
# Calculate out-of-sample MSE manually
knn_test_mse_20 <- mean((test_data$track_popularity - knn_test_pred_20)^2)
print(paste("Out-of-Sample MSE for KNN: ", knn_test_mse_20))
## [1] "Out-of-Sample MSE for KNN: 556.981777608275"
spotify_knn_model_500 <- kknn(track_popularity ~ danceability + energy + key + loudness + mode + speechiness + acousticness + instrumentalness + liveness + valence + tempo + duration_ms, train = train_data, test = train_data, k = 500)
# Predict on training data
knn_train_pred_500 <- fitted.values(spotify_knn_model_500)
# Calculate in-sample MSE manually
knn_train_mse_500 <- mean((train_data$track_popularity - knn_train_pred_500)^2)
print(paste("In-Sample MSE for KNN: ", knn_train_mse_500))
## [1] "In-Sample MSE for KNN: 517.746609308476"
# Predict on testing data
knn_model_test_500 <- kknn(track_popularity ~ danceability + energy + key + loudness + mode + speechiness + acousticness + instrumentalness + liveness + valence + tempo + duration_ms, train = train_data, test = test_data, k = 500)
knn_test_pred_500 <- fitted.values(knn_model_test_500)
# Calculate out-of-sample MSE manually
knn_test_mse_500 <- mean((test_data$track_popularity - knn_test_pred_500)^2)
print(paste("Out-of-Sample MSE for KNN: ", knn_test_mse_500))
## [1] "Out-of-Sample MSE for KNN: 524.140072017361"
# create matrix with 4 columns and 4 rows
knn <- matrix(c('194.90', '687.02', '389.06', '556.981', '517.75', '524.14'), ncol=2, byrow=TRUE)
# specify the column names and row names of matrix
colnames(knn) = c('Training MSE','Testing MSE')
rownames(knn) <- c('knn5', 'knn20', 'knn500' )
# assign to table
knn_MSE=as.table(knn)
# display
knn_MSE
## Training MSE Testing MSE
## knn5 194.90 687.02
## knn20 389.06 556.981
## knn500 517.75 524.14
lm_MSE
## Training MSE Testing MSE
## LM1 529.4 527.51
## LM2 486.43 485.8
## LM3 572.16 549.2
## LM4 572.03 549.59
We tried the linear regression and the KNN model with different k’s
to see which model would be best for predicting
track popularity using the mean square error to see which
one performs better. When comparing the Lm to
knn model we saw that the model with best Training MSE
was knn5 , but the MSE on the testing
set was too high (687.02 and 556.981 and 524.14) on the two knn models.
We did not use all of the variables in the data set, but we used all of
our continuous variables.We decided to go this route because we mostly
wanted to explore track popularity and the relationship it has with the
continuous variables in our data set.
Since the MSE is not quite where we want it to be,
we are going to do some feature scaling to see if there is any impact on
knn model with less variables. The variables that will be removed are
key, mode and accoustiness.
#Creating a standardized train knn model
train_new.x <- train_data[, c("track_popularity", "danceability", "energy" , "loudness","speechiness","instrumentalness" , "liveness","valence","duration_ms")]
k<- 5
stand_knn <- kknn(track_popularity ~ danceability + energy + loudness + speechiness + instrumentalness + liveness + valence + duration_ms, train = train_new.x, test = train_new.x, k = 5)
# Creating a unstandardized knn model
knn_unstand <- kknn(track_popularity ~ danceability + energy + loudness + speechiness + instrumentalness + liveness + valence + duration_ms, train = train_new.x, test = train_new.x, k = 5, scale = FALSE)
#finding the MSE for our standardized model
#first we have to find the prediction of our standardized knn model
pred_stand <- fitted(stand_knn)
train_new.y <- train_new.x$track_popularity
#Standardized MSE
stand_mse <- mean((train_new.y - pred_stand)^2)
#Unstandardized MSE
pred_undstand <- fitted(knn_unstand)
unstand_mse <- mean((train_new.y - pred_undstand)^2)
#output
print(paste("Training MSE of standardized data :", stand_mse))
## [1] "Training MSE of standardized data : 206.7465001884"
print(paste("Training MSE of unstandardized data:", unstand_mse))
## [1] "Training MSE of unstandardized data: 217.941700903636"
For this model we took out key, mode and
acousticness The MSE of the standardized
data is 206.7465 and the MSE for the
unstandarized data is 217.9417 , so
there is a very small difference between the two. Our original train
data had a MSE of 194.903, which is also a small
difference compared to our feature scaling models. This leads us to the
conclusion that feature scaling could make a effective difference in
overall model, but it might not be needed.
Now lets test the test data with the same variables removed.
#Creating a standardized test knn model
test_new.x <- test_data[, c("track_popularity", "danceability", "energy" , "loudness","speechiness","instrumentalness" , "liveness","valence","duration_ms")]
stand_test_knn <- kknn(track_popularity ~ danceability + energy + loudness + speechiness + instrumentalness + liveness + valence + duration_ms, train = test_new.x, test = test_new.x, k = 5)
# Creating a unstandardized knn model
knn_test_unstand <- kknn(track_popularity ~ danceability + energy + loudness + speechiness + instrumentalness + liveness + valence + duration_ms, train = test_new.x, test = test_new.x, k = 5, scale = FALSE)
#finding the MSE for our standardized model
#first we have to find the prediction of our standardized knn model
pred_stand_test <- fitted(stand_test_knn)
test_new.y <- test_new.x$track_popularity
#Standardized MSE
test_stand_mse <- mean((test_new.y - pred_stand_test)^2)
#Unstandardized MSE
pred_undstand_test <- fitted(knn_test_unstand)
unstand_mse_test <- mean((test_new.y - pred_undstand_test)^2)
#output
print(paste("Testing MSE of standardized data :", test_stand_mse))
## [1] "Testing MSE of standardized data : 201.450244331452"
print(paste("Test MSE of unstandardized data:", unstand_mse_test))
## [1] "Test MSE of unstandardized data: 213.310812353227"
# create matrix with 4 columns and 4 rows
knn <- matrix(c('194.90', '687.02', '389.06', '556.981', '517.75','524.14', '217.942', '206.747', '213.31', '201.45'), ncol=2, byrow=TRUE)
# specify the column names and row names of matrix
colnames(knn) = c('Training MSE','Testing MSE')
rownames(knn) <- c('knn5', 'knn20','knn500', 'knn_un', 'knn_stand' )
# assign to table
knn_MSE=as.table(knn)
# display
knn_MSE
## Training MSE Testing MSE
## knn5 194.90 687.02
## knn20 389.06 556.981
## knn500 517.75 524.14
## knn_un 217.942 206.747
## knn_stand 213.31 201.45
lm_MSE
## Training MSE Testing MSE
## LM1 529.4 527.51
## LM2 486.43 485.8
## LM3 572.16 549.2
## LM4 572.03 549.59
We did a new KNN with new variables and K=5 for training and testing set with unstandardized and standardized data. The MSE improved on our model in both training and testing data, but we were using different variables. Smallest MSE went down to 201.45 in standardized model. Feature scaling might improve model, but not as much as we wanted compared to the same model with no standardization (206.747). What we can conclude is that the model improved a lot when deleting some variables.
# diagnostic plot
par(mfrow = c(2, 2))
plot(lm_model)
##### Interpreting Models
Theoretically, the model that would fit the data the best is the linear regression model because it is easier to interpret, you can see the coefficients and how each variable is changed by one unit increase in Y. You also have to meet the four assumptions (linearity, independence,normality, and homoscedasticity). We can run a diagnostics plot to see if our model meets these assumptions.
From the diagnostic plot, we see that the model does not meet the four assumptions, which leads us to the conclusion that the linear regression model is not the best fit for this data. It also has a very low R^2 (also when we tried different data) and very high MSE.
KNN performed better practically since the MSE was lower, but the k’s were too low (only 5). We did the same model with k=50 with same variables and the MSE went up to 490. Therefore although R^2 was very small, and MSE very high we still think that LM is a better model, since it is also more visual.
Lets create track-popularity as dummy variable and
insert a 1, when the popularity is above 70 and a 0 when it is below 70.
Meaning that 1 is popular and 0 is not popular.
track_populaity will be again or dependent variable.
spotify_copy <- spotify
spotify_copy$track_popularity <- ifelse(spotify_copy$track_popularity >=70, 1 , 0)
set.seed(2023)
index <- sample(1:nrow(spotify_copy),nrow(spotify_copy)*0.80)
spotify_copy.train = spotify_copy[index,]
spotify_copy.test = spotify_copy[-index,]
spotify_copy.glm0<- glm(track_popularity~ danceability + energy + key + loudness + mode + speechiness + acousticness + instrumentalness + liveness + valence + tempo + duration_ms, family=binomial, data=spotify_copy.train)
summary(spotify_copy.glm0)
##
## Call:
## glm(formula = track_popularity ~ danceability + energy + key +
## loudness + mode + speechiness + acousticness + instrumentalness +
## liveness + valence + tempo + duration_ms, family = binomial,
## data = spotify_copy.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.1331 -0.5000 -0.4123 -0.2795 4.2276
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.066e+00 3.085e-01 3.455 0.000551 ***
## danceability 3.319e-01 1.902e-01 1.745 0.081027 .
## energy -3.095e+00 2.169e-01 -14.269 < 2e-16 ***
## key -9.644e-03 6.523e-03 -1.478 0.139311
## loudness 2.076e-01 1.310e-02 15.843 < 2e-16 ***
## mode 1.286e-01 4.805e-02 2.676 0.007446 **
## speechiness -7.592e-01 2.401e-01 -3.162 0.001568 **
## acousticness 1.473e-01 1.228e-01 1.199 0.230523
## instrumentalness -3.071e+00 2.814e-01 -10.916 < 2e-16 ***
## liveness -4.376e-01 1.710e-01 -2.558 0.010518 *
## valence 4.779e-01 1.159e-01 4.125 3.71e-05 ***
## tempo 2.652e-03 8.573e-04 3.093 0.001980 **
## duration_ms -1.883e-06 4.619e-07 -4.076 4.57e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 14035 on 22683 degrees of freedom
## Residual deviance: 13193 on 22671 degrees of freedom
## AIC: 13219
##
## Number of Fisher Scoring iterations: 7
Observation: As we can see from this model, energy,
loudness, mode, speechiness,
instrumentalness, liveness,
valence, tempo and duration_ms
are significant in predicting whether a song is going to be popular or
not. Unlike linear regression, in this model key has the
highest coefficient with -9.644, followed by speechiness
and liveness.
pred.glm0.train <- predict(spotify_copy.glm0,type="response")
hist(pred.glm0.train)
Observation: These tables illustrate the impact of
choosing different cut-off probability. Choosing a large cut-off
probability will result many cases being predicted as 1, and choosing a
small cut-off probability will result in few cases being predicted as
1.
#ROC curve
library(ROCR)
pred <- prediction(pred.glm0.train, spotify_copy.train$track_popularity)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)
Obsservations:
#Get the AUC
unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.6855791
Observation:
We got the AUC (Area under the Curve). Again we would like to have a number closer to 1, meaning that it would be perfectly predicted. However, it is 0.6856. It is not to bad.
pred.glm0.test<- predict(spotify_copy.glm0, newdata = spotify_copy.test, type="response")
pred <- prediction(pred.glm0.test, spotify_copy.test$track_popularity)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)
#Get the AUC
unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.6610358
Observations:
In order to get the matrix for True Positive and True Negative Predictions, we will determine the optimal cut-off Probability using Grid Search Method. We will first use a weight of 5 and 1.
# define a cost function with input "obs" being observed response
# and "pi" being predicted probability, and "pcut" being the threshold.
costfunc = function(obs, pred.p, pcut){
weight1 = 5 # define the weight for "true=1 but pred=0" (FN)
weight0 = 1 # define the weight for "true=0 but pred=1" (FP)
c1 = (obs==1)&(pred.p<pcut) # count for "true=1 but pred=0" (FN)
c0 = (obs==0)&(pred.p>=pcut) # count for "true=0 but pred=1" (FP)
cost = mean(weight1*c1 + weight0*c0) # misclassification with weight
return(cost) # you have to return to a value when you write R functions
} # end of the function
Next, we will define a sequence of probability (you need to search the optimal p-cut from this sequence)
# define a sequence from 0.01 to 1 by 0.01
p.seq = seq(0.01, 1, 0.01)
Then, we will to calculate the cost for each probability in the sequence p.seq.
# write a loop for all p-cut to see which one provides the smallest cost
# first, need to define a 0 vector in order to save the value of cost from all pcut
cost = rep(0, length(p.seq))
for(i in 1:length(p.seq)){
cost[i] = costfunc(obs = spotify_copy.train$track_popularity, pred.p = pred.glm0.train, pcut = p.seq[i])
} # end of the loop
#cbind(p.seq, cost)
Last, we will draw a plot with cost against p.seq, and find the p-cut that gives us the minimum cost.
# draw a plot with X axis being all pcut and Y axis being associated cost
plot(p.seq, cost)
# find the optimal pcut
optimal.pcut.glm0 = p.seq[which(cost==min(cost))]
print(optimal.pcut.glm0)
## [1] 0.16
Observation:
With the weight of 5 and 1, we got o.15 as optimal cut-off-probability
# step 1. get binary classification
class.glm0.train.opt<- (pred.glm0.train>optimal.pcut.glm0)*1
# step 2. get confusion matrix, MR, FPR, FNR
table(spotify_copy.train$track_popularity, class.glm0.train.opt, dnn = c("True", "Predicted"))
## Predicted
## True 0 1
## 0 18732 1843
## 1 1636 473
Observation:
With weight of 5 and 1, we got 16845 TN, 824 TP, 3422 FP and 1593 FN. These predictions seem very good. It means for example that 824 were predicted a popular when they were popular.
MR<- mean(spotify_copy.train$track_popularity!= class.glm0.train.opt)
FPR<- sum(spotify_copy.train$track_popularity==0 & class.glm0.train.opt==1)/sum(spotify_copy.train$track_popularity==0)
FNR<- sum(spotify_copy.train$track_popularity==1 & class.glm0.train.opt==0)/sum(spotify_copy.train$track_popularity==1)
cost<- costfunc(obs = spotify_copy.train$track_popularity, pred.p = class.glm0.train.opt, pcut = optimal.pcut.glm0)
print(paste0("MR:",MR))
## [1] "MR:0.153368012696174"
print(paste0("FPR:",FPR))
## [1] "FPR:0.0895747266099635"
print(paste0("FNR:",FNR))
## [1] "FNR:0.775723091512565"
print(paste0("cost:",cost))
## [1] "cost:0.441853288661612"
We got 0.1534 for MR, which is pretty accuarte for such a complicated data. We also got 0.0896 fro FPR, 0.7757 for FNR and 0.4419 for cost.
Now, lets do the same with the Out-Of-Sample Data which is more important and calculate MR.
# step 0. obtain predicated values on testing data (we actually did this already, but let's do it again)
pred.glm0.test<- predict(spotify_copy.glm0, newdata = spotify_copy.test, type="response")
# step 1. get binary classification, I used the optimal cut-off
pred.glm0.test.opt <- (pred.glm0.test>optimal.pcut.glm0)*1
# step 2. get confusion matrix, MR, FPR, FNR
table(spotify_copy.test$track_popularity, pred.glm0.test.opt, dnn = c("True", "Predicted"))
## Predicted
## True 0 1
## 0 4698 448
## 1 416 110
Obs: With weight of 5 and 1, (optimal cut-off being 0.15) we got 4214 TN, 186 TP, 876 FP and 396 FN. These predictions seem very good. It means for example that 186 were predicted a popular when they were popular.
MR<- mean(spotify_copy.test$track_popularity!= pred.glm0.test.opt)
FPR<- sum(spotify_copy.test$track_popularity==0 & pred.glm0.test.opt==1)/sum(spotify_copy.test$track_popularity==0)
FNR<- sum(spotify_copy.test$track_popularity==1 & pred.glm0.test.opt==0)/sum(spotify_copy.test$track_popularity==1)
cost<- costfunc(obs = spotify_copy.test$track_popularity, pred.p = pred.glm0.test.opt, pcut = optimal.pcut.glm0)
print(paste0("MR:",MR))
## [1] "MR:0.152327221438646"
print(paste0("FPR:",FPR))
## [1] "FPR:0.0870579090555771"
print(paste0("FNR:",FNR))
## [1] "FNR:0.790874524714829"
print(paste0("cost:",cost))
## [1] "cost:0.445698166431594"
Out of sample MR is still very good, with a score of 0.1523. We also got 0.08701 for FPR, 0.79087 for FNR and 0.4457 for cost. All of the values were better for testing set except of FNR.
spotify_glm0_cat<- glm(track_popularity~ playlist_genre + playlist_subgenre, family=binomial, data=spotify_copy.train)
summary(spotify_glm0_cat)
##
## Call:
## glm(formula = track_popularity ~ playlist_genre + playlist_subgenre,
## family = binomial, data = spotify_copy.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.9468 -0.5258 -0.3472 -0.1463 3.3090
##
## Coefficients: (5 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.4706 0.4474 -12.229 < 2e-16
## playlist_genrelatin 2.1066 0.4832 4.360 1.30e-05
## playlist_genrepop 4.9005 0.4532 10.813 < 2e-16
## playlist_genrer&b 3.5616 0.4578 7.781 7.21e-15
## playlist_genrerap 3.4258 0.4587 7.469 8.10e-14
## playlist_genrerock 3.1048 0.4647 6.681 2.38e-11
## playlist_subgenrebig room 0.1464 0.6715 0.218 0.827462
## playlist_subgenreclassic rock -0.1167 0.1790 -0.652 0.514502
## playlist_subgenredance pop -0.6406 0.1037 -6.178 6.50e-10
## playlist_subgenreelectro house 0.9380 0.5332 1.759 0.078579
## playlist_subgenreelectropop -1.1165 0.1131 -9.868 < 2e-16
## playlist_subgenregangster rap -1.1283 0.1873 -6.025 1.69e-09
## playlist_subgenrehard rock -0.7950 0.2064 -3.853 0.000117
## playlist_subgenrehip hop 0.4123 0.1312 3.142 0.001676
## playlist_subgenrehip pop 0.1436 0.1470 0.977 0.328782
## playlist_subgenreindie poptimism -2.2086 0.1422 -15.535 < 2e-16
## playlist_subgenrelatin hip hop 0.9102 0.2183 4.169 3.06e-05
## playlist_subgenrelatin pop 1.9038 0.2023 9.410 < 2e-16
## playlist_subgenreneo soul -1.8874 0.2208 -8.549 < 2e-16
## playlist_subgenrenew jack swing -3.2198 0.4588 -7.018 2.25e-12
## playlist_subgenrepermanent wave 0.7548 0.1584 4.766 1.88e-06
## playlist_subgenrepop edm 2.7406 0.4715 5.812 6.16e-09
## playlist_subgenrepost-teen pop NA NA NA NA
## playlist_subgenreprogressive electro house NA NA NA NA
## playlist_subgenrereggaeton 1.5118 0.2209 6.844 7.71e-12
## playlist_subgenresouthern hip hop -1.0521 0.1711 -6.149 7.79e-10
## playlist_subgenretrap NA NA NA NA
## playlist_subgenretropical NA NA NA NA
## playlist_subgenreurban contemporary NA NA NA NA
##
## (Intercept) ***
## playlist_genrelatin ***
## playlist_genrepop ***
## playlist_genrer&b ***
## playlist_genrerap ***
## playlist_genrerock ***
## playlist_subgenrebig room
## playlist_subgenreclassic rock
## playlist_subgenredance pop ***
## playlist_subgenreelectro house .
## playlist_subgenreelectropop ***
## playlist_subgenregangster rap ***
## playlist_subgenrehard rock ***
## playlist_subgenrehip hop **
## playlist_subgenrehip pop
## playlist_subgenreindie poptimism ***
## playlist_subgenrelatin hip hop ***
## playlist_subgenrelatin pop ***
## playlist_subgenreneo soul ***
## playlist_subgenrenew jack swing ***
## playlist_subgenrepermanent wave ***
## playlist_subgenrepop edm ***
## playlist_subgenrepost-teen pop
## playlist_subgenreprogressive electro house
## playlist_subgenrereggaeton ***
## playlist_subgenresouthern hip hop ***
## playlist_subgenretrap
## playlist_subgenretropical
## playlist_subgenreurban contemporary
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 14035 on 22683 degrees of freedom
## Residual deviance: 12271 on 22660 degrees of freedom
## AIC: 12319
##
## Number of Fisher Scoring iterations: 7
Observation:
All of the variables are significant at the 0.05 level, except
playlist_subgenrebig room,
playlist_subgenreclassic rock,
playlist_subgenrepost-teen pop,
playlist_subgenreprogressive electro house,
playlist_subgenretrap,
playlist_subgenretropical and
playlist_subgenreurban contemporary.
Now under genre, the highest change in one unit goes to
pop, followed by r&b and
rap.
For the subgenre, the highest coefficients were on
southern hip hop, followed by reggaeton and
dance pop. On lm the biggest changes were in other
variables (new jack swing, electropop and
neo soul).
pred.glm0_cat.train <- predict(spotify_glm0_cat,type="response")
#ROC curve
library(ROCR)
pred <- prediction(pred.glm0_cat.train, spotify_copy.train$track_popularity)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)
#Get the AUC
unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.7593949
pred.glm0.test_cat<- predict(spotify_glm0_cat, newdata = spotify_copy.test, type="response")
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
pred <- prediction(pred.glm0.test_cat, spotify_copy.test$track_popularity)
perf <- performance(pred, "tpr", "fpr")
#Get the AUC
unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.7539842
plot(perf, colorize=TRUE)
Observation: For the training set we get an AUC of
0.7594 and for the testing set we get 0.75398.
# define a cost function with input "obs" being observed response
# and "pi" being predicted probability, and "pcut" being the threshold.
costfunc = function(obs, pred.p, pcut){
weight1 = 5 # define the weight for "true=1 but pred=0" (FN)
weight0 = 1 # define the weight for "true=0 but pred=1" (FP)
c1 = (obs==1)&(pred.p<pcut) # count for "true=1 but pred=0" (FN)
c0 = (obs==0)&(pred.p>=pcut) # count for "true=0 but pred=1" (FP)
cost = mean(weight1*c1 + weight0*c0) # misclassification with weight
return(cost) # you have to return to a value when you write R functions
} # end of the function
# define a sequence from 0.01 to 1 by 0.01
p.seq = seq(0.01, 1, 0.01)
# write a loop for all p-cut to see which one provides the smallest cost
# first, need to define a 0 vector in order to save the value of cost from all pcut
cost = rep(0, length(p.seq))
for(i in 1:length(p.seq)){
cost[i] = costfunc(obs = spotify_copy.train$track_popularity, pred.p = pred.glm0_cat.train, pcut = p.seq[i])
} # end of the loop
#cbind(p.seq, cost)
# find the optimal pcut
optimal.pcut.glm0 = p.seq[which(cost==min(cost))]
print(optimal.pcut.glm0)
## [1] 0.17 0.18
The optimal pcut is either 0.17 or 0.18. Now we will get MR, FPR, FNR and cost to compare it to each other.
# step 1. get binary classification
class.glm0.train.cat<- (pred.glm0_cat.train>optimal.pcut.glm0)*1
# get values
MR<- mean(spotify_copy.train$track_popularity!= class.glm0.train.cat)
FPR<- sum(spotify_copy.train$track_popularity==0 & class.glm0.train.cat==1)/sum(spotify_copy.train$track_popularity==0)
FNR<- sum(spotify_copy.train$track_popularity==1 & class.glm0.train.cat==0)/sum(spotify_copy.train$track_popularity==1)
cost<- costfunc(obs = spotify_copy.train$track_popularity, pred.p = class.glm0.train.cat, pcut = optimal.pcut.glm0)
print(paste0("MR:",MR))
## [1] "MR:0.151340151648739"
print(paste0("FPR:",FPR))
## [1] "FPR:0.0982260024301337"
print(paste0("FNR:",FNR))
## [1] "FNR:0.669511616880038"
print(paste0("cost:",cost))
## [1] "cost:0.400326221125022"
# step 0. obtain predicated values on testing data (we actually did this already, but let's do it again)
pred.glm0.test<- predict(spotify_glm0_cat, newdata = spotify_copy.test, type="response")
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
# step 1. get binary classification, I used the optimal cut-off
pred.glm0.test.cat <- (pred.glm0.test>optimal.pcut.glm0)*1
#Get values
MR<- mean(spotify_copy.test$track_popularity!= pred.glm0.test.cat)
FPR<- sum(spotify_copy.test$track_popularity==0 & pred.glm0.test.cat==1)/sum(spotify_copy.test$track_popularity==0)
FNR<- sum(spotify_copy.test$track_popularity==1 & pred.glm0.test.cat==0)/sum(spotify_copy.test$track_popularity==1)
cost<- costfunc(obs = spotify_copy.test$track_popularity, pred.p = pred.glm0.test.cat, pcut = optimal.pcut.glm0)
print(paste0("MR:",MR))
## [1] "MR:0.157087447108604"
print(paste0("FPR:",FPR))
## [1] "FPR:0.104741546832491"
print(paste0("FNR:",FNR))
## [1] "FNR:0.669201520912547"
print(paste0("cost:",cost))
## [1] "cost:0.405324400564175"
# create matrix with 2 columns and 4 rows
lg_num <- matrix(c('0.6856', '0.6610', '0.1534', '0.1523', '0.0896', '0.087', '0.7757', '0.7908', '0.4419', '0.4457'), ncol=2, byrow=TRUE)
lg_cat <- matrix(c('0.7593', '0.7539', '0.1513', '0.1571', '0.098', '0.1047', '0.6695', '0.6692', '0.4003', '0.4053'), ncol=2, byrow=TRUE)
# specify the column names and row names of matrix
colnames(lg_num) = c('Training_num','Testing_num')
rownames(lg_num) <- c('AUC', 'MR', 'FPR', 'FNR', 'Cost' )
colnames(lg_cat) = c('Training_cat','Testing_cat')
rownames(lg_cat) <- c('AUC', 'MR', 'FPR', 'FNR', 'Cost' )
# assign to table
lg1_results=as.table(lg_num)
lg2_results=as.table(lg_cat)
# display
#lg1_results
#lg2_results
library(knitr)
kable(lg1_results)
| Training_num | Testing_num | |
|---|---|---|
| AUC | 0.6856 | 0.6610 |
| MR | 0.1534 | 0.1523 |
| FPR | 0.0896 | 0.087 |
| FNR | 0.7757 | 0.7908 |
| Cost | 0.4419 | 0.4457 |
kable(lg2_results)
| Training_cat | Testing_cat | |
|---|---|---|
| AUC | 0.7593 | 0.7539 |
| MR | 0.1513 | 0.1571 |
| FPR | 0.098 | 0.1047 |
| FNR | 0.6695 | 0.6692 |
| Cost | 0.4003 | 0.4053 |
Observations: We performed two logistic regression
models to predict track_popularity. One was with numerical
observations and the other one with categorical observations.
AUC was better on categroical model on
the training data. In MR on testing was better in the
first model and on training was better in the second model. Overall the
model with categorical variables predicted better than with numerical
variables.
Something that we like about this model is that the accuracy is not to bad and the MR is relatively small in all models. We prefer a lower FP ratio meaning that less song are being predicted as popular when they are not. However, the FN ratio is very high meaning that many songs are predicted as not popular when they are popular.
We are first going to make a new data set without the categorical
variables, then make our track_popularity variable 2
levels.
spotify_new = subset(spotify, select = -c(track_album_release_date, track_id, track_name, track_artist, track_album_name, track_album_id , track_album_name, playlist_genre, playlist_name, playlist_subgenre, playlist_id) )
spotify_new$track_popularity <- ifelse(spotify_new$track_popularity >=70, 1 , 0)
#splitting the data
set.seed(2023)
index <- sample(1:nrow(spotify_new),nrow(spotify_new)*0.80)
spotify_svm_train <- spotify_new[index,]
spotify_svm_test <- spotify_new[-index,]
#fitting the data w/o weight class cost
library(e1071)
spotify_new$track_popularity<- as.factor(spotify_new$track_popularity)
str(spotify_new)
## 'data.frame': 28356 obs. of 13 variables:
## $ track_popularity: Factor w/ 2 levels "0","1": 1 1 2 1 1 1 1 1 1 1 ...
## $ danceability : num 0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
## $ energy : num 0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
## $ key : int 6 11 1 7 1 8 5 4 8 2 ...
## $ loudness : num -2.63 -4.97 -3.43 -3.78 -4.67 ...
## $ mode : int 1 1 0 1 1 1 0 0 1 1 ...
## $ speechiness : num 0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
## $ acousticness : num 0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
## $ instrumentalness: num 0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
## $ liveness : num 0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
## $ valence : num 0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
## $ tempo : num 122 100 124 122 124 ...
## $ duration_ms : int 194754 162600 176616 169093 189052 163049 187675 207619 193187 253040 ...
#svm model
spotify_svm = svm({{as.factor(track_popularity)}} ~ danceability + energy + key + loudness + mode + speechiness + acousticness + instrumentalness + liveness + valence + tempo + duration_ms, data = spotify_svm_train, kernel = 'linear')
# predictions on the train data
pred_spotify_train = predict(spotify_svm, spotify_svm_train)
# Confusion matrix to evaluate the model on train data
Cmatrix_train = table(true = spotify_svm_train$track_popularity ,
pred = pred_spotify_train)
Cmatrix_train
## pred
## true 0 1
## 0 20575 0
## 1 2109 0
#Mis-classification rate
1 - sum(diag(Cmatrix_train))/sum(Cmatrix_train)
## [1] 0.09297302
# Predictions on the train data
pred_spotify_test <- predict(spotify_svm, spotify_svm_test)
# Confusion matrix to evaluate the model on train data
Cmatrix_test = table(true = spotify_svm_test$track_popularity,
pred = pred_spotify_test)
Cmatrix_test
## pred
## true 0 1
## 0 5146 0
## 1 526 0
#Mis-classification rate
1 - sum(diag(Cmatrix_test))/sum(Cmatrix_test)
## [1] 0.09273625
Observations The confusion matrix of the predicted values with the training data shows the number of true negatives as 20575, false positives as 0 , false negatives as 2109 and true positives as 0. The mis-classification error rate is 0.0929
The confusion matrix of the predicted values with the testing data shows the number of true negatives as 5146, false positives as 0 , false negatives as 526 and true positives as 0. The mis-classification error rate is 0.09274.
As we can see this model is not working well since no songs are predicted as popular. Therefore we will try either different kernels or different weight to see if we can be able to get a better model.
Now we are going to fit a SVM model with weight cost using the train data
spotify_svm_asymmetric = svm(as.factor(track_popularity) ~ danceability + energy + loudness + speechiness + mode + key + acousticness+ instrumentalness + liveness + valence + tempo +duration_ms, data = spotify_svm_train, kernel = 'linear', class.weights = c("0" = 1, "1" = 7))
# predictions for train data
pred_spotify_train <- predict(spotify_svm_asymmetric, spotify_svm_train)
#Confusion matrix for train data
Cmatrix_train_21 <- table(true =spotify_svm_train$track_popularity,
pred = pred_spotify_train)
print(Cmatrix_train_21)
## pred
## true 0 1
## 0 16364 4211
## 1 1240 869
#Mis-classification rate
1 - sum(diag(Cmatrix_train_21))/sum(Cmatrix_train_21)
## [1] 0.2403015
The confusion matrix of the predicted values with the training data shows the number of TN as 16364, FP as 4211 , FN as 1240 and TP as 869.
The mis-classification error rate is 0.24. This is the best model we could get with lineal kernel. However the MR is vry high with 0.24.
Now we are going to fit a SVM model with cost using the testing data
#testing predicted probability
pred_spotify_test <- predict(spotify_svm_asymmetric, spotify_svm_test)
#Confusion matrix
Cmatrix_test_22 <- table( true = spotify_svm_test$track_popularity, pred = pred_spotify_test)
print(Cmatrix_test_22)
## pred
## true 0 1
## 0 4051 1095
## 1 320 206
#Mis-classification rate
MR <- 1 - sum(diag(Cmatrix_test_22))/sum(Cmatrix_test_22)
MR
## [1] 0.2494711
The confusion matrix of the predicted values with the testing data shows the number of TN as 4051, FN as 1095 , FP as 320 and TP as 206.
The MR is 0.25 Still a very high MR.
Lets do another model with polynomial kernel ad compare those.
spotify_svm_asy_poly = svm(as.factor(track_popularity) ~ danceability + energy + loudness + speechiness + mode + key + acousticness+ instrumentalness + liveness + valence + tempo +duration_ms, data = spotify_svm_train, kernel = 'polynomial', class.weights = c("0" = 1, "1" = 5))
# predictions for train data
pred_spotify_train_poly <- predict(spotify_svm_asy_poly, spotify_svm_train)
#Confusion matrix for train data
Cmatrix_train_23 <- table(true =spotify_svm_train$track_popularity,
pred = pred_spotify_train_poly)
print(Cmatrix_train_23)
## pred
## true 0 1
## 0 20085 490
## 1 1903 206
#Mis-classification rate
1 - sum(diag(Cmatrix_train_23))/sum(Cmatrix_train_23)
## [1] 0.1054929
#testing predicted probability
pred_spotify_test_asy <- predict(spotify_svm_asy_poly, spotify_svm_test)
#Confusion matrix
Cmatrix_test_24 <- table( true = spotify_svm_test$track_popularity, pred = pred_spotify_test_asy)
print(Cmatrix_test_24)
## pred
## true 0 1
## 0 5017 129
## 1 493 33
#Mis-classification rate
MR <- 1 - sum(diag(Cmatrix_test_24))/sum(Cmatrix_test_24)
MR
## [1] 0.1096615
Observations: Polynomial SVM with different weight is the best SVM model for this data set. We were able to get predictions for popularity (not many), and our MR is lower than in the previous model (With 0.11) . We also had a lower weight on 1 than in the linear model.
Now we are going to obtain the ROC and AUC for our SVM model. We will then use our output to compare it to the ROC and AUC we obtained in our logistic model.
#ROC and AUC using the train data
spotify_svm_prob = svm(as.factor(track_popularity) ~ danceability + energy + loudness + speechiness + mode + key + acousticness+ instrumentalness + liveness + valence + tempo +duration_ms, data = spotify_svm_train, kernel = 'linear',
probability = TRUE)
pred_prob_train = predict(spotify_svm_prob,
newdata = spotify_svm_train,
probability = TRUE)
pred_prob_train = attr(pred_prob_train, "probabilities")[, 2]
#ROC
library(ROCR)
pred <- prediction(pred_prob_train, spotify_svm_train$track_popularity)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)
#AUC
unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.5866711
The AUC is 0.5867.
#refitting the model on testing
pred_prob_test = predict(spotify_svm_prob,
newdata = spotify_svm_test,
probability = TRUE)
pred_prob_test = attr(pred_prob_test, "probabilities")[, 2]
#ROC
pred <- prediction(pred_prob_test, spotify_svm_test$track_popularity )
perf <- performance(pred, "tpr", "fpr")
#AUC
unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.5698786
The AUC is 0.5699.
# work on it here!
#refit the model with probability enable
spotify_svm_asymmetric = svm(as.factor(track_popularity) ~ danceability + energy + loudness + speechiness + mode + key + acousticness+ instrumentalness + liveness + valence + tempo +duration_ms, data = spotify_svm_train, kernel = 'linear', class.weights = c("0" = 1, "1" = 7),probability = TRUE)
#get predictions for training
pred_prob_train <- predict(spotify_svm_asymmetric, spotify_svm_train, probability = TRUE)
pred_prob_train = attr(pred_prob_train, "probabilities")[, 2]
#ROC and AUC
#ROC
library(ROCR)
pred <- prediction(pred_prob_train, spotify_svm_train$track_popularity)
perf <- performance(pred, "tpr", "fpr")
unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.6800004
#get predictions for testing
pred_prob_test <- predict(spotify_svm_asymmetric, spotify_svm_test, probability = TRUE)
pred_prob_test = attr(pred_prob_test, "probabilities")[, 2]
#ROC and AUC
#ROC
library(ROCR)
pred <- prediction(pred_prob_test, spotify_svm_test$track_popularity)
perf <- performance(pred, "tpr", "fpr")
unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.6587955
Observations: From our second model we got a better AUC for almost 10% higher with scores of (0.68 and 0.66).
# work on it here!
#refit the model with probability enable
spotify_svm_asymmetric_POL = svm(as.factor(track_popularity) ~ danceability + energy + loudness + speechiness + mode + key + acousticness+ instrumentalness + liveness + valence + tempo +duration_ms, data = spotify_svm_train, kernel = 'polynomial', class.weights = c("0" = 1, "1" = 5),probability = TRUE)
#get predictions for training
pred_prob_train <- predict(spotify_svm_asymmetric_POL, spotify_svm_train, probability = TRUE)
pred_prob_train = attr(pred_prob_train, "probabilities")[, 2]
#ROC and AUC
#ROC
library(ROCR)
pred <- prediction(pred_prob_train, spotify_svm_train$track_popularity)
perf <- performance(pred, "tpr", "fpr")
unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.7154458
#get predictions for testing
pred_prob_test <- predict(spotify_svm_asymmetric_POL, spotify_svm_test, probability = TRUE)
pred_prob_test = attr(pred_prob_test, "probabilities")[, 2]
#ROC and AUC
#ROC
library(ROCR)
pred <- prediction(pred_prob_test, spotify_svm_test$track_popularity)
perf <- performance(pred, "tpr", "fpr")
unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.6612042
**From our last model we get the best in training AUC with 0.714 and a lower score on the AUC of testing set (with 0.659).
# create matrix with 2 columns and 4 rows
svm_results <- matrix(c('0.587', '0.5909', '0.6800', '0.658','0.715', '0.66', '0.0929', '0.0928', '0.0930', '0.0927', '0.105', '0.1096'), ncol=2, byrow=TRUE)
# specify the column names and row names of matrix
colnames(svm_results) = c('Training_svm','Testing_svm')
rownames(svm_results) <- c('AUC1','AUC2', 'AUC3', 'MR1','MR2', 'MR3' )
# assign to table
svm=as.table(svm_results)
# display
kable(lg1_results)
| Training_num | Testing_num | |
|---|---|---|
| AUC | 0.6856 | 0.6610 |
| MR | 0.1534 | 0.1523 |
| FPR | 0.0896 | 0.087 |
| FNR | 0.7757 | 0.7908 |
| Cost | 0.4419 | 0.4457 |
kable(svm)
| Training_svm | Testing_svm | |
|---|---|---|
| AUC1 | 0.587 | 0.5909 |
| AUC2 | 0.6800 | 0.658 |
| AUC3 | 0.715 | 0.66 |
| MR1 | 0.0929 | 0.0928 |
| MR2 | 0.0930 | 0.0927 |
| MR3 | 0.105 | 0.1096 |
Observations: For the SVM model we performed three different ones. The first one was a model with no weights and linear kernel. That one did not predict any single popular song. Then we did a linear model with weights (to weighted to 1). Lastly a polynomial model with 5 weights on 1. The model with best MR was the second model (0.927 on testing), and the confusion matrix made sense as well. Best AUC was on the third model with 0.66 on the testing set. Out of all three the third one performed the best, since it was also less weighted.
We will create a copy of spotify with a new data name
for the purpose of this analysis.
spotify_dec <- spotify
spotify_dec$track_popularity <- ifelse(spotify_dec$track_popularity >=70, 1 , 0)
spotify_dec$mode<- as.factor(spotify_dec$mode)
spotify_dec$key<- as.factor(spotify_dec$key)
set.seed(2023)
index <- sample(1:nrow(spotify_dec),nrow(spotify_dec)*0.60)
spotify_dec.train = spotify_dec[index,]
spotify_dec.test = spotify_dec[-index,]
# fit the model
fit_tree_dec <- rpart(as.factor(track_popularity) ~ danceability + energy + speechiness + loudness + instrumentalness+ mode+ key + acousticness+ liveness + valence + tempo, data=spotify_dec.train)
summary(fit_tree_dec)
## Call:
## rpart(formula = as.factor(track_popularity) ~ danceability +
## energy + speechiness + loudness + instrumentalness + mode +
## key + acousticness + liveness + valence + tempo, data = spotify_dec.train)
## n= 17013
##
## CP nsplit rel error xerror xstd
## 1 0 0 1 0 0
##
## Node number 1: 17013 observations
## predicted class=0 expected loss=0.09351672 P(node) =1
## class counts: 15422 1591
## probabilities: 0.906 0.094
#rpart.plot(fit_tree, yesno=2, extra=4)
Obtain the predicted values for training data and confusion matrix.
# Make predictions on the train data
pred_credit_train <- predict(fit_tree_dec, spotify_dec.train, type="class")
# Confusion matrix to evaluate the model on train data
Cmatrix_train = table(true = spotify_dec.train$track_popularity,
pred = pred_credit_train)
Cmatrix_train
## pred
## true 0 1
## 0 15422 0
## 1 1591 0
1 - sum(diag(Cmatrix_train))/sum(Cmatrix_train)
## [1] 0.09351672
Observation: From this model we get a very small missclassification rate of 0.0935, but from the Confusion Matrix of the training set we can see that none of the variables are predicted as 1.
Obtain the predicted values for testing data and confusion matrix.
# Make predictions on the test data
pred_credit_test <- predict(fit_tree_dec, spotify_dec.test, type="class")
# Confusion matrix to evaluate the model on train data
Cmatrix_test = table(true = spotify_dec.test$track_popularity,
pred = pred_credit_test)
Cmatrix_test
## pred
## true 0 1
## 0 10299 0
## 1 1044 0
1 - sum(diag(Cmatrix_test))/sum(Cmatrix_test)
## [1] 0.09203914
Observation:
Now that we have done th same on the testing set, we can see that MR is even smaller with 0.092 but still none of the variables are predicted as 1. This model is not good. Therefore we will use asymmetric test, to produce different weights and see if this improves the model.
# We need to define a cost matrix first, don't change 0 there
cost_matrix <- matrix(c(0, 1, # cost of 1 for FP
6, 0), # cost of 6 for FN
byrow = TRUE, nrow = 2)
fit_tree_asym <- rpart(as.factor(track_popularity) ~ danceability + energy + speechiness + loudness + instrumentalness+ mode+ key + acousticness+ liveness + valence + tempo, data=spotify_dec.train, parms = list(loss = cost_matrix))
rpart.plot(fit_tree_asym,extra=4, yesno=2)
Since we now do have some variables on the ‘1’ side of the data, we can
visualize the tree in a plot. We can predict that from the variables
energy is the one that best predicts a popular song. The
model explains that energy value smaller 0.82 would predict
1, and loudness bigger that -5.1 would also predict 1.
#get predictions for training
pred_spotify_train <- predict(fit_tree_asym, spotify_dec.train,
type = "class")
#C matrix for training
Cmatrix_train_asy <- table( true = spotify_dec.train$track_popularity, pred = pred_spotify_train)
Cmatrix_train_asy
## pred
## true 0 1
## 0 13157 2265
## 1 1054 537
1 - sum(diag(Cmatrix_train_asy))/sum(Cmatrix_train_asy)
## [1] 0.1950861
Observation: Now that we have redistributed the weights we finally can get some values for 1 (being popular) , with 537 being TP, 2265 FP, 13157 TN and 1054 FN on the training set. However, now we have a bigger MR with 0.1951. Lets get assymmetric predictions for the testing set.
#get predictions for testing
pred_credit_test <- predict(fit_tree_asym, spotify_dec.test, type = "class")
#C matrix for testing with more weights on "1"
Cmatrix_test_weight = table( true = spotify_dec.test$track_popularity, pred = pred_credit_test)
Cmatrix_test_weight
## pred
## true 0 1
## 0 8820 1479
## 1 707 337
1 - sum(diag(Cmatrix_test_weight))/sum(Cmatrix_test_weight)
## [1] 0.192718
Observation: Now for the testing set we getthe following values: with 337 being TP, 1479 FP, 8820 TN and 707 FN on the training set. The MR is now a bit lower with a value of 0.1927. Something that is not too good is that the prediction for a popular song, when it is popular, is small compared to other predictions. Now, we will obtain the AUC for the training and testing data.
# obtain predicted probability
pred_prob_train = predict(fit_tree_dec, spotify_dec.train, type = "prob")
# This is necessary again, as predict() for tree model return two values, one for 0 and one for 1.
# Replace "1" with the actual category if response variable is a factor
pred_prob_train = pred_prob_train[,"1"]
library(ROCR)
pred <- prediction(pred_prob_train, spotify_dec.train$track_popularity)
perf <- performance(pred, "tpr", "fpr")
#Get the AUC
unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.5
# obtain predicted probability
pred_prob_test = predict(fit_tree_dec, spotify_dec.test, type = "prob")
# This is necessary again, as predict() for tree model return two values, one for 0 and one for 1.
pred_prob_test = pred_prob_test[,"1"] #replace "1" with the actual category if reponse variable is a factor
#ROC
pred <- prediction(pred_prob_test, spotify_dec.test$track_popularity)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)
#Get the AUC
unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.5
Observation:
When we get the AUC for both the training and testing datasets for the model with no weight we get the same value of 0.5 and the ROC curve is a diagonal linear line. This is due to the fact that none of the variables are predicted as 1. Now, lets get the AUC-ROC for the model with weights.
# obtain predicted probability
pred_asym_train = predict(fit_tree_asym, spotify_dec.train, type = "prob")
# This is necessary again, as predict() for tree model return two values, one for 0 and one for 1.
# Replace "1" with the actual category if response variable is a factor
pred_asym_train = pred_asym_train[,"1"]
library(ROCR)
pred <- prediction(pred_asym_train, spotify_dec.train$track_popularity)
perf <- performance(pred, "tpr", "fpr")
#plot(perf, colorize=TRUE)
#Get the AUC
unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.6613876
# obtain predicted probability
pred_asym_test = predict(fit_tree_asym, spotify_dec.test, type = "prob")
# This is necessary again, as predict() for tree model return two values, one for 0 and one for 1.
# Replace "1" with the actual category if response variable is a factor
pred_asym_test = pred_asym_test[,"1"]
library(ROCR)
pred <- prediction(pred_asym_test, spotify_dec.test$track_popularity)
perf <- performance(pred, "tpr", "fpr")
#plot(perf, colorize=TRUE)
#Get the AUC
unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.6450638
Observations The AUC for the second model is not very good (0.645), and it is highly weighted on 1.
# create matrix with 2 columns and 4 rows
tree_results <- matrix(c('0.5', '0.5', '0.6614', '0.6451', '0.0955', '0.0920', '0.1951', '0.1927'), ncol=2, byrow=TRUE)
# specify the column names and row names of matrix
colnames(tree_results) = c('Training_tree','Testing_tree')
rownames(tree_results) <- c('AUC1', 'AUC2', 'MR1', 'MR2' )
# assign to table
tree=as.table(tree_results)
kable(knn_MSE)
| Training MSE | Testing MSE | |
|---|---|---|
| knn5 | 194.90 | 687.02 |
| knn20 | 389.06 | 556.981 |
| knn500 | 517.75 | 524.14 |
| knn_un | 217.942 | 206.747 |
| knn_stand | 213.31 | 201.45 |
kable(lm_MSE)
| Training MSE | Testing MSE | |
|---|---|---|
| LM1 | 529.4 | 527.51 |
| LM2 | 486.43 | 485.8 |
| LM3 | 572.16 | 549.2 |
| LM4 | 572.03 | 549.59 |
kable(tree)
| Training_tree | Testing_tree | |
|---|---|---|
| AUC1 | 0.5 | 0.5 |
| AUC2 | 0.6614 | 0.6451 |
| MR1 | 0.0955 | 0.0920 |
| MR2 | 0.1951 | 0.1927 |
kable(lg1_results)
| Training_num | Testing_num | |
|---|---|---|
| AUC | 0.6856 | 0.6610 |
| MR | 0.1534 | 0.1523 |
| FPR | 0.0896 | 0.087 |
| FNR | 0.7757 | 0.7908 |
| Cost | 0.4419 | 0.4457 |
kable(lg2_results)
| Training_cat | Testing_cat | |
|---|---|---|
| AUC | 0.7593 | 0.7539 |
| MR | 0.1513 | 0.1571 |
| FPR | 0.098 | 0.1047 |
| FNR | 0.6695 | 0.6692 |
| Cost | 0.4003 | 0.4053 |
kable(svm)
| Training_svm | Testing_svm | |
|---|---|---|
| AUC1 | 0.587 | 0.5909 |
| AUC2 | 0.6800 | 0.658 |
| AUC3 | 0.715 | 0.66 |
| MR1 | 0.0929 | 0.0928 |
| MR2 | 0.0930 | 0.0927 |
| MR3 | 0.105 | 0.1096 |
Observation: Now that we have done AUC-ROC on the weighted models, we get the best AUC of 0.6614 fr the training set and 0.6451 for the testing set, being a better model the one with weights. Overall, although MR is lower on the model with no weights, it does not mean that is better because none of the observations are classified as popular. Although the model with weights has a bigger MR, the model does better classification that model 1. Overall the model that best fits the data is the one with weights (in classification trees).
After, cleaning our data, we did EDA to predict which variables and genre would best indicate if a song was popular or not. From the visualizations and the models we saw that:
instrumentallness, speechiness and
liveness being the variables that best predict
track_popularity. Loudness and
Energy were not the best predictors as one would
think.pop
predicted popularity the most.dance pop was the best predictor as far as
subgenre.From Data Modeling we concluded the following:
key was non-significant and
2 variables in the second model. OUT-Of Sample MSE was
the best in the model with categorical values (485.8) when comparing LM
to Knn and improved when adding an interaction. But the R^2 was still to
low for it to be a good model. When we performed Knn we got a lower MSE
(but with less variables) and this model still does not give a good
estimate and coefficients.When comparing Trees , SVM and logistic, SVM third model had the best AUC in training (0.715), but the weights are very distributed to 1 and this model is not easy to interpret visually. Trees had lower AUC score than SVM and logistic. Logistic Regression was the best model for this dataset because its AUC was almost as high as SVM (0.6856 & 0.6610), and just a bit higher in MR. However we still think that Logistic Regression is better than SVM becasue we did not have to distribute any weights and the coefficients and estimates are interpretable.
Our predictions were that lm and Classification Tree would be the best models. However Logistic Regression performed better than Tree, and we did not have to put in in any weights. Lm had a too high MSE and R^2 for it to be a good model. Therefore Regression Tree is the best model for this dataset.