The Spotify Dataset comes from Spotify via the spotifyr package. Charlie Thompson, Josiah Parry, Donal Phipps, and Tom Wolff authored this package to make it easier to get either your own data or general metadata arounds songs from Spotify’s API. Kaylin Pavlik had a recent blogpost using the audio features to explore and classify songs. She used the spotifyr package to collect about 5000 songs from 6 main categories.
The data shows general metadata around songs from Spotify’s API. It shows the song’s popularity and other parameters such as acousticness, danceability, energy, speechiness, valence, key…
With this analysis we are interested in how track popularity is getting influenced by other attributes like danceability, loudness, speechiness, valence etc.
The plan is to analyze relationship between popularity and different features of the song to predict future popularity of a song. We plan on performing Data Preparation, EDA and Modelilling using models such as linear regression, knn or logistic regression.
This is mainly beneficial to market spotify customers and improve their experience while using spotify. Also for Spotify, they will be able to provide more accurate predictions of a new song’s potential popularity even before its release.
library(tidyverse) #It assists with data import, tidying, manipulation, and data visualization.
library(ggplot2) # package for producing statistical, or data, graphics
library(kknn) # to perform k-nearest neighbor classification
library(corrplot) # graphical display of a correlation matrix, confidence interval
library(readr) #o provide a fast way to read rectangular data
spotify <- read_csv("/Users/evabeyebach/Desktop/Projects/spotify.csv")
## Rows: 32833 Columns: 23
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): track_id, track_name, track_artist, track_album_id, track_album_na...
## dbl (13): track_popularity, danceability, energy, key, loudness, mode, speec...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
### Checking dimension of Data
dim(spotify)
## [1] 32833 23
The original dataset contains 32833 rows and 23 columns
# show first 5 rows
head(spotify)
## # A tibble: 6 × 23
## track_id track_name track_artist track_popularity track_album_id
## <chr> <chr> <chr> <dbl> <chr>
## 1 6f807x0ima9a1j3VPbc7VN I Don't C… Ed Sheeran 66 2oCs0DGTsRO98…
## 2 0r7CVbZTWZgbTCYdfa2P31 Memories … Maroon 5 67 63rPSO264uRjW…
## 3 1z1Hg7Vb0AhHDiEmnDE79l All the T… Zara Larsson 70 1HoSmj2eLcsrR…
## 4 75FpbthrwQmzHlBJLuGdC7 Call You … The Chainsm… 60 1nqYsOef1yKKu…
## 5 1e8PAfcKUYoKkxPhrHqw4x Someone Y… Lewis Capal… 69 7m7vv9wlQ4i0L…
## 6 7fvUMiyapMsRRxr07cU8Ef Beautiful… Ed Sheeran 67 2yiy9cd2QktrN…
## # ℹ 18 more variables: track_album_name <chr>, track_album_release_date <chr>,
## # playlist_name <chr>, playlist_id <chr>, playlist_genre <chr>,
## # playlist_subgenre <chr>, danceability <dbl>, energy <dbl>, key <dbl>,
## # loudness <dbl>, mode <dbl>, speechiness <dbl>, acousticness <dbl>,
## # instrumentalness <dbl>, liveness <dbl>, valence <dbl>, tempo <dbl>,
## # duration_ms <dbl>
#### Checking column name
names(spotify)
## [1] "track_id" "track_name"
## [3] "track_artist" "track_popularity"
## [5] "track_album_id" "track_album_name"
## [7] "track_album_release_date" "playlist_name"
## [9] "playlist_id" "playlist_genre"
## [11] "playlist_subgenre" "danceability"
## [13] "energy" "key"
## [15] "loudness" "mode"
## [17] "speechiness" "acousticness"
## [19] "instrumentalness" "liveness"
## [21] "valence" "tempo"
## [23] "duration_ms"
### Checking structure of Data
str(spotify)
## spc_tbl_ [32,833 × 23] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ track_id : chr [1:32833] "6f807x0ima9a1j3VPbc7VN" "0r7CVbZTWZgbTCYdfa2P31" "1z1Hg7Vb0AhHDiEmnDE79l" "75FpbthrwQmzHlBJLuGdC7" ...
## $ track_name : chr [1:32833] "I Don't Care (with Justin Bieber) - Loud Luxury Remix" "Memories - Dillon Francis Remix" "All the Time - Don Diablo Remix" "Call You Mine - Keanu Silva Remix" ...
## $ track_artist : chr [1:32833] "Ed Sheeran" "Maroon 5" "Zara Larsson" "The Chainsmokers" ...
## $ track_popularity : num [1:32833] 66 67 70 60 69 67 62 69 68 67 ...
## $ track_album_id : chr [1:32833] "2oCs0DGTsRO98Gh5ZSl2Cx" "63rPSO264uRjW1X5E6cWv6" "1HoSmj2eLcsrR0vE9gThr4" "1nqYsOef1yKKuGOVchbsk6" ...
## $ track_album_name : chr [1:32833] "I Don't Care (with Justin Bieber) [Loud Luxury Remix]" "Memories (Dillon Francis Remix)" "All the Time (Don Diablo Remix)" "Call You Mine - The Remixes" ...
## $ track_album_release_date: chr [1:32833] "2019-06-14" "2019-12-13" "2019-07-05" "2019-07-19" ...
## $ playlist_name : chr [1:32833] "Pop Remix" "Pop Remix" "Pop Remix" "Pop Remix" ...
## $ playlist_id : chr [1:32833] "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" ...
## $ playlist_genre : chr [1:32833] "pop" "pop" "pop" "pop" ...
## $ playlist_subgenre : chr [1:32833] "dance pop" "dance pop" "dance pop" "dance pop" ...
## $ danceability : num [1:32833] 0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
## $ energy : num [1:32833] 0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
## $ key : num [1:32833] 6 11 1 7 1 8 5 4 8 2 ...
## $ loudness : num [1:32833] -2.63 -4.97 -3.43 -3.78 -4.67 ...
## $ mode : num [1:32833] 1 1 0 1 1 1 0 0 1 1 ...
## $ speechiness : num [1:32833] 0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
## $ acousticness : num [1:32833] 0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
## $ instrumentalness : num [1:32833] 0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
## $ liveness : num [1:32833] 0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
## $ valence : num [1:32833] 0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
## $ tempo : num [1:32833] 122 100 124 122 124 ...
## $ duration_ms : num [1:32833] 194754 162600 176616 169093 189052 ...
## - attr(*, "spec")=
## .. cols(
## .. track_id = col_character(),
## .. track_name = col_character(),
## .. track_artist = col_character(),
## .. track_popularity = col_double(),
## .. track_album_id = col_character(),
## .. track_album_name = col_character(),
## .. track_album_release_date = col_character(),
## .. playlist_name = col_character(),
## .. playlist_id = col_character(),
## .. playlist_genre = col_character(),
## .. playlist_subgenre = col_character(),
## .. danceability = col_double(),
## .. energy = col_double(),
## .. key = col_double(),
## .. loudness = col_double(),
## .. mode = col_double(),
## .. speechiness = col_double(),
## .. acousticness = col_double(),
## .. instrumentalness = col_double(),
## .. liveness = col_double(),
## .. valence = col_double(),
## .. tempo = col_double(),
## .. duration_ms = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
Shows structure of data. We can see that track_id ,
track_name, track_artist,
track_album_id, track_album_name,
track_album_release_date, playlist_name,
playlist_id, playlist_genre,
playlist_subgenre are character variables. On the other
side, track_popularity, energy,
key, loudness, mode,
speechiness, acousticness,
instrumentalness, liveness,
valence, tempo and duration_ms
are numeric. We need to change track_album_release_date to
date variable. We will also change playlist_genre to
factor, for future plotting.
# Modifying Data Types
spotify$track_album_release_date<- as.Date(spotify$track_album_release_date)
spotify$playlist_genre<-as.factor(spotify$playlist_genre)
#summary statistics
summary(spotify)
## track_id track_name track_artist track_popularity
## Length:32833 Length:32833 Length:32833 Min. : 0.00
## Class :character Class :character Class :character 1st Qu.: 24.00
## Mode :character Mode :character Mode :character Median : 45.00
## Mean : 42.48
## 3rd Qu.: 62.00
## Max. :100.00
##
## track_album_id track_album_name track_album_release_date
## Length:32833 Length:32833 Min. :1957-01-01
## Class :character Class :character 1st Qu.:2010-12-04
## Mode :character Mode :character Median :2017-01-27
## Mean :2012-09-09
## 3rd Qu.:2019-05-16
## Max. :2020-01-29
## NA's :1886
## playlist_name playlist_id playlist_genre playlist_subgenre
## Length:32833 Length:32833 edm :6043 Length:32833
## Class :character Class :character latin:5155 Class :character
## Mode :character Mode :character pop :5507 Mode :character
## r&b :5431
## rap :5746
## rock :4951
##
## danceability energy key loudness
## Min. :0.0000 Min. :0.000175 Min. : 0.000 Min. :-46.448
## 1st Qu.:0.5630 1st Qu.:0.581000 1st Qu.: 2.000 1st Qu.: -8.171
## Median :0.6720 Median :0.721000 Median : 6.000 Median : -6.166
## Mean :0.6548 Mean :0.698619 Mean : 5.374 Mean : -6.720
## 3rd Qu.:0.7610 3rd Qu.:0.840000 3rd Qu.: 9.000 3rd Qu.: -4.645
## Max. :0.9830 Max. :1.000000 Max. :11.000 Max. : 1.275
##
## mode speechiness acousticness instrumentalness
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000000
## 1st Qu.:0.0000 1st Qu.:0.0410 1st Qu.:0.0151 1st Qu.:0.0000000
## Median :1.0000 Median :0.0625 Median :0.0804 Median :0.0000161
## Mean :0.5657 Mean :0.1071 Mean :0.1753 Mean :0.0847472
## 3rd Qu.:1.0000 3rd Qu.:0.1320 3rd Qu.:0.2550 3rd Qu.:0.0048300
## Max. :1.0000 Max. :0.9180 Max. :0.9940 Max. :0.9940000
##
## liveness valence tempo duration_ms
## Min. :0.0000 Min. :0.0000 Min. : 0.00 Min. : 4000
## 1st Qu.:0.0927 1st Qu.:0.3310 1st Qu.: 99.96 1st Qu.:187819
## Median :0.1270 Median :0.5120 Median :121.98 Median :216000
## Mean :0.1902 Mean :0.5106 Mean :120.88 Mean :225800
## 3rd Qu.:0.2480 3rd Qu.:0.6930 3rd Qu.:133.92 3rd Qu.:253585
## Max. :0.9960 Max. :0.9910 Max. :239.44 Max. :517810
##
Displays Min, Q1, Median, Mean, Q3 and Max of each varibale. We can
already see that there are probably some outliers and that some
variables have too big Max (duration_ms has a Max od
517810; tempo has a Max of 239.44). We will do some
truncation, winorization or standardization, to see how it affects the
model.
#lets look at some tables for categorical variables
table(spotify$track_popularity)
##
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
## 2703 575 387 321 240 240 192 189 201 195 174 172 161 207 201 190
## 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
## 219 206 242 205 209 228 207 228 243 242 272 271 266 277 345 323
## 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
## 351 377 388 433 431 435 483 459 486 442 428 464 472 505 430 496
## 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
## 465 497 498 514 506 472 514 492 497 541 503 467 514 492 470 483
## 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79
## 424 462 441 468 425 443 410 408 339 357 353 306 334 326 224 265
## 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
## 172 167 126 183 122 120 91 89 104 27 59 58 27 44 37 15
## 96 97 98 99 100
## 7 22 36 4 2
table(spotify$playlist_genre)
##
## edm latin pop r&b rap rock
## 6043 5155 5507 5431 5746 4951
table(spotify$playlist_subgenre)
##
## album rock big room classic rock
## 1065 1206 1296
## dance pop electro house electropop
## 1298 1511 1408
## gangster rap hard rock hip hop
## 1458 1485 1322
## hip pop indie poptimism latin hip hop
## 1256 1672 1656
## latin pop neo soul new jack swing
## 1262 1637 1133
## permanent wave pop edm post-teen pop
## 1105 1517 1129
## progressive electro house reggaeton southern hip hop
## 1809 949 1675
## trap tropical urban contemporary
## 1291 1288 1405
table(spotify$key)
##
## 0 1 2 3 4 5 6 7 8 9 10 11
## 3454 4010 2827 913 2201 2680 2670 3352 2430 3027 2273 2996
table(spotify$mode)
##
## 0 1
## 14259 18574
We can see that mode is a binary variable.
Playlist_genre and playlist_subgenre are
categorical variables with the genre of music.
# lets look at specific data types and class
str(spotify$track_popularity)
## num [1:32833] 66 67 70 60 69 67 62 69 68 67 ...
class(spotify$track_popularity)
## [1] "numeric"
#Looking for duplicates
dups_id <- sum(duplicated(spotify$track_id))
print(dups_id)
## [1] 4477
We can see that a lot of songs have been duplicated in this dataset. They have the same track_id. Therefore we will remove them, for further analysis.
spotify_dups = spotify[duplicated(spotify$track_id),]
spotify = spotify[!duplicated(spotify$track_id),]
We removed the duplicate songs to another dataset called
spotify_dups and removed duplicates from current
dataset.
#looking for missing values
sum(is.na(spotify))
## [1] 1693
There are 12 missing values in this dataset. However, we will not remove them, since they might still be important for the analysis.
ggplot(spotify, aes(x= track_popularity)) +
geom_histogram(binwidth=5, color="darkblue", fill="lightblue") +
ggtitle("Popularity Distribution") +
xlab("Popularity") +
ylab("Frequency")
hist(spotify$danceability, col = 'blue', border = "black", xlab = 'danceability', ylab = 'Frequency', main = 'danceability Distribution')
hist(spotify$energy, col = 'blue', border = "black", xlab = 'energy', ylab = 'Frequency', main = 'energy Distribution')
hist(spotify$key, col = 'blue', border = "black", xlab = 'key', ylab = 'Frequency', main = 'key Distribution')
hist(spotify$loudness, col = 'blue', border = "black", xlab = 'loudness', ylab = 'Frequency', main = 'loudness distribution')
hist(spotify$mode, col = 'blue', border = "black", xlab = 'mode', ylab = 'Frequency', main = 'mode distribution')
hist(spotify$valence, col = 'blue', border = "black", xlab = 'valence', ylab = 'Frequency', main = 'valence Distribution')
hist(spotify$speechiness, col = 'blue', border = "black", xlab = 'speechiness', ylab = 'Frequency', main = 'speechiness Distribution')
hist(spotify$acousticness, col = 'blue', border = "black", xlab = 'acousticness', ylab = 'Frequency', main = 'acousticness Distribution')
hist(spotify$liveness, col = 'blue', border = "black", xlab = 'liveness', ylab = 'Frequency', main = 'liveness Distribution')
hist(spotify$instrumentalness, col = 'blue', border = "black", xlab = 'instrumentalness', ylab = 'Frequency', main = 'instrumentalness Distribution')
hist(spotify$tempo, col = 'blue', border = "black", xlab = 'tempo', ylab = 'Frequency', main = 'tempo Distribution')
hist(spotify$duration_ms, col = 'blue', border = "black", xlab = 'duration_ms', ylab = 'Frequency', main = 'duration Distribution')
plot(spotify$playlist_genre, col = 'blue', border = "black", xlab = 'Genre' , ylab = "Frequencies")
After plotting the histograms we can observe the following distribution:
Duration, tempo and Valence are normally distributed Danceability, Enery and Loudness is left-skewed Acousticness, Speechiness and Liveness is right-skewed By genre, most of the songs are edm.
ggplot(spotify, aes(x=tempo, y=track_popularity)) + geom_jitter(aes(color = playlist_genre)) + ggtitle("tempo and popularity")
ggplot(spotify, aes(x=danceability, y=track_popularity)) + geom_jitter(aes(color = playlist_genre)) +
ggtitle("danceability and popularity")
ggplot(spotify, aes(x=energy, y=track_popularity)) + geom_jitter(aes(color = playlist_genre)) +
ggtitle("energy and popularity")
ggplot(spotify, aes(x=loudness, y=track_popularity)) + geom_jitter(aes(color = playlist_genre)) +
ggtitle("loudness and popularity")
ggplot(spotify, aes(x=speechiness, y=track_popularity)) + geom_jitter(aes(color = playlist_genre)) +
ggtitle("speechiness and popularity")
ggplot(spotify, aes(x=acousticness, y=track_popularity)) + geom_jitter(aes(color = playlist_genre)) +
ggtitle("acousticness and popularity")
ggplot(spotify, aes(x=instrumentalness, y=track_popularity)) + geom_jitter(aes(color = playlist_genre)) +
ggtitle("instrumentalness and popularity")
ggplot(spotify, aes(x=liveness, y=track_popularity)) + geom_jitter(aes(color = playlist_genre)) +
ggtitle("liveness and popularity")
ggplot(spotify, aes(x=valence, y=track_popularity)) + geom_jitter(aes(color = playlist_genre)) +
ggtitle("valence and popularity")
ggplot(spotify, aes(x=tempo, y=track_popularity)) + geom_jitter(aes(color = playlist_genre)) +
ggtitle("tempo and popularity")
ggplot(spotify, aes(x=duration_ms, y=track_popularity)) + geom_jitter(aes(color = playlist_genre)) +
ggtitle("duration and popularity")
From these scatterplots, we can observe how every numerical variable
plots with track_popularity. We have plotted it by genre,
with edm, latin, pop, r&b,
rap, rock. It is divided into colors in every graph.
From this visualizations we can observe that cluster analysis and knn
model will be probably the best one to analyze this data. Also, as we
look through the graphs we can see that most of the plots that have a
higher popularity (closer to 100) are the green dots, which are pop
genre.Edm is usually plotted below 75, those songs are less popular.
boxplot(track_popularity ~ playlist_genre , data = spotify,
main = "Popular genre",
ylab = "Popularity",
xlab = "Genre",
col = "yellow")
boxplot( track_popularity ~ mode, data = spotify,
main = "Popular Mode",
ylab = "Popularity",
xlab = "mode",
col = "yellow")
boxplot(spotify$danceability,
main = "Boxplot distribution of Danceability",
col = "yellow")
boxplot(spotify$energy,
main = "Boxplot distribution of energy",
col = "yellow")
boxplot(spotify$key,
main = "Boxplot distribution of key",
col = "yellow")
boxplot(spotify$loudness,
main = "Boxplot distribution of loudness",
col = "yellow")
boxplot(spotify$mode,
main = "Boxplot distribution of mode",
col = "yellow")
boxplot(spotify$speechiness,
main = "Boxplot distribution of speechiness",
col = "yellow")
boxplot(spotify$acousticness,
main = "Boxplot distribution of acousticness",
col = "yellow")
boxplot(spotify$instrumentalness,
main = "Boxplot distribution of instrumentalness",
col = "yellow")
boxplot(spotify$liveness,
main = "Boxplot distribution of liveness",
col = "yellow")
boxplot(spotify$valence,
main = "Boxplot distribution of valence",
col = "yellow")
boxplot(spotify$tempo,
main = "Boxplot distribution of tempo",
col = "yellow")
boxplot(spotify$duration_ms,
main = "Boxplot distribution of duration",
col = "yellow")
From the boxplots we can observe that a lot of variables
(danceability, energy, loudness,
speechiness, acousticness,
instrumentalness, liveness,
duration) have outliers. Removing them would influence the
analysis a lot.
Lets create a new dataset with all winsorized and truncated variables to reduce outliers.
spotify_copy <- spotify
# truncation danceability, energy, speechiness, acousticness, instrumentalness and liveness
spotify_copy$danceability[spotify_copy$danceability <= 0.28] <- 0.28
spotify_copy$energy[spotify_copy$energy <= 0.2] <- 0.2
spotify_copy$speechiness[spotify_copy$speechiness >= 0.27] <- 0.27
spotify_copy$acousticness[spotify_copy$acousticness >= 0.6] <- 0.6
spotify_copy$instrumentalness[spotify_copy$instrumentalness >= 0.015] <- 0.015
spotify_copy$liveness[spotify_copy$liveness >= 0.4] <- 0.4
# winsorization loudness
# Calculate the 5th and 95th percentiles for 'loudness'
lower_bound_loudness <- quantile(spotify_copy$loudness, 0.05, na.rm = TRUE)
upper_bound_loudness <- quantile(spotify_copy$loudness, 0.95, na.rm = TRUE)
# Winsorize the data
spotify_copy$loudness[spotify_copy$loudness < lower_bound_loudness] <- lower_bound_loudness
spotify_copy$loudness[spotify_copy$loudness > upper_bound_loudness] <- upper_bound_loudness
# winsorization tempo
# Calculate the 5th and 95th percentiles for 'tempo'
lower_bound_tempo <- quantile(spotify_copy$tempo, 0.05, na.rm = TRUE)
upper_bound_tempo <- quantile(spotify_copy$tempo, 0.95, na.rm = TRUE)
# Winsorize the data
spotify_copy$tempo[spotify_copy$tempo < lower_bound_tempo] <- lower_bound_tempo
spotify_copy$tempo[spotify_copy$tempo > upper_bound_tempo] <- upper_bound_tempo
#winsorize duration
# Calculate the 5th and 95th percentiles for 'duration'
lower_bound_duration_ms <- quantile(spotify_copy$duration_ms, 0.05, na.rm = TRUE)
upper_bound_duration_ms <- quantile(spotify_copy$duration_ms, 0.95, na.rm = TRUE)
# Winsorize the data
spotify_copy$duration_ms[spotify_copy$duration_ms < lower_bound_duration_ms] <- lower_bound_duration_ms
spotify_copy$duration_ms[spotify_copy$duration_ms > upper_bound_duration_ms] <- upper_bound_duration_ms
Now we have remove all the outliers from those variables. The data is cleaned. We also have dealt with missing values, duplicates, and data types.
knitr::kable(head(spotify[, 1:23]), "simple")
| track_id | track_name | track_artist | track_popularity | track_album_id | track_album_name | track_album_release_date | playlist_name | playlist_id | playlist_genre | playlist_subgenre | danceability | energy | key | loudness | mode | speechiness | acousticness | instrumentalness | liveness | valence | tempo | duration_ms |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 6f807x0ima9a1j3VPbc7VN | I Don’t Care (with Justin Bieber) - Loud Luxury Remix | Ed Sheeran | 66 | 2oCs0DGTsRO98Gh5ZSl2Cx | I Don’t Care (with Justin Bieber) [Loud Luxury Remix] | 2019-06-14 | Pop Remix | 37i9dQZF1DXcZDD7cfEKhW | pop | dance pop | 0.748 | 0.916 | 6 | -2.634 | 1 | 0.0583 | 0.1020 | 0.00e+00 | 0.0653 | 0.518 | 122.036 | 194754 |
| 0r7CVbZTWZgbTCYdfa2P31 | Memories - Dillon Francis Remix | Maroon 5 | 67 | 63rPSO264uRjW1X5E6cWv6 | Memories (Dillon Francis Remix) | 2019-12-13 | Pop Remix | 37i9dQZF1DXcZDD7cfEKhW | pop | dance pop | 0.726 | 0.815 | 11 | -4.969 | 1 | 0.0373 | 0.0724 | 4.21e-03 | 0.3570 | 0.693 | 99.972 | 162600 |
| 1z1Hg7Vb0AhHDiEmnDE79l | All the Time - Don Diablo Remix | Zara Larsson | 70 | 1HoSmj2eLcsrR0vE9gThr4 | All the Time (Don Diablo Remix) | 2019-07-05 | Pop Remix | 37i9dQZF1DXcZDD7cfEKhW | pop | dance pop | 0.675 | 0.931 | 1 | -3.432 | 0 | 0.0742 | 0.0794 | 2.33e-05 | 0.1100 | 0.613 | 124.008 | 176616 |
| 75FpbthrwQmzHlBJLuGdC7 | Call You Mine - Keanu Silva Remix | The Chainsmokers | 60 | 1nqYsOef1yKKuGOVchbsk6 | Call You Mine - The Remixes | 2019-07-19 | Pop Remix | 37i9dQZF1DXcZDD7cfEKhW | pop | dance pop | 0.718 | 0.930 | 7 | -3.778 | 1 | 0.1020 | 0.0287 | 9.40e-06 | 0.2040 | 0.277 | 121.956 | 169093 |
| 1e8PAfcKUYoKkxPhrHqw4x | Someone You Loved - Future Humans Remix | Lewis Capaldi | 69 | 7m7vv9wlQ4i0LFuJiE2zsQ | Someone You Loved (Future Humans Remix) | 2019-03-05 | Pop Remix | 37i9dQZF1DXcZDD7cfEKhW | pop | dance pop | 0.650 | 0.833 | 1 | -4.672 | 1 | 0.0359 | 0.0803 | 0.00e+00 | 0.0833 | 0.725 | 123.976 | 189052 |
| 7fvUMiyapMsRRxr07cU8Ef | Beautiful People (feat. Khalid) - Jack Wins Remix | Ed Sheeran | 67 | 2yiy9cd2QktrNvWC2EUi0k | Beautiful People (feat. Khalid) [Jack Wins Remix] | 2019-07-11 | Pop Remix | 37i9dQZF1DXcZDD7cfEKhW | pop | dance pop | 0.675 | 0.919 | 8 | -5.385 | 1 | 0.1270 | 0.0799 | 0.00e+00 | 0.1430 | 0.585 | 124.982 | 163049 |
To uncover new information in the data, we first took a look at the descriptive statistics of the continuous variables, this showed us the mean, median, minimum, maximum, and quartiles of each variable in our data. We then used a histogram to visualize the distribution of each variable. This allowed us to see the skewness of each variable which gave us an idea of the characteristics of the data. Our visualizations gave us the clearest insight. The scatter plots allowed us to see the relationships between the variables and the influence that a song’s genre has over that relationship. The box plots allowed us to get a different look at the distribution and see outliers which can have a major influence over our overall data set.The different ways we can look at this data is from the perspective of
The plots that we can use to illustrate our findings are scatter plots, line charts, and histograms. We already used scatter plots and histograms in the discovery process, but they will be useful to illustrate our findings because they can show evidence of relationships between variables and show the distribution of those variables.
Currently, we do not know how to conduct statistical tests like t-test, ANOVA, and chi-square to be able to test our hypothesis.
To create new summary information, we plan on narrowing our data down to the variables we really plan on exploring to help us gain insights to our predictions.
track_populrity. We will do so by splitting the data into
training and test sets.# Set the seed for reproducibility
set.seed(2023)
# Randomly sample row indices for the training set split in 70% and 30%
train_indices <- sample(1:NROW(spotify),NROW(spotify)*0.70)
# Create the training set
train_data <- spotify[train_indices, ] #everything before comma is row selector and after comma is column selesctor
# Create the testing set
test_data <- spotify[-train_indices, ]
# Train the linear regression model, comparing popularity to rest of numeric parameters
lm_model <- lm(track_popularity ~ danceability + energy + key + loudness + mode + speechiness + acousticness + instrumentalness + liveness + valence + tempo + duration_ms, data = train_data)
summary(lm_model)
##
## Call:
## lm(formula = track_popularity ~ danceability + energy + key +
## loudness + mode + speechiness + acousticness + instrumentalness +
## liveness + valence + tempo + duration_ms, data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -54.757 -17.360 2.935 18.119 60.604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.684e+01 2.045e+00 32.682 < 2e-16 ***
## danceability 4.170e+00 1.288e+00 3.237 0.00121 **
## energy -2.318e+01 1.461e+00 -15.864 < 2e-16 ***
## key 1.398e-02 4.595e-02 0.304 0.76088
## loudness 1.132e+00 7.802e-02 14.512 < 2e-16 ***
## mode 7.590e-01 3.357e-01 2.261 0.02378 *
## speechiness -7.397e+00 1.657e+00 -4.464 8.09e-06 ***
## acousticness 4.934e+00 8.895e-01 5.547 2.95e-08 ***
## instrumentalness -9.426e+00 7.437e-01 -12.676 < 2e-16 ***
## liveness -4.420e+00 1.077e+00 -4.104 4.08e-05 ***
## valence 1.997e+00 7.861e-01 2.540 0.01110 *
## tempo 2.808e-02 6.261e-03 4.485 7.33e-06 ***
## duration_ms -4.277e-05 2.726e-06 -15.689 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23.03 on 19836 degrees of freedom
## Multiple R-squared: 0.06003, Adjusted R-squared: 0.05946
## F-statistic: 105.6 on 12 and 19836 DF, p-value: < 2.2e-16
we can compile the actual and predicted values and view the first 5 records
# Create a data frame to compare actual and predicted values
comparison_df <- data.frame(Actual = train_data$track_popularity, lm_predicted =lm_model$fitted.values)
head(comparison_df)
## Actual lm_predicted
## 1 0 33.94611
## 2 24 40.51391
## 3 39 41.34777
## 4 63 38.75230
## 5 40 35.35537
## 6 30 28.37513
lm_mse_train <- mean((lm_model$fitted.values - train_data$track_popularity)^2)
print(paste("Training MSE for Linear Model:", round(lm_mse_train, 2)))
## [1] "Training MSE for Linear Model: 529.94"
# Predict on testing data
lm_test_pred <- predict(lm_model, newdata = test_data)
# Cal
lm_mse_test <- mean((lm_test_pred - test_data$track_popularity)^2)
print(paste("Testing MSE for Linear Model:", round(lm_mse_test, 2)))
## [1] "Testing MSE for Linear Model: 527.51"
spotify_knn_model <- kknn(track_popularity ~ danceability + energy + key + loudness + mode + speechiness + acousticness + instrumentalness + liveness + valence + tempo + duration_ms, train = train_data, test = train_data, k = 5)
# Predict on training data
knn_train_pred <- fitted.values(spotify_knn_model)
# Calculate in-sample MSE manually
knn_train_mse <- mean((train_data$track_popularity - knn_train_pred)^2)
print(paste("In-Sample MSE for KNN: ", knn_train_mse))
## [1] "In-Sample MSE for KNN: 194.903869196156"
# Predict on testing data
knn_model_test <- kknn(track_popularity ~ danceability + energy + key + loudness + mode + speechiness + acousticness + instrumentalness + liveness + valence + tempo + duration_ms, train = train_data, test = test_data, k = 5)
knn_test_pred <- fitted.values(knn_model_test)
# Calculate out-of-sample MSE manually
knn_test_mse <- mean((test_data$track_popularity - knn_test_pred)^2)
print(paste("Out-of-Sample MSE for KNN: ", knn_test_mse))
## [1] "Out-of-Sample MSE for KNN: 687.024025645934"
We tried the linear regression and the KNN model to see which model would be best for predicting track popularity using the mean square error to see which one performs better.We found that the MSE for the linear regression model was 578.12 for In-Sample testing and the MSE was 581.72 for Out-of-sample testing. We also found that the KNN model has a MSE 194.903 for In-sample test, and a MSE of 687.024 for the Out-of Sample test. When comparing these two models, the KNN model perfomed better than the regression model.
We did not use all of the variables in the data set, but we used all of our continuous variables.We decided to go this route because we mostly wanted to explore track popularity and the relationship it has with the continuous variables in our data set.
Theoretically, the model that would fit the data the best is the linear regression model because it is easier to interpret. You also have to meet the four assumptions (linearity, independence,normality, and homoscedasticity). We can run a diagnostics plot to see if our model meets these assumptions.
# diagnostic plot
par(mfrow = c(2, 2))
plot(lm_model)
From the diagnostic plot, we see that the model does not meet the four assumptions, which leads us to the conclusion that the linear regression model is not the best fit for this data.
The model that fits the best in practice is the KNN model. The KNN gave us the best in sample performance when tested and the Linear Regression Model gave us the best In-Sample testing.The evaluation metrics we have been using are the mean square error to see which model has the lowest to reveal which model will be the best to test our predictions .The training data in the KNN model gave us the lowest MSE (194.903) which is what lead us to the conclusion that the KNN model would be better for prediction.