We plan on diving into the Spotify dataset to gain insights into music consumption patterns and preferences. By conducting this analysis, we expect to uncover valuable insights into music consumption behavior on Spotify, which can benefit artists, music labels, and the platform itself. These insights can inform marketing strategies, playlist duration, and content creation.
Variables we plan to use:
First, we will explore the data using summary statistics and visualizations to understand the dataset. In this step, we will identify trends, correlations, or patterns. Next, we will preprocess the data by handling any missing or inconsistent data. Then, we will create a linear regression model and interpret the coefficients. And then evaluate and test the model. Next, we will use KNN using KKNN. We will test different values of k and evaluate the models using MSE. Finally, we will compare the performance of the linear regression model and KNN models. We will use the linear regression and KNN models to answer our problem statement.
This will help the consumer understand user preferences, trends, and the factors influencing song popularity, which is crucial for both the music industry and for artists.
suppressPackageStartupMessages(library(tidyverse, quietly = TRUE))
suppressPackageStartupMessages(library(corrplot, quietly = TRUE))
suppressPackageStartupMessages(library(kknn, quietly = TRUE))
suppressPackageStartupMessages(library(psych, quietly = TRUE))
# Setting work Directory for Midterm Project
setwd("D:/Documents/School/Fall 2023/Data Mining for Bus Analytics/Midterm_Project")
We obtained the original data from Github: https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-01-21/readme.md
The original purpose of the data was to use audio features to explore and classify songs, and it was collected in 2020. The original dataset had 23 variables. As far as any peculiarities, it looks like missing values were assigned a value of 0.
The following section consists of analyzing and investigating the data sets and summaries using Exploratory Data Analysis (EDA).
# Load the dataset
spotify <- read_csv("spotify_songs.csv")
## Rows: 32833 Columns: 23
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): track_id, track_name, track_artist, track_album_id, track_album_na...
## dbl (13): track_popularity, danceability, energy, key, loudness, mode, speec...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
The head function displays the first few rows of the dataset. The structure function displays information about the type of variable, components, names, and first few values.
head(spotify)
## # A tibble: 6 × 23
## track_id track_name track_artist track_popularity track_album_id
## <chr> <chr> <chr> <dbl> <chr>
## 1 6f807x0ima9a1j3VPbc7VN I Don't C… Ed Sheeran 66 2oCs0DGTsRO98…
## 2 0r7CVbZTWZgbTCYdfa2P31 Memories … Maroon 5 67 63rPSO264uRjW…
## 3 1z1Hg7Vb0AhHDiEmnDE79l All the T… Zara Larsson 70 1HoSmj2eLcsrR…
## 4 75FpbthrwQmzHlBJLuGdC7 Call You … The Chainsm… 60 1nqYsOef1yKKu…
## 5 1e8PAfcKUYoKkxPhrHqw4x Someone Y… Lewis Capal… 69 7m7vv9wlQ4i0L…
## 6 7fvUMiyapMsRRxr07cU8Ef Beautiful… Ed Sheeran 67 2yiy9cd2QktrN…
## # ℹ 18 more variables: track_album_name <chr>, track_album_release_date <chr>,
## # playlist_name <chr>, playlist_id <chr>, playlist_genre <chr>,
## # playlist_subgenre <chr>, danceability <dbl>, energy <dbl>, key <dbl>,
## # loudness <dbl>, mode <dbl>, speechiness <dbl>, acousticness <dbl>,
## # instrumentalness <dbl>, liveness <dbl>, valence <dbl>, tempo <dbl>,
## # duration_ms <dbl>
str(spotify)
## spc_tbl_ [32,833 × 23] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ track_id : chr [1:32833] "6f807x0ima9a1j3VPbc7VN" "0r7CVbZTWZgbTCYdfa2P31" "1z1Hg7Vb0AhHDiEmnDE79l" "75FpbthrwQmzHlBJLuGdC7" ...
## $ track_name : chr [1:32833] "I Don't Care (with Justin Bieber) - Loud Luxury Remix" "Memories - Dillon Francis Remix" "All the Time - Don Diablo Remix" "Call You Mine - Keanu Silva Remix" ...
## $ track_artist : chr [1:32833] "Ed Sheeran" "Maroon 5" "Zara Larsson" "The Chainsmokers" ...
## $ track_popularity : num [1:32833] 66 67 70 60 69 67 62 69 68 67 ...
## $ track_album_id : chr [1:32833] "2oCs0DGTsRO98Gh5ZSl2Cx" "63rPSO264uRjW1X5E6cWv6" "1HoSmj2eLcsrR0vE9gThr4" "1nqYsOef1yKKuGOVchbsk6" ...
## $ track_album_name : chr [1:32833] "I Don't Care (with Justin Bieber) [Loud Luxury Remix]" "Memories (Dillon Francis Remix)" "All the Time (Don Diablo Remix)" "Call You Mine - The Remixes" ...
## $ track_album_release_date: chr [1:32833] "2019-06-14" "2019-12-13" "2019-07-05" "2019-07-19" ...
## $ playlist_name : chr [1:32833] "Pop Remix" "Pop Remix" "Pop Remix" "Pop Remix" ...
## $ playlist_id : chr [1:32833] "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" ...
## $ playlist_genre : chr [1:32833] "pop" "pop" "pop" "pop" ...
## $ playlist_subgenre : chr [1:32833] "dance pop" "dance pop" "dance pop" "dance pop" ...
## $ danceability : num [1:32833] 0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
## $ energy : num [1:32833] 0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
## $ key : num [1:32833] 6 11 1 7 1 8 5 4 8 2 ...
## $ loudness : num [1:32833] -2.63 -4.97 -3.43 -3.78 -4.67 ...
## $ mode : num [1:32833] 1 1 0 1 1 1 0 0 1 1 ...
## $ speechiness : num [1:32833] 0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
## $ acousticness : num [1:32833] 0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
## $ instrumentalness : num [1:32833] 0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
## $ liveness : num [1:32833] 0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
## $ valence : num [1:32833] 0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
## $ tempo : num [1:32833] 122 100 124 122 124 ...
## $ duration_ms : num [1:32833] 194754 162600 176616 169093 189052 ...
## - attr(*, "spec")=
## .. cols(
## .. track_id = col_character(),
## .. track_name = col_character(),
## .. track_artist = col_character(),
## .. track_popularity = col_double(),
## .. track_album_id = col_character(),
## .. track_album_name = col_character(),
## .. track_album_release_date = col_character(),
## .. playlist_name = col_character(),
## .. playlist_id = col_character(),
## .. playlist_genre = col_character(),
## .. playlist_subgenre = col_character(),
## .. danceability = col_double(),
## .. energy = col_double(),
## .. key = col_double(),
## .. loudness = col_double(),
## .. mode = col_double(),
## .. speechiness = col_double(),
## .. acousticness = col_double(),
## .. instrumentalness = col_double(),
## .. liveness = col_double(),
## .. valence = col_double(),
## .. tempo = col_double(),
## .. duration_ms = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
The summary function allows us to identify what the variable types are in the dataset.
summary(spotify)
## track_id track_name track_artist track_popularity
## Length:32833 Length:32833 Length:32833 Min. : 0.00
## Class :character Class :character Class :character 1st Qu.: 24.00
## Mode :character Mode :character Mode :character Median : 45.00
## Mean : 42.48
## 3rd Qu.: 62.00
## Max. :100.00
## track_album_id track_album_name track_album_release_date
## Length:32833 Length:32833 Length:32833
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## playlist_name playlist_id playlist_genre playlist_subgenre
## Length:32833 Length:32833 Length:32833 Length:32833
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## danceability energy key loudness
## Min. :0.0000 Min. :0.000175 Min. : 0.000 Min. :-46.448
## 1st Qu.:0.5630 1st Qu.:0.581000 1st Qu.: 2.000 1st Qu.: -8.171
## Median :0.6720 Median :0.721000 Median : 6.000 Median : -6.166
## Mean :0.6548 Mean :0.698619 Mean : 5.374 Mean : -6.720
## 3rd Qu.:0.7610 3rd Qu.:0.840000 3rd Qu.: 9.000 3rd Qu.: -4.645
## Max. :0.9830 Max. :1.000000 Max. :11.000 Max. : 1.275
## mode speechiness acousticness instrumentalness
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000000
## 1st Qu.:0.0000 1st Qu.:0.0410 1st Qu.:0.0151 1st Qu.:0.0000000
## Median :1.0000 Median :0.0625 Median :0.0804 Median :0.0000161
## Mean :0.5657 Mean :0.1071 Mean :0.1753 Mean :0.0847472
## 3rd Qu.:1.0000 3rd Qu.:0.1320 3rd Qu.:0.2550 3rd Qu.:0.0048300
## Max. :1.0000 Max. :0.9180 Max. :0.9940 Max. :0.9940000
## liveness valence tempo duration_ms
## Min. :0.0000 Min. :0.0000 Min. : 0.00 Min. : 4000
## 1st Qu.:0.0927 1st Qu.:0.3310 1st Qu.: 99.96 1st Qu.:187819
## Median :0.1270 Median :0.5120 Median :121.98 Median :216000
## Mean :0.1902 Mean :0.5106 Mean :120.88 Mean :225800
## 3rd Qu.:0.2480 3rd Qu.:0.6930 3rd Qu.:133.92 3rd Qu.:253585
## Max. :0.9960 Max. :0.9910 Max. :239.44 Max. :517810
The new data set below is a data set that includes the numeric variables as well as the track artist and playlist genre variables, which have a big impact on the track popularity. We are currently not able to run the analysis due to the high number of artists but, in the future, the plan is to use the more popular artists and categorize the remaining artists as “other”.
new <- spotify[c("track_popularity", "danceability", "energy", "key", "loudness", "mode", "speechiness", "acousticness", "instrumentalness", "liveness", "valence", "tempo", "duration_ms", "track_artist", "playlist_genre", "playlist_subgenre")]
selected_artists <- c("Drake", "Don Omar", "The Weeknd", "David Guetta", "The Chainsmokers")
new <- new %>%
filter(track_artist %in% selected_artists)
summary(new)
## track_popularity danceability energy key
## Min. : 0.00 Min. :0.214 Min. :0.1560 Min. : 0.000
## 1st Qu.:35.00 1st Qu.:0.571 1st Qu.:0.5930 1st Qu.: 1.000
## Median :59.00 Median :0.663 Median :0.7210 Median : 5.000
## Mean :51.27 Mean :0.653 Mean :0.7111 Mean : 5.143
## 3rd Qu.:70.00 3rd Qu.:0.758 3rd Qu.:0.8630 3rd Qu.: 8.000
## Max. :98.00 Max. :0.928 Max. :0.9950 Max. :11.000
## loudness mode speechiness acousticness
## Min. :-17.515 Min. :0.0000 Min. :0.0255 Min. :0.0000312
## 1st Qu.: -7.080 1st Qu.:0.0000 1st Qu.:0.0432 1st Qu.:0.0179000
## Median : -5.609 Median :1.0000 Median :0.0610 Median :0.0760000
## Mean : -5.820 Mean :0.5225 Mean :0.1020 Mean :0.1314883
## 3rd Qu.: -4.083 3rd Qu.:1.0000 3rd Qu.:0.1230 3rd Qu.:0.1800000
## Max. : -1.304 Max. :1.0000 Max. :0.5290 Max. :0.9510000
## instrumentalness liveness valence tempo
## Min. :0.0000000 Min. :0.0258 Min. :0.0350 Min. : 74.63
## 1st Qu.:0.0000000 1st Qu.:0.1020 1st Qu.:0.3235 1st Qu.: 96.22
## Median :0.0000094 Median :0.1340 Median :0.4220 Median :120.12
## Mean :0.0205983 Mean :0.2009 Mean :0.4611 Mean :121.63
## 3rd Qu.:0.0007175 3rd Qu.:0.3055 3rd Qu.:0.6040 3rd Qu.:133.34
## Max. :0.9180000 Max. :0.8570 Max. :0.9650 Max. :203.59
## duration_ms track_artist playlist_genre playlist_subgenre
## Min. :106333 Length:511 Length:511 Length:511
## 1st Qu.:196818 Class :character Class :character Class :character
## Median :214354 Mode :character Mode :character Mode :character
## Mean :226749
## 3rd Qu.:244960
## Max. :486773
Creating a new dataset that only uses the numeric variables form the original Spotify dataset.
num_spotify <- spotify[sapply(spotify, is.numeric)]
The correlation function displays correlation coefficient between the variables in the dataset.
corr_matrix <- cor(num_spotify)
print(corr_matrix)
## track_popularity danceability energy key
## track_popularity 1.0000000000 0.064747671 -0.109111533 -0.0006503533
## danceability 0.0647476713 1.000000000 -0.086073156 0.0117364748
## energy -0.1091115325 -0.086073156 1.000000000 0.0100516957
## key -0.0006503533 0.011736475 0.010051696 1.0000000000
## loudness 0.0576870774 0.025335088 0.676624523 0.0009586305
## mode 0.0106365762 -0.058647400 -0.004799733 -0.1740929567
## speechiness 0.0068194421 0.181721334 -0.032149611 0.0226069895
## acousticness 0.0851593365 -0.024519058 -0.539744630 0.0043058583
## instrumentalness -0.1498724125 -0.008655078 0.033246579 0.0059678178
## liveness -0.0545844404 -0.123859417 0.161223049 0.0028871809
## valence 0.0332313281 0.330523257 0.151103304 0.0199139115
## tempo -0.0053780630 -0.184084351 0.149951107 -0.0133701991
## duration_ms -0.1436823496 -0.096878789 0.012611444 0.0151393092
## loudness mode speechiness acousticness
## track_popularity 0.0576870774 0.010636576 0.006819442 0.085159337
## danceability 0.0253350882 -0.058647400 0.181721334 -0.024519058
## energy 0.6766245234 -0.004799733 -0.032149611 -0.539744630
## key 0.0009586305 -0.174092957 0.022606990 0.004305858
## loudness 1.0000000000 -0.019289482 0.010338981 -0.361638165
## mode -0.0192894815 1.000000000 -0.063512355 0.009415361
## speechiness 0.0103389807 -0.063512355 1.000000000 0.026091985
## acousticness -0.3616381651 0.009415361 0.026091985 1.000000000
## instrumentalness -0.1478240185 -0.006740665 -0.103424193 -0.006850273
## liveness 0.0776126010 -0.005548974 0.055425906 -0.077243449
## valence 0.0533835553 0.002614470 0.064659103 -0.016844738
## tempo 0.0937673598 0.014329047 0.044603290 -0.112723913
## duration_ms -0.1150575031 0.015633730 -0.089430567 -0.081580676
## instrumentalness liveness valence tempo
## track_popularity -0.149872413 -0.054584440 0.03323133 -0.005378063
## danceability -0.008655078 -0.123859417 0.33052326 -0.184084351
## energy 0.033246579 0.161223049 0.15110330 0.149951107
## key 0.005967818 0.002887181 0.01991391 -0.013370199
## loudness -0.147824018 0.077612601 0.05338356 0.093767360
## mode -0.006740665 -0.005548974 0.00261447 0.014329047
## speechiness -0.103424193 0.055425906 0.06465910 0.044603290
## acousticness -0.006850273 -0.077243449 -0.01684474 -0.112723913
## instrumentalness 1.000000000 -0.005507043 -0.17540218 0.023335266
## liveness -0.005507043 1.000000000 -0.02055977 0.021017804
## valence -0.175402179 -0.020559772 1.00000000 -0.025732148
## tempo 0.023335266 0.021017804 -0.02573215 1.000000000
## duration_ms 0.063234740 0.006138455 -0.03222518 -0.001411828
## duration_ms
## track_popularity -0.143682350
## danceability -0.096878789
## energy 0.012611444
## key 0.015139309
## loudness -0.115057503
## mode 0.015633730
## speechiness -0.089430567
## acousticness -0.081580676
## instrumentalness 0.063234740
## liveness 0.006138455
## valence -0.032225183
## tempo -0.001411828
## duration_ms 1.000000000
The function below is testing for any missing values in the dataset.
colSums(is.na(num_spotify))
## track_popularity danceability energy key
## 0 0 0 0
## loudness mode speechiness acousticness
## 0 0 0 0
## instrumentalness liveness valence tempo
## 0 0 0 0
## duration_ms
## 0
The boxplot below, shows the outliers for each variable in the dataset. Outliers exist in the following variables: danceability, energy, loudness, speechiness, acousticness, instrumentalness, liveness, temp, and duration_ms. The truncation method was used to get rid of the outliers.
boxplot(num_spotify, las=2, cex.axis=0.6)
# Truncate energy
num_spotify$energy[num_spotify$energy <= 0.2] <- 0.2
# Truncate danceability
num_spotify$danceability[num_spotify$danceability <= 0.3] <- 0.3
# Truncate loudness
num_spotify$loudness[num_spotify$loudness >= 0] <-0
num_spotify$loudness[num_spotify$loudness <= -13] <- -13
# Truncate speechiness
num_spotify$speechiness[num_spotify$speechiness >= 0.22] <- 0.22
# Truncate acousticness
num_spotify$acousticness[num_spotify$acousticness >= 0.6] <- 0.6
# Truncate instrumentalness
num_spotify$instrumentalness[num_spotify$instrumentalness >= 0.012] <- 0.012
# Truncate liveness
num_spotify$liveness[num_spotify$liveness >= 0.45] <- 0.45
# Truncate tempo
num_spotify$tempo[num_spotify$tempo >= 175] <- 175
num_spotify$tempo[num_spotify$tempo <= 50] <- 50
# Truncate duration_ms
num_spotify$duration_ms[num_spotify$duration_ms >= 350000] <- 350000
num_spotify$duration_ms[num_spotify$duration_ms <= 100000] <- 100000
Below is a boxplot of the truncated dataset showing outliers have been removed.
boxplot(num_spotify, las=2, cex.axis=0.6)
The dataset is now clean. The final data set is shown in the table below.
knitr::kable(head(num_spotify[,1:13]), "pipe")
| track_popularity | danceability | energy | key | loudness | mode | speechiness | acousticness | instrumentalness | liveness | valence | tempo | duration_ms |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 66 | 0.748 | 0.916 | 6 | -2.634 | 1 | 0.0583 | 0.1020 | 0.00e+00 | 0.0653 | 0.518 | 122.036 | 194754 |
| 67 | 0.726 | 0.815 | 11 | -4.969 | 1 | 0.0373 | 0.0724 | 4.21e-03 | 0.3570 | 0.693 | 99.972 | 162600 |
| 70 | 0.675 | 0.931 | 1 | -3.432 | 0 | 0.0742 | 0.0794 | 2.33e-05 | 0.1100 | 0.613 | 124.008 | 176616 |
| 60 | 0.718 | 0.930 | 7 | -3.778 | 1 | 0.1020 | 0.0287 | 9.40e-06 | 0.2040 | 0.277 | 121.956 | 169093 |
| 69 | 0.650 | 0.833 | 1 | -4.672 | 1 | 0.0359 | 0.0803 | 0.00e+00 | 0.0833 | 0.725 | 123.976 | 189052 |
| 67 | 0.675 | 0.919 | 8 | -5.385 | 1 | 0.1270 | 0.0799 | 0.00e+00 | 0.1430 | 0.585 | 124.982 | 163049 |
describe(num_spotify)
## vars n mean sd median trimmed mad
## track_popularity 1 32833 42.48 24.98 45.00 43.00 26.69
## danceability 2 32833 0.66 0.14 0.67 0.66 0.15
## energy 3 32833 0.70 0.18 0.72 0.71 0.19
## key 4 32833 5.37 3.61 6.00 5.35 4.45
## loudness 5 32833 -6.63 2.70 -6.17 -6.41 2.52
## mode 6 32833 0.57 0.50 1.00 0.58 0.00
## speechiness 7 32833 0.09 0.07 0.06 0.08 0.04
## acousticness 8 32833 0.16 0.19 0.08 0.13 0.11
## instrumentalness 9 32833 0.00 0.00 0.00 0.00 0.00
## liveness 10 32833 0.18 0.12 0.13 0.16 0.07
## valence 11 32833 0.51 0.23 0.51 0.51 0.27
## tempo 12 32833 120.44 25.82 121.98 119.12 26.75
## duration_ms 13 32833 223818.33 53113.48 216000.00 220382.28 47246.01
## min max range skew kurtosis se
## track_popularity 0.0 1.00e+02 1.00e+02 -0.23 -0.93 0.14
## danceability 0.3 9.80e-01 6.80e-01 -0.41 -0.31 0.00
## energy 0.2 1.00e+00 8.00e-01 -0.58 -0.24 0.00
## key 0.0 1.10e+01 1.10e+01 -0.02 -1.31 0.02
## loudness -13.0 0.00e+00 1.30e+01 -0.65 -0.15 0.01
## mode 0.0 1.00e+00 1.00e+00 -0.27 -1.93 0.00
## speechiness 0.0 2.20e-01 2.20e-01 0.96 -0.59 0.00
## acousticness 0.0 6.00e-01 6.00e-01 1.17 0.10 0.00
## instrumentalness 0.0 1.00e-02 1.00e-02 1.17 -0.53 0.00
## liveness 0.0 4.50e-01 4.50e-01 1.04 -0.12 0.00
## valence 0.0 9.90e-01 9.90e-01 -0.01 -0.90 0.00
## tempo 50.0 1.75e+02 1.25e+02 0.34 -0.43 0.14
## duration_ms 100000.0 3.50e+05 2.50e+05 0.53 0.05 293.12
In the summary statistics table for our cleaned data set, our key takeaways of the descriptive statistics for the variables of concern are:
To uncover new information in the data that is not self-evident, we can, did, and will continue to apply a number of methods; these include visualizations, truncation, and model testing. A few different ways to look at the data to answer our questions, in addition to those that we have already applied, will be to test different values for our linear regression and KNN models, and to test different splits for our training and test data.
Below is a histogram of the “Track Popularity” variable.
ggplot(spotify, aes(x=track_popularity)) +
geom_histogram(binwidth=5, fill="blue", alpha=0.7) +
ggtitle("Histogram") +
xlab("Track Popularity") +
ylab("Frequency")
ggplot(new, aes(x = playlist_genre)) +
geom_bar(fill="blue", color="black", alpha=0.7) +
labs(
title="Bar Chart of Playlist Genres",
x="Playlist (Genre)",
y="Count"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust=0.5))
ggplot(new, aes(x = playlist_subgenre)) +
geom_bar(fill = "blue", color = "black", alpha = 0.7) +
coord_flip() +
labs(
title = "Bar Chart of Playlist SubGenres",
y = "Count",
x = "Subgenre"
) +
theme_minimal() +
theme(axis.text.y = element_text(angle = 0, hjust = 0.5))
ggplot(new, aes(x = track_artist)) +
geom_bar(fill="blue", color="black", alpha=0.7) +
labs(
title="Bar Chart of Playlist Genres",
x="Track Artist",
y="Count"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust=0.5))
ggplot(spotify, aes(x=loudness, y=track_popularity)) + geom_jitter(aes(color = playlist_genre)) +
ggtitle("Loudness and Popularity")
ggplot(spotify, aes(x=tempo, y=track_popularity)) + geom_jitter(aes(color = playlist_genre)) + ggtitle("Tempo and Popularity")
ggplot(spotify, aes(x=speechiness, y=track_popularity)) + geom_jitter(aes(color = playlist_genre)) +
ggtitle("Speechiness and Popularity")
ggplot(spotify, aes(x=danceability, y=track_popularity)) + geom_jitter(aes(color = playlist_genre)) +
ggtitle("Danceability and Popularity")
ggplot(spotify, aes(x=energy, y=track_popularity)) + geom_jitter(aes(color = playlist_genre)) +
ggtitle("Energy and Popularity")
ggplot(spotify, aes(x=liveness, y=track_popularity)) + geom_jitter(aes(color = playlist_genre)) +
ggtitle("Liveness and Popularity")
ggplot(spotify, aes(x=acousticness, y=track_popularity)) + geom_jitter(aes(color = playlist_genre)) +
ggtitle("Acousticness and Popularity")
ggplot(spotify, aes(x=duration_ms, y=track_popularity)) + geom_jitter(aes(color = playlist_genre)) +
ggtitle("Duration and Popularity")
ggplot(spotify, aes(x=instrumentalness, y=track_popularity)) + geom_jitter(aes(color = playlist_genre)) +
ggtitle("Instrumentalness and Popularity")
ggplot(spotify, aes(x=valence, y=track_popularity)) + geom_jitter(aes(color = playlist_genre)) +
ggtitle("Valence and Popularity")
spotify_piechart <- data.frame(
variable = c("Valence", "Tempo", "Energy" , "Loudness"),
significance = c(0.15, 0.25, 0.20, 0.10)
)
spotify_piechart <- spotify_piechart %>%
arrange(desc(significance))
total_significance <- sum(spotify_piechart$significance)
spotify_piechart$percentage <- (spotify_piechart$significance / total_significance) * 100
ggplot(spotify_piechart, aes(x = "", y = percentage, fill = variable)) +
geom_bar(stat = "identity", width = 1) +
coord_polar(theta = "y") +
labs(
title = "Percentage Effect of Variables on Track Popularity",
fill = "Variable"
) +
theme_minimal() +
theme(legend.position = "right")
Below are visualization techniques that show the correlation between the dataset variables.
corr_matrix <- cor(num_spotify)
print(corr_matrix)
## track_popularity danceability energy key
## track_popularity 1.0000000000 0.065252575 -0.109780603 -0.0006503533
## danceability 0.0652525748 1.000000000 -0.091867642 0.0121403872
## energy -0.1097806032 -0.091867642 1.000000000 0.0098638427
## key -0.0006503533 0.012140387 0.009863843 1.0000000000
## loudness 0.0613782905 0.006647159 0.675069403 -0.0019776668
## mode 0.0106365762 -0.058102892 -0.003992262 -0.1740929567
## speechiness 0.0114869489 0.228969318 -0.001970227 0.0241236557
## acousticness 0.0917312887 0.002973148 -0.516121648 0.0054816873
## instrumentalness -0.1769045504 -0.044823526 0.077965754 0.0116707469
## liveness -0.0497280028 -0.123897962 0.174063032 0.0035209045
## valence 0.0332313281 0.329985784 0.149958408 0.0199139115
## tempo -0.0058760602 -0.176672607 0.156966992 -0.0136459690
## duration_ms -0.1421762934 -0.100706773 0.008993375 0.0153729324
## loudness mode speechiness acousticness
## track_popularity 0.061378290 0.010636576 0.011486949 0.091731289
## danceability 0.006647159 -0.058102892 0.228969318 0.002973148
## energy 0.675069403 -0.003992262 -0.001970227 -0.516121648
## key -0.001977667 -0.174092957 0.024123656 0.005481687
## loudness 1.000000000 -0.019990599 0.046304697 -0.324549507
## mode -0.019990599 1.000000000 -0.073720939 0.003658047
## speechiness 0.046304697 -0.073720939 1.000000000 0.023875232
## acousticness -0.324549507 0.003658047 0.023875232 1.000000000
## instrumentalness -0.151080835 -0.012133441 -0.160679795 -0.086739727
## liveness 0.096364878 -0.006556692 0.059462137 -0.087174106
## valence 0.044065911 0.002614470 0.070500612 0.008816439
## tempo 0.098546824 0.013802062 0.031490178 -0.117857457
## duration_ms -0.129157665 0.015229474 -0.096154178 -0.074329309
## instrumentalness liveness valence tempo
## track_popularity -0.1769045504 -0.0497280028 0.033231328 -0.005876060
## danceability -0.0448235263 -0.1238979617 0.329985784 -0.176672607
## energy 0.0779657539 0.1740630316 0.149958408 0.156966992
## key 0.0116707469 0.0035209045 0.019913911 -0.013645969
## loudness -0.1510808355 0.0963648780 0.044065911 0.098546824
## mode -0.0121334411 -0.0065566915 0.002614470 0.013802062
## speechiness -0.1606797952 0.0594621373 0.070500612 0.031490178
## acousticness -0.0867397269 -0.0871741062 0.008816439 -0.117857457
## instrumentalness 1.0000000000 -0.0004094716 -0.165237727 0.039792129
## liveness -0.0004094716 1.0000000000 -0.019332272 0.024927249
## valence -0.1652377269 -0.0193322723 1.000000000 -0.029128791
## tempo 0.0397921287 0.0249272491 -0.029128791 1.000000000
## duration_ms 0.0903161849 -0.0090259013 -0.021256337 -0.005129168
## duration_ms
## track_popularity -0.142176293
## danceability -0.100706773
## energy 0.008993375
## key 0.015372932
## loudness -0.129157665
## mode 0.015229474
## speechiness -0.096154178
## acousticness -0.074329309
## instrumentalness 0.090316185
## liveness -0.009025901
## valence -0.021256337
## tempo -0.005129168
## duration_ms 1.000000000
corrplot(corr_matrix, method="circle", type="upper", order="hclust",
tl.col="black", tl.srt=45)
corr_data <- as.data.frame(corr_matrix)
corr_data$row <- rownames(corr_matrix)
corr_data_long <- gather(corr_data, key = "column", value = "correlation", -row)
ggplot(data = corr_data_long, aes(x = row, y = column, fill = correlation)) +
geom_tile() +
scale_fill_gradient(low = "blue", high = "red") +
theme_minimal() +
labs(title = "Correlation Heatmap")
The types of plots and tables that we will use to help illustrate our findings for our questions include:
One variable that is not currently in our finalized dataset that will have a big impact on track popularity is track artist. Currently, we are only including the numerical variables in our Spotify dataset, but the ultimate goal is to incorporate track artist, which its class is character, by possibly using a one-hot encoding technique or another technique learned in the future.
The current plan is to slice and dice the dataset, using 70% for training and 30% for testing. Moving forward the goal is to play around with this ratio to maximize the efficiency of our analysis. Improving the efficiency can also be done by adding interaction terms in the analysis, as well as track artist variable.
The models that we have tried include Linear Regression and KNN. When looking at the Linear Regression model, the Residuals vs Fitted graph appears to have an even amount of positive and negative observations. This means that the model is generally doing a good job of capturing the relationship between our independent variable (track_popularity) and the dependent variables (danceability, energy, key, loudness, mode, speechiness, acousticness, instrumentalness, liveness, valence, tempo, duration_ms). In addition, the assumption of homoscedasticity is observed as the residual points are equally scattered in the plot as seen in the Residuals vs Leverage graph.
When looking at the KNN model, as the value of k increases, the MSE increases in the training dataset, but the MSE decreases in the testing dataset. Because the goal of MSE is to have a low value, the training dataset would not allow for the model to be flexible or fitting compared to the testing dataset.
This step allows the data to be divided into training data, and testing data. This allows the dataset to learn from itself and help understand the relationship between the variables.
# Set the seed for reproducibility
set.seed(25)
# Randomly sample row indices for the training set
train_indices <- sample(1:NROW(num_spotify),NROW(num_spotify)*0.70)
# Create the training set
train_data <- num_spotify[train_indices, ]
# Create the testing set
test_data <- num_spotify[-train_indices, ]
# Train the linear regression model
lm_model <- lm(track_popularity ~ ., data = train_data)
summary(lm_model)
##
## Call:
## lm(formula = track_popularity ~ ., data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -58.897 -17.411 2.986 18.879 62.702
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.923e+01 2.047e+00 38.701 < 2e-16 ***
## danceability 4.445e+00 1.269e+00 3.501 0.000464 ***
## energy -2.951e+01 1.450e+00 -20.356 < 2e-16 ***
## key 4.101e-02 4.452e-02 0.921 0.356991
## loudness 1.580e+00 8.519e-02 18.550 < 2e-16 ***
## mode 9.425e-01 3.264e-01 2.887 0.003888 **
## speechiness -1.272e+01 2.507e+00 -5.074 3.93e-07 ***
## acousticness 2.875e+00 9.976e-01 2.881 0.003963 **
## instrumentalness -6.081e+02 3.479e+01 -17.482 < 2e-16 ***
## liveness -4.948e+00 1.378e+00 -3.590 0.000331 ***
## valence 3.509e+00 7.638e-01 4.594 4.37e-06 ***
## tempo 2.702e-02 6.312e-03 4.281 1.87e-05 ***
## duration_ms -4.815e-05 3.068e-06 -15.697 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 24.02 on 22970 degrees of freedom
## Multiple R-squared: 0.07494, Adjusted R-squared: 0.07446
## F-statistic: 155.1 on 12 and 22970 DF, p-value: < 2.2e-16
According to our linear regression, the variables that have a high impact when it comes to track popularity are:
Below we are manually checking the results of the linear regression model and the first 5 records of the actual and predicted values are shown.
# Create a data frame to compare actual and predicted values
comparison_df <- data.frame(Actual = train_data$track_popularity, lm_predicted =lm_model$fitted.values)
head(comparison_df)
## Actual lm_predicted
## 1 55 35.66503
## 2 63 39.93547
## 3 12 41.02927
## 4 63 36.84513
## 5 0 35.93214
## 6 57 49.87456
The diagnostic plots are used to see if the assumptions being made for the linear regression model meet the standards of the dataset.
par(mfrow = c(2,2))
plot(lm_model)
Once we complete the linear regression model. The In-sample MSE, or Training MSE, can be computed as shown below.
lm_mse_train <- mean((lm_model$fitted.values - train_data$track_popularity)^2)
print(paste("Training MSE for Linear Model:", round(lm_mse_train, 2)))
## [1] "Training MSE for Linear Model: 576.46"
Similarly, the Out-of-sample MSE, or Testing MSE, can be calculated.
# Predict on testing data
lm_test_pred <- predict(lm_model, newdata = test_data)
# Cal
lm_mse_train <- mean((lm_test_pred - test_data$track_popularity)^2)
print(paste("Testing MSE for Linear Model:", round(lm_mse_train, 2)))
## [1] "Testing MSE for Linear Model: 577.42"
Because of the linear regression, we are adding these variables as interaction terms.
spotify_lm_it2 <- lm(track_popularity ~ . + tempo*valence + speechiness*liveness, data = num_spotify)
summary(spotify_lm_it2)
##
## Call:
## lm(formula = track_popularity ~ . + tempo * valence + speechiness *
## liveness, data = num_spotify)
##
## Residuals:
## Min 1Q Median 3Q Max
## -59.586 -17.552 2.911 18.772 64.108
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.259e+01 2.242e+00 36.836 < 2e-16 ***
## danceability 5.376e+00 1.066e+00 5.043 4.61e-07 ***
## energy -2.879e+01 1.211e+00 -23.770 < 2e-16 ***
## key 4.942e-02 3.730e-02 1.325 0.1852
## loudness 1.644e+00 7.155e-02 22.981 < 2e-16 ***
## mode 6.678e-01 2.727e-01 2.449 0.0143 *
## speechiness -7.076e+00 3.711e+00 -1.907 0.0565 .
## acousticness 3.278e+00 8.317e-01 3.941 8.13e-05 ***
## instrumentalness -6.221e+02 2.912e+01 -21.359 < 2e-16 ***
## liveness -1.852e+00 1.992e+00 -0.930 0.3525
## valence -4.118e+00 2.845e+00 -1.448 0.1478
## tempo -6.844e-03 1.311e-02 -0.522 0.6015
## duration_ms -4.872e-05 2.569e-06 -18.969 < 2e-16 ***
## valence:tempo 5.697e-02 2.278e-02 2.501 0.0124 *
## speechiness:liveness -3.258e+01 1.668e+01 -1.953 0.0508 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 24.01 on 32818 degrees of freedom
## Multiple R-squared: 0.0765, Adjusted R-squared: 0.07611
## F-statistic: 194.2 on 14 and 32818 DF, p-value: < 2.2e-16
spotify_lm_it <- lm(track_popularity ~ . + acousticness*danceability + energy*loudness, data = num_spotify)
summary(spotify_lm_it)
##
## Call:
## lm(formula = track_popularity ~ . + acousticness * danceability +
## energy * loudness, data = num_spotify)
##
## Residuals:
## Min 1Q Median 3Q Max
## -60.678 -17.627 2.961 18.809 60.167
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.342e+01 2.349e+00 39.768 < 2e-16 ***
## danceability 1.496e+00 1.337e+00 1.119 0.263281
## energy -4.423e+01 2.278e+00 -19.419 < 2e-16 ***
## key 4.555e-02 3.726e-02 1.222 0.221617
## loudness 2.973e+00 1.878e-01 15.829 < 2e-16 ***
## mode 7.086e-01 2.726e-01 2.599 0.009351 **
## speechiness -1.278e+01 2.093e+00 -6.105 1.04e-09 ***
## acousticness -6.860e+00 3.283e+00 -2.090 0.036649 *
## instrumentalness -5.865e+02 2.944e+01 -19.919 < 2e-16 ***
## liveness -4.826e+00 1.152e+00 -4.188 2.82e-05 ***
## valence 2.550e+00 6.398e-01 3.985 6.77e-05 ***
## tempo 2.306e-02 5.298e-03 4.352 1.35e-05 ***
## duration_ms -5.067e-05 2.576e-06 -19.669 < 2e-16 ***
## danceability:acousticness 1.670e+01 4.877e+00 3.425 0.000616 ***
## energy:loudness -1.972e+00 2.578e-01 -7.652 2.04e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23.99 on 32818 degrees of freedom
## Multiple R-squared: 0.07828, Adjusted R-squared: 0.07789
## F-statistic: 199.1 on 14 and 32818 DF, p-value: < 2.2e-16
Train the KNN Model with training dataset.
# Train the KNN model
knn_model <- kknn(track_popularity ~ ., train = train_data, test = train_data, k = 5)
Again, we can manually check the results of the KNN model. The table below shows the first 5 records of linear model predicted values, along with the predicted values of the KNN, and the actual values.
# Create a data frame to compare actual and predicted values
comparison_df$knn_std_predicted <- knn_model$fitted.values
head(comparison_df)
## Actual lm_predicted knn_std_predicted
## 1 55 35.66503 45.099896
## 2 63 39.93547 35.110958
## 3 12 41.02927 34.827113
## 4 63 36.84513 48.693988
## 5 0 35.93214 5.970955
## 6 57 49.87456 64.861724
Now that the KNN model is complete. The In-sample MSE and the Out-of-sample MSE are calculated.
# Predict on training data
knn_train_pred <- fitted.values(knn_model)
# Calculate in-sample MSE manually
knn_train_mse <- mean((train_data$track_popularity - knn_train_pred)^2)
print(paste("In-Sample MSE for KNN: ", knn_train_mse))
## [1] "In-Sample MSE for KNN: 179.138485697501"
# Predict on testing data
knn_model_test <- kknn(track_popularity ~ ., train = train_data, test = test_data, k = 5)
knn_test_pred <- fitted.values(knn_model_test)
# Calculate out-of-sample MSE manually
knn_test_mse <- mean((test_data$track_popularity - knn_test_pred)^2)
print(paste("Out-of-Sample MSE for KNN: ", knn_test_mse))
## [1] "Out-of-Sample MSE for KNN: 611.882895509132"
In order to make sure our model is as efficient as possible, the KNN model was run with different values of K to compare the output MSE.
# Initialize a dataframe to store MSE for each k
mse_df <- data.frame(k = integer(), MSE_train = numeric(), MSE_test = numeric())
for (k in c(1:5)) {
# Fit the k-NN model using training data
knn_model_train <- kknn(track_popularity ~ ., train = train_data, test = train_data, k = k)
# Calculate the training MSE
mse_train <- mean((knn_model_train$fitted.values - train_data$track_popularity)^2)
# Test the k-NN with testing data
knn_model_test <- kknn(track_popularity ~ ., train = train_data, test = test_data, k = k)
mse_test <- mean((knn_model_test$fitted.values - test_data$track_popularity)^2)
# save the results
mse_df <- rbind(mse_df, data.frame(k = k, MSE_train = mse_train, MSE_test = mse_test))
}
# Show the MSE dataframe
print(mse_df)
## k MSE_train MSE_test
## 1 1 28.36980 832.0829
## 2 2 45.63493 729.9191
## 3 3 94.42580 667.4293
## 4 4 140.30341 632.3014
## 5 5 179.13849 611.8829
We did not use all the variables in the Spotify dataset. Due to the lack of storage in RStudio along with computer limitations, we had to reduce our dataset to only looking at the numeric variables. By changing the dataset to numeric, we were allowed to continue with our research.
Theoretically, the KNN model would fit our data best. This is because the KNN model does not make assumptions about the relationships between variables, and our data indicated several variables that we assumed would produce non-linear relationships. KNN also has better pattern recognition and, using the varied variables we had, theoretically could display relationships better; I.e., KNN is more flexible. An ideal linear regression model assumes homoscedasticity, linearity, normality, and low correlation of independent variables. Theoretically our model would not be expected to show linearity, due to the wide variability of our values. Also, we expected high correlation between some of the independent variables.
In practice, the linear regression model fit our data the best. We applied a .70 split to our training set to run our model. The training and testing MSE values were close (576.46 for training and 577.42 for testing) indicating that the model is not over fitting. Our scatter plot showed moderate linearity. Our histogram did have a left tail, skewed negatively, but our residuals vs fitted plot showed moderate random scatter. Alternatively, for the KNN, the MSE on the training data (179.14) suggested over fitting, as performance dropped significantly for the test data (611.88). The optimal number of neighbors appeared to be k=5, as that is when MSE was at its lowest for both training (179.14) and test (611.88). KNN seemed poorly fitted to the data set, as the test data MSE was so high. The best model for in-sample performance was the KNN training data at 179.14 and the best model for out of sample performance was the linear regression test data at 577.42. The evaluation metric we used was the Mean Squared Error.
While conducting linear regression, we discovered that track popularity is individually significantly influenced by danceability, energy, loudness, speechiness, instrumentalness, liveness, valence, tempo, and duration_ms. Furthermore, after adding interaction terms between tempo and valence and speechiness and liveness, it’s apparent that while these variables are together, they do not have a significant impact on track popularity. For instance, the interaction of tempo and valence in a song is not as significant compared to the interaction of acousticness and danceability. In addition, the interaction of speechiness and liveness in a song is not as significant compared to the interaction of energy and loudness.