Introduction

Packages Required

library(tidyverse) #It assists with data import, tidying, manipulation, and data visualization.
library(ggplot2) # package for producing statistical, or data, graphics
library(kknn) # to perform k-nearest neighbor classification
library(corrplot) # graphical display of a correlation matrix, confidence interval
library(readr) #o provide a fast way to read rectangular data

Data Preparation

Data Loading
spotify <- read_csv("/Users/evabeyebach/Desktop/Projects/spotify.csv")
## Rows: 32833 Columns: 23
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): track_id, track_name, track_artist, track_album_id, track_album_na...
## dbl (13): track_popularity, danceability, energy, key, loudness, mode, speec...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Data Examination

### Checking dimension of Data
dim(spotify)
## [1] 32833    23

The original dataset contains 32833 rows and 23 columns

# show first 5 rows
head(spotify)
## # A tibble: 6 × 23
##   track_id               track_name track_artist track_popularity track_album_id
##   <chr>                  <chr>      <chr>                   <dbl> <chr>         
## 1 6f807x0ima9a1j3VPbc7VN I Don't C… Ed Sheeran                 66 2oCs0DGTsRO98…
## 2 0r7CVbZTWZgbTCYdfa2P31 Memories … Maroon 5                   67 63rPSO264uRjW…
## 3 1z1Hg7Vb0AhHDiEmnDE79l All the T… Zara Larsson               70 1HoSmj2eLcsrR…
## 4 75FpbthrwQmzHlBJLuGdC7 Call You … The Chainsm…               60 1nqYsOef1yKKu…
## 5 1e8PAfcKUYoKkxPhrHqw4x Someone Y… Lewis Capal…               69 7m7vv9wlQ4i0L…
## 6 7fvUMiyapMsRRxr07cU8Ef Beautiful… Ed Sheeran                 67 2yiy9cd2QktrN…
## # ℹ 18 more variables: track_album_name <chr>, track_album_release_date <chr>,
## #   playlist_name <chr>, playlist_id <chr>, playlist_genre <chr>,
## #   playlist_subgenre <chr>, danceability <dbl>, energy <dbl>, key <dbl>,
## #   loudness <dbl>, mode <dbl>, speechiness <dbl>, acousticness <dbl>,
## #   instrumentalness <dbl>, liveness <dbl>, valence <dbl>, tempo <dbl>,
## #   duration_ms <dbl>
#### Checking column name
names(spotify)
##  [1] "track_id"                 "track_name"              
##  [3] "track_artist"             "track_popularity"        
##  [5] "track_album_id"           "track_album_name"        
##  [7] "track_album_release_date" "playlist_name"           
##  [9] "playlist_id"              "playlist_genre"          
## [11] "playlist_subgenre"        "danceability"            
## [13] "energy"                   "key"                     
## [15] "loudness"                 "mode"                    
## [17] "speechiness"              "acousticness"            
## [19] "instrumentalness"         "liveness"                
## [21] "valence"                  "tempo"                   
## [23] "duration_ms"

Data Cleaning

### Checking structure of Data
str(spotify)
## spc_tbl_ [32,833 × 23] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ track_id                : chr [1:32833] "6f807x0ima9a1j3VPbc7VN" "0r7CVbZTWZgbTCYdfa2P31" "1z1Hg7Vb0AhHDiEmnDE79l" "75FpbthrwQmzHlBJLuGdC7" ...
##  $ track_name              : chr [1:32833] "I Don't Care (with Justin Bieber) - Loud Luxury Remix" "Memories - Dillon Francis Remix" "All the Time - Don Diablo Remix" "Call You Mine - Keanu Silva Remix" ...
##  $ track_artist            : chr [1:32833] "Ed Sheeran" "Maroon 5" "Zara Larsson" "The Chainsmokers" ...
##  $ track_popularity        : num [1:32833] 66 67 70 60 69 67 62 69 68 67 ...
##  $ track_album_id          : chr [1:32833] "2oCs0DGTsRO98Gh5ZSl2Cx" "63rPSO264uRjW1X5E6cWv6" "1HoSmj2eLcsrR0vE9gThr4" "1nqYsOef1yKKuGOVchbsk6" ...
##  $ track_album_name        : chr [1:32833] "I Don't Care (with Justin Bieber) [Loud Luxury Remix]" "Memories (Dillon Francis Remix)" "All the Time (Don Diablo Remix)" "Call You Mine - The Remixes" ...
##  $ track_album_release_date: chr [1:32833] "2019-06-14" "2019-12-13" "2019-07-05" "2019-07-19" ...
##  $ playlist_name           : chr [1:32833] "Pop Remix" "Pop Remix" "Pop Remix" "Pop Remix" ...
##  $ playlist_id             : chr [1:32833] "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" ...
##  $ playlist_genre          : chr [1:32833] "pop" "pop" "pop" "pop" ...
##  $ playlist_subgenre       : chr [1:32833] "dance pop" "dance pop" "dance pop" "dance pop" ...
##  $ danceability            : num [1:32833] 0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
##  $ energy                  : num [1:32833] 0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
##  $ key                     : num [1:32833] 6 11 1 7 1 8 5 4 8 2 ...
##  $ loudness                : num [1:32833] -2.63 -4.97 -3.43 -3.78 -4.67 ...
##  $ mode                    : num [1:32833] 1 1 0 1 1 1 0 0 1 1 ...
##  $ speechiness             : num [1:32833] 0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
##  $ acousticness            : num [1:32833] 0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
##  $ instrumentalness        : num [1:32833] 0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
##  $ liveness                : num [1:32833] 0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
##  $ valence                 : num [1:32833] 0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
##  $ tempo                   : num [1:32833] 122 100 124 122 124 ...
##  $ duration_ms             : num [1:32833] 194754 162600 176616 169093 189052 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   track_id = col_character(),
##   ..   track_name = col_character(),
##   ..   track_artist = col_character(),
##   ..   track_popularity = col_double(),
##   ..   track_album_id = col_character(),
##   ..   track_album_name = col_character(),
##   ..   track_album_release_date = col_character(),
##   ..   playlist_name = col_character(),
##   ..   playlist_id = col_character(),
##   ..   playlist_genre = col_character(),
##   ..   playlist_subgenre = col_character(),
##   ..   danceability = col_double(),
##   ..   energy = col_double(),
##   ..   key = col_double(),
##   ..   loudness = col_double(),
##   ..   mode = col_double(),
##   ..   speechiness = col_double(),
##   ..   acousticness = col_double(),
##   ..   instrumentalness = col_double(),
##   ..   liveness = col_double(),
##   ..   valence = col_double(),
##   ..   tempo = col_double(),
##   ..   duration_ms = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

Shows structure of data. We can see that track_id , track_name, track_artist, track_album_id, track_album_name, track_album_release_date, playlist_name, playlist_id, playlist_genre, playlist_subgenre are character variables. On the other side, track_popularity, energy, key, loudness, mode, speechiness, acousticness, instrumentalness, liveness, valence, tempo and duration_ms are numeric. We need to change track_album_release_date to date variable. We will also change playlist_genre to factor, for future plotting.

# Modifying Data Types
spotify$track_album_release_date<- as.Date(spotify$track_album_release_date)
spotify$playlist_genre<-as.factor(spotify$playlist_genre)
#summary statistics
summary(spotify)
##    track_id          track_name        track_artist       track_popularity
##  Length:32833       Length:32833       Length:32833       Min.   :  0.00  
##  Class :character   Class :character   Class :character   1st Qu.: 24.00  
##  Mode  :character   Mode  :character   Mode  :character   Median : 45.00  
##                                                           Mean   : 42.48  
##                                                           3rd Qu.: 62.00  
##                                                           Max.   :100.00  
##                                                                           
##  track_album_id     track_album_name   track_album_release_date
##  Length:32833       Length:32833       Min.   :1957-01-01      
##  Class :character   Class :character   1st Qu.:2010-12-04      
##  Mode  :character   Mode  :character   Median :2017-01-27      
##                                        Mean   :2012-09-09      
##                                        3rd Qu.:2019-05-16      
##                                        Max.   :2020-01-29      
##                                        NA's   :1886            
##  playlist_name      playlist_id        playlist_genre playlist_subgenre 
##  Length:32833       Length:32833       edm  :6043     Length:32833      
##  Class :character   Class :character   latin:5155     Class :character  
##  Mode  :character   Mode  :character   pop  :5507     Mode  :character  
##                                        r&b  :5431                       
##                                        rap  :5746                       
##                                        rock :4951                       
##                                                                         
##   danceability        energy              key            loudness      
##  Min.   :0.0000   Min.   :0.000175   Min.   : 0.000   Min.   :-46.448  
##  1st Qu.:0.5630   1st Qu.:0.581000   1st Qu.: 2.000   1st Qu.: -8.171  
##  Median :0.6720   Median :0.721000   Median : 6.000   Median : -6.166  
##  Mean   :0.6548   Mean   :0.698619   Mean   : 5.374   Mean   : -6.720  
##  3rd Qu.:0.7610   3rd Qu.:0.840000   3rd Qu.: 9.000   3rd Qu.: -4.645  
##  Max.   :0.9830   Max.   :1.000000   Max.   :11.000   Max.   :  1.275  
##                                                                        
##       mode         speechiness      acousticness    instrumentalness   
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000000  
##  1st Qu.:0.0000   1st Qu.:0.0410   1st Qu.:0.0151   1st Qu.:0.0000000  
##  Median :1.0000   Median :0.0625   Median :0.0804   Median :0.0000161  
##  Mean   :0.5657   Mean   :0.1071   Mean   :0.1753   Mean   :0.0847472  
##  3rd Qu.:1.0000   3rd Qu.:0.1320   3rd Qu.:0.2550   3rd Qu.:0.0048300  
##  Max.   :1.0000   Max.   :0.9180   Max.   :0.9940   Max.   :0.9940000  
##                                                                        
##     liveness         valence           tempo         duration_ms    
##  Min.   :0.0000   Min.   :0.0000   Min.   :  0.00   Min.   :  4000  
##  1st Qu.:0.0927   1st Qu.:0.3310   1st Qu.: 99.96   1st Qu.:187819  
##  Median :0.1270   Median :0.5120   Median :121.98   Median :216000  
##  Mean   :0.1902   Mean   :0.5106   Mean   :120.88   Mean   :225800  
##  3rd Qu.:0.2480   3rd Qu.:0.6930   3rd Qu.:133.92   3rd Qu.:253585  
##  Max.   :0.9960   Max.   :0.9910   Max.   :239.44   Max.   :517810  
## 

Displays Min, Q1, Median, Mean, Q3 and Max of each varibale. We can already see that there are probably some outliers and that some variables have too big Max (duration_ms has a Max od 517810; tempo has a Max of 239.44). We will do some truncation, winorization or standardization, to see how it affects the model.

#lets look at some tables for categorical variables
table(spotify$track_popularity)
## 
##    0    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
## 2703  575  387  321  240  240  192  189  201  195  174  172  161  207  201  190 
##   16   17   18   19   20   21   22   23   24   25   26   27   28   29   30   31 
##  219  206  242  205  209  228  207  228  243  242  272  271  266  277  345  323 
##   32   33   34   35   36   37   38   39   40   41   42   43   44   45   46   47 
##  351  377  388  433  431  435  483  459  486  442  428  464  472  505  430  496 
##   48   49   50   51   52   53   54   55   56   57   58   59   60   61   62   63 
##  465  497  498  514  506  472  514  492  497  541  503  467  514  492  470  483 
##   64   65   66   67   68   69   70   71   72   73   74   75   76   77   78   79 
##  424  462  441  468  425  443  410  408  339  357  353  306  334  326  224  265 
##   80   81   82   83   84   85   86   87   88   89   90   91   92   93   94   95 
##  172  167  126  183  122  120   91   89  104   27   59   58   27   44   37   15 
##   96   97   98   99  100 
##    7   22   36    4    2
table(spotify$playlist_genre)
## 
##   edm latin   pop   r&b   rap  rock 
##  6043  5155  5507  5431  5746  4951
table(spotify$playlist_subgenre)
## 
##                album rock                  big room              classic rock 
##                      1065                      1206                      1296 
##                 dance pop             electro house                electropop 
##                      1298                      1511                      1408 
##              gangster rap                 hard rock                   hip hop 
##                      1458                      1485                      1322 
##                   hip pop           indie poptimism             latin hip hop 
##                      1256                      1672                      1656 
##                 latin pop                  neo soul            new jack swing 
##                      1262                      1637                      1133 
##            permanent wave                   pop edm             post-teen pop 
##                      1105                      1517                      1129 
## progressive electro house                 reggaeton          southern hip hop 
##                      1809                       949                      1675 
##                      trap                  tropical        urban contemporary 
##                      1291                      1288                      1405
table(spotify$key)
## 
##    0    1    2    3    4    5    6    7    8    9   10   11 
## 3454 4010 2827  913 2201 2680 2670 3352 2430 3027 2273 2996
table(spotify$mode)
## 
##     0     1 
## 14259 18574

We can see that mode is a binary variable. Playlist_genre and playlist_subgenre are categorical variables with the genre of music.

# lets look at specific data types and class
str(spotify$track_popularity)
##  num [1:32833] 66 67 70 60 69 67 62 69 68 67 ...
class(spotify$track_popularity)
## [1] "numeric"
#Looking for duplicates
dups_id <- sum(duplicated(spotify$track_id))
print(dups_id)
## [1] 4477

We can see that a lot of songs have been duplicated in this dataset. They have the same track_id. Therefore we will remove them, for further analysis.

spotify_dups = spotify[duplicated(spotify$track_id),]
spotify = spotify[!duplicated(spotify$track_id),]

We removed the duplicate songs to another dataset called spotify_dups and removed duplicates from current dataset.

#looking for missing values
sum(is.na(spotify))
## [1] 1693

There are 12 missing values in this dataset. However, we will not remove them, since they might still be important for the analysis.

Data Visualization

Visual histogram exploration to understand the data set.
ggplot(spotify, aes(x= track_popularity)) +
  geom_histogram(binwidth=5, color="darkblue", fill="lightblue") +
  ggtitle("Popularity Distribution") +
  xlab("Popularity") +
  ylab("Frequency")

hist(spotify$danceability, col = 'blue', border = "black", xlab = 'danceability', ylab = 'Frequency', main = 'danceability Distribution')

hist(spotify$energy, col = 'blue', border = "black", xlab = 'energy', ylab = 'Frequency', main = 'energy Distribution')

hist(spotify$key, col = 'blue', border = "black", xlab = 'key', ylab = 'Frequency', main = 'key Distribution')

hist(spotify$loudness, col = 'blue', border = "black", xlab = 'loudness', ylab = 'Frequency', main = 'loudness distribution')

hist(spotify$mode, col = 'blue', border = "black", xlab = 'mode', ylab = 'Frequency', main = 'mode distribution')

hist(spotify$valence, col = 'blue', border = "black", xlab = 'valence', ylab = 'Frequency', main = 'valence Distribution')

hist(spotify$speechiness, col = 'blue', border = "black", xlab = 'speechiness', ylab = 'Frequency', main = 'speechiness Distribution')

hist(spotify$acousticness, col = 'blue', border = "black", xlab = 'acousticness', ylab = 'Frequency', main = 'acousticness Distribution')

hist(spotify$liveness, col = 'blue', border = "black", xlab = 'liveness', ylab = 'Frequency', main = 'liveness Distribution')

hist(spotify$instrumentalness, col = 'blue', border = "black", xlab = 'instrumentalness', ylab = 'Frequency', main = 'instrumentalness Distribution')

hist(spotify$tempo, col = 'blue', border = "black", xlab = 'tempo', ylab = 'Frequency', main = 'tempo Distribution')

hist(spotify$duration_ms, col = 'blue', border = "black", xlab = 'duration_ms', ylab = 'Frequency', main = 'duration Distribution')

plot(spotify$playlist_genre, col = 'blue', border = "black", xlab = 'Genre' , ylab = "Frequencies")

After plotting the histograms we can observe the following distribution:

Duration, tempo and Valence are normally distributed Danceability, Enery and Loudness is left-skewed Acousticness, Speechiness and Liveness is right-skewed By genre, most of the songs are edm.

Visualitazion scatterplots based on genre
ggplot(spotify, aes(x=tempo, y=track_popularity)) + geom_jitter(aes(color = playlist_genre)) + ggtitle("tempo and popularity") 

ggplot(spotify, aes(x=danceability, y=track_popularity)) + geom_jitter(aes(color = playlist_genre)) +
  ggtitle("danceability and popularity") 

ggplot(spotify, aes(x=energy, y=track_popularity)) + geom_jitter(aes(color = playlist_genre)) +
  ggtitle("energy and popularity") 

ggplot(spotify, aes(x=loudness, y=track_popularity)) + geom_jitter(aes(color = playlist_genre)) +
  ggtitle("loudness and popularity")  

ggplot(spotify, aes(x=speechiness, y=track_popularity)) + geom_jitter(aes(color = playlist_genre)) +
  ggtitle("speechiness and popularity") 

ggplot(spotify, aes(x=acousticness, y=track_popularity)) + geom_jitter(aes(color = playlist_genre)) +
  ggtitle("acousticness and popularity") 

ggplot(spotify, aes(x=instrumentalness, y=track_popularity)) + geom_jitter(aes(color = playlist_genre)) +
  ggtitle("instrumentalness and popularity") 

ggplot(spotify, aes(x=liveness, y=track_popularity)) + geom_jitter(aes(color = playlist_genre)) +
  ggtitle("liveness and popularity") 

ggplot(spotify, aes(x=valence, y=track_popularity)) + geom_jitter(aes(color = playlist_genre)) +
  ggtitle("valence and popularity") 

ggplot(spotify, aes(x=tempo, y=track_popularity)) + geom_jitter(aes(color = playlist_genre)) +
  ggtitle("tempo and popularity") 

ggplot(spotify, aes(x=duration_ms, y=track_popularity)) + geom_jitter(aes(color = playlist_genre)) +
  ggtitle("duration and popularity") 

From these scatterplots, we can observe how every numerical variable plots with track_popularity. We have plotted it by genre, with edm, latin, pop, r&b, rap, rock. It is divided into colors in every graph. From this visualizations we can observe that cluster analysis and knn model will be probably the best one to analyze this data. Also, as we look through the graphs we can see that most of the plots that have a higher popularity (closer to 100) are the green dots, which are pop genre.Edm is usually plotted below 75, those songs are less popular.

Visualization Boxplots
boxplot(track_popularity ~ playlist_genre , data = spotify,
  main = "Popular genre",
  ylab = "Popularity",
  xlab = "Genre",
  col = "yellow")

boxplot( track_popularity ~ mode, data = spotify,
  main = "Popular Mode",
  ylab = "Popularity",
  xlab = "mode",
  col = "yellow")

boxplot(spotify$danceability,
  main = "Boxplot distribution of Danceability",
  col = "yellow")

boxplot(spotify$energy,
  main = "Boxplot distribution of energy",
  col = "yellow")

boxplot(spotify$key,
  main = "Boxplot distribution of key",
  col = "yellow")

boxplot(spotify$loudness,
  main = "Boxplot distribution of loudness",
  col = "yellow")

boxplot(spotify$mode,
  main = "Boxplot distribution of mode",
  col = "yellow")

boxplot(spotify$speechiness,
  main = "Boxplot distribution of speechiness",
  col = "yellow")

boxplot(spotify$acousticness,
  main = "Boxplot distribution of acousticness",
  col = "yellow")

boxplot(spotify$instrumentalness,
  main = "Boxplot distribution of instrumentalness",
  col = "yellow")

boxplot(spotify$liveness,
  main = "Boxplot distribution of liveness",
  col = "yellow")

boxplot(spotify$valence,
  main = "Boxplot distribution of valence",
  col = "yellow")

boxplot(spotify$tempo,
  main = "Boxplot distribution of tempo",
  col = "yellow")

boxplot(spotify$duration_ms,
  main = "Boxplot distribution of duration",
  col = "yellow")

From the boxplots we can observe that a lot of variables (danceability, energy, loudness, speechiness, acousticness, instrumentalness, liveness, duration) have outliers. Removing them would influence the analysis a lot.

Lets create a new dataset with all winsorized and truncated variables to reduce outliers.

spotify_copy <- spotify
# truncation danceability, energy, speechiness, acousticness, instrumentalness and liveness
spotify_copy$danceability[spotify_copy$danceability <= 0.28] <- 0.28

spotify_copy$energy[spotify_copy$energy <= 0.2] <- 0.2

spotify_copy$speechiness[spotify_copy$speechiness >= 0.27] <- 0.27

spotify_copy$acousticness[spotify_copy$acousticness >= 0.6] <- 0.6

spotify_copy$instrumentalness[spotify_copy$instrumentalness >= 0.015] <- 0.015

spotify_copy$liveness[spotify_copy$liveness >= 0.4] <- 0.4
# winsorization loudness
# Calculate the 5th and 95th percentiles for 'loudness'
lower_bound_loudness <- quantile(spotify_copy$loudness, 0.05, na.rm = TRUE)
upper_bound_loudness <- quantile(spotify_copy$loudness, 0.95, na.rm = TRUE)

# Winsorize the data
spotify_copy$loudness[spotify_copy$loudness < lower_bound_loudness] <- lower_bound_loudness
spotify_copy$loudness[spotify_copy$loudness > upper_bound_loudness] <- upper_bound_loudness
# winsorization tempo
# Calculate the 5th and 95th percentiles for 'tempo'
lower_bound_tempo <- quantile(spotify_copy$tempo, 0.05, na.rm = TRUE)
upper_bound_tempo <- quantile(spotify_copy$tempo, 0.95, na.rm = TRUE)

# Winsorize the data
spotify_copy$tempo[spotify_copy$tempo < lower_bound_tempo] <- lower_bound_tempo
spotify_copy$tempo[spotify_copy$tempo > upper_bound_tempo] <- upper_bound_tempo
#winsorize duration
# Calculate the 5th and 95th percentiles for 'duration'
lower_bound_duration_ms <- quantile(spotify_copy$duration_ms, 0.05, na.rm = TRUE)
upper_bound_duration_ms <- quantile(spotify_copy$duration_ms, 0.95, na.rm = TRUE)

# Winsorize the data
spotify_copy$duration_ms[spotify_copy$duration_ms < lower_bound_duration_ms] <- lower_bound_duration_ms
spotify_copy$duration_ms[spotify_copy$duration_ms > upper_bound_duration_ms] <- upper_bound_duration_ms

Now we have remove all the outliers from those variables. The data is cleaned. We also have dealt with missing values, duplicates, and data types.

knitr::kable(head(spotify[, 1:23]), "simple")
track_id track_name track_artist track_popularity track_album_id track_album_name track_album_release_date playlist_name playlist_id playlist_genre playlist_subgenre danceability energy key loudness mode speechiness acousticness instrumentalness liveness valence tempo duration_ms
6f807x0ima9a1j3VPbc7VN I Don’t Care (with Justin Bieber) - Loud Luxury Remix Ed Sheeran 66 2oCs0DGTsRO98Gh5ZSl2Cx I Don’t Care (with Justin Bieber) [Loud Luxury Remix] 2019-06-14 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop dance pop 0.748 0.916 6 -2.634 1 0.0583 0.1020 0.00e+00 0.0653 0.518 122.036 194754
0r7CVbZTWZgbTCYdfa2P31 Memories - Dillon Francis Remix Maroon 5 67 63rPSO264uRjW1X5E6cWv6 Memories (Dillon Francis Remix) 2019-12-13 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop dance pop 0.726 0.815 11 -4.969 1 0.0373 0.0724 4.21e-03 0.3570 0.693 99.972 162600
1z1Hg7Vb0AhHDiEmnDE79l All the Time - Don Diablo Remix Zara Larsson 70 1HoSmj2eLcsrR0vE9gThr4 All the Time (Don Diablo Remix) 2019-07-05 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop dance pop 0.675 0.931 1 -3.432 0 0.0742 0.0794 2.33e-05 0.1100 0.613 124.008 176616
75FpbthrwQmzHlBJLuGdC7 Call You Mine - Keanu Silva Remix The Chainsmokers 60 1nqYsOef1yKKuGOVchbsk6 Call You Mine - The Remixes 2019-07-19 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop dance pop 0.718 0.930 7 -3.778 1 0.1020 0.0287 9.40e-06 0.2040 0.277 121.956 169093
1e8PAfcKUYoKkxPhrHqw4x Someone You Loved - Future Humans Remix Lewis Capaldi 69 7m7vv9wlQ4i0LFuJiE2zsQ Someone You Loved (Future Humans Remix) 2019-03-05 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop dance pop 0.650 0.833 1 -4.672 1 0.0359 0.0803 0.00e+00 0.0833 0.725 123.976 189052
7fvUMiyapMsRRxr07cU8Ef Beautiful People (feat. Khalid) - Jack Wins Remix Ed Sheeran 67 2yiy9cd2QktrNvWC2EUi0k Beautiful People (feat. Khalid) [Jack Wins Remix] 2019-07-11 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop dance pop 0.675 0.919 8 -5.385 1 0.1270 0.0799 0.00e+00 0.1430 0.585 124.982 163049

Exploratory Data Anaylsis

Modelling

Linear Regression model with training and test data

  • We want to predict track popularity and we want to see which variables best predicts the dependent variable. Therefore we will first build a linear regression model to see how it predicts track_populrity. We will do so by splitting the data into training and test sets.

Data splitting

# Set the seed for reproducibility
set.seed(2023)

# Randomly sample row indices for the training set split in 70% and 30%
train_indices <- sample(1:NROW(spotify),NROW(spotify)*0.70)

# Create the training set
train_data <- spotify[train_indices, ] #everything before comma is row selector and after comma is column selesctor

# Create the testing set
test_data <- spotify[-train_indices, ]
# Train the linear regression model, comparing popularity to rest of numeric parameters
lm_model <- lm(track_popularity ~ danceability + energy + key + loudness + mode + speechiness + acousticness + instrumentalness + liveness + valence + tempo + duration_ms, data = train_data)
summary(lm_model)
## 
## Call:
## lm(formula = track_popularity ~ danceability + energy + key + 
##     loudness + mode + speechiness + acousticness + instrumentalness + 
##     liveness + valence + tempo + duration_ms, data = train_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -54.757 -17.360   2.935  18.119  60.604 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       6.684e+01  2.045e+00  32.682  < 2e-16 ***
## danceability      4.170e+00  1.288e+00   3.237  0.00121 ** 
## energy           -2.318e+01  1.461e+00 -15.864  < 2e-16 ***
## key               1.398e-02  4.595e-02   0.304  0.76088    
## loudness          1.132e+00  7.802e-02  14.512  < 2e-16 ***
## mode              7.590e-01  3.357e-01   2.261  0.02378 *  
## speechiness      -7.397e+00  1.657e+00  -4.464 8.09e-06 ***
## acousticness      4.934e+00  8.895e-01   5.547 2.95e-08 ***
## instrumentalness -9.426e+00  7.437e-01 -12.676  < 2e-16 ***
## liveness         -4.420e+00  1.077e+00  -4.104 4.08e-05 ***
## valence           1.997e+00  7.861e-01   2.540  0.01110 *  
## tempo             2.808e-02  6.261e-03   4.485 7.33e-06 ***
## duration_ms      -4.277e-05  2.726e-06 -15.689  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.03 on 19836 degrees of freedom
## Multiple R-squared:  0.06003,    Adjusted R-squared:  0.05946 
## F-statistic: 105.6 on 12 and 19836 DF,  p-value: < 2.2e-16
  • The model suggests that all the variables are significant except key, since it is the only variable with a p-value higher than 0.5. This model has an R^2 of 0.06003 which means that only 6% of the of the data’s variability can be explained by the regression model, which indicated that this is not a good model.
Manually check the results

we can compile the actual and predicted values and view the first 5 records

# Create a data frame to compare actual and predicted values
comparison_df <- data.frame(Actual =  train_data$track_popularity, lm_predicted =lm_model$fitted.values)
head(comparison_df)
##   Actual lm_predicted
## 1      0     33.94611
## 2     24     40.51391
## 3     39     41.34777
## 4     63     38.75230
## 5     40     35.35537
## 6     30     28.37513

In-sample (training) MSE

lm_mse_train <- mean((lm_model$fitted.values - train_data$track_popularity)^2)
print(paste("Training MSE for Linear Model:", round(lm_mse_train, 2)))
## [1] "Training MSE for Linear Model: 529.94"

Out-of-sample (testing) MSE

# Predict on testing data
lm_test_pred <- predict(lm_model, newdata = test_data)
# Cal
lm_mse_test <- mean((lm_test_pred - test_data$track_popularity)^2)
print(paste("Testing MSE for Linear Model:", round(lm_mse_test, 2)))
## [1] "Testing MSE for Linear Model: 527.51"

KNN model

spotify_knn_model <- kknn(track_popularity ~ danceability + energy + key + loudness + mode + speechiness + acousticness + instrumentalness + liveness + valence + tempo + duration_ms, train = train_data, test = train_data, k = 5)
# Predict on training data
knn_train_pred <- fitted.values(spotify_knn_model)

# Calculate in-sample MSE manually
knn_train_mse <- mean((train_data$track_popularity - knn_train_pred)^2)
print(paste("In-Sample MSE for KNN: ", knn_train_mse))
## [1] "In-Sample MSE for KNN:  194.903869196156"
# Predict on testing data
knn_model_test <- kknn(track_popularity ~ danceability + energy + key + loudness + mode + speechiness + acousticness + instrumentalness + liveness + valence + tempo + duration_ms, train = train_data, test = test_data, k = 5)

knn_test_pred <- fitted.values(knn_model_test)

# Calculate out-of-sample MSE manually
knn_test_mse <- mean((test_data$track_popularity - knn_test_pred)^2)
print(paste("Out-of-Sample MSE for KNN: ", knn_test_mse))
## [1] "Out-of-Sample MSE for KNN:  687.024025645934"
Interpreting the models

We tried the linear regression and the KNN model to see which model would be best for predicting track popularity using the mean square error to see which one performs better.We found that the MSE for the linear regression model was 578.12 for In-Sample testing and the MSE was 581.72 for Out-of-sample testing. We also found that the KNN model has a MSE 194.903 for In-sample test, and a MSE of 687.024 for the Out-of Sample test. When comparing these two models, the KNN model perfomed better than the regression model.

We did not use all of the variables in the data set, but we used all of our continuous variables.We decided to go this route because we mostly wanted to explore track popularity and the relationship it has with the continuous variables in our data set.

Theoretically, the model that would fit the data the best is the linear regression model because it is easier to interpret. You also have to meet the four assumptions (linearity, independence,normality, and homoscedasticity). We can run a diagnostics plot to see if our model meets these assumptions.

# diagnostic plot
par(mfrow = c(2, 2))
plot(lm_model)

From the diagnostic plot, we see that the model does not meet the four assumptions, which leads us to the conclusion that the linear regression model is not the best fit for this data.

The model that fits the best in practice is the KNN model. The KNN gave us the best in sample performance when tested and the Linear Regression Model gave us the best In-Sample testing.The evaluation metrics we have been using are the mean square error to see which model has the lowest to reveal which model will be the best to test our predictions .The training data in the KNN model gave us the lowest MSE (194.903) which is what lead us to the conclusion that the KNN model would be better for prediction.