Spotify Analysis

Introduction

The Spotify Dataset comes from Spotify via the spotifyr package. Charlie Thompson, Josiah Parry, Donal Phipps, and Tom Wolff authored this package to make it easier to get either your own data or general metadata arounds songs from Spotify’s API. Kaylin Pavlik had a recent blogpost using the audio features to explore and classify songs. She used the spotifyr package to collect about 5000 songs from 6 main categories.
The data shows general metadata around songs from Spotify’s API. It shows the song’s popularity and other parameters such as acousticness, danceability, energy, speechiness, valence, key…
With this analysis we are interested in how track popularity is getting influenced by other attributes like danceability, loudness, speechiness, valence etc.
The plan is to analyze relationship between popularity and different features of the song to predict future popularity of a song. We plan on performing Data Preparation, EDA and Modelilling using models such as linear regression, knn or logistic regression.
This is mainly beneficial to market spotify customers and improve their experience while using spotify. Also for Spotify, they will be able to provide more accurate predictions of a new song’s potential popularity even before its release.

Packages Required

library(tidyverse) #It assists with data import, tidying, manipulation, and data visualization.
library(ggplot2) # package for producing statistical, or data, graphics
library(kknn) # to perform k-nearest neighbor classification
library(corrplot) # graphical display of a correlation matrix, confidence interval
library(readr) #o provide a fast way to read rectangular data

Data Preparation

Data Loading

spotify <- read_csv("/Users/evabeyebach/Desktop/Projects/spotify.csv")

## Rows: 32833 Columns: 23
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): track_id, track_name, track_artist, track_album_id, track_album_na...
## dbl (13): track_popularity, danceability, energy, key, loudness, mode, speec...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Data Examination

### Checking dimension of Data
dim(spotify)

## [1] 32833    23

The original dataset contains 32833 rows and 23 columns

# show first 5 rows
head(spotify)

## # A tibble: 6 × 23
##   track_id               track_name track_artist track_popularity track_album_id
##   <chr>                  <chr>      <chr>                   <dbl> <chr>         
## 1 6f807x0ima9a1j3VPbc7VN I Don't C… Ed Sheeran                 66 2oCs0DGTsRO98…
## 2 0r7CVbZTWZgbTCYdfa2P31 Memories … Maroon 5                   67 63rPSO264uRjW…
## 3 1z1Hg7Vb0AhHDiEmnDE79l All the T… Zara Larsson               70 1HoSmj2eLcsrR…
## 4 75FpbthrwQmzHlBJLuGdC7 Call You … The Chainsm…               60 1nqYsOef1yKKu…
## 5 1e8PAfcKUYoKkxPhrHqw4x Someone Y… Lewis Capal…               69 7m7vv9wlQ4i0L…
## 6 7fvUMiyapMsRRxr07cU8Ef Beautiful… Ed Sheeran                 67 2yiy9cd2QktrN…
## # ℹ 18 more variables: track_album_name <chr>, track_album_release_date <chr>,
## #   playlist_name <chr>, playlist_id <chr>, playlist_genre <chr>,
## #   playlist_subgenre <chr>, danceability <dbl>, energy <dbl>, key <dbl>,
## #   loudness <dbl>, mode <dbl>, speechiness <dbl>, acousticness <dbl>,
## #   instrumentalness <dbl>, liveness <dbl>, valence <dbl>, tempo <dbl>,
## #   duration_ms <dbl>

#### Checking column name
names(spotify)

##  [1] "track_id"                 "track_name"              
##  [3] "track_artist"             "track_popularity"        
##  [5] "track_album_id"           "track_album_name"        
##  [7] "track_album_release_date" "playlist_name"           
##  [9] "playlist_id"              "playlist_genre"          
## [11] "playlist_subgenre"        "danceability"            
## [13] "energy"                   "key"                     
## [15] "loudness"                 "mode"                    
## [17] "speechiness"              "acousticness"            
## [19] "instrumentalness"         "liveness"                
## [21] "valence"                  "tempo"                   
## [23] "duration_ms"

Data Cleaning

### Checking structure of Data
str(spotify)

## spc_tbl_ [32,833 × 23] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ track_id                : chr [1:32833] "6f807x0ima9a1j3VPbc7VN" "0r7CVbZTWZgbTCYdfa2P31" "1z1Hg7Vb0AhHDiEmnDE79l" "75FpbthrwQmzHlBJLuGdC7" ...
##  $ track_name              : chr [1:32833] "I Don't Care (with Justin Bieber) - Loud Luxury Remix" "Memories - Dillon Francis Remix" "All the Time - Don Diablo Remix" "Call You Mine - Keanu Silva Remix" ...
##  $ track_artist            : chr [1:32833] "Ed Sheeran" "Maroon 5" "Zara Larsson" "The Chainsmokers" ...
##  $ track_popularity        : num [1:32833] 66 67 70 60 69 67 62 69 68 67 ...
##  $ track_album_id          : chr [1:32833] "2oCs0DGTsRO98Gh5ZSl2Cx" "63rPSO264uRjW1X5E6cWv6" "1HoSmj2eLcsrR0vE9gThr4" "1nqYsOef1yKKuGOVchbsk6" ...
##  $ track_album_name        : chr [1:32833] "I Don't Care (with Justin Bieber) [Loud Luxury Remix]" "Memories (Dillon Francis Remix)" "All the Time (Don Diablo Remix)" "Call You Mine - The Remixes" ...
##  $ track_album_release_date: chr [1:32833] "2019-06-14" "2019-12-13" "2019-07-05" "2019-07-19" ...
##  $ playlist_name           : chr [1:32833] "Pop Remix" "Pop Remix" "Pop Remix" "Pop Remix" ...
##  $ playlist_id             : chr [1:32833] "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" ...
##  $ playlist_genre          : chr [1:32833] "pop" "pop" "pop" "pop" ...
##  $ playlist_subgenre       : chr [1:32833] "dance pop" "dance pop" "dance pop" "dance pop" ...
##  $ danceability            : num [1:32833] 0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
##  $ energy                  : num [1:32833] 0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
##  $ key                     : num [1:32833] 6 11 1 7 1 8 5 4 8 2 ...
##  $ loudness                : num [1:32833] -2.63 -4.97 -3.43 -3.78 -4.67 ...
##  $ mode                    : num [1:32833] 1 1 0 1 1 1 0 0 1 1 ...
##  $ speechiness             : num [1:32833] 0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
##  $ acousticness            : num [1:32833] 0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
##  $ instrumentalness        : num [1:32833] 0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
##  $ liveness                : num [1:32833] 0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
##  $ valence                 : num [1:32833] 0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
##  $ tempo                   : num [1:32833] 122 100 124 122 124 ...
##  $ duration_ms             : num [1:32833] 194754 162600 176616 169093 189052 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   track_id = col_character(),
##   ..   track_name = col_character(),
##   ..   track_artist = col_character(),
##   ..   track_popularity = col_double(),
##   ..   track_album_id = col_character(),
##   ..   track_album_name = col_character(),
##   ..   track_album_release_date = col_character(),
##   ..   playlist_name = col_character(),
##   ..   playlist_id = col_character(),
##   ..   playlist_genre = col_character(),
##   ..   playlist_subgenre = col_character(),
##   ..   danceability = col_double(),
##   ..   energy = col_double(),
##   ..   key = col_double(),
##   ..   loudness = col_double(),
##   ..   mode = col_double(),
##   ..   speechiness = col_double(),
##   ..   acousticness = col_double(),
##   ..   instrumentalness = col_double(),
##   ..   liveness = col_double(),
##   ..   valence = col_double(),
##   ..   tempo = col_double(),
##   ..   duration_ms = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

Shows structure of data. We can see that track_id , track_name, track_artist, track_album_id, track_album_name, track_album_release_date, playlist_name, playlist_id, playlist_genre, playlist_subgenre are character variables. On the other side, track_popularity, energy, key, loudness, mode, speechiness, acousticness, instrumentalness, liveness, valence, tempo and duration_ms are numeric. We need to change track_album_release_date to date variable. We will also change playlist_genre to factor, for future plotting.

# Modifying Data Types
spotify$track_album_release_date<- as.Date(spotify$track_album_release_date)
spotify$playlist_genre<-as.factor(spotify$playlist_genre)

#summary statistics
summary(spotify)

##    track_id          track_name        track_artist       track_popularity
##  Length:32833       Length:32833       Length:32833       Min.   :  0.00  
##  Class :character   Class :character   Class :character   1st Qu.: 24.00  
##  Mode  :character   Mode  :character   Mode  :character   Median : 45.00  
##                                                           Mean   : 42.48  
##                                                           3rd Qu.: 62.00  
##                                                           Max.   :100.00  
##                                                                           
##  track_album_id     track_album_name   track_album_release_date
##  Length:32833       Length:32833       Min.   :1957-01-01      
##  Class :character   Class :character   1st Qu.:2010-12-04      
##  Mode  :character   Mode  :character   Median :2017-01-27      
##                                        Mean   :2012-09-09      
##                                        3rd Qu.:2019-05-16      
##                                        Max.   :2020-01-29      
##                                        NA's   :1886            
##  playlist_name      playlist_id        playlist_genre playlist_subgenre 
##  Length:32833       Length:32833       edm  :6043     Length:32833      
##  Class :character   Class :character   latin:5155     Class :character  
##  Mode  :character   Mode  :character   pop  :5507     Mode  :character  
##                                        r&b  :5431                       
##                                        rap  :5746                       
##                                        rock :4951                       
##                                                                         
##   danceability        energy              key            loudness      
##  Min.   :0.0000   Min.   :0.000175   Min.   : 0.000   Min.   :-46.448  
##  1st Qu.:0.5630   1st Qu.:0.581000   1st Qu.: 2.000   1st Qu.: -8.171  
##  Median :0.6720   Median :0.721000   Median : 6.000   Median : -6.166  
##  Mean   :0.6548   Mean   :0.698619   Mean   : 5.374   Mean   : -6.720  
##  3rd Qu.:0.7610   3rd Qu.:0.840000   3rd Qu.: 9.000   3rd Qu.: -4.645  
##  Max.   :0.9830   Max.   :1.000000   Max.   :11.000   Max.   :  1.275  
##                                                                        
##       mode         speechiness      acousticness    instrumentalness   
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000000  
##  1st Qu.:0.0000   1st Qu.:0.0410   1st Qu.:0.0151   1st Qu.:0.0000000  
##  Median :1.0000   Median :0.0625   Median :0.0804   Median :0.0000161  
##  Mean   :0.5657   Mean   :0.1071   Mean   :0.1753   Mean   :0.0847472  
##  3rd Qu.:1.0000   3rd Qu.:0.1320   3rd Qu.:0.2550   3rd Qu.:0.0048300  
##  Max.   :1.0000   Max.   :0.9180   Max.   :0.9940   Max.   :0.9940000  
##                                                                        
##     liveness         valence           tempo         duration_ms    
##  Min.   :0.0000   Min.   :0.0000   Min.   :  0.00   Min.   :  4000  
##  1st Qu.:0.0927   1st Qu.:0.3310   1st Qu.: 99.96   1st Qu.:187819  
##  Median :0.1270   Median :0.5120   Median :121.98   Median :216000  
##  Mean   :0.1902   Mean   :0.5106   Mean   :120.88   Mean   :225800  
##  3rd Qu.:0.2480   3rd Qu.:0.6930   3rd Qu.:133.92   3rd Qu.:253585  
##  Max.   :0.9960   Max.   :0.9910   Max.   :239.44   Max.   :517810  
##

Displays Min, Q1, Median, Mean, Q3 and Max of each varibale. We can already see that there are probably some outliers and that some variables have too big Max (duration_ms has a Max od 517810; tempo has a Max of 239.44). We will do some truncation, winorization or standardization, to see how it affects the model.

#lets look at some tables for categorical variables
table(spotify$track_popularity)

## 
##    0    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
## 2703  575  387  321  240  240  192  189  201  195  174  172  161  207  201  190 
##   16   17   18   19   20   21   22   23   24   25   26   27   28   29   30   31 
##  219  206  242  205  209  228  207  228  243  242  272  271  266  277  345  323 
##   32   33   34   35   36   37   38   39   40   41   42   43   44   45   46   47 
##  351  377  388  433  431  435  483  459  486  442  428  464  472  505  430  496 
##   48   49   50   51   52   53   54   55   56   57   58   59   60   61   62   63 
##  465  497  498  514  506  472  514  492  497  541  503  467  514  492  470  483 
##   64   65   66   67   68   69   70   71   72   73   74   75   76   77   78   79 
##  424  462  441  468  425  443  410  408  339  357  353  306  334  326  224  265 
##   80   81   82   83   84   85   86   87   88   89   90   91   92   93   94   95 
##  172  167  126  183  122  120   91   89  104   27   59   58   27   44   37   15 
##   96   97   98   99  100 
##    7   22   36    4    2

table(spotify$playlist_genre)

## 
##   edm latin   pop   r&b   rap  rock 
##  6043  5155  5507  5431  5746  4951

table(spotify$playlist_subgenre)

## 
##                album rock                  big room              classic rock 
##                      1065                      1206                      1296 
##                 dance pop             electro house                electropop 
##                      1298                      1511                      1408 
##              gangster rap                 hard rock                   hip hop 
##                      1458                      1485                      1322 
##                   hip pop           indie poptimism             latin hip hop 
##                      1256                      1672                      1656 
##                 latin pop                  neo soul            new jack swing 
##                      1262                      1637                      1133 
##            permanent wave                   pop edm             post-teen pop 
##                      1105                      1517                      1129 
## progressive electro house                 reggaeton          southern hip hop 
##                      1809                       949                      1675 
##                      trap                  tropical        urban contemporary 
##                      1291                      1288                      1405

table(spotify$key)

## 
##    0    1    2    3    4    5    6    7    8    9   10   11 
## 3454 4010 2827  913 2201 2680 2670 3352 2430 3027 2273 2996

table(spotify$mode)

## 
##     0     1 
## 14259 18574

We can see that mode is a binary variable. Playlist_genre and playlist_subgenre are categorical variables with the genre of music.

# lets look at specific data types and class
str(spotify$track_popularity)

##  num [1:32833] 66 67 70 60 69 67 62 69 68 67 ...

class(spotify$track_popularity)

## [1] "numeric"

#Looking for duplicates
dups_id <- sum(duplicated(spotify$track_id))
print(dups_id)

## [1] 4477

We can see that a lot of songs have been duplicated in this dataset. They have the same track_id. Therefore we will remove them, for further analysis.

spotify_dups = spotify[duplicated(spotify$track_id),]
spotify = spotify[!duplicated(spotify$track_id),]

We removed the duplicate songs to another dataset called spotify_dups and removed duplicates from current dataset.

#looking for missing values
sum(is.na(spotify))

## [1] 1693

There are 12 missing values in this dataset. However, we will not remove them, since they might still be important for the analysis.

Data Visualization

Visual histogram exploration to understand the data set.

ggplot(spotify, aes(x= track_popularity)) +
  geom_histogram(binwidth=5, color="darkblue", fill="lightblue") +
  ggtitle("Popularity Distribution") +
  xlab("Popularity") +
  ylab("Frequency")

hist(spotify$danceability, col = 'blue', border = "black", xlab = 'danceability', ylab = 'Frequency', main = 'danceability Distribution')

hist(spotify$energy, col = 'blue', border = "black", xlab = 'energy', ylab = 'Frequency', main = 'energy Distribution')

hist(spotify$key, col = 'blue', border = "black", xlab = 'key', ylab = 'Frequency', main = 'key Distribution')

hist(spotify$loudness, col = 'blue', border = "black", xlab = 'loudness', ylab = 'Frequency', main = 'loudness distribution')

hist(spotify$mode, col = 'blue', border = "black", xlab = 'mode', ylab = 'Frequency', main = 'mode distribution')

hist(spotify$valence, col = 'blue', border = "black", xlab = 'valence', ylab = 'Frequency', main = 'valence Distribution')

hist(spotify$speechiness, col = 'blue', border = "black", xlab = 'speechiness', ylab = 'Frequency', main = 'speechiness Distribution')

hist(spotify$acousticness, col = 'blue', border = "black", xlab = 'acousticness', ylab = 'Frequency', main = 'acousticness Distribution')

hist(spotify$liveness, col = 'blue', border = "black", xlab = 'liveness', ylab = 'Frequency', main = 'liveness Distribution')

hist(spotify$instrumentalness, col = 'blue', border = "black", xlab = 'instrumentalness', ylab = 'Frequency', main = 'instrumentalness Distribution')

hist(spotify$tempo, col = 'blue', border = "black", xlab = 'tempo', ylab = 'Frequency', main = 'tempo Distribution')

hist(spotify$duration_ms, col = 'blue', border = "black", xlab = 'duration_ms', ylab = 'Frequency', main = 'duration Distribution')

plot(spotify$playlist_genre, col = 'blue', border = "black", xlab = 'Genre' , ylab = "Frequencies")

After plotting the histograms we can observe the following distribution:

Duration, tempo and Valence are normally distributed Danceability, Enery and Loudness is left-skewed Acousticness, Speechiness and Liveness is right-skewed By genre, most of the songs are edm.

Visualitazion scatterplots based on genre

ggplot(spotify, aes(x=tempo, y=track_popularity)) + geom_jitter(aes(color = playlist_genre)) + ggtitle("tempo and popularity")

ggplot(spotify, aes(x=danceability, y=track_popularity)) + geom_jitter(aes(color = playlist_genre)) +
  ggtitle("danceability and popularity")

ggplot(spotify, aes(x=energy, y=track_popularity)) + geom_jitter(aes(color = playlist_genre)) +
  ggtitle("energy and popularity")

ggplot(spotify, aes(x=loudness, y=track_popularity)) + geom_jitter(aes(color = playlist_genre)) +
  ggtitle("loudness and popularity")

ggplot(spotify, aes(x=speechiness, y=track_popularity)) + geom_jitter(aes(color = playlist_genre)) +
  ggtitle("speechiness and popularity")

ggplot(spotify, aes(x=acousticness, y=track_popularity)) + geom_jitter(aes(color = playlist_genre)) +
  ggtitle("acousticness and popularity")

ggplot(spotify, aes(x=instrumentalness, y=track_popularity)) + geom_jitter(aes(color = playlist_genre)) +
  ggtitle("instrumentalness and popularity")

ggplot(spotify, aes(x=liveness, y=track_popularity)) + geom_jitter(aes(color = playlist_genre)) +
  ggtitle("liveness and popularity")

ggplot(spotify, aes(x=valence, y=track_popularity)) + geom_jitter(aes(color = playlist_genre)) +
  ggtitle("valence and popularity")

ggplot(spotify, aes(x=tempo, y=track_popularity)) + geom_jitter(aes(color = playlist_genre)) +
  ggtitle("tempo and popularity")

ggplot(spotify, aes(x=duration_ms, y=track_popularity)) + geom_jitter(aes(color = playlist_genre)) +
  ggtitle("duration and popularity")

From these scatterplots, we can observe how every numerical variable plots with track_popularity. We have plotted it by genre, with edm, latin, pop, r&b, rap, rock. It is divided into colors in every graph. From this visualizations we can observe that cluster analysis and knn model will be probably the best one to analyze this data. Also, as we look through the graphs we can see that most of the plots that have a higher popularity (closer to 100) are the green dots, which are pop genre.Edm is usually plotted below 75, those songs are less popular.

Visualization Boxplots

boxplot(track_popularity ~ playlist_genre , data = spotify,
  main = "Popular genre",
  ylab = "Popularity",
  xlab = "Genre",
  col = "yellow")

boxplot( track_popularity ~ mode, data = spotify,
  main = "Popular Mode",
  ylab = "Popularity",
  xlab = "mode",
  col = "yellow")

boxplot(spotify$danceability,
  main = "Boxplot distribution of Danceability",
  col = "yellow")

boxplot(spotify$energy,
  main = "Boxplot distribution of energy",
  col = "yellow")

boxplot(spotify$key,
  main = "Boxplot distribution of key",
  col = "yellow")

boxplot(spotify$loudness,
  main = "Boxplot distribution of loudness",
  col = "yellow")

boxplot(spotify$mode,
  main = "Boxplot distribution of mode",
  col = "yellow")

boxplot(spotify$speechiness,
  main = "Boxplot distribution of speechiness",
  col = "yellow")

boxplot(spotify$acousticness,
  main = "Boxplot distribution of acousticness",
  col = "yellow")

boxplot(spotify$instrumentalness,
  main = "Boxplot distribution of instrumentalness",
  col = "yellow")

boxplot(spotify$liveness,
  main = "Boxplot distribution of liveness",
  col = "yellow")

boxplot(spotify$valence,
  main = "Boxplot distribution of valence",
  col = "yellow")

boxplot(spotify$tempo,
  main = "Boxplot distribution of tempo",
  col = "yellow")

boxplot(spotify$duration_ms,
  main = "Boxplot distribution of duration",
  col = "yellow")

From the boxplots we can observe that a lot of variables (danceability, energy, loudness, speechiness, acousticness, instrumentalness, liveness, duration) have outliers. Removing them would influence the analysis a lot.

Lets create a new dataset with all winsorized and truncated variables to reduce outliers.

spotify_copy <- spotify

# truncation danceability, energy, speechiness, acousticness, instrumentalness and liveness
spotify_copy$danceability[spotify_copy$danceability <= 0.28] <- 0.28

spotify_copy$energy[spotify_copy$energy <= 0.2] <- 0.2

spotify_copy$speechiness[spotify_copy$speechiness >= 0.27] <- 0.27

spotify_copy$acousticness[spotify_copy$acousticness >= 0.6] <- 0.6

spotify_copy$instrumentalness[spotify_copy$instrumentalness >= 0.015] <- 0.015

spotify_copy$liveness[spotify_copy$liveness >= 0.4] <- 0.4

# winsorization loudness
# Calculate the 5th and 95th percentiles for 'loudness'
lower_bound_loudness <- quantile(spotify_copy$loudness, 0.05, na.rm = TRUE)
upper_bound_loudness <- quantile(spotify_copy$loudness, 0.95, na.rm = TRUE)

# Winsorize the data
spotify_copy$loudness[spotify_copy$loudness < lower_bound_loudness] <- lower_bound_loudness
spotify_copy$loudness[spotify_copy$loudness > upper_bound_loudness] <- upper_bound_loudness

# winsorization tempo
# Calculate the 5th and 95th percentiles for 'tempo'
lower_bound_tempo <- quantile(spotify_copy$tempo, 0.05, na.rm = TRUE)
upper_bound_tempo <- quantile(spotify_copy$tempo, 0.95, na.rm = TRUE)

# Winsorize the data
spotify_copy$tempo[spotify_copy$tempo < lower_bound_tempo] <- lower_bound_tempo
spotify_copy$tempo[spotify_copy$tempo > upper_bound_tempo] <- upper_bound_tempo

#winsorize duration
# Calculate the 5th and 95th percentiles for 'duration'
lower_bound_duration_ms <- quantile(spotify_copy$duration_ms, 0.05, na.rm = TRUE)
upper_bound_duration_ms <- quantile(spotify_copy$duration_ms, 0.95, na.rm = TRUE)

# Winsorize the data
spotify_copy$duration_ms[spotify_copy$duration_ms < lower_bound_duration_ms] <- lower_bound_duration_ms
spotify_copy$duration_ms[spotify_copy$duration_ms > upper_bound_duration_ms] <- upper_bound_duration_ms

Now we have remove all the outliers from those variables. The data is cleaned. We also have dealt with missing values, duplicates, and data types.

knitr::kable(head(spotify[, 1:23]), "simple")

track_id	track_name	track_artist	track_popularity	track_album_id	track_album_name	track_album_release_date	playlist_name	playlist_id	playlist_genre	playlist_subgenre	danceability	energy	key	loudness	mode	speechiness	acousticness	instrumentalness	liveness	valence	tempo	duration_ms
6f807x0ima9a1j3VPbc7VN	I Don’t Care (with Justin Bieber) - Loud Luxury Remix	Ed Sheeran	66	2oCs0DGTsRO98Gh5ZSl2Cx	I Don’t Care (with Justin Bieber) [Loud Luxury Remix]	2019-06-14	Pop Remix	37i9dQZF1DXcZDD7cfEKhW	pop	dance pop	0.748	0.916	6	-2.634	1	0.0583	0.1020	0.00e+00	0.0653	0.518	122.036	194754
0r7CVbZTWZgbTCYdfa2P31	Memories - Dillon Francis Remix	Maroon 5	67	63rPSO264uRjW1X5E6cWv6	Memories (Dillon Francis Remix)	2019-12-13	Pop Remix	37i9dQZF1DXcZDD7cfEKhW	pop	dance pop	0.726	0.815	11	-4.969	1	0.0373	0.0724	4.21e-03	0.3570	0.693	99.972	162600
1z1Hg7Vb0AhHDiEmnDE79l	All the Time - Don Diablo Remix	Zara Larsson	70	1HoSmj2eLcsrR0vE9gThr4	All the Time (Don Diablo Remix)	2019-07-05	Pop Remix	37i9dQZF1DXcZDD7cfEKhW	pop	dance pop	0.675	0.931	1	-3.432	0	0.0742	0.0794	2.33e-05	0.1100	0.613	124.008	176616
75FpbthrwQmzHlBJLuGdC7	Call You Mine - Keanu Silva Remix	The Chainsmokers	60	1nqYsOef1yKKuGOVchbsk6	Call You Mine - The Remixes	2019-07-19	Pop Remix	37i9dQZF1DXcZDD7cfEKhW	pop	dance pop	0.718	0.930	7	-3.778	1	0.1020	0.0287	9.40e-06	0.2040	0.277	121.956	169093
1e8PAfcKUYoKkxPhrHqw4x	Someone You Loved - Future Humans Remix	Lewis Capaldi	69	7m7vv9wlQ4i0LFuJiE2zsQ	Someone You Loved (Future Humans Remix)	2019-03-05	Pop Remix	37i9dQZF1DXcZDD7cfEKhW	pop	dance pop	0.650	0.833	1	-4.672	1	0.0359	0.0803	0.00e+00	0.0833	0.725	123.976	189052
7fvUMiyapMsRRxr07cU8Ef	Beautiful People (feat. Khalid) - Jack Wins Remix	Ed Sheeran	67	2yiy9cd2QktrNvWC2EUi0k	Beautiful People (feat. Khalid) [Jack Wins Remix]	2019-07-11	Pop Remix	37i9dQZF1DXcZDD7cfEKhW	pop	dance pop	0.675	0.919	8	-5.385	1	0.1270	0.0799	0.00e+00	0.1430	0.585	124.982	163049

Exploratory Data Anaylsis

To uncover new information in the data, we first took a look at the descriptive statistics of the continuous variables, this showed us the mean, median, minimum, maximum, and quartiles of each variable in our data. We then used a histogram to visualize the distribution of each variable. This allowed us to see the skewness of each variable which gave us an idea of the characteristics of the data. Our visualizations gave us the clearest insight. The scatter plots allowed us to see the relationships between the variables and the influence that a song’s genre has over that relationship. The box plots allowed us to get a different look at the distribution and see outliers which can have a major influence over our overall data set.The different ways we can look at this data is from the perspective of
The plots that we can use to illustrate our findings are scatter plots, line charts, and histograms. We already used scatter plots and histograms in the discovery process, but they will be useful to illustrate our findings because they can show evidence of relationships between variables and show the distribution of those variables.
Currently, we do not know how to conduct statistical tests like t-test, ANOVA, and chi-square to be able to test our hypothesis.
To create new summary information, we plan on narrowing our data down to the variables we really plan on exploring to help us gain insights to our predictions.

Modelling

Linear Regression model with training and test data

We want to predict track popularity and we want to see which variables best predicts the dependent variable. Therefore we will first build a linear regression model to see how it predicts track_populrity. We will do so by splitting the data into training and test sets.

Data splitting

# Set the seed for reproducibility
set.seed(2023)

# Randomly sample row indices for the training set split in 70% and 30%
train_indices <- sample(1:NROW(spotify),NROW(spotify)*0.70)

# Create the training set
train_data <- spotify[train_indices, ] #everything before comma is row selector and after comma is column selesctor

# Create the testing set
test_data <- spotify[-train_indices, ]

# Train the linear regression model, comparing popularity to rest of numeric parameters
lm_model <- lm(track_popularity ~ danceability + energy + key + loudness + mode + speechiness + acousticness + instrumentalness + liveness + valence + tempo + duration_ms, data = train_data)
summary(lm_model)

## 
## Call:
## lm(formula = track_popularity ~ danceability + energy + key + 
##     loudness + mode + speechiness + acousticness + instrumentalness + 
##     liveness + valence + tempo + duration_ms, data = train_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -54.757 -17.360   2.935  18.119  60.604 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       6.684e+01  2.045e+00  32.682  < 2e-16 ***
## danceability      4.170e+00  1.288e+00   3.237  0.00121 ** 
## energy           -2.318e+01  1.461e+00 -15.864  < 2e-16 ***
## key               1.398e-02  4.595e-02   0.304  0.76088    
## loudness          1.132e+00  7.802e-02  14.512  < 2e-16 ***
## mode              7.590e-01  3.357e-01   2.261  0.02378 *  
## speechiness      -7.397e+00  1.657e+00  -4.464 8.09e-06 ***
## acousticness      4.934e+00  8.895e-01   5.547 2.95e-08 ***
## instrumentalness -9.426e+00  7.437e-01 -12.676  < 2e-16 ***
## liveness         -4.420e+00  1.077e+00  -4.104 4.08e-05 ***
## valence           1.997e+00  7.861e-01   2.540  0.01110 *  
## tempo             2.808e-02  6.261e-03   4.485 7.33e-06 ***
## duration_ms      -4.277e-05  2.726e-06 -15.689  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.03 on 19836 degrees of freedom
## Multiple R-squared:  0.06003,    Adjusted R-squared:  0.05946 
## F-statistic: 105.6 on 12 and 19836 DF,  p-value: < 2.2e-16

The model suggests that all the variables are significant except key, since it is the only variable with a p-value higher than 0.5. This model has an R^2 of 0.06003 which means that only 6% of the of the data’s variability can be explained by the regression model, which indicated that this is not a good model.

Manually check the results

we can compile the actual and predicted values and view the first 5 records

# Create a data frame to compare actual and predicted values
comparison_df <- data.frame(Actual =  train_data$track_popularity, lm_predicted =lm_model$fitted.values)
head(comparison_df)

##   Actual lm_predicted
## 1      0     33.94611
## 2     24     40.51391
## 3     39     41.34777
## 4     63     38.75230
## 5     40     35.35537
## 6     30     28.37513

In-sample (training) MSE

lm_mse_train <- mean((lm_model$fitted.values - train_data$track_popularity)^2)
print(paste("Training MSE for Linear Model:", round(lm_mse_train, 2)))

## [1] "Training MSE for Linear Model: 529.94"

Out-of-sample (testing) MSE

# Predict on testing data
lm_test_pred <- predict(lm_model, newdata = test_data)
# Cal
lm_mse_test <- mean((lm_test_pred - test_data$track_popularity)^2)
print(paste("Testing MSE for Linear Model:", round(lm_mse_test, 2)))

## [1] "Testing MSE for Linear Model: 527.51"

KNN model

spotify_knn_model <- kknn(track_popularity ~ danceability + energy + key + loudness + mode + speechiness + acousticness + instrumentalness + liveness + valence + tempo + duration_ms, train = train_data, test = train_data, k = 5)

# Predict on training data
knn_train_pred <- fitted.values(spotify_knn_model)

# Calculate in-sample MSE manually
knn_train_mse <- mean((train_data$track_popularity - knn_train_pred)^2)
print(paste("In-Sample MSE for KNN: ", knn_train_mse))

## [1] "In-Sample MSE for KNN:  194.903869196156"

# Predict on testing data
knn_model_test <- kknn(track_popularity ~ danceability + energy + key + loudness + mode + speechiness + acousticness + instrumentalness + liveness + valence + tempo + duration_ms, train = train_data, test = test_data, k = 5)

knn_test_pred <- fitted.values(knn_model_test)

# Calculate out-of-sample MSE manually
knn_test_mse <- mean((test_data$track_popularity - knn_test_pred)^2)
print(paste("Out-of-Sample MSE for KNN: ", knn_test_mse))

## [1] "Out-of-Sample MSE for KNN:  687.024025645934"

Interpreting the models

We tried the linear regression and the KNN model to see which model would be best for predicting track popularity using the mean square error to see which one performs better.We found that the MSE for the linear regression model was 578.12 for In-Sample testing and the MSE was 581.72 for Out-of-sample testing. We also found that the KNN model has a MSE 194.903 for In-sample test, and a MSE of 687.024 for the Out-of Sample test. When comparing these two models, the KNN model perfomed better than the regression model.

We did not use all of the variables in the data set, but we used all of our continuous variables.We decided to go this route because we mostly wanted to explore track popularity and the relationship it has with the continuous variables in our data set.

Theoretically, the model that would fit the data the best is the linear regression model because it is easier to interpret. You also have to meet the four assumptions (linearity, independence,normality, and homoscedasticity). We can run a diagnostics plot to see if our model meets these assumptions.

# diagnostic plot
par(mfrow = c(2, 2))
plot(lm_model)

From the diagnostic plot, we see that the model does not meet the four assumptions, which leads us to the conclusion that the linear regression model is not the best fit for this data.

The model that fits the best in practice is the KNN model. The KNN gave us the best in sample performance when tested and the Linear Regression Model gave us the best In-Sample testing.The evaluation metrics we have been using are the mean square error to see which model has the lowest to reveal which model will be the best to test our predictions .The training data in the KNN model gave us the lowest MSE (194.903) which is what lead us to the conclusion that the KNN model would be better for prediction.