Spotify Logo

1.0 Introduction

1.1 Problem statement

We plan on diving into the Spotify dataset to gain insights into music consumption patterns and preferences. By conducting this analysis, we expect to uncover valuable insights into music consumption behavior on Spotify, which can benefit artists, music labels, and the platform itself. These insights can inform marketing strategies, playlist duration, and content creation.

1.2 Plan of Action

  • Identify the most popular music genres among Spotify users.
  • Determine the impact of factors like tempo, danceability, and energy on a song’s popularity.
  • Analyze how user-generated playlists influence song discovery.
  • Analyze the correlation between popularity of certain genres/songs and release date to identify how music preferences change seasonally.
  • Investigate the relationship between specific song features (song duration, tempo, energy, etc.) and their effect on each other.

Variables we plan to use:

  • Track popularity
  • Track features:
    • Tempo: This represents the beats per minute (BPM) of a song. It measures how fast or slow the song is.
    • Danceability: how suitable a track is for dancing based on a combination of musical elements (. rhythm, tempo and beat strength). A value of 0.0 is least danceable and 1.0 is most danceable.
    • Energy: Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity.
    • Valence: A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive.
    • Loudness: overall loudness of a track in decibels (dB).Values typical range between -60 and 0 dB.
    • Speechiness: Detects the presence of spoken words in a track.Values > 0.6 might be a podcast or talk show, where 0.3 to 0.6 is the normal range for songs and if its less than 0.3 its mostly music
    • Acousticness: Measure of how acoustic the track is and ranges from 0.0 to 1.0
    • Liveness: Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
  • User Data: Playlists created and playlist name.
  • Genre: Playlist genre and sub genre.
  • Time Data: Release date of tracks and trends over time.
  • Song duration: Duration MS (will be transforming milliseconds to minutes)

1.3 Current Proposal Approach/Technique

First, we will explore the data using summary statistics and visualizations to understand the dataset. In this step, we will identify trends, correlations, or patterns. Next, we will preprocess the data by handling any missing or inconsistent data. Then, we will create a linear regression model and interpret the coefficients. And then evaluate and test the model. Next, we will use KNN using KKNN. We will test different values of k and evaluate the models using MSE. Finally, we will compare the performance of the linear regression model and KNN models. We will use the linear regression and KNN models to answer our problem statement.

1.4 Helping the Consumer

This will help the consumer understand user preferences, trends, and the factors influencing song popularity, which is crucial for both the music industry and for artists.

2.0 Packages Required

2.1/2.2 Importing Packages

suppressPackageStartupMessages(library(tidyverse, quietly = TRUE))
suppressPackageStartupMessages(library(corrplot, quietly = TRUE))
suppressPackageStartupMessages(library(kknn, quietly = TRUE))
suppressPackageStartupMessages(library(psych, quietly = TRUE)) 
# Setting work Directory for Midterm Project
setwd("D:/Documents/School/Fall 2023/Data Mining for Bus Analytics/Midterm_Project")

2.3 Purposes

  • We used tidyverse because it has dplyr and ggplot2. This package also encompasses other packages that we may also need.
  • We used corrplot for a correlation matrix.
  • We used kknn for KNN models.
  • We used psych for multivariate analysis and scale construction.

3.0 Data Preparation

3.1 Original Source

We obtained the original data from Github: https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-01-21/readme.md

3.2 Dataset Details

The original purpose of the data was to use audio features to explore and classify songs, and it was collected in 2020. The original dataset had 23 variables. As far as any peculiarities, it looks like missing values were assigned a value of 0.

3.3 Importing and Cleaning

The following section consists of analyzing and investigating the data sets and summaries using Exploratory Data Analysis (EDA).

# Load the dataset
spotify <- read_csv("spotify_songs.csv")
## Rows: 32833 Columns: 23
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): track_id, track_name, track_artist, track_album_id, track_album_na...
## dbl (13): track_popularity, danceability, energy, key, loudness, mode, speec...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Examine the Data and Variable Types

The head function displays the first few rows of the dataset. The structure function displays information about the type of variable, components, names, and first few values.

head(spotify)
## # A tibble: 6 × 23
##   track_id               track_name track_artist track_popularity track_album_id
##   <chr>                  <chr>      <chr>                   <dbl> <chr>         
## 1 6f807x0ima9a1j3VPbc7VN I Don't C… Ed Sheeran                 66 2oCs0DGTsRO98…
## 2 0r7CVbZTWZgbTCYdfa2P31 Memories … Maroon 5                   67 63rPSO264uRjW…
## 3 1z1Hg7Vb0AhHDiEmnDE79l All the T… Zara Larsson               70 1HoSmj2eLcsrR…
## 4 75FpbthrwQmzHlBJLuGdC7 Call You … The Chainsm…               60 1nqYsOef1yKKu…
## 5 1e8PAfcKUYoKkxPhrHqw4x Someone Y… Lewis Capal…               69 7m7vv9wlQ4i0L…
## 6 7fvUMiyapMsRRxr07cU8Ef Beautiful… Ed Sheeran                 67 2yiy9cd2QktrN…
## # ℹ 18 more variables: track_album_name <chr>, track_album_release_date <chr>,
## #   playlist_name <chr>, playlist_id <chr>, playlist_genre <chr>,
## #   playlist_subgenre <chr>, danceability <dbl>, energy <dbl>, key <dbl>,
## #   loudness <dbl>, mode <dbl>, speechiness <dbl>, acousticness <dbl>,
## #   instrumentalness <dbl>, liveness <dbl>, valence <dbl>, tempo <dbl>,
## #   duration_ms <dbl>
str(spotify)
## spc_tbl_ [32,833 × 23] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ track_id                : chr [1:32833] "6f807x0ima9a1j3VPbc7VN" "0r7CVbZTWZgbTCYdfa2P31" "1z1Hg7Vb0AhHDiEmnDE79l" "75FpbthrwQmzHlBJLuGdC7" ...
##  $ track_name              : chr [1:32833] "I Don't Care (with Justin Bieber) - Loud Luxury Remix" "Memories - Dillon Francis Remix" "All the Time - Don Diablo Remix" "Call You Mine - Keanu Silva Remix" ...
##  $ track_artist            : chr [1:32833] "Ed Sheeran" "Maroon 5" "Zara Larsson" "The Chainsmokers" ...
##  $ track_popularity        : num [1:32833] 66 67 70 60 69 67 62 69 68 67 ...
##  $ track_album_id          : chr [1:32833] "2oCs0DGTsRO98Gh5ZSl2Cx" "63rPSO264uRjW1X5E6cWv6" "1HoSmj2eLcsrR0vE9gThr4" "1nqYsOef1yKKuGOVchbsk6" ...
##  $ track_album_name        : chr [1:32833] "I Don't Care (with Justin Bieber) [Loud Luxury Remix]" "Memories (Dillon Francis Remix)" "All the Time (Don Diablo Remix)" "Call You Mine - The Remixes" ...
##  $ track_album_release_date: chr [1:32833] "2019-06-14" "2019-12-13" "2019-07-05" "2019-07-19" ...
##  $ playlist_name           : chr [1:32833] "Pop Remix" "Pop Remix" "Pop Remix" "Pop Remix" ...
##  $ playlist_id             : chr [1:32833] "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" ...
##  $ playlist_genre          : chr [1:32833] "pop" "pop" "pop" "pop" ...
##  $ playlist_subgenre       : chr [1:32833] "dance pop" "dance pop" "dance pop" "dance pop" ...
##  $ danceability            : num [1:32833] 0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
##  $ energy                  : num [1:32833] 0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
##  $ key                     : num [1:32833] 6 11 1 7 1 8 5 4 8 2 ...
##  $ loudness                : num [1:32833] -2.63 -4.97 -3.43 -3.78 -4.67 ...
##  $ mode                    : num [1:32833] 1 1 0 1 1 1 0 0 1 1 ...
##  $ speechiness             : num [1:32833] 0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
##  $ acousticness            : num [1:32833] 0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
##  $ instrumentalness        : num [1:32833] 0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
##  $ liveness                : num [1:32833] 0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
##  $ valence                 : num [1:32833] 0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
##  $ tempo                   : num [1:32833] 122 100 124 122 124 ...
##  $ duration_ms             : num [1:32833] 194754 162600 176616 169093 189052 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   track_id = col_character(),
##   ..   track_name = col_character(),
##   ..   track_artist = col_character(),
##   ..   track_popularity = col_double(),
##   ..   track_album_id = col_character(),
##   ..   track_album_name = col_character(),
##   ..   track_album_release_date = col_character(),
##   ..   playlist_name = col_character(),
##   ..   playlist_id = col_character(),
##   ..   playlist_genre = col_character(),
##   ..   playlist_subgenre = col_character(),
##   ..   danceability = col_double(),
##   ..   energy = col_double(),
##   ..   key = col_double(),
##   ..   loudness = col_double(),
##   ..   mode = col_double(),
##   ..   speechiness = col_double(),
##   ..   acousticness = col_double(),
##   ..   instrumentalness = col_double(),
##   ..   liveness = col_double(),
##   ..   valence = col_double(),
##   ..   tempo = col_double(),
##   ..   duration_ms = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

Summary statistics

The summary function allows us to identify what the variable types are in the dataset.

summary(spotify)
##    track_id          track_name        track_artist       track_popularity
##  Length:32833       Length:32833       Length:32833       Min.   :  0.00  
##  Class :character   Class :character   Class :character   1st Qu.: 24.00  
##  Mode  :character   Mode  :character   Mode  :character   Median : 45.00  
##                                                           Mean   : 42.48  
##                                                           3rd Qu.: 62.00  
##                                                           Max.   :100.00  
##  track_album_id     track_album_name   track_album_release_date
##  Length:32833       Length:32833       Length:32833            
##  Class :character   Class :character   Class :character        
##  Mode  :character   Mode  :character   Mode  :character        
##                                                                
##                                                                
##                                                                
##  playlist_name      playlist_id        playlist_genre     playlist_subgenre 
##  Length:32833       Length:32833       Length:32833       Length:32833      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##   danceability        energy              key            loudness      
##  Min.   :0.0000   Min.   :0.000175   Min.   : 0.000   Min.   :-46.448  
##  1st Qu.:0.5630   1st Qu.:0.581000   1st Qu.: 2.000   1st Qu.: -8.171  
##  Median :0.6720   Median :0.721000   Median : 6.000   Median : -6.166  
##  Mean   :0.6548   Mean   :0.698619   Mean   : 5.374   Mean   : -6.720  
##  3rd Qu.:0.7610   3rd Qu.:0.840000   3rd Qu.: 9.000   3rd Qu.: -4.645  
##  Max.   :0.9830   Max.   :1.000000   Max.   :11.000   Max.   :  1.275  
##       mode         speechiness      acousticness    instrumentalness   
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000000  
##  1st Qu.:0.0000   1st Qu.:0.0410   1st Qu.:0.0151   1st Qu.:0.0000000  
##  Median :1.0000   Median :0.0625   Median :0.0804   Median :0.0000161  
##  Mean   :0.5657   Mean   :0.1071   Mean   :0.1753   Mean   :0.0847472  
##  3rd Qu.:1.0000   3rd Qu.:0.1320   3rd Qu.:0.2550   3rd Qu.:0.0048300  
##  Max.   :1.0000   Max.   :0.9180   Max.   :0.9940   Max.   :0.9940000  
##     liveness         valence           tempo         duration_ms    
##  Min.   :0.0000   Min.   :0.0000   Min.   :  0.00   Min.   :  4000  
##  1st Qu.:0.0927   1st Qu.:0.3310   1st Qu.: 99.96   1st Qu.:187819  
##  Median :0.1270   Median :0.5120   Median :121.98   Median :216000  
##  Mean   :0.1902   Mean   :0.5106   Mean   :120.88   Mean   :225800  
##  3rd Qu.:0.2480   3rd Qu.:0.6930   3rd Qu.:133.92   3rd Qu.:253585  
##  Max.   :0.9960   Max.   :0.9910   Max.   :239.44   Max.   :517810

The new data set below is a data set that includes the numeric variables as well as the track artist and playlist genre variables, which have a big impact on the track popularity. We are currently not able to run the analysis due to the high number of artists but, in the future, the plan is to use the more popular artists and categorize the remaining artists as “other”.

new <- spotify[c("track_popularity", "danceability", "energy", "key", "loudness", "mode", "speechiness", "acousticness", "instrumentalness", "liveness", "valence", "tempo", "duration_ms", "track_artist", "playlist_genre", "playlist_subgenre")]

selected_artists <- c("Drake", "Don Omar", "The Weeknd", "David Guetta", "The Chainsmokers")

new <- new %>%
  filter(track_artist %in% selected_artists)

summary(new)
##  track_popularity  danceability       energy            key        
##  Min.   : 0.00    Min.   :0.214   Min.   :0.1560   Min.   : 0.000  
##  1st Qu.:35.00    1st Qu.:0.571   1st Qu.:0.5930   1st Qu.: 1.000  
##  Median :59.00    Median :0.663   Median :0.7210   Median : 5.000  
##  Mean   :51.27    Mean   :0.653   Mean   :0.7111   Mean   : 5.143  
##  3rd Qu.:70.00    3rd Qu.:0.758   3rd Qu.:0.8630   3rd Qu.: 8.000  
##  Max.   :98.00    Max.   :0.928   Max.   :0.9950   Max.   :11.000  
##     loudness            mode         speechiness      acousticness      
##  Min.   :-17.515   Min.   :0.0000   Min.   :0.0255   Min.   :0.0000312  
##  1st Qu.: -7.080   1st Qu.:0.0000   1st Qu.:0.0432   1st Qu.:0.0179000  
##  Median : -5.609   Median :1.0000   Median :0.0610   Median :0.0760000  
##  Mean   : -5.820   Mean   :0.5225   Mean   :0.1020   Mean   :0.1314883  
##  3rd Qu.: -4.083   3rd Qu.:1.0000   3rd Qu.:0.1230   3rd Qu.:0.1800000  
##  Max.   : -1.304   Max.   :1.0000   Max.   :0.5290   Max.   :0.9510000  
##  instrumentalness       liveness         valence           tempo       
##  Min.   :0.0000000   Min.   :0.0258   Min.   :0.0350   Min.   : 74.63  
##  1st Qu.:0.0000000   1st Qu.:0.1020   1st Qu.:0.3235   1st Qu.: 96.22  
##  Median :0.0000094   Median :0.1340   Median :0.4220   Median :120.12  
##  Mean   :0.0205983   Mean   :0.2009   Mean   :0.4611   Mean   :121.63  
##  3rd Qu.:0.0007175   3rd Qu.:0.3055   3rd Qu.:0.6040   3rd Qu.:133.34  
##  Max.   :0.9180000   Max.   :0.8570   Max.   :0.9650   Max.   :203.59  
##   duration_ms     track_artist       playlist_genre     playlist_subgenre 
##  Min.   :106333   Length:511         Length:511         Length:511        
##  1st Qu.:196818   Class :character   Class :character   Class :character  
##  Median :214354   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :226749                                                           
##  3rd Qu.:244960                                                           
##  Max.   :486773

Correlation Matrix and Plot

Creating a new dataset that only uses the numeric variables form the original Spotify dataset.

num_spotify <- spotify[sapply(spotify, is.numeric)]

The correlation function displays correlation coefficient between the variables in the dataset.

corr_matrix <- cor(num_spotify)
print(corr_matrix)
##                  track_popularity danceability       energy           key
## track_popularity     1.0000000000  0.064747671 -0.109111533 -0.0006503533
## danceability         0.0647476713  1.000000000 -0.086073156  0.0117364748
## energy              -0.1091115325 -0.086073156  1.000000000  0.0100516957
## key                 -0.0006503533  0.011736475  0.010051696  1.0000000000
## loudness             0.0576870774  0.025335088  0.676624523  0.0009586305
## mode                 0.0106365762 -0.058647400 -0.004799733 -0.1740929567
## speechiness          0.0068194421  0.181721334 -0.032149611  0.0226069895
## acousticness         0.0851593365 -0.024519058 -0.539744630  0.0043058583
## instrumentalness    -0.1498724125 -0.008655078  0.033246579  0.0059678178
## liveness            -0.0545844404 -0.123859417  0.161223049  0.0028871809
## valence              0.0332313281  0.330523257  0.151103304  0.0199139115
## tempo               -0.0053780630 -0.184084351  0.149951107 -0.0133701991
## duration_ms         -0.1436823496 -0.096878789  0.012611444  0.0151393092
##                       loudness         mode  speechiness acousticness
## track_popularity  0.0576870774  0.010636576  0.006819442  0.085159337
## danceability      0.0253350882 -0.058647400  0.181721334 -0.024519058
## energy            0.6766245234 -0.004799733 -0.032149611 -0.539744630
## key               0.0009586305 -0.174092957  0.022606990  0.004305858
## loudness          1.0000000000 -0.019289482  0.010338981 -0.361638165
## mode             -0.0192894815  1.000000000 -0.063512355  0.009415361
## speechiness       0.0103389807 -0.063512355  1.000000000  0.026091985
## acousticness     -0.3616381651  0.009415361  0.026091985  1.000000000
## instrumentalness -0.1478240185 -0.006740665 -0.103424193 -0.006850273
## liveness          0.0776126010 -0.005548974  0.055425906 -0.077243449
## valence           0.0533835553  0.002614470  0.064659103 -0.016844738
## tempo             0.0937673598  0.014329047  0.044603290 -0.112723913
## duration_ms      -0.1150575031  0.015633730 -0.089430567 -0.081580676
##                  instrumentalness     liveness     valence        tempo
## track_popularity     -0.149872413 -0.054584440  0.03323133 -0.005378063
## danceability         -0.008655078 -0.123859417  0.33052326 -0.184084351
## energy                0.033246579  0.161223049  0.15110330  0.149951107
## key                   0.005967818  0.002887181  0.01991391 -0.013370199
## loudness             -0.147824018  0.077612601  0.05338356  0.093767360
## mode                 -0.006740665 -0.005548974  0.00261447  0.014329047
## speechiness          -0.103424193  0.055425906  0.06465910  0.044603290
## acousticness         -0.006850273 -0.077243449 -0.01684474 -0.112723913
## instrumentalness      1.000000000 -0.005507043 -0.17540218  0.023335266
## liveness             -0.005507043  1.000000000 -0.02055977  0.021017804
## valence              -0.175402179 -0.020559772  1.00000000 -0.025732148
## tempo                 0.023335266  0.021017804 -0.02573215  1.000000000
## duration_ms           0.063234740  0.006138455 -0.03222518 -0.001411828
##                   duration_ms
## track_popularity -0.143682350
## danceability     -0.096878789
## energy            0.012611444
## key               0.015139309
## loudness         -0.115057503
## mode              0.015633730
## speechiness      -0.089430567
## acousticness     -0.081580676
## instrumentalness  0.063234740
## liveness          0.006138455
## valence          -0.032225183
## tempo            -0.001411828
## duration_ms       1.000000000

Missing Values

The function below is testing for any missing values in the dataset.

colSums(is.na(num_spotify))
## track_popularity     danceability           energy              key 
##                0                0                0                0 
##         loudness             mode      speechiness     acousticness 
##                0                0                0                0 
## instrumentalness         liveness          valence            tempo 
##                0                0                0                0 
##      duration_ms 
##                0

Outliers and Data Truncation

The boxplot below, shows the outliers for each variable in the dataset. Outliers exist in the following variables: danceability, energy, loudness, speechiness, acousticness, instrumentalness, liveness, temp, and duration_ms. The truncation method was used to get rid of the outliers.

boxplot(num_spotify, las=2, cex.axis=0.6)

# Truncate energy 
num_spotify$energy[num_spotify$energy <= 0.2] <- 0.2
# Truncate danceability 
num_spotify$danceability[num_spotify$danceability <= 0.3] <- 0.3
# Truncate loudness 
num_spotify$loudness[num_spotify$loudness >= 0] <-0
num_spotify$loudness[num_spotify$loudness <= -13] <- -13
# Truncate speechiness 
num_spotify$speechiness[num_spotify$speechiness >= 0.22] <- 0.22
# Truncate acousticness 
num_spotify$acousticness[num_spotify$acousticness >= 0.6] <- 0.6
# Truncate instrumentalness 
num_spotify$instrumentalness[num_spotify$instrumentalness >= 0.012] <- 0.012
# Truncate liveness 
num_spotify$liveness[num_spotify$liveness >= 0.45] <- 0.45
# Truncate tempo 
num_spotify$tempo[num_spotify$tempo >= 175] <- 175
num_spotify$tempo[num_spotify$tempo <= 50] <- 50
# Truncate duration_ms 
num_spotify$duration_ms[num_spotify$duration_ms >= 350000] <- 350000
num_spotify$duration_ms[num_spotify$duration_ms <= 100000] <- 100000

Below is a boxplot of the truncated dataset showing outliers have been removed.

boxplot(num_spotify, las=2, cex.axis=0.6)

3.4 Finalized Dataset

The dataset is now clean. The final data set is shown in the table below.

knitr::kable(head(num_spotify[,1:13]), "pipe")
track_popularity danceability energy key loudness mode speechiness acousticness instrumentalness liveness valence tempo duration_ms
66 0.748 0.916 6 -2.634 1 0.0583 0.1020 0.00e+00 0.0653 0.518 122.036 194754
67 0.726 0.815 11 -4.969 1 0.0373 0.0724 4.21e-03 0.3570 0.693 99.972 162600
70 0.675 0.931 1 -3.432 0 0.0742 0.0794 2.33e-05 0.1100 0.613 124.008 176616
60 0.718 0.930 7 -3.778 1 0.1020 0.0287 9.40e-06 0.2040 0.277 121.956 169093
69 0.650 0.833 1 -4.672 1 0.0359 0.0803 0.00e+00 0.0833 0.725 123.976 189052
67 0.675 0.919 8 -5.385 1 0.1270 0.0799 0.00e+00 0.1430 0.585 124.982 163049

3.5 Summary

describe(num_spotify)
##                  vars     n      mean       sd    median   trimmed      mad
## track_popularity    1 32833     42.48    24.98     45.00     43.00    26.69
## danceability        2 32833      0.66     0.14      0.67      0.66     0.15
## energy              3 32833      0.70     0.18      0.72      0.71     0.19
## key                 4 32833      5.37     3.61      6.00      5.35     4.45
## loudness            5 32833     -6.63     2.70     -6.17     -6.41     2.52
## mode                6 32833      0.57     0.50      1.00      0.58     0.00
## speechiness         7 32833      0.09     0.07      0.06      0.08     0.04
## acousticness        8 32833      0.16     0.19      0.08      0.13     0.11
## instrumentalness    9 32833      0.00     0.00      0.00      0.00     0.00
## liveness           10 32833      0.18     0.12      0.13      0.16     0.07
## valence            11 32833      0.51     0.23      0.51      0.51     0.27
## tempo              12 32833    120.44    25.82    121.98    119.12    26.75
## duration_ms        13 32833 223818.33 53113.48 216000.00 220382.28 47246.01
##                       min      max    range  skew kurtosis     se
## track_popularity      0.0 1.00e+02 1.00e+02 -0.23    -0.93   0.14
## danceability          0.3 9.80e-01 6.80e-01 -0.41    -0.31   0.00
## energy                0.2 1.00e+00 8.00e-01 -0.58    -0.24   0.00
## key                   0.0 1.10e+01 1.10e+01 -0.02    -1.31   0.02
## loudness            -13.0 0.00e+00 1.30e+01 -0.65    -0.15   0.01
## mode                  0.0 1.00e+00 1.00e+00 -0.27    -1.93   0.00
## speechiness           0.0 2.20e-01 2.20e-01  0.96    -0.59   0.00
## acousticness          0.0 6.00e-01 6.00e-01  1.17     0.10   0.00
## instrumentalness      0.0 1.00e-02 1.00e-02  1.17    -0.53   0.00
## liveness              0.0 4.50e-01 4.50e-01  1.04    -0.12   0.00
## valence               0.0 9.90e-01 9.90e-01 -0.01    -0.90   0.00
## tempo                50.0 1.75e+02 1.25e+02  0.34    -0.43   0.14
## duration_ms      100000.0 3.50e+05 2.50e+05  0.53     0.05 293.12

In the summary statistics table for our cleaned data set, our key takeaways of the descriptive statistics for the variables of concern are:

  • For our target variable, track_popularity:
    • The average popularity score is 42.48. This is out of a 0-100 scale.
    • The standard deviation is 24.98. That’s substantial at about 59%, so the data ranges widely from the mean.
    • Half the tracks have a popularity below 45 and half above.
  • Speechiness shows substantial variability between the mean .09 and standard deviation .07.
  • Acousticness shows a significant variability between the mean .16 and sd .19.
  • There appears to be a wide range of song tempos and durations, showing significant variability.

4.0 Exploratory Data Analysis

4.1 Handling New Information

To uncover new information in the data that is not self-evident, we can, did, and will continue to apply a number of methods; these include visualizations, truncation, and model testing. A few different ways to look at the data to answer our questions, in addition to those that we have already applied, will be to test different values for our linear regression and KNN models, and to test different splits for our training and test data.

4.2 Plots and Tables

Histogram

Below is a histogram of the “Track Popularity” variable.

ggplot(spotify, aes(x=track_popularity)) +
  geom_histogram(binwidth=5, fill="blue", alpha=0.7) +
  ggtitle("Histogram") +
  xlab("Track Popularity") +
  ylab("Frequency")

Bar Charts

ggplot(new, aes(x = playlist_genre)) +
  geom_bar(fill="blue", color="black", alpha=0.7) +
  labs(
    title="Bar Chart of Playlist Genres", 
    x="Playlist (Genre)", 
    y="Count"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust=0.5))

ggplot(new, aes(x = playlist_subgenre)) +
  geom_bar(fill = "blue", color = "black", alpha = 0.7) +
  coord_flip() +  
  labs(
    title = "Bar Chart of Playlist SubGenres",
    y = "Count", 
    x = "Subgenre"
  ) +
  theme_minimal() +
  theme(axis.text.y = element_text(angle = 0, hjust = 0.5))

ggplot(new, aes(x = track_artist)) +
  geom_bar(fill="blue", color="black", alpha=0.7) +
  labs(
    title="Bar Chart of Playlist Genres", 
    x="Track Artist", 
    y="Count"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust=0.5))

Scatter Plots

ggplot(spotify, aes(x=loudness, y=track_popularity)) + geom_jitter(aes(color = playlist_genre)) +
  ggtitle("Loudness and Popularity")  

ggplot(spotify, aes(x=tempo, y=track_popularity)) + geom_jitter(aes(color = playlist_genre)) + ggtitle("Tempo and Popularity") 

ggplot(spotify, aes(x=speechiness, y=track_popularity)) + geom_jitter(aes(color = playlist_genre)) +
  ggtitle("Speechiness and Popularity") 

ggplot(spotify, aes(x=danceability, y=track_popularity)) + geom_jitter(aes(color = playlist_genre)) +
  ggtitle("Danceability and Popularity") 

ggplot(spotify, aes(x=energy, y=track_popularity)) + geom_jitter(aes(color = playlist_genre)) +
  ggtitle("Energy and Popularity") 

ggplot(spotify, aes(x=liveness, y=track_popularity)) + geom_jitter(aes(color = playlist_genre)) +
  ggtitle("Liveness and Popularity") 

ggplot(spotify, aes(x=acousticness, y=track_popularity)) + geom_jitter(aes(color = playlist_genre)) +
  ggtitle("Acousticness and Popularity") 

ggplot(spotify, aes(x=duration_ms, y=track_popularity)) + geom_jitter(aes(color = playlist_genre)) +
  ggtitle("Duration and Popularity") 

ggplot(spotify, aes(x=instrumentalness, y=track_popularity)) + geom_jitter(aes(color = playlist_genre)) +
  ggtitle("Instrumentalness and Popularity") 

ggplot(spotify, aes(x=valence, y=track_popularity)) + geom_jitter(aes(color = playlist_genre)) +
  ggtitle("Valence and Popularity") 

Pie Chart

spotify_piechart <- data.frame(
  variable = c("Valence", "Tempo", "Energy" , "Loudness"),
  significance = c(0.15, 0.25, 0.20, 0.10)
)

spotify_piechart <- spotify_piechart %>%
  arrange(desc(significance))

total_significance <- sum(spotify_piechart$significance)
spotify_piechart$percentage <- (spotify_piechart$significance / total_significance) * 100


ggplot(spotify_piechart, aes(x = "", y = percentage, fill = variable)) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar(theta = "y") +
  labs(
    title = "Percentage Effect of Variables on Track Popularity",
    fill = "Variable"
  ) +
  theme_minimal() +
  theme(legend.position = "right")

Correlation Matrix, Plot, and Map

Below are visualization techniques that show the correlation between the dataset variables.

corr_matrix <- cor(num_spotify)
print(corr_matrix)
##                  track_popularity danceability       energy           key
## track_popularity     1.0000000000  0.065252575 -0.109780603 -0.0006503533
## danceability         0.0652525748  1.000000000 -0.091867642  0.0121403872
## energy              -0.1097806032 -0.091867642  1.000000000  0.0098638427
## key                 -0.0006503533  0.012140387  0.009863843  1.0000000000
## loudness             0.0613782905  0.006647159  0.675069403 -0.0019776668
## mode                 0.0106365762 -0.058102892 -0.003992262 -0.1740929567
## speechiness          0.0114869489  0.228969318 -0.001970227  0.0241236557
## acousticness         0.0917312887  0.002973148 -0.516121648  0.0054816873
## instrumentalness    -0.1769045504 -0.044823526  0.077965754  0.0116707469
## liveness            -0.0497280028 -0.123897962  0.174063032  0.0035209045
## valence              0.0332313281  0.329985784  0.149958408  0.0199139115
## tempo               -0.0058760602 -0.176672607  0.156966992 -0.0136459690
## duration_ms         -0.1421762934 -0.100706773  0.008993375  0.0153729324
##                      loudness         mode  speechiness acousticness
## track_popularity  0.061378290  0.010636576  0.011486949  0.091731289
## danceability      0.006647159 -0.058102892  0.228969318  0.002973148
## energy            0.675069403 -0.003992262 -0.001970227 -0.516121648
## key              -0.001977667 -0.174092957  0.024123656  0.005481687
## loudness          1.000000000 -0.019990599  0.046304697 -0.324549507
## mode             -0.019990599  1.000000000 -0.073720939  0.003658047
## speechiness       0.046304697 -0.073720939  1.000000000  0.023875232
## acousticness     -0.324549507  0.003658047  0.023875232  1.000000000
## instrumentalness -0.151080835 -0.012133441 -0.160679795 -0.086739727
## liveness          0.096364878 -0.006556692  0.059462137 -0.087174106
## valence           0.044065911  0.002614470  0.070500612  0.008816439
## tempo             0.098546824  0.013802062  0.031490178 -0.117857457
## duration_ms      -0.129157665  0.015229474 -0.096154178 -0.074329309
##                  instrumentalness      liveness      valence        tempo
## track_popularity    -0.1769045504 -0.0497280028  0.033231328 -0.005876060
## danceability        -0.0448235263 -0.1238979617  0.329985784 -0.176672607
## energy               0.0779657539  0.1740630316  0.149958408  0.156966992
## key                  0.0116707469  0.0035209045  0.019913911 -0.013645969
## loudness            -0.1510808355  0.0963648780  0.044065911  0.098546824
## mode                -0.0121334411 -0.0065566915  0.002614470  0.013802062
## speechiness         -0.1606797952  0.0594621373  0.070500612  0.031490178
## acousticness        -0.0867397269 -0.0871741062  0.008816439 -0.117857457
## instrumentalness     1.0000000000 -0.0004094716 -0.165237727  0.039792129
## liveness            -0.0004094716  1.0000000000 -0.019332272  0.024927249
## valence             -0.1652377269 -0.0193322723  1.000000000 -0.029128791
## tempo                0.0397921287  0.0249272491 -0.029128791  1.000000000
## duration_ms          0.0903161849 -0.0090259013 -0.021256337 -0.005129168
##                   duration_ms
## track_popularity -0.142176293
## danceability     -0.100706773
## energy            0.008993375
## key               0.015372932
## loudness         -0.129157665
## mode              0.015229474
## speechiness      -0.096154178
## acousticness     -0.074329309
## instrumentalness  0.090316185
## liveness         -0.009025901
## valence          -0.021256337
## tempo            -0.005129168
## duration_ms       1.000000000
corrplot(corr_matrix, method="circle", type="upper", order="hclust",
         tl.col="black", tl.srt=45)

corr_data <- as.data.frame(corr_matrix) 
corr_data$row <- rownames(corr_matrix) 
corr_data_long <- gather(corr_data, key = "column", value = "correlation", -row) 
ggplot(data = corr_data_long, aes(x = row, y = column, fill = correlation)) + 
  geom_tile() + 
  scale_fill_gradient(low = "blue", high = "red") + 
  theme_minimal() + 
  labs(title = "Correlation Heatmap") 

The types of plots and tables that we will use to help illustrate our findings for our questions include:

  • What we can utilize in future:
    • Bar Charts:
      • We could use a Bar Chart to compare frequency between track_popularity and other significant variables, to show comparisons of how the variables affect track_popularity.
    • Pie Charts:
      • We could show the percentage of significance of a few of the variables on track_popularity. For example, if danceability, valence, tempo, and energy have significant influence on track_popularity, we could show the percentage effect of each in a pie chart.
    • Scatter Plots:
      • We could use a scatter plot to show the trend between track_popularity and another continuous variable, to show any clustering that affects the variables relationship with track_popularity. For instance, we could show the relationship between artist name and track popularity, and evaluate the relationship.
  • What we have used:
    • Histograms:
      • Our track_popularity histogram showed us the frequency distribution for track popularity, within the range (0-100).
    • Correlation Matrix:
      • Our correlation matrix supplied us with the correlation coefficients for our variables, including our target variable track_popularity.
    • Correlation Heatmap:
      • Our heatmap showed us the correlation between variables in a colorful, easy to digest visual.

4.3 Current Limitations

One variable that is not currently in our finalized dataset that will have a big impact on track popularity is track artist. Currently, we are only including the numerical variables in our Spotify dataset, but the ultimate goal is to incorporate track artist, which its class is character, by possibly using a one-hot encoding technique or another technique learned in the future.

4.4 Future Endeavors

The current plan is to slice and dice the dataset, using 70% for training and 30% for testing. Moving forward the goal is to play around with this ratio to maximize the efficiency of our analysis. Improving the efficiency can also be done by adding interaction terms in the analysis, as well as track artist variable.

5.0 Modeling

5.1 Analysis

The models that we have tried include Linear Regression and KNN. When looking at the Linear Regression model, the Residuals vs Fitted graph appears to have an even amount of positive and negative observations. This means that the model is generally doing a good job of capturing the relationship between our independent variable (track_popularity) and the dependent variables (danceability, energy, key, loudness, mode, speechiness, acousticness, instrumentalness, liveness, valence, tempo, duration_ms). In addition, the assumption of homoscedasticity is observed as the residual points are equally scattered in the plot as seen in the Residuals vs Leverage graph.

When looking at the KNN model, as the value of k increases, the MSE increases in the training dataset, but the MSE decreases in the testing dataset. Because the goal of MSE is to have a low value, the training dataset would not allow for the model to be flexible or fitting compared to the testing dataset.

Data Spliting

This step allows the data to be divided into training data, and testing data. This allows the dataset to learn from itself and help understand the relationship between the variables.

# Set the seed for reproducibility
set.seed(25)
# Randomly sample row indices for the training set
train_indices <- sample(1:NROW(num_spotify),NROW(num_spotify)*0.70)
# Create the training set
train_data <- num_spotify[train_indices, ]
# Create the testing set
test_data <- num_spotify[-train_indices, ]

Linear Regression Model

# Train the linear regression model
lm_model <- lm(track_popularity ~ ., data = train_data)
summary(lm_model)
## 
## Call:
## lm(formula = track_popularity ~ ., data = train_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -58.897 -17.411   2.986  18.879  62.702 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       7.923e+01  2.047e+00  38.701  < 2e-16 ***
## danceability      4.445e+00  1.269e+00   3.501 0.000464 ***
## energy           -2.951e+01  1.450e+00 -20.356  < 2e-16 ***
## key               4.101e-02  4.452e-02   0.921 0.356991    
## loudness          1.580e+00  8.519e-02  18.550  < 2e-16 ***
## mode              9.425e-01  3.264e-01   2.887 0.003888 ** 
## speechiness      -1.272e+01  2.507e+00  -5.074 3.93e-07 ***
## acousticness      2.875e+00  9.976e-01   2.881 0.003963 ** 
## instrumentalness -6.081e+02  3.479e+01 -17.482  < 2e-16 ***
## liveness         -4.948e+00  1.378e+00  -3.590 0.000331 ***
## valence           3.509e+00  7.638e-01   4.594 4.37e-06 ***
## tempo             2.702e-02  6.312e-03   4.281 1.87e-05 ***
## duration_ms      -4.815e-05  3.068e-06 -15.697  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24.02 on 22970 degrees of freedom
## Multiple R-squared:  0.07494,    Adjusted R-squared:  0.07446 
## F-statistic: 155.1 on 12 and 22970 DF,  p-value: < 2.2e-16

According to our linear regression, the variables that have a high impact when it comes to track popularity are:

  • danceability
  • energy
  • loudness
  • speechiness
  • instrumentalness
  • liveness
  • valence
  • tempo
  • duration_ms

Below we are manually checking the results of the linear regression model and the first 5 records of the actual and predicted values are shown.

# Create a data frame to compare actual and predicted values
comparison_df <- data.frame(Actual =  train_data$track_popularity, lm_predicted =lm_model$fitted.values)
head(comparison_df)
##   Actual lm_predicted
## 1     55     35.66503
## 2     63     39.93547
## 3     12     41.02927
## 4     63     36.84513
## 5      0     35.93214
## 6     57     49.87456

The diagnostic plots are used to see if the assumptions being made for the linear regression model meet the standards of the dataset.

par(mfrow = c(2,2))
plot(lm_model)

Once we complete the linear regression model. The In-sample MSE, or Training MSE, can be computed as shown below.

lm_mse_train <- mean((lm_model$fitted.values - train_data$track_popularity)^2)
print(paste("Training MSE for Linear Model:", round(lm_mse_train, 2)))
## [1] "Training MSE for Linear Model: 576.46"

Similarly, the Out-of-sample MSE, or Testing MSE, can be calculated.

# Predict on testing data
lm_test_pred <- predict(lm_model, newdata = test_data)
# Cal
lm_mse_train <- mean((lm_test_pred - test_data$track_popularity)^2)
print(paste("Testing MSE for Linear Model:", round(lm_mse_train, 2)))
## [1] "Testing MSE for Linear Model: 577.42"

Because of the linear regression, we are adding these variables as interaction terms.

Interaction Terms

spotify_lm_it2 <- lm(track_popularity ~ . + tempo*valence + speechiness*liveness, data = num_spotify)
summary(spotify_lm_it2)
## 
## Call:
## lm(formula = track_popularity ~ . + tempo * valence + speechiness * 
##     liveness, data = num_spotify)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -59.586 -17.552   2.911  18.772  64.108 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           8.259e+01  2.242e+00  36.836  < 2e-16 ***
## danceability          5.376e+00  1.066e+00   5.043 4.61e-07 ***
## energy               -2.879e+01  1.211e+00 -23.770  < 2e-16 ***
## key                   4.942e-02  3.730e-02   1.325   0.1852    
## loudness              1.644e+00  7.155e-02  22.981  < 2e-16 ***
## mode                  6.678e-01  2.727e-01   2.449   0.0143 *  
## speechiness          -7.076e+00  3.711e+00  -1.907   0.0565 .  
## acousticness          3.278e+00  8.317e-01   3.941 8.13e-05 ***
## instrumentalness     -6.221e+02  2.912e+01 -21.359  < 2e-16 ***
## liveness             -1.852e+00  1.992e+00  -0.930   0.3525    
## valence              -4.118e+00  2.845e+00  -1.448   0.1478    
## tempo                -6.844e-03  1.311e-02  -0.522   0.6015    
## duration_ms          -4.872e-05  2.569e-06 -18.969  < 2e-16 ***
## valence:tempo         5.697e-02  2.278e-02   2.501   0.0124 *  
## speechiness:liveness -3.258e+01  1.668e+01  -1.953   0.0508 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24.01 on 32818 degrees of freedom
## Multiple R-squared:  0.0765, Adjusted R-squared:  0.07611 
## F-statistic: 194.2 on 14 and 32818 DF,  p-value: < 2.2e-16
spotify_lm_it <- lm(track_popularity ~ . + acousticness*danceability + energy*loudness, data = num_spotify)
summary(spotify_lm_it)
## 
## Call:
## lm(formula = track_popularity ~ . + acousticness * danceability + 
##     energy * loudness, data = num_spotify)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -60.678 -17.627   2.961  18.809  60.167 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                9.342e+01  2.349e+00  39.768  < 2e-16 ***
## danceability               1.496e+00  1.337e+00   1.119 0.263281    
## energy                    -4.423e+01  2.278e+00 -19.419  < 2e-16 ***
## key                        4.555e-02  3.726e-02   1.222 0.221617    
## loudness                   2.973e+00  1.878e-01  15.829  < 2e-16 ***
## mode                       7.086e-01  2.726e-01   2.599 0.009351 ** 
## speechiness               -1.278e+01  2.093e+00  -6.105 1.04e-09 ***
## acousticness              -6.860e+00  3.283e+00  -2.090 0.036649 *  
## instrumentalness          -5.865e+02  2.944e+01 -19.919  < 2e-16 ***
## liveness                  -4.826e+00  1.152e+00  -4.188 2.82e-05 ***
## valence                    2.550e+00  6.398e-01   3.985 6.77e-05 ***
## tempo                      2.306e-02  5.298e-03   4.352 1.35e-05 ***
## duration_ms               -5.067e-05  2.576e-06 -19.669  < 2e-16 ***
## danceability:acousticness  1.670e+01  4.877e+00   3.425 0.000616 ***
## energy:loudness           -1.972e+00  2.578e-01  -7.652 2.04e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.99 on 32818 degrees of freedom
## Multiple R-squared:  0.07828,    Adjusted R-squared:  0.07789 
## F-statistic: 199.1 on 14 and 32818 DF,  p-value: < 2.2e-16

KNN Model

Train the KNN Model with training dataset.

# Train the KNN model
knn_model <- kknn(track_popularity ~ ., train = train_data, test = train_data, k = 5)

Again, we can manually check the results of the KNN model. The table below shows the first 5 records of linear model predicted values, along with the predicted values of the KNN, and the actual values.

# Create a data frame to compare actual and predicted values
comparison_df$knn_std_predicted <- knn_model$fitted.values
head(comparison_df)
##   Actual lm_predicted knn_std_predicted
## 1     55     35.66503         45.099896
## 2     63     39.93547         35.110958
## 3     12     41.02927         34.827113
## 4     63     36.84513         48.693988
## 5      0     35.93214          5.970955
## 6     57     49.87456         64.861724

Now that the KNN model is complete. The In-sample MSE and the Out-of-sample MSE are calculated.

# Predict on training data
knn_train_pred <- fitted.values(knn_model)
# Calculate in-sample MSE manually
knn_train_mse <- mean((train_data$track_popularity - knn_train_pred)^2)
print(paste("In-Sample MSE for KNN: ", knn_train_mse))
## [1] "In-Sample MSE for KNN:  179.138485697501"
# Predict on testing data
knn_model_test <- kknn(track_popularity ~ ., train = train_data, test = test_data, k = 5)
knn_test_pred <- fitted.values(knn_model_test)
# Calculate out-of-sample MSE manually
knn_test_mse <- mean((test_data$track_popularity - knn_test_pred)^2)
print(paste("Out-of-Sample MSE for KNN: ", knn_test_mse))
## [1] "Out-of-Sample MSE for KNN:  611.882895509132"

In order to make sure our model is as efficient as possible, the KNN model was run with different values of K to compare the output MSE.

# Initialize a dataframe to store MSE for each k
mse_df <- data.frame(k = integer(), MSE_train = numeric(), MSE_test = numeric())
for (k in c(1:5)) {
  # Fit the k-NN model using training data
  knn_model_train <- kknn(track_popularity ~ ., train = train_data, test = train_data, k = k)
  # Calculate the training MSE
  mse_train <-  mean((knn_model_train$fitted.values - train_data$track_popularity)^2)
  # Test the k-NN with testing data
  knn_model_test <- kknn(track_popularity ~ ., train = train_data, test = test_data, k = k)
  mse_test <-  mean((knn_model_test$fitted.values - test_data$track_popularity)^2)
  # save the results
  mse_df <- rbind(mse_df, data.frame(k = k, MSE_train = mse_train, MSE_test = mse_test))
}
# Show the MSE dataframe
print(mse_df)
##   k MSE_train MSE_test
## 1 1  28.36980 832.0829
## 2 2  45.63493 729.9191
## 3 3  94.42580 667.4293
## 4 4 140.30341 632.3014
## 5 5 179.13849 611.8829

5.2 Chosen Variables

We did not use all the variables in the Spotify dataset. Due to the lack of storage in RStudio along with computer limitations, we had to reduce our dataset to only looking at the numeric variables. By changing the dataset to numeric, we were allowed to continue with our research.

5.3 Best Fit

Theoretically, the KNN model would fit our data best. This is because the KNN model does not make assumptions about the relationships between variables, and our data indicated several variables that we assumed would produce non-linear relationships. KNN also has better pattern recognition and, using the varied variables we had, theoretically could display relationships better; I.e., KNN is more flexible. An ideal linear regression model assumes homoscedasticity, linearity, normality, and low correlation of independent variables. Theoretically our model would not be expected to show linearity, due to the wide variability of our values. Also, we expected high correlation between some of the independent variables.

5.4 Best Fit in Practice

In practice, the linear regression model fit our data the best. We applied a .70 split to our training set to run our model. The training and testing MSE values were close (576.46 for training and 577.42 for testing) indicating that the model is not over fitting. Our scatter plot showed moderate linearity. Our histogram did have a left tail, skewed negatively, but our residuals vs fitted plot showed moderate random scatter. Alternatively, for the KNN, the MSE on the training data (179.14) suggested over fitting, as performance dropped significantly for the test data (611.88). The optimal number of neighbors appeared to be k=5, as that is when MSE was at its lowest for both training (179.14) and test (611.88). KNN seemed poorly fitted to the data set, as the test data MSE was so high. The best model for in-sample performance was the KNN training data at 179.14 and the best model for out of sample performance was the linear regression test data at 577.42. The evaluation metric we used was the Mean Squared Error.

Conclusion

While conducting linear regression, we discovered that track popularity is individually significantly influenced by danceability, energy, loudness, speechiness, instrumentalness, liveness, valence, tempo, and duration_ms. Furthermore, after adding interaction terms between tempo and valence and speechiness and liveness, it’s apparent that while these variables are together, they do not have a significant impact on track popularity. For instance, the interaction of tempo and valence in a song is not as significant compared to the interaction of acousticness and danceability. In addition, the interaction of speechiness and liveness in a song is not as significant compared to the interaction of energy and loudness.