library(dplyr)
library(FactoMineR)

Spotify Songs

Today, we’ll be analyzing and creating an unsupervised learning model from a data set of songs and their characteristics that were pulled from Spotify. You can find the data set here: https://www.kaggle.com/datasets/zaheenhamidani/ultimate-spotify-tracks-db

Data Pre-processing

Let’s do some data pre-processing and take a look at our data set.

songs <- read.csv("SpotifyFeatures.csv")
glimpse(songs)
## Rows: 232,725
## Columns: 18
## $ genre            <chr> "Movie", "Movie", "Movie", "Movie", "Movie", "Movie",…
## $ artist_name      <chr> "Henri Salvador", "Martin & les fées", "Joseph Willia…
## $ track_name       <chr> "C'est beau de faire un Show", "Perdu d'avance (par G…
## $ track_id         <chr> "0BRjO6ga9RKCKjfDqeFgWV", "0BjC1NfoEOOusryehmNudP", "…
## $ popularity       <int> 0, 1, 3, 0, 4, 0, 2, 15, 0, 10, 0, 2, 4, 3, 0, 0, 0, …
## $ acousticness     <dbl> 0.61100, 0.24600, 0.95200, 0.70300, 0.95000, 0.74900,…
## $ danceability     <dbl> 0.389, 0.590, 0.663, 0.240, 0.331, 0.578, 0.703, 0.41…
## $ duration_ms      <int> 99373, 137373, 170267, 152427, 82625, 160627, 212293,…
## $ energy           <dbl> 0.9100, 0.7370, 0.1310, 0.3260, 0.2250, 0.0948, 0.270…
## $ instrumentalness <dbl> 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 1.23e-01, 0.0…
## $ key              <chr> "C#", "F#", "C", "C#", "F", "C#", "C#", "F#", "C", "G…
## $ liveness         <dbl> 0.3460, 0.1510, 0.1030, 0.0985, 0.2020, 0.1070, 0.105…
## $ loudness         <dbl> -1.828, -5.559, -13.879, -12.178, -21.150, -14.970, -…
## $ mode             <chr> "Major", "Minor", "Minor", "Major", "Major", "Major",…
## $ speechiness      <dbl> 0.0525, 0.0868, 0.0362, 0.0395, 0.0456, 0.1430, 0.953…
## $ tempo            <dbl> 166.969, 174.003, 99.488, 171.758, 140.576, 87.479, 8…
## $ time_signature   <chr> "4/4", "4/4", "5/4", "4/4", "4/4", "4/4", "4/4", "4/4…
## $ valence          <dbl> 0.8140, 0.8160, 0.3680, 0.2270, 0.3900, 0.3580, 0.533…

Seems like some of the variables have repeating values. Let’s check.

unique(songs$key)
##  [1] "C#" "F#" "C"  "F"  "G"  "E"  "D#" "G#" "D"  "A#" "A"  "B"
unique(songs$time_signature)
## [1] "4/4" "5/4" "3/4" "1/4" "0/4"

Let’s also check if there are any NA values.

songs %>% is.na() %>% colSums()
##            genre      artist_name       track_name         track_id 
##                0                0                0                0 
##       popularity     acousticness     danceability      duration_ms 
##                0                0                0                0 
##           energy instrumentalness              key         liveness 
##                0                0                0                0 
##         loudness             mode      speechiness            tempo 
##                0                0                0                0 
##   time_signature          valence 
##                0                0

Now let’s get rid of non-numerical variables that have too many unique values. This would be the track_id, artist_name, and track_name. Let’s also change the non-numerical variables with repeating values into factors.

songs <- songs %>% distinct(track_name, .keep_all = TRUE)
rownames(songs) <- songs$track_name

songs <- songs %>%
  select(-c(track_id, artist_name,track_name)) %>%
  mutate(genre=as.factor(genre),
         key=as.factor(key),
         mode=as.factor(mode),
         time_signature=as.factor(time_signature)
         )

Now let’s check the ranges of the values.

summary(songs)
##          genre         popularity      acousticness     danceability   
##  Comedy     : 8640   Min.   :  0.00   Min.   :0.0000   Min.   :0.0569  
##  Classical  : 8395   1st Qu.: 24.00   1st Qu.:0.0474   1st Qu.:0.4060  
##  Alternative: 8305   Median : 36.00   Median :0.3050   Median :0.5530  
##  Anime      : 8258   Mean   : 35.69   Mean   :0.4151   Mean   :0.5356  
##  Opera      : 8096   3rd Qu.: 48.00   3rd Qu.:0.8100   3rd Qu.:0.6800  
##  Electronic : 7915   Max.   :100.00   Max.   :0.9960   Max.   :0.9870  
##  (Other)    :99006                                                     
##   duration_ms          energy          instrumentalness         key       
##  Min.   :  15387   Min.   :0.0000203   Min.   :0.0000000   C      :17495  
##  1st Qu.: 177380   1st Qu.:0.3300000   1st Qu.:0.0000000   G      :17328  
##  Median : 220000   Median :0.5890000   Median :0.0000887   D      :15730  
##  Mean   : 238023   Mean   :0.5523232   Mean   :0.1818384   A      :14778  
##  3rd Qu.: 270867   3rd Qu.:0.7910000   3rd Qu.:0.1300000   C#     :14191  
##  Max.   :5552917   Max.   :0.9990000   Max.   :0.9990000   F      :13128  
##                                                            (Other):55965  
##     liveness          loudness          mode        speechiness    
##  Min.   :0.00967   Min.   :-52.457   Major:98080   Min.   :0.0222  
##  1st Qu.:0.09775   1st Qu.:-13.302   Minor:50535   1st Qu.:0.0371  
##  Median :0.13100   Median : -8.323                 Median :0.0496  
##  Mean   :0.22882   Mean   :-10.382                 Mean   :0.1305  
##  3rd Qu.:0.28400   3rd Qu.: -5.659                 3rd Qu.:0.1030  
##  Max.   :1.00000   Max.   :  3.744                 Max.   :0.9670  
##                                                                    
##      tempo        time_signature    valence      
##  Min.   : 30.38   0/4:     6     Min.   :0.0000  
##  1st Qu.: 91.84   1/4:  2066     1st Qu.:0.2150  
##  Median :114.89   3/4: 18138     Median :0.4370  
##  Mean   :116.93   4/4:124534     Mean   :0.4486  
##  3rd Qu.:138.28   5/4:  3871     3rd Qu.:0.6660  
##  Max.   :242.90                  Max.   :1.0000  
## 

Seems like the data ranges are quite varied. Let’s scale these values later.

PCA

Since we have quite a lot of categorical values, let’s use the PCA method. For the amount of PCs we’ll use, let’s simply input it as the amount of numerical variables that we have for now.

# quantitative column names
quanti <- songs %>% 
  select_if(is.numeric) %>% 
  colnames()

# numeric column index
quantivar <- which(colnames(songs) %in% quanti)

# qualitative column names
quali <- songs %>% 
  select_if(is.factor) %>% 
  colnames()

# categorical column index
qualivar <- which(colnames(songs) %in% quali)

Don’t forget to set the scale.unit to TRUE, as we want to scale our values.

songs_pca <- PCA(
  X = songs,
  scale.unit = T,
  quali.sup = qualivar,
  graph = F,
  ncp = length(quanti)
)

songs_pca
## **Results for the Principal Component Analysis (PCA)**
## The analysis was performed on 148615 individuals, described by 15 variables
## *The results are available in the following objects:
## 
##    name                description                                          
## 1  "$eig"              "eigenvalues"                                        
## 2  "$var"              "results for the variables"                          
## 3  "$var$coord"        "coord. for the variables"                           
## 4  "$var$cor"          "correlations variables - dimensions"                
## 5  "$var$cos2"         "cos2 for the variables"                             
## 6  "$var$contrib"      "contributions of the variables"                     
## 7  "$ind"              "results for the individuals"                        
## 8  "$ind$coord"        "coord. for the individuals"                         
## 9  "$ind$cos2"         "cos2 for the individuals"                           
## 10 "$ind$contrib"      "contributions of the individuals"                   
## 11 "$quali.sup"        "results for the supplementary categorical variables"
## 12 "$quali.sup$coord"  "coord. for the supplementary categories"            
## 13 "$quali.sup$v.test" "v-test of the supplementary categories"             
## 14 "$call"             "summary statistics"                                 
## 15 "$call$centre"      "mean of the variables"                              
## 16 "$call$ecart.type"  "standard error of the variables"                    
## 17 "$call$row.w"       "weights for the individuals"                        
## 18 "$call$col.w"       "weights for the variables"

Analysis

In order to analyze the results, let’s plot out the individual values of the PCA.

plot.PCA(
  x=songs_pca,
  choix="ind",
  select="contrib 10", 
  invisible = "quali",
  habillage = "genre"
)
## Warning: ggrepel: 9 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps


It makes sense that each genre would naturally cluster up, as people created genres in order to classify songs based off their characteristics in the first place. But there are also other things that we can infer from this plot, such as the fact that World music seems to have overlapping characteristics with Electronic and Soundtrack music. We can also see that one World song seems to have a characteristic that’s quite different from the rest in its genre. In fact, it would seem like that the variation in characteristics when it comes to the World genre is quite large. Another interesting point is that Comedy music seems to be quite distinct when compared to the other genres, as it is sequestered at the top of the chart.

How about the variables?

plot.PCA(
  x=songs_pca,
  choix = "var"
)


In order to interpret this chart, first we have to remember that between two variables:
- <90 degree angle shows a positive correlation
- 90 degree angle shows no correlation
- nearing 180 degree angle shows negative correlation

Thus we could infer, when it comes to a song’s popularity:
- tempo, loudness, valence, danceability and energy correlates positively
- acousticness, instrumentalness, duration_ms, speechiness, and liveness correlates negatively This all makes sense, as most songs that go viral tend to be short and fast-paced, the type that would be played during parties.

We can also see how much certain variables contribute to the information of the PC.

library(factoextra)
## Warning: package 'factoextra' was built under R version 4.4.1
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
fviz_contrib(X = songs_pca, 
             axes = 1, # = PC1
             choice = "var")


In our first PC, we can see that loudness and energy have the biggest effect. We can also see this numerically like below.

# dimdesc: dimension description
loan_dim <- dimdesc(songs_pca)

# variable yang berkontribusi untuk PC1
loan_dim$Dim.1$quanti %>% as.data.frame()

Dimension Reduction

We can use our PCA for supervised learning models, in that case we can prepare our PCA to avoid overfitting and reduce computation times by choosing a limited amount of dimensions. In order to decide how many we should keep, we can start by looking at the eigenvalues.

songs_pca$eig
##         eigenvalue percentage of variance cumulative percentage of variance
## comp 1   3.7330652              33.936956                          33.93696
## comp 2   1.7935575              16.305069                          50.24203
## comp 3   1.1467482              10.424984                          60.66701
## comp 4   0.9719350               8.835773                          69.50278
## comp 5   0.8486468               7.714971                          77.21775
## comp 6   0.7098561               6.453237                          83.67099
## comp 7   0.6447742               5.861583                          89.53257
## comp 8   0.4463965               4.058150                          93.59072
## comp 9   0.3382590               3.075082                          96.66581
## comp 10  0.2610486               2.373169                          99.03898
## comp 11  0.1057127               0.961025                         100.00000

For example, if we’d like our PCA to retain no less than 90% of its information, we can discard comp 9-11 and keep comp 1-8 like below.

songs_keep <- songs_pca$ind$coord[ , 1:8 ] %>% as.data.frame()
songs_keep %>% head()

We can then combine this into the original data set by only keeping the categorical variables and replacing all the numerical ones with the dimensions we have selected. This data can be used for supervised learning models.

quali_songs <- songs %>% 
  select_if(is.factor)
songs_new <- merge(quali_songs,songs_keep, by="row.names", all=TRUE)
head(songs_new)

Conclusion

We can use PCA to simplify data, visualizing clusters, and figure out how each of the variables correlate with one another. In our case, we’ve used it to see how each genre clusters together and which variables affect a song’s popularity. Our PCA can also be used in order to train supervised learning models, although it will no longer have the original values (as they have been scaled).