Unsupervised Learning: Spotify Recommendation Analysis

Introduction

Spotify is a popular music streaming platform with millions of tracks from various genres. Spotify tracks are individual songs with attributes like danceability, acousticness, energy and more. Analyzing the Spotify Tracks dataset using unsupervised learning techniques allows us to discover hidden relationships, group similar tracks, and gain insights into the characteristics and patterns of the songs available on the platform. In this project, the dataset is used from Kaggle. Here is the link for the Spotify Recommendation.

Import Library

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(lubridate)
library(cluster)
library(factoextra)

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

library(GGally)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

library(scales)

## 
## Attaching package: 'scales'
## 
## The following object is masked from 'package:purrr':
## 
##     discard
## 
## The following object is masked from 'package:readr':
## 
##     col_factor

library(cowplot)

## 
## Attaching package: 'cowplot'
## 
## The following object is masked from 'package:lubridate':
## 
##     stamp

library(FactoMineR)
library(factoextra)
library(plotly)

## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout

Read Data

The dataset were obtained from Kaggle by Zaheen Hamidani about the Spotify tracks.

spotify <- read.csv("dataset/data.csv")
glimpse(spotify)

## Rows: 195
## Columns: 14
## $ danceability     <dbl> 0.803, 0.762, 0.261, 0.722, 0.787, 0.778, 0.666, 0.92…
## $ energy           <dbl> 0.6240, 0.7030, 0.0149, 0.7360, 0.5720, 0.6320, 0.589…
## $ key              <int> 7, 10, 1, 3, 1, 8, 0, 7, 7, 3, 9, 1, 1, 6, 8, 5, 8, 8…
## $ loudness         <dbl> -6.764, -7.951, -27.528, -6.994, -7.516, -6.415, -8.4…
## $ mode             <int> 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1,…
## $ speechiness      <dbl> 0.0477, 0.3060, 0.0419, 0.0585, 0.2220, 0.1250, 0.324…
## $ acousticness     <dbl> 4.51e-01, 2.06e-01, 9.92e-01, 4.31e-01, 1.45e-01, 4.0…
## $ instrumentalness <dbl> 7.34e-04, 0.00e+00, 8.97e-01, 1.18e-06, 0.00e+00, 0.0…
## $ liveness         <dbl> 0.1000, 0.0912, 0.1020, 0.1230, 0.0753, 0.0912, 0.114…
## $ valence          <dbl> 0.6280, 0.5190, 0.0382, 0.5820, 0.6470, 0.8270, 0.776…
## $ tempo            <dbl> 95.968, 151.329, 75.296, 89.860, 155.117, 140.951, 74…
## $ duration_ms      <int> 304524, 247178, 286987, 208920, 179413, 224029, 14605…
## $ time_signature   <int> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,…
## $ liked            <int> 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0,…

In this dataset, there are 195 data with 14 variables and we will only use the necessary variables.

Data Pre-processing

spotify <- spotify %>%
  select(-c(key, mode, time_signature, liked, loudness, tempo, duration_ms))
glimpse(spotify)

## Rows: 195
## Columns: 7
## $ danceability     <dbl> 0.803, 0.762, 0.261, 0.722, 0.787, 0.778, 0.666, 0.92…
## $ energy           <dbl> 0.6240, 0.7030, 0.0149, 0.7360, 0.5720, 0.6320, 0.589…
## $ speechiness      <dbl> 0.0477, 0.3060, 0.0419, 0.0585, 0.2220, 0.1250, 0.324…
## $ acousticness     <dbl> 4.51e-01, 2.06e-01, 9.92e-01, 4.31e-01, 1.45e-01, 4.0…
## $ instrumentalness <dbl> 7.34e-04, 0.00e+00, 8.97e-01, 1.18e-06, 0.00e+00, 0.0…
## $ liveness         <dbl> 0.1000, 0.0912, 0.1020, 0.1230, 0.0753, 0.0912, 0.114…
## $ valence          <dbl> 0.6280, 0.5190, 0.0382, 0.5820, 0.6470, 0.8270, 0.776…

colSums(is.na(spotify))

##     danceability           energy      speechiness     acousticness 
##                0                0                0                0 
## instrumentalness         liveness          valence 
##                0                0                0

There is no misssing data so we can continue to scalling data for the k-means clustering.

spotify_s <- scale(spotify)

Principal Component Analysis

First we need to check the correlation between the variables.

ggcorr(spotify_s, label = T)

There is a strong correlation between variables, namely valence and danceability.

spotify_pca <- prcomp(spotify_s, scale. = F)
summary(spotify_pca)

## Importance of components:
##                           PC1    PC2    PC3    PC4     PC5     PC6    PC7
## Standard deviation     1.6986 1.2490 0.9606 0.8953 0.66258 0.47836 0.4030
## Proportion of Variance 0.4122 0.2229 0.1318 0.1145 0.06272 0.03269 0.0232
## Cumulative Proportion  0.4122 0.6351 0.7669 0.8814 0.94411 0.97680 1.0000

From the summary I got I want to retain information on the data as much as ~80%, PC1 to 4 will be maintained

spotify_wbcd <- as.data.frame(spotify_pca$x[, 1:4])
head(spotify_wbcd)

spotify_pca1 <- PCA(X = spotify_s,
              scale.unit = F,
              graph = F)

Visualize the individual factor plot.

plot.PCA(x = spotify_pca1,
         choix = "ind",
         select="contrib 5")

## Warning: ggrepel: 5 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

There is no outlier based on the plot.

fviz_contrib(X = spotify_pca1,
             choice = "var",
             axes = 1)

The insight is that the 3 variables that contribute the most to PC 1 based on the correlation between the variables with PC 1 are instrumentalness, danceability and valence.

Clustering

Finding the optimal number of clusters

In order to perform cluster analysis, it is important to determine the ideal number of clusters beforehand. Clustering involves minimizing the total sum of squares within each cluster, which indicates that the distance between observations within the same cluster is minimized. The methods that can be used to determine the optimal number of clusters: the elbow method.

Elbow Method

The selection of the number of clusters using the elbow method is subjective. The general guideline is to choose the number of clusters at the point where the graph of the total within sum of squares starts to level off, resembling the shape of an elbow.

fviz_nbclust(spotify_s, kmeans, method = "wss", k.max = 5)  + labs(subtitle = "Elbow method")

A point is chosen where the total reduction within the sum of squares is no longer significant/sloping (elbow point). Based on the results of the elbow plot we will choose the value of k = 2.

K-Means Clustering

Perform k-means clustering based on the previous elbow method and store the results in the ws_kmeans object

RNGkind(sample.kind = "Rounding")

## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used

set.seed(333)

# your code here
ws_kmeans <- kmeans(x = spotify_s,
                    centers = 2)

Clustering Profile

Return the cluster label for each observation to the initial data before scaling, but the outliers have been removed.

spotify$cluster <- ws_kmeans$cluster

# your code here
spotify %>% 
  group_by(cluster) %>% 
  summarise_all(mean)

Cluster 1: The lowest in speechiness, the highest in instrumentalness
Cluster 2: The lowest in instrumentalness, the highest in danceability

Visualize the cluster

library(ggiraphExtra)
ggRadar(data=spotify, aes(colour=cluster), interactive=TRUE)

Conclusion

By using PCA, the 3 variables that contribute the most to PC 1 based on the correlation between the variables and PC 1 are obtained, namely instrumentalness, danceability and valence.

By using K-means clustering, 2 clusters are obtained, namely:

Cluster 1: The lowest in speechiness, the highest in instrumentalness
Cluster 2: The lowest in instrumentalness, the highest in danceability