Introduction

In this exercise, we’re doing an exploratory data analysis with PCA using spotify dataset. Our objective is to convert our variables into PC data to be used later as a method to cluster our dataset.

Library

library(tidyverse)
library(tidymodels)
library(FactoMineR)
library(factoextra)
library(skimr)
library(wesanderson)

Import Data

spotify <- read_csv("data_input/SpotifyFeatures.csv")

## Parsed with column specification:
## cols(
##   genre = col_character(),
##   artist_name = col_character(),
##   track_name = col_character(),
##   track_id = col_character(),
##   popularity = col_double(),
##   acousticness = col_double(),
##   danceability = col_double(),
##   duration_ms = col_double(),
##   energy = col_double(),
##   instrumentalness = col_double(),
##   key = col_character(),
##   liveness = col_double(),
##   loudness = col_double(),
##   mode = col_character(),
##   speechiness = col_double(),
##   tempo = col_double(),
##   time_signature = col_character(),
##   valence = col_double()
## )

Data Preview

head(spotify)

Data Wrangling

We’re changing Genre, Artist Name, Key, Mode, and Time Signature column type into factor, and removing Track Name and Track Id.

spotify_sel <- spotify %>% 
  mutate(genre = factor(genre),
         artist_name = factor(artist_name),
         key = factor(key),
         mode = factor(mode),
         time_signature = factor(time_signature)) %>% 
  select(-track_name, -track_id)

Data Downsampling

This step is not necessary for PCA, but we’re doing this because the data is too massive, it causes memory issue. Since this is an exercise of data exploration with PCA, I think using a portion of the data shouldn’t be an issue.

set.seed(1234)
spot_split <- initial_split(spotify_sel, prop = .1)
spotify_working <- training(spot_split)

Storing our scaled data for our elbow method to determine the best amount of cluster later on.

spotify_scale <- spotify_working %>% 
  select_if(is.numeric) %>% 
  scale()

PCA Functions

Our PCA variable. We’re using a few categorical variable as separator like, Genre, Artist Name, Key, Mode, and Time Signature.

spotify_pca <- PCA(spotify_working,
                   scale.unit = T,
                   quali.sup = c(1,2,9,12,15),
                   graph = F)

Our Individual PCA plot with Genre as our color information

plot.PCA(spotify_pca,
         choix = "ind",
         habillage = 1,
         select = "contrib5",
         invisible = "quali")

Our Variable PCA plot

plot.PCA(spotify_pca,
         choix = "var")

From the two plot I think there’s a few insights we can draw upon :
1. Soundtrack and Movie are strongly Instrumental
2. Comedy genre dominates our Speechiness and Live variables
3. Folk genre is strong in the Accoustic variables

Data Clustering with K-Means

Elbow Method

Elbow method is a method to define the best K for our K-means function. It’s trying to output the smallest WSS while maintaining the highest BSS as possible.

wss(spotify_scale)

From our elbow method visualization, I think our data can be divided into 3 cluster.

K-means visualization

Setting up our cluster and storing it inside our original, unscaled data.

spotify_km <- kmeans(spotify_scale, centers = 3)
spotify_working$cluster <- spotify_km$cluster

Our data cluster vizualisation.

fviz_cluster(spotify_km, data = spotify_scale)

Visualization

To better understand our cluster characteristics, let’s visualize it across different variables.

spotify_working %>% 
  count(cluster)

spotify_working %>%
  select(-genre, -artist_name, -key, -mode, -time_signature) %>% 
  group_by(cluster) %>% 
  summarise_all(mean)

Genre accross Cluster

I’ve split this into 3 tabs for easier comparison.

Cluster 1

spotify_working %>% 
  group_by(cluster, genre) %>% 
  select(-key, -mode, -time_signature) %>% 
  filter(cluster == 1) %>% 
  summarise (genre_n = n()) %>% 
  ggplot(aes(reorder(genre, -genre_n), genre_n, fill = genre)) + geom_col() +
  labs(title = "Genre in Cluster 1",
       x = "Genre",
       y = "Total in Cluster 1") +
  theme_minimal() +
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 45, hjust = 1))

Cluster 2

spotify_working %>% 
  group_by(cluster, genre) %>% 
  select(-key, -mode, -time_signature) %>% 
  filter(cluster == 2) %>% 
  summarise (genre_n = n()) %>% 
  ggplot(aes(reorder(genre, -genre_n), genre_n, fill = genre)) + geom_col() +
  labs(title = "Genre in Cluster 2",
       x = "Genre",
       y = "Total in Cluster 2") +
  theme_minimal() +
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 45, hjust = 1))

Cluster 3

spotify_working %>% 
  group_by(cluster, genre) %>% 
  select(-key, -mode, -time_signature) %>% 
  filter(cluster == 3) %>% 
  summarise (genre_n = n()) %>% 
  ggplot(aes(reorder(genre, -genre_n), genre_n, fill = genre)) + geom_col() +
  labs(title = "Genre in Cluster 3",
       x = "Genre",
       y = "Total in Cluster 3") +
  theme_minimal() +
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 45, hjust = 1))

Cluster Conclusion

I think after seeing all the cluster we can somewhat see the characteristics of each cluster.

Cluster 1 is the “normal” kind of music
Cluster 2 is the Comedy cluster
Cluster 3 is the instrumental, movie soundtrack

Spotify PCA

Deo Ivan Mareza

3/7/2020