In this exercise, we’re doing an exploratory data analysis with PCA using spotify dataset. Our objective is to convert our variables into PC data to be used later as a method to cluster our dataset.
## Parsed with column specification:
## cols(
## genre = col_character(),
## artist_name = col_character(),
## track_name = col_character(),
## track_id = col_character(),
## popularity = col_double(),
## acousticness = col_double(),
## danceability = col_double(),
## duration_ms = col_double(),
## energy = col_double(),
## instrumentalness = col_double(),
## key = col_character(),
## liveness = col_double(),
## loudness = col_double(),
## mode = col_character(),
## speechiness = col_double(),
## tempo = col_double(),
## time_signature = col_character(),
## valence = col_double()
## )
We’re changing Genre, Artist Name, Key, Mode, and Time Signature column type into factor, and removing Track Name and Track Id.
spotify_sel <- spotify %>%
mutate(genre = factor(genre),
artist_name = factor(artist_name),
key = factor(key),
mode = factor(mode),
time_signature = factor(time_signature)) %>%
select(-track_name, -track_id)This step is not necessary for PCA, but we’re doing this because the data is too massive, it causes memory issue. Since this is an exercise of data exploration with PCA, I think using a portion of the data shouldn’t be an issue.
set.seed(1234)
spot_split <- initial_split(spotify_sel, prop = .1)
spotify_working <- training(spot_split)Storing our scaled data for our elbow method to determine the best amount of cluster later on.
Our PCA variable. We’re using a few categorical variable as separator like, Genre, Artist Name, Key, Mode, and Time Signature.
Our Individual PCA plot with Genre as our color information
Our Variable PCA plot
From the two plot I think there’s a few insights we can draw upon :
1. Soundtrack and Movie are strongly Instrumental
2. Comedy genre dominates our Speechiness and Live variables
3. Folk genre is strong in the Accoustic variables
Elbow method is a method to define the best K for our K-means function. It’s trying to output the smallest WSS while maintaining the highest BSS as possible.
From our elbow method visualization, I think our data can be divided into 3 cluster.
Setting up our cluster and storing it inside our original, unscaled data.
Our data cluster vizualisation.
To better understand our cluster characteristics, let’s visualize it across different variables.
spotify_working %>%
select(-genre, -artist_name, -key, -mode, -time_signature) %>%
group_by(cluster) %>%
summarise_all(mean)I’ve split this into 3 tabs for easier comparison.
spotify_working %>%
group_by(cluster, genre) %>%
select(-key, -mode, -time_signature) %>%
filter(cluster == 1) %>%
summarise (genre_n = n()) %>%
ggplot(aes(reorder(genre, -genre_n), genre_n, fill = genre)) + geom_col() +
labs(title = "Genre in Cluster 1",
x = "Genre",
y = "Total in Cluster 1") +
theme_minimal() +
theme(legend.position = "none",
axis.text.x = element_text(angle = 45, hjust = 1))spotify_working %>%
group_by(cluster, genre) %>%
select(-key, -mode, -time_signature) %>%
filter(cluster == 2) %>%
summarise (genre_n = n()) %>%
ggplot(aes(reorder(genre, -genre_n), genre_n, fill = genre)) + geom_col() +
labs(title = "Genre in Cluster 2",
x = "Genre",
y = "Total in Cluster 2") +
theme_minimal() +
theme(legend.position = "none",
axis.text.x = element_text(angle = 45, hjust = 1))spotify_working %>%
group_by(cluster, genre) %>%
select(-key, -mode, -time_signature) %>%
filter(cluster == 3) %>%
summarise (genre_n = n()) %>%
ggplot(aes(reorder(genre, -genre_n), genre_n, fill = genre)) + geom_col() +
labs(title = "Genre in Cluster 3",
x = "Genre",
y = "Total in Cluster 3") +
theme_minimal() +
theme(legend.position = "none",
axis.text.x = element_text(angle = 45, hjust = 1))I think after seeing all the cluster we can somewhat see the characteristics of each cluster.
Cluster 1 is the “normal” kind of music
Cluster 2 is the Comedy cluster
Cluster 3 is the instrumental, movie soundtrack