At this point and time, virtually everyone with a smartphone and likes to listen to music must have use Spotify or other similar service, i personally do not know anyone who still buy a physical media anymore unless for memorabilia purpose. In this notebook i would like to cluster these tracks dataset into several cluster for recomendation purpose, you might think, anyone could just look at the genre they like and browse under that genre, well you are not wrong, but why would you not want to discover similar track based on track you already listened? you might end up liking other genres and expand your music library.
library(dplyr)
library(tidyr)
library(GGally)
library(gridExtra)
library(factoextra)
library(FactoMineR)
library(plotly)
library(ggplot2)
spotify <- read.csv('SpotifyFeatures.csv', stringsAsFactors = T)
spotify <- spotify %>% filter(ï..genre %in% c("A Capella","Alternative", "Blues", "Classical", "Country", "Dance", "Electronic", "Folk", "Hip-Hop", "Indie", "Jazz", "Opera", "Pop", "R&B", "Rap", "Reggae", "Reggaeton", "Rock", "Ska", "Soul"))
levels(spotify$ï..genre)
## [1] "A Capella" "Alternative" "Anime"
## [4] "Blues" "Children's Music" "Childrenâ\200\231s Music"
## [7] "Classical" "Comedy" "Country"
## [10] "Dance" "Electronic" "Folk"
## [13] "Hip-Hop" "Indie" "Jazz"
## [16] "Movie" "Opera" "Pop"
## [19] "R&B" "Rap" "Reggae"
## [22] "Reggaeton" "Rock" "Ska"
## [25] "Soul" "Soundtrack" "World"
spotify$ï..genre <- as.character(spotify$ï..genre)
spotify$ï..genre <- as.factor(spotify$ï..genre)
The dataset are obtained from kaggle which are acquired through spotify API.
Primary
track_id (Id of track generated by Spotify) Numerical:
acousticness (Ranges from 0 to 1) danceability (Ranges from 0 to 1) energy (Ranges from 0 to 1) duration_ms (Integer typically ranging from 200k to 300k) instrumentalness (Ranges from 0 to 1) valence (Ranges from 0 to 1) popularity (Ranges from 0 to 100) tempo (Float typically ranging from 50 to 150) liveness (Ranges from 0 to 1) loudness (Float typically ranging from -60 to 0) speechiness (Ranges from 0 to 1) Dummy:
mode (0 = Minor, 1 = Major) Categorical:
key (All keys on octave encoded as values ranging from 0 to 11, starting on C as 0, C# as 1 and so on…) time_signature (notational convention used in Western musical notation to specify how many beats are contained in each measure. For example: ‘4/4’, ‘5/4’, ‘3/4’, ‘1/4’, ‘0/4’) artist_name (List of artists mentioned) track_name (Name of the song) genre (Genre of the song)
Check Missing Value
anyNA(spotify)
## [1] FALSE
Since we already convert the data type of every string type to factor, our data now is ready for next step, EDA.
spotify %>% head()
## ï..genre artist_name track_name
## 1 R&B Mary J. Blige Be Without You - Kendu Mix
## 2 R&B Rihanna Desperado
## 3 R&B Yung Bleu Ice On My Baby (feat. Kevin Gates) - Remix
## 4 R&B Surfaces Heaven Falls / Fall on Me
## 5 R&B Olivia O'Brien Love Myself
## 6 R&B ELHAE Needs
## track_id popularity acousticness danceability duration_ms
## 1 2YegxR5As7BeQuVp2U6pek 65 0.0830 0.724 246333
## 2 6KFaHC9G178beAp7P0Vi5S 63 0.3230 0.685 186467
## 3 6muW8cSjJ3rusKJ0vH5olw 62 0.0675 0.762 199520
## 4 7yHqOZfsXYlicyoMt62yC6 61 0.3600 0.563 240597
## 5 4XzgjxGKqULifVf7mnDIQK 68 0.5960 0.653 213947
## 6 7KdRu0h7PQ0Ecfa37rUBzW 61 0.6610 0.510 205640
## energy instrumentalness key liveness loudness mode speechiness tempo
## 1 0.689 0.00e+00 D 0.3040 -5.922 Minor 0.1350 146.496
## 2 0.610 0.00e+00 C 0.1020 -5.221 Minor 0.0439 94.384
## 3 0.520 3.95e-06 F 0.1140 -5.237 Minor 0.0959 75.047
## 4 0.366 2.43e-03 B 0.0955 -6.896 Minor 0.1210 85.352
## 5 0.621 0.00e+00 B 0.0811 -5.721 Minor 0.0409 100.006
## 6 0.331 0.00e+00 B 0.1230 -13.073 Minor 0.0895 124.657
## time_signature valence
## 1 4/4 0.6930
## 2 3/4 0.3230
## 3 4/4 0.0862
## 4 4/4 0.7680
## 5 4/4 0.4660
## 6 4/4 0.2250
spotifygenre <- spotify %>% select_if(is.numeric) %>% group_by(spotify[,1]) %>% summarise_all(mean)
spotifygenre$genre <- spotifygenre$`spotify[, 1]`
spotifygenre$`spotify[, 1]` <- NULL
spotifygenrescaled <- as.data.frame(scale(spotifygenre[,-12]))
ggcorr(spotify, label = T)
We have several variables with strong correlation between them, we could use PCA to reduce our variables.
While every tracks have their own designated genre, with provided data, we could determine how close they are to each other based on their property, below i plotted Popularity vs Energy, as you can see, many of the genres here are close to each other, that means we could cluster them into several cluster, but because k-means are sensitive to range, we have to remove the outlier genre. also plot below doesnt tell us which genre is actually outlier because we only plot using 2 variables out of 11, for that we need PCA.
spotifygenre %>% ggplot(aes(popularity, energy, color = genre))+
geom_point()
# UL : K-Means Clustering
spotify2 <- spotify %>% select_if(is.numeric)
spotify2$genre <- spotify[,1]
spotify2z <- as.data.frame(scale(spotify2[,-12]))
Choosing K(amount of cluster)
# fungsi untuk plot elbow method
RNGkind(sample.kind = "Rounding")
kmeansTunning <- function(data, maxK) {
withinall <- NULL
total_k <- NULL
for (i in 2:maxK) {
set.seed(567)
temp <- kmeans(data,i)$tot.withinss
withinall <- append(withinall, temp)
total_k <- append(total_k,i)
}
plot(x = total_k, y = withinall, type = "o", xlab = "Number of Cluster", ylab = "Total within")
}
# contoh cara penggunaan:
# kmeansTunning(your_data, maxK = 8)
kmeansTunning(spotify2z, maxK = 21)
Based on plot above i decided to use k = 10
spotify_kmeans <- kmeans(spotify2z, centers= 10)
spotify$cluster <- as.factor(spotify_kmeans$cluster)
spotify_kmeans$size
## [1] 20296 9070 21630 15896 22056 24757 10811 10057 28121 10110
Now that every track listed have its own cluster, you can use this new cluster data to push a recommendation to listeners that listen to a particular track!, if the amount of recomendation is too much we can always increase the number of cluster, we dont have to abide by the elbow method.
spotify_pca <- PCA(spotify2 %>% select(-'genre'), graph=F, scale.unit = T)
spotify_pca$eig
## eigenvalue percentage of variance cumulative percentage of variance
## comp 1 3.6344725 33.040659 33.04066
## comp 2 1.2255933 11.141757 44.18242
## comp 3 1.0732845 9.757132 53.93955
## comp 4 0.9900342 9.000311 62.93986
## comp 5 0.9688851 8.808047 71.74791
## comp 6 0.8891525 8.083204 79.83111
## comp 7 0.7066617 6.424197 86.25531
## comp 8 0.6803660 6.185145 92.44045
## comp 9 0.4320579 3.927799 96.36825
## comp 10 0.2829271 2.572064 98.94032
## comp 11 0.1165653 1.059684 100.00000
Through this process, i can eliminate a certain number of PC and retain as much info as possible, this is Dimensionality Reduction. based on result above, if we only use PC1-PC7 and eliminate the rest, we still retain about 86.25%+ information of our data, and i was able to remove 4 variables, about 36% of the original data.
spotify_pcax <- PCA(spotify2 %>% select(-'genre'), scale.unit = T, graph=F, ncp = 7)
These are how much information of every quantitative variables that are summarize in PC1, change to Dim.2 for PC2 and so on.
# dimdisc: dimension description
dim <- dimdesc(spotify_pcax)
# variable yang berkontribusi untuk PC1
as.data.frame(dim$Dim.1$quanti) # quantitative -> var numerik
## correlation p.value
## loudness 0.88138016 0.000000e+00
## energy 0.84326902 0.000000e+00
## danceability 0.62126230 0.000000e+00
## valence 0.58009661 0.000000e+00
## popularity 0.47409265 0.000000e+00
## speechiness 0.29122712 0.000000e+00
## tempo 0.26810203 0.000000e+00
## liveness 0.09015844 1.175515e-308
## duration_ms -0.27681766 0.000000e+00
## instrumentalness -0.54557018 0.000000e+00
## acousticness -0.81252740 0.000000e+00
Here the same information but visualized.
fviz_contrib(X = spotify_pcax,
choice = "var",
axes = 1)