Intro

At this point and time, virtually everyone with a smartphone and likes to listen to music must have use Spotify or other similar service, i personally do not know anyone who still buy a physical media anymore unless for memorabilia purpose. In this notebook i would like to cluster these tracks dataset into several cluster for recomendation purpose, you might think, anyone could just look at the genre they like and browse under that genre, well you are not wrong, but why would you not want to discover similar track based on track you already listened? you might end up liking other genres and expand your music library.

library(dplyr)
library(tidyr)
library(GGally)
library(gridExtra)
library(factoextra)
library(FactoMineR)
library(plotly)
library(ggplot2)
spotify <- read.csv('SpotifyFeatures.csv', stringsAsFactors = T)
spotify <- spotify %>% filter(ï..genre %in% c("A Capella","Alternative", "Blues", "Classical", "Country", "Dance", "Electronic", "Folk", "Hip-Hop", "Indie", "Jazz", "Opera", "Pop", "R&B", "Rap", "Reggae", "Reggaeton", "Rock", "Ska", "Soul"))
levels(spotify$ï..genre)
##  [1] "A Capella"          "Alternative"        "Anime"             
##  [4] "Blues"              "Children's Music"   "Childrenâ\200\231s Music"
##  [7] "Classical"          "Comedy"             "Country"           
## [10] "Dance"              "Electronic"         "Folk"              
## [13] "Hip-Hop"            "Indie"              "Jazz"              
## [16] "Movie"              "Opera"              "Pop"               
## [19] "R&B"                "Rap"                "Reggae"            
## [22] "Reggaeton"          "Rock"               "Ska"               
## [25] "Soul"               "Soundtrack"         "World"
spotify$ï..genre <- as.character(spotify$ï..genre)
spotify$ï..genre <- as.factor(spotify$ï..genre)

Data Dictionary

The dataset are obtained from kaggle which are acquired through spotify API.

Primary

track_id (Id of track generated by Spotify) Numerical:

acousticness (Ranges from 0 to 1) danceability (Ranges from 0 to 1) energy (Ranges from 0 to 1) duration_ms (Integer typically ranging from 200k to 300k) instrumentalness (Ranges from 0 to 1) valence (Ranges from 0 to 1) popularity (Ranges from 0 to 100) tempo (Float typically ranging from 50 to 150) liveness (Ranges from 0 to 1) loudness (Float typically ranging from -60 to 0) speechiness (Ranges from 0 to 1) Dummy:

mode (0 = Minor, 1 = Major) Categorical:

key (All keys on octave encoded as values ranging from 0 to 11, starting on C as 0, C# as 1 and so on…) time_signature (notational convention used in Western musical notation to specify how many beats are contained in each measure. For example: ‘4/4’, ‘5/4’, ‘3/4’, ‘1/4’, ‘0/4’) artist_name (List of artists mentioned) track_name (Name of the song) genre (Genre of the song)

Data Prep

Check Missing Value

anyNA(spotify)
## [1] FALSE

Since we already convert the data type of every string type to factor, our data now is ready for next step, EDA.

Exploratory Data Analysis

spotify %>% head()
##   ï..genre    artist_name                                 track_name
## 1      R&B  Mary J. Blige                 Be Without You - Kendu Mix
## 2      R&B        Rihanna                                  Desperado
## 3      R&B      Yung Bleu Ice On My Baby (feat. Kevin Gates) - Remix
## 4      R&B       Surfaces                  Heaven Falls / Fall on Me
## 5      R&B Olivia O'Brien                                Love Myself
## 6      R&B          ELHAE                                      Needs
##                 track_id popularity acousticness danceability duration_ms
## 1 2YegxR5As7BeQuVp2U6pek         65       0.0830        0.724      246333
## 2 6KFaHC9G178beAp7P0Vi5S         63       0.3230        0.685      186467
## 3 6muW8cSjJ3rusKJ0vH5olw         62       0.0675        0.762      199520
## 4 7yHqOZfsXYlicyoMt62yC6         61       0.3600        0.563      240597
## 5 4XzgjxGKqULifVf7mnDIQK         68       0.5960        0.653      213947
## 6 7KdRu0h7PQ0Ecfa37rUBzW         61       0.6610        0.510      205640
##   energy instrumentalness key liveness loudness  mode speechiness   tempo
## 1  0.689         0.00e+00   D   0.3040   -5.922 Minor      0.1350 146.496
## 2  0.610         0.00e+00   C   0.1020   -5.221 Minor      0.0439  94.384
## 3  0.520         3.95e-06   F   0.1140   -5.237 Minor      0.0959  75.047
## 4  0.366         2.43e-03   B   0.0955   -6.896 Minor      0.1210  85.352
## 5  0.621         0.00e+00   B   0.0811   -5.721 Minor      0.0409 100.006
## 6  0.331         0.00e+00   B   0.1230  -13.073 Minor      0.0895 124.657
##   time_signature valence
## 1            4/4  0.6930
## 2            3/4  0.3230
## 3            4/4  0.0862
## 4            4/4  0.7680
## 5            4/4  0.4660
## 6            4/4  0.2250
spotifygenre <- spotify %>% select_if(is.numeric) %>%  group_by(spotify[,1]) %>% summarise_all(mean)
spotifygenre$genre <- spotifygenre$`spotify[, 1]`
spotifygenre$`spotify[, 1]` <- NULL
spotifygenrescaled <- as.data.frame(scale(spotifygenre[,-12]))

PCA Possibility

ggcorr(spotify, label = T)

We have several variables with strong correlation between them, we could use PCA to reduce our variables.

Clustering Possibility

While every tracks have their own designated genre, with provided data, we could determine how close they are to each other based on their property, below i plotted Popularity vs Energy, as you can see, many of the genres here are close to each other, that means we could cluster them into several cluster, but because k-means are sensitive to range, we have to remove the outlier genre. also plot below doesnt tell us which genre is actually outlier because we only plot using 2 variables out of 11, for that we need PCA.

spotifygenre %>% ggplot(aes(popularity, energy, color = genre))+
  geom_point()

# UL : K-Means Clustering

spotify2 <- spotify %>% select_if(is.numeric)
spotify2$genre <-  spotify[,1]
spotify2z <- as.data.frame(scale(spotify2[,-12]))

Choosing K(amount of cluster)

# fungsi untuk plot elbow method
RNGkind(sample.kind = "Rounding")
kmeansTunning <- function(data, maxK) {
  withinall <- NULL
  total_k <- NULL
  for (i in 2:maxK) {
    set.seed(567)
    temp <- kmeans(data,i)$tot.withinss
    withinall <- append(withinall, temp)
    total_k <- append(total_k,i)
  }
  plot(x = total_k, y = withinall, type = "o", xlab = "Number of Cluster", ylab = "Total within")
}

# contoh cara penggunaan:
# kmeansTunning(your_data, maxK = 8)
kmeansTunning(spotify2z, maxK = 21)

Based on plot above i decided to use k = 10

spotify_kmeans <- kmeans(spotify2z, centers= 10)
spotify$cluster <-  as.factor(spotify_kmeans$cluster)
spotify_kmeans$size
##  [1] 20296  9070 21630 15896 22056 24757 10811 10057 28121 10110

Now that every track listed have its own cluster, you can use this new cluster data to push a recommendation to listeners that listen to a particular track!, if the amount of recomendation is too much we can always increase the number of cluster, we dont have to abide by the elbow method.

UL : PCA

spotify_pca <- PCA(spotify2 %>% select(-'genre'), graph=F, scale.unit = T)
spotify_pca$eig
##         eigenvalue percentage of variance cumulative percentage of variance
## comp 1   3.6344725              33.040659                          33.04066
## comp 2   1.2255933              11.141757                          44.18242
## comp 3   1.0732845               9.757132                          53.93955
## comp 4   0.9900342               9.000311                          62.93986
## comp 5   0.9688851               8.808047                          71.74791
## comp 6   0.8891525               8.083204                          79.83111
## comp 7   0.7066617               6.424197                          86.25531
## comp 8   0.6803660               6.185145                          92.44045
## comp 9   0.4320579               3.927799                          96.36825
## comp 10  0.2829271               2.572064                          98.94032
## comp 11  0.1165653               1.059684                         100.00000

Through this process, i can eliminate a certain number of PC and retain as much info as possible, this is Dimensionality Reduction. based on result above, if we only use PC1-PC7 and eliminate the rest, we still retain about 86.25%+ information of our data, and i was able to remove 4 variables, about 36% of the original data.

spotify_pcax <- PCA(spotify2 %>% select(-'genre'), scale.unit = T, graph=F, ncp = 7)

These are how much information of every quantitative variables that are summarize in PC1, change to Dim.2 for PC2 and so on.

# dimdisc: dimension description
dim <- dimdesc(spotify_pcax)

# variable yang berkontribusi untuk PC1
as.data.frame(dim$Dim.1$quanti) # quantitative -> var numerik
##                  correlation       p.value
## loudness          0.88138016  0.000000e+00
## energy            0.84326902  0.000000e+00
## danceability      0.62126230  0.000000e+00
## valence           0.58009661  0.000000e+00
## popularity        0.47409265  0.000000e+00
## speechiness       0.29122712  0.000000e+00
## tempo             0.26810203  0.000000e+00
## liveness          0.09015844 1.175515e-308
## duration_ms      -0.27681766  0.000000e+00
## instrumentalness -0.54557018  0.000000e+00
## acousticness     -0.81252740  0.000000e+00

Here the same information but visualized.

fviz_contrib(X = spotify_pcax,
             choice = "var",
             axes = 1)

Conclussion

  1. It is possible to cluster the datasets using the original data, or PCA variables, using PCA, will help with computation.
  2. Using PCA, we could use 7PC out of 11 to reduce the dimension by 36% while only losing only less than 14% of the information, again when our really huge like this dataset, PCA will help with computational, making it lighter and compute faster.
  3. The new data obtained from PCA, could be utilized further for clustering or classification.