Andina - LBB Unsupervised Learning

Introduction

For this Learning By Building assesment, I use the Spotify dataset from Kaggle (https://www.kaggle.com/zaheenhamidani/ultimate-spotify-tracks-db). I will practice the K-means Clustering method to see what variables contributes the best for popularity. Since the method only works for numerical variables, I will take down all the character variables. The description of each variables can be read in here: https://developer.spotify.com/documentation/web-api/reference/tracks/get-audio-features/.

Data Wrangling

Import Data

songs <-read.csv("SpotifyFeatures.csv")
head(songs)

Take out all the character variables since I’m going to use only numeric variables, and also take out the duration because we won’t analize the duration. And since we try to see what variables contributes the most to popularity, I will choose obsrevations which popularity value are more than 75 considered that the data is also very huge.

spotify <-  songs %>% select(-c(1,4,8,11,14,17)) %>% filter(popularity > 75) %>% mutate(track_name = as.factor(track_name))
glimpse(spotify)

## Observations: 2,956
## Variables: 12
## $ artist_name      <fct> Jason Derulo, Joji, Joji, Joji, Smash Mouth, Gorilla…
## $ track_name       <fct> "Tip Toe (feat. French Montana)", "Sanctuary", "SLOW…
## $ popularity       <int> 76, 83, 81, 76, 78, 77, 78, 76, 80, 78, 77, 78, 76, …
## $ acousticness     <dbl> 0.023300, 0.422000, 0.544000, 0.619000, 0.031900, 0.…
## $ danceability     <dbl> 0.845, 0.552, 0.515, 0.672, 0.731, 0.818, 0.559, 0.5…
## $ energy           <dbl> 0.709, 0.650, 0.479, 0.588, 0.861, 0.705, 0.345, 0.9…
## $ instrumentalness <dbl> 0.00000000, 0.00027500, 0.00598000, 0.24100000, 0.00…
## $ liveness         <dbl> 0.0940, 0.3720, 0.1910, 0.0992, 0.0829, 0.6130, 0.14…
## $ loudness         <dbl> -4.547, -7.199, -7.458, -9.573, -5.881, -6.679, -13.…
## $ speechiness      <dbl> 0.0714, 0.1280, 0.0261, 0.1330, 0.0323, 0.1770, 0.04…
## $ tempo            <dbl> 98.062, 167.788, 88.964, 169.033, 104.034, 138.559, …
## $ valence          <dbl> 0.620, 0.316, 0.284, 0.204, 0.780, 0.772, 0.458, 0.3…

Exploratory Data Analysis

Let’s see the correlation between each of the variables to see if there is strong correlation bet ween them.

corrplot(cor(spotify[,-c(1,2)]), type="upper", method="ellipse", tl.cex=0.9)

Each of the variables has very low correlation to popularity with pale colours, whilst the energy correlates very strong with loudness. We can also see that accousticness correlates very low to energy and loudness, consider that instrumental songs/tunes are mostly used for relaxing, meditating, or to treat anxiety/insomnia. On the next phase of this topic, let’s see if clustering method with K-means clustering could help with identifying what variables contributes the most to popularity.

Since numerical data has a wide range of scale, I will scale the data and then we can see if the data distribute normally.

spotify_scaled <- scale(spotify[,-c(1,2)])
hist(spotify_scaled)

It turned out the data is skewed to zero, that means it distributes normally and ready to go.

K-means Clustering

Before we do kmeans(), we need to identify the best centers with this function created by Algoritma Team:

wss <- function(data, maxCluster = 9) {
    # Initialize within sum of squares
    SSw <- (nrow(data) - 1) * sum(apply(data, 2, var))
    SSw <- vector()
    set.seed(100)
    for (i in 2:maxCluster) {
        SSw[i] <- sum(kmeans(data, centers = i)$withinss)
    }
    plot(1:maxCluster, SSw, type = "o", xlab = "Number of Clusters", ylab = "Within groups sum of squares", pch=19)
}
wss(spotify_scaled)

It seems the elbow happened on 7.

RNGkind(sample.kind = "Rounding")

## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used

set.seed(7777)
spotify_km <- kmeans(spotify_scaled, centers = 7)
spotify_km$size

## [1] 362 274 552 653 355 326 434

spotify_km$centers

##    popularity  acousticness danceability       energy instrumentalness
## 1 -0.05188304 -0.3306675421   -0.5685789  0.513165002       -0.1183880
## 2  2.25055010 -0.0140195965    0.3693307 -0.001958426       -0.1058578
## 3 -0.37918348  0.0006463842    0.6752738 -0.613980047       -0.1382529
## 4 -0.25981267 -0.3608185765    0.2479961  0.926461596        0.1243509
## 5 -0.12803900 -0.1892412412    0.7086820 -0.282159507       -0.1262373
## 6 -0.11991274  1.8658437366   -0.7229199 -1.489820794        0.4330825
## 7 -0.30957768 -0.4200084595   -1.0275895  0.310038354       -0.0677296
##      liveness   loudness speechiness       tempo    valence
## 1 -0.19274313  0.3934715 -0.03654098  1.48809131 -0.1029454
## 2  0.07684566  0.2405183 -0.14303552 -0.23180628 -0.1640598
## 3 -0.33183059 -0.3838081 -0.15544867 -0.27032192 -0.2539017
## 4  0.28575607  0.7093968 -0.37386369 -0.30722720  1.0250644
## 5  0.38768484 -0.2405291  2.02491248  0.30557063  0.1891629
## 6 -0.16709484 -1.4897491 -0.35110683 -0.01738674 -0.7070293
## 7 -0.08724915  0.2565283 -0.51157347 -0.52568333 -0.6535852

These are the insights:
- The size of each cluster (count of each cluster) returns: cluster1 342, cluster2 274, cluster552, and so on
- The centers shows the mean value on each of the variable. In here we can see the highest mean on popularity is in cluster 2.

Now let’s see in a plot from factoextra

fviz_cluster(spotify_km, data=spotify_scaled)

The plot explain where each of the individuals (row names) belongs to which cluster.

Principal Component Analysis

To see which variables contributes to popularity I will use the PCA() then plot it with plot.PCA()

spotify_pca <- PCA(spotify_scaled, 
                graph = F, 
                ncp = 5)

plot.PCA(spotify_pca, 
         choix = c("ind"),
         habillage = 1,
         select = "contrib7",
         invisible = "quali",)

plot.PCA(spotify_pca, choix = c("var"))

From the two plots of PCA individuals and variables we can see the direction of popularity is similar with danceability and speechiness.

Now, let’s see if songs in cluster 2 have the highest score of popularity, danceability, speechiness among all clusters.

# include the cluster column into Spotify data
spotify$cluster <- spotify_km$cluster

# subset data to variables we need
spoti_cluster <- spotify %>% group_by(cluster) %>% select(popularity, danceability, speechiness) %>% summarise_all("mean")

## Adding missing grouping variables: `cluster`

spoti_cluster

The mean on popularity in cluster 2 returns 89.63504 giving the highest score of popularity among all clusters.

Now, let’s see what songs are in the cluster 2 with the highest score of popularity!

spotify %>% filter(cluster == "2") %>% arrange(desc(popularity))

I guess we have Ariana Grande as the WINNER!
hashtag QWEEN

Summary

With the method of K-means Clustering combined with Principle Component Analysis, songs can be classified as Most Popular identified with dimension and variance but also narrowed into a cluster. The variables where the dimension is similar with popularity are danceability and speechiness. Songs with groovey tunes and Rap music can be the most popular ones at the moment. But I guess, in a party or when we’re alone, we all love to dance. Don’t you think?