For this Learning By Building assesment, I use the Spotify dataset from Kaggle (https://www.kaggle.com/zaheenhamidani/ultimate-spotify-tracks-db). I will practice the K-means Clustering method to see what variables contributes the best for popularity. Since the method only works for numerical variables, I will take down all the character variables. The description of each variables can be read in here: https://developer.spotify.com/documentation/web-api/reference/tracks/get-audio-features/.
popularity, I will choose obsrevations which popularity value are more than 75 considered that the data is also very huge.spotify <- songs %>% select(-c(1,4,8,11,14,17)) %>% filter(popularity > 75) %>% mutate(track_name = as.factor(track_name))
glimpse(spotify)## Observations: 2,956
## Variables: 12
## $ artist_name <fct> Jason Derulo, Joji, Joji, Joji, Smash Mouth, Gorilla…
## $ track_name <fct> "Tip Toe (feat. French Montana)", "Sanctuary", "SLOW…
## $ popularity <int> 76, 83, 81, 76, 78, 77, 78, 76, 80, 78, 77, 78, 76, …
## $ acousticness <dbl> 0.023300, 0.422000, 0.544000, 0.619000, 0.031900, 0.…
## $ danceability <dbl> 0.845, 0.552, 0.515, 0.672, 0.731, 0.818, 0.559, 0.5…
## $ energy <dbl> 0.709, 0.650, 0.479, 0.588, 0.861, 0.705, 0.345, 0.9…
## $ instrumentalness <dbl> 0.00000000, 0.00027500, 0.00598000, 0.24100000, 0.00…
## $ liveness <dbl> 0.0940, 0.3720, 0.1910, 0.0992, 0.0829, 0.6130, 0.14…
## $ loudness <dbl> -4.547, -7.199, -7.458, -9.573, -5.881, -6.679, -13.…
## $ speechiness <dbl> 0.0714, 0.1280, 0.0261, 0.1330, 0.0323, 0.1770, 0.04…
## $ tempo <dbl> 98.062, 167.788, 88.964, 169.033, 104.034, 138.559, …
## $ valence <dbl> 0.620, 0.316, 0.284, 0.204, 0.780, 0.772, 0.458, 0.3…
Each of the variables has very low correlation to popularity with pale colours, whilst the
energy correlates very strong with loudness. We can also see that accousticness correlates very low to energy and loudness, consider that instrumental songs/tunes are mostly used for relaxing, meditating, or to treat anxiety/insomnia. On the next phase of this topic, let’s see if clustering method with K-means clustering could help with identifying what variables contributes the most to popularity.
It turned out the data is skewed to zero, that means it distributes normally and ready to go.
kmeans(), we need to identify the best centers with this function created by Algoritma Team:wss <- function(data, maxCluster = 9) {
# Initialize within sum of squares
SSw <- (nrow(data) - 1) * sum(apply(data, 2, var))
SSw <- vector()
set.seed(100)
for (i in 2:maxCluster) {
SSw[i] <- sum(kmeans(data, centers = i)$withinss)
}
plot(1:maxCluster, SSw, type = "o", xlab = "Number of Clusters", ylab = "Within groups sum of squares", pch=19)
}
wss(spotify_scaled) It seems the
elbow happened on 7.
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
## [1] 362 274 552 653 355 326 434
## popularity acousticness danceability energy instrumentalness
## 1 -0.05188304 -0.3306675421 -0.5685789 0.513165002 -0.1183880
## 2 2.25055010 -0.0140195965 0.3693307 -0.001958426 -0.1058578
## 3 -0.37918348 0.0006463842 0.6752738 -0.613980047 -0.1382529
## 4 -0.25981267 -0.3608185765 0.2479961 0.926461596 0.1243509
## 5 -0.12803900 -0.1892412412 0.7086820 -0.282159507 -0.1262373
## 6 -0.11991274 1.8658437366 -0.7229199 -1.489820794 0.4330825
## 7 -0.30957768 -0.4200084595 -1.0275895 0.310038354 -0.0677296
## liveness loudness speechiness tempo valence
## 1 -0.19274313 0.3934715 -0.03654098 1.48809131 -0.1029454
## 2 0.07684566 0.2405183 -0.14303552 -0.23180628 -0.1640598
## 3 -0.33183059 -0.3838081 -0.15544867 -0.27032192 -0.2539017
## 4 0.28575607 0.7093968 -0.37386369 -0.30722720 1.0250644
## 5 0.38768484 -0.2405291 2.02491248 0.30557063 0.1891629
## 6 -0.16709484 -1.4897491 -0.35110683 -0.01738674 -0.7070293
## 7 -0.08724915 0.2565283 -0.51157347 -0.52568333 -0.6535852
These are the insights:
- The size of each cluster (count of each cluster) returns: cluster1 342, cluster2 274, cluster552, and so on
- The centers shows the mean value on each of the variable. In here we can see the highest mean on popularity is in cluster 2.
factoextra The plot explain where each of the individuals (row names) belongs to which cluster.
popularity I will use the PCA() then plot it with plot.PCA()spotify_pca <- PCA(spotify_scaled,
graph = F,
ncp = 5)
plot.PCA(spotify_pca,
choix = c("ind"),
habillage = 1,
select = "contrib7",
invisible = "quali",)
From the two plots of PCA individuals and variables we can see the direction of popularity is similar with danceability and speechiness.
Now, let’s see if songs in cluster 2 have the highest score of popularity, danceability, speechiness among all clusters.
# include the cluster column into Spotify data
spotify$cluster <- spotify_km$cluster
# subset data to variables we need
spoti_cluster <- spotify %>% group_by(cluster) %>% select(popularity, danceability, speechiness) %>% summarise_all("mean")## Adding missing grouping variables: `cluster`
The mean on popularity in cluster 2 returns 89.63504 giving the highest score of popularity among all clusters.
popularity!
I guess we have Ariana Grande as the WINNER!
hashtag QWEEN
With the method of K-means Clustering combined with Principle Component Analysis, songs can be classified as Most Popular identified with dimension and variance but also narrowed into a cluster. The variables where the dimension is similar with popularity are danceability and speechiness. Songs with groovey tunes and Rap music can be the most popular ones at the moment. But I guess, in a party or when we’re alone, we all love to dance. Don’t you think?