Hi, welcome to my Learning by Building for Unsupervised Learning. In this project, I will try clustering spotify track’s popularity. The data set is downloaded from kaggle.com
Here are the library I will use in this LBB
Let’s just load the data for this project
There are 18 columns in the dataset, which are
genre
: Track genre
artist_name
: Artist name
track_name
: Track title
track_id
: Track ID
popularity
: Popularity rate (1-100) of the track
acousticness
: A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
danceability
: Describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0.
duration_ms
: Track’s duration in milliseconds.
energy
: A measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy.
instrumentalness
: Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
key
: The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation.
liveness
: Detects the presence of an audience in the recording.
loudness
: The loudness of a track in decibels (dB)
mode
: Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived
speechiness
: Speechiness detects the presence of spoken words in a track.
tempo
: An estimated tempo of a track in beats per minute (BPM).
time_signature
: Time signature of a track
valence
: A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track.
I will use my own creation of theme for plot
white_theme <- theme(
panel.background = element_rect(fill="white"),
plot.background = element_rect(fill="white"),
panel.grid.minor.x = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.major.y = element_blank(),
text = element_text(color="black"),
axis.text = element_text(color="black"),
strip.background =element_rect(fill="snow3"),
strip.text = element_text(colour = 'black')
)
Before we continue, let’s check if there are any genre
that duplicated
## [1] "Movie" "R&B" "A Capella"
## [4] "Alternative" "Country" "Dance"
## [7] "Electronic" "Anime" "Folk"
## [10] "Blues" "Opera" "Hip-Hop"
## [13] "Children's Music" "Childrenâ\200\231s Music" "Rap"
## [16] "Indie" "Classical" "Pop"
## [19] "Reggae" "Reggaeton" "Jazz"
## [22] "Rock" "Ska" "Comedy"
## [25] "Soul" "Soundtrack" "World"
In genre
, we have Children’s Music and Children’s Music which are the same but there are some symbols in it. Let’s just take those symbols out.
tracks <- tracks %>%
mutate(genre = as.factor(str_replace_all(genre, "â\200\231", "")))
unique(tracks$genre)
## [1] Movie R&B A Capella Alternative
## [5] Country Dance Electronic Anime
## [9] Folk Blues Opera Hip-Hop
## [13] Children's Music Childrens Music Rap Indie
## [17] Classical Pop Reggae Reggaeton
## [21] Jazz Rock Ska Comedy
## [25] Soul Soundtrack World
## 27 Levels: A Capella Alternative Anime Blues ... World
Now we already have one Children’s Music.
I will just take out some columns artist_name
, track_name
and track_id
since we won’t use it and change data types of genre
, key
and mode
into factor using as.factor()
and replace some symbols and punctuation in genre
using str_replace_all()
. But before that, I will just remove some duplicated tracks before continue.
tracks <- tracks[!duplicated(tracks$track_id),]
tracks <- tracks %>%
select(-c(artist_name, track_name, track_id)) %>%
mutate_if(is.integer, as.numeric) %>%
mutate(genre = as.factor(str_replace_all(genre, "'", "")),
key = as.factor(key),
mode = as.factor(mode))
Let’s check the data types
## 'data.frame': 176774 obs. of 15 variables:
## $ genre : Factor w/ 26 levels "A Capella","Alternative",..: 15 15 15 15 15 15 15 15 15 15 ...
## $ popularity : num 0 1 3 0 4 0 2 15 0 10 ...
## $ acousticness : num 0.611 0.246 0.952 0.703 0.95 0.749 0.344 0.939 0.00104 0.319 ...
## $ danceability : num 0.389 0.59 0.663 0.24 0.331 0.578 0.703 0.416 0.734 0.598 ...
## $ duration_ms : num 99373 137373 170267 152427 82625 ...
## $ energy : num 0.91 0.737 0.131 0.326 0.225 0.0948 0.27 0.269 0.481 0.705 ...
## $ instrumentalness: num 0 0 0 0 0.123 0 0 0 0.00086 0.00125 ...
## $ key : Factor w/ 12 levels "A","A#","B","C",..: 5 10 4 5 9 5 5 10 4 11 ...
## $ liveness : num 0.346 0.151 0.103 0.0985 0.202 0.107 0.105 0.113 0.0765 0.349 ...
## $ loudness : num -1.83 -5.56 -13.88 -12.18 -21.15 ...
## $ mode : Factor w/ 2 levels "Major","Minor": 1 2 2 1 1 1 1 1 1 1 ...
## $ speechiness : num 0.0525 0.0868 0.0362 0.0395 0.0456 0.143 0.953 0.0286 0.046 0.0281 ...
## $ tempo : num 167 174 99.5 171.8 140.6 ...
## $ time_signature : chr "4/4" "4/4" "5/4" "4/4" ...
## $ valence : num 0.814 0.816 0.368 0.227 0.39 0.358 0.533 0.274 0.765 0.718 ...
Now, we already have proper data types
Let’s check if there are any missing values in our data
## genre popularity acousticness danceability
## 0 0 0 0
## duration_ms energy instrumentalness key
## 0 0 0 0
## liveness loudness mode speechiness
## 0 0 0 0
## tempo time_signature valence
## 0 0 0
There are no missing values in our data
Now, let’s check the correlation between our variables
tracks %>%
select(-c(genre, key, mode, time_signature)) %>%
cor() %>%
corrplot(type = "upper", method = "ellipse", tl.cex = 0.8)
The most correlated items to define popularity are
danceability
,energy
,loudness
,tempo
,valence
Now, let’s check those correlation above with density plot
feature <- tracks %>%
select(-c(key, mode, time_signature))
feature <- names(feature)[3:12]
tracks %>%
select(-c(key, mode, time_signature)) %>%
select(c(popularity, feature)) %>%
pivot_longer(cols = feature) %>%
ggplot(aes(x = value)) +
geom_density(aes(color = popularity), alpha = 0.5) +
facet_wrap(~name, ncol = 2, scales = "free") +
labs(title = "Audio Feature Colleration with Popularity",
x = NULL, y = "density") +
theme(axis.text.y = element_blank(),
plot.title = element_text(hjust = 0.5)) +
white_theme
Based on the plot above, it looks like
danceability
,energy
,loudness
,tempo
,valence
affect the most aspects for popularity number, just like the same likeggcor()
plot.
tracks %>%
select(-c(key, mode, time_signature, popularity)) %>%
select(c(genre, feature)) %>%
pivot_longer(cols = feature) %>%
ggplot(aes(x = value)) +
geom_density(aes(color = genre), alpha = 0.5) +
facet_wrap(~name, ncol = 2, scales = "free") +
labs(title = "Spotify Audio Feature Density - by Genre",
x = NULL, y = "density") +
theme(axis.text.y = element_blank(),
plot.title = element_text(hjust = 0.5)) +
white_theme
Based on the density plot, aspects that related to define
genre
aredanceability
,energy
,loudness
andtempo
Next step, we will find out the distribution of popularity
Before we scale our data, let’s see which columns that need to be scaled
## genre popularity acousticness danceability
## Comedy : 9674 Min. : 0.00 Min. :0.0000 Min. :0.0569
## Electronic : 9149 1st Qu.: 25.00 1st Qu.:0.0456 1st Qu.:0.4150
## Alternative: 9095 Median : 37.00 Median :0.2880 Median :0.5580
## Anime : 8935 Mean : 36.27 Mean :0.4041 Mean :0.5411
## Classical : 8711 3rd Qu.: 49.00 3rd Qu.:0.7910 3rd Qu.:0.6830
## Reggae : 8687 Max. :100.00 Max. :0.9960 Max. :0.9890
## (Other) :122523
## duration_ms energy instrumentalness key
## Min. : 15387 Min. :0.0000203 Min. :0.0000000 C :20970
## 1st Qu.: 178253 1st Qu.:0.3440000 1st Qu.:0.0000000 G :20476
## Median : 219453 Median :0.5920000 Median :0.0000704 D :18643
## Mean : 236127 Mean :0.5570245 Mean :0.1720729 A :17499
## 3rd Qu.: 268547 3rd Qu.:0.7890000 3rd Qu.:0.0908000 C# :16856
## Max. :5552917 Max. :0.9990000 Max. :0.9990000 F :15605
## (Other):66725
## liveness loudness mode speechiness
## Min. :0.00967 Min. :-52.457 Major:116619 Min. :0.0222
## 1st Qu.:0.09750 1st Qu.:-12.851 Minor: 60155 1st Qu.:0.0368
## Median :0.13000 Median : -8.191 Median :0.0494
## Mean :0.22453 Mean :-10.138 Mean :0.1274
## 3rd Qu.:0.27700 3rd Qu.: -5.631 3rd Qu.:0.1020
## Max. :1.00000 Max. : 3.744 Max. :0.9670
##
## tempo time_signature valence
## Min. : 30.38 Length:176774 Min. :0.0000
## 1st Qu.: 92.01 Class :character 1st Qu.:0.2220
## Median :115.01 Mode :character Median :0.4400
## Mean :117.20 Mean :0.4516
## 3rd Qu.:138.80 3rd Qu.:0.6670
## Max. :242.90 Max. :1.0000
##
Our data has a lot of variances of number, so, we will just scale all numeric type.
Time to scale our data.
We already scaled our data, now let’s continue
First, let’s find the optimum k
for our clustering
RNGkind(sample.kind = "Rounding")
kmeansTunning <- function(data, maxK) {
withinall <- NULL
total_k <- NULL
for (i in 2:maxK) {
set.seed(101)
temp <- kmeans(data,i)$tot.withinss
withinall <- append(withinall, temp)
total_k <- append(total_k,i)
}
plot(x = total_k, y = withinall, type = "o", xlab = "Number of Cluster", ylab = "Total within")
}
kmeansTunning(tracks_cleaned, maxK = 5)
The optimum
k
we get is 3
Now, time to
## Length Class Mode
## cluster 176774 -none- numeric
## centers 33 -none- numeric
## totss 1 -none- numeric
## withinss 3 -none- numeric
## tot.withinss 1 -none- numeric
## betweenss 1 -none- numeric
## size 3 -none- numeric
## iter 1 -none- numeric
## ifault 1 -none- numeric
clust <- tracks %>%
select_if(is.numeric) %>%
mutate(cluster = as.factor(km$cluster))
clust$popularity <- rescale(clust$popularity, to = c(0,100))
clust$acousticness <- rescale(clust$acousticness, to = c(0,100))
clust$danceability <- rescale(clust$danceability, to = c(0,100))
clust$duration_ms <- rescale(clust$duration_ms, to = c(0,100))
clust$energy <- rescale(clust$energy, to = c(0,100))
clust$instrumentalness <- rescale(clust$instrumentalness, to = c(0,100))
clust$liveness <- rescale(clust$liveness, to = c(0,100))
clust$loudness <- rescale(clust$loudness, to = c(0,100))
clust$speechiness <- rescale(clust$speechiness, to = c(0,100))
clust$tempo <- rescale(clust$tempo, to = c(0,100))
clust$valence <- rescale(clust$valence, to = c(0,100))
clust %>%
group_by(cluster) %>%
summarise_all(mean) %>%
pivot_longer(cols = -1) %>%
ggplot(aes(x = name, y = value)) +
geom_col(aes(fill = name)) +
facet_wrap(~cluster)+
labs(x = NULL, y = NULL, title = "Cluster's Characteristic")+
theme(axis.text.x = element_text(angle = 50, hjust = 1),
plot.title = element_text(hjust = 0.5)) +
white_theme
From the plot above, we can infer that * Cluster 1 : has best in danceability
, energy
, loudness
and valence
* Cluster 2 : has best in acousticness
, danceability
, energy
, liveness
, loudness
, and speechiness
* Cluster 3 : has best in acousticness
and loudness
Time to plot our cluster
From the plot, the most popular songs are in Cluster 2
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.9120 1.3358 1.0808 0.99152 0.92525 0.85280 0.80714
## Proportion of Variance 0.3323 0.1622 0.1062 0.08937 0.07783 0.06612 0.05923
## Cumulative Proportion 0.3323 0.4946 0.6008 0.69014 0.76797 0.83408 0.89331
## PC8 PC9 PC10 PC11
## Standard deviation 0.67140 0.58850 0.51758 0.32960
## Proportion of Variance 0.04098 0.03148 0.02435 0.00988
## Cumulative Proportion 0.93429 0.96577 0.99012 1.00000
In this project, I want to get 80% of data information, so I will use PC1 to PC 6 in order to get 83.3%
pca_tracks <- data.frame(pca_spotify$x, cluster = factor(km$cluster), genre = tracks$genre)
pca_data <- data.frame(varnames = rownames(pca_spotify$rotation),
pca_spotify$rotation)
x <- "PC1"
y <- "PC2"
data <- data.frame(obsnames=seq(nrow(pca_spotify$x)), pca_spotify$x)
mult <- min(
(max(data[,y]) - min(data[,y])/(max(pca_data[,y])
-min(pca_data[,y]))),
(max(data[,x]) - min(data[,x])/(max(pca_data[,x])
-min(pca_data[,x]))))
pca_data <- transform(pca_data,
v1 = .9 * mult * (get(x)),
v2 = .9 * mult * (get(y)))
ggplot(pca_tracks, aes(x = PC1, y = PC2)) +
geom_hline(aes(yintercept = 0), size = 0.2) +
geom_vline(aes(xintercept = 0), size = 0.2) +
coord_equal() +
geom_point(aes(color = cluster),size = 0.2) +
geom_segment(data = pca_data, aes( x =0, y = 0, xend = v1, yend = v2), arrow = arrow(length = unit(0.2, "cm"))) +
geom_text_repel(data = pca_data, aes(label = str_to_title(varnames)), point.padding = -10, segment.size = 0.5) +
scale_color_brewer(palette = "Pastel2") +
guides(colour = guide_legend(override.aes = list(size = 3))) +
labs(title = "3 Clusters with PCA and Factor Loading", color = "Cluster", x = NULL, y = NULL) +
theme(plot.title = element_text(hjust = 0.5)) +
white_theme
From the Bi-plot above we can see that it have a 4 common directions.
acousticness
, duration_ms
and instrumentalness
are having a negative correlation with energy
, valence
, danceability
and loudness
liveness
and speechiness
are having negative correlation tempo
and popularity
And we get main characteristic, which can explain our Cluster’s Characteristic Graph
energy
, danceability
, valence
, loudness
, tempo
and popularity
acousticness
, duration_ms
, instrumentalness
liveness
and speechiness
Here are our dimensional cluster plot
pca_tracks <- PCA(X = tracks_cleaned, scale.unit = F, graph = F)
df_pca <- data.frame(pca_tracks$ind$coord[, 1:3]) %>% bind_cols(cluster = as.factor(km$cluster)) %>%
select(cluster, 1:3)
plot_ly(df_pca, x = ~Dim.1, y = ~Dim.2, z = ~Dim.3, color = ~cluster, colors = c("red",
"blue", "green")) %>% add_markers() %>% layout(scene = list(xaxis = list(title = "Dim.1"),
yaxis = list(title = "Dim.2"), zaxis = list(title = "Dim.3")))
- We can classify the tracks we have into 3 Clusters based on the optimum
k
using elbow-method we got.
- To get 80% of information, we get 6 PCs, so we at least the missing information will be just 20%.
Based on Cluster’s Characteristic Plot and Cluster Plot, the most popular song are clustered into Cluster 1, since Cluster 1 has strong characteristics in
danceability
,energy
,loudness
,valence
. And the value ofpopularity
is the highest from 2 other clusters.