This report is created for learn by building assignment in Algoritma Academy for unsupervised learning course. Data used is spotify songlist dataset prepared by Zaheen Hamidani [https://www.kaggle.com/zaheenhamidani/ultimate-spotify-tracks-db] that I got from Kaggle. In this report I am going to extract some pattern in song attributes that spotify has for each song in their database.
spotify <- read.delim("spotify.csv",sep = "\t",header = T)
glimpse(spotify)
## Observations: 228,159
## Variables: 18
## $ genre <fct> Pop, Dance, Pop, Pop, Rap, Dance, Pop, Hip-Hop,…
## $ artist_name <fct> Ariana Grande, Ariana Grande, Ariana Grande, Po…
## $ track_name <fct> "7 rings", "7 rings", "break up with your girlf…
## $ track_id <fct> 14msK75pk3pA33pzPVNtBF, 14msK75pk3pA33pzPVNtBF,…
## $ popularity <int> 100, 100, 99, 99, 99, 99, 98, 98, 98, 97, 97, 9…
## $ acousticness <dbl> 0.5780, 0.5780, 0.0421, 0.1630, 0.1630, 0.0421,…
## $ danceability <dbl> 0.725, 0.725, 0.726, 0.833, 0.833, 0.726, 0.737…
## $ duration_ms <int> 178640, 178640, 190440, 149520, 149520, 190440,…
## $ energy <dbl> 0.321, 0.321, 0.554, 0.539, 0.539, 0.554, 0.860…
## $ instrumentalness <dbl> 0.00e+00, 0.00e+00, 0.00e+00, 2.10e-06, 2.10e-0…
## $ key <fct> C#, C#, F, B, B, F, G#, G#, G#, D, F#, G#, F, C…
## $ liveness <dbl> 0.0884, 0.0884, 0.1060, 0.1010, 0.1010, 0.1060,…
## $ loudness <dbl> -10.744, -10.744, -5.290, -7.399, -7.399, -5.29…
## $ mode <fct> Minor, Minor, Minor, Minor, Minor, Minor, Minor…
## $ speechiness <dbl> 0.3230, 0.3230, 0.0917, 0.1780, 0.1780, 0.0917,…
## $ tempo <dbl> 70.142, 70.142, 169.999, 99.947, 99.947, 169.99…
## $ time_signature <fct> 4-Apr, 4-Apr, 4-Apr, 4-Apr, 4-Apr, 4-Apr, 4-Apr…
## $ valence <dbl> 0.319, 0.319, 0.335, 0.385, 0.385, 0.335, 0.656…
Each row indicates 1 song and column contain attributes for each song.
- genre Genre of the song.
- artist_name Name of the artist.
- track_name Title of the song.
- track_id Track id on spotify.
- popularity Measure the popularity from 0 to 100 based on play number of the track.
- acousticness Measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
- danceability Describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
- duration_ms The duration of the track in milliseconds.
- energy Measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
- instrumentalness Measure whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
- key estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1.
- liveness Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
- loudness overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.
- mode Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.
- speechiness Detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
- tempo Overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
- time_signature Estimated overall time signature of a track. The time signature (meter) is a notational convention to specify how many beats are in each bar (or measure).
- valence Measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry). The distribution of values for this feature look like this: Valence distribution.
As a music researcher, I am performing clustering and principal component analysis to answer some questions:
1. Can I extract a pattern to do clustering from the available variables in the data?
2. How each cluster differ?
# Remove NA's
spot <- na.omit(spotify)
# Filter unique song
spot <- spot[!duplicated(spot$track_id),]
# Adjust some row data type
spot <- spot %>%
mutate(time_signature = as.numeric(time_signature),
popularity = as.numeric(popularity),
duration_ms = as.numeric(duration_ms))
Checking Genre Distribution
# spot %>%
# group_by(genre) %>%
# summarise(avg_pop = mean(popularity),
# percent = round(n()/nrow(spot),3)) %>%
# arrange(desc(percent))
Checking Popularity Distribution
hist(spot$popularity)
hist(spot$popularity,breaks = 4)
From exploring the distribution on popularity, I want to make new variables that divide popularity into 4 groups for further analysis on clustering
spot <- spot %>%
mutate(pop_gr = as.factor(case_when(((popularity > 0) & (popularity < 20)) ~ "1",
((popularity >= 20) & (popularity < 40))~ "2",
((popularity >= 40) & (popularity < 60)) ~ "3",
TRUE ~ "4")))
table(spot$pop_gr)
##
## 1 2 3 4
## 18830 55812 59920 19123
colnames(spot)
## [1] "genre" "artist_name" "track_name"
## [4] "track_id" "popularity" "acousticness"
## [7] "danceability" "duration_ms" "energy"
## [10] "instrumentalness" "key" "liveness"
## [13] "loudness" "mode" "speechiness"
## [16] "tempo" "time_signature" "valence"
## [19] "pop_gr"
The variables that I choose in this analysis
for Cluster and PCA:
- Acousticness
- Loudness
- Danceability
- Energy
- Liveness
- Valence
Some other variables to help the further analysis:
- Popularity
- Track Name
- Artist Name
- Genre
- Popularity Group(pop_gr)
Excluded Variables:
- Track Id : only represent the unique for database needs.
- Speechiness : from the explanation of the variable above, the value of this variables must be 0.33 for music and non-speech tracks. since we want to analyze music and non speech track, so this variable is not giving much informations.
- Key, Mode, Time Signature : since data is not numeric and scale is not ordinal, it is better not to include this in clustering.
- Duration and Tempo : not included in this analysis.
spot1 <- spot %>%
select(acousticness,loudness,danceability,energy,liveness,valence,popularity,track_name,artist_name,genre,pop_gr)
glimpse(spot1)
## Observations: 153,685
## Variables: 11
## $ acousticness <dbl> 0.57800, 0.04210, 0.16300, 0.11000, 0.55600, 0.2970…
## $ loudness <dbl> -10.744, -5.290, -7.399, -2.652, -5.574, -7.050, -7…
## $ danceability <dbl> 0.725, 0.726, 0.833, 0.737, 0.760, 0.752, 0.741, 0.…
## $ energy <dbl> 0.321, 0.554, 0.539, 0.860, 0.479, 0.488, 0.520, 0.…
## $ liveness <dbl> 0.0884, 0.1060, 0.1010, 0.0574, 0.0703, 0.0936, 0.2…
## $ valence <dbl> 0.319, 0.335, 0.385, 0.656, 0.913, 0.533, 0.347, 0.…
## $ popularity <dbl> 100, 99, 99, 98, 97, 97, 97, 97, 97, 97, 96, 96, 96…
## $ track_name <fct> "7 rings", "break up with your girlfriend, i'm bore…
## $ artist_name <fct> Ariana Grande, Ariana Grande, Post Malone, Daddy Ya…
## $ genre <fct> Pop, Pop, Pop, Pop, Pop, Pop, Pop, Pop, Pop, Pop, P…
## $ pop_gr <fct> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, …
Before performing cluster and PCA, the variables need to be scale.
spot1z <- as.data.frame(scale(spot1[,c(1:7)]))
summary(spot1z)
## acousticness loudness danceability energy
## Min. :-1.1196 Min. :-6.4254 Min. :-2.5014 Min. :-1.9980
## 1st Qu.:-0.9829 1st Qu.:-0.4231 1st Qu.:-0.6674 1st Qu.:-0.7723
## Median :-0.3071 Median : 0.3212 Median : 0.1043 Median : 0.1379
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.0625 3rd Qu.: 0.7057 3rd Qu.: 0.7569 3rd Qu.: 0.8306
## Max. : 1.5869 Max. : 1.8164 Max. : 2.3158 Max. : 1.6248
## liveness valence popularity
## Min. :-1.0117 Min. :-1.63783 Min. :-2.24363
## 1st Qu.:-0.6031 1st Qu.:-0.86701 1st Qu.:-0.66739
## Median :-0.4532 Median :-0.06627 Median : 0.03316
## Mean : 0.0000 Mean : 0.00000 Mean : 0.00000
## 3rd Qu.: 0.2429 3rd Qu.: 0.80183 3rd Qu.: 0.73371
## Max. : 3.5844 Max. : 2.10398 Max. : 3.59429
Using elbow method to determine how many cluster by looking at the homogenity for each number of cluster used.
wss <- function(data, maxCluster = 9) {
SSw <- (nrow(data) - 1) * sum(apply(data, 2, var))
SSw <- vector()
for (i in 2:maxCluster) {
SSw[i] <- sum(kmeans(data, centers = i)$withinss)
}
plot(1:maxCluster, SSw, type = "o", xlab = "Number of Clusters", ylab = "Within groups sum of squares", pch=19)
}
wss(spot1z)
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 7684250)
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 7684250)
By looking at the plot, we can’t see the “Elbow” clearly. But I think it happens when 3 cluster applied. So I choose 3 to be number of clusters for this analysis. After that, I perform the K-mean clustering.
set.seed(212)
km1 <- kmeans(spot1z,3,nstart = 20)
km_cl <- km1$cluster
km_ct <- data.frame(km1$centers,clust = rownames(km1$centers))
Perform the PCA with 6 variables that has been scale before.
pc1 <- prcomp(spot1z[,1:7],center = T)
summary(pc1)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 1.8492 1.1089 0.9500 0.80721 0.61668 0.53368
## Proportion of Variance 0.4885 0.1757 0.1289 0.09308 0.05433 0.04069
## Cumulative Proportion 0.4885 0.6641 0.7931 0.88616 0.94048 0.98117
## PC7
## Standard deviation 0.36303
## Proportion of Variance 0.01883
## Cumulative Proportion 1.00000
Combining the cluster result with PCA and Initial Data
spot1z <- data.frame(spot1z,spot1[,-c(1:7)])
spot1z$clust <- as.factor(km_cl)
pr1 <- data.frame(pc1$x, clust = factor(km_cl), pop_gr = spot1$pop_gr, genre = spot$genre)
datapc <- data.frame(varnames = rownames(pc1$rotation), pc1$rotation)
x <- "PC1"
y <- "PC2"
data <- data.frame(obsnames=seq(nrow(pc1$x)), pc1$x)
mult <- min(
(max(data[,y]) - min(data[,y])/(max(datapc[,y])-min(datapc[,y]))),
(max(data[,x]) - min(data[,x])/(max(datapc[,x])-min(datapc[,x]))))
datapc <- transform(datapc,
v1 = .9 * mult * (get(x)),
v2 = .9 * mult * (get(y)))
ggplot(pr1, aes(x=PC1, y=PC2)) +
geom_hline(aes(yintercept=0), size=.2) +
geom_vline(aes(xintercept=0), size=.2) +
coord_equal() +
geom_point(aes(color = clust),size = 0.2) +
geom_segment(data = datapc, aes(x=0, y=0, xend=v1, yend=v2), arrow = arrow(length=unit(0.2, "cm"))) +
geom_text_repel(data = datapc, aes(label=str_to_title(varnames)),point.padding = -10,segment.size = 0.5) +
theme_dark()+
scale_color_brewer(palette = "Pastel1")+
guides(colour = guide_legend(override.aes = list(size=3)))+
labs(title = "3 Clusters with PCA and Factor Loading",color = "Cluster")
From the Bi-plot above we can see that it have a 4 common directions.
1. Acousticness : Have a negative correlation with Valence, Energy, Loudness, Danceability and little of populations.
2. Liveness : Have a negative correlation with Popularity.
3. Popularity : Have a negative correlation with Liveness.
4. Valence, Energy, Loudness, Danceability: Have a negative correlation with Acousticness.
Also the 3 clusters main characteristic.
1. Cluster 1 : Popularity, Valence, Energy, Loudness, Danceability.
2. Cluster 2 : Acousticness.
3. Cluster 3 : Liveness.
We will continue to analyze the clusters characteristic with another plot.
Before visualizing the attribute, I am scaling the data with min max to convert data range to 0-1
normalize <- function(x){
return (
(x - min(x))/(max(x) - min(x))
)
}
spot1gg <- normalize(spot1z[,c(1:7)])
spot1gg <- cbind(spot1gg,spot1z[,-c(1:7)])
Make a plot that shows clusters attributes
spot1gg %>%
group_by(clust) %>%
summarise(Acoustic = mean(acousticness),
Loud = mean(loudness),
Dance = mean(danceability),
Energy = mean(energy),
Liveness = mean(liveness),
Valence = mean(valence)) %>%
select(Loud,Acoustic,Dance,Energy,Liveness,Valence,clust) %>%
gather("Name","Value",-clust) %>%
ggplot(aes(y=Value,x = clust,col=Name,group = Name)) +
geom_point()+
geom_line()+
facet_wrap(~Name,scales = "free_y")+
scale_color_brewer(palette = "Pastel1")+
theme_dark()+
labs(x="Cluster",col = "Attributes",title = "Attributes for each cluster")
The plot above shows that cluster 3 have some Valence, Energy, Loudness, Danceability and Acoustic. It make sense because the live music performance can happen in acoustic format or full-band/electronic format that result in more Valence, Energy, Loudness, and Danceability.
Cluster Popularity Attribute
spot1gg %>%
group_by(clust) %>%
summarise(Popularity = mean(popularity)) %>%
select(Popularity,clust) %>%
gather("name","value",-clust) %>%
ggplot(aes(y=value,x = clust,group = name))+
geom_point()+
geom_line()+
theme_dark()+
labs(y = "Popularity Value",x="Cluster",title = "Average Popularity for each cluster")
From the plot above we can see that cluster 1 is most popular and 3 is least popular
Dive deeper to the clustering, I try to take sample and listen the song for each cluster. First I am selecting the song that have nearest distance to each cluster’s centroid.
Nearest distance to centroid songlist
spot1dist <- cbind(spot1gg,index = 1:nrow(spot1gg))
dist <- list()
for (i in 1:3){
d <- data.frame(spot1dist[spot1dist$clust == i,])
d <- data.frame(d %>%
mutate(dist = sqrt((d[,1] - km_ct[i,1])^2 +
(d[,2] - km_ct[i,2])^2 +
(d[,3] - km_ct[i,3])^2 +
(d[,4] - km_ct[i,4])^2 +
(d[,5] - km_ct[i,5])^2 +
(d[,6] - km_ct[i,6])^2 +
(d[,7] - km_ct[i,7])^2)) %>%
arrange(dist) %>%
head(1))
dist[[i]] <- d
}
distance <- do.call(rbind, dist)
distance %>% select(artist_name,track_name,clust,popularity)
## artist_name
## 1 Maranatha! Music
## 2 Hector Berlioz
## 3 Shakuhachi Sakano
## track_name
## 1 Oceans (Where Feet May Fail)
## 2 La Damnation de Faust, Op. 24, H. 111: Pt. II, Scène VII - Ballet des sylphes
## 3 The Shores of the Moon
## clust popularity
## 1 1 0.4697933
## 2 2 0.4173553
## 3 3 0.4406611
Nearest to centroid songs attributes
distance %>%
gather("name","value",1:6) %>%
mutate(label = as.character(paste(clust,artist_name,"-",track_name)),
text = round(value,2)) %>%
arrange(artist_name) %>%
ggplot(aes(x=name,y=value,fill = str_to_title(name)))+
geom_col(position = position_stack(),aes(fill = str_to_title(name)))+
geom_text(aes(label=text),position = position_stack(vjust = .5),size=2)+
facet_wrap(~label)+
theme_dark()+
scale_fill_brewer(palette = "Pastel1")+
theme(axis.text.x = element_text(size = 0),strip.text = element_text(size = 5))+
labs(x=NULL,y =NULL,fill = "Attributes",title = "Song closest to each cluster centroid")
After listening the track. This is what I identified from the track:
Cluster 1: Gospel Full-band. Have more Valence, Energy, Loudness, and Danceability.
Cluster 2: Classical Orchestra. Have more Acousticnes.
Cluster 3: Ambient For Meditation. Have high Liveness, but since the audio of this song is only the sound of the oceans, I think Spotify Algorithm misheard it and thought the voice was the audience’s voice. Since Liveness is measure the sound of audience in the track.
After that, I try to take a sample on most popular song for each centroid.
Most popular songs for each cluster
comb = list()
for (i in 1:3){
x <- data.frame(spot1gg %>% filter(clust == i) %>% arrange(desc(popularity)) %>% head(1) %>% select(-pop_gr))
comb[[i]] <- x
}
combine <- do.call(rbind, comb)
combine %>% select(artist_name,track_name,clust,popularity)
## artist_name track_name clust popularity
## 1 Ariana Grande 7 rings 1 1.0000000
## 2 gnash i hate u, i love u (feat. olivia o'brien) 2 0.9009504
## 3 Lady Gaga Always Remember Us This Way 3 0.9300826
Most popular songs attributes for each cluster
combine %>%
gather("name","value",1:6) %>%
mutate(label = as.character(paste(clust,artist_name,"-",str_to_title(track_name))),
text = round(value,2)) %>%
arrange(artist_name) %>%
ggplot(aes(x=name,y=value,fill = str_to_title(name)))+
geom_col(position = position_stack(),aes(fill = str_to_title(name)))+
geom_text(aes(label=text),position = position_stack(vjust = .5),size=2)+
facet_wrap(~label)+
theme_dark()+
scale_fill_brewer(palette = "Pastel1")+
theme(axis.text.x = element_text(size = 0),strip.text = element_text(size = 5))+
labs(x=NULL,y =NULL,fill = "Attributes",title = "Most Popular song for each cluster")
After listening the track. This is what I identified from the track:
Cluster 1 : Electronic Hip-hop. Also have more Valence, Energy, Loudness, and Danceability.
Cluster 2 : Acoustic Pop, Rap. Also have more Acousticness.
Cluster 3 : Full Band, Live, Pop. Also have more Liveness.
Cluster 1 is the most popular and Cluster 3 is least popular.
Popular (most listened) music tend to more have Valence(happy/cheerful), Energy, Loudness, Danceability and tend to less have the Liveness and Acousticness attributes.
Acoustic music, tend to less have Valence(happy/cheerful), Energy, Loudness, Danceability.
Live music can be acoustic or loud and tend to have less popularity.
Popularity group distribution in PCA. Group 1 is the most popular and 3 the least popular.
ggplot(pr1,aes(x=PC1,y=PC2))+
geom_jitter(aes(col=pop_gr),size =0.1)+
transition_manual(pop_gr,cumulative = T)+
theme_dark()+
scale_color_brewer(palette = "Pastel1")+
guides(colour = guide_legend(override.aes = list(size=3)))+
geom_segment(data = datapc, aes(x=0, y=0, xend=v1, yend=v2), arrow = arrow(length=unit(0.2, "cm"))) +
geom_text_repel(data = datapc, aes(label=str_to_title(varnames))) +
labs(title = "Popularity Group",col = "Popularity Group")
## nframes and fps adjusted to match transition
Genre Distribution in PCA
ggplot(pr1,aes(x=PC1,y=PC2))+
geom_jitter(aes(col=genre),size =0.3)+
transition_manual(genre,cumulative = T)+
theme_dark()+
guides(colour = guide_legend(override.aes = list(size=3)))+
geom_segment(data = datapc, aes(x=0, y=0, xend=v1, yend=v2), arrow = arrow(length=unit(0.2, "cm"))) +
geom_text_repel(data = datapc, aes(label=str_to_title(varnames))) +
labs(title = "Genre",col = "Genre")
## nframes and fps adjusted to match transition