1 Background

This report is created for learn by building assignment in Algoritma Academy for unsupervised learning course. Data used is spotify songlist dataset prepared by Zaheen Hamidani [https://www.kaggle.com/zaheenhamidani/ultimate-spotify-tracks-db] that I got from Kaggle. In this report I am going to extract some pattern in song attributes that spotify has for each song in their database.

1.1 Data Structure

spotify <- read.delim("spotify.csv",sep = "\t",header = T)
glimpse(spotify)
## Observations: 228,159
## Variables: 18
## $ genre            <fct> Pop, Dance, Pop, Pop, Rap, Dance, Pop, Hip-Hop,…
## $ artist_name      <fct> Ariana Grande, Ariana Grande, Ariana Grande, Po…
## $ track_name       <fct> "7 rings", "7 rings", "break up with your girlf…
## $ track_id         <fct> 14msK75pk3pA33pzPVNtBF, 14msK75pk3pA33pzPVNtBF,…
## $ popularity       <int> 100, 100, 99, 99, 99, 99, 98, 98, 98, 97, 97, 9…
## $ acousticness     <dbl> 0.5780, 0.5780, 0.0421, 0.1630, 0.1630, 0.0421,…
## $ danceability     <dbl> 0.725, 0.725, 0.726, 0.833, 0.833, 0.726, 0.737…
## $ duration_ms      <int> 178640, 178640, 190440, 149520, 149520, 190440,…
## $ energy           <dbl> 0.321, 0.321, 0.554, 0.539, 0.539, 0.554, 0.860…
## $ instrumentalness <dbl> 0.00e+00, 0.00e+00, 0.00e+00, 2.10e-06, 2.10e-0…
## $ key              <fct> C#, C#, F, B, B, F, G#, G#, G#, D, F#, G#, F, C…
## $ liveness         <dbl> 0.0884, 0.0884, 0.1060, 0.1010, 0.1010, 0.1060,…
## $ loudness         <dbl> -10.744, -10.744, -5.290, -7.399, -7.399, -5.29…
## $ mode             <fct> Minor, Minor, Minor, Minor, Minor, Minor, Minor…
## $ speechiness      <dbl> 0.3230, 0.3230, 0.0917, 0.1780, 0.1780, 0.0917,…
## $ tempo            <dbl> 70.142, 70.142, 169.999, 99.947, 99.947, 169.99…
## $ time_signature   <fct> 4-Apr, 4-Apr, 4-Apr, 4-Apr, 4-Apr, 4-Apr, 4-Apr…
## $ valence          <dbl> 0.319, 0.319, 0.335, 0.385, 0.385, 0.335, 0.656…

Each row indicates 1 song and column contain attributes for each song.
- genre Genre of the song.
- artist_name Name of the artist.
- track_name Title of the song.
- track_id Track id on spotify.
- popularity Measure the popularity from 0 to 100 based on play number of the track.
- acousticness Measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
- danceability Describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
- duration_ms The duration of the track in milliseconds.
- energy Measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
- instrumentalness Measure whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
- key estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1.
- liveness Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
- loudness overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.
- mode Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.
- speechiness Detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
- tempo Overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
- time_signature Estimated overall time signature of a track. The time signature (meter) is a notational convention to specify how many beats are in each bar (or measure).
- valence Measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry). The distribution of values for this feature look like this: Valence distribution.

1.2 Objectives

As a music researcher, I am performing clustering and principal component analysis to answer some questions:
1. Can I extract a pattern to do clustering from the available variables in the data?
2. How each cluster differ?

2 Data Pre-processing

2.1 Data Cleaning

# Remove NA's
spot <- na.omit(spotify)

# Filter unique song
spot <- spot[!duplicated(spot$track_id),]

# Adjust some row data type
spot <- spot %>% 
  mutate(time_signature = as.numeric(time_signature),
         popularity = as.numeric(popularity),
         duration_ms = as.numeric(duration_ms))

2.2 Data Exploratory

Checking Genre Distribution

# spot %>% 
#   group_by(genre) %>% 
#   summarise(avg_pop = mean(popularity),
#             percent = round(n()/nrow(spot),3)) %>% 
#   arrange(desc(percent))

Checking Popularity Distribution

hist(spot$popularity)

hist(spot$popularity,breaks = 4)

2.3 Data Manipulation

From exploring the distribution on popularity, I want to make new variables that divide popularity into 4 groups for further analysis on clustering

spot <- spot %>% 
  mutate(pop_gr = as.factor(case_when(((popularity > 0) & (popularity < 20)) ~ "1",
                         ((popularity >= 20) & (popularity < 40))~ "2",
                         ((popularity >= 40) & (popularity < 60)) ~ "3",
                         TRUE ~ "4")))

table(spot$pop_gr)
## 
##     1     2     3     4 
## 18830 55812 59920 19123

2.4 Data Wrangling

colnames(spot)
##  [1] "genre"            "artist_name"      "track_name"      
##  [4] "track_id"         "popularity"       "acousticness"    
##  [7] "danceability"     "duration_ms"      "energy"          
## [10] "instrumentalness" "key"              "liveness"        
## [13] "loudness"         "mode"             "speechiness"     
## [16] "tempo"            "time_signature"   "valence"         
## [19] "pop_gr"

The variables that I choose in this analysis

for Cluster and PCA:
- Acousticness
- Loudness
- Danceability
- Energy
- Liveness
- Valence

Some other variables to help the further analysis:
- Popularity
- Track Name
- Artist Name
- Genre
- Popularity Group(pop_gr)

Excluded Variables:
- Track Id : only represent the unique for database needs.
- Speechiness : from the explanation of the variable above, the value of this variables must be 0.33 for music and non-speech tracks. since we want to analyze music and non speech track, so this variable is not giving much informations.
- Key, Mode, Time Signature : since data is not numeric and scale is not ordinal, it is better not to include this in clustering.
- Duration and Tempo : not included in this analysis.

spot1 <- spot %>% 
  select(acousticness,loudness,danceability,energy,liveness,valence,popularity,track_name,artist_name,genre,pop_gr)

glimpse(spot1)
## Observations: 153,685
## Variables: 11
## $ acousticness <dbl> 0.57800, 0.04210, 0.16300, 0.11000, 0.55600, 0.2970…
## $ loudness     <dbl> -10.744, -5.290, -7.399, -2.652, -5.574, -7.050, -7…
## $ danceability <dbl> 0.725, 0.726, 0.833, 0.737, 0.760, 0.752, 0.741, 0.…
## $ energy       <dbl> 0.321, 0.554, 0.539, 0.860, 0.479, 0.488, 0.520, 0.…
## $ liveness     <dbl> 0.0884, 0.1060, 0.1010, 0.0574, 0.0703, 0.0936, 0.2…
## $ valence      <dbl> 0.319, 0.335, 0.385, 0.656, 0.913, 0.533, 0.347, 0.…
## $ popularity   <dbl> 100, 99, 99, 98, 97, 97, 97, 97, 97, 97, 96, 96, 96…
## $ track_name   <fct> "7 rings", "break up with your girlfriend, i'm bore…
## $ artist_name  <fct> Ariana Grande, Ariana Grande, Post Malone, Daddy Ya…
## $ genre        <fct> Pop, Pop, Pop, Pop, Pop, Pop, Pop, Pop, Pop, Pop, P…
## $ pop_gr       <fct> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, …

3 Modeling

3.1 Data Clustering

Before performing cluster and PCA, the variables need to be scale.

spot1z <- as.data.frame(scale(spot1[,c(1:7)]))

summary(spot1z)
##   acousticness        loudness        danceability         energy       
##  Min.   :-1.1196   Min.   :-6.4254   Min.   :-2.5014   Min.   :-1.9980  
##  1st Qu.:-0.9829   1st Qu.:-0.4231   1st Qu.:-0.6674   1st Qu.:-0.7723  
##  Median :-0.3071   Median : 0.3212   Median : 0.1043   Median : 0.1379  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.0625   3rd Qu.: 0.7057   3rd Qu.: 0.7569   3rd Qu.: 0.8306  
##  Max.   : 1.5869   Max.   : 1.8164   Max.   : 2.3158   Max.   : 1.6248  
##     liveness          valence           popularity      
##  Min.   :-1.0117   Min.   :-1.63783   Min.   :-2.24363  
##  1st Qu.:-0.6031   1st Qu.:-0.86701   1st Qu.:-0.66739  
##  Median :-0.4532   Median :-0.06627   Median : 0.03316  
##  Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.00000  
##  3rd Qu.: 0.2429   3rd Qu.: 0.80183   3rd Qu.: 0.73371  
##  Max.   : 3.5844   Max.   : 2.10398   Max.   : 3.59429

Using elbow method to determine how many cluster by looking at the homogenity for each number of cluster used.

wss <- function(data, maxCluster = 9) {
    SSw <- (nrow(data) - 1) * sum(apply(data, 2, var))
    SSw <- vector()
    for (i in 2:maxCluster) {
        SSw[i] <- sum(kmeans(data, centers = i)$withinss)
    }
    plot(1:maxCluster, SSw, type = "o", xlab = "Number of Clusters", ylab = "Within groups sum of squares", pch=19)
}

wss(spot1z)
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 7684250)

## Warning: Quick-TRANSfer stage steps exceeded maximum (= 7684250)

By looking at the plot, we can’t see the “Elbow” clearly. But I think it happens when 3 cluster applied. So I choose 3 to be number of clusters for this analysis. After that, I perform the K-mean clustering.

set.seed(212)
km1 <- kmeans(spot1z,3,nstart = 20)

km_cl <- km1$cluster
km_ct <- data.frame(km1$centers,clust = rownames(km1$centers))

3.2 PCA

Perform the PCA with 6 variables that has been scale before.

pc1 <- prcomp(spot1z[,1:7],center = T)
summary(pc1)
## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6
## Standard deviation     1.8492 1.1089 0.9500 0.80721 0.61668 0.53368
## Proportion of Variance 0.4885 0.1757 0.1289 0.09308 0.05433 0.04069
## Cumulative Proportion  0.4885 0.6641 0.7931 0.88616 0.94048 0.98117
##                            PC7
## Standard deviation     0.36303
## Proportion of Variance 0.01883
## Cumulative Proportion  1.00000

Combining the cluster result with PCA and Initial Data

spot1z <- data.frame(spot1z,spot1[,-c(1:7)])
spot1z$clust <- as.factor(km_cl)

pr1 <- data.frame(pc1$x, clust = factor(km_cl), pop_gr = spot1$pop_gr, genre = spot$genre)

4 Result

4.1 Biplot

datapc <- data.frame(varnames = rownames(pc1$rotation), pc1$rotation)
x <- "PC1"
y <- "PC2"

data <- data.frame(obsnames=seq(nrow(pc1$x)), pc1$x)

mult <- min(
  (max(data[,y]) - min(data[,y])/(max(datapc[,y])-min(datapc[,y]))),
  (max(data[,x]) - min(data[,x])/(max(datapc[,x])-min(datapc[,x]))))
datapc <- transform(datapc,
                    v1 = .9 * mult * (get(x)),
                    v2 = .9 * mult * (get(y)))

ggplot(pr1, aes(x=PC1, y=PC2)) +
  geom_hline(aes(yintercept=0), size=.2) + 
  geom_vline(aes(xintercept=0), size=.2) +
  coord_equal() +
  geom_point(aes(color = clust),size = 0.2) +
  geom_segment(data = datapc, aes(x=0, y=0, xend=v1, yend=v2), arrow = arrow(length=unit(0.2, "cm"))) +
  geom_text_repel(data = datapc, aes(label=str_to_title(varnames)),point.padding = -10,segment.size = 0.5) +
  theme_dark()+
  scale_color_brewer(palette = "Pastel1")+
  guides(colour = guide_legend(override.aes = list(size=3)))+
  labs(title = "3 Clusters with PCA and Factor Loading",color = "Cluster")

From the Bi-plot above we can see that it have a 4 common directions.
1. Acousticness : Have a negative correlation with Valence, Energy, Loudness, Danceability and little of populations.
2. Liveness : Have a negative correlation with Popularity.
3. Popularity : Have a negative correlation with Liveness.
4. Valence, Energy, Loudness, Danceability: Have a negative correlation with Acousticness.

Also the 3 clusters main characteristic.
1. Cluster 1 : Popularity, Valence, Energy, Loudness, Danceability.
2. Cluster 2 : Acousticness.
3. Cluster 3 : Liveness.

We will continue to analyze the clusters characteristic with another plot.

4.2 Cluster Attributes

Before visualizing the attribute, I am scaling the data with min max to convert data range to 0-1

normalize <- function(x){
  return ( 
    (x - min(x))/(max(x) - min(x)) 
           )
}

spot1gg <- normalize(spot1z[,c(1:7)])
spot1gg <- cbind(spot1gg,spot1z[,-c(1:7)])

Make a plot that shows clusters attributes

spot1gg %>% 
  group_by(clust) %>% 
  summarise(Acoustic = mean(acousticness),
            Loud = mean(loudness),
            Dance = mean(danceability),
            Energy = mean(energy),
            Liveness = mean(liveness),
            Valence = mean(valence)) %>% 
  select(Loud,Acoustic,Dance,Energy,Liveness,Valence,clust) %>% 
  gather("Name","Value",-clust) %>% 
  ggplot(aes(y=Value,x = clust,col=Name,group = Name)) +
  geom_point()+
  geom_line()+
  facet_wrap(~Name,scales = "free_y")+
  scale_color_brewer(palette = "Pastel1")+
  theme_dark()+
  labs(x="Cluster",col = "Attributes",title = "Attributes for each cluster")

The plot above shows that cluster 3 have some Valence, Energy, Loudness, Danceability and Acoustic. It make sense because the live music performance can happen in acoustic format or full-band/electronic format that result in more Valence, Energy, Loudness, and Danceability.

Cluster Popularity Attribute

spot1gg %>% 
  group_by(clust) %>% 
  summarise(Popularity = mean(popularity)) %>% 
  select(Popularity,clust) %>% 
  gather("name","value",-clust) %>% 
  ggplot(aes(y=value,x = clust,group = name))+
  geom_point()+
  geom_line()+
  theme_dark()+
  labs(y = "Popularity Value",x="Cluster",title = "Average Popularity for each cluster")

From the plot above we can see that cluster 1 is most popular and 3 is least popular

4.3 Cluster Song

Dive deeper to the clustering, I try to take sample and listen the song for each cluster. First I am selecting the song that have nearest distance to each cluster’s centroid.

Nearest distance to centroid songlist

spot1dist <- cbind(spot1gg,index = 1:nrow(spot1gg))
dist <- list()

for (i in 1:3){
  d <- data.frame(spot1dist[spot1dist$clust == i,])
  d <- data.frame(d %>% 
    mutate(dist = sqrt((d[,1] - km_ct[i,1])^2 +
                       (d[,2] - km_ct[i,2])^2 +
                       (d[,3] - km_ct[i,3])^2 +
                       (d[,4] - km_ct[i,4])^2 +
                       (d[,5] - km_ct[i,5])^2 +
                       (d[,6] - km_ct[i,6])^2 +
                       (d[,7] - km_ct[i,7])^2)) %>% 
    arrange(dist) %>% 
    head(1))
 dist[[i]] <- d
}

distance <- do.call(rbind, dist)

distance %>% select(artist_name,track_name,clust,popularity)
##         artist_name
## 1  Maranatha! Music
## 2    Hector Berlioz
## 3 Shakuhachi Sakano
##                                                                      track_name
## 1                                                  Oceans (Where Feet May Fail)
## 2 La Damnation de Faust, Op. 24, H. 111: Pt. II, Scène VII - Ballet des sylphes
## 3                                                        The Shores of the Moon
##   clust popularity
## 1     1  0.4697933
## 2     2  0.4173553
## 3     3  0.4406611

Nearest to centroid songs attributes

distance %>% 
  gather("name","value",1:6) %>% 
  mutate(label = as.character(paste(clust,artist_name,"-",track_name)),
          text = round(value,2)) %>%
  arrange(artist_name) %>% 
  ggplot(aes(x=name,y=value,fill =  str_to_title(name)))+
  geom_col(position = position_stack(),aes(fill =  str_to_title(name)))+
  geom_text(aes(label=text),position = position_stack(vjust = .5),size=2)+
  facet_wrap(~label)+
  theme_dark()+
  scale_fill_brewer(palette = "Pastel1")+
  theme(axis.text.x = element_text(size = 0),strip.text = element_text(size = 5))+
  labs(x=NULL,y =NULL,fill = "Attributes",title = "Song closest to each cluster centroid")

After listening the track. This is what I identified from the track:
Cluster 1: Gospel Full-band. Have more Valence, Energy, Loudness, and Danceability.
Cluster 2: Classical Orchestra. Have more Acousticnes.
Cluster 3: Ambient For Meditation. Have high Liveness, but since the audio of this song is only the sound of the oceans, I think Spotify Algorithm misheard it and thought the voice was the audience’s voice. Since Liveness is measure the sound of audience in the track.

After that, I try to take a sample on most popular song for each centroid.

Most popular songs for each cluster

comb = list()

for (i in 1:3){
x <- data.frame(spot1gg %>% filter(clust == i) %>% arrange(desc(popularity)) %>% head(1) %>% select(-pop_gr))
comb[[i]] <- x
}

combine <- do.call(rbind, comb)

combine %>% select(artist_name,track_name,clust,popularity)
##     artist_name                                track_name clust popularity
## 1 Ariana Grande                                   7 rings     1  1.0000000
## 2         gnash i hate u, i love u (feat. olivia o'brien)     2  0.9009504
## 3     Lady Gaga               Always Remember Us This Way     3  0.9300826

Most popular songs attributes for each cluster

combine %>% 
  gather("name","value",1:6) %>% 
  mutate(label = as.character(paste(clust,artist_name,"-",str_to_title(track_name))),
          text = round(value,2)) %>%
  arrange(artist_name) %>% 
  ggplot(aes(x=name,y=value,fill = str_to_title(name)))+
  geom_col(position = position_stack(),aes(fill =  str_to_title(name)))+
  geom_text(aes(label=text),position = position_stack(vjust = .5),size=2)+
  facet_wrap(~label)+
  theme_dark()+
  scale_fill_brewer(palette = "Pastel1")+
  theme(axis.text.x = element_text(size = 0),strip.text = element_text(size = 5))+
  labs(x=NULL,y =NULL,fill = "Attributes",title = "Most Popular song for each cluster")

After listening the track. This is what I identified from the track:
Cluster 1 : Electronic Hip-hop. Also have more Valence, Energy, Loudness, and Danceability.
Cluster 2 : Acoustic Pop, Rap. Also have more Acousticness.
Cluster 3 : Full Band, Live, Pop. Also have more Liveness.

4.4 Summary

  • Factor Loading:
  1. Acousticness : Have a negative correlation with Valence, Energy, Loudness, Danceability and little of populations.
  2. Liveness : Have a negative correlation with Popularity.
  3. Popularity : Have a negative correlation with Liveness.
  4. Valence, Energy, Loudness, Danceability: Have a negative correlation with Acousticness.
  • Clusters Main Characteristic:
  1. Cluster 1 : Popularity, Valence, Energy, Loudness, Danceability.
  2. Cluster 2 : Acousticness.
  3. Cluster 3 : Liveness.
  • Cluster 1 is the most popular and Cluster 3 is least popular.

  • Popular (most listened) music tend to more have Valence(happy/cheerful), Energy, Loudness, Danceability and tend to less have the Liveness and Acousticness attributes.

  • Acoustic music, tend to less have Valence(happy/cheerful), Energy, Loudness, Danceability.

  • Live music can be acoustic or loud and tend to have less popularity.

5 Extras

5.1 Popularity Distribution

Popularity group distribution in PCA. Group 1 is the most popular and 3 the least popular.

ggplot(pr1,aes(x=PC1,y=PC2))+
  geom_jitter(aes(col=pop_gr),size =0.1)+
  transition_manual(pop_gr,cumulative = T)+
  theme_dark()+
  scale_color_brewer(palette = "Pastel1")+
  guides(colour = guide_legend(override.aes = list(size=3)))+
  geom_segment(data = datapc, aes(x=0, y=0, xend=v1, yend=v2), arrow = arrow(length=unit(0.2, "cm"))) +
  geom_text_repel(data = datapc, aes(label=str_to_title(varnames))) +
  labs(title = "Popularity Group",col = "Popularity Group")
## nframes and fps adjusted to match transition

5.2 Genre Distribution

Genre Distribution in PCA

ggplot(pr1,aes(x=PC1,y=PC2))+
  geom_jitter(aes(col=genre),size =0.3)+
  transition_manual(genre,cumulative = T)+
  theme_dark()+
  guides(colour = guide_legend(override.aes = list(size=3)))+
  geom_segment(data = datapc, aes(x=0, y=0, xend=v1, yend=v2), arrow = arrow(length=unit(0.2, "cm"))) +
  geom_text_repel(data = datapc, aes(label=str_to_title(varnames))) +
  labs(title = "Genre",col = "Genre")
## nframes and fps adjusted to match transition