Nikita Sankhe

UC BANA 19-20

Introduction

Dataset:

This dataset is extracted using the spotifyr package and was obtained from rfordatascience github.

Problem Statement:

Spotify as a music application does a very good job in recommeding music to its users. It suggests music based on your frequently liked songs/artists. This particular data set, built via the spotifyr package has details of track names, artists, types of genres, sub genres and other audio features.

Objective:

The idea behind the project is to use this dataset to :

  • Build a K -means model to identify the most popular songs by each cluster:
    • Which genres are popular by the clusters - i.e are they pop? are they rap?
    • How varied are these clusters?
    • What are the audio features/attributes of these clusters?
    • Understand how the audio features perform across clusters and thereby on the songs

End Goal:

This analysis aims to provide an understading on which songs / genres are the most popular ones.
The idea is to help an end user to gain better understanding of what goes behind the most popular songs on Spotify.

Approach:

  1. Exploratory Data Analysis
  • Visualization techniques that uncover patterns and insights about the audio features and their behaviour with each other
  • Statistical Testing to understand variable behaviors :
    • Chi square testing for categorical variables
    • Correlation plot for numeric variables
  1. K - Means Clustering
  • To understand the popularity of songs I used K-means clustering method to group clusters and identified how far the clusters are from each other.
  • Songs with similar characteristics are grouped into clusters by the algorithm and these clusters help in understanding the audio attributes of the popular songs.
  1. Insights from Analysis

Data Preparation

Packages

#Dataframe
library(knitr)
library(kableExtra)
library(DT)
#Data Manipulation
library(tidyverse)
library(dplyr)
library(tidyr)
#Data Viz
library(ggplot2)
library(GGally)
library(RColorBrewer)
library(viridis)
library(gridExtra)
#K-Means
library(factoextra)
library(fpc)
  • knitr : Helps display better outputs without any intense coding. The kable function particularly helps in presenting tables, manipulating table styles

  • kablextra : In addition to the kable function, kableextra library provides formatting functions which controls width etc.

  • DT : Helps in presenting tables in a clean format, and has the ability to provide filters

  • Ggally : To plot the correlation analysis of variables in matrice form

  • tidyverse : Tidyverse provides a collection of packages including “dplyr”, “tidyr”, “ggplot2” explained below

    • dplyr provides functions for data manipulation such as - adds new variables that are functions of existing variables, select, rename data, filter, summarise etc
    • tidyr helps in tidying data with dropna, fillna functions, extracting values from strings and thereby making the data more readable, concrete and complete
    • ggplot2 provides elegant visualizations, that help to present insights in a delightful manner
  • RColorBrewer : Provides multiple color palettes to be used in conjunction with GGplot visualisations

  • viridis : Similar to Rcolorbrewer, helps with color palettes and other cosmetic purposes

  • gridExtra : Helps in arranging multiple plots on a grid

  • factoextra : Factoextra is usually used to visualize the output of multivariate data analysis, but in this project I have used it to plot the clusters of K-means algorithm.

  • fpc : Provides various methods for clustering and cluster validation

Importing Data

Loading Data

spotify <- read.csv("spotify_songs.csv", stringsAsFactors=FALSE)

About the data

## [1] 32833    23

The data set has 32833 rows of observations with 23 variables.

The following information about the variables is provided on the ‘rfordatascience’ website and will help the users to understand the dataset

kable(table_description, caption = "Spotify Dictionary")
Spotify Dictionary
Variable Description
track_id Song unique ID
track_name Song Name
track_artist Song Artist
track_popularity Song Popularity (0-100) where higher is better
track_album_id Album unique ID
track_album_name Song album name
track_album_release_date Date when album released
playlist_name Name of playlist
playlist_id Playlist ID
playlist_genre Playlist genre
playlist_subgenre Playlist subgenre
danceability Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
energy Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
key The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1.
loudness The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.
mode Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.
speechiness Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
acousticness A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic
instrumentalness Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
liveness Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
valence A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
tempo The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
duration_ms Duration of song in milliseconds

Data Wrangling

Data Cleaning

  • The total number of missing values for each variable in the data set are identified.
  • The following variables each have 5 missing values:

    • track_name
    • track_artist
    • track_album_name
colSums(is.na(spotify))
##                 track_id               track_name             track_artist 
##                        0                        5                        5 
##         track_popularity           track_album_id         track_album_name 
##                        0                        0                        5 
## track_album_release_date            playlist_name              playlist_id 
##                        0                        0                        0 
##           playlist_genre        playlist_subgenre             danceability 
##                        0                        0                        0 
##                   energy                      key                 loudness 
##                        0                        0                        0 
##                     mode              speechiness             acousticness 
##                        0                        0                        0 
##         instrumentalness                 liveness                  valence 
##                        0                        0                        0 
##                    tempo              duration_ms 
##                        0                        0
  • From the results below it can be seen that all the missing values in the 3 variables belong to the same 5 rows.
  • The row indices are : 8152, 9283, 9284, 19569, 19812.
which(is.na(spotify$track_name))
## [1]  8152  9283  9284 19569 19812
which(is.na(spotify$track_artist))
## [1]  8152  9283  9284 19569 19812
which(is.na(spotify$track_album_name))
## [1]  8152  9283  9284 19569 19812
  • This is a very small number of missing values in a large dataset, and hence it is not detrimental to the analysis, and therefore its okay to omit them.
spotify <-  spotify[-c(8152,9283,9284,19569,19812), ]
  • Certain variables have incorrect data types, and before starting EDA they need to be corrected.
str(spotify)
## 'data.frame':    32828 obs. of  23 variables:
##  $ track_id                : chr  "6f807x0ima9a1j3VPbc7VN" "0r7CVbZTWZgbTCYdfa2P31" "1z1Hg7Vb0AhHDiEmnDE79l" "75FpbthrwQmzHlBJLuGdC7" ...
##  $ track_name              : chr  "I Don't Care (with Justin Bieber) - Loud Luxury Remix" "Memories - Dillon Francis Remix" "All the Time - Don Diablo Remix" "Call You Mine - Keanu Silva Remix" ...
##  $ track_artist            : chr  "Ed Sheeran" "Maroon 5" "Zara Larsson" "The Chainsmokers" ...
##  $ track_popularity        : int  66 67 70 60 69 67 62 69 68 67 ...
##  $ track_album_id          : chr  "2oCs0DGTsRO98Gh5ZSl2Cx" "63rPSO264uRjW1X5E6cWv6" "1HoSmj2eLcsrR0vE9gThr4" "1nqYsOef1yKKuGOVchbsk6" ...
##  $ track_album_name        : chr  "I Don't Care (with Justin Bieber) [Loud Luxury Remix]" "Memories (Dillon Francis Remix)" "All the Time (Don Diablo Remix)" "Call You Mine - The Remixes" ...
##  $ track_album_release_date: chr  "2019-06-14" "2019-12-13" "2019-07-05" "2019-07-19" ...
##  $ playlist_name           : chr  "Pop Remix" "Pop Remix" "Pop Remix" "Pop Remix" ...
##  $ playlist_id             : chr  "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" ...
##  $ playlist_genre          : chr  "pop" "pop" "pop" "pop" ...
##  $ playlist_subgenre       : chr  "dance pop" "dance pop" "dance pop" "dance pop" ...
##  $ danceability            : num  0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
##  $ energy                  : num  0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
##  $ key                     : int  6 11 1 7 1 8 5 4 8 2 ...
##  $ loudness                : num  -2.63 -4.97 -3.43 -3.78 -4.67 ...
##  $ mode                    : int  1 1 0 1 1 1 0 0 1 1 ...
##  $ speechiness             : num  0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
##  $ acousticness            : num  0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
##  $ instrumentalness        : num  0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
##  $ liveness                : num  0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
##  $ valence                 : num  0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
##  $ tempo                   : num  122 100 124 122 124 ...
##  $ duration_ms             : int  194754 162600 176616 169093 189052 163049 187675 207619 193187 253040 ...
  • From the above summary of the structure of the data, apart from numeric variables, the following variables need to be transformed to factors:
  1. playlist_genre : 6 types of genres, hence better to transform to factors of 6 levels.
unique(spotify$playlist_genre)
## [1] "pop"   "rap"   "rock"  "latin" "r&b"   "edm"
  1. playlist_subgenre : 24 types of subgenres, hence better to transform to factors of 24 levels.
unique(spotify$playlist_subgenre)
##  [1] "dance pop"                 "post-teen pop"            
##  [3] "electropop"                "indie poptimism"          
##  [5] "hip hop"                   "southern hip hop"         
##  [7] "gangster rap"              "trap"                     
##  [9] "album rock"                "classic rock"             
## [11] "permanent wave"            "hard rock"                
## [13] "tropical"                  "latin pop"                
## [15] "reggaeton"                 "latin hip hop"            
## [17] "urban contemporary"        "hip pop"                  
## [19] "new jack swing"            "neo soul"                 
## [21] "electro house"             "big room"                 
## [23] "pop edm"                   "progressive electro house"
  1. key : 12 types of keys, hence better to transform to factors of 12 levels.
unique(spotify$key)
##  [1]  6 11  1  7  8  5  4  2  0 10  9  3
  1. mode : 2 types of mode (0,1), hence better to transform to factors of 2 levels.
unique(spotify$mode)
## [1] 1 0
  • Therefore, we transform the above variables to factors and also fix some other variables
#Changing Data Types
spotify <- spotify %>% 
  mutate(
  track_name =  as.factor(spotify$track_name),
  track_artist = as.factor(spotify$track_artist),
  playlist_genre = as.factor(spotify$playlist_genre),
  playlist_subgenre = as.factor(spotify$playlist_subgenre),
  key = as.factor(spotify$key),
  mode = as.factor(spotify$mode),
  track_popularity = as.numeric(spotify$track_popularity),
  duration_ms = as.numeric(spotify$duration_ms)
  )
  • Now, the variables Track_id, track_album_id, track_album_name, sub genre, duration are not important to the analysis and hence they are dropped.
spotify <- spotify %>% select(2,3,4,10,12:22)

Summary of Cleaned Dataset

  • The cleaned dataset has 32828 observations of 15 variables
dim(spotify)
## [1] 32828    15
  • From the summary, it can be seen that the audio features fit the description given in the features table, value wise and range wise as well.

  • But for speechiness, acousticness, instrumentalness, liveness the median and mean are not as close as they are for other variables and hence we will look into some plots to understand their behaviour in EDA section.

kable(summary(spotify)) %>% 
      kable_styling(bootstrap_options = c("striped", "hover"),
                    full_width = F,
                    font_size = 12,
                    position = "left") %>% 
                    scroll_box(width = "100%", 
                               height = "400px")
track_name track_artist track_popularity playlist_genre danceability energy key loudness mode speechiness acousticness instrumentalness liveness valence tempo
Poison : 22 Martin Garrix : 161 Min. : 0.00 edm :6043 Min. :0.0000 Min. :0.000175 1 : 4010 Min. :-46.448 0:14256 Min. :0.0000 Min. :0.0000 Min. :0.0000000 Min. :0.0000 Min. :0.0000 Min. : 0.00
Breathe : 21 Queen : 136 1st Qu.: 24.00 latin:5153 1st Qu.:0.5630 1st Qu.:0.581000 0 : 3454 1st Qu.: -8.171 1:18572 1st Qu.:0.0410 1st Qu.:0.0151 1st Qu.:0.0000000 1st Qu.:0.0927 1st Qu.:0.3310 1st Qu.: 99.96
Alive : 20 The Chainsmokers: 123 Median : 45.00 pop :5507 Median :0.6720 Median :0.721000 7 : 3352 Median : -6.166 NA Median :0.0625 Median :0.0804 Median :0.0000161 Median :0.1270 Median :0.5120 Median :121.98
Forever : 20 David Guetta : 110 Mean : 42.48 r&b :5431 Mean :0.6549 Mean :0.698603 9 : 3027 Mean : -6.720 NA Mean :0.1071 Mean :0.1754 Mean :0.0847599 Mean :0.1902 Mean :0.5106 Mean :120.88
Paradise: 19 Don Omar : 102 3rd Qu.: 62.00 rap :5743 3rd Qu.:0.7610 3rd Qu.:0.840000 11 : 2994 3rd Qu.: -4.645 NA 3rd Qu.:0.1320 3rd Qu.:0.2550 3rd Qu.:0.0048300 3rd Qu.:0.2480 3rd Qu.:0.6930 3rd Qu.:133.92
Stay : 19 Drake : 100 Max. :100.00 rock :4951 Max. :0.9830 Max. :1.000000 2 : 2827 Max. : 1.275 NA Max. :0.9180 Max. :0.9940 Max. :0.9940000 Max. :0.9960 Max. :0.9910 Max. :239.44
(Other) :32707 (Other) :32096 NA NA NA NA (Other):13164 NA NA NA NA NA NA NA NA

Exploratory Data Analysis

Understanding Attributes

  • Speechiness, acousticness, instrumentalness, liveness are right skewed, with instrumentalness behavior needing more explanation
#Plotting numeric values
spotify %>%
  keep(is.numeric) %>% #hist only for numeric
  gather() %>% #converts to key value
  ggplot(aes(value, fill = key)) + 
  facet_wrap(~ key, scales = "free") +
  geom_histogram(alpha = 0.7, bins = 30) + 
  ggtitle("Distribution of Audio Attributes") + 
  scale_x_discrete(guide = guide_axis(check.overlap = TRUE)) +
  theme(plot.title = element_text(hjust = 0.5))

  • As the histograms depict, many of the attributes are skewed which is reflected in the boxplots as well.

  • Instrumentalness has most values closer to 0, which is why the boxplot and histogram act this way.

#Boxplot for numeric values
spotify %>%
  keep(is.numeric) %>% #hist only for numeric
  gather() %>% #converts to key value
  ggplot(aes(value, fill = key)) + 
  facet_wrap(~ key, scales = "free") +
  geom_boxplot(alpha = 0.7) + 
  ggtitle("Boxplots of Attributes") + 
  theme(plot.title = element_text(hjust = 0.5)) +
  coord_flip()

Understanding Genres

  • From Plot 1, it can be seen that the maximum number of songs belong to:
  1. EDM
  2. Rap
  3. Pop
  • To understand genres better, genres are plotted by their average popularity.

  • From Plot 2, it can be seen that the maximum number of popular songs belong to :

  1. Pop
  2. Latin
  3. Rap
  • Hence in cluster analysis, the focus can be seen on these genres.
#Plotting Genres
p1 <- ggplot(spotify, aes(x=factor(playlist_genre))) +
      geom_bar(width=0.7, 
           aes(fill=playlist_genre), 
           alpha=0.7) + 
      scale_fill_brewer(palette = "Paired") + 
      ggtitle("Plot 1 : Genre Count") + 
      theme(plot.title = element_text(hjust = 0.5)) + 
      xlab("Genre")

avg_popularity <- spotify %>% 
                  select(track_popularity, playlist_genre) %>% 
                  group_by(playlist_genre) %>% 
                  summarise("average_popularity" = round(mean(track_popularity)))

p2 <- ggplot(data=avg_popularity, 
             mapping = aes(x = (playlist_genre), 
                           y = average_popularity, 
                           fill = playlist_genre)) + 
      geom_col(width = 0.7,alpha=0.7) + 
      scale_fill_brewer(palette = "Paired") + 
      ggtitle("Plot 2 : Genres & Popularity") + 
      xlab("Genre") + ylab("Mean Popularity") + 
      theme(plot.title = element_text(hjust = 0.5))

grid.arrange(p1, p2, nrow=2, ncol=1)

Keys & Mode

In music “key” is short for “key signature” and refers to an ascending series of notes that will be used in a melody, and to the number of sharps or flats in the scale.

But will a mode enhance the key?

To understand keys & mode better,a chi square test on them reveal information as they are categorical variables

  • Ho : Key is independent of mode
  • Ha : Key is not independent of mode

On applying the chisq.test function on the two variables, the p-value is found to be lesser than 2.2e-16, which is too significant for an \(\alpha\) of 0.05.

Hence, the null hypothesis is rejected.

Therefore, the key is dependent on mode, and the mode will sharpen the keys.

chisq.test(spotify$key, spotify$mode)
## 
##  Pearson's Chi-squared test
## 
## data:  spotify$key and spotify$mode
## X-squared = 3046.2, df = 11, p-value < 2.2e-16
  • From the chart below it is seen that songs have mode 1 (major track) more often than mode 2(minor track).

  • Pitch 1 is the most frequenly key occuring in songs

#Plotting Mode & Keys

g1 <- ggplot(spotify,aes(mode)) + 
      geom_bar(aes(fill=mode),alpha = 0.6) + 
      ggtitle("Modes") +
      theme(plot.title = element_text(hjust = 0.5)) + 
      scale_fill_brewer(palette = "Dark2")
g2 <- ggplot(spotify,aes(key)) + 
      geom_bar(aes(fill=key), alpha = 0.6) + 
      ggtitle("Keys") +
      theme(plot.title = element_text(hjust = 0.5))

grid.arrange(g1,g2,ncol=1)

K-Means Clustering

Data Structuring

Approach for Clustering:

  1. 70% of the data is split as train set and rest 30% as test set.
  2. The data is scaled to make the numerical attributes comparable
  3. Understand behaviour of numerical attributes from the correlation plot
  4. Find the optimal number of centers using elbow method to implement K-Means Clustering
  5. Fit the K Means Clustering Model
  6. Group the clusters and the attributes by their mean
  7. Understand the accuracy of the model
  8. In-depth analysis cluster wise
  9. Interpretation of Model and Results
#Splitting Data into Train & Test
index <- sample(nrow(spotify), 0.7*nrow(spotify))
train_kmeans <- spotify[index,]
test_kmeans <- spotify[-index,]

Standardization is an important step in data preprocessing, as it controls the variability of the dataset. It is used to limit the values between -1 and 1 for numeric columns. Therefore, I have scaled the data before implementing K-Means Clustering.

#Scaling Data
train_scale <- scale(train_kmeans[,-c(1,2,4,7,9)])
test_scale <- scale(test_kmeans[,-c(1,2,4,7,9)])

Correlations

Correlation Plot Insights:

The plot below gives the following top few insights:

  1. Energy has a high positive correlation with loudness and a negative correlation with acousticness. It is also positively related to liveness
  2. Like energy, loudness and tempo are negatively related with acoustiness, i.e as acousticness increases, loudness and tempo decrease.
  3. Therefore, as expected popularity is negatively correlated with energy, liveness, instrumentalness and positively associated with danceability, loudness and acousticness
  4. Valence and Danceability have a positive relation
# Correlation Plot 
ggcorr(train_scale, 
       low = "blue3", 
       high = "red") + 
      ggtitle("Correlation Plot") + 
      theme(plot.title = element_text(hjust = 0.5))

K-Means Clustering

  • K-Means clustering is a simple and quick algorithm which deals with large data sets easily.

  • The idea behind K-Means is in grouping the data into clusters such that the variation inside the clusters (also known as total within-cluster sum of square or WSS) is minimum, and the variation within the clusters is maximum.

  • This helps in understanding which songs tend to be popular in which groups

General K-Means Process

  1. Identify the number of clusters (K) to be created, in this analysis Elbow Method has been used for the same
  2. Select optimally identified k objects from the data set as the cluster centers and fit the kmeans model
  3. Plot the clusters
  4. Measure the accuracy

Elbow method

  • One reason for using this method is that it chooses the correct number of clusters over random assignment of samples to clusters.

  • In this method, a wss curve is plotted according to the number of clusters k. The location of a bend (knee) in the plot is considered as an indicator of the appropriate number of clusters.

  • With the elbow method, the ideal number of clusters are identified as 3. Therefore kmeans is implemented with 3 centers.

  • The total within-cluster sum of square (wss) measures the compactness of the clustering and we want it to be as small as possible.

#Elbow Method

wss <- (nrow(train_scale)-1)*sum(apply(train_scale,2,var))

for (i in 2:15) wss[i] <- sum(kmeans(train_scale,centers=i)$withinss)
plot(1:15, wss, type="b", pch=20, frame = FALSE, xlab="Number of Clusters K",ylab="Total WSS",main="Optimal Number of Clusters")

Fitting K Means Model

The k means model is fit with 3 centers, while nstart = 25 generates 25 initial configurations and gives out the best one.

As seen from the output this model results in 3 clusters of sizes 7750, 10932, 4297

#Fit kmeans
set.seed(13437885)
fit <- kmeans(train_scale, centers = 3, nstart = 25)
fit$size
## [1]  7964 10796  4219
#Plotting Kmeans
fviz_cluster(fit, 
             geom = c("point", "text"),  
             data = train_scale, 
             palette = "Set3",
             main = "K Means Clustering with 3 Centers", 
             alpha = 0.9) + theme(plot.title = element_text(hjust = 0.5))

  • The clusters are extracted and added to the data to do some descriptive statistics at the cluster level. The datatable below is a result of clustering.

  • The right most column depicts the cluster the songs belong to, and this will help in further analysis to understand the features of the clusters.

#Assiging cluster to df
train_kmeans$cluster <- as.factor(fit$cluster)
datatable(head(train_kmeans,5),options = list(dom = 't',scrollX = T,autoWidth = TRUE))

Model Quality Check

Interpreting the Quality of Clusters

  • The BSS is 51208.72.

    • Between Sum of Squares gives the sum of the squared distance between various cluster centers.
    • The higher it is, the better it is as we want the different cluster centers far apart from each other.
    • A large BSS implies that the characteristics of the clusters are unique and very obviously identifiable.
round(fit$betweenss,2)
## [1] 51460.41
  • The idea is to maximize the bss/tss%.

To get a high value, we need to increase the number of clusters. But in this case, we found the number of clusters to be ideal at 3, hence we’ll stay at it.

round((fit$betweenss / fit$totss * 100),2)
## [1] 22.4

Prediction Strength

The prediction strength is defined according to Tibshirani and Walther (2005), who recommend to choose as optimal number of cluster the largest number of clusters that leads to a prediction strength above 0.8 or 0.9.

  • This function computes the prediction strength of a clustering of a dataset into different numbers of components.
  • The largest cutoff for clusters is 3, hence though there’s a low bss/tss% we continue with 3 clusters.
  • The prediction strength for the clusters is decent as it is above 0.5 for all clusters
#Prediction Strength
prediction.strength(train_scale, Gmin=2, Gmax=5, M=10,cutoff=0.8)
## Prediction strength 
## Clustering method:  kmeans 
## Maximum number of clusters:  5 
## Resampled data sets:  10 
## Mean pred.str. for numbers of clusters:  1 0.8710546 0.8671857 0.6807116 0.5561761 
## Cutoff value:  0.8 
## Largest number of clusters better than cutoff:  3

Attribute Analysis

Cluster Behaviour Analysis

The behaviours of the clusters can be outlined as below:

  • Cluster 1: Liveness, Energy
    • Cluster 1 is second largest
  • Cluster 2: Track Popularity, Danceability, Energy, Valence
    • Cluster 2 is the largest
  • Cluster 3: Acousticness, danceability
    • Cluster 3 is the smallest
  • Accousticness and energy vary drastically across the clusters. Hence it will be used in final analysis
  • Popularity of cluster 2 is the highest, followed by cluster 3 and finally cluster 1, but popularity doesn’t really distinguish the clusters
  • Similarly danceability is not too distinct amongst the clusters
  • Cluster 1 songs are ranked high on energy
  • Valence is an important virtue for cluster 2
  • Accousticness is the highest and only significant for cluster 3
#Grouping the Clusters by Mean
cluster_mean <- train_kmeans %>%
                group_by(cluster) %>% 
                summarise_if(is.numeric, "mean") %>% 
                mutate_if(is.numeric, .funs = "round", digits = 2)

datatable(cluster_mean, options = list(dom = 't',scrollX = T,autoWidth = TRUE))
#Bar Plots for Clusters
b1 <- train_kmeans %>% 
      ggplot(aes(x = cluster, 
      y = energy, 
      fill = cluster)) +
      geom_boxplot() + 
      scale_fill_viridis(option = "D",discrete = TRUE, alpha=0.5) + 
      ggtitle("Clusters and Energy") + 
      theme(plot.title = element_text(hjust = 0.5))

b2 <- train_kmeans %>% 
      ggplot(aes(x = cluster, 
      y = acousticness, 
      fill = cluster)) +
      geom_boxplot() + 
      scale_fill_viridis(option = "D",discrete = TRUE, alpha=0.5) + 
      ggtitle("Clusters and Acousticness") + 
      theme(plot.title = element_text(hjust = 0.5))

b3 <- train_kmeans %>% 
      ggplot(aes(x = cluster, 
      y = danceability, 
      fill = cluster)) +
      geom_boxplot() + 
      scale_fill_viridis(option = "D",discrete = TRUE, alpha=0.5) + 
      ggtitle("Clusters and Danceability") + 
      theme(plot.title = element_text(hjust = 0.5))

b4 <- train_kmeans %>% 
      ggplot(aes(x = cluster, 
      y = valence, 
      fill = cluster)) +
      geom_boxplot() + 
      scale_fill_viridis(option = "D",discrete = TRUE, alpha=0.5) + 
      ggtitle("Clusters and Valence") + 
      theme(plot.title = element_text(hjust = 0.5))

grid.arrange(b1, b2, b3, b4, nrow=2, ncol=2)

Cluster Insights

Individual Cluster Analysis

For the cluster analysis a baseline for popularity is kept at 90 and above. The popular songs in this cluster are depicted in the table below

Cluster 1 Insights:

  • Cluster 1 the second largest of the 3 clusters is known for its Liveness, Energy. Since, Pop, rock and rap are the most popular ones, the table for cluster 1 will be based on those.

  • As expected, most popular songs in cluster 1 are high on energy and low on accousticness

#Analysis on Cluster 1
c1 <- train_kmeans[which(train_kmeans$cluster==1), ]

#Grouping cluster by popularity
avg_pop <- c1 %>% 
          select(track_popularity, playlist_genre) %>% 
          group_by(playlist_genre) %>% 
          summarise("average_popularity" = round(mean(track_popularity)))

#Plotting genres across popularity
x1 <- ggplot(data=avg_pop, 
             mapping = aes(x = (playlist_genre), 
                           y = average_popularity, 
                           fill = playlist_genre)) + 
      geom_col(width = 0.7,alpha=0.7) + 
      scale_fill_brewer(palette = "Spectral") + 
      ggtitle("Cluster 1 - Genres & Popularity") + 
      xlab("Genre") + ylab("Mean Popularity") + 
      theme(plot.title = element_text(hjust = 0.5))
x1

n <- c1 %>% 
  select(track_name,track_artist,playlist_genre,acousticness,energy,track_popularity) %>% 
  subset(track_popularity >= 90 & playlist_genre %in% c("rap","rock","pop")) %>% 
  distinct(track_name,.keep_all = TRUE) 

datatable(n, caption = 'Cluster 1: Top Songs', options = list(scrollX = T, autoWidth = TRUE, order = list((list(6, 'desc')))))

Cluster 2 Insights:

  • Cluster 2 has the most popular tracks purely coz of the size, and its tracks also have the highest Danceability, Energy, Valence

  • Therfore the most popular genres are - pop,latin,rock

  • The cluster two songs are high on energy and low on acousticness.

#Analysis on Cluster 2
c2 <- train_kmeans[which(train_kmeans$cluster==2), ]

#Grouping cluster by popularity
avg_pop <- c2 %>% 
          select(track_popularity, playlist_genre) %>% 
          group_by(playlist_genre) %>% 
          summarise("average_popularity" = round(mean(track_popularity)))

#Plotting genres across popularity
x2 <- ggplot(data=avg_pop, 
             mapping = aes(x = (playlist_genre), 
                           y = average_popularity, 
                           fill = playlist_genre)) + 
      geom_col(width = 0.7,alpha=0.7) + 
      scale_fill_brewer(palette = "Spectral") + 
      ggtitle("Cluster 2 - Genres & Popularity") + 
      xlab("Genre") + ylab("Mean Popularity") + 
      theme(plot.title = element_text(hjust = 0.5))
x2

n <- c2 %>% 
  select(track_name,track_artist,playlist_genre,acousticness,energy,track_popularity) %>% 
  subset(track_popularity >= 90 & playlist_genre %in% c("latin","rock","pop")) %>% 
  distinct(track_name,.keep_all = TRUE) 

datatable(n, caption = 'Cluster 2: Top Songs', options = list(scrollX = T, autoWidth = TRUE, order = list((list(6, 'desc')))))

Cluster 3 Insights

  • Cluster 3 is the smallest and its tracks have the attributes of high acousticness, danceability and mid level energy compared to other clusters.

  • Therfore the most popular genres are - pop,latin,rap

  • The popular songs are high on acousticness with average energy.

#Analysis on Cluster 3
c3 <- train_kmeans[which(train_kmeans$cluster==3), ]

#Grouping cluster by popularity
avg_pop <- c3 %>% 
          select(track_popularity, playlist_genre) %>% 
          group_by(playlist_genre) %>% 
          summarise("average_popularity" = round(mean(track_popularity)))

#Plotting genres across popularity
x3 <- ggplot(data=avg_pop, 
             mapping = aes(x = (playlist_genre), 
                           y = average_popularity, 
                           fill = playlist_genre)) + 
      geom_col(width = 0.7,alpha=0.7) + 
      scale_fill_brewer(palette = "Spectral") + 
      ggtitle("Cluster 3 - Genres & Popularity") + 
      xlab("Genre") + ylab("Mean Popularity") + 
      theme(plot.title = element_text(hjust = 0.5))
x3

n <- c3 %>% 
  select(track_name,track_artist,playlist_genre,acousticness,energy,track_popularity) %>% 
  subset(track_popularity >= 90 & playlist_genre %in% c("latin","rap","pop")) %>% 
  distinct(track_name,.keep_all = TRUE) 

datatable(n, caption = 'Cluster 3: Top Songs', options = list(scrollX = T, autoWidth = TRUE, order = list((list(6, 'desc')))))

Conclusions

  • Summary:

    • This analysis was aimed to understand what makes the clusters different from each other, which also lead us to top songs in each category
    • The analysis was achieved through Visual Exploration, Statistical testing and K means clustering to arrive at the below takeways
    • To a consumer, this analysis will give an overview on the kind of music he should be followinf on spotify based on his tastes.
  • Key Takeways:

    • The three clusters do not vary too much on popularity, but instead vary highly on energy and acousticness.
    • The most popular genres turn out to be - Pop, Latin and Rock
    • Cluster two with low acousticness, mid level energy has the the most number of popular songs. One reason for it can be the high danceability associated with cluster 2.
  • Limitations:

    • The K clusters were chosen only on elbow method due to its reputation. But an attempt at Gap static and Silhoutte method, would enhance the quality of the analysis.
    • This analysis does not cover predicting popularity of a song, which would be a good project in its own.