Cluster Analysis on Spotify Data

Introduction

Dataset:

This dataset is extracted using the spotifyr package and was obtained from rfordatascience github.

Problem Statement:

Spotify as a music application does a very good job in recommeding music to its users. It suggests music based on your frequently liked songs/artists. This particular data set, built via the spotifyr package has details of track names, artists, types of genres, sub genres and other audio features.

Objective:

The idea behind the project is to use this dataset to :

Build a K -means model to identify the most popular songs by each cluster:
- Which genres are popular by the clusters - i.e are they pop? are they rap?
- How varied are these clusters?
- What are the audio features/attributes of these clusters?
- Understand how the audio features perform across clusters and thereby on the songs

End Goal:

This analysis aims to provide an understading on which songs / genres are the most popular ones.
The idea is to help an end user to gain better understanding of what goes behind the most popular songs on Spotify.

Approach:

Exploratory Data Analysis

Visualization techniques that uncover patterns and insights about the audio features and their behaviour with each other
Statistical Testing to understand variable behaviors :
- Chi square testing for categorical variables
- Correlation plot for numeric variables

K - Means Clustering

To understand the popularity of songs I used K-means clustering method to group clusters and identified how far the clusters are from each other.
Songs with similar characteristics are grouped into clusters by the algorithm and these clusters help in understanding the audio attributes of the popular songs.

Insights from Analysis

Data Preparation

Packages

#Dataframe
library(knitr)
library(kableExtra)
library(DT)
#Data Manipulation
library(tidyverse)
library(dplyr)
library(tidyr)
#Data Viz
library(ggplot2)
library(GGally)
library(RColorBrewer)
library(viridis)
library(gridExtra)
#K-Means
library(factoextra)
library(fpc)

knitr : Helps display better outputs without any intense coding. The kable function particularly helps in presenting tables, manipulating table styles
kablextra : In addition to the kable function, kableextra library provides formatting functions which controls width etc.
DT : Helps in presenting tables in a clean format, and has the ability to provide filters
Ggally : To plot the correlation analysis of variables in matrice form
tidyverse : Tidyverse provides a collection of packages including “dplyr”, “tidyr”, “ggplot2” explained below
- dplyr provides functions for data manipulation such as - adds new variables that are functions of existing variables, select, rename data, filter, summarise etc
- tidyr helps in tidying data with dropna, fillna functions, extracting values from strings and thereby making the data more readable, concrete and complete
- ggplot2 provides elegant visualizations, that help to present insights in a delightful manner
RColorBrewer : Provides multiple color palettes to be used in conjunction with GGplot visualisations
viridis : Similar to Rcolorbrewer, helps with color palettes and other cosmetic purposes
gridExtra : Helps in arranging multiple plots on a grid
factoextra : Factoextra is usually used to visualize the output of multivariate data analysis, but in this project I have used it to plot the clusters of K-means algorithm.
fpc : Provides various methods for clustering and cluster validation

Importing Data

Loading Data

spotify <- read.csv("spotify_songs.csv", stringsAsFactors=FALSE)

About the data

## [1] 32833    23

The data set has 32833 rows of observations with 23 variables.

The following information about the variables is provided on the ‘rfordatascience’ website and will help the users to understand the dataset

kable(table_description, caption = "Spotify Dictionary")

Spotify Dictionary
Variable	Description
track_id	Song unique ID
track_name	Song Name
track_artist	Song Artist
track_popularity	Song Popularity (0-100) where higher is better
track_album_id	Album unique ID
track_album_name	Song album name
track_album_release_date	Date when album released
playlist_name	Name of playlist
playlist_id	Playlist ID
playlist_genre	Playlist genre
playlist_subgenre	Playlist subgenre
danceability	Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
energy	Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
key	The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1.
loudness	The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.
mode	Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.
speechiness	Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
acousticness	A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic
instrumentalness	Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
liveness	Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
valence	A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
tempo	The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
duration_ms	Duration of song in milliseconds

Data Wrangling

Data Cleaning

The total number of missing values for each variable in the data set are identified.
The following variables each have 5 missing values:
- track_name
- track_artist
- track_album_name

colSums(is.na(spotify))

##                 track_id               track_name             track_artist 
##                        0                        5                        5 
##         track_popularity           track_album_id         track_album_name 
##                        0                        0                        5 
## track_album_release_date            playlist_name              playlist_id 
##                        0                        0                        0 
##           playlist_genre        playlist_subgenre             danceability 
##                        0                        0                        0 
##                   energy                      key                 loudness 
##                        0                        0                        0 
##                     mode              speechiness             acousticness 
##                        0                        0                        0 
##         instrumentalness                 liveness                  valence 
##                        0                        0                        0 
##                    tempo              duration_ms 
##                        0                        0

From the results below it can be seen that all the missing values in the 3 variables belong to the same 5 rows.
The row indices are : 8152, 9283, 9284, 19569, 19812.

which(is.na(spotify$track_name))

## [1]  8152  9283  9284 19569 19812

which(is.na(spotify$track_artist))

## [1]  8152  9283  9284 19569 19812

which(is.na(spotify$track_album_name))

## [1]  8152  9283  9284 19569 19812

This is a very small number of missing values in a large dataset, and hence it is not detrimental to the analysis, and therefore its okay to omit them.

spotify <-  spotify[-c(8152,9283,9284,19569,19812), ]

Certain variables have incorrect data types, and before starting EDA they need to be corrected.

str(spotify)

## 'data.frame':    32828 obs. of  23 variables:
##  $ track_id                : chr  "6f807x0ima9a1j3VPbc7VN" "0r7CVbZTWZgbTCYdfa2P31" "1z1Hg7Vb0AhHDiEmnDE79l" "75FpbthrwQmzHlBJLuGdC7" ...
##  $ track_name              : chr  "I Don't Care (with Justin Bieber) - Loud Luxury Remix" "Memories - Dillon Francis Remix" "All the Time - Don Diablo Remix" "Call You Mine - Keanu Silva Remix" ...
##  $ track_artist            : chr  "Ed Sheeran" "Maroon 5" "Zara Larsson" "The Chainsmokers" ...
##  $ track_popularity        : int  66 67 70 60 69 67 62 69 68 67 ...
##  $ track_album_id          : chr  "2oCs0DGTsRO98Gh5ZSl2Cx" "63rPSO264uRjW1X5E6cWv6" "1HoSmj2eLcsrR0vE9gThr4" "1nqYsOef1yKKuGOVchbsk6" ...
##  $ track_album_name        : chr  "I Don't Care (with Justin Bieber) [Loud Luxury Remix]" "Memories (Dillon Francis Remix)" "All the Time (Don Diablo Remix)" "Call You Mine - The Remixes" ...
##  $ track_album_release_date: chr  "2019-06-14" "2019-12-13" "2019-07-05" "2019-07-19" ...
##  $ playlist_name           : chr  "Pop Remix" "Pop Remix" "Pop Remix" "Pop Remix" ...
##  $ playlist_id             : chr  "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" ...
##  $ playlist_genre          : chr  "pop" "pop" "pop" "pop" ...
##  $ playlist_subgenre       : chr  "dance pop" "dance pop" "dance pop" "dance pop" ...
##  $ danceability            : num  0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
##  $ energy                  : num  0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
##  $ key                     : int  6 11 1 7 1 8 5 4 8 2 ...
##  $ loudness                : num  -2.63 -4.97 -3.43 -3.78 -4.67 ...
##  $ mode                    : int  1 1 0 1 1 1 0 0 1 1 ...
##  $ speechiness             : num  0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
##  $ acousticness            : num  0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
##  $ instrumentalness        : num  0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
##  $ liveness                : num  0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
##  $ valence                 : num  0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
##  $ tempo                   : num  122 100 124 122 124 ...
##  $ duration_ms             : int  194754 162600 176616 169093 189052 163049 187675 207619 193187 253040 ...

From the above summary of the structure of the data, apart from numeric variables, the following variables need to be transformed to factors:

playlist_genre : 6 types of genres, hence better to transform to factors of 6 levels.

unique(spotify$playlist_genre)

## [1] "pop"   "rap"   "rock"  "latin" "r&b"   "edm"

playlist_subgenre : 24 types of subgenres, hence better to transform to factors of 24 levels.

unique(spotify$playlist_subgenre)

##  [1] "dance pop"                 "post-teen pop"            
##  [3] "electropop"                "indie poptimism"          
##  [5] "hip hop"                   "southern hip hop"         
##  [7] "gangster rap"              "trap"                     
##  [9] "album rock"                "classic rock"             
## [11] "permanent wave"            "hard rock"                
## [13] "tropical"                  "latin pop"                
## [15] "reggaeton"                 "latin hip hop"            
## [17] "urban contemporary"        "hip pop"                  
## [19] "new jack swing"            "neo soul"                 
## [21] "electro house"             "big room"                 
## [23] "pop edm"                   "progressive electro house"

key : 12 types of keys, hence better to transform to factors of 12 levels.

unique(spotify$key)

##  [1]  6 11  1  7  8  5  4  2  0 10  9  3

mode : 2 types of mode (0,1), hence better to transform to factors of 2 levels.

unique(spotify$mode)

## [1] 1 0

Therefore, we transform the above variables to factors and also fix some other variables

#Changing Data Types
spotify <- spotify %>% 
  mutate(
  track_name =  as.factor(spotify$track_name),
  track_artist = as.factor(spotify$track_artist),
  playlist_genre = as.factor(spotify$playlist_genre),
  playlist_subgenre = as.factor(spotify$playlist_subgenre),
  key = as.factor(spotify$key),
  mode = as.factor(spotify$mode),
  track_popularity = as.numeric(spotify$track_popularity),
  duration_ms = as.numeric(spotify$duration_ms)
  )

Now, the variables Track_id, track_album_id, track_album_name, sub genre, duration are not important to the analysis and hence they are dropped.

spotify <- spotify %>% select(2,3,4,10,12:22)

Summary of Cleaned Dataset

The cleaned dataset has 32828 observations of 15 variables

dim(spotify)

## [1] 32828    15

From the summary, it can be seen that the audio features fit the description given in the features table, value wise and range wise as well.
But for speechiness, acousticness, instrumentalness, liveness the median and mean are not as close as they are for other variables and hence we will look into some plots to understand their behaviour in EDA section.

kable(summary(spotify)) %>% 
      kable_styling(bootstrap_options = c("striped", "hover"),
                    full_width = F,
                    font_size = 12,
                    position = "left") %>% 
                    scroll_box(width = "100%", 
                               height = "400px")

track_name	track_artist	track_popularity	playlist_genre	danceability	energy	key	loudness	mode	speechiness	acousticness	instrumentalness	liveness	valence	tempo
Poison : 22	Martin Garrix : 161	Min. : 0.00	edm :6043	Min. :0.0000	Min. :0.000175	1 : 4010	Min. :-46.448	0:14256	Min. :0.0000	Min. :0.0000	Min. :0.0000000	Min. :0.0000	Min. :0.0000	Min. : 0.00
Breathe : 21	Queen : 136	1st Qu.: 24.00	latin:5153	1st Qu.:0.5630	1st Qu.:0.581000	0 : 3454	1st Qu.: -8.171	1:18572	1st Qu.:0.0410	1st Qu.:0.0151	1st Qu.:0.0000000	1st Qu.:0.0927	1st Qu.:0.3310	1st Qu.: 99.96
Alive : 20	The Chainsmokers: 123	Median : 45.00	pop :5507	Median :0.6720	Median :0.721000	7 : 3352	Median : -6.166	NA	Median :0.0625	Median :0.0804	Median :0.0000161	Median :0.1270	Median :0.5120	Median :121.98
Forever : 20	David Guetta : 110	Mean : 42.48	r&b :5431	Mean :0.6549	Mean :0.698603	9 : 3027	Mean : -6.720	NA	Mean :0.1071	Mean :0.1754	Mean :0.0847599	Mean :0.1902	Mean :0.5106	Mean :120.88
Paradise: 19	Don Omar : 102	3rd Qu.: 62.00	rap :5743	3rd Qu.:0.7610	3rd Qu.:0.840000	11 : 2994	3rd Qu.: -4.645	NA	3rd Qu.:0.1320	3rd Qu.:0.2550	3rd Qu.:0.0048300	3rd Qu.:0.2480	3rd Qu.:0.6930	3rd Qu.:133.92
Stay : 19	Drake : 100	Max. :100.00	rock :4951	Max. :0.9830	Max. :1.000000	2 : 2827	Max. : 1.275	NA	Max. :0.9180	Max. :0.9940	Max. :0.9940000	Max. :0.9960	Max. :0.9910	Max. :239.44
(Other) :32707	(Other) :32096	NA	NA	NA	NA	(Other):13164	NA	NA	NA	NA	NA	NA	NA	NA

Exploratory Data Analysis

Understanding Attributes

Speechiness, acousticness, instrumentalness, liveness are right skewed, with instrumentalness behavior needing more explanation

#Plotting numeric values
spotify %>%
  keep(is.numeric) %>% #hist only for numeric
  gather() %>% #converts to key value
  ggplot(aes(value, fill = key)) + 
  facet_wrap(~ key, scales = "free") +
  geom_histogram(alpha = 0.7, bins = 30) + 
  ggtitle("Distribution of Audio Attributes") + 
  scale_x_discrete(guide = guide_axis(check.overlap = TRUE)) +
  theme(plot.title = element_text(hjust = 0.5))

As the histograms depict, many of the attributes are skewed which is reflected in the boxplots as well.
Instrumentalness has most values closer to 0, which is why the boxplot and histogram act this way.

#Boxplot for numeric values
spotify %>%
  keep(is.numeric) %>% #hist only for numeric
  gather() %>% #converts to key value
  ggplot(aes(value, fill = key)) + 
  facet_wrap(~ key, scales = "free") +
  geom_boxplot(alpha = 0.7) + 
  ggtitle("Boxplots of Attributes") + 
  theme(plot.title = element_text(hjust = 0.5)) +
  coord_flip()

Understanding Genres

From Plot 1, it can be seen that the maximum number of songs belong to:

EDM
Rap
Pop

To understand genres better, genres are plotted by their average popularity.
From Plot 2, it can be seen that the maximum number of popular songs belong to :

Pop
Latin
Rap

Hence in cluster analysis, the focus can be seen on these genres.

#Plotting Genres
p1 <- ggplot(spotify, aes(x=factor(playlist_genre))) +
      geom_bar(width=0.7, 
           aes(fill=playlist_genre), 
           alpha=0.7) + 
      scale_fill_brewer(palette = "Paired") + 
      ggtitle("Plot 1 : Genre Count") + 
      theme(plot.title = element_text(hjust = 0.5)) + 
      xlab("Genre")

avg_popularity <- spotify %>% 
                  select(track_popularity, playlist_genre) %>% 
                  group_by(playlist_genre) %>% 
                  summarise("average_popularity" = round(mean(track_popularity)))

p2 <- ggplot(data=avg_popularity, 
             mapping = aes(x = (playlist_genre), 
                           y = average_popularity, 
                           fill = playlist_genre)) + 
      geom_col(width = 0.7,alpha=0.7) + 
      scale_fill_brewer(palette = "Paired") + 
      ggtitle("Plot 2 : Genres & Popularity") + 
      xlab("Genre") + ylab("Mean Popularity") + 
      theme(plot.title = element_text(hjust = 0.5))

grid.arrange(p1, p2, nrow=2, ncol=1)

Keys & Mode

In music “key” is short for “key signature” and refers to an ascending series of notes that will be used in a melody, and to the number of sharps or flats in the scale.

But will a mode enhance the key?

To understand keys & mode better,a chi square test on them reveal information as they are categorical variables

Ho : Key is independent of mode
Ha : Key is not independent of mode

On applying the chisq.test function on the two variables, the p-value is found to be lesser than 2.2e-16, which is too significant for an \(\alpha\) of 0.05.

Hence, the null hypothesis is rejected.

Therefore, the key is dependent on mode, and the mode will sharpen the keys.

chisq.test(spotify$key, spotify$mode)

## 
##  Pearson's Chi-squared test
## 
## data:  spotify$key and spotify$mode
## X-squared = 3046.2, df = 11, p-value < 2.2e-16

From the chart below it is seen that songs have mode 1 (major track) more often than mode 2(minor track).
Pitch 1 is the most frequenly key occuring in songs

#Plotting Mode & Keys

g1 <- ggplot(spotify,aes(mode)) + 
      geom_bar(aes(fill=mode),alpha = 0.6) + 
      ggtitle("Modes") +
      theme(plot.title = element_text(hjust = 0.5)) + 
      scale_fill_brewer(palette = "Dark2")
g2 <- ggplot(spotify,aes(key)) + 
      geom_bar(aes(fill=key), alpha = 0.6) + 
      ggtitle("Keys") +
      theme(plot.title = element_text(hjust = 0.5))

grid.arrange(g1,g2,ncol=1)

K-Means Clustering

Data Structuring

Approach for Clustering:

70% of the data is split as train set and rest 30% as test set.
The data is scaled to make the numerical attributes comparable
Understand behaviour of numerical attributes from the correlation plot
Find the optimal number of centers using elbow method to implement K-Means Clustering
Fit the K Means Clustering Model
Group the clusters and the attributes by their mean
Understand the accuracy of the model
In-depth analysis cluster wise
Interpretation of Model and Results

#Splitting Data into Train & Test
index <- sample(nrow(spotify), 0.7*nrow(spotify))
train_kmeans <- spotify[index,]
test_kmeans <- spotify[-index,]

Standardization is an important step in data preprocessing, as it controls the variability of the dataset. It is used to limit the values between -1 and 1 for numeric columns. Therefore, I have scaled the data before implementing K-Means Clustering.

#Scaling Data
train_scale <- scale(train_kmeans[,-c(1,2,4,7,9)])
test_scale <- scale(test_kmeans[,-c(1,2,4,7,9)])

Correlations

Correlation Plot Insights:

The plot below gives the following top few insights:

Energy has a high positive correlation with loudness and a negative correlation with acousticness. It is also positively related to liveness
Like energy, loudness and tempo are negatively related with acoustiness, i.e as acousticness increases, loudness and tempo decrease.
Therefore, as expected popularity is negatively correlated with energy, liveness, instrumentalness and positively associated with danceability, loudness and acousticness
Valence and Danceability have a positive relation

# Correlation Plot 
ggcorr(train_scale, 
       low = "blue3", 
       high = "red") + 
      ggtitle("Correlation Plot") + 
      theme(plot.title = element_text(hjust = 0.5))

K-Means Clustering

K-Means clustering is a simple and quick algorithm which deals with large data sets easily.
The idea behind K-Means is in grouping the data into clusters such that the variation inside the clusters (also known as total within-cluster sum of square or WSS) is minimum, and the variation within the clusters is maximum.
This helps in understanding which songs tend to be popular in which groups

General K-Means Process

Identify the number of clusters (K) to be created, in this analysis Elbow Method has been used for the same
Select optimally identified k objects from the data set as the cluster centers and fit the kmeans model
Plot the clusters
Measure the accuracy

Elbow method

One reason for using this method is that it chooses the correct number of clusters over random assignment of samples to clusters.
In this method, a wss curve is plotted according to the number of clusters k. The location of a bend (knee) in the plot is considered as an indicator of the appropriate number of clusters.
With the elbow method, the ideal number of clusters are identified as 3. Therefore kmeans is implemented with 3 centers.
The total within-cluster sum of square (wss) measures the compactness of the clustering and we want it to be as small as possible.

#Elbow Method

wss <- (nrow(train_scale)-1)*sum(apply(train_scale,2,var))

for (i in 2:15) wss[i] <- sum(kmeans(train_scale,centers=i)$withinss)
plot(1:15, wss, type="b", pch=20, frame = FALSE, xlab="Number of Clusters K",ylab="Total WSS",main="Optimal Number of Clusters")

Fitting K Means Model

The k means model is fit with 3 centers, while nstart = 25 generates 25 initial configurations and gives out the best one.

As seen from the output this model results in 3 clusters of sizes 7750, 10932, 4297

#Fit kmeans
set.seed(13437885)
fit <- kmeans(train_scale, centers = 3, nstart = 25)
fit$size

## [1]  7964 10796  4219

#Plotting Kmeans
fviz_cluster(fit, 
             geom = c("point", "text"),  
             data = train_scale, 
             palette = "Set3",
             main = "K Means Clustering with 3 Centers", 
             alpha = 0.9) + theme(plot.title = element_text(hjust = 0.5))

The clusters are extracted and added to the data to do some descriptive statistics at the cluster level. The datatable below is a result of clustering.
The right most column depicts the cluster the songs belong to, and this will help in further analysis to understand the features of the clusters.

#Assiging cluster to df
train_kmeans$cluster <- as.factor(fit$cluster)
datatable(head(train_kmeans,5),options = list(dom = 't',scrollX = T,autoWidth = TRUE))

Model Quality Check

Interpreting the Quality of Clusters

The BSS is 51208.72.
- Between Sum of Squares gives the sum of the squared distance between various cluster centers.
- The higher it is, the better it is as we want the different cluster centers far apart from each other.
- A large BSS implies that the characteristics of the clusters are unique and very obviously identifiable.

round(fit$betweenss,2)

## [1] 51460.41

The idea is to maximize the bss/tss%.

To get a high value, we need to increase the number of clusters. But in this case, we found the number of clusters to be ideal at 3, hence we’ll stay at it.

round((fit$betweenss / fit$totss * 100),2)

## [1] 22.4

Prediction Strength

The prediction strength is defined according to Tibshirani and Walther (2005), who recommend to choose as optimal number of cluster the largest number of clusters that leads to a prediction strength above 0.8 or 0.9.

This function computes the prediction strength of a clustering of a dataset into different numbers of components.
The largest cutoff for clusters is 3, hence though there’s a low bss/tss% we continue with 3 clusters.
The prediction strength for the clusters is decent as it is above 0.5 for all clusters

#Prediction Strength
prediction.strength(train_scale, Gmin=2, Gmax=5, M=10,cutoff=0.8)

## Prediction strength 
## Clustering method:  kmeans 
## Maximum number of clusters:  5 
## Resampled data sets:  10 
## Mean pred.str. for numbers of clusters:  1 0.8710546 0.8671857 0.6807116 0.5561761 
## Cutoff value:  0.8 
## Largest number of clusters better than cutoff:  3

Attribute Analysis

Cluster Behaviour Analysis

The behaviours of the clusters can be outlined as below:

Cluster 1: Liveness, Energy
- Cluster 1 is second largest
Cluster 2: Track Popularity, Danceability, Energy, Valence
- Cluster 2 is the largest
Cluster 3: Acousticness, danceability
- Cluster 3 is the smallest
Accousticness and energy vary drastically across the clusters. Hence it will be used in final analysis
Popularity of cluster 2 is the highest, followed by cluster 3 and finally cluster 1, but popularity doesn’t really distinguish the clusters
Similarly danceability is not too distinct amongst the clusters
Cluster 1 songs are ranked high on energy
Valence is an important virtue for cluster 2
Accousticness is the highest and only significant for cluster 3

#Grouping the Clusters by Mean
cluster_mean <- train_kmeans %>%
                group_by(cluster) %>% 
                summarise_if(is.numeric, "mean") %>% 
                mutate_if(is.numeric, .funs = "round", digits = 2)

datatable(cluster_mean, options = list(dom = 't',scrollX = T,autoWidth = TRUE))

#Bar Plots for Clusters
b1 <- train_kmeans %>% 
      ggplot(aes(x = cluster, 
      y = energy, 
      fill = cluster)) +
      geom_boxplot() + 
      scale_fill_viridis(option = "D",discrete = TRUE, alpha=0.5) + 
      ggtitle("Clusters and Energy") + 
      theme(plot.title = element_text(hjust = 0.5))

b2 <- train_kmeans %>% 
      ggplot(aes(x = cluster, 
      y = acousticness, 
      fill = cluster)) +
      geom_boxplot() + 
      scale_fill_viridis(option = "D",discrete = TRUE, alpha=0.5) + 
      ggtitle("Clusters and Acousticness") + 
      theme(plot.title = element_text(hjust = 0.5))

b3 <- train_kmeans %>% 
      ggplot(aes(x = cluster, 
      y = danceability, 
      fill = cluster)) +
      geom_boxplot() + 
      scale_fill_viridis(option = "D",discrete = TRUE, alpha=0.5) + 
      ggtitle("Clusters and Danceability") + 
      theme(plot.title = element_text(hjust = 0.5))

b4 <- train_kmeans %>% 
      ggplot(aes(x = cluster, 
      y = valence, 
      fill = cluster)) +
      geom_boxplot() + 
      scale_fill_viridis(option = "D",discrete = TRUE, alpha=0.5) + 
      ggtitle("Clusters and Valence") + 
      theme(plot.title = element_text(hjust = 0.5))

grid.arrange(b1, b2, b3, b4, nrow=2, ncol=2)

Cluster Insights

Individual Cluster Analysis

For the cluster analysis a baseline for popularity is kept at 90 and above. The popular songs in this cluster are depicted in the table below

Cluster 1 Insights:

Cluster 1 the second largest of the 3 clusters is known for its Liveness, Energy. Since, Pop, rock and rap are the most popular ones, the table for cluster 1 will be based on those.
As expected, most popular songs in cluster 1 are high on energy and low on accousticness

#Analysis on Cluster 1
c1 <- train_kmeans[which(train_kmeans$cluster==1), ]

#Grouping cluster by popularity
avg_pop <- c1 %>% 
          select(track_popularity, playlist_genre) %>% 
          group_by(playlist_genre) %>% 
          summarise("average_popularity" = round(mean(track_popularity)))

#Plotting genres across popularity
x1 <- ggplot(data=avg_pop, 
             mapping = aes(x = (playlist_genre), 
                           y = average_popularity, 
                           fill = playlist_genre)) + 
      geom_col(width = 0.7,alpha=0.7) + 
      scale_fill_brewer(palette = "Spectral") + 
      ggtitle("Cluster 1 - Genres & Popularity") + 
      xlab("Genre") + ylab("Mean Popularity") + 
      theme(plot.title = element_text(hjust = 0.5))
x1

n <- c1 %>% 
  select(track_name,track_artist,playlist_genre,acousticness,energy,track_popularity) %>% 
  subset(track_popularity >= 90 & playlist_genre %in% c("rap","rock","pop")) %>% 
  distinct(track_name,.keep_all = TRUE) 

datatable(n, caption = 'Cluster 1: Top Songs', options = list(scrollX = T, autoWidth = TRUE, order = list((list(6, 'desc')))))

Cluster 2 Insights:

Cluster 2 has the most popular tracks purely coz of the size, and its tracks also have the highest Danceability, Energy, Valence
Therfore the most popular genres are - pop,latin,rock
The cluster two songs are high on energy and low on acousticness.

#Analysis on Cluster 2
c2 <- train_kmeans[which(train_kmeans$cluster==2), ]

#Grouping cluster by popularity
avg_pop <- c2 %>% 
          select(track_popularity, playlist_genre) %>% 
          group_by(playlist_genre) %>% 
          summarise("average_popularity" = round(mean(track_popularity)))

#Plotting genres across popularity
x2 <- ggplot(data=avg_pop, 
             mapping = aes(x = (playlist_genre), 
                           y = average_popularity, 
                           fill = playlist_genre)) + 
      geom_col(width = 0.7,alpha=0.7) + 
      scale_fill_brewer(palette = "Spectral") + 
      ggtitle("Cluster 2 - Genres & Popularity") + 
      xlab("Genre") + ylab("Mean Popularity") + 
      theme(plot.title = element_text(hjust = 0.5))
x2

n <- c2 %>% 
  select(track_name,track_artist,playlist_genre,acousticness,energy,track_popularity) %>% 
  subset(track_popularity >= 90 & playlist_genre %in% c("latin","rock","pop")) %>% 
  distinct(track_name,.keep_all = TRUE) 

datatable(n, caption = 'Cluster 2: Top Songs', options = list(scrollX = T, autoWidth = TRUE, order = list((list(6, 'desc')))))

Cluster 3 Insights

Cluster 3 is the smallest and its tracks have the attributes of high acousticness, danceability and mid level energy compared to other clusters.
Therfore the most popular genres are - pop,latin,rap
The popular songs are high on acousticness with average energy.

#Analysis on Cluster 3
c3 <- train_kmeans[which(train_kmeans$cluster==3), ]

#Grouping cluster by popularity
avg_pop <- c3 %>% 
          select(track_popularity, playlist_genre) %>% 
          group_by(playlist_genre) %>% 
          summarise("average_popularity" = round(mean(track_popularity)))

#Plotting genres across popularity
x3 <- ggplot(data=avg_pop, 
             mapping = aes(x = (playlist_genre), 
                           y = average_popularity, 
                           fill = playlist_genre)) + 
      geom_col(width = 0.7,alpha=0.7) + 
      scale_fill_brewer(palette = "Spectral") + 
      ggtitle("Cluster 3 - Genres & Popularity") + 
      xlab("Genre") + ylab("Mean Popularity") + 
      theme(plot.title = element_text(hjust = 0.5))
x3

n <- c3 %>% 
  select(track_name,track_artist,playlist_genre,acousticness,energy,track_popularity) %>% 
  subset(track_popularity >= 90 & playlist_genre %in% c("latin","rap","pop")) %>% 
  distinct(track_name,.keep_all = TRUE) 

datatable(n, caption = 'Cluster 3: Top Songs', options = list(scrollX = T, autoWidth = TRUE, order = list((list(6, 'desc')))))

Conclusions

Summary:
- This analysis was aimed to understand what makes the clusters different from each other, which also lead us to top songs in each category
- The analysis was achieved through Visual Exploration, Statistical testing and K means clustering to arrive at the below takeways
- To a consumer, this analysis will give an overview on the kind of music he should be followinf on spotify based on his tastes.
Key Takeways:
- The three clusters do not vary too much on popularity, but instead vary highly on energy and acousticness.
- The most popular genres turn out to be - Pop, Latin and Rock
- Cluster two with low acousticness, mid level energy has the the most number of popular songs. One reason for it can be the high danceability associated with cluster 2.
Limitations:
- The K clusters were chosen only on elbow method due to its reputation. But an attempt at Gap static and Silhoutte method, would enhance the quality of the analysis.
- This analysis does not cover predicting popularity of a song, which would be a good project in its own.