Nikita Sankhe
UC BANA 19-20
Dataset:
This dataset is extracted using the spotifyr package and was obtained from rfordatascience github.
Problem Statement:
Spotify as a music application does a very good job in recommeding music to its users. It suggests music based on your frequently liked songs/artists. This particular data set, built via the spotifyr package has details of track names, artists, types of genres, sub genres and other audio features.
Objective:
The idea behind the project is to use this dataset to :
End Goal:
This analysis aims to provide an understading on which songs / genres are the most popular ones.
The idea is to help an end user to gain better understanding of what goes behind the most popular songs on Spotify.
Approach:
#Dataframe
library(knitr)
library(kableExtra)
library(DT)
#Data Manipulation
library(tidyverse)
library(dplyr)
library(tidyr)
#Data Viz
library(ggplot2)
library(GGally)
library(RColorBrewer)
library(viridis)
library(gridExtra)
#K-Means
library(factoextra)
library(fpc)
knitr : Helps display better outputs without any intense coding. The kable function particularly helps in presenting tables, manipulating table styles
kablextra : In addition to the kable function, kableextra library provides formatting functions which controls width etc.
DT : Helps in presenting tables in a clean format, and has the ability to provide filters
Ggally : To plot the correlation analysis of variables in matrice form
tidyverse : Tidyverse provides a collection of packages including “dplyr”, “tidyr”, “ggplot2” explained below
RColorBrewer : Provides multiple color palettes to be used in conjunction with GGplot visualisations
viridis : Similar to Rcolorbrewer, helps with color palettes and other cosmetic purposes
gridExtra : Helps in arranging multiple plots on a grid
factoextra : Factoextra is usually used to visualize the output of multivariate data analysis, but in this project I have used it to plot the clusters of K-means algorithm.
fpc : Provides various methods for clustering and cluster validation
Loading Data
spotify <- read.csv("spotify_songs.csv", stringsAsFactors=FALSE)
About the data
## [1] 32833 23
The data set has 32833 rows of observations with 23 variables.
The following information about the variables is provided on the ‘rfordatascience’ website and will help the users to understand the dataset
kable(table_description, caption = "Spotify Dictionary")
| Variable | Description |
|---|---|
| track_id | Song unique ID |
| track_name | Song Name |
| track_artist | Song Artist |
| track_popularity | Song Popularity (0-100) where higher is better |
| track_album_id | Album unique ID |
| track_album_name | Song album name |
| track_album_release_date | Date when album released |
| playlist_name | Name of playlist |
| playlist_id | Playlist ID |
| playlist_genre | Playlist genre |
| playlist_subgenre | Playlist subgenre |
| danceability | Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable. |
| energy | Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy. |
| key | The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1. |
| loudness | The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db. |
| mode | Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0. |
| speechiness | Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks. |
| acousticness | A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic |
| instrumentalness | Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0. |
| liveness | Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live. |
| valence | A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry). |
| tempo | The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration. |
| duration_ms | Duration of song in milliseconds |
Data Cleaning
The following variables each have 5 missing values:
colSums(is.na(spotify))
## track_id track_name track_artist
## 0 5 5
## track_popularity track_album_id track_album_name
## 0 0 5
## track_album_release_date playlist_name playlist_id
## 0 0 0
## playlist_genre playlist_subgenre danceability
## 0 0 0
## energy key loudness
## 0 0 0
## mode speechiness acousticness
## 0 0 0
## instrumentalness liveness valence
## 0 0 0
## tempo duration_ms
## 0 0
which(is.na(spotify$track_name))
## [1] 8152 9283 9284 19569 19812
which(is.na(spotify$track_artist))
## [1] 8152 9283 9284 19569 19812
which(is.na(spotify$track_album_name))
## [1] 8152 9283 9284 19569 19812
spotify <- spotify[-c(8152,9283,9284,19569,19812), ]
str(spotify)
## 'data.frame': 32828 obs. of 23 variables:
## $ track_id : chr "6f807x0ima9a1j3VPbc7VN" "0r7CVbZTWZgbTCYdfa2P31" "1z1Hg7Vb0AhHDiEmnDE79l" "75FpbthrwQmzHlBJLuGdC7" ...
## $ track_name : chr "I Don't Care (with Justin Bieber) - Loud Luxury Remix" "Memories - Dillon Francis Remix" "All the Time - Don Diablo Remix" "Call You Mine - Keanu Silva Remix" ...
## $ track_artist : chr "Ed Sheeran" "Maroon 5" "Zara Larsson" "The Chainsmokers" ...
## $ track_popularity : int 66 67 70 60 69 67 62 69 68 67 ...
## $ track_album_id : chr "2oCs0DGTsRO98Gh5ZSl2Cx" "63rPSO264uRjW1X5E6cWv6" "1HoSmj2eLcsrR0vE9gThr4" "1nqYsOef1yKKuGOVchbsk6" ...
## $ track_album_name : chr "I Don't Care (with Justin Bieber) [Loud Luxury Remix]" "Memories (Dillon Francis Remix)" "All the Time (Don Diablo Remix)" "Call You Mine - The Remixes" ...
## $ track_album_release_date: chr "2019-06-14" "2019-12-13" "2019-07-05" "2019-07-19" ...
## $ playlist_name : chr "Pop Remix" "Pop Remix" "Pop Remix" "Pop Remix" ...
## $ playlist_id : chr "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" ...
## $ playlist_genre : chr "pop" "pop" "pop" "pop" ...
## $ playlist_subgenre : chr "dance pop" "dance pop" "dance pop" "dance pop" ...
## $ danceability : num 0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
## $ energy : num 0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
## $ key : int 6 11 1 7 1 8 5 4 8 2 ...
## $ loudness : num -2.63 -4.97 -3.43 -3.78 -4.67 ...
## $ mode : int 1 1 0 1 1 1 0 0 1 1 ...
## $ speechiness : num 0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
## $ acousticness : num 0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
## $ instrumentalness : num 0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
## $ liveness : num 0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
## $ valence : num 0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
## $ tempo : num 122 100 124 122 124 ...
## $ duration_ms : int 194754 162600 176616 169093 189052 163049 187675 207619 193187 253040 ...
unique(spotify$playlist_genre)
## [1] "pop" "rap" "rock" "latin" "r&b" "edm"
unique(spotify$playlist_subgenre)
## [1] "dance pop" "post-teen pop"
## [3] "electropop" "indie poptimism"
## [5] "hip hop" "southern hip hop"
## [7] "gangster rap" "trap"
## [9] "album rock" "classic rock"
## [11] "permanent wave" "hard rock"
## [13] "tropical" "latin pop"
## [15] "reggaeton" "latin hip hop"
## [17] "urban contemporary" "hip pop"
## [19] "new jack swing" "neo soul"
## [21] "electro house" "big room"
## [23] "pop edm" "progressive electro house"
unique(spotify$key)
## [1] 6 11 1 7 8 5 4 2 0 10 9 3
unique(spotify$mode)
## [1] 1 0
#Changing Data Types
spotify <- spotify %>%
mutate(
track_name = as.factor(spotify$track_name),
track_artist = as.factor(spotify$track_artist),
playlist_genre = as.factor(spotify$playlist_genre),
playlist_subgenre = as.factor(spotify$playlist_subgenre),
key = as.factor(spotify$key),
mode = as.factor(spotify$mode),
track_popularity = as.numeric(spotify$track_popularity),
duration_ms = as.numeric(spotify$duration_ms)
)
spotify <- spotify %>% select(2,3,4,10,12:22)
Summary of Cleaned Dataset
dim(spotify)
## [1] 32828 15
From the summary, it can be seen that the audio features fit the description given in the features table, value wise and range wise as well.
But for speechiness, acousticness, instrumentalness, liveness the median and mean are not as close as they are for other variables and hence we will look into some plots to understand their behaviour in EDA section.
kable(summary(spotify)) %>%
kable_styling(bootstrap_options = c("striped", "hover"),
full_width = F,
font_size = 12,
position = "left") %>%
scroll_box(width = "100%",
height = "400px")
| track_name | track_artist | track_popularity | playlist_genre | danceability | energy | key | loudness | mode | speechiness | acousticness | instrumentalness | liveness | valence | tempo | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Poison : 22 | Martin Garrix : 161 | Min. : 0.00 | edm :6043 | Min. :0.0000 | Min. :0.000175 | 1 : 4010 | Min. :-46.448 | 0:14256 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000000 | Min. :0.0000 | Min. :0.0000 | Min. : 0.00 | |
| Breathe : 21 | Queen : 136 | 1st Qu.: 24.00 | latin:5153 | 1st Qu.:0.5630 | 1st Qu.:0.581000 | 0 : 3454 | 1st Qu.: -8.171 | 1:18572 | 1st Qu.:0.0410 | 1st Qu.:0.0151 | 1st Qu.:0.0000000 | 1st Qu.:0.0927 | 1st Qu.:0.3310 | 1st Qu.: 99.96 | |
| Alive : 20 | The Chainsmokers: 123 | Median : 45.00 | pop :5507 | Median :0.6720 | Median :0.721000 | 7 : 3352 | Median : -6.166 | NA | Median :0.0625 | Median :0.0804 | Median :0.0000161 | Median :0.1270 | Median :0.5120 | Median :121.98 | |
| Forever : 20 | David Guetta : 110 | Mean : 42.48 | r&b :5431 | Mean :0.6549 | Mean :0.698603 | 9 : 3027 | Mean : -6.720 | NA | Mean :0.1071 | Mean :0.1754 | Mean :0.0847599 | Mean :0.1902 | Mean :0.5106 | Mean :120.88 | |
| Paradise: 19 | Don Omar : 102 | 3rd Qu.: 62.00 | rap :5743 | 3rd Qu.:0.7610 | 3rd Qu.:0.840000 | 11 : 2994 | 3rd Qu.: -4.645 | NA | 3rd Qu.:0.1320 | 3rd Qu.:0.2550 | 3rd Qu.:0.0048300 | 3rd Qu.:0.2480 | 3rd Qu.:0.6930 | 3rd Qu.:133.92 | |
| Stay : 19 | Drake : 100 | Max. :100.00 | rock :4951 | Max. :0.9830 | Max. :1.000000 | 2 : 2827 | Max. : 1.275 | NA | Max. :0.9180 | Max. :0.9940 | Max. :0.9940000 | Max. :0.9960 | Max. :0.9910 | Max. :239.44 | |
| (Other) :32707 | (Other) :32096 | NA | NA | NA | NA | (Other):13164 | NA | NA | NA | NA | NA | NA | NA | NA |
Understanding Attributes
#Plotting numeric values
spotify %>%
keep(is.numeric) %>% #hist only for numeric
gather() %>% #converts to key value
ggplot(aes(value, fill = key)) +
facet_wrap(~ key, scales = "free") +
geom_histogram(alpha = 0.7, bins = 30) +
ggtitle("Distribution of Audio Attributes") +
scale_x_discrete(guide = guide_axis(check.overlap = TRUE)) +
theme(plot.title = element_text(hjust = 0.5))
As the histograms depict, many of the attributes are skewed which is reflected in the boxplots as well.
Instrumentalness has most values closer to 0, which is why the boxplot and histogram act this way.
#Boxplot for numeric values
spotify %>%
keep(is.numeric) %>% #hist only for numeric
gather() %>% #converts to key value
ggplot(aes(value, fill = key)) +
facet_wrap(~ key, scales = "free") +
geom_boxplot(alpha = 0.7) +
ggtitle("Boxplots of Attributes") +
theme(plot.title = element_text(hjust = 0.5)) +
coord_flip()
Understanding Genres
To understand genres better, genres are plotted by their average popularity.
From Plot 2, it can be seen that the maximum number of popular songs belong to :
#Plotting Genres
p1 <- ggplot(spotify, aes(x=factor(playlist_genre))) +
geom_bar(width=0.7,
aes(fill=playlist_genre),
alpha=0.7) +
scale_fill_brewer(palette = "Paired") +
ggtitle("Plot 1 : Genre Count") +
theme(plot.title = element_text(hjust = 0.5)) +
xlab("Genre")
avg_popularity <- spotify %>%
select(track_popularity, playlist_genre) %>%
group_by(playlist_genre) %>%
summarise("average_popularity" = round(mean(track_popularity)))
p2 <- ggplot(data=avg_popularity,
mapping = aes(x = (playlist_genre),
y = average_popularity,
fill = playlist_genre)) +
geom_col(width = 0.7,alpha=0.7) +
scale_fill_brewer(palette = "Paired") +
ggtitle("Plot 2 : Genres & Popularity") +
xlab("Genre") + ylab("Mean Popularity") +
theme(plot.title = element_text(hjust = 0.5))
grid.arrange(p1, p2, nrow=2, ncol=1)
Keys & Mode
In music “key” is short for “key signature” and refers to an ascending series of notes that will be used in a melody, and to the number of sharps or flats in the scale.
But will a mode enhance the key?
To understand keys & mode better,a chi square test on them reveal information as they are categorical variables
On applying the chisq.test function on the two variables, the p-value is found to be lesser than 2.2e-16, which is too significant for an \(\alpha\) of 0.05.
Hence, the null hypothesis is rejected.
Therefore, the key is dependent on mode, and the mode will sharpen the keys.
chisq.test(spotify$key, spotify$mode)
##
## Pearson's Chi-squared test
##
## data: spotify$key and spotify$mode
## X-squared = 3046.2, df = 11, p-value < 2.2e-16
From the chart below it is seen that songs have mode 1 (major track) more often than mode 2(minor track).
Pitch 1 is the most frequenly key occuring in songs
#Plotting Mode & Keys
g1 <- ggplot(spotify,aes(mode)) +
geom_bar(aes(fill=mode),alpha = 0.6) +
ggtitle("Modes") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_fill_brewer(palette = "Dark2")
g2 <- ggplot(spotify,aes(key)) +
geom_bar(aes(fill=key), alpha = 0.6) +
ggtitle("Keys") +
theme(plot.title = element_text(hjust = 0.5))
grid.arrange(g1,g2,ncol=1)
Approach for Clustering:
#Splitting Data into Train & Test
index <- sample(nrow(spotify), 0.7*nrow(spotify))
train_kmeans <- spotify[index,]
test_kmeans <- spotify[-index,]
Standardization is an important step in data preprocessing, as it controls the variability of the dataset. It is used to limit the values between -1 and 1 for numeric columns. Therefore, I have scaled the data before implementing K-Means Clustering.
#Scaling Data
train_scale <- scale(train_kmeans[,-c(1,2,4,7,9)])
test_scale <- scale(test_kmeans[,-c(1,2,4,7,9)])
Correlation Plot Insights:
The plot below gives the following top few insights:
# Correlation Plot
ggcorr(train_scale,
low = "blue3",
high = "red") +
ggtitle("Correlation Plot") +
theme(plot.title = element_text(hjust = 0.5))
K-Means clustering is a simple and quick algorithm which deals with large data sets easily.
The idea behind K-Means is in grouping the data into clusters such that the variation inside the clusters (also known as total within-cluster sum of square or WSS) is minimum, and the variation within the clusters is maximum.
This helps in understanding which songs tend to be popular in which groups
General K-Means Process
Elbow Method has been used for the sameElbow method
One reason for using this method is that it chooses the correct number of clusters over random assignment of samples to clusters.
In this method, a wss curve is plotted according to the number of clusters k. The location of a bend (knee) in the plot is considered as an indicator of the appropriate number of clusters.
With the elbow method, the ideal number of clusters are identified as 3. Therefore kmeans is implemented with 3 centers.
The total within-cluster sum of square (wss) measures the compactness of the clustering and we want it to be as small as possible.
#Elbow Method
wss <- (nrow(train_scale)-1)*sum(apply(train_scale,2,var))
for (i in 2:15) wss[i] <- sum(kmeans(train_scale,centers=i)$withinss)
plot(1:15, wss, type="b", pch=20, frame = FALSE, xlab="Number of Clusters K",ylab="Total WSS",main="Optimal Number of Clusters")
Fitting K Means Model
The k means model is fit with 3 centers, while nstart = 25 generates 25 initial configurations and gives out the best one.
As seen from the output this model results in 3 clusters of sizes 7750, 10932, 4297
#Fit kmeans
set.seed(13437885)
fit <- kmeans(train_scale, centers = 3, nstart = 25)
fit$size
## [1] 7964 10796 4219
#Plotting Kmeans
fviz_cluster(fit,
geom = c("point", "text"),
data = train_scale,
palette = "Set3",
main = "K Means Clustering with 3 Centers",
alpha = 0.9) + theme(plot.title = element_text(hjust = 0.5))
The clusters are extracted and added to the data to do some descriptive statistics at the cluster level. The datatable below is a result of clustering.
The right most column depicts the cluster the songs belong to, and this will help in further analysis to understand the features of the clusters.
#Assiging cluster to df
train_kmeans$cluster <- as.factor(fit$cluster)
datatable(head(train_kmeans,5),options = list(dom = 't',scrollX = T,autoWidth = TRUE))
Interpreting the Quality of Clusters
The BSS is 51208.72.
round(fit$betweenss,2)
## [1] 51460.41
bss/tss%.To get a high value, we need to increase the number of clusters. But in this case, we found the number of clusters to be ideal at 3, hence we’ll stay at it.
round((fit$betweenss / fit$totss * 100),2)
## [1] 22.4
Prediction Strength
The prediction strength is defined according to Tibshirani and Walther (2005), who recommend to choose as optimal number of cluster the largest number of clusters that leads to a prediction strength above 0.8 or 0.9.
bss/tss% we continue with 3 clusters.#Prediction Strength
prediction.strength(train_scale, Gmin=2, Gmax=5, M=10,cutoff=0.8)
## Prediction strength
## Clustering method: kmeans
## Maximum number of clusters: 5
## Resampled data sets: 10
## Mean pred.str. for numbers of clusters: 1 0.8710546 0.8671857 0.6807116 0.5561761
## Cutoff value: 0.8
## Largest number of clusters better than cutoff: 3
Cluster Behaviour Analysis
The behaviours of the clusters can be outlined as below:
#Grouping the Clusters by Mean
cluster_mean <- train_kmeans %>%
group_by(cluster) %>%
summarise_if(is.numeric, "mean") %>%
mutate_if(is.numeric, .funs = "round", digits = 2)
datatable(cluster_mean, options = list(dom = 't',scrollX = T,autoWidth = TRUE))
#Bar Plots for Clusters
b1 <- train_kmeans %>%
ggplot(aes(x = cluster,
y = energy,
fill = cluster)) +
geom_boxplot() +
scale_fill_viridis(option = "D",discrete = TRUE, alpha=0.5) +
ggtitle("Clusters and Energy") +
theme(plot.title = element_text(hjust = 0.5))
b2 <- train_kmeans %>%
ggplot(aes(x = cluster,
y = acousticness,
fill = cluster)) +
geom_boxplot() +
scale_fill_viridis(option = "D",discrete = TRUE, alpha=0.5) +
ggtitle("Clusters and Acousticness") +
theme(plot.title = element_text(hjust = 0.5))
b3 <- train_kmeans %>%
ggplot(aes(x = cluster,
y = danceability,
fill = cluster)) +
geom_boxplot() +
scale_fill_viridis(option = "D",discrete = TRUE, alpha=0.5) +
ggtitle("Clusters and Danceability") +
theme(plot.title = element_text(hjust = 0.5))
b4 <- train_kmeans %>%
ggplot(aes(x = cluster,
y = valence,
fill = cluster)) +
geom_boxplot() +
scale_fill_viridis(option = "D",discrete = TRUE, alpha=0.5) +
ggtitle("Clusters and Valence") +
theme(plot.title = element_text(hjust = 0.5))
grid.arrange(b1, b2, b3, b4, nrow=2, ncol=2)
Individual Cluster Analysis
For the cluster analysis a baseline for popularity is kept at 90 and above. The popular songs in this cluster are depicted in the table below
Cluster 1 Insights:
Cluster 1 the second largest of the 3 clusters is known for its Liveness, Energy. Since, Pop, rock and rap are the most popular ones, the table for cluster 1 will be based on those.
As expected, most popular songs in cluster 1 are high on energy and low on accousticness
#Analysis on Cluster 1
c1 <- train_kmeans[which(train_kmeans$cluster==1), ]
#Grouping cluster by popularity
avg_pop <- c1 %>%
select(track_popularity, playlist_genre) %>%
group_by(playlist_genre) %>%
summarise("average_popularity" = round(mean(track_popularity)))
#Plotting genres across popularity
x1 <- ggplot(data=avg_pop,
mapping = aes(x = (playlist_genre),
y = average_popularity,
fill = playlist_genre)) +
geom_col(width = 0.7,alpha=0.7) +
scale_fill_brewer(palette = "Spectral") +
ggtitle("Cluster 1 - Genres & Popularity") +
xlab("Genre") + ylab("Mean Popularity") +
theme(plot.title = element_text(hjust = 0.5))
x1
n <- c1 %>%
select(track_name,track_artist,playlist_genre,acousticness,energy,track_popularity) %>%
subset(track_popularity >= 90 & playlist_genre %in% c("rap","rock","pop")) %>%
distinct(track_name,.keep_all = TRUE)
datatable(n, caption = 'Cluster 1: Top Songs', options = list(scrollX = T, autoWidth = TRUE, order = list((list(6, 'desc')))))
Cluster 2 Insights:
Cluster 2 has the most popular tracks purely coz of the size, and its tracks also have the highest Danceability, Energy, Valence
Therfore the most popular genres are - pop,latin,rock
The cluster two songs are high on energy and low on acousticness.
#Analysis on Cluster 2
c2 <- train_kmeans[which(train_kmeans$cluster==2), ]
#Grouping cluster by popularity
avg_pop <- c2 %>%
select(track_popularity, playlist_genre) %>%
group_by(playlist_genre) %>%
summarise("average_popularity" = round(mean(track_popularity)))
#Plotting genres across popularity
x2 <- ggplot(data=avg_pop,
mapping = aes(x = (playlist_genre),
y = average_popularity,
fill = playlist_genre)) +
geom_col(width = 0.7,alpha=0.7) +
scale_fill_brewer(palette = "Spectral") +
ggtitle("Cluster 2 - Genres & Popularity") +
xlab("Genre") + ylab("Mean Popularity") +
theme(plot.title = element_text(hjust = 0.5))
x2
n <- c2 %>%
select(track_name,track_artist,playlist_genre,acousticness,energy,track_popularity) %>%
subset(track_popularity >= 90 & playlist_genre %in% c("latin","rock","pop")) %>%
distinct(track_name,.keep_all = TRUE)
datatable(n, caption = 'Cluster 2: Top Songs', options = list(scrollX = T, autoWidth = TRUE, order = list((list(6, 'desc')))))
Cluster 3 Insights
Cluster 3 is the smallest and its tracks have the attributes of high acousticness, danceability and mid level energy compared to other clusters.
Therfore the most popular genres are - pop,latin,rap
The popular songs are high on acousticness with average energy.
#Analysis on Cluster 3
c3 <- train_kmeans[which(train_kmeans$cluster==3), ]
#Grouping cluster by popularity
avg_pop <- c3 %>%
select(track_popularity, playlist_genre) %>%
group_by(playlist_genre) %>%
summarise("average_popularity" = round(mean(track_popularity)))
#Plotting genres across popularity
x3 <- ggplot(data=avg_pop,
mapping = aes(x = (playlist_genre),
y = average_popularity,
fill = playlist_genre)) +
geom_col(width = 0.7,alpha=0.7) +
scale_fill_brewer(palette = "Spectral") +
ggtitle("Cluster 3 - Genres & Popularity") +
xlab("Genre") + ylab("Mean Popularity") +
theme(plot.title = element_text(hjust = 0.5))
x3
n <- c3 %>%
select(track_name,track_artist,playlist_genre,acousticness,energy,track_popularity) %>%
subset(track_popularity >= 90 & playlist_genre %in% c("latin","rap","pop")) %>%
distinct(track_name,.keep_all = TRUE)
datatable(n, caption = 'Cluster 3: Top Songs', options = list(scrollX = T, autoWidth = TRUE, order = list((list(6, 'desc')))))
Summary:
Key Takeways:
Limitations: