Spotify is one of the biggest digital music, podcast, and video streaming service in the world that gives access to millions of songs and other content from artists all over the world. Not only does Spotify gives us access to good songs everywhere (work, home, in the car), it has also introduced us to artists that we would never have listened to before and in genres that we had never experienced. Spotify uses very advanced technology to track and identify each song uploaded to its platform.
The Spotify database provides an interesting look into their listening data. Not just the popularity of tracks, but also features of the tracks they have in their library is recorded in their database. In this project, we have analyzed a track’s popularity based on several audio features provided in the dataset and found answer to ‘Can we predict a track’s popularity from key features about the song?’ We have also done a custom analysis based on user’s listening profile which shall enable Spotify to stock up similar hit tracks more on their platform and let go off songs that are not much popular among the listeners.
We are considering you as a Spotify user. So as a Spotify user, won’t you be impressed if you can get a list of the most popular songs tailored to your taste without having to manually search for them extensively? Also, won’t you be a happy and recurring customer if you keep on getting a list of latest top music, divided by genre and have easy access to recently released music? We will be solving for providing easy access to popular songs and that is the reason, you, as a Spotify user should be interested.
We have analyzed relationship between popularity and different audio features and genres through a detaoled EDA. Then we have performed clustering analysis using K-means method to provide song recommendation based on recent user listening on Spotify. Here we have extracted a pattern from the clusters and determined how does each cluster differ. This helped us predict the most and least popular clusters. Finally we have created a ‘Song Recommendation System’.
K-means clustering has provided information about customer listening behavior which shall help Spotify upsell, cross-sell or combine both to increase profit. Using this method, we have analyzed correlation between songs in a cluster and their popularity rate.
After we have clustered the songs, we have build a song recommendation system to enable listeners get effective suggestions regarding next best songs according to their taste.
Consumer of our analysis would be Spotify’s programming team. Our analysis shall enable the team to upsell, cross-sell or combine both to increase profit. Having a better understanding of different clusters shall enable Spotify to make a better targeted content distribution, leading to reduced churn rate.
For example, if the team knows that 30% of customers who listens to track A also listens to track B, Spotify can market track B to customers shortly after they listen to track A to speed up that process and capture those who might not have otherwise considered listening to track B. Also, for those customers who do not know of track B, getting suggestions will make them happy and impressed. This is how our analysis would help Spotify in providing better services to their consumers and keep them ahead of the curve.
#install.packages("tidyverse")
#install.packages("dplyr")
#install.packages("ggplot2")
#install.packages("plotly")
#install.packages("corrplot")
#install.packages("factoextra")
#install.packages("plyr")
#install.packages("RColorBrewer")
#install.packages("funModeling")
#install.packages("knitr")
library(tidyverse)
library(dplyr)
library(ggplot2)
library(plotly)
library(corrplot)
library(factoextra)
library(plyr)
library(RColorBrewer)
library(funModeling)
library(knitr)
tidyverse - for interacting with data through subsetting, transformation, visualization, etc.
dplyr - for data manipulation in R by combining, selecting, grouping, subsetting and transforming all or parts of dataset
ggplot2 - for declaratively creating graphics, based on The Grammar of Graphics
plotly - for creating interactive web-based graphs via the open source JavaScript graphing library plotly.js
corrplot - for visualizing correlation matrices and confidence intervals
factoextra - to extract and visualize the output of multivariate data analyses and simplifying some clustering
plyr - to split data apart, do stuff to it, and mash it back together for simplifying control to the input and output data format
RColorBrewer - to choose sensible colour schemes for figures in R
funModeling - for some cool ‘dataViz’
The dataset used for this project is the Spotify song list prepared by Zaheen Hamidani which we got from kaggle. (https://www.kaggle.com/zaheenhamidani/ultimate-spotify-tracks-db)
Originally, the dataset was created by Zaheen Hamidani and uploaded to Kaggle in July 2019. Alternately, it is also available in a R package version 2.1.1 Spotify R package. Charlie Thompson, Josia Parry, Donal Phipps, and Tom Wolff authored this package to make it easier to get data or general metadata around songs from Spotify’s API. It allows to enter an artist’s name and retrieve their entire audio history (collection of all songs) in seconds, along with Spotify’s audio features and track/album popularity metrics.
The primary purpose of the data was to analyze the behaviour between valence and all the measures that Spotify API gives for every track. Approximately 10,000 songs were selected per genre and there are 26 genres. But, the same data can also be used to analyze different statistics and obtain other useful information.
There is not much peculiarity in the data. It is moderately clean with only 15 missing values. Since every track made is unique is some sense, we have not done any missing value imputation and have just removed them.
First, we will load the Spotify songs dataset into R to kickstart with the analysis.The dataset has been imported using the read.csv function and saved as “spotify”.
set.seed(13232767)
spotify <- read.csv("spotify_songs.csv")
glimpse(spotify)
## Observations: 32,833
## Variables: 23
## $ track_id <fct> 6f807x0ima9a1j3VPbc7VN, 0r7CVbZTWZgbT...
## $ track_name <fct> I Don't Care (with Justin Bieber) - L...
## $ track_artist <fct> Ed Sheeran, Maroon 5, Zara Larsson, T...
## $ track_popularity <int> 66, 67, 70, 60, 69, 67, 62, 69, 68, 6...
## $ track_album_id <fct> 2oCs0DGTsRO98Gh5ZSl2Cx, 63rPSO264uRjW...
## $ track_album_name <fct> I Don't Care (with Justin Bieber) [Lo...
## $ track_album_release_date <fct> 2019-06-14, 2019-12-13, 2019-07-05, 2...
## $ playlist_name <fct> Pop Remix, Pop Remix, Pop Remix, Pop ...
## $ playlist_id <fct> 37i9dQZF1DXcZDD7cfEKhW, 37i9dQZF1DXcZ...
## $ playlist_genre <fct> pop, pop, pop, pop, pop, pop, pop, po...
## $ playlist_subgenre <fct> dance pop, dance pop, dance pop, danc...
## $ danceability <dbl> 0.748, 0.726, 0.675, 0.718, 0.650, 0....
## $ energy <dbl> 0.916, 0.815, 0.931, 0.930, 0.833, 0....
## $ key <int> 6, 11, 1, 7, 1, 8, 5, 4, 8, 2, 6, 8, ...
## $ loudness <dbl> -2.634, -4.969, -3.432, -3.778, -4.67...
## $ mode <int> 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1...
## $ speechiness <dbl> 0.0583, 0.0373, 0.0742, 0.1020, 0.035...
## $ acousticness <dbl> 0.10200, 0.07240, 0.07940, 0.02870, 0...
## $ instrumentalness <dbl> 0.00e+00, 4.21e-03, 2.33e-05, 9.43e-0...
## $ liveness <dbl> 0.0653, 0.3570, 0.1100, 0.2040, 0.083...
## $ valence <dbl> 0.518, 0.693, 0.613, 0.277, 0.725, 0....
## $ tempo <dbl> 122.036, 99.972, 124.008, 121.956, 12...
## $ duration_ms <int> 194754, 162600, 176616, 169093, 18905...
Our dataset has 32,833 observations and 23 variables.
Using the colSums function, we observe that there are 5 NA’s in track_name, track_artist and track_album_name. We have removed the respective observations using the na.omit function.
colSums(is.na(spotify))
## track_id track_name track_artist
## 0 5 5
## track_popularity track_album_id track_album_name
## 0 0 5
## track_album_release_date playlist_name playlist_id
## 0 0 0
## playlist_genre playlist_subgenre danceability
## 0 0 0
## energy key loudness
## 0 0 0
## mode speechiness acousticness
## 0 0 0
## instrumentalness liveness valence
## 0 0 0
## tempo duration_ms
## 0 0
spotify <- na.omit(spotify)
Filtering for unique tracks and removing all the duplicated tracks using the duplicated function
spotify <- spotify[!duplicated(spotify$track_id),]
Converting key, mode, genre and sub genre to factors to facilitate our data analysis since that seemed logical observing the type of data these columns contain
spotify <- spotify %>%
mutate(playlist_genre = as.factor(spotify$playlist_genre),
playlist_subgenre = as.factor(spotify$playlist_subgenre),
mode = as.factor(mode),
key = as.factor(key))
Converting duration_ms to duration in mins (duration_min) since it is more sensible for the analysis
spotify <- spotify %>% mutate(duration_min = duration_ms/60000)
For exploring the distribution on popularity, we have made new variables that divide popularity into 4 groups for effective cluster analysis
spotify <- spotify %>%
mutate(popularity_group = as.numeric(case_when(
((track_popularity > 0) & (track_popularity < 20)) ~ "1",
((track_popularity >= 20) & (track_popularity < 40))~ "2",
((track_popularity >= 40) & (track_popularity < 60)) ~ "3",
TRUE ~ "4"))
)
table(spotify$popularity_group)
##
## 1 2 3 4
## 4182 6162 8975 9033
We have removed track_id, track_album_id and playlist_id from the dataset since it is not useful for our analysis. These unique id’s have been maintained in the Spotify dataset only to uniquely identify a track in the database.
spotify <- spotify %>% select(-c(track_id, track_album_id, playlist_id))
summary(spotify)
## track_name track_artist track_popularity
## Breathe : 18 Queen : 130 Min. : 0.00
## Paradise: 17 Martin Garrix : 87 1st Qu.: 21.00
## Poison : 16 Don Omar : 84 Median : 42.00
## Alive : 15 David Guetta : 81 Mean : 39.34
## Forever : 14 Dimitri Vegas & Like Mike: 68 3rd Qu.: 58.00
## Stay : 14 Drake : 68 Max. :100.00
## (Other) :28258 (Other) :27834
## track_album_name track_album_release_date
## Greatest Hits : 135 2020-01-10: 201
## Ultimate Freestyle Mega Mix: 42 2013-01-01: 189
## Gold : 34 2019-11-22: 185
## Rock & Rios (Remastered) : 29 2019-12-06: 184
## Asian Dreamer : 20 2019-11-15: 183
## Trip Stories : 20 2008-01-01: 176
## (Other) :28072 (Other) :27234
## playlist_name playlist_genre
## Indie Poptimism : 294 edm :4877
## Permanent Wave : 223 latin:4136
## Hard Rock Workout : 211 pop :5132
## Southern Hip Hop : 174 r&b :4504
## post teen pop : 159 rap :5398
## Urban Contemporary: 157 rock :4305
## (Other) :27134
## playlist_subgenre danceability energy
## southern hip hop : 1582 Min. :0.0000 Min. :0.000175
## indie poptimism : 1547 1st Qu.:0.5610 1st Qu.:0.579000
## neo soul : 1478 Median :0.6700 Median :0.722000
## progressive electro house: 1460 Mean :0.6534 Mean :0.698372
## electro house : 1416 3rd Qu.:0.7600 3rd Qu.:0.843000
## gangster rap : 1314 Max. :0.9830 Max. :1.000000
## (Other) :19555
## key loudness mode speechiness
## 1 : 3436 Min. :-46.448 0:12318 Min. :0.0000
## 0 : 3001 1st Qu.: -8.310 1:16034 1st Qu.:0.0410
## 7 : 2907 Median : -6.261 Median :0.0626
## 9 : 2631 Mean : -6.818 Mean :0.1079
## 11 : 2577 3rd Qu.: -4.709 3rd Qu.:0.1330
## 2 : 2478 Max. : 1.275 Max. :0.9180
## (Other):11322
## acousticness instrumentalness liveness valence
## Min. :0.0000 Min. :0.0000000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0143 1st Qu.:0.0000000 1st Qu.:0.0926 1st Qu.:0.3290
## Median :0.0797 Median :0.0000206 Median :0.1270 Median :0.5120
## Mean :0.1772 Mean :0.0911294 Mean :0.1910 Mean :0.5104
## 3rd Qu.:0.2600 3rd Qu.:0.0065725 3rd Qu.:0.2490 3rd Qu.:0.6950
## Max. :0.9940 Max. :0.9940000 Max. :0.9960 Max. :0.9910
##
## tempo duration_ms duration_min popularity_group
## Min. : 0.00 Min. : 4000 Min. :0.06667 Min. :1.000
## 1st Qu.: 99.97 1st Qu.:187741 1st Qu.:3.12902 1st Qu.:2.000
## Median :121.99 Median :216933 Median :3.61555 Median :3.000
## Mean :120.96 Mean :226575 Mean :3.77624 Mean :2.806
## 3rd Qu.:134.00 3rd Qu.:254975 3rd Qu.:4.24959 3rd Qu.:4.000
## Max. :239.44 Max. :517810 Max. :8.63017 Max. :4.000
##
Each row indicates 1 song and column contain attributes for each song.The attributes are as follows:
track_id - Track id on spotify
track_name - Title of the song
track_artist - Name of the artist
track_popularity - Measure the popularity from 0 to 100 based on play number of the track
acousticness - Measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
danceability - Describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
duration_ms - The duration of the track in milliseconds(ms).
energy - Measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
instrumentalness - Measure whether a track contains no vocals. ‘Ooh’ and ‘aah’ sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly ‘vocal’. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
key - Estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = Câ¯/Dâ, 2 = D, and so on. If no key was detected, the value is -1.
liveness - Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
loudness - overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 dB.
mode - Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.
speechiness - Detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
tempo Overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
valence - Measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
The best way to uncover useful information from data that is not self-evident is by performing EDA efficiently. EDA helps us to make sense of our data. Before performing a formal analysis, it is essential to explore a data set. No models can be done without a proper EDA. This will help us to better understand the patterns within the data, detect outliers or anomalous events and find interesting relations among the variables. We have used histograms, boxplots and correlation plot to find such answers.
df1 <- select(spotify, track_popularity, danceability, energy, loudness, speechiness, acousticness, instrumentalness, liveness, valence, tempo)
corrplot(cor(df1))
The plot shows popularity does not have strong correlation with track features. But we found some variables have strong correlation with each other, indicating that this dataset has multicollinearity and might not be suitable for various classification algorithms.
Analyzing data distribution of the audio features
spotify_hist <- spotify[,-c(1,2,3,4,5,6,7,8,11,13,20,22)]
plot_num(spotify_hist)
From the histograms, we can observe that:
Majority (85.4323394%) observations have a value no larger than 0.1 in instrumentalness, and this is the reason why the difference between mean and median of instrumentalness is quite large
Majority of songs listened to have a duration of about 3-4 mins with songs longer than that duration having lower frequency of listeners
Valence is normally distributed
Danceability and energy are almost normally distributed
Majority of the tracks have a loudness level of -5dB
Majority tracks have speechiness index less than 0.2 indicating that less speechy songs are more favoured by listeners
boxplot(energy~playlist_genre, data=spotify,
main = "Variation of energy between genres",
xlab = "Energy",
ylab = "Genre",
col = "orange",
border = "brown",
horizontal = TRUE,
notch = TRUE
)
EDM songs are highest in energy, as expected!
boxplot(danceability~playlist_genre, data=spotify,
main = "Variation of danceability between genres",
xlab = "Danceability",
ylab = "Genre",
col = "orange",
border = "brown",
horizontal = TRUE,
notch = TRUE
)
Rap songs have highest danceability, I knew!
boxplot(danceability~playlist_genre, data=spotify,
main = "Variation of liveness between genres",
xlab = "Liveness",
ylab = "Genre",
col = "orange",
border = "brown",
horizontal = TRUE,
notch = TRUE
)
Rap songs are most lively, obviously!
boxplot(valence~playlist_genre, data=spotify,
main = "Variation of valence between genres",
xlab = "Valence",
ylab = "Genre",
col = "orange",
border = "brown",
horizontal = TRUE,
notch = TRUE
)
Songs in Latin genre have the highest valence, OK!
boxplot(loudness~playlist_genre, data=spotify,
main = "Variation of loudness between genres",
xlab = "Loudness",
ylab = "Genre",
col = "orange",
border = "brown",
horizontal = TRUE,
notch = TRUE
)
Songs in EDM genre are louder in nature, cool!
spotify$acousticness.scale <- scale(spotify$acousticness)
spotify %>%
select(popularity_group, acousticness.scale, playlist_genre) %>%
group_by(popularity_group)%>%
filter(!is.na(popularity_group)) %>%
filter(!is.na(acousticness.scale))%>%
ggplot(mapping = aes(x = acousticness.scale, y = popularity_group, color = playlist_genre))+
facet_wrap(~playlist_genre)+
geom_point()+
theme_minimal()
Accoustiness does not effect track popularity as the level of accousticness has been uniform across all popularity levels.
spotify%>%
select(popularity_group, valence, playlist_genre) %>%
group_by(popularity_group)%>%
filter(!is.na(popularity_group)) %>%
filter(!is.na(valence))%>%
ggplot(mapping = aes(x = popularity_group, y = valence, color = playlist_genre, fill = playlist_genre))+
geom_bar(stat = 'identity')+
coord_polar()+
facet_wrap(~playlist_genre)+
theme_minimal()
Songs with higher valence index are more popular in pop and rap genre and less in edm genre.
spotify$cut_energy <- cut(spotify$energy, breaks = 10)
spotify %>%
ggplot( aes(x=cut_energy ))+
geom_bar(width=0.2) +
coord_flip() +
scale_x_discrete(name="Energy")
Suppports the findings from the energy histogram. Hence proved that higher energy songs are favoured more by Spotify listeners.
spotify$cut_spe <- cut(spotify$speechiness, breaks = 10)
spotify %>%
ggplot( aes(x=cut_spe ))+
geom_bar(width=0.2) +
coord_flip() +
scale_x_discrete(name="Spechiness")
This graph also supports the findings of our histogram for speechiness. As we all know how we do not like speechier tracks, this affirms our belief that less speechy songs are more favoured by maximum Spotify listeners. So, Spotify does not keep speechier songs in their database.
spotify$liveness.scale <- scale(spotify$liveness)
spotify$tempo.scale <- scale(spotify$tempo)
spotify %>%
select(tempo.scale, liveness.scale, playlist_genre) %>%
group_by(playlist_genre)%>%
filter(!is.na(tempo.scale)) %>%
filter(!is.na(liveness.scale))%>%
ggplot(mapping = aes(x = tempo.scale, y = liveness.scale, color = playlist_genre, fill = playlist_genre))+
geom_bar(stat = 'identity')+
coord_polar()+
theme_minimal()
Tempo is way higher for EDM genre compared to the others while Liveness is almost uniformly distributed across all genres.
spotify %>%
select(track_name, track_artist, track_album_name, playlist_genre, track_popularity)%>%
group_by(track_artist)%>%
filter(!is.na(track_name))%>%
filter(!is.na(track_artist))%>%
filter(!is.na(track_album_name))%>%
arrange(desc(track_popularity))%>%
head(n = 10)%>%
ggplot(mapping = aes(x = track_name, y = track_artist, color = track_artist, fill = track_artist, size = track_popularity ))+
geom_point()+
coord_polar()+
facet_wrap(~playlist_genre)+
theme_minimal()+
labs(x = 'track_name', y = 'track_artist', title = 'Top ten artists of spotify')+
theme(plot.title = element_text(hjust=0.5),legend.position ='bottom')
The top songs are primarily from Pop genre, the most popular song being by the artist ‘Tones & I’. Latin and Rap also have one song each featured in top 10.
Before performing clustering, we need to scale the numeric variables so as to negate the influence of variables measured on higher scales. The variables that I choose in this analysis for Cluster are as follows: - Danceability - Energy - Loudness - Speechiness - Acousticness - Instrumentalness - Liveness - Valence - Tempo - Duration_min
Checking the updated dataset since we have certain variables changes for EDA
str(spotify)
## 'data.frame': 28352 obs. of 27 variables:
## $ track_name : Factor w/ 23449 levels "'39 - 2011 Mix",..: 9368 12887 944 3111 18360 1968 13859 15785 20934 9823 ...
## $ track_artist : Factor w/ 10692 levels "'Til Tuesday",..: 2848 6185 10633 9373 5530 2848 5000 8320 761 8562 ...
## $ track_popularity : int 66 67 70 60 69 67 62 69 68 67 ...
## $ track_album_name : Factor w/ 19743 levels "'74 - '75 (feat. Susan Tyler)",..: 7928 10684 981 2869 15185 1882 11515 13093 17788 8155 ...
## $ track_album_release_date: Factor w/ 4530 levels "1957-01-01","1957-03",..: 4316 4493 4336 4349 4221 4341 4356 4389 4316 4321 ...
## $ playlist_name : Factor w/ 449 levels "\"Permanent Wave\"",..: 309 309 309 309 309 309 309 309 309 309 ...
## $ playlist_genre : Factor w/ 6 levels "edm","latin",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ playlist_subgenre : Factor w/ 24 levels "album rock","big room",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ danceability : num 0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
## $ energy : num 0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
## $ key : Factor w/ 12 levels "0","1","2","3",..: 7 12 2 8 2 9 6 5 9 3 ...
## $ loudness : num -2.63 -4.97 -3.43 -3.78 -4.67 ...
## $ mode : Factor w/ 2 levels "0","1": 2 2 1 2 2 2 1 1 2 2 ...
## $ speechiness : num 0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
## $ acousticness : num 0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
## $ instrumentalness : num 0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
## $ liveness : num 0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
## $ valence : num 0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
## $ tempo : num 122 100 124 122 124 ...
## $ duration_ms : int 194754 162600 176616 169093 189052 163049 187675 207619 193187 253040 ...
## $ duration_min : num 3.25 2.71 2.94 2.82 3.15 ...
## $ popularity_group : num 4 4 4 4 4 4 4 4 4 4 ...
## $ acousticness.scale : num [1:28352, 1] -0.337 -0.47 -0.439 -0.666 -0.435 ...
## ..- attr(*, "scaled:center")= num 0.177
## ..- attr(*, "scaled:scale")= num 0.223
## $ cut_energy : Factor w/ 10 levels "(-0.000825,0.1]",..: 10 9 10 10 9 10 9 10 10 9 ...
## $ cut_spe : Factor w/ 10 levels "(-0.000918,0.0918]",..: 1 1 1 2 1 2 1 1 1 1 ...
## $ liveness.scale : num [1:28352, 1] -0.8061 1.0652 -0.5193 0.0837 -0.6906 ...
## ..- attr(*, "scaled:center")= num 0.191
## ..- attr(*, "scaled:scale")= num 0.156
## $ tempo.scale : num [1:28352, 1] 0.04 -0.779 0.113 0.037 0.112 ...
## ..- attr(*, "scaled:center")= num 121
## ..- attr(*, "scaled:scale")= num 27
## - attr(*, "na.action")= 'omit' Named int 8152 9283 9284 19569 19812
## ..- attr(*, "names")= chr "8152" "9283" "9284" "19569" ...
Scaling the numeric variables required for cluster analysis:
spotify_scaled <- scale(spotify[,-c(1,2,3,4,5,6,7,8,11,13,20,22,23,24,25,26,27)])
summary(spotify_scaled)
## danceability energy loudness speechiness
## Min. :-4.4816 Min. :-3.8047 Min. :-13.0516 Min. :-1.0526
## 1st Qu.:-0.6336 1st Qu.:-0.6505 1st Qu.: -0.4915 1st Qu.:-0.6528
## Median : 0.1140 Median : 0.1288 Median : 0.1834 Median :-0.4421
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.7314 3rd Qu.: 0.7881 3rd Qu.: 0.6946 3rd Qu.: 0.2444
## Max. : 2.2609 Max. : 1.6437 Max. : 2.6652 Max. : 7.8994
## acousticness instrumentalness liveness valence
## Min. :-0.7952 Min. :-0.3918 Min. :-1.2249 Min. :-2.177934
## 1st Qu.:-0.7311 1st Qu.:-0.3918 1st Qu.:-0.6309 1st Qu.:-0.774014
## Median :-0.4375 Median :-0.3918 Median :-0.4103 Median : 0.006889
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.000000
## 3rd Qu.: 0.3716 3rd Qu.:-0.3636 3rd Qu.: 0.3724 3rd Qu.: 0.787793
## Max. : 3.6659 Max. : 3.8823 Max. : 5.1643 Max. : 2.050894
## tempo duration_min
## Min. :-4.48750 Min. :-3.6439
## 1st Qu.:-0.77858 1st Qu.:-0.6358
## Median : 0.03841 Median :-0.1578
## Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.48381 3rd Qu.: 0.4650
## Max. : 4.39562 Max. : 4.7680
Determining the optimal number of clusters:
wss <- function(data, maxCluster = 9) {
SSw <- (nrow(data) - 1) * sum(apply(data, 2, var))
SSw <- vector()
for (i in 2:maxCluster) {
SSw[i] <- sum(kmeans(data, centers = i)$withinss)
}
plot(1:maxCluster, SSw, type = "o", xlab = "Number of Clusters", ylab = "Within groups sum of squares", pch=19)
}
wss(spotify_scaled)
## Warning: did not converge in 10 iterations
By looking at the plot, we can’t see the “Elbow” clearly. But we think it happens at k = 7 since within group sum of squares is not changing significantly after k = 7. So we choose 7 to be number of clusters for this analysis.
spotify_kmeans <- kmeans(spotify_scaled, centers = 7)
spotify_kmeans$size
## [1] 1747 3754 7665 3256 3420 6265 2245
spotify_kmeans$centers
## danceability energy loudness speechiness acousticness
## 1 -0.32257364 0.4754506 0.26305322 0.03586906 -0.27112012
## 2 -0.35426056 -1.5639363 -1.30289624 -0.35081509 1.68479950
## 3 0.64505164 0.1531843 0.11746811 -0.32734861 -0.14326327
## 4 -0.82623557 0.3662025 0.30810693 0.04551065 -0.27190863
## 5 0.59334595 -0.2548104 -0.09046971 2.03734288 0.04973956
## 6 -0.41165009 0.4071884 0.48311917 -0.40457685 -0.47413275
## 7 0.08422147 0.4428987 -0.08436790 -0.36427511 -0.47541674
## instrumentalness liveness valence tempo duration_min
## 1 -0.11655566 2.82065999 -0.01703516 0.02854606 0.111877644
## 2 0.08086478 -0.27948077 -0.50374425 -0.31054449 -0.042750825
## 3 -0.31784071 -0.26766758 0.86230076 -0.35362881 -0.007234689
## 4 -0.32264063 -0.11869411 0.18920129 1.64069058 0.037763749
## 5 -0.34242492 -0.08164052 0.12854595 -0.24406061 -0.198196919
## 6 -0.28900243 -0.13567781 -0.73428622 -0.16176324 -0.067695698
## 7 2.83675577 -0.13859692 -0.50961352 0.14811775 0.445202000
spotify$cluster <- spotify_kmeans$cluster
tail(spotify)
## track_name
## 32828 Many Ways - Radio Edit
## 32829 City Of Lights - Official Radio Edit
## 32830 Closer - Sultan & Ned Shepard Remix
## 32831 Sweet Surrender - Radio Edit
## 32832 Only For You - Maor Levi Remix
## 32833 Typhoon - Original Mix
## track_artist track_popularity
## 32828 Ferry Corsten feat. Jenny Wahlstrom 27
## 32829 Lush & Simon 42
## 32830 Tegan and Sara 20
## 32831 Starkillers 14
## 32832 Mat Zo 15
## 32833 Julian Calor 27
## track_album_name track_album_release_date
## 32828 Many Ways 2013
## 32829 City Of Lights (Vocal Mix) 2014-04-28
## 32830 Closer Remixed 2013-03-08
## 32831 Sweet Surrender (Radio Edit) 2014-04-21
## 32832 Only For You (Remixes) 2014-01-01
## 32833 Typhoon/Storm 2014-03-03
## playlist_name playlist_genre playlist_subgenre
## 32828 â\231¥ EDM LOVE 2020 edm progressive electro house
## 32829 â\231¥ EDM LOVE 2020 edm progressive electro house
## 32830 â\231¥ EDM LOVE 2020 edm progressive electro house
## 32831 â\231¥ EDM LOVE 2020 edm progressive electro house
## 32832 â\231¥ EDM LOVE 2020 edm progressive electro house
## 32833 â\231¥ EDM LOVE 2020 edm progressive electro house
## danceability energy key loudness mode speechiness acousticness
## 32828 0.581 0.640 5 -8.367 1 0.0365 0.026600
## 32829 0.428 0.922 2 -1.814 1 0.0936 0.076600
## 32830 0.522 0.786 0 -4.462 1 0.0420 0.001710
## 32831 0.529 0.821 6 -4.899 0 0.0481 0.108000
## 32832 0.626 0.888 2 -3.361 1 0.1090 0.007920
## 32833 0.603 0.884 5 -4.571 0 0.0385 0.000133
## instrumentalness liveness valence tempo duration_ms duration_min
## 32828 0.00e+00 0.5720 0.2880 128.001 196993 3.283217
## 32829 0.00e+00 0.0668 0.2100 128.170 204375 3.406250
## 32830 4.27e-03 0.3750 0.4000 128.041 353120 5.885333
## 32831 1.11e-06 0.1500 0.4360 127.989 210112 3.501867
## 32832 1.27e-01 0.3430 0.3080 128.008 367432 6.123867
## 32833 3.41e-01 0.7420 0.0894 127.984 337500 5.625000
## popularity_group acousticness.scale cut_energy cut_spe
## 32828 2 -0.6758632 (0.6,0.7] (-0.000918,0.0918]
## 32829 3 -0.4514612 (0.9,1] (0.0918,0.184]
## 32830 2 -0.7875706 (0.7,0.8] (-0.000918,0.0918]
## 32831 1 -0.3105367 (0.8,0.9] (-0.000918,0.0918]
## 32832 1 -0.7596998 (0.8,0.9] (0.0918,0.184]
## 32833 2 -0.7946482 (0.8,0.9] (-0.000918,0.0918]
## liveness.scale tempo.scale cluster
## 32828 2.4443564 0.2612840 1
## 32829 -0.7964367 0.2675539 6
## 32830 1.1806267 0.2627680 6
## 32831 -0.2627194 0.2608388 6
## 32832 0.9753508 0.2615437 6
## 32833 3.5348845 0.2606533 1
The size of each cluster (count of each cluster) returns: cluster1 2289, cluster2 3690, cluster3 5635, cluster4 2857, cluster5 4788, cluster6 1776 and cluster 7 7317
The centers shows the mean value on each of the variable
Plotting using ‘factoextra’:
fviz_cluster(spotify_kmeans, data=spotify_scaled)
We can check it from the following 3 values:
1 - Within Sum of Squares tot.withinss : signifies the ‘length’ from each observation to its centroid in each cluster
spotify_kmeans$tot.withinss
## [1] 167883.7
2- Total Sum of Squares totss : signifies the ‘length’ from each observation to global sample mean
spotify_kmeans$totss
## [1] 283510
3 - Between Sum of Squares betweenss : signifies the ‘length’ from each centroid from each cluster to the global sample mean
spotify_kmeans$betweenss
## [1] 115626.3
Another ‘goodness’ measure can be signifies with a value of betweenss/totss closer the value to 1 or 100%, the better): betweenss
/tot.withinss
((spotify_kmeans$betweenss)/(spotify_kmeans$totss))*100
## [1] 40.78384
Good cluster has high similarity characteristics in 1 cluster (low WSS) and maximum difference in characteristics between clusters (high BSS). In addition, it can be marked with a BSS / totss ratio that is close to 1 (100%).
From the unsupervised learning analysis above, we can summarize that K-means clustering can be done using this dataset since we have got a reasonable high value for BSS / totss ratio, 40.78%.
Repeating the same exercise multiple times by adjusting with multiple combinations of variables, we are getting the best fit and optimized model by excluding loudness, tempo and duration_min.
Finding what kind of song characterises each clusters in the optimized model:
spotify %>%
group_by(cluster) %>%
summarise_all(mean) %>%
select(cluster, acousticness, danceability, energy, instrumentalness, speechiness, valence, liveness)
## # A tibble: 7 x 8
## cluster acousticness danceability energy instrumentalness speechiness
## <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 0.117 0.606 0.786 0.0640 0.112
## 2 2 0.553 0.602 0.411 0.110 0.0720
## 3 3 0.145 0.747 0.726 0.0172 0.0744
## 4 4 0.117 0.533 0.766 0.0161 0.113
## 5 5 0.188 0.740 0.652 0.0115 0.317
## 6 6 0.0715 0.593 0.773 0.0239 0.0665
## 7 7 0.0713 0.666 0.780 0.751 0.0706
## # ... with 2 more variables: valence <dbl>, liveness <dbl>
Now, let’s check the which cluster is my favorite song. My favourite track from the list is Memories by Maroon 5.
spotify %>%
filter(track_name == "Memories - Dillon Francis Remix", track_artist == "Maroon 5")
## track_name track_artist track_popularity
## 1 Memories - Dillon Francis Remix Maroon 5 67
## track_album_name track_album_release_date playlist_name
## 1 Memories (Dillon Francis Remix) 2019-12-13 Pop Remix
## playlist_genre playlist_subgenre danceability energy key loudness mode
## 1 pop dance pop 0.726 0.815 11 -4.969 1
## speechiness acousticness instrumentalness liveness valence tempo
## 1 0.0373 0.0724 0.00421 0.357 0.693 99.972
## duration_ms duration_min popularity_group acousticness.scale cut_energy
## 1 162600 2.71 4 -0.4703109 (0.8,0.9]
## cut_spe liveness.scale tempo.scale cluster
## 1 (-0.000918,0.0918] 1.065159 -0.7785794 3
So my favourite song is in cluster 3.
Now I want to try a new genre, r&b. So let’s check the best songs which I should try according to my taste in the r&b genre given that my favourite song is Memories by Maroon 5.
spotify %>%
filter(cluster == 3, playlist_genre == "r&b") %>%
sample_n(5)
## track_name track_artist track_popularity
## 1 You're Makin' Me High Toni Braxton 43
## 2 Just Ain't Gonna Work Out Mayer Hawthorne 51
## 3 When Can I See You Babyface 0
## 4 Tradición Gloria Estefan 32
## 5 Rolex Ayo & Teo 73
## track_album_name track_album_release_date
## 1 Secrets (Remix Package) 1996
## 2 A Strange Arrangement 2009
## 3 R&B Slow Grooves 2008-08-19
## 4 Mi Tierra 1993-06-03
## 5 Rolex 2017-03-15
## playlist_name playlist_genre
## 1 1987-1997 OLD SKOOL JAMZ r&b
## 2 Soul Coffee (The Best Neo-Soul Mixtape ever) r&b
## 3 90s R&B - The BET Planet Groove/Midnight Love Mix r&b
## 4 Cuban vibes only r&b
## 5 Hip pop r&b
## playlist_subgenre danceability energy key loudness mode speechiness
## 1 new jack swing 0.852 0.576 10 -8.668 0 0.0377
## 2 neo soul 0.782 0.544 4 -5.448 1 0.0306
## 3 new jack swing 0.795 0.553 1 -8.752 0 0.0508
## 4 urban contemporary 0.571 0.698 0 -6.786 1 0.1090
## 5 hip pop 0.804 0.886 1 -2.512 1 0.0400
## acousticness instrumentalness liveness valence tempo duration_ms
## 1 0.0108 1.21e-05 0.0848 0.902 92.123 267267
## 2 0.2120 1.06e-01 0.1790 0.753 90.562 150933
## 3 0.2470 5.39e-06 0.1800 0.586 84.621 228160
## 4 0.0839 3.19e-03 0.1270 0.909 132.215 320027
## 5 0.0837 0.00e+00 0.2660 0.789 144.946 238587
## duration_min popularity_group acousticness.scale cut_energy
## 1 4.454450 3 -0.7467743 (0.5,0.6]
## 2 2.515550 3 0.1562195 (0.5,0.6]
## 3 3.802667 4 0.3133010 (0.5,0.6]
## 4 5.333783 2 -0.4186985 (0.6,0.7]
## 5 3.976450 4 -0.4195961 (0.8,0.9]
## cut_spe liveness.scale tempo.scale cluster
## 1 (-0.000918,0.0918] -0.68096902 -1.0697738 3
## 2 (-0.000918,0.0918] -0.07668813 -1.1276862 3
## 3 (-0.000918,0.0918] -0.07027326 -1.3480946 3
## 4 (0.0918,0.184] -0.41026144 0.4176216 3
## 5 (-0.000918,0.0918] 0.48140569 0.8899360 3
We have also performed association mining on this dataset. We got stuck here due to the large number of unique tracks in dataset and the limited system capacity we processing capacity. From this we learnt that maybe association mining is not fit for a dataset of with large number of unique values. We tried reducing the dataset by deleting random rows but even with 600 rows, we got 520K rules and top 10 rules with 100% confidence level, which is not feasible for analysis.
From our correlation plot, we observed that variables have strong correlation with each other, indicating that this dataset has multicollinearity. With further deep diving into this matter, we learnt that such types of dataset are not suitable for various classification algorithms. So we dropped our plan for CART (Classification Tree) and Random Forest.
Learning such ML nuances were a major takeaway for us from this project.