1.INTRODUCTION

1.1 Background

Spotify is one of the biggest digital music, podcast, and video streaming service in the world that gives access to millions of songs and other content from artists all over the world. Not only does Spotify gives us access to good songs everywhere (work, home, in the car), it has also introduced us to artists that we would never have listened to before and in genres that we had never experienced. Spotify uses very advanced technology to track and identify each song uploaded to its platform.

picture

The Spotify database provides an interesting look into their listening data. Not just the popularity of tracks, but also features of the tracks they have in their library is recorded in their database. In this project, we have analyzed a track’s popularity based on several audio features provided in the dataset and found answer to ‘Can we predict a track’s popularity from key features about the song?’ We have also done a custom analysis based on user’s listening profile which shall enable Spotify to stock up similar hit tracks more on their platform and let go off songs that are not much popular among the listeners.

We are considering you as a Spotify user. So as a Spotify user, won’t you be impressed if you can get a list of the most popular songs tailored to your taste without having to manually search for them extensively? Also, won’t you be a happy and recurring customer if you keep on getting a list of latest top music, divided by genre and have easy access to recently released music? We will be solving for providing easy access to popular songs and that is the reason, you, as a Spotify user should be interested.

1.2 Methodology Used

We have analyzed relationship between popularity and different audio features and genres through a detaoled EDA. Then we have performed clustering analysis using K-means method to provide song recommendation based on recent user listening on Spotify. Here we have extracted a pattern from the clusters and determined how does each cluster differ. This helped us predict the most and least popular clusters. Finally we have created a ‘Song Recommendation System’.

1.3 Proposed Analytical Approach

K-means clustering has provided information about customer listening behavior which shall help Spotify upsell, cross-sell or combine both to increase profit. Using this method, we have analyzed correlation between songs in a cluster and their popularity rate.

After we have clustered the songs, we have build a song recommendation system to enable listeners get effective suggestions regarding next best songs according to their taste.

1.4 Usefulness of Analysis

Consumer of our analysis would be Spotify’s programming team. Our analysis shall enable the team to upsell, cross-sell or combine both to increase profit. Having a better understanding of different clusters shall enable Spotify to make a better targeted content distribution, leading to reduced churn rate.

For example, if the team knows that 30% of customers who listens to track A also listens to track B, Spotify can market track B to customers shortly after they listen to track A to speed up that process and capture those who might not have otherwise considered listening to track B. Also, for those customers who do not know of track B, getting suggestions will make them happy and impressed. This is how our analysis would help Spotify in providing better services to their consumers and keep them ahead of the curve.

2.PACKAGES USED

#install.packages("tidyverse")
#install.packages("dplyr")
#install.packages("ggplot2")
#install.packages("plotly")
#install.packages("corrplot")
#install.packages("factoextra")
#install.packages("plyr")
#install.packages("RColorBrewer")
#install.packages("funModeling")
#install.packages("knitr")



library(tidyverse)
library(dplyr)
library(ggplot2)
library(plotly)
library(corrplot)
library(factoextra)
library(plyr)
library(RColorBrewer)
library(funModeling)
library(knitr)

tidyverse - for interacting with data through subsetting, transformation, visualization, etc.

dplyr - for data manipulation in R by combining, selecting, grouping, subsetting and transforming all or parts of dataset

ggplot2 - for declaratively creating graphics, based on The Grammar of Graphics

plotly - for creating interactive web-based graphs via the open source JavaScript graphing library plotly.js

corrplot - for visualizing correlation matrices and confidence intervals

factoextra - to extract and visualize the output of multivariate data analyses and simplifying some clustering

plyr - to split data apart, do stuff to it, and mash it back together for simplifying control to the input and output data format

RColorBrewer - to choose sensible colour schemes for figures in R

funModeling - for some cool ‘dataViz’

3.DATA PREPARATION

3.1 Data Source

The dataset used for this project is the Spotify song list prepared by Zaheen Hamidani which we got from kaggle. (https://www.kaggle.com/zaheenhamidani/ultimate-spotify-tracks-db)

3.2 Data Information

Originally, the dataset was created by Zaheen Hamidani and uploaded to Kaggle in July 2019. Alternately, it is also available in a R package version 2.1.1 Spotify R package. Charlie Thompson, Josia Parry, Donal Phipps, and Tom Wolff authored this package to make it easier to get data or general metadata around songs from Spotify’s API. It allows to enter an artist’s name and retrieve their entire audio history (collection of all songs) in seconds, along with Spotify’s audio features and track/album popularity metrics.

The primary purpose of the data was to analyze the behaviour between valence and all the measures that Spotify API gives for every track. Approximately 10,000 songs were selected per genre and there are 26 genres. But, the same data can also be used to analyze different statistics and obtain other useful information.

There is not much peculiarity in the data. It is moderately clean with only 15 missing values. Since every track made is unique is some sense, we have not done any missing value imputation and have just removed them.

3.3 Data Importing

First, we will load the Spotify songs dataset into R to kickstart with the analysis.The dataset has been imported using the read.csv function and saved as “spotify”.

set.seed(13232767)
spotify <- read.csv("spotify_songs.csv")
glimpse(spotify)
## Observations: 32,833
## Variables: 23
## $ track_id                 <fct> 6f807x0ima9a1j3VPbc7VN, 0r7CVbZTWZgbT...
## $ track_name               <fct> I Don't Care (with Justin Bieber) - L...
## $ track_artist             <fct> Ed Sheeran, Maroon 5, Zara Larsson, T...
## $ track_popularity         <int> 66, 67, 70, 60, 69, 67, 62, 69, 68, 6...
## $ track_album_id           <fct> 2oCs0DGTsRO98Gh5ZSl2Cx, 63rPSO264uRjW...
## $ track_album_name         <fct> I Don't Care (with Justin Bieber) [Lo...
## $ track_album_release_date <fct> 2019-06-14, 2019-12-13, 2019-07-05, 2...
## $ playlist_name            <fct> Pop Remix, Pop Remix, Pop Remix, Pop ...
## $ playlist_id              <fct> 37i9dQZF1DXcZDD7cfEKhW, 37i9dQZF1DXcZ...
## $ playlist_genre           <fct> pop, pop, pop, pop, pop, pop, pop, po...
## $ playlist_subgenre        <fct> dance pop, dance pop, dance pop, danc...
## $ danceability             <dbl> 0.748, 0.726, 0.675, 0.718, 0.650, 0....
## $ energy                   <dbl> 0.916, 0.815, 0.931, 0.930, 0.833, 0....
## $ key                      <int> 6, 11, 1, 7, 1, 8, 5, 4, 8, 2, 6, 8, ...
## $ loudness                 <dbl> -2.634, -4.969, -3.432, -3.778, -4.67...
## $ mode                     <int> 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1...
## $ speechiness              <dbl> 0.0583, 0.0373, 0.0742, 0.1020, 0.035...
## $ acousticness             <dbl> 0.10200, 0.07240, 0.07940, 0.02870, 0...
## $ instrumentalness         <dbl> 0.00e+00, 4.21e-03, 2.33e-05, 9.43e-0...
## $ liveness                 <dbl> 0.0653, 0.3570, 0.1100, 0.2040, 0.083...
## $ valence                  <dbl> 0.518, 0.693, 0.613, 0.277, 0.725, 0....
## $ tempo                    <dbl> 122.036, 99.972, 124.008, 121.956, 12...
## $ duration_ms              <int> 194754, 162600, 176616, 169093, 18905...

Our dataset has 32,833 observations and 23 variables.

3.4 Data Cleaning

3.4.1 Removing NA’s

Using the colSums function, we observe that there are 5 NA’s in track_name, track_artist and track_album_name. We have removed the respective observations using the na.omit function.

colSums(is.na(spotify))
##                 track_id               track_name             track_artist 
##                        0                        5                        5 
##         track_popularity           track_album_id         track_album_name 
##                        0                        0                        5 
## track_album_release_date            playlist_name              playlist_id 
##                        0                        0                        0 
##           playlist_genre        playlist_subgenre             danceability 
##                        0                        0                        0 
##                   energy                      key                 loudness 
##                        0                        0                        0 
##                     mode              speechiness             acousticness 
##                        0                        0                        0 
##         instrumentalness                 liveness                  valence 
##                        0                        0                        0 
##                    tempo              duration_ms 
##                        0                        0
spotify <- na.omit(spotify)

3.4.2 Filtering

Filtering for unique tracks and removing all the duplicated tracks using the duplicated function

spotify <- spotify[!duplicated(spotify$track_id),]

3.4.3 Transforming Variables

Converting key, mode, genre and sub genre to factors to facilitate our data analysis since that seemed logical observing the type of data these columns contain

spotify <- spotify %>%
  mutate(playlist_genre = as.factor(spotify$playlist_genre),
         playlist_subgenre = as.factor(spotify$playlist_subgenre),
         mode = as.factor(mode),
         key = as.factor(key))

Converting duration_ms to duration in mins (duration_min) since it is more sensible for the analysis

spotify <- spotify %>% mutate(duration_min = duration_ms/60000)

3.4.4 Creating New Variables

For exploring the distribution on popularity, we have made new variables that divide popularity into 4 groups for effective cluster analysis

spotify <- spotify %>% 
  mutate(popularity_group = as.numeric(case_when(
    ((track_popularity > 0) & (track_popularity < 20)) ~ "1",
    ((track_popularity >= 20) & (track_popularity < 40))~ "2",
    ((track_popularity >= 40) & (track_popularity < 60)) ~ "3",
    TRUE ~ "4"))
    )
table(spotify$popularity_group)
## 
##    1    2    3    4 
## 4182 6162 8975 9033

3.4.5 Removing Variables

We have removed track_id, track_album_id and playlist_id from the dataset since it is not useful for our analysis. These unique id’s have been maintained in the Spotify dataset only to uniquely identify a track in the database.

spotify <- spotify %>% select(-c(track_id, track_album_id, playlist_id))
summary(spotify)
##     track_name                       track_artist   track_popularity
##  Breathe :   18   Queen                    :  130   Min.   :  0.00  
##  Paradise:   17   Martin Garrix            :   87   1st Qu.: 21.00  
##  Poison  :   16   Don Omar                 :   84   Median : 42.00  
##  Alive   :   15   David Guetta             :   81   Mean   : 39.34  
##  Forever :   14   Dimitri Vegas & Like Mike:   68   3rd Qu.: 58.00  
##  Stay    :   14   Drake                    :   68   Max.   :100.00  
##  (Other) :28258   (Other)                  :27834                   
##                     track_album_name track_album_release_date
##  Greatest Hits              :  135   2020-01-10:  201        
##  Ultimate Freestyle Mega Mix:   42   2013-01-01:  189        
##  Gold                       :   34   2019-11-22:  185        
##  Rock & Rios (Remastered)   :   29   2019-12-06:  184        
##  Asian Dreamer              :   20   2019-11-15:  183        
##  Trip Stories               :   20   2008-01-01:  176        
##  (Other)                    :28072   (Other)   :27234        
##             playlist_name   playlist_genre
##  Indie Poptimism   :  294   edm  :4877    
##  Permanent Wave    :  223   latin:4136    
##  Hard Rock Workout :  211   pop  :5132    
##  Southern Hip Hop  :  174   r&b  :4504    
##  post teen pop     :  159   rap  :5398    
##  Urban Contemporary:  157   rock :4305    
##  (Other)           :27134                 
##                  playlist_subgenre  danceability        energy        
##  southern hip hop         : 1582   Min.   :0.0000   Min.   :0.000175  
##  indie poptimism          : 1547   1st Qu.:0.5610   1st Qu.:0.579000  
##  neo soul                 : 1478   Median :0.6700   Median :0.722000  
##  progressive electro house: 1460   Mean   :0.6534   Mean   :0.698372  
##  electro house            : 1416   3rd Qu.:0.7600   3rd Qu.:0.843000  
##  gangster rap             : 1314   Max.   :0.9830   Max.   :1.000000  
##  (Other)                  :19555                                      
##       key           loudness       mode       speechiness    
##  1      : 3436   Min.   :-46.448   0:12318   Min.   :0.0000  
##  0      : 3001   1st Qu.: -8.310   1:16034   1st Qu.:0.0410  
##  7      : 2907   Median : -6.261             Median :0.0626  
##  9      : 2631   Mean   : -6.818             Mean   :0.1079  
##  11     : 2577   3rd Qu.: -4.709             3rd Qu.:0.1330  
##  2      : 2478   Max.   :  1.275             Max.   :0.9180  
##  (Other):11322                                               
##   acousticness    instrumentalness       liveness         valence      
##  Min.   :0.0000   Min.   :0.0000000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0143   1st Qu.:0.0000000   1st Qu.:0.0926   1st Qu.:0.3290  
##  Median :0.0797   Median :0.0000206   Median :0.1270   Median :0.5120  
##  Mean   :0.1772   Mean   :0.0911294   Mean   :0.1910   Mean   :0.5104  
##  3rd Qu.:0.2600   3rd Qu.:0.0065725   3rd Qu.:0.2490   3rd Qu.:0.6950  
##  Max.   :0.9940   Max.   :0.9940000   Max.   :0.9960   Max.   :0.9910  
##                                                                        
##      tempo         duration_ms      duration_min     popularity_group
##  Min.   :  0.00   Min.   :  4000   Min.   :0.06667   Min.   :1.000   
##  1st Qu.: 99.97   1st Qu.:187741   1st Qu.:3.12902   1st Qu.:2.000   
##  Median :121.99   Median :216933   Median :3.61555   Median :3.000   
##  Mean   :120.96   Mean   :226575   Mean   :3.77624   Mean   :2.806   
##  3rd Qu.:134.00   3rd Qu.:254975   3rd Qu.:4.24959   3rd Qu.:4.000   
##  Max.   :239.44   Max.   :517810   Max.   :8.63017   Max.   :4.000   
## 

3.5 Description of Attributes

Each row indicates 1 song and column contain attributes for each song.The attributes are as follows:

track_id - Track id on spotify

track_name - Title of the song

track_artist - Name of the artist

track_popularity - Measure the popularity from 0 to 100 based on play number of the track

acousticness - Measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.

danceability - Describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.

duration_ms - The duration of the track in milliseconds(ms).

energy - Measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.

instrumentalness - Measure whether a track contains no vocals. ‘Ooh’ and ‘aah’ sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly ‘vocal’. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.

key - Estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1.

liveness - Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.

loudness - overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 dB.

mode - Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.

speechiness - Detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.

tempo Overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.

valence - Measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).

4.EXPLORATORY DATA ANALYSIS

The best way to uncover useful information from data that is not self-evident is by performing EDA efficiently. EDA helps us to make sense of our data. Before performing a formal analysis, it is essential to explore a data set. No models can be done without a proper EDA. This will help us to better understand the patterns within the data, detect outliers or anomalous events and find interesting relations among the variables. We have used histograms, boxplots and correlation plot to find such answers.

4.1 Correlation Plot

df1 <- select(spotify, track_popularity, danceability, energy, loudness, speechiness, acousticness, instrumentalness, liveness, valence, tempo)
corrplot(cor(df1))

The plot shows popularity does not have strong correlation with track features. But we found some variables have strong correlation with each other, indicating that this dataset has multicollinearity and might not be suitable for various classification algorithms.

4.2 Histogram

Analyzing data distribution of the audio features

spotify_hist <- spotify[,-c(1,2,3,4,5,6,7,8,11,13,20,22)]
plot_num(spotify_hist)

From the histograms, we can observe that:

  • Majority (85.4323394%) observations have a value no larger than 0.1 in instrumentalness, and this is the reason why the difference between mean and median of instrumentalness is quite large

  • Majority of songs listened to have a duration of about 3-4 mins with songs longer than that duration having lower frequency of listeners

  • Valence is normally distributed

  • Danceability and energy are almost normally distributed

  • Majority of the tracks have a loudness level of -5dB

  • Majority tracks have speechiness index less than 0.2 indicating that less speechy songs are more favoured by listeners

4.3 Boxplot

4.3.1 Genre by Energy

boxplot(energy~playlist_genre, data=spotify,
        main = "Variation of energy between genres",
        xlab = "Energy",
        ylab = "Genre",
        col = "orange",
        border = "brown",
        horizontal = TRUE,
        notch = TRUE
)

EDM songs are highest in energy, as expected!

4.3.2 Genre by Danceability

boxplot(danceability~playlist_genre, data=spotify,
        main = "Variation of danceability between genres",
        xlab = "Danceability",
        ylab = "Genre",
        col = "orange",
        border = "brown",
        horizontal = TRUE,
        notch = TRUE
)

Rap songs have highest danceability, I knew!

4.3.3 Genre by Liveliness

boxplot(danceability~playlist_genre, data=spotify,
        main = "Variation of liveness between genres",
        xlab = "Liveness",
        ylab = "Genre",
        col = "orange",
        border = "brown",
        horizontal = TRUE,
        notch = TRUE
)

Rap songs are most lively, obviously!

4.3.4 Genre by Valence

boxplot(valence~playlist_genre, data=spotify,
        main = "Variation of valence between genres",
        xlab = "Valence",
        ylab = "Genre",
        col = "orange",
        border = "brown",
        horizontal = TRUE,
        notch = TRUE
)

Songs in Latin genre have the highest valence, OK!

4.3.5 Genre by Loudness

boxplot(loudness~playlist_genre, data=spotify,
        main = "Variation of loudness between genres",
        xlab = "Loudness",
        ylab = "Genre",
        col = "orange",
        border = "brown",
        horizontal = TRUE,
        notch = TRUE
)

Songs in EDM genre are louder in nature, cool!

4.4 Popularity by Accousticness

spotify$acousticness.scale <- scale(spotify$acousticness)
spotify %>%
  select(popularity_group, acousticness.scale, playlist_genre) %>%
  group_by(popularity_group)%>%
  filter(!is.na(popularity_group)) %>%
  filter(!is.na(acousticness.scale))%>%
  ggplot(mapping = aes(x = acousticness.scale, y = popularity_group, color = playlist_genre))+
  facet_wrap(~playlist_genre)+
  geom_point()+
  theme_minimal()

Accoustiness does not effect track popularity as the level of accousticness has been uniform across all popularity levels.

4.5 Popularity by Valence

spotify%>%
  select(popularity_group, valence, playlist_genre) %>%
  group_by(popularity_group)%>%
  filter(!is.na(popularity_group)) %>%
  filter(!is.na(valence))%>%
  ggplot(mapping = aes(x = popularity_group, y = valence, color = playlist_genre, fill = playlist_genre))+
  geom_bar(stat = 'identity')+
  coord_polar()+
  facet_wrap(~playlist_genre)+
  theme_minimal()

Songs with higher valence index are more popular in pop and rap genre and less in edm genre.

4.6 Energy Distribution

spotify$cut_energy <- cut(spotify$energy, breaks = 10)
spotify %>%
  ggplot( aes(x=cut_energy ))+
  geom_bar(width=0.2) +
  coord_flip() +
  scale_x_discrete(name="Energy")  

Suppports the findings from the energy histogram. Hence proved that higher energy songs are favoured more by Spotify listeners.

4.7 Speechiness Distribution

spotify$cut_spe <- cut(spotify$speechiness, breaks = 10)
spotify %>%
  ggplot( aes(x=cut_spe ))+
  geom_bar(width=0.2) +
  coord_flip() +
  scale_x_discrete(name="Spechiness")  

This graph also supports the findings of our histogram for speechiness. As we all know how we do not like speechier tracks, this affirms our belief that less speechy songs are more favoured by maximum Spotify listeners. So, Spotify does not keep speechier songs in their database.

4.8 Tempo and Liveness Distribution across Genre

spotify$liveness.scale <- scale(spotify$liveness)
spotify$tempo.scale <- scale(spotify$tempo)
spotify %>%
  select(tempo.scale, liveness.scale, playlist_genre) %>%
  group_by(playlist_genre)%>%
  filter(!is.na(tempo.scale)) %>%
  filter(!is.na(liveness.scale))%>%
  ggplot(mapping = aes(x = tempo.scale, y = liveness.scale, color = playlist_genre, fill = playlist_genre))+
  geom_bar(stat = 'identity')+
  coord_polar()+
  theme_minimal()

Tempo is way higher for EDM genre compared to the others while Liveness is almost uniformly distributed across all genres.

4.9 Top 10 Artists and their Most Famous Tracks

spotify %>%
  select(track_name, track_artist, track_album_name, playlist_genre, track_popularity)%>%
  group_by(track_artist)%>%
  filter(!is.na(track_name))%>%
  filter(!is.na(track_artist))%>%
  filter(!is.na(track_album_name))%>%
  arrange(desc(track_popularity))%>%
  head(n = 10)%>%
  ggplot(mapping = aes(x = track_name, y =  track_artist, color = track_artist, fill = track_artist, size = track_popularity ))+
  geom_point()+
  coord_polar()+
  facet_wrap(~playlist_genre)+
  theme_minimal()+
  labs(x = 'track_name', y = 'track_artist', title = 'Top ten artists of spotify')+
  theme(plot.title = element_text(hjust=0.5),legend.position ='bottom')

The top songs are primarily from Pop genre, the most popular song being by the artist ‘Tones & I’. Latin and Rap also have one song each featured in top 10.

5.MODEL BUILDING

5.1 Data Clustering

Before performing clustering, we need to scale the numeric variables so as to negate the influence of variables measured on higher scales. The variables that I choose in this analysis for Cluster are as follows: - Danceability - Energy - Loudness - Speechiness - Acousticness - Instrumentalness - Liveness - Valence - Tempo - Duration_min

Checking the updated dataset since we have certain variables changes for EDA

str(spotify)
## 'data.frame':    28352 obs. of  27 variables:
##  $ track_name              : Factor w/ 23449 levels "'39 - 2011 Mix",..: 9368 12887 944 3111 18360 1968 13859 15785 20934 9823 ...
##  $ track_artist            : Factor w/ 10692 levels "'Til Tuesday",..: 2848 6185 10633 9373 5530 2848 5000 8320 761 8562 ...
##  $ track_popularity        : int  66 67 70 60 69 67 62 69 68 67 ...
##  $ track_album_name        : Factor w/ 19743 levels "'74 - '75 (feat. Susan Tyler)",..: 7928 10684 981 2869 15185 1882 11515 13093 17788 8155 ...
##  $ track_album_release_date: Factor w/ 4530 levels "1957-01-01","1957-03",..: 4316 4493 4336 4349 4221 4341 4356 4389 4316 4321 ...
##  $ playlist_name           : Factor w/ 449 levels "\"Permanent Wave\"",..: 309 309 309 309 309 309 309 309 309 309 ...
##  $ playlist_genre          : Factor w/ 6 levels "edm","latin",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ playlist_subgenre       : Factor w/ 24 levels "album rock","big room",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ danceability            : num  0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
##  $ energy                  : num  0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
##  $ key                     : Factor w/ 12 levels "0","1","2","3",..: 7 12 2 8 2 9 6 5 9 3 ...
##  $ loudness                : num  -2.63 -4.97 -3.43 -3.78 -4.67 ...
##  $ mode                    : Factor w/ 2 levels "0","1": 2 2 1 2 2 2 1 1 2 2 ...
##  $ speechiness             : num  0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
##  $ acousticness            : num  0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
##  $ instrumentalness        : num  0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
##  $ liveness                : num  0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
##  $ valence                 : num  0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
##  $ tempo                   : num  122 100 124 122 124 ...
##  $ duration_ms             : int  194754 162600 176616 169093 189052 163049 187675 207619 193187 253040 ...
##  $ duration_min            : num  3.25 2.71 2.94 2.82 3.15 ...
##  $ popularity_group        : num  4 4 4 4 4 4 4 4 4 4 ...
##  $ acousticness.scale      : num [1:28352, 1] -0.337 -0.47 -0.439 -0.666 -0.435 ...
##   ..- attr(*, "scaled:center")= num 0.177
##   ..- attr(*, "scaled:scale")= num 0.223
##  $ cut_energy              : Factor w/ 10 levels "(-0.000825,0.1]",..: 10 9 10 10 9 10 9 10 10 9 ...
##  $ cut_spe                 : Factor w/ 10 levels "(-0.000918,0.0918]",..: 1 1 1 2 1 2 1 1 1 1 ...
##  $ liveness.scale          : num [1:28352, 1] -0.8061 1.0652 -0.5193 0.0837 -0.6906 ...
##   ..- attr(*, "scaled:center")= num 0.191
##   ..- attr(*, "scaled:scale")= num 0.156
##  $ tempo.scale             : num [1:28352, 1] 0.04 -0.779 0.113 0.037 0.112 ...
##   ..- attr(*, "scaled:center")= num 121
##   ..- attr(*, "scaled:scale")= num 27
##  - attr(*, "na.action")= 'omit' Named int  8152 9283 9284 19569 19812
##   ..- attr(*, "names")= chr  "8152" "9283" "9284" "19569" ...

Scaling the numeric variables required for cluster analysis:

spotify_scaled <- scale(spotify[,-c(1,2,3,4,5,6,7,8,11,13,20,22,23,24,25,26,27)])
summary(spotify_scaled)
##   danceability         energy           loudness         speechiness     
##  Min.   :-4.4816   Min.   :-3.8047   Min.   :-13.0516   Min.   :-1.0526  
##  1st Qu.:-0.6336   1st Qu.:-0.6505   1st Qu.: -0.4915   1st Qu.:-0.6528  
##  Median : 0.1140   Median : 0.1288   Median :  0.1834   Median :-0.4421  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   :  0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.7314   3rd Qu.: 0.7881   3rd Qu.:  0.6946   3rd Qu.: 0.2444  
##  Max.   : 2.2609   Max.   : 1.6437   Max.   :  2.6652   Max.   : 7.8994  
##   acousticness     instrumentalness     liveness          valence         
##  Min.   :-0.7952   Min.   :-0.3918   Min.   :-1.2249   Min.   :-2.177934  
##  1st Qu.:-0.7311   1st Qu.:-0.3918   1st Qu.:-0.6309   1st Qu.:-0.774014  
##  Median :-0.4375   Median :-0.3918   Median :-0.4103   Median : 0.006889  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.000000  
##  3rd Qu.: 0.3716   3rd Qu.:-0.3636   3rd Qu.: 0.3724   3rd Qu.: 0.787793  
##  Max.   : 3.6659   Max.   : 3.8823   Max.   : 5.1643   Max.   : 2.050894  
##      tempo           duration_min    
##  Min.   :-4.48750   Min.   :-3.6439  
##  1st Qu.:-0.77858   1st Qu.:-0.6358  
##  Median : 0.03841   Median :-0.1578  
##  Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.48381   3rd Qu.: 0.4650  
##  Max.   : 4.39562   Max.   : 4.7680

5.1.1 Elbow Method

Determining the optimal number of clusters:

wss <- function(data, maxCluster = 9) {
  SSw <- (nrow(data) - 1) * sum(apply(data, 2, var))
  SSw <- vector()
  for (i in 2:maxCluster) {
    SSw[i] <- sum(kmeans(data, centers = i)$withinss)
  }
  plot(1:maxCluster, SSw, type = "o", xlab = "Number of Clusters", ylab = "Within groups sum of squares", pch=19)
}

wss(spotify_scaled)
## Warning: did not converge in 10 iterations

By looking at the plot, we can’t see the “Elbow” clearly. But we think it happens at k = 7 since within group sum of squares is not changing significantly after k = 7. So we choose 7 to be number of clusters for this analysis.

5.1.2 K-means Clustering

spotify_kmeans <- kmeans(spotify_scaled, centers = 7)
spotify_kmeans$size
## [1] 1747 3754 7665 3256 3420 6265 2245
spotify_kmeans$centers
##   danceability     energy    loudness speechiness acousticness
## 1  -0.32257364  0.4754506  0.26305322  0.03586906  -0.27112012
## 2  -0.35426056 -1.5639363 -1.30289624 -0.35081509   1.68479950
## 3   0.64505164  0.1531843  0.11746811 -0.32734861  -0.14326327
## 4  -0.82623557  0.3662025  0.30810693  0.04551065  -0.27190863
## 5   0.59334595 -0.2548104 -0.09046971  2.03734288   0.04973956
## 6  -0.41165009  0.4071884  0.48311917 -0.40457685  -0.47413275
## 7   0.08422147  0.4428987 -0.08436790 -0.36427511  -0.47541674
##   instrumentalness    liveness     valence       tempo duration_min
## 1      -0.11655566  2.82065999 -0.01703516  0.02854606  0.111877644
## 2       0.08086478 -0.27948077 -0.50374425 -0.31054449 -0.042750825
## 3      -0.31784071 -0.26766758  0.86230076 -0.35362881 -0.007234689
## 4      -0.32264063 -0.11869411  0.18920129  1.64069058  0.037763749
## 5      -0.34242492 -0.08164052  0.12854595 -0.24406061 -0.198196919
## 6      -0.28900243 -0.13567781 -0.73428622 -0.16176324 -0.067695698
## 7       2.83675577 -0.13859692 -0.50961352  0.14811775  0.445202000
spotify$cluster <- spotify_kmeans$cluster
tail(spotify)
##                                 track_name
## 32828               Many Ways - Radio Edit
## 32829 City Of Lights - Official Radio Edit
## 32830  Closer - Sultan & Ned Shepard Remix
## 32831         Sweet Surrender - Radio Edit
## 32832       Only For You - Maor Levi Remix
## 32833               Typhoon - Original Mix
##                              track_artist track_popularity
## 32828 Ferry Corsten feat. Jenny Wahlstrom               27
## 32829                        Lush & Simon               42
## 32830                      Tegan and Sara               20
## 32831                         Starkillers               14
## 32832                              Mat Zo               15
## 32833                        Julian Calor               27
##                   track_album_name track_album_release_date
## 32828                    Many Ways                     2013
## 32829   City Of Lights (Vocal Mix)               2014-04-28
## 32830               Closer Remixed               2013-03-08
## 32831 Sweet Surrender (Radio Edit)               2014-04-21
## 32832       Only For You (Remixes)               2014-01-01
## 32833                Typhoon/Storm               2014-03-03
##           playlist_name playlist_genre         playlist_subgenre
## 32828 â\231¥ EDM LOVE 2020            edm progressive electro house
## 32829 â\231¥ EDM LOVE 2020            edm progressive electro house
## 32830 â\231¥ EDM LOVE 2020            edm progressive electro house
## 32831 â\231¥ EDM LOVE 2020            edm progressive electro house
## 32832 â\231¥ EDM LOVE 2020            edm progressive electro house
## 32833 â\231¥ EDM LOVE 2020            edm progressive electro house
##       danceability energy key loudness mode speechiness acousticness
## 32828        0.581  0.640   5   -8.367    1      0.0365     0.026600
## 32829        0.428  0.922   2   -1.814    1      0.0936     0.076600
## 32830        0.522  0.786   0   -4.462    1      0.0420     0.001710
## 32831        0.529  0.821   6   -4.899    0      0.0481     0.108000
## 32832        0.626  0.888   2   -3.361    1      0.1090     0.007920
## 32833        0.603  0.884   5   -4.571    0      0.0385     0.000133
##       instrumentalness liveness valence   tempo duration_ms duration_min
## 32828         0.00e+00   0.5720  0.2880 128.001      196993     3.283217
## 32829         0.00e+00   0.0668  0.2100 128.170      204375     3.406250
## 32830         4.27e-03   0.3750  0.4000 128.041      353120     5.885333
## 32831         1.11e-06   0.1500  0.4360 127.989      210112     3.501867
## 32832         1.27e-01   0.3430  0.3080 128.008      367432     6.123867
## 32833         3.41e-01   0.7420  0.0894 127.984      337500     5.625000
##       popularity_group acousticness.scale cut_energy            cut_spe
## 32828                2         -0.6758632  (0.6,0.7] (-0.000918,0.0918]
## 32829                3         -0.4514612    (0.9,1]     (0.0918,0.184]
## 32830                2         -0.7875706  (0.7,0.8] (-0.000918,0.0918]
## 32831                1         -0.3105367  (0.8,0.9] (-0.000918,0.0918]
## 32832                1         -0.7596998  (0.8,0.9]     (0.0918,0.184]
## 32833                2         -0.7946482  (0.8,0.9] (-0.000918,0.0918]
##       liveness.scale tempo.scale cluster
## 32828      2.4443564   0.2612840       1
## 32829     -0.7964367   0.2675539       6
## 32830      1.1806267   0.2627680       6
## 32831     -0.2627194   0.2608388       6
## 32832      0.9753508   0.2615437       6
## 32833      3.5348845   0.2606533       1

5.1.3 Insights

  • The size of each cluster (count of each cluster) returns: cluster1 2289, cluster2 3690, cluster3 5635, cluster4 2857, cluster5 4788, cluster6 1776 and cluster 7 7317

  • The centers shows the mean value on each of the variable

Plotting using ‘factoextra’:

fviz_cluster(spotify_kmeans, data=spotify_scaled)

5.1.4 Goodness of Fit

We can check it from the following 3 values:

1 - Within Sum of Squares tot.withinss : signifies the ‘length’ from each observation to its centroid in each cluster

spotify_kmeans$tot.withinss
## [1] 167883.7

2- Total Sum of Squares totss : signifies the ‘length’ from each observation to global sample mean

spotify_kmeans$totss
## [1] 283510

3 - Between Sum of Squares betweenss : signifies the ‘length’ from each centroid from each cluster to the global sample mean

spotify_kmeans$betweenss
## [1] 115626.3

Another ‘goodness’ measure can be signifies with a value of betweenss/totss closer the value to 1 or 100%, the better): betweenss/tot.withinss

((spotify_kmeans$betweenss)/(spotify_kmeans$totss))*100
## [1] 40.78384

Good cluster has high similarity characteristics in 1 cluster (low WSS) and maximum difference in characteristics between clusters (high BSS). In addition, it can be marked with a BSS / totss ratio that is close to 1 (100%).

From the unsupervised learning analysis above, we can summarize that K-means clustering can be done using this dataset since we have got a reasonable high value for BSS / totss ratio, 40.78%.

Repeating the same exercise multiple times by adjusting with multiple combinations of variables, we are getting the best fit and optimized model by excluding loudness, tempo and duration_min.

Finding what kind of song characterises each clusters in the optimized model:

spotify %>% 
  group_by(cluster) %>% 
  summarise_all(mean) %>% 
  select(cluster, acousticness, danceability, energy, instrumentalness, speechiness, valence, liveness)
## # A tibble: 7 x 8
##   cluster acousticness danceability energy instrumentalness speechiness
##     <int>        <dbl>        <dbl>  <dbl>            <dbl>       <dbl>
## 1       1       0.117         0.606  0.786           0.0640      0.112 
## 2       2       0.553         0.602  0.411           0.110       0.0720
## 3       3       0.145         0.747  0.726           0.0172      0.0744
## 4       4       0.117         0.533  0.766           0.0161      0.113 
## 5       5       0.188         0.740  0.652           0.0115      0.317 
## 6       6       0.0715        0.593  0.773           0.0239      0.0665
## 7       7       0.0713        0.666  0.780           0.751       0.0706
## # ... with 2 more variables: valence <dbl>, liveness <dbl>

5.1.5 Characteristics of Clusters

  • Cluster 1: Highest danceability, Highest valence
  • Cluster 2: Highest energy, Lowest acousticness
  • Cluster 3: Lowest instrumentalness, Highest speechiness
  • Cluster 4: Lowest speechiness, Lowest liveness
  • Cluster 5: Lowest acousticness, Highest liveness
  • Cluster 6: Highest instrumentalness
  • Cluster 7: Highest acousticness, Lowest energy, Lowest valence

5.2 Song Recommendation System

Now, let’s check the which cluster is my favorite song. My favourite track from the list is Memories by Maroon 5.

spotify %>% 
  filter(track_name == "Memories - Dillon Francis Remix", track_artist == "Maroon 5")
##                        track_name track_artist track_popularity
## 1 Memories - Dillon Francis Remix     Maroon 5               67
##                  track_album_name track_album_release_date playlist_name
## 1 Memories (Dillon Francis Remix)               2019-12-13     Pop Remix
##   playlist_genre playlist_subgenre danceability energy key loudness mode
## 1            pop         dance pop        0.726  0.815  11   -4.969    1
##   speechiness acousticness instrumentalness liveness valence  tempo
## 1      0.0373       0.0724          0.00421    0.357   0.693 99.972
##   duration_ms duration_min popularity_group acousticness.scale cut_energy
## 1      162600         2.71                4         -0.4703109  (0.8,0.9]
##              cut_spe liveness.scale tempo.scale cluster
## 1 (-0.000918,0.0918]       1.065159  -0.7785794       3

So my favourite song is in cluster 3.

Now I want to try a new genre, r&b. So let’s check the best songs which I should try according to my taste in the r&b genre given that my favourite song is Memories by Maroon 5.

spotify %>% 
  filter(cluster == 3, playlist_genre == "r&b") %>% 
  sample_n(5)
##                  track_name    track_artist track_popularity
## 1     You're Makin' Me High    Toni Braxton               43
## 2 Just Ain't Gonna Work Out Mayer Hawthorne               51
## 3        When Can I See You        Babyface                0
## 4                Tradición  Gloria Estefan               32
## 5                     Rolex       Ayo & Teo               73
##          track_album_name track_album_release_date
## 1 Secrets (Remix Package)                     1996
## 2   A Strange Arrangement                     2009
## 3        R&B Slow Grooves               2008-08-19
## 4               Mi Tierra               1993-06-03
## 5                   Rolex               2017-03-15
##                                       playlist_name playlist_genre
## 1                          1987-1997 OLD SKOOL JAMZ            r&b
## 2      Soul Coffee (The Best Neo-Soul Mixtape ever)            r&b
## 3 90s R&B - The BET Planet Groove/Midnight Love Mix            r&b
## 4                                  Cuban vibes only            r&b
## 5                                           Hip pop            r&b
##    playlist_subgenre danceability energy key loudness mode speechiness
## 1     new jack swing        0.852  0.576  10   -8.668    0      0.0377
## 2           neo soul        0.782  0.544   4   -5.448    1      0.0306
## 3     new jack swing        0.795  0.553   1   -8.752    0      0.0508
## 4 urban contemporary        0.571  0.698   0   -6.786    1      0.1090
## 5            hip pop        0.804  0.886   1   -2.512    1      0.0400
##   acousticness instrumentalness liveness valence   tempo duration_ms
## 1       0.0108         1.21e-05   0.0848   0.902  92.123      267267
## 2       0.2120         1.06e-01   0.1790   0.753  90.562      150933
## 3       0.2470         5.39e-06   0.1800   0.586  84.621      228160
## 4       0.0839         3.19e-03   0.1270   0.909 132.215      320027
## 5       0.0837         0.00e+00   0.2660   0.789 144.946      238587
##   duration_min popularity_group acousticness.scale cut_energy
## 1     4.454450                3         -0.7467743  (0.5,0.6]
## 2     2.515550                3          0.1562195  (0.5,0.6]
## 3     3.802667                4          0.3133010  (0.5,0.6]
## 4     5.333783                2         -0.4186985  (0.6,0.7]
## 5     3.976450                4         -0.4195961  (0.8,0.9]
##              cut_spe liveness.scale tempo.scale cluster
## 1 (-0.000918,0.0918]    -0.68096902  -1.0697738       3
## 2 (-0.000918,0.0918]    -0.07668813  -1.1276862       3
## 3 (-0.000918,0.0918]    -0.07027326  -1.3480946       3
## 4     (0.0918,0.184]    -0.41026144   0.4176216       3
## 5 (-0.000918,0.0918]     0.48140569   0.8899360       3

6.LEARNINGS & KEY TAKEAWAYS

We have also performed association mining on this dataset. We got stuck here due to the large number of unique tracks in dataset and the limited system capacity we processing capacity. From this we learnt that maybe association mining is not fit for a dataset of with large number of unique values. We tried reducing the dataset by deleting random rows but even with 600 rows, we got 520K rules and top 10 rules with 100% confidence level, which is not feasible for analysis.

From our correlation plot, we observed that variables have strong correlation with each other, indicating that this dataset has multicollinearity. With further deep diving into this matter, we learnt that such types of dataset are not suitable for various classification algorithms. So we dropped our plan for CART (Classification Tree) and Random Forest.

Learning such ML nuances were a major takeaway for us from this project.