Analysing evolution of music using Spotify data

Introduction

Spotify is one of the largest music streaming services all over the world. With 271 million monthly active users, including 124 million paying subscribers, it is the ideal platform for artists to reach their audience. At the heart of Spotify lives a massive and growing data-set. What if we could analyze the music we listen to using Data Science?

In this analysis, we would mine the nuggets of insight hidden beneath mountains of Spotify data. In doing so, gain a greater understanding of the type of genres, tracks and artists the consumers have been listening to on Spotify.

Broadly, we will be performing the following steps to accomplish the project objectives:

Perform Exploratory Data Analysis:

Visulization of audio feaftures of different genres
Correlation between features
Popular artists within each genre

Generate Insights:

Trends of popular features in the past decade
Variability of characteristics across genres

Implement Clustering:

Apply k-means clustering algorithm
Use elbow curve which helps to find the optimum number of clusters

Our analysis can help understand consumer behavior and suggest what music are they looking for and hence provide direction to artists and music producers.

Let’s explore the Spotify dataset to discover the patterns and insights.

Packages Required

The following packages have been used for the analysis:

dplyr : For data manipulation
ggplot2 : For customizable graphical representation
plotly : For interactive plots
tidyverse : For data wrangling
kableExtra : For building tables
DT : For previewing the data sets
corrplot : For visualizing the correlation plots
gridExtra : For using addition functionalities in grid graphics
treemap : For visualizing the treemap plots
viridisLite : For generating the color vectors
fmsb : For visualizing the radar plots
cowplot : For providing addition functionalities to ggplot
factoextra : For determine the optimal number clusters
formattable : For more readable and impactful tabular formats

library(dplyr) 
library(ggplot2) 
library(plotly)
library(tidyverse) 
library(kableExtra) 
library(DT) 
library(corrplot) 
library(gridExtra) 
library(treemap)
library(viridisLite) 
library(fmsb) 
library(cowplot) 
library(factoextra)  
library(formattable)

Data Prepration

Data Source

We would be using subset of Spotify tracks’ metadata. This dataset was created using the spotifyr package and can be downloaded from this link

Explanation of Source Data

The dataset consists of 32833 observations corresponding to each track and their 23 attributes.

Below is the detailed data dictionary to understand all the variables present in the dataset.

DataDictionary <- read.csv("DataDictionary.csv")
songs <- read.csv("spotify_songs.csv")
DataDictionary%>% kable() %>% kable_styling(bootstrap_options = c("striped", "condensed", "responsive"), full_width = F)

variable	class	description
track_id	character	Song unique ID
track_name	character	Song Name
track_artist	character	Song Artist
track_popularity	double	Song Popularity (0-100) where higher is better
track_album_id	character	Album unique ID
track_album_name	character	Song album name
track_album_release_date	character	Date when album released
playlist_name	character	Name of playlist
playlist_id	character	Playlist ID
playlist_genre	character	Playlist genre
playlist_subgenre	character	Playlist subgenre
danceability	double	Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
energy	double	Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
key	double	The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C_/D_, 2 = D, and so on. If no key was detected, the value is -1.
loudness	double	The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.
mode	double	Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.
speechiness	double	Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
acousticness	double	A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
instrumentalness	double	Predicts whether a track contains no vocals. ‘Ooh’ and ‘aah’ sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly ‘vocal’. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
liveness	double	Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
valence	double	A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
tempo	double	The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
duration_ms	double	Duration of song in milliseconds

There are various features used to describe songs in the dataset which have been scaled between 0 and 1 for ease of comparision and interpretability.

Data Cleaning

We now take a look at the structure and summary statistics of the dataset. The summaries would help us spot any anomalies like negative values. It would also indicate the fields with missing values and their counts.

str(songs)

## 'data.frame':    32833 obs. of  23 variables:
##  $ track_id                : Factor w/ 28356 levels "0017A6SJgTbfQVU2EtsPNo",..: 22912 2531 7160 25706 4705 26672 9521 22445 26146 5283 ...
##  $ track_name              : Factor w/ 23449 levels "_away","¡Corre!",..: 9042 12696 896 3057 18176 1921 13676 15603 20742 9637 ...
##  $ track_artist            : Factor w/ 10692 levels "_tag","-M-","!!!",..: 2818 6149 10610 9348 5497 2818 4971 8291 749 8534 ...
##  $ track_popularity        : int  66 67 70 60 69 67 62 69 68 67 ...
##  $ track_album_id          : Factor w/ 22545 levels "000f3dTtvpazVzv35NuZmn",..: 7684 17645 4144 4691 21907 8636 21592 17795 21050 13719 ...
##  $ track_album_name        : Factor w/ 19743 levels "_away","!","¡Hola!",..: 7760 10569 951 2836 15075 1853 11406 12985 17674 8047 ...
##  $ track_album_release_date: Factor w/ 4530 levels "1957-01-01","1957-03",..: 4316 4493 4336 4349 4221 4341 4356 4389 4316 4321 ...
##  $ playlist_name           : Factor w/ 449 levels "¡Viva Latino!",..: 309 309 309 309 309 309 309 309 309 309 ...
##  $ playlist_id             : Factor w/ 471 levels "0275i1VNfBnsNbPl0QIBpG",..: 237 237 237 237 237 237 237 237 237 237 ...
##  $ playlist_genre          : Factor w/ 6 levels "edm","latin",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ playlist_subgenre       : Factor w/ 24 levels "album rock","big room",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ danceability            : num  0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
##  $ energy                  : num  0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
##  $ key                     : int  6 11 1 7 1 8 5 4 8 2 ...
##  $ loudness                : num  -2.63 -4.97 -3.43 -3.78 -4.67 ...
##  $ mode                    : int  1 1 0 1 1 1 0 0 1 1 ...
##  $ speechiness             : num  0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
##  $ acousticness            : num  0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
##  $ instrumentalness        : num  0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
##  $ liveness                : num  0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
##  $ valence                 : num  0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
##  $ tempo                   : num  122 100 124 122 124 ...
##  $ duration_ms             : int  194754 162600 176616 169093 189052 163049 187675 207619 193187 253040 ...

summary(songs)

##                    track_id        track_name              track_artist  
##  7BKLCZ1jbUBVqRi2FVlTVw:   10   Poison  :   22   Martin Garrix   :  161  
##  14sOS5L36385FJ3OL8hew4:    9   Breathe :   21   Queen           :  136  
##  3eekarcy7kvN4yt5ZFzltW:    9   Alive   :   20   The Chainsmokers:  123  
##  0nbXyq5TXYPCO7pr3N8S4I:    8   Forever :   20   David Guetta    :  110  
##  0qaWEvPkts34WF68r8Dzx9:    8   Paradise:   19   Don Omar        :  102  
##  0rIAC4PXANcKmitJfoqmVm:    8   (Other) :32726   (Other)         :32196  
##  (Other)               :32781   NA's    :    5   NA's            :    5  
##  track_popularity                track_album_id 
##  Min.   :  0.00   5L1xcowSxwzFUSJzvyMp48:   42  
##  1st Qu.: 24.00   5fstCqs5NpIlF42VhPNv23:   29  
##  Median : 45.00   7CjJb2mikwAWA1V6kewFBF:   28  
##  Mean   : 42.48   4VFG1DOuTeDMBjBLZT7hCK:   26  
##  3rd Qu.: 62.00   2HTbQ0RHwukKVXAlTmCZP2:   21  
##  Max.   :100.00   4CzT5ueFBRpbILw34HQYxi:   21  
##                   (Other)               :32666  
##                     track_album_name track_album_release_date
##  Greatest Hits              :  139   2020-01-10:  270        
##  Ultimate Freestyle Mega Mix:   42   2019-11-22:  244        
##  Gold                       :   35   2019-12-06:  235        
##  Malibu                     :   30   2019-12-13:  220        
##  Rock & Rios (Remastered)   :   29   2013-01-01:  219        
##  (Other)                    :32553   2019-11-15:  215        
##  NA's                       :    5   (Other)   :31430        
##                                                                    playlist_name  
##  Indie Poptimism                                                          :  308  
##  2020 Hits & 2019  Hits – Top Global Tracks \U0001f525\U0001f525\U0001f525:  247  
##  Permanent Wave                                                           :  244  
##  Hard Rock Workout                                                        :  219  
##  Ultimate Indie Presents... Best Indie Tracks of the 2010s                :  198  
##  Fitness Workout Electro | House | Dance | Progressive House              :  195  
##  (Other)                                                                  :31422  
##                  playlist_id    playlist_genre
##  4JkkvMpVl4lSioqQjeAL0q:  247   edm  :6043    
##  37i9dQZF1DWTHM4kX49UKs:  198   latin:5155    
##  6KnQDwp0syvhfHOR4lWP7x:  195   pop  :5507    
##  3xMQTDLOIGvj3lWH5e5x6F:  189   r&b  :5431    
##  3Ho3iO0iJykgEQNbjB2sic:  182   rap  :5746    
##  25ButZrVb1Zj1MJioMs09D:  109   rock :4951    
##  (Other)               :31713                 
##                  playlist_subgenre  danceability        energy        
##  progressive electro house: 1809   Min.   :0.0000   Min.   :0.000175  
##  southern hip hop         : 1675   1st Qu.:0.5630   1st Qu.:0.581000  
##  indie poptimism          : 1672   Median :0.6720   Median :0.721000  
##  latin hip hop            : 1656   Mean   :0.6548   Mean   :0.698619  
##  neo soul                 : 1637   3rd Qu.:0.7610   3rd Qu.:0.840000  
##  pop edm                  : 1517   Max.   :0.9830   Max.   :1.000000  
##  (Other)                  :22867                                      
##       key            loudness            mode         speechiness    
##  Min.   : 0.000   Min.   :-46.448   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.: 2.000   1st Qu.: -8.171   1st Qu.:0.0000   1st Qu.:0.0410  
##  Median : 6.000   Median : -6.166   Median :1.0000   Median :0.0625  
##  Mean   : 5.374   Mean   : -6.720   Mean   :0.5657   Mean   :0.1071  
##  3rd Qu.: 9.000   3rd Qu.: -4.645   3rd Qu.:1.0000   3rd Qu.:0.1320  
##  Max.   :11.000   Max.   :  1.275   Max.   :1.0000   Max.   :0.9180  
##                                                                      
##   acousticness    instrumentalness       liveness         valence      
##  Min.   :0.0000   Min.   :0.0000000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0151   1st Qu.:0.0000000   1st Qu.:0.0927   1st Qu.:0.3310  
##  Median :0.0804   Median :0.0000161   Median :0.1270   Median :0.5120  
##  Mean   :0.1753   Mean   :0.0847472   Mean   :0.1902   Mean   :0.5106  
##  3rd Qu.:0.2550   3rd Qu.:0.0048300   3rd Qu.:0.2480   3rd Qu.:0.6930  
##  Max.   :0.9940   Max.   :0.9940000   Max.   :0.9960   Max.   :0.9910  
##                                                                        
##      tempo         duration_ms    
##  Min.   :  0.00   Min.   :  4000  
##  1st Qu.: 99.96   1st Qu.:187819  
##  Median :121.98   Median :216000  
##  Mean   :120.88   Mean   :225800  
##  3rd Qu.:133.92   3rd Qu.:253585  
##  Max.   :239.44   Max.   :517810  
##

Missing values treatment

We can observe that there are 5 missing values in the columns ‘track_name’, ‘track_artist’ and ‘track_album_name’ in our dataset. Since these 5 records correspond to just 0.00015% of the dataset so, we would remove these observations with missing values from our dataset from further analysis.

songs_clean <- songs %>% filter(!is.na(track_name) & !is.na(track_artist) & !is.na(track_album_name))

Checking duplicate records

songs_clean[duplicated(songs_clean$Names) | duplicated(songs_clean$Names, fromLast = TRUE), ]

Above results show that all the rows in our dataset are unique.

Variable datatypes cleaning

We observed that the Spotify dataset has valid datatypes being assigned to corresponding variables and need not require any change.

Creation of new variable

We would be generating year-wise trends in the later part of our project. Thus, we should extract year from ‘track_album_release_date’ variable and create a new column for it.

songs_clean$year <- as.numeric(substring(songs_clean$track_album_release_date,1,4))

Deletion of unneccesary columns

Few of the columns like ‘track_id’, ‘track_album_id’ and ‘playlist_id’ we won’t be needing for analysis beacause these contain only long alpha-numeric values. Let’s get rid of the these columns.

songs_clean <- songs_clean%>%dplyr::select(-track_id,-track_album_id,-playlist_id)

Checking final dimensions of cleaned dataset

After the data cleaning, we would check the final number of rows and columns, as shown in the code below. The results show 32,828 unique tracks in dataset.

dim(songs_clean)

## [1] 32828    21

Clean Data Preview

A preview of the clean dataset is given below:

head(songs_clean, 20) %>%
  datatable(options = list(scrollCollapse = TRUE,scrollX = TRUE,
  columnDefs = list(list(className = 'dt-center', targets = 1:4))
  ))

Exploratory Data Analysis

Data Overview

To begin our analysis, we simply wanted to plot the proportion of playlist genres accoss our dataset. The below plot depicts the required proportion in Spotify data.

Proportion of playlist genres

songs_clean_pie_data <- songs_clean %>% 
  group_by(playlist_genre) %>% 
  summarise(Total_number_of_tracks = length(playlist_genre))

ggplot(songs_clean_pie_data, aes(x="", y=Total_number_of_tracks, fill=playlist_genre)) + 
  geom_bar(width = 1, stat = "identity") + 
  coord_polar("y", start=0) + 
  geom_text(aes(label = paste(round(Total_number_of_tracks / sum(Total_number_of_tracks) * 100, 1), "%")),
            position = position_stack(vjust = 0.5))

It appears that our Spotify dataset is somewhat uniformly distributed across playlist genres with each genre having 15-18% records each.
EDM has the highest proportion of genres with 18.4%.

Correlation between variables

In order to understand the correlation among variables, we’ll use corrplot function in R which is one of the basic data visualization functions.

songs_correlation <- cor(songs_clean[,-c(1,2,4,5,6,7,8)])
corrplot(songs_correlation, type = "upper", tl.srt = 45)

It appears that energy and loudness are highly positively correlated while, energy and acousticness are highly negatively correlated with each other.
Also, loudenss & acousticness, valence & danceability, loudenss & year and duration_ms & year are moderatly correlated with each other.

Density Plots of Variables Let’s see energy, danceability, valence, acousticness, speechiness and liveness are distributed over all the observations of our dataset. We would be plotting density plots of all these 6 variables together as they all on same scale and range from 0 to 1.

correlated_density <- ggplot(songs_clean) +
    geom_density(aes(energy, fill ="energy", alpha = 0.1)) + 
    geom_density(aes(danceability, fill ="danceability", alpha = 0.1)) + 
    geom_density(aes(valence, fill ="valence", alpha = 0.1)) + 
    geom_density(aes(acousticness, fill ="acousticness", alpha = 0.1)) + 
    geom_density(aes(speechiness, fill ="speechiness", alpha = 0.1)) + 
    geom_density(aes(liveness, fill ="liveness", alpha = 0.1)) + 
    scale_x_continuous(name = "Energy, Danceability, Valence, Acousticness, Speechiness and Liveness") +
    scale_y_continuous(name = "Density") +
    ggtitle("Density plot of Energy, Danceability, Valence, Acousticness, Speechiness and Liveness") +
    theme_bw() +
    theme(plot.title = element_text(size = 10, face = "bold"),
          text = element_text(size = 10)) +
    theme(legend.title=element_blank()) +
    scale_fill_brewer(palette="Accent")

correlated_density

It can be seen on the graph the distributions of variables speechiness, acousticness and liveness are left-skewed with valued tending to be closer to 0.
The desnity plots of danceability and energy are right-skewed while, valence has somewhat normal distribution like curve.

Histograms of loudness, duration and track_popularity

loudness_density <- ggplot(songs_clean) +
    geom_density(aes(loudness, fill ="loudness")) + 
    scale_x_continuous(name = "Loudness") +
    scale_y_continuous(name = "Density") +
    ggtitle("Density plot of Loudness") +
    theme_bw() +
    theme(plot.title = element_text(size = 14, face = "bold"),
            text = element_text(size = 12)) +
    theme(legend.title=element_blank()) +
    scale_fill_brewer(palette="Paired")

duration_ms_density <- ggplot(songs_clean) +
    geom_density(aes(duration_ms, fill ="duration_ms")) + 
    scale_x_continuous(name = "duration_ms") +
    scale_y_continuous(name = "Density") +
    ggtitle("Density plot of duration_ms") +
    theme_bw() +
    theme(plot.title = element_text(size = 14, face = "bold"),
            text = element_text(size = 12)) +
    theme(legend.title=element_blank()) +
    scale_fill_brewer(palette="Dark2")

track_popularity_density <- ggplot(songs_clean) +
    geom_density(aes(track_popularity, fill ="track_popularity")) + 
    scale_x_continuous(name = "track_popularity") +
    scale_y_continuous(name = "Density") +
    ggtitle("Density plot of track_popularity") +
    theme_bw() +
    theme(plot.title = element_text(size = 14, face = "bold"),
            text = element_text(size = 12)) +
    theme(legend.title=element_blank()) +
    scale_fill_brewer(palette="RdBu")

grid.arrange(loudness_density, duration_ms_density,track_popularity_density, nrow = 3)

The distibution of loudness of tracks is right-skewed and closer to 0 decibels. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude).
The duration of tracks have a normal curve like distribution with mean being around 20,000 ms.
The song popularilty of is usually between 25 to 75 with a lot of songs having popularity equal to 0.

Visualizing top artists within each genre

top_genre <- songs_clean %>% select(playlist_genre, track_artist, track_popularity) %>% group_by(playlist_genre,track_artist) %>% summarise(n = n()) %>% top_n(15, n)

tm <- treemap(top_genre, index = c("playlist_genre", "track_artist"), vSize = "n", vColor = 'playlist_genre', palette =  viridis(6),title="Top 15 Track Artists within each Playlist Genre")

Above, treemap depicts top 15 track artists with in each of the 6 playlist genre. The size of the boxes in treemap corresponds to the count tracks for the artists.
For genre edm, rock, pop, rap, latin and r&b, the top track artist are Martin Garrix, Queen, The Chainsmoker, Logic, Don Omar and Bobby Brown respectively.

How track characteristics vary across genres?

Let’s take a look at how various characteristics of our tracks are different among the 6 genres using radar charts.

A radar chart is useful to compare the musical vibes of genres in a more visual way. In order to plot it, we normalized the danceability, energy, loudness, speechiness, valence, instrumentalness and acousticness values to be from 0 to 1. This helps to make the chart more clear and readable.

To generate these radar plots we built a user-defined function which takes track’s feature as an argument and return its corresponding radar chart.

Plots showing variability of characteristics across genres

radar_chart <- function(arg){
songs_clean_filtered <- songs_clean %>% filter(playlist_genre==arg)
radar_data_v1 <- songs_clean_filtered %>%
  select(danceability,energy,loudness,speechiness,valence,instrumentalness,acousticness)
radar_data_v2 <- apply(radar_data_v1,2,function(x){(x-min(x)) / diff(range(x))})
radar_data_v3 <- apply(radar_data_v2,2,mean)
radar_data_v4 <- rbind(rep(1,6) , rep(0,6) , radar_data_v3)
return(radarchart(as.data.frame(radar_data_v4),title=arg))
}

par(mfrow = c(2, 3))
Chart_pop<-radar_chart("pop")
Chart_rb<-radar_chart("r&b")
Chart_edm<-radar_chart("edm")
Chart_latin<-radar_chart("latin")
Chart_rap<-radar_chart("rap")
Chart_rock<-radar_chart("rock")

Intersteringly Latin music is most loudest among all genres.
Moreover, Latin music is also the most danceable. It seems people prefer to dance on the loud Latin music.
EDM music has the highest acousticness, which is as per our expectations.
R&b and rap tracks tend to have relatively less energy in comparison to the rest of the genres.

Did music taste change in last decade?

In this part of our project, we would try to find out how the features change across time. We can group the songs by its added year, get the average for each feature over time and visualize it. To generate these plots we built a user-defined function which takes track’s feature as an argument and return its trend chart.

Plots showing change in tracks’ feature values in last one decade

trend_chart <- function(arg){
trend_change <- songs_clean %>% filter(year>2010) %>% group_by(year) %>% summarize_at(vars(all_of(arg)), funs(Average = mean))
chart<- ggplot(data = trend_change, aes(x = year, y = Average)) + 
     geom_line(color = "#00AFBB", size = 1) +
    scale_x_continuous(breaks=seq(2011, 2020, 1)) + scale_y_continuous(name=paste("",arg,sep=""))
return(chart)
}

trend_chart_track_popularity<-trend_chart("track_popularity")
trend_chart_danceability<-trend_chart("danceability")
trend_chart_energy<-trend_chart("energy")
trend_chart_loudness<-trend_chart("loudness")
trend_chart_duration_ms<-trend_chart("duration_ms")
trend_chart_speechiness<-trend_chart("speechiness")

plot_grid(trend_chart_track_popularity, trend_chart_danceability, trend_chart_energy, trend_chart_loudness, trend_chart_duration_ms, trend_chart_speechiness,ncol = 2, label_size = 1)

What interests us the most is that the duration of tracks is showing continuous decreasing trend. Meaning that the songs are getting shorter and shorter with each year.
Furthermore, the danceability of tracks is on continuous rise, which is a good thing as people are enjoying danceable songs.
Energy and loudness have almost the same trend each year showing high positive correlation between them. Both the features have peaks and dips on trend in the same years.
The average popularity of the songs reached its minimum value in 2014 in last 1 decade and after that it’s has been continuously increasing, depicting that the songs are becoming popular with time on average among people.

K-Means Clustering

In this section, we will perform K-means clustering on the Spotify dataset and would try to analyze the change in output as the number of clusters increases. We would try to identify the optimal value of clusters K using the elbow method.

K-means clustering is a centroid-based algorithm, or a distance-based algorithm, where we calculate the distances to assign a point to a cluster. In K-Means, each cluster is associated with a centroid. The main objective of the K-Means algorithm is to minimize the sum of distances between the points and their respective cluster centroid.

Since K-Means is a distance-based algorithm, this difference of magnitude can create a problem. So let’s first bring all the variables to the same magnitude

# select required song features for clustering
cluster.input <- songs_clean[, c('energy', 'liveness','tempo', 'speechiness', 'acousticness',
                                    'instrumentalness', 'danceability', 'duration_ms' ,'loudness','valence')]

# scale features for input to clustering
cluster.input.scaled <- scale(cluster.input[, c('energy', 'liveness', 'tempo', 'speechiness'
                                                , 'acousticness', 'instrumentalness', 'danceability'
                                                , 'duration_ms' ,'loudness', 'valence')])

Visualizing clusters for different values of k

We will first fit multiple k-means models and in each successive model, we will increase the number of clusters. We will then plot the results for visualization as below.

# kmeans with different k values
k2 <- kmeans(cluster.input.scaled, centers = 2, nstart = 25)
k3 <- kmeans(cluster.input.scaled, centers = 3, nstart = 25)
k4 <- kmeans(cluster.input.scaled, centers = 4, nstart = 25)
k5 <- kmeans(cluster.input.scaled, centers = 5, nstart = 25)

# plots to compare
p1 <- fviz_cluster(k2, geom = "point",  data = cluster.input.scaled) + ggtitle("k = 2")
p2 <- fviz_cluster(k3, geom = "point",  data = cluster.input.scaled) + ggtitle("k = 3")
p3 <- fviz_cluster(k4, geom = "point",  data = cluster.input.scaled) + ggtitle("k = 4")
p4 <- fviz_cluster(k5, geom = "point",  data = cluster.input.scaled) + ggtitle("k = 5")

grid.arrange(p1, p2, p3, p4, nrow = 2)

Identifying optimal number of clusters in K-means clustering

Elbow method is widely used to determine optimal number of clusters in K-means clustering. This method takes into consideration the total within-cluster sum of square (WSS) as a function of the number of clusters K. Optimal value of K is such that adding another cluster doesn’t improve much better the total WSS.

set.seed(100)
fviz_nbclust(cluster.input[1:1000,], kmeans, method = "wss")

Looking at the above elbow curve, we can say that the optimal value of k is 3.

Let’s take a look at how our within sum of squared error changes with k in a table format.

n_clust<-fviz_nbclust(cluster.input[1:1000,], kmeans, method = "wss")
n_clust<-n_clust$data

n_clust %>% rename(Number_of_clusters=clusters,Within_sum_of_squared_error=y) %>% 
  mutate(Within_sum_of_squared_error = color_tile("white", "red")(Within_sum_of_squared_error)) %>% 
  kable("html", escape = F) %>% 
  kable_styling("hover", full_width = F) %>% 
  column_spec(2, width = "5cm") %>%
  row_spec(3:3, bold = T, color = "white", background = "grey")

Number_of_clusters	Within_sum_of_squared_error
1	1.191996e+12
2	5.466280e+11
3	3.070086e+11
4	1.973334e+11
5	1.370321e+11
6	1.091560e+11
7	7.225112e+10
8	6.101159e+10
9	5.167471e+10
10	4.697773e+10

We observe that the within sum of squared error of clusters stablized when k becomes equal to 4 which was also infered from the elbow curve above.

Summary

5.1 Problem Statement The analysis was intended to understand the evolution of music over time as well as understand the characterisitcs of various genres of music. In addition to that, we also identified the underlying patterns and relationsships of various features that describe music using spotify data.

5.2 Methodology

We started by looking at correlations amongst various features followed by the distributions for each of these features.
We then examined the top artists within each genre using tree map.
This was followed by looking at variability of characteristics amongst the 6 genres using radar charts.
We then analysed how various major features evolved in the past decade.
Lastly, we performed k-means clustering to understand commonalities between various genres and found the optimal number of clusters using elbow method.

5.3 Insights

Energy is highly positively correlated with loudness while it is highly negatively correlated with acousticness.
A lot of songs high a popularity score of 0, which means a lot of songs haven’t been explored yet.
For Genres EDM, rock, pop, rap, latin and r&b, the top artists are Martin Garrix, Queen, The Chainsmoker, Logic, Don Omar and Bobby Brown respectively.
Latin music is most loudest as well as most danceable among all genres. R&b and rap tracks tend to have relatively less energy as compared to other genres, while EDM has the highest acousticness.
Interestingly, songs are getting shorter by the year, while also becoming danceable over time.
Even though we have six genres in the data, as per clustering technique we got only 3 optimal clusters which means a lot of the genres are very similar.

5.4 Implications

This analysis was conducted to explore the evolution of music over time and also diving deeper to understand trends in what makes a song more danceable than others can help DJs, artists, and producers create music based on characteristics like tempo or level of speechiness.

Netflix has a commendable job by leveraging data to produce video content, and the next music revolution could be brought in by similar techniques.

5.5 Limitations

Even though there are millions of songs that exist, we only had about 32k records for our analysis, and hence we couldn’t obtain a full picture of the features of music.

Also the analysis could be strengthened by incorporating user related features like their demographical attributes, user history etc.