Data Wrangling - Spotify Dataset Analysis

Introduction

Background

Spotify is a Swedish-based audio streaming and media services provider, which launched in October 2008. It is now one of the biggest digital music, podcast, and video streaming service in the world that gives access to millions of songs from artists all over the world.

As a freemium service which means it has basic features that are free with advertisements and limited control, but you could also opt for additional features, such as offline listening and commercial-free listening, are offered via paid subscriptions. Users can search for music based on artist, album, or genre, and can create, edit, and share playlists. Not only does Spotify gives us access to good songs on multiple platforms, it has exposed everyone to trending and upcoming artists from various genres that we had never experienced. Spotify uses very advanced technology to track and identify each song uploaded to its platform.

The Spotify dataset provides insight into users data about which songs they listen to, and not just the popularity of tracks, but also features of the tracks they have in their library is recorded in their database.

In this project I will be analyzing a track’s popularity based on several audio features provided in the dataset and find whether I can predict a track’s popularity from key features about the song.

I plan on analysing user’s listening profile to enable Spotify to suggest and acquire similar tracks on their platform to improve user experience

Proposed Analytical Methodology

The plan is to analyze relationship between popularity and different features of the song, and maybe later perform cluster analysis using K-means method to provide song recommendation based on recent user listening on Spotify.

Usefulness of Analysis

This is mainly useful to market songs to the spotify users and improve their experience while using it. This analysis will help better understand the different clusters and enable Spotify to make a better targeted content distribution that would be helpful for the developers and the marketing team to analyze trends and help them to segments users better and try to increase profits and provide a better user experience.

Packages

Following packages were used:

tidyverse - which will provide us functionality to model, transform, and visualize data.
dplyr - used for data manipulation in R
ggplot2 - used for plotting charts
plotly - for web-based graphs via the open source JavaScript graphing library plotly.js for interactive charts
corrplot - for displaying correlation matrices and confidence intervals
factoextra - to visualize the output of multivariate data analysis
funModeling - Exploratory Data Analysis and Data Preparation Tool-Box
plyr - break a big problem down into manageable pieces, operate on each piece and then put all the pieces back together
RColorBrewer - to help you choose sensible colour schemes for figures in R

library(tidyverse)
library(dplyr)
library(ggplot2)
library(plotly)
library(corrplot)
library(factoextra)
library(plyr)
library(knitr)
library(RColorBrewer)
library(funModeling)
library(kableExtra) 
library(DT) 
library(gridExtra)
library(viridisLite) 
library(fmsb)

Data Preparation

This sections contains all the procedures I’ve followed in preparing the data for analysis. Each step has been explained with code for those steps.

Data Source for the Spotify Data

The dataset used for this project is the Spotify Genre dataset was provided in the course curriculum

, more details about the dataset is provided below.

Information on the Data and its actual source

The data comes from Spotify via the spotifyr package. Charlie Thompson, Josiah Parry, Donal Phipps, and Tom Wolff authored this package to make it easier to get either your own data or general metadata arounds songs from Spotify’s API.

It’s likely that Spotify uses these features to power products like Spotify Radio and custom playlists like Discover Weekly and Daily Mixes.

After having an intial look at the data, the is not much peculiarity in the data. It is moderately clean with only 15 missing values. As every row is unique is some sense, I will not perform any imputation for missing values and just remove them instead.

Importing the Data

Firstly, the Spotify dataset is loaded into R to begin the analysis.The dataset has been imported using the read.csv function and saved as “spotify_data”.

spotify_data <- readr::read_csv('https://raw.githubusercontent.com/nairrj/DataWrangling/main/spotify_songs.csv')

Now, we’ll have take brief look at the dataset using the head and the glimpse function

head(spotify_data)

## # A tibble: 6 x 23
##   track_id track_name track_artist track_popularity track_album_id
##   <chr>    <chr>      <chr>                   <dbl> <chr>         
## 1 6f807x0~ I Don't C~ Ed Sheeran                 66 2oCs0DGTsRO98~
## 2 0r7CVbZ~ Memories ~ Maroon 5                   67 63rPSO264uRjW~
## 3 1z1Hg7V~ All the T~ Zara Larsson               70 1HoSmj2eLcsrR~
## 4 75Fpbth~ Call You ~ The Chainsm~               60 1nqYsOef1yKKu~
## 5 1e8PAfc~ Someone Y~ Lewis Capal~               69 7m7vv9wlQ4i0L~
## 6 7fvUMiy~ Beautiful~ Ed Sheeran                 67 2yiy9cd2QktrN~
## # ... with 18 more variables: track_album_name <chr>,
## #   track_album_release_date <chr>, playlist_name <chr>, playlist_id <chr>,
## #   playlist_genre <chr>, playlist_subgenre <chr>, danceability <dbl>,
## #   energy <dbl>, key <dbl>, loudness <dbl>, mode <dbl>, speechiness <dbl>,
## #   acousticness <dbl>, instrumentalness <dbl>, liveness <dbl>, valence <dbl>,
## #   tempo <dbl>, duration_ms <dbl>

glimpse(spotify_data)

## Observations: 32,833
## Variables: 23
## $ track_id                 <chr> "6f807x0ima9a1j3VPbc7VN", "0r7CVbZTWZgbTCY...
## $ track_name               <chr> "I Don't Care (with Justin Bieber) - Loud ...
## $ track_artist             <chr> "Ed Sheeran", "Maroon 5", "Zara Larsson", ...
## $ track_popularity         <dbl> 66, 67, 70, 60, 69, 67, 62, 69, 68, 67, 58...
## $ track_album_id           <chr> "2oCs0DGTsRO98Gh5ZSl2Cx", "63rPSO264uRjW1X...
## $ track_album_name         <chr> "I Don't Care (with Justin Bieber) [Loud L...
## $ track_album_release_date <chr> "2019-06-14", "2019-12-13", "2019-07-05", ...
## $ playlist_name            <chr> "Pop Remix", "Pop Remix", "Pop Remix", "Po...
## $ playlist_id              <chr> "37i9dQZF1DXcZDD7cfEKhW", "37i9dQZF1DXcZDD...
## $ playlist_genre           <chr> "pop", "pop", "pop", "pop", "pop", "pop", ...
## $ playlist_subgenre        <chr> "dance pop", "dance pop", "dance pop", "da...
## $ danceability             <dbl> 0.748, 0.726, 0.675, 0.718, 0.650, 0.675, ...
## $ energy                   <dbl> 0.916, 0.815, 0.931, 0.930, 0.833, 0.919, ...
## $ key                      <dbl> 6, 11, 1, 7, 1, 8, 5, 4, 8, 2, 6, 8, 1, 5,...
## $ loudness                 <dbl> -2.634, -4.969, -3.432, -3.778, -4.672, -5...
## $ mode                     <dbl> 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, ...
## $ speechiness              <dbl> 0.0583, 0.0373, 0.0742, 0.1020, 0.0359, 0....
## $ acousticness             <dbl> 0.10200, 0.07240, 0.07940, 0.02870, 0.0803...
## $ instrumentalness         <dbl> 0.00e+00, 4.21e-03, 2.33e-05, 9.43e-06, 0....
## $ liveness                 <dbl> 0.0653, 0.3570, 0.1100, 0.2040, 0.0833, 0....
## $ valence                  <dbl> 0.518, 0.693, 0.613, 0.277, 0.725, 0.585, ...
## $ tempo                    <dbl> 122.036, 99.972, 124.008, 121.956, 123.976...
## $ duration_ms              <dbl> 194754, 162600, 176616, 169093, 189052, 16...

colnames(spotify_data)

##  [1] "track_id"                 "track_name"              
##  [3] "track_artist"             "track_popularity"        
##  [5] "track_album_id"           "track_album_name"        
##  [7] "track_album_release_date" "playlist_name"           
##  [9] "playlist_id"              "playlist_genre"          
## [11] "playlist_subgenre"        "danceability"            
## [13] "energy"                   "key"                     
## [15] "loudness"                 "mode"                    
## [17] "speechiness"              "acousticness"            
## [19] "instrumentalness"         "liveness"                
## [21] "valence"                  "tempo"                   
## [23] "duration_ms"

dim(spotify_data)

## [1] 32833    23

Our dataset has 32,833 observations and 23 variables.

Cleaning the Data

Remove Null values

We observe that there are 3 columns which have 5 NA’s each and those columns are track_name, track_artist and track_album_name and this information was retrieved using the colsums function.

I have then removed those respective observations using the na.omit function.

colSums(is.na(spotify_data))

##                 track_id               track_name             track_artist 
##                        0                        5                        5 
##         track_popularity           track_album_id         track_album_name 
##                        0                        0                        5 
## track_album_release_date            playlist_name              playlist_id 
##                        0                        0                        0 
##           playlist_genre        playlist_subgenre             danceability 
##                        0                        0                        0 
##                   energy                      key                 loudness 
##                        0                        0                        0 
##                     mode              speechiness             acousticness 
##                        0                        0                        0 
##         instrumentalness                 liveness                  valence 
##                        0                        0                        0 
##                    tempo              duration_ms 
##                        0                        0

spotify_data <- na.omit(spotify_data)

Remove Duplicates

I will now filter for unique tracks, by removing all the duplicate tracks using the duplicated function

spotify_data <- spotify_data[!duplicated(spotify_data$track_id),]

Transform the Variables

I have converted genre, sub genre, mode and key to factors to facilitate our data analysis, I based this off the values those fields contained

spotify_data <- spotify_data %>%
  mutate(playlist_genre = as.factor(spotify_data$playlist_genre),
         playlist_subgenre = as.factor(spotify_data$playlist_subgenre),
         mode = as.factor(mode),
         key = as.factor(key))

Converting duration_ms to duration in minutes (duration_min) since it is more sensible for the analysis

spotify_data <- spotify_data %>% mutate(duration_min = duration_ms/60000)

Creating New Variables

For exploring the distribution on popularity, we have made new variables that divide popularity into 4 groups for effective cluster analysis

spotify_data <- spotify_data %>% 
  mutate(popularity_group = as.numeric(case_when(
    ((track_popularity > 0) & (track_popularity < 20)) ~ "1",
    ((track_popularity >= 20) & (track_popularity < 40))~ "2",
    ((track_popularity >= 40) & (track_popularity < 60)) ~ "3",
    TRUE ~ "4"))
    )
table(spotify_data$popularity_group)

## 
##    1    2    3    4 
## 4182 6162 8975 9033

Removing Variables

We have removed track_id, track_album_id and playlist_id from the dataset since it is not useful for our analysis. These id’s are in the Spotify dataset only to uniquely identify a tracks in the database.

spotify_data <- spotify_data %>% select(-c(track_id, track_album_id, playlist_id))

summary(spotify_data)

##   track_name        track_artist       track_popularity track_album_name  
##  Length:28352       Length:28352       Min.   :  0.00   Length:28352      
##  Class :character   Class :character   1st Qu.: 21.00   Class :character  
##  Mode  :character   Mode  :character   Median : 42.00   Mode  :character  
##                                        Mean   : 39.34                     
##                                        3rd Qu.: 58.00                     
##                                        Max.   :100.00                     
##                                                                           
##  track_album_release_date playlist_name      playlist_genre
##  Length:28352             Length:28352       edm  :4877    
##  Class :character         Class :character   latin:4136    
##  Mode  :character         Mode  :character   pop  :5132    
##                                              r&b  :4504    
##                                              rap  :5398    
##                                              rock :4305    
##                                                            
##                  playlist_subgenre  danceability        energy        
##  southern hip hop         : 1582   Min.   :0.0000   Min.   :0.000175  
##  indie poptimism          : 1547   1st Qu.:0.5610   1st Qu.:0.579000  
##  neo soul                 : 1478   Median :0.6700   Median :0.722000  
##  progressive electro house: 1460   Mean   :0.6534   Mean   :0.698372  
##  electro house            : 1416   3rd Qu.:0.7600   3rd Qu.:0.843000  
##  gangster rap             : 1314   Max.   :0.9830   Max.   :1.000000  
##  (Other)                  :19555                                      
##       key           loudness       mode       speechiness      acousticness   
##  1      : 3436   Min.   :-46.448   0:12318   Min.   :0.0000   Min.   :0.0000  
##  0      : 3001   1st Qu.: -8.310   1:16034   1st Qu.:0.0410   1st Qu.:0.0143  
##  7      : 2907   Median : -6.261             Median :0.0626   Median :0.0797  
##  9      : 2631   Mean   : -6.818             Mean   :0.1079   Mean   :0.1772  
##  11     : 2577   3rd Qu.: -4.709             3rd Qu.:0.1330   3rd Qu.:0.2600  
##  2      : 2478   Max.   :  1.275             Max.   :0.9180   Max.   :0.9940  
##  (Other):11322                                                                
##  instrumentalness       liveness         valence           tempo       
##  Min.   :0.0000000   Min.   :0.0000   Min.   :0.0000   Min.   :  0.00  
##  1st Qu.:0.0000000   1st Qu.:0.0926   1st Qu.:0.3290   1st Qu.: 99.97  
##  Median :0.0000206   Median :0.1270   Median :0.5120   Median :121.99  
##  Mean   :0.0911294   Mean   :0.1910   Mean   :0.5104   Mean   :120.96  
##  3rd Qu.:0.0065725   3rd Qu.:0.2490   3rd Qu.:0.6950   3rd Qu.:134.00  
##  Max.   :0.9940000   Max.   :0.9960   Max.   :0.9910   Max.   :239.44  
##                                                                        
##   duration_ms      duration_min     popularity_group
##  Min.   :  4000   Min.   :0.06667   Min.   :1.000   
##  1st Qu.:187741   1st Qu.:3.12902   1st Qu.:2.000   
##  Median :216933   Median :3.61555   Median :3.000   
##  Mean   :226575   Mean   :3.77624   Mean   :2.806   
##  3rd Qu.:254975   3rd Qu.:4.24959   3rd Qu.:4.000   
##  Max.   :517810   Max.   :8.63017   Max.   :4.000   
##

head(spotify_data)

## # A tibble: 6 x 22
##   track_name track_artist track_popularity track_album_name track_album_rel~
##   <chr>      <chr>                   <dbl> <chr>            <chr>           
## 1 I Don't C~ Ed Sheeran                 66 I Don't Care (w~ 2019-06-14      
## 2 Memories ~ Maroon 5                   67 Memories (Dillo~ 2019-12-13      
## 3 All the T~ Zara Larsson               70 All the Time (D~ 2019-07-05      
## 4 Call You ~ The Chainsm~               60 Call You Mine -~ 2019-07-19      
## 5 Someone Y~ Lewis Capal~               69 Someone You Lov~ 2019-03-05      
## 6 Beautiful~ Ed Sheeran                 67 Beautiful Peopl~ 2019-07-11      
## # ... with 17 more variables: playlist_name <chr>, playlist_genre <fct>,
## #   playlist_subgenre <fct>, danceability <dbl>, energy <dbl>, key <fct>,
## #   loudness <dbl>, mode <fct>, speechiness <dbl>, acousticness <dbl>,
## #   instrumentalness <dbl>, liveness <dbl>, valence <dbl>, tempo <dbl>,
## #   duration_ms <dbl>, duration_min <dbl>, popularity_group <dbl>

Description of Attributes

Each row indicates 1 song and column contain attributes for each song.The attributes are as follows:

track_id : Track ID on song
track_name : Title / Name of the song
track_artist : Name of the artist
track_popularity : Measure the popularity from 0 to 100 based on play number of the track
track_album_release_date : Information about the release date of the song
track_album_name : Provides us with the name of the album from which the song is in.
playlist_name : Name of the playlist which the song is in.
playlist_genre : Name of the genre related to the playlist which the song is in.
acousticness : Measure of how acoustic the track is and ranges from 0.0 to 1.0
danceability : Describes how suitable a track is for dancing. Values range from 0.0 being least danceable and 1.0 being most danceable.
duration_ms : The duration of the track in milliseconds(ms) which has been converted to minutes using transformation
energy : Measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity i.e. the enery of the song.
instrumentalness : Measure whether a track contains vocals. Sounds are treated as instrumental in this context. Values ranges from 0.0 to 1.0
speechiness - Detects the presence of spoken words in a track.Values > 0.6 might be a podcast or talk show, where 0.3 to 0.6 is the normal range for songs and if its less than 0.3 its mostly music
valence - Measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive , while tracks with low valence sound more negative.
key : Estimated overall key of the track. If key is not detected, the value is -1.
liveness - Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
loudness - overall loudness of a track in decibels (dB).Values typical range between -60 and 0 dB.
mode - Mode indicates the modality (major or minor) of a track. Major is represented by 1 and minor is represented by 0.
tempo Overall estimated tempo of a track in beats per minute (BPM).

Exploratory Data Analysis

Exploratory Data analysis (EDA) helps us uncover useful information from data that is not self-evident, only if EDA is done correctly.

EDA is essentail before we start to build a model on the data.

With EDA we can understand the patterns within the data, detect outliers or anomalous events and find interesting relations among the variables.

I have used correlation plot, histograms and boxplots in my EDA.

Correlation Plot

corr_spotify <- select(spotify_data, track_popularity, danceability, energy, loudness, speechiness, acousticness, instrumentalness, liveness, valence, tempo)
corrplot(cor(corr_spotify), type = "lower")

Based on the plot, we can state that popularity does not have strong correlation with other track features. But quite a few variables have strong correlation with each other, indicating multicollinearity and might not be suitable for classification algorithms.

Histogram

Analyzing data distribution of the audio features, using the plot_num function (plots only numeric variables)

spotify_histograms <- spotify_data[,-c(1,2,3,4,5,6,7,8,11,13,20,22)]
plot_num(spotify_histograms)

From the histograms, we can observe that:

Songs with duration of 2.5 to 4 minutes have majority listeners
A lot of observations have a value no larger than 0.1 in instrumentalness which is ~80% of the dataset
Energy and Danceability are pretty normally distribuited, but Valence is normally distributed
Most of the songs have a loudness level between -5dB and -10db
Majority tracks have speechiness less than 0.25 indicating that more speechy songs aren’t favoured.

Boxplot

Genre by Energy

boxplot(energy~playlist_genre, data = spotify_data,
        main = "Variation:- Energy and  Genre",
        xlab = "Energy",
        ylab = "Genre",
        col = "green",
        border = "blue",
        horizontal = TRUE,
        notch = TRUE
)

The plot shows that EDM genre has songs with highest energy.

Genre by Danceability

boxplot(danceability~playlist_genre, data = spotify_data,
        main = "Variation:- Danceability and Genre",
        xlab = "Danceability",
        ylab = "Genre",
        col = "green",
        border = "blue",
        horizontal = TRUE,
        notch = TRUE
)

As seen in the graph, Rap genre has the highest danceability factor.

Genre by Liveliness

boxplot(danceability~playlist_genre, data = spotify_data,
        main = "Variation:- Liveness and Genres",
        xlab = "Liveness",
        ylab = "Genre",
        col = "green",
        border = "blue",
        horizontal = TRUE,
        notch = TRUE
)

Looks like Rap songs are most lively, followed closely by latin genre.

Genre by Valence

boxplot(valence~playlist_genre, data = spotify_data,
        main = "Variation:- Valence and Genre",
        xlab = "Valence",
        ylab = "Genre",
        col = "green",
        border = "blue",
        horizontal = TRUE,
        notch = TRUE
)

As seen above, Latin genre has a higher valence than others.

Genre by Loudness

boxplot(loudness~playlist_genre, data = spotify_data,
        main = "Variation:- Loudness and Genre",
        xlab = "Loudness",
        ylab = "Genre",
        col = "green",
        border = "blue",
        horizontal = TRUE,
        notch = TRUE
)

The loudness is pretty similar, only songs in EDM genre are a bit louder than the other genres.

Tempo and Liveness Distribution across Genre

spotify_data$liveness.scale <- scale(spotify_data$liveness)
spotify_data$tempo.scale <- scale(spotify_data$tempo)
spotify_data %>%
  select(tempo.scale, liveness.scale, playlist_genre) %>%
  group_by(playlist_genre) %>%
  filter(!is.na(tempo.scale)) %>%
  filter(!is.na(liveness.scale)) %>%
  ggplot(mapping = aes(x = tempo.scale, y = liveness.scale, color = playlist_genre, fill = playlist_genre)) +
  geom_bar(stat = 'identity') +
  coord_polar() +
  theme_dark() +
  theme(legend.position = "top")

As visible in the plot, the Tempo is way higher for EDM genre compared to the others while Liveness is almost uniformly distributed across all genres.

Energy Distribution of the songs

spotify_data$energy_only <- cut(spotify_data$energy, breaks = 10)
spotify_data %>%
  ggplot( aes(x = energy_only )) +
  geom_bar(width = 0.2, fill = "#FF9999", colour = "black") +
  scale_x_discrete(name = "Energy")

This plot shows that higher energy songs are popular among Spotify listeners.

Speechiness Distribution of the songs

spotify_data$speech_only <- cut(spotify_data$speechiness, breaks = 10)
spotify_data %>%
  ggplot( aes(x = speech_only )) +  
  geom_bar(width = 0.2,  fill = "#FF9999", colour = "black") +
  scale_x_discrete(name = "Speechiness") +
  coord_flip()

This plot shows that less speechy songs are more favoured by maximum Spotify listeners.

Proportion of playlist genres by count

We can see the distribution of playlist genres accoss our dataset. The below statistics depicts the required proportion in Spotify data.

spotify_data$playlist_genre <- as.factor(spotify_data$playlist_genre)

summary(spotify_data$playlist_genre)

##   edm latin   pop   r&b   rap  rock 
##  4877  4136  5132  4504  5398  4305

It appears that the genres in our dataset is nearly uniformly distributed with each genre attributing to 15-18% of the data, and EDM has the highest proportion of data in playlist genres.

Plots showing variability of characteristics across genres

I’ve Plotted the characteristics of our tracks are different among the 6 genres using radar charts.

A radar chart is useful to compare the musical vibes of genres in a more visual way. I’ve normalized the feature values to be from 0 to 1.

radar_chart <- function(arg){
spotify_data_filtered <- spotify_data %>% filter(playlist_genre == arg)
radar_data_v1 <- spotify_data_filtered %>%
  select(danceability,energy,loudness,speechiness,valence,instrumentalness,acousticness)
radar_data_v2 <- apply(radar_data_v1,2,function(x){(x - min(x)) / diff(range(x))})
radar_data_v3 <- apply(radar_data_v2,2,mean)
radar_data_v4 <- rbind(rep(1,6) , rep(0,6) , radar_data_v3)
return(radarchart(as.data.frame(radar_data_v4),title = arg))
}

par(mfrow = c(2, 3))
Chart_pop <- radar_chart("pop")
Chart_rb <- radar_chart("r&b")
Chart_edm <- radar_chart("edm")
Chart_latin <- radar_chart("latin")
Chart_rap <- radar_chart("rap")
Chart_rock <- radar_chart("rock")

Looks like Latin genre has the loudest songs and it is also the most danceable.

Clustering

K-means clustering

I will perform K-means clustering on this dataset and try to analyze the change in output as the number of clusters increases.

K-means clustering is a centroid-based algorithm. The main objective of the algorithm is to minimize the sum of distances between the points and their respective cluster centroid.

Firstly I’d bring all the variables to the same magnitude using scale function

spotify.input <- spotify_data[, c('energy', 'liveness','tempo', 'speechiness', 'acousticness','instrumentalness', 'danceability', 'duration_ms' ,'loudness','valence')]

cluster.spotify.scaled <- scale(spotify.input[, c('energy', 'liveness', 'tempo', 'speechiness' , 'acousticness', 'instrumentalness', 'danceability' , 'duration_ms' ,'loudness', 'valence')])

kis2 <- kmeans(cluster.spotify.scaled, centers = 2, nstart = 25)
kis3 <- kmeans(cluster.spotify.scaled, centers = 3, nstart = 25)
kis4 <- kmeans(cluster.spotify.scaled, centers = 4, nstart = 25)
kis5 <- kmeans(cluster.spotify.scaled, centers = 5, nstart = 25)

plot1 <- fviz_cluster(kis2, geom = "point",  data = cluster.spotify.scaled) + ggtitle("k = 2")
plot2 <- fviz_cluster(kis3, geom = "point",  data = cluster.spotify.scaled) + ggtitle("k = 3")
plot3 <- fviz_cluster(kis4, geom = "point",  data = cluster.spotify.scaled) + ggtitle("k = 4")
plot4 <- fviz_cluster(kis5, geom = "point",  data = cluster.spotify.scaled) + ggtitle("k = 5")

grid.arrange(plot1, plot2, plot3, plot4, nrow = 3)

Based on the plot we get the most ideal set of distinct clusters when k is 3

I will also identify / verify the optimal value of clusters K using the elbow method using the wss method which takes into consideration the total within-cluster sum of squares as a function of the number of clusters.

set.seed(100)
fviz_nbclust(spotify.input[1:1000,], kmeans, method = "wss")

Based on the elbow curve, we can say that the optimal value of k is 3.

Summary

Final Project Conclusion

Summary of the Problem statement

The analysis is intended to understand the characterisitcs of various genres of music and to identify how popular a songs would be based on its features. Along with that, I have also identified the patterns and relationship between the features that describe the songs in our dataset.

Methodology

We have performed data wrangling on our spotify dataset by removing null values, removing duplicates and transforming variables, before starting our exploratory data analysis. From our correlation plot, we observed that variables have strong correlation with each other, indicating that this dataset has multicollinearity.I have also plotted histograms and boxplot to show the relation between the variables. I have then utilized this dataset for clustering analysis to analyze a track’s popularity based on several audio features provided in the dataset and find whether we could predict a track’s popularity from key features about the song.

Insights

We have gained a lot of insights during this journey of exploring the data and the interesting ones relevant in our analysis are: A lot of songs in our data have a popularity score of 0, which means a lot those haven’t been explored yet. Energy has high positive correlation with loudness and high negative correlation with acousticness. EDM genre has songs with highest energy, Rap genre has the highest danceability factor.Rap and Latin are most lively and loudness is pretty similar for all genres. Furthermore, Even though we have six genres in the dataset, with the clustering technique I got optimal clusters count as three, which means few genres might be very similar. Lastly, less speechy and high energy songs are likely to be popular amoung users.

Implications

With our clustering techniques we could provide more suggestions based on three clusters that we got, which would improve the user experience and finally we have an idea of which charecteristics have an impact on the popularity of the songs and we could probably build more customized playlists.

Limitations of my analysis

Our dataset had limited entries, so with a much larger dataset we could probably improve our analysis even more. Also additional data points in the dataset like user demographics and historical data would make our analysis even more reliable.