Data Wrangling

Introduction

Background

Spotify is a Swedish-based audio streaming and media services provider, which launched in October 2008. It is now one of the biggest digital music, podcast, and video streaming service in the world that gives access to millions of songs from artists all over the world.

As a freemium service which means it has basic features that are free with advertisements and limited control, but you could also opt for additional features, such as offline listening and commercial-free listening, are offered via paid subscriptions. Users can search for music based on artist, album, or genre, and can create, edit, and share playlists. Not only does Spotify gives us access to good songs on multiple platforms, it has exposed everyone to trending and upcoming artists from various genres that we had never experienced. Spotify uses very advanced technology to track and identify each song uploaded to its platform.

The Spotify dataset provides insight into users data about which songs they listen to, and not just the genre of tracks, but also features of the tracks they have in their library is recorded in their database.

In this project, we will be analyzing a playlist’s genre based on several audio features provided in the dataset and find whether we can predict a playlist’s genre from key features about the song.

We plan on analyzing user’s listening profile to enable Spotify to suggest and acquire similar songs on their platform to improve user experience

Proposed Analytical Methodology

The plan is to analyze relationship between playlist genre and different features of the song, and maybe later use a classification algorithm that will predict the song genre to provide recommendation based on recent user listening on Spotify.

Usefulness of Analysis

This is mainly useful to market to the Spotify users and improve their experience while using it. This analysis will help better understand the genre of different songs and enable Spotify to make a better targeted content distribution that would be helpful for the developers and the marketing team to analyze trends and help them to segments users better and try to increase profits and provide a better user experience.

Packages required

Following packages were used:

tidyverse - which will provide us functionality to model, transform, and visualize data.

dplyr - used for data manipulation in R

ggplot2 - used for plotting charts

plotly - for web-based graphs via the open source JavaScript graphing library plotly.js for interactive charts

corrplot - for displaying correlation matrices and confidence intervals

factoextra - to visualize the output of multivariate data analysis

funModeling - Exploratory Data Analysis and Data Preparation Tool-Box

plyr - break a big problem down into manageable pieces, operate on each piece and then put all the pieces back together

RColorBrewer - to help you choose sensible colour schemes for figures in R

library(tidyverse)
library(dplyr)
library(ggplot2)
library(plotly)
library(corrplot)
library(factoextra)
library(plyr)
library(knitr)
library(RColorBrewer)
library(funModeling)
library(randomForest)
library(e1071)

Data Preparation

This sections contains all the procedures we’ve followed in preparing the data for analysis. Each step has been explained with code for those steps.

Data Source for the Spotify data

The dataset used for this project is the Spotify Genre dataset was provided in the course curriculum, more details about the dataset is provided below.

Information on the Data and its actual source

The data comes from Spotify via the spotifyr package. Charlie Thompson, Josiah Parry, Donal Phipps, and Tom Wolff authored this package to make it easier to get either your own data or general metadata around songs from Spotify’s API.

It’s likely that Spotify uses these features to power products like Spotify Radio and custom playlists like Discover Weekly and Daily Mixes.

After having an intial look at the data, the is not much peculiarity in the data. It is almost clean with only 15 missing values. As every row is unique is some sense, we will not perform any imputation for missing values and just remove them instead.

Importing the Data

Firstly, the Spotify dataset is loaded into R to begin the analysis.The dataset has been imported using the read.csv function and saved as “spotify_data”.

spotify_data<-readr::read_csv('https://raw.githubusercontent.com/nairrj/DataWrangling/main/spotify_songs.csv')

Now, we’ll have take brief look at the dataset using the head and the glimpse function

head(spotify_data)
## # A tibble: 6 x 23
##   track_id               track_name track_artist track_popularity track_album_id
##   <chr>                  <chr>      <chr>                   <dbl> <chr>         
## 1 6f807x0ima9a1j3VPbc7VN I Don't C~ Ed Sheeran                 66 2oCs0DGTsRO98~
## 2 0r7CVbZTWZgbTCYdfa2P31 Memories ~ Maroon 5                   67 63rPSO264uRjW~
## 3 1z1Hg7Vb0AhHDiEmnDE79l All the T~ Zara Larsson               70 1HoSmj2eLcsrR~
## 4 75FpbthrwQmzHlBJLuGdC7 Call You ~ The Chainsm~               60 1nqYsOef1yKKu~
## 5 1e8PAfcKUYoKkxPhrHqw4x Someone Y~ Lewis Capal~               69 7m7vv9wlQ4i0L~
## 6 7fvUMiyapMsRRxr07cU8Ef Beautiful~ Ed Sheeran                 67 2yiy9cd2QktrN~
## # ... with 18 more variables: track_album_name <chr>,
## #   track_album_release_date <chr>, playlist_name <chr>, playlist_id <chr>,
## #   playlist_genre <chr>, playlist_subgenre <chr>, danceability <dbl>,
## #   energy <dbl>, key <dbl>, loudness <dbl>, mode <dbl>, speechiness <dbl>,
## #   acousticness <dbl>, instrumentalness <dbl>, liveness <dbl>, valence <dbl>,
## #   tempo <dbl>, duration_ms <dbl>
glimpse(spotify_data)
## Rows: 32,833
## Columns: 23
## $ track_id                 <chr> "6f807x0ima9a1j3VPbc7VN", "0r7CVbZTWZgbTCYdfa~
## $ track_name               <chr> "I Don't Care (with Justin Bieber) - Loud Lux~
## $ track_artist             <chr> "Ed Sheeran", "Maroon 5", "Zara Larsson", "Th~
## $ track_popularity         <dbl> 66, 67, 70, 60, 69, 67, 62, 69, 68, 67, 58, 6~
## $ track_album_id           <chr> "2oCs0DGTsRO98Gh5ZSl2Cx", "63rPSO264uRjW1X5E6~
## $ track_album_name         <chr> "I Don't Care (with Justin Bieber) [Loud Luxu~
## $ track_album_release_date <chr> "2019-06-14", "2019-12-13", "2019-07-05", "20~
## $ playlist_name            <chr> "Pop Remix", "Pop Remix", "Pop Remix", "Pop R~
## $ playlist_id              <chr> "37i9dQZF1DXcZDD7cfEKhW", "37i9dQZF1DXcZDD7cf~
## $ playlist_genre           <chr> "pop", "pop", "pop", "pop", "pop", "pop", "po~
## $ playlist_subgenre        <chr> "dance pop", "dance pop", "dance pop", "dance~
## $ danceability             <dbl> 0.748, 0.726, 0.675, 0.718, 0.650, 0.675, 0.4~
## $ energy                   <dbl> 0.916, 0.815, 0.931, 0.930, 0.833, 0.919, 0.8~
## $ key                      <dbl> 6, 11, 1, 7, 1, 8, 5, 4, 8, 2, 6, 8, 1, 5, 5,~
## $ loudness                 <dbl> -2.634, -4.969, -3.432, -3.778, -4.672, -5.38~
## $ mode                     <dbl> 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, ~
## $ speechiness              <dbl> 0.0583, 0.0373, 0.0742, 0.1020, 0.0359, 0.127~
## $ acousticness             <dbl> 0.10200, 0.07240, 0.07940, 0.02870, 0.08030, ~
## $ instrumentalness         <dbl> 0.00e+00, 4.21e-03, 2.33e-05, 9.43e-06, 0.00e~
## $ liveness                 <dbl> 0.0653, 0.3570, 0.1100, 0.2040, 0.0833, 0.143~
## $ valence                  <dbl> 0.518, 0.693, 0.613, 0.277, 0.725, 0.585, 0.1~
## $ tempo                    <dbl> 122.036, 99.972, 124.008, 121.956, 123.976, 1~
## $ duration_ms              <dbl> 194754, 162600, 176616, 169093, 189052, 16304~
colnames(spotify_data)
##  [1] "track_id"                 "track_name"              
##  [3] "track_artist"             "track_popularity"        
##  [5] "track_album_id"           "track_album_name"        
##  [7] "track_album_release_date" "playlist_name"           
##  [9] "playlist_id"              "playlist_genre"          
## [11] "playlist_subgenre"        "danceability"            
## [13] "energy"                   "key"                     
## [15] "loudness"                 "mode"                    
## [17] "speechiness"              "acousticness"            
## [19] "instrumentalness"         "liveness"                
## [21] "valence"                  "tempo"                   
## [23] "duration_ms"
dim(spotify_data)
## [1] 32833    23

Our dataset has 32,833 observations and 23 variables.

Cleaning the Data

Remove null values

We observe that there are 3 columns which have 5 NA’s each and those columns are track_name, track_artist and track_album_name and this information was retrieved using the colsums function.

I have then removed those respective observations using the na.omit function.

colSums(is.na(spotify_data))
##                 track_id               track_name             track_artist 
##                        0                        5                        5 
##         track_popularity           track_album_id         track_album_name 
##                        0                        0                        5 
## track_album_release_date            playlist_name              playlist_id 
##                        0                        0                        0 
##           playlist_genre        playlist_subgenre             danceability 
##                        0                        0                        0 
##                   energy                      key                 loudness 
##                        0                        0                        0 
##                     mode              speechiness             acousticness 
##                        0                        0                        0 
##         instrumentalness                 liveness                  valence 
##                        0                        0                        0 
##                    tempo              duration_ms 
##                        0                        0
spotify_data <- na.omit(spotify_data)

Remove duplicates

I will now filter for unique tracks, by removing all the duplicate tracks using the duplicated function

spotify_data <- spotify_data[!duplicated(spotify_data$track_id),]

Transform the Variables

I have converted genre, sub genre, mode and key to factors to facilitate our data analysis, I based this off the values those fields contained

spotify_data <- spotify_data %>%
  mutate(playlist_genre = as.factor(spotify_data$playlist_genre),
         playlist_subgenre = as.factor(spotify_data$playlist_subgenre),
         mode = as.factor(mode),
         key = as.factor(key))

Converting duration_ms to duration in minutes (duration_min) since it is more sensible for the analysis

spotify_data <- spotify_data %>% mutate(duration_min = duration_ms/60000)

Creating new variables

For exploring the distribution on popularity, we have made new variables that divide popularity into 4 groups for effective cluster analysis

spotify_data <- spotify_data %>% 
  mutate(Like = as.numeric(case_when(
    ((track_popularity <= 55)) ~ "1",
    ((track_popularity < 55)) ~ "2",
  ))
    )
table(spotify_data$Like)
## 
##     1 
## 20159
Desription of Attributes

Each row indicates 1 song and column contain attributes for each song.The attributes are as follows:

  • track_id : Track ID on song
  • track_name : Title / Name of the song
  • track_artist : Name of the artist
  • track_popularity : Measure the popularity from 0 to 100 based on play number of the track
  • track_album_release_date : Information about the release date of the song
  • track_album_name : Provides us with the name of the album from which the song is in.
  • playlist_name : Name of the playlist which the song is in.
  • playlist_genre : Name of the genre related to the playlist which the song is in.
  • acousticness : Measure of how acoustic the track is and ranges from 0.0 to 1.0
  • danceability : Describes how suitable a track is for dancing. Values range from 0.0 being least danceable and 1.0 being most danceable.
  • duration_ms : The duration of the track in milliseconds(ms) which has been converted to minutes using transformation
  • energy : Measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity i.e. the enery of the song.
  • instrumentalness : Measure whether a track contains vocals. Sounds are treated as instrumental in this context. Values ranges from 0.0 to 1.0
  • speechiness - Detects the presence of spoken words in a track.Values > 0.6 might be a podcast or talk show, where 0.3 to 0.6 is the normal range for songs and if its less than 0.3 its mostly music
  • valence - Measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive , while tracks with low valence sound more negative.
  • key : Estimated overall key of the track. If key is not detected, the value is -1.
  • liveness - Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
  • loudness - overall loudness of a track in decibels (dB).Values typical range between -60 and 0 dB.
  • mode - Mode indicates the modality (major or minor) of a track. Major is represented by 1 and minor is represented by 0.
  • tempo Overall estimated tempo of a track in beats per minute (BPM).

Exploratory Data Analysis

Exploratory Data analysis (EDA) helps us uncover useful information from data that is not self-evident, only if EDA is done correctly.

EDA is essentail before we start to build a model on the data.

With EDA we can understand the patterns within the data, detect outliers or anomalous events and find interesting relations among the variables.

I have used correlation plot, histograms and boxplots in my EDA.

Correlation Plot

corr_spotify <- select(spotify_data, track_popularity, danceability, energy, loudness, speechiness, acousticness, instrumentalness, liveness, valence, tempo)
corrplot(cor(corr_spotify), type="lower")

Prior to any model creation, it is good practice to check for multicollinearity, which is correlation between the independent features within the dataset. It is clear there is no multicollinearity.

Histogram

Analyzing data distribution of the audio features, using the plot_num function (plots only numeric variables)

spotify_histograms <- spotify_data[,-c(1,2,3,4,5,6,7,8,11,13,20,22)]
plot_num(spotify_histograms)
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

From the histograms, we can observe that:

  • Songs with duration of 2.5 to 4 minutes have majority listeners
  • A lot of observations have a value no larger than 0.1 in instrumentalness which is ~80% of the dataset
  • Energy and Danceability are pretty normally distribuited, but Valence is normally distributed
  • Most of the songs have a loudness level between -5dB and -10db
  • Majority tracks have speechiness less than 0.25 indicating that more speechy songs aren’t favoured.

Boxplot

Genre by Energy

boxplot(energy~playlist_genre, data = spotify_data,
        main = "Variation:- Energy and  Genre",
        xlab = "Energy",
        ylab = "Genre",
        col = "green",
        border = "blue",
        horizontal = TRUE,
        notch = TRUE
)

The plot shows that EDM genre has songs with highest energy.

Genre by Danceability

boxplot(danceability~playlist_genre, data = spotify_data,
        main = "Variation:- Danceability and Genre",
        xlab = "Danceability",
        ylab = "Genre",
        col = "green",
        border = "blue",
        horizontal = TRUE,
        notch = TRUE
)

As seen in the graph, Rap genre has the highest danceability factor.

Genre by Liveliness

boxplot(danceability~playlist_genre, data = spotify_data,
        main = "Variation:- Liveness and Genres",
        xlab = "Liveness",
        ylab = "Genre",
        col = "green",
        border = "blue",
        horizontal = TRUE,
        notch = TRUE
)

Looks like Rap songs are most lively, followed closely by latin genre.

Genre by Valence

boxplot(valence~playlist_genre, data = spotify_data,
        main = "Variation:- Valence and Genre",
        xlab = "Valence",
        ylab = "Genre",
        col = "green",
        border = "blue",
        horizontal = TRUE,
        notch = TRUE
)

As seen above, Latin genre has a higher valence than others.

Genre by Loudness

boxplot(loudness~playlist_genre, data = spotify_data,
        main = "Variation:- Loudness and Genre",
        xlab = "Loudness",
        ylab = "Genre",
        col = "green",
        border = "blue",
        horizontal = TRUE,
        notch = TRUE
)

The loudness is pretty similar, only songs in EDM genre are a bit louder than the other genres.

Tempo and Liveness Distribution across Genre

spotify_data$liveness.scale <- scale(spotify_data$liveness)
spotify_data$tempo.scale <- scale(spotify_data$tempo)
spotify_data %>%
  select(tempo.scale, liveness.scale, playlist_genre) %>%
  group_by(playlist_genre) %>%
  filter(!is.na(tempo.scale)) %>%
  filter(!is.na(liveness.scale)) %>%
  ggplot(mapping = aes(x = tempo.scale, y = liveness.scale, color = playlist_genre, fill = playlist_genre)) +
  geom_bar(stat = 'identity') +
  coord_polar() +
  theme_dark() +
  theme(legend.position = "top")

As visible in the plot, the Tempo is way higher for EDM genre compared to the others while Liveness is almost uniformly distributed across all genres.

Energy Distribution of the songs

spotify_data$energy_only <- cut(spotify_data$energy, breaks = 10)
spotify_data %>%
  ggplot( aes(x = energy_only )) +
  geom_bar(width = 0.2, fill = "#FF9999", colour = "black") +
  scale_x_discrete(name = "Energy")

This plot shows that higher energy songs are popular among Spotify listeners.

Speechiness Distribution of the songs

spotify_data$speech_only <- cut(spotify_data$speechiness, breaks = 10)
spotify_data %>%
  ggplot( aes(x = speech_only )) +  
  geom_bar(width = 0.2,  fill = "#FF9999", colour = "black") +
  scale_x_discrete(name = "Speechiness") +
  coord_flip()

This plot shows that less speechy songs are more favoured by maximum Spotify listeners.

Data Modelling

Multiple models would be built to explore the classfication of playlist genre. Model perfomance would be computed based on in-sample ( randomly choosing 75% of original database) and out-of-sample (remaining 25%) misclassification rate (MR).

Train-Test Split

First step is to randomly split the whole dataset into training (75%) and testing (25%) set for model validation.

set.seed(7052)
idx <- sample(1:nrow(spotify_data), nrow(spotify_data) * .75)

spotify_data <- spotify_data[,c(10, 12:23)]
songs_train <- spotify_data[idx,]
songs_test <- spotify_data[-idx,]

Random Forest

Random forest is an extension of Bagging, which it’s a aggregation of trees. By boostrap (sample with replacement) from the data and fit tree models, random forest aggragates those trees and makes significant improvement in terms of prediction.

The idea of random forests is to randomly select m out of p predictors as candidate variables for each split in each tree. Commonly, m=√p in classifacation. The reason of doing this is that it can decorrelates the trees such that it reduces variance when we aggregate the trees.

In this particular case, m should be sqaure root of number of predictors √12 = 4. Let’s fit this model on the traing data and make preditions on both training and testing data.

songs_rf <- randomForest(playlist_genre~., data = songs_train, mtry = 4)

pred_train <- predict(songs_rf)
pred_test <- predict(songs_rf, songs_test)

Misclassification rate (MR) is defined as the rate of misclassify a song’s genre, we could computate both in-sample and out-od-sample misclassfication rate based on the prediction we got in last step.

We see that in-sample MR and out-of-sample MR are around 0.45, and are close to each other, indicatiing there is no overfitting problem in this model. Let’s check on the confusion matrix to see more detail.

In-sample Confusion Matrix

kable(songs_rf$confusion)
edm latin pop r&b rap rock class.error
edm 2545 136 555 86 208 115 0.3017833
latin 255 1162 603 334 564 161 0.6226047
pop 499 358 1595 449 376 574 0.5858219
r&b 65 250 446 1528 752 304 0.5431988
rap 172 262 246 353 2962 109 0.2782651
rock 108 87 357 273 65 2350 0.2746914

Out-sample Confusion Matrix

kable(table(pred_test, songs_test$playlist_genre))
edm latin pop r&b rap rock
edm 861 93 156 25 72 44
latin 59 374 123 81 80 26
pop 177 198 546 137 98 113
r&b 27 122 141 548 109 80
rap 68 201 122 274 894 15
rock 40 69 193 94 41 787

From two tables above we can see that Random Forest model could classify rock, edm and rap very well, misclassification rates are 24.70%, 31.30% and 34.61% respectively, while it performs worse in the indentification of r&b, lation are pop, misclassification rates are 52.84%, 58.06% and 67.06%.

SVM

SVM is probably one of the best off-the-shelf classifiers for many of problems. An SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible. It handles nonlinearity, is well regularized (avoids overfitting), have few parameters, and fast for large number of observations.

In this particular case, we still use the training data to build the SVM model, and use the testing data to do model validation.

song_svm <- svm(playlist_genre ~ ., data = songs_train, cost = 1)

pred_train_svm <- predict(song_svm)
pred_test_svm <- predict(song_svm, songs_test)

We see that in-sample MR is 0.441, and out-of-sample MR is 0.468, they are not far away each other, and similar to the model performance of random forest. Let’s check on the confusion matrix to see more detail.

In-sample Confusion Matrix

kable(table(songs_train$playlist_genre, pred_train_svm))
edm latin pop r&b rap rock
edm 2356 125 646 84 290 144
latin 260 1036 631 345 618 189
pop 487 290 1718 400 410 546
r&b 79 265 459 1466 773 303
rap 214 257 313 301 2933 86
rock 143 90 375 255 57 2320

Out-of-sample Confusion Matrix

kable(table(songs_test$playlist_genre, pred_test_svm))
edm latin pop r&b rap rock
edm 782 51 219 28 95 57
latin 83 310 237 123 223 81
pop 158 110 543 148 126 196
r&b 29 94 158 482 288 108
rap 87 86 113 101 865 42
rock 60 38 145 76 8 738

Similarily, SVM model could classify rock, edm and rap very well, but doesn’t do well in indentification of r&b, lation are pop.

Summary

We have performed data wrangling on our spotify dataset by removing null values, removing duplicates and transforming variables, before starting our exploratory data analysis. We have also seen that there is no multicollinearity

We have plotted histograms and boxplot to show the relation between the variables and We plan on utilizing this dataset to build a model which could predict the song genre based on several audio features provided in the dataset.

Additionally, we have explored

The audio feature pattern of each genre by visualizing audio features across different genre. The correlation between features by computing the relation coefficients. Given features of a song, how to classify its genre based on a random forest model and a SVM model. Those explorations could give us some hints on:

The specific feature pattern of each genre. For example, genre edm looks like having medium danceability, higher energy and low valence compared to other genres, rock looks like having lower danceability, higher energy and medium valence. The relationship between audio features. For example, there exist a strong negative correlation between acousticness and energy, a relatively high negative correlation between acousticness and loudness, and a relatively high positive correlation between energy and loudness, which are consistent with our common sense. How the classification model works on this problem. In the random forest model, it’s more accurate to classify rock, edm and rap, however, it doesn’t work that well in classifing r&b, lation and pop. This project does give us some insights in audio feature pattern and the classification of genre, however, due to the limitation of data, model and time, the project is far from perfection. More data could be imported and more advanced models, like neural network could be built to extract more insights and information from the data.