Data Wrangling Midterm Project - Spotify

Venkat Sureddi, Abhiteja Achanta and Vamsi Chand Emani

3/30/2020

How Popular is the song

Introduction

** What is Spotify **

Spotify is the most popular audio streaming service across the world. There are millions of tracks on the app which can be browsed by different parameters such as artist, album, genre.

** Problem Statement **

In this project, we analyse the various features affecting the popularity of a song and also predict the genre of songs. In the process we will be answering:

The correlation between different genres and audio features Who are the most popular artists? Which is the most popular genre? Most popular tracks in the dataset What features affect the popularity of a song and how they affect it

Way Forward

  • Exploration of the summary statistics of each audio feature.
  • Data cleaning i.e remove the null values and outliers if any.
  • Check for correlation between audio features and correlation between genres.
  • Identify the features of each genre
  • Perform basic EDA and observe the patterns across each audio feature and across genres.
  • Finally build a predictive model to identify the genre and estimate the popularity of the song.

This project aims to equip the consumer with the ability to guesstimate and be able to explain the popularity and genre of a song when provided with the required audio features such as danceability, valence etc.

Packages Required

The packages which we are going to use in our analysis:

library(plotly)      #Useful for creating interactive visualisations
library(tidyr)       #tidying data i.t converting into long form,etc
library(ggplot2)     #Used in the visualisation of the data
library(dplyr)       #Used for data wrangling
library(rpart)       #Has the functions which assist in building the decision tree
library(knitr)       #Helps in the integration of R code into HTML
library(kableExtra)  #USeful for construction of complex tables and customisation                        of styles
library(caret)      #For splitting the data into training and testing
library(DT)          #Displaying data objects as tables on the HTML page
library(funModeling) #For creating visualisations
library(corrplot)   #Helps in plotting the correlation of numerical values in Data
library(randomForest) #Useful for peforming the RandomForest algorithm
library(e1071)      #Useful for running the SVM algorithm

Data Preparation

Data Source and Summary Of Variables

The spotify data being used for our analysis has been taken from this path: Spotify Data

The data has been made available via the spotifyr package. Charlie Thompson, Josiah Parry, Donal Phipps, and Tom Wolff authored this package to make it easier to get either your own data or general metadata arounds songs from Spotify’s API.

The variables in the dataset and their description:

Data Importing

The dataset contains 32,833 observations of 23 variables.

Data Cleaning and Manipulation

From the summary of the dataset we observe that loudness has negative values, mode and key are numerical. We will clean these columns.

We need to check if any of the songs are repetitive. For this, we will consider the track_id column and check if there are any duplicates in that column.

#Removing Duplicates
spotify_songs_unique = spotify_songs[!duplicated(spotify_songs$track_id),]
dim(spotify_songs_unique)
## [1] 28350    23

Now, we select only those columns which will be useful in our analysis and in the building of the model. We will go ahead and drop the following columns:

  • track_id
  • track_album_id
  • track_album_name
  • playlist_name
  • playlist_id
#Removing unnecessary columns
spotify_songs_final = spotify_songs_unique[-c(1,5,6,8,9)]
head(spotify_songs_final)
## # A tibble: 6 x 18
##   track_name track_artist track_popularity track_album_rel~ playlist_genre
##   <chr>      <chr>                   <dbl> <chr>            <chr>         
## 1 I Don't C~ Ed Sheeran                 66 2019-06-14       pop           
## 2 Memories ~ Maroon 5                   67 2019-12-13       pop           
## 3 All the T~ Zara Larsson               70 2019-07-05       pop           
## 4 Call You ~ The Chainsm~               60 2019-07-19       pop           
## 5 Someone Y~ Lewis Capal~               69 2019-03-05       pop           
## 6 Beautiful~ Ed Sheeran                 67 2019-07-11       pop           
## # ... with 13 more variables: playlist_subgenre <chr>, danceability <dbl>,
## #   energy <dbl>, key <fct>, loudness <dbl>, mode <fct>,
## #   speechiness <dbl>, acousticness <dbl>, instrumentalness <dbl>,
## #   liveness <dbl>, valence <dbl>, tempo <dbl>, duration_ms <dbl>

We now check for the missing values across all the columns in the dataset.

colSums(is.na(spotify_songs_final))
##               track_name             track_artist         track_popularity 
##                        4                        4                        0 
## track_album_release_date           playlist_genre        playlist_subgenre 
##                        0                        0                        0 
##             danceability                   energy                      key 
##                        0                        0                        0 
##                 loudness                     mode              speechiness 
##                        0                        0                        0 
##             acousticness         instrumentalness                 liveness 
##                        0                        0                        0 
##                  valence                    tempo              duration_ms 
##                        0                        0                        0

There are four missing values each in track_name and track_artist. We go ahead and remove these missing values from our dataset so that there won’t be any road blockers in building the model.

spotify_songs_final = spotify_songs_final[!is.na(spotify_songs_final$track_name),]

colSums(is.na(spotify_songs_final))
##               track_name             track_artist         track_popularity 
##                        0                        0                        0 
## track_album_release_date           playlist_genre        playlist_subgenre 
##                        0                        0                        0 
##             danceability                   energy                      key 
##                        0                        0                        0 
##                 loudness                     mode              speechiness 
##                        0                        0                        0 
##             acousticness         instrumentalness                 liveness 
##                        0                        0                        0 
##                  valence                    tempo              duration_ms 
##                        0                        0                        0

Now, we look at some of the rows from the final cleaned dataset:

spotify_songs_final %>% top_n(100)
## Selecting by duration_ms
## # A tibble: 100 x 18
##    track_name track_artist track_popularity track_album_rel~ playlist_genre
##    <chr>      <chr>                   <dbl> <chr>            <chr>         
##  1 Mirrors    Justin Timb~               77 2013-03-15       pop           
##  2 Bailando ~ Chela                      31 2011-07-06       pop           
##  3 Bring It ~ Geto Boys                  31 1993-03-09       rap           
##  4 Tonight I~ Betty Wright               41 2002-07-02       rap           
##  5 Sixteen    Rick Ross                   0 2012-01-01       rap           
##  6 Fat Frees~ Fat Pat                     3 2012-11-27       rap           
##  7 Still In ~ Shuya Okino                 0 2016-03-04       rock          
##  8 Al Andalu~ Miguel Rios                 0 2005-01-01       rock          
##  9 Dancing W~ Genesis                    48 1973-10-12       rock          
## 10 Killer     Van Der Gra~               33 1986-01-01       rock          
## # ... with 90 more rows, and 13 more variables: playlist_subgenre <chr>,
## #   danceability <dbl>, energy <dbl>, key <fct>, loudness <dbl>,
## #   mode <fct>, speechiness <dbl>, acousticness <dbl>,
## #   instrumentalness <dbl>, liveness <dbl>, valence <dbl>, tempo <dbl>,
## #   duration_ms <dbl>
datatable(spotify_songs_final, filter = 'top', options = list(pageLength = 10))
## Warning in instance$preRenderHook(instance): It seems your data is too
## big for client-side DataTables. You may consider server-side processing:
## https://rstudio.github.io/DT/server.html

Now we will look at the individual statistics of each variable:

str(spotify_songs_final)
## Classes 'tbl_df', 'tbl' and 'data.frame':    28346 obs. of  18 variables:
##  $ track_name              : chr  "I Don't Care (with Justin Bieber) - Loud Luxury Remix" "Memories - Dillon Francis Remix" "All the Time - Don Diablo Remix" "Call You Mine - Keanu Silva Remix" ...
##  $ track_artist            : chr  "Ed Sheeran" "Maroon 5" "Zara Larsson" "The Chainsmokers" ...
##  $ track_popularity        : num  66 67 70 60 69 67 62 69 68 67 ...
##  $ track_album_release_date: chr  "2019-06-14" "2019-12-13" "2019-07-05" "2019-07-19" ...
##  $ playlist_genre          : chr  "pop" "pop" "pop" "pop" ...
##  $ playlist_subgenre       : chr  "dance pop" "dance pop" "dance pop" "dance pop" ...
##  $ danceability            : num  0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
##  $ energy                  : num  0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
##  $ key                     : Factor w/ 12 levels "0","1","2","3",..: 7 12 2 8 2 9 6 5 9 3 ...
##  $ loudness                : num  -2.63 -4.97 -3.43 -3.78 -4.67 ...
##  $ mode                    : Factor w/ 2 levels "0","1": 2 2 1 2 2 2 1 1 2 2 ...
##  $ speechiness             : num  0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
##  $ acousticness            : num  0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
##  $ instrumentalness        : num  0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
##  $ liveness                : num  0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
##  $ valence                 : num  0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
##  $ tempo                   : num  122 100 124 122 124 ...
##  $ duration_ms             : num  194754 162600 176616 169093 189052 ...

Exploratory Data Analysis

We aim to visualise our data using a mixture of plots such as:

  • Correlation Matrix
  • Scatter plots
  • Histograms
  • Box plots

First now check the number of songs per each genre:

spotify_songs_final %>% count(playlist_genre) %>% knitr::kable()
playlist_genre n
edm 4875
latin 4136
pop 5132
r&b 4504
rap 5395
rock 4304

It is clear that our dataset is fairly diversified with good number of songs from each genre.

From our dataset, we now identify the most popular tracks and their artists.

#Most popular in the dataset and their artists

Top_tracks <- spotify_songs_final %>% 
  arrange(desc(track_popularity)) 

head(Top_tracks)
## # A tibble: 6 x 18
##   track_name track_artist track_popularity track_album_rel~ playlist_genre
##   <chr>      <chr>                   <dbl> <chr>            <chr>         
## 1 Dance Mon~ Tones and I               100 2019-10-17       pop           
## 2 ROXANNE    Arizona Zer~               99 2019-10-10       latin         
## 3 Tusa       KAROL G                    98 2019-11-07       pop           
## 4 Memories   Maroon 5                   98 2019-09-20       pop           
## 5 Blinding ~ The Weeknd                 98 2019-11-29       pop           
## 6 Circles    Post Malone                98 2019-09-06       pop           
## # ... with 13 more variables: playlist_subgenre <chr>, danceability <dbl>,
## #   energy <dbl>, key <fct>, loudness <dbl>, mode <fct>,
## #   speechiness <dbl>, acousticness <dbl>, instrumentalness <dbl>,
## #   liveness <dbl>, valence <dbl>, tempo <dbl>, duration_ms <dbl>

Dance Monkey, Roxanne, Tusa, Memories and Blinding Lights are the top 5 tracks in terms of popularity in our dataset.

We need to understand who the most popular artists are.

Most_popular_Artists <- spotify_songs_final %>%
  group_by(track_artist) %>%
  summarise(Mean_popularity = mean(track_popularity),Numberofsongs = n()) %>%
  arrange(desc(Mean_popularity),desc(Numberofsongs))
  
  head(Most_popular_Artists)
## # A tibble: 6 x 3
##   track_artist  Mean_popularity Numberofsongs
##   <chr>                   <dbl>         <int>
## 1 Trevor Daniel            97               1
## 2 Y2K                      91               1
## 3 Don Toliver              87.5             2
## 4 Kina                     85.5             2
## 5 JACKBOYS                 84.3             3
## 6 Dadá Boladão             84               1

Trevor Daniel, Y2K, Don Toliver, Kina and Jackboys are the top5 artists in terms of popularity.

Popular Genres

Most_popular_genres <- spotify_songs_final %>%
  group_by(playlist_genre) %>%
  summarise(Mean_popularity = mean(track_popularity),Numberofsongs = n()) %>%
  arrange(desc(Mean_popularity),desc(Numberofsongs))
   
  head(Most_popular_genres)
## # A tibble: 6 x 3
##   playlist_genre Mean_popularity Numberofsongs
##   <chr>                    <dbl>         <int>
## 1 pop                       45.9          5132
## 2 rap                       41.9          5395
## 3 latin                     41.4          4136
## 4 rock                      39.7          4304
## 5 r&b                       35.9          4504
## 6 edm                       30.7          4875

Pop is the most popular genre followed by Rap, Latin and Rank.

Box Plot of Track Popularity vs Playlist Genre:

#Boxplot For Track Popularity vs Genre
boxplot(track_popularity ~ playlist_genre, data = spotify_songs_final, ylab = "Popularity", xlab = "Genre")

Distribution Of Numerical Columns

From the above plots, we can infer that valence approximately follows normal distribution. Speechiness, Accousticness, Instrumentalness, Tempo and Duration are skewed to the left whereas danceability and loudness are skewed to the right.

Correlation of numerical columns to Popularity

# Correlation of these numerical variables to Popularity

options(repr.plot.width = 20, repr.plot.height = 15)
spotify_final_sliced <- spotify_songs_final[, -c(1,2,4,5,6,9,11)]
corr <- cor(spotify_final_sliced)

num <- corrplot(corr, method = "ellipse", type = "upper", tl.srt = 45)

For the predicting of the popularity of songs we intend to use the numerical columns of track artist. We quantify our artists as most popular, moderately popular and not popular.

#Classifying the artists
spotify_cleaned1 <- spotify_songs_final %>%
  group_by(track_artist) %>%
  summarise(Mean_popularity = mean(track_popularity)) %>%
  right_join(spotify_songs_final)
## Joining, by = "track_artist"
hist(spotify_cleaned1$Mean_popularity)

spotify_cleaned1 <- spotify_cleaned1 %>% 
  mutate(Popularity_factor = cut(x = spotify_cleaned1$Mean_popularity, breaks = c(0, 30, 60, 100),include.lowest=TRUE)) %>%
  as.data.frame()
levels(spotify_cleaned1$Popularity_factor) <- c("low", "Medium" , "High")

Prior to the model building we remove the columns which are unnecessary:

#Remove Unnecessary columns
head(spotify_cleaned1)
##       track_artist Mean_popularity
## 1       Ed Sheeran        65.05882
## 2         Maroon 5        42.40909
## 3     Zara Larsson        53.06250
## 4 The Chainsmokers        49.22727
## 5    Lewis Capaldi        76.88889
## 6       Ed Sheeran        65.05882
##                                              track_name track_popularity
## 1 I Don't Care (with Justin Bieber) - Loud Luxury Remix               66
## 2                       Memories - Dillon Francis Remix               67
## 3                       All the Time - Don Diablo Remix               70
## 4                     Call You Mine - Keanu Silva Remix               60
## 5               Someone You Loved - Future Humans Remix               69
## 6     Beautiful People (feat. Khalid) - Jack Wins Remix               67
##   track_album_release_date playlist_genre playlist_subgenre danceability
## 1               2019-06-14            pop         dance pop        0.748
## 2               2019-12-13            pop         dance pop        0.726
## 3               2019-07-05            pop         dance pop        0.675
## 4               2019-07-19            pop         dance pop        0.718
## 5               2019-03-05            pop         dance pop        0.650
## 6               2019-07-11            pop         dance pop        0.675
##   energy key loudness mode speechiness acousticness instrumentalness
## 1  0.916   6   -2.634    1      0.0583       0.1020         0.00e+00
## 2  0.815  11   -4.969    1      0.0373       0.0724         4.21e-03
## 3  0.931   1   -3.432    0      0.0742       0.0794         2.33e-05
## 4  0.930   7   -3.778    1      0.1020       0.0287         9.43e-06
## 5  0.833   1   -4.672    1      0.0359       0.0803         0.00e+00
## 6  0.919   8   -5.385    1      0.1270       0.0799         0.00e+00
##   liveness valence   tempo duration_ms Popularity_factor
## 1   0.0653   0.518 122.036      194754              High
## 2   0.3570   0.693  99.972      162600            Medium
## 3   0.1100   0.613 124.008      176616            Medium
## 4   0.2040   0.277 121.956      169093            Medium
## 5   0.0833   0.725 123.976      189052              High
## 6   0.1430   0.585 124.982      163049              High
spotify_cleaned2 <- spotify_cleaned1[, -c(1,2,3,5,7)]
str(spotify_cleaned2)
## 'data.frame':    28346 obs. of  15 variables:
##  $ track_popularity : num  66 67 70 60 69 67 62 69 68 67 ...
##  $ playlist_genre   : chr  "pop" "pop" "pop" "pop" ...
##  $ danceability     : num  0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
##  $ energy           : num  0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
##  $ key              : Factor w/ 12 levels "0","1","2","3",..: 7 12 2 8 2 9 6 5 9 3 ...
##  $ loudness         : num  -2.63 -4.97 -3.43 -3.78 -4.67 ...
##  $ mode             : Factor w/ 2 levels "0","1": 2 2 1 2 2 2 1 1 2 2 ...
##  $ speechiness      : num  0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
##  $ acousticness     : num  0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
##  $ instrumentalness : num  0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
##  $ liveness         : num  0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
##  $ valence          : num  0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
##  $ tempo            : num  122 100 124 122 124 ...
##  $ duration_ms      : num  194754 162600 176616 169093 189052 ...
##  $ Popularity_factor: Factor w/ 3 levels "low","Medium",..: 3 2 2 2 3 3 2 1 2 2 ...

Modelling

Methodology

As all the variables are not in the same scale we opt for Random Forest as it will take care of that. Also it is robust to outliers hence we can go ahead with Random Forest. Using this model, we will understand the importance of variables. Prior to the building of our model, we change the datatype of playlist_genre to factor.

train.index <- createDataPartition(spotify_cleaned2$Popularity_factor, p = .7, list = FALSE)
train_data <- spotify_cleaned2[ train.index,]
test_data  <- spotify_cleaned2[-train.index,]

train_data$playlist_genre <- as.factor(train_data$playlist_genre)

x_train <- train_data[,-1]
y_train <- train_data[,1]

spotify.rf<- randomForest(track_popularity~., data = train_data, importance=TRUE)
spotify.rf
## 
## Call:
##  randomForest(formula = track_popularity ~ ., data = train_data,      importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 4
## 
##           Mean of squared residuals: 336.2267
##                     % Var explained: 40.12
spotify.rf$importance
##                         %IncMSE IncNodePurity
## playlist_genre     13.292229244     423930.30
## danceability        0.033328943     507968.11
## energy             12.529168499     536838.18
## key                -9.891497975     648815.81
## loudness            8.651718845     563202.74
## mode               -0.721128914      59902.01
## speechiness        -0.004855765     507948.38
## acousticness        3.981640243     529580.07
## instrumentalness    2.071513631     450294.80
## liveness           -2.437507770     511314.39
## valence            -0.937745389     509395.44
## tempo              -0.706977596     541658.48
## duration_ms         0.136618031     621661.70
## Popularity_factor 389.922576181    4182132.01
varImpPlot(spotify.rf)

y_pred_rf_train<-predict(spotify.rf)
sqrt(mean((y_pred_rf_train - y_train)^2)) 
## [1] 18.33648

For the building of Random Forest, we first split the data into train and test. Then we run the random forest algorithm on the train datasets to determine the most important variables in the prediction of song popularity. From impVarPlot, it is clear that the artist’s popularity and key are important in determining the popularity of song.

svm_model <- svm(track_popularity ~ . , train_data)

y_pred_svm_train<-predict(svm_model)
sqrt(mean((y_pred_svm_train - y_train)^2)) 
## [1] 17.49928

We then run the SVM algorithm on the train and test datasets. The mean prediction error here is 17.59.

Prediction Of Genre

Now to predict the genres of songs, we again use the Random forest algorithm.

spotify_cleaned4 <- spotify_songs_final[,-c(1,2,3,4,6)]
index <- sample(nrow(spotify_cleaned4),nrow(spotify_cleaned4)*0.70)
spotify.train <- spotify_cleaned4[index,]
spotify.test <- spotify_cleaned4[-index,]

modfit.rpart <- rpart(playlist_genre ~ ., data=spotify.train, method="class")
print(modfit.rpart, digits = 3)
## n= 19842 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 19842 16100 rap (0.17 0.15 0.18 0.16 0.19 0.15)  
##    2) speechiness< 0.14 15133 11900 pop (0.19 0.15 0.21 0.15 0.1 0.19)  
##      4) danceability>=0.563 10894  8500 pop (0.21 0.19 0.22 0.17 0.12 0.097)  
##        8) tempo>=121 5305  3440 edm (0.35 0.14 0.2 0.096 0.12 0.087)  
##         16) tempo< 130 3227  1580 edm (0.51 0.13 0.2 0.055 0.047 0.066) *
##         17) tempo>=130 2078  1570 rap (0.11 0.17 0.2 0.16 0.24 0.12) *
##        9) tempo< 121 5589  4250 pop (0.074 0.23 0.24 0.23 0.12 0.11)  
##         18) duration_ms< 2.46e+05 3986  2860 pop (0.08 0.27 0.28 0.17 0.12 0.082)  
##           36) danceability>=0.71 2077  1350 latin (0.066 0.35 0.23 0.17 0.15 0.041) *
##           37) danceability< 0.71 1909  1260 pop (0.096 0.18 0.34 0.16 0.088 0.13) *
##         19) duration_ms>=2.46e+05 1603   972 r&b (0.057 0.13 0.14 0.39 0.11 0.17) *
##      5) danceability< 0.563 4239  2450 rock (0.15 0.055 0.19 0.12 0.061 0.42) *
##    3) speechiness>=0.14 4709  2550 rap (0.11 0.14 0.083 0.18 0.46 0.034) *
predictions_rpart <- predict(modfit.rpart, spotify.test[,-1], type = "class")

Summary

Problem Statement

The main focus of our analysis is to determine the audio features which predict the popularity of a song and also classify the given songs into particular genres.

Solution Methodology

For the understanding of these audio features, we leveraged tabular and graphical methods to observe the trends of the audio features such as valence, danceability etc. We formulated a new categorical variable to classify the artist’s popularity into high, medium and low.

Post these, we used the random forest and SVM algorithms to predict the popularity of the songs and also to classify these songs into respective genres.

Insights

Dance Monkey is the most popular track whereas Trevor Daniel is the most popular artist. Pop is the most popular genre followed by Rap and Latin. The loudness and energy in Pop is higher compared to the other genres. The artist’s popularity is most important in predicting the popularity of a song.

Implications

From this report, the consumer can understand the audio features which determine the genre of the song. They can understand why they are hooked to a particular genre - the features which are keeping them glued to that genre. Also, from the models, they can predict what the popularitof the next song from their favorite artist is going to be.

Limitations

More algorithms can be delpoyed in the preditcion of genre and popularity of songs. This might help us in developing a more accurate model.