Data Wrangling Final Project

Spotify Genre Data

Introduction

All of us like to listen to songs of our choice and to share, with people, our playlists and jams. Spotify is one such platform where you can not only listen to songs but also create your playlists which can be viewed and played by other people with similar interests. There can be many factors that can determine the popularity of a soundtrack like its acoustics characteristics. Here is a brief analysis of Spotify Dataset wherein we try to determine whether the popularity of a particular soundtrack is related to attributes like artist name, duration and various other acoustic features like loudness, tempo etc.

Firstly, we will clean and manipulate the Spotify data to make it suitable for our analysis. Then we will explore and visualize this to gain some valuable insights from the data which are not self-evident. Lastly, to address our problem statement, that is to predict the popularity of a track based on it’s characteristics, we will apply some machine learning techniques and compare them to select the best model.

The technique to explore and visualize data to uncover hidden information is known as exploratory data analysis. We will plot different graphs and charts like histogram, lollipop charts, radial graphs in order to accomplish this goal in the best possible manner.

We will apply supervised machine learning techniques to build our predictive models. We will begin with linear regression model and then move towards more sophisticated models like SVM and random forest. At last, we will compare these three models to select the model best suited for prediction.

Our analysis can be consumed by the Spotify business directly in order to better maintain their song database and attract more customers by suggesting them songs of their choice based on popularity. Also, it can be used by the end customers of Spotify like you and me to search effectively for the popular trends and playlists with most popular soundtracks or to create such playlists.

Packages Required

The packages we have used in this project are:

library(dplyr)
library(corrplot)
library(funModeling)
library(ggplot2)
library(GGally)
library(rsample)
library(randomForest) 
library(e1071)

We used attributes message=FALSE and warning=FALSE to suppress warnings and messages that can arise from the execution of this chunk.

Let’s have a look at each package we have used or plan to use:

dplyr: This package aims to provide a function for each basic verb of data manipulation:

filter() to select cases based on their values.
arrange() to reorder the cases.
select() and rename() to select variables based on their names.
mutate() and transmute() to add new variables that are functions of existing variables.
summarise() to condense multiple values to a single value.
sample_n() and sample_frac() to take random samples.

More info on: https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html

We have used distinct() function in our project.

corrplot: The corrplot package is a graphical display of a correlation matrix, confidence interval. It also contains some algorithms to do matrix reordering. In addition, corrplot is good at details, including choosing color, text labels, color labels, layout, etc.

More info on: https://cran.r-project.org/web/packages/corrplot/vignettes/corrplot-intro.html

We have used corrplot() function from this package for data visualization.

funModeling: This package is generally used to perform data preparation, profiling, selecting best variables data visualization, assessing model performance and other functions.

More info on: https://cran.r-project.org/web/packages/funModeling/index.html

We have used plot_num() function from this package.

ggplot2: It is a system for declaratively creating graphics, based on The Grammar of Graphics. You generally supply a dataset and aesthetic mapping (with aes()). You then add on layers (like geom_point() or geom_histogram()), scales (like scale_colour_brewer()), faceting specifications (like facet_wrap()) and coordinate systems (like coord_flip()).

More info on: https://cran.r-project.org/web/packages/ggplot2/index.html

We have used multiple functions from this package for visualizing the data.

GGally: It extends ‘ggplot2’ by adding several functions to reduce the complexity of combining geometric objects with transformed data. Some of these functions include a pairwise plot matrix, a two group pairwise plot matrix, a parallel coordinates plot, a survival plot, and several functions to plot networks.

More info on: https://cran.r-project.org/web/packages/GGally/index.html

We have used multiple functions from this package for visualizing the data.

rsample: It has functions to create variations of a data set that can be used to evaluate models or to estimate the sampling distribution of some statistic.

More info on: https://cran.r-project.org/web/packages/rsample/index.html

We have used initial_split() from this package to split data into testing and training samples.

e1071: This package has functions for latent class analysis, short time Fourier transform, fuzzy clustering, support vector machines, shortest path computation, bagged clustering, naive Bayes classifier etc.

More info on: https://cran.r-project.org/web/packages/e1071/index.html

We have used this package to implement SVM.

randomForest: It implements Breiman’s random forest algorithm (based on Breiman and Cutler’s original Fortran code) for classification and regression. It can also be used in unsupervised mode for assessing proximities among data points.

More info on: https://cran.r-project.org/web/packages/randomForest/index.html

We have used this package to implement random forest.

Data Preparation

About Data

The dataset that we will be analyzing can be found here.

The original data comes from Spotify via the spotifyr package. The authors of this package are Charlie Thompson, Josiah Parry, Donal Phipps, and Tom Wolff. It makes it easier to get either your own data or general metadata around songs from Spotify’s API.

spotifyr is an R wrapper that can be used to pull track audio features and other information from Spotify’s Web API in bulk. It lets you to enter an artist’s name and retrieve their entire discography, along with Spotify’s audio features and track/album popularity metrics. It also allows you to pull song and playlist information of any Spotify User.

A detailed description of the attributes and their data type is given in the table below:

Variable	Class	Description
track_id	character	Song unique ID
track_name	character	Song name
track_artist	character	Song Artist
track_popularity	double	Song Popularity (0-100) where higher is better
track_album_id	character	Album unique ID
track_album_name	character	Song album name
track_album_release_date	character	Date when album released
playlist_name	character	Name of playlist
playlist_id	character	Playlist ID
playlist_genre	character	Playlist genre
playlist_subgenre	character	Playlist subgenre
danceability	double	Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
energy	double	Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
key	double	The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1.
loudness	double	The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.
mode	double	Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.
speechiness	double	Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
acousticness	double	A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
instrumentalness	double	Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
liveness	double	Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
valence	double	A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
tempo	double	The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
duration_ms	double	Duration of song in milliseconds

Import and Explore Data

We have the data in a csv file. So we will use read.csv() function to read the file.

spotify <- read.csv("C:/Users/khann/Desktop/COLLEGE/R/spotify_songs.csv", stringsAsFactors = FALSE)

Then, we check the structure and dimensions of the data frame as well as sample of the data.

str(spotify)

## 'data.frame':    32833 obs. of  23 variables:
##  $ track_id                : chr  "6f807x0ima9a1j3VPbc7VN" "0r7CVbZTWZgbTCYdfa2P31" "1z1Hg7Vb0AhHDiEmnDE79l" "75FpbthrwQmzHlBJLuGdC7" ...
##  $ track_name              : chr  "I Don't Care (with Justin Bieber) - Loud Luxury Remix" "Memories - Dillon Francis Remix" "All the Time - Don Diablo Remix" "Call You Mine - Keanu Silva Remix" ...
##  $ track_artist            : chr  "Ed Sheeran" "Maroon 5" "Zara Larsson" "The Chainsmokers" ...
##  $ track_popularity        : int  66 67 70 60 69 67 62 69 68 67 ...
##  $ track_album_id          : chr  "2oCs0DGTsRO98Gh5ZSl2Cx" "63rPSO264uRjW1X5E6cWv6" "1HoSmj2eLcsrR0vE9gThr4" "1nqYsOef1yKKuGOVchbsk6" ...
##  $ track_album_name        : chr  "I Don't Care (with Justin Bieber) [Loud Luxury Remix]" "Memories (Dillon Francis Remix)" "All the Time (Don Diablo Remix)" "Call You Mine - The Remixes" ...
##  $ track_album_release_date: chr  "2019-06-14" "2019-12-13" "2019-07-05" "2019-07-19" ...
##  $ playlist_name           : chr  "Pop Remix" "Pop Remix" "Pop Remix" "Pop Remix" ...
##  $ playlist_id             : chr  "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" ...
##  $ playlist_genre          : chr  "pop" "pop" "pop" "pop" ...
##  $ playlist_subgenre       : chr  "dance pop" "dance pop" "dance pop" "dance pop" ...
##  $ danceability            : num  0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
##  $ energy                  : num  0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
##  $ key                     : int  6 11 1 7 1 8 5 4 8 2 ...
##  $ loudness                : num  -2.63 -4.97 -3.43 -3.78 -4.67 ...
##  $ mode                    : int  1 1 0 1 1 1 0 0 1 1 ...
##  $ speechiness             : num  0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
##  $ acousticness            : num  0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
##  $ instrumentalness        : num  0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
##  $ liveness                : num  0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
##  $ valence                 : num  0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
##  $ tempo                   : num  122 100 124 122 124 ...
##  $ duration_ms             : int  194754 162600 176616 169093 189052 163049 187675 207619 193187 253040 ...

head(spotify,5)

##                 track_id
## 1 6f807x0ima9a1j3VPbc7VN
## 2 0r7CVbZTWZgbTCYdfa2P31
## 3 1z1Hg7Vb0AhHDiEmnDE79l
## 4 75FpbthrwQmzHlBJLuGdC7
## 5 1e8PAfcKUYoKkxPhrHqw4x
##                                              track_name     track_artist
## 1 I Don't Care (with Justin Bieber) - Loud Luxury Remix       Ed Sheeran
## 2                       Memories - Dillon Francis Remix         Maroon 5
## 3                       All the Time - Don Diablo Remix     Zara Larsson
## 4                     Call You Mine - Keanu Silva Remix The Chainsmokers
## 5               Someone You Loved - Future Humans Remix    Lewis Capaldi
##   track_popularity         track_album_id
## 1               66 2oCs0DGTsRO98Gh5ZSl2Cx
## 2               67 63rPSO264uRjW1X5E6cWv6
## 3               70 1HoSmj2eLcsrR0vE9gThr4
## 4               60 1nqYsOef1yKKuGOVchbsk6
## 5               69 7m7vv9wlQ4i0LFuJiE2zsQ
##                                        track_album_name
## 1 I Don't Care (with Justin Bieber) [Loud Luxury Remix]
## 2                       Memories (Dillon Francis Remix)
## 3                       All the Time (Don Diablo Remix)
## 4                           Call You Mine - The Remixes
## 5               Someone You Loved (Future Humans Remix)
##   track_album_release_date playlist_name            playlist_id
## 1               2019-06-14     Pop Remix 37i9dQZF1DXcZDD7cfEKhW
## 2               2019-12-13     Pop Remix 37i9dQZF1DXcZDD7cfEKhW
## 3               2019-07-05     Pop Remix 37i9dQZF1DXcZDD7cfEKhW
## 4               2019-07-19     Pop Remix 37i9dQZF1DXcZDD7cfEKhW
## 5               2019-03-05     Pop Remix 37i9dQZF1DXcZDD7cfEKhW
##   playlist_genre playlist_subgenre danceability energy key loudness mode
## 1            pop         dance pop        0.748  0.916   6   -2.634    1
## 2            pop         dance pop        0.726  0.815  11   -4.969    1
## 3            pop         dance pop        0.675  0.931   1   -3.432    0
## 4            pop         dance pop        0.718  0.930   7   -3.778    1
## 5            pop         dance pop        0.650  0.833   1   -4.672    1
##   speechiness acousticness instrumentalness liveness valence   tempo
## 1      0.0583       0.1020         0.00e+00   0.0653   0.518 122.036
## 2      0.0373       0.0724         4.21e-03   0.3570   0.693  99.972
## 3      0.0742       0.0794         2.33e-05   0.1100   0.613 124.008
## 4      0.1020       0.0287         9.43e-06   0.2040   0.277 121.956
## 5      0.0359       0.0803         0.00e+00   0.0833   0.725 123.976
##   duration_ms
## 1      194754
## 2      162600
## 3      176616
## 4      169093
## 5      189052

The output of str() gives insight into the datatype of each variable along with some of the values for that variable. It can be observed that there are 23 variables and 32833 observations. The head() function gives the top 5 rows of the data frame.

Data Cleaning

To begin with, we will delete some attributes that are not useful for our analysis. These include track_id, track_album_id, track_album_name, track_album_release_date, playlist_name, playlist_id and playlist_genre.

new_sp <- spotify[ -c(1,5:9) ]

Next, we will identify the distinct track names and then remove the duplicates. This can significantly reduce the number of records.

new_sp <- distinct(new_sp,track_name, .keep_all = TRUE )

Moving further, we will identify null values.

colSums(is.na(new_sp))

##        track_name      track_artist  track_popularity    playlist_genre 
##                 1                 1                 0                 0 
## playlist_subgenre      danceability            energy               key 
##                 0                 0                 0                 0 
##          loudness              mode       speechiness      acousticness 
##                 0                 0                 0                 0 
##  instrumentalness          liveness           valence             tempo 
##                 0                 0                 0                 0 
##       duration_ms 
##                 0

This shows there are missing values in track_name and track_artist. Since these variables are names of track and artist, we can’t do any imputation. Also, the number of missing values is very less and will not impact the analysis, we can safely remove these records.

new_sp <- na.omit(new_sp)

Lastly, we will reorder the columns to put all the numeric variables together. This makes it easier to code as well as analyze in further steps.

new_sp <- select(new_sp,track_name,track_artist, playlist_genre, playlist_subgenre, track_popularity, danceability:duration_ms)

Cleansed Data

Let’s have a look at the data after performing all the steps of data cleaning.

colSums(is.na(new_sp))

##        track_name      track_artist    playlist_genre playlist_subgenre 
##                 0                 0                 0                 0 
##  track_popularity      danceability            energy               key 
##                 0                 0                 0                 0 
##          loudness              mode       speechiness      acousticness 
##                 0                 0                 0                 0 
##  instrumentalness          liveness           valence             tempo 
##                 0                 0                 0                 0 
##       duration_ms 
##                 0

dim(new_sp)

## [1] 23449    17

names(new_sp)

##  [1] "track_name"        "track_artist"      "playlist_genre"   
##  [4] "playlist_subgenre" "track_popularity"  "danceability"     
##  [7] "energy"            "key"               "loudness"         
## [10] "mode"              "speechiness"       "acousticness"     
## [13] "instrumentalness"  "liveness"          "valence"          
## [16] "tempo"             "duration_ms"

As can be seen, there are no null values now. Also, the new data frame has 17 attributes with 23449 records since the duplicate values have been removed. The output of names() function gives the list of variables in the new data frame.

Summary Statistics

Before moving ahead with data exploration, let’s have a look at the output of summary() function.

summary(new_sp, 5)

##   track_name        track_artist       playlist_genre    
##  Length:23449       Length:23449       Length:23449      
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##  playlist_subgenre  track_popularity  danceability        energy        
##  Length:23449       Min.   : 0.00    Min.   :0.0000   Min.   :0.000175  
##  Class :character   1st Qu.:23.00    1st Qu.:0.5630   1st Qu.:0.578000  
##  Mode  :character   Median :43.00    Median :0.6720   Median :0.720000  
##                     Mean   :39.74    Mean   :0.6552   Mean   :0.696638  
##                     3rd Qu.:58.00    3rd Qu.:0.7620   3rd Qu.:0.841000  
##                     Max.   :98.00    Max.   :0.9830   Max.   :1.000000  
##       key            loudness            mode         speechiness    
##  Min.   : 0.000   Min.   :-46.448   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.: 2.000   1st Qu.: -8.341   1st Qu.:0.0000   1st Qu.:0.0412  
##  Median : 6.000   Median : -6.285   Median :1.0000   Median :0.0637  
##  Mean   : 5.374   Mean   : -6.848   Mean   :0.5667   Mean   :0.1103  
##  3rd Qu.: 9.000   3rd Qu.: -4.727   3rd Qu.:1.0000   3rd Qu.:0.1390  
##  Max.   :11.000   Max.   :  1.275   Max.   :1.0000   Max.   :0.9180  
##   acousticness    instrumentalness       liveness         valence      
##  Min.   :0.0000   Min.   :0.0000000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0148   1st Qu.:0.0000000   1st Qu.:0.0927   1st Qu.:0.3300  
##  Median :0.0827   Median :0.0000192   Median :0.1270   Median :0.5130  
##  Mean   :0.1811   Mean   :0.0940749   Mean   :0.1916   Mean   :0.5114  
##  3rd Qu.:0.2660   3rd Qu.:0.0069600   3rd Qu.:0.2500   3rd Qu.:0.6950  
##  Max.   :0.9940   Max.   :0.9940000   Max.   :0.9960   Max.   :0.9910  
##      tempo         duration_ms    
##  Min.   :  0.00   Min.   :  4000  
##  1st Qu.: 99.97   1st Qu.:187200  
##  Median :122.00   Median :216133  
##  Mean   :121.08   Mean   :226107  
##  3rd Qu.:134.78   3rd Qu.:254493  
##  Max.   :239.44   Max.   :517810

This function displays the length, class and mode for strings whereas minimum value, maximum value, quantile values and mean for numeric variables.

There are four string variables- track_name, track_artist, playlist_genre and playlist_subgenre- each having 23449 records.

The rest are numeric variables. We observe that valence, liveless, instrumentalness, acousticness, speechiness, danceability, and energy range between 0 and 1. Also, mode has just two values, 0 and 1. track_popularity can take values between 0 and 100, in theory. But as per the data, it is evident that it’s maximum value is 98.

Exploratory Data Analysis

Exploring Data

Histograms for Numeric Variables

options(repr.plot.width = 30, repr.plot.height = 30)
sp_sliced <- new_sp[, 5:17]
plot_num(sp_sliced)

Here, we can see the frequency distribution of all the numeric variables. We observe that valence, tempo and duration_ms are normally distributed whereas speechiness, acousticness, instrumentalness and liveliness are positively skewed. Also, we observe that loudness, danceability and energy are negatively skewed.

Box Plots to Analyze Outliers

df_new <- gather(sp_sliced, "danceability", "energy", "key", "loudness", "mode","speechiness","acousticness","instrumentalness","liveness","valence","tempo" ,"duration_ms", key="variables", value="value")
df_new$variables <- as.factor(df_new$variables)

ggplot(df_new, aes(x = variables, y = value)) +
  geom_boxplot(aes(fill= variables))+
    facet_wrap(vars(variables), scales="free") +theme_minimal(base_size = 26)

Next, we analyze the outliers in all the numeric variables by using boxplots. It can be observed that except mode, valence and key, all of these have significant number of outliers. Also, it is important to note that mode can only take either 0 or 1. Hence, there are no outliers.

Correlation Plot for Numeric Variables

corr <- cor(sp_sliced)
num <- corrplot(corr, method = "number", number.cex=0.60)

From this correlation plot, two values that stand out are 0.68 and -0.55. Here, 0.68 indicates a moderately positive correlation between loudness and energy whereas -0.55 indicates a moderately negative correlation between acousticness and energy.

Popularity Trends

General Popularity Trend

popular_df <- data.frame(x = new_sp$track_popularity)
popular_df %>%
  ggplot(aes(x=x, fill = "#AA66FF"))+
  geom_histogram(aes(y=..density..), color = 'black', fill="#AA66FF")+
  geom_density(aes(y=..density..),color = 'black',fill = 'grey',  alpha = 0.5,  kernel='gaussian')+
  geom_vline(aes(xintercept = mean(x)),color = 'red', linetype = 'dashed')+
  geom_vline(aes(xintercept = median(x)),color = 'blue', linetype = 'dashed')+
  geom_vline(aes(xintercept = quantile(x, probs = 0.25)),color = 'black')+
  geom_vline(aes(xintercept = quantile(x, probs = 0.75)),color = 'black')+
  theme_minimal()

log_pop <- data.frame(x = log(new_sp$track_popularity)+ 1)
log_pop %>%
  ggplot(aes(x=x, fill = "#AA66FF"))+
  geom_histogram(aes(y=..density..), color = 'black', fill="#AA66FF")+
  geom_density(aes(y=..density..),color = 'black',fill = 'grey',  alpha = 0.5,  kernel='gaussian')+
  geom_vline(aes(xintercept = mean(x)),color = 'red', linetype = 'dashed')+
  geom_vline(aes(xintercept = median(x)),color = 'blue', linetype = 'dashed')+
  geom_vline(aes(xintercept = quantile(x, probs = 0.25)),color = 'black')+
  geom_vline(aes(xintercept = quantile(x, probs = 0.75)),color = 'black')+
  theme_minimal()

These histograms show the distribution of track_popularity. In the first graph, it is evident that the variable does not follow perfect normal distribution and has a peak on the lower end.

In the next figure, we have applied log transformation on track popularity to further analyze the distirbution. This shows that after the transformation, it is skewed to the left.

Popular Genres

new_sp %>%
  group_by(playlist_genre) %>%
  summarise(count = n(),
            avg_pop = mean(track_popularity)) %>%
  arrange(desc(avg_pop)) %>%  
  mutate(playlist_genre = factor(playlist_genre, levels = playlist_genre)) %>% 
  ggplot( aes(x = playlist_genre, y = avg_pop)) +
    geom_bar(stat="identity", width = 0.5, fill="violetred4") + 
    theme_light() +
    xlab("Genre") +
    ylab("Popularity Score")

From this graph, it can be observed that genre type “Pop” has highest mean score for track popularity. Hence, we will consider as the most popular genre on Spotify.

Popular Sub-genres

new_sp %>%
  group_by(playlist_genre, playlist_subgenre) %>%
  summarise(count=n(),
            avg_pop = mean(track_popularity)) %>%
  ggplot(aes(playlist_genre, playlist_subgenre))+
    geom_tile(aes(fill=avg_pop),colour="white")+
    scale_fill_gradient(low="pink",high = "brown")+
    theme_light() +
    xlab("Playlist Genre")+
    ylab("Playlist Subgenre")+
    ggtitle("Popularity of Genre and Subgenre")+
    guides(fill=FALSE)

On diving deeper into the popularity of genre and sub-genre, we observe that post-teen pop and dance pop are more popular than other sub categories of pop. Also, it can be inferred that progressive electro house, new jack swing, new soul are among the least popular sub-genres.

Popular Artists

new_sp %>%
  group_by(track_artist) %>%
  summarise(count = n(),
            avg_pop = mean(track_popularity)) %>%
  arrange(avg_pop) %>%    
  top_n(20) %>%
  mutate(track_artist = factor(track_artist, levels = track_artist)) %>% 
  ggplot( aes(x = track_artist, y = avg_pop)) +
    geom_segment( aes(x = track_artist, xend = track_artist, y = avg_pop, yend = 0), color = "navyblue") +
    geom_point( size = 4, color = "blue", alpha = 0.8) +
    coord_flip() +
    theme_light() +
    xlab("Artist") +
    ylab("Popularity Score")

On analyzing the popularity of artists, Trevor Daniel, Y2K, Ant Saunders, Don Toliver and Kina emerge as the top 5 artists on Spotify.

Popular Tracks

new_sp %>%
  group_by(track_name) %>%
  summarise(count = n(),
            pop = track_popularity) %>%
  arrange(pop) %>%  
  top_n(20) %>%
  mutate(track_name = factor(track_name, levels = track_name)) %>% 
  ggplot( aes(x = track_name, y = pop)) +
    geom_bar(stat="identity", width = 0.5, fill="violetred4") + 
    theme_classic(base_size = 26)  +
    theme_light() +
    coord_flip() +
    xlab("Track Name") +
    ylab("Popularity Score")

This plot for top 20 popular tracks shows that Tusa by Nicki Minaj and Karol G has the highest popularity rating. Other notable songs in the top 20 are Memories by Maroon 5, Falling by Trevor Daniel, My Oh My by Camila Cabello, Lose you to love me by Selena Gomez and Bad Guy by Billie Eilish.

Scatter Plot of Popularity vs Valence per Genre

new_sp %>%
  select(track_popularity, valence, speechiness, tempo, track_artist, playlist_genre, playlist_subgenre) %>%
  group_by(track_popularity)%>%
  filter(!is.na(track_popularity)) %>%
  filter(!is.na(valence))%>%
  filter(!is.na(speechiness))%>%
  filter(!is.na(tempo))%>%
  ggplot(mapping = aes(x = valence, y = track_popularity, color = playlist_subgenre))+
  facet_wrap(~playlist_genre)+
  geom_point()+
  theme_minimal(base_size = 20)

This graph shows the relation of popularity and valence for each genre type.For instance, for EDM type of music as seen from the plot most common sub- genres included are permanent wave, pop EDM, progressive electro house, southern hip hop, album rock and dance pop. We can derive this insight for other genres also.

For EDM most songs from permanent wave have been highly popular and they also possess higher valence or musical positiveness.

Most of the Latin songs have a higher valence and they are either progressive electro house or latin pop sub-genres.

Other Visualizations

Radial Charts to Analyze Popularity in Multiple Dimensions

new_sp %>%
  select(track_popularity, valence, speechiness, tempo, playlist_subgenre) %>%
  group_by(track_popularity)%>%
  filter(!is.na(track_popularity)) %>%
  filter(!is.na(valence))%>%
  filter(!is.na(speechiness))%>%
  filter(!is.na(tempo))%>%
  ggplot(mapping = aes(x = track_popularity, y = valence, color = tempo, alpha = speechiness, fill = tempo))+
  geom_bar(stat = 'identity', position = 'dodge')+
  coord_polar()+
  facet_wrap(~playlist_subgenre)+
  theme_minimal(base_size = 20)

This Visualization explains significance of the following parameters with respect to track popularity per sub genre:

Valence : The radial distance suggests valence of a track
Tempo : Is given by the sequential coloring
Speechiness: Is shown by the fadedness

Insights according each sub genre:

Album rock: The speechiness seems to be less as the chart seems to be particularly faded. Many tracks have a higher valence. Only a few songs have lesser tempo irrespective of popularity.
Big room: The speechiness and valence are on the lesser side for most of the tracks and the tempo seems to be higher. For one particular song which has a higher valence is also very popular. Also since the chart is less dense so very less number of songs lie in this sub genre.
Classic rock: Many songs lie in this genre and most of them higher valence and tempo and less speechiness.
Dance pop: A lot of tracks lie in this sub genre with most of them having a higher tempo and valence. Some of the highly popular songs also have a lower valence.
Electro house: Not many songs have a popularity greater than 75. Tempo ranges to medium values.
Electropop: Some tracks with lesser valence are highly popular. Speechiness seems to be less.
Ganster rap: Speechiness is very high as most od the chart seems to be solid. valence also seems to be high. A highly popular song has higher tempo.
Hard rock: Speechiness seems to be less and tempo is higher in almost all the tracks. One of the highly popular tracks has a lesser valence.
Hip hop: many of these tracks are popular and have a high speechiness and medium to low tempo. Valence seems to be higher.
Hip pop: Some of the tracks have a very low tempo. Valence is generally higher but some popular tracks have a less valence.
Indipoptimism: Most of the tracks have less speechiness and higher valence. tempo seems to be medium to low. One of the highly popular tracks has a lower tempo and medium valence.
Latin hip hop: Most of the tracks have medium to low tempo higher valence and medium speechiness.
Latin pop: Many teracks lie in this sub genre with less speechiness high valence and mostly medium tempo.
Neo soul: For most tracks valence seem to be medium with medium to low tempo. A very popular song has higher tempo and valence.
New jack Swing: None of the tracks are highly popular. Tempo ranges from medium to lower and higher valence.
Permanent wave: Most of the popular tracks have a higher valence less speechiness.
Pop edm: Speechiness is less, a very popular song has a high valence and tempo.
Post teen pop: Most tracks have medium to high valence and are very popular.
Progressive electro house: This sub genre is not highly popular and speechiness is less.
Reggaeton: Most tracks are popular and have a higher valence. Speechiness is less.
Southern Hip Hop: Speechiness and valence is high. Also there are many tracks in this sub genre as the chart is dense.
Trap: Most of the tracks are popular with a high valence and lesser speechiness.
Tropical: Many tracks lie in this sub genre valence is higher tempo seems to be medium to low.
Urban contemporary: Many highly popular songs have a lesser valence. Most songs are popular and overall tempo seems to be medium to less.

Radial Chart to Visualize Energy of Tracks per Subgenre

new_sp %>%
  select(energy, playlist_subgenre, speechiness, tempo, playlist_genre) %>%
  group_by(energy)%>%
  filter(!is.na(energy)) %>%
  filter(!is.na(playlist_subgenre))%>%
  filter(!is.na(speechiness))%>%
  filter(!is.na(tempo))%>%
  ggplot(mapping = aes(x = playlist_subgenre, y = energy, alpha = speechiness, fill = playlist_genre))+
  geom_bar(stat = 'identity')+
  coord_polar()+
  theme_minimal(base_size = 20)

This visualization shows energy and speechiness of all the tracks per sub genre and their respective genres.

Insights:

Southern hip hop, progressive electro house, dance pop, electro house and indie poptimism have higher energy levels.
Hip hop has more speechiness for higher energy levels similar insight can be seen in latin hip hop too.
Hip hop, new jack swing and reggaeton have lower energy levels.
Indiepoptimism tracks are less speechy whith increasing energy which can be linked to its popularity from the previous chart.
Higher energy generally corresponds to lower popularity.

Machine Learning Techniques

We have created three models to predict the popularity of a track based on predictor variables such valence, tempo, energy etc.

Linear Regression

First of all, we split the dataset into training and testing subsets.

set.seed(073)
split_fun <- initial_split(sp_sliced, prop = .85)

train <- training(split_fun)
test  <- testing(split_fun)

x_train <- train[, -1]
y_target <- train[, 1]

x_test <- test[, -1]
y_test <- test[, 1]

training <- data.frame(x_train, target = y_target)

Now we create a multiple regression model with track popularity as the response variable.

Linear_model <- lm(target ~ danceability + energy + key + loudness + mode
+ speechiness+ acousticness + instrumentalness + liveness + valence + tempo + duration_ms, data =training)

summary(Linear_model)

## 
## Call:
## lm(formula = target ~ danceability + energy + key + loudness + 
##     mode + speechiness + acousticness + instrumentalness + liveness + 
##     valence + tempo + duration_ms, data = training)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -54.945 -16.278   2.869  17.715  62.207 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       6.789e+01  1.995e+00  34.027  < 2e-16 ***
## danceability      4.845e+00  1.253e+00   3.866 0.000111 ***
## energy           -2.380e+01  1.429e+00 -16.649  < 2e-16 ***
## key               6.748e-03  4.506e-02   0.150 0.880965    
## loudness          1.208e+00  7.634e-02  15.822  < 2e-16 ***
## mode              1.146e+00  3.292e-01   3.481 0.000501 ***
## speechiness      -6.250e+00  1.582e+00  -3.950 7.83e-05 ***
## acousticness      4.154e+00  8.640e-01   4.807 1.54e-06 ***
## instrumentalness -9.298e+00  7.227e-01 -12.866  < 2e-16 ***
## liveness         -4.432e+00  1.044e+00  -4.247 2.18e-05 ***
## valence           2.481e+00  7.696e-01   3.224 0.001267 ** 
## tempo             2.798e-02  6.085e-03   4.598 4.30e-06 ***
## duration_ms      -4.528e-05  2.671e-06 -16.956  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22.63 on 19919 degrees of freedom
## Multiple R-squared:  0.06443,    Adjusted R-squared:  0.06386 
## F-statistic: 114.3 on 12 and 19919 DF,  p-value: < 2.2e-16

The summary statistics show that except key, all the predictor variables are significant with a p-value of less than 0.05. The adjusted R-squared is very low although the model is significant. This is possible if the data inherently contain higher amount of unexplainable variability.

PRESS <- function(linear.model) {
  pr <- residuals(linear.model)/(1-lm.influence(linear.model)$hat)
  PRESS <- sum(pr^2)
  return(PRESS)
}

MSPE <- function(linear.model) {
  return(PRESS(linear.model)/length(residuals(linear.model)))
}

pred_r_squared <- function(linear.model) {
  lm.anova <- anova(linear.model)
  tss <- sum(lm.anova$'Sum Sq')
  pred.r.squared <- 1-PRESS(linear.model)/(tss)
  return(pred.r.squared)
}

MSPE(Linear_model)

## [1] 512.2262

RMSE <- sqrt(MSPE(Linear_model))
RMSE

## [1] 22.63241

pred_r_squared(Linear_model)

## [1] 0.06320997

summary(Linear_model)$r.squared

## [1] 0.06442833

test[, 'linpred'] <- predict(Linear_model,test, type="response")
head(test)

##    track_popularity danceability energy key loudness mode speechiness
## 31               56        0.641  0.869  11   -4.754    1      0.0423
## 36               55        0.748  0.831   1   -5.029    1      0.1150
## 38               63        0.633  0.854   0   -4.046    0      0.0432
## 39               67        0.563  0.810   2   -2.921    1      0.0522
## 47               56        0.789  0.893  10   -4.364    0      0.1680
## 48               57        0.640  0.838   6   -4.203    1      0.0416
##    acousticness instrumentalness liveness valence   tempo duration_ms
## 31      0.03190         1.31e-03   0.4030   0.358 128.091      178125
## 36      0.08230         7.36e-05   0.0757   0.894 128.024      185273
## 38      0.03820         2.83e-05   0.4340   0.659 126.026      172360
## 39      0.00522         0.00e+00   0.0846   0.495 129.975      247385
## 47      0.05380         0.00e+00   0.2210   0.410 121.956      144073
## 48      0.05880         1.99e-05   0.0424   0.587 124.081      193861
##     linpred
## 31 40.26702
## 36 43.51167
## 38 41.06565
## 39 41.95346
## 47 41.35510
## 48 43.10152

Due to high value of MSPE and low prediction R-squared, the prediction by this model won’t be considered accurate.

Support Vector Machine

The goal of an SVM is to take groups of observations and construct boundaries to predict which group future observations belong to based on their measurements. For our data set, the optimal cost is calculated to be 0.1, which doesn’t penalize the model much for misclassified observations. But due to computational limitations, we cannot run the tuning function to enhance the model.

set.seed(073)

mod.svm <- svm(target~., data = training, type = "eps-regression", kernel = "linear", cost = '1', gamma = '0.1')
print(mod.svm)

## 
## Call:
## svm(formula = target ~ ., data = training, type = "eps-regression", 
##     kernel = "linear", cost = "1", gamma = "0.1")
## 
## 
## Parameters:
##    SVM-Type:  eps-regression 
##  SVM-Kernel:  linear 
##        cost:  1 
##       gamma:  0.1 
##     epsilon:  0.1 
## 
## 
## Number of Support Vectors:  18438

set.seed(073)
test_pred_svm <- predict(mod.svm ,x_test) 
RMSE_tree <- sqrt(mean((test_pred_svm - y_test)^2))                      
RMSE_tree

## [1] 22.90203

MAE_tree <- mean(abs(test_pred_svm - y_test))
MAE_tree

## [1] 18.82682

test[, 'svmpred'] <- predict(mod.svm, test)
head(test)

##    track_popularity danceability energy key loudness mode speechiness
## 31               56        0.641  0.869  11   -4.754    1      0.0423
## 36               55        0.748  0.831   1   -5.029    1      0.1150
## 38               63        0.633  0.854   0   -4.046    0      0.0432
## 39               67        0.563  0.810   2   -2.921    1      0.0522
## 47               56        0.789  0.893  10   -4.364    0      0.1680
## 48               57        0.640  0.838   6   -4.203    1      0.0416
##    acousticness instrumentalness liveness valence   tempo duration_ms
## 31      0.03190         1.31e-03   0.4030   0.358 128.091      178125
## 36      0.08230         7.36e-05   0.0757   0.894 128.024      185273
## 38      0.03820         2.83e-05   0.4340   0.659 126.026      172360
## 39      0.00522         0.00e+00   0.0846   0.495 129.975      247385
## 47      0.05380         0.00e+00   0.2210   0.410 121.956      144073
## 48      0.05880         1.99e-05   0.0424   0.587 124.081      193861
##     linpred  svmpred
## 31 40.26702 42.66323
## 36 43.51167 46.70114
## 38 41.06565 43.96005
## 39 41.95346 45.33414
## 47 41.35510 43.93442
## 48 43.10152 46.15517

The root mean sq error of this svm is 22.90203 and the mean absolute value is 18.82682. Since RMSE is close to MAE, the model makes many relatively small errors. Hence this model can be improved further for predicting accurate track popularity.

Random Forest

Random forest is a Supervised Learning algorithm which uses ensemble learning method for classification and regression. It operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.

set.seed(073)
rf <-randomForest(target~.,data=training, mtry=2, importance=TRUE, ntree = 448)
print(rf)

## 
## Call:
##  randomForest(formula = target ~ ., data = training, mtry = 2,      importance = TRUE, ntree = 448) 
##                Type of random forest: regression
##                      Number of trees: 448
## No. of variables tried at each split: 2
## 
##           Mean of squared residuals: 499.4887
##                     % Var explained: 8.65

test_pred <- predict(rf,x_test) 
RMSE_tree1 <- sqrt(mean((test_pred - y_test)^2))                      
RMSE_tree1

## [1] 22.55016

MAE_tree1 <- mean(abs(test_pred - y_test))
MAE_tree1

## [1] 18.71556

test[, 'rfpred'] <- predict(rf, test)
head(test)

##    track_popularity danceability energy key loudness mode speechiness
## 31               56        0.641  0.869  11   -4.754    1      0.0423
## 36               55        0.748  0.831   1   -5.029    1      0.1150
## 38               63        0.633  0.854   0   -4.046    0      0.0432
## 39               67        0.563  0.810   2   -2.921    1      0.0522
## 47               56        0.789  0.893  10   -4.364    0      0.1680
## 48               57        0.640  0.838   6   -4.203    1      0.0416
##    acousticness instrumentalness liveness valence   tempo duration_ms
## 31      0.03190         1.31e-03   0.4030   0.358 128.091      178125
## 36      0.08230         7.36e-05   0.0757   0.894 128.024      185273
## 38      0.03820         2.83e-05   0.4340   0.659 126.026      172360
## 39      0.00522         0.00e+00   0.0846   0.495 129.975      247385
## 47      0.05380         0.00e+00   0.2210   0.410 121.956      144073
## 48      0.05880         1.99e-05   0.0424   0.587 124.081      193861
##     linpred  svmpred   rfpred
## 31 40.26702 42.66323 35.28330
## 36 43.51167 46.70114 40.08330
## 38 41.06565 43.96005 40.56842
## 39 41.95346 45.33414 41.60759
## 47 41.35510 43.93442 40.85093
## 48 43.10152 46.15517 45.20850

We can further tune the optimization parameters for Random Forest model but due to computational limitations we are unable to do so. The mean squared residual values are 499.3788 and the percentage of variables explained is 8.67. The RMSE value for this model is 22.55016 and the MAE value is 18.71556 which indicates that the model is not very suitable for accurate predictions.

Hence this model can be improved a lot using various optimization techniques For data including categorical variables with different number of levels, random forests are biased in favor of those attributes with more levels. Therefore, the variable importance scores from random forest are not reliable for this type of data, which can be the case for this particular dataset as there can be multiple number of catagories in this dataset.

varImpPlot(rf)

The varImpPlot gives the importance of each variable with respect to:

Random permutation assignment of a variable will result in what percentage increase of the MSE; in this case energy will result in 100% increase in MSE, that is it is less reliable for accurate prediction of popularity.
Node purity is measured by Gini Index which is the the difference between sum of squares of residuals before and after the split on that variable. But this plot may not be accurately specifying the importance of variables as this dataset has multiple catagorical levels as mentioned earlier.

Comparison

df1 <- data.frame(Model= c("LinearModel","SVM","RF"), values= c(RMSE, RMSE_tree, RMSE_tree1))

p<-ggplot(df1,mapping= (aes(x=Model, y=values, fill=Model)) ) +
  geom_bar(stat="identity", width = 0.5) + theme_classic(base_size = 26) 
p + scale_y_continuous(expand = c(0,0))

From this graph, it can be inferred that the RMSE values for all of these are almost similar. But as observed in prediction on test data earlier, SVM gives better results than the other two.

Summary

Problem Statement: This project aims to explore the Spotify dataset and predict the popularity of a soundtrack based on its characteristics such as valence, tempo, energy and likewise.

Methodology: We have performed exploratory data analysis to uncover hidden trends from the data. Then we have built supervised learning models to predict the popularity of a track. These are multiple linear regression, SVM and random forest. We have analyzed and compared them on the basis of RMSE values and predictions made by these models.

Insights: Some of the interesting insights gained from EDA are:

Trevor Daniel emerged as the most popular artist.
Tusa by Nicki Minaj and Karol G has the highest popularity rating.
Pop is the most popular genre and post-teen pop and dance pop are more popular than its other sub categories.
Southern hip hop, progressive electro house, dance pop, electro house and indie poptimism have higher energy levels.

Other insights have been covered in detail in the EDA section.

Implications: Our analysis can help the consumers to know about the popular trends and playlists with most popular soundtracks. Also, the predictive models can be useful for the Spotify business as well as the music creators as these indicate the factors on which the popularity is based. These factors may not be known explicitly but analysis of the tracks makes the things clearer.

Limitations: The predictive models proposed in this project can not be used to accurately predict the popularity. These models need further optimization but we were bound by the technical limitations of our systems.

By proposing these models, we just wanted to gather a general idea of building predictive models for this dataset and to eventually predict accurately how popular a track can be based on certain given features, while optimization of these models can be included in the future scope of this project.