Spotify Data Analysis —————————————————

Introduction

1.1 We are using the Spotify dataset, which has many audio features for a variety of songs, which are classified by genres. As music is loved by everyone around the world, a Spotify user will always be interested in getting recommendations on specific genre types that they frequently listen to. And as Spotify is concerned with improving the user experience, providing better recommendations to the end user will help in improving customer satisfaction. Hence, through this project we are trying to predict the genre of a song based on the audio features.
1.2 To predict the genre, we will be using classification models such as Decision Tree, K-Nearest Neighbors, Random Forest and XGBoost. We will then be comparing the accuracy metrics for each of the models and going ahead with the model that gives the best prediction based on the available data.
1.3 The approach to solve this problem will be as follows:
• Performing EDA on the entire dataset to analyze the data distribution • Cleaning the dataset by changing any datatypes of columns if required, checking and imputing null values, and correcting/formatting any values if required.
• Analyzing correlation between features and performing feature reduction
• Building the classification model to predict the genre of the songs
• Predicting the genres using the model and analyzing accuracy scores and other metrics.
1.4 Through this genre prediction model, Spotify can classify songs of all categories easily, and suggest its customers the kinds of songs they like to listen. And as a customer will get better song recommendations, it will be more likely that they purchase the premium membership offered by Spotify. Hence a satisfied customer will ultimately result in increased revenue for Spotify.

Packages Required

library(ggplot2)  #for Plotting

library(dplyr)    #for wrangling with dataframe
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(corrplot) #for plotting correlation between variables
## corrplot 0.90 loaded

Data Preparation

3.1 The data was collected from the following URL - [Github-TidyTuesday)(https://github.com/rfordatascience/tidytuesday/tree/master/data/2020/2020-01-21)
3.2 The data comes from Spotify via the spotifyr package. Charlie Thompson, Josiah Parry, Donal Phipps, and Tom Wolff authored this package to make it easier to get either your own data or general metadata arounds songs from Spotify’s API. Make sure to check out the spotifyr package website to see how you can collect your own data! Spotifyr is an R wrapper for pulling track audio features and other information from Spotify’s Web API in bulk. By automatically batching API requests, it allows you to enter an artist’s name and retrieve their entire discography in seconds, along with Spotify’s audio features and track/album popularity metrics. You can also pull song and playlist information for a given Spotify User (including yourself!).

3.3 We will be performing the following data importing and cleaning steps on the dataset
• Reading the file from the designated URL

spotify_songs_data <- read.csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv')

• Checking the number of rows and columns in the dataset

dim(spotify_songs_data)
## [1] 32833    23

• Renaming column names for any column if required

colnames(spotify_songs_data)    #not required
##  [1] "track_id"                 "track_name"              
##  [3] "track_artist"             "track_popularity"        
##  [5] "track_album_id"           "track_album_name"        
##  [7] "track_album_release_date" "playlist_name"           
##  [9] "playlist_id"              "playlist_genre"          
## [11] "playlist_subgenre"        "danceability"            
## [13] "energy"                   "key"                     
## [15] "loudness"                 "mode"                    
## [17] "speechiness"              "acousticness"            
## [19] "instrumentalness"         "liveness"                
## [21] "valence"                  "tempo"                   
## [23] "duration_ms"

• Getting to know the data

Ensuring all numerical datatype columns have only numeric data and no character data

str(spotify_songs_data)
## 'data.frame':    32833 obs. of  23 variables:
##  $ track_id                : chr  "6f807x0ima9a1j3VPbc7VN" "0r7CVbZTWZgbTCYdfa2P31" "1z1Hg7Vb0AhHDiEmnDE79l" "75FpbthrwQmzHlBJLuGdC7" ...
##  $ track_name              : chr  "I Don't Care (with Justin Bieber) - Loud Luxury Remix" "Memories - Dillon Francis Remix" "All the Time - Don Diablo Remix" "Call You Mine - Keanu Silva Remix" ...
##  $ track_artist            : chr  "Ed Sheeran" "Maroon 5" "Zara Larsson" "The Chainsmokers" ...
##  $ track_popularity        : int  66 67 70 60 69 67 62 69 68 67 ...
##  $ track_album_id          : chr  "2oCs0DGTsRO98Gh5ZSl2Cx" "63rPSO264uRjW1X5E6cWv6" "1HoSmj2eLcsrR0vE9gThr4" "1nqYsOef1yKKuGOVchbsk6" ...
##  $ track_album_name        : chr  "I Don't Care (with Justin Bieber) [Loud Luxury Remix]" "Memories (Dillon Francis Remix)" "All the Time (Don Diablo Remix)" "Call You Mine - The Remixes" ...
##  $ track_album_release_date: chr  "2019-06-14" "2019-12-13" "2019-07-05" "2019-07-19" ...
##  $ playlist_name           : chr  "Pop Remix" "Pop Remix" "Pop Remix" "Pop Remix" ...
##  $ playlist_id             : chr  "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" ...
##  $ playlist_genre          : chr  "pop" "pop" "pop" "pop" ...
##  $ playlist_subgenre       : chr  "dance pop" "dance pop" "dance pop" "dance pop" ...
##  $ danceability            : num  0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
##  $ energy                  : num  0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
##  $ key                     : int  6 11 1 7 1 8 5 4 8 2 ...
##  $ loudness                : num  -2.63 -4.97 -3.43 -3.78 -4.67 ...
##  $ mode                    : int  1 1 0 1 1 1 0 0 1 1 ...
##  $ speechiness             : num  0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
##  $ acousticness            : num  0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
##  $ instrumentalness        : num  0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
##  $ liveness                : num  0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
##  $ valence                 : num  0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
##  $ tempo                   : num  122 100 124 122 124 ...
##  $ duration_ms             : int  194754 162600 176616 169093 189052 163049 187675 207619 193187 253040 ...

Datatypes look fine, so nothing is reqd.

head(spotify_songs_data, 4)
##                 track_id                                            track_name
## 1 6f807x0ima9a1j3VPbc7VN I Don't Care (with Justin Bieber) - Loud Luxury Remix
## 2 0r7CVbZTWZgbTCYdfa2P31                       Memories - Dillon Francis Remix
## 3 1z1Hg7Vb0AhHDiEmnDE79l                       All the Time - Don Diablo Remix
## 4 75FpbthrwQmzHlBJLuGdC7                     Call You Mine - Keanu Silva Remix
##       track_artist track_popularity         track_album_id
## 1       Ed Sheeran               66 2oCs0DGTsRO98Gh5ZSl2Cx
## 2         Maroon 5               67 63rPSO264uRjW1X5E6cWv6
## 3     Zara Larsson               70 1HoSmj2eLcsrR0vE9gThr4
## 4 The Chainsmokers               60 1nqYsOef1yKKuGOVchbsk6
##                                        track_album_name
## 1 I Don't Care (with Justin Bieber) [Loud Luxury Remix]
## 2                       Memories (Dillon Francis Remix)
## 3                       All the Time (Don Diablo Remix)
## 4                           Call You Mine - The Remixes
##   track_album_release_date playlist_name            playlist_id playlist_genre
## 1               2019-06-14     Pop Remix 37i9dQZF1DXcZDD7cfEKhW            pop
## 2               2019-12-13     Pop Remix 37i9dQZF1DXcZDD7cfEKhW            pop
## 3               2019-07-05     Pop Remix 37i9dQZF1DXcZDD7cfEKhW            pop
## 4               2019-07-19     Pop Remix 37i9dQZF1DXcZDD7cfEKhW            pop
##   playlist_subgenre danceability energy key loudness mode speechiness
## 1         dance pop        0.748  0.916   6   -2.634    1      0.0583
## 2         dance pop        0.726  0.815  11   -4.969    1      0.0373
## 3         dance pop        0.675  0.931   1   -3.432    0      0.0742
## 4         dance pop        0.718  0.930   7   -3.778    1      0.1020
##   acousticness instrumentalness liveness valence   tempo duration_ms
## 1       0.1020         0.00e+00   0.0653   0.518 122.036      194754
## 2       0.0724         4.21e-03   0.3570   0.693  99.972      162600
## 3       0.0794         2.33e-05   0.1100   0.613 124.008      176616
## 4       0.0287         9.43e-06   0.2040   0.277 121.956      169093
summary(spotify_songs_data)
##    track_id          track_name        track_artist       track_popularity
##  Length:32833       Length:32833       Length:32833       Min.   :  0.00  
##  Class :character   Class :character   Class :character   1st Qu.: 24.00  
##  Mode  :character   Mode  :character   Mode  :character   Median : 45.00  
##                                                           Mean   : 42.48  
##                                                           3rd Qu.: 62.00  
##                                                           Max.   :100.00  
##  track_album_id     track_album_name   track_album_release_date
##  Length:32833       Length:32833       Length:32833            
##  Class :character   Class :character   Class :character        
##  Mode  :character   Mode  :character   Mode  :character        
##                                                                
##                                                                
##                                                                
##  playlist_name      playlist_id        playlist_genre     playlist_subgenre 
##  Length:32833       Length:32833       Length:32833       Length:32833      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##   danceability        energy              key            loudness      
##  Min.   :0.0000   Min.   :0.000175   Min.   : 0.000   Min.   :-46.448  
##  1st Qu.:0.5630   1st Qu.:0.581000   1st Qu.: 2.000   1st Qu.: -8.171  
##  Median :0.6720   Median :0.721000   Median : 6.000   Median : -6.166  
##  Mean   :0.6548   Mean   :0.698619   Mean   : 5.374   Mean   : -6.720  
##  3rd Qu.:0.7610   3rd Qu.:0.840000   3rd Qu.: 9.000   3rd Qu.: -4.645  
##  Max.   :0.9830   Max.   :1.000000   Max.   :11.000   Max.   :  1.275  
##       mode         speechiness      acousticness    instrumentalness   
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000000  
##  1st Qu.:0.0000   1st Qu.:0.0410   1st Qu.:0.0151   1st Qu.:0.0000000  
##  Median :1.0000   Median :0.0625   Median :0.0804   Median :0.0000161  
##  Mean   :0.5657   Mean   :0.1071   Mean   :0.1753   Mean   :0.0847472  
##  3rd Qu.:1.0000   3rd Qu.:0.1320   3rd Qu.:0.2550   3rd Qu.:0.0048300  
##  Max.   :1.0000   Max.   :0.9180   Max.   :0.9940   Max.   :0.9940000  
##     liveness         valence           tempo         duration_ms    
##  Min.   :0.0000   Min.   :0.0000   Min.   :  0.00   Min.   :  4000  
##  1st Qu.:0.0927   1st Qu.:0.3310   1st Qu.: 99.96   1st Qu.:187819  
##  Median :0.1270   Median :0.5120   Median :121.98   Median :216000  
##  Mean   :0.1902   Mean   :0.5106   Mean   :120.88   Mean   :225800  
##  3rd Qu.:0.2480   3rd Qu.:0.6930   3rd Qu.:133.92   3rd Qu.:253585  
##  Max.   :0.9960   Max.   :0.9910   Max.   :239.44   Max.   :517810

We can see that there is clear outlier in duration_ms, with a song length of 4sec. This is not possible. So, we just get rid of the row.

spotify_songs_data <- filter(spotify_songs_data, spotify_songs_data$duration_ms>4000)

• Now, we are looking for the representation of different Genres in our sample dataset:

summary_data <- spotify_songs_data %>%
  group_by(playlist_genre) %>%
  summarise(Count = n())
summary_data
## # A tibble: 6 × 2
##   playlist_genre Count
##   <chr>          <int>
## 1 edm             6043
## 2 latin           5155
## 3 pop             5507
## 4 r&b             5431
## 5 rap             5746
## 6 rock            4950

So, the sample set seems to be balanced as it is almost unformly distributed among all 6 Genres

• Checking null values and then either imputing the null values with appropriate values, or removing them from the dataset

sum(is.na(spotify_songs_data))
## [1] 15

Since we have only 15 null values, dropping columns with null values

spotify_songs_data <- na.omit(spotify_songs_data)  

• Since we have Track_ID, we can remove other characterising columns consisting of other identifying variable like track_name, track_artist, track_album_id, track_album_id, track_album_name, track_album_release_date, playlist_name, playlist_id and playlist_subgenre. We are now left with 12 Independent variables, and 1 Dependent Variable. Let us create another dataframe with only numerical columns for analysis and then we can mutate it later with our response variable.

num_col <- unlist(lapply(spotify_songs_data,is.numeric))

spotify_num_col <- spotify_songs_data[,num_col]

head(spotify_num_col, 5)
##   track_popularity danceability energy key loudness mode speechiness
## 1               66        0.748  0.916   6   -2.634    1      0.0583
## 2               67        0.726  0.815  11   -4.969    1      0.0373
## 3               70        0.675  0.931   1   -3.432    0      0.0742
## 4               60        0.718  0.930   7   -3.778    1      0.1020
## 5               69        0.650  0.833   1   -4.672    1      0.0359
##   acousticness instrumentalness liveness valence   tempo duration_ms
## 1       0.1020         0.00e+00   0.0653   0.518 122.036      194754
## 2       0.0724         4.21e-03   0.3570   0.693  99.972      162600
## 3       0.0794         2.33e-05   0.1100   0.613 124.008      176616
## 4       0.0287         9.43e-06   0.2040   0.277 121.956      169093
## 5       0.0803         0.00e+00   0.0833   0.725 123.976      189052
par(mfrow=c(1,1))

3.4 Displaying data

corrplot(cor(spotify_num_col), method = 'square', order = 'FPC', type = 'lower', diag = FALSE) #Correlation and Pairwise Graphs

Between loudness and Energy, a correlation of 0.67 was observed. Hence, if required during modelling we may drop either of these from further steps.

• Analyzing the distribution of data in all columns using histogram and other plots

par(mfrow = c(2, 2))

hist(spotify_songs_data$danceability)
hist(spotify_songs_data$energy)
hist(spotify_songs_data$key)
hist(spotify_songs_data$loudness)

par(mfrow = c(2, 2))
hist(spotify_songs_data$mode)
hist(spotify_songs_data$speechiness)
hist(spotify_songs_data$acousticness)
hist(spotify_songs_data$instrumentalness)

par(mfrow = c(2, 2))
hist(spotify_songs_data$liveness)
hist(spotify_songs_data$valence)
hist(spotify_songs_data$tempo)
hist(spotify_songs_data$duration_ms)

3.5 Below is a description of all the columns in the dataset

variable class description
track_id character Song unique ID
track_name character Song Name
track_artist character Song Artist
track_popularity double Song Popularity (0-100) where higher is better
track_album_id character Album unique ID
track_album_name character Song album name
track_album_release_date character Date when album released
playlist_name character Name of playlist
playlist_id character Playlist ID
playlist_genre character Playlist genre
playlist_subgenre character Playlist subgenre
danceability double Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
Energy double Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
Key double The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1.
Loudness double The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.
Mode double Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.
Speechiness double Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
Acousticness double A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
Instrumentalness double Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
liveness double Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
valence double A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
tempo double The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
duration_ms double Duration of song in milliseconds

Proposed Exploratory Data Analysis

4.1 We will be performing EDA (Exploratory Data Analysis) on each column of the dataset to analyze the distribution of data. This will include observing the range, mean, median, and quantiles for each column. This analysis will give us a good estimate of how the data is distributed, and whether is involves any skewness. We will also be observing the relation between all pairs of numerical data columns to see if any variables are related in any way. If there is any column that contains data which can be split further to give additional insights, we will be performing the splitting as well.
4.2 We will using the following types of plots for analysis
• Histogram – For analyzing the distribution frequency of all variables
• Scatterplots – For analyzing the relationship between pairs of variables
• Boxplots – For identifying any outliers that might be skewing the data
• Correlation matrix – For numerically identifying the linear correlation between all pairs of variables
4.3 We are currently not familiar with the packages and coding syntax for the Machine Learning algorithms that needs to be applied for predicting song genres. Also, how to generate accuracy scores and other metrics, and plotting these metrics is something that we need to learn.
4.4 We plan on using Linear Regression for our analysis. Through fitting a regression line on the scatterplots, we will get a good idea on the trend of data. It will be a good approximation in identifying the genre of the songs based on a single variable. When combined with multiple variables together, the accuracy of the prediction should increase further.