Spotify Data Analysis —————————————————
Introduction
1.1 We are using the Spotify dataset, which has many audio features for a variety of songs, which are classified by genres. As music is loved by everyone around the world, a Spotify user will always be interested in getting recommendations on specific genre types that they frequently listen to. And as Spotify is concerned with improving the user experience, providing better recommendations to the end user will help in improving customer satisfaction. Hence, through this project we are trying to predict the genre of a song based on the audio features.
1.2 To predict the genre, we will be using classification models such as Decision Tree, K-Nearest Neighbors, Random Forest and XGBoost. We will then be comparing the accuracy metrics for each of the models and going ahead with the model that gives the best prediction based on the available data.
1.3 The approach to solve this problem will be as follows:
• Performing EDA on the entire dataset to analyze the data distribution • Cleaning the dataset by changing any datatypes of columns if required, checking and imputing null values, and correcting/formatting any values if required.
• Analyzing correlation between features and performing feature reduction
• Building the classification model to predict the genre of the songs
• Predicting the genres using the model and analyzing accuracy scores and other metrics.
1.4 Through this genre prediction model, Spotify can classify songs of all categories easily, and suggest its customers the kinds of songs they like to listen. And as a customer will get better song recommendations, it will be more likely that they purchase the premium membership offered by Spotify. Hence a satisfied customer will ultimately result in increased revenue for Spotify.
Packages Required
library(ggplot2) #for Plotting
library(dplyr) #for wrangling with dataframe
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(corrplot) #for plotting correlation between variables
## corrplot 0.90 loaded
Data Preparation
3.1 The data was collected from the following URL - [Github-TidyTuesday)(https://github.com/rfordatascience/tidytuesday/tree/master/data/2020/2020-01-21)
3.2 The data comes from Spotify via the spotifyr package. Charlie Thompson, Josiah Parry, Donal Phipps, and Tom Wolff authored this package to make it easier to get either your own data or general metadata arounds songs from Spotify’s API. Make sure to check out the spotifyr package website to see how you can collect your own data! Spotifyr is an R wrapper for pulling track audio features and other information from Spotify’s Web API in bulk. By automatically batching API requests, it allows you to enter an artist’s name and retrieve their entire discography in seconds, along with Spotify’s audio features and track/album popularity metrics. You can also pull song and playlist information for a given Spotify User (including yourself!).
3.3 We will be performing the following data importing and cleaning steps on the dataset
• Reading the file from the designated URL
<- read.csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv') spotify_songs_data
• Checking the number of rows and columns in the dataset
dim(spotify_songs_data)
## [1] 32833 23
• Renaming column names for any column if required
colnames(spotify_songs_data) #not required
## [1] "track_id" "track_name"
## [3] "track_artist" "track_popularity"
## [5] "track_album_id" "track_album_name"
## [7] "track_album_release_date" "playlist_name"
## [9] "playlist_id" "playlist_genre"
## [11] "playlist_subgenre" "danceability"
## [13] "energy" "key"
## [15] "loudness" "mode"
## [17] "speechiness" "acousticness"
## [19] "instrumentalness" "liveness"
## [21] "valence" "tempo"
## [23] "duration_ms"
• Getting to know the data
Ensuring all numerical datatype columns have only numeric data and no character data
str(spotify_songs_data)
## 'data.frame': 32833 obs. of 23 variables:
## $ track_id : chr "6f807x0ima9a1j3VPbc7VN" "0r7CVbZTWZgbTCYdfa2P31" "1z1Hg7Vb0AhHDiEmnDE79l" "75FpbthrwQmzHlBJLuGdC7" ...
## $ track_name : chr "I Don't Care (with Justin Bieber) - Loud Luxury Remix" "Memories - Dillon Francis Remix" "All the Time - Don Diablo Remix" "Call You Mine - Keanu Silva Remix" ...
## $ track_artist : chr "Ed Sheeran" "Maroon 5" "Zara Larsson" "The Chainsmokers" ...
## $ track_popularity : int 66 67 70 60 69 67 62 69 68 67 ...
## $ track_album_id : chr "2oCs0DGTsRO98Gh5ZSl2Cx" "63rPSO264uRjW1X5E6cWv6" "1HoSmj2eLcsrR0vE9gThr4" "1nqYsOef1yKKuGOVchbsk6" ...
## $ track_album_name : chr "I Don't Care (with Justin Bieber) [Loud Luxury Remix]" "Memories (Dillon Francis Remix)" "All the Time (Don Diablo Remix)" "Call You Mine - The Remixes" ...
## $ track_album_release_date: chr "2019-06-14" "2019-12-13" "2019-07-05" "2019-07-19" ...
## $ playlist_name : chr "Pop Remix" "Pop Remix" "Pop Remix" "Pop Remix" ...
## $ playlist_id : chr "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" ...
## $ playlist_genre : chr "pop" "pop" "pop" "pop" ...
## $ playlist_subgenre : chr "dance pop" "dance pop" "dance pop" "dance pop" ...
## $ danceability : num 0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
## $ energy : num 0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
## $ key : int 6 11 1 7 1 8 5 4 8 2 ...
## $ loudness : num -2.63 -4.97 -3.43 -3.78 -4.67 ...
## $ mode : int 1 1 0 1 1 1 0 0 1 1 ...
## $ speechiness : num 0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
## $ acousticness : num 0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
## $ instrumentalness : num 0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
## $ liveness : num 0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
## $ valence : num 0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
## $ tempo : num 122 100 124 122 124 ...
## $ duration_ms : int 194754 162600 176616 169093 189052 163049 187675 207619 193187 253040 ...
Datatypes look fine, so nothing is reqd.
head(spotify_songs_data, 4)
## track_id track_name
## 1 6f807x0ima9a1j3VPbc7VN I Don't Care (with Justin Bieber) - Loud Luxury Remix
## 2 0r7CVbZTWZgbTCYdfa2P31 Memories - Dillon Francis Remix
## 3 1z1Hg7Vb0AhHDiEmnDE79l All the Time - Don Diablo Remix
## 4 75FpbthrwQmzHlBJLuGdC7 Call You Mine - Keanu Silva Remix
## track_artist track_popularity track_album_id
## 1 Ed Sheeran 66 2oCs0DGTsRO98Gh5ZSl2Cx
## 2 Maroon 5 67 63rPSO264uRjW1X5E6cWv6
## 3 Zara Larsson 70 1HoSmj2eLcsrR0vE9gThr4
## 4 The Chainsmokers 60 1nqYsOef1yKKuGOVchbsk6
## track_album_name
## 1 I Don't Care (with Justin Bieber) [Loud Luxury Remix]
## 2 Memories (Dillon Francis Remix)
## 3 All the Time (Don Diablo Remix)
## 4 Call You Mine - The Remixes
## track_album_release_date playlist_name playlist_id playlist_genre
## 1 2019-06-14 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop
## 2 2019-12-13 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop
## 3 2019-07-05 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop
## 4 2019-07-19 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop
## playlist_subgenre danceability energy key loudness mode speechiness
## 1 dance pop 0.748 0.916 6 -2.634 1 0.0583
## 2 dance pop 0.726 0.815 11 -4.969 1 0.0373
## 3 dance pop 0.675 0.931 1 -3.432 0 0.0742
## 4 dance pop 0.718 0.930 7 -3.778 1 0.1020
## acousticness instrumentalness liveness valence tempo duration_ms
## 1 0.1020 0.00e+00 0.0653 0.518 122.036 194754
## 2 0.0724 4.21e-03 0.3570 0.693 99.972 162600
## 3 0.0794 2.33e-05 0.1100 0.613 124.008 176616
## 4 0.0287 9.43e-06 0.2040 0.277 121.956 169093
summary(spotify_songs_data)
## track_id track_name track_artist track_popularity
## Length:32833 Length:32833 Length:32833 Min. : 0.00
## Class :character Class :character Class :character 1st Qu.: 24.00
## Mode :character Mode :character Mode :character Median : 45.00
## Mean : 42.48
## 3rd Qu.: 62.00
## Max. :100.00
## track_album_id track_album_name track_album_release_date
## Length:32833 Length:32833 Length:32833
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## playlist_name playlist_id playlist_genre playlist_subgenre
## Length:32833 Length:32833 Length:32833 Length:32833
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## danceability energy key loudness
## Min. :0.0000 Min. :0.000175 Min. : 0.000 Min. :-46.448
## 1st Qu.:0.5630 1st Qu.:0.581000 1st Qu.: 2.000 1st Qu.: -8.171
## Median :0.6720 Median :0.721000 Median : 6.000 Median : -6.166
## Mean :0.6548 Mean :0.698619 Mean : 5.374 Mean : -6.720
## 3rd Qu.:0.7610 3rd Qu.:0.840000 3rd Qu.: 9.000 3rd Qu.: -4.645
## Max. :0.9830 Max. :1.000000 Max. :11.000 Max. : 1.275
## mode speechiness acousticness instrumentalness
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000000
## 1st Qu.:0.0000 1st Qu.:0.0410 1st Qu.:0.0151 1st Qu.:0.0000000
## Median :1.0000 Median :0.0625 Median :0.0804 Median :0.0000161
## Mean :0.5657 Mean :0.1071 Mean :0.1753 Mean :0.0847472
## 3rd Qu.:1.0000 3rd Qu.:0.1320 3rd Qu.:0.2550 3rd Qu.:0.0048300
## Max. :1.0000 Max. :0.9180 Max. :0.9940 Max. :0.9940000
## liveness valence tempo duration_ms
## Min. :0.0000 Min. :0.0000 Min. : 0.00 Min. : 4000
## 1st Qu.:0.0927 1st Qu.:0.3310 1st Qu.: 99.96 1st Qu.:187819
## Median :0.1270 Median :0.5120 Median :121.98 Median :216000
## Mean :0.1902 Mean :0.5106 Mean :120.88 Mean :225800
## 3rd Qu.:0.2480 3rd Qu.:0.6930 3rd Qu.:133.92 3rd Qu.:253585
## Max. :0.9960 Max. :0.9910 Max. :239.44 Max. :517810
We can see that there is clear outlier in duration_ms, with a song length of 4sec. This is not possible. So, we just get rid of the row.
<- filter(spotify_songs_data, spotify_songs_data$duration_ms>4000) spotify_songs_data
• Now, we are looking for the representation of different Genres in our sample dataset:
<- spotify_songs_data %>%
summary_data group_by(playlist_genre) %>%
summarise(Count = n())
summary_data
## # A tibble: 6 × 2
## playlist_genre Count
## <chr> <int>
## 1 edm 6043
## 2 latin 5155
## 3 pop 5507
## 4 r&b 5431
## 5 rap 5746
## 6 rock 4950
So, the sample set seems to be balanced as it is almost unformly distributed among all 6 Genres
• Checking null values and then either imputing the null values with appropriate values, or removing them from the dataset
sum(is.na(spotify_songs_data))
## [1] 15
Since we have only 15 null values, dropping columns with null values
<- na.omit(spotify_songs_data) spotify_songs_data
• Since we have Track_ID, we can remove other characterising columns consisting of other identifying variable like track_name, track_artist, track_album_id, track_album_id, track_album_name, track_album_release_date, playlist_name, playlist_id and playlist_subgenre. We are now left with 12 Independent variables, and 1 Dependent Variable. Let us create another dataframe with only numerical columns for analysis and then we can mutate it later with our response variable.
<- unlist(lapply(spotify_songs_data,is.numeric))
num_col
<- spotify_songs_data[,num_col]
spotify_num_col
head(spotify_num_col, 5)
## track_popularity danceability energy key loudness mode speechiness
## 1 66 0.748 0.916 6 -2.634 1 0.0583
## 2 67 0.726 0.815 11 -4.969 1 0.0373
## 3 70 0.675 0.931 1 -3.432 0 0.0742
## 4 60 0.718 0.930 7 -3.778 1 0.1020
## 5 69 0.650 0.833 1 -4.672 1 0.0359
## acousticness instrumentalness liveness valence tempo duration_ms
## 1 0.1020 0.00e+00 0.0653 0.518 122.036 194754
## 2 0.0724 4.21e-03 0.3570 0.693 99.972 162600
## 3 0.0794 2.33e-05 0.1100 0.613 124.008 176616
## 4 0.0287 9.43e-06 0.2040 0.277 121.956 169093
## 5 0.0803 0.00e+00 0.0833 0.725 123.976 189052
par(mfrow=c(1,1))
3.4 Displaying data
corrplot(cor(spotify_num_col), method = 'square', order = 'FPC', type = 'lower', diag = FALSE) #Correlation and Pairwise Graphs
Between loudness and Energy, a correlation of 0.67 was observed. Hence, if required during modelling we may drop either of these from further steps.
• Analyzing the distribution of data in all columns using histogram and other plots
par(mfrow = c(2, 2))
hist(spotify_songs_data$danceability)
hist(spotify_songs_data$energy)
hist(spotify_songs_data$key)
hist(spotify_songs_data$loudness)
par(mfrow = c(2, 2))
hist(spotify_songs_data$mode)
hist(spotify_songs_data$speechiness)
hist(spotify_songs_data$acousticness)
hist(spotify_songs_data$instrumentalness)
par(mfrow = c(2, 2))
hist(spotify_songs_data$liveness)
hist(spotify_songs_data$valence)
hist(spotify_songs_data$tempo)
hist(spotify_songs_data$duration_ms)
3.5 Below is a description of all the columns in the dataset
variable | class | description |
---|---|---|
track_id | character | Song unique ID |
track_name | character | Song Name |
track_artist | character | Song Artist |
track_popularity | double | Song Popularity (0-100) where higher is better |
track_album_id | character | Album unique ID |
track_album_name | character | Song album name |
track_album_release_date | character | Date when album released |
playlist_name | character | Name of playlist |
playlist_id | character | Playlist ID |
playlist_genre | character | Playlist genre |
playlist_subgenre | character | Playlist subgenre |
danceability | double | Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable. |
Energy | double | Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy. |
Key | double | The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1. |
Loudness | double | The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db. |
Mode | double | Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0. |
Speechiness | double | Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks. |
Acousticness | double | A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic. |
Instrumentalness | double | Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0. |
liveness | double | Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live. |
valence | double | A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry). |
tempo | double | The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration. |
duration_ms | double | Duration of song in milliseconds |
Proposed Exploratory Data Analysis
4.1 We will be performing EDA (Exploratory Data Analysis) on each column of the dataset to analyze the distribution of data. This will include observing the range, mean, median, and quantiles for each column. This analysis will give us a good estimate of how the data is distributed, and whether is involves any skewness. We will also be observing the relation between all pairs of numerical data columns to see if any variables are related in any way. If there is any column that contains data which can be split further to give additional insights, we will be performing the splitting as well.
4.2 We will using the following types of plots for analysis
• Histogram – For analyzing the distribution frequency of all variables
• Scatterplots – For analyzing the relationship between pairs of variables
• Boxplots – For identifying any outliers that might be skewing the data
• Correlation matrix – For numerically identifying the linear correlation between all pairs of variables
4.3 We are currently not familiar with the packages and coding syntax for the Machine Learning algorithms that needs to be applied for predicting song genres. Also, how to generate accuracy scores and other metrics, and plotting these metrics is something that we need to learn.
4.4 We plan on using Linear Regression for our analysis. Through fitting a regression line on the scatterplots, we will get a good idea on the trend of data. It will be a good approximation in identifying the genre of the songs based on a single variable. When combined with multiple variables together, the accuracy of the prediction should increase further.