Spotify is the most popular audio streaming service across the world. There are millions of tracks on the app which can be browsed by different parameters such as artist, album, genre.
In this project, we aim to understand what features determine the genre of the song, characterisrics responsible for the popularity of a song using the data we have.
The packages which we are going to use in our analysis:
library(plotly) #Useful for creating interactive visualisations
library(tidyr) #tidying data i.t converting into long form,etc
library(ggplot2) #Used in the visualisation of the data
library(dplyr) #Used for data wrangling
library(rpart) #Has the functions which assist in building the decision tree
library(knitr) #Helps in the integration of R code into HTML
library(kableExtra) #USeful for construction of complex tables and customisation of styles
library(missForest) #For building the random forest model
library(DT) #Displaying data objects as tables on the HTML page
The spotify data being used for our analysis has been taken from this path: Spotify Data
The data has been made available via the spotifyr package. Charlie Thompson, Josiah Parry, Donal Phipps, and Tom Wolff authored this package to make it easier to get either your own data or general metadata arounds songs from Spotify’s API.
The variables in the dataset and their description:
The dataset contains 32,833 observations of 23 variables.
First, we need to check if any of the songs are repetitive. For this, we will consider the track_id column and check if there are any duplicates in that column.
#Removing Duplicates
spotify_songs_unique = spotify_songs[!duplicated(spotify_songs$track_id),]
Now, we select only those columns which will be useful in our analysis and in the building of the model. We will go ahead and drop the following columns:
#Removing unnecessary columns
spotify_songs_final = spotify_songs_unique[-c(1,5,6,8,9)]
head(spotify_songs_final)
## # A tibble: 6 x 18
## track_name track_artist track_popularity track_album_rel~ playlist_genre
## <chr> <chr> <dbl> <chr> <chr>
## 1 I Don't C~ Ed Sheeran 66 2019-06-14 pop
## 2 Memories ~ Maroon 5 67 2019-12-13 pop
## 3 All the T~ Zara Larsson 70 2019-07-05 pop
## 4 Call You ~ The Chainsm~ 60 2019-07-19 pop
## 5 Someone Y~ Lewis Capal~ 69 2019-03-05 pop
## 6 Beautiful~ Ed Sheeran 67 2019-07-11 pop
## # ... with 13 more variables: playlist_subgenre <chr>, danceability <dbl>,
## # energy <dbl>, key <dbl>, loudness <dbl>, mode <dbl>,
## # speechiness <dbl>, acousticness <dbl>, instrumentalness <dbl>,
## # liveness <dbl>, valence <dbl>, tempo <dbl>, duration_ms <dbl>
We now check for the missing values across all the columns in the dataset.
colSums(is.na(spotify_songs_final))
## track_name track_artist track_popularity
## 4 4 0
## track_album_release_date playlist_genre playlist_subgenre
## 0 0 0
## danceability energy key
## 0 0 0
## loudness mode speechiness
## 0 0 0
## acousticness instrumentalness liveness
## 0 0 0
## valence tempo duration_ms
## 0 0 0
There are four missing values each in track_name and track_artist. The number of missing values is very low and also these columns won’t be affecting the model building. Hence we’ll go ahead without deleting any records.
Now, we look at some of the rows from the final cleaned dataset:
spotify_songs_final %>% top_n(100)
## Selecting by duration_ms
## # A tibble: 100 x 18
## track_name track_artist track_popularity track_album_rel~ playlist_genre
## <chr> <chr> <dbl> <chr> <chr>
## 1 Mirrors Justin Timb~ 77 2013-03-15 pop
## 2 Bailando ~ Chela 31 2011-07-06 pop
## 3 Bring It ~ Geto Boys 31 1993-03-09 rap
## 4 Tonight I~ Betty Wright 41 2002-07-02 rap
## 5 Sixteen Rick Ross 0 2012-01-01 rap
## 6 Fat Frees~ Fat Pat 3 2012-11-27 rap
## 7 Still In ~ Shuya Okino 0 2016-03-04 rock
## 8 Al Andalu~ Miguel Rios 0 2005-01-01 rock
## 9 Dancing W~ Genesis 48 1973-10-12 rock
## 10 Killer Van Der Gra~ 33 1986-01-01 rock
## # ... with 90 more rows, and 13 more variables: playlist_subgenre <chr>,
## # danceability <dbl>, energy <dbl>, key <dbl>, loudness <dbl>,
## # mode <dbl>, speechiness <dbl>, acousticness <dbl>,
## # instrumentalness <dbl>, liveness <dbl>, valence <dbl>, tempo <dbl>,
## # duration_ms <dbl>
datatable(spotify_songs_final, filter = 'top', options = list(pageLength = 10))
## Warning in instance$preRenderHook(instance): It seems your data is too
## big for client-side DataTables. You may consider server-side processing:
## https://rstudio.github.io/DT/server.html
We aim to visualise our data using a mixture of plots such as:
For now, we will look at the individual statistics of each variable:
str(spotify_songs_final)
## Classes 'tbl_df', 'tbl' and 'data.frame': 28356 obs. of 18 variables:
## $ track_name : chr "I Don't Care (with Justin Bieber) - Loud Luxury Remix" "Memories - Dillon Francis Remix" "All the Time - Don Diablo Remix" "Call You Mine - Keanu Silva Remix" ...
## $ track_artist : chr "Ed Sheeran" "Maroon 5" "Zara Larsson" "The Chainsmokers" ...
## $ track_popularity : num 66 67 70 60 69 67 62 69 68 67 ...
## $ track_album_release_date: chr "2019-06-14" "2019-12-13" "2019-07-05" "2019-07-19" ...
## $ playlist_genre : chr "pop" "pop" "pop" "pop" ...
## $ playlist_subgenre : chr "dance pop" "dance pop" "dance pop" "dance pop" ...
## $ danceability : num 0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
## $ energy : num 0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
## $ key : num 6 11 1 7 1 8 5 4 8 2 ...
## $ loudness : num -2.63 -4.97 -3.43 -3.78 -4.67 ...
## $ mode : num 1 1 0 1 1 1 0 0 1 1 ...
## $ speechiness : num 0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
## $ acousticness : num 0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
## $ instrumentalness : num 0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
## $ liveness : num 0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
## $ valence : num 0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
## $ tempo : num 122 100 124 122 124 ...
## $ duration_ms : num 194754 162600 176616 169093 189052 ...
summary(spotify_songs_final)
## track_name track_artist track_popularity
## Length:28356 Length:28356 Min. : 0.00
## Class :character Class :character 1st Qu.: 21.00
## Mode :character Mode :character Median : 42.00
## Mean : 39.33
## 3rd Qu.: 58.00
## Max. :100.00
## track_album_release_date playlist_genre playlist_subgenre
## Length:28356 Length:28356 Length:28356
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## danceability energy key loudness
## Min. :0.0000 Min. :0.000175 Min. : 0.000 Min. :-46.448
## 1st Qu.:0.5610 1st Qu.:0.579000 1st Qu.: 2.000 1st Qu.: -8.309
## Median :0.6700 Median :0.722000 Median : 6.000 Median : -6.261
## Mean :0.6534 Mean :0.698388 Mean : 5.368 Mean : -6.818
## 3rd Qu.:0.7600 3rd Qu.:0.843000 3rd Qu.: 9.000 3rd Qu.: -4.709
## Max. :0.9830 Max. :1.000000 Max. :11.000 Max. : 1.275
## mode speechiness acousticness instrumentalness
## Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. :0.0000000
## 1st Qu.:0.0000 1st Qu.:0.0410 1st Qu.:0.01438 1st Qu.:0.0000000
## Median :1.0000 Median :0.0626 Median :0.07970 Median :0.0000206
## Mean :0.5655 Mean :0.1080 Mean :0.17718 Mean :0.0911168
## 3rd Qu.:1.0000 3rd Qu.:0.1330 3rd Qu.:0.26000 3rd Qu.:0.0065700
## Max. :1.0000 Max. :0.9180 Max. :0.99400 Max. :0.9940000
## liveness valence tempo duration_ms
## Min. :0.0000 Min. :0.0000 Min. : 0.00 Min. : 4000
## 1st Qu.:0.0926 1st Qu.:0.3290 1st Qu.: 99.97 1st Qu.:187742
## Median :0.1270 Median :0.5120 Median :121.99 Median :216933
## Mean :0.1910 Mean :0.5104 Mean :120.96 Mean :226576
## 3rd Qu.:0.2490 3rd Qu.:0.6950 3rd Qu.:134.00 3rd Qu.:254975
## Max. :0.9960 Max. :0.9910 Max. :239.44 Max. :517810
We now check the number of songs per each genre:
spotify_songs_final %>% count(playlist_genre) %>% knitr::kable()
| playlist_genre | n |
|---|---|
| edm | 4877 |
| latin | 4137 |
| pop | 5132 |
| r&b | 4504 |
| rap | 5401 |
| rock | 4305 |
It is clear that our dataset is fairly diversified with good number of songs from each genre.
Going forward, we will analyse the features across each genre, check for correlation among the features, correlation among the genres and also analyse for all features across individual genres.
We will try out decision trees and random forest algorithms on our data. Based on the inputs from the planned EDA, we will see if splitting the dataset into different genres will yield better models. Post this, we are planning to develop an interactive dashboard using R-Shiny.