** What is Spotify **
Spotify is the most popular audio streaming service across the world. There are millions of tracks on the app which can be browsed by different parameters such as artist, album, genre.
** Problem Statement **
In this project, we analyse the various features affecting the popularity of a song and also predict the genre of songs. In the process we will be answering:
The correlation between different genres and audio features Who are the most popular artists? Which is the most popular genre? Most popular tracks in the dataset What features affect the popularity of a song and how they affect it
This project aims to equip the consumer with the ability to guesstimate and be able to explain the popularity and genre of a song when provided with the required audio features such as danceability, valence etc.
The packages which we are going to use in our analysis:
library(plotly) #Useful for creating interactive visualisations
library(tidyr) #tidying data i.t converting into long form,etc
library(ggplot2) #Used in the visualisation of the data
library(dplyr) #Used for data wrangling
library(rpart) #Has the functions which assist in building the decision tree
library(knitr) #Helps in the integration of R code into HTML
library(kableExtra) #USeful for construction of complex tables and customisation of styles
library(caret) #For splitting the data into training and testing
library(DT) #Displaying data objects as tables on the HTML page
library(funModeling) #For creating visualisations
library(corrplot) #Helps in plotting the correlation of numerical values in Data
library(randomForest) #Useful for peforming the RandomForest algorithm
library(e1071) #Useful for running the SVM algorithm
The spotify data being used for our analysis has been taken from this path: Spotify Data
The data has been made available via the spotifyr package. Charlie Thompson, Josiah Parry, Donal Phipps, and Tom Wolff authored this package to make it easier to get either your own data or general metadata arounds songs from Spotify’s API.
The variables in the dataset and their description:
The dataset contains 32,833 observations of 23 variables.
From the summary of the dataset we observe that loudness has negative values, mode and key are numerical. We will clean these columns.
We need to check if any of the songs are repetitive. For this, we will consider the track_id column and check if there are any duplicates in that column.
#Removing Duplicates
spotify_songs_unique = spotify_songs[!duplicated(spotify_songs$track_id),]
dim(spotify_songs_unique)
## [1] 28350 23
Now, we select only those columns which will be useful in our analysis and in the building of the model. We will go ahead and drop the following columns:
#Removing unnecessary columns
spotify_songs_final = spotify_songs_unique[-c(1,5,6,8,9)]
head(spotify_songs_final)
## # A tibble: 6 x 18
## track_name track_artist track_popularity track_album_rel~ playlist_genre
## <chr> <chr> <dbl> <chr> <chr>
## 1 I Don't C~ Ed Sheeran 66 2019-06-14 pop
## 2 Memories ~ Maroon 5 67 2019-12-13 pop
## 3 All the T~ Zara Larsson 70 2019-07-05 pop
## 4 Call You ~ The Chainsm~ 60 2019-07-19 pop
## 5 Someone Y~ Lewis Capal~ 69 2019-03-05 pop
## 6 Beautiful~ Ed Sheeran 67 2019-07-11 pop
## # ... with 13 more variables: playlist_subgenre <chr>, danceability <dbl>,
## # energy <dbl>, key <fct>, loudness <dbl>, mode <fct>,
## # speechiness <dbl>, acousticness <dbl>, instrumentalness <dbl>,
## # liveness <dbl>, valence <dbl>, tempo <dbl>, duration_ms <dbl>
We now check for the missing values across all the columns in the dataset.
colSums(is.na(spotify_songs_final))
## track_name track_artist track_popularity
## 4 4 0
## track_album_release_date playlist_genre playlist_subgenre
## 0 0 0
## danceability energy key
## 0 0 0
## loudness mode speechiness
## 0 0 0
## acousticness instrumentalness liveness
## 0 0 0
## valence tempo duration_ms
## 0 0 0
There are four missing values each in track_name and track_artist. We go ahead and remove these missing values from our dataset so that there won’t be any road blockers in building the model.
spotify_songs_final = spotify_songs_final[!is.na(spotify_songs_final$track_name),]
colSums(is.na(spotify_songs_final))
## track_name track_artist track_popularity
## 0 0 0
## track_album_release_date playlist_genre playlist_subgenre
## 0 0 0
## danceability energy key
## 0 0 0
## loudness mode speechiness
## 0 0 0
## acousticness instrumentalness liveness
## 0 0 0
## valence tempo duration_ms
## 0 0 0
Now, we look at some of the rows from the final cleaned dataset:
spotify_songs_final %>% top_n(100)
## Selecting by duration_ms
## # A tibble: 100 x 18
## track_name track_artist track_popularity track_album_rel~ playlist_genre
## <chr> <chr> <dbl> <chr> <chr>
## 1 Mirrors Justin Timb~ 77 2013-03-15 pop
## 2 Bailando ~ Chela 31 2011-07-06 pop
## 3 Bring It ~ Geto Boys 31 1993-03-09 rap
## 4 Tonight I~ Betty Wright 41 2002-07-02 rap
## 5 Sixteen Rick Ross 0 2012-01-01 rap
## 6 Fat Frees~ Fat Pat 3 2012-11-27 rap
## 7 Still In ~ Shuya Okino 0 2016-03-04 rock
## 8 Al Andalu~ Miguel Rios 0 2005-01-01 rock
## 9 Dancing W~ Genesis 48 1973-10-12 rock
## 10 Killer Van Der Gra~ 33 1986-01-01 rock
## # ... with 90 more rows, and 13 more variables: playlist_subgenre <chr>,
## # danceability <dbl>, energy <dbl>, key <fct>, loudness <dbl>,
## # mode <fct>, speechiness <dbl>, acousticness <dbl>,
## # instrumentalness <dbl>, liveness <dbl>, valence <dbl>, tempo <dbl>,
## # duration_ms <dbl>
datatable(spotify_songs_final, filter = 'top', options = list(pageLength = 10))
## Warning in instance$preRenderHook(instance): It seems your data is too
## big for client-side DataTables. You may consider server-side processing:
## https://rstudio.github.io/DT/server.html
Now we will look at the individual statistics of each variable:
str(spotify_songs_final)
## Classes 'tbl_df', 'tbl' and 'data.frame': 28346 obs. of 18 variables:
## $ track_name : chr "I Don't Care (with Justin Bieber) - Loud Luxury Remix" "Memories - Dillon Francis Remix" "All the Time - Don Diablo Remix" "Call You Mine - Keanu Silva Remix" ...
## $ track_artist : chr "Ed Sheeran" "Maroon 5" "Zara Larsson" "The Chainsmokers" ...
## $ track_popularity : num 66 67 70 60 69 67 62 69 68 67 ...
## $ track_album_release_date: chr "2019-06-14" "2019-12-13" "2019-07-05" "2019-07-19" ...
## $ playlist_genre : chr "pop" "pop" "pop" "pop" ...
## $ playlist_subgenre : chr "dance pop" "dance pop" "dance pop" "dance pop" ...
## $ danceability : num 0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
## $ energy : num 0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
## $ key : Factor w/ 12 levels "0","1","2","3",..: 7 12 2 8 2 9 6 5 9 3 ...
## $ loudness : num -2.63 -4.97 -3.43 -3.78 -4.67 ...
## $ mode : Factor w/ 2 levels "0","1": 2 2 1 2 2 2 1 1 2 2 ...
## $ speechiness : num 0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
## $ acousticness : num 0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
## $ instrumentalness : num 0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
## $ liveness : num 0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
## $ valence : num 0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
## $ tempo : num 122 100 124 122 124 ...
## $ duration_ms : num 194754 162600 176616 169093 189052 ...
We aim to visualise our data using a mixture of plots such as:
First now check the number of songs per each genre:
spotify_songs_final %>% count(playlist_genre) %>% knitr::kable()
| playlist_genre | n |
|---|---|
| edm | 4875 |
| latin | 4136 |
| pop | 5132 |
| r&b | 4504 |
| rap | 5395 |
| rock | 4304 |
It is clear that our dataset is fairly diversified with good number of songs from each genre.
From our dataset, we now identify the most popular tracks and their artists.
#Most popular in the dataset and their artists
Top_tracks <- spotify_songs_final %>%
arrange(desc(track_popularity))
head(Top_tracks)
## # A tibble: 6 x 18
## track_name track_artist track_popularity track_album_rel~ playlist_genre
## <chr> <chr> <dbl> <chr> <chr>
## 1 Dance Mon~ Tones and I 100 2019-10-17 pop
## 2 ROXANNE Arizona Zer~ 99 2019-10-10 latin
## 3 Tusa KAROL G 98 2019-11-07 pop
## 4 Memories Maroon 5 98 2019-09-20 pop
## 5 Blinding ~ The Weeknd 98 2019-11-29 pop
## 6 Circles Post Malone 98 2019-09-06 pop
## # ... with 13 more variables: playlist_subgenre <chr>, danceability <dbl>,
## # energy <dbl>, key <fct>, loudness <dbl>, mode <fct>,
## # speechiness <dbl>, acousticness <dbl>, instrumentalness <dbl>,
## # liveness <dbl>, valence <dbl>, tempo <dbl>, duration_ms <dbl>
Dance Monkey, Roxanne, Tusa, Memories and Blinding Lights are the top 5 tracks in terms of popularity in our dataset.
We need to understand who the most popular artists are.
Most_popular_Artists <- spotify_songs_final %>%
group_by(track_artist) %>%
summarise(Mean_popularity = mean(track_popularity),Numberofsongs = n()) %>%
arrange(desc(Mean_popularity),desc(Numberofsongs))
head(Most_popular_Artists)
## # A tibble: 6 x 3
## track_artist Mean_popularity Numberofsongs
## <chr> <dbl> <int>
## 1 Trevor Daniel 97 1
## 2 Y2K 91 1
## 3 Don Toliver 87.5 2
## 4 Kina 85.5 2
## 5 JACKBOYS 84.3 3
## 6 Dadá Boladão 84 1
Trevor Daniel, Y2K, Don Toliver, Kina and Jackboys are the top5 artists in terms of popularity.
Popular Genres
Most_popular_genres <- spotify_songs_final %>%
group_by(playlist_genre) %>%
summarise(Mean_popularity = mean(track_popularity),Numberofsongs = n()) %>%
arrange(desc(Mean_popularity),desc(Numberofsongs))
head(Most_popular_genres)
## # A tibble: 6 x 3
## playlist_genre Mean_popularity Numberofsongs
## <chr> <dbl> <int>
## 1 pop 45.9 5132
## 2 rap 41.9 5395
## 3 latin 41.4 4136
## 4 rock 39.7 4304
## 5 r&b 35.9 4504
## 6 edm 30.7 4875
Pop is the most popular genre followed by Rap, Latin and Rank.
Box Plot of Track Popularity vs Playlist Genre:
#Boxplot For Track Popularity vs Genre
boxplot(track_popularity ~ playlist_genre, data = spotify_songs_final, ylab = "Popularity", xlab = "Genre")
Distribution Of Numerical Columns
From the above plots, we can infer that valence approximately follows normal distribution. Speechiness, Accousticness, Instrumentalness, Tempo and Duration are skewed to the left whereas danceability and loudness are skewed to the right.
Correlation of numerical columns to Popularity
# Correlation of these numerical variables to Popularity
options(repr.plot.width = 20, repr.plot.height = 15)
spotify_final_sliced <- spotify_songs_final[, -c(1,2,4,5,6,9,11)]
corr <- cor(spotify_final_sliced)
num <- corrplot(corr, method = "ellipse", type = "upper", tl.srt = 45)
For the predicting of the popularity of songs we intend to use the numerical columns of track artist. We quantify our artists as most popular, moderately popular and not popular.
#Classifying the artists
spotify_cleaned1 <- spotify_songs_final %>%
group_by(track_artist) %>%
summarise(Mean_popularity = mean(track_popularity)) %>%
right_join(spotify_songs_final)
## Joining, by = "track_artist"
hist(spotify_cleaned1$Mean_popularity)
spotify_cleaned1 <- spotify_cleaned1 %>%
mutate(Popularity_factor = cut(x = spotify_cleaned1$Mean_popularity, breaks = c(0, 30, 60, 100),include.lowest=TRUE)) %>%
as.data.frame()
levels(spotify_cleaned1$Popularity_factor) <- c("low", "Medium" , "High")
Prior to the model building we remove the columns which are unnecessary:
#Remove Unnecessary columns
head(spotify_cleaned1)
## track_artist Mean_popularity
## 1 Ed Sheeran 65.05882
## 2 Maroon 5 42.40909
## 3 Zara Larsson 53.06250
## 4 The Chainsmokers 49.22727
## 5 Lewis Capaldi 76.88889
## 6 Ed Sheeran 65.05882
## track_name track_popularity
## 1 I Don't Care (with Justin Bieber) - Loud Luxury Remix 66
## 2 Memories - Dillon Francis Remix 67
## 3 All the Time - Don Diablo Remix 70
## 4 Call You Mine - Keanu Silva Remix 60
## 5 Someone You Loved - Future Humans Remix 69
## 6 Beautiful People (feat. Khalid) - Jack Wins Remix 67
## track_album_release_date playlist_genre playlist_subgenre danceability
## 1 2019-06-14 pop dance pop 0.748
## 2 2019-12-13 pop dance pop 0.726
## 3 2019-07-05 pop dance pop 0.675
## 4 2019-07-19 pop dance pop 0.718
## 5 2019-03-05 pop dance pop 0.650
## 6 2019-07-11 pop dance pop 0.675
## energy key loudness mode speechiness acousticness instrumentalness
## 1 0.916 6 -2.634 1 0.0583 0.1020 0.00e+00
## 2 0.815 11 -4.969 1 0.0373 0.0724 4.21e-03
## 3 0.931 1 -3.432 0 0.0742 0.0794 2.33e-05
## 4 0.930 7 -3.778 1 0.1020 0.0287 9.43e-06
## 5 0.833 1 -4.672 1 0.0359 0.0803 0.00e+00
## 6 0.919 8 -5.385 1 0.1270 0.0799 0.00e+00
## liveness valence tempo duration_ms Popularity_factor
## 1 0.0653 0.518 122.036 194754 High
## 2 0.3570 0.693 99.972 162600 Medium
## 3 0.1100 0.613 124.008 176616 Medium
## 4 0.2040 0.277 121.956 169093 Medium
## 5 0.0833 0.725 123.976 189052 High
## 6 0.1430 0.585 124.982 163049 High
spotify_cleaned2 <- spotify_cleaned1[, -c(1,2,3,5,7)]
str(spotify_cleaned2)
## 'data.frame': 28346 obs. of 15 variables:
## $ track_popularity : num 66 67 70 60 69 67 62 69 68 67 ...
## $ playlist_genre : chr "pop" "pop" "pop" "pop" ...
## $ danceability : num 0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
## $ energy : num 0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
## $ key : Factor w/ 12 levels "0","1","2","3",..: 7 12 2 8 2 9 6 5 9 3 ...
## $ loudness : num -2.63 -4.97 -3.43 -3.78 -4.67 ...
## $ mode : Factor w/ 2 levels "0","1": 2 2 1 2 2 2 1 1 2 2 ...
## $ speechiness : num 0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
## $ acousticness : num 0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
## $ instrumentalness : num 0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
## $ liveness : num 0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
## $ valence : num 0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
## $ tempo : num 122 100 124 122 124 ...
## $ duration_ms : num 194754 162600 176616 169093 189052 ...
## $ Popularity_factor: Factor w/ 3 levels "low","Medium",..: 3 2 2 2 3 3 2 1 2 2 ...
Methodology
As all the variables are not in the same scale we opt for Random Forest as it will take care of that. Also it is robust to outliers hence we can go ahead with Random Forest. Using this model, we will understand the importance of variables. Prior to the building of our model, we change the datatype of playlist_genre to factor.
train.index <- createDataPartition(spotify_cleaned2$Popularity_factor, p = .7, list = FALSE)
train_data <- spotify_cleaned2[ train.index,]
test_data <- spotify_cleaned2[-train.index,]
train_data$playlist_genre <- as.factor(train_data$playlist_genre)
x_train <- train_data[,-1]
y_train <- train_data[,1]
spotify.rf<- randomForest(track_popularity~., data = train_data, importance=TRUE)
spotify.rf
##
## Call:
## randomForest(formula = track_popularity ~ ., data = train_data, importance = TRUE)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 4
##
## Mean of squared residuals: 336.2267
## % Var explained: 40.12
spotify.rf$importance
## %IncMSE IncNodePurity
## playlist_genre 13.292229244 423930.30
## danceability 0.033328943 507968.11
## energy 12.529168499 536838.18
## key -9.891497975 648815.81
## loudness 8.651718845 563202.74
## mode -0.721128914 59902.01
## speechiness -0.004855765 507948.38
## acousticness 3.981640243 529580.07
## instrumentalness 2.071513631 450294.80
## liveness -2.437507770 511314.39
## valence -0.937745389 509395.44
## tempo -0.706977596 541658.48
## duration_ms 0.136618031 621661.70
## Popularity_factor 389.922576181 4182132.01
varImpPlot(spotify.rf)
y_pred_rf_train<-predict(spotify.rf)
sqrt(mean((y_pred_rf_train - y_train)^2))
## [1] 18.33648
For the building of Random Forest, we first split the data into train and test. Then we run the random forest algorithm on the train datasets to determine the most important variables in the prediction of song popularity. From impVarPlot, it is clear that the artist’s popularity and key are important in determining the popularity of song.
svm_model <- svm(track_popularity ~ . , train_data)
y_pred_svm_train<-predict(svm_model)
sqrt(mean((y_pred_svm_train - y_train)^2))
## [1] 17.49928
We then run the SVM algorithm on the train and test datasets. The mean prediction error here is 17.59.
Prediction Of Genre
Now to predict the genres of songs, we again use the Random forest algorithm.
spotify_cleaned4 <- spotify_songs_final[,-c(1,2,3,4,6)]
index <- sample(nrow(spotify_cleaned4),nrow(spotify_cleaned4)*0.70)
spotify.train <- spotify_cleaned4[index,]
spotify.test <- spotify_cleaned4[-index,]
modfit.rpart <- rpart(playlist_genre ~ ., data=spotify.train, method="class")
print(modfit.rpart, digits = 3)
## n= 19842
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 19842 16100 rap (0.17 0.15 0.18 0.16 0.19 0.15)
## 2) speechiness< 0.14 15133 11900 pop (0.19 0.15 0.21 0.15 0.1 0.19)
## 4) danceability>=0.563 10894 8500 pop (0.21 0.19 0.22 0.17 0.12 0.097)
## 8) tempo>=121 5305 3440 edm (0.35 0.14 0.2 0.096 0.12 0.087)
## 16) tempo< 130 3227 1580 edm (0.51 0.13 0.2 0.055 0.047 0.066) *
## 17) tempo>=130 2078 1570 rap (0.11 0.17 0.2 0.16 0.24 0.12) *
## 9) tempo< 121 5589 4250 pop (0.074 0.23 0.24 0.23 0.12 0.11)
## 18) duration_ms< 2.46e+05 3986 2860 pop (0.08 0.27 0.28 0.17 0.12 0.082)
## 36) danceability>=0.71 2077 1350 latin (0.066 0.35 0.23 0.17 0.15 0.041) *
## 37) danceability< 0.71 1909 1260 pop (0.096 0.18 0.34 0.16 0.088 0.13) *
## 19) duration_ms>=2.46e+05 1603 972 r&b (0.057 0.13 0.14 0.39 0.11 0.17) *
## 5) danceability< 0.563 4239 2450 rock (0.15 0.055 0.19 0.12 0.061 0.42) *
## 3) speechiness>=0.14 4709 2550 rap (0.11 0.14 0.083 0.18 0.46 0.034) *
predictions_rpart <- predict(modfit.rpart, spotify.test[,-1], type = "class")
Problem Statement
The main focus of our analysis is to determine the audio features which predict the popularity of a song and also classify the given songs into particular genres.
Solution Methodology
For the understanding of these audio features, we leveraged tabular and graphical methods to observe the trends of the audio features such as valence, danceability etc. We formulated a new categorical variable to classify the artist’s popularity into high, medium and low.
Post these, we used the random forest and SVM algorithms to predict the popularity of the songs and also to classify these songs into respective genres.
Insights
Dance Monkey is the most popular track whereas Trevor Daniel is the most popular artist. Pop is the most popular genre followed by Rap and Latin. The loudness and energy in Pop is higher compared to the other genres. The artist’s popularity is most important in predicting the popularity of a song.
Implications
From this report, the consumer can understand the audio features which determine the genre of the song. They can understand why they are hooked to a particular genre - the features which are keeping them glued to that genre. Also, from the models, they can predict what the popularitof the next song from their favorite artist is going to be.
Limitations
More algorithms can be delpoyed in the preditcion of genre and popularity of songs. This might help us in developing a more accurate model.