With over 320 million monthly users and home to 60 million tracks, four billion playlists and 1.9 million podcasts, Spotify is one of the most popular music (and, increasingly, talk content) streaming platforms in existence. Similar to its big tech rivals and partners, much of Spotify’s success has been fueled by data and analytics. By collecting and analyzing massive amounts of listener data, Spotify can identify emerging user trends in real-time and rapidly develop new features or services to capitalize on them. One of Spotify’s major competitive advantages is it’s formidable recommendation engine. Using machine learning (ML) algorithms, natural language processing (NLP) and convolutional neural networks (CNN), Spotify is able to transform historical listening data into personalized playlists and music recommendations.
For scope this project we are interested in how track popularity is getting influenced by other attributes likes danceability, loudness, speechiness, valence etc.
The plan is to analyze relationship between popularity and different features of the song, and also perform cluster analysis using K-means method to get idea about songs genre and random forest to predict song popularity.
This is mainly useful to market to the spotify users and improve their experience while using it. This analysis will help to predict popularity of the new song based on its attributes way before hitting markets.
Following packages were used:
tidyverse - Which will provide us functionality to model, transform, and visualize data.
ggplot2 - Used for plotting charts
plotly - For web-based graphs via the open source JavaScript graphing library plotly.js for interactive charts
corrplot - For displaying correlation matrices and confidence intervals
factoextra - To visualize the output of multivariate data analysis
funModeling - Exploratory Data Analysis and Data Preparation Tool-Box
RColorBrewer - To help you choose sensible colour schemes for figures in R
Lubridate - It is a package that eases working with Date and Time datatypes
Knitr - It enables the integration of R code into R markdown and in our case we used it to display the variables in a neat scrollable tabular format.
DT - Data objects in R can be rendered as HTML by importing this package.
cowplot - For providing addition functionalities to ggplot.
vtable - To print the summary statistics of the data
cluster : To use clustering algorithm
factoextra : Visualizing clustering algorithm
purrr : Purrr is a package that fills in the missing pieces in R’s functional programming tools: it’s designed to make your pure functions
randomForest : To perform Random Forest algorithm
library(tidyverse)
library(ggplot2)
library(plotly)
library(corrplot)
library(factoextra)
library(knitr)
library(RColorBrewer)
library(funModeling)
library(knitr)
library(lubridate)
library(DT)
library(cowplot)
library(vtable)
library(cluster)
library(factoextra)
library(purrr)
library(randomForest)
Description of Attributes Each row indicates 1 song and column contain attributes for each song.The attributes are as follows
track_id: Track ID on song
track_name: Title / Name of the song
track_artist: Name of the artist
track_popularity: Measure the popularity from 0 to 100 based on play number of the track
track_album_release_date: Information about the release date of the song
track_album_name: Provides us with the name of the album from which the song is in.
playlist_name: Name of the playlist which the song is in.
playlist_genre: Name of the genre related to the playlist which the song is in.
acousticness : Measure of how acoustic the track is and ranges from 0.0 to 1.0
danceability: Describes how suitable a track is for dancing. Values range from 0.0 being least danceable and 1.0 being most danceable.
duration_ms : The duration of the track in milliseconds(ms) which has been converted to minutes using transformation
energy: Measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity i.e. the enery of the song.
instrumentalness: Measure whether a track contains vocals. Sounds are treated as instrumental in this context. Values ranges from 0.0 to 1.0
speechiness: Detects the presence of spoken words in a track.Values > 0.6 might be a podcast or talk show, where 0.3 to 0.6 is the normal range for songs and if its less than 0.3 its mostly music
valence: Measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive , while tracks with low valence sound more negative.
key: Estimated overall key of the track. If key is not detected, the value is -1.
liveness: Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
loudness : overall loudness of a track in decibels (dB).Values typical range between -60 and 0 dB.
mode: Mode indicates the modality (major or minor) of a track. Major is represented by 1 and minor is represented by 0.
tempo: Overall estimated tempo of a track in beats per minute (BPM).
This sections contains all the procedures we have followed in preparing the data for analysis. Each step has been explained with code for those steps.
The dataset used for this project is the Spotify Genre dataset was provided in the course curriculum
#### Reading Data
spotify_songs <- read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv')
### Checking dimension of Data
dim(spotify_songs)
## [1] 32833 23
The original dataset has 32833 rows and 23 columns, which was collected from every genre, which is an interesting visualization of the spotify genre-space maintained by a genre taxonomist. The dataset includes 5000 songs for each genre, split across various sub-genre. The main purpose of the original dataset was to explore the following audio features:
The dataset consists of the following variables:
#### Checking column name
names(spotify_songs)
## [1] "track_id" "track_name"
## [3] "track_artist" "track_popularity"
## [5] "track_album_id" "track_album_name"
## [7] "track_album_release_date" "playlist_name"
## [9] "playlist_id" "playlist_genre"
## [11] "playlist_subgenre" "danceability"
## [13] "energy" "key"
## [15] "loudness" "mode"
## [17] "speechiness" "acousticness"
## [19] "instrumentalness" "liveness"
## [21] "valence" "tempo"
## [23] "duration_ms"
Step 1: Handling Missing and Empty Values
#### Counting NA values in every column
colSums(is.na(spotify_songs))
## track_id track_name track_artist
## 0 5 5
## track_popularity track_album_id track_album_name
## 0 0 5
## track_album_release_date playlist_name playlist_id
## 0 0 0
## playlist_genre playlist_subgenre danceability
## 0 0 0
## energy key loudness
## 0 0 0
## mode speechiness acousticness
## 0 0 0
## instrumentalness liveness valence
## 0 0 0
## tempo duration_ms
## 0 0
#### Removing NA values from the data
spotify_songs <- na.omit(spotify_songs)
As we can see that the track_name,track_album_name and track_artist variables contain 5 missing values, we decided to remove them since it would hamper our analysis. A total of 5 rows were omitted, which would not have a severe impact on the insights derived from the dataset.
Step 2: Checking the structure and changing datatypes of certain variables
#### checking Structure of the data
str(spotify_songs)
## tibble [32,828 x 23] (S3: tbl_df/tbl/data.frame)
## $ track_id : chr [1:32828] "6f807x0ima9a1j3VPbc7VN" "0r7CVbZTWZgbTCYdfa2P31" "1z1Hg7Vb0AhHDiEmnDE79l" "75FpbthrwQmzHlBJLuGdC7" ...
## $ track_name : chr [1:32828] "I Don't Care (with Justin Bieber) - Loud Luxury Remix" "Memories - Dillon Francis Remix" "All the Time - Don Diablo Remix" "Call You Mine - Keanu Silva Remix" ...
## $ track_artist : chr [1:32828] "Ed Sheeran" "Maroon 5" "Zara Larsson" "The Chainsmokers" ...
## $ track_popularity : num [1:32828] 66 67 70 60 69 67 62 69 68 67 ...
## $ track_album_id : chr [1:32828] "2oCs0DGTsRO98Gh5ZSl2Cx" "63rPSO264uRjW1X5E6cWv6" "1HoSmj2eLcsrR0vE9gThr4" "1nqYsOef1yKKuGOVchbsk6" ...
## $ track_album_name : chr [1:32828] "I Don't Care (with Justin Bieber) [Loud Luxury Remix]" "Memories (Dillon Francis Remix)" "All the Time (Don Diablo Remix)" "Call You Mine - The Remixes" ...
## $ track_album_release_date: chr [1:32828] "2019-06-14" "2019-12-13" "2019-07-05" "2019-07-19" ...
## $ playlist_name : chr [1:32828] "Pop Remix" "Pop Remix" "Pop Remix" "Pop Remix" ...
## $ playlist_id : chr [1:32828] "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" ...
## $ playlist_genre : chr [1:32828] "pop" "pop" "pop" "pop" ...
## $ playlist_subgenre : chr [1:32828] "dance pop" "dance pop" "dance pop" "dance pop" ...
## $ danceability : num [1:32828] 0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
## $ energy : num [1:32828] 0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
## $ key : num [1:32828] 6 11 1 7 1 8 5 4 8 2 ...
## $ loudness : num [1:32828] -2.63 -4.97 -3.43 -3.78 -4.67 ...
## $ mode : num [1:32828] 1 1 0 1 1 1 0 0 1 1 ...
## $ speechiness : num [1:32828] 0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
## $ acousticness : num [1:32828] 0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
## $ instrumentalness : num [1:32828] 0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
## $ liveness : num [1:32828] 0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
## $ valence : num [1:32828] 0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
## $ tempo : num [1:32828] 122 100 124 122 124 ...
## $ duration_ms : num [1:32828] 194754 162600 176616 169093 189052 ...
## - attr(*, "na.action")= 'omit' Named int [1:5] 8152 9283 9284 19569 19812
## ..- attr(*, "names")= chr [1:5] "8152" "9283" "9284" "19569" ...
#### checking Summary of the data
summary(spotify_songs)
## track_id track_name track_artist track_popularity
## Length:32828 Length:32828 Length:32828 Min. : 0.00
## Class :character Class :character Class :character 1st Qu.: 24.00
## Mode :character Mode :character Mode :character Median : 45.00
## Mean : 42.48
## 3rd Qu.: 62.00
## Max. :100.00
## track_album_id track_album_name track_album_release_date
## Length:32828 Length:32828 Length:32828
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## playlist_name playlist_id playlist_genre playlist_subgenre
## Length:32828 Length:32828 Length:32828 Length:32828
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## danceability energy key loudness
## Min. :0.0000 Min. :0.000175 Min. : 0.000 Min. :-46.448
## 1st Qu.:0.5630 1st Qu.:0.581000 1st Qu.: 2.000 1st Qu.: -8.171
## Median :0.6720 Median :0.721000 Median : 6.000 Median : -6.166
## Mean :0.6549 Mean :0.698603 Mean : 5.374 Mean : -6.720
## 3rd Qu.:0.7610 3rd Qu.:0.840000 3rd Qu.: 9.000 3rd Qu.: -4.645
## Max. :0.9830 Max. :1.000000 Max. :11.000 Max. : 1.275
## mode speechiness acousticness instrumentalness
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000000
## 1st Qu.:0.0000 1st Qu.:0.0410 1st Qu.:0.0151 1st Qu.:0.0000000
## Median :1.0000 Median :0.0625 Median :0.0804 Median :0.0000161
## Mean :0.5657 Mean :0.1071 Mean :0.1754 Mean :0.0847599
## 3rd Qu.:1.0000 3rd Qu.:0.1320 3rd Qu.:0.2550 3rd Qu.:0.0048300
## Max. :1.0000 Max. :0.9180 Max. :0.9940 Max. :0.9940000
## liveness valence tempo duration_ms
## Min. :0.0000 Min. :0.0000 Min. : 0.00 Min. : 4000
## 1st Qu.:0.0927 1st Qu.:0.3310 1st Qu.: 99.96 1st Qu.:187805
## Median :0.1270 Median :0.5120 Median :121.98 Median :216000
## Mean :0.1902 Mean :0.5106 Mean :120.88 Mean :225797
## 3rd Qu.:0.2480 3rd Qu.:0.6930 3rd Qu.:133.92 3rd Qu.:253581
## Max. :0.9960 Max. :0.9910 Max. :239.44 Max. :517810
#### Changing datatype of some columns
spotify_songs<-spotify_songs %>%
mutate(playlist_genre=as.factor(spotify_songs$playlist_genre),
playlist_subgenre=as.factor(spotify_songs$playlist_subgenre),
mode=as.factor(mode),
key=as.factor(key))
Step 3: Removing Duplicate
#### removing duplicated data
spotify_songs <- spotify_songs[!duplicated(spotify_songs$track_id),]
dim(spotify_songs)
## [1] 28352 23
Step 4: Extracting Year and Song Duration in minutes.
We aim at analyzing the trends that the data follows according to the artist name and genre types over the years that it was released in. We thereby split the track_album_release_date into year, month and day
#### Extracting Year from songs
spotify_songs <- spotify_songs %>%
separate(track_album_release_date,
c("year","month","day"),
sep = "-")
#### Creating minutes from duration
spotify_songs<-spotify_songs %>%
mutate(duration_min=duration_ms/60000)
#### changing data type of year column
spotify_songs$year <- as.numeric(spotify_songs$year)
Step 5: Selecting the required columns from the dataset
#### Dropping unneccessary columns
spotify_songs <- spotify_songs %>% select(-c(track_id,track_album_id,playlist_id))
A preview of the clean dataset is given below:
### displaying top 100 rows
output_data <- head(spotify_songs, n = 100)
datatable(output_data, filter = 'top', options = list(pageLength = 25))
Correlation between variables
In order to understand the correlation among variables, we’ll use corrplot function in R which is one of the basic data visualization functions.
### Correlation plot of numeric columns
songs_corr <- spotify_songs %>%
select(track_popularity,danceability,energy,loudness,speechiness,acousticness,instrumentalness, liveness, valence, tempo)
par(bg="#121212")
corrplot(cor(songs_corr),method = 'pie',type="lower",bg="#121212",col="#1DB954",tl.col="#1DB954",addgrid.col = "#1DB954")
Based on the plot, we can state that popularity does not have strong correlation with other track features. But quite a few variables have strong correlation with each other, indicating multicollinearity and might not be suitable for classification algorithms.
Density Plots of Variables
Let’s see energy, danceability, valence, acousticness, speechiness and liveness are distributed over all the observations of our dataset. We would be plotting density plots of all these 6 variables together as they all on same scale and range from 0 to 1.
#### Plotting Density Plots
ggplot(spotify_songs) +
geom_density(aes(energy, fill ="energy", alpha = 0.1)) +
geom_density(aes(danceability, fill ="danceability", alpha = 0.1)) +
geom_density(aes(valence, fill ="valence", alpha = 0.1)) +
geom_density(aes(acousticness, fill ="acousticness", alpha = 0.1)) +
geom_density(aes(speechiness, fill ="speechiness", alpha = 0.1)) +
geom_density(aes(liveness, fill ="liveness", alpha = 0.1)) +
scale_x_continuous(name = "Energy, Danceability, Valence, Acousticness, Speechiness and Liveness") +
scale_y_continuous(name = "Density") +
ggtitle("Density plot of Energy, Danceability, Valence, Acousticness, Speechiness and Liveness") +
theme_bw() +
theme(plot.title = element_text(size = 10, face = "bold", colour = "#1DB954"),
text = element_text(size = 10,colour = "#1DB954")) +
theme(legend.title=element_blank()) +
scale_fill_brewer(palette="Accent") +
theme(panel.background = element_rect(fill = "#121212")) +
theme(plot.background = element_rect(fill = "#121212")) +
theme(legend.background = element_rect(fill = "#121212"))+
theme(axis.text.x = element_text(colour = "darkgreen"))+
theme(axis.text.y = element_text(colour = "darkgreen"))
Box Plot
Genre by energy
#### Ploting Box Plot of genre by energy
ggplot(spotify_songs, aes(x=energy, y=playlist_genre)) +
geom_boxplot(color="white", fill="darkgreen") +
scale_x_continuous(name = "Energy") +
scale_y_discrete(name = "Genre") +
theme_bw() +
ggtitle("Variation: Energy and Genre") +
theme(plot.title = element_text(size = 10, face = "bold", colour = "#1DB954"),
text = element_text(size = 10,colour = "#1DB954")) +
theme(legend.title=element_blank()) +
scale_fill_brewer(palette="Accent") +
theme(axis.text.x = element_text(colour = "darkgreen"))+
theme(panel.background = element_rect(fill = "#121212"),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank()) +
theme(plot.background = element_rect(fill = "#121212")) +
theme(legend.background = element_rect(fill = "#121212"))+
theme(axis.text.y = element_text(colour = "darkgreen"))
Genre by danceability
#### Ploting Box Plot of genre by danceability
ggplot(spotify_songs, aes(x=danceability, y=playlist_genre)) +
geom_boxplot(color="white", fill="darkgreen") +
scale_x_continuous(name = "Danceability") +
scale_y_discrete(name = "Genre") +
theme_bw() +
ggtitle("Danceability and Genre") +
theme(plot.title = element_text(size = 10, face = "bold", colour = "#1DB954"),
text = element_text(size = 10,colour = "#1DB954")) +
theme(legend.title=element_blank()) +
scale_fill_brewer(palette="Accent") +
theme(axis.text.x = element_text(colour = "darkgreen"))+
theme(panel.background = element_rect(fill = "#121212"),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank()) +
theme(plot.background = element_rect(fill = "#121212")) +
theme(legend.background = element_rect(fill = "#121212"))+
theme(axis.text.y = element_text(colour = "darkgreen"))
Genre by liveness
#### Ploting Box Plot of genre by liveness
ggplot(spotify_songs, aes(x=liveness, y=playlist_genre)) +
geom_boxplot(color="white", fill="darkgreen") +
scale_x_continuous(name = "Liveness") +
scale_y_discrete(name = "Genre") +
theme_bw() +
ggtitle("Liveness and Genre") +
theme(plot.title = element_text(size = 10, face = "bold", colour = "#1DB954"),
text = element_text(size = 10,colour = "#1DB954")) +
theme(legend.title=element_blank()) +
scale_fill_brewer(palette="Accent") +
theme(axis.text.x = element_text(colour = "darkgreen"))+
theme(panel.background = element_rect(fill = "#121212"),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank()) +
theme(plot.background = element_rect(fill = "#121212")) +
theme(legend.background = element_rect(fill = "#121212"))+
theme(axis.text.y = element_text(colour = "darkgreen"))
Genre by valence
#### Ploting Box Plot of genre by valence
ggplot(spotify_songs, aes(x=valence, y=playlist_genre)) +
geom_boxplot(color="white", fill="darkgreen") +
scale_x_continuous(name = "Valence") +
scale_y_discrete(name = "Genre") +
theme_bw() +
ggtitle("Valence and Genre") +
theme(plot.title = element_text(size = 10, face = "bold", colour = "#1DB954"),
text = element_text(size = 10,colour = "#1DB954")) +
theme(legend.title=element_blank()) +
scale_fill_brewer(palette="Accent") +
theme(axis.text.x = element_text(colour = "darkgreen"))+
theme(panel.background = element_rect(fill = "#121212"),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank()) +
theme(plot.background = element_rect(fill = "#121212")) +
theme(legend.background = element_rect(fill = "#121212"))+
theme(axis.text.y = element_text(colour = "darkgreen"))
Genre by loudness
#### Ploting Box Plot of genre by loudness
ggplot(spotify_songs, aes(x=loudness, y=playlist_genre)) +
geom_boxplot(color="white", fill="darkgreen") +
scale_x_continuous(name = "Loudness") +
scale_y_discrete(name = "Genre") +
theme_bw() +
ggtitle("Loudness and Genre") +
theme(plot.title = element_text(size = 10, face = "bold", colour = "#1DB954"),
text = element_text(size = 10,colour = "#1DB954")) +
theme(legend.title=element_blank()) +
scale_fill_brewer(palette="Accent") +
theme(axis.text.x = element_text(colour = "darkgreen"))+
theme(panel.background = element_rect(fill = "#121212"),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank()) +
theme(plot.background = element_rect(fill = "#121212")) +
theme(legend.background = element_rect(fill = "#121212"))+
theme(axis.text.y = element_text(colour = "darkgreen"))
Energy Distribution of the songs
#### Histogram of Energy Distribution
spotify_songs$energy_only <- cut(spotify_songs$energy, breaks = 10)
spotify_songs %>%
ggplot( aes(x = energy_only )) +
geom_bar(width = 0.2, fill = "#1DB954", colour = "black") +
scale_x_discrete(name = "Energy") +
theme(plot.title = element_text(size = 10, face = "bold", colour = "#1DB954"),
text = element_text(size = 10,colour = "darkgreen")) +
theme(axis.text.x = element_text(colour = "#1DB954"))+
theme(panel.background = element_rect(fill = "#121212"),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank()) +
theme(plot.background = element_rect(fill = "#121212")) +
theme(legend.background = element_rect(fill = "#121212"))+
theme(axis.text.y = element_text(colour = "darkgreen"))
Speechiness Distribution of the songs
spotify_songs$speech_only <- cut(spotify_songs$speechiness, breaks = 10)
spotify_songs %>%
ggplot( aes(x = speech_only , colour ="1DB954")) +
geom_bar(width = 0.2, fill = "#1DB954", colour = "black") +
scale_x_discrete(name = "Speechiness") +
theme(axis.text.x = element_text(colour = "darkgreen"))+
theme(plot.title = element_text(size = 10, face = "bold", colour = "#1DB954"),
text = element_text(size = 10,colour = "#1DB954")) +
theme(panel.background = element_rect(fill = "#121212"),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank()) +
theme(plot.background = element_rect(fill = "#121212")) +
theme(legend.background = element_rect(fill = "#121212"))+
theme(axis.text.y = element_text(colour = "darkgreen"))+
coord_flip()
Trend Analysis by Year
trend_chart <- function(arg){
trend_change <- spotify_songs %>% filter(year>2010) %>% group_by(year) %>% summarize_at(vars(all_of(arg)), funs(Average = mean))
chart <- ggplot(data = trend_change, aes(x = year, y = Average)) +
geom_line(color = "#1DB954", size = 1) +
scale_x_continuous(breaks=seq(2011, 2020, 1)) + scale_y_continuous(name=paste("",arg,sep="")) +
theme(axis.text.x = element_text(colour = "darkgreen")) +
theme(plot.title = element_text(size = 10, face = "bold", colour = "#1DB954"),
text = element_text(size = 10,colour = "#1DB954")) +
theme(panel.background = element_rect(fill = "#121212"),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank()) +
theme(plot.background = element_rect(fill = "#121212")) +
theme(legend.background = element_rect(fill = "#121212")) +
theme(axis.text.y = element_text(colour = "darkgreen"))
return(chart)
}
trend_chart_track_popularity<-trend_chart("track_popularity")
trend_chart_danceability<-trend_chart("danceability")
trend_chart_energy<-trend_chart("energy")
trend_chart_loudness<-trend_chart("loudness")
trend_chart_duration_min<-trend_chart("duration_min")
trend_chart_speechiness<-trend_chart("speechiness")
plot_grid(trend_chart_track_popularity, trend_chart_danceability, trend_chart_energy, trend_chart_loudness, trend_chart_duration_min, trend_chart_speechiness,ncol = 2, label_size = 1)
To find out the trend of how the features change across time.We can group the songs by its added year, get the average for each feature over time and visualize it.
What interests us the most is that the duration of tracks is showing continuous decreasing trend i.e. the songs are getting shorter and shorter with each year.
Summary Statistics of the clean data
### Summary statistics of all the variables available in the data
st(spotify_songs)
| Variable | N | Mean | Std. Dev. | Min | Pctl. 25 | Pctl. 75 | Max |
|---|---|---|---|---|---|---|---|
| track_popularity | 28352 | 39.335 | 23.699 | 0 | 21 | 58 | 100 |
| year | 28352 | 2011.054 | 11.23 | 1957 | 2008 | 2019 | 2020 |
| playlist_genre | 28352 | ||||||
| … edm | 4877 | 17.2% | |||||
| … latin | 4136 | 14.6% | |||||
| … pop | 5132 | 18.1% | |||||
| … r&b | 4504 | 15.9% | |||||
| … rap | 5398 | 19% | |||||
| … rock | 4305 | 15.2% | |||||
| playlist_subgenre | 28352 | ||||||
| … album rock | 1039 | 3.7% | |||||
| … big room | 1034 | 3.6% | |||||
| … classic rock | 1100 | 3.9% | |||||
| … dance pop | 1298 | 4.6% | |||||
| … electro house | 1416 | 5% | |||||
| … electropop | 1251 | 4.4% | |||||
| … gangster rap | 1314 | 4.6% | |||||
| … hard rock | 1202 | 4.2% | |||||
| … hip hop | 1296 | 4.6% | |||||
| … hip pop | 803 | 2.8% | |||||
| … indie poptimism | 1547 | 5.5% | |||||
| … latin hip hop | 1194 | 4.2% | |||||
| … latin pop | 1097 | 3.9% | |||||
| … neo soul | 1478 | 5.2% | |||||
| … new jack swing | 1036 | 3.7% | |||||
| … permanent wave | 964 | 3.4% | |||||
| … pop edm | 967 | 3.4% | |||||
| … post-teen pop | 1036 | 3.7% | |||||
| … progressive electro house | 1460 | 5.1% | |||||
| … reggaeton | 687 | 2.4% | |||||
| … southern hip hop | 1582 | 5.6% | |||||
| … trap | 1206 | 4.3% | |||||
| … tropical | 1158 | 4.1% | |||||
| … urban contemporary | 1187 | 4.2% | |||||
| danceability | 28352 | 0.653 | 0.146 | 0 | 0.561 | 0.76 | 0.983 |
| energy | 28352 | 0.698 | 0.184 | 0 | 0.579 | 0.843 | 1 |
| key | 28352 | ||||||
| … 0 | 3001 | 10.6% | |||||
| … 1 | 3436 | 12.1% | |||||
| … 2 | 2478 | 8.7% | |||||
| … 3 | 797 | 2.8% | |||||
| … 4 | 1925 | 6.8% | |||||
| … 5 | 2301 | 8.1% | |||||
| … 6 | 2261 | 8% | |||||
| … 7 | 2907 | 10.3% | |||||
| … 8 | 2066 | 7.3% | |||||
| … 9 | 2631 | 9.3% | |||||
| … 10 | 1972 | 7% | |||||
| … 11 | 2577 | 9.1% | |||||
| loudness | 28352 | -6.818 | 3.036 | -46.448 | -8.31 | -4.709 | 1.275 |
| mode | 28352 | ||||||
| … 0 | 12318 | 43.4% | |||||
| … 1 | 16034 | 56.6% | |||||
| speechiness | 28352 | 0.108 | 0.103 | 0 | 0.041 | 0.133 | 0.918 |
| acousticness | 28352 | 0.177 | 0.223 | 0 | 0.014 | 0.26 | 0.994 |
| instrumentalness | 28352 | 0.091 | 0.233 | 0 | 0 | 0.007 | 0.994 |
| liveness | 28352 | 0.191 | 0.156 | 0 | 0.093 | 0.249 | 0.996 |
| valence | 28352 | 0.51 | 0.234 | 0 | 0.329 | 0.695 | 0.991 |
| tempo | 28352 | 120.958 | 26.955 | 0 | 99.972 | 133.999 | 239.44 |
| duration_ms | 28352 | 226574.631 | 61081.364 | 4000 | 187741.25 | 254975.25 | 517810 |
| duration_min | 28352 | 3.776 | 1.018 | 0.067 | 3.129 | 4.25 | 8.63 |
| energy_only | 28352 | ||||||
| … (-0.000825,0.1] | 67 | 0.2% | |||||
| … (0.1,0.2] | 225 | 0.8% | |||||
| … (0.2,0.3] | 518 | 1.8% | |||||
| … (0.3,0.4] | 1168 | 4.1% | |||||
| … (0.4,0.5] | 2368 | 8.4% | |||||
| … (0.5,0.6] | 3627 | 12.8% | |||||
| … (0.6,0.7] | 4964 | 17.5% | |||||
| … (0.7,0.8] | 5797 | 20.4% | |||||
| … (0.8,0.9] | 5763 | 20.3% | |||||
| … (0.9,1] | 3855 | 13.6% | |||||
| speech_only | 28352 | ||||||
| … (-0.000918,0.0918] | 18385 | 64.8% | |||||
| … (0.0918,0.184] | 4869 | 17.2% | |||||
| … (0.184,0.275] | 2459 | 8.7% | |||||
| … (0.275,0.367] | 1648 | 5.8% | |||||
| … (0.367,0.459] | 736 | 2.6% | |||||
| … (0.459,0.551] | 174 | 0.6% | |||||
| … (0.551,0.643] | 53 | 0.2% | |||||
| … (0.643,0.734] | 12 | 0% | |||||
| … (0.734,0.826] | 8 | 0% | |||||
| … (0.826,0.919] | 8 | 0% |
In this section, we executed Random Forest on the Spotify dataset and would try to predict track popularity. We consider top 25 percentile values as popular songs rest non popular songs
#selecting relevant the data
set.seed(2021)
part<-sample(1:nrow(spotify_songs), nrow(spotify_songs)*.75)
data_for_popularity_analysis <- spotify_songs %>%
select(c('energy', 'liveness','tempo', 'speechiness', 'acousticness',
'instrumentalness', 'danceability', 'duration_min' ,
'loudness','valence' ,'track_popularity','key','mode','playlist_genre')) %>%
mutate( track_popularity = if_else(track_popularity > 62 , 1,0))
#Splitting data in train and Test
spotify_songs_train<- data_for_popularity_analysis[part,]
spotify_songs_test <- data_for_popularity_analysis[-part,]
# running random Forest
spotify_rand <- randomForest(as.factor(track_popularity)~., data=spotify_songs_train, mtry= 4,importance =TRUE )
spotify_rand
##
## Call:
## randomForest(formula = as.factor(track_popularity) ~ ., data = spotify_songs_train, mtry = 4, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 18.75%
## Confusion matrix:
## 0 1 class.error
## 0 16947 445 0.02558648
## 1 3543 329 0.91503099
Analysis of Random Forest Model
Calculating Confusion Matrix
#Predicting songs popularity on test data
predict_test <- predict(spotify_rand, spotify_songs_test[,-11])
table(spotify_songs_test$track_popularity, predict_test)
## predict_test
## 0 1
## 0 5687 123
## 1 1172 106
Plotting Variable Importance
###Claulating variable importance
important <- importance(spotify_rand)
varImportance <- data.frame(Variables = row.names(important),
Importance = round(important[,3],2))
rankImportance <- varImportance%>%
mutate(Rank= paste('#',dense_rank(desc(Importance))))
ggplot(rankImportance,aes(x=reorder(Variables,Importance) ,y=Importance,fill=Importance))+
geom_bar(stat = "identity",fill = "#1DB954") +
geom_text(aes(x = Variables, y = 0.5, label = Rank),hjust=0, vjust=0.55, size = 4, colour = "black") +
theme_bw() +
ggtitle("Variable Importance") +
scale_x_discrete(name = "Variables") +
theme(axis.text.x = element_text(colour = "darkgreen"))+
theme(plot.title = element_text(size = 10, face = "bold", colour = "#1DB954"),
text = element_text(size = 10,colour = "#1DB954")) +
theme(panel.background = element_rect(fill = "#121212"),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank()) +
theme(plot.background = element_rect(fill = "#121212")) +
theme(legend.background = element_rect(fill = "#121212"))+
theme(axis.text.y = element_text(colour = "darkgreen")) +
coord_flip()
A common assumption is that energy influences popularity like energetic songs are more popular. However, we could not find and correlation between popularity and energy. Number of songs belonging to all genres in the top 100 were not evenly distributed.
Yes we have sliced the track_album_release_date variable into year,month and year. We have also created new variables track_album_release_year, popularity_group etc. We are trying to find the track popularity using different features of the song. We have used newly created variables viz. trend_chart_track_popularity, trend_chart_danceability, trend_chart_energy, trend_chart_loudness, trend_chart_duration_min, trend_chart_speechiness to find out the trends.
What interests us the most is that the duration of tracks is showing continuous decreasing trend. Meaning that the songs are getting shorter and shorter with each year. Furthermore, the danceability of tracks is on continuous rise, which is a good thing as people are enjoying danceable songs. Energy and loudness have almost the same trend each year showing high positive correlation between them. Both the features have peaks and dips on trend in the same years.
The average popularity of the songs reached its minimum value in 2014 in last 1 decade and after that it’s has been continuously increasing, depicting that the songs are becoming popular with time on average among people.
We have used trend charts to find out how the features change across time. In order to understand the correlation among variables, we have used corrplot function in R. We have used boxplots to find out the outliers.
Even though there are millions of songs that exist, we only had about 32k records for our analysis, and hence we couldn’t obtain a full picture of the features of music. Also, the analysis could be strengthened by incorporating user related features like their demographical attributes, user history etc. We used k-mean clustering algorithm to find out the repecstive genre of the song in the cluster but because perhaps data size we couldn’t clear cluster. Also we try predict a songs track’s popularity from key features about the song using Random Forest that indicate algorithm is fairly good classifying not popular but average in predict predicting songs popularity. Playlist genre and loudness are two major factor to contribute song’s popularity score