All of us like to listen to songs of our choice and to share, with people, our playlists and jams. Spotify is one such platform where you can not only listen to songs but also create your playlists which can be viewed and played by other people with similar interests. There can be many factors that can determine the popularity of a soundtrack like its acoustics characteristics. Here is a brief analysis of Spotify Dataset wherein we try to determine whether the popularity of a particular soundtrack is related to attributes like artist name, duration and various other acoustic features like loudness, tempo etc.
Firstly, we will clean and manipulate the Spotify data to make it suitable for our analysis. Then we will explore and visualize this to gain some valuable insights from the data which are not self-evident. Lastly, to address our problem statement, that is to predict the popularity of a track based on it’s characteristics, we will apply some machine learning techniques and compare them to select the best model.
The technique to explore and visualize data to uncover hidden information is known as exploratory data analysis. We will plot different graphs and charts like histogram, lollipop charts, radial graphs in order to accomplish this goal in the best possible manner.
We will apply supervised machine learning techniques to build our predictive models. We will begin with linear regression model and then move towards more sophisticated models like SVM and random forest. At last, we will compare these three models to select the model best suited for prediction.
Our analysis can be consumed by the Spotify business directly in order to better maintain their song database and attract more customers by suggesting them songs of their choice based on popularity. Also, it can be used by the end customers of Spotify like you and me to search effectively for the popular trends and playlists with most popular soundtracks or to create such playlists.
The packages we have used in this project are:
library(dplyr)
library(corrplot)
library(funModeling)
library(ggplot2)
library(GGally)
library(rsample)
library(randomForest)
library(e1071)
We used attributes message=FALSE and warning=FALSE to suppress warnings and messages that can arise from the execution of this chunk.
Let’s have a look at each package we have used or plan to use:
dplyr: This package aims to provide a function for each basic verb of data manipulation:filter() to select cases based on their values.
arrange() to reorder the cases.
select() and rename() to select variables based on their names.
mutate() and transmute() to add new variables that are functions of existing variables.
summarise() to condense multiple values to a single value.
sample_n() and sample_frac() to take random samples.
More info on: https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html
We have used distinct() function in our project.
corrplot: The corrplot package is a graphical display of a correlation matrix, confidence interval. It also contains some algorithms to do matrix reordering. In addition, corrplot is good at details, including choosing color, text labels, color labels, layout, etc.
More info on: https://cran.r-project.org/web/packages/corrplot/vignettes/corrplot-intro.html
We have used corrplot() function from this package for data visualization.
funModeling: This package is generally used to perform data preparation, profiling, selecting best variables data visualization, assessing model performance and other functions.
More info on: https://cran.r-project.org/web/packages/funModeling/index.html
We have used plot_num() function from this package.
ggplot2: It is a system for declaratively creating graphics, based on The Grammar of Graphics. You generally supply a dataset and aesthetic mapping (with aes()). You then add on layers (like geom_point() or geom_histogram()), scales (like scale_colour_brewer()), faceting specifications (like facet_wrap()) and coordinate systems (like coord_flip()).
More info on: https://cran.r-project.org/web/packages/ggplot2/index.html
We have used multiple functions from this package for visualizing the data.
GGally: It extends ‘ggplot2’ by adding several functions to reduce the complexity of combining geometric objects with transformed data. Some of these functions include a pairwise plot matrix, a two group pairwise plot matrix, a parallel coordinates plot, a survival plot, and several functions to plot networks.
More info on: https://cran.r-project.org/web/packages/GGally/index.html
We have used multiple functions from this package for visualizing the data.
rsample: It has functions to create variations of a data set that can be used to evaluate models or to estimate the sampling distribution of some statistic.
More info on: https://cran.r-project.org/web/packages/rsample/index.html
We have used initial_split() from this package to split data into testing and training samples.
e1071: This package has functions for latent class analysis, short time Fourier transform, fuzzy clustering, support vector machines, shortest path computation, bagged clustering, naive Bayes classifier etc.
More info on: https://cran.r-project.org/web/packages/e1071/index.html
We have used this package to implement SVM.
randomForest: It implements Breiman’s random forest algorithm (based on Breiman and Cutler’s original Fortran code) for classification and regression. It can also be used in unsupervised mode for assessing proximities among data points.
More info on: https://cran.r-project.org/web/packages/randomForest/index.html
We have used this package to implement random forest.
The dataset that we will be analyzing can be found here.
The original data comes from Spotify via the spotifyr package. The authors of this package are Charlie Thompson, Josiah Parry, Donal Phipps, and Tom Wolff. It makes it easier to get either your own data or general metadata around songs from Spotify’s API.
spotifyr is an R wrapper that can be used to pull track audio features and other information from Spotify’s Web API in bulk. It lets you to enter an artist’s name and retrieve their entire discography, along with Spotify’s audio features and track/album popularity metrics. It also allows you to pull song and playlist information of any Spotify User.
A detailed description of the attributes and their data type is given in the table below:
| Variable | Class | Description |
|---|---|---|
| track_id | character | Song unique ID |
| track_name | character | Song name |
| track_artist | character | Song Artist |
| track_popularity | double | Song Popularity (0-100) where higher is better |
| track_album_id | character | Album unique ID |
| track_album_name | character | Song album name |
| track_album_release_date | character | Date when album released |
| playlist_name | character | Name of playlist |
| playlist_id | character | Playlist ID |
| playlist_genre | character | Playlist genre |
| playlist_subgenre | character | Playlist subgenre |
| danceability | double | Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable. |
| energy | double | Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy. |
| key | double | The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1. |
| loudness | double | The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db. |
| mode | double | Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0. |
| speechiness | double | Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks. |
| acousticness | double | A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic. |
| instrumentalness | double | Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0. |
| liveness | double | Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live. |
| valence | double | A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry). |
| tempo | double | The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration. |
| duration_ms | double | Duration of song in milliseconds |
We have the data in a csv file. So we will use read.csv() function to read the file.
spotify <- read.csv("C:/Users/khann/Desktop/COLLEGE/R/spotify_songs.csv", stringsAsFactors = FALSE)
Then, we check the structure and dimensions of the data frame as well as sample of the data.
str(spotify)
## 'data.frame': 32833 obs. of 23 variables:
## $ track_id : chr "6f807x0ima9a1j3VPbc7VN" "0r7CVbZTWZgbTCYdfa2P31" "1z1Hg7Vb0AhHDiEmnDE79l" "75FpbthrwQmzHlBJLuGdC7" ...
## $ track_name : chr "I Don't Care (with Justin Bieber) - Loud Luxury Remix" "Memories - Dillon Francis Remix" "All the Time - Don Diablo Remix" "Call You Mine - Keanu Silva Remix" ...
## $ track_artist : chr "Ed Sheeran" "Maroon 5" "Zara Larsson" "The Chainsmokers" ...
## $ track_popularity : int 66 67 70 60 69 67 62 69 68 67 ...
## $ track_album_id : chr "2oCs0DGTsRO98Gh5ZSl2Cx" "63rPSO264uRjW1X5E6cWv6" "1HoSmj2eLcsrR0vE9gThr4" "1nqYsOef1yKKuGOVchbsk6" ...
## $ track_album_name : chr "I Don't Care (with Justin Bieber) [Loud Luxury Remix]" "Memories (Dillon Francis Remix)" "All the Time (Don Diablo Remix)" "Call You Mine - The Remixes" ...
## $ track_album_release_date: chr "2019-06-14" "2019-12-13" "2019-07-05" "2019-07-19" ...
## $ playlist_name : chr "Pop Remix" "Pop Remix" "Pop Remix" "Pop Remix" ...
## $ playlist_id : chr "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" ...
## $ playlist_genre : chr "pop" "pop" "pop" "pop" ...
## $ playlist_subgenre : chr "dance pop" "dance pop" "dance pop" "dance pop" ...
## $ danceability : num 0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
## $ energy : num 0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
## $ key : int 6 11 1 7 1 8 5 4 8 2 ...
## $ loudness : num -2.63 -4.97 -3.43 -3.78 -4.67 ...
## $ mode : int 1 1 0 1 1 1 0 0 1 1 ...
## $ speechiness : num 0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
## $ acousticness : num 0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
## $ instrumentalness : num 0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
## $ liveness : num 0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
## $ valence : num 0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
## $ tempo : num 122 100 124 122 124 ...
## $ duration_ms : int 194754 162600 176616 169093 189052 163049 187675 207619 193187 253040 ...
head(spotify,5)
## track_id
## 1 6f807x0ima9a1j3VPbc7VN
## 2 0r7CVbZTWZgbTCYdfa2P31
## 3 1z1Hg7Vb0AhHDiEmnDE79l
## 4 75FpbthrwQmzHlBJLuGdC7
## 5 1e8PAfcKUYoKkxPhrHqw4x
## track_name track_artist
## 1 I Don't Care (with Justin Bieber) - Loud Luxury Remix Ed Sheeran
## 2 Memories - Dillon Francis Remix Maroon 5
## 3 All the Time - Don Diablo Remix Zara Larsson
## 4 Call You Mine - Keanu Silva Remix The Chainsmokers
## 5 Someone You Loved - Future Humans Remix Lewis Capaldi
## track_popularity track_album_id
## 1 66 2oCs0DGTsRO98Gh5ZSl2Cx
## 2 67 63rPSO264uRjW1X5E6cWv6
## 3 70 1HoSmj2eLcsrR0vE9gThr4
## 4 60 1nqYsOef1yKKuGOVchbsk6
## 5 69 7m7vv9wlQ4i0LFuJiE2zsQ
## track_album_name
## 1 I Don't Care (with Justin Bieber) [Loud Luxury Remix]
## 2 Memories (Dillon Francis Remix)
## 3 All the Time (Don Diablo Remix)
## 4 Call You Mine - The Remixes
## 5 Someone You Loved (Future Humans Remix)
## track_album_release_date playlist_name playlist_id
## 1 2019-06-14 Pop Remix 37i9dQZF1DXcZDD7cfEKhW
## 2 2019-12-13 Pop Remix 37i9dQZF1DXcZDD7cfEKhW
## 3 2019-07-05 Pop Remix 37i9dQZF1DXcZDD7cfEKhW
## 4 2019-07-19 Pop Remix 37i9dQZF1DXcZDD7cfEKhW
## 5 2019-03-05 Pop Remix 37i9dQZF1DXcZDD7cfEKhW
## playlist_genre playlist_subgenre danceability energy key loudness mode
## 1 pop dance pop 0.748 0.916 6 -2.634 1
## 2 pop dance pop 0.726 0.815 11 -4.969 1
## 3 pop dance pop 0.675 0.931 1 -3.432 0
## 4 pop dance pop 0.718 0.930 7 -3.778 1
## 5 pop dance pop 0.650 0.833 1 -4.672 1
## speechiness acousticness instrumentalness liveness valence tempo
## 1 0.0583 0.1020 0.00e+00 0.0653 0.518 122.036
## 2 0.0373 0.0724 4.21e-03 0.3570 0.693 99.972
## 3 0.0742 0.0794 2.33e-05 0.1100 0.613 124.008
## 4 0.1020 0.0287 9.43e-06 0.2040 0.277 121.956
## 5 0.0359 0.0803 0.00e+00 0.0833 0.725 123.976
## duration_ms
## 1 194754
## 2 162600
## 3 176616
## 4 169093
## 5 189052
The output of str() gives insight into the datatype of each variable along with some of the values for that variable. It can be observed that there are 23 variables and 32833 observations. The head() function gives the top 5 rows of the data frame.
To begin with, we will delete some attributes that are not useful for our analysis. These include track_id, track_album_id, track_album_name, track_album_release_date, playlist_name, playlist_id and playlist_genre.
new_sp <- spotify[ -c(1,5:9) ]
Next, we will identify the distinct track names and then remove the duplicates. This can significantly reduce the number of records.
new_sp <- distinct(new_sp,track_name, .keep_all = TRUE )
Moving further, we will identify null values.
colSums(is.na(new_sp))
## track_name track_artist track_popularity playlist_genre
## 1 1 0 0
## playlist_subgenre danceability energy key
## 0 0 0 0
## loudness mode speechiness acousticness
## 0 0 0 0
## instrumentalness liveness valence tempo
## 0 0 0 0
## duration_ms
## 0
This shows there are missing values in track_name and track_artist. Since these variables are names of track and artist, we can’t do any imputation. Also, the number of missing values is very less and will not impact the analysis, we can safely remove these records.
new_sp <- na.omit(new_sp)
Lastly, we will reorder the columns to put all the numeric variables together. This makes it easier to code as well as analyze in further steps.
new_sp <- select(new_sp,track_name,track_artist, playlist_genre, playlist_subgenre, track_popularity, danceability:duration_ms)
Let’s have a look at the data after performing all the steps of data cleaning.
colSums(is.na(new_sp))
## track_name track_artist playlist_genre playlist_subgenre
## 0 0 0 0
## track_popularity danceability energy key
## 0 0 0 0
## loudness mode speechiness acousticness
## 0 0 0 0
## instrumentalness liveness valence tempo
## 0 0 0 0
## duration_ms
## 0
dim(new_sp)
## [1] 23449 17
names(new_sp)
## [1] "track_name" "track_artist" "playlist_genre"
## [4] "playlist_subgenre" "track_popularity" "danceability"
## [7] "energy" "key" "loudness"
## [10] "mode" "speechiness" "acousticness"
## [13] "instrumentalness" "liveness" "valence"
## [16] "tempo" "duration_ms"
As can be seen, there are no null values now. Also, the new data frame has 17 attributes with 23449 records since the duplicate values have been removed. The output of names() function gives the list of variables in the new data frame.
Before moving ahead with data exploration, let’s have a look at the output of summary() function.
summary(new_sp, 5)
## track_name track_artist playlist_genre
## Length:23449 Length:23449 Length:23449
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## playlist_subgenre track_popularity danceability energy
## Length:23449 Min. : 0.00 Min. :0.0000 Min. :0.000175
## Class :character 1st Qu.:23.00 1st Qu.:0.5630 1st Qu.:0.578000
## Mode :character Median :43.00 Median :0.6720 Median :0.720000
## Mean :39.74 Mean :0.6552 Mean :0.696638
## 3rd Qu.:58.00 3rd Qu.:0.7620 3rd Qu.:0.841000
## Max. :98.00 Max. :0.9830 Max. :1.000000
## key loudness mode speechiness
## Min. : 0.000 Min. :-46.448 Min. :0.0000 Min. :0.0000
## 1st Qu.: 2.000 1st Qu.: -8.341 1st Qu.:0.0000 1st Qu.:0.0412
## Median : 6.000 Median : -6.285 Median :1.0000 Median :0.0637
## Mean : 5.374 Mean : -6.848 Mean :0.5667 Mean :0.1103
## 3rd Qu.: 9.000 3rd Qu.: -4.727 3rd Qu.:1.0000 3rd Qu.:0.1390
## Max. :11.000 Max. : 1.275 Max. :1.0000 Max. :0.9180
## acousticness instrumentalness liveness valence
## Min. :0.0000 Min. :0.0000000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0148 1st Qu.:0.0000000 1st Qu.:0.0927 1st Qu.:0.3300
## Median :0.0827 Median :0.0000192 Median :0.1270 Median :0.5130
## Mean :0.1811 Mean :0.0940749 Mean :0.1916 Mean :0.5114
## 3rd Qu.:0.2660 3rd Qu.:0.0069600 3rd Qu.:0.2500 3rd Qu.:0.6950
## Max. :0.9940 Max. :0.9940000 Max. :0.9960 Max. :0.9910
## tempo duration_ms
## Min. : 0.00 Min. : 4000
## 1st Qu.: 99.97 1st Qu.:187200
## Median :122.00 Median :216133
## Mean :121.08 Mean :226107
## 3rd Qu.:134.78 3rd Qu.:254493
## Max. :239.44 Max. :517810
This function displays the length, class and mode for strings whereas minimum value, maximum value, quantile values and mean for numeric variables.
There are four string variables- track_name, track_artist, playlist_genre and playlist_subgenre- each having 23449 records.
The rest are numeric variables. We observe that valence, liveless, instrumentalness, acousticness, speechiness, danceability, and energy range between 0 and 1. Also, mode has just two values, 0 and 1. track_popularity can take values between 0 and 100, in theory. But as per the data, it is evident that it’s maximum value is 98.
options(repr.plot.width = 30, repr.plot.height = 30)
sp_sliced <- new_sp[, 5:17]
plot_num(sp_sliced)
Here, we can see the frequency distribution of all the numeric variables. We observe that valence, tempo and duration_ms are normally distributed whereas speechiness, acousticness, instrumentalness and liveliness are positively skewed. Also, we observe that loudness, danceability and energy are negatively skewed.
df_new <- gather(sp_sliced, "danceability", "energy", "key", "loudness", "mode","speechiness","acousticness","instrumentalness","liveness","valence","tempo" ,"duration_ms", key="variables", value="value")
df_new$variables <- as.factor(df_new$variables)
ggplot(df_new, aes(x = variables, y = value)) +
geom_boxplot(aes(fill= variables))+
facet_wrap(vars(variables), scales="free") +theme_minimal(base_size = 26)
Next, we analyze the outliers in all the numeric variables by using boxplots. It can be observed that except mode, valence and key, all of these have significant number of outliers. Also, it is important to note that mode can only take either 0 or 1. Hence, there are no outliers.
corr <- cor(sp_sliced)
num <- corrplot(corr, method = "number", number.cex=0.60)
From this correlation plot, two values that stand out are 0.68 and -0.55. Here, 0.68 indicates a moderately positive correlation between loudness and energy whereas -0.55 indicates a moderately negative correlation between acousticness and energy.
popular_df <- data.frame(x = new_sp$track_popularity)
popular_df %>%
ggplot(aes(x=x, fill = "#AA66FF"))+
geom_histogram(aes(y=..density..), color = 'black', fill="#AA66FF")+
geom_density(aes(y=..density..),color = 'black',fill = 'grey', alpha = 0.5, kernel='gaussian')+
geom_vline(aes(xintercept = mean(x)),color = 'red', linetype = 'dashed')+
geom_vline(aes(xintercept = median(x)),color = 'blue', linetype = 'dashed')+
geom_vline(aes(xintercept = quantile(x, probs = 0.25)),color = 'black')+
geom_vline(aes(xintercept = quantile(x, probs = 0.75)),color = 'black')+
theme_minimal()
log_pop <- data.frame(x = log(new_sp$track_popularity)+ 1)
log_pop %>%
ggplot(aes(x=x, fill = "#AA66FF"))+
geom_histogram(aes(y=..density..), color = 'black', fill="#AA66FF")+
geom_density(aes(y=..density..),color = 'black',fill = 'grey', alpha = 0.5, kernel='gaussian')+
geom_vline(aes(xintercept = mean(x)),color = 'red', linetype = 'dashed')+
geom_vline(aes(xintercept = median(x)),color = 'blue', linetype = 'dashed')+
geom_vline(aes(xintercept = quantile(x, probs = 0.25)),color = 'black')+
geom_vline(aes(xintercept = quantile(x, probs = 0.75)),color = 'black')+
theme_minimal()
These histograms show the distribution of track_popularity. In the first graph, it is evident that the variable does not follow perfect normal distribution and has a peak on the lower end.
In the next figure, we have applied log transformation on track popularity to further analyze the distirbution. This shows that after the transformation, it is skewed to the left.
new_sp %>%
group_by(playlist_genre) %>%
summarise(count = n(),
avg_pop = mean(track_popularity)) %>%
arrange(desc(avg_pop)) %>%
mutate(playlist_genre = factor(playlist_genre, levels = playlist_genre)) %>%
ggplot( aes(x = playlist_genre, y = avg_pop)) +
geom_bar(stat="identity", width = 0.5, fill="violetred4") +
theme_light() +
xlab("Genre") +
ylab("Popularity Score")
From this graph, it can be observed that genre type “Pop” has highest mean score for track popularity. Hence, we will consider as the most popular genre on Spotify.
new_sp %>%
group_by(playlist_genre, playlist_subgenre) %>%
summarise(count=n(),
avg_pop = mean(track_popularity)) %>%
ggplot(aes(playlist_genre, playlist_subgenre))+
geom_tile(aes(fill=avg_pop),colour="white")+
scale_fill_gradient(low="pink",high = "brown")+
theme_light() +
xlab("Playlist Genre")+
ylab("Playlist Subgenre")+
ggtitle("Popularity of Genre and Subgenre")+
guides(fill=FALSE)
On diving deeper into the popularity of genre and sub-genre, we observe that post-teen pop and dance pop are more popular than other sub categories of pop. Also, it can be inferred that progressive electro house, new jack swing, new soul are among the least popular sub-genres.
new_sp %>%
group_by(track_artist) %>%
summarise(count = n(),
avg_pop = mean(track_popularity)) %>%
arrange(avg_pop) %>%
top_n(20) %>%
mutate(track_artist = factor(track_artist, levels = track_artist)) %>%
ggplot( aes(x = track_artist, y = avg_pop)) +
geom_segment( aes(x = track_artist, xend = track_artist, y = avg_pop, yend = 0), color = "navyblue") +
geom_point( size = 4, color = "blue", alpha = 0.8) +
coord_flip() +
theme_light() +
xlab("Artist") +
ylab("Popularity Score")
On analyzing the popularity of artists, Trevor Daniel, Y2K, Ant Saunders, Don Toliver and Kina emerge as the top 5 artists on Spotify.
new_sp %>%
group_by(track_name) %>%
summarise(count = n(),
pop = track_popularity) %>%
arrange(pop) %>%
top_n(20) %>%
mutate(track_name = factor(track_name, levels = track_name)) %>%
ggplot( aes(x = track_name, y = pop)) +
geom_bar(stat="identity", width = 0.5, fill="violetred4") +
theme_classic(base_size = 26) +
theme_light() +
coord_flip() +
xlab("Track Name") +
ylab("Popularity Score")
This plot for top 20 popular tracks shows that Tusa by Nicki Minaj and Karol G has the highest popularity rating. Other notable songs in the top 20 are Memories by Maroon 5, Falling by Trevor Daniel, My Oh My by Camila Cabello, Lose you to love me by Selena Gomez and Bad Guy by Billie Eilish.
new_sp %>%
select(track_popularity, valence, speechiness, tempo, track_artist, playlist_genre, playlist_subgenre) %>%
group_by(track_popularity)%>%
filter(!is.na(track_popularity)) %>%
filter(!is.na(valence))%>%
filter(!is.na(speechiness))%>%
filter(!is.na(tempo))%>%
ggplot(mapping = aes(x = valence, y = track_popularity, color = playlist_subgenre))+
facet_wrap(~playlist_genre)+
geom_point()+
theme_minimal(base_size = 20)
This graph shows the relation of popularity and valence for each genre type.For instance, for EDM type of music as seen from the plot most common sub- genres included are permanent wave, pop EDM, progressive electro house, southern hip hop, album rock and dance pop. We can derive this insight for other genres also.
For EDM most songs from permanent wave have been highly popular and they also possess higher valence or musical positiveness.
Most of the Latin songs have a higher valence and they are either progressive electro house or latin pop sub-genres.
new_sp %>%
select(track_popularity, valence, speechiness, tempo, playlist_subgenre) %>%
group_by(track_popularity)%>%
filter(!is.na(track_popularity)) %>%
filter(!is.na(valence))%>%
filter(!is.na(speechiness))%>%
filter(!is.na(tempo))%>%
ggplot(mapping = aes(x = track_popularity, y = valence, color = tempo, alpha = speechiness, fill = tempo))+
geom_bar(stat = 'identity', position = 'dodge')+
coord_polar()+
facet_wrap(~playlist_subgenre)+
theme_minimal(base_size = 20)
This Visualization explains significance of the following parameters with respect to track popularity per sub genre:
Insights according each sub genre:
Album rock: The speechiness seems to be less as the chart seems to be particularly faded. Many tracks have a higher valence. Only a few songs have lesser tempo irrespective of popularity.
Big room: The speechiness and valence are on the lesser side for most of the tracks and the tempo seems to be higher. For one particular song which has a higher valence is also very popular. Also since the chart is less dense so very less number of songs lie in this sub genre.
Classic rock: Many songs lie in this genre and most of them higher valence and tempo and less speechiness.
Dance pop: A lot of tracks lie in this sub genre with most of them having a higher tempo and valence. Some of the highly popular songs also have a lower valence.
Electro house: Not many songs have a popularity greater than 75. Tempo ranges to medium values.
Electropop: Some tracks with lesser valence are highly popular. Speechiness seems to be less.
Ganster rap: Speechiness is very high as most od the chart seems to be solid. valence also seems to be high. A highly popular song has higher tempo.
Hard rock: Speechiness seems to be less and tempo is higher in almost all the tracks. One of the highly popular tracks has a lesser valence.
Hip hop: many of these tracks are popular and have a high speechiness and medium to low tempo. Valence seems to be higher.
Hip pop: Some of the tracks have a very low tempo. Valence is generally higher but some popular tracks have a less valence.
Indipoptimism: Most of the tracks have less speechiness and higher valence. tempo seems to be medium to low. One of the highly popular tracks has a lower tempo and medium valence.
Latin hip hop: Most of the tracks have medium to low tempo higher valence and medium speechiness.
Latin pop: Many teracks lie in this sub genre with less speechiness high valence and mostly medium tempo.
Neo soul: For most tracks valence seem to be medium with medium to low tempo. A very popular song has higher tempo and valence.
New jack Swing: None of the tracks are highly popular. Tempo ranges from medium to lower and higher valence.
Permanent wave: Most of the popular tracks have a higher valence less speechiness.
Pop edm: Speechiness is less, a very popular song has a high valence and tempo.
Post teen pop: Most tracks have medium to high valence and are very popular.
Progressive electro house: This sub genre is not highly popular and speechiness is less.
Reggaeton: Most tracks are popular and have a higher valence. Speechiness is less.
Southern Hip Hop: Speechiness and valence is high. Also there are many tracks in this sub genre as the chart is dense.
Trap: Most of the tracks are popular with a high valence and lesser speechiness.
Tropical: Many tracks lie in this sub genre valence is higher tempo seems to be medium to low.
Urban contemporary: Many highly popular songs have a lesser valence. Most songs are popular and overall tempo seems to be medium to less.
new_sp %>%
select(energy, playlist_subgenre, speechiness, tempo, playlist_genre) %>%
group_by(energy)%>%
filter(!is.na(energy)) %>%
filter(!is.na(playlist_subgenre))%>%
filter(!is.na(speechiness))%>%
filter(!is.na(tempo))%>%
ggplot(mapping = aes(x = playlist_subgenre, y = energy, alpha = speechiness, fill = playlist_genre))+
geom_bar(stat = 'identity')+
coord_polar()+
theme_minimal(base_size = 20)
This visualization shows energy and speechiness of all the tracks per sub genre and their respective genres.
Insights:
We have created three models to predict the popularity of a track based on predictor variables such valence, tempo, energy etc.
First of all, we split the dataset into training and testing subsets.
set.seed(073)
split_fun <- initial_split(sp_sliced, prop = .85)
train <- training(split_fun)
test <- testing(split_fun)
x_train <- train[, -1]
y_target <- train[, 1]
x_test <- test[, -1]
y_test <- test[, 1]
training <- data.frame(x_train, target = y_target)
Now we create a multiple regression model with track popularity as the response variable.
Linear_model <- lm(target ~ danceability + energy + key + loudness + mode
+ speechiness+ acousticness + instrumentalness + liveness + valence + tempo + duration_ms, data =training)
summary(Linear_model)
##
## Call:
## lm(formula = target ~ danceability + energy + key + loudness +
## mode + speechiness + acousticness + instrumentalness + liveness +
## valence + tempo + duration_ms, data = training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -54.945 -16.278 2.869 17.715 62.207
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.789e+01 1.995e+00 34.027 < 2e-16 ***
## danceability 4.845e+00 1.253e+00 3.866 0.000111 ***
## energy -2.380e+01 1.429e+00 -16.649 < 2e-16 ***
## key 6.748e-03 4.506e-02 0.150 0.880965
## loudness 1.208e+00 7.634e-02 15.822 < 2e-16 ***
## mode 1.146e+00 3.292e-01 3.481 0.000501 ***
## speechiness -6.250e+00 1.582e+00 -3.950 7.83e-05 ***
## acousticness 4.154e+00 8.640e-01 4.807 1.54e-06 ***
## instrumentalness -9.298e+00 7.227e-01 -12.866 < 2e-16 ***
## liveness -4.432e+00 1.044e+00 -4.247 2.18e-05 ***
## valence 2.481e+00 7.696e-01 3.224 0.001267 **
## tempo 2.798e-02 6.085e-03 4.598 4.30e-06 ***
## duration_ms -4.528e-05 2.671e-06 -16.956 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.63 on 19919 degrees of freedom
## Multiple R-squared: 0.06443, Adjusted R-squared: 0.06386
## F-statistic: 114.3 on 12 and 19919 DF, p-value: < 2.2e-16
The summary statistics show that except key, all the predictor variables are significant with a p-value of less than 0.05. The adjusted R-squared is very low although the model is significant. This is possible if the data inherently contain higher amount of unexplainable variability.
PRESS <- function(linear.model) {
pr <- residuals(linear.model)/(1-lm.influence(linear.model)$hat)
PRESS <- sum(pr^2)
return(PRESS)
}
MSPE <- function(linear.model) {
return(PRESS(linear.model)/length(residuals(linear.model)))
}
pred_r_squared <- function(linear.model) {
lm.anova <- anova(linear.model)
tss <- sum(lm.anova$'Sum Sq')
pred.r.squared <- 1-PRESS(linear.model)/(tss)
return(pred.r.squared)
}
MSPE(Linear_model)
## [1] 512.2262
RMSE <- sqrt(MSPE(Linear_model))
RMSE
## [1] 22.63241
pred_r_squared(Linear_model)
## [1] 0.06320997
summary(Linear_model)$r.squared
## [1] 0.06442833
test[, 'linpred'] <- predict(Linear_model,test, type="response")
head(test)
## track_popularity danceability energy key loudness mode speechiness
## 31 56 0.641 0.869 11 -4.754 1 0.0423
## 36 55 0.748 0.831 1 -5.029 1 0.1150
## 38 63 0.633 0.854 0 -4.046 0 0.0432
## 39 67 0.563 0.810 2 -2.921 1 0.0522
## 47 56 0.789 0.893 10 -4.364 0 0.1680
## 48 57 0.640 0.838 6 -4.203 1 0.0416
## acousticness instrumentalness liveness valence tempo duration_ms
## 31 0.03190 1.31e-03 0.4030 0.358 128.091 178125
## 36 0.08230 7.36e-05 0.0757 0.894 128.024 185273
## 38 0.03820 2.83e-05 0.4340 0.659 126.026 172360
## 39 0.00522 0.00e+00 0.0846 0.495 129.975 247385
## 47 0.05380 0.00e+00 0.2210 0.410 121.956 144073
## 48 0.05880 1.99e-05 0.0424 0.587 124.081 193861
## linpred
## 31 40.26702
## 36 43.51167
## 38 41.06565
## 39 41.95346
## 47 41.35510
## 48 43.10152
Due to high value of MSPE and low prediction R-squared, the prediction by this model won’t be considered accurate.
The goal of an SVM is to take groups of observations and construct boundaries to predict which group future observations belong to based on their measurements. For our data set, the optimal cost is calculated to be 0.1, which doesn’t penalize the model much for misclassified observations. But due to computational limitations, we cannot run the tuning function to enhance the model.
set.seed(073)
mod.svm <- svm(target~., data = training, type = "eps-regression", kernel = "linear", cost = '1', gamma = '0.1')
print(mod.svm)
##
## Call:
## svm(formula = target ~ ., data = training, type = "eps-regression",
## kernel = "linear", cost = "1", gamma = "0.1")
##
##
## Parameters:
## SVM-Type: eps-regression
## SVM-Kernel: linear
## cost: 1
## gamma: 0.1
## epsilon: 0.1
##
##
## Number of Support Vectors: 18438
set.seed(073)
test_pred_svm <- predict(mod.svm ,x_test)
RMSE_tree <- sqrt(mean((test_pred_svm - y_test)^2))
RMSE_tree
## [1] 22.90203
MAE_tree <- mean(abs(test_pred_svm - y_test))
MAE_tree
## [1] 18.82682
test[, 'svmpred'] <- predict(mod.svm, test)
head(test)
## track_popularity danceability energy key loudness mode speechiness
## 31 56 0.641 0.869 11 -4.754 1 0.0423
## 36 55 0.748 0.831 1 -5.029 1 0.1150
## 38 63 0.633 0.854 0 -4.046 0 0.0432
## 39 67 0.563 0.810 2 -2.921 1 0.0522
## 47 56 0.789 0.893 10 -4.364 0 0.1680
## 48 57 0.640 0.838 6 -4.203 1 0.0416
## acousticness instrumentalness liveness valence tempo duration_ms
## 31 0.03190 1.31e-03 0.4030 0.358 128.091 178125
## 36 0.08230 7.36e-05 0.0757 0.894 128.024 185273
## 38 0.03820 2.83e-05 0.4340 0.659 126.026 172360
## 39 0.00522 0.00e+00 0.0846 0.495 129.975 247385
## 47 0.05380 0.00e+00 0.2210 0.410 121.956 144073
## 48 0.05880 1.99e-05 0.0424 0.587 124.081 193861
## linpred svmpred
## 31 40.26702 42.66323
## 36 43.51167 46.70114
## 38 41.06565 43.96005
## 39 41.95346 45.33414
## 47 41.35510 43.93442
## 48 43.10152 46.15517
The root mean sq error of this svm is 22.90203 and the mean absolute value is 18.82682. Since RMSE is close to MAE, the model makes many relatively small errors. Hence this model can be improved further for predicting accurate track popularity.
Random forest is a Supervised Learning algorithm which uses ensemble learning method for classification and regression. It operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.
set.seed(073)
rf <-randomForest(target~.,data=training, mtry=2, importance=TRUE, ntree = 448)
print(rf)
##
## Call:
## randomForest(formula = target ~ ., data = training, mtry = 2, importance = TRUE, ntree = 448)
## Type of random forest: regression
## Number of trees: 448
## No. of variables tried at each split: 2
##
## Mean of squared residuals: 499.4887
## % Var explained: 8.65
test_pred <- predict(rf,x_test)
RMSE_tree1 <- sqrt(mean((test_pred - y_test)^2))
RMSE_tree1
## [1] 22.55016
MAE_tree1 <- mean(abs(test_pred - y_test))
MAE_tree1
## [1] 18.71556
test[, 'rfpred'] <- predict(rf, test)
head(test)
## track_popularity danceability energy key loudness mode speechiness
## 31 56 0.641 0.869 11 -4.754 1 0.0423
## 36 55 0.748 0.831 1 -5.029 1 0.1150
## 38 63 0.633 0.854 0 -4.046 0 0.0432
## 39 67 0.563 0.810 2 -2.921 1 0.0522
## 47 56 0.789 0.893 10 -4.364 0 0.1680
## 48 57 0.640 0.838 6 -4.203 1 0.0416
## acousticness instrumentalness liveness valence tempo duration_ms
## 31 0.03190 1.31e-03 0.4030 0.358 128.091 178125
## 36 0.08230 7.36e-05 0.0757 0.894 128.024 185273
## 38 0.03820 2.83e-05 0.4340 0.659 126.026 172360
## 39 0.00522 0.00e+00 0.0846 0.495 129.975 247385
## 47 0.05380 0.00e+00 0.2210 0.410 121.956 144073
## 48 0.05880 1.99e-05 0.0424 0.587 124.081 193861
## linpred svmpred rfpred
## 31 40.26702 42.66323 35.28330
## 36 43.51167 46.70114 40.08330
## 38 41.06565 43.96005 40.56842
## 39 41.95346 45.33414 41.60759
## 47 41.35510 43.93442 40.85093
## 48 43.10152 46.15517 45.20850
We can further tune the optimization parameters for Random Forest model but due to computational limitations we are unable to do so. The mean squared residual values are 499.3788 and the percentage of variables explained is 8.67. The RMSE value for this model is 22.55016 and the MAE value is 18.71556 which indicates that the model is not very suitable for accurate predictions.
Hence this model can be improved a lot using various optimization techniques For data including categorical variables with different number of levels, random forests are biased in favor of those attributes with more levels. Therefore, the variable importance scores from random forest are not reliable for this type of data, which can be the case for this particular dataset as there can be multiple number of catagories in this dataset.
varImpPlot(rf)
The varImpPlot gives the importance of each variable with respect to:
df1 <- data.frame(Model= c("LinearModel","SVM","RF"), values= c(RMSE, RMSE_tree, RMSE_tree1))
p<-ggplot(df1,mapping= (aes(x=Model, y=values, fill=Model)) ) +
geom_bar(stat="identity", width = 0.5) + theme_classic(base_size = 26)
p + scale_y_continuous(expand = c(0,0))
From this graph, it can be inferred that the RMSE values for all of these are almost similar. But as observed in prediction on test data earlier, SVM gives better results than the other two.
Problem Statement: This project aims to explore the Spotify dataset and predict the popularity of a soundtrack based on its characteristics such as valence, tempo, energy and likewise.
Methodology: We have performed exploratory data analysis to uncover hidden trends from the data. Then we have built supervised learning models to predict the popularity of a track. These are multiple linear regression, SVM and random forest. We have analyzed and compared them on the basis of RMSE values and predictions made by these models.
Insights: Some of the interesting insights gained from EDA are:
Other insights have been covered in detail in the EDA section.
Implications: Our analysis can help the consumers to know about the popular trends and playlists with most popular soundtracks. Also, the predictive models can be useful for the Spotify business as well as the music creators as these indicate the factors on which the popularity is based. These factors may not be known explicitly but analysis of the tracks makes the things clearer.
Limitations: The predictive models proposed in this project can not be used to accurately predict the popularity. These models need further optimization but we were bound by the technical limitations of our systems.
By proposing these models, we just wanted to gather a general idea of building predictive models for this dataset and to eventually predict accurately how popular a track can be based on certain given features, while optimization of these models can be included in the future scope of this project.