Spotify is one of the biggest digital music, podcast, and video streaming service in the world that gives access to millions of songs and other content from artists all over the world. Not only does Spotify gives us access to good songs everywhere (work, home, in the car), it has also introduced us to artists that we would never have listened to before and in genres that we had never experienced, Spotify uses very advanced technology to track and identify each song uploaded to its platform.
The Spotify database provides an interesting look into their listening data. Not just the popularity of tracks, but also features of the tracks they have in their library is recorded in their database. In this project, we are trying to analyze a track’s popularity based on several audio features provided in the dataset to find answer to ‘Can we predict a track’s popularity from key features about the song?’ We are also trying to do a custom analysis based on user’s listening profile which shall enable Spotify to stock up similar hit tracks more on their platform and let go off songs that are not much popular among the listeners.
I am considering you as a Spotify user. So as a Spotify user, won’t you be impressed if you can get a list of the most popular songs tailored to your taste without having to manually search for them extensively? Also, won’t you be a happy and recurring customer if you keep on getting a list of top latest music, divided by genre and have easy access to recently released music? We will be solving for providing easy access to popular songs and that is the reason, you, as a Spotify user should be interested.
We will study if there is a relation between popularity and different audio features and genres. We will do clustering analysis using K-means method to provide song recommendation based on recent user listening on Spotify and will try to reduce dimensionality using Principle Component Analysis (PCA). We are trying to check if we can extract a pattern to do clustering from the available variables in the dataset and how does each cluster differ to predict the most and least popular clusters.
K-means clustering will provide information about customer listening behavior which shall help Spotify upsell, cross-sell or combine both to increase profit. Using k-means clustering, we are trying to study if there is correlation between clusters and our popularity rate.
Also, with PCA, we are trying to reduce the dimensionality of the dataset to prevent our model from overfitting. This will enable our future model to be better suited for generalization beyond the training set. We will also analyze how difference in features like acousticness, liveness, loudness, energy, instrumentalness, loudness, valence effect a track’s popularity and can remove features which are not affecting our analysis. Models also become more efficient as a reduced feature set boosts learning rates and diminishes computation costs by removing redundant features.
Consumer of our analysis would be Spotify’s programming team. Our analysis shall enable the team to upsell, cross-sell or combine both to increase profit. Having a better understanding of different clusters shall enable Spotify to make a better targeted content distribution, leading to reduced churn rate.
For example, if the team knows that 30% of customers who listens to track A also listens to track B, Spotify can market track B to customers shortly after they listen to track A to speed up that process and capture those who might not have otherwise considered listening to track B. Also, for those customers who do not know of track B, getting suggestions will make them happy and impressed. This is how our analysis would help Spotify in providing better services to their consumers and keep them ahead of the curve.
library(tidyverse)
library(dplyr)
library(corrplot)
The packages that we will be using for our analysis are:
tidyverse – The packages under the tidyverse umbrella help us in performing and interacting with the data such as subsetting, transforming, visualizing, etc.
dplyr – It is the most useful package in R for data manipulation. We have used this package to use the pipe function “%>%” to combine different functions. We have also used this for performing several other manipulations on the dataset like selecting columns from the (select()) dataset, grouping different observations together such that the original dataset does not change (group_by()) and creating new columns by preserving the existing variables (mutate()).
corrplot – This package is essentially used to visualize the correlation between different variables.
factoextra – for pca and k-means (to be included in detail in the final project)
The dataset used for this project is the Spotify song list prepared by Zaheen Hamidani which we got from kaggle. (https://www.kaggle.com/zaheenhamidani/ultimate-spotify-tracks-db)
Originally, the dataset was created by Zaheen Hamidani and uploaded to Kaggle in July 2019. Alternately, it is also available in a R package version 2.1.1 Spotify R package. Charlie Thompson, Josia Parry,Donal Phipps, and Tom Wolff authored this package to make it easier to get data or general metadata arounds songs from Spotify’s API. It allows to enter an artist’s name and retrieve their entire audio history (collection of all songs) in seconds, along with Spotify’s audio features and track/album popularity metrics.
The primary purpose of the data was to analyze the behaviour between valence and all the measures that Spotify API gives for every track. Approximately 10,000 songs were selected per genre and there are 26 genres. But, the same data can also be used to analyze different statistics and obtain other useful information.
There is not much peculiarity in the data. It is moderately clean with only 15 missing values. Since every track made is unique is some sense, we have not done any missing value imputation and have just removed them.
First, we will load the Spotify songs dataset into R to kickstart with the analysis.The dataset has been imported using the read.csv function and saved as ‘spotify’.
spotify <- read.csv("spotify_songs.csv")
glimpse(spotify)
## Observations: 32,833
## Variables: 23
## $ track_id <fct> 6f807x0ima9a1j3VPbc7VN, 0r7CVbZTWZgbT...
## $ track_name <fct> I Don't Care (with Justin Bieber) - L...
## $ track_artist <fct> Ed Sheeran, Maroon 5, Zara Larsson, T...
## $ track_popularity <int> 66, 67, 70, 60, 69, 67, 62, 69, 68, 6...
## $ track_album_id <fct> 2oCs0DGTsRO98Gh5ZSl2Cx, 63rPSO264uRjW...
## $ track_album_name <fct> I Don't Care (with Justin Bieber) [Lo...
## $ track_album_release_date <fct> 2019-06-14, 2019-12-13, 2019-07-05, 2...
## $ playlist_name <fct> Pop Remix, Pop Remix, Pop Remix, Pop ...
## $ playlist_id <fct> 37i9dQZF1DXcZDD7cfEKhW, 37i9dQZF1DXcZ...
## $ playlist_genre <fct> pop, pop, pop, pop, pop, pop, pop, po...
## $ playlist_subgenre <fct> dance pop, dance pop, dance pop, danc...
## $ danceability <dbl> 0.748, 0.726, 0.675, 0.718, 0.650, 0....
## $ energy <dbl> 0.916, 0.815, 0.931, 0.930, 0.833, 0....
## $ key <int> 6, 11, 1, 7, 1, 8, 5, 4, 8, 2, 6, 8, ...
## $ loudness <dbl> -2.634, -4.969, -3.432, -3.778, -4.67...
## $ mode <int> 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1...
## $ speechiness <dbl> 0.0583, 0.0373, 0.0742, 0.1020, 0.035...
## $ acousticness <dbl> 0.10200, 0.07240, 0.07940, 0.02870, 0...
## $ instrumentalness <dbl> 0.00e+00, 4.21e-03, 2.33e-05, 9.43e-0...
## $ liveness <dbl> 0.0653, 0.3570, 0.1100, 0.2040, 0.083...
## $ valence <dbl> 0.518, 0.693, 0.613, 0.277, 0.725, 0....
## $ tempo <dbl> 122.036, 99.972, 124.008, 121.956, 12...
## $ duration_ms <int> 194754, 162600, 176616, 169093, 18905...
Our dataset has 32,833 observations and 23 variables.
colSums(is.na(spotify))
## track_id track_name track_artist
## 0 5 5
## track_popularity track_album_id track_album_name
## 0 0 5
## track_album_release_date playlist_name playlist_id
## 0 0 0
## playlist_genre playlist_subgenre danceability
## 0 0 0
## energy key loudness
## 0 0 0
## mode speechiness acousticness
## 0 0 0
## instrumentalness liveness valence
## 0 0 0
## tempo duration_ms
## 0 0
spotify <- na.omit(spotify) #Remove NA's
spotify <- spotify[!duplicated(spotify$track_id),]
spotify <- spotify %>%
mutate(playlist_genre = as.factor(spotify$playlist_genre),
playlist_subgenre = as.factor(spotify$playlist_subgenre))
spotify <- spotify %>%
mutate(mode = as.factor(mode),
key = as.factor(key),
loudness = as.numeric(loudness))
spotify <- spotify %>% mutate(duration_min = duration_ms/60000)
spotify <- spotify %>%
mutate(popularity_group = as.factor(case_when(
((track_popularity > 0) & (track_popularity < 20)) ~ "1",
((track_popularity >= 20) & (track_popularity < 40))~ "2",
((track_popularity >= 40) & (track_popularity < 60)) ~ "3",
TRUE ~ "4"))
)
table(spotify$popularity_group)
##
## 1 2 3 4
## 4182 6162 8975 9033
spotify <- spotify %>% select(-c(track_id, track_album_id, playlist_id))
summary(spotify)
## track_name track_artist track_popularity
## Breathe : 18 Queen : 130 Min. : 0.00
## Paradise: 17 Martin Garrix : 87 1st Qu.: 21.00
## Poison : 16 Don Omar : 84 Median : 42.00
## Alive : 15 David Guetta : 81 Mean : 39.34
## Forever : 14 Dimitri Vegas & Like Mike: 68 3rd Qu.: 58.00
## Stay : 14 Drake : 68 Max. :100.00
## (Other) :28258 (Other) :27834
## track_album_name track_album_release_date
## Greatest Hits : 135 2020-01-10: 201
## Ultimate Freestyle Mega Mix: 42 2013-01-01: 189
## Gold : 34 2019-11-22: 185
## Rock & Rios (Remastered) : 29 2019-12-06: 184
## Asian Dreamer : 20 2019-11-15: 183
## Trip Stories : 20 2008-01-01: 176
## (Other) :28072 (Other) :27234
## playlist_name playlist_genre
## Indie Poptimism : 294 edm :4877
## Permanent Wave : 223 latin:4136
## Hard Rock Workout : 211 pop :5132
## Southern Hip Hop : 174 r&b :4504
## post teen pop : 159 rap :5398
## Urban Contemporary: 157 rock :4305
## (Other) :27134
## playlist_subgenre danceability energy
## southern hip hop : 1582 Min. :0.0000 Min. :0.000175
## indie poptimism : 1547 1st Qu.:0.5610 1st Qu.:0.579000
## neo soul : 1478 Median :0.6700 Median :0.722000
## progressive electro house: 1460 Mean :0.6534 Mean :0.698372
## electro house : 1416 3rd Qu.:0.7600 3rd Qu.:0.843000
## gangster rap : 1314 Max. :0.9830 Max. :1.000000
## (Other) :19555
## key loudness mode speechiness
## 1 : 3436 Min. :-46.448 0:12318 Min. :0.0000
## 0 : 3001 1st Qu.: -8.310 1:16034 1st Qu.:0.0410
## 7 : 2907 Median : -6.261 Median :0.0626
## 9 : 2631 Mean : -6.818 Mean :0.1079
## 11 : 2577 3rd Qu.: -4.709 3rd Qu.:0.1330
## 2 : 2478 Max. : 1.275 Max. :0.9180
## (Other):11322
## acousticness instrumentalness liveness valence
## Min. :0.0000 Min. :0.0000000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0143 1st Qu.:0.0000000 1st Qu.:0.0926 1st Qu.:0.3290
## Median :0.0797 Median :0.0000206 Median :0.1270 Median :0.5120
## Mean :0.1772 Mean :0.0911294 Mean :0.1910 Mean :0.5104
## 3rd Qu.:0.2600 3rd Qu.:0.0065725 3rd Qu.:0.2490 3rd Qu.:0.6950
## Max. :0.9940 Max. :0.9940000 Max. :0.9960 Max. :0.9910
##
## tempo duration_ms duration_min popularity_group
## Min. : 0.00 Min. : 4000 Min. :0.06667 1:4182
## 1st Qu.: 99.97 1st Qu.:187741 1st Qu.:3.12902 2:6162
## Median :121.99 Median :216933 Median :3.61555 3:8975
## Mean :120.96 Mean :226575 Mean :3.77624 4:9033
## 3rd Qu.:134.00 3rd Qu.:254975 3rd Qu.:4.24959
## Max. :239.44 Max. :517810 Max. :8.63017
##
The best way to uncover useful information from data that is not self-evident is by performing EDA efficiently. EDA helps us to make sense of our data. Before performing a formal analysis, it is essential to explore a data set. No models can be done without a proper EDA. This will help us to better understand the patterns within the data, detect outliers or anomalous events and find interesting relations among the variables. We have used histograms, boxplots and correlation plot to find such answers.
df1 <- select(spotify, track_popularity, danceability, energy, loudness, speechiness, acousticness, instrumentalness, liveness, valence, tempo)
corrplot(cor(df1))
The plot shows popularity does not have strong correlation with other track features. But we found some variables have strong correlation with each other, indicating that this dataset has multicollinearity and might not be suitable for various classification algorithms.
From dataset we found we have genre which is we can group our data base on it. To make us focus on popularity variable, we would select 5 highest average genre. It can inteprete the genre have large distribution from low to highest popularity. We will visualize the data:
hist(spotify$track_popularity)
From the histogram above We can check distribution for popularity of tracks. We found distribution of popularity have spike in the middle,in the range of 50-55.
genre_popularity <- spotify %>% select(track_popularity, playlist_genre) %>% group_by(playlist_genre) %>% summarise("average_popularity" = round(mean(track_popularity)))
ggplot(data=genre_popularity, mapping = aes(x = reorder(playlist_genre,average_popularity), y = average_popularity, fill = playlist_genre)) +
geom_col() +
coord_flip() +
theme_minimal() +
theme(
legend.position = "none",
) +
labs(
y = "Average popularity",
x = "Genre"
)
hist(spotify$instrumentalness, main = 'Histogram of Instrumentalness', xlab = '', breaks = 10)
We see that majority (85.4323394%) observations have a value no larger than 0.1 in instrumentalness, and this is the reason why the difference between mean and median of instrumentalness is quite large.
hist(spotify$duration_min, main = 'Histogram of Duration in min', xlab = 'Duration in min', breaks = 10)
We see that majority of songs listened to have a duration of about 3-4 mins with songs longer than that duration having lower frequency of listeners.
Let’s check the loudness distribution corresponding to each genre.
boxplot(spotify$loudness~spotify$playlist_genre, main = 'Boxplot for each genre', xlab = 'Genre', ylab = 'Loudness')
loudness: Minimum of it is -46.448 while the first quarter is -8.171. Genre latin does have relatively more outliers on the left side than other genres. Therefore, this minimum number might be true as genre latin looks like have a more right skewed distribution than other genres.
We will use correlation plot to analyze how popularity is correlated with the various track features. From this graph, we can also check if there is presence of multicollinearity among the variables and determine whether the dataset is suitable for various classification algorithms.
We will use histogram for the track features to analyze the data distribution. This shall help us in determining trends in the dataset like skewness or spikes. With this information, we can determine which factor influences track popularity more than the others.
Before we use k-mean for clustering, we will visualize boxplot popularity with the different factor variables like key and genre to analyze if they have any significant relation to popularity. This shall assist the Spotify programming team in adding more popular tracks in the database. For instance, if we find that tracks of Pop genre and key A# are more popular compared to the others, adding such tracks to lead to increased customer satisfaction.
There are various powerful functions, tools, visualization plots and packages in R which can help us further deep dive in our analysis and help build models using ML techniques.
We still want to explore the use of gglot2 function. For the final project, we also intend to use R Shiny for better presentation.
Yes, we plan on incorporating the following machine learning techniques to answer our questions:
We will use linear regression to determine which predictor variables are significant for popularity. Adding, different interaction terms between certain variables, like energy and loudness or acousticness and instrumentalness, we shall observe if the model performance improves.
We will study if there is a relation between popularity and different audio features and genres. We will do this through clustering analysis using K-means clustering method to provide song recommendation based on recent user listening on Spotify.
We will try to reduce dimensionality using Principle Component Analysis (PCA). We will check if we can extract a pattern to do clustering from the available variables in the dataset and study how does each cluster differ to predict the most and least popular clusters.
We will also try to apply Association Rules to help Spotify cross-sell and upsell.