Spotify is a Swedish-based audio streaming and media services provider, which launched in October 2008. It is now one of the biggest digital music, podcast, and video streaming service in the world that gives access to millions of songs from artists all over the world.
As a freemium service which means it has basic features that are free with advertisements and limited control, but you could also opt for additional features, such as offline listening and commercial-free listening, are offered via paid subscriptions. Users can search for music based on artist, album, or genre, and can create, edit, and share playlists. Not only does Spotify gives us access to good songs on multiple platforms, it has exposed everyone to trending and upcoming artists from various genres that we had never experienced. Spotify uses very advanced technology to track and identify each song uploaded to its platform.
The Spotify dataset provides insight into users data about which songs they listen to, and not just the genre of tracks, but also features of the tracks they have in their library is recorded in their database.
In this project, we will be analyzing a playlist’s genre based on several audio features provided in the dataset and find whether we can predict a playlist’s genre from key features about the song.
We plan on analyzing user’s listening profile to enable Spotify to suggest and acquire similar songs on their platform to improve user experience
The plan is to analyze relationship between playlist genre and different features of the song, and maybe later use a classification algorithm that will predict the song genre to provide recommendation based on recent user listening on Spotify.
This is mainly useful to market to the Spotify users and improve their experience while using it. This analysis will help better understand the genre of different songs and enable Spotify to make a better targeted content distribution that would be helpful for the developers and the marketing team to analyze trends and help them to segments users better and try to increase profits and provide a better user experience.
tidyverse - which will provide us functionality to model, transform, and visualize data.
dplyr - used for data manipulation in R
ggplot2 - used for plotting charts
plotly - for web-based graphs via the open source JavaScript graphing library plotly.js for interactive charts
corrplot - for displaying correlation matrices and confidence intervals
factoextra - to visualize the output of multivariate data analysis
funModeling - Exploratory Data Analysis and Data Preparation Tool-Box
plyr - break a big problem down into manageable pieces, operate on each piece and then put all the pieces back together
RColorBrewer - to help you choose sensible colour schemes for figures in R
library(tidyverse)
library(dplyr)
library(ggplot2)
library(plotly)
library(corrplot)
library(factoextra)
library(plyr)
library(knitr)
library(RColorBrewer)
library(funModeling)
This sections contains all the procedures we’ve followed in preparing the data for analysis. Each step has been explained with code for those steps.
The dataset used for this project is the Spotify Genre dataset was provided in the course curriculum, more details about the dataset is provided below.
The data comes from Spotify via the spotifyr package. Charlie Thompson, Josiah Parry, Donal Phipps, and Tom Wolff authored this package to make it easier to get either your own data or general metadata around songs from Spotify’s API.
It’s likely that Spotify uses these features to power products like Spotify Radio and custom playlists like Discover Weekly and Daily Mixes.
After having an intial look at the data, the is not much peculiarity in the data. It is almost clean with only 15 missing values. As every row is unique is some sense, we will not perform any imputation for missing values and just remove them instead.
Firstly, the Spotify dataset is loaded into R to begin the analysis.The dataset has been imported using the read.csv function and saved as “spotify_data”.
spotify_data<-readr::read_csv('https://raw.githubusercontent.com/nairrj/DataWrangling/main/spotify_songs.csv')
Now, we’ll have take brief look at the dataset using the head and the glimpse function
head(spotify_data)
## # A tibble: 6 x 23
## track_id track_name track_artist track_popularity track_album_id
## <chr> <chr> <chr> <dbl> <chr>
## 1 6f807x0ima9a1j3VPbc7VN I Don't C~ Ed Sheeran 66 2oCs0DGTsRO98~
## 2 0r7CVbZTWZgbTCYdfa2P31 Memories ~ Maroon 5 67 63rPSO264uRjW~
## 3 1z1Hg7Vb0AhHDiEmnDE79l All the T~ Zara Larsson 70 1HoSmj2eLcsrR~
## 4 75FpbthrwQmzHlBJLuGdC7 Call You ~ The Chainsm~ 60 1nqYsOef1yKKu~
## 5 1e8PAfcKUYoKkxPhrHqw4x Someone Y~ Lewis Capal~ 69 7m7vv9wlQ4i0L~
## 6 7fvUMiyapMsRRxr07cU8Ef Beautiful~ Ed Sheeran 67 2yiy9cd2QktrN~
## # ... with 18 more variables: track_album_name <chr>,
## # track_album_release_date <chr>, playlist_name <chr>, playlist_id <chr>,
## # playlist_genre <chr>, playlist_subgenre <chr>, danceability <dbl>,
## # energy <dbl>, key <dbl>, loudness <dbl>, mode <dbl>, speechiness <dbl>,
## # acousticness <dbl>, instrumentalness <dbl>, liveness <dbl>, valence <dbl>,
## # tempo <dbl>, duration_ms <dbl>
glimpse(spotify_data)
## Rows: 32,833
## Columns: 23
## $ track_id <chr> "6f807x0ima9a1j3VPbc7VN", "0r7CVbZTWZgbTCYdfa~
## $ track_name <chr> "I Don't Care (with Justin Bieber) - Loud Lux~
## $ track_artist <chr> "Ed Sheeran", "Maroon 5", "Zara Larsson", "Th~
## $ track_popularity <dbl> 66, 67, 70, 60, 69, 67, 62, 69, 68, 67, 58, 6~
## $ track_album_id <chr> "2oCs0DGTsRO98Gh5ZSl2Cx", "63rPSO264uRjW1X5E6~
## $ track_album_name <chr> "I Don't Care (with Justin Bieber) [Loud Luxu~
## $ track_album_release_date <chr> "2019-06-14", "2019-12-13", "2019-07-05", "20~
## $ playlist_name <chr> "Pop Remix", "Pop Remix", "Pop Remix", "Pop R~
## $ playlist_id <chr> "37i9dQZF1DXcZDD7cfEKhW", "37i9dQZF1DXcZDD7cf~
## $ playlist_genre <chr> "pop", "pop", "pop", "pop", "pop", "pop", "po~
## $ playlist_subgenre <chr> "dance pop", "dance pop", "dance pop", "dance~
## $ danceability <dbl> 0.748, 0.726, 0.675, 0.718, 0.650, 0.675, 0.4~
## $ energy <dbl> 0.916, 0.815, 0.931, 0.930, 0.833, 0.919, 0.8~
## $ key <dbl> 6, 11, 1, 7, 1, 8, 5, 4, 8, 2, 6, 8, 1, 5, 5,~
## $ loudness <dbl> -2.634, -4.969, -3.432, -3.778, -4.672, -5.38~
## $ mode <dbl> 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, ~
## $ speechiness <dbl> 0.0583, 0.0373, 0.0742, 0.1020, 0.0359, 0.127~
## $ acousticness <dbl> 0.10200, 0.07240, 0.07940, 0.02870, 0.08030, ~
## $ instrumentalness <dbl> 0.00e+00, 4.21e-03, 2.33e-05, 9.43e-06, 0.00e~
## $ liveness <dbl> 0.0653, 0.3570, 0.1100, 0.2040, 0.0833, 0.143~
## $ valence <dbl> 0.518, 0.693, 0.613, 0.277, 0.725, 0.585, 0.1~
## $ tempo <dbl> 122.036, 99.972, 124.008, 121.956, 123.976, 1~
## $ duration_ms <dbl> 194754, 162600, 176616, 169093, 189052, 16304~
colnames(spotify_data)
## [1] "track_id" "track_name"
## [3] "track_artist" "track_popularity"
## [5] "track_album_id" "track_album_name"
## [7] "track_album_release_date" "playlist_name"
## [9] "playlist_id" "playlist_genre"
## [11] "playlist_subgenre" "danceability"
## [13] "energy" "key"
## [15] "loudness" "mode"
## [17] "speechiness" "acousticness"
## [19] "instrumentalness" "liveness"
## [21] "valence" "tempo"
## [23] "duration_ms"
dim(spotify_data)
## [1] 32833 23
Our dataset has 32,833 observations and 23 variables.
We observe that there are 3 columns which have 5 NA’s each and those columns are track_name, track_artist and track_album_name and this information was retrieved using the colsums function.
I have then removed those respective observations using the na.omit function.
colSums(is.na(spotify_data))
## track_id track_name track_artist
## 0 5 5
## track_popularity track_album_id track_album_name
## 0 0 5
## track_album_release_date playlist_name playlist_id
## 0 0 0
## playlist_genre playlist_subgenre danceability
## 0 0 0
## energy key loudness
## 0 0 0
## mode speechiness acousticness
## 0 0 0
## instrumentalness liveness valence
## 0 0 0
## tempo duration_ms
## 0 0
spotify_data <- na.omit(spotify_data)
I will now filter for unique tracks, by removing all the duplicate tracks using the duplicated function
spotify_data <- spotify_data[!duplicated(spotify_data$track_id),]
I have converted genre, sub genre, mode and key to factors to facilitate our data analysis, I based this off the values those fields contained
spotify_data <- spotify_data %>%
mutate(playlist_genre = as.factor(spotify_data$playlist_genre),
playlist_subgenre = as.factor(spotify_data$playlist_subgenre),
mode = as.factor(mode),
key = as.factor(key))
Converting duration_ms to duration in minutes (duration_min) since it is more sensible for the analysis
spotify_data <- spotify_data %>% mutate(duration_min = duration_ms/60000)
For exploring the distribution on popularity, we have made new variables that divide popularity into 4 groups for effective cluster analysis
spotify_data <- spotify_data %>%
mutate(Like = as.numeric(case_when(
((track_popularity <= 55)) ~ "1",
((track_popularity < 55)) ~ "2",
))
)
table(spotify_data$Like)
##
## 1
## 20159
We have removed track_id, track_album_id and playlist_id from the dataset since it is not useful for our analysis. These id’s are in the Spotify dataset only to uniquely identify a tracks in the database.
spotify_data <- spotify_data %>% select(-c(track_id, track_album_id, playlist_id))
summary(spotify_data)
## track_name track_artist track_popularity track_album_name
## Length:28352 Length:28352 Min. : 0.00 Length:28352
## Class :character Class :character 1st Qu.: 21.00 Class :character
## Mode :character Mode :character Median : 42.00 Mode :character
## Mean : 39.34
## 3rd Qu.: 58.00
## Max. :100.00
##
## track_album_release_date playlist_name playlist_genre
## Length:28352 Length:28352 edm :4877
## Class :character Class :character latin:4136
## Mode :character Mode :character pop :5132
## r&b :4504
## rap :5398
## rock :4305
##
## playlist_subgenre danceability energy
## southern hip hop : 1582 Min. :0.0000 Min. :0.000175
## indie poptimism : 1547 1st Qu.:0.5610 1st Qu.:0.579000
## neo soul : 1478 Median :0.6700 Median :0.722000
## progressive electro house: 1460 Mean :0.6534 Mean :0.698372
## electro house : 1416 3rd Qu.:0.7600 3rd Qu.:0.843000
## gangster rap : 1314 Max. :0.9830 Max. :1.000000
## (Other) :19555
## key loudness mode speechiness acousticness
## 1 : 3436 Min. :-46.448 0:12318 Min. :0.0000 Min. :0.0000
## 0 : 3001 1st Qu.: -8.310 1:16034 1st Qu.:0.0410 1st Qu.:0.0143
## 7 : 2907 Median : -6.261 Median :0.0626 Median :0.0797
## 9 : 2631 Mean : -6.818 Mean :0.1079 Mean :0.1772
## 11 : 2577 3rd Qu.: -4.709 3rd Qu.:0.1330 3rd Qu.:0.2600
## 2 : 2478 Max. : 1.275 Max. :0.9180 Max. :0.9940
## (Other):11322
## instrumentalness liveness valence tempo
## Min. :0.0000000 Min. :0.0000 Min. :0.0000 Min. : 0.00
## 1st Qu.:0.0000000 1st Qu.:0.0926 1st Qu.:0.3290 1st Qu.: 99.97
## Median :0.0000207 Median :0.1270 Median :0.5120 Median :121.99
## Mean :0.0911294 Mean :0.1910 Mean :0.5104 Mean :120.96
## 3rd Qu.:0.0065725 3rd Qu.:0.2490 3rd Qu.:0.6950 3rd Qu.:134.00
## Max. :0.9940000 Max. :0.9960 Max. :0.9910 Max. :239.44
##
## duration_ms duration_min Like
## Min. : 4000 Min. :0.06667 Min. :1
## 1st Qu.:187741 1st Qu.:3.12902 1st Qu.:1
## Median :216933 Median :3.61555 Median :1
## Mean :226575 Mean :3.77624 Mean :1
## 3rd Qu.:254975 3rd Qu.:4.24959 3rd Qu.:1
## Max. :517810 Max. :8.63017 Max. :1
## NA's :8193
head(spotify_data)
## # A tibble: 6 x 22
## track_name track_artist track_popularity track_album_name track_album_rel~
## <chr> <chr> <dbl> <chr> <chr>
## 1 I Don't Car~ Ed Sheeran 66 I Don't Care (wi~ 2019-06-14
## 2 Memories - ~ Maroon 5 67 Memories (Dillon~ 2019-12-13
## 3 All the Tim~ Zara Larsson 70 All the Time (Do~ 2019-07-05
## 4 Call You Mi~ The Chainsmo~ 60 Call You Mine - ~ 2019-07-19
## 5 Someone You~ Lewis Capaldi 69 Someone You Love~ 2019-03-05
## 6 Beautiful P~ Ed Sheeran 67 Beautiful People~ 2019-07-11
## # ... with 17 more variables: playlist_name <chr>, playlist_genre <fct>,
## # playlist_subgenre <fct>, danceability <dbl>, energy <dbl>, key <fct>,
## # loudness <dbl>, mode <fct>, speechiness <dbl>, acousticness <dbl>,
## # instrumentalness <dbl>, liveness <dbl>, valence <dbl>, tempo <dbl>,
## # duration_ms <dbl>, duration_min <dbl>, Like <dbl>
Each row indicates 1 song and column contain attributes for each song.The attributes are as follows:
Exploratory Data analysis (EDA) helps us uncover useful information from data that is not self-evident, only if EDA is done correctly.
EDA is essentail before we start to build a model on the data.
With EDA we can understand the patterns within the data, detect outliers or anomalous events and find interesting relations among the variables.
I have used correlation plot, histograms and boxplots in my EDA.
corr_spotify <- select(spotify_data, track_popularity, danceability, energy, loudness, speechiness, acousticness, instrumentalness, liveness, valence, tempo)
corrplot(cor(corr_spotify), type="lower")
Prior to any model creation, it is good practice to check for multicollinearity, which is correlation between the independent features within the dataset. It is clear there is no multicollinearity.
Analyzing data distribution of the audio features, using the plot_num function (plots only numeric variables)
spotify_histograms <- spotify_data[,-c(1,2,3,4,5,6,7,8,11,13,20,22)]
plot_num(spotify_histograms)
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
From the histograms, we can observe that:
boxplot(energy~playlist_genre, data = spotify_data,
main = "Variation:- Energy and Genre",
xlab = "Energy",
ylab = "Genre",
col = "green",
border = "blue",
horizontal = TRUE,
notch = TRUE
)
The plot shows that EDM genre has songs with highest energy.
boxplot(danceability~playlist_genre, data = spotify_data,
main = "Variation:- Danceability and Genre",
xlab = "Danceability",
ylab = "Genre",
col = "green",
border = "blue",
horizontal = TRUE,
notch = TRUE
)
As seen in the graph, Rap genre has the highest danceability factor.
boxplot(danceability~playlist_genre, data = spotify_data,
main = "Variation:- Liveness and Genres",
xlab = "Liveness",
ylab = "Genre",
col = "green",
border = "blue",
horizontal = TRUE,
notch = TRUE
)
Looks like Rap songs are most lively, followed closely by latin genre.
boxplot(valence~playlist_genre, data = spotify_data,
main = "Variation:- Valence and Genre",
xlab = "Valence",
ylab = "Genre",
col = "green",
border = "blue",
horizontal = TRUE,
notch = TRUE
)
As seen above, Latin genre has a higher valence than others.
boxplot(loudness~playlist_genre, data = spotify_data,
main = "Variation:- Loudness and Genre",
xlab = "Loudness",
ylab = "Genre",
col = "green",
border = "blue",
horizontal = TRUE,
notch = TRUE
)
The loudness is pretty similar, only songs in EDM genre are a bit louder than the other genres.
spotify_data$liveness.scale <- scale(spotify_data$liveness)
spotify_data$tempo.scale <- scale(spotify_data$tempo)
spotify_data %>%
select(tempo.scale, liveness.scale, playlist_genre) %>%
group_by(playlist_genre) %>%
filter(!is.na(tempo.scale)) %>%
filter(!is.na(liveness.scale)) %>%
ggplot(mapping = aes(x = tempo.scale, y = liveness.scale, color = playlist_genre, fill = playlist_genre)) +
geom_bar(stat = 'identity') +
coord_polar() +
theme_dark() +
theme(legend.position = "top")
As visible in the plot, the Tempo is way higher for EDM genre compared to the others while Liveness is almost uniformly distributed across all genres.
spotify_data$energy_only <- cut(spotify_data$energy, breaks = 10)
spotify_data %>%
ggplot( aes(x = energy_only )) +
geom_bar(width = 0.2, fill = "#FF9999", colour = "black") +
scale_x_discrete(name = "Energy")
This plot shows that higher energy songs are popular among Spotify listeners.
spotify_data$speech_only <- cut(spotify_data$speechiness, breaks = 10)
spotify_data %>%
ggplot( aes(x = speech_only )) +
geom_bar(width = 0.2, fill = "#FF9999", colour = "black") +
scale_x_discrete(name = "Speechiness") +
coord_flip()
This plot shows that less speechy songs are more favoured by maximum Spotify listeners.
We have performed data wrangling on our spotify dataset by removing null values, removing duplicates and transforming variables, before starting our exploratory data analysis. We have also seen that there is no multicollinearity
We have plotted histograms and boxplot to show the relation between the variables and We plan on utilizing this dataset to build a model which could predict the song genre based on several audio features provided in the dataset.