Dataset: Spotify
Problem Statement:
Spotify as a music application does a very good job in recommeding music to its users. It suggests music based on your frequent and liked songs/artists. This particular data set, built via the spotifyr package has details of track names, artists, types of genres, sub genres and other audio features.
Objective:
The idea behind the project is to use this dataset to :
End Goal:
This analysis aims to provide an overview on which songs are the most popular ones and what are their attributes. The idea is to help an end user to gain better understanding of what goes behind the most popular songs on Spotify.
Overall Approach:
#Dataframe
library(knitr)
library(DT)
#Data Manipulation
library(tidyverse)
library(dplyr)
library(tidyr)
#Data Viz
library(ggplot2)
library(GGally)
knitr : Helps display better outputs without intense coding. The kable function particularly helps in presenting tables, manipulating table styles.
DT : Helps in presenting tables in a clean format, and has the ability to provide filters.
Ggally : To plot the correlation analysis of variables in matrice form
tidyverse : Tidyverse provides a collection of packages including “dplyr”, “tidyr”, “ggplot2” explained below.
This dataset is extracted using the spotifyr package and was obtained from rfordatascience github.
Importing Data
spotify <- read.csv("spotify_songs.csv", stringsAsFactors=FALSE)
About the data
## [1] 32833 23
The data set has 32833 rows of observations with 23 variables.
The following information about the variables is provided on the ‘rfordatascience’ website and helps the users to understand the dataset:
| Variable | Description |
|---|---|
| track_id | Song unique ID |
| track_name | Song Name |
| track_artist | Song Artist |
| track_popularity | Song Popularity (0-100) where higher is better |
| track_album_id | Album unique ID |
| track_album_name | Song album name |
| track_album_release_date | Date when album released |
| playlist_name | Name of playlist |
| playlist_id | Playlist ID |
| playlist_genre | Playlist genre |
| playlist_subgenre | Playlist subgenre |
| danceability | Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable. |
| energy | Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy. |
| key | The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1. |
| loudness | The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db. |
| mode | Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0. |
| speechiness | Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks. |
| acousticness | A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic |
| instrumentalness | Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0. |
| liveness | Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live. |
| valence | A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry). |
| tempo | The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration. |
| duration_ms | Duration of song in milliseconds |
Data Cleaning
The following variables each have 5 missing values:
colSums(is.na(spotify))
## track_id track_name track_artist
## 0 5 5
## track_popularity track_album_id track_album_name
## 0 0 5
## track_album_release_date playlist_name playlist_id
## 0 0 0
## playlist_genre playlist_subgenre danceability
## 0 0 0
## energy key loudness
## 0 0 0
## mode speechiness acousticness
## 0 0 0
## instrumentalness liveness valence
## 0 0 0
## tempo duration_ms
## 0 0
which(is.na(spotify$track_name))
## [1] 8152 9283 9284 19569 19812
which(is.na(spotify$track_artist))
## [1] 8152 9283 9284 19569 19812
which(is.na(spotify$track_album_name))
## [1] 8152 9283 9284 19569 19812
str(spotify)
## 'data.frame': 32833 obs. of 23 variables:
## $ track_id : chr "6f807x0ima9a1j3VPbc7VN" "0r7CVbZTWZgbTCYdfa2P31" "1z1Hg7Vb0AhHDiEmnDE79l" "75FpbthrwQmzHlBJLuGdC7" ...
## $ track_name : chr "I Don't Care (with Justin Bieber) - Loud Luxury Remix" "Memories - Dillon Francis Remix" "All the Time - Don Diablo Remix" "Call You Mine - Keanu Silva Remix" ...
## $ track_artist : chr "Ed Sheeran" "Maroon 5" "Zara Larsson" "The Chainsmokers" ...
## $ track_popularity : int 66 67 70 60 69 67 62 69 68 67 ...
## $ track_album_id : chr "2oCs0DGTsRO98Gh5ZSl2Cx" "63rPSO264uRjW1X5E6cWv6" "1HoSmj2eLcsrR0vE9gThr4" "1nqYsOef1yKKuGOVchbsk6" ...
## $ track_album_name : chr "I Don't Care (with Justin Bieber) [Loud Luxury Remix]" "Memories (Dillon Francis Remix)" "All the Time (Don Diablo Remix)" "Call You Mine - The Remixes" ...
## $ track_album_release_date: chr "2019-06-14" "2019-12-13" "2019-07-05" "2019-07-19" ...
## $ playlist_name : chr "Pop Remix" "Pop Remix" "Pop Remix" "Pop Remix" ...
## $ playlist_id : chr "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" ...
## $ playlist_genre : chr "pop" "pop" "pop" "pop" ...
## $ playlist_subgenre : chr "dance pop" "dance pop" "dance pop" "dance pop" ...
## $ danceability : num 0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
## $ energy : num 0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
## $ key : int 6 11 1 7 1 8 5 4 8 2 ...
## $ loudness : num -2.63 -4.97 -3.43 -3.78 -4.67 ...
## $ mode : int 1 1 0 1 1 1 0 0 1 1 ...
## $ speechiness : num 0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
## $ acousticness : num 0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
## $ instrumentalness : num 0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
## $ liveness : num 0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
## $ valence : num 0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
## $ tempo : num 122 100 124 122 124 ...
## $ duration_ms : int 194754 162600 176616 169093 189052 163049 187675 207619 193187 253040 ...
unique(spotify$playlist_genre)
## [1] "pop" "rap" "rock" "latin" "r&b" "edm"
unique(spotify$playlist_subgenre)
## [1] "dance pop" "post-teen pop"
## [3] "electropop" "indie poptimism"
## [5] "hip hop" "southern hip hop"
## [7] "gangster rap" "trap"
## [9] "album rock" "classic rock"
## [11] "permanent wave" "hard rock"
## [13] "tropical" "latin pop"
## [15] "reggaeton" "latin hip hop"
## [17] "urban contemporary" "hip pop"
## [19] "new jack swing" "neo soul"
## [21] "electro house" "big room"
## [23] "pop edm" "progressive electro house"
unique(spotify$key)
## [1] 6 11 1 7 8 5 4 2 0 10 9 3
unique(spotify$mode)
## [1] 1 0
spotify <- spotify %>% mutate(
playlist_genre = as.factor(spotify$playlist_genre),
playlist_subgenre = as.factor(spotify$playlist_subgenre),
key = as.factor(spotify$key),
mode = as.factor(spotify$mode)
)
spotify <- spotify %>% select(2,4,10:23)
Summary of Final & Cleaned Dataset
datatable(spotify, rownames = FALSE, filter="top", options = list(pageLength = 5, scrollX=T))
## Warning in instance$preRenderHook(instance): It seems your data is too big
## for client-side DataTables. You may consider server-side processing: https://
## rstudio.github.io/DT/server.html
From the summaries, it can be seen that the audio features fit the description given in the features table, value wise and range wise as well.
But for speechiness,acousticness,instrumentalness,liveness the median and mean are not as close as they are for other variables and hence we will look into some plots to understand their behaviour.
summary(spotify)
## track_name track_popularity playlist_genre
## Length:32833 Min. : 0.00 edm :6043
## Class :character 1st Qu.: 24.00 latin:5155
## Mode :character Median : 45.00 pop :5507
## Mean : 42.48 r&b :5431
## 3rd Qu.: 62.00 rap :5746
## Max. :100.00 rock :4951
##
## playlist_subgenre danceability energy
## progressive electro house: 1809 Min. :0.0000 Min. :0.000175
## southern hip hop : 1675 1st Qu.:0.5630 1st Qu.:0.581000
## indie poptimism : 1672 Median :0.6720 Median :0.721000
## latin hip hop : 1656 Mean :0.6548 Mean :0.698619
## neo soul : 1637 3rd Qu.:0.7610 3rd Qu.:0.840000
## pop edm : 1517 Max. :0.9830 Max. :1.000000
## (Other) :22867
## key loudness mode speechiness acousticness
## 1 : 4010 Min. :-46.448 0:14259 Min. :0.0000 Min. :0.0000
## 0 : 3454 1st Qu.: -8.171 1:18574 1st Qu.:0.0410 1st Qu.:0.0151
## 7 : 3352 Median : -6.166 Median :0.0625 Median :0.0804
## 9 : 3027 Mean : -6.720 Mean :0.1071 Mean :0.1753
## 11 : 2996 3rd Qu.: -4.645 3rd Qu.:0.1320 3rd Qu.:0.2550
## 2 : 2827 Max. : 1.275 Max. :0.9180 Max. :0.9940
## (Other):13167
## instrumentalness liveness valence tempo
## Min. :0.0000000 Min. :0.0000 Min. :0.0000 Min. : 0.00
## 1st Qu.:0.0000000 1st Qu.:0.0927 1st Qu.:0.3310 1st Qu.: 99.96
## Median :0.0000161 Median :0.1270 Median :0.5120 Median :121.98
## Mean :0.0847472 Mean :0.1902 Mean :0.5106 Mean :120.88
## 3rd Qu.:0.0048300 3rd Qu.:0.2480 3rd Qu.:0.6930 3rd Qu.:133.92
## Max. :0.9940000 Max. :0.9960 Max. :0.9910 Max. :239.44
##
## duration_ms
## Min. : 4000
## 1st Qu.:187819
## Median :216000
## Mean :225800
## 3rd Qu.:253585
## Max. :517810
##
Visual EDA
Histograms
spotify %>%
keep(is.numeric) %>% #hist only for numeric
gather() %>% #converts to key value
ggplot(aes(value, fill = key)) +
facet_wrap(~ key, scales = "free") +
geom_histogram(alpha = 0.7, bins = 30) + scale_x_discrete(guide = guide_axis(check.overlap = TRUE))
Barcharts
ggplot(spotify,aes(mode)) + geom_bar(aes(fill=mode),alpha = 0.7)
ggplot(spotify,aes(key)) + geom_bar(aes(fill=key), alpha = 0.7)
Boxplots
Loudness, tempo, speechiness, danceability and duration have some obvious outliers. We will take this into consideration while working on the data.
Instrumentalness has most values closer to 0, which is why the boxplot and histogram act this way.
spotify %>%
keep(is.numeric) %>% #hist only for numeric
gather() %>% #converts to key value
ggplot(aes(value, fill = key)) +
facet_wrap(~ key, scales = "free") +
geom_boxplot(alpha = 0.7) + coord_flip()
I will be analyzing the popularity of songs, genres, audio features, their interaction. For their insights depiction I intend to make use of :
Also, for k means clustering I am aware of the elbow method which i plan to implement to find out the number of clusters that are ideal which will eventually tell me what separates the popular songs cluster from the rest.
Currently, I am unsure if apart from clustering, if I will be implementing other ML methods like randomforest and bagging.