If there’s one thing many people can’t live without, it’s music. Spotify is an international media services provider. The company’s primary business is providing an audio streaming platform, the “Spotify” platform, that provides DRM-restricted music, videos and podcasts from record labels and media companies.
The motivation of this project is to enable anyone to discover patterns and insights about the music that they listen to. In doing so, They gain a better understanding of the musical behaviors when they listen to songs on Spotify.
Have you ever wondered how Spotify rates the popularity of songs? Or ever wonder which factors determine the song’s genre? What characteristics of a song can determine its popularity? Using data analysis, we will try to get answers to these questions.
The following tasks will be performed:
We plan to achieve this by performing Data Preparation, Exploratory Data Analysis and Predictive Modeling.
Based on our analysis, the consumer will be able to identify which factors influence the popularity of a song on Spotify.
Following packages will be used in the analysis:
library(tidyverse)
library(ggplot2)
library(dplyr)
library(psych)
library(DAAG)
library(highcharter)
library(knitr)
library(kableExtra)
library(DT)
The data set used in this project can found here Spotify Data
This data comes from Spotify via the spotifyr package. Charlie Thompson, Josiah Parry, Donal Phipps, and Tom Wolff authored this package to make it easier to get either data or general metadata arounds songs from Spotify’s API.
The data set contains 32,833 observations of 23 variables.
Following is the summary of all the variables in the data set.
| variable_name | description |
|---|---|
| track_id | unique ID |
| track_name | Song Name |
| track_artist | Song Artist |
| track_popularity | Song Popularity (0-100) where higher is better |
| track_album_id | Album unique ID |
| track_album_name | Song album name |
| track_album_release_date | Date when album released |
| playlist_name | Name of playlist |
| playlist_id | Playlist ID |
| playlist_genre | Playlist genre |
| playlist_subgenre | Playlist subgenre |
| danceability | Danceability describes how suitable a track is for dancing based on a combination of musical elements. A value of 0.0 is least danceable and 1.0 is most danceable. |
| energy | Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. |
| key | The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . |
| loudness | The overall loudness of a track in decibels (dB). |
| mode | Mode indicates the modality (major or minor) of a track |
| speechiness | Speechiness detects the presence of spoken words in a track. |
| acousticness | A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic. |
| instrumentalness | Predicts whether a track contains no vocals. |
| liveness | Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live. |
| valence | A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive. |
| tempo | The overall estimated tempo of a track in beats per minute (BPM). |
| duration_ms | Duration of song in milliseconds |
spotify_data <- read.csv("C:/Users/Nikita/Downloads/spotify_songs.csv", header = TRUE)
head(spotify_data)
## track_id track_name
## 1 6f807x0ima9a1j3VPbc7VN I Don't Care (with Justin Bieber) - Loud Luxury Remix
## 2 0r7CVbZTWZgbTCYdfa2P31 Memories - Dillon Francis Remix
## 3 1z1Hg7Vb0AhHDiEmnDE79l All the Time - Don Diablo Remix
## 4 75FpbthrwQmzHlBJLuGdC7 Call You Mine - Keanu Silva Remix
## 5 1e8PAfcKUYoKkxPhrHqw4x Someone You Loved - Future Humans Remix
## 6 7fvUMiyapMsRRxr07cU8Ef Beautiful People (feat. Khalid) - Jack Wins Remix
## track_artist track_popularity track_album_id
## 1 Ed Sheeran 66 2oCs0DGTsRO98Gh5ZSl2Cx
## 2 Maroon 5 67 63rPSO264uRjW1X5E6cWv6
## 3 Zara Larsson 70 1HoSmj2eLcsrR0vE9gThr4
## 4 The Chainsmokers 60 1nqYsOef1yKKuGOVchbsk6
## 5 Lewis Capaldi 69 7m7vv9wlQ4i0LFuJiE2zsQ
## 6 Ed Sheeran 67 2yiy9cd2QktrNvWC2EUi0k
## track_album_name
## 1 I Don't Care (with Justin Bieber) [Loud Luxury Remix]
## 2 Memories (Dillon Francis Remix)
## 3 All the Time (Don Diablo Remix)
## 4 Call You Mine - The Remixes
## 5 Someone You Loved (Future Humans Remix)
## 6 Beautiful People (feat. Khalid) [Jack Wins Remix]
## track_album_release_date playlist_name playlist_id playlist_genre
## 1 2019-06-14 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop
## 2 2019-12-13 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop
## 3 2019-07-05 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop
## 4 2019-07-19 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop
## 5 2019-03-05 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop
## 6 2019-07-11 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop
## playlist_subgenre danceability energy key loudness mode speechiness
## 1 dance pop 0.748 0.916 6 -2.634 1 0.0583
## 2 dance pop 0.726 0.815 11 -4.969 1 0.0373
## 3 dance pop 0.675 0.931 1 -3.432 0 0.0742
## 4 dance pop 0.718 0.930 7 -3.778 1 0.1020
## 5 dance pop 0.650 0.833 1 -4.672 1 0.0359
## 6 dance pop 0.675 0.919 8 -5.385 1 0.1270
## acousticness instrumentalness liveness valence tempo duration_ms
## 1 0.1020 0.00e+00 0.0653 0.518 122.036 194754
## 2 0.0724 4.21e-03 0.3570 0.693 99.972 162600
## 3 0.0794 2.33e-05 0.1100 0.613 124.008 176616
## 4 0.0287 9.43e-06 0.2040 0.277 121.956 169093
## 5 0.0803 0.00e+00 0.0833 0.725 123.976 189052
## 6 0.0799 0.00e+00 0.1430 0.585 124.982 163049
We observe that many songs have been repeated more than once in this dataset. They have the same ‘track_id’ but have a different ‘playist_id’. So we need to remove those duplicated songs in the dataset. Since the song’s ‘track_id’ is unique and the other quantifiable variables of that song remains the same, we will delete those duplicated songs based on the ‘track_id’.
spotify_data_unique = spotify_data[!duplicated(spotify_data$track_id),]
Now since we have no more repeated songs in the list, and we would like to analyze which variables influence the ‘track_popularity’, we can drop the following columns which are not useful in our analysis:
spotify_data_2 <- spotify_data_unique[c(-1, -5, -6, -8, -9, -11)]
head(spotify_data_2)
## track_name track_artist
## 1 I Don't Care (with Justin Bieber) - Loud Luxury Remix Ed Sheeran
## 2 Memories - Dillon Francis Remix Maroon 5
## 3 All the Time - Don Diablo Remix Zara Larsson
## 4 Call You Mine - Keanu Silva Remix The Chainsmokers
## 5 Someone You Loved - Future Humans Remix Lewis Capaldi
## 6 Beautiful People (feat. Khalid) - Jack Wins Remix Ed Sheeran
## track_popularity track_album_release_date playlist_genre danceability energy
## 1 66 2019-06-14 pop 0.748 0.916
## 2 67 2019-12-13 pop 0.726 0.815
## 3 70 2019-07-05 pop 0.675 0.931
## 4 60 2019-07-19 pop 0.718 0.930
## 5 69 2019-03-05 pop 0.650 0.833
## 6 67 2019-07-11 pop 0.675 0.919
## key loudness mode speechiness acousticness instrumentalness liveness valence
## 1 6 -2.634 1 0.0583 0.1020 0.00e+00 0.0653 0.518
## 2 11 -4.969 1 0.0373 0.0724 4.21e-03 0.3570 0.693
## 3 1 -3.432 0 0.0742 0.0794 2.33e-05 0.1100 0.613
## 4 7 -3.778 1 0.1020 0.0287 9.43e-06 0.2040 0.277
## 5 1 -4.672 1 0.0359 0.0803 0.00e+00 0.0833 0.725
## 6 8 -5.385 1 0.1270 0.0799 0.00e+00 0.1430 0.585
## tempo duration_ms
## 1 122.036 194754
## 2 99.972 162600
## 3 124.008 176616
## 4 121.956 169093
## 5 123.976 189052
## 6 124.982 163049
Now that our data does not contain any duplicate and redundant data, we check for missing values in the data set. We are using colSums function in R to find out missing values in each column.
colSums(is.na(spotify_data_2))
## track_name track_artist track_popularity
## 4 4 0
## track_album_release_date playlist_genre danceability
## 0 0 0
## energy key loudness
## 0 0 0
## mode speechiness acousticness
## 0 0 0
## instrumentalness liveness valence
## 0 0 0
## tempo duration_ms
## 0 0
We observe that there are 4 missing values in track_name and track_artist columns. We can keep these observations, since missing values for track_name and track_artist wouldn’t impact our analysis.
output_data <- head(spotify_data_2, n=100)
datatable(spotify_data_2, filter = 'top', options = list(pageLength = 25))
We will create various visualizations to analyse the data we have such as:
Here is the initial EDA of our final data set
dim(spotify_data_2)
## [1] 28356 17
glimpse(spotify_data_2)
## Observations: 28,356
## Variables: 17
## $ track_name <fct> I Don't Care (with Justin Bieber) - Loud L...
## $ track_artist <fct> Ed Sheeran, Maroon 5, Zara Larsson, The Ch...
## $ track_popularity <int> 66, 67, 70, 60, 69, 67, 62, 69, 68, 67, 58...
## $ track_album_release_date <fct> 2019-06-14, 2019-12-13, 2019-07-05, 2019-0...
## $ playlist_genre <fct> pop, pop, pop, pop, pop, pop, pop, pop, po...
## $ danceability <dbl> 0.748, 0.726, 0.675, 0.718, 0.650, 0.675, ...
## $ energy <dbl> 0.916, 0.815, 0.931, 0.930, 0.833, 0.919, ...
## $ key <int> 6, 11, 1, 7, 1, 8, 5, 4, 8, 2, 6, 8, 1, 5,...
## $ loudness <dbl> -2.634, -4.969, -3.432, -3.778, -4.672, -5...
## $ mode <int> 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, ...
## $ speechiness <dbl> 0.0583, 0.0373, 0.0742, 0.1020, 0.0359, 0....
## $ acousticness <dbl> 0.10200, 0.07240, 0.07940, 0.02870, 0.0803...
## $ instrumentalness <dbl> 0.00e+00, 4.21e-03, 2.33e-05, 9.43e-06, 0....
## $ liveness <dbl> 0.0653, 0.3570, 0.1100, 0.2040, 0.0833, 0....
## $ valence <dbl> 0.518, 0.693, 0.613, 0.277, 0.725, 0.585, ...
## $ tempo <dbl> 122.036, 99.972, 124.008, 121.956, 123.976...
## $ duration_ms <int> 194754, 162600, 176616, 169093, 189052, 16...
str(spotify_data_2)
## 'data.frame': 28356 obs. of 17 variables:
## $ track_name : Factor w/ 23449 levels "'39 - 2011 Mix",..: 9368 12887 944 3111 18360 1968 13859 15785 20934 9823 ...
## $ track_artist : Factor w/ 10692 levels "'Til Tuesday",..: 2848 6185 10633 9373 5530 2848 5000 8320 761 8562 ...
## $ track_popularity : int 66 67 70 60 69 67 62 69 68 67 ...
## $ track_album_release_date: Factor w/ 4530 levels "1957-01-01","1957-03",..: 4316 4493 4336 4349 4221 4341 4356 4389 4316 4321 ...
## $ playlist_genre : Factor w/ 6 levels "edm","latin",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ danceability : num 0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
## $ energy : num 0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
## $ key : int 6 11 1 7 1 8 5 4 8 2 ...
## $ loudness : num -2.63 -4.97 -3.43 -3.78 -4.67 ...
## $ mode : int 1 1 0 1 1 1 0 0 1 1 ...
## $ speechiness : num 0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
## $ acousticness : num 0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
## $ instrumentalness : num 0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
## $ liveness : num 0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
## $ valence : num 0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
## $ tempo : num 122 100 124 122 124 ...
## $ duration_ms : int 194754 162600 176616 169093 189052 163049 187675 207619 193187 253040 ...
summary(spotify_data_2)
## track_name track_artist track_popularity
## Breathe : 18 Queen : 130 Min. : 0.00
## Paradise: 17 Martin Garrix : 87 1st Qu.: 21.00
## Poison : 16 Don Omar : 84 Median : 42.00
## Alive : 15 David Guetta : 81 Mean : 39.33
## Forever : 14 Dimitri Vegas & Like Mike: 68 3rd Qu.: 58.00
## (Other) :28272 (Other) :27902 Max. :100.00
## NA's : 4 NA's : 4
## track_album_release_date playlist_genre danceability energy
## 2020-01-10: 201 edm :4877 Min. :0.0000 Min. :0.000175
## 2013-01-01: 189 latin:4137 1st Qu.:0.5610 1st Qu.:0.579000
## 2019-11-22: 185 pop :5132 Median :0.6700 Median :0.722000
## 2019-12-06: 184 r&b :4504 Mean :0.6534 Mean :0.698388
## 2019-11-15: 183 rap :5401 3rd Qu.:0.7600 3rd Qu.:0.843000
## 2008-01-01: 176 rock :4305 Max. :0.9830 Max. :1.000000
## (Other) :27238
## key loudness mode speechiness
## Min. : 0.000 Min. :-46.448 Min. :0.0000 Min. :0.0000
## 1st Qu.: 2.000 1st Qu.: -8.309 1st Qu.:0.0000 1st Qu.:0.0410
## Median : 6.000 Median : -6.261 Median :1.0000 Median :0.0626
## Mean : 5.368 Mean : -6.818 Mean :0.5655 Mean :0.1080
## 3rd Qu.: 9.000 3rd Qu.: -4.709 3rd Qu.:1.0000 3rd Qu.:0.1330
## Max. :11.000 Max. : 1.275 Max. :1.0000 Max. :0.9180
##
## acousticness instrumentalness liveness valence
## Min. :0.00000 Min. :0.0000000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.01438 1st Qu.:0.0000000 1st Qu.:0.0926 1st Qu.:0.3290
## Median :0.07970 Median :0.0000206 Median :0.1270 Median :0.5120
## Mean :0.17718 Mean :0.0911168 Mean :0.1910 Mean :0.5104
## 3rd Qu.:0.26000 3rd Qu.:0.0065700 3rd Qu.:0.2490 3rd Qu.:0.6950
## Max. :0.99400 Max. :0.9940000 Max. :0.9960 Max. :0.9910
##
## tempo duration_ms
## Min. : 0.00 Min. : 4000
## 1st Qu.: 99.97 1st Qu.:187742
## Median :121.99 Median :216933
## Mean :120.96 Mean :226576
## 3rd Qu.:134.00 3rd Qu.:254975
## Max. :239.44 Max. :517810
##
Correlation between covariates(independent variables) and the song popularity(dependent variable) will also be done to identify which variables influence the song’s popularity. With model creation for linear regression and predictive analysis to follow. We will also be merging different datasets with the existing one with information regarding songs and its feature or maybe different datasets which contain Spotify data. We may also split the dataset into smaller datasets based on the ‘Genre’ for better analysis.