Team members:
If there’s one thing many people can’t live without, it’s music.
Spotify is an international media services provider. The company’s primary business is providing an audio streaming platform, the “Spotify” platform, that provides DRM-restricted music, videos and podcasts from record labels and media companies.
The way Spotify suggest music to listeners has a major influence on their listening habits. The motivation of this project is to enable anyone to discover patterns and insights about the music that they listen to. In doing so, They gain a better understanding of the musical behaviors when they listen to songs on Spotify.
Have you ever wondered how Spotify rates the popularity of songs? Or ever wonder which factors determine the song’s genre? What characteristics of a song can determine its popularity?
This analysis aims to answer these questions.
The following tasks are performed:
By performing Data Preparation, Exploratory Data Analysis and Predictive Modeling.
Based on our analysis, the consumer will be able to identify which factors influence the popularity of a song on Spotify.
Following packages will be used in the analysis:
library(tidyverse)
library(ggplot2)
library(dplyr)
library(psych)
library(DAAG)
library(highcharter)
library(knitr)
library(kableExtra)
library(DT)
library(tm)
library(corrplot)
library(leaps)
The data set used in this project can found here Spotify Data
This data comes from Spotify via the spotifyr package. Charlie Thompson, Josiah Parry, Donal Phipps, and Tom Wolff authored this package to make it easier to get either data or general metadata arounds songs from Spotify’s API.
The data set contains 32,833 observations of 23 variables.
Following is the summary of all the variables in the data set.
variable_name | description |
---|---|
track_id | unique ID |
track_name | Song Name |
track_artist | Song Artist |
track_popularity | Song Popularity (0-100) where higher is better |
track_album_id | Album unique ID |
track_album_name | Song album name |
track_album_release_date | Date when album released |
playlist_name | Name of playlist |
playlist_id | Playlist ID |
playlist_genre | Playlist genre |
playlist_subgenre | Playlist subgenre |
danceability | Danceability describes how suitable a track is for dancing based on a combination of musical elements. A value of 0.0 is least danceable and 1.0 is most danceable. |
energy | Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. |
key | The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . |
loudness | The overall loudness of a track in decibels (dB). |
mode | Mode indicates the modality (major or minor) of a track |
speechiness | Speechiness detects the presence of spoken words in a track. |
acousticness | A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic. |
instrumentalness | Predicts whether a track contains no vocals. |
liveness | Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live. |
valence | A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive. |
tempo | The overall estimated tempo of a track in beats per minute (BPM). |
duration_ms | Duration of song in milliseconds |
spotify_data <- read.csv("E:/Vinni_USA/MSIS coursework docs/Spring-20/4th Flex/Data Wrangling/Final Project/spotify/spotify_songs.csv", header = TRUE)
head(spotify_data)
## track_id track_name
## 1 6f807x0ima9a1j3VPbc7VN I Don't Care (with Justin Bieber) - Loud Luxury Remix
## 2 0r7CVbZTWZgbTCYdfa2P31 Memories - Dillon Francis Remix
## 3 1z1Hg7Vb0AhHDiEmnDE79l All the Time - Don Diablo Remix
## 4 75FpbthrwQmzHlBJLuGdC7 Call You Mine - Keanu Silva Remix
## 5 1e8PAfcKUYoKkxPhrHqw4x Someone You Loved - Future Humans Remix
## 6 7fvUMiyapMsRRxr07cU8Ef Beautiful People (feat. Khalid) - Jack Wins Remix
## track_artist track_popularity track_album_id
## 1 Ed Sheeran 66 2oCs0DGTsRO98Gh5ZSl2Cx
## 2 Maroon 5 67 63rPSO264uRjW1X5E6cWv6
## 3 Zara Larsson 70 1HoSmj2eLcsrR0vE9gThr4
## 4 The Chainsmokers 60 1nqYsOef1yKKuGOVchbsk6
## 5 Lewis Capaldi 69 7m7vv9wlQ4i0LFuJiE2zsQ
## 6 Ed Sheeran 67 2yiy9cd2QktrNvWC2EUi0k
## track_album_name
## 1 I Don't Care (with Justin Bieber) [Loud Luxury Remix]
## 2 Memories (Dillon Francis Remix)
## 3 All the Time (Don Diablo Remix)
## 4 Call You Mine - The Remixes
## 5 Someone You Loved (Future Humans Remix)
## 6 Beautiful People (feat. Khalid) [Jack Wins Remix]
## track_album_release_date playlist_name playlist_id playlist_genre
## 1 2019-06-14 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop
## 2 2019-12-13 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop
## 3 2019-07-05 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop
## 4 2019-07-19 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop
## 5 2019-03-05 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop
## 6 2019-07-11 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop
## playlist_subgenre danceability energy key loudness mode speechiness
## 1 dance pop 0.748 0.916 6 -2.634 1 0.0583
## 2 dance pop 0.726 0.815 11 -4.969 1 0.0373
## 3 dance pop 0.675 0.931 1 -3.432 0 0.0742
## 4 dance pop 0.718 0.930 7 -3.778 1 0.1020
## 5 dance pop 0.650 0.833 1 -4.672 1 0.0359
## 6 dance pop 0.675 0.919 8 -5.385 1 0.1270
## acousticness instrumentalness liveness valence tempo duration_ms
## 1 0.1020 0.00e+00 0.0653 0.518 122.036 194754
## 2 0.0724 4.21e-03 0.3570 0.693 99.972 162600
## 3 0.0794 2.33e-05 0.1100 0.613 124.008 176616
## 4 0.0287 9.43e-06 0.2040 0.277 121.956 169093
## 5 0.0803 0.00e+00 0.0833 0.725 123.976 189052
## 6 0.0799 0.00e+00 0.1430 0.585 124.982 163049
str(spotify_data)
## 'data.frame': 32833 obs. of 23 variables:
## $ track_id : Factor w/ 28356 levels "0017A6SJgTbfQVU2EtsPNo",..: 22912 2531 7160 25706 4705 26672 9521 22445 26146 5283 ...
## $ track_name : Factor w/ 23449 levels "'39 - 2011 Mix",..: 9368 12887 944 3111 18360 1968 13859 15785 20934 9823 ...
## $ track_artist : Factor w/ 10692 levels "'Til Tuesday",..: 2848 6185 10633 9373 5530 2848 5000 8320 761 8562 ...
## $ track_popularity : int 66 67 70 60 69 67 62 69 68 67 ...
## $ track_album_id : Factor w/ 22545 levels "000f3dTtvpazVzv35NuZmn",..: 7684 17645 4144 4691 21907 8636 21592 17795 21050 13719 ...
## $ track_album_name : Factor w/ 19743 levels "'74 - '75 (feat. Susan Tyler)",..: 7928 10684 981 2869 15185 1882 11515 13093 17788 8155 ...
## $ track_album_release_date: Factor w/ 4530 levels "1957-01-01","1957-03",..: 4316 4493 4336 4349 4221 4341 4356 4389 4316 4321 ...
## $ playlist_name : Factor w/ 449 levels "\"Permanent Wave\"",..: 309 309 309 309 309 309 309 309 309 309 ...
## $ playlist_id : Factor w/ 471 levels "0275i1VNfBnsNbPl0QIBpG",..: 237 237 237 237 237 237 237 237 237 237 ...
## $ playlist_genre : Factor w/ 6 levels "edm","latin",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ playlist_subgenre : Factor w/ 24 levels "album rock","big room",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ danceability : num 0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
## $ energy : num 0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
## $ key : int 6 11 1 7 1 8 5 4 8 2 ...
## $ loudness : num -2.63 -4.97 -3.43 -3.78 -4.67 ...
## $ mode : int 1 1 0 1 1 1 0 0 1 1 ...
## $ speechiness : num 0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
## $ acousticness : num 0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
## $ instrumentalness : num 0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
## $ liveness : num 0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
## $ valence : num 0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
## $ tempo : num 122 100 124 122 124 ...
## $ duration_ms : int 194754 162600 176616 169093 189052 163049 187675 207619 193187 253040 ...
summary(spotify_data)
## track_id track_name track_artist
## 7BKLCZ1jbUBVqRi2FVlTVw: 10 Poison : 22 Martin Garrix : 161
## 14sOS5L36385FJ3OL8hew4: 9 Breathe : 21 Queen : 136
## 3eekarcy7kvN4yt5ZFzltW: 9 Alive : 20 The Chainsmokers: 123
## 0nbXyq5TXYPCO7pr3N8S4I: 8 Forever : 20 David Guetta : 110
## 0qaWEvPkts34WF68r8Dzx9: 8 Paradise: 19 Don Omar : 102
## 0rIAC4PXANcKmitJfoqmVm: 8 (Other) :32726 (Other) :32196
## (Other) :32781 NA's : 5 NA's : 5
## track_popularity track_album_id
## Min. : 0.00 5L1xcowSxwzFUSJzvyMp48: 42
## 1st Qu.: 24.00 5fstCqs5NpIlF42VhPNv23: 29
## Median : 45.00 7CjJb2mikwAWA1V6kewFBF: 28
## Mean : 42.48 4VFG1DOuTeDMBjBLZT7hCK: 26
## 3rd Qu.: 62.00 2HTbQ0RHwukKVXAlTmCZP2: 21
## Max. :100.00 4CzT5ueFBRpbILw34HQYxi: 21
## (Other) :32666
## track_album_name track_album_release_date
## Greatest Hits : 139 2020-01-10: 270
## Ultimate Freestyle Mega Mix: 42 2019-11-22: 244
## Gold : 35 2019-12-06: 235
## Malibu : 30 2019-12-13: 220
## Rock & Rios (Remastered) : 29 2013-01-01: 219
## (Other) :32553 2019-11-15: 215
## NA's : 5 (Other) :31430
## playlist_name
## Indie Poptimism : 308
## 2020 Hits & 2019 Hits â\200“ Top Global Tracks 🔥🔥🔥 : 247
## Permanent Wave : 244
## Hard Rock Workout : 219
## Ultimate Indie Presents... Best Indie Tracks of the 2010s : 198
## Fitness Workout Electro | House | Dance | Progressive House: 195
## (Other) :31422
## playlist_id playlist_genre
## 4JkkvMpVl4lSioqQjeAL0q: 247 edm :6043
## 37i9dQZF1DWTHM4kX49UKs: 198 latin:5155
## 6KnQDwp0syvhfHOR4lWP7x: 195 pop :5507
## 3xMQTDLOIGvj3lWH5e5x6F: 189 r&b :5431
## 3Ho3iO0iJykgEQNbjB2sic: 182 rap :5746
## 25ButZrVb1Zj1MJioMs09D: 109 rock :4951
## (Other) :31713
## playlist_subgenre danceability energy
## progressive electro house: 1809 Min. :0.0000 Min. :0.000175
## southern hip hop : 1675 1st Qu.:0.5630 1st Qu.:0.581000
## indie poptimism : 1672 Median :0.6720 Median :0.721000
## latin hip hop : 1656 Mean :0.6548 Mean :0.698619
## neo soul : 1637 3rd Qu.:0.7610 3rd Qu.:0.840000
## pop edm : 1517 Max. :0.9830 Max. :1.000000
## (Other) :22867
## key loudness mode speechiness
## Min. : 0.000 Min. :-46.448 Min. :0.0000 Min. :0.0000
## 1st Qu.: 2.000 1st Qu.: -8.171 1st Qu.:0.0000 1st Qu.:0.0410
## Median : 6.000 Median : -6.166 Median :1.0000 Median :0.0625
## Mean : 5.374 Mean : -6.720 Mean :0.5657 Mean :0.1071
## 3rd Qu.: 9.000 3rd Qu.: -4.645 3rd Qu.:1.0000 3rd Qu.:0.1320
## Max. :11.000 Max. : 1.275 Max. :1.0000 Max. :0.9180
##
## acousticness instrumentalness liveness valence
## Min. :0.0000 Min. :0.0000000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0151 1st Qu.:0.0000000 1st Qu.:0.0927 1st Qu.:0.3310
## Median :0.0804 Median :0.0000161 Median :0.1270 Median :0.5120
## Mean :0.1753 Mean :0.0847472 Mean :0.1902 Mean :0.5106
## 3rd Qu.:0.2550 3rd Qu.:0.0048300 3rd Qu.:0.2480 3rd Qu.:0.6930
## Max. :0.9940 Max. :0.9940000 Max. :0.9960 Max. :0.9910
##
## tempo duration_ms
## Min. : 0.00 Min. : 4000
## 1st Qu.: 99.96 1st Qu.:187819
## Median :121.98 Median :216000
## Mean :120.88 Mean :225800
## 3rd Qu.:133.92 3rd Qu.:253585
## Max. :239.44 Max. :517810
##
We observe that many songs have been repeated more than once in this dataset. They have the same ‘track_id’ but have a different ‘playist_id’. So we need to remove those duplicated songs in the dataset. Since the song’s ‘track_id’ is unique and the other quantifiable variables of that song remains the same, we will delete those duplicated songs based on the ‘track_id’.
spotify_data_unique = spotify_data[!duplicated(spotify_data$track_id),]
Now since we have no more repeated songs in the list, and we would like to analyze which variables influence the ‘track_popularity’, we can drop the following columns which are not useful in our analysis:
spotify_data_2 <- spotify_data_unique[c(-1, -5, -6, -8, -9, -11)]
ls(spotify_data_2)
## [1] "acousticness" "danceability"
## [3] "duration_ms" "energy"
## [5] "instrumentalness" "key"
## [7] "liveness" "loudness"
## [9] "mode" "playlist_genre"
## [11] "speechiness" "tempo"
## [13] "track_album_release_date" "track_artist"
## [15] "track_name" "track_popularity"
## [17] "valence"
spotify_data_3 <- spotify_data_2 %>%
separate(track_album_release_date,
c("track_album_release_year","track_album_release_month","track_album_release_day"),
sep = "-")
spotify_data_4 <- spotify_data_3[c(-5, -6)]
head(spotify_data_4)
## track_name track_artist
## 1 I Don't Care (with Justin Bieber) - Loud Luxury Remix Ed Sheeran
## 2 Memories - Dillon Francis Remix Maroon 5
## 3 All the Time - Don Diablo Remix Zara Larsson
## 4 Call You Mine - Keanu Silva Remix The Chainsmokers
## 5 Someone You Loved - Future Humans Remix Lewis Capaldi
## 6 Beautiful People (feat. Khalid) - Jack Wins Remix Ed Sheeran
## track_popularity track_album_release_year playlist_genre danceability energy
## 1 66 2019 pop 0.748 0.916
## 2 67 2019 pop 0.726 0.815
## 3 70 2019 pop 0.675 0.931
## 4 60 2019 pop 0.718 0.930
## 5 69 2019 pop 0.650 0.833
## 6 67 2019 pop 0.675 0.919
## key loudness mode speechiness acousticness instrumentalness liveness valence
## 1 6 -2.634 1 0.0583 0.1020 0.00e+00 0.0653 0.518
## 2 11 -4.969 1 0.0373 0.0724 4.21e-03 0.3570 0.693
## 3 1 -3.432 0 0.0742 0.0794 2.33e-05 0.1100 0.613
## 4 7 -3.778 1 0.1020 0.0287 9.43e-06 0.2040 0.277
## 5 1 -4.672 1 0.0359 0.0803 0.00e+00 0.0833 0.725
## 6 8 -5.385 1 0.1270 0.0799 0.00e+00 0.1430 0.585
## tempo duration_ms
## 1 122.036 194754
## 2 99.972 162600
## 3 124.008 176616
## 4 121.956 169093
## 5 123.976 189052
## 6 124.982 163049
For easier analysis, we have split ‘track_album_release_date’ into three different columns namely:
We will be focussing only on the year of release for analysis. So we will be deleting the ‘track_album_release_month’ & ‘track_album_release_day’ from our dataset.
Now that our data does not contain any duplicate and redundant data, we check for missing values in the data set. We are using colSums function in R to find out missing values in each column.
colSums(is.na(spotify_data_4))
## track_name track_artist track_popularity
## 4 4 0
## track_album_release_year playlist_genre danceability
## 0 0 0
## energy key loudness
## 0 0 0
## mode speechiness acousticness
## 0 0 0
## instrumentalness liveness valence
## 0 0 0
## tempo duration_ms
## 0 0
We observe that there are 4 missing values in track_name and track_artist columns. We can keep these observations, since missing values for track_name and track_artist wouldn’t impact our analysis.
songs_data <- names(spotify_data_4)[c(6:9,11:17)]
songs <- spotify_data_4 %>%
select(c('playlist_genre', songs_data)) %>%
pivot_longer(cols = songs_data)
## Note: Using an external vector in selections is ambiguous.
## i Use `all_of(songs_data)` instead of `songs_data` to silence this message.
## i See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
ggplot(data = songs) +
geom_boxplot(aes(y = value)) +
facet_wrap(~name, nrow = 3, scales = "free") +
coord_flip() +
ggtitle("Outlier analysis", subtitle = "For different song attributes") +
theme(axis.text.x = element_text(angle = 50, hjust = 1),axis.text.y = element_blank())
We observe that there are many outliers in the dataset for different variables. Removing these outliers will impact the data in our dataset and will effect our analysis. Until we have a proper justification to remove them, we will be keeping these outliers for now.
feature_names <- names(spotify_data_4)[c(3,6:17)]
songs <- spotify_data_4 %>%
select(c(feature_names)) %>%
pivot_longer(cols = feature_names)
songs %>%
ggplot(aes(x = value)) +
geom_histogram() +
facet_wrap(~name, ncol = 5, scales = 'free') +
labs(title = 'Audio Feature Pattern Frequency Plots', x = '', y = '') +
theme(axis.text.y = element_blank())
We are plotting Histograms to summarize the distribution of variables in the data set. We observe:
Displaying 100 rows of the cleaned data set.
output_data <- head(spotify_data_4, n = 100)
datatable(output_data, filter = 'top', options = list(pageLength = 25))
The dimensions of our final data set
dim(spotify_data_4)
## [1] 28356 17
There are 28356 obervations of 17 variables in the cleaned data set.
A glimpse into the data set to identify data types of all the variables.
str(spotify_data_4)
## 'data.frame': 28356 obs. of 17 variables:
## $ track_name : Factor w/ 23449 levels "'39 - 2011 Mix",..: 9368 12887 944 3111 18360 1968 13859 15785 20934 9823 ...
## $ track_artist : Factor w/ 10692 levels "'Til Tuesday",..: 2848 6185 10633 9373 5530 2848 5000 8320 761 8562 ...
## $ track_popularity : int 66 67 70 60 69 67 62 69 68 67 ...
## $ track_album_release_year: chr "2019" "2019" "2019" "2019" ...
## $ playlist_genre : Factor w/ 6 levels "edm","latin",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ danceability : num 0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
## $ energy : num 0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
## $ key : int 6 11 1 7 1 8 5 4 8 2 ...
## $ loudness : num -2.63 -4.97 -3.43 -3.78 -4.67 ...
## $ mode : int 1 1 0 1 1 1 0 0 1 1 ...
## $ speechiness : num 0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
## $ acousticness : num 0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
## $ instrumentalness : num 0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
## $ liveness : num 0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
## $ valence : num 0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
## $ tempo : num 122 100 124 122 124 ...
## $ duration_ms : int 194754 162600 176616 169093 189052 163049 187675 207619 193187 253040 ...
Summary statistics for the cleaned data set.
summary(spotify_data_4)
## track_name track_artist track_popularity
## Breathe : 18 Queen : 130 Min. : 0.00
## Paradise: 17 Martin Garrix : 87 1st Qu.: 21.00
## Poison : 16 Don Omar : 84 Median : 42.00
## Alive : 15 David Guetta : 81 Mean : 39.33
## Forever : 14 Dimitri Vegas & Like Mike: 68 3rd Qu.: 58.00
## (Other) :28272 (Other) :27902 Max. :100.00
## NA's : 4 NA's : 4
## track_album_release_year playlist_genre danceability energy
## Length:28356 edm :4877 Min. :0.0000 Min. :0.000175
## Class :character latin:4137 1st Qu.:0.5610 1st Qu.:0.579000
## Mode :character pop :5132 Median :0.6700 Median :0.722000
## r&b :4504 Mean :0.6534 Mean :0.698388
## rap :5401 3rd Qu.:0.7600 3rd Qu.:0.843000
## rock :4305 Max. :0.9830 Max. :1.000000
##
## key loudness mode speechiness
## Min. : 0.000 Min. :-46.448 Min. :0.0000 Min. :0.0000
## 1st Qu.: 2.000 1st Qu.: -8.309 1st Qu.:0.0000 1st Qu.:0.0410
## Median : 6.000 Median : -6.261 Median :1.0000 Median :0.0626
## Mean : 5.368 Mean : -6.818 Mean :0.5655 Mean :0.1080
## 3rd Qu.: 9.000 3rd Qu.: -4.709 3rd Qu.:1.0000 3rd Qu.:0.1330
## Max. :11.000 Max. : 1.275 Max. :1.0000 Max. :0.9180
##
## acousticness instrumentalness liveness valence
## Min. :0.00000 Min. :0.0000000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.01438 1st Qu.:0.0000000 1st Qu.:0.0926 1st Qu.:0.3290
## Median :0.07970 Median :0.0000206 Median :0.1270 Median :0.5120
## Mean :0.17718 Mean :0.0911168 Mean :0.1910 Mean :0.5104
## 3rd Qu.:0.26000 3rd Qu.:0.0065700 3rd Qu.:0.2490 3rd Qu.:0.6950
## Max. :0.99400 Max. :0.9940000 Max. :0.9960 Max. :0.9910
##
## tempo duration_ms
## Min. : 0.00 Min. : 4000
## 1st Qu.: 99.97 1st Qu.:187742
## Median :121.99 Median :216933
## Mean :120.96 Mean :226576
## 3rd Qu.:134.00 3rd Qu.:254975
## Max. :239.44 Max. :517810
##
Genre Characteristics
songs_data <- names(spotify_data_4)[c(6:9,11:17)]
songs <- spotify_data_4 %>%
select(c('playlist_genre', songs_data)) %>%
pivot_longer(cols = songs_data)
songs %>%
ggplot(aes(x = value)) +
geom_density(aes(color = playlist_genre)) +
facet_wrap(~name, ncol = 4, scales = 'free') +
labs(title = 'Songs characteristics',x = '', y = '') +
theme(axis.text.x = element_text(angle = 50, hjust = 1),axis.text.y = element_blank())
From the viz, we observe that the songs of different genres follow different patterns for characteristics
Genre Classification
Based on the density plot, it looks like energy, valence and danceability may provide the most separation between genres during classification, while instrumentalness and key may not help much
So, the combination of all these characteristics of a song contribute to the classification of it into their respective genre.
corr_plot_data <- spotify_data_4 %>%
select(track_popularity, danceability, energy, key, loudness, mode, speechiness,acousticness,
instrumentalness, liveness, valence, tempo, duration_ms)
corrplot(cor(corr_plot_data),
method = "color",
type = "upper",
order = "hclust")
We can see that there exist a
We can use correlation matrix to determine if there exists any correlation between track popularity and song characteristics.
We observe popularity a positive correlation between acousticness, loudness, danceability, valence and track popularity. A negative correlation between liveness, energy and instrumentalness.
We also observe mode, speechiness, tempo and key have no strong correlation with track popularity.
Thus, we can conclude that popularity is influenced by the following charateristics:
feature_names <- names(spotify_data_2)[c(6,7,9,11:16)]
songs <- spotify_data_4 %>%
arrange(desc(track_popularity)) %>%
head(n = 500) %>%
pivot_longer(cols = feature_names)
songs %>%
ggplot(aes(x = name, y = value)) +
geom_jitter(aes(color = playlist_genre)) +
facet_wrap(~name, ncol = 3, scales = 'free') +
labs(title = 'Audio Feature Pattern Frequency Plots', x = '', y = '') +
theme(axis.text.y = element_blank())
From the jitter plot we observe, most popular songs on Spoify are
Number of songs in every genre
spotify_data_4 %>%
filter(!is.na(track_artist)) %>%
count(playlist_genre) %>%
ggplot() +
geom_col(aes(x = playlist_genre, y = n, fill = playlist_genre)) +
coord_polar() +
theme(axis.text.x = element_text(hjust = 1), axis.text.y = element_text(hjust = 1)) +
ggtitle("Number of songs in every genre") +
xlab("Song Genre") +
ylab("Number of songs")
This graph gives us insight to count of songs in each genre present in Spotify. We observe that our dataset has more songs of the following genre
This count will be useful to us in understanding if the popularity of songs depends on its genre.
Most popular songs in every genre
We are now identifying the most popular songs per genre in our dataset.
spotify_data_4 %>%
select(playlist_genre, track_popularity, track_name) %>%
group_by(playlist_genre) %>%
arrange(desc(track_popularity)) %>%
head(n = 500) %>%
ggplot(mapping = aes(x = playlist_genre, y = track_popularity,
color = playlist_genre, shape = playlist_genre,
fill = playlist_genre
, label = track_name
)) +
geom_point() +
theme_minimal() +
labs(x = 'genre', y = 'song popularity', title = 'Most popular Songs per genre') +
geom_text(check_overlap = TRUE, data = subset(spotify_data_4, track_popularity > 97) ) +
theme(plot.title = element_text(hjust = 0.5),legend.position = 'bottom')
From the bar graph,
We observe that pop songs are more popular than the remaining genres, followed by latin and rap .
Also the count of pop songs is more than any other genre in the most popular songs list.
We are identifying the top 10 popular songs in list.
This will help poeple who are new to music, to listen to the top trending songs. The artist and genre of the song could also be identified from the table for more information.
top_songs <-
spotify_data_4 %>%
select(track_name, track_artist,playlist_genre, track_popularity) %>%
group_by(playlist_genre) %>%
arrange(desc(track_popularity)) %>%
head(n = 10)
top_songs %>%
ggplot(mapping = aes(x = track_name, y = track_popularity, color = track_name)) +
geom_point() +
coord_polar() +
theme_minimal() +
labs(x = 'track_name', y = 'track_popularity', title = 'Top 10 songs in Spotify') +
theme(plot.title = element_text(hjust = 0.5),legend.position = 'bottom')
top_songs %>%
kable() %>%
kable_styling()
track_name | track_artist | playlist_genre | track_popularity |
---|---|---|---|
Dance Monkey | Tones and I | pop | 100 |
ROXANNE | Arizona Zervas | latin | 99 |
Tusa | KAROL G | pop | 98 |
Memories | Maroon 5 | pop | 98 |
Blinding Lights | The Weeknd | pop | 98 |
Circles | Post Malone | pop | 98 |
The Box | Roddy Ricch | rap | 98 |
everything i wanted | Billie Eilish | pop | 97 |
Don’t Start Now | Dua Lipa | pop | 97 |
Falling | Trevor Daniel | pop | 97 |
Analyzing song characteristics of top 10 songs
As we saw earier, track popularity is influenced by acousticness, loudness, valence and danceability.
Of all the popular songs on Spotify, which ones make it to top 10. TO find that out, we analyse characteristics for top 10 songs on Spotify.
Acousticness
top_songs <-
spotify_data_4 %>%
select(track_name, playlist_genre, track_popularity, acousticness, loudness, valence, danceability) %>%
group_by(playlist_genre) %>%
arrange(desc(track_popularity)) %>%
head(n = 10)
ggplot(data = top_songs, aes(y = acousticness , x = track_name, fill = playlist_genre ,
shape = playlist_genre)) +
geom_col() +
theme(axis.text.x = element_text(angle = 25, hjust = 0.5)) +
ggtitle('Acousticness', subtitle = 'For top 10 songs on Spotify')
There exists variation among accouticness values.
Loudness
top_songs <-
spotify_data_4 %>%
select(track_name, playlist_genre, track_popularity, acousticness, loudness, valence, danceability) %>%
group_by(playlist_genre) %>%
arrange(desc(track_popularity)) %>%
head(n = 10)
ggplot(data = top_songs, aes(y = loudness , x = track_name, fill = playlist_genre, shape = playlist_genre)) +
geom_col() +
theme(axis.text.x = element_text(angle = 25, hjust = 0.5)) +
ggtitle('Loudness', subtitle = 'For top 10 songs on Spotify')
Loudness levels are almost similar expect for ‘everything I wanted’ track.
Valence
top_songs <-
spotify_data_4 %>%
select(track_name, playlist_genre, track_popularity, acousticness, loudness, valence, danceability) %>%
group_by(playlist_genre) %>%
arrange(desc(track_popularity)) %>%
head(n = 10)
ggplot(data = top_songs, aes(y = valence , x = track_name, fill = playlist_genre, shape = playlist_genre)) +
geom_col() +
theme(axis.text.x = element_text(angle = 25, hjust = 0.5)) +
ggtitle('Valence', subtitle = 'For top 10 songs on Spotify')
Valence levels are almost similar except for ‘Falling’ and ‘Everything I wanted’ tracks.
Danceability
top_songs <-
spotify_data_4 %>%
select(track_name, playlist_genre, track_popularity, acousticness, loudness, valence, danceability) %>%
group_by(playlist_genre) %>%
arrange(desc(track_popularity)) %>%
head(n = 10)
ggplot(data = top_songs, aes(y = danceability , x = track_name, fill = playlist_genre, shape = playlist_genre)) +
geom_col() +
theme(axis.text.x = element_text(angle = 25, hjust = 0.5)) +
ggtitle('Danceability', subtitle = 'For top 10 songs on Spotify')
Danceability is high for every track in top 10.
To summarize,
It can be concluded that the songs with highest popularity i.e top 10 songs on Spotify have high danceability, valence and loudness. We cannot observe any pattern for accousticness.
This makes sense, since danceable and loud songs are popular at parties and clubs.
Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry). People generally prefer upbeat and happy songs making them more popular.
These observation are in line with the insights we got from the jitter plot for analyzing popular songs.
We are identifying the top 10 artists in the list. This will help people identify other songs of these artists and listen to them in the future.
top_artist <-
spotify_data_4 %>%
select(track_name, track_artist,playlist_genre, track_popularity) %>%
filter(!is.na(track_artist)) %>%
arrange(desc(track_popularity)) %>%
top_n(100) %>%
count(track_artist) %>%
arrange(-n) %>%
head(10)
## Selecting by track_popularity
top_artist %>%
ggplot(aes(reorder(track_artist, n), n)) +
geom_col(fill = "cyan3") +
coord_flip() +
labs(x = 'Artist', y = 'song count', title = 'Top 10 Artists') +
theme(plot.title = element_text(hjust = 0.5),legend.position = 'bottom')
According to the graph, Post Malone has the most number of popular songs in our dataset.
To predict popularity based on song characteristics we make use of multiple regression.
Regression analysis is a set of statistical processes for estimating the relationships between a dependent variable and one or more independent variables.
Multiple linear regression performs the task to predict a dependent variable value, track_popularity in our scenario based on independent variables that is song characteristics.
Creating a multiple linear regression model with track_popularity value as the response variable and danceability, energy, key, loudness, mode, speechiness, acousticness, instrumentalness, liveness, valence, tempo and duration_ms as the covariates.
model_1 <- lm(track_popularity ~ danceability + energy + key + loudness + mode + speechiness + acousticness + instrumentalness + liveness + valence + tempo + duration_ms,
data = spotify_data_4)
summary(model_1)
##
## Call:
## lm(formula = track_popularity ~ danceability + energy + key +
## loudness + mode + speechiness + acousticness + instrumentalness +
## liveness + valence + tempo + duration_ms, data = spotify_data_4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -54.62 -17.22 2.95 18.10 60.54
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.783e+01 1.704e+00 39.805 < 2e-16 ***
## danceability 3.721e+00 1.072e+00 3.470 0.000522 ***
## energy -2.321e+01 1.220e+00 -19.028 < 2e-16 ***
## key 3.190e-03 3.844e-02 0.083 0.933870
## loudness 1.156e+00 6.527e-02 17.711 < 2e-16 ***
## mode 8.616e-01 2.809e-01 3.067 0.002161 **
## speechiness -6.328e+00 1.380e+00 -4.587 4.52e-06 ***
## acousticness 4.331e+00 7.466e-01 5.801 6.67e-09 ***
## instrumentalness -9.292e+00 6.255e-01 -14.856 < 2e-16 ***
## liveness -4.280e+00 8.990e-01 -4.761 1.93e-06 ***
## valence 1.788e+00 6.565e-01 2.724 0.006458 **
## tempo 2.609e-02 5.239e-03 4.979 6.42e-07 ***
## duration_ms -4.342e-05 2.294e-06 -18.925 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23.01 on 28343 degrees of freedom
## Multiple R-squared: 0.05808, Adjusted R-squared: 0.05768
## F-statistic: 145.6 on 12 and 28343 DF, p-value: < 2.2e-16
It can be noticed that all the covariates in the model are significant expect key since the p-value for each of them is less than 0.05.
Besides, the Adjusted R- squared values is 0.05768 which is moderate. p-value of the model is < 2.2e-16 suggesting all the results are significant.
However, We are performing variable selection process to identify the significant covariates.
model_3 = regsubsets(track_popularity ~ danceability + energy + key + loudness + mode + speechiness + acousticness + instrumentalness + liveness + valence + tempo + duration_ms,
data = spotify_data_4,
nbest = 7)
plot(model_3, scale = "bic")
According to best subset selection, the influence of ‘Energy’ > ‘Loudness’.
Upon comparing both these results we can arrive at the conclusion that 1 1 0 1 1 1 1 1 1 1 1 1
is the best linear regression model for this dataset or in other words, all variables except ‘key’ are statiscally significant in predicting the track popularity.
model_2 <- lm(track_popularity ~ danceability + energy + loudness + mode + speechiness + acousticness + instrumentalness + liveness + valence + tempo + duration_ms,
data = spotify_data_4)
summary(model_2)
##
## Call:
## lm(formula = track_popularity ~ danceability + energy + loudness +
## mode + speechiness + acousticness + instrumentalness + liveness +
## valence + tempo + duration_ms, data = spotify_data_4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -54.624 -17.226 2.949 18.099 60.533
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.785e+01 1.692e+00 40.113 < 2e-16 ***
## danceability 3.720e+00 1.072e+00 3.469 0.000523 ***
## energy -2.321e+01 1.219e+00 -19.030 < 2e-16 ***
## loudness 1.156e+00 6.527e-02 17.712 < 2e-16 ***
## mode 8.576e-01 2.765e-01 3.101 0.001930 **
## speechiness -6.326e+00 1.379e+00 -4.586 4.54e-06 ***
## acousticness 4.331e+00 7.465e-01 5.803 6.60e-09 ***
## instrumentalness -9.292e+00 6.254e-01 -14.856 < 2e-16 ***
## liveness -4.280e+00 8.990e-01 -4.761 1.93e-06 ***
## valence 1.789e+00 6.564e-01 2.726 0.006414 **
## tempo 2.608e-02 5.239e-03 4.979 6.44e-07 ***
## duration_ms -4.342e-05 2.294e-06 -18.928 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23.01 on 28344 degrees of freedom
## Multiple R-squared: 0.05808, Adjusted R-squared: 0.05771
## F-statistic: 158.9 on 11 and 28344 DF, p-value: < 2.2e-16
Adjusted R- squared values is 0.05717. This implies that the model can predict the track popularity and is able to explain 5.71% of the variation in the data set.
par(mfrow = c(1,2))
# generate QQ plot
qqnorm(model_2$residuals,main = "Model")
qqline(model_2$residuals)
# generate Scatter Plot
plot(model_2$fitted.values,model_2$residuals,pch = 20)
abline(h = 0,col = "grey")
From the graphs, we observe that the qq plot is not ideal and the data in the scatterplot is not evenly distributed.
Therefore, this dataset doesn’t completely satisfy the normality, linearity and equal variance assumptions.
Now we use the model created to make predictions about the track popularity.
new_popularity <- data.frame(danceability = 0.718,
energy = 0.93,
loudness = -3.778,
mode = 1,
speechiness = 0.102,
acousticness = 0.0287,
instrumentalness = 0,
liveness = 0.204,
valence = 0.277,
tempo = 121.956,
duration_ms = 169093)
print(paste0("Observed popularity: ",60))
## [1] "Observed popularity: 60"
predicted <- predict(model_2, newdata = new_popularity)
print(paste0("Predicted popularity: ",(predicted)))
## [1] "Predicted popularity: 40.3716218419876"
We observe the values we get for popularity is less then the observed values. The variation exists because of skewness in the data.
The model needs to tranformaed to make accurate predictions about the popularity.
Conclusion
The popularity of a song is most influenced by the dancability, loundness and valence of the song. We came to this conclusion from the correlation matrix and jitter plot for the most popular songs on Spotify
The factors that determine the song’s genre are: danceability, energy and valence. We came to this conclusion from the density plot of characteristics of the songs.
Spotify could be determining a song’s popularity based on all the characteristics of apart from ‘key’. We concluded this from the model we created using multiple linear regression analysis through the variable selection method.
Pop genre has the highest number of popular songs on Spotify. We concluded this from the bar graph we plotted to classify the top 100 songs according to their genre.
Insights
A common assumption is that energy influences popularity like energetic songs are more popular. However, we could not find and correlation betweeen popularity and energy
Number of songs belonging to all genres in the top 100 were not evenly distributed. We observe that people prefer pop music over other genres.
Implications
The model which we created could be used by people to calculate popularity. That factor would help people understand how the song will fare when it will be released
This analysis can be helpful to students studying music or wanting to pursue a career in music
Future Scope
We can improve the model by applying transformations on the dependent variable and covariants. We will be able to get a better model for prediction analysis
We can included sub-genre to be considered it as a factor which determines the popularity of a song.
Combining different datasets related to music apart from the Spotify data wil be helpful in better analysis of the song’s popularity.