Streaming music services have been able to collect a wealth of data for song characteristics, amount of streams, and user interaction. Spotify has taken this data to create playlists curated to a user’s taste based on their listening habits. Considering how songs have been quantified by Spotify, is it possible to produce a predictive model for a song’s popularity?
With the Spotify dataset retrieved with the spotifyr package, we will look at 12 variables from almost 33,000 songs and see how they interact with each other, and whether they are significant in determining popularity. As genres of music can have very different characteristics, it may make more sense to gauge popularity based on each genre’s representative statistics rather than overall.
We will first compare variables to determine which have influence on the popularity of a song. We will then apply a multiple linear regression to the data to make a prediction as to what characteristics are likely, within each genre, to produce a popular song.
A predictive model of popularity can help songwriters craft a song based on data-driven methods. Users will also benefit by being able to discover songs based on their preferred characteristics, and see whether they prefer more or less popular songs.
library(tidyverse)
library(scales)
library(table1)
This analysis will make use of the following packages:
songs <- read.csv('spotify_songs.csv', stringsAsFactors = FALSE)
The data used in this analysis is available as part of the tidytuesdayR package, and is also available for download here.
The codebook is also available on the tidytuesday GitHub page.
The data set, which was originally created on 2020-01-21, consists of 32,833 records, each with 23 columns. Each record represents a single song, and the columns represent various aspects of each song:
htmltools::includeHTML("codebook.html")
| variable | class | description |
|---|---|---|
| track_id | character | Song unique ID |
| track_name | character | Song Name |
| track_artist | character | Song Artist |
| track_popularity | double | Song Popularity (0-100) where higher is better |
| track_album_id | character | Album unique ID |
| track_album_name | character | Song album name |
| track_album_release_date | character | Date when album released |
| playlist_name | character | Name of playlist |
| playlist_id | character | Playlist ID |
| playlist_genre | character | Playlist genre |
| playlist_subgenre | character | Playlist subgenre |
| danceability | double | Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable. |
| energy | double | Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy. |
| key | double | The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1. |
| loudness | double | The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db. |
| mode | double | Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0. |
| speechiness | double | Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks. |
| acousticness | double | A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic. |
| instrumentalness | double | Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0. |
| liveness | double | Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live. |
| valence | double | A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry). |
| tempo | double | The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration. |
| duration_ms | double | Duration of song in milliseconds |
Because the columns are named sensibly, we won’t need to re-name any of them.
In examining the structure of the data set, we can see that track_album_release_date should be re-formatted as a date, and that playlist_genre and playlist_subgenre should be made into factors.
str(songs)
## 'data.frame': 32833 obs. of 23 variables:
## $ track_id : chr "6f807x0ima9a1j3VPbc7VN" "0r7CVbZTWZgbTCYdfa2P31" "1z1Hg7Vb0AhHDiEmnDE79l" "75FpbthrwQmzHlBJLuGdC7" ...
## $ track_name : chr "I Don't Care (with Justin Bieber) - Loud Luxury Remix" "Memories - Dillon Francis Remix" "All the Time - Don Diablo Remix" "Call You Mine - Keanu Silva Remix" ...
## $ track_artist : chr "Ed Sheeran" "Maroon 5" "Zara Larsson" "The Chainsmokers" ...
## $ track_popularity : int 66 67 70 60 69 67 62 69 68 67 ...
## $ track_album_id : chr "2oCs0DGTsRO98Gh5ZSl2Cx" "63rPSO264uRjW1X5E6cWv6" "1HoSmj2eLcsrR0vE9gThr4" "1nqYsOef1yKKuGOVchbsk6" ...
## $ track_album_name : chr "I Don't Care (with Justin Bieber) [Loud Luxury Remix]" "Memories (Dillon Francis Remix)" "All the Time (Don Diablo Remix)" "Call You Mine - The Remixes" ...
## $ track_album_release_date: chr "2019-06-14" "2019-12-13" "2019-07-05" "2019-07-19" ...
## $ playlist_name : chr "Pop Remix" "Pop Remix" "Pop Remix" "Pop Remix" ...
## $ playlist_id : chr "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" ...
## $ playlist_genre : chr "pop" "pop" "pop" "pop" ...
## $ playlist_subgenre : chr "dance pop" "dance pop" "dance pop" "dance pop" ...
## $ danceability : num 0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
## $ energy : num 0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
## $ key : int 6 11 1 7 1 8 5 4 8 2 ...
## $ loudness : num -2.63 -4.97 -3.43 -3.78 -4.67 ...
## $ mode : int 1 1 0 1 1 1 0 0 1 1 ...
## $ speechiness : num 0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
## $ acousticness : num 0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
## $ instrumentalness : num 0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
## $ liveness : num 0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
## $ valence : num 0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
## $ tempo : num 122 100 124 122 124 ...
## $ duration_ms : int 194754 162600 176616 169093 189052 163049 187675 207619 193187 253040 ...
songs$track_album_release_date <- as.Date(songs$track_album_release_date)
songs$playlist_genre <- as.factor(songs$playlist_genre)
songs$playlist_subgenre <- as.factor(songs$playlist_subgenre)
Is there any missing data?
nrow(songs[complete.cases(songs), ])
## [1] 30942
colSums(is.na(songs))
## track_id track_name track_artist
## 0 5 5
## track_popularity track_album_id track_album_name
## 0 0 5
## track_album_release_date playlist_name playlist_id
## 1886 0 0
## playlist_genre playlist_subgenre danceability
## 0 0 0
## energy key loudness
## 0 0 0
## mode speechiness acousticness
## 0 0 0
## instrumentalness liveness valence
## 0 0 0
## tempo duration_ms
## 0 0
The NA values in this data set exist in the track_name, track_artist, track_album_name, and track_album_release_date. Given the nature of the columns with missing values, it does not make sense for us to make any imputations. Since we aren’t using the track_album_release_date in our analysis, we’ll just drop it from the data set.
songs <- songs[ , !(names(songs) %in% c("track_album_release_date"))]
How many rows still have missing data?
na_rows <- nrow(songs) - nrow(songs[complete.cases(songs), ])
pct_na_rows <- na_rows / nrow(songs)
Only 5 rows are still missing data at this point. This is only about 0% of the total number of rows, so we’ll just drop those rows from the data set.
songs <- na.omit(songs)
Next we’ll look at summaries of each of the numeric values to be sure they make sense, and check for any outliers.
summary(songs)
## track_id track_name track_artist track_popularity
## Length:32828 Length:32828 Length:32828 Min. : 0.00
## Class :character Class :character Class :character 1st Qu.: 24.00
## Mode :character Mode :character Mode :character Median : 45.00
## Mean : 42.48
## 3rd Qu.: 62.00
## Max. :100.00
##
## track_album_id track_album_name playlist_name playlist_id
## Length:32828 Length:32828 Length:32828 Length:32828
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## playlist_genre playlist_subgenre danceability
## edm :6043 progressive electro house: 1809 Min. :0.0000
## latin:5153 southern hip hop : 1674 1st Qu.:0.5630
## pop :5507 indie poptimism : 1672 Median :0.6720
## r&b :5431 latin hip hop : 1655 Mean :0.6549
## rap :5743 neo soul : 1637 3rd Qu.:0.7610
## rock :4951 pop edm : 1517 Max. :0.9830
## (Other) :22864
## energy key loudness mode
## Min. :0.000175 Min. : 0.000 Min. :-46.448 Min. :0.0000
## 1st Qu.:0.581000 1st Qu.: 2.000 1st Qu.: -8.171 1st Qu.:0.0000
## Median :0.721000 Median : 6.000 Median : -6.166 Median :1.0000
## Mean :0.698603 Mean : 5.374 Mean : -6.720 Mean :0.5657
## 3rd Qu.:0.840000 3rd Qu.: 9.000 3rd Qu.: -4.645 3rd Qu.:1.0000
## Max. :1.000000 Max. :11.000 Max. : 1.275 Max. :1.0000
##
## speechiness acousticness instrumentalness liveness
## Min. :0.0000 Min. :0.0000 Min. :0.0000000 Min. :0.0000
## 1st Qu.:0.0410 1st Qu.:0.0151 1st Qu.:0.0000000 1st Qu.:0.0927
## Median :0.0625 Median :0.0804 Median :0.0000161 Median :0.1270
## Mean :0.1071 Mean :0.1754 Mean :0.0847599 Mean :0.1902
## 3rd Qu.:0.1320 3rd Qu.:0.2550 3rd Qu.:0.0048300 3rd Qu.:0.2480
## Max. :0.9180 Max. :0.9940 Max. :0.9940000 Max. :0.9960
##
## valence tempo duration_ms
## Min. :0.0000 Min. : 0.00 Min. : 4000
## 1st Qu.:0.3310 1st Qu.: 99.96 1st Qu.:187805
## Median :0.5120 Median :121.98 Median :216000
## Mean :0.5106 Mean :120.88 Mean :225797
## 3rd Qu.:0.6930 3rd Qu.:133.92 3rd Qu.:253581
## Max. :0.9910 Max. :239.44 Max. :517810
##
songs %>%
summarise_if(is.numeric, mean)
## track_popularity danceability energy key loudness mode
## 1 42.48355 0.6548504 0.6986026 5.373949 -6.719529 0.5657366
## speechiness acousticness instrumentalness liveness valence tempo
## 1 0.1070535 0.1753515 0.08475987 0.1901754 0.5105559 120.8836
## duration_ms
## 1 225796.8
There do not appear to be any outliers among the numeric variables, and all values are within the stated ranges (i.e., values such as danceability and energy are measured on a scale of 0.0 - 1.0). mode is binary variable with 0 and 1 as the only possible values. The longest song has a duration of 517810, which is about 8.63 minutes, which seems entirely reasonable.
Finally, there are several columns we won’t need for our analysis, so we’ll drop them here:
unused_cols <- c('track_id', 'track_album_id', 'track_album_name',
'playlist_name', 'playlist_id', 'playlist_subgenre')
songs <- songs[ , -which(names(songs) %in% c(unused_cols))]
The data is now clean, with 32828 observations of 16 variables. Here’s what it looks like:
dim(songs)
## [1] 32828 16
head(songs)
## track_name track_artist
## 1 I Don't Care (with Justin Bieber) - Loud Luxury Remix Ed Sheeran
## 2 Memories - Dillon Francis Remix Maroon 5
## 3 All the Time - Don Diablo Remix Zara Larsson
## 4 Call You Mine - Keanu Silva Remix The Chainsmokers
## 5 Someone You Loved - Future Humans Remix Lewis Capaldi
## 6 Beautiful People (feat. Khalid) - Jack Wins Remix Ed Sheeran
## track_popularity playlist_genre danceability energy key loudness mode
## 1 66 pop 0.748 0.916 6 -2.634 1
## 2 67 pop 0.726 0.815 11 -4.969 1
## 3 70 pop 0.675 0.931 1 -3.432 0
## 4 60 pop 0.718 0.930 7 -3.778 1
## 5 69 pop 0.650 0.833 1 -4.672 1
## 6 67 pop 0.675 0.919 8 -5.385 1
## speechiness acousticness instrumentalness liveness valence tempo
## 1 0.0583 0.1020 0.00e+00 0.0653 0.518 122.036
## 2 0.0373 0.0724 4.21e-03 0.3570 0.693 99.972
## 3 0.0742 0.0794 2.33e-05 0.1100 0.613 124.008
## 4 0.1020 0.0287 9.43e-06 0.2040 0.277 121.956
## 5 0.0359 0.0803 0.00e+00 0.0833 0.725 123.976
## 6 0.1270 0.0799 0.00e+00 0.1430 0.585 124.982
## duration_ms
## 1 194754
## 2 162600
## 3 176616
## 4 169093
## 5 189052
## 6 163049
The variables we are concerned with are track_artist and playlist_genre, and the quantifying variables track_popularity, danceability, energy, key, loudness, mode, speechiness, acousticness, instrumentalness, liveness, and valence.
Here are summaries of the numeric columns of interest, broken down by playlist_genre:
table1::table1(~ track_popularity + danceability + energy + key + loudness + mode + speechiness + acousticness + instrumentalness +liveness + valence | playlist_genre, data = songs)
| edm (N=6043) |
latin (N=5153) |
pop (N=5507) |
r&b (N=5431) |
rap (N=5743) |
rock (N=4951) |
Overall (N=32828) |
|
|---|---|---|---|---|---|---|---|
| track_popularity | |||||||
| Mean (SD) | 34.8 (23.2) | 47.0 (25.4) | 47.7 (25.2) | 41.2 (25.9) | 43.2 (23.3) | 41.7 (24.8) | 42.5 (25.0) |
| Median [Min, Max] | 36.0 [0, 99.0] | 50.0 [0, 100] | 52.0 [0, 100] | 44.0 [0, 99.0] | 47.0 [0, 98.0] | 46.0 [0, 95.0] | 45.0 [0, 100] |
| danceability | |||||||
| Mean (SD) | 0.655 (0.124) | 0.713 (0.115) | 0.639 (0.128) | 0.670 (0.138) | 0.718 (0.136) | 0.521 (0.140) | 0.655 (0.145) |
| Median [Min, Max] | 0.659 [0.162, 0.983] | 0.729 [0.0771, 0.979] | 0.652 [0.0985, 0.979] | 0.689 [0.140, 0.977] | 0.737 [0.150, 0.975] | 0.523 [0, 0.956] | 0.672 [0, 0.983] |
| energy | |||||||
| Mean (SD) | 0.802 (0.139) | 0.708 (0.152) | 0.701 (0.171) | 0.591 (0.179) | 0.651 (0.170) | 0.733 (0.195) | 0.699 (0.181) |
| Median [Min, Max] | 0.830 [0.106, 0.998] | 0.729 [0.000175, 1.00] | 0.727 [0.00814, 0.999] | 0.596 [0.0118, 0.995] | 0.665 [0.0161, 0.999] | 0.775 [0.0167, 0.998] | 0.721 [0.000175, 1.00] |
| key | |||||||
| Mean (SD) | 5.35 (3.56) | 5.48 (3.64) | 5.32 (3.64) | 5.40 (3.60) | 5.47 (3.70) | 5.21 (3.53) | 5.37 (3.61) |
| Median [Min, Max] | 6.00 [0, 11.0] | 6.00 [0, 11.0] | 5.00 [0, 11.0] | 6.00 [0, 11.0] | 6.00 [0, 11.0] | 5.00 [0, 11.0] | 6.00 [0, 11.0] |
| loudness | |||||||
| Mean (SD) | -5.43 (2.37) | -6.26 (2.87) | -6.32 (2.62) | -7.86 (2.89) | -7.04 (3.06) | -7.59 (3.38) | -6.72 (2.99) |
| Median [Min, Max] | -4.96 [-19.6, 1.14] | -5.75 [-46.4, -0.0460] | -5.84 [-26.3, -0.700] | -7.42 [-34.3, -0.478] | -6.51 [-26.2, 0.642] | -7.01 [-26.1, 1.28] | -6.17 [-46.4, 1.28] |
| mode | |||||||
| Mean (SD) | 0.520 (0.500) | 0.562 (0.496) | 0.588 (0.492) | 0.521 (0.500) | 0.522 (0.500) | 0.700 (0.458) | 0.566 (0.496) |
| Median [Min, Max] | 1.00 [0, 1.00] | 1.00 [0, 1.00] | 1.00 [0, 1.00] | 1.00 [0, 1.00] | 1.00 [0, 1.00] | 1.00 [0, 1.00] | 1.00 [0, 1.00] |
| speechiness | |||||||
| Mean (SD) | 0.0867 (0.0711) | 0.103 (0.0877) | 0.0740 (0.0678) | 0.117 (0.107) | 0.197 (0.132) | 0.0577 (0.0453) | 0.107 (0.101) |
| Median [Min, Max] | 0.0599 [0.0239, 0.624] | 0.0674 [0.0232, 0.662] | 0.0490 [0.0228, 0.869] | 0.0679 [0.0224, 0.918] | 0.177 [0.0243, 0.877] | 0.0419 [0, 0.488] | 0.0625 [0, 0.918] |
| acousticness | |||||||
| Mean (SD) | 0.0815 (0.145) | 0.211 (0.214) | 0.171 (0.219) | 0.260 (0.256) | 0.193 (0.219) | 0.145 (0.211) | 0.175 (0.220) |
| Median [Min, Max] | 0.0193 [0.00000251, 0.985] | 0.139 [0.0000356, 0.989] | 0.0767 [0.00000243, 0.992] | 0.165 [0.0000251, 0.989] | 0.109 [0.00000166, 0.994] | 0.0373 [0, 0.986] | 0.0804 [0, 0.994] |
| instrumentalness | |||||||
| Mean (SD) | 0.219 (0.326) | 0.0445 (0.168) | 0.0599 (0.184) | 0.0289 (0.122) | 0.0760 (0.230) | 0.0624 (0.174) | 0.0848 (0.224) |
| Median [Min, Max] | 0.00399 [0, 0.987] | 0.00000237 [0, 0.994] | 0.0000109 [0, 0.982] | 0.00000457 [0, 0.969] | 0 [0, 0.974] | 0.000211 [0, 0.969] | 0.0000161 [0, 0.994] |
| liveness | |||||||
| Mean (SD) | 0.212 (0.172) | 0.181 (0.151) | 0.177 (0.136) | 0.175 (0.140) | 0.192 (0.149) | 0.203 (0.170) | 0.190 (0.154) |
| Median [Min, Max] | 0.139 [0.0131, 0.976] | 0.121 [0.00946, 0.990] | 0.123 [0.0158, 0.988] | 0.120 [0.0165, 0.991] | 0.128 [0.00936, 0.983] | 0.136 [0, 0.996] | 0.127 [0, 0.996] |
| valence | |||||||
| Mean (SD) | 0.401 (0.226) | 0.605 (0.222) | 0.504 (0.220) | 0.531 (0.226) | 0.505 (0.225) | 0.537 (0.229) | 0.511 (0.233) |
| Median [Min, Max] | 0.370 [0.0269, 0.983] | 0.628 [0.0000100, 0.976] | 0.500 [0.0276, 0.981] | 0.542 [0.0366, 0.990] | 0.517 [0.0292, 0.977] | 0.531 [0, 0.991] | 0.512 [0, 0.991] |
We will look at the typical characteristics of each variable in each genre. We will compare the variables in pairs to determine if any influence the other, and whether they are significant to the overall model. New variable creation will not be necessary, and perhaps further grouping by the variable liveness or instrumentalness may be helpful as well.
By listing summary tables of variables we can see representative characteristics within each genre. Scatter plots, histograms, box plots, and plot pairs will help visualize correlations.
We will need to research if multiple linear regression is adequate for this type of data, or if there are other analysis methodologies that we should research as well.
Multiple linear regression will be used to determine correlation and build our predictive model.