Project - Midterm

Section 1: Introduction

Streaming music services have been able to collect a wealth of data for song characteristics, amount of streams, and user interaction. Spotify has taken this data to create playlists curated to a user’s taste based on their listening habits. Considering how songs have been quantified by Spotify, is it possible to produce a predictive model for a song’s popularity?

With the Spotify dataset retrieved with the spotifyr package, we will look at 12 variables from almost 33,000 songs and see how they interact with each other, and whether they are significant in determining popularity. As genres of music can have very different characteristics, it may make more sense to gauge popularity based on each genre’s representative statistics rather than overall.

We will first compare variables to determine which have influence on the popularity of a song. We will then apply a multiple linear regression to the data to make a prediction as to what characteristics are likely, within each genre, to produce a popular song.

A predictive model of popularity can help songwriters craft a song based on data-driven methods. Users will also benefit by being able to discover songs based on their preferred characteristics, and see whether they prefer more or less popular songs.

Section 2: Required Packages

library(tidyverse)
library(scales)
library(table1)

This analysis will make use of the following packages:

tidyverse - A collection of various packages designed to make it easier to make data tidy for analysis.
scales - A package with various string formatting functions.
table1 - A package for the creation of HTML tables of descriptive statistics.

Section 3: Data Preparation

songs <- read.csv('spotify_songs.csv', stringsAsFactors = FALSE)

The data used in this analysis is available as part of the tidytuesdayR package, and is also available for download here.

The codebook is also available on the tidytuesday GitHub page.

The data set, which was originally created on 2020-01-21, consists of 32,833 records, each with 23 columns. Each record represents a single song, and the columns represent various aspects of each song:

htmltools::includeHTML("codebook.html")

variable	class	description
track_id	character	Song unique ID
track_name	character	Song Name
track_artist	character	Song Artist
track_popularity	double	Song Popularity (0-100) where higher is better
track_album_id	character	Album unique ID
track_album_name	character	Song album name
track_album_release_date	character	Date when album released
playlist_name	character	Name of playlist
playlist_id	character	Playlist ID
playlist_genre	character	Playlist genre
playlist_subgenre	character	Playlist subgenre
danceability	double	Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
energy	double	Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
key	double	The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1.
loudness	double	The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.
mode	double	Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.
speechiness	double	Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
acousticness	double	A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
instrumentalness	double	Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
liveness	double	Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
valence	double	A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
tempo	double	The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
duration_ms	double	Duration of song in milliseconds

Because the columns are named sensibly, we won’t need to re-name any of them.

In examining the structure of the data set, we can see that track_album_release_date should be re-formatted as a date, and that playlist_genre and playlist_subgenre should be made into factors.

str(songs)

## 'data.frame':    32833 obs. of  23 variables:
##  $ track_id                : chr  "6f807x0ima9a1j3VPbc7VN" "0r7CVbZTWZgbTCYdfa2P31" "1z1Hg7Vb0AhHDiEmnDE79l" "75FpbthrwQmzHlBJLuGdC7" ...
##  $ track_name              : chr  "I Don't Care (with Justin Bieber) - Loud Luxury Remix" "Memories - Dillon Francis Remix" "All the Time - Don Diablo Remix" "Call You Mine - Keanu Silva Remix" ...
##  $ track_artist            : chr  "Ed Sheeran" "Maroon 5" "Zara Larsson" "The Chainsmokers" ...
##  $ track_popularity        : int  66 67 70 60 69 67 62 69 68 67 ...
##  $ track_album_id          : chr  "2oCs0DGTsRO98Gh5ZSl2Cx" "63rPSO264uRjW1X5E6cWv6" "1HoSmj2eLcsrR0vE9gThr4" "1nqYsOef1yKKuGOVchbsk6" ...
##  $ track_album_name        : chr  "I Don't Care (with Justin Bieber) [Loud Luxury Remix]" "Memories (Dillon Francis Remix)" "All the Time (Don Diablo Remix)" "Call You Mine - The Remixes" ...
##  $ track_album_release_date: chr  "2019-06-14" "2019-12-13" "2019-07-05" "2019-07-19" ...
##  $ playlist_name           : chr  "Pop Remix" "Pop Remix" "Pop Remix" "Pop Remix" ...
##  $ playlist_id             : chr  "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" ...
##  $ playlist_genre          : chr  "pop" "pop" "pop" "pop" ...
##  $ playlist_subgenre       : chr  "dance pop" "dance pop" "dance pop" "dance pop" ...
##  $ danceability            : num  0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
##  $ energy                  : num  0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
##  $ key                     : int  6 11 1 7 1 8 5 4 8 2 ...
##  $ loudness                : num  -2.63 -4.97 -3.43 -3.78 -4.67 ...
##  $ mode                    : int  1 1 0 1 1 1 0 0 1 1 ...
##  $ speechiness             : num  0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
##  $ acousticness            : num  0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
##  $ instrumentalness        : num  0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
##  $ liveness                : num  0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
##  $ valence                 : num  0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
##  $ tempo                   : num  122 100 124 122 124 ...
##  $ duration_ms             : int  194754 162600 176616 169093 189052 163049 187675 207619 193187 253040 ...

songs$track_album_release_date <- as.Date(songs$track_album_release_date)
songs$playlist_genre <- as.factor(songs$playlist_genre)
songs$playlist_subgenre <- as.factor(songs$playlist_subgenre)

Is there any missing data?

nrow(songs[complete.cases(songs), ])

## [1] 30942

colSums(is.na(songs))

##                 track_id               track_name             track_artist 
##                        0                        5                        5 
##         track_popularity           track_album_id         track_album_name 
##                        0                        0                        5 
## track_album_release_date            playlist_name              playlist_id 
##                     1886                        0                        0 
##           playlist_genre        playlist_subgenre             danceability 
##                        0                        0                        0 
##                   energy                      key                 loudness 
##                        0                        0                        0 
##                     mode              speechiness             acousticness 
##                        0                        0                        0 
##         instrumentalness                 liveness                  valence 
##                        0                        0                        0 
##                    tempo              duration_ms 
##                        0                        0

The NA values in this data set exist in the track_name, track_artist, track_album_name, and track_album_release_date. Given the nature of the columns with missing values, it does not make sense for us to make any imputations. Since we aren’t using the track_album_release_date in our analysis, we’ll just drop it from the data set.

songs <- songs[ , !(names(songs) %in% c("track_album_release_date"))]

How many rows still have missing data?

na_rows <- nrow(songs) - nrow(songs[complete.cases(songs), ])
pct_na_rows <- na_rows / nrow(songs)

Only 5 rows are still missing data at this point. This is only about 0% of the total number of rows, so we’ll just drop those rows from the data set.

songs <- na.omit(songs)

Next we’ll look at summaries of each of the numeric values to be sure they make sense, and check for any outliers.

summary(songs)

##    track_id          track_name        track_artist       track_popularity
##  Length:32828       Length:32828       Length:32828       Min.   :  0.00  
##  Class :character   Class :character   Class :character   1st Qu.: 24.00  
##  Mode  :character   Mode  :character   Mode  :character   Median : 45.00  
##                                                           Mean   : 42.48  
##                                                           3rd Qu.: 62.00  
##                                                           Max.   :100.00  
##                                                                           
##  track_album_id     track_album_name   playlist_name      playlist_id       
##  Length:32828       Length:32828       Length:32828       Length:32828      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  playlist_genre                 playlist_subgenre  danceability   
##  edm  :6043     progressive electro house: 1809   Min.   :0.0000  
##  latin:5153     southern hip hop         : 1674   1st Qu.:0.5630  
##  pop  :5507     indie poptimism          : 1672   Median :0.6720  
##  r&b  :5431     latin hip hop            : 1655   Mean   :0.6549  
##  rap  :5743     neo soul                 : 1637   3rd Qu.:0.7610  
##  rock :4951     pop edm                  : 1517   Max.   :0.9830  
##                 (Other)                  :22864                   
##      energy              key            loudness            mode       
##  Min.   :0.000175   Min.   : 0.000   Min.   :-46.448   Min.   :0.0000  
##  1st Qu.:0.581000   1st Qu.: 2.000   1st Qu.: -8.171   1st Qu.:0.0000  
##  Median :0.721000   Median : 6.000   Median : -6.166   Median :1.0000  
##  Mean   :0.698603   Mean   : 5.374   Mean   : -6.720   Mean   :0.5657  
##  3rd Qu.:0.840000   3rd Qu.: 9.000   3rd Qu.: -4.645   3rd Qu.:1.0000  
##  Max.   :1.000000   Max.   :11.000   Max.   :  1.275   Max.   :1.0000  
##                                                                        
##   speechiness      acousticness    instrumentalness       liveness     
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000000   Min.   :0.0000  
##  1st Qu.:0.0410   1st Qu.:0.0151   1st Qu.:0.0000000   1st Qu.:0.0927  
##  Median :0.0625   Median :0.0804   Median :0.0000161   Median :0.1270  
##  Mean   :0.1071   Mean   :0.1754   Mean   :0.0847599   Mean   :0.1902  
##  3rd Qu.:0.1320   3rd Qu.:0.2550   3rd Qu.:0.0048300   3rd Qu.:0.2480  
##  Max.   :0.9180   Max.   :0.9940   Max.   :0.9940000   Max.   :0.9960  
##                                                                        
##     valence           tempo         duration_ms    
##  Min.   :0.0000   Min.   :  0.00   Min.   :  4000  
##  1st Qu.:0.3310   1st Qu.: 99.96   1st Qu.:187805  
##  Median :0.5120   Median :121.98   Median :216000  
##  Mean   :0.5106   Mean   :120.88   Mean   :225797  
##  3rd Qu.:0.6930   3rd Qu.:133.92   3rd Qu.:253581  
##  Max.   :0.9910   Max.   :239.44   Max.   :517810  
##

songs %>%
  summarise_if(is.numeric, mean)

##   track_popularity danceability    energy      key  loudness      mode
## 1         42.48355    0.6548504 0.6986026 5.373949 -6.719529 0.5657366
##   speechiness acousticness instrumentalness  liveness   valence    tempo
## 1   0.1070535    0.1753515       0.08475987 0.1901754 0.5105559 120.8836
##   duration_ms
## 1    225796.8

There do not appear to be any outliers among the numeric variables, and all values are within the stated ranges (i.e., values such as danceability and energy are measured on a scale of 0.0 - 1.0). mode is binary variable with 0 and 1 as the only possible values. The longest song has a duration of 517810, which is about 8.63 minutes, which seems entirely reasonable.

Finally, there are several columns we won’t need for our analysis, so we’ll drop them here:

unused_cols <- c('track_id', 'track_album_id', 'track_album_name', 
                 'playlist_name', 'playlist_id', 'playlist_subgenre')
songs <- songs[ , -which(names(songs) %in% c(unused_cols))]

The data is now clean, with 32828 observations of 16 variables. Here’s what it looks like:

dim(songs)

## [1] 32828    16

head(songs)

##                                              track_name     track_artist
## 1 I Don't Care (with Justin Bieber) - Loud Luxury Remix       Ed Sheeran
## 2                       Memories - Dillon Francis Remix         Maroon 5
## 3                       All the Time - Don Diablo Remix     Zara Larsson
## 4                     Call You Mine - Keanu Silva Remix The Chainsmokers
## 5               Someone You Loved - Future Humans Remix    Lewis Capaldi
## 6     Beautiful People (feat. Khalid) - Jack Wins Remix       Ed Sheeran
##   track_popularity playlist_genre danceability energy key loudness mode
## 1               66            pop        0.748  0.916   6   -2.634    1
## 2               67            pop        0.726  0.815  11   -4.969    1
## 3               70            pop        0.675  0.931   1   -3.432    0
## 4               60            pop        0.718  0.930   7   -3.778    1
## 5               69            pop        0.650  0.833   1   -4.672    1
## 6               67            pop        0.675  0.919   8   -5.385    1
##   speechiness acousticness instrumentalness liveness valence   tempo
## 1      0.0583       0.1020         0.00e+00   0.0653   0.518 122.036
## 2      0.0373       0.0724         4.21e-03   0.3570   0.693  99.972
## 3      0.0742       0.0794         2.33e-05   0.1100   0.613 124.008
## 4      0.1020       0.0287         9.43e-06   0.2040   0.277 121.956
## 5      0.0359       0.0803         0.00e+00   0.0833   0.725 123.976
## 6      0.1270       0.0799         0.00e+00   0.1430   0.585 124.982
##   duration_ms
## 1      194754
## 2      162600
## 3      176616
## 4      169093
## 5      189052
## 6      163049

The variables we are concerned with are track_artist and playlist_genre, and the quantifying variables track_popularity, danceability, energy, key, loudness, mode, speechiness, acousticness, instrumentalness, liveness, and valence.

track_artist: There are 10,692 distinct artists in this data set.
playlist_genre: Each song is classified by one of 6 distinct genres.

Here are summaries of the numeric columns of interest, broken down by playlist_genre:

table1::table1(~ track_popularity + danceability + energy + key + loudness + mode + speechiness + acousticness + instrumentalness +liveness + valence | playlist_genre, data = songs)

	edm (N=6043)	latin (N=5153)	pop (N=5507)	r&b (N=5431)	rap (N=5743)	rock (N=4951)	Overall (N=32828)
track_popularity
Mean (SD)	34.8 (23.2)	47.0 (25.4)	47.7 (25.2)	41.2 (25.9)	43.2 (23.3)	41.7 (24.8)	42.5 (25.0)
Median [Min, Max]	36.0 [0, 99.0]	50.0 [0, 100]	52.0 [0, 100]	44.0 [0, 99.0]	47.0 [0, 98.0]	46.0 [0, 95.0]	45.0 [0, 100]
danceability
Mean (SD)	0.655 (0.124)	0.713 (0.115)	0.639 (0.128)	0.670 (0.138)	0.718 (0.136)	0.521 (0.140)	0.655 (0.145)
Median [Min, Max]	0.659 [0.162, 0.983]	0.729 [0.0771, 0.979]	0.652 [0.0985, 0.979]	0.689 [0.140, 0.977]	0.737 [0.150, 0.975]	0.523 [0, 0.956]	0.672 [0, 0.983]
energy
Mean (SD)	0.802 (0.139)	0.708 (0.152)	0.701 (0.171)	0.591 (0.179)	0.651 (0.170)	0.733 (0.195)	0.699 (0.181)
Median [Min, Max]	0.830 [0.106, 0.998]	0.729 [0.000175, 1.00]	0.727 [0.00814, 0.999]	0.596 [0.0118, 0.995]	0.665 [0.0161, 0.999]	0.775 [0.0167, 0.998]	0.721 [0.000175, 1.00]
key
Mean (SD)	5.35 (3.56)	5.48 (3.64)	5.32 (3.64)	5.40 (3.60)	5.47 (3.70)	5.21 (3.53)	5.37 (3.61)
Median [Min, Max]	6.00 [0, 11.0]	6.00 [0, 11.0]	5.00 [0, 11.0]	6.00 [0, 11.0]	6.00 [0, 11.0]	5.00 [0, 11.0]	6.00 [0, 11.0]
loudness
Mean (SD)	-5.43 (2.37)	-6.26 (2.87)	-6.32 (2.62)	-7.86 (2.89)	-7.04 (3.06)	-7.59 (3.38)	-6.72 (2.99)
Median [Min, Max]	-4.96 [-19.6, 1.14]	-5.75 [-46.4, -0.0460]	-5.84 [-26.3, -0.700]	-7.42 [-34.3, -0.478]	-6.51 [-26.2, 0.642]	-7.01 [-26.1, 1.28]	-6.17 [-46.4, 1.28]
mode
Mean (SD)	0.520 (0.500)	0.562 (0.496)	0.588 (0.492)	0.521 (0.500)	0.522 (0.500)	0.700 (0.458)	0.566 (0.496)
Median [Min, Max]	1.00 [0, 1.00]	1.00 [0, 1.00]	1.00 [0, 1.00]	1.00 [0, 1.00]	1.00 [0, 1.00]	1.00 [0, 1.00]	1.00 [0, 1.00]
speechiness
Mean (SD)	0.0867 (0.0711)	0.103 (0.0877)	0.0740 (0.0678)	0.117 (0.107)	0.197 (0.132)	0.0577 (0.0453)	0.107 (0.101)
Median [Min, Max]	0.0599 [0.0239, 0.624]	0.0674 [0.0232, 0.662]	0.0490 [0.0228, 0.869]	0.0679 [0.0224, 0.918]	0.177 [0.0243, 0.877]	0.0419 [0, 0.488]	0.0625 [0, 0.918]
acousticness
Mean (SD)	0.0815 (0.145)	0.211 (0.214)	0.171 (0.219)	0.260 (0.256)	0.193 (0.219)	0.145 (0.211)	0.175 (0.220)
Median [Min, Max]	0.0193 [0.00000251, 0.985]	0.139 [0.0000356, 0.989]	0.0767 [0.00000243, 0.992]	0.165 [0.0000251, 0.989]	0.109 [0.00000166, 0.994]	0.0373 [0, 0.986]	0.0804 [0, 0.994]
instrumentalness
Mean (SD)	0.219 (0.326)	0.0445 (0.168)	0.0599 (0.184)	0.0289 (0.122)	0.0760 (0.230)	0.0624 (0.174)	0.0848 (0.224)
Median [Min, Max]	0.00399 [0, 0.987]	0.00000237 [0, 0.994]	0.0000109 [0, 0.982]	0.00000457 [0, 0.969]	0 [0, 0.974]	0.000211 [0, 0.969]	0.0000161 [0, 0.994]
liveness
Mean (SD)	0.212 (0.172)	0.181 (0.151)	0.177 (0.136)	0.175 (0.140)	0.192 (0.149)	0.203 (0.170)	0.190 (0.154)
Median [Min, Max]	0.139 [0.0131, 0.976]	0.121 [0.00946, 0.990]	0.123 [0.0158, 0.988]	0.120 [0.0165, 0.991]	0.128 [0.00936, 0.983]	0.136 [0, 0.996]	0.127 [0, 0.996]
valence
Mean (SD)	0.401 (0.226)	0.605 (0.222)	0.504 (0.220)	0.531 (0.226)	0.505 (0.225)	0.537 (0.229)	0.511 (0.233)
Median [Min, Max]	0.370 [0.0269, 0.983]	0.628 [0.0000100, 0.976]	0.500 [0.0276, 0.981]	0.542 [0.0366, 0.990]	0.517 [0.0292, 0.977]	0.531 [0, 0.991]	0.512 [0, 0.991]

Section 4: Proposed Exploratory Data Analysis

We will look at the typical characteristics of each variable in each genre. We will compare the variables in pairs to determine if any influence the other, and whether they are significant to the overall model. New variable creation will not be necessary, and perhaps further grouping by the variable liveness or instrumentalness may be helpful as well.

By listing summary tables of variables we can see representative characteristics within each genre. Scatter plots, histograms, box plots, and plot pairs will help visualize correlations.

We will need to research if multiple linear regression is adequate for this type of data, or if there are other analysis methodologies that we should research as well.

Multiple linear regression will be used to determine correlation and build our predictive model.

Project - Midterm

Johnny Arguedas & Mark Hageman

11/01/2020

Section 1: Introduction

Section 2: Required Packages

Section 3: Data Preparation

Section 4: Proposed Exploratory Data Analysis

Section 6: Summary

References