Analyze the Spotify database to:
This project will be executed in two major phases:
Phase 1: Analyze the data and look for associations between song characteristics and song genres & sub-genres. This will include data clean-up, data wrangling and data visualization.
Phase 2: Create models to predict song popularity based on most relevant song characteristics identified in phase 1. This phase will include variable selection and evaluation of various model architectures.
Learnings from these analyses and song popularity predictive model will be used by the MakeYourSong (made up) start-up to guide its users on what song characteristics are likely to drive popularity. The predictive model will be available to users of the MakeYourSong start-up.
#install.packages("tidyverse")
#install.packages("dplyr")
#install.packages("ggplot2")
#install.packages("plotly")
#install.packages("corrplot")
library(tidyverse)
library(dplyr)
library(ggplot2)
library(plotly)
library(corrplot)
tidyverse - for interacting with data through subsetting, transformation, visualization, etc.
dplyr - for data manipulation in R by combining, selecting, grouping, subsetting and transforming all or parts of dataset
ggplot2 - for declaratively creating graphics, based on The Grammar of Graphics
plotly - for creating interactive web-based graphs via the open source JavaScript graphing library plotly.js
corrplot - for visualizing correlation matrices and confidence intervals
The dataset is available in Github. Link to the data source is here.
The data to be analyzed is be a excerpt of the Spotify database containing 32,833 rows. The data set of spotify songs contains 23 variables and 32,833 songs from 1957-2020. There are 10,693 artists and 6 main genres with sub-categories for each. There are 12 audio features for each track, including confidence measures like acousticness, liveness, speechiness and instrumentalness, perceptual measures like energy, loudness, danceability and valence (positiveness), and descriptors like duration, tempo, key, and mode.
Genres were selected from Every Noise, a visualization of the Spotify genre-space maintained by a genre taxonomist. The top four sub-genres for each were used to query Spotify for 20 playlists each, resulting in about 5000 songs for each genre, split across a varied sub-genre space.
You can find the code for generating the dataset in spotify_dataset.R in the full Github repo.
# Code to import the data
spotify <- read.csv("C:/Users/king.nm/OneDrive - Procter and Gamble/UC/MS Business Analytics/Classes/Summer 2021/Data Wrangling BANA 7025 - Jun2021/Final Project/spotify.csv")
dictionary_spotify <- read.csv("C:/Users/king.nm/OneDrive - Procter and Gamble/UC/MS Business Analytics/Classes/Summer 2021/Data Wrangling BANA 7025 - Jun2021/Final Project/dictionary_spotify.csv")
# Code to view data Spotify codebook
# Use library knitr to format codebook table
library(knitr)
## Warning: package 'knitr' was built under R version 4.0.5
kable(dictionary_spotify[,], caption = "Spotify Codebook")
| variable | class | description |
|---|---|---|
| track_id | character | Song unique ID |
| track_name | character | Song Name |
| track_artist | character | Song Artist |
| track_popularity | double | Song Popularity (0-100) where higher is better |
| track_album_id | character | Album unique ID |
| track_album_name | character | Song album name |
| track_album_release_date | character | Date when album released |
| playlist_name | character | Name of playlist |
| playlist_id | character | Playlist ID |
| playlist_genre | character | Playlist genre |
| playlist_subgenre | character | Playlist subgenre |
| danceability | double | Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable. |
| energy | double | Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy. |
| key | double | The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C?/D?, 2 = D, and so on. If no key was detected, the value is -1. |
| loudness | double | The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db. |
| mode | double | Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0. |
| speechiness | double | Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks. |
| acousticness | double | A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic. |
| instrumentalness | double | Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0. |
| liveness | double | Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live. |
| valence | double | A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry). |
| tempo | double | The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration. |
| duration_ms | double | Duration of song in milliseconds |
dim(spotify)
## [1] 32833 23
# Checking to see whether there are songs with the same ID
length(unique(spotify$track_id))
## [1] 28356
# Creating a new file with unique songs
spotify_unique = spotify[!duplicated(spotify$track_id),]
str(spotify_unique)
## 'data.frame': 28356 obs. of 23 variables:
## $ track_id : chr "6f807x0ima9a1j3VPbc7VN" "0r7CVbZTWZgbTCYdfa2P31" "1z1Hg7Vb0AhHDiEmnDE79l" "75FpbthrwQmzHlBJLuGdC7" ...
## $ track_name : chr "I Don't Care (with Justin Bieber) - Loud Luxury Remix" "Memories - Dillon Francis Remix" "All the Time - Don Diablo Remix" "Call You Mine - Keanu Silva Remix" ...
## $ track_artist : chr "Ed Sheeran" "Maroon 5" "Zara Larsson" "The Chainsmokers" ...
## $ track_popularity : int 66 67 70 60 69 67 62 69 68 67 ...
## $ track_album_id : chr "2oCs0DGTsRO98Gh5ZSl2Cx" "63rPSO264uRjW1X5E6cWv6" "1HoSmj2eLcsrR0vE9gThr4" "1nqYsOef1yKKuGOVchbsk6" ...
## $ track_album_name : chr "I Don't Care (with Justin Bieber) [Loud Luxury Remix]" "Memories (Dillon Francis Remix)" "All the Time (Don Diablo Remix)" "Call You Mine - The Remixes" ...
## $ track_album_release_date: chr "6/14/2019" "12/13/2019" "7/5/2019" "7/19/2019" ...
## $ playlist_name : chr "Pop Remix" "Pop Remix" "Pop Remix" "Pop Remix" ...
## $ playlist_id : chr "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" ...
## $ playlist_genre : chr "pop" "pop" "pop" "pop" ...
## $ playlist_subgenre : chr "dance pop" "dance pop" "dance pop" "dance pop" ...
## $ danceability : num 0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
## $ energy : num 0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
## $ key : int 6 11 1 7 1 8 5 4 8 2 ...
## $ loudness : num -2.63 -4.97 -3.43 -3.78 -4.67 ...
## $ mode : int 1 1 0 1 1 1 0 0 1 1 ...
## $ speechiness : num 0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
## $ acousticness : num 0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
## $ instrumentalness : num 0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
## $ liveness : num 0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
## $ valence : num 0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
## $ tempo : num 122 100 124 122 124 ...
## $ duration_ms : int 194754 162600 176616 169093 189052 163049 187675 207619 193187 253040 ...
# Shortening the name of spotify_unique to spotify only
spotify <- spotify_unique
# Checking whether the unique file contains only 28356
str(spotify)
## 'data.frame': 28356 obs. of 23 variables:
## $ track_id : chr "6f807x0ima9a1j3VPbc7VN" "0r7CVbZTWZgbTCYdfa2P31" "1z1Hg7Vb0AhHDiEmnDE79l" "75FpbthrwQmzHlBJLuGdC7" ...
## $ track_name : chr "I Don't Care (with Justin Bieber) - Loud Luxury Remix" "Memories - Dillon Francis Remix" "All the Time - Don Diablo Remix" "Call You Mine - Keanu Silva Remix" ...
## $ track_artist : chr "Ed Sheeran" "Maroon 5" "Zara Larsson" "The Chainsmokers" ...
## $ track_popularity : int 66 67 70 60 69 67 62 69 68 67 ...
## $ track_album_id : chr "2oCs0DGTsRO98Gh5ZSl2Cx" "63rPSO264uRjW1X5E6cWv6" "1HoSmj2eLcsrR0vE9gThr4" "1nqYsOef1yKKuGOVchbsk6" ...
## $ track_album_name : chr "I Don't Care (with Justin Bieber) [Loud Luxury Remix]" "Memories (Dillon Francis Remix)" "All the Time (Don Diablo Remix)" "Call You Mine - The Remixes" ...
## $ track_album_release_date: chr "6/14/2019" "12/13/2019" "7/5/2019" "7/19/2019" ...
## $ playlist_name : chr "Pop Remix" "Pop Remix" "Pop Remix" "Pop Remix" ...
## $ playlist_id : chr "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" ...
## $ playlist_genre : chr "pop" "pop" "pop" "pop" ...
## $ playlist_subgenre : chr "dance pop" "dance pop" "dance pop" "dance pop" ...
## $ danceability : num 0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
## $ energy : num 0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
## $ key : int 6 11 1 7 1 8 5 4 8 2 ...
## $ loudness : num -2.63 -4.97 -3.43 -3.78 -4.67 ...
## $ mode : int 1 1 0 1 1 1 0 0 1 1 ...
## $ speechiness : num 0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
## $ acousticness : num 0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
## $ instrumentalness : num 0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
## $ liveness : num 0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
## $ valence : num 0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
## $ tempo : num 122 100 124 122 124 ...
## $ duration_ms : int 194754 162600 176616 169093 189052 163049 187675 207619 193187 253040 ...
head(spotify, n=5)
## track_id track_name
## 1 6f807x0ima9a1j3VPbc7VN I Don't Care (with Justin Bieber) - Loud Luxury Remix
## 2 0r7CVbZTWZgbTCYdfa2P31 Memories - Dillon Francis Remix
## 3 1z1Hg7Vb0AhHDiEmnDE79l All the Time - Don Diablo Remix
## 4 75FpbthrwQmzHlBJLuGdC7 Call You Mine - Keanu Silva Remix
## 5 1e8PAfcKUYoKkxPhrHqw4x Someone You Loved - Future Humans Remix
## track_artist track_popularity track_album_id
## 1 Ed Sheeran 66 2oCs0DGTsRO98Gh5ZSl2Cx
## 2 Maroon 5 67 63rPSO264uRjW1X5E6cWv6
## 3 Zara Larsson 70 1HoSmj2eLcsrR0vE9gThr4
## 4 The Chainsmokers 60 1nqYsOef1yKKuGOVchbsk6
## 5 Lewis Capaldi 69 7m7vv9wlQ4i0LFuJiE2zsQ
## track_album_name
## 1 I Don't Care (with Justin Bieber) [Loud Luxury Remix]
## 2 Memories (Dillon Francis Remix)
## 3 All the Time (Don Diablo Remix)
## 4 Call You Mine - The Remixes
## 5 Someone You Loved (Future Humans Remix)
## track_album_release_date playlist_name playlist_id playlist_genre
## 1 6/14/2019 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop
## 2 12/13/2019 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop
## 3 7/5/2019 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop
## 4 7/19/2019 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop
## 5 3/5/2019 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop
## playlist_subgenre danceability energy key loudness mode speechiness
## 1 dance pop 0.748 0.916 6 -2.634 1 0.0583
## 2 dance pop 0.726 0.815 11 -4.969 1 0.0373
## 3 dance pop 0.675 0.931 1 -3.432 0 0.0742
## 4 dance pop 0.718 0.930 7 -3.778 1 0.1020
## 5 dance pop 0.650 0.833 1 -4.672 1 0.0359
## acousticness instrumentalness liveness valence tempo duration_ms
## 1 0.1020 0.00e+00 0.0653 0.518 122.036 194754
## 2 0.0724 4.21e-03 0.3570 0.693 99.972 162600
## 3 0.0794 2.33e-05 0.1100 0.613 124.008 176616
## 4 0.0287 9.43e-06 0.2040 0.277 121.956 169093
## 5 0.0803 0.00e+00 0.0833 0.725 123.976 189052
tail(spotify, n=5)
## track_id track_name
## 32829 7bxnKAamR3snQ1VGLuVfC1 City Of Lights - Official Radio Edit
## 32830 5Aevni09Em4575077nkWHz Closer - Sultan & Ned Shepard Remix
## 32831 7ImMqPP3Q1yfUHvsdn7wEo Sweet Surrender - Radio Edit
## 32832 2m69mhnfQ1Oq6lGtXuYhgX Only For You - Maor Levi Remix
## 32833 29zWqhca3zt5NsckZqDf6c Typhoon - Original Mix
## track_artist track_popularity track_album_id
## 32829 Lush & Simon 42 2azRoBBWEEEYhqV6sb7JrT
## 32830 Tegan and Sara 20 6kD6KLxj7s8eCE3ABvAyf5
## 32831 Starkillers 14 0ltWNSY9JgxoIZO4VzuCa6
## 32832 Mat Zo 15 1fGrOkHnHJcStl14zNx8Jy
## 32833 Julian Calor 27 0X3mUOm6MhxR7PzxG95rAo
## track_album_name track_album_release_date playlist_name
## 32829 City Of Lights (Vocal Mix) 4/28/2014 â\231¥ EDM LOVE 2020
## 32830 Closer Remixed 3/8/2013 â\231¥ EDM LOVE 2020
## 32831 Sweet Surrender (Radio Edit) 4/21/2014 â\231¥ EDM LOVE 2020
## 32832 Only For You (Remixes) 1/1/2014 â\231¥ EDM LOVE 2020
## 32833 Typhoon/Storm 3/3/2014 â\231¥ EDM LOVE 2020
## playlist_id playlist_genre playlist_subgenre
## 32829 6jI1gFr6ANFtT8MmTvA2Ux edm progressive electro house
## 32830 6jI1gFr6ANFtT8MmTvA2Ux edm progressive electro house
## 32831 6jI1gFr6ANFtT8MmTvA2Ux edm progressive electro house
## 32832 6jI1gFr6ANFtT8MmTvA2Ux edm progressive electro house
## 32833 6jI1gFr6ANFtT8MmTvA2Ux edm progressive electro house
## danceability energy key loudness mode speechiness acousticness
## 32829 0.428 0.922 2 -1.814 1 0.0936 0.076600
## 32830 0.522 0.786 0 -4.462 1 0.0420 0.001710
## 32831 0.529 0.821 6 -4.899 0 0.0481 0.108000
## 32832 0.626 0.888 2 -3.361 1 0.1090 0.007920
## 32833 0.603 0.884 5 -4.571 0 0.0385 0.000133
## instrumentalness liveness valence tempo duration_ms
## 32829 0.00e+00 0.0668 0.2100 128.170 204375
## 32830 4.27e-03 0.3750 0.4000 128.041 353120
## 32831 1.11e-06 0.1500 0.4360 127.989 210112
## 32832 1.27e-01 0.3430 0.3080 128.008 367432
## 32833 3.41e-01 0.7420 0.0894 127.984 337500
sum(is.na(spotify))
## [1] 12
colSums(is.na(spotify))
## track_id track_name track_artist
## 0 4 4
## track_popularity track_album_id track_album_name
## 0 0 4
## track_album_release_date playlist_name playlist_id
## 0 0 0
## playlist_genre playlist_subgenre danceability
## 0 0 0
## energy key loudness
## 0 0 0
## mode speechiness acousticness
## 0 0 0
## instrumentalness liveness valence
## 0 0 0
## tempo duration_ms
## 0 0
# Eliminating missing data since there are not too many missing values
spotify <- na.omit(spotify)
# Checking whether missing data was omitted
str(spotify)
## 'data.frame': 28352 obs. of 23 variables:
## $ track_id : chr "6f807x0ima9a1j3VPbc7VN" "0r7CVbZTWZgbTCYdfa2P31" "1z1Hg7Vb0AhHDiEmnDE79l" "75FpbthrwQmzHlBJLuGdC7" ...
## $ track_name : chr "I Don't Care (with Justin Bieber) - Loud Luxury Remix" "Memories - Dillon Francis Remix" "All the Time - Don Diablo Remix" "Call You Mine - Keanu Silva Remix" ...
## $ track_artist : chr "Ed Sheeran" "Maroon 5" "Zara Larsson" "The Chainsmokers" ...
## $ track_popularity : int 66 67 70 60 69 67 62 69 68 67 ...
## $ track_album_id : chr "2oCs0DGTsRO98Gh5ZSl2Cx" "63rPSO264uRjW1X5E6cWv6" "1HoSmj2eLcsrR0vE9gThr4" "1nqYsOef1yKKuGOVchbsk6" ...
## $ track_album_name : chr "I Don't Care (with Justin Bieber) [Loud Luxury Remix]" "Memories (Dillon Francis Remix)" "All the Time (Don Diablo Remix)" "Call You Mine - The Remixes" ...
## $ track_album_release_date: chr "6/14/2019" "12/13/2019" "7/5/2019" "7/19/2019" ...
## $ playlist_name : chr "Pop Remix" "Pop Remix" "Pop Remix" "Pop Remix" ...
## $ playlist_id : chr "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" ...
## $ playlist_genre : chr "pop" "pop" "pop" "pop" ...
## $ playlist_subgenre : chr "dance pop" "dance pop" "dance pop" "dance pop" ...
## $ danceability : num 0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
## $ energy : num 0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
## $ key : int 6 11 1 7 1 8 5 4 8 2 ...
## $ loudness : num -2.63 -4.97 -3.43 -3.78 -4.67 ...
## $ mode : int 1 1 0 1 1 1 0 0 1 1 ...
## $ speechiness : num 0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
## $ acousticness : num 0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
## $ instrumentalness : num 0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
## $ liveness : num 0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
## $ valence : num 0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
## $ tempo : num 122 100 124 122 124 ...
## $ duration_ms : int 194754 162600 176616 169093 189052 163049 187675 207619 193187 253040 ...
## - attr(*, "na.action")= 'omit' Named int [1:4] 7669 8693 8694 17666
## ..- attr(*, "names")= chr [1:4] "8152" "9283" "9284" "19569"
summary(spotify)
## track_id track_name track_artist track_popularity
## Length:28352 Length:28352 Length:28352 Min. : 0.00
## Class :character Class :character Class :character 1st Qu.: 21.00
## Mode :character Mode :character Mode :character Median : 42.00
## Mean : 39.34
## 3rd Qu.: 58.00
## Max. :100.00
## track_album_id track_album_name track_album_release_date
## Length:28352 Length:28352 Length:28352
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## playlist_name playlist_id playlist_genre playlist_subgenre
## Length:28352 Length:28352 Length:28352 Length:28352
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## danceability energy key loudness
## Min. :0.0000 Min. :0.000175 Min. : 0.000 Min. :-46.448
## 1st Qu.:0.5610 1st Qu.:0.579000 1st Qu.: 2.000 1st Qu.: -8.310
## Median :0.6700 Median :0.722000 Median : 6.000 Median : -6.261
## Mean :0.6534 Mean :0.698372 Mean : 5.367 Mean : -6.818
## 3rd Qu.:0.7600 3rd Qu.:0.843000 3rd Qu.: 9.000 3rd Qu.: -4.709
## Max. :0.9830 Max. :1.000000 Max. :11.000 Max. : 1.275
## mode speechiness acousticness instrumentalness
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000000
## 1st Qu.:0.0000 1st Qu.:0.0410 1st Qu.:0.0143 1st Qu.:0.0000000
## Median :1.0000 Median :0.0626 Median :0.0797 Median :0.0000207
## Mean :0.5655 Mean :0.1079 Mean :0.1772 Mean :0.0911294
## 3rd Qu.:1.0000 3rd Qu.:0.1330 3rd Qu.:0.2600 3rd Qu.:0.0065725
## Max. :1.0000 Max. :0.9180 Max. :0.9940 Max. :0.9940000
## liveness valence tempo duration_ms
## Min. :0.0000 Min. :0.0000 Min. : 0.00 Min. : 4000
## 1st Qu.:0.0926 1st Qu.:0.3290 1st Qu.: 99.97 1st Qu.:187741
## Median :0.1270 Median :0.5120 Median :121.99 Median :216933
## Mean :0.1910 Mean :0.5104 Mean :120.96 Mean :226575
## 3rd Qu.:0.2490 3rd Qu.:0.6950 3rd Qu.:134.00 3rd Qu.:254975
## Max. :0.9960 Max. :0.9910 Max. :239.44 Max. :517810
hist(spotify$danceability)
hist(spotify$energy)
hist(spotify$loudness)
hist(spotify$speechiness)
hist(spotify$acousticness)
hist(spotify$instrumentalness)
hist(spotify$liveness)
hist(spotify$valence)
hist(spotify$tempo)
hist(spotify$key)
hist(spotify$mode)
hist(spotify$track_popularity)
hist(spotify$duration_ms)
library(knitr)
kable(table(spotify$playlist_genre), align = "l", caption = "Playlist genre frequencies")
| Var1 | Freq |
|---|---|
| edm | 4877 |
| latin | 4136 |
| pop | 5132 |
| r&b | 4504 |
| rap | 5398 |
| rock | 4305 |
kable(table(spotify$playlist_subgenre),align = "l", caption = "Playlist sub-genre frequencies" )
| Var1 | Freq |
|---|---|
| album rock | 1039 |
| big room | 1034 |
| classic rock | 1100 |
| dance pop | 1298 |
| electro house | 1416 |
| electropop | 1251 |
| gangster rap | 1314 |
| hard rock | 1202 |
| hip hop | 1296 |
| hip pop | 803 |
| indie poptimism | 1547 |
| latin hip hop | 1194 |
| latin pop | 1097 |
| neo soul | 1478 |
| new jack swing | 1036 |
| permanent wave | 964 |
| pop edm | 967 |
| post-teen pop | 1036 |
| progressive electro house | 1460 |
| reggaeton | 687 |
| southern hip hop | 1582 |
| trap | 1206 |
| tropical | 1158 |
| urban contemporary | 1187 |
barplot(table(spotify$playlist_genre))
barplot(table(spotify$playlist_subgenre))
boxplot(spotify$track_popularity,xlab = "popularity")
boxplot(spotify$danceability,xlab = "danceability")
boxplot(spotify$duration_ms, xlab = "duration_ms")
boxplot(spotify$energy, xlab = "energy")
boxplot(spotify$loudness, xlab = "loudness")
boxplot(spotify$speechiness, xlab = "speechiness")
boxplot(spotify$acousticness, xlab = "accousticness")
boxplot(spotify$instrumentalness, xlab = "instumentalness")
boxplot(spotify$liveness, xlab = "liveness")
boxplot(spotify$valence, xlab = "valence")
boxplot(spotify$tempo, xlab = "tempo")
All the variables evaluated have outliers: danceability, duration, energy, loudness, speechiness, accousticness, instrumentalness, liveness and tempo
length(unique(spotify$track_artist))
## [1] 10692
length(unique(spotify$playlist_id))
## [1] 470
length(unique(spotify$playlist_id))
## [1] 470
plot(spotify$liveness, spotify$tempo)
plot(spotify$speechiness, spotify$liveness)
plot(spotify$track_popularity, spotify$liveness)
# Creating a subset of the data with numeric variables only to more easily check for correlations
library(tidyverse)
spotify_num <- select(spotify, track_popularity, danceability, energy, loudness, speechiness, acousticness, instrumentalness, liveness, valence, tempo, key, mode)
# Checking for variable correlations
cor(spotify_num)
## track_popularity danceability energy loudness
## track_popularity 1.000000000 0.046574393 -0.103510773 0.0373368426
## danceability 0.046574393 1.000000000 -0.081426757 0.0153113311
## energy -0.103510773 -0.081426757 1.000000000 0.6821643541
## loudness 0.037336843 0.015311331 0.682164354 1.0000000000
## speechiness 0.005439570 0.183558194 -0.029030115 0.0129401739
## acousticness 0.091624759 -0.028881286 -0.545878674 -0.3716005101
## instrumentalness -0.124546651 -0.002274667 0.023850025 -0.1543017028
## liveness -0.052752799 -0.127054574 0.163802644 0.0819049144
## valence 0.022594291 0.333751328 0.149662060 0.0495341593
## tempo 0.004321794 -0.184639775 0.151658031 0.0967103886
## key -0.007879063 0.007059769 0.012790256 -0.0005832657
## mode 0.016130687 -0.055270139 -0.004265523 -0.0176398787
## speechiness acousticness instrumentalness liveness
## track_popularity 0.00543957 0.091624759 -0.124546651 -0.0527527986
## danceability 0.18355819 -0.028881286 -0.002274667 -0.1270545736
## energy -0.02903011 -0.545878674 0.023850025 0.1638026443
## loudness 0.01294017 -0.371600510 -0.154301703 0.0819049144
## speechiness 1.00000000 0.025016481 -0.107921943 0.0592325869
## acousticness 0.02501648 1.000000000 -0.003128449 -0.0745330902
## instrumentalness -0.10792194 -0.003128449 1.000000000 -0.0084967401
## liveness 0.05923259 -0.074533090 -0.008496740 1.0000000000
## valence 0.06482384 -0.018997220 -0.174173559 -0.0197889232
## tempo 0.03275482 -0.114379959 0.021457069 0.0218915079
## key 0.02295464 0.004277595 0.007455312 0.0020729759
## mode -0.05955242 0.006721610 -0.005800667 -0.0002156869
## valence tempo key mode
## track_popularity 0.022594291 0.004321794 -0.0078790634 0.0161306874
## danceability 0.333751328 -0.184639775 0.0070597692 -0.0552701387
## energy 0.149662060 0.151658031 0.0127902556 -0.0042655230
## loudness 0.049534159 0.096710389 -0.0005832657 -0.0176398787
## speechiness 0.064823839 0.032754821 0.0229546448 -0.0595524151
## acousticness -0.018997220 -0.114379959 0.0042775954 0.0067216097
## instrumentalness -0.174173559 0.021457069 0.0074553115 -0.0058006673
## liveness -0.019788923 0.021891508 0.0020729759 -0.0002156869
## valence 1.000000000 -0.025046418 0.0216434352 -0.0031256418
## tempo -0.025046418 1.000000000 -0.0102970040 0.0166918679
## key 0.021643435 -0.010297004 1.0000000000 -0.1759585270
## mode -0.003125642 0.016691868 -0.1759585270 1.0000000000
library(tidyverse)
spotify_m <- select(spotify, track_popularity, danceability, energy, loudness, speechiness, acousticness, instrumentalness, liveness, valence, tempo, key, mode, playlist_genre, playlist_subgenre)
# Code to transform the character variables into factors
spotify_m$playlist_genre <- as.factor(spotify_m$playlist_genre)
spotify_m$playlist_subgenre <- as.factor(spotify_m$playlist_subgenre)
# Checking whether the factors were created
str(spotify_m)
## 'data.frame': 28352 obs. of 14 variables:
## $ track_popularity : int 66 67 70 60 69 67 62 69 68 67 ...
## $ danceability : num 0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
## $ energy : num 0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
## $ loudness : num -2.63 -4.97 -3.43 -3.78 -4.67 ...
## $ speechiness : num 0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
## $ acousticness : num 0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
## $ instrumentalness : num 0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
## $ liveness : num 0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
## $ valence : num 0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
## $ tempo : num 122 100 124 122 124 ...
## $ key : int 6 11 1 7 1 8 5 4 8 2 ...
## $ mode : int 1 1 0 1 1 1 0 0 1 1 ...
## $ playlist_genre : Factor w/ 6 levels "edm","latin",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ playlist_subgenre: Factor w/ 24 levels "album rock","big room",..: 4 4 4 4 4 4 4 4 4 4 ...
## - attr(*, "na.action")= 'omit' Named int [1:4] 7669 8693 8694 17666
## ..- attr(*, "names")= chr [1:4] "8152" "9283" "9284" "19569"
# Summary of the clean dataset
summary(spotify_m)
## track_popularity danceability energy loudness
## Min. : 0.00 Min. :0.0000 Min. :0.000175 Min. :-46.448
## 1st Qu.: 21.00 1st Qu.:0.5610 1st Qu.:0.579000 1st Qu.: -8.310
## Median : 42.00 Median :0.6700 Median :0.722000 Median : -6.261
## Mean : 39.34 Mean :0.6534 Mean :0.698372 Mean : -6.818
## 3rd Qu.: 58.00 3rd Qu.:0.7600 3rd Qu.:0.843000 3rd Qu.: -4.709
## Max. :100.00 Max. :0.9830 Max. :1.000000 Max. : 1.275
##
## speechiness acousticness instrumentalness liveness
## Min. :0.0000 Min. :0.0000 Min. :0.0000000 Min. :0.0000
## 1st Qu.:0.0410 1st Qu.:0.0143 1st Qu.:0.0000000 1st Qu.:0.0926
## Median :0.0626 Median :0.0797 Median :0.0000207 Median :0.1270
## Mean :0.1079 Mean :0.1772 Mean :0.0911294 Mean :0.1910
## 3rd Qu.:0.1330 3rd Qu.:0.2600 3rd Qu.:0.0065725 3rd Qu.:0.2490
## Max. :0.9180 Max. :0.9940 Max. :0.9940000 Max. :0.9960
##
## valence tempo key mode
## Min. :0.0000 Min. : 0.00 Min. : 0.000 Min. :0.0000
## 1st Qu.:0.3290 1st Qu.: 99.97 1st Qu.: 2.000 1st Qu.:0.0000
## Median :0.5120 Median :121.99 Median : 6.000 Median :1.0000
## Mean :0.5104 Mean :120.96 Mean : 5.367 Mean :0.5655
## 3rd Qu.:0.6950 3rd Qu.:134.00 3rd Qu.: 9.000 3rd Qu.:1.0000
## Max. :0.9910 Max. :239.44 Max. :11.000 Max. :1.0000
##
## playlist_genre playlist_subgenre
## edm :4877 southern hip hop : 1582
## latin:4136 indie poptimism : 1547
## pop :5132 neo soul : 1478
## r&b :4504 progressive electro house: 1460
## rap :5398 electro house : 1416
## rock :4305 gangster rap : 1314
## (Other) :19555
Learning: Variables speechiness, acousticness, instrumentalness and liveness are highly skewed, with a signficant number of outliers. These variables will need to be analyzed to decide whether they should be part of the analyses and predictive model.
# Summarize the clean dataset using means
summ1 <- summarise(spotify_m, popular_mean = mean(spotify_m$track_popularity, na.rm = TRUE),
danceab_mean = mean(spotify_m$danceability, na.rm = TRUE),
energ_mean = mean(spotify_m$energy, na.rm = TRUE),
loud_mean = mean(spotify_m$loudness, na.rm = TRUE),
speech_mean = mean(spotify_m$speechiness, na.rm = TRUE),
acoust_mean = mean(spotify_m$acousticness, na.rm = TRUE),
instr_mean = mean(spotify_m$instrumentalness, na.rm = TRUE),
liven_mean = mean(spotify_m$liveness, na.rm = TRUE),
valen_mean = mean(spotify_m$valence, na.rm = TRUE),
tempo_mean = mean(spotify_m$tempo, na.rm = TRUE),
key_mean = mean(spotify_m$key, na.rm = TRUE),
mode_mean = mean(spotify_m$mode, na.rm = TRUE),
loud_mean = mean(spotify_m$loudness, na.rm = TRUE),
n = n())
# Summarize the clean dataset using ranges
summ2 <- summarise(spotify_m, popular_mean = mean(spotify_m$track_popularity, na.rm = TRUE),
danceab_range = range(spotify_m$danceability, na.rm = TRUE),
energ_range = range(spotify_m$energy, na.rm = TRUE),
loud_range = range(spotify_m$loudness, na.rm = TRUE),
speech_range = range(spotify_m$speechiness, na.rm = TRUE),
acoust_range = range(spotify_m$acousticness, na.rm = TRUE),
instr_range = range(spotify_m$instrumentalness, na.rm = TRUE),
liven_range = range(spotify_m$liveness, na.rm = TRUE),
valen_range = range(spotify_m$valence, na.rm = TRUE),
tempo_range = range(spotify_m$tempo, na.rm = TRUE),
key_range = range(spotify_m$key, na.rm = TRUE),
mode_range = range(spotify_m$mode, na.rm = TRUE),
loud_range = range(spotify_m$loudness, na.rm = TRUE),
n = n() )
# Printing the two key summary tables
print(list(summ1, summ2))
## [[1]]
## popular_mean danceab_mean energ_mean loud_mean speech_mean acoust_mean
## 1 39.33532 0.6533752 0.6983725 -6.817777 0.1079392 0.177192
## instr_mean liven_mean valen_mean tempo_mean key_mean mode_mean n
## 1 0.09112945 0.1909547 0.5103855 120.9582 5.367417 0.5655333 28352
##
## [[2]]
## popular_mean danceab_range energ_range loud_range speech_range acoust_range
## 1 39.33532 0.000 0.000175 -46.448 0.000 0.000
## 2 39.33532 0.983 1.000000 1.275 0.918 0.994
## instr_range liven_range valen_range tempo_range key_range mode_range n
## 1 0.000 0.000 0.000 0.00 0 0 28352
## 2 0.994 0.996 0.991 239.44 11 1 28352
As shown below, the variables speeachiness, acousticness, instrumentalness and liveness are highly skewed. In the case of instrumentalness, the median is zero. The median for the other three variables is also significantly closer to the minimum value vs. maximum value. These variables may need to be re-scaled or eliminated from the model.
# Variables of concerns
spotify_conc <- select(spotify_m, speechiness, acousticness, instrumentalness, liveness)
summary(spotify_conc)
## speechiness acousticness instrumentalness liveness
## Min. :0.0000 Min. :0.0000 Min. :0.0000000 Min. :0.0000
## 1st Qu.:0.0410 1st Qu.:0.0143 1st Qu.:0.0000000 1st Qu.:0.0926
## Median :0.0626 Median :0.0797 Median :0.0000207 Median :0.1270
## Mean :0.1079 Mean :0.1772 Mean :0.0911294 Mean :0.1910
## 3rd Qu.:0.1330 3rd Qu.:0.2600 3rd Qu.:0.0065725 3rd Qu.:0.2490
## Max. :0.9180 Max. :0.9940 Max. :0.9940000 Max. :0.9960
As part of the data analysis and modeling of this data, I’ll take a further look at correlation, skewness, outliers and value frequency among other measures. I’ll slice the data to segregate low and high popularity scores to determine whether these scores correlate with any song characteristics. I’ll also look to combine values for the popularity scores, say, break the popularity scores into 3 segments (e.g. unpopular, popular, very popular) to determine whether new trends emerge. I’ll also look to eliminate variables that are highly skewed to determine whether new trends emerge.
Correlation plots, aggregation and grouping data by specific values or variables (e.g. low, medium and high instrumentalness) can be helpful to determine trends in the data. I will also create sub-sets of the data for the different genres and sub-genres to help answer my questions.
I know that some variables are highly skewed and could lead to low-accuracy predictive models for popularity. These highly skewed variables could also mask trends on what characteristics are associated with each genre.
I plan to incorporate machine learning techniques such as linear regression, trees, cluster analysis and other model architectures to help answer my questions.