Can We Predict Song Popularity?

1. Introduction

1.1. Project Objectives

Analyze the Spotify database to:

Understand how song characteristics (e.g. danceability, liveness) might be associated to different song genres (e.g. pop, rock).
Create a model to predict song popularity based on song characteristics.

1.2. Plan to Deliver Against Project Objectives

Conduct preliminary data analyses to determine what cleaning steps, if any, are needed
Clean the data
Determine what variables might be correlated
Fit various types of models to the data, starting with the simplest models (e.g. multiple linear regression, trees)
Select best model
Re-evaluate variables to determine whether further data cleaning and/or collection should be recommended
Final summary and recommendations

1.3. Analysis and Modeling Proposal

This project will be executed in two major phases:

Phase 1: Analyze the data and look for associations between song characteristics and song genres & sub-genres. This will include data clean-up, data wrangling and data visualization.
Phase 2: Create models to predict song popularity based on most relevant song characteristics identified in phase 1. This phase will include variable selection and evaluation of various model architectures.

1.4. Expected Output

Learnings from these analyses and song popularity predictive model will be used by the MakeYourSong (made up) start-up to guide its users on what song characteristics are likely to drive popularity. The predictive model will be available to users of the MakeYourSong start-up.

2. Packages Required

2.1. Packages Required

2.2. Messages and warnings resulting from loading the packages are suppressed

#install.packages("tidyverse")
#install.packages("dplyr")
#install.packages("ggplot2")
#install.packages("plotly")
#install.packages("corrplot")

library(tidyverse)
library(dplyr)
library(ggplot2)
library(plotly)
library(corrplot)

2.3. Package Short Description

tidyverse - for interacting with data through subsetting, transformation, visualization, etc.

dplyr - for data manipulation in R by combining, selecting, grouping, subsetting and transforming all or parts of dataset

ggplot2 - for declaratively creating graphics, based on The Grammar of Graphics

plotly - for creating interactive web-based graphs via the open source JavaScript graphing library plotly.js

corrplot - for visualizing correlation matrices and confidence intervals

3. Data Preparation

3.1. Data Source

The dataset is available in Github. Link to the data source is here.

3.2. Explanation of Data Source

The data to be analyzed is be a excerpt of the Spotify database containing 32,833 rows. The data set of spotify songs contains 23 variables and 32,833 songs from 1957-2020. There are 10,693 artists and 6 main genres with sub-categories for each. There are 12 audio features for each track, including confidence measures like acousticness, liveness, speechiness and instrumentalness, perceptual measures like energy, loudness, danceability and valence (positiveness), and descriptors like duration, tempo, key, and mode.

Genres were selected from Every Noise, a visualization of the Spotify genre-space maintained by a genre taxonomist. The top four sub-genres for each were used to query Spotify for 20 playlists each, resulting in about 5000 songs for each genre, split across a varied sub-genre space.

You can find the code for generating the dataset in spotify_dataset.R in the full Github repo.

3.3. Data Importing and Cleaning

# Code to import the data
spotify <- read.csv("C:/Users/king.nm/OneDrive - Procter and Gamble/UC/MS Business Analytics/Classes/Summer 2021/Data Wrangling BANA 7025 - Jun2021/Final Project/spotify.csv")

Identifying and reviewing the codebook

dictionary_spotify <- read.csv("C:/Users/king.nm/OneDrive - Procter and Gamble/UC/MS Business Analytics/Classes/Summer 2021/Data Wrangling BANA 7025 - Jun2021/Final Project/dictionary_spotify.csv")

# Code to view data Spotify codebook
# Use library knitr to format codebook table
library(knitr)

## Warning: package 'knitr' was built under R version 4.0.5

kable(dictionary_spotify[,], caption = "Spotify Codebook")

Spotify Codebook
variable	class	description
track_id	character	Song unique ID
track_name	character	Song Name
track_artist	character	Song Artist
track_popularity	double	Song Popularity (0-100) where higher is better
track_album_id	character	Album unique ID
track_album_name	character	Song album name
track_album_release_date	character	Date when album released
playlist_name	character	Name of playlist
playlist_id	character	Playlist ID
playlist_genre	character	Playlist genre
playlist_subgenre	character	Playlist subgenre
danceability	double	Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
energy	double	Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
key	double	The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C?/D?, 2 = D, and so on. If no key was detected, the value is -1.
loudness	double	The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.
mode	double	Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.
speechiness	double	Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
acousticness	double	A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
instrumentalness	double	Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
liveness	double	Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
valence	double	A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
tempo	double	The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
duration_ms	double	Duration of song in milliseconds

Assessing dimensions of the dataset

dim(spotify)

## [1] 32833    23

Check for duplicate rows or columns

# Checking to see whether there are songs with the same ID
length(unique(spotify$track_id))

## [1] 28356

# Creating a new file with unique songs
spotify_unique = spotify[!duplicated(spotify$track_id),]
str(spotify_unique)

## 'data.frame':    28356 obs. of  23 variables:
##  $ track_id                : chr  "6f807x0ima9a1j3VPbc7VN" "0r7CVbZTWZgbTCYdfa2P31" "1z1Hg7Vb0AhHDiEmnDE79l" "75FpbthrwQmzHlBJLuGdC7" ...
##  $ track_name              : chr  "I Don't Care (with Justin Bieber) - Loud Luxury Remix" "Memories - Dillon Francis Remix" "All the Time - Don Diablo Remix" "Call You Mine - Keanu Silva Remix" ...
##  $ track_artist            : chr  "Ed Sheeran" "Maroon 5" "Zara Larsson" "The Chainsmokers" ...
##  $ track_popularity        : int  66 67 70 60 69 67 62 69 68 67 ...
##  $ track_album_id          : chr  "2oCs0DGTsRO98Gh5ZSl2Cx" "63rPSO264uRjW1X5E6cWv6" "1HoSmj2eLcsrR0vE9gThr4" "1nqYsOef1yKKuGOVchbsk6" ...
##  $ track_album_name        : chr  "I Don't Care (with Justin Bieber) [Loud Luxury Remix]" "Memories (Dillon Francis Remix)" "All the Time (Don Diablo Remix)" "Call You Mine - The Remixes" ...
##  $ track_album_release_date: chr  "6/14/2019" "12/13/2019" "7/5/2019" "7/19/2019" ...
##  $ playlist_name           : chr  "Pop Remix" "Pop Remix" "Pop Remix" "Pop Remix" ...
##  $ playlist_id             : chr  "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" ...
##  $ playlist_genre          : chr  "pop" "pop" "pop" "pop" ...
##  $ playlist_subgenre       : chr  "dance pop" "dance pop" "dance pop" "dance pop" ...
##  $ danceability            : num  0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
##  $ energy                  : num  0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
##  $ key                     : int  6 11 1 7 1 8 5 4 8 2 ...
##  $ loudness                : num  -2.63 -4.97 -3.43 -3.78 -4.67 ...
##  $ mode                    : int  1 1 0 1 1 1 0 0 1 1 ...
##  $ speechiness             : num  0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
##  $ acousticness            : num  0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
##  $ instrumentalness        : num  0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
##  $ liveness                : num  0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
##  $ valence                 : num  0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
##  $ tempo                   : num  122 100 124 122 124 ...
##  $ duration_ms             : int  194754 162600 176616 169093 189052 163049 187675 207619 193187 253040 ...

# Shortening the name of spotify_unique to spotify only
spotify <- spotify_unique

# Checking whether the unique file contains only 28356
str(spotify)

## 'data.frame':    28356 obs. of  23 variables:
##  $ track_id                : chr  "6f807x0ima9a1j3VPbc7VN" "0r7CVbZTWZgbTCYdfa2P31" "1z1Hg7Vb0AhHDiEmnDE79l" "75FpbthrwQmzHlBJLuGdC7" ...
##  $ track_name              : chr  "I Don't Care (with Justin Bieber) - Loud Luxury Remix" "Memories - Dillon Francis Remix" "All the Time - Don Diablo Remix" "Call You Mine - Keanu Silva Remix" ...
##  $ track_artist            : chr  "Ed Sheeran" "Maroon 5" "Zara Larsson" "The Chainsmokers" ...
##  $ track_popularity        : int  66 67 70 60 69 67 62 69 68 67 ...
##  $ track_album_id          : chr  "2oCs0DGTsRO98Gh5ZSl2Cx" "63rPSO264uRjW1X5E6cWv6" "1HoSmj2eLcsrR0vE9gThr4" "1nqYsOef1yKKuGOVchbsk6" ...
##  $ track_album_name        : chr  "I Don't Care (with Justin Bieber) [Loud Luxury Remix]" "Memories (Dillon Francis Remix)" "All the Time (Don Diablo Remix)" "Call You Mine - The Remixes" ...
##  $ track_album_release_date: chr  "6/14/2019" "12/13/2019" "7/5/2019" "7/19/2019" ...
##  $ playlist_name           : chr  "Pop Remix" "Pop Remix" "Pop Remix" "Pop Remix" ...
##  $ playlist_id             : chr  "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" ...
##  $ playlist_genre          : chr  "pop" "pop" "pop" "pop" ...
##  $ playlist_subgenre       : chr  "dance pop" "dance pop" "dance pop" "dance pop" ...
##  $ danceability            : num  0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
##  $ energy                  : num  0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
##  $ key                     : int  6 11 1 7 1 8 5 4 8 2 ...
##  $ loudness                : num  -2.63 -4.97 -3.43 -3.78 -4.67 ...
##  $ mode                    : int  1 1 0 1 1 1 0 0 1 1 ...
##  $ speechiness             : num  0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
##  $ acousticness            : num  0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
##  $ instrumentalness        : num  0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
##  $ liveness                : num  0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
##  $ valence                 : num  0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
##  $ tempo                   : num  122 100 124 122 124 ...
##  $ duration_ms             : int  194754 162600 176616 169093 189052 163049 187675 207619 193187 253040 ...

Viewing the head and tail of the data

head(spotify, n=5)

##                 track_id                                            track_name
## 1 6f807x0ima9a1j3VPbc7VN I Don't Care (with Justin Bieber) - Loud Luxury Remix
## 2 0r7CVbZTWZgbTCYdfa2P31                       Memories - Dillon Francis Remix
## 3 1z1Hg7Vb0AhHDiEmnDE79l                       All the Time - Don Diablo Remix
## 4 75FpbthrwQmzHlBJLuGdC7                     Call You Mine - Keanu Silva Remix
## 5 1e8PAfcKUYoKkxPhrHqw4x               Someone You Loved - Future Humans Remix
##       track_artist track_popularity         track_album_id
## 1       Ed Sheeran               66 2oCs0DGTsRO98Gh5ZSl2Cx
## 2         Maroon 5               67 63rPSO264uRjW1X5E6cWv6
## 3     Zara Larsson               70 1HoSmj2eLcsrR0vE9gThr4
## 4 The Chainsmokers               60 1nqYsOef1yKKuGOVchbsk6
## 5    Lewis Capaldi               69 7m7vv9wlQ4i0LFuJiE2zsQ
##                                        track_album_name
## 1 I Don't Care (with Justin Bieber) [Loud Luxury Remix]
## 2                       Memories (Dillon Francis Remix)
## 3                       All the Time (Don Diablo Remix)
## 4                           Call You Mine - The Remixes
## 5               Someone You Loved (Future Humans Remix)
##   track_album_release_date playlist_name            playlist_id playlist_genre
## 1                6/14/2019     Pop Remix 37i9dQZF1DXcZDD7cfEKhW            pop
## 2               12/13/2019     Pop Remix 37i9dQZF1DXcZDD7cfEKhW            pop
## 3                 7/5/2019     Pop Remix 37i9dQZF1DXcZDD7cfEKhW            pop
## 4                7/19/2019     Pop Remix 37i9dQZF1DXcZDD7cfEKhW            pop
## 5                 3/5/2019     Pop Remix 37i9dQZF1DXcZDD7cfEKhW            pop
##   playlist_subgenre danceability energy key loudness mode speechiness
## 1         dance pop        0.748  0.916   6   -2.634    1      0.0583
## 2         dance pop        0.726  0.815  11   -4.969    1      0.0373
## 3         dance pop        0.675  0.931   1   -3.432    0      0.0742
## 4         dance pop        0.718  0.930   7   -3.778    1      0.1020
## 5         dance pop        0.650  0.833   1   -4.672    1      0.0359
##   acousticness instrumentalness liveness valence   tempo duration_ms
## 1       0.1020         0.00e+00   0.0653   0.518 122.036      194754
## 2       0.0724         4.21e-03   0.3570   0.693  99.972      162600
## 3       0.0794         2.33e-05   0.1100   0.613 124.008      176616
## 4       0.0287         9.43e-06   0.2040   0.277 121.956      169093
## 5       0.0803         0.00e+00   0.0833   0.725 123.976      189052

tail(spotify, n=5)

##                     track_id                           track_name
## 32829 7bxnKAamR3snQ1VGLuVfC1 City Of Lights - Official Radio Edit
## 32830 5Aevni09Em4575077nkWHz  Closer - Sultan & Ned Shepard Remix
## 32831 7ImMqPP3Q1yfUHvsdn7wEo         Sweet Surrender - Radio Edit
## 32832 2m69mhnfQ1Oq6lGtXuYhgX       Only For You - Maor Levi Remix
## 32833 29zWqhca3zt5NsckZqDf6c               Typhoon - Original Mix
##         track_artist track_popularity         track_album_id
## 32829   Lush & Simon               42 2azRoBBWEEEYhqV6sb7JrT
## 32830 Tegan and Sara               20 6kD6KLxj7s8eCE3ABvAyf5
## 32831    Starkillers               14 0ltWNSY9JgxoIZO4VzuCa6
## 32832         Mat Zo               15 1fGrOkHnHJcStl14zNx8Jy
## 32833   Julian Calor               27 0X3mUOm6MhxR7PzxG95rAo
##                   track_album_name track_album_release_date     playlist_name
## 32829   City Of Lights (Vocal Mix)                4/28/2014 â\231¥ EDM LOVE 2020
## 32830               Closer Remixed                 3/8/2013 â\231¥ EDM LOVE 2020
## 32831 Sweet Surrender (Radio Edit)                4/21/2014 â\231¥ EDM LOVE 2020
## 32832       Only For You (Remixes)                 1/1/2014 â\231¥ EDM LOVE 2020
## 32833                Typhoon/Storm                 3/3/2014 â\231¥ EDM LOVE 2020
##                  playlist_id playlist_genre         playlist_subgenre
## 32829 6jI1gFr6ANFtT8MmTvA2Ux            edm progressive electro house
## 32830 6jI1gFr6ANFtT8MmTvA2Ux            edm progressive electro house
## 32831 6jI1gFr6ANFtT8MmTvA2Ux            edm progressive electro house
## 32832 6jI1gFr6ANFtT8MmTvA2Ux            edm progressive electro house
## 32833 6jI1gFr6ANFtT8MmTvA2Ux            edm progressive electro house
##       danceability energy key loudness mode speechiness acousticness
## 32829        0.428  0.922   2   -1.814    1      0.0936     0.076600
## 32830        0.522  0.786   0   -4.462    1      0.0420     0.001710
## 32831        0.529  0.821   6   -4.899    0      0.0481     0.108000
## 32832        0.626  0.888   2   -3.361    1      0.1090     0.007920
## 32833        0.603  0.884   5   -4.571    0      0.0385     0.000133
##       instrumentalness liveness valence   tempo duration_ms
## 32829         0.00e+00   0.0668  0.2100 128.170      204375
## 32830         4.27e-03   0.3750  0.4000 128.041      353120
## 32831         1.11e-06   0.1500  0.4360 127.989      210112
## 32832         1.27e-01   0.3430  0.3080 128.008      367432
## 32833         3.41e-01   0.7420  0.0894 127.984      337500

Cleaning the Data (explanation of the data cleaning steps)

Identify missing data
Determine how to handle missing data
Looking for outliers
Determine how to handle outliers
Frequency distribution for the variables

Identifying missing data

sum(is.na(spotify))

## [1] 12

colSums(is.na(spotify))

##                 track_id               track_name             track_artist 
##                        0                        4                        4 
##         track_popularity           track_album_id         track_album_name 
##                        0                        0                        4 
## track_album_release_date            playlist_name              playlist_id 
##                        0                        0                        0 
##           playlist_genre        playlist_subgenre             danceability 
##                        0                        0                        0 
##                   energy                      key                 loudness 
##                        0                        0                        0 
##                     mode              speechiness             acousticness 
##                        0                        0                        0 
##         instrumentalness                 liveness                  valence 
##                        0                        0                        0 
##                    tempo              duration_ms 
##                        0                        0

# Eliminating missing data since there are not too many missing values
spotify <- na.omit(spotify)
# Checking whether missing data was omitted
str(spotify)

## 'data.frame':    28352 obs. of  23 variables:
##  $ track_id                : chr  "6f807x0ima9a1j3VPbc7VN" "0r7CVbZTWZgbTCYdfa2P31" "1z1Hg7Vb0AhHDiEmnDE79l" "75FpbthrwQmzHlBJLuGdC7" ...
##  $ track_name              : chr  "I Don't Care (with Justin Bieber) - Loud Luxury Remix" "Memories - Dillon Francis Remix" "All the Time - Don Diablo Remix" "Call You Mine - Keanu Silva Remix" ...
##  $ track_artist            : chr  "Ed Sheeran" "Maroon 5" "Zara Larsson" "The Chainsmokers" ...
##  $ track_popularity        : int  66 67 70 60 69 67 62 69 68 67 ...
##  $ track_album_id          : chr  "2oCs0DGTsRO98Gh5ZSl2Cx" "63rPSO264uRjW1X5E6cWv6" "1HoSmj2eLcsrR0vE9gThr4" "1nqYsOef1yKKuGOVchbsk6" ...
##  $ track_album_name        : chr  "I Don't Care (with Justin Bieber) [Loud Luxury Remix]" "Memories (Dillon Francis Remix)" "All the Time (Don Diablo Remix)" "Call You Mine - The Remixes" ...
##  $ track_album_release_date: chr  "6/14/2019" "12/13/2019" "7/5/2019" "7/19/2019" ...
##  $ playlist_name           : chr  "Pop Remix" "Pop Remix" "Pop Remix" "Pop Remix" ...
##  $ playlist_id             : chr  "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" ...
##  $ playlist_genre          : chr  "pop" "pop" "pop" "pop" ...
##  $ playlist_subgenre       : chr  "dance pop" "dance pop" "dance pop" "dance pop" ...
##  $ danceability            : num  0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
##  $ energy                  : num  0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
##  $ key                     : int  6 11 1 7 1 8 5 4 8 2 ...
##  $ loudness                : num  -2.63 -4.97 -3.43 -3.78 -4.67 ...
##  $ mode                    : int  1 1 0 1 1 1 0 0 1 1 ...
##  $ speechiness             : num  0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
##  $ acousticness            : num  0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
##  $ instrumentalness        : num  0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
##  $ liveness                : num  0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
##  $ valence                 : num  0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
##  $ tempo                   : num  122 100 124 122 124 ...
##  $ duration_ms             : int  194754 162600 176616 169093 189052 163049 187675 207619 193187 253040 ...
##  - attr(*, "na.action")= 'omit' Named int [1:4] 7669 8693 8694 17666
##   ..- attr(*, "names")= chr [1:4] "8152" "9283" "9284" "19569"

Computing summary statistics for the variables

summary(spotify)

##    track_id          track_name        track_artist       track_popularity
##  Length:28352       Length:28352       Length:28352       Min.   :  0.00  
##  Class :character   Class :character   Class :character   1st Qu.: 21.00  
##  Mode  :character   Mode  :character   Mode  :character   Median : 42.00  
##                                                           Mean   : 39.34  
##                                                           3rd Qu.: 58.00  
##                                                           Max.   :100.00  
##  track_album_id     track_album_name   track_album_release_date
##  Length:28352       Length:28352       Length:28352            
##  Class :character   Class :character   Class :character        
##  Mode  :character   Mode  :character   Mode  :character        
##                                                                
##                                                                
##                                                                
##  playlist_name      playlist_id        playlist_genre     playlist_subgenre 
##  Length:28352       Length:28352       Length:28352       Length:28352      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##   danceability        energy              key            loudness      
##  Min.   :0.0000   Min.   :0.000175   Min.   : 0.000   Min.   :-46.448  
##  1st Qu.:0.5610   1st Qu.:0.579000   1st Qu.: 2.000   1st Qu.: -8.310  
##  Median :0.6700   Median :0.722000   Median : 6.000   Median : -6.261  
##  Mean   :0.6534   Mean   :0.698372   Mean   : 5.367   Mean   : -6.818  
##  3rd Qu.:0.7600   3rd Qu.:0.843000   3rd Qu.: 9.000   3rd Qu.: -4.709  
##  Max.   :0.9830   Max.   :1.000000   Max.   :11.000   Max.   :  1.275  
##       mode         speechiness      acousticness    instrumentalness   
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000000  
##  1st Qu.:0.0000   1st Qu.:0.0410   1st Qu.:0.0143   1st Qu.:0.0000000  
##  Median :1.0000   Median :0.0626   Median :0.0797   Median :0.0000207  
##  Mean   :0.5655   Mean   :0.1079   Mean   :0.1772   Mean   :0.0911294  
##  3rd Qu.:1.0000   3rd Qu.:0.1330   3rd Qu.:0.2600   3rd Qu.:0.0065725  
##  Max.   :1.0000   Max.   :0.9180   Max.   :0.9940   Max.   :0.9940000  
##     liveness         valence           tempo         duration_ms    
##  Min.   :0.0000   Min.   :0.0000   Min.   :  0.00   Min.   :  4000  
##  1st Qu.:0.0926   1st Qu.:0.3290   1st Qu.: 99.97   1st Qu.:187741  
##  Median :0.1270   Median :0.5120   Median :121.99   Median :216933  
##  Mean   :0.1910   Mean   :0.5104   Mean   :120.96   Mean   :226575  
##  3rd Qu.:0.2490   3rd Qu.:0.6950   3rd Qu.:134.00   3rd Qu.:254975  
##  Max.   :0.9960   Max.   :0.9910   Max.   :239.44   Max.   :517810

Learn about the data visually by plotting:

Histograms of numeric variables

hist(spotify$danceability)
hist(spotify$energy)
hist(spotify$loudness)
hist(spotify$speechiness)
hist(spotify$acousticness)
hist(spotify$instrumentalness)
hist(spotify$liveness)
hist(spotify$valence)
hist(spotify$tempo)
hist(spotify$key)
hist(spotify$mode)
hist(spotify$track_popularity)
hist(spotify$duration_ms)

Tables for character variables

library(knitr)
kable(table(spotify$playlist_genre), align = "l", caption = "Playlist genre frequencies")

Playlist genre frequencies
Var1	Freq
edm	4877
latin	4136
pop	5132
r&b	4504
rap	5398
rock	4305

kable(table(spotify$playlist_subgenre),align = "l", caption = "Playlist sub-genre frequencies" )

Playlist sub-genre frequencies
Var1	Freq
album rock	1039
big room	1034
classic rock	1100
dance pop	1298
electro house	1416
electropop	1251
gangster rap	1314
hard rock	1202
hip hop	1296
hip pop	803
indie poptimism	1547
latin hip hop	1194
latin pop	1097
neo soul	1478
new jack swing	1036
permanent wave	964
pop edm	967
post-teen pop	1036
progressive electro house	1460
reggaeton	687
southern hip hop	1582
trap	1206
tropical	1158
urban contemporary	1187

Bar Plots for the Character Variables

barplot(table(spotify$playlist_genre))
barplot(table(spotify$playlist_subgenre))

Box plots – looking for outliers

boxplot(spotify$track_popularity,xlab = "popularity")
boxplot(spotify$danceability,xlab = "danceability")
boxplot(spotify$duration_ms, xlab = "duration_ms")
boxplot(spotify$energy, xlab = "energy")
boxplot(spotify$loudness, xlab = "loudness")
boxplot(spotify$speechiness, xlab = "speechiness")
boxplot(spotify$acousticness, xlab = "accousticness")
boxplot(spotify$instrumentalness, xlab = "instumentalness")
boxplot(spotify$liveness, xlab = "liveness")
boxplot(spotify$valence, xlab = "valence")
boxplot(spotify$tempo, xlab = "tempo")

All the variables evaluated have outliers: danceability, duration, energy, loudness, speechiness, accousticness, instrumentalness, liveness and tempo

Number of Artists

length(unique(spotify$track_artist))

## [1] 10692

Number of Playlists IDs

length(unique(spotify$playlist_id))

## [1] 470

Number of Playlist Names

length(unique(spotify$playlist_id))

## [1] 470

Scatter plots to Look for Correlations Between Variables

plot(spotify$liveness, spotify$tempo)
plot(spotify$speechiness, spotify$liveness)
plot(spotify$track_popularity, spotify$liveness)

# Creating a subset of the data with numeric variables only to more easily check for correlations
library(tidyverse)
spotify_num <- select(spotify, track_popularity, danceability, energy, loudness, speechiness, acousticness, instrumentalness, liveness, valence, tempo, key, mode)

# Checking for variable correlations
cor(spotify_num)

##                  track_popularity danceability       energy      loudness
## track_popularity      1.000000000  0.046574393 -0.103510773  0.0373368426
## danceability          0.046574393  1.000000000 -0.081426757  0.0153113311
## energy               -0.103510773 -0.081426757  1.000000000  0.6821643541
## loudness              0.037336843  0.015311331  0.682164354  1.0000000000
## speechiness           0.005439570  0.183558194 -0.029030115  0.0129401739
## acousticness          0.091624759 -0.028881286 -0.545878674 -0.3716005101
## instrumentalness     -0.124546651 -0.002274667  0.023850025 -0.1543017028
## liveness             -0.052752799 -0.127054574  0.163802644  0.0819049144
## valence               0.022594291  0.333751328  0.149662060  0.0495341593
## tempo                 0.004321794 -0.184639775  0.151658031  0.0967103886
## key                  -0.007879063  0.007059769  0.012790256 -0.0005832657
## mode                  0.016130687 -0.055270139 -0.004265523 -0.0176398787
##                  speechiness acousticness instrumentalness      liveness
## track_popularity  0.00543957  0.091624759     -0.124546651 -0.0527527986
## danceability      0.18355819 -0.028881286     -0.002274667 -0.1270545736
## energy           -0.02903011 -0.545878674      0.023850025  0.1638026443
## loudness          0.01294017 -0.371600510     -0.154301703  0.0819049144
## speechiness       1.00000000  0.025016481     -0.107921943  0.0592325869
## acousticness      0.02501648  1.000000000     -0.003128449 -0.0745330902
## instrumentalness -0.10792194 -0.003128449      1.000000000 -0.0084967401
## liveness          0.05923259 -0.074533090     -0.008496740  1.0000000000
## valence           0.06482384 -0.018997220     -0.174173559 -0.0197889232
## tempo             0.03275482 -0.114379959      0.021457069  0.0218915079
## key               0.02295464  0.004277595      0.007455312  0.0020729759
## mode             -0.05955242  0.006721610     -0.005800667 -0.0002156869
##                       valence        tempo           key          mode
## track_popularity  0.022594291  0.004321794 -0.0078790634  0.0161306874
## danceability      0.333751328 -0.184639775  0.0070597692 -0.0552701387
## energy            0.149662060  0.151658031  0.0127902556 -0.0042655230
## loudness          0.049534159  0.096710389 -0.0005832657 -0.0176398787
## speechiness       0.064823839  0.032754821  0.0229546448 -0.0595524151
## acousticness     -0.018997220 -0.114379959  0.0042775954  0.0067216097
## instrumentalness -0.174173559  0.021457069  0.0074553115 -0.0058006673
## liveness         -0.019788923  0.021891508  0.0020729759 -0.0002156869
## valence           1.000000000 -0.025046418  0.0216434352 -0.0031256418
## tempo            -0.025046418  1.000000000 -0.0102970040  0.0166918679
## key               0.021643435 -0.010297004  1.0000000000 -0.1759585270
## mode             -0.003125642  0.016691868 -0.1759585270  1.0000000000

Create the final clean file with numeric, genre and sub-genre variables that will be used for modeling

library(tidyverse)
spotify_m <- select(spotify, track_popularity, danceability, energy, loudness, speechiness, acousticness, instrumentalness, liveness, valence, tempo, key, mode, playlist_genre, playlist_subgenre)

3.4. Clean Data (show the data in the most condensed form possible)

Transforming genre and sub-genre variables into factors

# Code to transform the character variables into factors
spotify_m$playlist_genre <- as.factor(spotify_m$playlist_genre)
spotify_m$playlist_subgenre <- as.factor(spotify_m$playlist_subgenre)
# Checking whether the factors were created
str(spotify_m)

## 'data.frame':    28352 obs. of  14 variables:
##  $ track_popularity : int  66 67 70 60 69 67 62 69 68 67 ...
##  $ danceability     : num  0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
##  $ energy           : num  0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
##  $ loudness         : num  -2.63 -4.97 -3.43 -3.78 -4.67 ...
##  $ speechiness      : num  0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
##  $ acousticness     : num  0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
##  $ instrumentalness : num  0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
##  $ liveness         : num  0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
##  $ valence          : num  0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
##  $ tempo            : num  122 100 124 122 124 ...
##  $ key              : int  6 11 1 7 1 8 5 4 8 2 ...
##  $ mode             : int  1 1 0 1 1 1 0 0 1 1 ...
##  $ playlist_genre   : Factor w/ 6 levels "edm","latin",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ playlist_subgenre: Factor w/ 24 levels "album rock","big room",..: 4 4 4 4 4 4 4 4 4 4 ...
##  - attr(*, "na.action")= 'omit' Named int [1:4] 7669 8693 8694 17666
##   ..- attr(*, "names")= chr [1:4] "8152" "9283" "9284" "19569"

# Summary of the clean dataset
summary(spotify_m)

##  track_popularity  danceability        energy            loudness      
##  Min.   :  0.00   Min.   :0.0000   Min.   :0.000175   Min.   :-46.448  
##  1st Qu.: 21.00   1st Qu.:0.5610   1st Qu.:0.579000   1st Qu.: -8.310  
##  Median : 42.00   Median :0.6700   Median :0.722000   Median : -6.261  
##  Mean   : 39.34   Mean   :0.6534   Mean   :0.698372   Mean   : -6.818  
##  3rd Qu.: 58.00   3rd Qu.:0.7600   3rd Qu.:0.843000   3rd Qu.: -4.709  
##  Max.   :100.00   Max.   :0.9830   Max.   :1.000000   Max.   :  1.275  
##                                                                        
##   speechiness      acousticness    instrumentalness       liveness     
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000000   Min.   :0.0000  
##  1st Qu.:0.0410   1st Qu.:0.0143   1st Qu.:0.0000000   1st Qu.:0.0926  
##  Median :0.0626   Median :0.0797   Median :0.0000207   Median :0.1270  
##  Mean   :0.1079   Mean   :0.1772   Mean   :0.0911294   Mean   :0.1910  
##  3rd Qu.:0.1330   3rd Qu.:0.2600   3rd Qu.:0.0065725   3rd Qu.:0.2490  
##  Max.   :0.9180   Max.   :0.9940   Max.   :0.9940000   Max.   :0.9960  
##                                                                        
##     valence           tempo             key              mode       
##  Min.   :0.0000   Min.   :  0.00   Min.   : 0.000   Min.   :0.0000  
##  1st Qu.:0.3290   1st Qu.: 99.97   1st Qu.: 2.000   1st Qu.:0.0000  
##  Median :0.5120   Median :121.99   Median : 6.000   Median :1.0000  
##  Mean   :0.5104   Mean   :120.96   Mean   : 5.367   Mean   :0.5655  
##  3rd Qu.:0.6950   3rd Qu.:134.00   3rd Qu.: 9.000   3rd Qu.:1.0000  
##  Max.   :0.9910   Max.   :239.44   Max.   :11.000   Max.   :1.0000  
##                                                                     
##  playlist_genre                 playlist_subgenre
##  edm  :4877     southern hip hop         : 1582  
##  latin:4136     indie poptimism          : 1547  
##  pop  :5132     neo soul                 : 1478  
##  r&b  :4504     progressive electro house: 1460  
##  rap  :5398     electro house            : 1416  
##  rock :4305     gangster rap             : 1314  
##                 (Other)                  :19555

Learning: Variables speechiness, acousticness, instrumentalness and liveness are highly skewed, with a signficant number of outliers. These variables will need to be analyzed to decide whether they should be part of the analyses and predictive model.

# Summarize the clean dataset using means
summ1 <- summarise(spotify_m, popular_mean = mean(spotify_m$track_popularity, na.rm = TRUE),
          danceab_mean = mean(spotify_m$danceability, na.rm = TRUE),
          energ_mean = mean(spotify_m$energy, na.rm = TRUE),
          loud_mean = mean(spotify_m$loudness, na.rm = TRUE), 
          speech_mean = mean(spotify_m$speechiness, na.rm = TRUE),
          acoust_mean = mean(spotify_m$acousticness, na.rm = TRUE),
          instr_mean = mean(spotify_m$instrumentalness, na.rm = TRUE),
          liven_mean = mean(spotify_m$liveness, na.rm = TRUE),
          valen_mean = mean(spotify_m$valence, na.rm = TRUE),
          tempo_mean = mean(spotify_m$tempo, na.rm = TRUE),
          key_mean = mean(spotify_m$key, na.rm = TRUE),
          mode_mean = mean(spotify_m$mode, na.rm = TRUE),
          loud_mean = mean(spotify_m$loudness, na.rm = TRUE),
          n = n())

# Summarize the clean dataset using ranges
summ2 <- summarise(spotify_m, popular_mean = mean(spotify_m$track_popularity, na.rm = TRUE),
          danceab_range = range(spotify_m$danceability, na.rm = TRUE),
          energ_range = range(spotify_m$energy, na.rm = TRUE),
          loud_range = range(spotify_m$loudness, na.rm = TRUE), 
          speech_range = range(spotify_m$speechiness, na.rm = TRUE),
          acoust_range = range(spotify_m$acousticness, na.rm = TRUE),
          instr_range = range(spotify_m$instrumentalness, na.rm = TRUE),
          liven_range = range(spotify_m$liveness, na.rm = TRUE),
          valen_range = range(spotify_m$valence, na.rm = TRUE),
          tempo_range = range(spotify_m$tempo, na.rm = TRUE),
          key_range = range(spotify_m$key, na.rm = TRUE),
          mode_range = range(spotify_m$mode, na.rm = TRUE),
          loud_range = range(spotify_m$loudness, na.rm = TRUE),
          n = n() )

# Printing the two key summary tables
print(list(summ1, summ2))

## [[1]]
##   popular_mean danceab_mean energ_mean loud_mean speech_mean acoust_mean
## 1     39.33532    0.6533752  0.6983725 -6.817777   0.1079392    0.177192
##   instr_mean liven_mean valen_mean tempo_mean key_mean mode_mean     n
## 1 0.09112945  0.1909547  0.5103855   120.9582 5.367417 0.5655333 28352
## 
## [[2]]
##   popular_mean danceab_range energ_range loud_range speech_range acoust_range
## 1     39.33532         0.000    0.000175    -46.448        0.000        0.000
## 2     39.33532         0.983    1.000000      1.275        0.918        0.994
##   instr_range liven_range valen_range tempo_range key_range mode_range     n
## 1       0.000       0.000       0.000        0.00         0          0 28352
## 2       0.994       0.996       0.991      239.44        11          1 28352

3.5 Provide summary information about the variables of concern in your cleaned data set.

As shown below, the variables speeachiness, acousticness, instrumentalness and liveness are highly skewed. In the case of instrumentalness, the median is zero. The median for the other three variables is also significantly closer to the minimum value vs. maximum value. These variables may need to be re-scaled or eliminated from the model.

# Variables of concerns
spotify_conc <- select(spotify_m, speechiness, acousticness, instrumentalness, liveness)
summary(spotify_conc)

##   speechiness      acousticness    instrumentalness       liveness     
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000000   Min.   :0.0000  
##  1st Qu.:0.0410   1st Qu.:0.0143   1st Qu.:0.0000000   1st Qu.:0.0926  
##  Median :0.0626   Median :0.0797   Median :0.0000207   Median :0.1270  
##  Mean   :0.1079   Mean   :0.1772   Mean   :0.0911294   Mean   :0.1910  
##  3rd Qu.:0.1330   3rd Qu.:0.2600   3rd Qu.:0.0065725   3rd Qu.:0.2490  
##  Max.   :0.9180   Max.   :0.9940   Max.   :0.9940000   Max.   :0.9960

4. Proposed Exploratory Data Analysis

4.1 Discuss how you plan to uncover new information in the data that is not self-evident. What are different ways you could look at this data to answer the questions you want to answer? Do you plan to slice and dice the data in different ways, create new variables, or join separate data frames to create new summary information? How could you summarize your data to answer key questions?

As part of the data analysis and modeling of this data, I’ll take a further look at correlation, skewness, outliers and value frequency among other measures. I’ll slice the data to segregate low and high popularity scores to determine whether these scores correlate with any song characteristics. I’ll also look to combine values for the popularity scores, say, break the popularity scores into 3 segments (e.g. unpopular, popular, very popular) to determine whether new trends emerge. I’ll also look to eliminate variables that are highly skewed to determine whether new trends emerge.

4.2 What types of plots and tables will help you to illustrate the findings to your questions?

Correlation plots, aggregation and grouping data by specific values or variables (e.g. low, medium and high instrumentalness) can be helpful to determine trends in the data. I will also create sub-sets of the data for the different genres and sub-genres to help answer my questions.

4.3 What do you not know how to do right now that you need to learn to answer your questions?

I know that some variables are highly skewed and could lead to low-accuracy predictive models for popularity. These highly skewed variables could also mask trends on what characteristics are associated with each genre.

4.4 Do you plan on incorporating any machine learning techniques (i.e. linear regression, discriminant analysis, cluster analysis) to answer your questions?

I plan to incorporate machine learning techniques such as linear regression, trees, cluster analysis and other model architectures to help answer my questions.