Data Wrangling Midterm Project

Spotify Data Analysis

Introduction

If there’s one thing many people can’t live without, it’s music. Spotify is an international media services provider. The company’s primary business is providing an audio streaming platform, the “Spotify” platform, that provides DRM-restricted music, videos and podcasts from record labels and media companies.

The motivation of this project is to enable anyone to discover patterns and insights about the music that they listen to. In doing so, They gain a better understanding of the musical behaviors when they listen to songs on Spotify.

Have you ever wondered how Spotify rates the popularity of songs? Or ever wonder which factors determine the song’s genre? What characteristics of a song can determine its popularity? Using data analysis, we will try to get answers to these questions.

The following tasks will be performed:

Find correlation between the different variables
Identify each genre’s features
Create different data models for analysis
Create a predictive model to identify genre and popularity of a song

We plan to achieve this by performing Data Preparation, Exploratory Data Analysis and Predictive Modeling.

Based on our analysis, the consumer will be able to identify which factors influence the popularity of a song on Spotify.

Packages required

Following packages will be used in the analysis:

tidyverse: set of packages that work in harmony to make it easy to install and load multiple ‘tidyverse’ packages in a single step
ggplot2: ggplot2 is a system for declaratively creating graphics, based on The Grammar of Graphics.
dplyr: dplyr provides a grammar of data manipulation, providing a consistent set of verbs that solve the most common data manipulation challenges.
psych: provides multivariate analysis and scale construction using factor analysis, principal component analysis, cluster analysis and reliability analysis, although others provide basic descriptive statistics
DAAG: provides Data Analysis and Graphics Data and Functions
highcharter: provide a various type of charts, from scatters to heatmaps or treemaps.
knitr: it is a package in the statistical programming language R that enables integration of R code into LaTeX, LyX, HTML, Markdown, AsciiDoc, and reStructuredText documents
kableExtra: allows users to construct complex tables and customize styles using a readable syntax.
DT: provides an R interface to the JavaScript library DataTables. R data objects (matrices or data frames) can be displayed as tables on HTML pages, and DataTables provides filtering, pagination, sorting, and many other features in the tables.

library(tidyverse)
library(ggplot2)
library(dplyr)
library(psych)
library(DAAG)
library(highcharter)
library(knitr)
library(kableExtra)
library(DT)

Data Preparation

Data Source

The data set used in this project can found here Spotify Data

Summary of variables

This data comes from Spotify via the spotifyr package. Charlie Thompson, Josiah Parry, Donal Phipps, and Tom Wolff authored this package to make it easier to get either data or general metadata arounds songs from Spotify’s API.
The data set contains 32,833 observations of 23 variables.
Following is the summary of all the variables in the data set.

variable_name	description
track_id	unique ID
track_name	Song Name
track_artist	Song Artist
track_popularity	Song Popularity (0-100) where higher is better
track_album_id	Album unique ID
track_album_name	Song album name
track_album_release_date	Date when album released
playlist_name	Name of playlist
playlist_id	Playlist ID
playlist_genre	Playlist genre
playlist_subgenre	Playlist subgenre
danceability	Danceability describes how suitable a track is for dancing based on a combination of musical elements. A value of 0.0 is least danceable and 1.0 is most danceable.
energy	Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity.
key	The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation .
loudness	The overall loudness of a track in decibels (dB).
mode	Mode indicates the modality (major or minor) of a track
speechiness	Speechiness detects the presence of spoken words in a track.
acousticness	A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
instrumentalness	Predicts whether a track contains no vocals.
liveness	Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
valence	A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive.
tempo	The overall estimated tempo of a track in beats per minute (BPM).
duration_ms	Duration of song in milliseconds

Reading the data from .csv file

spotify_data <- read.csv("C:/Users/Nikita/Downloads/spotify_songs.csv", header = TRUE)
head(spotify_data)

##                 track_id                                            track_name
## 1 6f807x0ima9a1j3VPbc7VN I Don't Care (with Justin Bieber) - Loud Luxury Remix
## 2 0r7CVbZTWZgbTCYdfa2P31                       Memories - Dillon Francis Remix
## 3 1z1Hg7Vb0AhHDiEmnDE79l                       All the Time - Don Diablo Remix
## 4 75FpbthrwQmzHlBJLuGdC7                     Call You Mine - Keanu Silva Remix
## 5 1e8PAfcKUYoKkxPhrHqw4x               Someone You Loved - Future Humans Remix
## 6 7fvUMiyapMsRRxr07cU8Ef     Beautiful People (feat. Khalid) - Jack Wins Remix
##       track_artist track_popularity         track_album_id
## 1       Ed Sheeran               66 2oCs0DGTsRO98Gh5ZSl2Cx
## 2         Maroon 5               67 63rPSO264uRjW1X5E6cWv6
## 3     Zara Larsson               70 1HoSmj2eLcsrR0vE9gThr4
## 4 The Chainsmokers               60 1nqYsOef1yKKuGOVchbsk6
## 5    Lewis Capaldi               69 7m7vv9wlQ4i0LFuJiE2zsQ
## 6       Ed Sheeran               67 2yiy9cd2QktrNvWC2EUi0k
##                                        track_album_name
## 1 I Don't Care (with Justin Bieber) [Loud Luxury Remix]
## 2                       Memories (Dillon Francis Remix)
## 3                       All the Time (Don Diablo Remix)
## 4                           Call You Mine - The Remixes
## 5               Someone You Loved (Future Humans Remix)
## 6     Beautiful People (feat. Khalid) [Jack Wins Remix]
##   track_album_release_date playlist_name            playlist_id playlist_genre
## 1               2019-06-14     Pop Remix 37i9dQZF1DXcZDD7cfEKhW            pop
## 2               2019-12-13     Pop Remix 37i9dQZF1DXcZDD7cfEKhW            pop
## 3               2019-07-05     Pop Remix 37i9dQZF1DXcZDD7cfEKhW            pop
## 4               2019-07-19     Pop Remix 37i9dQZF1DXcZDD7cfEKhW            pop
## 5               2019-03-05     Pop Remix 37i9dQZF1DXcZDD7cfEKhW            pop
## 6               2019-07-11     Pop Remix 37i9dQZF1DXcZDD7cfEKhW            pop
##   playlist_subgenre danceability energy key loudness mode speechiness
## 1         dance pop        0.748  0.916   6   -2.634    1      0.0583
## 2         dance pop        0.726  0.815  11   -4.969    1      0.0373
## 3         dance pop        0.675  0.931   1   -3.432    0      0.0742
## 4         dance pop        0.718  0.930   7   -3.778    1      0.1020
## 5         dance pop        0.650  0.833   1   -4.672    1      0.0359
## 6         dance pop        0.675  0.919   8   -5.385    1      0.1270
##   acousticness instrumentalness liveness valence   tempo duration_ms
## 1       0.1020         0.00e+00   0.0653   0.518 122.036      194754
## 2       0.0724         4.21e-03   0.3570   0.693  99.972      162600
## 3       0.0794         2.33e-05   0.1100   0.613 124.008      176616
## 4       0.0287         9.43e-06   0.2040   0.277 121.956      169093
## 5       0.0803         0.00e+00   0.0833   0.725 123.976      189052
## 6       0.0799         0.00e+00   0.1430   0.585 124.982      163049

Cleaning the data set

Duplicate Data

We observe that many songs have been repeated more than once in this dataset. They have the same ‘track_id’ but have a different ‘playist_id’. So we need to remove those duplicated songs in the dataset. Since the song’s ‘track_id’ is unique and the other quantifiable variables of that song remains the same, we will delete those duplicated songs based on the ‘track_id’.

spotify_data_unique = spotify_data[!duplicated(spotify_data$track_id),]

Redundant Columns

Now since we have no more repeated songs in the list, and we would like to analyze which variables influence the ‘track_popularity’, we can drop the following columns which are not useful in our analysis:

track_id
track_album_id
track_album_name
playlist_name
playlist_id
playlist_subgenre

spotify_data_2 <- spotify_data_unique[c(-1, -5, -6, -8, -9, -11)]
head(spotify_data_2)

##                                              track_name     track_artist
## 1 I Don't Care (with Justin Bieber) - Loud Luxury Remix       Ed Sheeran
## 2                       Memories - Dillon Francis Remix         Maroon 5
## 3                       All the Time - Don Diablo Remix     Zara Larsson
## 4                     Call You Mine - Keanu Silva Remix The Chainsmokers
## 5               Someone You Loved - Future Humans Remix    Lewis Capaldi
## 6     Beautiful People (feat. Khalid) - Jack Wins Remix       Ed Sheeran
##   track_popularity track_album_release_date playlist_genre danceability energy
## 1               66               2019-06-14            pop        0.748  0.916
## 2               67               2019-12-13            pop        0.726  0.815
## 3               70               2019-07-05            pop        0.675  0.931
## 4               60               2019-07-19            pop        0.718  0.930
## 5               69               2019-03-05            pop        0.650  0.833
## 6               67               2019-07-11            pop        0.675  0.919
##   key loudness mode speechiness acousticness instrumentalness liveness valence
## 1   6   -2.634    1      0.0583       0.1020         0.00e+00   0.0653   0.518
## 2  11   -4.969    1      0.0373       0.0724         4.21e-03   0.3570   0.693
## 3   1   -3.432    0      0.0742       0.0794         2.33e-05   0.1100   0.613
## 4   7   -3.778    1      0.1020       0.0287         9.43e-06   0.2040   0.277
## 5   1   -4.672    1      0.0359       0.0803         0.00e+00   0.0833   0.725
## 6   8   -5.385    1      0.1270       0.0799         0.00e+00   0.1430   0.585
##     tempo duration_ms
## 1 122.036      194754
## 2  99.972      162600
## 3 124.008      176616
## 4 121.956      169093
## 5 123.976      189052
## 6 124.982      163049

Missing Values

Now that our data does not contain any duplicate and redundant data, we check for missing values in the data set. We are using colSums function in R to find out missing values in each column.

colSums(is.na(spotify_data_2))

##               track_name             track_artist         track_popularity 
##                        4                        4                        0 
## track_album_release_date           playlist_genre             danceability 
##                        0                        0                        0 
##                   energy                      key                 loudness 
##                        0                        0                        0 
##                     mode              speechiness             acousticness 
##                        0                        0                        0 
##         instrumentalness                 liveness                  valence 
##                        0                        0                        0 
##                    tempo              duration_ms 
##                        0                        0

We observe that there are 4 missing values in track_name and track_artist columns. We can keep these observations, since missing values for track_name and track_artist wouldn’t impact our analysis.

Cleaned data set

output_data <- head(spotify_data_2, n=100)

datatable(spotify_data_2, filter = 'top', options = list(pageLength = 25))

Proposed Exploratory Data Analysis

Initial EDA

We will create various visualizations to analyse the data we have such as:

Scatter plots
Histograms
Box plots

Here is the initial EDA of our final data set

dim(spotify_data_2)

## [1] 28356    17

glimpse(spotify_data_2)

## Observations: 28,356
## Variables: 17
## $ track_name               <fct> I Don't Care (with Justin Bieber) - Loud L...
## $ track_artist             <fct> Ed Sheeran, Maroon 5, Zara Larsson, The Ch...
## $ track_popularity         <int> 66, 67, 70, 60, 69, 67, 62, 69, 68, 67, 58...
## $ track_album_release_date <fct> 2019-06-14, 2019-12-13, 2019-07-05, 2019-0...
## $ playlist_genre           <fct> pop, pop, pop, pop, pop, pop, pop, pop, po...
## $ danceability             <dbl> 0.748, 0.726, 0.675, 0.718, 0.650, 0.675, ...
## $ energy                   <dbl> 0.916, 0.815, 0.931, 0.930, 0.833, 0.919, ...
## $ key                      <int> 6, 11, 1, 7, 1, 8, 5, 4, 8, 2, 6, 8, 1, 5,...
## $ loudness                 <dbl> -2.634, -4.969, -3.432, -3.778, -4.672, -5...
## $ mode                     <int> 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, ...
## $ speechiness              <dbl> 0.0583, 0.0373, 0.0742, 0.1020, 0.0359, 0....
## $ acousticness             <dbl> 0.10200, 0.07240, 0.07940, 0.02870, 0.0803...
## $ instrumentalness         <dbl> 0.00e+00, 4.21e-03, 2.33e-05, 9.43e-06, 0....
## $ liveness                 <dbl> 0.0653, 0.3570, 0.1100, 0.2040, 0.0833, 0....
## $ valence                  <dbl> 0.518, 0.693, 0.613, 0.277, 0.725, 0.585, ...
## $ tempo                    <dbl> 122.036, 99.972, 124.008, 121.956, 123.976...
## $ duration_ms              <int> 194754, 162600, 176616, 169093, 189052, 16...

str(spotify_data_2)

## 'data.frame':    28356 obs. of  17 variables:
##  $ track_name              : Factor w/ 23449 levels "'39 - 2011 Mix",..: 9368 12887 944 3111 18360 1968 13859 15785 20934 9823 ...
##  $ track_artist            : Factor w/ 10692 levels "'Til Tuesday",..: 2848 6185 10633 9373 5530 2848 5000 8320 761 8562 ...
##  $ track_popularity        : int  66 67 70 60 69 67 62 69 68 67 ...
##  $ track_album_release_date: Factor w/ 4530 levels "1957-01-01","1957-03",..: 4316 4493 4336 4349 4221 4341 4356 4389 4316 4321 ...
##  $ playlist_genre          : Factor w/ 6 levels "edm","latin",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ danceability            : num  0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
##  $ energy                  : num  0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
##  $ key                     : int  6 11 1 7 1 8 5 4 8 2 ...
##  $ loudness                : num  -2.63 -4.97 -3.43 -3.78 -4.67 ...
##  $ mode                    : int  1 1 0 1 1 1 0 0 1 1 ...
##  $ speechiness             : num  0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
##  $ acousticness            : num  0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
##  $ instrumentalness        : num  0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
##  $ liveness                : num  0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
##  $ valence                 : num  0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
##  $ tempo                   : num  122 100 124 122 124 ...
##  $ duration_ms             : int  194754 162600 176616 169093 189052 163049 187675 207619 193187 253040 ...

summary(spotify_data_2)

##     track_name                       track_artist   track_popularity
##  Breathe :   18   Queen                    :  130   Min.   :  0.00  
##  Paradise:   17   Martin Garrix            :   87   1st Qu.: 21.00  
##  Poison  :   16   Don Omar                 :   84   Median : 42.00  
##  Alive   :   15   David Guetta             :   81   Mean   : 39.33  
##  Forever :   14   Dimitri Vegas & Like Mike:   68   3rd Qu.: 58.00  
##  (Other) :28272   (Other)                  :27902   Max.   :100.00  
##  NA's    :    4   NA's                     :    4                   
##  track_album_release_date playlist_genre  danceability        energy        
##  2020-01-10:  201         edm  :4877     Min.   :0.0000   Min.   :0.000175  
##  2013-01-01:  189         latin:4137     1st Qu.:0.5610   1st Qu.:0.579000  
##  2019-11-22:  185         pop  :5132     Median :0.6700   Median :0.722000  
##  2019-12-06:  184         r&b  :4504     Mean   :0.6534   Mean   :0.698388  
##  2019-11-15:  183         rap  :5401     3rd Qu.:0.7600   3rd Qu.:0.843000  
##  2008-01-01:  176         rock :4305     Max.   :0.9830   Max.   :1.000000  
##  (Other)   :27238                                                           
##       key            loudness            mode         speechiness    
##  Min.   : 0.000   Min.   :-46.448   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.: 2.000   1st Qu.: -8.309   1st Qu.:0.0000   1st Qu.:0.0410  
##  Median : 6.000   Median : -6.261   Median :1.0000   Median :0.0626  
##  Mean   : 5.368   Mean   : -6.818   Mean   :0.5655   Mean   :0.1080  
##  3rd Qu.: 9.000   3rd Qu.: -4.709   3rd Qu.:1.0000   3rd Qu.:0.1330  
##  Max.   :11.000   Max.   :  1.275   Max.   :1.0000   Max.   :0.9180  
##                                                                      
##   acousticness     instrumentalness       liveness         valence      
##  Min.   :0.00000   Min.   :0.0000000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.01438   1st Qu.:0.0000000   1st Qu.:0.0926   1st Qu.:0.3290  
##  Median :0.07970   Median :0.0000206   Median :0.1270   Median :0.5120  
##  Mean   :0.17718   Mean   :0.0911168   Mean   :0.1910   Mean   :0.5104  
##  3rd Qu.:0.26000   3rd Qu.:0.0065700   3rd Qu.:0.2490   3rd Qu.:0.6950  
##  Max.   :0.99400   Max.   :0.9940000   Max.   :0.9960   Max.   :0.9910  
##                                                                         
##      tempo         duration_ms    
##  Min.   :  0.00   Min.   :  4000  
##  1st Qu.: 99.97   1st Qu.:187742  
##  Median :121.99   Median :216933  
##  Mean   :120.96   Mean   :226576  
##  3rd Qu.:134.00   3rd Qu.:254975  
##  Max.   :239.44   Max.   :517810  
##

Future Tasks(Correlation and Model creation)

Correlation between covariates(independent variables) and the song popularity(dependent variable) will also be done to identify which variables influence the song’s popularity. With model creation for linear regression and predictive analysis to follow. We will also be merging different datasets with the existing one with information regarding songs and its feature or maybe different datasets which contain Spotify data. We may also split the dataset into smaller datasets based on the ‘Genre’ for better analysis.