Data Wrangling Midterm Project

Spotify MidTerm Project

Introduction

Spotify is the most popular audio streaming service across the world. There are millions of tracks on the app which can be browsed by different parameters such as artist, album, genre.

In this project, we aim to understand what features determine the genre of the song, characterisrics responsible for the popularity of a song using the data we have.

Way Forward

Exploration of the summary statistics of each audio feature.
Data cleaning i.e remove the null values and outliers if any.
Check for correlation between audio features and correlation between genres.
Identify the features of each genre
Perform basic EDA and observe the patterns across each audio feature and across genres.
Finally build a predictive model to identify the genre and estimate the popularity of the song.
Build an interactive dashboard using Shiny.

Packages Required

The packages which we are going to use in our analysis:

library(plotly)      #Useful for creating interactive visualisations
library(tidyr)       #tidying data i.t converting into long form,etc
library(ggplot2)     #Used in the visualisation of the data
library(dplyr)       #Used for data wrangling
library(rpart)       #Has the functions which assist in building the decision tree
library(knitr)       #Helps in the integration of R code into HTML
library(kableExtra)  #USeful for construction of complex tables and customisation of                       styles
library(missForest)  #For building the random forest model
library(DT)          #Displaying data objects as tables on the HTML page

Data Preparation

Data Source and Summary Of Variables

The spotify data being used for our analysis has been taken from this path: Spotify Data

The data has been made available via the spotifyr package. Charlie Thompson, Josiah Parry, Donal Phipps, and Tom Wolff authored this package to make it easier to get either your own data or general metadata arounds songs from Spotify’s API.

The variables in the dataset and their description:

Data Importing

The dataset contains 32,833 observations of 23 variables.

Data Cleaning and Manipulation

First, we need to check if any of the songs are repetitive. For this, we will consider the track_id column and check if there are any duplicates in that column.

#Removing Duplicates
spotify_songs_unique = spotify_songs[!duplicated(spotify_songs$track_id),]

Now, we select only those columns which will be useful in our analysis and in the building of the model. We will go ahead and drop the following columns:

track_id
track_album_id
track_album_name
playlist_name
playlist_id

#Removing unnecessary columns
spotify_songs_final = spotify_songs_unique[-c(1,5,6,8,9)]
head(spotify_songs_final)

## # A tibble: 6 x 18
##   track_name track_artist track_popularity track_album_rel~ playlist_genre
##   <chr>      <chr>                   <dbl> <chr>            <chr>         
## 1 I Don't C~ Ed Sheeran                 66 2019-06-14       pop           
## 2 Memories ~ Maroon 5                   67 2019-12-13       pop           
## 3 All the T~ Zara Larsson               70 2019-07-05       pop           
## 4 Call You ~ The Chainsm~               60 2019-07-19       pop           
## 5 Someone Y~ Lewis Capal~               69 2019-03-05       pop           
## 6 Beautiful~ Ed Sheeran                 67 2019-07-11       pop           
## # ... with 13 more variables: playlist_subgenre <chr>, danceability <dbl>,
## #   energy <dbl>, key <dbl>, loudness <dbl>, mode <dbl>,
## #   speechiness <dbl>, acousticness <dbl>, instrumentalness <dbl>,
## #   liveness <dbl>, valence <dbl>, tempo <dbl>, duration_ms <dbl>

We now check for the missing values across all the columns in the dataset.

colSums(is.na(spotify_songs_final))

##               track_name             track_artist         track_popularity 
##                        4                        4                        0 
## track_album_release_date           playlist_genre        playlist_subgenre 
##                        0                        0                        0 
##             danceability                   energy                      key 
##                        0                        0                        0 
##                 loudness                     mode              speechiness 
##                        0                        0                        0 
##             acousticness         instrumentalness                 liveness 
##                        0                        0                        0 
##                  valence                    tempo              duration_ms 
##                        0                        0                        0

There are four missing values each in track_name and track_artist. The number of missing values is very low and also these columns won’t be affecting the model building. Hence we’ll go ahead without deleting any records.

Now, we look at some of the rows from the final cleaned dataset:

spotify_songs_final %>% top_n(100)

## Selecting by duration_ms

## # A tibble: 100 x 18
##    track_name track_artist track_popularity track_album_rel~ playlist_genre
##    <chr>      <chr>                   <dbl> <chr>            <chr>         
##  1 Mirrors    Justin Timb~               77 2013-03-15       pop           
##  2 Bailando ~ Chela                      31 2011-07-06       pop           
##  3 Bring It ~ Geto Boys                  31 1993-03-09       rap           
##  4 Tonight I~ Betty Wright               41 2002-07-02       rap           
##  5 Sixteen    Rick Ross                   0 2012-01-01       rap           
##  6 Fat Frees~ Fat Pat                     3 2012-11-27       rap           
##  7 Still In ~ Shuya Okino                 0 2016-03-04       rock          
##  8 Al Andalu~ Miguel Rios                 0 2005-01-01       rock          
##  9 Dancing W~ Genesis                    48 1973-10-12       rock          
## 10 Killer     Van Der Gra~               33 1986-01-01       rock          
## # ... with 90 more rows, and 13 more variables: playlist_subgenre <chr>,
## #   danceability <dbl>, energy <dbl>, key <dbl>, loudness <dbl>,
## #   mode <dbl>, speechiness <dbl>, acousticness <dbl>,
## #   instrumentalness <dbl>, liveness <dbl>, valence <dbl>, tempo <dbl>,
## #   duration_ms <dbl>

datatable(spotify_songs_final, filter = 'top', options = list(pageLength = 10))

## Warning in instance$preRenderHook(instance): It seems your data is too
## big for client-side DataTables. You may consider server-side processing:
## https://rstudio.github.io/DT/server.html

Exploratory Data Analysis

We aim to visualise our data using a mixture of plots such as:

Correlation Matrix
Scatter plots
Histograms
Box plots

Initial EDA

For now, we will look at the individual statistics of each variable:

str(spotify_songs_final)

## Classes 'tbl_df', 'tbl' and 'data.frame':    28356 obs. of  18 variables:
##  $ track_name              : chr  "I Don't Care (with Justin Bieber) - Loud Luxury Remix" "Memories - Dillon Francis Remix" "All the Time - Don Diablo Remix" "Call You Mine - Keanu Silva Remix" ...
##  $ track_artist            : chr  "Ed Sheeran" "Maroon 5" "Zara Larsson" "The Chainsmokers" ...
##  $ track_popularity        : num  66 67 70 60 69 67 62 69 68 67 ...
##  $ track_album_release_date: chr  "2019-06-14" "2019-12-13" "2019-07-05" "2019-07-19" ...
##  $ playlist_genre          : chr  "pop" "pop" "pop" "pop" ...
##  $ playlist_subgenre       : chr  "dance pop" "dance pop" "dance pop" "dance pop" ...
##  $ danceability            : num  0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
##  $ energy                  : num  0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
##  $ key                     : num  6 11 1 7 1 8 5 4 8 2 ...
##  $ loudness                : num  -2.63 -4.97 -3.43 -3.78 -4.67 ...
##  $ mode                    : num  1 1 0 1 1 1 0 0 1 1 ...
##  $ speechiness             : num  0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
##  $ acousticness            : num  0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
##  $ instrumentalness        : num  0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
##  $ liveness                : num  0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
##  $ valence                 : num  0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
##  $ tempo                   : num  122 100 124 122 124 ...
##  $ duration_ms             : num  194754 162600 176616 169093 189052 ...

summary(spotify_songs_final)

##   track_name        track_artist       track_popularity
##  Length:28356       Length:28356       Min.   :  0.00  
##  Class :character   Class :character   1st Qu.: 21.00  
##  Mode  :character   Mode  :character   Median : 42.00  
##                                        Mean   : 39.33  
##                                        3rd Qu.: 58.00  
##                                        Max.   :100.00  
##  track_album_release_date playlist_genre     playlist_subgenre 
##  Length:28356             Length:28356       Length:28356      
##  Class :character         Class :character   Class :character  
##  Mode  :character         Mode  :character   Mode  :character  
##                                                                
##                                                                
##                                                                
##   danceability        energy              key            loudness      
##  Min.   :0.0000   Min.   :0.000175   Min.   : 0.000   Min.   :-46.448  
##  1st Qu.:0.5610   1st Qu.:0.579000   1st Qu.: 2.000   1st Qu.: -8.309  
##  Median :0.6700   Median :0.722000   Median : 6.000   Median : -6.261  
##  Mean   :0.6534   Mean   :0.698388   Mean   : 5.368   Mean   : -6.818  
##  3rd Qu.:0.7600   3rd Qu.:0.843000   3rd Qu.: 9.000   3rd Qu.: -4.709  
##  Max.   :0.9830   Max.   :1.000000   Max.   :11.000   Max.   :  1.275  
##       mode         speechiness      acousticness     instrumentalness   
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.00000   Min.   :0.0000000  
##  1st Qu.:0.0000   1st Qu.:0.0410   1st Qu.:0.01438   1st Qu.:0.0000000  
##  Median :1.0000   Median :0.0626   Median :0.07970   Median :0.0000206  
##  Mean   :0.5655   Mean   :0.1080   Mean   :0.17718   Mean   :0.0911168  
##  3rd Qu.:1.0000   3rd Qu.:0.1330   3rd Qu.:0.26000   3rd Qu.:0.0065700  
##  Max.   :1.0000   Max.   :0.9180   Max.   :0.99400   Max.   :0.9940000  
##     liveness         valence           tempo         duration_ms    
##  Min.   :0.0000   Min.   :0.0000   Min.   :  0.00   Min.   :  4000  
##  1st Qu.:0.0926   1st Qu.:0.3290   1st Qu.: 99.97   1st Qu.:187742  
##  Median :0.1270   Median :0.5120   Median :121.99   Median :216933  
##  Mean   :0.1910   Mean   :0.5104   Mean   :120.96   Mean   :226576  
##  3rd Qu.:0.2490   3rd Qu.:0.6950   3rd Qu.:134.00   3rd Qu.:254975  
##  Max.   :0.9960   Max.   :0.9910   Max.   :239.44   Max.   :517810

We now check the number of songs per each genre:

spotify_songs_final %>% count(playlist_genre) %>% knitr::kable()

playlist_genre	n
edm	4877
latin	4137
pop	5132
r&b	4504
rap	5401
rock	4305

It is clear that our dataset is fairly diversified with good number of songs from each genre.

Further EDA

Going forward, we will analyse the features across each genre, check for correlation among the features, correlation among the genres and also analyse for all features across individual genres.

Model Building and Road Ahead

We will try out decision trees and random forest algorithms on our data. Based on the inputs from the planned EDA, we will see if splitting the dataset into different genres will yield better models. Post this, we are planning to develop an interactive dashboard using R-Shiny.

Data Wrangling Midterm Project - Spotify

Venkat Sureddi, Abhiteja Achanta and Vamsi Chand Emani

3/30/2020