Spotify Data Analysis

Introduction

Spotify is an audio streaming platform that that provides DRM-restricted music, videos and podcasts from record labels and media companies. It has over 50 million tracks which user can browse using various parameters like artist, album, genre, or via playlists.It pays the artists or the right holders via royalties which is approximately 70% of thier revenue. Thus, it is also a good platform for musicians to not only showcase their talent but also to make some money.

Thus, from a musician point of view, we have following problem statements:

To find popular singer.
To find popular genre.
To find if there is a relationship between artist and genre.
To find the average duration of popular songs.
To find if there are specific parameters of a song that makes a particular genre.

Answering these questions might help a musician to find a particular singer and compose song that will increase the probability of becoming popular.

Methodology: We are using spotify dataset that was collected using spotify API and available via spotifyr package. We begin with the analyses of data structre and cleaning and processing the data for further use. Next, we explore the data to find various relationship between variables using lists,tables and visualizations.

Packages Required

library(DT)

library(tidyverse)

library(ggplot2)

The DT library is used to create a table format of the dataset.
The tidyverse library is used in cleaning and summarizing the dataset.
The ggplot2 library is used in Exploratory Data Analysis (EDA).

Data Preparation

Data Importing

Spotify Dataset

Data Description

There are 23 variables in the dataset.

Here are variables we have:

	Variables	Description
	track_id	Song unique ID
	track_name	Song Name
	track_artist	Song Artist
	track_popularity	Song Popularity (0-100) where higher is better
	track_album_id	Album unique ID
	track_album_name	Song album name
	track_album_release_date	Date when album released
	playlist_name	Name of playlist
	playlist_id	Playlist ID
	playlist_genre	Playlist genre
	playlist_subgenre	Playlist subgenre
	danceability	Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
	energy	Song ArtisEnergy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
	key	The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1
	loudness	The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.
	mode	Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.
	speechiness	Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
	acousticness	A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic
	instrumentalness	Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0
	liveness	Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live
	valence	A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
	tempo	The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
	duration_ms	Duration of song in milliseconds

Data Cleaning

Renaming the data

We have renamed the few columns of the dataset. We followed this step in order to make sure that the column names are not large and the description is well understood.

spotify_data = spotify_data %>% 
  rename(
    genre = playlist_genre,
    sub_genre = playlist_subgenre,
    release_date = track_album_release_date,
    artist = track_artist,
    popularity = track_popularity,
    album_name = track_album_name
  )

Number of Rows and Columns

# Number of columns and rows in the dataset
nrow(spotify_data)

## [1] 32833

ncol(spotify_data)

## [1] 23

There are 32833 no of rows and 23 columns in the dataset.

missing_val = colSums(is.na(spotify_data))
missing_val

##         track_id       track_name           artist       popularity 
##                0                5                5                0 
##   track_album_id       album_name     release_date    playlist_name 
##                0                5                0                0 
##      playlist_id            genre        sub_genre     danceability 
##                0                0                0                0 
##           energy              key         loudness             mode 
##                0                0                0                0 
##      speechiness     acousticness instrumentalness         liveness 
##                0                0                0                0 
##          valence            tempo      duration_ms 
##                0                0                0

We can see there are 5 missing values in album_name , artist and trackname.

datatable(
  head(spotify_data,50),
  extensions = 'FixedColumns',
  options = list(
    scrollY = "400px",
    scrollX = TRUE,
    fixedColumns = TRUE
  )
)

## Warning in instance$preRenderHook(instance): It seems your data is too big
## for client-side DataTables. You may consider server-side processing: https://
## rstudio.github.io/DT/server.html

The track_name, artist and album_name have the same number of missing index number.

index_track_name = which(is.na(spotify_data$track_name))
index_track_name

## [1]  8152  9283  9284 19569 19812

index_artist = which(is.na(spotify_data$artist))
index_artist

## [1]  8152  9283  9284 19569 19812

index_album_name = which(is.na(spotify_data$album_name))
index_album_name

## [1]  8152  9283  9284 19569 19812

We are not going to remove missing values from the dataset as they will be handled further during data modelling.

Structure of Variable

str(spotify_data)

## 'data.frame':    32833 obs. of  23 variables:
##  $ track_id        : Factor w/ 28356 levels "0017A6SJgTbfQVU2EtsPNo",..: 22912 2531 7160 25706 4705 26672 9521 22445 26146 5283 ...
##  $ track_name      : Factor w/ 23449 levels "'39 - 2011 Mix",..: 9368 12887 944 3111 18360 1968 13859 15785 20934 9823 ...
##  $ artist          : Factor w/ 10692 levels "'Til Tuesday",..: 2848 6185 10633 9373 5530 2848 5000 8320 761 8562 ...
##  $ popularity      : int  66 67 70 60 69 67 62 69 68 67 ...
##  $ track_album_id  : Factor w/ 22545 levels "000f3dTtvpazVzv35NuZmn",..: 7684 17645 4144 4691 21907 8636 21592 17795 21050 13719 ...
##  $ album_name      : Factor w/ 19743 levels "'74 - '75 (feat. Susan Tyler)",..: 7928 10684 981 2869 15185 1882 11515 13093 17788 8155 ...
##  $ release_date    : Factor w/ 4530 levels "1957-01-01","1957-03",..: 4316 4493 4336 4349 4221 4341 4356 4389 4316 4321 ...
##  $ playlist_name   : Factor w/ 449 levels "\"Permanent Wave\"",..: 309 309 309 309 309 309 309 309 309 309 ...
##  $ playlist_id     : Factor w/ 471 levels "0275i1VNfBnsNbPl0QIBpG",..: 237 237 237 237 237 237 237 237 237 237 ...
##  $ genre           : Factor w/ 6 levels "edm","latin",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ sub_genre       : Factor w/ 24 levels "album rock","big room",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ danceability    : num  0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
##  $ energy          : num  0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
##  $ key             : int  6 11 1 7 1 8 5 4 8 2 ...
##  $ loudness        : num  -2.63 -4.97 -3.43 -3.78 -4.67 ...
##  $ mode            : int  1 1 0 1 1 1 0 0 1 1 ...
##  $ speechiness     : num  0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
##  $ acousticness    : num  0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
##  $ instrumentalness: num  0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
##  $ liveness        : num  0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
##  $ valence         : num  0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
##  $ tempo           : num  122 100 124 122 124 ...
##  $ duration_ms     : int  194754 162600 176616 169093 189052 163049 187675 207619 193187 253040 ...

Now looking at the structure of the dataset all the columns are defined as required for EDA and modelling. So we won’t be making any changes in the datatype.

Summary and Identifying Outliers

summary(spotify_data)

##                    track_id        track_name                 artist     
##  7BKLCZ1jbUBVqRi2FVlTVw:   10   Poison  :   22   Martin Garrix   :  161  
##  14sOS5L36385FJ3OL8hew4:    9   Breathe :   21   Queen           :  136  
##  3eekarcy7kvN4yt5ZFzltW:    9   Alive   :   20   The Chainsmokers:  123  
##  0nbXyq5TXYPCO7pr3N8S4I:    8   Forever :   20   David Guetta    :  110  
##  0qaWEvPkts34WF68r8Dzx9:    8   Paradise:   19   Don Omar        :  102  
##  0rIAC4PXANcKmitJfoqmVm:    8   (Other) :32726   (Other)         :32196  
##  (Other)               :32781   NA's    :    5   NA's            :    5  
##    popularity                    track_album_id 
##  Min.   :  0.00   5L1xcowSxwzFUSJzvyMp48:   42  
##  1st Qu.: 24.00   5fstCqs5NpIlF42VhPNv23:   29  
##  Median : 45.00   7CjJb2mikwAWA1V6kewFBF:   28  
##  Mean   : 42.48   4VFG1DOuTeDMBjBLZT7hCK:   26  
##  3rd Qu.: 62.00   2HTbQ0RHwukKVXAlTmCZP2:   21  
##  Max.   :100.00   4CzT5ueFBRpbILw34HQYxi:   21  
##                   (Other)               :32666  
##                        album_name        release_date  
##  Greatest Hits              :  139   2020-01-10:  270  
##  Ultimate Freestyle Mega Mix:   42   2019-11-22:  244  
##  Gold                       :   35   2019-12-06:  235  
##  Malibu                     :   30   2019-12-13:  220  
##  Rock & Rios (Remastered)   :   29   2013-01-01:  219  
##  (Other)                    :32553   2019-11-15:  215  
##  NA's                       :    5   (Other)   :31430  
##                                                      playlist_name  
##  Indie Poptimism                                            :  308  
##  2020 Hits & 2019  Hits â\200“ Top Global Tracks ðŸ”¥ðŸ”¥ðŸ”¥  :  247  
##  Permanent Wave                                             :  244  
##  Hard Rock Workout                                          :  219  
##  Ultimate Indie Presents... Best Indie Tracks of the 2010s  :  198  
##  Fitness Workout Electro | House | Dance | Progressive House:  195  
##  (Other)                                                    :31422  
##                  playlist_id      genre                          sub_genre    
##  4JkkvMpVl4lSioqQjeAL0q:  247   edm  :6043   progressive electro house: 1809  
##  37i9dQZF1DWTHM4kX49UKs:  198   latin:5155   southern hip hop         : 1675  
##  6KnQDwp0syvhfHOR4lWP7x:  195   pop  :5507   indie poptimism          : 1672  
##  3xMQTDLOIGvj3lWH5e5x6F:  189   r&b  :5431   latin hip hop            : 1656  
##  3Ho3iO0iJykgEQNbjB2sic:  182   rap  :5746   neo soul                 : 1637  
##  25ButZrVb1Zj1MJioMs09D:  109   rock :4951   pop edm                  : 1517  
##  (Other)               :31713                (Other)                  :22867  
##   danceability        energy              key            loudness      
##  Min.   :0.0000   Min.   :0.000175   Min.   : 0.000   Min.   :-46.448  
##  1st Qu.:0.5630   1st Qu.:0.581000   1st Qu.: 2.000   1st Qu.: -8.171  
##  Median :0.6720   Median :0.721000   Median : 6.000   Median : -6.166  
##  Mean   :0.6548   Mean   :0.698619   Mean   : 5.374   Mean   : -6.720  
##  3rd Qu.:0.7610   3rd Qu.:0.840000   3rd Qu.: 9.000   3rd Qu.: -4.645  
##  Max.   :0.9830   Max.   :1.000000   Max.   :11.000   Max.   :  1.275  
##                                                                        
##       mode         speechiness      acousticness    instrumentalness   
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000000  
##  1st Qu.:0.0000   1st Qu.:0.0410   1st Qu.:0.0151   1st Qu.:0.0000000  
##  Median :1.0000   Median :0.0625   Median :0.0804   Median :0.0000161  
##  Mean   :0.5657   Mean   :0.1071   Mean   :0.1753   Mean   :0.0847472  
##  3rd Qu.:1.0000   3rd Qu.:0.1320   3rd Qu.:0.2550   3rd Qu.:0.0048300  
##  Max.   :1.0000   Max.   :0.9180   Max.   :0.9940   Max.   :0.9940000  
##                                                                        
##     liveness         valence           tempo         duration_ms    
##  Min.   :0.0000   Min.   :0.0000   Min.   :  0.00   Min.   :  4000  
##  1st Qu.:0.0927   1st Qu.:0.3310   1st Qu.: 99.96   1st Qu.:187819  
##  Median :0.1270   Median :0.5120   Median :121.98   Median :216000  
##  Mean   :0.1902   Mean   :0.5106   Mean   :120.88   Mean   :225800  
##  3rd Qu.:0.2480   3rd Qu.:0.6930   3rd Qu.:133.92   3rd Qu.:253585  
##  Max.   :0.9960   Max.   :0.9910   Max.   :239.44   Max.   :517810  
##

At the first glance at the above table we can identify there are outliers in few of the columns as in instrumentalness, liveness, acousticness.

Exploratory Data Analysis

The main objective of the exploratory data analysis will be to determine the patterns between variables of interest to discover interesting insights that would help us answer our questions.

We will be using ggplot to make plots to demonstrate data knowledge and describe the information.

Example - Scatter Plots

ggplot(spotify_data, aes(x = energy, y = liveness , color = genre)) +
  geom_point()

Formatting the data to display scatter plot filtering data with artist as Ed Sheeran.

spotify_edsheeran = spotify_data %>%
  filter(artist == "Ed Sheeran")

ggplot(spotify_edsheeran, aes(x = energy, y = liveness , color = genre)) +
  geom_point()

Example - Bar Graph

ggplot(spotify_data, aes(x = genre, y = duration_ms)) +
  geom_col()

Correlation Plot, scatterplots and barplots are some of the plots that will help us with our analysis.