Nikita Sankhe

Introduction

Dataset: Spotify

Problem Statement:

Spotify as a music application does a very good job in recommeding music to its users. It suggests music based on your frequent and liked songs/artists. This particular data set, built via the spotifyr package has details of track names, artists, types of genres, sub genres and other audio features.

Objective:

The idea behind the project is to use this dataset to :

  • Identify the most popular songs by groups:
    • Which groups are popular - i.e are they pop? are they rap?
    • How varied are these groups?
    • What are their audio features/attributes?
  • Understand how the audio features behave / interact with each other and how they affect the songs group wise.

End Goal:

This analysis aims to provide an overview on which songs are the most popular ones and what are their attributes. The idea is to help an end user to gain better understanding of what goes behind the most popular songs on Spotify.

Overall Approach:

  • K - Means Clustering
    • To interpet the popularity of songs and how different they are, I intend to use K-means clustering method, find the centroid of each cluster and see how far are these clusters from each other.
    • Songs with similar characteristics will be grouped and these groups will help in understanding the audio attributes.
  • Exploratory Data Analysis
    • Visualization techniques to uncover patterns and insights about the audio features and their behaviour with each other
    • Statistical Testing to understand the variable importance

Packages Required

#Dataframe
library(knitr)
library(DT)
#Data Manipulation
library(tidyverse)
library(dplyr)
library(tidyr)
#Data Viz
library(ggplot2)
library(GGally)
  • knitr : Helps display better outputs without intense coding. The kable function particularly helps in presenting tables, manipulating table styles.

  • DT : Helps in presenting tables in a clean format, and has the ability to provide filters.

  • Ggally : To plot the correlation analysis of variables in matrice form

  • tidyverse : Tidyverse provides a collection of packages including “dplyr”, “tidyr”, “ggplot2” explained below.

    • dplyr provides functions for data manipulation such as - adds new variables that are functions of existing variables, select, rename data, filter, summarise etc
    • tidyr helps in tidying data with dropna, fillna functions, extracting values from strings and thereby making the data more readable, concrete and complete
    • ggplot provides elegant visualizations, that help to present insights in a delightful manner

Data Preparation

This dataset is extracted using the spotifyr package and was obtained from rfordatascience github.

Importing Data

spotify <- read.csv("spotify_songs.csv", stringsAsFactors=FALSE)

About the data

## [1] 32833    23

The data set has 32833 rows of observations with 23 variables.

The following information about the variables is provided on the ‘rfordatascience’ website and helps the users to understand the dataset:

Spotify Dictionary
Variable Description
track_id Song unique ID
track_name Song Name
track_artist Song Artist
track_popularity Song Popularity (0-100) where higher is better
track_album_id Album unique ID
track_album_name Song album name
track_album_release_date Date when album released
playlist_name Name of playlist
playlist_id Playlist ID
playlist_genre Playlist genre
playlist_subgenre Playlist subgenre
danceability Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
energy Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
key The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1.
loudness The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.
mode Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.
speechiness Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
acousticness A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic
instrumentalness Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
liveness Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
valence A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
tempo The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
duration_ms Duration of song in milliseconds

Data Cleaning

  • First, the total number of missing values for each variable in the data set are identified.
  • The following variables each have 5 missing values:

    • track_name
    • track_artist
    • track_album_name
colSums(is.na(spotify))
##                 track_id               track_name             track_artist 
##                        0                        5                        5 
##         track_popularity           track_album_id         track_album_name 
##                        0                        0                        5 
## track_album_release_date            playlist_name              playlist_id 
##                        0                        0                        0 
##           playlist_genre        playlist_subgenre             danceability 
##                        0                        0                        0 
##                   energy                      key                 loudness 
##                        0                        0                        0 
##                     mode              speechiness             acousticness 
##                        0                        0                        0 
##         instrumentalness                 liveness                  valence 
##                        0                        0                        0 
##                    tempo              duration_ms 
##                        0                        0
  • Then we move to identifying rows with missing values in these 3 columns.
  • From the results below it can be seen that all the missing values in the 3 variables belong to the same 5 rows.
  • The row indices are : 8152 9283 9284 19569 19812.
  • This is a very small number of missing values compared to a large dataset, and hence it is not detrimental to our analysis.
which(is.na(spotify$track_name))
## [1]  8152  9283  9284 19569 19812
which(is.na(spotify$track_artist))
## [1]  8152  9283  9284 19569 19812
which(is.na(spotify$track_album_name))
## [1]  8152  9283  9284 19569 19812
  • Certain variables have incorrect data types, and before EDA they need to be corrected.
str(spotify)
## 'data.frame':    32833 obs. of  23 variables:
##  $ track_id                : chr  "6f807x0ima9a1j3VPbc7VN" "0r7CVbZTWZgbTCYdfa2P31" "1z1Hg7Vb0AhHDiEmnDE79l" "75FpbthrwQmzHlBJLuGdC7" ...
##  $ track_name              : chr  "I Don't Care (with Justin Bieber) - Loud Luxury Remix" "Memories - Dillon Francis Remix" "All the Time - Don Diablo Remix" "Call You Mine - Keanu Silva Remix" ...
##  $ track_artist            : chr  "Ed Sheeran" "Maroon 5" "Zara Larsson" "The Chainsmokers" ...
##  $ track_popularity        : int  66 67 70 60 69 67 62 69 68 67 ...
##  $ track_album_id          : chr  "2oCs0DGTsRO98Gh5ZSl2Cx" "63rPSO264uRjW1X5E6cWv6" "1HoSmj2eLcsrR0vE9gThr4" "1nqYsOef1yKKuGOVchbsk6" ...
##  $ track_album_name        : chr  "I Don't Care (with Justin Bieber) [Loud Luxury Remix]" "Memories (Dillon Francis Remix)" "All the Time (Don Diablo Remix)" "Call You Mine - The Remixes" ...
##  $ track_album_release_date: chr  "2019-06-14" "2019-12-13" "2019-07-05" "2019-07-19" ...
##  $ playlist_name           : chr  "Pop Remix" "Pop Remix" "Pop Remix" "Pop Remix" ...
##  $ playlist_id             : chr  "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" ...
##  $ playlist_genre          : chr  "pop" "pop" "pop" "pop" ...
##  $ playlist_subgenre       : chr  "dance pop" "dance pop" "dance pop" "dance pop" ...
##  $ danceability            : num  0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
##  $ energy                  : num  0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
##  $ key                     : int  6 11 1 7 1 8 5 4 8 2 ...
##  $ loudness                : num  -2.63 -4.97 -3.43 -3.78 -4.67 ...
##  $ mode                    : int  1 1 0 1 1 1 0 0 1 1 ...
##  $ speechiness             : num  0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
##  $ acousticness            : num  0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
##  $ instrumentalness        : num  0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
##  $ liveness                : num  0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
##  $ valence                 : num  0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
##  $ tempo                   : num  122 100 124 122 124 ...
##  $ duration_ms             : int  194754 162600 176616 169093 189052 163049 187675 207619 193187 253040 ...
  • From the above summary of the structure of the data, the following variables need to be transformed to factors:
  1. playlist_genre : 6 types of genres, hence better to transform to factors of 6 levels.
unique(spotify$playlist_genre)
## [1] "pop"   "rap"   "rock"  "latin" "r&b"   "edm"
  1. playlist_subgenre : 24 types of subgenres, hence better to transform to factors of 24 levels.
unique(spotify$playlist_subgenre)
##  [1] "dance pop"                 "post-teen pop"            
##  [3] "electropop"                "indie poptimism"          
##  [5] "hip hop"                   "southern hip hop"         
##  [7] "gangster rap"              "trap"                     
##  [9] "album rock"                "classic rock"             
## [11] "permanent wave"            "hard rock"                
## [13] "tropical"                  "latin pop"                
## [15] "reggaeton"                 "latin hip hop"            
## [17] "urban contemporary"        "hip pop"                  
## [19] "new jack swing"            "neo soul"                 
## [21] "electro house"             "big room"                 
## [23] "pop edm"                   "progressive electro house"
  1. key : 12 types of keys, hence better to transform to factors of 12 levels.
unique(spotify$key)
##  [1]  6 11  1  7  8  5  4  2  0 10  9  3
  1. mode : 2 types of mode (0,1), hence better to transform to factors of 2 levels.
unique(spotify$mode)
## [1] 1 0
  • Therefore, we transform the above 4 variables to factors
spotify <- spotify %>% mutate(
  playlist_genre = as.factor(spotify$playlist_genre),
  playlist_subgenre = as.factor(spotify$playlist_subgenre),
  key = as.factor(spotify$key),
  mode = as.factor(spotify$mode)
  )
  • Also, Track_id,track_artist,track_album_id,track_album_name are not important to the analysis and hence we drop them.
spotify <- spotify %>% select(2,4,10:23)

Summary of Final & Cleaned Dataset

  • The cleaned dataset has 32833 observations of 16 variables
datatable(spotify, rownames = FALSE, filter="top", options = list(pageLength = 5, scrollX=T))
## Warning in instance$preRenderHook(instance): It seems your data is too big
## for client-side DataTables. You may consider server-side processing: https://
## rstudio.github.io/DT/server.html
  • From the summaries, it can be seen that the audio features fit the description given in the features table, value wise and range wise as well.

  • But for speechiness,acousticness,instrumentalness,liveness the median and mean are not as close as they are for other variables and hence we will look into some plots to understand their behaviour.

summary(spotify)
##   track_name        track_popularity playlist_genre
##  Length:32833       Min.   :  0.00   edm  :6043    
##  Class :character   1st Qu.: 24.00   latin:5155    
##  Mode  :character   Median : 45.00   pop  :5507    
##                     Mean   : 42.48   r&b  :5431    
##                     3rd Qu.: 62.00   rap  :5746    
##                     Max.   :100.00   rock :4951    
##                                                    
##                  playlist_subgenre  danceability        energy        
##  progressive electro house: 1809   Min.   :0.0000   Min.   :0.000175  
##  southern hip hop         : 1675   1st Qu.:0.5630   1st Qu.:0.581000  
##  indie poptimism          : 1672   Median :0.6720   Median :0.721000  
##  latin hip hop            : 1656   Mean   :0.6548   Mean   :0.698619  
##  neo soul                 : 1637   3rd Qu.:0.7610   3rd Qu.:0.840000  
##  pop edm                  : 1517   Max.   :0.9830   Max.   :1.000000  
##  (Other)                  :22867                                      
##       key           loudness       mode       speechiness      acousticness   
##  1      : 4010   Min.   :-46.448   0:14259   Min.   :0.0000   Min.   :0.0000  
##  0      : 3454   1st Qu.: -8.171   1:18574   1st Qu.:0.0410   1st Qu.:0.0151  
##  7      : 3352   Median : -6.166             Median :0.0625   Median :0.0804  
##  9      : 3027   Mean   : -6.720             Mean   :0.1071   Mean   :0.1753  
##  11     : 2996   3rd Qu.: -4.645             3rd Qu.:0.1320   3rd Qu.:0.2550  
##  2      : 2827   Max.   :  1.275             Max.   :0.9180   Max.   :0.9940  
##  (Other):13167                                                                
##  instrumentalness       liveness         valence           tempo       
##  Min.   :0.0000000   Min.   :0.0000   Min.   :0.0000   Min.   :  0.00  
##  1st Qu.:0.0000000   1st Qu.:0.0927   1st Qu.:0.3310   1st Qu.: 99.96  
##  Median :0.0000161   Median :0.1270   Median :0.5120   Median :121.98  
##  Mean   :0.0847472   Mean   :0.1902   Mean   :0.5106   Mean   :120.88  
##  3rd Qu.:0.0048300   3rd Qu.:0.2480   3rd Qu.:0.6930   3rd Qu.:133.92  
##  Max.   :0.9940000   Max.   :0.9960   Max.   :0.9910   Max.   :239.44  
##                                                                        
##   duration_ms    
##  Min.   :  4000  
##  1st Qu.:187819  
##  Median :216000  
##  Mean   :225800  
##  3rd Qu.:253585  
##  Max.   :517810  
## 

Visual EDA

Histograms

  • speechiness,acousticness, instrumentalness, liveness are right skewed, with instrumentalness needing more exploration
spotify %>%
  keep(is.numeric) %>% #hist only for numeric
  gather() %>% #converts to key value
  ggplot(aes(value, fill = key)) + 
    facet_wrap(~ key, scales = "free") +
    geom_histogram(alpha = 0.7, bins = 30) + scale_x_discrete(guide = guide_axis(check.overlap = TRUE))

Barcharts

  • From the chart below it is seen that songs have mode 1 (major track) more often than mode 2(minor track).
  • Pitch 1 is frequenly occuring in songs
ggplot(spotify,aes(mode)) + geom_bar(aes(fill=mode),alpha = 0.7) 

ggplot(spotify,aes(key)) + geom_bar(aes(fill=key), alpha = 0.7)

Boxplots

  • Loudness, tempo, speechiness, danceability and duration have some obvious outliers. We will take this into consideration while working on the data.

  • Instrumentalness has most values closer to 0, which is why the boxplot and histogram act this way.

spotify %>%
  keep(is.numeric) %>% #hist only for numeric
  gather() %>% #converts to key value
  ggplot(aes(value, fill = key)) + 
    facet_wrap(~ key, scales = "free") +
    geom_boxplot(alpha = 0.7) + coord_flip()

Proposed EDA

I will be analyzing the popularity of songs, genres, audio features, their interaction. For their insights depiction I intend to make use of :

  • ggplot
  • ggally
  • k-means clustering
  • chi square test

Also, for k means clustering I am aware of the elbow method which i plan to implement to find out the number of clusters that are ideal which will eventually tell me what separates the popular songs cluster from the rest.

Currently, I am unsure if apart from clustering, if I will be implementing other ML methods like randomforest and bagging.