Problem Statement:

The objective of this project is to explore various songs and to classify them in to genres by analysing their audio features. We can use this to answer questions like which songs belongs to which genre, which genre is most suitable for dancing or to uplift your mood or maybe to brood over something :p?

*I might also incude some additional analysis of my personal favorite artists and music in order to better understand my own taste in music and what particular features in a song matters to me more.

Approach:

The dataset used for our analysis has been extracted from Spotify using the spotifyr package (https://www.rcharlie.com/spotifyr/), the dataset inlcludes 12 audio feautures or dimension such as “arousal”, “valence”, and “depth”, we further explain these features later in the document. After tidying the data, I plan to use some visualization to understand how each of the 12 audio features relate to each genre and various artists, performing basic analysis on data, and then using data mining techniques to classify songs in to various categories or genres.

Proposed analytical technqiue:

I will first perform data cleaning on the data set to find out any missing values and then decide whether to delete or impute those, check if their are any outliers in data if yes then how it might affect the results and according decide to delete or keep it. For the purpose of classification I will most probably be using decision tree or it’s extensions such as random forest. *will explore other techniques which might give a better classification rate and then decide which technique is most appropriate.

How will it help the consumer?

Consumer can make use of the analysis to put songs in to broader genres or categories which traditionally may not belong the same genre but have similar features and that can be used to create custom playlists suitable for a particular occasion or mood. Recommendations for new songs from different genres/languages which have similar features such as danceability, energy, valence etc can be made to a user depending on their listening history and preferences.

data.table tidyverse randomForest rpart

#install.packages(c('data.table','tidyverse','randomForest','rpart','fBasics'))

# this package is used to read data files, is faster than readr package
library(data.table) 

# this package is used to get basic statistics of numerical variables using basicStats function
library(fBasics)

# it is a collection of multiple packages used to clean, visualise,model, and to communicate the data.
library(tidyverse) 

# this is used to genrate random forest(data mining technique) to model the data and classify the results.
library(randomForest) 

# this is used to generate decision trees which will help in classifying the data in to genres.
library(rpart) 

Data Preparation

3.1

The data comes from Spotify via the spotifyr package. Charlie Thompson, Josiah Parry, Donal Phipps, and Tom Wolff authored this package to make it easier to get either your own data or general metadata arounds songs from Spotify’s API.

Data downloded from: https://www.dropbox.com/sh/qj0ueimxot3ltbf/AACzMOHv7sZCJsj3ErjtOG7ya?dl=1

3.1

Purpose

I can’t seem to find any particular purpose of the source data since this package spotifyr is only used to extract data, and users can extract and use this data for various purposes. This package was published on 13th July, 2019.

For our analysis we will be using 12 audio features in the dataset 1.acousticness 2.liveness 3.speechiness 4.instrumentalness 5.energy 6.loudness 7.danceability 8.valence 9.duration 10.tempo 11.key 12.mode

3.2

Original dataset extracted via this package has 23 variables as listed below. Data importing is done using the fread() from data.table package which is then stored in a data frame called songs.

We then check the variable names in the dataframe using names() function which returns the column names.

First 6 observartions are displayed using the head() function.

#rerading dataset in to songs.
songs <- fread("BANA /Data Wrangling - R/Project/spotify_songs.csv")
#checking variable or column names
names(songs)
##  [1] "track_id"                 "track_name"              
##  [3] "track_artist"             "track_popularity"        
##  [5] "track_album_id"           "track_album_name"        
##  [7] "track_album_release_date" "playlist_name"           
##  [9] "playlist_id"              "playlist_genre"          
## [11] "playlist_subgenre"        "danceability"            
## [13] "energy"                   "key"                     
## [15] "loudness"                 "mode"                    
## [17] "speechiness"              "acousticness"            
## [19] "instrumentalness"         "liveness"                
## [21] "valence"                  "tempo"                   
## [23] "duration_ms"
#displaying the first 6 observations.
head(songs)
##                  track_id
## 1: 6f807x0ima9a1j3VPbc7VN
## 2: 0r7CVbZTWZgbTCYdfa2P31
## 3: 1z1Hg7Vb0AhHDiEmnDE79l
## 4: 75FpbthrwQmzHlBJLuGdC7
## 5: 1e8PAfcKUYoKkxPhrHqw4x
## 6: 7fvUMiyapMsRRxr07cU8Ef
##                                               track_name     track_artist
## 1: I Don't Care (with Justin Bieber) - Loud Luxury Remix       Ed Sheeran
## 2:                       Memories - Dillon Francis Remix         Maroon 5
## 3:                       All the Time - Don Diablo Remix     Zara Larsson
## 4:                     Call You Mine - Keanu Silva Remix The Chainsmokers
## 5:               Someone You Loved - Future Humans Remix    Lewis Capaldi
## 6:     Beautiful People (feat. Khalid) - Jack Wins Remix       Ed Sheeran
##    track_popularity         track_album_id
## 1:               66 2oCs0DGTsRO98Gh5ZSl2Cx
## 2:               67 63rPSO264uRjW1X5E6cWv6
## 3:               70 1HoSmj2eLcsrR0vE9gThr4
## 4:               60 1nqYsOef1yKKuGOVchbsk6
## 5:               69 7m7vv9wlQ4i0LFuJiE2zsQ
## 6:               67 2yiy9cd2QktrNvWC2EUi0k
##                                         track_album_name
## 1: I Don't Care (with Justin Bieber) [Loud Luxury Remix]
## 2:                       Memories (Dillon Francis Remix)
## 3:                       All the Time (Don Diablo Remix)
## 4:                           Call You Mine - The Remixes
## 5:               Someone You Loved (Future Humans Remix)
## 6:     Beautiful People (feat. Khalid) [Jack Wins Remix]
##    track_album_release_date playlist_name            playlist_id
## 1:               2019-06-14     Pop Remix 37i9dQZF1DXcZDD7cfEKhW
## 2:               2019-12-13     Pop Remix 37i9dQZF1DXcZDD7cfEKhW
## 3:               2019-07-05     Pop Remix 37i9dQZF1DXcZDD7cfEKhW
## 4:               2019-07-19     Pop Remix 37i9dQZF1DXcZDD7cfEKhW
## 5:               2019-03-05     Pop Remix 37i9dQZF1DXcZDD7cfEKhW
## 6:               2019-07-11     Pop Remix 37i9dQZF1DXcZDD7cfEKhW
##    playlist_genre playlist_subgenre danceability energy key loudness mode
## 1:            pop         dance pop        0.748  0.916   6   -2.634    1
## 2:            pop         dance pop        0.726  0.815  11   -4.969    1
## 3:            pop         dance pop        0.675  0.931   1   -3.432    0
## 4:            pop         dance pop        0.718  0.930   7   -3.778    1
## 5:            pop         dance pop        0.650  0.833   1   -4.672    1
## 6:            pop         dance pop        0.675  0.919   8   -5.385    1
##    speechiness acousticness instrumentalness liveness valence   tempo
## 1:      0.0583       0.1020         0.00e+00   0.0653   0.518 122.036
## 2:      0.0373       0.0724         4.21e-03   0.3570   0.693  99.972
## 3:      0.0742       0.0794         2.33e-05   0.1100   0.613 124.008
## 4:      0.1020       0.0287         9.43e-06   0.2040   0.277 121.956
## 5:      0.0359       0.0803         0.00e+00   0.0833   0.725 123.976
## 6:      0.1270       0.0799         0.00e+00   0.1430   0.585 124.982
##    duration_ms
## 1:      194754
## 2:      162600
## 3:      176616
## 4:      169093
## 5:      189052
## 6:      163049

3.3

There are total 32833 observations in 23 variables.

sum() and is.na() is used to get the sum of all the missing values in the dataframe, there are 15 missing values in the data set. Next we check where these missing values are in the dataset using apply() here second parameter 2 indicates to search for missing values in columns rather than rows and which() returns the indices where the missing values are.

#checking the structure of the dataframe songs.
str(songs)
## Classes 'data.table' and 'data.frame':   32833 obs. of  23 variables:
##  $ track_id                : chr  "6f807x0ima9a1j3VPbc7VN" "0r7CVbZTWZgbTCYdfa2P31" "1z1Hg7Vb0AhHDiEmnDE79l" "75FpbthrwQmzHlBJLuGdC7" ...
##  $ track_name              : chr  "I Don't Care (with Justin Bieber) - Loud Luxury Remix" "Memories - Dillon Francis Remix" "All the Time - Don Diablo Remix" "Call You Mine - Keanu Silva Remix" ...
##  $ track_artist            : chr  "Ed Sheeran" "Maroon 5" "Zara Larsson" "The Chainsmokers" ...
##  $ track_popularity        : int  66 67 70 60 69 67 62 69 68 67 ...
##  $ track_album_id          : chr  "2oCs0DGTsRO98Gh5ZSl2Cx" "63rPSO264uRjW1X5E6cWv6" "1HoSmj2eLcsrR0vE9gThr4" "1nqYsOef1yKKuGOVchbsk6" ...
##  $ track_album_name        : chr  "I Don't Care (with Justin Bieber) [Loud Luxury Remix]" "Memories (Dillon Francis Remix)" "All the Time (Don Diablo Remix)" "Call You Mine - The Remixes" ...
##  $ track_album_release_date: chr  "2019-06-14" "2019-12-13" "2019-07-05" "2019-07-19" ...
##  $ playlist_name           : chr  "Pop Remix" "Pop Remix" "Pop Remix" "Pop Remix" ...
##  $ playlist_id             : chr  "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" ...
##  $ playlist_genre          : chr  "pop" "pop" "pop" "pop" ...
##  $ playlist_subgenre       : chr  "dance pop" "dance pop" "dance pop" "dance pop" ...
##  $ danceability            : num  0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
##  $ energy                  : num  0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
##  $ key                     : int  6 11 1 7 1 8 5 4 8 2 ...
##  $ loudness                : num  -2.63 -4.97 -3.43 -3.78 -4.67 ...
##  $ mode                    : int  1 1 0 1 1 1 0 0 1 1 ...
##  $ speechiness             : num  0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
##  $ acousticness            : num  0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
##  $ instrumentalness        : num  0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
##  $ liveness                : num  0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
##  $ valence                 : num  0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
##  $ tempo                   : num  122 100 124 122 124 ...
##  $ duration_ms             : int  194754 162600 176616 169093 189052 163049 187675 207619 193187 253040 ...
##  - attr(*, ".internal.selfref")=<externalptr>
#summing the total na values in songs.
sum(is.na(songs))
## [1] 15
#finding out where these missing values are.

apply(is.na(songs), 2, which)
## $track_id
## integer(0)
## 
## $track_name
## [1]  8152  9283  9284 19569 19812
## 
## $track_artist
## [1]  8152  9283  9284 19569 19812
## 
## $track_popularity
## integer(0)
## 
## $track_album_id
## integer(0)
## 
## $track_album_name
## [1]  8152  9283  9284 19569 19812
## 
## $track_album_release_date
## integer(0)
## 
## $playlist_name
## integer(0)
## 
## $playlist_id
## integer(0)
## 
## $playlist_genre
## integer(0)
## 
## $playlist_subgenre
## integer(0)
## 
## $danceability
## integer(0)
## 
## $energy
## integer(0)
## 
## $key
## integer(0)
## 
## $loudness
## integer(0)
## 
## $mode
## integer(0)
## 
## $speechiness
## integer(0)
## 
## $acousticness
## integer(0)
## 
## $instrumentalness
## integer(0)
## 
## $liveness
## integer(0)
## 
## $valence
## integer(0)
## 
## $tempo
## integer(0)
## 
## $duration_ms
## integer(0)
#this function is giving an error when I knit the html file, while it works fine when I run it normally. This is giving the column names and row number where the missing values are. I am not sure how to resolve it.



#summarising the data mean, min and max for numerical variables.
options(digits = 2)# limiting the decimal digits to 2


#getting all the numeric variables/columns in numsongs
numsongs <- songs[,sapply(songs, is.numeric), with= FALSE] 
#getting expeceted value and range.
basicStats(numsongs)[c("Mean","Minimum", "Maximum"),]
##         track_popularity danceability  energy  key loudness mode
## Mean                  42         0.65 0.69862  5.4     -6.7 0.57
## Minimum                0         0.00 0.00017  0.0    -46.4 0.00
## Maximum              100         0.98 1.00000 11.0      1.3 1.00
##         speechiness acousticness instrumentalness liveness valence tempo
## Mean           0.11         0.18            0.085     0.19    0.51   121
## Minimum        0.00         0.00            0.000     0.00    0.00     0
## Maximum        0.92         0.99            0.994     1.00    0.99   239
##         duration_ms
## Mean         225800
## Minimum        4000
## Maximum      517810

We can see observation numbers 8152, 9283, 9284, 19569, and 19812 have missing values in columns “track_name”, “track_artist”, and “track_album_name”. All the information for these 4 observation related to names of songs and artists have been missed. This shouldn’t really be a problem in our analysis as we have all the necessary information required to classify a song to a particular genre.

The data is clean and there is hardly anything we need to do, missing values will not affect our analysis as explained above.

Deleting columns

Deleting columns track_id, track_album_id, track_album_name, track_album_release_date, and playlist_id as we will not be needing these columns for any of our analysis.

songs$track_id = NULL
songs$track_album_id = NULL
songs$track_album_name = NULL
songs$track_album_release_date = NULL
songs$playlist_id = NULL

3.4

Displaying the first few observations of the dataset to give an idea of what the data looks like

head(songs)
##                                               track_name     track_artist
## 1: I Don't Care (with Justin Bieber) - Loud Luxury Remix       Ed Sheeran
## 2:                       Memories - Dillon Francis Remix         Maroon 5
## 3:                       All the Time - Don Diablo Remix     Zara Larsson
## 4:                     Call You Mine - Keanu Silva Remix The Chainsmokers
## 5:               Someone You Loved - Future Humans Remix    Lewis Capaldi
## 6:     Beautiful People (feat. Khalid) - Jack Wins Remix       Ed Sheeran
##    track_popularity playlist_name playlist_genre playlist_subgenre
## 1:               66     Pop Remix            pop         dance pop
## 2:               67     Pop Remix            pop         dance pop
## 3:               70     Pop Remix            pop         dance pop
## 4:               60     Pop Remix            pop         dance pop
## 5:               69     Pop Remix            pop         dance pop
## 6:               67     Pop Remix            pop         dance pop
##    danceability energy key loudness mode speechiness acousticness
## 1:         0.75   0.92   6     -2.6    1       0.058        0.102
## 2:         0.73   0.81  11     -5.0    1       0.037        0.072
## 3:         0.68   0.93   1     -3.4    0       0.074        0.079
## 4:         0.72   0.93   7     -3.8    1       0.102        0.029
## 5:         0.65   0.83   1     -4.7    1       0.036        0.080
## 6:         0.68   0.92   8     -5.4    1       0.127        0.080
##    instrumentalness liveness valence tempo duration_ms
## 1:          0.0e+00    0.065    0.52   122      194754
## 2:          4.2e-03    0.357    0.69   100      162600
## 3:          2.3e-05    0.110    0.61   124      176616
## 4:          9.4e-06    0.204    0.28   122      169093
## 5:          0.0e+00    0.083    0.72   124      189052
## 6:          0.0e+00    0.143    0.58   125      163049

Detailed explanation of 12 audio features which we will be using for our analysis

danceability - Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.

energy - Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.

key - The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1.

loudness - The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.

mode - Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.

speechiness - Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.

acousticness - A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.

instrumentalness - Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.

liveness - Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.

valence - A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).

tempo - The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.

duration_ms - Duration of song in milliseconds

Proposed exploratory data analysis

4.1

I plan to generate some boxplots to understand how many outliers are there and in what variables, check correlation between features and genres, perform some analysis on which artist has the most number of tracks, which artist has the most popular songs and various other visualizations involving artists and what features are most likely to be in their songs.

I did remove some columns which we did not need for our analysis, as of now I don’t think new columns or dataframes would be needed, though here and there I might put the songs data in to a different dataframe to perform some analysis so that our original data remains intact.

Most of the questions in this case will be answered by visualisations.

4.2

Boxplots, frequecncy plots(bar charts and histograms), scatterplots, maybe some heatmaps for better visualization.

4.3

I am not very good with plotting beautiful plots, so I would want to learn more of it and implement it in this project as there are so many libraries such as ggplot2 to achieve this.

4.4

Yes I will be using random forest and decision trees which I studied in Data Mining course last flex.