The objective of this project is to explore various songs and to classify them in to genres by analysing their audio features. We can use this to answer questions like which songs belongs to which genre, which genre is most suitable for dancing or to uplift your mood or maybe to brood over something :p?
*I might also incude some additional analysis of my personal favorite artists and music in order to better understand my own taste in music and what particular features in a song matters to me more.
The dataset used for our analysis has been extracted from Spotify using the spotifyr package (https://www.rcharlie.com/spotifyr/), the dataset inlcludes 12 audio feautures or dimension such as “arousal”, “valence”, and “depth”, we further explain these features later in the document. After tidying the data, I plan to use some visualization to understand how each of the 12 audio features relate to each genre and various artists, performing basic analysis on data, and then using data mining techniques to classify songs in to various categories or genres.
I will first perform data cleaning on the data set to find out any missing values and then decide whether to delete or impute those, check if their are any outliers in data if yes then how it might affect the results and according decide to delete or keep it. For the purpose of classification I will most probably be using decision tree or it’s extensions such as random forest. *will explore other techniques which might give a better classification rate and then decide which technique is most appropriate.
Consumer can make use of the analysis to put songs in to broader genres or categories which traditionally may not belong the same genre but have similar features and that can be used to create custom playlists suitable for a particular occasion or mood. Recommendations for new songs from different genres/languages which have similar features such as danceability, energy, valence etc can be made to a user depending on their listening history and preferences.
data.table tidyverse randomForest rpart
#install.packages(c('data.table','tidyverse','randomForest','rpart','fBasics'))
# this package is used to read data files, is faster than readr package
library(data.table)
# this package is used to get basic statistics of numerical variables using basicStats function
library(fBasics)
# it is a collection of multiple packages used to clean, visualise,model, and to communicate the data.
library(tidyverse)
# this is used to genrate random forest(data mining technique) to model the data and classify the results.
library(randomForest)
# this is used to generate decision trees which will help in classifying the data in to genres.
library(rpart)
The data comes from Spotify via the spotifyr package. Charlie Thompson, Josiah Parry, Donal Phipps, and Tom Wolff authored this package to make it easier to get either your own data or general metadata arounds songs from Spotify’s API.
Data downloded from: https://www.dropbox.com/sh/qj0ueimxot3ltbf/AACzMOHv7sZCJsj3ErjtOG7ya?dl=1
I can’t seem to find any particular purpose of the source data since this package spotifyr is only used to extract data, and users can extract and use this data for various purposes. This package was published on 13th July, 2019.
For our analysis we will be using 12 audio features in the dataset 1.acousticness 2.liveness 3.speechiness 4.instrumentalness 5.energy 6.loudness 7.danceability 8.valence 9.duration 10.tempo 11.key 12.mode
Original dataset extracted via this package has 23 variables as listed below. Data importing is done using the fread() from data.table package which is then stored in a data frame called songs.
We then check the variable names in the dataframe using names() function which returns the column names.
First 6 observartions are displayed using the head() function.
#rerading dataset in to songs.
songs <- fread("BANA /Data Wrangling - R/Project/spotify_songs.csv")
#checking variable or column names
names(songs)
## [1] "track_id" "track_name"
## [3] "track_artist" "track_popularity"
## [5] "track_album_id" "track_album_name"
## [7] "track_album_release_date" "playlist_name"
## [9] "playlist_id" "playlist_genre"
## [11] "playlist_subgenre" "danceability"
## [13] "energy" "key"
## [15] "loudness" "mode"
## [17] "speechiness" "acousticness"
## [19] "instrumentalness" "liveness"
## [21] "valence" "tempo"
## [23] "duration_ms"
#displaying the first 6 observations.
head(songs)
## track_id
## 1: 6f807x0ima9a1j3VPbc7VN
## 2: 0r7CVbZTWZgbTCYdfa2P31
## 3: 1z1Hg7Vb0AhHDiEmnDE79l
## 4: 75FpbthrwQmzHlBJLuGdC7
## 5: 1e8PAfcKUYoKkxPhrHqw4x
## 6: 7fvUMiyapMsRRxr07cU8Ef
## track_name track_artist
## 1: I Don't Care (with Justin Bieber) - Loud Luxury Remix Ed Sheeran
## 2: Memories - Dillon Francis Remix Maroon 5
## 3: All the Time - Don Diablo Remix Zara Larsson
## 4: Call You Mine - Keanu Silva Remix The Chainsmokers
## 5: Someone You Loved - Future Humans Remix Lewis Capaldi
## 6: Beautiful People (feat. Khalid) - Jack Wins Remix Ed Sheeran
## track_popularity track_album_id
## 1: 66 2oCs0DGTsRO98Gh5ZSl2Cx
## 2: 67 63rPSO264uRjW1X5E6cWv6
## 3: 70 1HoSmj2eLcsrR0vE9gThr4
## 4: 60 1nqYsOef1yKKuGOVchbsk6
## 5: 69 7m7vv9wlQ4i0LFuJiE2zsQ
## 6: 67 2yiy9cd2QktrNvWC2EUi0k
## track_album_name
## 1: I Don't Care (with Justin Bieber) [Loud Luxury Remix]
## 2: Memories (Dillon Francis Remix)
## 3: All the Time (Don Diablo Remix)
## 4: Call You Mine - The Remixes
## 5: Someone You Loved (Future Humans Remix)
## 6: Beautiful People (feat. Khalid) [Jack Wins Remix]
## track_album_release_date playlist_name playlist_id
## 1: 2019-06-14 Pop Remix 37i9dQZF1DXcZDD7cfEKhW
## 2: 2019-12-13 Pop Remix 37i9dQZF1DXcZDD7cfEKhW
## 3: 2019-07-05 Pop Remix 37i9dQZF1DXcZDD7cfEKhW
## 4: 2019-07-19 Pop Remix 37i9dQZF1DXcZDD7cfEKhW
## 5: 2019-03-05 Pop Remix 37i9dQZF1DXcZDD7cfEKhW
## 6: 2019-07-11 Pop Remix 37i9dQZF1DXcZDD7cfEKhW
## playlist_genre playlist_subgenre danceability energy key loudness mode
## 1: pop dance pop 0.748 0.916 6 -2.634 1
## 2: pop dance pop 0.726 0.815 11 -4.969 1
## 3: pop dance pop 0.675 0.931 1 -3.432 0
## 4: pop dance pop 0.718 0.930 7 -3.778 1
## 5: pop dance pop 0.650 0.833 1 -4.672 1
## 6: pop dance pop 0.675 0.919 8 -5.385 1
## speechiness acousticness instrumentalness liveness valence tempo
## 1: 0.0583 0.1020 0.00e+00 0.0653 0.518 122.036
## 2: 0.0373 0.0724 4.21e-03 0.3570 0.693 99.972
## 3: 0.0742 0.0794 2.33e-05 0.1100 0.613 124.008
## 4: 0.1020 0.0287 9.43e-06 0.2040 0.277 121.956
## 5: 0.0359 0.0803 0.00e+00 0.0833 0.725 123.976
## 6: 0.1270 0.0799 0.00e+00 0.1430 0.585 124.982
## duration_ms
## 1: 194754
## 2: 162600
## 3: 176616
## 4: 169093
## 5: 189052
## 6: 163049
There are total 32833 observations in 23 variables.
sum() and is.na() is used to get the sum of all the missing values in the dataframe, there are 15 missing values in the data set. Next we check where these missing values are in the dataset using apply() here second parameter 2 indicates to search for missing values in columns rather than rows and which() returns the indices where the missing values are.
#checking the structure of the dataframe songs.
str(songs)
## Classes 'data.table' and 'data.frame': 32833 obs. of 23 variables:
## $ track_id : chr "6f807x0ima9a1j3VPbc7VN" "0r7CVbZTWZgbTCYdfa2P31" "1z1Hg7Vb0AhHDiEmnDE79l" "75FpbthrwQmzHlBJLuGdC7" ...
## $ track_name : chr "I Don't Care (with Justin Bieber) - Loud Luxury Remix" "Memories - Dillon Francis Remix" "All the Time - Don Diablo Remix" "Call You Mine - Keanu Silva Remix" ...
## $ track_artist : chr "Ed Sheeran" "Maroon 5" "Zara Larsson" "The Chainsmokers" ...
## $ track_popularity : int 66 67 70 60 69 67 62 69 68 67 ...
## $ track_album_id : chr "2oCs0DGTsRO98Gh5ZSl2Cx" "63rPSO264uRjW1X5E6cWv6" "1HoSmj2eLcsrR0vE9gThr4" "1nqYsOef1yKKuGOVchbsk6" ...
## $ track_album_name : chr "I Don't Care (with Justin Bieber) [Loud Luxury Remix]" "Memories (Dillon Francis Remix)" "All the Time (Don Diablo Remix)" "Call You Mine - The Remixes" ...
## $ track_album_release_date: chr "2019-06-14" "2019-12-13" "2019-07-05" "2019-07-19" ...
## $ playlist_name : chr "Pop Remix" "Pop Remix" "Pop Remix" "Pop Remix" ...
## $ playlist_id : chr "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" ...
## $ playlist_genre : chr "pop" "pop" "pop" "pop" ...
## $ playlist_subgenre : chr "dance pop" "dance pop" "dance pop" "dance pop" ...
## $ danceability : num 0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
## $ energy : num 0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
## $ key : int 6 11 1 7 1 8 5 4 8 2 ...
## $ loudness : num -2.63 -4.97 -3.43 -3.78 -4.67 ...
## $ mode : int 1 1 0 1 1 1 0 0 1 1 ...
## $ speechiness : num 0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
## $ acousticness : num 0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
## $ instrumentalness : num 0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
## $ liveness : num 0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
## $ valence : num 0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
## $ tempo : num 122 100 124 122 124 ...
## $ duration_ms : int 194754 162600 176616 169093 189052 163049 187675 207619 193187 253040 ...
## - attr(*, ".internal.selfref")=<externalptr>
#summing the total na values in songs.
sum(is.na(songs))
## [1] 15
#finding out where these missing values are.
apply(is.na(songs), 2, which)
## $track_id
## integer(0)
##
## $track_name
## [1] 8152 9283 9284 19569 19812
##
## $track_artist
## [1] 8152 9283 9284 19569 19812
##
## $track_popularity
## integer(0)
##
## $track_album_id
## integer(0)
##
## $track_album_name
## [1] 8152 9283 9284 19569 19812
##
## $track_album_release_date
## integer(0)
##
## $playlist_name
## integer(0)
##
## $playlist_id
## integer(0)
##
## $playlist_genre
## integer(0)
##
## $playlist_subgenre
## integer(0)
##
## $danceability
## integer(0)
##
## $energy
## integer(0)
##
## $key
## integer(0)
##
## $loudness
## integer(0)
##
## $mode
## integer(0)
##
## $speechiness
## integer(0)
##
## $acousticness
## integer(0)
##
## $instrumentalness
## integer(0)
##
## $liveness
## integer(0)
##
## $valence
## integer(0)
##
## $tempo
## integer(0)
##
## $duration_ms
## integer(0)
#this function is giving an error when I knit the html file, while it works fine when I run it normally. This is giving the column names and row number where the missing values are. I am not sure how to resolve it.
#summarising the data mean, min and max for numerical variables.
options(digits = 2)# limiting the decimal digits to 2
#getting all the numeric variables/columns in numsongs
numsongs <- songs[,sapply(songs, is.numeric), with= FALSE]
#getting expeceted value and range.
basicStats(numsongs)[c("Mean","Minimum", "Maximum"),]
## track_popularity danceability energy key loudness mode
## Mean 42 0.65 0.69862 5.4 -6.7 0.57
## Minimum 0 0.00 0.00017 0.0 -46.4 0.00
## Maximum 100 0.98 1.00000 11.0 1.3 1.00
## speechiness acousticness instrumentalness liveness valence tempo
## Mean 0.11 0.18 0.085 0.19 0.51 121
## Minimum 0.00 0.00 0.000 0.00 0.00 0
## Maximum 0.92 0.99 0.994 1.00 0.99 239
## duration_ms
## Mean 225800
## Minimum 4000
## Maximum 517810
We can see observation numbers 8152, 9283, 9284, 19569, and 19812 have missing values in columns “track_name”, “track_artist”, and “track_album_name”. All the information for these 4 observation related to names of songs and artists have been missed. This shouldn’t really be a problem in our analysis as we have all the necessary information required to classify a song to a particular genre.
The data is clean and there is hardly anything we need to do, missing values will not affect our analysis as explained above.
Deleting columns track_id, track_album_id, track_album_name, track_album_release_date, and playlist_id as we will not be needing these columns for any of our analysis.
songs$track_id = NULL
songs$track_album_id = NULL
songs$track_album_name = NULL
songs$track_album_release_date = NULL
songs$playlist_id = NULL
Displaying the first few observations of the dataset to give an idea of what the data looks like
head(songs)
## track_name track_artist
## 1: I Don't Care (with Justin Bieber) - Loud Luxury Remix Ed Sheeran
## 2: Memories - Dillon Francis Remix Maroon 5
## 3: All the Time - Don Diablo Remix Zara Larsson
## 4: Call You Mine - Keanu Silva Remix The Chainsmokers
## 5: Someone You Loved - Future Humans Remix Lewis Capaldi
## 6: Beautiful People (feat. Khalid) - Jack Wins Remix Ed Sheeran
## track_popularity playlist_name playlist_genre playlist_subgenre
## 1: 66 Pop Remix pop dance pop
## 2: 67 Pop Remix pop dance pop
## 3: 70 Pop Remix pop dance pop
## 4: 60 Pop Remix pop dance pop
## 5: 69 Pop Remix pop dance pop
## 6: 67 Pop Remix pop dance pop
## danceability energy key loudness mode speechiness acousticness
## 1: 0.75 0.92 6 -2.6 1 0.058 0.102
## 2: 0.73 0.81 11 -5.0 1 0.037 0.072
## 3: 0.68 0.93 1 -3.4 0 0.074 0.079
## 4: 0.72 0.93 7 -3.8 1 0.102 0.029
## 5: 0.65 0.83 1 -4.7 1 0.036 0.080
## 6: 0.68 0.92 8 -5.4 1 0.127 0.080
## instrumentalness liveness valence tempo duration_ms
## 1: 0.0e+00 0.065 0.52 122 194754
## 2: 4.2e-03 0.357 0.69 100 162600
## 3: 2.3e-05 0.110 0.61 124 176616
## 4: 9.4e-06 0.204 0.28 122 169093
## 5: 0.0e+00 0.083 0.72 124 189052
## 6: 0.0e+00 0.143 0.58 125 163049
danceability - Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
energy - Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
key - The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1.
loudness - The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.
mode - Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.
speechiness - Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
acousticness - A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
instrumentalness - Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
liveness - Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
valence - A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
tempo - The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
duration_ms - Duration of song in milliseconds
I plan to generate some boxplots to understand how many outliers are there and in what variables, check correlation between features and genres, perform some analysis on which artist has the most number of tracks, which artist has the most popular songs and various other visualizations involving artists and what features are most likely to be in their songs.
I did remove some columns which we did not need for our analysis, as of now I don’t think new columns or dataframes would be needed, though here and there I might put the songs data in to a different dataframe to perform some analysis so that our original data remains intact.
Most of the questions in this case will be answered by visualisations.
Boxplots, frequecncy plots(bar charts and histograms), scatterplots, maybe some heatmaps for better visualization.
I am not very good with plotting beautiful plots, so I would want to learn more of it and implement it in this project as there are so many libraries such as ggplot2 to achieve this.
Yes I will be using random forest and decision trees which I studied in Data Mining course last flex.