The objective of this project is to explore various songs and to classify them in to genres by analysing their audio features.This helps music streaming services such as Spotify/Youtube and many other platforms to create playlits and also make recommendations to their user based on their preferences and also music they have listened to in past.
The dataset used for our analysis has been extracted from Spotify using the spotifyr package (https://www.rcharlie.com/spotifyr/), the dataset inlcludes 12 audio feautures or dimension such as “arousal”, “valence”, and “depth”, we further explain these features later in the document. After tidying the data, I plan to use some visualization to understand how each of the 12 audio features relate to each genre and various artists, performing basic analysis on data, and then using data mining techniques to classify songs in to various categories or genres.
I will first perform data cleaning on the data set to find out any missing values and then decide whether to delete or impute those, check if their are any outliers in data if yes then how it might affect the results and according decide to delete or keep it. For the purpose of classification I will most probably be using decision tree or it’s extensions such as random forest. *will explore other techniques which might give a better classification rate and then decide which technique is most appropriate.
Consumer can make use of the analysis to put songs in to broader genres or categories which traditionally may not belong the same genre but have similar features and that can be used to create custom playlists suitable for a particular occasion or mood. Recommendations for new songs from different genres/languages which have similar features such as danceability, energy, valence etc can be made to a user depending on their listening history and preferences.
data.table tidyverse randomForest rpart
# this package is used to read data files, is faster than readr package
library(data.table)
# this package is used to get basic statistics of numerical variables using basicStats function
library(fBasics)
# it is a collection of multiple packages used to clean, visualise,model, and to communicate the data.
library(tidyverse)
# this is used to genrate random forest(data mining technique) to model the data and classify the results.
library(randomForest)
# this is used to generate decision trees which will help in classifying the data in to genres.
library(rpart)
#to use ggarrange()
library(ggpubr)
#LDA function
library(MASS)
The data comes from Spotify via the spotifyr package. Charlie Thompson, Josiah Parry, Donal Phipps, and Tom Wolff authored this package to make it easier to get either your own data or general metadata arounds songs from Spotify’s API.
Data downloded from: https://www.dropbox.com/sh/qj0ueimxot3ltbf/AACzMOHv7sZCJsj3ErjtOG7ya?dl=1
I can’t seem to find any particular purpose of the source data since this package spotifyr is only used to extract data, and users can extract and use this data for various purposes. This package was published on 13th July, 2019.
For our analysis we will be using 12 audio features in the dataset 1.acousticness 2.liveness 3.speechiness 4.instrumentalness 5.energy 6.loudness 7.danceability 8.valence 9.duration 10.tempo 11.key 12.mode
Original dataset extracted via this package has 23 variables as listed below. Data importing is done using the fread() from data.table package which is then stored in a data frame called songs.
We then check the variable names in the dataframe using names() function which returns the column names.
First 6 observartions are displayed using the head() function.
#rerading dataset in to songs.
songs <- fread("spotify_songs.csv")
#checking variable or column names
names(songs)
## [1] "track_id" "track_name"
## [3] "track_artist" "track_popularity"
## [5] "track_album_id" "track_album_name"
## [7] "track_album_release_date" "playlist_name"
## [9] "playlist_id" "playlist_genre"
## [11] "playlist_subgenre" "danceability"
## [13] "energy" "key"
## [15] "loudness" "mode"
## [17] "speechiness" "acousticness"
## [19] "instrumentalness" "liveness"
## [21] "valence" "tempo"
## [23] "duration_ms"
#displaying the first 6 observations.
head(songs)
## track_id
## 1: 6f807x0ima9a1j3VPbc7VN
## 2: 0r7CVbZTWZgbTCYdfa2P31
## 3: 1z1Hg7Vb0AhHDiEmnDE79l
## 4: 75FpbthrwQmzHlBJLuGdC7
## 5: 1e8PAfcKUYoKkxPhrHqw4x
## 6: 7fvUMiyapMsRRxr07cU8Ef
## track_name track_artist
## 1: I Don't Care (with Justin Bieber) - Loud Luxury Remix Ed Sheeran
## 2: Memories - Dillon Francis Remix Maroon 5
## 3: All the Time - Don Diablo Remix Zara Larsson
## 4: Call You Mine - Keanu Silva Remix The Chainsmokers
## 5: Someone You Loved - Future Humans Remix Lewis Capaldi
## 6: Beautiful People (feat. Khalid) - Jack Wins Remix Ed Sheeran
## track_popularity track_album_id
## 1: 66 2oCs0DGTsRO98Gh5ZSl2Cx
## 2: 67 63rPSO264uRjW1X5E6cWv6
## 3: 70 1HoSmj2eLcsrR0vE9gThr4
## 4: 60 1nqYsOef1yKKuGOVchbsk6
## 5: 69 7m7vv9wlQ4i0LFuJiE2zsQ
## 6: 67 2yiy9cd2QktrNvWC2EUi0k
## track_album_name
## 1: I Don't Care (with Justin Bieber) [Loud Luxury Remix]
## 2: Memories (Dillon Francis Remix)
## 3: All the Time (Don Diablo Remix)
## 4: Call You Mine - The Remixes
## 5: Someone You Loved (Future Humans Remix)
## 6: Beautiful People (feat. Khalid) [Jack Wins Remix]
## track_album_release_date playlist_name playlist_id
## 1: 2019-06-14 Pop Remix 37i9dQZF1DXcZDD7cfEKhW
## 2: 2019-12-13 Pop Remix 37i9dQZF1DXcZDD7cfEKhW
## 3: 2019-07-05 Pop Remix 37i9dQZF1DXcZDD7cfEKhW
## 4: 2019-07-19 Pop Remix 37i9dQZF1DXcZDD7cfEKhW
## 5: 2019-03-05 Pop Remix 37i9dQZF1DXcZDD7cfEKhW
## 6: 2019-07-11 Pop Remix 37i9dQZF1DXcZDD7cfEKhW
## playlist_genre playlist_subgenre danceability energy key loudness mode
## 1: pop dance pop 0.748 0.916 6 -2.634 1
## 2: pop dance pop 0.726 0.815 11 -4.969 1
## 3: pop dance pop 0.675 0.931 1 -3.432 0
## 4: pop dance pop 0.718 0.930 7 -3.778 1
## 5: pop dance pop 0.650 0.833 1 -4.672 1
## 6: pop dance pop 0.675 0.919 8 -5.385 1
## speechiness acousticness instrumentalness liveness valence tempo
## 1: 0.0583 0.1020 0.00e+00 0.0653 0.518 122.036
## 2: 0.0373 0.0724 4.21e-03 0.3570 0.693 99.972
## 3: 0.0742 0.0794 2.33e-05 0.1100 0.613 124.008
## 4: 0.1020 0.0287 9.43e-06 0.2040 0.277 121.956
## 5: 0.0359 0.0803 0.00e+00 0.0833 0.725 123.976
## 6: 0.1270 0.0799 0.00e+00 0.1430 0.585 124.982
## duration_ms
## 1: 194754
## 2: 162600
## 3: 176616
## 4: 169093
## 5: 189052
## 6: 163049
There are total 32833 observations in 23 variables.
sum() and is.na() is used to get the sum of all the missing values in the dataframe, there are 15 missing values in the data set. Next we check where these missing values are in the dataset using apply() here second parameter 2 indicates to search for missing values in columns rather than rows and which() returns the indices where the missing values are.
#checking the structure of the dataframe songs.
str(songs)
## Classes 'data.table' and 'data.frame': 32833 obs. of 23 variables:
## $ track_id : chr "6f807x0ima9a1j3VPbc7VN" "0r7CVbZTWZgbTCYdfa2P31" "1z1Hg7Vb0AhHDiEmnDE79l" "75FpbthrwQmzHlBJLuGdC7" ...
## $ track_name : chr "I Don't Care (with Justin Bieber) - Loud Luxury Remix" "Memories - Dillon Francis Remix" "All the Time - Don Diablo Remix" "Call You Mine - Keanu Silva Remix" ...
## $ track_artist : chr "Ed Sheeran" "Maroon 5" "Zara Larsson" "The Chainsmokers" ...
## $ track_popularity : int 66 67 70 60 69 67 62 69 68 67 ...
## $ track_album_id : chr "2oCs0DGTsRO98Gh5ZSl2Cx" "63rPSO264uRjW1X5E6cWv6" "1HoSmj2eLcsrR0vE9gThr4" "1nqYsOef1yKKuGOVchbsk6" ...
## $ track_album_name : chr "I Don't Care (with Justin Bieber) [Loud Luxury Remix]" "Memories (Dillon Francis Remix)" "All the Time (Don Diablo Remix)" "Call You Mine - The Remixes" ...
## $ track_album_release_date: chr "2019-06-14" "2019-12-13" "2019-07-05" "2019-07-19" ...
## $ playlist_name : chr "Pop Remix" "Pop Remix" "Pop Remix" "Pop Remix" ...
## $ playlist_id : chr "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" ...
## $ playlist_genre : chr "pop" "pop" "pop" "pop" ...
## $ playlist_subgenre : chr "dance pop" "dance pop" "dance pop" "dance pop" ...
## $ danceability : num 0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
## $ energy : num 0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
## $ key : int 6 11 1 7 1 8 5 4 8 2 ...
## $ loudness : num -2.63 -4.97 -3.43 -3.78 -4.67 ...
## $ mode : int 1 1 0 1 1 1 0 0 1 1 ...
## $ speechiness : num 0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
## $ acousticness : num 0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
## $ instrumentalness : num 0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
## $ liveness : num 0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
## $ valence : num 0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
## $ tempo : num 122 100 124 122 124 ...
## $ duration_ms : int 194754 162600 176616 169093 189052 163049 187675 207619 193187 253040 ...
## - attr(*, ".internal.selfref")=<externalptr>
#summing the total na values in songs.
sum(is.na(songs))
## [1] 15
#finding out where these missing values are.
apply(is.na(songs), 2, which)
## $track_id
## integer(0)
##
## $track_name
## [1] 8152 9283 9284 19569 19812
##
## $track_artist
## [1] 8152 9283 9284 19569 19812
##
## $track_popularity
## integer(0)
##
## $track_album_id
## integer(0)
##
## $track_album_name
## [1] 8152 9283 9284 19569 19812
##
## $track_album_release_date
## integer(0)
##
## $playlist_name
## integer(0)
##
## $playlist_id
## integer(0)
##
## $playlist_genre
## integer(0)
##
## $playlist_subgenre
## integer(0)
##
## $danceability
## integer(0)
##
## $energy
## integer(0)
##
## $key
## integer(0)
##
## $loudness
## integer(0)
##
## $mode
## integer(0)
##
## $speechiness
## integer(0)
##
## $acousticness
## integer(0)
##
## $instrumentalness
## integer(0)
##
## $liveness
## integer(0)
##
## $valence
## integer(0)
##
## $tempo
## integer(0)
##
## $duration_ms
## integer(0)
#this function is giving an error when I knit the html file, while it works fine when I run it normally. This is giving the column names and row number where the missing values are. I am not sure how to resolve it.
#summarising the data mean, min and max for numerical variables.
options(digits = 2)# limiting the decimal digits to 2
#getting all the numeric variables/columns in numsongs
numsongs <- songs[,sapply(songs, is.numeric), with= FALSE]
#getting expeceted value and range.
stat <- basicStats(numsongs)[c("Mean","Minimum", "Maximum"),]
We can see observation numbers 8152, 9283, 9284, 19569, and 19812 have missing values in columns “track_name”, “track_artist”, and “track_album_name”. All the information for these 4 observation related to names of songs and artists have been missed. This shouldn’t really be a problem in our analysis as we have all the necessary information required to classify a song to a particular genre.
The data is clean and there is hardly anything we need to do, missing values will not affect our analysis as explained above.
Deleting columns track_id, track_album_id, track_album_name, track_album_release_date, and playlist_id as we will not be needing these columns for any of our analysis.
songs$track_id = NULL
songs$track_album_id = NULL
songs$track_album_name = NULL
songs$track_album_release_date = NULL
songs$playlist_id = NULL
Displaying the first few observations of the dataset to give an idea of what the data looks like
songs %>% head() %>% knitr::kable()
track_name | track_artist | track_popularity | playlist_name | playlist_genre | playlist_subgenre | danceability | energy | key | loudness | mode | speechiness | acousticness | instrumentalness | liveness | valence | tempo | duration_ms |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
I Don’t Care (with Justin Bieber) - Loud Luxury Remix | Ed Sheeran | 66 | Pop Remix | pop | dance pop | 0.75 | 0.92 | 6 | -2.6 | 1 | 0.06 | 0.10 | 0 | 0.07 | 0.52 | 122 | 194754 |
Memories - Dillon Francis Remix | Maroon 5 | 67 | Pop Remix | pop | dance pop | 0.73 | 0.82 | 11 | -5.0 | 1 | 0.04 | 0.07 | 0 | 0.36 | 0.69 | 100 | 162600 |
All the Time - Don Diablo Remix | Zara Larsson | 70 | Pop Remix | pop | dance pop | 0.68 | 0.93 | 1 | -3.4 | 0 | 0.07 | 0.08 | 0 | 0.11 | 0.61 | 124 | 176616 |
Call You Mine - Keanu Silva Remix | The Chainsmokers | 60 | Pop Remix | pop | dance pop | 0.72 | 0.93 | 7 | -3.8 | 1 | 0.10 | 0.03 | 0 | 0.20 | 0.28 | 122 | 169093 |
Someone You Loved - Future Humans Remix | Lewis Capaldi | 69 | Pop Remix | pop | dance pop | 0.65 | 0.83 | 1 | -4.7 | 1 | 0.04 | 0.08 | 0 | 0.08 | 0.72 | 124 | 189052 |
Beautiful People (feat. Khalid) - Jack Wins Remix | Ed Sheeran | 67 | Pop Remix | pop | dance pop | 0.68 | 0.92 | 8 | -5.4 | 1 | 0.13 | 0.08 | 0 | 0.14 | 0.58 | 125 | 163049 |
danceability - Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
energy - Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
key - The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1.
loudness - The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.
mode - Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.
speechiness - Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
acousticness - A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
instrumentalness - Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
liveness - Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
valence - A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
tempo - The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
duration_ms - Duration of song in milliseconds
I plan to generate some boxplots to understand how many outliers are there and in what variables, check correlation between features and genres, perform some analysis on which artist has the most number of tracks, which artist has the most popular songs and various other visualizations involving artists and what features are most likely to be in their songs.
I did remove some columns which we did not need for our analysis, as of now I don’t think new columns or dataframes would be needed, though here and there I might put the songs data in to a different dataframe to perform some analysis so that our original data remains intact.
Most of the questions in this case will be answered by visualisations.
Boxplots, frequecncy plots(bar charts and histograms), scatterplots, maybe some heatmaps for better visualization.
I am not very good with plotting beautiful plots, so I would want to learn more of it and implement it in this project as there are so many libraries such as ggplot2 to achieve this.
Yes I will be using random forest and decision trees which I studied in Data Mining course last flex.
Checking the total number of songs from each genre using ggplot and then summarizing it in a table.
#plotting count of each genre
songs %>% ggplot(aes(x = fct_infreq(playlist_genre) , fill = playlist_genre))+
geom_bar(width=0.3)+
labs(title = 'Number of songs from each genre', x= 'Genres', y = 'Count' )+
theme_minimal()
#pie-chart
songs %>% ggplot(aes(x = playlist_genre, fill = playlist_genre))+
geom_bar(width=1)+
coord_polar()+
theme_void()
songs %>%
count(playlist_genre) %>%
knitr::kable()
playlist_genre | n |
---|---|
edm | 6043 |
latin | 5155 |
pop | 5507 |
r&b | 5431 |
rap | 5746 |
rock | 4951 |
Here, I have plotted density plots of the audio features of all the genres, which basically tells us how the each genre is related with each individual audio feature, and how relevant that particular feature is in defining that genre.
#extracting all the audio feature columns
feature_names <- names(songs)[7:18]
songs %>%
dplyr::select(c('playlist_genre', feature_names)) %>%
pivot_longer(cols = feature_names) %>%
ggplot(aes(x = value)) +
geom_density(aes(color = playlist_genre), alpha = 0.5) +
facet_wrap(~name, ncol = 3, scales = 'free') +
labs(title = 'Spotify Audio Feature Density - by Genre',
x = '', y = 'density') +
theme(axis.text.y = element_blank())
### Observations : 1. EDM tracks are least likely to be acoustic, high on energy, and low on valence (sad) as compared to other genres. 2. Latin tracks have high danceability and high valence (happier). 3. Rock songs are least likely to be danceable as compared to other genres. 4. Rap scores high on speechiness as one would expect, which means it has more spoken words.
As per our denisty plots the following features will provide us the most seperation amongst the genres: 1. Valence 2. Energy 3. Danceability, and maybe 4. Tempo
So i’ll focus on these features and explore more details about them.
Plotting the boxplots of genres against above mentioned 4 audio features.
#Valence
p1 <- songs %>% ggplot(aes(x = playlist_genre, y = valence, color = playlist_genre)) +
geom_boxplot(alpha = 0.7, notch = TRUE) +
theme_bw() +
labs(title = 'How happy or sad the genres are?', x= 'Genres', y = 'Happiness' )
#Energy
p2 <- songs %>% ggplot(aes(x = playlist_genre, y = energy, color = playlist_genre)) +
geom_boxplot(alpha = 0.1, notch = TRUE) +
theme_bw() +
labs(title = 'How energetic are the Genres?', x= 'Genres', y = 'Energy' )
#Danceability
p3 <- songs %>% ggplot(aes(x = playlist_genre, y = danceability, color = playlist_genre)) +
geom_boxplot(alpha = 0.5, notch = TRUE) +
theme_bw() +
labs(title = 'Genres and their danceablity', x= 'Genres', y = 'Danceability' )
#Tempo
p4 <- songs %>% ggplot(aes(x = playlist_genre, y = tempo, color = playlist_genre)) +
geom_boxplot(alpha = 0.5, notch = TRUE) +
theme_bw() +
labs(title = 'Genres and their Tempo', x= 'Genres', y = 'Tempo' )
ggarrange(p1,p2,p3,p4 , nrow = 2, ncol = 2)
How does boxplots and density plots together help us understand generes and their features?
Valence - As observed in the density plot valence can provide us a good seperation between EDM and Latin tracks as their is a comsiderable difference between their medial values and range, while all other genres have somewhat similar valence.
Energy - Latin and Pop tracks have similar range and median values so energy might not be a good seperator for them while the remaining 4 genres have a decent seperation on energy.
Danceability - As density plots show Rock has the lowest danceability score while Latin tracks are more closely packed with the high danceability scores and Rap have a little more variability than latin tracks but in general have high danceability score.
Tempo - It might do a good job of seperating EDM tracks from the rest of the genres as most of the EDM tracks are clustered around 125 while other genres have more or less a similar spread and variability.
songs %>%
dplyr::select(feature_names) %>%
scale() %>%
cor() %>%
corrplot::corrplot(method = 'color',
order = 'hclust',
type = 'upper',
diag = FALSE,
tl.col = 'black',
addCoef.col = "grey30",
number.cex = .7,
col = colorRampPalette(colors = c('red','white','blue'))(200),
main = 'Audio Feature Correlation',
mar = c(2,2,2,2),
family = 'Avenir',
number.digits = 1.
)
Correlation between energy and loudness, and energy and acousticness seems to be on the higher side so we will explore it further.
tibble(variable = 'energy', loudness = 0.7, acousticness = -0.5)
## # A tibble: 1 x 3
## variable loudness acousticness
## <chr> <dbl> <dbl>
## 1 energy 0.7 -0.5
Plotting scatterplots to understand the correlation better.
s1 <- songs %>% ggplot(aes(energy,loudness)) +
geom_point(color = 'red', alpha = .5, shape = 17) +
geom_smooth(color = 'black')
s2 <- songs %>% ggplot(aes(energy,acousticness)) +
geom_point(color = 'blue', alpha = .5, shape = 17) +
geom_smooth(color = 'black')
ggarrange(s1,s2)
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
We can see that as loudness(dB) of the song increase so does the energy, and with the decrease in acousticness there is an increase in energy though this pattern is not very linear but neverthless there is an observable negative correlation.
Thus via density plots, boxplots and correlation graphs we can infer that loudness will not help much in our prediction as energy gives us better seperation. We will discard loudness.
#Scaling
songs_scaled <- songs%>%
mutate_if(is.numeric, scale)
set.seed(4715)
#Training - Testing
id_train <- sample(nrow(songs_scaled),nrow(songs_scaled)*0.80)
songs.train = songs_scaled[id_train,]
songs.test = songs_scaled[-id_train,]
#extracting playlist_genre column
train_resp <- songs_scaled[id_train, 'playlist_genre']
test_resp <- songs_scaled[-id_train, 'playlist_genre']
#creating a function to calculate model accuracy
model_accuracy_calc <- function(df, model_name) {
df %>%
mutate(match = ifelse(true_value == predicted_value, TRUE, FALSE)) %>%
count(match) %>%
mutate(accuracy = n/sum(n),
model = model_name)
}
First up we will use LDA to see how it performs on the data
#Using lda function from MASS package
songs.lda <- lda(playlist_genre~ valence+energy+danceability+tempo+speechiness,data=songs.train)
#checking the fitted model
songs.lda
## Call:
## lda(playlist_genre ~ valence + energy + danceability + tempo +
## speechiness, data = songs.train)
##
## Prior probabilities of groups:
## edm latin pop r&b rap rock
## 0.19 0.16 0.17 0.16 0.18 0.15
##
## Group means:
## valence energy danceability tempo speechiness
## edm -0.471 0.571 0.0017 0.1795 -0.203
## latin 0.403 0.056 0.4010 -0.0880 -0.039
## pop -0.024 0.014 -0.1027 -0.0039 -0.323
## r&b 0.078 -0.595 0.0975 -0.2473 0.099
## rap -0.027 -0.262 0.4334 -0.0133 0.898
## rock 0.103 0.198 -0.9365 0.1405 -0.486
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3 LD4 LD5
## valence -0.198 -0.84 0.30 -0.717 -0.041
## energy -0.328 0.86 -0.08 -0.511 0.373
## danceability 0.735 0.67 0.69 0.091 -0.179
## tempo -0.015 0.20 -0.10 -0.158 -0.999
## speechiness 0.779 -0.11 -0.71 -0.327 0.177
##
## Proportion of trace:
## LD1 LD2 LD3 LD4 LD5
## 0.5748 0.3082 0.0816 0.0351 0.0003
#predicting the training data
pred.lda <- predict(songs.lda,data=songs.train)
#tabulating the predicted and observed values
table(songs.train$playlist_genre,pred.lda$class,dnn=c("Obs","Pred"))
## Pred
## Obs edm latin pop r&b rap rock
## edm 2945 591 283 200 375 468
## latin 798 1526 280 495 674 331
## pop 1170 780 558 687 330 831
## r&b 388 767 303 1397 947 510
## rap 725 528 129 602 2458 227
## rock 552 310 260 492 50 2299
#misclassification rate
mean(ifelse(songs.train$playlist_genre != pred.lda$class, 1, 0))
## [1] 0.57
We decided to use Valence, Energy, Tempo, Danceability, and Speechiness as our predictors, our model was able to correctly classify 43% of the data.
pred.lda.test <- predict(songs.lda,data=songs.test)
mean(ifelse(songs.test$playlist_genre != pred.lda.test$class, 1, 0))
## Warning in `!=.default`(songs.test$playlist_genre, pred.lda.test$class):
## longer object length is not a multiple of shorter object length
## Warning in is.na(e1) | is.na(e2): longer object length is not a multiple of
## shorter object length
## [1] 0.84
Misclassification rate of 83% which means we could only classify 17% of the songs correctly.
model_dt <- rpart(playlist_genre ~ valence+energy+danceability+tempo+speechiness , data = songs.train)
rpart.plot::rpart.plot(model_dt,
type = 5,
extra = 104,
box.palette = list(purple = "#490B32",
red = "#9A031E",
orange = '#FB8B24',
dark_blue = "#0F4C5C",
blue = "#5DA9E9",
grey = '#66717E'),
leaf.round = 0,
fallen.leaves = FALSE,
branch = 0.3,
under = TRUE,
under.col = 'grey40',
family = 'Avenir',
main = 'Genre Decision Tree',
tweak = 1.2)
According to the tree Speechiness is the most important feature seperaring Rap from rest of the genres, tracks with low danceability are classified as Rock.
High Tempo tracks are either classified as EDM or RAP with higher tempo songs being classified as Rap, this observation is inline with out Boxplot as it shows that both rap and edm have high tempo while rap has a larger range, edm tracks are closely packed around their mean.
Values under the leafs represents the true values of each genre grouped in to that leaf, for example 10% EDM tracks, 14% latin, 7% pop, 20% R&B, 46% Rap, and 2% Rock have been classified as Rap. Similar for all the other leaves.
Best classification was achieved for EDM with 56% correctly classified and second best was Rap with 46%.
predict_dt <- predict(object = model_dt, newdata = songs.test)
max_id <- apply(predict_dt, 1, which.max)
pred <- levels(as.factor(songs.test$playlist_genre))[max_id]
compare_dt <- data.frame(true_value = songs.test$playlist_genre,
predicted_value = pred,
model = 'decision_tree',
stringsAsFactors = FALSE)
accuracy_dt <- model_accuracy_calc(df = compare_dt, model_name = 'decision_tree')
accuracy_dt
## # A tibble: 2 x 4
## match n accuracy model
## <lgl> <int> <dbl> <chr>
## 1 FALSE 4105 0.625 decision_tree
## 2 TRUE 2462 0.375 decision_tree
We get a classification rate of 38% on out test data which is lower from our previous
model_rf <- randomForest(as.factor(playlist_genre) ~ valence+energy+danceability+tempo+speechiness , ntree = 100, importance = TRUE, data = songs.train)
predict_rf <- predict(model_rf, songs.test)
compare_rf <- data.frame(true_value = test_resp,
predicted_value = predict_rf,
model = 'random_forest',
stringsAsFactors = FALSE)
accuracy_rf <- model_accuracy_calc(df = compare_rf, model_name = 'random_forest')
accuracy_rf
## # A tibble: 2 x 4
## match n accuracy model
## <lgl> <int> <dbl> <chr>
## 1 FALSE 3329 0.507 random_forest
## 2 TRUE 3238 0.493 random_forest
Surprisingly Random Forest gives us a very high accuracy rate of 81% on test data.
I tried to classify the songs based on their audio features, which is a very imporant aspect for any music streaming service as this helps them to create playlists and make recommendations to their user.
I started with some basic exploratory data analysis, and plotted some density plots which helped me figure out what features will be the most important for our classificiation.
Boxplots were plotted of the important audio features from previous step and insights from boxplots were similar to ones from density plots and provided more information on the spread of the features in each of the genres.
Spotify data was then divided in to 80% testing data and 20% trraining data on which LDA, Decision Trees and Random Forest models were applied.
Random Forest gave us the best classification rate of 81% on the test data.
One really interesting observation from correlation graph of audio features was that Danceability is negatively correlated with tempo and energy though the negative correlation is not very high. I would want to explore more on this particular aspect.
I am not satisfied with the outcomes of the model, and I believe Random Forest is giving a very low misclassification rate, I would want to explore more of this in depth.
We could also try to single out some genres instead of predicting all 6 and see which features would be most relevant and how that improves our classification over classifying all the 6 genres in one model.
Lastly, I would want to use Neural Networks on this data as I believe it would give the best results, but due to lack of time and knowledge on neural net I was not able to use that model in this project.