Background

In this case, you will be able to get a list of the next music can play based on your current mood. Basically, we have a bunch of data for the composition of music. Through the raw data, will analyze the use of classification machine learning way. the result will generate a list of track can be listened from previous track taste.

Data Pre-Processing

Import Library

# Import library
library(dplyr)
library(tidyr)
library(GGally)
library(gridExtra)
library(factoextra)
library(FactoMineR)
library(plotly)

Import Data

Import bunch of data from music characteristic. Data source come from spotify public data https://www.kaggle.com/zaheenhamidani/ultimate-spotify-tracks-db

spotify <- read.csv("SpotifyFeatures.csv")
head(spotify)

can be see for data type from spotify data source

glimpse(spotify)

## Rows: 232,725
## Columns: 18
## $ genre            <chr> "Movie", "Movie", "Movie", "Movie", "Movie", "Movie"…
## $ artist_name      <chr> "Henri Salvador", "Martin & les fées", "Joseph Willi…
## $ track_name       <chr> "C'est beau de faire un Show", "Perdu d'avance (par …
## $ track_id         <chr> "0BRjO6ga9RKCKjfDqeFgWV", "0BjC1NfoEOOusryehmNudP", …
## $ popularity       <int> 0, 1, 3, 0, 4, 0, 2, 15, 0, 10, 0, 2, 4, 3, 0, 0, 0,…
## $ acousticness     <dbl> 0.61100, 0.24600, 0.95200, 0.70300, 0.95000, 0.74900…
## $ danceability     <dbl> 0.389, 0.590, 0.663, 0.240, 0.331, 0.578, 0.703, 0.4…
## $ duration_ms      <int> 99373, 137373, 170267, 152427, 82625, 160627, 212293…
## $ energy           <dbl> 0.9100, 0.7370, 0.1310, 0.3260, 0.2250, 0.0948, 0.27…
## $ instrumentalness <dbl> 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 1.23e-01, 0.…
## $ key              <chr> "C#", "F#", "C", "C#", "F", "C#", "C#", "F#", "C", "…
## $ liveness         <dbl> 0.3460, 0.1510, 0.1030, 0.0985, 0.2020, 0.1070, 0.10…
## $ loudness         <dbl> -1.828, -5.559, -13.879, -12.178, -21.150, -14.970, …
## $ mode             <chr> "Major", "Minor", "Minor", "Major", "Major", "Major"…
## $ speechiness      <dbl> 0.0525, 0.0868, 0.0362, 0.0395, 0.0456, 0.1430, 0.95…
## $ tempo            <dbl> 166.969, 174.003, 99.488, 171.758, 140.576, 87.479, …
## $ time_signature   <chr> "4/4", "4/4", "5/4", "4/4", "4/4", "4/4", "4/4", "4/…
## $ valence          <dbl> 0.8140, 0.8160, 0.3680, 0.2270, 0.3900, 0.3580, 0.53…

Missing data check

anyNA(spotify)

## [1] FALSE

The result from process in above, the data has been complete and no missing data.

Filter popular song

Most of the people will be easiest to hear from popular song. in this step, will do for filtering popularity song higher than ‘80’.

spotify_filter <- spotify %>% 
  filter(popularity>=80 & mode == "Major") 
  
spotify_filter

Remove unnecessary data

From all data, only data who relate to composize of the song. other from that will be removed as predictor.

spotify_pop <- spotify_filter %>% 
 select(-c(artist_name, track_name, track_id, mode, time_signature, key, mode, duration_ms, genre, popularity, tempo))

head(spotify_pop)

Exploratory Data Analysis

summary(spotify_pop)

##   acousticness       danceability        energy       instrumentalness   
##  Min.   :0.000147   Min.   :0.2770   Min.   :0.1470   Min.   :0.0000000  
##  1st Qu.:0.033500   1st Qu.:0.5940   1st Qu.:0.5200   1st Qu.:0.0000000  
##  Median :0.137000   Median :0.6800   Median :0.6360   Median :0.0000000  
##  Mean   :0.209465   Mean   :0.6766   Mean   :0.6379   Mean   :0.0037289  
##  3rd Qu.:0.310000   3rd Qu.:0.7650   3rd Qu.:0.7680   3rd Qu.:0.0000102  
##  Max.   :0.922000   Max.   :0.9420   Max.   :0.9530   Max.   :0.4330000  
##     liveness         loudness        speechiness        valence      
##  Min.   :0.0344   Min.   :-18.064   Min.   :0.0232   Min.   :0.0379  
##  1st Qu.:0.0935   1st Qu.: -7.458   1st Qu.:0.0432   1st Qu.:0.2840  
##  Median :0.1200   Median : -5.893   Median :0.0703   Median :0.4460  
##  Mean   :0.1677   Mean   : -6.307   Mean   :0.1100   Mean   :0.4646  
##  3rd Qu.:0.1940   3rd Qu.: -4.777   3rd Qu.:0.1450   3rd Qu.:0.6190  
##  Max.   :0.7640   Max.   : -2.188   Max.   :0.5650   Max.   :0.9280

plot(prcomp(spotify_pop))

From bar chart in above, can be seen that the gap is quite high. in terms of data, will do for scaling data to manipulate the variances.

Scaling data

# scaling
spot_scale <- scale(spotify_pop)

# check the PCA again
plot(prcomp(spot_scale))

After processing scale data, the result shown that gap is normal.

Model : PCA

#menggunakan data yang sudah discale
pca_spot <- PCA(spot_scale, scale. = F)

pca_spot

## **Results for the Principal Component Analysis (PCA)**
## The analysis was performed on 745 individuals, described by 8 variables
## *The results are available in the following objects:
## 
##    name               description                          
## 1  "$eig"             "eigenvalues"                        
## 2  "$var"             "results for the variables"          
## 3  "$var$coord"       "coord. for the variables"           
## 4  "$var$cor"         "correlations variables - dimensions"
## 5  "$var$cos2"        "cos2 for the variables"             
## 6  "$var$contrib"     "contributions of the variables"     
## 7  "$ind"             "results for the individuals"        
## 8  "$ind$coord"       "coord. for the individuals"         
## 9  "$ind$cos2"        "cos2 for the individuals"           
## 10 "$ind$contrib"     "contributions of the individuals"   
## 11 "$call"            "summary statistics"                 
## 12 "$call$centre"     "mean of the variables"              
## 13 "$call$ecart.type" "standard error of the variables"    
## 14 "$call$row.w"      "weights for the individuals"        
## 15 "$call$col.w"      "weights for the variables"

summary(pca_spot)

## 
## Call:
## PCA(X = spot_scale, scale.unit = F) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               2.294   1.342   1.019   0.945   0.862   0.734   0.562
## % of var.             28.718  16.803  12.750  11.832  10.794   9.183   7.033
## Cumulative % of var.  28.718  45.520  58.271  70.103  80.897  90.079  97.112
##                        Dim.8
## Variance               0.231
## % of var.              2.888
## Cumulative % of var. 100.000
## 
## Individuals (the 10 first)
##                      Dist    Dim.1    ctr   cos2    Dim.2    ctr   cos2  
## 1                |  2.332 | -0.763  0.034  0.107 | -0.852  0.073  0.134 |
## 2                |  2.556 | -1.954  0.223  0.584 | -1.513  0.229  0.350 |
## 3                |  1.446 | -0.722  0.031  0.250 |  0.374  0.014  0.067 |
## 4                |  2.693 | -1.081  0.068  0.161 |  1.650  0.272  0.376 |
## 5                |  0.937 | -0.001  0.000  0.000 | -0.123  0.002  0.017 |
## 6                |  2.718 |  0.276  0.004  0.010 |  0.744  0.055  0.075 |
## 7                |  3.546 | -0.662  0.026  0.035 |  0.454  0.021  0.016 |
## 8                |  1.501 |  0.371  0.008  0.061 |  0.179  0.003  0.014 |
## 9                |  1.482 |  1.233  0.089  0.692 | -0.200  0.004  0.018 |
## 10               |  1.737 | -0.732  0.031  0.178 | -0.276  0.008  0.025 |
##                   Dim.3    ctr   cos2  
## 1                -1.669  0.367  0.512 |
## 2                -0.383  0.019  0.022 |
## 3                 0.449  0.027  0.096 |
## 4                -0.938  0.116  0.121 |
## 5                 0.531  0.037  0.321 |
## 6                 1.606  0.340  0.349 |
## 7                -1.704  0.382  0.231 |
## 8                -0.684  0.062  0.208 |
## 9                 0.262  0.009  0.031 |
## 10                0.311  0.013  0.032 |
## 
## Variables
##                     Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3    ctr
## acousticness     | -0.692 20.879  0.480 | -0.211  3.315  0.045 |  0.074  0.541
## danceability     |  0.166  1.199  0.028 |  0.793 46.850  0.630 |  0.123  1.496
## energy           |  0.861 32.296  0.742 | -0.212  3.348  0.045 |  0.160  2.524
## instrumentalness | -0.303  3.997  0.092 | -0.071  0.378  0.005 |  0.478 22.454
## liveness         |  0.160  1.117  0.026 | -0.102  0.769  0.010 | -0.834 68.363
## loudness         |  0.779 26.450  0.608 | -0.203  3.074  0.041 |  0.121  1.428
## speechiness      |  0.001  0.000  0.000 |  0.746 41.506  0.558 | -0.118  1.372
## valence          |  0.568 14.063  0.323 |  0.101  0.762  0.010 |  0.136  1.823
##                    cos2  
## acousticness      0.006 |
## danceability      0.015 |
## energy            0.026 |
## instrumentalness  0.229 |
## liveness          0.697 |
## loudness          0.015 |
## speechiness       0.014 |
## valence           0.019 |

Choosing Optimum K

After PCA and scaling the data, the next step is to implement K-means clustering to find the optimum cluster number to model our data. Use the defined kmeansTunning() function below to find the optimum K using Elbow method.

fviz_nbclust(spot_scale, kmeans, method = "wss")

Based on plot in above, that optimum K is 6.

Clustering

In this process, K value will be implemented into clustering process. and create new column cluster for classification each observations.

# set.seed to ensure reproducible example
set.seed(101)

# use kmeans (centers=clusters, which is 6)
spot_cluster <- kmeans(spotify_pop, centers = 6)

# show how many observations on each cluster
data.frame(cbind(cluster=c(1:6), observation=spot_cluster$size))

spotify_filter$cluster <- spot_cluster$cluster
tail(spotify_pop)

fviz_cluster(object=spot_cluster,
             data = spotify_pop)

Goodness of Fit

We can check it from 3 values: - Within Sum of Squares tot.withinss : signifies the ‘length’ from each observation to its centroid in each cluster

spot_cluster$tot.withinss

## [1] 351.3024

Total Sum of Squares totss : signifies the ‘length’ from each observation to global sample mean

# totss
spot_cluster$totss

## [1] 3760.574

Between Sum of Squares betweenss : signifies the ‘length’ from each centroid from each cluster to the global sample mean

spot_cluster$betweenss

## [1] 3409.271

Another ‘goodness’ measure can be signifies with a value of betweenss/totss closer the value to 1 or 100%, the better):

# `betweenss`/`tot.withinss`
((spot_cluster$betweenss)/(spot_cluster$totss))*100

## [1] 90.65828

the result of this modelling has great accuracy in 90% above. which means is good. and you will be able to hear right music based on your mood.

So, this is the result of clustering track using classification method. as you can see that the new column clustering shown us for general composition of music.

spotify_filter %>% 
  group_by(cluster) %>% 
  summarise_all(mean) %>% 
  select(cluster, acousticness, danceability, energy, instrumentalness, speechiness, valence)

Try to find suggestion song

case : if you are listening ‘Lady Gaga’ with Shallow song. and you don’t know yet to choose next music after this. and this app will show you what next music with the similar taste and composition.

# Find out the cluster of favorite song
spotify_filter %>% 
  filter(artist_name == "Lady Gaga", track_name == "Shallow")

the result from Shallow from Lady Gaga that we have two genres with same cluster. in the terms clustering result, we have same result. which means both of that songs is on cluster 2. because of the song has two genres, it make you more have options to choose genres what you want to enjoy it.

let’s say, you already choose the genres is Pop. and what music next will be suggested on? Yes, you will be able to see top 5 list of the songs that you can enjoy.

spotify_filter %>% 
  filter(cluster == 2, genre == "Pop") %>% 
  head(5)

Summary

In terms of accuracy that modelling can generate 90.65%, which means is great. this model can be used for users who want to set up automatically play the songs based on their mood. You will see the list has been filtered for rating more that 8. So, it can be still familiar to hear. the result from this model can be shon top 5 songs that you will be able to choose which best song for the next play.

Spotify Music Recommendation using Classification Machine Learning

Adam Hafid

11/23/2020