In this case, you will be able to get a list of the next music can play based on your current mood. Basically, we have a bunch of data for the composition of music. Through the raw data, will analyze the use of classification machine learning way. the result will generate a list of track can be listened from previous track taste.
Import bunch of data from music characteristic. Data source come from spotify public data https://www.kaggle.com/zaheenhamidani/ultimate-spotify-tracks-db
can be see for data type from spotify data source
## Rows: 232,725
## Columns: 18
## $ genre <chr> "Movie", "Movie", "Movie", "Movie", "Movie", "Movie"…
## $ artist_name <chr> "Henri Salvador", "Martin & les fées", "Joseph Willi…
## $ track_name <chr> "C'est beau de faire un Show", "Perdu d'avance (par …
## $ track_id <chr> "0BRjO6ga9RKCKjfDqeFgWV", "0BjC1NfoEOOusryehmNudP", …
## $ popularity <int> 0, 1, 3, 0, 4, 0, 2, 15, 0, 10, 0, 2, 4, 3, 0, 0, 0,…
## $ acousticness <dbl> 0.61100, 0.24600, 0.95200, 0.70300, 0.95000, 0.74900…
## $ danceability <dbl> 0.389, 0.590, 0.663, 0.240, 0.331, 0.578, 0.703, 0.4…
## $ duration_ms <int> 99373, 137373, 170267, 152427, 82625, 160627, 212293…
## $ energy <dbl> 0.9100, 0.7370, 0.1310, 0.3260, 0.2250, 0.0948, 0.27…
## $ instrumentalness <dbl> 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 1.23e-01, 0.…
## $ key <chr> "C#", "F#", "C", "C#", "F", "C#", "C#", "F#", "C", "…
## $ liveness <dbl> 0.3460, 0.1510, 0.1030, 0.0985, 0.2020, 0.1070, 0.10…
## $ loudness <dbl> -1.828, -5.559, -13.879, -12.178, -21.150, -14.970, …
## $ mode <chr> "Major", "Minor", "Minor", "Major", "Major", "Major"…
## $ speechiness <dbl> 0.0525, 0.0868, 0.0362, 0.0395, 0.0456, 0.1430, 0.95…
## $ tempo <dbl> 166.969, 174.003, 99.488, 171.758, 140.576, 87.479, …
## $ time_signature <chr> "4/4", "4/4", "5/4", "4/4", "4/4", "4/4", "4/4", "4/…
## $ valence <dbl> 0.8140, 0.8160, 0.3680, 0.2270, 0.3900, 0.3580, 0.53…
## [1] FALSE
The result from process in above, the data has been complete and no missing data.
Most of the people will be easiest to hear from popular song. in this step, will do for filtering popularity song higher than ‘80’.
From all data, only data who relate to composize of the song. other from that will be removed as predictor.
spotify_pop <- spotify_filter %>%
select(-c(artist_name, track_name, track_id, mode, time_signature, key, mode, duration_ms, genre, popularity, tempo))
head(spotify_pop)
## acousticness danceability energy instrumentalness
## Min. :0.000147 Min. :0.2770 Min. :0.1470 Min. :0.0000000
## 1st Qu.:0.033500 1st Qu.:0.5940 1st Qu.:0.5200 1st Qu.:0.0000000
## Median :0.137000 Median :0.6800 Median :0.6360 Median :0.0000000
## Mean :0.209465 Mean :0.6766 Mean :0.6379 Mean :0.0037289
## 3rd Qu.:0.310000 3rd Qu.:0.7650 3rd Qu.:0.7680 3rd Qu.:0.0000102
## Max. :0.922000 Max. :0.9420 Max. :0.9530 Max. :0.4330000
## liveness loudness speechiness valence
## Min. :0.0344 Min. :-18.064 Min. :0.0232 Min. :0.0379
## 1st Qu.:0.0935 1st Qu.: -7.458 1st Qu.:0.0432 1st Qu.:0.2840
## Median :0.1200 Median : -5.893 Median :0.0703 Median :0.4460
## Mean :0.1677 Mean : -6.307 Mean :0.1100 Mean :0.4646
## 3rd Qu.:0.1940 3rd Qu.: -4.777 3rd Qu.:0.1450 3rd Qu.:0.6190
## Max. :0.7640 Max. : -2.188 Max. :0.5650 Max. :0.9280
From bar chart in above, can be seen that the gap is quite high. in terms of data, will do for scaling data to manipulate the variances.
## **Results for the Principal Component Analysis (PCA)**
## The analysis was performed on 745 individuals, described by 8 variables
## *The results are available in the following objects:
##
## name description
## 1 "$eig" "eigenvalues"
## 2 "$var" "results for the variables"
## 3 "$var$coord" "coord. for the variables"
## 4 "$var$cor" "correlations variables - dimensions"
## 5 "$var$cos2" "cos2 for the variables"
## 6 "$var$contrib" "contributions of the variables"
## 7 "$ind" "results for the individuals"
## 8 "$ind$coord" "coord. for the individuals"
## 9 "$ind$cos2" "cos2 for the individuals"
## 10 "$ind$contrib" "contributions of the individuals"
## 11 "$call" "summary statistics"
## 12 "$call$centre" "mean of the variables"
## 13 "$call$ecart.type" "standard error of the variables"
## 14 "$call$row.w" "weights for the individuals"
## 15 "$call$col.w" "weights for the variables"
##
## Call:
## PCA(X = spot_scale, scale.unit = F)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
## Variance 2.294 1.342 1.019 0.945 0.862 0.734 0.562
## % of var. 28.718 16.803 12.750 11.832 10.794 9.183 7.033
## Cumulative % of var. 28.718 45.520 58.271 70.103 80.897 90.079 97.112
## Dim.8
## Variance 0.231
## % of var. 2.888
## Cumulative % of var. 100.000
##
## Individuals (the 10 first)
## Dist Dim.1 ctr cos2 Dim.2 ctr cos2
## 1 | 2.332 | -0.763 0.034 0.107 | -0.852 0.073 0.134 |
## 2 | 2.556 | -1.954 0.223 0.584 | -1.513 0.229 0.350 |
## 3 | 1.446 | -0.722 0.031 0.250 | 0.374 0.014 0.067 |
## 4 | 2.693 | -1.081 0.068 0.161 | 1.650 0.272 0.376 |
## 5 | 0.937 | -0.001 0.000 0.000 | -0.123 0.002 0.017 |
## 6 | 2.718 | 0.276 0.004 0.010 | 0.744 0.055 0.075 |
## 7 | 3.546 | -0.662 0.026 0.035 | 0.454 0.021 0.016 |
## 8 | 1.501 | 0.371 0.008 0.061 | 0.179 0.003 0.014 |
## 9 | 1.482 | 1.233 0.089 0.692 | -0.200 0.004 0.018 |
## 10 | 1.737 | -0.732 0.031 0.178 | -0.276 0.008 0.025 |
## Dim.3 ctr cos2
## 1 -1.669 0.367 0.512 |
## 2 -0.383 0.019 0.022 |
## 3 0.449 0.027 0.096 |
## 4 -0.938 0.116 0.121 |
## 5 0.531 0.037 0.321 |
## 6 1.606 0.340 0.349 |
## 7 -1.704 0.382 0.231 |
## 8 -0.684 0.062 0.208 |
## 9 0.262 0.009 0.031 |
## 10 0.311 0.013 0.032 |
##
## Variables
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3 ctr
## acousticness | -0.692 20.879 0.480 | -0.211 3.315 0.045 | 0.074 0.541
## danceability | 0.166 1.199 0.028 | 0.793 46.850 0.630 | 0.123 1.496
## energy | 0.861 32.296 0.742 | -0.212 3.348 0.045 | 0.160 2.524
## instrumentalness | -0.303 3.997 0.092 | -0.071 0.378 0.005 | 0.478 22.454
## liveness | 0.160 1.117 0.026 | -0.102 0.769 0.010 | -0.834 68.363
## loudness | 0.779 26.450 0.608 | -0.203 3.074 0.041 | 0.121 1.428
## speechiness | 0.001 0.000 0.000 | 0.746 41.506 0.558 | -0.118 1.372
## valence | 0.568 14.063 0.323 | 0.101 0.762 0.010 | 0.136 1.823
## cos2
## acousticness 0.006 |
## danceability 0.015 |
## energy 0.026 |
## instrumentalness 0.229 |
## liveness 0.697 |
## loudness 0.015 |
## speechiness 0.014 |
## valence 0.019 |
After PCA and scaling the data, the next step is to implement K-means clustering to find the optimum cluster number to model our data. Use the defined kmeansTunning() function below to find the optimum K using Elbow method.
Based on plot in above, that optimum K is 6.
In this process, K value will be implemented into clustering process. and create new column cluster
for classification each observations.
# set.seed to ensure reproducible example
set.seed(101)
# use kmeans (centers=clusters, which is 6)
spot_cluster <- kmeans(spotify_pop, centers = 6)
# show how many observations on each cluster
data.frame(cbind(cluster=c(1:6), observation=spot_cluster$size))
We can check it from 3 values: - Within Sum of Squares tot.withinss : signifies the ‘length’ from each observation to its centroid in each cluster
## [1] 351.3024
Total Sum of Squares totss : signifies the ‘length’ from each observation to global sample mean
## [1] 3760.574
Between Sum of Squares betweenss : signifies the ‘length’ from each centroid from each cluster to the global sample mean
## [1] 3409.271
Another ‘goodness’ measure can be signifies with a value of betweenss/totss closer the value to 1 or 100%, the better):
## [1] 90.65828
the result of this modelling has great accuracy in 90% above. which means is good. and you will be able to hear right music based on your mood.
So, this is the result of clustering track using classification method. as you can see that the new column clustering
shown us for general composition of music.
spotify_filter %>%
group_by(cluster) %>%
summarise_all(mean) %>%
select(cluster, acousticness, danceability, energy, instrumentalness, speechiness, valence)
case : if you are listening ‘Lady Gaga’ with Shallow
song. and you don’t know yet to choose next music after this. and this app will show you what next music with the similar taste and composition.
# Find out the cluster of favorite song
spotify_filter %>%
filter(artist_name == "Lady Gaga", track_name == "Shallow")
the result from Shallow
from Lady Gaga that we have two genres with same cluster. in the terms clustering result, we have same result. which means both of that songs is on cluster 2
. because of the song has two genres, it make you more have options to choose genres what you want to enjoy it.
let’s say, you already choose the genres is Pop
. and what music next will be suggested on? Yes, you will be able to see top 5 list of the songs that you can enjoy.
In terms of accuracy that modelling can generate 90.65%, which means is great. this model can be used for users who want to set up automatically play the songs based on their mood. You will see the list has been filtered for rating more that 8. So, it can be still familiar to hear. the result from this model can be shon top 5 songs that you will be able to choose which best song for the next play.