Main Objective
In this report, we will role play as Data Scientist, and we want to make cluster for popular songs. Our main objective is to make list of playlist from similar or typical character to our user.
This is Learning By Building Project for Unsupervised Learning (UL) course. The data used for this analysis is Spotify Tracks, you can find the data from Kaggle. In this data, there are several columns about tracks/songs profile, and we will use this data to make cluster. This analysis will be separated into 6 parts:
Let’s get started!
In this report, we will role play as Data Scientist, and we want to make cluster for popular songs. Our main objective is to make list of playlist from similar or typical character to our user.
In this report, we will use following library:
#import library
library(tidyverse)
library(glue)
library(plotly)
library(FactoMineR)
library(factoextra)In this part, we’re going to read our data, and prepare it into “clean data”.
First, let’s read our data and we will assign it as spotify.
spotify <- read.csv("data_input/SpotifyFeatures.csv")
spotifyHereby following description for every column:
— this part refer to Spotify for Developers website, please refer to this link for more detail information.
acousticness: A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.danceability: Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.duration_ms: The duration of the track in milliseconds.energy: Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity.liveliness: Detects the presence of an audience in the recording.instrumentalness: Predicts whether a track contains no vocals.loudness: The overall loudness of a track in decibels (dB)speechiness: Speechiness detects the presence of spoken words in a track.valence: A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track.time_signature: An estimated time signature.tempo: The overall estimated tempo of a track in beats per minute (BPM).key: The key the track is in.Check any mismatch data types in our data.
# check data types for each column
str(spotify)#> 'data.frame': 232725 obs. of 18 variables:
#> $ ï..genre : chr "Movie" "Movie" "Movie" "Movie" ...
#> $ artist_name : chr "Henri Salvador" "Martin & les fées" "Joseph Williams" "Henri Salvador" ...
#> $ track_name : chr "C'est beau de faire un Show" "Perdu d'avance (par Gad Elmaleh)" "Don't Let Me Be Lonely Tonight" "Dis-moi Monsieur Gordon Cooper" ...
#> $ track_id : chr "0BRjO6ga9RKCKjfDqeFgWV" "0BjC1NfoEOOusryehmNudP" "0CoSDzoNIKCRs124s9uTVy" "0Gc6TVm52BwZD07Ki6tIvf" ...
#> $ popularity : int 0 1 3 0 4 0 2 15 0 10 ...
#> $ acousticness : num 0.611 0.246 0.952 0.703 0.95 0.749 0.344 0.939 0.00104 0.319 ...
#> $ danceability : num 0.389 0.59 0.663 0.24 0.331 0.578 0.703 0.416 0.734 0.598 ...
#> $ duration_ms : int 99373 137373 170267 152427 82625 160627 212293 240067 226200 152694 ...
#> $ energy : num 0.91 0.737 0.131 0.326 0.225 0.0948 0.27 0.269 0.481 0.705 ...
#> $ instrumentalness: num 0 0 0 0 0.123 0 0 0 0.00086 0.00125 ...
#> $ key : chr "C#" "F#" "C" "C#" ...
#> $ liveness : num 0.346 0.151 0.103 0.0985 0.202 0.107 0.105 0.113 0.0765 0.349 ...
#> $ loudness : num -1.83 -5.56 -13.88 -12.18 -21.15 ...
#> $ mode : chr "Major" "Minor" "Minor" "Major" ...
#> $ speechiness : num 0.0525 0.0868 0.0362 0.0395 0.0456 0.143 0.953 0.0286 0.046 0.0281 ...
#> $ tempo : num 167 174 99.5 171.8 140.6 ...
#> $ time_signature : chr "4/4" "4/4" "5/4" "4/4" ...
#> $ valence : num 0.814 0.816 0.368 0.227 0.39 0.358 0.533 0.274 0.765 0.718 ...
There are several columns that has mismatch data types and we will rename ï..genre into genre to make our analysis process easier.
ï..genre, key, and mode into factor.# rename genre column name
spotify <- spotify %>%
rename(genre = ï..genre)
# changing data type
spotify_clean <- spotify %>%
mutate_at(
vars(
genre,
key,
mode
),
as.factor
)
str(spotify_clean)#> 'data.frame': 232725 obs. of 18 variables:
#> $ genre : Factor w/ 27 levels "A Capella","Alternative",..: 16 16 16 16 16 16 16 16 16 16 ...
#> $ artist_name : chr "Henri Salvador" "Martin & les fées" "Joseph Williams" "Henri Salvador" ...
#> $ track_name : chr "C'est beau de faire un Show" "Perdu d'avance (par Gad Elmaleh)" "Don't Let Me Be Lonely Tonight" "Dis-moi Monsieur Gordon Cooper" ...
#> $ track_id : chr "0BRjO6ga9RKCKjfDqeFgWV" "0BjC1NfoEOOusryehmNudP" "0CoSDzoNIKCRs124s9uTVy" "0Gc6TVm52BwZD07Ki6tIvf" ...
#> $ popularity : int 0 1 3 0 4 0 2 15 0 10 ...
#> $ acousticness : num 0.611 0.246 0.952 0.703 0.95 0.749 0.344 0.939 0.00104 0.319 ...
#> $ danceability : num 0.389 0.59 0.663 0.24 0.331 0.578 0.703 0.416 0.734 0.598 ...
#> $ duration_ms : int 99373 137373 170267 152427 82625 160627 212293 240067 226200 152694 ...
#> $ energy : num 0.91 0.737 0.131 0.326 0.225 0.0948 0.27 0.269 0.481 0.705 ...
#> $ instrumentalness: num 0 0 0 0 0.123 0 0 0 0.00086 0.00125 ...
#> $ key : Factor w/ 12 levels "A","A#","B","C",..: 5 10 4 5 9 5 5 10 4 11 ...
#> $ liveness : num 0.346 0.151 0.103 0.0985 0.202 0.107 0.105 0.113 0.0765 0.349 ...
#> $ loudness : num -1.83 -5.56 -13.88 -12.18 -21.15 ...
#> $ mode : Factor w/ 2 levels "Major","Minor": 1 2 2 1 1 1 1 1 1 1 ...
#> $ speechiness : num 0.0525 0.0868 0.0362 0.0395 0.0456 0.143 0.953 0.0286 0.046 0.0281 ...
#> $ tempo : num 167 174 99.5 171.8 140.6 ...
#> $ time_signature : chr "4/4" "4/4" "5/4" "4/4" ...
#> $ valence : num 0.814 0.816 0.368 0.227 0.39 0.358 0.533 0.274 0.765 0.718 ...
Each column now already in their proper types.
# check missing value
colSums(is.na(spotify_clean))#> genre artist_name track_name track_id
#> 0 0 0 0
#> popularity acousticness danceability duration_ms
#> 0 0 0 0
#> energy instrumentalness key liveness
#> 0 0 0 0
#> loudness mode speechiness tempo
#> 0 0 0 0
#> time_signature valence
#> 0 0
There are no missing values in our data.
After we got clean data, now we’re going to explore our data. We can try to observe our data in general by using summary.
# data summary
summary(spotify_clean)#> genre artist_name track_name track_id
#> Comedy : 9681 Length:232725 Length:232725 Length:232725
#> Soundtrack: 9646 Class :character Class :character Class :character
#> Indie : 9543 Mode :character Mode :character Mode :character
#> Jazz : 9441
#> Pop : 9386
#> Electronic: 9377
#> (Other) :175651
#> popularity acousticness danceability duration_ms
#> Min. : 0.00 Min. :0.0000 Min. :0.0569 Min. : 15387
#> 1st Qu.: 29.00 1st Qu.:0.0376 1st Qu.:0.4350 1st Qu.: 182857
#> Median : 43.00 Median :0.2320 Median :0.5710 Median : 220427
#> Mean : 41.13 Mean :0.3686 Mean :0.5544 Mean : 235122
#> 3rd Qu.: 55.00 3rd Qu.:0.7220 3rd Qu.:0.6920 3rd Qu.: 265768
#> Max. :100.00 Max. :0.9960 Max. :0.9890 Max. :5552917
#>
#> energy instrumentalness key liveness
#> Min. :0.0000203 Min. :0.0000000 C :27583 Min. :0.00967
#> 1st Qu.:0.3850000 1st Qu.:0.0000000 G :26390 1st Qu.:0.09740
#> Median :0.6050000 Median :0.0000443 D :24077 Median :0.12800
#> Mean :0.5709577 Mean :0.1483012 C# :23201 Mean :0.21501
#> 3rd Qu.:0.7870000 3rd Qu.:0.0358000 A :22671 3rd Qu.:0.26400
#> Max. :0.9990000 Max. :0.9990000 F :20279 Max. :1.00000
#> (Other):88524
#> loudness mode speechiness tempo
#> Min. :-52.457 Major:151744 Min. :0.0222 Min. : 30.38
#> 1st Qu.:-11.771 Minor: 80981 1st Qu.:0.0367 1st Qu.: 92.96
#> Median : -7.762 Median :0.0501 Median :115.78
#> Mean : -9.570 Mean :0.1208 Mean :117.67
#> 3rd Qu.: -5.501 3rd Qu.:0.1050 3rd Qu.:139.05
#> Max. : 3.744 Max. :0.9670 Max. :242.90
#>
#> time_signature valence
#> Length:232725 Min. :0.0000
#> Class :character 1st Qu.:0.2370
#> Mode :character Median :0.4440
#> Mean :0.4549
#> 3rd Qu.:0.6600
#> Max. :1.0000
#>
From summary above, we knew that:
We have popularity column in our data, and we want to focus our analysis in tracks that have popoularity above 70.
spotify_top70 <-
spotify_clean %>%
filter(popularity >= 70)
spotify_top70There are 9001 tracks that has popularity 70 and above. Let’s analysis what’s the most popular genre.
# genre count in popular list
popular_genre <-
spotify_top70 %>%
group_by(genre) %>%
summarise(count = n()) %>%
arrange(desc(count))
popular_genreInsight:
Let’s observe in top 3 genre, Pop, Rap, and Dance
observation <-
spotify_top70 %>%
filter(genre %in% c("Pop", "Rap", "Dance"))
observation Our data has 5128 rows and 18 column, we want to reduce the dimension without losing information contain in data. We can make Principal Component Analysis to reduce data dimension without losing information. We want to keep 85% of information from our data.
# chr column
character <- observation %>%
select_if(is.character) %>%
colnames()
character_var <- which(colnames(observation) %in% character)
# observation w/o character
observation_nochr <- observation %>%
select(-character_var)
# numeric data
observation_numeric <-
observation_nochr %>%
select_if(is.numeric)
# select numeric column
quanti <- observation_nochr %>%
select_if(is.numeric) %>%
colnames()
quantivar <- which(colnames(observation_nochr) %in% quanti)
# select categorical column
quali <- observation_nochr %>%
select_if(is.factor) %>%
colnames()
qualivar <- which(colnames(observation_nochr) %in% quali)# PCA with FactoMineR
spotify_pca <- PCA(X = observation_nochr,
scale.unit = T,
quali.sup = qualivar,
graph = F)
summary(spotify_pca)#>
#> Call:
#> PCA(X = observation_nochr, scale.unit = T, quali.sup = qualivar,
#> graph = F)
#>
#>
#> Eigenvalues
#> Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
#> Variance 2.324 1.387 1.184 1.040 1.023 0.955 0.891
#> % of var. 21.124 12.611 10.763 9.454 9.297 8.678 8.099
#> Cumulative % of var. 21.124 33.735 44.498 53.952 63.249 71.927 80.026
#> Dim.8 Dim.9 Dim.10 Dim.11
#> Variance 0.816 0.656 0.502 0.223
#> % of var. 7.420 5.965 4.564 2.025
#> Cumulative % of var. 87.446 93.411 97.975 100.000
#>
#> Individuals (the 10 first)
#> Dist Dim.1 ctr cos2 Dim.2 ctr cos2
#> 1 | 5.283 | 0.101 0.000 0.000 | 1.597 0.036 0.091 |
#> 2 | 6.596 | -3.447 0.100 0.273 | 3.270 0.150 0.246 |
#> 3 | 4.648 | -0.843 0.006 0.033 | 1.968 0.054 0.179 |
#> 4 | 5.294 | -3.250 0.089 0.377 | 1.231 0.021 0.054 |
#> 5 | 4.390 | -1.383 0.016 0.099 | 2.407 0.081 0.301 |
#> 6 | 4.117 | -0.177 0.000 0.002 | 1.438 0.029 0.122 |
#> 7 | 3.570 | 0.977 0.008 0.075 | 1.150 0.019 0.104 |
#> 8 | 3.930 | 0.431 0.002 0.012 | 1.154 0.019 0.086 |
#> 9 | 4.475 | -1.458 0.018 0.106 | 1.380 0.027 0.095 |
#> 10 | 4.522 | -0.925 0.007 0.042 | 0.477 0.003 0.011 |
#> Dim.3 ctr cos2
#> 1 -0.410 0.003 0.006 |
#> 2 -0.913 0.014 0.019 |
#> 3 -1.227 0.025 0.070 |
#> 4 -1.997 0.066 0.142 |
#> 5 -0.340 0.002 0.006 |
#> 6 -1.740 0.050 0.179 |
#> 7 -0.323 0.002 0.008 |
#> 8 -0.724 0.009 0.034 |
#> 9 -0.394 0.003 0.008 |
#> 10 0.520 0.004 0.013 |
#>
#> Variables (the 10 first)
#> Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3 ctr
#> popularity | -0.002 0.000 0.000 | 0.344 8.531 0.118 | -0.240 4.852
#> acousticness | -0.680 19.894 0.462 | 0.003 0.001 0.000 | -0.147 1.826
#> danceability | 0.148 0.946 0.022 | 0.787 44.608 0.619 | 0.049 0.200
#> duration_ms | -0.070 0.212 0.005 | -0.509 18.677 0.259 | 0.104 0.914
#> energy | 0.891 34.149 0.793 | -0.198 2.824 0.039 | -0.036 0.108
#> instrumentalness | -0.106 0.485 0.011 | -0.204 3.011 0.042 | -0.037 0.113
#> liveness | 0.205 1.802 0.042 | -0.229 3.795 0.053 | 0.375 11.895
#> loudness | 0.827 29.402 0.683 | -0.094 0.631 0.009 | -0.145 1.770
#> speechiness | -0.050 0.107 0.002 | 0.363 9.496 0.132 | 0.761 48.932
#> tempo | 0.123 0.647 0.015 | -0.146 1.546 0.021 | 0.578 28.245
#> cos2
#> popularity 0.057 |
#> acousticness 0.022 |
#> danceability 0.002 |
#> duration_ms 0.011 |
#> energy 0.001 |
#> instrumentalness 0.001 |
#> liveness 0.141 |
#> loudness 0.021 |
#> speechiness 0.579 |
#> tempo 0.334 |
#>
#> Supplementary categories (the 10 first)
#> Dist Dim.1 cos2 v.test Dim.2 cos2 v.test
#> Dance | 0.501 | 0.351 0.490 8.028 | -0.132 0.069 -3.902 |
#> Pop | 0.192 | -0.072 0.141 -3.741 | -0.109 0.323 -7.325 |
#> Rap | 0.628 | -0.106 0.029 -2.960 | 0.329 0.274 11.818 |
#> A | 0.234 | 0.056 0.056 0.713 | 0.060 0.065 0.995 |
#> A# | 0.196 | -0.120 0.374 -1.566 | 0.089 0.206 1.502 |
#> B | 0.276 | 0.161 0.342 2.392 | 0.125 0.204 2.394 |
#> C | 0.324 | -0.166 0.262 -2.737 | -0.203 0.393 -4.338 |
#> C# | 0.370 | 0.058 0.025 1.160 | 0.271 0.537 6.976 |
#> D | 0.301 | 0.004 0.000 0.058 | -0.216 0.517 -3.935 |
#> D# | 0.560 | -0.100 0.032 -0.726 | -0.221 0.156 -2.081 |
#> Dim.3 cos2 v.test
#> Dance -0.312 0.387 -9.988 |
#> Pop -0.103 0.289 -7.497 |
#> Rap 0.449 0.511 17.478 |
#> A 0.059 0.064 1.065 |
#> A# 0.004 0.000 0.079 |
#> B 0.097 0.124 2.020 |
#> C -0.162 0.251 -3.753 |
#> C# 0.164 0.197 4.578 |
#> D 0.118 0.154 2.324 |
#> D# -0.424 0.574 -4.322 |
plot.PCA(x = spotify_pca,
choix = "ind",
invisible = "quali",
select = "contrib 5",
habillage = 1)Based on plot, there are outlier in observation 3703, 367, 4695, 2881, 4332. We can take out this outlier before clustering.
fviz_contrib(X = spotify_pca,choice = "var", axes = 1) energy, loudness, acousticness, valence has big influence in PC1
fviz_contrib(X = spotify_pca,choice = "var", axes = 2)danceability, duration, speechiness has big influence in PC2
To answer our main objective, we’ll try to make clustering for the song.
# take out outlier
outlier <- c(3703, 367, 4695, 2881, 4332)
observation <- observation[-outlier,]
observation_numeric_clean <- observation_numeric[-outlier,]
# scale data
observation_numeric_scale <-
observation_numeric_clean %>%
scale() %>%
as.data.frame()
head(observation_numeric_scale)RNGkind(sample.kind = "Rounding")
set.seed(100)
# k-optimum
fviz_nbclust(x = observation_numeric_scale,
FUNcluster = kmeans,
method = "wss",
print.summary = T)The most significant reduction in total wss is from 1 to 2. But we want more option or profile in our tracks, so we will use 3 cluster.
RNGkind(sample.kind = "Rounding")
set.seed(150)
# Please type your code down below
spotify_clustering <- kmeans(x = observation_numeric_scale,
centers = 3)
# visualization
fviz_cluster(object = spotify_clustering,
data = observation_numeric_scale)+
theme_minimal()observation$cluster <- spotify_clustering$cluster
observation_numeric_clean$cluster <- spotify_clustering$clusteragg_observation <- observation_numeric_clean %>%
mutate(cluster = as.factor(cluster)) %>%
group_by(cluster) %>%
summarise_all(mean)
agg_observationCluster Profiling:
We can use this cluster profiling as guide line to create our playlist, for example I want to make a group of cheerful pop songs.
observation %>%
filter(cluster == "2") %>%
filter(genre == "Pop")And now you have 1443 song recommendation to create your own playlist.
To evaluate our model, we will use Within Sum Squares value and Sum of Squares to Total Sum of Squares ratio. Good clustering has low withinss and betweenss to totalss ratio close to 1.
spotify_clustering$withinss#> [1] 12895.34 21434.82 10859.95
spotify_clustering$betweenss / spotify_clustering$totss#> [1] 0.1979321
Our withinss is high, and the ratio is very far from 1. Maybe there are still a lots of things to improve in our clustering model.
To sum up our analysis, we have to return to our main objective. We have create clustering for popular tracks base on their sound characteristics, and we can make a list of songs recommendation base on our clustering model.
According to visualization, the separation from each cluster is not really good (there are still overlapping cluster) and in quantitative check (withinss, and betweenss to totalss ratio) our clustering is very far from good.
Here is recommendation that may improve our model: