Spotify is a digital music service available on Android, IOS, Windows, etc. Operating Systems. Spotify provides access to various songs from various artists in the world. Therefore, there are many songs that can be accessed, it can make us confused in choosing the song we want to hear that suits our tastes.
Based on this, I will try to cluster the songs so that they can help readers choose songs that match the criteria we want using the unsupervised learning method.
Load required library.
library(tidyverse)
library(factoextra)
library(FactoMineR)
library(animation)
library(lubridate)
library(ggiraphExtra)Load data to perform unsupervised learning model
music <- read.csv(file="SpotifyFeatures.csv",stringsAsFactors = F)
music$genre <- music$ï..genre
music <- music %>%
select(-ï..genre) %>%
mutate(genre = as.factor(genre))after we have successfully imported our data, we will do a data inspection to find out contents our data, actually we can use the view() function to view the contents of the data but it will take time to see the whole data so we use a function that sees the head() only.
Descriptions:
artist_name: artist’s Nametrack_name: Track’s Nametrack_id: The Spotify ID for the track.popularity: track’s popularity rate on spotifyacousticness: A confidence measure from 0.0 to 1.0 of
whether the track is acoustic. 1.0 represents high confidence the track
is acoustic.danceability: Danceability describes how suitable a
track is for dancing based on a combination of musical elements
including tempo, rhythm stability, beat strength, and overall
regularity. A value of 0.0 is least danceable and 1.0 is most
danceable.duration_ms: The duration of the track in
milliseconds.energy: Energy is a measure from 0.0 to 1.0 and
represents a perceptual measure of intensity and activity. Typically,
energetic tracks feel fast, loud, and noisy.instrumentalness: Predicts whether a track contains no
vocals. “Ooh” and “aah” sounds are treated as instrumental in this
context. Rap or spoken word tracks are clearly “vocal”. The closer the
instrumentalness value is to 1.0, the greater likelihood the track
contains no vocal content. Values above 0.5 are intended to represent
instrumental tracks, but confidence is higher as the value approaches
1.0.key: The key the track is in. Integers map to pitches
using standard Pitch Class notation. E.g. 0 = C, 1 = C♯/D♭, 2 = D, and
so on. If no key was detected, the value is -1. >= -1 until <=
11liveness: Detects the presence of an audience in the
recording. Higher liveness values represent an increased probability
that the track was performed live. A value above 0.8 provides strong
likelihood that the track is live.loudness: The overall loudness of a track in decibels
(dB). Loudness values are averaged across the entire track and are
useful for comparing relative loudness of tracks. Loudness is the
quality of a sound that is the primary psychological correlate of
physical strength (amplitude). Values typically range between -60 and 0
db.mode: Mode indicates the modality (major or minor) of a
track, the type of scale from which its melodic content is derived.
Major is represented by 1 and minor is 0.speechiness:Speechiness detects the presence of spoken
words in a track. The more exclusively speech-like the recording
(e.g. talk show, audio book, poetry), the closer to 1.0 the attribute
value. Values above 0.66 describe tracks that are probably made entirely
of spoken words. Values between 0.33 and 0.66 describe tracks that may
contain both music and speech, either in sections or layered, including
such cases as rap music. Values below 0.33 most likely represent music
and other non-speech-like tracks.tempo: The overall estimated tempo of a track in beats
per minute (BPM). In musical terminology, tempo is the speed or pace of
a given piece and derives directly from the average beat duration.time_signature: An estimated time signature. The time
signature (meter) is a notational convention to specify how many beats
are in each bar (or measure). The time signature ranges from 3 to 7
indicating time signatures of “3/4”, to “7/4”. >= 3 and <= 7valence: A measure from 0.0 to 1.0 describing the
musical positiveness conveyed by a track. Tracks with high valence sound
more positive (e.g. happy, cheerful, euphoric), while tracks with low
valence sound more negative (e.g. sad, depressed, angry).>= 0 and
<= 1genre: Track’s GenreCheck the structure of the data copier. Are there any columns whose data types do not match?
str(music)#> 'data.frame': 232725 obs. of 18 variables:
#> $ artist_name : chr "Henri Salvador" "Martin & les fées" "Joseph Williams" "Henri Salvador" ...
#> $ track_name : chr "C'est beau de faire un Show" "Perdu d'avance (par Gad Elmaleh)" "Don't Let Me Be Lonely Tonight" "Dis-moi Monsieur Gordon Cooper" ...
#> $ track_id : chr "0BRjO6ga9RKCKjfDqeFgWV" "0BjC1NfoEOOusryehmNudP" "0CoSDzoNIKCRs124s9uTVy" "0Gc6TVm52BwZD07Ki6tIvf" ...
#> $ popularity : int 0 1 3 0 4 0 2 15 0 10 ...
#> $ acousticness : num 0.611 0.246 0.952 0.703 0.95 0.749 0.344 0.939 0.00104 0.319 ...
#> $ danceability : num 0.389 0.59 0.663 0.24 0.331 0.578 0.703 0.416 0.734 0.598 ...
#> $ duration_ms : int 99373 137373 170267 152427 82625 160627 212293 240067 226200 152694 ...
#> $ energy : num 0.91 0.737 0.131 0.326 0.225 0.0948 0.27 0.269 0.481 0.705 ...
#> $ instrumentalness: num 0 0 0 0 0.123 0 0 0 0.00086 0.00125 ...
#> $ key : chr "C#" "F#" "C" "C#" ...
#> $ liveness : num 0.346 0.151 0.103 0.0985 0.202 0.107 0.105 0.113 0.0765 0.349 ...
#> $ loudness : num -1.83 -5.56 -13.88 -12.18 -21.15 ...
#> $ mode : chr "Major" "Minor" "Minor" "Major" ...
#> $ speechiness : num 0.0525 0.0868 0.0362 0.0395 0.0456 0.143 0.953 0.0286 0.046 0.0281 ...
#> $ tempo : num 167 174 99.5 171.8 140.6 ...
#> $ time_signature : chr "4/4" "4/4" "5/4" "4/4" ...
#> $ valence : num 0.814 0.816 0.368 0.227 0.39 0.358 0.533 0.274 0.765 0.718 ...
#> $ genre : Factor w/ 27 levels "A Capella","Alternative",..: 16 16 16 16 16 16 16 16 16 16 ...
Columns whose data types do not match:
artist_name -> as factortrack_name -> as factorkey -> as factormode -> as factortime_signature-> as factorAre any of the variables below that can be removed?
track_id -> we don’t really need these
variable(track’s id) due to it’s already presence by track’s name.music_clean <- music %>%
mutate(artist_name = as.factor(artist_name),
track_name = as.factor(track_name),
key = as.factor(key),
mode = as.factor(mode),
time_signature = as.factor(time_signature))%>%
select(-track_id)
glimpse(music_clean)#> Rows: 232,725
#> Columns: 17
#> $ artist_name <fct> "Henri Salvador", "Martin & les fées", "Joseph Willi~
#> $ track_name <fct> "C'est beau de faire un Show", "Perdu d'avance (par G~
#> $ popularity <int> 0, 1, 3, 0, 4, 0, 2, 15, 0, 10, 0, 2, 4, 3, 0, 0, 0, ~
#> $ acousticness <dbl> 0.61100, 0.24600, 0.95200, 0.70300, 0.95000, 0.74900,~
#> $ danceability <dbl> 0.389, 0.590, 0.663, 0.240, 0.331, 0.578, 0.703, 0.41~
#> $ duration_ms <int> 99373, 137373, 170267, 152427, 82625, 160627, 212293,~
#> $ energy <dbl> 0.9100, 0.7370, 0.1310, 0.3260, 0.2250, 0.0948, 0.270~
#> $ instrumentalness <dbl> 0.00000000, 0.00000000, 0.00000000, 0.00000000, 0.123~
#> $ key <fct> C#, F#, C, C#, F, C#, C#, F#, C, G, E, C, F#, D#, G, ~
#> $ liveness <dbl> 0.3460, 0.1510, 0.1030, 0.0985, 0.2020, 0.1070, 0.105~
#> $ loudness <dbl> -1.828, -5.559, -13.879, -12.178, -21.150, -14.970, -~
#> $ mode <fct> Major, Minor, Minor, Major, Major, Major, Major, Majo~
#> $ speechiness <dbl> 0.0525, 0.0868, 0.0362, 0.0395, 0.0456, 0.1430, 0.953~
#> $ tempo <dbl> 166.969, 174.003, 99.488, 171.758, 140.576, 87.479, 8~
#> $ time_signature <fct> 4/4, 4/4, 5/4, 4/4, 4/4, 4/4, 4/4, 4/4, 4/4, 4/4, 4/4~
#> $ valence <dbl> 0.8140, 0.8160, 0.3680, 0.2270, 0.3900, 0.3580, 0.533~
#> $ genre <fct> Movie, Movie, Movie, Movie, Movie, Movie, Movie, Movi~
we have changed the appropriate data type by eliminating unnecessary columns.
Next we will check blank data in our dataset
colSums(is.na(music_clean))#> artist_name track_name popularity acousticness
#> 0 0 0 0
#> danceability duration_ms energy instrumentalness
#> 0 0 0 0
#> key liveness loudness mode
#> 0 0 0 0
#> speechiness tempo time_signature valence
#> 0 0 0 0
#> genre
#> 0
from the results of our exploratory data, we get the result that there is no NA or balnk data.
Before doing the modeling, We will do subnetting based on the popularity level of the track, with a popularity value of 76.
popular_music <- music_clean %>% filter(popularity >= 75)
str(popular_music)#> 'data.frame': 3593 obs. of 17 variables:
#> $ artist_name : Factor w/ 14564 levels "'Til Tuesday",..: 5953 6476 6476 6476 11641 4906 10544 5008 12071 12945 ...
#> $ track_name : Factor w/ 148615 levels "' Cello Song",..: 132378 107362 112957 146257 7961 42036 136590 133047 47566 120168 ...
#> $ popularity : int 76 83 81 76 78 77 78 76 75 75 ...
#> $ acousticness : num 0.0233 0.422 0.544 0.619 0.0319 0.00836 0.0576 0.00847 0.443 0.0495 ...
#> $ danceability : num 0.845 0.552 0.515 0.672 0.731 0.818 0.559 0.56 0.656 0.612 ...
#> $ duration_ms : int 187521 180019 209274 174358 200373 222640 264307 218013 222374 240400 ...
#> $ energy : num 0.709 0.65 0.479 0.588 0.861 0.705 0.345 0.936 0.432 0.807 ...
#> $ instrumentalness: num 0 0.000275 0.00598 0.241 0 0.00233 0.000105 0 0 0.0177 ...
#> $ key : Factor w/ 12 levels "A","A#","B","C",..: 2 5 7 5 3 10 8 7 10 2 ...
#> $ liveness : num 0.094 0.372 0.191 0.0992 0.0829 0.613 0.141 0.161 0.132 0.101 ...
#> $ loudness : num -4.55 -7.2 -7.46 -9.57 -5.88 ...
#> $ mode : Factor w/ 2 levels "Major","Minor": 2 1 1 1 1 1 1 1 2 1 ...
#> $ speechiness : num 0.0714 0.128 0.0261 0.133 0.0323 0.177 0.0459 0.0439 0.217 0.0336 ...
#> $ tempo : num 98.1 167.8 89 169 104 ...
#> $ time_signature : Factor w/ 5 levels "0/4","1/4","3/4",..: 4 4 4 4 4 4 4 4 4 4 ...
#> $ valence : num 0.62 0.316 0.284 0.204 0.78 0.772 0.458 0.371 0.0897 0.398 ...
#> $ genre : Factor w/ 27 levels "A Capella","Alternative",..: 19 2 2 2 2 2 2 2 2 2 ...
We only need an int/num column because PCA analysis uses variance values. After the data type has been corrected, we can select only the integer column.
music_num <- popular_music %>%
select_if(is.numeric)
head(music_num)After we eliminate non-numeric data, then we can proceed to the Principle Component Analysis (PCA) process stage.
cov(music_num)#> popularity acousticness danceability
#> popularity 18.661155817 0.0102555657 0.0703245592
#> acousticness 0.010255566 0.0486669591 -0.0029457106
#> danceability 0.070324559 -0.0029457106 0.0194399385
#> duration_ms -21990.426755470 -456.4368769197 -1320.1215394549
#> energy -0.003681580 -0.0190052942 -0.0026170937
#> instrumentalness -0.011014591 0.0006582834 -0.0003472753
#> liveness 0.002062613 -0.0015510822 -0.0004202905
#> loudness 0.710601166 -0.2130338207 0.0134905267
#> speechiness 0.023145655 -0.0015862877 0.0041648203
#> tempo -1.135436994 -0.2319680495 -0.3496705530
#> valence -0.018426741 -0.0054057076 0.0059469199
#> duration_ms energy instrumentalness liveness
#> popularity -21990.4268 -0.0036815799 -0.0110145910 0.0020626130
#> acousticness -456.4369 -0.0190052942 0.0006582834 -0.0015510822
#> danceability -1320.1215 -0.0026170937 -0.0003472753 -0.0004202905
#> duration_ms 1997156721.4270 530.3447259975 164.5176016874 124.1994234542
#> energy 530.3447 0.0286155400 -0.0003566577 0.0024352975
#> instrumentalness 164.5176 -0.0003566577 0.0022049200 0.0003036391
#> liveness 124.1994 0.0024352975 0.0003036391 0.0145596145
#> loudness -4468.8300 0.2963591703 -0.0119807231 0.0188037160
#> speechiness -455.4490 -0.0015152587 -0.0003637197 0.0007722376
#> tempo -17144.3703 0.2153668199 0.0273663386 0.1508496362
#> valence -861.0436 0.0151149328 -0.0009154053 0.0022851267
#> loudness speechiness tempo valence
#> popularity 0.71060117 0.0231456554 -1.13543699 -0.0184267412
#> acousticness -0.21303382 -0.0015862877 -0.23196805 -0.0054057076
#> danceability 0.01349053 0.0041648203 -0.34967055 0.0059469199
#> duration_ms -4468.83003999 -455.4489515630 -17144.37033558 -861.0436485855
#> energy 0.29635917 -0.0015152587 0.21536682 0.0151149328
#> instrumentalness -0.01198072 -0.0003637197 0.02736634 -0.0009154053
#> liveness 0.01880372 0.0007722376 0.15084964 0.0022851267
#> loudness 5.88518536 -0.0143229921 1.40317680 0.1289770325
#> speechiness -0.01432299 0.0102833668 0.34144781 0.0006709665
#> tempo 1.40317680 0.3414478066 833.06800793 -0.1080728628
#> valence 0.12897703 0.0006709665 -0.10807286 0.0485241808
plot(prcomp(music_num))
After the check value and the variance plot, we can see that the average
of all variables is the difference and the loudness variance data
variable has a very high variance compared to other variables.
Data with a high-scale difference variable is not good for clustering analysis because it is biased. Variables will be considered to capture the highest variance and other variables will consider not providing information.
Therefore, we have to do the scaling before doing the clustering.
# scaling
music_scale <- scale(music_num)
head(music_scale)#> popularity acousticness danceability duration_ms energy
#> [1,] -0.7281633 -0.8120669 1.18338351 -0.68676402 0.39243398
#> [2,] 0.8922610 0.9952284 -0.91807189 -0.85463321 0.04365441
#> [3,] 0.4292826 1.5482508 -1.18344339 -0.20000603 -0.96721519
#> [4,] -0.7281633 1.8882236 -0.05740756 -0.98130709 -0.32285971
#> [5,] -0.2651850 -0.7730834 0.36575240 -0.39918007 1.29098474
#> [6,] -0.4966742 -0.8797895 0.98973403 0.09907949 0.36878791
#> instrumentalness liveness loudness speechiness tempo valence
#> [1,] -0.14734191 -0.6142960 0.7419484 -0.4066667 -0.8049667 0.5852270
#> [2,] -0.14148543 1.6896371 -0.3512361 0.1514805 1.6107974 -0.7948219
#> [3,] -0.01999020 0.1895944 -0.4579988 -0.8533817 -1.1201809 -0.9400902
#> [4,] 4.98505960 -0.5712008 -1.3298258 0.2007868 1.6539323 -1.3032610
#> [5,] -0.14734191 -0.7062875 0.1920585 -0.7922419 -0.5980576 1.3115685
#> [6,] -0.09772159 3.6869316 -0.1368862 0.6346822 0.5981139 1.2752514
PCA summarizes the information (variance) of the initial variables using new dimensions called principal components (PC). It will take only a few PCs that summarize the amount of information needed.
music_pca <- PCA(music_scale,scale.unit = F, graph = F)
#Visualization
plot.PCA(music_pca,
choix = "ind",
select = "contrib 5",
habillage = 1)
it is shown that there are outliers in the observations of 2693 , 2097 ,
406 , 1326 , 2820.
To see the contribution of variables from each pc, and see the correlation between variables.
plot.PCA(
x = music_pca,
choix = "var"
)fviz_contrib(
X = music_pca,
choice = "var",
axes = 1
)fviz_contrib(
X = music_pca,
choice = "var",
axes = 2
)as results of the visualization we get insight::
The two variables that PC1 captures the most: energy & loudness
The two variables that PC2 captures the most: dance & speech ability
Variables with high positive correlation:
speechiness & popularity
dancing & popularity
dancing & speechiness
duration_ms & speechiness
instrumentalness & speechiness
dance abillity & temp
Clustering is grouping data based on its characteristics. Clustering aims to produce clusters where:
Before we perform cluster analysis, we first need to determine the optimal number of clusters. In the clustering method, we try to minimize the number of squares in the cluster (meaning the distance between observations in the same cluster is minimal). To get the optimal number of clusters, 3 methods can be used, namely the elbow method, the silhouette method, and the gap statistic. But here we will use the elbow method
Choosing the number of clusters using the elbow method is arbitrary. The rule of thumb is that we choose the number of clusters in the “bend” area, where the graph of the number of squares starts to stagnate as the number of clusters increases.
RNGkind(sample.kind = "Rounding")
set.seed(100)
fviz_nbclust(music_scale, kmeans, method = "wss")We take the value of x where when k is added, the total decrease within sum of squares is no longer large (sloping). so we will take k = 5.
Here’s the algorithm behind K-Means Clustering:
Assign a number at random, from 1 to K, to each observation. It serves as the initial cluster assignment for observations. Iterate until the cluster task stops changing. For each K cluster, calculate the cluster center. The centroid of cluster K is a feature vector of p for observations in cluster k. Assign each observation to the cluster with the closest centroid (using Euclidean distance or other distance measurement).
RNGkind(sample.kind = "Rounding")
set.seed(100)
music_kmeans <- kmeans(music_scale, centers=5)The grouping results can be seen from the 3 values Within Sum of Squares ($ withinss): the sum of the squared distances from each observation to the centroid of each cluster. Between Sum of Squares ($ betweenss): the sum of the squared distances from each centroid to the global average. Based on the number of observations in the cluster. Total Sum of Squares ($tots): the sum of the squares of the distances from each observation to the global average.
music_kmeans$cluster#> [1] 3 1 4 4 3 3 4 2 4 2 2 2 2 2 2 2 2 3 3 2 3 2 2 2 3 2 2 2 3 1 1 1 4 1 1 1 1
#> [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 4 1 1 1 1 1 1 1 1 5 1 1 1 1
#> [75] 1 1 1 2 3 3 1 1 1 1 2 1 1 2 4 1 1 1 1 5 5 4 1 1 5 1 2 1 3 3 1 3 4 4 4 1 3
#> [112] 3 4 1 5 2 1 1 1 3 4 1 1 2 4 3 5 3 2 2 3 1 3 3 4 2 3 4 1 1 2 3 1 3 3 1 3 1
#> [149] 3 2 1 5 3 3 2 2 4 2 2 2 1 2 2 2 3 2 3 3 2 2 2 3 3 3 3 5 1 2 2 3 3 3 4 2 4
#> [186] 2 2 1 3 3 2 4 2 2 3 4 4 2 2 5 4 2 4 2 4 5 3 2 3 2 5 4 3 2 2 4 3 5 3 4 2 3
#> [223] 1 2 3 3 2 1 3 3 5 2 2 3 3 2 3 2 2 3 2 3 2 3 1 3 5 3 4 1 3 4 2 5 4 2 4 3 3
#> [260] 5 2 2 3 2 5 2 3 3 3 4 5 2 3 4 2 3 2 2 2 3 2 2 3 2 3 3 3 3 3 2 2 2 3 3 3 3
#> [297] 3 3 3 3 2 3 2 4 3 1 5 3 2 2 2 5 3 2 2 3 2 3 4 3 5 4 3 3 3 2 4 2 3 2 2 3 4
#> [334] 2 4 2 3 3 2 3 4 5 5 2 2 3 2 2 3 3 3 2 2 4 5 3 3 3 3 2 2 3 3 3 2 3 4 3 3 3
#> [371] 3 5 3 3 2 2 2 2 2 3 2 4 3 3 3 3 5 3 3 5 3 2 2 2 4 3 3 2 2 3 3 2 3 3 3 3 2
#> [408] 3 3 2 2 3 3 3 3 3 2 2 3 3 3 3 3 3 2 2 3 3 3 2 3 2 3 4 3 3 2 3 3 2 3 3 2 2
#> [445] 3 2 3 3 3 2 4 3 3 4 2 2 3 3 3 3 1 2 2 3 3 3 3 4 2 2 2 4 3 2 2 2 4 4 4 2 4
#> [482] 4 3 5 4 3 4 4 2 4 2 2 4 2 3 4 4 2 2 4 4 2 2 4 4 4 4 2 4 2 4 2 2 2 1 1 1 5
#> [519] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 5 5 5 5 1 5 1 4 5 1 1 1 1 5 1 2 1 4 1 1 5 5 5
#> [556] 5 1 5 5 1 5 5 5 4 1 1 1 5 5 4 1 5 2 2 5 5 5 2 5 3 1 1 5 4 5 5 4 5 5 5 5 5
#> [593] 5 3 5 5 5 5 1 5 1 3 5 2 4 5 3 2 5 2 5 5 2 5 5 5 5 5 5 5 4 5 4 5 2 5 5 5 3
#> [630] 5 5 5 5 2 5 5 5 5 5 2 5 1 5 5 4 5 3 5 5 1 5 2 5 5 5 2 5 3 5 5 5 5 4 5 5 5
#> [667] 5 5 4 4 5 3 4 1 5 3 5 3 5 4 5 3 5 4 5 5 4 3 5 5 2 5 5 5 5 2 5 5 5 4 5 5 5
#> [704] 5 5 5 5 3 3 3 5 2 3 3 3 3 5 5 5 3 5 4 5 5 5 5 5 4 4 5 2 5 5 5 5 5 4 5 4 5
#> [741] 2 5 3 5 4 3 5 3 4 2 5 5 5 5 5 3 5 5 5 4 4 4 3 5 2 3 5 5 4 4 3 2 5 4 5 2 4
#> [778] 5 5 5 5 5 5 4 2 2 5 5 5 5 2 4 2 5 5 2 5 4 3 3 2 2 3 5 5 5 5 2 5 5 2 2 4 5
#> [815] 3 5 3 3 4 3 2 4 2 2 5 3 3 3 5 2 5 5 5 3 4 5 2 2 5 3 3 5 5 3 3 5 4 2 4 3 2
#> [852] 5 3 3 3 3 1 3 3 1 5 3 3 3 3 3 2 3 3 3 5 3 3 4 3 3 3 3 5 3 3 2 2 3 3 5 1 3
#> [889] 3 3 3 3 3 3 2 3 4 3 5 3 2 3 5 3 3 3 3 3 3 2 5 3 5 3 4 2 4 4 4 5 4 2 4 3 3
#> [926] 2 4 4 4 4 3 2 2 2 2 4 3 3 4 4 4 4 2 2 2 2 3 4 2 3 2 3 2 3 4 5 2 2 3 2 3 2
#> [963] 4 3 4 2 2 4 4 2 2 4 2 2 3 3 3 3 3 2 3 2 2 2 2 2 3 3 2 2 3 3 2 2 2 2 2 2 2
#> [1000] 3 2 3 3 2 2 2 2 3 2 2 2 2 2 2 3 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#> [1037] 1 1 1 1 5 1 1 1 1 1 1 1 5 1 5 1 1 1 5 1 1 4 5 1 1 1 5 5 1 1 1 1 5 2 5 1 2
#> [1074] 1 5 2 5 5 5 5 5 1 1 5 5 2 2 5 1 1 1 5 5 1 4 5 1 1 1 1 1 1 1 3 5 3 4 2 3 4
#> [1111] 4 5 3 2 5 5 5 5 5 4 2 2 1 1 4 4 1 4 5 1 5 3 5 5 5 5 5 4 3 2 5 1 5 5 3 4 1
#> [1148] 3 2 5 5 5 5 5 5 5 5 3 2 4 5 5 4 1 5 1 5 5 5 2 5 3 4 2 3 5 2 5 2 5 5 4 4 2
#> [1185] 5 5 4 5 5 2 5 5 5 2 5 2 2 5 5 3 5 5 3 5 3 1 3 5 3 5 5 5 2 4 4 5 5 3 4 3 5
#> [1222] 2 4 5 2 5 4 2 4 5 4 5 5 3 2 5 1 5 5 5 5 2 5 4 5 5 4 5 3 2 5 5 5 2 2 2 5 1
#> [1259] 5 5 5 5 2 2 4 5 5 5 2 5 5 5 5 5 5 5 4 4 5 3 5 5 2 5 5 4 5 2 5 5 2 5 5 5 5
#> [1296] 5 5 4 5 3 5 5 5 4 2 2 2 5 5 2 5 5 5 4 5 5 4 2 3 5 5 2 4 5 5 5 4 4 1 5 5 3
#> [1333] 5 4 5 3 5 3 5 4 3 2 5 3 5 5 2 4 5 2 5 5 5 5 5 3 5 5 3 2 4 5 2 2 2 5 2 4 5
#> [1370] 2 4 5 5 2 5 5 3 5 4 3 5 5 4 5 5 2 5 3 2 2 2 5 5 3 3 5 4 2 5 4 5 5 4 3 4 4
#> [1407] 5 5 5 2 4 2 2 5 3 5 3 5 4 2 3 3 5 4 5 5 5 5 2 2 3 5 5 5 5 2 5 4 4 5 4 5 3
#> [1444] 2 3 2 3 4 3 4 3 3 2 4 2 1 5 3 2 3 2 3 5 5 2 4 2 3 3 2 5 5 4 3 2 5 3 2 5 5
#> [1481] 5 5 5 3 5 2 5 2 2 3 3 3 5 5 5 3 2 3 2 4 3 5 4 2 3 3 2 2 2 3 5 3 5 4 3 3 4
#> [1518] 5 1 2 2 2 4 1 3 2 4 2 2 4 2 1 3 2 4 2 3 4 3 2 3 4 5 3 4 4 4 3 2 5 4 4 4 2
#> [1555] 2 2 4 2 2 5 2 2 5 2 4 3 5 5 2 2 3 3 4 2 3 2 4 2 2 3 3 4 3 2 3 2 4 2 3 4 2
#> [1592] 4 2 5 4 3 4 2 4 2 4 2 2 4 2 4 2 1 1 1 1 1 1 1 1 1 4 1 1 1 4 1 1 1 1 1 1 1
#> [1629] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 4 1 1 1 1 1 1 1 1 4 1 1 5 1 1 1 1 5 1 1
#> [1666] 1 1 1 5 1 1 1 1 1 1 2 4 5 1 1 1 1 1 3 5 2 1 1 1 5 1 5 4 1 1 4 1 1 1 1 5 4
#> [1703] 1 4 5 1 4 4 1 3 1 4 1 1 1 1 1 1 3 1 3 3 1 4 1 1 1 5 5 1 5 2 4 1 1 4 1 3 2
#> [1740] 2 1 4 4 5 5 4 4 4 4 1 1 2 1 1 1 1 1 4 1 1 1 5 1 4 2 5 4 4 3 4 1 1 5 1 4 2
#> [1777] 4 3 4 1 5 1 2 2 1 1 1 4 5 2 5 5 5 1 5 4 3 2 4 2 5 4 1 4 1 4 2 4 5 3 4 5 1
#> [1814] 2 1 3 5 4 3 4 4 3 2 5 5 2 5 1 3 4 1 2 4 5 3 3 4 5 1 2 3 3 4 2 4 1 1 5 3 5
#> [1851] 3 4 5 5 3 4 4 2 2 1 3 1 1 5 3 4 5 1 2 2 5 1 4 2 4 4 4 2 3 5 3 5 4 4 3 3 2
#> [1888] 2 2 1 4 5 4 1 1 2 2 1 5 1 4 4 4 3 5 2 1 1 2 4 3 5 2 5 2 4 5 2 3 2 5 4 4 2
#> [1925] 3 2 2 2 5 5 2 5 4 5 4 2 5 4 2 2 5 4 3 2 2 2 2 5 5 2 5 4 2 2 4 5 5 2 5 2 5
#> [1962] 3 4 5 5 1 1 1 5 3 2 1 3 5 1 4 2 1 1 3 4 3 3 3 4 2 5 4 5 3 2 3 5 3 1 1 4 5
#> [1999] 2 3 2 5 3 2 2 2 3 4 2 5 2 2 2 2 2 2 3 4 5 5 2 5 4 5 5 2 2 4 5 4 1 5 3 4 5
#> [2036] 2 2 3 5 4 4 2 4 2 5 5 5 2 3 2 2 2 1 4 5 4 5 5 2 1 3 5 2 4 3 3 3 3 3 3 4 5
#> [2073] 2 2 3 3 4 2 4 3 4 1 1 3 1 4 2 2 1 5 2 4 2 3 5 3 3 5 4 2 5 1 3 2 3 5 2 5 3
#> [2110] 5 3 3 2 3 5 2 1 5 2 2 2 4 2 2 3 5 4 3 4 3 5 4 3 4 2 3 3 3 2 2 5 3 3 2 5 4
#> [2147] 2 1 3 3 2 2 4 3 2 2 4 5 4 1 2 2 5 4 5 1 5 2 1 4 2 3 2 5 3 2 3 2 5 2 3 2 4
#> [2184] 3 5 2 4 2 1 3 3 3 3 4 4 3 5 3 2 4 3 3 3 2 4 3 5 3 1 3 5 5 2 4 3 2 5 5 3 4
#> [2221] 5 4 2 4 5 4 4 4 5 3 2 5 3 2 3 3 5 4 2 2 2 3 3 2 5 3 5 5 5 3 4 5 2 4 3 5 5
#> [2258] 4 5 3 3 4 3 5 5 4 3 3 5 5 3 5 2 2 2 3 3 5 3 2 2 3 2 2 4 3 4 2 2 2 5 3 5 4
#> [2295] 4 3 2 4 3 3 3 3 2 3 2 1 2 2 2 3 3 5 1 2 2 4 3 4 2 3 5 4 3 5 2 3 3 2 3 5 3
#> [2332] 5 2 2 4 3 2 3 2 3 4 2 4 3 3 5 4 1 5 5 5 5 5 3 2 3 3 3 2 4 1 3 3 3 2 4 2 3
#> [2369] 5 2 2 4 2 5 2 4 2 2 4 5 5 3 4 3 2 3 3 3 2 4 3 4 3 3 5 3 5 3 3 3 3 2 3 2 2
#> [2406] 3 1 3 3 2 3 2 3 2 5 5 5 3 5 2 2 2 3 5 3 3 2 2 5 5 5 2 5 5 2 4 3 4 5 4 2 4
#> [2443] 2 5 5 2 3 4 3 2 2 4 5 5 2 2 1 4 2 5 4 2 5 4 2 2 4 2 5 3 2 2 5 3 1 4 3 3 3
#> [2480] 4 3 4 4 3 2 4 3 2 3 2 2 2 2 4 3 3 2 3 5 5 5 2 3 4 2 2 2 2 2 3 2 3 5 3 3 3
#> [2517] 2 2 3 3 2 2 3 4 3 4 4 3 3 4 3 4 3 5 2 5 4 3 3 3 3 2 5 2 5 4 2 2 4 2 5 3 3
#> [2554] 2 3 3 5 5 2 3 3 3 2 4 3 4 3 3 3 2 3 3 3 4 2 5 3 4 5 3 2 2 3 2 3 2 5 2 2 2
#> [2591] 3 3 2 5 3 3 2 3 3 3 3 3 3 2 2 2 2 3 3 1 3 2 4 1 2 2 3 3 3 3 3 5 3 3 4 2 2
#> [2628] 4 3 3 3 3 2 5 5 3 3 4 2 3 3 2 3 3 3 2 2 2 5 2 2 2 2 3 3 3 3 3 2 4 5 3 3 3
#> [2665] 2 3 2 2 3 2 2 3 3 3 3 2 2 3 3 3 3 3 2 4 3 2 3 2 1 3 3 3 3 4 3 3 5 2 2 2 2
#> [2702] 3 2 2 3 5 2 4 3 5 2 5 3 3 4 3 3 4 3 2 3 4 4 3 2 1 2 2 3 2 2 2 2 2 3 3 3 3
#> [2739] 3 3 4 4 2 3 3 3 2 5 4 4 3 4 2 2 3 2 3 5 3 3 4 3 4 3 3 3 3 3 4 3 3 5 3 2 2
#> [2776] 3 2 2 4 3 3 3 3 3 4 5 3 3 3 2 3 3 3 3 4 3 3 3 2 3 3 3 4 3 3 2 3 1 4 5 3 2
#> [2813] 2 5 5 3 3 3 5 2 4 4 2 3 3 3 5 3 3 3 2 3 3 2 3 3 3 3 3 3 5 4 5 3 3 3 2 2 4
#> [2850] 2 3 3 5 3 2 3 3 2 2 2 3 4 3 2 2 3 3 3 2 3 3 3 2 2 3 3 3 3 2 3 3 3 5 3 3 5
#> [2887] 3 1 1 1 1 1 1 1 1 5 1 1 1 2 2 5 1 1 2 2 1 3 1 4 5 3 1 5 3 3 1 5 3 5 3 3 5
#> [2924] 3 1 3 4 3 3 1 1 2 3 2 3 3 3 5 3 2 5 2 3 5 3 3 4 3 3 3 2 3 3 1 3 3 1 1 3 3
#> [2961] 5 3 3 3 2 3 3 3 3 5 3 3 3 4 5 3 1 2 3 5 3 3 5 2 3 5 5 3 5 3 3 3 3 3 3 3 3
#> [2998] 3 3 3 3 5 3 5 5 3 5 5 3 3 3 3 2 3 3 4 4 2 3 3 2 3 3 4 4 3 3 3 3 4 4 2 3 4
#> [3035] 1 4 1 4 2 4 4 1 4 2 1 4 2 5 5 2 4 2 4 4 4 1 1 5 1 4 5 4 2 4 5 5 2 4 4 3 2
#> [3072] 4 4 5 1 4 2 2 4 4 3 2 4 2 4 4 2 4 4 4 1 5 2 4 2 2 4 4 5 2 3 3 4 4 3 2 4 5
#> [3109] 4 2 2 2 4 4 2 4 2 5 4 4 3 3 3 3 2 3 4 3 3 2 2 5 4 3 5 3 2 4 3 3 4 4 3 2 3
#> [3146] 5 2 2 3 3 2 2 3 4 3 3 3 3 2 5 4 5 3 2 4 2 3 3 4 5 3 2 2 3 3 2 2 2 5 4 3 3
#> [3183] 3 2 2 3 2 5 2 3 3 3 4 2 3 3 2 3 3 3 3 2 3 2 3 3 1 4 3 2 1 1 1 2 4 2 2 1 2
#> [3220] 4 4 1 4 4 2 3 2 3 3 2 1 4 4 2 2 3 3 4 2 4 2 1 2 2 3 2 2 2 3 2 2 3 1 3 2 2
#> [3257] 3 2 4 3 4 2 3 4 4 4 2 3 2 4 2 3 2 4 2 3 3 3 4 4 3 4 4 4 4 4 3 4 3 2 3 3 4
#> [3294] 3 3 2 2 4 3 3 2 2 2 3 3 3 3 3 2 4 3 2 2 3 4 4 3 4 5 2 4 4 3 4 2 4 2 3 2 4
#> [3331] 2 3 2 3 2 2 2 3 3 2 2 2 3 2 2 3 3 2 4 2 3 3 3 3 2 4 2 4 2 2 4 5 2 3 2 2 2
#> [3368] 4 5 3 3 3 3 4 2 2 2 2 3 3 3 2 3 4 2 2 2 2 4 2 2 3 2 2 2 2 2 3 2 4 2 4 3 4
#> [3405] 2 2 2 3 4 2 3 3 4 3 2 4 3 2 4 3 3 2 4 2 2 3 3 2 3 2 4 4 5 4 4 2 2 2 4 3 3
#> [3442] 4 2 3 2 3 3 2 3 3 2 4 2 4 3 3 2 2 2 2 2 4 3 2 2 2 2 4 4 4 2 2 2 2 2 4 2 4
#> [3479] 2 2 3 4 3 2 2 5 3 3 3 2 3 2 2 2 2 2 4 4 4 4 3 4 4 5 3 4 4 4 5 4 4 4 2 4 2
#> [3516] 2 4 2 4 3 4 2 2 4 2 4 3 2 5 3 2 2 4 2 2 4 3 4 2 4 3 2 3 3 2 4 3 4 2 4 3 4
#> [3553] 4 2 4 3 3 3 3 2 4 4 3 4 2 3 2 2 3 4 4 4 4 3 4 4 4 4 4 3 3 4 2 2 4 2 3 2 2
#> [3590] 5 3 4 2
Interpretasi/Cluster Profiling
# enter the results of clustering into the initial data
music_num$cluster <- music_kmeans$cluster
popular_music$cluster <- music_kmeans$cluster
# doing profiling by summarizing data
music_centroid <- music_num %>%
group_by(cluster) %>% # grouping every each cluster
summarise_all(mean) #calculate mean cluster
music_centroidlibrary(ggiraphExtra)
ggRadar(
data=music_num,
mapping = aes(colours = cluster),
interactive = T
)music_centroid %>%
pivot_longer(-cluster) %>%
group_by(name) %>%
summarize(
kelompok_min = which.min(value),
kelompok_max = which.max(value))Cluster 1 : - songs tend to be popular - the song has a short duration.
Cluster 2 : Songs with long duration but little acousticness, danceability, speechiness, valence and also little popularity.
Cluster 3 : A song with a lot of energy, live, loud and valence.
Cluster 4 : Songs with instrumentalness, instrumentalness but little energy, not live, slow tempo and not loud.
Cluster 5 : Songs with a fast tempo with lots of lyrics and can be used for dancing.
popular_music %>%
filter(popularity >= 80) %>% head(5)The results of the filter based on our popularity get a high popularity score from the artist Ariana Grande with the track name 7 rings. In terms of clustering results, the track results are in cluster 3 and the Dance genre.
For example, let’s say you chose the genre “Alternatives” and what music would you suggest next?
popular_music %>%
filter(cluster == 1 & genre == "Dance") %>% head(5)The results show that the 7 rings track has little in common or the same clusterring with the track:
From the analysis of unsupervised learning above, it can be concluded that:
Dimensional reduction can be done using this dataset. To perform dimensionality reduction, we can choose a PC from a total of 10 PCs according to the total information we want to store.
by using K-Means we can group the existing variables as below:
Cluster 1 : - songs tend to be popular - the song has a short duration.
Cluster 2 : Songs with long duration but little acousticness, danceability, speechiness, valence and also little popularity.
Cluster 3 : A song with a lot of energy, live, loud and valence.
Cluster 4 : Songs with instrumentalness, instrumentalness but little energy, not live, slow tempo and not loud.
Cluster 5 : Songs with a fast tempo with lots of lyrics and can be used for dancing.