Intro

Spotify is a digital music service available on Android, IOS, Windows, etc. Operating Systems. Spotify provides access to various songs from various artists in the world. Therefore, there are many songs that can be accessed, it can make us confused in choosing the song we want to hear that suits our tastes.

Based on this, I will try to cluster the songs so that they can help readers choose songs that match the criteria we want using the unsupervised learning method.

Data Preperation

Load required library.

library(tidyverse)
library(factoextra)
library(FactoMineR)
library(animation)
library(lubridate)
library(ggiraphExtra)

Load data to perform unsupervised learning model

music <- read.csv(file="SpotifyFeatures.csv",stringsAsFactors = F)
music$genre <- music$ï..genre

music <- music %>%
  select(-ï..genre) %>%
  mutate(genre = as.factor(genre))

Inspect Data

after we have successfully imported our data, we will do a data inspection to find out contents our data, actually we can use the view() function to view the contents of the data but it will take time to see the whole data so we use a function that sees the head() only.

Descriptions:

  • artist_name: artist’s Name
  • track_name: Track’s Name
  • track_id: The Spotify ID for the track.
  • popularity: track’s popularity rate on spotify
  • acousticness: A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
  • danceability: Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
  • duration_ms: The duration of the track in milliseconds.
  • energy: Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy.
  • instrumentalness: Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
  • key: The key the track is in. Integers map to pitches using standard Pitch Class notation. E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1. >= -1 until <= 11
  • liveness: Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
  • loudness: The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typically range between -60 and 0 db.
  • mode: Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.
  • speechiness:Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
  • tempo: The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
  • time_signature: An estimated time signature. The time signature (meter) is a notational convention to specify how many beats are in each bar (or measure). The time signature ranges from 3 to 7 indicating time signatures of “3/4”, to “7/4”. >= 3 and <= 7
  • valence: A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).>= 0 and <= 1
  • genre: Track’s Genre

Data Wrangling & Eksploratory Data

Check the structure of the data copier. Are there any columns whose data types do not match?

str(music)
#> 'data.frame':    232725 obs. of  18 variables:
#>  $ artist_name     : chr  "Henri Salvador" "Martin & les fées" "Joseph Williams" "Henri Salvador" ...
#>  $ track_name      : chr  "C'est beau de faire un Show" "Perdu d'avance (par Gad Elmaleh)" "Don't Let Me Be Lonely Tonight" "Dis-moi Monsieur Gordon Cooper" ...
#>  $ track_id        : chr  "0BRjO6ga9RKCKjfDqeFgWV" "0BjC1NfoEOOusryehmNudP" "0CoSDzoNIKCRs124s9uTVy" "0Gc6TVm52BwZD07Ki6tIvf" ...
#>  $ popularity      : int  0 1 3 0 4 0 2 15 0 10 ...
#>  $ acousticness    : num  0.611 0.246 0.952 0.703 0.95 0.749 0.344 0.939 0.00104 0.319 ...
#>  $ danceability    : num  0.389 0.59 0.663 0.24 0.331 0.578 0.703 0.416 0.734 0.598 ...
#>  $ duration_ms     : int  99373 137373 170267 152427 82625 160627 212293 240067 226200 152694 ...
#>  $ energy          : num  0.91 0.737 0.131 0.326 0.225 0.0948 0.27 0.269 0.481 0.705 ...
#>  $ instrumentalness: num  0 0 0 0 0.123 0 0 0 0.00086 0.00125 ...
#>  $ key             : chr  "C#" "F#" "C" "C#" ...
#>  $ liveness        : num  0.346 0.151 0.103 0.0985 0.202 0.107 0.105 0.113 0.0765 0.349 ...
#>  $ loudness        : num  -1.83 -5.56 -13.88 -12.18 -21.15 ...
#>  $ mode            : chr  "Major" "Minor" "Minor" "Major" ...
#>  $ speechiness     : num  0.0525 0.0868 0.0362 0.0395 0.0456 0.143 0.953 0.0286 0.046 0.0281 ...
#>  $ tempo           : num  167 174 99.5 171.8 140.6 ...
#>  $ time_signature  : chr  "4/4" "4/4" "5/4" "4/4" ...
#>  $ valence         : num  0.814 0.816 0.368 0.227 0.39 0.358 0.533 0.274 0.765 0.718 ...
#>  $ genre           : Factor w/ 27 levels "A Capella","Alternative",..: 16 16 16 16 16 16 16 16 16 16 ...

Columns whose data types do not match:

  • artist_name -> as factor
  • track_name -> as factor
  • key -> as factor
  • mode -> as factor
  • time_signature-> as factor

Are any of the variables below that can be removed?

  • track_id -> we don’t really need these variable(track’s id) due to it’s already presence by track’s name.
music_clean <- music %>%
  mutate(artist_name = as.factor(artist_name),
         track_name = as.factor(track_name),
         key = as.factor(key),
         mode = as.factor(mode),
         time_signature = as.factor(time_signature))%>%
  select(-track_id)

glimpse(music_clean)
#> Rows: 232,725
#> Columns: 17
#> $ artist_name      <fct> "Henri Salvador", "Martin & les fées", "Joseph Willi~
#> $ track_name       <fct> "C'est beau de faire un Show", "Perdu d'avance (par G~
#> $ popularity       <int> 0, 1, 3, 0, 4, 0, 2, 15, 0, 10, 0, 2, 4, 3, 0, 0, 0, ~
#> $ acousticness     <dbl> 0.61100, 0.24600, 0.95200, 0.70300, 0.95000, 0.74900,~
#> $ danceability     <dbl> 0.389, 0.590, 0.663, 0.240, 0.331, 0.578, 0.703, 0.41~
#> $ duration_ms      <int> 99373, 137373, 170267, 152427, 82625, 160627, 212293,~
#> $ energy           <dbl> 0.9100, 0.7370, 0.1310, 0.3260, 0.2250, 0.0948, 0.270~
#> $ instrumentalness <dbl> 0.00000000, 0.00000000, 0.00000000, 0.00000000, 0.123~
#> $ key              <fct> C#, F#, C, C#, F, C#, C#, F#, C, G, E, C, F#, D#, G, ~
#> $ liveness         <dbl> 0.3460, 0.1510, 0.1030, 0.0985, 0.2020, 0.1070, 0.105~
#> $ loudness         <dbl> -1.828, -5.559, -13.879, -12.178, -21.150, -14.970, -~
#> $ mode             <fct> Major, Minor, Minor, Major, Major, Major, Major, Majo~
#> $ speechiness      <dbl> 0.0525, 0.0868, 0.0362, 0.0395, 0.0456, 0.1430, 0.953~
#> $ tempo            <dbl> 166.969, 174.003, 99.488, 171.758, 140.576, 87.479, 8~
#> $ time_signature   <fct> 4/4, 4/4, 5/4, 4/4, 4/4, 4/4, 4/4, 4/4, 4/4, 4/4, 4/4~
#> $ valence          <dbl> 0.8140, 0.8160, 0.3680, 0.2270, 0.3900, 0.3580, 0.533~
#> $ genre            <fct> Movie, Movie, Movie, Movie, Movie, Movie, Movie, Movi~

we have changed the appropriate data type by eliminating unnecessary columns.

Next we will check blank data in our dataset

colSums(is.na(music_clean))
#>      artist_name       track_name       popularity     acousticness 
#>                0                0                0                0 
#>     danceability      duration_ms           energy instrumentalness 
#>                0                0                0                0 
#>              key         liveness         loudness             mode 
#>                0                0                0                0 
#>      speechiness            tempo   time_signature          valence 
#>                0                0                0                0 
#>            genre 
#>                0

from the results of our exploratory data, we get the result that there is no NA or balnk data.

Pre-Processing Data

Before doing the modeling, We will do subnetting based on the popularity level of the track, with a popularity value of 76.

popular_music <- music_clean %>% filter(popularity >= 75)
str(popular_music)
#> 'data.frame':    3593 obs. of  17 variables:
#>  $ artist_name     : Factor w/ 14564 levels "'Til Tuesday",..: 5953 6476 6476 6476 11641 4906 10544 5008 12071 12945 ...
#>  $ track_name      : Factor w/ 148615 levels "' Cello Song",..: 132378 107362 112957 146257 7961 42036 136590 133047 47566 120168 ...
#>  $ popularity      : int  76 83 81 76 78 77 78 76 75 75 ...
#>  $ acousticness    : num  0.0233 0.422 0.544 0.619 0.0319 0.00836 0.0576 0.00847 0.443 0.0495 ...
#>  $ danceability    : num  0.845 0.552 0.515 0.672 0.731 0.818 0.559 0.56 0.656 0.612 ...
#>  $ duration_ms     : int  187521 180019 209274 174358 200373 222640 264307 218013 222374 240400 ...
#>  $ energy          : num  0.709 0.65 0.479 0.588 0.861 0.705 0.345 0.936 0.432 0.807 ...
#>  $ instrumentalness: num  0 0.000275 0.00598 0.241 0 0.00233 0.000105 0 0 0.0177 ...
#>  $ key             : Factor w/ 12 levels "A","A#","B","C",..: 2 5 7 5 3 10 8 7 10 2 ...
#>  $ liveness        : num  0.094 0.372 0.191 0.0992 0.0829 0.613 0.141 0.161 0.132 0.101 ...
#>  $ loudness        : num  -4.55 -7.2 -7.46 -9.57 -5.88 ...
#>  $ mode            : Factor w/ 2 levels "Major","Minor": 2 1 1 1 1 1 1 1 2 1 ...
#>  $ speechiness     : num  0.0714 0.128 0.0261 0.133 0.0323 0.177 0.0459 0.0439 0.217 0.0336 ...
#>  $ tempo           : num  98.1 167.8 89 169 104 ...
#>  $ time_signature  : Factor w/ 5 levels "0/4","1/4","3/4",..: 4 4 4 4 4 4 4 4 4 4 ...
#>  $ valence         : num  0.62 0.316 0.284 0.204 0.78 0.772 0.458 0.371 0.0897 0.398 ...
#>  $ genre           : Factor w/ 27 levels "A Capella","Alternative",..: 19 2 2 2 2 2 2 2 2 2 ...

We only need an int/num column because PCA analysis uses variance values. After the data type has been corrected, we can select only the integer column.

music_num <- popular_music %>% 
  select_if(is.numeric)
head(music_num)

After we eliminate non-numeric data, then we can proceed to the Principle Component Analysis (PCA) process stage.

cov(music_num)
#>                        popularity    acousticness     danceability
#> popularity           18.661155817    0.0102555657     0.0703245592
#> acousticness          0.010255566    0.0486669591    -0.0029457106
#> danceability          0.070324559   -0.0029457106     0.0194399385
#> duration_ms      -21990.426755470 -456.4368769197 -1320.1215394549
#> energy               -0.003681580   -0.0190052942    -0.0026170937
#> instrumentalness     -0.011014591    0.0006582834    -0.0003472753
#> liveness              0.002062613   -0.0015510822    -0.0004202905
#> loudness              0.710601166   -0.2130338207     0.0134905267
#> speechiness           0.023145655   -0.0015862877     0.0041648203
#> tempo                -1.135436994   -0.2319680495    -0.3496705530
#> valence              -0.018426741   -0.0054057076     0.0059469199
#>                      duration_ms         energy instrumentalness       liveness
#> popularity           -21990.4268  -0.0036815799    -0.0110145910   0.0020626130
#> acousticness           -456.4369  -0.0190052942     0.0006582834  -0.0015510822
#> danceability          -1320.1215  -0.0026170937    -0.0003472753  -0.0004202905
#> duration_ms      1997156721.4270 530.3447259975   164.5176016874 124.1994234542
#> energy                  530.3447   0.0286155400    -0.0003566577   0.0024352975
#> instrumentalness        164.5176  -0.0003566577     0.0022049200   0.0003036391
#> liveness                124.1994   0.0024352975     0.0003036391   0.0145596145
#> loudness              -4468.8300   0.2963591703    -0.0119807231   0.0188037160
#> speechiness            -455.4490  -0.0015152587    -0.0003637197   0.0007722376
#> tempo                -17144.3703   0.2153668199     0.0273663386   0.1508496362
#> valence                -861.0436   0.0151149328    -0.0009154053   0.0022851267
#>                        loudness     speechiness           tempo         valence
#> popularity           0.71060117    0.0231456554     -1.13543699   -0.0184267412
#> acousticness        -0.21303382   -0.0015862877     -0.23196805   -0.0054057076
#> danceability         0.01349053    0.0041648203     -0.34967055    0.0059469199
#> duration_ms      -4468.83003999 -455.4489515630 -17144.37033558 -861.0436485855
#> energy               0.29635917   -0.0015152587      0.21536682    0.0151149328
#> instrumentalness    -0.01198072   -0.0003637197      0.02736634   -0.0009154053
#> liveness             0.01880372    0.0007722376      0.15084964    0.0022851267
#> loudness             5.88518536   -0.0143229921      1.40317680    0.1289770325
#> speechiness         -0.01432299    0.0102833668      0.34144781    0.0006709665
#> tempo                1.40317680    0.3414478066    833.06800793   -0.1080728628
#> valence              0.12897703    0.0006709665     -0.10807286    0.0485241808
plot(prcomp(music_num))

After the check value and the variance plot, we can see that the average of all variables is the difference and the loudness variance data variable has a very high variance compared to other variables.

Data with a high-scale difference variable is not good for clustering analysis because it is biased. Variables will be considered to capture the highest variance and other variables will consider not providing information.

Therefore, we have to do the scaling before doing the clustering.

# scaling 
music_scale <- scale(music_num)
head(music_scale)
#>      popularity acousticness danceability duration_ms      energy
#> [1,] -0.7281633   -0.8120669   1.18338351 -0.68676402  0.39243398
#> [2,]  0.8922610    0.9952284  -0.91807189 -0.85463321  0.04365441
#> [3,]  0.4292826    1.5482508  -1.18344339 -0.20000603 -0.96721519
#> [4,] -0.7281633    1.8882236  -0.05740756 -0.98130709 -0.32285971
#> [5,] -0.2651850   -0.7730834   0.36575240 -0.39918007  1.29098474
#> [6,] -0.4966742   -0.8797895   0.98973403  0.09907949  0.36878791
#>      instrumentalness   liveness   loudness speechiness      tempo    valence
#> [1,]      -0.14734191 -0.6142960  0.7419484  -0.4066667 -0.8049667  0.5852270
#> [2,]      -0.14148543  1.6896371 -0.3512361   0.1514805  1.6107974 -0.7948219
#> [3,]      -0.01999020  0.1895944 -0.4579988  -0.8533817 -1.1201809 -0.9400902
#> [4,]       4.98505960 -0.5712008 -1.3298258   0.2007868  1.6539323 -1.3032610
#> [5,]      -0.14734191 -0.7062875  0.1920585  -0.7922419 -0.5980576  1.3115685
#> [6,]      -0.09772159  3.6869316 -0.1368862   0.6346822  0.5981139  1.2752514

Principle Component Analysis(PCA)

PCA summarizes the information (variance) of the initial variables using new dimensions called principal components (PC). It will take only a few PCs that summarize the amount of information needed.

music_pca <- PCA(music_scale,scale.unit = F, graph = F)

#Visualization
plot.PCA(music_pca,
         choix = "ind",
         select = "contrib 5",
         habillage = 1)

it is shown that there are outliers in the observations of 2693 , 2097 , 406 , 1326 , 2820.

To see the contribution of variables from each pc, and see the correlation between variables.

plot.PCA(
  x = music_pca,
  choix = "var"
)

fviz_contrib(
  X = music_pca,
  choice = "var",
  axes = 1
)

fviz_contrib(
  X = music_pca,
  choice = "var",
  axes = 2
)

as results of the visualization we get insight::

  • The two variables that PC1 captures the most: energy & loudness

  • The two variables that PC2 captures the most: dance & speech ability

  • Variables with high positive correlation:

  1. speechiness & popularity

  2. dancing & popularity

  3. dancing & speechiness

  • Variables with high negative correlation:
  1. duration_ms & speechiness

  2. instrumentalness & speechiness

  3. dance abillity & temp

  • Variable almost have no correlation :
  1. acousticness & speechiness

Clustering

Clustering is grouping data based on its characteristics. Clustering aims to produce clusters where:

  • Observations in the same cluster that have similar characteristics
  • Observations from different clusters have different characteristics.

K Optimum

Before we perform cluster analysis, we first need to determine the optimal number of clusters. In the clustering method, we try to minimize the number of squares in the cluster (meaning the distance between observations in the same cluster is minimal). To get the optimal number of clusters, 3 methods can be used, namely the elbow method, the silhouette method, and the gap statistic. But here we will use the elbow method

Choosing the number of clusters using the elbow method is arbitrary. The rule of thumb is that we choose the number of clusters in the “bend” area, where the graph of the number of squares starts to stagnate as the number of clusters increases.

RNGkind(sample.kind = "Rounding")
set.seed(100)

fviz_nbclust(music_scale, kmeans, method = "wss")

We take the value of x where when k is added, the total decrease within sum of squares is no longer large (sloping). so we will take k = 5.

K-Mean Clustering

Here’s the algorithm behind K-Means Clustering:

Assign a number at random, from 1 to K, to each observation. It serves as the initial cluster assignment for observations. Iterate until the cluster task stops changing. For each K cluster, calculate the cluster center. The centroid of cluster K is a feature vector of p for observations in cluster k. Assign each observation to the cluster with the closest centroid (using Euclidean distance or other distance measurement).

RNGkind(sample.kind = "Rounding")
set.seed(100)

music_kmeans <- kmeans(music_scale, centers=5)

Profiling

The grouping results can be seen from the 3 values ​​Within Sum of Squares ($ withinss): the sum of the squared distances from each observation to the centroid of each cluster. Between Sum of Squares ($ betweenss): the sum of the squared distances from each centroid to the global average. Based on the number of observations in the cluster. Total Sum of Squares ($tots): the sum of the squares of the distances from each observation to the global average.

music_kmeans$cluster
#>    [1] 3 1 4 4 3 3 4 2 4 2 2 2 2 2 2 2 2 3 3 2 3 2 2 2 3 2 2 2 3 1 1 1 4 1 1 1 1
#>   [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 4 1 1 1 1 1 1 1 1 5 1 1 1 1
#>   [75] 1 1 1 2 3 3 1 1 1 1 2 1 1 2 4 1 1 1 1 5 5 4 1 1 5 1 2 1 3 3 1 3 4 4 4 1 3
#>  [112] 3 4 1 5 2 1 1 1 3 4 1 1 2 4 3 5 3 2 2 3 1 3 3 4 2 3 4 1 1 2 3 1 3 3 1 3 1
#>  [149] 3 2 1 5 3 3 2 2 4 2 2 2 1 2 2 2 3 2 3 3 2 2 2 3 3 3 3 5 1 2 2 3 3 3 4 2 4
#>  [186] 2 2 1 3 3 2 4 2 2 3 4 4 2 2 5 4 2 4 2 4 5 3 2 3 2 5 4 3 2 2 4 3 5 3 4 2 3
#>  [223] 1 2 3 3 2 1 3 3 5 2 2 3 3 2 3 2 2 3 2 3 2 3 1 3 5 3 4 1 3 4 2 5 4 2 4 3 3
#>  [260] 5 2 2 3 2 5 2 3 3 3 4 5 2 3 4 2 3 2 2 2 3 2 2 3 2 3 3 3 3 3 2 2 2 3 3 3 3
#>  [297] 3 3 3 3 2 3 2 4 3 1 5 3 2 2 2 5 3 2 2 3 2 3 4 3 5 4 3 3 3 2 4 2 3 2 2 3 4
#>  [334] 2 4 2 3 3 2 3 4 5 5 2 2 3 2 2 3 3 3 2 2 4 5 3 3 3 3 2 2 3 3 3 2 3 4 3 3 3
#>  [371] 3 5 3 3 2 2 2 2 2 3 2 4 3 3 3 3 5 3 3 5 3 2 2 2 4 3 3 2 2 3 3 2 3 3 3 3 2
#>  [408] 3 3 2 2 3 3 3 3 3 2 2 3 3 3 3 3 3 2 2 3 3 3 2 3 2 3 4 3 3 2 3 3 2 3 3 2 2
#>  [445] 3 2 3 3 3 2 4 3 3 4 2 2 3 3 3 3 1 2 2 3 3 3 3 4 2 2 2 4 3 2 2 2 4 4 4 2 4
#>  [482] 4 3 5 4 3 4 4 2 4 2 2 4 2 3 4 4 2 2 4 4 2 2 4 4 4 4 2 4 2 4 2 2 2 1 1 1 5
#>  [519] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 5 5 5 5 1 5 1 4 5 1 1 1 1 5 1 2 1 4 1 1 5 5 5
#>  [556] 5 1 5 5 1 5 5 5 4 1 1 1 5 5 4 1 5 2 2 5 5 5 2 5 3 1 1 5 4 5 5 4 5 5 5 5 5
#>  [593] 5 3 5 5 5 5 1 5 1 3 5 2 4 5 3 2 5 2 5 5 2 5 5 5 5 5 5 5 4 5 4 5 2 5 5 5 3
#>  [630] 5 5 5 5 2 5 5 5 5 5 2 5 1 5 5 4 5 3 5 5 1 5 2 5 5 5 2 5 3 5 5 5 5 4 5 5 5
#>  [667] 5 5 4 4 5 3 4 1 5 3 5 3 5 4 5 3 5 4 5 5 4 3 5 5 2 5 5 5 5 2 5 5 5 4 5 5 5
#>  [704] 5 5 5 5 3 3 3 5 2 3 3 3 3 5 5 5 3 5 4 5 5 5 5 5 4 4 5 2 5 5 5 5 5 4 5 4 5
#>  [741] 2 5 3 5 4 3 5 3 4 2 5 5 5 5 5 3 5 5 5 4 4 4 3 5 2 3 5 5 4 4 3 2 5 4 5 2 4
#>  [778] 5 5 5 5 5 5 4 2 2 5 5 5 5 2 4 2 5 5 2 5 4 3 3 2 2 3 5 5 5 5 2 5 5 2 2 4 5
#>  [815] 3 5 3 3 4 3 2 4 2 2 5 3 3 3 5 2 5 5 5 3 4 5 2 2 5 3 3 5 5 3 3 5 4 2 4 3 2
#>  [852] 5 3 3 3 3 1 3 3 1 5 3 3 3 3 3 2 3 3 3 5 3 3 4 3 3 3 3 5 3 3 2 2 3 3 5 1 3
#>  [889] 3 3 3 3 3 3 2 3 4 3 5 3 2 3 5 3 3 3 3 3 3 2 5 3 5 3 4 2 4 4 4 5 4 2 4 3 3
#>  [926] 2 4 4 4 4 3 2 2 2 2 4 3 3 4 4 4 4 2 2 2 2 3 4 2 3 2 3 2 3 4 5 2 2 3 2 3 2
#>  [963] 4 3 4 2 2 4 4 2 2 4 2 2 3 3 3 3 3 2 3 2 2 2 2 2 3 3 2 2 3 3 2 2 2 2 2 2 2
#> [1000] 3 2 3 3 2 2 2 2 3 2 2 2 2 2 2 3 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#> [1037] 1 1 1 1 5 1 1 1 1 1 1 1 5 1 5 1 1 1 5 1 1 4 5 1 1 1 5 5 1 1 1 1 5 2 5 1 2
#> [1074] 1 5 2 5 5 5 5 5 1 1 5 5 2 2 5 1 1 1 5 5 1 4 5 1 1 1 1 1 1 1 3 5 3 4 2 3 4
#> [1111] 4 5 3 2 5 5 5 5 5 4 2 2 1 1 4 4 1 4 5 1 5 3 5 5 5 5 5 4 3 2 5 1 5 5 3 4 1
#> [1148] 3 2 5 5 5 5 5 5 5 5 3 2 4 5 5 4 1 5 1 5 5 5 2 5 3 4 2 3 5 2 5 2 5 5 4 4 2
#> [1185] 5 5 4 5 5 2 5 5 5 2 5 2 2 5 5 3 5 5 3 5 3 1 3 5 3 5 5 5 2 4 4 5 5 3 4 3 5
#> [1222] 2 4 5 2 5 4 2 4 5 4 5 5 3 2 5 1 5 5 5 5 2 5 4 5 5 4 5 3 2 5 5 5 2 2 2 5 1
#> [1259] 5 5 5 5 2 2 4 5 5 5 2 5 5 5 5 5 5 5 4 4 5 3 5 5 2 5 5 4 5 2 5 5 2 5 5 5 5
#> [1296] 5 5 4 5 3 5 5 5 4 2 2 2 5 5 2 5 5 5 4 5 5 4 2 3 5 5 2 4 5 5 5 4 4 1 5 5 3
#> [1333] 5 4 5 3 5 3 5 4 3 2 5 3 5 5 2 4 5 2 5 5 5 5 5 3 5 5 3 2 4 5 2 2 2 5 2 4 5
#> [1370] 2 4 5 5 2 5 5 3 5 4 3 5 5 4 5 5 2 5 3 2 2 2 5 5 3 3 5 4 2 5 4 5 5 4 3 4 4
#> [1407] 5 5 5 2 4 2 2 5 3 5 3 5 4 2 3 3 5 4 5 5 5 5 2 2 3 5 5 5 5 2 5 4 4 5 4 5 3
#> [1444] 2 3 2 3 4 3 4 3 3 2 4 2 1 5 3 2 3 2 3 5 5 2 4 2 3 3 2 5 5 4 3 2 5 3 2 5 5
#> [1481] 5 5 5 3 5 2 5 2 2 3 3 3 5 5 5 3 2 3 2 4 3 5 4 2 3 3 2 2 2 3 5 3 5 4 3 3 4
#> [1518] 5 1 2 2 2 4 1 3 2 4 2 2 4 2 1 3 2 4 2 3 4 3 2 3 4 5 3 4 4 4 3 2 5 4 4 4 2
#> [1555] 2 2 4 2 2 5 2 2 5 2 4 3 5 5 2 2 3 3 4 2 3 2 4 2 2 3 3 4 3 2 3 2 4 2 3 4 2
#> [1592] 4 2 5 4 3 4 2 4 2 4 2 2 4 2 4 2 1 1 1 1 1 1 1 1 1 4 1 1 1 4 1 1 1 1 1 1 1
#> [1629] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 4 1 1 1 1 1 1 1 1 4 1 1 5 1 1 1 1 5 1 1
#> [1666] 1 1 1 5 1 1 1 1 1 1 2 4 5 1 1 1 1 1 3 5 2 1 1 1 5 1 5 4 1 1 4 1 1 1 1 5 4
#> [1703] 1 4 5 1 4 4 1 3 1 4 1 1 1 1 1 1 3 1 3 3 1 4 1 1 1 5 5 1 5 2 4 1 1 4 1 3 2
#> [1740] 2 1 4 4 5 5 4 4 4 4 1 1 2 1 1 1 1 1 4 1 1 1 5 1 4 2 5 4 4 3 4 1 1 5 1 4 2
#> [1777] 4 3 4 1 5 1 2 2 1 1 1 4 5 2 5 5 5 1 5 4 3 2 4 2 5 4 1 4 1 4 2 4 5 3 4 5 1
#> [1814] 2 1 3 5 4 3 4 4 3 2 5 5 2 5 1 3 4 1 2 4 5 3 3 4 5 1 2 3 3 4 2 4 1 1 5 3 5
#> [1851] 3 4 5 5 3 4 4 2 2 1 3 1 1 5 3 4 5 1 2 2 5 1 4 2 4 4 4 2 3 5 3 5 4 4 3 3 2
#> [1888] 2 2 1 4 5 4 1 1 2 2 1 5 1 4 4 4 3 5 2 1 1 2 4 3 5 2 5 2 4 5 2 3 2 5 4 4 2
#> [1925] 3 2 2 2 5 5 2 5 4 5 4 2 5 4 2 2 5 4 3 2 2 2 2 5 5 2 5 4 2 2 4 5 5 2 5 2 5
#> [1962] 3 4 5 5 1 1 1 5 3 2 1 3 5 1 4 2 1 1 3 4 3 3 3 4 2 5 4 5 3 2 3 5 3 1 1 4 5
#> [1999] 2 3 2 5 3 2 2 2 3 4 2 5 2 2 2 2 2 2 3 4 5 5 2 5 4 5 5 2 2 4 5 4 1 5 3 4 5
#> [2036] 2 2 3 5 4 4 2 4 2 5 5 5 2 3 2 2 2 1 4 5 4 5 5 2 1 3 5 2 4 3 3 3 3 3 3 4 5
#> [2073] 2 2 3 3 4 2 4 3 4 1 1 3 1 4 2 2 1 5 2 4 2 3 5 3 3 5 4 2 5 1 3 2 3 5 2 5 3
#> [2110] 5 3 3 2 3 5 2 1 5 2 2 2 4 2 2 3 5 4 3 4 3 5 4 3 4 2 3 3 3 2 2 5 3 3 2 5 4
#> [2147] 2 1 3 3 2 2 4 3 2 2 4 5 4 1 2 2 5 4 5 1 5 2 1 4 2 3 2 5 3 2 3 2 5 2 3 2 4
#> [2184] 3 5 2 4 2 1 3 3 3 3 4 4 3 5 3 2 4 3 3 3 2 4 3 5 3 1 3 5 5 2 4 3 2 5 5 3 4
#> [2221] 5 4 2 4 5 4 4 4 5 3 2 5 3 2 3 3 5 4 2 2 2 3 3 2 5 3 5 5 5 3 4 5 2 4 3 5 5
#> [2258] 4 5 3 3 4 3 5 5 4 3 3 5 5 3 5 2 2 2 3 3 5 3 2 2 3 2 2 4 3 4 2 2 2 5 3 5 4
#> [2295] 4 3 2 4 3 3 3 3 2 3 2 1 2 2 2 3 3 5 1 2 2 4 3 4 2 3 5 4 3 5 2 3 3 2 3 5 3
#> [2332] 5 2 2 4 3 2 3 2 3 4 2 4 3 3 5 4 1 5 5 5 5 5 3 2 3 3 3 2 4 1 3 3 3 2 4 2 3
#> [2369] 5 2 2 4 2 5 2 4 2 2 4 5 5 3 4 3 2 3 3 3 2 4 3 4 3 3 5 3 5 3 3 3 3 2 3 2 2
#> [2406] 3 1 3 3 2 3 2 3 2 5 5 5 3 5 2 2 2 3 5 3 3 2 2 5 5 5 2 5 5 2 4 3 4 5 4 2 4
#> [2443] 2 5 5 2 3 4 3 2 2 4 5 5 2 2 1 4 2 5 4 2 5 4 2 2 4 2 5 3 2 2 5 3 1 4 3 3 3
#> [2480] 4 3 4 4 3 2 4 3 2 3 2 2 2 2 4 3 3 2 3 5 5 5 2 3 4 2 2 2 2 2 3 2 3 5 3 3 3
#> [2517] 2 2 3 3 2 2 3 4 3 4 4 3 3 4 3 4 3 5 2 5 4 3 3 3 3 2 5 2 5 4 2 2 4 2 5 3 3
#> [2554] 2 3 3 5 5 2 3 3 3 2 4 3 4 3 3 3 2 3 3 3 4 2 5 3 4 5 3 2 2 3 2 3 2 5 2 2 2
#> [2591] 3 3 2 5 3 3 2 3 3 3 3 3 3 2 2 2 2 3 3 1 3 2 4 1 2 2 3 3 3 3 3 5 3 3 4 2 2
#> [2628] 4 3 3 3 3 2 5 5 3 3 4 2 3 3 2 3 3 3 2 2 2 5 2 2 2 2 3 3 3 3 3 2 4 5 3 3 3
#> [2665] 2 3 2 2 3 2 2 3 3 3 3 2 2 3 3 3 3 3 2 4 3 2 3 2 1 3 3 3 3 4 3 3 5 2 2 2 2
#> [2702] 3 2 2 3 5 2 4 3 5 2 5 3 3 4 3 3 4 3 2 3 4 4 3 2 1 2 2 3 2 2 2 2 2 3 3 3 3
#> [2739] 3 3 4 4 2 3 3 3 2 5 4 4 3 4 2 2 3 2 3 5 3 3 4 3 4 3 3 3 3 3 4 3 3 5 3 2 2
#> [2776] 3 2 2 4 3 3 3 3 3 4 5 3 3 3 2 3 3 3 3 4 3 3 3 2 3 3 3 4 3 3 2 3 1 4 5 3 2
#> [2813] 2 5 5 3 3 3 5 2 4 4 2 3 3 3 5 3 3 3 2 3 3 2 3 3 3 3 3 3 5 4 5 3 3 3 2 2 4
#> [2850] 2 3 3 5 3 2 3 3 2 2 2 3 4 3 2 2 3 3 3 2 3 3 3 2 2 3 3 3 3 2 3 3 3 5 3 3 5
#> [2887] 3 1 1 1 1 1 1 1 1 5 1 1 1 2 2 5 1 1 2 2 1 3 1 4 5 3 1 5 3 3 1 5 3 5 3 3 5
#> [2924] 3 1 3 4 3 3 1 1 2 3 2 3 3 3 5 3 2 5 2 3 5 3 3 4 3 3 3 2 3 3 1 3 3 1 1 3 3
#> [2961] 5 3 3 3 2 3 3 3 3 5 3 3 3 4 5 3 1 2 3 5 3 3 5 2 3 5 5 3 5 3 3 3 3 3 3 3 3
#> [2998] 3 3 3 3 5 3 5 5 3 5 5 3 3 3 3 2 3 3 4 4 2 3 3 2 3 3 4 4 3 3 3 3 4 4 2 3 4
#> [3035] 1 4 1 4 2 4 4 1 4 2 1 4 2 5 5 2 4 2 4 4 4 1 1 5 1 4 5 4 2 4 5 5 2 4 4 3 2
#> [3072] 4 4 5 1 4 2 2 4 4 3 2 4 2 4 4 2 4 4 4 1 5 2 4 2 2 4 4 5 2 3 3 4 4 3 2 4 5
#> [3109] 4 2 2 2 4 4 2 4 2 5 4 4 3 3 3 3 2 3 4 3 3 2 2 5 4 3 5 3 2 4 3 3 4 4 3 2 3
#> [3146] 5 2 2 3 3 2 2 3 4 3 3 3 3 2 5 4 5 3 2 4 2 3 3 4 5 3 2 2 3 3 2 2 2 5 4 3 3
#> [3183] 3 2 2 3 2 5 2 3 3 3 4 2 3 3 2 3 3 3 3 2 3 2 3 3 1 4 3 2 1 1 1 2 4 2 2 1 2
#> [3220] 4 4 1 4 4 2 3 2 3 3 2 1 4 4 2 2 3 3 4 2 4 2 1 2 2 3 2 2 2 3 2 2 3 1 3 2 2
#> [3257] 3 2 4 3 4 2 3 4 4 4 2 3 2 4 2 3 2 4 2 3 3 3 4 4 3 4 4 4 4 4 3 4 3 2 3 3 4
#> [3294] 3 3 2 2 4 3 3 2 2 2 3 3 3 3 3 2 4 3 2 2 3 4 4 3 4 5 2 4 4 3 4 2 4 2 3 2 4
#> [3331] 2 3 2 3 2 2 2 3 3 2 2 2 3 2 2 3 3 2 4 2 3 3 3 3 2 4 2 4 2 2 4 5 2 3 2 2 2
#> [3368] 4 5 3 3 3 3 4 2 2 2 2 3 3 3 2 3 4 2 2 2 2 4 2 2 3 2 2 2 2 2 3 2 4 2 4 3 4
#> [3405] 2 2 2 3 4 2 3 3 4 3 2 4 3 2 4 3 3 2 4 2 2 3 3 2 3 2 4 4 5 4 4 2 2 2 4 3 3
#> [3442] 4 2 3 2 3 3 2 3 3 2 4 2 4 3 3 2 2 2 2 2 4 3 2 2 2 2 4 4 4 2 2 2 2 2 4 2 4
#> [3479] 2 2 3 4 3 2 2 5 3 3 3 2 3 2 2 2 2 2 4 4 4 4 3 4 4 5 3 4 4 4 5 4 4 4 2 4 2
#> [3516] 2 4 2 4 3 4 2 2 4 2 4 3 2 5 3 2 2 4 2 2 4 3 4 2 4 3 2 3 3 2 4 3 4 2 4 3 4
#> [3553] 4 2 4 3 3 3 3 2 4 4 3 4 2 3 2 2 3 4 4 4 4 3 4 4 4 4 4 3 3 4 2 2 4 2 3 2 2
#> [3590] 5 3 4 2

Interpretasi/Cluster Profiling

# enter the results of clustering into the initial data
music_num$cluster <- music_kmeans$cluster
popular_music$cluster <- music_kmeans$cluster


# doing profiling by summarizing data
music_centroid <- music_num %>% 
  group_by(cluster) %>%  # grouping every each cluster
  summarise_all(mean) #calculate mean cluster

music_centroid
library(ggiraphExtra)

ggRadar(
  data=music_num,
  mapping = aes(colours = cluster),
  interactive = T
)
music_centroid %>% 
  pivot_longer(-cluster) %>% 
  group_by(name) %>% 
  summarize(
    kelompok_min = which.min(value),
    kelompok_max = which.max(value))

Cluster 1 : - songs tend to be popular - the song has a short duration.

Cluster 2 : Songs with long duration but little acousticness, danceability, speechiness, valence and also little popularity.

Cluster 3 : A song with a lot of energy, live, loud and valence.

Cluster 4 : Songs with instrumentalness, instrumentalness but little energy, not live, slow tempo and not loud.

Cluster 5 : Songs with a fast tempo with lots of lyrics and can be used for dancing.

Case

popular_music %>% 
  filter(popularity >= 80) %>% head(5)

The results of the filter based on our popularity get a high popularity score from the artist Ariana Grande with the track name 7 rings. In terms of clustering results, the track results are in cluster 3 and the Dance genre.

For example, let’s say you chose the genre “Alternatives” and what music would you suggest next?

popular_music %>% 
filter(cluster == 1 & genre == "Dance") %>% head(5)

The results show that the 7 rings track has little in common or the same clusterring with the track:

  • break up with your girlfriend, I’m bored
  • Without Me
  • NASA
  • thank you, next

conclusion

From the analysis of unsupervised learning above, it can be concluded that:

Dimensional reduction can be done using this dataset. To perform dimensionality reduction, we can choose a PC from a total of 10 PCs according to the total information we want to store.

by using K-Means we can group the existing variables as below:

Cluster 1 : - songs tend to be popular - the song has a short duration.

Cluster 2 : Songs with long duration but little acousticness, danceability, speechiness, valence and also little popularity.

Cluster 3 : A song with a lot of energy, live, loud and valence.

Cluster 4 : Songs with instrumentalness, instrumentalness but little energy, not live, slow tempo and not loud.

Cluster 5 : Songs with a fast tempo with lots of lyrics and can be used for dancing.