Spotify is a popular music streaming platform with millions of tracks from various genres. Spotify tracks are individual songs with attributes like danceability, acousticness, energy and more. Analyzing the Spotify Tracks dataset using unsupervised learning techniques allows us to discover hidden relationships, group similar tracks, and gain insights into the characteristics and patterns of the songs available on the platform. In this project, the dataset is used from Kaggle. Here is the link for the Spotify Recommendation.
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
##
## Attaching package: 'scales'
##
## The following object is masked from 'package:purrr':
##
## discard
##
## The following object is masked from 'package:readr':
##
## col_factor
##
## Attaching package: 'cowplot'
##
## The following object is masked from 'package:lubridate':
##
## stamp
##
## Attaching package: 'plotly'
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following object is masked from 'package:graphics':
##
## layout
The dataset were obtained from Kaggle by Zaheen Hamidani about the Spotify tracks.
## Rows: 195
## Columns: 14
## $ danceability <dbl> 0.803, 0.762, 0.261, 0.722, 0.787, 0.778, 0.666, 0.92…
## $ energy <dbl> 0.6240, 0.7030, 0.0149, 0.7360, 0.5720, 0.6320, 0.589…
## $ key <int> 7, 10, 1, 3, 1, 8, 0, 7, 7, 3, 9, 1, 1, 6, 8, 5, 8, 8…
## $ loudness <dbl> -6.764, -7.951, -27.528, -6.994, -7.516, -6.415, -8.4…
## $ mode <int> 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1,…
## $ speechiness <dbl> 0.0477, 0.3060, 0.0419, 0.0585, 0.2220, 0.1250, 0.324…
## $ acousticness <dbl> 4.51e-01, 2.06e-01, 9.92e-01, 4.31e-01, 1.45e-01, 4.0…
## $ instrumentalness <dbl> 7.34e-04, 0.00e+00, 8.97e-01, 1.18e-06, 0.00e+00, 0.0…
## $ liveness <dbl> 0.1000, 0.0912, 0.1020, 0.1230, 0.0753, 0.0912, 0.114…
## $ valence <dbl> 0.6280, 0.5190, 0.0382, 0.5820, 0.6470, 0.8270, 0.776…
## $ tempo <dbl> 95.968, 151.329, 75.296, 89.860, 155.117, 140.951, 74…
## $ duration_ms <int> 304524, 247178, 286987, 208920, 179413, 224029, 14605…
## $ time_signature <int> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,…
## $ liked <int> 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0,…
In this dataset, there are 195 data with 14 variables and we will only use the necessary variables.
spotify <- spotify %>%
select(-c(key, mode, time_signature, liked, loudness, tempo, duration_ms))
glimpse(spotify)## Rows: 195
## Columns: 7
## $ danceability <dbl> 0.803, 0.762, 0.261, 0.722, 0.787, 0.778, 0.666, 0.92…
## $ energy <dbl> 0.6240, 0.7030, 0.0149, 0.7360, 0.5720, 0.6320, 0.589…
## $ speechiness <dbl> 0.0477, 0.3060, 0.0419, 0.0585, 0.2220, 0.1250, 0.324…
## $ acousticness <dbl> 4.51e-01, 2.06e-01, 9.92e-01, 4.31e-01, 1.45e-01, 4.0…
## $ instrumentalness <dbl> 7.34e-04, 0.00e+00, 8.97e-01, 1.18e-06, 0.00e+00, 0.0…
## $ liveness <dbl> 0.1000, 0.0912, 0.1020, 0.1230, 0.0753, 0.0912, 0.114…
## $ valence <dbl> 0.6280, 0.5190, 0.0382, 0.5820, 0.6470, 0.8270, 0.776…
## danceability energy speechiness acousticness
## 0 0 0 0
## instrumentalness liveness valence
## 0 0 0
There is no misssing data so we can continue to scalling data for the k-means clustering.
First we need to check the correlation between the variables.
There is a strong correlation between variables, namely valence and
danceability.
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.6986 1.2490 0.9606 0.8953 0.66258 0.47836 0.4030
## Proportion of Variance 0.4122 0.2229 0.1318 0.1145 0.06272 0.03269 0.0232
## Cumulative Proportion 0.4122 0.6351 0.7669 0.8814 0.94411 0.97680 1.0000
From the summary I got I want to retain information on the data as much as ~80%, PC1 to 4 will be maintained
Visualize the individual factor plot.
## Warning: ggrepel: 5 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
There is no outlier based on the plot.
The insight is that the 3 variables that contribute the most to PC 1
based on the correlation between the variables with PC 1 are
instrumentalness, danceability and valence.
In order to perform cluster analysis, it is important to determine the ideal number of clusters beforehand. Clustering involves minimizing the total sum of squares within each cluster, which indicates that the distance between observations within the same cluster is minimized. The methods that can be used to determine the optimal number of clusters: the elbow method.
The selection of the number of clusters using the elbow method is subjective. The general guideline is to choose the number of clusters at the point where the graph of the total within sum of squares starts to level off, resembling the shape of an elbow.
A point is chosen where the total reduction within the sum of squares is
no longer significant/sloping (elbow point). Based on the results of the
elbow plot we will choose the value of k = 2.
Perform k-means clustering based on the previous elbow method and
store the results in the ws_kmeans object
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
Return the cluster label for each observation to the initial data before scaling, but the outliers have been removed.
Visualize the cluster
By using PCA, the 3 variables that contribute the most to PC 1 based on the correlation between the variables and PC 1 are obtained, namely instrumentalness, danceability and valence.
By using K-means clustering, 2 clusters are obtained, namely: