Spotify

1. Introduction

Business Introduction

Spotify is a digital music streaming service that offers millions of songs and podcasts from across the world. The firm, which was founded in Sweden in 2006, has swiftly grown to become one of the most popular music platforms, with over 365 million monthly active users and 165 million subscribers in over 170 countries. Spotify has altered the way people listen to music with its easy user interface and tailored suggestions, and has emerged as a prominent competitor in the highly competitive music market.

In the big data age, Spotify analyzes client listening patterns by asking questions when they first log in, such as what are your favorite music genres, and utilizes machine learning algorithms to recommend our favorite songs daily and weekly. In order to identify our preferences, our listening history is also being gathered.

The method by which Spotify categorizes music into broad categories is of interest. What are the distinguishing features of each genre, and how are they utilized to classify them? We’ll go further into the challenges in this project.

Goals

Based on the data, each song is allocated 12 audio features, 6 broad genres, and 24 subgenres. These 14 variables will be the focus of the next sections.

The goal of completing this project is to:

  • Comprehend the relationship between several aspects
  • Discover patterns in various audio characteristics in relation to distinct genres
  • Discover what characteristics make a song popular

To achieve such objectives, we will:

  • Exploratory Data Analysis: Examining the relationship between several variables
  • Correlation between features learning
  • K-Means Clustering is the model to utilize.

About the Data

Data set we all can access here

Basic on Unsupervised Learning

What is unsupervised learning ?

  • ❌ Does not have a target variable/label
  • 🎯 The goal is to identify patterns in data, which is useful for generating information. Used in the Pre-processing and Exploratory Data Analysis (EDA) stages. For example, dimension reduction, clustering, or to find anomalous patterns and others.
  • 🚫 The model cannot be evaluated, because it does not have a “ground truth” or an actual label.

In this project, we are going to use K-Means as a simplify model of unsupervised learning.

the k-means algorithm is a classifies data into as many K groups as we define. This algorithm is called flat clustering, meaning that one group has an equal position with the other groups.

Workflow K-means :

  • Business questions
  • Read data
  • Data cleansing: check data type must be numeric and handle NA
  • EDA: scale check
  • Data Preprocessing: scaling if the scale is different
  • Determine k optimum :
    • Business needs
    • elbow method
  • Create a cluster with the function kmeans(data, center = k)
  • The clustering result label column is combined with the initial data
  • Cluster profiling to understand the characteristics of each cluster, can be used for product recommendations, etc. (if needed).
  • Addendum: clustering can be visualized with biplots to make profiling easier

2. Preparation

Import Library

library(dplyr) # function data wrangling
library(tidyr)
library(factoextra) # for fviz_contrib(), visualization fviz_pca_biplot() 
library(FactoMineR) # for PCA()
library(ggiraphExtra) # for visualization profiling
library(GGally) # for ggcorr()
library(ggplot2) # Visual Graph
library(plotly) # Visual Graph

Read Data

spotify <- read.csv("data_input/SpotifyFeatures.csv", stringsAsFactors = T,encoding = "UTF-8")
head(spotify)
dim(spotify)
#> [1] 232725     18

Insight :

  • Data-set spotify have 232,725 rows and 18 columns
  • From data-set spotify, we can see track_name based on genre with their own popularity,acousticness,danceability,duration_ms until valence
glimpse(spotify)
#> Rows: 232,725
#> Columns: 18
#> $ genre            <fct> Movie, Movie, Movie, Movie, Movie, Movie, Movie, Movi…
#> $ artist_name      <fct> "Henri Salvador", "Martin & les fées", "Joseph Willia…
#> $ track_name       <fct> "C'est beau de faire un Show", "Perdu d'avance (par G…
#> $ track_id         <fct> 0BRjO6ga9RKCKjfDqeFgWV, 0BjC1NfoEOOusryehmNudP, 0CoSD…
#> $ popularity       <int> 0, 1, 3, 0, 4, 0, 2, 15, 0, 10, 0, 2, 4, 3, 0, 0, 0, …
#> $ acousticness     <dbl> 0.61100, 0.24600, 0.95200, 0.70300, 0.95000, 0.74900,…
#> $ danceability     <dbl> 0.389, 0.590, 0.663, 0.240, 0.331, 0.578, 0.703, 0.41…
#> $ duration_ms      <int> 99373, 137373, 170267, 152427, 82625, 160627, 212293,…
#> $ energy           <dbl> 0.9100, 0.7370, 0.1310, 0.3260, 0.2250, 0.0948, 0.270…
#> $ instrumentalness <dbl> 0.00000000, 0.00000000, 0.00000000, 0.00000000, 0.123…
#> $ key              <fct> C#, F#, C, C#, F, C#, C#, F#, C, G, E, C, F#, D#, G, …
#> $ liveness         <dbl> 0.3460, 0.1510, 0.1030, 0.0985, 0.2020, 0.1070, 0.105…
#> $ loudness         <dbl> -1.828, -5.559, -13.879, -12.178, -21.150, -14.970, -…
#> $ mode             <fct> Major, Minor, Minor, Major, Major, Major, Major, Majo…
#> $ speechiness      <dbl> 0.0525, 0.0868, 0.0362, 0.0395, 0.0456, 0.1430, 0.953…
#> $ tempo            <dbl> 166.969, 174.003, 99.488, 171.758, 140.576, 87.479, 8…
#> $ time_signature   <fct> 4/4, 4/4, 5/4, 4/4, 4/4, 4/4, 4/4, 4/4, 4/4, 4/4, 4/4…
#> $ valence          <dbl> 0.8140, 0.8160, 0.3680, 0.2270, 0.3900, 0.3580, 0.533…

Description of variabels/columns :

  • genre: the genre of a song track.
  • artist_name: singer name.
  • track_name: track title.
  • track_id: id number of the track.
  • popularity: the popularity rating of the track.
  • accousticness: a measure of confidence that the song is acoustic, ranges from 0-1.
  • danceability: describes how suitable a track is for dance based on a combination of musical elements including tempo, rhythmic stability, beat strength, and overall regularity. A value of 0 is the least able to dance and 1 is the most capable of dancing.
  • duration_ms: song duration in milliseconds
  • energy: represents a perceived measure of intensity and activity with a range of 0-1. Typically, energetic tracks feel fast, loud, and noisy. Getting closer to range 1 means it’s getting faster while getting closer to number 0 means a song has a slower and softer tempo.
  • instrumentalness: detects a song without vocals. The sounds “Ooh” and “aah” are treated as instruments in this context. A song that contains a rap or track with words is referred to as vocals. The closer the instrument value is to 1, the more likely the song contains no vocal content. Values above 0.5 are meant to represent an instrument track, but confidence is higher as values get closer to 1.
  • key: key of the entire predicted track. Value-to-pitch mapping uses standard Pitch Class notation. For example. 0 = C, 1 = C♯ / D♭, 2 = D, and so on. If no key is detected, the value is -1.
  • liveness: detects the presence of viewers in the recording. A higher life value indicates an increased probability that the song is played live. Values above 0.8 give a strong possibility that the track is live.
  • loudness: the overall loudness of the track in decibels (dB). Loudness values are averaged across tracks and are useful for comparing the relative loudness of tracks.
  • mode: indicates the modality (major or minor) of a track, the type of scale from which the melodic content originates. Major is represented by 1 and minor is 0.
  • speechiness: detects the presence of spoken words on the track. The more exclusive the recording resembles speech (e.g. talk shows, audiobooks, poetry), the closer to 1 the attribute value is. Values above 0.66 describe tracks that may be made up entirely of spoken words. A value between 0.33 and 0.66 describes a track that may contain both music and speech, either in sections or in layers, including cases such as rap music. Values below 0.33 most likely represent music and other non-speech tracks.
  • tempo: the approximate tempo of the entire track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a particular piece and is derived directly from the average beat duration.
  • time_signature: approximation of the overall time signature of a track. The time signature (meter) is a notational convention for determining how many beats are in each bar (or measure).
  • valence: a measure from 0-1 that describes the positive music conveyed by a track. Tracks with high valence sound more positive (eg happy, cheerful, excited), while tracks with low valence sound more negative (eg sad, depressed, angry).

EDA(Exploratory Data Analysis)

Feature Engineering

Due to our project based on popularity, as we can see summary of popularity below

range(spotify$popularity)
#> [1]   0 100
summary(spotify$popularity)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>    0.00   29.00   43.00   41.13   55.00  100.00

Insight :

  • As we can see range popularity from 0 - 100, with median 43, average in dataset 41.13

In this case we are going to classified into 4 class based on popularity.

spotify <- spotify %>% 
  mutate(popularity.class = as.numeric(case_when(
    ((popularity > 0) & (popularity < 20)) ~ "1",
    ((popularity >= 20) & (popularity < 40))~ "2",
    ((popularity >= 40) & (popularity < 60)) ~ "3",
    TRUE ~ "4"))
    )
table(spotify$popularity.class)
#> 
#>     1     2     3     4 
#> 24138 70473 95348 42766

Check Missing Values

colSums(is.na(spotify))
#>            genre      artist_name       track_name         track_id 
#>                0                0                0                0 
#>       popularity     acousticness     danceability      duration_ms 
#>                0                0                0                0 
#>           energy instrumentalness              key         liveness 
#>                0                0                0                0 
#>         loudness             mode      speechiness            tempo 
#>                0                0                0                0 
#>   time_signature          valence popularity.class 
#>                0                0                0

Insight :

  • There is no missing values of each columns.

Corellation between Predictors

ggcorr(spotify,label=T)

Insight :

  • We can see from the graph that there are a few factors with a strong relationship. We must either pick one of the variables or apply dimensional reduction techniques to avoid multicollinearity.

Analayst Music

We are going to get insight information based on data-set popular music.

spotify_popular <- spotify %>% 
  filter(popularity.class == 4)
head(spotify_popular)

Business Question

What genres do streamers listen to the most?

spotify_genre_rank <- spotify_popular %>%
  group_by(genre) %>% 
  count() %>% 
  rename(total = n)

spotify_genre_rank <- spotify_genre_rank %>%
  head(20) %>% 
  ggplot(mapping = aes(x = total, y = reorder(genre, total))) +
  geom_col(aes(fill=total))+
  scale_fill_gradient(low ='#90e0ef', high ='#415a77')+
  geom_label(aes(label=total),color="black",size=2,nudge_x= 0.8)+
  theme(legend.position = 'none')+
  labs(title = "Top 20 Genres Spotify Streaming", 
       subtitle = 'Across All Categories', 
       x = 'Total Frequent', 
       y ='Genre',
       fill='count',
       caption = 'Source Spotify Track DB')
spotify_genre_rank

Insight :

  • From the most 20 genres, we can see pop the most popular with total frequent 9343 and followed by Rap and Hip-Hop

Artists who have the most songs on spotify ?

spotify_artist_rank <- spotify_popular %>%
  group_by(artist_name) %>% 
  count() %>% 
  rename(total = n)

spotify_artist_rank <- spotify_artist_rank %>%
  head(20) %>% 
  ggplot(mapping = aes(x = total, y = reorder(artist_name, total))) +
  geom_col(aes(fill=total))+
  scale_fill_gradient(low ='#90e0ef', high ='#415a77')+
  geom_label(aes(label=total),color="black",size=2,nudge_x= 0.8)+
  theme(legend.position = 'none')+
  labs(title = "Top 20 Artist who have the most songs on spotify", 
       subtitle = 'Across All Categories', 
       x = 'Total Frequent', 
       y ='Artist Name',
       fill='count',
       caption = 'Source Spotify Track DB')
spotify_artist_rank

Insight :

  • From the most 20 artist name, we can see Suicide Boys the most songs on spotify with total frequent 184 with his own style hiphop and rap

How about Genre in every popularity class ?

spotify_genre_data <- spotify %>%
  group_by(genre,popularity.class) %>% 
  mutate(popularity.class = as.factor(popularity.class)) %>% 
  count() %>% 
  rename(total = n) %>% 
  ggplot(spotify_genre_data, mapping = aes(fill=genre, y=total, x=popularity.class)) + 
    geom_bar(position="dodge", stat="identity")
ggplotly(spotify_genre_data)

Insight :

  • As we can see from the graph above, pop is the most popular genre, followed by rap
  • Pop’s appeal might be attributed to its capacity to allow people to sing along, which is vital for many people who wish to listen to lively music in the vehicle while commuting or wasting time.
  • Genre HipHop is the least popular, mostly because of the less number of songs

Summary data-set spotify

spotify_clean <- spotify %>% 
  select(-track_id) %>% # Drop out column track id
  mutate_at(vars(genre,artist_name,track_name,time_signature,mode,key),as.factor)
summary(spotify_clean)
#>         genre                         artist_name       track_name    
#>  Comedy    :  9681   Giuseppe Verdi         :  1394   Home   :   100  
#>  Soundtrack:  9646   Giacomo Puccini        :  1137   You    :    71  
#>  Indie     :  9543   Kimbo Children's Music :   971   Intro  :    69  
#>  Jazz      :  9441   Nobuo Uematsu          :   825   Stay   :    63  
#>  Pop       :  9386   Richard Wagner         :   804   Wake Up:    59  
#>  Electronic:  9377   Wolfgang Amadeus Mozart:   800   Closer :    58  
#>  (Other)   :175651   (Other)                :226794   (Other):232305  
#>    popularity      acousticness     danceability     duration_ms     
#>  Min.   :  0.00   Min.   :0.0000   Min.   :0.0569   Min.   :  15387  
#>  1st Qu.: 29.00   1st Qu.:0.0376   1st Qu.:0.4350   1st Qu.: 182857  
#>  Median : 43.00   Median :0.2320   Median :0.5710   Median : 220427  
#>  Mean   : 41.13   Mean   :0.3686   Mean   :0.5544   Mean   : 235122  
#>  3rd Qu.: 55.00   3rd Qu.:0.7220   3rd Qu.:0.6920   3rd Qu.: 265768  
#>  Max.   :100.00   Max.   :0.9960   Max.   :0.9890   Max.   :5552917  
#>                                                                      
#>      energy          instrumentalness         key           liveness      
#>  Min.   :0.0000203   Min.   :0.0000000   C      :27583   Min.   :0.00967  
#>  1st Qu.:0.3850000   1st Qu.:0.0000000   G      :26390   1st Qu.:0.09740  
#>  Median :0.6050000   Median :0.0000443   D      :24077   Median :0.12800  
#>  Mean   :0.5709577   Mean   :0.1483012   C#     :23201   Mean   :0.21501  
#>  3rd Qu.:0.7870000   3rd Qu.:0.0358000   A      :22671   3rd Qu.:0.26400  
#>  Max.   :0.9990000   Max.   :0.9990000   F      :20279   Max.   :1.00000  
#>                                          (Other):88524                    
#>     loudness          mode         speechiness         tempo       
#>  Min.   :-52.457   Major:151744   Min.   :0.0222   Min.   : 30.38  
#>  1st Qu.:-11.771   Minor: 80981   1st Qu.:0.0367   1st Qu.: 92.96  
#>  Median : -7.762                  Median :0.0501   Median :115.78  
#>  Mean   : -9.570                  Mean   :0.1208   Mean   :117.67  
#>  3rd Qu.: -5.501                  3rd Qu.:0.1050   3rd Qu.:139.05  
#>  Max.   :  3.744                  Max.   :0.9670   Max.   :242.90  
#>                                                                    
#>  time_signature    valence       popularity.class
#>  0/4:     8     Min.   :0.0000   Min.   :1.000   
#>  1/4:  2608     1st Qu.:0.2370   1st Qu.:2.000   
#>  3/4: 24111     Median :0.4440   Median :3.000   
#>  4/4:200760     Mean   :0.4549   Mean   :2.674   
#>  5/4:  5238     3rd Qu.:0.6600   3rd Qu.:3.000   
#>                 Max.   :1.0000   Max.   :4.000   
#> 

Insight :

  • Every numeric columns have their own scale, in this summary, we see many different range of data

Data Pre-processing

# Change poplarity class into factor
spotify_clean <- spotify_clean %>% 
  mutate(popularity.class = as.factor(popularity.class))
glimpse(spotify_clean)
#> Rows: 232,725
#> Columns: 18
#> $ genre            <fct> Movie, Movie, Movie, Movie, Movie, Movie, Movie, Movi…
#> $ artist_name      <fct> "Henri Salvador", "Martin & les fées", "Joseph Willia…
#> $ track_name       <fct> "C'est beau de faire un Show", "Perdu d'avance (par G…
#> $ popularity       <int> 0, 1, 3, 0, 4, 0, 2, 15, 0, 10, 0, 2, 4, 3, 0, 0, 0, …
#> $ acousticness     <dbl> 0.61100, 0.24600, 0.95200, 0.70300, 0.95000, 0.74900,…
#> $ danceability     <dbl> 0.389, 0.590, 0.663, 0.240, 0.331, 0.578, 0.703, 0.41…
#> $ duration_ms      <int> 99373, 137373, 170267, 152427, 82625, 160627, 212293,…
#> $ energy           <dbl> 0.9100, 0.7370, 0.1310, 0.3260, 0.2250, 0.0948, 0.270…
#> $ instrumentalness <dbl> 0.00000000, 0.00000000, 0.00000000, 0.00000000, 0.123…
#> $ key              <fct> C#, F#, C, C#, F, C#, C#, F#, C, G, E, C, F#, D#, G, …
#> $ liveness         <dbl> 0.3460, 0.1510, 0.1030, 0.0985, 0.2020, 0.1070, 0.105…
#> $ loudness         <dbl> -1.828, -5.559, -13.879, -12.178, -21.150, -14.970, -…
#> $ mode             <fct> Major, Minor, Minor, Major, Major, Major, Major, Majo…
#> $ speechiness      <dbl> 0.0525, 0.0868, 0.0362, 0.0395, 0.0456, 0.1430, 0.953…
#> $ tempo            <dbl> 166.969, 174.003, 99.488, 171.758, 140.576, 87.479, 8…
#> $ time_signature   <fct> 4/4, 4/4, 5/4, 4/4, 4/4, 4/4, 4/4, 4/4, 4/4, 4/4, 4/4…
#> $ valence          <dbl> 0.8140, 0.8160, 0.3680, 0.2270, 0.3900, 0.3580, 0.533…
#> $ popularity.class <fct> 4, 1, 1, 4, 1, 4, 1, 1, 4, 1, 4, 1, 1, 1, 4, 4, 4, 1,…

Due to our data has large number of observations (Rows: 232,725), we should take sample of them for lessen the computation time in later stage, directly for find K optimum

RNGkind(sample.kind = "Rounding")
set.seed(100) #Lock random values

reduction1 <- sample(x = nrow(spotify_clean), size = nrow(spotify_clean)*0.0044) # We are going to take sample around a thousand of observations
spotify_keep <- spotify_clean[reduction1,]
# Column's name numeric (quantitative)
quanti <- spotify_keep %>% 
  select_if(is.numeric) %>% 
  colnames()

# index numeric columns
quantivar <- which(colnames(spotify_keep) %in% quanti)

# Column's name category (qualitative)
quali <- spotify_keep %>% 
  select_if(is.factor) %>% 
  colnames()

# index category columns
qualivar <- which(colnames(spotify_keep) %in% quali)

K Optimum

We choose K optimum by :

  • Needs from a business perspective, how many groups are needed.
  • Elbow Method
  • Choose a value of k where when k is added, the decrease in Total WSS is not significant (sloping).

In this project, we are going to use elbow method

spotify_num <- spotify_keep %>% 
  select_if(is.numeric) %>%
  scale()

library(factoextra)
fviz_nbclust(x = spotify_num,
             FUNcluster = kmeans,
             method = "wss")

Insight :

  • K optimum is 3 as our count of clusters, due to

3. Outlier Detection

PCA

# Using Visualization PCA Biplot for check Outlier Data.

pca_spotify <- PCA(X = spotify_keep, # data sample has sampled before
                scale.unit = T, # scaling
                quali.sup = qualivar, # index qualitative
                graph = F, 
                ncp = 11) # 11 columns numeric on spotify_keep
plot.PCA(x = pca_spotify, # objek dari fungsi PCA() dari library FactoMineR
         choix = "ind", # jenis visualisasi yang akan ditampilkan, ind -> individual factor map
         invisible = "quali", # menghilangkan label variabel categorical
         select = "contrib 10", # untuk identifikasi 5 outlier terluar
         habillage = "popularity.class") # untuk mewarnai titik observasi berdasarkan variable categorical (ditulis index/nama kolom)

spotify_num1 <- spotify_keep %>% 
  select_if(is.numeric)

spotify_num1[c("130688","71625"),]

Insight :

  • We have got 10 rows of observations in out-liers there is 130688,216889,130346,81385,128160,105057,175004,172070,211861. We are going to take out those out-liers.
  • For example, row 130688 is outlier because of liveness, instrumentalness, acousticness are high
plot.PCA(x = pca_spotify, # objek dari fungsi PCA()
         choix = "var")

Insight :

  • the percentage displayed on the Dim 1 (31.53%) and Dim 2 (16.60%) axes indicates how much the axes summarize information. Collectively, the above biplots explain about 48.3% of the original data information.

  • PC1 includes at most two variables: energy, valence, danceability, loudness

  • PC2 includes at most two variables: speechiness, liveness, tempo, popularity, instrumentalness, acousticness, duration_ms

  • Pairs of variables that are highly positively correlated:

    • speechiness, liveness
    • valence, energy
    • danceability, loudness
    • tempo, popularity
    • duration_ms , instrumentalness

4. Clustering

# your code here
outlier <- c(130688,216889,130346,81385,128160,105057,175004,172070,211861)
spotify_no <- spotify_num[-outlier,]

K means Clustering

# your code here
# clustering with optimum k
RNGkind(sample.kind = "Rounding")
set.seed(100)
spotify_kmeans <- kmeans(x = spotify_no,
                     centers = 3)
spotify_kmeans
#> K-means clustering with 3 clusters of sizes 197, 766, 60
#> 
#> Cluster means:
#>   popularity acousticness danceability duration_ms     energy instrumentalness
#> 1 -0.6834540    1.2567025  -1.06133891   0.2596139 -1.3961322        1.1940948
#> 2  0.2643987   -0.4172683   0.27122789  -0.0502160  0.3292904       -0.2713331
#> 3 -1.1314829    1.2009523   0.02205331  -0.2113080  0.3800266       -0.4565918
#>     liveness   loudness speechiness      tempo    valence
#> 1 -0.2010456 -1.4650248  -0.3943022 -0.3381463 -0.9665134
#> 2 -0.1311695  0.4071616  -0.1786706  0.1481276  0.2628519
#> 3  2.3346967 -0.3879322   3.5756535 -0.7808484 -0.1823572
#> 
#> Clustering vector:
#>  71625  59967 128539  13122 109042 112584 189062  86181 127194  39623 145447 
#>      2      2      1      2      2      2      2      2      1      1      1 
#> 205293  65242  92733 177455 155689  47616  83199  83653 160635 124686 165407 
#>      1      2      2      2      2      2      2      2      2      1      2 
#> 125276 174288  97759  39890 179249 205229 127774  64626 113627 216058  81139 
#>      1      3      2      1      1      1      1      2      2      2      1 
#> 222025 161785 206967  41979 146452 230259  30317  76940 201300 180931 192499 
#>      2      2      1      2      2      2      2      2      1      3      2 
#> 140383 114300 181574 205741  48331  71452  76907  46228  54840  63959 137584 
#>      2      2      2      1      2      1      2      2      2      1      2 
#>  58957  28732  53492 139037  49188 107888 150558 223490 157373 103569  83240 
#>      1      2      1      2      2      2      2      2      2      2      2 
#> 106031 103630  57023 161545  95910  76247 133209 224975 153964 145336 199300 
#>      1      2      2      2      2      2      2      2      2      2      1 
#> 180251 194034  21290 106907 139447 213967 228647   8795 134452 170598  57867 
#>      2      3      2      1      2      2      1      2      2      3      1 
#>  69963 170631 210990  48811  83315 104289 210863  90596 120377  29135   7013 
#>      2      3      2      1      2      1      2      2      2      2      2 
#> 179543  76165  90603   9550  84069 132822 159317 225901 163289   2686 124574 
#>      2      1      2      2      2      2      2      2      2      1      1 
#> 194600 187689  18826  55574 224700   8763 213165 168927  46702 195436  92266 
#>      1      2      2      2      2      1      2      3      2      2      2 
#>  91345 109919 135760  81966   6624 231457 222727 128313  23715  55338 200054 
#>      2      2      2      1      2      2      2      1      2      2      1 
#> 171715 115671 134885   3765 109813   9860 107724 146528 156588  20243  33396 
#>      3      2      2      2      2      2      1      2      2      2      2 
#> 211296  28499 169519 221038  10084   4559  46300 117471 215454  32189  39350 
#>      3      2      3      2      2      2      2      2      2      2      2 
#> 141084 189749 196145 183282   4424 162352 189563 132190 111645  37529  20758 
#>      2      2      2      2      1      2      2      2      2      2      2 
#>  37834   6216 165031 176981 199411 101689  97007 136195 191848 184446  75847 
#>      2      2      2      2      1      2      2      2      2      1      2 
#> 222767 151859 106664 140928  67141 166928 214730 156952  43239  80940  28724 
#>      2      2      1      2      2      2      2      2      2      1      2 
#>  25142  69286 194912 230855 101649  47103 224131 153561  69322  27732 139460 
#>      2      2      2      1      2      2      2      2      2      2      2 
#>  27948 183331  85941 222369 212428 191444  74287 204083 186097 142152  16840 
#>      2      1      2      2      2      2      1      1      2      2      2 
#>  98026  80094 174790  50863  67925  82764 146938 207522 172994 106308   8393 
#>      2      1      2      2      2      1      2      2      3      1      2 
#> 132183 102073 139615 221199  62644 153180  17780  16574  86101  69076 128123 
#>      2      1      2      1      2      2      2      2      2      2      1 
#>  85998 196617 144329  92797  69669  88607 163006 220134 179672  51043 166485 
#>      2      1      2      2      2      2      2      1      2      2      2 
#> 154420 164191  65009 165584 153656   9535  14214  65038  70200 222388  90623 
#>      2      2      2      2      2      2      2      2      2      2      2 
#>  86431 196007 196957  74235  30804 143458 183952  78585 210484  45925 184595 
#>      2      2      1      2      2      1      1      2      2      2      1 
#> 175416 211863  75012  20031 211815 222036 156011 173175 102284  26742 157054 
#>      3      3      2      2      3      2      2      3      2      3      2 
#> 170052 112502  39760 157099  61131  79376  48871   3794  87576 130704 158056 
#>      3      2      2      2      2      1      2      2      2      1      2 
#> 173258 220834  37909  75486  30848 148130  76919 150867  70401  16557 153959 
#>      3      1      2      2      2      2      2      2      2      2      2 
#> 176554 128616 125341 197467 151826 221008 143457 114547 226844 113966 152274 
#>      2      1      2      2      2      1      2      2      2      2      2 
#> 139172 220231  85532 204103 105719 115385 107053 143609 140409 182645 129049 
#>      2      2      2      1      1      2      1      2      1      3      1 
#> 178716  93959 118694 121672 230729  99851 231474 182806 119897 116824 211861 
#>      2      2      2      2      2      2      2      1      2      2      3 
#>  61449  40404  93035 125110  56928  87345 134845  48559 186219 148449 171113 
#>      2      2      2      1      2      2      2      2      2      2      3 
#> 101973 134739  59501 107760  39183 142657 222642 111122 174552   4784  39729 
#>      2      2      1      1      2      2      2      2      2      2      2 
#> 148711  38337  82428  43320 208588  55254 228869   4957  24989  56889 167851 
#>      2      2      1      2      2      2      2      2      2      2      2 
#>   7607 127648 158658  69764  90335 170189 223642 176729 135155 107550  82790 
#>      2      1      2      2      2      3      2      2      2      1      1 
#>  89205  47814  32246  90403  61816 163409  94769  61703  93165  45845 192828 
#>      2      2      2      2      2      2      2      2      2      2      2 
#> 122673  91805 133355 225593 151054  77508  77007 218959  88545 131110 118841 
#>      2      2      2      2      2      2      2      2      2      2      2 
#>  32262  55710 166813  69053 118774  64519  83789 101648 186570 120949 161730 
#>      2      2      2      2      2      2      2      2      2      2      2 
#> 196968 196473  91040  35664 148548  66714 215665  36156 223911    293 164364 
#>      2      3      2      2      2      2      2      2      2      2      2 
#> 146397 179591 207358 118755 173996 215200  21315 115374  45880 230936   8554 
#>      2      2      2      2      3      2      2      2      2      2      2 
#>  50540 206961  27629  85726  37719  38721 225904 188247 221447  92895 197966 
#>      1      1      2      2      1      2      2      2      2      2      1 
#>  14988  70273 113929 150115 118611 185188 131240  82116 153094  55578  61602 
#>      2      2      2      2      2      2      2      1      1      2      2 
#> 120381 171397 181082  26641 142236 211583 147263  63796  81952 155811 210597 
#>      2      3      2      2      2      3      2      2      1      2      2 
#> 169607 213627  88825 198383  12713  58991  85269  83745  65152  90276  80856 
#>      3      2      2      1      2      1      2      2      2      2      1 
#>  64781 186054 231092 174940 140126 137943 135235  55428 105505 231202  19218 
#>      2      2      2      3      1      2      2      2      1      2      2 
#>  12055 114680  12540 220731 147051  62690 231495  83327  86336  76690  24244 
#>      2      2      2      2      1      2      2      1      2      1      2 
#> 134671  77354    878 230973  85746  51144 172580  71051  93402  91396 102353 
#>      2      2      2      2      2      2      3      2      2      2      2 
#> 179461  48666 173727 135929  75342  87092   6256 112237 225264     92  24546 
#>      2      2      3      2      2      2      2      2      2      2      1 
#> 102159  72518 225550 135903 170422 154073 167879  25778  50755 144560 222743 
#>      2      3      2      2      3      2      3      2      2      2      2 
#> 188618   5147 158483 192904 187568 195470  43575  22046 104210 139425 200511 
#>      2      2      2      2      2      1      2      2      1      2      1 
#> 197697 162015 149521 152010 204403 198965 232493  75812 127002 170506 186971 
#>      1      2      2      2      1      1      2      2      1      3      2 
#>  47931 204351  91959 228227 130346 210954 166032  54708    907 111296 202479 
#>      2      1      2      2      1      1      2      2      2      2      1 
#>  60903 154018 204221 211616  90617 200853 119473 172075 222918 132262  72223 
#>      2      2      1      3      2      1      2      3      2      2      1 
#> 172560 104579 110015 110136 171955  24885 149081  98009 185835 143184  57890 
#>      3      1      2      2      3      2      2      2      2      2      1 
#> 158141  80725 206945 152571 217900  31148 225743  81422  55016 142908 173434 
#>      2      1      1      2      2      2      2      1      2      2      3 
#> 223663 216505 183890  72363  21684  48119   7374 134564  35795  29079  34351 
#>      2      2      2      3      2      2      2      2      2      1      2 
#> 212006  59755 125916 150650  34631 197052  68127 128160 202666 150173  92191 
#>      2      2      2      2      1      2      2      1      1      2      2 
#>   6855  82830 162138  69504 186815 179714  49006  87541  72271 208327  96661 
#>      2      1      2      2      2      2      2      2      2      2      2 
#> 194008  74606  31890 143369 156761  63230  43218  76666  28274  54665 216503 
#>      2      2      2      2      2      2      2      1      1      2      2 
#> 181618 104849  25716 206248 188820  45671 139301   9274 109951 180317 134566 
#>      2      2      2      1      2      2      2      2      2      1      2 
#>  74727 230541 115172 158146 171252 210631 102455 173617 175100 103354  85024 
#>      1      2      2      2      3      1      2      3      3      2      2 
#>  18356  47619 205376 212783 143313  84944 156846  95242 155324 211206  96274 
#>      2      2      1      2      2      2      2      2      2      2      1 
#>  35494  87594 108088  56684  27433   1922 151033  13126  30192 222924  60233 
#>      2      2      2      2      1      2      2      2      2      2      2 
#>   2104 230388 151501 168110 165742 191437  34399 156626 196505 230314  70560 
#>      2      2      2      3      2      2      2      2      1      2      2 
#>  33737 120433 193085 213043  80770  77193 194639 103539 161163 209473   6839 
#>      2      2      1      2      1      2      1      1      2      1      2 
#> 109470  94871  78177 153892 202583  20130  77643 216643 213126 159335  88987 
#>      2      2      2      2      1      2      2      2      2      2      2 
#>   2084 167033 111579    864 132734  18713  88137 231893  22613 186579  19542 
#>      1      2      2      1      2      2      2      2      2      2      1 
#> 203506 227948 141485 158879 114744  64028 232007 154142 127196 138274  71533 
#>      1      2      2      2      2      2      2      2      2      2      1 
#> 208161 212537 225419  79427 215381 213108 160404 178608 139698 178467  22730 
#>      2      2      2      1      2      2      2      2      2      2      2 
#>  53762 163969   6686 146409 155923 105057 156789   2179  14726  64774 108535 
#>      2      2      1      2      2      1      2      2      2      2      2 
#> 180112  85114 208935 120947  67179  56734 136656  88320 163627  26917 138816 
#>      2      2      2      2      2      1      2      2      2      2      2 
#> 196251  81385  73336 180969  59099 194885 174285  71622 106243  96234  75386 
#>      3      1      1      2      1      2      3      2      1      2      2 
#>  31456 211094  34881 194349 126239  26894  10259  65568 204397 194118  13397 
#>      2      1      2      2      1      2      2      2      1      1      2 
#>  11301  73439  30484   8260  28026 131708  29534 175813  23060  53328 102888 
#>      2      1      2      2      2      2      2      3      2      2      2 
#> 119604 214503  10311  34593 196707  35474 191324 200188 151755   5105 231435 
#>      2      2      2      1      2      1      2      2      2      2      2 
#>  34242  19146 184993  79792 168969  79092 199631  94437 131851 209072 216889 
#>      1      2      2      1      3      2      1      2      2      2      1 
#> 196466  96012  80570 105954 209582  83547 134302 145672 165452   5116 115228 
#>      2      2      1      1      1      2      2      1      2      2      2 
#> 102148 129334  40724 130688 141370 148177 183881  80335  77334 228975 166130 
#>      2      1      2      1      2      2      1      1      2      2      2 
#>  95311 132227  24336 201098 110808  64593 107362   9527 202735 172358  16842 
#>      1      2      2      1      2      2      1      2      1      3      2 
#>  98768 119591 171280  62145 112987 223143 195201  62164  60455 211776 138908 
#>      2      2      3      2      2      2      1      2      2      3      2 
#>  14762 166325 172070  60910 122887 186491 168745 200641  10757   1874 217794 
#>      2      2      3      2      2      2      3      1      2      2      1 
#>  56029 208592 118553 161579 221885  12559  30046 178099 199685 168748 209627 
#>      2      2      2      2      2      2      2      2      1      3      2 
#> 171908  37444 111848  95063 108012 102816   6150 151604  90548   6080  67461 
#>      3      2      2      2      2      2      2      2      2      1      2 
#>    105 149634 191699 175004 119345 143635  92261  99806 147139 111223 129668 
#>      2      2      2      3      2      2      2      2      1      1      1 
#> 118165 191595 129837   3079 110247   8448  56755 221302  62501  10945 222926 
#>      2      2      1      2      2      2      2      1      2      2      2 
#>   9545 131795 113437 189074 121980 193976  16649 172816  78221 143805 166421 
#>      2      2      2      2      2      1      2      3      2      1      2 
#> 220114  63036 225254 109498 184879 210667 229627  51299  44606 164550 196017 
#>      1      1      2      2      2      2      2      2      2      2      1 
#>  10142 168810  69346  59726 227251 200565 153925 201381 195478 224048 211667 
#>      2      3      2      2      2      1      2      1      2      2      3 
#>  65732 106106 201383 119719  61621 201315 159955 137785 214048  34621 186902 
#>      2      1      1      2      2      1      2      2      2      2      2 
#>  72180 213419 150544  41335 102292 174890  59721 231504 121078 231311  93304 
#>      2      2      2      2      2      3      2      2      2      2      2 
#> 135410 206673 197777  99472  64604 206681 228799  32349 125231  71975 197458 
#>      2      1      1      1      1      2      2      2      1      2      2 
#>  10088  86245 119569 177916  66588 189243   1223   5170  56303 192648  17159 
#>      2      2      2      1      2      2      2      2      2      2      2 
#>  25923 144583 155505  84786  42447   6192 113647 138087 221554  83045  41583 
#>      2      2      2      2      2      2      2      2      1      1      1 
#> 229977   7114 167552  10294  65398  13417 152431 208767 111671 166397 232057 
#>      1      2      3      2      2      2      2      1      2      2      2 
#> 
#> Within cluster sum of squares by cluster:
#> [1] 2048.1328 4863.1969  411.4773
#>  (between_SS / total_SS =  34.9 %)
#> 
#> Available components:
#> 
#> [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
#> [6] "betweenss"    "size"         "iter"         "ifault"

Insight :

For each of clusters have their own songs, here it is :

  • Cluster 1 have 197 songs
  • Cluster 2 have 766 songs
  • Cluster 3 have 60 songs

the higher k :

  • WSS is getting closer to 0
  • BSS is getting closer to TSS (or BSS/TSS is getting closer to 1)

Cluster Profilling

# Make a new one data set contain only numeric columns
spotify_qty <- as.data.frame(spotify_num)


# Cluster Profilling 
spotify_keep$cluster <- as.factor(spotify_kmeans$cluster)
spotify_qty$cluster <- as.factor(spotify_kmeans$cluster)
# your code here
spotify_qty <- spotify_qty %>% 
  group_by(cluster) %>% 
  summarise_all(mean)
spotify_qty
spotify_kmeans$centers
#>   popularity acousticness danceability duration_ms     energy instrumentalness
#> 1 -0.6834540    1.2567025  -1.06133891   0.2596139 -1.3961322        1.1940948
#> 2  0.2643987   -0.4172683   0.27122789  -0.0502160  0.3292904       -0.2713331
#> 3 -1.1314829    1.2009523   0.02205331  -0.2113080  0.3800266       -0.4565918
#>     liveness   loudness speechiness      tempo    valence
#> 1 -0.2010456 -1.4650248  -0.3943022 -0.3381463 -0.9665134
#> 2 -0.1311695  0.4071616  -0.1786706  0.1481276  0.2628519
#> 3  2.3346967 -0.3879322   3.5756535 -0.7808484 -0.1823572
spotify_qty %>% 
  tidyr::pivot_longer(-cluster) %>% 
  group_by(name) %>%# kolom
  summarize(cluster_min_val = which.min(value),
            cluster_max_val = which.max(value))
library(ggiraphExtra)
library(ggplot2)
ggRadar(data=spotify_qty, aes(colour=cluster), interactive=TRUE)

Insight :

  • Cluster 1 looks more dominant in acousticness, duration_ms, instrumentalness.
  • Cluster 2 looks more dominant in danceability, loudness, popularity, tempo ,valence.
  • Cluster 3 looks more dominant in energy, liveness,speechiness.
spotify_keep %>% 
  select(c(artist_name, track_name, cluster))

Insight :

  • Kanyewest with his own track name So Appalled in cluster 2 which were identified danceability, loudness, popularity, tempo ,valence.
  • Hillsong United Heart Like Heaven (Falling) - Live in cluster 1 which were identified acousticness, duration_ms, instrumentalness.
  • Ryan Stout with track name Out to Killing & Eating in cluster which were identified energy, liveness,speechiness.
fviz_cluster(object = spotify_kmeans, # object kmeans
             data = spotify_no)+ # data variable numerik
  labs(title = "Cluster Plot of Spotify Dataset using k=3")+
  theme(panel.grid.minor = element_line(linetype = "dashed"),
        panel.grid.major = element_line(linetype = "dashed"))

Insight :

5. Conclusion

Business Answer

After we explore and make clustering based on characteristics spotify’s song, we get :

  • We used K-Means clustering to group comparable songs based on numerous audio features in order to group similar songs regardless of genre.
  • We used method elbow to get number of cluster on K-Optimum, we get 3 clusters, but in real case , we should compare with business’s aspect
  • With the K-Means model that we have developed, a listener will be able to enjoy radio features based on the songs he or she usually hears, given that the audio characteristics are already known.

Basic Theory

Clustering is grouping data based on its characteristics. Clustering aims to produce clusters where:

  • Each observation in the same cluster has similar characteristics.
  • Each observation from a different cluster has different characteristics.

K-means is a centroid-based clustering algorithm, meaning that each cluster has a centroid/center point that represents the cluster. K-means is an iterative process consisting of:

  • Random initialization: Create k cluster centers (centroids) randomly.
  • Cluster assignment: Assign each observation to the nearest cluster based on calculating the distance between the observations to the center of the cluster.
  • Centroid update: Shifts the centroid to the center/means of the clusters formed.
  • Repeat steps 2 and 3 until the observations assigned to each cluster do not change anymore.

Workflow K-means :

  • Business questions
  • Read data
  • Data cleansing: check data type must be numeric and handle NA
  • EDA: scale check
  • Data Preprocessing: scaling if the scale is different
  • Determine k optimum :
    • Business needs
    • elbow method
  • Create a cluster with the function kmeans(data, center = k)
  • The clustering result label column is combined with the initial data
  • Cluster profiling to understand the characteristics of each cluster, can be used for product recommendations, etc. (if needed).
  • Addendum: clustering can be visualized with biplots to make profiling easier

For further project, you can try Hierarchical Clustering, Fuzzy C-means, DBScan.

6. Reference

Bruce, P., & Bruce, A. (2017). Practical statistics for data scientists: 50 essential concepts. O’Reilly Media, Inc.