Introduction

“Without music, life would be a mistake”, Friedrich Nietzsche
It is undoubted that music becomes one important thing of people nowadays as they can access music in everywhere and anytime. As one of music online platform, Spotify is one of popular music platform that most used by people around the world. Every year, spotify launch the 50 most popular songs by singer, genre and etc. Furthermore, in this analysis I would like to group those 50 top songs of 2019 into several category in order to see the type of the song. The data was taken from https://www.kaggle.com/leonardopena/top50spotify2019.

Read & Cleaning Data

In this step we load the data first from our directory folder and name it as spotify.

spotify <- read.csv("top50.csv")

After that we try to take a glimpse of our data structure using str().

str(spotify)

## 'data.frame':    50 obs. of  14 variables:
##  $ X               : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Track.Name      : Factor w/ 50 levels "0.958333333",..: 38 10 7 6 16 20 37 19 30 4 ...
##  $ Artist.Name     : Factor w/ 38 levels "Ali Gatie","Anuel AA",..: 33 2 3 10 29 10 21 31 20 5 ...
##  $ Genre           : Factor w/ 21 levels "atl hip hop",..: 7 20 9 16 10 16 21 16 8 12 ...
##  $ Beats.Per.Minute: int  117 105 190 93 150 102 180 111 136 135 ...
##  $ Energy          : int  55 81 80 65 65 68 64 68 62 43 ...
##  $ Danceability    : int  76 79 40 64 58 80 75 48 88 70 ...
##  $ Loudness..dB..  : int  -6 -4 -4 -8 -4 -5 -6 -5 -6 -11 ...
##  $ Liveness        : int  8 8 16 8 11 9 7 8 11 10 ...
##  $ Valence.        : int  75 61 70 55 18 84 23 35 64 56 ...
##  $ Length.         : int  191 302 186 198 175 220 131 202 157 194 ...
##  $ Acousticness..  : int  4 8 12 12 45 9 2 15 5 33 ...
##  $ Speechiness.    : int  3 9 46 19 7 4 29 9 10 38 ...
##  $ Popularity      : int  79 92 85 86 94 84 92 90 87 95 ...

Variable Description :
* Track.Name : Name of the track (song titile)
* Artis.Name : Name of the artist (singer)
* Genre : The genre of the track
* Beats.Per.Minute : The tempo of the song
* Energy : The energy of the song - the higher the value, the more energetic the song
* Danceability : The higher the value, the easier it is to dance to this song
* Loudness..dB.. : The higher the value, the louder the song
* Liveness : The higher the value, the more likely the song is a live recording
* Valence. : The higher the value, the more positive mood for the song
* Length. : The duration of the song
* Acousticness.. : The higher the value the more acoustic the song is
* Speechiness. : The higher the value the more spoken word the song contains
* Popularity : The higher the value the more popular the song is

There is one variable X that unused in this analysis so we need to remove it first using library tidyverse. Fortunately our data is already in its appropriate format so we do not have to convert any data form.

library(tidyverse) # for data wrangling
spotify <- spotify %>% 
  select(-X)

Then, we inspect whether there is any missing value of our observation using `colsums(is.na())

colSums(is.na(spotify))

##       Track.Name      Artist.Name            Genre Beats.Per.Minute 
##                0                0                0                0 
##           Energy     Danceability   Loudness..dB..         Liveness 
##                0                0                0                0 
##         Valence.          Length.   Acousticness..     Speechiness. 
##                0                0                0                0 
##       Popularity 
##                0

There is no missing data of our dataframe so we could proceed to the next step.

Basic Exploratory Data Analysis

In this step I would like to explore the data by doing several data analysis.

What is the top 10 song on Spotify in 2019

library(plotly) # for interactive plot
library(glue) # for glue text

top10_song <- spotify %>% 
  arrange(desc(Popularity)) %>% 
  head(10) %>% 
  select(c(Track.Name, Artist.Name, Genre, Popularity, Length.)) %>% 
  mutate(mean_length = mean(Length.),
         text = glue(
    "Artist = {Artist.Name}
    Genre = {Genre}"
  ))

plot_top10_song <- ggplot(data = top10_song, aes(x = reorder(Track.Name, Popularity),
                                                 y = Popularity,
                                                 text = text,
                                                 label = Popularity))+
  geom_col(aes(fill = Popularity), show.legend = F)+
  theme_bw()+
  coord_flip()+
  theme(axis.text = element_text(size = 12),
        axis.title = element_text(size = 12, colour = "black"),
        title = element_text(size = 12, colour = "black"))+
  geom_text(aes(label = Popularity), color = "white", size = 6, fontface = "bold", position = position_stack(0.8))+
  labs(title = "Top 10 Song on Spotify in 2019",
       x = "Song Title",
       y = "Popularity Rate",
       caption = "Source : Kaggle Dataset")

ggplotly(plot_top10_song, tooltip = "text")

Bad Guy sung by Billie Eilish with Genre electropop was the most popular song played on spotifiy in 2019.

What is the Top 3 Genre of 2019 Spotify top songs

top3_genre <- spotify %>% 
  group_by(Genre) %>% 
  summarise(song = n()) %>% 
  ungroup() %>% 
  mutate(song = song/50) %>% 
  arrange(desc(song)) %>% 
  head(3)

library(ggplot2) #to make plot
plot_top3_genre <- ggplot(data = top3_genre, aes(x = reorder(Genre, song),
                                                 y = song,
                                                 label = song))+
  geom_col(aes(fill = song), show.legend = FALSE)+
  theme_bw()+
  coord_flip()+
  theme(axis.text = element_text(size = 12),
        axis.title = element_text(size = 14, colour = "black"),
        title = element_text(size = 14, colour = "black"))+
  geom_text(aes(label = scales::percent(song)), color = "white", size = 12, fontface = "bold", position = position_stack(0.7))+
  labs(title = "Top 3 Genre of Spotify Most Popular Song 2019",
       x = "Genre of Music",
       y = "Rate of Genre",
       caption = "Source : Kaggle Dataset")

plot_top3_genre

Based on graph above, we know that 40% of spotify popular songs in 2019 are dominated 3 categories of Genre, in which the most popular genre is Dance Pop (16%) and followed by Pop(14%) and latin (10%). While the other category is equal or less than 4%.
This is a little intriguing since the most popular track in 2019 have electropop genre.

Selecting Variables

Based on my business wise, we deselect several variables that probably may not suitable for this analysis, which variables that not in numeric format that does not related to this clasification.

spotify_ppt <- spotify %>% 
  select_if(is.numeric) %>% 
  select(-Popularity) # this variable would not be used even integer since it does not carelated to thid clasification.

glimpse(spotify_ppt)

## Observations: 50
## Variables: 9
## $ Beats.Per.Minute <int> 117, 105, 190, 93, 150, 102, 180, 111, 136, 135, 1...
## $ Energy           <int> 55, 81, 80, 65, 65, 68, 64, 68, 62, 43, 62, 71, 41...
## $ Danceability     <int> 76, 79, 40, 64, 58, 80, 75, 48, 88, 70, 61, 82, 50...
## $ Loudness..dB..   <int> -6, -4, -4, -8, -4, -5, -6, -5, -6, -11, -5, -4, -...
## $ Liveness         <int> 8, 8, 16, 8, 11, 9, 7, 8, 11, 10, 24, 15, 11, 6, 1...
## $ Valence.         <int> 75, 61, 70, 55, 18, 84, 23, 35, 64, 56, 24, 38, 45...
## $ Length.          <int> 191, 302, 186, 198, 175, 220, 131, 202, 157, 194, ...
## $ Acousticness..   <int> 4, 8, 12, 12, 45, 9, 2, 15, 5, 33, 60, 28, 75, 7, ...
## $ Speechiness.     <int> 3, 9, 46, 19, 7, 4, 29, 9, 10, 38, 31, 7, 3, 20, 5...

Data Preprocessing

We need to make sure that our data is properly scaled in order to get a useful PCA. Here I would like to use scale() function to scale the numeric variables and store it as spotify_scale.

spotify_scale <- scale(spotify_ppt, center = T, scale = T)

K-Means Clustering

Data clustering is common a data mining technique to create clusters of data that can be identified as “data with some characteristics”. Since we do not have outlier from the data so we do need to remove outlier step and can proceed to the next step.

Choosing Optimum K

The next step in building a K-means clustering is to find the optimum cluster number to model our data. Use the defined kmeansTunning() function below to find the optimum K using Elbow method. Use a maximum of maxK as 7 to limit the plot into 7 distinct clusters.

RNGkind(sample.kind = "Rounding")
kmeansTunning <- function(data, maxK){
  withinall <-  NULL
  total_k <-  NULL
  for (i in 2: maxK){
    set.seed(101)
    temp <- kmeans(data,i)$tot.withinss
    withinall <- append(withinall, temp)
    total_k <-  append(total_k,i)
  }
  plot(x = total_k, y = withinall, type = "o", xlab = "Number of Cluster", ylab = "Total Within")
}

kmeansTunning(spotify_scale, maxK = 7)

Based on the elbow plot generated above, the optimal number of cluster is 6.
K-means is a clustering algorithm that groups the data based on distance. The resulting clusters are stated to be optimum if the distance between data in the same cluster is low and the distance between data from different clusters is high.

Building Cluster

Once we find the optimum K from the previous section, we try to do K-means clustering from our data and store it as spotify_cluster. Use set.seed(101) to guarantee a reproducible example. Extract the cluster information from the resulting K-means object using spotify_cluster$cluster and add them as a new column named cluster to the coffee dataset.

set.seed(101)
spotify_cluster <- kmeans(spotify_ppt, 6)
spotify_ppt$cluster <- spotify_cluster$cluster
spotify_ppt$cluster <- as.factor(spotify_ppt$cluster)

Principal Component Analysis

Principal comonent analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables (entities of which takes on various numerical values) into a set of values of linearly uncorrelated variables called principal components. This transformation is defined in such a way that the first principal component has the largest possible variance (that is, accounts for as much of the variability in the data as possible), and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to the preceding components. The resulting vectors (each being a linear combination of the variables and containing n observations) are an uncorrelated orthogonal basis set. PCA is sensitive to the relative scaling of the original variables.

Build Principal Component

We have prepared the scaled data to be used for PCA. Next, we will try to generate the principal component from the spotify_ppt. Recall how we use FactoMinerlibrary to perform PCA. Use PCA() function from the library to generate a PCA and store it as pca_spotify.

library(FactoMineR) # for PCA
pca_spotify <- PCA(spotify_ppt, quali.sup =10, graph = F, scale.unit = T)

# plot
plot.PCA(pca_spotify, choix = "ind", label = "none", habillage = 10)

Then check the summary of the pca_spotify.

summary(pca_spotify)

## 
## Call:
## PCA(X = spotify_ppt, scale.unit = T, quali.sup = 10, graph = F) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               2.252   1.578   1.273   1.015   0.898   0.732   0.692
## % of var.             25.020  17.532  14.144  11.282   9.982   8.139   7.691
## Cumulative % of var.  25.020  42.553  56.697  67.979  77.961  86.100  93.791
##                        Dim.8   Dim.9
## Variance               0.335   0.224
## % of var.              3.723   2.486
## Cumulative % of var.  97.514 100.000
## 
## Individuals (the 10 first)
##                      Dist    Dim.1    ctr   cos2    Dim.2    ctr   cos2  
## 1                |  1.886 |  0.154  0.021  0.007 | -0.310  0.122  0.027 |
## 2                |  3.269 |  2.085  3.860  0.407 |  0.189  0.045  0.003 |
## 3                |  4.937 | -0.002  0.000  0.000 |  4.103 21.336  0.691 |
## 4                |  1.874 | -0.689  0.422  0.135 | -0.050  0.003  0.001 |
## 5                |  2.816 | -0.725  0.467  0.066 |  0.039  0.002  0.000 |
## 6                |  2.102 |  1.293  1.485  0.378 | -0.314  0.125  0.022 |
## 7                |  3.624 | -1.589  2.242  0.192 |  2.276  6.565  0.394 |
## 8                |  2.364 | -0.052  0.002  0.000 | -0.098  0.012  0.002 |
## 9                |  2.182 | -0.066  0.004  0.001 |  0.394  0.196  0.033 |
## 10               |  3.905 | -3.226  9.242  0.682 |  1.025  1.333  0.069 |
##                   Dim.3    ctr   cos2  
## 1                -1.322  2.745  0.491 |
## 2                -0.135  0.028  0.002 |
## 3                 2.071  6.738  0.176 |
## 4                -0.159  0.040  0.007 |
## 5                 1.284  2.589  0.208 |
## 6                -1.315  2.718  0.392 |
## 7                -0.457  0.329  0.016 |
## 8                 1.211  2.303  0.262 |
## 9                -1.789  5.031  0.672 |
## 10               -0.063  0.006  0.000 |
## 
## Variables
##                     Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3    ctr
## Beats.Per.Minute | -0.231  2.375  0.053 |  0.840 44.735  0.706 |  0.082  0.522
## Energy           |  0.845 31.691  0.714 |  0.339  7.303  0.115 | -0.002  0.000
## Danceability     |  0.126  0.710  0.016 | -0.084  0.451  0.007 | -0.737 42.615
## Loudness..dB..   |  0.813 29.325  0.660 |  0.115  0.842  0.013 |  0.151  1.792
## Liveness         |  0.371  6.127  0.138 | -0.238  3.597  0.057 |  0.578 26.252
## Valence.         |  0.502 11.180  0.252 |  0.212  2.860  0.045 | -0.406 12.962
## Length.          |  0.359  5.718  0.129 |  0.009  0.006  0.000 |  0.355  9.896
## Acousticness..   | -0.362  5.832  0.131 | -0.277  4.868  0.077 |  0.211  3.505
## Speechiness.     | -0.398  7.042  0.159 |  0.747 35.339  0.558 |  0.177  2.456
##                    cos2  
## Beats.Per.Minute  0.007 |
## Energy            0.000 |
## Danceability      0.542 |
## Loudness..dB..    0.023 |
## Liveness          0.334 |
## Valence.          0.165 |
## Length.           0.126 |
## Acousticness..    0.045 |
## Speechiness.      0.031 |
## 
## Supplementary categories
##                      Dist    Dim.1   cos2 v.test    Dim.2   cos2 v.test  
## cluster_1        |  1.458 | -1.042  0.511 -3.032 |  0.434  0.089  1.510 |
## cluster_2        |  2.697 | -0.191  0.005 -0.297 |  2.412  0.800  4.480 |
## cluster_3        |  2.162 |  1.820  0.708  3.135 | -0.521  0.058 -1.072 |
## cluster_4        |  1.835 | -0.370  0.041 -0.808 | -1.136  0.383 -2.966 |
## cluster_5        |  1.285 |  0.140  0.012  0.387 | -0.553  0.186 -1.828 |
## cluster_6        |  3.224 |  2.047  0.403  2.412 |  0.802  0.062  1.129 |
##                   Dim.3   cos2 v.test  
## cluster_1        -0.532  0.133 -2.059 |
## cluster_2         1.016  0.142  2.100 |
## cluster_3         0.597  0.076  1.368 |
## cluster_4         0.886  0.233  2.575 |
## cluster_5        -0.704  0.300 -2.587 |
## cluster_6        -0.012  0.000 -0.018 |

Based on the summary, in assumption if we only tolerate no more than 20% of information loss, there will 6 Principal Components (PCs) that would we use in this analysis.

Another great implementation of PCA is to visalize high dimensional data into 2 dimensional plot for various purposes, such as cluster analysis or detecting any outliers. In order to visualize the PCA, use plot.PCA() function to the pca_spotify. This will generate an individual PCA plot.

plot.PCA(pca_spotify)

As we can see, there is no outlier in our data.
We can aslo create a varaible PCA plot that shows the variable loading information of the PCA by simply add choix = "var"in the plot.PCA(). The loading information will be represented by the length of the arrow from the center of coordinates. The longer the arrow, the bigger loading information of those variables. However this may not an efficient method if we have many features. Some variable would overlap with each other, making it to see the variable names.

An alternative way to extract the loading information is by using the dimdesc() function to the pca_spotify. Store the result as pca_dimdesc. Inspect the loading information from the first dimension/PC by calling pca_dimdesc$Dim.1. Since the first dimension is the one that hold most information.

pca_dimdesc <-  dimdesc(pca_spotify)

pca_dimdesc$Dim.1

## $quanti
##                correlation                p.value
## Energy           0.8447698 0.00000000000001244339
## Loudness..dB..   0.8126227 0.00000000000077516122
## Valence.         0.5017403 0.00020554496547279090
## Liveness         0.3714510 0.00791004433592319527
## Length.          0.3588254 0.01049895605194324857
## Acousticness..  -0.3623999 0.00970087606008841058
## Speechiness.    -0.3982108 0.00418260004773495647
## 
## $quali
##                R2       p.value
## cluster 0.4379973 0.00008282538
## 
## $category
##                    Estimate     p.value
## cluster=cluster_3  1.419174 0.001110847
## cluster=cluster_6  1.646350 0.014255483
## cluster=cluster_1 -1.442859 0.001677372
## 
## attr(,"class")
## [1] "condes" "list "

Energy and Loudness is the most two variables contributing to PC 1. It is very make sense 16% of total song of Top Spotify 2019 track have genre Dance Pop.

Combining PCA with K-Means

plot.PCA(pca_spotify, choix = "var", col.ind = spotify_ppt$cluster)

library(factoextra)
fviz_cluster(spotify_cluster,
             data = spotify_ppt[,-10])+
  theme_minimal()

# Goodness of Fit From this test we can measure how good our clustering model with 3 values :
* Wihtin Sum of Squares (withinss): distance of each observation to the centroid for each cluster.
* Summed Total Sum of Squares (totss): the distance of each observation to the global sample mean (overall data average).
* Between Sum of Squares (betweenss): centroid distance of each cluster to the global sample mean.

spotify_cluster$withinss

## [1] 28670.214  8828.800  6850.333 12348.444 18546.462  4674.667

spotify_cluster$totss

## [1] 193255

spotify_cluster$betweenss

## [1] 113336.1

The closer value of betweenss/totss to 1, the better the clustering. So here we inspect that value:

spotify_cluster$betweenss/spotify_cluster$totss

## [1] 0.5864587

Based on the value above, we can see that our model is fairly good in clustering the Spotify 2019 top song.

Conclusion

Summarize Cluster

spotify_ppt %>% 
  group_by(cluster) %>% 
  summarise_all("mean")

From seeing the graph, we know that the data has been clustered into 6 categories with their own distinct characteristics.There are 6 big groups of popular songs that people hear via spotify in 2019. * Cluster 1 has high beats per minute and danceability. So we can say that cluster 1 containing the largest data of our grouping, is consisted of songs that upbeat and danceable with average length of music 167 seconds (under three minutes) which the lowest length compared to other clusters.

Clustering Analysis of Top 50 Spotify Songs - 2019

Meinari

4/17/2020