“Without music, life would be a mistake”, Friedrich Nietzsche
It is undoubted that music becomes one important thing of people nowadays as they can access music in everywhere and anytime. As one of music online platform, Spotify is one of popular music platform that most used by people around the world. Every year, spotify launch the 50 most popular songs by singer, genre and etc. Furthermore, in this analysis I would like to group those 50 top songs of 2019 into several category in order to see the type of the song. The data was taken from https://www.kaggle.com/leonardopena/top50spotify2019.
In this step we load the data first from our directory folder and name it as spotify.
After that we try to take a glimpse of our data structure using str().
## 'data.frame': 50 obs. of 14 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Track.Name : Factor w/ 50 levels "0.958333333",..: 38 10 7 6 16 20 37 19 30 4 ...
## $ Artist.Name : Factor w/ 38 levels "Ali Gatie","Anuel AA",..: 33 2 3 10 29 10 21 31 20 5 ...
## $ Genre : Factor w/ 21 levels "atl hip hop",..: 7 20 9 16 10 16 21 16 8 12 ...
## $ Beats.Per.Minute: int 117 105 190 93 150 102 180 111 136 135 ...
## $ Energy : int 55 81 80 65 65 68 64 68 62 43 ...
## $ Danceability : int 76 79 40 64 58 80 75 48 88 70 ...
## $ Loudness..dB.. : int -6 -4 -4 -8 -4 -5 -6 -5 -6 -11 ...
## $ Liveness : int 8 8 16 8 11 9 7 8 11 10 ...
## $ Valence. : int 75 61 70 55 18 84 23 35 64 56 ...
## $ Length. : int 191 302 186 198 175 220 131 202 157 194 ...
## $ Acousticness.. : int 4 8 12 12 45 9 2 15 5 33 ...
## $ Speechiness. : int 3 9 46 19 7 4 29 9 10 38 ...
## $ Popularity : int 79 92 85 86 94 84 92 90 87 95 ...
Variable Description :
* Track.Name : Name of the track (song titile)
* Artis.Name : Name of the artist (singer)
* Genre : The genre of the track
* Beats.Per.Minute : The tempo of the song
* Energy : The energy of the song - the higher the value, the more energetic the song
* Danceability : The higher the value, the easier it is to dance to this song
* Loudness..dB.. : The higher the value, the louder the song
* Liveness : The higher the value, the more likely the song is a live recording
* Valence. : The higher the value, the more positive mood for the song
* Length. : The duration of the song
* Acousticness.. : The higher the value the more acoustic the song is
* Speechiness. : The higher the value the more spoken word the song contains
* Popularity : The higher the value the more popular the song is
There is one variable X that unused in this analysis so we need to remove it first using library tidyverse. Fortunately our data is already in its appropriate format so we do not have to convert any data form.
Then, we inspect whether there is any missing value of our observation using `colsums(is.na())
## Track.Name Artist.Name Genre Beats.Per.Minute
## 0 0 0 0
## Energy Danceability Loudness..dB.. Liveness
## 0 0 0 0
## Valence. Length. Acousticness.. Speechiness.
## 0 0 0 0
## Popularity
## 0
There is no missing data of our dataframe so we could proceed to the next step.
In this step I would like to explore the data by doing several data analysis.
library(plotly) # for interactive plot
library(glue) # for glue text
top10_song <- spotify %>%
arrange(desc(Popularity)) %>%
head(10) %>%
select(c(Track.Name, Artist.Name, Genre, Popularity, Length.)) %>%
mutate(mean_length = mean(Length.),
text = glue(
"Artist = {Artist.Name}
Genre = {Genre}"
))
plot_top10_song <- ggplot(data = top10_song, aes(x = reorder(Track.Name, Popularity),
y = Popularity,
text = text,
label = Popularity))+
geom_col(aes(fill = Popularity), show.legend = F)+
theme_bw()+
coord_flip()+
theme(axis.text = element_text(size = 12),
axis.title = element_text(size = 12, colour = "black"),
title = element_text(size = 12, colour = "black"))+
geom_text(aes(label = Popularity), color = "white", size = 6, fontface = "bold", position = position_stack(0.8))+
labs(title = "Top 10 Song on Spotify in 2019",
x = "Song Title",
y = "Popularity Rate",
caption = "Source : Kaggle Dataset")
ggplotly(plot_top10_song, tooltip = "text")Bad Guy sung by Billie Eilish with Genre electropop was the most popular song played on spotifiy in 2019.
top3_genre <- spotify %>%
group_by(Genre) %>%
summarise(song = n()) %>%
ungroup() %>%
mutate(song = song/50) %>%
arrange(desc(song)) %>%
head(3)
library(ggplot2) #to make plot
plot_top3_genre <- ggplot(data = top3_genre, aes(x = reorder(Genre, song),
y = song,
label = song))+
geom_col(aes(fill = song), show.legend = FALSE)+
theme_bw()+
coord_flip()+
theme(axis.text = element_text(size = 12),
axis.title = element_text(size = 14, colour = "black"),
title = element_text(size = 14, colour = "black"))+
geom_text(aes(label = scales::percent(song)), color = "white", size = 12, fontface = "bold", position = position_stack(0.7))+
labs(title = "Top 3 Genre of Spotify Most Popular Song 2019",
x = "Genre of Music",
y = "Rate of Genre",
caption = "Source : Kaggle Dataset")
plot_top3_genreBased on graph above, we know that 40% of spotify popular songs in 2019 are dominated 3 categories of Genre, in which the most popular genre is Dance Pop (16%) and followed by Pop(14%) and latin (10%). While the other category is equal or less than 4%.
This is a little intriguing since the most popular track in 2019 have electropop genre.
Based on my business wise, we deselect several variables that probably may not suitable for this analysis, which variables that not in numeric format that does not related to this clasification.
spotify_ppt <- spotify %>%
select_if(is.numeric) %>%
select(-Popularity) # this variable would not be used even integer since it does not carelated to thid clasification.
glimpse(spotify_ppt)## Observations: 50
## Variables: 9
## $ Beats.Per.Minute <int> 117, 105, 190, 93, 150, 102, 180, 111, 136, 135, 1...
## $ Energy <int> 55, 81, 80, 65, 65, 68, 64, 68, 62, 43, 62, 71, 41...
## $ Danceability <int> 76, 79, 40, 64, 58, 80, 75, 48, 88, 70, 61, 82, 50...
## $ Loudness..dB.. <int> -6, -4, -4, -8, -4, -5, -6, -5, -6, -11, -5, -4, -...
## $ Liveness <int> 8, 8, 16, 8, 11, 9, 7, 8, 11, 10, 24, 15, 11, 6, 1...
## $ Valence. <int> 75, 61, 70, 55, 18, 84, 23, 35, 64, 56, 24, 38, 45...
## $ Length. <int> 191, 302, 186, 198, 175, 220, 131, 202, 157, 194, ...
## $ Acousticness.. <int> 4, 8, 12, 12, 45, 9, 2, 15, 5, 33, 60, 28, 75, 7, ...
## $ Speechiness. <int> 3, 9, 46, 19, 7, 4, 29, 9, 10, 38, 31, 7, 3, 20, 5...
We need to make sure that our data is properly scaled in order to get a useful PCA. Here I would like to use scale() function to scale the numeric variables and store it as spotify_scale.
Data clustering is common a data mining technique to create clusters of data that can be identified as “data with some characteristics”. Since we do not have outlier from the data so we do need to remove outlier step and can proceed to the next step.
The next step in building a K-means clustering is to find the optimum cluster number to model our data. Use the defined kmeansTunning() function below to find the optimum K using Elbow method. Use a maximum of maxK as 7 to limit the plot into 7 distinct clusters.
RNGkind(sample.kind = "Rounding")
kmeansTunning <- function(data, maxK){
withinall <- NULL
total_k <- NULL
for (i in 2: maxK){
set.seed(101)
temp <- kmeans(data,i)$tot.withinss
withinall <- append(withinall, temp)
total_k <- append(total_k,i)
}
plot(x = total_k, y = withinall, type = "o", xlab = "Number of Cluster", ylab = "Total Within")
}
kmeansTunning(spotify_scale, maxK = 7)Based on the elbow plot generated above, the optimal number of cluster is 6.
K-means is a clustering algorithm that groups the data based on distance. The resulting clusters are stated to be optimum if the distance between data in the same cluster is low and the distance between data from different clusters is high.
Once we find the optimum K from the previous section, we try to do K-means clustering from our data and store it as spotify_cluster. Use set.seed(101) to guarantee a reproducible example. Extract the cluster information from the resulting K-means object using spotify_cluster$cluster and add them as a new column named cluster to the coffee dataset.
Principal comonent analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables (entities of which takes on various numerical values) into a set of values of linearly uncorrelated variables called principal components. This transformation is defined in such a way that the first principal component has the largest possible variance (that is, accounts for as much of the variability in the data as possible), and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to the preceding components. The resulting vectors (each being a linear combination of the variables and containing n observations) are an uncorrelated orthogonal basis set. PCA is sensitive to the relative scaling of the original variables.
We have prepared the scaled data to be used for PCA. Next, we will try to generate the principal component from the spotify_ppt. Recall how we use FactoMinerlibrary to perform PCA. Use PCA() function from the library to generate a PCA and store it as pca_spotify.
library(FactoMineR) # for PCA
pca_spotify <- PCA(spotify_ppt, quali.sup =10, graph = F, scale.unit = T)
# plot
plot.PCA(pca_spotify, choix = "ind", label = "none", habillage = 10)Then check the summary of the pca_spotify.
##
## Call:
## PCA(X = spotify_ppt, scale.unit = T, quali.sup = 10, graph = F)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
## Variance 2.252 1.578 1.273 1.015 0.898 0.732 0.692
## % of var. 25.020 17.532 14.144 11.282 9.982 8.139 7.691
## Cumulative % of var. 25.020 42.553 56.697 67.979 77.961 86.100 93.791
## Dim.8 Dim.9
## Variance 0.335 0.224
## % of var. 3.723 2.486
## Cumulative % of var. 97.514 100.000
##
## Individuals (the 10 first)
## Dist Dim.1 ctr cos2 Dim.2 ctr cos2
## 1 | 1.886 | 0.154 0.021 0.007 | -0.310 0.122 0.027 |
## 2 | 3.269 | 2.085 3.860 0.407 | 0.189 0.045 0.003 |
## 3 | 4.937 | -0.002 0.000 0.000 | 4.103 21.336 0.691 |
## 4 | 1.874 | -0.689 0.422 0.135 | -0.050 0.003 0.001 |
## 5 | 2.816 | -0.725 0.467 0.066 | 0.039 0.002 0.000 |
## 6 | 2.102 | 1.293 1.485 0.378 | -0.314 0.125 0.022 |
## 7 | 3.624 | -1.589 2.242 0.192 | 2.276 6.565 0.394 |
## 8 | 2.364 | -0.052 0.002 0.000 | -0.098 0.012 0.002 |
## 9 | 2.182 | -0.066 0.004 0.001 | 0.394 0.196 0.033 |
## 10 | 3.905 | -3.226 9.242 0.682 | 1.025 1.333 0.069 |
## Dim.3 ctr cos2
## 1 -1.322 2.745 0.491 |
## 2 -0.135 0.028 0.002 |
## 3 2.071 6.738 0.176 |
## 4 -0.159 0.040 0.007 |
## 5 1.284 2.589 0.208 |
## 6 -1.315 2.718 0.392 |
## 7 -0.457 0.329 0.016 |
## 8 1.211 2.303 0.262 |
## 9 -1.789 5.031 0.672 |
## 10 -0.063 0.006 0.000 |
##
## Variables
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3 ctr
## Beats.Per.Minute | -0.231 2.375 0.053 | 0.840 44.735 0.706 | 0.082 0.522
## Energy | 0.845 31.691 0.714 | 0.339 7.303 0.115 | -0.002 0.000
## Danceability | 0.126 0.710 0.016 | -0.084 0.451 0.007 | -0.737 42.615
## Loudness..dB.. | 0.813 29.325 0.660 | 0.115 0.842 0.013 | 0.151 1.792
## Liveness | 0.371 6.127 0.138 | -0.238 3.597 0.057 | 0.578 26.252
## Valence. | 0.502 11.180 0.252 | 0.212 2.860 0.045 | -0.406 12.962
## Length. | 0.359 5.718 0.129 | 0.009 0.006 0.000 | 0.355 9.896
## Acousticness.. | -0.362 5.832 0.131 | -0.277 4.868 0.077 | 0.211 3.505
## Speechiness. | -0.398 7.042 0.159 | 0.747 35.339 0.558 | 0.177 2.456
## cos2
## Beats.Per.Minute 0.007 |
## Energy 0.000 |
## Danceability 0.542 |
## Loudness..dB.. 0.023 |
## Liveness 0.334 |
## Valence. 0.165 |
## Length. 0.126 |
## Acousticness.. 0.045 |
## Speechiness. 0.031 |
##
## Supplementary categories
## Dist Dim.1 cos2 v.test Dim.2 cos2 v.test
## cluster_1 | 1.458 | -1.042 0.511 -3.032 | 0.434 0.089 1.510 |
## cluster_2 | 2.697 | -0.191 0.005 -0.297 | 2.412 0.800 4.480 |
## cluster_3 | 2.162 | 1.820 0.708 3.135 | -0.521 0.058 -1.072 |
## cluster_4 | 1.835 | -0.370 0.041 -0.808 | -1.136 0.383 -2.966 |
## cluster_5 | 1.285 | 0.140 0.012 0.387 | -0.553 0.186 -1.828 |
## cluster_6 | 3.224 | 2.047 0.403 2.412 | 0.802 0.062 1.129 |
## Dim.3 cos2 v.test
## cluster_1 -0.532 0.133 -2.059 |
## cluster_2 1.016 0.142 2.100 |
## cluster_3 0.597 0.076 1.368 |
## cluster_4 0.886 0.233 2.575 |
## cluster_5 -0.704 0.300 -2.587 |
## cluster_6 -0.012 0.000 -0.018 |
Based on the summary, in assumption if we only tolerate no more than 20% of information loss, there will 6 Principal Components (PCs) that would we use in this analysis.
Another great implementation of PCA is to visalize high dimensional data into 2 dimensional plot for various purposes, such as cluster analysis or detecting any outliers. In order to visualize the PCA, use plot.PCA() function to the pca_spotify. This will generate an individual PCA plot.
As we can see, there is no outlier in our data.
We can aslo create a varaible PCA plot that shows the variable loading information of the PCA by simply add choix = "var"in the plot.PCA(). The loading information will be represented by the length of the arrow from the center of coordinates. The longer the arrow, the bigger loading information of those variables. However this may not an efficient method if we have many features. Some variable would overlap with each other, making it to see the variable names.
An alternative way to extract the loading information is by using the dimdesc() function to the pca_spotify. Store the result as pca_dimdesc. Inspect the loading information from the first dimension/PC by calling pca_dimdesc$Dim.1. Since the first dimension is the one that hold most information.
## $quanti
## correlation p.value
## Energy 0.8447698 0.00000000000001244339
## Loudness..dB.. 0.8126227 0.00000000000077516122
## Valence. 0.5017403 0.00020554496547279090
## Liveness 0.3714510 0.00791004433592319527
## Length. 0.3588254 0.01049895605194324857
## Acousticness.. -0.3623999 0.00970087606008841058
## Speechiness. -0.3982108 0.00418260004773495647
##
## $quali
## R2 p.value
## cluster 0.4379973 0.00008282538
##
## $category
## Estimate p.value
## cluster=cluster_3 1.419174 0.001110847
## cluster=cluster_6 1.646350 0.014255483
## cluster=cluster_1 -1.442859 0.001677372
##
## attr(,"class")
## [1] "condes" "list "
Energy and Loudness is the most two variables contributing to PC 1. It is very make sense 16% of total song of Top Spotify 2019 track have genre Dance Pop.
# Goodness of Fit From this test we can measure how good our clustering model with 3 values :
* Wihtin Sum of Squares (withinss): distance of each observation to the centroid for each cluster.
* Summed Total Sum of Squares (totss): the distance of each observation to the global sample mean (overall data average).
* Between Sum of Squares (betweenss): centroid distance of each cluster to the global sample mean.
## [1] 28670.214 8828.800 6850.333 12348.444 18546.462 4674.667
## [1] 193255
## [1] 113336.1
The closer value of betweenss/totss to 1, the better the clustering. So here we inspect that value:
## [1] 0.5864587
Based on the value above, we can see that our model is fairly good in clustering the Spotify 2019 top song.
From seeing the graph, we know that the data has been clustered into 6 categories with their own distinct characteristics.There are 6 big groups of popular songs that people hear via spotify in 2019. * Cluster 1 has high beats per minute and danceability. So we can say that cluster 1 containing the largest data of our grouping, is consisted of songs that upbeat and danceable with average length of music 167 seconds (under three minutes) which the lowest length compared to other clusters.