The aim of this research paper is to evaluate the feasibility of using clustering algorithms, specifically K-Means, to auto-assign music genres to tracks based on their audio characteristics. The dataset contains 15,150 observations and 19 variables describing tracks, including features such as danceability, energy, loudness, and others. The end goal is to uncover patterns and structures within the dataset using clustering and analyze how they align with predefined genre labels.
https://www.kaggle.com/datasets/thebumpkin/10400-classic-hits-10-genres-1923-to-2023
Variables are defined in the following way according to the creator
of the dataset:
Track: The title of the song.
Artist: The name
of the performing artist or band.
Year: The year the track was
released.
Duration: Length of the track in milliseconds.
Time_Signature: The musical time signature of the track (eg. x/4).
Danceability: A measure of how suitable the track is for dancing,
ranging from 0.0 to 1.0.
Energy: A measure of the intensity and
activity of the track, ranging from 0.0 to 1.0.
Key: The key of the
track (e.g., 0=C).
Loudness: The overall loudness of the track in
decibels (dB).
Mode: The modality of the track, typically major (1)
or minor (0).
Speechiness: A measure indicating the presence of
spoken words in the track, ranging from 0.0 to 1.0.
Acousticness: A
measure of how acoustic the track is, ranging from 0.0 to 1.0.
Instrumentalness: A measure of the likelihood that the track is
instrumental, ranging from 0.0 to 1.0.
Liveness: A measure of the
presence of a live audience in the track, ranging from 0.0 to 1.0.
Valence: A measure of the musical positiveness of the track, ranging
from 0.0 to 1.0.
Tempo: The speed of the track in beats per minute
(BPM).
Popularity: A measure of the track’s popularity, ranging from
0 to 100.
Genre: 19 distinct genres of music.
library(factoextra)
library(cluster)
library(flexclust)
library(fpc)
library(ClusterR)
library(hopkins)
library(FeatureImpCluster)
library(attempt)
library(stats)
library(ggplot2)
library(corrplot)
library(clusterCrit)
library(FactoMineR)
library(psych)
library(gridExtra)
The first order of operation is to clean the data. Music tracks often have remasters and different editions of the same track which need to be removed in this case. After finding and removing such duplicates a subset of the data containing only numerical variables suited for clustering was created. (I based the process on both track name and duration being the same as tracks often have same names, even when they are in fact different).
data <- read.csv("ClassicHit.csv")
duplicate_rows <- data[duplicated(data[,c("Track","Duration")]), ]
data <- data[!duplicated(data[,c("Track","Duration")]), ]
#Removing non-numerical variables
data_c <-data[,c("Danceability", "Energy", "Loudness", "Speechiness", "Acousticness", "Instrumentalness", "Liveness", "Valence", "Tempo", "Popularity")]
#Scaling the data
data_n <- scale(data_c)
corrplot(cor(data_c), method = "color")
The correlation plot reveals strong correlations, such as a negative relationship between Acousticness and Energy and a positive one between Energy and Loudness. These align with intuitive expectations (e.g., loud tracks tend to be more energetic). However, no correlations are strong enough to warrant removing any variables.
To evaluate how well clustering aligns with predefined genres, the initial application of K-Means clustering is with k = 19, corresponding to the number of unique genres in the dataset.
set.seed(42)
data_n <- scale(data_c)
#First clustering will be done with 19 clusters as that is the amount of genres in the data set and I wish to compare whether clustering creates similar groupings
km1<-eclust(data_n,"kmeans",hc_metric="euclidean",k=19, iter.max = 50)
data$cluster <-km1$cluster
table(data$Genre, data$cluster)
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
## Alt. Rock 67 24 1 13 12 82 31 6 94 127 125 26 5 20 46 35 62
## Blues 31 35 23 113 2 12 27 87 9 8 0 46 13 24 19 8 31
## Country 52 97 25 136 1 1 39 70 0 39 19 29 3 3 28 65 5
## Disco 20 100 5 18 6 0 28 20 4 53 4 245 2 6 52 9 54
## EDM 9 0 0 0 24 58 9 1 108 54 185 36 17 34 37 6 119
## Folk 20 27 60 162 4 0 33 53 2 6 0 14 13 2 10 31 5
## Funk 19 48 7 8 12 1 19 5 5 26 6 90 7 12 24 6 37
## Gospel 9 1 26 27 5 31 47 16 31 8 12 12 5 34 9 19 0
## Jazz 6 12 45 139 1 1 37 69 3 4 1 11 2 7 5 26 63
## Metal 34 3 0 1 1 193 47 2 286 24 88 17 6 55 21 8 98
## Pop 239 367 50 107 122 81 143 138 80 544 360 172 61 41 213 427 99
## Punk 84 10 5 3 9 179 34 6 176 13 10 46 11 57 33 3 59
## R&B 48 116 19 47 16 3 37 46 6 97 23 62 7 7 50 71 6
## Rap 9 9 0 2 291 9 2 3 9 61 91 18 154 12 22 14 9
## Reggae 25 90 5 0 77 1 6 4 3 49 3 85 31 4 17 1 12
## Rock 43 20 13 35 1 70 142 33 79 12 11 154 3 36 21 14 38
## SKA 65 26 1 1 5 30 3 1 24 14 1 82 5 16 15 0 15
## Today 36 9 1 3 52 15 4 7 15 139 158 5 39 7 16 92 0
## World 10 4 15 41 3 4 27 17 4 7 2 37 5 12 9 11 30
##
## 18 19
## Alt. Rock 1 3
## Blues 30 159
## Country 9 212
## Disco 1 25
## EDM 2 0
## Folk 23 97
## Funk 6 12
## Gospel 0 16
## Jazz 255 91
## Metal 1 1
## Pop 82 146
## Punk 2 5
## R&B 3 98
## Rap 0 1
## Reggae 3 23
## Rock 19 44
## SKA 2 21
## Today 1 0
## World 41 35
Modes <- function(x) {
ux <- unique(x)
tab <- tabulate(match(x, ux))
ux[tab == max(tab)]
}
most_common_cluster_in_genre <- c(1:19)
genre_list<-unique(data$Genre)
data$cluster <- as.numeric(as.character(data$cluster))
for (i in 1:19) {
most_common_cluster_in_genre[i] <- Modes(data[data$Genre==genre_list[i],"cluster"])
}
print(most_common_cluster_in_genre)
## [1] 10 19 19 12 11 4 12 7 18 9 10 6 2 5 2 12 12 11 18
The list above represents the most dominant cluster in each genre (measured by amount of observations). Many genres are assigned to the same dominant clusters, indicating that K-Means with k = 19 fails to adequately distinguish between genres. This could result from the overlap between genres in the feature space or uneven genre sample sizes.
sil <- silhouette(km1$cluster, dist(data_n))
fviz_silhouette(sil)
## cluster size ave.sil.width
## 1 1 826 0.08
## 2 2 998 0.13
## 3 3 301 0.09
## 4 4 856 0.14
## 5 5 644 0.14
## 6 6 771 0.12
## 7 7 715 0.09
## 8 8 584 0.09
## 9 9 938 0.11
## 10 10 1285 0.16
## 11 11 1099 0.13
## 12 12 1187 0.14
## 13 13 389 0.02
## 14 14 389 0.09
## 15 15 647 0.08
## 16 16 846 0.12
## 17 17 742 0.12
## 18 18 481 0.19
## 19 19 989 0.13
Average silhouette score is close to 0.12 which indicates low
quality clustering overall. Clusters are of varying sizes some extremely
irrelevant with very few observations. In light of such results the best
course of action would be to perform clustering with the appropriate
amount of clusters to determine whether the algorithm can uncover
subgroups of track which are still meaningful even if they do not align
with traditional genre division. Another aspect of this analysis will be
if there exist cross-genre patterns in songs.
fviz_nbclust(data_n, kmeans, method = "wss")
Optimal_Clusters_KMeans(data_n, max_clusters=20, plot_clusters=TRUE, criterion="silhouette")
## [1] 0.0000000 0.1990990 0.1526359 0.1591607 0.1624798 0.1672372 0.1422907
## [8] 0.1363529 0.1436810 0.1322141 0.1308640 0.1315249 0.1306508 0.1271082
## [15] 0.1278780 0.1247002 0.1207977 0.1205069 0.1185581 0.1148549
As we can see from the elbow plot 19 clusters is excessive, which
could be expected as often in music, distinctions between genres are not
drastic enough to warrant a creation of a separate cluster in the
algorithm. This combined with uneven sample size for each genre makes it
impossible for kmeans to assign tracks to similar groups as the man made
genres. Having this in mind for the future analysis the k parameter will
be adjusted to 4* in order to explore what overlaps exists between
genres and which of them will be assigned to which cluster.
*The silhouette is higher in 2 and 6 clusters but in order for meaningful analysis to happen 4 was chosen.
km2<-eclust(data_n,"kmeans",hc_metric="euclidean",k=4)
data$cluster<-km2$cluster
table(data$Genre, data$cluster)
##
## 1 2 3 4
## Alt. Rock 461 233 35 51
## Blues 69 156 59 393
## Country 57 297 34 445
## Disco 38 490 46 78
## EDM 473 161 55 10
## Folk 20 104 59 379
## Funk 34 245 32 39
## Gospel 101 41 77 89
## Jazz 35 68 50 625
## Metal 747 52 65 22
## Pop 774 1740 197 761
## Punk 531 116 77 21
## R&B 65 440 49 208
## Rap 150 459 98 9
## Reggae 11 386 19 23
## Rock 307 231 63 187
## SKA 101 185 32 9
## Today 218 302 16 63
## World 42 79 33 160
While not ideal most genres now have a dominant cluster which makes analysis and interpretation possible. That being said, cluster 3 is not a dominant one for any genre making it so that effecitvly we only have 2 relevant clusters. For k=5 similar problems arise, which is why for the final clustering k=3 was chosen.
km3<-eclust(data_n,"kmeans",hc_metric="euclidean",k=3)
data$cluster<-km3$cluster
genre_cluster <- apply(table(data$Genre, data$cluster), 1, which.max)
# Create a data frame mapping clusters to genres
cluster_genres <- split(names(genre_cluster), genre_cluster)
# Convert to a readable table
cluster_table <- data.frame(
Cluster = names(cluster_genres),
Genres = sapply(cluster_genres, function(x) paste(x, collapse = ", ")),
row.names = NULL
)
# Display the table
print(cluster_table, row.names = FALSE)
## Cluster Genres
## 1 Disco, Funk, Pop, R&B, Rap, Reggae, SKA, Today
## 2 Blues, Country, Folk, Jazz, World
## 3 Alt. Rock, EDM, Gospel, Metal, Punk, Rock
This table presents genres divided into groups based on the most
dominant cluster within them. From a logical perspective these groups do
make sense. Here are my assumptions to be checked in the following part
of the paper.
Group number 1 is popular, happy music which people
often dance to.
Group number 2 is calm music, often with lower
tempo.
Group number 3 is loud, energetic music which often has long
instrumental segments.
km3$centers
## Danceability Energy Loudness Speechiness Acousticness Instrumentalness
## 1 0.7533171 0.1908460 0.1700509 0.1251305 -0.3044365 -0.22082248
## 2 -0.3818617 -1.2349531 -1.0122061 -0.3591653 1.2694734 0.29622639
## 3 -0.7326799 0.7918369 0.6298277 0.1322645 -0.6615557 0.05651161
## Liveness Valence Tempo Popularity
## 1 -0.24062868 0.6477836 -0.1485194 0.22890328
## 2 -0.07199776 -0.4434115 -0.2512287 -0.44483713
## 3 0.40061452 -0.5312408 0.4248492 0.05973275
heatmap(km3$centers, main = "Cluster Centers",labRow = rownames(km1$centers),Rowv = NA, Colv = NA,margins = c(9,5))
Looking at the centers, Danceability and Valence are crucial
characteristics for cluster 1. It is also the most popular out of all 3
clusters. I would say that all the genres from group 1 are in fact a
good fit. Disco, Funk, Pop, R&B and Reggae are know for being
positive and most of the time they have lyrics, which makes sense with
low Acoustiness and Instrumentalness. Group 2 also has the
characteriscitsc of the most crucial centers of cluster. Jazz, Folk and
Blues are often tracks with low volume (Loudness -1.012) and Energy
(-1.23). Instruments are also often at the front (Acousticness = 1.27)
and these genres arent the most popular. As predicted, cluster 3 has
high energy and loudness though it may be a bit surprising to see
acoussticness at such a low degree. People also dont often dance to
metal, punk or rock and live recording are often seen on services such
as spotify. Overall, the results are suprisingly positive. With the
exception of EDM (electronic dance music) being in the cluster with
lowest dancability the genre groups and characteristics are well matched
and fit real world stereotypes.
sil <- silhouette(km3$cluster, dist(data_n))
fviz_silhouette(sil)
## cluster size ave.sil.width
## 1 1 6330 0.19
## 2 2 3861 0.15
## 3 3 4496 0.09
Silhouette is still not of the desired quality which indicates
that the data simply might have a lot of overlap. This might be caused
by the dominance of pop music in the sample which is know for often
having repetitive patterns.
Dunn3 <- intCriteria(as.matrix(data_n), km3$cluster, c("Dunn"))
CH3 <- intCriteria(as.matrix(data_n), km3$cluster, c("Calinski_Harabasz"))
crit1 <- intCriteria(as.matrix(data_n), km1$cluster, c("Calinski_Harabasz"))
cat("Dunn Index for k=3:", Dunn3[[1]], "\n")
## Dunn Index for k=3: 0.003103574
cat("Calinski Harabasz Index for k=3:", CH3[[1]], "\n")
## Calinski Harabasz Index for k=3: 2714.601
cat("For comparison, Calinski Harabasz Index for k=19:", crit1[[1]], "\n")
## For comparison, Calinski Harabasz Index for k=19: 1260.003
z_scores <- scale(data_n)
# Identify rows with z-scores > 3 or < -3
outliers <- which(abs(z_scores) > 3, arr.ind = TRUE)
We also see the confirmation that the clusters may be overlapping or poorly defined in the very low Dunn index. It is important to mention that this index is quite sensitive to outliers of which there are many in the data set (1756 to be expact found thourgh the Z-score method) so it might be artifically lower than in reality. Despise that the conclusion is in line with silhouette analysis. From the Calinski Harabasz Index we see that the clustering compactness and quality is better than in the case with 19 clusters.
hopkins(data_n)
## [1] 0.9999976
The Hopkins statistic is rather suprisingly high, it indicates that the data has a sighnificat clustarable structure. This is connection with the very low Dunn index may mean that the clustering algorythm was unoptimal for this study.
km4<-kcca(data_n, k=4)
FeatureImp_km<-FeatureImpCluster(km4, as.data.table(data_n))
plot(FeatureImp_km)
We can observe that most variables are actually quite impactful
on the result of clustering especially danceability, which means there
is no need to alter the data set.
ggplot(data, aes(x = cluster, y = Danceability, fill = factor(cluster))) +
geom_boxplot()
The figure presents how the most important variable, danceability looks across the clusters. It is highest in cluster 1 at around 0.7 while for clusters 2 and 3 it hovers around 0.5.
This analysis demonstrates that K-Means clustering can group tracks into meaningful subgroups based on audio features, though complete alignment with predefined genres remains challenging due to feature overlap and noise. With k = 3, we identified:
A cluster of danceable and happy tracks (e.g., Pop, Disco).
A
cluster of calm, acoustic tracks (e.g., Jazz, Folk).
A cluster of
loud, energetic tracks (e.g., Rock, Metal).
The results are promising but highlight limitations in K-Means for datasets with significant overlap. Future studies could explore alternative clustering methods, such as DBSCAN or Gaussian Mixture Models, and refine the dataset by addressing outliers and dominant genres like Pop.
Prior to conducting PCA, it is necessary to extract only numerical columns from the dataset. Additionally, standardization of the data is essential, as PCA is highly sensitive to differences in range across variables which are present.
data_pc <- subset(data,select =-c(Track,Artist,Genre,cluster))
data_pca <- subset(data_pc,select=-c(Year,Mode,Key,Time_Signature))
data_pca_scaled <- scale(data_pca)
pca<- prcomp(data_pca_scaled, center = TRUE, scale. = TRUE)
summary(pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.6224 1.2773 1.0700 1.04360 1.02289 0.95944 0.90993
## Proportion of Variance 0.2393 0.1483 0.1041 0.09901 0.09512 0.08368 0.07527
## Cumulative Proportion 0.2393 0.3876 0.4917 0.59070 0.68582 0.76950 0.84477
## PC8 PC9 PC10 PC11
## Standard deviation 0.85121 0.65390 0.63893 0.38357
## Proportion of Variance 0.06587 0.03887 0.03711 0.01338
## Cumulative Proportion 0.91064 0.94951 0.98662 1.00000
The table above presents the proportion of variance explained by the principal components up to 12 components. A common benchmark is to retain approximately 80-85% cumulative variance, which ensures that most of the original dataset’s information is preserved while reducing dimensionality. Based on this criterion and the scree plot analysis, the optimal number of principal components is seven.
fviz_eig(pca, choice='eigenvalue')
fviz_pca_var(pca, col.var = "black")
From the PCA variable correlation plot, it can be inferred that the key
contributors to Component 1 include Energy, Loudness, and their inverse
Acousticness. Component 2 is primarily influenced by Danceability and
Valence, with additional contributions from Liveness and Duration.
a<-fviz_contrib(pca, "var",axes = 1)
b<-fviz_contrib(pca, "var",axes = 2)
c<-fviz_contrib(pca, "var",axes = 3)
d<-fviz_contrib(pca, "var",axes = 4)
e<-fviz_contrib(pca, "var",axes = 5)
f<-fviz_contrib(pca, "var",axes = 6)
g<-fviz_contrib(pca, "var",axes = 7)
plots <- list(a, b, c, d, e, f, g)
grid.arrange(grobs = plots, nrow = 4, ncol = 2)
pca2<-principal(data_pca_scaled, nfactors=7, rotate="varimax")
pca2
## Principal Components Analysis
## Call: principal(r = data_pca_scaled, nfactors = 7, rotate = "varimax")
## Standardized loadings (pattern matrix) based upon correlation matrix
## RC1 RC2 RC3 RC7 RC5 RC6 RC4 h2 u2 com
## Duration 0.04 -0.10 0.04 0.04 -0.04 -0.01 0.97 0.96 0.039 1.0
## Danceability 0.01 0.78 -0.25 -0.09 -0.26 0.24 0.02 0.81 0.189 1.7
## Energy 0.93 0.07 0.11 0.03 0.12 0.03 0.00 0.90 0.104 1.1
## Loudness 0.84 -0.01 -0.01 -0.19 0.02 0.05 -0.11 0.76 0.238 1.1
## Speechiness 0.11 0.05 0.06 -0.01 0.06 0.96 -0.01 0.95 0.052 1.0
## Acousticness -0.83 -0.08 0.10 -0.01 -0.04 -0.08 -0.16 0.75 0.255 1.1
## Instrumentalness -0.09 -0.10 0.01 0.97 -0.03 -0.01 0.04 0.95 0.045 1.0
## Liveness 0.15 -0.19 0.82 -0.11 -0.13 0.16 -0.05 0.78 0.215 1.4
## Valence 0.11 0.88 0.10 -0.04 0.12 -0.10 -0.13 0.84 0.165 1.2
## Tempo 0.13 -0.05 -0.03 -0.03 0.95 0.06 -0.04 0.93 0.075 1.1
## Popularity 0.30 -0.19 -0.62 -0.26 -0.19 0.15 -0.17 0.67 0.332 2.7
##
## RC1 RC2 RC3 RC7 RC5 RC6 RC4
## SS loadings 2.43 1.49 1.16 1.06 1.06 1.06 1.03
## Proportion Var 0.22 0.14 0.11 0.10 0.10 0.10 0.09
## Cumulative Var 0.22 0.36 0.46 0.56 0.65 0.75 0.84
## Proportion Explained 0.26 0.16 0.12 0.11 0.11 0.11 0.11
## Cumulative Proportion 0.26 0.42 0.55 0.66 0.78 0.89 1.00
##
## Mean item complexity = 1.3
## Test of the hypothesis that 7 components are sufficient.
##
## The root mean square of the residuals (RMSR) is 0.07
## with the empirical chi square 7825.4 with prob < NA
##
## Fit based upon off diagonal values = 0.88
print(loadings(pca2), digits=3, cutoff=0.4, sort=TRUE)
##
## Loadings:
## RC1 RC2 RC3 RC7 RC5 RC6 RC4
## Energy 0.929
## Loudness 0.843
## Acousticness -0.834
## Danceability 0.783
## Valence 0.878
## Liveness 0.819
## Popularity -0.619
## Instrumentalness 0.967
## Tempo 0.949
## Speechiness 0.963
## Duration 0.972
##
## RC1 RC2 RC3 RC7 RC5 RC6 RC4
## SS loadings 2.434 1.492 1.159 1.062 1.057 1.056 1.033
## Proportion Var 0.221 0.136 0.105 0.097 0.096 0.096 0.094
## Cumulative Var 0.221 0.357 0.462 0.559 0.655 0.751 0.845
pca2$complexity
## Duration Danceability Energy Loudness
## 1.033393 1.697492 1.077065 1.146466
## Speechiness Acousticness Instrumentalness Liveness
## 1.044693 1.145974 1.043751 1.360893
## Valence Tempo Popularity
## 1.170296 1.057793 2.719432
pca2$uniquenesses
## Duration Danceability Energy Loudness
## 0.03854904 0.18852017 0.10387791 0.23802715
## Speechiness Acousticness Instrumentalness Liveness
## 0.05201880 0.25470491 0.04533429 0.21523704
## Valence Tempo Popularity
## 0.16484983 0.07452223 0.33185604
After examining the contribution plots and loadings, it was determined that the Popularity variable should be excluded from further analysis. The reasoning behind this decision is that the variable is spread across multiple components, has high complexity, and exhibits high uniqueness, suggesting that it is not very well represented in the principal component structure. Including Popularity which is influenced by external factors such as listener preferences and trends (an example of this would be the rise of hip hop and rap in recent times) may introduce undesired effects (it is also not actually a strictly musical characteristic).
data_pca_scaled_no_pop <-subset(data_pca_scaled,select=-c(Popularity))
pca_no_pop<- prcomp(data_pca_scaled_no_pop, center = TRUE, scale. = TRUE)
summary(pca_no_pop)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.5963 1.2730 1.0441 1.0242 0.98209 0.94646 0.90844
## Proportion of Variance 0.2548 0.1621 0.1090 0.1049 0.09645 0.08958 0.08253
## Cumulative Proportion 0.2548 0.4169 0.5259 0.6308 0.72724 0.81682 0.89934
## PC8 PC9 PC10
## Standard deviation 0.65959 0.6512 0.38405
## Proportion of Variance 0.04351 0.0424 0.01475
## Cumulative Proportion 0.94285 0.9852 1.00000
a<-fviz_contrib(pca_no_pop, "var",axes = 1)
b<-fviz_contrib(pca_no_pop, "var",axes = 2)
c<-fviz_contrib(pca_no_pop, "var",axes = 3)
d<-fviz_contrib(pca_no_pop, "var",axes = 4)
e<-fviz_contrib(pca_no_pop, "var",axes = 5)
f<-fviz_contrib(pca_no_pop, "var",axes = 6)
plots2 <- list(a, b, c, d, e, f)
grid.arrange(grobs = plots2, nrow = 3, ncol = 2)
Excluding the Popularity variable allows the 80% variance benchmark to be reached with six principal components instead of seven. The contribution plots indicate that PC3 and PC5 are both largely defined by Tempo, while PC4 and PC6 show high contributions from Speechiness and Liveness. To further investigate whether these components represent distinct musical dimensions or are redundant, biplots for PC3 vs. PC5 and PC4 vs. PC6 were analyzed.
plot1 <- fviz_pca_var(pca_no_pop, axes = c(3,5))
plot2 <- fviz_pca_var(pca_no_pop, axes = c(4,6))
grid.arrange(plot1, plot2, ncol = 2)
The results indicate that while Tempo contributes similarly to both PC3
and PC5, other variables introduce sufficient differentiation to justify
retaining both components. Likewise, for PC4 and PC6, although
Speechiness exhibits similar effects, Liveness varies in direction,
reinforcing the necessity of both components.
pca2_no_pop<-principal(data_pca_scaled_no_pop, nfactors=6, rotate="varimax")
pca2_no_pop
## Principal Components Analysis
## Call: principal(r = data_pca_scaled_no_pop, nfactors = 6, rotate = "varimax")
## Standardized loadings (pattern matrix) based upon correlation matrix
## RC1 RC2 RC3 RC5 RC6 RC4 h2 u2 com
## Duration 0.16 -0.17 0.71 -0.32 -0.01 0.01 0.66 0.338 1.7
## Danceability 0.03 0.79 -0.05 -0.28 -0.22 0.23 0.81 0.191 1.6
## Energy 0.91 0.08 0.02 0.15 0.14 0.03 0.88 0.124 1.1
## Loudness 0.84 -0.02 -0.23 0.02 0.03 0.05 0.77 0.229 1.2
## Speechiness 0.11 0.04 -0.03 0.05 0.07 0.97 0.97 0.031 1.0
## Acousticness -0.85 -0.07 -0.10 -0.02 0.07 -0.08 0.75 0.252 1.1
## Instrumentalness -0.23 0.01 0.76 0.26 0.04 -0.05 0.70 0.304 1.4
## Liveness 0.07 -0.07 0.02 -0.03 0.98 0.07 0.97 0.031 1.0
## Valence 0.09 0.89 -0.09 0.14 0.07 -0.12 0.85 0.146 1.1
## Tempo 0.16 -0.06 0.00 0.88 -0.03 0.05 0.81 0.186 1.1
##
## RC1 RC2 RC3 RC5 RC6 RC4
## SS loadings 2.39 1.47 1.15 1.08 1.04 1.04
## Proportion Var 0.24 0.15 0.12 0.11 0.10 0.10
## Cumulative Var 0.24 0.39 0.50 0.61 0.71 0.82
## Proportion Explained 0.29 0.18 0.14 0.13 0.13 0.13
## Cumulative Proportion 0.29 0.47 0.61 0.75 0.87 1.00
##
## Mean item complexity = 1.2
## Test of the hypothesis that 6 components are sufficient.
##
## The root mean square of the residuals (RMSR) is 0.08
## with the empirical chi square 9394.34 with prob < NA
##
## Fit based upon off diagonal values = 0.85
print(loadings(pca2_no_pop), digits=3, cutoff=0.4, sort=TRUE)
##
## Loadings:
## RC1 RC2 RC3 RC5 RC6 RC4
## Energy 0.909
## Loudness 0.845
## Acousticness -0.849
## Danceability 0.791
## Valence 0.894
## Duration 0.709
## Instrumentalness 0.758
## Tempo 0.885
## Liveness 0.977
## Speechiness 0.973
##
## RC1 RC2 RC3 RC5 RC6 RC4
## SS loadings 2.390 1.474 1.153 1.076 1.039 1.036
## Proportion Var 0.239 0.147 0.115 0.108 0.104 0.104
## Cumulative Var 0.239 0.386 0.502 0.609 0.713 0.817
pca2_no_pop$complexity
## Duration Danceability Energy Loudness
## 1.653789 1.625447 1.122370 1.160977
## Speechiness Acousticness Instrumentalness Liveness
## 1.047390 1.075312 1.438552 1.031507
## Valence Tempo
## 1.140092 1.081202
Applying varimax rotation to the PCA results allows for a clearer interpretation of the component structure. The following analysis of factor complexity reveals that three variables: Duration, Danceability, and Instrumentalness exhibit high complexity. This suggests that these variables are not easily explained by a single component and instead influence multiple principal components. Conversely, variables such as Speechiness, Acousticness, Liveness, and Tempo exhibit strong thematic coherence, meaning they are predominantly associated with a single principal component.
pca2_no_pop$uniquenesses
## Duration Danceability Energy Loudness
## 0.33826078 0.19062182 0.12440787 0.22930339
## Speechiness Acousticness Instrumentalness Liveness
## 0.03058102 0.25187598 0.30354481 0.03111654
## Valence Tempo
## 0.14622115 0.18589760
The uniqueness values indicate that most of the variance is effectively captured within the principal components. This suggests that PCA successfully models the underlying structure of the dataset, confirming the appropriateness of the selected components.
| Principal Component | Umbrella Name | Most Contributing Variables |
|---|---|---|
| PC1 | Intensity | Energy, Loudness, Acousticness |
| PC2 | Danceability | Danceability, Valence |
| PC3 | Pacing | Duration, Tempo |
| PC4 | Lack of Words | Speechiness, Liveness |
| PC5 | Presence of Instruments | Instrumentalness, Tempo |
| PC6 | Noise | Liveness, Speechiness, Valence |
PC1: Intensity – Represents the overall intensity of a song, as determined by its energy level and loudness. The negative correlation with acousticness suggests that more intense songs tend to have lower acoustic elements.
PC2: Danceability – No need for an umbrella name here as the component is primarily influenced by danceability scores and valence which are both important for dancing.
PC3: Pacing – Represents tempo-related characteristics, including the song’s duration and speed.
PC4: Lack of Words – This component is defined by a negative correlation with speechiness, meaning higher PC4 values correspond to tracks with fewer spoken words.
PC5: Presence of Instruments – Similar to PC4 but in the opposite direction, this component correlates positively with instrumentalness, indicating the presence of more instrumental elements in a track.
PC6: Noise – Reflects ambient and crowd noise characteristics, as inferred from high contributions of liveness and speechiness.
Overall, these principal components effectively capture distinct musical characteristics, supporting the use of PCA as a dimensionality reduction tool for genre classification and track analysis.