Introduction

The aim of this research paper is to evaluate the feasibility of using clustering algorithms, specifically K-Means, to auto-assign music genres to tracks based on their audio characteristics. The dataset contains 15,150 observations and 19 variables describing tracks, including features such as danceability, energy, loudness, and others. The end goal is to uncover patterns and structures within the dataset using clustering and analyze how they align with predefined genre labels.

Data Cleaning and Preprocessing

https://www.kaggle.com/datasets/thebumpkin/10400-classic-hits-10-genres-1923-to-2023
Variables are defined in the following way according to the creator of the dataset:
Track: The title of the song.
Artist: The name of the performing artist or band.
Year: The year the track was released.
Duration: Length of the track in milliseconds.
Time_Signature: The musical time signature of the track (eg. x/4).
Danceability: A measure of how suitable the track is for dancing, ranging from 0.0 to 1.0.
Energy: A measure of the intensity and activity of the track, ranging from 0.0 to 1.0.
Key: The key of the track (e.g., 0=C).
Loudness: The overall loudness of the track in decibels (dB).
Mode: The modality of the track, typically major (1) or minor (0).
Speechiness: A measure indicating the presence of spoken words in the track, ranging from 0.0 to 1.0.
Acousticness: A measure of how acoustic the track is, ranging from 0.0 to 1.0.
Instrumentalness: A measure of the likelihood that the track is instrumental, ranging from 0.0 to 1.0.
Liveness: A measure of the presence of a live audience in the track, ranging from 0.0 to 1.0.
Valence: A measure of the musical positiveness of the track, ranging from 0.0 to 1.0.
Tempo: The speed of the track in beats per minute (BPM).
Popularity: A measure of the track’s popularity, ranging from 0 to 100.
Genre: 19 distinct genres of music.

library(factoextra)
library(cluster)
library(flexclust)
library(fpc)
library(ClusterR)
library(hopkins)
library(FeatureImpCluster)
library(attempt)
library(stats)
library(ggplot2)
library(corrplot)
library(clusterCrit)
library(FactoMineR)
library(psych)
library(gridExtra)

The first order of operation is to clean the data. Music tracks often have remasters and different editions of the same track which need to be removed in this case. After finding and removing such duplicates a subset of the data containing only numerical variables suited for clustering was created. (I based the process on both track name and duration being the same as tracks often have same names, even when they are in fact different).

data <- read.csv("ClassicHit.csv")
duplicate_rows <- data[duplicated(data[,c("Track","Duration")]), ]
data <- data[!duplicated(data[,c("Track","Duration")]), ]
#Removing non-numerical variables
data_c <-data[,c("Danceability", "Energy", "Loudness", "Speechiness", "Acousticness", "Instrumentalness", "Liveness", "Valence", "Tempo", "Popularity")]
#Scaling the data
data_n <- scale(data_c)
corrplot(cor(data_c), method = "color")

The correlation plot reveals strong correlations, such as a negative relationship between Acousticness and Energy and a positive one between Energy and Loudness. These align with intuitive expectations (e.g., loud tracks tend to be more energetic). However, no correlations are strong enough to warrant removing any variables.

Part 1: K-means Clustering

To evaluate how well clustering aligns with predefined genres, the initial application of K-Means clustering is with k = 19, corresponding to the number of unique genres in the dataset.

set.seed(42)
data_n <- scale(data_c)
#First clustering will be done with 19 clusters as that is the amount of genres in the data set and I wish to compare whether clustering creates similar groupings

km1<-eclust(data_n,"kmeans",hc_metric="euclidean",k=19, iter.max = 50)

data$cluster <-km1$cluster
table(data$Genre, data$cluster)
##            
##               1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17
##   Alt. Rock  67  24   1  13  12  82  31   6  94 127 125  26   5  20  46  35  62
##   Blues      31  35  23 113   2  12  27  87   9   8   0  46  13  24  19   8  31
##   Country    52  97  25 136   1   1  39  70   0  39  19  29   3   3  28  65   5
##   Disco      20 100   5  18   6   0  28  20   4  53   4 245   2   6  52   9  54
##   EDM         9   0   0   0  24  58   9   1 108  54 185  36  17  34  37   6 119
##   Folk       20  27  60 162   4   0  33  53   2   6   0  14  13   2  10  31   5
##   Funk       19  48   7   8  12   1  19   5   5  26   6  90   7  12  24   6  37
##   Gospel      9   1  26  27   5  31  47  16  31   8  12  12   5  34   9  19   0
##   Jazz        6  12  45 139   1   1  37  69   3   4   1  11   2   7   5  26  63
##   Metal      34   3   0   1   1 193  47   2 286  24  88  17   6  55  21   8  98
##   Pop       239 367  50 107 122  81 143 138  80 544 360 172  61  41 213 427  99
##   Punk       84  10   5   3   9 179  34   6 176  13  10  46  11  57  33   3  59
##   R&B        48 116  19  47  16   3  37  46   6  97  23  62   7   7  50  71   6
##   Rap         9   9   0   2 291   9   2   3   9  61  91  18 154  12  22  14   9
##   Reggae     25  90   5   0  77   1   6   4   3  49   3  85  31   4  17   1  12
##   Rock       43  20  13  35   1  70 142  33  79  12  11 154   3  36  21  14  38
##   SKA        65  26   1   1   5  30   3   1  24  14   1  82   5  16  15   0  15
##   Today      36   9   1   3  52  15   4   7  15 139 158   5  39   7  16  92   0
##   World      10   4  15  41   3   4  27  17   4   7   2  37   5  12   9  11  30
##            
##              18  19
##   Alt. Rock   1   3
##   Blues      30 159
##   Country     9 212
##   Disco       1  25
##   EDM         2   0
##   Folk       23  97
##   Funk        6  12
##   Gospel      0  16
##   Jazz      255  91
##   Metal       1   1
##   Pop        82 146
##   Punk        2   5
##   R&B         3  98
##   Rap         0   1
##   Reggae      3  23
##   Rock       19  44
##   SKA         2  21
##   Today       1   0
##   World      41  35
Modes <- function(x) {
  ux <- unique(x)
  tab <- tabulate(match(x, ux))
  ux[tab == max(tab)]
}
most_common_cluster_in_genre <- c(1:19)
genre_list<-unique(data$Genre)
data$cluster <- as.numeric(as.character(data$cluster))
for (i in 1:19) {
most_common_cluster_in_genre[i] <- Modes(data[data$Genre==genre_list[i],"cluster"])
}
print(most_common_cluster_in_genre)
##  [1] 10 19 19 12 11  4 12  7 18  9 10  6  2  5  2 12 12 11 18

The list above represents the most dominant cluster in each genre (measured by amount of observations). Many genres are assigned to the same dominant clusters, indicating that K-Means with k = 19 fails to adequately distinguish between genres. This could result from the overlap between genres in the feature space or uneven genre sample sizes.

sil <- silhouette(km1$cluster, dist(data_n))
fviz_silhouette(sil)
##    cluster size ave.sil.width
## 1        1  826          0.08
## 2        2  998          0.13
## 3        3  301          0.09
## 4        4  856          0.14
## 5        5  644          0.14
## 6        6  771          0.12
## 7        7  715          0.09
## 8        8  584          0.09
## 9        9  938          0.11
## 10      10 1285          0.16
## 11      11 1099          0.13
## 12      12 1187          0.14
## 13      13  389          0.02
## 14      14  389          0.09
## 15      15  647          0.08
## 16      16  846          0.12
## 17      17  742          0.12
## 18      18  481          0.19
## 19      19  989          0.13


Average silhouette score is close to 0.12 which indicates low quality clustering overall. Clusters are of varying sizes some extremely irrelevant with very few observations. In light of such results the best course of action would be to perform clustering with the appropriate amount of clusters to determine whether the algorithm can uncover subgroups of track which are still meaningful even if they do not align with traditional genre division. Another aspect of this analysis will be if there exist cross-genre patterns in songs.

fviz_nbclust(data_n, kmeans, method = "wss")

Optimal_Clusters_KMeans(data_n, max_clusters=20, plot_clusters=TRUE, criterion="silhouette")

##  [1] 0.0000000 0.1990990 0.1526359 0.1591607 0.1624798 0.1672372 0.1422907
##  [8] 0.1363529 0.1436810 0.1322141 0.1308640 0.1315249 0.1306508 0.1271082
## [15] 0.1278780 0.1247002 0.1207977 0.1205069 0.1185581 0.1148549

As we can see from the elbow plot 19 clusters is excessive, which could be expected as often in music, distinctions between genres are not drastic enough to warrant a creation of a separate cluster in the algorithm. This combined with uneven sample size for each genre makes it impossible for kmeans to assign tracks to similar groups as the man made genres. Having this in mind for the future analysis the k parameter will be adjusted to 4* in order to explore what overlaps exists between genres and which of them will be assigned to which cluster.

*The silhouette is higher in 2 and 6 clusters but in order for meaningful analysis to happen 4 was chosen.

km2<-eclust(data_n,"kmeans",hc_metric="euclidean",k=4)

data$cluster<-km2$cluster
table(data$Genre, data$cluster)
##            
##                1    2    3    4
##   Alt. Rock  461  233   35   51
##   Blues       69  156   59  393
##   Country     57  297   34  445
##   Disco       38  490   46   78
##   EDM        473  161   55   10
##   Folk        20  104   59  379
##   Funk        34  245   32   39
##   Gospel     101   41   77   89
##   Jazz        35   68   50  625
##   Metal      747   52   65   22
##   Pop        774 1740  197  761
##   Punk       531  116   77   21
##   R&B         65  440   49  208
##   Rap        150  459   98    9
##   Reggae      11  386   19   23
##   Rock       307  231   63  187
##   SKA        101  185   32    9
##   Today      218  302   16   63
##   World       42   79   33  160

While not ideal most genres now have a dominant cluster which makes analysis and interpretation possible. That being said, cluster 3 is not a dominant one for any genre making it so that effecitvly we only have 2 relevant clusters. For k=5 similar problems arise, which is why for the final clustering k=3 was chosen.

Final Clustering (k = 3)

km3<-eclust(data_n,"kmeans",hc_metric="euclidean",k=3)

data$cluster<-km3$cluster
genre_cluster <- apply(table(data$Genre, data$cluster), 1, which.max)

# Create a data frame mapping clusters to genres
cluster_genres <- split(names(genre_cluster), genre_cluster)

# Convert to a readable table
cluster_table <- data.frame(
  Cluster = names(cluster_genres),
  Genres = sapply(cluster_genres, function(x) paste(x, collapse = ", ")),
  row.names = NULL
)

# Display the table
print(cluster_table, row.names = FALSE)
##  Cluster                                         Genres
##        1 Disco, Funk, Pop, R&B, Rap, Reggae, SKA, Today
##        2              Blues, Country, Folk, Jazz, World
##        3      Alt. Rock, EDM, Gospel, Metal, Punk, Rock

This table presents genres divided into groups based on the most dominant cluster within them. From a logical perspective these groups do make sense. Here are my assumptions to be checked in the following part of the paper.
Group number 1 is popular, happy music which people often dance to.
Group number 2 is calm music, often with lower tempo.
Group number 3 is loud, energetic music which often has long instrumental segments.

km3$centers
##   Danceability     Energy   Loudness Speechiness Acousticness Instrumentalness
## 1    0.7533171  0.1908460  0.1700509   0.1251305   -0.3044365      -0.22082248
## 2   -0.3818617 -1.2349531 -1.0122061  -0.3591653    1.2694734       0.29622639
## 3   -0.7326799  0.7918369  0.6298277   0.1322645   -0.6615557       0.05651161
##      Liveness    Valence      Tempo  Popularity
## 1 -0.24062868  0.6477836 -0.1485194  0.22890328
## 2 -0.07199776 -0.4434115 -0.2512287 -0.44483713
## 3  0.40061452 -0.5312408  0.4248492  0.05973275
heatmap(km3$centers, main = "Cluster Centers",labRow = rownames(km1$centers),Rowv = NA, Colv = NA,margins = c(9,5))


Looking at the centers, Danceability and Valence are crucial characteristics for cluster 1. It is also the most popular out of all 3 clusters. I would say that all the genres from group 1 are in fact a good fit. Disco, Funk, Pop, R&B and Reggae are know for being positive and most of the time they have lyrics, which makes sense with low Acoustiness and Instrumentalness. Group 2 also has the characteriscitsc of the most crucial centers of cluster. Jazz, Folk and Blues are often tracks with low volume (Loudness -1.012) and Energy (-1.23). Instruments are also often at the front (Acousticness = 1.27) and these genres arent the most popular. As predicted, cluster 3 has high energy and loudness though it may be a bit surprising to see acoussticness at such a low degree. People also dont often dance to metal, punk or rock and live recording are often seen on services such as spotify. Overall, the results are suprisingly positive. With the exception of EDM (electronic dance music) being in the cluster with lowest dancability the genre groups and characteristics are well matched and fit real world stereotypes.

Clustering Quality

sil <- silhouette(km3$cluster, dist(data_n))
fviz_silhouette(sil)
##   cluster size ave.sil.width
## 1       1 6330          0.19
## 2       2 3861          0.15
## 3       3 4496          0.09


Silhouette is still not of the desired quality which indicates that the data simply might have a lot of overlap. This might be caused by the dominance of pop music in the sample which is know for often having repetitive patterns.

Dunn3 <- intCriteria(as.matrix(data_n), km3$cluster, c("Dunn"))
CH3 <- intCriteria(as.matrix(data_n), km3$cluster, c("Calinski_Harabasz"))
crit1 <- intCriteria(as.matrix(data_n), km1$cluster, c("Calinski_Harabasz"))

cat("Dunn Index for k=3:", Dunn3[[1]], "\n")
## Dunn Index for k=3: 0.003103574
cat("Calinski Harabasz Index for k=3:", CH3[[1]], "\n")
## Calinski Harabasz Index for k=3: 2714.601
cat("For comparison, Calinski Harabasz Index for k=19:", crit1[[1]], "\n")
## For comparison, Calinski Harabasz Index for k=19: 1260.003
z_scores <- scale(data_n)

# Identify rows with z-scores > 3 or < -3
outliers <- which(abs(z_scores) > 3, arr.ind = TRUE)

We also see the confirmation that the clusters may be overlapping or poorly defined in the very low Dunn index. It is important to mention that this index is quite sensitive to outliers of which there are many in the data set (1756 to be expact found thourgh the Z-score method) so it might be artifically lower than in reality. Despise that the conclusion is in line with silhouette analysis. From the Calinski Harabasz Index we see that the clustering compactness and quality is better than in the case with 19 clusters.

hopkins(data_n)
## [1] 0.9999976

The Hopkins statistic is rather suprisingly high, it indicates that the data has a sighnificat clustarable structure. This is connection with the very low Dunn index may mean that the clustering algorythm was unoptimal for this study.

km4<-kcca(data_n, k=4)
FeatureImp_km<-FeatureImpCluster(km4, as.data.table(data_n))
plot(FeatureImp_km)


We can observe that most variables are actually quite impactful on the result of clustering especially danceability, which means there is no need to alter the data set.

ggplot(data, aes(x = cluster, y = Danceability, fill = factor(cluster))) +
  geom_boxplot()

The figure presents how the most important variable, danceability looks across the clusters. It is highest in cluster 1 at around 0.7 while for clusters 2 and 3 it hovers around 0.5.

Conclusion of Part 1

This analysis demonstrates that K-Means clustering can group tracks into meaningful subgroups based on audio features, though complete alignment with predefined genres remains challenging due to feature overlap and noise. With k = 3, we identified:

A cluster of danceable and happy tracks (e.g., Pop, Disco).
A cluster of calm, acoustic tracks (e.g., Jazz, Folk).
A cluster of loud, energetic tracks (e.g., Rock, Metal).

The results are promising but highlight limitations in K-Means for datasets with significant overlap. Future studies could explore alternative clustering methods, such as DBSCAN or Gaussian Mixture Models, and refine the dataset by addressing outliers and dominant genres like Pop.

Part 2: Principle Component Analysis

Prior to conducting PCA, it is necessary to extract only numerical columns from the dataset. Additionally, standardization of the data is essential, as PCA is highly sensitive to differences in range across variables which are present.

data_pc <- subset(data,select =-c(Track,Artist,Genre,cluster))
data_pca <- subset(data_pc,select=-c(Year,Mode,Key,Time_Signature))
data_pca_scaled <- scale(data_pca)
pca<- prcomp(data_pca_scaled, center = TRUE, scale. = TRUE)
summary(pca)
## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6     PC7
## Standard deviation     1.6224 1.2773 1.0700 1.04360 1.02289 0.95944 0.90993
## Proportion of Variance 0.2393 0.1483 0.1041 0.09901 0.09512 0.08368 0.07527
## Cumulative Proportion  0.2393 0.3876 0.4917 0.59070 0.68582 0.76950 0.84477
##                            PC8     PC9    PC10    PC11
## Standard deviation     0.85121 0.65390 0.63893 0.38357
## Proportion of Variance 0.06587 0.03887 0.03711 0.01338
## Cumulative Proportion  0.91064 0.94951 0.98662 1.00000

Summary of PCA Results

The table above presents the proportion of variance explained by the principal components up to 12 components. A common benchmark is to retain approximately 80-85% cumulative variance, which ensures that most of the original dataset’s information is preserved while reducing dimensionality. Based on this criterion and the scree plot analysis, the optimal number of principal components is seven.

fviz_eig(pca, choice='eigenvalue')

fviz_pca_var(pca, col.var = "black")

From the PCA variable correlation plot, it can be inferred that the key contributors to Component 1 include Energy, Loudness, and their inverse Acousticness. Component 2 is primarily influenced by Danceability and Valence, with additional contributions from Liveness and Duration.

a<-fviz_contrib(pca, "var",axes = 1)
b<-fviz_contrib(pca, "var",axes = 2)
c<-fviz_contrib(pca, "var",axes = 3)
d<-fviz_contrib(pca, "var",axes = 4)
e<-fviz_contrib(pca, "var",axes = 5)
f<-fviz_contrib(pca, "var",axes = 6)
g<-fviz_contrib(pca, "var",axes = 7)
plots <- list(a, b, c, d, e, f, g)
grid.arrange(grobs = plots, nrow = 4, ncol = 2)

pca2<-principal(data_pca_scaled, nfactors=7, rotate="varimax")
pca2
## Principal Components Analysis
## Call: principal(r = data_pca_scaled, nfactors = 7, rotate = "varimax")
## Standardized loadings (pattern matrix) based upon correlation matrix
##                    RC1   RC2   RC3   RC7   RC5   RC6   RC4   h2    u2 com
## Duration          0.04 -0.10  0.04  0.04 -0.04 -0.01  0.97 0.96 0.039 1.0
## Danceability      0.01  0.78 -0.25 -0.09 -0.26  0.24  0.02 0.81 0.189 1.7
## Energy            0.93  0.07  0.11  0.03  0.12  0.03  0.00 0.90 0.104 1.1
## Loudness          0.84 -0.01 -0.01 -0.19  0.02  0.05 -0.11 0.76 0.238 1.1
## Speechiness       0.11  0.05  0.06 -0.01  0.06  0.96 -0.01 0.95 0.052 1.0
## Acousticness     -0.83 -0.08  0.10 -0.01 -0.04 -0.08 -0.16 0.75 0.255 1.1
## Instrumentalness -0.09 -0.10  0.01  0.97 -0.03 -0.01  0.04 0.95 0.045 1.0
## Liveness          0.15 -0.19  0.82 -0.11 -0.13  0.16 -0.05 0.78 0.215 1.4
## Valence           0.11  0.88  0.10 -0.04  0.12 -0.10 -0.13 0.84 0.165 1.2
## Tempo             0.13 -0.05 -0.03 -0.03  0.95  0.06 -0.04 0.93 0.075 1.1
## Popularity        0.30 -0.19 -0.62 -0.26 -0.19  0.15 -0.17 0.67 0.332 2.7
## 
##                        RC1  RC2  RC3  RC7  RC5  RC6  RC4
## SS loadings           2.43 1.49 1.16 1.06 1.06 1.06 1.03
## Proportion Var        0.22 0.14 0.11 0.10 0.10 0.10 0.09
## Cumulative Var        0.22 0.36 0.46 0.56 0.65 0.75 0.84
## Proportion Explained  0.26 0.16 0.12 0.11 0.11 0.11 0.11
## Cumulative Proportion 0.26 0.42 0.55 0.66 0.78 0.89 1.00
## 
## Mean item complexity =  1.3
## Test of the hypothesis that 7 components are sufficient.
## 
## The root mean square of the residuals (RMSR) is  0.07 
##  with the empirical chi square  7825.4  with prob <  NA 
## 
## Fit based upon off diagonal values = 0.88
print(loadings(pca2), digits=3, cutoff=0.4, sort=TRUE)
## 
## Loadings:
##                  RC1    RC2    RC3    RC7    RC5    RC6    RC4   
## Energy            0.929                                          
## Loudness          0.843                                          
## Acousticness     -0.834                                          
## Danceability             0.783                                   
## Valence                  0.878                                   
## Liveness                        0.819                            
## Popularity                     -0.619                            
## Instrumentalness                       0.967                     
## Tempo                                         0.949              
## Speechiness                                          0.963       
## Duration                                                    0.972
## 
##                  RC1   RC2   RC3   RC7   RC5   RC6   RC4
## SS loadings    2.434 1.492 1.159 1.062 1.057 1.056 1.033
## Proportion Var 0.221 0.136 0.105 0.097 0.096 0.096 0.094
## Cumulative Var 0.221 0.357 0.462 0.559 0.655 0.751 0.845
pca2$complexity
##         Duration     Danceability           Energy         Loudness 
##         1.033393         1.697492         1.077065         1.146466 
##      Speechiness     Acousticness Instrumentalness         Liveness 
##         1.044693         1.145974         1.043751         1.360893 
##          Valence            Tempo       Popularity 
##         1.170296         1.057793         2.719432
pca2$uniquenesses
##         Duration     Danceability           Energy         Loudness 
##       0.03854904       0.18852017       0.10387791       0.23802715 
##      Speechiness     Acousticness Instrumentalness         Liveness 
##       0.05201880       0.25470491       0.04533429       0.21523704 
##          Valence            Tempo       Popularity 
##       0.16484983       0.07452223       0.33185604

Principal Component Contributions

After examining the contribution plots and loadings, it was determined that the Popularity variable should be excluded from further analysis. The reasoning behind this decision is that the variable is spread across multiple components, has high complexity, and exhibits high uniqueness, suggesting that it is not very well represented in the principal component structure. Including Popularity which is influenced by external factors such as listener preferences and trends (an example of this would be the rise of hip hop and rap in recent times) may introduce undesired effects (it is also not actually a strictly musical characteristic).

data_pca_scaled_no_pop <-subset(data_pca_scaled,select=-c(Popularity))
pca_no_pop<- prcomp(data_pca_scaled_no_pop, center = TRUE, scale. = TRUE)
summary(pca_no_pop)
## Importance of components:
##                           PC1    PC2    PC3    PC4     PC5     PC6     PC7
## Standard deviation     1.5963 1.2730 1.0441 1.0242 0.98209 0.94646 0.90844
## Proportion of Variance 0.2548 0.1621 0.1090 0.1049 0.09645 0.08958 0.08253
## Cumulative Proportion  0.2548 0.4169 0.5259 0.6308 0.72724 0.81682 0.89934
##                            PC8    PC9    PC10
## Standard deviation     0.65959 0.6512 0.38405
## Proportion of Variance 0.04351 0.0424 0.01475
## Cumulative Proportion  0.94285 0.9852 1.00000
a<-fviz_contrib(pca_no_pop, "var",axes = 1)
b<-fviz_contrib(pca_no_pop, "var",axes = 2)
c<-fviz_contrib(pca_no_pop, "var",axes = 3)
d<-fviz_contrib(pca_no_pop, "var",axes = 4)
e<-fviz_contrib(pca_no_pop, "var",axes = 5)
f<-fviz_contrib(pca_no_pop, "var",axes = 6)
plots2 <- list(a, b, c, d, e, f)
grid.arrange(grobs = plots2, nrow = 3, ncol = 2)

PCA Without Popularity Variable

Excluding the Popularity variable allows the 80% variance benchmark to be reached with six principal components instead of seven. The contribution plots indicate that PC3 and PC5 are both largely defined by Tempo, while PC4 and PC6 show high contributions from Speechiness and Liveness. To further investigate whether these components represent distinct musical dimensions or are redundant, biplots for PC3 vs. PC5 and PC4 vs. PC6 were analyzed.

plot1 <- fviz_pca_var(pca_no_pop, axes = c(3,5))
plot2 <- fviz_pca_var(pca_no_pop, axes = c(4,6))
grid.arrange(plot1, plot2, ncol = 2)

The results indicate that while Tempo contributes similarly to both PC3 and PC5, other variables introduce sufficient differentiation to justify retaining both components. Likewise, for PC4 and PC6, although Speechiness exhibits similar effects, Liveness varies in direction, reinforcing the necessity of both components.

pca2_no_pop<-principal(data_pca_scaled_no_pop, nfactors=6, rotate="varimax")
pca2_no_pop
## Principal Components Analysis
## Call: principal(r = data_pca_scaled_no_pop, nfactors = 6, rotate = "varimax")
## Standardized loadings (pattern matrix) based upon correlation matrix
##                    RC1   RC2   RC3   RC5   RC6   RC4   h2    u2 com
## Duration          0.16 -0.17  0.71 -0.32 -0.01  0.01 0.66 0.338 1.7
## Danceability      0.03  0.79 -0.05 -0.28 -0.22  0.23 0.81 0.191 1.6
## Energy            0.91  0.08  0.02  0.15  0.14  0.03 0.88 0.124 1.1
## Loudness          0.84 -0.02 -0.23  0.02  0.03  0.05 0.77 0.229 1.2
## Speechiness       0.11  0.04 -0.03  0.05  0.07  0.97 0.97 0.031 1.0
## Acousticness     -0.85 -0.07 -0.10 -0.02  0.07 -0.08 0.75 0.252 1.1
## Instrumentalness -0.23  0.01  0.76  0.26  0.04 -0.05 0.70 0.304 1.4
## Liveness          0.07 -0.07  0.02 -0.03  0.98  0.07 0.97 0.031 1.0
## Valence           0.09  0.89 -0.09  0.14  0.07 -0.12 0.85 0.146 1.1
## Tempo             0.16 -0.06  0.00  0.88 -0.03  0.05 0.81 0.186 1.1
## 
##                        RC1  RC2  RC3  RC5  RC6  RC4
## SS loadings           2.39 1.47 1.15 1.08 1.04 1.04
## Proportion Var        0.24 0.15 0.12 0.11 0.10 0.10
## Cumulative Var        0.24 0.39 0.50 0.61 0.71 0.82
## Proportion Explained  0.29 0.18 0.14 0.13 0.13 0.13
## Cumulative Proportion 0.29 0.47 0.61 0.75 0.87 1.00
## 
## Mean item complexity =  1.2
## Test of the hypothesis that 6 components are sufficient.
## 
## The root mean square of the residuals (RMSR) is  0.08 
##  with the empirical chi square  9394.34  with prob <  NA 
## 
## Fit based upon off diagonal values = 0.85
print(loadings(pca2_no_pop), digits=3, cutoff=0.4, sort=TRUE)
## 
## Loadings:
##                  RC1    RC2    RC3    RC5    RC6    RC4   
## Energy            0.909                                   
## Loudness          0.845                                   
## Acousticness     -0.849                                   
## Danceability             0.791                            
## Valence                  0.894                            
## Duration                        0.709                     
## Instrumentalness                0.758                     
## Tempo                                  0.885              
## Liveness                                      0.977       
## Speechiness                                          0.973
## 
##                  RC1   RC2   RC3   RC5   RC6   RC4
## SS loadings    2.390 1.474 1.153 1.076 1.039 1.036
## Proportion Var 0.239 0.147 0.115 0.108 0.104 0.104
## Cumulative Var 0.239 0.386 0.502 0.609 0.713 0.817
pca2_no_pop$complexity
##         Duration     Danceability           Energy         Loudness 
##         1.653789         1.625447         1.122370         1.160977 
##      Speechiness     Acousticness Instrumentalness         Liveness 
##         1.047390         1.075312         1.438552         1.031507 
##          Valence            Tempo 
##         1.140092         1.081202

Applying varimax rotation to the PCA results allows for a clearer interpretation of the component structure. The following analysis of factor complexity reveals that three variables: Duration, Danceability, and Instrumentalness exhibit high complexity. This suggests that these variables are not easily explained by a single component and instead influence multiple principal components. Conversely, variables such as Speechiness, Acousticness, Liveness, and Tempo exhibit strong thematic coherence, meaning they are predominantly associated with a single principal component.

pca2_no_pop$uniquenesses
##         Duration     Danceability           Energy         Loudness 
##       0.33826078       0.19062182       0.12440787       0.22930339 
##      Speechiness     Acousticness Instrumentalness         Liveness 
##       0.03058102       0.25187598       0.30354481       0.03111654 
##          Valence            Tempo 
##       0.14622115       0.18589760

The uniqueness values indicate that most of the variance is effectively captured within the principal components. This suggests that PCA successfully models the underlying structure of the dataset, confirming the appropriateness of the selected components.

Interpretation of Principal Components

Principal Component Umbrella Name Most Contributing Variables
PC1 Intensity Energy, Loudness, Acousticness
PC2 Danceability Danceability, Valence
PC3 Pacing Duration, Tempo
PC4 Lack of Words Speechiness, Liveness
PC5 Presence of Instruments Instrumentalness, Tempo
PC6 Noise Liveness, Speechiness, Valence

PC1: Intensity – Represents the overall intensity of a song, as determined by its energy level and loudness. The negative correlation with acousticness suggests that more intense songs tend to have lower acoustic elements.

PC2: Danceability – No need for an umbrella name here as the component is primarily influenced by danceability scores and valence which are both important for dancing.

PC3: Pacing – Represents tempo-related characteristics, including the song’s duration and speed.

PC4: Lack of Words – This component is defined by a negative correlation with speechiness, meaning higher PC4 values correspond to tracks with fewer spoken words.

PC5: Presence of Instruments – Similar to PC4 but in the opposite direction, this component correlates positively with instrumentalness, indicating the presence of more instrumental elements in a track.

PC6: Noise – Reflects ambient and crowd noise characteristics, as inferred from high contributions of liveness and speechiness.

Overall, these principal components effectively capture distinct musical characteristics, supporting the use of PCA as a dimensionality reduction tool for genre classification and track analysis.