The past two years of equity trading have been characterized by alternating AI-driven rallies and macroeconomic drawdowns, motivating a reassessment of how unlabeled stocks group together when measured through risk, return, and technical behavior. This paper applies unsupervised learning to 48 liquid equities from November 2023 to November 2025, aggregating daily market data into six conceptual feature blocks: risk/return, momentum, volatility, liquidity, technical patterns, and distributional characteristics. We evaluate K-means, PAM, hierarchical clustering, and DBSCAN — interpreting Silhouette, Davies–Bouldin, Calinski–Harabasz, and density cues alongside financial diagnostics - to see how assumptions shape segmentation. The results highlight the sensitivity of financial clustering to preprocessing choices and demonstrate that weak or unstable groupings are themselves informative. Weak clustering suggests that market behaviors overlap heavily, consistent with the hypothesis that equities do not partition cleanly into static behavioral regimes.
Segmenting stocks is useful for several reasons. First, many financial features lack natural labels, making supervised learning unsuitable; clustering allows us to explore whether meaningful structure exists at all. Second, grouping stocks with similar risk or momentum behavior supports diversification by avoiding unintended exposure to a single regime. Finally, cluster patterns can reveal how assets co-move during volatility shocks or macroeconomic cycles, offering insight into market regimes.
Clustering financial assets is inherently challenging because market data often exhibits weak or overlapping structure. Features such as volatility, momentum, and liquidity are highly correlated, heavy-tailed, and regime-dependent, which means natural clusters may not exist or may be unstable across periods. Rather than expecting clean partitions, the purpose of clustering here is exploratory: to assess whether any persistent structure emerges and to understand how algorithmic assumptions influence the segmentation. In this context, “weak clustering” is still informative because it indicates that financial behaviors are diffuse rather than distinctly separated.
This paper evaluates how different clustering algorithms behave when applied to aggregated stock features. The objective is not to identify a single “best” segmentation but to understand how assumptions—distance metrics, standardization, dimensionality reduction, and linkage rules—affect the resulting groups. By comparing K-means, PAM, hierarchical clustering, and DBSCAN in a PCA-reduced space, we examine robustness, sensitivity, and interpretability.
The study addresses four questions:
# Load the parquet file
global_data <- read_parquet("data/ds1_global_data.parquet")
# Separate tickers and features
tickers <- global_data$ticker
features <- global_data %>% dplyr::select(-ticker)
# Remove zero variance features
variances <- apply(features, 2, var, na.rm = TRUE)
zero_var_features <- names(variances[variances == 0])
features <- features %>% dplyr::select(-all_of(zero_var_features))
Twenty features were grouped into six conceptual categories:
mean_daily_return,
mean_stock_vs_market, beta_global,
mean_sharpe_20d, worst_drawdown): captures
performance and risk-adjusted returns. They are helpful for separating
high-risk/high-return from low-risk/stable stocksmean_momentum_20d,
mean_close_vs_sma200, mean_adx_14,
mean_price_pos_20): identifies trends and range-bound of
stocksmean_volatility_20d,
mean_atr_14, mean_volatility_ratio,
mean_bb_width): separates high-volatility from
low-volatility stocks which is important for risk management and
volatility-based strategiesmean_liquidity_20d,
mean_volume_ratio): identifies liquid vs illiquid stocks
for execution and market impact considerationsmean_rsi_14,
mean_macd_hist, mean_stoch_k): represents
short-term price behavior patternsreturn_skewness,
return_kurtosis): detects asymmetric behavior, useful for
risk modelingAll features represent stock-level aggregates computed over the entire sample period (one row per stock), rather than time-series sequences
Highly skewed features were transformed using Yeo–Johnson to reduce heavy-tailed behavior. All features were then standardized to ensure equal contribution to distance calculations. PCA was applied to mitigate multicollinearity and reduce noise.
# Apply Yeo-Johnson transformation
features_transformed <- features
for(feat in skew_data) {
yn_obj <- yeojohnson(features[[feat]], standardize = FALSE, eps = 1e-8)
features_transformed[[feat]] <- predict(yn_obj)}
# Standardization
features_scaled <- as.data.frame(scale(features_transformed))
# Apply PCA
pca_result <- prcomp(features_scaled, center = FALSE, scale. = FALSE)
To verify that reducing the feature space did not distort cluster structure, we compared K-Means performance using the first 3 versus 4 principal components. Adding PC4 increased total explained variance from 67.4% to 74.8%, but did not consistently improve internal validation scores. Silhouette and Davies–Bouldin values changed only marginally, and the cluster boundaries in the 4-component space were not more distinct. This indicates that the additional variance captured by PC4 reflects noise or idiosyncratic effects rather than meaningful structure.
# Extract 3 and 4 components
feature_pca_3 <- as.data.frame(pca_result$x[, 1:3])
feature_pca_4 <- as.data.frame(pca_result$x[, 1:4])
| Components | Variance_Explained | Silhouette | Davies_Bouldin | Calinski_Harabasz |
|---|---|---|---|---|
| 3 | 67.4 | 0.41 | 0.67 | 36.21 |
| 4 | 74.8 | 0.28 | 1.34 | 22.75 |
Given the limited sample size (n = 48) and the diminishing returns after PC3, we retained the 3-component solution for all clustering analyses.
fviz_pca_var(pca_result, col.var="steelblue", title = 'Figure 4: PCA Biplot of Variable loadings')
The PCA biplot shows that many features load strongly and directionally onto PC1, confirming a high degree of multicollinearity across risk, momentum, and technical indicators. This redundancy means that distance-based methods (especially K-Means and Ward’s method) would overweight these correlated features unless dimensionality is reduced. Retaining the first three components removes noise while preserving broad market structure (trend, volatility, momentum). The goal is not interpretability of PCs, but to reduce distortion in cluster geometry caused by correlated inputs.
get_clust_tendency(features_scaled, n = 2, graph = TRUE,
gradient = list(low="red", mid="white", high="blue"),
seed = 123)
## $hopkins_stat
## [1] 0.7832834
##
## $plot
Hopkins statistic = 0.78, indicating strong cluster tendency, aligns with the dissimilarity heatmap, which shows block-like low-distance regions separated by high-distance bands. The data therefore contains some structure, though its clarity must be validated through multiple clustering methods.
Clustering uncovers latent patterns in unlabeled financial features. It helps assess similarity in risk, momentum, or volatility profiles without assuming predefined categories. This is valuable for identifying diversification groups, stress-response regimes, and behavioral-style clustersters that may not align with sector classifications.
To select the number of clusters, we used NbClust, which aggregates multiple internal validity indices.
We evaluated four internal indices (Gap Statistic, Silhouette, Davies–Bouldin, Calinski–Harabasz). Results varied: Gap suggested k=2, CH suggested k=5, and both Silhouette and Davies–Bouldin converged on k=4. We selected k=4 based on the majority of stability-focused metrics and consistency with broad financial regimes.
dist_eucl <- dist(feature_pca, method = "euclidean")
dist_manh <- dist(feature_pca, method = "manhattan")
cor_matrix <- cor(t(feature_pca))
dist_cos <- as.dist(1 - cor_matrix)
We compared Euclidean, Manhattan, and cosine-like distance (1 − correlation). These metrics capture different geometric assumptions and allow assessment of how cluster assignments depend on distance definitions.
K-means provides a fast baseline for comparison. It assumes spherical clusters and equal variance, making it sensitive to outliers and skewed financial features. Standardized PCA features partly mitigate this, but K-means remains prone to boundary distortion. Metrics and visualization evaluate whether the method captures coherent groups.
km_final <- kmeans(feature_pca, centers = 4, nstart = 10, iter.max = 300)
| Metric | Value |
|---|---|
| Silhouette Score (Avg) | 0.336 |
| Davies-Bouldin Index | 1.220 |
| Calinski-Harabasz Index | 28.614 |
PAM uses medoids, making it more robust to outliers and suitable for noisy financial distributions. Because medoids represent actual stocks, cluster centers are easier to interpret. We evaluate PAM with multiple distance metrics to test stability and robustness relative to K-means.
pam_eucl <- pam(dist_eucl, k = 4, diss = TRUE)
pam_manh <- pam(dist_manh, k = 4, diss = TRUE)
pam_cos <- pam(dist_cos, k = 4, diss = TRUE)
| Metric | Silhouette | Davies_Bouldin | Calinski_Harabasz | Medoids |
|---|---|---|---|---|
| Euclidean | 0.309 | 1.142 | 26.132 | MSFT, GOOGL, PFE, TSLA |
| Manhattan | 0.254 | 1.482 | 20.973 | CAT, T, MRK, NEE |
| Cosine | 0.623 | 1.865 | 18.279 | TSM, SPY, COP, META |
Agglomerative hierarchical clustering with Ward, complete, and average linkage was applied. Dendrograms reveal multi-scale structure without fixing k upfront. Cophenetic correlation assesses how well each linkage preserves original pairwise distances.
# Euclidean + Ward
hc_eucl_ward_tree <- hclust(dist_eucl, method = "ward.D2")
hc_eucl_ward_cl <- cutree(hc_eucl_ward_tree, k = 4)
# Euclidean + Complete
hc_eucl_comp_tree <- hclust(dist_eucl, method = "complete")
hc_eucl_comp_cl <- cutree(hc_eucl_comp_tree, k = 4)
# Euclidean + Average
hc_eucl_avg_tree <- hclust(dist_eucl, method = "average")
hc_eucl_avg_cl <- cutree(hc_eucl_avg_tree, k = 4)
# Manhattan + Complete
hc_manh_comp_tree <- hclust(dist_manh, method = "complete")
hc_manh_comp_cl <- cutree(hc_manh_comp_tree, k = 4)
# Manhattan + Average
hc_manh_avg_tree <- hclust(dist_manh, method = "average")
hc_manh_avg_cl <- cutree(hc_manh_avg_tree, k = 4)
# Cosine + Complete
hc_cos_comp_tree <- hclust(dist_cos, method = "complete")
hc_cos_comp_cl <- cutree(hc_cos_comp_tree, k = 4)
# Cosine + Average
hc_cos_avg_tree <- hclust(dist_cos, method = "average")
hc_cos_avg_cl <- cutree(hc_cos_avg_tree, k = 4)
| Distance | Linkage | Silhouette | Cophenetic_Corr | |
|---|---|---|---|---|
| 7 | Cosine | Average | 0.737 | 0.849 |
| 5 | Manhattan | Average | 0.386 | 0.832 |
| 3 | Euclidean | Average | 0.417 | 0.805 |
| 4 | Manhattan | Complete | 0.389 | 0.786 |
| 2 | Euclidean | Complete | 0.373 | 0.713 |
| 6 | Cosine | Complete | 0.615 | 0.710 |
| 1 | Euclidean | Ward.D2 | 0.410 | 0.614 |
We evaluated Single, Complete, Average, and Ward linkage methods with different distance matrices. Average linkage yielded the highest Cophenetic Correlation Coefficient (0.85), indicating it most accurately preserved the pairwise distances of the original feature space.
We use three complementary internal validation metrics to assess cluster quality:
Silhouette Score: Measures how similar each point is to its own cluster compared to the nearest other cluster. Values range from -1 to 1, with higher values indicating better-defined clusters.
Davies-Bouldin Index: Calculates the average similarity ratio of each cluster with its most similar cluster. Lower values indicate better clustering, as they reflect higher within-cluster compactness and greater between-cluster separation.
Calinski-Harabasz Index: Also known as the Variance Ratio Criterion, this measures the ratio of between-cluster variance to within-cluster variance. Higher values indicate better-defined clusters with clear separation. This metric is particularly useful for comparing different numbers of clusters.
Together, these metrics evaluate different aspects of cluster quality: individual point assignment (Silhouette), cluster-level separation (Davies-Bouldin), and overall variance structure (Calinski-Harabasz).
| Silhouette | Davies_Bouldin | Calinski_Harabasz | |
|---|---|---|---|
| HC_Cosine_Average | 0.74 | 1.12 | 18.80 |
| PAM_Cosine | 0.62 | 1.86 | 18.28 |
| HC_Cosine_Complete | 0.62 | 1.42 | 20.20 |
| HC_Euclidean_Average | 0.42 | 0.65 | 33.16 |
| HC_Euclidean_Ward | 0.41 | 0.64 | 31.53 |
| HC_Manhattan_Complete | 0.39 | 0.61 | 29.06 |
| HC_Manhattan_Average | 0.39 | 0.58 | 24.33 |
| HC_Euclidean_Complete | 0.37 | 0.70 | 31.38 |
| K-Means | 0.34 | 1.22 | 28.61 |
| PAM_Euclidean | 0.31 | 1.14 | 26.13 |
| PAM_Manhattan | 0.25 | 1.48 | 20.97 |
The ARI matrix shows that no method pair consistently exceeds 0.6 except among hierarchical algorithms using the same distance metric. This confirms that clustering is highly sensitive to methodological assumptions and indicates that the underlying financial structure is weakly separable.
Clustering financial data often yields unstable results because feature distributions overlap and the underlying structure is weak. To assess how sensitive the cluster assignments are to algorithmic assumptions, we compute the pairwise Adjusted Rand Index (ARI) across all clustering methods. ARI measures the agreement between two partitions after correcting for chance; values near 1 indicate nearly identical clusters, while values near 0 indicate unrelated segmentations.
The ARI heatmap shows moderate similarity among Euclidean-based methods (K-Means, Ward’s linkage, and PAM-Euclidean), with values generally between 0.4 and 0.6. In contrast, Manhattan and Cosine distances produce partitions with substantially lower agreement, indicating that small changes in how distances are computed lead to different groupings. This is direct evidence of weak intrinsic structure in the feature space rather than a failure of any particular algorithm.
For financial applications, this instability is itself an informative result:
In short, the stability analysis confirms that the dataset does not contain strongly separable clusters, and any conclusions must account for this sensitivity.
Across the PCA-reduced feature space, the algorithms produced different partitions in terms of compactness, separation, and stability. K-means achieved one of the stronger Calinski–Harabasz values, indicating well-defined centroids under Euclidean distance. PAM delivered the most consistent Silhouette and Davies–Bouldin scores across distance metrics, producing compact clusters in multiple configurations.
Hierarchical clustering produced heterogeneous results depending on the linkage method. Ward’s linkage generated compact clusters comparable to K-means, while Average linkage achieved the highest cophenetic correlation, indicating it best preserved the original dissimilarity structure. In contrast, Complete linkage produced more elongated cluster shapes and lower internal scores.
DBSCAN identified a small set of points as noise and formed only one or two dense groups, reflecting the absence of clear density-separated regions. It did not yield a multi-cluster segmentation comparable to the partitioning or hierarchical methods.
Across all algorithms, cluster boundaries were diffuse and partially overlapping. PAM and Ward linkage generated the most internally consistent partitions, while distance-dependent methods showed greater variability in cluster assignments.
Cosine-distance hierarchical clustering with average linkage produced four behaviorally meaningful groups:
Cluster 1 — Broad Market Backbone: Large, liquid mega-caps such as AAPL, JPM, HD, and IBM. These stocks exhibit stable, low-volatility behavior with moderate momentum and consistent factor exposures. Their feature directions are strongly aligned, leading cosine similarity to cluster them tightly.
Cluster 2 — Defensive & Low-Beta Assets: JNJ, INTC, WMT, MA, LLY, and ABBV form a low-beta cluster characterized by stable returns, defensive fundamentals, and weak participation in speculative market cycles. These stocks share a consistent risk profile despite belonging to different sectors.
Cluster 3 — Momentum-Led Tech & Growth: NVDA, META, NFLX, AMZN, TSLA, and BABA appear together due to strong momentum, high volatility, and price trends dominated by macro/AI catalysts. Cosine distance groups them because their normalized feature vectors show similar trend direction even when magnitudes differ.
Cluster 4 — Macro-Sensitive & Idiosyncratic Risk: XOM, CVX, DIS, TGT, PEP, PG, NEE, CMCSA, and MRK exhibit higher idiosyncratic volatility or macro-driven patterns (energy shocks, consumer cyclicality, policy sensitivity). This cluster captures stocks whose behavior is irregular or influenced by sector-specific risks rather than broad market trends.
Overall, the HC–Cosine–Average model reveals a latent structure organized primarily along volatility–momentum axes, with defensive stocks at one extreme and growth/momentum stocks at the other, while macro-driven equities form a distinct idiosyncratic group.
A critical finding in this study was the presence of structural noise that challenged standard clustering assumptions. We employed two diagnostic methods to isolate these anomalies.
The Silhouette Width \(s(i)\) measures how similar a point is to its own cluster compared to other clusters. Points with \(s(i) < 0\) suggest misclassification.
sil_values <- silhouette(hc_cos_avg_cl, dist_cos)
## [1] "CVX"
To interpret the underlying “DNA” of the identified segments, we analyzed the standardized means of all 20 features across the four clusters. This profiling reveals the specific factor exposures that drive each group:
Cluster 1 (Trend Leaders): Characterized by high relative
strength (mean_stock_vs_market), strong technical
indicators (mean_stoch_k, mean_rsi_14), and
high price positioning (mean_price_pos_20d).
Cluster 2 (Defensive/Low-Beta): Defined by negative global beta
and the lowest volatility scores (mean_volatility_20d,
mean_bb_width) across the dataset.
Cluster 3 (High-Volatility Speculative): Displays extreme scores in momentum, daily returns, and volatility expansion, indicated by high Bollinger Band width.
Cluster 4 (Laggards): Shows negative means across nearly all performance metrics, representing assets in persistent downtrends during the sample period.
| Antecedent (Behavioral Features) | Consequent (Target Cluster) | Support | Confidence | Lift | N | |
|---|---|---|---|---|---|---|
| 2 | mean_rsi_14=High,mean_adx_14=High | {Cluster 1} | 0.104 | 1 | 3.692 | 5 |
| 3 | mean_stock_vs_market=High,mean_rsi_14=High | {Cluster 1} | 0.146 | 1 | 3.692 | 7 |
| 4 | mean_daily_return=High,mean_rsi_14=High | {Cluster 1} | 0.146 | 1 | 3.692 | 7 |
| 1 | beta_global=Low,mean_rsi_14=Med | {Cluster 2} | 0.062 | 1 | 6.857 | 3 |
| 31 | beta_global=Low,mean_rsi_14=Med,return_skewness=Low | {Cluster 2} | 0.062 | 1 | 6.857 | 3 |
| 41 | beta_global=Low,worst_drawdown=High,mean_rsi_14=Med | {Cluster 2} | 0.062 | 1 | 6.857 | 3 |
| 11 | mean_bb_width=High | {Cluster 3} | 0.104 | 1 | 6.000 | 5 |
| 21 | mean_volatility_20d=High | {Cluster 3} | 0.104 | 1 | 6.000 | 5 |
| 32 | mean_volatility_20d=High,mean_bb_width=High | {Cluster 3} | 0.104 | 1 | 6.000 | 5 |
| 12 | mean_momentum_20d=Low | {Cluster 4} | 0.146 | 1 | 2.400 | 7 |
| 22 | mean_stoch_k=Low | {Cluster 4} | 0.167 | 1 | 2.400 | 8 |
| 33 | mean_price_pos_20d=Low | {Cluster 4} | 0.167 | 1 | 2.400 | 8 |
While clustering identifies which stocks group together, Association Rule Mining (ARM) reveals the deterministic “if-then” logic defining those groups. Using the Apriori algorithm on discretized features, we identified 1,308 rules that characterize these market regimes. Several high-conviction rules (Confidence = 1.0) provide a roadmap for cluster membership:
Rule for Cluster 1: The combination of {mean_rsi_14=High} and {mean_adx_14=High} results in a Lift of 3.69, identifying assets in well-defined bullish trends.
Rule for Cluster 2: A combination of {beta_global=Low} and {mean_rsi_14=Med} is a perfect predictor (Confidence = 1.0) for this defensive segment, yielding the highest observed Lift of 6.86.
Rule for Cluster 3: The presence of {mean_bb_width=High} or {mean_volatility_20d=High} alone is sufficient to identify these “Regime-Breakers” with a Lift of 6.0.
Rule for Cluster 4: Low 20-day Sharpe ratios paired with low relative market performance are deterministic markers for this underperforming group.
The analysis suggests that the equity market—at least among liquid large-cap equities—is not structurally rigid. The relatively low cross-method ARI scores (< 0.5 for many pairs) indicate that stock groupings are fluid rather than discrete. However, the recurring emergence of the “Four-Cluster” structure across stable methods (PAM and Ward’s method) reveals a latent hierarchy driven by Risk and Momentum.Specifically, the separation of the “Defensive” cluster (Cluster 1) from the “High Momentum” cluster (Cluster 3) validates the “Risk-On / Risk-Off” dynamic prevalent in the 2023–2025 trading period. The clustering results suggest that unsupervised learning can successfully recover these fundamental investment factors (Value vs. Growth, Low Vol vs. High Beta) without prior labeling.
Comparing the four algorithms reveals clear trade-offs between stability and financial utility:
Several limitations constrain the generalizability of these findings. First, the sample size (\(N=48\)) is small relative to the feature space, creating sparse density that challenges algorithms like DBSCAN. Second, the analysis aggregates features over a two-year window (static clustering). This ignores “regime shifts”—for example, a stock like Meta might behave as a “Value” stock in 2023 but a “Momentum” stock in 2024. Finally, the analysis is limited to large-cap, liquid equities; results may differ significantly for small-cap or illiquid assets where idiosyncratic noise is higher.
The results align with Modern Portfolio Theory (MPT) and the Capital Asset Pricing Model (CAPM). The primary axis of differentiation found by PCA (PC1) and the resulting clusters proxies closely for Market Beta. The fact that the algorithms spontaneously separated “High Beta/High Volatility” stocks from “Low Beta/Defensive” stocks confirms that systematic risk remains the primary organizing principle of equity markets, even when viewed through the lens of complex, high-dimensional technical features.
This comparative study demonstrated that financial market segmentation is most effective when distance-based clustering is validated by rule-based logic. Our final findings include:
Algorithmic Synergy: Hierarchical Clustering (Cosine + Average) effectively groups assets by trend similarity, while Association Rules provide the specific technical thresholds that trigger cluster membership.
Structural Stability: The market naturally stratifies into four behavioral regimes—Defensive, Core/Trend, Momentum/Volatile, and Laggards—validated by both geometric distance and logical associations.
Actionable Logic: The discovery of rules with high Lift (up to 6.86) suggests that “weak clustering” in financial data can be overcome by identifying high-conviction behavioral “recipes”.
For quantitative portfolio managers, these findings imply that sector labels (e.g., “Technology”) are insufficient proxies for behavior. A risk-managed portfolio should ensure exposure across the identified behavioral-style clusters rather than just industry verticals. Representative assets within each cluster may act as indicative behavioral proxies, though formal factor-model validation would be required to determine whether they can be reliably used for hedging or risk premia construction.
Future research should move from static to dynamic clustering (Rolling Window Analysis) to detect when stocks migrate between clusters. This would allow for the creation of “Regime Change Signals,” alerting traders when a defensive stock begins exhibiting momentum characteristics. Additionally, applying Dynamic Time Warping (DTW) as a distance metric could better capture the temporal similarity of price paths compared to aggregated statistical features.
if (!require("pacman")) install.packages("pacman")
# Load all libraries
pacman::p_load(
arrow, # Reading .parquet files (Dataset source)
dplyr, # Data manipulation (filter, select, mutate)
tidyr, # Data tidying and reshaping
knitr, # For kable
ggplot2, # Base plotting engine
factoextra, # Visualization for PCA & Clustering
corrplot, # Visualizing correlation matrices
psych, # Descriptive statistics
e1071, # Skewness and kurtosis calculations
bestNormalize, # Yeo-Johnson transformation for non-normal features
cluster, # Algorithms: PAM, Silhouette, Monothetic
dbscan, # Density-based clustering (DBSCAN) & kNN plots
mclust, # Model-based clustering & Adjusted Rand Index
NbClust, # Determining optimal 'k'
fpc, # Flexible Procedures for Clustering
clusterSim, # Cluster Similarity measures
clusterCrit, # Additional internal validity indices
arules,
arulesViz
)
# Load the parquet file
global_data <- read_parquet("data/ds1_global_data.parquet")
# Separate tickers and features
tickers <- global_data$ticker
features <- global_data %>% dplyr::select(-ticker)
# Remove zero variance features
variances <- apply(features, 2, var, na.rm = TRUE)
zero_var_features <- names(variances[variances == 0])
features <- features %>% dplyr::select(-all_of(zero_var_features))
# Skewness analysis and transformation
skew_threshold <- 2
skewness_vals <- apply(features, 2, e1071::skewness, na.rm = TRUE)
skew_data <- names(skewness_vals[abs(skewness_vals) > skew_threshold])
normal_data <- names(skewness_vals[abs(skewness_vals) <= skew_threshold])
# Visualize skewness
skew_df <- data.frame(
feature = names(skewness_vals),
skewness = skewness_vals
) %>% arrange(desc(skewness))
ggplot(skew_df, aes(x = skewness, y = reorder(feature, skewness))) +
geom_bar(stat = "identity", fill = "steelblue") +
geom_vline(xintercept = skew_threshold, color = "red", linetype = "dashed") +
geom_vline(xintercept = -skew_threshold, color = "red", linetype = "dashed") +
geom_vline(xintercept = 0, color = "black", linetype = "solid") +
labs(title = "Figure 1: Feature Skewness Ranking",
x = "Skewness Coefficient",
y = "Feature") +
theme_minimal() +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank())
# Apply Yeo-Johnson transformation
features_transformed <- features
for(feat in skew_data) {
yn_obj <- yeojohnson(features[[feat]], standardize = FALSE, eps = 1e-8)
features_transformed[[feat]] <- predict(yn_obj)}
# Standardization
features_scaled <- as.data.frame(scale(features_transformed))
# Apply PCA
pca_result <- prcomp(features_scaled, center = FALSE, scale. = FALSE)
# Calculate explained variance
explained_var <- summary(pca_result)$importance[2, ]
cumulative_var <- cumsum(explained_var)
# Find number of components for 65% variance
n_components_65 <- which(cumulative_var >= 0.65)[1]
# Extract PCA components
feature_pca <- as.data.frame(pca_result$x[, 1:n_components_65])
# Visualize explained variance
eig.val <- get_eigenvalue(pca_result)
eig.val.top10 <- head(eig.val, 10)
eig.val.top10$PC_Label <- factor(1:nrow(eig.val.top10), levels = 1:nrow(eig.val.top10))
ggplot(eig.val.top10, aes(x = PC_Label)) +
geom_bar(aes(y = variance.percent), stat = "identity", fill = "steelblue", alpha = 0.7) +
geom_line(aes(y = cumulative.variance.percent, group = 1), color = "red", size = 1.2) +
geom_point(aes(y = cumulative.variance.percent), color = "red", size = 2) +
geom_text(aes(y = variance.percent, label = round(variance.percent, 1)), vjust = -0.5, size = 3) +
geom_text(aes(y = cumulative.variance.percent, label = round(cumulative.variance.percent, 1)), vjust = -0.5, color = "red", size = 3) +
scale_y_continuous(limits = c(0, 105), breaks = seq(0, 100, 20)) +
labs(title = "Figure 2: Scree Plot & Cumulative Variance",
subtitle = "Cumulative Variance Explained %",
x = "Principal Components",
y = "Percentage of Variance Explained") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, face = "bold"))
# Extract 3 and 4 components
feature_pca_3 <- as.data.frame(pca_result$x[, 1:3])
feature_pca_4 <- as.data.frame(pca_result$x[, 1:4])
# Test K-Means with both
km_3 <- kmeans(feature_pca_3, centers = 4, nstart = 10)
km_4 <- kmeans(feature_pca_4, centers = 4, nstart = 10)
# Calculate metrics for 3 components
dist_3 <- dist(feature_pca_3)
sil_3 <- mean(silhouette(km_3$cluster, dist_3)[, 3])
db_3 <- index.DB(feature_pca_3, km_3$cluster)$DB
ch_3 <- calinhara(feature_pca_3, km_3$cluster)
# Calculate metrics for 4 components
dist_4 <- dist(feature_pca_4)
sil_4 <- mean(silhouette(km_4$cluster, dist_4)[, 3])
db_4 <- index.DB(feature_pca_4, km_4$cluster)$DB
ch_4 <- calinhara(feature_pca_4, km_4$cluster)
# Compare results
comparison_3vs4 <- data.frame(
Components = c(3, 4),
Variance_Explained = c(67.4, 74.8),
Silhouette = round(c(sil_3, sil_4), 4),
Davies_Bouldin = round(c(db_3, db_4), 4),
Calinski_Harabasz = round(c(ch_3, ch_4), 4),
stringsAsFactors = FALSE
)
knitr::kable(comparison_3vs4, caption = "3 vs 4 components comparison", digits = 2)
plot_3 <- fviz_cluster(km_3,
data = feature_pca_3,
geom = "point",
ellipse.type = "convex",
palette = "Set2",
ggtheme = theme_minimal(),
main = "Figure 3.1: 3 Components (67.4% variance)",
xlab = "PC1", ylab = "PC2") +
theme(plot.title = element_text(hjust = 0.5, face = "bold"))
print(plot_3)
plot_4 <- fviz_cluster(km_4,
data = feature_pca_4,
geom = "point",
ellipse.type = "convex",
palette = "Set2",
ggtheme = theme_minimal(),
main = "Figure 3.2: 4 Components (74.8% variance)",
xlab = "PC1", ylab = "PC2") +
theme(plot.title = element_text(hjust = 0.5, face = "bold"))
print(plot_4)
fviz_pca_var(pca_result, col.var="steelblue", title = 'Figure 4: PCA Biplot of Variable loadings')
get_clust_tendency(features_scaled, n = 2, graph = TRUE,
gradient = list(low="red", mid="white", high="blue"),
seed = 123)
data_for_nbclust <- as.matrix(feature_pca)
create_index_plot <- function(nb_result, title, y_label, higher_better = TRUE) {
idx_df <- data.frame(
k = as.numeric(names(nb_result$All.index)),
value = as.numeric(nb_result$All.index)
)
y_pos <- if(higher_better) max(idx_df$value, na.rm = TRUE) else min(idx_df$value, na.rm = TRUE)
v_just <- if(higher_better) -0.5 else 1.5
ggplot(idx_df, aes(x = k, y = value)) +
geom_line(linewidth = 1, color = "steelblue") +
geom_point(size = 3, color = "steelblue") +
geom_vline(xintercept = nb_result$Best.nc[1],
color = "red", linetype = "dashed", linewidth = 1) +
annotate("text", x = nb_result$Best.nc[1], y = y_pos,
label = paste("k =", nb_result$Best.nc[1]),
color = "red", fontface = "bold", vjust = v_just) +
labs(title = title, x = "Number of Clusters (k)", y = y_label) +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, face = "bold"))
}
# Run NbClust
ch_result <- NbClust(data_for_nbclust, distance = "euclidean", method = "complete",
index = "ch", min.nc = 2, max.nc = min(10, nrow(data_for_nbclust) - 1))
sil_result <- NbClust(data_for_nbclust, distance = "euclidean", method = "complete",
index = "silhouette", min.nc = 2, max.nc = min(10, nrow(data_for_nbclust) - 1))
db_result <- NbClust(data_for_nbclust, distance = "euclidean", method = "complete",
index = "db", min.nc = 2, max.nc = min(10, nrow(data_for_nbclust) - 1))
gap_result <- NbClust(data_for_nbclust, distance = "euclidean", method = "complete",
index = "gap", min.nc = 2, max.nc = min(10, nrow(data_for_nbclust) - 1))
# Create plots
p1 <- create_index_plot(ch_result, "Figure 5.1: Calinski-Harabasz Index vs k (Higher is Better)",
"Calinski-Harabasz Index", higher_better = TRUE)
p2 <- create_index_plot(sil_result, "Figure 5.2: Silhouette Index vs k (Higher is Better)",
"Silhouette Index", higher_better = TRUE)
p3 <- create_index_plot(db_result, "Figure 5.3: Davies-Bouldin Index vs k (Lower is Better)",
"Davies-Bouldin Index", higher_better = FALSE)
p4 <- create_index_plot(gap_result, "Figure 5.4: Gap Statistic vs k (Higher is Better)",
"Gap Statistic", higher_better = TRUE)
print(p1)
print(p2)
print(p3)
print(p4)
dist_eucl <- dist(feature_pca, method = "euclidean")
dist_manh <- dist(feature_pca, method = "manhattan")
cor_matrix <- cor(t(feature_pca))
dist_cos <- as.dist(1 - cor_matrix)
km_final <- kmeans(feature_pca, centers = 4, nstart = 10, iter.max = 300)
sil_obj <- silhouette(km_final$cluster, dist(feature_pca))
mean_sil <- mean(sil_obj[, 3])
db_obj <- index.DB(feature_pca, km_final$cluster)
db_score <- db_obj$DB
ch_score <- calinhara(feature_pca, km_final$cluster)
metrics_df <- data.frame(
Metric = c("Silhouette Score (Avg)", "Davies-Bouldin Index", "Calinski-Harabasz Index"),
Value = round(c(mean_sil, db_score, ch_score), 4)
)
knitr::kable(metrics_df, caption = "K-Means Clustering", digits = 3)
fviz_cluster(km_final,
data = feature_pca,
geom = "point",
ellipse.type = "convex",
ggtheme = theme_minimal(),
main = "Figure 6: K-Means Clustering (k=4)",
xlab = "PC1",
ylab = "PC2") +
theme(plot.title = element_text(hjust = 0.5, face = "bold"))
pam_eucl <- pam(dist_eucl, k = 4, diss = TRUE)
pam_manh <- pam(dist_manh, k = 4, diss = TRUE)
pam_cos <- pam(dist_cos, k = 4, diss = TRUE)
# PAM + Euclidean
sil_eucl <- mean(silhouette(pam_eucl$clustering, dist_eucl)[, 3])
db_eucl <- index.DB(feature_pca, pam_eucl$clustering)$DB
med_eucl <- paste(tickers[pam_eucl$medoids], collapse=", ")
ch_eucl <- calinhara(feature_pca, pam_eucl$cluster)
# PAM + Manhattan
sil_manh <- mean(silhouette(pam_manh$clustering, dist_manh)[, 3])
db_manh <- index.DB(feature_pca, pam_manh$clustering)$DB
med_manh <- paste(tickers[pam_manh$medoids], collapse=", ")
ch_manh <- calinhara(feature_pca, pam_manh$cluster)
# PAM + Cosine
sil_cos <- mean(silhouette(pam_cos$clustering, dist_cos)[, 3])
db_cos <- index.DB(feature_pca, pam_cos$clustering)$DB
med_cos <- paste(tickers[pam_cos$medoids], collapse=", ")
ch_cos <- calinhara(feature_pca, pam_cos$cluster)
pam_results <- data.frame(
Metric = c("Euclidean", "Manhattan", "Cosine"),
Silhouette = c(sil_eucl, sil_manh, sil_cos),
Davies_Bouldin = c(db_eucl, db_manh, db_cos),
Calinski_Harabasz = c(ch_eucl, ch_manh, ch_cos),
Medoids = c(med_eucl, med_manh, med_cos)
)
knitr::kable(pam_results, caption = "PAM Clustering Performance", digits = 3)
plot_pam <- function(model, title) {
plot_df <- as.data.frame(feature_pca)
plot_df$cluster <- as.factor(model)
col_names <- names(plot_df)[1:2]
ggplot(plot_df, aes_string(x = col_names[1], y = col_names[2], color = "cluster")) +
geom_point(size = 3, alpha = 1) +
stat_ellipse(type = "norm", level = 0.95, linetype = 2) +
labs(title = title) +
theme_minimal() +
theme(
legend.position = "none",
plot.title = element_text(size = 10, face = "bold")
)
}
p1 <- plot_pam(pam_eucl$clustering, paste("Figure 7.1: Euclidean (Sil:", round(sil_eucl, 2), ")"))
print(p1)
p2 <- plot_pam(pam_manh$clustering, paste("Figure 7.2: Manhattan (Sil:", round(sil_manh, 2), ")"))
print(p2)
p3 <- plot_pam(pam_cos$clustering, paste("Figure 7.3: Cosine (Sil:", round(sil_cos, 2), ")"))
print(p3)
# Euclidean + Ward
hc_eucl_ward_tree <- hclust(dist_eucl, method = "ward.D2")
hc_eucl_ward_cl <- cutree(hc_eucl_ward_tree, k = 4)
sil_eucl_ward <- mean(silhouette(hc_eucl_ward_cl, dist_eucl)[, 3])
coph_eucl_ward <- cor(dist_eucl, cophenetic(hc_eucl_ward_tree))
# Euclidean + Complete
hc_eucl_comp_tree <- hclust(dist_eucl, method = "complete")
hc_eucl_comp_cl <- cutree(hc_eucl_comp_tree, k = 4)
sil_eucl_comp <- mean(silhouette(hc_eucl_comp_cl, dist_eucl)[, 3])
coph_eucl_comp <- cor(dist_eucl, cophenetic(hc_eucl_comp_tree))
# Euclidean + Average
hc_eucl_avg_tree <- hclust(dist_eucl, method = "average")
hc_eucl_avg_cl <- cutree(hc_eucl_avg_tree, k = 4)
sil_eucl_avg <- mean(silhouette(hc_eucl_avg_cl, dist_eucl)[, 3])
coph_eucl_avg <- cor(dist_eucl, cophenetic(hc_eucl_avg_tree))
# Manhattan + Complete
hc_manh_comp_tree <- hclust(dist_manh, method = "complete")
hc_manh_comp_cl <- cutree(hc_manh_comp_tree, k = 4)
sil_manh_comp <- mean(silhouette(hc_manh_comp_cl, dist_manh)[, 3])
coph_manh_comp <- cor(dist_manh, cophenetic(hc_manh_comp_tree))
# Manhattan + Average
hc_manh_avg_tree <- hclust(dist_manh, method = "average")
hc_manh_avg_cl <- cutree(hc_manh_avg_tree, k = 4)
sil_manh_avg <- mean(silhouette(hc_manh_avg_cl, dist_manh)[, 3])
coph_manh_avg <- cor(dist_manh, cophenetic(hc_manh_avg_tree))
# Cosine + Complete
hc_cos_comp_tree <- hclust(dist_cos, method = "complete")
hc_cos_comp_cl <- cutree(hc_cos_comp_tree, k = 4)
sil_cos_comp <- mean(silhouette(hc_cos_comp_cl, dist_cos)[, 3])
coph_cos_comp <- cor(dist_cos, cophenetic(hc_cos_comp_tree))
# Cosine + Average
hc_cos_avg_tree <- hclust(dist_cos, method = "average")
hc_cos_avg_cl <- cutree(hc_cos_avg_tree, k = 4)
sil_cos_avg <- mean(silhouette(hc_cos_avg_cl, dist_cos)[, 3])
coph_cos_avg <- cor(dist_cos, cophenetic(hc_cos_avg_tree))
hier_results <- data.frame(
Distance = c(rep("Euclidean", 3), rep("Manhattan", 2), rep("Cosine", 2)),
Linkage = c("Ward.D2", "Complete", "Average", "Complete", "Average", "Complete", "Average"),
Silhouette = c(sil_eucl_ward, sil_eucl_comp, sil_eucl_avg,
sil_manh_comp, sil_manh_avg,
sil_cos_comp, sil_cos_avg),
Cophenetic_Corr = c(coph_eucl_ward, coph_eucl_comp, coph_eucl_avg,
coph_manh_comp, coph_manh_avg,
coph_cos_comp, coph_cos_avg)
)
hier_results <- hier_results[order(-hier_results$Cophenetic_Corr), ]
knitr::kable(hier_results, caption = "Hierarchical Clustering Performance", digits = 3)
p1 <- fviz_dend(hc_eucl_ward_tree, k=4, show_labels=FALSE, main="Figure 8.1: Euclidean + Ward", rect=TRUE)
print(p1)
p2 <- fviz_dend(hc_eucl_comp_tree, k=4, show_labels=FALSE, main="Figure 8.2: Euclidean + Complete", rect=TRUE)
print(p2)
p3 <- fviz_dend(hc_eucl_avg_tree, k=4, show_labels=FALSE, main="Figure 8.3: Euclidean + Average", rect=TRUE)
print(p3)
p4 <- fviz_dend(hc_manh_comp_tree, k=4, show_labels=FALSE, main="Figure 8.4: Manhattan + Complete", rect=TRUE)
print(p4)
p5 <- fviz_dend(hc_manh_avg_tree, k=4, show_labels=FALSE, main="Figure 8.5: Manhattan + Average", rect=TRUE)
print(p5)
p6 <- fviz_dend(hc_cos_comp_tree, k=4, show_labels=FALSE, main="Figure 8.6: Cosine + Complete", rect=TRUE)
print(p6)
p7 <- fviz_dend(hc_cos_avg_tree, k=4, show_labels=FALSE, main="Figure 8.7: Cosine + Average", rect=TRUE)
print(p7)
compute_metrics <- function(clusters, data, dist_matrix = NULL) {
sil <- mean(silhouette(clusters, dist_matrix)[, 3])
db <- index.DB(data, clusters)$DB
ch <- calinhara(data, clusters)
return(list(
silhouette = sil,
davies_bouldin = db,
calinski_harabasz = ch
))
}
# Collect all clustering results
all_clusterings <- list(
"K-Means" = km_final$cluster,
"PAM_Euclidean" = pam_eucl$clustering,
"PAM_Manhattan" = pam_manh$clustering,
"PAM_Cosine" = pam_cos$clustering,
"HC_Euclidean_Ward" = hc_eucl_ward_cl,
"HC_Euclidean_Complete" = hc_eucl_comp_cl,
"HC_Euclidean_Average" = hc_eucl_avg_cl,
"HC_Manhattan_Complete" = hc_manh_comp_cl,
"HC_Manhattan_Average" = hc_manh_avg_cl,
"HC_Cosine_Complete" = hc_cos_comp_cl,
"HC_Cosine_Average" = hc_cos_avg_cl
)
# Compute metrics for each method
metrics_list <- list()
for (method_name in names(all_clusterings)) {
clusters <- all_clusterings[[method_name]]
# Determine which distance matrix to use
if (grepl("Euclidean", method_name) || method_name == "K-Means") {
dist_used <- dist_eucl
} else if (grepl("Manhattan", method_name)) {
dist_used <- dist_manh
} else if (grepl("Cosine", method_name)) {
dist_used <- dist_cos
} else {
dist_used <- dist_eucl
}
metrics_list[[method_name]] <- compute_metrics(clusters, feature_pca, dist_used)
}
# Create summary table
metrics_summary <- data.frame(
Silhouette = round(sapply(metrics_list, function(x) x$silhouette), 4),
Davies_Bouldin = round(sapply(metrics_list, function(x) x$davies_bouldin), 4),
Calinski_Harabasz = round(sapply(metrics_list, function(x) x$calinski_harabasz), 2)
)
# Sort by Silhouette score (descending)
metrics_summary <- metrics_summary[order(-metrics_summary$Silhouette), ]
knitr::kable(metrics_summary, caption = "Clustering Comparison by Silhouette Score", digits = 2)
# Compute pairwise ARI matrix
n_methods <- length(all_clusterings)
ari_matrix <- matrix(0, nrow = n_methods, ncol = n_methods)
rownames(ari_matrix) <- names(all_clusterings)
colnames(ari_matrix) <- names(all_clusterings)
for (i in 1:n_methods) {
for (j in 1:n_methods) {
if (i == j) {
ari_matrix[i, j] <- 1.0
} else {
ari_matrix[i, j] <- adjustedRandIndex(
all_clusterings[[i]],
all_clusterings[[j]]
)
}
}
}
# Visualize ARI matrix as heatmap
corrplot(ari_matrix,
method = "color",
type = "upper",
is.corr = FALSE,
tl.cex = 0.7,
tl.col = "black",
addCoef.col = "black",
number.cex = 0.6,
title = "Figure 9: Pairwise Adjusted Rand Index (ARI) Matrix",
mar = c(0, 0, 2, 0))
sil_values <- silhouette(hc_cos_avg_cl, dist_cos)
# Identify negative silhouette samples (Potential Outliers)
neg_sil_indices <- which(sil_values[, 3] < 0)
print(tickers[neg_sil_indices])
fviz_silhouette(sil_values, palette = "jco", print.summary = FALSE) +
labs(title = "Figure 10: Silhouette Analysis: Hierarchical Clustering (Cosine + Average)",
subtitle = paste("Average Silhouette Width =", round(mean(sil_values[, 3]), 2))) +
theme_minimal() +
theme(axis.text.x = element_blank(),
plot.title = element_text(hjust = 0.5, face="bold"),
plot.subtitle = element_text(hjust = 0.5))
# Prepare discrete data
df_rules <- features_scaled
df_rules$Cluster <- as.factor(paste0("Cluster_", hc_cos_avg_cl))
# Discretize all numeric features into 3 levels
df_discrete <- as.data.frame(lapply(df_rules, function(x) {
if(is.numeric(x)) {
# Check if the variable has enough unique values
if(length(unique(x)) < 3) {
return(as.factor(x)) # Return as is if it's essentially a constant
}
tryCatch({
discretize(x, method = "cluster", breaks = 3,
labels = c("Low", "Med", "High"))
}, error = function(e) {
# Fallback to 'interval' if clustering fails
discretize(x, method = "interval", breaks = 3,
labels = c("Low", "Med", "High"))
})
} else {
return(x)
}
}))
# Look for rules with at least 5% support and 80% confidence
rules_cluster_1 <- apriori(df_discrete,
parameter = list(supp = 0.05, conf = 0.8, minlen = 2),
appearance = list(rhs = "Cluster=Cluster_1", default = "lhs"),
control = list(verbose = FALSE))
rules1_sorted <- sort(rules_cluster_1, by = "lift")
rules_cluster_2 <- apriori(df_discrete,
parameter = list(supp = 0.05, conf = 0.8, minlen = 2),
appearance = list(rhs = "Cluster=Cluster_2", default = "lhs"),
control = list(verbose = FALSE))
rules2_sorted <- sort(rules_cluster_2, by = "lift")
rules_cluster_3 <- apriori(df_discrete,
parameter = list(supp = 0.05, conf = 0.8, minlen = 2),
appearance = list(rhs = "Cluster=Cluster_3", default = "lhs"),
control = list(verbose = FALSE))
rules3_sorted <- sort(rules_cluster_3, by = "lift")
rules_cluster_4 <- apriori(df_discrete,
parameter = list(supp = 0.05, conf = 0.8, minlen = 2),
appearance = list(rhs = "Cluster=Cluster_4", default = "lhs"),
control = list(verbose = FALSE))
rules4_sorted <- sort(rules_cluster_4, by = "lift")
df_r1 <- DATAFRAME(head(rules1_sorted, 3), separate = TRUE)
df_r2 <- DATAFRAME(head(rules2_sorted, 3), separate = TRUE)
df_r3 <- DATAFRAME(head(rules3_sorted, 3), separate = TRUE)
df_r4 <- DATAFRAME(head(rules4_sorted, 3), separate = TRUE)
combined_rules <- rbind(df_r1, df_r2, df_r3, df_r4)
final_rules_table <- combined_rules %>%
dplyr::select(LHS, RHS, support, confidence, lift, count) %>%
dplyr::mutate(
LHS = gsub("[{}]", "", LHS),
RHS = gsub("Cluster=Cluster_", "Cluster ", RHS)
)
knitr::kable(final_rules_table,
caption = "Summary of Top Association Rules Across All Behavioral Clusters",
digits = 3,
col.names = c("Antecedent (Behavioral Features)", "Consequent (Target Cluster)",
"Support", "Confidence", "Lift", "N"))