The past two years of equity trading have been characterized by alternating AI-driven rallies and macroeconomic drawdowns, motivating a reassessment of how unlabeled stocks group together when measured through risk, return, and technical behavior. This paper applies unsupervised learning to 48 liquid equities from November 2023 to November 2025, aggregating daily market data into six conceptual feature blocks: risk/return, momentum, volatility, liquidity, technical patterns, and distributional characteristics. We evaluate K-means, PAM, hierarchical clustering, and DBSCAN — interpreting Silhouette, Davies–Bouldin, Calinski–Harabasz, and density cues alongside financial diagnostics - to see how assumptions shape segmentation. The results highlight the sensitivity of financial clustering to preprocessing choices and demonstrate that weak or unstable groupings are themselves informative. Weak clustering suggests that market behaviors overlap heavily, consistent with the hypothesis that equities do not partition cleanly into static behavioral regimes.

1. Introduction

1.1 Problem context

Segmenting stocks is useful for several reasons. First, many financial features lack natural labels, making supervised learning unsuitable; clustering allows us to explore whether meaningful structure exists at all. Second, grouping stocks with similar risk or momentum behavior supports diversification by avoiding unintended exposure to a single regime. Finally, cluster patterns can reveal how assets co-move during volatility shocks or macroeconomic cycles, offering insight into market regimes.

Clustering financial assets is inherently challenging because market data often exhibits weak or overlapping structure. Features such as volatility, momentum, and liquidity are highly correlated, heavy-tailed, and regime-dependent, which means natural clusters may not exist or may be unstable across periods. Rather than expecting clean partitions, the purpose of clustering here is exploratory: to assess whether any persistent structure emerges and to understand how algorithmic assumptions influence the segmentation. In this context, “weak clustering” is still informative because it indicates that financial behaviors are diffuse rather than distinctly separated.

1.2 Purpose of the paper

This paper evaluates how different clustering algorithms behave when applied to aggregated stock features. The objective is not to identify a single “best” segmentation but to understand how assumptions—distance metrics, standardization, dimensionality reduction, and linkage rules—affect the resulting groups. By comparing K-means, PAM, hierarchical clustering, and DBSCAN in a PCA-reduced space, we examine robustness, sensitivity, and interpretability.

1.3 Research questions

The study addresses four questions:

How do K-means, PAM, hierarchical clustering, and DBSCAN differ when applied to financial features?
Do consistent and meaningful groups emerge across methods?
How do internal validation metrics and density diagnostics guide conclusions about cluster quality?

2. Data & Feature Explanation

2.1 Data Source

The dataset analyzed in this study comprises daily trading data for 48 large-cap equities across diverse industries. The raw data was sourced from https://massive.com/ and covers a period of nearly two years (Nov 2023 – Nov 2025)
To ensure reproducibility, the final aggregated dataset has been deposited in the Zenodo repository and is available at: Aggregated trading profile of 48 stocks in the US market

# Load the parquet file
global_data <- read_parquet("data/ds1_global_data.parquet")
# Separate tickers and features
tickers <- global_data$ticker
features <- global_data %>% dplyr::select(-ticker)
# Remove zero variance features
variances <- apply(features, 2, var, na.rm = TRUE)
zero_var_features <- names(variances[variances == 0])
features <- features %>% dplyr::select(-all_of(zero_var_features))

2.2 Feature Logic

Twenty features were grouped into six conceptual categories:

Risk/Return (mean_daily_return, mean_stock_vs_market, beta_global, mean_sharpe_20d, worst_drawdown): captures performance and risk-adjusted returns. They are helpful for separating high-risk/high-return from low-risk/stable stocks
Momentum (mean_momentum_20d, mean_close_vs_sma200, mean_adx_14, mean_price_pos_20): identifies trends and range-bound of stocks
Volatility (mean_volatility_20d, mean_atr_14, mean_volatility_ratio, mean_bb_width): separates high-volatility from low-volatility stocks which is important for risk management and volatility-based strategies
Volume/Liquidity (mean_liquidity_20d, mean_volume_ratio): identifies liquid vs illiquid stocks for execution and market impact considerations
Technical indicators (mean_rsi_14, mean_macd_hist, mean_stoch_k): represents short-term price behavior patterns
Distributional features (return_skewness, return_kurtosis): detects asymmetric behavior, useful for risk modeling

All features represent stock-level aggregates computed over the entire sample period (one row per stock), rather than time-series sequences

2.3 Transformations & Standardization

Highly skewed features were transformed using Yeo–Johnson to reduce heavy-tailed behavior. All features were then standardized to ensure equal contribution to distance calculations. PCA was applied to mitigate multicollinearity and reduce noise.

# Apply Yeo-Johnson transformation
features_transformed <- features
  for(feat in skew_data) {
    yn_obj <- yeojohnson(features[[feat]], standardize = FALSE, eps = 1e-8)
    features_transformed[[feat]] <- predict(yn_obj)}
# Standardization
features_scaled <- as.data.frame(scale(features_transformed))

2.4 PCA (Principal Component Analysis)

# Apply PCA
pca_result <- prcomp(features_scaled, center = FALSE, scale. = FALSE)

To verify that reducing the feature space did not distort cluster structure, we compared K-Means performance using the first 3 versus 4 principal components. Adding PC4 increased total explained variance from 67.4% to 74.8%, but did not consistently improve internal validation scores. Silhouette and Davies–Bouldin values changed only marginally, and the cluster boundaries in the 4-component space were not more distinct. This indicates that the additional variance captured by PC4 reflects noise or idiosyncratic effects rather than meaningful structure.

# Extract 3 and 4 components
feature_pca_3 <- as.data.frame(pca_result$x[, 1:3])
feature_pca_4 <- as.data.frame(pca_result$x[, 1:4])

3 vs 4 components comparison
Components	Variance_Explained	Silhouette	Davies_Bouldin	Calinski_Harabasz
3	67.4	0.41	0.67	36.21
4	74.8	0.28	1.34	22.75

Given the limited sample size (n = 48) and the diminishing returns after PC3, we retained the 3-component solution for all clustering analyses.

fviz_pca_var(pca_result, col.var="steelblue", title = 'Figure 4: PCA Biplot of Variable loadings')

The PCA biplot shows that many features load strongly and directionally onto PC1, confirming a high degree of multicollinearity across risk, momentum, and technical indicators. This redundancy means that distance-based methods (especially K-Means and Ward’s method) would overweight these correlated features unless dimensionality is reduced. Retaining the first three components removes noise while preserving broad market structure (trend, volatility, momentum). The goal is not interpretability of PCs, but to reduce distortion in cluster geometry caused by correlated inputs.

2.5 Cluster tendency

get_clust_tendency(features_scaled, n = 2, graph = TRUE,
                   gradient = list(low="red", mid="white", high="blue"),
                   seed = 123)

## $hopkins_stat
## [1] 0.7832834
## 
## $plot

Hopkins statistic = 0.78, indicating strong cluster tendency, aligns with the dissimilarity heatmap, which shows block-like low-distance regions separated by high-distance bands. The data therefore contains some structure, though its clarity must be validated through multiple clustering methods.

3. Methods

3.1 Why clustering?

Clustering uncovers latent patterns in unlabeled financial features. It helps assess similarity in risk, momentum, or volatility profiles without assuming predefined categories. This is valuable for identifying diversification groups, stress-response regimes, and behavioral-style clustersters that may not align with sector classifications.

3.2 Optimal Number of Clusters

To select the number of clusters, we used NbClust, which aggregates multiple internal validity indices.

We evaluated four internal indices (Gap Statistic, Silhouette, Davies–Bouldin, Calinski–Harabasz). Results varied: Gap suggested k=2, CH suggested k=5, and both Silhouette and Davies–Bouldin converged on k=4. We selected k=4 based on the majority of stability-focused metrics and consistency with broad financial regimes.

3.3 Different distance matrices

dist_eucl <- dist(feature_pca, method = "euclidean")
dist_manh <- dist(feature_pca, method = "manhattan")
cor_matrix <- cor(t(feature_pca))
dist_cos <- as.dist(1 - cor_matrix)

We compared Euclidean, Manhattan, and cosine-like distance (1 − correlation). These metrics capture different geometric assumptions and allow assessment of how cluster assignments depend on distance definitions.

3.4 K-Means

K-means provides a fast baseline for comparison. It assumes spherical clusters and equal variance, making it sensitive to outliers and skewed financial features. Standardized PCA features partly mitigate this, but K-means remains prone to boundary distortion. Metrics and visualization evaluate whether the method captures coherent groups.

km_final <- kmeans(feature_pca, centers = 4, nstart = 10, iter.max = 300)

K-Means Clustering
Metric	Value
Silhouette Score (Avg)	0.336
Davies-Bouldin Index	1.220
Calinski-Harabasz Index	28.614

3.5 PAM (Partitioning Around Medoids)

PAM uses medoids, making it more robust to outliers and suitable for noisy financial distributions. Because medoids represent actual stocks, cluster centers are easier to interpret. We evaluate PAM with multiple distance metrics to test stability and robustness relative to K-means.

pam_eucl <- pam(dist_eucl, k = 4, diss = TRUE)
pam_manh <- pam(dist_manh, k = 4, diss = TRUE)
pam_cos  <- pam(dist_cos, k = 4, diss = TRUE)

PAM Clustering Performance
Metric	Silhouette	Davies_Bouldin	Calinski_Harabasz	Medoids
Euclidean	0.309	1.142	26.132	MSFT, GOOGL, PFE, TSLA
Manhattan	0.254	1.482	20.973	CAT, T, MRK, NEE
Cosine	0.623	1.865	18.279	TSM, SPY, COP, META

3.6 Hierarchical Clustering

Agglomerative hierarchical clustering with Ward, complete, and average linkage was applied. Dendrograms reveal multi-scale structure without fixing k upfront. Cophenetic correlation assesses how well each linkage preserves original pairwise distances.

# Euclidean + Ward
hc_eucl_ward_tree <- hclust(dist_eucl, method = "ward.D2")
hc_eucl_ward_cl   <- cutree(hc_eucl_ward_tree, k = 4)
# Euclidean + Complete
hc_eucl_comp_tree <- hclust(dist_eucl, method = "complete")
hc_eucl_comp_cl   <- cutree(hc_eucl_comp_tree, k = 4)
# Euclidean + Average
hc_eucl_avg_tree  <- hclust(dist_eucl, method = "average")
hc_eucl_avg_cl    <- cutree(hc_eucl_avg_tree, k = 4)
# Manhattan + Complete
hc_manh_comp_tree <- hclust(dist_manh, method = "complete")
hc_manh_comp_cl   <- cutree(hc_manh_comp_tree, k = 4)
# Manhattan + Average
hc_manh_avg_tree  <- hclust(dist_manh, method = "average")
hc_manh_avg_cl    <- cutree(hc_manh_avg_tree, k = 4)
# Cosine + Complete
hc_cos_comp_tree  <- hclust(dist_cos, method = "complete")
hc_cos_comp_cl    <- cutree(hc_cos_comp_tree, k = 4)
# Cosine + Average
hc_cos_avg_tree   <- hclust(dist_cos, method = "average")
hc_cos_avg_cl     <- cutree(hc_cos_avg_tree, k = 4)

Hierarchical Clustering Performance
	Distance	Linkage	Silhouette	Cophenetic_Corr
7	Cosine	Average	0.737	0.849
5	Manhattan	Average	0.386	0.832
3	Euclidean	Average	0.417	0.805
4	Manhattan	Complete	0.389	0.786
2	Euclidean	Complete	0.373	0.713
6	Cosine	Complete	0.615	0.710
1	Euclidean	Ward.D2	0.410	0.614

We evaluated Single, Complete, Average, and Ward linkage methods with different distance matrices. Average linkage yielded the highest Cophenetic Correlation Coefficient (0.85), indicating it most accurately preserved the pairwise distances of the original feature space.

3.7 Evaluation Metrics & Their Meaning

We use three complementary internal validation metrics to assess cluster quality:

Silhouette Score: Measures how similar each point is to its own cluster compared to the nearest other cluster. Values range from -1 to 1, with higher values indicating better-defined clusters.
Davies-Bouldin Index: Calculates the average similarity ratio of each cluster with its most similar cluster. Lower values indicate better clustering, as they reflect higher within-cluster compactness and greater between-cluster separation.
Calinski-Harabasz Index: Also known as the Variance Ratio Criterion, this measures the ratio of between-cluster variance to within-cluster variance. Higher values indicate better-defined clusters with clear separation. This metric is particularly useful for comparing different numbers of clusters.

Together, these metrics evaluate different aspects of cluster quality: individual point assignment (Silhouette), cluster-level separation (Davies-Bouldin), and overall variance structure (Calinski-Harabasz).

Clustering Comparison by Silhouette Score
	Silhouette	Davies_Bouldin	Calinski_Harabasz
HC_Cosine_Average	0.74	1.12	18.80
PAM_Cosine	0.62	1.86	18.28
HC_Cosine_Complete	0.62	1.42	20.20
HC_Euclidean_Average	0.42	0.65	33.16
HC_Euclidean_Ward	0.41	0.64	31.53
HC_Manhattan_Complete	0.39	0.61	29.06
HC_Manhattan_Average	0.39	0.58	24.33
HC_Euclidean_Complete	0.37	0.70	31.38
K-Means	0.34	1.22	28.61
PAM_Euclidean	0.31	1.14	26.13
PAM_Manhattan	0.25	1.48	20.97

The ARI matrix shows that no method pair consistently exceeds 0.6 except among hierarchical algorithms using the same distance metric. This confirms that clustering is highly sensitive to methodological assumptions and indicates that the underlying financial structure is weakly separable.

3.8 Stability Analysis

Clustering financial data often yields unstable results because feature distributions overlap and the underlying structure is weak. To assess how sensitive the cluster assignments are to algorithmic assumptions, we compute the pairwise Adjusted Rand Index (ARI) across all clustering methods. ARI measures the agreement between two partitions after correcting for chance; values near 1 indicate nearly identical clusters, while values near 0 indicate unrelated segmentations.

The ARI heatmap shows moderate similarity among Euclidean-based methods (K-Means, Ward’s linkage, and PAM-Euclidean), with values generally between 0.4 and 0.6. In contrast, Manhattan and Cosine distances produce partitions with substantially lower agreement, indicating that small changes in how distances are computed lead to different groupings. This is direct evidence of weak intrinsic structure in the feature space rather than a failure of any particular algorithm.

For financial applications, this instability is itself an informative result:

It signals that the behaviors of large-cap equities overlap heavily.
It highlights that clustering outputs should be interpreted as regime approximations, not rigid groups.
It underscores the importance of testing multiple distances before drawing conclusions.

In short, the stability analysis confirms that the dataset does not contain strongly separable clusters, and any conclusions must account for this sensitivity.

4. Results

4.1 Comparison of Methods

Across the PCA-reduced feature space, the algorithms produced different partitions in terms of compactness, separation, and stability. K-means achieved one of the stronger Calinski–Harabasz values, indicating well-defined centroids under Euclidean distance. PAM delivered the most consistent Silhouette and Davies–Bouldin scores across distance metrics, producing compact clusters in multiple configurations.

Hierarchical clustering produced heterogeneous results depending on the linkage method. Ward’s linkage generated compact clusters comparable to K-means, while Average linkage achieved the highest cophenetic correlation, indicating it best preserved the original dissimilarity structure. In contrast, Complete linkage produced more elongated cluster shapes and lower internal scores.

DBSCAN identified a small set of points as noise and formed only one or two dense groups, reflecting the absence of clear density-separated regions. It did not yield a multi-cluster segmentation comparable to the partitioning or hierarchical methods.

Across all algorithms, cluster boundaries were diffuse and partially overlapping. PAM and Ward linkage generated the most internally consistent partitions, while distance-dependent methods showed greater variability in cluster assignments.

4.2 Cluster Interpretability

Cosine-distance hierarchical clustering with average linkage produced four behaviorally meaningful groups:

Cluster 1 — Broad Market Backbone: Large, liquid mega-caps such as AAPL, JPM, HD, and IBM. These stocks exhibit stable, low-volatility behavior with moderate momentum and consistent factor exposures. Their feature directions are strongly aligned, leading cosine similarity to cluster them tightly.
Cluster 2 — Defensive & Low-Beta Assets: JNJ, INTC, WMT, MA, LLY, and ABBV form a low-beta cluster characterized by stable returns, defensive fundamentals, and weak participation in speculative market cycles. These stocks share a consistent risk profile despite belonging to different sectors.
Cluster 3 — Momentum-Led Tech & Growth: NVDA, META, NFLX, AMZN, TSLA, and BABA appear together due to strong momentum, high volatility, and price trends dominated by macro/AI catalysts. Cosine distance groups them because their normalized feature vectors show similar trend direction even when magnitudes differ.
Cluster 4 — Macro-Sensitive & Idiosyncratic Risk: XOM, CVX, DIS, TGT, PEP, PG, NEE, CMCSA, and MRK exhibit higher idiosyncratic volatility or macro-driven patterns (energy shocks, consumer cyclicality, policy sensitivity). This cluster captures stocks whose behavior is irregular or influenced by sector-specific risks rather than broad market trends.

Overall, the HC–Cosine–Average model reveals a latent structure organized primarily along volatility–momentum axes, with defensive stocks at one extreme and growth/momentum stocks at the other, while macro-driven equities form a distinct idiosyncratic group.

4.3 Outlier detection

A critical finding in this study was the presence of structural noise that challenged standard clustering assumptions. We employed two diagnostic methods to isolate these anomalies.

4.3.1 By Silhouette width

The Silhouette Width \(s(i)\) measures how similar a point is to its own cluster compared to other clusters. Points with \(s(i) < 0\) suggest misclassification.

sil_values <- silhouette(hc_cos_avg_cl, dist_cos)

## [1] "CVX"

4.4 Cluster profile

To interpret the underlying “DNA” of the identified segments, we analyzed the standardized means of all 20 features across the four clusters. This profiling reveals the specific factor exposures that drive each group:

Cluster 1 (Trend Leaders): Characterized by high relative strength (mean_stock_vs_market), strong technical indicators (mean_stoch_k, mean_rsi_14), and high price positioning (mean_price_pos_20d).
Cluster 2 (Defensive/Low-Beta): Defined by negative global beta and the lowest volatility scores (mean_volatility_20d, mean_bb_width) across the dataset.
Cluster 3 (High-Volatility Speculative): Displays extreme scores in momentum, daily returns, and volatility expansion, indicated by high Bollinger Band width.
Cluster 4 (Laggards): Shows negative means across nearly all performance metrics, representing assets in persistent downtrends during the sample period.

4.5 Association rules

Summary of Top Association Rules Across All Behavioral Clusters
	Antecedent (Behavioral Features)	Consequent (Target Cluster)	Support	Confidence	Lift	N
2	mean_rsi_14=High,mean_adx_14=High	{Cluster 1}	0.104	1	3.692	5
3	mean_stock_vs_market=High,mean_rsi_14=High	{Cluster 1}	0.146	1	3.692	7
4	mean_daily_return=High,mean_rsi_14=High	{Cluster 1}	0.146	1	3.692	7
1	beta_global=Low,mean_rsi_14=Med	{Cluster 2}	0.062	1	6.857	3
31	beta_global=Low,mean_rsi_14=Med,return_skewness=Low	{Cluster 2}	0.062	1	6.857	3
41	beta_global=Low,worst_drawdown=High,mean_rsi_14=Med	{Cluster 2}	0.062	1	6.857	3
11	mean_bb_width=High	{Cluster 3}	0.104	1	6.000	5
21	mean_volatility_20d=High	{Cluster 3}	0.104	1	6.000	5
32	mean_volatility_20d=High,mean_bb_width=High	{Cluster 3}	0.104	1	6.000	5
12	mean_momentum_20d=Low	{Cluster 4}	0.146	1	2.400	7
22	mean_stoch_k=Low	{Cluster 4}	0.167	1	2.400	8
33	mean_price_pos_20d=Low	{Cluster 4}	0.167	1	2.400	8

While clustering identifies which stocks group together, Association Rule Mining (ARM) reveals the deterministic “if-then” logic defining those groups. Using the Apriori algorithm on discretized features, we identified 1,308 rules that characterize these market regimes. Several high-conviction rules (Confidence = 1.0) provide a roadmap for cluster membership:

Rule for Cluster 1: The combination of {mean_rsi_14=High} and {mean_adx_14=High} results in a Lift of 3.69, identifying assets in well-defined bullish trends.
Rule for Cluster 2: A combination of {beta_global=Low} and {mean_rsi_14=Med} is a perfect predictor (Confidence = 1.0) for this defensive segment, yielding the highest observed Lift of 6.86.
Rule for Cluster 3: The presence of {mean_bb_width=High} or {mean_volatility_20d=High} alone is sufficient to identify these “Regime-Breakers” with a Lift of 6.0.
Rule for Cluster 4: Low 20-day Sharpe ratios paired with low relative market performance are deterministic markers for this underperforming group.

5. Discussion

5.1 Market Structure & Latent Factors

The analysis suggests that the equity market—at least among liquid large-cap equities—is not structurally rigid. The relatively low cross-method ARI scores (< 0.5 for many pairs) indicate that stock groupings are fluid rather than discrete. However, the recurring emergence of the “Four-Cluster” structure across stable methods (PAM and Ward’s method) reveals a latent hierarchy driven by Risk and Momentum.Specifically, the separation of the “Defensive” cluster (Cluster 1) from the “High Momentum” cluster (Cluster 3) validates the “Risk-On / Risk-Off” dynamic prevalent in the 2023–2025 trading period. The clustering results suggest that unsupervised learning can successfully recover these fundamental investment factors (Value vs. Growth, Low Vol vs. High Beta) without prior labeling.

5.2 Methodological Assessment

Comparing the four algorithms reveals clear trade-offs between stability and financial utility:

K-Means: While computationally efficient, K-means proved too sensitive to the heavy-tailed distribution of financial returns. The “pull” of outliers like Tesla distorted centroids, making it less reliable for robust portfolio segmentation.
PAM (Partitioning Around Medoids): PAM offered robust performance and stable partitions under Euclidean distance. However, its reliance on geometric distance meant it sometimes struggled to distinguish between “High Volatility Winners” and “High Volatility Losers.”
Hierarchical Clustering (Cosine + Average): This emerged as the preferred method. By using Cosine Distance, the algorithm focused on the shape of the price path (correlation) rather than the magnitude. This allowed it to correctly group assets that move together, regardless of their volatility levels. Furthermore, Average Linkage achieved the highest Cophenetic Correlation (\(0.85\)), confirming it best preserved the true topology of the feature space.
DBSCAN: While less useful for global segmentation due to sparsity in high dimensions, DBSCAN was invaluable as an anomaly detector, correctly isolating idiosyncratic risks (e.g., TSLA, INTC) that partition-based methods failed to flag.

5.3 Limitations

Several limitations constrain the generalizability of these findings. First, the sample size (\(N=48\)) is small relative to the feature space, creating sparse density that challenges algorithms like DBSCAN. Second, the analysis aggregates features over a two-year window (static clustering). This ignores “regime shifts”—for example, a stock like Meta might behave as a “Value” stock in 2023 but a “Momentum” stock in 2024. Finally, the analysis is limited to large-cap, liquid equities; results may differ significantly for small-cap or illiquid assets where idiosyncratic noise is higher.

5.4 Theoretical Alignment

The results align with Modern Portfolio Theory (MPT) and the Capital Asset Pricing Model (CAPM). The primary axis of differentiation found by PCA (PC1) and the resulting clusters proxies closely for Market Beta. The fact that the algorithms spontaneously separated “High Beta/High Volatility” stocks from “Low Beta/Defensive” stocks confirms that systematic risk remains the primary organizing principle of equity markets, even when viewed through the lens of complex, high-dimensional technical features.

6. Conclusions

6.1 Summary of Findings

This comparative study demonstrated that financial market segmentation is most effective when distance-based clustering is validated by rule-based logic. Our final findings include:

Algorithmic Synergy: Hierarchical Clustering (Cosine + Average) effectively groups assets by trend similarity, while Association Rules provide the specific technical thresholds that trigger cluster membership.
Structural Stability: The market naturally stratifies into four behavioral regimes—Defensive, Core/Trend, Momentum/Volatile, and Laggards—validated by both geometric distance and logical associations.
Actionable Logic: The discovery of rules with high Lift (up to 6.86) suggests that “weak clustering” in financial data can be overcome by identifying high-conviction behavioral “recipes”.

6.2 Practical Implications

For quantitative portfolio managers, these findings imply that sector labels (e.g., “Technology”) are insufficient proxies for behavior. A risk-managed portfolio should ensure exposure across the identified behavioral-style clusters rather than just industry verticals. Representative assets within each cluster may act as indicative behavioral proxies, though formal factor-model validation would be required to determine whether they can be reliably used for hedging or risk premia construction.

6.3 Future Work

Future research should move from static to dynamic clustering (Rolling Window Analysis) to detect when stocks migrate between clusters. This would allow for the creation of “Regime Change Signals,” alerting traders when a defensive stock begins exhibiting momentum characteristics. Additionally, applying Dynamic Time Warping (DTW) as a distance metric could better capture the temporal similarity of price paths compared to aggregated statistical features.

Appendix: Full R Scripts

if (!require("pacman")) install.packages("pacman")

# Load all libraries
pacman::p_load(
  arrow,          # Reading .parquet files (Dataset source)
  dplyr,          # Data manipulation (filter, select, mutate)
  tidyr,          # Data tidying and reshaping
  knitr,          # For kable
  ggplot2,        # Base plotting engine
  factoextra,     # Visualization for PCA & Clustering
  corrplot,       # Visualizing correlation matrices
  psych,          # Descriptive statistics
  e1071,          # Skewness and kurtosis calculations
  bestNormalize,  # Yeo-Johnson transformation for non-normal features
  cluster,        # Algorithms: PAM, Silhouette, Monothetic
  dbscan,         # Density-based clustering (DBSCAN) & kNN plots
  mclust,         # Model-based clustering & Adjusted Rand Index
  NbClust,        # Determining optimal 'k'
  fpc,            # Flexible Procedures for Clustering
  clusterSim,     # Cluster Similarity measures
  clusterCrit,     # Additional internal validity indices
  arules,
  arulesViz
)

# Load the parquet file
global_data <- read_parquet("data/ds1_global_data.parquet")
# Separate tickers and features
tickers <- global_data$ticker
features <- global_data %>% dplyr::select(-ticker)
# Remove zero variance features
variances <- apply(features, 2, var, na.rm = TRUE)
zero_var_features <- names(variances[variances == 0])
features <- features %>% dplyr::select(-all_of(zero_var_features))

# Skewness analysis and transformation
skew_threshold <- 2
skewness_vals <- apply(features, 2, e1071::skewness, na.rm = TRUE)
skew_data <- names(skewness_vals[abs(skewness_vals) > skew_threshold])
normal_data <- names(skewness_vals[abs(skewness_vals) <= skew_threshold])

# Visualize skewness
skew_df <- data.frame(
  feature = names(skewness_vals),
  skewness = skewness_vals
) %>% arrange(desc(skewness))

ggplot(skew_df, aes(x = skewness, y = reorder(feature, skewness))) +
  geom_bar(stat = "identity", fill = "steelblue") +
  geom_vline(xintercept = skew_threshold, color = "red", linetype = "dashed") +
  geom_vline(xintercept = -skew_threshold, color = "red", linetype = "dashed") +
  geom_vline(xintercept = 0, color = "black", linetype = "solid") +
  labs(title = "Figure 1: Feature Skewness Ranking",
       x = "Skewness Coefficient",
       y = "Feature") +
  theme_minimal() +
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank())

# Apply Yeo-Johnson transformation
features_transformed <- features
for(feat in skew_data) {
  yn_obj <- yeojohnson(features[[feat]], standardize = FALSE, eps = 1e-8)
  features_transformed[[feat]] <- predict(yn_obj)}
# Standardization
features_scaled <- as.data.frame(scale(features_transformed))

# Apply PCA
pca_result <- prcomp(features_scaled, center = FALSE, scale. = FALSE)

# Calculate explained variance
explained_var <- summary(pca_result)$importance[2, ]
cumulative_var <- cumsum(explained_var)

# Find number of components for 65% variance
n_components_65 <- which(cumulative_var >= 0.65)[1]

# Extract PCA components
feature_pca <- as.data.frame(pca_result$x[, 1:n_components_65])

# Visualize explained variance
eig.val <- get_eigenvalue(pca_result)
eig.val.top10 <- head(eig.val, 10)
eig.val.top10$PC_Label <- factor(1:nrow(eig.val.top10), levels = 1:nrow(eig.val.top10))

ggplot(eig.val.top10, aes(x = PC_Label)) +
  geom_bar(aes(y = variance.percent), stat = "identity", fill = "steelblue", alpha = 0.7) +
  geom_line(aes(y = cumulative.variance.percent, group = 1), color = "red", size = 1.2) +
  geom_point(aes(y = cumulative.variance.percent), color = "red", size = 2) +
  geom_text(aes(y = variance.percent, label = round(variance.percent, 1)), vjust = -0.5, size = 3) +
  geom_text(aes(y = cumulative.variance.percent, label = round(cumulative.variance.percent, 1)), vjust = -0.5, color = "red", size = 3) +
  scale_y_continuous(limits = c(0, 105), breaks = seq(0, 100, 20)) +
  labs(title = "Figure 2: Scree Plot & Cumulative Variance",
       subtitle = "Cumulative Variance Explained %",
       x = "Principal Components",
       y = "Percentage of Variance Explained") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))

# Extract 3 and 4 components
feature_pca_3 <- as.data.frame(pca_result$x[, 1:3])
feature_pca_4 <- as.data.frame(pca_result$x[, 1:4])

# Test K-Means with both
km_3 <- kmeans(feature_pca_3, centers = 4, nstart = 10)
km_4 <- kmeans(feature_pca_4, centers = 4, nstart = 10)

# Calculate metrics for 3 components
dist_3 <- dist(feature_pca_3)
sil_3 <- mean(silhouette(km_3$cluster, dist_3)[, 3])
db_3 <- index.DB(feature_pca_3, km_3$cluster)$DB
ch_3 <- calinhara(feature_pca_3, km_3$cluster)

# Calculate metrics for 4 components
dist_4 <- dist(feature_pca_4)
sil_4 <- mean(silhouette(km_4$cluster, dist_4)[, 3])
db_4 <- index.DB(feature_pca_4, km_4$cluster)$DB
ch_4 <- calinhara(feature_pca_4, km_4$cluster)

# Compare results
comparison_3vs4 <- data.frame(
  Components = c(3, 4),
  Variance_Explained = c(67.4, 74.8),
  Silhouette = round(c(sil_3, sil_4), 4),
  Davies_Bouldin = round(c(db_3, db_4), 4),
  Calinski_Harabasz = round(c(ch_3, ch_4), 4),
  stringsAsFactors = FALSE
)
knitr::kable(comparison_3vs4, caption = "3 vs 4 components comparison", digits = 2)

plot_3 <- fviz_cluster(km_3,
                       data = feature_pca_3,
                       geom = "point",
                       ellipse.type = "convex",
                       palette = "Set2",
                       ggtheme = theme_minimal(),
                       main = "Figure 3.1: 3 Components (67.4% variance)",
                       xlab = "PC1", ylab = "PC2") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))
print(plot_3)
plot_4 <- fviz_cluster(km_4,
                       data = feature_pca_4,
                       geom = "point",
                       ellipse.type = "convex",
                       palette = "Set2",
                       ggtheme = theme_minimal(),
                       main = "Figure 3.2: 4 Components (74.8% variance)",
                       xlab = "PC1", ylab = "PC2") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))
print(plot_4)

fviz_pca_var(pca_result, col.var="steelblue", title = 'Figure 4: PCA Biplot of Variable loadings')

get_clust_tendency(features_scaled, n = 2, graph = TRUE,
                   gradient = list(low="red", mid="white", high="blue"),
                   seed = 123)

data_for_nbclust <- as.matrix(feature_pca)
create_index_plot <- function(nb_result, title, y_label, higher_better = TRUE) {
  idx_df <- data.frame(
    k = as.numeric(names(nb_result$All.index)),
    value = as.numeric(nb_result$All.index)
  )
  
  y_pos <- if(higher_better) max(idx_df$value, na.rm = TRUE) else min(idx_df$value, na.rm = TRUE)
  v_just <- if(higher_better) -0.5 else 1.5
  
  ggplot(idx_df, aes(x = k, y = value)) +
    geom_line(linewidth = 1, color = "steelblue") +
    geom_point(size = 3, color = "steelblue") +
    geom_vline(xintercept = nb_result$Best.nc[1], 
               color = "red", linetype = "dashed", linewidth = 1) +
    annotate("text", x = nb_result$Best.nc[1], y = y_pos,
             label = paste("k =", nb_result$Best.nc[1]),
             color = "red", fontface = "bold", vjust = v_just) +
    labs(title = title, x = "Number of Clusters (k)", y = y_label) +
    theme_minimal() +
    theme(plot.title = element_text(hjust = 0.5, face = "bold"))
}

# Run NbClust
ch_result <- NbClust(data_for_nbclust, distance = "euclidean", method = "complete", 
                     index = "ch", min.nc = 2, max.nc = min(10, nrow(data_for_nbclust) - 1))
sil_result <- NbClust(data_for_nbclust, distance = "euclidean", method = "complete", 
                      index = "silhouette", min.nc = 2, max.nc = min(10, nrow(data_for_nbclust) - 1))
db_result <- NbClust(data_for_nbclust, distance = "euclidean", method = "complete", 
                     index = "db", min.nc = 2, max.nc = min(10, nrow(data_for_nbclust) - 1))
gap_result <- NbClust(data_for_nbclust, distance = "euclidean", method = "complete", 
                      index = "gap", min.nc = 2, max.nc = min(10, nrow(data_for_nbclust) - 1))

# Create plots
p1 <- create_index_plot(ch_result, "Figure 5.1: Calinski-Harabasz Index vs k (Higher is Better)", 
                        "Calinski-Harabasz Index", higher_better = TRUE)
p2 <- create_index_plot(sil_result, "Figure 5.2: Silhouette Index vs k (Higher is Better)", 
                        "Silhouette Index", higher_better = TRUE)
p3 <- create_index_plot(db_result, "Figure 5.3: Davies-Bouldin Index vs k (Lower is Better)", 
                        "Davies-Bouldin Index", higher_better = FALSE)
p4 <- create_index_plot(gap_result, "Figure 5.4: Gap Statistic vs k (Higher is Better)", 
                        "Gap Statistic", higher_better = TRUE)

print(p1)
print(p2)
print(p3)
print(p4)

dist_eucl <- dist(feature_pca, method = "euclidean")
dist_manh <- dist(feature_pca, method = "manhattan")
cor_matrix <- cor(t(feature_pca))
dist_cos <- as.dist(1 - cor_matrix)

km_final <- kmeans(feature_pca, centers = 4, nstart = 10, iter.max = 300)

sil_obj <- silhouette(km_final$cluster, dist(feature_pca))
mean_sil <- mean(sil_obj[, 3])
db_obj <- index.DB(feature_pca, km_final$cluster)
db_score <- db_obj$DB
ch_score <- calinhara(feature_pca, km_final$cluster)

metrics_df <- data.frame(
  Metric = c("Silhouette Score (Avg)", "Davies-Bouldin Index", "Calinski-Harabasz Index"),
  Value = round(c(mean_sil, db_score, ch_score), 4)
)
knitr::kable(metrics_df, caption = "K-Means Clustering", digits = 3)

fviz_cluster(km_final,
             data = feature_pca,
             geom = "point",
             ellipse.type = "convex",
             ggtheme = theme_minimal(),
             main = "Figure 6: K-Means Clustering (k=4)",
             xlab = "PC1",
             ylab = "PC2") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))

pam_eucl <- pam(dist_eucl, k = 4, diss = TRUE)
pam_manh <- pam(dist_manh, k = 4, diss = TRUE)
pam_cos  <- pam(dist_cos, k = 4, diss = TRUE)

# PAM + Euclidean
sil_eucl <- mean(silhouette(pam_eucl$clustering, dist_eucl)[, 3])
db_eucl  <- index.DB(feature_pca, pam_eucl$clustering)$DB
med_eucl <- paste(tickers[pam_eucl$medoids], collapse=", ")
ch_eucl <- calinhara(feature_pca, pam_eucl$cluster)
# PAM + Manhattan
sil_manh <- mean(silhouette(pam_manh$clustering, dist_manh)[, 3])
db_manh  <- index.DB(feature_pca, pam_manh$clustering)$DB 
med_manh <- paste(tickers[pam_manh$medoids], collapse=", ")
ch_manh <- calinhara(feature_pca, pam_manh$cluster)
# PAM + Cosine
sil_cos  <- mean(silhouette(pam_cos$clustering, dist_cos)[, 3])
db_cos   <- index.DB(feature_pca, pam_cos$clustering)$DB
med_cos  <- paste(tickers[pam_cos$medoids], collapse=", ")
ch_cos <- calinhara(feature_pca, pam_cos$cluster)

pam_results <- data.frame(
  Metric = c("Euclidean", "Manhattan", "Cosine"),
  Silhouette = c(sil_eucl, sil_manh, sil_cos),
  Davies_Bouldin = c(db_eucl, db_manh, db_cos),
  Calinski_Harabasz = c(ch_eucl, ch_manh, ch_cos),
  Medoids = c(med_eucl, med_manh, med_cos)
)

knitr::kable(pam_results, caption = "PAM Clustering Performance", digits = 3)
plot_pam <- function(model, title) {
  plot_df <- as.data.frame(feature_pca)
  plot_df$cluster <- as.factor(model)
  col_names <- names(plot_df)[1:2]
  ggplot(plot_df, aes_string(x = col_names[1], y = col_names[2], color = "cluster")) +
    geom_point(size = 3, alpha = 1) +
    stat_ellipse(type = "norm", level = 0.95, linetype = 2) +
    labs(title = title) +
    theme_minimal() +
    theme(
      legend.position = "none",
      plot.title = element_text(size = 10, face = "bold")
    )
}

p1 <- plot_pam(pam_eucl$clustering, paste("Figure 7.1: Euclidean (Sil:", round(sil_eucl, 2), ")"))
print(p1)
p2 <- plot_pam(pam_manh$clustering, paste("Figure 7.2: Manhattan (Sil:", round(sil_manh, 2), ")"))
print(p2)
p3 <- plot_pam(pam_cos$clustering,  paste("Figure 7.3: Cosine (Sil:", round(sil_cos, 2), ")"))
print(p3)

# Euclidean + Ward
hc_eucl_ward_tree <- hclust(dist_eucl, method = "ward.D2")
hc_eucl_ward_cl   <- cutree(hc_eucl_ward_tree, k = 4)
sil_eucl_ward     <- mean(silhouette(hc_eucl_ward_cl, dist_eucl)[, 3])
coph_eucl_ward    <- cor(dist_eucl, cophenetic(hc_eucl_ward_tree))
# Euclidean + Complete
hc_eucl_comp_tree <- hclust(dist_eucl, method = "complete")
hc_eucl_comp_cl   <- cutree(hc_eucl_comp_tree, k = 4)
sil_eucl_comp     <- mean(silhouette(hc_eucl_comp_cl, dist_eucl)[, 3])
coph_eucl_comp    <- cor(dist_eucl, cophenetic(hc_eucl_comp_tree))
# Euclidean + Average
hc_eucl_avg_tree  <- hclust(dist_eucl, method = "average")
hc_eucl_avg_cl    <- cutree(hc_eucl_avg_tree, k = 4)
sil_eucl_avg      <- mean(silhouette(hc_eucl_avg_cl, dist_eucl)[, 3])
coph_eucl_avg     <- cor(dist_eucl, cophenetic(hc_eucl_avg_tree))
# Manhattan + Complete
hc_manh_comp_tree <- hclust(dist_manh, method = "complete")
hc_manh_comp_cl   <- cutree(hc_manh_comp_tree, k = 4)
sil_manh_comp     <- mean(silhouette(hc_manh_comp_cl, dist_manh)[, 3])
coph_manh_comp    <- cor(dist_manh, cophenetic(hc_manh_comp_tree))
# Manhattan + Average
hc_manh_avg_tree  <- hclust(dist_manh, method = "average")
hc_manh_avg_cl    <- cutree(hc_manh_avg_tree, k = 4)
sil_manh_avg      <- mean(silhouette(hc_manh_avg_cl, dist_manh)[, 3])
coph_manh_avg     <- cor(dist_manh, cophenetic(hc_manh_avg_tree))
# Cosine + Complete
hc_cos_comp_tree  <- hclust(dist_cos, method = "complete")
hc_cos_comp_cl    <- cutree(hc_cos_comp_tree, k = 4)
sil_cos_comp      <- mean(silhouette(hc_cos_comp_cl, dist_cos)[, 3])
coph_cos_comp     <- cor(dist_cos, cophenetic(hc_cos_comp_tree))
# Cosine + Average
hc_cos_avg_tree   <- hclust(dist_cos, method = "average")
hc_cos_avg_cl     <- cutree(hc_cos_avg_tree, k = 4)
sil_cos_avg       <- mean(silhouette(hc_cos_avg_cl, dist_cos)[, 3])
coph_cos_avg      <- cor(dist_cos, cophenetic(hc_cos_avg_tree))

hier_results <- data.frame(
  Distance = c(rep("Euclidean", 3), rep("Manhattan", 2), rep("Cosine", 2)),
  Linkage  = c("Ward.D2", "Complete", "Average", "Complete", "Average", "Complete", "Average"),
  Silhouette = c(sil_eucl_ward, sil_eucl_comp, sil_eucl_avg, 
                 sil_manh_comp, sil_manh_avg, 
                 sil_cos_comp, sil_cos_avg),
  Cophenetic_Corr = c(coph_eucl_ward, coph_eucl_comp, coph_eucl_avg, 
                      coph_manh_comp, coph_manh_avg, 
                      coph_cos_comp, coph_cos_avg)
)

hier_results <- hier_results[order(-hier_results$Cophenetic_Corr), ]
knitr::kable(hier_results, caption = "Hierarchical Clustering Performance", digits = 3)

p1 <- fviz_dend(hc_eucl_ward_tree, k=4, show_labels=FALSE, main="Figure 8.1: Euclidean + Ward", rect=TRUE)
print(p1)
p2 <- fviz_dend(hc_eucl_comp_tree,  k=4, show_labels=FALSE, main="Figure 8.2: Euclidean + Complete", rect=TRUE)
print(p2)
p3 <- fviz_dend(hc_eucl_avg_tree,  k=4, show_labels=FALSE, main="Figure 8.3: Euclidean + Average", rect=TRUE)
print(p3)
p4 <- fviz_dend(hc_manh_comp_tree,   k=4, show_labels=FALSE, main="Figure 8.4: Manhattan + Complete", rect=TRUE)
print(p4)
p5 <- fviz_dend(hc_manh_avg_tree,   k=4, show_labels=FALSE, main="Figure 8.5: Manhattan + Average", rect=TRUE)
print(p5)
p6 <- fviz_dend(hc_cos_comp_tree,   k=4, show_labels=FALSE, main="Figure 8.6: Cosine + Complete", rect=TRUE)
print(p6)
p7 <- fviz_dend(hc_cos_avg_tree,   k=4, show_labels=FALSE, main="Figure 8.7: Cosine + Average", rect=TRUE)
print(p7)

compute_metrics <- function(clusters, data, dist_matrix = NULL) {
  
  sil <- mean(silhouette(clusters, dist_matrix)[, 3])
  db <- index.DB(data, clusters)$DB
  ch <- calinhara(data, clusters)
  
  return(list(
    silhouette = sil,
    davies_bouldin = db,
    calinski_harabasz = ch
  ))
}

# Collect all clustering results
all_clusterings <- list(
  "K-Means" = km_final$cluster,
  "PAM_Euclidean" = pam_eucl$clustering,
  "PAM_Manhattan" = pam_manh$clustering,
  "PAM_Cosine" = pam_cos$clustering,
  "HC_Euclidean_Ward" = hc_eucl_ward_cl,
  "HC_Euclidean_Complete" = hc_eucl_comp_cl,
  "HC_Euclidean_Average" = hc_eucl_avg_cl,
  "HC_Manhattan_Complete" = hc_manh_comp_cl,
  "HC_Manhattan_Average" = hc_manh_avg_cl,
  "HC_Cosine_Complete" = hc_cos_comp_cl,
  "HC_Cosine_Average" = hc_cos_avg_cl
)

# Compute metrics for each method
metrics_list <- list()

for (method_name in names(all_clusterings)) {
  clusters <- all_clusterings[[method_name]]
  
  # Determine which distance matrix to use
  if (grepl("Euclidean", method_name) || method_name == "K-Means") {
    dist_used <- dist_eucl
  } else if (grepl("Manhattan", method_name)) {
    dist_used <- dist_manh
  } else if (grepl("Cosine", method_name)) {
    dist_used <- dist_cos
  } else {
    dist_used <- dist_eucl
  }
  
  metrics_list[[method_name]] <- compute_metrics(clusters, feature_pca, dist_used)
}

# Create summary table
metrics_summary <- data.frame(
  Silhouette = round(sapply(metrics_list, function(x) x$silhouette), 4),
  Davies_Bouldin = round(sapply(metrics_list, function(x) x$davies_bouldin), 4),
  Calinski_Harabasz = round(sapply(metrics_list, function(x) x$calinski_harabasz), 2)
)

# Sort by Silhouette score (descending)
metrics_summary <- metrics_summary[order(-metrics_summary$Silhouette), ]
knitr::kable(metrics_summary, caption = "Clustering Comparison by Silhouette Score", digits = 2)

# Compute pairwise ARI matrix
n_methods <- length(all_clusterings)
ari_matrix <- matrix(0, nrow = n_methods, ncol = n_methods)
rownames(ari_matrix) <- names(all_clusterings)
colnames(ari_matrix) <- names(all_clusterings)

for (i in 1:n_methods) {
  for (j in 1:n_methods) {
    if (i == j) {
      ari_matrix[i, j] <- 1.0
    } else {
      ari_matrix[i, j] <- adjustedRandIndex(
        all_clusterings[[i]], 
        all_clusterings[[j]]
      )
    }
  }
}

# Visualize ARI matrix as heatmap
corrplot(ari_matrix, 
         method = "color", 
         type = "upper",
         is.corr = FALSE,
         tl.cex = 0.7,
         tl.col = "black",
         addCoef.col = "black",
         number.cex = 0.6,
         title = "Figure 9: Pairwise Adjusted Rand Index (ARI) Matrix",
         mar = c(0, 0, 2, 0))

sil_values <- silhouette(hc_cos_avg_cl, dist_cos)

# Identify negative silhouette samples (Potential Outliers)
neg_sil_indices <- which(sil_values[, 3] < 0)
print(tickers[neg_sil_indices])

fviz_silhouette(sil_values, palette = "jco", print.summary = FALSE) +
  labs(title = "Figure 10: Silhouette Analysis: Hierarchical Clustering (Cosine + Average)",
       subtitle = paste("Average Silhouette Width =", round(mean(sil_values[, 3]), 2))) +
  theme_minimal() +
  theme(axis.text.x = element_blank(), 
        plot.title = element_text(hjust = 0.5, face="bold"),
        plot.subtitle = element_text(hjust = 0.5))

# Prepare discrete data
df_rules <- features_scaled
df_rules$Cluster <- as.factor(paste0("Cluster_", hc_cos_avg_cl))

# Discretize all numeric features into 3 levels
df_discrete <- as.data.frame(lapply(df_rules, function(x) {
  if(is.numeric(x)) {
    # Check if the variable has enough unique values
    if(length(unique(x)) < 3) {
      return(as.factor(x)) # Return as is if it's essentially a constant
    }
    tryCatch({
      discretize(x, method = "cluster", breaks = 3, 
                 labels = c("Low", "Med", "High"))
    }, error = function(e) {
      # Fallback to 'interval' if clustering fails
      discretize(x, method = "interval", breaks = 3, 
                 labels = c("Low", "Med", "High"))
    })
  } else {
    return(x)
  }
}))

# Look for rules with at least 5% support and 80% confidence
rules_cluster_1 <- apriori(df_discrete, 
                 parameter = list(supp = 0.05, conf = 0.8, minlen = 2),
                 appearance = list(rhs = "Cluster=Cluster_1", default = "lhs"),
                 control = list(verbose = FALSE))

rules1_sorted <- sort(rules_cluster_1, by = "lift")

rules_cluster_2 <- apriori(df_discrete, 
                 parameter = list(supp = 0.05, conf = 0.8, minlen = 2),
                 appearance = list(rhs = "Cluster=Cluster_2", default = "lhs"),
                 control = list(verbose = FALSE))

rules2_sorted <- sort(rules_cluster_2, by = "lift")

rules_cluster_3 <- apriori(df_discrete, 
                 parameter = list(supp = 0.05, conf = 0.8, minlen = 2),
                 appearance = list(rhs = "Cluster=Cluster_3", default = "lhs"),
                 control = list(verbose = FALSE))

rules3_sorted <- sort(rules_cluster_3, by = "lift")

rules_cluster_4 <- apriori(df_discrete, 
                 parameter = list(supp = 0.05, conf = 0.8, minlen = 2),
                 appearance = list(rhs = "Cluster=Cluster_4", default = "lhs"),
                 control = list(verbose = FALSE))

rules4_sorted <- sort(rules_cluster_4, by = "lift")

df_r1 <- DATAFRAME(head(rules1_sorted, 3), separate = TRUE)
df_r2 <- DATAFRAME(head(rules2_sorted, 3), separate = TRUE)
df_r3 <- DATAFRAME(head(rules3_sorted, 3), separate = TRUE)
df_r4 <- DATAFRAME(head(rules4_sorted, 3), separate = TRUE)

combined_rules <- rbind(df_r1, df_r2, df_r3, df_r4)

final_rules_table <- combined_rules %>%
  dplyr::select(LHS, RHS, support, confidence, lift, count) %>%
  dplyr::mutate(
    LHS = gsub("[{}]", "", LHS),
    RHS = gsub("Cluster=Cluster_", "Cluster ", RHS)
  )

knitr::kable(final_rules_table, 
             caption = "Summary of Top Association Rules Across All Behavioral Clusters",
             digits = 3,
             col.names = c("Antecedent (Behavioral Features)", "Consequent (Target Cluster)", 
                           "Support", "Confidence", "Lift", "N"))

Comparative Analysis of Partitioning and Hierarchical Clustering Methods for Financial Market Segmentation

Phuong Giang Tran

2025-12-03