The stock market over the last two years has been a rollercoaster, driven by AI hype and shifting economic conditions. This paper uses machine learning to analyze 48 major stocks and determine if they naturally fall into distinct groups based on their behavior—looking at factors like risk, speed of price changes (momentum), and trading volume. We tested three different sorting algorithms (K-means, PAM, and Hierarchical Clustering) to identify hidden market patterns. Crucially, we extended this analysis using a technique called Association Rule Mining to translate complex mathematical groups into simple “If-Then” rules. The results show that while the boundaries between stock groups can be blurry, they are actually governed by clear, identifiable logic—allowing us to distinguish between “Safe Defensive” stocks and “Risky High-Flyers” using simple technical thresholds.

1. Introduction

1.1 Problem context

Grouping stocks into meaningful categories is useful for several reasons. First, because financial data doesn’t come with pre-existing labels (like “Safe” or “Risky”), we need methods that can find patterns on their own. Second, grouping stocks with similar behaviors helps investors avoid putting all their eggs in one basket—if you own ten stocks that all behave exactly the same way, you aren’t truly diversified. Finally, these groups can reveal how different companies react to real-world events, such as economic downturns or sudden market rallies.

However, organizing stocks into clear-cut groups is difficult. Financial data is messy; features like price trends and trading volume are often tangled together, meaning clear “borders” between groups rarely exist. Instead of expecting perfect, separate boxes, the goal of this study is exploratory: to see if any consistent patterns emerge and to understand how different mathematical methods influence the results. In this context, even “weak” groups are valuable findings because they show that market behavior is often a spectrum rather than a set of rigid categories.

1.2 Purpose of the paper

This paper tests how different computer algorithms group stocks based on their behavior. The goal is not just to find the “best” method, but to understand the “why” behind the groupings. We compare three popular techniques (K-means, PAM, and Hierarchical Clustering) to see how sensitive they are to different assumptions. Most importantly, we take these complex mathematical groups and translate them into simple logic using a method called Association Rule Mining. This allows us to move from saying “these stocks belong to Cluster A” to explaining why—for example, “these stocks belong to Cluster A because they are all stable, low-risk companies.”

1.3 Research questions

The study addresses four questions:

How do different clustering algorithms compare when trying to group stocks?
Do consistent groups appear regardless of which method we use?
Can mathematical quality scores help us decide which groups are “real”?
Can we find simple “If-Then” rules (like “If a stock is risky AND rising fast…”) that explain exactly why a stock falls into a specific group?

2. Data & Feature Explanation

2.1 Data Source

The dataset analyzed in this study comprises daily trading data for 48 large-cap equities across diverse industries. The raw data was sourced from https://massive.com/ and covers a period of nearly two years (Nov 2023 – Nov 2025)
To ensure reproducibility, the final aggregated dataset has been deposited in the Zenodo repository and is available at: Aggregated trading profile of 48 stocks in the US market

# Load the parquet file
global_data <- read_parquet("data/ds1_global_data.parquet")
# Separate tickers and features
tickers <- global_data$ticker
features <- global_data %>% dplyr::select(-ticker)
# Remove zero variance features
variances <- apply(features, 2, var, na.rm = TRUE)
zero_var_features <- names(variances[variances == 0])
features <- features %>% dplyr::select(-all_of(zero_var_features))

2.2 Feature Logic

We grouped 20 different financial metrics into six simple categories to capture every aspect of how a stock behaves:

Risk/Return (mean_daily_return, mean_stock_vs_market, beta_global, mean_sharpe_20d, worst_drawdown): These measure the trade-off between how much money a stock makes versus how much it might lose. They help distinguish between “slow and steady” stocks and “high-risk, high-reward” ones.
Momentum (mean_momentum_20d, mean_close_vs_sma200, mean_adx_14, mean_price_pos_20): These measure the “speed” of a stock. Is it trending upward aggressively, or is it stuck in a flat range?
Volatility (mean_volatility_20d, mean_atr_14, mean_volatility_ratio, mean_bb_width): These measuren how wildly the price swings up and down. This is crucial for understanding how risky a stock is on a daily basis.
Volume/Liquidity (mean_liquidity_20d, mean_volume_ratio): This measures how easy it is to buy or sell the stock. “Liquid” stocks trade frequently without moving the price much; “illiquid” stocks are harder to exit.
Technical indicators (mean_rsi_14, mean_macd_hist, mean_stoch_k): These are mathematical patterns commonly used by traders to predict short-term price movements.
Distributional features (return_skewness, return_kurtosis): These look for statistical quirks—like whether a stock is prone to sudden, extreme crashes (“fat tails”) that normal averages might miss.

Instead of looking at a timeline of prices, we summarized two years of behavior into a single “profile” for each stock.

2.3 Transformations & Standardization

Highly skewed features were transformed using Yeo–Johnson to reduce heavy-tailed behavior. All features were then standardized to ensure equal contribution to distance calculations. PCA was applied to mitigate multicollinearity and reduce noise.

# Apply Yeo-Johnson transformation
features_transformed <- features
  for(feat in skew_data) {
    yn_obj <- yeojohnson(features[[feat]], standardize = FALSE, eps = 1e-8)
    features_transformed[[feat]] <- predict(yn_obj)}
# Standardization
features_scaled <- as.data.frame(scale(features_transformed))

2.4 PCA (Principal Component Analysis)

# Apply PCA
pca_result <- prcomp(features_scaled, center = FALSE, scale. = FALSE)

To verify that reducing the feature space did not distort cluster structure, we compared K-Means performance using the first 3 versus 4 principal components. Adding PC4 increased total explained variance from 67.4% to 74.8%, but did not consistently improve internal validation scores. Silhouette and Davies–Bouldin values changed only marginally, and the cluster boundaries in the 4-component space were not more distinct. This indicates that the additional variance captured by PC4 reflects noise or idiosyncratic effects rather than meaningful structure.

# Extract 3 and 4 components
feature_pca_3 <- as.data.frame(pca_result$x[, 1:3])
feature_pca_4 <- as.data.frame(pca_result$x[, 1:4])

3 vs 4 components comparison
Components	Variance_Explained	Silhouette	Davies_Bouldin	Calinski_Harabasz
3	67.4	0.41	0.67	36.21
4	74.8	0.35	0.79	27.24

Given the limited sample size (n = 48) and the diminishing returns after PC3, we retained the 3-component solution for all clustering analyses.

fviz_pca_var(pca_result, col.var="steelblue", title = 'Figure 4: PCA Biplot of Variable loadings')

The PCA biplot shows that many features load strongly and directionally onto PC1, confirming a high degree of multicollinearity across risk, momentum, and technical indicators. This redundancy means that distance-based methods (especially K-Means and Ward’s method) would overweight these correlated features unless dimensionality is reduced. Retaining the first three components removes noise while preserving broad market structure (trend, volatility, momentum). The goal is not interpretability of PCs, but to reduce distortion in cluster geometry caused by correlated inputs.

2.5 Cluster tendency

get_clust_tendency(features_scaled, n = 2, graph = TRUE,
                   gradient = list(low="red", mid="white", high="blue"),
                   seed = 123)

## $hopkins_stat
## [1] 0.7832834
## 
## $plot

Hopkins statistic = 0.78, indicating strong cluster tendency, aligns with the dissimilarity heatmap, which shows block-like low-distance regions separated by high-distance bands. The data therefore contains some structure, though its clarity must be validated through multiple clustering methods.

3. Methods

3.1 Why clustering?

Clustering is a way to find hidden groups in data that doesn’t have labels. In the stock market, we usually group companies by sector (e.g., “Technology” vs. “Healthcare”). However, a tech stock might sometimes behave exactly like a healthcare stock. Clustering allows us to ignore the official labels and group stocks based on how they actually act. This is valuable for building safer portfolios — if we own five stocks that behave differently, we are better protected than if we own five stocks that act the same, regardless of their sector.

3.2 Optimal Number of Clusters

Deciding how many groups (clusters) exist in the data is a difficult problem with no single right answer. To solve this, we used a tool called NbClust, which runs multiple mathematical tests to vote on the best number.

The tests gave different suggestions: some said there were only 2 groups, while others suggested 5. However, the majority of stability tests pointed to 4 clusters. We chose 4 because it provided the best balance—grouping stocks into distinct regimes without splitting them into tiny, meaningless fragments.

3.3 Different distance matrices

dist_eucl <- dist(feature_pca, method = "euclidean")
dist_manh <- dist(feature_pca, method = "manhattan")
cor_matrix <- cor(t(feature_pca))
dist_cos <- as.dist(1 - cor_matrix)

We compared Euclidean, Manhattan, and cosine-like distance (1 − correlation). These metrics capture different geometric assumptions and allow assessment of how cluster assignments depend on distance definitions.

3.4 K-Means

K-means provides a fast baseline for comparison. It assumes spherical clusters and equal variance, making it sensitive to outliers and skewed financial features. Standardized PCA features partly mitigate this, but K-means remains prone to boundary distortion. Metrics and visualization evaluate whether the method captures coherent groups.

km_final <- kmeans(feature_pca, centers = 4, nstart = 10, iter.max = 300)

K-Means Clustering
Metric	Value
Silhouette Score (Avg)	0.336
Davies-Bouldin Index	1.220
Calinski-Harabasz Index	28.614

3.5 PAM (Partitioning Around Medoids)

PAM uses medoids, making it more robust to outliers and suitable for noisy financial distributions. Because medoids represent actual stocks, cluster centers are easier to interpret. We evaluate PAM with multiple distance metrics to test stability and robustness relative to K-means.

pam_eucl <- pam(dist_eucl, k = 4, diss = TRUE)
pam_manh <- pam(dist_manh, k = 4, diss = TRUE)
pam_cos  <- pam(dist_cos, k = 4, diss = TRUE)

PAM Clustering Performance
Metric	Silhouette	Davies_Bouldin	Calinski_Harabasz	Medoids
Euclidean	0.309	1.142	26.132	MSFT, GOOGL, PFE, TSLA
Manhattan	0.254	1.482	20.973	CAT, T, MRK, NEE
Cosine	0.623	1.865	18.279	TSM, SPY, COP, META

3.6 Hierarchical Clustering

Agglomerative hierarchical clustering with Ward, complete, and average linkage was applied. Dendrograms reveal multi-scale structure without fixing k upfront. Cophenetic correlation assesses how well each linkage preserves original pairwise distances.

# Euclidean + Ward
hc_eucl_ward_tree <- hclust(dist_eucl, method = "ward.D2")
hc_eucl_ward_cl   <- cutree(hc_eucl_ward_tree, k = 4)
# Euclidean + Complete
hc_eucl_comp_tree <- hclust(dist_eucl, method = "complete")
hc_eucl_comp_cl   <- cutree(hc_eucl_comp_tree, k = 4)
# Euclidean + Average
hc_eucl_avg_tree  <- hclust(dist_eucl, method = "average")
hc_eucl_avg_cl    <- cutree(hc_eucl_avg_tree, k = 4)
# Manhattan + Complete
hc_manh_comp_tree <- hclust(dist_manh, method = "complete")
hc_manh_comp_cl   <- cutree(hc_manh_comp_tree, k = 4)
# Manhattan + Average
hc_manh_avg_tree  <- hclust(dist_manh, method = "average")
hc_manh_avg_cl    <- cutree(hc_manh_avg_tree, k = 4)
# Cosine + Complete
hc_cos_comp_tree  <- hclust(dist_cos, method = "complete")
hc_cos_comp_cl    <- cutree(hc_cos_comp_tree, k = 4)
# Cosine + Average
hc_cos_avg_tree   <- hclust(dist_cos, method = "average")
hc_cos_avg_cl     <- cutree(hc_cos_avg_tree, k = 4)

Hierarchical Clustering Performance
	Distance	Linkage	Silhouette	Cophenetic_Corr
7	Cosine	Average	0.737	0.849
5	Manhattan	Average	0.386	0.832
3	Euclidean	Average	0.417	0.805
4	Manhattan	Complete	0.389	0.786
2	Euclidean	Complete	0.373	0.713
6	Cosine	Complete	0.615	0.710
1	Euclidean	Ward.D2	0.410	0.614

We evaluated Single, Complete, Average, and Ward linkage methods with different distance matrices. Average linkage yielded the highest Cophenetic Correlation Coefficient (0.85), indicating it most accurately preserved the pairwise distances of the original feature space.

3.7 Evaluation Metrics & Their Meaning

We use three complementary internal validation metrics to assess cluster quality:

Silhouette Score: This asks, “Does this stock fit well in its group?” A high score means the stock sits comfortably in its cluster; a low score means it might belong somewhere else.
Davies-Bouldin Index: This asks, “Are the groups distinct?” We want low scores here, which would mean the clusters are tight and don’t overlap too much with each other.
Calinski-Harabasz Index: This measures “compactness.” It rewards clusters that are dense (members are close together) and well-separated from other clusters.

Together, these scores help us decide if our computer-generated groups are mathematically sound or just random noise.

Clustering Comparison by Silhouette Score
	Silhouette	Davies_Bouldin	Calinski_Harabasz
HC_Cosine_Average	0.74	1.12	18.80
PAM_Cosine	0.62	1.86	18.28
HC_Cosine_Complete	0.62	1.42	20.20
HC_Euclidean_Average	0.42	0.65	33.16
HC_Euclidean_Ward	0.41	0.64	31.53
HC_Manhattan_Complete	0.39	0.61	29.06
HC_Manhattan_Average	0.39	0.58	24.33
HC_Euclidean_Complete	0.37	0.70	31.38
K-Means	0.34	1.22	28.61
PAM_Euclidean	0.31	1.14	26.13
PAM_Manhattan	0.25	1.48	20.97

The ARI matrix shows that no method pair consistently exceeds 0.6 except among hierarchical algorithms using the same distance metric. This confirms that clustering is highly sensitive to methodological assumptions and indicates that the underlying financial structure is weakly separable.

3.8 Stability Analysis

A major challenge in financial data is that stocks rarely fit into perfect, separate boxes. To test if our clusters were “real” or just a fluke of the algorithm, we used the Adjusted Rand Index (ARI). Think of ARI as a “consistency score”: if we run two different methods, do they put the same stocks in the same groups?

Score of 1.0: The methods agree perfectly.

Score of 0.0: The methods essentially guessed randomly.

Our analysis showed that methods using similar math (like Euclidean distance) agreed reasonably well (scores around 0.6). However, when we changed how we measured “distance” (e.g., looking at correlation instead of raw value), the agreement dropped.

This “instability” is actually an important finding. It proves that the stock market is not made of rigid, separate islands. Instead, stock behavior is a continuous spectrum. The fact that different algorithms draw different borders confirms that while we can find broad “regimes” (like Defensive vs. Aggressive), the lines between them are naturally blurry.

4. Results

4.1 Comparison of Methods

When we applied the different algorithms to the stock data, they produced slightly different ways of slicing the market.

K-Means: This method tried to create tight, circular groups. It worked mathematically but struggled with the messy, overlapping reality of stock behavior.
PAM: This was the most reliable performer. It was consistent and didn’t get confused by extreme data points (outliers), giving us the most stable groups.
Hierarchical Clustering: This method builds a “family tree” of stocks. We found that connecting stocks based on their average similarity worked best, as it preserved the natural relationships in the data without distorting the shapes of the groups.

Overall, no method found perfect, separated islands. The borders between groups were fuzzy, which confirms that stock behaviors often overlap rather than sitting in rigid boxes.

4.2 Cluster Interpretability

Cosine-distance hierarchical clustering with average linkage produced four behaviorally meaningful groups:

Cluster 1 — The Market Leaders: Acts as the “anchor” of the economy. It includes massive, established companies like Apple (AAPL) and JPMorgan (JPM). These stocks don’t move as fast as the tech superstars, but they are reliable and define the general direction of the market.

Cluster 2 — The Safety Net: Represents “safety.” It includes companies that people need regardless of the economy, such as Walmart (WMT), Johnson & Johnson (JNJ), and Coca-Cola (KO). In financial terms, these are “Low Beta” stocks—meaning when the market crashes, these stocks usually stay stable or fall much less.

Cluster 3 — The High Flyers: This group is the “risk-on” engine. It includes the fastest-moving companies like Nvidia (NVDA), Tesla (TSLA), and Meta. These stocks offer the highest potential rewards but come with the highest volatility. They are visually located on the far right of the chart, indicating they are the most sensitive to market enthusiasm.

Cluster 4 — The Sector Movers: Consists of companies that react strongly to specific economic cycles rather than general market hype. It includes Energy stocks like Exxon (XOM) (reacts to oil prices) and Retailers like Target (TGT) (reacts to consumer spending). They don’t follow the tech crowd; instead, they move based on real-world economic news.

Using our best model, we discovered that the market organizes itself along a clear spectrum. At one extreme, you have Safe/Defensive stocks (like Walmart). At the other extreme, you have Risky/Aggressive stocks (like Nvidia). In between, there are groups that react to specific economic events. This proves that despite the daily noise of the stock market, there is a hidden structure based on Risk and Speed (Momentum).

4.3 Outlier detection

A key finding was that some stocks simply didn’t fit the standard rules. To identify these “misfits,” we used a diagnostic tool.

4.3.1 By Silhouette width

The Silhouette Width \(s(i)\) measures how similar a point is to its own cluster compared to other clusters. Points with \(s(i) < 0\) suggest misclassification.

sil_values <- silhouette(hc_cos_avg_cl, dist_cos)

## [1] "CVX"

4.4 Association rules

Summary of Top Association Rules Across All Behavioral Clusters
	Antecedent (Behavioral Features)	Consequent (Target Cluster)	Support	Confidence	Lift	N
2	beta_global=High,mean_price_pos_20d=High	{Cluster 1}	0.062	1	3.692	3
4	mean_stock_vs_market=High,mean_price_pos_20d=High	{Cluster 1}	0.083	1	3.692	4
5	mean_daily_return=High,mean_price_pos_20d=High	{Cluster 1}	0.083	1	3.692	4
1	beta_global=Low,mean_stoch_k=Med	{Cluster 2}	0.062	1	6.857	3
21	beta_global=Low,mean_price_pos_20d=Med	{Cluster 2}	0.062	1	6.857	3
6	beta_global=Low,mean_stoch_k=Med,return_skewness=Low	{Cluster 2}	0.062	1	6.857	3
11	mean_volatility_20d=High	{Cluster 3}	0.104	1	6.000	5
22	mean_bb_width=High	{Cluster 3}	0.104	1	6.000	5
3	mean_volatility_20d=High,mean_bb_width=High	{Cluster 3}	0.104	1	6.000	5
12	mean_momentum_20d=Low	{Cluster 4}	0.146	1	2.400	7
23	mean_adx_14=Low	{Cluster 4}	0.146	1	2.400	7
31	mean_rsi_14=Low	{Cluster 4}	0.188	1	2.400	9

If clustering tells us who hangs out together, Association Rules tell us why. We used a computer algorithm to find simple “If-Then” logic that defines each group.

Rule for Cluster 1: If a stock has High RSI (strength) and High ADX (trend strength), it is almost certainly a Leader.
Rule for Cluster 2: If a stock has Low Beta (stability) and Medium RSI , it is a perfect match for the Defensive group. This simple rule was 100% accurate in identifying safety stocks.
Rule for Cluster 3: High Volatility. If the price swings are wide enough, the stock is automatically flagged as a “High Flyer.”
Rule for Cluster 4: If a stock has a Low Risk-Adjusted Return (Sharpe Ratio) and is losing to the market average, it is mathematically defined as a Laggard.

5. Discussion

5.1 The Big Picture of the Market

Our analysis proves that the stock market isn’t made of rigid, separate boxes. Stock behaviors overlap and shift. However, a clear structure still emerged: stocks naturally sort themselves based on Risk and Speed (Momentum). This confirms that even without human labels, the computer identified the fundamental “Safe vs. Risky” dynamic that drives the real economy.

5.2 Methodological Assessment

We tested four different methods to see which understood the market best:

K-Means: Struggled. It got confused by extreme outliers (like Tesla) and couldn’t handle the messiness of real-world data.
PAM: Very stable, but sometimes missed the difference between “good volatility” (winning big) and “bad volatility” (losing big).
Hierarchical Clustering: The best method. It grouped stocks based on how they move together, not just how much they move.
Association Rules: The most useful addition. It translated complex math into simple English rules (e.g., “If Risk is Low, then it is a Safety Stock”), making the results actually usable.

5.3 Limitations

There are three limits to this study:

Small Group: We only looked at 48 major stocks.
Time Snapshot: We looked at a 2-year average. We did not track how a stock’s personality might have changed from month to month.
Simplification: Converting complex data into simple “High/Low” rules loses some fine details.

6. Conclusions

6.1 Summary of Findings

Best Approach: The most effective way to analyze the market is to combine Clustering (grouping similar items) with Rules (explaining why they are grouped).
Four Personalities: The market naturally splits into four behavioral types: Safety Net, Market Leaders, High Flyers, and Laggards.
Clear Recipes: We found that even “weak” groups follow strict rules. We can now mathematically define exactly what makes a stock “Defensive” or “Speculative.”

6.2 Practical Implications

For investors, the lesson is simple: Don’t trust labels. A “Technology” stock might behave exactly like a “Utility” stock. Instead of diversifying by industry names, investors should diversify by behavior. Using our simple rules, anyone can quickly tag a stock as “Safe” or “Risky” based on its actual performance rather than its category.

Appendix: Full R Scripts

if (!require("pacman")) install.packages("pacman")

# Load all libraries
pacman::p_load(
  arrow,          # Reading .parquet files (Dataset source)
  dplyr,          # Data manipulation (filter, select, mutate)
  tidyr,          # Data tidying and reshaping
  knitr,          # For kable
  ggplot2,        # Base plotting engine
  factoextra,     # Visualization for PCA & Clustering
  corrplot,       # Visualizing correlation matrices
  psych,          # Descriptive statistics
  e1071,          # Skewness and kurtosis calculations
  bestNormalize,  # Yeo-Johnson transformation for non-normal features
  cluster,        # Algorithms: PAM, Silhouette, Monothetic
  dbscan,         # Density-based clustering (DBSCAN) & kNN plots
  mclust,         # Model-based clustering & Adjusted Rand Index
  NbClust,        # Determining optimal 'k'
  fpc,            # Flexible Procedures for Clustering
  clusterSim,     # Cluster Similarity measures
  clusterCrit,    # Additional internal validity indices
  arules,         # Association Rules
  arulesViz,      # Visualizing Rules
  ggrepel         # Non-overlapping labels for plots
)

# Load the parquet file
global_data <- read_parquet("data/ds1_global_data.parquet")
# Separate tickers and features
tickers <- global_data$ticker
features <- global_data %>% dplyr::select(-ticker)
# Remove zero variance features
variances <- apply(features, 2, var, na.rm = TRUE)
zero_var_features <- names(variances[variances == 0])
features <- features %>% dplyr::select(-all_of(zero_var_features))

# Skewness analysis and transformation
skew_threshold <- 2
skewness_vals <- apply(features, 2, e1071::skewness, na.rm = TRUE)
skew_data <- names(skewness_vals[abs(skewness_vals) > skew_threshold])
normal_data <- names(skewness_vals[abs(skewness_vals) <= skew_threshold])

# Visualize skewness
skew_df <- data.frame(
  feature = names(skewness_vals),
  skewness = skewness_vals
) %>% arrange(desc(skewness))

ggplot(skew_df, aes(x = skewness, y = reorder(feature, skewness))) +
  geom_bar(stat = "identity", fill = "steelblue") +
  geom_vline(xintercept = skew_threshold, color = "red", linetype = "dashed") +
  geom_vline(xintercept = -skew_threshold, color = "red", linetype = "dashed") +
  geom_vline(xintercept = 0, color = "black", linetype = "solid") +
  labs(title = "Figure 1: Feature Skewness Ranking",
       x = "Skewness Coefficient",
       y = "Feature") +
  theme_minimal() +
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank())

# Apply Yeo-Johnson transformation
features_transformed <- features
for(feat in skew_data) {
  yn_obj <- yeojohnson(features[[feat]], standardize = FALSE, eps = 1e-8)
  features_transformed[[feat]] <- predict(yn_obj)}
# Standardization
features_scaled <- as.data.frame(scale(features_transformed))

# Apply PCA
pca_result <- prcomp(features_scaled, center = FALSE, scale. = FALSE)

# Calculate explained variance
explained_var <- summary(pca_result)$importance[2, ]
cumulative_var <- cumsum(explained_var)

# Find number of components for 65% variance
n_components_65 <- which(cumulative_var >= 0.65)[1]

# Extract PCA components
feature_pca <- as.data.frame(pca_result$x[, 1:n_components_65])

# Visualize explained variance
eig.val <- get_eigenvalue(pca_result)
eig.val.top10 <- head(eig.val, 10)
eig.val.top10$PC_Label <- factor(1:nrow(eig.val.top10), levels = 1:nrow(eig.val.top10))

ggplot(eig.val.top10, aes(x = PC_Label)) +
  geom_bar(aes(y = variance.percent), stat = "identity", fill = "steelblue", alpha = 0.7) +
  geom_line(aes(y = cumulative.variance.percent, group = 1), color = "red", size = 1.2) +
  geom_point(aes(y = cumulative.variance.percent), color = "red", size = 2) +
  geom_text(aes(y = variance.percent, label = round(variance.percent, 1)), vjust = -0.5, size = 3) +
  geom_text(aes(y = cumulative.variance.percent, label = round(cumulative.variance.percent, 1)), vjust = -0.5, color = "red", size = 3) +
  scale_y_continuous(limits = c(0, 105), breaks = seq(0, 100, 20)) +
  labs(title = "Figure 2: Scree Plot & Cumulative Variance",
       subtitle = "Cumulative Variance Explained %",
       x = "Principal Components",
       y = "Percentage of Variance Explained") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))

# Extract 3 and 4 components
feature_pca_3 <- as.data.frame(pca_result$x[, 1:3])
feature_pca_4 <- as.data.frame(pca_result$x[, 1:4])

# Test K-Means with both
km_3 <- kmeans(feature_pca_3, centers = 4, nstart = 10)
km_4 <- kmeans(feature_pca_4, centers = 4, nstart = 10)

# Calculate metrics for 3 components
dist_3 <- dist(feature_pca_3)
sil_3 <- mean(silhouette(km_3$cluster, dist_3)[, 3])
db_3 <- index.DB(feature_pca_3, km_3$cluster)$DB
ch_3 <- calinhara(feature_pca_3, km_3$cluster)

# Calculate metrics for 4 components
dist_4 <- dist(feature_pca_4)
sil_4 <- mean(silhouette(km_4$cluster, dist_4)[, 3])
db_4 <- index.DB(feature_pca_4, km_4$cluster)$DB
ch_4 <- calinhara(feature_pca_4, km_4$cluster)

# Compare results
comparison_3vs4 <- data.frame(
  Components = c(3, 4),
  Variance_Explained = c(67.4, 74.8),
  Silhouette = round(c(sil_3, sil_4), 4),
  Davies_Bouldin = round(c(db_3, db_4), 4),
  Calinski_Harabasz = round(c(ch_3, ch_4), 4),
  stringsAsFactors = FALSE
)
knitr::kable(comparison_3vs4, caption = "3 vs 4 components comparison", digits = 2)

plot_3 <- fviz_cluster(km_3,
                       data = feature_pca_3,
                       geom = "point",
                       ellipse.type = "convex",
                       palette = "Set2",
                       ggtheme = theme_minimal(),
                       main = "Figure 3.1: 3 Components (67.4% variance)",
                       xlab = "PC1", ylab = "PC2") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))
print(plot_3)

plot_4 <- fviz_cluster(km_4,
                       data = feature_pca_4,
                       geom = "point",
                       ellipse.type = "convex",
                       palette = "Set2",
                       ggtheme = theme_minimal(),
                       main = "Figure 3.2: 4 Components (74.8% variance)",
                       xlab = "PC1", ylab = "PC2") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))
print(plot_4)

fviz_pca_var(pca_result, col.var="steelblue", title = 'Figure 4: PCA Biplot of Variable loadings')

get_clust_tendency(features_scaled, n = 2, graph = TRUE,
                   gradient = list(low="red", mid="white", high="blue"),
                   seed = 123)

data_for_nbclust <- as.matrix(feature_pca)
create_index_plot <- function(nb_result, title, y_label, higher_better = TRUE) {
  idx_df <- data.frame(
    k = as.numeric(names(nb_result$All.index)),
    value = as.numeric(nb_result$All.index)
  )
  
  y_pos <- if(higher_better) max(idx_df$value, na.rm = TRUE) else min(idx_df$value, na.rm = TRUE)
  v_just <- if(higher_better) -0.5 else 1.5
  
  ggplot(idx_df, aes(x = k, y = value)) +
    geom_line(linewidth = 1, color = "steelblue") +
    geom_point(size = 3, color = "steelblue") +
    geom_vline(xintercept = nb_result$Best.nc[1], 
               color = "red", linetype = "dashed", linewidth = 1) +
    annotate("text", x = nb_result$Best.nc[1], y = y_pos,
             label = paste("k =", nb_result$Best.nc[1]),
             color = "red", fontface = "bold", vjust = v_just) +
    labs(title = title, x = "Number of Clusters (k)", y = y_label) +
    theme_minimal() +
    theme(plot.title = element_text(hjust = 0.5, face = "bold"))
}

# Run NbClust
ch_result <- NbClust(data_for_nbclust, distance = "euclidean", method = "complete", 
                     index = "ch", min.nc = 2, max.nc = min(10, nrow(data_for_nbclust) - 1))
sil_result <- NbClust(data_for_nbclust, distance = "euclidean", method = "complete", 
                      index = "silhouette", min.nc = 2, max.nc = min(10, nrow(data_for_nbclust) - 1))
db_result <- NbClust(data_for_nbclust, distance = "euclidean", method = "complete", 
                     index = "db", min.nc = 2, max.nc = min(10, nrow(data_for_nbclust) - 1))
gap_result <- NbClust(data_for_nbclust, distance = "euclidean", method = "complete", 
                      index = "gap", min.nc = 2, max.nc = min(10, nrow(data_for_nbclust) - 1))

# Create plots
p1 <- create_index_plot(ch_result, "Figure 5.1: Calinski-Harabasz Index vs k (Higher is Better)", 
                        "Calinski-Harabasz Index", higher_better = TRUE)
p2 <- create_index_plot(sil_result, "Figure 5.2: Silhouette Index vs k (Higher is Better)", 
                        "Silhouette Index", higher_better = TRUE)
p3 <- create_index_plot(db_result, "Figure 5.3: Davies-Bouldin Index vs k (Lower is Better)", 
                        "Davies-Bouldin Index", higher_better = FALSE)
p4 <- create_index_plot(gap_result, "Figure 5.4: Gap Statistic vs k (Higher is Better)", 
                        "Gap Statistic", higher_better = TRUE)

print(p1)
print(p2)
print(p3)
print(p4)

dist_eucl <- dist(feature_pca, method = "euclidean")
dist_manh <- dist(feature_pca, method = "manhattan")
cor_matrix <- cor(t(feature_pca))
dist_cos <- as.dist(1 - cor_matrix)

km_final <- kmeans(feature_pca, centers = 4, nstart = 10, iter.max = 300)

sil_obj <- silhouette(km_final$cluster, dist(feature_pca))
mean_sil <- mean(sil_obj[, 3])
db_obj <- index.DB(feature_pca, km_final$cluster)
db_score <- db_obj$DB
ch_score <- calinhara(feature_pca, km_final$cluster)

metrics_df <- data.frame(
  Metric = c("Silhouette Score (Avg)", "Davies-Bouldin Index", "Calinski-Harabasz Index"),
  Value = round(c(mean_sil, db_score, ch_score), 4)
)
knitr::kable(metrics_df, caption = "K-Means Clustering", digits = 3)

fviz_cluster(km_final,
             data = feature_pca,
             geom = "point",
             ellipse.type = "convex",
             ggtheme = theme_minimal(),
             main = "Figure 6: K-Means Clustering (k=4)",
             xlab = "PC1",
             ylab = "PC2") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))

pam_eucl <- pam(dist_eucl, k = 4, diss = TRUE)
pam_manh <- pam(dist_manh, k = 4, diss = TRUE)
pam_cos  <- pam(dist_cos, k = 4, diss = TRUE)

# PAM + Euclidean
sil_eucl <- mean(silhouette(pam_eucl$clustering, dist_eucl)[, 3])
db_eucl  <- index.DB(feature_pca, pam_eucl$clustering)$DB
med_eucl <- paste(tickers[pam_eucl$medoids], collapse=", ")
ch_eucl <- calinhara(feature_pca, pam_eucl$cluster)
# PAM + Manhattan
sil_manh <- mean(silhouette(pam_manh$clustering, dist_manh)[, 3])
db_manh  <- index.DB(feature_pca, pam_manh$clustering)$DB 
med_manh <- paste(tickers[pam_manh$medoids], collapse=", ")
ch_manh <- calinhara(feature_pca, pam_manh$cluster)
# PAM + Cosine
sil_cos  <- mean(silhouette(pam_cos$clustering, dist_cos)[, 3])
db_cos   <- index.DB(feature_pca, pam_cos$clustering)$DB
med_cos  <- paste(tickers[pam_cos$medoids], collapse=", ")
ch_cos <- calinhara(feature_pca, pam_cos$cluster)

pam_results <- data.frame(
  Metric = c("Euclidean", "Manhattan", "Cosine"),
  Silhouette = c(sil_eucl, sil_manh, sil_cos),
  Davies_Bouldin = c(db_eucl, db_manh, db_cos),
  Calinski_Harabasz = c(ch_eucl, ch_manh, ch_cos),
  Medoids = c(med_eucl, med_manh, med_cos)
)

knitr::kable(pam_results, caption = "PAM Clustering Performance", digits = 3)
plot_pam <- function(model, title) {
  plot_df <- as.data.frame(feature_pca)
  plot_df$cluster <- as.factor(model)
  col_names <- names(plot_df)[1:2]
  ggplot(plot_df, aes_string(x = col_names[1], y = col_names[2], color = "cluster")) +
    geom_point(size = 3, alpha = 1) +
    stat_ellipse(type = "norm", level = 0.95, linetype = 2) +
    labs(title = title) +
    theme_minimal() +
    theme(
      legend.position = "none",
      plot.title = element_text(size = 10, face = "bold")
    )
}

p1 <- plot_pam(pam_eucl$clustering, paste("Figure 7.1: Euclidean (Sil:", round(sil_eucl, 2), ")"))
print(p1)
p2 <- plot_pam(pam_manh$clustering, paste("Figure 7.2: Manhattan (Sil:", round(sil_manh, 2), ")"))
print(p2)
p3 <- plot_pam(pam_cos$clustering,  paste("Figure 7.3: Cosine (Sil:", round(sil_cos, 2), ")"))
print(p3)

# Euclidean + Ward
hc_eucl_ward_tree <- hclust(dist_eucl, method = "ward.D2")
hc_eucl_ward_cl   <- cutree(hc_eucl_ward_tree, k = 4)
sil_eucl_ward     <- mean(silhouette(hc_eucl_ward_cl, dist_eucl)[, 3])
coph_eucl_ward    <- cor(dist_eucl, cophenetic(hc_eucl_ward_tree))
# Euclidean + Complete
hc_eucl_comp_tree <- hclust(dist_eucl, method = "complete")
hc_eucl_comp_cl   <- cutree(hc_eucl_comp_tree, k = 4)
sil_eucl_comp     <- mean(silhouette(hc_eucl_comp_cl, dist_eucl)[, 3])
coph_eucl_comp    <- cor(dist_eucl, cophenetic(hc_eucl_comp_tree))
# Euclidean + Average
hc_eucl_avg_tree  <- hclust(dist_eucl, method = "average")
hc_eucl_avg_cl    <- cutree(hc_eucl_avg_tree, k = 4)
sil_eucl_avg      <- mean(silhouette(hc_eucl_avg_cl, dist_eucl)[, 3])
coph_eucl_avg     <- cor(dist_eucl, cophenetic(hc_eucl_avg_tree))
# Manhattan + Complete
hc_manh_comp_tree <- hclust(dist_manh, method = "complete")
hc_manh_comp_cl   <- cutree(hc_manh_comp_tree, k = 4)
sil_manh_comp     <- mean(silhouette(hc_manh_comp_cl, dist_manh)[, 3])
coph_manh_comp    <- cor(dist_manh, cophenetic(hc_manh_comp_tree))
# Manhattan + Average
hc_manh_avg_tree  <- hclust(dist_manh, method = "average")
hc_manh_avg_cl    <- cutree(hc_manh_avg_tree, k = 4)
sil_manh_avg      <- mean(silhouette(hc_manh_avg_cl, dist_manh)[, 3])
coph_manh_avg     <- cor(dist_manh, cophenetic(hc_manh_avg_tree))
# Cosine + Complete
hc_cos_comp_tree  <- hclust(dist_cos, method = "complete")
hc_cos_comp_cl    <- cutree(hc_cos_comp_tree, k = 4)
sil_cos_comp      <- mean(silhouette(hc_cos_comp_cl, dist_cos)[, 3])
coph_cos_comp     <- cor(dist_cos, cophenetic(hc_cos_comp_tree))
# Cosine + Average
hc_cos_avg_tree   <- hclust(dist_cos, method = "average")
hc_cos_avg_cl     <- cutree(hc_cos_avg_tree, k = 4)
sil_cos_avg       <- mean(silhouette(hc_cos_avg_cl, dist_cos)[, 3])
coph_cos_avg      <- cor(dist_cos, cophenetic(hc_cos_avg_tree))

hier_results <- data.frame(
  Distance = c(rep("Euclidean", 3), rep("Manhattan", 2), rep("Cosine", 2)),
  Linkage  = c("Ward.D2", "Complete", "Average", "Complete", "Average", "Complete", "Average"),
  Silhouette = c(sil_eucl_ward, sil_eucl_comp, sil_eucl_avg, 
                 sil_manh_comp, sil_manh_avg, 
                 sil_cos_comp, sil_cos_avg),
  Cophenetic_Corr = c(coph_eucl_ward, coph_eucl_comp, coph_eucl_avg, 
                      coph_manh_comp, coph_manh_avg, 
                      coph_cos_comp, coph_cos_avg)
)

hier_results <- hier_results[order(-hier_results$Cophenetic_Corr), ]
knitr::kable(hier_results, caption = "Hierarchical Clustering Performance", digits = 3)

p1 <- fviz_dend(hc_eucl_ward_tree, k=4, show_labels=FALSE, main="Figure 8.1: Euclidean + Ward", rect=TRUE)
print(p1)
p2 <- fviz_dend(hc_eucl_comp_tree,  k=4, show_labels=FALSE, main="Figure 8.2: Euclidean + Complete", rect=TRUE)
print(p2)
p3 <- fviz_dend(hc_eucl_avg_tree,  k=4, show_labels=FALSE, main="Figure 8.3: Euclidean + Average", rect=TRUE)
print(p3)
p4 <- fviz_dend(hc_manh_comp_tree,   k=4, show_labels=FALSE, main="Figure 8.4: Manhattan + Complete", rect=TRUE)
print(p4)
p5 <- fviz_dend(hc_manh_avg_tree,   k=4, show_labels=FALSE, main="Figure 8.5: Manhattan + Average", rect=TRUE)
print(p5)
p6 <- fviz_dend(hc_cos_comp_tree,   k=4, show_labels=FALSE, main="Figure 8.6: Cosine + Complete", rect=TRUE)
print(p6)
p7 <- fviz_dend(hc_cos_avg_tree,   k=4, show_labels=FALSE, main="Figure 8.7: Cosine + Average", rect=TRUE)
print(p7)

compute_metrics <- function(clusters, data, dist_matrix = NULL) {
  
  sil <- mean(silhouette(clusters, dist_matrix)[, 3])
  db <- index.DB(data, clusters)$DB
  ch <- calinhara(data, clusters)
  
  return(list(
    silhouette = sil,
    davies_bouldin = db,
    calinski_harabasz = ch
  ))
}

# Collect all clustering results
all_clusterings <- list(
  "K-Means" = km_final$cluster,
  "PAM_Euclidean" = pam_eucl$clustering,
  "PAM_Manhattan" = pam_manh$clustering,
  "PAM_Cosine" = pam_cos$clustering,
  "HC_Euclidean_Ward" = hc_eucl_ward_cl,
  "HC_Euclidean_Complete" = hc_eucl_comp_cl,
  "HC_Euclidean_Average" = hc_eucl_avg_cl,
  "HC_Manhattan_Complete" = hc_manh_comp_cl,
  "HC_Manhattan_Average" = hc_manh_avg_cl,
  "HC_Cosine_Complete" = hc_cos_comp_cl,
  "HC_Cosine_Average" = hc_cos_avg_cl
)

# Compute metrics for each method
metrics_list <- list()

for (method_name in names(all_clusterings)) {
  clusters <- all_clusterings[[method_name]]
  
  # Determine which distance matrix to use
  if (grepl("Euclidean", method_name) || method_name == "K-Means") {
    dist_used <- dist_eucl
  } else if (grepl("Manhattan", method_name)) {
    dist_used <- dist_manh
  } else if (grepl("Cosine", method_name)) {
    dist_used <- dist_cos
  } else {
    dist_used <- dist_eucl
  }
  
  metrics_list[[method_name]] <- compute_metrics(clusters, feature_pca, dist_used)
}

# Create summary table
metrics_summary <- data.frame(
  Silhouette = round(sapply(metrics_list, function(x) x$silhouette), 4),
  Davies_Bouldin = round(sapply(metrics_list, function(x) x$davies_bouldin), 4),
  Calinski_Harabasz = round(sapply(metrics_list, function(x) x$calinski_harabasz), 2)
)

# Sort by Silhouette score (descending)
metrics_summary <- metrics_summary[order(-metrics_summary$Silhouette), ]
knitr::kable(metrics_summary, caption = "Clustering Comparison by Silhouette Score", digits = 2)

# Compute pairwise ARI matrix
n_methods <- length(all_clusterings)
ari_matrix <- matrix(0, nrow = n_methods, ncol = n_methods)
rownames(ari_matrix) <- names(all_clusterings)
colnames(ari_matrix) <- names(all_clusterings)

for (i in 1:n_methods) {
  for (j in 1:n_methods) {
    if (i == j) {
      ari_matrix[i, j] <- 1.0
    } else {
      ari_matrix[i, j] <- adjustedRandIndex(
        all_clusterings[[i]], 
        all_clusterings[[j]]
      )
    }
  }
}

# Visualize ARI matrix as heatmap
corrplot(ari_matrix, 
         method = "color", 
         type = "upper",
         is.corr = FALSE,
         tl.cex = 0.7,
         tl.col = "black",
         addCoef.col = "black",
         number.cex = 0.6,
         title = "Figure 9: Pairwise Adjusted Rand Index (ARI) Matrix",
         mar = c(0, 0, 2, 0))

# Cluster Visualization with ggrepel
pca_viz_df <- data.frame(feature_pca)
pca_viz_df$Cluster <- as.factor(hc_cos_avg_cl)
pca_viz_df$Ticker <- tickers

plot_cluster_highlight <- function(target_cluster, title_text, color_code) {
  
  ggplot(pca_viz_df, aes(x = PC1, y = PC2)) +
    geom_point(data = subset(pca_viz_df, Cluster != target_cluster),
               color = "grey85", alpha = 0.6, size = 2) +
    
    geom_point(data = subset(pca_viz_df, Cluster == target_cluster),
               color = color_code, size = 3.5) +
    
    geom_text_repel(data = subset(pca_viz_df, Cluster == target_cluster),
                    aes(label = Ticker), 
                    color = "black", 
                    box.padding = 0.4, 
                    point.padding = 0.3,
                    max.overlaps = Inf,
                    size = 3) +
    
    labs(title = title_text,
         subtitle = "Cluster vs. Broad Market",
         x = "PC1 (Risk / Market Beta)", 
         y = "PC2 (Momentum)") +
    theme_minimal() +
    theme(plot.title = element_text(face = "bold", size = 11),
          panel.grid.minor = element_blank())
}

# Plot Clusters
plot_cluster_highlight(target_cluster = "1", 
                       title_text = "Figure 10.1: Cluster 1 - The Market Leaders", 
                       color_code = "#2E9FDF")

plot_cluster_highlight(target_cluster = "2", 
                       title_text = "Figure 10.2: Cluster 2 - The Safety Net", 
                       color_code = "#00AFBB")

plot_cluster_highlight(target_cluster = "3", 
                       title_text = "Figure 10.3: Cluster 3 - The High Flyers", 
                       color_code = "#FC4E07")

plot_cluster_highlight(target_cluster = "4", 
                       title_text = "Figure 10.4: Cluster 4 - The Sector Movers", 
                       color_code = "#E7B800")

# Silhouette Width Analysis
sil_values <- silhouette(hc_cos_avg_cl, dist_cos)

# Identify negative silhouette samples (Potential Outliers)
neg_sil_indices <- which(sil_values[, 3] < 0)
print(tickers[neg_sil_indices])

fviz_silhouette(sil_values, palette = "jco", print.summary = FALSE) +
  labs(title = "Figure 11: Silhouette Analysis: Hierarchical Clustering (Cosine + Average)",
       subtitle = paste("Average Silhouette Width =", round(mean(sil_values[, 3]), 2))) +
  theme_minimal() +
  theme(axis.text.x = element_blank(), 
        plot.title = element_text(hjust = 0.5, face="bold"),
        plot.subtitle = element_text(hjust = 0.5))

# Prepare discrete data
df_rules <- features_scaled
df_rules$Cluster <- as.factor(paste0("Cluster_", hc_cos_avg_cl))

# Discretize all numeric features into 3 levels
df_discrete <- as.data.frame(lapply(df_rules, function(x) {
  if(is.numeric(x)) {
    # Check if the variable has enough unique values
    if(length(unique(x)) < 3) {
      return(as.factor(x)) # Return as is if it's essentially a constant
    }
    tryCatch({
      discretize(x, method = "cluster", breaks = 3, 
                 labels = c("Low", "Med", "High"))
    }, error = function(e) {
      # Fallback to 'interval' if clustering fails
      discretize(x, method = "interval", breaks = 3, 
                 labels = c("Low", "Med", "High"))
    })
  } else {
    return(x)
  }
}))

# Look for rules with at least 5% support and 80% confidence
rules_cluster_1 <- apriori(df_discrete, 
                           parameter = list(supp = 0.05, conf = 0.8, minlen = 2),
                           appearance = list(rhs = "Cluster=Cluster_1", default = "lhs"),
                           control = list(verbose = FALSE))

rules1_sorted <- sort(rules_cluster_1, by = "lift")

rules_cluster_2 <- apriori(df_discrete, 
                           parameter = list(supp = 0.05, conf = 0.8, minlen = 2),
                           appearance = list(rhs = "Cluster=Cluster_2", default = "lhs"),
                           control = list(verbose = FALSE))

rules2_sorted <- sort(rules_cluster_2, by = "lift")

rules_cluster_3 <- apriori(df_discrete, 
                           parameter = list(supp = 0.05, conf = 0.8, minlen = 2),
                           appearance = list(rhs = "Cluster=Cluster_3", default = "lhs"),
                           control = list(verbose = FALSE))

rules3_sorted <- sort(rules_cluster_3, by = "lift")

rules_cluster_4 <- apriori(df_discrete, 
                           parameter = list(supp = 0.05, conf = 0.8, minlen = 2),
                           appearance = list(rhs = "Cluster=Cluster_4", default = "lhs"),
                           control = list(verbose = FALSE))

rules4_sorted <- sort(rules_cluster_4, by = "lift")

df_r1 <- DATAFRAME(head(rules1_sorted, 3), separate = TRUE)
df_r2 <- DATAFRAME(head(rules2_sorted, 3), separate = TRUE)
df_r3 <- DATAFRAME(head(rules3_sorted, 3), separate = TRUE)
df_r4 <- DATAFRAME(head(rules4_sorted, 3), separate = TRUE)

combined_rules <- rbind(df_r1, df_r2, df_r3, df_r4)

final_rules_table <- combined_rules %>%
  dplyr::select(LHS, RHS, support, confidence, lift, count) %>%
  dplyr::mutate(
    LHS = gsub("[{}]", "", LHS),
    RHS = gsub("Cluster=Cluster_", "Cluster ", RHS)
  )

knitr::kable(final_rules_table, 
             caption = "Summary of Top Association Rules Across All Behavioral Clusters",
             digits = 3,
             col.names = c("Antecedent (Behavioral Features)", "Consequent (Target Cluster)", 
                           "Support", "Confidence", "Lift", "N"))

Comparative Analysis of Partitioning and Hierarchical Clustering Methods for Financial Market Segmentation

Phuong Giang Tran

2025-12-03