Introduction

In the finance industry, it is well established that stock price behavior can be driven by different factors. This study focuses specifically on the sectorial component of listed companies.

Stock’s sector is among the primary determinants of price dynamics. For instance, during periods of economic crisis, technology stocks (whose growth largely relies on financing and favorable economic cycle) can show high price volatility. On the other hand, utilities and healthcare services are more resilient as demand for their services remains stable regardless of the economic condition. Professionals often use this sectorial characteristics to construct diversified financial portfolios or develop sector-oriented trading strategies.

There exist several classification frameworks proposed. Morningstar® for example, proposes its own classification that is freely accessible. It proposes 11 industries where each company is classified. The way they classify it is based on the revenue shares of the company, in other words, from which industry the company’s profit mostly comes from. Absolutely no information coming from the stock’s price time series is included into this classification framework.

The goal of this study is to validate and challenge the classification proposed by Morningstar® by using a PCA and clustering approach applied on a sample of stock prices timeseries from stocks listed in Tokyo. While generally used for dimensionality reduction, in this case, PCA can also extract market factors and show how stocks react to them over a given period. From there, we can cluster our stocks using the PCA output, specifically, the sensitivity of each stock to these PCs. After removing least important PCs, we will cluster our stocks in 11 unlabeled categories, check the quality of the clustering with some metrics, and see how well it aligns with the corporate classification.

We will also do some experimentation by trying two different approaches : using the full period without any manipulation, and trying a method that we found in another scientific paper (Costa, 2015) which subdivides the period in order to counter overfitting.

Libraries and Dataset loadings

#LOADING USEFUL LIBRARIES
library(corrplot)
library(knitr)
library(cluster)
library(ggplot2)
library(reshape2)
library(factoextra)
library(clusterSim)
library(mclust)

#SET WORKIND DIRECTORY
setwd("/Users/lukadecaters/Desktop/FTSE")

df_japan_full <- read.csv("df_japan_full", stringsAsFactors=FALSE)
df_japan_full$Date <- as.Date(df_japan_full$Date)
df_japan_classified <- read.csv("df_japan_full_classified.csv", stringsAsFactors=FALSE)

Data

For the sake of our study, we will use Japanese stocks from the Nikkei225 index, a financial index that aggregates the 225 biggest stocks by capitalization in Japan. The period that we analyze extends from the year 2000 until the end of 2025. Since Nikkei includes stocks that were not yet introduced on the stock exchange in the year 2000 or later, we simply remove those stocks to keep only those that are listed during the studied period.

Afterwards, we select the daily closing price of each stock and calculate the daily log-return (applying log return is a standard practice in financial research thanks to the properties of logarithms). From there, we will also take off stocks that didn’t have enough liquidity at a certain point of time of our studied period, for this purpose, we simply delete stocks that show a log-return of 0 for 10 consecutive days. Those stocks were probably of very small importance in the beginning of the studied period and were rarely exchanged on the stock market (lack of liquidity). Keeping those stocks would generate complications since PCA can be easily confused by null values.

All this information was extracted thanks to the yfinance library on python, an API that connects with the data of yahoo finance, a widely used stock screener that collects stock markets data since 1975.

In the end we end up with the log return of 153 stocks on 25 years (2000 - 2025) which should be sufficient to apply our methodology.

Right there, we can see displayed all the tickers that we will use for our study. Tickers must be considered as keys that allow to identify stocks instead of putting the full company name.

Data description

#DISPLAY TICKERS
setdiff(names(df_japan_full), "Date")

##   [1] "X1332.T" "X1605.T" "X1801.T" "X1802.T" "X1803.T" "X1812.T" "X1925.T"
##   [8] "X1928.T" "X1963.T" "X2002.T" "X2282.T" "X2501.T" "X2502.T" "X2503.T"
##  [15] "X2768.T" "X2801.T" "X2802.T" "X2871.T" "X2914.T" "X3086.T" "X3382.T"
##  [22] "X3401.T" "X3402.T" "X3405.T" "X3407.T" "X3861.T" "X4004.T" "X4005.T"
##  [29] "X4021.T" "X4042.T" "X4061.T" "X4063.T" "X4151.T" "X4183.T" "X4188.T"
##  [36] "X4208.T" "X4452.T" "X4502.T" "X4503.T" "X4506.T" "X4507.T" "X4519.T"
##  [43] "X4523.T" "X4543.T" "X4568.T" "X4901.T" "X4902.T" "X4911.T" "X5101.T"
##  [50] "X5108.T" "X5201.T" "X5233.T" "X5301.T" "X5332.T" "X5333.T" "X5401.T"
##  [57] "X5406.T" "X5631.T" "X5706.T" "X5711.T" "X5713.T" "X5714.T" "X5801.T"
##  [64] "X5802.T" "X5803.T" "X6301.T" "X6302.T" "X6326.T" "X6361.T" "X6367.T"
##  [71] "X6471.T" "X6472.T" "X6473.T" "X6479.T" "X6501.T" "X6503.T" "X6504.T"
##  [78] "X6674.T" "X6701.T" "X6702.T" "X6752.T" "X6753.T" "X6758.T" "X6762.T"
##  [85] "X6770.T" "X6841.T" "X6857.T" "X6902.T" "X6952.T" "X6954.T" "X6971.T"
##  [92] "X6976.T" "X7004.T" "X7011.T" "X7013.T" "X7201.T" "X7202.T" "X7203.T"
##  [99] "X7205.T" "X7211.T" "X7261.T" "X7267.T" "X7269.T" "X7270.T" "X7731.T"
## [106] "X7733.T" "X7751.T" "X7752.T" "X7911.T" "X7912.T" "X7951.T" "X8001.T"
## [113] "X8002.T" "X8015.T" "X8031.T" "X8035.T" "X8053.T" "X8058.T" "X8233.T"
## [120] "X8252.T" "X8253.T" "X8267.T" "X8308.T" "X8316.T" "X8331.T" "X8601.T"
## [127] "X8604.T" "X8801.T" "X8802.T" "X8830.T" "X9001.T" "X9005.T" "X9007.T"
## [134] "X9008.T" "X9009.T" "X9020.T" "X9021.T" "X9064.T" "X9104.T" "X9107.T"
## [141] "X9202.T" "X9432.T" "X9433.T" "X9501.T" "X9502.T" "X9503.T" "X9531.T"
## [148] "X9532.T" "X9602.T" "X9735.T" "X9766.T" "X9983.T" "X9984.T"

#SHOW PIECE OF THE DATAFRAME
kable(head(df_japan_full[1:10, 1:6]),
      digits = 4)

Date	X1332.T	X1605.T	X1801.T	X1802.T	X1803.T
2000-01-06	0.0375	-0.0134	-0.0707	-0.0783	-0.0508
2000-01-07	0.0361	0.0067	0.0156	0.0499	0.0508
2000-01-11	-0.0059	0.0034	-0.0052	-0.0383	-0.0326
2000-01-12	-0.0180	-0.0135	-0.0157	0.0023	-0.0213
2000-01-13	0.0702	0.0688	0.0157	0.0159	0.0243
2000-01-14	0.0496	0.0584	0.0406	0.0592	0.0498

#SHOW THE DISTRIBUTION IN TERMS OF MORNINGSTAR CATEGORIES
counts <- table(df_japan_classified$Sector)
barplot(counts,
        main = "Distribution of categories",
        ylab = "Number of Stocks",
        col = "darkblue",
        border = "aliceblue",
        las=2,
        cex.names = 0.4)

This section presented a section of the dataframe and all the tickers that we will treat. Additionally, we can see the distribution of categories, there is a high representation of industrial stocks. This is probably caused by a bias that comes from our data cleaning. Since we kept only the stocks listed in 2000 until now, we probably filtered some recent companies that didn’t exist in the year 2000, while industrial companies are generally well established old companies.

Methodology

The correlation matrix is the first step of our PCA

#REMOVE DATE FROM THE ANALYSIS
df_japan_full <- df_japan_full[, setdiff(names(df_japan_full), "Date")]

#CONVERT ALL VALUES AS NUMERIC
df_japan_full[] <- lapply(df_japan_full, function(x) as.numeric(as.character(x)))

#CREATE CORR PLOT (PEARSON CORRELATION)
corr_mat <- cor(df_japan_full, method = "pearson")

#DISPLAY CORR PLOT
corrplot(corr_mat,
         method = "color",
         type   = "full",
         order  = "hclust",
         tl.cex = 0.25,
         tl.col = "black")

Now, we must diagonalize the correlation matrix. This transformation decomposes the matrix into a structure of uncorrelated vectors. The process yields eigenvalues (which represent the proportion of total variance explained by each factor), and eigenvectors which are our PCs : linear combination of stocks. The weight displays by how much the stock contributes to the eigenvector (PC).

Each stock can now be seen as a combination of those principal components, in other perspective, principal components can be considered as combinations of stocks with different weights. Each of those principal components explains a certain amount of variance in our dataset. PCs are ranked in terms of explained variability.

# DIAGONALIZE THE CORRELATION MATRIX TO EXTRACT EIGENVALUES AND EIGENVECTORS
pca_result <- prcomp(df_japan_full, scale. = TRUE)
values   <- pca_result$sdev^2
loadings <- pca_result$rotation

Here an example with Toyota \[ \text{Toyota (X7203.T)} = -0.0971 \cdot PC_1 + 0.039 \cdot PC_2 + 0.0298 \cdot PC_3 + \dots \]

The following table displays the first 15 PCs from the 153 that we get. Each PC is associated with a certain amount of variation explained. As for here, we can see that PC1 explains 39% of variance. In that kind of studies, it is common knowledge that PC1 is generally considered as the “market variable” of the PC analysis.

# GENERATE AND VIEW THE TABLE
pca_table <- get_eig(pca_result)
head(pca_table, 15)

##        eigenvalue variance.percent cumulative.variance.percent
## Dim.1   59.702424       39.0211919                    39.02119
## Dim.2    6.314584        4.1271793                    43.14837
## Dim.3    3.394726        2.2187750                    45.36715
## Dim.4    2.358369        1.5414179                    46.90856
## Dim.5    2.046563        1.3376228                    48.24619
## Dim.6    1.896008        1.2392212                    49.48541
## Dim.7    1.870843        1.2227734                    50.70818
## Dim.8    1.595576        1.0428604                    51.75104
## Dim.9    1.553854        1.0155912                    52.76663
## Dim.10   1.286665        0.8409575                    53.60759
## Dim.11   1.240100        0.8105228                    54.41811
## Dim.12   1.218021        0.7960923                    55.21421
## Dim.13   1.145282        0.7485505                    55.96276
## Dim.14   1.061029        0.6934831                    56.65624
## Dim.15   1.046373        0.6839037                    57.34014

#SCREE PLOT
plot(1:30, values[1:30], type="b", main="Scree Plot (First 30 PCs) - Full Period",
     xlab="PC", ylab="Eigenvalue", col="darkblue", pch=19)
abline(h=1, col="red", lty=2, lwd=2)

This result already tells us interesting things. Generally, PCA applied to that kind of market data get different results in other papers like the ones we have as references. While PC1 being moderately low still seems realistic, the other PCs are extremely low compared to other scientific papers doing the same things. I interpret this as a specificity of the japanese stock market and the bias that occurs due to my stock selection. First, regarding the market, japanese stock market is less diversified and offers less variations compared to often analysed american stock market. Secondly, since I only took companies listed since the year 2000, it filters some companies that could bring more diversity to the selected stocks.

We will now apply hierarchical agglomerative clustering using Ward’s method. Ward’s method is generally considered as one of the best one among hierarchical clustering because it minimizes variance across clusters. The difference with other agglomerative clustering approaches, is that here, the junction of clusters will be done based on variance, in other words, two clusters will join so that the increase of intra-cluster variance is the smallest possible.

As PCA is meant to reduce the number of dimensions, we will obviously have to decrease the number of PCs which is of 153 right now. For this, several method are possible, since we have a lot of PCs, we will use Kaiser rule which tells us to take off PCs whose eigen value is smaller than 1, if we didn’t do it, we could have had curse of dimensionality. We could use cumulative variance as a threshold, but due to the specificity of our data, a cumulative threshold of for example 90% would get us too many PCs hence failing the objective of reducing dimensions. We end up with 15 PCS.

#KEEP PCS USING KAISER'S RULE
keep_pcs <- which(values > 1)

#HOW MANY PCS WE KEEP
length(keep_pcs)

## [1] 15

loadings_selected <- loadings[, keep_pcs]

#DO THE HIERARCHICAL CLUSTERING (EUCLIDIAN DISTANCE & WARD METHOD)
dist_matrix <- dist(loadings_selected, method = "euclidean")
hclust_results <- hclust(dist_matrix, method = "ward.D2")

One issue with our data is the fact that taking the 25 year period as a whole could lead to over-fitting : aberrant market behavior (2009 crises and covid) may completely dominate our dataset and impact negatively the analysis. This is why we imitated the framework used by Costa (2025) in his short paper. We will divide the period in smaller sub-periods (respectively 12 in our case). For each sub-period, we independently perform PCA and apply Kaiser’s rule to extract significant principal components, then calculate Euclidean distances between stocks in their respective PC spaces. This produces 12 separate distance matrices, each capturing the co-movement structure during that specific time window. To aggregate these temporal snapshots into a single, robust clustering solution, we compute the harmonic mean of distances across all sub-periods for each pair of stocks.

This temporal aggregation approach yields a final distance matrix that reflects stable, persistent relationships rather than being dominated by extreme events in any single period.

Hierarchical clustering is then applied to this “multi-temporal” distance matrix, producing clusters that capture long-term market structure without being biased by transient crisis-driven correlations.

#CREATE FUNCTION TO SPLIT THE PERIOD
split_into_k <- function(d, k) {
  n <- nrow(d)
  split(d, cut(seq_len(n), breaks = k, labels = FALSE))}

#SPLIT 2
halves <- split_into_k(df_japan_full, 2)
df_half_1 <- halves[[1]]
df_half_2 <- halves[[2]]

#SPLIT 3
thirds <- split_into_k(df_japan_full, 3)
df_third_1 <- thirds[[1]]
df_third_2 <- thirds[[2]]
df_third_3 <- thirds[[3]]

#SPLIT 7
sevenths <- split_into_k(df_japan_full, 7)
df_seventh_1 <- sevenths[[1]]
df_seventh_2 <- sevenths[[2]]
df_seventh_3 <- sevenths[[3]]
df_seventh_4 <- sevenths[[4]]
df_seventh_5 <- sevenths[[5]]
df_seventh_6 <- sevenths[[6]]
df_seventh_7 <- sevenths[[7]]

#LIST OF PERIODS
df_list <- list(
  df_half_1, df_half_2,
  df_third_1, df_third_2, df_third_3,
  df_seventh_1, df_seventh_2, df_seventh_3, df_seventh_4, 
  df_seventh_5, df_seventh_6, df_seventh_7)

Right there, we simply apply the method used on our 25 years long frame, but now applied multiple times on our subperiods.

stock_names <- colnames(df_half_1)[colnames(df_half_1) != "Date"]
n_stocks <- length(stock_names)
distance_list <- list()

for(i in 1:length(df_list)) {
  df_subperiod <- df_list[[i]][, setdiff(names(df_list[[i]]), "Date")]
  df_subperiod[] <- lapply(df_subperiod, function(x) as.numeric(as.character(x)))
  
  pca_result <- prcomp(df_subperiod, scale. = TRUE)
  
  keep_pcs <- which(pca_result$sdev^2 > 1)
  
  distance_list[[i]] <- as.matrix(dist(pca_result$rotation[, keep_pcs], method = "euclidean"))}

harmonic_dist_matrix <- matrix(0, nrow = n_stocks, ncol = n_stocks)
rownames(harmonic_dist_matrix) <- stock_names
colnames(harmonic_dist_matrix) <- stock_names

This part however is specific to the “Subperiods procedure”. We aggregate the 12 distance matrices into a single distance matrix. For each pair of stocks, we extract their distances across all 12 subperiods and compute the harmonic mean of these values. An epsilon value is added to prevent division by zero in cases where stocks have identical loadings. This aggregation process produces a final 153×153 distance matrix. This final distance matrix is then used as input for hierarchical clustering as it was done on full-period.

The harmonic mean is particularly suited for this aggregation as it reduces the effect of outliers.

epsilon <- 1e-10

for(i in 1:n_stocks) {
  for(j in 1:n_stocks) {
    if(i != j) {
      distances_ij <- sapply(distance_list, function(mat) mat[i, j])
      harmonic_dist_matrix[i, j] <- length(distances_ij) / sum(1/(distances_ij + epsilon))}}} 

final_dist <- as.dist(harmonic_dist_matrix)

hclust_temporal <- hclust(final_dist, method = "ward.D2")

What would be the optimal number of clusters

This section uses different methods to assess what would be the best number of clusters for our dataset.

Optimal Clusters - WSS_Elbow Method

WSS or Elbow method checks how compact the clusters are. It takes different counts of clusters and plot the within cluster sum of square for each of them. The good amount would be where the curve descend sharply and forms a shape similar to an “elbow”.

Full Period

fviz_nbclust(loadings_selected, hcut, method = "wss", k.max = 20)

Sub-Periods

fviz_nbclust(as.matrix(final_dist), hcut, method = "wss", k.max = 20)

Optimal Clusters - Silhouette Method

Silhouette measures if each point is close to the other points of the cluster. Silhouette is given in a [-1;1] interval, the closest to 1 the better. On the graph, the highest point shows the best count of clusters.

Full Period

fviz_nbclust(loadings_selected, hcut, method = "silhouette", k.max = 20)

Sub-Periods

fviz_nbclust(as.matrix(final_dist), hcut, method = "silhouette", k.max = 20)

Optimal Clusters - Gap Statistic

Gap statistic measures how much better your clustering is compared to clustering random, uniformly distributed data. The optimal count is found where the graph reaches a peak.

Full Period

fviz_nbclust(loadings_selected, hcut, method = "gap_stat", k.max = 20, nboot = 20)

Sub-Periods

fviz_nbclust(as.matrix(final_dist), hcut, method = "gap_stat", k.max = 20, nboot = 20)

## {.unlisted .unnumbered}

Treat this section as an experimental one. The goal of our study is to validate the Morningstar classification not to try to create a new one, doing that would lead us into another new research question (which could possibly be an interesting master thesis subject).

We see that none of those metrics give a clear result. This can be interpreted as data fuzzyness : the fact that the data structure is unclear and hard to cluster and organize. However, the 11 categories count do not seem aberrant, even better, the count seems to be a good compromise between realistic number of categories, and correctness of clustering.

With WSS (Elbow), the optimal count is close to where the curve start to have a less abrupt slope. With silhouette, although values around 15 or 16 seem better, the count stays realistic. Same for GAP, although no count under 20 reaches a peak, the slope start to decrease around 12-13.

If we compare subperiods and full period, we see that Elbow and GAP give to 11-count a better result to the sub-periods method. Silhouette doesn’t differenciate that much but shows that only 2 categories would be an excellent choice if we use subperiods methodology.

Dendrogram visualization

Full Period

#FULL PERIOD
plot(hclust_results,
     main = "Hierarchical clustering - Full Period",
     xlab = "Stocks",
     ylab = "Mean Distance",
     cex = 0.4)

Sub-Periods

#WITH SUBPERIODS
plot(hclust_temporal,
     main = "Hierarchical Clustering - With Sub Periods",
     xlab = "Stocks",
     ylab = "Mean Distance",
     cex = 0.4)

## {.unlisted .unnumbered}

Results and validation

Composition of clusters barplot

Both methods give a distribution highly dominated by one cluster. This cluster probably corresponds to industrial values, as we saw in the exploratory data analysis that we have a lot of stocks classified in industrial category according to Morningstar®.

Full period

#FULL PERIOD
clusters_k_11 <- cutree(hclust_results, k = 11)
barplot(table(clusters_k_11),
        main = "Distribution of Stocks Across Clusters (Full period)",
        xlab = "Cluster",
        ylab = "Number of Stocks",
        col = "steelblue",
        border = "white")

Sub-Periods

#WITH SUBPERIODS
clusters_temporal_11 <- cutree(hclust_temporal, k = 11)
barplot(table(clusters_temporal_11),
        main = "Distribution of Stocks Across Clusters (With Sub periods)",
        xlab = "Cluster",
        ylab = "Number of Stocks",
        col = "steelblue",
        border = "white")

## {.unlisted .unnumbered}

Silhouette plot

This result is interesting and give us a lot of insights. First, and it may be surprising, not subdividing by sub-periods give better silhouette-results. The result is also pretty poor in general but the variation across clusters is very high ! For example, we see that “cluster 3” is extremely well clustered in both cases. On the other hand, cluster 2 displayed catastrophic results. A lot of clusters especially when we apply full period are apparently even in the wrong cluster.

Those results need further investigation, as identifying what each cluster represent may bring important information and tell us a lot about sectorial behavior in stock markets.

Full Period

#FULL PERIOD
sil_full_k_11 <- silhouette(clusters_k_11, dist_matrix)

fviz_silhouette(sil_full_k_11,
                palette = "viridis(11)",
                ggtheme = theme_minimal()) +
  labs(title = "Silhouette Analysis - Full Period",
       subtitle = paste("Average Silhouette Width:", round(mean(sil_full_k_11[, 3]), 3)))

##    cluster size ave.sil.width
## 1        1    9          0.31
## 2        2   24          0.07
## 3        3    4          0.88
## 4        4   13          0.08
## 5        5    5          0.26
## 6        6    8          0.32
## 7        7   55          0.15
## 8        8    8          0.50
## 9        9   14          0.35
## 10      10    8          0.53
## 11      11    5          0.45

Sub-Periods

#SUBPERIODS
sil_temporal_11 <- silhouette(clusters_temporal_11, final_dist)

fviz_silhouette(sil_temporal_11,
                palette = "viridis(11)",
                ggtheme = theme_minimal()) +
  labs(title = "Silhouette Analysis - Sub Periods",
       subtitle = paste("Average Silhouette Width:", round(mean(sil_temporal_11[, 3]), 3)))

##    cluster size ave.sil.width
## 1        1   12          0.15
## 2        2   55          0.04
## 3        3    4          0.78
## 4        4   14          0.16
## 5        5   10          0.06
## 6        6   12          0.33
## 7        7    8          0.38
## 8        8   16          0.26
## 9        9    9          0.25
## 10      10    8          0.46
## 11      11    5          0.40

Composition of clusters

stock_names <- colnames(df_japan_full)

Full Period

cluster_list_k_11 <- split(stock_names, clusters_k_11)
max_size_k_11 <- max(sapply(cluster_list_k_11, length)) 
cluster_df_k_11 <- data.frame( 
  lapply(cluster_list_k_11, function(x) {
    c(x, rep("", max_size_k_11 - length(x)))})) 
kable(cluster_df_k_11, 
      col.names = paste("Cluster", 1:11))

Cluster 1	Cluster 2	Cluster 3	Cluster 4	Cluster 5	Cluster 6	Cluster 7	Cluster 8	Cluster 9	Cluster 10	Cluster 11
X1332.T	X1605.T	X1801.T	X1925.T	X2501.T	X3086.T	X3401.T	X4151.T	X5101.T	X9001.T	X9501.T
X2002.T	X1963.T	X1802.T	X1928.T	X4183.T	X3382.T	X3402.T	X4502.T	X6471.T	X9005.T	X9502.T
X2282.T	X2768.T	X1803.T	X8253.T	X5108.T	X4452.T	X3405.T	X4503.T	X6472.T	X9007.T	X9503.T
X2502.T	X5401.T	X1812.T	X8308.T	X8001.T	X4911.T	X3407.T	X4506.T	X6473.T	X9008.T	X9531.T
X2503.T	X5406.T		X8316.T	X9984.T	X8233.T	X3861.T	X4507.T	X6902.T	X9009.T	X9532.T
X2801.T	X5631.T		X8331.T		X8252.T	X4004.T	X4519.T	X7201.T	X9020.T
X2802.T	X5706.T		X8601.T		X8267.T	X4005.T	X4523.T	X7202.T	X9021.T
X2871.T	X5711.T		X8604.T		X9983.T	X4021.T	X4568.T	X7203.T	X9202.T
X2914.T	X5713.T		X8801.T			X4042.T		X7205.T
	X5714.T		X8802.T			X4061.T		X7211.T
	X5801.T		X8830.T			X4063.T		X7261.T
	X5803.T		X9432.T			X4188.T		X7267.T
	X6302.T		X9433.T			X4208.T		X7269.T
	X6674.T					X4543.T		X7270.T
	X7004.T					X4901.T
	X7011.T					X4902.T
	X7013.T					X5201.T
	X8002.T					X5233.T
	X8015.T					X5301.T
	X8031.T					X5332.T
	X8053.T					X5333.T
	X8058.T					X5802.T
	X9104.T					X6301.T
	X9107.T					X6326.T
						X6361.T
						X6367.T
						X6479.T
						X6501.T
						X6503.T
						X6504.T
						X6701.T
						X6702.T
						X6752.T
						X6753.T
						X6758.T
						X6762.T
						X6770.T
						X6841.T
						X6857.T
						X6952.T
						X6954.T
						X6971.T
						X6976.T
						X7731.T
						X7733.T
						X7751.T
						X7752.T
						X7911.T
						X7912.T
						X7951.T
						X8035.T
						X9064.T
						X9602.T
						X9735.T
						X9766.T

Sub-Periods

cluster_list_temporal_11 <- split(stock_names, clusters_temporal_11) 
max_size_temporal_11 <- max(sapply(cluster_list_temporal_11, length)) 
cluster_df_temporal_11 <- data.frame( 
  lapply(cluster_list_temporal_11, function(x) {
    c(x, rep("", max_size_temporal_11 - length(x)))})) 
kable(cluster_df_temporal_11,
      col.names = paste("Cluster", 1:11))

Cluster 1	Cluster 2	Cluster 3	Cluster 4	Cluster 5	Cluster 6	Cluster 7	Cluster 8	Cluster 9	Cluster 10	Cluster 11
X1332.T	X1605.T	X1801.T	X2768.T	X3086.T	X3401.T	X4151.T	X5101.T	X8253.T	X9001.T	X9501.T
X2002.T	X1925.T	X1802.T	X5401.T	X3382.T	X3402.T	X4502.T	X5108.T	X8308.T	X9005.T	X9502.T
X2282.T	X1928.T	X1803.T	X5406.T	X8233.T	X3405.T	X4503.T	X5802.T	X8316.T	X9007.T	X9503.T
X2501.T	X1963.T	X1812.T	X5706.T	X8252.T	X3407.T	X4506.T	X6471.T	X8331.T	X9008.T	X9531.T
X2502.T	X3861.T		X5711.T	X8267.T	X4004.T	X4507.T	X6472.T	X8601.T	X9009.T	X9532.T
X2503.T	X4021.T		X5713.T	X9432.T	X4005.T	X4519.T	X6473.T	X8604.T	X9020.T
X2801.T	X4063.T		X5714.T	X9433.T	X4042.T	X4523.T	X6902.T	X8801.T	X9021.T
X2802.T	X4543.T		X8001.T	X9766.T	X4061.T	X4568.T	X7201.T	X8802.T	X9202.T
X2871.T	X4901.T		X8002.T	X9983.T	X4183.T		X7202.T	X8830.T
X2914.T	X4902.T		X8031.T	X9984.T	X4188.T		X7203.T
X4452.T	X5201.T		X8053.T		X4208.T		X7205.T
X4911.T	X5233.T		X8058.T		X5301.T		X7211.T
	X5332.T		X9104.T				X7261.T
	X5333.T		X9107.T				X7267.T
	X5631.T						X7269.T
	X5801.T						X7270.T
	X5803.T
	X6301.T
	X6302.T
	X6326.T
	X6361.T
	X6367.T
	X6479.T
	X6501.T
	X6503.T
	X6504.T
	X6674.T
	X6701.T
	X6702.T
	X6752.T
	X6753.T
	X6758.T
	X6762.T
	X6770.T
	X6841.T
	X6857.T
	X6952.T
	X6954.T
	X6971.T
	X6976.T
	X7004.T
	X7011.T
	X7013.T
	X7731.T
	X7733.T
	X7751.T
	X7752.T
	X7911.T
	X7912.T
	X7951.T
	X8015.T
	X8035.T
	X9064.T
	X9602.T
	X9735.T

Confusion matrix

This matrix now allows us to identify what each of our generated cluster represents. It is very interesting to see that our third cluster which was so nicely classified according to silhouette is apparently represented by some specific industrial values those values are the following : Taisei, Obayashi, Shimizu, Kajima.

We get also very different results with full-period and sub-period methodology. In full period, Morningstar-industrial category is split between second and seventh cluster, while in sub-period industrial stocks mainly go into the second cluster.

Based on the matrix, some stocks seem to be more accurately predicted as well, like utilities, healthcare and consumer defensive (with both methods).

We further analyze this confusion matrix using two other metrics, ARI and Purity.

Full period

#FULL PERIOD
confusion_matrix_11x11_full <- table(df_japan_classified$Sector, clusters_k_11)

conf_df_full <- as.data.frame(confusion_matrix_11x11_full)
colnames(conf_df_full) <- c("Sector", "Cluster", "Count")

ggplot(conf_df_full, aes(x = Cluster, y = Sector, fill = Count)) +
  geom_tile(color = "white", linewidth = 0.5) +
  geom_text(aes(label = Count), color = "black", size = 4, fontface = "bold") +
  scale_fill_gradient(low = "white", high = "steelblue") +
  labs(title = "Morningstar Sectors vs PCA Clusters - With Full Period",
       x = "Cluster",
       y = "Morningstar Sector",
       fill = "Count") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 0, hjust = 0.5, size = 10),
        axis.text.y = element_text(size = 9),
        plot.title = element_text(hjust = 0.5, face = "bold", size = 14),
        panel.grid = element_blank())

Sub-Periods

#SUBPERIODS
confusion_matrix_11x11 <- table(df_japan_classified$Sector, clusters_temporal_11)

conf_df <- as.data.frame(confusion_matrix_11x11)
colnames(conf_df) <- c("Sector", "Cluster", "Count")

ggplot(conf_df, aes(x = Cluster, y = Sector, fill = Count)) +
  geom_tile(color = "white", linewidth = 0.5) +
  geom_text(aes(label = Count), color = "black", size = 4, fontface = "bold") +
  scale_fill_gradient(low = "white", high = "steelblue") +
  labs(title = "Morningstar Sectors vs PCA Clusters - With sub-periods",
       x = "Cluster",
       y = "Morningstar Sector",
       fill = "Count") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 0, hjust = 0.5, size = 10),
        axis.text.y = element_text(size = 9),
        plot.title = element_text(hjust = 0.5, face = "bold", size = 14),
        panel.grid = element_blank())

## {.unlisted .unnumbered}

ARI

ARI is a ratio that checks pairwise agreement between clustering results and true sector labels adjusted for randomness. The result is better than 0 which would be the result of random assignation but is still very low. Here, not like silhouette, the subperiods method gets a slightly better score. Silhouette cares about variance in the cluster, while ARI doesnt “see” what’s happening in the cluster variance-wise.

Full period

ari_full <- adjustedRandIndex(clusters_k_11, as.numeric(factor(df_japan_classified$Sector)))
cat("ARI - Full Period:", round(ari_full, 3), "\n")

## ARI - Full Period: 0.213

Sub-Periods

ari_sub <- adjustedRandIndex(clusters_temporal_11, as.numeric(factor(df_japan_classified$Sector)))
cat("ARI - Sub-periods:", round(ari_sub, 3), "\n")

## ARI - Sub-periods: 0.29

Purity

Purity is another metric based on the confusion matrix. Here, we simply verify if clusters are “pure” without caring how well it is assigned. The score is pretty good and we get again a better result with Sub-Periods method. This may indicate that there could be some clear categories “underneath” the ones that we have from Morningstar®.

purity <- function(clusters, classes) {
  tbl <- table(clusters, classes)
  sum(apply(tbl, 1, max)) / length(clusters)}

Full period

purity_full <- purity(clusters_k_11, df_japan_classified$Sector)
cat("Purity - Full Period:", round(purity_full, 3), "\n")

## Purity - Full Period: 0.621

Sub-Periods

purity_sub <- purity(clusters_temporal_11, df_japan_classified$Sector)
cat("Purity - Sub-periods:", round(purity_sub, 3), "\n")

## Purity - Sub-periods: 0.699

Conclusion

This study was rich in insights and information. When it comes to answering our question, we can say that PCA was not optimal in our case for sectorial clustering. However, the algorithms that we used actually proposed clusters that are often disregarded and not analyzed by professionals, especially those who work in financial markets and may simply use the sectors given by GICS® or Morningstar® classification.

We could see that 4 companies from the Nikkei clustered extremely well compared to the other ones. Those companies are simply categorized as industrial according to Morningstar®. However, these 4 companies are actually part of the “Big 5 super general contractors”, an oligopoly that dominates the construction market in Japan (Takenaka, the fifth one is not a listed company). Knowing that those companies have very similar stock behavior could be a crucial insight for arbitrage-oriented traders.Some other sectors were also better clustered than others, like Utilities, Healthcare and Consumer Defensive, proving that those sectors are really similar regarding their stock behavior and further demonstrating that companies from those sectors display very similar patterns in their stock price evolution pattern.

Of course, we can’t forget the limitations mentioned previously in the study, the main one being the selection bias caused by our data preparation. Some of them may be easily addressed and allow therefore further research based on methodological improvements. Other approaches may also be tried to increase accuracy in sector prediction or to find other insights and even stronger “hidden categories”, like different clustering techniques or other PCA thresholds.

References

Costa, A. P. M. (2025). SECTOR ANALYSIS ON THE S&P 500: A PCA APPROACH. FGV Invest-Short Studies Series.

Yang, L., Rea, W., & Rea, A. (2017). Financial insights from the last few components of a stock market PCA. International Journal of Financial Studies, 5(3), 15.

Salvati, F. (2023). Innovative clustering approach for SandP500 companies: beyond sectors boundaries and cyclical-non cyclical portfolios.

Pasini, G. (2017). Principal component analysis for stock portfolio management. International Journal of Pure and Applied Mathematics, 115(1), 153-167.

How stocks cluster into sectors: A PCA-based experimental approach

Luka de Caters

20/01/2026

Introduction

Libraries and Dataset loadings

Data

Data description

Methodology

What would be the optimal number of clusters

Optimal Clusters - WSS_Elbow Method

Full Period

Sub-Periods

Optimal Clusters - Silhouette Method

Full Period

Sub-Periods

Optimal Clusters - Gap Statistic

Full Period

Sub-Periods

Dendrogram visualization

Full Period

Sub-Periods

Results and validation

Composition of clusters barplot

Full period

Sub-Periods

Silhouette plot

Full Period

Sub-Periods

Composition of clusters

Full Period

Sub-Periods

Confusion matrix

Full period

Sub-Periods

ARI

Full period

Sub-Periods

Purity

Full period

Sub-Periods

Conclusion

References