In the finance industry, it is well established that stock price behavior can be driven by different factors. This study focuses specifically on the sectorial component of listed companies.
Stock’s sector is among the primary determinants of price dynamics. For instance, during periods of economic crisis, technology stocks (whose growth largely relies on financing and favorable economic cycle) can show high price volatility. On the other hand, utilities and healthcare services are more resilient as demand for their services remains stable regardless of the economic condition. Professionals often use this sectorial characteristics to construct diversified financial portfolios or develop sector-oriented trading strategies.
There exist several classification frameworks proposed. Morningstar® for example, proposes its own classification that is freely accessible. It proposes 11 industries where each company is classified. The way they classify it is based on the revenue shares of the company, in other words, from which industry the company’s profit mostly comes from. Absolutely no information coming from the stock’s price time series is included into this classification framework.
The goal of this study is to validate and challenge the classification proposed by Morningstar® by using a PCA and clustering approach applied on a sample of stock prices timeseries from stocks listed in Tokyo. While generally used for dimensionality reduction, in this case, PCA can also extract market factors and show how stocks react to them over a given period. From there, we can cluster our stocks using the PCA output, specifically, the sensitivity of each stock to these PCs. After removing least important PCs, we will cluster our stocks in 11 unlabeled categories, check the quality of the clustering with some metrics, and see how well it aligns with the corporate classification.
We will also do some experimentation by trying two different approaches : using the full period without any manipulation, and trying a method that we found in another scientific paper (Costa, 2015) which subdivides the period in order to counter overfitting.
#LOADING USEFUL LIBRARIES
library(corrplot)
library(knitr)
library(cluster)
library(ggplot2)
library(reshape2)
library(factoextra)
library(clusterSim)
library(mclust)
#SET WORKIND DIRECTORY
setwd("/Users/lukadecaters/Desktop/FTSE")
df_japan_full <- read.csv("df_japan_full", stringsAsFactors=FALSE)
df_japan_full$Date <- as.Date(df_japan_full$Date)
df_japan_classified <- read.csv("df_japan_full_classified.csv", stringsAsFactors=FALSE)
For the sake of our study, we will use Japanese stocks from the Nikkei225 index, a financial index that aggregates the 225 biggest stocks by capitalization in Japan. The period that we analyze extends from the year 2000 until the end of 2025. Since Nikkei includes stocks that were not yet introduced on the stock exchange in the year 2000 or later, we simply remove those stocks to keep only those that are listed during the studied period.
Afterwards, we select the daily closing price of each stock and calculate the daily log-return (applying log return is a standard practice in financial research thanks to the properties of logarithms). From there, we will also take off stocks that didn’t have enough liquidity at a certain point of time of our studied period, for this purpose, we simply delete stocks that show a log-return of 0 for 10 consecutive days. Those stocks were probably of very small importance in the beginning of the studied period and were rarely exchanged on the stock market (lack of liquidity). Keeping those stocks would generate complications since PCA can be easily confused by null values.
All this information was extracted thanks to the yfinance library on python, an API that connects with the data of yahoo finance, a widely used stock screener that collects stock markets data since 1975.
In the end we end up with the log return of 153 stocks on 25 years (2000 - 2025) which should be sufficient to apply our methodology.
Right there, we can see displayed all the tickers that we will use for our study. Tickers must be considered as keys that allow to identify stocks instead of putting the full company name.
#DISPLAY TICKERS
setdiff(names(df_japan_full), "Date")
## [1] "X1332.T" "X1605.T" "X1801.T" "X1802.T" "X1803.T" "X1812.T" "X1925.T"
## [8] "X1928.T" "X1963.T" "X2002.T" "X2282.T" "X2501.T" "X2502.T" "X2503.T"
## [15] "X2768.T" "X2801.T" "X2802.T" "X2871.T" "X2914.T" "X3086.T" "X3382.T"
## [22] "X3401.T" "X3402.T" "X3405.T" "X3407.T" "X3861.T" "X4004.T" "X4005.T"
## [29] "X4021.T" "X4042.T" "X4061.T" "X4063.T" "X4151.T" "X4183.T" "X4188.T"
## [36] "X4208.T" "X4452.T" "X4502.T" "X4503.T" "X4506.T" "X4507.T" "X4519.T"
## [43] "X4523.T" "X4543.T" "X4568.T" "X4901.T" "X4902.T" "X4911.T" "X5101.T"
## [50] "X5108.T" "X5201.T" "X5233.T" "X5301.T" "X5332.T" "X5333.T" "X5401.T"
## [57] "X5406.T" "X5631.T" "X5706.T" "X5711.T" "X5713.T" "X5714.T" "X5801.T"
## [64] "X5802.T" "X5803.T" "X6301.T" "X6302.T" "X6326.T" "X6361.T" "X6367.T"
## [71] "X6471.T" "X6472.T" "X6473.T" "X6479.T" "X6501.T" "X6503.T" "X6504.T"
## [78] "X6674.T" "X6701.T" "X6702.T" "X6752.T" "X6753.T" "X6758.T" "X6762.T"
## [85] "X6770.T" "X6841.T" "X6857.T" "X6902.T" "X6952.T" "X6954.T" "X6971.T"
## [92] "X6976.T" "X7004.T" "X7011.T" "X7013.T" "X7201.T" "X7202.T" "X7203.T"
## [99] "X7205.T" "X7211.T" "X7261.T" "X7267.T" "X7269.T" "X7270.T" "X7731.T"
## [106] "X7733.T" "X7751.T" "X7752.T" "X7911.T" "X7912.T" "X7951.T" "X8001.T"
## [113] "X8002.T" "X8015.T" "X8031.T" "X8035.T" "X8053.T" "X8058.T" "X8233.T"
## [120] "X8252.T" "X8253.T" "X8267.T" "X8308.T" "X8316.T" "X8331.T" "X8601.T"
## [127] "X8604.T" "X8801.T" "X8802.T" "X8830.T" "X9001.T" "X9005.T" "X9007.T"
## [134] "X9008.T" "X9009.T" "X9020.T" "X9021.T" "X9064.T" "X9104.T" "X9107.T"
## [141] "X9202.T" "X9432.T" "X9433.T" "X9501.T" "X9502.T" "X9503.T" "X9531.T"
## [148] "X9532.T" "X9602.T" "X9735.T" "X9766.T" "X9983.T" "X9984.T"
#SHOW PIECE OF THE DATAFRAME
kable(head(df_japan_full[1:10, 1:6]),
digits = 4)
| Date | X1332.T | X1605.T | X1801.T | X1802.T | X1803.T |
|---|---|---|---|---|---|
| 2000-01-06 | 0.0375 | -0.0134 | -0.0707 | -0.0783 | -0.0508 |
| 2000-01-07 | 0.0361 | 0.0067 | 0.0156 | 0.0499 | 0.0508 |
| 2000-01-11 | -0.0059 | 0.0034 | -0.0052 | -0.0383 | -0.0326 |
| 2000-01-12 | -0.0180 | -0.0135 | -0.0157 | 0.0023 | -0.0213 |
| 2000-01-13 | 0.0702 | 0.0688 | 0.0157 | 0.0159 | 0.0243 |
| 2000-01-14 | 0.0496 | 0.0584 | 0.0406 | 0.0592 | 0.0498 |
#SHOW THE DISTRIBUTION IN TERMS OF MORNINGSTAR CATEGORIES
counts <- table(df_japan_classified$Sector)
barplot(counts,
main = "Distribution of categories",
ylab = "Number of Stocks",
col = "darkblue",
border = "aliceblue",
las=2,
cex.names = 0.4)
This section presented a section of the dataframe and all the tickers that we will treat. Additionally, we can see the distribution of categories, there is a high representation of industrial stocks. This is probably caused by a bias that comes from our data cleaning. Since we kept only the stocks listed in 2000 until now, we probably filtered some recent companies that didn’t exist in the year 2000, while industrial companies are generally well established old companies.
The correlation matrix is the first step of our PCA
#REMOVE DATE FROM THE ANALYSIS
df_japan_full <- df_japan_full[, setdiff(names(df_japan_full), "Date")]
#CONVERT ALL VALUES AS NUMERIC
df_japan_full[] <- lapply(df_japan_full, function(x) as.numeric(as.character(x)))
#CREATE CORR PLOT (PEARSON CORRELATION)
corr_mat <- cor(df_japan_full, method = "pearson")
#DISPLAY CORR PLOT
corrplot(corr_mat,
method = "color",
type = "full",
order = "hclust",
tl.cex = 0.25,
tl.col = "black")
Now, we must diagonalize the correlation matrix. This transformation decomposes the matrix into a structure of uncorrelated vectors. The process yields eigenvalues (which represent the proportion of total variance explained by each factor), and eigenvectors which are our PCs : linear combination of stocks. The weight displays by how much the stock contributes to the eigenvector (PC).
Each stock can now be seen as a combination of those principal components, in other perspective, principal components can be considered as combinations of stocks with different weights. Each of those principal components explains a certain amount of variance in our dataset. PCs are ranked in terms of explained variability.
# DIAGONALIZE THE CORRELATION MATRIX TO EXTRACT EIGENVALUES AND EIGENVECTORS
pca_result <- prcomp(df_japan_full, scale. = TRUE)
values <- pca_result$sdev^2
loadings <- pca_result$rotation
Here an example with Toyota \[ \text{Toyota (X7203.T)} = -0.0971 \cdot PC_1 + 0.039 \cdot PC_2 + 0.0298 \cdot PC_3 + \dots \]
The following table displays the first 15 PCs from the 153 that we get. Each PC is associated with a certain amount of variation explained. As for here, we can see that PC1 explains 39% of variance. In that kind of studies, it is common knowledge that PC1 is generally considered as the “market variable” of the PC analysis.
# GENERATE AND VIEW THE TABLE
pca_table <- get_eig(pca_result)
head(pca_table, 15)
## eigenvalue variance.percent cumulative.variance.percent
## Dim.1 59.702424 39.0211919 39.02119
## Dim.2 6.314584 4.1271793 43.14837
## Dim.3 3.394726 2.2187750 45.36715
## Dim.4 2.358369 1.5414179 46.90856
## Dim.5 2.046563 1.3376228 48.24619
## Dim.6 1.896008 1.2392212 49.48541
## Dim.7 1.870843 1.2227734 50.70818
## Dim.8 1.595576 1.0428604 51.75104
## Dim.9 1.553854 1.0155912 52.76663
## Dim.10 1.286665 0.8409575 53.60759
## Dim.11 1.240100 0.8105228 54.41811
## Dim.12 1.218021 0.7960923 55.21421
## Dim.13 1.145282 0.7485505 55.96276
## Dim.14 1.061029 0.6934831 56.65624
## Dim.15 1.046373 0.6839037 57.34014
#SCREE PLOT
plot(1:30, values[1:30], type="b", main="Scree Plot (First 30 PCs) - Full Period",
xlab="PC", ylab="Eigenvalue", col="darkblue", pch=19)
abline(h=1, col="red", lty=2, lwd=2)
This result already tells us interesting things. Generally, PCA applied to that kind of market data get different results in other papers like the ones we have as references. While PC1 being moderately low still seems realistic, the other PCs are extremely low compared to other scientific papers doing the same things. I interpret this as a specificity of the japanese stock market and the bias that occurs due to my stock selection. First, regarding the market, japanese stock market is less diversified and offers less variations compared to often analysed american stock market. Secondly, since I only took companies listed since the year 2000, it filters some companies that could bring more diversity to the selected stocks.
We will now apply hierarchical agglomerative clustering using Ward’s method. Ward’s method is generally considered as one of the best one among hierarchical clustering because it minimizes variance across clusters. The difference with other agglomerative clustering approaches, is that here, the junction of clusters will be done based on variance, in other words, two clusters will join so that the increase of intra-cluster variance is the smallest possible.
As PCA is meant to reduce the number of dimensions, we will obviously have to decrease the number of PCs which is of 153 right now. For this, several method are possible, since we have a lot of PCs, we will use Kaiser rule which tells us to take off PCs whose eigen value is smaller than 1, if we didn’t do it, we could have had curse of dimensionality. We could use cumulative variance as a threshold, but due to the specificity of our data, a cumulative threshold of for example 90% would get us too many PCs hence failing the objective of reducing dimensions. We end up with 15 PCS.
#KEEP PCS USING KAISER'S RULE
keep_pcs <- which(values > 1)
#HOW MANY PCS WE KEEP
length(keep_pcs)
## [1] 15
loadings_selected <- loadings[, keep_pcs]
#DO THE HIERARCHICAL CLUSTERING (EUCLIDIAN DISTANCE & WARD METHOD)
dist_matrix <- dist(loadings_selected, method = "euclidean")
hclust_results <- hclust(dist_matrix, method = "ward.D2")
One issue with our data is the fact that taking the 25 year period as a whole could lead to over-fitting : aberrant market behavior (2009 crises and covid) may completely dominate our dataset and impact negatively the analysis. This is why we imitated the framework used by Costa (2025) in his short paper. We will divide the period in smaller sub-periods (respectively 12 in our case). For each sub-period, we independently perform PCA and apply Kaiser’s rule to extract significant principal components, then calculate Euclidean distances between stocks in their respective PC spaces. This produces 12 separate distance matrices, each capturing the co-movement structure during that specific time window. To aggregate these temporal snapshots into a single, robust clustering solution, we compute the harmonic mean of distances across all sub-periods for each pair of stocks.
This temporal aggregation approach yields a final distance matrix that reflects stable, persistent relationships rather than being dominated by extreme events in any single period.
Hierarchical clustering is then applied to this “multi-temporal” distance matrix, producing clusters that capture long-term market structure without being biased by transient crisis-driven correlations.
#CREATE FUNCTION TO SPLIT THE PERIOD
split_into_k <- function(d, k) {
n <- nrow(d)
split(d, cut(seq_len(n), breaks = k, labels = FALSE))}
#SPLIT 2
halves <- split_into_k(df_japan_full, 2)
df_half_1 <- halves[[1]]
df_half_2 <- halves[[2]]
#SPLIT 3
thirds <- split_into_k(df_japan_full, 3)
df_third_1 <- thirds[[1]]
df_third_2 <- thirds[[2]]
df_third_3 <- thirds[[3]]
#SPLIT 7
sevenths <- split_into_k(df_japan_full, 7)
df_seventh_1 <- sevenths[[1]]
df_seventh_2 <- sevenths[[2]]
df_seventh_3 <- sevenths[[3]]
df_seventh_4 <- sevenths[[4]]
df_seventh_5 <- sevenths[[5]]
df_seventh_6 <- sevenths[[6]]
df_seventh_7 <- sevenths[[7]]
#LIST OF PERIODS
df_list <- list(
df_half_1, df_half_2,
df_third_1, df_third_2, df_third_3,
df_seventh_1, df_seventh_2, df_seventh_3, df_seventh_4,
df_seventh_5, df_seventh_6, df_seventh_7)
Right there, we simply apply the method used on our 25 years long frame, but now applied multiple times on our subperiods.
stock_names <- colnames(df_half_1)[colnames(df_half_1) != "Date"]
n_stocks <- length(stock_names)
distance_list <- list()
for(i in 1:length(df_list)) {
df_subperiod <- df_list[[i]][, setdiff(names(df_list[[i]]), "Date")]
df_subperiod[] <- lapply(df_subperiod, function(x) as.numeric(as.character(x)))
pca_result <- prcomp(df_subperiod, scale. = TRUE)
keep_pcs <- which(pca_result$sdev^2 > 1)
distance_list[[i]] <- as.matrix(dist(pca_result$rotation[, keep_pcs], method = "euclidean"))}
harmonic_dist_matrix <- matrix(0, nrow = n_stocks, ncol = n_stocks)
rownames(harmonic_dist_matrix) <- stock_names
colnames(harmonic_dist_matrix) <- stock_names
This part however is specific to the “Subperiods procedure”. We aggregate the 12 distance matrices into a single distance matrix. For each pair of stocks, we extract their distances across all 12 subperiods and compute the harmonic mean of these values. An epsilon value is added to prevent division by zero in cases where stocks have identical loadings. This aggregation process produces a final 153×153 distance matrix. This final distance matrix is then used as input for hierarchical clustering as it was done on full-period.
The harmonic mean is particularly suited for this aggregation as it reduces the effect of outliers.
epsilon <- 1e-10
for(i in 1:n_stocks) {
for(j in 1:n_stocks) {
if(i != j) {
distances_ij <- sapply(distance_list, function(mat) mat[i, j])
harmonic_dist_matrix[i, j] <- length(distances_ij) / sum(1/(distances_ij + epsilon))}}}
final_dist <- as.dist(harmonic_dist_matrix)
hclust_temporal <- hclust(final_dist, method = "ward.D2")
This section uses different methods to assess what would be the best number of clusters for our dataset.
WSS or Elbow method checks how compact the clusters are. It takes different counts of clusters and plot the within cluster sum of square for each of them. The good amount would be where the curve descend sharply and forms a shape similar to an “elbow”.
fviz_nbclust(loadings_selected, hcut, method = "wss", k.max = 20)
fviz_nbclust(as.matrix(final_dist), hcut, method = "wss", k.max = 20)
Silhouette measures if each point is close to the other points of the cluster. Silhouette is given in a [-1;1] interval, the closest to 1 the better. On the graph, the highest point shows the best count of clusters.
fviz_nbclust(loadings_selected, hcut, method = "silhouette", k.max = 20)
fviz_nbclust(as.matrix(final_dist), hcut, method = "silhouette", k.max = 20)
Gap statistic measures how much better your clustering is compared to clustering random, uniformly distributed data. The optimal count is found where the graph reaches a peak.
fviz_nbclust(loadings_selected, hcut, method = "gap_stat", k.max = 20, nboot = 20)
fviz_nbclust(as.matrix(final_dist), hcut, method = "gap_stat", k.max = 20, nboot = 20)
## {.unlisted .unnumbered}
Treat this section as an experimental one. The goal of our study is to validate the Morningstar classification not to try to create a new one, doing that would lead us into another new research question (which could possibly be an interesting master thesis subject).
We see that none of those metrics give a clear result. This can be interpreted as data fuzzyness : the fact that the data structure is unclear and hard to cluster and organize. However, the 11 categories count do not seem aberrant, even better, the count seems to be a good compromise between realistic number of categories, and correctness of clustering.
With WSS (Elbow), the optimal count is close to where the curve start to have a less abrupt slope. With silhouette, although values around 15 or 16 seem better, the count stays realistic. Same for GAP, although no count under 20 reaches a peak, the slope start to decrease around 12-13.
If we compare subperiods and full period, we see that Elbow and GAP give to 11-count a better result to the sub-periods method. Silhouette doesn’t differenciate that much but shows that only 2 categories would be an excellent choice if we use subperiods methodology.
#FULL PERIOD
plot(hclust_results,
main = "Hierarchical clustering - Full Period",
xlab = "Stocks",
ylab = "Mean Distance",
cex = 0.4)
#WITH SUBPERIODS
plot(hclust_temporal,
main = "Hierarchical Clustering - With Sub Periods",
xlab = "Stocks",
ylab = "Mean Distance",
cex = 0.4)
## {.unlisted .unnumbered}
Both methods give a distribution highly dominated by one cluster. This cluster probably corresponds to industrial values, as we saw in the exploratory data analysis that we have a lot of stocks classified in industrial category according to Morningstar®.
#FULL PERIOD
clusters_k_11 <- cutree(hclust_results, k = 11)
barplot(table(clusters_k_11),
main = "Distribution of Stocks Across Clusters (Full period)",
xlab = "Cluster",
ylab = "Number of Stocks",
col = "steelblue",
border = "white")
#WITH SUBPERIODS
clusters_temporal_11 <- cutree(hclust_temporal, k = 11)
barplot(table(clusters_temporal_11),
main = "Distribution of Stocks Across Clusters (With Sub periods)",
xlab = "Cluster",
ylab = "Number of Stocks",
col = "steelblue",
border = "white")
## {.unlisted .unnumbered}
This result is interesting and give us a lot of insights. First, and it may be surprising, not subdividing by sub-periods give better silhouette-results. The result is also pretty poor in general but the variation across clusters is very high ! For example, we see that “cluster 3” is extremely well clustered in both cases. On the other hand, cluster 2 displayed catastrophic results. A lot of clusters especially when we apply full period are apparently even in the wrong cluster.
Those results need further investigation, as identifying what each cluster represent may bring important information and tell us a lot about sectorial behavior in stock markets.
#FULL PERIOD
sil_full_k_11 <- silhouette(clusters_k_11, dist_matrix)
fviz_silhouette(sil_full_k_11,
palette = "viridis(11)",
ggtheme = theme_minimal()) +
labs(title = "Silhouette Analysis - Full Period",
subtitle = paste("Average Silhouette Width:", round(mean(sil_full_k_11[, 3]), 3)))
## cluster size ave.sil.width
## 1 1 9 0.31
## 2 2 24 0.07
## 3 3 4 0.88
## 4 4 13 0.08
## 5 5 5 0.26
## 6 6 8 0.32
## 7 7 55 0.15
## 8 8 8 0.50
## 9 9 14 0.35
## 10 10 8 0.53
## 11 11 5 0.45
#SUBPERIODS
sil_temporal_11 <- silhouette(clusters_temporal_11, final_dist)
fviz_silhouette(sil_temporal_11,
palette = "viridis(11)",
ggtheme = theme_minimal()) +
labs(title = "Silhouette Analysis - Sub Periods",
subtitle = paste("Average Silhouette Width:", round(mean(sil_temporal_11[, 3]), 3)))
## cluster size ave.sil.width
## 1 1 12 0.15
## 2 2 55 0.04
## 3 3 4 0.78
## 4 4 14 0.16
## 5 5 10 0.06
## 6 6 12 0.33
## 7 7 8 0.38
## 8 8 16 0.26
## 9 9 9 0.25
## 10 10 8 0.46
## 11 11 5 0.40
stock_names <- colnames(df_japan_full)
cluster_list_k_11 <- split(stock_names, clusters_k_11)
max_size_k_11 <- max(sapply(cluster_list_k_11, length))
cluster_df_k_11 <- data.frame(
lapply(cluster_list_k_11, function(x) {
c(x, rep("", max_size_k_11 - length(x)))}))
kable(cluster_df_k_11,
col.names = paste("Cluster", 1:11))
| Cluster 1 | Cluster 2 | Cluster 3 | Cluster 4 | Cluster 5 | Cluster 6 | Cluster 7 | Cluster 8 | Cluster 9 | Cluster 10 | Cluster 11 |
|---|---|---|---|---|---|---|---|---|---|---|
| X1332.T | X1605.T | X1801.T | X1925.T | X2501.T | X3086.T | X3401.T | X4151.T | X5101.T | X9001.T | X9501.T |
| X2002.T | X1963.T | X1802.T | X1928.T | X4183.T | X3382.T | X3402.T | X4502.T | X6471.T | X9005.T | X9502.T |
| X2282.T | X2768.T | X1803.T | X8253.T | X5108.T | X4452.T | X3405.T | X4503.T | X6472.T | X9007.T | X9503.T |
| X2502.T | X5401.T | X1812.T | X8308.T | X8001.T | X4911.T | X3407.T | X4506.T | X6473.T | X9008.T | X9531.T |
| X2503.T | X5406.T | X8316.T | X9984.T | X8233.T | X3861.T | X4507.T | X6902.T | X9009.T | X9532.T | |
| X2801.T | X5631.T | X8331.T | X8252.T | X4004.T | X4519.T | X7201.T | X9020.T | |||
| X2802.T | X5706.T | X8601.T | X8267.T | X4005.T | X4523.T | X7202.T | X9021.T | |||
| X2871.T | X5711.T | X8604.T | X9983.T | X4021.T | X4568.T | X7203.T | X9202.T | |||
| X2914.T | X5713.T | X8801.T | X4042.T | X7205.T | ||||||
| X5714.T | X8802.T | X4061.T | X7211.T | |||||||
| X5801.T | X8830.T | X4063.T | X7261.T | |||||||
| X5803.T | X9432.T | X4188.T | X7267.T | |||||||
| X6302.T | X9433.T | X4208.T | X7269.T | |||||||
| X6674.T | X4543.T | X7270.T | ||||||||
| X7004.T | X4901.T | |||||||||
| X7011.T | X4902.T | |||||||||
| X7013.T | X5201.T | |||||||||
| X8002.T | X5233.T | |||||||||
| X8015.T | X5301.T | |||||||||
| X8031.T | X5332.T | |||||||||
| X8053.T | X5333.T | |||||||||
| X8058.T | X5802.T | |||||||||
| X9104.T | X6301.T | |||||||||
| X9107.T | X6326.T | |||||||||
| X6361.T | ||||||||||
| X6367.T | ||||||||||
| X6479.T | ||||||||||
| X6501.T | ||||||||||
| X6503.T | ||||||||||
| X6504.T | ||||||||||
| X6701.T | ||||||||||
| X6702.T | ||||||||||
| X6752.T | ||||||||||
| X6753.T | ||||||||||
| X6758.T | ||||||||||
| X6762.T | ||||||||||
| X6770.T | ||||||||||
| X6841.T | ||||||||||
| X6857.T | ||||||||||
| X6952.T | ||||||||||
| X6954.T | ||||||||||
| X6971.T | ||||||||||
| X6976.T | ||||||||||
| X7731.T | ||||||||||
| X7733.T | ||||||||||
| X7751.T | ||||||||||
| X7752.T | ||||||||||
| X7911.T | ||||||||||
| X7912.T | ||||||||||
| X7951.T | ||||||||||
| X8035.T | ||||||||||
| X9064.T | ||||||||||
| X9602.T | ||||||||||
| X9735.T | ||||||||||
| X9766.T |
cluster_list_temporal_11 <- split(stock_names, clusters_temporal_11)
max_size_temporal_11 <- max(sapply(cluster_list_temporal_11, length))
cluster_df_temporal_11 <- data.frame(
lapply(cluster_list_temporal_11, function(x) {
c(x, rep("", max_size_temporal_11 - length(x)))}))
kable(cluster_df_temporal_11,
col.names = paste("Cluster", 1:11))
| Cluster 1 | Cluster 2 | Cluster 3 | Cluster 4 | Cluster 5 | Cluster 6 | Cluster 7 | Cluster 8 | Cluster 9 | Cluster 10 | Cluster 11 |
|---|---|---|---|---|---|---|---|---|---|---|
| X1332.T | X1605.T | X1801.T | X2768.T | X3086.T | X3401.T | X4151.T | X5101.T | X8253.T | X9001.T | X9501.T |
| X2002.T | X1925.T | X1802.T | X5401.T | X3382.T | X3402.T | X4502.T | X5108.T | X8308.T | X9005.T | X9502.T |
| X2282.T | X1928.T | X1803.T | X5406.T | X8233.T | X3405.T | X4503.T | X5802.T | X8316.T | X9007.T | X9503.T |
| X2501.T | X1963.T | X1812.T | X5706.T | X8252.T | X3407.T | X4506.T | X6471.T | X8331.T | X9008.T | X9531.T |
| X2502.T | X3861.T | X5711.T | X8267.T | X4004.T | X4507.T | X6472.T | X8601.T | X9009.T | X9532.T | |
| X2503.T | X4021.T | X5713.T | X9432.T | X4005.T | X4519.T | X6473.T | X8604.T | X9020.T | ||
| X2801.T | X4063.T | X5714.T | X9433.T | X4042.T | X4523.T | X6902.T | X8801.T | X9021.T | ||
| X2802.T | X4543.T | X8001.T | X9766.T | X4061.T | X4568.T | X7201.T | X8802.T | X9202.T | ||
| X2871.T | X4901.T | X8002.T | X9983.T | X4183.T | X7202.T | X8830.T | ||||
| X2914.T | X4902.T | X8031.T | X9984.T | X4188.T | X7203.T | |||||
| X4452.T | X5201.T | X8053.T | X4208.T | X7205.T | ||||||
| X4911.T | X5233.T | X8058.T | X5301.T | X7211.T | ||||||
| X5332.T | X9104.T | X7261.T | ||||||||
| X5333.T | X9107.T | X7267.T | ||||||||
| X5631.T | X7269.T | |||||||||
| X5801.T | X7270.T | |||||||||
| X5803.T | ||||||||||
| X6301.T | ||||||||||
| X6302.T | ||||||||||
| X6326.T | ||||||||||
| X6361.T | ||||||||||
| X6367.T | ||||||||||
| X6479.T | ||||||||||
| X6501.T | ||||||||||
| X6503.T | ||||||||||
| X6504.T | ||||||||||
| X6674.T | ||||||||||
| X6701.T | ||||||||||
| X6702.T | ||||||||||
| X6752.T | ||||||||||
| X6753.T | ||||||||||
| X6758.T | ||||||||||
| X6762.T | ||||||||||
| X6770.T | ||||||||||
| X6841.T | ||||||||||
| X6857.T | ||||||||||
| X6952.T | ||||||||||
| X6954.T | ||||||||||
| X6971.T | ||||||||||
| X6976.T | ||||||||||
| X7004.T | ||||||||||
| X7011.T | ||||||||||
| X7013.T | ||||||||||
| X7731.T | ||||||||||
| X7733.T | ||||||||||
| X7751.T | ||||||||||
| X7752.T | ||||||||||
| X7911.T | ||||||||||
| X7912.T | ||||||||||
| X7951.T | ||||||||||
| X8015.T | ||||||||||
| X8035.T | ||||||||||
| X9064.T | ||||||||||
| X9602.T | ||||||||||
| X9735.T |
This matrix now allows us to identify what each of our generated cluster represents. It is very interesting to see that our third cluster which was so nicely classified according to silhouette is apparently represented by some specific industrial values those values are the following : Taisei, Obayashi, Shimizu, Kajima.
We get also very different results with full-period and sub-period methodology. In full period, Morningstar-industrial category is split between second and seventh cluster, while in sub-period industrial stocks mainly go into the second cluster.
Based on the matrix, some stocks seem to be more accurately predicted as well, like utilities, healthcare and consumer defensive (with both methods).
We further analyze this confusion matrix using two other metrics, ARI and Purity.
#FULL PERIOD
confusion_matrix_11x11_full <- table(df_japan_classified$Sector, clusters_k_11)
conf_df_full <- as.data.frame(confusion_matrix_11x11_full)
colnames(conf_df_full) <- c("Sector", "Cluster", "Count")
ggplot(conf_df_full, aes(x = Cluster, y = Sector, fill = Count)) +
geom_tile(color = "white", linewidth = 0.5) +
geom_text(aes(label = Count), color = "black", size = 4, fontface = "bold") +
scale_fill_gradient(low = "white", high = "steelblue") +
labs(title = "Morningstar Sectors vs PCA Clusters - With Full Period",
x = "Cluster",
y = "Morningstar Sector",
fill = "Count") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 0, hjust = 0.5, size = 10),
axis.text.y = element_text(size = 9),
plot.title = element_text(hjust = 0.5, face = "bold", size = 14),
panel.grid = element_blank())
#SUBPERIODS
confusion_matrix_11x11 <- table(df_japan_classified$Sector, clusters_temporal_11)
conf_df <- as.data.frame(confusion_matrix_11x11)
colnames(conf_df) <- c("Sector", "Cluster", "Count")
ggplot(conf_df, aes(x = Cluster, y = Sector, fill = Count)) +
geom_tile(color = "white", linewidth = 0.5) +
geom_text(aes(label = Count), color = "black", size = 4, fontface = "bold") +
scale_fill_gradient(low = "white", high = "steelblue") +
labs(title = "Morningstar Sectors vs PCA Clusters - With sub-periods",
x = "Cluster",
y = "Morningstar Sector",
fill = "Count") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 0, hjust = 0.5, size = 10),
axis.text.y = element_text(size = 9),
plot.title = element_text(hjust = 0.5, face = "bold", size = 14),
panel.grid = element_blank())
## {.unlisted .unnumbered}
ARI is a ratio that checks pairwise agreement between clustering results and true sector labels adjusted for randomness. The result is better than 0 which would be the result of random assignation but is still very low. Here, not like silhouette, the subperiods method gets a slightly better score. Silhouette cares about variance in the cluster, while ARI doesnt “see” what’s happening in the cluster variance-wise.
ari_full <- adjustedRandIndex(clusters_k_11, as.numeric(factor(df_japan_classified$Sector)))
cat("ARI - Full Period:", round(ari_full, 3), "\n")
## ARI - Full Period: 0.213
ari_sub <- adjustedRandIndex(clusters_temporal_11, as.numeric(factor(df_japan_classified$Sector)))
cat("ARI - Sub-periods:", round(ari_sub, 3), "\n")
## ARI - Sub-periods: 0.29
Purity is another metric based on the confusion matrix. Here, we simply verify if clusters are “pure” without caring how well it is assigned. The score is pretty good and we get again a better result with Sub-Periods method. This may indicate that there could be some clear categories “underneath” the ones that we have from Morningstar®.
purity <- function(clusters, classes) {
tbl <- table(clusters, classes)
sum(apply(tbl, 1, max)) / length(clusters)}
purity_full <- purity(clusters_k_11, df_japan_classified$Sector)
cat("Purity - Full Period:", round(purity_full, 3), "\n")
## Purity - Full Period: 0.621
purity_sub <- purity(clusters_temporal_11, df_japan_classified$Sector)
cat("Purity - Sub-periods:", round(purity_sub, 3), "\n")
## Purity - Sub-periods: 0.699
This study was rich in insights and information. When it comes to answering our question, we can say that PCA was not optimal in our case for sectorial clustering. However, the algorithms that we used actually proposed clusters that are often disregarded and not analyzed by professionals, especially those who work in financial markets and may simply use the sectors given by GICS® or Morningstar® classification.
We could see that 4 companies from the Nikkei clustered extremely well compared to the other ones. Those companies are simply categorized as industrial according to Morningstar®. However, these 4 companies are actually part of the “Big 5 super general contractors”, an oligopoly that dominates the construction market in Japan (Takenaka, the fifth one is not a listed company). Knowing that those companies have very similar stock behavior could be a crucial insight for arbitrage-oriented traders.Some other sectors were also better clustered than others, like Utilities, Healthcare and Consumer Defensive, proving that those sectors are really similar regarding their stock behavior and further demonstrating that companies from those sectors display very similar patterns in their stock price evolution pattern.
Of course, we can’t forget the limitations mentioned previously in the study, the main one being the selection bias caused by our data preparation. Some of them may be easily addressed and allow therefore further research based on methodological improvements. Other approaches may also be tried to increase accuracy in sector prediction or to find other insights and even stronger “hidden categories”, like different clustering techniques or other PCA thresholds.
Costa, A. P. M. (2025). SECTOR ANALYSIS ON THE S&P 500: A PCA APPROACH. FGV Invest-Short Studies Series.
Yang, L., Rea, W., & Rea, A. (2017). Financial insights from the last few components of a stock market PCA. International Journal of Financial Studies, 5(3), 15.
Salvati, F. (2023). Innovative clustering approach for SandP500 companies: beyond sectors boundaries and cyclical-non cyclical portfolios.
Pasini, G. (2017). Principal component analysis for stock portfolio management. International Journal of Pure and Applied Mathematics, 115(1), 153-167.