The Fama–French five-factor model constitutes one of the most influential empirical frameworks in modern asset pricing, extending the traditional CAPM by incorporating size, value, profitability, and investment-related risk factors. While the individual economic interpretations of these factors are well established in the literature, less attention is typically devoted to the joint correlation structure and latent geometry of the factor space itself.
This project aims to explore the multivariate dependence structure of the monthly Fama–French five factors using dimensionality reduction techniques. Rather than focusing on predictive performance or asset pricing tests, the analysis emphasizes the internal geometry of the factor space and investigates whether the observed factors can be represented by a smaller number of latent dimensions without substantial loss of information.
To this end, Principal Component Analysis (PCA), rotated PCA, and Multidimensional Scaling (MDS) are employed and systematically compared. Particular attention is paid to distance preservation, interpretability of latent components, and the degree to which linear methods adequately capture the structure of the data.
The empirical analysis is based on the Fama–French Five-Factor model using monthly data. The dataset originates from the publicly available data library (https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html) maintained by Kenneth R. French and is widely used in empirical asset pricing research.
The following five risk factors are employed in the analysis:
Mkt–RF (Market Excess Return)
The excess return on the value-weighted market portfolio over the
risk-free rate. This factor captures systematic market risk and
represents the traditional CAPM component.
SMB (Small Minus Big)
The return differential between portfolios of small-cap and large-cap
firms, designed to proxy for the size effect observed in equity
markets.
HML (High Minus Low)
The return differential between high book-to-market (value) and low
book-to-market (growth) firms, capturing the value premium.
RMW (Robust Minus Weak)
The return differential between firms with robust and weak operating
profitability, reflecting the profitability effect in asset
returns.
CMA (Conservative Minus Aggressive)
The return differential between firms that invest conservatively versus
aggressively, capturing the investment effect.
The risk-free rate (RF) is excluded from the analysis, as the focus is placed on the joint structure and dependence between the systematic risk factors themselves rather than on excess-return construction.
The dataset consists of approximately 750 monthly observations, covering several decades of U.S. equity market history. Monthly frequency is chosen to balance noise reduction with sufficient time-series length for reliable multivariate analysis.
All factor return series are transformed as follows:
Standardization ensures that the subsequent distance-based and variance-based methods (PCA and MDS) are not dominated by differences in scale or volatility across factors. As a result, the analysis focuses purely on correlation structure and geometric relationships among the factors.
Despite the relatively small number of variables, dimensionality reduction is well motivated due to the presence of strong and economically meaningful correlations among the factors. The objective is not variable reduction per se, but rather to:
This approach aligns with the broader asset pricing literature, where factor interdependence and redundancy are known to play a crucial role in model interpretation and factor construction.
The analysis is based on monthly observations of the Fama–French five factors obtained from Kenneth French’s data library. The sample spans approximately 750 monthly observations, providing a sufficiently long time horizon to study the stable dependence structure between factors.
The five analyzed factors include the excess market return (Mkt–RF), size (SMB), value (HML), profitability (RMW), and investment (CMA). The risk-free rate is excluded, as the focus lies on the joint behavior of systematic risk factors rather than excess return construction.
All factor series are standardized prior to analysis to ensure comparability and to prevent scale differences from influencing distance-based methods. Standardization is particularly important in the context of PCA and MDS, where the geometry of the data is directly affected by variable scaling.
library(tidyverse)
library(lubridate)
library(factoextra)
library(gridExtra)
library(smacof)
library(vegan)
library(psych)
library(tibble)
ff5_raw <- readr::read_csv("FF5_monthly.csv", show_col_types = FALSE)
names(ff5_raw)
#> [1] "...1" "Mkt-RF" "SMB" "HML" "RMW" "CMA" "RF"
head(ff5_raw, 5)
ff5 <- ff5_raw %>%
rename(DATE = 1) %>%
mutate(DATE = as.character(DATE)) %>%
filter(str_detect(DATE, "^[0-9]{6}$")) %>%
mutate(date = ymd(paste0(DATE, "01"))) %>%
select(date, `Mkt-RF`, SMB, HML, RMW, CMA) %>%
arrange(date) %>%
mutate(across(-date, as.numeric))
glimpse(ff5)
#> Rows: 749
#> Columns: 6
#> $ date <date> 1963-07-01, 1963-08-01, 1963-09-01, 1963-10-01, 1963-11-01, …
#> $ `Mkt-RF` <dbl> -0.39, 5.08, -1.57, 2.54, -0.86, 1.83, 2.27, 1.55, 1.41, 0.11…
#> $ SMB <dbl> -0.48, -0.80, -0.43, -1.34, -0.85, -1.89, 0.10, 0.33, 1.41, -…
#> $ HML <dbl> -0.81, 1.70, 0.00, -0.04, 1.73, -0.21, 1.63, 2.81, 3.29, -0.5…
#> $ RMW <dbl> 0.64, 0.40, -0.78, 2.79, -0.43, 0.12, 0.21, 0.11, -2.03, -1.3…
#> $ CMA <dbl> -1.15, -0.38, 0.15, -2.25, 2.27, -0.25, 1.48, 0.81, 2.98, -1.…
range(ff5$date)
#> [1] "1963-07-01" "2025-11-01"
summary(ff5)
#> date Mkt-RF SMB HML
#> Min. :1963-07-01 Min. :-23.1900 Min. :-15.5400 Min. :-13.8300
#> 1st Qu.:1979-02-01 1st Qu.: -1.9600 1st Qu.: -1.5800 1st Qu.: -1.4400
#> Median :1994-09-01 Median : 1.0200 Median : 0.0200 Median : 0.2000
#> Mean :1994-08-31 Mean : 0.5956 Mean : 0.1801 Mean : 0.2837
#> 3rd Qu.:2010-04-01 3rd Qu.: 3.4200 3rd Qu.: 1.9400 3rd Qu.: 1.7300
#> Max. :2025-11-01 Max. : 16.1000 Max. : 18.4600 Max. : 12.8600
#> RMW CMA
#> Min. :-18.9500 Min. :-7.0800
#> 1st Qu.: -0.8500 1st Qu.:-1.0400
#> Median : 0.2500 Median : 0.0900
#> Mean : 0.2628 Mean : 0.2407
#> 3rd Qu.: 1.3100 3rd Qu.: 1.4900
#> Max. : 13.0500 Max. : 9.0100
diag_tbl <- tibble(
start_date = min(ff5$date),
end_date = max(ff5$date),
n_months = nrow(ff5),
n_factors = ncol(ff5) - 1
)
knitr::kable(diag_tbl, caption = "Dataset diagnostics")
| start_date | end_date | n_months | n_factors |
|---|---|---|---|
| 1963-07-01 | 2025-11-01 | 749 | 5 |
X <- ff5 %>% select(`Mkt-RF`, SMB, HML, RMW, CMA)
corr_mat <- cor(X, use = "pairwise.complete.obs")
round(corr_mat, 2)
#> Mkt-RF SMB HML RMW CMA
#> Mkt-RF 1.00 0.28 -0.21 -0.19 -0.35
#> SMB 0.28 1.00 0.01 -0.34 -0.08
#> HML -0.21 0.01 1.00 0.09 0.68
#> RMW -0.19 -0.34 0.09 1.00 0.00
#> CMA -0.35 -0.08 0.68 0.00 1.00
knitr::kable(round(corr_mat, 2), caption = "Correlation matrix")
| Mkt-RF | SMB | HML | RMW | CMA | |
|---|---|---|---|---|---|
| Mkt-RF | 1.00 | 0.28 | -0.21 | -0.19 | -0.35 |
| SMB | 0.28 | 1.00 | 0.01 | -0.34 | -0.08 |
| HML | -0.21 | 0.01 | 1.00 | 0.09 | 0.68 |
| RMW | -0.19 | -0.34 | 0.09 | 1.00 | 0.00 |
| CMA | -0.35 | -0.08 | 0.68 | 0.00 | 1.00 |
corr_df <- as.data.frame(corr_mat) %>%
rownames_to_column("factor1") %>%
pivot_longer(-factor1, names_to = "factor2", values_to = "corr")
ggplot(corr_df, aes(factor1, factor2, fill = corr)) +
geom_tile() +
geom_text(aes(label = round(corr, 2)), size = 4) +
scale_fill_gradient2(limits = c(-1, 1)) +
coord_equal() +
labs(title = "Correlation heatmap", fill = "Corr") +
theme_minimal()
corr_with_mkt <- sort(corr_mat[, "Mkt-RF"], decreasing = TRUE)
round(corr_with_mkt, 2)
#> Mkt-RF SMB RMW HML CMA
#> 1.00 0.28 -0.19 -0.21 -0.35
The correlation matrix reveals substantial interdependencies among the Fama–French factors, providing a strong motivation for dimensionality reduction. Notably, the value factor (HML) exhibits a pronounced positive correlation with the investment factor (CMA), which is consistent with theoretical interpretations linking value firms to conservative investment behavior.
Similarly, partial dependence between the market factor (Mkt–RF) and the size factor (SMB) reflects the empirical tendency of small-cap stocks to exhibit higher market beta exposure. In contrast, profitability (RMW) displays a more distinct correlation pattern, suggesting that it may capture an orthogonal dimension of systematic risk.
The presence of such structured correlations indicates that the five-factor system may be effectively represented by a smaller number of latent dimensions without substantial loss of information.
X_scaled <- scale(X)
pca <- prcomp(X_scaled)
eig <- pca$sdev^2
prop <- eig / sum(eig)
pca_var_tbl <- tibble(
PC = paste0("PC", 1:5),
eigenvalue = eig,
prop_var = prop,
cum_var = cumsum(prop)
)
pca_var_tbl
knitr::kable(
pca_var_tbl %>% mutate(across(where(is.numeric), ~ round(.x, 4))),
caption = "PCA: eigenvalues and explained variance"
)
| PC | eigenvalue | prop_var | cum_var |
|---|---|---|---|
| PC1 | 1.9480 | 0.3896 | 0.3896 |
| PC2 | 1.3776 | 0.2755 | 0.6651 |
| PC3 | 0.7727 | 0.1545 | 0.8197 |
| PC4 | 0.6201 | 0.1240 | 0.9437 |
| PC5 | 0.2816 | 0.0563 | 1.0000 |
fviz_eig(pca, main = "Scree plot: variance explained by PCs")
loadings <- pca$rotation[, 1:3] %>%
as.data.frame() %>%
rownames_to_column("Factor")
knitr::kable(
loadings %>% mutate(across(-Factor, ~ round(.x, 3))),
caption = "PCA loadings (PC1–PC3)"
)
| Factor | PC1 | PC2 | PC3 |
|---|---|---|---|
| Mkt-RF | 0.467 | 0.214 | 0.654 |
| SMB | 0.269 | 0.612 | 0.161 |
| HML | -0.547 | 0.374 | 0.357 |
| RMW | -0.245 | -0.571 | 0.647 |
| CMA | -0.592 | 0.337 | -0.009 |
Inspection of factor loadings allows for an economically meaningful interpretation of the extracted components. The first principal component loads heavily on the value (HML) and investment (CMA) factors, indicating a latent dimension related to firms’ balance sheet structure and long-term investment behavior. This component can be interpreted as a Value–Investment axis.
The second component is dominated by profitability (RMW), capturing variation related to firms’ operating efficiency and earnings quality. This suggests that profitability represents a distinct and relatively orthogonal source of systematic risk.
The third component exhibits higher loadings on the market (Mkt–RF) and size (SMB) factors, reflecting a Market–Size dimension that aligns with traditional beta-driven risk exposure and firm capitalization effects.
gridExtra::grid.arrange(
fviz_contrib(pca, choice = "var", axes = 1, title = "Contribution to PC1"),
fviz_contrib(pca, choice = "var", axes = 2, title = "Contribution to PC2"),
fviz_contrib(pca, choice = "var", axes = 3, title = "Contribution to PC3"),
ncol = 1
)
scores <- as_tibble(pca$x) %>%
mutate(date = ff5$date)
ggplot(scores, aes(PC1, PC2)) +
geom_point(alpha = 0.5) +
labs(title = "Months in PC space (PC1 vs PC2)") +
theme_minimal()
fviz_pca_biplot(
pca,
axes = c(1, 2),
geom.ind = "point",
alpha.ind = 0.5,
col.ind = "grey40",
col.var = "steelblue",
repel = TRUE
) +
labs(title = "PCA biplot: PC1 vs PC2")
Principal Component Analysis is employed to identify orthogonal linear combinations of the original factors that explain the maximum variance in the data. Given the strong correlation structure observed earlier, PCA serves as a natural first step in uncovering latent dimensions underlying the factor space.
The eigenvalue decomposition reveals that the first three principal components jointly explain approximately 80–85% of total variance. This result suggests that the effective dimensionality of the Fama–French five-factor system is considerably lower than the nominal number of factors.
From an economic perspective, this finding implies that multiple observed risk factors may be manifestations of a smaller number of fundamental sources of systematic risk.
dist_mat <- dist(scale(X))
mds_cmd <- cmdscale(dist_mat, k = 2)
mds_df <- tibble(
Dim1 = mds_cmd[, 1],
Dim2 = mds_cmd[, 2]
)
ggplot(mds_df, aes(Dim1, Dim2)) +
geom_point(alpha = 0.5, color = "grey40") +
labs(title = "Classical MDS (2D)") +
theme_minimal()
mds_smacof <- smacof::mds(dist_mat, ndim = 2, type = "ratio")
mds_smacof$stress
#> [1] 0.1650079
knitr::kable(
tibble(stress_2D = round(mds_smacof$stress, 4)),
caption = "SMACOF MDS stress (2D)"
)
| stress_2D |
|---|
| 0.165 |
plot(mds_smacof, main = "SMACOF MDS (stress-based)")
stress_curve <- tibble(
ndim = 1:5,
stress = map_dbl(1:5, ~ smacof::mds(dist_mat, ndim = .x, type = "ratio")$stress)
)
knitr::kable(
stress_curve %>% mutate(stress = round(stress, 4)),
caption = "Stress vs number of dimensions"
)
| ndim | stress |
|---|---|
| 1 | 0.3527 |
| 2 | 0.1650 |
| 3 | 0.0956 |
| 4 | 0.0383 |
| 5 | 0.0000 |
ggplot(stress_curve, aes(ndim, stress)) +
geom_line() +
geom_point(size = 2) +
labs(title = "SMACOF MDS: stress curve") +
theme_minimal()
Multidimensional Scaling offers a complementary, distance-based perspective on the structure of the data. Unlike PCA, which is variance-oriented, MDS focuses explicitly on preserving pairwise distances between observations.
Using Euclidean distances computed on standardized factors, both classical MDS and stress-based SMACOF MDS are applied. The two-dimensional solutions exhibit relatively low stress values, indicating a high-quality low-dimensional representation of the data.
The stress curve demonstrates a clear elbow around two to three dimensions, suggesting diminishing returns from adding further dimensions.
d_orig <- as.vector(dist(scale(X)))
d_pca <- as.vector(dist(as.matrix(scores[, c("PC1","PC2")])))
d_mds <- as.vector(dist(mds_cmd))
cor(d_orig, d_pca)
#> [1] 0.9296938
cor(d_orig, d_mds)
#> [1] 0.9296938
knitr::kable(
tibble(
comparison = c("Original vs PCA(2D)", "Original vs MDS(2D)"),
correlation = round(c(cor(d_orig, d_pca), cor(d_orig, d_mds)), 4)
),
caption = "Distance preservation"
)
| comparison | correlation |
|---|---|
| Original vs PCA(2D) | 0.9297 |
| Original vs MDS(2D) | 0.9297 |
To formally assess the similarity between PCA and MDS embeddings, pairwise distances in the original standardized space are compared with distances in the reduced two-dimensional representations. Both PCA and MDS preserve over 90% of the original distance structure, indicating that the underlying geometry of the data is well approximated by linear projections.
Furthermore, Procrustes analysis reveals near-perfect alignment between the PCA and MDS configurations. This result implies that the factor space is close to linear and that nonlinear manifold-learning techniques are unlikely to provide additional insight in this context.
proc <- vegan::procrustes(
as.matrix(scores[, c("PC1","PC2")]),
as.matrix(mds_cmd),
symmetric = TRUE
)
plot(proc)
summary(proc)
#>
#> Call:
#> vegan::procrustes(X = as.matrix(scores[, c("PC1", "PC2")]), Y = as.matrix(mds_cmd), symmetric = TRUE)
#>
#> Number of objects: 749 Number of dimensions: 2
#>
#> Procrustes sum of squares:
#> 6.661338e-16
#> Procrustes root mean squared error:
#> 9.430611e-10
#> Quantiles of Procrustes errors:
#> Min 1Q Median 3Q Max
#> 8.673617e-19 3.265015e-17 5.583589e-17 8.626653e-17 1.112297e-15
#>
#> Rotation matrix:
#> [,1] [,2]
#> [1,] 1.000000e+00 -3.746221e-15
#> [2,] -3.746221e-15 -1.000000e+00
#>
#> Translation of averages:
#> [,1] [,2]
#> [1,] 5.387559e-19 2.378681e-19
#>
#> Scaling of target:
#> [1] 1
rot_pca <- psych::principal(X_scaled, nfactors = 3, rotate = "varimax")
print(rot_pca$loadings, cutoff = 0.3)
#>
#> Loadings:
#> RC1 RC3 RC2
#> Mkt-RF 0.871
#> SMB 0.609 -0.541
#> HML 0.931
#> RMW 0.939
#> CMA 0.875
#>
#> RC1 RC3 RC2
#> SS loadings 1.709 1.202 1.187
#> Proportion Var 0.342 0.240 0.237
#> Cumulative Var 0.342 0.582 0.820
To enhance interpretability, a varimax rotation is applied to the first three principal components. Rotation does not alter the explanatory power of the solution but redistributes variance across components in a way that promotes sparsity in loadings.
The rotated solution yields components that align closely with the theoretical structure of the Fama–French model. Each rotated factor loads predominantly on a small subset of original variables, reinforcing the interpretation of the latent dimensions as economically distinct sources of risk rather than statistical artifacts.
This result provides further evidence that the observed correlation structure arises from a limited number of underlying economic mechanisms.
Inspection of factor loadings allows for an economically meaningful interpretation of the extracted components. The first principal component loads heavily on the value (HML) and investment (CMA) factors, indicating a latent dimension related to firms’ balance sheet structure and long-term investment behavior. This component can be interpreted as a Value–Investment axis.
The second component is dominated by profitability (RMW), capturing variation related to firms’ operating efficiency and earnings quality. This suggests that profitability represents a distinct and relatively orthogonal source of systematic risk.
The third component exhibits higher loadings on the market (Mkt–RF) and size (SMB) factors, reflecting a Market–Size dimension that aligns with traditional beta-driven risk exposure and firm capitalization effects.
Despite the relatively small number of variables, dimensionality reduction proves to be both statistically justified and economically meaningful in the context of the Fama–French five-factor model. Strong inter-factor correlations imply that the effective dimensionality of the system is substantially lower than five.
Both PCA and MDS yield stable and nearly identical geometric representations, indicating that the structure of the data is predominantly linear. Rotated PCA further enhances interpretability, revealing latent dimensions that closely correspond to theoretical constructs such as value, profitability, and market exposure.
Overall, the results suggest that the Fama–French factor system can be understood as a low-dimensional representation of a small number of fundamental sources of systematic risk.