Fama–French 5 Factors: PCA & MDS Geometry (Monthly)

0.1 Introduction

The Fama–French five-factor model constitutes one of the most influential empirical frameworks in modern asset pricing, extending the traditional CAPM by incorporating size, value, profitability, and investment-related risk factors. While the individual economic interpretations of these factors are well established in the literature, less attention is typically devoted to the joint correlation structure and latent geometry of the factor space itself.

This project aims to explore the multivariate dependence structure of the monthly Fama–French five factors using dimensionality reduction techniques. Rather than focusing on predictive performance or asset pricing tests, the analysis emphasizes the internal geometry of the factor space and investigates whether the observed factors can be represented by a smaller number of latent dimensions without substantial loss of information.

To this end, Principal Component Analysis (PCA), rotated PCA, and Multidimensional Scaling (MDS) are employed and systematically compared. Particular attention is paid to distance preservation, interpretability of latent components, and the degree to which linear methods adequately capture the structure of the data.

0.2 Data Description: Fama–French Five-Factor Model

The empirical analysis is based on the Fama–French Five-Factor model using monthly data. The dataset originates from the publicly available data library (https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html) maintained by Kenneth R. French and is widely used in empirical asset pricing research.

0.2.1 Factor Definitions

The following five risk factors are employed in the analysis:

Mkt–RF (Market Excess Return)
The excess return on the value-weighted market portfolio over the risk-free rate. This factor captures systematic market risk and represents the traditional CAPM component.
SMB (Small Minus Big)
The return differential between portfolios of small-cap and large-cap firms, designed to proxy for the size effect observed in equity markets.
HML (High Minus Low)
The return differential between high book-to-market (value) and low book-to-market (growth) firms, capturing the value premium.
RMW (Robust Minus Weak)
The return differential between firms with robust and weak operating profitability, reflecting the profitability effect in asset returns.
CMA (Conservative Minus Aggressive)
The return differential between firms that invest conservatively versus aggressively, capturing the investment effect.

The risk-free rate (RF) is excluded from the analysis, as the focus is placed on the joint structure and dependence between the systematic risk factors themselves rather than on excess-return construction.

0.2.2 Data Frequency and Sample Size

The dataset consists of approximately 750 monthly observations, covering several decades of U.S. equity market history. Monthly frequency is chosen to balance noise reduction with sufficient time-series length for reliable multivariate analysis.

0.2.3 Data Preprocessing

All factor return series are transformed as follows:

Returns are expressed in percentage terms.
Each factor is standardized using z-scores, i.e. centered to zero mean and scaled to unit variance.

Standardization ensures that the subsequent distance-based and variance-based methods (PCA and MDS) are not dominated by differences in scale or volatility across factors. As a result, the analysis focuses purely on correlation structure and geometric relationships among the factors.

0.2.4 Analytical Scope

Despite the relatively small number of variables, dimensionality reduction is well motivated due to the presence of strong and economically meaningful correlations among the factors. The objective is not variable reduction per se, but rather to:

uncover latent dimensions driving common variation,
assess whether the factor space exhibits an approximately linear structure,
compare variance-preserving (PCA) and distance-preserving (MDS) representations.

This approach aligns with the broader asset pricing literature, where factor interdependence and redundancy are known to play a crucial role in model interpretation and factor construction.

0.3 1. Environment

0.4 Data Description

The analysis is based on monthly observations of the Fama–French five factors obtained from Kenneth French’s data library. The sample spans approximately 750 monthly observations, providing a sufficiently long time horizon to study the stable dependence structure between factors.

The five analyzed factors include the excess market return (Mkt–RF), size (SMB), value (HML), profitability (RMW), and investment (CMA). The risk-free rate is excluded, as the focus lies on the joint behavior of systematic risk factors rather than excess return construction.

All factor series are standardized prior to analysis to ensure comparability and to prevent scale differences from influencing distance-based methods. Standardization is particularly important in the context of PCA and MDS, where the geometry of the data is directly affected by variable scaling.

library(tidyverse)
library(lubridate)
library(factoextra)
library(gridExtra)
library(smacof)
library(vegan)
library(psych)
library(tibble)

0.5 2. Data loading and cleaning (FF5 monthly)

ff5_raw <- readr::read_csv("FF5_monthly.csv", show_col_types = FALSE)

names(ff5_raw)
#> [1] "...1"   "Mkt-RF" "SMB"    "HML"    "RMW"    "CMA"    "RF"
head(ff5_raw, 5)

ff5 <- ff5_raw %>%
  rename(DATE = 1) %>%
  mutate(DATE = as.character(DATE)) %>%
  filter(str_detect(DATE, "^[0-9]{6}$")) %>%
  mutate(date = ymd(paste0(DATE, "01"))) %>%
  select(date, `Mkt-RF`, SMB, HML, RMW, CMA) %>%
  arrange(date) %>%
  mutate(across(-date, as.numeric))

glimpse(ff5)
#> Rows: 749
#> Columns: 6
#> $ date     <date> 1963-07-01, 1963-08-01, 1963-09-01, 1963-10-01, 1963-11-01, …
#> $ `Mkt-RF` <dbl> -0.39, 5.08, -1.57, 2.54, -0.86, 1.83, 2.27, 1.55, 1.41, 0.11…
#> $ SMB      <dbl> -0.48, -0.80, -0.43, -1.34, -0.85, -1.89, 0.10, 0.33, 1.41, -…
#> $ HML      <dbl> -0.81, 1.70, 0.00, -0.04, 1.73, -0.21, 1.63, 2.81, 3.29, -0.5…
#> $ RMW      <dbl> 0.64, 0.40, -0.78, 2.79, -0.43, 0.12, 0.21, 0.11, -2.03, -1.3…
#> $ CMA      <dbl> -1.15, -0.38, 0.15, -2.25, 2.27, -0.25, 1.48, 0.81, 2.98, -1.…
range(ff5$date)
#> [1] "1963-07-01" "2025-11-01"
summary(ff5)
#>       date                Mkt-RF              SMB                HML          
#>  Min.   :1963-07-01   Min.   :-23.1900   Min.   :-15.5400   Min.   :-13.8300  
#>  1st Qu.:1979-02-01   1st Qu.: -1.9600   1st Qu.: -1.5800   1st Qu.: -1.4400  
#>  Median :1994-09-01   Median :  1.0200   Median :  0.0200   Median :  0.2000  
#>  Mean   :1994-08-31   Mean   :  0.5956   Mean   :  0.1801   Mean   :  0.2837  
#>  3rd Qu.:2010-04-01   3rd Qu.:  3.4200   3rd Qu.:  1.9400   3rd Qu.:  1.7300  
#>  Max.   :2025-11-01   Max.   : 16.1000   Max.   : 18.4600   Max.   : 12.8600  
#>       RMW                CMA         
#>  Min.   :-18.9500   Min.   :-7.0800  
#>  1st Qu.: -0.8500   1st Qu.:-1.0400  
#>  Median :  0.2500   Median : 0.0900  
#>  Mean   :  0.2628   Mean   : 0.2407  
#>  3rd Qu.:  1.3100   3rd Qu.: 1.4900  
#>  Max.   : 13.0500   Max.   : 9.0100

diag_tbl <- tibble(
  start_date = min(ff5$date),
  end_date   = max(ff5$date),
  n_months   = nrow(ff5),
  n_factors  = ncol(ff5) - 1
)
knitr::kable(diag_tbl, caption = "Dataset diagnostics")

Dataset diagnostics
start_date	end_date	n_months	n_factors
1963-07-01	2025-11-01	749	5

0.6 3. Correlation structure

X <- ff5 %>% select(`Mkt-RF`, SMB, HML, RMW, CMA)

corr_mat <- cor(X, use = "pairwise.complete.obs")
round(corr_mat, 2)
#>        Mkt-RF   SMB   HML   RMW   CMA
#> Mkt-RF   1.00  0.28 -0.21 -0.19 -0.35
#> SMB      0.28  1.00  0.01 -0.34 -0.08
#> HML     -0.21  0.01  1.00  0.09  0.68
#> RMW     -0.19 -0.34  0.09  1.00  0.00
#> CMA     -0.35 -0.08  0.68  0.00  1.00

knitr::kable(round(corr_mat, 2), caption = "Correlation matrix")

Correlation matrix
	Mkt-RF	SMB	HML	RMW	CMA
Mkt-RF	1.00	0.28	-0.21	-0.19	-0.35
SMB	0.28	1.00	0.01	-0.34	-0.08
HML	-0.21	0.01	1.00	0.09	0.68
RMW	-0.19	-0.34	0.09	1.00	0.00
CMA	-0.35	-0.08	0.68	0.00	1.00

corr_df <- as.data.frame(corr_mat) %>%
  rownames_to_column("factor1") %>%
  pivot_longer(-factor1, names_to = "factor2", values_to = "corr")

ggplot(corr_df, aes(factor1, factor2, fill = corr)) +
  geom_tile() +
  geom_text(aes(label = round(corr, 2)), size = 4) +
  scale_fill_gradient2(limits = c(-1, 1)) +
  coord_equal() +
  labs(title = "Correlation heatmap", fill = "Corr") +
  theme_minimal()

corr_with_mkt <- sort(corr_mat[, "Mkt-RF"], decreasing = TRUE)
round(corr_with_mkt, 2)
#> Mkt-RF    SMB    RMW    HML    CMA 
#>   1.00   0.28  -0.19  -0.21  -0.35

0.7 Correlation Structure of the Fama–French Factors

The correlation matrix reveals substantial interdependencies among the Fama–French factors, providing a strong motivation for dimensionality reduction. Notably, the value factor (HML) exhibits a pronounced positive correlation with the investment factor (CMA), which is consistent with theoretical interpretations linking value firms to conservative investment behavior.

Similarly, partial dependence between the market factor (Mkt–RF) and the size factor (SMB) reflects the empirical tendency of small-cap stocks to exhibit higher market beta exposure. In contrast, profitability (RMW) displays a more distinct correlation pattern, suggesting that it may capture an orthogonal dimension of systematic risk.

The presence of such structured correlations indicates that the five-factor system may be effectively represented by a smaller number of latent dimensions without substantial loss of information.

0.8 4. PCA analysis

X_scaled <- scale(X)
pca <- prcomp(X_scaled)

eig <- pca$sdev^2
prop <- eig / sum(eig)

pca_var_tbl <- tibble(
  PC = paste0("PC", 1:5),
  eigenvalue = eig,
  prop_var = prop,
  cum_var = cumsum(prop)
)

pca_var_tbl

knitr::kable(
  pca_var_tbl %>% mutate(across(where(is.numeric), ~ round(.x, 4))),
  caption = "PCA: eigenvalues and explained variance"
)

PCA: eigenvalues and explained variance
PC	eigenvalue	prop_var	cum_var
PC1	1.9480	0.3896	0.3896
PC2	1.3776	0.2755	0.6651
PC3	0.7727	0.1545	0.8197
PC4	0.6201	0.1240	0.9437
PC5	0.2816	0.0563	1.0000

fviz_eig(pca, main = "Scree plot: variance explained by PCs")

loadings <- pca$rotation[, 1:3] %>%
  as.data.frame() %>%
  rownames_to_column("Factor")

knitr::kable(
  loadings %>% mutate(across(-Factor, ~ round(.x, 3))),
  caption = "PCA loadings (PC1–PC3)"
)

PCA loadings (PC1–PC3)
Factor	PC1	PC2	PC3
Mkt-RF	0.467	0.214	0.654
SMB	0.269	0.612	0.161
HML	-0.547	0.374	0.357
RMW	-0.245	-0.571	0.647
CMA	-0.592	0.337	-0.009

0.9 Interpretation of Principal Components

Inspection of factor loadings allows for an economically meaningful interpretation of the extracted components. The first principal component loads heavily on the value (HML) and investment (CMA) factors, indicating a latent dimension related to firms’ balance sheet structure and long-term investment behavior. This component can be interpreted as a Value–Investment axis.

The second component is dominated by profitability (RMW), capturing variation related to firms’ operating efficiency and earnings quality. This suggests that profitability represents a distinct and relatively orthogonal source of systematic risk.

The third component exhibits higher loadings on the market (Mkt–RF) and size (SMB) factors, reflecting a Market–Size dimension that aligns with traditional beta-driven risk exposure and firm capitalization effects.

gridExtra::grid.arrange(
  fviz_contrib(pca, choice = "var", axes = 1, title = "Contribution to PC1"),
  fviz_contrib(pca, choice = "var", axes = 2, title = "Contribution to PC2"),
  fviz_contrib(pca, choice = "var", axes = 3, title = "Contribution to PC3"),
  ncol = 1
)

scores <- as_tibble(pca$x) %>%
  mutate(date = ff5$date)

ggplot(scores, aes(PC1, PC2)) +
  geom_point(alpha = 0.5) +
  labs(title = "Months in PC space (PC1 vs PC2)") +
  theme_minimal()

fviz_pca_biplot(
  pca,
  axes = c(1, 2),
  geom.ind = "point",
  alpha.ind = 0.5,
  col.ind = "grey40",
  col.var = "steelblue",
  repel = TRUE
) +
  labs(title = "PCA biplot: PC1 vs PC2")

0.10 Principal Component Analysis

Principal Component Analysis is employed to identify orthogonal linear combinations of the original factors that explain the maximum variance in the data. Given the strong correlation structure observed earlier, PCA serves as a natural first step in uncovering latent dimensions underlying the factor space.

The eigenvalue decomposition reveals that the first three principal components jointly explain approximately 80–85% of total variance. This result suggests that the effective dimensionality of the Fama–French five-factor system is considerably lower than the nominal number of factors.

From an economic perspective, this finding implies that multiple observed risk factors may be manifestations of a smaller number of fundamental sources of systematic risk.

0.11 5. Multidimensional Scaling (MDS)

dist_mat <- dist(scale(X))
mds_cmd <- cmdscale(dist_mat, k = 2)

mds_df <- tibble(
  Dim1 = mds_cmd[, 1],
  Dim2 = mds_cmd[, 2]
)

ggplot(mds_df, aes(Dim1, Dim2)) +
  geom_point(alpha = 0.5, color = "grey40") +
  labs(title = "Classical MDS (2D)") +
  theme_minimal()

mds_smacof <- smacof::mds(dist_mat, ndim = 2, type = "ratio")
mds_smacof$stress
#> [1] 0.1650079

knitr::kable(
  tibble(stress_2D = round(mds_smacof$stress, 4)),
  caption = "SMACOF MDS stress (2D)"
)

SMACOF MDS stress (2D)
stress_2D
0.165

plot(mds_smacof, main = "SMACOF MDS (stress-based)")

stress_curve <- tibble(
  ndim = 1:5,
  stress = map_dbl(1:5, ~ smacof::mds(dist_mat, ndim = .x, type = "ratio")$stress)
)

knitr::kable(
  stress_curve %>% mutate(stress = round(stress, 4)),
  caption = "Stress vs number of dimensions"
)

Stress vs number of dimensions
ndim	stress
1	0.3527
2	0.1650
3	0.0956
4	0.0383
5	0.0000

ggplot(stress_curve, aes(ndim, stress)) +
  geom_line() +
  geom_point(size = 2) +
  labs(title = "SMACOF MDS: stress curve") +
  theme_minimal()

0.12 Multidimensional Scaling

Multidimensional Scaling offers a complementary, distance-based perspective on the structure of the data. Unlike PCA, which is variance-oriented, MDS focuses explicitly on preserving pairwise distances between observations.

Using Euclidean distances computed on standardized factors, both classical MDS and stress-based SMACOF MDS are applied. The two-dimensional solutions exhibit relatively low stress values, indicating a high-quality low-dimensional representation of the data.

The stress curve demonstrates a clear elbow around two to three dimensions, suggesting diminishing returns from adding further dimensions.

0.13 6. PCA vs MDS: geometry comparison

d_orig <- as.vector(dist(scale(X)))
d_pca  <- as.vector(dist(as.matrix(scores[, c("PC1","PC2")])))
d_mds  <- as.vector(dist(mds_cmd))

cor(d_orig, d_pca)
#> [1] 0.9296938
cor(d_orig, d_mds)
#> [1] 0.9296938

knitr::kable(
  tibble(
    comparison = c("Original vs PCA(2D)", "Original vs MDS(2D)"),
    correlation = round(c(cor(d_orig, d_pca), cor(d_orig, d_mds)), 4)
  ),
  caption = "Distance preservation"
)

Distance preservation
comparison	correlation
Original vs PCA(2D)	0.9297
Original vs MDS(2D)	0.9297

0.14 Comparison of PCA and MDS Representations

To formally assess the similarity between PCA and MDS embeddings, pairwise distances in the original standardized space are compared with distances in the reduced two-dimensional representations. Both PCA and MDS preserve over 90% of the original distance structure, indicating that the underlying geometry of the data is well approximated by linear projections.

Furthermore, Procrustes analysis reveals near-perfect alignment between the PCA and MDS configurations. This result implies that the factor space is close to linear and that nonlinear manifold-learning techniques are unlikely to provide additional insight in this context.

0.15 7. Procrustes analysis

proc <- vegan::procrustes(
  as.matrix(scores[, c("PC1","PC2")]),
  as.matrix(mds_cmd),
  symmetric = TRUE
)

plot(proc)

summary(proc)
#> 
#> Call:
#> vegan::procrustes(X = as.matrix(scores[, c("PC1", "PC2")]), Y = as.matrix(mds_cmd),      symmetric = TRUE) 
#> 
#> Number of objects: 749    Number of dimensions: 2 
#> 
#> Procrustes sum of squares:  
#>  6.661338e-16 
#> Procrustes root mean squared error: 
#>  9.430611e-10 
#> Quantiles of Procrustes errors:
#>          Min           1Q       Median           3Q          Max 
#> 8.673617e-19 3.265015e-17 5.583589e-17 8.626653e-17 1.112297e-15 
#> 
#> Rotation matrix:
#>               [,1]          [,2]
#> [1,]  1.000000e+00 -3.746221e-15
#> [2,] -3.746221e-15 -1.000000e+00
#> 
#> Translation of averages:
#>              [,1]         [,2]
#> [1,] 5.387559e-19 2.378681e-19
#> 
#> Scaling of target:
#> [1] 1

0.16 8. Rotated PCA (Varimax)

rot_pca <- psych::principal(X_scaled, nfactors = 3, rotate = "varimax")
print(rot_pca$loadings, cutoff = 0.3)
#> 
#> Loadings:
#>        RC1    RC3    RC2   
#> Mkt-RF         0.871       
#> SMB            0.609 -0.541
#> HML     0.931              
#> RMW                   0.939
#> CMA     0.875              
#> 
#>                  RC1   RC3   RC2
#> SS loadings    1.709 1.202 1.187
#> Proportion Var 0.342 0.240 0.237
#> Cumulative Var 0.342 0.582 0.820

0.17 Rotated Principal Components

To enhance interpretability, a varimax rotation is applied to the first three principal components. Rotation does not alter the explanatory power of the solution but redistributes variance across components in a way that promotes sparsity in loadings.

The rotated solution yields components that align closely with the theoretical structure of the Fama–French model. Each rotated factor loads predominantly on a small subset of original variables, reinforcing the interpretation of the latent dimensions as economically distinct sources of risk rather than statistical artifacts.

This result provides further evidence that the observed correlation structure arises from a limited number of underlying economic mechanisms.

0.18 Interpretation of Principal Components

0.19 9. Conclusions

0.20 Conclusions

Despite the relatively small number of variables, dimensionality reduction proves to be both statistically justified and economically meaningful in the context of the Fama–French five-factor model. Strong inter-factor correlations imply that the effective dimensionality of the system is substantially lower than five.

Both PCA and MDS yield stable and nearly identical geometric representations, indicating that the structure of the data is predominantly linear. Rotated PCA further enhances interpretability, revealing latent dimensions that closely correspond to theoretical constructs such as value, profitability, and market exposure.

Overall, the results suggest that the Fama–French factor system can be understood as a low-dimensional representation of a small number of fundamental sources of systematic risk.