Introduction

The Fama-French five-factor model constitutes one of the most influential empirical frameworks in modern asset pricing, extending the traditional CAPM by incorporating size, value, profitability, and investment-related risk factors. While the individual economic interpretations of these factors are well established in the literature, less attention is typically devoted to the joint correlation structure and latent geometry of the factor space itself.

This project aims to explore the multivariate dependence structure of the monthly Fama-French five factors using dimensionality reduction techniques. Rather than focusing on predictive performance or asset pricing tests, the analysis emphasizes the internal geometry of the factor space and investigates whether the observed factors can be represented by a smaller number of latent dimensions without substantial loss of information.

To this end, Principal Component Analysis (PCA), rotated PCA, and Multidimensional Scaling (MDS) are employed and systematically compared. Particular attention is paid to distance preservation, interpretability of latent components, and the degree to which linear methods adequately capture the structure of the data.

Data Description: Fama-French Five-Factor Model

The empirical analysis is based on the Fama-French Five-Factor model using monthly data. The dataset originates from the publicly available data library (https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html) maintained by Kenneth R. French and is widely used in empirical asset pricing research.

Factor Definitions

The following five risk factors are employed in the analysis:

  • Mkt-RF (Market Excess Return)
    The excess return on the value-weighted market portfolio over the risk-free rate. This factor captures systematic market risk and represents the traditional CAPM component.

  • SMB (Small Minus Big)
    The return differential between portfolios of small-cap and large-cap firms, designed to proxy for the size effect observed in equity markets.

  • HML (High Minus Low)
    The return differential between high book-to-market (value) and low book-to-market (growth) firms, capturing the value premium.

  • RMW (Robust Minus Weak)
    The return differential between firms with robust and weak operating profitability, reflecting the profitability effect in asset returns.

  • CMA (Conservative Minus Aggressive)
    The return differential between firms that invest conservatively versus aggressively, capturing the investment effect.

The risk-free rate (RF) is excluded from the analysis, as the focus is placed on the joint structure and dependence between the systematic risk factors themselves rather than on excess-return construction.

Data Frequency and Sample Size

The dataset consists of approximately 750 monthly observations, covering several decades of U.S. equity market history. Monthly frequency is chosen to balance noise reduction with sufficient time-series length for reliable multivariate analysis.

Data Preprocessing

All factor return series are transformed as follows:

  • Returns are expressed in percentage terms.
  • Each factor is standardized using z-scores, i.e. centered to zero mean and scaled to unit variance.

Standardization ensures that the subsequent distance-based and variance-based methods (PCA and MDS) are not dominated by differences in scale or volatility across factors. As a result, the analysis focuses purely on correlation structure and geometric relationships among the factors.

Analytical Scope

Despite the relatively small number of variables, dimensionality reduction is well motivated due to the presence of strong and economically meaningful correlations among the factors. The objective is not variable reduction per se, but rather to:

  • uncover latent dimensions driving common variation,
  • assess whether the factor space exhibits an approximately linear structure,
  • compare variance-preserving (PCA) and distance-preserving (MDS) representations.

This approach aligns with the broader asset pricing literature, where factor interdependence and redundancy are known to play a crucial role in model interpretation and factor construction.

1 Environment

1.1 Data Description

The analysis is based on monthly observations of the Fama-French five factors obtained from Kenneth French’s data library. The sample spans approximately 750 monthly observations, providing a sufficiently long time horizon to study the stable dependence structure between factors.

The five analyzed factors include the excess market return (Mkt-RF), size (SMB), value (HML), profitability (RMW), and investment (CMA). The risk-free rate is excluded, as the focus lies on the joint behavior of systematic risk factors rather than excess return construction.

All factor series are standardized prior to analysis to ensure comparability and to prevent scale differences from influencing distance-based methods. Standardization is particularly important in the context of PCA and MDS, where the geometry of the data is directly affected by variable scaling.

options(repos = c(CRAN = "https://cloud.r-project.org"))

required_pkgs <- c(
  "tidyverse", "lubridate", "gridExtra", "smacof", "vegan", "psych", "tibble"
)

missing_pkgs <- required_pkgs[
  !vapply(required_pkgs, requireNamespace, logical(1), quietly = TRUE)
]

if (length(missing_pkgs) > 0) {
  install.packages(missing_pkgs, dependencies = TRUE)
}

invisible(lapply(required_pkgs, library, character.only = TRUE))

2 Data Loading and Cleaning (FF5 Monthly)

ff5_raw <- readr::read_csv("FF5_monthly.csv", show_col_types = FALSE)

names(ff5_raw)
#> [1] "...1"   "Mkt-RF" "SMB"    "HML"    "RMW"    "CMA"    "RF"
head(ff5_raw, 5)
ff5 <- ff5_raw %>%
  rename(DATE = 1) %>%
  mutate(DATE = as.character(DATE)) %>%
  filter(str_detect(DATE, "^[0-9]{6}$")) %>%
  mutate(date = ymd(paste0(DATE, "01"))) %>%
  select(date, `Mkt-RF`, SMB, HML, RMW, CMA) %>%
  arrange(date) %>%
  mutate(across(-date, as.numeric))

glimpse(ff5)
#> Rows: 749
#> Columns: 6
#> $ date     <date> 1963-07-01, 1963-08-01, 1963-09-01, 1963-10-01, 1963-11-01, …
#> $ `Mkt-RF` <dbl> -0.39, 5.08, -1.57, 2.54, -0.86, 1.83, 2.27, 1.55, 1.41, 0.11…
#> $ SMB      <dbl> -0.48, -0.80, -0.43, -1.34, -0.85, -1.89, 0.10, 0.33, 1.41, -…
#> $ HML      <dbl> -0.81, 1.70, 0.00, -0.04, 1.73, -0.21, 1.63, 2.81, 3.29, -0.5…
#> $ RMW      <dbl> 0.64, 0.40, -0.78, 2.79, -0.43, 0.12, 0.21, 0.11, -2.03, -1.3…
#> $ CMA      <dbl> -1.15, -0.38, 0.15, -2.25, 2.27, -0.25, 1.48, 0.81, 2.98, -1.…
range(ff5$date)
#> [1] "1963-07-01" "2025-11-01"
summary(ff5)
#>       date                Mkt-RF              SMB                HML          
#>  Min.   :1963-07-01   Min.   :-23.1900   Min.   :-15.5400   Min.   :-13.8300  
#>  1st Qu.:1979-02-01   1st Qu.: -1.9600   1st Qu.: -1.5800   1st Qu.: -1.4400  
#>  Median :1994-09-01   Median :  1.0200   Median :  0.0200   Median :  0.2000  
#>  Mean   :1994-08-31   Mean   :  0.5956   Mean   :  0.1801   Mean   :  0.2837  
#>  3rd Qu.:2010-04-01   3rd Qu.:  3.4200   3rd Qu.:  1.9400   3rd Qu.:  1.7300  
#>  Max.   :2025-11-01   Max.   : 16.1000   Max.   : 18.4600   Max.   : 12.8600  
#>       RMW                CMA         
#>  Min.   :-18.9500   Min.   :-7.0800  
#>  1st Qu.: -0.8500   1st Qu.:-1.0400  
#>  Median :  0.2500   Median : 0.0900  
#>  Mean   :  0.2628   Mean   : 0.2407  
#>  3rd Qu.:  1.3100   3rd Qu.: 1.4900  
#>  Max.   : 13.0500   Max.   : 9.0100
diag_tbl <- tibble(
  start_date = min(ff5$date),
  end_date   = max(ff5$date),
  n_months   = nrow(ff5),
  n_factors  = ncol(ff5) - 1
)
knitr::kable(diag_tbl, caption = "Dataset diagnostics")
Dataset diagnostics
start_date end_date n_months n_factors
1963-07-01 2025-11-01 749 5

3 Correlation Structure

X <- ff5 %>% select(`Mkt-RF`, SMB, HML, RMW, CMA)

corr_mat <- cor(X, use = "pairwise.complete.obs")
round(corr_mat, 2)
#>        Mkt-RF   SMB   HML   RMW   CMA
#> Mkt-RF   1.00  0.28 -0.21 -0.19 -0.35
#> SMB      0.28  1.00  0.01 -0.34 -0.08
#> HML     -0.21  0.01  1.00  0.09  0.68
#> RMW     -0.19 -0.34  0.09  1.00  0.00
#> CMA     -0.35 -0.08  0.68  0.00  1.00
knitr::kable(round(corr_mat, 2), caption = "Correlation matrix")
Correlation matrix
Mkt-RF SMB HML RMW CMA
Mkt-RF 1.00 0.28 -0.21 -0.19 -0.35
SMB 0.28 1.00 0.01 -0.34 -0.08
HML -0.21 0.01 1.00 0.09 0.68
RMW -0.19 -0.34 0.09 1.00 0.00
CMA -0.35 -0.08 0.68 0.00 1.00
corr_df <- as.data.frame(corr_mat) %>%
  rownames_to_column("factor1") %>%
  pivot_longer(-factor1, names_to = "factor2", values_to = "corr")

ggplot(corr_df, aes(factor1, factor2, fill = corr)) +
  geom_tile() +
  geom_text(aes(label = round(corr, 2)), size = 4) +
  scale_fill_gradient2(limits = c(-1, 1)) +
  coord_equal() +
  labs(title = "Correlation heatmap", fill = "Corr") +
  theme_minimal()

corr_with_mkt <- sort(corr_mat[, "Mkt-RF"], decreasing = TRUE)
round(corr_with_mkt, 2)
#> Mkt-RF    SMB    RMW    HML    CMA 
#>   1.00   0.28  -0.19  -0.21  -0.35

3.1 Correlation Structure of the Fama-French Factors

The correlation matrix confirms non-trivial dependence between factors and supports dimensionality reduction. The strongest relationship is between HML and CMA (\(r =\) 0.68), consistent with a shared value-investment structure.

Mkt-RF is moderately positively related to SMB (\(r =\) 0.28) and negatively related to HML, RMW, and CMA. RMW is not fully orthogonal: it is close to zero versus HML and CMA, but materially negative versus SMB (\(r =\) -0.34).

Overall, the pattern suggests a few latent dimensions rather than five independent sources of variation.

4 PCA Analysis

X_scaled <- scale(X)
pca <- prcomp(X_scaled)

eig <- pca$sdev^2
prop <- eig / sum(eig)

pca_var_tbl <- tibble(
  PC = paste0("PC", 1:5),
  eigenvalue = eig,
  prop_var = prop,
  cum_var = cumsum(prop)
)

pca_var_tbl
knitr::kable(
  pca_var_tbl %>% mutate(across(where(is.numeric), ~ round(.x, 4))),
  caption = "PCA: eigenvalues and explained variance"
)
PCA: eigenvalues and explained variance
PC eigenvalue prop_var cum_var
PC1 1.9480 0.3896 0.3896
PC2 1.3776 0.2755 0.6651
PC3 0.7727 0.1545 0.8197
PC4 0.6201 0.1240 0.9437
PC5 0.2816 0.0563 1.0000
scree_tbl <- pca_var_tbl %>%
  mutate(PC = factor(PC, levels = paste0("PC", 1:5)))

ggplot(scree_tbl, aes(PC, prop_var, group = 1)) +
  geom_col(fill = "steelblue", alpha = 0.8) +
  geom_line(color = "grey30") +
  geom_point(color = "grey30", size = 2) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
  labs(
    title = "Scree plot: variance explained by PCs",
    x = NULL,
    y = "Explained variance"
  ) +
  theme_minimal()

loadings <- pca$rotation[, 1:3] %>%
  as.data.frame() %>%
  rownames_to_column("Factor")

knitr::kable(
  loadings %>% mutate(across(-Factor, ~ round(.x, 3))),
  caption = "PCA loadings (PC1-PC3)"
)
PCA loadings (PC1-PC3)
Factor PC1 PC2 PC3
Mkt-RF 0.467 0.214 0.654
SMB 0.269 0.612 0.161
HML -0.547 0.374 0.357
RMW -0.245 -0.571 0.647
CMA -0.592 0.337 -0.009

4.1 Interpretation of Principal Components

Interpretation should be based on loading magnitudes (the sign itself is arbitrary in PCA). In this sample:

  • PC1 is mainly driven by HML and CMA (with additional market exposure), indicating a dominant Value-Investment dimension.
  • PC2 loads strongly on SMB and RMW with opposite signs, suggesting a Size-Profitability contrast rather than a pure profitability factor.
  • PC3 is strongest on Mkt-RF and RMW, which is closer to a Market-Profitability component than a Market-Size axis.
contrib_tbl <- as_tibble(pca$rotation[, 1:3], rownames = "Factor") %>%
  pivot_longer(-Factor, names_to = "PC", values_to = "loading") %>%
  group_by(PC) %>%
  mutate(contrib = 100 * (loading^2) / sum(loading^2)) %>%
  ungroup()

plot_contrib <- function(pc_name) {
  ggplot(
    contrib_tbl %>% filter(PC == pc_name) %>% arrange(desc(contrib)),
    aes(x = reorder(Factor, contrib), y = contrib)
  ) +
    geom_col(fill = "steelblue", alpha = 0.85) +
    coord_flip() +
    labs(title = paste("Contribution to", pc_name), x = NULL, y = "Contribution (%)") +
    theme_minimal()
}

gridExtra::grid.arrange(
  plot_contrib("PC1"),
  plot_contrib("PC2"),
  plot_contrib("PC3"),
  ncol = 1
)

scores <- as_tibble(pca$x) %>%
  mutate(date = ff5$date)

ggplot(scores, aes(PC1, PC2)) +
  geom_point(alpha = 0.5) +
  labs(title = "Months in PC space (PC1 vs PC2)") +
  theme_minimal()

scores_12 <- as_tibble(pca$x[, 1:2], .name_repair = "minimal") %>%
  rename(PC1 = 1, PC2 = 2)

loadings_12 <- as_tibble(pca$rotation[, 1:2], rownames = "Factor")

arrow_scale <- min(
  diff(range(scores_12$PC1)) / diff(range(loadings_12$PC1)),
  diff(range(scores_12$PC2)) / diff(range(loadings_12$PC2))
) * 0.25

loadings_plot <- loadings_12 %>%
  mutate(
    PC1 = PC1 * arrow_scale,
    PC2 = PC2 * arrow_scale
  )

ggplot(scores_12, aes(PC1, PC2)) +
  geom_point(alpha = 0.5, color = "grey40") +
  geom_segment(
    data = loadings_plot,
    aes(x = 0, y = 0, xend = PC1, yend = PC2),
    inherit.aes = FALSE,
    color = "steelblue",
    arrow = arrow(length = grid::unit(0.2, "cm"))
  ) +
  geom_text(
    data = loadings_plot,
    aes(x = PC1, y = PC2, label = Factor),
    inherit.aes = FALSE,
    color = "steelblue",
    nudge_y = 0.05,
    size = 4
  ) +
  labs(title = "PCA biplot: PC1 vs PC2") +
  theme_minimal()

4.2 Principal Component Analysis

Principal Component Analysis is employed to identify orthogonal linear combinations of the original factors that explain the maximum variance in the data. Given the strong correlation structure observed earlier, PCA serves as a natural first step in uncovering latent dimensions underlying the factor space.

For this dataset, the first three principal components explain 82% of total variance. This indicates that effective dimensionality is lower than the nominal five factors, although not all variation is captured in 2D.

From an economic perspective, this finding implies that multiple observed risk factors may be manifestations of a smaller number of fundamental sources of systematic risk.

5 Multidimensional Scaling (MDS)

dist_mat <- dist(scale(X))
mds_cmd <- cmdscale(dist_mat, k = 2)

mds_df <- tibble(
  Dim1 = mds_cmd[, 1],
  Dim2 = mds_cmd[, 2]
)

ggplot(mds_df, aes(Dim1, Dim2)) +
  geom_point(alpha = 0.5, color = "grey40") +
  labs(title = "Classical MDS (2D)") +
  theme_minimal()

mds_smacof <- smacof::mds(dist_mat, ndim = 2, type = "ratio")
mds_smacof$stress
#> [1] 0.1650079
knitr::kable(
  tibble(stress_2D = round(mds_smacof$stress, 4)),
  caption = "SMACOF MDS stress (2D)"
)
SMACOF MDS stress (2D)
stress_2D
0.165
plot(mds_smacof, main = "SMACOF MDS (stress-based)")

stress_curve <- tibble(
  ndim = 1:5,
  stress = map_dbl(1:5, ~ smacof::mds(dist_mat, ndim = .x, type = "ratio")$stress)
)

knitr::kable(
  stress_curve %>% mutate(stress = round(stress, 4)),
  caption = "Stress vs number of dimensions"
)
Stress vs number of dimensions
ndim stress
1 0.3527
2 0.1650
3 0.0956
4 0.0383
5 0.0000
ggplot(stress_curve, aes(ndim, stress)) +
  geom_line() +
  geom_point(size = 2) +
  labs(title = "SMACOF MDS: stress curve") +
  theme_minimal()

5.1 Multidimensional Scaling

Multidimensional Scaling offers a complementary, distance-based perspective on the structure of the data. Unlike PCA, which is variance-oriented, MDS focuses explicitly on preserving pairwise distances between observations.

Using Euclidean distances computed on standardized factors, both classical MDS and stress-based SMACOF MDS are applied. The 2D SMACOF solution has stress 0.165, which is acceptable for visualization but still indicates non-negligible distortion.

The stress curve shows substantial improvement from 2D to 3D (0.165 \(\rightarrow\) 0.096), and further improvement in 4D. This supports using 2D for communication, while noting that higher-dimensional embeddings represent distances more faithfully.

6 PCA vs MDS: Geometry Comparison

d_orig <- as.vector(dist(scale(X)))
d_pca  <- as.vector(dist(as.matrix(scores[, c("PC1","PC2")])))
d_mds  <- as.vector(dist(mds_cmd))

cor(d_orig, d_pca)
#> [1] 0.9296938
cor(d_orig, d_mds)
#> [1] 0.9296938
knitr::kable(
  tibble(
    comparison = c("Original vs PCA(2D)", "Original vs MDS(2D)"),
    correlation = round(c(cor(d_orig, d_pca), cor(d_orig, d_mds)), 4)
  ),
  caption = "Distance preservation"
)
Distance preservation
comparison correlation
Original vs PCA(2D) 0.9297
Original vs MDS(2D) 0.9297

6.1 Comparison of PCA and MDS Representations

Distance preservation is evaluated here as the Pearson correlation of pairwise Euclidean distances between the original standardized space and each 2D embedding. In this sample, both PCA(2D) and MDS(2D) achieve high correlation (0.93 and 0.93).

Procrustes alignment between PCA and classical MDS is numerically near-perfect (6.66e-16 sum of squares), showing that both methods recover almost the same 2D geometry up to rotation/reflection/scale.

Methodological caveat: because both embeddings are linear and are built from the same standardized Euclidean structure, strong similarity is expected. Therefore, this comparison supports internal consistency of the geometry rather than serving as a standalone proof that nonlinear methods would never add information.

7 Procrustes Analysis

proc <- vegan::procrustes(
  as.matrix(scores[, c("PC1","PC2")]),
  as.matrix(mds_cmd),
  symmetric = TRUE
)

proc_overlay <- bind_rows(
  tibble(Dim1 = proc$X[, 1], Dim2 = proc$X[, 2], Space = "PCA (2D)"),
  tibble(Dim1 = proc$Yrot[, 1], Dim2 = proc$Yrot[, 2], Space = "MDS aligned")
)

ggplot(proc_overlay, aes(Dim1, Dim2, color = Space)) +
  geom_point(alpha = 0.35, size = 1) +
  labs(title = "Procrustes alignment: PCA vs MDS", x = "Dimension 1", y = "Dimension 2") +
  theme_minimal()


summary(proc)
#> 
#> Call:
#> vegan::procrustes(X = as.matrix(scores[, c("PC1", "PC2")]), Y = as.matrix(mds_cmd),      symmetric = TRUE) 
#> 
#> Number of objects: 749    Number of dimensions: 2 
#> 
#> Procrustes sum of squares:  
#>  6.661338e-16 
#> Procrustes root mean squared error: 
#>  9.430611e-10 
#> Quantiles of Procrustes errors:
#>          Min           1Q       Median           3Q          Max 
#> 8.673617e-19 3.265015e-17 5.583589e-17 8.626653e-17 1.112297e-15 
#> 
#> Rotation matrix:
#>               [,1]          [,2]
#> [1,]  1.000000e+00 -3.746221e-15
#> [2,] -3.746221e-15 -1.000000e+00
#> 
#> Translation of averages:
#>              [,1]         [,2]
#> [1,] 5.387559e-19 2.378681e-19
#> 
#> Scaling of target:
#> [1] 1

8 Rotated PCA (Varimax)

rot_pca <- psych::principal(X_scaled, nfactors = 3, rotate = "varimax")
print(rot_pca$loadings, cutoff = 0.3)
#> 
#> Loadings:
#>        RC1    RC3    RC2   
#> Mkt-RF         0.871       
#> SMB            0.609 -0.541
#> HML     0.931              
#> RMW                   0.939
#> CMA     0.875              
#> 
#>                  RC1   RC3   RC2
#> SS loadings    1.709 1.202 1.187
#> Proportion Var 0.342 0.240 0.237
#> Cumulative Var 0.342 0.582 0.820

8.1 Rotated Principal Components

To enhance interpretability, a varimax rotation is applied to the first three principal components. Rotation does not alter the explanatory power of the solution but redistributes variance across components in a way that promotes sparsity in loadings.

The rotated solution is easier to read economically: one component is concentrated on HML/CMA (value-investment), one is concentrated on RMW (profitability), and one is centered on Mkt-RF with support from SMB (market-size exposure). This supports a low-dimensional but interpretable structure of common risk dimensions.

8.2 Interpretation of Principal Components

Given the rotated structure above, interpretation is more robust when anchored in rotated loadings rather than reusing unrotated labels. The key message is that the five observed factors can be organized around a small set of economically meaningful dimensions: value-investment, profitability, and market-size exposure.

9 Portfolio-Level Validation: PCA vs Original FF5 on 25 Portfolios

This chapter moves from factor geometry to asset-pricing application. The goal is to test whether latent factors extracted from PCA are practically useful in explaining the cross-section of Fama-French 25 portfolios and how they compare to the original pre-PCA FF5 factors.

9.1 Data Used in This Chapter

The portfolio dataset comes from 25_Portfolios_5x5.csv (CRSP 202601 vintage). It contains monthly value-weighted and equal-weighted returns for portfolios formed on size (ME) and book-to-market (BEME). We use the value-weighted monthly block only, consistent with typical factor-model validation practice. Missing values are encoded as -99.99 or -999 and are treated as NA.

The factor-side data come from the already prepared monthly FF5 table and PCA scores computed earlier in the report. For pricing comparability, dependent variables are modeled as excess returns: \(R^e_{i,t} = R_{i,t} - RF_t\).

portfolio_lines <- readr::read_lines("25_Portfolios_5x5.csv")
ew_monthly_marker <- which(str_detect(portfolio_lines, "Average Equal Weighted Returns -- Monthly"))[1]

if (is.na(ew_monthly_marker) || ew_monthly_marker <= 3) {
  stop("Cannot locate the monthly value-weighted block in 25_Portfolios_5x5.csv")
}

vw_last_data_line <- max(which(str_detect(portfolio_lines[1:(ew_monthly_marker - 1)], "^[0-9]{6},")))

vw_monthly_text <- paste(portfolio_lines[1:vw_last_data_line], collapse = "\n")

port25_vw <- readr::read_csv(
  I(vw_monthly_text),
  show_col_types = FALSE,
  na = c("-99.99", "-999")
) %>%
  rename(DATE = 1) %>%
  mutate(
    DATE = as.character(DATE),
    date = ymd(paste0(DATE, "01"))
  ) %>%
  filter(str_detect(DATE, "^[0-9]{6}$")) %>%
  select(-DATE)

rf_monthly <- ff5_raw %>%
  rename(DATE = 1) %>%
  mutate(DATE = as.character(DATE)) %>%
  filter(str_detect(DATE, "^[0-9]{6}$")) %>%
  mutate(date = ymd(paste0(DATE, "01"))) %>%
  select(date, RF) %>%
  mutate(RF = as.numeric(RF))

pc_scores <- scores %>% select(date, PC1, PC2, PC3)
ff5_original <- ff5 %>% select(date, `Mkt-RF`, SMB, HML, RMW, CMA)

analysis_panel <- port25_vw %>%
  inner_join(pc_scores, by = "date") %>%
  inner_join(ff5_original, by = "date") %>%
  inner_join(rf_monthly, by = "date")

portfolio_names <- setdiff(names(port25_vw), "date")

section9_diag <- tibble(
  start_date = min(analysis_panel$date),
  end_date = max(analysis_panel$date),
  n_months = nrow(analysis_panel),
  n_portfolios = length(portfolio_names),
  predictors_PCA = 3,
  predictors_FF5 = 5
)

knitr::kable(section9_diag, caption = "Chapter 9 data panel diagnostics")
Chapter 9 data panel diagnostics
start_date end_date n_months n_portfolios predictors_PCA predictors_FF5
1963-07-01 2025-11-01 749 25 3 5

9.2 Model Design

For each of the 25 portfolios, we estimate two linear models on the same sample window:

PCA latent-factor model:

\[ R^e_{i,t} = \alpha_i + \beta_{i1}PC1_t + \beta_{i2}PC2_t + \beta_{i3}PC3_t + \varepsilon_{i,t} \]

Original FF5 model (pre-PCA factors):

\[ R^e_{i,t} = \alpha_i + \gamma_{iM}(Mkt\text{-}RF)_t + \gamma_{iS}SMB_t + \gamma_{iH}HML_t + \gamma_{iR}RMW_t + \gamma_{iC}CMA_t + \varepsilon_{i,t} \]

We compare explanatory power using both \(R^2\) and adjusted \(R^2\), where adjusted \(R^2\) is especially important given different model dimensionality (3 vs 5 predictors).

comparison_results <- map_dfr(portfolio_names, function(pname) {
  model_data <- analysis_panel %>%
    transmute(
      ret_excess = .data[[pname]] - RF,
      PC1, PC2, PC3,
      `Mkt-RF`, SMB, HML, RMW, CMA
    ) %>%
    drop_na()

  fit_pca <- lm(ret_excess ~ PC1 + PC2 + PC3, data = model_data)
  fit_ff5 <- lm(ret_excess ~ `Mkt-RF` + SMB + HML + RMW + CMA, data = model_data)

  co_pca <- coef(fit_pca)
  co_ff5 <- coef(fit_ff5)

  s_pca <- summary(fit_pca)
  s_ff5 <- summary(fit_ff5)

  tibble(
    portfolio = pname,
    alpha_PCA = unname(co_pca[1]),
    beta_PC1 = unname(co_pca[2]),
    beta_PC2 = unname(co_pca[3]),
    beta_PC3 = unname(co_pca[4]),
    alpha_FF5 = unname(co_ff5[1]),
    beta_MKT = unname(co_ff5[2]),
    beta_SMB = unname(co_ff5[3]),
    beta_HML = unname(co_ff5[4]),
    beta_RMW = unname(co_ff5[5]),
    beta_CMA = unname(co_ff5[6]),
    R2_PCA = s_pca$r.squared,
    adjR2_PCA = s_pca$adj.r.squared,
    R2_FF5 = s_ff5$r.squared,
    adjR2_FF5 = s_ff5$adj.r.squared,
    delta_adjR2 = s_pca$adj.r.squared - s_ff5$adj.r.squared,
    R2_retention = if_else(s_ff5$r.squared > 0, s_pca$r.squared / s_ff5$r.squared, NA_real_),
    n_obs = nobs(fit_pca)
  )
})

knitr::kable(
  comparison_results %>%
    select(portfolio, beta_PC1, beta_PC2, beta_PC3, R2_PCA, adjR2_PCA, R2_FF5, adjR2_FF5, delta_adjR2) %>%
    mutate(across(-portfolio, ~ round(.x, 4))),
  caption = "Portfolio-level comparison: PCA latent model vs original FF5"
)
Portfolio-level comparison: PCA latent model vs original FF5
portfolio beta_PC1 beta_PC2 beta_PC3 R2_PCA adjR2_PCA R2_FF5 adjR2_FF5 delta_adjR2
SMALL LoBM 4.3004 3.4896 2.5298 0.9085 0.9082 0.9209 0.9204 -0.0123
ME1 BM2 3.4459 3.4716 2.5385 0.9275 0.9273 0.9368 0.9364 -0.0092
ME1 BM3 2.6466 3.1193 3.2004 0.9427 0.9425 0.9475 0.9471 -0.0047
ME1 BM4 2.2184 3.2323 3.3260 0.9500 0.9498 0.9518 0.9515 -0.0017
SMALL HiBM 1.9168 3.5660 3.7643 0.9029 0.9026 0.9061 0.9055 -0.0030
ME2 BM1 4.0742 2.6044 2.8779 0.9340 0.9337 0.9529 0.9525 -0.0188
ME2 BM2 2.9308 2.6245 3.3657 0.9430 0.9428 0.9537 0.9534 -0.0106
ME2 BM3 2.2179 2.4581 3.8017 0.9359 0.9356 0.9418 0.9414 -0.0058
ME2 BM4 1.8129 2.6944 3.7544 0.9393 0.9391 0.9485 0.9481 -0.0090
ME2 BM5 1.8654 3.3820 4.2996 0.9444 0.9442 0.9539 0.9536 -0.0095
ME3 BM1 3.7859 1.9407 2.8689 0.9256 0.9253 0.9490 0.9486 -0.0233
ME3 BM2 2.4998 2.0284 3.5423 0.9066 0.9062 0.9317 0.9312 -0.0250
ME3 BM3 1.8046 2.0199 3.7529 0.8829 0.8825 0.9169 0.9163 -0.0339
ME3 BM4 1.5075 2.2854 3.9107 0.8829 0.8825 0.9249 0.9244 -0.0420
ME3 BM5 1.4637 2.8504 4.4726 0.8795 0.8790 0.9112 0.9106 -0.0317
ME4 BM1 3.2879 1.3740 2.6518 0.8651 0.8646 0.9309 0.9305 -0.0659
ME4 BM2 2.0897 1.5249 3.5958 0.8191 0.8183 0.9066 0.9059 -0.0876
ME4 BM3 1.5905 1.6417 3.8412 0.8037 0.8029 0.8952 0.8945 -0.0916
ME4 BM4 1.4456 1.9525 3.7750 0.7960 0.7952 0.8865 0.8857 -0.0905
ME4 BM5 1.3292 2.5400 4.3275 0.7881 0.7872 0.8797 0.8789 -0.0917
BIG LoBM 2.3777 -0.0103 2.7315 0.7614 0.7604 0.9533 0.9530 -0.1926
ME5 BM2 1.6356 0.6500 3.1090 0.6809 0.6796 0.8980 0.8973 -0.2177
ME5 BM3 1.2239 0.9154 3.0900 0.6075 0.6059 0.8508 0.8498 -0.2439
ME5 BM4 0.8541 1.3977 3.6687 0.6443 0.6429 0.8914 0.8907 -0.2478
BIG HiBM 1.0451 1.9319 3.8085 0.5835 0.5818 0.8198 0.8185 -0.2367
model_compare_summary <- comparison_results %>%
  summarize(
    mean_R2_PCA = mean(R2_PCA, na.rm = TRUE),
    mean_R2_FF5 = mean(R2_FF5, na.rm = TRUE),
    mean_adjR2_PCA = mean(adjR2_PCA, na.rm = TRUE),
    mean_adjR2_FF5 = mean(adjR2_FF5, na.rm = TRUE),
    mean_R2_retention = mean(R2_retention, na.rm = TRUE),
    median_R2_retention = median(R2_retention, na.rm = TRUE),
    mean_delta_adjR2 = mean(delta_adjR2, na.rm = TRUE),
    share_adjR2_within_95pct = mean(adjR2_PCA >= 0.95 * adjR2_FF5, na.rm = TRUE),
    share_adjR2_PCA_gt_FF5 = mean(adjR2_PCA > adjR2_FF5, na.rm = TRUE)
  )

knitr::kable(
  model_compare_summary %>% mutate(across(everything(), ~ round(.x, 4))),
  caption = "Chapter 9 effectiveness summary"
)
Chapter 9 effectiveness summary
mean_R2_PCA mean_R2_FF5 mean_adjR2_PCA mean_adjR2_FF5 mean_R2_retention median_R2_retention mean_delta_adjR2 share_adjR2_within_95pct share_adjR2_PCA_gt_FF5
0.8462 0.9184 0.8456 0.9178 0.9188 0.9651 -0.0723 0.6 0
ggplot(comparison_results, aes(adjR2_FF5, adjR2_PCA)) +
  geom_abline(slope = 1, intercept = 0, linetype = 2, color = "grey50") +
  geom_point(size = 2.2, alpha = 0.85, color = "steelblue") +
  labs(
    title = "Adjusted R-squared: PCA latent model vs original FF5",
    x = "Adjusted R-squared (FF5)",
    y = "Adjusted R-squared (PCA latent factors)"
  ) +
  theme_minimal()

beta_long <- comparison_results %>%
  select(portfolio, beta_PC1, beta_PC2, beta_PC3) %>%
  pivot_longer(-portfolio, names_to = "component", values_to = "beta") %>%
  mutate(component = str_remove(component, "beta_"))

portfolio_grid <- tibble(portfolio = unique(beta_long$portfolio)) %>%
  mutate(
    size = case_when(
      portfolio == "SMALL LoBM" ~ 1L,
      portfolio == "SMALL HiBM" ~ 1L,
      portfolio == "BIG LoBM" ~ 5L,
      portfolio == "BIG HiBM" ~ 5L,
      TRUE ~ as.integer(str_match(portfolio, "^ME([1-5]) BM([1-5])$")[, 2])
    ),
    bm = case_when(
      portfolio == "SMALL LoBM" ~ 1L,
      portfolio == "SMALL HiBM" ~ 5L,
      portfolio == "BIG LoBM" ~ 1L,
      portfolio == "BIG HiBM" ~ 5L,
      TRUE ~ as.integer(str_match(portfolio, "^ME([1-5]) BM([1-5])$")[, 3])
    ),
    size_label = factor(size, levels = 5:1, labels = c("Big", "4", "3", "2", "Small")),
    bm_label = factor(bm, levels = 1:5, labels = c("LoBM", "2", "3", "4", "HiBM"))
  )

beta_heatmap <- beta_long %>%
  left_join(portfolio_grid, by = "portfolio")

ggplot(beta_heatmap, aes(bm_label, size_label, fill = beta)) +
  geom_tile(color = "white") +
  facet_wrap(~ component, nrow = 1) +
  scale_fill_gradient2(low = "#2c7bb6", mid = "white", high = "#d7191c") +
  labs(
    title = "PCA latent-factor exposures across the 5x5 size-value grid",
    x = "Book-to-Market bucket",
    y = "Size bucket",
    fill = "Beta"
  ) +
  theme_minimal()

9.3 Finance Interpretation: Effectiveness and Practical Sense

The PCA model compresses FF5 information from five correlated factors into three orthogonal latent drivers. Empirically, this chapter shows how much explanatory power is retained when moving from the original FF5 specification to the PCA representation.

In this sample, the PCA model retains on average about 91.9% of FF5 \(R^2\) (median retention: 96.5%). The mean adjusted-\(R^2\) gap is -0.0723, and in 60% of portfolios the PCA adjusted \(R^2\) reaches at least 95% of the FF5 adjusted \(R^2\).

From a asset-pricing perspective, this is a strong signal that PCA factors are economically usable as a parsimonious approximation layer. They are especially attractive when interpretability at the latent-dimension level, orthogonality of regressors, and model compactness are operational priorities. The trade-off is expected: some fit is sacrificed relative to the full FF5 basis, but often not enough to invalidate use in exploratory attribution, clustering of exposures, or reduced-form risk dashboards.

Methodological caveat: this chapter is an in-sample structural test, not an out-of-sample forecasting evaluation. Therefore, conclusions should be interpreted as evidence of explanatory compression quality, not forecasting superiority.

10 Conclusions

Dimensionality reduction is statistically justified in this dataset and remains economically meaningful when interpreted with discipline. Strong inter-factor co-movement implies that the effective dimensionality of FF5 is below five, which is consistent with PCA and MDS diagnostics reported earlier.

Both PCA and MDS provide closely aligned 2D maps of factor geometry, while the first three PCs explain about 82% of total variance. At the same time, the SMACOF stress level (0.165 in 2D) reminds us that two-dimensional projections still involve non-trivial distortion.

At the portfolio-pricing layer (Chapter 9), PCA latent factors are practically effective but not a free lunch. The 3-factor PCA model retains on average 91.9% of FF5 \(R^2\), with a mean adjusted-\(R^2\) gap of -0.0723. In 60% of portfolios, PCA reaches at least 95% of FF5 adjusted \(R^2\), indicating that much of the explanatory content can be preserved after orthogonal compression.

The methodological conclusion is therefore conditional: PCA is highly sensible as a compact explanatory layer for structure discovery, exposure mapping, and reduced-form risk analytics, while the original FF5 basis remains preferable when maximizing in-sample fit is the main objective. This framing balances efficiency, interpretability, and economic fidelity.

Reproducibility Appendix

pkg_versions <- tibble(
  package = c("R", required_pkgs),
  version = c(
    as.character(getRversion()),
    vapply(required_pkgs, function(p) as.character(packageVersion(p)), character(1))
  )
)

knitr::kable(pkg_versions, caption = "R and package versions used in this run")
R and package versions used in this run
package version
R 4.5.2
tidyverse 2.0.0
lubridate 1.9.5
gridExtra 2.3
smacof 2.1.7
vegan 2.7.3
psych 2.6.1
tibble 3.3.0
sessionInfo()
#> R version 4.5.2 (2025-10-31 ucrt)
#> Platform: x86_64-w64-mingw32/x64
#> Running under: Windows 11 x64 (build 26200)
#> 
#> Matrix products: default
#>   LAPACK version 3.12.1
#> 
#> locale:
#> [1] LC_COLLATE=English_United States.utf8 
#> [2] LC_CTYPE=English_United States.utf8   
#> [3] LC_MONETARY=English_United States.utf8
#> [4] LC_NUMERIC=C                          
#> [5] LC_TIME=English_United States.utf8    
#> 
#> time zone: Europe/Warsaw
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#>  [1] psych_2.6.1      vegan_2.7-3      permute_0.9-10   smacof_2.1-7    
#>  [5] e1071_1.7-17     colorspace_2.1-2 plotrix_3.8-14   gridExtra_2.3   
#>  [9] lubridate_1.9.5  forcats_1.0.1    stringr_1.6.0    dplyr_1.1.4     
#> [13] purrr_1.2.1      readr_2.2.0      tidyr_1.3.2      tibble_3.3.0    
#> [17] ggplot2_4.0.2    tidyverse_2.0.0 
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rdpack_2.6.6        mnormt_2.1.2        polynom_1.4-1      
#>  [4] rlang_1.1.6         magrittr_2.0.4      otel_0.2.0         
#>  [7] compiler_4.5.2      mgcv_1.9-3          gdata_3.0.1        
#> [10] vctrs_0.6.5         pkgconfig_2.0.3     shape_1.4.6.1      
#> [13] crayon_1.5.3        fastmap_1.2.0       backports_1.5.0    
#> [16] labeling_0.4.3      rmarkdown_2.30      tzdb_0.5.0         
#> [19] nloptr_2.2.1        bit_4.6.0           xfun_0.56          
#> [22] glmnet_4.1-10       jomo_2.7-6          cachem_1.1.0       
#> [25] jsonlite_2.0.0      pan_1.9             broom_1.0.12       
#> [28] parallel_4.5.2      cluster_2.1.8.2     R6_2.6.1           
#> [31] bslib_0.10.0        stringi_1.8.7       RColorBrewer_1.1-3 
#> [34] boot_1.3-32         rpart_4.1.24        jquerylib_0.1.4    
#> [37] Rcpp_1.1.0          iterators_1.0.14    knitr_1.51         
#> [40] base64enc_0.1-6     weights_1.1.2       Matrix_1.7-4       
#> [43] nnls_1.6            splines_4.5.2       nnet_7.3-20        
#> [46] timechange_0.4.0    tidyselect_1.2.1    rstudioapi_0.18.0  
#> [49] yaml_2.3.12         doParallel_1.0.17   codetools_0.2-20   
#> [52] lattice_0.22-7      withr_3.0.2         S7_0.2.1           
#> [55] evaluate_1.0.5      foreign_0.8-90      survival_3.8-3     
#> [58] proxy_0.4-29        pillar_1.11.1       mice_3.19.0        
#> [61] checkmate_2.3.4     foreach_1.5.2       reformulas_0.4.4   
#> [64] ellipse_0.5.0       generics_0.1.4      vroom_1.7.0        
#> [67] hms_1.1.4           scales_1.4.0        minqa_1.2.8        
#> [70] gtools_3.9.5        class_7.3-23        glue_1.8.0         
#> [73] Hmisc_5.2-5         tools_4.5.2         data.table_1.18.2.1
#> [76] lme4_2.0-1          grid_4.5.2          rbibutils_2.4.1    
#> [79] nlme_3.1-168        htmlTable_2.4.3     Formula_1.2-5      
#> [82] cli_3.6.5           gtable_0.3.6        sass_0.4.10        
#> [85] digest_0.6.39       wordcloud_2.6       htmlwidgets_1.6.4  
#> [88] farver_2.1.2        htmltools_0.5.9     lifecycle_1.0.5    
#> [91] mitml_0.4-5         bit64_4.6.0-1       MASS_7.3-65