The Fama-French five-factor model constitutes one of the most influential empirical frameworks in modern asset pricing, extending the traditional CAPM by incorporating size, value, profitability, and investment-related risk factors. While the individual economic interpretations of these factors are well established in the literature, less attention is typically devoted to the joint correlation structure and latent geometry of the factor space itself.
This project aims to explore the multivariate dependence structure of the monthly Fama-French five factors using dimensionality reduction techniques. Rather than focusing on predictive performance or asset pricing tests, the analysis emphasizes the internal geometry of the factor space and investigates whether the observed factors can be represented by a smaller number of latent dimensions without substantial loss of information.
To this end, Principal Component Analysis (PCA), rotated PCA, and Multidimensional Scaling (MDS) are employed and systematically compared. Particular attention is paid to distance preservation, interpretability of latent components, and the degree to which linear methods adequately capture the structure of the data.
The empirical analysis is based on the Fama-French Five-Factor model using monthly data. The dataset originates from the publicly available data library (https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html) maintained by Kenneth R. French and is widely used in empirical asset pricing research.
The following five risk factors are employed in the analysis:
Mkt-RF (Market Excess Return)
The excess return on the value-weighted market portfolio over the
risk-free rate. This factor captures systematic market risk and
represents the traditional CAPM component.
SMB (Small Minus Big)
The return differential between portfolios of small-cap and large-cap
firms, designed to proxy for the size effect observed in equity
markets.
HML (High Minus Low)
The return differential between high book-to-market (value) and low
book-to-market (growth) firms, capturing the value premium.
RMW (Robust Minus Weak)
The return differential between firms with robust and weak operating
profitability, reflecting the profitability effect in asset
returns.
CMA (Conservative Minus Aggressive)
The return differential between firms that invest conservatively versus
aggressively, capturing the investment effect.
The risk-free rate (RF) is excluded from the analysis, as the focus is placed on the joint structure and dependence between the systematic risk factors themselves rather than on excess-return construction.
The dataset consists of approximately 750 monthly observations, covering several decades of U.S. equity market history. Monthly frequency is chosen to balance noise reduction with sufficient time-series length for reliable multivariate analysis.
All factor return series are transformed as follows:
Standardization ensures that the subsequent distance-based and variance-based methods (PCA and MDS) are not dominated by differences in scale or volatility across factors. As a result, the analysis focuses purely on correlation structure and geometric relationships among the factors.
Despite the relatively small number of variables, dimensionality reduction is well motivated due to the presence of strong and economically meaningful correlations among the factors. The objective is not variable reduction per se, but rather to:
This approach aligns with the broader asset pricing literature, where factor interdependence and redundancy are known to play a crucial role in model interpretation and factor construction.
The analysis is based on monthly observations of the Fama-French five factors obtained from Kenneth French’s data library. The sample spans approximately 750 monthly observations, providing a sufficiently long time horizon to study the stable dependence structure between factors.
The five analyzed factors include the excess market return (Mkt-RF), size (SMB), value (HML), profitability (RMW), and investment (CMA). The risk-free rate is excluded, as the focus lies on the joint behavior of systematic risk factors rather than excess return construction.
All factor series are standardized prior to analysis to ensure comparability and to prevent scale differences from influencing distance-based methods. Standardization is particularly important in the context of PCA and MDS, where the geometry of the data is directly affected by variable scaling.
options(repos = c(CRAN = "https://cloud.r-project.org"))
required_pkgs <- c(
"tidyverse", "lubridate", "gridExtra", "smacof", "vegan", "psych", "tibble"
)
missing_pkgs <- required_pkgs[
!vapply(required_pkgs, requireNamespace, logical(1), quietly = TRUE)
]
if (length(missing_pkgs) > 0) {
install.packages(missing_pkgs, dependencies = TRUE)
}
invisible(lapply(required_pkgs, library, character.only = TRUE))
ff5_raw <- readr::read_csv("FF5_monthly.csv", show_col_types = FALSE)
names(ff5_raw)
#> [1] "...1" "Mkt-RF" "SMB" "HML" "RMW" "CMA" "RF"
head(ff5_raw, 5)
ff5 <- ff5_raw %>%
rename(DATE = 1) %>%
mutate(DATE = as.character(DATE)) %>%
filter(str_detect(DATE, "^[0-9]{6}$")) %>%
mutate(date = ymd(paste0(DATE, "01"))) %>%
select(date, `Mkt-RF`, SMB, HML, RMW, CMA) %>%
arrange(date) %>%
mutate(across(-date, as.numeric))
glimpse(ff5)
#> Rows: 749
#> Columns: 6
#> $ date <date> 1963-07-01, 1963-08-01, 1963-09-01, 1963-10-01, 1963-11-01, …
#> $ `Mkt-RF` <dbl> -0.39, 5.08, -1.57, 2.54, -0.86, 1.83, 2.27, 1.55, 1.41, 0.11…
#> $ SMB <dbl> -0.48, -0.80, -0.43, -1.34, -0.85, -1.89, 0.10, 0.33, 1.41, -…
#> $ HML <dbl> -0.81, 1.70, 0.00, -0.04, 1.73, -0.21, 1.63, 2.81, 3.29, -0.5…
#> $ RMW <dbl> 0.64, 0.40, -0.78, 2.79, -0.43, 0.12, 0.21, 0.11, -2.03, -1.3…
#> $ CMA <dbl> -1.15, -0.38, 0.15, -2.25, 2.27, -0.25, 1.48, 0.81, 2.98, -1.…
range(ff5$date)
#> [1] "1963-07-01" "2025-11-01"
summary(ff5)
#> date Mkt-RF SMB HML
#> Min. :1963-07-01 Min. :-23.1900 Min. :-15.5400 Min. :-13.8300
#> 1st Qu.:1979-02-01 1st Qu.: -1.9600 1st Qu.: -1.5800 1st Qu.: -1.4400
#> Median :1994-09-01 Median : 1.0200 Median : 0.0200 Median : 0.2000
#> Mean :1994-08-31 Mean : 0.5956 Mean : 0.1801 Mean : 0.2837
#> 3rd Qu.:2010-04-01 3rd Qu.: 3.4200 3rd Qu.: 1.9400 3rd Qu.: 1.7300
#> Max. :2025-11-01 Max. : 16.1000 Max. : 18.4600 Max. : 12.8600
#> RMW CMA
#> Min. :-18.9500 Min. :-7.0800
#> 1st Qu.: -0.8500 1st Qu.:-1.0400
#> Median : 0.2500 Median : 0.0900
#> Mean : 0.2628 Mean : 0.2407
#> 3rd Qu.: 1.3100 3rd Qu.: 1.4900
#> Max. : 13.0500 Max. : 9.0100
diag_tbl <- tibble(
start_date = min(ff5$date),
end_date = max(ff5$date),
n_months = nrow(ff5),
n_factors = ncol(ff5) - 1
)
knitr::kable(diag_tbl, caption = "Dataset diagnostics")
| start_date | end_date | n_months | n_factors |
|---|---|---|---|
| 1963-07-01 | 2025-11-01 | 749 | 5 |
X <- ff5 %>% select(`Mkt-RF`, SMB, HML, RMW, CMA)
corr_mat <- cor(X, use = "pairwise.complete.obs")
round(corr_mat, 2)
#> Mkt-RF SMB HML RMW CMA
#> Mkt-RF 1.00 0.28 -0.21 -0.19 -0.35
#> SMB 0.28 1.00 0.01 -0.34 -0.08
#> HML -0.21 0.01 1.00 0.09 0.68
#> RMW -0.19 -0.34 0.09 1.00 0.00
#> CMA -0.35 -0.08 0.68 0.00 1.00
knitr::kable(round(corr_mat, 2), caption = "Correlation matrix")
| Mkt-RF | SMB | HML | RMW | CMA | |
|---|---|---|---|---|---|
| Mkt-RF | 1.00 | 0.28 | -0.21 | -0.19 | -0.35 |
| SMB | 0.28 | 1.00 | 0.01 | -0.34 | -0.08 |
| HML | -0.21 | 0.01 | 1.00 | 0.09 | 0.68 |
| RMW | -0.19 | -0.34 | 0.09 | 1.00 | 0.00 |
| CMA | -0.35 | -0.08 | 0.68 | 0.00 | 1.00 |
corr_df <- as.data.frame(corr_mat) %>%
rownames_to_column("factor1") %>%
pivot_longer(-factor1, names_to = "factor2", values_to = "corr")
ggplot(corr_df, aes(factor1, factor2, fill = corr)) +
geom_tile() +
geom_text(aes(label = round(corr, 2)), size = 4) +
scale_fill_gradient2(limits = c(-1, 1)) +
coord_equal() +
labs(title = "Correlation heatmap", fill = "Corr") +
theme_minimal()
corr_with_mkt <- sort(corr_mat[, "Mkt-RF"], decreasing = TRUE)
round(corr_with_mkt, 2)
#> Mkt-RF SMB RMW HML CMA
#> 1.00 0.28 -0.19 -0.21 -0.35
The correlation matrix confirms non-trivial dependence between factors and supports dimensionality reduction. The strongest relationship is between HML and CMA (\(r =\) 0.68), consistent with a shared value-investment structure.
Mkt-RF is moderately positively related to SMB (\(r =\) 0.28) and negatively related to HML, RMW, and CMA. RMW is not fully orthogonal: it is close to zero versus HML and CMA, but materially negative versus SMB (\(r =\) -0.34).
Overall, the pattern suggests a few latent dimensions rather than five independent sources of variation.
X_scaled <- scale(X)
pca <- prcomp(X_scaled)
eig <- pca$sdev^2
prop <- eig / sum(eig)
pca_var_tbl <- tibble(
PC = paste0("PC", 1:5),
eigenvalue = eig,
prop_var = prop,
cum_var = cumsum(prop)
)
pca_var_tbl
knitr::kable(
pca_var_tbl %>% mutate(across(where(is.numeric), ~ round(.x, 4))),
caption = "PCA: eigenvalues and explained variance"
)
| PC | eigenvalue | prop_var | cum_var |
|---|---|---|---|
| PC1 | 1.9480 | 0.3896 | 0.3896 |
| PC2 | 1.3776 | 0.2755 | 0.6651 |
| PC3 | 0.7727 | 0.1545 | 0.8197 |
| PC4 | 0.6201 | 0.1240 | 0.9437 |
| PC5 | 0.2816 | 0.0563 | 1.0000 |
scree_tbl <- pca_var_tbl %>%
mutate(PC = factor(PC, levels = paste0("PC", 1:5)))
ggplot(scree_tbl, aes(PC, prop_var, group = 1)) +
geom_col(fill = "steelblue", alpha = 0.8) +
geom_line(color = "grey30") +
geom_point(color = "grey30", size = 2) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
labs(
title = "Scree plot: variance explained by PCs",
x = NULL,
y = "Explained variance"
) +
theme_minimal()
loadings <- pca$rotation[, 1:3] %>%
as.data.frame() %>%
rownames_to_column("Factor")
knitr::kable(
loadings %>% mutate(across(-Factor, ~ round(.x, 3))),
caption = "PCA loadings (PC1-PC3)"
)
| Factor | PC1 | PC2 | PC3 |
|---|---|---|---|
| Mkt-RF | 0.467 | 0.214 | 0.654 |
| SMB | 0.269 | 0.612 | 0.161 |
| HML | -0.547 | 0.374 | 0.357 |
| RMW | -0.245 | -0.571 | 0.647 |
| CMA | -0.592 | 0.337 | -0.009 |
Interpretation should be based on loading magnitudes (the sign itself is arbitrary in PCA). In this sample:
contrib_tbl <- as_tibble(pca$rotation[, 1:3], rownames = "Factor") %>%
pivot_longer(-Factor, names_to = "PC", values_to = "loading") %>%
group_by(PC) %>%
mutate(contrib = 100 * (loading^2) / sum(loading^2)) %>%
ungroup()
plot_contrib <- function(pc_name) {
ggplot(
contrib_tbl %>% filter(PC == pc_name) %>% arrange(desc(contrib)),
aes(x = reorder(Factor, contrib), y = contrib)
) +
geom_col(fill = "steelblue", alpha = 0.85) +
coord_flip() +
labs(title = paste("Contribution to", pc_name), x = NULL, y = "Contribution (%)") +
theme_minimal()
}
gridExtra::grid.arrange(
plot_contrib("PC1"),
plot_contrib("PC2"),
plot_contrib("PC3"),
ncol = 1
)
scores <- as_tibble(pca$x) %>%
mutate(date = ff5$date)
ggplot(scores, aes(PC1, PC2)) +
geom_point(alpha = 0.5) +
labs(title = "Months in PC space (PC1 vs PC2)") +
theme_minimal()
scores_12 <- as_tibble(pca$x[, 1:2], .name_repair = "minimal") %>%
rename(PC1 = 1, PC2 = 2)
loadings_12 <- as_tibble(pca$rotation[, 1:2], rownames = "Factor")
arrow_scale <- min(
diff(range(scores_12$PC1)) / diff(range(loadings_12$PC1)),
diff(range(scores_12$PC2)) / diff(range(loadings_12$PC2))
) * 0.25
loadings_plot <- loadings_12 %>%
mutate(
PC1 = PC1 * arrow_scale,
PC2 = PC2 * arrow_scale
)
ggplot(scores_12, aes(PC1, PC2)) +
geom_point(alpha = 0.5, color = "grey40") +
geom_segment(
data = loadings_plot,
aes(x = 0, y = 0, xend = PC1, yend = PC2),
inherit.aes = FALSE,
color = "steelblue",
arrow = arrow(length = grid::unit(0.2, "cm"))
) +
geom_text(
data = loadings_plot,
aes(x = PC1, y = PC2, label = Factor),
inherit.aes = FALSE,
color = "steelblue",
nudge_y = 0.05,
size = 4
) +
labs(title = "PCA biplot: PC1 vs PC2") +
theme_minimal()
Principal Component Analysis is employed to identify orthogonal linear combinations of the original factors that explain the maximum variance in the data. Given the strong correlation structure observed earlier, PCA serves as a natural first step in uncovering latent dimensions underlying the factor space.
For this dataset, the first three principal components explain 82% of total variance. This indicates that effective dimensionality is lower than the nominal five factors, although not all variation is captured in 2D.
From an economic perspective, this finding implies that multiple observed risk factors may be manifestations of a smaller number of fundamental sources of systematic risk.
dist_mat <- dist(scale(X))
mds_cmd <- cmdscale(dist_mat, k = 2)
mds_df <- tibble(
Dim1 = mds_cmd[, 1],
Dim2 = mds_cmd[, 2]
)
ggplot(mds_df, aes(Dim1, Dim2)) +
geom_point(alpha = 0.5, color = "grey40") +
labs(title = "Classical MDS (2D)") +
theme_minimal()
mds_smacof <- smacof::mds(dist_mat, ndim = 2, type = "ratio")
mds_smacof$stress
#> [1] 0.1650079
knitr::kable(
tibble(stress_2D = round(mds_smacof$stress, 4)),
caption = "SMACOF MDS stress (2D)"
)
| stress_2D |
|---|
| 0.165 |
plot(mds_smacof, main = "SMACOF MDS (stress-based)")
stress_curve <- tibble(
ndim = 1:5,
stress = map_dbl(1:5, ~ smacof::mds(dist_mat, ndim = .x, type = "ratio")$stress)
)
knitr::kable(
stress_curve %>% mutate(stress = round(stress, 4)),
caption = "Stress vs number of dimensions"
)
| ndim | stress |
|---|---|
| 1 | 0.3527 |
| 2 | 0.1650 |
| 3 | 0.0956 |
| 4 | 0.0383 |
| 5 | 0.0000 |
ggplot(stress_curve, aes(ndim, stress)) +
geom_line() +
geom_point(size = 2) +
labs(title = "SMACOF MDS: stress curve") +
theme_minimal()
Multidimensional Scaling offers a complementary, distance-based perspective on the structure of the data. Unlike PCA, which is variance-oriented, MDS focuses explicitly on preserving pairwise distances between observations.
Using Euclidean distances computed on standardized factors, both classical MDS and stress-based SMACOF MDS are applied. The 2D SMACOF solution has stress 0.165, which is acceptable for visualization but still indicates non-negligible distortion.
The stress curve shows substantial improvement from 2D to 3D (0.165 \(\rightarrow\) 0.096), and further improvement in 4D. This supports using 2D for communication, while noting that higher-dimensional embeddings represent distances more faithfully.
d_orig <- as.vector(dist(scale(X)))
d_pca <- as.vector(dist(as.matrix(scores[, c("PC1","PC2")])))
d_mds <- as.vector(dist(mds_cmd))
cor(d_orig, d_pca)
#> [1] 0.9296938
cor(d_orig, d_mds)
#> [1] 0.9296938
knitr::kable(
tibble(
comparison = c("Original vs PCA(2D)", "Original vs MDS(2D)"),
correlation = round(c(cor(d_orig, d_pca), cor(d_orig, d_mds)), 4)
),
caption = "Distance preservation"
)
| comparison | correlation |
|---|---|
| Original vs PCA(2D) | 0.9297 |
| Original vs MDS(2D) | 0.9297 |
Distance preservation is evaluated here as the Pearson correlation of pairwise Euclidean distances between the original standardized space and each 2D embedding. In this sample, both PCA(2D) and MDS(2D) achieve high correlation (0.93 and 0.93).
Procrustes alignment between PCA and classical MDS is numerically near-perfect (6.66e-16 sum of squares), showing that both methods recover almost the same 2D geometry up to rotation/reflection/scale.
Methodological caveat: because both embeddings are linear and are built from the same standardized Euclidean structure, strong similarity is expected. Therefore, this comparison supports internal consistency of the geometry rather than serving as a standalone proof that nonlinear methods would never add information.
proc <- vegan::procrustes(
as.matrix(scores[, c("PC1","PC2")]),
as.matrix(mds_cmd),
symmetric = TRUE
)
proc_overlay <- bind_rows(
tibble(Dim1 = proc$X[, 1], Dim2 = proc$X[, 2], Space = "PCA (2D)"),
tibble(Dim1 = proc$Yrot[, 1], Dim2 = proc$Yrot[, 2], Space = "MDS aligned")
)
ggplot(proc_overlay, aes(Dim1, Dim2, color = Space)) +
geom_point(alpha = 0.35, size = 1) +
labs(title = "Procrustes alignment: PCA vs MDS", x = "Dimension 1", y = "Dimension 2") +
theme_minimal()
summary(proc)
#>
#> Call:
#> vegan::procrustes(X = as.matrix(scores[, c("PC1", "PC2")]), Y = as.matrix(mds_cmd), symmetric = TRUE)
#>
#> Number of objects: 749 Number of dimensions: 2
#>
#> Procrustes sum of squares:
#> 6.661338e-16
#> Procrustes root mean squared error:
#> 9.430611e-10
#> Quantiles of Procrustes errors:
#> Min 1Q Median 3Q Max
#> 8.673617e-19 3.265015e-17 5.583589e-17 8.626653e-17 1.112297e-15
#>
#> Rotation matrix:
#> [,1] [,2]
#> [1,] 1.000000e+00 -3.746221e-15
#> [2,] -3.746221e-15 -1.000000e+00
#>
#> Translation of averages:
#> [,1] [,2]
#> [1,] 5.387559e-19 2.378681e-19
#>
#> Scaling of target:
#> [1] 1
rot_pca <- psych::principal(X_scaled, nfactors = 3, rotate = "varimax")
print(rot_pca$loadings, cutoff = 0.3)
#>
#> Loadings:
#> RC1 RC3 RC2
#> Mkt-RF 0.871
#> SMB 0.609 -0.541
#> HML 0.931
#> RMW 0.939
#> CMA 0.875
#>
#> RC1 RC3 RC2
#> SS loadings 1.709 1.202 1.187
#> Proportion Var 0.342 0.240 0.237
#> Cumulative Var 0.342 0.582 0.820
To enhance interpretability, a varimax rotation is applied to the first three principal components. Rotation does not alter the explanatory power of the solution but redistributes variance across components in a way that promotes sparsity in loadings.
The rotated solution is easier to read economically: one component is concentrated on HML/CMA (value-investment), one is concentrated on RMW (profitability), and one is centered on Mkt-RF with support from SMB (market-size exposure). This supports a low-dimensional but interpretable structure of common risk dimensions.
Given the rotated structure above, interpretation is more robust when anchored in rotated loadings rather than reusing unrotated labels. The key message is that the five observed factors can be organized around a small set of economically meaningful dimensions: value-investment, profitability, and market-size exposure.
This chapter moves from factor geometry to asset-pricing application. The goal is to test whether latent factors extracted from PCA are practically useful in explaining the cross-section of Fama-French 25 portfolios and how they compare to the original pre-PCA FF5 factors.
The portfolio dataset comes from 25_Portfolios_5x5.csv
(CRSP 202601 vintage). It contains monthly value-weighted and
equal-weighted returns for portfolios formed on size (ME) and
book-to-market (BEME). We use the value-weighted monthly
block only, consistent with typical factor-model validation
practice. Missing values are encoded as -99.99 or
-999 and are treated as NA.
The factor-side data come from the already prepared monthly FF5 table and PCA scores computed earlier in the report. For pricing comparability, dependent variables are modeled as excess returns: \(R^e_{i,t} = R_{i,t} - RF_t\).
portfolio_lines <- readr::read_lines("25_Portfolios_5x5.csv")
ew_monthly_marker <- which(str_detect(portfolio_lines, "Average Equal Weighted Returns -- Monthly"))[1]
if (is.na(ew_monthly_marker) || ew_monthly_marker <= 3) {
stop("Cannot locate the monthly value-weighted block in 25_Portfolios_5x5.csv")
}
vw_last_data_line <- max(which(str_detect(portfolio_lines[1:(ew_monthly_marker - 1)], "^[0-9]{6},")))
vw_monthly_text <- paste(portfolio_lines[1:vw_last_data_line], collapse = "\n")
port25_vw <- readr::read_csv(
I(vw_monthly_text),
show_col_types = FALSE,
na = c("-99.99", "-999")
) %>%
rename(DATE = 1) %>%
mutate(
DATE = as.character(DATE),
date = ymd(paste0(DATE, "01"))
) %>%
filter(str_detect(DATE, "^[0-9]{6}$")) %>%
select(-DATE)
rf_monthly <- ff5_raw %>%
rename(DATE = 1) %>%
mutate(DATE = as.character(DATE)) %>%
filter(str_detect(DATE, "^[0-9]{6}$")) %>%
mutate(date = ymd(paste0(DATE, "01"))) %>%
select(date, RF) %>%
mutate(RF = as.numeric(RF))
pc_scores <- scores %>% select(date, PC1, PC2, PC3)
ff5_original <- ff5 %>% select(date, `Mkt-RF`, SMB, HML, RMW, CMA)
analysis_panel <- port25_vw %>%
inner_join(pc_scores, by = "date") %>%
inner_join(ff5_original, by = "date") %>%
inner_join(rf_monthly, by = "date")
portfolio_names <- setdiff(names(port25_vw), "date")
section9_diag <- tibble(
start_date = min(analysis_panel$date),
end_date = max(analysis_panel$date),
n_months = nrow(analysis_panel),
n_portfolios = length(portfolio_names),
predictors_PCA = 3,
predictors_FF5 = 5
)
knitr::kable(section9_diag, caption = "Chapter 9 data panel diagnostics")
| start_date | end_date | n_months | n_portfolios | predictors_PCA | predictors_FF5 |
|---|---|---|---|---|---|
| 1963-07-01 | 2025-11-01 | 749 | 25 | 3 | 5 |
For each of the 25 portfolios, we estimate two linear models on the same sample window:
PCA latent-factor model:
\[ R^e_{i,t} = \alpha_i + \beta_{i1}PC1_t + \beta_{i2}PC2_t + \beta_{i3}PC3_t + \varepsilon_{i,t} \]
Original FF5 model (pre-PCA factors):
\[ R^e_{i,t} = \alpha_i + \gamma_{iM}(Mkt\text{-}RF)_t + \gamma_{iS}SMB_t + \gamma_{iH}HML_t + \gamma_{iR}RMW_t + \gamma_{iC}CMA_t + \varepsilon_{i,t} \]
We compare explanatory power using both \(R^2\) and adjusted \(R^2\), where adjusted \(R^2\) is especially important given different model dimensionality (3 vs 5 predictors).
comparison_results <- map_dfr(portfolio_names, function(pname) {
model_data <- analysis_panel %>%
transmute(
ret_excess = .data[[pname]] - RF,
PC1, PC2, PC3,
`Mkt-RF`, SMB, HML, RMW, CMA
) %>%
drop_na()
fit_pca <- lm(ret_excess ~ PC1 + PC2 + PC3, data = model_data)
fit_ff5 <- lm(ret_excess ~ `Mkt-RF` + SMB + HML + RMW + CMA, data = model_data)
co_pca <- coef(fit_pca)
co_ff5 <- coef(fit_ff5)
s_pca <- summary(fit_pca)
s_ff5 <- summary(fit_ff5)
tibble(
portfolio = pname,
alpha_PCA = unname(co_pca[1]),
beta_PC1 = unname(co_pca[2]),
beta_PC2 = unname(co_pca[3]),
beta_PC3 = unname(co_pca[4]),
alpha_FF5 = unname(co_ff5[1]),
beta_MKT = unname(co_ff5[2]),
beta_SMB = unname(co_ff5[3]),
beta_HML = unname(co_ff5[4]),
beta_RMW = unname(co_ff5[5]),
beta_CMA = unname(co_ff5[6]),
R2_PCA = s_pca$r.squared,
adjR2_PCA = s_pca$adj.r.squared,
R2_FF5 = s_ff5$r.squared,
adjR2_FF5 = s_ff5$adj.r.squared,
delta_adjR2 = s_pca$adj.r.squared - s_ff5$adj.r.squared,
R2_retention = if_else(s_ff5$r.squared > 0, s_pca$r.squared / s_ff5$r.squared, NA_real_),
n_obs = nobs(fit_pca)
)
})
knitr::kable(
comparison_results %>%
select(portfolio, beta_PC1, beta_PC2, beta_PC3, R2_PCA, adjR2_PCA, R2_FF5, adjR2_FF5, delta_adjR2) %>%
mutate(across(-portfolio, ~ round(.x, 4))),
caption = "Portfolio-level comparison: PCA latent model vs original FF5"
)
| portfolio | beta_PC1 | beta_PC2 | beta_PC3 | R2_PCA | adjR2_PCA | R2_FF5 | adjR2_FF5 | delta_adjR2 |
|---|---|---|---|---|---|---|---|---|
| SMALL LoBM | 4.3004 | 3.4896 | 2.5298 | 0.9085 | 0.9082 | 0.9209 | 0.9204 | -0.0123 |
| ME1 BM2 | 3.4459 | 3.4716 | 2.5385 | 0.9275 | 0.9273 | 0.9368 | 0.9364 | -0.0092 |
| ME1 BM3 | 2.6466 | 3.1193 | 3.2004 | 0.9427 | 0.9425 | 0.9475 | 0.9471 | -0.0047 |
| ME1 BM4 | 2.2184 | 3.2323 | 3.3260 | 0.9500 | 0.9498 | 0.9518 | 0.9515 | -0.0017 |
| SMALL HiBM | 1.9168 | 3.5660 | 3.7643 | 0.9029 | 0.9026 | 0.9061 | 0.9055 | -0.0030 |
| ME2 BM1 | 4.0742 | 2.6044 | 2.8779 | 0.9340 | 0.9337 | 0.9529 | 0.9525 | -0.0188 |
| ME2 BM2 | 2.9308 | 2.6245 | 3.3657 | 0.9430 | 0.9428 | 0.9537 | 0.9534 | -0.0106 |
| ME2 BM3 | 2.2179 | 2.4581 | 3.8017 | 0.9359 | 0.9356 | 0.9418 | 0.9414 | -0.0058 |
| ME2 BM4 | 1.8129 | 2.6944 | 3.7544 | 0.9393 | 0.9391 | 0.9485 | 0.9481 | -0.0090 |
| ME2 BM5 | 1.8654 | 3.3820 | 4.2996 | 0.9444 | 0.9442 | 0.9539 | 0.9536 | -0.0095 |
| ME3 BM1 | 3.7859 | 1.9407 | 2.8689 | 0.9256 | 0.9253 | 0.9490 | 0.9486 | -0.0233 |
| ME3 BM2 | 2.4998 | 2.0284 | 3.5423 | 0.9066 | 0.9062 | 0.9317 | 0.9312 | -0.0250 |
| ME3 BM3 | 1.8046 | 2.0199 | 3.7529 | 0.8829 | 0.8825 | 0.9169 | 0.9163 | -0.0339 |
| ME3 BM4 | 1.5075 | 2.2854 | 3.9107 | 0.8829 | 0.8825 | 0.9249 | 0.9244 | -0.0420 |
| ME3 BM5 | 1.4637 | 2.8504 | 4.4726 | 0.8795 | 0.8790 | 0.9112 | 0.9106 | -0.0317 |
| ME4 BM1 | 3.2879 | 1.3740 | 2.6518 | 0.8651 | 0.8646 | 0.9309 | 0.9305 | -0.0659 |
| ME4 BM2 | 2.0897 | 1.5249 | 3.5958 | 0.8191 | 0.8183 | 0.9066 | 0.9059 | -0.0876 |
| ME4 BM3 | 1.5905 | 1.6417 | 3.8412 | 0.8037 | 0.8029 | 0.8952 | 0.8945 | -0.0916 |
| ME4 BM4 | 1.4456 | 1.9525 | 3.7750 | 0.7960 | 0.7952 | 0.8865 | 0.8857 | -0.0905 |
| ME4 BM5 | 1.3292 | 2.5400 | 4.3275 | 0.7881 | 0.7872 | 0.8797 | 0.8789 | -0.0917 |
| BIG LoBM | 2.3777 | -0.0103 | 2.7315 | 0.7614 | 0.7604 | 0.9533 | 0.9530 | -0.1926 |
| ME5 BM2 | 1.6356 | 0.6500 | 3.1090 | 0.6809 | 0.6796 | 0.8980 | 0.8973 | -0.2177 |
| ME5 BM3 | 1.2239 | 0.9154 | 3.0900 | 0.6075 | 0.6059 | 0.8508 | 0.8498 | -0.2439 |
| ME5 BM4 | 0.8541 | 1.3977 | 3.6687 | 0.6443 | 0.6429 | 0.8914 | 0.8907 | -0.2478 |
| BIG HiBM | 1.0451 | 1.9319 | 3.8085 | 0.5835 | 0.5818 | 0.8198 | 0.8185 | -0.2367 |
model_compare_summary <- comparison_results %>%
summarize(
mean_R2_PCA = mean(R2_PCA, na.rm = TRUE),
mean_R2_FF5 = mean(R2_FF5, na.rm = TRUE),
mean_adjR2_PCA = mean(adjR2_PCA, na.rm = TRUE),
mean_adjR2_FF5 = mean(adjR2_FF5, na.rm = TRUE),
mean_R2_retention = mean(R2_retention, na.rm = TRUE),
median_R2_retention = median(R2_retention, na.rm = TRUE),
mean_delta_adjR2 = mean(delta_adjR2, na.rm = TRUE),
share_adjR2_within_95pct = mean(adjR2_PCA >= 0.95 * adjR2_FF5, na.rm = TRUE),
share_adjR2_PCA_gt_FF5 = mean(adjR2_PCA > adjR2_FF5, na.rm = TRUE)
)
knitr::kable(
model_compare_summary %>% mutate(across(everything(), ~ round(.x, 4))),
caption = "Chapter 9 effectiveness summary"
)
| mean_R2_PCA | mean_R2_FF5 | mean_adjR2_PCA | mean_adjR2_FF5 | mean_R2_retention | median_R2_retention | mean_delta_adjR2 | share_adjR2_within_95pct | share_adjR2_PCA_gt_FF5 |
|---|---|---|---|---|---|---|---|---|
| 0.8462 | 0.9184 | 0.8456 | 0.9178 | 0.9188 | 0.9651 | -0.0723 | 0.6 | 0 |
ggplot(comparison_results, aes(adjR2_FF5, adjR2_PCA)) +
geom_abline(slope = 1, intercept = 0, linetype = 2, color = "grey50") +
geom_point(size = 2.2, alpha = 0.85, color = "steelblue") +
labs(
title = "Adjusted R-squared: PCA latent model vs original FF5",
x = "Adjusted R-squared (FF5)",
y = "Adjusted R-squared (PCA latent factors)"
) +
theme_minimal()
beta_long <- comparison_results %>%
select(portfolio, beta_PC1, beta_PC2, beta_PC3) %>%
pivot_longer(-portfolio, names_to = "component", values_to = "beta") %>%
mutate(component = str_remove(component, "beta_"))
portfolio_grid <- tibble(portfolio = unique(beta_long$portfolio)) %>%
mutate(
size = case_when(
portfolio == "SMALL LoBM" ~ 1L,
portfolio == "SMALL HiBM" ~ 1L,
portfolio == "BIG LoBM" ~ 5L,
portfolio == "BIG HiBM" ~ 5L,
TRUE ~ as.integer(str_match(portfolio, "^ME([1-5]) BM([1-5])$")[, 2])
),
bm = case_when(
portfolio == "SMALL LoBM" ~ 1L,
portfolio == "SMALL HiBM" ~ 5L,
portfolio == "BIG LoBM" ~ 1L,
portfolio == "BIG HiBM" ~ 5L,
TRUE ~ as.integer(str_match(portfolio, "^ME([1-5]) BM([1-5])$")[, 3])
),
size_label = factor(size, levels = 5:1, labels = c("Big", "4", "3", "2", "Small")),
bm_label = factor(bm, levels = 1:5, labels = c("LoBM", "2", "3", "4", "HiBM"))
)
beta_heatmap <- beta_long %>%
left_join(portfolio_grid, by = "portfolio")
ggplot(beta_heatmap, aes(bm_label, size_label, fill = beta)) +
geom_tile(color = "white") +
facet_wrap(~ component, nrow = 1) +
scale_fill_gradient2(low = "#2c7bb6", mid = "white", high = "#d7191c") +
labs(
title = "PCA latent-factor exposures across the 5x5 size-value grid",
x = "Book-to-Market bucket",
y = "Size bucket",
fill = "Beta"
) +
theme_minimal()
The PCA model compresses FF5 information from five correlated factors into three orthogonal latent drivers. Empirically, this chapter shows how much explanatory power is retained when moving from the original FF5 specification to the PCA representation.
In this sample, the PCA model retains on average about 91.9% of FF5 \(R^2\) (median retention: 96.5%). The mean adjusted-\(R^2\) gap is -0.0723, and in 60% of portfolios the PCA adjusted \(R^2\) reaches at least 95% of the FF5 adjusted \(R^2\).
From a asset-pricing perspective, this is a strong signal that PCA factors are economically usable as a parsimonious approximation layer. They are especially attractive when interpretability at the latent-dimension level, orthogonality of regressors, and model compactness are operational priorities. The trade-off is expected: some fit is sacrificed relative to the full FF5 basis, but often not enough to invalidate use in exploratory attribution, clustering of exposures, or reduced-form risk dashboards.
Methodological caveat: this chapter is an in-sample structural test, not an out-of-sample forecasting evaluation. Therefore, conclusions should be interpreted as evidence of explanatory compression quality, not forecasting superiority.
Dimensionality reduction is statistically justified in this dataset and remains economically meaningful when interpreted with discipline. Strong inter-factor co-movement implies that the effective dimensionality of FF5 is below five, which is consistent with PCA and MDS diagnostics reported earlier.
Both PCA and MDS provide closely aligned 2D maps of factor geometry, while the first three PCs explain about 82% of total variance. At the same time, the SMACOF stress level (0.165 in 2D) reminds us that two-dimensional projections still involve non-trivial distortion.
At the portfolio-pricing layer (Chapter 9), PCA latent factors are practically effective but not a free lunch. The 3-factor PCA model retains on average 91.9% of FF5 \(R^2\), with a mean adjusted-\(R^2\) gap of -0.0723. In 60% of portfolios, PCA reaches at least 95% of FF5 adjusted \(R^2\), indicating that much of the explanatory content can be preserved after orthogonal compression.
The methodological conclusion is therefore conditional: PCA is highly sensible as a compact explanatory layer for structure discovery, exposure mapping, and reduced-form risk analytics, while the original FF5 basis remains preferable when maximizing in-sample fit is the main objective. This framing balances efficiency, interpretability, and economic fidelity.
pkg_versions <- tibble(
package = c("R", required_pkgs),
version = c(
as.character(getRversion()),
vapply(required_pkgs, function(p) as.character(packageVersion(p)), character(1))
)
)
knitr::kable(pkg_versions, caption = "R and package versions used in this run")
| package | version |
|---|---|
| R | 4.5.2 |
| tidyverse | 2.0.0 |
| lubridate | 1.9.5 |
| gridExtra | 2.3 |
| smacof | 2.1.7 |
| vegan | 2.7.3 |
| psych | 2.6.1 |
| tibble | 3.3.0 |
sessionInfo()
#> R version 4.5.2 (2025-10-31 ucrt)
#> Platform: x86_64-w64-mingw32/x64
#> Running under: Windows 11 x64 (build 26200)
#>
#> Matrix products: default
#> LAPACK version 3.12.1
#>
#> locale:
#> [1] LC_COLLATE=English_United States.utf8
#> [2] LC_CTYPE=English_United States.utf8
#> [3] LC_MONETARY=English_United States.utf8
#> [4] LC_NUMERIC=C
#> [5] LC_TIME=English_United States.utf8
#>
#> time zone: Europe/Warsaw
#> tzcode source: internal
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] psych_2.6.1 vegan_2.7-3 permute_0.9-10 smacof_2.1-7
#> [5] e1071_1.7-17 colorspace_2.1-2 plotrix_3.8-14 gridExtra_2.3
#> [9] lubridate_1.9.5 forcats_1.0.1 stringr_1.6.0 dplyr_1.1.4
#> [13] purrr_1.2.1 readr_2.2.0 tidyr_1.3.2 tibble_3.3.0
#> [17] ggplot2_4.0.2 tidyverse_2.0.0
#>
#> loaded via a namespace (and not attached):
#> [1] Rdpack_2.6.6 mnormt_2.1.2 polynom_1.4-1
#> [4] rlang_1.1.6 magrittr_2.0.4 otel_0.2.0
#> [7] compiler_4.5.2 mgcv_1.9-3 gdata_3.0.1
#> [10] vctrs_0.6.5 pkgconfig_2.0.3 shape_1.4.6.1
#> [13] crayon_1.5.3 fastmap_1.2.0 backports_1.5.0
#> [16] labeling_0.4.3 rmarkdown_2.30 tzdb_0.5.0
#> [19] nloptr_2.2.1 bit_4.6.0 xfun_0.56
#> [22] glmnet_4.1-10 jomo_2.7-6 cachem_1.1.0
#> [25] jsonlite_2.0.0 pan_1.9 broom_1.0.12
#> [28] parallel_4.5.2 cluster_2.1.8.2 R6_2.6.1
#> [31] bslib_0.10.0 stringi_1.8.7 RColorBrewer_1.1-3
#> [34] boot_1.3-32 rpart_4.1.24 jquerylib_0.1.4
#> [37] Rcpp_1.1.0 iterators_1.0.14 knitr_1.51
#> [40] base64enc_0.1-6 weights_1.1.2 Matrix_1.7-4
#> [43] nnls_1.6 splines_4.5.2 nnet_7.3-20
#> [46] timechange_0.4.0 tidyselect_1.2.1 rstudioapi_0.18.0
#> [49] yaml_2.3.12 doParallel_1.0.17 codetools_0.2-20
#> [52] lattice_0.22-7 withr_3.0.2 S7_0.2.1
#> [55] evaluate_1.0.5 foreign_0.8-90 survival_3.8-3
#> [58] proxy_0.4-29 pillar_1.11.1 mice_3.19.0
#> [61] checkmate_2.3.4 foreach_1.5.2 reformulas_0.4.4
#> [64] ellipse_0.5.0 generics_0.1.4 vroom_1.7.0
#> [67] hms_1.1.4 scales_1.4.0 minqa_1.2.8
#> [70] gtools_3.9.5 class_7.3-23 glue_1.8.0
#> [73] Hmisc_5.2-5 tools_4.5.2 data.table_1.18.2.1
#> [76] lme4_2.0-1 grid_4.5.2 rbibutils_2.4.1
#> [79] nlme_3.1-168 htmlTable_2.4.3 Formula_1.2-5
#> [82] cli_3.6.5 gtable_0.3.6 sass_0.4.10
#> [85] digest_0.6.39 wordcloud_2.6 htmlwidgets_1.6.4
#> [88] farver_2.1.2 htmltools_0.5.9 lifecycle_1.0.5
#> [91] mitml_0.4-5 bit64_4.6.0-1 MASS_7.3-65