Fama-MacBeth (1973) regression is a two-step procedure widely used in empirical asset pricing. It is especially powerful when a dataset has many cross-sectional units (firms, portfolios) but a limited number of time periods.
Its key advantage is providing standard errors corrected for cross-sectional correlation, which makes inference more reliable than pooled OLS.
| Step | Name | What happens |
|---|---|---|
| Step 0 | Time-Series Regression | Regress each firm’s return on risk factors → get betas |
| Step 1 | Cross-Sectional Regression | For each date t, regress returns on the betas from Step 0 |
| Step 2 | Average the Coefficients | Average Step 1 estimates across all T periods; test via t-test |
pkgs <- c("plm", "lmtest", "sandwich", "dplyr", "tidyr", "ggplot2")
new <- pkgs[!pkgs %in% installed.packages()[, "Package"]]
if (length(new)) install.packages(new, repos = "https://cloud.r-project.org")
library(plm)
library(lmtest)
library(sandwich)
library(dplyr)
library(tidyr)
library(ggplot2)| Variable | Description |
|---|---|
date |
Trading date |
symbol |
Company ticker |
ri |
Daily excess stock return |
MKT |
Market excess return (Rm − Rf) |
SMB |
Small-minus-Big factor |
HML |
High-minus-Low factor |
# ── If you downloaded data.csv from payhip.com/b/uqxcH, use: ─────────────────
# df <- read.csv("data.csv")
# ── Synthetic dataset with the same structure as the tutorial ─────────────────
set.seed(42)
symbols <- c("AAPL", "MSFT", "AMZN", "GOOG", "META",
"TSLA", "NVDA", "JPM", "BAC", "GS")
dates <- seq.Date(as.Date("2018-01-02"), as.Date("2020-12-31"), by = "day")
dates <- dates[!weekdays(dates) %in% c("Saturday", "Sunday")]
n <- length(dates)
factors <- data.frame(
date = dates,
MKT = rnorm(n, mean = 0.0003, sd = 0.010),
SMB = rnorm(n, mean = 0.0001, sd = 0.005),
HML = rnorm(n, mean = 0.0001, sd = 0.005)
)
true_betas <- data.frame(
symbol = symbols,
bMKT = c( 1.2, 1.0, 1.3, 1.1, 1.4, 1.8, 1.5, 0.9, 1.0, 1.1),
bSMB = c(-0.3, -0.2, 0.1, -0.1, 0.2, 0.5, 0.3, -0.4, -0.2, -0.1),
bHML = c(-0.5, -0.3, -0.4, -0.6, -0.5, -0.8, -0.7, 0.3, 0.2, 0.1)
)
stock_list <- lapply(symbols, function(s) {
b <- true_betas[true_betas$symbol == s, ]
ri <- b$bMKT * factors$MKT +
b$bSMB * factors$SMB +
b$bHML * factors$HML +
rnorm(n, 0, 0.015)
data.frame(date = dates, symbol = s, ri = ri)
})
df <- merge(do.call(rbind, stock_list), factors, by = "date")
cat("Rows:", nrow(df), " | Columns:", ncol(df))Rows: 7830 | Columns: 6
date symbol ri MKT SMB HML
1 2018-01-02 AAPL -0.005158844 0.01400958 -0.005899808 -0.0001253861
2 2018-01-02 BAC 0.018617175 0.01400958 -0.005899808 -0.0001253861
3 2018-01-02 GOOG 0.017787733 0.01400958 -0.005899808 -0.0001253861
4 2018-01-02 JPM 0.028230190 0.01400958 -0.005899808 -0.0001253861
5 2018-01-02 GS 0.015298307 0.01400958 -0.005899808 -0.0001253861
6 2018-01-02 META 0.041423986 0.01400958 -0.005899808 -0.0001253861
7 2018-01-02 NVDA 0.028802207 0.01400958 -0.005899808 -0.0001253861
8 2018-01-02 MSFT 0.001434172 0.01400958 -0.005899808 -0.0001253861
9 2018-01-02 AMZN 0.019361548 0.01400958 -0.005899808 -0.0001253861
10 2018-01-02 TSLA 0.004414289 0.01400958 -0.005899808 -0.0001253861
ri MKT SMB
Min. :-0.0752901 Min. :-0.0298793 Min. :-1.676e-02
1st Qu.:-0.0138257 1st Qu.:-0.0063677 1st Qu.:-3.244e-03
Median :-0.0003607 Median :-0.0001070 Median : 1.190e-04
Mean :-0.0004306 Mean :-0.0001625 Mean : 1.778e-05
3rd Qu.: 0.0127971 3rd Qu.: 0.0066082 3rd Qu.: 3.458e-03
Max. : 0.0696857 Max. : 0.0325907 Max. : 1.758e-02
HML
Min. :-0.0140424
1st Qu.:-0.0031990
Median : 0.0001552
Mean : 0.0002495
3rd Qu.: 0.0036783
Max. : 0.0180233
For each firm we run:
\[r_{i,t} = \alpha_i + \hat\beta_{i,MKT} \cdot MKT_t + \hat\beta_{i,SMB} \cdot SMB_t + \hat\beta_{i,HML} \cdot HML_t + \varepsilon_{i,t}\]
step0_list <- lapply(symbols, function(s) {
sub <- df[df$symbol == s, ]
fit <- lm(ri ~ MKT + SMB + HML, data = sub)
data.frame(symbol = s,
b_MKT = coef(fit)[["MKT"]],
b_SMB = coef(fit)[["SMB"]],
b_HML = coef(fit)[["HML"]])
})
step0 <- do.call(rbind, step0_list)
rownames(step0) <- NULL
step0 symbol b_MKT b_SMB b_HML
1 AAPL 1.1333636 -0.29261029 -0.51343558
2 MSFT 0.9931190 -0.02935885 -0.43969603
3 AMZN 1.3522774 0.12812690 -0.60822728
4 GOOG 1.1231302 0.00187908 -0.54792910
5 META 1.4492575 0.45209023 -0.57007662
6 TSLA 1.8797296 0.51677714 -0.94233080
7 NVDA 1.5325296 0.30480087 -0.57394295
8 JPM 0.8958927 -0.41014050 0.07536340
9 BAC 0.8795994 -0.25351288 0.04337075
10 GS 1.0092382 -0.01243681 -0.17365546
symbol date ri MKT SMB HML
1 AAPL 2018-01-02 -0.005158844 0.014009584 -0.005899808 -0.0001253861
2 AAPL 2019-04-05 0.005958435 -0.008052058 0.003533895 -0.0018224335
3 AAPL 2018-06-04 -0.031770047 0.001491610 -0.001003277 -0.0045588590
4 AAPL 2019-09-05 0.009865348 -0.002200651 0.002293478 -0.0099335247
5 AAPL 2020-07-08 -0.022169419 -0.005769892 -0.006896938 0.0082455558
b_MKT b_SMB b_HML
1 1.133364 -0.2926103 -0.5134356
2 1.133364 -0.2926103 -0.5134356
3 1.133364 -0.2926103 -0.5134356
4 1.133364 -0.2926103 -0.5134356
5 1.133364 -0.2926103 -0.5134356
pal <- c("Market (MKT)" = "#1565C0",
"Size (SMB)" = "#2E7D32",
"Value (HML)" = "#E65100")
step0_long <- reshape(step0,
varying = c("b_MKT", "b_SMB", "b_HML"),
v.names = "Beta",
timevar = "Factor",
times = c("Market (MKT)", "Size (SMB)", "Value (HML)"),
direction = "long")
step0_long$Factor <- factor(step0_long$Factor,
levels = c("Market (MKT)", "Size (SMB)", "Value (HML)"))
ggplot(step0_long, aes(x = symbol, y = Beta, fill = Factor)) +
geom_col(position = "dodge", width = 0.7) +
geom_hline(yintercept = 0, linetype = "dashed", colour = "grey40") +
facet_wrap(~ Factor, scales = "free_y") +
scale_fill_manual(values = pal) +
labs(title = "Step 0: Factor Loadings per Firm",
subtitle = "Time-series OLS — ri ~ MKT + SMB + HML",
x = NULL, y = "Beta") +
theme_minimal(base_size = 12) +
theme(legend.position = "none",
strip.text = element_text(face = "bold"),
axis.text.x = element_text(angle = 45, hjust = 1))For each date t:
\[r_{i,t} = \lambda_{0,t} + \lambda_{MKT,t}\hat\beta_{i,MKT} + \lambda_{SMB,t}\hat\beta_{i,SMB} + \lambda_{HML,t}\hat\beta_{i,HML} + u_{i,t}\]
unique_dates <- sort(unique(df2$date))
step1_list <- lapply(unique_dates, function(d) {
sub <- df2[df2$date == d, ]
if (nrow(sub) < 5) return(NULL)
fit <- lm(ri ~ b_MKT + b_SMB + b_HML, data = sub)
data.frame(date = d,
lambda_0 = coef(fit)[["(Intercept)"]],
lambda_MKT = coef(fit)[["b_MKT"]],
lambda_SMB = coef(fit)[["b_SMB"]],
lambda_HML = coef(fit)[["b_HML"]])
})
step1 <- do.call(rbind, step1_list)
step1 <- step1[complete.cases(step1), ]
rownames(step1) <- NULL
cat("Cross-sectional regressions run:", nrow(step1), "\n")Cross-sectional regressions run: 783
date lambda_0 lambda_MKT lambda_SMB lambda_HML
1 2018-01-02 0.027959280 0.008139872 0.046994590 0.053673405
2 2018-01-03 -0.012175904 0.018014760 0.022455366 0.043817934
3 2018-01-04 0.096495344 -0.099436791 0.059088947 -0.061609966
4 2018-01-05 -0.009064687 0.008219238 -0.007236964 -0.011435900
5 2018-01-08 -0.047260313 0.042248330 -0.046463760 -0.003663960
6 2018-01-09 0.086021432 -0.093788386 0.012551422 -0.079061384
7 2018-01-10 0.013619564 0.009825615 0.008979299 0.016894167
8 2018-01-11 0.062185618 -0.053850415 0.062701026 0.002870204
step1_long <- reshape(step1,
varying = c("lambda_MKT", "lambda_SMB", "lambda_HML"),
v.names = "Lambda",
timevar = "Factor",
times = c("Market (MKT)", "Size (SMB)", "Value (HML)"),
direction = "long")
step1_long$Factor <- factor(step1_long$Factor,
levels = c("Market (MKT)", "Size (SMB)", "Value (HML)"))
ggplot(step1_long, aes(x = date, y = Lambda, colour = Factor)) +
geom_line(alpha = 0.55, linewidth = 0.35) +
geom_hline(yintercept = 0, linetype = "dashed", colour = "grey30") +
facet_wrap(~ Factor, ncol = 1, scales = "free_y") +
scale_colour_manual(values = pal) +
labs(title = "Step 1: Risk Premium Estimates Over Time",
subtitle = "Cross-sectional lambda for each trading day",
x = NULL, y = "Lambda") +
theme_minimal(base_size = 11) +
theme(legend.position = "none",
strip.text = element_text(face = "bold"))ggplot(step1_long, aes(x = Lambda, fill = Factor)) +
geom_histogram(bins = 60, colour = "white", alpha = 0.85) +
geom_vline(xintercept = 0, linetype = "dashed", colour = "grey30") +
facet_wrap(~ Factor, scales = "free") +
scale_fill_manual(values = pal) +
labs(title = "Step 1: Distribution of Per-Period Lambda Estimates",
x = "Lambda (Risk Premium)", y = "Count") +
theme_minimal(base_size = 12) +
theme(legend.position = "none",
strip.text = element_text(face = "bold"))\[\hat\lambda_k = \frac{1}{T}\sum_{t=1}^T \hat\lambda_{k,t} \qquad t_k = \frac{\hat\lambda_k}{s(\hat\lambda_k)/\sqrt{T}}\]
T_obs <- nrow(step1)
lcols <- c("lambda_0", "lambda_MKT", "lambda_SMB", "lambda_HML")
means <- colMeans(step1[lcols])
sds <- apply(step1[lcols], 2, sd)
se <- sds / sqrt(T_obs)
t_stats <- means / se
p_vals <- 2 * pt(-abs(t_stats), df = T_obs - 1)
results <- data.frame(
Factor = c("Intercept", "Market (MKT)", "Size (SMB)", "Value (HML)"),
Estimate = round(means, 6),
Std_Error = round(se, 6),
t_stat = round(t_stats, 4),
p_value = round(p_vals, 4),
Sig = ifelse(p_vals < 0.001, "***",
ifelse(p_vals < 0.01, "**",
ifelse(p_vals < 0.05, "*",
ifelse(p_vals < 0.10, ".", ""))))
)
rownames(results) <- NULL
results Factor Estimate Std_Error t_stat p_value Sig
1 Intercept 0.000093 0.001602 0.0579 0.9538
2 Market (MKT) -0.000229 0.001589 -0.1444 0.8852
3 Size (SMB) 0.000208 0.001263 0.1650 0.8690
4 Value (HML) 0.000590 0.001158 0.5094 0.6106
Significance:
***p < 0.001**p < 0.01*p < 0.05.p < 0.1
plot_data <- results[-1, ]
plot_data$Sig_flag <- plot_data$p_value < 0.05
ggplot(plot_data,
aes(x = Factor, y = Estimate,
ymin = Estimate - 1.96 * Std_Error,
ymax = Estimate + 1.96 * Std_Error,
colour = Sig_flag)) +
geom_hline(yintercept = 0, linetype = "dashed", colour = "grey50") +
geom_errorbar(width = 0.18, linewidth = 1.1) +
geom_point(size = 4.5) +
scale_colour_manual(values = c("TRUE" = "#1565C0", "FALSE" = "#B71C1C"),
labels = c("TRUE" = "p < 0.05", "FALSE" = "Not significant"),
name = NULL) +
labs(title = "Fama-MacBeth: Risk Premium Estimates",
subtitle = "Error bars = 95% confidence interval",
x = NULL, y = "Average Lambda") +
theme_minimal(base_size = 13) +
theme(legend.position = "top",
plot.title = element_text(face = "bold"))pmg()The plm package provides a direct Fama-MacBeth
implementation via pmg().
pdata <- pdata.frame(df2, index = c("symbol", "date"), drop.index = FALSE)
fm_plm <- pmg(ri ~ b_MKT + b_SMB + b_HML,
data = pdata,
index = c("symbol", "date"))
summary(fm_plm)Mean Groups model
Call:
pmg(formula = ri ~ b_MKT + b_SMB + b_HML, data = pdata, index = c("symbol",
"date"))
Balanced Panel: n = 10, T = 783, N = 7830
Residuals:
Min. 1st Qu. Median 3rd Qu. Max.
-7.477021e-02 -1.341936e-02 9.247065e-05 1.319155e-02 6.968253e-02
Coefficients:
Estimate Std. Error z-value Pr(>|z|)
(Intercept) -0.00043055 0.00013696 -3.1436 0.001669 **
b_MKT NA NA NA NA
b_SMB NA NA NA NA
b_HML NA NA NA NA
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Total Sum of Squares: 3.0331
Residual Sum of Squares: 3.0317
Multiple R-squared: 0.00043584
Corrects for serial autocorrelation in the lambda time series (2 lags).
for (col in c("lambda_MKT", "lambda_SMB", "lambda_HML")) {
fit <- lm(step1[[col]] ~ 1)
nw <- coeftest(fit, vcov = NeweyWest(fit, lag = 2, prewhite = FALSE))
cat("\n", col, ":\n", sep = "")
print(nw)
}
lambda_MKT:
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.00022942 0.00159930 -0.1435 0.886
lambda_SMB:
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.00020839 0.00124461 0.1674 0.8671
lambda_HML:
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.00058993 0.00113211 0.5211 0.6025
R version 4.5.3 (2026-03-11 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 10 x64 (build 19045)
Matrix products: default
LAPACK version 3.12.1
locale:
[1] LC_COLLATE=English_United States.utf8
[2] LC_CTYPE=English_United States.utf8
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.utf8
time zone: Asia/Ulaanbaatar
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] ggplot2_4.0.2 tidyr_1.3.2 dplyr_1.2.1 sandwich_3.1-1 lmtest_0.9-40
[6] zoo_1.8-15 plm_2.6-7
loaded via a namespace (and not attached):
[1] sass_0.4.10 generics_0.1.4 lattice_0.22-9 digest_0.6.39
[5] magrittr_2.0.5 evaluate_1.0.5 grid_4.5.3 RColorBrewer_1.1-3
[9] fastmap_1.2.0 jsonlite_2.0.0 Formula_1.2-5 purrr_1.2.2
[13] scales_1.4.0 jquerylib_0.1.4 Rdpack_2.6.6 cli_3.6.6
[17] rlang_1.2.0 rbibutils_2.4.1 miscTools_0.6-30 withr_3.0.2
[21] cachem_1.1.0 yaml_2.3.12 tools_4.5.3 parallel_4.5.3
[25] bdsmatrix_1.3-7 maxLik_1.5-2.2 vctrs_0.7.3 R6_2.6.1
[29] lifecycle_1.0.5 MASS_7.3-65 pkgconfig_2.0.3 pillar_1.11.1
[33] bslib_0.10.0 gtable_0.3.6 glue_1.8.0 Rcpp_1.1.1
[37] collapse_2.1.6 xfun_0.57 tibble_3.3.1 tidyselect_1.2.1
[41] rstudioapi_0.18.0 knitr_1.51 farver_2.1.2 htmltools_0.5.9
[45] nlme_3.1-168 labeling_0.4.3 rmarkdown_2.31 compiler_4.5.3
[49] S7_0.2.1
Replicates the tutorial by The Data Hall — YouTube: https://www.youtube.com/watch?v=dLvjmYj-PVA | Data: https://payhip.com/b/uqxcH