1 Introduction

Fama-MacBeth (1973) regression is a two-step procedure widely used in empirical asset pricing. It is especially powerful when a dataset has many cross-sectional units (firms, portfolios) but a limited number of time periods.

Its key advantage is providing standard errors corrected for cross-sectional correlation, which makes inference more reliable than pooled OLS.

1.1 The Three-Step Procedure

Step Name What happens
Step 0 Time-Series Regression Regress each firm’s return on risk factors → get betas
Step 1 Cross-Sectional Regression For each date t, regress returns on the betas from Step 0
Step 2 Average the Coefficients Average Step 1 estimates across all T periods; test via t-test

2 Packages

pkgs <- c("plm", "lmtest", "sandwich", "dplyr", "tidyr", "ggplot2")
new  <- pkgs[!pkgs %in% installed.packages()[, "Package"]]
if (length(new)) install.packages(new, repos = "https://cloud.r-project.org")

library(plm)
library(lmtest)
library(sandwich)
library(dplyr)
library(tidyr)
library(ggplot2)

3 Data

3.1 Variable Definitions

Variable Description
date Trading date
symbol Company ticker
ri Daily excess stock return
MKT Market excess return (Rm − Rf)
SMB Small-minus-Big factor
HML High-minus-Low factor

3.2 Load Data

# ── If you downloaded data.csv from payhip.com/b/uqxcH, use: ─────────────────
# df <- read.csv("data.csv")

# ── Synthetic dataset with the same structure as the tutorial ─────────────────
set.seed(42)

symbols <- c("AAPL", "MSFT", "AMZN", "GOOG", "META",
             "TSLA", "NVDA", "JPM",  "BAC",  "GS")

dates <- seq.Date(as.Date("2018-01-02"), as.Date("2020-12-31"), by = "day")
dates <- dates[!weekdays(dates) %in% c("Saturday", "Sunday")]
n     <- length(dates)

factors <- data.frame(
  date = dates,
  MKT  = rnorm(n, mean = 0.0003, sd = 0.010),
  SMB  = rnorm(n, mean = 0.0001, sd = 0.005),
  HML  = rnorm(n, mean = 0.0001, sd = 0.005)
)

true_betas <- data.frame(
  symbol = symbols,
  bMKT   = c( 1.2,  1.0,  1.3,  1.1,  1.4,  1.8,  1.5,  0.9,  1.0,  1.1),
  bSMB   = c(-0.3, -0.2,  0.1, -0.1,  0.2,  0.5,  0.3, -0.4, -0.2, -0.1),
  bHML   = c(-0.5, -0.3, -0.4, -0.6, -0.5, -0.8, -0.7,  0.3,  0.2,  0.1)
)

stock_list <- lapply(symbols, function(s) {
  b  <- true_betas[true_betas$symbol == s, ]
  ri <- b$bMKT * factors$MKT +
        b$bSMB * factors$SMB +
        b$bHML * factors$HML +
        rnorm(n, 0, 0.015)
  data.frame(date = dates, symbol = s, ri = ri)
})

df <- merge(do.call(rbind, stock_list), factors, by = "date")
cat("Rows:", nrow(df), " | Columns:", ncol(df))
Rows: 7830  | Columns: 6

3.3 Preview

head(df, 10)
         date symbol           ri        MKT          SMB           HML
1  2018-01-02   AAPL -0.005158844 0.01400958 -0.005899808 -0.0001253861
2  2018-01-02    BAC  0.018617175 0.01400958 -0.005899808 -0.0001253861
3  2018-01-02   GOOG  0.017787733 0.01400958 -0.005899808 -0.0001253861
4  2018-01-02    JPM  0.028230190 0.01400958 -0.005899808 -0.0001253861
5  2018-01-02     GS  0.015298307 0.01400958 -0.005899808 -0.0001253861
6  2018-01-02   META  0.041423986 0.01400958 -0.005899808 -0.0001253861
7  2018-01-02   NVDA  0.028802207 0.01400958 -0.005899808 -0.0001253861
8  2018-01-02   MSFT  0.001434172 0.01400958 -0.005899808 -0.0001253861
9  2018-01-02   AMZN  0.019361548 0.01400958 -0.005899808 -0.0001253861
10 2018-01-02   TSLA  0.004414289 0.01400958 -0.005899808 -0.0001253861

3.4 Summary Statistics

summary(df[c("ri", "MKT", "SMB", "HML")])
       ri                  MKT                  SMB            
 Min.   :-0.0752901   Min.   :-0.0298793   Min.   :-1.676e-02  
 1st Qu.:-0.0138257   1st Qu.:-0.0063677   1st Qu.:-3.244e-03  
 Median :-0.0003607   Median :-0.0001070   Median : 1.190e-04  
 Mean   :-0.0004306   Mean   :-0.0001625   Mean   : 1.778e-05  
 3rd Qu.: 0.0127971   3rd Qu.: 0.0066082   3rd Qu.: 3.458e-03  
 Max.   : 0.0696857   Max.   : 0.0325907   Max.   : 1.758e-02  
      HML            
 Min.   :-0.0140424  
 1st Qu.:-0.0031990  
 Median : 0.0001552  
 Mean   : 0.0002495  
 3rd Qu.: 0.0036783  
 Max.   : 0.0180233  

4 Step 0 — Time-Series Regression

For each firm we run:

\[r_{i,t} = \alpha_i + \hat\beta_{i,MKT} \cdot MKT_t + \hat\beta_{i,SMB} \cdot SMB_t + \hat\beta_{i,HML} \cdot HML_t + \varepsilon_{i,t}\]

step0_list <- lapply(symbols, function(s) {
  sub <- df[df$symbol == s, ]
  fit <- lm(ri ~ MKT + SMB + HML, data = sub)
  data.frame(symbol = s,
             b_MKT  = coef(fit)[["MKT"]],
             b_SMB  = coef(fit)[["SMB"]],
             b_HML  = coef(fit)[["HML"]])
})

step0        <- do.call(rbind, step0_list)
rownames(step0) <- NULL
step0
   symbol     b_MKT       b_SMB       b_HML
1    AAPL 1.1333636 -0.29261029 -0.51343558
2    MSFT 0.9931190 -0.02935885 -0.43969603
3    AMZN 1.3522774  0.12812690 -0.60822728
4    GOOG 1.1231302  0.00187908 -0.54792910
5    META 1.4492575  0.45209023 -0.57007662
6    TSLA 1.8797296  0.51677714 -0.94233080
7    NVDA 1.5325296  0.30480087 -0.57394295
8     JPM 0.8958927 -0.41014050  0.07536340
9     BAC 0.8795994 -0.25351288  0.04337075
10     GS 1.0092382 -0.01243681 -0.17365546

4.1 Merge betas back into main data

df2 <- merge(df, step0, by = "symbol")
head(df2, 5)
  symbol       date           ri          MKT          SMB           HML
1   AAPL 2018-01-02 -0.005158844  0.014009584 -0.005899808 -0.0001253861
2   AAPL 2019-04-05  0.005958435 -0.008052058  0.003533895 -0.0018224335
3   AAPL 2018-06-04 -0.031770047  0.001491610 -0.001003277 -0.0045588590
4   AAPL 2019-09-05  0.009865348 -0.002200651  0.002293478 -0.0099335247
5   AAPL 2020-07-08 -0.022169419 -0.005769892 -0.006896938  0.0082455558
     b_MKT      b_SMB      b_HML
1 1.133364 -0.2926103 -0.5134356
2 1.133364 -0.2926103 -0.5134356
3 1.133364 -0.2926103 -0.5134356
4 1.133364 -0.2926103 -0.5134356
5 1.133364 -0.2926103 -0.5134356

4.2 Plot: Beta loadings per firm

pal <- c("Market (MKT)" = "#1565C0",
         "Size (SMB)"   = "#2E7D32",
         "Value (HML)"  = "#E65100")

step0_long <- reshape(step0,
                      varying   = c("b_MKT", "b_SMB", "b_HML"),
                      v.names   = "Beta",
                      timevar   = "Factor",
                      times     = c("Market (MKT)", "Size (SMB)", "Value (HML)"),
                      direction = "long")
step0_long$Factor <- factor(step0_long$Factor,
                            levels = c("Market (MKT)", "Size (SMB)", "Value (HML)"))

ggplot(step0_long, aes(x = symbol, y = Beta, fill = Factor)) +
  geom_col(position = "dodge", width = 0.7) +
  geom_hline(yintercept = 0, linetype = "dashed", colour = "grey40") +
  facet_wrap(~ Factor, scales = "free_y") +
  scale_fill_manual(values = pal) +
  labs(title    = "Step 0: Factor Loadings per Firm",
       subtitle = "Time-series OLS — ri ~ MKT + SMB + HML",
       x = NULL, y = "Beta") +
  theme_minimal(base_size = 12) +
  theme(legend.position = "none",
        strip.text      = element_text(face = "bold"),
        axis.text.x     = element_text(angle = 45, hjust = 1))


5 Step 1 — Cross-Sectional Regression

For each date t:

\[r_{i,t} = \lambda_{0,t} + \lambda_{MKT,t}\hat\beta_{i,MKT} + \lambda_{SMB,t}\hat\beta_{i,SMB} + \lambda_{HML,t}\hat\beta_{i,HML} + u_{i,t}\]

unique_dates <- sort(unique(df2$date))

step1_list <- lapply(unique_dates, function(d) {
  sub <- df2[df2$date == d, ]
  if (nrow(sub) < 5) return(NULL)
  fit <- lm(ri ~ b_MKT + b_SMB + b_HML, data = sub)
  data.frame(date       = d,
             lambda_0   = coef(fit)[["(Intercept)"]],
             lambda_MKT = coef(fit)[["b_MKT"]],
             lambda_SMB = coef(fit)[["b_SMB"]],
             lambda_HML = coef(fit)[["b_HML"]])
})

step1        <- do.call(rbind, step1_list)
step1        <- step1[complete.cases(step1), ]
rownames(step1) <- NULL

cat("Cross-sectional regressions run:", nrow(step1), "\n")
Cross-sectional regressions run: 783 
head(step1, 8)
        date     lambda_0   lambda_MKT   lambda_SMB   lambda_HML
1 2018-01-02  0.027959280  0.008139872  0.046994590  0.053673405
2 2018-01-03 -0.012175904  0.018014760  0.022455366  0.043817934
3 2018-01-04  0.096495344 -0.099436791  0.059088947 -0.061609966
4 2018-01-05 -0.009064687  0.008219238 -0.007236964 -0.011435900
5 2018-01-08 -0.047260313  0.042248330 -0.046463760 -0.003663960
6 2018-01-09  0.086021432 -0.093788386  0.012551422 -0.079061384
7 2018-01-10  0.013619564  0.009825615  0.008979299  0.016894167
8 2018-01-11  0.062185618 -0.053850415  0.062701026  0.002870204

5.1 Plot: Lambda estimates over time

step1_long <- reshape(step1,
                      varying   = c("lambda_MKT", "lambda_SMB", "lambda_HML"),
                      v.names   = "Lambda",
                      timevar   = "Factor",
                      times     = c("Market (MKT)", "Size (SMB)", "Value (HML)"),
                      direction = "long")
step1_long$Factor <- factor(step1_long$Factor,
                            levels = c("Market (MKT)", "Size (SMB)", "Value (HML)"))

ggplot(step1_long, aes(x = date, y = Lambda, colour = Factor)) +
  geom_line(alpha = 0.55, linewidth = 0.35) +
  geom_hline(yintercept = 0, linetype = "dashed", colour = "grey30") +
  facet_wrap(~ Factor, ncol = 1, scales = "free_y") +
  scale_colour_manual(values = pal) +
  labs(title    = "Step 1: Risk Premium Estimates Over Time",
       subtitle = "Cross-sectional lambda for each trading day",
       x = NULL, y = "Lambda") +
  theme_minimal(base_size = 11) +
  theme(legend.position = "none",
        strip.text      = element_text(face = "bold"))

5.2 Plot: Distribution of lambda estimates

ggplot(step1_long, aes(x = Lambda, fill = Factor)) +
  geom_histogram(bins = 60, colour = "white", alpha = 0.85) +
  geom_vline(xintercept = 0, linetype = "dashed", colour = "grey30") +
  facet_wrap(~ Factor, scales = "free") +
  scale_fill_manual(values = pal) +
  labs(title = "Step 1: Distribution of Per-Period Lambda Estimates",
       x = "Lambda (Risk Premium)", y = "Count") +
  theme_minimal(base_size = 12) +
  theme(legend.position = "none",
        strip.text      = element_text(face = "bold"))


6 Step 2 — Average Coefficients + T-Test

\[\hat\lambda_k = \frac{1}{T}\sum_{t=1}^T \hat\lambda_{k,t} \qquad t_k = \frac{\hat\lambda_k}{s(\hat\lambda_k)/\sqrt{T}}\]

T_obs   <- nrow(step1)
lcols   <- c("lambda_0", "lambda_MKT", "lambda_SMB", "lambda_HML")
means   <- colMeans(step1[lcols])
sds     <- apply(step1[lcols], 2, sd)
se      <- sds / sqrt(T_obs)
t_stats <- means / se
p_vals  <- 2 * pt(-abs(t_stats), df = T_obs - 1)

results <- data.frame(
  Factor    = c("Intercept", "Market (MKT)", "Size (SMB)", "Value (HML)"),
  Estimate  = round(means,   6),
  Std_Error = round(se,      6),
  t_stat    = round(t_stats, 4),
  p_value   = round(p_vals,  4),
  Sig       = ifelse(p_vals < 0.001, "***",
              ifelse(p_vals < 0.01,  "**",
              ifelse(p_vals < 0.05,  "*",
              ifelse(p_vals < 0.10,  ".", ""))))
)
rownames(results) <- NULL
results
        Factor  Estimate Std_Error  t_stat p_value Sig
1    Intercept  0.000093  0.001602  0.0579  0.9538    
2 Market (MKT) -0.000229  0.001589 -0.1444  0.8852    
3   Size (SMB)  0.000208  0.001263  0.1650  0.8690    
4  Value (HML)  0.000590  0.001158  0.5094  0.6106    

Significance: *** p < 0.001   ** p < 0.01   * p < 0.05   . p < 0.1

6.1 Plot: Final FM estimates with 95% CI

plot_data          <- results[-1, ]
plot_data$Sig_flag <- plot_data$p_value < 0.05

ggplot(plot_data,
       aes(x = Factor, y = Estimate,
           ymin = Estimate - 1.96 * Std_Error,
           ymax = Estimate + 1.96 * Std_Error,
           colour = Sig_flag)) +
  geom_hline(yintercept = 0, linetype = "dashed", colour = "grey50") +
  geom_errorbar(width = 0.18, linewidth = 1.1) +
  geom_point(size = 4.5) +
  scale_colour_manual(values = c("TRUE" = "#1565C0", "FALSE" = "#B71C1C"),
                      labels = c("TRUE" = "p < 0.05", "FALSE" = "Not significant"),
                      name = NULL) +
  labs(title    = "Fama-MacBeth: Risk Premium Estimates",
       subtitle = "Error bars = 95% confidence interval",
       x = NULL, y = "Average Lambda") +
  theme_minimal(base_size = 13) +
  theme(legend.position = "top",
        plot.title      = element_text(face = "bold"))


7 PLM Method — pmg()

The plm package provides a direct Fama-MacBeth implementation via pmg().

pdata  <- pdata.frame(df2, index = c("symbol", "date"), drop.index = FALSE)
fm_plm <- pmg(ri ~ b_MKT + b_SMB + b_HML,
              data  = pdata,
              index = c("symbol", "date"))
summary(fm_plm)
Mean Groups model

Call:
pmg(formula = ri ~ b_MKT + b_SMB + b_HML, data = pdata, index = c("symbol", 
    "date"))

Balanced Panel: n = 10, T = 783, N = 7830

Residuals:
         Min.       1st Qu.        Median       3rd Qu.          Max. 
-7.477021e-02 -1.341936e-02  9.247065e-05  1.319155e-02  6.968253e-02 

Coefficients:
               Estimate  Std. Error z-value Pr(>|z|)   
(Intercept) -0.00043055  0.00013696 -3.1436 0.001669 **
b_MKT                NA          NA      NA       NA   
b_SMB                NA          NA      NA       NA   
b_HML                NA          NA      NA       NA   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Total Sum of Squares: 3.0331
Residual Sum of Squares: 3.0317
Multiple R-squared: 0.00043584

8 Newey-West Standard Errors

Corrects for serial autocorrelation in the lambda time series (2 lags).

for (col in c("lambda_MKT", "lambda_SMB", "lambda_HML")) {
  fit <- lm(step1[[col]] ~ 1)
  nw  <- coeftest(fit, vcov = NeweyWest(fit, lag = 2, prewhite = FALSE))
  cat("\n", col, ":\n", sep = "")
  print(nw)
}

lambda_MKT:

t test of coefficients:

               Estimate  Std. Error t value Pr(>|t|)
(Intercept) -0.00022942  0.00159930 -0.1435    0.886


lambda_SMB:

t test of coefficients:

              Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.00020839 0.00124461  0.1674   0.8671


lambda_HML:

t test of coefficients:

              Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.00058993 0.00113211  0.5211   0.6025

9 Interpretation

  • \(\hat\lambda_k > 0\), significant → factor is positively priced; investors demand a premium for that risk exposure.
  • \(\hat\lambda_k < 0\), significant → factor is negatively priced.
  • \(p > 0.05\) → factor does not significantly explain the cross-section of returns in this sample.

10 Session Info

sessionInfo()
R version 4.5.3 (2026-03-11 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 10 x64 (build 19045)

Matrix products: default
  LAPACK version 3.12.1

locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

time zone: Asia/Ulaanbaatar
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] ggplot2_4.0.2  tidyr_1.3.2    dplyr_1.2.1    sandwich_3.1-1 lmtest_0.9-40 
[6] zoo_1.8-15     plm_2.6-7     

loaded via a namespace (and not attached):
 [1] sass_0.4.10        generics_0.1.4     lattice_0.22-9     digest_0.6.39     
 [5] magrittr_2.0.5     evaluate_1.0.5     grid_4.5.3         RColorBrewer_1.1-3
 [9] fastmap_1.2.0      jsonlite_2.0.0     Formula_1.2-5      purrr_1.2.2       
[13] scales_1.4.0       jquerylib_0.1.4    Rdpack_2.6.6       cli_3.6.6         
[17] rlang_1.2.0        rbibutils_2.4.1    miscTools_0.6-30   withr_3.0.2       
[21] cachem_1.1.0       yaml_2.3.12        tools_4.5.3        parallel_4.5.3    
[25] bdsmatrix_1.3-7    maxLik_1.5-2.2     vctrs_0.7.3        R6_2.6.1          
[29] lifecycle_1.0.5    MASS_7.3-65        pkgconfig_2.0.3    pillar_1.11.1     
[33] bslib_0.10.0       gtable_0.3.6       glue_1.8.0         Rcpp_1.1.1        
[37] collapse_2.1.6     xfun_0.57          tibble_3.3.1       tidyselect_1.2.1  
[41] rstudioapi_0.18.0  knitr_1.51         farver_2.1.2       htmltools_0.5.9   
[45] nlme_3.1-168       labeling_0.4.3     rmarkdown_2.31     compiler_4.5.3    
[49] S7_0.2.1          

Replicates the tutorial by The Data Hall — YouTube: https://www.youtube.com/watch?v=dLvjmYj-PVA | Data: https://payhip.com/b/uqxcH