Overview

In their excellent paper, Green, Hand, and Zhang (2017) (GHZ) study which characteristics in the so-called “factor zoo” provide information about U.S. stock returns. This is a central issue in empirical asset pricing, as the number of characteristics that supposedly predict returns keeps increasing, and asset pricing theory does not explicitly state which factors should be used. The methodology used in their paper is based on the Fama and MacBeth (1973) two-step regression procedure. The first step consists of a cross-sectional regression of stocks returns on firm characteristics, which is repeated for each month of the sample. In the second step, the characteristics are assessed using the time-series average and standard errors of the corresponding coefficients.

In this short article, I reproduce some of the results in GHZ using a slightly different, and longer, sample. Notably, I don’t include I/B/E/S variables, so I effectively use fewer characteristics. GHZ made available the SAS code that they used to compile a large number of firm characteristics from the accounting and empirical asset pricing literature, which was a huge step in terms of fostering reproducible research. I hope this short article is helpful to others working in this field.

Data

The data was compiled from CRSP and Compustat and covers the period from 1970 to 2018. The data was originally downloaded using an adaptation of the SAS code kindly provided by Jeremiah Green, and was used in Hwang and Rubesam (2020) to create long-short “candidate” asset pricing factors. The characteristics were first winsorized at the 1st and 99th percentiles and standardized to have a zero mean and unit standard deviation. Missing values were replaced by the mean value of zero. The sample only includes non-microcap stocks, defined as those stocks whose market capitalization at a given month were above the 20th percentile of market cap of NYSE stocks. Therefore, my results should be compared to the OLS results for non-microcaps in GHZ. The total number of firm characteristics is 85 firm, slightly lower than the 94 used in GHZ. All the pre-treatment of the data was done in SAS.

I start by loading the data. Since the file is relatively large (~1GB), I use the fread function from the data.table package. I then count and plot the number of stocks available in each month. The number of non-microcap stocks peaks at over 2,500 at the height of the dot-com bubble, and has been fluctuating between 1,500 and 2,000 more recently.

# load useful packages
library(data.table) # for fread function
library(corrplot)   # for corrplot function
library(car)        # for vif function
library(sandwich)   # for NeweyWest function

# read CRSP/Compustat data file
crsp_data <- fread("C://Users/a.rubesam/OneDrive - IESEG/Courses/Machine Learning/ML Finance/data/rps1970subsetWSStd.csv")

# format date
crsp_data$DATE <- as.Date(as.character(crsp_data$DATE), format="%Y%m%d")

# unique months in the sample
months <- unique(crsp_data$DATE)
nMonths <- length(months)

# count number of stocks each month
nStocks <- matrix(nMonths, 1)
for (i in 1:nMonths){
  nStocks[i] <- dim(crsp_data[crsp_data$DATE == months[i], ])[1]
}

# plot the number of stocks in each month
plot(nStocks ~ months, type = "l", xlab = "Date", ylab = "Number of stocks")

Missing Values and Correlations

As mentioned previously, missing values are set to the mean value at each month (0) after winsorization and standardization. Below, I calculate the percentage of missing values for each characteristics, and plot the ones with more than 20% missing values. The figures are very close to those reported by GHZ (Table 2, pg 4400).

xtrain <- model.matrix( RET ~ . -DATE - permno , data = crsp_data)[, -1]

# missing values have been replaced by zero. Let's check how many are zero for each column
percMiss <- colSums(xtrain == 0)/dim(xtrain)[1]

# plot characteristics that have more than 20% missing values
barplot(percMiss[percMiss >= 0.2], las=2, cex.names = 1)

Next, I look at the correlations among the characteristics. I use the corrplot function from the package with the same name to visualize the correlations. Overall, most characteristics have relatively low cross-correlations, but a few extreme correlations can be seen in the graph. These may cause multicollinearity issues. In order to identify the characteristics more likely to cause multicollinearity problems, I first calculate variance inflation factors (VIFs) from a pooled regression using the vif function from the car package. Overall, the distribution of the VIFs is very similar to that reported by GHZ.

corrMat <- cor(xtrain)
corrplot(corrMat, tl.cex = 0.5)

# run a large pooled regression to get some VIFs
fm <- as.formula(paste("RET ~ ", paste(colnames(xtrain), collapse = "+")))
ols.pooled <- lm(fm, data = crsp_data)

# get the VIFs
vifs <- vif(ols.pooled)

# some descriptive stats of the VIFs
summary(vifs)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.013   1.263   1.859   3.077   3.173  17.386
quantile(vifs, c(0, 0.1, 0.5, 0.9, 1))
##        0%       10%       50%       90%      100% 
##  1.013356  1.115168  1.859322  7.879090 17.386225

Several characteristics have high VIFs. GHZ exclude some characteristics until the highest VIF is lower than 7. I start by removing the squared beta (betasq), which has a VIF of 16.15 in my sample. This allows me to keep beta, which makes more sense from an asset pricing perspective. I then recalculate the VIFs and remove variables with highest VIF until the maximum VIF is lower than 7. In the end, the following six variables are removed: betasq, stdcf, dolvol, mom6m, quick, retvol. This list is similar, but not identical, to the list of removed variables in GHZ. The maximum VIF after removing these six variables is 5.86.

# characteristics with VIFs > 7
vifs[vifs > 7]
##    currat     quick       mve    stdacc     stdcf     mom6m    dolvol    retvol 
##  7.706295  8.147438  8.847172 10.636433 10.712662  9.214088  9.623314  7.994286 
##      beta    betasq 
## 17.386225 16.148264
# remove a few characteristics with high VIF. Start with betasq
removed_variables <- "betasq"
xtrain2 <- xtrain[, -which(colnames(xtrain) == removed_variables)]
fm <- as.formula(paste("RET ~ ", paste(colnames(xtrain2), collapse = "+")))
ols.pooled <- lm(fm, data = crsp_data)
vifs <- vif(ols.pooled)
vifs[vifs > 7]
##    currat     quick       mve    stdacc     stdcf     mom6m    dolvol    retvol 
##  7.687820  8.132379  8.829127 10.635078 10.711900  9.213691  9.557715  7.981722
# do a loop to remove characteristics until the maximum VIF < 7
max_vif <- max(vifs)
while (max_vif > 7){
  removed_variables <- names(which(vifs == max(vifs)))  
  xtrain2 <- xtrain2[, -which(colnames(xtrain2) == removed_variables)]
  fm <- as.formula(paste("RET ~ ", paste(colnames(xtrain2), collapse = "+")))
  ols.pooled <- lm(fm, data = crsp_data)
  vifs <- vif(ols.pooled)
  max_vif <- max(vifs)
  print(paste("Variable", removed_variables, "was removed. Maximum VIF is now", max_vif))
}
## [1] "Variable stdcf was removed. Maximum VIF is now 9.55769383869023"
## [1] "Variable dolvol was removed. Maximum VIF is now 9.20882375380867"
## [1] "Variable mom6m was removed. Maximum VIF is now 8.12848654617556"
## [1] "Variable quick was removed. Maximum VIF is now 7.91916356826568"
## [1] "Variable retvol was removed. Maximum VIF is now 5.86191251697774"

The total number of firm characteristics after removing the variables more likely to cause multicollinearity is 79. I now run cross-sectional regressions of returns on these characteristics each month, and store the coefficients.

# to make the output a bit nicer, I sort the names in alphabetical order
fm <- as.formula(paste("RET ~ ", paste(sort(colnames(xtrain2)), collapse="+")))
coeffs <- matrix(0, nrow = nMonths, ncol = ncol(xtrain2))
colnames(coeffs) <- sort(colnames(xtrain2))

for (i in 1:nMonths){
  ols_month <- lm(fm, data = crsp_data[crsp_data$DATE == months[i], ])
  coeffs[i, ] <- ols_month$coefficients[2:length(ols_month$coefficients)]
}

Calculating the Fama-MacBeth Coefficients

All that’s left is to summarize the coefficients. It should be noted that some characteristics have missing values for all stocks in some of the first months. I follow GHZ and report the Fama-MacBeth coefficients as the means of the monthly estimated coefficients, multiplied by 100. The t−statistics are calculated using the time series of monthly coefficient estimates, adjusted using the Newey-West procedure with 12 lags.

Despite the differences in sample period and variables, the values of coefficients and t-stats are very close to those reported by GHZ (Table 5, column B, pages 4409-4410). In their paper, GHZ further correct the p-values based on the Fama-MacBeth t-stats to adjust for multiple testing. They comment that this approach leads to similar conclusions to adopting the recommendation of Harvey, Liu, and Zhu (2016), i.e. using a t-stat above 3 in absolute value. There are 12 characteristics with absolute t-stat larger than 3 in my results, which are shown below. This list is almost identical to that reported by GHZ, with a few exceptions (I/B/E/S characteristics and one that I excluded, but they did not).

# create a matrix to store Fama-MacBeth coeffs and t-stats
coeffs.FM <- matrix(0, nrow = ncol(xtrain2), ncol = 2)

# run a regression of the coefficients on a constant and apply Newey-West
for (i in 1:ncol(coeffs)){
  ols.FM <- lm(coeffs[, i] ~ 1)
  ols_NW <- NeweyWest(ols.FM, lag = 12)
  coeffs.FM[i, 1] <- ols.FM$coefficients*100
  coeffs.FM[i, 2] <- ols.FM$coefficients/sqrt(ols_NW)
}

rownames(coeffs.FM) <- sort(colnames(xtrain2))
colnames(coeffs.FM) <- c("FM Coeff", "t-stat")

coeffs.FM[abs(coeffs.FM[,2 ]) > 3, ]
##             FM Coeff    t-stat
## agr      -0.16606607 -4.656516
## cash      0.15958592  3.942712
## ear       0.06222597  5.493107
## grcapx   -0.03907752 -3.237871
## ill      -0.07981191 -3.681215
## mom1m    -0.44679470 -8.037031
## mve      -0.17548513 -3.923681
## nincr     0.08377441  5.844713
## rd_mve    0.11984191  3.445927
## std_turn  0.20462272  5.966624
## stdacc   -0.40666848 -6.656700
## turn     -0.28343205 -7.577388

What are the magnitudes of the risk premia for some prominent factors/anomalies? Below, I show the coefficients and t-statistics for size (mve), book-to-market (bm), asset growth (agr), 12-month momentum (mom12m), and return on equity (roeq). For example, an increase of one standard deviation in the standardized size characteristic is associated with a -0.18% decrease in return (compared to -0.17% in GHZ), confirming the expectation that smaller firms outperform large firms. For book-to-market, the corresponding figure is 0.06% (compared to 0.07 in GHZ), confirming the value effect, although in the presence of other characteristics, it is not significant, as in GHZ. The coefficient on asset growth has the expected negative magnitude (Cooper, Gulen, and Schill, 2008). Similar comments apply to momentum and return on equity (positively associated with future returns).

factors <- c("mve", "bm", "agr", "mom12m", "roeq")
coeffs.FM[factors,]
##           FM Coeff    t-stat
## mve    -0.17548513 -3.923681
## bm      0.06449908  1.913136
## agr    -0.16606607 -4.656516
## mom12m  0.14188386  2.264414
## roeq    0.04999324  2.019661

The Post-2003 Period

GHZ show that a structural break ocurred around 2003, after which fewer characteristics seem to be relevant, and the profitability of a strategy that uses characteristics to forecast returns largely disappears. Below, I repeat the estimation for the post-2003 period. Considering the combined results from two sets of regressions (one using weighted least squares and all stocks, the other using only non-microcap stocks, as I do), GHZ find that only two characteristics are significant: the industry-adjusted change in number of employees (chempia) and the number of consecutive quarters with earnings (nincr). In my sample, the t-stats for the post-2003 period (2004-2018, compared to 2004-2014 in GHZ) for these characteristics are 2.23 and 1.81, as shown below. In contrast, I find that two other characteristics, earnings announcement returns (ear) and a variable that measures sales concentration (herf) have an absolute t-stat above 3. While ear also has a high t-stat in the post-2003 period in GHZ (t-stat = 2.7), they report a much lower t-stat for herf, which may be due to its correlation with some characteristics that exclude.

# identify months after 2003 
post2003 <- which(year(months)>2003)
# create a matrix to store Fama-MacBeth coeffs and t-stats
coeffs.FM.Post2003 <- matrix(0, nrow = ncol(xtrain2), ncol = 2)

# run a regression of the coefficients on a constant and apply Newey-West
for (i in 1:ncol(coeffs)){
  ols.FM <- lm(coeffs[post2003, i] ~ 1)
  ols_NW <- NeweyWest(ols.FM, lag = 12)
  coeffs.FM.Post2003[i, 1] <- ols.FM$coefficients*100
  coeffs.FM.Post2003[i, 2] <- ols.FM$coefficients/sqrt(ols_NW)
}

rownames(coeffs.FM.Post2003) <- sort(colnames(xtrain2))
colnames(coeffs.FM.Post2003) <- c("FM Coeff", "t-stat")


# t-stats of significant characteristics after 2003 in GHZ:
coeffs.FM.Post2003[c("chempia", "nincr"), ]
##           FM Coeff   t-stat
## chempia 0.05852378 2.236852
## nincr   0.04297246 1.814560
# characteristics with |t-stat| > 3 
coeffs.FM.Post2003[abs(coeffs.FM.Post2003[,2 ]) > 3, ]
##         FM Coeff    t-stat
## ear   0.09265991  4.579091
## herf -0.05292658 -3.102928

Concluding Remarks

This short article illustrates the Fama-MacBeth regression approach by replicating some of the results in Green, Hand, and Zhang (2017). Confirming their results, I find that about 12 characteristics are significant determinants of average U.S. stock returns, despite the differences in sample period and the characteristics used. In the post-2003 period, I also confirm that only two characteristics seem to be important.