SMB_HML_CMA_RMW

My R Code

Load packages

options(scipen=999) # shut down scientific notation
library(tidyverse);library(Hmisc)
rm(list=ls());gc()

##           used  (Mb) gc trigger  (Mb) max used  (Mb)
## Ncells 2134209 114.0    4300632 229.7  2765582 147.7
## Vcells 3574882  27.3    8388608  64.0  7007195  53.5

Load data

CRSP

You will need to download the data from CRSP, to which WRDS has access. The data items used here are given below.

CRSP Permanent Number (PERMNO) (select PERMNO as company codes)
CRSP Permanent Company Number (permco)
Share Code (shrcd)
Exchange Code (exchcd)
Primary Exchange (primexch)
Price (prc)
Holding Period Return (ret)
Number of Shares Outstanding (shrout)
Delisting Return (dlret)

Note the difference between PERMNO and PERMCO, where PERMNO is security-specific and PERMCO is company-specific. For example, Google has two class of shares, therefore Google has one PERMCO but two PERMNO.

CCM

In addition to CRSP, I use the CRSP-COMPUSTAT-Merged link table (CCM) to connect CRSP to COMPUSTAT. CCM is accessible under CRSP in WRDS. I download the entire link table as it is not a big file.

Code

# Load data
crsp <- readRDS("Data/CRSP_2022.rds")
cmp_a <- readRDS("Data/COMPUSTAT_2022.rds")
ccm <- read.csv("Data/CCM.csv")

# convert all column names to lower case to keep consistent
names(crsp) <- tolower(names(crsp))
names(cmp_a) <- tolower(names(cmp_a))
names(ccm) <- tolower(names(ccm))

Process CRSP

This section deals with the raw CRSP data. It includes converting date format, computing firm size, and filling missing returns with zeros.

One can argue about whether filling missing returns with zeros is reasonable, as there can be many different reasons for a return to be missing. If a stock is suspended from trading within month t while it is selected in your portfolio (long or short) at the end of month t-1, it will have a missing return but it takes a non-zero weight in your portfolio in month t. Without further information, taking a missing return as zero would make more sense than ignoring it in your portfolio in this case.

You can comment out the three-line code to ignore delisting returns, though it does not make any difference.

Code

######################
##   process crsp   ##
######################
df <- crsp

df$date.o <- df$date
df$date <- as.integer(substr(df$date, 1, 6))
df$year <- as.integer(substr(df$date, 1, 4))
df$month <- as.integer(substr(df$date, 5, 6))
# a minus sign before PRC means it is the average of bid and ask prices
df$prc <- abs(df$prc) # absolute value of price 
df$me <- df$prc * df$shrout # firm ME (ME--market equity)
df$ret <- as.numeric(df$ret) # NAs will be introduced by coercion
df$dlret <- as.numeric(df$dlret)

## Deal with Delist returns and NA returns
df$ret.o <- df$ret # keep the original return data
df[is.na(df$ret), 'ret'] <- 0
df[is.na(df$dlret), 'dlret'] <- 0
# Take the delist return into account
df$ret <- (1+df$ret)*(1+df$dlret) - 1

# lag me 
df <- df %>%
  group_by(permno) %>%
  arrange(date, .by_group=T) %>%
  mutate(
    me.lag = dplyr::lag(me, 1),
    date.dif = date - dplyr::lag(date,1)
  )
# df$me.lag[!df$date.dif %in% c(1,89) & !is.na(df$date.dif)] <- NA

# screening stocks following conventions
df <- df[(df$shrcd%in%c(10,11)) & (df$primexch%in%c('N','A','Q')),]

# portfolios are formed at the end of June in each year
df$year.ff <- df$year
df$year.ff[df$month<=6] <- df$year.ff[df$month<=6]-1

# save 
crsp.ready <- df

Process CCM and merge CRSP and CCM

This section merges CRSP with CCM. You can find more information about how to use CCM on WRDS: https://wrds-www.wharton.upenn.edu/pages/classroom/using-crspcompustat-merged-database/.

After combining the two datasets, I use gvkey as the firm id to construct portfolios. This procedure can help keep consistent with the other factor constructions.

Code

#####################
##   process ccm   ##
#####################
df <- ccm

# "NU" "LU" "NR" "LC" "LX" "LD" "LN" "LS" "NP"
df <- df[df$linktype %in% c('LC','LU'), ]
df <- df[df$linkprim %in% c('P','C','J'),]
df <- df[c('lpermno','gvkey','cik','linkdt','linkenddt')]
colnames(df)[1] <- 'permno'
df$linkenddt <- as.integer(df$linkenddt)
df$linkenddt <- ifelse(is.na(df$linkenddt), Inf, df$linkenddt)

# merge crsp and ccm
df <- merge(crsp.ready, df, by='permno')
# the ccm link is only valid between linkdt and linkenddt
df <- df[df$date.o >= df$linkdt & df$date.o <= df$linkenddt, ]
df$linkdt <- NULL
df$linkenddt <- NULL

# sum up the size (me) of different permno under the same gvkey
df <- df %>%
  group_by(gvkey, date) %>%
  mutate(
    me.sum = sum(me),
    me.sum.lag = sum(me.lag)
  )

df[c("permco","cusip",'shrcd','shrcls','dlret','retx',"shrout")] <- NULL

# save
crsp_ccm <- df

June in year t (dataframe pt6) and December in year t-1

Code

##################
## FF Procedure ##
##################
# ME in June t-1
df <- crsp_ccm[crsp_ccm$month==6, c("gvkey","year","me.sum","primexch")]
df$me.jun <- df$me.sum
df <- df[!duplicated(df[c('gvkey','year')]),]
pt6 <- df

# need ME.Dec as to cut BEME
# ME in Dec t-1
df <- crsp_ccm[crsp_ccm$month==12, c("gvkey","year","me.sum")]
df <- df[!duplicated(df[c('gvkey','year')]),]
df$year <- df$year + 1
names(df)[3] <- 'me.dec'
pt6 <- merge(pt6, df, by=c('gvkey','year'), all.x = T)

Compustat Annual Data (dataframe cmp_a)

Code

###########################
##   process compustat   ##
###########################
df <- cmp_a

# index is from 1 to n
df <- df %>%
  group_by(gvkey) %>%
  arrange(fyear, .by_group=T) %>%
  mutate(
    index = 1:n(),
    year.dif = fyear - dplyr::lag(fyear,1)
  )
# remove the first two obs for each firm due the back-filling practice
# df <- df[df$index>2,] # remove the first two obs

# # Book value (annual) according to Fama-French (1993)
# # create shareholders' equity (seq)
# df$seq <- ifelse(is.na(df$seq), 0, df$seq)
# # create preferred stock (ps)
# # (pstkrv, pstkl, pstk, 0)
# df$ps <- ifelse(is.na(df$pstkrv), df$pstkl, df$pstkrv)
# df$ps <- ifelse(is.na(df$ps), df$pstk, df$ps)
# df$ps <- ifelse(is.na(df$ps), 0, df$ps)
# # fill NA of txditcq with 0
# df$txditc <- ifelse(is.na(df$txditc), 0, df$txditc)
# # create book equity (be)
# df$be <- df$seq + df$txditc - df$ps

# The easy way out
df$be <- df$ceq

# lagged at
df <- df %>%
  group_by(gvkey) %>%
  arrange(fyear, .by_group = T) %>%
  mutate(
    across(all_of(c('at')), ~dplyr::lag(.x, 1), .names = "{.col}.lag")
  )
df$at.lag[df$year.dif!=1&!is.na(df$year.dif)] <- NA

# RMW factor
# to be included, a firm needs a non-missing revt 
#   and non-missing value for one of ('cogs','xsga','xint')
# roe.ok == T means a firm is qualified to be included 
df$revt.ok <- ifelse(!is.na(df$revt), T, F) # T means revt is ok
df$roe.ok <- rowSums(!is.na(df[c('cogs','xsga','xint')]))>=1 # T means others are ok
df$roe.ok <- df$roe.ok + df$revt.ok == 2 # T mean both requirements are satisfied

df$revt[is.na(df$revt)] <- 0
df$cogs[is.na(df$cogs)] <- 0
df$xsga[is.na(df$xsga)] <- 0
df$xint[is.na(df$xint)] <- 0

## FF profitability: (revt-cogs-xsga-xint)/be
df$roe <- (df$revt-df$cogs-df$xsga-df$xint)/df$be

########
df$dat <- (df$at - df$at.lag)
df$dat__at <- df$dat / df$at.lag

# we remove the obs with at.lag=0
df$dat__at[df$at.lag==0&!is.na(df$at.lag)] <- NA

## Why use the fyear + 1? Refer to FF (2015)
df$aval_year <- df$fyear + 1

# save
df.save <- df

# fundamental data (df)  is on fyear+1 and 
#        crsp data (pt6) is on June of each calendar year
df <- merge(pt6,
            df.save[c("gvkey","aval_year","index",
                      "be","roe","roe.ok","dat__at",'sic')],
            by.x=c('gvkey','year'),
            by.y=c('gvkey','aval_year'), 
            all.x = T)
# Size cut uses me in June t, beme uses me in Dec t-1
df$me <- df$me.jun
df$beme <- df$be / df$me.dec
pt6_cmp <- df

Form portfolios and compute factor returns

This section cuts firms into 2 ME groups, 3 IA groups, and 3 ROE groups independently at the end of June in year t. Note that Fama-French and Hou-Xue-Zhang always use NYSE cut-points to form portfolios, hence I use [primexch==“N”] when calculating the cut-points to take care of this convention.

Code

#########################################
## Form portfolios and compute returns ##
#########################################
year.st = 1963
# year.ed = 2021

# The list to store results of cma, rmw, hml, smb
res.list <- list()

#######################################
### factor that you are replicating ###
#######################################
var.factor <- 'cma'

# parameters
fin.ex = F
be.neg.ex = F

# variables used
var.test.1 <- 'me'
var.test.2 <- 'dat__at'

#######################
df <- pt6_cmp
df <- df[df$index>=2,]

if(fin.ex==T){
  # df <- df[df$sic<4900|df$sic>4999,]
  df <- df[df$sic<6000|df$sic>6999,]
  # df <- df[df$sic<9000,]
}
if(be.neg.ex==T){
  df <- df[df$be>0,]
}

df <- df[df$year>=year.st,]
# df <- df[df$year<=year.ed,]
df <- df[!is.na(df[,var.test.1]),] 
df <- df[!is.na(df[,var.test.2]),]

# Rank var.1 and var.2
pt <- df %>%
  group_by(year) %>%
  mutate(
    across(all_of(var.test.1), ~findInterval(.x, quantile(.x[primexch=="N"], probs=c(0.5)))+1, .names = "rank.var1"),
    across(all_of(var.test.2), ~findInterval(.x, quantile(.x[primexch=="N"], probs=c(0.3,0.7)))+1, .names = "rank.var2")
  )

# merge back to CRSP
df <- merge(crsp_ccm[c("gvkey","year.ff","date","ret","me.lag")], 
            pt[c('gvkey','year','rank.var1','rank.var2')],
            by.x=c('gvkey','year.ff'), by.y=c('gvkey','year'))

df <- df[!is.na(df$me.lag),]
# factor: 2-by-3 portfolio
pt.factor <- df %>%
  group_by(date, rank.var1, rank.var2) %>%
  summarise(
    # nstocks = n(),
    ret_vw = weighted.mean(ret, me.lag, na.rm = TRUE)*100,
    # ret_ew = mean(ret, na.rm = TRUE)*100,
    .groups='drop')
pt.factor$group <- paste(pt.factor$rank.var1, pt.factor$rank.var2)
pt.factor <- reshape2::dcast(pt.factor, date ~ group, value.var = 'ret_vw')

res <- pt.factor
res$var <- (res$`1 3`+res$`2 3`)/2 - (res$`1 1`+res$`2 1`)/2
res$var <- -res$var
names(res)[ncol(res)] <- var.factor

res$smb <- (res$`1 1`+res$`1 2`+res$`1 3`)/3 - (res$`2 1`+res$`2 2`+res$`2 3`)/3
res <- res[order(res$date),]
#
res.list[[var.factor]] <- res
# write.csv(res.list[[var.factor]],
#           paste0('ff_',var.factor,'_2_3_portfolios.csv'),
#           row.names = F)


#######################################
### factor that you are replicating ###
#######################################
var.factor <- 'rmw'

# parameters
fin.ex = F
be.neg.ex = T

# variables used
var.test.1 <- 'me'
var.test.2 <- 'roe'

#######################
df <- pt6_cmp
df <- df[df$index>=2,]

if(fin.ex==T){
  df <- df[df$sic<6000|df$sic>6999,]
}
if(be.neg.ex==T){
  df <- df[df$be>0,]
}

df <- df[df$year>=year.st,]
# df <- df[df$year<=year.ed,]
df <- df[!is.na(df[,var.test.1]),] 
df <- df[!is.na(df[,var.test.2]),]
# unique to RMW factor
df <- df[df$roe.ok==T,]

# Rank var.1 and var.2
pt <- df %>%
  group_by(year) %>%
  mutate(
    across(all_of(var.test.1), ~findInterval(.x, quantile(.x[primexch=="N"], probs=c(0.5)))+1, .names = "rank.var1"),
    across(all_of(var.test.2), ~findInterval(.x, quantile(.x[primexch=="N"], probs=c(0.3,0.7)))+1, .names = "rank.var2")
  )

# merge back to CRSP
df <- merge(crsp_ccm[c("gvkey","year.ff","date","ret","me.lag")], 
            pt[c('gvkey','year','rank.var1','rank.var2')],
            by.x=c('gvkey','year.ff'), by.y=c('gvkey','year'))

df <- df[!is.na(df$me.lag),]
# factor: 2-by-3 portfolio
pt.factor <- df %>%
  group_by(date, rank.var1, rank.var2) %>%
  summarise(
    # nstocks = n(),
    ret_vw = weighted.mean(ret, me.lag, na.rm = TRUE)*100,
    # ret_ew = mean(ret, na.rm = TRUE)*100,
    .groups='drop')
pt.factor$group <- paste(pt.factor$rank.var1, pt.factor$rank.var2)
pt.factor <- reshape2::dcast(pt.factor, date ~ group, value.var = 'ret_vw')

res <- pt.factor
res$var <- (res$`1 3`+res$`2 3`)/2 - (res$`1 1`+res$`2 1`)/2
names(res)[ncol(res)] <- var.factor

res$smb <- (res$`1 1`+res$`1 2`+res$`1 3`)/3 - (res$`2 1`+res$`2 2`+res$`2 3`)/3
res <- res[order(res$date),]
#
res.list[[var.factor]] <- res
# write.csv(res.list[[var.factor]],
#           paste0('ff_',var.factor,'_2_3_portfolios.csv'),
#           row.names = F)


#######################################
### factor that you are replicating ###
#######################################
var.factor <- 'hml'

# parameters
fin.ex = F
be.neg.ex = T

# variables used
var.test.1 <- 'me'
var.test.2 <- 'beme'

#######################
df <- pt6_cmp
df <- df[df$index>=2,]

if(fin.ex==T){
  df <- df[df$sic<6000|df$sic>6999,]
}
if(be.neg.ex==T){
  df <- df[df$be>0,]
}

df <- df[df$year>=year.st,]
# df <- df[df$year<=year.ed,]
df <- df[!is.na(df[,var.test.1]),] 
df <- df[!is.na(df[,var.test.2]),]

# Rank var.1 and var.2
pt <- df %>%
  group_by(year) %>%
  mutate(
    across(all_of(var.test.1), ~findInterval(.x, quantile(.x[primexch=="N"], probs=c(0.5)))+1, .names = "rank.var1"),
    across(all_of(var.test.2), ~findInterval(.x, quantile(.x[primexch=="N"], probs=c(0.3,0.7)))+1, .names = "rank.var2")
  )

# merge back to CRSP
df <- merge(crsp_ccm[c("gvkey","year.ff","date","ret","me.lag")], 
            pt[c('gvkey','year','rank.var1','rank.var2')],
            by.x=c('gvkey','year.ff'), by.y=c('gvkey','year'))

df <- df[!is.na(df$me.lag),]
# factor: 2-by-3 portfolio
pt.factor <- df %>%
  group_by(date, rank.var1, rank.var2) %>%
  summarise(
    # nstocks = n(),
    ret_vw = weighted.mean(ret, me.lag, na.rm = TRUE)*100,
    # ret_ew = mean(ret, na.rm = TRUE)*100,
    .groups='drop')
pt.factor$group <- paste(pt.factor$rank.var1, pt.factor$rank.var2)
pt.factor <- reshape2::dcast(pt.factor, date ~ group, value.var = 'ret_vw')

res <- pt.factor
res$var <- (res$`1 3`+res$`2 3`)/2 - (res$`1 1`+res$`2 1`)/2
names(res)[ncol(res)] <- var.factor

res$smb <- (res$`1 1`+res$`1 2`+res$`1 3`)/3 - (res$`2 1`+res$`2 2`+res$`2 3`)/3
res <- res[order(res$date),]
#
res.list[[var.factor]] <- res
# write.csv(res.list[[var.factor]],
#           paste0('ff_',var.factor,'_2_3_portfolios.csv'),
#           row.names = F)

Verify the results and plot the cumulative returns

This section plots the cumulative returns of my result (with .rp) and factor returns from the authors’ websites. You need to download it from their website and save it in your “Data” folder. It helps visualize the consistency and it concludes my replication.

Code

##########################
### Verify the results ###
##########################
# load Fama-French factor
ff <- read.csv("Data/FF_All_Factors.csv")
names(ff) <- tolower(names(ff))

res <- res.list$cma[c('date')]
res$cma.rp <- res.list$cma$cma
res$rmw.rp <- res.list$rmw$rmw
res$hml.rp <- res.list$hml$hml

res$smb.cma.rp <- res.list$cma$smb
res$smb.rmw.rp <- res.list$rmw$smb
res$smb3.rp <- res.list$hml$smb

res$smb5.rp <- (res$smb.cma.rp+res$smb.rmw.rp+res$smb3.rp)/3

res <- merge(res, ff, by='date')
res <- res[order(res$date),]

write.csv(res, 'ff_factors_all.csv', row.names = F)
#
res.sum <- data.frame(factor = c('cma','rmw','hml','smb3','smb5'),
                      corr = NA, mean.ff = NA, mean.rp = NA,
                      t_stat.ff = NA, t_stat.rp = NA)
#
for (i in 1:length(res.sum$factor)) {
  # i=1
  #
  i <- res.sum$factor[i]
  i.rp <- paste0(i,'.rp')
  #
  res.sum$corr[res.sum$factor==i] <- cor(res[i], res[i.rp])
  
  tmp <- t.test(res[i])
  res.sum$mean.ff[res.sum$factor==i] <- tmp$estimate
  res.sum$t_stat.ff[res.sum$factor==i] <- tmp$statistic
  
  tmp <- t.test(res[i.rp])
  res.sum$mean.rp[res.sum$factor==i] <- tmp$estimate
  res.sum$t_stat.rp[res.sum$factor==i] <- tmp$statistic
  
  ############################
  ## Plot cumulative return ##
  ############################
  res.p <- res[c('date',i,i.rp)]
  # convert to R date format
  res.p$date <- as.Date(paste0(res.p$date,'01'), '%Y%m%d')
  res.p$date <- res.p$date + months(1) - 1
  
  # transform wide table into long table
  res.p <- reshape2::melt(res.p, id='date')
  names(res.p) <- c("Date","Factor","Return")
  
  # compute cumulative return
  res.p <- res.p %>%
    group_by(Factor) %>% 
    arrange(Date, .by_group = T) %>%
    mutate(
      Return.Cum = cumprod(1+Return/100)
    )
  
  # cut 1950-2021 into 4 subsamples
  res.p$Year <- as.integer(substr(res.p$Date,1,4))
  res.p$Subsample <- cut2(res.p$Year, g=4)
  
  # ggplot
  fig <- ggplot(data = res.p, aes(x=Date, y=Return.Cum, group=Factor)) +
    geom_line(aes(linetype=Factor, color=Factor)) +
    scale_linetype_manual(values=c('solid',"longdash","dotted","dashed","twodash"))+
    scale_color_manual(values=c('blue','brown','black','black','coral4'))+
    ylab('Portfolio worth if invest $1 at the beginning') +
    scale_x_date(date_labels = "%m-%Y", date_breaks = "4 year") +
    theme_bw() +
    theme(legend.title = element_blank()
    ) + 
    facet_wrap(vars(Subsample),
               scales = "free") + 
    scale_y_continuous() + 
    labs(title=paste('Cumulative return of factors from',min(res$date)))
  print(fig)
  # ggsave(paste0('cumulative_return_',i,'.pdf'), width = 12, height = 7)
  # ggsave(paste0('cumulative_return_',i,'.png'), width = 12, height = 7)
  
}

res.sum[,-1] <- round(res.sum[,-1], 2)
res.sum

##   factor corr mean.ff mean.rp t_stat.ff t_stat.rp
## 1    cma 0.98    0.27    0.26      3.58      3.40
## 2    rmw 0.98    0.28    0.30      3.30      3.41
## 3    hml 0.98    0.27    0.29      2.49      2.66
## 4   smb3 0.99    0.20    0.18      1.73      1.63
## 5   smb5 1.00    0.23    0.22      1.99      1.98

# write.csv(res.sum, 'factor_replicating_results.csv', row.names = F)

SMB_HML_CMA_RMW_FF5

Kai Wang

2/24/2022

To Start

My R Code

Load packages

Load data

Process CRSP

Process CCM and merge CRSP and CCM

June in year t (dataframe pt6) and December in year t-1

Compustat Annual Data (dataframe cmp_a)

Form portfolios and compute factor returns

Verify the results and plot the cumulative returns