It is common to come across SAS or Stata manuals while working on empirical asset pricing or corporate finance. Nonetheless, given the popularity of open-source programming languages, e.g. R/Python, it is intriguing that fewer resources/manuals are available when it comes to common databases such as CRSP or COMPUSTAT. As part of my research, I have been working with R mainly - whether this involves optimization, time series analysis, data manipulation, Monte Carlo simulation, panel regression, or machine learning.
In this vignette, I illustrate how to work with the CRSP and COMPUSTAT databases, which are typical databases in empirical finance research. Specifically, I discuss how to clean, merge, and analyze both data-sets using R. For concrete illustration, I will demonstrate how to form portfolios with respect to the Fama-French size and value premiums. At the end of the discussion, I study the sensitivity of the results with respect to different inputs upon which the portfolio results depend. The purpose of which is to study the implications of asset pricing subject to changes in the experiment specification and design. I hope this vignette would encourage further reproducible empirical asset pricing research.
The discussion assumes that the user has already downloaded the CRSP and the COMPUSTAT data sets in two separate files, both in csv formats. Nonetheless, users who are interested in working with the WRDS API may look into the WRDS manuals for further information on this. I will work with a number of libraries to perform the analysis. The core packages of interests are data.table and lubridate. In addition, in few cases I refer to plyr to perform basic summary statistics. Nonetheless, the main analysis is conducted in the data.table environment. To learn more on data.table, which is my propaganda, I highly recommend the DataCamp course “Data Manipulation with data.table in R” by Matt Dowle and Arun Srinivasan.
library(data.table)
library(lubridate)
library(ggplot2)
library(plyr)
library(parallel)
library(plotly)
rm(list = ls())
While there are a lot of pros working data.table, one good functionality is the option that allows the user to easily specify a subset of variables while reading the whole data. Mostly, after downloading the full data-set, I focus on a subset of variables of interest.
file.i <- "/home/simaan/Dropbox/Data/DATA/CRSP_1960_2019.csv"
select.var <- c("PERMCO","date","COMNAM","SHROUT","SHRCD","DLRET",
"DLSTCD","DLPDT","EXCHCD","RET","PRC","CUSIP")
DT <- fread(file.i,select = select.var)
|--------------------------------------------------|
|==================================================|
The above commands load the monthly CRSP data-set with pre-specified variables. Those are the permanent identifier of the security (PERMCO), date, shares outstanding in thousands (SHROUT), share code (SHRCD), exchange code (EXCHCD), security return RET, price PRC, and the CUSIP identifier. Note that the CUSIP is the key link between the CRSP and COMPUSTAT data. Additionally, I consider delisting-related variables denoted by DL, to which I will refer later.
After loading the data, I perform few filters and cleaning procedures. In particular, I keep common shares, those with 10 or 11 codes. I drop missing values for prices. There are also certain flags for prices denoted by -44, -55, -66, -77, -88, and -99. I drop these from the data as well. Additionally, I keep securities listed on major exchanges (NYSE, AMEX, or NASDAQ). Given these filters, I compute the market cap for each stock-month in the data.
DT <- DT[DT$SHRCD %in% 10:11,]
DT <- DT[!is.na(DT$PRC),]
DT <- DT[!DT$PRC %in% (-(4:9)*11),]
DT$RET <- as.numeric(DT$RET)
NAs introduced by coercion
DT <- DT[!is.na(DT$RET),]
DT$PRC <- abs(DT$PRC)
DT$MKTCAP <- DT$PRC*DT$SHROUT
DT <- DT[DT$EXCHCD %in% 1:3,]
There may be duplicates in the data depending on the identifier of interest. To control for this, consider the following commands
DT <- unique(DT)
DT <- DT[order(DT$PERMCO,DT$date),]
DT[ , `:=`( duplicate_N = .N ) , by= list(PERMCO,date)]
table(DT$duplicate_N)/nrow(DT)*100
1 2 3 4 6 7
98.014830993 1.902150108 0.044272559 0.018337003 0.010738451 0.009670885
Note that the := command creates a new variable to the already existing data.table, where .N denotes the data length. By grouping the data based on the stock identifier and month, we can count whether there are multiple observations within each case. As we observe from the above table, about 2% of the observations are duplicates, i.e. multiple observations for the same stock-month. For instance, the table below illustrates the case in which we have 7 duplicates:
head(DT[DT$duplicate_N == 7,],7)
Note that the duplicates arise due to different CUSIP identifiers. Since we are planning to link the data with the COMPUSTAT using CUSIP, we check whether there are duplicates for the same cusip-month
DT[ , `:=`( duplicate_N = .N ) , by= list(CUSIP,date)]
table(DT$duplicate_N)/nrow(DT)*100
1
100
DT$duplicate_N <- NULL
In all cases, observe that there is a unique cusip-date observation. However, if one is working with CRSP alone, it is common to aggregate duplicates by value-weighting the observations based on market cap to yield a unique permco-month observation. Additionally, note that the first 6 CUSIP characters result in the same unique identifier.
In terms of date formatting, I utilize the lubridate library to manipulate dates:
DT$date <- ymd(DT$date)
DT$date <- ceiling_date(DT$date,"m") - 1
It is common to require a minimum history of each security in the data. For instance, a researcher may need to estimate the market beta on a rolling window using an initial sample of 2 years. For the sake illustration, I require that each security should have at least two months of data. At the end of this vignette, I discuss the sensitivity of the portfolio results to this choice. Nonetheless, dropping observations in the following manner could have major implications in terms data-snooping and survivor-ship bias.
crsp_keep <- 2
DT[ , `:=`( N_obs = .N ) , by= list(CUSIP)]
DT <- DT[DT$N_obs >= crsp_keep,]
DT$N_obs <- NULL
Finally, we have 24581 unique securities, 720 months, and total of 3187327 security-month observations.
An important characteristic of the CRSP database is that it includes historical companies that were delisted in the past. Not controlling for such delisting creates a survivor-ship bias, especially when it comes to researching the cross-section of stock returns - see, e.g., this paper for further information.
To control for delisted returns, I follow the methodology recommended by Bali et al 2016. I demonstrate the steps below; but, before we do so, let’s take a quick look the summary statistics of the monthly returns
summary(DT$RET)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.99360 -0.06667 0.00000 0.01155 0.07143 24.00000
DT$DLRET <- as.numeric(DT$DLRET)
NAs introduced by coercion
DT <- DT[order(DT$CUSIP,DT$date),]
DT[,`:=` (last_date = date[.N]),by = list(CUSIP) ]
DT[DT$DLSTCD %in% 100,"DLSTCD"] <- NA
cusip_delist <- unique(DT[!is.na(DT$DLSTCD),])$CUSIP
DT_delist <- DT[DT$CUSIP %in% cusip_delist,]
DT_delist_last <- DT_delist[, .SD[.N], by= list(CUSIP)]
rm(DT_delist)
DT_delist_last$date <- ceiling_date(DT_delist_last$date + 1,"m")- 1
DT_delist_last$RET <- DT_delist_last$DLRET
na_dlret_index <- is.na(DT_delist_last$RET)
DT_delist_last_na <- DT_delist_last[na_dlret_index,]
DT_delist_last <-DT_delist_last[!na_dlret_index,]
select_code <- c(500,520:551,573,574,580,584)
DT_delist_last_na[DT_delist_last_na$DLSTCD %in% select_code,"RET"] <- -0.3
DT_delist_last_na[!DT_delist_last_na$DLSTCD %in% select_code,"RET"] <- -1
DT_delist_last <- rbind(DT_delist_last,DT_delist_last_na)
DT_delist_last$MKTCAP <- DT_delist_last$MKTCAP*(1 + DT_delist_last$RET)
DT_delist_last$PRC <- DT_delist_last$PRC*(1 + DT_delist_last$RET)
DT <- rbind(DT,DT_delist_last)
DT <- DT[order(DT$CUSIP,DT$date),]
summary(DT$RET)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-1.00000 -0.06667 0.00000 0.01151 0.07143 24.00000
The above steps adjust the returns in case the delisted returns are available. Otherwise, I use an arbitrary returns according to the suggestion in the literature. We can see that the minimum return becomes -100%. However, at the same time, we observe that the mean return stays roughly the same due to the large sample size. It is also worth mentioning that dropping stocks below a certain price level potentially eliminates outliers and penny stocks that are more likely to get delisted.
Before we link the CRSP data to COMPUSTAT, let’s perform some basic analysis. For instance, we can easily aggregate the security return on the monthly level to create either a value-weighted or equally weighted portfolios. To do so, consider the following commands:
PORT_RET <- DT[,list(EW_RET = lapply(.SD,mean,na.rm = T)), by = list(date), .SDcols = "RET"]
PORT_RET2 <- DT[,list(VW_RET = lapply(.SD, function(x) sum(x*MKTCAP/sum(MKTCAP,na.rm = T),na.rm = T))),
by = list(date), .SDcols = "RET"]
PORT_RET <- merge(PORT_RET,PORT_RET2)
rm(PORT_RET2)
PORT_RET <- PORT_RET[order(PORT_RET$date),]
To summarize the returns over time, consider the time series of cumulative returns for each portfolio. We refer the ggplot2 library and the NBER recession periods to do so:
bar.col <- "gray"
ggplot_recession0 <- geom_rect(fill = bar.col,col = bar.col,
aes(xmin=date("1973-11-30"),
xmax=date("1975-03-31"),
ymin=-Inf, ymax=Inf))
ggplot_recession1 <- geom_rect(fill = bar.col,col = bar.col,
aes(xmin=date("1980-01-31"),
xmax=date("1980-07-31"),
ymin=-Inf, ymax=Inf))
ggplot_recession2 <- geom_rect(fill = bar.col,col = bar.col,
aes(xmin=date("1981-07-31"),
xmax=date("1982-11-30"),
ymin=-Inf, ymax=Inf))
ggplot_recession3 <- geom_rect(fill = bar.col,col = bar.col,
aes(xmin=date("1990-07-31"),
xmax=date("1991-03-31"),
ymin=-Inf, ymax=Inf))
ggplot_recession4 <- geom_rect(fill = bar.col,col = bar.col,
aes(xmin=date("2001-03-31"),
xmax=date("2001-11-30"),
ymin=-Inf, ymax=Inf))
ggplot_recession5 <- geom_rect(fill = bar.col,col = bar.col,
aes(xmin=date("2007-12-31"),
xmax=date("2009-06-30"),
ymin=-Inf, ymax=Inf))
ds.plot1 <- data.frame(date = PORT_RET$date, CumRet = cumsum(PORT_RET$EW_RET),
Type = "Equally-Weighted" )
ds.plot2 <- data.frame(date = PORT_RET$date, CumRet = cumsum(PORT_RET$VW_RET),
Type = "Value-Weighted" )
ds.plot <- rbind(ds.plot1,ds.plot2)
ds.plot$Type <- as.factor(ds.plot$Type)
p <- ggplot(ds.plot, aes(date, CumRet,colour = Type))
p <- p + ggplot_recession0 + ggplot_recession1 +
ggplot_recession2 + ggplot_recession3 +
ggplot_recession4 + ggplot_recession5
p <- p + geom_line(alpha = 0.4)
p <- p + xlab("Date") + ylab("Cumulative Return")
p <- p + geom_abline(intercept = 0, slope = 0, color="black", linetype="dashed", size=0.2)
p
We note that, overall, the equally-weighted portfolio under-performs the value-weighted one. The equally-weighted portfolio attributes greater weight to small cap stocks. Contrary to Fama-French, interestingly, we observe that the value-weighted stocks outperform the small cap stocks. Nonetheless, the pattern started to become evident beginning in the late 80s.
We take a closer look into the above by grouping securities into size portfolios. At each month, we group securities into 5 groups based on the market cap cut-off. To perform the cutoff, I refer to the ntile function from the dplyr library. In the following analysis, I proceed with equal-weighting for the sake of simplicity.
library(dplyr)
cut.n <- 5
DT <- DT[,`:=` (Group_Size = ntile(MKTCAP,cut.n)), by = list(date)]
N_G <- DT[,.N, by = list(date,Group_Size) ]
N_G[,mean(N), by = list(Group_Size)]
NA
We observe that, on average, each group contains 885 firms over the sample period. In addition to the group size, I consider the next one month return on each security. I use the shift function from data.table and apply it on each security as follows
DT <- DT[order(DT$CUSIP,DT$date),]
DT <- DT[,`:=` (RET_1 = shift(RET,-1)), by = list(CUSIP)]
Given the above, I report summary statistics on the size group level:
DT_size <- DT[,lapply(.SD,mean,na.rm = T),by = list(Group_Size),
.SDcols = c("RET","RET_1","PRC","SHROUT","MKTCAP","EXCHCD")]
DT_size <- DT_size[order(DT_size$Group_Size),]
DT_size$RET_1 <- DT_size$RET_1*12
DT_size$RET <- DT_size$RET*12
DT_size
Clearly, securities ranked in the 5th largest size group have a higher market-cap which is associated with both higher prices and shares outstanding. Additionally, note that the large stocks are more likely to be listed on NYSE (EXCHCD = 1), whereas the small cap stocks are listed on NASDAQ (EXCHCD = 3). In terms of returns, the results are sensitive with respect to whether we consider in-sample return or next month return. In the in-sample, we observe that large caps outperform small cap Nonetheless, the more relevant one, in practice, is the out-of-sample return. In the latter case, we observe that small cap stocks outperform large cap stocks.
In annual terms, the small cap stocks yield about 10% mean return more than large cap stocks. Ignoring transaction cost and assuming that investors re-balance their portfolios on a monthly basis, this evidence is consistent the Fama-French size premium. That is investors expect higher return on investing in small cap stocks.
The above discussion relates to the CRSP data alone. In the following, I focus on the COMPUSTAT data which contains accounting-related information about both public and non-public firms. Similar to the above, I focus on a subset of variables. For identification, I look into the CUSIP and the CIK number. The latter is relevant to identify firms via the SEC EDGAR system - for those interested in merging the data with SEC filings and perform textual analysis. The FYEARQ and FQTR are the fiscal year and quarter of the data point. For accounting variables, I consider total assets (atq), net income (niq), and common equity ceqq.
file.j <- "/home/simaan/Dropbox/Data/DATA/COMPUSTAT_1960_2020.csv"
select.var2 <- tolower(c("FYEARQ","FQTR","cusip","cik","sic","niq","atq","ceqq"))
DT2 <- unique(fread(file.j,select = select.var2))
|--------------------------------------------------|
|==================================================|
To link between COMPUSTAT and CRSP, we need to make a small adjustment for the CUSIP identifiers. In CRSP, the number of characters is 8, whereas in COMPUSTAT it is 9.
table(nchar(DT$CUSIP))
8
3187327
table(nchar(DT2$cusip))
0 9
130 1806228
Note that there a few cases in which the CUSIP is unavailable in COMPUSTAT. The adjustments are described below
DT2 <- DT2[!nchar(DT2$cusip) == 0,]
DT2$CUSIP <- substr(DT2$cusip,0,8)
DT2$cusip <- NULL
DT2 <- unique(DT2[DT2$CUSIP %in% DT$CUSIP,])
In order to merge with the CRSP data, which corresponds to calendar dates, I adjust the fiscal dates in COMPUSTAST. It is common to use 6 months lags to allow the financial disclosures to become publicly available. To do so, I perform the following adjustments:
DT2$date <- ymd(DT2$fyearq*10000 + DT2$fqtr*3*100 + 1)
DT2$date <- DT2$date + months(6)
DT2$date <- ceiling_date(DT2$date,"q") - 1
DT2 <- DT2[month(DT2$date) == 6,]
DT2$fyearq <- DT2$fqtr <- NULL
Additionally, I keep the annual data rather than the quarterly one. Again, it all depends on the purpose of the investigation. In this illustration, I will consider portfolio formation on an annual basis. In particular, I will keep the June data for the portfolio formation process.
Same as before, let’s check for duplicates
DT2 <- unique(DT2)
DT2[ , `:=`( duplicate_N = .N ) , by= list(CUSIP,date)]
table(DT2$duplicate_N)
1
247903
DT2$duplicate_N <- NULL
The COMPUSTAT data seem to have unique cusip-date observations.
Given the final COMPUSTAT data, I consider a few summary statistics for each industry, which is defined using the 4th digit of the SIC code.
DT2 <- na.omit(DT2)
DT2$ROA <- DT2$niq/DT2$atq
DT2$ROE <- DT2$niq/DT2$ceqq
DT2$BL <- DT2$atq/DT2$ceqq
DT2$Industry <- floor(DT2$sic/1000)
DT2_sum <- DT2[,lapply(.SD,median,na.rm = T), by = list(Industry),
.SDcols = c("atq","ROA","ROE","BL")]
DT2_sum <- DT2_sum[order(DT2_sum$Industry),]
DT2_sum
Observe that financial firms are associated with highest book leverage, given the business nature of financial institutions. Also, we note that the financial industry is associated with the largest total assets. This is not surprising given the concentration of the financial industry over time, in which a few entities hold the majority of assets. To relate to the last point, consider the total assets of the top 10% firms with respect to the total assets in the industry as a whole.
DT2_Fin <- DT2[DT2$Industry == 6,]
DT2_Fin <- DT2_Fin[,`:=` (Group_Size = ntile(atq,10)), by = list(date)]
DT2_Fin <- DT2_Fin[,`:=` (Total_Assets = sum(atq)), by = list(date)]
DT2_Fin <- DT2_Fin[,lapply(.SD, function(x) sum(x/Total_Assets)),
by = list(date,Group_Size), .SDcols = "atq"]
DT2_Fin_Top <- DT2_Fin[DT2_Fin$Group_Size == 10,]
DT2_Fin_Top <- DT2_Fin_Top[order(DT2_Fin_Top$date),]
plot(atq~date,data = DT2_Fin_Top,type = "l",
main = "Proportion of Assets held by Top 10% in Financial Industry",
ylab = "", xlab = "Date")
grid(10)
We discern that the top 10% increased their share of total assets from 34% in the late 60s up to 90% more recently.
Note that the COMPUSTAT data set is annual, while the CRSP is monthly. If we seek to form portfolios based on book-to-market (BM) ratio such as the case for Fama-French, we need to create an BM annual variable in the COMPUSTAT data-set. Before we merge the data altogether, let’s add the market value of equity (ME) to the COMPUSTAT data-set. Additionally, I add the stock prices which would be useful for filtering penny stocks from the portfolio formation later on.
BM <- DT[,c("date","CUSIP","MKTCAP","PRC")]
DT2 <- merge(DT2,BM, by = c("CUSIP","date"), all = F)
DT2$BM_ratio <- DT2$ceqq/(DT2$MKTCAP/1000)
DT2$ME <- DT2$MKTCAP
DT2$PRC_LAST <- DT2$PRC
DT2$MKTCAP <- DT2$PRC <- NULL
DT2 <- DT2[order(DT2$CUSIP,DT2$date),]
DT2[,`:=` (N_years = 1:.N), by = list(CUSIP)]
rm(BM); gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 1462689 78.2 2908574 155.4 2908574 155.4
Vcells 83773005 639.2 192790929 1470.9 192790929 1470.9
The above commands link the COMPUSTAT and the CRSP data-sets to determine the market equity of the firms and, hence, the book-to-market ratio. Also, we denote the stock price as PRC_LAST to refer to the recent stock price in the annual data.
Before we finally merge the data altogether, we need to do one small trick with the CRSP data. The ultimate goal of this illustration to attribute the next 12 months returns with respect to the BM ratio from the previous June. To do so, let’s perform the following trick:
prev_june_f <- function(x) floor_date(floor_date(x,"m") + months(6),"y") - months(6) - 1
DT$date_june <- prev_june_f(DT$date)
DT2$date_june <- DT2$date
DT2$date <- NULL
While the prev_june_f seems obscure, one should break the function into different components in order to understand how it works. Consider the special case where we have
x <- DT$date[1:13]
x
[1] "1970-12-31" "1971-01-31" "1971-02-28" "1971-03-31" "1971-04-30" "1971-05-31" "1971-06-30" "1971-07-31" "1971-08-31"
[10] "1971-09-30" "1971-10-31" "1971-11-30" "1971-12-31"
The first command of the function shifts the dates six months ahead.
floor_date(x,"m") + months(6)
[1] "1971-06-01" "1971-07-01" "1971-08-01" "1971-09-01" "1971-10-01" "1971-11-01" "1971-12-01" "1972-01-01" "1972-02-01"
[10] "1972-03-01" "1972-04-01" "1972-05-01" "1972-06-01"
The second step identifies the floor of each date on the annual level:
floor_date(floor_date(x,"m") + months(6),"y")
[1] "1971-01-01" "1971-01-01" "1971-01-01" "1971-01-01" "1971-01-01" "1971-01-01" "1971-01-01" "1972-01-01" "1972-01-01"
[10] "1972-01-01" "1972-01-01" "1972-01-01" "1972-01-01"
The final step subtracts 6 months to identify the June of the previous year. The minus one is conducted to retrieve June 30th rather than July 1st, such that
floor_date(floor_date(x,"m") + months(6),"y") - months(6) - 1
[1] "1970-06-30" "1970-06-30" "1970-06-30" "1970-06-30" "1970-06-30" "1970-06-30" "1970-06-30" "1971-06-30" "1971-06-30"
[10] "1971-06-30" "1971-06-30" "1971-06-30" "1971-06-30"
Looking at returns between Dec 1970 and Mar 1971, the function traces these observations back to June 1970 for the first 7 returns. When the next June data shows up, the date_june variable adjusts accordingly. Finally, the merged data set is given by
DT12 <- merge(DT,DT2,by = c("CUSIP","date_june"))
The DT12 object contains the final CRSP-COMPUSTAT data. The merging is conducted to allow for portfolio formation at the end of June each year as mentioned above. Users may wish to change this for different months. Such setting allows a convenient way to form portfolios without relying on loops to form portfolios. Nonetheless, it would also impose that portfolio managers balance their portfolios on a monthly basis according to the market cap from the previous June.
Given the merged data set, I form groups at the end of June-year based on market equity (size) and book-to-market ratio (value)
DT12 <- DT12[,`:=` (Group_ME = ntile(ME,5)), by = list(date_june)]
DT12 <- DT12[,`:=` (Group_BM = ntile(BM_ratio,5)), by = list(date_june,Group_ME)]
The above two commands perform dependent portfolio sorting, in which the firms are first sorted based on size and then on book-to-market ratio. I consider value-weighting to compute the future returns on each size-value portfolio. This results in 25 value-weighted portfolios.
PORT_RET <- DT12[,list(VW_RET = sum(RET*ME/sum(ME)) ),
by = list(date,Group_ME,Group_BM)]
PORT_RET <- PORT_RET[order(PORT_RET$date,PORT_RET$Group_BM,PORT_RET$Group_ME),]
PORT_RET$VW_RET <- as.numeric(PORT_RET$VW_RET)*100
ret_matrix <- PORT_RET[,lapply(.SD,mean,na.rm = T),
by = list(Group_ME,Group_BM), .SDcols = "VW_RET"]
ret_matrix <- ret_matrix[order(ret_matrix$Group_BM,ret_matrix$Group_ME),]
ret_matrix <- matrix(as.numeric(ret_matrix$VW_RET),5)
rownames(ret_matrix) <- 1:5
colnames(ret_matrix) <- paste("BM",1:5,sep = "_")
rownames(ret_matrix)[1] <- "Small"
rownames(ret_matrix)[5] <- "Big"
colnames(ret_matrix)[1] <- "Low"
colnames(ret_matrix)[5] <- "High"
round(data.frame(ret_matrix*12),2)
For each column above, we observe the stock returns (raw) decrease, on average, with size. This is commonly known as the size effect. At the same time, we observe that within each size group, the mean return increases with BM ratio. The latter denotes what is known as the value effect. In other words, investors expect higher returns from small enterprises and undervalued stocks (trading below book value). Nonetheless, recent discussions debate whether this is the case. For further information on this, see this article.
The above portfolio results compute the raw returns. In order to price these portfolios, I decompose the returns into (1) systematic components (risk-premiums) and (2) residuals. The latter denotes the risk-adjusted returns. Additionally, I consider the excess return on each portfolio, i.e. the portfolio return minus the 1-month Treasury yield.
The Fama-French risk factors are obtained easily using the Ken French public library according to the commands below. Note that I focus on the three factors: market, size, and value:
FF_file <- "https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/ftp/F-F_Research_Data_Factors_CSV.zip"
temp <- tempfile()
download.file(FF_file,temp)
trying URL 'https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/ftp/F-F_Research_Data_Factors_CSV.zip'
URL 'https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/ftp/F-F_Research_Data_Factors_CSV.zip': status was 'Couldn't resolve host name'Error in download.file(FF_file, temp) :
cannot open URL 'https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/ftp/F-F_Research_Data_Factors_CSV.zip'
Given the time series of portfolio returns, I merge the data with the time series portfolio and regress the excess returns on the three factors as follows. To do so, I leverage some functional programming using the lapply base function along with the dlply function from the plyr library.
PORT_RET_RA <- merge(PORT_RET,ds)
PORT_RET_RA$RAR <- PORT_RET_RA$VW_RET - PORT_RET_RA$RF
lm_list <- dlply(PORT_RET_RA,c("Group_ME","Group_BM"), function(x) lm( RAR ~ Mkt.RF + SMB + HML, data = x ) )
rar_matrix <- lapply(lm_list, coef)
rar_matrix <- sapply(rar_matrix,function(x) x[[1]])
rar_matrix <- matrix(rar_matrix,5,5,byrow = T)
rownames(rar_matrix) <- 1:5
colnames(rar_matrix) <- paste("BM",1:5,sep = "_")
rownames(rar_matrix)[1] <- "Small"
rownames(rar_matrix)[5] <- "Big"
colnames(rar_matrix)[1] <- "Low"
colnames(rar_matrix)[5] <- "High"
rar_matrix <- round(data.frame(rar_matrix*12),2)
rar_matrix
Note that after controlling for the three risk-factors, the high-small portfolio still yields an annual return of 9%. A potential argument may be that such alpha is attributed to other risk factors that the three-factor model fail to price. More recently, Fama-French suggest five factors model to better representation of systematic components.
The above results are subjected to certain filtering, the time period, and the number of months each stock should include in the data. One major challenge is whether the securities in the analysis are tradable. For instance, it is common to consider stocks with a price larger than $5. The major issue with doing so is that such filter could cause a forward-looking bias. By the time the decision maker allocates his/her portfolio, he/she cannot ascertain whether the stocks would be above or less $5 in the future - this is less of an issue for large cap stocks. Hence, in order to keep stocks that satisfy a minimum price level, it should be done on a recurring basis. Rather than dropping all observations in which the price is less $5, one should consider only the stocks that were qualified by the time of the portfolio formation. The same issue applies to the window length for which the data is available or other filters one may be interested in controlling for.
The following is a generalized portfolio formation function that takes four arguments. The first two correspond to the minimum price and number of years that the stock should have to be considered investable at June each year. The other two arguments set the time period during which the researcher is interested in computed the portfolio returns. The function returns a list containing the raw and risk-adjusted returns of the 25 portfolios along with the average number of stocks within each portfolio.
FF3_anomaly <- function(min_price,year_keep = 1,year1 = 1965,year2 = 2019) {
DT12_sub <- DT12
keep_cusip_date <- unique(DT12_sub[,list(CUSIP,date_june,N_years,PRC_LAST)])
keep_cusip_date <- keep_cusip_date[keep_cusip_date$N_years >= year_keep & keep_cusip_date$PRC_LAST >= min_price ,]
keep_cusip_date <- unique(keep_cusip_date[,list(CUSIP,date_june)])
DT12_sub <- merge(DT12_sub,keep_cusip_date)
DT12_sub <- DT12_sub[,`:=` (Group_ME = ntile(ME,5)), by = list(date_june)]
DT12_sub <- DT12_sub[,`:=` (Group_BM = ntile(BM_ratio,5)), by = list(date_june,Group_ME)]
N_G <- DT12_sub[,.N, by = list(date,Group_ME,Group_BM) ]
N_G <- (N_G[,mean(N), by = list(Group_ME,Group_BM) ])
N_G <- N_G[order(N_G$Group_ME,N_G$Group_BM),]
N_G <- matrix(N_G$V1,5,5,byrow = T)
rownames(N_G) <- 1:5
colnames(N_G) <- paste("BM",1:5,sep = "_")
rownames(N_G)[1] <- "Small"
rownames(N_G)[5] <- "Big"
colnames(N_G)[1] <- "Low"
colnames(N_G)[5] <- "High"
N_G <- round((N_G))
PORT_RET <- DT12_sub[,list(VW_RET = sum(RET*ME/sum(ME)) ),
by = list(date,Group_ME,Group_BM)]
PORT_RET <- PORT_RET[order(PORT_RET$date,PORT_RET$Group_BM,PORT_RET$Group_ME),]
PORT_RET$VW_RET <- as.numeric(PORT_RET$VW_RET)*100
PORT_RET_RA <- merge(PORT_RET,ds)
PORT_RET_RA$RAR <- PORT_RET_RA$VW_RET - PORT_RET_RA$RF
PORT_RET_RA_sub <- PORT_RET_RA
PORT_RET_RA_sub <- PORT_RET_RA_sub[year(PORT_RET_RA_sub$date) >= year1 & year(PORT_RET_RA_sub$date) <= year2 ,]
ret_matrix <- PORT_RET_RA_sub[,lapply(.SD,mean,na.rm = T),
by = list(Group_ME,Group_BM), .SDcols = "VW_RET"]
ret_matrix <- ret_matrix[order(ret_matrix$Group_BM,ret_matrix$Group_ME),]
ret_matrix <- matrix(as.numeric(ret_matrix$VW_RET),5)
rownames(ret_matrix) <- 1:5
colnames(ret_matrix) <- paste("BM",1:5,sep = "_")
rownames(ret_matrix)[1] <- "Small"
rownames(ret_matrix)[5] <- "Big"
colnames(ret_matrix)[1] <- "Low"
colnames(ret_matrix)[5] <- "High"
ret_matrix <- round(data.frame(ret_matrix*12),2)
lm_list <- dlply(PORT_RET_RA_sub,c("Group_ME","Group_BM"), function(x) lm( RAR ~ Mkt.RF + SMB + HML, data = x ) )
rar_matrix <- lapply(lm_list, coef)
rar_matrix <- sapply(rar_matrix,function(x) x[[1]])
rar_matrix <- matrix(rar_matrix,5,5,byrow = T)
rownames(rar_matrix) <- 1:5
colnames(rar_matrix) <- paste("BM",1:5,sep = "_")
rownames(rar_matrix)[1] <- "Small"
rownames(rar_matrix)[5] <- "Big"
colnames(rar_matrix)[1] <- "Low"
colnames(rar_matrix)[5] <- "High"
rar_matrix <- round(data.frame(rar_matrix*12),2)
list(N_G,ret_matrix,rar_matrix)
}
Given the above function, I test the sensitivity of the size/value premiums by including stocks with pre-specified minimum price. In particular, I run this for a sequence of prices ranging between 0 and $50. With the parallel library, I can easily perform parallel computing on four cores. It takes a few seconds to run this:
p_seq <- 0:50
list_port_price <- mclapply(p_seq, function(p) FF3_anomaly(p),mc.cores = 4)
gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 1274392 68.1 2908574 155.4 2908574 155.4
Vcells 119639785 912.8 190429023 1452.9 190146018 1450.7
Given the list list_port_price, I extract the result of interest. In this case, I focus on the raw returns. For instance, the first item of this list corresponds to the same results from above
list_port_price[[1]][[2]]
On the other hand, if we consider $5 minimum price, we see that the results are tentative depending on the level of entry:
list_port_price[[6]][[2]]
To capture this for each matrix of raw returns, I compute the high-minus-low (HML), i.e. column five minus column 1 for each size level. This results in 5 HML premiums for each size level. I summarize the results in the plot below:
RAR_list <- lapply(list_port_price,function(x) x[[2]])
RAR_list_Value <- t(sapply(RAR_list, function(x) x[,5] - x[,1] ))
colnames(RAR_list_Value) <- rownames(RAR_list[[1]])
RAR_list_Size <- t(sapply(RAR_list, function(x) x[1,] - x[5,] ))
ds.plot <- lapply(1:5, function(i) data.frame(HML = RAR_list_Value[,i],Min_Price = p_seq, Size = i))
ds.plot <- Reduce(rbind,ds.plot)
ds.plot <- ds.plot[order(ds.plot$Size,ds.plot$Min_Price),]
ds.plot <- ds.plot[ds.plot$Size %in% c(1,3,5),]
ds.plot$Size <- as.factor(ds.plot$Size)
p <- ggplot(ds.plot,aes(x = Min_Price, y = HML, colour = Size, shape = Size))
p <- p + geom_point() + geom_line()
p <- p + geom_abline(intercept = 0, slope = 0, color="black", linetype="dashed")
ggplotly(p,width = 800, height = 600)
We observe that the HML is more evident for small cap stocks. However, at the same time, note that the premium declines as we increase the minimum price entry. One argument is the following. As we increase minimum price, the size effect is mitigated and, hence, the value premium. On the other hand, this could also be attributed to whether investors can utilize such alphas for small cap that tend to be less liquid.
I repeat the same plot for the size premium. In particular, I compute the small-minus-big (SMB), i.e. row one minus row 5 for each BM level. This results in 5 SMB premiums for each BM group. Similar to the previous plot, I get the following:
ds.plot <- lapply(1:5, function(i) data.frame(SMB = unlist(RAR_list_Size[,i]),Min_Price = p_seq, BM = i))
ds.plot <- Reduce(rbind,ds.plot)
ds.plot <- ds.plot[order(ds.plot$BM,ds.plot$Min_Price),]
ds.plot <- ds.plot[ds.plot$BM %in% c(1,3,5),]
ds.plot$BM <- as.factor(ds.plot$BM)
p <- ggplot(ds.plot,aes(x = Min_Price, y = SMB, colour = BM, shape = BM))
p <- p + geom_point() + geom_line()
p <- p + geom_abline(intercept = 0, slope = 0, color="black", linetype="dashed")
ggplotly(p,width = 800, height = 600)
Consistent with the case for the HML premium, the SMB premium seems to shrink as we increase the minimum price entry. While different forces could be attributed to this, it is of great relevance for researchers/investors to understand the mechanisms behind which. Interested readers, may find this artile relevant concerning the impact of liquidity and limits of arbitrage when it comes to the inclusion/exclusion of small cap stocks.
This vignette provides a brief illustration on how to merge the CRSP (security prices/returns) data with the COMPUSTAT (financial) data. The illustration is conducted for the purpose of portfolio formation using book data. The end result is combining monthly market data with annual accounting data. However, researchers who are interested in performing panel analysis or apply predictive models using machine learning may prefer merging the data altogether using the same frequency. I leave this for a future endeavor. Nonetheless, I hope this vignette would encourage further reproducible empirical asset pricing research, while at the same time help bridging the gap between the data science community and the empirical asset pricing literature.