I will create some fake scenario in this exercise that helps to understand causal inference methods and the type of data patterns or treatment effects they help reveal. For this exercise, I will use two methods:
Differences-in-Differences (DID) with Natural Experiment
Regression Discontinuity Design (RDD)
Assume that there are 100 fruits and vegetable markets in 5 states the US (20 in each state). Lets imagine we have the data for 5 states for two periods: 2000 and 2005. There was a Food Quality Control Policy implemented in 2003. Once the policy is implemented in certain market, government assumes that there will be excellent compliance, and the positive effect constantly persists. The policy implemented by Farm Section of USDA in 2003 was decided back in 1990.
We notice that the food borne disease outbreak has been an issue in the US market. We are interested in understanding the causal relation between food safety practices and food-borne disease outbreak in the US market. In this case, quality control policy allows us to estimate the true effect. To accompany our analysis, we have the data on food borne disease annual outbreaks after 1995 in these states from CDC.
Given that the food quality control policy implemented by Farm Section of USDAin 2003 was decided back in 1990, above analysis assumes that our treatment is exogenous. Also, fruits and vegetable consumers, CDC or market sellers do not have control over USDA’s decision-making and CDC have not published outbreak data in 1990 (they started publication in 1995) means USDA’s policy mimics quasi-experimental setting.
set.seed(12345)
n = 100 #100 markets
data1 <- data.frame(market = rep(c(1:n), each=2)) #create marketid; we need two values for each market
data1$year = rep(c("2000", "2005"), n/2) #each market has obs for 2000 and 2005
data1$policy = ifelse(data1$year=="2000",0,rep(c(0,1), n/2)) #policy=treatment, in 2005, half of the markets are treated and in 2000, none of them are treated.
data1$after = c(rep(0,50), rep(1,50))
data1$outbreaks = rnorm(n, 800, 120) #total annual outbreaks with mean 800 and 120 SD
#assume that the true effect is 300 less outbreaks when policy in action
data1$outbreaks[data1$policy == 1 & data1$after == 1] <- data1$outbreaks[data1$policy == 1 & data1$after == 1] - 300 #if policy and after both =1, then the real effect can be seen.
data1_plot<-ggplot(data1, aes(market, outbreaks, shape = year,color = policy)) +
geom_point(size = 1.5) +
geom_smooth(method="lm")+ #geomline created from linear regression using these market observations
ggtitle("DID Simulation")+
xlab("") + ylab("outbreaks")
data1_plot
This
plot reveals that the markets without the treatment the trend of
food-borne disease outbreaks would remain straight as black line (with
mean slightly above 800). But, treatment (Quality Control Policy) helped
markets to reduce the disease outbreak as seen in the blue line.
did <- lm(outbreaks ~ after+policy+policy*after, data1) #linear regression
did
##
## Call:
## lm(formula = outbreaks ~ after + policy + policy * after, data = data1)
##
## Coefficients:
## (Intercept) after policy after:policy
## 821.53320 8.55474 0.02951 -285.60663
library(stargazer)
stargazer (did,
type="text", #type="html" or "latex" did not worked for me
align=TRUE,
no.space=TRUE,
covariate.labels = c("After=1", "Policy (Treatment)=1", "Policy*After=1" ),
title="TABLE: DIFF-IN-DIFF USING ORDINARY LEAST SQUARES")
##
## TABLE: DIFF-IN-DIFF USING ORDINARY LEAST SQUARES
## ================================================
## Dependent variable:
## ---------------------------
## outbreaks
## ------------------------------------------------
## After=1 8.555
## (26.823)
## Policy (Treatment)=1 0.030
## (26.823)
## Policy*After=1 -285.607***
## (37.933)
## Constant 821.533***
## (18.967)
## ------------------------------------------------
## Observations 200
## R2 0.455
## Adjusted R2 0.446
## Residual Std. Error 134.114 (df = 196)
## F Statistic 54.485*** (df = 3; 196)
## ================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
\(Disease Outbreak_{it} = \beta + \alpha \times Policy_i + \tau \times After_t + \gamma (Policy \times After) + \epsilon_{it}\)
Result: Disease Outbreak = 821.533 + 0.030 Policy + 8.555 After - 285.607 (Policy*After)
library(plm)
didplm <- plm(outbreaks ~ policy+policy*after, data1, index="after", model="within")
didplm
##
## Model Formula: outbreaks ~ policy + policy * after
##
## Coefficients:
## policy policy:after1
## 0.029505 -285.606628
library(stargazer)
stargazer (didplm,
type="text", #type="html" or "latex" did not worked for me
align=TRUE,
no.space=TRUE,
covariate.labels = c("After=1", "Policy (Treatment)=1", "Policy*After=1" ),
title="TABLE: DIFF-IN-DIFF USING ORDINARY LEAST SQUARES")
##
## TABLE: DIFF-IN-DIFF USING ORDINARY LEAST SQUARES
## ================================================
## Dependent variable:
## ---------------------------
## outbreaks
## ------------------------------------------------
## After=1 0.030
## (26.823)
## Policy (Treatment)=1 -285.607***
## (37.933)
## ------------------------------------------------
## Observations 200
## R2 0.366
## Adjusted R2 0.357
## F Statistic 56.677*** (df = 2; 196)
## ================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Lets explore slightly different scenario. We are interested in looking the interventions to improve household nutritional security. We want to know if higher food consumption helps; there can be multiple interventions that helps improving household consumption level. Along the line, we found that Supplemental Nutrition Assistance Program (SNAP) makes an interesting case because households with income < $15000 are eligible to participate in the program. Our identification strategy is discontinuity that is observed at the cutoff. We assume that there will be full compliance because who would reject the free money? It also does keep the things simple.
We can use nutritional security measure like dietary diversity, specific micro nutrient intake using blood samples, etc. Lets assume we collected blood sample from randomly selected household member, and identified the level of Zinc (Zn) present in the blood. The normal range of Zn in the human body is 70-120 micro grams. For this example, higher the variety of fresh foods the household consumes (like meat, fish, seafood), higher will be their zinc concentration and household will be considered as the nutrionally secure (note that, measuring only one mineral is not enough in real life).
set.seed(12345)
n = 200 #200 HHs
data2 <- data.frame(household = rep(c(1:n), each=1)) #create HHid
data2$zn = runif(n, 10, 120) #zn is zinc level
data2$income = rnorm(n, 15005, 1000) #mean is 1505
#assume that the true effect on Zn level is 30 points higher when household receives SNAP benefits
data2$zn[data2$income<=15000] <- data2$zn[data2$income<=15000] + 30
#lets create a treatment dummy; treatment=1 if income <=15000
data2$treatment <- as.factor(ifelse(data2$income<=15000, 1, 0))
We assume that the cutoff at $15000 exogenously determines the level of fresh food consumption. While households earning exactly below and above the cutoff are not very different, one household gets free food (through SNAP) and other does not, respectively. This mimic the randomization near the cutoff; and, with the appropriate bandwidth selection, we can compare the outcomes of these two groups.
Call the households receiving SNAP as “eligible” ones. Lets approximate the plot using local linear function.
#lets split data to two halves (over and under 15000 income) to draw fitted lines:
# NOTE: I am using some of my codes from past homeworks for this course.
data2_eligible <- subset(data2, data2$treatment == 1)
data2_ineligible <- subset(data2, data2$treatment == 0)
library(ggplot2)
rddplot <- ggplot()+
geom_point(data = data2, aes(x = income, y = zn), shape = 5) +
geom_smooth(aes(x=income, y=zn), data=data2_eligible, method ="lm")+
geom_smooth(aes(x=income, y=zn), data=data2_ineligible, method ="lm")+
scale_x_continuous(breaks=seq(13000, 17000, by=500)) + scale_y_continuous(breaks=seq(55, 150, by= 10)) +
xlab("Income") +
ylab("Zn Level in Blood Sample (in micrograms") +
ggtitle("Zn Level vs Income Cutoff")
rddplot
Notice
that there is a sharp jump at the cutoff at Income=15000. Zn Level of
the households receiving SNAP benefits are relatively higher than the
ineligible households near the cutoff.
simple_RDD <- lm(zn ~ income + treatment, data2)
library(stargazer)
stargazer (simple_RDD,
type="text",
align=TRUE,
no.space=TRUE,
covariate.labels = c("Income", "Income<=15000"),
title="Table: Simple RDD")
##
## Table: Simple RDD
## ===============================================
## Dependent variable:
## ---------------------------
## zn
## -----------------------------------------------
## Income 0.004
## (0.004)
## Income<=15000 34.573***
## (7.512)
## Constant 1.618
## (61.113)
## -----------------------------------------------
## Observations 200
## R2 0.162
## Adjusted R2 0.154
## Residual Std. Error 32.433 (df = 197)
## F Statistic 19.060*** (df = 2; 197)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
The estimate obtained from the simple linear model (34.57) is almost equal to the real effect (i.e., 30 points). The households eligible for SNAP benefits have 34.573 points higher Zn level than the ineligible households. This also means that ineligible households are Zn-deficient. The result is significant at 1% level. The jump at the cutoff of Income=15000 is captured by this simple framework.
There can be non-linearity in the original model:
library(ggplot2)
rddplot <- ggplot()+
geom_point(data = data2, aes(x = income, y = zn), shape = 5) +
geom_smooth(aes(x=income, y=zn), data=data2_eligible)+
geom_smooth(aes(x=income, y=zn), data=data2_ineligible)+
scale_x_continuous(breaks=seq(13000, 17000, by=500)) + scale_y_continuous(breaks=seq(55, 150, by= 10)) +
xlab("Income") +
ylab("Zn Level in Blood Sample (in micrograms") +
ggtitle("Zn Level vs Income Cutoff")
rddplot
Above
we are using local linear approximation, it is very important to
understand the model that we used to approximate the underlying data.
For example, if we decide on using polynomial function to approximate
above curves, we should be cautious of the chances of
over/underestimating the jump at the threshold.
library(rdd)
RDD <- RDestimate(zn ~ income, data = data2, cutpoint = 15000, bw=800)
RDD2 <- RDestimate(zn ~ income, data = data2, cutpoint = 15000, kernel= "rectangular", bw=1000)
RDD
##
## Call:
## RDestimate(formula = zn ~ income, data = data2, cutpoint = 15000,
## bw = 800)
##
## Coefficients:
## LATE Half-BW Double-BW
## -39.18 -51.14 -28.42
plot (RDD)
Since there is some non-linearity in the plot obtained from pacakged RDestimate, lets allow slopes to vary:
#create interaction
data2$treatment <- ifelse(data2$income<=15000, 1, 0)
data2$interaction <- data2$income*data2$treatment
flexible_RDD <- lm(zn ~ income + treatment + interaction, data2)
library(stargazer)
stargazer (flexible_RDD,
type="text",
align=TRUE,
no.space=TRUE,
covariate.labels = c("Income", "Income<=15000", "Income * Income<=15000"),
title="Table: Flexible RDD")
##
## Table: Flexible RDD
## ==================================================
## Dependent variable:
## ---------------------------
## zn
## --------------------------------------------------
## Income 0.0004
## (0.005)
## Income<=15000 -97.783
## (116.251)
## Income * Income<=15000 0.009
## (0.008)
## Constant 64.572
## (82.303)
## --------------------------------------------------
## Observations 200
## R2 0.168
## Adjusted R2 0.155
## Residual Std. Error 32.409 (df = 196)
## F Statistic 13.160*** (df = 3; 196)
## ==================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
The interaction term is not significant, however.