I will create some fake scenario in this exercise that helps to understand causal inference methods and the type of data patterns or treatment effects they help reveal. For this exercise, I will use two methods:

  1. Differences-in-Differences (DID) with Natural Experiment

  2. Regression Discontinuity Design (RDD)

A) DID Scenario:

Assume that there are 100 fruits and vegetable markets in 5 states the US (20 in each state). Lets imagine we have the data for 5 states for two periods: 2000 and 2005. There was a Food Quality Control Policy implemented in 2003. Once the policy is implemented in certain market, government assumes that there will be excellent compliance, and the positive effect constantly persists. The policy implemented by Farm Section of USDA in 2003 was decided back in 1990.

We notice that the food borne disease outbreak has been an issue in the US market. We are interested in understanding the causal relation between food safety practices and food-borne disease outbreak in the US market. In this case, quality control policy allows us to estimate the true effect. To accompany our analysis, we have the data on food borne disease annual outbreaks after 1995 in these states from CDC.

Given that the food quality control policy implemented by Farm Section of USDAin 2003 was decided back in 1990, above analysis assumes that our treatment is exogenous. Also, fruits and vegetable consumers, CDC or market sellers do not have control over USDA’s decision-making and CDC have not published outbreak data in 1990 (they started publication in 1995) means USDA’s policy mimics quasi-experimental setting.

A.1. Fake Data

set.seed(12345)
n = 100                   #100 markets
data1 <- data.frame(market = rep(c(1:n), each=2))     #create marketid; we need two values for each market
data1$year = rep(c("2000", "2005"), n/2)              #each market has obs for 2000 and 2005
data1$policy = ifelse(data1$year=="2000",0,rep(c(0,1), n/2))        #policy=treatment, in 2005, half of the markets are treated and in 2000, none of them are treated.
data1$after = c(rep(0,50), rep(1,50))

data1$outbreaks = rnorm(n, 800, 120) #total annual outbreaks with mean 800 and 120 SD
#assume that the true effect is 300 less outbreaks when policy in action
data1$outbreaks[data1$policy == 1 & data1$after == 1] <- data1$outbreaks[data1$policy == 1 & data1$after == 1] - 300                    #if policy and after both =1, then the real effect can be seen.

A.2. Plotting Data Patterns

data1_plot<-ggplot(data1, aes(market, outbreaks, shape = year,color = policy)) +
  geom_point(size = 1.5) +
  geom_smooth(method="lm")+             #geomline created from linear regression using these market observations
  ggtitle("DID Simulation")+
  xlab("") + ylab("outbreaks")
data1_plot

This plot reveals that the markets without the treatment the trend of food-borne disease outbreaks would remain straight as black line (with mean slightly above 800). But, treatment (Quality Control Policy) helped markets to reduce the disease outbreak as seen in the blue line.

A.3. Relevant Regression

did <-  lm(outbreaks ~ after+policy+policy*after, data1)       #linear regression
did
## 
## Call:
## lm(formula = outbreaks ~ after + policy + policy * after, data = data1)
## 
## Coefficients:
##  (Intercept)         after        policy  after:policy  
##    821.53320       8.55474       0.02951    -285.60663
library(stargazer)
stargazer (did,
           type="text",         #type="html" or "latex" did not worked for me
           align=TRUE,
           no.space=TRUE,
           covariate.labels = c("After=1", "Policy (Treatment)=1", "Policy*After=1" ),
           title="TABLE: DIFF-IN-DIFF USING ORDINARY LEAST SQUARES")
## 
## TABLE: DIFF-IN-DIFF USING ORDINARY LEAST SQUARES
## ================================================
##                          Dependent variable:    
##                      ---------------------------
##                               outbreaks         
## ------------------------------------------------
## After=1                         8.555           
##                               (26.823)          
## Policy (Treatment)=1            0.030           
##                               (26.823)          
## Policy*After=1               -285.607***        
##                               (37.933)          
## Constant                     821.533***         
##                               (18.967)          
## ------------------------------------------------
## Observations                     200            
## R2                              0.455           
## Adjusted R2                     0.446           
## Residual Std. Error      134.114 (df = 196)     
## F Statistic            54.485*** (df = 3; 196)  
## ================================================
## Note:                *p<0.1; **p<0.05; ***p<0.01

A.4. Results and Discussion:

\(Disease Outbreak_{it} = \beta + \alpha \times Policy_i + \tau \times After_t + \gamma (Policy \times After) + \epsilon_{it}\)

Result: Disease Outbreak = 821.533 + 0.030 Policy + 8.555 After - 285.607 (Policy*After)

  • The simulated outcome has the mean of 800 with 120 SD. The intercept (\(\beta\)) in the DID model in part A.3 resulted almost the correct mean.
  • For 2005 (compared to 2000), the outbreak is 8.55 unit higher (\(\tau\)), but the result is insignificant.
  • Treated Markets (i.e., where the Food Control Policy has been implemented) were found to have 0.030 unit higher (\(\alpha\)) food borne disease; again the result is insignificant.
  • Likewise, treated markets showed significantly different outbreak scenario in 2005. That is, these markets have 285.61 units lower outbreak than the untreated markets. The real effect in our simulated data was -300. The simulated DID model almost captured the real difference.

A.5. Additional Exercise:

library(plm)
didplm <-  plm(outbreaks ~ policy+policy*after, data1, index="after", model="within")
didplm
## 
## Model Formula: outbreaks ~ policy + policy * after
## 
## Coefficients:
##        policy policy:after1 
##      0.029505   -285.606628
library(stargazer)
stargazer (didplm,
           type="text",         #type="html" or "latex" did not worked for me
           align=TRUE,
           no.space=TRUE,
           covariate.labels = c("After=1", "Policy (Treatment)=1", "Policy*After=1" ),
           title="TABLE: DIFF-IN-DIFF USING ORDINARY LEAST SQUARES")
## 
## TABLE: DIFF-IN-DIFF USING ORDINARY LEAST SQUARES
## ================================================
##                          Dependent variable:    
##                      ---------------------------
##                               outbreaks         
## ------------------------------------------------
## After=1                         0.030           
##                               (26.823)          
## Policy (Treatment)=1         -285.607***        
##                               (37.933)          
## ------------------------------------------------
## Observations                     200            
## R2                              0.366           
## Adjusted R2                     0.357           
## F Statistic            56.677*** (df = 2; 196)  
## ================================================
## Note:                *p<0.1; **p<0.05; ***p<0.01

B) RDD Scenario:

Lets explore slightly different scenario. We are interested in looking the interventions to improve household nutritional security. We want to know if higher food consumption helps; there can be multiple interventions that helps improving household consumption level. Along the line, we found that Supplemental Nutrition Assistance Program (SNAP) makes an interesting case because households with income < $15000 are eligible to participate in the program. Our identification strategy is discontinuity that is observed at the cutoff. We assume that there will be full compliance because who would reject the free money? It also does keep the things simple.

We can use nutritional security measure like dietary diversity, specific micro nutrient intake using blood samples, etc. Lets assume we collected blood sample from randomly selected household member, and identified the level of Zinc (Zn) present in the blood. The normal range of Zn in the human body is 70-120 micro grams. For this example, higher the variety of fresh foods the household consumes (like meat, fish, seafood), higher will be their zinc concentration and household will be considered as the nutrionally secure (note that, measuring only one mineral is not enough in real life).

B.1. Fake Data

set.seed(12345)
n = 200                   #200 HHs
data2 <- data.frame(household = rep(c(1:n), each=1))     #create HHid
data2$zn = runif(n, 10, 120)          #zn is zinc level
data2$income = rnorm(n, 15005, 1000)            #mean is 1505

#assume that the true effect on Zn level is 30 points higher when household receives SNAP benefits
data2$zn[data2$income<=15000] <- data2$zn[data2$income<=15000] + 30

#lets create a treatment dummy; treatment=1 if income <=15000
data2$treatment <- as.factor(ifelse(data2$income<=15000, 1, 0))

We assume that the cutoff at $15000 exogenously determines the level of fresh food consumption. While households earning exactly below and above the cutoff are not very different, one household gets free food (through SNAP) and other does not, respectively. This mimic the randomization near the cutoff; and, with the appropriate bandwidth selection, we can compare the outcomes of these two groups.

B.2. Plotting Data Patterns

Call the households receiving SNAP as “eligible” ones. Lets approximate the plot using local linear function.

#lets split data to two halves (over and under 15000 income) to draw fitted lines:
# NOTE: I am using some of my codes from past homeworks for this course.

data2_eligible <- subset(data2, data2$treatment == 1)
data2_ineligible <- subset(data2, data2$treatment == 0)

library(ggplot2)
rddplot <- ggplot()+
  geom_point(data = data2, aes(x = income, y = zn), shape = 5) +
  geom_smooth(aes(x=income, y=zn), data=data2_eligible, method ="lm")+
  geom_smooth(aes(x=income, y=zn), data=data2_ineligible, method ="lm")+
  scale_x_continuous(breaks=seq(13000, 17000, by=500)) +  scale_y_continuous(breaks=seq(55, 150, by= 10)) +
  xlab("Income") + 
  ylab("Zn Level in Blood Sample (in micrograms") +
  ggtitle("Zn Level vs Income Cutoff") 
rddplot

Notice that there is a sharp jump at the cutoff at Income=15000. Zn Level of the households receiving SNAP benefits are relatively higher than the ineligible households near the cutoff.

B.3. Relevant Regression

simple_RDD <- lm(zn ~ income + treatment, data2)
library(stargazer)
stargazer (simple_RDD,
           type="text",
           align=TRUE,
           no.space=TRUE,
           covariate.labels = c("Income", "Income<=15000"),
           title="Table: Simple RDD")
## 
## Table: Simple RDD
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                                 zn             
## -----------------------------------------------
## Income                         0.004           
##                               (0.004)          
## Income<=15000                34.573***         
##                               (7.512)          
## Constant                       1.618           
##                              (61.113)          
## -----------------------------------------------
## Observations                    200            
## R2                             0.162           
## Adjusted R2                    0.154           
## Residual Std. Error      32.433 (df = 197)     
## F Statistic           19.060*** (df = 2; 197)  
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01

B.4. Results and Discussion:

The estimate obtained from the simple linear model (34.57) is almost equal to the real effect (i.e., 30 points). The households eligible for SNAP benefits have 34.573 points higher Zn level than the ineligible households. This also means that ineligible households are Zn-deficient. The result is significant at 1% level. The jump at the cutoff of Income=15000 is captured by this simple framework.

B.5. Additional Exercise:

There can be non-linearity in the original model:

library(ggplot2)
rddplot <- ggplot()+
  geom_point(data = data2, aes(x = income, y = zn), shape = 5) +
  geom_smooth(aes(x=income, y=zn), data=data2_eligible)+
  geom_smooth(aes(x=income, y=zn), data=data2_ineligible)+
  scale_x_continuous(breaks=seq(13000, 17000, by=500)) +  scale_y_continuous(breaks=seq(55, 150, by= 10)) +
  xlab("Income") + 
  ylab("Zn Level in Blood Sample (in micrograms") +
  ggtitle("Zn Level vs Income Cutoff") 
rddplot

Above we are using local linear approximation, it is very important to understand the model that we used to approximate the underlying data. For example, if we decide on using polynomial function to approximate above curves, we should be cautious of the chances of over/underestimating the jump at the threshold.

library(rdd)
RDD <- RDestimate(zn ~ income, data = data2, cutpoint = 15000, bw=800)
RDD2 <- RDestimate(zn ~ income, data = data2, cutpoint = 15000, kernel= "rectangular", bw=1000)
RDD
## 
## Call:
## RDestimate(formula = zn ~ income, data = data2, cutpoint = 15000, 
##     bw = 800)
## 
## Coefficients:
##      LATE    Half-BW  Double-BW  
##    -39.18     -51.14     -28.42
plot (RDD)

Since there is some non-linearity in the plot obtained from pacakged RDestimate, lets allow slopes to vary:

#create interaction
data2$treatment <- ifelse(data2$income<=15000, 1, 0)
data2$interaction <- data2$income*data2$treatment
  
flexible_RDD <- lm(zn ~ income + treatment + interaction, data2)
library(stargazer)
stargazer (flexible_RDD,
           type="text",
           align=TRUE,
           no.space=TRUE,
           covariate.labels = c("Income", "Income<=15000", "Income * Income<=15000"),
           title="Table: Flexible RDD")
## 
## Table: Flexible RDD
## ==================================================
##                            Dependent variable:    
##                        ---------------------------
##                                    zn             
## --------------------------------------------------
## Income                           0.0004           
##                                  (0.005)          
## Income<=15000                    -97.783          
##                                 (116.251)         
## Income * Income<=15000            0.009           
##                                  (0.008)          
## Constant                         64.572           
##                                 (82.303)          
## --------------------------------------------------
## Observations                       200            
## R2                                0.168           
## Adjusted R2                       0.155           
## Residual Std. Error         32.409 (df = 196)     
## F Statistic              13.160*** (df = 3; 196)  
## ==================================================
## Note:                  *p<0.1; **p<0.05; ***p<0.01

The interaction term is not significant, however.