Due dates: 18 January 2024 (share screens in class), 23 January (submissions due)
Assignment Propose and simulate a data-generating process in which (i) students who live on campus in their freshman year tend to have better outcomes, but at the same time (ii) the causal effect of living on campus is negative.
Up to you… whether you’d like to have the causal effect be negative for all students, or positive for those who do select into living on campus and negative for those who do not—this would actually make sense in a model where students selected into living accommodations with knowledge of what suited them best.
Complete this task in a markdown file and make the argument
that’s relevant to the policy makers visually (e.g., likely something
that starts with ggplot).
I’d like you to eventually upload an html file. Prep something for 18 January, anticipating that we can have a look together before we submit our responses on the 23rd.
Let’s model this as an OVB problem, where the true relationship is captured in \[y = x_1 + x_2 + \beta~(T=1) + \epsilon~,\] and \(\epsilon \sim N(0,1)\). (Maybe make \(x_1\) continuous and \(x_2\) discrete?) In order to introduce the potential for non-random treatment (i.e., on-campus living) let’s assume that the linear combination \(z = 1 + a_1x1 + a_2x2\) determines the probability each individual lives on campus, such that \[Pr[T=1] = \frac{1}{1+e^{-z}}~.\] Setting \(\{a_1, a_2\} > 0\) will then imply that “better” students live on-campus, on average—positive selection into treatment. (While there are many reasonable ways to model this, we’ll see some good intuition unfold following this setup.)
# DGP
library(pacman)
p_load(dplyr)
set.seed(4321)
n = 1000
dpg_df = tibble(
e = rnorm(n,sd=1),
b = -0.5,
x1 = rnorm(n,sd=1),
x2 = sample(0:1, n, replace = TRUE),
z = 1+x1+x2,
prob_t = 1/(1+exp(-z)),
t = rbinom(n,1,prob_t),
y = x1 + x2 + b*t + e
)
# ggplot
p_load(ggplot2)
plot1 = ggplot(data=dpg_df,aes(x=t,y=y, color = factor(t))) +
geom_point() +
scale_color_manual(values=c("#69b3a2","#404080"))+
labs(title = "Education Outcomes vs On Campus Residency", x = "Treatment (0:offcampus,1:on-campus)",y="Educational Outcome")+
labs(color="Treatment")
plot1
Plotting educational outcomes (gpa, graduation rates etc…) on whether students lived on campus (T=1) their first year or off campus (T=0), suggests that students who live on campus have more success in school. So it must be that living on campus causes the students to do better right?
# model the problem? communicate results?
p_load(fixest)
p_load(broom)
Naive_Regression<-feols(y~t, data=dpg_df)
Cont_OV<-feols(y~x1+t,data=dpg_df)
Discrete_OV<-feols(y~x2+t,data=dpg_df)
True_Model<-feols(y~x1+x2+t,data=dpg_df)
comb_summary<- etable(Naive_Regression,Cont_OV,Discrete_OV,True_Model)
print(comb_summary)
## Naive_Regression Cont_OV Discrete_OV
## Dependent Var.: y y y
##
## Constant -0.2115* (0.0879) 0.3322*** (0.0737) -0.5543*** (0.0882)
## t 0.4828*** (0.1018) -0.2766** (0.0871) 0.3010** (0.0972)
## x1 0.9355*** (0.0391)
## x2 0.9569*** (0.0846)
## _______________ __________________ __________________ ___________________
## S.E. type IID IID IID
## Observations 1,000 1,000 1,000
## R2 0.02206 0.37899 0.13323
## Adj. R2 0.02108 0.37774 0.13150
##
## True_Model
## Dependent Var.: y
##
## Constant -0.0230 (0.0693)
## t -0.4905*** (0.0788)
## x1 0.9583*** (0.0349)
## x2 1.028*** (0.0639)
## _______________ ___________________
## S.E. type IID
## Observations 1,000
## R2 0.50719
## Adj. R2 0.50570
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The first regression shows a naive interpretation of the descriptive plot above. Regressing just outcomes on treatment will result in a significantly positive estimate for the treatment effects of living on campus. The second and third regression both add in one of the omitted variables (x1 or x2) into the regression. Controlling for bias from omitting x1 leads to a negative estimate for treatment effects, while controlling for bias from omitting x2 leads to a less positive estimate for treatment effects. Finally the last regression estimates the true model in the simulation, controlling for both x1 and x2, resulting in a significantly negative estimate for treatment effects. The estimate of the true model also leads to estimate of treatment effects that is statistically the same as -0.5, which is the fundamental parameter in the data generating process.
plot2<- ggplot(dpg_df,aes(x=t, y=y-x1-x2, color = factor(t)))+
geom_point()+
scale_color_manual(values=c("#69b3a2","#404080"))+
labs(title = "Controling for Selection into Treatment", x = "Treatment (0:offcampus,1:on-campus)",y="Educational Outcome")+
labs(color="Treatment")
plot2
The plot above further supports the evidence shown in the final regression. This plot shows that when you control for selection into treatment, i.e students that choose to live on campus on average have characteristics that help them later in school(wealth,race,etc…), the effect of living on campus all else equal is actually negative. This provides a proof of concept in which students who choose to live on campus tend to do better, even when the causal effect of living on campus actually hurts their performance. So if we believe there is selection into treatment, which is very likely, a policy enforcing all students to live on campus will result in lowered educational outcomes than expected from the naive “first look”.