A1: Campus mandate


Due dates: 18 January 2024 (share screens in class), 23 January (submissions due)


Assignment Propose and simulate a data-generating process in which (i) students who live on campus in their freshman year tend to have better outcomes, but at the same time (ii) the causal effect of living on campus is negative.



Let’s model this as an OVB problem, where the true relationship is captured in \[y = x_1 + x_2 + \beta~(T=1) + \epsilon~,\] and \(\epsilon \sim N(0,1)\). (Maybe make \(x_1\) continuous and \(x_2\) discrete?) In order to introduce the potential for non-random treatment (i.e., on-campus living) let’s assume that the linear combination \(z = 1 + a_1x1 + a_2x2\) determines the probability each individual lives on campus, such that \[Pr[T=1] = \frac{1}{1+e^{-z}}~.\] Setting \(\{a_1, a_2\} > 0\) will then imply that “better” students live on-campus, on average—positive selection into treatment. (While there are many reasonable ways to model this, we’ll see some good intuition unfold following this setup.)

# DGP 
library(pacman)
p_load(dplyr)
set.seed(4321)
n = 1000
dpg_df = tibble(
  e = rnorm(n,sd=1),
  b = -0.5,
  x1 = rnorm(n,sd=1),
  x2 = sample(0:1, n, replace = TRUE),
  z = 1+x1+x2,
  prob_t = 1/(1+exp(-z)),
  t = rbinom(n,1,prob_t),
  y = x1 + x2 + b*t + e
)

Descriptives

# ggplot 
p_load(ggplot2)
plot1 = ggplot(data=dpg_df,aes(x=t,y=y, color = factor(t))) + 
  geom_point() +
  scale_color_manual(values=c("#69b3a2","#404080"))+
  labs(title = "Education Outcomes vs On Campus Residency", x = "Treatment (0:offcampus,1:on-campus)",y="Educational Outcome")+
  labs(color="Treatment")
plot1 

First Glance

Plotting educational outcomes (gpa, graduation rates etc…) on whether students lived on campus (T=1) their first year or off campus (T=0), suggests that students who live on campus have more success in school. So it must be that living on campus causes the students to do better right?

Regression analysis

# model the problem? communicate results?
p_load(fixest)
p_load(broom)
Naive_Regression<-feols(y~t, data=dpg_df)
Cont_OV<-feols(y~x1+t,data=dpg_df)
Discrete_OV<-feols(y~x2+t,data=dpg_df)
True_Model<-feols(y~x1+x2+t,data=dpg_df)

comb_summary<- etable(Naive_Regression,Cont_OV,Discrete_OV,True_Model)
print(comb_summary)
##                   Naive_Regression            Cont_OV         Discrete_OV
## Dependent Var.:                  y                  y                   y
##                                                                          
## Constant         -0.2115* (0.0879) 0.3322*** (0.0737) -0.5543*** (0.0882)
## t               0.4828*** (0.1018) -0.2766** (0.0871)   0.3010** (0.0972)
## x1                                 0.9355*** (0.0391)                    
## x2                                                     0.9569*** (0.0846)
## _______________ __________________ __________________ ___________________
## S.E. type                      IID                IID                 IID
## Observations                 1,000              1,000               1,000
## R2                         0.02206            0.37899             0.13323
## Adj. R2                    0.02108            0.37774             0.13150
## 
##                          True_Model
## Dependent Var.:                   y
##                                    
## Constant           -0.0230 (0.0693)
## t               -0.4905*** (0.0788)
## x1               0.9583*** (0.0349)
## x2                1.028*** (0.0639)
## _______________ ___________________
## S.E. type                       IID
## Observations                  1,000
## R2                          0.50719
## Adj. R2                     0.50570
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Discussion

The first regression shows a naive interpretation of the descriptive plot above. Regressing just outcomes on treatment will result in a significantly positive estimate for the treatment effects of living on campus. The second and third regression both add in one of the omitted variables (x1 or x2) into the regression. Controlling for bias from omitting x1 leads to a negative estimate for treatment effects, while controlling for bias from omitting x2 leads to a less positive estimate for treatment effects. Finally the last regression estimates the true model in the simulation, controlling for both x1 and x2, resulting in a significantly negative estimate for treatment effects. The estimate of the true model also leads to estimate of treatment effects that is statistically the same as -0.5, which is the fundamental parameter in the data generating process.

plot2<- ggplot(dpg_df,aes(x=t, y=y-x1-x2, color = factor(t)))+
  geom_point()+
  scale_color_manual(values=c("#69b3a2","#404080"))+
  labs(title = "Controling for Selection into Treatment", x = "Treatment (0:offcampus,1:on-campus)",y="Educational Outcome")+
  labs(color="Treatment")
plot2

Conclusion

The plot above further supports the evidence shown in the final regression. This plot shows that when you control for selection into treatment, i.e students that choose to live on campus on average have characteristics that help them later in school(wealth,race,etc…), the effect of living on campus all else equal is actually negative. This provides a proof of concept in which students who choose to live on campus tend to do better, even when the causal effect of living on campus actually hurts their performance. So if we believe there is selection into treatment, which is very likely, a policy enforcing all students to live on campus will result in lowered educational outcomes than expected from the naive “first look”.