Question 1

\[ \begin{aligned} & \mathbb{E}_X\left[\mathbb{E}(Y \mid A=1, X) - \mathbb{E}(Y \mid A=0, X)\right] \\ &= \mathbb{E}_X\left[\mathbb{E}(-X_1 + \log|\!X_1| + 1 + \varepsilon) - \mathbb{E}(-X_1 + \log|\!X_1| + \varepsilon)\right] \\ &= \mathbb{E}_X[1] \\ &= 1 \end{aligned} \]

Question 2

Observed data distribution (the actual world)?

Sample size: \(n = 200\); number of Monte Carlo replications: \(2000\).

Covariates: \(X_1 \sim \mathcal{N}(0, 1),\; X_2 \sim \mathcal{N}(0, 1)\).

Treatment: \(A \sim \text{Bernoulli}(\pi(X))\), where \(\pi(X) = \text{expit}(-1 + X_1 + X_2)\).

Outcome: \(Y = -X_1 + \log(|X_1|) + A + \varepsilon\), where \(\varepsilon \sim \mathcal{N}(0, 1)\).

The observed data distribution is the outcome above with the given distribution for the variables A, X, and E

Target data distribution (counterfactual world)

\[ [Y \mid A = 1, X_1, \varepsilon] - [Y \mid A = 0, X_1, \varepsilon] = \left( -X_1 + \log(|X_1|) + \varepsilon + 1 \right) - \left( -X_1 + \log(|X_1|) + \varepsilon \right) = 1 \] The target distribution is the distribution of the difference of outcomes for the treated (A=1) and those that weren’t treated (A=0).

IPW formula for estimating the ATE \[ \mathbb{E}[Y \mid A = 1] \approx \frac{1}{200} \sum_{i=1}^{200} \frac{1}{{\pi}_i} Y_i \]

Question 3

The average treatement effect for the true model is: 0.9806

The average treatement effect for the estimated model is: 0.9601

The average treatment effect for the misspecified model is: 0.9600

boxplot(q3data, names=c('True estimated treatment','Estimated treatment', 'Misspecified treatment'), ylab = "ATE estimates")

The variance for the misspecified treatment is the lowest, followed by the estimated treatment and then the the true estimated treatment. This is expected since the more complex models have more covariates that introduce variance whereas the misspecified treatment model only has the one covariate. However, this would indicate that the mispecified treatement will have more bias compared to the other two (with the true estimated model having the least bias).

Appendix

#Initialize Values for code used in all parts
n<-200
reps<-2000
ateA<-NULL
ateB<-NULL
ateC<-NULL

expit<-function(x){
  1/(1+exp(-x))}

#Question 3 code
for (i in 1:reps) {
  set.seed(i)
  
  #define variables
  x1<-rnorm(n)
  x2<-rnorm(n)
  err<-rnorm(n)
  A<-rbinom(n,1,expit(-1+x1+x2))
  
  #Question 3a
  q3<-as.data.frame(cbind(x1,x2,err,A,Y))
  regAb<-glm(A~x1+x2, data=q3, family = 'binomial')
  
    #calculate IPW
  Y<-(-x1+log(abs(x1))+A+err)
  trueAa<- A/(expit(-1+x1+x2))
  contAa<- (1-A)/(1-(expit(-1+x1+x2)))
  ipwA<-mean(trueAa*Y)-mean(contAa*Y)
  ateA[i]<-ipwA
  
  #for question 3b:
  predAb<-predict(regAb, type='response')
  
    #calculate IPW
  treatedB<-A/predAb
  nottreatB<-(1-A)/(1-predAb)
  ipwB<-mean(treatedB*Y)-mean(nottreatB*Y)
  ateB[i]<-ipwB
  
  #for question 3c:
  regAc<-glm(A~x1, data=q3, family = 'binomial')
  predAc<-predict(regAc, type='response') #pi hat
  
    #calculate IPW
  treatedC<-A/predAc
  nottreatC<-(1-A)/(1-predAc)
  ipwC<-mean(treatedC*Y) - mean(nottreatC*Y)
  ateC[i]<-ipwC
}

#Put in data frame then plot in boxplot for comparison
q3data<-as.data.frame(cbind(ateA,ateB,ateC))
boxplot(q3data, names=c('True estimated treatment','Estimated treatment', 'Misspecified treatment'), ylab = "ATE estimates")

#3.a.
#Mean Y[A=1] - Y[A=0]
trueATE<-mean(q3data$ateA)

#3.b.
estATE<-mean(q3data$ateB)

#3.c.
mispecifiedATE<-mean(q3data$ateC)

#4