\[ \begin{aligned} & \mathbb{E}_X\left[\mathbb{E}(Y \mid A=1, X) - \mathbb{E}(Y \mid A=0, X)\right] \\ &= \mathbb{E}_X\left[\mathbb{E}(-X_1 + \log|\!X_1| + 1 + \varepsilon) - \mathbb{E}(-X_1 + \log|\!X_1| + \varepsilon)\right] \\ &= \mathbb{E}_X[1] \\ &= 1 \end{aligned} \]
Observed data distribution (the actual world)?
Sample size: \(n = 200\); number of Monte Carlo replications: \(2000\).
Covariates: \(X_1 \sim \mathcal{N}(0, 1),\; X_2 \sim \mathcal{N}(0, 1)\).
Treatment: \(A \sim \text{Bernoulli}(\pi(X))\), where \(\pi(X) = \text{expit}(-1 + X_1 + X_2)\).
Outcome: \(Y = -X_1 + \log(|X_1|) + A + \varepsilon\), where \(\varepsilon \sim \mathcal{N}(0, 1)\).
The observed data distribution is the outcome above with the given distribution for the variables A, X, and E
Target data distribution (counterfactual world)
\[ [Y \mid A = 1, X_1, \varepsilon] - [Y \mid A = 0, X_1, \varepsilon] = \left( -X_1 + \log(|X_1|) + \varepsilon + 1 \right) - \left( -X_1 + \log(|X_1|) + \varepsilon \right) = 1 \] The target distribution is the distribution of the difference of outcomes for the treated (A=1) and those that weren’t treated (A=0).
IPW formula for estimating the ATE \[
\mathbb{E}[Y \mid A = 1] \approx \frac{1}{200} \sum_{i=1}^{200}
\frac{1}{{\pi}_i} Y_i
\]
The average treatement effect for the true model is: 0.9806
The average treatement effect for the estimated model is: 0.9601
The average treatment effect for the misspecified model is: 0.9600
boxplot(q3data, names=c('True estimated treatment','Estimated treatment', 'Misspecified treatment'), ylab = "ATE estimates")
The variance for the misspecified treatment is the lowest, followed by the estimated treatment and then the the true estimated treatment. This is expected since the more complex models have more covariates that introduce variance whereas the misspecified treatment model only has the one covariate. However, this would indicate that the mispecified treatement will have more bias compared to the other two (with the true estimated model having the least bias).
#Initialize Values for code used in all parts
n<-200
reps<-2000
ateA<-NULL
ateB<-NULL
ateC<-NULL
expit<-function(x){
1/(1+exp(-x))}
#Question 3 code
for (i in 1:reps) {
set.seed(i)
#define variables
x1<-rnorm(n)
x2<-rnorm(n)
err<-rnorm(n)
A<-rbinom(n,1,expit(-1+x1+x2))
#Question 3a
q3<-as.data.frame(cbind(x1,x2,err,A,Y))
regAb<-glm(A~x1+x2, data=q3, family = 'binomial')
#calculate IPW
Y<-(-x1+log(abs(x1))+A+err)
trueAa<- A/(expit(-1+x1+x2))
contAa<- (1-A)/(1-(expit(-1+x1+x2)))
ipwA<-mean(trueAa*Y)-mean(contAa*Y)
ateA[i]<-ipwA
#for question 3b:
predAb<-predict(regAb, type='response')
#calculate IPW
treatedB<-A/predAb
nottreatB<-(1-A)/(1-predAb)
ipwB<-mean(treatedB*Y)-mean(nottreatB*Y)
ateB[i]<-ipwB
#for question 3c:
regAc<-glm(A~x1, data=q3, family = 'binomial')
predAc<-predict(regAc, type='response') #pi hat
#calculate IPW
treatedC<-A/predAc
nottreatC<-(1-A)/(1-predAc)
ipwC<-mean(treatedC*Y) - mean(nottreatC*Y)
ateC[i]<-ipwC
}
#Put in data frame then plot in boxplot for comparison
q3data<-as.data.frame(cbind(ateA,ateB,ateC))
boxplot(q3data, names=c('True estimated treatment','Estimated treatment', 'Misspecified treatment'), ylab = "ATE estimates")
#3.a.
#Mean Y[A=1] - Y[A=0]
trueATE<-mean(q3data$ateA)
#3.b.
estATE<-mean(q3data$ateB)
#3.c.
mispecifiedATE<-mean(q3data$ateC)
#4