Causal Inference Lab 1: Individual assignment

The perfect doctor

Lab specs can be found here

Exercise 1

Question: Recreate table 1 shown below, inserting values appropriately for the three empty colums: (i) The column labeled \(\delta_i\): please enter each patient’s treatment effect (ii) The column labeled \(D\): the optimal treatment for this patient (iii) The column labeled \(Y\): the observed outcomes. Calculate the average treatment effect (ATE) and the average treatment effect for the treated (ATT) when comparing the outcome of the ventilators treatment with that of the bedrest treatment and comment as to which type of intervention is more effective on average. Finally, explain under which conditions might SUTVA be violated for treatments of covid-19 in the scenario described above.

# Write your code here

tribble(
  ~patient, ~`Y(0)`, ~`Y(1)`, ~Age, ~`   delta   `, ~`   D   `, ~`   Y   `,
  1, 10, 1,  29, -9, 0, 10,
  2,  5, 1,  35, -4, 0, 5,
  3,  4, 1,  19, -3, 0, 4,
  4,  6, 5,  45, -1, 0, 6,
  5,  1, 5,  65, 4, 1, 5,
  6,  7, 6,  50,-1, 0, 7, 
  7,  8, 7,  77, -1, 0,7,
  8, 10, 7,  18, -3, 0, 10, 
  9,  2, 8,  85, 6, 1, 8,
  10, 6, 9,  96, 3, 1, 9,
  11, 7, 10, 77, 3, 1,10) %>% 
  gt()

patient	Y(0)	Y(1)	Age	delta	D	Y
1	10	1	29	-9	0	10
2	5	1	35	-4	0	5
3	4	1	19	-3	0	4
4	6	5	45	-1	0	6
5	1	5	65	4	1	5
6	7	6	50	-1	0	7
7	8	7	77	-1	0	7
8	10	7	18	-3	0	10
9	2	8	85	6	1	8
10	6	9	96	3	1	9
11	7	10	77	3	1	10

##prepare data
data <- data.frame(
  patient = 1:11,
  Y0 = c(10, 5, 4, 6, 1, 7, 8, 10, 2, 6, 7),
  Y1 = c(1, 1, 1, 5, 5, 6, 7, 7, 8, 9, 10),
  Age = c(29, 35, 19, 45, 65, 50, 77, 18, 85, 96, 77),
  delta=c (-9, -4, -3, -1, 4, -1, -1, -3, 6, 3, 3),
  D = c(0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1),
  Y = c(10, 5, 4, 6, 5, 7, 8, 10, 8, 9, 10)
)
##ATE
ate <- mean(data$delta)
ate

## [1] -0.5454545

##ATT
att <- mean(data$delta[data$D==1])
att

## [1] 4

print(data)

##    patient Y0 Y1 Age delta D  Y
## 1        1 10  1  29    -9 0 10
## 2        2  5  1  35    -4 0  5
## 3        3  4  1  19    -3 0  4
## 4        4  6  5  45    -1 0  6
## 5        5  1  5  65     4 1  5
## 6        6  7  6  50    -1 0  7
## 7        7  8  7  77    -1 0  8
## 8        8 10  7  18    -3 0 10
## 9        9  2  8  85     6 1  8
## 10      10  6  9  96     3 1  9
## 11      11  7 10  77     3 1 10

cat("Average Treatment Effect (ATE):", ate, "\n")

## Average Treatment Effect (ATE): -0.5454545

cat("Average Treatment Effect for the Treated (ATT):", att, "\n")

## Average Treatment Effect for the Treated (ATT): 4

Answer: [Average Treatment Effect (ATE): -0.5454545 Average Treatment Effect for the Treated (ATT): 4 A positive ATT (4) suggests that ventilators are beneficial treatment, a negative ATE suggests that when considering the entire population, the ventilators treatment might have negative inflence. ‘explain under which conditions might [SUTVA] be violated’ when people received treatment, and cut transmission of virus, the decrease the chance of virus exposure.]

Exercise 2

Question: Calculate the simple difference in outcomes (SDO), showing the details of your calculation. Is the SDO a good estimation for the ATE? Finally, check whether the SDO is equal to the sum of the ATT and the selection bias, \(E[Y(0)|T=1] - E[Y(0)|T=0]\).

# Write your code here
library(dplyr)


# Calculate the Simple Difference in Outcomes (SDO)
treat_mean <- mean(data$Y[data$D==1])
treat_mean

## [1] 8

untreated_mean <- mean(data$Y[data$D==0])
untreated_mean

## [1] 7.142857

SDO <- treat_mean - untreated_mean
SDO

## [1] 0.8571429

# Calculate Average Treatment Effect (ATE)
ATE <- mean(data$delta)
ATE

## [1] -0.5454545

##selection bias
T1 <- mean(data$Y0[data$D == 1])
T0 <- mean(data$Y0[data$D == 0])

# Selection Bias
selection_bias <- T1 - T0
selection_bias

## [1] -3.142857

# Calculate Average Treatment Effect for the Treated (ATT)
ATT <- mean(data$delta[data$D == 1])
ATT

## [1] 4

# Calculate selection bias
E_Y0_T1 <- mean(data$Y0[data$D == 1])
E_Y0_T0 <- mean(data$Y0[data$D == 0])
selection_bias <- E_Y0_T1 - E_Y0_T0

# Verify that SDO = ATT + Selection Bias
SDO_check <- ATT + selection_bias
SDO_check

## [1] 0.8571429

Answer: [SDO=0.8571429 ATE=-0.5454545 ATT=4 selection bias=-3.142857 No, SDO is not a good estimator for ATE in this context. The SDO (0.8571) and ATE (-0.5454) are significantly different, indicating that SDO does not capture the true average treatment effect across the entire population. SDO is equal to the sum of the ATT and the selection bias]

Exercise 3

Question: Compare the treatment effect for both groups: for those treated with a ventilator and for those treated with bedrest. What explains the difference in the average effect? Now compare all four measures of effects. What are the advantages and disadvantages of each? Is the ATE equal to the mean of the ATU and the ATT? Why or why not?

# Write your code here
ATE <- mean(data$delta)

# Calculate Average Treatment Effect for the Treated (ATT)
ATT <- mean(data$delta[data$D == 1])

# Calculate Average Treatment Effect for the Untreated (ATU)
ATU <- mean(data$delta[data$D == 0])
 
##SDO
sdo <- mean(data$Y1[data$D == 1]) - mean(data$Y0[data$D == 0])
sdo

## [1] 0.8571429

##output results
cat("ATE: ", ATE, "\n")

## ATE:  -0.5454545

cat("ATT: ", ATT, "\n")

## ATT:  4

cat("ATU : ", ATU , "\n")

## ATU :  -3.142857

cat("SDO: ", sdo, "\n")

## SDO:  0.8571429

Answer: [ATE: -0.5454545 ATT: 4 ATU : -3.142857 SDO: 0.8571429 ATE is not equal to mean of ATT and ATU differences of treatment effects could be due to the diffent demographic characteristics of included groups. SDO did not consider confounders effect.]

Using regression to estimate effects

The following exercises demonstrate that regression is a useful tool to estimate average outcomes and treatment effects in the different groups. Notice that in contrast to the role that regressions play in traditional statistics, here standard errors and significance are not of primary concern. Instead, we are interested in using regression to calculate average effects proper.

Exercise 4

Question: Calculate the outcome, conditional on getting the bedrest treatment \(\mathbb{E}[Y|D=0]\). Now estimate the following regression, comparing the coefficients \(\alpha\) and \(\delta\) to the statistics you’ve previously calculated. What did you find? How would you explain these finding?

# Write your code here
##calculate expected outcome {E}[Y|D=0]
E_Y0 <- mean(data$Y[data$D==0])
cat("Expected outcome for bedrest treatment E[Y|D=0]:", E_Y0,"\n")

## Expected outcome for bedrest treatment E[Y|D=0]: 7.142857

##estimate the regression
regression <- lm(Y~D, data=data)
summary(regression)

## 
## Call:
## lm(formula = Y ~ D, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.143 -1.643  0.000  1.500  2.857 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   7.1429     0.8624   8.282 1.68e-05 ***
## D             0.8571     1.4302   0.599    0.564    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.282 on 9 degrees of freedom
## Multiple R-squared:  0.03838,    Adjusted R-squared:  -0.06847 
## F-statistic: 0.3592 on 1 and 9 DF,  p-value: 0.5637

Answer: [1. Expected outcome for bedrest treatment E[Y|D=0]: 7.142857 2. Treatment effect (delta) from regression: 0.8571429 3. The intercept from the regression should match the calculated E[Y|D=0] if the model is correctly specified. 4. The coefficient of D (delta) represents the average treatment effect (ATE) assuming linear effects and no confounding. 5. The treatment effect (delta) in the regression should be scrutinized for assumptions of linearity and equal variance across treatment groups.]

Exercise 5

Question: Now estimate the same regression, but this time, controlling for age, again comparing, the coefficient \(\delta\) to the statistics you’ve previously calculated. What did you find? How do you explain these results?

# Write your code here
##toncrol age
regression2 <- lm(Y ~ D + Age, data = data)
summary(regression2)

## 
## Call:
## lm(formula = Y ~ D + Age, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.7390 -1.6630 -0.0858  1.3839  3.2811 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  6.35543    1.90736   3.332   0.0104 *
## D            0.01419    2.33997   0.006   0.9953  
## Age          0.02019    0.04309   0.469   0.6519  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.388 on 8 degrees of freedom
## Multiple R-squared:  0.06407,    Adjusted R-squared:  -0.1699 
## F-statistic: 0.2738 on 2 and 8 DF,  p-value: 0.7673

Answer: [1. treatment effect(regression model controlling age) is 0.01419 when control age, number of D shows treatment effect difference across patients, age is a confounder ]

Exercise 6

Question: Estimate the following three regression models. The first model is the same as the one above. The second equation is the auxiliary regression of \(D\) onto \(X_{age}\). The third equation regresses \(Y\) onto \(\tilde{D}\) which is the residual from the second equation. Compare the coefficient on \(D\) from the first equation to the coefficient on \(\tilde{D}\) in the third equation. What does this tell you about how to interpret multivariate regressions?

# Write your code here
##regression2 Y~D+age
regression2

## 
## Call:
## lm(formula = Y ~ D + Age, data = data)
## 
## Coefficients:
## (Intercept)            D          Age  
##     6.35543      0.01419      0.02019

##auxiliary regression  D~age
regression3 <- lm(D~Age, data=data)
summary(regression3)

## 
## Call:
## lm(formula = D ~ Age, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.68663 -0.16289  0.04442  0.17433  0.48323 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept) -0.403323   0.236116  -1.708  0.12178   
## Age          0.014155   0.003925   3.606  0.00569 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3401 on 9 degrees of freedom
## Multiple R-squared:  0.591,  Adjusted R-squared:  0.5455 
## F-statistic:    13 on 1 and 9 DF,  p-value: 0.005694

residuals3 <- residuals(regression3)

##third equation regresses $Y$ onto $\tilde{D}$ which is the residual from the second equation
data1 <- mutate(data,D_t= residuals3)
regerssion4 <- lm(Y~D_t, data=data1)
summary(regerssion4)

## 
## Call:
## lm(formula = Y ~ D_t, data = data1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4565 -1.9522  0.5426  2.0429  2.5456 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.45455    0.70156  10.626 2.15e-06 ***
## D_t          0.01419    2.28040   0.006    0.995    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.327 on 9 degrees of freedom
## Multiple R-squared:  4.302e-06,  Adjusted R-squared:  -0.1111 
## F-statistic: 3.872e-05 on 1 and 9 DF,  p-value: 0.9952

Answer: [coefficient of regression 2(model1) only adjust for age, the regression 4(model 3) isoloates the component of D unrerlated with age. when compre coefficient of model 1 and model 3, it shows importance of confounders when estimating treatment effects. The disparity between these coefficients illustrates the influence of omitted variable bias when Age is not considered]