The standardized mean is the weighted average of the conditional means in each stratum, and the weights are the probability of occurrence Pr [L = l ] in each stratum. But in high-dimensional data—such as our smoking cessation example—it is impossible for us to estimate E[Y|A=1,C =0,L=l] in a nonparametric way. (C=0 means no censor)
That is, when calculating the standardized mean, the stratum with the largest number of people has the most weight.
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.1 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## here() starts at C:/Users/hed2/Desktop
# some preprocessing of the data
nhefs$cens <- ifelse(is.na(nhefs$wt82), 1, 0)
fit <- glm(wt82_71 ~ qsmk + sex + race + age + I(age*age) + as.factor(education)
+ smokeintensity + I(smokeintensity*smokeintensity) + smokeyrs
+ I(smokeyrs*smokeyrs) + as.factor(exercise) + as.factor(active)
+ wt71 + I(wt71*wt71) + qsmk*smokeintensity, data=nhefs)
summary(fit)
##
## Call:
## glm(formula = wt82_71 ~ qsmk + sex + race + age + I(age * age) +
## as.factor(education) + smokeintensity + I(smokeintensity *
## smokeintensity) + smokeyrs + I(smokeyrs * smokeyrs) + as.factor(exercise) +
## as.factor(active) + wt71 + I(wt71 * wt71) + qsmk * smokeintensity,
## data = nhefs)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -42.056 -4.171 -0.343 3.891 44.606
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.5881657 4.3130359 -0.368 0.712756
## qsmk 2.5595941 0.8091486 3.163 0.001590 **
## sex -1.4302717 0.4689576 -3.050 0.002328 **
## race 0.5601096 0.5818888 0.963 0.335913
## age 0.3596353 0.1633188 2.202 0.027809 *
## I(age * age) -0.0061010 0.0017261 -3.534 0.000421 ***
## as.factor(education)2 0.7904440 0.6070005 1.302 0.193038
## as.factor(education)3 0.5563124 0.5561016 1.000 0.317284
## as.factor(education)4 1.4915695 0.8322704 1.792 0.073301 .
## as.factor(education)5 -0.1949770 0.7413692 -0.263 0.792589
## smokeintensity 0.0491365 0.0517254 0.950 0.342287
## I(smokeintensity * smokeintensity) -0.0009907 0.0009380 -1.056 0.291097
## smokeyrs 0.1343686 0.0917122 1.465 0.143094
## I(smokeyrs * smokeyrs) -0.0018664 0.0015437 -1.209 0.226830
## as.factor(exercise)1 0.2959754 0.5351533 0.553 0.580298
## as.factor(exercise)2 0.3539128 0.5588587 0.633 0.526646
## as.factor(active)1 -0.9475695 0.4099344 -2.312 0.020935 *
## as.factor(active)2 -0.2613779 0.6845577 -0.382 0.702647
## wt71 0.0455018 0.0833709 0.546 0.585299
## I(wt71 * wt71) -0.0009653 0.0005247 -1.840 0.066001 .
## qsmk:smokeintensity 0.0466628 0.0351448 1.328 0.184463
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 53.5683)
##
## Null deviance: 97176 on 1565 degrees of freedom
## Residual deviance: 82763 on 1545 degrees of freedom
## (63 observations deleted due to missingness)
## AIC: 10701
##
## Number of Fisher Scoring iterations: 2
nhefs$predicted.meanY <- predict(fit, nhefs) # conditional E[Y | A=1,C =0,L= l]
nhefs[which(nhefs$seqn==24770), c("predicted.meanY", "qsmk", "sex",
"race", "age", "education",
"smokeintensity", "smokeyrs", "exercise",
"active", "wt71")]
## predicted.meanY qsmk sex race age education smokeintensity smokeyrs
## 1582 0.3421569 0 0 0 26 4 15 12
## exercise active wt71
## 1582 1 0 111.58
summary(nhefs$predicted.meanY[nhefs$cens==0])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -10.876 1.116 3.042 2.638 4.511 9.876
summary(nhefs$wt82_71[nhefs$cens==0])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -41.280 -1.478 2.604 2.638 6.690 48.538
Our goal is to estimate when A=a, ∑E[Y|A=a,C=0,L =l]*Pr[L l]. The summation in L should be written as integral notation, since the variables in L are essentially continuous and cannot be represented by simple probabilities. But whatever the notation, we have now estimated E [Y | A = a, C =0, L = l ] for all combinations of A and L. Next we need to normalize these means according to the value l in L.
In principle, we could compute this probability non parametrically by simply dividing the number of people with L = l by the total number of people.
Next we describe how to estimate in the treatment group ( A = 1 ) and the non-treatment group ( A = 0 ).
This method has four steps: data augmentation, outcome regression, prediction, and standardization.
We need to add the second and third plates to the original data, fit the model of E[Y|A=a,C =0,L= l], and then generate the predicted value. The mean of the predicted values in the second panel is 1.66, the value in the third panel is 5.18, and the causal effect E(Ya1,c0)-E(Ya0,c0) is 3.5.
# create a dataset with 3 copies of each subject
nhefs$interv <- -1 # 1st copy: equal to original one
# original one
interv0 <- nhefs # 2nd copy: treatment set to 0, outcome to missing
interv0$interv <- 0 # a=0
interv0$qsmk <- 0 #
interv0$wt82_71 <- NA
interv1 <- nhefs # 3rd copy: treatment set to 1, outcome to missing
interv1$interv <- 1 #
interv1$qsmk <- 1 # A=1
interv1$wt82_71 <- NA #
onesample <- rbind(nhefs, interv0, interv1) # combining datasets # by rows
# linear model to estimate mean outcome conditional on treatment and confounders . parameters are estimated using original observations only (nhefs)
std <- glm(wt82_71 ~ qsmk + sex + race + age + I(age*age)
+ as.factor(education) + smokeintensity
+ I(smokeintensity*smokeintensity) + smokeyrs
+ I(smokeyrs*smokeyrs) + as.factor(exercise)
+ as.factor(active) + wt71 + I(wt71*wt71) + I(qsmk*smokeintensity),
data=onesample)# modeling
summary(std)
##
## Call:
## glm(formula = wt82_71 ~ qsmk + sex + race + age + I(age * age) +
## as.factor(education) + smokeintensity + I(smokeintensity *
## smokeintensity) + smokeyrs + I(smokeyrs * smokeyrs) + as.factor(exercise) +
## as.factor(active) + wt71 + I(wt71 * wt71) + I(qsmk * smokeintensity),
## data = onesample)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -42.056 -4.171 -0.343 3.891 44.606
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.5881657 4.3130359 -0.368 0.712756
## qsmk 2.5595941 0.8091486 3.163 0.001590 **
## sex -1.4302717 0.4689576 -3.050 0.002328 **
## race 0.5601096 0.5818888 0.963 0.335913
## age 0.3596353 0.1633188 2.202 0.027809 *
## I(age * age) -0.0061010 0.0017261 -3.534 0.000421 ***
## as.factor(education)2 0.7904440 0.6070005 1.302 0.193038
## as.factor(education)3 0.5563124 0.5561016 1.000 0.317284
## as.factor(education)4 1.4915695 0.8322704 1.792 0.073301 .
## as.factor(education)5 -0.1949770 0.7413692 -0.263 0.792589
## smokeintensity 0.0491365 0.0517254 0.950 0.342287
## I(smokeintensity * smokeintensity) -0.0009907 0.0009380 -1.056 0.291097
## smokeyrs 0.1343686 0.0917122 1.465 0.143094
## I(smokeyrs * smokeyrs) -0.0018664 0.0015437 -1.209 0.226830
## as.factor(exercise)1 0.2959754 0.5351533 0.553 0.580298
## as.factor(exercise)2 0.3539128 0.5588587 0.633 0.526646
## as.factor(active)1 -0.9475695 0.4099344 -2.312 0.020935 *
## as.factor(active)2 -0.2613779 0.6845577 -0.382 0.702647
## wt71 0.0455018 0.0833709 0.546 0.585299
## I(wt71 * wt71) -0.0009653 0.0005247 -1.840 0.066001 .
## I(qsmk * smokeintensity) 0.0466628 0.0351448 1.328 0.184463
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 53.5683)
##
## Null deviance: 97176 on 1565 degrees of freedom
## Residual deviance: 82763 on 1545 degrees of freedom
## (3321 observations deleted due to missingness)
## AIC: 10701
##
## Number of Fisher Scoring iterations: 2
onesample$predicted_meanY <- predict(std, onesample) # prediction
# estimate mean outcome in each of the groups interv=0, and interv=1
# this mean outcome is a weighted average of the mean outcomes in each combination of values of treatment and confounders, that is, the standardized outcome
mean(onesample[which(onesample$interv==-1),]$predicted_meanY) # calculate
## [1] 2.56319
mean(onesample[which(onesample$interv==0),]$predicted_meanY)
## [1] 1.660267
mean(onesample[which(onesample$interv==1),]$predicted_meanY)
## [1] 5.178841
#install.packages("boot") # install package if required
library(boot) #
# function to calculate difference in means
# indices -i by row,frequencies-f ,weights-w ,how to smaple
standardization <- function(data, indices) {
# create a dataset with 3 copies of each subject
d <- data[indices,] # 1st copy: equal to original one,
d$interv <- -1
d0 <- d # 2nd copy: treatment set to 0, outcome to missing
d0$interv <- 0
d0$qsmk <- 0
d0$wt82_71 <- NA
d1 <- d # 3rd copy: treatment set to 1, outcome to missing
d1$interv <- 1
d1$qsmk <- 1
d1$wt82_71 <- NA
d.onesample <- rbind(d, d0, d1) # combining datasets
# parameter estimates are used to predict mean outcome for observations with set treatment (interv=0 and interv=1)
fit <- glm(wt82_71 ~ qsmk + sex + race + age + I(age*age) +
as.factor(education) + smokeintensity +
I(smokeintensity*smokeintensity) + smokeyrs + I(smokeyrs*smokeyrs) +
as.factor(exercise) + as.factor(active) + wt71 + I(wt71*wt71),
data=d.onesample)
d.onesample$predicted_meanY <- predict(fit, d.onesample)
# estimate mean outcome in each of the groups interv=-1, interv=0, and interv=1
return(c(mean(d.onesample$predicted_meanY[d.onesample$interv==-1]),
mean(d.onesample$predicted_meanY[d.onesample$interv==0]),
mean(d.onesample$predicted_meanY[d.onesample$interv==1]),
mean(d.onesample$predicted_meanY[d.onesample$interv==1])-
mean(d.onesample$predicted_meanY[d.onesample$interv==0])))
}
# bootstrap
results <- boot(data=nhefs, statistic=standardization, R=5)
# generating confidence intervals
se <- c(sd(results$t[,1]), sd(results$t[,2]),
sd(results$t[,3]), sd(results$t[,4]))
mean <- results$t0 # t0 ?
ll <- mean - qnorm(0.975)*se
ul <- mean + qnorm(0.975)*se
bootstrap <- data.frame(cbind(c("Observed", "No Treatment", "Treatment",
"Treatment - No Treatment"), mean, se, ll, ul))
bootstrap
## V1 mean se ll
## 1 Observed 2.56188497106103 0.33474575441222 1.90579534843539
## 2 No Treatment 1.65212306626746 0.391679078638321 0.884446178638516
## 3 Treatment 5.11474489549346 0.613098099595701 3.91309470129594
## 4 Treatment - No Treatment 3.46262182922601 0.728545662975968 2.03469856870025
## ul
## 1 3.21797459368667
## 2 2.4197999538964
## 3 6.31639508969099
## 4 4.89054508975176
ipw and standardization is suitbale for simple randomization control trial.
The inverse probability weighted mean and the standardized mean are equal only when the model is not used, otherwise they are not necessarily equal.
The validity of causal inference depends on the following conditions: interchangeability, positivity, and consistency. It does not involve measurement errors and model errors.
However, effect estimates from observational studies are always severely flawed. The problem is that no one can guarantee that these conditions will be perfectly established. Unmeasured confounding variables, nonoverlapping confounding variable distributions, poorly specified interventions, mismeasured variables, and misspecified models are all common problems we encounter in effect estimation. For example, we can try different models, but it is impossible to adjust for variables that are not measured.
The important problem with causal inference is that we cannot observe all the counterfactual data, so we use the above assumptions to approximate these data. The more we deviate from these assumptions, the more our effect estimates deviate from the true causal effect. Therefore, causal inferences drawn from observational studies must be viewed with caution.