Standardization and G formula

The standardized mean is the weighted average of the conditional means in each stratum, and the weights are the probability of occurrence Pr [L = l ] in each stratum. But in high-dimensional data—such as our smoking cessation example—it is impossible for us to estimate E[Y|A=1,C =0,L=l] in a nonparametric way. (C=0 means no censor)

That is, when calculating the standardized mean, the stratum with the largest number of people has the most weight.

Estimated prediction mean by model

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.1     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## here() starts at C:/Users/hed2/Desktop

# some preprocessing of the data
nhefs$cens <- ifelse(is.na(nhefs$wt82), 1, 0)

fit <- glm(wt82_71 ~ qsmk + sex + race + age + I(age*age) + as.factor(education)
           + smokeintensity + I(smokeintensity*smokeintensity) + smokeyrs
           + I(smokeyrs*smokeyrs) + as.factor(exercise) + as.factor(active)
           + wt71 + I(wt71*wt71) + qsmk*smokeintensity, data=nhefs)
summary(fit)

## 
## Call:
## glm(formula = wt82_71 ~ qsmk + sex + race + age + I(age * age) + 
##     as.factor(education) + smokeintensity + I(smokeintensity * 
##     smokeintensity) + smokeyrs + I(smokeyrs * smokeyrs) + as.factor(exercise) + 
##     as.factor(active) + wt71 + I(wt71 * wt71) + qsmk * smokeintensity, 
##     data = nhefs)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -42.056   -4.171   -0.343    3.891   44.606  
## 
## Coefficients:
##                                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                        -1.5881657  4.3130359  -0.368 0.712756    
## qsmk                                2.5595941  0.8091486   3.163 0.001590 ** 
## sex                                -1.4302717  0.4689576  -3.050 0.002328 ** 
## race                                0.5601096  0.5818888   0.963 0.335913    
## age                                 0.3596353  0.1633188   2.202 0.027809 *  
## I(age * age)                       -0.0061010  0.0017261  -3.534 0.000421 ***
## as.factor(education)2               0.7904440  0.6070005   1.302 0.193038    
## as.factor(education)3               0.5563124  0.5561016   1.000 0.317284    
## as.factor(education)4               1.4915695  0.8322704   1.792 0.073301 .  
## as.factor(education)5              -0.1949770  0.7413692  -0.263 0.792589    
## smokeintensity                      0.0491365  0.0517254   0.950 0.342287    
## I(smokeintensity * smokeintensity) -0.0009907  0.0009380  -1.056 0.291097    
## smokeyrs                            0.1343686  0.0917122   1.465 0.143094    
## I(smokeyrs * smokeyrs)             -0.0018664  0.0015437  -1.209 0.226830    
## as.factor(exercise)1                0.2959754  0.5351533   0.553 0.580298    
## as.factor(exercise)2                0.3539128  0.5588587   0.633 0.526646    
## as.factor(active)1                 -0.9475695  0.4099344  -2.312 0.020935 *  
## as.factor(active)2                 -0.2613779  0.6845577  -0.382 0.702647    
## wt71                                0.0455018  0.0833709   0.546 0.585299    
## I(wt71 * wt71)                     -0.0009653  0.0005247  -1.840 0.066001 .  
## qsmk:smokeintensity                 0.0466628  0.0351448   1.328 0.184463    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 53.5683)
## 
##     Null deviance: 97176  on 1565  degrees of freedom
## Residual deviance: 82763  on 1545  degrees of freedom
##   (63 observations deleted due to missingness)
## AIC: 10701
## 
## Number of Fisher Scoring iterations: 2

nhefs$predicted.meanY <- predict(fit, nhefs) # conditional E[Y | A=1,C =0,L= l]

nhefs[which(nhefs$seqn==24770), c("predicted.meanY", "qsmk", "sex", 
                                 "race", "age", "education", 
                                 "smokeintensity", "smokeyrs", "exercise",
                                 "active", "wt71")]

##      predicted.meanY qsmk sex race age education smokeintensity smokeyrs
## 1582       0.3421569    0   0    0  26         4             15       12
##      exercise active   wt71
## 1582        1      0 111.58

summary(nhefs$predicted.meanY[nhefs$cens==0])

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -10.876   1.116   3.042   2.638   4.511   9.876

summary(nhefs$wt82_71[nhefs$cens==0])

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -41.280  -1.478   2.604   2.638   6.690  48.538

Our goal is to estimate when A=a, ∑E[Y|A=a,C=0,L =l]*Pr[L l]. The summation in L should be written as integral notation, since the variables in L are essentially continuous and cannot be represented by simple probabilities. But whatever the notation, we have now estimated E [Y | A = a, C =0, L = l ] for all combinations of A and L. Next we need to normalize these means according to the value l in L.

Standardization of outcome means according to the distribution of confounding variables

In principle, we could compute this probability non parametrically by simply dividing the number of people with L = l by the total number of people.

Next we describe how to estimate in the treatment group ( A = 1 ) and the non-treatment group ( A = 0 ).

This method has four steps: data augmentation, outcome regression, prediction, and standardization.

We need to add the second and third plates to the original data, fit the model of E[Y|A=a,C =0,L= l], and then generate the predicted value. The mean of the predicted values in the second panel is 1.66, the value in the third panel is 5.18, and the causal effect E(Ya1,c0)-E(Ya0,c0) is 3.5.

# create a dataset with 3 copies of each subject
nhefs$interv <- -1 # 1st copy: equal to original one
# original one

interv0 <- nhefs # 2nd copy: treatment set to 0, outcome to missing 
interv0$interv <- 0 # a=0
interv0$qsmk <- 0 #  
interv0$wt82_71 <- NA

interv1 <- nhefs # 3rd copy: treatment set to 1, outcome to missing
interv1$interv <- 1 #  
interv1$qsmk <- 1 # A=1
interv1$wt82_71 <- NA #

onesample <- rbind(nhefs, interv0, interv1) # combining datasets # by rows

# linear model to estimate mean outcome conditional on treatment and confounders . parameters are estimated using original observations only (nhefs)

std <- glm(wt82_71 ~ qsmk + sex + race + age + I(age*age)
           + as.factor(education) + smokeintensity 
           + I(smokeintensity*smokeintensity) + smokeyrs 
           + I(smokeyrs*smokeyrs) + as.factor(exercise) 
           + as.factor(active) + wt71 + I(wt71*wt71) + I(qsmk*smokeintensity), 
           data=onesample)# modeling
summary(std)

## 
## Call:
## glm(formula = wt82_71 ~ qsmk + sex + race + age + I(age * age) + 
##     as.factor(education) + smokeintensity + I(smokeintensity * 
##     smokeintensity) + smokeyrs + I(smokeyrs * smokeyrs) + as.factor(exercise) + 
##     as.factor(active) + wt71 + I(wt71 * wt71) + I(qsmk * smokeintensity), 
##     data = onesample)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -42.056   -4.171   -0.343    3.891   44.606  
## 
## Coefficients:
##                                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                        -1.5881657  4.3130359  -0.368 0.712756    
## qsmk                                2.5595941  0.8091486   3.163 0.001590 ** 
## sex                                -1.4302717  0.4689576  -3.050 0.002328 ** 
## race                                0.5601096  0.5818888   0.963 0.335913    
## age                                 0.3596353  0.1633188   2.202 0.027809 *  
## I(age * age)                       -0.0061010  0.0017261  -3.534 0.000421 ***
## as.factor(education)2               0.7904440  0.6070005   1.302 0.193038    
## as.factor(education)3               0.5563124  0.5561016   1.000 0.317284    
## as.factor(education)4               1.4915695  0.8322704   1.792 0.073301 .  
## as.factor(education)5              -0.1949770  0.7413692  -0.263 0.792589    
## smokeintensity                      0.0491365  0.0517254   0.950 0.342287    
## I(smokeintensity * smokeintensity) -0.0009907  0.0009380  -1.056 0.291097    
## smokeyrs                            0.1343686  0.0917122   1.465 0.143094    
## I(smokeyrs * smokeyrs)             -0.0018664  0.0015437  -1.209 0.226830    
## as.factor(exercise)1                0.2959754  0.5351533   0.553 0.580298    
## as.factor(exercise)2                0.3539128  0.5588587   0.633 0.526646    
## as.factor(active)1                 -0.9475695  0.4099344  -2.312 0.020935 *  
## as.factor(active)2                 -0.2613779  0.6845577  -0.382 0.702647    
## wt71                                0.0455018  0.0833709   0.546 0.585299    
## I(wt71 * wt71)                     -0.0009653  0.0005247  -1.840 0.066001 .  
## I(qsmk * smokeintensity)            0.0466628  0.0351448   1.328 0.184463    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 53.5683)
## 
##     Null deviance: 97176  on 1565  degrees of freedom
## Residual deviance: 82763  on 1545  degrees of freedom
##   (3321 observations deleted due to missingness)
## AIC: 10701
## 
## Number of Fisher Scoring iterations: 2

onesample$predicted_meanY <- predict(std, onesample) # prediction

# estimate mean outcome in each of the groups interv=0, and interv=1
# this mean outcome is a weighted average of the mean outcomes in each combination of values of treatment and confounders, that is, the standardized outcome
mean(onesample[which(onesample$interv==-1),]$predicted_meanY) # calculate

## [1] 2.56319

mean(onesample[which(onesample$interv==0),]$predicted_meanY)

## [1] 1.660267

mean(onesample[which(onesample$interv==1),]$predicted_meanY)

## [1] 5.178841

Bootstrap method to calculate the confidence interval

#install.packages("boot") # install package if required
library(boot) # 

# function to calculate difference in means
# indices -i by row，frequencies-f ，weights-w   ，how to smaple

standardization <- function(data, indices) {
  # create a dataset with 3 copies of each subject
  d <- data[indices,] # 1st copy: equal to original one，
  d$interv <- -1
  d0 <- d # 2nd copy: treatment set to 0, outcome to missing
  d0$interv <- 0
  d0$qsmk <- 0
  d0$wt82_71 <- NA
  d1 <- d # 3rd copy: treatment set to 1, outcome to missing
  d1$interv <- 1
  d1$qsmk <- 1
  d1$wt82_71 <- NA
  d.onesample <- rbind(d, d0, d1) # combining datasets
  
  # parameter estimates are used to predict mean outcome for observations with set treatment (interv=0 and interv=1)
  fit <- glm(wt82_71 ~ qsmk + sex + race + age + I(age*age) + 
               as.factor(education) + smokeintensity + 
               I(smokeintensity*smokeintensity) + smokeyrs + I(smokeyrs*smokeyrs) +
               as.factor(exercise) + as.factor(active) + wt71 + I(wt71*wt71), 
             data=d.onesample)
  
  d.onesample$predicted_meanY <- predict(fit, d.onesample)
  
  # estimate mean outcome in each of the groups interv=-1, interv=0, and interv=1
  return(c(mean(d.onesample$predicted_meanY[d.onesample$interv==-1]),
           mean(d.onesample$predicted_meanY[d.onesample$interv==0]),
           mean(d.onesample$predicted_meanY[d.onesample$interv==1]),
           mean(d.onesample$predicted_meanY[d.onesample$interv==1])-
             mean(d.onesample$predicted_meanY[d.onesample$interv==0])))
}

# bootstrap
results <- boot(data=nhefs, statistic=standardization, R=5)

# generating confidence intervals
se <- c(sd(results$t[,1]), sd(results$t[,2]), 
        sd(results$t[,3]), sd(results$t[,4]))
mean <- results$t0 # t0 ?
ll <- mean - qnorm(0.975)*se
ul <- mean + qnorm(0.975)*se

bootstrap <- data.frame(cbind(c("Observed", "No Treatment", "Treatment", 
                                "Treatment - No Treatment"), mean, se, ll, ul))
bootstrap

##                         V1             mean                se                ll
## 1                 Observed 2.56188497106103  0.33474575441222  1.90579534843539
## 2             No Treatment 1.65212306626746 0.391679078638321 0.884446178638516
## 3                Treatment 5.11474489549346 0.613098099595701  3.91309470129594
## 4 Treatment - No Treatment 3.46262182922601 0.728545662975968  2.03469856870025
##                 ul
## 1 3.21797459368667
## 2  2.4197999538964
## 3 6.31639508969099
## 4 4.89054508975176

ipw and standardization is suitbale for simple randomization control trial.

The inverse probability weighted mean and the standardized mean are equal only when the model is not used, otherwise they are not necessarily equal.

The validity of causal inference depends on the following conditions: interchangeability, positivity, and consistency. It does not involve measurement errors and model errors.

However, effect estimates from observational studies are always severely flawed. The problem is that no one can guarantee that these conditions will be perfectly established. Unmeasured confounding variables, nonoverlapping confounding variable distributions, poorly specified interventions, mismeasured variables, and misspecified models are all common problems we encounter in effect estimation. For example, we can try different models, but it is impossible to adjust for variables that are not measured.

The important problem with causal inference is that we cannot observe all the counterfactual data, so we use the above assumptions to approximate these data. The more we deviate from these assumptions, the more our effect estimates deviate from the true causal effect. Therefore, causal inferences drawn from observational studies must be viewed with caution.

Standardization and G formula

2023-05-04

Standardization and G formula