BIOS 623 (Longitudinal Data Analysis)

Class 9 (Oct 28, 2021)

Outline: missing data concept and MI for cross-sectional data

Example: CHAIN Data

setwd(“~/Desktop/BIOS234/Lab9_10.28”) Messeri P, Lee G, Abramson DM, et al. Antiretroviral therapy and declining AIDS mortality in New York City. MEDICAL CARE. 2003 41(4): 512-521.

Example: CHAIN Data (con’t) – read into R (from SPSS data)

library("haven")
setwd("~/Desktop/BIOS234/Lab9_10.28")
chain <-  read_spss(file = "CHAIN.sav")
dim(chain)
## [1] 532   8
head(chain)
## # A tibble: 6 × 8
##      id log_virus   age income healthy mental damage treatment
##   <dbl>     <dbl> <dbl>  <dbl>   <dbl>  <dbl>  <dbl>     <dbl>
## 1     1     10.0     29      5    34.2      1      4         1
## 2     2      0       38      5    58.1      0      5         2
## 3     3     NA       47      6    21.9      0      1         1
## 4     4     NA       53      2    18.7      0      5         0
## 5     5      7.09    42      2    54.1      0      4         1
## 6     6     NA       49      4    45.6      0      4         2

Example: CHAIN Data (con’t) – the 7 variables (n=532)

Purpose of Modeling

For the sake of today’s topic, we set to assess treatment’s impact on viral load’s (i.e., confounding model): \[ Y_{} = \beta_{0} + \beta_{1}S + e_{}, \quad \quad \text{(univariate model)} \] \[ Y_{} = \beta_{0} + \beta_{1}S + \beta_{c_{1}}C_{1} + \ldots + \beta_{c_{p}}C_{p} + e_{}, \quad \quad \text{(multivariate model)} \]

#unadjusted model, i.e., univariate model
univ.fit <- lm(log_virus ~ factor(treatment), data = chain)

#adjusted model, multivariate model
mult.fit <- lm(log_virus ~ factor(treatment) + age + 
           income + healthy + mental, data = chain)

Q: why we exclude the “damage” variable in the multivariate model?

Results univariate and multivariate model

Pay attention to the “Observations” under model (1) and (2).

Missing data pattern

Using “> md.pattern(chain, plot=FALSE))”, we could generate a missing pattern, as below:

##     id age healthy mental treatment income damage log_virus    
## 335  1   1       1      1         1      1      1         1   0
## 122  1   1       1      1         1      1      1         0   1
## 9    1   1       1      1         1      1      0         1   1
## 28   1   1       1      1         1      1      0         0   2
## 8    1   1       1      1         1      0      1         1   1
## 4    1   1       1      1         1      0      1         0   2
## 1    1   1       1      1         1      0      0         1   2
## 1    1   1       1      1         1      0      0         0   3
## 24   1   0       0      0         0      0      0         0   7
##      0  24      24     24        24     38     63       179 376

Missing data: what and why?

In general,

Three broad categories of reasons:

Missing Data Mechanism

Little and Rubin (2002) provides taxonomy of missingness mechanisms:

Missing completely at random (MCAR)

Little RJA, Rubin DB. Statistical Analysis with Missing Data, 2nd Edition. Wiley: New York, 2002.

MCAR

Missing at random (MAR)

Missing not at random (MNAR)

We can not empirically distinguish between MAR and MNAR with data !!!

Simple fixes

Imputation by filling with mean

Imputation by regression

Multiple Imputation (MI)

MI procedure (and rationale):

MI Steps

Examine missing data among variables of interest (exploratory data analyses and missing data patterns, as we did for the CHAIN data); identify auxiliary variables as needed

  1. Imputation:

    • Specify models for the relevant variables
    • For the models to predict missing values
    • Repeat M times to create M complete datasets
  2. Analysis: Perform your desired analysis in each dataset

  3. Pooling: Combine estimates across datasets using common rules

Remark:

MI: the imputation step

An analyst must have to specify a model for the data. These vary across software packages.

In R’s mice package, conditional model method was implemented (Van Buuren 2018).

Van Burren S. (2018), Flexible imputation of missing data (2nd edition). CRC Press.(online version: https://stefvanbuuren.name/fimd/)

MI: the pooling step

Suppose we want to inference a regression coefficient, \(\beta\)

We obtain estimates \(\hat{\beta}_{m}\) in each of the M datasets as well as standard errors, \(s_{1}, \ldots, s_{M}\)

To obtain an overall point estimate, we average over the estimates from the separate imputed datasets: \[\hat{\beta} = \frac{1}{M}\sum_{m=1}^{M}\hat{\beta}_{m}\] A final variance estimate \(V_{\beta}\) reflects variation within and between imputations: \[V_{\beta} = V_{W} + (1 + \frac{1}{M})V_{B} + \text{correction factor}\]

These rules work for any parameter of interest

MI in action for CHAIN data (FCS based): imputation

# Impute the data, 5 copies 
names(chain)
## [1] "id"        "log_virus" "age"       "income"    "healthy"   "mental"   
## [7] "damage"    "treatment"
data.mice <- mice(chain, 
               m=5, 
               meth=c("","pmm","pmm","pmm", "pmm", "logreg", "polr", "polr"),
               printFlag=FALSE)

Remarks:

Exam the imputed data

One could exam the imputed data, using summary() etc, the following web link show few of such examples

https://datascienceplus.com/imputing-missing-data-with-r-mice-package/

Exam the imputed data

stripplot(data.mice, pch = 20, cex = 1.2)

MI in action: analysis and pooling

mice.fit <- with(data.mice,
                 lm(log_virus~ 
                      factor(treatment) 
                    + age 
                    + income 
                    + healthy 
                    + mental))
result.mice <-  pool(mice.fit)

summary(result.mice);  summary(mult.fit)
##                 term    estimate  std.error statistic        df      p.value
## 1        (Intercept) 13.02161327 2.64236013  4.928024  7.926683 1.183718e-03
## 2 factor(treatment)1 -2.35199730 0.57577918 -4.084895 54.045555 1.469317e-04
## 3 factor(treatment)2 -2.30735376 0.53367738 -4.323499 48.232195 7.662921e-05
## 4                age -0.09365580 0.03374523 -2.775379 16.326459 1.332164e-02
## 5             income -0.36545526 0.12634010 -2.892631 25.965831 7.631477e-03
## 6            healthy -0.05047707 0.02712661 -1.860795  9.833850 9.289597e-02
## 7            mental1  1.48729757 0.54444313  2.731778 32.277835 1.012912e-02
## 
## Call:
## lm(formula = log_virus ~ factor(treatment) + age + income + healthy + 
##     mental, data = chain)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.720 -3.798 -1.344  3.986 10.367 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        13.70070    1.85584   7.382 1.23e-12 ***
## factor(treatment)1 -2.39691    0.62267  -3.849 0.000142 ***
## factor(treatment)2 -2.21450    0.55999  -3.955 9.35e-05 ***
## age                -0.10147    0.03044  -3.334 0.000952 ***
## income             -0.36593    0.11825  -3.094 0.002137 ** 
## healthy            -0.05925    0.02041  -2.903 0.003941 ** 
## mental              1.21077    0.56034   2.161 0.031416 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.47 on 337 degrees of freedom
##   (188 observations deleted due to missingness)
## Multiple R-squared:  0.1484, Adjusted R-squared:  0.1332 
## F-statistic: 9.785 on 6 and 337 DF,  p-value: 6.018e-10

Pooling for complicate models

Note that mice (or other MI packages) might not support complex model.

As an example, with the CHAIN data, following mediation modelling can be conducted to assess the direct/indirect effect of ART treatment \(^{*}\):

\(^{*}\): See mediation R package for details for the syntax.

Pooling for complicate models (con’t)

Then, one could use the following coding schema/algorithm (using the mediation example above as prototype):

data_full <- complete(data.mice, action = “long”, include = TRUE)

theta <- numeric(20); theta.se <- numeric(20)

for (i in 1:20) {

(1): do complete data analysis on each imputed data

data1 <- data_full[data_full$.imp == i,]

med.fit <- lm(…, data=data1)

out.fit <- glm(…, data=data1)

mediation.m <- mediate(med.fit, out.fit, …, data=data1)

(2): obtaining the point estimate of interested parameter and SE \(^{*}\)

e.g., theta[i] <- mediation.m$coef[1]

e.g., theta.se[i] <- mediation.m$var[1]

}

Pooling the theta[1], …, theta[20] to get final theta (e.g., as the mean).

Pooling the theta.se[1], …, theta.se[20] (e.g., using Rubin’s formula) to get the final SE.

\(^{*}\): syntax depends on specific model.

Strength/weakness of MI

Strengths:

Weaknesses:

Notes on MI

Lab

Tasks for today’s lab:

tinytex::install_tinytex() render