Outline: missing data concept and MI for cross-sectional data

CHAIN data with confounding model
Missing data concept and handling
- what is missing data and why?
- missing data mechanisms
- ad-hoc approaches on missing data and the issues
Multiple imputation (MI) method
- steps in performing MI
- algorithm wit complex model
- pros and cons of MI

Lab
- mice package in R for MAR based missing data handling

Example: CHAIN Data

The CHAIN dataset is a subset of the Community Health Advisory and Information Network (CHAIN) study.
This study is a longitudinal cohort study of people living with HIV in New York City and is conducted by Columbia University School of Public Health (Messeri et al. 2003).
The CHAIN cohort was recruited in 1994 from a large number of medical care and social service agencies serving HIV in New York City. Cohort members were interviewed up to 8 times through 2002. A total of 532 CHAIN participants completed at least one interview at either the 1rst, 2ndh or 3rd rounds of interview.
For simplicity, our analysis here discards the time aspect of the dataset and use only the first round of the survey.

setwd(“~/Desktop/BIOS234/Lab9_10.28”) Messeri P, Lee G, Abramson DM, et al. Antiretroviral therapy and declining AIDS mortality in New York City. MEDICAL CARE. 2003 41(4): 512-521.

Example: CHAIN Data (con’t) – read into R (from SPSS data)

library("haven")
setwd("~/Desktop/BIOS234/Lab9_10.28")
chain <-  read_spss(file = "CHAIN.sav")
dim(chain)

## [1] 532   8

head(chain)

## # A tibble: 6 × 8
##      id log_virus   age income healthy mental damage treatment
##   <dbl>     <dbl> <dbl>  <dbl>   <dbl>  <dbl>  <dbl>     <dbl>
## 1     1     10.0     29      5    34.2      1      4         1
## 2     2      0       38      5    58.1      0      5         2
## 3     3     NA       47      6    21.9      0      1         1
## 4     4     NA       53      2    18.7      0      5         0
## 5     5      7.09    42      2    54.1      0      4         1
## 6     6     NA       49      4    45.6      0      4         2

Example: CHAIN Data (con’t) – the 7 variables (n=532)

log_virus log of self reported viral load level.
age age at time of the interview
income annual family income in 10 intervals
healthy a continuous scale of physical health with a theoretical range between 0 and 100 where better health is associated with higher scale values
mental a binary measure of poor mental health ( 1=Yes, 0=No )
damage ordered interval for the CD4 count, which is an indicator of how much damage HIV (aka, the virus) has caused to the immune system
treatment a three-level ordered variable: 0=Not currently taking HAART (Highly Active AntiretRoviral Therapy) 1=taking HAART but nonadherent, 2=taking HAART and adherent

Purpose of Modeling

For the sake of today’s topic, we set to assess treatment’s impact on viral load’s (i.e., confounding model): \[ Y_{} = \beta_{0} + \beta_{1}S + e_{}, \quad \quad \text{(univariate model)} \] \[ Y_{} = \beta_{0} + \beta_{1}S + \beta_{c_{1}}C_{1} + \ldots + \beta_{c_{p}}C_{p} + e_{}, \quad \quad \text{(multivariate model)} \]

#unadjusted model, i.e., univariate model
univ.fit <- lm(log_virus ~ factor(treatment), data = chain)

#adjusted model, multivariate model
mult.fit <- lm(log_virus ~ factor(treatment) + age + 
           income + healthy + mental, data = chain)

Q: why we exclude the “damage” variable in the multivariate model?

Results univariate and multivariate model

Pay attention to the “Observations” under model (1) and (2).

Missing data pattern

Using “> md.pattern(chain, plot=FALSE))”, we could generate a missing pattern, as below:

##     id age healthy mental treatment income damage log_virus    
## 335  1   1       1      1         1      1      1         1   0
## 122  1   1       1      1         1      1      1         0   1
## 9    1   1       1      1         1      1      0         1   1
## 28   1   1       1      1         1      1      0         0   2
## 8    1   1       1      1         1      0      1         1   1
## 4    1   1       1      1         1      0      1         0   2
## 1    1   1       1      1         1      0      0         1   2
## 1    1   1       1      1         1      0      0         0   3
## 24   1   0       0      0         0      0      0         0   7
##      0  24      24     24        24     38     63       179 376

The first column provides the frequency of each pattern.
The last column lists the number of missing entries per pattern.
The bottom row provides the number of missing entries per variable, and the total number of missing cells.

Missing data: what and why?

In general,

Missing data" refers to data values that were intended to be collected but were not available for some reasons.

Three broad categories of reasons:

The characteristics of study participants
- e.g. some participants did not answer survey, e.g. questions were offended by certain questions
The characteristics of study design
- e.g. a study required too much of participants’ time
The interaction between the two
- e.g. those who were the sickest were unable to complete more burdensome aspects of the study

Missing Data Mechanism

Little and Rubin (2002) provides taxonomy of missingness mechanisms:

Missing completely at random (MCAR)

A variable is missing completely at random if the probability of missingness is the same for all respondents
Respondents who have missing data are a random subset of the complete sample of subjects
The probability of missingness is not related to any subject characteristics
Missingness does not depend on any observable or unobservable data

Little RJA, Rubin DB. Statistical Analysis with Missing Data, 2nd Edition. Wiley: New York, 2002.

MCAR

Examples:
- Accidental loss of a subset of study records
- In an Air pollution study, drop-put due to parent job re-location
- Missing by design,
  - e.g. in a survey all participants need to answer first set of questions and then groups are to answer different sets of questions - often assignment to the groups is random
In MCAR throwing out cases with missing data does not bias the inferences (only loss of statistical power)
Possible to test whether data is MCAR but often can show that it is not

Missing at random (MAR)

A variable is missing at random if the probability of missingness depends only on available information
Reasons for missingness is based on other observed subject characteristics
Missing data is considered random conditional on these other subject characteristics that determined their missingness and that are available at the time of the analysis
Examples:
- e.g. doctors are less likely to order blood tests in a study for patients who are sicker
- Study participants who are sicker drop out of the study
- Study protocol requires that a subject be removed from the study whenever the value of an outcome variable falls outside of a certain clinical range of values.

Missing not at random (MNAR)

A observation is not missing at random if missingness depends on unobserved characteristics, like the value of the observation itself
If missingness is not at random, it must be explicitly modeled, or else you must accept some bias in your inferences
Examples:
- Wealthy people are less likely to report their income
- Outcome variable is a measure of “quality-of-life” and the subject fail to complete the instrument or questionnaire on occasions when their quality-of-life is compromised.
- when parents of children who were obese were either more or less likely to sign the consent form than parents of children who were not obese.

We can not empirically distinguish between MAR and MNAR with data !!!

Simple fixes

Complete-case analysis (CCA) - analysis on cases with complete data on all variables
- At best it is inefficient (reduced sample size)
- At worse it can cause severe bias
- Rests on MCAR assumption but can be relaxed - complete cases are considered a random sample of the complete sample
Complete-variables analyses (CVA) - analysis on variables with complete data on all cases
- Retains the full sample size but not all variables - problematic if you need to control for these variables in the analysis
Both approaches can preclude subgroup analysis (because of severe data reduction)

Imputation by filling with mean

Imputation by regression

Multiple Imputation (MI)

MI procedure (and rationale):

Attempts to take into account both sampling variability and model uncertainty
Uses available data to create several complete datasets
Data analyses are performed on each imputed dataset
The results from each dataset are combined using standard rules

MI Steps

Examine missing data among variables of interest (exploratory data analyses and missing data patterns, as we did for the CHAIN data); identify auxiliary variables as needed

Imputation:
- Specify models for the relevant variables
- For the models to predict missing values
- Repeat M times to create M complete datasets
Analysis: Perform your desired analysis in each dataset
Pooling: Combine estimates across datasets using common rules

Remark:

it separate the solution of missing data problem and the solution of complete-data (scientific) problem
allows for better understanding of the scientific question

MI: the imputation step

An analyst must have to specify a model for the data. These vary across software packages.

In R’s mice package, conditional model method was implemented (Van Buuren 2018).

fully conditional specification (FCS) method
- does not assume a joint distribution (as in SAS)
- allows the use of different distributions across variables
- utilizes Markov chain Monte Carlo (MCMC) internally

Van Burren S. (2018), Flexible imputation of missing data (2nd edition). CRC Press.(online version: https://stefvanbuuren.name/fimd/)

MI: the pooling step

Suppose we want to inference a regression coefficient, $\beta$

We obtain estimates $\hat{\beta}_{m}$ in each of the M datasets as well as standard errors, $s_{1}, \ldots, s_{M}$

To obtain an overall point estimate, we average over the estimates from the separate imputed datasets: \[\hat{\beta} = \frac{1}{M}\sum_{m=1}^{M}\hat{\beta}_{m}\] A final variance estimate $V_{\beta}$ reflects variation within and between imputations: \[V_{\beta} = V_{W} + (1 + \frac{1}{M})V_{B} + \text{correction factor}\]

where $V_{W} = \frac{1}{M}\sum_{m=1}^{M}s_{m}^{2}$ and $V_{B} = \frac{1}{M-1}\sum_{m=1}^{M} (\hat{\beta}_{m} - \hat{\beta})^2$
correction factor = $V_{B}/\text{\#of imputations}$

These rules work for any parameter of interest

MI in action for CHAIN data (FCS based): imputation

# Impute the data, 5 copies 
names(chain)

## [1] "id"        "log_virus" "age"       "income"    "healthy"   "mental"   
## [7] "damage"    "treatment"

data.mice <- mice(chain, 
               m=5, 
               meth=c("","pmm","pmm","pmm", "pmm", "logreg", "polr", "polr"),
               printFlag=FALSE)

Remarks:

m=5 is to generate 5 imputed data,
different prediction model was used for different type of variable
- pmm: predicted mean matching
- logreg: logistic regression
- polr: ordinal logistic regression
note, we included the ‘damage’ variable for the imputation
usually, one needs only 5 replicates for model building/testing, 20 to 100 for final model

Exam the imputed data

One could exam the imputed data, using summary() etc, the following web link show few of such examples

https://datascienceplus.com/imputing-missing-data-with-r-mice-package/

Exam the imputed data

stripplot(data.mice, pch = 20, cex = 1.2)

MI in action: analysis and pooling

mice.fit <- with(data.mice,
                 lm(log_virus~ 
                      factor(treatment) 
                    + age 
                    + income 
                    + healthy 
                    + mental))
result.mice <-  pool(mice.fit)

the complete-data analysis is a linear regression, lm(), supported by mice package.

summary(result.mice);  summary(mult.fit)

##                 term    estimate  std.error statistic        df      p.value
## 1        (Intercept) 13.02161327 2.64236013  4.928024  7.926683 1.183718e-03
## 2 factor(treatment)1 -2.35199730 0.57577918 -4.084895 54.045555 1.469317e-04
## 3 factor(treatment)2 -2.30735376 0.53367738 -4.323499 48.232195 7.662921e-05
## 4                age -0.09365580 0.03374523 -2.775379 16.326459 1.332164e-02
## 5             income -0.36545526 0.12634010 -2.892631 25.965831 7.631477e-03
## 6            healthy -0.05047707 0.02712661 -1.860795  9.833850 9.289597e-02
## 7            mental1  1.48729757 0.54444313  2.731778 32.277835 1.012912e-02

## 
## Call:
## lm(formula = log_virus ~ factor(treatment) + age + income + healthy + 
##     mental, data = chain)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.720 -3.798 -1.344  3.986 10.367 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        13.70070    1.85584   7.382 1.23e-12 ***
## factor(treatment)1 -2.39691    0.62267  -3.849 0.000142 ***
## factor(treatment)2 -2.21450    0.55999  -3.955 9.35e-05 ***
## age                -0.10147    0.03044  -3.334 0.000952 ***
## income             -0.36593    0.11825  -3.094 0.002137 ** 
## healthy            -0.05925    0.02041  -2.903 0.003941 ** 
## mental              1.21077    0.56034   2.161 0.031416 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.47 on 337 degrees of freedom
##   (188 observations deleted due to missingness)
## Multiple R-squared:  0.1484, Adjusted R-squared:  0.1332 
## F-statistic: 9.785 on 6 and 337 DF,  p-value: 6.018e-10

Pooling for complicate models

Note that mice (or other MI packages) might not support complex model.

As an example, with the CHAIN data, following mediation modelling can be conducted to assess the direct/indirect effect of ART treatment $^{*}$:

mediator model:
- med.fit <- lm(log_virus~ treatment
  age + income + healthy + mental))
outcome model:
- out.fit <- glm(damage~ treatment + log_virus+ age + income + healthy + mental))
mediation model:
- mediation.m <- mediate(med.fit, out.fit,treat = “treatment”, mediator = “log_virus”,boot = TRUE, sims = 1000)

$^{*}$: See mediation R package for details for the syntax.

Pooling for complicate models (con’t)

Then, one could use the following coding schema/algorithm (using the mediation example above as prototype):

data_full <- complete(data.mice, action = “long”, include = TRUE)

theta <- numeric(20); theta.se <- numeric(20)

for (i in 1:20) {

(1): do complete data analysis on each imputed data

data1 <- data_full[data_full$.imp == i,]

med.fit <- lm(…, data=data1)

out.fit <- glm(…, data=data1)

mediation.m <- mediate(med.fit, out.fit, …, data=data1)

(2): obtaining the point estimate of interested parameter and SE $^{*}$

e.g., theta[i] <- mediation.m$coef[1]

e.g., theta.se[i] <- mediation.m$var[1]

}

Pooling the theta[1], …, theta[20] to get final theta (e.g., as the mean).

Pooling the theta.se[1], …, theta.se[20] (e.g., using Rubin’s formula) to get the final SE.

$^{*}$: syntax depends on specific model.

Strength/weakness of MI

Strengths:

Maintains entire dataset, uses all available information
Weak (more plausible) assumption about the missing data mechanism (MAR)
Properly reflects two kinds of uncertainty (sampling and model)
Maintains relationships between variables
One set of imputed datasets can be used for many analyses (avoiding different ad-hoc imputations, the original motivation of the method was for survey analysis

Weaknesses:

Can be complex to implement (though with current software this is becoming less of an issue)
Have to rely on modeling assumptions

Notes on MI

The imputation model should be “congenial” to or consistent with the analytic model
- Includes, at the very least, the same variables as the analytic model
- Includes any transformations to variables in the analytic model
  - E.g. logarithmic and squaring transformations, interaction terms
All relationships between variables should be represented and estimated simultaneously
- Otherwise, you are imputing values assuming they are uncorrelated with the variables you did not include
Sensitivity Analysis is important (both within the method and across with other methods, e.g. IPW, inverse probability weighting)

BIOS 623 (Longitudinal Data Analysis)

Outline: missing data concept and MI for cross-sectional data

Example: CHAIN Data

Example: CHAIN Data (con’t) – read into R (from SPSS data)

Example: CHAIN Data (con’t) – the 7 variables (n=532)

Purpose of Modeling

Results univariate and multivariate model

Missing data pattern

Missing data: what and why?

Missing Data Mechanism

MCAR

Missing at random (MAR)

Missing not at random (MNAR)

Simple fixes

Imputation by filling with mean

Imputation by regression

Multiple Imputation (MI)

MI Steps

MI: the imputation step

MI: the pooling step

MI in action for CHAIN data (FCS based): imputation

Exam the imputed data

Exam the imputed data

MI in action: analysis and pooling

Pooling for complicate models

Pooling for complicate models (con’t)

Strength/weakness of MI

Notes on MI

Lab