Class 9 (Oct 28, 2021)
CHAIN data with confounding model
Missing data concept and handling
Multiple imputation (MI) method
The CHAIN dataset is a subset of the Community Health Advisory and Information Network (CHAIN) study.
This study is a longitudinal cohort study of people living with HIV in New York City and is conducted by Columbia University School of Public Health (Messeri et al. 2003).
The CHAIN cohort was recruited in 1994 from a large number of medical care and social service agencies serving HIV in New York City. Cohort members were interviewed up to 8 times through 2002. A total of 532 CHAIN participants completed at least one interview at either the 1rst, 2ndh or 3rd rounds of interview.
For simplicity, our analysis here discards the time aspect of the dataset and use only the first round of the survey.
setwd(“~/Desktop/BIOS234/Lab9_10.28”) Messeri P, Lee G, Abramson DM, et al. Antiretroviral therapy and declining AIDS mortality in New York City. MEDICAL CARE. 2003 41(4): 512-521.
library("haven")
setwd("~/Desktop/BIOS234/Lab9_10.28")
chain <- read_spss(file = "CHAIN.sav")
dim(chain)## [1] 532 8
## # A tibble: 6 × 8
## id log_virus age income healthy mental damage treatment
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 10.0 29 5 34.2 1 4 1
## 2 2 0 38 5 58.1 0 5 2
## 3 3 NA 47 6 21.9 0 1 1
## 4 4 NA 53 2 18.7 0 5 0
## 5 5 7.09 42 2 54.1 0 4 1
## 6 6 NA 49 4 45.6 0 4 2
log_virus log of self reported viral load level.
age age at time of the interview
income annual family income in 10 intervals
healthy a continuous scale of physical health with a theoretical range between 0 and 100 where better health is associated with higher scale values
mental a binary measure of poor mental health ( 1=Yes, 0=No )
damage ordered interval for the CD4 count, which is an indicator of how much damage HIV (aka, the virus) has caused to the immune system
treatment a three-level ordered variable: 0=Not currently taking HAART (Highly Active AntiretRoviral Therapy) 1=taking HAART but nonadherent, 2=taking HAART and adherent
For the sake of today’s topic, we set to assess treatment’s impact on viral load’s (i.e., confounding model): \[ Y_{} = \beta_{0} + \beta_{1}S + e_{}, \quad \quad \text{(univariate model)} \] \[ Y_{} = \beta_{0} + \beta_{1}S + \beta_{c_{1}}C_{1} + \ldots + \beta_{c_{p}}C_{p} + e_{}, \quad \quad \text{(multivariate model)} \]
#unadjusted model, i.e., univariate model
univ.fit <- lm(log_virus ~ factor(treatment), data = chain)
#adjusted model, multivariate model
mult.fit <- lm(log_virus ~ factor(treatment) + age +
income + healthy + mental, data = chain)Q: why we exclude the “damage” variable in the multivariate model?
Pay attention to the “Observations” under model (1) and (2).
Using “> md.pattern(chain, plot=FALSE))”, we could generate a missing pattern, as below:
## id age healthy mental treatment income damage log_virus
## 335 1 1 1 1 1 1 1 1 0
## 122 1 1 1 1 1 1 1 0 1
## 9 1 1 1 1 1 1 0 1 1
## 28 1 1 1 1 1 1 0 0 2
## 8 1 1 1 1 1 0 1 1 1
## 4 1 1 1 1 1 0 1 0 2
## 1 1 1 1 1 1 0 0 1 2
## 1 1 1 1 1 1 0 0 0 3
## 24 1 0 0 0 0 0 0 0 7
## 0 24 24 24 24 38 63 179 376
In general,
Three broad categories of reasons:
Little and Rubin (2002) provides taxonomy of missingness mechanisms:
Missing completely at random (MCAR)
A variable is missing completely at random if the probability of missingness is the same for all respondents
Respondents who have missing data are a random subset of the complete sample of subjects
The probability of missingness is not related to any subject characteristics
Missingness does not depend on any observable or unobservable data
Little RJA, Rubin DB. Statistical Analysis with Missing Data, 2nd Edition. Wiley: New York, 2002.
Examples:
In MCAR throwing out cases with missing data does not bias the inferences (only loss of statistical power)
Possible to test whether data is MCAR but often can show that it is not
A variable is missing at random if the probability of missingness depends only on available information
Reasons for missingness is based on other observed subject characteristics
Missing data is considered random conditional on these other subject characteristics that determined their missingness and that are available at the time of the analysis
Examples:
e.g. doctors are less likely to order blood tests in a study for patients who are sicker
Study participants who are sicker drop out of the study
Study protocol requires that a subject be removed from the study whenever the value of an outcome variable falls outside of a certain clinical range of values.
A observation is not missing at random if missingness depends on unobserved characteristics, like the value of the observation itself
If missingness is not at random, it must be explicitly modeled, or else you must accept some bias in your inferences
Examples:
We can not empirically distinguish between MAR and MNAR with data !!!
MI procedure (and rationale):
Attempts to take into account both sampling variability and model uncertainty
Uses available data to create several complete datasets
Data analyses are performed on each imputed dataset
The results from each dataset are combined using standard rules
Examine missing data among variables of interest (exploratory data analyses and missing data patterns, as we did for the CHAIN data); identify auxiliary variables as needed
Imputation:
Analysis: Perform your desired analysis in each dataset
Pooling: Combine estimates across datasets using common rules
Remark:
it separate the solution of missing data problem and the solution of complete-data (scientific) problem
allows for better understanding of the scientific question
An analyst must have to specify a model for the data. These vary across software packages.
In R’s mice package, conditional model method was implemented (Van Buuren 2018).
Van Burren S. (2018), Flexible imputation of missing data (2nd edition). CRC Press.(online version: https://stefvanbuuren.name/fimd/)
Suppose we want to inference a regression coefficient, \(\beta\)
We obtain estimates \(\hat{\beta}_{m}\) in each of the M datasets as well as standard errors, \(s_{1}, \ldots, s_{M}\)
To obtain an overall point estimate, we average over the estimates from the separate imputed datasets: \[\hat{\beta} = \frac{1}{M}\sum_{m=1}^{M}\hat{\beta}_{m}\] A final variance estimate \(V_{\beta}\) reflects variation within and between imputations: \[V_{\beta} = V_{W} + (1 + \frac{1}{M})V_{B} + \text{correction factor}\]
These rules work for any parameter of interest
## [1] "id" "log_virus" "age" "income" "healthy" "mental"
## [7] "damage" "treatment"
data.mice <- mice(chain,
m=5,
meth=c("","pmm","pmm","pmm", "pmm", "logreg", "polr", "polr"),
printFlag=FALSE)Remarks:
m=5 is to generate 5 imputed data,
different prediction model was used for different type of variable
note, we included the ‘damage’ variable for the imputation
usually, one needs only 5 replicates for model building/testing, 20 to 100 for final model
One could exam the imputed data, using summary() etc, the following web link show few of such examples
https://datascienceplus.com/imputing-missing-data-with-r-mice-package/
mice.fit <- with(data.mice,
lm(log_virus~
factor(treatment)
+ age
+ income
+ healthy
+ mental))
result.mice <- pool(mice.fit)## term estimate std.error statistic df p.value
## 1 (Intercept) 13.02161327 2.64236013 4.928024 7.926683 1.183718e-03
## 2 factor(treatment)1 -2.35199730 0.57577918 -4.084895 54.045555 1.469317e-04
## 3 factor(treatment)2 -2.30735376 0.53367738 -4.323499 48.232195 7.662921e-05
## 4 age -0.09365580 0.03374523 -2.775379 16.326459 1.332164e-02
## 5 income -0.36545526 0.12634010 -2.892631 25.965831 7.631477e-03
## 6 healthy -0.05047707 0.02712661 -1.860795 9.833850 9.289597e-02
## 7 mental1 1.48729757 0.54444313 2.731778 32.277835 1.012912e-02
##
## Call:
## lm(formula = log_virus ~ factor(treatment) + age + income + healthy +
## mental, data = chain)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.720 -3.798 -1.344 3.986 10.367
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.70070 1.85584 7.382 1.23e-12 ***
## factor(treatment)1 -2.39691 0.62267 -3.849 0.000142 ***
## factor(treatment)2 -2.21450 0.55999 -3.955 9.35e-05 ***
## age -0.10147 0.03044 -3.334 0.000952 ***
## income -0.36593 0.11825 -3.094 0.002137 **
## healthy -0.05925 0.02041 -2.903 0.003941 **
## mental 1.21077 0.56034 2.161 0.031416 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.47 on 337 degrees of freedom
## (188 observations deleted due to missingness)
## Multiple R-squared: 0.1484, Adjusted R-squared: 0.1332
## F-statistic: 9.785 on 6 and 337 DF, p-value: 6.018e-10
Note that mice (or other MI packages) might not support complex model.
As an example, with the CHAIN data, following mediation modelling can be conducted to assess the direct/indirect effect of ART treatment \(^{*}\):
\(^{*}\): See mediation R package for details for the syntax.
Then, one could use the following coding schema/algorithm (using the mediation example above as prototype):
data_full <- complete(data.mice, action = “long”, include = TRUE)
theta <- numeric(20); theta.se <- numeric(20)
for (i in 1:20) {
(1): do complete data analysis on each imputed data
data1 <- data_full[data_full$.imp == i,]
med.fit <- lm(…, data=data1)
out.fit <- glm(…, data=data1)
mediation.m <- mediate(med.fit, out.fit, …, data=data1)
(2): obtaining the point estimate of interested parameter and SE \(^{*}\)
e.g., theta[i] <- mediation.m$coef[1]
e.g., theta.se[i] <- mediation.m$var[1]
}
Pooling the theta[1], …, theta[20] to get final theta (e.g., as the mean).
Pooling the theta.se[1], …, theta.se[20] (e.g., using Rubin’s formula) to get the final SE.
\(^{*}\): syntax depends on specific model.
Strengths:
Maintains entire dataset, uses all available information
Weak (more plausible) assumption about the missing data mechanism (MAR)
Properly reflects two kinds of uncertainty (sampling and model)
Maintains relationships between variables
One set of imputed datasets can be used for many analyses (avoiding different ad-hoc imputations, the original motivation of the method was for survey analysis
Weaknesses:
Can be complex to implement (though with current software this is becoming less of an issue)
Have to rely on modeling assumptions
The imputation model should be “congenial” to or consistent with the analytic model
All relationships between variables should be represented and estimated simultaneously
Sensitivity Analysis is important (both within the method and across with other methods, e.g. IPW, inverse probability weighting)
Tasks for today’s lab:
repeat the MI based analysis
upload the results onto BB
tinytex::install_tinytex() render