knitr::opts_chunk$set(
echo = TRUE,
message = FALSE,
warning = FALSE
)
Previous sections:
Note: This analysis is used for practicing. In this section, I will summerize some basic commands for imputation based on several online sources.
R Packages:
# More packages will be shown during the analysis process
# Load the required library
library(tidyverse) # Data Wrangling
library(conflicted) # Dealing with conflict package
library(readxl) # Read csv file
library(DT) # For using datatable()
library(mice) # For imputation
library(VIM) # For aggr function
library(visdat) # For vis_dat; vis_miss function
library(simputation) # For imputation missing value
library(naniar) # For bind_shadow function
library(arsenal) # For comparing 2 datasets comparedf() (equal or unequal)
library(diffdf) # For comparing 2 datasets diffdf()
Dealing with Conflicts
There is a lot of packages here, and sometimes individual functions are
in conflict. R’s default conflict resolution system gives precedence to
the most recently loaded package. This can make it hard to detect
conflicts, particularly when introduced by an update to an existing
package.
conflict_prefer("filter", "dplyr")
conflict_prefer("select", "dplyr")
conflict_prefer("Predict", "rms")
conflict_prefer("impute_median", "simputation")
conflict_prefer("summarize", "dplyr")
My source for this description of mechanisms is Chapter 25 of Gelman
and Hill (2007)
https://thomaselove.github.io/432-notes/dealing-with-missingness-single-imputation.html#plotting-missingness
MCAR causes enlarged standard errors due to the reduced sample size, but does not cause bias (‘systematic error’ that is overestimation of benefits and underestimation of harms)
It is often reasonable to model this process as a logistic regression, where the outcome variable equals 1 for observed cases and 0 for missing. When an outcome variable is missing at random, it is acceptable to exclude the missing cases (that is, to treat them as NA), as long as the regression controls for all the variables that affect the probability of missingness.
MAR allows prediction of the missing values based on the participants with complete data.
If a particular treatment causes discomfort, a patient is more likely to drop out of the study. This missingness is not at random (unless “discomfort” is measured and observed for all patients). If missingness is not at random, it must be explicitly modeled, or else you must accept some bias in your inferences.
Missingness that depends on the missing value itself. Finally, a
particularly difficult situation arises when the probability of
missingness depends on the (potentially missing) variable itself. For
example, suppose that people with higher earnings are less likely to
reveal them.
Essentially, MNAR are referred to collectively as non-random missingness, and cause more trouble for us than MCAR and MAR.
Note: From MCAR to MNAR, you are more confident about the reason of missing values. For example, in MNAR, you can explain why patient is more likely to drop out of the study => missing value.
Best-worst and worst-best case sensitivity analyses may be used if in doubt:
First a ‘best-worst-case’ scenario dataset is generated where it is assumed that all participants lost to follow-up in one group (referred to as group 1) have had a beneficial outcome (for example, had no serious adverse event); and all those with missing outcomes in the other group (group 2) have had a harmful outcome (for example, have had a serious adverse event).
Then a ‘worst-best-case’ scenario dataset is generated where it is assumed that all participants lost to follow-up in group 1 have had a harmful outcome; and that all those lost to follow-up in group 2 have had a beneficial outcome.
If continuous outcomes are used, then a ‘beneficial outcome’ might be the group mean plus 2 standard deviations (or 1 standard deviation) of the group mean, and a ‘harmful outcome’ might be the group mean minus 2 standard deviations (or 1 standard deviation) of the group mean.
For dichotomised data, these best-worst and worst-best case sensitivity analyses will then show the range of uncertainty due to missing data, and if this range does not give qualitatively contradicting results, then the missing data may be ignored. For continuous data imputation with 2 SD will represent a possible range of uncertainty given 95% of the observed data (if normally distributed).
If only the dependent variable has missing values and auxiliary variables (variables not included in the regression analysis, but correlated with a variable with missing values and/or related to its missingness) are not identified, complete case analysis may be used as the primary analysis and no specific methods ought to be used to handle the missing data. No additional information will be obtained by, for example, using multiple imputation but the standard errors may increase due to the uncertainty introduced by the multiple imputation.
It would also be valid just to perform complete case analysis if it is relatively certain that the data are MCAR. It is relatively rare that it is certain that the data are MCAR. It is possible to test the hypothesis that the data are MCAR with Little’s test, but it may be unwise to build on tests that turned out to be insignificant. Hence, if there is reasonable doubt if the data are MCAR, even if Little’s test is insignificant (fail to reject the null hypothesis that data is MCAR), then MCAR should not be assumed.
Note: When the proportion if missing data is too large:
If large proportions of data are missing, it ought to be considered just to report the results of the complete case analysis and then clearly discuss the resulting interpretative limitations of the trial results. If multiple imputations or other methods are used to handle missing data it might indicate that the results of the trial are confirmative, which they are not if the missingness is considerable.
If the proportions of missing data are very large (for example, more than 40%) on important variables, then trial results may only be considered as hypothesis generating results. A rare exception would be if the underlying mechanism behind the missing data can be described as MCAR (see paragraph above).
When data are ready to be analysed, it should be thoroughly assessed, based on inspection of the data, whether statistical methods ought to be used to handle missing data. Bell et al. aimed to assess handling of missing data in randomised clinical trials published between July and December 2013 in the BMJ, JAMA, Lancet, and New England Journal of Medicine.
There are several available methods for dealing with missing data that are MCAR or MAR, but they basically boil down to:
Discard data: Complete Case; Available Case; Dropping variable
analyses.
Retain data:
** Mean, median and mode.
** Random Sampling Imputation.
** Predictive/Statistical models: Linear Regression; Random Forest; kNN
(k Nearest Neighbour); Maximum likelihood.
Single Imputation vs Multiple Imputation.
More info can be found here:
https://towardsdatascience.com/all-about-missing-data-handling-b94b8b5d2184
Complete case analyses: The most common approach to the missing data is to omit those cases (observation with entire row containing NA value) with the missing data and analyse the remaining data. This is the default approach for many statistical software packages, and may introduce unpredictable bias and fail to include some useful, often hard-won information.
Two problems with complete-case analysis:
If there is too much data missing for a variable, it may be an option to delete the variable or the column from the dataset. There is no rule of thumbs for this, but it depends on the situation, and a proper analysis of data is needed before the variable is dropped altogether. This should be the last option, and we need to check if model performance improves after the deletion of a variable.
1. Mean
In a mean substitution, the mean value of a variable is used in place of
the missing data value for that same variable. This has the benefit of
not changing the sample mean for that variable.
The theoretical background of the mean substitution is that the mean is
a reasonable estimate for a randomly selected observation from a normal
distribution. However, with missing values that are not strictly random,
especially in the presence of great inequality in the number of missing
values for the different variables, the mean substitution method may
lead to inconsistent bias. Distortion of
original variance and distortion of co-variance with
remaining variables within the dataset are two major drawbacks
of this method.
2. Median
Median can be used when the variable has a skewed distribution.
3. Mode
The rationale for Mode is to replace the population of missing values
with the most frequent value since this is the most likely
occurrence.
Random sampling imputation is in principle similar to mean/median
imputation because it aims to preserve the statistical
parameters of the original variable, for which data is
missing.
Random sampling consists of taking a random observation from the pool of
available observations and using that randomly extracted value to fill
the NA. In Random Sampling, one takes as many random observations as
missing values are present in the variable. Random sample imputation
assumes that the data are missing completely at random (MCAR). If this
is the case, it makes sense to substitute the missing values with values
extracted from the original variable distribution.
In regression imputation, the existing variables are used to predict,
and then the predicted value is substituted as if an actually obtained
value. This approach has several advantages because the imputation
retains a great deal of data over the listwise or pairwise deletion and
avoids significantly altering the standard deviation or the
shape of the distribution.
However, as in a mean substitution, while a regression imputation
substitutes a value predicted from other variables, no novel
information is added, while the sample size has been
increased and the standard error is reduced.
Regression: imp_mean <- mice(dataset,
method=“norm.predict”, m=1, maxit=1)
Stochastic regression: imp_mean <- mice(dataset,
method=“norm.nob”, m=1, maxit=1)
To extract the imputed dataset: imp_mean <-
complete(imp_mean)
For multiple imputation - I will demonstrate with the example later.
With Bayesian Stochastic regression imputation uncertainty is not only accounted for by adding error variance to the predicted values but also by taking into account the uncertainty in estimating the regression coefficients of the imputation model. The Bayesian idea is used that there is not one (true) population regression coefficient but that the regression coefficients itself also follows a distribution.
The command for in mice package: imp_mean <-
mice(dataset, method=“norm”, m=1, maxit=1)
To extract the imputed dataset: imp_mean <-
complete(imp_mean)
Random forest is a non-parametric imputation method applicable to
various variable types that work well with both data missing at
random and not missing at random.
Random forest uses multiple decision trees to estimate missing values
and outputs OOB (out of the bag) imputation error estimates. One caveat
is that random forest works best with large datasets,
and using random forest on small datasets runs the risk of
overfitting.
k-NN imputes the missing attribute values based on the nearest K neighbour. Neighbours are determined based on a distance measure. Once K neighbours are determined, the missing value is imputed by taking mean/median or mode of known attribute values of the missing attribute.
The assumption that the observed data are a sample drawn from a multivariate normal distribution is relatively easy to understand. After the parameters are estimated using the available data, the missing data are estimated based on the parameters which have just been estimated. Several strategies are using the maximum likelihood method to handle the missing data.
Expectation-Maximization (EM) is the maximum likelihood method used to create a new data set. All missing values are imputed with values estimated by the maximum likelihood methods.
This approach begins with the expectation step, during which the
parameters (e.g., variances, covariances, and means) are estimated,
perhaps using the listwise deletion.
Those estimates are then used to create a regression equation to
predict the missing data. The maximization step uses those equations to
fill in the missing data.
The expectation step is then repeated with the new parameters,
where the new regression equations are determined to “fill in” the
missing data.
The expectation and maximization steps are repeated until the system stabilizes.
In single imputation analyses, NA values are estimated/replaced one time with one particular data value for the purpose of obtaining more complete samples, at the expense of creating some potential bias in the eventual conclusions or obtaining slightly less accurate estimates than would be available if there were no missing values in the data.
A single imputation can be just a replacement with the
mean or median (for a quantity) or the mode (for a categorical
variable). However, such an approach, though easy to
understand, underestimates variance and ignores
the relationship of missing values to other variables.
Single imputation can also be done using a variety of models to
try to capture information about the NA values that are available in
other variables within the data set.
The simputation package can help us execute single imputations using a wide variety of techniques, within the pipe approach used by the tidyverse. Another approach I have used in the past is the mice package, which can also perform single imputations.
Multiple Imputation (MI) is a statistical technique for handling missing data. The key concept of MI is to use the distribution of the observed data to estimate a set of plausible values for the missing data. Random components are incorporated into these estimated values to show their uncertainty. Multiple datasets are created and then analysed individually but identically to obtain a set of parameter estimates. Estimates are combined to obtain a set of parameter estimates.
The benefit of the multiple imputations is that restoring the natural variability of the missing values incorporates the uncertainty due to the missing data, which results in a valid statistical inference.
As a flexible way of handling more than one missing variable, apply a Multiple Imputation by Chained Equations (MICE) approach. In the MICE algorithm, a chain of regression equations is used to obtain imputations, which means that variables with missing data are imputed one by one. The regression models use information from all other variables in the model, i.e. (conditional) imputation models. In order to add sampling variability to the imputations, residual error is added to create the imputed values. This residual error can either be added to the prediced values directly, which is esentially similar to repeating stochastic regression imputation over several imputation runs. Residual variance can also be added to the parameter estimates of the regression model, i.e. the regression coefficients, which makes it a Bayesian method. The Bayesian method is the default in the mice package in R. Refer to the reference for more infomation about MICE.
#
A single variable regression analysis includes a dependent variable and the stratification variables used in the randomisation. The stratification variables often include a centre indicator if the trial is a multi-centre trial and usually one or more adjusting variables with prognostic information which are correlated with the outcome. When using a continuous dependent variable, a baseline value of the dependent variable may also be included.
=> Therefore, in all events, a single variable imputation (with or without auxiliary variables included as appropriate) is conducted if only the baseline variable is missing.
If both the dependent variable and the baseline variable are missing and the missingness is monotone, a monotonic imputation is done.
Assume a data matrix where patients are represented by rows and variables by columns. The missingness of such a data matrix is said to be monotone if its columns can be reordered such that for any patient:
If the missingness is monotone, the method of multiple imputation is also relatively straightforward, even if more than one variable has missing values. In this case it is relatively simple to impute the missing data using sequential regression imputation where the missing values are imputed for each variable at a time.
If missingness is not monotone, a multiple imputation is conducted using the chained equations or the MCMC method. Auxiliary variables are included in the model if they are available.
One of the distinct advantages of multiple imputation is that it can produce unbiased estimates with correct confidence intervals with a low number of imputed datasets, even as low as m = 2.
The classic advice is to use a low number of imputation, somewhere between 3 and 5 for moderate amounts of missing information. Several authors investigated the influence of m on various aspects of the results. The picture emerging from this work is that it is often beneficial to set m higher, somewhere in the range of 20–100 imputations.
Theoretically it is always better to use higher m , but this involves more computation and storage. Setting m very high (say m = 200 ) may be useful for low-level estimands that are very uncertain, and for which we want to approximate the full distribution, or for parameters that are notoriously different to estimates, like variance components. On the other hand, setting m high may not be worth the extra wait if the primary interest is on the point estimates (and not on standard errors, p -values, and so on). In that case using m = 5 − 20 will be enough under moderate missingness.
Imputing a dataset in practice often involves trial and error to adapt and refine the imputation model. Such initial explorations do not require large m . It is convenient to set m = 5 during model building, and increase m only after being satisfied with the model for the “final” round of imputation. So if calculation is not prohibitive, we may set m to the average percentage of missing data. The substantive conclusions are unlikely to change as a result of raising m beyond m = 5.
Researchers assume that the number of imputations needed to generate valid imputations has to be set at 3-5 imputations. This idea was based on the work of Rubin (Rubin (1987)). He showed that the precision of a pooled parameter becomes lower when a finite number of multiply imputed datasets is used compared to an infinite number (finite means a limited number of imputed datasets, like 5 imputed datasets and infinite means unlimited and can be recognized by the mathematical symbol ∞). Accordingly, when 5 imputed datasets are used, the standard error S E is √0.93=0.96 times as large as the S E when an infinite number of imputed datasets are used.
FMI is the fraction of missing information and m is the number of imputed datasets. Where FMI is roughly equal to the percentage of missing data in the simplest case of one variable with missing data. When there are more variables in the imputation model, and these variables are correlated with the variables with missing data the FMI becomes lower. For FMI´s of 0.05, 0.1, 0.2, 0.3, 0.5 the following number of imputed dataets are needed: ≥3, 6, 12, 24, 59, respectively. Another rule of thumb is that the number of imputed datasets should be at least equal to the percentage of missing cases. This means that when 10% of the subjects have missing values, at least 10 imputed datasets should be generated.
Iterations Van Buuren (Van Buuren (2018)) states that the number of iterations may depend on the correlation between variables and the percentage of missing data in variables. He proposed that a number of 5-20 iterations is enough to reach convergence. This number may be adjusted when the percentage of missing data is high. Nowadays computers are fast so that a higher number of iterations can easily be used.
It should be clear that this rule is not universally appropriate. Some non-MCAR missing data patterns have FMI greater than the fraction of missing data. Some settings require a greater or smaller degree of reproducibility. Larger numbers of imputations may be required for method comparison studies.
Include the covariates and outcome from the analysis model: To avoid bias in the analysis model, the imputation model must include all variables that are in the analysis model. In particular, when imputing missing values of analysis model covariates, the analysis model outcome must be in the imputation model. If the imputed data are to be used to fit several different analysis models, then the imputation model should contain every variable included in any of the analysis models.
Include predictors of the incomplete variable: This is beneficial for two reasons. First, it may make the MAR assumption more plausible and hence reduce bias. Second, adding predictors of the incomplete variable to the imputation model can improve the imputations and hence reduces the standard errors of the estimates for the analysis model.
In practice, one could include in the imputation model all variables that significantly predict the incomplete variable, or whose association with the incomplete variable exceeds some threshold. One might also include any other variables which significantly predict whether the incomplete variable is missing, on the grounds that bias may be avoided if the included variables have a true association with the incomplete variable that fails to reach statistical significance, whereas little loss of precision is likely if the included variables do not predict the incomplete variable.
# 
Multiple imputation consists of three steps:
Imputation step: we create several m complete versions of the data by replacing the missing values by plausible data values – multiple imputation represents multiple sets of plausible values. When using multiple imputation, missing values are identified and are replaced by a random sample of plausible values imputations (completed datasets). Multiple completed datasets are generated via some chosen imputation model. Five imputed datasets (m=5) have traditionally been suggested to be sufficient on theoretical grounds, but 50 datasets (or more) seem preferable to reduce sampling variability from the imputation process.
Completed-data analysis (estimation) step: The desired analysis is performed separately for each dataset that is generated during the imputation step. Hereby, for example, 50 analysis results are constructed.
Pooling step: The results obtained from each completed-data analyses are combined into a single multiple-imputation result. There is no need to conduct a weighted meta-analysis as all say 50 analysis results are considered to have the same statistical weight.
It is of great importance that there is either compatibility between the imputation model and the analysis model or the imputation model is more general than the analysis model (for example, that the imputation model includes more independent covariates than the analysis model). For example, if the analysis model has significant interactions, then the imputation model should include them as well, if the analysis model uses a transformed version of a variable then the imputation model should use the same transformation, etc.
The classic workflow in mice runs functions mice(), with() and pool() in succession, each time saving the intermediate result:
library(mice)
imp <- mice(x, seed = 123, print = FALSE)
fit <- with(imp, lm(y ~ z))
est1 <- pool(fit)
The objects imp, fit and est have classes mids, mira and mipo, respectively. The classic workflow works because mice contains a with() function that understands how to deal with a mids-object. The classic mids workflow has been widely adopted, but there are more possibilities.
The magrittr package introduced the pipe operator to R. This operator removes the need to save and reread objects, resulting in more compact and better readable code:
library(magrittr)
est2 <- x %>% mice(seed = 123, print = FALSE) %>% with (lm(y ~
z)) %>% pool()
The with() function handles two tasks: to fill in the missing data and to analyze the data. Splitting these over two separate functions provided the user easier access to the imputed data, and hence is more flexible.
Many types of estimates are approximately normally distributed, e.g., means, standard deviations, regression coefficients, proportions and linear predictors.
With non-normal distributions: correlation coefficients, odds ratios, relative risks, hazard ratios, measures of explained variance and so on? The quality of the pooled estimate and the confidence intervals can be improved when pooling is done in a scale for which the distribution is close to normal. Thus, transformation toward normality and back-transformation into the original scale improves statistical inference.
More info can be found here: https://stefvanbuuren.name/fimd/sec-pooling.html
Schafer (1997) distinguished three types of statistics in multi-parameter tests: D1 (multivariate Wald test), D2 (combining test statistics) and D3 (likelihood ratio test).
# 
If the estimates are approximately normal and if the software can produce the required variance-covariance matrices, we recommend using D1 with an adjustment for small samples if n < 100 . D1 is a direct extension of Rubin’s rules to multi-parameter problems, theoretically convincing, mature and widely applied. D1 is insensitive to the assumption of equal fractions of missing information, is well calibrated, works well with small m (unless the fractions of information are large and variable) and suffers only modest loss of power.
If only the test statistics are available for pooling, then the D2 -statistic is a good option, provided that the number of imputations m > 20 . The test is easy to calculate and applies to different test statistics. For m < 20 , the power may be low. D2 tends to become optimistic for high fractions of missing information ( > 0.3 ) , and this effect unfortunately increases with sample size. Thus, careless application of D2 to large datasets with many missing values may yield high rates of false positives.
The likelihood ratio statistic D3 is theoretically sound. Calculation of D3 requires refitting the repeated analysis models with the estimates constrained to their pooled values. This was once an issue, but probably less so in the future. D3 is asymptotically equivalent to D1 , and may be preferred for theoretical reasons: it does not require normality in the complete-data model, it is often more powerful and it may be more stable than if k is large. We found that D3 produces Type 1 error rates that were comparable to D1 . D3 tends to be somewhat conservative in smaller samples, especially with high fractions of missing information and with high k . Also, D3 has lower statistical power in some of the extreme scenarios. For small samples, D1 has a slight edge over D3 , so given the current available evidence D1 is the better option for n < 200 . In larger samples ( n ≥ 200 ) D1 and D3 appear equally good, so the choice between them is mostly a matter of convenience.
There are 6 limitations in MICE as follow:
More infomation can be found within this paper:
White, I.R., Royston, P. and Wood, A.M., 2011. Multiple imputation using
chained equations: issues and guidance for practice. Statistics in
medicine, 30(4), pp.377-399.
Data used in this notes
fakestroke.csv
(Source: https://github.com/THOMASELOVE/432-data/blob/master/data/fakestroke.csv)
Loading the data, adjust the character of each column:
# Loading the data, adjust the character of each column
data <- read_excel("D:/Statistics/R/R data/fakestroke.xlsx",
col_types = c("text", "text", "numeric",
"text", "numeric", "text", "text",
"numeric", "numeric", "text", "numeric",
"text", "numeric", "numeric", "text",
"numeric", "numeric", "numeric"))
df = data
The fakestroke.csv file contains the following 18 variables for 500 patients:
# To make a table in R markdown: 1st row: header, 2nd row: Alignment; the remaining row: for content
## |Variable | Description |
## |:------- | :---------- |
## |studyid | Study ID # (z001 through z500) |
Variable | Description |
---|---|
studyid | Study ID # (z001 through z500) |
trt | Treatment group (Intervention or Control) |
age | Age in years |
sex | Male or Female |
nihss | NIH Stroke Scale Score (can range from 0-42; higher scores indicate more severe neurological deficits) |
location | Stroke Location - Left or Right Hemisphere |
hx.isch | History of Ischemic Stroke (Yes/No) |
afib | Atrial Fibrillation (1 = Yes, 0 = No) |
dm | Diabetes Mellitus (1 = Yes, 0 = No) |
mrankin | Pre-stroke modified Rankin scale score (0, 1, 2 or > 2) indicating functional disability - complete range is 0 (no symptoms) to 6 (death) |
sbp | Systolic blood pressure, in mm Hg |
iv.altep | Treatment with IV alteplase (Yes/No) |
time.iv | Time from stroke onset to start of IV alteplase (minutes) if iv.altep=Yes |
aspects | Alberta Stroke Program Early Computed Tomography score, which measures extent of stroke from 0 - 10; higher scores indicate fewer early ischemic changes |
ia.occlus | Intracranial arterial occlusion, based on vessel imaging - five categories3 |
extra.ica | Extracranial ICA occlusion (1 = Yes, 0 = No) |
time.rand | Time from stroke onset to study randomization, in minutes |
time.punc | Time from stroke onset to groin puncture, in minutes (only if Intervention) |
A quick look at the simulated data in fakestroke
df
## # A tibble: 500 x 18
## studyid trt age sex nihss locat~1 hx.isch afib dm mrankin sbp
## <chr> <chr> <dbl> <chr> <dbl> <chr> <chr> <dbl> <dbl> <chr> <dbl>
## 1 z001 Control 53 Male 21 Right No 0 0 2 127
## 2 z002 Interven~ 51 Male 23 Left No 1 0 0 137
## 3 z003 Control 68 Fema~ 11 Right No 0 0 0 138
## 4 z004 Control 28 Male 22 Left No 0 0 0 122
## 5 z005 Control 91 Male 24 Right No 0 0 0 162
## 6 z006 Control 34 Fema~ 18 Left No 0 0 2 166
## 7 z007 Interven~ 75 Male 25 Right No 0 0 0 140
## 8 z008 Control 89 Fema~ 18 Right No 0 0 0 157
## 9 z009 Control 75 Male 25 Left No 1 0 2 129
## 10 z010 Interven~ 26 Fema~ 27 Right No 0 0 0 143
## # ... with 490 more rows, 7 more variables: iv.altep <chr>, time.iv <dbl>,
## # aspects <dbl>, ia.occlus <chr>, extra.ica <dbl>, time.rand <dbl>,
## # time.punc <dbl>, and abbreviated variable name 1: location
str(df)
## tibble [500 x 18] (S3: tbl_df/tbl/data.frame)
## $ studyid : chr [1:500] "z001" "z002" "z003" "z004" ...
## $ trt : chr [1:500] "Control" "Intervention" "Control" "Control" ...
## $ age : num [1:500] 53 51 68 28 91 34 75 89 75 26 ...
## $ sex : chr [1:500] "Male" "Male" "Female" "Male" ...
## $ nihss : num [1:500] 21 23 11 22 24 18 25 18 25 27 ...
## $ location : chr [1:500] "Right" "Left" "Right" "Left" ...
## $ hx.isch : chr [1:500] "No" "No" "No" "No" ...
## $ afib : num [1:500] 0 1 0 0 0 0 0 0 1 0 ...
## $ dm : num [1:500] 0 0 0 0 0 0 0 0 0 0 ...
## $ mrankin : chr [1:500] "2" "0" "0" "0" ...
## $ sbp : num [1:500] 127 137 138 122 162 166 140 157 129 143 ...
## $ iv.altep : chr [1:500] "Yes" "Yes" "No" "Yes" ...
## $ time.iv : num [1:500] 63 68 NA 78 121 78 97 NA 49 99 ...
## $ aspects : num [1:500] 10 10 10 10 8 5 10 9 6 10 ...
## $ ia.occlus: chr [1:500] "M1" "M1" "ICA with M1" "ICA with M1" ...
## $ extra.ica: num [1:500] 0 1 1 1 0 0 0 0 0 0 ...
## $ time.rand: num [1:500] 139 118 178 160 214 154 122 147 271 141 ...
## $ time.punc: num [1:500] NA 281 NA NA NA NA 268 NA NA 259 ...
summary(df)
## studyid trt age sex
## Length:500 Length:500 Min. :23.00 Length:500
## Class :character Class :character 1st Qu.:55.00 Class :character
## Mode :character Mode :character Median :65.75 Mode :character
## Mean :64.71
## 3rd Qu.:76.00
## Max. :96.00
##
## nihss location hx.isch afib
## Min. :10.00 Length:500 Length:500 Min. :0.00
## 1st Qu.:14.00 Class :character Class :character 1st Qu.:0.00
## Median :18.00 Mode :character Mode :character Median :0.00
## Mean :18.03 Mean :0.27
## 3rd Qu.:22.00 3rd Qu.:1.00
## Max. :28.00 Max. :1.00
##
## dm mrankin sbp iv.altep
## Min. :0.000 Length:500 Min. : 78.0 Length:500
## 1st Qu.:0.000 Class :character 1st Qu.:128.5 Class :character
## Median :0.000 Mode :character Median :145.0 Mode :character
## Mean :0.126 Mean :145.5
## 3rd Qu.:0.000 3rd Qu.:162.5
## Max. :1.000 Max. :231.0
## NA's :1
## time.iv aspects ia.occlus extra.ica
## Min. : 42.00 Min. : 5.000 Length:500 Min. :0.0000
## 1st Qu.: 67.00 1st Qu.: 7.000 Class :character 1st Qu.:0.0000
## Median : 86.00 Median : 9.000 Mode :character Median :0.0000
## Mean : 92.64 Mean : 8.506 Mean :0.2906
## 3rd Qu.:115.00 3rd Qu.:10.000 3rd Qu.:1.0000
## Max. :218.00 Max. :10.000 Max. :1.0000
## NA's :55 NA's :4 NA's :1
## time.rand time.punc
## Min. :100.0 Min. :180
## 1st Qu.:151.2 1st Qu.:212
## Median :201.5 Median :260
## Mean :208.6 Mean :263
## 3rd Qu.:257.8 3rd Qu.:313
## Max. :360.0 Max. :360
## NA's :2 NA's :267
Before constructing a table, I must check numeric/ categorical
variables.
For example: in our data:
df = data %>% select(-studyid,-time.punc, -time.iv)
# Changing coding variable to categorical:
df$afib <- factor(df$afib , levels=0:1, labels=c("No", "Yes"))
df$dm <- factor(df$dm , levels=0:1, labels=c("No", "Yes"))
df$extra.ica <- factor(df$extra.ica, levels=0:1, labels=c("No", "Yes"))
str(df)
## tibble [500 x 15] (S3: tbl_df/tbl/data.frame)
## $ trt : chr [1:500] "Control" "Intervention" "Control" "Control" ...
## $ age : num [1:500] 53 51 68 28 91 34 75 89 75 26 ...
## $ sex : chr [1:500] "Male" "Male" "Female" "Male" ...
## $ nihss : num [1:500] 21 23 11 22 24 18 25 18 25 27 ...
## $ location : chr [1:500] "Right" "Left" "Right" "Left" ...
## $ hx.isch : chr [1:500] "No" "No" "No" "No" ...
## $ afib : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 1 1 1 2 1 ...
## $ dm : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ mrankin : chr [1:500] "2" "0" "0" "0" ...
## $ sbp : num [1:500] 127 137 138 122 162 166 140 157 129 143 ...
## $ iv.altep : chr [1:500] "Yes" "Yes" "No" "Yes" ...
## $ aspects : num [1:500] 10 10 10 10 8 5 10 9 6 10 ...
## $ ia.occlus: chr [1:500] "M1" "M1" "ICA with M1" "ICA with M1" ...
## $ extra.ica: Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 1 1 1 1 ...
## $ time.rand: num [1:500] 139 118 178 160 214 154 122 147 271 141 ...
df <- df%>%mutate_if(is.character, as.factor)
str(df)
## tibble [500 x 15] (S3: tbl_df/tbl/data.frame)
## $ trt : Factor w/ 2 levels "Control","Intervention": 1 2 1 1 1 1 2 1 1 2 ...
## $ age : num [1:500] 53 51 68 28 91 34 75 89 75 26 ...
## $ sex : Factor w/ 2 levels "Female","Male": 2 2 1 2 2 1 2 1 2 1 ...
## $ nihss : num [1:500] 21 23 11 22 24 18 25 18 25 27 ...
## $ location : Factor w/ 2 levels "Left","Right": 2 1 2 1 2 1 2 2 1 2 ...
## $ hx.isch : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ afib : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 1 1 1 2 1 ...
## $ dm : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ mrankin : Factor w/ 4 levels "> 2","0","1",..: 4 2 2 2 2 4 2 2 4 2 ...
## $ sbp : num [1:500] 127 137 138 122 162 166 140 157 129 143 ...
## $ iv.altep : Factor w/ 2 levels "No","Yes": 2 2 1 2 2 2 2 1 2 2 ...
## $ aspects : num [1:500] 10 10 10 10 8 5 10 9 6 10 ...
## $ ia.occlus: Factor w/ 6 levels "A1 or A2","ICA with M1",..: 4 4 2 2 4 1 4 4 4 4 ...
## $ extra.ica: Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 1 1 1 1 ...
## $ time.rand: num [1:500] 139 118 178 160 214 154 122 147 271 141 ...
# Relevel factors:
df$trt = factor(df$trt, levels=c("Intervention", "Control"))
df$mrankin = factor(df$mrankin, levels=c("0", "1", "2", "> 2"))
df$ia.occlus = factor(df$ia.occlus, levels=c("Intracranial ICA", "ICA with M1",
"M1", "M2", "A1 or A2"))
# Check order of factor after relevel:
df$trt[1:2]
## [1] Control Intervention
## Levels: Intervention Control
df$mrankin[1:2]
## [1] 2 0
## Levels: 0 1 2 > 2
df$ia.occlus[1:3]
## [1] M1 M1 ICA with M1
## Levels: Intracranial ICA ICA with M1 M1 M2 A1 or A2
Before moving on to determining the specifics of multiple imputation, we should first explore and see the pattern of missing data in our dataset.
There are several ways to check the missing value (NA) in the dataset:
# number and proportion of missing values per variable
cbind("# NA" = sort(colSums(is.na(df))),
"% NA" = round(sort(colMeans(is.na(df))) * 100, 2))
## # NA % NA
## trt 0 0.0
## age 0 0.0
## sex 0 0.0
## nihss 0 0.0
## location 0 0.0
## hx.isch 0 0.0
## afib 0 0.0
## dm 0 0.0
## mrankin 0 0.0
## iv.altep 0 0.0
## sbp 1 0.2
## ia.occlus 1 0.2
## extra.ica 1 0.2
## time.rand 2 0.4
## aspects 4 0.8
Discuss
There are 5 variables with 9 NAs.
aspects (0.4%) > time.rand (0.4%).
sbp = ia.occlus = extra.ica (0.2%).
Ncc <- cbind(
"#" = table(complete.cases(df)),
"%" = round(100 * table(complete.cases(df))/nrow(df), 2)
)
rownames(Ncc) <- c("incompl.", "complete")
Ncc
## # %
## incompl. 8 1.6
## complete 492 98.4
Discuss
Among 500 observations:
8 observations contain NA.
492 observations dont have NA.
p_missing <- unlist(lapply(df, function(x) sum(is.na(x))))/nrow(df)
sort(p_missing[p_missing > 0], decreasing = TRUE)
## aspects time.rand sbp ia.occlus extra.ica
## 0.008 0.004 0.002 0.002 0.002
library(plyr)
ddply(df, c(sbp = "ifelse(is.na(sbp), 'missing', 'observed')",
aspects = "ifelse(is.na(aspects), 'missing', 'observed')",
time.rand = "ifelse(is.na(time.rand), 'missing', 'observed')",
ia.occlus = "ifelse(is.na(ia.occlus), 'missing', 'observed')",
extra.ica = "ifelse(is.na(extra.ica), 'missing', 'observed')"
),
summarize,
N = length(sbp)
)
## sbp aspects time.rand ia.occlus extra.ica N
## 1 missing observed observed observed observed 1
## 2 observed missing observed observed observed 4
## 3 observed observed missing observed observed 2
## 4 observed observed observed missing missing 1
## 5 observed observed observed observed observed 492
Discuss
This command provide more info than the previous two, among 500 observations:
8 observations (obs) contain NA.
** 1 obs have NA at sbp.
** 4 obs have NA at aspects.
** 2 obs have NA at time.rand.
** 1 obs have NA at both ia.occlus and
extra.ica.
492 obs dont have NA.
Using the following command:
md.pattern (x)
md.pattern(df)
## trt age sex nihss location hx.isch afib dm mrankin iv.altep sbp ia.occlus
## 492 1 1 1 1 1 1 1 1 1 1 1 1
## 4 1 1 1 1 1 1 1 1 1 1 1 1
## 2 1 1 1 1 1 1 1 1 1 1 1 1
## 1 1 1 1 1 1 1 1 1 1 1 1 0
## 1 1 1 1 1 1 1 1 1 1 1 0 1
## 0 0 0 0 0 0 0 0 0 0 1 1
## extra.ica time.rand aspects
## 492 1 1 1 0
## 4 1 1 0 1
## 2 1 0 1 1
## 1 0 1 1 2
## 1 1 1 1 1
## 1 2 4 9
Discuss
In each row output:
** 1: the variable is complete.
** 0: the variable in that pattern contains missing values.
** The last row: the total number of missing values for each
variable.
In each column output:
** 1st column on the left (without a column name) shows the number of
cases with a specific pattern.
** The column on the right : the number of variables that is incomplete
in that pattern.
Return to our data result:
At each row:
** 1st line: 492 obs having full value.
** 2nd line: 4 obs have NA at
aspects
** 3rd line: 2 obs have NA at
time.rand
** 4th line: 1 obs have missing value at both
ia.occlus and extra.ica
** 5th line: 1 obs have NA at
sbp
At each column - variable, total number of missing value:
** sbp: 1
** ia.occlus: 1
** extra.ica: 1
** time.rand: 2
** aspects: 4
Installing this package:
devtools::install_github(“ropensci/visdat”)
Using the following command:
vis_miss(x, cluster = F, sort_miss = F, show_perc = T,
show_perc_col = T)
vis_miss(df, cluster = T, sort_miss = T, show_perc = T, show_perc_col = T)
vis_dat(df, sort_type = T)
Discuss
More visualization.
Hard to identify the number of missing value in each variables.
Using the following command:
aggr (x, plot = T, prop = T, numbers = F, combined = F, sortVars
= T, cex.axis=.7, gap=3)
aggr(df, plot = F, numbers = T, prop = T)
##
## Missings in variables:
## Variable Count
## sbp 1
## aspects 4
## ia.occlus 1
## extra.ica 1
## time.rand 2
aggr(df, plot = T, numbers = T, prop = T, combined= F, sortVars = T, cex.axis=.7, gap=3)
##
## Variables sorted by number of missings:
## Variable Count
## aspects 0.008
## time.rand 0.004
## sbp 0.002
## ia.occlus 0.002
## extra.ica 0.002
## trt 0.000
## age 0.000
## sex 0.000
## nihss 0.000
## location 0.000
## hx.isch 0.000
## afib 0.000
## dm 0.000
## mrankin 0.000
## iv.altep 0.000
aggr(df, plot = T, numbers = T, prop = T, combined= T, sortVars = F)
Discuss
We can see there are 5 variables with missing value:
** 1 obs have NA at both ia.occlus, extra.ica.
Among these variables:
** Numeric variables: sbp, aspects, time.rand.
** Categorical variables: ia.occlus, extra.ica.
The total proportion of the remaining 5 varialbes < 5% => complete case analysis can be applied. However, in this example, I will still use single imputation and multiple imputation to demonstrate how they work.
Summary
I recommend using mice and/or aggr()
function for observing missing value:
The code above calculates what percent of data is missing. A
simple look at this table warns us about several variables that have
more than 25% missing. It is useful to remove these variables from the
dataset first as they might mess up the imputation.
There are 5 varialbes with missing value:
Numeric variables: sbp, aspects,
time.rand.
Categorical variables: ia.occlus,
extra.ica.
The outcome variables of this project: nihss ;
mrankin.
The missing data patterns show that the total proportion of
missing value < 5% => Complete case analysis can be applied.
=> However, for demonstration, I still continue with the imputation
practice.
Little´s MCAR test is available in the naniar package as the mcar_test function. To apply the test, we select only the continuous variables. We use it for the same dataset as in the previous paragraph. The p-value for the test is siginificant, indicating that the missings does not seem to be compeletely at random.
df_select <- df[,c("age", "nihss", "sbp","aspects" ,"time.rand" )]
library(naniar)
mcar_test(data=df_select)
## # A tibble: 1 x 4
## statistic df p.value missing.patterns
## <dbl> <dbl> <dbl> <int>
## 1 18.4 12 0.104 4
mcar_test(data = df)
## # A tibble: 1 x 4
## statistic df p.value missing.patterns
## <dbl> <dbl> <dbl> <int>
## 1 64.3 55 0.183 5
Discuss
The p-value for both test > 0.05 => Accept the Null
hypothesis.
=> The missings is compeletely at random (MCAR).
=> Result is similar to Missing data patterns
(<5%): Missingness is assumed not to matter for the analysis.
In R also groups of the non-responders (i.e., participants with
missing observations) can be compared to the responders
(i.e. participants without missing observations). For this we can use
the function missing_compare of the finalfit
package.
First we have to copy and rename the variables with missing data. We
rename the continuous variables sbp,
aspects and time.rand and call them
sbp_MAR, aspects_MAR and time.rand_MAR
respectively.
The missing_compare function uses the Kruskal Wallis test for continuous variables and the Chi-squared test for categorical variables. (Note: The concerned outcome of this research: mrankin, nihss)
In this example. I will demonstrate the command to compare group for numeric variable: time.rand. Similar commands can be done for other numeric variables:
df_select <- df[,c("age", "nihss", "sbp","aspects" ,"time.rand",
"trt", "mrankin", "ia.occlus", "extra.ica")]
# Copy variable with missing data and rename
df_select$sbp_MAR <- df_select$sbp
df_select$aspects_MAR <- df_select$aspects
df_select$time.rand_MAR <- df_select$time.rand
df_select$extra.ica_MAR <- df_select$extra.ica
library(finalfit)
explanatory = c("age", "nihss", "sbp" ,"aspects")
dependent = "time.rand_MAR"
df_select %>%
missing_compare(dependent, explanatory) %>%
knitr::kable(row.names=FALSE, align = c("l", "l", "r", "r", "r"),
caption = "Mean comparisons between values of responders (Not missing) and
non-responders (Missing) on the time.rand variable.")
Missing data analysis: time.rand_MAR | Not missing | Missing | p | |
---|---|---|---|---|
age | Mean (SD) | 64.6 (17.0) | 94.0 (1.4) | 0.015 |
nihss | Mean (SD) | 18.0 (4.7) | 18.0 (1.4) | 0.994 |
sbp | Mean (SD) | 145.6 (25.1) | 105.0 (8.5) | 0.022 |
aspects | Mean (SD) | 8.5 (1.6) | 9.0 (1.4) | 0.653 |
I can also compare group for extra.ica variable.
explanatory = c("trt", "mrankin")
dependent = "extra.ica_MAR"
df_select %>%
missing_compare(dependent, explanatory) %>%
knitr::kable(row.names=FALSE, align = c("l", "l", "r", "r", "r"),
caption = "Mean comparisons between values of responders (Not missing) and
non-responders (Missing) on the extra.ica variable.")
Missing data analysis: extra.ica_MAR | Not missing | Missing | p | |
---|---|---|---|---|
trt | Intervention | 233 (100.0) | 0 (0.0) | 1.000 |
Control | 266 (99.6) | 1 (0.4) | ||
mrankin | 0 | 403 (99.8) | 1 (0.2) | 0.971 |
1 | 50 (100.0) | 0 (0.0) | ||
2 | 25 (100.0) | 0 (0.0) | ||
> 2 | 21 (100.0) | 0 (0.0) |
Discuss
On the time.rand_MAR variable, the patients that have observed values (column not missing) differ significantly from patients with missing values (column Missing) on age (P = 0.015) and sbp (P=0.022).
On the extra.ica variable, the patients that have observed values (column not missing) don’t differ significantly from patients with missing values (column Missing) on trt and mrankin.
Before imputing the missing data it is important to take a look at how the incomplete variables are distributed so that we can choose appropriate imputation models. Pay attention to:
# One option that allows to quickly get an overview of all variables,
# categorical and continuous (replace the "df" dataset in the code below)
nc <- max(5, ceiling(sqrt(ncol(df))))
nr <- ceiling(ncol(df) / nc)
par(mfrow = c(nr, nc), mgp = c(2, 0.6, 0), mar = c(2, 3, 3, 0.5))
for (i in 1:ncol(df)) {
if (is.numeric(df[, i])) {
hist(df[, i], nclass = 50, xlab = "",
main = paste0(names(df[i]), " (",
round(mean(is.na(df[, i])) * 100, 2), "% NA)")
)
} else {
barplot(table(df[, i]), ylab = "Frequency",
main = paste0(names(df[i]), " (",
round(mean(is.na(df[, i])) * 100, 2), "% NA)"))
}
}
Package used: mice
Usage:
mice(dataset, method = , predictorMatrix = ,m = , maxit = , seed
= 1234, printFlag = F)
There are several guidelines that can be used to customize the imputation model for each variable by the predictor matrix. To summarize:
First, do a set-up run of mice(), without any iterations (maxit = 0). This is to get all the objects that we may need to adjust in order to customize the imputation to our data.
imp0 <- mice(df, maxit = 0, print = F)
There are many imputation methods available in mice. You can find the list in the help page of the mice() function. We will focus here on the following ones:
Name | Description | Variable type |
---|---|---|
pmm | Predictive mean matching | any |
norm | Bayesian linear regression | numeric |
logreg | Logistic regression | factor, 2 levels |
polyreg | Polytomous logistic regression | factor, >= 2 levels |
polr | Proportional odds model | ordered, >= 2 levels |
If unspecified, mice will use:
For each column, the algorithm requires a specification of the imputation method. To see which method was used by default:
imp0$meth
## trt age sex nihss location hx.isch afib dm
## "" "" "" "" "" "" "" ""
## mrankin sbp iv.altep aspects ia.occlus extra.ica time.rand
## "" "pmm" "" "pmm" "polyreg" "logreg" "pmm"
Discuss
““: The variable is complete and therefore not imputed, denoted
by the”” empty string.
pmm: sbp; aspects; time.rand
polyreg: ia.occlus
logreg: extra.ica
When a normal imputation model seems to be appropriate for most of the continuous covariates, you may want to specify norm as the default method in the setup run. The order of the types of variable is: continuous, binary, factor, ordered factor.
imp0 <- mice(df, maxit = 0,
defaultMethod = c("norm", 'logreg', 'polyreg', 'polr'))
Notice that mice has set the imputation method for variable hyp to logreg, which implements multiple imputation by logistic regression. An up-to-date overview of the methods in mice can be found by:
methods(mice)
## [1] mice.impute.2l.bin mice.impute.2l.lmer
## [3] mice.impute.2l.norm mice.impute.2l.pan
## [5] mice.impute.2lonly.mean mice.impute.2lonly.norm
## [7] mice.impute.2lonly.pmm mice.impute.cart
## [9] mice.impute.jomoImpute mice.impute.lasso.logreg
## [11] mice.impute.lasso.norm mice.impute.lasso.select.logreg
## [13] mice.impute.lasso.select.norm mice.impute.lda
## [15] mice.impute.logreg mice.impute.logreg.boot
## [17] mice.impute.mean mice.impute.midastouch
## [19] mice.impute.mnar.logreg mice.impute.mnar.norm
## [21] mice.impute.norm mice.impute.norm.boot
## [23] mice.impute.norm.nob mice.impute.norm.predict
## [25] mice.impute.panImpute mice.impute.passive
## [27] mice.impute.pmm mice.impute.polr
## [29] mice.impute.polyreg mice.impute.quadratic
## [31] mice.impute.rf mice.impute.ri
## [33] mice.impute.sample mice.mids
## [35] mice.theme
## see '?methods' for accessing help and source code
Let’s change the imputation method for sbp, aspects, time.rand so that it is imputed with predictive mean matching (pmm):
# Turn their methods matrix into the specified imputation models
meth <- imp0$meth
meth[c("sbp", "aspects", "time.rand")] <- "pmm"
meth
## trt age sex nihss location hx.isch afib dm
## "" "" "" "" "" "" "" ""
## mrankin sbp iv.altep aspects ia.occlus extra.ica time.rand
## "" "pmm" "" "pmm" "polyreg" "logreg" "pmm"
Discuss After exploring the data distribution, the method we modified is the same as the default method that mice recommended (see 1st command of the Imputation method: imp0$meth). However, we shouldn’t rely completely on the program but should check the data distribution and make judgement by ourself.
The Predictor Matrix informs us which variables are going to be used to predict a plausible value for variables (1 means a variable is used to predict another variable, 0 otherwise).
mice gives you complete control over the predictor matrix, enabling you to choose your own predictor relations. This can be very useful, for example, when you have many variables or when you have clear ideas or prior knowledge about relations in the data at hand. You can use mice() to give you the initial predictor matrix, and change it afterwards, without running the algorithm.
There is a special function called quickpred() for a quick selection procedure of predictors, which can be handy for datasets containing many variables. See ?quickpred for more info. Selecting predictors according to data relations with a minimum correlation of ρ=.1 can be done by. For example:
ini <- mice(df, pred=quickpred(df, mincor=.1), print=F)
ini$pred
## trt age sex nihss location hx.isch afib dm mrankin sbp iv.altep
## trt 0 0 0 0 0 0 0 0 0 0 0
## age 0 0 0 0 0 0 0 0 0 0 0
## sex 0 0 0 0 0 0 0 0 0 0 0
## nihss 0 0 0 0 0 0 0 0 0 0 0
## location 0 0 0 0 0 0 0 0 0 0 0
## hx.isch 0 0 0 0 0 0 0 0 0 0 0
## afib 0 0 0 0 0 0 0 0 0 0 0
## dm 0 0 0 0 0 0 0 0 0 0 0
## mrankin 0 0 0 0 0 0 0 0 0 0 0
## sbp 0 0 0 0 0 0 0 1 0 0 0
## iv.altep 0 0 0 0 0 0 0 0 0 0 0
## aspects 0 0 0 0 0 0 1 0 0 0 0
## ia.occlus 0 0 0 1 0 0 0 0 0 0 0
## extra.ica 0 0 0 0 0 0 0 0 0 0 0
## time.rand 0 1 0 0 0 0 0 0 0 1 0
## aspects ia.occlus extra.ica time.rand
## trt 0 0 0 0
## age 0 0 0 0
## sex 0 0 0 0
## nihss 0 0 0 0
## location 0 0 0 0
## hx.isch 0 0 0 0
## afib 0 0 0 0
## dm 0 0 0 0
## mrankin 0 0 0 0
## sbp 0 0 0 0
## iv.altep 0 0 0 0
## aspects 0 0 0 0
## ia.occlus 1 0 1 0
## extra.ica 1 1 0 0
## time.rand 0 0 0 0
Let return to our dataset:
pred <- imp0$predictorMatrix
pred
## trt age sex nihss location hx.isch afib dm mrankin sbp iv.altep
## trt 0 1 1 1 1 1 1 1 1 1 1
## age 1 0 1 1 1 1 1 1 1 1 1
## sex 1 1 0 1 1 1 1 1 1 1 1
## nihss 1 1 1 0 1 1 1 1 1 1 1
## location 1 1 1 1 0 1 1 1 1 1 1
## hx.isch 1 1 1 1 1 0 1 1 1 1 1
## afib 1 1 1 1 1 1 0 1 1 1 1
## dm 1 1 1 1 1 1 1 0 1 1 1
## mrankin 1 1 1 1 1 1 1 1 0 1 1
## sbp 1 1 1 1 1 1 1 1 1 0 1
## iv.altep 1 1 1 1 1 1 1 1 1 1 0
## aspects 1 1 1 1 1 1 1 1 1 1 1
## ia.occlus 1 1 1 1 1 1 1 1 1 1 1
## extra.ica 1 1 1 1 1 1 1 1 1 1 1
## time.rand 1 1 1 1 1 1 1 1 1 1 1
## aspects ia.occlus extra.ica time.rand
## trt 1 1 1 1
## age 1 1 1 1
## sex 1 1 1 1
## nihss 1 1 1 1
## location 1 1 1 1
## hx.isch 1 1 1 1
## afib 1 1 1 1
## dm 1 1 1 1
## mrankin 1 1 1 1
## sbp 1 1 1 1
## iv.altep 1 1 1 1
## aspects 0 1 1 1
## ia.occlus 1 0 1 1
## extra.ica 1 1 0 1
## time.rand 1 1 1 0
# pred[, "hx.isch"] <- 0
# pred["trt", "age"] <- 0 : Row trt - column age = 0: Eliminate age from trt prediction
# pred
In case, we wanted to impute variables separately? OR use one
variable to impute the other variables. Info can be found here:
https://rmisstastic.netlify.app/tutorials/erler_course_multipleimputation_2018/erler_practical_mice_2018#visit_sequence
Since we don’t need to impute in the correct order from
Predictor matrix, nothing will change in the
Visit sequence.
https://rmisstastic.netlify.app/tutorials/erler_course_multipleimputation_2018/erler_practical_mice_2018#visit_sequence
visSeq <- imp0$visitSequence
Researchers assume that the number of imputations needed to generate valid imputations has to be set at 3-5 imputations. This idea was based on the work of Rubin (Rubin (1987)). He showed that the precision of a pooled parameter becomes lower when a finite number of multiply imputed datasets is used compared to an infinite number (finite means a limited number of imputed datasets, like 5 imputed datasets and infinite means unlimited and can be recognized by the mathematical symbol ∞). Accordingly, when 5 imputed datasets are used, the standard error S E is √0.93=0.96 times as large as the S E when an infinite number of imputed datasets are used.
FMI is the fraction of missing information and m is the number of imputed datasets. Where FMI is roughly equal to the percentage of missing data in the simplest case of one variable with missing data. When there are more variables in the imputation model, and these variables are correlated with the variables with missing data the FMI becomes lower. For FMI´s of 0.05, 0.1, 0.2, 0.3, 0.5 the following number of imputed dataets are needed: ≥3, 6, 12, 24, 59, respectively. Another rule of thumb is that the number of imputed datasets should be at least equal to the percentage of missing cases. This means that when 10% of the subjects have missing values, at least 10 imputed datasets should be generated.
Iterations Van Buuren (Van Buuren (2018)) states that the number of iterations may depend on the correlation between variables and the percentage of missing data in variables. He proposed that a number of 5-20 iterations is enough to reach convergence. This number may be adjusted when the percentage of missing data is high. Nowadays computers are fast so that a higher number of iterations can easily be used.
Based on the above infomation, I decide to choose m = 5 and maxit = 10.
More info can be found here:
https://bookdown.org/mwheymans/bookmi/multiple-imputation.html#number-of-imputed-datasets-and-iterations
With the changes that we have made to the predictorMatrix and method, we can now perform the imputation. Use m = 5 and maxit = 10.
m: Number of multiple imputation datasets.
** Set m = 5: Generate 5 imputed datasets.
maxit: A scalar giving the number of iterations.
** Set maxit = 10: 10 iterations for each imputed dataset.
printFlag = F: By default, the mice fucntion returns information about the iteration and imputation steps of the imputed variables under the columns named “iter,” “imp” and “variable” respectively. This information can be turned off by setting the mice function parameter printFlag = FALSE, which results in silent computation of the missing values.
imp <- mice(df, method = meth,
predictorMatrix = pred,
visitSequence = visSeq,
maxit = 10, m = 5,
seed = 1234,
printFlag = FALSE)
mice() prints the name of the variable being imputed for each iteration and imputation. If you run mice() on your own computer the output will show up continuously. There, you may notice that imputation is slowest for categorical variables, especially when they have many categories.
Retrieve the imputation results: Take the 1st and 2nd result of multiple imputations:
df_complete1 = complete(imp, action = 1)
df_complete2 = complete(imp, action = 2)
Compare to the original dataset:
diffdf(df_complete1,df)
## Differences found between the objects!
##
## A summary is given below.
##
## Not all Values Compared Equal
## All rows are shown in table below
##
## ==============================
## Variable No of Differences
## ------------------------------
## sbp 1
## aspects 4
## ia.occlus 1
## extra.ica 1
## time.rand 2
## ------------------------------
##
##
## All rows are shown in table below
##
## ========================================
## VARIABLE ..ROWNUMBER.. BASE COMPARE
## ----------------------------------------
## sbp 131 121 <NA>
## ----------------------------------------
##
##
## All rows are shown in table below
##
## ========================================
## VARIABLE ..ROWNUMBER.. BASE COMPARE
## ----------------------------------------
## aspects 19 10 <NA>
## aspects 169 9 <NA>
## aspects 201 8 <NA>
## aspects 222 10 <NA>
## ----------------------------------------
##
##
## All rows are shown in table below
##
## =========================================
## VARIABLE ..ROWNUMBER.. BASE COMPARE
## -----------------------------------------
## ia.occlus 404 M1 <NA>
## -----------------------------------------
##
##
## All rows are shown in table below
##
## =========================================
## VARIABLE ..ROWNUMBER.. BASE COMPARE
## -----------------------------------------
## extra.ica 404 No <NA>
## -----------------------------------------
##
##
## All rows are shown in table below
##
## =========================================
## VARIABLE ..ROWNUMBER.. BASE COMPARE
## -----------------------------------------
## time.rand 434 110 <NA>
## time.rand 463 120 <NA>
## -----------------------------------------
Compare to the 2 imputation dataset that we just retrieved earlier:
diffdf(df_complete1,df_complete2)
## Differences found between the objects!
##
## A summary is given below.
##
## Not all Values Compared Equal
## All rows are shown in table below
##
## ==============================
## Variable No of Differences
## ------------------------------
## sbp 1
## aspects 3
## ia.occlus 1
## extra.ica 1
## time.rand 2
## ------------------------------
##
##
## All rows are shown in table below
##
## ========================================
## VARIABLE ..ROWNUMBER.. BASE COMPARE
## ----------------------------------------
## sbp 131 121 153
## ----------------------------------------
##
##
## All rows are shown in table below
##
## ========================================
## VARIABLE ..ROWNUMBER.. BASE COMPARE
## ----------------------------------------
## aspects 19 10 6
## aspects 169 9 8
## aspects 201 8 9
## ----------------------------------------
##
##
## All rows are shown in table below
##
## =============================================
## VARIABLE ..ROWNUMBER.. BASE COMPARE
## ---------------------------------------------
## ia.occlus 404 M1 ICA with M1
## ---------------------------------------------
##
##
## All rows are shown in table below
##
## =========================================
## VARIABLE ..ROWNUMBER.. BASE COMPARE
## -----------------------------------------
## extra.ica 404 No Yes
## -----------------------------------------
##
##
## All rows are shown in table below
##
## =========================================
## VARIABLE ..ROWNUMBER.. BASE COMPARE
## -----------------------------------------
## time.rand 434 110 156
## time.rand 463 120 252
## -----------------------------------------
Discuss
The 1st and 2nd imputation is different.
At thís step, you can retrieve your data for further analysis or continue with Complete-data analysis + pooling.
A summary of the imputation results can be obtained by calling the imp object.
imp
## Class: mids
## Number of multiple imputations: 5
## Imputation methods:
## trt age sex nihss location hx.isch afib dm
## "" "" "" "" "" "" "" ""
## mrankin sbp iv.altep aspects ia.occlus extra.ica time.rand
## "" "pmm" "" "pmm" "polyreg" "logreg" "pmm"
## PredictorMatrix:
## trt age sex nihss location hx.isch afib dm mrankin sbp iv.altep
## trt 0 1 1 1 1 1 1 1 1 1 1
## age 1 0 1 1 1 1 1 1 1 1 1
## sex 1 1 0 1 1 1 1 1 1 1 1
## nihss 1 1 1 0 1 1 1 1 1 1 1
## location 1 1 1 1 0 1 1 1 1 1 1
## hx.isch 1 1 1 1 1 0 1 1 1 1 1
## aspects ia.occlus extra.ica time.rand
## trt 1 1 1 1
## age 1 1 1 1
## sex 1 1 1 1
## nihss 1 1 1 1
## location 1 1 1 1
## hx.isch 1 1 1 1
The original data can be found in:
# imp$data
mice() does not return a data.frame. Find out the class of the object returned by mice() function using the function class(), and take a look at the help file for this class: https://www.rdocumentation.org/packages/mice/versions/2.46.0/topics/mids-class
class(imp)
## [1] "mids"
It is good practice to make sure that mice() has not done any processing of the data that was not planned or that you are not aware of. This means checking the loggedEvents, but also ensuring that the correct method, predictorMatrix and visitSequence were used.
Do these checks for imp.
imp$method
## trt age sex nihss location hx.isch afib dm
## "" "" "" "" "" "" "" ""
## mrankin sbp iv.altep aspects ia.occlus extra.ica time.rand
## "" "pmm" "" "pmm" "polyreg" "logreg" "pmm"
imp$predictorMatrix
## trt age sex nihss location hx.isch afib dm mrankin sbp iv.altep
## trt 0 1 1 1 1 1 1 1 1 1 1
## age 1 0 1 1 1 1 1 1 1 1 1
## sex 1 1 0 1 1 1 1 1 1 1 1
## nihss 1 1 1 0 1 1 1 1 1 1 1
## location 1 1 1 1 0 1 1 1 1 1 1
## hx.isch 1 1 1 1 1 0 1 1 1 1 1
## afib 1 1 1 1 1 1 0 1 1 1 1
## dm 1 1 1 1 1 1 1 0 1 1 1
## mrankin 1 1 1 1 1 1 1 1 0 1 1
## sbp 1 1 1 1 1 1 1 1 1 0 1
## iv.altep 1 1 1 1 1 1 1 1 1 1 0
## aspects 1 1 1 1 1 1 1 1 1 1 1
## ia.occlus 1 1 1 1 1 1 1 1 1 1 1
## extra.ica 1 1 1 1 1 1 1 1 1 1 1
## time.rand 1 1 1 1 1 1 1 1 1 1 1
## aspects ia.occlus extra.ica time.rand
## trt 1 1 1 1
## age 1 1 1 1
## sex 1 1 1 1
## nihss 1 1 1 1
## location 1 1 1 1
## hx.isch 1 1 1 1
## afib 1 1 1 1
## dm 1 1 1 1
## mrankin 1 1 1 1
## sbp 1 1 1 1
## iv.altep 1 1 1 1
## aspects 0 1 1 1
## ia.occlus 1 0 1 1
## extra.ica 1 1 0 1
## time.rand 1 1 1 0
imp$visitSequence
## [1] "trt" "age" "sex" "nihss" "location" "hx.isch"
## [7] "afib" "dm" "mrankin" "sbp" "iv.altep" "aspects"
## [13] "ia.occlus" "extra.ica" "time.rand"
imp$loggedEvents
## NULL
imp$pad$data
## NULL
Discuss
Let’s see what else would have come up if we change the number of
maxit from 10 to 5:
impnaive <- mice(NHANES, m = 5, maxit = 5)
Take a look at the loggedEvents of impnaive.
impnaive <- mice(df, method = meth,
predictorMatrix = pred,
visitSequence = visSeq,
maxit = 5, m = 5,
seed = 1234,
printFlag = FALSE)
Extracting the imputation: Take the 1st and 2nd imputation:
df_naivecomplete1 = complete(impnaive, action = 1)
Compare to the original dataset (maxit=10):
diffdf(df_complete1, df_naivecomplete1)
## Differences found between the objects!
##
## A summary is given below.
##
## Not all Values Compared Equal
## All rows are shown in table below
##
## ==============================
## Variable No of Differences
## ------------------------------
## sbp 1
## aspects 1
## ia.occlus 1
## time.rand 2
## ------------------------------
##
##
## All rows are shown in table below
##
## ========================================
## VARIABLE ..ROWNUMBER.. BASE COMPARE
## ----------------------------------------
## sbp 131 121 89
## ----------------------------------------
##
##
## All rows are shown in table below
##
## ========================================
## VARIABLE ..ROWNUMBER.. BASE COMPARE
## ----------------------------------------
## aspects 169 9 10
## ----------------------------------------
##
##
## All rows are shown in table below
##
## =============================================
## VARIABLE ..ROWNUMBER.. BASE COMPARE
## ---------------------------------------------
## ia.occlus 404 M1 ICA with M1
## ---------------------------------------------
##
##
## All rows are shown in table below
##
## =========================================
## VARIABLE ..ROWNUMBER.. BASE COMPARE
## -----------------------------------------
## time.rand 434 110 237
## time.rand 463 120 186
## -----------------------------------------
impnaive$loggedEvents
## NULL
impnaive$pad$categories
## NULL
Discuss
The result was different just by changing number of
maxit. Therefore, we must choose the components
carefully.
Logged events are no different.
More info can be found here:
https://rmisstastic.netlify.app/tutorials/erler_course_multipleimputation_2018/erler_practical_mice_2018#run_the_imputation
The mean and variance of the imputed values per iteration and variable is stored in the elements chainMean and chainVar of the mids object.
imp$chainMean
## , , Chain 1
##
## 1 2 3 4 5 6 7 8 9 10
## trt NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
## age NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
## sex NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
## nihss NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
## location NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
## hx.isch NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
## afib NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
## dm NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
## mrankin NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
## sbp 166.00 119.0 148.00 139.00 89.0 110.00 127.00 161.0 121.00 121.00
## iv.altep NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
## aspects 8.75 9.0 8.25 9.75 9.5 9.25 8.25 9.0 8.75 9.25
## ia.occlus 4.00 3.0 3.00 4.00 2.0 2.00 3.00 3.0 3.00 3.00
## extra.ica 1.00 1.0 1.00 1.00 1.0 1.00 1.00 1.0 1.00 1.00
## time.rand 199.50 161.5 243.00 188.00 211.5 222.00 174.00 244.5 231.50 115.00
##
## , , Chain 2
##
## 1 2 3 4 5 6 7 8 9 10
## trt NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
## age NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
## sex NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
## nihss NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
## location NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
## hx.isch NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
## afib NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
## dm NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
## mrankin NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
## sbp 144.00 157.0 158.00 131.0 126.00 130 130.0 127.00 166.00 153.00
## iv.altep NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
## aspects 8.75 9.0 9.75 7.5 6.75 8 8.5 8.25 8.75 8.25
## ia.occlus 3.00 2.0 3.00 3.0 2.00 3 2.0 3.00 2.00 2.00
## extra.ica 2.00 1.0 2.00 2.0 1.00 1 2.0 2.00 2.00 2.00
## time.rand 190.50 242.5 213.00 222.5 163.00 190 168.5 183.00 151.50 204.00
##
## , , Chain 3
##
## 1 2 3 4 5 6 7 8 9 10
## trt NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
## age NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
## sex NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
## nihss NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
## location NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
## hx.isch NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
## afib NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
## dm NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
## mrankin NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
## sbp 181.00 145.0 168.0 162.00 164.0 167.00 173.00 146.00 167.00 119.00
## iv.altep NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
## aspects 8.25 9.0 9.5 9.25 9.5 9.25 8.25 8.75 8.75 8.75
## ia.occlus 3.00 4.0 3.0 3.00 3.0 3.00 2.00 3.00 3.00 3.00
## extra.ica 1.00 1.0 1.0 1.00 1.0 1.00 2.00 2.00 1.00 1.00
## time.rand 201.00 181.5 262.5 179.00 205.0 202.00 126.50 158.00 167.50 162.00
##
## , , Chain 4
##
## 1 2 3 4 5 6 7 8 9 10
## trt NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
## age NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
## sex NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
## nihss NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
## location NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
## hx.isch NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
## afib NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
## dm NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
## mrankin NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
## sbp 144 115.00 105 160.00 112.0 121.0 165.0 162.00 116 152.0
## iv.altep NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
## aspects 9 8.25 7 7.75 9.0 9.5 7.5 9.75 9 9.5
## ia.occlus 2 3.00 3 2.00 5.0 3.0 3.0 2.00 2 5.0
## extra.ica 1 1.00 1 1.00 2.0 1.0 2.0 2.00 1 1.0
## time.rand 266 245.00 179 173.50 194.5 125.0 172.0 231.50 186 158.0
##
## , , Chain 5
##
## 1 2 3 4 5 6 7 8 9 10
## trt NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
## age NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
## sex NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
## nihss NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
## location NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
## hx.isch NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
## afib NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
## dm NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
## mrankin NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
## sbp 161.0 128.00 109.00 178.0 121.00 161.0 149.00 105.00 167.00 154.00
## iv.altep NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
## aspects 8.5 7.75 9.25 8.5 8.25 8.5 8.25 8.25 8.75 7.75
## ia.occlus 2.0 3.00 3.00 2.0 2.00 3.0 1.00 3.00 3.00 3.00
## extra.ica 2.0 2.00 2.00 2.0 1.00 2.0 2.00 1.00 1.00 1.00
## time.rand 156.0 200.00 161.50 219.5 159.00 145.5 146.00 227.00 216.50 196.50
Discuss
The number of chains is equal to the number of imputed datasets. A chain refers to the chain of regression models that is used to generate the imputed values. The length of each chain is equal to the number of iterations.
The convergence can be visualised by plotting the means in a convergence plot. For our example, the convergence plots are shown below. In this plot you see that the variance between the imputation chains is almost equal to the variance within the chains, which indicates healthy convergence.
Let’s plot them to see if our imputation imp has converged:
# implemented plotting function (use layout to change the number of rows and columns)
plot(imp, layout = c(4, 3))
# plot for a single variable
matplot(imp$chainMean["sbp", , ], type = "l", ylab = "imputed value",
xlab = "iteration", main = "sbp")
# or the ggplot version
library(ggplot2)
library(reshape2)
ggplot(melt(imp$chainMean), aes(x = Var2, y = value, color = Var3)) +
geom_line() +
facet_wrap("Var1", scales = 'free') +
theme(legend.position = 'none') +
xlab("iteration") +
ylab("imputed value")
Discuss
There are two main things you want to note in a trace plot:
=> From what I see, the chains in imp seem to
have converged, but in practice, more iterations should be done to
confirm this.
More info can be found here:
https://rmisstastic.netlify.app/tutorials/erler_course_multipleimputation_2018/erler_practical_mice_2018#convergence
It can also be of interest to compare the values that are imputed with those that are observed. For that, the stripplot function can be used in mice. This function visualises the observed and imputed values in one plot. By comparing the observed and the imputed data points we get an idea if the imputed values are in range of the observed data. If there are no large differences between the imputed and observed values than we can conclude the imputed values are plausible.
stripplot(imp, pch = c(1, 20))
Now that we know that imp has converged, we can compare the distribution of the imputed values against the distribution of the observed values.
Plot the distributions of the imputed variables (continuous and categorical), and make sure the imputed values are realistic (e.g., no height of 2.50m or weight of 10kg for adults). If there are differences in the distributions of observed and imputed values see if those differences can be explained by other variables.
densityplot(imp)
# Create a long dataset with all imputed data sets stacked onto each other
longDF <- complete(imp, "long")
summary(longDF)
## .imp .id trt age sex
## Min. :1 Min. : 1.0 Intervention:1165 Min. :23.00 Female:1040
## 1st Qu.:2 1st Qu.:125.8 Control :1335 1st Qu.:55.00 Male :1460
## Median :3 Median :250.5 Median :65.75
## Mean :3 Mean :250.5 Mean :64.71
## 3rd Qu.:4 3rd Qu.:375.2 3rd Qu.:76.00
## Max. :5 Max. :500.0 Max. :96.00
## nihss location hx.isch afib dm mrankin
## Min. :10.00 Left :1345 No :2230 No :1825 No :2185 0 :2020
## 1st Qu.:14.00 Right:1155 Yes: 270 Yes: 675 Yes: 315 1 : 250
## Median :18.00 2 : 125
## Mean :18.03 > 2: 105
## 3rd Qu.:22.00
## Max. :28.00
## sbp iv.altep aspects ia.occlus extra.ica
## Min. : 78.0 No : 275 Min. : 5.000 Intracranial ICA: 20 No :1774
## 1st Qu.:128.0 Yes:2225 1st Qu.: 7.000 ICA with M1 : 671 Yes: 726
## Median :145.0 Median : 9.000 M1 :1598
## Mean :145.5 Mean : 8.508 M2 : 195
## 3rd Qu.:162.2 3rd Qu.:10.000 A1 or A2 : 16
## Max. :231.0 Max. :10.000
## time.rand
## Min. :100.0
## 1st Qu.:150.8
## Median :201.0
## Mean :208.4
## 3rd Qu.:258.0
## Max. :360.0
# get the summary of the "raw" imputed values
lapply(imp$imp, function(x)
summary(unlist(x))
)
## $trt
## Mode
## logical
##
## $age
## Mode
## logical
##
## $sex
## Mode
## logical
##
## $nihss
## Mode
## logical
##
## $location
## Mode
## logical
##
## $hx.isch
## Mode
## logical
##
## $afib
## Mode
## logical
##
## $dm
## Mode
## logical
##
## $mrankin
## Mode
## logical
##
## $sbp
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 119.0 121.0 152.0 139.8 153.0 154.0
##
## $iv.altep
## Mode
## logical
##
## $aspects
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.0 8.0 9.0 8.7 10.0 10.0
##
## $ia.occlus
## Intracranial ICA ICA with M1 M1 M2
## 0 1 3 0
## A1 or A2
## 1
##
## $extra.ica
## No Yes
## 4 1
##
## $time.rand
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 110.0 122.0 149.0 167.1 190.5 273.0
# Plot for all categorical variables
propplot <- function(x, formula, facet = "wrap", ...) {
library(ggplot2)
cd <- data.frame(mice::complete(x, "long", include = TRUE))
cd$.imp <- factor(cd$.imp)
r <- as.data.frame(is.na(x$data))
impcat <- x$meth != "" & sapply(x$data, is.factor)
vnames <- names(impcat)[impcat]
if (missing(formula)) {
formula <- as.formula(paste(paste(vnames, collapse = "+",
sep = ""), "~1", sep = ""))
}
tmsx <- terms(formula[-3], data = x$data)
xnames <- attr(tmsx, "term.labels")
xnames <- xnames[xnames %in% vnames]
if (paste(formula[3]) != "1") {
wvars <- gsub("[[:space:]]*\\|[[:print:]]*", "", paste(formula)[3])
# wvars <- all.vars(as.formula(paste("~", wvars)))
wvars <- attr(terms(as.formula(paste("~", wvars))), "term.labels")
if (grepl("\\|", formula[3])) {
svars <- gsub("[[:print:]]*\\|[[:space:]]*", "", paste(formula)[3])
svars <- all.vars(as.formula(paste("~", svars)))
} else {
svars <- ".imp"
}
} else {
wvars <- NULL
svars <- ".imp"
}
for (i in seq_along(xnames)) {
xvar <- xnames[i]
select <- cd$.imp != 0 & !r[, xvar]
cd[select, xvar] <- NA
}
for (i in which(!wvars %in% names(cd))) {
cd[, wvars[i]] <- with(cd, eval(parse(text = wvars[i])))
}
meltDF <- reshape2::melt(cd[, c(wvars, svars, xnames)], id.vars = c(wvars, svars))
meltDF <- meltDF[!is.na(meltDF$value), ]
wvars <- if (!is.null(wvars)) paste0("`", wvars, "`")
a <- plyr::ddply(meltDF, c(wvars, svars, "variable", "value"), plyr::summarize,
count = length(value))
b <- plyr::ddply(meltDF, c(wvars, svars, "variable"), plyr::summarize,
tot = length(value))
mdf <- merge(a,b)
mdf$prop <- mdf$count / mdf$tot
plotDF <- merge(unique(meltDF), mdf)
plotDF$value <- factor(plotDF$value,
levels = unique(unlist(lapply(x$data[, xnames], levels))),
ordered = T)
p <- ggplot(plotDF, aes(x = value, fill = get(svars), y = prop)) +
geom_bar(position = "dodge", stat = "identity") +
theme(legend.position = "bottom", ...) +
ylab("proportion") +
scale_fill_manual(name = "",
values = c("black",
colorRampPalette(
RColorBrewer::brewer.pal(9, "Blues"))(x$m + 3)[1:x$m + 3])) +
guides(fill = guide_legend(nrow = 1))
if (facet == "wrap")
if (length(xnames) > 1) {
print(p + facet_wrap(c("variable", wvars), scales = "free"))
} else {
if (is.null(wvars)) {
print(p)
} else {
print(p + facet_wrap(wvars, scales = "free"))
}
}
if (facet == "grid")
if (!is.null(wvars)) {
print(p + facet_grid(paste(paste(wvars, collapse = "+"), "~ variable"),
scales = "free"))
}
}
propplot(imp)
propplot(imp, ia.occlus ~ sex)
propplot(imp, extra.ica ~ sex)
Discuss
Each bar represents the proportion of the data in the category given on
the x-axis in one (imputed) dataset. The black bars show the
distribution of the original, incomplete data. The blue bars show the
corresponding distribution in the imputed values from each of the models
(5, in this case) imputed datasets.
When the blue bars are approximately as high as the black bar, the distribution of the categorical variable in the imputed values is similar to the distribution in the observed data. Note, however, that here the marginal distributions are compared, but the MAR assumption is that the incomplete cases have the same distribution as the observed cases conditional on the other variables.
So if you do see differences this does not necessarily indicate a problem with the imputation. It could be that differences can be explained by covariates used in the imputation model.
# Look into the distribution of aspects
densityplot(imp, ~ aspects)
# Look into the distribution of time.rand
densityplot(imp, ~time.rand)
Discuss
For categorical variable, I compare the imputed variable with the non-NA observed variable. Each have only one missing value, which makes it difficult to judge the distribution of imputed values.
The distributions of the imputed values for aspects and time.rand differ from the distributions of the observed data, but these differences cannot be explained by adjusting for any variables. This might be due to the small number of NA in the dataset (<5%).
Brand (1999) was the first to recognize and treat the variable selection problem.
The majority method is identical to step 1 of Brand (1999), whereas the Wald test method is similar to Brand’s step 2, with the likelihood ratio test replaced by the Wald test. The Wald test method is recommended since it is a well-established approach that follows Rubin’s rules, whereas the majority and stack methods fail to take into account the uncertainty caused by the missing data. Indeed, Wood, White, and Royston (2008) found that the Wald test method is the only procedure that preserved the type I error.
In practice, it may be useful to combine methods. The Wald test method is computationally intensive, but is now easily available in mice as the D1() function. A strong point of the majority method is that it gives insight into the variability between the imputed datasets. An advantage of the stack method is that only one dataset needs to be analyzed. The discussion of Wood, White, and Royston (2008) contains additional simulations of a two-step method, in which a preselection made by the majority and stack methods is followed by the Wald test. This yielded a faster method with better theoretical properties. In practice, a judicious combination of approaches might turn out best.
The following steps illustrate the main steps involved by
implementing a simple majority method to select variables in mice.
https://stefvanbuuren.name/fimd/sec-stepwise.html
Remmeber our 2 outcome variables : nihss; mrankin
# Remind imp from step 3
#imp <- mice(df, method = meth,
# predictorMatrix = pred,
# visitSequence = visSeq,
# maxit = 10, m = 5,
# seed = 1234,
# printFlag = FALSE)
# Continue practicle:
scope <- list(upper = ~ trt + age + sex + location + hx.isch + afib + dm +
mrankin + sbp + aspects + ia.occlus + extra.ica + time.rand,
lower = ~1)
expr <- expression(f1 <- lm(nihss ~ 1),
f2 <- step(f1, scope = scope))
fit <- with(imp, expr)
## Start: AIC=1541.79
## nihss ~ 1
##
## Df Sum of Sq RSS AIC
## + dm 1 210.416 10664 1534.0
## + ia.occlus 4 238.098 10637 1538.7
## + location 1 65.168 10810 1540.8
## + time.rand 1 64.332 10810 1540.8
## + hx.isch 1 61.447 10813 1541.0
## <none> 10875 1541.8
## + aspects 1 19.884 10855 1542.9
## + sbp 1 17.363 10857 1543.0
## + age 1 12.399 10862 1543.2
## + sex 1 4.904 10870 1543.6
## + afib 1 1.847 10873 1543.7
## + trt 1 1.588 10873 1543.7
## + extra.ica 1 0.048 10875 1543.8
## + mrankin 3 61.940 10813 1544.9
##
## Step: AIC=1534.02
## nihss ~ dm
##
## Df Sum of Sq RSS AIC
## + ia.occlus 4 258.528 10406 1529.8
## + location 1 70.677 10594 1532.7
## + time.rand 1 60.505 10604 1533.2
## + hx.isch 1 56.310 10608 1533.4
## <none> 10664 1534.0
## + aspects 1 19.511 10645 1535.1
## + sbp 1 17.221 10647 1535.2
## + age 1 9.019 10655 1535.6
## + sex 1 5.901 10658 1535.8
## + afib 1 3.067 10661 1535.9
## + trt 1 1.750 10662 1535.9
## + extra.ica 1 0.028 10664 1536.0
## + mrankin 3 48.829 10615 1537.7
## - dm 1 210.416 10875 1541.8
##
## Step: AIC=1529.75
## nihss ~ dm + ia.occlus
##
## Df Sum of Sq RSS AIC
## + time.rand 1 63.336 10342 1528.7
## + location 1 62.837 10343 1528.7
## + hx.isch 1 49.124 10357 1529.4
## <none> 10406 1529.8
## + extra.ica 1 19.037 10387 1530.8
## + aspects 1 12.747 10393 1531.1
## + sex 1 11.937 10394 1531.2
## + age 1 11.624 10394 1531.2
## + sbp 1 10.657 10395 1531.2
## + trt 1 3.224 10402 1531.6
## + afib 1 1.900 10404 1531.7
## + mrankin 3 42.795 10363 1533.7
## - ia.occlus 4 258.528 10664 1534.0
## - dm 1 230.847 10637 1538.7
##
## Step: AIC=1528.7
## nihss ~ dm + ia.occlus + time.rand
##
## Df Sum of Sq RSS AIC
## + hx.isch 1 58.908 10284 1527.8
## + location 1 53.257 10289 1528.1
## <none> 10342 1528.7
## + extra.ica 1 21.765 10321 1529.7
## - time.rand 1 63.336 10406 1529.8
## + sex 1 15.833 10326 1529.9
## + age 1 13.226 10329 1530.1
## + aspects 1 12.281 10330 1530.1
## + sbp 1 8.765 10334 1530.3
## + trt 1 6.519 10336 1530.4
## + afib 1 2.618 10340 1530.6
## + mrankin 3 48.093 10294 1532.4
## - ia.occlus 4 261.358 10604 1533.2
## - dm 1 227.285 10570 1537.6
##
## Step: AIC=1527.84
## nihss ~ dm + ia.occlus + time.rand + hx.isch
##
## Df Sum of Sq RSS AIC
## + location 1 58.917 10225 1527.0
## <none> 10284 1527.8
## - hx.isch 1 58.908 10342 1528.7
## + extra.ica 1 20.482 10263 1528.8
## + sex 1 16.405 10267 1529.0
## + aspects 1 11.116 10272 1529.3
## + age 1 10.933 10272 1529.3
## - time.rand 1 73.120 10357 1529.4
## + sbp 1 8.314 10275 1529.4
## + trt 1 4.993 10278 1529.6
## + afib 1 2.842 10281 1529.7
## + mrankin 3 44.140 10239 1531.7
## - ia.occlus 4 254.059 10538 1532.0
## - dm 1 221.371 10505 1536.5
##
## Step: AIC=1526.97
## nihss ~ dm + ia.occlus + time.rand + hx.isch + location
##
## Df Sum of Sq RSS AIC
## <none> 10225 1527.0
## - location 1 58.917 10284 1527.8
## + extra.ica 1 19.349 10205 1528.0
## - time.rand 1 62.805 10287 1528.0
## - hx.isch 1 64.568 10289 1528.1
## + sex 1 17.052 10208 1528.1
## + aspects 1 12.628 10212 1528.3
## + age 1 11.285 10213 1528.4
## + sbp 1 9.476 10215 1528.5
## + trt 1 7.537 10217 1528.6
## + afib 1 3.135 10221 1528.8
## + mrankin 3 52.798 10172 1530.4
## - ia.occlus 4 245.952 10470 1530.8
## - dm 1 225.674 10450 1535.9
## Start: AIC=1541.79
## nihss ~ 1
##
## Df Sum of Sq RSS AIC
## + dm 1 210.416 10664 1534.0
## + ia.occlus 4 229.750 10645 1539.1
## + time.rand 1 65.740 10809 1540.8
## + location 1 65.168 10810 1540.8
## + hx.isch 1 61.447 10813 1541.0
## <none> 10875 1541.8
## + aspects 1 22.162 10852 1542.8
## + sbp 1 16.007 10859 1543.0
## + age 1 12.399 10862 1543.2
## + sex 1 4.904 10870 1543.6
## + afib 1 1.847 10873 1543.7
## + trt 1 1.588 10873 1543.7
## + extra.ica 1 0.372 10874 1543.8
## + mrankin 3 61.940 10813 1544.9
##
## Step: AIC=1534.02
## nihss ~ dm
##
## Df Sum of Sq RSS AIC
## + ia.occlus 4 250.203 10414 1530.2
## + location 1 70.677 10594 1532.7
## + time.rand 1 62.333 10602 1533.1
## + hx.isch 1 56.310 10608 1533.4
## <none> 10664 1534.0
## + aspects 1 21.506 10643 1535.0
## + sbp 1 15.104 10649 1535.3
## + age 1 9.019 10655 1535.6
## + sex 1 5.901 10658 1535.8
## + afib 1 3.067 10661 1535.9
## + trt 1 1.750 10662 1535.9
## + extra.ica 1 0.285 10664 1536.0
## + mrankin 3 48.829 10615 1537.7
## - dm 1 210.416 10875 1541.8
##
## Step: AIC=1530.15
## nihss ~ dm + ia.occlus
##
## Df Sum of Sq RSS AIC
## + time.rand 1 66.610 10347 1528.9
## + location 1 62.376 10352 1529.2
## + hx.isch 1 49.526 10364 1529.8
## <none> 10414 1530.2
## + extra.ica 1 21.903 10392 1531.1
## + aspects 1 14.026 10400 1531.5
## + sex 1 11.438 10403 1531.6
## + age 1 11.158 10403 1531.6
## + sbp 1 9.039 10405 1531.7
## + trt 1 3.324 10411 1532.0
## + afib 1 1.865 10412 1532.1
## - ia.occlus 4 250.203 10664 1534.0
## + mrankin 3 42.544 10372 1534.1
## - dm 1 230.869 10645 1539.1
##
## Step: AIC=1528.94
## nihss ~ dm + ia.occlus + time.rand
##
## Df Sum of Sq RSS AIC
## + hx.isch 1 59.855 10288 1528.0
## + location 1 53.239 10294 1528.4
## <none> 10347 1528.9
## + extra.ica 1 24.417 10323 1529.8
## - time.rand 1 66.610 10414 1530.2
## + sex 1 14.979 10332 1530.2
## + aspects 1 13.379 10334 1530.3
## + age 1 12.272 10335 1530.3
## + sbp 1 7.454 10340 1530.6
## + trt 1 6.519 10341 1530.6
## + afib 1 2.508 10345 1530.8
## + mrankin 3 47.978 10300 1532.6
## - ia.occlus 4 254.480 10602 1533.1
## - dm 1 227.776 10575 1537.8
##
## Step: AIC=1528.04
## nihss ~ dm + ia.occlus + time.rand + hx.isch
##
## Df Sum of Sq RSS AIC
## + location 1 58.993 10229 1527.2
## <none> 10288 1528.0
## + extra.ica 1 22.922 10265 1528.9
## - hx.isch 1 59.855 10347 1528.9
## + sex 1 15.532 10272 1529.3
## + aspects 1 12.065 10276 1529.5
## + age 1 10.031 10278 1529.5
## + sbp 1 7.090 10280 1529.7
## - time.rand 1 76.940 10364 1529.8
## + trt 1 4.970 10283 1529.8
## + afib 1 2.724 10285 1529.9
## + mrankin 3 44.024 10244 1531.9
## - ia.occlus 4 247.685 10535 1531.9
## - dm 1 221.850 10509 1536.7
##
## Step: AIC=1527.17
## nihss ~ dm + ia.occlus + time.rand + hx.isch + location
##
## Df Sum of Sq RSS AIC
## <none> 10229 1527.2
## - location 1 58.993 10288 1528.0
## + extra.ica 1 21.979 10207 1528.1
## - hx.isch 1 65.609 10294 1528.4
## + sex 1 16.212 10212 1528.4
## - time.rand 1 67.151 10296 1528.4
## + aspects 1 14.124 10214 1528.5
## + age 1 10.392 10218 1528.7
## + sbp 1 8.041 10220 1528.8
## + trt 1 7.538 10221 1528.8
## + afib 1 3.029 10226 1529.0
## + mrankin 3 52.682 10176 1530.6
## - ia.occlus 4 239.059 10468 1530.7
## - dm 1 226.059 10455 1536.1
## Start: AIC=1541.79
## nihss ~ 1
##
## Df Sum of Sq RSS AIC
## + dm 1 210.416 10664 1534.0
## + ia.occlus 4 238.098 10637 1538.7
## + location 1 65.168 10810 1540.8
## + time.rand 1 63.829 10811 1540.8
## + hx.isch 1 61.447 10813 1541.0
## <none> 10875 1541.8
## + aspects 1 19.327 10855 1542.9
## + sbp 1 17.446 10857 1543.0
## + age 1 12.399 10862 1543.2
## + sex 1 4.904 10870 1543.6
## + afib 1 1.847 10873 1543.7
## + trt 1 1.588 10873 1543.7
## + extra.ica 1 0.048 10875 1543.8
## + mrankin 3 61.940 10813 1544.9
##
## Step: AIC=1534.02
## nihss ~ dm
##
## Df Sum of Sq RSS AIC
## + ia.occlus 4 258.528 10406 1529.8
## + location 1 70.677 10594 1532.7
## + time.rand 1 60.254 10604 1533.2
## + hx.isch 1 56.310 10608 1533.4
## <none> 10664 1534.0
## + aspects 1 18.837 10645 1535.1
## + sbp 1 17.354 10647 1535.2
## + age 1 9.019 10655 1535.6
## + sex 1 5.901 10658 1535.8
## + afib 1 3.067 10661 1535.9
## + trt 1 1.750 10662 1535.9
## + extra.ica 1 0.028 10664 1536.0
## + mrankin 3 48.829 10615 1537.7
## - dm 1 210.416 10875 1541.8
##
## Step: AIC=1529.75
## nihss ~ dm + ia.occlus
##
## Df Sum of Sq RSS AIC
## + time.rand 1 63.310 10342 1528.7
## + location 1 62.837 10343 1528.7
## + hx.isch 1 49.124 10357 1529.4
## <none> 10406 1529.8
## + extra.ica 1 19.037 10387 1530.8
## + sex 1 11.937 10394 1531.2
## + age 1 11.624 10394 1531.2
## + aspects 1 11.164 10395 1531.2
## + sbp 1 10.760 10395 1531.2
## + trt 1 3.224 10402 1531.6
## + afib 1 1.900 10404 1531.7
## + mrankin 3 42.795 10363 1533.7
## - ia.occlus 4 258.528 10664 1534.0
## - dm 1 230.847 10637 1538.7
##
## Step: AIC=1528.7
## nihss ~ dm + ia.occlus + time.rand
##
## Df Sum of Sq RSS AIC
## + hx.isch 1 59.061 10283 1527.8
## + location 1 53.580 10289 1528.1
## <none> 10342 1528.7
## + extra.ica 1 21.655 10321 1529.7
## - time.rand 1 63.310 10406 1529.8
## + sex 1 15.622 10327 1529.9
## + age 1 12.936 10330 1530.1
## + aspects 1 10.679 10332 1530.2
## + sbp 1 9.061 10333 1530.3
## + trt 1 6.401 10336 1530.4
## + afib 1 2.572 10340 1530.6
## + mrankin 3 48.094 10294 1532.4
## - ia.occlus 4 261.584 10604 1533.2
## - dm 1 227.562 10570 1537.6
##
## Step: AIC=1527.84
## nihss ~ dm + ia.occlus + time.rand + hx.isch
##
## Df Sum of Sq RSS AIC
## + location 1 59.282 10224 1527.0
## <none> 10283 1527.8
## - hx.isch 1 59.061 10342 1528.7
## + extra.ica 1 20.368 10263 1528.8
## + sex 1 16.179 10267 1529.0
## + age 1 10.649 10273 1529.3
## + aspects 1 9.575 10274 1529.4
## - time.rand 1 73.247 10357 1529.4
## + sbp 1 8.613 10275 1529.4
## + trt 1 4.884 10278 1529.6
## + afib 1 2.792 10281 1529.7
## + mrankin 3 44.142 10239 1531.7
## - ia.occlus 4 254.296 10538 1532.0
## - dm 1 221.654 10505 1536.5
##
## Step: AIC=1526.95
## nihss ~ dm + ia.occlus + time.rand + hx.isch + location
##
## Df Sum of Sq RSS AIC
## <none> 10224 1527.0
## - location 1 59.282 10283 1527.8
## + extra.ica 1 19.250 10205 1528.0
## - time.rand 1 63.297 10287 1528.0
## - hx.isch 1 64.763 10289 1528.1
## + sex 1 16.852 10207 1528.1
## + aspects 1 11.435 10213 1528.4
## + age 1 11.020 10213 1528.4
## + sbp 1 9.784 10214 1528.5
## + trt 1 7.431 10217 1528.6
## + afib 1 3.090 10221 1528.8
## + mrankin 3 52.833 10171 1530.4
## - ia.occlus 4 246.159 10470 1530.8
## - dm 1 225.943 10450 1535.9
## Start: AIC=1541.79
## nihss ~ 1
##
## Df Sum of Sq RSS AIC
## + dm 1 210.416 10664 1534.0
## + ia.occlus 4 252.239 10622 1538.1
## + location 1 65.168 10810 1540.8
## + time.rand 1 64.967 10810 1540.8
## + hx.isch 1 61.447 10813 1541.0
## <none> 10875 1541.8
## + aspects 1 19.127 10856 1542.9
## + sbp 1 16.050 10859 1543.0
## + age 1 12.399 10862 1543.2
## + sex 1 4.904 10870 1543.6
## + afib 1 1.847 10873 1543.7
## + trt 1 1.588 10873 1543.7
## + extra.ica 1 0.048 10875 1543.8
## + mrankin 3 61.940 10813 1544.9
##
## Step: AIC=1534.02
## nihss ~ dm
##
## Df Sum of Sq RSS AIC
## + ia.occlus 4 270.710 10394 1529.2
## + location 1 70.677 10594 1532.7
## + time.rand 1 61.338 10603 1533.1
## + hx.isch 1 56.310 10608 1533.4
## <none> 10664 1534.0
## + aspects 1 18.823 10645 1535.1
## + sbp 1 15.169 10649 1535.3
## + age 1 9.019 10655 1535.6
## + sex 1 5.901 10658 1535.8
## + afib 1 3.067 10661 1535.9
## + trt 1 1.750 10662 1535.9
## + extra.ica 1 0.028 10664 1536.0
## + mrankin 3 48.829 10615 1537.7
## - dm 1 210.416 10875 1541.8
##
## Step: AIC=1529.17
## nihss ~ dm + ia.occlus
##
## Df Sum of Sq RSS AIC
## + location 1 65.341 10328 1528.0
## + time.rand 1 62.302 10331 1528.2
## + hx.isch 1 48.404 10345 1528.8
## <none> 10394 1529.2
## + extra.ica 1 19.775 10374 1530.2
## + sex 1 13.305 10380 1530.5
## + age 1 12.300 10381 1530.6
## + aspects 1 9.513 10384 1530.7
## + sbp 1 9.412 10384 1530.7
## + trt 1 2.680 10391 1531.0
## + afib 1 2.204 10391 1531.1
## + mrankin 3 42.764 10351 1533.1
## - ia.occlus 4 270.710 10664 1534.0
## - dm 1 228.888 10622 1538.1
##
## Step: AIC=1528.01
## nihss ~ dm + ia.occlus + location
##
## Df Sum of Sq RSS AIC
## + hx.isch 1 54.804 10273 1527.3
## + time.rand 1 52.827 10275 1527.5
## <none> 10328 1528.0
## + extra.ica 1 18.952 10309 1529.1
## - location 1 65.341 10394 1529.2
## + sex 1 14.304 10314 1529.3
## + age 1 12.932 10315 1529.4
## + aspects 1 10.937 10317 1529.5
## + sbp 1 10.490 10318 1529.5
## + trt 1 5.017 10323 1529.8
## + afib 1 2.539 10326 1529.9
## + mrankin 3 52.351 10276 1531.5
## - ia.occlus 4 265.374 10594 1532.7
## - dm 1 233.309 10562 1537.2
##
## Step: AIC=1527.35
## nihss ~ dm + ia.occlus + location + hx.isch
##
## Df Sum of Sq RSS AIC
## + time.rand 1 61.969 10211 1526.3
## <none> 10273 1527.3
## - hx.isch 1 54.804 10328 1528.0
## + extra.ica 1 17.553 10256 1528.5
## + sex 1 14.536 10259 1528.6
## - location 1 71.742 10345 1528.8
## + age 1 10.651 10263 1528.8
## + sbp 1 10.250 10263 1528.8
## + aspects 1 10.075 10263 1528.9
## + trt 1 3.653 10270 1529.2
## + afib 1 2.700 10271 1529.2
## + mrankin 3 48.021 10225 1531.0
## - ia.occlus 4 256.964 10530 1531.7
## - dm 1 228.101 10502 1536.3
##
## Step: AIC=1526.33
## nihss ~ dm + ia.occlus + location + hx.isch + time.rand
##
## Df Sum of Sq RSS AIC
## <none> 10211 1526.3
## - location 1 61.559 10273 1527.3
## + extra.ica 1 19.952 10192 1527.3
## - time.rand 1 61.969 10273 1527.3
## + sex 1 18.448 10193 1527.4
## - hx.isch 1 63.946 10275 1527.5
## + age 1 11.728 10200 1527.8
## + aspects 1 9.455 10202 1527.9
## + sbp 1 8.246 10203 1527.9
## + trt 1 6.645 10205 1528.0
## + afib 1 3.474 10208 1528.2
## + mrankin 3 53.017 10158 1529.7
## - ia.occlus 4 257.759 10469 1530.8
## - dm 1 224.191 10436 1535.2
## Start: AIC=1541.79
## nihss ~ 1
##
## Df Sum of Sq RSS AIC
## + dm 1 210.416 10664 1534.0
## + ia.occlus 4 238.098 10637 1538.7
## + time.rand 1 66.144 10808 1540.7
## + location 1 65.168 10810 1540.8
## + hx.isch 1 61.447 10813 1541.0
## <none> 10875 1541.8
## + aspects 1 20.059 10855 1542.9
## + sbp 1 15.963 10859 1543.1
## + age 1 12.399 10862 1543.2
## + sex 1 4.904 10870 1543.6
## + afib 1 1.847 10873 1543.7
## + trt 1 1.588 10873 1543.7
## + extra.ica 1 0.048 10875 1543.8
## + mrankin 3 61.940 10813 1544.9
##
## Step: AIC=1534.02
## nihss ~ dm
##
## Df Sum of Sq RSS AIC
## + ia.occlus 4 258.528 10406 1529.8
## + location 1 70.677 10594 1532.7
## + time.rand 1 62.693 10602 1533.1
## + hx.isch 1 56.310 10608 1533.4
## <none> 10664 1534.0
## + aspects 1 19.315 10645 1535.1
## + sbp 1 15.038 10649 1535.3
## + age 1 9.019 10655 1535.6
## + sex 1 5.901 10658 1535.8
## + afib 1 3.067 10661 1535.9
## + trt 1 1.750 10662 1535.9
## + extra.ica 1 0.028 10664 1536.0
## + mrankin 3 48.829 10615 1537.7
## - dm 1 210.416 10875 1541.8
##
## Step: AIC=1529.75
## nihss ~ dm + ia.occlus
##
## Df Sum of Sq RSS AIC
## + time.rand 1 65.982 10340 1528.6
## + location 1 62.837 10343 1528.7
## + hx.isch 1 49.124 10357 1529.4
## <none> 10406 1529.8
## + extra.ica 1 19.037 10387 1530.8
## + sex 1 11.937 10394 1531.2
## + age 1 11.624 10394 1531.2
## + aspects 1 11.225 10394 1531.2
## + sbp 1 8.989 10397 1531.3
## + trt 1 3.224 10402 1531.6
## + afib 1 1.900 10404 1531.7
## + mrankin 3 42.795 10363 1533.7
## - ia.occlus 4 258.528 10664 1534.0
## - dm 1 230.847 10637 1538.7
##
## Step: AIC=1528.57
## nihss ~ dm + ia.occlus + time.rand
##
## Df Sum of Sq RSS AIC
## + hx.isch 1 59.352 10280 1527.7
## + location 1 53.657 10286 1528.0
## <none> 10340 1528.6
## + extra.ica 1 21.620 10318 1529.5
## - time.rand 1 65.982 10406 1529.8
## + sex 1 15.534 10324 1529.8
## + age 1 12.775 10327 1530.0
## + aspects 1 11.292 10328 1530.0
## + sbp 1 7.401 10332 1530.2
## + trt 1 6.378 10333 1530.3
## + afib 1 2.551 10337 1530.5
## + mrankin 3 48.194 10292 1532.2
## - ia.occlus 4 261.818 10602 1533.1
## - dm 1 227.712 10567 1537.5
##
## Step: AIC=1527.69
## nihss ~ dm + ia.occlus + time.rand + hx.isch
##
## Df Sum of Sq RSS AIC
## + location 1 59.396 10221 1526.8
## <none> 10280 1527.7
## - hx.isch 1 59.352 10340 1528.6
## + extra.ica 1 20.326 10260 1528.7
## + sex 1 16.080 10264 1528.9
## + age 1 10.485 10270 1529.2
## + aspects 1 10.124 10270 1529.2
## + sbp 1 7.040 10273 1529.3
## - time.rand 1 76.210 10357 1529.4
## + trt 1 4.856 10276 1529.5
## + afib 1 2.768 10278 1529.6
## + mrankin 3 44.237 10236 1531.5
## - ia.occlus 4 254.531 10535 1531.9
## - dm 1 221.803 10502 1536.4
##
## Step: AIC=1526.8
## nihss ~ dm + ia.occlus + time.rand + hx.isch + location
##
## Df Sum of Sq RSS AIC
## <none> 10221 1526.8
## - location 1 59.396 10280 1527.7
## + extra.ica 1 19.220 10202 1527.8
## - hx.isch 1 65.090 10286 1528.0
## + sex 1 16.773 10204 1528.0
## - time.rand 1 66.374 10287 1528.0
## + aspects 1 12.251 10209 1528.2
## + age 1 10.868 10210 1528.3
## + sbp 1 7.981 10213 1528.4
## + trt 1 7.415 10214 1528.4
## + afib 1 3.070 10218 1528.7
## + mrankin 3 52.956 10168 1530.2
## - ia.occlus 4 246.383 10467 1530.7
## - dm 1 226.075 10447 1535.7
Discuss
This code imputes the df data m = 5 times, fits a
stepwise linear model to predict nihss separately to
each of the imputed dataset. The following code blocks counts how many
times each variable was selected.
formulas <- lapply(fit$analyses, formula)
terms <- lapply(formulas, terms)
votes <- unlist(lapply(terms, labels))
table(votes)
## votes
## dm hx.isch ia.occlus location time.rand
## 5 5 5 5 5
Discuss
The lapply() function is used three times. The first statement extracts the model formulas fitted to the m imputed datasets. The second lapply() call decomposes the model formulas into pieces, and the third call extracts the names of the variables included in all m models. The table() function counts the number of times that each variable in the 5 replications.
Variables dm, hx.isch, ia.occlus, location and
time.rand are always included, whereas the other was not
selected, even only one time in the models.
If any variable appears in more than 50% of the models, we can use the Wald test to determine whether it should be in the final model.
Let look at the following 3 examples:
Example 1: with or without age
fit.without <- with(imp, lm(nihss ~ dm + hx.isch + ia.occlus + location + time.rand))
fit.with <- with(imp, lm(nihss ~ dm + hx.isch + ia.occlus + location + time.rand + age))
D1(fit.with, fit.without) # model with more variables is put before model lower variables
## test statistic df1 df2 dfcom p.value riv
## 1 ~~ 2 0.5302907 1 4 490 0.5068257 0.0003151527
D3(fit.with, fit.without)
## test statistic df1 df2 dfcom p.value riv
## 1 ~~ 2 0.5412293 1 103837222881 490 0.4619237 -6.206557e-06
Discuss
Both D1, D3 have p-value > 0.05 => age is not needed in the model. If we go one step further, and remove dm, we obtain:
Example 2: with or without dm
fit.without <- with(imp, lm(nihss ~ hx.isch + ia.occlus + location + time.rand))
fit.with <- with(imp, lm(nihss ~ dm + hx.isch + ia.occlus + location + time.rand))
D1(fit.with, fit.without)
## test statistic df1 df2 dfcom p.value riv
## 1 ~~ 2 10.83548 1 4 491 0.03016651 3.83061e-05
D3(fit.with, fit.without)
## test statistic df1 df2 dfcom p.value riv
## 1 ~~ 2 10.91439 1 28548643783 491 0.0009541989 1.183702e-05
Example 3: with or without location
fit.without <- with(imp, lm(nihss ~ dm + hx.isch + ia.occlus + time.rand ))
fit.with <- with(imp, lm(nihss ~ dm + hx.isch + ia.occlus + location + time.rand))
D1(fit.with, fit.without)
## test statistic df1 df2 dfcom p.value riv
## 1 ~~ 2 2.863189 1 4 491 0.1658847 0.0003026752
D3(fit.with, fit.without)
## test statistic df1 df2 dfcom p.value riv
## 1 ~~ 2 2.907485 1 79355075 491 0.08816928 0.000224564
Discuss
The result is a little bit weird. Accodring to https://stefvanbuuren.name/fimd/, based on stepwise model selection, the final nihss model should contain dm, hx.isch, ia.occlus, location and time.rand.
However, when I exclude hx.isch, location, time.rand and the other 2 variables from the model, the p-value > 0.05 means that those 3 variables are not needed in the model ?? The final nihss model only contain dm ; ia.occlus So what had happened here??
The main danger of data-driven model building strategies is that the model found may depend highly on the sample at hand. For example, Viallefont, Raftery, and Richardson (2001) showed that of the variables declared to be “significant” with p -values between 0.01 and 0.05 by stepwise variable selection, only 49% actually were true risk factors.
Various solutions have been proposed to counter such model optimism. A popular procedure is bootstrapping the model as developed in Sauerbrei and Schumacher (1992) and Harrell (2001). The method randomly draws multiple samples with replacement from the observed sample, thus mimicking the sampling variation in the population from which the sample was drawn. Stepwise regression analyses are replayed in each bootstrap sample. The proportion of times that each prognostic variable is retained in the stepwise regression model is known as the inclusion frequency (Sauerbrei and Schumacher 1992). This proportion provides information about the strength of the evidence that an indicator is an independent predictor. In addition, each bootstrap model can be fitted to the original sample. The difference between the apparent performance and the bootstrap performance provides the basis for performance measures that correct for model optimism.
Clearly, the presence of missing data adds uncertainty to the model building process, so optimism can be expected to be more severe with missing data. It is not yet clear what the best way is to estimate optimism from incomplete data. Heymans et al. (2007) explored the combination of multiple imputation and the bootstrap. A drawback of the Heymans’s method is the use of classic stepwise variable selection techniques, which do not generalize well to high-dimensional data. There appear to be at least four general procedures:
Long and Johnson (2015) developed a procedure, called bootstrap imputation and stability selection (BI-SS) , that generates bootstrap samples from the original data, imputes each bootstrap sample by single imputation, obtains the randomized LASSO estimate from each sample, and then selects the active set according to majority. The multiple imputation random LASSO (MIRL) method by Liu et al. (2016) first performs multiple imputation, obtains bootstrap samples from each imputed dataset, estimates regression weights under LASSO, and then selects the active set by majority. It is not yet known how BS-SS and MIRL compare to each other.
When we are confident that the imputation was successfull, the imputed data can be analysed with any complete data method.
Choose any analysis model of interest and fit it on the imputed data. The model could for instance be a linear regression model (using lm) or a generalized linear regression model, e.g., logistic regression (using glm()). Find out the class and structure of the resulting object.
imp is not a dataframe but an object of class
mids, so we cannot use the normal specification of a
model in which we specify the argument data to be a data.frame. Instead,
use with(imp,
# for example:
models1 <- with(imp, lm(nihss ~ trt + age + sex + location + hx.isch + afib + dm +
mrankin + sbp + aspects + ia.occlus + extra.ica + time.rand))
models2 <- with(imp, lm(nihss ~ dm + ia.occlus))
models3 <- with(imp, glm(mrankin ~ trt + age + sex + location + hx.isch + afib + dm +
nihss + sbp + aspects + ia.occlus + extra.ica + time.rand,
family = "binomial"))
# to determine the class
class(models1)
## [1] "mira" "matrix"
# to determine the structure
names(models1)
## [1] "call" "call1" "nmis" "analyses"
lapply(models1, class)
## $call
## [1] "call"
##
## $call1
## [1] "call"
##
## $nmis
## [1] "integer"
##
## $analyses
## [1] "list"
# explore further
models1$analyses
## [[1]]
##
## Call:
## lm(formula = nihss ~ trt + age + sex + location + hx.isch + afib +
## dm + mrankin + sbp + aspects + ia.occlus + extra.ica + time.rand)
##
## Coefficients:
## (Intercept) trtControl age
## 17.252363 0.336268 -0.008557
## sexMale locationRight hx.ischYes
## 0.314477 0.781461 -1.073092
## afibYes dmYes mrankin1
## 0.081253 -1.960446 -0.221030
## mrankin2 mrankin> 2 sbp
## -0.361164 1.439753 -0.005320
## aspects ia.occlusICA with M1 ia.occlusM1
## -0.114684 2.961485 4.091233
## ia.occlusM2 ia.occlusA1 or A2 extra.icaYes
## 4.889223 8.060141 0.411613
## time.rand
## -0.006184
##
##
## [[2]]
##
## Call:
## lm(formula = nihss ~ trt + age + sex + location + hx.isch + afib +
## dm + mrankin + sbp + aspects + ia.occlus + extra.ica + time.rand)
##
## Coefficients:
## (Intercept) trtControl age
## 17.206006 0.336380 -0.008177
## sexMale locationRight hx.ischYes
## 0.303715 0.785275 -1.080764
## afibYes dmYes mrankin1
## 0.073071 -1.959072 -0.221581
## mrankin2 mrankin> 2 sbp
## -0.357597 1.439351 -0.004910
## aspects ia.occlusICA with M1 ia.occlusM1
## -0.118385 3.010038 4.107655
## ia.occlusM2 ia.occlusA1 or A2 extra.icaYes
## 4.907026 8.075932 0.440491
## time.rand
## -0.006325
##
##
## [[3]]
##
## Call:
## lm(formula = nihss ~ trt + age + sex + location + hx.isch + afib +
## dm + mrankin + sbp + aspects + ia.occlus + extra.ica + time.rand)
##
## Coefficients:
## (Intercept) trtControl age
## 17.214181 0.330692 -0.008417
## sexMale locationRight hx.ischYes
## 0.311406 0.785978 -1.076008
## afibYes dmYes mrankin1
## 0.079610 -1.961131 -0.223412
## mrankin2 mrankin> 2 sbp
## -0.362819 1.439835 -0.005404
## aspects ia.occlusICA with M1 ia.occlusM1
## -0.108330 2.956945 4.085079
## ia.occlusM2 ia.occlusA1 or A2 extra.icaYes
## 4.873101 8.067876 0.408500
## time.rand
## -0.006188
##
##
## [[4]]
##
## Call:
## lm(formula = nihss ~ trt + age + sex + location + hx.isch + afib +
## dm + mrankin + sbp + aspects + ia.occlus + extra.ica + time.rand)
##
## Coefficients:
## (Intercept) trtControl age
## 17.051717 0.317570 -0.008639
## sexMale locationRight hx.ischYes
## 0.328283 0.795258 -1.070596
## afibYes dmYes mrankin1
## 0.093718 -1.953017 -0.218021
## mrankin2 mrankin> 2 sbp
## -0.339789 1.441107 -0.004962
## aspects ia.occlusICA with M1 ia.occlusM1
## -0.099336 2.960153 4.083512
## ia.occlusM2 ia.occlusA1 or A2 extra.icaYes
## 4.891359 7.930349 0.418464
## time.rand
## -0.006133
##
##
## [[5]]
##
## Call:
## lm(formula = nihss ~ trt + age + sex + location + hx.isch + afib +
## dm + mrankin + sbp + aspects + ia.occlus + extra.ica + time.rand)
##
## Coefficients:
## (Intercept) trtControl age
## 17.161027 0.330387 -0.008370
## sexMale locationRight hx.ischYes
## 0.307164 0.787312 -1.079104
## afibYes dmYes mrankin1
## 0.079369 -1.957972 -0.223150
## mrankin2 mrankin> 2 sbp
## -0.363033 1.444367 -0.004875
## aspects ia.occlusICA with M1 ia.occlusM1
## -0.109658 2.971916 4.099981
## ia.occlusM2 ia.occlusA1 or A2 extra.icaYes
## 4.887867 8.070109 0.407125
## time.rand
## -0.006321
Let look at the summary of 3 models:
summary(models1$analyses[[1]])
##
## Call:
## lm(formula = nihss ~ trt + age + sex + location + hx.isch + afib +
## dm + mrankin + sbp + aspects + ia.occlus + extra.ica + time.rand)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.1563 -3.7547 -0.5499 3.6602 11.5062
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.252363 3.104956 5.556 4.57e-08 ***
## trtControl 0.336268 0.419092 0.802 0.42273
## age -0.008557 0.012240 -0.699 0.48485
## sexMale 0.314477 0.420237 0.748 0.45462
## locationRight 0.781461 0.417205 1.873 0.06166 .
## hx.ischYes -1.073092 0.668206 -1.606 0.10895
## afibYes 0.081253 0.467415 0.174 0.86207
## dmYes -1.960446 0.623112 -3.146 0.00176 **
## mrankin1 -0.221030 0.698761 -0.316 0.75190
## mrankin2 -0.361164 0.980914 -0.368 0.71289
## mrankin> 2 1.439753 1.045147 1.378 0.16898
## sbp -0.005320 0.008239 -0.646 0.51881
## aspects -0.114684 0.134705 -0.851 0.39499
## ia.occlusICA with M1 2.961485 2.374193 1.247 0.21287
## ia.occlusM1 4.091233 2.352406 1.739 0.08264 .
## ia.occlusM2 4.889223 2.470246 1.979 0.04836 *
## ia.occlusA1 or A2 8.060141 3.558860 2.265 0.02397 *
## extra.icaYes 0.411613 0.475725 0.865 0.38734
## time.rand -0.006184 0.003226 -1.917 0.05588 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.582 on 481 degrees of freedom
## Multiple R-squared: 0.07152, Adjusted R-squared: 0.03677
## F-statistic: 2.058 on 18 and 481 DF, p-value: 0.006495
summary(models2$analyses[[1]])
##
## Call:
## lm(formula = nihss ~ dm + ia.occlus)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.5038 -3.5038 -0.5038 3.5504 10.5542
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.7500 2.2948 6.428 3.06e-10 ***
## dmYes -2.0530 0.6202 -3.310 0.0010 ***
## ia.occlusICA with M1 2.6958 2.3297 1.157 0.2478
## ia.occlusM1 3.7538 2.3105 1.625 0.1049
## ia.occlusM2 4.6698 2.4122 1.936 0.0534 .
## ia.occlusA1 or A2 7.5833 3.5053 2.163 0.0310 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.59 on 494 degrees of freedom
## Multiple R-squared: 0.04312, Adjusted R-squared: 0.03344
## F-statistic: 4.453 on 5 and 494 DF, p-value: 0.0005612
summary(models3$analyses[[1]])
##
## Call:
## glm(formula = mrankin ~ trt + age + sex + location + hx.isch +
## afib + dm + nihss + sbp + aspects + ia.occlus + extra.ica +
## time.rand, family = "binomial")
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6296 -0.6786 -0.5854 -0.4667 2.2363
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.1630483 1.7069637 0.096 0.9239
## trtControl 0.0472200 0.2397982 0.197 0.8439
## age -0.0056011 0.0068957 -0.812 0.4166
## sexMale 0.1656146 0.2397950 0.691 0.4898
## locationRight -0.3418088 0.2387721 -1.432 0.1523
## hx.ischYes 0.0668760 0.3806121 0.176 0.8605
## afibYes 0.1301763 0.2612503 0.498 0.6183
## dmYes -0.3528833 0.3876486 -0.910 0.3627
## nihss 0.0055946 0.0256744 0.218 0.8275
## sbp -0.0001552 0.0046825 -0.033 0.9736
## aspects 0.1228582 0.0797173 1.541 0.1233
## ia.occlusICA with M1 -3.0054740 1.1977587 -2.509 0.0121 *
## ia.occlusM1 -2.4927321 1.1839151 -2.105 0.0352 *
## ia.occlusM2 -2.7362239 1.2658713 -2.162 0.0307 *
## ia.occlusA1 or A2 -0.2168552 1.7294198 -0.125 0.9002
## extra.icaYes 0.2908186 0.2645062 1.099 0.2716
## time.rand 0.0007048 0.0018015 0.391 0.6956
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 489.11 on 499 degrees of freedom
## Residual deviance: 469.25 on 483 degrees of freedom
## AIC: 503.25
##
## Number of Fisher Scoring iterations: 4
Analyses performed on mids objects will result in a mira object (multiply imputed repeated analyses).
A mira object is a list with the following elements:
The class of each of the components of imp$analyses will depend on the type of model that was fitted.
To pool the results from the repeated analyses contained in a mira object, the function pool() can be used.
Pool one of the provided mira objects (models1, models2, models3) and investigate its structure.
# Pool the result accross 5 imputations
pooled1 <- pool(models1)
# investigate the structure
class(pooled1)
## [1] "mipo" "data.frame"
names(pooled1)
## [1] "call" "m" "pooled" "glanced"
str(pooled1)
## Classes 'mipo' and 'data.frame': 0 obs. of 4 variables:
## $ call : language pool(object = models1)
## $ m : int 5
## $ pooled :'data.frame': 19 obs. of 11 variables:
## ..$ term : Factor w/ 19 levels "(Intercept)",..: 1 2 3 4 5 6 7 8 9 10 ...
## ..$ m : int 5 5 5 5 5 5 5 5 5 5 ...
## ..$ estimate: num 17.17706 0.33026 -0.00843 0.31301 0.78706 ...
## ..$ ubar : num 9.63947 0.17546 0.00015 0.17657 0.17409 ...
## ..$ b : num 5.96e-03 5.87e-05 3.18e-08 8.96e-05 2.57e-05 ...
## ..$ t : num 9.64663 0.17553 0.00015 0.17668 0.17412 ...
## ..$ dfcom : int 481 481 481 481 481 481 481 481 481 481 ...
## ..$ df : num 479 479 479 479 479 ...
## ..$ riv : num 0.000742 0.000401 0.000255 0.000609 0.000177 ...
## ..$ lambda : num 0.000742 0.000401 0.000255 0.000609 0.000177 ...
## ..$ fmi : num 0.00489 0.00455 0.0044 0.00476 0.00433 ...
## $ glanced:'data.frame': 5 obs. of 12 variables:
## ..$ r.squared : num 0.0715 0.0712 0.0714 0.0724 0.0715
## ..$ adj.r.squared: num 0.0368 0.0364 0.0366 0.0377 0.0368
## ..$ sigma : num 4.58 4.58 4.58 4.58 4.58
## ..$ statistic : num 2.06 2.05 2.05 2.09 2.06
## ..$ p.value : num 0.0065 0.00683 0.00664 0.00566 0.00648
## ..$ df : num 18 18 18 18 18
## ..$ logLik : num -1461 -1461 -1461 -1461 -1461
## ..$ AIC : num 2962 2962 2962 2961 2962
## ..$ BIC : num 3046 3046 3046 3045 3046
## ..$ deviance : num 10097 10100 10098 10087 10097
## ..$ df.residual : int 481 481 481 481 481
## ..$ nobs : int 500 500 500 500 500
Pooling of a mira object results in a mipo object (multiply imputed pooled analysis).
The mipo object is a list that has the following elements:
Object | Description |
---|---|
call | The call that created the mipo object. |
call1 | The call that created the mids object. |
call2 | The call that created the mira object. |
data | The original, incomplete data. |
nmis | The number of missing values per variable. |
m | The number of imputations. |
qhat | The coefficients from all analyses: a matrix with m rows and npar columns. |
u | The variance-covariance matrix of the coefficients from each analysis: an array of dimension c(m, npar, npar). |
qbar | The pooled parameter estimates: a vector of length npar. |
ubar | The average within imputation variance: a matrix with npar rows and npar columns containing the averaged variance-covariance matrices. |
b | The between imputation variance-covariance matrix (u+b+b/m). |
t | The total variance-covariance matrix. |
r | The relative increase in variance due to the missing values. |
dfcom | Degrees of freedom in the hypothetically complete data (N - # free parameters). |
df | Degrees of freedom associated with the t-statistics. |
fmi | Fraction of missing information. |
lambda | Proportion of the variation due to the missing values: (b+b/m)/t. |
From the elements contained in the mipo object, we could obtain the pooled coefficients (qbar) and variances (t), and calculate the pooled confidence intervals and p-values by hand, using the function provided in the course notes.
However, the function summary() applied to the mipo object will do this for us.
Note:
pool() extracts the model coefficients and variance-covariance matrices using the functions coef() and vcov(). Hence, pooling using the pool() function from mice only works for those types of models for which these functions will return the regression coefficients and corresponding variance-covariance matrix.
Moreover, pool() only works for objects of class mira. When data has been imputed in a different package/software, or analyses have not been performed in the way described above, a list of fitted models can be converted to a mira object using as.mira().
library(kableExtra)
pooled1 <- pool(models1)
summary(pooled1, conf.int = TRUE) %>%
select(-statistic, -df) %>%
kable(digits = 3)
term | estimate | std.error | p.value | 2.5 % | 97.5 % |
---|---|---|---|---|---|
(Intercept) | 17.177 | 3.106 | 0.000 | 11.074 | 23.280 |
trtControl | 0.330 | 0.419 | 0.431 | -0.493 | 1.154 |
age | -0.008 | 0.012 | 0.491 | -0.032 | 0.016 |
sexMale | 0.313 | 0.420 | 0.457 | -0.513 | 1.139 |
locationRight | 0.787 | 0.417 | 0.060 | -0.033 | 1.607 |
hx.ischYes | -1.076 | 0.668 | 0.108 | -2.389 | 0.237 |
afibYes | 0.081 | 0.468 | 0.862 | -0.838 | 1.000 |
dmYes | -1.958 | 0.623 | 0.002 | -3.183 | -0.734 |
mrankin1 | -0.221 | 0.699 | 0.751 | -1.594 | 1.152 |
mrankin2 | -0.357 | 0.980 | 0.716 | -2.282 | 1.568 |
mrankin> 2 | 1.441 | 1.045 | 0.169 | -0.613 | 3.495 |
sbp | -0.005 | 0.008 | 0.537 | -0.021 | 0.011 |
aspects | -0.110 | 0.135 | 0.415 | -0.375 | 0.155 |
ia.occlusICA with M1 | 2.972 | 2.374 | 0.211 | -1.693 | 7.637 |
ia.occlusM1 | 4.093 | 2.352 | 0.082 | -0.529 | 8.716 |
ia.occlusM2 | 4.890 | 2.470 | 0.048 | 0.037 | 9.743 |
ia.occlusA1 or A2 | 8.041 | 3.508 | 0.022 | 1.149 | 14.933 |
extra.icaYes | 0.417 | 0.476 | 0.381 | -0.518 | 1.352 |
time.rand | -0.006 | 0.003 | 0.054 | -0.013 | 0.000 |
Discuss
We’ll interpret 2 of the predictors here: dm and
ia.occlus:
If we have two subjects (A and B) with the same values of all
predictors trt, age, sex, etc but A is
dm-Yes and B is dm-No, then the model
predicts that A outcome will be 1.958 point lower than B, with a 95%
confidence interval around that estimate ranging from (round(a\(conf.low,3), round(a\)conf.high,3)).
Similar comments can be made with ia.occlus.
We can compare these results to the complete case analysis
df_cc <- df %$%
lm(nihss ~ trt + age + sex + location + hx.isch + afib + dm +
mrankin + sbp + aspects + ia.occlus + extra.ica + time.rand)
tidy(df_cc, conf.int = TRUE) %>%
select(term, estimate, std.error, p.value, conf.low, conf.high) %>%
kable(digits = 3)
term | estimate | std.error | p.value | conf.low | conf.high |
---|---|---|---|---|---|
(Intercept) | 17.199 | 3.122 | 0.000 | 11.064 | 23.334 |
trtControl | 0.249 | 0.424 | 0.558 | -0.584 | 1.081 |
age | -0.008 | 0.012 | 0.515 | -0.032 | 0.016 |
sexMale | 0.306 | 0.425 | 0.472 | -0.530 | 1.141 |
locationRight | 0.811 | 0.422 | 0.055 | -0.018 | 1.640 |
hx.ischYes | -1.037 | 0.671 | 0.123 | -2.357 | 0.282 |
afibYes | 0.107 | 0.472 | 0.821 | -0.821 | 1.035 |
dmYes | -2.065 | 0.635 | 0.001 | -3.313 | -0.816 |
mrankin1 | -0.193 | 0.702 | 0.783 | -1.572 | 1.186 |
mrankin2 | -0.437 | 1.004 | 0.664 | -2.410 | 1.536 |
mrankin> 2 | 1.472 | 1.050 | 0.161 | -0.591 | 3.535 |
sbp | -0.005 | 0.008 | 0.516 | -0.022 | 0.011 |
aspects | -0.108 | 0.136 | 0.427 | -0.377 | 0.160 |
ia.occlusICA with M1 | 2.862 | 2.385 | 0.231 | -1.824 | 7.549 |
ia.occlusM1 | 4.039 | 2.361 | 0.088 | -0.600 | 8.679 |
ia.occlusM2 | 4.797 | 2.483 | 0.054 | -0.082 | 9.677 |
ia.occlusA1 or A2 | 8.102 | 3.572 | 0.024 | 1.083 | 15.122 |
extra.icaYes | 0.409 | 0.482 | 0.396 | -0.538 | 1.356 |
time.rand | -0.006 | 0.003 | 0.073 | -0.012 | 0.001 |
Note that there are some sizeable differences here, although nothing enormous. If we want the pooled R^2 or pooled adjusted R^2 after imputation, R will provide it (and a 95% confidence interval around the estimate) with:
pool.r.squared(models1)
## est lo 95 hi 95 fmi
## R^2 0.07160315 0.03392821 0.1205161 0.0005197753
pool.r.squared(models1, adjusted = TRUE)
## est lo 95 hi 95 fmi
## adj R^2 0.03686002 0.01124442 0.07567558 0.001006057
We can see the fraction of missing information about each coefficient due to non-response (fmi) and other details with the following code:
pooled1
## Class: mipo m = 5
## term m estimate ubar b t
## 1 (Intercept) 5 17.177058905 9.639473e+00 5.963575e-03 9.646629e+00
## 2 trtControl 5 0.330259428 1.754636e-01 5.869607e-05 1.755341e-01
## 3 age 5 -0.008432130 1.497218e-04 3.184904e-08 1.497601e-04
## 4 sexMale 5 0.313009143 1.765744e-01 8.964159e-05 1.766820e-01
## 5 locationRight 5 0.787056842 1.740886e-01 2.574415e-05 1.741195e-01
## 6 hx.ischYes 5 -1.075912627 4.465767e-01 1.749045e-05 4.465977e-01
## 7 afibYes 5 0.081404020 2.186886e-01 5.711044e-05 2.187571e-01
## 8 dmYes 5 -1.958327545 3.882901e-01 1.030698e-05 3.883024e-01
## 9 mrankin1 5 -0.221438946 4.882727e-01 4.671731e-06 4.882783e-01
## 10 mrankin2 5 -0.356880358 9.595191e-01 9.602684e-05 9.596344e-01
## 11 mrankin> 2 5 1.440882599 1.092470e+00 4.227877e-06 1.092475e+00
## 12 sbp 5 -0.005093975 6.791841e-05 6.155365e-08 6.799228e-05
## 13 aspects 5 -0.110078715 1.814773e-02 5.221231e-05 1.821039e-02
## 14 ia.occlusICA with M1 5 2.972107444 5.636021e+00 4.811054e-04 5.636598e+00
## 15 ia.occlusM1 5 4.093492042 5.533161e+00 1.045480e-04 5.533286e+00
## 16 ia.occlusM2 5 4.889715285 6.099775e+00 1.455083e-04 6.099949e+00
## 17 ia.occlusA1 or A2 5 8.040881534 1.229913e+01 3.849952e-03 1.230375e+01
## 18 extra.icaYes 5 0.417238545 2.261252e-01 1.881161e-04 2.263510e-01
## 19 time.rand 5 -0.006229975 1.042510e-05 7.668257e-09 1.043431e-05
## dfcom df riv lambda fmi
## 1 481 478.6255 7.423943e-04 7.418436e-04 0.004891367
## 2 481 478.8110 4.014238e-04 4.012627e-04 0.004550603
## 3 481 478.8864 2.552657e-04 2.552005e-04 0.004404497
## 4 481 478.6995 6.092044e-04 6.088335e-04 0.004758272
## 5 481 478.9256 1.774556e-04 1.774241e-04 0.004326706
## 6 481 478.9639 4.699875e-05 4.699654e-05 0.004196490
## 7 481 478.8567 3.133795e-04 3.132813e-04 0.004462593
## 8 481 478.9639 3.185343e-05 3.185242e-05 0.004181408
## 9 481 478.9639 1.148145e-05 1.148132e-05 0.004161122
## 10 481 478.9541 1.200937e-04 1.200793e-04 0.004269354
## 11 481 478.9639 4.644020e-06 4.643998e-06 0.004154313
## 12 481 478.4245 1.087546e-03 1.086364e-03 0.005236189
## 13 481 476.6909 3.452485e-03 3.440606e-03 0.007595613
## 14 481 478.9627 1.024351e-04 1.024246e-04 0.004251698
## 15 481 478.9639 2.267377e-05 2.267326e-05 0.004172267
## 16 481 478.9639 2.862565e-05 2.862483e-05 0.004178194
## 17 481 478.8245 3.756315e-04 3.754905e-04 0.004524821
## 18 481 478.4777 9.982935e-04 9.972979e-04 0.005147034
## 19 481 478.5454 8.826683e-04 8.818899e-04 0.005031522
Besides, there are more info on pooling Independent t-test;
Chi-square; ANOVA, regression model:
https://bookdown.org/mwheymans/bookmi/data-analysis-after-multiple-imputation.html#the-pooled-independent-t-test
Loading the data, adjust the character of each column:
# Loading the data, adjust the character of each column
data <- read_excel("D:/Statistics/R/R data/fakestroke.xlsx",
col_types = c("text", "text", "numeric",
"text", "numeric", "text", "text",
"numeric", "numeric", "text", "numeric",
"text", "numeric", "numeric", "text",
"numeric", "numeric", "numeric"))
df = data
The 1st three step will be similar to Multiple Imputation with mice()
Before constructing a table, I must check numeric/ categorical
variables.
For example: in our data:
df = data %>% select(-studyid,-time.punc, -time.iv)
# Changing coding variable to categorical:
df$afib <- factor(df$afib , levels=0:1, labels=c("No", "Yes"))
df$dm <- factor(df$dm , levels=0:1, labels=c("No", "Yes"))
df$extra.ica <- factor(df$extra.ica, levels=0:1, labels=c("No", "Yes"))
str(df)
## tibble [500 x 15] (S3: tbl_df/tbl/data.frame)
## $ trt : chr [1:500] "Control" "Intervention" "Control" "Control" ...
## $ age : num [1:500] 53 51 68 28 91 34 75 89 75 26 ...
## $ sex : chr [1:500] "Male" "Male" "Female" "Male" ...
## $ nihss : num [1:500] 21 23 11 22 24 18 25 18 25 27 ...
## $ location : chr [1:500] "Right" "Left" "Right" "Left" ...
## $ hx.isch : chr [1:500] "No" "No" "No" "No" ...
## $ afib : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 1 1 1 2 1 ...
## $ dm : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ mrankin : chr [1:500] "2" "0" "0" "0" ...
## $ sbp : num [1:500] 127 137 138 122 162 166 140 157 129 143 ...
## $ iv.altep : chr [1:500] "Yes" "Yes" "No" "Yes" ...
## $ aspects : num [1:500] 10 10 10 10 8 5 10 9 6 10 ...
## $ ia.occlus: chr [1:500] "M1" "M1" "ICA with M1" "ICA with M1" ...
## $ extra.ica: Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 1 1 1 1 ...
## $ time.rand: num [1:500] 139 118 178 160 214 154 122 147 271 141 ...
df <- df%>%mutate_if(is.character, as.factor)
str(df)
## tibble [500 x 15] (S3: tbl_df/tbl/data.frame)
## $ trt : Factor w/ 2 levels "Control","Intervention": 1 2 1 1 1 1 2 1 1 2 ...
## $ age : num [1:500] 53 51 68 28 91 34 75 89 75 26 ...
## $ sex : Factor w/ 2 levels "Female","Male": 2 2 1 2 2 1 2 1 2 1 ...
## $ nihss : num [1:500] 21 23 11 22 24 18 25 18 25 27 ...
## $ location : Factor w/ 2 levels "Left","Right": 2 1 2 1 2 1 2 2 1 2 ...
## $ hx.isch : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ afib : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 1 1 1 2 1 ...
## $ dm : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ mrankin : Factor w/ 4 levels "> 2","0","1",..: 4 2 2 2 2 4 2 2 4 2 ...
## $ sbp : num [1:500] 127 137 138 122 162 166 140 157 129 143 ...
## $ iv.altep : Factor w/ 2 levels "No","Yes": 2 2 1 2 2 2 2 1 2 2 ...
## $ aspects : num [1:500] 10 10 10 10 8 5 10 9 6 10 ...
## $ ia.occlus: Factor w/ 6 levels "A1 or A2","ICA with M1",..: 4 4 2 2 4 1 4 4 4 4 ...
## $ extra.ica: Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 1 1 1 1 ...
## $ time.rand: num [1:500] 139 118 178 160 214 154 122 147 271 141 ...
# Relevel factors:
df$trt = factor(df$trt, levels=c("Intervention", "Control"))
df$mrankin = factor(df$mrankin, levels=c("0", "1", "2", "> 2"))
df$ia.occlus = factor(df$ia.occlus, levels=c("Intracranial ICA", "ICA with M1",
"M1", "M2", "A1 or A2"))
# Check order of factor after relevel:
df$trt[1:2]
## [1] Control Intervention
## Levels: Intervention Control
df$mrankin[1:2]
## [1] 2 0
## Levels: 0 1 2 > 2
df$ia.occlus[1:3]
## [1] M1 M1 ICA with M1
## Levels: Intracranial ICA ICA with M1 M1 M2 A1 or A2
Before moving on to determining the specifics of multiple imputation, we should first explore and see the pattern of missing data in our dataset.
There are several ways to check the missing value (NA) in the dataset:
# number and proportion of missing values per variable
cbind("# NA" = sort(colSums(is.na(df))),
"% NA" = round(sort(colMeans(is.na(df))) * 100, 2))
## # NA % NA
## trt 0 0.0
## age 0 0.0
## sex 0 0.0
## nihss 0 0.0
## location 0 0.0
## hx.isch 0 0.0
## afib 0 0.0
## dm 0 0.0
## mrankin 0 0.0
## iv.altep 0 0.0
## sbp 1 0.2
## ia.occlus 1 0.2
## extra.ica 1 0.2
## time.rand 2 0.4
## aspects 4 0.8
Discuss
There are 5 variables with 9 NAs.
Ncc <- cbind(
"#" = table(complete.cases(df)),
"%" = round(100 * table(complete.cases(df))/nrow(df), 2)
)
rownames(Ncc) <- c("incompl.", "complete")
Ncc
## # %
## incompl. 8 1.6
## complete 492 98.4
Discuss
Among 500 observations:
p_missing <- unlist(lapply(df, function(x) sum(is.na(x))))/nrow(df)
sort(p_missing[p_missing > 0], decreasing = TRUE)
## aspects time.rand sbp ia.occlus extra.ica
## 0.008 0.004 0.002 0.002 0.002
library(plyr)
ddply(df, c(sbp = "ifelse(is.na(sbp), 'missing', 'observed')",
aspects = "ifelse(is.na(aspects), 'missing', 'observed')",
time.rand = "ifelse(is.na(time.rand), 'missing', 'observed')",
ia.occlus = "ifelse(is.na(ia.occlus), 'missing', 'observed')",
extra.ica = "ifelse(is.na(extra.ica), 'missing', 'observed')"
),
summarize,
N = length(sbp)
)
## sbp aspects time.rand ia.occlus extra.ica N
## 1 missing observed observed observed observed 1
## 2 observed missing observed observed observed 4
## 3 observed observed missing observed observed 2
## 4 observed observed observed missing missing 1
## 5 observed observed observed observed observed 492
Discuss
This command provide more info than the previous two, among 500 observations:
8 observations (obs) contain NA.
** 1 obs have NA at sbp.
** 4 obs have NA at aspects.
** 2 obs have NA at time.rand.
** 1 obs have NA at both ia.occlus and
extra.ica.
492 obs dont have NA.
Using the following command:
md.pattern (x)
md.pattern(df)
## trt age sex nihss location hx.isch afib dm mrankin iv.altep sbp ia.occlus
## 492 1 1 1 1 1 1 1 1 1 1 1 1
## 4 1 1 1 1 1 1 1 1 1 1 1 1
## 2 1 1 1 1 1 1 1 1 1 1 1 1
## 1 1 1 1 1 1 1 1 1 1 1 1 0
## 1 1 1 1 1 1 1 1 1 1 1 0 1
## 0 0 0 0 0 0 0 0 0 0 1 1
## extra.ica time.rand aspects
## 492 1 1 1 0
## 4 1 1 0 1
## 2 1 0 1 1
## 1 0 1 1 2
## 1 1 1 1 1
## 1 2 4 9
Discuss
In each row output:
** 1: the variable is complete.
** 0: the variable in that pattern contains missing values.
** The last row: the total number of missing values for each
variable.
In each column output:
** 1st column on the left (without a column name) shows the number of
cases with a specific pattern.
** The column on the right : the number of variables that is incomplete
in that pattern.
Return to our data result:
** 1st line: 492 obs having full value.
** 2nd line: 4 obs have NA at
aspects
** 3rd line: 2 obs have NA at
time.rand
** 4th line: 1 obs have missing value at both
ia.occlus and extra.ica
** 5th line: 1 obs have NA at sbp
At each column - variable, total number of missing value:
** sbp: 1
** ia.occlus: 1
** extra.ica: 1
** time.rand: 2
** aspects: 4
Installing this package:
devtools::install_github(“ropensci/visdat”)
Using the following command:
vis_miss(x, cluster = F, sort_miss = F, show_perc = T,
show_perc_col = T)
vis_miss(df, cluster = T, sort_miss = T, show_perc = T, show_perc_col = T)
vis_dat(df, sort_type = T)
Discuss
Using the following command:
aggr (x, plot = T, prop = T, numbers = F, combined = F, sortVars
= T, cex.axis=.7, gap=3)
aggr(df, plot = F, numbers = T, prop = T)
##
## Missings in variables:
## Variable Count
## sbp 1
## aspects 4
## ia.occlus 1
## extra.ica 1
## time.rand 2
aggr(df, plot = T, numbers = T, prop = T, combined= F, sortVars = T, cex.axis=.7, gap=3)
##
## Variables sorted by number of missings:
## Variable Count
## aspects 0.008
## time.rand 0.004
## sbp 0.002
## ia.occlus 0.002
## extra.ica 0.002
## trt 0.000
## age 0.000
## sex 0.000
## nihss 0.000
## location 0.000
## hx.isch 0.000
## afib 0.000
## dm 0.000
## mrankin 0.000
## iv.altep 0.000
aggr(df, plot = T, numbers = T, prop = T, combined= T, sortVars = F)
Discuss
We can see there are 5 variables with missing value:
** 1 obs have NA at both ia.occlus, extra.ica.
Among these variables:
** Numeric variables: sbp, aspects, time.rand.
** Categorical variables: ia.occlus, extra.ica.
The total proportion of the remaining 5 varialbes < 5% => complete case analysis can be applied. However, in this example, I will still use single imputation and multiple imputation to demonstrate how they work.
Summary
I recommend using mice and/or aggr()
function for observing missing value:
The code above calculates what percent of data is missing. A
simple look at this table warns us about several variables that have
more than 25% missing. It is useful to remove these variables from the
dataset first as they might mess up the imputation.
There are 5 varialbes with missing value:
** Numeric variables: sbp, aspects, time.rand.
** Categorical variables: ia.occlus, extra.ica.
The outcome variables of this project: nihss ;
mrankin.
The missing data patterns show that the total proportion of
missing value < 5% => Complete case analysis can be applied.
=> However, for demonstration, I still continue with the imputation
practice.
Little´s MCAR test is available in the naniar package as the mcar_test function. To apply the test, we select only the continuous variables. We use it for the same dataset as in the previous paragraph. The p-value for the test is siginificant, indicating that the missings does not seem to be compeletely at random.
df_select <- df[,c("age", "nihss", "sbp","aspects" ,"time.rand" )]
library(naniar)
mcar_test(data=df_select)
## # A tibble: 1 x 4
## statistic df p.value missing.patterns
## <dbl> <dbl> <dbl> <int>
## 1 18.4 12 0.104 4
mcar_test(data = df)
## # A tibble: 1 x 4
## statistic df p.value missing.patterns
## <dbl> <dbl> <dbl> <int>
## 1 64.3 55 0.183 5
Discuss
The p-value for both test > 0.05 => Accept the Null
hypothesis.
=> The missings is compeletely at random (MCAR).
=> Result is similar to Missing data patterns
(<5%): Missingness is assumed not to matter for the analysis.
In R also groups of the non-responders (i.e., participants with
missing observations) can be compared to the responders
(i.e. participants without missing observations). For this we can use
the function missing_compare of the finalfit
package.
First we have to copy and rename the variables with missing data. We
rename the continuous variables sbp,
aspects and time.rand and call them
sbp_MAR, aspects_MAR and time.rand_MAR
respectively.
The missing_compare function uses the Kruskal Wallis test for continuous variables and the Chi-squared test for categorical variables. (Note: The concerned outcome of this research: mrankin, nihss)
In this example. I will demonstrate the command to compare group for numeric variable: time.rand. Similar commands can be done for other numeric variables:
df_select <- df[,c("age", "nihss", "sbp","aspects" ,"time.rand",
"trt", "mrankin", "ia.occlus", "extra.ica")]
# Copy variable with missing data and rename
df_select$sbp_MAR <- df_select$sbp
df_select$aspects_MAR <- df_select$aspects
df_select$time.rand_MAR <- df_select$time.rand
df_select$extra.ica_MAR <- df_select$extra.ica
library(finalfit)
explanatory = c("age", "nihss", "sbp" ,"aspects")
dependent = "time.rand_MAR"
df_select %>%
missing_compare(dependent, explanatory) %>%
knitr::kable(row.names=FALSE, align = c("l", "l", "r", "r", "r"),
caption = "Mean comparisons between values of responders (Not missing) and
non-responders (Missing) on the time.rand variable.")
Missing data analysis: time.rand_MAR | Not missing | Missing | p | |
---|---|---|---|---|
age | Mean (SD) | 64.6 (17.0) | 94.0 (1.4) | 0.015 |
nihss | Mean (SD) | 18.0 (4.7) | 18.0 (1.4) | 0.994 |
sbp | Mean (SD) | 145.6 (25.1) | 105.0 (8.5) | 0.022 |
aspects | Mean (SD) | 8.5 (1.6) | 9.0 (1.4) | 0.653 |
I can also compare group for extra.ica variable.
explanatory = c("trt", "mrankin")
dependent = "extra.ica_MAR"
df_select %>%
missing_compare(dependent, explanatory) %>%
knitr::kable(row.names=FALSE, align = c("l", "l", "r", "r", "r"),
caption = "Mean comparisons between values of responders (Not missing) and
non-responders (Missing) on the extra.ica variable.")
Missing data analysis: extra.ica_MAR | Not missing | Missing | p | |
---|---|---|---|---|
trt | Intervention | 233 (100.0) | 0 (0.0) | 1.000 |
Control | 266 (99.6) | 1 (0.4) | ||
mrankin | 0 | 403 (99.8) | 1 (0.2) | 0.971 |
1 | 50 (100.0) | 0 (0.0) | ||
2 | 25 (100.0) | 0 (0.0) | ||
> 2 | 21 (100.0) | 0 (0.0) |
Discuss
On the time.rand_MAR variable, the patients that have observed values (column not missing) differ significantly from patients with missing values (column Missing) on age (P = 0.015) and sbp (P=0.022).
On the extra.ica variable, the patients that have observed values (column not missing) don’t differ significantly from patients with missing values (column Missing) on trt and mrankin.
Before imputing the missing data it is important to take a look at how the incomplete variables are distributed so that we can choose appropriate imputation models. Pay attention to:
# One option that allows to quickly get an overview of all variables,
# categorical and continuous (replace the "df" dataset in the code below)
nc <- max(5, ceiling(sqrt(ncol(df))))
nr <- ceiling(ncol(df) / nc)
par(mfrow = c(nr, nc), mgp = c(2, 0.6, 0), mar = c(2, 3, 3, 0.5))
for (i in 1:ncol(df)) {
if (is.numeric(df[, i])) {
hist(df[, i], nclass = 50, xlab = "",
main = paste0(names(df[i]), " (",
round(mean(is.na(df[, i])) * 100, 2), "% NA)")
)
} else {
barplot(table(df[, i]), ylab = "Frequency",
main = paste0(names(df[i]), " (",
round(mean(is.na(df[, i])) * 100, 2), "% NA)"))
}
}
Package used: simputation
Usage: impute_
Formula is of the form:
IMPUTED_VARIABLES ~ MODEL_SPECIFICATION [ | GROUPING_VARIABLES ]
Note: If several Variables can be defined by similar variable groups, we can combine them to the left-hand-side of formula, which produced same results. For example:
set.seed(1234)
df_multi <- df %>% data.frame() %>%
impute_rlm(., sbp + aspects + time.rand ~ trt + sex + age) %>%
tbl_df()
diffdf(df, df_multi)
## Differences found between the objects!
##
## A summary is given below.
##
## Not all Values Compared Equal
## All rows are shown in table below
##
## ==============================
## Variable No of Differences
## ------------------------------
## sbp 1
## aspects 4
## time.rand 2
## ------------------------------
##
##
## All rows are shown in table below
##
## =========================================
## VARIABLE ..ROWNUMBER.. BASE COMPARE
## -----------------------------------------
## sbp 131 <NA> 143.8762
## -----------------------------------------
##
##
## All rows are shown in table below
##
## =========================================
## VARIABLE ..ROWNUMBER.. BASE COMPARE
## -----------------------------------------
## aspects 19 <NA> 8.723411
## aspects 169 <NA> 8.740693
## aspects 201 <NA> 8.743835
## aspects 222 <NA> 8.684133
## -----------------------------------------
##
##
## All rows are shown in table below
##
## ==========================================
## VARIABLE ..ROWNUMBER.. BASE COMPARE
## ------------------------------------------
## time.rand 434 <NA> 195.4407
## time.rand 463 <NA> 195.6484
## ------------------------------------------
set.seed(1234)
df_multi <- df %>% data.frame() %>%
impute_rlm(., aspects ~ trt + sex + age) %>%
impute_rlm(., time.rand ~ trt + sex + age) %>%
impute_rlm(., sbp ~ trt + sex + age) %>%
tbl_df()
diffdf(df, df_multi)
## Differences found between the objects!
##
## A summary is given below.
##
## Not all Values Compared Equal
## All rows are shown in table below
##
## ==============================
## Variable No of Differences
## ------------------------------
## sbp 1
## aspects 4
## time.rand 2
## ------------------------------
##
##
## All rows are shown in table below
##
## =========================================
## VARIABLE ..ROWNUMBER.. BASE COMPARE
## -----------------------------------------
## sbp 131 <NA> 143.8762
## -----------------------------------------
##
##
## All rows are shown in table below
##
## =========================================
## VARIABLE ..ROWNUMBER.. BASE COMPARE
## -----------------------------------------
## aspects 19 <NA> 8.723411
## aspects 169 <NA> 8.740693
## aspects 201 <NA> 8.743835
## aspects 222 <NA> 8.684133
## -----------------------------------------
##
##
## All rows are shown in table below
##
## ==========================================
## VARIABLE ..ROWNUMBER.. BASE COMPARE
## ------------------------------------------
## time.rand 434 <NA> 195.4407
## time.rand 463 <NA> 195.6484
## ------------------------------------------
We will impute the missing value using several different strategies, all supported nicely by the simputation package.
Impute types | Target Variable | Description |
---|---|---|
impute_rlm | Numeric | Imputation using robust linear regression with M-estimators |
impute_lm | Numeric | Imputation using linear regression |
impute_median | Numeric | Impute (group-wise) medians - Impute the median value of all non-missing observations into the missing values for the variable |
impute_pmm | Numeric | Imputation using predictive mean matching |
impute_em | Numeric | Multivariate, model-based imputation - These variables are assumed to follow a multivariate normal distribution for which the means and covariance matrix is estimated based on the EM-algorithm |
impute_mf | Numeric, Categorical, or Mixed data | Multivariate, model-based imputation - Missing values are imputed using a rough guess after which a predictive random forest is trained and used to reimpute themissing values. This is iterated until convergence |
impute_cart | Numeric, Categorical, or Mixed data | Decision tree imputation - Imputation using classification and regression trees |
impute_rf | Numeric, Categorical, or Mixed data | Decision tree imputation - Imputation using using a Random Forest model |
impute_knn | Numeric, Categorical, or Mixed data | Imputation using k-nearest neighbors methods |
impute_rhd | Numeric, Categorical, or Mixed data | Random “hot deck” imputation involves drawing at random from the complete cases for that variabla |
Package used: diffdf() ; arsenal()
After creating imputation for discrete variables. I will use diffdf() from diffdf package to check the differences between 2 datasets: diffdf(x, y)
Or I can use comparedf() from arsenal package: summary(comparedf(x, y))
Because time.punc is only applied for Intervention group (trt), time.iv only applied for patients if iv.altep=Yes (iv.altep), I eliminate these two variable from imputation. Therefore, we only have 5 variables:
Numeric variables: sbp, aspects, time.rand.
Categorical variables: ia.occlus, extra.ica.
Here, we’ll arbitrarily impute our categorical variables (ia.occlus, extra.ica) as follows:
impute_rhd(., x ~ y) and impute_knn(., x ~ y)
y ~ 1 to specify that no grouping is to be
applied.
y ~ . to specify all variables will be involved.
For extra.ica (Yes/No) we’ll use the
impute_rhd approach to draw a random observation from
the existing set of Yes and No in the complete
extra.ica data.
For ia.occlus (5 categories) we’ll use a method called impute_knn - k-nearest neighbors methods which takes the result from a model based on the (imputed) extra.ica value and the remaining variables and converts it to the nearest value in the observed ia.occlus data.
1/ trt + iv.altep + extra.ica:
set.seed(1234)
df_trt_iv.altep <- df %>%
data.frame() %>%
impute_rhd(.,
extra.ica ~ 1) %>%
impute_knn(., ia.occlus ~ trt + iv.altep + extra.ica, k=5) %>%
tbl_df()
set.seed(1234)
df_all <- df %>%
data.frame() %>%
impute_rhd(.,
extra.ica ~ 1) %>%
impute_knn(., ia.occlus ~ trt + iv.altep + sex +location + hx.isch + afib + dm + mrankin + extra.ica, k=5) %>%
tbl_df()
Checking the changes: comparedf() vs diffdf()
After creating imputation for discrete variables. I will use
diffdf() from diffdf package to check the differences
between 2 datasets:
diffdf(x, y)
diffdf(df_trt_iv.altep, df_all)
## No issues were found!
Or I can use comparedf() from
arsenal package:
summary(comparedf(x, y))
summary(comparedf(df_trt_iv.altep, df_all))
##
##
## Table: Summary of data.frames
##
## version arg ncol nrow
## -------- ---------------- ----- -----
## x df_trt_iv.altep 15 500
## y df_all 15 500
##
##
##
## Table: Summary of overall comparison
##
## statistic value
## ------------------------------------------------------------ ------
## Number of by-variables 0
## Number of non-by variables in common 15
## Number of variables compared 15
## Number of variables in x but not y 0
## Number of variables in y but not x 0
## Number of variables compared with some values unequal 0
## Number of variables compared with all values equal 15
## Number of observations in common 500
## Number of observations in x but not y 0
## Number of observations in y but not x 0
## Number of observations with some compared variables unequal 0
## Number of observations with all compared variables equal 500
## Number of values unequal 0
##
##
##
## Table: Variables not shared
##
##
## ------------------------
## No variables not shared
## ------------------------
##
##
##
## Table: Other variables not compared
##
##
## --------------------------------
## No other variables not compared
## --------------------------------
##
##
##
## Table: Observations not shared
##
##
## ---------------------------
## No observations not shared
## ---------------------------
##
##
##
## Table: Differences detected by variable
##
## var.x var.y n NAs
## ---------- ---------- --- ----
## trt trt 0 0
## age age 0 0
## sex sex 0 0
## nihss nihss 0 0
## location location 0 0
## hx.isch hx.isch 0 0
## afib afib 0 0
## dm dm 0 0
## mrankin mrankin 0 0
## sbp sbp 0 0
## iv.altep iv.altep 0 0
## aspects aspects 0 0
## ia.occlus ia.occlus 0 0
## extra.ica extra.ica 0 0
## time.rand time.rand 0 0
##
##
##
## Table: Differences detected
##
##
## ------------------------
## No differences detected
## ------------------------
##
##
##
## Table: Non-identical attributes
##
##
## ----------------------------
## No non-identical attributes
## ----------------------------
Discuss
With no grouping apply for 1st imputaion, extra.ica: No ;
ia.occlus: M1.
Because ia.occlus and extra.ica have only 1 missing variable - small size, no difference is detected when applying impute_rhd and impute_knn for difference variables combination.
impute_cart(., x ~ y) and impute_cart(., x ~ y)
set.seed(1234)
df_trt_iv.altep <- df %>%
data.frame() %>%
impute_cart(.,
extra.ica ~ trt) %>%
impute_cart(., ia.occlus ~ trt + iv.altep) %>%
tbl_df()
diffdf(df, df_trt_iv.altep)
## Differences found between the objects!
##
## A summary is given below.
##
## Not all Values Compared Equal
## All rows are shown in table below
##
## ==============================
## Variable No of Differences
## ------------------------------
## ia.occlus 1
## extra.ica 1
## ------------------------------
##
##
## All rows are shown in table below
##
## =========================================
## VARIABLE ..ROWNUMBER.. BASE COMPARE
## -----------------------------------------
## ia.occlus 404 <NA> M1
## -----------------------------------------
##
##
## All rows are shown in table below
##
## =========================================
## VARIABLE ..ROWNUMBER.. BASE COMPARE
## -----------------------------------------
## extra.ica 404 <NA> No
## -----------------------------------------
impute_cart(., x ~ y) and impute_knn(., x ~
y)
* trt / trt + iv.altep + extra.ica:
set.seed(1234)
df_trt_iv.altep <- df %>%
data.frame() %>%
impute_cart(.,
extra.ica ~ trt) %>%
impute_knn(., ia.occlus ~ trt + iv.altep + extra.ica, k=5) %>%
tbl_df()
diffdf(df, df_trt_iv.altep)
## Differences found between the objects!
##
## A summary is given below.
##
## Not all Values Compared Equal
## All rows are shown in table below
##
## ==============================
## Variable No of Differences
## ------------------------------
## ia.occlus 1
## extra.ica 1
## ------------------------------
##
##
## All rows are shown in table below
##
## =========================================
## VARIABLE ..ROWNUMBER.. BASE COMPARE
## -----------------------------------------
## ia.occlus 404 <NA> M1
## -----------------------------------------
##
##
## All rows are shown in table below
##
## =========================================
## VARIABLE ..ROWNUMBER.. BASE COMPARE
## -----------------------------------------
## extra.ica 404 <NA> No
## -----------------------------------------
impute_rhd(., x ~ y) and impute_knn(., x ~
y)
* trt / trt + iv.altep + extra.ica:
set.seed(1234)
df_trt_iv.altep1 <- df %>%
data.frame() %>%
impute_rhd(.,
extra.ica ~ trt) %>%
impute_knn(., ia.occlus ~ trt + iv.altep + extra.ica, k=5) %>%
tbl_df()
diffdf(df, df_trt_iv.altep1)
## Differences found between the objects!
##
## A summary is given below.
##
## Not all Values Compared Equal
## All rows are shown in table below
##
## ==============================
## Variable No of Differences
## ------------------------------
## ia.occlus 1
## extra.ica 1
## ------------------------------
##
##
## All rows are shown in table below
##
## =============================================
## VARIABLE ..ROWNUMBER.. BASE COMPARE
## ---------------------------------------------
## ia.occlus 404 <NA> ICA with M1
## ---------------------------------------------
##
##
## All rows are shown in table below
##
## =========================================
## VARIABLE ..ROWNUMBER.. BASE COMPARE
## -----------------------------------------
## extra.ica 404 <NA> Yes
## -----------------------------------------
Discuss
I will use df_trt_iv.altep - the result of combination of impuute_cart and impute_knn for next step: Single imputation For Numeric Variables.
After dealing with 2 categorical variables, we only have to deal with the remaining 3 numeric variables:
In this 1st trial, I’m choosing to use:
set.seed(1234)
df_num1 <- df_trt_iv.altep %>% data.frame() %>%
impute_rlm(., sbp ~ afib + dm + mrankin) %>%
impute_knn(., aspects ~ trt + sex + nihss, k = 5) %>%
impute_rlm(., time.rand ~ trt + sex + age + nihss + hx.isch) %>%
tbl_df()
Checking the changes: diffdf()
In the checking step, we have 2 ways to compare:
diffdf(df_trt_iv.altep, df_num1)
## Differences found between the objects!
##
## A summary is given below.
##
## Not all Values Compared Equal
## All rows are shown in table below
##
## ==============================
## Variable No of Differences
## ------------------------------
## sbp 1
## aspects 4
## time.rand 2
## ------------------------------
##
##
## All rows are shown in table below
##
## =========================================
## VARIABLE ..ROWNUMBER.. BASE COMPARE
## -----------------------------------------
## sbp 131 <NA> 145.8476
## -----------------------------------------
##
##
## All rows are shown in table below
##
## ========================================
## VARIABLE ..ROWNUMBER.. BASE COMPARE
## ----------------------------------------
## aspects 19 <NA> 8
## aspects 169 <NA> 8
## aspects 201 <NA> 9
## aspects 222 <NA> 5
## ----------------------------------------
##
##
## All rows are shown in table below
##
## ==========================================
## VARIABLE ..ROWNUMBER.. BASE COMPARE
## ------------------------------------------
## time.rand 434 <NA> 196.3632
## time.rand 463 <NA> 198.9834
## ------------------------------------------
diffdf(df, df_num1)
## Differences found between the objects!
##
## A summary is given below.
##
## Not all Values Compared Equal
## All rows are shown in table below
##
## ==============================
## Variable No of Differences
## ------------------------------
## sbp 1
## aspects 4
## ia.occlus 1
## extra.ica 1
## time.rand 2
## ------------------------------
##
##
## All rows are shown in table below
##
## =========================================
## VARIABLE ..ROWNUMBER.. BASE COMPARE
## -----------------------------------------
## sbp 131 <NA> 145.8476
## -----------------------------------------
##
##
## All rows are shown in table below
##
## ========================================
## VARIABLE ..ROWNUMBER.. BASE COMPARE
## ----------------------------------------
## aspects 19 <NA> 8
## aspects 169 <NA> 8
## aspects 201 <NA> 9
## aspects 222 <NA> 5
## ----------------------------------------
##
##
## All rows are shown in table below
##
## =========================================
## VARIABLE ..ROWNUMBER.. BASE COMPARE
## -----------------------------------------
## ia.occlus 404 <NA> M1
## -----------------------------------------
##
##
## All rows are shown in table below
##
## =========================================
## VARIABLE ..ROWNUMBER.. BASE COMPARE
## -----------------------------------------
## extra.ica 404 <NA> No
## -----------------------------------------
##
##
## All rows are shown in table below
##
## ==========================================
## VARIABLE ..ROWNUMBER.. BASE COMPARE
## ------------------------------------------
## time.rand 434 <NA> 196.3632
## time.rand 463 <NA> 198.9834
## ------------------------------------------
In the 2nd trial, I’m choosing to use:
set.seed(1234)
df_num2 <- df_trt_iv.altep %>% data.frame() %>%
impute_rhd(., sbp ~ 1) %>%
impute_median(., aspects ~ trt + sex) %>%
impute_rlm(., time.rand ~ trt + sex + age) %>%
tbl_df()
Compare to the result of 1st trial
diffdf(df_num2, df_num1)
## Differences found between the objects!
##
## A summary is given below.
##
## Not all Values Compared Equal
## All rows are shown in table below
##
## ==============================
## Variable No of Differences
## ------------------------------
## sbp 1
## aspects 3
## time.rand 2
## ------------------------------
##
##
## All rows are shown in table below
##
## =========================================
## VARIABLE ..ROWNUMBER.. BASE COMPARE
## -----------------------------------------
## sbp 131 139 145.8476
## -----------------------------------------
##
##
## All rows are shown in table below
##
## ========================================
## VARIABLE ..ROWNUMBER.. BASE COMPARE
## ----------------------------------------
## aspects 19 9 8
## aspects 169 9 8
## aspects 222 9 5
## ----------------------------------------
##
##
## All rows are shown in table below
##
## ==============================================
## VARIABLE ..ROWNUMBER.. BASE COMPARE
## ----------------------------------------------
## time.rand 434 195.4407 196.3632
## time.rand 463 195.6484 198.9834
## ----------------------------------------------
Discuss
The result of the 1st trial is different from
the 2nd trial.
=> Choose impute_
Personally, I recommend:
** Numeric variable: impute_rlm,
impute_cart and impute_knn.
** Category variable: impute_cart and
impute_knn.
saveRDS(df_num1, file = "df_num1.Rds")
https://thomaselove.github.io/432-notes/dealing-with-missingness-single-imputation.html#plotting-missingness
https://bookdown.org/mwheymans/bookmi/
https://github.com/benjaminrich/table1/issues/52
https://towardsdatascience.com/all-about-missing-data-handling-b94b8b5d2184
https://stefvanbuuren.name/fimd/workflow.html
https://rmisstastic.netlify.app/tutorials/erler_course_multipleimputation_2018/erler_practical_mice_2018#getting_to_know_the_data
https://www.gerkovink.com/miceVignettes/Convergence_pooling/Convergence_and_pooling.html
Gelman, Andrew, and Jennifer Hill. 2007. Data Analysis Using Regression
and Multilevel/Hierarchical Models. New York: Cambridge University
Press.
He, Y., 2010. Missing data analysis using multiple imputation: getting
to the heart of the matter. Circulation: Cardiovascular Quality and
Outcomes, 3(1), pp.98-105.
Jakobsen, J.C., Gluud, C., Wetterslev, J. et al. When and how should
multiple imputation be used for handling missing data in randomised
clinical trials – a practical guide with flowcharts. BMC Med Res
Methodol 17, 162 (2017). https://doi.org/10.1186/s12874-017-0442-1.
White, I.R., Royston, P. and Wood, A.M., 2011. Multiple imputation using
chained equations: issues and guidance for practice. Statistics in
medicine, 30(4), pp.377-399.