knitr::opts_chunk$set(
    echo = TRUE,
    message = FALSE,
    warning = FALSE
)

Previous sections:

  1. Clustering practice: https://rpubs.com/minhtri/923711
  2. Table Descriptive Analysis: https://rpubs.com/minhtri/929114

 

Note: This analysis is used for practicing. In this section, I will summerize some basic commands for imputation based on several online sources.

 

R Packages:

# More packages will be shown during the analysis process
# Load the required library
library(tidyverse)    # Data Wrangling
library(conflicted)   # Dealing with conflict package
library(readxl)       # Read csv file
library(DT)           # For using datatable()
library(mice)         # For imputation 
library(VIM)          # For aggr function
library(visdat)       # For vis_dat; vis_miss function
library(simputation)  # For imputation missing value
library(naniar)       # For bind_shadow function
library(arsenal)      # For comparing 2 datasets comparedf() (equal or unequal)
library(diffdf)       # For comparing 2 datasets diffdf()

 

Dealing with Conflicts
There is a lot of packages here, and sometimes individual functions are in conflict. R’s default conflict resolution system gives precedence to the most recently loaded package. This can make it hard to detect conflicts, particularly when introduced by an update to an existing package.

  • Using the code below helps the entire section run properly. You may or may not need to look into the conflicted package for your work.
conflict_prefer("filter", "dplyr")
conflict_prefer("select", "dplyr")
conflict_prefer("Predict", "rms")
conflict_prefer("impute_median", "simputation")
conflict_prefer("summarize", "dplyr")

 

1 Introduction

1.1 Types of missing data

My source for this description of mechanisms is Chapter 25 of Gelman and Hill (2007)
https://thomaselove.github.io/432-notes/dealing-with-missingness-single-imputation.html#plotting-missingness

  1. MCAR (Missingness completely at random): A variable is missing completely at random if the probability of missingness is the same for all units:
  • For example, if for each subject, we decide whether to collect the diabetes status by rolling a die and refusing to answer if a “6” shows up.
  • Another example is a weighing scale that ran out of batteries. Some of the data will be missing simply because of bad luck.
  • Another example is when we take a random sample of a population, where each member has the same chance of being included in the sample. The (unobserved) data of members in the population that were not included in the sample are MCAR.

MCAR causes enlarged standard errors due to the reduced sample size, but does not cause bias (‘systematic error’ that is overestimation of benefits and underestimation of harms)

 

  1. MAR (Missingness at random): Missingness that depends only on observed predictors - is that the probability a variable is missing depends only on available information. Here, we would have to be willing to assume that the probability of nonresponse to diabetes depends only on the other, fully recorded variables in the data.

It is often reasonable to model this process as a logistic regression, where the outcome variable equals 1 for observed cases and 0 for missing. When an outcome variable is missing at random, it is acceptable to exclude the missing cases (that is, to treat them as NA), as long as the regression controls for all the variables that affect the probability of missingness.

MAR allows prediction of the missing values based on the participants with complete data.

 

  1. MNAR (Missingness not at random): Missingness that depends on unobserved predictors. Missingness is no longer “at random” if it depends on information that has not been recorded and this information also predicts the missing values.
  • If a particular treatment causes discomfort, a patient is more likely to drop out of the study. This missingness is not at random (unless “discomfort” is measured and observed for all patients). If missingness is not at random, it must be explicitly modeled, or else you must accept some bias in your inferences.

  • Missingness that depends on the missing value itself. Finally, a particularly difficult situation arises when the probability of missingness depends on the (potentially missing) variable itself. For example, suppose that people with higher earnings are less likely to reveal them.

  • Essentially, MNAR are referred to collectively as non-random missingness, and cause more trouble for us than MCAR and MAR.

     

Note: From MCAR to MNAR, you are more confident about the reason of missing values. For example, in MNAR, you can explain why patient is more likely to drop out of the study => missing value.

 

1.2 When to ignore missing data

  1. Complete case analysis may be used as the primary analysis if the proportions of missing data are below approximately 5% (as a rule of thumb) and it is implausible that certain patient groups (for example, the very sick or the very ‘well’ participants) specifically are lost to follow-up in one of the compared groups. In other words, if the potential impact of the missing data is negligible, then the missing data may be ignored in the analysis.

Best-worst and worst-best case sensitivity analyses may be used if in doubt:

  • First a ‘best-worst-case’ scenario dataset is generated where it is assumed that all participants lost to follow-up in one group (referred to as group 1) have had a beneficial outcome (for example, had no serious adverse event); and all those with missing outcomes in the other group (group 2) have had a harmful outcome (for example, have had a serious adverse event).

  • Then a ‘worst-best-case’ scenario dataset is generated where it is assumed that all participants lost to follow-up in group 1 have had a harmful outcome; and that all those lost to follow-up in group 2 have had a beneficial outcome.

  • If continuous outcomes are used, then a ‘beneficial outcome’ might be the group mean plus 2 standard deviations (or 1 standard deviation) of the group mean, and a ‘harmful outcome’ might be the group mean minus 2 standard deviations (or 1 standard deviation) of the group mean.

  • For dichotomised data, these best-worst and worst-best case sensitivity analyses will then show the range of uncertainty due to missing data, and if this range does not give qualitatively contradicting results, then the missing data may be ignored. For continuous data imputation with 2 SD will represent a possible range of uncertainty given 95% of the observed data (if normally distributed).

  1. If only the dependent variable has missing values and auxiliary variables (variables not included in the regression analysis, but correlated with a variable with missing values and/or related to its missingness) are not identified, complete case analysis may be used as the primary analysis and no specific methods ought to be used to handle the missing data. No additional information will be obtained by, for example, using multiple imputation but the standard errors may increase due to the uncertainty introduced by the multiple imputation.

  2. It would also be valid just to perform complete case analysis if it is relatively certain that the data are MCAR. It is relatively rare that it is certain that the data are MCAR. It is possible to test the hypothesis that the data are MCAR with Little’s test, but it may be unwise to build on tests that turned out to be insignificant. Hence, if there is reasonable doubt if the data are MCAR, even if Little’s test is insignificant (fail to reject the null hypothesis that data is MCAR), then MCAR should not be assumed.

 

Note: When the proportion if missing data is too large:

  • If large proportions of data are missing, it ought to be considered just to report the results of the complete case analysis and then clearly discuss the resulting interpretative limitations of the trial results. If multiple imputations or other methods are used to handle missing data it might indicate that the results of the trial are confirmative, which they are not if the missingness is considerable.

  • If the proportions of missing data are very large (for example, more than 40%) on important variables, then trial results may only be considered as hypothesis generating results. A rare exception would be if the underlying mechanism behind the missing data can be described as MCAR (see paragraph above).

     

1.3 Options to handle Missingness

When data are ready to be analysed, it should be thoroughly assessed, based on inspection of the data, whether statistical methods ought to be used to handle missing data. Bell et al. aimed to assess handling of missing data in randomised clinical trials published between July and December 2013 in the BMJ, JAMA, Lancet, and New England Journal of Medicine.

  • 95% of the 77 identified trials reported some missing outcome data.
  • The most commonly used method to handle missing data in the primary analysis was:
    ** Complete case analysis (45%).
    ** Single imputation (27%).
    ** Model-based methods (for example, mixed models or generalised estimating equations) (19%).
    ** Multiple imputation (8%).

There are several available methods for dealing with missing data that are MCAR or MAR, but they basically boil down to:

  • Discard data: Complete Case; Available Case; Dropping variable analyses.

  • Retain data:
    ** Mean, median and mode.
    ** Random Sampling Imputation.
    ** Predictive/Statistical models: Linear Regression; Random Forest; kNN (k Nearest Neighbour); Maximum likelihood.

  • Single Imputation vs Multiple Imputation.

More info can be found here:
https://towardsdatascience.com/all-about-missing-data-handling-b94b8b5d2184

 

1.3.1 Discard data

1.3.1.1 Complete Case analyses (list-wise deletion)

Complete case analyses: The most common approach to the missing data is to omit those cases (observation with entire row containing NA value) with the missing data and analyse the remaining data. This is the default approach for many statistical software packages, and may introduce unpredictable bias and fail to include some useful, often hard-won information.

  • A complete case analysis can be appropriate when the number of missing observations is not large, and the missing pattern is MCAR (missing completely at random).
  • If the missing data are MCAR, the complete case analysis will have a reduced statistical power due to the reduced sample size, but the observed data will not be biased. For example, suppose data are MCAR across 20 variables and the missingness fraction is 5% for each variable. Using complete case analysis will lose close to two third of the subjects because the fully observed subjects only account for (1 – 5%)^20 ≈ 36% of the original sample.
  • When missing data are not MCAR, the complete case analysis estimate of the intervention effect might be based, i.e., there will often be a risk of overestimation of benefit and underestimation of harm.

Two problems with complete-case analysis:

  • If the units with missing values differ systematically from the completely observed cases, this could bias the complete-case analysis.
  • If many variables are included in a model, there may be very few complete cases, so that most of the data would be discarded for the sake of a straightforward analysis.

1.3.1.2 Pairwise (available case analysis — ACA) deletion

  • A related approach is available-case analysis where different aspects of a problem are studied with different subsets of the data, perhaps identified on the basis of what is missing in them.
  • In this case, if there is missing data elsewhere in the data set, the existing values are used. Since a pairwise deletion uses all information observed, it preserves more information than the listwise deletion.
  • Less biased for the MCAR or MAR data. However, if there are many missing observations, the analysis will be deficient.

1.3.1.3 Dropping Variables

If there is too much data missing for a variable, it may be an option to delete the variable or the column from the dataset. There is no rule of thumbs for this, but it depends on the situation, and a proper analysis of data is needed before the variable is dropped altogether. This should be the last option, and we need to check if model performance improves after the deletion of a variable.

 

1.3.2 Retain data

1.3.2.1 Mean, median and mode

1. Mean
In a mean substitution, the mean value of a variable is used in place of the missing data value for that same variable. This has the benefit of not changing the sample mean for that variable.
The theoretical background of the mean substitution is that the mean is a reasonable estimate for a randomly selected observation from a normal distribution. However, with missing values that are not strictly random, especially in the presence of great inequality in the number of missing values for the different variables, the mean substitution method may lead to inconsistent bias. Distortion of original variance and distortion of co-variance with remaining variables within the dataset are two major drawbacks of this method.

  • The command in mice package: imp_mean <- mice(dataset, method=“mean”, m=1, maxit=1)
    To extract the mean imputed dataset: imp_mean <- complete(imp_mean)
  • For multiple imputation - I will demonstrate with the example later.

2. Median
Median can be used when the variable has a skewed distribution.

3. Mode
The rationale for Mode is to replace the population of missing values with the most frequent value since this is the most likely occurrence.

 

1.3.2.2 Random Sampling Imputation

Random sampling imputation is in principle similar to mean/median imputation because it aims to preserve the statistical parameters of the original variable, for which data is missing.
Random sampling consists of taking a random observation from the pool of available observations and using that randomly extracted value to fill the NA. In Random Sampling, one takes as many random observations as missing values are present in the variable. Random sample imputation assumes that the data are missing completely at random (MCAR). If this is the case, it makes sense to substitute the missing values with values extracted from the original variable distribution.

 

1.3.2.3 Model: Linear Regression

In regression imputation, the existing variables are used to predict, and then the predicted value is substituted as if an actually obtained value. This approach has several advantages because the imputation retains a great deal of data over the listwise or pairwise deletion and avoids significantly altering the standard deviation or the shape of the distribution.
However, as in a mean substitution, while a regression imputation substitutes a value predicted from other variables, no novel information is added, while the sample size has been increased and the standard error is reduced.

  • The command in mice package:

Regression: imp_mean <- mice(dataset, method=“norm.predict”, m=1, maxit=1)
Stochastic regression: imp_mean <- mice(dataset, method=“norm.nob”, m=1, maxit=1)
To extract the imputed dataset: imp_mean <- complete(imp_mean)

  • For multiple imputation - I will demonstrate with the example later.

     

1.3.2.4 Model: Bayesian Stochastic regression imputation

With Bayesian Stochastic regression imputation uncertainty is not only accounted for by adding error variance to the predicted values but also by taking into account the uncertainty in estimating the regression coefficients of the imputation model. The Bayesian idea is used that there is not one (true) population regression coefficient but that the regression coefficients itself also follows a distribution.

  • The command for in mice package: imp_mean <- mice(dataset, method=“norm”, m=1, maxit=1)
    To extract the imputed dataset: imp_mean <- complete(imp_mean)

     

1.3.2.5 Model: Random Forest

Random forest is a non-parametric imputation method applicable to various variable types that work well with both data missing at random and not missing at random.
Random forest uses multiple decision trees to estimate missing values and outputs OOB (out of the bag) imputation error estimates. One caveat is that random forest works best with large datasets, and using random forest on small datasets runs the risk of overfitting.

 

1.3.2.6 Model: kNN

k-NN imputes the missing attribute values based on the nearest K neighbour. Neighbours are determined based on a distance measure. Once K neighbours are determined, the missing value is imputed by taking mean/median or mode of known attribute values of the missing attribute.

 

1.3.2.7 Model: Maximum likelihood - Expectation-Maximization (MI - EM)

The assumption that the observed data are a sample drawn from a multivariate normal distribution is relatively easy to understand. After the parameters are estimated using the available data, the missing data are estimated based on the parameters which have just been estimated. Several strategies are using the maximum likelihood method to handle the missing data.

Expectation-Maximization (EM) is the maximum likelihood method used to create a new data set. All missing values are imputed with values estimated by the maximum likelihood methods.

  • This approach begins with the expectation step, during which the parameters (e.g., variances, covariances, and means) are estimated, perhaps using the listwise deletion.

  • Those estimates are then used to create a regression equation to predict the missing data. The maximization step uses those equations to fill in the missing data.

  • The expectation step is then repeated with the new parameters, where the new regression equations are determined to “fill in” the missing data.

  • The expectation and maximization steps are repeated until the system stabilizes.

     

1.4 Single Imputation

In single imputation analyses, NA values are estimated/replaced one time with one particular data value for the purpose of obtaining more complete samples, at the expense of creating some potential bias in the eventual conclusions or obtaining slightly less accurate estimates than would be available if there were no missing values in the data.

  • A single imputation can be just a replacement with the mean or median (for a quantity) or the mode (for a categorical variable). However, such an approach, though easy to understand, underestimates variance and ignores the relationship of missing values to other variables.

  • Single imputation can also be done using a variety of models to try to capture information about the NA values that are available in other variables within the data set.

  • The simputation package can help us execute single imputations using a wide variety of techniques, within the pipe approach used by the tidyverse. Another approach I have used in the past is the mice package, which can also perform single imputations.

     

1.5 Multiple Imputation

1.5.1 Basic concept

Multiple Imputation (MI) is a statistical technique for handling missing data. The key concept of MI is to use the distribution of the observed data to estimate a set of plausible values for the missing data. Random components are incorporated into these estimated values to show their uncertainty. Multiple datasets are created and then analysed individually but identically to obtain a set of parameter estimates. Estimates are combined to obtain a set of parameter estimates.

The benefit of the multiple imputations is that restoring the natural variability of the missing values incorporates the uncertainty due to the missing data, which results in a valid statistical inference.

 

As a flexible way of handling more than one missing variable, apply a Multiple Imputation by Chained Equations (MICE) approach. In the MICE algorithm, a chain of regression equations is used to obtain imputations, which means that variables with missing data are imputed one by one. The regression models use information from all other variables in the model, i.e. (conditional) imputation models. In order to add sampling variability to the imputations, residual error is added to create the imputed values. This residual error can either be added to the prediced values directly, which is esentially similar to repeating stochastic regression imputation over several imputation runs. Residual variance can also be added to the parameter estimates of the regression model, i.e. the regression coefficients, which makes it a Bayesian method. The Bayesian method is the default in the mice package in R. Refer to the reference for more infomation about MICE.

#![](/Statistics/R/R data/Model practise/image/3. Dealing with missingness/3.1 Flowchart-handle missing data RCT.jpeg)  

 

1.5.2 Types of multiple imputation

  1. Single value regression analysis

A single variable regression analysis includes a dependent variable and the stratification variables used in the randomisation. The stratification variables often include a centre indicator if the trial is a multi-centre trial and usually one or more adjusting variables with prognostic information which are correlated with the outcome. When using a continuous dependent variable, a baseline value of the dependent variable may also be included.

  • As mentioned previously, if only the dependent variable has missing values and auxiliary variables are not identified, a complete case analysis should be performed and no specific methods ought to be used to handle the missing data.
  • If auxiliary variables have been identified, a single variable imputation may be performed. If there are significant missingness on the baseline variable of a continuous variable, a complete case analysis may provide biased results.

=> Therefore, in all events, a single variable imputation (with or without auxiliary variables included as appropriate) is conducted if only the baseline variable is missing.

  1. Monotonic imputation

If both the dependent variable and the baseline variable are missing and the missingness is monotone, a monotonic imputation is done.

Assume a data matrix where patients are represented by rows and variables by columns. The missingness of such a data matrix is said to be monotone if its columns can be reordered such that for any patient:

  1. if a value is missing all values to the right of its position are also missing, and
  2. if a value is observed all values to the left of this value are also observed.

If the missingness is monotone, the method of multiple imputation is also relatively straightforward, even if more than one variable has missing values. In this case it is relatively simple to impute the missing data using sequential regression imputation where the missing values are imputed for each variable at a time.

  1. Chained equations or the Markov chain Monte Carlo (MCMC) method

If missingness is not monotone, a multiple imputation is conducted using the chained equations or the MCMC method. Auxiliary variables are included in the model if they are available.

 

1.5.3 How many imputations?

One of the distinct advantages of multiple imputation is that it can produce unbiased estimates with correct confidence intervals with a low number of imputed datasets, even as low as m = 2.

The classic advice is to use a low number of imputation, somewhere between 3 and 5 for moderate amounts of missing information. Several authors investigated the influence of m on various aspects of the results. The picture emerging from this work is that it is often beneficial to set m higher, somewhere in the range of 20–100 imputations.

Theoretically it is always better to use higher m , but this involves more computation and storage. Setting m very high (say m = 200 ) may be useful for low-level estimands that are very uncertain, and for which we want to approximate the full distribution, or for parameters that are notoriously different to estimates, like variance components. On the other hand, setting m high may not be worth the extra wait if the primary interest is on the point estimates (and not on standard errors, p -values, and so on). In that case using m = 5 − 20 will be enough under moderate missingness.

Imputing a dataset in practice often involves trial and error to adapt and refine the imputation model. Such initial explorations do not require large m . It is convenient to set m = 5 during model building, and increase m only after being satisfied with the model for the “final” round of imputation. So if calculation is not prohibitive, we may set m to the average percentage of missing data. The substantive conclusions are unlikely to change as a result of raising m beyond m = 5.

Researchers assume that the number of imputations needed to generate valid imputations has to be set at 3-5 imputations. This idea was based on the work of Rubin (Rubin (1987)). He showed that the precision of a pooled parameter becomes lower when a finite number of multiply imputed datasets is used compared to an infinite number (finite means a limited number of imputed datasets, like 5 imputed datasets and infinite means unlimited and can be recognized by the mathematical symbol ∞). Accordingly, when 5 imputed datasets are used, the standard error S E is √0.93=0.96 times as large as the S E when an infinite number of imputed datasets are used.

FMI is the fraction of missing information and m is the number of imputed datasets. Where FMI is roughly equal to the percentage of missing data in the simplest case of one variable with missing data. When there are more variables in the imputation model, and these variables are correlated with the variables with missing data the FMI becomes lower. For FMI´s of 0.05, 0.1, 0.2, 0.3, 0.5 the following number of imputed dataets are needed: ≥3, 6, 12, 24, 59, respectively. Another rule of thumb is that the number of imputed datasets should be at least equal to the percentage of missing cases. This means that when 10% of the subjects have missing values, at least 10 imputed datasets should be generated.

Iterations Van Buuren (Van Buuren (2018)) states that the number of iterations may depend on the correlation between variables and the percentage of missing data in variables. He proposed that a number of 5-20 iterations is enough to reach convergence. This number may be adjusted when the percentage of missing data is high. Nowadays computers are fast so that a higher number of iterations can easily be used.

It should be clear that this rule is not universally appropriate. Some non-MCAR missing data patterns have FMI greater than the fraction of missing data. Some settings require a greater or smaller degree of reproducibility. Larger numbers of imputations may be required for method comparison studies.

 

1.5.4 Imputation model: Variable selection

Include the covariates and outcome from the analysis model: To avoid bias in the analysis model, the imputation model must include all variables that are in the analysis model. In particular, when imputing missing values of analysis model covariates, the analysis model outcome must be in the imputation model. If the imputed data are to be used to fit several different analysis models, then the imputation model should contain every variable included in any of the analysis models.

Include predictors of the incomplete variable: This is beneficial for two reasons. First, it may make the MAR assumption more plausible and hence reduce bias. Second, adding predictors of the incomplete variable to the imputation model can improve the imputations and hence reduces the standard errors of the estimates for the analysis model.

In practice, one could include in the imputation model all variables that significantly predict the incomplete variable, or whose association with the incomplete variable exceeds some threshold. One might also include any other variables which significantly predict whether the incomplete variable is missing, on the grounds that bias may be avoided if the included variables have a true association with the incomplete variable that fails to reach statistical significance, whereas little loss of precision is likely if the included variables do not predict the incomplete variable.

 

1.5.5 Workflow of imputation

# ![](/Statistics/R/R data/Model practise/image/3. Dealing with missingness/3.2 Workflow of multiple imputation.jpeg)   


Multiple imputation consists of three steps:

  • Imputation step: we create several m complete versions of the data by replacing the missing values by plausible data values – multiple imputation represents multiple sets of plausible values. When using multiple imputation, missing values are identified and are replaced by a random sample of plausible values imputations (completed datasets). Multiple completed datasets are generated via some chosen imputation model. Five imputed datasets (m=5) have traditionally been suggested to be sufficient on theoretical grounds, but 50 datasets (or more) seem preferable to reduce sampling variability from the imputation process.

  • Completed-data analysis (estimation) step: The desired analysis is performed separately for each dataset that is generated during the imputation step. Hereby, for example, 50 analysis results are constructed.

  • Pooling step: The results obtained from each completed-data analyses are combined into a single multiple-imputation result. There is no need to conduct a weighted meta-analysis as all say 50 analysis results are considered to have the same statistical weight.

It is of great importance that there is either compatibility between the imputation model and the analysis model or the imputation model is more general than the analysis model (for example, that the imputation model includes more independent covariates than the analysis model). For example, if the analysis model has significant interactions, then the imputation model should include them as well, if the analysis model uses a transformed version of a variable then the imputation model should use the same transformation, etc.

  1. mids workflow using saved objects

The classic workflow in mice runs functions mice(), with() and pool() in succession, each time saving the intermediate result:

library(mice)
imp <- mice(x, seed = 123, print = FALSE)
fit <- with(imp, lm(y ~ z))
est1 <- pool(fit)

The objects imp, fit and est have classes mids, mira and mipo, respectively. The classic workflow works because mice contains a with() function that understands how to deal with a mids-object. The classic mids workflow has been widely adopted, but there are more possibilities.

  1. mids workflow using pipes

The magrittr package introduced the pipe operator to R. This operator removes the need to save and reread objects, resulting in more compact and better readable code:

library(magrittr)
est2 <- x %>% mice(seed = 123, print = FALSE) %>% with (lm(y ~ z)) %>% pool()

The with() function handles two tasks: to fill in the missing data and to analyze the data. Splitting these over two separate functions provided the user easier access to the imputed data, and hence is more flexible.

1.5.6 Parameter pooling

Many types of estimates are approximately normally distributed, e.g., means, standard deviations, regression coefficients, proportions and linear predictors.

With non-normal distributions: correlation coefficients, odds ratios, relative risks, hazard ratios, measures of explained variance and so on? The quality of the pooled estimate and the confidence intervals can be improved when pooling is done in a scale for which the distribution is close to normal. Thus, transformation toward normality and back-transformation into the original scale improves statistical inference.

More info can be found here: https://stefvanbuuren.name/fimd/sec-pooling.html

 

1.5.7 Multi-parameter inference

Schafer (1997) distinguished three types of statistics in multi-parameter tests: D1 (multivariate Wald test), D2 (combining test statistics) and D3 (likelihood ratio test).

 

1.5.7.1 D1 Multivariate Wald test

# ![](/Statistics/R/R data/Model practise/image/3. Dealing with missingness/3.3 D1 test example.jpeg)   

 

1.5.7.2 D2 Combining Test Statistics

 

1.5.7.3 D3 Likelihood Ratio Test

 

1.5.7.4 Which D ?

If the estimates are approximately normal and if the software can produce the required variance-covariance matrices, we recommend using D1 with an adjustment for small samples if n < 100 . D1 is a direct extension of Rubin’s rules to multi-parameter problems, theoretically convincing, mature and widely applied. D1 is insensitive to the assumption of equal fractions of missing information, is well calibrated, works well with small m (unless the fractions of information are large and variable) and suffers only modest loss of power.

If only the test statistics are available for pooling, then the D2 -statistic is a good option, provided that the number of imputations m > 20 . The test is easy to calculate and applies to different test statistics. For m < 20 , the power may be low. D2 tends to become optimistic for high fractions of missing information ( > 0.3 ) , and this effect unfortunately increases with sample size. Thus, careless application of D2 to large datasets with many missing values may yield high rates of false positives.

The likelihood ratio statistic D3 is theoretically sound. Calculation of D3 requires refitting the repeated analysis models with the estimates constrained to their pooled values. This was once an issue, but probably less so in the future. D3 is asymptotically equivalent to D1 , and may be preferred for theoretical reasons: it does not require normality in the complete-data model, it is often more powerful and it may be more stable than if k is large. We found that D3 produces Type 1 error rates that were comparable to D1 . D3 tends to be somewhat conservative in smaller samples, especially with high fractions of missing information and with high k . Also, D3 has lower statistical power in some of the extreme scenarios. For small samples, D1 has a slight edge over D3 , so given the current available evidence D1 is the better option for n < 200 . In larger samples ( n ≥ 200 ) D1 and D3 appear equally good, so the choice between them is mostly a matter of convenience.

1.5.8 Limitations and pitfalls in MICE

There are 6 limitations in MICE as follow:

  • Lack of theoretical basis.
  • Perfect prediction.
  • Departures from MAR.
  • Non-convergence.
  • Checking the imputation models.
  • Too many variables.

More infomation can be found within this paper:
White, I.R., Royston, P. and Wood, A.M., 2011. Multiple imputation using chained equations: issues and guidance for practice. Statistics in medicine, 30(4), pp.377-399.

2 Simulated fakestroke data

Data used in this notes
fakestroke.csv
(Source: https://github.com/THOMASELOVE/432-data/blob/master/data/fakestroke.csv)

Loading the data, adjust the character of each column:

# Loading the data, adjust the character of each column
data <- read_excel("D:/Statistics/R/R data/fakestroke.xlsx", 
                 col_types = c("text", "text", "numeric",
                               "text", "numeric", "text", "text", 
                               "numeric", "numeric", "text", "numeric", 
                               "text", "numeric", "numeric", "text", 
                               "numeric", "numeric", "numeric")) 
df = data

The fakestroke.csv file contains the following 18 variables for 500 patients:

# To make a table in R markdown: 1st row: header, 2nd row: Alignment; the remaining row: for content
## |Variable |  Description |
## |:------- | :----------  |
## |studyid  | Study ID  # (z001 through z500) | 
Variable Description
studyid Study ID # (z001 through z500)
trt Treatment group (Intervention or Control)
age Age in years
sex Male or Female
nihss NIH Stroke Scale Score (can range from 0-42; higher scores indicate more severe neurological deficits)
location Stroke Location - Left or Right Hemisphere
hx.isch History of Ischemic Stroke (Yes/No)
afib Atrial Fibrillation (1 = Yes, 0 = No)
dm Diabetes Mellitus (1 = Yes, 0 = No)
mrankin Pre-stroke modified Rankin scale score (0, 1, 2 or > 2) indicating functional disability - complete range is 0 (no symptoms) to 6 (death)
sbp Systolic blood pressure, in mm Hg
iv.altep Treatment with IV alteplase (Yes/No)
time.iv Time from stroke onset to start of IV alteplase (minutes) if iv.altep=Yes
aspects Alberta Stroke Program Early Computed Tomography score, which measures extent of stroke from 0 - 10; higher scores indicate fewer early ischemic changes
ia.occlus Intracranial arterial occlusion, based on vessel imaging - five categories3
extra.ica Extracranial ICA occlusion (1 = Yes, 0 = No)
time.rand Time from stroke onset to study randomization, in minutes
time.punc Time from stroke onset to groin puncture, in minutes (only if Intervention)

A quick look at the simulated data in fakestroke

df
## # A tibble: 500 x 18
##    studyid trt         age sex   nihss locat~1 hx.isch  afib    dm mrankin   sbp
##    <chr>   <chr>     <dbl> <chr> <dbl> <chr>   <chr>   <dbl> <dbl> <chr>   <dbl>
##  1 z001    Control      53 Male     21 Right   No          0     0 2         127
##  2 z002    Interven~    51 Male     23 Left    No          1     0 0         137
##  3 z003    Control      68 Fema~    11 Right   No          0     0 0         138
##  4 z004    Control      28 Male     22 Left    No          0     0 0         122
##  5 z005    Control      91 Male     24 Right   No          0     0 0         162
##  6 z006    Control      34 Fema~    18 Left    No          0     0 2         166
##  7 z007    Interven~    75 Male     25 Right   No          0     0 0         140
##  8 z008    Control      89 Fema~    18 Right   No          0     0 0         157
##  9 z009    Control      75 Male     25 Left    No          1     0 2         129
## 10 z010    Interven~    26 Fema~    27 Right   No          0     0 0         143
## # ... with 490 more rows, 7 more variables: iv.altep <chr>, time.iv <dbl>,
## #   aspects <dbl>, ia.occlus <chr>, extra.ica <dbl>, time.rand <dbl>,
## #   time.punc <dbl>, and abbreviated variable name 1: location
str(df)
## tibble [500 x 18] (S3: tbl_df/tbl/data.frame)
##  $ studyid  : chr [1:500] "z001" "z002" "z003" "z004" ...
##  $ trt      : chr [1:500] "Control" "Intervention" "Control" "Control" ...
##  $ age      : num [1:500] 53 51 68 28 91 34 75 89 75 26 ...
##  $ sex      : chr [1:500] "Male" "Male" "Female" "Male" ...
##  $ nihss    : num [1:500] 21 23 11 22 24 18 25 18 25 27 ...
##  $ location : chr [1:500] "Right" "Left" "Right" "Left" ...
##  $ hx.isch  : chr [1:500] "No" "No" "No" "No" ...
##  $ afib     : num [1:500] 0 1 0 0 0 0 0 0 1 0 ...
##  $ dm       : num [1:500] 0 0 0 0 0 0 0 0 0 0 ...
##  $ mrankin  : chr [1:500] "2" "0" "0" "0" ...
##  $ sbp      : num [1:500] 127 137 138 122 162 166 140 157 129 143 ...
##  $ iv.altep : chr [1:500] "Yes" "Yes" "No" "Yes" ...
##  $ time.iv  : num [1:500] 63 68 NA 78 121 78 97 NA 49 99 ...
##  $ aspects  : num [1:500] 10 10 10 10 8 5 10 9 6 10 ...
##  $ ia.occlus: chr [1:500] "M1" "M1" "ICA with M1" "ICA with M1" ...
##  $ extra.ica: num [1:500] 0 1 1 1 0 0 0 0 0 0 ...
##  $ time.rand: num [1:500] 139 118 178 160 214 154 122 147 271 141 ...
##  $ time.punc: num [1:500] NA 281 NA NA NA NA 268 NA NA 259 ...
summary(df)
##    studyid              trt                 age            sex           
##  Length:500         Length:500         Min.   :23.00   Length:500        
##  Class :character   Class :character   1st Qu.:55.00   Class :character  
##  Mode  :character   Mode  :character   Median :65.75   Mode  :character  
##                                        Mean   :64.71                     
##                                        3rd Qu.:76.00                     
##                                        Max.   :96.00                     
##                                                                          
##      nihss         location           hx.isch               afib     
##  Min.   :10.00   Length:500         Length:500         Min.   :0.00  
##  1st Qu.:14.00   Class :character   Class :character   1st Qu.:0.00  
##  Median :18.00   Mode  :character   Mode  :character   Median :0.00  
##  Mean   :18.03                                         Mean   :0.27  
##  3rd Qu.:22.00                                         3rd Qu.:1.00  
##  Max.   :28.00                                         Max.   :1.00  
##                                                                      
##        dm          mrankin               sbp          iv.altep        
##  Min.   :0.000   Length:500         Min.   : 78.0   Length:500        
##  1st Qu.:0.000   Class :character   1st Qu.:128.5   Class :character  
##  Median :0.000   Mode  :character   Median :145.0   Mode  :character  
##  Mean   :0.126                      Mean   :145.5                     
##  3rd Qu.:0.000                      3rd Qu.:162.5                     
##  Max.   :1.000                      Max.   :231.0                     
##                                     NA's   :1                         
##     time.iv          aspects        ia.occlus           extra.ica     
##  Min.   : 42.00   Min.   : 5.000   Length:500         Min.   :0.0000  
##  1st Qu.: 67.00   1st Qu.: 7.000   Class :character   1st Qu.:0.0000  
##  Median : 86.00   Median : 9.000   Mode  :character   Median :0.0000  
##  Mean   : 92.64   Mean   : 8.506                      Mean   :0.2906  
##  3rd Qu.:115.00   3rd Qu.:10.000                      3rd Qu.:1.0000  
##  Max.   :218.00   Max.   :10.000                      Max.   :1.0000  
##  NA's   :55       NA's   :4                           NA's   :1       
##    time.rand       time.punc  
##  Min.   :100.0   Min.   :180  
##  1st Qu.:151.2   1st Qu.:212  
##  Median :201.5   Median :260  
##  Mean   :208.6   Mean   :263  
##  3rd Qu.:257.8   3rd Qu.:313  
##  Max.   :360.0   Max.   :360  
##  NA's   :2       NA's   :267

 
 

3 Multiple Imputation with mice()

3.1 Step 1: Identify categorical or numeric variables

Before constructing a table, I must check numeric/ categorical variables.
For example: in our data:

  • Numeric variable: age; nihss; sbp; time.iv; aspects; time.rand; time.punc.
  • Character variable: sex; location; hx.isch; afib; dm; mrankin; iv.altep; ia.occlus; extra.ica.
  1. Because time.punc is only applied for Intervention group (trt), time.iv only applied for patients if iv.altep=Yes (iv.altep), I eliminate these two variable from imputation:
df = data %>% select(-studyid,-time.punc, -time.iv)

 

  1. In the excel file, afib, dm, extra.ica are noted as “1” and “0”. Therefore, I must convert them back to “Yes-No” for character variables:
# Changing coding variable to categorical:
df$afib       <- factor(df$afib , levels=0:1, labels=c("No", "Yes"))
df$dm         <- factor(df$dm , levels=0:1, labels=c("No", "Yes"))
df$extra.ica  <- factor(df$extra.ica, levels=0:1, labels=c("No", "Yes"))

 

  1. Define factor for all categorical variables:
  • Before converting:
str(df)
## tibble [500 x 15] (S3: tbl_df/tbl/data.frame)
##  $ trt      : chr [1:500] "Control" "Intervention" "Control" "Control" ...
##  $ age      : num [1:500] 53 51 68 28 91 34 75 89 75 26 ...
##  $ sex      : chr [1:500] "Male" "Male" "Female" "Male" ...
##  $ nihss    : num [1:500] 21 23 11 22 24 18 25 18 25 27 ...
##  $ location : chr [1:500] "Right" "Left" "Right" "Left" ...
##  $ hx.isch  : chr [1:500] "No" "No" "No" "No" ...
##  $ afib     : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 1 1 1 2 1 ...
##  $ dm       : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ mrankin  : chr [1:500] "2" "0" "0" "0" ...
##  $ sbp      : num [1:500] 127 137 138 122 162 166 140 157 129 143 ...
##  $ iv.altep : chr [1:500] "Yes" "Yes" "No" "Yes" ...
##  $ aspects  : num [1:500] 10 10 10 10 8 5 10 9 6 10 ...
##  $ ia.occlus: chr [1:500] "M1" "M1" "ICA with M1" "ICA with M1" ...
##  $ extra.ica: Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 1 1 1 1 ...
##  $ time.rand: num [1:500] 139 118 178 160 214 154 122 147 271 141 ...
  • After:
df <- df%>%mutate_if(is.character, as.factor)
str(df) 
## tibble [500 x 15] (S3: tbl_df/tbl/data.frame)
##  $ trt      : Factor w/ 2 levels "Control","Intervention": 1 2 1 1 1 1 2 1 1 2 ...
##  $ age      : num [1:500] 53 51 68 28 91 34 75 89 75 26 ...
##  $ sex      : Factor w/ 2 levels "Female","Male": 2 2 1 2 2 1 2 1 2 1 ...
##  $ nihss    : num [1:500] 21 23 11 22 24 18 25 18 25 27 ...
##  $ location : Factor w/ 2 levels "Left","Right": 2 1 2 1 2 1 2 2 1 2 ...
##  $ hx.isch  : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ afib     : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 1 1 1 2 1 ...
##  $ dm       : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ mrankin  : Factor w/ 4 levels "> 2","0","1",..: 4 2 2 2 2 4 2 2 4 2 ...
##  $ sbp      : num [1:500] 127 137 138 122 162 166 140 157 129 143 ...
##  $ iv.altep : Factor w/ 2 levels "No","Yes": 2 2 1 2 2 2 2 1 2 2 ...
##  $ aspects  : num [1:500] 10 10 10 10 8 5 10 9 6 10 ...
##  $ ia.occlus: Factor w/ 6 levels "A1 or A2","ICA with M1",..: 4 4 2 2 4 1 4 4 4 4 ...
##  $ extra.ica: Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 1 1 1 1 ...
##  $ time.rand: num [1:500] 139 118 178 160 214 154 122 147 271 141 ...

 

  1. Reorder the level of some character variables:
  • trt: Intervention (baseline) , Control
  • mrankin: 0, 1, 2, > 2.
  • ia.occlus: rearrange the ia.occlus variable to the order presented in Berkhemer et al. (2015).
# Relevel factors:
df$trt = factor(df$trt, levels=c("Intervention", "Control"))
df$mrankin = factor(df$mrankin, levels=c("0", "1", "2", "> 2"))
df$ia.occlus = factor(df$ia.occlus, levels=c("Intracranial ICA", "ICA with M1", 
                                             "M1", "M2", "A1 or A2"))

# Check order of factor after relevel: 
df$trt[1:2]
## [1] Control      Intervention
## Levels: Intervention Control
df$mrankin[1:2]
## [1] 2 0
## Levels: 0 1 2 > 2
df$ia.occlus[1:3]
## [1] M1          M1          ICA with M1
## Levels: Intracranial ICA ICA with M1 M1 M2 A1 or A2

 

3.2 Step 2: Checking missing value (NA)

3.2.1 Step 2.1: Missing data patterns

Before moving on to determining the specifics of multiple imputation, we should first explore and see the pattern of missing data in our dataset.

There are several ways to check the missing value (NA) in the dataset:

3.2.1.1 Complete vs incomplete case

# number and proportion of missing values per variable
cbind("# NA" = sort(colSums(is.na(df))),
      "% NA" = round(sort(colMeans(is.na(df))) * 100, 2))
##           # NA % NA
## trt          0  0.0
## age          0  0.0
## sex          0  0.0
## nihss        0  0.0
## location     0  0.0
## hx.isch      0  0.0
## afib         0  0.0
## dm           0  0.0
## mrankin      0  0.0
## iv.altep     0  0.0
## sbp          1  0.2
## ia.occlus    1  0.2
## extra.ica    1  0.2
## time.rand    2  0.4
## aspects      4  0.8

Discuss

There are 5 variables with 9 NAs.

  • aspects (0.4%) > time.rand (0.4%).

  • sbp = ia.occlus = extra.ica (0.2%).

     

3.2.1.2 Number and proportion of complete cases

Ncc <- cbind(
  "#" = table(complete.cases(df)),
  "%" = round(100 * table(complete.cases(df))/nrow(df), 2)
)
rownames(Ncc) <- c("incompl.", "complete")
Ncc
##            #    %
## incompl.   8  1.6
## complete 492 98.4

Discuss

Among 500 observations:

  • 8 observations contain NA.

  • 492 observations dont have NA.

     

3.2.1.3 unlist()

p_missing <- unlist(lapply(df, function(x) sum(is.na(x))))/nrow(df)
sort(p_missing[p_missing > 0], decreasing = TRUE)
##   aspects time.rand       sbp ia.occlus extra.ica 
##     0.008     0.004     0.002     0.002     0.002

 

3.2.1.4 plyr package

library(plyr)
ddply(df, c(sbp = "ifelse(is.na(sbp), 'missing', 'observed')",
                aspects = "ifelse(is.na(aspects), 'missing', 'observed')",
                time.rand = "ifelse(is.na(time.rand), 'missing', 'observed')",
                ia.occlus = "ifelse(is.na(ia.occlus), 'missing', 'observed')",
                extra.ica = "ifelse(is.na(extra.ica), 'missing', 'observed')"
                ),
      summarize,
      N = length(sbp)
)
##        sbp  aspects time.rand ia.occlus extra.ica   N
## 1  missing observed  observed  observed  observed   1
## 2 observed  missing  observed  observed  observed   4
## 3 observed observed   missing  observed  observed   2
## 4 observed observed  observed   missing   missing   1
## 5 observed observed  observed  observed  observed 492

Discuss

This command provide more info than the previous two, among 500 observations:

  • 8 observations (obs) contain NA.
    ** 1 obs have NA at sbp.
    ** 4 obs have NA at aspects.
    ** 2 obs have NA at time.rand.
    ** 1 obs have NA at both ia.occlus and extra.ica.

  • 492 obs dont have NA.

     

3.2.1.5 Library (mice) (recommend)

Using the following command:
md.pattern (x)

md.pattern(df)

##     trt age sex nihss location hx.isch afib dm mrankin iv.altep sbp ia.occlus
## 492   1   1   1     1        1       1    1  1       1        1   1         1
## 4     1   1   1     1        1       1    1  1       1        1   1         1
## 2     1   1   1     1        1       1    1  1       1        1   1         1
## 1     1   1   1     1        1       1    1  1       1        1   1         0
## 1     1   1   1     1        1       1    1  1       1        1   0         1
##       0   0   0     0        0       0    0  0       0        0   1         1
##     extra.ica time.rand aspects  
## 492         1         1       1 0
## 4           1         1       0 1
## 2           1         0       1 1
## 1           0         1       1 2
## 1           1         1       1 1
##             1         2       4 9

Discuss

  • In each row output:
    ** 1: the variable is complete.
    ** 0: the variable in that pattern contains missing values.
    ** The last row: the total number of missing values for each variable.

  • In each column output:
    ** 1st column on the left (without a column name) shows the number of cases with a specific pattern.
    ** The column on the right : the number of variables that is incomplete in that pattern.

Return to our data result:

  • At each row:
    ** 1st line: 492 obs having full value.
    ** 2nd line: 4 obs have NA at aspects
    ** 3rd line: 2 obs have NA at time.rand
    ** 4th line: 1 obs have missing value at both ia.occlus and extra.ica
    ** 5th line: 1 obs have NA at sbp

  • At each column - variable, total number of missing value:
    ** sbp: 1
    ** ia.occlus: 1
    ** extra.ica: 1
    ** time.rand: 2
    ** aspects: 4

     

3.2.1.6 library(visdat)

Installing this package: devtools::install_github(“ropensci/visdat”)
Using the following command:
vis_miss(x, cluster = F, sort_miss = F, show_perc = T, show_perc_col = T)

  • cluster = F (default): use hierarchical clustering (mcquitty method) to arrange rows according to missingness.
  • sort_miss = F (default): arranges the columns in order of missingness.
  • show_perc = T (default): adds % of missing/complete data in the whole dataset into the legend.
  • show_perc_col = T (default): adds in the % missing data in a given column into the x axis.
vis_miss(df, cluster = T, sort_miss = T, show_perc = T, show_perc_col = T)

vis_dat(df, sort_type = T)

Discuss

  • More visualization.

  • Hard to identify the number of missing value in each variables.

     

3.2.1.7 library(VIM) (recommend)

Using the following command:
aggr (x, plot = T, prop = T, numbers = F, combined = F, sortVars = T, cex.axis=.7, gap=3)

  • plot: whether the results should be plotted.
  • numbers: whether the proportion or frequencies of the different combinations should be represented by numbers.
  • prop: whether the proportion of missing/imputed values should be used rather than the total amount.
  • combined: a logical indicating whether the two plots should be combined.
  • sortVars: whether the variables should be sorted by the number of missing/imputed values.
aggr(df, plot = F, numbers = T, prop = T)
## 
##  Missings in variables:
##   Variable Count
##        sbp     1
##    aspects     4
##  ia.occlus     1
##  extra.ica     1
##  time.rand     2
aggr(df, plot = T, numbers = T, prop = T, combined= F, sortVars = T, cex.axis=.7, gap=3)

## 
##  Variables sorted by number of missings: 
##   Variable Count
##    aspects 0.008
##  time.rand 0.004
##        sbp 0.002
##  ia.occlus 0.002
##  extra.ica 0.002
##        trt 0.000
##        age 0.000
##        sex 0.000
##      nihss 0.000
##   location 0.000
##    hx.isch 0.000
##       afib 0.000
##         dm 0.000
##    mrankin 0.000
##   iv.altep 0.000
aggr(df, plot = T, numbers = T, prop = T, combined= T, sortVars = F)

Discuss

  • We can see there are 5 variables with missing value:
    ** 1 obs have NA at both ia.occlus, extra.ica.

  • Among these variables:
    ** Numeric variables: sbp, aspects, time.rand.
    ** Categorical variables: ia.occlus, extra.ica.

  • The total proportion of the remaining 5 varialbes < 5% => complete case analysis can be applied. However, in this example, I will still use single imputation and multiple imputation to demonstrate how they work.

     

Summary
I recommend using mice and/or aggr() function for observing missing value:

  • The code above calculates what percent of data is missing. A simple look at this table warns us about several variables that have more than 25% missing. It is useful to remove these variables from the dataset first as they might mess up the imputation.

  • There are 5 varialbes with missing value:

  • Numeric variables: sbp, aspects, time.rand.

  • Categorical variables: ia.occlus, extra.ica.

  • The outcome variables of this project: nihss ; mrankin.

  • The missing data patterns show that the total proportion of missing value < 5% => Complete case analysis can be applied.
    => However, for demonstration, I still continue with the imputation practice.

     

3.2.2 Step 2.2: Missing data evaluation

3.2.2.1 Little’s MCAR test in R

Little´s MCAR test is available in the naniar package as the mcar_test function. To apply the test, we select only the continuous variables. We use it for the same dataset as in the previous paragraph. The p-value for the test is siginificant, indicating that the missings does not seem to be compeletely at random.

  • Choosing only continuous variables:
df_select <- df[,c("age", "nihss", "sbp","aspects" ,"time.rand" )]
library(naniar)
mcar_test(data=df_select)
## # A tibble: 1 x 4
##   statistic    df p.value missing.patterns
##       <dbl> <dbl>   <dbl>            <int>
## 1      18.4    12   0.104                4
  • Choosing all dataset (not suitable because dataset have categorical data)
mcar_test(data = df)
## # A tibble: 1 x 4
##   statistic    df p.value missing.patterns
##       <dbl> <dbl>   <dbl>            <int>
## 1      64.3    55   0.183                5

Discuss
The p-value for both test > 0.05 => Accept the Null hypothesis.
=> The missings is compeletely at random (MCAR).
=> Result is similar to Missing data patterns (<5%): Missingness is assumed not to matter for the analysis.

 

3.2.2.2 Compare and test group comparisons

In R also groups of the non-responders (i.e., participants with missing observations) can be compared to the responders (i.e. participants without missing observations). For this we can use the function missing_compare of the finalfit package.
First we have to copy and rename the variables with missing data. We rename the continuous variables sbp, aspects and time.rand and call them sbp_MAR, aspects_MAR and time.rand_MAR respectively.

The missing_compare function uses the Kruskal Wallis test for continuous variables and the Chi-squared test for categorical variables. (Note: The concerned outcome of this research: mrankin, nihss)

In this example. I will demonstrate the command to compare group for numeric variable: time.rand. Similar commands can be done for other numeric variables:

df_select <- df[,c("age", "nihss", "sbp","aspects" ,"time.rand", 
                   "trt", "mrankin", "ia.occlus", "extra.ica")]

# Copy variable with missing data and rename 
df_select$sbp_MAR <- df_select$sbp
df_select$aspects_MAR <- df_select$aspects
df_select$time.rand_MAR <- df_select$time.rand
df_select$extra.ica_MAR <- df_select$extra.ica

library(finalfit)

explanatory = c("age", "nihss", "sbp" ,"aspects")
dependent = "time.rand_MAR"
df_select %>% 
  missing_compare(dependent, explanatory) %>% 
    knitr::kable(row.names=FALSE, align = c("l", "l", "r", "r", "r"), 
        caption = "Mean comparisons between values of responders (Not missing) and 
        non-responders (Missing) on the time.rand variable.") 
Mean comparisons between values of responders (Not missing) and non-responders (Missing) on the time.rand variable.
Missing data analysis: time.rand_MAR Not missing Missing p
age Mean (SD) 64.6 (17.0) 94.0 (1.4) 0.015
nihss Mean (SD) 18.0 (4.7) 18.0 (1.4) 0.994
sbp Mean (SD) 145.6 (25.1) 105.0 (8.5) 0.022
aspects Mean (SD) 8.5 (1.6) 9.0 (1.4) 0.653

 

I can also compare group for extra.ica variable.

explanatory = c("trt", "mrankin")
dependent = "extra.ica_MAR"
df_select %>% 
  missing_compare(dependent, explanatory) %>% 
    knitr::kable(row.names=FALSE, align = c("l", "l", "r", "r", "r"), 
        caption = "Mean comparisons between values of responders (Not missing) and 
        non-responders (Missing) on the extra.ica variable.") 
Mean comparisons between values of responders (Not missing) and non-responders (Missing) on the extra.ica variable.
Missing data analysis: extra.ica_MAR Not missing Missing p
trt Intervention 233 (100.0) 0 (0.0) 1.000
Control 266 (99.6) 1 (0.4)
mrankin 0 403 (99.8) 1 (0.2) 0.971
1 50 (100.0) 0 (0.0)
2 25 (100.0) 0 (0.0)
> 2 21 (100.0) 0 (0.0)

Discuss

  • Of these variables, indicator variables are defined which are used to compare group means of other variables, that can be tested for significance using independent t-tests. The variables for which indicators group are compared are age”, “nihss”, “sbp”,“aspects”, “trt”, “mrankin”.
  • The drawback of this table: does not show the number of missing value (n), sometimes there is only 1 mising value.

On the time.rand_MAR variable, the patients that have observed values (column not missing) differ significantly from patients with missing values (column Missing) on age (P = 0.015) and sbp (P=0.022).

  • For age variable:
    ** When we look at the mean of the age variable, we see that the mean of patients with missing values is higher compared to the mean of patients with observed scores. This means that there is a higher probability of missing data for patients with higher age scores.
    ** If time.rand and age are correlated, the missing values on the time.rand variable can also be explained by the age variable.
  • For sbp variable:
    ** There is a higher probability of missing data for patients with lower sbp scores.
    ** If time.rand and sbp are correlated, the missing values on the time.rand variable can also be explained by the sbp variable.

On the extra.ica variable, the patients that have observed values (column not missing) don’t differ significantly from patients with missing values (column Missing) on trt and mrankin.

 

3.3 Step 3: Distribution of the data

Before imputing the missing data it is important to take a look at how the incomplete variables are distributed so that we can choose appropriate imputation models. Pay attention to:

  • Whether continuous distributions deviate considerably from the normal distribution.
  • If variables have values close to the limits of the range they are defined in.
  • whether categorical variables are very unbalanced (i.e., some category very small).
# One option that allows to quickly get an overview of all variables,
# categorical and continuous (replace the "df" dataset in the code below)

nc <- max(5, ceiling(sqrt(ncol(df))))
nr <- ceiling(ncol(df) / nc)
par(mfrow = c(nr, nc), mgp = c(2, 0.6, 0), mar = c(2, 3, 3, 0.5))
for (i in 1:ncol(df)) {
  if (is.numeric(df[, i])) {
    hist(df[, i], nclass = 50, xlab = "",
         main = paste0(names(df[i]), " (",
                       round(mean(is.na(df[, i])) * 100, 2), "% NA)")
    )
  } else {
    barplot(table(df[, i]), ylab = "Frequency",
            main = paste0(names(df[i]), " (",
                          round(mean(is.na(df[, i])) * 100, 2), "% NA)"))
  }
}

 

3.4 Step 4: Multiple Imputation mice package

Package used: mice

Usage:
mice(dataset, method = , predictorMatrix = ,m = , maxit = , seed = 1234, printFlag = F)

  • method: The method for imputation. Type the command methods(mice) for more option or use the default method.
  • predictorMatrix: What variables are used to predict the other variables.
  • m: Number of multiple imputation datasets.
  • maxit: A scalar giving the number of iterations.
  • printFlag = F: By default, the mice fucntion returns information about the iteration and imputation steps of the imputed variables under the columns named “iter,” “imp” and “variable” respectively. This information can be turned off by setting the mice function parameter printFlag = FALSE, which results in silent computation of the missing values.

3.4.1 Guidelines for the Imputation model

There are several guidelines that can be used to customize the imputation model for each variable by the predictor matrix. To summarize:

  • Include all variables that are part of the analysis model, including the dependent (outcome) variable.
  • Include the variables at the same way in the imputation model as they appear in the analysis model (i.e. if interaction terms are in the analysis model also include them in the imputation model).
  • Include additional (auxiliary) variables that are related to missingness or to variables with missing values.
  • Be aware of including variables with a high correlation (> 0.90).
  • Be aware of variables with high percentages of missing data (> 50%).

3.4.2 Step 4.1: Set-up run

First, do a set-up run of mice(), without any iterations (maxit = 0). This is to get all the objects that we may need to adjust in order to customize the imputation to our data.

imp0 <- mice(df, maxit = 0, print = F)

3.4.3 Step 4.2: Imputation method

There are many imputation methods available in mice. You can find the list in the help page of the mice() function. We will focus here on the following ones:

Name Description Variable type
pmm Predictive mean matching any
norm Bayesian linear regression numeric
logreg Logistic regression factor, 2 levels
polyreg Polytomous logistic regression factor, >= 2 levels
polr Proportional odds model ordered, >= 2 levels
  1. The default imputation methods that mice() selects can be specified in the argument defaultMethod.

If unspecified, mice will use:

  • pmm for numerical columns
  • logreg for factor columns with two categories
  • polyreg for columns with unordered
  • polr for columns with ordered factors with more than two categories.

For each column, the algorithm requires a specification of the imputation method. To see which method was used by default:

imp0$meth
##       trt       age       sex     nihss  location   hx.isch      afib        dm 
##        ""        ""        ""        ""        ""        ""        ""        "" 
##   mrankin       sbp  iv.altep   aspects ia.occlus extra.ica time.rand 
##        ""     "pmm"        ""     "pmm" "polyreg"  "logreg"     "pmm"

Discuss

  • ““: The variable is complete and therefore not imputed, denoted by the”” empty string.

  • pmm: sbp; aspects; time.rand

  • polyreg: ia.occlus

  • logreg: extra.ica

     

When a normal imputation model seems to be appropriate for most of the continuous covariates, you may want to specify norm as the default method in the setup run. The order of the types of variable is: continuous, binary, factor, ordered factor.

imp0 <- mice(df, maxit = 0, 
             defaultMethod = c("norm", 'logreg', 'polyreg', 'polr'))

 

  1. Looking at the histograms we made for the continuous variables, we can see that the variable sbp, aspects, time.rand has a skewed distribution, so using a normal imputation model may not work well.

Notice that mice has set the imputation method for variable hyp to logreg, which implements multiple imputation by logistic regression. An up-to-date overview of the methods in mice can be found by:

methods(mice)
##  [1] mice.impute.2l.bin              mice.impute.2l.lmer            
##  [3] mice.impute.2l.norm             mice.impute.2l.pan             
##  [5] mice.impute.2lonly.mean         mice.impute.2lonly.norm        
##  [7] mice.impute.2lonly.pmm          mice.impute.cart               
##  [9] mice.impute.jomoImpute          mice.impute.lasso.logreg       
## [11] mice.impute.lasso.norm          mice.impute.lasso.select.logreg
## [13] mice.impute.lasso.select.norm   mice.impute.lda                
## [15] mice.impute.logreg              mice.impute.logreg.boot        
## [17] mice.impute.mean                mice.impute.midastouch         
## [19] mice.impute.mnar.logreg         mice.impute.mnar.norm          
## [21] mice.impute.norm                mice.impute.norm.boot          
## [23] mice.impute.norm.nob            mice.impute.norm.predict       
## [25] mice.impute.panImpute           mice.impute.passive            
## [27] mice.impute.pmm                 mice.impute.polr               
## [29] mice.impute.polyreg             mice.impute.quadratic          
## [31] mice.impute.rf                  mice.impute.ri                 
## [33] mice.impute.sample              mice.mids                      
## [35] mice.theme                     
## see '?methods' for accessing help and source code

 

Let’s change the imputation method for sbp, aspects, time.rand so that it is imputed with predictive mean matching (pmm):

# Turn their methods matrix into the specified imputation models
meth <- imp0$meth
meth[c("sbp", "aspects", "time.rand")] <- "pmm"
meth
##       trt       age       sex     nihss  location   hx.isch      afib        dm 
##        ""        ""        ""        ""        ""        ""        ""        "" 
##   mrankin       sbp  iv.altep   aspects ia.occlus extra.ica time.rand 
##        ""     "pmm"        ""     "pmm" "polyreg"  "logreg"     "pmm"

Discuss After exploring the data distribution, the method we modified is the same as the default method that mice recommended (see 1st command of the Imputation method: imp0$meth). However, we shouldn’t rely completely on the program but should check the data distribution and make judgement by ourself.

 

3.4.4 Step 4.3: Predictor matrix

The Predictor Matrix informs us which variables are going to be used to predict a plausible value for variables (1 means a variable is used to predict another variable, 0 otherwise).

  • Each variable in the data has a row and a column in the predictor matrix. A value 1 indicates that the column variable was used to impute the row variable. For example, the 1 at entry [row, column]: [age, trt] indicates that variable trt was used to impute the incomplete variable age.
  • Note that the diagonal is zero because a variable is not allowed to impute itself.
  • The column of variable contains all zeros because that variable is removed from the set of predictors, but still leaves it to be predicted by the other variables.

mice gives you complete control over the predictor matrix, enabling you to choose your own predictor relations. This can be very useful, for example, when you have many variables or when you have clear ideas or prior knowledge about relations in the data at hand. You can use mice() to give you the initial predictor matrix, and change it afterwards, without running the algorithm.

There is a special function called quickpred() for a quick selection procedure of predictors, which can be handy for datasets containing many variables. See ?quickpred for more info. Selecting predictors according to data relations with a minimum correlation of ρ=.1 can be done by. For example:

ini <- mice(df, pred=quickpred(df, mincor=.1), print=F)  
ini$pred 
##           trt age sex nihss location hx.isch afib dm mrankin sbp iv.altep
## trt         0   0   0     0        0       0    0  0       0   0        0
## age         0   0   0     0        0       0    0  0       0   0        0
## sex         0   0   0     0        0       0    0  0       0   0        0
## nihss       0   0   0     0        0       0    0  0       0   0        0
## location    0   0   0     0        0       0    0  0       0   0        0
## hx.isch     0   0   0     0        0       0    0  0       0   0        0
## afib        0   0   0     0        0       0    0  0       0   0        0
## dm          0   0   0     0        0       0    0  0       0   0        0
## mrankin     0   0   0     0        0       0    0  0       0   0        0
## sbp         0   0   0     0        0       0    0  1       0   0        0
## iv.altep    0   0   0     0        0       0    0  0       0   0        0
## aspects     0   0   0     0        0       0    1  0       0   0        0
## ia.occlus   0   0   0     1        0       0    0  0       0   0        0
## extra.ica   0   0   0     0        0       0    0  0       0   0        0
## time.rand   0   1   0     0        0       0    0  0       0   1        0
##           aspects ia.occlus extra.ica time.rand
## trt             0         0         0         0
## age             0         0         0         0
## sex             0         0         0         0
## nihss           0         0         0         0
## location        0         0         0         0
## hx.isch         0         0         0         0
## afib            0         0         0         0
## dm              0         0         0         0
## mrankin         0         0         0         0
## sbp             0         0         0         0
## iv.altep        0         0         0         0
## aspects         0         0         0         0
## ia.occlus       1         0         1         0
## extra.ica       1         1         0         0
## time.rand       0         0         0         0

 

Let return to our dataset:

  • The original modifying Predictor Matrix:
pred <- imp0$predictorMatrix
pred
##           trt age sex nihss location hx.isch afib dm mrankin sbp iv.altep
## trt         0   1   1     1        1       1    1  1       1   1        1
## age         1   0   1     1        1       1    1  1       1   1        1
## sex         1   1   0     1        1       1    1  1       1   1        1
## nihss       1   1   1     0        1       1    1  1       1   1        1
## location    1   1   1     1        0       1    1  1       1   1        1
## hx.isch     1   1   1     1        1       0    1  1       1   1        1
## afib        1   1   1     1        1       1    0  1       1   1        1
## dm          1   1   1     1        1       1    1  0       1   1        1
## mrankin     1   1   1     1        1       1    1  1       0   1        1
## sbp         1   1   1     1        1       1    1  1       1   0        1
## iv.altep    1   1   1     1        1       1    1  1       1   1        0
## aspects     1   1   1     1        1       1    1  1       1   1        1
## ia.occlus   1   1   1     1        1       1    1  1       1   1        1
## extra.ica   1   1   1     1        1       1    1  1       1   1        1
## time.rand   1   1   1     1        1       1    1  1       1   1        1
##           aspects ia.occlus extra.ica time.rand
## trt             1         1         1         1
## age             1         1         1         1
## sex             1         1         1         1
## nihss           1         1         1         1
## location        1         1         1         1
## hx.isch         1         1         1         1
## afib            1         1         1         1
## dm              1         1         1         1
## mrankin         1         1         1         1
## sbp             1         1         1         1
## iv.altep        1         1         1         1
## aspects         0         1         1         1
## ia.occlus       1         0         1         1
## extra.ica       1         1         0         1
## time.rand       1         1         1         0

 

  • If we want to modify something: Eliminate hx.isch from analysis
# pred[, "hx.isch"] <- 0 
# pred["trt", "age"] <- 0 : Row trt - column age = 0: Eliminate age from trt prediction  
# pred

In case, we wanted to impute variables separately? OR use one variable to impute the other variables. Info can be found here:
https://rmisstastic.netlify.app/tutorials/erler_course_multipleimputation_2018/erler_practical_mice_2018#visit_sequence

 

3.4.5 Step 4.4: Visit sequence

Since we don’t need to impute in the correct order from Predictor matrix, nothing will change in the Visit sequence.
https://rmisstastic.netlify.app/tutorials/erler_course_multipleimputation_2018/erler_practical_mice_2018#visit_sequence

visSeq <- imp0$visitSequence

 

3.4.6 Step 4.5: Number of Imputed datasets and iterations

Researchers assume that the number of imputations needed to generate valid imputations has to be set at 3-5 imputations. This idea was based on the work of Rubin (Rubin (1987)). He showed that the precision of a pooled parameter becomes lower when a finite number of multiply imputed datasets is used compared to an infinite number (finite means a limited number of imputed datasets, like 5 imputed datasets and infinite means unlimited and can be recognized by the mathematical symbol ∞). Accordingly, when 5 imputed datasets are used, the standard error S E is √0.93=0.96 times as large as the S E when an infinite number of imputed datasets are used.

FMI is the fraction of missing information and m is the number of imputed datasets. Where FMI is roughly equal to the percentage of missing data in the simplest case of one variable with missing data. When there are more variables in the imputation model, and these variables are correlated with the variables with missing data the FMI becomes lower. For FMI´s of 0.05, 0.1, 0.2, 0.3, 0.5 the following number of imputed dataets are needed: ≥3, 6, 12, 24, 59, respectively. Another rule of thumb is that the number of imputed datasets should be at least equal to the percentage of missing cases. This means that when 10% of the subjects have missing values, at least 10 imputed datasets should be generated.

Iterations Van Buuren (Van Buuren (2018)) states that the number of iterations may depend on the correlation between variables and the percentage of missing data in variables. He proposed that a number of 5-20 iterations is enough to reach convergence. This number may be adjusted when the percentage of missing data is high. Nowadays computers are fast so that a higher number of iterations can easily be used.

Based on the above infomation, I decide to choose m = 5 and maxit = 10.

More info can be found here:
https://bookdown.org/mwheymans/bookmi/multiple-imputation.html#number-of-imputed-datasets-and-iterations

 

3.4.7 Step 4.6: Run the imputation

With the changes that we have made to the predictorMatrix and method, we can now perform the imputation. Use m = 5 and maxit = 10.

  • m: Number of multiple imputation datasets.
    ** Set m = 5: Generate 5 imputed datasets.

  • maxit: A scalar giving the number of iterations.
    ** Set maxit = 10: 10 iterations for each imputed dataset.

  • printFlag = F: By default, the mice fucntion returns information about the iteration and imputation steps of the imputed variables under the columns named “iter,” “imp” and “variable” respectively. This information can be turned off by setting the mice function parameter printFlag = FALSE, which results in silent computation of the missing values.

imp <- mice(df, method = meth, 
            predictorMatrix = pred,
            visitSequence = visSeq,
            maxit = 10, m = 5,
            seed = 1234,
            printFlag = FALSE)

mice() prints the name of the variable being imputed for each iteration and imputation. If you run mice() on your own computer the output will show up continuously. There, you may notice that imputation is slowest for categorical variables, especially when they have many categories.

3.5 Step 5*: Retrieve or continue the imputation ?

Retrieve the imputation results: Take the 1st and 2nd result of multiple imputations:

df_complete1 = complete(imp, action = 1)
df_complete2 = complete(imp, action = 2)

Compare to the original dataset:

diffdf(df_complete1,df)
## Differences found between the objects!
## 
## A summary is given below.
## 
## Not all Values Compared Equal
## All rows are shown in table below
## 
##   ==============================
##    Variable   No of Differences 
##   ------------------------------
##       sbp             1         
##     aspects           4         
##    ia.occlus          1         
##    extra.ica          1         
##    time.rand          2         
##   ------------------------------
## 
## 
## All rows are shown in table below
## 
##   ========================================
##    VARIABLE  ..ROWNUMBER..  BASE  COMPARE 
##   ----------------------------------------
##      sbp          131       121    <NA>   
##   ----------------------------------------
## 
## 
## All rows are shown in table below
## 
##   ========================================
##    VARIABLE  ..ROWNUMBER..  BASE  COMPARE 
##   ----------------------------------------
##    aspects         19        10    <NA>   
##    aspects        169         9    <NA>   
##    aspects        201         8    <NA>   
##    aspects        222        10    <NA>   
##   ----------------------------------------
## 
## 
## All rows are shown in table below
## 
##   =========================================
##    VARIABLE   ..ROWNUMBER..  BASE  COMPARE 
##   -----------------------------------------
##    ia.occlus       404        M1    <NA>   
##   -----------------------------------------
## 
## 
## All rows are shown in table below
## 
##   =========================================
##    VARIABLE   ..ROWNUMBER..  BASE  COMPARE 
##   -----------------------------------------
##    extra.ica       404        No    <NA>   
##   -----------------------------------------
## 
## 
## All rows are shown in table below
## 
##   =========================================
##    VARIABLE   ..ROWNUMBER..  BASE  COMPARE 
##   -----------------------------------------
##    time.rand       434       110    <NA>   
##    time.rand       463       120    <NA>   
##   -----------------------------------------

 

Compare to the 2 imputation dataset that we just retrieved earlier:

diffdf(df_complete1,df_complete2)
## Differences found between the objects!
## 
## A summary is given below.
## 
## Not all Values Compared Equal
## All rows are shown in table below
## 
##   ==============================
##    Variable   No of Differences 
##   ------------------------------
##       sbp             1         
##     aspects           3         
##    ia.occlus          1         
##    extra.ica          1         
##    time.rand          2         
##   ------------------------------
## 
## 
## All rows are shown in table below
## 
##   ========================================
##    VARIABLE  ..ROWNUMBER..  BASE  COMPARE 
##   ----------------------------------------
##      sbp          131       121     153   
##   ----------------------------------------
## 
## 
## All rows are shown in table below
## 
##   ========================================
##    VARIABLE  ..ROWNUMBER..  BASE  COMPARE 
##   ----------------------------------------
##    aspects         19        10      6    
##    aspects        169         9      8    
##    aspects        201         8      9    
##   ----------------------------------------
## 
## 
## All rows are shown in table below
## 
##   =============================================
##    VARIABLE   ..ROWNUMBER..  BASE    COMPARE   
##   ---------------------------------------------
##    ia.occlus       404        M1   ICA with M1 
##   ---------------------------------------------
## 
## 
## All rows are shown in table below
## 
##   =========================================
##    VARIABLE   ..ROWNUMBER..  BASE  COMPARE 
##   -----------------------------------------
##    extra.ica       404        No     Yes   
##   -----------------------------------------
## 
## 
## All rows are shown in table below
## 
##   =========================================
##    VARIABLE   ..ROWNUMBER..  BASE  COMPARE 
##   -----------------------------------------
##    time.rand       434       110     156   
##    time.rand       463       120     252   
##   -----------------------------------------

 

Discuss

  • The 1st and 2nd imputation is different.

  • At thís step, you can retrieve your data for further analysis or continue with Complete-data analysis + pooling.

     

3.6 Step 6: Complete-data analysis

A summary of the imputation results can be obtained by calling the imp object.

imp
## Class: mids
## Number of multiple imputations:  5 
## Imputation methods:
##       trt       age       sex     nihss  location   hx.isch      afib        dm 
##        ""        ""        ""        ""        ""        ""        ""        "" 
##   mrankin       sbp  iv.altep   aspects ia.occlus extra.ica time.rand 
##        ""     "pmm"        ""     "pmm" "polyreg"  "logreg"     "pmm" 
## PredictorMatrix:
##          trt age sex nihss location hx.isch afib dm mrankin sbp iv.altep
## trt        0   1   1     1        1       1    1  1       1   1        1
## age        1   0   1     1        1       1    1  1       1   1        1
## sex        1   1   0     1        1       1    1  1       1   1        1
## nihss      1   1   1     0        1       1    1  1       1   1        1
## location   1   1   1     1        0       1    1  1       1   1        1
## hx.isch    1   1   1     1        1       0    1  1       1   1        1
##          aspects ia.occlus extra.ica time.rand
## trt            1         1         1         1
## age            1         1         1         1
## sex            1         1         1         1
## nihss          1         1         1         1
## location       1         1         1         1
## hx.isch        1         1         1         1

 

The original data can be found in:

# imp$data

3.6.1 Step 6.1: Type of class

mice() does not return a data.frame. Find out the class of the object returned by mice() function using the function class(), and take a look at the help file for this class: https://www.rdocumentation.org/packages/mice/versions/2.46.0/topics/mids-class

class(imp)
## [1] "mids"

3.6.2 Step 6.2: Elements of the mids object

It is good practice to make sure that mice() has not done any processing of the data that was not planned or that you are not aware of. This means checking the loggedEvents, but also ensuring that the correct method, predictorMatrix and visitSequence were used.

Do these checks for imp.

imp$method
##       trt       age       sex     nihss  location   hx.isch      afib        dm 
##        ""        ""        ""        ""        ""        ""        ""        "" 
##   mrankin       sbp  iv.altep   aspects ia.occlus extra.ica time.rand 
##        ""     "pmm"        ""     "pmm" "polyreg"  "logreg"     "pmm"
imp$predictorMatrix
##           trt age sex nihss location hx.isch afib dm mrankin sbp iv.altep
## trt         0   1   1     1        1       1    1  1       1   1        1
## age         1   0   1     1        1       1    1  1       1   1        1
## sex         1   1   0     1        1       1    1  1       1   1        1
## nihss       1   1   1     0        1       1    1  1       1   1        1
## location    1   1   1     1        0       1    1  1       1   1        1
## hx.isch     1   1   1     1        1       0    1  1       1   1        1
## afib        1   1   1     1        1       1    0  1       1   1        1
## dm          1   1   1     1        1       1    1  0       1   1        1
## mrankin     1   1   1     1        1       1    1  1       0   1        1
## sbp         1   1   1     1        1       1    1  1       1   0        1
## iv.altep    1   1   1     1        1       1    1  1       1   1        0
## aspects     1   1   1     1        1       1    1  1       1   1        1
## ia.occlus   1   1   1     1        1       1    1  1       1   1        1
## extra.ica   1   1   1     1        1       1    1  1       1   1        1
## time.rand   1   1   1     1        1       1    1  1       1   1        1
##           aspects ia.occlus extra.ica time.rand
## trt             1         1         1         1
## age             1         1         1         1
## sex             1         1         1         1
## nihss           1         1         1         1
## location        1         1         1         1
## hx.isch         1         1         1         1
## afib            1         1         1         1
## dm              1         1         1         1
## mrankin         1         1         1         1
## sbp             1         1         1         1
## iv.altep        1         1         1         1
## aspects         0         1         1         1
## ia.occlus       1         0         1         1
## extra.ica       1         1         0         1
## time.rand       1         1         1         0
imp$visitSequence
##  [1] "trt"       "age"       "sex"       "nihss"     "location"  "hx.isch"  
##  [7] "afib"      "dm"        "mrankin"   "sbp"       "iv.altep"  "aspects"  
## [13] "ia.occlus" "extra.ica" "time.rand"
imp$loggedEvents
## NULL
imp$pad$data
## NULL

 

Discuss

  • Everything is exactly as what we expected. If something goes wrong, you can look at the following link for solution:

https://rmisstastic.netlify.app/tutorials/erler_course_multipleimputation_2018/erler_practical_mice_2018#logged_events

 

3.6.3 Step 6.3: Logged events

Let’s see what else would have come up if we change the number of maxit from 10 to 5:
impnaive <- mice(NHANES, m = 5, maxit = 5)

 

Take a look at the loggedEvents of impnaive.

impnaive <- mice(df, method = meth, 
            predictorMatrix = pred,
            visitSequence = visSeq,
            maxit = 5, m = 5,
            seed = 1234,
            printFlag = FALSE)

Extracting the imputation: Take the 1st and 2nd imputation:

df_naivecomplete1 = complete(impnaive, action = 1)

Compare to the original dataset (maxit=10):

diffdf(df_complete1, df_naivecomplete1)
## Differences found between the objects!
## 
## A summary is given below.
## 
## Not all Values Compared Equal
## All rows are shown in table below
## 
##   ==============================
##    Variable   No of Differences 
##   ------------------------------
##       sbp             1         
##     aspects           1         
##    ia.occlus          1         
##    time.rand          2         
##   ------------------------------
## 
## 
## All rows are shown in table below
## 
##   ========================================
##    VARIABLE  ..ROWNUMBER..  BASE  COMPARE 
##   ----------------------------------------
##      sbp          131       121     89    
##   ----------------------------------------
## 
## 
## All rows are shown in table below
## 
##   ========================================
##    VARIABLE  ..ROWNUMBER..  BASE  COMPARE 
##   ----------------------------------------
##    aspects        169        9      10    
##   ----------------------------------------
## 
## 
## All rows are shown in table below
## 
##   =============================================
##    VARIABLE   ..ROWNUMBER..  BASE    COMPARE   
##   ---------------------------------------------
##    ia.occlus       404        M1   ICA with M1 
##   ---------------------------------------------
## 
## 
## All rows are shown in table below
## 
##   =========================================
##    VARIABLE   ..ROWNUMBER..  BASE  COMPARE 
##   -----------------------------------------
##    time.rand       434       110     237   
##    time.rand       463       120     186   
##   -----------------------------------------
impnaive$loggedEvents
## NULL
impnaive$pad$categories
## NULL

Discuss

3.6.4 Step 6.4: Convergence

The mean and variance of the imputed values per iteration and variable is stored in the elements chainMean and chainVar of the mids object.

imp$chainMean
## , , Chain 1
## 
##                1     2      3      4     5      6      7     8      9     10
## trt          NaN   NaN    NaN    NaN   NaN    NaN    NaN   NaN    NaN    NaN
## age          NaN   NaN    NaN    NaN   NaN    NaN    NaN   NaN    NaN    NaN
## sex          NaN   NaN    NaN    NaN   NaN    NaN    NaN   NaN    NaN    NaN
## nihss        NaN   NaN    NaN    NaN   NaN    NaN    NaN   NaN    NaN    NaN
## location     NaN   NaN    NaN    NaN   NaN    NaN    NaN   NaN    NaN    NaN
## hx.isch      NaN   NaN    NaN    NaN   NaN    NaN    NaN   NaN    NaN    NaN
## afib         NaN   NaN    NaN    NaN   NaN    NaN    NaN   NaN    NaN    NaN
## dm           NaN   NaN    NaN    NaN   NaN    NaN    NaN   NaN    NaN    NaN
## mrankin      NaN   NaN    NaN    NaN   NaN    NaN    NaN   NaN    NaN    NaN
## sbp       166.00 119.0 148.00 139.00  89.0 110.00 127.00 161.0 121.00 121.00
## iv.altep     NaN   NaN    NaN    NaN   NaN    NaN    NaN   NaN    NaN    NaN
## aspects     8.75   9.0   8.25   9.75   9.5   9.25   8.25   9.0   8.75   9.25
## ia.occlus   4.00   3.0   3.00   4.00   2.0   2.00   3.00   3.0   3.00   3.00
## extra.ica   1.00   1.0   1.00   1.00   1.0   1.00   1.00   1.0   1.00   1.00
## time.rand 199.50 161.5 243.00 188.00 211.5 222.00 174.00 244.5 231.50 115.00
## 
## , , Chain 2
## 
##                1     2      3     4      5   6     7      8      9     10
## trt          NaN   NaN    NaN   NaN    NaN NaN   NaN    NaN    NaN    NaN
## age          NaN   NaN    NaN   NaN    NaN NaN   NaN    NaN    NaN    NaN
## sex          NaN   NaN    NaN   NaN    NaN NaN   NaN    NaN    NaN    NaN
## nihss        NaN   NaN    NaN   NaN    NaN NaN   NaN    NaN    NaN    NaN
## location     NaN   NaN    NaN   NaN    NaN NaN   NaN    NaN    NaN    NaN
## hx.isch      NaN   NaN    NaN   NaN    NaN NaN   NaN    NaN    NaN    NaN
## afib         NaN   NaN    NaN   NaN    NaN NaN   NaN    NaN    NaN    NaN
## dm           NaN   NaN    NaN   NaN    NaN NaN   NaN    NaN    NaN    NaN
## mrankin      NaN   NaN    NaN   NaN    NaN NaN   NaN    NaN    NaN    NaN
## sbp       144.00 157.0 158.00 131.0 126.00 130 130.0 127.00 166.00 153.00
## iv.altep     NaN   NaN    NaN   NaN    NaN NaN   NaN    NaN    NaN    NaN
## aspects     8.75   9.0   9.75   7.5   6.75   8   8.5   8.25   8.75   8.25
## ia.occlus   3.00   2.0   3.00   3.0   2.00   3   2.0   3.00   2.00   2.00
## extra.ica   2.00   1.0   2.00   2.0   1.00   1   2.0   2.00   2.00   2.00
## time.rand 190.50 242.5 213.00 222.5 163.00 190 168.5 183.00 151.50 204.00
## 
## , , Chain 3
## 
##                1     2     3      4     5      6      7      8      9     10
## trt          NaN   NaN   NaN    NaN   NaN    NaN    NaN    NaN    NaN    NaN
## age          NaN   NaN   NaN    NaN   NaN    NaN    NaN    NaN    NaN    NaN
## sex          NaN   NaN   NaN    NaN   NaN    NaN    NaN    NaN    NaN    NaN
## nihss        NaN   NaN   NaN    NaN   NaN    NaN    NaN    NaN    NaN    NaN
## location     NaN   NaN   NaN    NaN   NaN    NaN    NaN    NaN    NaN    NaN
## hx.isch      NaN   NaN   NaN    NaN   NaN    NaN    NaN    NaN    NaN    NaN
## afib         NaN   NaN   NaN    NaN   NaN    NaN    NaN    NaN    NaN    NaN
## dm           NaN   NaN   NaN    NaN   NaN    NaN    NaN    NaN    NaN    NaN
## mrankin      NaN   NaN   NaN    NaN   NaN    NaN    NaN    NaN    NaN    NaN
## sbp       181.00 145.0 168.0 162.00 164.0 167.00 173.00 146.00 167.00 119.00
## iv.altep     NaN   NaN   NaN    NaN   NaN    NaN    NaN    NaN    NaN    NaN
## aspects     8.25   9.0   9.5   9.25   9.5   9.25   8.25   8.75   8.75   8.75
## ia.occlus   3.00   4.0   3.0   3.00   3.0   3.00   2.00   3.00   3.00   3.00
## extra.ica   1.00   1.0   1.0   1.00   1.0   1.00   2.00   2.00   1.00   1.00
## time.rand 201.00 181.5 262.5 179.00 205.0 202.00 126.50 158.00 167.50 162.00
## 
## , , Chain 4
## 
##             1      2   3      4     5     6     7      8   9    10
## trt       NaN    NaN NaN    NaN   NaN   NaN   NaN    NaN NaN   NaN
## age       NaN    NaN NaN    NaN   NaN   NaN   NaN    NaN NaN   NaN
## sex       NaN    NaN NaN    NaN   NaN   NaN   NaN    NaN NaN   NaN
## nihss     NaN    NaN NaN    NaN   NaN   NaN   NaN    NaN NaN   NaN
## location  NaN    NaN NaN    NaN   NaN   NaN   NaN    NaN NaN   NaN
## hx.isch   NaN    NaN NaN    NaN   NaN   NaN   NaN    NaN NaN   NaN
## afib      NaN    NaN NaN    NaN   NaN   NaN   NaN    NaN NaN   NaN
## dm        NaN    NaN NaN    NaN   NaN   NaN   NaN    NaN NaN   NaN
## mrankin   NaN    NaN NaN    NaN   NaN   NaN   NaN    NaN NaN   NaN
## sbp       144 115.00 105 160.00 112.0 121.0 165.0 162.00 116 152.0
## iv.altep  NaN    NaN NaN    NaN   NaN   NaN   NaN    NaN NaN   NaN
## aspects     9   8.25   7   7.75   9.0   9.5   7.5   9.75   9   9.5
## ia.occlus   2   3.00   3   2.00   5.0   3.0   3.0   2.00   2   5.0
## extra.ica   1   1.00   1   1.00   2.0   1.0   2.0   2.00   1   1.0
## time.rand 266 245.00 179 173.50 194.5 125.0 172.0 231.50 186 158.0
## 
## , , Chain 5
## 
##               1      2      3     4      5     6      7      8      9     10
## trt         NaN    NaN    NaN   NaN    NaN   NaN    NaN    NaN    NaN    NaN
## age         NaN    NaN    NaN   NaN    NaN   NaN    NaN    NaN    NaN    NaN
## sex         NaN    NaN    NaN   NaN    NaN   NaN    NaN    NaN    NaN    NaN
## nihss       NaN    NaN    NaN   NaN    NaN   NaN    NaN    NaN    NaN    NaN
## location    NaN    NaN    NaN   NaN    NaN   NaN    NaN    NaN    NaN    NaN
## hx.isch     NaN    NaN    NaN   NaN    NaN   NaN    NaN    NaN    NaN    NaN
## afib        NaN    NaN    NaN   NaN    NaN   NaN    NaN    NaN    NaN    NaN
## dm          NaN    NaN    NaN   NaN    NaN   NaN    NaN    NaN    NaN    NaN
## mrankin     NaN    NaN    NaN   NaN    NaN   NaN    NaN    NaN    NaN    NaN
## sbp       161.0 128.00 109.00 178.0 121.00 161.0 149.00 105.00 167.00 154.00
## iv.altep    NaN    NaN    NaN   NaN    NaN   NaN    NaN    NaN    NaN    NaN
## aspects     8.5   7.75   9.25   8.5   8.25   8.5   8.25   8.25   8.75   7.75
## ia.occlus   2.0   3.00   3.00   2.0   2.00   3.0   1.00   3.00   3.00   3.00
## extra.ica   2.0   2.00   2.00   2.0   1.00   2.0   2.00   1.00   1.00   1.00
## time.rand 156.0 200.00 161.50 219.5 159.00 145.5 146.00 227.00 216.50 196.50

Discuss

  • The number of chains is equal to the number of imputed datasets. A chain refers to the chain of regression models that is used to generate the imputed values. The length of each chain is equal to the number of iterations.

  • The convergence can be visualised by plotting the means in a convergence plot. For our example, the convergence plots are shown below. In this plot you see that the variance between the imputation chains is almost equal to the variance within the chains, which indicates healthy convergence.

Let’s plot them to see if our imputation imp has converged:

# implemented plotting function (use layout to change the number of rows and columns)
plot(imp, layout = c(4, 3))

# plot for a single variable
matplot(imp$chainMean["sbp", , ], type = "l", ylab = "imputed value",
        xlab = "iteration", main = "sbp")

# or the ggplot version
library(ggplot2)
library(reshape2)
ggplot(melt(imp$chainMean), aes(x = Var2, y = value, color = Var3)) +
  geom_line() +
  facet_wrap("Var1", scales = 'free') +
  theme(legend.position = 'none') +
  xlab("iteration") +
  ylab("imputed value")

Discuss
There are two main things you want to note in a trace plot:

  • First, assess whether the algorithm appeared to reach a stable posterior distribution by examining the plot to see if the predicted values remains relatively constant.
  • Second, you want to examine the plot to see how long it takes to reach this stationary phase.

=> From what I see, the chains in imp seem to have converged, but in practice, more iterations should be done to confirm this.
More info can be found here:
https://rmisstastic.netlify.app/tutorials/erler_course_multipleimputation_2018/erler_practical_mice_2018#convergence

 

3.6.5 Step 6.5: Imputation diagnostics

  1. Guiding from https://bookdown.org/mwheymans/bookmi/

It can also be of interest to compare the values that are imputed with those that are observed. For that, the stripplot function can be used in mice. This function visualises the observed and imputed values in one plot. By comparing the observed and the imputed data points we get an idea if the imputed values are in range of the observed data. If there are no large differences between the imputed and observed values than we can conclude the imputed values are plausible.

stripplot(imp,  pch = c(1, 20))

  1. Guiding from https://rmisstastic.netlify.app/tutorials/erler_course_multipleimputation_2018/erler_practical_mice_2018#structure_of_mira_objects

Now that we know that imp has converged, we can compare the distribution of the imputed values against the distribution of the observed values.

Plot the distributions of the imputed variables (continuous and categorical), and make sure the imputed values are realistic (e.g., no height of 2.50m or weight of 10kg for adults). If there are differences in the distributions of observed and imputed values see if those differences can be explained by other variables.

  • Draw a densityplot for imp:
densityplot(imp)

 

  • Create a long dataset with all imputed data sets stacked onto each other:
# Create a long dataset with all imputed data sets stacked onto each other
longDF <- complete(imp, "long")
summary(longDF)
##       .imp        .id                  trt            age            sex      
##  Min.   :1   Min.   :  1.0   Intervention:1165   Min.   :23.00   Female:1040  
##  1st Qu.:2   1st Qu.:125.8   Control     :1335   1st Qu.:55.00   Male  :1460  
##  Median :3   Median :250.5                       Median :65.75                
##  Mean   :3   Mean   :250.5                       Mean   :64.71                
##  3rd Qu.:4   3rd Qu.:375.2                       3rd Qu.:76.00                
##  Max.   :5   Max.   :500.0                       Max.   :96.00                
##      nihss        location    hx.isch     afib        dm       mrankin   
##  Min.   :10.00   Left :1345   No :2230   No :1825   No :2185   0  :2020  
##  1st Qu.:14.00   Right:1155   Yes: 270   Yes: 675   Yes: 315   1  : 250  
##  Median :18.00                                                 2  : 125  
##  Mean   :18.03                                                 > 2: 105  
##  3rd Qu.:22.00                                                           
##  Max.   :28.00                                                           
##       sbp        iv.altep      aspects                  ia.occlus    extra.ica 
##  Min.   : 78.0   No : 275   Min.   : 5.000   Intracranial ICA:  20   No :1774  
##  1st Qu.:128.0   Yes:2225   1st Qu.: 7.000   ICA with M1     : 671   Yes: 726  
##  Median :145.0              Median : 9.000   M1              :1598             
##  Mean   :145.5              Mean   : 8.508   M2              : 195             
##  3rd Qu.:162.2              3rd Qu.:10.000   A1 or A2        :  16             
##  Max.   :231.0              Max.   :10.000                                     
##    time.rand    
##  Min.   :100.0  
##  1st Qu.:150.8  
##  Median :201.0  
##  Mean   :208.4  
##  3rd Qu.:258.0  
##  Max.   :360.0

 

  • Get the summary of the “raw” imputed values
# get the summary of the "raw" imputed values
lapply(imp$imp, function(x) 
  summary(unlist(x))
)
## $trt
##    Mode 
## logical 
## 
## $age
##    Mode 
## logical 
## 
## $sex
##    Mode 
## logical 
## 
## $nihss
##    Mode 
## logical 
## 
## $location
##    Mode 
## logical 
## 
## $hx.isch
##    Mode 
## logical 
## 
## $afib
##    Mode 
## logical 
## 
## $dm
##    Mode 
## logical 
## 
## $mrankin
##    Mode 
## logical 
## 
## $sbp
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   119.0   121.0   152.0   139.8   153.0   154.0 
## 
## $iv.altep
##    Mode 
## logical 
## 
## $aspects
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     5.0     8.0     9.0     8.7    10.0    10.0 
## 
## $ia.occlus
## Intracranial ICA      ICA with M1               M1               M2 
##                0                1                3                0 
##         A1 or A2 
##                1 
## 
## $extra.ica
##  No Yes 
##   4   1 
## 
## $time.rand
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   110.0   122.0   149.0   167.1   190.5   273.0

 

# Plot for all categorical variables
propplot <- function(x, formula, facet = "wrap", ...) {
  library(ggplot2)

  cd <- data.frame(mice::complete(x, "long", include = TRUE))
  cd$.imp <- factor(cd$.imp)
  
  r <- as.data.frame(is.na(x$data))
  
  impcat <- x$meth != "" & sapply(x$data, is.factor)
  vnames <- names(impcat)[impcat]
  
  if (missing(formula)) {
    formula <- as.formula(paste(paste(vnames, collapse = "+",
                                      sep = ""), "~1", sep = ""))
  }
  
  tmsx <- terms(formula[-3], data = x$data)
  xnames <- attr(tmsx, "term.labels")
  xnames <- xnames[xnames %in% vnames]
  
  if (paste(formula[3]) != "1") {
    wvars <- gsub("[[:space:]]*\\|[[:print:]]*", "", paste(formula)[3])
    # wvars <- all.vars(as.formula(paste("~", wvars)))
    wvars <- attr(terms(as.formula(paste("~", wvars))), "term.labels")
    if (grepl("\\|", formula[3])) {
      svars <- gsub("[[:print:]]*\\|[[:space:]]*", "", paste(formula)[3])
      svars <- all.vars(as.formula(paste("~", svars)))
    } else {
      svars <- ".imp"
    }
  } else {
    wvars <- NULL
    svars <- ".imp"
  }
  
  for (i in seq_along(xnames)) {
    xvar <- xnames[i]
    select <- cd$.imp != 0 & !r[, xvar]
    cd[select, xvar] <- NA
  }
  
  
  for (i in which(!wvars %in% names(cd))) {
    cd[, wvars[i]] <- with(cd, eval(parse(text = wvars[i])))
  }
  
  meltDF <- reshape2::melt(cd[, c(wvars, svars, xnames)], id.vars = c(wvars, svars))
  meltDF <- meltDF[!is.na(meltDF$value), ]
  
  
  wvars <- if (!is.null(wvars)) paste0("`", wvars, "`")
  
  a <- plyr::ddply(meltDF, c(wvars, svars, "variable", "value"), plyr::summarize,
             count = length(value))
  b <- plyr::ddply(meltDF, c(wvars, svars, "variable"), plyr::summarize,
             tot = length(value))
  mdf <- merge(a,b)
  mdf$prop <- mdf$count / mdf$tot
  
  plotDF <- merge(unique(meltDF), mdf)
  plotDF$value <- factor(plotDF$value,
                         levels = unique(unlist(lapply(x$data[, xnames], levels))),
                         ordered = T)
  
  p <- ggplot(plotDF, aes(x = value, fill = get(svars), y = prop)) +
    geom_bar(position = "dodge", stat = "identity") +
    theme(legend.position = "bottom", ...) +
    ylab("proportion") +
    scale_fill_manual(name = "",
                      values = c("black",
                                 colorRampPalette(
                                   RColorBrewer::brewer.pal(9, "Blues"))(x$m + 3)[1:x$m + 3])) +
    guides(fill = guide_legend(nrow = 1))
  
  if (facet == "wrap")
    if (length(xnames) > 1) {
      print(p + facet_wrap(c("variable", wvars), scales = "free"))
    } else {
      if (is.null(wvars)) {
        print(p)
      } else {
        print(p + facet_wrap(wvars, scales = "free"))
      }
    }
  
  if (facet == "grid")
    if (!is.null(wvars)) {
      print(p + facet_grid(paste(paste(wvars, collapse = "+"), "~ variable"),
                           scales = "free"))
    }
}
propplot(imp)

propplot(imp, ia.occlus ~ sex)

propplot(imp, extra.ica ~ sex)

 

Discuss
Each bar represents the proportion of the data in the category given on the x-axis in one (imputed) dataset. The black bars show the distribution of the original, incomplete data. The blue bars show the corresponding distribution in the imputed values from each of the models (5, in this case) imputed datasets.

When the blue bars are approximately as high as the black bar, the distribution of the categorical variable in the imputed values is similar to the distribution in the observed data. Note, however, that here the marginal distributions are compared, but the MAR assumption is that the incomplete cases have the same distribution as the observed cases conditional on the other variables.

So if you do see differences this does not necessarily indicate a problem with the imputation. It could be that differences can be explained by covariates used in the imputation model.

  • Look into the distribution of aspects and time.rand
# Look into the distribution of aspects  
densityplot(imp, ~ aspects)

# Look into the distribution of time.rand  
densityplot(imp, ~time.rand)

Discuss

  • For categorical variable, I compare the imputed variable with the non-NA observed variable. Each have only one missing value, which makes it difficult to judge the distribution of imputed values.

  • The distributions of the imputed values for aspects and time.rand differ from the distributions of the observed data, but these differences cannot be explained by adjusting for any variables. This might be due to the small number of NA in the dataset (<5%).

3.7 Mini step: Stepwise model selection

3.7.1 Theory: Model selection techniques

Brand (1999) was the first to recognize and treat the variable selection problem.

  • The first step involves performing stepwise model selection separately on each imputed dataset, followed by the construction of a new supermodel that contains all variables that were present in at least half of the initial models. The idea is that this criterion excludes variables that were selected accidentally. Moreover, it is a rough correction for multiple testing.
  • Second, a special procedure for backward elimination is applied to all variables present in the supermodel. Each variable is removed in turn, and the pooled likelihood ratio p -value is calculated. If the largest p -value is larger than 0.05, the corresponding variable is removed, and the procedure is repeated on the smaller model. The procedure stops if all p ≤ 0.05 . The procedure was found to be a considerable improvement over complete-case analysis.

The majority method is identical to step 1 of Brand (1999), whereas the Wald test method is similar to Brand’s step 2, with the likelihood ratio test replaced by the Wald test. The Wald test method is recommended since it is a well-established approach that follows Rubin’s rules, whereas the majority and stack methods fail to take into account the uncertainty caused by the missing data. Indeed, Wood, White, and Royston (2008) found that the Wald test method is the only procedure that preserved the type I error.

In practice, it may be useful to combine methods. The Wald test method is computationally intensive, but is now easily available in mice as the D1() function. A strong point of the majority method is that it gives insight into the variability between the imputed datasets. An advantage of the stack method is that only one dataset needs to be analyzed. The discussion of Wood, White, and Royston (2008) contains additional simulations of a two-step method, in which a preselection made by the majority and stack methods is followed by the Wald test. This yielded a faster method with better theoretical properties. In practice, a judicious combination of approaches might turn out best.

3.7.2 Practice

The following steps illustrate the main steps involved by implementing a simple majority method to select variables in mice.
https://stefvanbuuren.name/fimd/sec-stepwise.html

Remmeber our 2 outcome variables : nihss; mrankin

# Remind imp from step 3
#imp <- mice(df, method = meth, 
#            predictorMatrix = pred,
#            visitSequence = visSeq,
#            maxit = 10, m = 5,
#            seed = 1234,
#            printFlag = FALSE)

# Continue practicle:
scope <- list(upper = ~ trt + age + sex  + location + hx.isch + afib + dm  + 
                mrankin + sbp  + aspects + ia.occlus + extra.ica + time.rand,
              lower = ~1)
expr <- expression(f1 <- lm(nihss ~ 1),
                   f2 <- step(f1, scope = scope))
fit <- with(imp, expr)
## Start:  AIC=1541.79
## nihss ~ 1
## 
##             Df Sum of Sq   RSS    AIC
## + dm         1   210.416 10664 1534.0
## + ia.occlus  4   238.098 10637 1538.7
## + location   1    65.168 10810 1540.8
## + time.rand  1    64.332 10810 1540.8
## + hx.isch    1    61.447 10813 1541.0
## <none>                   10875 1541.8
## + aspects    1    19.884 10855 1542.9
## + sbp        1    17.363 10857 1543.0
## + age        1    12.399 10862 1543.2
## + sex        1     4.904 10870 1543.6
## + afib       1     1.847 10873 1543.7
## + trt        1     1.588 10873 1543.7
## + extra.ica  1     0.048 10875 1543.8
## + mrankin    3    61.940 10813 1544.9
## 
## Step:  AIC=1534.02
## nihss ~ dm
## 
##             Df Sum of Sq   RSS    AIC
## + ia.occlus  4   258.528 10406 1529.8
## + location   1    70.677 10594 1532.7
## + time.rand  1    60.505 10604 1533.2
## + hx.isch    1    56.310 10608 1533.4
## <none>                   10664 1534.0
## + aspects    1    19.511 10645 1535.1
## + sbp        1    17.221 10647 1535.2
## + age        1     9.019 10655 1535.6
## + sex        1     5.901 10658 1535.8
## + afib       1     3.067 10661 1535.9
## + trt        1     1.750 10662 1535.9
## + extra.ica  1     0.028 10664 1536.0
## + mrankin    3    48.829 10615 1537.7
## - dm         1   210.416 10875 1541.8
## 
## Step:  AIC=1529.75
## nihss ~ dm + ia.occlus
## 
##             Df Sum of Sq   RSS    AIC
## + time.rand  1    63.336 10342 1528.7
## + location   1    62.837 10343 1528.7
## + hx.isch    1    49.124 10357 1529.4
## <none>                   10406 1529.8
## + extra.ica  1    19.037 10387 1530.8
## + aspects    1    12.747 10393 1531.1
## + sex        1    11.937 10394 1531.2
## + age        1    11.624 10394 1531.2
## + sbp        1    10.657 10395 1531.2
## + trt        1     3.224 10402 1531.6
## + afib       1     1.900 10404 1531.7
## + mrankin    3    42.795 10363 1533.7
## - ia.occlus  4   258.528 10664 1534.0
## - dm         1   230.847 10637 1538.7
## 
## Step:  AIC=1528.7
## nihss ~ dm + ia.occlus + time.rand
## 
##             Df Sum of Sq   RSS    AIC
## + hx.isch    1    58.908 10284 1527.8
## + location   1    53.257 10289 1528.1
## <none>                   10342 1528.7
## + extra.ica  1    21.765 10321 1529.7
## - time.rand  1    63.336 10406 1529.8
## + sex        1    15.833 10326 1529.9
## + age        1    13.226 10329 1530.1
## + aspects    1    12.281 10330 1530.1
## + sbp        1     8.765 10334 1530.3
## + trt        1     6.519 10336 1530.4
## + afib       1     2.618 10340 1530.6
## + mrankin    3    48.093 10294 1532.4
## - ia.occlus  4   261.358 10604 1533.2
## - dm         1   227.285 10570 1537.6
## 
## Step:  AIC=1527.84
## nihss ~ dm + ia.occlus + time.rand + hx.isch
## 
##             Df Sum of Sq   RSS    AIC
## + location   1    58.917 10225 1527.0
## <none>                   10284 1527.8
## - hx.isch    1    58.908 10342 1528.7
## + extra.ica  1    20.482 10263 1528.8
## + sex        1    16.405 10267 1529.0
## + aspects    1    11.116 10272 1529.3
## + age        1    10.933 10272 1529.3
## - time.rand  1    73.120 10357 1529.4
## + sbp        1     8.314 10275 1529.4
## + trt        1     4.993 10278 1529.6
## + afib       1     2.842 10281 1529.7
## + mrankin    3    44.140 10239 1531.7
## - ia.occlus  4   254.059 10538 1532.0
## - dm         1   221.371 10505 1536.5
## 
## Step:  AIC=1526.97
## nihss ~ dm + ia.occlus + time.rand + hx.isch + location
## 
##             Df Sum of Sq   RSS    AIC
## <none>                   10225 1527.0
## - location   1    58.917 10284 1527.8
## + extra.ica  1    19.349 10205 1528.0
## - time.rand  1    62.805 10287 1528.0
## - hx.isch    1    64.568 10289 1528.1
## + sex        1    17.052 10208 1528.1
## + aspects    1    12.628 10212 1528.3
## + age        1    11.285 10213 1528.4
## + sbp        1     9.476 10215 1528.5
## + trt        1     7.537 10217 1528.6
## + afib       1     3.135 10221 1528.8
## + mrankin    3    52.798 10172 1530.4
## - ia.occlus  4   245.952 10470 1530.8
## - dm         1   225.674 10450 1535.9
## Start:  AIC=1541.79
## nihss ~ 1
## 
##             Df Sum of Sq   RSS    AIC
## + dm         1   210.416 10664 1534.0
## + ia.occlus  4   229.750 10645 1539.1
## + time.rand  1    65.740 10809 1540.8
## + location   1    65.168 10810 1540.8
## + hx.isch    1    61.447 10813 1541.0
## <none>                   10875 1541.8
## + aspects    1    22.162 10852 1542.8
## + sbp        1    16.007 10859 1543.0
## + age        1    12.399 10862 1543.2
## + sex        1     4.904 10870 1543.6
## + afib       1     1.847 10873 1543.7
## + trt        1     1.588 10873 1543.7
## + extra.ica  1     0.372 10874 1543.8
## + mrankin    3    61.940 10813 1544.9
## 
## Step:  AIC=1534.02
## nihss ~ dm
## 
##             Df Sum of Sq   RSS    AIC
## + ia.occlus  4   250.203 10414 1530.2
## + location   1    70.677 10594 1532.7
## + time.rand  1    62.333 10602 1533.1
## + hx.isch    1    56.310 10608 1533.4
## <none>                   10664 1534.0
## + aspects    1    21.506 10643 1535.0
## + sbp        1    15.104 10649 1535.3
## + age        1     9.019 10655 1535.6
## + sex        1     5.901 10658 1535.8
## + afib       1     3.067 10661 1535.9
## + trt        1     1.750 10662 1535.9
## + extra.ica  1     0.285 10664 1536.0
## + mrankin    3    48.829 10615 1537.7
## - dm         1   210.416 10875 1541.8
## 
## Step:  AIC=1530.15
## nihss ~ dm + ia.occlus
## 
##             Df Sum of Sq   RSS    AIC
## + time.rand  1    66.610 10347 1528.9
## + location   1    62.376 10352 1529.2
## + hx.isch    1    49.526 10364 1529.8
## <none>                   10414 1530.2
## + extra.ica  1    21.903 10392 1531.1
## + aspects    1    14.026 10400 1531.5
## + sex        1    11.438 10403 1531.6
## + age        1    11.158 10403 1531.6
## + sbp        1     9.039 10405 1531.7
## + trt        1     3.324 10411 1532.0
## + afib       1     1.865 10412 1532.1
## - ia.occlus  4   250.203 10664 1534.0
## + mrankin    3    42.544 10372 1534.1
## - dm         1   230.869 10645 1539.1
## 
## Step:  AIC=1528.94
## nihss ~ dm + ia.occlus + time.rand
## 
##             Df Sum of Sq   RSS    AIC
## + hx.isch    1    59.855 10288 1528.0
## + location   1    53.239 10294 1528.4
## <none>                   10347 1528.9
## + extra.ica  1    24.417 10323 1529.8
## - time.rand  1    66.610 10414 1530.2
## + sex        1    14.979 10332 1530.2
## + aspects    1    13.379 10334 1530.3
## + age        1    12.272 10335 1530.3
## + sbp        1     7.454 10340 1530.6
## + trt        1     6.519 10341 1530.6
## + afib       1     2.508 10345 1530.8
## + mrankin    3    47.978 10300 1532.6
## - ia.occlus  4   254.480 10602 1533.1
## - dm         1   227.776 10575 1537.8
## 
## Step:  AIC=1528.04
## nihss ~ dm + ia.occlus + time.rand + hx.isch
## 
##             Df Sum of Sq   RSS    AIC
## + location   1    58.993 10229 1527.2
## <none>                   10288 1528.0
## + extra.ica  1    22.922 10265 1528.9
## - hx.isch    1    59.855 10347 1528.9
## + sex        1    15.532 10272 1529.3
## + aspects    1    12.065 10276 1529.5
## + age        1    10.031 10278 1529.5
## + sbp        1     7.090 10280 1529.7
## - time.rand  1    76.940 10364 1529.8
## + trt        1     4.970 10283 1529.8
## + afib       1     2.724 10285 1529.9
## + mrankin    3    44.024 10244 1531.9
## - ia.occlus  4   247.685 10535 1531.9
## - dm         1   221.850 10509 1536.7
## 
## Step:  AIC=1527.17
## nihss ~ dm + ia.occlus + time.rand + hx.isch + location
## 
##             Df Sum of Sq   RSS    AIC
## <none>                   10229 1527.2
## - location   1    58.993 10288 1528.0
## + extra.ica  1    21.979 10207 1528.1
## - hx.isch    1    65.609 10294 1528.4
## + sex        1    16.212 10212 1528.4
## - time.rand  1    67.151 10296 1528.4
## + aspects    1    14.124 10214 1528.5
## + age        1    10.392 10218 1528.7
## + sbp        1     8.041 10220 1528.8
## + trt        1     7.538 10221 1528.8
## + afib       1     3.029 10226 1529.0
## + mrankin    3    52.682 10176 1530.6
## - ia.occlus  4   239.059 10468 1530.7
## - dm         1   226.059 10455 1536.1
## Start:  AIC=1541.79
## nihss ~ 1
## 
##             Df Sum of Sq   RSS    AIC
## + dm         1   210.416 10664 1534.0
## + ia.occlus  4   238.098 10637 1538.7
## + location   1    65.168 10810 1540.8
## + time.rand  1    63.829 10811 1540.8
## + hx.isch    1    61.447 10813 1541.0
## <none>                   10875 1541.8
## + aspects    1    19.327 10855 1542.9
## + sbp        1    17.446 10857 1543.0
## + age        1    12.399 10862 1543.2
## + sex        1     4.904 10870 1543.6
## + afib       1     1.847 10873 1543.7
## + trt        1     1.588 10873 1543.7
## + extra.ica  1     0.048 10875 1543.8
## + mrankin    3    61.940 10813 1544.9
## 
## Step:  AIC=1534.02
## nihss ~ dm
## 
##             Df Sum of Sq   RSS    AIC
## + ia.occlus  4   258.528 10406 1529.8
## + location   1    70.677 10594 1532.7
## + time.rand  1    60.254 10604 1533.2
## + hx.isch    1    56.310 10608 1533.4
## <none>                   10664 1534.0
## + aspects    1    18.837 10645 1535.1
## + sbp        1    17.354 10647 1535.2
## + age        1     9.019 10655 1535.6
## + sex        1     5.901 10658 1535.8
## + afib       1     3.067 10661 1535.9
## + trt        1     1.750 10662 1535.9
## + extra.ica  1     0.028 10664 1536.0
## + mrankin    3    48.829 10615 1537.7
## - dm         1   210.416 10875 1541.8
## 
## Step:  AIC=1529.75
## nihss ~ dm + ia.occlus
## 
##             Df Sum of Sq   RSS    AIC
## + time.rand  1    63.310 10342 1528.7
## + location   1    62.837 10343 1528.7
## + hx.isch    1    49.124 10357 1529.4
## <none>                   10406 1529.8
## + extra.ica  1    19.037 10387 1530.8
## + sex        1    11.937 10394 1531.2
## + age        1    11.624 10394 1531.2
## + aspects    1    11.164 10395 1531.2
## + sbp        1    10.760 10395 1531.2
## + trt        1     3.224 10402 1531.6
## + afib       1     1.900 10404 1531.7
## + mrankin    3    42.795 10363 1533.7
## - ia.occlus  4   258.528 10664 1534.0
## - dm         1   230.847 10637 1538.7
## 
## Step:  AIC=1528.7
## nihss ~ dm + ia.occlus + time.rand
## 
##             Df Sum of Sq   RSS    AIC
## + hx.isch    1    59.061 10283 1527.8
## + location   1    53.580 10289 1528.1
## <none>                   10342 1528.7
## + extra.ica  1    21.655 10321 1529.7
## - time.rand  1    63.310 10406 1529.8
## + sex        1    15.622 10327 1529.9
## + age        1    12.936 10330 1530.1
## + aspects    1    10.679 10332 1530.2
## + sbp        1     9.061 10333 1530.3
## + trt        1     6.401 10336 1530.4
## + afib       1     2.572 10340 1530.6
## + mrankin    3    48.094 10294 1532.4
## - ia.occlus  4   261.584 10604 1533.2
## - dm         1   227.562 10570 1537.6
## 
## Step:  AIC=1527.84
## nihss ~ dm + ia.occlus + time.rand + hx.isch
## 
##             Df Sum of Sq   RSS    AIC
## + location   1    59.282 10224 1527.0
## <none>                   10283 1527.8
## - hx.isch    1    59.061 10342 1528.7
## + extra.ica  1    20.368 10263 1528.8
## + sex        1    16.179 10267 1529.0
## + age        1    10.649 10273 1529.3
## + aspects    1     9.575 10274 1529.4
## - time.rand  1    73.247 10357 1529.4
## + sbp        1     8.613 10275 1529.4
## + trt        1     4.884 10278 1529.6
## + afib       1     2.792 10281 1529.7
## + mrankin    3    44.142 10239 1531.7
## - ia.occlus  4   254.296 10538 1532.0
## - dm         1   221.654 10505 1536.5
## 
## Step:  AIC=1526.95
## nihss ~ dm + ia.occlus + time.rand + hx.isch + location
## 
##             Df Sum of Sq   RSS    AIC
## <none>                   10224 1527.0
## - location   1    59.282 10283 1527.8
## + extra.ica  1    19.250 10205 1528.0
## - time.rand  1    63.297 10287 1528.0
## - hx.isch    1    64.763 10289 1528.1
## + sex        1    16.852 10207 1528.1
## + aspects    1    11.435 10213 1528.4
## + age        1    11.020 10213 1528.4
## + sbp        1     9.784 10214 1528.5
## + trt        1     7.431 10217 1528.6
## + afib       1     3.090 10221 1528.8
## + mrankin    3    52.833 10171 1530.4
## - ia.occlus  4   246.159 10470 1530.8
## - dm         1   225.943 10450 1535.9
## Start:  AIC=1541.79
## nihss ~ 1
## 
##             Df Sum of Sq   RSS    AIC
## + dm         1   210.416 10664 1534.0
## + ia.occlus  4   252.239 10622 1538.1
## + location   1    65.168 10810 1540.8
## + time.rand  1    64.967 10810 1540.8
## + hx.isch    1    61.447 10813 1541.0
## <none>                   10875 1541.8
## + aspects    1    19.127 10856 1542.9
## + sbp        1    16.050 10859 1543.0
## + age        1    12.399 10862 1543.2
## + sex        1     4.904 10870 1543.6
## + afib       1     1.847 10873 1543.7
## + trt        1     1.588 10873 1543.7
## + extra.ica  1     0.048 10875 1543.8
## + mrankin    3    61.940 10813 1544.9
## 
## Step:  AIC=1534.02
## nihss ~ dm
## 
##             Df Sum of Sq   RSS    AIC
## + ia.occlus  4   270.710 10394 1529.2
## + location   1    70.677 10594 1532.7
## + time.rand  1    61.338 10603 1533.1
## + hx.isch    1    56.310 10608 1533.4
## <none>                   10664 1534.0
## + aspects    1    18.823 10645 1535.1
## + sbp        1    15.169 10649 1535.3
## + age        1     9.019 10655 1535.6
## + sex        1     5.901 10658 1535.8
## + afib       1     3.067 10661 1535.9
## + trt        1     1.750 10662 1535.9
## + extra.ica  1     0.028 10664 1536.0
## + mrankin    3    48.829 10615 1537.7
## - dm         1   210.416 10875 1541.8
## 
## Step:  AIC=1529.17
## nihss ~ dm + ia.occlus
## 
##             Df Sum of Sq   RSS    AIC
## + location   1    65.341 10328 1528.0
## + time.rand  1    62.302 10331 1528.2
## + hx.isch    1    48.404 10345 1528.8
## <none>                   10394 1529.2
## + extra.ica  1    19.775 10374 1530.2
## + sex        1    13.305 10380 1530.5
## + age        1    12.300 10381 1530.6
## + aspects    1     9.513 10384 1530.7
## + sbp        1     9.412 10384 1530.7
## + trt        1     2.680 10391 1531.0
## + afib       1     2.204 10391 1531.1
## + mrankin    3    42.764 10351 1533.1
## - ia.occlus  4   270.710 10664 1534.0
## - dm         1   228.888 10622 1538.1
## 
## Step:  AIC=1528.01
## nihss ~ dm + ia.occlus + location
## 
##             Df Sum of Sq   RSS    AIC
## + hx.isch    1    54.804 10273 1527.3
## + time.rand  1    52.827 10275 1527.5
## <none>                   10328 1528.0
## + extra.ica  1    18.952 10309 1529.1
## - location   1    65.341 10394 1529.2
## + sex        1    14.304 10314 1529.3
## + age        1    12.932 10315 1529.4
## + aspects    1    10.937 10317 1529.5
## + sbp        1    10.490 10318 1529.5
## + trt        1     5.017 10323 1529.8
## + afib       1     2.539 10326 1529.9
## + mrankin    3    52.351 10276 1531.5
## - ia.occlus  4   265.374 10594 1532.7
## - dm         1   233.309 10562 1537.2
## 
## Step:  AIC=1527.35
## nihss ~ dm + ia.occlus + location + hx.isch
## 
##             Df Sum of Sq   RSS    AIC
## + time.rand  1    61.969 10211 1526.3
## <none>                   10273 1527.3
## - hx.isch    1    54.804 10328 1528.0
## + extra.ica  1    17.553 10256 1528.5
## + sex        1    14.536 10259 1528.6
## - location   1    71.742 10345 1528.8
## + age        1    10.651 10263 1528.8
## + sbp        1    10.250 10263 1528.8
## + aspects    1    10.075 10263 1528.9
## + trt        1     3.653 10270 1529.2
## + afib       1     2.700 10271 1529.2
## + mrankin    3    48.021 10225 1531.0
## - ia.occlus  4   256.964 10530 1531.7
## - dm         1   228.101 10502 1536.3
## 
## Step:  AIC=1526.33
## nihss ~ dm + ia.occlus + location + hx.isch + time.rand
## 
##             Df Sum of Sq   RSS    AIC
## <none>                   10211 1526.3
## - location   1    61.559 10273 1527.3
## + extra.ica  1    19.952 10192 1527.3
## - time.rand  1    61.969 10273 1527.3
## + sex        1    18.448 10193 1527.4
## - hx.isch    1    63.946 10275 1527.5
## + age        1    11.728 10200 1527.8
## + aspects    1     9.455 10202 1527.9
## + sbp        1     8.246 10203 1527.9
## + trt        1     6.645 10205 1528.0
## + afib       1     3.474 10208 1528.2
## + mrankin    3    53.017 10158 1529.7
## - ia.occlus  4   257.759 10469 1530.8
## - dm         1   224.191 10436 1535.2
## Start:  AIC=1541.79
## nihss ~ 1
## 
##             Df Sum of Sq   RSS    AIC
## + dm         1   210.416 10664 1534.0
## + ia.occlus  4   238.098 10637 1538.7
## + time.rand  1    66.144 10808 1540.7
## + location   1    65.168 10810 1540.8
## + hx.isch    1    61.447 10813 1541.0
## <none>                   10875 1541.8
## + aspects    1    20.059 10855 1542.9
## + sbp        1    15.963 10859 1543.1
## + age        1    12.399 10862 1543.2
## + sex        1     4.904 10870 1543.6
## + afib       1     1.847 10873 1543.7
## + trt        1     1.588 10873 1543.7
## + extra.ica  1     0.048 10875 1543.8
## + mrankin    3    61.940 10813 1544.9
## 
## Step:  AIC=1534.02
## nihss ~ dm
## 
##             Df Sum of Sq   RSS    AIC
## + ia.occlus  4   258.528 10406 1529.8
## + location   1    70.677 10594 1532.7
## + time.rand  1    62.693 10602 1533.1
## + hx.isch    1    56.310 10608 1533.4
## <none>                   10664 1534.0
## + aspects    1    19.315 10645 1535.1
## + sbp        1    15.038 10649 1535.3
## + age        1     9.019 10655 1535.6
## + sex        1     5.901 10658 1535.8
## + afib       1     3.067 10661 1535.9
## + trt        1     1.750 10662 1535.9
## + extra.ica  1     0.028 10664 1536.0
## + mrankin    3    48.829 10615 1537.7
## - dm         1   210.416 10875 1541.8
## 
## Step:  AIC=1529.75
## nihss ~ dm + ia.occlus
## 
##             Df Sum of Sq   RSS    AIC
## + time.rand  1    65.982 10340 1528.6
## + location   1    62.837 10343 1528.7
## + hx.isch    1    49.124 10357 1529.4
## <none>                   10406 1529.8
## + extra.ica  1    19.037 10387 1530.8
## + sex        1    11.937 10394 1531.2
## + age        1    11.624 10394 1531.2
## + aspects    1    11.225 10394 1531.2
## + sbp        1     8.989 10397 1531.3
## + trt        1     3.224 10402 1531.6
## + afib       1     1.900 10404 1531.7
## + mrankin    3    42.795 10363 1533.7
## - ia.occlus  4   258.528 10664 1534.0
## - dm         1   230.847 10637 1538.7
## 
## Step:  AIC=1528.57
## nihss ~ dm + ia.occlus + time.rand
## 
##             Df Sum of Sq   RSS    AIC
## + hx.isch    1    59.352 10280 1527.7
## + location   1    53.657 10286 1528.0
## <none>                   10340 1528.6
## + extra.ica  1    21.620 10318 1529.5
## - time.rand  1    65.982 10406 1529.8
## + sex        1    15.534 10324 1529.8
## + age        1    12.775 10327 1530.0
## + aspects    1    11.292 10328 1530.0
## + sbp        1     7.401 10332 1530.2
## + trt        1     6.378 10333 1530.3
## + afib       1     2.551 10337 1530.5
## + mrankin    3    48.194 10292 1532.2
## - ia.occlus  4   261.818 10602 1533.1
## - dm         1   227.712 10567 1537.5
## 
## Step:  AIC=1527.69
## nihss ~ dm + ia.occlus + time.rand + hx.isch
## 
##             Df Sum of Sq   RSS    AIC
## + location   1    59.396 10221 1526.8
## <none>                   10280 1527.7
## - hx.isch    1    59.352 10340 1528.6
## + extra.ica  1    20.326 10260 1528.7
## + sex        1    16.080 10264 1528.9
## + age        1    10.485 10270 1529.2
## + aspects    1    10.124 10270 1529.2
## + sbp        1     7.040 10273 1529.3
## - time.rand  1    76.210 10357 1529.4
## + trt        1     4.856 10276 1529.5
## + afib       1     2.768 10278 1529.6
## + mrankin    3    44.237 10236 1531.5
## - ia.occlus  4   254.531 10535 1531.9
## - dm         1   221.803 10502 1536.4
## 
## Step:  AIC=1526.8
## nihss ~ dm + ia.occlus + time.rand + hx.isch + location
## 
##             Df Sum of Sq   RSS    AIC
## <none>                   10221 1526.8
## - location   1    59.396 10280 1527.7
## + extra.ica  1    19.220 10202 1527.8
## - hx.isch    1    65.090 10286 1528.0
## + sex        1    16.773 10204 1528.0
## - time.rand  1    66.374 10287 1528.0
## + aspects    1    12.251 10209 1528.2
## + age        1    10.868 10210 1528.3
## + sbp        1     7.981 10213 1528.4
## + trt        1     7.415 10214 1528.4
## + afib       1     3.070 10218 1528.7
## + mrankin    3    52.956 10168 1530.2
## - ia.occlus  4   246.383 10467 1530.7
## - dm         1   226.075 10447 1535.7

 

Discuss
This code imputes the df data m = 5 times, fits a stepwise linear model to predict nihss separately to each of the imputed dataset. The following code blocks counts how many times each variable was selected.

formulas <- lapply(fit$analyses, formula)
terms <- lapply(formulas, terms)
votes <- unlist(lapply(terms, labels))
table(votes)
## votes
##        dm   hx.isch ia.occlus  location time.rand 
##         5         5         5         5         5

Discuss

The lapply() function is used three times. The first statement extracts the model formulas fitted to the m imputed datasets. The second lapply() call decomposes the model formulas into pieces, and the third call extracts the names of the variables included in all m models. The table() function counts the number of times that each variable in the 5 replications.

  • Variables dm, hx.isch, ia.occlus, location and time.rand are always included, whereas the other was not selected, even only one time in the models.

  • If any variable appears in more than 50% of the models, we can use the Wald test to determine whether it should be in the final model.

     

Let look at the following 3 examples:
Example 1: with or without age

fit.without <- with(imp, lm(nihss ~ dm + hx.isch + ia.occlus + location + time.rand))
fit.with <- with(imp, lm(nihss ~ dm + hx.isch + ia.occlus + location + time.rand + age))
D1(fit.with, fit.without) # model with more variables is put before model lower variables
##    test statistic df1 df2 dfcom   p.value          riv
##  1 ~~ 2 0.5302907   1   4   490 0.5068257 0.0003151527
D3(fit.with, fit.without)
##    test statistic df1          df2 dfcom   p.value           riv
##  1 ~~ 2 0.5412293   1 103837222881   490 0.4619237 -6.206557e-06

Discuss

  • Both D1, D3 have p-value > 0.05 => age is not needed in the model. If we go one step further, and remove dm, we obtain:

     

Example 2: with or without dm

fit.without <- with(imp, lm(nihss ~ hx.isch + ia.occlus + location + time.rand))
fit.with <- with(imp, lm(nihss ~ dm + hx.isch + ia.occlus + location + time.rand))
D1(fit.with, fit.without)
##    test statistic df1 df2 dfcom    p.value         riv
##  1 ~~ 2  10.83548   1   4   491 0.03016651 3.83061e-05
D3(fit.with, fit.without)
##    test statistic df1         df2 dfcom      p.value          riv
##  1 ~~ 2  10.91439   1 28548643783   491 0.0009541989 1.183702e-05

 

Example 3: with or without location

fit.without <- with(imp, lm(nihss ~ dm + hx.isch + ia.occlus + time.rand ))
fit.with <- with(imp, lm(nihss ~ dm + hx.isch + ia.occlus + location + time.rand))
D1(fit.with, fit.without)
##    test statistic df1 df2 dfcom   p.value          riv
##  1 ~~ 2  2.863189   1   4   491 0.1658847 0.0003026752
D3(fit.with, fit.without)
##    test statistic df1      df2 dfcom    p.value         riv
##  1 ~~ 2  2.907485   1 79355075   491 0.08816928 0.000224564

Discuss

  • The result is a little bit weird. Accodring to https://stefvanbuuren.name/fimd/, based on stepwise model selection, the final nihss model should contain dm, hx.isch, ia.occlus, location and time.rand.

  • However, when I exclude hx.isch, location, time.rand and the other 2 variables from the model, the p-value > 0.05 means that those 3 variables are not needed in the model ?? The final nihss model only contain dm ; ia.occlus So what had happened here??

     

3.7.3 Explaination: Model optimism

The main danger of data-driven model building strategies is that the model found may depend highly on the sample at hand. For example, Viallefont, Raftery, and Richardson (2001) showed that of the variables declared to be “significant” with p -values between 0.01 and 0.05 by stepwise variable selection, only 49% actually were true risk factors.

Various solutions have been proposed to counter such model optimism. A popular procedure is bootstrapping the model as developed in Sauerbrei and Schumacher (1992) and Harrell (2001). The method randomly draws multiple samples with replacement from the observed sample, thus mimicking the sampling variation in the population from which the sample was drawn. Stepwise regression analyses are replayed in each bootstrap sample. The proportion of times that each prognostic variable is retained in the stepwise regression model is known as the inclusion frequency (Sauerbrei and Schumacher 1992). This proportion provides information about the strength of the evidence that an indicator is an independent predictor. In addition, each bootstrap model can be fitted to the original sample. The difference between the apparent performance and the bootstrap performance provides the basis for performance measures that correct for model optimism.

Clearly, the presence of missing data adds uncertainty to the model building process, so optimism can be expected to be more severe with missing data. It is not yet clear what the best way is to estimate optimism from incomplete data. Heymans et al. (2007) explored the combination of multiple imputation and the bootstrap. A drawback of the Heymans’s method is the use of classic stepwise variable selection techniques, which do not generalize well to high-dimensional data. There appear to be at least four general procedures:

  1. Imputation. Multiple imputation generates 100 imputed datasets. Automatic backward selection is applied to each dataset. Any differences found between the 100 fitted models are due to the missing data.
  2. Bootstrap. 200 bootstrap samples are drawn from one singly imputed completed data. Automatic backward selection is applied to each dataset. Any differences found between the 200 fitted models are due to sampling variation.
  3. Nested bootstrap. The bootstrap method is applied on each of the multiply imputed datasets. Automatic backward selection is applied to each of the 100 × 200 datasets. Differences between the fitted model portray both sampling and missing data uncertainty.
  4. Nested imputation. The imputation method is applied on each of the bootstrapped datasets.

Long and Johnson (2015) developed a procedure, called bootstrap imputation and stability selection (BI-SS) , that generates bootstrap samples from the original data, imputes each bootstrap sample by single imputation, obtains the randomized LASSO estimate from each sample, and then selects the active set according to majority. The multiple imputation random LASSO (MIRL) method by Liu et al. (2016) first performs multiple imputation, obtains bootstrap samples from each imputed dataset, estimates regression weights under LASSO, and then selects the active set by majority. It is not yet known how BS-SS and MIRL compare to each other.

 

3.8 Step 7: Analysis

3.8.1 Fitting models on mids objects

When we are confident that the imputation was successfull, the imputed data can be analysed with any complete data method.

Choose any analysis model of interest and fit it on the imputed data. The model could for instance be a linear regression model (using lm) or a generalized linear regression model, e.g., logistic regression (using glm()). Find out the class and structure of the resulting object.

imp is not a dataframe but an object of class mids, so we cannot use the normal specification of a model in which we specify the argument data to be a data.frame. Instead, use with(imp, ).

# for example:
models1 <- with(imp, lm(nihss   ~ trt + age + sex  + location + hx.isch + afib + dm  +
                          mrankin + sbp  + aspects + ia.occlus + extra.ica + time.rand))
models2 <- with(imp, lm(nihss ~ dm + ia.occlus))
models3 <- with(imp, glm(mrankin ~ trt + age + sex  + location + hx.isch + afib + dm  +
                          nihss + sbp  + aspects + ia.occlus + extra.ica + time.rand,
                       family = "binomial"))
# to determine the class
class(models1)
## [1] "mira"   "matrix"
# to determine the structure
names(models1)
## [1] "call"     "call1"    "nmis"     "analyses"
lapply(models1, class)
## $call
## [1] "call"
## 
## $call1
## [1] "call"
## 
## $nmis
## [1] "integer"
## 
## $analyses
## [1] "list"
# explore further
models1$analyses
## [[1]]
## 
## Call:
## lm(formula = nihss ~ trt + age + sex + location + hx.isch + afib + 
##     dm + mrankin + sbp + aspects + ia.occlus + extra.ica + time.rand)
## 
## Coefficients:
##          (Intercept)            trtControl                   age  
##            17.252363              0.336268             -0.008557  
##              sexMale         locationRight            hx.ischYes  
##             0.314477              0.781461             -1.073092  
##              afibYes                 dmYes              mrankin1  
##             0.081253             -1.960446             -0.221030  
##             mrankin2            mrankin> 2                   sbp  
##            -0.361164              1.439753             -0.005320  
##              aspects  ia.occlusICA with M1           ia.occlusM1  
##            -0.114684              2.961485              4.091233  
##          ia.occlusM2     ia.occlusA1 or A2          extra.icaYes  
##             4.889223              8.060141              0.411613  
##            time.rand  
##            -0.006184  
## 
## 
## [[2]]
## 
## Call:
## lm(formula = nihss ~ trt + age + sex + location + hx.isch + afib + 
##     dm + mrankin + sbp + aspects + ia.occlus + extra.ica + time.rand)
## 
## Coefficients:
##          (Intercept)            trtControl                   age  
##            17.206006              0.336380             -0.008177  
##              sexMale         locationRight            hx.ischYes  
##             0.303715              0.785275             -1.080764  
##              afibYes                 dmYes              mrankin1  
##             0.073071             -1.959072             -0.221581  
##             mrankin2            mrankin> 2                   sbp  
##            -0.357597              1.439351             -0.004910  
##              aspects  ia.occlusICA with M1           ia.occlusM1  
##            -0.118385              3.010038              4.107655  
##          ia.occlusM2     ia.occlusA1 or A2          extra.icaYes  
##             4.907026              8.075932              0.440491  
##            time.rand  
##            -0.006325  
## 
## 
## [[3]]
## 
## Call:
## lm(formula = nihss ~ trt + age + sex + location + hx.isch + afib + 
##     dm + mrankin + sbp + aspects + ia.occlus + extra.ica + time.rand)
## 
## Coefficients:
##          (Intercept)            trtControl                   age  
##            17.214181              0.330692             -0.008417  
##              sexMale         locationRight            hx.ischYes  
##             0.311406              0.785978             -1.076008  
##              afibYes                 dmYes              mrankin1  
##             0.079610             -1.961131             -0.223412  
##             mrankin2            mrankin> 2                   sbp  
##            -0.362819              1.439835             -0.005404  
##              aspects  ia.occlusICA with M1           ia.occlusM1  
##            -0.108330              2.956945              4.085079  
##          ia.occlusM2     ia.occlusA1 or A2          extra.icaYes  
##             4.873101              8.067876              0.408500  
##            time.rand  
##            -0.006188  
## 
## 
## [[4]]
## 
## Call:
## lm(formula = nihss ~ trt + age + sex + location + hx.isch + afib + 
##     dm + mrankin + sbp + aspects + ia.occlus + extra.ica + time.rand)
## 
## Coefficients:
##          (Intercept)            trtControl                   age  
##            17.051717              0.317570             -0.008639  
##              sexMale         locationRight            hx.ischYes  
##             0.328283              0.795258             -1.070596  
##              afibYes                 dmYes              mrankin1  
##             0.093718             -1.953017             -0.218021  
##             mrankin2            mrankin> 2                   sbp  
##            -0.339789              1.441107             -0.004962  
##              aspects  ia.occlusICA with M1           ia.occlusM1  
##            -0.099336              2.960153              4.083512  
##          ia.occlusM2     ia.occlusA1 or A2          extra.icaYes  
##             4.891359              7.930349              0.418464  
##            time.rand  
##            -0.006133  
## 
## 
## [[5]]
## 
## Call:
## lm(formula = nihss ~ trt + age + sex + location + hx.isch + afib + 
##     dm + mrankin + sbp + aspects + ia.occlus + extra.ica + time.rand)
## 
## Coefficients:
##          (Intercept)            trtControl                   age  
##            17.161027              0.330387             -0.008370  
##              sexMale         locationRight            hx.ischYes  
##             0.307164              0.787312             -1.079104  
##              afibYes                 dmYes              mrankin1  
##             0.079369             -1.957972             -0.223150  
##             mrankin2            mrankin> 2                   sbp  
##            -0.363033              1.444367             -0.004875  
##              aspects  ia.occlusICA with M1           ia.occlusM1  
##            -0.109658              2.971916              4.099981  
##          ia.occlusM2     ia.occlusA1 or A2          extra.icaYes  
##             4.887867              8.070109              0.407125  
##            time.rand  
##            -0.006321

 
Let look at the summary of 3 models:

summary(models1$analyses[[1]])
## 
## Call:
## lm(formula = nihss ~ trt + age + sex + location + hx.isch + afib + 
##     dm + mrankin + sbp + aspects + ia.occlus + extra.ica + time.rand)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.1563 -3.7547 -0.5499  3.6602 11.5062 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          17.252363   3.104956   5.556 4.57e-08 ***
## trtControl            0.336268   0.419092   0.802  0.42273    
## age                  -0.008557   0.012240  -0.699  0.48485    
## sexMale               0.314477   0.420237   0.748  0.45462    
## locationRight         0.781461   0.417205   1.873  0.06166 .  
## hx.ischYes           -1.073092   0.668206  -1.606  0.10895    
## afibYes               0.081253   0.467415   0.174  0.86207    
## dmYes                -1.960446   0.623112  -3.146  0.00176 ** 
## mrankin1             -0.221030   0.698761  -0.316  0.75190    
## mrankin2             -0.361164   0.980914  -0.368  0.71289    
## mrankin> 2            1.439753   1.045147   1.378  0.16898    
## sbp                  -0.005320   0.008239  -0.646  0.51881    
## aspects              -0.114684   0.134705  -0.851  0.39499    
## ia.occlusICA with M1  2.961485   2.374193   1.247  0.21287    
## ia.occlusM1           4.091233   2.352406   1.739  0.08264 .  
## ia.occlusM2           4.889223   2.470246   1.979  0.04836 *  
## ia.occlusA1 or A2     8.060141   3.558860   2.265  0.02397 *  
## extra.icaYes          0.411613   0.475725   0.865  0.38734    
## time.rand            -0.006184   0.003226  -1.917  0.05588 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.582 on 481 degrees of freedom
## Multiple R-squared:  0.07152,    Adjusted R-squared:  0.03677 
## F-statistic: 2.058 on 18 and 481 DF,  p-value: 0.006495
summary(models2$analyses[[1]])
## 
## Call:
## lm(formula = nihss ~ dm + ia.occlus)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.5038 -3.5038 -0.5038  3.5504 10.5542 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           14.7500     2.2948   6.428 3.06e-10 ***
## dmYes                 -2.0530     0.6202  -3.310   0.0010 ***
## ia.occlusICA with M1   2.6958     2.3297   1.157   0.2478    
## ia.occlusM1            3.7538     2.3105   1.625   0.1049    
## ia.occlusM2            4.6698     2.4122   1.936   0.0534 .  
## ia.occlusA1 or A2      7.5833     3.5053   2.163   0.0310 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.59 on 494 degrees of freedom
## Multiple R-squared:  0.04312,    Adjusted R-squared:  0.03344 
## F-statistic: 4.453 on 5 and 494 DF,  p-value: 0.0005612
summary(models3$analyses[[1]])
## 
## Call:
## glm(formula = mrankin ~ trt + age + sex + location + hx.isch + 
##     afib + dm + nihss + sbp + aspects + ia.occlus + extra.ica + 
##     time.rand, family = "binomial")
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6296  -0.6786  -0.5854  -0.4667   2.2363  
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)  
## (Intercept)           0.1630483  1.7069637   0.096   0.9239  
## trtControl            0.0472200  0.2397982   0.197   0.8439  
## age                  -0.0056011  0.0068957  -0.812   0.4166  
## sexMale               0.1656146  0.2397950   0.691   0.4898  
## locationRight        -0.3418088  0.2387721  -1.432   0.1523  
## hx.ischYes            0.0668760  0.3806121   0.176   0.8605  
## afibYes               0.1301763  0.2612503   0.498   0.6183  
## dmYes                -0.3528833  0.3876486  -0.910   0.3627  
## nihss                 0.0055946  0.0256744   0.218   0.8275  
## sbp                  -0.0001552  0.0046825  -0.033   0.9736  
## aspects               0.1228582  0.0797173   1.541   0.1233  
## ia.occlusICA with M1 -3.0054740  1.1977587  -2.509   0.0121 *
## ia.occlusM1          -2.4927321  1.1839151  -2.105   0.0352 *
## ia.occlusM2          -2.7362239  1.2658713  -2.162   0.0307 *
## ia.occlusA1 or A2    -0.2168552  1.7294198  -0.125   0.9002  
## extra.icaYes          0.2908186  0.2645062   1.099   0.2716  
## time.rand             0.0007048  0.0018015   0.391   0.6956  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 489.11  on 499  degrees of freedom
## Residual deviance: 469.25  on 483  degrees of freedom
## AIC: 503.25
## 
## Number of Fisher Scoring iterations: 4

3.8.2 Structure of mira objects

Analyses performed on mids objects will result in a mira object (multiply imputed repeated analyses).

A mira object is a list with the following elements:

  • call: The call that created the mira object.
  • call1: The call that created the mids object.
  • nmis: The number of missing observations per column.
  • analyses: The models fitted on each of the imputed datasets: A list with imp$m components, where each component contains a fitted model object.

The class of each of the components of imp$analyses will depend on the type of model that was fitted.

 

3.9 Step 8: Pooling

3.9.1 Combining results

To pool the results from the repeated analyses contained in a mira object, the function pool() can be used.

Pool one of the provided mira objects (models1, models2, models3) and investigate its structure.

 

  • Models1: Pool the result accross 5 imputations
# Pool the result accross 5 imputations
pooled1 <- pool(models1)

# investigate the structure
class(pooled1)
## [1] "mipo"       "data.frame"
names(pooled1)
## [1] "call"    "m"       "pooled"  "glanced"
str(pooled1)
## Classes 'mipo' and 'data.frame': 0 obs. of  4 variables:
##  $ call   : language pool(object = models1)
##  $ m      : int 5
##  $ pooled :'data.frame': 19 obs. of  11 variables:
##   ..$ term    : Factor w/ 19 levels "(Intercept)",..: 1 2 3 4 5 6 7 8 9 10 ...
##   ..$ m       : int  5 5 5 5 5 5 5 5 5 5 ...
##   ..$ estimate: num  17.17706 0.33026 -0.00843 0.31301 0.78706 ...
##   ..$ ubar    : num  9.63947 0.17546 0.00015 0.17657 0.17409 ...
##   ..$ b       : num  5.96e-03 5.87e-05 3.18e-08 8.96e-05 2.57e-05 ...
##   ..$ t       : num  9.64663 0.17553 0.00015 0.17668 0.17412 ...
##   ..$ dfcom   : int  481 481 481 481 481 481 481 481 481 481 ...
##   ..$ df      : num  479 479 479 479 479 ...
##   ..$ riv     : num  0.000742 0.000401 0.000255 0.000609 0.000177 ...
##   ..$ lambda  : num  0.000742 0.000401 0.000255 0.000609 0.000177 ...
##   ..$ fmi     : num  0.00489 0.00455 0.0044 0.00476 0.00433 ...
##  $ glanced:'data.frame': 5 obs. of  12 variables:
##   ..$ r.squared    : num  0.0715 0.0712 0.0714 0.0724 0.0715
##   ..$ adj.r.squared: num  0.0368 0.0364 0.0366 0.0377 0.0368
##   ..$ sigma        : num  4.58 4.58 4.58 4.58 4.58
##   ..$ statistic    : num  2.06 2.05 2.05 2.09 2.06
##   ..$ p.value      : num  0.0065 0.00683 0.00664 0.00566 0.00648
##   ..$ df           : num  18 18 18 18 18
##   ..$ logLik       : num  -1461 -1461 -1461 -1461 -1461
##   ..$ AIC          : num  2962 2962 2962 2961 2962
##   ..$ BIC          : num  3046 3046 3046 3045 3046
##   ..$ deviance     : num  10097 10100 10098 10087 10097
##   ..$ df.residual  : int  481 481 481 481 481
##   ..$ nobs         : int  500 500 500 500 500

3.9.2 Structure of mipo objects

Pooling of a mira object results in a mipo object (multiply imputed pooled analysis).

The mipo object is a list that has the following elements:

Object Description
call The call that created the mipo object.
call1 The call that created the mids object.
call2 The call that created the mira object.
data The original, incomplete data.
nmis The number of missing values per variable.
m The number of imputations.
qhat The coefficients from all analyses: a matrix with m rows and npar columns.
u The variance-covariance matrix of the coefficients from each analysis: an array of dimension c(m, npar, npar).
qbar The pooled parameter estimates: a vector of length npar.
ubar The average within imputation variance: a matrix with npar rows and npar columns containing the averaged variance-covariance matrices.
b The between imputation variance-covariance matrix (u+b+b/m).
t The total variance-covariance matrix.
r The relative increase in variance due to the missing values.
dfcom Degrees of freedom in the hypothetically complete data (N - # free parameters).
df Degrees of freedom associated with the t-statistics.
fmi Fraction of missing information.
lambda Proportion of the variation due to the missing values: (b+b/m)/t.

From the elements contained in the mipo object, we could obtain the pooled coefficients (qbar) and variances (t), and calculate the pooled confidence intervals and p-values by hand, using the function provided in the course notes.

However, the function summary() applied to the mipo object will do this for us.

Note:

  • pool() extracts the model coefficients and variance-covariance matrices using the functions coef() and vcov(). Hence, pooling using the pool() function from mice only works for those types of models for which these functions will return the regression coefficients and corresponding variance-covariance matrix.

  • Moreover, pool() only works for objects of class mira. When data has been imputed in a different package/software, or analyses have not been performed in the way described above, a list of fitted models can be converted to a mira object using as.mira().

library(kableExtra)
pooled1 <- pool(models1)
summary(pooled1, conf.int = TRUE) %>%
    select(-statistic, -df) %>%
    kable(digits = 3)
term estimate std.error p.value 2.5 % 97.5 %
(Intercept) 17.177 3.106 0.000 11.074 23.280
trtControl 0.330 0.419 0.431 -0.493 1.154
age -0.008 0.012 0.491 -0.032 0.016
sexMale 0.313 0.420 0.457 -0.513 1.139
locationRight 0.787 0.417 0.060 -0.033 1.607
hx.ischYes -1.076 0.668 0.108 -2.389 0.237
afibYes 0.081 0.468 0.862 -0.838 1.000
dmYes -1.958 0.623 0.002 -3.183 -0.734
mrankin1 -0.221 0.699 0.751 -1.594 1.152
mrankin2 -0.357 0.980 0.716 -2.282 1.568
mrankin> 2 1.441 1.045 0.169 -0.613 3.495
sbp -0.005 0.008 0.537 -0.021 0.011
aspects -0.110 0.135 0.415 -0.375 0.155
ia.occlusICA with M1 2.972 2.374 0.211 -1.693 7.637
ia.occlusM1 4.093 2.352 0.082 -0.529 8.716
ia.occlusM2 4.890 2.470 0.048 0.037 9.743
ia.occlusA1 or A2 8.041 3.508 0.022 1.149 14.933
extra.icaYes 0.417 0.476 0.381 -0.518 1.352
time.rand -0.006 0.003 0.054 -0.013 0.000

 

Discuss
We’ll interpret 2 of the predictors here: dm and ia.occlus:

  • If we have two subjects (A and B) with the same values of all predictors trt, age, sex, etc but A is dm-Yes and B is dm-No, then the model predicts that A outcome will be 1.958 point lower than B, with a 95% confidence interval around that estimate ranging from (round(a\(conf.low,3), round(a\)conf.high,3)).

  • Similar comments can be made with ia.occlus.

     

     

We can compare these results to the complete case analysis

df_cc <- df %$% 
    lm(nihss    ~ trt + age + sex  + location + hx.isch + afib + dm  +
         mrankin + sbp  + aspects + ia.occlus + extra.ica + time.rand)
tidy(df_cc, conf.int = TRUE) %>%
    select(term, estimate, std.error, p.value, conf.low, conf.high) %>%
    kable(digits = 3)
term estimate std.error p.value conf.low conf.high
(Intercept) 17.199 3.122 0.000 11.064 23.334
trtControl 0.249 0.424 0.558 -0.584 1.081
age -0.008 0.012 0.515 -0.032 0.016
sexMale 0.306 0.425 0.472 -0.530 1.141
locationRight 0.811 0.422 0.055 -0.018 1.640
hx.ischYes -1.037 0.671 0.123 -2.357 0.282
afibYes 0.107 0.472 0.821 -0.821 1.035
dmYes -2.065 0.635 0.001 -3.313 -0.816
mrankin1 -0.193 0.702 0.783 -1.572 1.186
mrankin2 -0.437 1.004 0.664 -2.410 1.536
mrankin> 2 1.472 1.050 0.161 -0.591 3.535
sbp -0.005 0.008 0.516 -0.022 0.011
aspects -0.108 0.136 0.427 -0.377 0.160
ia.occlusICA with M1 2.862 2.385 0.231 -1.824 7.549
ia.occlusM1 4.039 2.361 0.088 -0.600 8.679
ia.occlusM2 4.797 2.483 0.054 -0.082 9.677
ia.occlusA1 or A2 8.102 3.572 0.024 1.083 15.122
extra.icaYes 0.409 0.482 0.396 -0.538 1.356
time.rand -0.006 0.003 0.073 -0.012 0.001

 
 

Note that there are some sizeable differences here, although nothing enormous. If we want the pooled R^2 or pooled adjusted R^2 after imputation, R will provide it (and a 95% confidence interval around the estimate) with:

pool.r.squared(models1)
##            est      lo 95     hi 95          fmi
## R^2 0.07160315 0.03392821 0.1205161 0.0005197753
pool.r.squared(models1, adjusted = TRUE)
##                est      lo 95      hi 95         fmi
## adj R^2 0.03686002 0.01124442 0.07567558 0.001006057

 
 

We can see the fraction of missing information about each coefficient due to non-response (fmi) and other details with the following code:

pooled1
## Class: mipo    m = 5 
##                    term m     estimate         ubar            b            t
## 1           (Intercept) 5 17.177058905 9.639473e+00 5.963575e-03 9.646629e+00
## 2            trtControl 5  0.330259428 1.754636e-01 5.869607e-05 1.755341e-01
## 3                   age 5 -0.008432130 1.497218e-04 3.184904e-08 1.497601e-04
## 4               sexMale 5  0.313009143 1.765744e-01 8.964159e-05 1.766820e-01
## 5         locationRight 5  0.787056842 1.740886e-01 2.574415e-05 1.741195e-01
## 6            hx.ischYes 5 -1.075912627 4.465767e-01 1.749045e-05 4.465977e-01
## 7               afibYes 5  0.081404020 2.186886e-01 5.711044e-05 2.187571e-01
## 8                 dmYes 5 -1.958327545 3.882901e-01 1.030698e-05 3.883024e-01
## 9              mrankin1 5 -0.221438946 4.882727e-01 4.671731e-06 4.882783e-01
## 10             mrankin2 5 -0.356880358 9.595191e-01 9.602684e-05 9.596344e-01
## 11           mrankin> 2 5  1.440882599 1.092470e+00 4.227877e-06 1.092475e+00
## 12                  sbp 5 -0.005093975 6.791841e-05 6.155365e-08 6.799228e-05
## 13              aspects 5 -0.110078715 1.814773e-02 5.221231e-05 1.821039e-02
## 14 ia.occlusICA with M1 5  2.972107444 5.636021e+00 4.811054e-04 5.636598e+00
## 15          ia.occlusM1 5  4.093492042 5.533161e+00 1.045480e-04 5.533286e+00
## 16          ia.occlusM2 5  4.889715285 6.099775e+00 1.455083e-04 6.099949e+00
## 17    ia.occlusA1 or A2 5  8.040881534 1.229913e+01 3.849952e-03 1.230375e+01
## 18         extra.icaYes 5  0.417238545 2.261252e-01 1.881161e-04 2.263510e-01
## 19            time.rand 5 -0.006229975 1.042510e-05 7.668257e-09 1.043431e-05
##    dfcom       df          riv       lambda         fmi
## 1    481 478.6255 7.423943e-04 7.418436e-04 0.004891367
## 2    481 478.8110 4.014238e-04 4.012627e-04 0.004550603
## 3    481 478.8864 2.552657e-04 2.552005e-04 0.004404497
## 4    481 478.6995 6.092044e-04 6.088335e-04 0.004758272
## 5    481 478.9256 1.774556e-04 1.774241e-04 0.004326706
## 6    481 478.9639 4.699875e-05 4.699654e-05 0.004196490
## 7    481 478.8567 3.133795e-04 3.132813e-04 0.004462593
## 8    481 478.9639 3.185343e-05 3.185242e-05 0.004181408
## 9    481 478.9639 1.148145e-05 1.148132e-05 0.004161122
## 10   481 478.9541 1.200937e-04 1.200793e-04 0.004269354
## 11   481 478.9639 4.644020e-06 4.643998e-06 0.004154313
## 12   481 478.4245 1.087546e-03 1.086364e-03 0.005236189
## 13   481 476.6909 3.452485e-03 3.440606e-03 0.007595613
## 14   481 478.9627 1.024351e-04 1.024246e-04 0.004251698
## 15   481 478.9639 2.267377e-05 2.267326e-05 0.004172267
## 16   481 478.9639 2.862565e-05 2.862483e-05 0.004178194
## 17   481 478.8245 3.756315e-04 3.754905e-04 0.004524821
## 18   481 478.4777 9.982935e-04 9.972979e-04 0.005147034
## 19   481 478.5454 8.826683e-04 8.818899e-04 0.005031522

 

Besides, there are more info on pooling Independent t-test; Chi-square; ANOVA, regression model:
https://bookdown.org/mwheymans/bookmi/data-analysis-after-multiple-imputation.html#the-pooled-independent-t-test

 
 

4 Single Imputation with simputation()

Loading the data, adjust the character of each column:

# Loading the data, adjust the character of each column
data <- read_excel("D:/Statistics/R/R data/fakestroke.xlsx", 
                 col_types = c("text", "text", "numeric",
                               "text", "numeric", "text", "text", 
                               "numeric", "numeric", "text", "numeric", 
                               "text", "numeric", "numeric", "text", 
                               "numeric", "numeric", "numeric")) 
df = data

 

The 1st three step will be similar to Multiple Imputation with mice()

4.1 Step 1: Identify categorical or numeric variables

Before constructing a table, I must check numeric/ categorical variables.
For example: in our data:

  • Numeric variable: age; nihss; sbp; time.iv; aspects; time.rand; time.punc.
  • Character variable: sex; location; hx.isch; afib; dm; mrankin; iv.altep; ia.occlus; extra.ica.
  1. Because time.punc is only applied for Intervention group (trt), time.iv only applied for patients if iv.altep=Yes (iv.altep), I eliminate these two variable from imputation:
df = data %>% select(-studyid,-time.punc, -time.iv)

 

  1. In the excel file, afib, dm, extra.ica are noted as “1” and “0”. Therefore, I must convert them back to “Yes-No” for character variables:
# Changing coding variable to categorical:
df$afib       <- factor(df$afib , levels=0:1, labels=c("No", "Yes"))
df$dm         <- factor(df$dm , levels=0:1, labels=c("No", "Yes"))
df$extra.ica  <- factor(df$extra.ica, levels=0:1, labels=c("No", "Yes"))

 

  1. Define factor for all categorical variables:
  • Before converting:
str(df)
## tibble [500 x 15] (S3: tbl_df/tbl/data.frame)
##  $ trt      : chr [1:500] "Control" "Intervention" "Control" "Control" ...
##  $ age      : num [1:500] 53 51 68 28 91 34 75 89 75 26 ...
##  $ sex      : chr [1:500] "Male" "Male" "Female" "Male" ...
##  $ nihss    : num [1:500] 21 23 11 22 24 18 25 18 25 27 ...
##  $ location : chr [1:500] "Right" "Left" "Right" "Left" ...
##  $ hx.isch  : chr [1:500] "No" "No" "No" "No" ...
##  $ afib     : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 1 1 1 2 1 ...
##  $ dm       : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ mrankin  : chr [1:500] "2" "0" "0" "0" ...
##  $ sbp      : num [1:500] 127 137 138 122 162 166 140 157 129 143 ...
##  $ iv.altep : chr [1:500] "Yes" "Yes" "No" "Yes" ...
##  $ aspects  : num [1:500] 10 10 10 10 8 5 10 9 6 10 ...
##  $ ia.occlus: chr [1:500] "M1" "M1" "ICA with M1" "ICA with M1" ...
##  $ extra.ica: Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 1 1 1 1 ...
##  $ time.rand: num [1:500] 139 118 178 160 214 154 122 147 271 141 ...
  • After:
df <- df%>%mutate_if(is.character, as.factor)
str(df) 
## tibble [500 x 15] (S3: tbl_df/tbl/data.frame)
##  $ trt      : Factor w/ 2 levels "Control","Intervention": 1 2 1 1 1 1 2 1 1 2 ...
##  $ age      : num [1:500] 53 51 68 28 91 34 75 89 75 26 ...
##  $ sex      : Factor w/ 2 levels "Female","Male": 2 2 1 2 2 1 2 1 2 1 ...
##  $ nihss    : num [1:500] 21 23 11 22 24 18 25 18 25 27 ...
##  $ location : Factor w/ 2 levels "Left","Right": 2 1 2 1 2 1 2 2 1 2 ...
##  $ hx.isch  : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ afib     : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 1 1 1 2 1 ...
##  $ dm       : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ mrankin  : Factor w/ 4 levels "> 2","0","1",..: 4 2 2 2 2 4 2 2 4 2 ...
##  $ sbp      : num [1:500] 127 137 138 122 162 166 140 157 129 143 ...
##  $ iv.altep : Factor w/ 2 levels "No","Yes": 2 2 1 2 2 2 2 1 2 2 ...
##  $ aspects  : num [1:500] 10 10 10 10 8 5 10 9 6 10 ...
##  $ ia.occlus: Factor w/ 6 levels "A1 or A2","ICA with M1",..: 4 4 2 2 4 1 4 4 4 4 ...
##  $ extra.ica: Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 1 1 1 1 ...
##  $ time.rand: num [1:500] 139 118 178 160 214 154 122 147 271 141 ...

 

  1. Reorder the level of some character variables:
  • trt: Intervention (baseline) , Control
  • mrankin: 0, 1, 2, > 2.
  • ia.occlus: rearrange the ia.occlus variable to the order presented in Berkhemer et al. (2015).
# Relevel factors:
df$trt = factor(df$trt, levels=c("Intervention", "Control"))
df$mrankin = factor(df$mrankin, levels=c("0", "1", "2", "> 2"))
df$ia.occlus = factor(df$ia.occlus, levels=c("Intracranial ICA", "ICA with M1", 
                                             "M1", "M2", "A1 or A2"))

# Check order of factor after relevel: 
df$trt[1:2]
## [1] Control      Intervention
## Levels: Intervention Control
df$mrankin[1:2]
## [1] 2 0
## Levels: 0 1 2 > 2
df$ia.occlus[1:3]
## [1] M1          M1          ICA with M1
## Levels: Intracranial ICA ICA with M1 M1 M2 A1 or A2

 

4.2 Step 2: Checking missing value (NA)

4.2.1 Step 2.1: Missing data patterns

Before moving on to determining the specifics of multiple imputation, we should first explore and see the pattern of missing data in our dataset.

There are several ways to check the missing value (NA) in the dataset:

4.2.1.1 Complete vs incomplete case

# number and proportion of missing values per variable
cbind("# NA" = sort(colSums(is.na(df))),
      "% NA" = round(sort(colMeans(is.na(df))) * 100, 2))
##           # NA % NA
## trt          0  0.0
## age          0  0.0
## sex          0  0.0
## nihss        0  0.0
## location     0  0.0
## hx.isch      0  0.0
## afib         0  0.0
## dm           0  0.0
## mrankin      0  0.0
## iv.altep     0  0.0
## sbp          1  0.2
## ia.occlus    1  0.2
## extra.ica    1  0.2
## time.rand    2  0.4
## aspects      4  0.8

Discuss

There are 5 variables with 9 NAs.

  • aspects (0.4%) > time.rand (0.4%).
  • sbp = ia.occlus = extra.ica (0.2%).

4.2.1.2 Number and proportion of complete cases

Ncc <- cbind(
  "#" = table(complete.cases(df)),
  "%" = round(100 * table(complete.cases(df))/nrow(df), 2)
)
rownames(Ncc) <- c("incompl.", "complete")
Ncc
##            #    %
## incompl.   8  1.6
## complete 492 98.4

Discuss

Among 500 observations:

  • 8 observations contain NA.
  • 492 observations dont have NA.

4.2.1.3 unlist()

p_missing <- unlist(lapply(df, function(x) sum(is.na(x))))/nrow(df)
sort(p_missing[p_missing > 0], decreasing = TRUE)
##   aspects time.rand       sbp ia.occlus extra.ica 
##     0.008     0.004     0.002     0.002     0.002

4.2.1.4 plyr package

library(plyr)
ddply(df, c(sbp = "ifelse(is.na(sbp), 'missing', 'observed')",
                aspects = "ifelse(is.na(aspects), 'missing', 'observed')",
                time.rand = "ifelse(is.na(time.rand), 'missing', 'observed')",
                ia.occlus = "ifelse(is.na(ia.occlus), 'missing', 'observed')",
                extra.ica = "ifelse(is.na(extra.ica), 'missing', 'observed')"
                ),
      summarize,
      N = length(sbp)
)
##        sbp  aspects time.rand ia.occlus extra.ica   N
## 1  missing observed  observed  observed  observed   1
## 2 observed  missing  observed  observed  observed   4
## 3 observed observed   missing  observed  observed   2
## 4 observed observed  observed   missing   missing   1
## 5 observed observed  observed  observed  observed 492

Discuss

This command provide more info than the previous two, among 500 observations:

  • 8 observations (obs) contain NA.
    ** 1 obs have NA at sbp.
    ** 4 obs have NA at aspects.
    ** 2 obs have NA at time.rand.
    ** 1 obs have NA at both ia.occlus and extra.ica.

  • 492 obs dont have NA.

     

4.2.1.5 Library (mice) (recommend)

Using the following command:
md.pattern (x)

md.pattern(df)

##     trt age sex nihss location hx.isch afib dm mrankin iv.altep sbp ia.occlus
## 492   1   1   1     1        1       1    1  1       1        1   1         1
## 4     1   1   1     1        1       1    1  1       1        1   1         1
## 2     1   1   1     1        1       1    1  1       1        1   1         1
## 1     1   1   1     1        1       1    1  1       1        1   1         0
## 1     1   1   1     1        1       1    1  1       1        1   0         1
##       0   0   0     0        0       0    0  0       0        0   1         1
##     extra.ica time.rand aspects  
## 492         1         1       1 0
## 4           1         1       0 1
## 2           1         0       1 1
## 1           0         1       1 2
## 1           1         1       1 1
##             1         2       4 9

Discuss

  • In each row output:
    ** 1: the variable is complete.
    ** 0: the variable in that pattern contains missing values.
    ** The last row: the total number of missing values for each variable.

  • In each column output:
    ** 1st column on the left (without a column name) shows the number of cases with a specific pattern.
    ** The column on the right : the number of variables that is incomplete in that pattern.

Return to our data result:

  • At each row:

** 1st line: 492 obs having full value.
** 2nd line: 4 obs have NA at aspects
** 3rd line: 2 obs have NA at time.rand
** 4th line: 1 obs have missing value at both ia.occlus and extra.ica
** 5th line: 1 obs have NA at sbp

  • At each column - variable, total number of missing value:
    ** sbp: 1
    ** ia.occlus: 1
    ** extra.ica: 1
    ** time.rand: 2
    ** aspects: 4

     

4.2.1.6 library(visdat)

Installing this package: devtools::install_github(“ropensci/visdat”)
Using the following command:
vis_miss(x, cluster = F, sort_miss = F, show_perc = T, show_perc_col = T)

  • cluster = F (default): use hierarchical clustering (mcquitty method) to arrange rows according to missingness.
  • sort_miss = F (default): arranges the columns in order of missingness.
  • show_perc = T (default): adds % of missing/complete data in the whole dataset into the legend.
  • show_perc_col = T (default): adds in the % missing data in a given column into the x axis.
vis_miss(df, cluster = T, sort_miss = T, show_perc = T, show_perc_col = T)

vis_dat(df, sort_type = T)

Discuss

  • More visualization.
  • Hard to identify the number of missing value in each variables.

4.2.1.7 library(VIM) (recommend)

Using the following command:
aggr (x, plot = T, prop = T, numbers = F, combined = F, sortVars = T, cex.axis=.7, gap=3)

  • plot: whether the results should be plotted.
  • numbers: whether the proportion or frequencies of the different combinations should be represented by numbers.
  • prop: whether the proportion of missing/imputed values should be used rather than the total amount.
  • combined: a logical indicating whether the two plots should be combined.
  • sortVars: whether the variables should be sorted by the number of missing/imputed values.
aggr(df, plot = F, numbers = T, prop = T)
## 
##  Missings in variables:
##   Variable Count
##        sbp     1
##    aspects     4
##  ia.occlus     1
##  extra.ica     1
##  time.rand     2
aggr(df, plot = T, numbers = T, prop = T, combined= F, sortVars = T, cex.axis=.7, gap=3)

## 
##  Variables sorted by number of missings: 
##   Variable Count
##    aspects 0.008
##  time.rand 0.004
##        sbp 0.002
##  ia.occlus 0.002
##  extra.ica 0.002
##        trt 0.000
##        age 0.000
##        sex 0.000
##      nihss 0.000
##   location 0.000
##    hx.isch 0.000
##       afib 0.000
##         dm 0.000
##    mrankin 0.000
##   iv.altep 0.000
aggr(df, plot = T, numbers = T, prop = T, combined= T, sortVars = F)

Discuss

  • We can see there are 5 variables with missing value:
    ** 1 obs have NA at both ia.occlus, extra.ica.

  • Among these variables:
    ** Numeric variables: sbp, aspects, time.rand.
    ** Categorical variables: ia.occlus, extra.ica.

  • The total proportion of the remaining 5 varialbes < 5% => complete case analysis can be applied. However, in this example, I will still use single imputation and multiple imputation to demonstrate how they work.

     

Summary
I recommend using mice and/or aggr() function for observing missing value:

  • The code above calculates what percent of data is missing. A simple look at this table warns us about several variables that have more than 25% missing. It is useful to remove these variables from the dataset first as they might mess up the imputation.

  • There are 5 varialbes with missing value:
    ** Numeric variables: sbp, aspects, time.rand.
    ** Categorical variables: ia.occlus, extra.ica.

  • The outcome variables of this project: nihss ; mrankin.

  • The missing data patterns show that the total proportion of missing value < 5% => Complete case analysis can be applied.
    => However, for demonstration, I still continue with the imputation practice.

     

4.2.2 Step 2.2: Missing data evaluation

4.2.2.1 Little’s MCAR test in R

Little´s MCAR test is available in the naniar package as the mcar_test function. To apply the test, we select only the continuous variables. We use it for the same dataset as in the previous paragraph. The p-value for the test is siginificant, indicating that the missings does not seem to be compeletely at random.

  • Choosing only continuous variables:
df_select <- df[,c("age", "nihss", "sbp","aspects" ,"time.rand" )]
library(naniar)
mcar_test(data=df_select)
## # A tibble: 1 x 4
##   statistic    df p.value missing.patterns
##       <dbl> <dbl>   <dbl>            <int>
## 1      18.4    12   0.104                4
  • Choosing all dataset (not suitable because dataset have categorical data)
mcar_test(data = df)
## # A tibble: 1 x 4
##   statistic    df p.value missing.patterns
##       <dbl> <dbl>   <dbl>            <int>
## 1      64.3    55   0.183                5

Discuss
The p-value for both test > 0.05 => Accept the Null hypothesis.
=> The missings is compeletely at random (MCAR).
=> Result is similar to Missing data patterns (<5%): Missingness is assumed not to matter for the analysis.

 

4.2.2.2 Compare and test group comparisons

In R also groups of the non-responders (i.e., participants with missing observations) can be compared to the responders (i.e. participants without missing observations). For this we can use the function missing_compare of the finalfit package.
First we have to copy and rename the variables with missing data. We rename the continuous variables sbp, aspects and time.rand and call them sbp_MAR, aspects_MAR and time.rand_MAR respectively.

The missing_compare function uses the Kruskal Wallis test for continuous variables and the Chi-squared test for categorical variables. (Note: The concerned outcome of this research: mrankin, nihss)

In this example. I will demonstrate the command to compare group for numeric variable: time.rand. Similar commands can be done for other numeric variables:

df_select <- df[,c("age", "nihss", "sbp","aspects" ,"time.rand", 
                   "trt", "mrankin", "ia.occlus", "extra.ica")]

# Copy variable with missing data and rename 
df_select$sbp_MAR <- df_select$sbp
df_select$aspects_MAR <- df_select$aspects
df_select$time.rand_MAR <- df_select$time.rand
df_select$extra.ica_MAR <- df_select$extra.ica

library(finalfit)

explanatory = c("age", "nihss", "sbp" ,"aspects")
dependent = "time.rand_MAR"
df_select %>% 
  missing_compare(dependent, explanatory) %>% 
    knitr::kable(row.names=FALSE, align = c("l", "l", "r", "r", "r"), 
        caption = "Mean comparisons between values of responders (Not missing) and 
        non-responders (Missing) on the time.rand variable.") 
Mean comparisons between values of responders (Not missing) and non-responders (Missing) on the time.rand variable.
Missing data analysis: time.rand_MAR Not missing Missing p
age Mean (SD) 64.6 (17.0) 94.0 (1.4) 0.015
nihss Mean (SD) 18.0 (4.7) 18.0 (1.4) 0.994
sbp Mean (SD) 145.6 (25.1) 105.0 (8.5) 0.022
aspects Mean (SD) 8.5 (1.6) 9.0 (1.4) 0.653

 

I can also compare group for extra.ica variable.

explanatory = c("trt", "mrankin")
dependent = "extra.ica_MAR"
df_select %>% 
  missing_compare(dependent, explanatory) %>% 
    knitr::kable(row.names=FALSE, align = c("l", "l", "r", "r", "r"), 
        caption = "Mean comparisons between values of responders (Not missing) and 
        non-responders (Missing) on the extra.ica variable.") 
Mean comparisons between values of responders (Not missing) and non-responders (Missing) on the extra.ica variable.
Missing data analysis: extra.ica_MAR Not missing Missing p
trt Intervention 233 (100.0) 0 (0.0) 1.000
Control 266 (99.6) 1 (0.4)
mrankin 0 403 (99.8) 1 (0.2) 0.971
1 50 (100.0) 0 (0.0)
2 25 (100.0) 0 (0.0)
> 2 21 (100.0) 0 (0.0)

Discuss

  • Of these variables, indicator variables are defined which are used to compare group means of other variables, that can be tested for significance using independent t-tests. The variables for which indicators group are compared are age”, “nihss”, “sbp”,“aspects”, “trt”, “mrankin”.
  • The drawback of this table: does not show the number of missing value.

On the time.rand_MAR variable, the patients that have observed values (column not missing) differ significantly from patients with missing values (column Missing) on age (P = 0.015) and sbp (P=0.022).

  • For age variable:
    ** When we look at the mean of the age variable, we see that the mean of patients with missing values is higher compared to the mean of patients with observed scores. This means that there is a higher probability of missing data for patients with higher age scores.
    ** If time.rand and age are correlated, the missing values on the time.rand variable can also be explained by the age variable.
  • For sbp variable:
    ** There is a higher probability of missing data for patients with lower sbp scores.
    ** If time.rand and sbp are correlated, the missing values on the time.rand variable can also be explained by the sbp variable.

On the extra.ica variable, the patients that have observed values (column not missing) don’t differ significantly from patients with missing values (column Missing) on trt and mrankin.

 

4.3 Step 3: Distribution of the data

Before imputing the missing data it is important to take a look at how the incomplete variables are distributed so that we can choose appropriate imputation models. Pay attention to:

  • Whether continuous distributions deviate considerably from the normal distribution.
  • if variables have values close to the limits of the range they are defined in.
  • whether categorical variables are very unbalanced (i.e., some category very small).
# One option that allows to quickly get an overview of all variables,
# categorical and continuous (replace the "df" dataset in the code below)

nc <- max(5, ceiling(sqrt(ncol(df))))
nr <- ceiling(ncol(df) / nc)
par(mfrow = c(nr, nc), mgp = c(2, 0.6, 0), mar = c(2, 3, 3, 0.5))
for (i in 1:ncol(df)) {
  if (is.numeric(df[, i])) {
    hist(df[, i], nclass = 50, xlab = "",
         main = paste0(names(df[i]), " (",
                       round(mean(is.na(df[, i])) * 100, 2), "% NA)")
    )
  } else {
    barplot(table(df[, i]), ylab = "Frequency",
            main = paste0(names(df[i]), " (",
                          round(mean(is.na(df[, i])) * 100, 2), "% NA)"))
  }
}

4.4 Introduction: simputation() package

4.4.1 Imputation: simputation package

Package used: simputation

Usage: impute_(data, formula, [model-specific options])

Formula is of the form:
IMPUTED_VARIABLES ~ MODEL_SPECIFICATION [ | GROUPING_VARIABLES ]

  • The left-hand-side of the formula object lists the variable or variables to be imputed.
  • Variables on the right-hand-side are used as predictors.
  • If grouping variables are specified, the data set is split according to the values of those variables, and model estimation and imputation occur independently for each group.

Note: If several Variables can be defined by similar variable groups, we can combine them to the left-hand-side of formula, which produced same results. For example:

set.seed(1234)
df_multi <- df %>% data.frame() %>%
    impute_rlm(., sbp + aspects + time.rand ~ trt + sex + age) %>%
  tbl_df()

diffdf(df, df_multi)
## Differences found between the objects!
## 
## A summary is given below.
## 
## Not all Values Compared Equal
## All rows are shown in table below
## 
##   ==============================
##    Variable   No of Differences 
##   ------------------------------
##       sbp             1         
##     aspects           4         
##    time.rand          2         
##   ------------------------------
## 
## 
## All rows are shown in table below
## 
##   =========================================
##    VARIABLE  ..ROWNUMBER..  BASE  COMPARE  
##   -----------------------------------------
##      sbp          131       <NA>  143.8762 
##   -----------------------------------------
## 
## 
## All rows are shown in table below
## 
##   =========================================
##    VARIABLE  ..ROWNUMBER..  BASE  COMPARE  
##   -----------------------------------------
##    aspects         19       <NA>  8.723411 
##    aspects        169       <NA>  8.740693 
##    aspects        201       <NA>  8.743835 
##    aspects        222       <NA>  8.684133 
##   -----------------------------------------
## 
## 
## All rows are shown in table below
## 
##   ==========================================
##    VARIABLE   ..ROWNUMBER..  BASE  COMPARE  
##   ------------------------------------------
##    time.rand       434       <NA>  195.4407 
##    time.rand       463       <NA>  195.6484 
##   ------------------------------------------
set.seed(1234)
df_multi <- df %>% data.frame() %>%
    impute_rlm(., aspects ~ trt + sex + age) %>%
    impute_rlm(., time.rand ~ trt + sex + age) %>%
    impute_rlm(., sbp ~ trt + sex + age) %>%
  tbl_df()

diffdf(df, df_multi)
## Differences found between the objects!
## 
## A summary is given below.
## 
## Not all Values Compared Equal
## All rows are shown in table below
## 
##   ==============================
##    Variable   No of Differences 
##   ------------------------------
##       sbp             1         
##     aspects           4         
##    time.rand          2         
##   ------------------------------
## 
## 
## All rows are shown in table below
## 
##   =========================================
##    VARIABLE  ..ROWNUMBER..  BASE  COMPARE  
##   -----------------------------------------
##      sbp          131       <NA>  143.8762 
##   -----------------------------------------
## 
## 
## All rows are shown in table below
## 
##   =========================================
##    VARIABLE  ..ROWNUMBER..  BASE  COMPARE  
##   -----------------------------------------
##    aspects         19       <NA>  8.723411 
##    aspects        169       <NA>  8.740693 
##    aspects        201       <NA>  8.743835 
##    aspects        222       <NA>  8.684133 
##   -----------------------------------------
## 
## 
## All rows are shown in table below
## 
##   ==========================================
##    VARIABLE   ..ROWNUMBER..  BASE  COMPARE  
##   ------------------------------------------
##    time.rand       434       <NA>  195.4407 
##    time.rand       463       <NA>  195.6484 
##   ------------------------------------------

 

We will impute the missing value using several different strategies, all supported nicely by the simputation package.

Impute types Target Variable Description
impute_rlm Numeric Imputation using robust linear regression with M-estimators
impute_lm Numeric Imputation using linear regression
impute_median Numeric Impute (group-wise) medians - Impute the median value of all non-missing observations into the missing values for the variable
impute_pmm Numeric Imputation using predictive mean matching
impute_em Numeric Multivariate, model-based imputation - These variables are assumed to follow a multivariate normal distribution for which the means and covariance matrix is estimated based on the EM-algorithm
impute_mf Numeric, Categorical, or Mixed data Multivariate, model-based imputation - Missing values are imputed using a rough guess after which a predictive random forest is trained and used to reimpute themissing values. This is iterated until convergence
impute_cart Numeric, Categorical, or Mixed data Decision tree imputation - Imputation using classification and regression trees
impute_rf Numeric, Categorical, or Mixed data Decision tree imputation - Imputation using using a Random Forest model
impute_knn Numeric, Categorical, or Mixed data Imputation using k-nearest neighbors methods
impute_rhd Numeric, Categorical, or Mixed data Random “hot deck” imputation involves drawing at random from the complete cases for that variabla

4.4.2 Checking the changes

Package used: diffdf() ; arsenal()

After creating imputation for discrete variables. I will use diffdf() from diffdf package to check the differences between 2 datasets: diffdf(x, y)

Or I can use comparedf() from arsenal package: summary(comparedf(x, y))

 

4.5 Step 4: Single imputation For binary Categorical Variables

Because time.punc is only applied for Intervention group (trt), time.iv only applied for patients if iv.altep=Yes (iv.altep), I eliminate these two variable from imputation. Therefore, we only have 5 variables:

  • Numeric variables: sbp, aspects, time.rand.

  • Categorical variables: ia.occlus, extra.ica.

     

Here, we’ll arbitrarily impute our categorical variables (ia.occlus, extra.ica) as follows:

  1. No grouping apply for extra.ica in 1st imputation

impute_rhd(., x ~ y) and impute_knn(., x ~ y)

  • y ~ 1 to specify that no grouping is to be applied.

  • y ~ . to specify all variables will be involved.

  • For extra.ica (Yes/No) we’ll use the impute_rhd approach to draw a random observation from the existing set of Yes and No in the complete extra.ica data.

  • For ia.occlus (5 categories) we’ll use a method called impute_knn - k-nearest neighbors methods which takes the result from a model based on the (imputed) extra.ica value and the remaining variables and converts it to the nearest value in the observed ia.occlus data.

     

  • 1/ trt + iv.altep + extra.ica:

set.seed(1234)
df_trt_iv.altep <- df %>%
  data.frame() %>%
    impute_rhd(., 
               extra.ica ~ 1) %>%
    impute_knn(., ia.occlus  ~ trt + iv.altep + extra.ica, k=5) %>%
  tbl_df()

 

  • 1/ trt + iv.altep + sex +location + hx.isch + afib + dm + mrankin + extra.ica for second imputation:
set.seed(1234)
df_all <- df %>%
  data.frame() %>%
    impute_rhd(., 
               extra.ica ~ 1) %>%
    impute_knn(., ia.occlus  ~ trt + iv.altep + sex +location + hx.isch + afib + dm + mrankin + extra.ica, k=5) %>%
  tbl_df()

 

Checking the changes: comparedf() vs diffdf()
After creating imputation for discrete variables. I will use diffdf() from diffdf package to check the differences between 2 datasets:
diffdf(x, y)

diffdf(df_trt_iv.altep, df_all)
## No issues were found!

 

Or I can use comparedf() from arsenal package:
summary(comparedf(x, y))

summary(comparedf(df_trt_iv.altep, df_all))
## 
## 
## Table: Summary of data.frames
## 
## version   arg                ncol   nrow
## --------  ----------------  -----  -----
## x         df_trt_iv.altep      15    500
## y         df_all               15    500
## 
## 
## 
## Table: Summary of overall comparison
## 
## statistic                                                      value
## ------------------------------------------------------------  ------
## Number of by-variables                                             0
## Number of non-by variables in common                              15
## Number of variables compared                                      15
## Number of variables in x but not y                                 0
## Number of variables in y but not x                                 0
## Number of variables compared with some values unequal              0
## Number of variables compared with all values equal                15
## Number of observations in common                                 500
## Number of observations in x but not y                              0
## Number of observations in y but not x                              0
## Number of observations with some compared variables unequal        0
## Number of observations with all compared variables equal         500
## Number of values unequal                                           0
## 
## 
## 
## Table: Variables not shared
## 
##                          
##  ------------------------
##  No variables not shared 
##  ------------------------
## 
## 
## 
## Table: Other variables not compared
## 
##                                  
##  --------------------------------
##  No other variables not compared 
##  --------------------------------
## 
## 
## 
## Table: Observations not shared
## 
##                             
##  ---------------------------
##  No observations not shared 
##  ---------------------------
## 
## 
## 
## Table: Differences detected by variable
## 
## var.x       var.y         n   NAs
## ----------  ----------  ---  ----
## trt         trt           0     0
## age         age           0     0
## sex         sex           0     0
## nihss       nihss         0     0
## location    location      0     0
## hx.isch     hx.isch       0     0
## afib        afib          0     0
## dm          dm            0     0
## mrankin     mrankin       0     0
## sbp         sbp           0     0
## iv.altep    iv.altep      0     0
## aspects     aspects       0     0
## ia.occlus   ia.occlus     0     0
## extra.ica   extra.ica     0     0
## time.rand   time.rand     0     0
## 
## 
## 
## Table: Differences detected
## 
##                          
##  ------------------------
##  No differences detected 
##  ------------------------
## 
## 
## 
## Table: Non-identical attributes
## 
##                              
##  ----------------------------
##  No non-identical attributes 
##  ----------------------------

 

Discuss

  • With no grouping apply for 1st imputaion, extra.ica: No ; ia.occlus: M1.

  • Because ia.occlus and extra.ica have only 1 missing variable - small size, no difference is detected when applying impute_rhd and impute_knn for difference variables combination.

     

  1. Apply group for extra.ica in 1st imputation

impute_cart(., x ~ y) and impute_cart(., x ~ y)

  • trt / trt + iv.altep:
set.seed(1234)
df_trt_iv.altep <- df %>%
  data.frame() %>%
    impute_cart(., 
               extra.ica ~ trt) %>%
    impute_cart(., ia.occlus     ~ trt + iv.altep) %>%
  tbl_df()
diffdf(df, df_trt_iv.altep)
## Differences found between the objects!
## 
## A summary is given below.
## 
## Not all Values Compared Equal
## All rows are shown in table below
## 
##   ==============================
##    Variable   No of Differences 
##   ------------------------------
##    ia.occlus          1         
##    extra.ica          1         
##   ------------------------------
## 
## 
## All rows are shown in table below
## 
##   =========================================
##    VARIABLE   ..ROWNUMBER..  BASE  COMPARE 
##   -----------------------------------------
##    ia.occlus       404       <NA>    M1    
##   -----------------------------------------
## 
## 
## All rows are shown in table below
## 
##   =========================================
##    VARIABLE   ..ROWNUMBER..  BASE  COMPARE 
##   -----------------------------------------
##    extra.ica       404       <NA>    No    
##   -----------------------------------------

 

impute_cart(., x ~ y) and impute_knn(., x ~ y)
* trt / trt + iv.altep + extra.ica:

set.seed(1234)
df_trt_iv.altep <- df %>%
  data.frame() %>%
    impute_cart(., 
               extra.ica ~ trt) %>%
    impute_knn(., ia.occlus  ~ trt + iv.altep + extra.ica, k=5) %>%
  tbl_df()
diffdf(df, df_trt_iv.altep)
## Differences found between the objects!
## 
## A summary is given below.
## 
## Not all Values Compared Equal
## All rows are shown in table below
## 
##   ==============================
##    Variable   No of Differences 
##   ------------------------------
##    ia.occlus          1         
##    extra.ica          1         
##   ------------------------------
## 
## 
## All rows are shown in table below
## 
##   =========================================
##    VARIABLE   ..ROWNUMBER..  BASE  COMPARE 
##   -----------------------------------------
##    ia.occlus       404       <NA>    M1    
##   -----------------------------------------
## 
## 
## All rows are shown in table below
## 
##   =========================================
##    VARIABLE   ..ROWNUMBER..  BASE  COMPARE 
##   -----------------------------------------
##    extra.ica       404       <NA>    No    
##   -----------------------------------------

impute_rhd(., x ~ y) and impute_knn(., x ~ y)
* trt / trt + iv.altep + extra.ica:

set.seed(1234)
df_trt_iv.altep1 <- df %>%
  data.frame() %>%
    impute_rhd(., 
               extra.ica ~ trt) %>%
    impute_knn(., ia.occlus  ~ trt + iv.altep + extra.ica, k=5) %>%
  tbl_df()
diffdf(df, df_trt_iv.altep1)
## Differences found between the objects!
## 
## A summary is given below.
## 
## Not all Values Compared Equal
## All rows are shown in table below
## 
##   ==============================
##    Variable   No of Differences 
##   ------------------------------
##    ia.occlus          1         
##    extra.ica          1         
##   ------------------------------
## 
## 
## All rows are shown in table below
## 
##   =============================================
##    VARIABLE   ..ROWNUMBER..  BASE    COMPARE   
##   ---------------------------------------------
##    ia.occlus       404       <NA>  ICA with M1 
##   ---------------------------------------------
## 
## 
## All rows are shown in table below
## 
##   =========================================
##    VARIABLE   ..ROWNUMBER..  BASE  COMPARE 
##   -----------------------------------------
##    extra.ica       404       <NA>    Yes   
##   -----------------------------------------

Discuss

  • As we can see, different groups and methods will lead to different imputation results. This is the result of just 2 missing value of categorical variables. What happened with dataset having several missing values ?
  • Therefore, choosing suitable imputation method is crucial.
  • For single imputation - categorical variable, I personally recommend using impute_cart; impute_knn:
    ** dataset %>% data.frame() %>% impute_cart(., x ~ y) %>% tbl_df()
    ** dataset %>% data.frame() %>% impute_knn(., x ~ y) %>% tbl_df()

I will use df_trt_iv.altep - the result of combination of impuute_cart and impute_knn for next step: Single imputation For Numeric Variables.

 

4.6 Step 5: Single imputation For Numeric Variables

After dealing with 2 categorical variables, we only have to deal with the remaining 3 numeric variables:

  • Categorical variables (done): ia.occlus, extra.ica.
  • Numeric variables: sbp, aspects, time.rand.

In this 1st trial, I’m choosing to use:

  • impute_rlm with the sbp based on afib + dm + mrankin.
  • impute_knn with the aspects based on trt + sex + nihss.
  • impute_rlm with the time.rand based on trt + sex + age + nihss + hx.isch.
set.seed(1234)
df_num1 <- df_trt_iv.altep %>% data.frame() %>%
    impute_rlm(., sbp ~ afib + dm + mrankin) %>%
    impute_knn(., aspects ~ trt + sex + nihss, k = 5) %>%
    impute_rlm(., time.rand ~ trt + sex + age   + nihss + hx.isch) %>%
  tbl_df()

Checking the changes: diffdf()
In the checking step, we have 2 ways to compare:

  • Between categorical imputation dataset and this final imputation dataset:
diffdf(df_trt_iv.altep, df_num1)
## Differences found between the objects!
## 
## A summary is given below.
## 
## Not all Values Compared Equal
## All rows are shown in table below
## 
##   ==============================
##    Variable   No of Differences 
##   ------------------------------
##       sbp             1         
##     aspects           4         
##    time.rand          2         
##   ------------------------------
## 
## 
## All rows are shown in table below
## 
##   =========================================
##    VARIABLE  ..ROWNUMBER..  BASE  COMPARE  
##   -----------------------------------------
##      sbp          131       <NA>  145.8476 
##   -----------------------------------------
## 
## 
## All rows are shown in table below
## 
##   ========================================
##    VARIABLE  ..ROWNUMBER..  BASE  COMPARE 
##   ----------------------------------------
##    aspects         19       <NA>     8    
##    aspects        169       <NA>     8    
##    aspects        201       <NA>     9    
##    aspects        222       <NA>     5    
##   ----------------------------------------
## 
## 
## All rows are shown in table below
## 
##   ==========================================
##    VARIABLE   ..ROWNUMBER..  BASE  COMPARE  
##   ------------------------------------------
##    time.rand       434       <NA>  196.3632 
##    time.rand       463       <NA>  198.9834 
##   ------------------------------------------
  • Between original dataset and this final imputation dataset:
diffdf(df, df_num1)
## Differences found between the objects!
## 
## A summary is given below.
## 
## Not all Values Compared Equal
## All rows are shown in table below
## 
##   ==============================
##    Variable   No of Differences 
##   ------------------------------
##       sbp             1         
##     aspects           4         
##    ia.occlus          1         
##    extra.ica          1         
##    time.rand          2         
##   ------------------------------
## 
## 
## All rows are shown in table below
## 
##   =========================================
##    VARIABLE  ..ROWNUMBER..  BASE  COMPARE  
##   -----------------------------------------
##      sbp          131       <NA>  145.8476 
##   -----------------------------------------
## 
## 
## All rows are shown in table below
## 
##   ========================================
##    VARIABLE  ..ROWNUMBER..  BASE  COMPARE 
##   ----------------------------------------
##    aspects         19       <NA>     8    
##    aspects        169       <NA>     8    
##    aspects        201       <NA>     9    
##    aspects        222       <NA>     5    
##   ----------------------------------------
## 
## 
## All rows are shown in table below
## 
##   =========================================
##    VARIABLE   ..ROWNUMBER..  BASE  COMPARE 
##   -----------------------------------------
##    ia.occlus       404       <NA>    M1    
##   -----------------------------------------
## 
## 
## All rows are shown in table below
## 
##   =========================================
##    VARIABLE   ..ROWNUMBER..  BASE  COMPARE 
##   -----------------------------------------
##    extra.ica       404       <NA>    No    
##   -----------------------------------------
## 
## 
## All rows are shown in table below
## 
##   ==========================================
##    VARIABLE   ..ROWNUMBER..  BASE  COMPARE  
##   ------------------------------------------
##    time.rand       434       <NA>  196.3632 
##    time.rand       463       <NA>  198.9834 
##   ------------------------------------------

 

In the 2nd trial, I’m choosing to use:

  • impute_rhd with the sbp.
  • impute_median with the aspects based on trt + sex.
  • impute_rlm with the time.rand based on trt + sex + age.
set.seed(1234)
df_num2 <- df_trt_iv.altep %>% data.frame() %>%
    impute_rhd(., sbp ~ 1) %>%
    impute_median(., aspects ~ trt + sex) %>%
    impute_rlm(., time.rand ~ trt + sex + age) %>%
  tbl_df()

Compare to the result of 1st trial

diffdf(df_num2, df_num1)
## Differences found between the objects!
## 
## A summary is given below.
## 
## Not all Values Compared Equal
## All rows are shown in table below
## 
##   ==============================
##    Variable   No of Differences 
##   ------------------------------
##       sbp             1         
##     aspects           3         
##    time.rand          2         
##   ------------------------------
## 
## 
## All rows are shown in table below
## 
##   =========================================
##    VARIABLE  ..ROWNUMBER..  BASE  COMPARE  
##   -----------------------------------------
##      sbp          131       139   145.8476 
##   -----------------------------------------
## 
## 
## All rows are shown in table below
## 
##   ========================================
##    VARIABLE  ..ROWNUMBER..  BASE  COMPARE 
##   ----------------------------------------
##    aspects         19        9       8    
##    aspects        169        9       8    
##    aspects        222        9       5    
##   ----------------------------------------
## 
## 
## All rows are shown in table below
## 
##   ==============================================
##    VARIABLE   ..ROWNUMBER..    BASE    COMPARE  
##   ----------------------------------------------
##    time.rand       434       195.4407  196.3632 
##    time.rand       463       195.6484  198.9834 
##   ----------------------------------------------

Discuss

  • The result of the 1st trial is different from the 2nd trial.
    => Choose impute_(data, formula, [model-specific options]) carefully.

  • Personally, I recommend:
    ** Numeric variable: impute_rlm, impute_cart and impute_knn.
    ** Category variable: impute_cart and impute_knn.

     

4.7 Step 6: Saving the new tibbles

saveRDS(df_num1, file = "df_num1.Rds")

5 References

https://thomaselove.github.io/432-notes/dealing-with-missingness-single-imputation.html#plotting-missingness
https://bookdown.org/mwheymans/bookmi/
https://github.com/benjaminrich/table1/issues/52
https://towardsdatascience.com/all-about-missing-data-handling-b94b8b5d2184
https://stefvanbuuren.name/fimd/workflow.html
https://rmisstastic.netlify.app/tutorials/erler_course_multipleimputation_2018/erler_practical_mice_2018#getting_to_know_the_data
https://www.gerkovink.com/miceVignettes/Convergence_pooling/Convergence_and_pooling.html
Gelman, Andrew, and Jennifer Hill. 2007. Data Analysis Using Regression and Multilevel/Hierarchical Models. New York: Cambridge University Press.
He, Y., 2010. Missing data analysis using multiple imputation: getting to the heart of the matter. Circulation: Cardiovascular Quality and Outcomes, 3(1), pp.98-105.
Jakobsen, J.C., Gluud, C., Wetterslev, J. et al. When and how should multiple imputation be used for handling missing data in randomised clinical trials – a practical guide with flowcharts. BMC Med Res Methodol 17, 162 (2017). https://doi.org/10.1186/s12874-017-0442-1.
White, I.R., Royston, P. and Wood, A.M., 2011. Multiple imputation using chained equations: issues and guidance for practice. Statistics in medicine, 30(4), pp.377-399.