Processing math: 100%
library(woeBinning) # 'germancredit' data
library(PRROC) # ROC curves in data frames
library(tidyverse) # everything
library(imputeMissings) # median/mode imputation
library(visdat) # missingness plots
library(doParallel) # parallel missForest
library(doRNG) # reproducible parallel results
library(gridExtra) # combining graphs
library(randomForest) # random forests
library(missForest) # ...
library(kableExtra) # pretty tables

options(scipen = 999) # no scientific notation

RNGkind(kind = "Mersenne-Twister", 
        normal.kind = "Inversion", 
        sample.kind = "Rejection") # because different versions of R giving different RNG results annoys me

theme_set(theme_light()) # ggplot theme


Introduction

MissForest - An Overview

MissForest is a random forest imputation algorithm for missing data, implemented in R in the missForest() package. It initially imputes all missing data using the mean/mode, then for each variable with missing values, MissForest fits a random forest on the observed part and then predicts the missing part. This process of training and predicting repeats in an iterative process until a stopping criterion is met, or a maximum number of user-specified iterations is reached.

The reason for the multiple iterations is that, from iteration 2 onwards, the random forests performing the imputation will be trained on better and better quality data that itself has been predictively imputed.

The original paper, Stekhoven & Buhlmann (2011), randomly introduced missing values at rates of 10%, 20% & 30% into 10 publicly available datasets, comparing the accuracy of the imputations (w.r.t. Normalized RMSE & Percentage Falsely Classified) to some of the best and most widely used imputation algorithms available (KNNimpute, MICE, MissPALasso).

At almost every level of missingness in every dataset, MissForest out-performed these other algorithms, in some cases reducing the imputation error by more than 50%. It is also easier to use than all of these alternatives and doesn’t require tuning (because random forests are so effective at default parameters).

Imputation of a matrix of predictors X is as easy as:

imputed_X <- missForest(X)$ximp




Advantages & Disadvantages

Advantages:

  • Can be applied to mixed data types (missings in numeric & categorical variables)
  • No pre-processing required (no dummy-coding, standardization, data splitting, etc.)
  • No assumptions required (aside from the normal assumption of being MAR/MCAR)
  • Robust to noisy data, as random forests effectively have build-in feature selection. Methods like KNN imputation will have poor predictions in datasets with weak & non-informative predictors, whereas missForest() will make little to no use of these features
  • Non-parametric: makes no assumptions about the relationship between the features, unlike MICE which assumes linearity
  • Excellent predictive power
  • Can leverage non-linear and interaction effects between features to improve imputation accuracy
  • Gives an OOB error estimate for its predictions (Numeric: NRMSE/MSE, Categorical: PFC)
  • Works with high dimensionality data (pn)


Disadvantages:

  • Imputation time, which increases with the number of observations, predictors and number of predictors containing missing values
  • It inherits the same lack of interpretability of random forests
  • It is an algorithm, not a model object you can store somewhere. This means it has to run each time missing data has to be imputed, which could be problematic in some production environments




The Algorithm

General Notation

Let X be our n×p matrix of predictors that requires imputation:

X=(X1,X2,,Xp)=[x11x12x1px21xn1xnp]

An arbitrary variable Xs contains missing values at entries i(s)mis{1,2,,n}

For every variable Xs that contains missing values, we can separate the dataset into 4 categories:

  1. The non-missing values of variable Xs, denoted by y(s)obs.
  2. The missing values of variable Xs, denoted by y(s)mis.
  3. The variables other than Xs, with observations i(s)obs={1,2,,n}i(s)mis, denoted by x(s)obs
  4. The variables other than Xs, with observations i(s)mis, denoted by x(s)mis


MissForest algorithm pseudocode:

Requires:

  • X, our n×p matrix for imputation
  • γ, the stopping criterion (it works in iterations, stopping when the difference between iteration i and i + 1 of the imputed dataframes begins to increase for categorical and numeric variables, or after 10 iterations with the default maxiter parameter)


Algorithm:

  1. Make an initial guess for all missing categorical/numeric values (e.g. mean, mode)
  2. k vector of column indices in X, sorted in ascending order of % missing
  3. while not γ do:
  4.       Ximpold store previous imputed matrix
  5.       for s in k do:
  6.             Fit a random forest predicting the non-missing values of Xs: y(s)obs ~ x(s)obs
  7.             Use this to predict the missing values of Xs: predict y(s)mis using x(s)mis
  8.             Ximpnew update imputed matrix, using the predicted y(s)mis
  9.       end for
  10.       update γ
  11. end while
  12. return the final imputed matrix Ximp




Example & Diagram

Let X=(X1,X2,X3,X4,X5) be a n×p matrix, with n = 20 and p = 5 (20 observations, 5 features).

Let X1 have missing values in rows 3 & 15 (10% missing).

Let X3 have missing values in rows 6, 7 & 16 (15% missing).


Notation (for X1):

  • i(1)mis={3,15}{1,2,,20} (the entries of X1 that are missing)
  • y(1)obs=X1 with rows 3 & 15 removed
  • y(1)mis= rows 3 & 15 of X1
  • x(1)obs=(X2,X3,X4,X5), with rows 3 & 15 removed.
  • x(1)mis= rows 3 & 15 of (X2,X3,X4,X5)


Applied to the algorithm:

  1. Roughly impute missings of X1 and X3 with the mean/mode
  2. k(1,3) are the columns that require imputation, in ascending order of missingness (X1 has 10%, X3 has 15%)
  3. while not γ do:
  4.       Ximpold store previous imputed matrix
  5.       s = 1:
  6.             Fit a random forest on all values of X1 except 3 & 15: y(1)obs ~ x(1)obs
  7.             Predict the missing values 3 & 15 of X1, i.e. predict y(1)mis using x(1)mis
  8.             Ximpnew update imputed matrix, using the predicted y(1)mis
  9.       s = 3:
  10.             Fit a random forest on all values of X3 except 6, 7 & 16: y(3)obs ~ x(3)obs
  11.             Predict the missing values 6, 7 & 16 of X3, i.e. predict y(3)mis using x(3)mis
  12.             Ximpnew update imputed matrix, using the predicted y(3)mis
  13.       update γ: has the stopping criterion been reached?
  14. end while (can take several iterations before the difference between imputations begins to increase, i.e. where the stopping criterion γ is met)
  15. return the final imputed matrix Ximp


Diagram:

I put together the image below to help visualize the initial mean/mode imputation, followed by two random forest imputations (one for X1, one for X3).

  • Table 1 shows the initial predictor matrix X
  • Table 2 shows X with mean/mode imputation and the first regression random forest being built for X1
  • Table 3 shows X with the (originally) missing values of X1 now replaced by the predictions from the regression random forest. It also shows the second (classification) random forest being built for X3
  • Table 4 shows X with the (originally) missing values of X3 now replaced by the predictions from the classification random forest


Tables 2 - 4 therefore show one iteration of the algorithm. You can picture these three steps repeating until the stopping criterion is reached.

Since random forests use bootstrap sampling, the final forests (and all of those in the intermediate iterations) will only be trained on 23 of the data. This is why MissForest has the unique property of also being able to provide OOB errors for its imputations.




Application - the germancredit dataset

Objective

The original authors effectively tested and demonstrated the effectiveness of missForest() in the accuracy of its imputation. Instead of repeating their work, I want to test the effectiveness that changing imputation methodologies has on an overall modelling objective relevant to my own work, such as maximizing out-of-sample ROC AUC when predicting credit default.

Since it is so widely used, I will compare missForest() with with median/mode imputation (for numeric & categorical missings respectively) via imputeMissings::impute(). Both of these functions are one-liners that require virtually no thinking or pre-processing.

My hope is that I can demonstrate that missForest() is just as easy to use, but that it has a statistically significant positive impact on the test AUC for this dataset.




Random Forests vs Median/Mode

Pre-processing

I use the woeBinning::germancredit dataset here, and perform minimal data cleaning:

germancredit$purpose <- ifelse(germancredit$purpose %in% c("business", "car (new)", "car (used)", "education", "furniture/equipment", "radio/television"), 
                               as.character(germancredit$purpose),
                               "OTHER") %>% factor()

germancredit$job <- ifelse(germancredit$job %in% c("unskilled - residentskilled employee / official", "unskilled - resident"), 
                           as.character(germancredit$job),
                           "OTHER") %>% factor()

germancredit$number.of.people.being.liable.to.provide.maintenance.for <- factor(germancredit$number.of.people.being.liable.to.provide.maintenance.for)


glimpse(germancredit)
## Rows: 1,000
## Columns: 21
## $ status.of.existing.checking.account                      <fct> ... < 0 DM...
## $ duration.in.month                                        <dbl> 6, 48, 12,...
## $ credit.history                                           <fct> critical a...
## $ purpose                                                  <fct> radio/tele...
## $ credit.amount                                            <dbl> 1169, 5951...
## $ savings.account.and.bonds                                <fct> unknown/ n...
## $ present.employment.since                                 <fct> ... >= 7 y...
## $ installment.rate.in.percentage.of.disposable.income      <dbl> 4, 2, 2, 2...
## $ personal.status.and.sex                                  <fct> male : sin...
## $ other.debtors.or.guarantors                              <fct> none, none...
## $ present.residence.since                                  <dbl> 4, 2, 3, 4...
## $ property                                                 <fct> "real esta...
## $ age.in.years                                             <dbl> 67, 22, 49...
## $ other.installment.plans                                  <fct> none, none...
## $ housing                                                  <fct> own, own, ...
## $ number.of.existing.credits.at.this.bank                  <dbl> 2, 1, 1, 1...
## $ job                                                      <fct> unskilled ...
## $ number.of.people.being.liable.to.provide.maintenance.for <fct> 1, 1, 2, 2...
## $ telephone                                                <fct> "yes, regi...
## $ foreign.worker                                           <fct> yes, yes, ...
## $ creditability                                            <fct> good, bad,...

I split the data into train/test, randomly introducing 20% missing values into the predictors of both datasets:

# train/test:
set.seed(111)
train_index <- sample(nrow(germancredit), 700) 

train <- germancredit[train_index, ]
train_X <- prodNA(select(train, -creditability), 0.2)

test <- germancredit[-train_index, ]
test_X <- prodNA(select(test, -creditability), 0.2)

vis_miss(rbind(train_X, test_X), show_perc = F) + 
  coord_flip()


Imputation using missForest()

I then take the following steps:

  1. Impute the training data using missForest(): train_X imp_train_X
  2. Build a random forest rf predicting creditability (the response) on imp_train_X
  3. Combine imp_train_X & test_X train_test_X
  4. Run missForest() to on train_test_X, then extract imp_test_X (the test observations only)
  5. Use rf to get the probability predictions on the test data (using the imputed data, imp_test_X)
  6. Calculate the resulting test-data ROC curve & AUC
# 1) impute train
imp_train_X <- missForest(train_X)$ximp
##   missForest iteration 1 in progress...done!
##   missForest iteration 2 in progress...done!
##   missForest iteration 3 in progress...done!
##   missForest iteration 4 in progress...done!
##   missForest iteration 5 in progress...done!
##   missForest iteration 6 in progress...done!
##   missForest iteration 7 in progress...done!
##   missForest iteration 8 in progress...done!
##   missForest iteration 9 in progress...done!
# 2) build model
rf <- randomForest(x = imp_train_X, y = train$creditability)

Notice that in steps 3 & 4 I am using the imputed training predictors to give the algorithm additional data in which to impute the test predictors. This is perfectly fine and would decrease imputation error with the trade-off of increased computing time. I would argue that this method would be more robust when imputing smaller future datasets.

Consider a realistic situation where the training data was 1M rows and this model was put into production to score ~500 observations hourly. By appending these 1M imputed training observations onto future observations before imputing we strengthen the predictive power of the algorithm (as each random forest will be built on >1M observations, instead of <500).

# 3) & 4) combine & impute test
train_test_X <- rbind(test_X, imp_train_X)
imp_test_X <- missForest(train_test_X)$ximp[1:nrow(test_X), ]
##   missForest iteration 1 in progress...done!
##   missForest iteration 2 in progress...done!
##   missForest iteration 3 in progress...done!
##   missForest iteration 4 in progress...done!
##   missForest iteration 5 in progress...done!
##   missForest iteration 6 in progress...done!
##   missForest iteration 7 in progress...done!
# 5) predict for test
pred_test <- predict(rf, imp_test_X, type = "prob")

# 6) test ROC & AUC
test_scores <- data.frame(event_prob = pred_test[ ,2], labels = test$creditability)

test_roc_v1 <- roc.curve(scores.class0 = test_scores[test_scores$labels == "good", ]$event_prob, # scores for the POSITIVE class
                      scores.class1 = test_scores[test_scores$labels == "bad", ]$event_prob, # scores for the NEGATIVE class
                      curve=T)

The resulting test ROC AUC:

test_roc_v1$auc # 0.7764794
## [1] 0.7764794




Median/Mode Imputation using impute()

I then take the following steps:

  1. Impute the training data using median/mode imputation: train_X imp_train_X
  2. Do the same for the test dataset: test_X imp_test_X
  3. Build a random forest rf predicting creditability (the response) on imp_train_X
  4. Use rf to get the probability predictions on the test data (using the imputed data, imp_test_X)
  5. Calculate the resulting test-data ROC curve & AUC
# 1) impute train
imp_train_X <- impute(train_X, method = "median/mode")

# 2) impute test
imp_test_X <- impute(test_X, method = "median/mode")

# 3) build model
rf <- randomForest(x = imp_train_X, y = train$creditability)

# 4) predict for test
pred_test <- predict(rf, imp_test_X, type = "prob")

# 5) test ROC & AUC
test_scores <- data.frame(event_prob = pred_test[ ,2], labels = test$creditability)

test_roc_v2 <- roc.curve(scores.class0 = test_scores[test_scores$labels == "good", ]$event_prob, # scores for the POSITIVE class
                      scores.class1 = test_scores[test_scores$labels == "bad", ]$event_prob, # scores for the NEGATIVE class
                      curve=T)

The resulting test ROC AUC:

test_roc_v2$auc # 0.7566735
## [1] 0.7566735




Comparison

I plot the two ROC curves produced using both methodologies:

data.frame(method = "missForest", FPR = test_roc_v1$curve[ ,1], TPR = test_roc_v1$curve[ ,2]) %>%
  rbind(data.frame(method = "median/mode", FPR = test_roc_v2$curve[ ,1], TPR = test_roc_v2$curve[ ,2])) %>%
  ggplot(aes(x = FPR, y = TPR, col = method)) + 
  geom_line() + 
  geom_abline(intercept = 0, slope = 1, linetype = "dashed") + 
  labs(title = "MissForest vs Median/Mode Imputation", 
       subtitle = paste0("MissForest AUC = ", round(test_roc_v1$auc, 3), ", Median/Mode AUC = ", round(test_roc_v2$auc, 3)), 
       col = "Imputation Method")

Great! Predictive imputation using missForest() succeeded in increasing our out-of-sample AUC.

With a large dataset this would be meaningful as train & test would be similar, but with this 1,000-row sample there will be considerable variability in these results, simply based on variation in the train/test split of the data.

In the next section I explore repeating this process many times, choosing an appropriate statistical test to quantify whether the difference in test AUC is truly greater for missForest() over many samples.




Does predictive imputation really help?

Repeated Testing

To answer this question, I decided to do something similar to Monte Carlo Cross-Validation, in which I wrap the above process in a for() loop. This involved performing 100 train/test splits, introducing missing data, imputing it, building a model on the imputed training data predicting credit default and finally evaluating this models ROC AUC on the imputed test data.

Fun facts:

  • Since there are missing values in each of the 20 variables, each iteration requires 20 random forests to be trained
  • Each forest consists of 100 trees
  • On average, these datasets reached the stopping criterion after 6 iterations
  • There are 2 datasets to impute (train_X & test_X)
  • This process is repeated 100 times

So the code below builds 20×100×6×2×100=2,400,000 decision trees in total.

missforest_auc <- c()
median_mode_auc <- c()


registerDoParallel(cores = 4) # set number of cores to use
registerDoRNG(seed = 123)


for (i in 1:100) {
  
  train_index <- sample(nrow(germancredit), 700) 
  
  train <- germancredit[train_index, ]
  train_X <- prodNA(select(train, -creditability), 0.2)
  
  test <- germancredit[-train_index, ]
  test_X <- prodNA(select(test, -creditability), 0.2)
  
  
  ##### ##### ##### missForest() ##### ##### ##### 
  
  # 1) impute train
  imp_train_X <- missForest(train_X, parallelize = "forests")$ximp
  
  # 2) build model
  rf <- randomForest(x = imp_train_X, y = train$creditability)
  
  # 3) impute test
  train_test_X <- rbind(test_X, imp_train_X)
  imp_test_X <- missForest(train_test_X, parallelize = "forests")$ximp[1:nrow(test_X), ]
  
  # 4) evaluate model
  pred_test <- predict(rf, imp_test_X, type = "prob")
  
  test_scores <- data.frame(event_prob = pred_test[ ,2], labels = test$creditability)
  
  test_roc_v1 <- roc.curve(scores.class0 = test_scores[test_scores$labels == "good", ]$event_prob,
                           scores.class1 = test_scores[test_scores$labels == "bad", ]$event_prob,
                           curve=T)
  
  missforest_auc[i] <- test_roc_v1$auc
  
  
  ##### ##### ##### median/mode ##### ##### ##### 
  
  # 1) impute train & test
  imp_train_X <- impute(train_X, method = "median/mode")
  imp_test_X <- impute(test_X, method = "median/mode")
  
  # 2) build model
  rf <- randomForest(x = imp_train_X, y = train$creditability)
  
  # 3) evaluate model
  pred_test <- predict(rf, imp_test_X, type = "prob")
  
  test_scores <- data.frame(event_prob = pred_test[ ,2], labels = test$creditability)
  
  test_roc_v2 <- roc.curve(scores.class0 = test_scores[test_scores$labels == "good", ]$event_prob,
                           scores.class1 = test_scores[test_scores$labels == "bad", ]$event_prob,
                           curve=T)
  
  median_mode_auc[i] <- test_roc_v2$auc
  
  
  print(paste0("i = ", i, " . . . missforest_auc = ", round(missforest_auc[i], 3), ", median_mode_auc = ", round(median_mode_auc[i], 3)))
}


auc_df <- data.frame(missforest_auc, median_mode_auc) %>%
  mutate(iteration = 1:nrow(.))

# write.csv(auc_df, "auc_df.csv", row.names = F)

Without parallel processing, each loop took ~44 seconds. With parallel processing, this was reduced to ~25 seconds. With more (and faster) cores, alongside following the Increasing Speed section at the end of this document, this could be reduced significantly.




Results & Conclusion

A glimpse() of the results can be seen below.

auc_df <- read.csv("./auc_df.csv")
glimpse(auc_df)
## Rows: 100
## Columns: 3
## $ missforest_auc  <dbl> 0.7616422, 0.7602458, 0.7172750, 0.7402216, 0.72515...
## $ median_mode_auc <dbl> 0.7503064, 0.7642765, 0.6990250, 0.7228860, 0.72109...
## $ iteration       <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, ...

missForest() average test AUC:

round(mean(auc_df$missforest_auc), 3) # 0.749758
## [1] 0.75

Median/mode average test AUC:

round(mean(auc_df$median_mode_auc), 3) # 0.7419909
## [1] 0.742

An almost 0.01 increase in test AUC predicting credit default is actually really impressive for doing nothing more than changing our imputation method.


I produced a few different visualizations of the results below:

g1 <- auc_df %>%
  gather("method",
         "AUC",
         -iteration) %>%
  ggplot(aes(x = AUC, fill = method)) +
  geom_vline(aes(xintercept = mean(auc_df$missforest_auc)), size = 1, col = "#00BFC4") +
  geom_vline(aes(xintercept = mean(auc_df$median_mode_auc)), size = 1, col = "#F8766D") +
  geom_density(alpha = 0.4) + 
  scale_fill_discrete(labels = c("median/mode", "missForest")) +
  theme(legend.position = "bottom", 
        legend.title = element_blank(), 
        axis.title.y = element_blank(), 
        axis.ticks.y = element_blank(), 
        axis.text.y = element_blank())

g2 <- auc_df %>%
  mutate(diff = missforest_auc - median_mode_auc) %>%
  gather("method",
         "AUC",
         -diff) %>%
  ggplot(aes(x = diff)) +
  geom_vline(xintercept = 0, col = "grey20") +
  geom_vline(aes(xintercept = mean(auc_df$missforest_auc - auc_df$median_mode_auc), col = "mediumseagreen"), size = 1) + 
  scale_color_manual(values = "mediumseagreen", labels = "mean(diff)") +
  geom_density(alpha = 0.4, fill = "mediumseagreen") + 
  scale_x_continuous(breaks = seq(-0.06, 0.06, 0.01)) + 
  labs(x = "diff = (missForest AUC) - (median/mode AUC)") + 
  theme(legend.position = "bottom", 
        legend.title = element_blank(), 
        axis.title.y = element_blank(), 
        axis.ticks.y = element_blank(), 
        axis.text.y = element_blank())

g3 <- auc_df %>% 
  gather("method", 
         "AUC", 
         -iteration) %>% 
  ggplot(aes(x = method, y = AUC, fill = method)) + 
  geom_boxplot() + 
  scale_fill_discrete(labels = c("median/mode", "missForest")) +
  coord_flip() + 
  theme(legend.position = "bottom", 
        legend.title = element_blank(),
        axis.title.y = element_blank(), 
        axis.ticks.y = element_blank(), 
        axis.text.y = element_blank())

g4 <- auc_df %>%
  mutate(missforest_best = factor(missforest_auc - median_mode_auc > 0)) %>%
  ggplot(aes(x = median_mode_auc, y = missforest_auc, col = missforest_best)) + 
  geom_abline(intercept = 0, slope = 1, col = "deepskyblue3", size = 1, alpha = 0.5, linetype = "dashed") + 
  geom_point() + 
  scale_color_manual(values = c("grey30", "mediumseagreen")) +
  labs(x = "median/mode AUC", 
       y = "missForest AUC", 
       col = "missForest best?:") + 
  theme(legend.position = "bottom")


grid.arrange(g1, g2, g3, g4, ncol = 2, nrow = 2)


To determine the likelihood that the difference in means can be explained by random variation, I use a paired t-test:

H0:μd=0H1:μd>0

Our null hypothesis is that the true mean difference (μd) is zero, where difference = missforest_auc - median_mode_auc.

The alternative hypothesis is that the true mean difference is greater than zero.

  • A one-tailed test is justified here as the question we are trying to answer is asymmetric - I am only interested in whether we can justify replacing median/mode imputation on this dataset with missForest(), and if it isn’t better, we will throw it away and favour the simpler methodology (regardless of whether missForest() performs worse or the same).
  • A paired test is appropriate because each iteration trained and tested on the same datasets

The resulting p-value:

t.test(auc_df$missforest_auc, auc_df$median_mode_auc, paired = T, alternative = "greater")$p.value %>% round(7) # 0.0004236
## [1] 0.0004236

p<0.05, so we reject the null hypothesis. The probability of observing these test results under the null hypothesis that the means were equal is 0.042%.

missForest() took no extra pre-processing of the data and had a statistically significant positive impact on the test AUC when predicting credit default for this dataset.




Advanced Tips

Increasing Speed

Parallel processing with parallelize:

To run missForest() in parallel you need to set up a parallel backend. It’s easy to do and can speed up performance significantly. For reproducible results in parallel you will need to set the seed via doRNG::registerDoRNG():

doParallel::registerDoParallel(cores = 4) # set based on number of CPU cores
doRNG::registerDoRNG(seed = 123)
imputed_X <- missForest(X, parallelize = 'forests')$ximp

Using set.seed(123, kind = "L'Ecuyer-CMRG") will provide reproducible results in most cases and requires no packages, but doRNG::registerDoRNG() is more reliable (doesn’t fail in loops).

There are two options for parallelize to run missForest() in parallel. Taken from the documentation:

  1. If parallelize = 'forests', the total number of trees in each random forests is split in the same way as normal. This is most useful if a single forest object takes long to compute and there are not many variables in the data.
  2. If parallelize = 'variables', the data is split into pieces of the size equal to the number of cores registered in the parallel backend. This is most useful if the data contains many variables and computing the random forests is not taking too long.

Whether parallelize = 'forests' or parallelize = 'variables' is more suitable depends on the data - try running a single iteration and compare the speeds. I have personally seen the greatest performance increase with parallelize = 'forests'.


Reducing maxiter:

missForest() has a hard stop of 10 iterations (when the stopping criterion isn’t reached) which can be increased or decreased. The stopping criterion makes some sense, but it often won’t end on the lowest error iteration, and can mean you are waiting around far longer than you need to for imputation to finish.

Whenever you run missForest(), I’d recommend first specifying verbose = T which allows you to monitor the OOB error estimate across iterations. You will often find that there are significantly diminishing returns after just 2 random forest iterations. If you see this happening, cancel the imputation run and specify maxiter = 2. Obviously if the default algorithm would’ve continued until 10 iterations you’ve just reduced computation time by 80%.


Reducing ntree:

Reducing ntree is usually a bit of a trade-off between accuracy and imputation time, but it has a linear relationship with the time taken per iteration; halving ntree from the default 100 to 50 will halve the computation time of each iteration.

You can reduce ntree more than you would with a single random forest - multiple iterations mean that each observation will get predicted multiple times, so 10 iterations of with ntree = 100 still means 1,000 trees are used per variable in the imputation. As a start, try setting ntree to 50 and observe whether the reduction in OOB error is worth the halved imputation times.


Reducing mtry:

The number of variables randomly selected at each split (mtry) gives you the option of reducing training time. The default value is p. Reducing this from the default will have the biggest impact for high dimensionality data where p is large, and can increase or decrease OOB error.




Increasing Accuracy

I’m just going to go through a quick example of how you might use the parameters to increase imputation accuracy as apposed to decreasing training time.

I again use the full germancredit dataset, introducing 20% missing values. I run missForest() in parallel for all of these examples.

X_true <- select(germancredit, -creditability)

set.seed(1)
X_missing <- prodNA(X_true, 0.2)

registerDoParallel(cores = 4)


Choosing an iteration:

Running missForest() normally with verbose = T shows that the best iteration is actually iteration 3 (the algorithm just didn’t stop there because the difference between the new and previous imputed X matrices didn’t both increase in their NRMSE & PFC).

registerDoRNG(seed = 1)
missForest_v1 <- missForest(X_missing, parallelize = 'forests', verbose = T) # 3 iterations seems good
##   parallelizing computation of the random forest model objects
##   missForest iteration 1 in progress...done!
##     estimated error(s): 0.5025207 0.3453656 
##     difference(s): 0.02717122 0.06128571 
##     time: 2.52 seconds
## 
##   missForest iteration 2 in progress...done!
##     estimated error(s): 0.4765437 0.3252075 
##     difference(s): 0.002889239 0.024 
##     time: 2.18 seconds
## 
##   missForest iteration 3 in progress...done!
##     estimated error(s): 0.4534789 0.3202094 
##     difference(s): 0.001816984 0.01864286 
##     time: 1.91 seconds
## 
##   missForest iteration 4 in progress...done!
##     estimated error(s): 0.4632786 0.3184532 
##     difference(s): 0.001843269 0.01764286 
##     time: 1.72 seconds
## 
##   missForest iteration 5 in progress...done!
##     estimated error(s): 0.4672895 0.3161335 
##     difference(s): 0.001811026 0.01707143 
##     time: 2.29 seconds
## 
##   missForest iteration 6 in progress...done!
##     estimated error(s): 0.464786 0.314633 
##     difference(s): 0.001440648 0.01735714 
##     time: 1.94 seconds
## 
##   missForest iteration 7 in progress...done!
##     estimated error(s): 0.4666038 0.3157098 
##     difference(s): 0.001678937 0.01714286 
##     time: 1.92 seconds
## 
##   missForest iteration 8 in progress...done!
##     estimated error(s): 0.465116 0.315166 
##     difference(s): 0.001125223 0.01607143 
##     time: 1.61 seconds
## 
##   missForest iteration 9 in progress...done!
##     estimated error(s): 0.4625869 0.311944 
##     difference(s): 0.001412786 0.01721429 
##     time: 2 seconds

We might determine that 3 iterations is sufficient (performance-wise or accuracy-wise) while the code is still running for a large dataset, so we could just stop the process and re-run specifying maxiter = 3 so the algorithm stops at that stage:

registerDoRNG(seed = 1)
missForest_v2 <- missForest(X_missing, parallelize = 'forests', maxiter = 3)
##   missForest iteration 1 in progress...done!
##   missForest iteration 2 in progress...done!
##   missForest iteration 3 in progress...done!
missForest_v2$OOBerror
##     NRMSE       PFC 
## 0.4534789 0.3202094


Basic mtry selection:

Because running missForest() in parallel on a small dataset is already really fast, I instead try tuning mtry. I wrote a loop to try mtry values from 2-19, and wrapped this in an outer loop that changes the seed (to try and reduce the variability of the OOB error estimates).

mtry_used <- c()
OOB_NRMSE <- c()
OOB_PFC <- c()

i <- 1

for (loop_seed in 1:5) {
  for (mtry in 2:19) { 
    print(paste0("seed = ", loop_seed, ", mtry = ", mtry))
    
    registerDoRNG(seed = loop_seed)
    missForest_v3 <- missForest(X_missing, 
                                      parallelize = 'forests', 
                                      ntree = 100, 
                                      maxiter = 3, 
                                      mtry = mtry)
    
    OOB_NRMSE[i] <- missForest_v3$OOBerror[1]
    OOB_PFC[i] <- missForest_v3$OOBerror[2]
    mtry_used[i] <- mtry
    
    i <- i + 1
  }
}


mtry_df <- data.frame(mtry_used, OOB_NRMSE, OOB_PFC) %>%
  group_by(mtry_used) %>%
  summarize(`OOB NRMSE` = mean(OOB_NRMSE), 
            `OOB PFC` = mean(OOB_PFC)) %>%
  gather("metric", "error", -mtry_used)

# write.csv(mtry_df, "mtry_df.csv", row.names = F)

I extracted the OOB errors (NRMSE & PFC) each time, averaged the results and plot the resulting graph below.

mtry_df <- read.csv("./mtry_df.csv")

ggplot(mtry_df, aes(x = mtry_used, y = error, col = factor(metric))) + 
  geom_point() +
  geom_line() + 
  scale_x_continuous(breaks = seq(2, 20, 2)) + 
  scale_color_manual(values = c("deepskyblue", "mediumseagreen")) +
  facet_wrap(~ metric, scales = "free_y") + 
  theme(legend.position = "none", 
        axis.title.y = element_blank()) + 
  labs(x = "mtry")

This approach is only really practical for small datasets, but you could easily sample a large dataset, reduce maxiter and do some basic checks that the default value of mtry is reasonable.


Alternative mtry selection:

There is absolutely no reason why you couldn’t choose one mtry for numeric variables and another for categorical. The code to implement this is below.

categorical_vars <- X_missing %>% select_if(is.factor) %>% names()
numeric_vars <- X_missing %>% select_if(is.numeric) %>% names()

X_imp_cat <- missForest(X_missing, 
                        mtry = 7, 
                        parallelize = 'forests', 
                        maxiter = 3)$ximp[ ,categorical_vars]

X_imp_num <- missForest(X_missing, 
                        mtry = 15, 
                        parallelize = 'forests', 
                        maxiter = 3)$ximp[ ,numeric_vars]

X_imp <- cbind(X_imp_cat, X_imp_num)

You could also select mtry based on how the imputation performs on a set of features that you know will be the best predictors of the target. Specifying variablewise = T will mean that the OOB error will be reported per variable (and the metric for numeric variables will change to MSE).

registerDoRNG(seed = 1)
missForest_v4 <- missForest(X_missing, parallelize = 'forests', variablewise = T)

The output when doing this can be quite messy, but you can easily put the errors into a dataframe:

data.frame(varname = names(missForest_v4$ximp), 
           error_type = names(missForest_v4$OOBerror), 
           error = missForest_v4$OOBerror) %>%
  mutate(error_type = cell_spec(error_type, color = "white", background = ifelse(error_type == "MSE", "deepskyblue", "mediumseagreen"))) %>%
  kable(escape = F) %>% 
  kable_styling(full_width = F)
varname error_type error
status.of.existing.checking.account PFC 0.4683544
duration.in.month MSE 74.6415515
credit.history PFC 0.2789539
purpose PFC 0.6235885
credit.amount MSE 3626289.8668973
savings.account.and.bonds PFC 0.4208543
present.employment.since PFC 0.5368550
installment.rate.in.percentage.of.disposable.income MSE 0.9228126
personal.status.and.sex PFC 0.3659148
other.debtors.or.guarantors PFC 0.0962963
present.residence.since MSE 0.9022829
property PFC 0.4573935
age.in.years MSE 86.5584998
other.installment.plans PFC 0.2004981
housing PFC 0.1730061
number.of.existing.credits.at.this.bank MSE 0.2091348
job PFC 0.3324841
number.of.people.being.liable.to.provide.maintenance.for PFC 0.1356147
telephone PFC 0.2844388
foreign.worker PFC 0.0380711

If anyone wanted to experiment with this methodology, you could create a custom metric that combines the OOB errors of the most important predictors and use this as the basis for the selection of the mtry hyper-parameter.

For the purposes of this document I am happy to proceed with maxiter = 3 and mtry = 9.


Increasing ntree:

Reducing ntree will generally increase the imputation error. Since we are only running two iterations, let’s instead try increasing ntree quite significantly to 1000:

registerDoRNG(seed = 1)
missForest_v5 <- missForest(X_missing, parallelize = 'forests', verbose = T, maxiter = 3, mtry = 9, ntree = 1000)
##   parallelizing computation of the random forest model objects
##   missForest iteration 1 in progress...done!
##     estimated error(s): 0.4916868 0.3267124 
##     difference(s): 0.03673848 0.06035714 
##     time: 22.14 seconds
## 
##   missForest iteration 2 in progress...done!
##     estimated error(s): 0.4351275 0.3086136 
##     difference(s): 0.003046984 0.01578571 
##     time: 21.33 seconds
## 
##   missForest iteration 3 in progress...done!
##     estimated error(s): 0.4368263 0.3003819 
##     difference(s): 0.0006723632 0.007428571 
##     time: 20.37 seconds


Results:

Running the default algorithm gave the following OOB errors:

missForest_v1$OOBerror
##    NRMSE      PFC 
## 0.465116 0.315166

The end result was successful in decreasing the estimated imputation error, giving the following OOB errors:

missForest_v5$OOBerror
##     NRMSE       PFC 
## 0.4368263 0.3003819