library(woeBinning) # 'germancredit' data
library(PRROC) # ROC curves in data frames
library(tidyverse) # everything
library(imputeMissings) # median/mode imputation
library(visdat) # missingness plots
library(doParallel) # parallel missForest
library(doRNG) # reproducible parallel results
library(gridExtra) # combining graphs
library(randomForest) # random forests
library(missForest) # ...
library(kableExtra) # pretty tables
options(scipen = 999) # no scientific notation
RNGkind(kind = "Mersenne-Twister",
normal.kind = "Inversion",
sample.kind = "Rejection") # because different versions of R giving different RNG results annoys me
theme_set(theme_light()) # ggplot theme
MissForest is a random forest imputation algorithm for missing data, implemented in R
in the missForest()
package. It initially imputes all missing data using the mean/mode, then for each variable with missing values, MissForest fits a random forest on the observed part and then predicts the missing part. This process of training and predicting repeats in an iterative process until a stopping criterion is met, or a maximum number of user-specified iterations is reached.
The reason for the multiple iterations is that, from iteration 2 onwards, the random forests performing the imputation will be trained on better and better quality data that itself has been predictively imputed.
The original paper, Stekhoven & Buhlmann (2011), randomly introduced missing values at rates of 10%, 20% & 30% into 10 publicly available datasets, comparing the accuracy of the imputations (w.r.t. Normalized RMSE & Percentage Falsely Classified) to some of the best and most widely used imputation algorithms available (KNNimpute, MICE, MissPALasso).
At almost every level of missingness in every dataset, MissForest out-performed these other algorithms, in some cases reducing the imputation error by more than 50%. It is also easier to use than all of these alternatives and doesn’t require tuning (because random forests are so effective at default parameters).
Imputation of a matrix of predictors X is as easy as:
imputed_X <- missForest(X)$ximp
Advantages:
missForest()
will make little to no use of these features
Disadvantages:
Let X be our n×p matrix of predictors that requires imputation:
X=(X1,X2,…,Xp)=[x11x12…x1px21⋱⋮⋱xn1xnp]
An arbitrary variable Xs contains missing values at entries i(s)mis⊆{1,2,…,n}
For every variable Xs that contains missing values, we can separate the dataset into 4 categories:
MissForest algorithm pseudocode:
Requires:
i
and i + 1
of the imputed dataframes begins to increase for categorical and numeric variables, or after 10 iterations with the default maxiter
parameter)
Algorithm:
Let X=(X1,X2,X3,X4,X5) be a n×p matrix, with n = 20 and p = 5 (20 observations, 5 features).
Let X1 have missing values in rows 3 & 15 (10% missing).
Let X3 have missing values in rows 6, 7 & 16 (15% missing).
Notation (for X1):
Applied to the algorithm:
Diagram:
I put together the image below to help visualize the initial mean/mode imputation, followed by two random forest imputations (one for X1, one for X3).
Tables 2 - 4 therefore show one iteration of the algorithm. You can picture these three steps repeating until the stopping criterion is reached.
Since random forests use bootstrap sampling, the final forests (and all of those in the intermediate iterations) will only be trained on ≈23 of the data. This is why MissForest has the unique property of also being able to provide OOB errors for its imputations.
germancredit
datasetThe original authors effectively tested and demonstrated the effectiveness of missForest()
in the accuracy of its imputation. Instead of repeating their work, I want to test the effectiveness that changing imputation methodologies has on an overall modelling objective relevant to my own work, such as maximizing out-of-sample ROC AUC when predicting credit default.
Since it is so widely used, I will compare missForest()
with with median/mode imputation (for numeric & categorical missings respectively) via imputeMissings::impute()
. Both of these functions are one-liners that require virtually no thinking or pre-processing.
My hope is that I can demonstrate that missForest()
is just as easy to use, but that it has a statistically significant positive impact on the test
AUC for this dataset.
I use the woeBinning::germancredit
dataset here, and perform minimal data cleaning:
germancredit$purpose <- ifelse(germancredit$purpose %in% c("business", "car (new)", "car (used)", "education", "furniture/equipment", "radio/television"),
as.character(germancredit$purpose),
"OTHER") %>% factor()
germancredit$job <- ifelse(germancredit$job %in% c("unskilled - residentskilled employee / official", "unskilled - resident"),
as.character(germancredit$job),
"OTHER") %>% factor()
germancredit$number.of.people.being.liable.to.provide.maintenance.for <- factor(germancredit$number.of.people.being.liable.to.provide.maintenance.for)
glimpse(germancredit)
## Rows: 1,000
## Columns: 21
## $ status.of.existing.checking.account <fct> ... < 0 DM...
## $ duration.in.month <dbl> 6, 48, 12,...
## $ credit.history <fct> critical a...
## $ purpose <fct> radio/tele...
## $ credit.amount <dbl> 1169, 5951...
## $ savings.account.and.bonds <fct> unknown/ n...
## $ present.employment.since <fct> ... >= 7 y...
## $ installment.rate.in.percentage.of.disposable.income <dbl> 4, 2, 2, 2...
## $ personal.status.and.sex <fct> male : sin...
## $ other.debtors.or.guarantors <fct> none, none...
## $ present.residence.since <dbl> 4, 2, 3, 4...
## $ property <fct> "real esta...
## $ age.in.years <dbl> 67, 22, 49...
## $ other.installment.plans <fct> none, none...
## $ housing <fct> own, own, ...
## $ number.of.existing.credits.at.this.bank <dbl> 2, 1, 1, 1...
## $ job <fct> unskilled ...
## $ number.of.people.being.liable.to.provide.maintenance.for <fct> 1, 1, 2, 2...
## $ telephone <fct> "yes, regi...
## $ foreign.worker <fct> yes, yes, ...
## $ creditability <fct> good, bad,...
I split the data into train
/test
, randomly introducing 20% missing values into the predictors of both datasets:
# train/test:
set.seed(111)
train_index <- sample(nrow(germancredit), 700)
train <- germancredit[train_index, ]
train_X <- prodNA(select(train, -creditability), 0.2)
test <- germancredit[-train_index, ]
test_X <- prodNA(select(test, -creditability), 0.2)
vis_miss(rbind(train_X, test_X), show_perc = F) +
coord_flip()
missForest()
I then take the following steps:
missForest()
: train_X
→ imp_train_X
rf
predicting creditability
(the response) on imp_train_X
imp_train_X
& test_X
→ train_test_X
missForest()
to on train_test_X
, then extract imp_test_X
(the test observations only)rf
to get the probability predictions on the test data (using the imputed data, imp_test_X
)# 1) impute train
imp_train_X <- missForest(train_X)$ximp
## missForest iteration 1 in progress...done!
## missForest iteration 2 in progress...done!
## missForest iteration 3 in progress...done!
## missForest iteration 4 in progress...done!
## missForest iteration 5 in progress...done!
## missForest iteration 6 in progress...done!
## missForest iteration 7 in progress...done!
## missForest iteration 8 in progress...done!
## missForest iteration 9 in progress...done!
# 2) build model
rf <- randomForest(x = imp_train_X, y = train$creditability)
Notice that in steps 3 & 4 I am using the imputed training predictors to give the algorithm additional data in which to impute the test predictors. This is perfectly fine and would decrease imputation error with the trade-off of increased computing time. I would argue that this method would be more robust when imputing smaller future datasets.
Consider a realistic situation where the training data was 1M rows and this model was put into production to score ~500 observations hourly. By appending these 1M imputed training observations onto future observations before imputing we strengthen the predictive power of the algorithm (as each random forest will be built on >1M observations, instead of <500).
# 3) & 4) combine & impute test
train_test_X <- rbind(test_X, imp_train_X)
imp_test_X <- missForest(train_test_X)$ximp[1:nrow(test_X), ]
## missForest iteration 1 in progress...done!
## missForest iteration 2 in progress...done!
## missForest iteration 3 in progress...done!
## missForest iteration 4 in progress...done!
## missForest iteration 5 in progress...done!
## missForest iteration 6 in progress...done!
## missForest iteration 7 in progress...done!
# 5) predict for test
pred_test <- predict(rf, imp_test_X, type = "prob")
# 6) test ROC & AUC
test_scores <- data.frame(event_prob = pred_test[ ,2], labels = test$creditability)
test_roc_v1 <- roc.curve(scores.class0 = test_scores[test_scores$labels == "good", ]$event_prob, # scores for the POSITIVE class
scores.class1 = test_scores[test_scores$labels == "bad", ]$event_prob, # scores for the NEGATIVE class
curve=T)
The resulting test
ROC AUC:
test_roc_v1$auc # 0.7764794
## [1] 0.7764794
impute()
I then take the following steps:
train_X
→ imp_train_X
test_X
→ imp_test_X
rf
predicting creditability
(the response) on imp_train_X
rf
to get the probability predictions on the test data (using the imputed data, imp_test_X
)# 1) impute train
imp_train_X <- impute(train_X, method = "median/mode")
# 2) impute test
imp_test_X <- impute(test_X, method = "median/mode")
# 3) build model
rf <- randomForest(x = imp_train_X, y = train$creditability)
# 4) predict for test
pred_test <- predict(rf, imp_test_X, type = "prob")
# 5) test ROC & AUC
test_scores <- data.frame(event_prob = pred_test[ ,2], labels = test$creditability)
test_roc_v2 <- roc.curve(scores.class0 = test_scores[test_scores$labels == "good", ]$event_prob, # scores for the POSITIVE class
scores.class1 = test_scores[test_scores$labels == "bad", ]$event_prob, # scores for the NEGATIVE class
curve=T)
The resulting test
ROC AUC:
test_roc_v2$auc # 0.7566735
## [1] 0.7566735
I plot the two ROC curves produced using both methodologies:
data.frame(method = "missForest", FPR = test_roc_v1$curve[ ,1], TPR = test_roc_v1$curve[ ,2]) %>%
rbind(data.frame(method = "median/mode", FPR = test_roc_v2$curve[ ,1], TPR = test_roc_v2$curve[ ,2])) %>%
ggplot(aes(x = FPR, y = TPR, col = method)) +
geom_line() +
geom_abline(intercept = 0, slope = 1, linetype = "dashed") +
labs(title = "MissForest vs Median/Mode Imputation",
subtitle = paste0("MissForest AUC = ", round(test_roc_v1$auc, 3), ", Median/Mode AUC = ", round(test_roc_v2$auc, 3)),
col = "Imputation Method")
Great! Predictive imputation using missForest()
succeeded in increasing our out-of-sample AUC.
With a large dataset this would be meaningful as train
& test
would be similar, but with this 1,000-row sample there will be considerable variability in these results, simply based on variation in the train
/test
split of the data.
In the next section I explore repeating this process many times, choosing an appropriate statistical test to quantify whether the difference in test AUC is truly greater for missForest()
over many samples.
To answer this question, I decided to do something similar to Monte Carlo Cross-Validation, in which I wrap the above process in a for()
loop. This involved performing 100 train/test splits, introducing missing data, imputing it, building a model on the imputed training data predicting credit default and finally evaluating this models ROC AUC on the imputed test data.
Fun facts:
train_X
& test_X
)So the code below builds ≈20×100×6×2×100=2,400,000 decision trees in total.
missforest_auc <- c()
median_mode_auc <- c()
registerDoParallel(cores = 4) # set number of cores to use
registerDoRNG(seed = 123)
for (i in 1:100) {
train_index <- sample(nrow(germancredit), 700)
train <- germancredit[train_index, ]
train_X <- prodNA(select(train, -creditability), 0.2)
test <- germancredit[-train_index, ]
test_X <- prodNA(select(test, -creditability), 0.2)
##### ##### ##### missForest() ##### ##### #####
# 1) impute train
imp_train_X <- missForest(train_X, parallelize = "forests")$ximp
# 2) build model
rf <- randomForest(x = imp_train_X, y = train$creditability)
# 3) impute test
train_test_X <- rbind(test_X, imp_train_X)
imp_test_X <- missForest(train_test_X, parallelize = "forests")$ximp[1:nrow(test_X), ]
# 4) evaluate model
pred_test <- predict(rf, imp_test_X, type = "prob")
test_scores <- data.frame(event_prob = pred_test[ ,2], labels = test$creditability)
test_roc_v1 <- roc.curve(scores.class0 = test_scores[test_scores$labels == "good", ]$event_prob,
scores.class1 = test_scores[test_scores$labels == "bad", ]$event_prob,
curve=T)
missforest_auc[i] <- test_roc_v1$auc
##### ##### ##### median/mode ##### ##### #####
# 1) impute train & test
imp_train_X <- impute(train_X, method = "median/mode")
imp_test_X <- impute(test_X, method = "median/mode")
# 2) build model
rf <- randomForest(x = imp_train_X, y = train$creditability)
# 3) evaluate model
pred_test <- predict(rf, imp_test_X, type = "prob")
test_scores <- data.frame(event_prob = pred_test[ ,2], labels = test$creditability)
test_roc_v2 <- roc.curve(scores.class0 = test_scores[test_scores$labels == "good", ]$event_prob,
scores.class1 = test_scores[test_scores$labels == "bad", ]$event_prob,
curve=T)
median_mode_auc[i] <- test_roc_v2$auc
print(paste0("i = ", i, " . . . missforest_auc = ", round(missforest_auc[i], 3), ", median_mode_auc = ", round(median_mode_auc[i], 3)))
}
auc_df <- data.frame(missforest_auc, median_mode_auc) %>%
mutate(iteration = 1:nrow(.))
# write.csv(auc_df, "auc_df.csv", row.names = F)
Without parallel processing, each loop took ~44 seconds. With parallel processing, this was reduced to ~25 seconds. With more (and faster) cores, alongside following the Increasing Speed section at the end of this document, this could be reduced significantly.
A glimpse()
of the results can be seen below.
auc_df <- read.csv("./auc_df.csv")
glimpse(auc_df)
## Rows: 100
## Columns: 3
## $ missforest_auc <dbl> 0.7616422, 0.7602458, 0.7172750, 0.7402216, 0.72515...
## $ median_mode_auc <dbl> 0.7503064, 0.7642765, 0.6990250, 0.7228860, 0.72109...
## $ iteration <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, ...
missForest()
average test AUC:
round(mean(auc_df$missforest_auc), 3) # 0.749758
## [1] 0.75
Median/mode average test AUC:
round(mean(auc_df$median_mode_auc), 3) # 0.7419909
## [1] 0.742
An almost 0.01 increase in test AUC predicting credit default is actually really impressive for doing nothing more than changing our imputation method.
I produced a few different visualizations of the results below:
g1 <- auc_df %>%
gather("method",
"AUC",
-iteration) %>%
ggplot(aes(x = AUC, fill = method)) +
geom_vline(aes(xintercept = mean(auc_df$missforest_auc)), size = 1, col = "#00BFC4") +
geom_vline(aes(xintercept = mean(auc_df$median_mode_auc)), size = 1, col = "#F8766D") +
geom_density(alpha = 0.4) +
scale_fill_discrete(labels = c("median/mode", "missForest")) +
theme(legend.position = "bottom",
legend.title = element_blank(),
axis.title.y = element_blank(),
axis.ticks.y = element_blank(),
axis.text.y = element_blank())
g2 <- auc_df %>%
mutate(diff = missforest_auc - median_mode_auc) %>%
gather("method",
"AUC",
-diff) %>%
ggplot(aes(x = diff)) +
geom_vline(xintercept = 0, col = "grey20") +
geom_vline(aes(xintercept = mean(auc_df$missforest_auc - auc_df$median_mode_auc), col = "mediumseagreen"), size = 1) +
scale_color_manual(values = "mediumseagreen", labels = "mean(diff)") +
geom_density(alpha = 0.4, fill = "mediumseagreen") +
scale_x_continuous(breaks = seq(-0.06, 0.06, 0.01)) +
labs(x = "diff = (missForest AUC) - (median/mode AUC)") +
theme(legend.position = "bottom",
legend.title = element_blank(),
axis.title.y = element_blank(),
axis.ticks.y = element_blank(),
axis.text.y = element_blank())
g3 <- auc_df %>%
gather("method",
"AUC",
-iteration) %>%
ggplot(aes(x = method, y = AUC, fill = method)) +
geom_boxplot() +
scale_fill_discrete(labels = c("median/mode", "missForest")) +
coord_flip() +
theme(legend.position = "bottom",
legend.title = element_blank(),
axis.title.y = element_blank(),
axis.ticks.y = element_blank(),
axis.text.y = element_blank())
g4 <- auc_df %>%
mutate(missforest_best = factor(missforest_auc - median_mode_auc > 0)) %>%
ggplot(aes(x = median_mode_auc, y = missforest_auc, col = missforest_best)) +
geom_abline(intercept = 0, slope = 1, col = "deepskyblue3", size = 1, alpha = 0.5, linetype = "dashed") +
geom_point() +
scale_color_manual(values = c("grey30", "mediumseagreen")) +
labs(x = "median/mode AUC",
y = "missForest AUC",
col = "missForest best?:") +
theme(legend.position = "bottom")
grid.arrange(g1, g2, g3, g4, ncol = 2, nrow = 2)
To determine the likelihood that the difference in means can be explained by random variation, I use a paired t-test:
H0:μd=0H1:μd>0
Our null hypothesis is that the true mean difference (μd) is zero, where difference = missforest_auc - median_mode_auc
.
The alternative hypothesis is that the true mean difference is greater than zero.
missForest()
, and if it isn’t better, we will throw it away and favour the simpler methodology (regardless of whether missForest()
performs worse or the same).The resulting p-value:
t.test(auc_df$missforest_auc, auc_df$median_mode_auc, paired = T, alternative = "greater")$p.value %>% round(7) # 0.0004236
## [1] 0.0004236
p<0.05, so we reject the null hypothesis. The probability of observing these test results under the null hypothesis that the means were equal is ≈0.042%.
missForest()
took no extra pre-processing of the data and had a statistically significant positive impact on the test
AUC when predicting credit default for this dataset.
Parallel processing with parallelize
:
To run missForest()
in parallel you need to set up a parallel backend. It’s easy to do and can speed up performance significantly. For reproducible results in parallel you will need to set the seed via doRNG::registerDoRNG()
:
doParallel::registerDoParallel(cores = 4) # set based on number of CPU cores
doRNG::registerDoRNG(seed = 123)
imputed_X <- missForest(X, parallelize = 'forests')$ximp
Using set.seed(123, kind = "L'Ecuyer-CMRG")
will provide reproducible results in most cases and requires no packages, but doRNG::registerDoRNG()
is more reliable (doesn’t fail in loops).
There are two options for parallelize
to run missForest()
in parallel. Taken from the documentation:
parallelize = 'forests'
, the total number of trees in each random forests is split in the same way as normal. This is most useful if a single forest object takes long to compute and there are not many variables in the data.parallelize = 'variables'
, the data is split into pieces of the size equal to the number of cores registered in the parallel backend. This is most useful if the data contains many variables and computing the random forests is not taking too long.Whether parallelize = 'forests'
or parallelize = 'variables'
is more suitable depends on the data - try running a single iteration and compare the speeds. I have personally seen the greatest performance increase with parallelize = 'forests'
.
Reducing maxiter
:
missForest()
has a hard stop of 10 iterations (when the stopping criterion isn’t reached) which can be increased or decreased. The stopping criterion makes some sense, but it often won’t end on the lowest error iteration, and can mean you are waiting around far longer than you need to for imputation to finish.
Whenever you run missForest()
, I’d recommend first specifying verbose = T
which allows you to monitor the OOB error estimate across iterations. You will often find that there are significantly diminishing returns after just 2 random forest iterations. If you see this happening, cancel the imputation run and specify maxiter = 2
. Obviously if the default algorithm would’ve continued until 10 iterations you’ve just reduced computation time by 80%.
Reducing ntree
:
Reducing ntree
is usually a bit of a trade-off between accuracy and imputation time, but it has a linear relationship with the time taken per iteration; halving ntree
from the default 100
to 50
will halve the computation time of each iteration.
You can reduce ntree
more than you would with a single random forest - multiple iterations mean that each observation will get predicted multiple times, so 10 iterations of with ntree = 100
still means 1,000 trees are used per variable in the imputation. As a start, try setting ntree
to 50 and observe whether the reduction in OOB error is worth the halved imputation times.
Reducing mtry
:
The number of variables randomly selected at each split (mtry
) gives you the option of reducing training time. The default value is ⌊√p⌋. Reducing this from the default will have the biggest impact for high dimensionality data where p is large, and can increase or decrease OOB error.
I’m just going to go through a quick example of how you might use the parameters to increase imputation accuracy as apposed to decreasing training time.
I again use the full germancredit
dataset, introducing 20% missing values. I run missForest()
in parallel for all of these examples.
X_true <- select(germancredit, -creditability)
set.seed(1)
X_missing <- prodNA(X_true, 0.2)
registerDoParallel(cores = 4)
Choosing an iteration:
Running missForest()
normally with verbose = T
shows that the best iteration is actually iteration 3 (the algorithm just didn’t stop there because the difference between the new and previous imputed X matrices didn’t both increase in their NRMSE & PFC).
registerDoRNG(seed = 1)
missForest_v1 <- missForest(X_missing, parallelize = 'forests', verbose = T) # 3 iterations seems good
## parallelizing computation of the random forest model objects
## missForest iteration 1 in progress...done!
## estimated error(s): 0.5025207 0.3453656
## difference(s): 0.02717122 0.06128571
## time: 2.52 seconds
##
## missForest iteration 2 in progress...done!
## estimated error(s): 0.4765437 0.3252075
## difference(s): 0.002889239 0.024
## time: 2.18 seconds
##
## missForest iteration 3 in progress...done!
## estimated error(s): 0.4534789 0.3202094
## difference(s): 0.001816984 0.01864286
## time: 1.91 seconds
##
## missForest iteration 4 in progress...done!
## estimated error(s): 0.4632786 0.3184532
## difference(s): 0.001843269 0.01764286
## time: 1.72 seconds
##
## missForest iteration 5 in progress...done!
## estimated error(s): 0.4672895 0.3161335
## difference(s): 0.001811026 0.01707143
## time: 2.29 seconds
##
## missForest iteration 6 in progress...done!
## estimated error(s): 0.464786 0.314633
## difference(s): 0.001440648 0.01735714
## time: 1.94 seconds
##
## missForest iteration 7 in progress...done!
## estimated error(s): 0.4666038 0.3157098
## difference(s): 0.001678937 0.01714286
## time: 1.92 seconds
##
## missForest iteration 8 in progress...done!
## estimated error(s): 0.465116 0.315166
## difference(s): 0.001125223 0.01607143
## time: 1.61 seconds
##
## missForest iteration 9 in progress...done!
## estimated error(s): 0.4625869 0.311944
## difference(s): 0.001412786 0.01721429
## time: 2 seconds
We might determine that 3 iterations is sufficient (performance-wise or accuracy-wise) while the code is still running for a large dataset, so we could just stop the process and re-run specifying maxiter = 3
so the algorithm stops at that stage:
registerDoRNG(seed = 1)
missForest_v2 <- missForest(X_missing, parallelize = 'forests', maxiter = 3)
## missForest iteration 1 in progress...done!
## missForest iteration 2 in progress...done!
## missForest iteration 3 in progress...done!
missForest_v2$OOBerror
## NRMSE PFC
## 0.4534789 0.3202094
Basic mtry
selection:
Because running missForest()
in parallel on a small dataset is already really fast, I instead try tuning mtry
. I wrote a loop to try mtry
values from 2-19, and wrapped this in an outer loop that changes the seed (to try and reduce the variability of the OOB error estimates).
mtry_used <- c()
OOB_NRMSE <- c()
OOB_PFC <- c()
i <- 1
for (loop_seed in 1:5) {
for (mtry in 2:19) {
print(paste0("seed = ", loop_seed, ", mtry = ", mtry))
registerDoRNG(seed = loop_seed)
missForest_v3 <- missForest(X_missing,
parallelize = 'forests',
ntree = 100,
maxiter = 3,
mtry = mtry)
OOB_NRMSE[i] <- missForest_v3$OOBerror[1]
OOB_PFC[i] <- missForest_v3$OOBerror[2]
mtry_used[i] <- mtry
i <- i + 1
}
}
mtry_df <- data.frame(mtry_used, OOB_NRMSE, OOB_PFC) %>%
group_by(mtry_used) %>%
summarize(`OOB NRMSE` = mean(OOB_NRMSE),
`OOB PFC` = mean(OOB_PFC)) %>%
gather("metric", "error", -mtry_used)
# write.csv(mtry_df, "mtry_df.csv", row.names = F)
I extracted the OOB errors (NRMSE & PFC) each time, averaged the results and plot the resulting graph below.
mtry_df <- read.csv("./mtry_df.csv")
ggplot(mtry_df, aes(x = mtry_used, y = error, col = factor(metric))) +
geom_point() +
geom_line() +
scale_x_continuous(breaks = seq(2, 20, 2)) +
scale_color_manual(values = c("deepskyblue", "mediumseagreen")) +
facet_wrap(~ metric, scales = "free_y") +
theme(legend.position = "none",
axis.title.y = element_blank()) +
labs(x = "mtry")
This approach is only really practical for small datasets, but you could easily sample a large dataset, reduce maxiter
and do some basic checks that the default value of mtry
is reasonable.
Alternative mtry
selection:
There is absolutely no reason why you couldn’t choose one mtry
for numeric variables and another for categorical. The code to implement this is below.
categorical_vars <- X_missing %>% select_if(is.factor) %>% names()
numeric_vars <- X_missing %>% select_if(is.numeric) %>% names()
X_imp_cat <- missForest(X_missing,
mtry = 7,
parallelize = 'forests',
maxiter = 3)$ximp[ ,categorical_vars]
X_imp_num <- missForest(X_missing,
mtry = 15,
parallelize = 'forests',
maxiter = 3)$ximp[ ,numeric_vars]
X_imp <- cbind(X_imp_cat, X_imp_num)
You could also select mtry
based on how the imputation performs on a set of features that you know will be the best predictors of the target. Specifying variablewise = T
will mean that the OOB error will be reported per variable (and the metric for numeric variables will change to MSE).
registerDoRNG(seed = 1)
missForest_v4 <- missForest(X_missing, parallelize = 'forests', variablewise = T)
The output when doing this can be quite messy, but you can easily put the errors into a dataframe:
data.frame(varname = names(missForest_v4$ximp),
error_type = names(missForest_v4$OOBerror),
error = missForest_v4$OOBerror) %>%
mutate(error_type = cell_spec(error_type, color = "white", background = ifelse(error_type == "MSE", "deepskyblue", "mediumseagreen"))) %>%
kable(escape = F) %>%
kable_styling(full_width = F)
varname | error_type | error |
---|---|---|
status.of.existing.checking.account | PFC | 0.4683544 |
duration.in.month | MSE | 74.6415515 |
credit.history | PFC | 0.2789539 |
purpose | PFC | 0.6235885 |
credit.amount | MSE | 3626289.8668973 |
savings.account.and.bonds | PFC | 0.4208543 |
present.employment.since | PFC | 0.5368550 |
installment.rate.in.percentage.of.disposable.income | MSE | 0.9228126 |
personal.status.and.sex | PFC | 0.3659148 |
other.debtors.or.guarantors | PFC | 0.0962963 |
present.residence.since | MSE | 0.9022829 |
property | PFC | 0.4573935 |
age.in.years | MSE | 86.5584998 |
other.installment.plans | PFC | 0.2004981 |
housing | PFC | 0.1730061 |
number.of.existing.credits.at.this.bank | MSE | 0.2091348 |
job | PFC | 0.3324841 |
number.of.people.being.liable.to.provide.maintenance.for | PFC | 0.1356147 |
telephone | PFC | 0.2844388 |
foreign.worker | PFC | 0.0380711 |
If anyone wanted to experiment with this methodology, you could create a custom metric that combines the OOB errors of the most important predictors and use this as the basis for the selection of the mtry
hyper-parameter.
For the purposes of this document I am happy to proceed with maxiter = 3
and mtry = 9
.
Increasing ntree
:
Reducing ntree
will generally increase the imputation error. Since we are only running two iterations, let’s instead try increasing ntree
quite significantly to 1000
:
registerDoRNG(seed = 1)
missForest_v5 <- missForest(X_missing, parallelize = 'forests', verbose = T, maxiter = 3, mtry = 9, ntree = 1000)
## parallelizing computation of the random forest model objects
## missForest iteration 1 in progress...done!
## estimated error(s): 0.4916868 0.3267124
## difference(s): 0.03673848 0.06035714
## time: 22.14 seconds
##
## missForest iteration 2 in progress...done!
## estimated error(s): 0.4351275 0.3086136
## difference(s): 0.003046984 0.01578571
## time: 21.33 seconds
##
## missForest iteration 3 in progress...done!
## estimated error(s): 0.4368263 0.3003819
## difference(s): 0.0006723632 0.007428571
## time: 20.37 seconds
Results:
Running the default algorithm gave the following OOB errors:
missForest_v1$OOBerror
## NRMSE PFC
## 0.465116 0.315166
The end result was successful in decreasing the estimated imputation error, giving the following OOB errors:
missForest_v5$OOBerror
## NRMSE PFC
## 0.4368263 0.3003819