Nonparametric missing value imputation using Random Forest

In this presentation we will learn how to impute missing values using the {missForest} package. This package uses a random forest trained on the observed values of a data matrix to predict the missing values. It can be used to impute continuous and/or categorical data.

https://cran.r-project.org/web/packages/missForest/missForest.pdf

Here are the steps on how to impute missing values in a dataset:

  1. Upload the libraries.
library(tidyverse)   # for almost everything 
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dlookr)      # for exploratory data analysis and imputation
## Registered S3 methods overwritten by 'dlookr':
##   method          from  
##   plot.transform  scales
##   print.transform scales
## 
## Attaching package: 'dlookr'
## 
## The following object is masked from 'package:tidyr':
## 
##     extract
## 
## The following object is masked from 'package:base':
## 
##     transform
library(visdat)      # for visualizing NAs
library(missRanger)  # to generate NAs
library(missForest)  # for imputation
  1. Upload dataset
library(mlbench)
data(PimaIndiansDiabetes)
str(PimaIndiansDiabetes)
## 'data.frame':    768 obs. of  9 variables:
##  $ pregnant: num  6 1 8 1 0 5 3 10 2 8 ...
##  $ glucose : num  148 85 183 89 137 116 78 115 197 125 ...
##  $ pressure: num  72 66 64 66 40 74 50 0 70 96 ...
##  $ triceps : num  35 29 0 23 35 0 32 0 45 0 ...
##  $ insulin : num  0 0 0 94 168 0 88 0 543 0 ...
##  $ mass    : num  33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
##  $ pedigree: num  0.627 0.351 0.672 0.167 2.288 ...
##  $ age     : num  50 31 32 21 33 30 26 29 53 54 ...
##  $ diabetes: Factor w/ 2 levels "neg","pos": 2 1 2 1 2 1 2 1 2 2 ...
  1. Create 8% missing values for each variable in the ‘PimaIndiansDiabetes_NA’ dataset using {missRanger} package.
library(missRanger)
set.seed(111)
PimaIndiansDiabetes_NA <- generateNA(PimaIndiansDiabetes, p = 0.08) %>% 
  mutate(diabetes = factor(diabetes))
  1. Check out NAs in ‘PimaIndiansDiabetes_NA’ dataset
library(visdat)
vis_miss(PimaIndiansDiabetes_NA)

# Also, view NAs using {dlookr} package
library(dlookr)      
plot_na_pareto(PimaIndiansDiabetes_NA, only_na = TRUE)

  1. Print ‘PimaIndiansDiabetes_NA’ dataset.
print(as_tibble(PimaIndiansDiabetes_NA), n = 50)
## # A tibble: 768 × 9
##    pregnant glucose pressure triceps insulin  mass pedigree   age diabetes
##       <dbl>   <dbl>    <dbl>   <dbl>   <dbl> <dbl>    <dbl> <dbl> <fct>   
##  1        6     148       72      35       0  33.6    0.627    50 pos     
##  2        1      85       66      29       0  26.6    0.351    31 neg     
##  3        8     183       64      NA       0  23.3    0.672    32 pos     
##  4        1      89       66      23      94  28.1    0.167    NA neg     
##  5        0     137       40      35      NA  43.1    2.29     33 pos     
##  6        5     116       NA       0       0  25.6    0.201    30 neg     
##  7        3      78       50      32      88  31      0.248    26 pos     
##  8       10     115        0      NA      NA  35.3    0.134    29 neg     
##  9        2     197       70      45     543  30.5   NA        53 pos     
## 10        8     125       96       0       0   0     NA        54 pos     
## 11        4     110       92       0       0  NA      0.191    30 neg     
## 12       10     168       NA       0       0  38      0.537    34 <NA>    
## 13       10      NA       80       0       0  27.1    1.44     57 neg     
## 14        1     189       60      23     846  30.1    0.398    59 <NA>    
## 15        5     166       72      19     175  25.8    0.587    51 pos     
## 16        7     100        0      NA       0  30      0.484    32 pos     
## 17        0     118       84      47     230  45.8    0.551    31 pos     
## 18        7     107       74       0       0  29.6    0.254    31 <NA>    
## 19        1     103       NA      38      83  43.3   NA        33 neg     
## 20        1     115       70      30      96  34.6    0.529    32 <NA>    
## 21        3     126       88      41     235  39.3    0.704    27 neg     
## 22        8      99       84       0       0  35.4   NA        50 <NA>    
## 23        7     196       90       0       0  39.8    0.451    41 pos     
## 24        9     119       80      35       0  29      0.263    29 pos     
## 25       NA     143       94      33     146  36.6    0.254    51 pos     
## 26       10     125       70      NA     115  31.1    0.205    41 pos     
## 27       NA     147       76       0       0  39.4    0.257    43 pos     
## 28        1      97       66      15     140  23.2    0.487    22 neg     
## 29       13     145       82      19     110  22.2    0.245    57 neg     
## 30        5     117       92      NA       0  34.1    0.337    38 neg     
## 31        5     109       75      26       0  36      0.546    60 neg     
## 32        3     158       76      36     245  31.6   NA        NA pos     
## 33        3      88       58      11      54  24.8    0.267    22 neg     
## 34        6      92       92       0       0  19.9   NA        28 neg     
## 35       10     122       78      31       0  27.6    0.512    45 neg     
## 36        4     103       60      33     192  24      0.966    33 neg     
## 37       11     138       76       0       0  33.2    0.42     35 <NA>    
## 38       NA     102       76      NA       0  NA      0.665    46 pos     
## 39        2      90       68      42       0  38.2    0.503    27 pos     
## 40       NA     111       72      47     207  37.1    1.39     56 pos     
## 41        3     180       64      25      70  NA      0.271    26 neg     
## 42       NA     133       84       0       0  40.2   NA        37 neg     
## 43        7     106       92      18       0  22.7    0.235    48 neg     
## 44        9     171      110      24     240  45.4    0.721    54 pos     
## 45       NA     159       64       0       0  27.4    0.294    40 neg     
## 46        0     180       66      39       0  42      1.89     25 pos     
## 47        1     146       56       0       0  29.7    0.564    29 neg     
## 48        2      71       NA      27      NA  NA      0.586    22 neg     
## 49        7     103       66      32       0  39.1   NA        31 pos     
## 50        7      NA        0      NA      NA   0      0.305    24 neg     
## # ℹ 718 more rows

Perform imputation using {missForest} package.

  1. Summarize ‘PimaIndiansDiabetes_NA’ file to see number of NAs in each of the variables.
summary(PimaIndiansDiabetes_NA)
##     pregnant         glucose         pressure         triceps     
##  Min.   : 0.000   Min.   :  0.0   Min.   :  0.00   Min.   : 0.00  
##  1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 64.00   1st Qu.: 0.00  
##  Median : 3.000   Median :117.0   Median : 72.00   Median :23.00  
##  Mean   : 3.836   Mean   :120.9   Mean   : 69.29   Mean   :20.72  
##  3rd Qu.: 6.000   3rd Qu.:141.0   3rd Qu.: 80.00   3rd Qu.:32.00  
##  Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.00  
##  NA's   :61       NA's   :61      NA's   :61       NA's   :61     
##     insulin           mass          pedigree           age       diabetes  
##  Min.   :  0.0   Min.   : 0.00   Min.   :0.0780   Min.   :21.0   neg :462  
##  1st Qu.:  0.0   1st Qu.:27.30   1st Qu.:0.2445   1st Qu.:24.0   pos :245  
##  Median : 37.0   Median :32.00   Median :0.3780   Median :29.0   NA's: 61  
##  Mean   : 81.7   Mean   :31.98   Mean   :0.4747   Mean   :33.3             
##  3rd Qu.:130.0   3rd Qu.:36.45   3rd Qu.:0.6265   3rd Qu.:40.0             
##  Max.   :846.0   Max.   :67.10   Max.   :2.4200   Max.   :81.0             
##  NA's   :61      NA's   :61      NA's   :61       NA's   :61
  1. Impute ‘PimaIndiansDiabetes_NA’ dataset. Save new imputed dataset as ‘imp_PimaIndiansDiabetes’
library(missForest)
imp_PimaIndiansDiabetes <- missForest(PimaIndiansDiabetes_NA)$ximp
  1. After, imputation summarize ‘imp_PimaIndiansDiabetes’ file to see if there are any NAs in each of the variables.
summary(imp_PimaIndiansDiabetes)
##     pregnant         glucose         pressure         triceps     
##  Min.   : 0.000   Min.   :  0.0   Min.   :  0.00   Min.   : 0.00  
##  1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 64.00   1st Qu.: 0.00  
##  Median : 3.000   Median :117.0   Median : 72.00   Median :23.00  
##  Mean   : 3.854   Mean   :120.7   Mean   : 69.26   Mean   :20.66  
##  3rd Qu.: 6.000   3rd Qu.:140.3   3rd Qu.: 80.00   3rd Qu.:32.00  
##  Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.00  
##     insulin            mass          pedigree           age        diabetes 
##  Min.   :  0.00   Min.   : 0.00   Min.   :0.0780   Min.   :21.00   neg:507  
##  1st Qu.:  0.00   1st Qu.:27.40   1st Qu.:0.2537   1st Qu.:24.00   pos:261  
##  Median : 44.76   Median :32.00   Median :0.3925   Median :29.00            
##  Mean   : 82.01   Mean   :32.00   Mean   :0.4751   Mean   :33.31            
##  3rd Qu.:130.00   3rd Qu.:36.15   3rd Qu.:0.6122   3rd Qu.:40.00            
##  Max.   :846.00   Max.   :67.10   Max.   :2.4200   Max.   :81.00
  1. Print ‘imp_PimaIndiansDiabetes’ dataset.
print(as_tibble(imp_PimaIndiansDiabetes), n = 50)
## # A tibble: 768 × 9
##    pregnant glucose pressure triceps insulin  mass pedigree   age diabetes
##       <dbl>   <dbl>    <dbl>   <dbl>   <dbl> <dbl>    <dbl> <dbl> <fct>   
##  1     6      148       72     35        0    33.6    0.627  50   pos     
##  2     1       85       66     29        0    26.6    0.351  31   neg     
##  3     8      183       64      9.19     0    23.3    0.672  32   pos     
##  4     1       89       66     23       94    28.1    0.167  24.5 neg     
##  5     0      137       40     35      294.   43.1    2.29   33   pos     
##  6     5      116       65.5    0        0    25.6    0.201  30   neg     
##  7     3       78       50     32       88    31      0.248  26   pos     
##  8    10      115        0     22.6    101.   35.3    0.134  29   neg     
##  9     2      197       70     45      543    30.5    1.06   53   pos     
## 10     8      125       96      0        0     0      0.405  54   pos     
## 11     4      110       92      0        0    31.8    0.191  30   neg     
## 12    10      168       72.7    0        0    38      0.537  34   pos     
## 13    10      128.      80      0        0    27.1    1.44   57   neg     
## 14     1      189       60     23      846    30.1    0.398  59   neg     
## 15     5      166       72     19      175    25.8    0.587  51   pos     
## 16     7      100        0      5.32     0    30      0.484  32   pos     
## 17     0      118       84     47      230    45.8    0.551  31   pos     
## 18     7      107       74      0        0    29.6    0.254  31   neg     
## 19     1      103       74.0   38       83    43.3    0.654  33   neg     
## 20     1      115       70     30       96    34.6    0.529  32   neg     
## 21     3      126       88     41      235    39.3    0.704  27   neg     
## 22     8       99       84      0        0    35.4    0.353  50   neg     
## 23     7      196       90      0        0    39.8    0.451  41   pos     
## 24     9      119       80     35        0    29      0.263  29   pos     
## 25     7.78   143       94     33      146    36.6    0.254  51   pos     
## 26    10      125       70     30.6    115    31.1    0.205  41   pos     
## 27     5.82   147       76      0        0    39.4    0.257  43   pos     
## 28     1       97       66     15      140    23.2    0.487  22   neg     
## 29    13      145       82     19      110    22.2    0.245  57   neg     
## 30     5      117       92     12.1      0    34.1    0.337  38   neg     
## 31     5      109       75     26        0    36      0.546  60   neg     
## 32     3      158       76     36      245    31.6    0.544  29.8 pos     
## 33     3       88       58     11       54    24.8    0.267  22   neg     
## 34     6       92       92      0        0    19.9    0.309  28   neg     
## 35    10      122       78     31        0    27.6    0.512  45   neg     
## 36     4      103       60     33      192    24      0.966  33   neg     
## 37    11      138       76      0        0    33.2    0.42   35   pos     
## 38     6.87   102       76     16.2      0    30.0    0.665  46   pos     
## 39     2       90       68     42        0    38.2    0.503  27   pos     
## 40     7.38   111       72     47      207    37.1    1.39   56   pos     
## 41     3      180       64     25       70    32.1    0.271  26   neg     
## 42     5.10   133       84      0        0    40.2    0.530  37   neg     
## 43     7      106       92     18        0    22.7    0.235  48   neg     
## 44     9      171      110     24      240    45.4    0.721  54   pos     
## 45     5.93   159       64      0        0    27.4    0.294  40   neg     
## 46     0      180       66     39        0    42      1.89   25   pos     
## 47     1      146       56      0        0    29.7    0.564  29   neg     
## 48     2       71       64.6   27       53.0  31.4    0.586  22   neg     
## 49     7      103       66     32        0    39.1    0.553  31   pos     
## 50     7       94.9      0     13.8     70.5   0      0.305  24   neg     
## # ℹ 718 more rows
  1. Lastly, check out NAs in the imputed dataset.
library(visdat)      
vis_miss(imp_PimaIndiansDiabetes)

Congratulations! You imputed missing values using the {missForest} package.

A.M.D.G.