In this presentation we will learn how to impute missing values using the {missForest} package. This package uses a random forest trained on the observed values of a data matrix to predict the missing values. It can be used to impute continuous and/or categorical data.
https://cran.r-project.org/web/packages/missForest/missForest.pdf
Here are the steps on how to impute missing values in a dataset:
library(tidyverse) # for almost everything
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dlookr) # for exploratory data analysis and imputation
## Registered S3 methods overwritten by 'dlookr':
## method from
## plot.transform scales
## print.transform scales
##
## Attaching package: 'dlookr'
##
## The following object is masked from 'package:tidyr':
##
## extract
##
## The following object is masked from 'package:base':
##
## transform
library(visdat) # for visualizing NAs
library(missRanger) # to generate NAs
library(missForest) # for imputation
library(mlbench)
data(PimaIndiansDiabetes)
str(PimaIndiansDiabetes)
## 'data.frame': 768 obs. of 9 variables:
## $ pregnant: num 6 1 8 1 0 5 3 10 2 8 ...
## $ glucose : num 148 85 183 89 137 116 78 115 197 125 ...
## $ pressure: num 72 66 64 66 40 74 50 0 70 96 ...
## $ triceps : num 35 29 0 23 35 0 32 0 45 0 ...
## $ insulin : num 0 0 0 94 168 0 88 0 543 0 ...
## $ mass : num 33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
## $ pedigree: num 0.627 0.351 0.672 0.167 2.288 ...
## $ age : num 50 31 32 21 33 30 26 29 53 54 ...
## $ diabetes: Factor w/ 2 levels "neg","pos": 2 1 2 1 2 1 2 1 2 2 ...
library(missRanger)
set.seed(111)
PimaIndiansDiabetes_NA <- generateNA(PimaIndiansDiabetes, p = 0.08) %>%
mutate(diabetes = factor(diabetes))
library(visdat)
vis_miss(PimaIndiansDiabetes_NA)
# Also, view NAs using {dlookr} package
library(dlookr)
plot_na_pareto(PimaIndiansDiabetes_NA, only_na = TRUE)
print(as_tibble(PimaIndiansDiabetes_NA), n = 50)
## # A tibble: 768 × 9
## pregnant glucose pressure triceps insulin mass pedigree age diabetes
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct>
## 1 6 148 72 35 0 33.6 0.627 50 pos
## 2 1 85 66 29 0 26.6 0.351 31 neg
## 3 8 183 64 NA 0 23.3 0.672 32 pos
## 4 1 89 66 23 94 28.1 0.167 NA neg
## 5 0 137 40 35 NA 43.1 2.29 33 pos
## 6 5 116 NA 0 0 25.6 0.201 30 neg
## 7 3 78 50 32 88 31 0.248 26 pos
## 8 10 115 0 NA NA 35.3 0.134 29 neg
## 9 2 197 70 45 543 30.5 NA 53 pos
## 10 8 125 96 0 0 0 NA 54 pos
## 11 4 110 92 0 0 NA 0.191 30 neg
## 12 10 168 NA 0 0 38 0.537 34 <NA>
## 13 10 NA 80 0 0 27.1 1.44 57 neg
## 14 1 189 60 23 846 30.1 0.398 59 <NA>
## 15 5 166 72 19 175 25.8 0.587 51 pos
## 16 7 100 0 NA 0 30 0.484 32 pos
## 17 0 118 84 47 230 45.8 0.551 31 pos
## 18 7 107 74 0 0 29.6 0.254 31 <NA>
## 19 1 103 NA 38 83 43.3 NA 33 neg
## 20 1 115 70 30 96 34.6 0.529 32 <NA>
## 21 3 126 88 41 235 39.3 0.704 27 neg
## 22 8 99 84 0 0 35.4 NA 50 <NA>
## 23 7 196 90 0 0 39.8 0.451 41 pos
## 24 9 119 80 35 0 29 0.263 29 pos
## 25 NA 143 94 33 146 36.6 0.254 51 pos
## 26 10 125 70 NA 115 31.1 0.205 41 pos
## 27 NA 147 76 0 0 39.4 0.257 43 pos
## 28 1 97 66 15 140 23.2 0.487 22 neg
## 29 13 145 82 19 110 22.2 0.245 57 neg
## 30 5 117 92 NA 0 34.1 0.337 38 neg
## 31 5 109 75 26 0 36 0.546 60 neg
## 32 3 158 76 36 245 31.6 NA NA pos
## 33 3 88 58 11 54 24.8 0.267 22 neg
## 34 6 92 92 0 0 19.9 NA 28 neg
## 35 10 122 78 31 0 27.6 0.512 45 neg
## 36 4 103 60 33 192 24 0.966 33 neg
## 37 11 138 76 0 0 33.2 0.42 35 <NA>
## 38 NA 102 76 NA 0 NA 0.665 46 pos
## 39 2 90 68 42 0 38.2 0.503 27 pos
## 40 NA 111 72 47 207 37.1 1.39 56 pos
## 41 3 180 64 25 70 NA 0.271 26 neg
## 42 NA 133 84 0 0 40.2 NA 37 neg
## 43 7 106 92 18 0 22.7 0.235 48 neg
## 44 9 171 110 24 240 45.4 0.721 54 pos
## 45 NA 159 64 0 0 27.4 0.294 40 neg
## 46 0 180 66 39 0 42 1.89 25 pos
## 47 1 146 56 0 0 29.7 0.564 29 neg
## 48 2 71 NA 27 NA NA 0.586 22 neg
## 49 7 103 66 32 0 39.1 NA 31 pos
## 50 7 NA 0 NA NA 0 0.305 24 neg
## # ℹ 718 more rows
summary(PimaIndiansDiabetes_NA)
## pregnant glucose pressure triceps
## Min. : 0.000 Min. : 0.0 Min. : 0.00 Min. : 0.00
## 1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 64.00 1st Qu.: 0.00
## Median : 3.000 Median :117.0 Median : 72.00 Median :23.00
## Mean : 3.836 Mean :120.9 Mean : 69.29 Mean :20.72
## 3rd Qu.: 6.000 3rd Qu.:141.0 3rd Qu.: 80.00 3rd Qu.:32.00
## Max. :17.000 Max. :199.0 Max. :122.00 Max. :99.00
## NA's :61 NA's :61 NA's :61 NA's :61
## insulin mass pedigree age diabetes
## Min. : 0.0 Min. : 0.00 Min. :0.0780 Min. :21.0 neg :462
## 1st Qu.: 0.0 1st Qu.:27.30 1st Qu.:0.2445 1st Qu.:24.0 pos :245
## Median : 37.0 Median :32.00 Median :0.3780 Median :29.0 NA's: 61
## Mean : 81.7 Mean :31.98 Mean :0.4747 Mean :33.3
## 3rd Qu.:130.0 3rd Qu.:36.45 3rd Qu.:0.6265 3rd Qu.:40.0
## Max. :846.0 Max. :67.10 Max. :2.4200 Max. :81.0
## NA's :61 NA's :61 NA's :61 NA's :61
library(missForest)
imp_PimaIndiansDiabetes <- missForest(PimaIndiansDiabetes_NA)$ximp
summary(imp_PimaIndiansDiabetes)
## pregnant glucose pressure triceps
## Min. : 0.000 Min. : 0.0 Min. : 0.00 Min. : 0.00
## 1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 64.00 1st Qu.: 0.00
## Median : 3.000 Median :117.0 Median : 72.00 Median :23.00
## Mean : 3.854 Mean :120.7 Mean : 69.26 Mean :20.66
## 3rd Qu.: 6.000 3rd Qu.:140.3 3rd Qu.: 80.00 3rd Qu.:32.00
## Max. :17.000 Max. :199.0 Max. :122.00 Max. :99.00
## insulin mass pedigree age diabetes
## Min. : 0.00 Min. : 0.00 Min. :0.0780 Min. :21.00 neg:507
## 1st Qu.: 0.00 1st Qu.:27.40 1st Qu.:0.2537 1st Qu.:24.00 pos:261
## Median : 44.76 Median :32.00 Median :0.3925 Median :29.00
## Mean : 82.01 Mean :32.00 Mean :0.4751 Mean :33.31
## 3rd Qu.:130.00 3rd Qu.:36.15 3rd Qu.:0.6122 3rd Qu.:40.00
## Max. :846.00 Max. :67.10 Max. :2.4200 Max. :81.00
print(as_tibble(imp_PimaIndiansDiabetes), n = 50)
## # A tibble: 768 × 9
## pregnant glucose pressure triceps insulin mass pedigree age diabetes
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct>
## 1 6 148 72 35 0 33.6 0.627 50 pos
## 2 1 85 66 29 0 26.6 0.351 31 neg
## 3 8 183 64 9.19 0 23.3 0.672 32 pos
## 4 1 89 66 23 94 28.1 0.167 24.5 neg
## 5 0 137 40 35 294. 43.1 2.29 33 pos
## 6 5 116 65.5 0 0 25.6 0.201 30 neg
## 7 3 78 50 32 88 31 0.248 26 pos
## 8 10 115 0 22.6 101. 35.3 0.134 29 neg
## 9 2 197 70 45 543 30.5 1.06 53 pos
## 10 8 125 96 0 0 0 0.405 54 pos
## 11 4 110 92 0 0 31.8 0.191 30 neg
## 12 10 168 72.7 0 0 38 0.537 34 pos
## 13 10 128. 80 0 0 27.1 1.44 57 neg
## 14 1 189 60 23 846 30.1 0.398 59 neg
## 15 5 166 72 19 175 25.8 0.587 51 pos
## 16 7 100 0 5.32 0 30 0.484 32 pos
## 17 0 118 84 47 230 45.8 0.551 31 pos
## 18 7 107 74 0 0 29.6 0.254 31 neg
## 19 1 103 74.0 38 83 43.3 0.654 33 neg
## 20 1 115 70 30 96 34.6 0.529 32 neg
## 21 3 126 88 41 235 39.3 0.704 27 neg
## 22 8 99 84 0 0 35.4 0.353 50 neg
## 23 7 196 90 0 0 39.8 0.451 41 pos
## 24 9 119 80 35 0 29 0.263 29 pos
## 25 7.78 143 94 33 146 36.6 0.254 51 pos
## 26 10 125 70 30.6 115 31.1 0.205 41 pos
## 27 5.82 147 76 0 0 39.4 0.257 43 pos
## 28 1 97 66 15 140 23.2 0.487 22 neg
## 29 13 145 82 19 110 22.2 0.245 57 neg
## 30 5 117 92 12.1 0 34.1 0.337 38 neg
## 31 5 109 75 26 0 36 0.546 60 neg
## 32 3 158 76 36 245 31.6 0.544 29.8 pos
## 33 3 88 58 11 54 24.8 0.267 22 neg
## 34 6 92 92 0 0 19.9 0.309 28 neg
## 35 10 122 78 31 0 27.6 0.512 45 neg
## 36 4 103 60 33 192 24 0.966 33 neg
## 37 11 138 76 0 0 33.2 0.42 35 pos
## 38 6.87 102 76 16.2 0 30.0 0.665 46 pos
## 39 2 90 68 42 0 38.2 0.503 27 pos
## 40 7.38 111 72 47 207 37.1 1.39 56 pos
## 41 3 180 64 25 70 32.1 0.271 26 neg
## 42 5.10 133 84 0 0 40.2 0.530 37 neg
## 43 7 106 92 18 0 22.7 0.235 48 neg
## 44 9 171 110 24 240 45.4 0.721 54 pos
## 45 5.93 159 64 0 0 27.4 0.294 40 neg
## 46 0 180 66 39 0 42 1.89 25 pos
## 47 1 146 56 0 0 29.7 0.564 29 neg
## 48 2 71 64.6 27 53.0 31.4 0.586 22 neg
## 49 7 103 66 32 0 39.1 0.553 31 pos
## 50 7 94.9 0 13.8 70.5 0 0.305 24 neg
## # ℹ 718 more rows
library(visdat)
vis_miss(imp_PimaIndiansDiabetes)
Congratulations! You imputed missing values using the {missForest} package.
A.M.D.G.