Hello there!

# 1 What is this all about?

In this short tutorial you will learn how to input NA data using a package called missForest. it uses random forests to input the values of your missing data based on data of other variables.

First of all download and open the gapminder and missForest packages (and some other tools too)

knitr::opts_chunk$set(echo = TRUE) library(missForest) ## Loading required package: randomForest ## randomForest 4.6-14 ## Type rfNews() to see new features/changes/bug fixes. ## Loading required package: foreach ## Loading required package: itertools ## Loading required package: iterators library(gapminder) library(tidyverse) ## -- Attaching packages ------------------ ## v ggplot2 3.3.0 v purrr 0.3.4 ## v tibble 3.0.1 v dplyr 0.8.5 ## v tidyr 1.0.3 v stringr 1.4.0 ## v readr 1.3.1 v forcats 0.5.0 ## -- Conflicts -- tidyverse_conflicts() -- ## x purrr::accumulate() masks foreach::accumulate() ## x dplyr::combine() masks randomForest::combine() ## x dplyr::filter() masks stats::filter() ## x dplyr::lag() masks stats::lag() ## x ggplot2::margin() masks randomForest::margin() ## x purrr::when() masks foreach::when() library(skimr) # 3 A simple example Let’s start with a simple example, I have created three vectors, one which deliberately has one missing value. variable_1=c(1:10) variable_2=c(c(11:14), NA, c(16:20)) variable_3=c(21:30) variable_1 ## [1] 1 2 3 4 5 6 7 8 9 10 variable_2 ## [1] 11 12 13 14 NA 16 17 18 19 20 variable_3 ## [1] 21 22 23 24 25 26 27 28 29 30 Let’s create a data-frame with those vectors, you can clearly see where the missing data is dat_miss<- data.frame(variable_1,variable_2,variable_3) dat_miss By looking at the data we could clearly estimate that the missing value is variable_2=15. Let’s use the function missForest() to input the missing data. dat_impo <- missForest(dat_miss) ## missForest iteration 1 in progress...done! ## missForest iteration 2 in progress...done! ## missForest iteration 3 in progress...done! ## missForest iteration 4 in progress...done! class(dat_impo) ## [1] "missForest" As you can see dat_impo becomes an object of the class "missForest". In order to get the new data frame we should do the following: new_data <- dat_impo$ximp
new_data

As you can see the input value is variable_2=14.44 not too bad hey! we were expecting 15. Now let’s try to use this function for a more complex problem.

# 4 A not so simple Example

First let’s have a look at the gapminder data-set and let’s create a data set for all years for which we have GDP per capita data between 1980 and 2010… also let’s not consider population and life expectancy

head(gapminder)
gapm <- gapminder %>%
select(!c("continent","pop", "lifeExp")) %>%
filter(year %in% c(1980:2010)) %>%
head(gapm)

Imagine that we have lost 10% of our values! let’s remove the name of the countries too

set.seed(110)
gapm<- gapm[,-1]
gap.mis <- prodNA(gapm, noNA = 0.1)
gap.mis <- as.data.frame(gap.mis)
head(gap.mis)
head(gapm)

Let’s double check the missing data, as you can see on average we have 17 GDP per capita missing values for each year.

skim(gap.mis)
 Name gap.mis Number of rows 142 Number of columns 6 _______________________ Column type frequency: numeric 6 ________________________ Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
1982 8 0.94 7361.07 7713.25 424.00 1313.30 4176.26 11652.52 33693.18 ▇▂▂▁▁
1987 17 0.88 8008.55 8361.39 389.88 1361.94 4314.11 12037.27 31540.97 ▇▂▁▁▁
1992 12 0.92 7943.65 8895.86 347.00 1219.89 4264.57 9604.72 34932.92 ▇▁▁▂▁
1997 16 0.89 9223.21 10157.66 312.19 1366.84 5015.81 14073.69 41283.16 ▇▂▁▁▁
2002 16 0.89 10386.03 11397.13 241.17 1647.27 5764.15 14542.65 44683.98 ▇▁▁▂▁
2007 16 0.89 12115.81 13063.96 277.55 1624.84 7049.75 19166.11 49357.19 ▇▂▁▂▁

Let’s impute missing values, using all the default parameters of the missForest function

set.seed(110)
gap.inp <- missForest(gap.mis, xtrue = gapm)
##   missForest iteration 1 in progress...done!
##   missForest iteration 2 in progress...done!
##   missForest iteration 3 in progress...done!
##   missForest iteration 4 in progress...done!

Again lets look at the new table

head(gap.inp$ximp) How big are the errors of the imputation? gap.inp$OOBerror
##     NRMSE
## 0.1648565
gap.inp$error ## NRMSE ## 0.1660809 Not too bad, around 16%. NRME is The normalized root mean squared error, it is defined as: $\sqrt\frac{mean((Xtrue−Ximp)^2}{var(Xtrue))}$ Check by yourself where the missing data is and what was it replaced with! moreover how similar it is compared with the original data! head(gapm) head(gap.mis) head(gap.inp$ximp)