Hello there!

1 What is this all about?

In this short tutorial you will learn how to input NA data using a package called missForest. it uses random forests to input the values of your missing data based on data of other variables.

2 Load packages

First of all download and open the gapminder and missForest packages (and some other tools too)

## Loading required package: randomForest
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## Loading required package: foreach
## Loading required package: itertools
## Loading required package: iterators
## -- Attaching packages ------------------
## v ggplot2 3.3.0     v purrr   0.3.4
## v tibble  3.0.1     v dplyr   0.8.5
## v tidyr   1.0.3     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0
## -- Conflicts -- tidyverse_conflicts() --
## x purrr::accumulate() masks foreach::accumulate()
## x dplyr::combine()    masks randomForest::combine()
## x dplyr::filter()     masks stats::filter()
## x dplyr::lag()        masks stats::lag()
## x ggplot2::margin()   masks randomForest::margin()
## x purrr::when()       masks foreach::when()

3 A simple example

Let’s start with a simple example, I have created three vectors, one which deliberately has one missing value.

##  [1]  1  2  3  4  5  6  7  8  9 10
##  [1] 11 12 13 14 NA 16 17 18 19 20
##  [1] 21 22 23 24 25 26 27 28 29 30

Let’s create a data-frame with those vectors, you can clearly see where the missing data is

By looking at the data we could clearly estimate that the missing value is variable_2=15. Let’s use the function missForest() to input the missing data.

##   missForest iteration 1 in progress...done!
##   missForest iteration 2 in progress...done!
##   missForest iteration 3 in progress...done!
##   missForest iteration 4 in progress...done!
## [1] "missForest"

As you can see dat_impo becomes an object of the class "missForest". In order to get the new data frame we should do the following:

As you can see the input value is variable_2=14.44 not too bad hey! we were expecting 15. Now let’s try to use this function for a more complex problem.

4 A not so simple Example

First let’s have a look at the gapminder data-set and let’s create a data set for all years for which we have GDP per capita data between 1980 and 2010… also let’s not consider population and life expectancy

Imagine that we have lost 10% of our values! let’s remove the name of the countries too

Let’s double check the missing data, as you can see on average we have 17 GDP per capita missing values for each year.

Data summary
Name gap.mis
Number of rows 142
Number of columns 6
_______________________
Column type frequency:
numeric 6
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
1982 8 0.94 7361.07 7713.25 424.00 1313.30 4176.26 11652.52 33693.18 ▇▂▂▁▁
1987 17 0.88 8008.55 8361.39 389.88 1361.94 4314.11 12037.27 31540.97 ▇▂▁▁▁
1992 12 0.92 7943.65 8895.86 347.00 1219.89 4264.57 9604.72 34932.92 ▇▁▁▂▁
1997 16 0.89 9223.21 10157.66 312.19 1366.84 5015.81 14073.69 41283.16 ▇▂▁▁▁
2002 16 0.89 10386.03 11397.13 241.17 1647.27 5764.15 14542.65 44683.98 ▇▁▁▂▁
2007 16 0.89 12115.81 13063.96 277.55 1624.84 7049.75 19166.11 49357.19 ▇▂▁▂▁

Let’s impute missing values, using all the default parameters of the missForest function

##   missForest iteration 1 in progress...done!
##   missForest iteration 2 in progress...done!
##   missForest iteration 3 in progress...done!
##   missForest iteration 4 in progress...done!

Again lets look at the new table

How big are the errors of the imputation?

##     NRMSE 
## 0.1648565
##     NRMSE 
## 0.1660809

Not too bad, around 16%. NRME is The normalized root mean squared error, it is defined as:

\[\sqrt\frac{mean((Xtrue−Ximp)^2}{var(Xtrue))}\]

Check by yourself where the missing data is and what was it replaced with! moreover how similar it is compared with the original data!

5 References

LS0tDQp0aXRsZTogIlVzaW5nIHRoZSBtaXNzRm9yZXN0IHBhY2thZ2UiDQphdXRob3I6ICJGZWxpcGUgU2FudG9zLU1hcnF1ZXosIFF1YVJDUyBsYWIsIEdTSUQsIE5hZ295YSBVbml2ZXJzaXR5Ig0Kb3V0cHV0OiANCiAgaHRtbF9kb2N1bWVudDoNCiAgICBjb2RlX2Rvd25sb2FkOiB0cnVlDQogICAgZGZfcHJpbnQ6IHBhZ2VkDQogICAgdG9jOiB0cnVlDQogICAgdG9jX2Zsb2F0Og0KICAgICAgY29sbGFwc2VkOiBmYWxzZQ0KICAgICAgc21vb3RoX3Njcm9sbDogZmFsc2UNCiAgICB0b2NfZGVwdGg6IDQNCiAgICBudW1iZXJfc2VjdGlvbnM6IHRydWUNCiAgICBjb2RlX2ZvbGRpbmc6ICJzaG93Ig0KICAgIHRoZW1lOiAiY29zbW8iDQogICAgaGlnaGxpZ2h0OiAibW9ub2Nocm9tZSINCiAgZ2l0aHViX2RvY3VtZW50OiBkZWZhdWx0DQogIHBkZl9kb2N1bWVudDogZGVmYXVsdA0KICB3b3JkX2RvY3VtZW50OiBkZWZhdWx0DQogIGh0bWxfbm90ZWJvb2s6DQogICAgdG9jOiB0cnVlDQogICAgdG9jX2Zsb2F0Og0KICAgICAgY29sbGFwc2VkOiBmYWxzZQ0KICAgICAgc21vb3RoX3Njcm9sbDogZmFsc2UNCiAgICB0b2NfZGVwdGg6IDQNCiAgICBudW1iZXJfc2VjdGlvbnM6IHRydWUNCiAgICBjb2RlX2ZvbGRpbmc6ICJoaWRlIg0KICAgIHRoZW1lOiAiY29zbW8iDQogICAgaGlnaGxpZ2h0OiAibW9ub2Nocm9tZSINCiAgICBkZl9wcmludDogImthYmxlIg0KLS0tDQoNCjxzdHlsZT4NCmgxLnRpdGxlIHtmb250LXNpemU6IDE4cHQ7IGNvbG9yOiBEYXJrQmx1ZTt9IA0KYm9keSwgaDEsIGgyLCBoMywgaDQge2ZvbnQtZmFtaWx5OiAiUGFsYXRpbm8iLCBzZXJpZjt9DQpib2R5IHtmb250LXNpemU6IDEycHQ7fQ0KLyogSGVhZGVycyAqLw0KaDEsaDIsaDMsaDQsaDUsaDZ7Zm9udC1zaXplOiAxNHB0OyBjb2xvcjogIzAwMDA4Qjt9DQpib2R5IHtjb2xvcjogIzMzMzMzMzt9DQphLCBhOmhvdmVyIHtjb2xvcjogIzhCM0E2Mjt9DQpwcmUge2ZvbnQtc2l6ZTogMTJweDt9DQo8L3N0eWxlPg0KDQoNCkhlbGxvIHRoZXJlISANCg0KIyBXaGF0IGlzIHRoaXMgYWxsIGFib3V0Pw0KDQpJbiB0aGlzIHNob3J0IHR1dG9yaWFsIHlvdSB3aWxsIGxlYXJuIGhvdyB0byBpbnB1dCBgTkFgIGRhdGEgdXNpbmcgYSBwYWNrYWdlIGNhbGxlZCBgbWlzc0ZvcmVzdGAuIGl0IHVzZXMgcmFuZG9tIGZvcmVzdHMgdG8gaW5wdXQgdGhlIHZhbHVlcyBvZiB5b3VyIG1pc3NpbmcgZGF0YSBiYXNlZCBvbiBkYXRhIG9mIG90aGVyIHZhcmlhYmxlcy4NCg0KIyBMb2FkIHBhY2thZ2VzIA0KDQpGaXJzdCBvZiBhbGwgZG93bmxvYWQgYW5kIG9wZW4gdGhlIGBnYXBtaW5kZXJgIGFuZCBgbWlzc0ZvcmVzdGAgcGFja2FnZXMgKGFuZCBzb21lIG90aGVyIHRvb2xzIHRvbykNCg0KYGBge3J9DQprbml0cjo6b3B0c19jaHVuayRzZXQoZWNobyA9IFRSVUUpDQpsaWJyYXJ5KG1pc3NGb3Jlc3QpDQpsaWJyYXJ5KGdhcG1pbmRlcikNCmxpYnJhcnkodGlkeXZlcnNlKQ0KbGlicmFyeShza2ltcikNCmBgYA0KDQojIEEgc2ltcGxlIGV4YW1wbGUNCg0KTGV0J3Mgc3RhcnQgd2l0aCBhIHNpbXBsZSBleGFtcGxlLCBJIGhhdmUgY3JlYXRlZCB0aHJlZSB2ZWN0b3JzLCBvbmUgd2hpY2ggZGVsaWJlcmF0ZWx5IGhhcyBvbmUgbWlzc2luZyB2YWx1ZS4gDQoNCmBgYHtyfQ0KdmFyaWFibGVfMT1jKDE6MTApDQp2YXJpYWJsZV8yPWMoYygxMToxNCksIE5BLCBjKDE2OjIwKSkNCnZhcmlhYmxlXzM9YygyMTozMCkNCnZhcmlhYmxlXzENCnZhcmlhYmxlXzINCnZhcmlhYmxlXzMNCmBgYA0KDQpMZXQncyBjcmVhdGUgYSBkYXRhLWZyYW1lIHdpdGggdGhvc2UgdmVjdG9ycywgeW91IGNhbiBjbGVhcmx5IHNlZSB3aGVyZSB0aGUgbWlzc2luZyBkYXRhIGlzDQoNCmBgYHtyfQ0KZGF0X21pc3M8LSBkYXRhLmZyYW1lKHZhcmlhYmxlXzEsdmFyaWFibGVfMix2YXJpYWJsZV8zKQ0KZGF0X21pc3MNCmBgYA0KDQpCeSBsb29raW5nIGF0IHRoZSBkYXRhIHdlIGNvdWxkIGNsZWFybHkgZXN0aW1hdGUgdGhhdCB0aGUgbWlzc2luZyB2YWx1ZSBpcyBgdmFyaWFibGVfMj0xNWAuIExldCdzIHVzZSB0aGUgZnVuY3Rpb24gYG1pc3NGb3Jlc3QoKWAgdG8gaW5wdXQgdGhlIG1pc3NpbmcgZGF0YS4NCg0KYGBge3J9DQpkYXRfaW1wbyA8LSBtaXNzRm9yZXN0KGRhdF9taXNzKQ0KY2xhc3MoZGF0X2ltcG8pDQpgYGANCg0KQXMgeW91IGNhbiBzZWUgYGRhdF9pbXBvYCBiZWNvbWVzIGFuIG9iamVjdCBvZiB0aGUgY2xhc3MgYCJtaXNzRm9yZXN0ImAuIEluIG9yZGVyIHRvIGdldCB0aGUgbmV3IGRhdGEgZnJhbWUgd2Ugc2hvdWxkIGRvIHRoZSBmb2xsb3dpbmc6DQoNCmBgYHtyfQ0KbmV3X2RhdGEgPC0gZGF0X2ltcG8keGltcA0KbmV3X2RhdGENCmBgYA0KDQpBcyB5b3UgY2FuIHNlZSB0aGUgaW5wdXQgdmFsdWUgaXMgYHZhcmlhYmxlXzI9MTQuNDRgIG5vdCB0b28gYmFkIGhleSEgd2Ugd2VyZSBleHBlY3RpbmcgYDE1YC4gTm93IGxldCdzIHRyeSB0byB1c2UgdGhpcyBmdW5jdGlvbiBmb3IgYSBtb3JlIGNvbXBsZXggcHJvYmxlbS4NCg0KIyBBIG5vdCBzbyBzaW1wbGUgRXhhbXBsZQ0KDQpGaXJzdCBsZXQncyBoYXZlIGEgbG9vayBhdCB0aGUgZ2FwbWluZGVyIGRhdGEtc2V0IGFuZCBsZXQncyBjcmVhdGUgYSBkYXRhIHNldCBmb3IgYWxsIHllYXJzIGZvciB3aGljaCB3ZSBoYXZlIEdEUCBwZXIgY2FwaXRhIGRhdGEgYmV0d2VlbiAxOTgwIGFuZCAyMDEwLi4uIGFsc28gbGV0J3Mgbm90IGNvbnNpZGVyIHBvcHVsYXRpb24gYW5kIGxpZmUgZXhwZWN0YW5jeQ0KDQpgYGB7cn0NCmhlYWQoZ2FwbWluZGVyKQ0KZ2FwbSA8LSBnYXBtaW5kZXIgJT4lIA0KICBzZWxlY3QoIWMoImNvbnRpbmVudCIsInBvcCIsICJsaWZlRXhwIikpICU+JSANCiAgZmlsdGVyKHllYXIgJWluJSBjKDE5ODA6MjAxMCkpICU+JSANCiAgc3ByZWFkKHllYXIsIGdkcFBlcmNhcCkNCmhlYWQoZ2FwbSkNCmBgYA0KDQoNCkltYWdpbmUgdGhhdCB3ZSBoYXZlIGxvc3QgMTAlIG9mIG91ciB2YWx1ZXMhIGxldCdzIHJlbW92ZSB0aGUgbmFtZSBvZiB0aGUgY291bnRyaWVzIHRvbw0KDQoNCmBgYHtyfQ0Kc2V0LnNlZWQoMTEwKQ0KZ2FwbTwtIGdhcG1bLC0xXQ0KZ2FwLm1pcyA8LSBwcm9kTkEoZ2FwbSwgbm9OQSA9IDAuMSkNCmdhcC5taXMgPC0gYXMuZGF0YS5mcmFtZShnYXAubWlzKQ0KaGVhZChnYXAubWlzKQ0KaGVhZChnYXBtKQ0KYGBgDQoNCkxldCdzIGRvdWJsZSBjaGVjayB0aGUgbWlzc2luZyBkYXRhLCBhcyB5b3UgY2FuIHNlZSBvbiBhdmVyYWdlIHdlIGhhdmUgMTcgR0RQIHBlciBjYXBpdGEgbWlzc2luZyB2YWx1ZXMgZm9yIGVhY2ggeWVhci4NCg0KYGBge3J9DQpza2ltKGdhcC5taXMpDQpgYGANCg0KDQpMZXQncyAgaW1wdXRlIG1pc3NpbmcgdmFsdWVzLCB1c2luZyBhbGwgdGhlIGRlZmF1bHQgcGFyYW1ldGVycyBvZiB0aGUgYG1pc3NGb3Jlc3RgIGZ1bmN0aW9uDQoNCmBgYHtyfQ0Kc2V0LnNlZWQoMTEwKQ0KZ2FwLmlucCA8LSBtaXNzRm9yZXN0KGdhcC5taXMsIHh0cnVlID0gZ2FwbSkNCmBgYA0KDQpBZ2FpbiBsZXRzIGxvb2sgYXQgdGhlIG5ldyB0YWJsZQ0KDQpgYGB7cn0NCmhlYWQoZ2FwLmlucCR4aW1wKQ0KYGBgDQoNCg0KSG93IGJpZyBhcmUgdGhlIGVycm9ycyBvZiB0aGUgaW1wdXRhdGlvbj8NCg0KYGBge3J9DQpnYXAuaW5wJE9PQmVycm9yDQpnYXAuaW5wJGVycm9yDQpgYGANCg0KTm90IHRvbyBiYWQsIGFyb3VuZCAxNiUuIE5STUUgaXMgVGhlIG5vcm1hbGl6ZWQgcm9vdCBtZWFuIHNxdWFyZWQgZXJyb3IsIGl0ICBpcyBkZWZpbmVkIGFzOg0KDQokJFxzcXJ0XGZyYWN7bWVhbigoWHRydWXiiJJYaW1wKV4yfXt2YXIoWHRydWUpKX0kJA0KIA0KQ2hlY2sgYnkgeW91cnNlbGYgd2hlcmUgdGhlIG1pc3NpbmcgZGF0YSBpcyBhbmQgd2hhdCB3YXMgaXQgcmVwbGFjZWQgd2l0aCEgbW9yZW92ZXIgaG93IHNpbWlsYXIgaXQgaXMgY29tcGFyZWQgd2l0aCB0aGUgb3JpZ2luYWwgZGF0YSEgDQogDQpgYGB7cn0NCmhlYWQoZ2FwbSkNCmhlYWQoZ2FwLm1pcykNCmhlYWQoZ2FwLmlucCR4aW1wKQ0KYGBgDQoNCg0KIyBSZWZlcmVuY2VzDQoNCi0gaHR0cHM6Ly93d3cucmRvY3VtZW50YXRpb24ub3JnL3BhY2thZ2VzL21pc3NGb3Jlc3QvdmVyc2lvbnMvMS40L3RvcGljcy9taXNzRm9yZXN0DQoNCi0gaHR0cHM6Ly93d3cuYW5hbHl0aWNzdmlkaHlhLmNvbS9ibG9nLzIwMTYvMDMvdHV0b3JpYWwtcG93ZXJmdWwtcGFja2FnZXMtaW1wdXRpbmctbWlzc2luZy12YWx1ZXMvDQoNCi0gU3Rla2hvdmVuLCBELkouIGFuZCBCdWVobG1hbm4sIFAuICgyMDEyKSwgJ01pc3NGb3Jlc3QgLSBub25wYXJhbWV0cmljIG1pc3NpbmcgdmFsdWUgaW1wdXRhdGlvbiBmb3IgbWl4ZWQtdHlwZSBkYXRhJywgQmlvaW5mb3JtYXRpY3MsIDI4KDEpIDIwMTIsIDExMi0xMTgsIGRvaTogMTAuMTA5My9iaW9pbmZvcm1hdGljcy9idHI1OTcNCg==