Hello there!
What is this all about?
In this short tutorial you will learn how to input NA
data using a package called missForest
. it uses random forests to input the values of your missing data based on data of other variables.
Load packages
First of all download and open the gapminder
and missForest
packages (and some other tools too)
## Loading required package: randomForest
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## Loading required package: foreach
## Loading required package: itertools
## Loading required package: iterators
## -- Attaching packages ------------------
## v ggplot2 3.3.0 v purrr 0.3.4
## v tibble 3.0.1 v dplyr 0.8.5
## v tidyr 1.0.3 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.5.0
## -- Conflicts -- tidyverse_conflicts() --
## x purrr::accumulate() masks foreach::accumulate()
## x dplyr::combine() masks randomForest::combine()
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## x ggplot2::margin() masks randomForest::margin()
## x purrr::when() masks foreach::when()
A simple example
Let’s start with a simple example, I have created three vectors, one which deliberately has one missing value.
## [1] 1 2 3 4 5 6 7 8 9 10
## [1] 11 12 13 14 NA 16 17 18 19 20
## [1] 21 22 23 24 25 26 27 28 29 30
Let’s create a data-frame with those vectors, you can clearly see where the missing data is
By looking at the data we could clearly estimate that the missing value is variable_2=15
. Let’s use the function missForest()
to input the missing data.
## missForest iteration 1 in progress...done!
## missForest iteration 2 in progress...done!
## missForest iteration 3 in progress...done!
## missForest iteration 4 in progress...done!
## [1] "missForest"
As you can see dat_impo
becomes an object of the class "missForest"
. In order to get the new data frame we should do the following:
As you can see the input value is variable_2=14.44
not too bad hey! we were expecting 15
. Now let’s try to use this function for a more complex problem.
A not so simple Example
First let’s have a look at the gapminder data-set and let’s create a data set for all years for which we have GDP per capita data between 1980 and 2010… also let’s not consider population and life expectancy
Imagine that we have lost 10% of our values! let’s remove the name of the countries too
Let’s double check the missing data, as you can see on average we have 17 GDP per capita missing values for each year.
Data summary
Name |
gap.mis |
Number of rows |
142 |
Number of columns |
6 |
_______________________ |
|
Column type frequency: |
|
numeric |
6 |
________________________ |
|
Group variables |
None |
Variable type: numeric
1982 |
8 |
0.94 |
7361.07 |
7713.25 |
424.00 |
1313.30 |
4176.26 |
11652.52 |
33693.18 |
▇▂▂▁▁ |
1987 |
17 |
0.88 |
8008.55 |
8361.39 |
389.88 |
1361.94 |
4314.11 |
12037.27 |
31540.97 |
▇▂▁▁▁ |
1992 |
12 |
0.92 |
7943.65 |
8895.86 |
347.00 |
1219.89 |
4264.57 |
9604.72 |
34932.92 |
▇▁▁▂▁ |
1997 |
16 |
0.89 |
9223.21 |
10157.66 |
312.19 |
1366.84 |
5015.81 |
14073.69 |
41283.16 |
▇▂▁▁▁ |
2002 |
16 |
0.89 |
10386.03 |
11397.13 |
241.17 |
1647.27 |
5764.15 |
14542.65 |
44683.98 |
▇▁▁▂▁ |
2007 |
16 |
0.89 |
12115.81 |
13063.96 |
277.55 |
1624.84 |
7049.75 |
19166.11 |
49357.19 |
▇▂▁▂▁ |
Let’s impute missing values, using all the default parameters of the missForest
function
## missForest iteration 1 in progress...done!
## missForest iteration 2 in progress...done!
## missForest iteration 3 in progress...done!
## missForest iteration 4 in progress...done!
Again lets look at the new table
How big are the errors of the imputation?
## NRMSE
## 0.1648565
## NRMSE
## 0.1660809
Not too bad, around 16%. NRME is The normalized root mean squared error, it is defined as:
\[\sqrt\frac{mean((Xtrue−Ximp)^2}{var(Xtrue))}\]
Check by yourself where the missing data is and what was it replaced with! moreover how similar it is compared with the original data!
LS0tDQp0aXRsZTogIlVzaW5nIHRoZSBtaXNzRm9yZXN0IHBhY2thZ2UiDQphdXRob3I6ICJGZWxpcGUgU2FudG9zLU1hcnF1ZXosIFF1YVJDUyBsYWIsIEdTSUQsIE5hZ295YSBVbml2ZXJzaXR5Ig0Kb3V0cHV0OiANCiAgaHRtbF9kb2N1bWVudDoNCiAgICBjb2RlX2Rvd25sb2FkOiB0cnVlDQogICAgZGZfcHJpbnQ6IHBhZ2VkDQogICAgdG9jOiB0cnVlDQogICAgdG9jX2Zsb2F0Og0KICAgICAgY29sbGFwc2VkOiBmYWxzZQ0KICAgICAgc21vb3RoX3Njcm9sbDogZmFsc2UNCiAgICB0b2NfZGVwdGg6IDQNCiAgICBudW1iZXJfc2VjdGlvbnM6IHRydWUNCiAgICBjb2RlX2ZvbGRpbmc6ICJzaG93Ig0KICAgIHRoZW1lOiAiY29zbW8iDQogICAgaGlnaGxpZ2h0OiAibW9ub2Nocm9tZSINCiAgZ2l0aHViX2RvY3VtZW50OiBkZWZhdWx0DQogIHBkZl9kb2N1bWVudDogZGVmYXVsdA0KICB3b3JkX2RvY3VtZW50OiBkZWZhdWx0DQogIGh0bWxfbm90ZWJvb2s6DQogICAgdG9jOiB0cnVlDQogICAgdG9jX2Zsb2F0Og0KICAgICAgY29sbGFwc2VkOiBmYWxzZQ0KICAgICAgc21vb3RoX3Njcm9sbDogZmFsc2UNCiAgICB0b2NfZGVwdGg6IDQNCiAgICBudW1iZXJfc2VjdGlvbnM6IHRydWUNCiAgICBjb2RlX2ZvbGRpbmc6ICJoaWRlIg0KICAgIHRoZW1lOiAiY29zbW8iDQogICAgaGlnaGxpZ2h0OiAibW9ub2Nocm9tZSINCiAgICBkZl9wcmludDogImthYmxlIg0KLS0tDQoNCjxzdHlsZT4NCmgxLnRpdGxlIHtmb250LXNpemU6IDE4cHQ7IGNvbG9yOiBEYXJrQmx1ZTt9IA0KYm9keSwgaDEsIGgyLCBoMywgaDQge2ZvbnQtZmFtaWx5OiAiUGFsYXRpbm8iLCBzZXJpZjt9DQpib2R5IHtmb250LXNpemU6IDEycHQ7fQ0KLyogSGVhZGVycyAqLw0KaDEsaDIsaDMsaDQsaDUsaDZ7Zm9udC1zaXplOiAxNHB0OyBjb2xvcjogIzAwMDA4Qjt9DQpib2R5IHtjb2xvcjogIzMzMzMzMzt9DQphLCBhOmhvdmVyIHtjb2xvcjogIzhCM0E2Mjt9DQpwcmUge2ZvbnQtc2l6ZTogMTJweDt9DQo8L3N0eWxlPg0KDQoNCkhlbGxvIHRoZXJlISANCg0KIyBXaGF0IGlzIHRoaXMgYWxsIGFib3V0Pw0KDQpJbiB0aGlzIHNob3J0IHR1dG9yaWFsIHlvdSB3aWxsIGxlYXJuIGhvdyB0byBpbnB1dCBgTkFgIGRhdGEgdXNpbmcgYSBwYWNrYWdlIGNhbGxlZCBgbWlzc0ZvcmVzdGAuIGl0IHVzZXMgcmFuZG9tIGZvcmVzdHMgdG8gaW5wdXQgdGhlIHZhbHVlcyBvZiB5b3VyIG1pc3NpbmcgZGF0YSBiYXNlZCBvbiBkYXRhIG9mIG90aGVyIHZhcmlhYmxlcy4NCg0KIyBMb2FkIHBhY2thZ2VzIA0KDQpGaXJzdCBvZiBhbGwgZG93bmxvYWQgYW5kIG9wZW4gdGhlIGBnYXBtaW5kZXJgIGFuZCBgbWlzc0ZvcmVzdGAgcGFja2FnZXMgKGFuZCBzb21lIG90aGVyIHRvb2xzIHRvbykNCg0KYGBge3J9DQprbml0cjo6b3B0c19jaHVuayRzZXQoZWNobyA9IFRSVUUpDQpsaWJyYXJ5KG1pc3NGb3Jlc3QpDQpsaWJyYXJ5KGdhcG1pbmRlcikNCmxpYnJhcnkodGlkeXZlcnNlKQ0KbGlicmFyeShza2ltcikNCmBgYA0KDQojIEEgc2ltcGxlIGV4YW1wbGUNCg0KTGV0J3Mgc3RhcnQgd2l0aCBhIHNpbXBsZSBleGFtcGxlLCBJIGhhdmUgY3JlYXRlZCB0aHJlZSB2ZWN0b3JzLCBvbmUgd2hpY2ggZGVsaWJlcmF0ZWx5IGhhcyBvbmUgbWlzc2luZyB2YWx1ZS4gDQoNCmBgYHtyfQ0KdmFyaWFibGVfMT1jKDE6MTApDQp2YXJpYWJsZV8yPWMoYygxMToxNCksIE5BLCBjKDE2OjIwKSkNCnZhcmlhYmxlXzM9YygyMTozMCkNCnZhcmlhYmxlXzENCnZhcmlhYmxlXzINCnZhcmlhYmxlXzMNCmBgYA0KDQpMZXQncyBjcmVhdGUgYSBkYXRhLWZyYW1lIHdpdGggdGhvc2UgdmVjdG9ycywgeW91IGNhbiBjbGVhcmx5IHNlZSB3aGVyZSB0aGUgbWlzc2luZyBkYXRhIGlzDQoNCmBgYHtyfQ0KZGF0X21pc3M8LSBkYXRhLmZyYW1lKHZhcmlhYmxlXzEsdmFyaWFibGVfMix2YXJpYWJsZV8zKQ0KZGF0X21pc3MNCmBgYA0KDQpCeSBsb29raW5nIGF0IHRoZSBkYXRhIHdlIGNvdWxkIGNsZWFybHkgZXN0aW1hdGUgdGhhdCB0aGUgbWlzc2luZyB2YWx1ZSBpcyBgdmFyaWFibGVfMj0xNWAuIExldCdzIHVzZSB0aGUgZnVuY3Rpb24gYG1pc3NGb3Jlc3QoKWAgdG8gaW5wdXQgdGhlIG1pc3NpbmcgZGF0YS4NCg0KYGBge3J9DQpkYXRfaW1wbyA8LSBtaXNzRm9yZXN0KGRhdF9taXNzKQ0KY2xhc3MoZGF0X2ltcG8pDQpgYGANCg0KQXMgeW91IGNhbiBzZWUgYGRhdF9pbXBvYCBiZWNvbWVzIGFuIG9iamVjdCBvZiB0aGUgY2xhc3MgYCJtaXNzRm9yZXN0ImAuIEluIG9yZGVyIHRvIGdldCB0aGUgbmV3IGRhdGEgZnJhbWUgd2Ugc2hvdWxkIGRvIHRoZSBmb2xsb3dpbmc6DQoNCmBgYHtyfQ0KbmV3X2RhdGEgPC0gZGF0X2ltcG8keGltcA0KbmV3X2RhdGENCmBgYA0KDQpBcyB5b3UgY2FuIHNlZSB0aGUgaW5wdXQgdmFsdWUgaXMgYHZhcmlhYmxlXzI9MTQuNDRgIG5vdCB0b28gYmFkIGhleSEgd2Ugd2VyZSBleHBlY3RpbmcgYDE1YC4gTm93IGxldCdzIHRyeSB0byB1c2UgdGhpcyBmdW5jdGlvbiBmb3IgYSBtb3JlIGNvbXBsZXggcHJvYmxlbS4NCg0KIyBBIG5vdCBzbyBzaW1wbGUgRXhhbXBsZQ0KDQpGaXJzdCBsZXQncyBoYXZlIGEgbG9vayBhdCB0aGUgZ2FwbWluZGVyIGRhdGEtc2V0IGFuZCBsZXQncyBjcmVhdGUgYSBkYXRhIHNldCBmb3IgYWxsIHllYXJzIGZvciB3aGljaCB3ZSBoYXZlIEdEUCBwZXIgY2FwaXRhIGRhdGEgYmV0d2VlbiAxOTgwIGFuZCAyMDEwLi4uIGFsc28gbGV0J3Mgbm90IGNvbnNpZGVyIHBvcHVsYXRpb24gYW5kIGxpZmUgZXhwZWN0YW5jeQ0KDQpgYGB7cn0NCmhlYWQoZ2FwbWluZGVyKQ0KZ2FwbSA8LSBnYXBtaW5kZXIgJT4lIA0KICBzZWxlY3QoIWMoImNvbnRpbmVudCIsInBvcCIsICJsaWZlRXhwIikpICU+JSANCiAgZmlsdGVyKHllYXIgJWluJSBjKDE5ODA6MjAxMCkpICU+JSANCiAgc3ByZWFkKHllYXIsIGdkcFBlcmNhcCkNCmhlYWQoZ2FwbSkNCmBgYA0KDQoNCkltYWdpbmUgdGhhdCB3ZSBoYXZlIGxvc3QgMTAlIG9mIG91ciB2YWx1ZXMhIGxldCdzIHJlbW92ZSB0aGUgbmFtZSBvZiB0aGUgY291bnRyaWVzIHRvbw0KDQoNCmBgYHtyfQ0Kc2V0LnNlZWQoMTEwKQ0KZ2FwbTwtIGdhcG1bLC0xXQ0KZ2FwLm1pcyA8LSBwcm9kTkEoZ2FwbSwgbm9OQSA9IDAuMSkNCmdhcC5taXMgPC0gYXMuZGF0YS5mcmFtZShnYXAubWlzKQ0KaGVhZChnYXAubWlzKQ0KaGVhZChnYXBtKQ0KYGBgDQoNCkxldCdzIGRvdWJsZSBjaGVjayB0aGUgbWlzc2luZyBkYXRhLCBhcyB5b3UgY2FuIHNlZSBvbiBhdmVyYWdlIHdlIGhhdmUgMTcgR0RQIHBlciBjYXBpdGEgbWlzc2luZyB2YWx1ZXMgZm9yIGVhY2ggeWVhci4NCg0KYGBge3J9DQpza2ltKGdhcC5taXMpDQpgYGANCg0KDQpMZXQncyAgaW1wdXRlIG1pc3NpbmcgdmFsdWVzLCB1c2luZyBhbGwgdGhlIGRlZmF1bHQgcGFyYW1ldGVycyBvZiB0aGUgYG1pc3NGb3Jlc3RgIGZ1bmN0aW9uDQoNCmBgYHtyfQ0Kc2V0LnNlZWQoMTEwKQ0KZ2FwLmlucCA8LSBtaXNzRm9yZXN0KGdhcC5taXMsIHh0cnVlID0gZ2FwbSkNCmBgYA0KDQpBZ2FpbiBsZXRzIGxvb2sgYXQgdGhlIG5ldyB0YWJsZQ0KDQpgYGB7cn0NCmhlYWQoZ2FwLmlucCR4aW1wKQ0KYGBgDQoNCg0KSG93IGJpZyBhcmUgdGhlIGVycm9ycyBvZiB0aGUgaW1wdXRhdGlvbj8NCg0KYGBge3J9DQpnYXAuaW5wJE9PQmVycm9yDQpnYXAuaW5wJGVycm9yDQpgYGANCg0KTm90IHRvbyBiYWQsIGFyb3VuZCAxNiUuIE5STUUgaXMgVGhlIG5vcm1hbGl6ZWQgcm9vdCBtZWFuIHNxdWFyZWQgZXJyb3IsIGl0ICBpcyBkZWZpbmVkIGFzOg0KDQokJFxzcXJ0XGZyYWN7bWVhbigoWHRydWXiiJJYaW1wKV4yfXt2YXIoWHRydWUpKX0kJA0KIA0KQ2hlY2sgYnkgeW91cnNlbGYgd2hlcmUgdGhlIG1pc3NpbmcgZGF0YSBpcyBhbmQgd2hhdCB3YXMgaXQgcmVwbGFjZWQgd2l0aCEgbW9yZW92ZXIgaG93IHNpbWlsYXIgaXQgaXMgY29tcGFyZWQgd2l0aCB0aGUgb3JpZ2luYWwgZGF0YSEgDQogDQpgYGB7cn0NCmhlYWQoZ2FwbSkNCmhlYWQoZ2FwLm1pcykNCmhlYWQoZ2FwLmlucCR4aW1wKQ0KYGBgDQoNCg0KIyBSZWZlcmVuY2VzDQoNCi0gaHR0cHM6Ly93d3cucmRvY3VtZW50YXRpb24ub3JnL3BhY2thZ2VzL21pc3NGb3Jlc3QvdmVyc2lvbnMvMS40L3RvcGljcy9taXNzRm9yZXN0DQoNCi0gaHR0cHM6Ly93d3cuYW5hbHl0aWNzdmlkaHlhLmNvbS9ibG9nLzIwMTYvMDMvdHV0b3JpYWwtcG93ZXJmdWwtcGFja2FnZXMtaW1wdXRpbmctbWlzc2luZy12YWx1ZXMvDQoNCi0gU3Rla2hvdmVuLCBELkouIGFuZCBCdWVobG1hbm4sIFAuICgyMDEyKSwgJ01pc3NGb3Jlc3QgLSBub25wYXJhbWV0cmljIG1pc3NpbmcgdmFsdWUgaW1wdXRhdGlvbiBmb3IgbWl4ZWQtdHlwZSBkYXRhJywgQmlvaW5mb3JtYXRpY3MsIDI4KDEpIDIwMTIsIDExMi0xMTgsIGRvaTogMTAuMTA5My9iaW9pbmZvcm1hdGljcy9idHI1OTcNCg==