Having missing data is a frequently encountered problem in data analysis. Here are some practical solutions for handling missing values in a dataset:
We will see just how one can replace all the missing values in the well-known Pima Tribe diabetes dataset with predicted values from a predictive model trained with complete data.
The Pima Tribe diabetes dataset can be obtained from the UCI Machine Learning Repository or the R library faraway. The National Institute of Diabetes and Digestive and Kidney Diseases conducted a study on 768 adult female Pima Indians living near Phoenix, Arizona, USA. The data set contains the following columns:
The dataset is 768 rows by 9 columns and has many secret missing values: a careful inspection will reveal many values of zero where they are biologically impossible.
if( !require(dplyr) ){ install.packages("dplyr") }
if( !require(faraway) ){ install.packages("faraway") }
if( !require(randomForest) ){ install.packages("randomForest") }
if( !require(tibble) ){ install.packages("tibble") }
library(dplyr)
library(faraway)
library(randomForest)
library(tibble)
pima <- tibble::as_tibble(faraway::pima)
str(pima)
## Classes 'tbl_df', 'tbl' and 'data.frame': 768 obs. of 9 variables:
## $ pregnant : int 6 1 8 1 0 5 3 10 2 8 ...
## $ glucose : int 148 85 183 89 137 116 78 115 197 125 ...
## $ diastolic: int 72 66 64 66 40 74 50 0 70 96 ...
## $ triceps : int 35 29 0 23 35 0 32 0 45 0 ...
## $ insulin : int 0 0 0 94 168 0 88 0 543 0 ...
## $ bmi : num 33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
## $ diabetes : num 0.627 0.351 0.672 0.167 2.288 ...
## $ age : int 50 31 32 21 33 30 26 29 53 54 ...
## $ test : int 1 0 1 0 1 0 1 0 1 1 ...
The variable type for \(\text{test}\) should be coded as a factor and not as an integer.
pima$test <- as.factor(ifelse(pima$test == 0, "negative", "positive"))
We check out the numerical summary for the dataset.
summary(pima)
## pregnant glucose diastolic triceps
## Min. : 0.000 Min. : 0.0 Min. : 0.00 Min. : 0.00
## 1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 62.00 1st Qu.: 0.00
## Median : 3.000 Median :117.0 Median : 72.00 Median :23.00
## Mean : 3.845 Mean :120.9 Mean : 69.11 Mean :20.54
## 3rd Qu.: 6.000 3rd Qu.:140.2 3rd Qu.: 80.00 3rd Qu.:32.00
## Max. :17.000 Max. :199.0 Max. :122.00 Max. :99.00
## insulin bmi diabetes age
## Min. : 0.0 Min. : 0.00 Min. :0.0780 Min. :21.00
## 1st Qu.: 0.0 1st Qu.:27.30 1st Qu.:0.2437 1st Qu.:24.00
## Median : 30.5 Median :32.00 Median :0.3725 Median :29.00
## Mean : 79.8 Mean :31.99 Mean :0.4719 Mean :33.24
## 3rd Qu.:127.2 3rd Qu.:36.60 3rd Qu.:0.6262 3rd Qu.:41.00
## Max. :846.0 Max. :67.10 Max. :2.4200 Max. :81.00
## test
## negative:500
## positive:268
##
##
##
##
From the numerical summary, we can see that zero is the minimum value for \(\text{glucose}\), \(\text{diastolic}\), \(\text{triceps}\), \(\text{insulin}\), and \(\text{bmi}\).
As mentioned before, these zero values are not real biological measurements. Unfortunately, it seems the researchers had made the poor choice of using zero values to encode missing data.
We replace these zeroes with NA values.
pima$glucose[ pima$glucose == 0 ] <- NA
pima$diastolic[ pima$diastolic == 0 ] <- NA
pima$triceps[ pima$triceps == 0 ] <- NA
pima$insulin[ pima$insulin == 0 ] <- NA
pima$bmi[ pima$bmi == 0 ] <- NA
nrow(pima[ complete.cases(pima) == TRUE, ])
## [1] 392
nrow(pima[ complete.cases(pima) != TRUE, ])
## [1] 376
head(pima)
## # A tibble: 6 x 9
## pregnant glucose diastolic triceps insulin bmi diabetes age test
## <int> <int> <int> <int> <int> <dbl> <dbl> <int> <fct>
## 1 6 148 72 35 NA 33.6 0.627 50 positive
## 2 1 85 66 29 NA 26.6 0.351 31 negative
## 3 8 183 64 NA NA 23.3 0.672 32 positive
## 4 1 89 66 23 94 28.1 0.167 21 negative
## 5 0 137 40 35 168 43.1 2.29 33 positive
## 6 5 116 74 NA NA 25.6 0.201 30 negative
tail(pima)
## # A tibble: 6 x 9
## pregnant glucose diastolic triceps insulin bmi diabetes age test
## <int> <int> <int> <int> <int> <dbl> <dbl> <int> <fct>
## 1 9 89 62 NA NA 22.5 0.142 33 negative
## 2 10 101 76 48 180 32.9 0.171 63 negative
## 3 2 122 70 27 NA 36.8 0.34 27 negative
## 4 5 121 72 23 112 26.2 0.245 30 negative
## 5 1 126 60 NA NA 30.1 0.349 47 positive
## 6 1 93 70 31 NA 30.4 0.315 23 negative
summary(pima)
## pregnant glucose diastolic triceps
## Min. : 0.000 Min. : 44.0 Min. : 24.00 Min. : 7.00
## 1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 64.00 1st Qu.:22.00
## Median : 3.000 Median :117.0 Median : 72.00 Median :29.00
## Mean : 3.845 Mean :121.7 Mean : 72.41 Mean :29.15
## 3rd Qu.: 6.000 3rd Qu.:141.0 3rd Qu.: 80.00 3rd Qu.:36.00
## Max. :17.000 Max. :199.0 Max. :122.00 Max. :99.00
## NA's :5 NA's :35 NA's :227
## insulin bmi diabetes age
## Min. : 14.00 Min. :18.20 Min. :0.0780 Min. :21.00
## 1st Qu.: 76.25 1st Qu.:27.50 1st Qu.:0.2437 1st Qu.:24.00
## Median :125.00 Median :32.30 Median :0.3725 Median :29.00
## Mean :155.55 Mean :32.46 Mean :0.4719 Mean :33.24
## 3rd Qu.:190.00 3rd Qu.:36.60 3rd Qu.:0.6262 3rd Qu.:41.00
## Max. :846.00 Max. :67.10 Max. :2.4200 Max. :81.00
## NA's :374 NA's :11
## test
## negative:500
## positive:268
##
##
##
##
##
The dataset now has 392 rows with complete data and 376 rows with some missing data values. There are 5 NA’s for \(\text{glucose}\), 35 NA’s for \(\text{diastolic}\), 227 NA’s for \(\text{triceps}\), 374 NA’s for \(\text{insulin}\), and 11 NA’s for \(\text{bmi}\).
The function rfImpute(), in the R library randomForest, performs missing value imputations in predictor data using the proximity matrix from randomForest(). The algorithm starts (with na.roughfix()) by imputing NAs using medians or modes, then a random forest model is fit to the completed data. The proximity matrix from randomForest() is used to update the imputation of the NAs. A response variable without missing values must be chosen. For the Pima Tribe diabetes dataset, we choose \(\text{test}\) as the response variable.
pima_imputed <- tibble::as_tibble(
randomForest::rfImpute(test ~ ., ntree = 200, iter = 5, data = pima)
) %>% select(pregnant, glucose, diastolic, triceps,
insulin, bmi, diabetes, age, test)
## ntree OOB 1 2
## 200: 24.35% 16.00% 39.93%
## ntree OOB 1 2
## 200: 25.13% 16.40% 41.42%
## ntree OOB 1 2
## 200: 24.48% 16.60% 39.18%
## ntree OOB 1 2
## 200: 25.65% 17.80% 40.30%
## ntree OOB 1 2
## 200: 24.48% 15.60% 41.04%
head(pima_imputed)
## # A tibble: 6 x 9
## pregnant glucose diastolic triceps insulin bmi diabetes age test
## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <fct>
## 1 6 148 72 35 277. 33.6 0.627 50 positive
## 2 1 85 66 29 68.8 26.6 0.351 31 negative
## 3 8 183 64 29.8 255. 23.3 0.672 32 positive
## 4 1 89 66 23 94 28.1 0.167 21 negative
## 5 0 137 40 35 168 43.1 2.29 33 positive
## 6 5 116 74 18.7 83.0 25.6 0.201 30 negative
tail(pima_imputed)
## # A tibble: 6 x 9
## pregnant glucose diastolic triceps insulin bmi diabetes age test
## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <fct>
## 1 9 89 62 20.5 81.2 22.5 0.142 33 negative
## 2 10 101 76 48 180 32.9 0.171 63 negative
## 3 2 122 70 27 132. 36.8 0.34 27 negative
## 4 5 121 72 23 112 26.2 0.245 30 negative
## 5 1 126 60 27.5 197. 30.1 0.349 47 positive
## 6 1 93 70 31 73.4 30.4 0.315 23 negative
summary(pima_imputed)
## pregnant glucose diastolic triceps
## Min. : 0.000 Min. : 44.0 Min. : 24.00 Min. : 7.00
## 1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 64.00 1st Qu.:21.00
## Median : 3.000 Median :117.0 Median : 72.00 Median :29.00
## Mean : 3.845 Mean :121.6 Mean : 72.35 Mean :28.86
## 3rd Qu.: 6.000 3rd Qu.:141.0 3rd Qu.: 80.00 3rd Qu.:35.00
## Max. :17.000 Max. :199.0 Max. :122.00 Max. :99.00
## insulin bmi diabetes age
## Min. : 14.00 Min. :18.20 Min. :0.0780 Min. :21.00
## 1st Qu.: 81.07 1st Qu.:27.40 1st Qu.:0.2437 1st Qu.:24.00
## Median :133.71 Median :32.15 Median :0.3725 Median :29.00
## Mean :154.51 Mean :32.41 Mean :0.4719 Mean :33.24
## 3rd Qu.:200.00 3rd Qu.:36.60 3rd Qu.:0.6262 3rd Qu.:41.00
## Max. :846.00 Max. :67.10 Max. :2.4200 Max. :81.00
## test
## negative:500
## positive:268
##
##
##
##
nrow(pima_imputed[ complete.cases(pima_imputed) == TRUE, ])
## [1] 768
nrow(pima_imputed[ complete.cases(pima_imputed) != TRUE, ])
## [1] 0
After imputation, all the missing values have been backfilled, and we see that there are 768 rows of complete data.