Introduction


Having missing data is a frequently encountered problem in data analysis. Here are some practical solutions for handling missing values in a dataset:

  1. deleting observations (okay, if you have access to lots of data, but some data analysts don’t have this luxury)
  2. deleting the variable (okay, if you don’t really need the variable, but somehow, someway you need this variable)
  3. imputation of missing values with either means, medians or modes
  4. imputation of missing values using a predictive model trained with complete data, i.e. a subset of the original dataset with no missing values

We will see just how one can replace all the missing values in the well-known Pima Tribe diabetes dataset with predicted values from a predictive model trained with complete data.

Missing Values in the Pima Tribe Diabetes Data


The Pima Tribe diabetes dataset can be obtained from the UCI Machine Learning Repository or the R library faraway. The National Institute of Diabetes and Digestive and Kidney Diseases conducted a study on 768 adult female Pima Indians living near Phoenix, Arizona, USA. The data set contains the following columns:

  1. pregnant - number of times pregnant
  2. glucose - plasma glucose concentration at 2 hours in an oral glucose tolerance test
  3. diastolic - diastolic blood pressure in mmHg
  4. triceps - triceps skin fold thickness in mm
  5. insulin - two-hour serum insulin in mu U/ml
  6. bmi - body mass index in kg/m^2
  7. diabetes - diabetes pedigree function
  8. age - patient age in years
  9. test - test whether the patient showed signs of diabetes: 0; 1

The dataset is 768 rows by 9 columns and has many secret missing values: a careful inspection will reveal many values of zero where they are biologically impossible.

if( !require(dplyr) ){ install.packages("dplyr") }
if( !require(faraway) ){ install.packages("faraway") }
if( !require(randomForest) ){ install.packages("randomForest") }
if( !require(tibble) ){ install.packages("tibble") }

library(dplyr)
library(faraway)
library(randomForest)
library(tibble)

pima <- tibble::as_tibble(faraway::pima)
str(pima)
## Classes 'tbl_df', 'tbl' and 'data.frame':    768 obs. of  9 variables:
##  $ pregnant : int  6 1 8 1 0 5 3 10 2 8 ...
##  $ glucose  : int  148 85 183 89 137 116 78 115 197 125 ...
##  $ diastolic: int  72 66 64 66 40 74 50 0 70 96 ...
##  $ triceps  : int  35 29 0 23 35 0 32 0 45 0 ...
##  $ insulin  : int  0 0 0 94 168 0 88 0 543 0 ...
##  $ bmi      : num  33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
##  $ diabetes : num  0.627 0.351 0.672 0.167 2.288 ...
##  $ age      : int  50 31 32 21 33 30 26 29 53 54 ...
##  $ test     : int  1 0 1 0 1 0 1 0 1 1 ...

The variable type for \(\text{test}\) should be coded as a factor and not as an integer.

pima$test <- as.factor(ifelse(pima$test == 0, "negative", "positive"))

We check out the numerical summary for the dataset.

summary(pima)
##     pregnant         glucose        diastolic         triceps     
##  Min.   : 0.000   Min.   :  0.0   Min.   :  0.00   Min.   : 0.00  
##  1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 62.00   1st Qu.: 0.00  
##  Median : 3.000   Median :117.0   Median : 72.00   Median :23.00  
##  Mean   : 3.845   Mean   :120.9   Mean   : 69.11   Mean   :20.54  
##  3rd Qu.: 6.000   3rd Qu.:140.2   3rd Qu.: 80.00   3rd Qu.:32.00  
##  Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.00  
##     insulin           bmi           diabetes           age       
##  Min.   :  0.0   Min.   : 0.00   Min.   :0.0780   Min.   :21.00  
##  1st Qu.:  0.0   1st Qu.:27.30   1st Qu.:0.2437   1st Qu.:24.00  
##  Median : 30.5   Median :32.00   Median :0.3725   Median :29.00  
##  Mean   : 79.8   Mean   :31.99   Mean   :0.4719   Mean   :33.24  
##  3rd Qu.:127.2   3rd Qu.:36.60   3rd Qu.:0.6262   3rd Qu.:41.00  
##  Max.   :846.0   Max.   :67.10   Max.   :2.4200   Max.   :81.00  
##        test    
##  negative:500  
##  positive:268  
##                
##                
##                
## 

From the numerical summary, we can see that zero is the minimum value for \(\text{glucose}\), \(\text{diastolic}\), \(\text{triceps}\), \(\text{insulin}\), and \(\text{bmi}\).
As mentioned before, these zero values are not real biological measurements. Unfortunately, it seems the researchers had made the poor choice of using zero values to encode missing data.

We replace these zeroes with NA values.

pima$glucose[ pima$glucose == 0 ] <- NA
pima$diastolic[ pima$diastolic == 0 ] <- NA
pima$triceps[ pima$triceps == 0 ] <- NA
pima$insulin[ pima$insulin == 0 ] <- NA
pima$bmi[ pima$bmi == 0 ] <- NA

nrow(pima[ complete.cases(pima) == TRUE, ])
## [1] 392
nrow(pima[ complete.cases(pima) != TRUE, ])
## [1] 376
head(pima)
## # A tibble: 6 x 9
##   pregnant glucose diastolic triceps insulin   bmi diabetes   age test    
##      <int>   <int>     <int>   <int>   <int> <dbl>    <dbl> <int> <fct>   
## 1        6     148        72      35      NA  33.6    0.627    50 positive
## 2        1      85        66      29      NA  26.6    0.351    31 negative
## 3        8     183        64      NA      NA  23.3    0.672    32 positive
## 4        1      89        66      23      94  28.1    0.167    21 negative
## 5        0     137        40      35     168  43.1    2.29     33 positive
## 6        5     116        74      NA      NA  25.6    0.201    30 negative
tail(pima)
## # A tibble: 6 x 9
##   pregnant glucose diastolic triceps insulin   bmi diabetes   age test    
##      <int>   <int>     <int>   <int>   <int> <dbl>    <dbl> <int> <fct>   
## 1        9      89        62      NA      NA  22.5    0.142    33 negative
## 2       10     101        76      48     180  32.9    0.171    63 negative
## 3        2     122        70      27      NA  36.8    0.34     27 negative
## 4        5     121        72      23     112  26.2    0.245    30 negative
## 5        1     126        60      NA      NA  30.1    0.349    47 positive
## 6        1      93        70      31      NA  30.4    0.315    23 negative
summary(pima)
##     pregnant         glucose        diastolic         triceps     
##  Min.   : 0.000   Min.   : 44.0   Min.   : 24.00   Min.   : 7.00  
##  1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 64.00   1st Qu.:22.00  
##  Median : 3.000   Median :117.0   Median : 72.00   Median :29.00  
##  Mean   : 3.845   Mean   :121.7   Mean   : 72.41   Mean   :29.15  
##  3rd Qu.: 6.000   3rd Qu.:141.0   3rd Qu.: 80.00   3rd Qu.:36.00  
##  Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.00  
##                   NA's   :5       NA's   :35       NA's   :227    
##     insulin            bmi           diabetes           age       
##  Min.   : 14.00   Min.   :18.20   Min.   :0.0780   Min.   :21.00  
##  1st Qu.: 76.25   1st Qu.:27.50   1st Qu.:0.2437   1st Qu.:24.00  
##  Median :125.00   Median :32.30   Median :0.3725   Median :29.00  
##  Mean   :155.55   Mean   :32.46   Mean   :0.4719   Mean   :33.24  
##  3rd Qu.:190.00   3rd Qu.:36.60   3rd Qu.:0.6262   3rd Qu.:41.00  
##  Max.   :846.00   Max.   :67.10   Max.   :2.4200   Max.   :81.00  
##  NA's   :374      NA's   :11                                      
##        test    
##  negative:500  
##  positive:268  
##                
##                
##                
##                
## 

The dataset now has 392 rows with complete data and 376 rows with some missing data values. There are 5 NA’s for \(\text{glucose}\), 35 NA’s for \(\text{diastolic}\), 227 NA’s for \(\text{triceps}\), 374 NA’s for \(\text{insulin}\), and 11 NA’s for \(\text{bmi}\).

Imputation of Missing Values with a Random Forest


The function rfImpute(), in the R library randomForest, performs missing value imputations in predictor data using the proximity matrix from randomForest(). The algorithm starts (with na.roughfix()) by imputing NAs using medians or modes, then a random forest model is fit to the completed data. The proximity matrix from randomForest() is used to update the imputation of the NAs. A response variable without missing values must be chosen. For the Pima Tribe diabetes dataset, we choose \(\text{test}\) as the response variable.

pima_imputed <- tibble::as_tibble(
                randomForest::rfImpute(test ~ ., ntree = 200, iter = 5, data = pima)
                ) %>% select(pregnant, glucose, diastolic, triceps, 
                             insulin, bmi, diabetes, age, test)
## ntree      OOB      1      2
##   200:  24.35% 16.00% 39.93%
## ntree      OOB      1      2
##   200:  25.13% 16.40% 41.42%
## ntree      OOB      1      2
##   200:  24.48% 16.60% 39.18%
## ntree      OOB      1      2
##   200:  25.65% 17.80% 40.30%
## ntree      OOB      1      2
##   200:  24.48% 15.60% 41.04%
head(pima_imputed)
## # A tibble: 6 x 9
##   pregnant glucose diastolic triceps insulin   bmi diabetes   age test    
##      <int>   <dbl>     <dbl>   <dbl>   <dbl> <dbl>    <dbl> <int> <fct>   
## 1        6     148        72    35     277.   33.6    0.627    50 positive
## 2        1      85        66    29      68.8  26.6    0.351    31 negative
## 3        8     183        64    29.8   255.   23.3    0.672    32 positive
## 4        1      89        66    23      94    28.1    0.167    21 negative
## 5        0     137        40    35     168    43.1    2.29     33 positive
## 6        5     116        74    18.7    83.0  25.6    0.201    30 negative
tail(pima_imputed)
## # A tibble: 6 x 9
##   pregnant glucose diastolic triceps insulin   bmi diabetes   age test    
##      <int>   <dbl>     <dbl>   <dbl>   <dbl> <dbl>    <dbl> <int> <fct>   
## 1        9      89        62    20.5    81.2  22.5    0.142    33 negative
## 2       10     101        76    48     180    32.9    0.171    63 negative
## 3        2     122        70    27     132.   36.8    0.34     27 negative
## 4        5     121        72    23     112    26.2    0.245    30 negative
## 5        1     126        60    27.5   197.   30.1    0.349    47 positive
## 6        1      93        70    31      73.4  30.4    0.315    23 negative
summary(pima_imputed)
##     pregnant         glucose        diastolic         triceps     
##  Min.   : 0.000   Min.   : 44.0   Min.   : 24.00   Min.   : 7.00  
##  1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 64.00   1st Qu.:21.00  
##  Median : 3.000   Median :117.0   Median : 72.00   Median :29.00  
##  Mean   : 3.845   Mean   :121.6   Mean   : 72.35   Mean   :28.86  
##  3rd Qu.: 6.000   3rd Qu.:141.0   3rd Qu.: 80.00   3rd Qu.:35.00  
##  Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.00  
##     insulin            bmi           diabetes           age       
##  Min.   : 14.00   Min.   :18.20   Min.   :0.0780   Min.   :21.00  
##  1st Qu.: 81.07   1st Qu.:27.40   1st Qu.:0.2437   1st Qu.:24.00  
##  Median :133.71   Median :32.15   Median :0.3725   Median :29.00  
##  Mean   :154.51   Mean   :32.41   Mean   :0.4719   Mean   :33.24  
##  3rd Qu.:200.00   3rd Qu.:36.60   3rd Qu.:0.6262   3rd Qu.:41.00  
##  Max.   :846.00   Max.   :67.10   Max.   :2.4200   Max.   :81.00  
##        test    
##  negative:500  
##  positive:268  
##                
##                
##                
## 
nrow(pima_imputed[ complete.cases(pima_imputed) == TRUE, ])
## [1] 768
nrow(pima_imputed[ complete.cases(pima_imputed) != TRUE, ])
## [1] 0

After imputation, all the missing values have been backfilled, and we see that there are 768 rows of complete data.