Handling Missing Values

Introduction

Missing values are a common occurrence while doing statistical analysis and can have significant effect on the conclusions.Missing values occurs when no data is stored for a variable in an observation.

There are several methods to handle missing values.In statistics, Imputation is one of the process of replacing missing data with substituted values which is being discussed in this Vignette.In order to achieve this , I will be introducing you to Mice package and two visualisation packages- VIM and Lattice.

Data Source

The datasource considered for this example is the Phase 1 of the Third National health and Nutrition Examination Survey (NHANES) by US National center for Health Statistics which comes along with MICE package.

Loading the Data

Since the dataset is part of Mice package,the package is initialised and then the structure of the dataset is analysed.

library(mice)

data(nhanes)

str(nhanes)

## 'data.frame':    25 obs. of  4 variables:
##  $ age: num  1 2 1 3 1 3 1 1 2 2 ...
##  $ bmi: num  NA 22.7 NA NA 20.4 NA 22.5 30.1 22 NA ...
##  $ hyp: num  NA 1 1 NA 1 NA 1 1 1 NA ...
##  $ chl: num  NA 187 187 NA 113 184 118 187 238 NA ...

The dataset has 25 observations and 4 variables - age, cholestrol level(chl),body-mass index (bmi) and hypertension level (hyp)

Finding Missing Values

Missing data can be identified in two ways (1)by using summary() function and (2)by using md.pattern() function in mice package

summary(nhanes)

##       age            bmi             hyp             chl       
##  Min.   :1.00   Min.   :20.40   Min.   :1.000   Min.   :113.0  
##  1st Qu.:1.00   1st Qu.:22.65   1st Qu.:1.000   1st Qu.:185.0  
##  Median :2.00   Median :26.75   Median :1.000   Median :187.0  
##  Mean   :1.76   Mean   :26.56   Mean   :1.235   Mean   :191.4  
##  3rd Qu.:2.00   3rd Qu.:28.93   3rd Qu.:1.000   3rd Qu.:212.0  
##  Max.   :3.00   Max.   :35.30   Max.   :2.000   Max.   :284.0  
##                 NA's   :9       NA's   :8       NA's   :10

The result shows that there are 9 values missing for bmi,8 for hyp and 10 values for chl.

Now we can evaluate the same using md.pattern() function.

md.pattern(nhanes)

##    age hyp bmi chl   
## 13   1   1   1   1  0
##  1   1   1   0   1  1
##  3   1   1   1   0  1
##  1   1   0   0   1  2
##  7   1   0   0   0  3
##      0   8   9  10 27

The result shows that there are 13 observations which are complete for all the four combinations,1 observation missing for bmi and so on.

The missing patterns can be visualised using VIM package.

library(VIM)

## Loading required package: colorspace

## Loading required package: grid

## Loading required package: data.table

## VIM is ready to use. 
##  Since version 4.0.0 the GUI is in its own package VIMGUI.
## 
##           Please use the package to use the new (and old) GUI.

## Suggestions and bug-reports can be submitted at: https://github.com/alexkowa/VIM/issues

## 
## Attaching package: 'VIM'

## The following object is masked from 'package:datasets':
## 
##     sleep

nhanes_plot <-aggr(nhanes, col=c('lightblue','red'), numbers=TRUE, sortVars=TRUE, labels=names(data), cex.axis=.7, gap=3, ylab=c("Histogram of missing data","Pattern"))

## 
##  Variables sorted by number of missings: 
##  Variable Count
##       chl  0.40
##       bmi  0.36
##       hyp  0.32
##       age  0.00

The histogram shows the proportion of missingness and the plot on the right gives the frequencies for different combinations of variables missing where.Blue refers to observed data and red to missing data.All the four variables are observed with 52% frequency and that all the three variables chl, bmi & hyp are missing is the most frequent with about 28% frequency (7 observations).

Handling Missing Data

Our hypothesis is that the choloestrol level has a liner relationship with age and body-mass index.The easy approach is to delete observations with missing values and fit a linear model which can be achieved using lm function.

lreg.cc <- lm(chl ~ age + bmi, data=nhanes)

summary(lreg.cc)

## 
## Call:
## lm(formula = chl ~ age + bmi, data = nhanes)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -31.187 -19.517  -0.310   6.915  60.606 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -80.194     58.772  -1.364 0.202327    
## age           53.069     11.293   4.699 0.000842 ***
## bmi            6.884      1.846   3.730 0.003913 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 27.67 on 10 degrees of freedom
##   (12 observations deleted due to missingness)
## Multiple R-squared:  0.7318, Adjusted R-squared:  0.6781 
## F-statistic: 13.64 on 2 and 10 DF,  p-value: 0.001388

We can observe that bmi and age are both significant, however, based on only 13 observations (12 observations are deleted).This is not a desired approach in statistics.

Imputation using mice package

This is a three step process.

1.mice() - simulates missing values
2.with() - analyses each of the completed datasets separately 3.pool() - combines all the results together

impute = mice(nhanes, m=20, printFlag=FALSE, maxit = 40, seed=2525)

The output impute contains m=20 completed datasets. Each dataset can be analysed using function with().

fit.mi = with(data=impute, exp = lm(chl ~ age + bmi))

summary(pool(fit.mi))

##                    est        se          t        df   Pr(>|t|)
## (Intercept) -12.779432 73.441871 -0.1740074 11.256311 0.86494966
## age          35.101831 12.270967  2.8605595  9.786956 0.01729236
## bmi           5.446575  2.283797  2.3848768 12.027350 0.03441232
##                    lo 95     hi 95 nmis       fmi    lambda
## (Intercept) -173.9759633 148.41710   NA 0.4747763 0.3890697
## age            7.6795228  62.52414    0 0.5426942 0.4579056
## bmi            0.4718631  10.42129    9 0.4403476 0.3544279

Checking Imputations Visually:

The distributions of the original and the imputed data can be validated using a scatterplot for each imputed dataset using lattice package.

library(lattice)

xyplot(impute, bmi ~ chl | .imp, pch = 20, cex = 1.4)

Blue represents observed data and red represents the imputed data. It also shows that the values imputed are all positive.