Missing values are a common occurrence while doing statistical analysis and can have significant effect on the conclusions.Missing values occurs when no data is stored for a variable in an observation.
There are several methods to handle missing values.In statistics, Imputation is one of the process of replacing missing data with substituted values which is being discussed in this Vignette.In order to achieve this , I will be introducing you to Mice package and two visualisation packages- VIM and Lattice.
The datasource considered for this example is the Phase 1 of the Third National health and Nutrition Examination Survey (NHANES) by US National center for Health Statistics which comes along with MICE package.
Since the dataset is part of Mice package,the package is initialised and then the structure of the dataset is analysed.
library(mice)
data(nhanes)
str(nhanes)
## 'data.frame': 25 obs. of 4 variables:
## $ age: num 1 2 1 3 1 3 1 1 2 2 ...
## $ bmi: num NA 22.7 NA NA 20.4 NA 22.5 30.1 22 NA ...
## $ hyp: num NA 1 1 NA 1 NA 1 1 1 NA ...
## $ chl: num NA 187 187 NA 113 184 118 187 238 NA ...
The dataset has 25 observations and 4 variables - age, cholestrol level(chl),body-mass index (bmi) and hypertension level (hyp)
Missing data can be identified in two ways (1)by using summary() function and (2)by using md.pattern() function in mice package
summary(nhanes)
## age bmi hyp chl
## Min. :1.00 Min. :20.40 Min. :1.000 Min. :113.0
## 1st Qu.:1.00 1st Qu.:22.65 1st Qu.:1.000 1st Qu.:185.0
## Median :2.00 Median :26.75 Median :1.000 Median :187.0
## Mean :1.76 Mean :26.56 Mean :1.235 Mean :191.4
## 3rd Qu.:2.00 3rd Qu.:28.93 3rd Qu.:1.000 3rd Qu.:212.0
## Max. :3.00 Max. :35.30 Max. :2.000 Max. :284.0
## NA's :9 NA's :8 NA's :10
The result shows that there are 9 values missing for bmi,8 for hyp and 10 values for chl.
Now we can evaluate the same using md.pattern() function.
md.pattern(nhanes)
## age hyp bmi chl
## 13 1 1 1 1 0
## 1 1 1 0 1 1
## 3 1 1 1 0 1
## 1 1 0 0 1 2
## 7 1 0 0 0 3
## 0 8 9 10 27
The result shows that there are 13 observations which are complete for all the four combinations,1 observation missing for bmi and so on.
The missing patterns can be visualised using VIM package.
library(VIM)
## Loading required package: colorspace
## Loading required package: grid
## Loading required package: data.table
## VIM is ready to use.
## Since version 4.0.0 the GUI is in its own package VIMGUI.
##
## Please use the package to use the new (and old) GUI.
## Suggestions and bug-reports can be submitted at: https://github.com/alexkowa/VIM/issues
##
## Attaching package: 'VIM'
## The following object is masked from 'package:datasets':
##
## sleep
nhanes_plot <-aggr(nhanes, col=c('lightblue','red'), numbers=TRUE, sortVars=TRUE, labels=names(data), cex.axis=.7, gap=3, ylab=c("Histogram of missing data","Pattern"))
##
## Variables sorted by number of missings:
## Variable Count
## chl 0.40
## bmi 0.36
## hyp 0.32
## age 0.00
The histogram shows the proportion of missingness and the plot on the right gives the frequencies for different combinations of variables missing where.Blue refers to observed data and red to missing data.All the four variables are observed with 52% frequency and that all the three variables chl, bmi & hyp are missing is the most frequent with about 28% frequency (7 observations).
Our hypothesis is that the choloestrol level has a liner relationship with age and body-mass index.The easy approach is to delete observations with missing values and fit a linear model which can be achieved using lm function.
lreg.cc <- lm(chl ~ age + bmi, data=nhanes)
summary(lreg.cc)
##
## Call:
## lm(formula = chl ~ age + bmi, data = nhanes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -31.187 -19.517 -0.310 6.915 60.606
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -80.194 58.772 -1.364 0.202327
## age 53.069 11.293 4.699 0.000842 ***
## bmi 6.884 1.846 3.730 0.003913 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 27.67 on 10 degrees of freedom
## (12 observations deleted due to missingness)
## Multiple R-squared: 0.7318, Adjusted R-squared: 0.6781
## F-statistic: 13.64 on 2 and 10 DF, p-value: 0.001388
We can observe that bmi and age are both significant, however, based on only 13 observations (12 observations are deleted).This is not a desired approach in statistics.
This is a three step process.
1.mice() - simulates missing values
2.with() - analyses each of the completed datasets separately 3.pool() - combines all the results together
impute = mice(nhanes, m=20, printFlag=FALSE, maxit = 40, seed=2525)
The output impute contains m=20 completed datasets. Each dataset can be analysed using function with().
fit.mi = with(data=impute, exp = lm(chl ~ age + bmi))
summary(pool(fit.mi))
## est se t df Pr(>|t|)
## (Intercept) -12.779432 73.441871 -0.1740074 11.256311 0.86494966
## age 35.101831 12.270967 2.8605595 9.786956 0.01729236
## bmi 5.446575 2.283797 2.3848768 12.027350 0.03441232
## lo 95 hi 95 nmis fmi lambda
## (Intercept) -173.9759633 148.41710 NA 0.4747763 0.3890697
## age 7.6795228 62.52414 0 0.5426942 0.4579056
## bmi 0.4718631 10.42129 9 0.4403476 0.3544279
The distributions of the original and the imputed data can be validated using a scatterplot for each imputed dataset using lattice package.
library(lattice)
xyplot(impute, bmi ~ chl | .imp, pch = 20, cex = 1.4)
Blue represents observed data and red represents the imputed data. It also shows that the values imputed are all positive.