Why do we need Imputation.

Let us take a dataset where there are several variables being measures in several subjects. However several of them are missing. The issue is that even with one missing value the subject is removed from the analysis.

Imputation is the method that is usually used to impute in the missing value.

Load the mice library.

## Loading required package: Rcpp
## mice 2.25 2015-11-09

Load the data nhanes2.

data("nhanes2")
dim(nhanes2)
## [1] 25  4
head(nhanes2)
##     age  bmi  hyp chl
## 1 20-39   NA <NA>  NA
## 2 40-59 22.7   no 187
## 3 20-39   NA   no 187
## 4 60-99   NA <NA>  NA
## 5 20-39 20.4   no 113
## 6 60-99   NA <NA> 184
summary(nhanes2)
##     age          bmi          hyp          chl       
##  20-39:12   Min.   :20.40   no  :13   Min.   :113.0  
##  40-59: 7   1st Qu.:22.65   yes : 4   1st Qu.:185.0  
##  60-99: 6   Median :26.75   NA's: 8   Median :187.0  
##             Mean   :26.56             Mean   :191.4  
##             3rd Qu.:28.93             3rd Qu.:212.0  
##             Max.   :35.30             Max.   :284.0  
##             NA's   :9                 NA's   :10

Imputation Technique Principle.

We start with a regression. We start with the mode for bmi.

bmi = B0 + B1x(hyp) + B2x(chl) + B3x(age (40 - 59)) + B3x(age(60 - 99)) + Error

The above is a multiple regression model.

  1. The Error is assumed to be normally distributed with a mean of 0 and a standard deviation.
  2. This regression equation is obtained using good data where data is available on every variable.
  3. From this regression equation the missing bmi value is determined.
  4. The error value is taken from a simulated single value taken from a random distribution.
  5. Note that the imputed value of bmi will differ each time.
  6. In order to account for a situation where two missing observations are there we develop another equation taking into account the remaining good values and repeat teh process.

If the missing value is a binary variable then we need a logistic model and repeat the same process.

Pr(hyp) = e^(B0 + B1(bmi) + B2(chl) + B3(age(40-59))+B4(age(60-99))+Error)^ /1+(e^(B0 + B1(bmi) + B2(chl) + B3(age(40-59))+B4(age(60-99)))+Error)

Here the error value is filled from a simulated Bernoulli distribution. A bernoulli distribution is the distribution of a variable which takes only 2 values.

The multiple imputation process using the “mice” function builds a complete dataset.

db_imp <- mice(nhanes2, m=10, print=FALSE)

Repeated imputations are done to generate multiple copies of the dataset.

Then the regression model is to be fitted to EACH dataset and the values are seen.

Note : Imputation will work when data are missing at RANDOM.

db_imp
## Multiply imputed data set
## Call:
## mice(data = nhanes2, m = 10, printFlag = FALSE)
## Number of multiple imputations:  10
## Missing cells per column:
## age bmi hyp chl 
##   0   9   8  10 
## Imputation methods:
##      age      bmi      hyp      chl 
##       ""    "pmm" "logreg"    "pmm" 
## VisitSequence:
## bmi hyp chl 
##   2   3   4 
## PredictorMatrix:
##     age bmi hyp chl
## age   0   0   0   0
## bmi   1   0   1   1
## hyp   1   1   0   1
## chl   1   1   1   0
## Random generator seed value:  NA

Imputation method is the regression method used for the imputation. Visit sequence is the sequence in which the variables were imputed.
Predictor Matrix is simply the variables which were used for generating the regression equation for each variable while imputing the missing values.

data2 <- complete(db_imp, action =2) # Way to see the 2nd dataset generated by the imputation method. 
summary(data2) 
##     age          bmi         hyp          chl       
##  20-39:12   Min.   :20.40   no :16   Min.   :113.0  
##  40-59: 7   1st Qu.:24.90   yes: 9   1st Qu.:186.0  
##  60-99: 6   Median :27.20            Median :204.0  
##             Mean   :27.23            Mean   :200.8  
##             3rd Qu.:28.70            3rd Qu.:229.0  
##             Max.   :35.30            Max.   :284.0

R functions for any distribution

R has 4 fundamental functions for any distribution. For example:
1. rnorm(100, 2, 1) - this command gives us 100 observation of a normal distribution with a mean of 2 and stdard deviation of 1.
2. dnorm(100,2,1) - will produce a density plot for the above normal distribution.
3. pnorm(1.7, 2, 1) - will give the probablity of the object being in the sd of 1.7 for normal distribution of mean 2 and sd 1.
4. qnorm(.31, 2, 1) - will give the quantile function of a normal distribution.

Regression for each imputed dataset

fit <- with(db_imp, lm(bmi~age+hyp+chl)) # the with command tells R to do the regression with all 10 datasets in the imputed list of data generated. 
fit
## call :
## with.mids(data = db_imp, expr = lm(bmi ~ age + hyp + chl))
## 
## call1 :
## mice(data = nhanes2, m = 10, printFlag = FALSE)
## 
## nmis :
## age bmi hyp chl 
##   0   9   8  10 
## 
## analyses :
## [[1]]
## 
## Call:
## lm(formula = bmi ~ age + hyp + chl)
## 
## Coefficients:
## (Intercept)         age2         age3         hyp2          chl  
##    16.96093     -5.23550     -8.06432      3.63642      0.06269  
## 
## 
## [[2]]
## 
## Call:
## lm(formula = bmi ~ age + hyp + chl)
## 
## Coefficients:
## (Intercept)         age2         age3         hyp2          chl  
##    22.66550     -5.76398     -5.29910      2.82839      0.03204  
## 
## 
## [[3]]
## 
## Call:
## lm(formula = bmi ~ age + hyp + chl)
## 
## Coefficients:
## (Intercept)         age2         age3         hyp2          chl  
##    20.16707     -4.41016     -5.97332      2.49175      0.04273  
## 
## 
## [[4]]
## 
## Call:
## lm(formula = bmi ~ age + hyp + chl)
## 
## Coefficients:
## (Intercept)         age2         age3         hyp2          chl  
##    22.27666     -4.89316     -7.42982      2.54075      0.03922  
## 
## 
## [[5]]
## 
## Call:
## lm(formula = bmi ~ age + hyp + chl)
## 
## Coefficients:
## (Intercept)         age2         age3         hyp2          chl  
##    20.74525     -5.38344     -4.87158      1.59272      0.04051  
## 
## 
## [[6]]
## 
## Call:
## lm(formula = bmi ~ age + hyp + chl)
## 
## Coefficients:
## (Intercept)         age2         age3         hyp2          chl  
##    18.47276     -4.41523     -5.74059      1.13107      0.05793  
## 
## 
## [[7]]
## 
## Call:
## lm(formula = bmi ~ age + hyp + chl)
## 
## Coefficients:
## (Intercept)         age2         age3         hyp2          chl  
##    19.63548     -6.05103     -5.50243      1.38547      0.05249  
## 
## 
## [[8]]
## 
## Call:
## lm(formula = bmi ~ age + hyp + chl)
## 
## Coefficients:
## (Intercept)         age2         age3         hyp2          chl  
##     17.8936      -4.1110      -6.2273       1.2909       0.0528  
## 
## 
## [[9]]
## 
## Call:
## lm(formula = bmi ~ age + hyp + chl)
## 
## Coefficients:
## (Intercept)         age2         age3         hyp2          chl  
##    18.88558     -6.65801     -9.40702      2.21012      0.06217  
## 
## 
## [[10]]
## 
## Call:
## lm(formula = bmi ~ age + hyp + chl)
## 
## Coefficients:
## (Intercept)         age2         age3         hyp2          chl  
##    19.37600     -5.22792     -5.49072      2.54963      0.04879