Let us take a dataset where there are several variables being measures in several subjects. However several of them are missing. The issue is that even with one missing value the subject is removed from the analysis.
Imputation is the method that is usually used to impute in the missing value.
Load the mice library.
## Loading required package: Rcpp
## mice 2.25 2015-11-09
data("nhanes2")
dim(nhanes2)
## [1] 25 4
head(nhanes2)
## age bmi hyp chl
## 1 20-39 NA <NA> NA
## 2 40-59 22.7 no 187
## 3 20-39 NA no 187
## 4 60-99 NA <NA> NA
## 5 20-39 20.4 no 113
## 6 60-99 NA <NA> 184
summary(nhanes2)
## age bmi hyp chl
## 20-39:12 Min. :20.40 no :13 Min. :113.0
## 40-59: 7 1st Qu.:22.65 yes : 4 1st Qu.:185.0
## 60-99: 6 Median :26.75 NA's: 8 Median :187.0
## Mean :26.56 Mean :191.4
## 3rd Qu.:28.93 3rd Qu.:212.0
## Max. :35.30 Max. :284.0
## NA's :9 NA's :10
We start with a regression. We start with the mode for bmi.
bmi = B0 + B1x(hyp) + B2x(chl) + B3x(age (40 - 59)) + B3x(age(60 - 99)) + Error
The above is a multiple regression model.
If the missing value is a binary variable then we need a logistic model and repeat the same process.
Pr(hyp) = e^(B0 + B1(bmi) + B2(chl) + B3(age(40-59))+B4(age(60-99))+Error)^ /1+(e^(B0 + B1(bmi) + B2(chl) + B3(age(40-59))+B4(age(60-99)))+Error)
Here the error value is filled from a simulated Bernoulli distribution. A bernoulli distribution is the distribution of a variable which takes only 2 values.
The multiple imputation process using the “mice” function builds a complete dataset.
db_imp <- mice(nhanes2, m=10, print=FALSE)
Repeated imputations are done to generate multiple copies of the dataset.
Then the regression model is to be fitted to EACH dataset and the values are seen.
Note : Imputation will work when data are missing at RANDOM.
db_imp
## Multiply imputed data set
## Call:
## mice(data = nhanes2, m = 10, printFlag = FALSE)
## Number of multiple imputations: 10
## Missing cells per column:
## age bmi hyp chl
## 0 9 8 10
## Imputation methods:
## age bmi hyp chl
## "" "pmm" "logreg" "pmm"
## VisitSequence:
## bmi hyp chl
## 2 3 4
## PredictorMatrix:
## age bmi hyp chl
## age 0 0 0 0
## bmi 1 0 1 1
## hyp 1 1 0 1
## chl 1 1 1 0
## Random generator seed value: NA
Imputation method is the regression method used for the imputation. Visit sequence is the sequence in which the variables were imputed.
Predictor Matrix is simply the variables which were used for generating the regression equation for each variable while imputing the missing values.
data2 <- complete(db_imp, action =2) # Way to see the 2nd dataset generated by the imputation method.
summary(data2)
## age bmi hyp chl
## 20-39:12 Min. :20.40 no :16 Min. :113.0
## 40-59: 7 1st Qu.:24.90 yes: 9 1st Qu.:186.0
## 60-99: 6 Median :27.20 Median :204.0
## Mean :27.23 Mean :200.8
## 3rd Qu.:28.70 3rd Qu.:229.0
## Max. :35.30 Max. :284.0
R has 4 fundamental functions for any distribution. For example:
1. rnorm(100, 2, 1) - this command gives us 100 observation of a normal distribution with a mean of 2 and stdard deviation of 1.
2. dnorm(100,2,1) - will produce a density plot for the above normal distribution.
3. pnorm(1.7, 2, 1) - will give the probablity of the object being in the sd of 1.7 for normal distribution of mean 2 and sd 1.
4. qnorm(.31, 2, 1) - will give the quantile function of a normal distribution.
fit <- with(db_imp, lm(bmi~age+hyp+chl)) # the with command tells R to do the regression with all 10 datasets in the imputed list of data generated.
fit
## call :
## with.mids(data = db_imp, expr = lm(bmi ~ age + hyp + chl))
##
## call1 :
## mice(data = nhanes2, m = 10, printFlag = FALSE)
##
## nmis :
## age bmi hyp chl
## 0 9 8 10
##
## analyses :
## [[1]]
##
## Call:
## lm(formula = bmi ~ age + hyp + chl)
##
## Coefficients:
## (Intercept) age2 age3 hyp2 chl
## 16.96093 -5.23550 -8.06432 3.63642 0.06269
##
##
## [[2]]
##
## Call:
## lm(formula = bmi ~ age + hyp + chl)
##
## Coefficients:
## (Intercept) age2 age3 hyp2 chl
## 22.66550 -5.76398 -5.29910 2.82839 0.03204
##
##
## [[3]]
##
## Call:
## lm(formula = bmi ~ age + hyp + chl)
##
## Coefficients:
## (Intercept) age2 age3 hyp2 chl
## 20.16707 -4.41016 -5.97332 2.49175 0.04273
##
##
## [[4]]
##
## Call:
## lm(formula = bmi ~ age + hyp + chl)
##
## Coefficients:
## (Intercept) age2 age3 hyp2 chl
## 22.27666 -4.89316 -7.42982 2.54075 0.03922
##
##
## [[5]]
##
## Call:
## lm(formula = bmi ~ age + hyp + chl)
##
## Coefficients:
## (Intercept) age2 age3 hyp2 chl
## 20.74525 -5.38344 -4.87158 1.59272 0.04051
##
##
## [[6]]
##
## Call:
## lm(formula = bmi ~ age + hyp + chl)
##
## Coefficients:
## (Intercept) age2 age3 hyp2 chl
## 18.47276 -4.41523 -5.74059 1.13107 0.05793
##
##
## [[7]]
##
## Call:
## lm(formula = bmi ~ age + hyp + chl)
##
## Coefficients:
## (Intercept) age2 age3 hyp2 chl
## 19.63548 -6.05103 -5.50243 1.38547 0.05249
##
##
## [[8]]
##
## Call:
## lm(formula = bmi ~ age + hyp + chl)
##
## Coefficients:
## (Intercept) age2 age3 hyp2 chl
## 17.8936 -4.1110 -6.2273 1.2909 0.0528
##
##
## [[9]]
##
## Call:
## lm(formula = bmi ~ age + hyp + chl)
##
## Coefficients:
## (Intercept) age2 age3 hyp2 chl
## 18.88558 -6.65801 -9.40702 2.21012 0.06217
##
##
## [[10]]
##
## Call:
## lm(formula = bmi ~ age + hyp + chl)
##
## Coefficients:
## (Intercept) age2 age3 hyp2 chl
## 19.37600 -5.22792 -5.49072 2.54963 0.04879