Missing data in research can cause serious bias and decrease statistical power. Several statistical methods have been developed to deal with missing data, such as expectation-maximization (EM) algorithm and multiple imputation (MI). The key difference between the EM algorithm and the MI method is that EM algorithm dose not fill in the missing data and model parameters will be estimated based on likelihood functions. While for the MI method, missing data will be filled in by calculated data several times. In this note, I will show how to conduct multiple imputation for both continuous and categorical missing data.
Most statistical software, such SAS, STATA, R and SPSS has the multiple imputation function. Here, I only focus on the R with the multivariate imputation by chained equation (MICE) method. MICE algorithm was developed by Van Buuren and Groothuis-Oudshoor.\(^1\) It can be used to impute continuous data, binary data, un-ordered categorical data and ordered categorical data. The mice package in R implements the MICE algorithm.
The nhanes2 data set comes with the package mice. The dataset has 25 observations and 4 variables:age,age group (1=20-39, 2=40-59, 3=60+);bmi,body mass index (kg/m**2);hyp,hypertensive (1=no,2=yes) and chl,total serum cholesterol (mg/dL). The datas et has both missing values for continuous and categorical variables.
library(mice)
nhanes2
## age bmi hyp chl
## 1 20-39 NA <NA> NA
## 2 40-59 22.7 no 187
## 3 20-39 NA no 187
## 4 60-99 NA <NA> NA
## 5 20-39 20.4 no 113
## 6 60-99 NA <NA> 184
## 7 20-39 22.5 no 118
## 8 20-39 30.1 no 187
## 9 40-59 22.0 no 238
## 10 40-59 NA <NA> NA
## 11 20-39 NA <NA> NA
## 12 40-59 NA <NA> NA
## 13 60-99 21.7 no 206
## 14 40-59 28.7 yes 204
## 15 20-39 29.6 no NA
## 16 20-39 NA <NA> NA
## 17 60-99 27.2 yes 284
## 18 40-59 26.3 yes 199
## 19 20-39 35.3 no 218
## 20 60-99 25.5 yes NA
## 21 20-39 NA <NA> NA
## 22 20-39 33.2 no 229
## 23 20-39 27.5 no 131
## 24 60-99 24.9 no NA
## 25 40-59 27.4 no 186
We write a function to check the properties of variables in the data set.
col_classes <- function(df) {
data.frame(
variable = names(df),
class = unname(sapply(df, class))
)
}
col_classes(nhanes2)
## variable class
## 1 age factor
## 2 bmi numeric
## 3 hyp factor
## 4 chl numeric
We see bmi and chl are numeric (continuous) variables and age and hyp are factor variables, note many times categorical variable are defined as character variables, at that situation we need to change these categorical variables to factor type variables, otherwise the mice cannot compute these categorical variables. This is very important.
Let us check the missing pattern:
md.pattern(nhanes2)
## age hyp bmi chl
## 13 1 1 1 1 0
## 3 1 1 1 0 1
## 1 1 1 0 1 1
## 1 1 0 0 1 2
## 7 1 0 0 0 3
## 0 8 9 10 27
In the above figure and table, red square stands (0, in the table) for missing value, blue square stands for non-missing value(1, in the table).
First, let us check row by row. There are 13 rows (or observations, subjects, data point, et al) have no missing values. 3 rows have one missing values at chl variable, 1 row has missing values at bmi variable, 1 row has missing values at hyp and bmi, and 7 rows have missing values at hyp, bmi and chl.
Next, let us check column by column. There are 10 subjects have missing values on chl which include 7 have three missing values on hyp, bmi and chl and 3 have missing values on chl only. There are 9 subjects have missing values on bmi (1+1+7) and there are 8 subjects have missing values on hyp (1+7).
Usually, missing data can be categorized into monotone, non-monotone, and some other patterns. “A missing data pattern is said to be monotone if the variables \(Y_j\) can be ordered such that if \(Y_j\) is missing then all variables \(Y_k\) with \(k>j\) are also missing. This occurs, for example, in longitudinal studies with drop-out. If the pattern is not monotone, it is called non-monotone or general.\(^1\)
We can see the following missing pattern is monotone. When missing data are monotone, a monotone multiple imputation need to be performed.
id V1 V2 V3 V4
1 2 5 9 3
2 3 1 2 .
3 2 6 5 .
4 1 4 . .
5 3 . . .
The usage of mice function usually can be written as:
mice(dataset, method = , predictorMatrix = ,m = , maxit = , seed = 1234, printFlag = F)
database: a data frame or a matrix containing the incomplete data. Missing values are coded as NA. This is the database we want to impute.
method: which method will be used to impute the missing values, different type of variable correspond to different imputing methods. For any other imputation methods we can refer to Rdocumentation.\(^2\)
| Method name | imputation method | Variable type |
|---|---|---|
| norm | Bayesian linear regression | numeric |
| logreg | Logistic regression | binary |
| polyreg | Polytomous logistic regression | un-ordered,factor, >= 2 levels |
| polr | Proportional odds model | ordered, factor, >= 2 levels |
predictorMatrix: What variables are used to predict the other variables.
m: Number of multiple imputation datasets.
maxit: A scalar giving the number of iterations.
printFlag = F: By default, the mice function returns information about the iteration and imputation steps of the imputed variables under the columns named “iter”, “imp” and “variable” respectively. This information can be turned off by setting the mice function parameter printFlag = FALSE, which results in silent computation of the missing values.
Before we conduct the imputation we first set up initiations of the function using the following code:
library(mice)
init = mice(nhanes2, maxit=0)
meth = init$method
predM = init$predictorMatrix
Next we tell mice to use which method to impute the missing values such as
meth[c("bmi")]="norm"
meth[c("hyp")]="logreg"
meth[c("chl")]="norm"
meth[c("Age")]=""
#We don't impute age since there was no missing value, however, it will be used for predition
To be continue…
1.https://stefvanbuuren.name/fimd/missing-data-pattern.html
2.https://www.rdocumentation.org/packages/mice/versions/3.16.0/topics/mice