Stands for Multivariate Imputation via Chained Equations.
MICE operates under the assumption that given the variables used in the imputation procedure, the missing data are Missing At Random (MAR), which means that the probability that a value is missing depends only on observed values and not on unobserved values (Schafer and Graham, 2002). Implementing MICE when data are not MAR could result in biased estimates.
The MICE process for missing data imputation:
Step 1: Initial Placeholder Imputation
Replace missing values with simple imputations, such as using the mean,
for every variable with missing values. Think of these imputations as
temporary placeholders.
Step 2: Prepare a Variable
Choose one variable (let’s call it “var”) that still has missing
values.
Set the mean imputations of “var” back to missing. This prepares “var”
for imputation.
Step 3: Regression Imputation
Use the observed values of “var” (from Step 2) as the dependent variable
in a regression model.
Other variables (not necessarily all in the dataset) act as independent
variables in this regression model.
The regression model is built based on the same assumptions as regular
linear, logistic, or Poisson regression models.
The purpose is to predict “var” using the relationships observed in the
data.
Step 4: Replace Missing Values Replace the missing
values of “var” with predictions (imputations) obtained from the
regression model.
These imputed values, along with the observed values, will be used when
“var” is involved in regression models for other variables.
Step 5: Cycle Through Variables
Repeat Steps 2 to 4 for each variable with missing values. Each
iteration through all variables constitutes one cycle. By the end of
each cycle, all missing values have been replaced with imputations based
on regression predictions.
Step 6: Repeated Iterations Continue repeating Steps 2 to 4 for a number of cycles. At each cycle, the imputations are updated based on the latest regression predictions.
Example:
To make the chained equation approach more concrete, imagine a simple
example where we have three variables in our dataset: age, income, and
gender, and all three have at least some missing values.
The MAR assumption would imply that the probability of a particular
variable being missing depends only on the observed values, and that,
for example, whether someone’s income is missing does not depend on
their (unobserved) income.
In Step 1 of the MICE process, each variable would first be imputed
using, e.g. mean imputation, temporarily setting any missing value equal
to the mean observed value for that variable.
Then in Step 2 the imputed mean values of age would be set back to
missing.
In Step 3, a linear regression of age predicted by income and gender
would be run using all cases where age was observed.
In Step 4, predictions of the missing age values would be obtained from
that regression equation and imputed. At this point, age does not have
any missingness. Steps 2–4 would then be repeated for the income
variable. The originally missing values of income would be set back to
missing and a linear regression of income predicted by age and gender
would be run using all cases with income observed; imputations
(predictions) would be obtained from that regression equation for the
missing income values. Then, Steps 2–4 would again be repeated for the
variable gender. The originally missing values of gender would be set
back to missing and a logistic regression of gender on age and income
would be run using all cases with gender observed; predictions from that
logistic regression model would be used to impute the missing gender
values.
This entire process of iterating through the three variables would be
repeated until convergence; the observed data and the final set of
imputed values would then constitute one “complete” data set.
By default, linear regression is used to predict continuous missing
values. Logistic regression is used for categorical missing values. Once
this cycle is complete, multiple data sets are generated. These data
sets differ only in imputed missing values. Generally, it’s considered
to be a good practice to build models on these data sets separately and
combining their results.
Precisely, the methods used by this package are: - PMM (Predictive
Mean Matching) – For numeric variables
- logreg (Logistic Regression) – For Binary Variables( with 2
levels)
- polyreg (Bayesian polytomous regression) – For Factor Variables (>=
2 levels)
- Proportional odds model (ordered, >= 2 levels)
# install.packages("mice")
require(mice)data(iris)
iris_dup <- irisset.seed(10)
iris_dup$Sepal.Length[sample(1:150, 3)] <- NA
iris_dup$Petal.Width[sample(1:150, 6)] <- NA
iris_dup$Petal.Length[sample(1:150, 2)] <- NA
iris_dup$Species[sample(1:150, 7)] <- NA
md.pattern(iris_dup, rotate.names = TRUE, plot = FALSE)## Sepal.Width Petal.Length Sepal.Length Petal.Width Species
## 133 1 1 1 1 1 0
## 7 1 1 1 1 0 1
## 5 1 1 1 0 1 1
## 2 1 1 0 1 1 1
## 1 1 1 0 0 1 2
## 2 1 0 1 1 1 1
## 0 2 3 6 7 18
Meaning: 133 complete observations, 7 cases have missing values in Species variable, 5 cases have missing values in Petal.Width variable, 1 case has missing values in both Sepal.Length and Petal.Width variables, and so on.
Imputing missing data using mice() function:
imputed_iris <- mice(iris_dup, m=5, maxit = 50, seed = 500, print = FALSE)
summary(imputed_iris)## Class: mids
## Number of multiple imputations: 5
## Imputation methods:
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## "pmm" "" "pmm" "pmm" "polyreg"
## PredictorMatrix:
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## Sepal.Length 0 1 1 1 1
## Sepal.Width 1 0 1 1 1
## Petal.Length 1 1 0 1 1
## Petal.Width 1 1 1 0 1
## Species 1 1 1 1 0
Arguments:
- m: Number of multiple imputations. The default is m=5.
- maxit: A scalar giving the number of iterations. The default is
5.
- seed: An integer that is used as argument by the set.seed() for
offsetting the random number generator. Default is to leave the random
number generator alone.
- method: “pmm” for Predictive mean matching, “logreg” for Logistic
regression, “polyreg” for Polytomous logistic regression, “polr” for
Proportional odds model.
Since there are 5 imputed data sets, you can select any using complete() function.
iris_com <- complete(imputed_iris, 1) # obtain the 1st imputed datasetiris_com <- complete(imputed_iris, "all") # same, but now as list, mild object
iris_com <- complete(imputed_iris, "all", include = TRUE) # same, but also include the original data
iris_com <- complete(imputed_iris, c(0, 3, 5), mild = TRUE) # select original + 3 + 5, store as mild
Let’s compare the summary statistics of original and imputed variables:
data.frame(
Petal_Length_Statistics = names(summary(iris$Petal.Length)),
iris = summary(iris$Petal.Length) |> as.numeric() |> round(3),
iris_imputed = summary(iris_com$Petal.Length) |> as.numeric() |> round(3)
)## Petal_Length_Statistics iris iris_imputed
## 1 Min. 1.000 1.000
## 2 1st Qu. 1.600 1.600
## 3 Median 4.350 4.350
## 4 Mean 3.758 3.753
## 5 3rd Qu. 5.100 5.100
## 6 Max. 6.900 6.900
data.frame(
Sepal_Length_Statistics = names(summary(iris$Sepal.Length)),
iris = summary(iris$Sepal.Length) |> as.numeric() |> round(3),
iris_imputed = summary(iris_com$Sepal.Length) |> as.numeric() |> round(3)
)## Sepal_Length_Statistics iris iris_imputed
## 1 Min. 4.300 4.300
## 2 1st Qu. 5.100 5.100
## 3 Median 5.800 5.800
## 4 Mean 5.843 5.851
## 5 3rd Qu. 6.400 6.400
## 6 Max. 7.900 7.900
table(iris$Species)##
## setosa versicolor virginica
## 50 50 50
table(iris_com$Species)##
## setosa versicolor virginica
## 50 50 50
Combine the result from 5 models fitted using the 5 imputed datasets and obtain a consolidated output using pool() command.
fit <-with(data = imputed_iris, exp = lm(Sepal.Width ~ Sepal.Length + Petal.Width))
pool(fit)## Class: mipo m = 5
## term m estimate ubar b t dfcom df
## 1 (Intercept) 5 1.9229495 0.101699772 7.701043e-04 0.102623897 147 143.3163
## 2 Sepal.Length 5 0.2893699 0.004296081 3.622191e-05 0.004339547 147 143.0720
## 3 Petal.Width 5 -0.4653243 0.005074095 4.035304e-05 0.005122519 147 143.2093
## riv lambda fmi
## 1 0.009086797 0.009004970 0.02255090
## 2 0.010117660 0.010016319 0.02357106
## 3 0.009543306 0.009453092 0.02300281
summary(pool(fit))## term estimate std.error statistic df p.value
## 1 (Intercept) 1.9229495 0.32034965 6.002659 143.3163 1.516196e-08
## 2 Sepal.Length 0.2893699 0.06587524 4.392695 143.0720 2.164925e-05
## 3 Petal.Width -0.4653243 0.07157177 -6.501506 143.2093 1.235539e-09
Model on the original dataset:
summary(lm(Sepal.Width ~ Sepal.Length + Petal.Width, iris))##
## Call:
## lm(formula = Sepal.Width ~ Sepal.Length + Petal.Width, data = iris)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.99563 -0.24690 -0.00503 0.23354 1.01131
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.92632 0.32094 6.002 1.45e-08 ***
## Sepal.Length 0.28929 0.06605 4.380 2.24e-05 ***
## Petal.Width -0.46641 0.07175 -6.501 1.17e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3841 on 147 degrees of freedom
## Multiple R-squared: 0.234, Adjusted R-squared: 0.2236
## F-statistic: 22.46 on 2 and 147 DF, p-value: 3.091e-09
Resources:
1. Multiple
imputation by chained equations: what is it and how does it
work?
2. https://github.com/amices/mice
3. https://www.analyticsvidhya.com/blog/2016/03/tutorial-powerful-packages-imputing-missing-values/
This package is named after Amelia Earhart, the first female aviator
to fly solo across the Atlantic Ocean. History says, she got
mysteriously disappeared (missing) while flying over the pacific ocean
in 1937, hence this package was named to solve missing value
problems.
Amelia II “multiply imputes” missing data in a single cross-section
(such as a survey), from a time series (like variables collected for
each year in a country), or from a time-series-cross-sectional data set
(such as collected by years for each of several countries).
It generalizes existing approaches by allowing for trends in time series
across observations within a cross-sectional unit, as well as priors
that allow experts to incorporate beliefs they have about the values of
missing cells in their data.
Unless the rate of missingness is exceptionally high, m=5 (the program
default) will usually be adequate.
Other methods of dealing with missing data, such as listwise deletion,
mean substitution, or single imputation, are in common circumstances
biased, inefficient, or both.
– To be continued –
# install.packages("Amelia")
require(Amelia)Resources:
1. https://gking.harvard.edu/amelia
2. https://www.analyticsvidhya.com/blog/2016/03/tutorial-powerful-packages-imputing-missing-values/