Missing Values Imputation

mice

Stands for Multivariate Imputation via Chained Equations.

MICE operates under the assumption that given the variables used in the imputation procedure, the missing data are Missing At Random (MAR), which means that the probability that a value is missing depends only on observed values and not on unobserved values (Schafer and Graham, 2002). Implementing MICE when data are not MAR could result in biased estimates.

The MICE process for missing data imputation:

Step 1: Initial Placeholder Imputation
Replace missing values with simple imputations, such as using the mean, for every variable with missing values. Think of these imputations as temporary placeholders.

Step 2: Prepare a Variable
Choose one variable (let’s call it “var”) that still has missing values.
Set the mean imputations of “var” back to missing. This prepares “var” for imputation.

Step 3: Regression Imputation
Use the observed values of “var” (from Step 2) as the dependent variable in a regression model.
Other variables (not necessarily all in the dataset) act as independent variables in this regression model.
The regression model is built based on the same assumptions as regular linear, logistic, or Poisson regression models.
The purpose is to predict “var” using the relationships observed in the data.

Step 4: Replace Missing Values Replace the missing values of “var” with predictions (imputations) obtained from the regression model.
These imputed values, along with the observed values, will be used when “var” is involved in regression models for other variables.

Step 5: Cycle Through Variables
Repeat Steps 2 to 4 for each variable with missing values. Each iteration through all variables constitutes one cycle. By the end of each cycle, all missing values have been replaced with imputations based on regression predictions.

Step 6: Repeated Iterations Continue repeating Steps 2 to 4 for a number of cycles. At each cycle, the imputations are updated based on the latest regression predictions.

Example:
To make the chained equation approach more concrete, imagine a simple example where we have three variables in our dataset: age, income, and gender, and all three have at least some missing values.
The MAR assumption would imply that the probability of a particular variable being missing depends only on the observed values, and that, for example, whether someone’s income is missing does not depend on their (unobserved) income.
In Step 1 of the MICE process, each variable would first be imputed using, e.g. mean imputation, temporarily setting any missing value equal to the mean observed value for that variable.
Then in Step 2 the imputed mean values of age would be set back to missing.
In Step 3, a linear regression of age predicted by income and gender would be run using all cases where age was observed.
In Step 4, predictions of the missing age values would be obtained from that regression equation and imputed. At this point, age does not have any missingness. Steps 2–4 would then be repeated for the income variable. The originally missing values of income would be set back to missing and a linear regression of income predicted by age and gender would be run using all cases with income observed; imputations (predictions) would be obtained from that regression equation for the missing income values. Then, Steps 2–4 would again be repeated for the variable gender. The originally missing values of gender would be set back to missing and a logistic regression of gender on age and income would be run using all cases with gender observed; predictions from that logistic regression model would be used to impute the missing gender values.
This entire process of iterating through the three variables would be repeated until convergence; the observed data and the final set of imputed values would then constitute one “complete” data set.
By default, linear regression is used to predict continuous missing values. Logistic regression is used for categorical missing values. Once this cycle is complete, multiple data sets are generated. These data sets differ only in imputed missing values. Generally, it’s considered to be a good practice to build models on these data sets separately and combining their results.

Precisely, the methods used by this package are: - PMM (Predictive Mean Matching) – For numeric variables
- logreg (Logistic Regression) – For Binary Variables( with 2 levels)
- polyreg (Bayesian polytomous regression) – For Factor Variables (>= 2 levels)
- Proportional odds model (ordered, >= 2 levels)

R code

# install.packages("mice")
require(mice)

data(iris)
iris_dup <- iris

set.seed(10)
iris_dup$Sepal.Length[sample(1:150, 3)] <- NA
iris_dup$Petal.Width[sample(1:150, 6)] <- NA
iris_dup$Petal.Length[sample(1:150, 2)] <- NA
iris_dup$Species[sample(1:150, 7)] <- NA
md.pattern(iris_dup, rotate.names = TRUE, plot = FALSE)

##     Sepal.Width Petal.Length Sepal.Length Petal.Width Species   
## 133           1            1            1           1       1  0
## 7             1            1            1           1       0  1
## 5             1            1            1           0       1  1
## 2             1            1            0           1       1  1
## 1             1            1            0           0       1  2
## 2             1            0            1           1       1  1
##               0            2            3           6       7 18

Meaning: 133 complete observations, 7 cases have missing values in Species variable, 5 cases have missing values in Petal.Width variable, 1 case has missing values in both Sepal.Length and Petal.Width variables, and so on.

Imputing missing data using mice() function:

imputed_iris <- mice(iris_dup, m=5, maxit = 50, seed = 500, print = FALSE)
summary(imputed_iris)

## Class: mids
## Number of multiple imputations:  5 
## Imputation methods:
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species 
##        "pmm"           ""        "pmm"        "pmm"    "polyreg" 
## PredictorMatrix:
##              Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## Sepal.Length            0           1            1           1       1
## Sepal.Width             1           0            1           1       1
## Petal.Length            1           1            0           1       1
## Petal.Width             1           1            1           0       1
## Species                 1           1            1           1       0

Arguments:
- m: Number of multiple imputations. The default is m=5.
- maxit: A scalar giving the number of iterations. The default is 5.
- seed: An integer that is used as argument by the set.seed() for offsetting the random number generator. Default is to leave the random number generator alone.
- method: “pmm” for Predictive mean matching, “logreg” for Logistic regression, “polyreg” for Polytomous logistic regression, “polr” for Proportional odds model.

Since there are 5 imputed data sets, you can select any using complete() function.

iris_com <- complete(imputed_iris, 1)  # obtain the 1st imputed dataset

iris_com <- complete(imputed_iris, "all")  #  same, but now as list, mild object
iris_com <- complete(imputed_iris, "all", include = TRUE)  #  same, but also include the original data
iris_com <- complete(imputed_iris, c(0, 3, 5), mild = TRUE)  #  select original + 3 + 5, store as mild

Let’s compare the summary statistics of original and imputed variables:

data.frame(
  Petal_Length_Statistics = names(summary(iris$Petal.Length)),
  iris = summary(iris$Petal.Length) |> as.numeric() |> round(3),
  iris_imputed = summary(iris_com$Petal.Length) |> as.numeric() |> round(3)
)

##   Petal_Length_Statistics  iris iris_imputed
## 1                    Min. 1.000        1.000
## 2                 1st Qu. 1.600        1.600
## 3                  Median 4.350        4.350
## 4                    Mean 3.758        3.753
## 5                 3rd Qu. 5.100        5.100
## 6                    Max. 6.900        6.900

data.frame(
  Sepal_Length_Statistics = names(summary(iris$Sepal.Length)),
  iris = summary(iris$Sepal.Length) |> as.numeric() |> round(3),
  iris_imputed = summary(iris_com$Sepal.Length) |> as.numeric() |> round(3)
)

##   Sepal_Length_Statistics  iris iris_imputed
## 1                    Min. 4.300        4.300
## 2                 1st Qu. 5.100        5.100
## 3                  Median 5.800        5.800
## 4                    Mean 5.843        5.851
## 5                 3rd Qu. 6.400        6.400
## 6                    Max. 7.900        7.900

table(iris$Species)

## 
##     setosa versicolor  virginica 
##         50         50         50

table(iris_com$Species)

## 
##     setosa versicolor  virginica 
##         50         50         50

Combine the result from 5 models fitted using the 5 imputed datasets and obtain a consolidated output using pool() command.

fit <-with(data = imputed_iris, exp = lm(Sepal.Width ~ Sepal.Length + Petal.Width)) 
pool(fit)

## Class: mipo    m = 5 
##           term m   estimate        ubar            b           t dfcom       df
## 1  (Intercept) 5  1.9229495 0.101699772 7.701043e-04 0.102623897   147 143.3163
## 2 Sepal.Length 5  0.2893699 0.004296081 3.622191e-05 0.004339547   147 143.0720
## 3  Petal.Width 5 -0.4653243 0.005074095 4.035304e-05 0.005122519   147 143.2093
##           riv      lambda        fmi
## 1 0.009086797 0.009004970 0.02255090
## 2 0.010117660 0.010016319 0.02357106
## 3 0.009543306 0.009453092 0.02300281

summary(pool(fit))

##           term   estimate  std.error statistic       df      p.value
## 1  (Intercept)  1.9229495 0.32034965  6.002659 143.3163 1.516196e-08
## 2 Sepal.Length  0.2893699 0.06587524  4.392695 143.0720 2.164925e-05
## 3  Petal.Width -0.4653243 0.07157177 -6.501506 143.2093 1.235539e-09

Model on the original dataset:

summary(lm(Sepal.Width ~ Sepal.Length + Petal.Width, iris))

## 
## Call:
## lm(formula = Sepal.Width ~ Sepal.Length + Petal.Width, data = iris)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.99563 -0.24690 -0.00503  0.23354  1.01131 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.92632    0.32094   6.002 1.45e-08 ***
## Sepal.Length  0.28929    0.06605   4.380 2.24e-05 ***
## Petal.Width  -0.46641    0.07175  -6.501 1.17e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3841 on 147 degrees of freedom
## Multiple R-squared:  0.234,  Adjusted R-squared:  0.2236 
## F-statistic: 22.46 on 2 and 147 DF,  p-value: 3.091e-09

Resources:
1. Multiple imputation by chained equations: what is it and how does it work?
2. https://github.com/amices/mice
3. https://www.analyticsvidhya.com/blog/2016/03/tutorial-powerful-packages-imputing-missing-values/

Amelia

This package is named after Amelia Earhart, the first female aviator to fly solo across the Atlantic Ocean. History says, she got mysteriously disappeared (missing) while flying over the pacific ocean in 1937, hence this package was named to solve missing value problems.
Amelia II “multiply imputes” missing data in a single cross-section (such as a survey), from a time series (like variables collected for each year in a country), or from a time-series-cross-sectional data set (such as collected by years for each of several countries).
It generalizes existing approaches by allowing for trends in time series across observations within a cross-sectional unit, as well as priors that allow experts to incorporate beliefs they have about the values of missing cells in their data.
Unless the rate of missingness is exceptionally high, m=5 (the program default) will usually be adequate.
Other methods of dealing with missing data, such as listwise deletion, mean substitution, or single imputation, are in common circumstances biased, inefficient, or both.

– To be continued –

R code

# install.packages("Amelia")
require(Amelia)

Resources:
1. https://gking.harvard.edu/amelia
2. https://www.analyticsvidhya.com/blog/2016/03/tutorial-powerful-packages-imputing-missing-values/

Missing Values Imputation

Md Ahsanul Islam

mice

R code

Amelia

R code