Author: Elhakim Ibrahim
Instructor: Corey Sparks, PhD1
April 27, 2020
This exercise illustrates modal and multiple imputation approaches to addressing missing data issues in binary logistic regression analysis. Model estimates from model fit on the imputed data and conventional filtered data with no missing data are compared to assess the potential impact of data missingness on model fit.
The exercise is based on 2018 Nigeria Demographic and Health Survey women recode data set. The Nigeria Demographic and Health Surveys data are accessible with permission from DHS Program archive.
setwd("C:/A/01/02S20/D283/H/06")
suppressPackageStartupMessages({
library(haven)
library(survey)
library(foreign)
library(expss)
library(readr)
library(car)
library(dplyr)
library(mice)
library(MASS)
library(gtools)
library(lmtest)
library(sjPlot)
library(sjmisc)
library(forcats)
library(lattice)
library(reshape2)
library(stargazer)
library(questionr)
library(htmlTable)
library(sjlabelled)
library(summarytools)
})ir18 <- read.dta("C:/Research/ndhs-ir/ir18.dta")
mydata0 <- ir18 %>% dplyr::select(v007, v001, v005, v021, v022, v228, v362,
v012, v502, v201, v613, v605, v149, v190, v101, v102, v536, v312)
# write.dta(mydata0, 'mydata0.dta')mydata0 <- mydata0 %>% rename(surveyyear_ = v007, clusterno_ = v001, iweight = v005,
psu = v021, strata = v022, terminate_ = v228, age_ = v012, unionstatus_ = v502,
parity_ = v201, paritypref_ = v613, fertdesire_ = v605, education_ = v149,
wealth_ = v190, region_ = v101, location_ = v102, sexrecency_ = v536, fpmethod_ = v312)The outcome of interest is history of abortion among fecund, nonsterilized study population age 15 to 44 years. This will be specified as a binary outcome: Never had versus Ever had. The Never had category represents those who had never had a terminated pregancy within five years to the interview date, while Ever had category represents those who have had an abortion during the same period. For the purpose of analysis, age, union status, parity, parity preference, fertility desire, educational attainment, wealth status, as well as region and place of residence would be included as indenpendent correlates.
Terminated
# v228 / terminate / ever had terminated pregnancy freq(mydata0$terminate_,
# round.digits=1, cumul=T, report.nas=T)
mydata0$terminate <- factor(mydata0$terminate_, levels = c("no", "yes"), labels = c("Never had",
"Ever had"))
# freq(mydata0$terminate, round.digits=1, cumul=T, report.nas=T)Age group
# v012 / age / age in single year freq(mydata0$age_, round.digits=1,
# cumul=T, report.nas=T)
mydata0$agegroup <- Recode(mydata0$age_, recodes = "15:19=\"15-19\";20:24=\"20-24\";
25:29=\"25-29\";30:34=\"30-34\";35:39=\"35-39\";
40:44=\"40-44\";else=NA",
as.factor = T)
# freq(mydata0$agegroup, round.digits=1, cumul=T, report.nas=T)Sexual recency
# v536 / sexrecency / recent sexual activity freq(mydata0$sexrecency_,
# round.digits=1, cumul=T, report.nas=T)
mydata0$sexrecency <- Recode(mydata0$sexrecency_, recodes = "\"active in last 4 weeks\"=\"Recent\";
\"not active in last 4 weeks - postpartum abstinence\"=\"Not recent\";
\"not active in last 4 weeks - not postpartum abstinence\"=\"Not recent\";
else=NA",
as.factor = T)
mydata0$sexrecency <- fct_relevel(mydata0$sexrecency, "Recent")
# freq(mydata0$sexrecency, round.digits=1, cumul=T, report.nas=T)Fertility desire
# v605 /fertdesire / desire for more children freq(mydata0$fertdesire_,
# round.digits=1, cumul=T, report.nas=T)
mydata0$fertdesire <- Recode(mydata0$fertdesire_, recodes = "\"wants no more\"=\"Wants no more\";
\"wants after 2+ years\"=\"Wants more\";
\"wants within 2 years\"=\"Wants more\";
\"wants, unsure timing\"=\"Wants more\";
\"undecided\"=\"Undecided\";
else=NA",
as.factor = T)
mydata0$fertdesire <- fct_relevel(mydata0$fertdesire, "Wants more", "Undecided",
"Wants no more")
# freq(mydata0$fertdesire, round.digits=1, cumul=T, report.nas=T)Educational attainment
# v149 / education / educational attainment freq(mydata0$education_,
# round.digits=1, cumul=T, report.nas=T)
mydata0$education <- factor(mydata0$education_, levels = c("no education", "incomplete primary",
"complete primary", "incomplete secondary", "complete secondary", "higher"),
labels = c("No education", "Incomplete primary", "Complete primary", "Incomplete secondary",
"Complete secondary", "Higher"))
# freq(mydata0$education, round.digits=1, cumul=T, report.nas=T)Location of residence
# v102 / location / type of location of residence freq(mydata0$location_,
# round.digits=1, cumul=T, report.nas=T)
mydata0$location <- factor(mydata0$location_, levels = c("rural", "urban"),
labels = c("Rural", "Urban"))
# freq(mydata0$location, round.digits=1, cumul=T, report.nas=T)The summary results show the number of missing record (NA) for each variable: 3891 for age, 6749 for sex recency and 1187 for fertility desire.
mydata <- mydata0 %>% dplyr::select(terminate, agegroup, sexrecency, fertdesire,
education, location)
summary(mydata) terminate agegroup sexrecency fertdesire
Never had:36983 15-19:8423 Recent :23855 Wants more :28894
Ever had : 4838 20-24:6844 Not recent:11217 Undecided : 3053
25-29:7203 NA's : 6749 Wants no more: 8687
30-34:5997 NA's : 1187
35-39:5406
40-44:4057
NA's :3891
education location
No education :14398 Rural:24837
Incomplete primary : 1908 Urban:16984
Complete primary : 4475
Incomplete secondary: 7246
Complete secondary : 9452
Higher : 4342
By modal imputation, we simply replace the missing values with the most common value/modal value of each variable with missing information. We then compare the orginal data distribution to the imputed data distribution. This is done for age, sex recency and fertility desire as follows.
mcv.agegroup <- factor(names(which.max(table(mydata$agegroup))), levels = levels(mydata$agegroup))
mydata$agegroup.imp <- as.factor(ifelse(is.na(mydata$agegroup) == T, mcv.agegroup,
mydata$agegroup))
levels(mydata$agegroup.imp) <- levels(mydata$agegroup)
15-19 20-24 25-29 30-34 35-39 40-44
22.2 18.0 19.0 15.8 14.3 10.7
15-19 20-24 25-29 30-34 35-39 40-44
29.4 16.4 17.2 14.3 12.9 9.7
mcv.sexrecency <- factor(names(which.max(table(mydata$sexrecency))), levels = levels(mydata$sexrecency))
mydata$sexrecency.imp <- as.factor(ifelse(is.na(mydata$sexrecency) == T, mcv.sexrecency,
mydata$sexrecency))
levels(mydata$sexrecency.imp) <- levels(mydata$sexrecency)
Recent Not recent
68 32
Recent Not recent
73.2 26.8
mcv.fertdesire <- factor(names(which.max(table(mydata$fertdesire))), levels = levels(mydata$fertdesire))
mydata$fertdesire.imp <- as.factor(ifelse(is.na(mydata$fertdesire) == T, mcv.fertdesire,
mydata$fertdesire))
levels(mydata$fertdesire.imp) <- levels(mydata$fertdesire)
Wants more Undecided Wants no more
71.1 7.5 21.4
Wants more Undecided Wants no more
71.9 7.3 20.8
Multiple imputation can be achieved using a number of packages such as mice and Amelia. First, using md.pattern() function in mice package, we can pictorially explore the patterns of missingness in the data, with each row corresponding to a particular pattern of missingness. In the diagram, the blue color respresents completeness of data between variables while the red colors indicate missingness. In the accompanying crosstable, 1 corresponds to observed/nonmissing while 0 corresponds to unobserved/missing.
terminate education location agegroup.imp sexrecency.imp
30626 1 1 1 1 1
6618 1 1 1 1 1
3386 1 1 1 1 1
4 1 1 1 1 1
560 1 1 1 1 1
126 1 1 1 1 1
500 1 1 1 1 1
1 1 1 1 1 1
0 0 0 0 0
fertdesire.imp fertdesire agegroup sexrecency
30626 1 1 1 1 0
6618 1 1 1 0 1
3386 1 1 0 1 1
4 1 1 0 0 2
560 1 0 1 1 1
126 1 0 1 0 2
500 1 0 0 1 2
1 1 0 0 0 3
0 1187 3891 6749 11827
In summary:
In order to see how pairs of variables are missing together, we will use the md.pairs() function.
$rr
terminate agegroup sexrecency fertdesire education location
terminate 41821 37930 35072 40634 41821 41821
agegroup 37930 37930 31186 37244 37930 37930
sexrecency 35072 31186 35072 34012 35072 35072
fertdesire 40634 37244 34012 40634 40634 40634
education 41821 37930 35072 40634 41821 41821
location 41821 37930 35072 40634 41821 41821
$rm
terminate agegroup sexrecency fertdesire education location
terminate 0 3891 6749 1187 0 0
agegroup 0 0 6744 686 0 0
sexrecency 0 3886 0 1060 0 0
fertdesire 0 3390 6622 0 0 0
education 0 3891 6749 1187 0 0
location 0 3891 6749 1187 0 0
$mr
terminate agegroup sexrecency fertdesire education location
terminate 0 0 0 0 0 0
agegroup 3891 0 3886 3390 3891 3891
sexrecency 6749 6744 0 6622 6749 6749
fertdesire 1187 686 1060 0 1187 1187
education 0 0 0 0 0 0
location 0 0 0 0 0 0
$mm
terminate agegroup sexrecency fertdesire education location
terminate 0 0 0 0 0 0
agegroup 0 3891 5 501 0 0
sexrecency 0 5 6749 127 0 0
fertdesire 0 501 127 1187 0 0
education 0 0 0 0 0 0
location 0 0 0 0 0 0
As shown above, the md.pairs() function generates four patterns of missigness for a pair of variables in the data.
Pattern rr: Both variables are observed;Pattern rm: The first variable is observed and the second variable is missing;Pattern mr: The first variable is missing and the second variable is observed; andPattern mm: Both first and second variables are missing.Next, we we use the mice package to generate multiple imputed data.
multimp <- mice(data = mydata[, c("terminate", "agegroup", "sexrecency", "fertdesire",
"education", "location")], seed = 22, m = 5)
iter imp variable
1 1 agegroup sexrecency fertdesire
1 2 agegroup sexrecency fertdesire
1 3 agegroup sexrecency fertdesire
1 4 agegroup sexrecency fertdesire
1 5 agegroup sexrecency fertdesire
2 1 agegroup sexrecency fertdesire
2 2 agegroup sexrecency fertdesire
2 3 agegroup sexrecency fertdesire
2 4 agegroup sexrecency fertdesire
2 5 agegroup sexrecency fertdesire
3 1 agegroup sexrecency fertdesire
3 2 agegroup sexrecency fertdesire
3 3 agegroup sexrecency fertdesire
3 4 agegroup sexrecency fertdesire
3 5 agegroup sexrecency fertdesire
4 1 agegroup sexrecency fertdesire
4 2 agegroup sexrecency fertdesire
4 3 agegroup sexrecency fertdesire
4 4 agegroup sexrecency fertdesire
4 5 agegroup sexrecency fertdesire
5 1 agegroup sexrecency fertdesire
5 2 agegroup sexrecency fertdesire
5 3 agegroup sexrecency fertdesire
5 4 agegroup sexrecency fertdesire
5 5 agegroup sexrecency fertdesire
Below output shows that 5 multiple imputations were done. In addition, variables with missingness and the imputation method used per each variable are listed.
Class: mids
Number of multiple imputations: 5
Imputation methods:
terminate agegroup sexrecency fertdesire education location
"" "polyreg" "logreg" "polyreg" "" ""
PredictorMatrix:
terminate agegroup sexrecency fertdesire education location
terminate 0 1 1 1 1 1
agegroup 1 0 1 1 1 1
sexrecency 1 1 0 1 1 1
fertdesire 1 1 1 0 1 1
education 1 1 1 1 0 1
location 1 1 1 1 1 0
It is a good practice to check the plausibility of the imputed values by examining the distribution of observations by age group, sex recency and fertility desire. We proceed to run diagnosis on the imputed values for the first 5 cases across the 5 different imputations as well as the numeric and pictorial summaries of the imputed values.
1 2 3 4 5
15-19: 277 15-19: 269 15-19: 289 15-19: 281 15-19: 277
20-24: 272 20-24: 215 20-24: 248 20-24: 258 20-24: 244
25-29: 405 25-29: 416 25-29: 443 25-29: 409 25-29: 422
30-34: 639 30-34: 682 30-34: 646 30-34: 716 30-34: 674
35-39:1056 35-39:1048 35-39:1009 35-39:1030 35-39:1016
40-44:1242 40-44:1261 40-44:1256 40-44:1197 40-44:1258
1 2 3 4
Recent :3663 Recent :3587 Recent :3755 Recent :3545
Not recent:3086 Not recent:3162 Not recent:2994 Not recent:3204
5
Recent :3568
Not recent:3181
1 2 3
Wants more :706 Wants more :704 Wants more :718
Undecided : 93 Undecided :100 Undecided : 87
Wants no more:388 Wants no more:383 Wants no more:382
4 5
Wants more :712 Wants more :720
Undecided :120 Undecided :100
Wants no more:355 Wants no more:367
Generally, the plots show that the imputed data plausibly correspond with the observed data. The distribution of the imputed data (red circles) align well with the observed data (blue bars) across the levels of age group, sex recency and fertility desire in different imputation runs.
Before we proceed, let’s have a cursory look at our complete imputed data and compare with the observed data. We can achieve this using the complete() function, which by default extracts the first run of the imputed data set. To get complete data set from the third run, we have to modify with the action=3 option; i.e. complete(imp, action=3).
head(mydata[is.na(mydata$agegroup) == T, c("terminate", "agegroup", "sexrecency",
"fertdesire", "education", "location")], n = 10)The outputs highlight the patterns of imputation done to the original data for each of the first ten cases missing on agegroup and fertdesire variables using information from respondents with similar characteristics and complete data.
By this approach, we excluding all cases with any missing record from the original data, leaving only the observations with complete data for across all variables.
[1] 41821 6
[1] 30626 6
This shows that the imputed data (41821) is more complete than the filtered data (30626) with over 11000 more observations. This is the main purpose of data imputation exercise: to have more number of cases with complete information for analysis.
Below we fIt models on the imputed data and observed data which excludes observations with missing information. Next, we compare estimates from the models to assess whether missingness in any variable occur at random or otherwise.
impfit <- glm(terminate ~ agegroup + sexrecency + fertdesire + education + location,
data = impdata, family = "binomial")
filfit <- glm(terminate ~ agegroup + sexrecency + fertdesire + education + location,
data = fildata, family = "binomial")
stargazer(impfit, filfit, style = "demography", type = "text", font.size = "normalsize",
dep.var.caption = c("Data preparation approach"), dep.var.labels.include = F,
align = T, single.row = T, no.space = T, column.sep.width = "5pt", covariate.labels = c("20-24",
"25-29", "30-34", "35-39", "40-44", "Not recent", "Undecided", "Wants no more",
"Incomplete primary", "Complete primary", "Incomplete secondary", "Complete secondary",
"Higher", "Urban"), title = "Comparison of estimates from imputed and filtered data",
column.labels = c("Imputed data", "Filtered data"), column.separate = c(1,
1, 1), ci.level = 0.95, object.names = F, model.numbers = F, report = "vcs*",
notes.align = "l", notes.label = "Notes:", keep.stat = "n", multicolumn = T,
out = "table.htm")
Comparison of estimates from imputed and filtered data
--------------------------------------------------------
Imputed data Filtered data
--------------------------------------------------------
20-24 1.390 (0.086)*** 0.526 (0.091)***
25-29 1.800 (0.082)*** 0.844 (0.087)***
30-34 2.170 (0.082)*** 1.150 (0.088)***
35-39 2.370 (0.083)*** 1.360 (0.089)***
40-44 2.380 (0.088)*** 1.380 (0.095)***
Not recent -0.374 (0.036)*** -0.394 (0.040)***
Undecided -0.302 (0.065)*** -0.296 (0.071)***
Wants no more -0.392 (0.043)*** -0.379 (0.049)***
Incomplete primary 0.361 (0.071)*** 0.392 (0.078)***
Complete primary 0.078 (0.053) 0.082 (0.059)
Incomplete secondary 0.040 (0.054) 0.129 (0.058)*
Complete secondary -0.051 (0.045) -0.004 (0.049)
Higher -0.001 (0.055) 0.075 (0.060)
Urban 0.015 (0.035) 0.076 (0.038)*
Constant -3.640 (0.079)*** -2.710 (0.082)***
N 41,821 30,626
--------------------------------------------------------
Notes: *p < .05; **p < .01; ***p < .001
Results show similar patterns of association between abortion and corresponding variables in imputed data (specifically age, sex recency and fertility desire) and filtered data. This suggests plausibility of the imputation conducted.
This content was inspired by Dr. Corey Sparks’ lab session on “Multiple Imputation & Missing Data” and his other easy-to-follow instructional materials accessible at https://rpubs.com/corey_sparks.
Data from Demographic and Health Surveys archive can be requested from https://www.dhsprogram.com/.
Stargazer is authored by Marek Hlavac and full document can be downloaded from https://cran.r-project.org/web/packages/stargazer/stargazer.pdf