DEM7283: Statistics for Demographic Data II

Objectives of the exercise
Data source
Set working directory and load packages
Load data, select and rename working variables
Outcome of interest and variables specification
- Outcome of interest
- Variables specification
Imputation procedures
Models comparison
Acknowledgements

Author: Elhakim Ibrahim

Instructor: Corey Sparks, PhD¹

April 27, 2020

Objectives of the exercise

This exercise illustrates modal and multiple imputation approaches to addressing missing data issues in binary logistic regression analysis. Model estimates from model fit on the imputed data and conventional filtered data with no missing data are compared to assess the potential impact of data missingness on model fit.

Data source

The exercise is based on 2018 Nigeria Demographic and Health Survey women recode data set. The Nigeria Demographic and Health Surveys data are accessible with permission from DHS Program archive.

Set working directory and load packages

setwd("C:/A/01/02S20/D283/H/06")

suppressPackageStartupMessages({
    library(haven)
    library(survey)
    library(foreign)
    library(expss)
    library(readr)
    library(car)
    library(dplyr)
    library(mice)
    library(MASS)
    library(gtools)
    library(lmtest)
    library(sjPlot)
    library(sjmisc)
    library(forcats)
    library(lattice)
    library(reshape2)
    library(stargazer)
    library(questionr)
    library(htmlTable)
    library(sjlabelled)
    library(summarytools)
})

Load data, select and rename working variables

ir18 <- read.dta("C:/Research/ndhs-ir/ir18.dta")

mydata0 <- ir18 %>% dplyr::select(v007, v001, v005, v021, v022, v228, v362, 
    v012, v502, v201, v613, v605, v149, v190, v101, v102, v536, v312)

# write.dta(mydata0, 'mydata0.dta')

mydata0 <- read.dta("C:/A/01/02S20/D283/H/06/mydata0.dta")

mydata0 <- mydata0 %>% rename(surveyyear_ = v007, clusterno_ = v001, iweight = v005, 
    psu = v021, strata = v022, terminate_ = v228, age_ = v012, unionstatus_ = v502, 
    parity_ = v201, paritypref_ = v613, fertdesire_ = v605, education_ = v149, 
    wealth_ = v190, region_ = v101, location_ = v102, sexrecency_ = v536, fpmethod_ = v312)

Outcome of interest and variables specification

Outcome of interest

The outcome of interest is history of abortion among fecund, nonsterilized study population age 15 to 44 years. This will be specified as a binary outcome: Never had versus Ever had. The Never had category represents those who had never had a terminated pregancy within five years to the interview date, while Ever had category represents those who have had an abortion during the same period. For the purpose of analysis, age, union status, parity, parity preference, fertility desire, educational attainment, wealth status, as well as region and place of residence would be included as indenpendent correlates.

Variables specification

Terminated

# v228 / terminate / ever had terminated pregnancy freq(mydata0$terminate_,
# round.digits=1, cumul=T, report.nas=T)

mydata0$terminate <- factor(mydata0$terminate_, levels = c("no", "yes"), labels = c("Never had", 
    "Ever had"))

# freq(mydata0$terminate, round.digits=1, cumul=T, report.nas=T)

Age group

# v012 / age / age in single year freq(mydata0$age_, round.digits=1,
# cumul=T, report.nas=T)

mydata0$agegroup <- Recode(mydata0$age_, recodes = "15:19=\"15-19\";20:24=\"20-24\";
                       25:29=\"25-29\";30:34=\"30-34\";35:39=\"35-39\";
                       40:44=\"40-44\";else=NA", 
    as.factor = T)

# freq(mydata0$agegroup, round.digits=1, cumul=T, report.nas=T)

Sexual recency

# v536 / sexrecency / recent sexual activity freq(mydata0$sexrecency_,
# round.digits=1, cumul=T, report.nas=T)

mydata0$sexrecency <- Recode(mydata0$sexrecency_, recodes = "\"active in last 4 weeks\"=\"Recent\";
                       \"not active in last 4 weeks - postpartum abstinence\"=\"Not recent\";
                       \"not active in last 4 weeks - not postpartum abstinence\"=\"Not recent\";
                       else=NA", 
    as.factor = T)

mydata0$sexrecency <- fct_relevel(mydata0$sexrecency, "Recent")
# freq(mydata0$sexrecency, round.digits=1, cumul=T, report.nas=T)

Fertility desire

# v605 /fertdesire / desire for more children freq(mydata0$fertdesire_,
# round.digits=1, cumul=T, report.nas=T)

mydata0$fertdesire <- Recode(mydata0$fertdesire_, recodes = "\"wants no more\"=\"Wants no more\";
                      \"wants after 2+ years\"=\"Wants more\";
                      \"wants within 2 years\"=\"Wants more\";
                      \"wants, unsure timing\"=\"Wants more\";
                      \"undecided\"=\"Undecided\";
                      else=NA", 
    as.factor = T)

mydata0$fertdesire <- fct_relevel(mydata0$fertdesire, "Wants more", "Undecided", 
    "Wants no more")
# freq(mydata0$fertdesire, round.digits=1, cumul=T, report.nas=T)

Educational attainment

# v149 / education / educational attainment freq(mydata0$education_,
# round.digits=1, cumul=T, report.nas=T)

mydata0$education <- factor(mydata0$education_, levels = c("no education", "incomplete primary", 
    "complete primary", "incomplete secondary", "complete secondary", "higher"), 
    labels = c("No education", "Incomplete primary", "Complete primary", "Incomplete secondary", 
        "Complete secondary", "Higher"))

# freq(mydata0$education, round.digits=1, cumul=T, report.nas=T)

Location of residence

# v102 / location / type of location of residence freq(mydata0$location_,
# round.digits=1, cumul=T, report.nas=T)

mydata0$location <- factor(mydata0$location_, levels = c("rural", "urban"), 
    labels = c("Rural", "Urban"))

# freq(mydata0$location, round.digits=1, cumul=T, report.nas=T)

Imputation procedures

Check missingness in the data

The summary results show the number of missing record (NA) for each variable: 3891 for age, 6749 for sex recency and 1187 for fertility desire.

mydata <- mydata0 %>% dplyr::select(terminate, agegroup, sexrecency, fertdesire, 
    education, location)
summary(mydata)

     terminate      agegroup         sexrecency            fertdesire   
 Never had:36983   15-19:8423   Recent    :23855   Wants more   :28894  
 Ever had : 4838   20-24:6844   Not recent:11217   Undecided    : 3053  
                   25-29:7203   NA's      : 6749   Wants no more: 8687  
                   30-34:5997                      NA's         : 1187  
                   35-39:5406                                           
                   40-44:4057                                           
                   NA's :3891                                           
                education      location    
 No education        :14398   Rural:24837  
 Incomplete primary  : 1908   Urban:16984  
 Complete primary    : 4475                
 Incomplete secondary: 7246                
 Complete secondary  : 9452                
 Higher              : 4342

Modal imputation

By modal imputation, we simply replace the missing values with the most common value/modal value of each variable with missing information. We then compare the orginal data distribution to the imputed data distribution. This is done for age, sex recency and fertility desire as follows.

mcv.agegroup <- factor(names(which.max(table(mydata$agegroup))), levels = levels(mydata$agegroup))
mydata$agegroup.imp <- as.factor(ifelse(is.na(mydata$agegroup) == T, mcv.agegroup, 
    mydata$agegroup))
levels(mydata$agegroup.imp) <- levels(mydata$agegroup)

round(prop.table(table(mydata$agegroup)) * 100, digits = 3)


15-19 20-24 25-29 30-34 35-39 40-44 
 22.2  18.0  19.0  15.8  14.3  10.7

round(prop.table(table(mydata$agegroup.imp)) * 100, digits = 3)


15-19 20-24 25-29 30-34 35-39 40-44 
 29.4  16.4  17.2  14.3  12.9   9.7

mcv.sexrecency <- factor(names(which.max(table(mydata$sexrecency))), levels = levels(mydata$sexrecency))
mydata$sexrecency.imp <- as.factor(ifelse(is.na(mydata$sexrecency) == T, mcv.sexrecency, 
    mydata$sexrecency))
levels(mydata$sexrecency.imp) <- levels(mydata$sexrecency)

round(prop.table(table(mydata$sexrecency)) * 100, digits = 1)


    Recent Not recent 
        68         32

round(prop.table(table(mydata$sexrecency.imp)) * 100, digits = 3)


    Recent Not recent 
      73.2       26.8

mcv.fertdesire <- factor(names(which.max(table(mydata$fertdesire))), levels = levels(mydata$fertdesire))
mydata$fertdesire.imp <- as.factor(ifelse(is.na(mydata$fertdesire) == T, mcv.fertdesire, 
    mydata$fertdesire))
levels(mydata$fertdesire.imp) <- levels(mydata$fertdesire)

round(prop.table(table(mydata$fertdesire)) * 100, digits = 1)


   Wants more     Undecided Wants no more 
         71.1           7.5          21.4

round(prop.table(table(mydata$fertdesire.imp)) * 100, digits = 3)


   Wants more     Undecided Wants no more 
         71.9           7.3          20.8

Multiple imputation

Multiple imputation can be achieved using a number of packages such as mice and Amelia. First, using md.pattern() function in mice package, we can pictorially explore the patterns of missingness in the data, with each row corresponding to a particular pattern of missingness. In the diagram, the blue color respresents completeness of data between variables while the red colors indicate missingness. In the accompanying crosstable, 1 corresponds to observed/nonmissing while 0 corresponds to unobserved/missing.

md.pattern(mydata)

      terminate education location agegroup.imp sexrecency.imp
30626         1         1        1            1              1
6618          1         1        1            1              1
3386          1         1        1            1              1
4             1         1        1            1              1
560           1         1        1            1              1
126           1         1        1            1              1
500           1         1        1            1              1
1             1         1        1            1              1
              0         0        0            0              0
      fertdesire.imp fertdesire agegroup sexrecency      
30626              1          1        1          1     0
6618               1          1        1          0     1
3386               1          1        0          1     1
4                  1          1        0          0     2
560                1          0        1          1     1
126                1          0        1          0     2
500                1          0        0          1     2
1                  1          0        0          0     3
                   0       1187     3891       6749 11827

In summary:

The first row shows the number of cases in the data that are complete;
The rows with 0 attributed to a variable in the column shows the number of cases missing only the variable;
Rows with multiple 0s in the columns indicate that multiple variables are missing;
The bottom row indicates total number of cases missing each variable, in ANY combination with other variables.

In order to see how pairs of variables are missing together, we will use the md.pairs() function.

md.pairs(mydata[, c("terminate", "agegroup", "sexrecency", "fertdesire", "education", 
    "location")])

$rr
           terminate agegroup sexrecency fertdesire education location
terminate      41821    37930      35072      40634     41821    41821
agegroup       37930    37930      31186      37244     37930    37930
sexrecency     35072    31186      35072      34012     35072    35072
fertdesire     40634    37244      34012      40634     40634    40634
education      41821    37930      35072      40634     41821    41821
location       41821    37930      35072      40634     41821    41821

$rm
           terminate agegroup sexrecency fertdesire education location
terminate          0     3891       6749       1187         0        0
agegroup           0        0       6744        686         0        0
sexrecency         0     3886          0       1060         0        0
fertdesire         0     3390       6622          0         0        0
education          0     3891       6749       1187         0        0
location           0     3891       6749       1187         0        0

$mr
           terminate agegroup sexrecency fertdesire education location
terminate          0        0          0          0         0        0
agegroup        3891        0       3886       3390      3891     3891
sexrecency      6749     6744          0       6622      6749     6749
fertdesire      1187      686       1060          0      1187     1187
education          0        0          0          0         0        0
location           0        0          0          0         0        0

$mm
           terminate agegroup sexrecency fertdesire education location
terminate          0        0          0          0         0        0
agegroup           0     3891          5        501         0        0
sexrecency         0        5       6749        127         0        0
fertdesire         0      501        127       1187         0        0
education          0        0          0          0         0        0
location           0        0          0          0         0        0

As shown above, the md.pairs() function generates four patterns of missigness for a pair of variables in the data.

Pattern rr: Both variables are observed;
Pattern rm: The first variable is observed and the second variable is missing;
Pattern mr: The first variable is missing and the second variable is observed; and
Pattern mm: Both first and second variables are missing.

Next, we we use the mice package to generate multiple imputed data.

multimp <- mice(data = mydata[, c("terminate", "agegroup", "sexrecency", "fertdesire", 
    "education", "location")], seed = 22, m = 5)


 iter imp variable
  1   1  agegroup  sexrecency  fertdesire
  1   2  agegroup  sexrecency  fertdesire
  1   3  agegroup  sexrecency  fertdesire
  1   4  agegroup  sexrecency  fertdesire
  1   5  agegroup  sexrecency  fertdesire
  2   1  agegroup  sexrecency  fertdesire
  2   2  agegroup  sexrecency  fertdesire
  2   3  agegroup  sexrecency  fertdesire
  2   4  agegroup  sexrecency  fertdesire
  2   5  agegroup  sexrecency  fertdesire
  3   1  agegroup  sexrecency  fertdesire
  3   2  agegroup  sexrecency  fertdesire
  3   3  agegroup  sexrecency  fertdesire
  3   4  agegroup  sexrecency  fertdesire
  3   5  agegroup  sexrecency  fertdesire
  4   1  agegroup  sexrecency  fertdesire
  4   2  agegroup  sexrecency  fertdesire
  4   3  agegroup  sexrecency  fertdesire
  4   4  agegroup  sexrecency  fertdesire
  4   5  agegroup  sexrecency  fertdesire
  5   1  agegroup  sexrecency  fertdesire
  5   2  agegroup  sexrecency  fertdesire
  5   3  agegroup  sexrecency  fertdesire
  5   4  agegroup  sexrecency  fertdesire
  5   5  agegroup  sexrecency  fertdesire

Below output shows that 5 multiple imputations were done. In addition, variables with missingness and the imputation method used per each variable are listed.

print(multimp)

Class: mids
Number of multiple imputations:  5 
Imputation methods:
 terminate   agegroup sexrecency fertdesire  education   location 
        ""  "polyreg"   "logreg"  "polyreg"         ""         "" 
PredictorMatrix:
           terminate agegroup sexrecency fertdesire education location
terminate          0        1          1          1         1        1
agegroup           1        0          1          1         1        1
sexrecency         1        1          0          1         1        1
fertdesire         1        1          1          0         1        1
education          1        1          1          1         0        1
location           1        1          1          1         1        0

It is a good practice to check the plausibility of the imputed values by examining the distribution of observations by age group, sex recency and fertility desire. We proceed to run diagnosis on the imputed values for the first 5 cases across the 5 different imputations as well as the numeric and pictorial summaries of the imputed values.

head(multimp$imp$agegroup, 5)

summary(multimp$imp$agegroup)

     1            2            3            4            5       
 15-19: 277   15-19: 269   15-19: 289   15-19: 281   15-19: 277  
 20-24: 272   20-24: 215   20-24: 248   20-24: 258   20-24: 244  
 25-29: 405   25-29: 416   25-29: 443   25-29: 409   25-29: 422  
 30-34: 639   30-34: 682   30-34: 646   30-34: 716   30-34: 674  
 35-39:1056   35-39:1048   35-39:1009   35-39:1030   35-39:1016  
 40-44:1242   40-44:1261   40-44:1256   40-44:1197   40-44:1258

stripplot(multimp, agegroup ~ terminate | .imp)

head(multimp$imp$sexrecency, 5)

summary(multimp$imp$sexrecency)

          1                 2                 3                 4       
 Recent    :3663   Recent    :3587   Recent    :3755   Recent    :3545  
 Not recent:3086   Not recent:3162   Not recent:2994   Not recent:3204  
          5       
 Recent    :3568  
 Not recent:3181

stripplot(multimp, sexrecency ~ terminate | .imp)

head(multimp$imp$fertdesire, 5)

summary(multimp$imp$fertdesire)

             1                   2                   3      
 Wants more   :706   Wants more   :704   Wants more   :718  
 Undecided    : 93   Undecided    :100   Undecided    : 87  
 Wants no more:388   Wants no more:383   Wants no more:382  
             4                   5      
 Wants more   :712   Wants more   :720  
 Undecided    :120   Undecided    :100  
 Wants no more:355   Wants no more:367

stripplot(multimp, fertdesire ~ terminate | .imp)

Generally, the plots show that the imputed data plausibly correspond with the observed data. The distribution of the imputed data (red circles) align well with the observed data (blue bars) across the levels of age group, sex recency and fertility desire in different imputation runs.

Before we proceed, let’s have a cursory look at our complete imputed data and compare with the observed data. We can achieve this using the complete() function, which by default extracts the first run of the imputed data set. To get complete data set from the third run, we have to modify with the action=3 option; i.e. complete(imp, action=3).

impdata <- complete(multimp, action = 1)

head(impdata[is.na(mydata$agegroup) == T, ], n = 10)

head(mydata[is.na(mydata$agegroup) == T, c("terminate", "agegroup", "sexrecency", 
    "fertdesire", "education", "location")], n = 10)

The outputs highlight the patterns of imputation done to the original data for each of the first ten cases missing on agegroup and fertdesire variables using information from respondents with similar characteristics and complete data.

Filter approach

By this approach, we excluding all cases with any missing record from the original data, leaving only the observations with complete data for across all variables.

# filter out observations with incomplete information
fildata <- mydata %>% dplyr::select(terminate, agegroup, sexrecency, fertdesire, 
    education, location) %>% dplyr::filter(complete.cases(.))

Compare number of cases in imputed and filter data

dim(impdata)

[1] 41821     6

impdata = as.data.frame(impdata)

dim(fildata)

[1] 30626     6

fildata = as.data.frame(fildata)

This shows that the imputed data (41821) is more complete than the filtered data (30626) with over 11000 more observations. This is the main purpose of data imputation exercise: to have more number of cases with complete information for analysis.

Models comparison

Below we fIt models on the imputed data and observed data which excludes observations with missing information. Next, we compare estimates from the models to assess whether missingness in any variable occur at random or otherwise.

impfit <- glm(terminate ~ agegroup + sexrecency + fertdesire + education + location, 
    data = impdata, family = "binomial")

filfit <- glm(terminate ~ agegroup + sexrecency + fertdesire + education + location, 
    data = fildata, family = "binomial")

stargazer(impfit, filfit, style = "demography", type = "text", font.size = "normalsize", 
    dep.var.caption = c("Data preparation approach"), dep.var.labels.include = F, 
    align = T, single.row = T, no.space = T, column.sep.width = "5pt", covariate.labels = c("20-24", 
        "25-29", "30-34", "35-39", "40-44", "Not recent", "Undecided", "Wants no more", 
        "Incomplete primary", "Complete primary", "Incomplete secondary", "Complete secondary", 
        "Higher", "Urban"), title = "Comparison of estimates from imputed and filtered data", 
    column.labels = c("Imputed data", "Filtered data"), column.separate = c(1, 
        1, 1), ci.level = 0.95, object.names = F, model.numbers = F, report = "vcs*", 
    notes.align = "l", notes.label = "Notes:", keep.stat = "n", multicolumn = T, 
    out = "table.htm")


Comparison of estimates from imputed and filtered data
--------------------------------------------------------
                       Imputed data      Filtered data  
--------------------------------------------------------
20-24                1.390 (0.086)***  0.526 (0.091)*** 
25-29                1.800 (0.082)***  0.844 (0.087)*** 
30-34                2.170 (0.082)***  1.150 (0.088)*** 
35-39                2.370 (0.083)***  1.360 (0.089)*** 
40-44                2.380 (0.088)***  1.380 (0.095)*** 
Not recent           -0.374 (0.036)*** -0.394 (0.040)***
Undecided            -0.302 (0.065)*** -0.296 (0.071)***
Wants no more        -0.392 (0.043)*** -0.379 (0.049)***
Incomplete primary   0.361 (0.071)***  0.392 (0.078)*** 
Complete primary       0.078 (0.053)     0.082 (0.059)  
Incomplete secondary   0.040 (0.054)    0.129 (0.058)*  
Complete secondary    -0.051 (0.045)    -0.004 (0.049)  
Higher                -0.001 (0.055)     0.075 (0.060)  
Urban                  0.015 (0.035)    0.076 (0.038)*  
Constant             -3.640 (0.079)*** -2.710 (0.082)***
N                         41,821            30,626      
--------------------------------------------------------
Notes:               *p < .05; **p < .01; ***p < .001

Results show similar patterns of association between abortion and corresponding variables in imputed data (specifically age, sex recency and fertility desire) and filtered data. This suggests plausibility of the imputation conducted.

Acknowledgements

This content was inspired by Dr. Corey Sparks’ lab session on “Multiple Imputation & Missing Data” and his other easy-to-follow instructional materials accessible at https://rpubs.com/corey_sparks.
Data from Demographic and Health Surveys archive can be requested from https://www.dhsprogram.com/.
Stargazer is authored by Marek Hlavac and full document can be downloaded from https://cran.r-project.org/web/packages/stargazer/stargazer.pdf

##########################################################################################################