DEM 7283 - Example 7 Multiple Imputation & Missing Data

This example will illustrate typical aspects of dealing with missing data. Topics will include: Mean imputation, modal imputation for categorical data, and multiple imputation of complex patterns of missing data.

For this example I am using 2016 CDC Behavioral Risk Factor Surveillance System (BRFSS) SMART county data. Link

Missing data

Every time we ask a question on a survey, there are a variety of reasons why we may fail to get a complete response
This is a cause for concern for those using survey data

Reasons for missing data

Total nonresponse

This is where a respondent completely refuses to participate in the survey, or some other barrier prevents them from participating (not at home, ill, disability)
This is usually accounted for by survey weights
Noncoverage
When an individual is not included in the survey’s sampling frame
Usually adjusted for in the weights
Item nonresponse
When a respondent fails to provide an acceptable response to a survey question
Could be refusal
Could be “I don’t know”
Gould give an inconsistent response
The interviewer could fail to ask the question
Partial nonresponse
Somewhere between total nonresponse and item nonresponse
Respondent could end the interview
e.g. Incomplete responses for a wave in a panel study
Can be dealt with by using weights, which eliminates respondent with partial responses
Can be dealt with by imputation, where the value is filled in by data editors

Types of missing data

Missing completely at random (MCAR)
Missing values do not depend on any characteristic of an individual
Missing at random (MAR)
Missing values do not relate to the item of interest itself, but it could be related to another characteristic of the respondent
These two are said to be “ignorable”
Missing Not at random (MNAR)
Think of a question on satisfaction, those that are extremely dissatisfied may be less likely to respond, so the results may be biased because they don’t include those folks
This is “non-ignorable”

How can we tell is a variable is MCAR?

One way to estimate if an element is MCAR or MAR is to form a missing data variable (1=missing, 0=nonmissing) and estimate a logistic regression for that variable using key characteristics of a respondent
If all these characteristics show insignificant effects, we can pretty much assume the element is MCAR/MAR

What does the computer do?

Typically the computer will do one of two things
Delete the entire case if any variables in the equation are missing, this is called listwise deletion
delete the case for a particular comparison, this is called pairwise deletion
Both of these lead to fewer cases for entire models or for particular tests

How can we impute a missing value?

There are easy ways and hard ways, but the answer is yes.
Easy ways == bad ways
Mean substitution
Plugs in the value for the average response for a missing value for all individuals
If a large % of the respondents are missing the values (say >20%) the mean could be not estimated very well
The variance of the variable will be driven down, if everyone who is missing is given the mean
Can lead to lower effect sizes/betas in the models using the data

Regression imputation

Multiple Imputation will use other characteristics of individuals with complete observations to predict the missing value for the missing person
This would use the observed characteristics of the respondent to predict their missing value
Think regression!
Income is a good variable like this
People don’t like to report their income, but they may be willing to report their education, and other characteristics
We can use those to predict their income
This is sensitive to assumptions of the regression model being used!

Multiple Imputation

This works like the regression imputation method, but applies it recursively to all patterns of missing data
Typically this will be done several times, usually 5 is the rule most programs use, and the differences in the imputations are compared across runs
i.e. if we impute the values then run a regression for the outcome, how sensitive are the results of the model to the imputation
There are various ways to do this, and it depends on the scale of the variable

Hot deck imputation

Find a set of variables that are associated with your outcome
If there are cases similar to yours (i.e. they are similar on the non-missing variables), you can plug in their variable of interest for the missing case
This is what the census did for years

Predictive mean matching

Regress Y on the observed X’s for the incomplete cases
Form fitted or “predicted” values
Replace the missing values with a fitted value from the regression from non-missing observations who are “close” to the missing cases
i.e. replace the missing values with fitted values from similar individuals

library(car)

## Loading required package: carData

library(mice)

## Loading required package: lattice

## 
## Attaching package: 'mice'

## The following objects are masked from 'package:base':
## 
##     cbind, rbind

library(ggplot2)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:car':
## 
##     recode

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

load(file = "~/Google Drive/classes/dem7283/class_19_7283//data/brfss16_mmsa.Rdata")
nams<-names(brfss16m)
head(nams, n=10)

##  [1] "DISPCODE" "HHADULT"  "GENHLTH"  "PHYSHLTH" "MENTHLTH" "POORHLTH"
##  [7] "HLTHPLN1" "PERSDOC2" "MEDCOST"  "CHECKUP1"

#we see some names are lower case, some are upper and some have a little _ in the first position. This is a nightmare.
set.seed(1234)
newnames<-tolower(gsub(pattern = "_",replacement =  "",x =  nams))
names(brfss16m)<-newnames

samp<-sample(1:dim(brfss16m)[1], size = 25000) #smaller sample for brevity
brfss16m<-brfss16m[samp,]

Recodes

#Healthy days
brfss16m$healthdays<-Recode(brfss16m$physhlth, recodes = "88=0; 77=NA; 99=NA")

#Healthy mental health days
brfss16m$healthmdays<-Recode(brfss16m$menthlth, recodes = "88=0; 77=NA; 99=NA")

brfss16m$badhealth<-Recode(brfss16m$genhlth, recodes="4:5=1; 1:3=0; else=NA")
#race/ethnicity
brfss16m$black<-Recode(brfss16m$racegr3, recodes="2=1; 9=NA; else=0")
brfss16m$white<-Recode(brfss16m$racegr3, recodes="1=1; 9=NA; else=0")
brfss16m$other<-Recode(brfss16m$racegr3, recodes="3:4=1; 9=NA; else=0")
brfss16m$hispanic<-Recode(brfss16m$racegr3, recodes="5=1; 9=NA; else=0")

brfss16m$race_eth<-Recode(brfss16m$racegr3, 
recodes="1='nhwhite'; 2='nh black'; 3='nh other';4='nh multirace'; 5='hispanic'; else=NA",
as.factor = T)
brfss16m$race_eth<-relevel(brfss16m$race_eth, ref = "nhwhite")

#insurance
brfss16m$ins<-Recode(brfss16m$hlthpln1, recodes ="7:9=NA; 1=1;2=0")

#income grouping
brfss16m$inc<-Recode(brfss16m$incomg, recodes = "9= NA;1='1_lt15k'; 2='2_15-25k';3='3_25-35k';4='4_35-50k';5='5_50kplus'", as.factor = T)
#brfss16m$inc<-as.ordered(brfss16m$inc)
#education level
brfss16m$educ<-Recode(brfss16m$educa,
recodes="1:2='0Prim'; 3='1somehs'; 4='2hsgrad'; 5='3somecol'; 6='4colgrad';9=NA",
as.factor=T)
brfss16m$educ<-relevel(brfss16m$educ, ref='2hsgrad')

#employloyment
brfss16m$employ<-Recode(brfss16m$employ1,
recodes="1:2='employloyed'; 2:6='nilf'; 7='retired'; 8='unable'; else=NA",
as.factor=T)
brfss16m$employ<-relevel(brfss16m$employ, ref='employloyed')

#marital status
brfss16m$marst<-Recode(brfss16m$marital,
recodes="1='married'; 2='divorced'; 3='widowed'; 4='separated'; 5='nm';6='cohab'; else=NA",
as.factor=T)
brfss16m$marst<-relevel(brfss16m$marst, ref='married')

#Age cut into intervals
brfss16m$agec<-cut(brfss16m$age80, breaks=c(0,24,39,59,79,99))

#BMI, in the brfss16ma the bmi variable has 2 implied decimal places,
#so we must divide by 100 to get real bmi's

brfss16m$bmi<-brfss16m$bmi5/100

#smoking currently
brfss16m$smoke<-Recode(brfss16m$smoker3, 
recodes="1:2=1; 3:4=0; else=NA")
#brfss16m$smoke<-relevel(brfss16m$smoke, ref = "NeverSmoked")

Now, we can get a general idea of the missingness of these variables by just using summary(brfss16m)

summary(brfss16m[, c("ins", "smoke",  "bmi", "badhealth", "race_eth",  "educ", "employ", "marst", "inc")])

##       ins             smoke             bmi          badhealth     
##  Min.   :0.0000   Min.   :0.0000   Min.   :13.15   Min.   :0.0000  
##  1st Qu.:1.0000   1st Qu.:0.0000   1st Qu.:23.73   1st Qu.:0.0000  
##  Median :1.0000   Median :0.0000   Median :26.78   Median :0.0000  
##  Mean   :0.9303   Mean   :0.1358   Mean   :27.82   Mean   :0.1718  
##  3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:30.79   3rd Qu.:0.0000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :81.53   Max.   :1.0000  
##  NA's   :110      NA's   :1048     NA's   :2122    NA's   :70      
##          race_eth           educ               employ     
##  nhwhite     :18331   2hsgrad : 6285   employloyed:12692  
##  hispanic    : 2367   0Prim   :  587   nilf       : 3218  
##  nh black    : 2471   1somehs : 1129   retired    : 7262  
##  nh multirace:  368   3somecol: 6782   unable     : 1622  
##  nh other    :  990   4colgrad:10118   NA's       :  206  
##  NA's        :  473   NA's    :   99                      
##                                                           
##        marst              inc       
##  married  :12911   1_lt15k  : 1846  
##  cohab    :  788   2_15-25k : 3208  
##  divorced : 3259   3_25-35k : 2065  
##  nm       : 4404   4_35-50k : 2771  
##  separated:  474   5_50kplus:10836  
##  widowed  : 2967   NA's     : 4274  
##  NA's     :  197

Which shows that, among these recoded variables, inc , the income variable, 4274 people in the BRFSS, or 17.096% of the sample.

The lowest number of missings is in the bad health variable, which only has 0.28% missing.

Mean imputation

Now, i’m going to illustrate mean imputation of a continuous variable, BMI.

#I'm going to play with 3 outcomes, bmi, having a regular doctor and income category
summary(brfss16m$bmi)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   13.15   23.73   26.78   27.82   30.79   81.53    2122

#what happens when we replace the missings with the mean?
brfss16m$bmi.imp.mean<-ifelse(is.na(brfss16m$bmi)==T, mean(brfss16m$bmi, na.rm=T), brfss16m$bmi)

mean(brfss16m$bmi, na.rm=T)

## [1] 27.8213

mean(brfss16m$bmi.imp.mean) #no difference!

## [1] 27.8213

median(brfss16m$bmi, na.rm=T)

## [1] 26.78

median(brfss16m$bmi.imp.mean) #slight difference

## [1] 27.44

var(brfss16m$bmi, na.rm=T)

## [1] 36.46526

var(brfss16m$bmi.imp.mean) # more noticeable difference!

## [1] 33.36997

So what we see here, is that imputing with the mean does nothing to central tendency (when measured using the mean, but does affect the median slightly), but it does reduce the variance in the outcome. This is because you’re replacing all missing cases with the most likely value (the mean), so you’re artificially deflating the variance. That’s not good.

We can see this in a histogram, where the imputed values increase the peak in the distribution:

#plot the histogram
library(reshape2)

brfss16m%>%
  select(bmi.imp.mean, bmi)%>%
  melt()%>%
  ggplot()+geom_freqpoly(aes(x = value,
     y = ..density.., colour = variable))

## No id variables; using all as measure variables

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 2122 rows containing non-finite values (stat_bin).

Modal imputation

If we have a categorical variable, an easy way to impute the values is to use modal imputation, or impute cases with the mode, or most common value. It doesn’t make sense to use the mean, because what would that mean for a categorical variable?

table(brfss16m$employ)

## 
## employloyed        nilf     retired      unable 
##       12692        3218        7262        1622

#find the most common value
mcv.employ<-factor(names(which.max(table(brfss16m$employ))), levels=levels(brfss16m$employ))
mcv.employ

## [1] employloyed
## Levels: employloyed nilf retired unable

#impute the cases
brfss16m$employ.imp<-as.factor(ifelse(is.na(brfss16m$employ)==T, mcv.employ, brfss16m$employ))
levels(brfss16m$employ.imp)<-levels(brfss16m$employ)

prop.table(table(brfss16m$employ))

## 
## employloyed        nilf     retired      unable 
##  0.51189804  0.12978947  0.29289344  0.06541905

prop.table(table(brfss16m$employ.imp))

## 
## employloyed        nilf     retired      unable 
##     0.51592     0.12872     0.29048     0.06488

barplot(prop.table(table(brfss16m$employ)), main="Original Data", ylim=c(0, .6))

barplot(prop.table(table(brfss16m$employ.imp)), main="Imputed Data",ylim=c(0, .6))

Which doesn’t look like much of a difference because only 206 people were missing. Now let’s try modal imputation on income group:

table(brfss16m$inc)

## 
##   1_lt15k  2_15-25k  3_25-35k  4_35-50k 5_50kplus 
##      1846      3208      2065      2771     10836

#find the most common value
mcv.inc<-factor(names(which.max(table(brfss16m$inc))), levels = levels(brfss16m$inc))
mcv.inc

## [1] 5_50kplus
## Levels: 1_lt15k 2_15-25k 3_25-35k 4_35-50k 5_50kplus

#impute the cases
brfss16m$inc.imp<-as.factor(ifelse(is.na(brfss16m$inc)==T, mcv.inc, brfss16m$inc))
levels(brfss16m$inc.imp)<-levels(as.factor(brfss16m$inc))

prop.table(table(brfss16m$inc))

## 
##    1_lt15k   2_15-25k   3_25-35k   4_35-50k  5_50kplus 
## 0.08906687 0.15478143 0.09963331 0.13369681 0.52282158

prop.table(table(brfss16m$inc.imp))

## 
##   1_lt15k  2_15-25k  3_25-35k  4_35-50k 5_50kplus 
##   0.07384   0.12832   0.08260   0.11084   0.60440

barplot(prop.table(table(brfss16m$inc)), main="Original Data", ylim=c(0, .6))

barplot(prop.table(table(brfss16m$inc.imp)), main="Imputed Data", ylim=c(0, .6))

Which shows how dramatically we alter the distribution of the variable by imputing at the mode.

Multiple Imputation

These days, these types of imputation have been far surpassed by more complete methods that are based upon regression methods. These methods are generally referred to as multiple imputation, because we are really interested in imputing multiple variables simultaneously. Instead of reviewing this perspective here, I suggest you have a look at Joe Schafer’s site that gives a nice treatment of the subject. Here, I will use the imputation techniques in the mice library in R, which you can read about here.

I have used these in practice in publications and generally like the framework the library uses. Another popular technique is in the Amelia library of Gary King, which I haven’t used much. If you are serious about doing multiple imputation it would be advised to investigate multiple methodologies.

To begin, I explore the various patterns of missingness in the data. The md.pattern function in mice does this nicely. Here, each row corresponds to a particular pattern of missingness (1 = observed, 0=missing)

#look at the patterns of missingness
md.pattern(brfss16m[,c("bmi", "inc", "agec","educ","race_eth")])

##       agec educ race_eth  bmi  inc     
## 19297    1    1        1    1    1    0
## 3182     1    1        1    1    0    1
## 1119     1    1        1    0    1    1
## 855      1    1        1    0    0    2
## 251      1    1        0    1    1    1
## 94       1    1        0    1    0    2
## 28       1    1        0    0    1    2
## 75       1    1        0    0    0    3
## 22       1    0        1    1    1    1
## 23       1    0        1    1    0    2
## 7        1    0        1    0    1    2
## 22       1    0        1    0    0    3
## 9        1    0        0    1    0    3
## 2        1    0        0    0    1    3
## 14       1    0        0    0    0    4
##          0   99      473 2122 4274 6968

The first row shows the number of observations in the data that are complete (first row).

The second row shows the number of people who are missing only the inc variable.

Rows that have multiple 0’s in the columns indicate missing data patterns where multiple variables are missing.

The bottom row tells how many total people are missing each variable, in ANY combination with other variables.

If you want to see how pairs of variables are missing together, the md.pairs() function will show this.

A pair of variables can have exactly four missingness patterns: Both variables are observed (pattern rr), the first variable is observed and the second variable is missing (pattern rm), the first variable is missing and the second variable is observed (pattern mr), and both are missing (pattern mm).

md.pairs(brfss16m[,c("bmi", "inc", "agec","educ","race_eth")])

## $rr
##            bmi   inc  agec  educ race_eth
## bmi      22878 19570 22878 22824    22524
## inc      19570 20726 20726 20695    20445
## agec     22878 20726 25000 24901    24527
## educ     22824 20695 24901 24901    24453
## race_eth 22524 20445 24527 24453    24527
## 
## $rm
##           bmi  inc agec educ race_eth
## bmi         0 3308    0   54      354
## inc      1156    0    0   31      281
## agec     2122 4274    0   99      473
## educ     2077 4206    0    0      448
## race_eth 2003 4082    0   74        0
## 
## $mr
##           bmi  inc agec educ race_eth
## bmi         0 1156 2122 2077     2003
## inc      3308    0 4274 4206     4082
## agec        0    0    0    0        0
## educ       54   31   99    0       74
## race_eth  354  281  473  448        0
## 
## $mm
##           bmi  inc agec educ race_eth
## bmi      2122  966    0   45      119
## inc       966 4274    0   68      192
## agec        0    0    0    0        0
## educ       45   68    0   99       25
## race_eth  119  192    0   25      473

Basic imputation:

We can perform a basic multiple imputation by simply doing: Note this may take a very long time with big data sets

imp<-mice(data = brfss16m[,c("bmi", "inc", "agec","educ","race_eth")], seed = 22, m = 5)

## 
##  iter imp variable
##   1   1  bmi  inc  educ  race_eth
##   1   2  bmi  inc  educ  race_eth
##   1   3  bmi  inc  educ  race_eth
##   1   4  bmi  inc  educ  race_eth
##   1   5  bmi  inc  educ  race_eth
##   2   1  bmi  inc  educ  race_eth
##   2   2  bmi  inc  educ  race_eth
##   2   3  bmi  inc  educ  race_eth
##   2   4  bmi  inc  educ  race_eth
##   2   5  bmi  inc  educ  race_eth
##   3   1  bmi  inc  educ  race_eth
##   3   2  bmi  inc  educ  race_eth
##   3   3  bmi  inc  educ  race_eth
##   3   4  bmi  inc  educ  race_eth
##   3   5  bmi  inc  educ  race_eth
##   4   1  bmi  inc  educ  race_eth
##   4   2  bmi  inc  educ  race_eth
##   4   3  bmi  inc  educ  race_eth
##   4   4  bmi  inc  educ  race_eth
##   4   5  bmi  inc  educ  race_eth
##   5   1  bmi  inc  educ  race_eth
##   5   2  bmi  inc  educ  race_eth
##   5   3  bmi  inc  educ  race_eth
##   5   4  bmi  inc  educ  race_eth
##   5   5  bmi  inc  educ  race_eth

print(imp)

## Class: mids
## Number of multiple imputations:  5 
## Imputation methods:
##       bmi       inc      agec      educ  race_eth 
##     "pmm" "polyreg"        "" "polyreg" "polyreg" 
## PredictorMatrix:
##          bmi inc agec educ race_eth
## bmi        0   1    1    1        1
## inc        1   0    1    1        1
## agec       1   1    0    1        1
## educ       1   1    1    0        1
## race_eth   1   1    1    1        0

Shows how many imputations were done. It also shows total missingness, which imputation method was used for each variable (because you wouldn’t want to use a normal distribution for a categorical variable!!).

It also shows the sequence of how each variable is visited (or imputed, the default is left to right).

We may want to make sure imputed values are plausible by having a look. For instance, are the BMI values outside of the range of the data.

head(imp$imp$bmi)

summary(imp$imp$bmi)

##        1               2               3               4        
##  Min.   :14.63   Min.   :13.19   Min.   :13.73   Min.   :15.00  
##  1st Qu.:23.49   1st Qu.:23.89   1st Qu.:23.74   1st Qu.:23.18  
##  Median :27.12   Median :27.18   Median :27.34   Median :26.89  
##  Mean   :28.21   Mean   :28.13   Mean   :28.11   Mean   :27.94  
##  3rd Qu.:31.58   3rd Qu.:31.02   3rd Qu.:31.53   3rd Qu.:30.56  
##  Max.   :67.39   Max.   :74.88   Max.   :57.88   Max.   :73.39  
##        5        
##  Min.   :13.19  
##  1st Qu.:23.71  
##  Median :27.06  
##  Mean   :28.07  
##  3rd Qu.:31.28  
##  Max.   :69.06

summary(brfss16m$bmi)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   13.15   23.73   26.78   27.82   30.79   81.53    2122

head(imp$imp$inc)

summary(imp$imp$inc)

##          1                2                3                4       
##  1_lt15k  : 463   1_lt15k  : 481   1_lt15k  : 477   1_lt15k  : 497  
##  2_15-25k : 827   2_15-25k : 811   2_15-25k : 836   2_15-25k : 780  
##  3_25-35k : 516   3_25-35k : 486   3_25-35k : 489   3_25-35k : 483  
##  4_35-50k : 556   4_35-50k : 566   4_35-50k : 576   4_35-50k : 604  
##  5_50kplus:1912   5_50kplus:1930   5_50kplus:1896   5_50kplus:1910  
##          5       
##  1_lt15k  : 491  
##  2_15-25k : 806  
##  3_25-35k : 499  
##  4_35-50k : 579  
##  5_50kplus:1899

Which shows the imputed values for the first 6 cases across the 5 different imputations, as well as the numeric summary of the imputed values. We can see that there is variation across the imputations, because the imputed values are not the same.

We can also do some plotting. For instance if we want to see how the observed and imputed values of bmi look with respect to race, we can do:

library(lattice)
stripplot(imp,bmi~race_eth|.imp, pch=20)

and we see the distribution of the original data (blue dots), the imputed data (red dots) across the levels of race, for each of the five different imputation runs(the number at the top shows which run, and the first plot is the original data).

This plot shows that the bmi values correspond well with the observed data, so they are probably plausible valuesU.

If we want to get our new, imputed data, we can use the complete() function, which by default extracts the first imputed data set. If we want a different one, we can do complete(imp, action=3) for example, to get the third imputed data set.

dat.imp<-complete(imp, action = 1)
head(dat.imp, n=10)

#Compare to the original data
head(brfss16m[,c("bmi", "inc", "agec","educ","race_eth")], n=10)

While the first few cases don’t show much missingness, we can coax some more interesting cases out and compare the original data to the imputed:

head(dat.imp[is.na(brfss16m$bmi)==T,], n=10)

head(brfss16m[is.na(brfss16m$bmi)==T,c("bmi", "inc", "agec","educ","race_eth")], n=10)

Analyzing the imputed data

A key element of using imputed data, is that the relationships we want to know about should be maintained after imputation, and presumably, the relationships within each imputed data set will be the same. So if we used each of the (5 in this case) imputed data sets in a model, then we should see similar results across the five different models.

Here I look at a linear model for bmi:

#Now, I will see the variability in the 5 different imputations for each outcom
fit.bmi<-with(data=imp ,expr=lm(bmi~inc+agec+educ+race_eth))
fit.bmi

## call :
## with.mids(data = imp, expr = lm(bmi ~ inc + agec + educ + race_eth))
## 
## call1 :
## mice(data = brfss16m[, c("bmi", "inc", "agec", "educ", "race_eth")], 
##     m = 5, seed = 22)
## 
## nmis :
##      bmi      inc     agec     educ race_eth 
##     2122     4274        0       99      473 
## 
## analyses :
## [[1]]
## 
## Call:
## lm(formula = bmi ~ inc + agec + educ + race_eth)
## 
## Coefficients:
##          (Intercept)           inc2_15-25k           inc3_25-35k  
##            25.414966              0.232223             -0.071817  
##          inc4_35-50k          inc5_50kplus           agec(24,39]  
##            -0.008227             -0.449068              2.714127  
##          agec(39,59]           agec(59,79]           agec(79,99]  
##             3.883869              3.407280              1.136255  
##            educ0Prim           educ1somehs          educ3somecol  
##             0.741329             -0.378400             -0.093667  
##         educ4colgrad      race_ethhispanic      race_ethnh black  
##            -1.441387              0.355282              1.716517  
## race_ethnh multirace      race_ethnh other  
##             0.876980             -1.021510  
## 
## 
## [[2]]
## 
## Call:
## lm(formula = bmi ~ inc + agec + educ + race_eth)
## 
## Coefficients:
##          (Intercept)           inc2_15-25k           inc3_25-35k  
##             25.28461               0.32409               0.18358  
##          inc4_35-50k          inc5_50kplus           agec(24,39]  
##              0.09249              -0.32688               2.75534  
##          agec(39,59]           agec(59,79]           agec(79,99]  
##              3.80311               3.32086               1.13634  
##            educ0Prim           educ1somehs          educ3somecol  
##              0.77934              -0.19762              -0.05626  
##         educ4colgrad      race_ethhispanic      race_ethnh black  
##             -1.32038               0.39924               1.66740  
## race_ethnh multirace      race_ethnh other  
##              0.80897              -1.21945  
## 
## 
## [[3]]
## 
## Call:
## lm(formula = bmi ~ inc + agec + educ + race_eth)
## 
## Coefficients:
##          (Intercept)           inc2_15-25k           inc3_25-35k  
##             25.29289               0.32772               0.06447  
##          inc4_35-50k          inc5_50kplus           agec(24,39]  
##              0.09829              -0.36607               2.64244  
##          agec(39,59]           agec(59,79]           agec(79,99]  
##              3.95123               3.38139               1.18906  
##            educ0Prim           educ1somehs          educ3somecol  
##              0.75822              -0.24948              -0.06864  
##         educ4colgrad      race_ethhispanic      race_ethnh black  
##             -1.37430               0.29361               1.65049  
## race_ethnh multirace      race_ethnh other  
##              0.72844              -1.07957  
## 
## 
## [[4]]
## 
## Call:
## lm(formula = bmi ~ inc + agec + educ + race_eth)
## 
## Coefficients:
##          (Intercept)           inc2_15-25k           inc3_25-35k  
##             25.45341               0.20144              -0.07910  
##          inc4_35-50k          inc5_50kplus           agec(24,39]  
##             -0.02428              -0.50619               2.77218  
##          agec(39,59]           agec(59,79]           agec(79,99]  
##              3.79329               3.31690               1.00672  
##            educ0Prim           educ1somehs          educ3somecol  
##              0.99992              -0.33585              -0.08870  
##         educ4colgrad      race_ethhispanic      race_ethnh black  
##             -1.31450               0.39730               1.57012  
## race_ethnh multirace      race_ethnh other  
##              0.80981              -1.29124  
## 
## 
## [[5]]
## 
## Call:
## lm(formula = bmi ~ inc + agec + educ + race_eth)
## 
## Coefficients:
##          (Intercept)           inc2_15-25k           inc3_25-35k  
##            25.222557              0.267287              0.090658  
##          inc4_35-50k          inc5_50kplus           agec(24,39]  
##             0.154271             -0.395245              2.833178  
##          agec(39,59]           agec(59,79]           agec(79,99]  
##             3.887220              3.388620              1.225407  
##            educ0Prim           educ1somehs          educ3somecol  
##             0.699915             -0.188286             -0.004645  
##         educ4colgrad      race_ethhispanic      race_ethnh black  
##            -1.251291              0.268642              1.670443  
## race_ethnh multirace      race_ethnh other  
##             0.862450             -1.213154

variation in bmi

with (data=imp, exp=(sd(bmi)))

## call :
## with.mids(data = imp, expr = (sd(bmi)))
## 
## call1 :
## mice(data = brfss16m[, c("bmi", "inc", "agec", "educ", "race_eth")], 
##     m = 5, seed = 22)
## 
## nmis :
##      bmi      inc     agec     educ race_eth 
##     2122     4274        0       99      473 
## 
## analyses :
## [[1]]
## [1] 6.080498
## 
## [[2]]
## [1] 6.091277
## 
## [[3]]
## [1] 6.078349
## 
## [[4]]
## [1] 6.129393
## 
## [[5]]
## [1] 6.070973

Frequency table for income

with (data=imp, exp=(prop.table(table(inc))))

## call :
## with.mids(data = imp, expr = (prop.table(table(inc))))
## 
## call1 :
## mice(data = brfss16m[, c("bmi", "inc", "agec", "educ", "race_eth")], 
##     m = 5, seed = 22)
## 
## nmis :
##      bmi      inc     agec     educ race_eth 
##     2122     4274        0       99      473 
## 
## analyses :
## [[1]]
## inc
##   1_lt15k  2_15-25k  3_25-35k  4_35-50k 5_50kplus 
##   0.09236   0.16140   0.10324   0.13308   0.50992 
## 
## [[2]]
## inc
##   1_lt15k  2_15-25k  3_25-35k  4_35-50k 5_50kplus 
##   0.09308   0.16076   0.10204   0.13348   0.51064 
## 
## [[3]]
## inc
##   1_lt15k  2_15-25k  3_25-35k  4_35-50k 5_50kplus 
##   0.09292   0.16176   0.10216   0.13388   0.50928 
## 
## [[4]]
## inc
##   1_lt15k  2_15-25k  3_25-35k  4_35-50k 5_50kplus 
##   0.09372   0.15952   0.10192   0.13500   0.50984 
## 
## [[5]]
## inc
##   1_lt15k  2_15-25k  3_25-35k  4_35-50k 5_50kplus 
##   0.09348   0.16056   0.10256   0.13400   0.50940

Frequency table for race/ethnicty

with (data=imp, exp=(prop.table(table(race_eth))))

## call :
## with.mids(data = imp, expr = (prop.table(table(race_eth))))
## 
## call1 :
## mice(data = brfss16m[, c("bmi", "inc", "agec", "educ", "race_eth")], 
##     m = 5, seed = 22)
## 
## nmis :
##      bmi      inc     agec     educ race_eth 
##     2122     4274        0       99      473 
## 
## analyses :
## [[1]]
## race_eth
##      nhwhite     hispanic     nh black nh multirace     nh other 
##      0.74792      0.09652      0.10040      0.01496      0.04020 
## 
## [[2]]
## race_eth
##      nhwhite     hispanic     nh black nh multirace     nh other 
##      0.74752      0.09664      0.10064      0.01492      0.04028 
## 
## [[3]]
## race_eth
##      nhwhite     hispanic     nh black nh multirace     nh other 
##      0.74736      0.09696      0.10044      0.01484      0.04040 
## 
## [[4]]
## race_eth
##      nhwhite     hispanic     nh black nh multirace     nh other 
##      0.74684      0.09680      0.10068      0.01516      0.04052 
## 
## [[5]]
## race_eth
##      nhwhite     hispanic     nh black nh multirace     nh other 
##      0.74756      0.09640      0.10052      0.01516      0.04036

Frequency table for education

with (data=imp, exp=(prop.table(table(educ))))

## call :
## with.mids(data = imp, expr = (prop.table(table(educ))))
## 
## call1 :
## mice(data = brfss16m[, c("bmi", "inc", "agec", "educ", "race_eth")], 
##     m = 5, seed = 22)
## 
## nmis :
##      bmi      inc     agec     educ race_eth 
##     2122     4274        0       99      473 
## 
## analyses :
## [[1]]
## educ
##  2hsgrad    0Prim  1somehs 3somecol 4colgrad 
##  0.25268  0.02368  0.04532  0.27244  0.40588 
## 
## [[2]]
## educ
##  2hsgrad    0Prim  1somehs 3somecol 4colgrad 
##  0.25252  0.02364  0.04540  0.27240  0.40604 
## 
## [[3]]
## educ
##  2hsgrad    0Prim  1somehs 3somecol 4colgrad 
##  0.25248  0.02356  0.04540  0.27244  0.40612 
## 
## [[4]]
## educ
##  2hsgrad    0Prim  1somehs 3somecol 4colgrad 
##  0.25256  0.02368  0.04564  0.27240  0.40572 
## 
## [[5]]
## educ
##  2hsgrad    0Prim  1somehs 3somecol 4colgrad 
##  0.25260  0.02368  0.04560  0.27192  0.40620

Now we pool the separate models from each imputed data set:

est.p<-pool(fit.bmi)
print(est.p)

## Class: mipo    m = 5 
##                         estimate       ubar           b          t dfcom
## (Intercept)          25.33368775 0.04254320 0.009340574 0.05375189 24983
## inc2_15-25k           0.27055212 0.02424901 0.003097640 0.02796618 24983
## inc3_25-35k           0.03755912 0.02987411 0.012609874 0.04500596 24983
## inc4_35-50k           0.06250827 0.02729409 0.005783733 0.03423457 24983
## inc5_50kplus         -0.40869277 0.02139611 0.004956579 0.02734400 24983
## agec(24,39]           2.74345488 0.03258700 0.005020105 0.03861113 24983
## agec(39,59]           3.86374439 0.02881108 0.004312554 0.03398614 24983
## agec(59,79]           3.36300885 0.02845255 0.001714075 0.03050944 24983
## agec(79,99]           1.13875571 0.04347791 0.006871428 0.05172362 24983
## educ0Prim             0.79574400 0.07048905 0.013876789 0.08714119 24983
## educ1somehs          -0.26992853 0.03753564 0.007105973 0.04606281 24983
## educ3somecol         -0.06238160 0.01092212 0.001270402 0.01244660 24983
## educ4colgrad         -1.34037074 0.01052314 0.005089906 0.01663103 24983
## race_ethhispanic      0.34281546 0.01925459 0.003557930 0.02352410 24983
## race_ethnh black      1.65499398 0.01644212 0.002850428 0.01986263 24983
## race_ethnh multirace  0.81733039 0.09626037 0.003405557 0.10034703 24983
## race_ethnh other     -1.16498318 0.03748388 0.012276835 0.05221608 24983
##                              df        riv     lambda        fmi
## (Intercept)            91.56341 0.26346605 0.20852642 0.22526595
## inc2_15-25k           224.07112 0.15329152 0.13291654 0.14055365
## inc3_25-35k            35.30939 0.50652057 0.33621882 0.37087253
## inc4_35-50k            96.84870 0.25428505 0.20273306 0.21870256
## inc5_50kplus           84.17503 0.27798962 0.21752103 0.23547294
## agec(24,39]           163.05164 0.18486284 0.15602046 0.16618573
## agec(39,59]           171.12304 0.17962068 0.15226986 0.16200700
## agec(59,79]           848.01377 0.07229194 0.06741815 0.06960985
## agec(79,99]           156.22058 0.18965297 0.15941873 0.16997743
## educ0Prim             108.94786 0.23623737 0.19109386 0.20554534
## educ1somehs           116.05616 0.22717521 0.18512044 0.19880943
## educ3somecol          263.43041 0.13957758 0.12248186 0.12906908
## educ4colgrad           29.60072 0.58042424 0.36725850 0.40607615
## race_ethhispanic      120.71392 0.22174019 0.18149537 0.19472759
## race_ethnh black      134.00701 0.20803361 0.17220846 0.18429239
## race_ethnh multirace 2191.21164 0.04245432 0.04072535 0.04159972
## race_ethnh other       50.10927 0.39302764 0.28213915 0.30917251

summary(est.p)

We need to pay attention to the fmi column and the lambda column. These convey information about how much the missingness of each particular variable affects the model coefficients.

lam<-data.frame(lam=est.p$pooled$lambda, param=row.names(est.p$pooled))

ggplot(data=lam,aes(x=param, y=lam))+geom_col()+theme(axis.text.x = element_text(angle = 45, hjust = 1))

It appears that a couple of the education variables and the income variables have large variances to them. This suggests that there may be noticeable variation in the resulting coefficient of the model, depending on which imputed data set we use.

We can also compare to the model fit on the original data, with missings eliminated:

library(dplyr)
bnm<-brfss16m%>%
  select(bmi, inc, agec, educ, race_eth)%>%
  filter(complete.cases(.))%>%
  as.data.frame()

summary(lm(bmi~inc+agec+educ+race_eth, bnm))

## 
## Call:
## lm(formula = bmi ~ inc + agec + educ + race_eth, data = bnm)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.491  -3.966  -0.856   2.873  52.475 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          25.65888    0.24472 104.848  < 2e-16 ***
## inc2_15-25k           0.33102    0.18246   1.814  0.06966 .  
## inc3_25-35k           0.05852    0.20076   0.292  0.77066    
## inc4_35-50k           0.13059    0.19066   0.685  0.49342    
## inc5_50kplus         -0.41660    0.16926  -2.461  0.01385 *  
## agec(24,39]           2.60708    0.21187  12.305  < 2e-16 ***
## agec(39,59]           3.69211    0.20031  18.432  < 2e-16 ***
## agec(59,79]           3.21911    0.19958  16.130  < 2e-16 ***
## agec(79,99]           1.04703    0.24975   4.192 2.77e-05 ***
## educ0Prim             0.74977    0.33896   2.212  0.02698 *  
## educ1somehs          -0.52764    0.22922  -2.302  0.02135 *  
## educ3somecol         -0.10030    0.11980  -0.837  0.40247    
## educ4colgrad         -1.46219    0.11631 -12.571  < 2e-16 ***
## race_ethhispanic      0.45596    0.15959   2.857  0.00428 ** 
## race_ethnh black      1.66481    0.14608  11.396  < 2e-16 ***
## race_ethnh multirace  0.70557    0.34990   2.016  0.04376 *  
## race_ethnh other     -1.35430    0.22481  -6.024 1.73e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.898 on 19280 degrees of freedom
## Multiple R-squared:  0.05391,    Adjusted R-squared:  0.05313 
## F-statistic: 68.67 on 16 and 19280 DF,  p-value: < 2.2e-16

Compare imputed model to original data

Here, I compare the coefficients from the model where we eliminated all missing data to the one that we fit on the imputed data:

fit1<-lm(bmi~inc+agec+educ+race_eth, data=brfss16m)
summary(fit1)

## 
## Call:
## lm(formula = bmi ~ inc + agec + educ + race_eth, data = brfss16m)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.491  -3.966  -0.856   2.873  52.475 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          25.65888    0.24472 104.848  < 2e-16 ***
## inc2_15-25k           0.33102    0.18246   1.814  0.06966 .  
## inc3_25-35k           0.05852    0.20076   0.292  0.77066    
## inc4_35-50k           0.13059    0.19066   0.685  0.49342    
## inc5_50kplus         -0.41660    0.16926  -2.461  0.01385 *  
## agec(24,39]           2.60708    0.21187  12.305  < 2e-16 ***
## agec(39,59]           3.69211    0.20031  18.432  < 2e-16 ***
## agec(59,79]           3.21911    0.19958  16.130  < 2e-16 ***
## agec(79,99]           1.04703    0.24975   4.192 2.77e-05 ***
## educ0Prim             0.74977    0.33896   2.212  0.02698 *  
## educ1somehs          -0.52764    0.22922  -2.302  0.02135 *  
## educ3somecol         -0.10030    0.11980  -0.837  0.40247    
## educ4colgrad         -1.46219    0.11631 -12.571  < 2e-16 ***
## race_ethhispanic      0.45596    0.15959   2.857  0.00428 ** 
## race_ethnh black      1.66481    0.14608  11.396  < 2e-16 ***
## race_ethnh multirace  0.70557    0.34990   2.016  0.04376 *  
## race_ethnh other     -1.35430    0.22481  -6.024 1.73e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.898 on 19280 degrees of freedom
##   (5703 observations deleted due to missingness)
## Multiple R-squared:  0.05391,    Adjusted R-squared:  0.05313 
## F-statistic: 68.67 on 16 and 19280 DF,  p-value: < 2.2e-16

fit.imp<-lm(bmi~inc+agec+educ+race_eth, data=dat.imp)
summary(fit.imp)

## 
## Call:
## lm(formula = bmi ~ inc + agec + educ + race_eth, data = dat.imp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.397  -3.950  -0.877   2.871  52.753 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          25.414966   0.206002 123.372  < 2e-16 ***
## inc2_15-25k           0.232223   0.155500   1.493  0.13535    
## inc3_25-35k          -0.071817   0.172233  -0.417  0.67670    
## inc4_35-50k          -0.008227   0.165182  -0.050  0.96028    
## inc5_50kplus         -0.449068   0.146174  -3.072  0.00213 ** 
## agec(24,39]           2.714127   0.180028  15.076  < 2e-16 ***
## agec(39,59]           3.883869   0.169290  22.942  < 2e-16 ***
## agec(59,79]           3.407280   0.168220  20.255  < 2e-16 ***
## agec(79,99]           1.136255   0.207874   5.466 4.64e-08 ***
## educ0Prim             0.741329   0.264595   2.802  0.00509 ** 
## educ1somehs          -0.378400   0.193327  -1.957  0.05032 .  
## educ3somecol         -0.093667   0.104210  -0.899  0.36875    
## educ4colgrad         -1.441387   0.102379 -14.079  < 2e-16 ***
## race_ethhispanic      0.355282   0.138382   2.567  0.01025 *  
## race_ethnh black      1.716517   0.127916  13.419  < 2e-16 ***
## race_ethnh multirace  0.876980   0.309876   2.830  0.00466 ** 
## race_ethnh other     -1.021510   0.193405  -5.282 1.29e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.911 on 24983 degrees of freedom
## Multiple R-squared:  0.05558,    Adjusted R-squared:  0.05498 
## F-statistic:  91.9 on 16 and 24983 DF,  p-value: < 2.2e-16

In the analysis that only uses complete cases, we see a significant income effect on bmi, but not once we impute the missing values. This suggests a significant selection effect for the income variable.

Flag variables

We can construct a flag variable. This is a useful exercise to see whethere we have missing not at random within the data:

fit1<-lm(bmi~agec+educ+race_eth+is.na(inc), data=brfss16m)
summary(fit1)

## 
## Call:
## lm(formula = bmi ~ agec + educ + race_eth + is.na(inc), data = brfss16m)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.581  -3.935  -0.866   2.845  52.035 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           25.4481     0.1788 142.292  < 2e-16 ***
## agec(24,39]            2.6840     0.1886  14.233  < 2e-16 ***
## agec(39,59]            3.7168     0.1766  21.041  < 2e-16 ***
## agec(59,79]            3.3477     0.1752  19.108  < 2e-16 ***
## agec(79,99]            1.3056     0.2154   6.060 1.38e-09 ***
## educ0Prim              0.8551     0.2927   2.922 0.003485 ** 
## educ1somehs           -0.1798     0.2024  -0.888 0.374409    
## educ3somecol          -0.1463     0.1088  -1.345 0.178501    
## educ4colgrad          -1.5871     0.1008 -15.748  < 2e-16 ***
## race_ethhispanic       0.5080     0.1442   3.523 0.000427 ***
## race_ethnh black       1.7843     0.1336  13.351  < 2e-16 ***
## race_ethnh multirace   0.8551     0.3243   2.637 0.008369 ** 
## race_ethnh other      -1.1379     0.2042  -5.572 2.55e-08 ***
## is.na(inc)TRUE        -0.8299     0.1142  -7.268 3.77e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.88 on 22465 degrees of freedom
##   (2521 observations deleted due to missingness)
## Multiple R-squared:  0.05518,    Adjusted R-squared:  0.05463 
## F-statistic: 100.9 on 13 and 22465 DF,  p-value: < 2.2e-16

And indeed we see that those with missing incomes have signifcantly lower bmi’s.

Examining the variation in the models for the imputed data

If we wanted to see the ranges of the betas in the five imputed data models, we could do that:

#get the coefficients from each of the 5 imputations of bmi
coefs<-as.data.frame(matrix(unlist(lapply(fit.bmi$analyses, coef)), nrow=5, ncol=17, byrow=T))
names(coefs)<-names(fit.bmi$analyses[[1]]$coef)
#plot the coefficients from each of the different rounds of imputation to see the variability in the
#results

coefs%>%
  select(`inc2_15-25k`:`race_ethnh other`)%>%
  melt()%>%
  mutate(imp=rep(1:5, 16))%>%
  ggplot()+geom_point(aes(y=value,x=variable, group=variable, color=as.factor(imp) ))+ggtitle("Estimated Betas from Each Imputed Regression Model for BMI Outcome")+theme(axis.text.x = element_text(angle = 45, hjust = 1))

## No id variables; using all as measure variables

DEM 7283 - Example 7 Multiple Imputation & Missing Data

Corey Sparks, PhD

March 18, 2019

Missing data

Reasons for missing data

Types of missing data

How can we tell is a variable is MCAR?

What does the computer do?

How can we impute a missing value?

Regression imputation

Multiple Imputation

Hot deck imputation

Predictive mean matching

Recodes

Mean imputation

Multiple Imputation

Basic imputation:

Analyzing the imputed data

variation in bmi

Frequency table for income

Frequency table for race/ethnicty

Frequency table for education

Compare imputed model to original data

Flag variables

Examining the variation in the models for the imputed data