This example will illustrate typical aspects of dealing with missing data. Topics will include: Mean imputation, modal imputation for categorical data, and multiple imputation of complex patterns of missing data.
For this example I am using 2016 CDC Behavioral Risk Factor Surveillance System (BRFSS) SMART county data. Link
Every time we ask a question on a survey, there are a variety of reasons why we may fail to get a complete response
This is a cause for concern for those using survey data
Total nonresponse
This is where a respondent completely refuses to participate in the survey, or some other barrier prevents them from participating (not at home, ill, disability)
This is usually accounted for by survey weights
Noncoverage
When an individual is not included in the survey’s sampling frame
Usually adjusted for in the weights
Item nonresponse
When a respondent fails to provide an acceptable response to a survey question
Could be refusal
Could be “I don’t know”
Gould give an inconsistent response
The interviewer could fail to ask the question
Partial nonresponse
Somewhere between total nonresponse and item nonresponse
Respondent could end the interview
e.g. Incomplete responses for a wave in a panel study
Can be dealt with by using weights, which eliminates respondent with partial responses
Can be dealt with by imputation, where the value is filled in by data editors
Missing completely at random (MCAR)
Missing values do not depend on any characteristic of an individual
Missing at random (MAR)
Missing values do not relate to the item of interest itself, but it could be related to another characteristic of the respondent
These two are said to be “ignorable”
Missing Not at random (MNAR)
Think of a question on satisfaction, those that are extremely dissatisfied may be less likely to respond, so the results may be biased because they don’t include those folks
This is “non-ignorable”
One way to estimate if an element is MCAR or MAR is to form a missing data variable (1=missing, 0=nonmissing) and estimate a logistic regression for that variable using key characteristics of a respondent
If all these characteristics show insignificant effects, we can pretty much assume the element is MCAR/MAR
Typically the computer will do one of two things
Delete the entire case if any variables in the equation are missing, this is called listwise deletion
delete the case for a particular comparison, this is called pairwise deletion
Both of these lead to fewer cases for entire models or for particular tests
There are easy ways and hard ways, but the answer is yes.
Easy ways == bad ways
Mean substitution
Plugs in the value for the average response for a missing value for all individuals
If a large % of the respondents are missing the values (say >20%) the mean could be not estimated very well
The variance of the variable will be driven down, if everyone who is missing is given the mean
Can lead to lower effect sizes/betas in the models using the data
Multiple Imputation will use other characteristics of individuals with complete observations to predict the missing value for the missing person
This would use the observed characteristics of the respondent to predict their missing value
Think regression!
Income is a good variable like this
People don’t like to report their income, but they may be willing to report their education, and other characteristics
We can use those to predict their income
This is sensitive to assumptions of the regression model being used!
This works like the regression imputation method, but applies it recursively to all patterns of missing data
Typically this will be done several times, usually 5 is the rule most programs use, and the differences in the imputations are compared across runs
i.e. if we impute the values then run a regression for the outcome, how sensitive are the results of the model to the imputation
There are various ways to do this, and it depends on the scale of the variable
Find a set of variables that are associated with your outcome
If there are cases similar to yours (i.e. they are similar on the non-missing variables), you can plug in their variable of interest for the missing case
This is what the census did for years
Regress Y on the observed X’s for the incomplete cases
Form fitted or “predicted” values
Replace the missing values with a fitted value from the regression from non-missing observations who are “close” to the missing cases
i.e. replace the missing values with fitted values from similar individuals
library(car)
library(mice)
library(ggplot2)
library(dplyr)
load(url("https://github.com/coreysparks/data/blob/master/brfss_2017.Rdata?raw=true"))
set.seed(1234)
samp<-sample(1:dim(brfss_17)[1], size = 25000) #smaller sample for brevity
brfss_17<-brfss_17[samp,]
#Healthy days
brfss_17$healthdays<-Recode(brfss_17$physhlth, recodes = "88=0; 77=NA; 99=NA")
#Healthy mental health days
brfss_17$healthmdays<-Recode(brfss_17$menthlth, recodes = "88=0; 77=NA; 99=NA")
brfss_17$badhealth<-Recode(brfss_17$genhlth, recodes="4:5=1; 1:3=0; else=NA")
#race/ethnicity
brfss_17$black<-Recode(brfss_17$racegr3, recodes="2=1; 9=NA; else=0")
brfss_17$white<-Recode(brfss_17$racegr3, recodes="1=1; 9=NA; else=0")
brfss_17$other<-Recode(brfss_17$racegr3, recodes="3:4=1; 9=NA; else=0")
brfss_17$hispanic<-Recode(brfss_17$racegr3, recodes="5=1; 9=NA; else=0")
brfss_17$race_eth<-Recode(brfss_17$racegr3,
recodes="1='nhwhite'; 2='nh black'; 3='nh other';4='nh multirace'; 5='hispanic'; else=NA",
as.factor = T)
brfss_17$race_eth<-relevel(brfss_17$race_eth, ref = "nhwhite")
#insurance
brfss_17$ins<-Recode(brfss_17$hlthpln1, recodes ="7:9=NA; 1=1;2=0")
#income grouping
brfss_17$inc<-Recode(brfss_17$incomg, recodes = "9= NA;1='1_lt15k'; 2='2_15-25k';3='3_25-35k';4='4_35-50k';5='5_50kplus'", as.factor = T)
#brfss_17$inc<-as.ordered(brfss_17$inc)
#education level
brfss_17$educ<-Recode(brfss_17$educa,
recodes="1:2='0Prim'; 3='1somehs'; 4='2hsgrad'; 5='3somecol'; 6='4colgrad';9=NA",
as.factor=T)
brfss_17$educ<-relevel(brfss_17$educ, ref='2hsgrad')
#employment
brfss_17$employ<-Recode(brfss_17$employ1,
recodes="1:2='employloyed'; 2:6='nilf'; 7='retired'; 8='unable'; else=NA",
as.factor=T)
brfss_17$employ<-relevel(brfss_17$employ, ref='employloyed')
#marital status
brfss_17$marst<-Recode(brfss_17$marital,
recodes="1='married'; 2='divorced'; 3='widowed'; 4='separated'; 5='nm';6='cohab'; else=NA",
as.factor=T)
brfss_17$marst<-relevel(brfss_17$marst, ref='married')
#Age cut into intervals
brfss_17$agec<-cut(brfss_17$age80, breaks=c(0,24,39,59,79,99))
#BMI, in the brfss_17a the bmi variable has 2 implied decimal places,
#so we must divide by 100 to get real bmi's
brfss_17$bmi<-brfss_17$bmi5/100
#smoking currently
brfss_17$smoke<-Recode(brfss_17$smoker3,
recodes="1:2=1; 3:4=0; else=NA")
#brfss_17$smoke<-relevel(brfss_17$smoke, ref = "NeverSmoked")
Now, we can get a general idea of the missingness of these variables by just using summary(brfss_17)
summary(brfss_17[, c("ins", "smoke", "bmi", "badhealth", "race_eth", "educ", "employ", "marst", "inc")])
## ins smoke bmi badhealth
## Min. :0.0000 Min. :0.0000 Min. :12.55 Min. :0.0000
## 1st Qu.:1.0000 1st Qu.:0.0000 1st Qu.:23.75 1st Qu.:0.0000
## Median :1.0000 Median :0.0000 Median :26.93 Median :0.0000
## Mean :0.9229 Mean :0.1363 Mean :27.97 Mean :0.1756
## 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:30.90 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.0000 Max. :93.97 Max. :1.0000
## NA's :111 NA's :1025 NA's :2118 NA's :67
## race_eth educ employ marst
## nhwhite :18106 2hsgrad : 6046 employloyed:12929 married :12600
## hispanic : 2326 0Prim : 564 nilf : 3181 cohab : 823
## nh black : 2594 1somehs : 968 retired : 7095 divorced : 3412
## nh multirace: 443 3somecol: 6861 unable : 1558 nm : 4526
## nh other : 1008 4colgrad:10435 NA's : 237 separated: 557
## NA's : 523 NA's : 126 widowed : 2860
## NA's : 222
## inc
## 1_lt15k : 1810
## 2_15-25k : 3123
## 3_25-35k : 2051
## 4_35-50k : 2798
## 5_50kplus:11016
## NA's : 4202
##
Which shows that, among these recoded variables, inc
, the income variable, 4202 people in the BRFSS, or 16.808% of the sample.
The lowest number of missings is in the bad health variable, which only has 0.268% missing.
Now, i’m going to illustrate mean imputation of a continuous variable, BMI.
#I'm going to play with 3 outcomes, bmi, having a regular doctor and income category
summary(brfss_17$bmi)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 12.55 23.75 26.93 27.97 30.90 93.97 2118
#what happens when we replace the missings with the mean?
brfss_17$bmi.imp.mean<-ifelse(is.na(brfss_17$bmi)==T, mean(brfss_17$bmi, na.rm=T), brfss_17$bmi)
mean(brfss_17$bmi, na.rm=T)
## [1] 27.9666
mean(brfss_17$bmi.imp.mean) #no difference!
## [1] 27.9666
fit<-lm(bmi~inc+agec+educ+race_eth, brfss_17)
median(brfss_17$bmi, na.rm=T)
## [1] 26.93
median(brfss_17$bmi.imp.mean) #slight difference
## [1] 27.46
var(brfss_17$bmi, na.rm=T)
## [1] 38.38787
var(brfss_17$bmi.imp.mean) # more noticeable difference!
## [1] 35.13552
So what we see here, is that imputing with the mean does nothing to central tendency (when measured using the mean, but does affect the median slightly), but it does reduce the variance in the outcome. This is because you’re replacing all missing cases with the most likely value (the mean), so you’re artificially deflating the variance. That’s not good.
We can see this in a histogram, where the imputed values increase the peak in the distribution:
#plot the histogram
library(reshape2)
brfss_17%>%
select(bmi.imp.mean, bmi)%>%
melt()%>%
ggplot()+geom_freqpoly(aes(x = value,
y = ..density.., colour = variable))
## No id variables; using all as measure variables
## Warning: attributes are not identical across measure variables; they will be
## dropped
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2118 rows containing non-finite values (stat_bin).
If we have a categorical variable, an easy way to impute the values is to use modal imputation, or impute cases with the mode, or most common value. It doesn’t make sense to use the mean, because what would that mean for a categorical variable?
table(brfss_17$employ)
##
## employloyed nilf retired unable
## 12929 3181 7095 1558
#find the most common value
mcv.employ<-factor(names(which.max(table(brfss_17$employ))), levels=levels(brfss_17$employ))
mcv.employ
## [1] employloyed
## Levels: employloyed nilf retired unable
#impute the cases
brfss_17$employ.imp<-as.factor(ifelse(is.na(brfss_17$employ)==T, mcv.employ, brfss_17$employ))
levels(brfss_17$employ.imp)<-levels(brfss_17$employ)
prop.table(table(brfss_17$employ))
##
## employloyed nilf retired unable
## 0.52210960 0.12845778 0.28651617 0.06291645
prop.table(table(brfss_17$employ.imp))
##
## employloyed nilf retired unable
## 0.52664 0.12724 0.28380 0.06232
barplot(prop.table(table(brfss_17$employ)), main="Original Data", ylim=c(0, .6))
barplot(prop.table(table(brfss_17$employ.imp)), main="Imputed Data",ylim=c(0, .6))
Which doesn’t look like much of a difference because only 237 people were missing. Now let’s try modal imputation on income group:
table(brfss_17$inc)
##
## 1_lt15k 2_15-25k 3_25-35k 4_35-50k 5_50kplus
## 1810 3123 2051 2798 11016
#find the most common value
mcv.inc<-factor(names(which.max(table(brfss_17$inc))), levels = levels(brfss_17$inc))
mcv.inc
## [1] 5_50kplus
## Levels: 1_lt15k 2_15-25k 3_25-35k 4_35-50k 5_50kplus
#impute the cases
brfss_17$inc.imp<-as.factor(ifelse(is.na(brfss_17$inc)==T, mcv.inc, brfss_17$inc))
levels(brfss_17$inc.imp)<-levels(as.factor(brfss_17$inc))
prop.table(table(brfss_17$inc))
##
## 1_lt15k 2_15-25k 3_25-35k 4_35-50k 5_50kplus
## 0.08702760 0.15015867 0.09861525 0.13453217 0.52966631
prop.table(table(brfss_17$inc.imp))
##
## 1_lt15k 2_15-25k 3_25-35k 4_35-50k 5_50kplus
## 0.07240 0.12492 0.08204 0.11192 0.60872
barplot(prop.table(table(brfss_17$inc)), main="Original Data", ylim=c(0, .6))
barplot(prop.table(table(brfss_17$inc.imp)), main="Imputed Data", ylim=c(0, .6))
Which shows how dramatically we alter the distribution of the variable by imputing at the mode.
We can construct a flag variable. This is a useful exercise to see whether we have missing at random within the data:
fit1<-lm(bmi~agec+educ+race_eth+is.na(inc), data=brfss_17)
summary(fit1)
##
## Call:
## lm(formula = bmi ~ agec + educ + race_eth + is.na(inc), data = brfss_17)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.436 -4.010 -0.963 2.849 64.876
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 25.74297 0.18012 142.922 < 2e-16 ***
## agec(24,39] 2.64753 0.19103 13.859 < 2e-16 ***
## agec(39,59] 3.53998 0.17989 19.679 < 2e-16 ***
## agec(59,79] 3.23609 0.17791 18.190 < 2e-16 ***
## agec(79,99] 0.80195 0.21907 3.661 0.000252 ***
## educ0Prim 0.44550 0.31063 1.434 0.151536
## educ1somehs 0.09697 0.22194 0.437 0.662161
## educ3somecol -0.18926 0.11243 -1.683 0.092321 .
## educ4colgrad -1.45153 0.10447 -13.894 < 2e-16 ***
## race_ethhispanic 0.54036 0.15067 3.586 0.000336 ***
## race_ethnh black 1.65255 0.13448 12.289 < 2e-16 ***
## race_ethnh multirace 0.44961 0.29998 1.499 0.133930
## race_ethnh other -1.36175 0.20947 -6.501 8.15e-11 ***
## is.na(inc)TRUE -0.95461 0.11910 -8.015 1.15e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.03 on 22422 degrees of freedom
## (2564 observations deleted due to missingness)
## Multiple R-squared: 0.0507, Adjusted R-squared: 0.05015
## F-statistic: 92.12 on 13 and 22422 DF, p-value: < 2.2e-16
And indeed we see that those with missing incomes have significantly lower bmi’s. This implies that bmi may not be missing at random with respect to income. This is a good process to go through when you are analyzing if your data are missing at random or not.
These days, these types of imputation have been far surpassed by more complete methods that are based upon regression methods. These methods are generally referred to as multiple imputation, because we are really interested in imputing multiple variables simultaneously. Instead of reviewing this perspective here, I suggest you have a look at Joe Schafer’s site that gives a nice treatment of the subject. Here, I will use the imputation techniques in the mice
library in R, which you can read about here.
I have used these in practice in publications and generally like the framework the library uses. Another popular technique is in the Amelia
library of Gary King, which I haven’t used much. If you are serious about doing multiple imputation it would be advised to investigate multiple methodologies.
To begin, I explore the various patterns of missingness in the data. The md.pattern
function in mice
does this nicely. Here, each row corresponds to a particular pattern of missingness (1 = observed, 0=missing)
#look at the patterns of missingness
md.pattern(brfss_17[,c("bmi", "inc", "agec","educ","race_eth")])
## agec educ race_eth bmi inc
## 19383 1 1 1 1 1 0
## 3053 1 1 1 1 0 1
## 1054 1 1 1 0 1 1
## 903 1 1 1 0 0 2
## 274 1 1 0 1 1 1
## 105 1 1 0 1 0 2
## 42 1 1 0 0 1 2
## 60 1 1 0 0 0 3
## 28 1 0 1 1 1 1
## 26 1 0 1 1 0 2
## 6 1 0 1 0 1 2
## 24 1 0 1 0 0 3
## 5 1 0 0 1 1 2
## 8 1 0 0 1 0 3
## 6 1 0 0 0 1 3
## 23 1 0 0 0 0 4
## 0 126 523 2118 4202 6969
The first row shows the number of observations in the data that are complete (first row).
The second row shows the number of people who are missing only the inc variable.
Rows that have multiple 0’s in the columns indicate missing data patterns where multiple variables are missing.
The bottom row tells how many total people are missing each variable, in ANY combination with other variables.
If you want to see how pairs of variables are missing together, the md.pairs()
function will show this.
A pair of variables can have exactly four missingness patterns: Both variables are observed (pattern rr
), the first variable is observed and the second variable is missing (pattern rm
), the first variable is missing and the second variable is observed (pattern mr
), and both are missing (pattern mm
).
md.pairs(brfss_17[,c("bmi", "inc", "agec","educ","race_eth")])
## $rr
## bmi inc agec educ race_eth
## bmi 22882 19690 22882 22815 22490
## inc 19690 20798 20798 20753 20471
## agec 22882 20798 25000 24874 24477
## educ 22815 20753 24874 24874 24393
## race_eth 22490 20471 24477 24393 24477
##
## $rm
## bmi inc agec educ race_eth
## bmi 0 3192 0 67 392
## inc 1108 0 0 45 327
## agec 2118 4202 0 126 523
## educ 2059 4121 0 0 481
## race_eth 1987 4006 0 84 0
##
## $mr
## bmi inc agec educ race_eth
## bmi 0 1108 2118 2059 1987
## inc 3192 0 4202 4121 4006
## agec 0 0 0 0 0
## educ 67 45 126 0 84
## race_eth 392 327 523 481 0
##
## $mm
## bmi inc agec educ race_eth
## bmi 2118 1010 0 59 131
## inc 1010 4202 0 81 196
## agec 0 0 0 0 0
## educ 59 81 0 126 42
## race_eth 131 196 0 42 523
We can perform a basic multiple imputation by simply doing: Note this may take a very long time with big data sets
dat2<-brfss_17
samp2<-sample(1:dim(dat2)[1], replace = F, size = 500)
dat2$bmiknock<-dat2$bmi
dat2$bmiknock[samp2]<-NA
head(dat2[, c("bmiknock","bmi")])
imp<-mice(data = dat2[,c("bmi", "inc", "agec","educ","race_eth")], seed = 22, m = 5)
##
## iter imp variable
## 1 1 bmi inc educ race_eth
## 1 2 bmi inc educ race_eth
## 1 3 bmi inc educ race_eth
## 1 4 bmi inc educ race_eth
## 1 5 bmi inc educ race_eth
## 2 1 bmi inc educ race_eth
## 2 2 bmi inc educ race_eth
## 2 3 bmi inc educ race_eth
## 2 4 bmi inc educ race_eth
## 2 5 bmi inc educ race_eth
## 3 1 bmi inc educ race_eth
## 3 2 bmi inc educ race_eth
## 3 3 bmi inc educ race_eth
## 3 4 bmi inc educ race_eth
## 3 5 bmi inc educ race_eth
## 4 1 bmi inc educ race_eth
## 4 2 bmi inc educ race_eth
## 4 3 bmi inc educ race_eth
## 4 4 bmi inc educ race_eth
## 4 5 bmi inc educ race_eth
## 5 1 bmi inc educ race_eth
## 5 2 bmi inc educ race_eth
## 5 3 bmi inc educ race_eth
## 5 4 bmi inc educ race_eth
## 5 5 bmi inc educ race_eth
print(imp)
## Class: mids
## Number of multiple imputations: 5
## Imputation methods:
## bmi inc agec educ race_eth
## "pmm" "polyreg" "" "polyreg" "polyreg"
## PredictorMatrix:
## bmi inc agec educ race_eth
## bmi 0 1 1 1 1
## inc 1 0 1 1 1
## agec 1 1 0 1 1
## educ 1 1 1 0 1
## race_eth 1 1 1 1 0
Shows how many imputations were done. It also shows total missingness, which imputation method was used for each variable (because you wouldn’t want to use a normal distribution for a categorical variable!!).
It also shows the sequence of how each variable is visited (or imputed, the default is left to right).
We may want to make sure imputed values are plausible by having a look. For instance, are the BMI values outside of the range of the data.
head(imp$imp$bmi)
summary(imp$imp$bmi)
## 1 2 3 4
## Min. :14.17 Min. :13.31 Min. :14.64 Min. :15.36
## 1st Qu.:24.21 1st Qu.:22.96 1st Qu.:23.45 1st Qu.:24.37
## Median :27.46 Median :25.79 Median :26.57 Median :27.71
## Mean :29.28 Mean :27.78 Mean :27.83 Mean :28.47
## 3rd Qu.:33.20 3rd Qu.:31.01 3rd Qu.:30.68 3rd Qu.:31.64
## Max. :65.42 Max. :62.73 Max. :81.51 Max. :81.51
## 5
## Min. :13.31
## 1st Qu.:23.41
## Median :26.85
## Mean :28.34
## 3rd Qu.:31.09
## Max. :81.51
summary(brfss_17$bmi)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 12.55 23.75 26.93 27.97 30.90 93.97 2118
head(imp$imp$inc)
summary(imp$imp$inc)
## 1 2 3 4
## 1_lt15k : 445 1_lt15k : 462 1_lt15k : 473 1_lt15k : 432
## 2_15-25k : 758 2_15-25k : 779 2_15-25k : 805 2_15-25k : 761
## 3_25-35k : 490 3_25-35k : 463 3_25-35k : 448 3_25-35k : 472
## 4_35-50k : 585 4_35-50k : 590 4_35-50k : 599 4_35-50k : 573
## 5_50kplus:1924 5_50kplus:1908 5_50kplus:1877 5_50kplus:1964
## 5
## 1_lt15k : 437
## 2_15-25k : 775
## 3_25-35k : 481
## 4_35-50k : 571
## 5_50kplus:1938
Which shows the imputed values for the first 6 cases across the 5 different imputations, as well as the numeric summary of the imputed values. We can see that there is variation across the imputations, because the imputed values are not the same.
We can also do some plotting. For instance if we want to see how the observed and imputed values of bmi look with respect to race, we can do:
library(lattice)
stripplot(imp,bmi~race_eth|.imp, pch=20)
and we see the distribution of the original data (blue dots), the imputed data (red dots) across the levels of race, for each of the five different imputation runs(the number at the top shows which run, and the first plot is the original data).
This plot shows that the bmi values correspond well with the observed data, so they are probably plausible values.
If we want to get our new, imputed data, we can use the complete()
function, which by default extracts the first imputed data set. If we want a different one, we can do complete(imp, action=3)
for example, to get the third imputed data set.
dat.imp<-complete(imp, action = 1)
head(dat.imp, n=10)
#Compare to the original data
head(brfss_17[,c("bmi", "inc", "agec","educ","race_eth")], n=10)
While the first few cases don’t show much missingness, we can coax some more interesting cases out and compare the original data to the imputed:
head(dat.imp[is.na(brfss_17$bmi)==T,], n=10)
head(brfss_17[is.na(brfss_17$bmi)==T,c("bmi", "inc", "agec","educ","race_eth")], n=10)
A key element of using imputed data, is that the relationships we want to know about should be maintained after imputation, and presumably, the relationships within each imputed data set will be the same. So if we used each of the (5 in this case) imputed data sets in a model, then we should see similar results across the five different models.
Here I look at a linear model for bmi:
#Now, I will see the variability in the 5 different imputations for each outcom
fit.bmi<-with(data=imp ,expr=lm(bmi~inc+agec+educ+race_eth))
fit.bmi
## call :
## with.mids(data = imp, expr = lm(bmi ~ inc + agec + educ + race_eth))
##
## call1 :
## mice(data = dat2[, c("bmi", "inc", "agec", "educ", "race_eth")],
## m = 5, seed = 22)
##
## nmis :
## bmi inc agec educ race_eth
## 2118 4202 0 126 523
##
## analyses :
## [[1]]
##
## Call:
## lm(formula = bmi ~ inc + agec + educ + race_eth)
##
## Coefficients:
## (Intercept) inc2_15-25k inc3_25-35k
## 26.0887 -0.1531 -0.5448
## inc4_35-50k inc5_50kplus agec(24,39]
## -0.4067 -0.9910 2.7745
## agec(39,59] agec(59,79] agec(79,99]
## 3.8217 3.2960 0.6177
## educ0Prim educ1somehs educ3somecol
## 0.4120 -0.2355 -0.1365
## educ4colgrad race_ethhispanic race_ethnh black
## -0.9614 0.4341 1.3222
## race_ethnh multirace race_ethnh other
## 0.2438 -1.6502
##
##
## [[2]]
##
## Call:
## lm(formula = bmi ~ inc + agec + educ + race_eth)
##
## Coefficients:
## (Intercept) inc2_15-25k inc3_25-35k
## 26.10329 -0.31193 -0.72832
## inc4_35-50k inc5_50kplus agec(24,39]
## -0.60587 -1.15468 2.79760
## agec(39,59] agec(59,79] agec(79,99]
## 3.77182 3.28025 0.69203
## educ0Prim educ1somehs educ3somecol
## 0.48565 -0.05169 -0.04539
## educ4colgrad race_ethhispanic race_ethnh black
## -0.98006 0.37436 1.35323
## race_ethnh multirace race_ethnh other
## 0.19938 -1.64472
##
##
## [[3]]
##
## Call:
## lm(formula = bmi ~ inc + agec + educ + race_eth)
##
## Coefficients:
## (Intercept) inc2_15-25k inc3_25-35k
## 25.86497 -0.11397 -0.33251
## inc4_35-50k inc5_50kplus agec(24,39]
## -0.24635 -0.89570 2.81354
## agec(39,59] agec(59,79] agec(79,99]
## 3.76143 3.27423 0.69896
## educ0Prim educ1somehs educ3somecol
## 0.25674 -0.10659 -0.01724
## educ4colgrad race_ethhispanic race_ethnh black
## -1.03366 0.41848 1.37973
## race_ethnh multirace race_ethnh other
## 0.41664 -1.59279
##
##
## [[4]]
##
## Call:
## lm(formula = bmi ~ inc + agec + educ + race_eth)
##
## Coefficients:
## (Intercept) inc2_15-25k inc3_25-35k
## 25.78473 -0.10513 -0.37739
## inc4_35-50k inc5_50kplus agec(24,39]
## -0.28025 -0.84747 3.02625
## agec(39,59] agec(59,79] agec(79,99]
## 3.89794 3.39239 0.83281
## educ0Prim educ1somehs educ3somecol
## 0.73084 -0.11819 -0.01748
## educ4colgrad race_ethhispanic race_ethnh black
## -1.09952 0.37712 1.46931
## race_ethnh multirace race_ethnh other
## 0.14095 -1.53960
##
##
## [[5]]
##
## Call:
## lm(formula = bmi ~ inc + agec + educ + race_eth)
##
## Coefficients:
## (Intercept) inc2_15-25k inc3_25-35k
## 26.0781 -0.3209 -0.4279
## inc4_35-50k inc5_50kplus agec(24,39]
## -0.4987 -1.0290 3.0979
## agec(39,59] agec(59,79] agec(79,99]
## 3.9029 3.3433 0.8623
## educ0Prim educ1somehs educ3somecol
## 0.5439 -0.3043 -0.2062
## educ4colgrad race_ethhispanic race_ethnh black
## -1.2640 0.3315 1.4140
## race_ethnh multirace race_ethnh other
## 0.3541 -1.5951
with (data=imp, exp=(sd(bmi)))
## call :
## with.mids(data = imp, expr = (sd(bmi)))
##
## call1 :
## mice(data = dat2[, c("bmi", "inc", "agec", "educ", "race_eth")],
## m = 5, seed = 22)
##
## nmis :
## bmi inc agec educ race_eth
## 2118 4202 0 126 523
##
## analyses :
## [[1]]
## [1] 6.288433
##
## [[2]]
## [1] 6.261847
##
## [[3]]
## [1] 6.25253
##
## [[4]]
## [1] 6.199815
##
## [[5]]
## [1] 6.332331
with (data=imp, exp=(prop.table(table(inc))))
## call :
## with.mids(data = imp, expr = (prop.table(table(inc))))
##
## call1 :
## mice(data = dat2[, c("bmi", "inc", "agec", "educ", "race_eth")],
## m = 5, seed = 22)
##
## nmis :
## bmi inc agec educ race_eth
## 2118 4202 0 126 523
##
## analyses :
## [[1]]
## inc
## 1_lt15k 2_15-25k 3_25-35k 4_35-50k 5_50kplus
## 0.09020 0.15524 0.10164 0.13532 0.51760
##
## [[2]]
## inc
## 1_lt15k 2_15-25k 3_25-35k 4_35-50k 5_50kplus
## 0.09088 0.15608 0.10056 0.13552 0.51696
##
## [[3]]
## inc
## 1_lt15k 2_15-25k 3_25-35k 4_35-50k 5_50kplus
## 0.09132 0.15712 0.09996 0.13588 0.51572
##
## [[4]]
## inc
## 1_lt15k 2_15-25k 3_25-35k 4_35-50k 5_50kplus
## 0.08968 0.15536 0.10092 0.13484 0.51920
##
## [[5]]
## inc
## 1_lt15k 2_15-25k 3_25-35k 4_35-50k 5_50kplus
## 0.08988 0.15592 0.10128 0.13476 0.51816
with (data=imp, exp=(prop.table(table(race_eth))))
## call :
## with.mids(data = imp, expr = (prop.table(table(race_eth))))
##
## call1 :
## mice(data = dat2[, c("bmi", "inc", "agec", "educ", "race_eth")],
## m = 5, seed = 22)
##
## nmis :
## bmi inc agec educ race_eth
## 2118 4202 0 126 523
##
## analyses :
## [[1]]
## race_eth
## nhwhite hispanic nh black nh multirace nh other
## 0.73948 0.09552 0.10588 0.01796 0.04116
##
## [[2]]
## race_eth
## nhwhite hispanic nh black nh multirace nh other
## 0.74036 0.09496 0.10600 0.01792 0.04076
##
## [[3]]
## race_eth
## nhwhite hispanic nh black nh multirace nh other
## 0.73956 0.09536 0.10572 0.01828 0.04108
##
## [[4]]
## race_eth
## nhwhite hispanic nh black nh multirace nh other
## 0.73988 0.09484 0.10608 0.01804 0.04116
##
## [[5]]
## race_eth
## nhwhite hispanic nh black nh multirace nh other
## 0.74000 0.09540 0.10568 0.01800 0.04092
with (data=imp, exp=(prop.table(table(educ))))
## call :
## with.mids(data = imp, expr = (prop.table(table(educ))))
##
## call1 :
## mice(data = dat2[, c("bmi", "inc", "agec", "educ", "race_eth")],
## m = 5, seed = 22)
##
## nmis :
## bmi inc agec educ race_eth
## 2118 4202 0 126 523
##
## analyses :
## [[1]]
## educ
## 2hsgrad 0Prim 1somehs 3somecol 4colgrad
## 0.24308 0.02272 0.03880 0.27604 0.41936
##
## [[2]]
## educ
## 2hsgrad 0Prim 1somehs 3somecol 4colgrad
## 0.24312 0.02260 0.03912 0.27568 0.41948
##
## [[3]]
## educ
## 2hsgrad 0Prim 1somehs 3somecol 4colgrad
## 0.24300 0.02280 0.03884 0.27604 0.41932
##
## [[4]]
## educ
## 2hsgrad 0Prim 1somehs 3somecol 4colgrad
## 0.24340 0.02268 0.03900 0.27572 0.41920
##
## [[5]]
## educ
## 2hsgrad 0Prim 1somehs 3somecol 4colgrad
## 0.24344 0.02276 0.03896 0.27552 0.41932
Now we pool the separate models from each imputed data set:
est.p<-pool(fit.bmi)
print(est.p)
## Class: mipo m = 5
## term m estimate ubar b t dfcom
## 1 (Intercept) 5 25.98395833 0.04401972 0.021980046 0.07039577 24983
## 2 inc2_15-25k 5 -0.20102208 0.02646174 0.011437119 0.04018629 24983
## 3 inc3_25-35k 5 -0.48218771 0.03234935 0.025207076 0.06259784 24983
## 4 inc4_35-50k 5 -0.40757159 0.02933427 0.022457380 0.05628313 24983
## 5 inc5_50kplus 5 -0.98355795 0.02314905 0.014410105 0.04044117 24983
## 6 agec(24,39] 5 2.90196482 0.03333074 0.022197674 0.05996795 24983
## 7 agec(39,59] 5 3.83116790 0.02980201 0.004520246 0.03522630 24983
## 8 agec(59,79] 5 3.31723271 0.02910634 0.002498749 0.03210484 24983
## 9 agec(79,99] 5 0.74077549 0.04482372 0.010626923 0.05757602 24983
## 10 educ0Prim 5 0.48582499 0.07856408 0.030332068 0.11496256 24983
## 11 educ1somehs 5 -0.16325284 0.04592982 0.010701301 0.05877138 24983
## 12 educ3somecol 5 -0.08456945 0.01181514 0.007018025 0.02023677 24983
## 13 educ4colgrad 5 -1.06773061 0.01137210 0.014918571 0.02927439 24983
## 14 race_ethhispanic 5 0.38711951 0.02089252 0.001636899 0.02285680 24983
## 15 race_ethnh black 5 1.38771278 0.01684907 0.003222442 0.02071600 24983
## 16 race_ethnh multirace 5 0.27098289 0.08553636 0.012726518 0.10080819 24983
## 17 race_ethnh other 5 -1.60448020 0.03934524 0.002036888 0.04178950 24983
## df riv lambda fmi
## 1 28.44082 0.59918728 0.37468237 0.41445982
## 2 34.22280 0.51865602 0.34152304 0.37690335
## 3 17.10785 0.93505714 0.48321940 0.53462028
## 4 17.42425 0.91868166 0.47880880 0.52984530
## 5 21.84473 0.74699083 0.42758715 0.47366636
## 6 20.24358 0.79917837 0.44419074 0.49201549
## 7 167.36115 0.18201107 0.15398423 0.16391626
## 8 449.45663 0.10301875 0.09339710 0.09740457
## 9 81.19864 0.28449910 0.22148641 0.23997872
## 10 39.81002 0.46329675 0.31661162 0.34853818
## 11 83.42503 0.27959092 0.21850024 0.23658527
## 12 23.06019 0.71278303 0.41615489 0.46096233
## 13 10.68416 1.57422853 0.61153410 0.66831011
## 14 529.05814 0.09401832 0.08593852 0.08937447
## 15 114.15425 0.22950412 0.18666397 0.20054884
## 16 172.86757 0.17854185 0.15149386 0.16114324
## 17 1113.85192 0.06212354 0.05848994 0.06017594
summary(est.p)
We need to pay attention to the fmi
column and the lambda
column. These convey information about how much the missingness of each particular variable affects the model coefficients.
lam<-data.frame(lam=est.p$pooled$lambda, param=row.names(est.p$pooled))
ggplot(data=lam,aes(x=param, y=lam))+geom_col()+theme(axis.text.x = element_text(angle = 45, hjust = 1))
It appears that a couple of the education variables and the income variables have large variances to them. This suggests that there may be noticeable variation in the resulting coefficient of the model, depending on which imputed data set we use.
We can also compare to the model fit on the original data, with missings eliminated:
library(dplyr)
bnm<-brfss_17%>%
select(bmi, inc, agec, educ, race_eth)%>%
filter(complete.cases(.))%>%
as.data.frame()
summary(lm(bmi~inc+agec+educ+race_eth, bnm))
##
## Call:
## lm(formula = bmi ~ inc + agec + educ + race_eth, data = bnm)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.593 -4.022 -0.952 2.895 65.113
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 26.0718 0.2465 105.788 < 2e-16 ***
## inc2_15-25k -0.2512 0.1882 -1.335 0.18192
## inc3_25-35k -0.5160 0.2057 -2.508 0.01213 *
## inc4_35-50k -0.4409 0.1952 -2.259 0.02390 *
## inc5_50kplus -1.0528 0.1733 -6.076 1.26e-09 ***
## agec(24,39] 2.9123 0.2130 13.673 < 2e-16 ***
## agec(39,59] 3.8858 0.2025 19.185 < 2e-16 ***
## agec(59,79] 3.5116 0.2012 17.455 < 2e-16 ***
## agec(79,99] 0.7940 0.2514 3.158 0.00159 **
## educ0Prim 0.2611 0.3542 0.737 0.46106
## educ1somehs -0.3379 0.2530 -1.335 0.18181
## educ3somecol -0.0481 0.1231 -0.391 0.69591
## educ4colgrad -1.1201 0.1199 -9.342 < 2e-16 ***
## race_ethhispanic 0.2501 0.1637 1.528 0.12650
## race_ethnh black 1.5093 0.1473 10.248 < 2e-16 ***
## race_ethnh multirace 0.2954 0.3237 0.913 0.36147
## race_ethnh other -1.4296 0.2295 -6.228 4.81e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.031 on 19366 degrees of freedom
## Multiple R-squared: 0.05047, Adjusted R-squared: 0.04969
## F-statistic: 64.34 on 16 and 19366 DF, p-value: < 2.2e-16
Here, I compare the coefficients from the model where we eliminated all missing data to the one that we fit on the imputed data:
fit1<-lm(bmi~inc+agec+educ+race_eth, data=brfss_17)
summary(fit1)
##
## Call:
## lm(formula = bmi ~ inc + agec + educ + race_eth, data = brfss_17)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.593 -4.022 -0.952 2.895 65.113
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 26.0718 0.2465 105.788 < 2e-16 ***
## inc2_15-25k -0.2512 0.1882 -1.335 0.18192
## inc3_25-35k -0.5160 0.2057 -2.508 0.01213 *
## inc4_35-50k -0.4409 0.1952 -2.259 0.02390 *
## inc5_50kplus -1.0528 0.1733 -6.076 1.26e-09 ***
## agec(24,39] 2.9123 0.2130 13.673 < 2e-16 ***
## agec(39,59] 3.8858 0.2025 19.185 < 2e-16 ***
## agec(59,79] 3.5116 0.2012 17.455 < 2e-16 ***
## agec(79,99] 0.7940 0.2514 3.158 0.00159 **
## educ0Prim 0.2611 0.3542 0.737 0.46106
## educ1somehs -0.3379 0.2530 -1.335 0.18181
## educ3somecol -0.0481 0.1231 -0.391 0.69591
## educ4colgrad -1.1201 0.1199 -9.342 < 2e-16 ***
## race_ethhispanic 0.2501 0.1637 1.528 0.12650
## race_ethnh black 1.5093 0.1473 10.248 < 2e-16 ***
## race_ethnh multirace 0.2954 0.3237 0.913 0.36147
## race_ethnh other -1.4296 0.2295 -6.228 4.81e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.031 on 19366 degrees of freedom
## (5617 observations deleted due to missingness)
## Multiple R-squared: 0.05047, Adjusted R-squared: 0.04969
## F-statistic: 64.34 on 16 and 19366 DF, p-value: < 2.2e-16
fit.imp<-lm(bmi~inc+agec+educ+race_eth, data=dat.imp)
summary(fit.imp)
##
## Call:
## lm(formula = bmi ~ inc + agec + educ + race_eth, data = dat.imp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.328 -4.072 -0.985 2.864 65.187
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 26.0887 0.2114 123.397 < 2e-16 ***
## inc2_15-25k -0.1531 0.1637 -0.936 0.34940
## inc3_25-35k -0.5448 0.1804 -3.020 0.00253 **
## inc4_35-50k -0.4067 0.1723 -2.360 0.01829 *
## inc5_50kplus -0.9910 0.1531 -6.474 9.73e-11 ***
## agec(24,39] 2.7745 0.1833 15.134 < 2e-16 ***
## agec(39,59] 3.8217 0.1734 22.041 < 2e-16 ***
## agec(59,79] 3.2960 0.1713 19.238 < 2e-16 ***
## agec(79,99] 0.6177 0.2125 2.907 0.00365 **
## educ0Prim 0.4120 0.2813 1.465 0.14305
## educ1somehs -0.2355 0.2156 -1.092 0.27481
## educ3somecol -0.1365 0.1092 -1.250 0.21117
## educ4colgrad -0.9614 0.1071 -8.974 < 2e-16 ***
## race_ethhispanic 0.4341 0.1449 2.996 0.00274 **
## race_ethnh black 1.3222 0.1303 10.151 < 2e-16 ***
## race_ethnh multirace 0.2438 0.2942 0.829 0.40732
## race_ethnh other -1.6502 0.1989 -8.296 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.138 on 24983 degrees of freedom
## Multiple R-squared: 0.04791, Adjusted R-squared: 0.0473
## F-statistic: 78.58 on 16 and 24983 DF, p-value: < 2.2e-16
In the analysis that only uses complete cases, we see a significant income effect on bmi, but not once we impute the missing values. This suggests a significant selection effect for the income variable.
If we wanted to see the ranges of the betas in the five imputed data models, we could do that:
#get the coefficients from each of the 5 imputations of bmi
coefs<-as.data.frame(matrix(unlist(lapply(fit.bmi$analyses, coef)), nrow=5, ncol=17, byrow=T))
names(coefs)<-names(fit.bmi$analyses[[1]]$coef)
#plot the coefficients from each of the different rounds of imputation to see the variability in the
#results
coefs%>%
select(`inc2_15-25k`:`race_ethnh other`)%>%
melt()%>%
mutate(imp=rep(1:5, 16))%>%
ggplot()+geom_point(aes(y=value,x=variable, group=variable, color=as.factor(imp) ))+ggtitle("Estimated Betas from Each Imputed Regression Model for BMI Outcome")+theme(axis.text.x = element_text(angle = 45, hjust = 1))
## No id variables; using all as measure variables