Including survey design information in multilevel models
When analyzing data from surveys, we must pay attention to the design of the survey when we conduct our analysis. The survey data sources will provide information in their code books or user guides that identify the survey design variables in the data source, as well as the person-weighting factors associated with each respondent.
These pieces of information are integral when we do any analysis of survey data, because the design variables (survey stratum and primary sampling units identifiers) allow us to correctly calculate standard errors for model parameters in models that we construct. Likewise, the person weights in the data will correct for differential non-response to the survey, any oversampling that exists in the survey design and any factors to make the survey sample representative of the population it was designed to represent.
Multilevel models are natural models to use when analyzing survey data, because they can automatically mimic the design of the survey using the nested specifications we have discussed so far. Indeed, Snijders and Bosker (2012) in their text on multilevel models devote chapter 14 of their book to the discussion of how to use multilevel models to analyze complex survey design data.
They offer suggestions of things to do when using multilevel models on complex survey design data (such as the BRFSS). We will discuss and illustrate two of these suggestions below. These two methods are where we
Method 1 equates to controlling for non-randomness of observations within sampling units by including the sample design variables (stratum and primary sampling units) as random effects in the model. This controls for the homogeneity within and between these survey units. Method 2 uses the provided case weights in the data and includes them as weights in the model to correct for the design factors that the weights have been used to correct for (representativeness, differential non response, etc.).
Now we turn to the practical aspects of including survey design information in the multilevel model. Method 1 is straightforward, as long as the data we have includes the sampling design variables, which most survey datasets do. In the BRFSS, there is a variable ststr that is the sampling stratum. If we were to include this, as Snijders and Bosker suggest, then we will include a random effect in the model for the survey strata. In this case, we now have two random effects, the MSA level and the stratum level.
To measure macro level variables, I will include some Census variables from the ACS 2011 5 Year estimates load in ACS data from tidycensus. The first set of variables includes information on the economic conditions of the county, specifically poverty and unemployment.
library(tidycensus)
library(dplyr)
Attaching package: <U+393C><U+3E31>dplyr<U+393C><U+3E32>
The following objects are masked from <U+393C><U+3E31>package:stats<U+393C><U+3E32>:
filter, lag
The following objects are masked from <U+393C><U+3E31>package:base<U+393C><U+3E32>:
intersect, setdiff, setequal, union
library(ggplot2)
usacs<-get_acs(geography = "metropolitan statistical area/micropolitan statistical area", year = 2011,
variables=c( "DP05_0001E", "DP03_0009P", "DP03_0062E", "DP03_0119PE",
"DP05_0001E","DP02_0009PE","DP02_0008PE", "DP02_0040E","DP02_0038E",
"DP02_0066PE","DP02_0067PE","DP02_0080PE","DP02_0092PE",
"DP03_0005PE","DP03_0028PE","DP03_0062E","DP03_0099PE","DP03_0101PE",
"DP03_0119PE","DP04_0046PE","DP05_0072PE","DP05_0073PE",
"DP05_0066PE","DP05_0033","DP05_0032", "DP05_0072PE", "DP02_0113PE", "DP04_0003PE") ,
summary_var = "B01001_001",
geometry = F, output = "wide")
Getting data from the 2007-2011 5-year ACS
Using the ACS Data Profile
Using the ACS Data Profile
usacs<-usacs%>%
mutate(totpop= DP05_0001E, fertrate = DP02_0040E,pwhite=DP05_0072PE/100, nwhite=DP05_0032E,nblack=DP05_0033E,
pblack=DP05_0073PE/100 , phisp=DP05_0066PE/100, pfemhh=DP02_0008PE/100,
phsormore=DP02_0066PE/100,punemp=DP03_0009PE/100, medhhinc=DP03_0062E,
ppov=DP03_0119PE/100, pforn=DP02_0092PE/100,plep=DP02_0113PE/100, pvacant=DP04_0003PE/100)%>%
select(GEOID,NAME, totpop, pwhite, pblack, phisp, pfemhh, phsormore, punemp, medhhinc, ppov, pforn,plep, pvacant)
head(usacs)
myscale<-function(x){as.numeric(scale(x))}
usacs<-usacs %>%
mutate_at(c("ppov","punemp", "pblack","phisp", "pfemhh", "phsormore", "medhhinc", "pvacant"),myscale)
merged<-usacs%>%filter(complete.cases(.))
head(merged)
Let’s see the geographic variation in these economic indicators: We can also make a quick map of the indicators by merging these data with the appropriate level of census geography, in this case, MSAs are the same as Core Based Statistical Areas (CBSAs) in the Census geographic hierarchy.
library(tigris)
library(sf)
options(tigris_class="sf")
msa<-core_based_statistical_areas(cb=T)
msa<-msa%>%
st_transform(crs = 102009)
sts<-states(cb = T)
sts<-sts%>%
st_transform(crs = 102009)%>%
filter(!STATEFP%in%c(60, 66, 69, 72, 78))
msa_ec<-geo_join(msa, usacs, "CBSAFP", "GEOID", how="inner")
load("~/Google Drive/classes/dem7283/class_19_7283/data/brfss16_mmsa.Rdata")
set.seed(12345)
#samps<-sample(1:nrow(brfss16m), size = 40000, replace=F)
#brfss16m<-brfss16m[samps,]
#The names in the data are very ugly, so I make them less ugly
nams<-names(brfss16m)
#we see some names are lower case, some are upper and some have a little _ in the first position. This is a nightmare.
newnames<-gsub(pattern = "_",replacement = "",x = nams)
names(brfss16m)<-tolower(newnames)
library(car)
Loading required package: carData
Attaching package: <U+393C><U+3E31>car<U+393C><U+3E32>
The following object is masked from <U+393C><U+3E31>package:dplyr<U+393C><U+3E32>:
recode
library(dplyr)
###BRFSS DATA
#nice MSA name
brfss16m$mmsa_name<-substr(brfss16m$mmsaname, 1,nchar(brfss16m$mmsaname)-31)
#sex
brfss16m$male<-ifelse(brfss16m$sex==1, 1, 0)
#BMI
brfss16m$bmi<-ifelse(is.na(brfss16m$bmi5)==T, NA, brfss16m$bmi5/100)
#Healthy days
brfss16m$healthdays<-Recode(brfss16m$physhlth, recodes = "88=0; 77=NA; 99=NA")
#Healthy mental health days
brfss16m$healthmdays<-Recode(brfss16m$menthlth, recodes = "88=0; 77=NA; 99=NA")
brfss16m$badhealth<-Recode(brfss16m$genhlth, recodes="4:5=1; 1:3=0; else=NA")
#race/ethnicity
brfss16m$black<-Recode(brfss16m$racegr3, recodes="2=1; 9=NA; else=0")
brfss16m$white<-Recode(brfss16m$racegr3, recodes="1=1; 9=NA; else=0")
brfss16m$other<-Recode(brfss16m$racegr3, recodes="3:4=1; 9=NA; else=0")
brfss16m$hispanic<-Recode(brfss16m$racegr3, recodes="5=1; 9=NA; else=0")
brfss16m$race_eth<-Recode(brfss16m$racegr3,
recodes="1='nhwhite'; 2='nh black'; 3='nh other';4='nh multirace'; 5='hispanic'; else=NA",
as.factor = T)
brfss16m$race_eth<-relevel(brfss16m$race_eth, ref = "nhwhite")
#insurance
brfss16m$ins<-Recode(brfss16m$hlthpln1, recodes ="7:9=NA; 1=1;2=0")
#income grouping
brfss16m$inc<-Recode(brfss16m$incomg, recodes = "9= NA;1='1_lt15k'; 2='2_15-25k';3='3_25-35k';4='4_35-50k';5='5_50kplus'", as.factor = T)
brfss16m$inc<-as.ordered(brfss16m$inc)
#education level
brfss16m$educ<-Recode(brfss16m$educa,
recodes="1:2='0Prim'; 3='1somehs'; 4='2hsgrad'; 5='3somecol'; 6='4colgrad';9=NA",
as.factor=T)
brfss16m$educ<-relevel(brfss16m$educ, ref='2hsgrad')
#employment
brfss16m$employ<-Recode(brfss16m$employ1,
recodes="1:2='employloyed'; 2:6='nilf'; 7='retired'; 8='unable'; else=NA",
as.factor=T)
brfss16m$employ<-relevel(brfss16m$employ, ref='employloyed')
#marital status
brfss16m$marst<-Recode(brfss16m$marital,
recodes="1='married'; 2='divorced'; 3='widowed'; 4='separated'; 5='nm';6='cohab'; else=NA",
as.factor=T)
brfss16m$marst<-relevel(brfss16m$marst, ref='married')
#Age cut into intervals
brfss16m$agec<-cut(brfss16m$age80, breaks=seq(20,80,10))
#BMI, in the brfss16ma the bmi variable has 2 implied decimal places,
#so we must divide by 100 to get real bmi's
brfss16m$bmi<-brfss16m$bmi5/100
#smoking currently
brfss16m$smoke<-Recode(brfss16m$smoker3,
recodes="1:2=1; 3:4=0; else=NA")
#brfss16m$smoke<-relevel(brfss16m$smoke, ref = "NeverSmoked")
brfss16m$obese<-ifelse(is.na(brfss16m$bmi)==T, NA,
ifelse(brfss16m$bmi>30,1,0))
The merge command in R has several arguments. x and y are the names of the individual and higher level data frames, by.x and by.y are the names of the key field (geographic identifier in this case) in both tables, which don’t have to have the same name. The all.x=F command tells R to join only the observations from the BRFSS that have a matching MSA in the MSA table.
merged<-merge(x=brfss16m,
y=usacs,
by.x="mmsa",
by.y="GEOID",
all.x=F)
merged<-merged%>%
select( mmsa, ststr, agec, educ, black, hispanic, other,smoke,obese, badhealth,pblack,phisp, medhhinc,ppov, male, mmsawt, mmsa_name )%>%
filter(complete.cases(.))
library(lme4)
Loading required package: Matrix
model1<-glmer(obese ~ agec + male + educ + black + hispanic + other + medhhinc + pblack + phisp+ (1|mmsa_name),
family=binomial,
data=merged,
control = glmerControl(optimizer = c("Nelder_Mead","bobyqa"),
optCtrl=list(maxfun=2e9)))
When we compare the model to model1 from earlier in the lesson, we see that the standard errors for some of the parameters have increased, and some of the \(\beta\)’s have also increased to small extent. The higher level variables effects are decreased in the model with the survey strata included.
library(stargazer)
stargazer(model1, model1s,
column.labels = c("No Survey Design", "Including Strata"),
type = "text",
style="demography", out = "~/Google Drive/classes/dem7283/class_19_7283/code/table2.html")
length of NULL cannot be changedlength of NULL cannot be changedlength of NULL cannot be changedlength of NULL cannot be changedlength of NULL cannot be changed
------------------------------------------------
obese
No Survey Design Including Strata
Model 1 Model 2
------------------------------------------------
agec(30,40] 0.457*** 0.457***
(0.025) (0.025)
agec(40,50] 0.660*** 0.659***
(0.024) (0.024)
agec(50,60] 0.627*** 0.624***
(0.022) (0.023)
agec(60,70] 0.595*** 0.594***
(0.022) (0.022)
agec(70,80] 0.125*** 0.123***
(0.023) (0.023)
male 0.059*** 0.060***
(0.011) (0.011)
educ0Prim 0.106** 0.108**
(0.038) (0.038)
educ1somehs 0.111*** 0.111***
(0.026) (0.026)
educ3somecol -0.049*** -0.048***
(0.014) (0.014)
educ4colgrad -0.432*** -0.427***
(0.014) (0.014)
black 0.517*** 0.518***
(0.019) (0.019)
hispanic 0.193*** 0.196***
(0.024) (0.024)
other -0.081** -0.079**
(0.027) (0.027)
medhhinc -0.088*** -0.085***
(0.016) (0.016)
pblack -0.028 -0.023
(0.016) (0.016)
phisp -0.078*** -0.076***
(0.016) (0.017)
Constant -1.109*** -1.112***
(0.027) (0.027)
N 169,887 169,887
Log Likelihood -101,575.400 -101,559.500
AIC 203,186.800 203,157.100
BIC 203,367.500 203,347.900
------------------------------------------------
*p < .05; **p < .01; ***p < .001
The second method is where we include the sample weights for each respondent in the analysis. Before we do this, we must first inspect these weights. Person weights are numeric values that multiply each person in the data to the population in the target population that each person represents. This is done by calculating the probability to f selection into to the survey for each person, based on their characteristics, and then inverting it, to form an inverse probability weight. If a given person is relatively rare in the sample and represents a large number of people in the population, they will have a larger weight, while the opposite is also true.
Examining the numerical values of the BRFSS weights can show us these expansion factors.
By doing a summary, we see that the minimum weight is 0.22 and the maximum is 32,154.31. This indicates that at least one person in the survey represents over 32,000 people in the US adult population.
summary(merged$mmsawt)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.237 104.459 242.248 528.153 561.161 31222.720
Indeed, if we sum the weights, we get a value of 8.972627510^{7}, which equals the sum of the adult populations living in metropolitan areas in 2016.
sum(merged$mmsawt)
[1] 89726275
If we include the weights in the model as is, they will inflate the sample size from 169887 to 8.972627510^{7}, which would lead to problems interpreting things, as the standard errors would become very small. Here is what this would look like.
glm1<-glm(obese ~ agec + male + educ + black + hispanic + other,
data=merged,
family = binomial)
glmw<-glm(obese ~ agec + male + educ + black + hispanic + other,
data=merged,
family = binomial,
weights=mmsawt)
non-integer #successes in a binomial glm!glm.fit: algorithm did not convergeglm.fit: fitted probabilities numerically 0 or 1 occurred
In the model where weights are included, as is, the model coefficients become huge and we get numerical warnings about fitted values of 0 and 1 being produced. This is not good.
Instead of using the raw weights, we need to scale the weights in order to use them in models. There are several ways in which this can be done, and Carle (2009) provides a few options for doing this.
The straight-forward way to scale the weights is to divide them by their mean, which would make the average weight 1 instead of 2,184.
mean(merged$mmsawt)
[1] 528.1527
mean(merged$mmsawt/mean(merged$mmsawt))
[1] 1
We can use the scaled weights in the analysis, and they will do the job they are designed to do.
glm_w2<-glm(obese ~ agec + male + educ + black + hispanic + other,
data=merged,
family = binomial,
weights=mmsawt/mean(mmsawt))
non-integer #successes in a binomial glm!
summary(glm_w2)
Call:
glm(formula = obese ~ agec + male + educ + black + hispanic +
other, family = binomial, data = merged, weights = mmsawt/mean(mmsawt))
Deviance Residuals:
Min 1Q Median 3Q Max
-6.9578 -0.6907 -0.3704 0.5706 12.4060
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.11664 0.01809 -61.737 < 2e-16 ***
agec(30,40] 0.36414 0.01848 19.705 < 2e-16 ***
agec(40,50] 0.60556 0.01848 32.767 < 2e-16 ***
agec(50,60] 0.52713 0.01822 28.929 < 2e-16 ***
agec(60,70] 0.53464 0.01922 27.810 < 2e-16 ***
agec(70,80] 0.07203 0.02134 3.375 0.000739 ***
male 0.02008 0.01074 1.869 0.061629 .
educ0Prim 0.12094 0.02866 4.219 2.45e-05 ***
educ1somehs 0.05256 0.02119 2.480 0.013120 *
educ3somecol -0.05098 0.01379 -3.696 0.000219 ***
educ4colgrad -0.50425 0.01483 -34.000 < 2e-16 ***
black 0.45809 0.01558 29.394 < 2e-16 ***
hispanic 0.13244 0.01576 8.404 < 2e-16 ***
other -0.49209 0.02609 -18.865 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 208748 on 169886 degrees of freedom
Residual deviance: 203394 on 169873 degrees of freedom
AIC: 189606
Number of Fisher Scoring iterations: 4
Which, when compared to the output from the glm1 model, we see substantive differences between them. Many of the coefficients have been changed, in this case they are now considered unbiased once the weights have been applied.
length of NULL cannot be changedlength of NULL cannot be changedlength of NULL cannot be changedlength of NULL cannot be changedlength of NULL cannot be changed
------------------------------------------
obese
No Weights Scaled Weights
Model 1 Model 2
------------------------------------------
agec(30,40] 0.456*** 0.364***
(0.025) (0.018)
agec(40,50] 0.653*** 0.606***
(0.024) (0.018)
agec(50,60] 0.621*** 0.527***
(0.022) (0.018)
agec(60,70] 0.590*** 0.535***
(0.022) (0.019)
agec(70,80] 0.117*** 0.072***
(0.023) (0.021)
male 0.056*** 0.020
(0.011) (0.011)
educ0Prim 0.119** 0.121***
(0.038) (0.029)
educ1somehs 0.121*** 0.053*
(0.026) (0.021)
educ3somecol -0.064*** -0.051***
(0.014) (0.014)
educ4colgrad -0.462*** -0.504***
(0.014) (0.015)
black 0.514*** 0.458***
(0.018) (0.016)
hispanic 0.091*** 0.132***
(0.020) (0.016)
other -0.106*** -0.492***
(0.027) (0.026)
Constant -1.165*** -1.117***
(0.022) (0.018)
N 169,887 169,887
Log Likelihood -101,830.900 -94,788.880
AIC 203,689.700 189,605.800
------------------------------------------
*p < .05; **p < .01; ***p < .001
Scaling the weights by the mean weight is a good, simple option, but Snijders and Bosker and Carle all suggest scaling the weights to the stratum sample size. This involves a little more coding, but is easy with dplyr
First, we make an id that is the same as the strata variable. Then we sum the weights within strata and count up the number of people in each strata. Then we merge these back to the data, and standardize the weights.
wts_strat<-merged%>%
mutate(sampleid=ststr)%>%
group_by(sampleid)%>%
summarise(sumwt=sum(mmsawt), nj=n())%>%
ungroup()
merged<-merge(merged, wts_strat, by.x="ststr", by.y="sampleid", all.x=T)
#In the new data set, multiply the original weight by the fraction of the
#sampling unit total population each person represents
merged$swts<-merged$mmsawt*(merged$nj/merged$sumwt)
Now we have weights that are standardized to the within stratum sample sizes, we can do a quick summary of them and see that they are indeed much smaller than the original weights, and very similar to the mean standardized weights.
summary(merged$swts)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.01599 0.54348 0.82436 1.00000 1.24991 23.68977
summary(merged$mmsawt/mean(merged$mmsawt))
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00234 0.19778 0.45867 1.00000 1.06250 59.11684
So now we have our weights, we can use them in the multilevel model and see if it affects our results.
model2w<-glmer(obese~ agec + male + educ + black + hispanic + other + ppov + black*pblack + hispanic*phisp + black*medhhinc + hispanic*medhhinc+ (1|mmsa_name)+(1|ststr),
family=binomial,
data=merged,
weights=swts,
control = glmerControl(optimizer = c("Nelder_Mead","bobyqa"),
optCtrl=list(maxfun=2e9)))
non-integer #successes in a binomial glm!
summary(model2w)
Generalized linear mixed model fit by maximum likelihood (Laplace Approximation) [
glmerMod]
Family: binomial ( logit )
Formula: obese ~ agec + male + educ + black + hispanic + other + ppov +
black * pblack + hispanic * phisp + black * medhhinc + hispanic *
medhhinc + (1 | mmsa_name) + (1 | ststr)
Data: merged
Weights: swts
Control:
glmerControl(optimizer = c("Nelder_Mead", "bobyqa"), optCtrl = list(maxfun = 2e+09))
AIC BIC logLik deviance df.resid
189787.7 190028.7 -94869.8 189739.7 169863
Scaled residuals:
Min 1Q Median 3Q Max
-3.1351 -0.6557 -0.4441 0.9906 7.8720
Random effects:
Groups Name Variance Std.Dev.
ststr (Intercept) 0.02070 0.1439
mmsa_name (Intercept) 0.01024 0.1012
Number of obs: 169887, groups: ststr, 1248; mmsa_name, 123
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.08639 0.02751 -39.484 < 2e-16 ***
agec(30,40] 0.42180 0.02044 20.634 < 2e-16 ***
agec(40,50] 0.63479 0.02036 31.176 < 2e-16 ***
agec(50,60] 0.59386 0.01996 29.753 < 2e-16 ***
agec(60,70] 0.56213 0.02095 26.836 < 2e-16 ***
agec(70,80] 0.07917 0.02264 3.496 0.000472 ***
male 0.05159 0.01118 4.615 3.92e-06 ***
educ0Prim 0.07372 0.03148 2.341 0.019207 *
educ1somehs 0.10270 0.02158 4.759 1.95e-06 ***
educ3somecol -0.03807 0.01399 -2.722 0.006495 **
educ4colgrad -0.46244 0.01575 -29.362 < 2e-16 ***
black 0.39042 0.03406 11.461 < 2e-16 ***
hispanic 0.06892 0.03848 1.791 0.073296 .
other -0.22961 0.02818 -8.150 3.65e-16 ***
ppov -0.01912 0.05600 -0.342 0.732722
pblack -0.01748 0.01968 -0.888 0.374475
phisp -0.05485 0.02917 -1.880 0.060070 .
medhhinc -0.09722 0.02710 -3.587 0.000334 ***
black:pblack 0.10874 0.02325 4.676 2.92e-06 ***
hispanic:phisp 0.07541 0.02611 2.889 0.003869 **
black:medhhinc -0.02107 0.02507 -0.841 0.400584
hispanic:medhhinc 0.09313 0.02802 3.324 0.000887 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation matrix not shown by default, as p = 22 > 12.
Use print(x, correlation=TRUE) or
vcov(x) if you need it
Indeed, now our cross-level interaction term between the individual level black dummy variable and the MSA level pblack variable is now significant in the model. This suggests that Non-Hispanic black people living in MSAs that have higher proportions of Non-Hispanic black population have higher odds of being obese compared to those in MSAs with an average proportion of Non-Hispanic black population. There are still no interaction effects for the MSA level income variable.
This also suggests that the weights are likely controlling for some type of non-response bias or non-representativeness in the survey.
This example will illustrate a way to combine individual survey data with aggregate data on MSAs to produce a MSA level estimate of basically any health indicator measured using the BRFSS. The framework I use below takes observed individual level survey responses from the BRFSS and merges these to MSA level variables from the ACS. This allows me to estimate the overall regression model for MSA-level prevalence, controlling for MSA level variables. Then, I can use this equation for prediction for MSAs where I have not observed survey respondents, but for which I have observed the MSA level characteristics.
This corresponds to a multilevel logistic model with a higher level variables as predictors and can be written: \[ln \left( \frac {\pi_{ij}}{1-\pi{ij}} \right ) = \beta_{0j} +\sum {\beta_k x_{ik}}+\sum {\gamma_j z_j} \]
library(lme4)
library(lmerTest)
Attaching package: <U+393C><U+3E31>lmerTest<U+393C><U+3E32>
The following object is masked from <U+393C><U+3E31>package:lme4<U+393C><U+3E32>:
lmer
The following object is masked from <U+393C><U+3E31>package:stats<U+393C><U+3E32>:
step
fit.mix.bin<-glmer(obese ~ agec + male + black + hispanic + other + (1|mmsa),
family=binomial,
data=merged,
weights=mmsawt/mean(mmsawt),
control = glmerControl(optimizer = c("Nelder_Mead","bobyqa"),
optCtrl=list(maxfun=2e9)))
non-integer #successes in a binomial glm!
summary(fit.mix.bin)
Generalized linear mixed model fit by maximum likelihood (Laplace Approximation) [
glmerMod]
Family: binomial ( logit )
Formula: obese ~ agec + male + black + hispanic + other + (1 | mmsa)
Data: merged
Weights: mmsawt/mean(mmsawt)
Control:
glmerControl(optimizer = c("Nelder_Mead", "bobyqa"), optCtrl = list(maxfun = 2e+09))
AIC BIC logLik deviance df.resid
180331.5 180442.0 -90154.8 180309.5 169876
Scaled residuals:
Min 1Q Median 3Q Max
-5.2424 -0.5463 -0.3037 0.5321 13.2273
Random effects:
Groups Name Variance Std.Dev.
mmsa (Intercept) 0.02119 0.1456
Number of obs: 169887, groups: mmsa, 123
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.26469 0.02175 -58.148 < 2e-16 ***
agec(30,40] 0.33610 0.01880 17.878 < 2e-16 ***
agec(40,50] 0.59211 0.01891 31.305 < 2e-16 ***
agec(50,60] 0.54029 0.01890 28.593 < 2e-16 ***
agec(60,70] 0.56751 0.02047 27.727 < 2e-16 ***
agec(70,80] 0.16293 0.02315 7.038 1.95e-12 ***
male 0.01646 0.01140 1.444 0.149
black 0.52829 0.01663 31.769 < 2e-16 ***
hispanic 0.33010 0.01732 19.058 < 2e-16 ***
other -0.55336 0.02788 -19.849 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
(Intr) a(30,4 a(40,5 a(50,6 a(60,7 a(70,8 male black hispnc
agec(30,40] -0.461
agec(40,50] -0.463 0.534
agec(50,60] -0.473 0.534 0.534
agec(60,70] -0.448 0.493 0.494 0.500
agec(70,80] -0.411 0.436 0.438 0.444 0.415
male -0.281 0.005 0.006 0.004 0.008 0.037
black -0.162 0.007 0.020 0.037 0.061 0.068 0.025
hispanic -0.174 0.000 0.035 0.077 0.109 0.116 -0.004 0.215
other -0.107 0.006 0.016 0.045 0.056 0.067 0.005 0.126 0.154
#odds ratios
exp(fixef(fit.mix.bin)[-1])
agec(30,40] agec(40,50] agec(50,60] agec(60,70] agec(70,80] male black
1.3994751 1.8078076 1.7165037 1.7638778 1.1769486 1.0165935 1.6960381
hispanic other
1.3911079 0.5750138
exp(confint(fit.mix.bin, method="Wald"))
2.5 % 97.5 %
.sig01 NA NA
(Intercept) 0.2705446 0.2946222
agec(30,40] 1.3488476 1.4520029
agec(40,50] 1.7420162 1.8760838
agec(50,60] 1.6540949 1.7812671
agec(60,70] 1.6945177 1.8360769
agec(70,80] 1.1247424 1.2315779
male 0.9941402 1.0395539
black 1.6416499 1.7522281
hispanic 1.3446748 1.4391444
other 0.5444374 0.6073075
These predicted values actually allow us to set up a counterfactual type of analysis. We can effectively change the houseing age variable to be older or younger than the real value to see how the outcome changes.
#get newdata using real values
dat1<-expand.grid(male=mean(merged$male), agec=levels(merged$agec), black=c(0,1), hispanic=c(0,1), other=c(0,1) , mmsa= levels(as.factor(merged$mmsa)))
#get newdata using artificially generated housing age values
#predictions
dat1$pred<-predict(fit.mix.bin, dat1, type="response", re.form=~(1 | mmsa))
library(ipumsr)
ddi<-read_ipums_ddi("~/Google Drive/classes/dem7283/class_19_7283/data/usa_00077.xml")
census<-read_ipums_micro(ddi)
Use of data from IPUMS-USA is subject to conditions including that users should
cite the data appropriately. Use command `ipums_conditions()` for more details.
names(census)<-tolower(names(census))
census<-zap_labels(census)
census<-census%>%
filter(age>20)
census$hisp <- recode(census$hispan, recodes = "9=NA; 1:4='Hispanic'; 0='NotHispanic'")
NAs introduced by coercion
census$race_rec <- recode(census$race, recodes = "1='White'; 2='Black'; 3='Other/Multiple'; 4:6='Asian'; 7:9='Other/Multiple'", as.factor = T)
census$race_eth <- interaction(census$hisp, census$race_rec, sep = " ")
census$race_eth <- as.factor(ifelse(substr(as.character(census$race_eth),1,8) == "Hispanic", "Hispanic", as.character(census$race_eth)))
census$race_eth <- relevel(census$race_eth, ref = "NotHispanic White")
census$black<-ifelse(census$race_eth=="NotHispanic Black",1,0)
census$other<-ifelse(census$race_eth%in%c("NotHispanic Asian", "NotHispanic Other/Multiple") ,1,0)
census$hispanic<-ifelse(census$race_eth=="Hispanic",1,0)
#sex (IV)
census$male <- ifelse(census$sex == 1,1,0)
census$pwt <- census$perwt
census$agec<-cut(census$age, breaks = seq(20, 80, 10))
library(survey)
Loading required package: grid
Loading required package: survival
Attaching package: <U+393C><U+3E31>survey<U+393C><U+3E32>
The following object is masked from <U+393C><U+3E31>package:graphics<U+393C><U+3E32>:
dotchart
des<-svydesign(ids=~cluster, strata = ~strata, weights = ~pwt, data=census[census$statefip=="48",])
Now I make the populations by the same charactersistics that went into the GLMM
msatotals<-svyby(~perwt, ~met2013+agec+male+black+hispanic+other, des, FUN=svytotal, na.rm=T)
msatotals$mmsa<-msatotals$met2013
head(msatotals)
The perwt column is the estimate of the number of people in each place by the characteristics provided.
Now we use these data to get fitted values for each row:
library(merTools)
package <U+393C><U+3E31>merTools<U+393C><U+3E32> was built under R version 3.5.3Loading required package: arm
Loading required package: MASS
Attaching package: <U+393C><U+3E31>MASS<U+393C><U+3E32>
The following object is masked from <U+393C><U+3E31>package:dplyr<U+393C><U+3E32>:
select
arm (Version 1.10-1, built: 2018-4-12)
Working directory is C:/Users/ozd504/Google Drive/classes/dem7283/class_19_7283/code
Attaching package: <U+393C><U+3E31>arm<U+393C><U+3E32>
The following object is masked from <U+393C><U+3E31>package:car<U+393C><U+3E32>:
logit
msatotals$p<-predict(fit.mix.bin, newdata=msatotals,allow.new.levels=T, type="response" )
est_cis<-msa_totals_variances<-predictInterval(fit.mix.bin, newdata=msatotals, level=.5, n.sims=500, type="probability")
The following levels of mmsa from newdata
-- 11100, 13140, 15180, 19100, 29700, 31180, 32580, 33260, 36220, 41660, 46340, 47380 -- are not in the model data.
Currently, predictions for these values are based only on the
fixed coefficients and the observation-level error.
msatotals$est_lci<-est_cis$lwr
msatotals$est_uc<-est_cis$upr
msatotals$est_pt<-est_cis$fit
head(msatotals)
Now, to do post-stratification estimates, we need to calculate:
\[\pi^{pred}_{msa_j} = \frac{\sum_j N_l \theta_l}{\sum_l N_l}\] Where \(\theta\) is the probability of being obese in each of the \(l\) strata (groups of people), which we have as the p estimate from above. The \(N_l\) estiamtes are the estimated counts of the population by the age, sex, and race characteristics noted above. Now we make the estimates
msatotals$est<-msatotals$p*msatotals$perwt
msarates<-msatotals%>%
group_by(mmsa)%>%
summarise(ob_rate_ps = sum(est)/sum(perwt))
head(msarates)
#San Antonio
msarates[msarates$mmsa=="41700",]
Not bad, based on other estimates
Here is the rest of Texas
msarates$mmsa_c<-as.character(msarates$mmsa)
ob_msa<-geo_join(msa_ec, msarates, by_sp="GEOID", by_df="mmsa_c", how="inner")
ob_msa%>%
ggplot()+geom_sf(aes(fill=ob_rate_ps))