Including survey design information in multilevel models
When analyzing data from surveys, we must pay attention to the design of the survey when we conduct our analysis. The survey data sources will provide information in their code books or user guides that identify the survey design variables in the data source, as well as the person-weighting factors associated with each respondent.
These pieces of information are integral when we do any analysis of survey data, because the design variables (survey stratum and primary sampling units identifiers) allow us to correctly calculate standard errors for model parameters in models that we construct. Likewise, the person weights in the data will correct for differential non-response to the survey, any oversampling that exists in the survey design and any factors to make the survey sample representative of the population it was designed to represent.
Multilevel models are natural models to use when analyzing survey data, because they can automatically mimic the design of the survey using the nested specifications we have discussed so far. Indeed, Snijders and Bosker (2012) in their text on multilevel models devote chapter 14 of their book to the discussion of how to use multilevel models to analyze complex survey design data.
They offer suggestions of things to do when using multilevel models on complex survey design data (such as the BRFSS). We will discuss and illustrate two of these suggestions below. These two methods are where we
Method 1 equates to controlling for non-randomness of observations within sampling units by including the sample design variables (stratum and primary sampling units) as random effects in the model. This controls for the homogeneity within and between these survey units. Method 2 uses the provided case weights in the data and includes them as weights in the model to correct for the design factors that the weights have been used to correct for (representativeness, differential non response, etc.).
Now we turn to the practical aspects of including survey design information in the multilevel model. Method 1 is straightforward, as long as the data we have includes the sampling design variables, which most survey datasets do. In the BRFSS, there is a variable ststr that is the sampling stratum. If we were to include this, as Snijders and Bosker suggest, then we will include a random effect in the model for the survey strata. In this case, we now have two random effects, the MSA level and the stratum level.
To measure macro level variables, I will include some Census variables from the ACS 2011 5 Year estimates load in ACS data from tidycensus. The first set of variables includes information on the economic conditions of the county, specifically poverty and unemployment.
library(tidycensus)
library(dplyr)
library(ggplot2)
usacs<-get_acs(geography = "metropolitan statistical area/micropolitan statistical area", year = 2011,
variables=c( "DP05_0001E", "DP03_0009P", "DP03_0062E", "DP03_0119PE",
"DP05_0001E","DP02_0009PE","DP02_0008PE", "DP02_0040E","DP02_0038E",
"DP02_0066PE","DP02_0067PE","DP02_0080PE","DP02_0092PE",
"DP03_0005PE","DP03_0028PE","DP03_0062E","DP03_0099PE","DP03_0101PE",
"DP03_0119PE","DP04_0046PE","DP05_0072PE","DP05_0073PE",
"DP05_0066PE","DP05_0033","DP05_0032", "DP05_0072PE", "DP02_0113PE", "DP04_0003PE") ,
summary_var = "B01001_001",
geometry = F, output = "wide")
Getting data from the 2007-2011 5-year ACS
Using the ACS Data Profile
Using the ACS Data Profile
usacs<-usacs%>%
mutate(totpop= DP05_0001E, fertrate = DP02_0040E,pwhite=DP05_0072PE/100, nwhite=DP05_0032E,nblack=DP05_0033E,
pblack=DP05_0073PE/100 , phisp=DP05_0066PE/100, pfemhh=DP02_0008PE/100,
phsormore=DP02_0066PE/100,punemp=DP03_0009PE/100, medhhinc=DP03_0062E,
ppov=DP03_0119PE/100, pforn=DP02_0092PE/100,plep=DP02_0113PE/100, pvacant=DP04_0003PE/100)%>%
select(GEOID,NAME, totpop, pwhite, pblack, phisp, pfemhh, phsormore, punemp, medhhinc, ppov, pforn,plep, pvacant)
head(usacs)
myscale<-function(x){as.numeric(scale(x))}
usacs<-usacs %>%
mutate_at(c("ppov","punemp", "pblack","phisp", "pfemhh", "phsormore", "medhhinc", "pvacant"),myscale)
merged<-usacs%>%filter(complete.cases(.))
head(merged)
Let’s see the geographic variation in these economic indicators: We can also make a quick map of the indicators by merging these data with the appropriate level of census geography, in this case, MSAs are the same as Core Based Statistical Areas (CBSAs) in the Census geographic hierarchy.
library(tigris)
library(sf)
options(tigris_class="sf")
msa<-core_based_statistical_areas(cb=T)
msa<-msa%>%
st_transform(crs = 102009)
sts<-states(cb = T)
sts<-sts%>%
st_transform(crs = 102009)%>%
filter(!STATEFP%in%c(60, 66, 69, 72, 78))
msa_ec<-geo_join(msa, usacs, "CBSAFP", "GEOID", how="inner")
load("~/Google Drive/classes/dem7283/class_19_7283/data/brfss16_mmsa.Rdata")
set.seed(12345)
#samps<-sample(1:nrow(brfss16m), size = 40000, replace=F)
#brfss16m<-brfss16m[samps,]
#The names in the data are very ugly, so I make them less ugly
nams<-names(brfss16m)
#we see some names are lower case, some are upper and some have a little _ in the first position. This is a nightmare.
newnames<-gsub(pattern = "_",replacement = "",x = nams)
names(brfss16m)<-tolower(newnames)
library(car)
library(dplyr)
###BRFSS DATA
#nice MSA name
brfss16m$mmsa_name<-substr(brfss16m$mmsaname, 1,nchar(brfss16m$mmsaname)-31)
#sex
brfss16m$male<-ifelse(brfss16m$sex==1, 1, 0)
#BMI
brfss16m$bmi<-ifelse(is.na(brfss16m$bmi5)==T, NA, brfss16m$bmi5/100)
#Healthy days
brfss16m$healthdays<-Recode(brfss16m$physhlth, recodes = "88=0; 77=NA; 99=NA")
#Healthy mental health days
brfss16m$healthmdays<-Recode(brfss16m$menthlth, recodes = "88=0; 77=NA; 99=NA")
brfss16m$badhealth<-Recode(brfss16m$genhlth, recodes="4:5=1; 1:3=0; else=NA")
#race/ethnicity
brfss16m$black<-Recode(brfss16m$racegr3, recodes="2=1; 9=NA; else=0")
brfss16m$white<-Recode(brfss16m$racegr3, recodes="1=1; 9=NA; else=0")
brfss16m$other<-Recode(brfss16m$racegr3, recodes="3:4=1; 9=NA; else=0")
brfss16m$hispanic<-Recode(brfss16m$racegr3, recodes="5=1; 9=NA; else=0")
brfss16m$race_eth<-Recode(brfss16m$racegr3,
recodes="1='nhwhite'; 2='nh black'; 3='nh other';4='nh multirace'; 5='hispanic'; else=NA",
as.factor = T)
brfss16m$race_eth<-relevel(brfss16m$race_eth, ref = "nhwhite")
#insurance
brfss16m$ins<-Recode(brfss16m$hlthpln1, recodes ="7:9=NA; 1=1;2=0")
#income grouping
brfss16m$inc<-Recode(brfss16m$incomg, recodes = "9= NA;1='1_lt15k'; 2='2_15-25k';3='3_25-35k';4='4_35-50k';5='5_50kplus'", as.factor = T)
brfss16m$inc<-as.ordered(brfss16m$inc)
#education level
brfss16m$educ<-Recode(brfss16m$educa,
recodes="1:2='0Prim'; 3='1somehs'; 4='2hsgrad'; 5='3somecol'; 6='4colgrad';9=NA",
as.factor=T)
brfss16m$educ<-relevel(brfss16m$educ, ref='2hsgrad')
#employment
brfss16m$employ<-Recode(brfss16m$employ1,
recodes="1:2='employloyed'; 2:6='nilf'; 7='retired'; 8='unable'; else=NA",
as.factor=T)
brfss16m$employ<-relevel(brfss16m$employ, ref='employloyed')
#marital status
brfss16m$marst<-Recode(brfss16m$marital,
recodes="1='married'; 2='divorced'; 3='widowed'; 4='separated'; 5='nm';6='cohab'; else=NA",
as.factor=T)
brfss16m$marst<-relevel(brfss16m$marst, ref='married')
#Age cut into intervals
brfss16m$agec<-cut(brfss16m$age80, breaks=seq(20,80,10))
#BMI, in the brfss16ma the bmi variable has 2 implied decimal places,
#so we must divide by 100 to get real bmi's
brfss16m$bmi<-brfss16m$bmi5/100
#smoking currently
brfss16m$smoke<-Recode(brfss16m$smoker3,
recodes="1:2=1; 3:4=0; else=NA")
#brfss16m$smoke<-relevel(brfss16m$smoke, ref = "NeverSmoked")
brfss16m$obese<-ifelse(is.na(brfss16m$bmi)==T, NA,
ifelse(brfss16m$bmi>30,1,0))
brfss16m$owtob<-ifelse(is.na(brfss16m$bmi)==T, NA,
ifelse(brfss16m$bmi>25,1,0))
The merge command in R has several arguments. x and y are the names of the individual and higher level data frames, by.x and by.y are the names of the key field (geographic identifier in this case) in both tables, which don’t have to have the same name. The all.x=F command tells R to join only the observations from the BRFSS that have a matching MSA in the MSA table.
merged<-merge(x=brfss16m,
y=usacs,
by.x="mmsa",
by.y="GEOID",
all.x=F)
merged<-merged%>%
select( mmsa, ststr, agec, educ, black, hispanic, other,smoke,owtob, badhealth,pblack,phisp, medhhinc,ppov, male, mmsawt, mmsa_name )%>%
filter(complete.cases(.))
Scaling the weights by the mean weight is a good, simple option, but Snijders and Bosker and Carle all suggest scaling the weights to the stratum sample size. This involves a little more coding, but is easy with dplyr
First, we make an id that is the same as the strata variable. Then we sum the weights within strata and count up the number of people in each strata. Then we merge these back to the data, and standardize the weights.
wts_strat<-merged%>%
mutate(sampleid=ststr)%>%
group_by(sampleid)%>%
summarise(sumwt=sum(mmsawt), nj=n())%>%
ungroup()
merged<-merge(merged, wts_strat, by.x="ststr", by.y="sampleid", all.x=T)
#In the new data set, multiply the original weight by the fraction of the
#sampling unit total population each person represents
merged$swts<-merged$mmsawt*(merged$nj/merged$sumwt)
Now we have weights that are standardized to the within stratum sample sizes, we can do a quick summary of them and see that they are indeed much smaller than the original weights, and very similar to the mean standardized weights.
summary(merged$swts)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.01599 0.54348 0.82436 1.00000 1.24991 23.68977
summary(merged$mmsawt/mean(merged$mmsawt))
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00234 0.19778 0.45867 1.00000 1.06250 59.11684
So now we have our weights, we can use them in the multilevel model and see if it affects our results.
set.seed(1115)
library(lme4)
Loading required package: Matrix
samps<-sample(1:dim(merged)[1], size = 25000)
model2w<-glmer(owtob~ agec + male + black + hispanic + other + (1|mmsa_name)+(1|ststr),
family=binomial,
data=merged[samps, ],
weights=swts,
control = glmerControl(optimizer = c("Nelder_Mead","bobyqa"),
optCtrl=list(maxfun=2e9)))
non-integer #successes in a binomial glm!
summary(model2w)
Generalized linear mixed model fit by maximum likelihood (Laplace Approximation) ['glmerMod']
Family: binomial ( logit )
Formula: owtob ~ agec + male + black + hispanic + other + (1 | mmsa_name) + (1 | ststr)
Data: merged[samps, ]
Weights: swts
Control: glmerControl(optimizer = c("Nelder_Mead", "bobyqa"), optCtrl = list(maxfun = 2e+09))
AIC BIC logLik deviance df.resid
28224.1 28321.6 -14100.0 28200.1 24988
Scaled residuals:
Min 1Q Median 3Q Max
-4.9891 -0.9600 0.4585 0.6608 2.8659
Random effects:
Groups Name Variance Std.Dev.
ststr (Intercept) 0.05663 0.2380
mmsa_name (Intercept) 0.01937 0.1392
Number of obs: 25000, groups: ststr, 986; mmsa_name, 123
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.22273 0.04453 -5.002 5.69e-07 ***
agec(30,40] 0.57947 0.04879 11.877 < 2e-16 ***
agec(40,50] 0.77297 0.05047 15.315 < 2e-16 ***
agec(50,60] 0.91451 0.04950 18.477 < 2e-16 ***
agec(60,70] 1.01151 0.05256 19.245 < 2e-16 ***
agec(70,80] 0.62836 0.05262 11.941 < 2e-16 ***
male 0.57604 0.02942 19.578 < 2e-16 ***
black 0.41771 0.05264 7.935 2.11e-15 ***
hispanic 0.39650 0.05605 7.074 1.51e-12 ***
other -0.37946 0.06443 -5.889 3.88e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
(Intr) a(30,4 a(40,5 a(50,6 a(60,7 a(70,8 male black hispnc
agec(30,40] -0.564
agec(40,50] -0.577 0.488
agec(50,60] -0.607 0.501 0.500
agec(60,70] -0.588 0.474 0.477 0.500
agec(70,80] -0.615 0.476 0.484 0.511 0.498
male -0.334 0.013 0.057 0.046 0.046 0.062
black -0.195 0.036 0.049 0.063 0.066 0.087 0.035
hispanic -0.182 0.007 0.028 0.071 0.081 0.098 0.021 0.119
other -0.122 -0.005 -0.007 0.031 0.036 0.058 0.010 0.098 0.106
This also suggests that the weights are likely controlling for some type of non-response bias or non-representativeness in the survey.
This example will illustrate a way to combine individual survey data with aggregate data on MSAs to produce a MSA level estimate of basically any health indicator measured using the BRFSS. The framework I use below takes observed individual level survey responses from the BRFSS and merges these to MSA level variables from the ACS. This allows me to estimate the overall regression model for MSA-level prevalence, controlling for MSA level variables. Then, I can use this equation for prediction for MSAs where I have not observed survey respondents, but for which I have observed the MSA level characteristics.
This corresponds to a multilevel logistic model with a higher level variables as predictors and can be written: \[ln \left( \frac {\pi_{ij}}{1-\pi{ij}} \right ) = \beta_{0j} +\sum {\beta_k x_{ik}}+\sum {\gamma_j z_j} \]
mmsas<-data.frame(mmsa=unique(merged$mmsa[samps]))
mmsas$struct<-1:length(mmsas$mmsa)
merged<-merge(merged[samps,], mmsas, by="mmsa")
library(INLA)
Loading required package: sp
This is INLA_18.07.12 built 2018-07-12 11:05:18 UTC.
See www.r-inla.org/contact-us for how to get help.
fit.mix.bin<-inla(owtob ~ agec + male + black + hispanic + other + f(struct, model="iid",prior="loggamma", param=c(.05, .1)),
family="binomial",
data=merged[samps,],
Ntrials=1)
summary(fit.mix.bin)
Call:
c("inla(formula = owtob ~ agec + male + black + hispanic + other + ", " f(struct, model = \"iid\", prior = \"loggamma\", param = c(0.05, ", " 0.1)), family = \"binomial\", data = merged[samps, ], Ntrials = 1)" )
Time used:
Pre-processing Running inla Post-processing Total
3.4647 7.0098 0.4638 10.9383
Fixed effects:
mean sd 0.025quant 0.5quant 0.975quant mode kld
(Intercept) -0.2620 0.1185 -0.4946 -0.2620 -0.0294 -0.2621 0
agec(30,40] 0.5865 0.1492 0.2944 0.5862 0.8800 0.5856 0
agec(40,50] 0.7421 0.1447 0.4588 0.7419 1.0268 0.7413 0
agec(50,60] 0.8932 0.1344 0.6297 0.8930 1.1573 0.8927 0
agec(60,70] 0.9246 0.1346 0.6609 0.9245 1.1891 0.9241 0
agec(70,80] 0.5427 0.1312 0.2851 0.5426 0.8003 0.5425 0
male 0.6349 0.0758 0.4866 0.6348 0.7841 0.6344 0
black 0.6507 0.1464 0.3683 0.6490 0.9428 0.6456 0
hispanic 0.4678 0.1515 0.1747 0.4663 0.7693 0.4633 0
other -0.0905 0.1771 -0.4340 -0.0919 0.2607 -0.0946 0
Random effects:
Name Model
struct IID model
Model hyperparameters:
Expected number of effective parameters(std dev): 35.33(7.449)
Number of equivalent replicates : 101.79
Marginal log-Likelihood: -2256.06
#odds ratios
exp(fit.mix.bin$summary.fixed[, c(1,3,4,5)])
plot(inla.tmarginal(fun=function(x) 1/x, marginal = fit.mix.bin$marginals.hyperpar$`Precision for struct`), type="l",
main="Marginal distribution of between MSA variance")
library(ipumsr)
ddi<-read_ipums_ddi("~/Google Drive/classes/dem7283/class_19_7283/data/usa_00077.xml")
census<-read_ipums_micro(ddi)
Use of data from IPUMS-USA is subject to conditions including that users should
cite the data appropriately. Use command `ipums_conditions()` for more details.
names(census)<-tolower(names(census))
census<-zap_labels(census)
census<-census%>%
filter(age>20)
census$hisp <- recode(census$hispan, recodes = "9=NA; 1:4='Hispanic'; 0='NotHispanic'")
NAs introduced by coercion
census$race_rec <- recode(census$race, recodes = "1='White'; 2='Black'; 3='Other/Multiple'; 4:6='Asian'; 7:9='Other/Multiple'", as.factor = T)
census$race_eth <- interaction(census$hisp, census$race_rec, sep = " ")
census$race_eth <- as.factor(ifelse(substr(as.character(census$race_eth),1,8) == "Hispanic", "Hispanic", as.character(census$race_eth)))
census$race_eth <- relevel(census$race_eth, ref = "NotHispanic White")
census$black<-ifelse(census$race_eth=="NotHispanic Black",1,0)
census$other<-ifelse(census$race_eth%in%c("NotHispanic Asian", "NotHispanic Other/Multiple") ,1,0)
census$hispanic<-ifelse(census$race_eth=="Hispanic",1,0)
#sex (IV)
census$male <- ifelse(census$sex == 1,1,0)
census$pwt <- census$perwt/100
census$agec<-cut(census$age, breaks = seq(20, 80, 10))
library(survey)
Loading required package: grid
Loading required package: survival
Attaching package: <U+393C><U+3E31>survey<U+393C><U+3E32>
The following object is masked from <U+393C><U+3E31>package:graphics<U+393C><U+3E32>:
dotchart
des<-svydesign(ids=~cluster, strata = ~strata, weights = ~pwt, data=census[census$statefip=="48",])
Now I make the populations by the same charactersistics that went into the GLMM
msatotals<-svyby(~pwt, ~met2013+agec+male+black+hispanic+other, des, FUN=svytotal, na.rm=T)
msatotals$mmsa<-msatotals$met2013
msatotals$owtob<-NA
msatotals$mmsa<-msatotals$met2013
head(msatotals)
The perwt column is the estimate of the number of people in each place by the characteristics provided.
Now we use these data to get fitted values for each row. To do this, we merge them at the end of the brfss data, and run the model. The model will estimate the missing values for the overweight variable as it fits.
merged_new<-rbind(merged[, c("mmsa","owtob", "agec", "black", "hispanic", "other", "male")],msatotals[, c("mmsa","owtob", "agec", "black", "hispanic", "other", "male")])
fit.pred<-inla(owtob ~ agec + male + black + hispanic + other + f(mmsa, model="iid",prior="loggamma", param=c(.05, .1)),
family="binomial",
data=merged_new,
control.predictor = list(link=1),
Ntrials=1)
summary(fit.pred)
Call:
c("inla(formula = owtob ~ agec + male + black + hispanic + other + ", " f(mmsa, model = \"iid\", prior = \"loggamma\", param = c(0.05, ", " 0.1)), family = \"binomial\", data = merged_new, Ntrials = 1, ", " control.predictor = list(link = 1))")
Time used:
Pre-processing Running inla Post-processing Total
3.3913 96.2239 2.4530 102.0682
Fixed effects:
mean sd 0.025quant 0.5quant 0.975quant mode kld
(Intercept) -0.2395 0.0489 -0.3355 -0.2396 -0.1435 -0.2396 0
agec(30,40] 0.5551 0.0580 0.4414 0.5551 0.6689 0.5550 0
agec(40,50] 0.7191 0.0564 0.6084 0.7191 0.8299 0.7190 0
agec(50,60] 0.8024 0.0527 0.6990 0.8024 0.9059 0.8024 0
agec(60,70] 0.8984 0.0519 0.7965 0.8984 1.0003 0.8984 0
agec(70,80] 0.5396 0.0515 0.4384 0.5396 0.6407 0.5396 0
male 0.6362 0.0284 0.5805 0.6362 0.6919 0.6361 0
black 0.6018 0.0558 0.4929 0.6015 0.7119 0.6011 0
hispanic 0.3498 0.0580 0.2366 0.3496 0.4641 0.3492 0
other -0.2290 0.0645 -0.3551 -0.2291 -0.1022 -0.2293 0
Random effects:
Name Model
mmsa IID model
Model hyperparameters:
Expected number of effective parameters(std dev): 68.51(6.567)
Number of equivalent replicates : 364.91
Marginal log-Likelihood: -15412.68
Posterior marginals for linear predictor and fitted values computed
msatotals_est<-cbind(msatotals, fit.pred$summary.fitted.values[is.na(merged_new$owtob)==T,])
msatotals_est$est_lci<-msatotals_est$`0.025quant`
msatotals_est$est_uci<-msatotals_est$`0.975quant`
msatotals_est$ptest<-msatotals_est$mean
head(msatotals_est)
Now, to do post-stratification estimates, we need to calculate:
\[\pi^{pred}_{msa_j} = \frac{\sum_j N_l \theta_l}{\sum_l N_l}\] Where \(\theta\) is the probability of being obese in each of the \(l\) strata (groups of people), which we have as the p estimate from above. The \(N_l\) estiamtes are the estimated counts of the population by the age, sex, and race characteristics noted above. Now we make the estimates
msatotals_est$est<-msatotals_est$ptest*msatotals_est$pwt
msatotals_est$lest<-msatotals_est$est_lci*msatotals_est$pwt
msatotals_est$uest<-msatotals_est$est_uci*msatotals_est$pwt
msarates<-msatotals_est%>%
group_by(mmsa)%>%
summarise(ob_rate_ps = sum(est)/sum(pwt), lowerrate=sum(lest)/sum(pwt), upperrate=sum(uest)/sum(pwt))
head(msarates)
#San Antonio
msarates[msarates$mmsa=="41700",]
Not bad, based on other estimates by the CDC
Here is the rest of Texas
msarates$mmsa_c<-as.character(msarates$mmsa)
ob_msa<-geo_join(msa_ec, msarates, by_sp="GEOID", by_df="mmsa_c", how="inner")
ob_msa%>%
ggplot()+geom_sf(aes(fill=ob_rate_ps))