Names - Mixed effects models - Hierarchical linear models - Random effects models
These all refer to the same suite of models, but different disciplines/authors/applications have managed to confuse us all.
Basic multilevel perspectives * The Social epidemiology perspective * General ecological models * Longitudinal models for change
Each of these situate humans within some higher, contextual level that is assumed to either influence or be influenced by their actions/behaviors
And we would ideally like to be able to incorporate covariates at the individual and higher level that could influence behaviors
Figures from Kawachi and Berkman (2003)
Here’s a picture of this:
Multistage Sampling
When we have a research statement that involves individuals within some context, this is a multi-level proposition. In this sense, we are interested in questions that relate variables at different levels, the micro and the macro. This also holds in general if a sample was collected with a multi-stage sampling scheme.
In a multilevel proposition, variables are present at two different levels, and we are interested in the relationship between both the micro and macro level association with our outcome, y.
Multi-level
This can be contrasted with a purely micro level proposition, where all our observed variables are the level of the individual
micro-level
Likewise, if we are only interested in the relationship between macro level variables, we have this situation:
macro-level
Macro - micro propositions
We commonly encounter the situation where a macro level variable effects a micro level outcome. This can happen in several different ways.
macro-micro
The first case is a macro to micro proposition, which may be exemplified by a statement such as: “Individuals in areas with high environmental contamination leads to higher risk of death”.
Whereas the second frame illustrates a more specific special case, where there is a macro level effect, net of the individual level predictor, and may be stated “For individuals with the a given level of education, living in areas with high environmental contamination leads to higher risk of death”.
The last panel illustrates what is known as a cross level interaction, or a macro-micro interaction. This is where the relationship between x and y is dependent on Z. This leads to the statement “Individuals with low levels of education, living in areas with high environmental contamination have higher risk of death”.
These kinds of models are used when you have observations on the same individuals over multiple time points * Don’t have to be the same times/ages for each observation + More flexibility * These models treat the individual as the higher level unit, and you are interested in studying * Change over time within an individual * Impacts of prior circumstances on later outcomes * Two modeling strategies will allow us to consider individual change over time, versus population averaged change over time.
\[y_{ij} = \mu + u_{j} + e_{ij}\]
where \[\mu\] is the grand mean, \(u_{j}\) is the fixed effect of the \(j^{th}\) group and \(e_{ij}\) is the residual for each individual
======================================================== This model assumes that you are capturing all variation in y by the group factor differences, \(u_{j}\) alone.
If you have all your groups,
and your only predictor is the group (factor) level,
and if you expect there to be directional differences across groups a priori,
then this is probably the model for you.
You might use this framework if you want to crudely model the effect of region of residence in an analysis. + i.e. is the mean different across my region?
\[ y_{ij} = \mu + u_{j} + e_{ij} \]
\[ \mu + u_{j} \]
Also, the global F-test will show us if ANY of the means are different from one another
In the ANCOVA model, you want to examine the effect of a covariate in each group (ANCOVA)
\[ y_{ij} = \mu + u_{j}*\beta x_{i}+ e_{ij} \] + as a simple example
\[ y_{ij} = \mu + u_{j}+\beta x_{i}+ e_{ij} \]
\[ y_{ij} = \mu + u_{j} + e_{ij} \]
The random effects model assumes that each of the group means \(\mu + u_j\) are composed of a grand mean and an iid random effect
This differs from the ANOVA model because the \(u_j\)’s are not considered fixed, by setting a comparison group.
\[ y_{ij} = \mu + u_{j} + e_{ij} \]
\[u_j \sim N(0, \sigma^2)\]
There are differences between these classic models and the linear mixed model. As a rule, you use the fixed-effects models when:
and you also know all the groups a priori e.g. sex or race
say \(n_j\) > 10, then the odds that you are really interested in all possible difference in the means is probably pretty low
z = the poverty rate or median income in the county
and again u is assumed to come from : \[u_j \sim N(0, \sigma^2)\]
Graphically the \(\beta_{0j}\) term can be seen as:
Random Intercepts
\(\sigma^2\) = \(\sigma^2 _{e}+\sigma^2 _{u}\) + These are called the variance components of the model, + and separate the variance into differences between individuals and differences between groups
\[\rho(y_{ij},y_{i'j}) = \frac{ \sigma^2 _{u} }{ \sigma^2 _{u} + \sigma^2 _{e} }\]
In this example, I introduce how to fit the multi-level model using the lme4 package. This example considers the linear case of the model, where the outcome is assumed to be continuous, and the model error term is assumed to be Gaussian. Subsequent examples will highlight the Generalized Linear Mixed Model (GLMM).
This example shows how to: * Examine variation between groups using fixed effects * Fit the basic random intercept model :
\(y_{ij} = \mu + u_{j} + e_{ij}\) with \(u_j \sim N(0, \sigma^2)\)
Where the intercepts (\(u_j\)) for each group vary randomly around the overall mean (\(\mu\))
*I also illustrate how to include group-level covariates and how to fit the random slopes model and the model for cross level interactions
The example merges data from the 2014 CDC Behavioral Risk Factor Surveillance System (BRFSS) SMART MSA data. Link and the 2010 American Community Survey 5-year estimates at the MSA level. More details on these data are found below.
#load brfss
library(car)
library(stargazer)
library(survey)
library(sjPlot)
library(ggplot2)
library(pander)
library(knitr)
load("~/Google Drive/classes/dem7283/class18/data/brfss16_mmsa.Rdata")
set.seed(12345)
#samps<-sample(1:nrow(brfss16m), size = 40000, replace=F)
#brfss16m<-brfss16m[samps,]
#The names in the data are very ugly, so I make them less ugly
nams<-names(brfss16m)
#we see some names are lower case, some are upper and some have a little _ in the first position. This is a nightmare.
newnames<-gsub(pattern = "_",replacement = "",x = nams)
names(brfss16m)<-tolower(newnames)
#sex
brfss16m$male<-ifelse(brfss16m$sex==1, 1, 0)
#BMI
brfss16m$bmi<-ifelse(is.na(brfss16m$bmi5)==T, NA, brfss16m$bmi5/100)
#Healthy days
brfss16m$healthdays<-recode(brfss16m$physhlth, recodes = "88=0; 77=NA; 99=NA")
#Healthy mental health days
brfss16m$healthmdays<-recode(brfss16m$menthlth, recodes = "88=0; 77=NA; 99=NA")
brfss16m$badhealth<-recode(brfss16m$genhlth, recodes="4:5=1; 1:3=0; else=NA")
#race/ethnicity
brfss16m$black<-recode(brfss16m$racegr3, recodes="2=1; 9=NA; else=0")
brfss16m$white<-recode(brfss16m$racegr3, recodes="1=1; 9=NA; else=0")
brfss16m$other<-recode(brfss16m$racegr3, recodes="3:4=1; 9=NA; else=0")
brfss16m$hispanic<-recode(brfss16m$racegr3, recodes="5=1; 9=NA; else=0")
brfss16m$race_eth<-recode(brfss16m$racegr3,
recodes="1='nhwhite'; 2='nh black'; 3='nh other';4='nh multirace'; 5='hispanic'; else=NA",
as.factor.result = T)
brfss16m$race_eth<-relevel(brfss16m$race_eth, ref = "nhwhite")
#insurance
brfss16m$ins<-recode(brfss16m$hlthpln1, recodes ="7:9=NA; 1=1;2=0")
#income grouping
brfss16m$inc<-recode(brfss16m$incomg, recodes = "9= NA;1='1_lt15k'; 2='2_15-25k';3='3_25-35k';4='4_35-50k';5='5_50kplus'", as.factor.result = T)
brfss16m$inc<-as.ordered(brfss16m$inc)
#education level
brfss16m$educ<-recode(brfss16m$educa,
recodes="1:2='0Prim'; 3='1somehs'; 4='2hsgrad'; 5='3somecol'; 6='4colgrad';9=NA",
as.factor.result=T)
brfss16m$educ<-relevel(brfss16m$educ, ref='2hsgrad')
#employloyment
brfss16m$employ<-recode(brfss16m$employ1,
recodes="1:2='employloyed'; 2:6='nilf'; 7='retired'; 8='unable'; else=NA",
as.factor.result=T)
brfss16m$employ<-relevel(brfss16m$employ, ref='employloyed')
#marital status
brfss16m$marst<-recode(brfss16m$marital,
recodes="1='married'; 2='divorced'; 3='widowed'; 4='separated'; 5='nm';6='cohab'; else=NA",
as.factor.result=T)
brfss16m$marst<-relevel(brfss16m$marst, ref='married')
#Age cut into intervals
brfss16m$agec<-cut(brfss16m$age80, breaks=c(0,24,39,59,79,99))
#BMI, in the brfss16ma the bmi variable has 2 implied decimal places,
#so we must divide by 100 to get real bmi's
brfss16m$bmi<-brfss16m$bmi5/100
#smoking currently
brfss16m$smoke<-recode(brfss16m$smoker3,
recodes="1:2=1; 3:4=0; else=NA")
#brfss16m$smoke<-relevel(brfss16m$smoke, ref = "NeverSmoked")
brfss16m$obese<-ifelse(is.na(brfss16m$bmi)==T, NA,
ifelse(brfss16m$bmi>30,1,0))
I want to see how many people we have in each MSA in the data:
#Now we will begin fitting the multilevel regression model with the msa
#that the person lives in being the higher level
head(data.frame(name=table(brfss16m$mmsaname),id=unique(brfss16m$mmsa)))
## name.Var1
## 1 Albany-Schenectady-Troy, NY, Metropolitan Statistical Area
## 2 Albuquerque, NM, Metropolitan Statistical Area
## 3 Allentown-Bethlehem-Easton, PA-NJ, Metropolitan Statistical Area
## 4 Anchorage, AK, Metropolitan Statistical Area
## 5 Atlanta-Sandy Springs-Roswell, GA, Metropolitan Statistical Area
## 6 Augusta-Richmond County, GA-SC, Metropolitan Statistical Area
## name.Freq id
## 1 2901 10580
## 2 1300 10740
## 3 675 10900
## 4 1028 11260
## 5 2444 12060
## 6 849 12260
#people within each msa
#How many total MSAs are in the data?
length(table(brfss16m$mmsa))
## [1] 143
#MSAs
We will often be interested in factors at both the individual AND contextual levels. To illustrate this, I will use data from the American Community Survey measured at the MSA level. Specifically, I use the DP3 table, which provides economic characteristics of places, from the 2010 5 year ACS Link.
library(acs)
#Get 2010 ACS median household incomes for tracts in Texas
msaacs<-geo.make(msa="*")
#ACS tables B17 is poverty, B19 is Gini, B25001 is housing vacancy, B25035 is median year built
acsecon<-acs.fetch(key=mykey, endyear=2010, span=5, geography=msaacs, variable = c("B19083_001","B17001_001","B17001_002" , "B25002_001","B25002_003", "B25035_001"))
colnames(acsecon@estimate)
## [1] "B19083_001" "B17001_001" "B17001_002" "B25002_001" "B25002_003"
## [6] "B25035_001"
msaecon<-data.frame(gini=acsecon@estimate[, "B19083_001"],
ppoverty=acsecon@estimate[, "B17001_002"]/acsecon@estimate[, "B17001_001"],
giniz=scale(acsecon@estimate[, "B19083_001"]),
pvacant=acsecon@estimate[,"B25002_003"]/acsecon@estimate[, "B25002_001"],
ppovertyz=scale(acsecon@estimate[, "B17001_002"]/acsecon@estimate[, "B17001_001"]),
pvacantz=scale(acsecon@estimate[,"B25002_003"]/acsecon@estimate[, "B25002_001"]),
medhouse=acsecon@estimate[, "B25035_001" ],
medhousez=scale(acsecon@estimate[, "B25035_001" ]))
msaecon$ids<-paste(acsecon@geography$metropolitanstatisticalareamicropolitanstatisticalarea)
head(msaecon)
## gini ppoverty giniz
## Adjuntas, PR Micro Area 0.525 0.5925656 2.718785
## Aguadilla-Isabela-San Sebastián, PR Metro Area 0.529 0.5577311 2.847562
## Coamo, PR Micro Area 0.532 0.5633984 2.944145
## Fajardo, PR Metro Area 0.483 0.4309070 1.366623
## Guayama, PR Metro Area 0.518 0.4980518 2.493425
## Jayuya, PR Micro Area 0.502 0.5446137 1.978315
## pvacant ppovertyz
## Adjuntas, PR Micro Area 0.1996029 6.271934
## Aguadilla-Isabela-San Sebastián, PR Metro Area 0.2005956 5.762039
## Coamo, PR Micro Area 0.1500353 5.844994
## Fajardo, PR Metro Area 0.3143172 3.905632
## Guayama, PR Metro Area 0.1951184 4.888474
## Jayuya, PR Micro Area 0.1562109 5.570031
## pvacantz medhouse
## Adjuntas, PR Micro Area 0.9053869 1978
## Aguadilla-Isabela-San Sebastián, PR Metro Area 0.9199516 1979
## Coamo, PR Micro Area 0.1781186 1979
## Fajardo, PR Metro Area 2.5885036 1977
## Guayama, PR Metro Area 0.8395882 1978
## Jayuya, PR Micro Area 0.2687281 1978
## medhousez ids
## Adjuntas, PR Micro Area 0.4923898 10260
## Aguadilla-Isabela-San Sebastián, PR Metro Area 0.5982027 10380
## Coamo, PR Micro Area 0.5982027 17620
## Fajardo, PR Metro Area 0.3865770 21940
## Guayama, PR Metro Area 0.4923898 25020
## Jayuya, PR Micro Area 0.4923898 27580
summary(msaecon)
## gini ppoverty giniz pvacant
## Min. :0.3510 Min. :0.02425 Min. :-2.88303 Min. :0.04246
## 1st Qu.:0.4190 1st Qu.:0.12308 1st Qu.:-0.69381 1st Qu.:0.09454
## Median :0.4390 Median :0.15172 Median :-0.04993 Median :0.12076
## Mean :0.4406 Mean :0.16409 Mean : 0.00000 Mean :0.13790
## 3rd Qu.:0.4590 3rd Qu.:0.18746 3rd Qu.: 0.59396 3rd Qu.:0.15964
## Max. :0.5730 Max. :0.59257 Max. : 4.26411 Max. :0.63877
## ppovertyz pvacantz medhouse medhousez
## Min. :-2.0468 Min. :-1.4002 Min. :1940 Min. :-3.5285
## 1st Qu.:-0.6002 1st Qu.:-0.6361 1st Qu.:1968 1st Qu.:-0.5657
## Median :-0.1810 Median :-0.2514 Median :1975 Median : 0.1750
## Mean : 0.0000 Mean : 0.0000 Mean :1973 Mean : 0.0000
## 3rd Qu.: 0.3421 3rd Qu.: 0.3190 3rd Qu.:1980 3rd Qu.: 0.7040
## Max. : 6.2719 Max. : 7.3489 Max. :1998 Max. : 2.6086
## ids
## Length:955
## Class :character
## Mode :character
##
##
##
Let’s see the geographic variation in these economic indicators:
library(tigris)
## To enable
## caching of data, set `options(tigris_use_cache = TRUE)` in your R script or .Rprofile.
##
## Attaching package: 'tigris'
## The following object is masked from 'package:graphics':
##
## plot
msa<-core_based_statistical_areas(cb=T)
msa_ec<-geo_join(msa, msaecon, "CBSAFP", "ids", how="inner")
library(RColorBrewer)
library(sp)
spplot(msa_ec, "gini", at=quantile(msa_ec$gini), col.regions=brewer.pal(n=6, "Greys"), col="transparent")
spplot(msa_ec, "pvacant", at=quantile(msa_ec$pvacant), col.regions=brewer.pal(n=6, "Greys"), col="transparent")
spplot(msa_ec, "medhouse", at=quantile(msa_ec$medhouse), col.regions=brewer.pal(n=6, "Greys"), col="transparent")
Merge the MSA data to the BRFSS data
joindata<-merge(brfss16m, msaecon, by.x="mmsa",by.y="ids", all.x=T)
joindata$bmiz<-as.numeric(scale(joindata$bmi, center=T, scale=T))
#joindata<-joindata[complete.cases(joindata[, c("bmiz", "race_eth", "agec", "educ", "gini")]),]
#and merge the data back to the kids data
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:acs':
##
## combine
## The following object is masked from 'package:car':
##
## recode
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
joindata<-joindata%>%
select(bmiz, obese, mmsa, agec, educ, race_eth,smoke, healthmdays, badhealth,bmi,gini, ppoverty, ppovertyz,giniz,medhouse,medhousez, pvacant, pvacantz, male, mmsawt, mmsaname )%>%
filter(complete.cases(.))
head(joindata[, c("bmiz", "male", "agec", "educ", "gini","medhouse", "pvacant", "mmsa")])
## bmiz male agec educ gini medhouse pvacant mmsa
## 1 -0.5184799 0 (79,99] 3somecol 0.427 1960 0.1115985 10580
## 2 0.3421579 0 (59,79] 2hsgrad 0.427 1960 0.1115985 10580
## 3 -0.2272375 1 (79,99] 4colgrad 0.427 1960 0.1115985 10580
## 4 -0.3646778 1 (59,79] 2hsgrad 0.427 1960 0.1115985 10580
## 5 0.3683370 0 (79,99] 3somecol 0.427 1960 0.1115985 10580
## 6 -1.1026011 0 (39,59] 1somehs 0.427 1960 0.1115985 10580
meanbmi<-mean(joindata$bmi, na.rm=T)
sdbmi<-sd(joindata$bmi, na.rm=T)
As a general rule, I will do a basic fixed-effects ANOVA as a precursor to doing full multi-level models, just to see if there is any variation amongst my higher level units (groups). If I do not see any variation in my higher level units, I generally will not proceed with the process of multi-level modeling.
fit.an<-lm(bmiz~as.factor(mmsa), joindata)
anova(fit.an)
## Analysis of Variance Table
##
## Response: bmiz
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(mmsa) 122 1309 10.7330 10.685 < 2.2e-16 ***
## Residuals 171412 172174 1.0044
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#use glm() for non-normal outcomes
fit.ob<-glm(obese~as.factor(mmsa),family=binomial, joindata)
anova(fit.ob, test="Chisq")
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: obese
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 171534 209187
## as.factor(mmsa) 122 1094.2 171412 208093 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
So we see significant variation in our outcomes across the higher level units.
Now we fit the hierarchical model
library(lme4)
library(lmerTest)
library(arm)
Basic hierarchical model for means of each MSA:
fit<-lmer(obese~(1|mmsa), data=joindata)
arm::display(fit, detail=T)
## lme4::lmer(formula = obese ~ (1 | mmsa), data = joindata)
## coef.est coef.se t value
## (Intercept) 0.31 0.00 88.86
##
## Error terms:
## Groups Name Std.Dev.
## mmsa (Intercept) 0.04
## Residual 0.46
## ---
## number of obs: 171535, groups: mmsa, 123
## AIC = 217956, DIC = 217931.3
## deviance = 217940.8
joindata$predbmi<-sdbmi*(fitted(fit))+meanbmi
head(ranef(fit))
## $mmsa
## (Intercept)
## 10580 -0.0145649692
## 10740 -0.0381033354
## 10900 -0.0108108484
## 11260 0.0008080330
## 12060 -0.0069256065
## 12260 -0.0022816384
## 12420 -0.0347823705
## 12580 0.0037975656
## 12940 0.0063121494
## 13220 0.0879929518
## 13620 0.0133708916
## 13740 -0.0132753456
## 13780 0.0125690682
## 13820 0.0365653425
## 13900 0.0057647203
## 14260 -0.0264866854
## 15380 -0.0019479482
## 15540 -0.0375314615
## 16300 0.0208628657
## 16620 0.0471533345
## 16700 0.0111967880
## 16740 0.0005840331
## 16860 0.0079561966
## 16980 0.0041801032
## 17140 0.0244629014
## 17200 -0.0145216757
## 17460 -0.0253559234
## 17780 -0.0097198273
## 17820 -0.0699226342
## 17900 0.0219073048
## 18140 -0.0012141587
## 18580 0.0424418544
## 18880 -0.0079775854
## 19060 0.0305713636
## 19380 0.0083306458
## 19660 -0.0245532764
## 19740 -0.0826948655
## 19780 -0.0077741731
## 20260 0.0015162516
## 21340 0.0133073184
## 22020 0.0215572210
## 22220 0.0052379033
## 23060 0.0195212189
## 23540 -0.0182912995
## 24020 0.0228298882
## 24220 0.0146331347
## 24260 0.0323776251
## 24340 -0.0116337538
## 24860 0.0167793077
## 25180 0.0473299837
## 25540 -0.0501667768
## 25940 -0.0665008748
## 26420 -0.0047288398
## 26580 0.0904982613
## 26900 0.0013889563
## 27140 0.0583865456
## 27260 -0.0045630191
## 28140 -0.0093240171
## 28700 0.0028896242
## 28940 0.0107168882
## 29620 0.0319725149
## 30700 -0.0242177113
## 30780 0.0397111505
## 30860 -0.0021051887
## 31140 0.0259210740
## 32820 0.0498308466
## 33100 -0.0684229124
## 33340 0.0099218453
## 33460 -0.0449414310
## 33500 0.0058536396
## 34820 0.0020133951
## 34980 -0.0076288641
## 35380 0.0322775775
## 35740 0.0280472790
## 35820 0.0795494706
## 35840 -0.0648530200
## 36260 -0.0059712931
## 36420 0.0195746357
## 36540 0.0256492847
## 36740 -0.0165833133
## 37460 0.0063017545
## 37860 -0.0096708741
## 38060 -0.0379486349
## 38300 -0.0003648090
## 38860 -0.0400156999
## 38900 -0.0339661914
## 38940 -0.0655724809
## 39300 -0.0325044001
## 39340 -0.0431117327
## 39580 -0.0128415779
## 39660 -0.0447517635
## 39900 -0.0542059111
## 40060 0.0093188546
## 40140 -0.0176488323
## 40340 -0.0212296173
## 40380 0.0291821421
## 40900 -0.0467686006
## 41060 -0.0098117524
## 41180 0.0089282415
## 41420 0.0422124242
## 41540 0.0334402689
## 41620 -0.0435774522
## 41700 0.0403544782
## 41940 -0.0721431136
## 41980 -0.0037139826
## 42420 0.0038996147
## 43580 0.0486275026
## 43620 0.0102836506
## 43900 0.0214939740
## 44060 -0.0044947365
## 44140 -0.0165457919
## 45060 0.0118300329
## 45220 0.0466110137
## 45300 -0.0192354451
## 45780 0.0301316486
## 45820 0.0362422570
## 46140 -0.0154166978
## 46220 0.0242670626
## 46540 0.0072924957
## 47260 0.0107727755
## 48620 0.0005012869
## 48660 -0.0127551819
## 49340 -0.0291405108
dim(ranef(fit)$mmsa)
## [1] 123 1
rate<- fixef(fit)+ranef(fit)$mmsa
est<-data.frame(rate =rate, id=rownames(ranef(fit)$mmsa))
msa_ec<-merge(msa_ec, est, by.x="CBSAFP", by.y="id")
spplot(msa_ec, "X.Intercept.")
spplot(msa_ec[is.na(msa_ec$X.Intercept.)==F,], "X.Intercept.")
head(est)
## X.Intercept. id
## 10580 0.2923401 10580
## 10740 0.2688017 10740
## 10900 0.2960942 10900
## 11260 0.3077131 11260
## 12060 0.2999794 12060
## 12260 0.3046234 12260
citymeans<-aggregate(cbind(bmi, predbmi)~mmsaname,joindata, mean)
head(citymeans, n=10)
## mmsaname
## 1 Albany-Schenectady-Troy, NY, Metropolitan Statistical Area
## 2 Albuquerque, NM, Metropolitan Statistical Area
## 3 Allentown-Bethlehem-Easton, PA-NJ, Metropolitan Statistical Area
## 4 Anchorage, AK, Metropolitan Statistical Area
## 5 Atlanta-Sandy Springs-Roswell, GA, Metropolitan Statistical Area
## 6 Augusta-Richmond County, GA-SC, Metropolitan Statistical Area
## 7 Austin-Round Rock, TX, Metropolitan Statistical Area
## 8 Baltimore-Columbia-Towson, MD, Metropolitan Statistical Area
## 9 Baton Rouge, LA, Metropolitan Statistical Area
## 10 Beckley, WV, Metropolitan Statistical Area
## bmi predbmi
## 1 28.00466 29.78739
## 2 27.30400 29.64272
## 3 27.91680 29.81046
## 4 28.38423 29.88188
## 5 28.09558 29.83435
## 6 27.98689 29.86289
## 7 27.42907 29.66313
## 8 28.21727 29.90025
## 9 28.50822 29.91571
## 10 29.35557 30.41775
sqrt
## function (x) .Primitive("sqrt")
plot(predbmi~bmi, citymeans)
dim(msa_ec@data)
## [1] 891 18
Model with individual level predictors:
fit2<-lmer(bmiz~male+agec+educ+(1|mmsa), data=joindata, na.action=na.omit)
arm::display(fit2, detail=T)
## lme4::lmer(formula = bmiz ~ male + agec + educ + (1 | mmsa),
## data = joindata, na.action = na.omit)
## coef.est coef.se t value
## (Intercept) -0.42 0.01 -32.75
## male 0.10 0.00 20.73
## agec(24,39] 0.47 0.01 41.00
## agec(39,59] 0.64 0.01 58.78
## agec(59,79] 0.55 0.01 51.40
## agec(79,99] 0.16 0.01 12.25
## educ0Prim 0.12 0.02 6.60
## educ1somehs 0.05 0.01 4.47
## educ3somecol -0.02 0.01 -2.76
## educ4colgrad -0.22 0.01 -36.49
##
## Error terms:
## Groups Name Std.Dev.
## mmsa (Intercept) 0.07
## Residual 0.98
## ---
## number of obs: 171535, groups: mmsa, 123
## AIC = 480887, DIC = 480704.9
## deviance = 480783.8
We do a liklihood ratio test to see if our individual level variables are explaining anything about out ourcome:
anova(fit, fit2)
## refitting model(s) with ML (instead of REML)
## Data: joindata
## Models:
## object: obese ~ (1 | mmsa)
## ..1: bmiz ~ male + agec + educ + (1 | mmsa)
## Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
## object 3 217947 217977 -108970 217941
## ..1 12 480808 480928 -240392 480784 0 9 1
The do.
We also typically would test to see if the random effect term is significant, SAS pumps out a test of this, so we do the same kind of thing in R
rand(fit2)
## Analysis of Random effects Table:
## Chi.sq Chi.DF p.value
## mmsa 743 1 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Which shows significant variation in average BMI across MSAs.
To specify a random intercept model for higher levels, we add, a model term that is (1|HIGHER LEVEL VARIABLE), which tells R to fit only a random intercept for each higher levely, in our case it will be (1|mmsa)
fit.mix<-lmer(bmiz~male+agec+educ+(1|mmsa), data=joindata)
#do a test for the random effect
rand(fit.mix)
## Analysis of Random effects Table:
## Chi.sq Chi.DF p.value
## mmsa 743 1 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
display(fit.mix, detail=T)
## lme4::lmer(formula = bmiz ~ male + agec + educ + (1 | mmsa),
## data = joindata)
## coef.est coef.se t value
## (Intercept) -0.42 0.01 -32.75
## male 0.10 0.00 20.73
## agec(24,39] 0.47 0.01 41.00
## agec(39,59] 0.64 0.01 58.78
## agec(59,79] 0.55 0.01 51.40
## agec(79,99] 0.16 0.01 12.25
## educ0Prim 0.12 0.02 6.60
## educ1somehs 0.05 0.01 4.47
## educ3somecol -0.02 0.01 -2.76
## educ4colgrad -0.22 0.01 -36.49
##
## Error terms:
## Groups Name Std.Dev.
## mmsa (Intercept) 0.07
## Residual 0.98
## ---
## number of obs: 171535, groups: mmsa, 123
## AIC = 480887, DIC = 480704.9
## deviance = 480783.8
So we see that our standard deviation at the MSA level is 0.074, and the standard deviation at the individual level (residual standard deviation) is .98.
Square these to get variances, of course.
Our fixed effects are interpreted as normal,males have higher average bmi’s than females, older people have higher BMI’s than younger people, those with less than High School education have higher BMI’s, while people with college education have lower BMI’s than those with a high school education only. See, just like ordinary regression.
Some may be interested in getting the intra-class correlation coefficient. While I don’t usually pay attention to this, here it is:
#it can be a little hairy to get it, but it can be done using various part of VarCorr()
ICC1<-VarCorr(fit)$mmsa[1]/( VarCorr(fit)$mmsa[1]+attr(VarCorr(fit), "sc"))
ICC1
## [1] 0.002722609
icc<-data.frame(VarCorr(fit))
icc[1,4]/sum(icc[,4])
## [1] 0.005946203
So less than 1% of the variance in BMI is due to difference between MSAs. That’s not much, but according to our random effect testing, it’s not, statistically speaking, 0.
Sometimes, to gain the appreciation, we may want to plot the random effects, I first show the fixed effects, then the random effects:
#I need to rescale the estimates to the BMI scale from the z-score scale
fixcoef<-fit.an$coefficients[1]+fit.an$coefficients[-1]
fixcoef<-(fixcoef*sdbmi)+meanbmi
plot(NULL,ylim=c(20, 30), xlim=c(0,1),
ylab="Intercept", xlab="") # get the ylim from a summary of rancoefs1
title(main="Fixed Effect Model")
for (i in 1: length(fixcoef)[1]){
#I plug in a beta, here it's the effect of age from fit.mix
abline(a=fixcoef[i], b=0, lwd=1.5, col="green")
}
#It may be easier to visualize the random intercepts by plotting them
rancoefs1<-ranef(fit.mix)$mmsa+fixef(fit.mix)[1]
rancoefs1<-(rancoefs1*sdbmi)+meanbmi
summary(rancoefs1)
## (Intercept)
## Min. :24.20
## 1st Qu.:25.15
## Median :25.46
## Mean :25.41
## 3rd Qu.:25.69
## Max. :26.41
plot(NULL,ylim=c(20,30), xlim=c(0,1),
ylab="Intercept", xlab="Age") # get the ylim from a summary of rancoefs1
title(main="Random Intercept Models")
for (i in 1: length(rancoefs1[,1])){
#I plug in a beta, here it's the effect of age from fit.mix
abline(a=rancoefs1[i,1], b=.0, lwd=1.5, col="maroon")
}
Now, I fit the same model above, but this time I include a predictor at the MSA level, the median year houseing was built.
#Now I estimate the multilevel model including the effects for the
#MSA level variables
fit.mix2<-lmer(bmiz~male+agec+educ+medhousez+(1|mmsa), data=joindata)
display(fit.mix2, detail=T)
## lme4::lmer(formula = bmiz ~ male + agec + educ + medhousez +
## (1 | mmsa), data = joindata)
## coef.est coef.se t value
## (Intercept) -0.42 0.01 -32.65
## male 0.10 0.00 20.72
## agec(24,39] 0.47 0.01 41.00
## agec(39,59] 0.64 0.01 58.77
## agec(59,79] 0.55 0.01 51.39
## agec(79,99] 0.16 0.01 12.23
## educ0Prim 0.12 0.02 6.62
## educ1somehs 0.05 0.01 4.49
## educ3somecol -0.02 0.01 -2.74
## educ4colgrad -0.22 0.01 -36.46
## medhousez -0.02 0.01 -3.43
##
## Error terms:
## Groups Name Std.Dev.
## mmsa (Intercept) 0.07
## Residual 0.98
## ---
## number of obs: 171535, groups: mmsa, 123
## AIC = 480886, DIC = 480685.3
## deviance = 480772.4
rand(fit.mix2)
## Analysis of Random effects Table:
## Chi.sq Chi.DF p.value
## mmsa 678 1 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ICC2<-VarCorr(fit.mix2)$mmsa[1]/( VarCorr(fit.mix2)$mmsa[1]+attr(VarCorr(fit.mix2), "sc"))
ICC2
## [1] 0.004988259
#compare the random intercept ad multilevel model with a LRT
anova(fit.mix, fit.mix2)
## refitting model(s) with ML (instead of REML)
## Data: joindata
## Models:
## object: bmiz ~ male + agec + educ + (1 | mmsa)
## ..1: bmiz ~ male + agec + educ + medhousez + (1 | mmsa)
## Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
## object 12 480808 480928 -240392 480784
## ..1 13 480798 480929 -240386 480772 11.417 1 0.0007279 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
In this case, we see a significant effect of adding the houseing age variable to the model. This suggests that the housing environment is associated with BMI, net of individual level factors.
\[\begin{aligned} \ E( Y) = \beta_{0j}+ \beta_{ j} x\\ \beta_{0j} = \beta_0 + u_{1j}\\ \beta_{j} = \beta + u_{2j}\\ \end{aligned}\\\]
Where \(\beta_0\) is the average intercept, and \(u_{1j }\) is the group-specific deviation in the intercept And where \(\beta\) is the average slope, and \(u_{2j }\) is the group-specific deviation in the average slope
This effectively allows one or more of the individual level variables to have different effects in each of the groups.
library(stargazer)
#stargazer(fit.mix3, fit.mix2, type = "html", out = "~/Google Drive/classes/dem7283/class18/lmerout.html")
#To do a random slope model, I do:
fit.mix3<-lmer(bmiz~male+agec+educ+(male|mmsa), joindata, REML=F)
#fit.mix3<-refit(fit.mix3)
display(fit.mix3, detail=T)
## lme4::lmer(formula = bmiz ~ male + agec + educ + (male | mmsa),
## data = joindata, REML = F)
## coef.est coef.se t value
## (Intercept) -0.42 0.01 -29.40
## male 0.09 0.01 11.29
## agec(24,39] 0.47 0.01 40.99
## agec(39,59] 0.64 0.01 58.85
## agec(59,79] 0.55 0.01 51.47
## agec(79,99] 0.16 0.01 12.30
## educ0Prim 0.12 0.02 6.61
## educ1somehs 0.05 0.01 4.37
## educ3somecol -0.02 0.01 -2.77
## educ4colgrad -0.22 0.01 -36.55
##
## Error terms:
## Groups Name Std.Dev. Corr
## mmsa (Intercept) 0.10
## male 0.06 -0.89
## Residual 0.98
## ---
## number of obs: 171535, groups: mmsa, 123
## AIC = 480687, DIC = 480659
## deviance = 480659.0
#compare the models with a LRT
anova(fit.mix, fit.mix3)
## refitting model(s) with ML (instead of REML)
## Data: joindata
## Models:
## object: bmiz ~ male + agec + educ + (1 | mmsa)
## ..1: bmiz ~ male + agec + educ + (male | mmsa)
## Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
## object 12 480808 480928 -240392 480784
## ..1 14 480687 480828 -240330 480659 124.76 2 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#the random slope model fits better
#plot random slopes and intercepts
rancoefs2<-ranef(fit.mix3)
plot(NULL, ylim=c(-.8, 0), xlim=c(0,1),ylab="Intercept", xlab="Male Gender == 1")
title (main="Random Slope and Intercept Model - Male Gender by Metro")
cols<-rainbow(length(unique(joindata$mmsa)))
for (i in 1: dim(rancoefs2$mmsa)[1]){
abline(a=fixef(fit.mix3)["(Intercept)"]+rancoefs2$mmsa[[1]][i],
b=fixef(fit.mix3)["male"]+rancoefs2$mmsa[, "male"][i], col=cols[i])
}
Here, I show the model for a cross-level interaction. This model fits an interaction term between (at least) one individual level variable and a group level variable. This allows you to ask very informative questions regarding individuals within specific contexts.
#Cross-level interaction model:
fit.mix4<-lmer(bmiz~male+agec+educ+medhousez*race_eth+(1|mmsa), joindata, REML=F)
display(fit.mix4, detail=T)
## lme4::lmer(formula = bmiz ~ male + agec + educ + medhousez *
## race_eth + (1 | mmsa), data = joindata, REML = F)
## coef.est coef.se t value
## (Intercept) -0.46 0.01 -36.50
## male 0.11 0.00 22.25
## agec(24,39] 0.47 0.01 40.93
## agec(39,59] 0.64 0.01 59.24
## agec(59,79] 0.56 0.01 52.50
## agec(79,99] 0.18 0.01 14.09
## educ0Prim 0.09 0.02 5.01
## educ1somehs 0.03 0.01 2.58
## educ3somecol -0.01 0.01 -1.80
## educ4colgrad -0.20 0.01 -33.14
## medhousez -0.04 0.01 -5.17
## race_ethhispanic 0.12 0.01 10.74
## race_ethnh black 0.29 0.01 31.27
## race_ethnh multirace 0.14 0.02 7.25
## race_ethnh other -0.17 0.01 -11.82
## medhousez:race_ethhispanic -0.02 0.01 -1.51
## medhousez:race_ethnh black 0.03 0.01 3.47
## medhousez:race_ethnh multirace -0.02 0.02 -0.90
## medhousez:race_ethnh other 0.02 0.01 1.32
##
## Error terms:
## Groups Name Std.Dev.
## mmsa (Intercept) 0.07
## Residual 0.98
## ---
## number of obs: 171535, groups: mmsa, 123
## AIC = 479329, DIC = 479286.7
## deviance = 479286.7
This basically says that blacks in counties with more recently built housing (mean+sd = 1982) have higher BMI’s than NH Whites living in areas with average age housing (mean = 1973)
#compare the models with a LRT
anova(fit.mix, fit.mix4)
## refitting model(s) with ML (instead of REML)
## Data: joindata
## Models:
## object: bmiz ~ male + agec + educ + (1 | mmsa)
## ..1: bmiz ~ male + agec + educ + medhousez * race_eth + (1 | mmsa)
## Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
## object 12 480808 480928 -240392 480784
## ..1 21 479329 479540 -239643 479287 1497.1 9 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Again, the cross level interaction model shows a significantly better fit than the model with only the individual level effects
In mixed models, we observe an effect referred to as “shrinkage” of the estimates of the means for each group.
This term refers to how group-level estimates in multi level models use information from all groups to produce the individual group estimates.
So, if we have a group with a larger sample size, the estimate of the mean in that group will have lots of information, and low variance in the estimate.
While a group with a small sample size will have less information about the mean for that group, and generally a higher variance estimate of the mean. Following Gelman and Hill p477, if we have a multilevel model where
\[y_i \sim N \left( \alpha_{j[i]}, \sigma_y^2 \right)\] and \[\alpha_{j} \sim N\left( \mu_{\alpha}, \sigma_{\alpha}^2 \right)\] and \(n_j\) is the sample size within each group. The multilevel estimate of the mean in each group, \(\alpha_j\) is
\[\alpha_j^{multilevel} = \omega_j \mu_{\alpha} + (1-\omega_j) \bar y_j\] where \[\omega_j = 1- \frac{\sigma_{\alpha}^2}{\sigma_{\alpha}^2+\sigma_{y}^2/n_j} \]
is a pooling, or weighting factor. If \(\omega\) = 1 the we have complete pooling, and the group means equal the population mean, or when \(\omega\)=0, we have no pooling and the group means are totally defined with no contribution of the overall mean. This factor \(1-\omega_j\) is called the shrinkage factor, and describes how much information about each group mean is contributed from the overall population mean, versus the means of each group individually.
So what’s the difference? It may be informative to plot the estimated BMI’s for each MSA from the OLS and multilevel models. This section illustrates the effects of “pooling” as Gelman & Hill ch 12.
#these models are good for estimating group means better than traditional methods
#this follows the examples in chapter 12 of Gelman and Hill, I stole the code directly from them.
#complete pooling, this model fits the grand mean ONLY
fit.cp<-lm(bmi~1, joindata)
display(fit.cp)
## lm(formula = bmi ~ 1, data = joindata)
## coef.est coef.se
## (Intercept) 27.99 0.01
## ---
## n = 171535, k = 1
## residual sd = 6.15, R-Squared = 0.00
#No pooling i.e. fixed effects regression, this model fits separate means for each MSA using OLS
lm.unpooled<-lm(bmi~factor(mmsa)-1, joindata)
#partial pooling, this model fits the population mean and MSA deviations using multilevel models
fit0<-lmer(bmi~1+(1|mmsa), joindata)
#partial pooling with covariate
fit0.1<-lmer(bmi~ppovertyz+giniz+(1|mmsa), joindata)
#Plot the means of the counties
J<-length(unique(joindata$mmsa))
ns<-as.numeric(table(fit0.1@frame$mmsa))
sample.size <- as.numeric(table(fit0.1@frame$mmsa))
sample.size.jittered <- sample.size*exp (runif (J, -.1, .1))
par (mar=c(5,5,4,2)+.1)
plot (sample.size.jittered, coef(lm.unpooled), cex.lab=1.2, cex.axis=1.2,
xlab="sample size in MSA j",
ylab=expression (paste("est. intercept, ", alpha[j], " (no pooling)")),
pch=20, log="x", ylim=c(25, 30), yaxt="n", xaxt="n")
axis (1, quantile(ns), cex.axis=1.1)
axis (2, seq(25, 30), cex.axis=1.1)
for (j in 1:J){
lines (rep(sample.size.jittered[j],2),
coef(lm.unpooled)[j] + c(-1,1)*se.coef(lm.unpooled)[j], lwd=.5)
}
abline (coef(fit.cp)[1], 0, lwd=.5)
title(main="Estimates of MSA Means from the Fixed Effect Model")
#plot MLM estimates of MSA means + se's
par (mar=c(5,5,4,2)+.1)
a.hat.M1 <- coef(fit0)$mmsa[,1]
a.se.M1 <- se.coef(fit0)$mmsa
ns<-as.numeric(table(fit0@frame$mmsa))
plot (as.numeric(ns), t(a.hat.M1), cex.lab=1.2, cex.axis=1.1,
xlab="sample size in MSA j",
ylab=expression (paste("est. intercept, ", alpha[j], "(multilevel model)")),
pch=20, log="x", ylim=c(25, 30), yaxt="n", xaxt="n")
axis (1, quantile(ns), cex.axis=1.1)
axis (2, seq(25,30), cex.axis=1.1)
for (j in 1:length(unique(joindata$mmsa))){
lines (rep(as.numeric(ns)[j],2),
as.vector(a.hat.M1[j]) + c(-1,1)*a.se.M1[j], lwd=.5, col="gray10")
}
abline (coef(fit.cp)[1], 0, lwd=.5)
title(main="Estimates of MSA Means from the MLM")
#plot MLM estimates of MSA means + se's, model with MSA covariates
par (mar=c(5,5,4,2)+.1)
a.hat.M2 <- coef(fit0.1)$mmsa[,1]
a.se.M2 <- se.coef(fit0.1)$mmsa
ns<-as.numeric(table(fit0.1@frame$mmsa))
plot (as.numeric(ns), t(a.hat.M2), cex.lab=1.2, cex.axis=1.1,
xlab="sample size in MSA j",
ylab=expression (paste("est. intercept, ", alpha[j], "(multilevel model with covariates)")),
pch=20, log="x", ylim=c(25, 30), yaxt="n", xaxt="n")
axis (1, quantile(ns), cex.axis=1.1)
axis (2, seq(25,30), cex.axis=1.1)
for (j in 1:length(unique(joindata$mmsa))){
lines (rep(as.numeric(ns)[j],2),
as.vector(a.hat.M2[j]) + c(-1,1)*a.se.M2[j], lwd=.5, col="gray10")
}
abline (coef(fit.cp)[1], 0, lwd=.5)
title(main="Estimates of MSA Means from the MLM with MSA predictors")
In the model with no pooling, we see greater variance in the estimates for MSAs with smaller sample sizes, although not terribly variable in this case since we have lots of data, and in the model with pooling, the variance in these estimates is reduced, becuase we are using the estimates from the MSAs with lots of information to estimate the population mean of the MSAs with great precision.