- Lecture notes
- Fundamental problem
- Perspectives in Multi-level Modeling
- Multi-stage sampling
- Multi-level propositions
- Longitudinal Models
- Multilevel data Preface
- Linear Mixed Model
- ANOVA and ANCOVA
- Basic Random Effect Models
- Choosing…
- Forms of the random effect model
- Random Intercept Model
- Variance components
- Empirical Example

**Names** - Mixed effects models - Hierarchical linear models - Random effects models

These all refer to the same suite of models, but different disciplines/authors/applications have managed to confuse us all.

**Basic multilevel perspectives** * The Social epidemiology perspective * General ecological models * Longitudinal models for change

Each of these situate humans within some higher, contextual level that is assumed to either influence or be influenced by their actions/behaviors

And we would ideally like to be able to incorporate covariates at the individual and higher level that could influence behaviors

- Non-random sampling
- Population consists of known sub-groups called
*clusters* - A 2 -stage sample might be households within neighborhoods, or children within schools
- We may choose a random sample of schools/neighborhoods at the first stage, and a random sample of people within each school/neighborhood as the second stage
- We need to be
*careful*because the observations in the second stage are not*independent*of one another - Increased probability of selection for children in a selected school
- This type of sampling leads to
*dependent*observations

*Figures from Kawachi and Berkman (2003)*

Here’s a picture of this:

When we have a research statement that involves individuals within some context, this is a multi-level proposition. In this sense, we are interested in questions that relate variables at different levels, the micro and the macro. This also holds in general if a sample was collected with a multi-stage sampling scheme.

In a multilevel proposition, variables are present at two different levels, and we are interested in the relationship between both the micro and macro level association with our outcome, y.

This can be contrasted with a purely micro level proposition, where all our observed variables are the level of the individual

Likewise, if we are only interested in the relationship between macro level variables, we have this situation:

**Macro - micro propositions**

We commonly encounter the situation where a macro level variable effects a micro level outcome. This can happen in several different ways.

The first case is a macro to micro proposition, which may be exemplified by a statement such as: “Individuals in areas with high environmental contamination leads to higher risk of death”.

Whereas the second frame illustrates a more specific special case, where there is a macro level effect, net of the individual level predictor, and may be stated “For individuals with the a given level of education, living in areas with high environmental contamination leads to higher risk of death”.

The last panel illustrates what is known as a *cross level interaction*, or a macro-micro interaction. This is where the relationship between x and y is dependent on Z. This leads to the statement “Individuals with low levels of education, living in areas with high environmental contamination have higher risk of death”.

These kinds of models are used when you have observations on the same individuals over multiple time points * Don’t have to be the same times/ages for each observation + More flexibility * These models treat the individual as the higher level unit, and you are interested in studying * Change over time within an individual * Impacts of prior circumstances on later outcomes * Two modeling strategies will allow us to consider individual change over time, versus population averaged change over time.

- Not all data sets will allow you to do multi-level modeling
- Many data sets don’t have any higher level units identified, or the units they do have are not necessarily meaningful

- Not all problems are multi-level problems
- Unless you are specifying a problem that is interested in how some characteristic of some higher level structure is influencing behavior, these models are not for you.

- In the traditional linear model, groups are treated as fixed effects
- ANOVA, ANCOVA, MANOVA
- For instance the ANOVA model assumes that the group effects are fixed and do not change relative to the reference group
- Also the groups represent distinct representations of all possible groups in the population
- This model is of the form:

\[y_{ij} = \mu + u_{j} + e_{ij}\]

where \[\mu\] is the grand mean, \(u_{j}\) is the fixed effect of the \(j^{th}\) group and \(e_{ij}\) is the residual for each individual

======================================================== This model assumes that you are capturing all variation in y by the group factor differences, \(u_{j}\) alone.

**If** you have all your groups,

**and** your only predictor is the group (factor) level,

**and** if you expect there to be directional differences across groups a priori,

*then* this is probably the model for you.

You might use this framework if you want to crudely model the effect of region of residence in an analysis. + i.e. is the mean different across my region?

- The ANOVA and ANCOVA models are extremely useful if:
- You simply want to test differences in the mean across groups
*(ANOVA)*

\[ y_{ij} = \mu + u_{j} + e_{ij} \]

- Each cell (group) has its mean defined by:

\[ \mu + u_{j} \]

- And we
*typically*set one group as the “reference group” - We test if each \[\mu_{j} = 0\] using a t-test and see if all our group means are equal
Also, the global F-test will show us if

*ANY*of the means are different from one anotherIn the ANCOVA model, you want to examine the effect of a covariate in each group

*(ANCOVA)*

\[ y_{ij} = \mu + u_{j}*\beta x_{i}+ e_{ij} \] + as a simple example

- This model contains the interaction between the group factor \(u_{j}\) and \(x_i\)
- This is often called the
*parallel slopes*model, because it is testing the assumption from the simpler model:

\[ y_{ij} = \mu + u_{j}+\beta x_{i}+ e_{ij} \]

- That all groups have the same \(\beta\) effect on the mean

- Consider the ANOVA model:

\[ y_{ij} = \mu + u_{j} + e_{ij} \]

The random effects model assumes that each of the group means \(\mu + u_j\) are composed of a grand mean and an

*iid*random effectThis differs from the ANOVA model because the \(u_j\)’s are not considered

*fixed*, by setting a comparison group.

\[ y_{ij} = \mu + u_{j} + e_{ij} \]

- Generally, this
*iid*random effect,*u*is assumed to come from :

\[u_j \sim N(0, \sigma^2)\]

- the random effects are centered around the mean \(\mu\)
- So that’s why there’s a 0 mean, and the variation in the groups is modeled by the estimated variance in the distribution \(\sigma^2\) +Basically, if \(\sigma^2\) = 0, then there is no variation between groups!

- This model is called the random intercept model, because only the
*intercepts*are allowed to vary randomly

There are differences between these classic models and the linear mixed model. As a rule, you use the fixed-effects models when:

- You know that each group is regarded as unique, and you want to draw conclusions on each of these specific groups,

and you also know all the groups a priori e.g. sex or race

- If the groups can be considered as a sample from some (real or hypothetical) population of groups, and you want to draw conclusions about this population of groups, then the random effects model is appropriate.

*WHY?*because if you have a*LARGE*number of groups,say \(n_j\) > 10, then the odds that you are really interested in all possible difference in the means is probably pretty low

- There are 2
*basic*forms for the mixed model - These models may be extended in MANY, MANY, MANY more ways
*which is why we’re here*- Random Intercepts model
- Random Slopes model

- The random intercept model assumes you have:
*j groups (j=1 to J)**i individuals within the j groups (i=1 to \(n_j\))*- for each individual in the
*j*groups you have measured \(y_{ij}\) and \(x_{ij}\) and*potentially*for each group*j*, we have may have measured \(z_j\) which is a covariate measured at the group level For example, you do a survey on health and you measure: - y = the health status of each individual
- x= SES, race, etc of each individual
- j = the county each individual lives in, and
z = the poverty rate or median income in the county

- We write our full model, with
*k*predictors as: \[y_{ij} = \beta_{0j} + \sum_{k} {\beta_k x_{ik}} + \gamma z_j + e_{ij}\] - This model has a few features that we can use or not use, as it suits us
- e.g. if we don’t have a group-level predictor
*z*, then we won’t have that component of the model - \(\beta_{0j}\) is called the random intercept,
- We can write the random intercept as: \[ \beta_{0j} = \beta_0 + u_{j} \]
- i.e. a fixed mean intercept and each group’s
*iid*deviation from it and again

*u*is assumed to come from : \[u_j \sim N(0, \sigma^2)\]Graphically the \(\beta_{0j}\) term can be seen as:

- This model also estimates variance components, so you can see how much variability is accounted for by adding the random intercept term.
- if var( \(y_{ij}\)) is the total variance, \(\sigma^2\)
- and var(\(u_j\)) is the higher level variance in the random intercepts, \(\sigma^2 _{u}\)
- and var(\(e_{ij}\)) is the residual individual level variance, \(\sigma^2 _{e}\)
- we can write the total variance as:

\(\sigma^2\) = \(\sigma^2 _{e}+\sigma^2 _{u}\) + These are called the variance components of the model, + and separate the variance into differences between individuals and differences between groups

- the correlation between any two individuals within a given group is:

\[\rho(y_{ij},y_{i'j}) = \frac{ \sigma^2 _{u} }{ \sigma^2 _{u} + \sigma^2 _{e} }\]

- is called the
*intra-class correlation coefficient*, and can be interpreted as the correlation between 2 random individuals in a random group, but, I find it more informative to interpret as the fraction of the variance that is due to the groups.

In this example, I introduce how to fit the multi-level model using the `lme4`

package. This example considers the linear case of the model, where the outcome is assumed to be continuous, and the model error term is assumed to be Gaussian. Subsequent examples will highlight the Generalized Linear Mixed Model (GLMM).

This example shows how to: * Examine variation between groups using fixed effects * Fit the basic random intercept model :

\(y_{ij} = \mu + u_{j} + e_{ij}\) with \(u_j \sim N(0, \sigma^2)\)

Where the intercepts (\(u_j\)) for each group vary randomly around the overall mean (\(\mu\))

*I also illustrate how to include group-level covariates and how to fit the random slopes model and the model for cross level interactions

The example merges data from the 2014 CDC Behavioral Risk Factor Surveillance System (BRFSS) SMART MSA data. Link and the 2010 American Community Survey 5-year estimates at the MSA level. More details on these data are found below.

```
#load brfss
library(car)
library(stargazer)
library(survey)
library(sjPlot)
library(ggplot2)
library(pander)
library(knitr)
load("~/Google Drive/classes/dem7283/class18/data/brfss16_mmsa.Rdata")
set.seed(12345)
#samps<-sample(1:nrow(brfss16m), size = 40000, replace=F)
#brfss16m<-brfss16m[samps,]
#The names in the data are very ugly, so I make them less ugly
nams<-names(brfss16m)
#we see some names are lower case, some are upper and some have a little _ in the first position. This is a nightmare.
newnames<-gsub(pattern = "_",replacement = "",x = nams)
names(brfss16m)<-tolower(newnames)
```

```
#sex
brfss16m$male<-ifelse(brfss16m$sex==1, 1, 0)
#BMI
brfss16m$bmi<-ifelse(is.na(brfss16m$bmi5)==T, NA, brfss16m$bmi5/100)
#Healthy days
brfss16m$healthdays<-recode(brfss16m$physhlth, recodes = "88=0; 77=NA; 99=NA")
#Healthy mental health days
brfss16m$healthmdays<-recode(brfss16m$menthlth, recodes = "88=0; 77=NA; 99=NA")
brfss16m$badhealth<-recode(brfss16m$genhlth, recodes="4:5=1; 1:3=0; else=NA")
#race/ethnicity
brfss16m$black<-recode(brfss16m$racegr3, recodes="2=1; 9=NA; else=0")
brfss16m$white<-recode(brfss16m$racegr3, recodes="1=1; 9=NA; else=0")
brfss16m$other<-recode(brfss16m$racegr3, recodes="3:4=1; 9=NA; else=0")
brfss16m$hispanic<-recode(brfss16m$racegr3, recodes="5=1; 9=NA; else=0")
brfss16m$race_eth<-recode(brfss16m$racegr3,
recodes="1='nhwhite'; 2='nh black'; 3='nh other';4='nh multirace'; 5='hispanic'; else=NA",
as.factor.result = T)
brfss16m$race_eth<-relevel(brfss16m$race_eth, ref = "nhwhite")
#insurance
brfss16m$ins<-recode(brfss16m$hlthpln1, recodes ="7:9=NA; 1=1;2=0")
#income grouping
brfss16m$inc<-recode(brfss16m$incomg, recodes = "9= NA;1='1_lt15k'; 2='2_15-25k';3='3_25-35k';4='4_35-50k';5='5_50kplus'", as.factor.result = T)
brfss16m$inc<-as.ordered(brfss16m$inc)
#education level
brfss16m$educ<-recode(brfss16m$educa,
recodes="1:2='0Prim'; 3='1somehs'; 4='2hsgrad'; 5='3somecol'; 6='4colgrad';9=NA",
as.factor.result=T)
brfss16m$educ<-relevel(brfss16m$educ, ref='2hsgrad')
#employloyment
brfss16m$employ<-recode(brfss16m$employ1,
recodes="1:2='employloyed'; 2:6='nilf'; 7='retired'; 8='unable'; else=NA",
as.factor.result=T)
brfss16m$employ<-relevel(brfss16m$employ, ref='employloyed')
#marital status
brfss16m$marst<-recode(brfss16m$marital,
recodes="1='married'; 2='divorced'; 3='widowed'; 4='separated'; 5='nm';6='cohab'; else=NA",
as.factor.result=T)
brfss16m$marst<-relevel(brfss16m$marst, ref='married')
#Age cut into intervals
brfss16m$agec<-cut(brfss16m$age80, breaks=c(0,24,39,59,79,99))
#BMI, in the brfss16ma the bmi variable has 2 implied decimal places,
#so we must divide by 100 to get real bmi's
brfss16m$bmi<-brfss16m$bmi5/100
#smoking currently
brfss16m$smoke<-recode(brfss16m$smoker3,
recodes="1:2=1; 3:4=0; else=NA")
#brfss16m$smoke<-relevel(brfss16m$smoke, ref = "NeverSmoked")
brfss16m$obese<-ifelse(is.na(brfss16m$bmi)==T, NA,
ifelse(brfss16m$bmi>30,1,0))
```

I want to see how many people we have in each MSA in the data:

```
#Now we will begin fitting the multilevel regression model with the msa
#that the person lives in being the higher level
head(data.frame(name=table(brfss16m$mmsaname),id=unique(brfss16m$mmsa)))
```

```
## name.Var1
## 1 Albany-Schenectady-Troy, NY, Metropolitan Statistical Area
## 2 Albuquerque, NM, Metropolitan Statistical Area
## 3 Allentown-Bethlehem-Easton, PA-NJ, Metropolitan Statistical Area
## 4 Anchorage, AK, Metropolitan Statistical Area
## 5 Atlanta-Sandy Springs-Roswell, GA, Metropolitan Statistical Area
## 6 Augusta-Richmond County, GA-SC, Metropolitan Statistical Area
## name.Freq id
## 1 2901 10580
## 2 1300 10740
## 3 675 10900
## 4 1028 11260
## 5 2444 12060
## 6 849 12260
```

```
#people within each msa
#How many total MSAs are in the data?
length(table(brfss16m$mmsa))
```

`## [1] 143`

`#MSAs`

We will often be interested in factors at both the individual *AND* contextual levels. To illustrate this, I will use data from the American Community Survey measured at the MSA level. Specifically, I use the DP3 table, which provides economic characteristics of places, from the 2010 5 year ACS Link.

```
library(acs)
#Get 2010 ACS median household incomes for tracts in Texas
msaacs<-geo.make(msa="*")
#ACS tables B17 is poverty, B19 is Gini, B25001 is housing vacancy, B25035 is median year built
acsecon<-acs.fetch(key=mykey, endyear=2010, span=5, geography=msaacs, variable = c("B19083_001","B17001_001","B17001_002" , "B25002_001","B25002_003", "B25035_001"))
colnames(acsecon@estimate)
```

```
## [1] "B19083_001" "B17001_001" "B17001_002" "B25002_001" "B25002_003"
## [6] "B25035_001"
```

```
msaecon<-data.frame(gini=acsecon@estimate[, "B19083_001"],
ppoverty=acsecon@estimate[, "B17001_002"]/acsecon@estimate[, "B17001_001"],
giniz=scale(acsecon@estimate[, "B19083_001"]),
pvacant=acsecon@estimate[,"B25002_003"]/acsecon@estimate[, "B25002_001"],
ppovertyz=scale(acsecon@estimate[, "B17001_002"]/acsecon@estimate[, "B17001_001"]),
pvacantz=scale(acsecon@estimate[,"B25002_003"]/acsecon@estimate[, "B25002_001"]),
medhouse=acsecon@estimate[, "B25035_001" ],
medhousez=scale(acsecon@estimate[, "B25035_001" ]))
msaecon$ids<-paste(acsecon@geography$metropolitanstatisticalareamicropolitanstatisticalarea)
head(msaecon)
```

```
## gini ppoverty giniz
## Adjuntas, PR Micro Area 0.525 0.5925656 2.718785
## Aguadilla-Isabela-San SebastiÃ¡n, PR Metro Area 0.529 0.5577311 2.847562
## Coamo, PR Micro Area 0.532 0.5633984 2.944145
## Fajardo, PR Metro Area 0.483 0.4309070 1.366623
## Guayama, PR Metro Area 0.518 0.4980518 2.493425
## Jayuya, PR Micro Area 0.502 0.5446137 1.978315
## pvacant ppovertyz
## Adjuntas, PR Micro Area 0.1996029 6.271934
## Aguadilla-Isabela-San SebastiÃ¡n, PR Metro Area 0.2005956 5.762039
## Coamo, PR Micro Area 0.1500353 5.844994
## Fajardo, PR Metro Area 0.3143172 3.905632
## Guayama, PR Metro Area 0.1951184 4.888474
## Jayuya, PR Micro Area 0.1562109 5.570031
## pvacantz medhouse
## Adjuntas, PR Micro Area 0.9053869 1978
## Aguadilla-Isabela-San SebastiÃ¡n, PR Metro Area 0.9199516 1979
## Coamo, PR Micro Area 0.1781186 1979
## Fajardo, PR Metro Area 2.5885036 1977
## Guayama, PR Metro Area 0.8395882 1978
## Jayuya, PR Micro Area 0.2687281 1978
## medhousez ids
## Adjuntas, PR Micro Area 0.4923898 10260
## Aguadilla-Isabela-San SebastiÃ¡n, PR Metro Area 0.5982027 10380
## Coamo, PR Micro Area 0.5982027 17620
## Fajardo, PR Metro Area 0.3865770 21940
## Guayama, PR Metro Area 0.4923898 25020
## Jayuya, PR Micro Area 0.4923898 27580
```

`summary(msaecon)`

```
## gini ppoverty giniz pvacant
## Min. :0.3510 Min. :0.02425 Min. :-2.88303 Min. :0.04246
## 1st Qu.:0.4190 1st Qu.:0.12308 1st Qu.:-0.69381 1st Qu.:0.09454
## Median :0.4390 Median :0.15172 Median :-0.04993 Median :0.12076
## Mean :0.4406 Mean :0.16409 Mean : 0.00000 Mean :0.13790
## 3rd Qu.:0.4590 3rd Qu.:0.18746 3rd Qu.: 0.59396 3rd Qu.:0.15964
## Max. :0.5730 Max. :0.59257 Max. : 4.26411 Max. :0.63877
## ppovertyz pvacantz medhouse medhousez
## Min. :-2.0468 Min. :-1.4002 Min. :1940 Min. :-3.5285
## 1st Qu.:-0.6002 1st Qu.:-0.6361 1st Qu.:1968 1st Qu.:-0.5657
## Median :-0.1810 Median :-0.2514 Median :1975 Median : 0.1750
## Mean : 0.0000 Mean : 0.0000 Mean :1973 Mean : 0.0000
## 3rd Qu.: 0.3421 3rd Qu.: 0.3190 3rd Qu.:1980 3rd Qu.: 0.7040
## Max. : 6.2719 Max. : 7.3489 Max. :1998 Max. : 2.6086
## ids
## Length:955
```