Introduction
Welcome to Module IV of our Spatial Statistics and Disease Mapping
course! In this module, we will explore multilevel analysis, a powerful
statistical technique for analyzing hierarchically structured data.
We’ll discuss the basic concepts of multilevel models, how to
incorporate spatial random effects, how to handle survey data, and
interpret the outputs of multilevel spatial models. We’ll also cover
model comparison and assessment using tools like ICC, AIC, BIC, and LRT.
Additionally, we will go through an example of data cleaning for a real
world survey data using STATA.
1. Introduction to Multilevel Models
1.1 What are Multilevel Models?
Multilevel models (also known as hierarchical models or mixed-effects
models) are statistical models that are used to analyze data with a
nested or hierarchical structure. This type of structure arises when
observations are grouped within higher-level units, such as individuals
within households, households within communities, or students within
schools.
1.2 Why Use Multilevel Models?
- Accounting for Grouping: Multilevel models
explicitly acknowledge the non-independence of observations within
groups. This means they address the issue that individuals within the
same group are more similar to each other than to individuals in
different groups.
- Variance Partitioning: They partition the total
variance into components at each level, allowing researchers to
understand how much variation in the outcome is due to individual
characteristics versus group-level factors.
- Contextual Effects: They enable the study of
contextual effects – how group-level variables influence individual
outcomes.
- Correct Standard Errors: They give correct standard
errors and confidence intervals that take into account non-independence
of data.
1.3 Basic Concepts
- Levels: The hierarchical structure of the data,
such as level-1 (individuals) and level-2 (communities). Levels can
extend to more than 2.
- Fixed Effects: The effects of variables that are
assumed to be constant across all groups.
- Random Effects: The effects of variables that are
allowed to vary across different groups, allowing group-specific
intercepts and slopes.
- Intercepts: Level-specific intercept terms are
included as random effects to allow the starting values of each groups
to be different.
- Slopes: Level-specific slopes can be included as
random effects to allow the relationship between independent and
dependent variable to vary across groups.
- Variance Components: Partitioning of variation to
different levels so that we can know how much of the variation is at the
level of the individuals and how much is at the level of the group.
- Intra-class Correlation (ICC): Measures the
proportion of the total variance that is explained by the group
level.
1.4 Mathematical Representation
A basic two-level model can be represented as follows:
Level 1 (Individual Level): yᵢⱼ = β₀ⱼ + β₁xᵢⱼ +
εᵢⱼ
Level 2 (Group Level): β₀ⱼ = γ₀₀ + γ₀₁zⱼ + u₀ⱼ
Where:
- yᵢⱼ is the outcome variable for individual i in
group j
- xᵢⱼ are the individual-level predictors.
- β₀ⱼ is the intercept for group j.
- β₁ is the fixed slope for the individual level
variable
- εᵢⱼ is the level-1 residual error term.
- zⱼ are the group-level predictors.
- γ₀₀ is the intercept (overall mean)
- γ₀₁ is the fixed coefficient of group level
predictor.
- u₀ⱼ is the level-2 residual error term.
2. Spatial Random Effects
Incorporating spatial dependencies into multilevel models is crucial
when the data is not only hierarchical but also spatially structured.
Spatial random effects account for spatial autocorrelation at the group
level.
2.1 What are Spatial Random Effects?
Spatial random effects are additional random effects that account for
the spatial structure within groups, allowing for spatial correlation
between neighboring groups that were not captured by independent errors.
This is particularly important when the groups are spatial units such as
census tracts, administrative regions or survey clusters.
2.2 How to Incorporate Spatial Random Effects
- Spatial Weights Matrix (W): A matrix that defines
the spatial relationships between groups using the concept of proximity
or neighborhood. This matrix can be based on distance, adjacency or
contiguity.
- Spatial Autocorrelation Parameter (ρ): Incorporate
a spatial autoregressive parameter to model the spatial dependencies
using the spatial weights matrix, just like in the spatial regression
models.
- Spatial Random Intercepts: Spatial random
intercepts account for a spatially structured variation in the mean
outcome of the response variable across groups.
- Spatial Random Slopes: Spatial random slopes
incorporate spatial heterogeneity in the effect of predictors on the
outcome.
2.3 Types of Spatial Random Effects
- Conditional Autoregressive (CAR) Model: One of the
most commonly used models for spatial random effects, where the random
effect at each location is influenced by random effects at neighboring
locations.
- Simultaneous Autoregressive (SAR) Model: Similar to
the CAR model but uses a different approach to calculate the
weights.
- Gaussian Process Models: Can be used for modelling
spatial autocorrelation, especially when spatial locations can be
represented as points.
2.4 Mathematical Representation (CAR Model)
Level 1 (Individual Level): yᵢⱼ = β₀ⱼ + β₁xᵢⱼ +
εᵢⱼ
Level 2 (Group Level): β₀ⱼ = γ₀₀ + γ₀₁zⱼ + u₀ⱼ,
where u₀ⱼ ~ CAR(W)
Where: * All variables as before * CAR(W)
denotes the
spatial random effects, which follow a conditional autoregressive
process based on spatial weights matrix W
.
3. Application of Multilevel Analysis with Survey Data
Applying multilevel models to survey data requires careful attention
to survey design features, such as weighting, stratification, and
clustering, which is very important to obtain reliable results from the
analysis.
3.1 Survey Designs in R and STATA
Survey design involves specifying how data was sampled from a
population. This involves defining sample weights, stratification,
clustering and multi-stage sampling techniques.
3.1.1 Survey Design in R
In R, you can specify survey designs using the survey
package. Here’s a basic overview: * Install and load
package: install.packages("survey")
and
library(survey)
* Create a survey design
object:
R survey_design <- svydesign( id = ~cluster_variable, # Specify the cluster variable strata = ~strata_variable, # Specify the stratification variable weights = ~weights_variable, # Specify the survey weight variable data = your_data )
* Analysis with Survey Design:
R svyglm(outcome_variable ~ predictor_variable, design = survey_design) # for linear and logistic regression svyboxplot(outcome_variable ~ group_variable , design = survey_design) # boxplot for survey data
3.1.2 Survey Design in STATA
In STATA, you can use svyset
to declare survey design: *
Specify survey design
stata svyset cluster_variable [pweight=weights_variable] , strata(strata_variable)
* Analysis with survey design
stata svy: regress outcome_variable predictor_variable # for linear regression svy: logit outcome_variable predictor_variable # for logistic regression svy: tab outcome_variable group_variable, ci percent row format(%7.1f) # for cross-tabulations
3.2 Incorporating Survey Weights
Survey weights are necessary to account for unequal selection
probabilities in the sample. To incorporate survey weights in R and
Stata: * R: survey weights are specified using the
weights
argument in the svydesign
function. *
STATA: weights are specified using
[pweight=weights_variable]
in svyset
and
[iw=weights_variable]
for all analysis functions that do
not have the svy
prefix.
3.3 Handling Clustering and Stratification
- Clustering: In both R and Stata, the cluster
variable needs to be specified to account for potential correlation
within clusters.
- Stratification: Stratification also needs to be
included in the survey design to account for differences between
strata.
3.4 Multilevel Analysis with Survey Data
- Use specific R packages like
lme4
or
rstanarm
to conduct multilevel analysis, making sure that
survey weights are correctly specified to account for survey design
effects.
- Use
melogit
and other multilevel modelling commands in
STATA to perform analyses, ensuring that svy
prefix is
included for correct variance estimation.
4. Interpretation and Model Assessment of Multilevel Spatial
Outputs
4.1 Interpreting Fixed Effects
- Fixed effects represent the average effect of the predictor
variables on the outcome, holding all other factors constant.
- Interpret coefficients as you would with standard regression, but
make sure that your interpretations are for the population (taking into
account the survey design).
4.2 Interpreting Random Effects
- Random effects represent the group-level variability that is not
explained by the fixed effects. This is in the form of the group level
variance which can be used to calculate intra-class correlation, which
reflects the degree of clustering in the response variable.
- Interpretation of spatial random effects includes the spatial
dependence captured by the parameters.
- Look at the variance components to assess the degree of
between-group differences, spatial correlation and the spatial
clustering.
5. Data Cleaning and Multilevel Modeling using STATA: An
Example
Let’s illustrate these steps using the example of the Somalia
Demographic and Health Survey (SDHS) 2020 data for sanitation access,
using the code you provided.
5.1 Data Cleaning in STATA
- Setting Dependent Variable (DV):
- The provided code reclassifies household sanitation access into
binary categories of “Improved” (0) and “Unimproved” (1), based on the
type of sanitation available in each household. This is saved as the
variable
dv
.
- Selecting Independent Variables (IVs):
- Individual-level factors included: sex and age of household heads,
media exposure (TV, Radio), wealth status, shared facilities, and family
size.
- Community-level factors included: place of residence, region, time
to reach the water source.
- Data Cleaning:
- The data set is kept to only include the selected IVs, DV, the
cluster (HV001) and strata (STRU_CD) variable for sampling weights
(HV005), and survey weights.
- Missing values on the selected variables are dropped and excluded
from the analysis using the command
drop if missing(...)
.
- Data saving: A clean dataset of selected variables
is saved as a Stata
.dta
file.
5.2 Survey Design Specification in STATA
- Setting Survey Design: The code specifies the
survey design using
svyset
, including cluster
(HV001
), strata (STRU_CD
), and sampling
weights (HV005
), which are then used for weighted
descriptive analysis to understand the distribution of the DV and
IVs.
- Recoding IVs: The IVs are recoded to make them
categorical variables for ease of interpretation.
- Bivariate Analysis: The code performs chi-squared
tests using
svy: tab
to examine associations between the
outcome variable and various predictors while accounting for survey
design.
- Multicollinearity test: A test for
multicollinearity is performed to check for correlation between
predictor variables.
- Multivariable Logistic Regression: A multivariable
regression model is used to identify predictors of unimproved
sanitation.
5.3 Multilevel Analysis in STATA
- Empty Model: The code starts by fitting an empty
model (
melogit dv || HV001:
) to estimate the variance at
both the individual and community level, and obtain the ICC.
- Individual-Level Model: Individual-level factors
are added to the model to assess their influence on unimproved
sanitation, with estimation of ICC and model assessment statistics (AIC
and BIC).
- Community-Level Model: Community-level factors are
added to assess their influence.
- Full Model: Both individual and community-level
factors are included in the final model, and model comparison criteria
are used to assess model fit of the full model compared to its nested
models.
5.4 Integrating Spatial Component
- Although the provided code does not explicitly use spatial
information, we can add a spatial component at the community level.
- By using the geographical identifiers (e.g., region), we can develop
a spatial weights matrix and include spatial random effects, using CAR
or SAR models to model spatial autocorrelation.
- After incorporating spatial effects we can assess how the spatial
effects compare to non-spatial models.
- The spatial component can be used to map predicted probabilities, as
we did in previous modules, in order to understand the geographical
distribution of areas with poor sanitation and the clustering of these
areas.
6. Aligning Multilevel Analysis to Spatial Analysis
Once you’ve established your multilevel model, here’s how to align it
with spatial analysis: * Mapping random effects: Map
the spatial random effects to visualize the spatial patterns of
variation at the group level. * Spatial residuals
analysis: Assess the spatial autocorrelation of the residuals
at the group level, using diagnostic tests. * Spatial random
effects: If spatial autocorrelation is detected in residuals,
re-fit the multilevel model with a spatial random component. *
Mapping of estimates: Map the predicted probabilities
or the fitted values of the outcome variable at the group level to
visualize the geographical distribution of the outcome across different
regions.
7. Conclusion
In this module, you have learned the fundamentals of multilevel
models, how to incorporate spatial random effects, and the crucial
considerations for using survey data. We have also learned how to use
model comparison tools and check for spatial autocorrelation using model
assessment tools, and see an example of how to use STATA for multilevel
modelling using survey data. You now have the skills to conduct complex
statistical analyses on hierarchically structured spatial data in public
health and related fields.
In Module V we will dive deeper into spatial mapping, data handling
and manipulation of spatial data using different R packages. ```