Introduction

Welcome to Module IV of our Spatial Statistics and Disease Mapping course! In this module, we will explore multilevel analysis, a powerful statistical technique for analyzing hierarchically structured data. We’ll discuss the basic concepts of multilevel models, how to incorporate spatial random effects, how to handle survey data, and interpret the outputs of multilevel spatial models. We’ll also cover model comparison and assessment using tools like ICC, AIC, BIC, and LRT. Additionally, we will go through an example of data cleaning for a real world survey data using STATA.

1. Introduction to Multilevel Models

1.1 What are Multilevel Models?

Multilevel models (also known as hierarchical models or mixed-effects models) are statistical models that are used to analyze data with a nested or hierarchical structure. This type of structure arises when observations are grouped within higher-level units, such as individuals within households, households within communities, or students within schools.

1.2 Why Use Multilevel Models?

  • Accounting for Grouping: Multilevel models explicitly acknowledge the non-independence of observations within groups. This means they address the issue that individuals within the same group are more similar to each other than to individuals in different groups.
  • Variance Partitioning: They partition the total variance into components at each level, allowing researchers to understand how much variation in the outcome is due to individual characteristics versus group-level factors.
  • Contextual Effects: They enable the study of contextual effects – how group-level variables influence individual outcomes.
  • Correct Standard Errors: They give correct standard errors and confidence intervals that take into account non-independence of data.

1.3 Basic Concepts

  • Levels: The hierarchical structure of the data, such as level-1 (individuals) and level-2 (communities). Levels can extend to more than 2.
  • Fixed Effects: The effects of variables that are assumed to be constant across all groups.
  • Random Effects: The effects of variables that are allowed to vary across different groups, allowing group-specific intercepts and slopes.
  • Intercepts: Level-specific intercept terms are included as random effects to allow the starting values of each groups to be different.
  • Slopes: Level-specific slopes can be included as random effects to allow the relationship between independent and dependent variable to vary across groups.
  • Variance Components: Partitioning of variation to different levels so that we can know how much of the variation is at the level of the individuals and how much is at the level of the group.
  • Intra-class Correlation (ICC): Measures the proportion of the total variance that is explained by the group level.

1.4 Mathematical Representation

A basic two-level model can be represented as follows:

Level 1 (Individual Level): yᵢⱼ = β₀ⱼ + β₁xᵢⱼ + εᵢⱼ

Level 2 (Group Level): β₀ⱼ = γ₀₀ + γ₀₁zⱼ + u₀ⱼ

Where:

  • yᵢⱼ is the outcome variable for individual i in group j
  • xᵢⱼ are the individual-level predictors.
  • β₀ⱼ is the intercept for group j.
  • β₁ is the fixed slope for the individual level variable
  • εᵢⱼ is the level-1 residual error term.
  • zⱼ are the group-level predictors.
  • γ₀₀ is the intercept (overall mean)
  • γ₀₁ is the fixed coefficient of group level predictor.
  • u₀ⱼ is the level-2 residual error term.

2. Spatial Random Effects

Incorporating spatial dependencies into multilevel models is crucial when the data is not only hierarchical but also spatially structured. Spatial random effects account for spatial autocorrelation at the group level.

2.1 What are Spatial Random Effects?

Spatial random effects are additional random effects that account for the spatial structure within groups, allowing for spatial correlation between neighboring groups that were not captured by independent errors. This is particularly important when the groups are spatial units such as census tracts, administrative regions or survey clusters.

2.2 How to Incorporate Spatial Random Effects

  1. Spatial Weights Matrix (W): A matrix that defines the spatial relationships between groups using the concept of proximity or neighborhood. This matrix can be based on distance, adjacency or contiguity.
  2. Spatial Autocorrelation Parameter (ρ): Incorporate a spatial autoregressive parameter to model the spatial dependencies using the spatial weights matrix, just like in the spatial regression models.
  3. Spatial Random Intercepts: Spatial random intercepts account for a spatially structured variation in the mean outcome of the response variable across groups.
  4. Spatial Random Slopes: Spatial random slopes incorporate spatial heterogeneity in the effect of predictors on the outcome.

2.3 Types of Spatial Random Effects

  • Conditional Autoregressive (CAR) Model: One of the most commonly used models for spatial random effects, where the random effect at each location is influenced by random effects at neighboring locations.
  • Simultaneous Autoregressive (SAR) Model: Similar to the CAR model but uses a different approach to calculate the weights.
  • Gaussian Process Models: Can be used for modelling spatial autocorrelation, especially when spatial locations can be represented as points.

2.4 Mathematical Representation (CAR Model)

Level 1 (Individual Level): yᵢⱼ = β₀ⱼ + β₁xᵢⱼ + εᵢⱼ

Level 2 (Group Level): β₀ⱼ = γ₀₀ + γ₀₁zⱼ + u₀ⱼ, where u₀ⱼ ~ CAR(W)

Where: * All variables as before * CAR(W) denotes the spatial random effects, which follow a conditional autoregressive process based on spatial weights matrix W.

3. Application of Multilevel Analysis with Survey Data

Applying multilevel models to survey data requires careful attention to survey design features, such as weighting, stratification, and clustering, which is very important to obtain reliable results from the analysis.

3.1 Survey Designs in R and STATA

Survey design involves specifying how data was sampled from a population. This involves defining sample weights, stratification, clustering and multi-stage sampling techniques.

3.1.1 Survey Design in R

In R, you can specify survey designs using the survey package. Here’s a basic overview: * Install and load package: install.packages("survey") and library(survey) * Create a survey design object: R survey_design <- svydesign( id = ~cluster_variable, # Specify the cluster variable strata = ~strata_variable, # Specify the stratification variable weights = ~weights_variable, # Specify the survey weight variable data = your_data ) * Analysis with Survey Design: R svyglm(outcome_variable ~ predictor_variable, design = survey_design) # for linear and logistic regression svyboxplot(outcome_variable ~ group_variable , design = survey_design) # boxplot for survey data

3.1.2 Survey Design in STATA

In STATA, you can use svyset to declare survey design: * Specify survey design stata svyset cluster_variable [pweight=weights_variable] , strata(strata_variable) * Analysis with survey design stata svy: regress outcome_variable predictor_variable # for linear regression svy: logit outcome_variable predictor_variable # for logistic regression svy: tab outcome_variable group_variable, ci percent row format(%7.1f) # for cross-tabulations

3.2 Incorporating Survey Weights

Survey weights are necessary to account for unequal selection probabilities in the sample. To incorporate survey weights in R and Stata: * R: survey weights are specified using the weights argument in the svydesign function. * STATA: weights are specified using [pweight=weights_variable] in svyset and [iw=weights_variable] for all analysis functions that do not have the svy prefix.

3.3 Handling Clustering and Stratification

  • Clustering: In both R and Stata, the cluster variable needs to be specified to account for potential correlation within clusters.
  • Stratification: Stratification also needs to be included in the survey design to account for differences between strata.

3.4 Multilevel Analysis with Survey Data

  • Use specific R packages like lme4 or rstanarm to conduct multilevel analysis, making sure that survey weights are correctly specified to account for survey design effects.
  • Use melogit and other multilevel modelling commands in STATA to perform analyses, ensuring that svy prefix is included for correct variance estimation.

4. Interpretation and Model Assessment of Multilevel Spatial Outputs

4.1 Interpreting Fixed Effects

  • Fixed effects represent the average effect of the predictor variables on the outcome, holding all other factors constant.
  • Interpret coefficients as you would with standard regression, but make sure that your interpretations are for the population (taking into account the survey design).

4.2 Interpreting Random Effects

  • Random effects represent the group-level variability that is not explained by the fixed effects. This is in the form of the group level variance which can be used to calculate intra-class correlation, which reflects the degree of clustering in the response variable.
  • Interpretation of spatial random effects includes the spatial dependence captured by the parameters.
  • Look at the variance components to assess the degree of between-group differences, spatial correlation and the spatial clustering.

4.3 Model Comparison and Checking Tools

  • Intra-class Correlation (ICC): Measures the proportion of the total variance that is attributable to group differences. A high ICC suggests strong within-group similarity and that a multilevel model is appropriate.
    • `ICC = Var_group / (Var_group + Var_individual)`
  • Akaike Information Criterion (AIC): A measure of the goodness-of-fit of the model that penalizes model complexity, allowing model comparison with different number of parameters.
    •  AIC = -2(log-likelihood) + 2(number of parameters)
  • Bayesian Information Criterion (BIC): Another measure of goodness-of-fit, also penalizing model complexity, with more penalty on model complexity than AIC.
    • BIC = -2(log-likelihood) + log(n)(number of parameters)
  • Likelihood Ratio Test (LRT): A statistical test used to compare nested models to see if the more complex model has a significantly improved model fit compared to a simpler model.
  • Deviance Information Criterion (DIC): For Bayesian Models, the DIC compares fit of model with the complexity of the model.
  • Residual Plots: Check the distribution of the residuals, whether they follow the normality assumptions, or whether they show any pattern.
  • Spatial Autocorrelation: Use diagnostic tests to assess for any spatial pattern in the residuals to evaluate if your model adequately account for spatial clustering.

5. Data Cleaning and Multilevel Modeling using STATA: An Example

Let’s illustrate these steps using the example of the Somalia Demographic and Health Survey (SDHS) 2020 data for sanitation access, using the code you provided.

5.1 Data Cleaning in STATA

  1. Setting Dependent Variable (DV):
    • The provided code reclassifies household sanitation access into binary categories of “Improved” (0) and “Unimproved” (1), based on the type of sanitation available in each household. This is saved as the variable dv.
  2. Selecting Independent Variables (IVs):
    • Individual-level factors included: sex and age of household heads, media exposure (TV, Radio), wealth status, shared facilities, and family size.
    • Community-level factors included: place of residence, region, time to reach the water source.
  3. Data Cleaning:
    • The data set is kept to only include the selected IVs, DV, the cluster (HV001) and strata (STRU_CD) variable for sampling weights (HV005), and survey weights.
    • Missing values on the selected variables are dropped and excluded from the analysis using the command drop if missing(...).
  4. Data saving: A clean dataset of selected variables is saved as a Stata .dta file.

5.2 Survey Design Specification in STATA

  1. Setting Survey Design: The code specifies the survey design using svyset, including cluster (HV001), strata (STRU_CD), and sampling weights (HV005), which are then used for weighted descriptive analysis to understand the distribution of the DV and IVs.
  2. Recoding IVs: The IVs are recoded to make them categorical variables for ease of interpretation.
  3. Bivariate Analysis: The code performs chi-squared tests using svy: tab to examine associations between the outcome variable and various predictors while accounting for survey design.
  4. Multicollinearity test: A test for multicollinearity is performed to check for correlation between predictor variables.
  5. Multivariable Logistic Regression: A multivariable regression model is used to identify predictors of unimproved sanitation.

5.3 Multilevel Analysis in STATA

  1. Empty Model: The code starts by fitting an empty model (melogit dv || HV001:) to estimate the variance at both the individual and community level, and obtain the ICC.
  2. Individual-Level Model: Individual-level factors are added to the model to assess their influence on unimproved sanitation, with estimation of ICC and model assessment statistics (AIC and BIC).
  3. Community-Level Model: Community-level factors are added to assess their influence.
  4. Full Model: Both individual and community-level factors are included in the final model, and model comparison criteria are used to assess model fit of the full model compared to its nested models.

5.4 Integrating Spatial Component

  • Although the provided code does not explicitly use spatial information, we can add a spatial component at the community level.
  • By using the geographical identifiers (e.g., region), we can develop a spatial weights matrix and include spatial random effects, using CAR or SAR models to model spatial autocorrelation.
  • After incorporating spatial effects we can assess how the spatial effects compare to non-spatial models.
  • The spatial component can be used to map predicted probabilities, as we did in previous modules, in order to understand the geographical distribution of areas with poor sanitation and the clustering of these areas.

6. Aligning Multilevel Analysis to Spatial Analysis

Once you’ve established your multilevel model, here’s how to align it with spatial analysis: * Mapping random effects: Map the spatial random effects to visualize the spatial patterns of variation at the group level. * Spatial residuals analysis: Assess the spatial autocorrelation of the residuals at the group level, using diagnostic tests. * Spatial random effects: If spatial autocorrelation is detected in residuals, re-fit the multilevel model with a spatial random component. * Mapping of estimates: Map the predicted probabilities or the fitted values of the outcome variable at the group level to visualize the geographical distribution of the outcome across different regions.

7. Conclusion

In this module, you have learned the fundamentals of multilevel models, how to incorporate spatial random effects, and the crucial considerations for using survey data. We have also learned how to use model comparison tools and check for spatial autocorrelation using model assessment tools, and see an example of how to use STATA for multilevel modelling using survey data. You now have the skills to conduct complex statistical analyses on hierarchically structured spatial data in public health and related fields.

In Module V we will dive deeper into spatial mapping, data handling and manipulation of spatial data using different R packages. ```