Introduction
Welcome to Module III of our Spatial Statistics and Disease Mapping
course! In this module, we will explore regression analysis, a
fundamental statistical technique, and how it can be enhanced by
incorporating spatial considerations. We’ll start by reviewing basic
linear regression, then delve into spatial regression models, and
finally discuss how to apply these methods with survey data and
interpret their outputs. We will also cover different types of
regression based on the scale of measurement of the outcome variables,
and then how these can be used for spatial analysis.
1. Review of Linear Regression Models
1.1 What is Regression Analysis?
Regression analysis is a statistical method used to examine the
relationship between a dependent variable (also called the response or
outcome variable) and one or more independent variables (also called
predictor or explanatory variables). The goal is to model how the
dependent variable changes as the independent variable(s) change. In
essence, we’re looking for a mathematical equation that can best
describe this relationship.
1.2 The Linear Regression Model
The simplest form is linear regression, which assumes a linear
relationship between the dependent variable (y) and the independent
variable(s) (x):
y = β₀ + β₁x₁ + β₂x₂ + … + ε
Where:
- y is the dependent variable.
- x₁, x₂,… are independent variables.
- β₀ is the intercept (the value of y when all x’s
are zero).
- β₁, β₂,… are the regression coefficients
(representing the change in y for a one-unit increase in x).
- ε is the error term (representing the variability
in y that is not explained by the model).
1.3 Assumptions of Linear Regression
For a valid linear regression, certain assumptions must hold:
- Linearity: A linear relationship between the
independent and dependent variables.
- Independence: The error terms are independent of
each other.
- Homoscedasticity: The variance of the error terms
is constant across all values of the independent variables.
- Normality: The error terms are normally
distributed.
2. Types of Regression Based on Outcome Variable
Regression models can be classified based on the scale of measurement
of the outcome variable. This determines the specific type of model to
use:
2.1 Binary Logistic Regression
- Outcome Variable: Binary (dichotomous) outcome,
taking two values (e.g., 0 or 1, yes or no, diseased or not
diseased).
- Model: Logistic regression model using a logit link
function: log(p/(1-p)) = β₀ + β₁x₁ + β₂x₂ + … where p is the probability
of the event occurring.
- Application: used in spatial analysis to model
geographical pattern of presence of absence of an event (e.g., mapping
probability of malaria incidence based on survey and environmental
variables).
2.2 Ordinal Logistic Regression
- Outcome Variable: Ordinal outcome, with ordered
categories (e.g., low, medium, high; strongly agree, agree, neutral,
disagree, strongly disagree).
- Model: Proportional odds logistic regression model,
modelling probabilities of being at or below each category.
- Application: used to model the geographical pattern
of responses in ordinal scales (e.g., mapping severity of disease
outcomes using survey data), the severity in an individual’s pain level,
or attitudes towards policies.
2.3 Multinomial (Nominal) Logistic Regression
- Outcome Variable: Nominal (categorical) outcome,
with multiple unordered categories (e.g., marital status, mode of
transportation, region).
- Model: Multinomial logistic regression model, using
one category as reference to model probabilities of belonging to
different categories
- Application: used to model geographical patterns in
responses with nominal scales (e.g., mapping geographical distribution
of different types of disease or types of employment or agricultural
activity in a region).
2.4 Count Regression
- Outcome Variable: Count data, representing the
number of occurrences of an event (e.g., number of hospital visits,
number of crimes, number of children).
- Model: Poisson or negative binomial regression
models are often used to deal with over-dispersion in count data.
- Application: used to model geographical variation
in count data (e.g., mapping number of disease cases per region from
aggregated survey data, analyzing traffic accidents in each
municipality).
2.5 Linear Regression (Continuous Outcome)
- Outcome Variable: Continuous variable, taking any
numerical value within a range (e.g., temperature, income, blood
pressure).
- Model: The standard linear regression model y = β₀
+ β₁x₁ + β₂x₂ + … + ε
- Application: used to model relationships between
geographical characteristics and continuous health indicators (e.g., the
relationship between altitude and mean blood pressure).
3. Spatial Regression Models
Spatial regression models extend linear regression by accounting for
spatial autocorrelation in the data. Ignoring spatial autocorrelation
can lead to biased results and incorrect inferences.
3.1 Spatial Lag Model
Concept: The spatial lag model assumes that the
value of the dependent variable at a location depends on the values of
the dependent variable at neighboring locations.
Equation: y = ρWy + β₀ + β₁x₁ + β₂x₂ + … + ε
Where:
- ρ is the spatial autoregressive coefficient that
measures the influence of neighboring values on the value of the outcome
variable.
- W is the spatial weights matrix, representing the
spatial relationships between locations.
Use Case: When the outcome in a location is
influenced by the outcome in neighboring locations.
3.2 Spatial Error Model
Concept: The spatial error model assumes that
spatial autocorrelation exists in the error term. In this case, the
values of the outcome variable at a location are correlated due to
unexplained factors from nearby locations.
Equation: y = β₀ + β₁x₁ + β₂x₂ + … + λWε + μ
Where:
- λ is the spatial autoregressive coefficient of the
error term.
- W is the spatial weights matrix, representing the
spatial relationships between locations.
- μ represents the error term in the spatial error
model which is not spatially autocorrelated.
Use Case: When there is a spatial pattern of
unmeasured factors affecting the outcome.
3.3 Choosing Between Spatial Lag and Spatial Error Models
The choice between spatial lag and spatial error models depends on
the nature of spatial dependence in the data: * Spatial lag model is
suitable when spatial autocorrelation arises because the outcome
variable is spatially autocorrelated. * Spatial error model is suitable
when the spatial autocorrelation is due to an underlying spatial process
or unmeasured factors. * Lagrange Multiplier (LM) tests
are used to identify which model is more appropriate for a particular
dataset. These tests assess the significance of spatial lag and spatial
error.
4. Application of Spatial Regression with Survey Data
Applying spatial regression with survey data requires careful
handling of complex survey designs and weights.
4.1 Integrating Survey Weights
- Inclusion of Weights: When doing regression
analysis with survey data, survey weights must be used to account for
unequal probabilities of selection. Not doing so may lead to biased
parameter estimates and invalid inferences.
- Weighted Regression: Standard regression procedures
can be modified to incorporate survey weights.
4.2 Aggregation to Areal Units
- Spatial aggregation: Individual-level data from
surveys needs to be aggregated to predefined areal units (e.g.,
administrative regions) for analysis with spatial regression
models.
- Weight Application in Aggregation: Survey weights
must be applied to obtain reliable area estimates when aggregating
individual survey data to areas for use in spatial regression and
disease mapping.
4.3 Addressing Complex Survey Designs
- Survey Specific Analysis: Special considerations
must be made for survey designs, such as stratification, cluster
sampling, and multi-stage sampling, using survey specific R packages or
procedures.
- Survey specific statistical methods: Special
procedures and statistical packages may need to be used to handle
complex survey designs in R or Stata.
5. Interpretation and Model Assessment of Spatial Regression
Outputs
Interpreting and assessing spatial regression outputs is crucial for
understanding the spatial relationships in your data:
5.1 Interpreting Regression Coefficients
- Sign and Magnitude: The sign and magnitude of
coefficients indicate the direction and strength of relationships,
including the interpretation of regression coefficient estimates of
independent variables.
- Spatial Parameters: Interpretation of spatial
autoregressive parameters (ρ in spatial lag and λ in spatial error)
indicate the spatial relationship between locations.
- Interpretation with survey weights: Make sure you
emphasize that your interpretations are based on population estimates
obtained using the survey weights.
5.2 Goodness-of-Fit Measures
- R-squared: Assess the proportion of variance in the
outcome variable that is explained by the model.
- AIC and BIC: Model selection criteria, which
assesses goodness-of-fit and model complexity.
5.3 Diagnostics and Model Validation
- Residual Analysis: Check the model assumptions by
analyzing the spatial patterns of residuals.
- Spatial Autocorrelation in Residuals: Residuals
should be assessed for spatial autocorrelation to detect whether the
model fails to capture the spatial dependency in the outcome
variable.
- Spatial diagnostics tests: Use spatial diagnostic
tests to check the model adequacy and the appropriateness of spatial
specification
- Cross-Validation: Evaluate the model’s predictive
ability on new data and to determine whether the models are valid in the
geographical space.
5.4 Spatial Implications
- Spatial Interpretation: Discuss the spatial
patterns and spatial effects based on the model parameter estimates and
their statistical significance, using maps and other visualization
techniques.
6. Aligning Regression Models to Spatial Analysis
Once a regression model (regardless of the outcome type) has
identified significant factors, we can align these findings with spatial
analysis, particularly for areal and geostatistical data.
6.1 For Areal Data
- Spatial Autocorrelation of Residuals: After fitting
the regression model, check the residuals for spatial autocorrelation.
Spatial patterns in the residuals may indicate that a spatial regression
model (lag or error) is necessary to take spatial dependency into
account.
- Mapping Predictors: Map the significant predictors
identified by the regression model across the areal units to see spatial
patterns and relationships.
- Geographical visualization: Map predicted values or
the impact of key predictors spatially in a map to highlight
geographical pattern of health outcomes or policy outcomes.
- Spatial Regression: Use spatial lag or error models
when the spatial autocorrelation is significant.
6.2 For Geostatistical Data
- Spatial interpolation: Spatially interpolate the
significant predictors and the fitted values of the response variable
using Kriging and other spatial interpolation techniques for prediction
purposes.
- Spatial Regression: Develop regression models using
predictor variables as covariates in spatial model, and predict the
values across the study region, mapping results to identify geographical
variations.
- Spatial Variation Maps: Map the spatial variation
in key regression coefficients across the spatial locations to assess if
the influence of independent variables varies spatially.
- Geographically weighted regression (GWR): Fit
geographically weighted regression to identify variations in the
relationship between variables over space.
7. Conclusion
In this module, we have reviewed linear regression and introduced
spatial regression models, covering binary, ordinal, nominal, count, and
continuous outcomes. We’ve discussed how to apply these models with
survey data, incorporate survey weights, and interpret their outputs
with spatial applications. We have also learnt how to align results of
regression models with spatial analysis for both areal and
geostatistical data. By understanding these methods, you are now
equipped to conduct robust analyses of spatially referenced data,
accounting for spatial dependencies and the nuances of survey data.
In the next module, we will explore multilevel analysis and see how
spatial analysis concepts can be incorporated in hierarchical models.
```