Introduction

Welcome to Module III of our Spatial Statistics and Disease Mapping course! In this module, we will explore regression analysis, a fundamental statistical technique, and how it can be enhanced by incorporating spatial considerations. We’ll start by reviewing basic linear regression, then delve into spatial regression models, and finally discuss how to apply these methods with survey data and interpret their outputs. We will also cover different types of regression based on the scale of measurement of the outcome variables, and then how these can be used for spatial analysis.

1. Review of Linear Regression Models

1.1 What is Regression Analysis?

Regression analysis is a statistical method used to examine the relationship between a dependent variable (also called the response or outcome variable) and one or more independent variables (also called predictor or explanatory variables). The goal is to model how the dependent variable changes as the independent variable(s) change. In essence, we’re looking for a mathematical equation that can best describe this relationship.

1.2 The Linear Regression Model

The simplest form is linear regression, which assumes a linear relationship between the dependent variable (y) and the independent variable(s) (x):

y = β₀ + β₁x₁ + β₂x₂ + … + ε

Where:

  • y is the dependent variable.
  • x₁, x₂,… are independent variables.
  • β₀ is the intercept (the value of y when all x’s are zero).
  • β₁, β₂,… are the regression coefficients (representing the change in y for a one-unit increase in x).
  • ε is the error term (representing the variability in y that is not explained by the model).

1.3 Assumptions of Linear Regression

For a valid linear regression, certain assumptions must hold:

  1. Linearity: A linear relationship between the independent and dependent variables.
  2. Independence: The error terms are independent of each other.
  3. Homoscedasticity: The variance of the error terms is constant across all values of the independent variables.
  4. Normality: The error terms are normally distributed.

2. Types of Regression Based on Outcome Variable

Regression models can be classified based on the scale of measurement of the outcome variable. This determines the specific type of model to use:

2.1 Binary Logistic Regression

  • Outcome Variable: Binary (dichotomous) outcome, taking two values (e.g., 0 or 1, yes or no, diseased or not diseased).
  • Model: Logistic regression model using a logit link function: log(p/(1-p)) = β₀ + β₁x₁ + β₂x₂ + … where p is the probability of the event occurring.
  • Application: used in spatial analysis to model geographical pattern of presence of absence of an event (e.g., mapping probability of malaria incidence based on survey and environmental variables).

2.2 Ordinal Logistic Regression

  • Outcome Variable: Ordinal outcome, with ordered categories (e.g., low, medium, high; strongly agree, agree, neutral, disagree, strongly disagree).
  • Model: Proportional odds logistic regression model, modelling probabilities of being at or below each category.
  • Application: used to model the geographical pattern of responses in ordinal scales (e.g., mapping severity of disease outcomes using survey data), the severity in an individual’s pain level, or attitudes towards policies.

2.3 Multinomial (Nominal) Logistic Regression

  • Outcome Variable: Nominal (categorical) outcome, with multiple unordered categories (e.g., marital status, mode of transportation, region).
  • Model: Multinomial logistic regression model, using one category as reference to model probabilities of belonging to different categories
  • Application: used to model geographical patterns in responses with nominal scales (e.g., mapping geographical distribution of different types of disease or types of employment or agricultural activity in a region).

2.4 Count Regression

  • Outcome Variable: Count data, representing the number of occurrences of an event (e.g., number of hospital visits, number of crimes, number of children).
  • Model: Poisson or negative binomial regression models are often used to deal with over-dispersion in count data.
  • Application: used to model geographical variation in count data (e.g., mapping number of disease cases per region from aggregated survey data, analyzing traffic accidents in each municipality).

2.5 Linear Regression (Continuous Outcome)

  • Outcome Variable: Continuous variable, taking any numerical value within a range (e.g., temperature, income, blood pressure).
  • Model: The standard linear regression model y = β₀ + β₁x₁ + β₂x₂ + … + ε
  • Application: used to model relationships between geographical characteristics and continuous health indicators (e.g., the relationship between altitude and mean blood pressure).

3. Spatial Regression Models

Spatial regression models extend linear regression by accounting for spatial autocorrelation in the data. Ignoring spatial autocorrelation can lead to biased results and incorrect inferences.

3.1 Spatial Lag Model

  • Concept: The spatial lag model assumes that the value of the dependent variable at a location depends on the values of the dependent variable at neighboring locations.

  • Equation: y = ρWy + β₀ + β₁x₁ + β₂x₂ + … + ε

    Where:

    • ρ is the spatial autoregressive coefficient that measures the influence of neighboring values on the value of the outcome variable.
    • W is the spatial weights matrix, representing the spatial relationships between locations.
  • Use Case: When the outcome in a location is influenced by the outcome in neighboring locations.

3.2 Spatial Error Model

  • Concept: The spatial error model assumes that spatial autocorrelation exists in the error term. In this case, the values of the outcome variable at a location are correlated due to unexplained factors from nearby locations.

  • Equation: y = β₀ + β₁x₁ + β₂x₂ + … + λWε + μ

    Where:

    • λ is the spatial autoregressive coefficient of the error term.
    • W is the spatial weights matrix, representing the spatial relationships between locations.
    • μ represents the error term in the spatial error model which is not spatially autocorrelated.
  • Use Case: When there is a spatial pattern of unmeasured factors affecting the outcome.

3.3 Choosing Between Spatial Lag and Spatial Error Models

The choice between spatial lag and spatial error models depends on the nature of spatial dependence in the data: * Spatial lag model is suitable when spatial autocorrelation arises because the outcome variable is spatially autocorrelated. * Spatial error model is suitable when the spatial autocorrelation is due to an underlying spatial process or unmeasured factors. * Lagrange Multiplier (LM) tests are used to identify which model is more appropriate for a particular dataset. These tests assess the significance of spatial lag and spatial error.

4. Application of Spatial Regression with Survey Data

Applying spatial regression with survey data requires careful handling of complex survey designs and weights.

4.1 Integrating Survey Weights

  • Inclusion of Weights: When doing regression analysis with survey data, survey weights must be used to account for unequal probabilities of selection. Not doing so may lead to biased parameter estimates and invalid inferences.
  • Weighted Regression: Standard regression procedures can be modified to incorporate survey weights.

4.2 Aggregation to Areal Units

  • Spatial aggregation: Individual-level data from surveys needs to be aggregated to predefined areal units (e.g., administrative regions) for analysis with spatial regression models.
  • Weight Application in Aggregation: Survey weights must be applied to obtain reliable area estimates when aggregating individual survey data to areas for use in spatial regression and disease mapping.

4.3 Addressing Complex Survey Designs

  • Survey Specific Analysis: Special considerations must be made for survey designs, such as stratification, cluster sampling, and multi-stage sampling, using survey specific R packages or procedures.
  • Survey specific statistical methods: Special procedures and statistical packages may need to be used to handle complex survey designs in R or Stata.

5. Interpretation and Model Assessment of Spatial Regression Outputs

Interpreting and assessing spatial regression outputs is crucial for understanding the spatial relationships in your data:

5.1 Interpreting Regression Coefficients

  • Sign and Magnitude: The sign and magnitude of coefficients indicate the direction and strength of relationships, including the interpretation of regression coefficient estimates of independent variables.
  • Spatial Parameters: Interpretation of spatial autoregressive parameters (ρ in spatial lag and λ in spatial error) indicate the spatial relationship between locations.
  • Interpretation with survey weights: Make sure you emphasize that your interpretations are based on population estimates obtained using the survey weights.

5.2 Goodness-of-Fit Measures

  • R-squared: Assess the proportion of variance in the outcome variable that is explained by the model.
  • AIC and BIC: Model selection criteria, which assesses goodness-of-fit and model complexity.

5.3 Diagnostics and Model Validation

  • Residual Analysis: Check the model assumptions by analyzing the spatial patterns of residuals.
  • Spatial Autocorrelation in Residuals: Residuals should be assessed for spatial autocorrelation to detect whether the model fails to capture the spatial dependency in the outcome variable.
  • Spatial diagnostics tests: Use spatial diagnostic tests to check the model adequacy and the appropriateness of spatial specification
  • Cross-Validation: Evaluate the model’s predictive ability on new data and to determine whether the models are valid in the geographical space.

5.4 Spatial Implications

  • Spatial Interpretation: Discuss the spatial patterns and spatial effects based on the model parameter estimates and their statistical significance, using maps and other visualization techniques.

6. Aligning Regression Models to Spatial Analysis

Once a regression model (regardless of the outcome type) has identified significant factors, we can align these findings with spatial analysis, particularly for areal and geostatistical data.

6.1 For Areal Data

  • Spatial Autocorrelation of Residuals: After fitting the regression model, check the residuals for spatial autocorrelation. Spatial patterns in the residuals may indicate that a spatial regression model (lag or error) is necessary to take spatial dependency into account.
  • Mapping Predictors: Map the significant predictors identified by the regression model across the areal units to see spatial patterns and relationships.
  • Geographical visualization: Map predicted values or the impact of key predictors spatially in a map to highlight geographical pattern of health outcomes or policy outcomes.
  • Spatial Regression: Use spatial lag or error models when the spatial autocorrelation is significant.

6.2 For Geostatistical Data

  • Spatial interpolation: Spatially interpolate the significant predictors and the fitted values of the response variable using Kriging and other spatial interpolation techniques for prediction purposes.
  • Spatial Regression: Develop regression models using predictor variables as covariates in spatial model, and predict the values across the study region, mapping results to identify geographical variations.
  • Spatial Variation Maps: Map the spatial variation in key regression coefficients across the spatial locations to assess if the influence of independent variables varies spatially.
  • Geographically weighted regression (GWR): Fit geographically weighted regression to identify variations in the relationship between variables over space.

7. Conclusion

In this module, we have reviewed linear regression and introduced spatial regression models, covering binary, ordinal, nominal, count, and continuous outcomes. We’ve discussed how to apply these models with survey data, incorporate survey weights, and interpret their outputs with spatial applications. We have also learnt how to align results of regression models with spatial analysis for both areal and geostatistical data. By understanding these methods, you are now equipped to conduct robust analyses of spatially referenced data, accounting for spatial dependencies and the nuances of survey data.

In the next module, we will explore multilevel analysis and see how spatial analysis concepts can be incorporated in hierarchical models. ```