Module I Introduction to Spatial Data Analysis

# Load essential packages here to avoid clutter later
library(sf)

## Linking to GEOS 3.11.2, GDAL 3.8.2, PROJ 9.3.1; sf_use_s2() is TRUE

library(sp)
library(ggplot2)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(mapview)
library(tmap)

## Breaking News: tmap 3.x is retiring. Please test v4, e.g. with
## remotes::install_github('r-tmap/tmap')

library(geosphere) 
library(spData)

## To access larger datasets in this package, install the spDataLarge
## package with: `install.packages('spDataLarge',
## repos='https://nowosad.github.io/drat/', type='source')`

library(terra)

## terra 1.7.71

library(spatstat)

## Loading required package: spatstat.data

## Loading required package: spatstat.geom

## spatstat.geom 3.2-9

## 
## Attaching package: 'spatstat.geom'

## The following objects are masked from 'package:terra':
## 
##     area, delaunay, is.empty, rescale, rotate, shift, where.max,
##     where.min

## The following object is masked from 'package:geosphere':
## 
##     perimeter

## Loading required package: spatstat.random

## spatstat.random 3.2-3

## Loading required package: spatstat.explore

## Loading required package: nlme

## 
## Attaching package: 'nlme'

## The following object is masked from 'package:dplyr':
## 
##     collapse

## spatstat.explore 3.2-7

## Loading required package: spatstat.model

## Loading required package: rpart

## spatstat.model 3.2-11

## Loading required package: spatstat.linnet

## spatstat.linnet 3.1-5

## 
## spatstat 3.0-8 
## For an introduction to spatstat, type 'beginner'

library(SpatialEpi)
library(rgdal)

## Please note that rgdal will be retired during October 2023,
## plan transition to sf/stars/terra functions using GDAL and PROJ
## at your earliest convenience.
## See https://r-spatial.org/r/2023/05/15/evolution4.html and https://github.com/r-spatial/evolution
## rgdal: version: 1.6-7, (SVN revision 1203)
## Geospatial Data Abstraction Library extensions to R successfully loaded
## Loaded GDAL runtime: GDAL 3.8.2, released 2023/16/12
## Path to GDAL shared files: C:/Users/Admin/AppData/Local/R/win-library/4.3/rgdal/gdal
##  GDAL does not use iconv for recoding strings.
## GDAL binary built with GEOS: TRUE 
## Loaded PROJ runtime: Rel. 9.3.1, December 1st, 2023, [PJ_VERSION: 931]
## Path to PROJ shared files: C:/Users/Admin/AppData/Local/R/win-library/4.3/rgdal/proj
## PROJ CDN enabled: FALSE
## Linking to sp version:2.1-3
## To mute warnings of possible GDAL/OSR exportToProj4() degradation,
## use options("rgdal_show_exportToProj4_warnings"="none") before loading sp or rgdal.

## 
## Attaching package: 'rgdal'

## The following object is masked from 'package:terra':
## 
##     project

Module Overview

Welcome to Module I of “Spatial Analysis and Disease Mapping.” This module lays the foundation for understanding spatial data, spatial concepts, and how we can leverage these to investigate health-related phenomena. We will introduce the principles of spatial analysis, explore different types of spatial data, and demonstrate how these data types are relevant to disease mapping and related public health investigations. Throughout the module, we will use R as our primary tool for data manipulation, visualization, and analysis.

Learning Objectives

Upon completion of this module, students will be able to:

Define spatial analysis and explain its importance in health research.
Distinguish between different types of spatial data, including areal, point pattern, and geostatistical data, and understand their properties.
Identify appropriate spatial data types for specific research questions related to disease mapping.
Load, manipulate, and visualize spatial data using R.
Understand the differences between spatial data types and their implications for spatial analysis techniques
Appreciate spatial autocorrelation and its importance in disease mapping

1. Introduction to Spatial Analysis

1.1 What is Spatial Analysis?

Spatial analysis is a powerful set of analytical techniques used to study spatial phenomena. It involves the application of various methods to examine spatial data, looking for patterns, relationships, and variations that may be linked to underlying spatial processes. In essence, spatial analysis goes beyond traditional statistical analyses by explicitly considering the spatial context of data. This is particularly important in disease mapping where geographic location is a crucial determinant of disease occurrence and spread.

1.2 Why is Spatial Analysis Important in Health Research?

Disease Mapping: Spatial analysis enables the visualization and analysis of disease distributions to identify hotspots, clusters, and areas of high risk. This aids in targeting interventions effectively.
Understanding Risk Factors: Spatial patterns of disease can reveal potential environmental, socio-economic, and behavioral risk factors that may not be apparent through non-spatial analyses.
Resource Allocation: By identifying areas with the highest need, spatial analysis helps optimize the allocation of health resources like clinics, personnel, and medication.
Epidemiological Investigations: It’s vital for understanding disease transmission, particularly for infectious diseases that spread through spatial proximity.
Public Health Planning: The results of spatial analysis inform public health planning, intervention strategies, and policy development to prevent and control diseases.

1.3 Spatial Concepts: Basic Ideas

Before proceeding with data types, let us understand some of the key concepts that underpins spatial analysis:

Spatial Autocorrelation: The tendency for values at nearby locations to be more similar than values that are far apart. This principle, often described as “Tobler’s First Law of Geography”, is foundational in spatial analysis.
Spatial Dependence: The assumption that the observation at a location is influenced by the observations in the neighboring locations.
Spatial Heterogeneity: Variation in data values across the study area.
Distance: The amount of separation between two points in space. Different metrics can be employed to measure distance (e.g., Euclidean distance, Manhattan distance, geodesic distance).
Neighborhood: The area around a specific location within which other locations are considered to be its neighbors.
Spatial Weights Matrix: A matrix that quantifies the relationship between locations by giving more weight to closer locations and lesser weights to far away locations.

2. Types of Spatial Data

Spatial data comes in different forms, each with its own characteristics and analytical applications. We will cover three major types relevant to health mapping:

2.1 Areal Data (Lattice Data)

2.1.1 Definition

Areal data represents spatial information aggregated within predefined areas. These areas can be administrative boundaries (e.g., districts, counties, regions), census tracts, or any other defined spatial units. The data associated with these areas often represents summaries of events, characteristics, or attributes (e.g., counts, rates, averages).

2.1.2 Characteristics

Fixed number of units
Irregular or regular
Aggregated or summarized variables

2.1.3 Examples from Health Research

Disease rates by district: Number of malaria cases per 1000 people in each district of a country.
Vaccination coverage by province: Percentage of children vaccinated in each province.
Prevalence of stunting by region: The proportion of children under five suffering from stunting in each region based on a demographic health survey.
Hospital admission rates by postcode areas: The number of hospital admissions due to respiratory infections per 1000 population in each postcode area in a city.
Percentage of households with access to improved sanitation by municipality or town from a population based survey like DHS

2.1.4 R Demonstration: Areal Data with DHS data

Let us generate a map of prevalence of women reporting an unmet need for family planning, from the most recent DHS data for Kenya in 2022.

# Load the DHS data. You would need to download the spatial data from the DHS website
# and prepare it as a sf object for use.
# Here, I will be using a fake dataset to simulate the real situation.
kenya_map <- st_read("path/to/your/kenya_admin_shapefile.shp", quiet=TRUE)

# Create fake data to simulate DHS indicators (you can replace this with actual data)
set.seed(123)
kenya_data <- data.frame(
  district = unique(kenya_map$admin_name),
  unmet_need_percentage = sample(10:40, length(unique(kenya_map$admin_name)), replace=TRUE),
  anc_coverage = sample(20:90, length(unique(kenya_map$admin_name)), replace = TRUE)
)


# Merge the map data with our simulated data based on the common field (e.g., district or county name)
kenya_map_merged <- left_join(kenya_map, kenya_data, by = c("admin_name" = "district"))


# Interactive visualization of the map 
mapview(kenya_map_merged, zcol="unmet_need_percentage", main = "Unmet Need for Family Planning in Kenya",
         map.types = "CartoDB.Positron")

# Basic static plot of the map
ggplot(kenya_map_merged) +
  geom_sf(aes(fill = unmet_need_percentage)) +
  labs(title = "Unmet Need for Family Planning in Kenya",
       fill = "Unmet need (%)") +
  theme_minimal()

2.1.5 Research Alignment

Disease Surveillance: Areal data can be used to map disease prevalence by geographical regions to identify hotspots of disease transmission and target prevention and control programs.
Health Resource Allocation: Spatial analysis of areal data enables effective allocation of resources based on population and health needs such as vaccination coverage or number of health facilities needed.
Health Equity: Socioeconomic characteristics of areas can be analyzed to map and identify regions with higher rates of ill-health based on the principle of inequitable geographical distribution of resources.

2.2 Point Patterns

2.2.1 Definition

Point pattern data represent individual locations or events as points in space. Each point represents a specific occurrence or observation, without reference to administrative units. It’s essential to understand the locations and spatial arrangement of these points rather than aggregated data across areas.

2.2.2 Characteristics

Random location of events
Not aggregated
Can be marked with other attributes

2.2.3 Examples in Health Research

Location of disease cases: Spatial coordinates of individual patients diagnosed with a particular disease. For example, HIV prevalence
Birth locations: The geographic coordinates of birth locations to study the spatial distribution of birth outcomes like low birth weight.
Locations of healthcare facilities: Spatial data on the geographic coordinates of hospitals, clinics, and pharmacies for planning health facility access.
Residential address of individual cases of mental health problems: Point locations of the residences of patients seeking help for mental health problems can help identify areas with higher needs for mental health services.

2.2.4 R Demonstration: Point Patterns using a simulated dataset

# Simulate point pattern of disease cases
set.seed(456)
n_cases <- 150
x_coords <- runif(n_cases, min = 0, max = 100)
y_coords <- runif(n_cases, min = 0, max = 100)
disease_points <- data.frame(x = x_coords, y = y_coords)

# Convert to spatial points object
coordinates(disease_points) <- ~ x + y
proj4string(disease_points) <- CRS("+proj=utm +zone=37 +datum=WGS84")

# Generate a base map
plot(disease_points, pch=16, main="Simulated Disease Cases", xlab="Longitude", ylab="Latitude")

# Let's generate an intensity map for the data
# Generate a spatial window from the point data
window <- as.owin(disease_points) 

# convert to point pattern object
points_pp <- as.ppp(disease_points, W=window)

plot(density(points_pp), main="Kernel Density of Simulated Disease Cases")
plot(points_pp, add=TRUE, pch=16, cex=0.4)

2.2.5 Research Alignment

*   **Disease Clustering**: Detecting spatial clusters of disease cases to identify potential sources of outbreaks or areas requiring targeted interventions. 
*   **Accessibility**: Mapping locations of healthcare facilities relative to the population to analyze spatial accessibility and plan health service delivery.
*   **Spatial epidemiology**: Investigating environmental risk factors by examining spatial association between health outcomes and environmental exposure

2.3 Geostatistical Data

2.3.1 Definition

Geostatistical data consists of values of a variable measured or estimated at specific locations across a continuous spatial field. The data is considered continuous in space, allowing for interpolation and prediction of values at unobserved locations based on known samples.

2.3.2 Characteristics

Continuous spatial variable
Values observed at different locations within the study area
Often requires interpolating to predict values at non-observed locations

2.3.3 Examples in Health Research

Air pollution levels: Continuous measures of air pollutants (e.g., PM2.5, Ozone) from monitoring stations spread across a region.
Temperature and humidity: Meteorological data from weather stations used to understand spatial variation in environmental variables impacting vector-borne diseases.
Soil contamination levels: Data on the levels of toxic substances in soil samples across a region impacting health outcomes.
Groundwater quality: Data on the concentration of heavy metals in water samples from different boreholes.
Nutrient levels in soils: Continuous data on the nutrient levels in agricultural land in different locations across a region.

2.3.4 R Demonstration: Geostatistical data using simulated data

# Simulate geostatistical data
set.seed(789)
n_stations <- 50
x_coords <- runif(n_stations, min = 0, max = 100)
y_coords <- runif(n_stations, min = 0, max = 100)
# Simulate temperature data using a Gaussian process model
temp <- 10 + 0.01 * (x_coords^2 + y_coords^2) + rnorm(n_stations, 0, 3)
temp_data <- data.frame(x = x_coords, y = y_coords, temperature = temp)

# Convert to spatial points dataframe
coordinates(temp_data) <- ~ x + y
gridded(temp_data) <- TRUE #convert the data to a grid
proj4string(temp_data) <- CRS("+proj=utm +zone=37 +datum=WGS84")

# Generate raster map
# Create a raster object from the geostatistical data
raster_temp <- raster(temp_data, layer="temperature")

plot(raster_temp, main = "Simulated Temperature Data")

# Plot the points
temp_points <- st_as_sf(as.data.frame(temp_data), coords = c("x", "y"))
plot(temp_points, add=TRUE, pch =16, col="black", cex = 0.5)

2.3.5 Research Alignment

Environmental Exposure Mapping: Estimating and mapping the spatial distribution of environmental risk factors to assess exposure and predict health risks. * Disease risk assessment: Combining spatial data on the prevalence of disease with environmental risk factors to identify areas at higher risk of disease incidence.
Spatial variation of disease risk: Analyzing the spatial variation of a disease based on environmental risk factors at different locations within the study area.

3. Conclusion

In this module, we introduced the basic principles of spatial analysis and distinguished between different types of spatial data (areal, point patterns, and geostatistical). You learned to visualize these types of data in R. Understanding spatial data types is crucial in our disease mapping and spatial analysis journey, as different data types require unique analytical approaches. We will build upon this foundation in subsequent modules, moving into more advanced methods of spatial analysis and their application to disease mapping and public health.

4. Next Steps

In the next module, we will learn about data wrangling and manipulation and how we can join and integrate different spatial data types to be able to model using more complex datasets.