Introduction

The present document is being prepared as a companion handbook to Hands-on workshop on Spatial Epidemiology using R during 51st Annual National Conference of Indian Association of Preventive and Social Medicine (IAPSMCON 2024).

Why learn spatial epidemiology?

Spatial epidemiology is the description and analysis of geographically indexed health data with respect to demographic, environmental, behavioral, socio-economic, genetic, and infectious risk factors. Spatial Epidemiology concerns the analysis of the spatial/geographical distribution of the incidence of disease. The rational of this workshop is to equip public health professionals with spatial epidemiology skills for disease analysis and risk assessment.

The spatial proximity and concepts in spatial epidemiology are closely related to transmission dynamics in infectious diseases. The rate of transmission is likely to be higher among individuals in close proximity as compared to those far apart. As a result, if spatial dependence is ignored, the confidence interval obtained is likely to be narrower and having higher chances of type I error (assuming positive spatial autocorrelation).

This chapter provides an introduction to geographic information systems (GIS) in R. The field of GIS in healthcare has become extremely useful in providing a fresh outlook to public health. GIS provides an opportunity to enable overlaying data with its spatial representation to support better planning and decision-making in healthcare. The convergence of many new sub-disciplines such as medical geography, public health, health informatics and data science help us better understand the similarities and differences in population health across the world. Some of the applications of GIS in healthcare include disease surveillance, environmental health, infectious diseases such as mathematical modelling and agent based modelling, and even medical imagining. While traditional uses of GIS in healthcare still are relevant, newer methods and advancing technology would be monumental for public health research.

It is now being realized that the healthcare industry can benefit tremendously from the potential of GIS. Innovative ways are being developed to harness the power of GIS through data integration pipelines and spatial visualization. Both the public and private sectors are adopting GIS to provide value addition across different sectors of healthcare - from public health departments and public health policy and research organizations to hospitals, medical centres, and health insurance organizations. With this background, let us dive right in!

It must be evident by now that geographic information systems (GIS) try to answer the question of WHERE:

  • Where diseases are prevalent?
  • Where do vulnerable populations live?
  • Where are we supposed to undertake public health correction measures?
  • Where are resources most needed for improving health conditions?

GIS enabled systems has the potential to offer crucial insights into such questions? However, it is also important that the right questions as being asked. In order to understand better on how to ask the right questions, we may have to revisit the history of GIS in Public Health.

Early examples using maps as tools to better understand disease and death have been applications such as medical mapping and disease topography. From early 17th Century, maps have not simply been used to illustrate a situation but also to prove an argument. Pioneering works in this regard have helped disprove the Theory of Miasma and identified that water as the cause of disease rather than air. Soon by the 18th Century medical maps saw many applications in public health such as plague, yellow fever, and cholera.

In 1854, an English physician, John Snow, provided the classic example of how GIS mapping can be used in epidemiological research. He identified the water source responsible for an outbreak of cholera in London by mapping the locations of those afflicted. By plotting the number and location of fatalities using stacks of bars on a map, Snow was able to perform a task that is now easily taken for granted: he visualized a spatial distribution. Looking at the results, the pattern on the map seems unmistakable. The map appears to support Snow’s claims that cholera is a waterborne disease and that the pump on Broad Street is the source of the outbreak. Despite its hand-drawn, back-of-the-envelope appearance, Snow writes:

Brief summary of the workshop.

The overall goal of the Spatial Epidemiology workshop is to provide an introduction to spatial data analysis and workflows in R. It is aimed at building a strong foundation in Spatial Epidemiology and spatial data science for healthcare professionals.

The day is divided into four sessions. The first session aims to introduce the basic concepts and spatial data visualization. This will be followed by a group activity on map making. The second session will introduce the workflows and principles of working with spatial data using R through a live demo session. This will be followed by a session on the fundamentals of spatial autocorrelation analysis and group activity on spatial clustering.

The third session will focus on Geographic Weighted Regression (GWR). The last session will be comprising of introduction to advanced concepts including iterations and interactive visualization. This will again be followed up by a group activity and subsequently, all the groups will present their assignments before the feedback session.

Assumptions during the workshop

For the workshop, it is assumed that the data collection has been completed by the researcher and attribute data has been pre-processed for analysis. Pre-processing of datasets is of utmost importance and needs deliberation before starting with analysis of any dataset.

Map projections

Projections transform the curved, three-dimensional surface of the earth into a flat, two-dimensional plane. All map projections have distortions (distance, area, direction, and/or shape). The choice of projection depends on the intended use of the map. An equal-area map projection is a good selection for portraying geographic data distributions and is suitable for most other maps. However, if the map is attempting to show distance from patients to providers, for example, then an equidistant projection is appropriate because it preserves distance.

We sometimes refer to coordinate systems (or grid systems) and datums in context with map projections. With the help of coordinate reference systems (CRS) every place on the earth can be specified by a set of three numbers, called coordinates. Various coordinate systems and datums are used throughout the world. Most mapping starts with a projected map and a coordinate system overlay, which enables locational referencing. Datums, based upon different ellipsoids (idealized versions of the shape of the earth), define the origin and orientation of latitude and longitude lines. The most recently developed and widely used datum is WGS 1984. Global positioning system (GPS) data are often collected in the WGS 1984 datum.

It is important to note that this CRS represented in can be represented in many ways. For example:

PROJ.4 format as +proj=longlat +datum=WGS84 +no_defs +type=crs, and in Well Known Text (WKT) format as

GEOGCS[“WGS 84”, DATUM[“WGS_1984”, SPHEROID[“WGS 84”,6378137,298.257223563]], PRIMEM[“Greenwich”,0], UNIT[“degree”,0.0174532925199433, AUTHORITY[“EPSG”,“9122”]], AUTHORITY[“EPSG”,“4326”]]

Spatial Files and Extensions

There are over 60 different file formats in GIS. However, it is important to understand that the spatial data is broadly categorized into field (also known as continuous data and usually available in raster data format) and discrete (includes point, line, polygon data sets and usually available in vector data format in multitude of combinations) objects. The sf package is the preferred option to use vector data in R, and the raster and terra packages are the choice for raster data.

Vector graphics are comprised of vertices and paths. The three basic symbol types in vector data are points, lines and polygons. Among vector data, the commonly used file formats are the ESRI Shape file extensions (.shp, .dbf, .shx), Geographic JavaScript Object Notation (.geojson, .json) and Google Keyhole Markup Language (.kml, .kmz). Please note that all files of the ESRI Shapefiles are necessary for the file to work properly.

On the contrary, raster data is made up of pixels, also referred to as grid cells). They are usually square but can also be hexagons or other shapes. Rasters have pixels that are associated with a value (continuous) or class (discrete). One of the most widely used raster formats are the GeoTiff extensions (.tif, .tiff).

Introducing the sf package

There are several packages that are available on CRAN that can perform spatial analytical tasks and operations. However, not all of them are having continued support from the original authors and are turning obsolete.

The figure below shows the different spatial packages on CRAN and their popularity, its is clear that the sf package has surpassed all other packages in terms of popularity and user downloads.

The sf package provides simple features access for R. Simple Features is a set of standards that specify a common storage and access model of geographic feature made of mostly two-dimensional geometries (point, line, polygon, multi-point, multi-line, etc.) used by geographic information systems. It is formalized by both the Open Geospatial Consortium (OGC) and the International Organization for Standardization (ISO).

What is a feature?

A feature is thought of as a thing, or an object in the real world, such as a building or a tree. As is the case with objects, they often consist of other objects. A set of features can form a single feature. A tree in a forest can be a feature, a forest can be a feature, a city can be a feature. An image pixel from a satellite can be a feature, or a complete MRI scan image can be a feature too.

Common types of geometry in `sf` package

Common types of geometry in sf package

Datasets and libraries used in the workshop.

Lets introduce ourselves to the datasets used in this workshop before proceeding further.

library(tidyverse) # Tidy data management
library(sf) # Handling spatial and vector data
library(raster) # Handling raster data
library(spdep) # Moran's I
library(rgeoda) # LISA statistics
library(spgwr) # Spatial regression packages
library(spatialreg)

John snow datasets.

The Snow data consists of the relevant 1854 London streets, the location of 578 deaths from cholera, and the position of 13 water pumps (wells) that can be used to re-create John Snow’s map showing deaths from cholera in the area surrounding Broad Street, London in the 1854 outbreak.

snow_deaths <- HistData::Snow.deaths
snow_deaths |> 
  head()

Another data frame giving coordinates used to draw the 528 street segment lines within the boundaries of the map.

snow_streets <- HistData::Snow.streets
snow_streets |> 
  head()

Additionally, we shall be using dataset giving the locations of water pumps within the boundaries of the map.

snow_pumps <- HistData::Snow.pumps
snow_pumps |> 
  head()

World map.

world <- readRDS(here::here("data",
                            "spatial_files", 
                            "world_india_compliant.rds"))
world |> 
  ggplot() +
  geom_sf()

India map with state boundaries.

india <- readRDS(here::here("data",
                            "spatial_files", 
                            "India_states.rds"))
india |> 
  ggplot() +
  geom_sf()

COVID-19 state wise aggregated datasets.

This dataset has been downloaded from crowd-sourced COVID-19 data ans subsetted from 14 March 2020 to 31 Oct 2021 for demonstration purposes. The dataset has been further modified to calculate the cumulative number of confirmed cases for each state.

covid <- rio::import(here::here("data",
                                "covid.csv"))
# Subset confirmed cases
covid <- covid |>  
  filter(Status == "Confirmed")
# Tidy data format
covid <- covid |> 
  pivot_longer(4:42,
               names_to = "state",
               values_to = "cases")
# Calculating cumulative cases
covid <- covid |> 
  group_by(state) |> 
  mutate(cumsum = cumsum(cases))

# Understanding data
covid |> 
  head()
covid |> 
  tail()

World population density.

The Gridded Population of the World (GPW) collection, now in its fourth version (GPWv4), models the distribution of human population (counts and densities) on a continuous global raster surface. The purpose of GPW is to provide a spatially disaggregated population layer that is compatible with data sets from social, economic, and Earth science disciplines, and remote sensing. It provides globally consistent and spatially explicit data for use in research, policy-making, and communications.

pop_density <- raster::raster(here::here("data", 
                   "spatial_files",
                   "pop_density.tif"))
plot(pop_density)

New York Leukemia dataset

We will use New York leukemia data from spData package which is a data frame with 281 observations on the following 12 variables, and the binary coded spatial weights used in the source. The variables are:- - AREANAME name of census tract
- AREAKEY unique FIPS code for each tract - X x-coordinate of tract centroid (in km) - Y y-coordinate of tract centroid (in km) - POP8 population size (1980 U.S. Census) - TRACTCAS number of cases 1978-1982 - PROPCAS proportion of cases per tract - PCTOWNHOME percentage of people in each tract owning their own home - PCTAGE65P percentage of people in each tract aged 65 or more - Z transformed proportions - AVGIDIST average distance between centroid and TCE sites - PEXPOSURE “exposure potential”: inverse distance between each census tract centroid and the nearest TCE site, IDIST, transformed via log(100*IDIST)

df_gwr <- spData::nydata

df_gwr |> glimpse()
## Rows: 281
## Columns: 12
## $ AREANAME   <fct> Binghamton city, Binghamton city, Binghamton city, Binghamt…
## $ AREAKEY    <fct> 36007000100, 36007000200, 36007000300, 36007000400, 3600700…
## $ X          <dbl> 4.069397, 4.639371, 5.709063, 7.613831, 7.315968, 8.558753,…
## $ Y          <dbl> -67.3533, -66.8619, -66.9775, -65.9958, -67.3183, -66.9344,…
## $ POP8       <dbl> 3540, 3560, 3739, 2784, 2571, 2729, 3952, 993, 1908, 948, 1…
## $ TRACTCAS   <dbl> 3.08, 4.08, 1.09, 1.07, 3.06, 1.06, 2.09, 0.02, 2.04, 0.02,…
## $ PROPCAS    <dbl> 0.000870, 0.001146, 0.000292, 0.000384, 0.001190, 0.000388,…
## $ PCTOWNHOME <dbl> 0.32773109, 0.42682927, 0.33773959, 0.46160483, 0.19243697,…
## $ PCTAGE65P  <dbl> 0.14661017, 0.23511236, 0.13800481, 0.11889368, 0.14157915,…
## $ Z          <dbl> 0.14197, 0.35555, -0.58165, -0.29634, 0.45689, -0.28123, -0…
## $ AVGIDIST   <dbl> 0.2373852, 0.2087413, 0.1708548, 0.1406045, 0.1577753, 0.17…
## $ PEXPOSURE  <dbl> 3.167099, 3.038511, 2.838229, 2.643366, 2.758587, 2.848411,…

Spatial Data Visualization in Epidemiology

Introduction.

The role of spatial data visualization in understanding distribution and determinants of a disease were classically demonstrated in John Snow’s work in cholera epidemic of Golden Square, London in 1854 and maps have remained the most common applied aspect of Spatial epidemiology till date. Maps enable real time visualizations. However, though the concepts of spatial data visualization has its historical roots in the practice of cartography1, with the evolution of map making computational abilities, the requirement to know the cartography concepts for visualisation of spatial data for general use has decreased to minimal. For eg. we can point a location in google maps and reach a destination without deliberations on cartographic concepts (such as northing, scale, etc). This has led to \(neogeography\)2 and thus crowd-sourced databases due to volunteer contributions.

In professional and academic forums, the concepts of cartography are pre-requisites to meaningful spatial data visualization. Further, the concept of using visualization methods has evolved from just a tool for communication to the concept of visualization as an analysis tool. Visualization of spatial data is used to understand distribution patterns and generate hypothesis as well as for inferential procedures. The underlying principles in Scientific visualization has led to use of visualization methods as complementary to statistical inference procedures in spatial epidemiology. For example, the interpretation of Moran’s I3 is incomplete without accompanying visualization of dataset. The maps as analysis tools have several possible meanings and are thus polysemic in nature.

Expectations from the present session.

It is expected that by end of this session, participants should be able to develop visualizations, understand and communicate epidemiological information, and develop hypothesis based on the visual representations of spatial data sets.

Graphic Variables.

\(Jacques\) \(Bertin\), a french cartographer published a book \(Semiologie\) \(Graphique\), which brought forward concept of graphic variables. These variables highlight methods using which different information can be obtained from the data. The seven graphic variables as mentioned in \(Semiology\) \(of\) \(graphics\) (English version) are as under:-

Location.

The location of a spatial object varies with changing projections. It is important to understand that the transformation of CRS from geographic to projected, from raster to vector should be done with caution. The method of choice should be guided by the research objectives and need to preserve shape or location.

Value.

The value of a variable is related to the darkness/ lightness of a symbol. It is usually used for variables on an interval or ratio scale. Traditionally, higher values are depicted darker.

Color

Color can be considered as most complex graphic variable. It represents varied sensations to different viewers. In addition, human eye is not equally sensitive across the spectrum of colors. Further, cultural affiliations, traditional representations and understandings provide challanges. A golden rule recommended is to first use color-blind friendly palettes. In the domain of Public Health, it is important to consider the color and its depiction during visualization to avoid ambiguous communication. For example: Color choice in vaccination coverage visualization as compared to color choice in obesity.

Size.

In proportional symbol maps, the size of a symbol represents the difference in quantum of observations. The underlying type of data, classification adopted, transformations, and outlier management among others should be explored during visualizations.

Shape.

Similar to size, choice of shape of the symbol can provide clarity or increase confusion, if not considered carefully. Different shapes can be provided to distinguish different types of variables. The same can be considered to distinguish sub-types of a variable. For example, dotted lines and solid lines can be used to distinguish roads from rivers but can also be used to distinguish main roads from side-roads. Understanding of datasets and variables help in deciding correct shapes for representation of data.

Space.

The arrangement of symbols can also be used to represent differences. For example: In a dot map representing dengue occurrence data, space between the dots represent disease clustering, if any. Again, the interpretation of a variable needs careful consideration, as the case occurrence may be higher just because of underlying population figures and not otherwise.

Orientation.

Understanding the orientation of a pattern may provide critical insights into disease epidemiology. For eg. occurrence of a disease along with the course of a river may indicate risk factors associated with disease.

Map components.

Map.

The vector or raster map is central to map visualizations.

Title.

A short and clear title must be provided. In addition, subtitles can be added, if required.

Distance and Scale.

There are two options for displaying information related to distance and scale in a map.
1. Display of a scale bar.
2. Use of x and y axes of the display as labeling axes.

Direction.

Similar to symbolism of scale, direction can be displayed as a symbol or can be implicit in the labeled axes. Conventionally, true North is provided as a symbol in maps.

Legend.

Not all maps require a legend. When present, due consideration to Graphic variables should be given.

Credits.

The data sources, source of geospatial files should be provided.

Understanding the process of Map creation.

The first step in any map making should be based on careful understanding of Co-ordinate Reference Systems. For beginners, it is recommended that all map making and spatial manipulations should be undertaken using ESPG 4326.

# Setting crs to 4326
india <- india |> 
  st_transform(crs = 4326)

Now, lets first create a map of India as per component of maps learned so far.

Creating a Base map.

india |> 
  ggplot() +
  geom_sf()

Adding Title/ sub-title.

india |> ggplot()+
  geom_sf()+
  labs(title = "India map",
       subtitle = "Spatial Epidemiology Workshop IAPSMCON 2024")

Describing Distance and Scale.

india |> ggplot()+
  geom_sf()+
  labs(title = "India map",
       subtitle = "Spatial Epidemiology Workshop IAPSMCON 2024")+
  ggspatial::annotation_scale(location = "bl") #bl/ br/ tr/ tl

Direction symbols.

india |> ggplot()+
  geom_sf()+
  labs(title = "India map",
       subtitle = "Spatial Epidemiology Workshop IAPSMCON 2024")+
  ggspatial::annotation_scale(location = "bl")+
  ggspatial::annotation_north_arrow(location = "br")

Adding axis labels

india |> ggplot()+
  geom_sf()+
  labs(title = "India map",
       subtitle = "Spatial Epidemiology Workshop IAPSMCON 2024")+
  ggspatial::annotation_scale(location = "bl")+
  ggspatial::annotation_north_arrow(location = "tr")+
  xlab("Longitude")+
  ylab("Latitude")

Acknowledging Credits.

india |> ggplot()+
  geom_sf()+
  labs(title = "India map",
       subtitle = "Spatial Epidemiology Workshop")+
  ggspatial::annotation_scale(location = "bl")+
  ggspatial::annotation_north_arrow(location = "tr")+
  xlab("Longitude")+
  ylab("Latitude")+
  geom_text(aes(x = 87.5, 
                y = 10.3, 
                label = "Credits: IAPSMCON 2024"),
            color = "black", 
            check_overlap = T, 
            size = 3)

Adding labels and legends.

Addition of labels and legends is optional in map making. The following code can be used to add labels to the map

#adding abbreviations
india$abbr <- c("LA","AP","AR", "AS","BR", "CH","CT", "DL",
              "GA", "GJ","HR", "HP","JH", "KA","KL", "MP",
              "MH","MN","ML", "MZ","NL", "OR","PY", "PB",
              "RJ", "SK","TN", "TR","UP", "UT","WB", "TG",
              "JK", "LD","AN", "DDDN")
## labeling abbreviations
india |> ggplot()+
  geom_sf()+
  labs(title = "India map",
       subtitle = "Spatial Epidemiology Workshop IAPSMCON 2024")+
  ggspatial::annotation_scale(location = "bl")+
  ggspatial::annotation_north_arrow(location = "tr")+
  xlab("Longitude")+
  ylab("Latitude")+
  geom_text(aes(x = 87.5, y = 10.3, label = "Credits: IAPSMCON 2024"),
             color = "black", check_overlap = T, size = 2)+
  geom_sf_label(aes(label = abbr))

Visualisation of point data.

Kindly note: For the purpose of clarity on a particular aspect of the map, we will avoid placing all components of a map across the session. This is important to understand that when communicating maps/ figures, all components should be present.

Dot Maps.

The simplest method to visualize disease occurrence patterns is creation of dot maps. They help us understand spatial dependencies in disease occurrence. The spatial clustering of a disease is suggestive of factors present/ absent in locations with high disease occurrence as compared to locations with low disease occurrence. Thus, point maps provides visual tool for providing estimates of Spatial Clustering. Important limitation of these dot maps is the inability to visualize independent events as the number of observations increase or multiple events are observed at the same location. To overcome the same methods such as smoothing, rasterisation and dot density maps is recommended.

Illustrative Example: John Snow maps

Let us create a dot map of John Snow data. To do the same, we need to convert data.frame into an sf object using the st_as_af() function.

snow_deaths_sf <- snow_deaths |> 
  st_as_sf(coords = c('x', 'y'), 
           crs = 4326)

snow_deaths_sf |> 
  ggplot() +
  geom_sf() 

Now, we can add additional layers such as the water pumps and the streets of Soho, London.

snow_streets_sf <- snow_streets |> 
  st_as_sf(coords = c('x', 'y'), crs = 4326) |> 
  group_by(street) |> 
  summarize(n = mean(n)) |>  
  st_cast('LINESTRING')

snow_streets_sf |> 
  ggplot() +
  geom_sf() 

Let us add the streets of Soho, London as a new layer in our previously created deaths map.

ggplot() +
  geom_sf(data = snow_streets_sf) +
  geom_sf(data = snow_deaths_sf, 
          color = 'red', 
          alpha = 0.5, 
          size = 1)

We can now add a new layer of water pumps on the above map.

snow_pumps_sf <- snow_pumps |> 
  st_as_sf(coords = c('x', 'y'), 
           crs = 4326)

ggplot() +
  geom_sf(data = snow_streets_sf) +
  geom_sf(data = snow_deaths_sf, color = 'red', alpha = 0.2, size = 3) +
  geom_sf(data = snow_pumps_sf , shape = 22, size = 1, fill = 'blue', color = 'blue') +
  geom_sf_label(data = snow_pumps_sf, 
                aes(label = label), 
                nudge_x = 0.025, 
                nudge_y = -0.5)

Visualisation of aggregate data.

Choropleth Maps.

The term “choropleth” is derived from Greek words \(Khoros\) meaning “place” and \(plethein\) meaning “full”. In choropleth maps, aggregated value for a given area forms the basis of visual representation. The three major challenges in interpretation of choropleth maps arise from large area polygons, Modified Unit Area Problem, and presence of skewed data. Deliberation on these issues, creation of cartograms, data transformation methods such Box-Cox transformations and multi-level analysis are recommended to overcome these limitations.

Example: COVID 19 cases

Lets summarise state wise total covid 19 cases and merge it with india map to plot a choropleth map as demonstration.

df <- covid |> 
  group_by(state) |> 
  summarise(cases = sum(cases))

india <- left_join(india, df,
          by = c("abbr" = "state")) 

india |> 
  ggplot()+
  geom_sf(aes(fill = cases)) +
  scale_fill_gradient(low = "white", high = "red") +
  labs(fill = "COVID-19 cases",
       caption = "For demonstration only",
       title = "India COVID-19 cases (Subset data)",
       subtitle = "For demonstration only")

Think!! Why such maps are not good visualizations?

Visualization using Web based maps.

Many a times it becomes difficult to get a shape file of the stud area. Various sources of shape files is available online (eg. DataMeet, SEDAC, HGL, DIVA-GIS, Bhuvan, Survey of India, etc.). Leaflet package helps us to use web maps from open sources such as Open Street Map, ESRI maps, cartoDB, etc.

Spatial Data Management

Introduction.

It is expected that by end of the session, participants will be able to understand as well as apply varied methods for manipulation of spatial data sets.

Understanding and Tranformation of Coordinated Reference System (CRS).

Know the CRS

To know the crs of a spatial dataset, use st_crs() function.

india |> 
  st_crs()
## Coordinate Reference System:
##   User input: EPSG:4326 
##   wkt:
## GEOGCRS["WGS 84",
##     ENSEMBLE["World Geodetic System 1984 ensemble",
##         MEMBER["World Geodetic System 1984 (Transit)"],
##         MEMBER["World Geodetic System 1984 (G730)"],
##         MEMBER["World Geodetic System 1984 (G873)"],
##         MEMBER["World Geodetic System 1984 (G1150)"],
##         MEMBER["World Geodetic System 1984 (G1674)"],
##         MEMBER["World Geodetic System 1984 (G1762)"],
##         MEMBER["World Geodetic System 1984 (G2139)"],
##         ELLIPSOID["WGS 84",6378137,298.257223563,
##             LENGTHUNIT["metre",1]],
##         ENSEMBLEACCURACY[2.0]],
##     PRIMEM["Greenwich",0,
##         ANGLEUNIT["degree",0.0174532925199433]],
##     CS[ellipsoidal,2],
##         AXIS["geodetic latitude (Lat)",north,
##             ORDER[1],
##             ANGLEUNIT["degree",0.0174532925199433]],
##         AXIS["geodetic longitude (Lon)",east,
##             ORDER[2],
##             ANGLEUNIT["degree",0.0174532925199433]],
##     USAGE[
##         SCOPE["Horizontal component of 3D system."],
##         AREA["World."],
##         BBOX[-90,-180,90,180]],
##     ID["EPSG",4326]]

Transforming CRS of Spatial Data

When we have two or more related spatial data sets, and the CRS are different, we can transform the CRS using st_transform() function.

india <- india |> 
  st_transform(crs = 4326)

Manipulating Spatial Data based on their location and shape.

Subset Spatial data based on attributes.

The process of selecting only those locations which have a specified attribute is known as sub-setting and the output is subset. It is done to understand spatial distribution of a given variable and thus identify underlying patterns. Sf objects have inherent data frame class. Hence, filter() function can be used for obtaining spatial subsets.

mh_subset <- india |> 
  filter(NAME_1 == "Maharashtra") 

mh_subset |> 
  ggplot() +
  geom_sf() +
  labs(title = "Subset map of Maharashtra")

Spatial relationships based on Topology.

Topology is defined as spatial relationship between the objects. These include relationships such as common boundaries, intersections, one within the other, crossings, dis junctions, etc.

mh_subset |> 
  st_intersects(india)
## Sparse geometry binary predicate list of length 1, where the predicate
## was `intersects'
##  1: 7, 9, 10, 14, 16, 17, 32, 36

Similarly, st_within() is used to find the relationship for those strictly within boundaries, st_disjoint() is opposite of st_intersect(), st_touches() is for those objects which are touching each other (eg. polygons) and st_within_distance() for the objects within a buffer distance from the said object.

Spatial overlay/ joins.

Presence of a unique key is pre-requisite to joining of datasets. The same concept is utilised for overlay or joining of two or more spatial datasets with location coordinates as unique key. st_join() performs spatial joins using location coordinates as unique key. By default, left join is undertaken. in is performed.

Lets learn it through an example. We shall add a new variable to Maharashtra subset and add the same in India file using spatial overlay.

mh_subset <- mh_subset |> 
  mutate(check_variable = "The states sharing border with Maharashtra subset data")

india_joined <- india |> 
  st_join(mh_subset)

india_joined |> 
  ggplot() +
  geom_sf(aes(fill = ifelse(
    check_variable == "The states sharing border with Maharashtra subset data",
    "red", "white"
  ))) +
  theme(legend.position = "none")

Obtaining aggregated spatial data.

Aggregation of spatial data is a process of condensing the dataset to obtain stratified summary measures. The same can be done using summarise() function.

india_joined |> 
  st_drop_geometry() |>
  group_by(check_variable) |> 
  summarise(mean_pop_density = mean(pop_density_india),
            .groups = "keep")

Area calculations.

To determine area of a polygon, st_area() function is used.

india |> 
  st_area()  
## Units: [m^2]
##  [1] 164871219452 163066121108  82163787556  78559330387  94431499839
##  [6]    118238781 135799676380   1504310841   3708836029 185709212363
## [11]  44081446376  55538115194  80226215548 192071260093  37853176465
## [16] 308776245994 308259630387  22368416342  22471908102  21187513837
## [21]  16631116132 156080143451    549465348  50274014378 342516183864
## [26]   7152632353 130728371017  10472109552 241157134272  53705294257
## [31]  85488226543 112699854048  52883743670   7528052264    665633067
## [36]   1050420948

Geometry based functions.

We can transform geometries with unary and binary transformations. Unary transformations are based on a single geometry in isolation and includes measures such as creation of buffers and centroids. Binary operations modify the shape of one geometry based on the other and includes measures such as clipping and geometry unions.

Centroid creation.

The centroid based processes identify the center of a sf object and is similar to measure of central tendency. There are many variations in ways to define centre of an sf object, however, use of geographical centre remains most common. st_centroid() provides centroid values.

india_point <- india |> 
  st_centroid() 
india_point

Adding centroids to the dataset.

india <- cbind(india, 
               st_coordinates(st_centroid(india)))

india

Buffer creation

buffer_bihar <- india_point |> 
  filter(NAME_1 == "Bihar") |> 
  st_buffer(dist = 100000)

ggplot() +
  geom_sf(data = india) +
  geom_sf(data = buffer_bihar,
          fill = "red",
          alpha = 0.5) 

Saving Spatial Data and its output.

Though st_write() function helps in saving spatial datasets as shapefiles, it is recommended that RDS data format be used as it uses less space in the system and is computationally faster to load subsequently.

Methods for spatial data manipulation of Raster datasets.

pop_density
## class      : RasterLayer 
## dimensions : 180, 360, 64800  (nrow, ncol, ncell)
## resolution : 1, 1  (x, y)
## extent     : -180, 180, -90, 90  (xmin, xmax, ymin, ymax)
## crs        : +proj=longlat +datum=WGS84 +no_defs 
## source     : pop_density.tif 
## names      : pop_density

Resolution

Similar to area calculations in a vector dataset, we would like to know about the resolution of the raster data. The res() function provides dimensions of each of the cell of the raster grid in metres (By default, UTM projection is in meters).

res(pop_density)
## [1] 1 1

Raster data subsets.

To extract value for a specified location terra::extract() function can be used.

To subset based on extent, the bounding box details can be extracted using st_bbox() function.

Map algebra.

Understanding of map algebra makes raster processing faster. There are four subclasses of raster based operations as under:-

Local.

These operations act independently for each cell of the raster.

Focal.

Also known a kernel, moving window or filter operations for raster data, they choose a central cell and its neighbourhood (such as surrounding eight cells, for example). The focal operation acts similar to spatial data aggregation and results in an output of aggregated values for each combination of central cell and its neighbourhood. raster::focal() function can be used for obtaining focal values by defining the matrix for central cell and its neighbourhood.

Zonal.

The principle behind zonal operations is same as that of focal operations. However, unlike in focal operations, zonal operations can go beyond a rectangular matrix of neighbourhood and can provide aggregated values for a pre-defined shape or distance or area, etc. The raster::zonal() function provides zonal statistics.

Global.

These operations are applied on the entire dataset and are used to summarise the characteristics of the given raster data such as mean, median, IQR, range etc of the datasets.

Merging rasters.

Two rasters can be merged using raster::merge() function in general. However, depending upon the raster characteristics, more refined methods are available such as from packages landsat, satellite, etc. Merging rasters require same resolution and crs of the rasters as prerequisites.

Illustrative raster data example: Population density raster

pop_density_india <- exactextractr::exact_extract(pop_density,
                             india,
                             "mean",
                             append_cols = "NAME_1")
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |==                                                                    |   3%
  |                                                                            
  |====                                                                  |   6%
  |                                                                            
  |======                                                                |   8%
  |                                                                            
  |========                                                              |  11%
  |                                                                            
  |==========                                                            |  14%
  |                                                                            
  |============                                                          |  17%
  |                                                                            
  |==============                                                        |  19%
  |                                                                            
  |================                                                      |  22%
  |                                                                            
  |==================                                                    |  25%
  |                                                                            
  |===================                                                   |  28%
  |                                                                            
  |=====================                                                 |  31%
  |                                                                            
  |=======================                                               |  33%
  |                                                                            
  |=========================                                             |  36%
  |                                                                            
  |===========================                                           |  39%
  |                                                                            
  |=============================                                         |  42%
  |                                                                            
  |===============================                                       |  44%
  |                                                                            
  |=================================                                     |  47%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |=====================================                                 |  53%
  |                                                                            
  |=======================================                               |  56%
  |                                                                            
  |=========================================                             |  58%
  |                                                                            
  |===========================================                           |  61%
  |                                                                            
  |=============================================                         |  64%
  |                                                                            
  |===============================================                       |  67%
  |                                                                            
  |=================================================                     |  69%
  |                                                                            
  |===================================================                   |  72%
  |                                                                            
  |====================================================                  |  75%
  |                                                                            
  |======================================================                |  78%
  |                                                                            
  |========================================================              |  81%
  |                                                                            
  |==========================================================            |  83%
  |                                                                            
  |============================================================          |  86%
  |                                                                            
  |==============================================================        |  89%
  |                                                                            
  |================================================================      |  92%
  |                                                                            
  |==================================================================    |  94%
  |                                                                            
  |====================================================================  |  97%
  |                                                                            
  |======================================================================| 100%
india <- left_join(india, pop_density_india)

india <- india |> 
  rename(pop_density_india = mean)
india |>
  ggplot() +
  geom_sf(aes(fill = pop_density_india)) +
  scale_fill_gradient(low = "green", high = "red") +
  labs(title = "Distribution of population density in India",
       caption =  "Source: SEDAC")

Spatial Statistics in Epidemiology: An introduction.

The descriptive measures in routine data analysis includes measures of central tendency (e.g. mean, median, mode) and measures of dispersion such as (e.g. SD, range, IQR). The choice of the descriptive measure chosen to describe the study variable depends upon the type of data and on the scale of measurement. While describing spatial datasets, these known measures of describing a study variable are complemented with additional summary measures to describe spatial data.

Expectations from the present session.

By the end of the session, students should be able to understand methods which are used to describe spatial distribution of study variable. The students are also expected to understand the difference in concept as well as methods in describing global estimates as compared to local estimates for presence of spatial clustering, if any.

Complete Spatial Randomness (CSR).

It is also known as “Random Spatial Distribution” or “Independent Spatial Distribution”. When there is no spatial dependence, an event is expected to have CSR. Lack of CSR indicates spatial clustering.

Stationarity.

A spatial process is said to be stationary when the dependence between measurements of the same variable across space is the same for all locations in an area. When the event occurrence is in CSR or in a regular pattern, the underlying process is said to be stationary.

Isotropy.

If the dependence in a stationary process is affected by distance, but this is the same in all directions, the process is considered to be isotropic. When the stationary process is affected by distance and is not same in all directions, the process is considered to be anisotropic.

Spatial Cluster.

Knox defined a cluster as “a geographically and/or temporally bounded group of occurrences of sufficient size and concentration to be unlikely to have occurred by chance”.

First and second order effects.

First-order effects describe large-scale variations in the mean of the outcome of interest due to location or other explanatory variables, while second-order effects describe small-scale variations.

Global measures of Spatial Clustering.

Global clustering methods are used to assess whether clustering is apparent throughout the study region but do not identify the location of clusters. The null hypothesis for global clustering methods is simply that ‘no clustering exists’ (i.e. random spatial dispersion) across the study region.

Local measures of Spatial Clustering.

Global measures of spatial association assume that the spatial process under investigation is stationary. Under this assumption, which is rarely met, global tests of association run the risk of obscuring local effects. As a result, significant local clustering may not be detected and conversely, large non-clustered areas within a study area may be ignored. Local statistics overcome this problem by scanning the entire dataset, but only measuring dependence in limited portions of the study area, the bounds of which have to be specified.Local (specific) methods of cluster detection define the locations and extent of clusters. They are further divided into focused and non-focused tests.

Scan circles.

Local measures of spatial clustering are divided into those primarily designed for aggregated data, and those for point data, although this distinction is not rigid and most can be used with either type of data. An important concept in understanding calculations behind local estimates of spatial clustering when point data is under consideration is the scan circles. These scan circles can be defined in terms of geographic distance (e.g. Openshaw’s method), number of cases (e.g. Besag and Newell’s method), and population size (e.g. Turnbull’s method and Kulldorff’s scan statistic). It is important to consider the optimal size of the scan circle while calculating local estimates of spatial clustering because a large scan circle is likely to identify spurious clusters whereas a small scan circle is likely to miss large clusters.

Neighbours

To define a neighbor, two commonly used approaches are based on adjacency and distance.

Other complex neighborhoods can be defined on basis of spatial relationships such as percentage of common boundaries.

Neighbours defined on basis of adjacency/ contiguity

Commonly used approaches based on adjacency to define neighbors are:-

Rook contiguity

Polygons are said to be neighbors or adjacent when they share a common border.

Queen contiguity.

Polygons are adjacent if they share a border or corner

Higher order contiguities.

Also called spatial lags.

First-order adjacency

Also called adjacent neighbors

Second-order adjacency

Also called neighbors-of-neighbors

Neighbours defined on basis of distance.

In case of point data, all the events within a specified distance/ radius are said to be in neighborhood.

Monte Carlo simulations.

Many tests for clustering use Monte Carlo simulation in order to determine the statistical significance of the cluster (i.e. does the observed spatial pattern differ significantly from the null hypothesis of complete spatial randomness). This involves first calculating the test statistic using the observed data, and then re-calculating it using a specified number (e.g. 99, 499, 999) of simulated data sets (or permutations). The latter is used to generate the expected distribution of the test statistic under the null hypothesis. The likelihood of obtaining the value for the test statistic derived from the observed data is then calculated, and expressed as the p-value.

Measures of Spatial Clustering for agggregated data.

Auto-correlation statistics for aggregated data provide an estimate of the degree of spatial similarity observed among neighboring values of an attribute over a study area.

Fundamental to all auto-correlation statistics is the \(weights\) \(matrix\), used to define the spatial relationships of the regions so that those that are close in space are given greater weight in the calculation than those that are distant.

Global estimates.

Moran’s I.

Description.

  • Moran’s I is similar to Pearson’s correlation coefficient (r).
  • It quantifies similarity of outcome among areas that are defined as spatially related.
  • It generally lies between +1 and -1.
  • Although Moran’s I is intended for use with continuous data it can also be used to analyze count data.

Interpretation.

A Moran’s I of zero indicates the null hypothesis of no clustering or presence of CSR. A positive Moran’s I indicates positive spatial autocorrelation (i.e. clustering of areas of similar attribute values), while a negative coefficient indicates negative spatial autocorrelation (i.e. that neighboring areas tend to have dissimilar attribute values).

Tests of significance.

Moran’s I is approximately normally distributed and the significance of Moran’s I can be assessed using Monte Carlo randomization.

Local estimates.

Local Moran test.

Also called as Local Indicators of Spatial Association (LISA).

The local Moran test detects local spatial autocorrelation in aggregated data by decomposing Moran’s I statistic into contributions for each area within a study region.

Moran scatterplot

  • It detect clusters of either similar or dissimilar disease frequency values around a given observation.

  • The Moran’s scatter plot provides an indication of the stability of the spatial association throughout the data.

  • Horizontal axis: Observed values.

  • Vertical axis: Weighted average of neighbouring values.

  • Four types of association:- low–high, high–high, high–low and low–low.

Illustrative example: Moran’s I and LISA statistics

india_moran <- india |> 
  filter(NAME_1 != "Andaman & Nicobar Island") |> 
  filter(NAME_1 != "Lakshadweep") 

# Creating a neighbor list from polygon list
neighbor_list_q <- india_moran |> 
  poly2nb(queen = T)
# Neighbors list object summary
summary(neighbor_list_q)
## Neighbour list object:
## Number of regions: 34 
## Number of nonzero links: 136 
## Percentage nonzero weights: 11.76471 
## Average number of links: 4 
## Link number distribution:
## 
## 1 2 3 4 5 6 7 9 
## 2 8 6 4 6 4 3 1 
## 2 least connected regions:
## 19 26 with 1 link
## 1 most connected region:
## 29 with 9 links
# Structure of neighbors list: Visualization
plot(india_moran$geometry, 
     border="black"); plot(neighbor_list_q, 
                           st_coordinates(st_centroid(india_moran)), 
                           pch = 19, cex = 0.6, add = TRUE, col= "red")

# Spatial weights
neighbor_weights_q_w <-
  neighbor_list_q %>%
  nb2listw(style = "W")

neighbor_weights_q_w
## Characteristics of weights list object:
## Neighbour list object:
## Number of regions: 34 
## Number of nonzero links: 136 
## Percentage nonzero weights: 11.76471 
## Average number of links: 4 
## 
## Weights style: W 
## Weights constants summary:
##    n   nn S0       S1       S2
## W 34 1156 34 19.43254 151.3604
# Moran's I
moran_values <- moran.test(india_moran$pop_density_india, 
           listw = neighbor_weights_q_w,
           na.action=na.omit)

moran_values
## 
##  Moran I test under randomisation
## 
## data:  india_moran$pop_density_india  
## weights: neighbor_weights_q_w    
## 
## Moran I statistic standard deviate = 2.4164, p-value = 0.007837
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##       0.207391989      -0.030303030       0.009676153
# Local Moran's I


# Create weights object
queen_w <- queen_weights(india_moran)
summary(queen_w)
# Calculate LISA
lisa <- local_moran(queen_w, 
          india_moran["pop_density_india"],
          permutations = 999,
          significance_cutoff = 0.1,
          seed = 5197,
          cpu_threads = 6)
lisa_value <- lisa_values(lisa)
lisa_pvalue <- lisa_pvalues(lisa)
lisa_quad <- lisa_clusters(lisa)
shp <- cbind(india_moran, 
      lisa_value, 
      lisa_pvalue,
      lisa_quad)
# Create Local Moran's Map
lisa_colors <- lisa_colors(lisa)
lisa_labels <- lisa_labels(lisa)
lisa_clusters <- lisa_clusters(lisa)

shp$lisa_quad <- factor(lisa_quad,
                       levels =  c(0,1,2,3,4),
                       labels = c("Not significant",
                                  "High-High",
                                  "Low-Low",
                                  "Low-High",
                                  "High-Low"))
lisa_map <- ggplot(shp)+
  geom_sf(aes(fill = lisa_quad))+
  scale_fill_manual(values = lisa_colors,
                    name = "LISA Quadrant")+
  labs(title = "LISA map")+
  ggspatial::annotation_north_arrow(height = unit(0.5, "cm"),
                                    width = unit(0.5, "cm"),
                                    location = "tl")+
  ggspatial::annotation_scale(height = unit(0.15, "cm"),
                              location = "br")
lisa_map

Point pattern analysis.

There are multiple statistical tests which can be used to measure spatial clustering of point data. It is based on either the locations, distance or buffer matrix for global estimates and scan circles for local associations.

Measures of Spatial clustering around a source (Focused tests).

Focused tests are used to establish inference regarding role of a source (eg. factory, pond, etc) in causation of disease. It is hypothesized that the disease occurrence is higher among population groups closer to the source as compared to those staying at a far away distance from the source.

Spatial variation in risk.

Epidemiological disease investigations should include an assessment of the spatial variation of disease risk, as this may provide important clues leading to causal explanations. Spatial variation estimation methods broadly includes:

  1. Smoothing methods.

  2. Interpolation.

Spatial Regression models: An introduction

Introduction

When we undertake classic linear regression, one of the assumptions include that the independent variables should be independent of each other. However, in case the variables are related spatially, the assumption is voilated. It is therefore recommended that at the outset, in the exploratory phase of any study, we should look for spatial correlation. In case, it is not present, the researcher can proceed with non spatial regression models, however, if present, Geographically Weighted Regression (GWR) should be undertaken.

What to do if spatial dependence is suspected?

If you feel like your data has spatial dependence, then the following steps should be carried out:

  • Inspect your data, construct an appropriate spatial neighborhood, and assign weights.

  • Run an OLS regression with your variables of interest. This regression assumes spatial independence.

  • Inspect results and plot your residuals.

  • Compute the Moran’s I value for the residuals.

  • If you find significant spatial correlation, do a GWR, else proceed with non spatial models.

Demonstration

Analysis without considering spatial dependence

It is expected that the locations where the exposure is high should have higher leukemia rates. In this case, based on our previous understanding, it seems obvious, however, many times we may not be aware of underlying spatial associations and ignore the spatial aspect. Lets see what happens!!

df_gwr |>  
  ggplot()+
  geom_point(
    aes(x = PEXPOSURE, y = Z, color = AREANAME)
  ) +
  theme(legend.position = "none") +
  labs(x = "Exposure",
       y = "Leukemia rate",
       title = "Scatter plot: Association of Exposure and Leukemia rate") +
  geom_smooth(aes(x = PEXPOSURE, y = Z),
              method = lm)

Simple linear regression without spatial dynamics

model <- lm(Z ~ PEXPOSURE + PCTAGE65P + PCTOWNHOME, 
   data = df_gwr)
summary(model)
## 
## Call:
## lm(formula = Z ~ PEXPOSURE + PCTAGE65P + PCTOWNHOME, data = df_gwr)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7417 -0.3957 -0.0326  0.3353  4.1398 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.51728    0.15856  -3.262  0.00124 ** 
## PEXPOSURE    0.04884    0.03506   1.393  0.16480    
## PCTAGE65P    3.95089    0.60550   6.525 3.22e-10 ***
## PCTOWNHOME  -0.56004    0.17031  -3.288  0.00114 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6571 on 277 degrees of freedom
## Multiple R-squared:  0.1932, Adjusted R-squared:  0.1844 
## F-statistic:  22.1 on 3 and 277 DF,  p-value: 7.306e-13

These results suggest that PEXPOSURE is not significant, but census variables are (home ownership and percent oldies). It’s also useful to glance at R-squared estimates. It gives a sense of how well model fits the data… it tells us how much variation we’ve explained out of the total variation in the model.

Looking at the residuals

In the model above, the residuals should not be spatially related. Lets have a look!

df_gwr <- df_gwr |> 
  mutate(residuals = model$residuals)

# Transform df_gwr to sf object
df_gwr <- st_as_sf(df_gwr,
         coords = c(lon = "X",
                    lat = "Y"))

Moran’s I

Using neighbor list from polygon list and Spatial weights

neighbor_weights_q_w <- spData::listw_NY
df_gwr |> 
  pull(residuals) |> 
  moran.test(listw = neighbor_weights_q_w)
## 
##  Moran I test under randomisation
## 
## data:  pull(df_gwr, residuals)  
## weights: neighbor_weights_q_w    
## 
## Moran I statistic standard deviate = 2.4457, p-value = 0.007229
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##       0.083090278      -0.003571429       0.001255603

Geographically weighted regression: spatial Lag model

model <- lagsarlm(formula=Z~PEXPOSURE+PCTAGE65P+PCTOWNHOME,data=df_gwr,listw=neighbor_weights_q_w,quiet=T)
summary(model)
## 
## Call:lagsarlm(formula = Z ~ PEXPOSURE + PCTAGE65P + PCTOWNHOME, data = df_gwr, 
##     listw = neighbor_weights_q_w, quiet = T)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -1.586752 -0.391580 -0.022469  0.338017  4.029430 
## 
## Type: lag 
## Coefficients: (asymptotic standard errors) 
##              Estimate Std. Error z value  Pr(>|z|)
## (Intercept) -0.514495   0.156154 -3.2948  0.000985
## PEXPOSURE    0.047627   0.034509  1.3801  0.167542
## PCTAGE65P    3.648198   0.599046  6.0900 1.129e-09
## PCTOWNHOME  -0.414601   0.169554 -2.4453  0.014475
## 
## Rho: 0.038893, LR test value: 6.9683, p-value: 0.0082967
## Asymptotic standard error: 0.015053
##     z-value: 2.5837, p-value: 0.0097755
## Wald statistic: 6.6754, p-value: 0.0097755
## 
## Log likelihood: -275.2447 for lag model
## ML residual variance (sigma squared): 0.41166, (sigma: 0.6416)
## Number of observations: 281 
## Number of parameters estimated: 6 
## AIC: 562.49, (AIC for lm: 567.46)
## LM test for residual autocorrelation
## test value: 1.4633, p-value: 0.22641

As expected, spatial auto-regressive parameter \((\rho)\) is highly significant (p = .008). Test for residual autocorrelation is not significant, which suggests that we cannot reject the null of randomly distributed residuals.

Moran’s I

df_gwr |> 
  pull(residuals) |> 
  moran.test(listw = neighbor_weights_q_w)
## 
##  Moran I test under randomisation
## 
## data:  pull(df_gwr, residuals)  
## weights: neighbor_weights_q_w    
## 
## Moran I statistic standard deviate = 2.4457, p-value = 0.007229
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##       0.083090278      -0.003571429       0.001255603

  1. The science and art of making maps↩︎

  2. The process of map creation as primary objective but owing little to past cartographic traditions.↩︎

  3. Measure of spatial clustering. Details covered in “Measures of Spatial Clustering” session↩︎