The present document is being prepared as a companion handbook to Hands-on workshop on Spatial Epidemiology using R during 51st Annual National Conference of Indian Association of Preventive and Social Medicine (IAPSMCON 2024).
Spatial epidemiology is the description and analysis of geographically indexed health data with respect to demographic, environmental, behavioral, socio-economic, genetic, and infectious risk factors. Spatial Epidemiology concerns the analysis of the spatial/geographical distribution of the incidence of disease. The rational of this workshop is to equip public health professionals with spatial epidemiology skills for disease analysis and risk assessment.
The spatial proximity and concepts in spatial epidemiology are closely related to transmission dynamics in infectious diseases. The rate of transmission is likely to be higher among individuals in close proximity as compared to those far apart. As a result, if spatial dependence is ignored, the confidence interval obtained is likely to be narrower and having higher chances of type I error (assuming positive spatial autocorrelation).
This chapter provides an introduction to geographic information systems (GIS) in R. The field of GIS in healthcare has become extremely useful in providing a fresh outlook to public health. GIS provides an opportunity to enable overlaying data with its spatial representation to support better planning and decision-making in healthcare. The convergence of many new sub-disciplines such as medical geography, public health, health informatics and data science help us better understand the similarities and differences in population health across the world. Some of the applications of GIS in healthcare include disease surveillance, environmental health, infectious diseases such as mathematical modelling and agent based modelling, and even medical imagining. While traditional uses of GIS in healthcare still are relevant, newer methods and advancing technology would be monumental for public health research.
It is now being realized that the healthcare industry can benefit tremendously from the potential of GIS. Innovative ways are being developed to harness the power of GIS through data integration pipelines and spatial visualization. Both the public and private sectors are adopting GIS to provide value addition across different sectors of healthcare - from public health departments and public health policy and research organizations to hospitals, medical centres, and health insurance organizations. With this background, let us dive right in!
It must be evident by now that geographic information systems (GIS) try to answer the question of WHERE:
GIS enabled systems has the potential to offer crucial insights into such questions? However, it is also important that the right questions as being asked. In order to understand better on how to ask the right questions, we may have to revisit the history of GIS in Public Health.
Early examples using maps as tools to better understand disease and death have been applications such as medical mapping and disease topography. From early 17th Century, maps have not simply been used to illustrate a situation but also to prove an argument. Pioneering works in this regard have helped disprove the Theory of Miasma and identified that water as the cause of disease rather than air. Soon by the 18th Century medical maps saw many applications in public health such as plague, yellow fever, and cholera.
In 1854, an English physician, John Snow, provided the classic example of how GIS mapping can be used in epidemiological research. He identified the water source responsible for an outbreak of cholera in London by mapping the locations of those afflicted. By plotting the number and location of fatalities using stacks of bars on a map, Snow was able to perform a task that is now easily taken for granted: he visualized a spatial distribution. Looking at the results, the pattern on the map seems unmistakable. The map appears to support Snow’s claims that cholera is a waterborne disease and that the pump on Broad Street is the source of the outbreak. Despite its hand-drawn, back-of-the-envelope appearance, Snow writes:
The overall goal of the Spatial Epidemiology workshop is to provide an introduction to spatial data analysis and workflows in R. It is aimed at building a strong foundation in Spatial Epidemiology and spatial data science for healthcare professionals.
The day is divided into four sessions. The first session aims to introduce the basic concepts and spatial data visualization. This will be followed by a group activity on map making. The second session will introduce the workflows and principles of working with spatial data using R through a live demo session. This will be followed by a session on the fundamentals of spatial autocorrelation analysis and group activity on spatial clustering.
The third session will focus on Geographic Weighted Regression (GWR). The last session will be comprising of introduction to advanced concepts including iterations and interactive visualization. This will again be followed up by a group activity and subsequently, all the groups will present their assignments before the feedback session.
For the workshop, it is assumed that the data collection has been completed by the researcher and attribute data has been pre-processed for analysis. Pre-processing of datasets is of utmost importance and needs deliberation before starting with analysis of any dataset.
Projections transform the curved, three-dimensional surface of the earth into a flat, two-dimensional plane. All map projections have distortions (distance, area, direction, and/or shape). The choice of projection depends on the intended use of the map. An equal-area map projection is a good selection for portraying geographic data distributions and is suitable for most other maps. However, if the map is attempting to show distance from patients to providers, for example, then an equidistant projection is appropriate because it preserves distance.
We sometimes refer to coordinate systems (or grid systems) and datums in context with map projections. With the help of coordinate reference systems (CRS) every place on the earth can be specified by a set of three numbers, called coordinates. Various coordinate systems and datums are used throughout the world. Most mapping starts with a projected map and a coordinate system overlay, which enables locational referencing. Datums, based upon different ellipsoids (idealized versions of the shape of the earth), define the origin and orientation of latitude and longitude lines. The most recently developed and widely used datum is WGS 1984. Global positioning system (GPS) data are often collected in the WGS 1984 datum.
It is important to note that this CRS represented in can be represented in many ways. For example:
PROJ.4 format as
+proj=longlat +datum=WGS84 +no_defs +type=crs, and in Well
Known Text (WKT) format as
GEOGCS[“WGS 84”, DATUM[“WGS_1984”, SPHEROID[“WGS 84”,6378137,298.257223563]], PRIMEM[“Greenwich”,0], UNIT[“degree”,0.0174532925199433, AUTHORITY[“EPSG”,“9122”]], AUTHORITY[“EPSG”,“4326”]]
There are over 60 different file formats in GIS. However, it is important to understand that the spatial data is broadly categorized into field (also known as continuous data and usually available in raster data format) and discrete (includes point, line, polygon data sets and usually available in vector data format in multitude of combinations) objects. The sf package is the preferred option to use vector data in R, and the raster and terra packages are the choice for raster data.
Vector graphics are comprised of vertices and paths. The three basic
symbol types in vector data are points, lines and polygons. Among vector
data, the commonly used file formats are the ESRI Shape file extensions
(.shp, .dbf, .shx), Geographic
JavaScript Object Notation (.geojson, .json)
and Google Keyhole Markup Language (.kml,
.kmz). Please note that all files of the ESRI Shapefiles
are necessary for the file to work properly.
On the contrary, raster data is made up of pixels, also referred to
as grid cells). They are usually square but can also be hexagons or
other shapes. Rasters have pixels that are associated with a value
(continuous) or class (discrete). One of the most widely used raster
formats are the GeoTiff extensions (.tif,
.tiff).
sf packageThere are several packages that are available on CRAN that can perform spatial analytical tasks and operations. However, not all of them are having continued support from the original authors and are turning obsolete.
The figure below shows the different spatial packages on CRAN and
their popularity, its is clear that the sf package has
surpassed all other packages in terms of popularity and user
downloads.
The sf package provides simple features access for R.
Simple Features is a set of standards that specify a common storage and
access model of geographic feature made of mostly two-dimensional
geometries (point, line, polygon, multi-point, multi-line, etc.) used by
geographic information systems. It is formalized by both the Open
Geospatial Consortium (OGC) and the International Organization for
Standardization (ISO).
A feature is thought of as a thing, or an object in the real world, such as a building or a tree. As is the case with objects, they often consist of other objects. A set of features can form a single feature. A tree in a forest can be a feature, a forest can be a feature, a city can be a feature. An image pixel from a satellite can be a feature, or a complete MRI scan image can be a feature too.
Common types of geometry in sf package
Lets introduce ourselves to the datasets used in this workshop before proceeding further.
library(tidyverse) # Tidy data management
library(sf) # Handling spatial and vector data
library(raster) # Handling raster data
library(spdep) # Moran's I
library(rgeoda) # LISA statistics
library(spgwr) # Spatial regression packages
library(spatialreg)
The Snow data consists of the relevant 1854 London streets, the location of 578 deaths from cholera, and the position of 13 water pumps (wells) that can be used to re-create John Snow’s map showing deaths from cholera in the area surrounding Broad Street, London in the 1854 outbreak.
snow_deaths <- HistData::Snow.deaths
snow_deaths |>
head()
Another data frame giving coordinates used to draw the 528 street segment lines within the boundaries of the map.
snow_streets <- HistData::Snow.streets
snow_streets |>
head()
Additionally, we shall be using dataset giving the locations of water pumps within the boundaries of the map.
snow_pumps <- HistData::Snow.pumps
snow_pumps |>
head()
world <- readRDS(here::here("data",
"spatial_files",
"world_india_compliant.rds"))
world |>
ggplot() +
geom_sf()
india <- readRDS(here::here("data",
"spatial_files",
"India_states.rds"))
india |>
ggplot() +
geom_sf()
This dataset has been downloaded from crowd-sourced COVID-19 data ans subsetted from 14 March 2020 to 31 Oct 2021 for demonstration purposes. The dataset has been further modified to calculate the cumulative number of confirmed cases for each state.
covid <- rio::import(here::here("data",
"covid.csv"))
# Subset confirmed cases
covid <- covid |>
filter(Status == "Confirmed")
# Tidy data format
covid <- covid |>
pivot_longer(4:42,
names_to = "state",
values_to = "cases")
# Calculating cumulative cases
covid <- covid |>
group_by(state) |>
mutate(cumsum = cumsum(cases))
# Understanding data
covid |>
head()
covid |>
tail()
The Gridded Population of the World (GPW) collection, now in its fourth version (GPWv4), models the distribution of human population (counts and densities) on a continuous global raster surface. The purpose of GPW is to provide a spatially disaggregated population layer that is compatible with data sets from social, economic, and Earth science disciplines, and remote sensing. It provides globally consistent and spatially explicit data for use in research, policy-making, and communications.
pop_density <- raster::raster(here::here("data",
"spatial_files",
"pop_density.tif"))
plot(pop_density)
We will use New York leukemia data from spData package
which is a data frame with 281 observations on the following 12
variables, and the binary coded spatial weights used in the source. The
variables are:- - AREANAME name of census tract
- AREAKEY unique FIPS code for each tract - X x-coordinate of tract
centroid (in km) - Y y-coordinate of tract centroid (in km) - POP8
population size (1980 U.S. Census) - TRACTCAS number of cases 1978-1982
- PROPCAS proportion of cases per tract - PCTOWNHOME percentage of
people in each tract owning their own home - PCTAGE65P percentage of
people in each tract aged 65 or more - Z transformed proportions -
AVGIDIST average distance between centroid and TCE sites - PEXPOSURE
“exposure potential”: inverse distance between each census tract
centroid and the nearest TCE site, IDIST, transformed via
log(100*IDIST)
df_gwr <- spData::nydata
df_gwr |> glimpse()
## Rows: 281
## Columns: 12
## $ AREANAME <fct> Binghamton city, Binghamton city, Binghamton city, Binghamt…
## $ AREAKEY <fct> 36007000100, 36007000200, 36007000300, 36007000400, 3600700…
## $ X <dbl> 4.069397, 4.639371, 5.709063, 7.613831, 7.315968, 8.558753,…
## $ Y <dbl> -67.3533, -66.8619, -66.9775, -65.9958, -67.3183, -66.9344,…
## $ POP8 <dbl> 3540, 3560, 3739, 2784, 2571, 2729, 3952, 993, 1908, 948, 1…
## $ TRACTCAS <dbl> 3.08, 4.08, 1.09, 1.07, 3.06, 1.06, 2.09, 0.02, 2.04, 0.02,…
## $ PROPCAS <dbl> 0.000870, 0.001146, 0.000292, 0.000384, 0.001190, 0.000388,…
## $ PCTOWNHOME <dbl> 0.32773109, 0.42682927, 0.33773959, 0.46160483, 0.19243697,…
## $ PCTAGE65P <dbl> 0.14661017, 0.23511236, 0.13800481, 0.11889368, 0.14157915,…
## $ Z <dbl> 0.14197, 0.35555, -0.58165, -0.29634, 0.45689, -0.28123, -0…
## $ AVGIDIST <dbl> 0.2373852, 0.2087413, 0.1708548, 0.1406045, 0.1577753, 0.17…
## $ PEXPOSURE <dbl> 3.167099, 3.038511, 2.838229, 2.643366, 2.758587, 2.848411,…
The role of spatial data visualization in understanding distribution and determinants of a disease were classically demonstrated in John Snow’s work in cholera epidemic of Golden Square, London in 1854 and maps have remained the most common applied aspect of Spatial epidemiology till date. Maps enable real time visualizations. However, though the concepts of spatial data visualization has its historical roots in the practice of cartography1, with the evolution of map making computational abilities, the requirement to know the cartography concepts for visualisation of spatial data for general use has decreased to minimal. For eg. we can point a location in google maps and reach a destination without deliberations on cartographic concepts (such as northing, scale, etc). This has led to \(neogeography\)2 and thus crowd-sourced databases due to volunteer contributions.
In professional and academic forums, the concepts of cartography are pre-requisites to meaningful spatial data visualization. Further, the concept of using visualization methods has evolved from just a tool for communication to the concept of visualization as an analysis tool. Visualization of spatial data is used to understand distribution patterns and generate hypothesis as well as for inferential procedures. The underlying principles in Scientific visualization has led to use of visualization methods as complementary to statistical inference procedures in spatial epidemiology. For example, the interpretation of Moran’s I3 is incomplete without accompanying visualization of dataset. The maps as analysis tools have several possible meanings and are thus polysemic in nature.
It is expected that by end of this session, participants should be able to develop visualizations, understand and communicate epidemiological information, and develop hypothesis based on the visual representations of spatial data sets.
\(Jacques\) \(Bertin\), a french cartographer published a book \(Semiologie\) \(Graphique\), which brought forward concept of graphic variables. These variables highlight methods using which different information can be obtained from the data. The seven graphic variables as mentioned in \(Semiology\) \(of\) \(graphics\) (English version) are as under:-
The location of a spatial object varies with changing projections. It is important to understand that the transformation of CRS from geographic to projected, from raster to vector should be done with caution. The method of choice should be guided by the research objectives and need to preserve shape or location.
The value of a variable is related to the darkness/ lightness of a symbol. It is usually used for variables on an interval or ratio scale. Traditionally, higher values are depicted darker.
Color can be considered as most complex graphic variable. It represents varied sensations to different viewers. In addition, human eye is not equally sensitive across the spectrum of colors. Further, cultural affiliations, traditional representations and understandings provide challanges. A golden rule recommended is to first use color-blind friendly palettes. In the domain of Public Health, it is important to consider the color and its depiction during visualization to avoid ambiguous communication. For example: Color choice in vaccination coverage visualization as compared to color choice in obesity.
In proportional symbol maps, the size of a symbol represents the difference in quantum of observations. The underlying type of data, classification adopted, transformations, and outlier management among others should be explored during visualizations.
Similar to size, choice of shape of the symbol can provide clarity or increase confusion, if not considered carefully. Different shapes can be provided to distinguish different types of variables. The same can be considered to distinguish sub-types of a variable. For example, dotted lines and solid lines can be used to distinguish roads from rivers but can also be used to distinguish main roads from side-roads. Understanding of datasets and variables help in deciding correct shapes for representation of data.
The arrangement of symbols can also be used to represent differences. For example: In a dot map representing dengue occurrence data, space between the dots represent disease clustering, if any. Again, the interpretation of a variable needs careful consideration, as the case occurrence may be higher just because of underlying population figures and not otherwise.
Understanding the orientation of a pattern may provide critical insights into disease epidemiology. For eg. occurrence of a disease along with the course of a river may indicate risk factors associated with disease.
The vector or raster map is central to map visualizations.
A short and clear title must be provided. In addition, subtitles can be added, if required.
There are two options for displaying information related to distance
and scale in a map.
1. Display of a scale bar.
2. Use of x and y axes of the display as labeling axes.
Similar to symbolism of scale, direction can be displayed as a symbol or can be implicit in the labeled axes. Conventionally, true North is provided as a symbol in maps.
Not all maps require a legend. When present, due consideration to Graphic variables should be given.
The data sources, source of geospatial files should be provided.
The first step in any map making should be based on careful understanding of Co-ordinate Reference Systems. For beginners, it is recommended that all map making and spatial manipulations should be undertaken using ESPG 4326.
# Setting crs to 4326
india <- india |>
st_transform(crs = 4326)
Now, lets first create a map of India as per component of maps learned so far.
india |>
ggplot() +
geom_sf()
india |> ggplot()+
geom_sf()+
labs(title = "India map",
subtitle = "Spatial Epidemiology Workshop IAPSMCON 2024")
india |> ggplot()+
geom_sf()+
labs(title = "India map",
subtitle = "Spatial Epidemiology Workshop IAPSMCON 2024")+
ggspatial::annotation_scale(location = "bl") #bl/ br/ tr/ tl
india |> ggplot()+
geom_sf()+
labs(title = "India map",
subtitle = "Spatial Epidemiology Workshop IAPSMCON 2024")+
ggspatial::annotation_scale(location = "bl")+
ggspatial::annotation_north_arrow(location = "br")
india |> ggplot()+
geom_sf()+
labs(title = "India map",
subtitle = "Spatial Epidemiology Workshop IAPSMCON 2024")+
ggspatial::annotation_scale(location = "bl")+
ggspatial::annotation_north_arrow(location = "tr")+
xlab("Longitude")+
ylab("Latitude")
india |> ggplot()+
geom_sf()+
labs(title = "India map",
subtitle = "Spatial Epidemiology Workshop")+
ggspatial::annotation_scale(location = "bl")+
ggspatial::annotation_north_arrow(location = "tr")+
xlab("Longitude")+
ylab("Latitude")+
geom_text(aes(x = 87.5,
y = 10.3,
label = "Credits: IAPSMCON 2024"),
color = "black",
check_overlap = T,
size = 3)
Addition of labels and legends is optional in map making. The following code can be used to add labels to the map
#adding abbreviations
india$abbr <- c("LA","AP","AR", "AS","BR", "CH","CT", "DL",
"GA", "GJ","HR", "HP","JH", "KA","KL", "MP",
"MH","MN","ML", "MZ","NL", "OR","PY", "PB",
"RJ", "SK","TN", "TR","UP", "UT","WB", "TG",
"JK", "LD","AN", "DDDN")
## labeling abbreviations
india |> ggplot()+
geom_sf()+
labs(title = "India map",
subtitle = "Spatial Epidemiology Workshop IAPSMCON 2024")+
ggspatial::annotation_scale(location = "bl")+
ggspatial::annotation_north_arrow(location = "tr")+
xlab("Longitude")+
ylab("Latitude")+
geom_text(aes(x = 87.5, y = 10.3, label = "Credits: IAPSMCON 2024"),
color = "black", check_overlap = T, size = 2)+
geom_sf_label(aes(label = abbr))
Kindly note: For the purpose of clarity on a particular aspect of the map, we will avoid placing all components of a map across the session. This is important to understand that when communicating maps/ figures, all components should be present.
The simplest method to visualize disease occurrence patterns is creation of dot maps. They help us understand spatial dependencies in disease occurrence. The spatial clustering of a disease is suggestive of factors present/ absent in locations with high disease occurrence as compared to locations with low disease occurrence. Thus, point maps provides visual tool for providing estimates of Spatial Clustering. Important limitation of these dot maps is the inability to visualize independent events as the number of observations increase or multiple events are observed at the same location. To overcome the same methods such as smoothing, rasterisation and dot density maps is recommended.
Let us create a dot map of John Snow data. To do the same, we need to
convert data.frame into an sf object using the
st_as_af() function.
snow_deaths_sf <- snow_deaths |>
st_as_sf(coords = c('x', 'y'),
crs = 4326)
snow_deaths_sf |>
ggplot() +
geom_sf()
Now, we can add additional layers such as the water pumps and the streets of Soho, London.
snow_streets_sf <- snow_streets |>
st_as_sf(coords = c('x', 'y'), crs = 4326) |>
group_by(street) |>
summarize(n = mean(n)) |>
st_cast('LINESTRING')
snow_streets_sf |>
ggplot() +
geom_sf()
Let us add the streets of Soho, London as a new layer in our previously created deaths map.
ggplot() +
geom_sf(data = snow_streets_sf) +
geom_sf(data = snow_deaths_sf,
color = 'red',
alpha = 0.5,
size = 1)
We can now add a new layer of water pumps on the above map.
snow_pumps_sf <- snow_pumps |>
st_as_sf(coords = c('x', 'y'),
crs = 4326)
ggplot() +
geom_sf(data = snow_streets_sf) +
geom_sf(data = snow_deaths_sf, color = 'red', alpha = 0.2, size = 3) +
geom_sf(data = snow_pumps_sf , shape = 22, size = 1, fill = 'blue', color = 'blue') +
geom_sf_label(data = snow_pumps_sf,
aes(label = label),
nudge_x = 0.025,
nudge_y = -0.5)
The term “choropleth” is derived from Greek words \(Khoros\) meaning “place” and \(plethein\) meaning “full”. In choropleth maps, aggregated value for a given area forms the basis of visual representation. The three major challenges in interpretation of choropleth maps arise from large area polygons, Modified Unit Area Problem, and presence of skewed data. Deliberation on these issues, creation of cartograms, data transformation methods such Box-Cox transformations and multi-level analysis are recommended to overcome these limitations.
Lets summarise state wise total covid 19 cases and merge it with india map to plot a choropleth map as demonstration.
df <- covid |>
group_by(state) |>
summarise(cases = sum(cases))
india <- left_join(india, df,
by = c("abbr" = "state"))
india |>
ggplot()+
geom_sf(aes(fill = cases)) +
scale_fill_gradient(low = "white", high = "red") +
labs(fill = "COVID-19 cases",
caption = "For demonstration only",
title = "India COVID-19 cases (Subset data)",
subtitle = "For demonstration only")
Think!! Why such maps are not good visualizations?
Many a times it becomes difficult to get a shape file of the stud area. Various sources of shape files is available online (eg. DataMeet, SEDAC, HGL, DIVA-GIS, Bhuvan, Survey of India, etc.). Leaflet package helps us to use web maps from open sources such as Open Street Map, ESRI maps, cartoDB, etc.
It is expected that by end of the session, participants will be able to understand as well as apply varied methods for manipulation of spatial data sets.
Simple features are open standard objects which have been endorsed by Open Geospatial Consortium. They represent heirarchial data model for 17 geometry types.
During this session and in subsequent sessions on Spatial
Epidemiology, we shall be using sf package. This is because
sf package has inbuilt capabilities of handling 7 most common
data models of simple features. Further,R spatial ecosystem is more
efficient with use of sf package compared to previous
packages.
Shape file format data can be imported into the R ecosystem using
st_read() or read_sf() function. Both are
similar in function with subtle differences. read_sf() chooses a few
tidyverse defaults such as silent (st_read gives a short report),
returns a spatial tibble (st_read returns a data frame), does not
convert strings to factors (st_read converts to factors by default) and
accepts list columns as input. For other formats st_as_sf() contains a
family of conversion functions.
To know the crs of a spatial dataset, use st_crs()
function.
india |>
st_crs()
## Coordinate Reference System:
## User input: EPSG:4326
## wkt:
## GEOGCRS["WGS 84",
## ENSEMBLE["World Geodetic System 1984 ensemble",
## MEMBER["World Geodetic System 1984 (Transit)"],
## MEMBER["World Geodetic System 1984 (G730)"],
## MEMBER["World Geodetic System 1984 (G873)"],
## MEMBER["World Geodetic System 1984 (G1150)"],
## MEMBER["World Geodetic System 1984 (G1674)"],
## MEMBER["World Geodetic System 1984 (G1762)"],
## MEMBER["World Geodetic System 1984 (G2139)"],
## ELLIPSOID["WGS 84",6378137,298.257223563,
## LENGTHUNIT["metre",1]],
## ENSEMBLEACCURACY[2.0]],
## PRIMEM["Greenwich",0,
## ANGLEUNIT["degree",0.0174532925199433]],
## CS[ellipsoidal,2],
## AXIS["geodetic latitude (Lat)",north,
## ORDER[1],
## ANGLEUNIT["degree",0.0174532925199433]],
## AXIS["geodetic longitude (Lon)",east,
## ORDER[2],
## ANGLEUNIT["degree",0.0174532925199433]],
## USAGE[
## SCOPE["Horizontal component of 3D system."],
## AREA["World."],
## BBOX[-90,-180,90,180]],
## ID["EPSG",4326]]
When we have two or more related spatial data sets, and the CRS are
different, we can transform the CRS using st_transform()
function.
india <- india |>
st_transform(crs = 4326)
The process of selecting only those locations which have a specified
attribute is known as sub-setting and the output is subset. It is done
to understand spatial distribution of a given variable and thus identify
underlying patterns. Sf objects have inherent data frame class.
Hence, filter() function can be used for obtaining spatial
subsets.
mh_subset <- india |>
filter(NAME_1 == "Maharashtra")
mh_subset |>
ggplot() +
geom_sf() +
labs(title = "Subset map of Maharashtra")
Topology is defined as spatial relationship between the objects. These include relationships such as common boundaries, intersections, one within the other, crossings, dis junctions, etc.
mh_subset |>
st_intersects(india)
## Sparse geometry binary predicate list of length 1, where the predicate
## was `intersects'
## 1: 7, 9, 10, 14, 16, 17, 32, 36
Similarly, st_within() is used to find the relationship
for those strictly within boundaries, st_disjoint() is
opposite of st_intersect(), st_touches() is
for those objects which are touching each other (eg. polygons) and
st_within_distance() for the objects within a buffer
distance from the said object.
Presence of a unique key is pre-requisite to joining of datasets. The
same concept is utilised for overlay or joining of two or more spatial
datasets with location coordinates as unique key. st_join()
performs spatial joins using location coordinates as unique key. By
default, left join is undertaken. in is performed.
Lets learn it through an example. We shall add a new variable to Maharashtra subset and add the same in India file using spatial overlay.
mh_subset <- mh_subset |>
mutate(check_variable = "The states sharing border with Maharashtra subset data")
india_joined <- india |>
st_join(mh_subset)
india_joined |>
ggplot() +
geom_sf(aes(fill = ifelse(
check_variable == "The states sharing border with Maharashtra subset data",
"red", "white"
))) +
theme(legend.position = "none")
Aggregation of spatial data is a process of condensing the dataset to
obtain stratified summary measures. The same can be done using
summarise() function.
india_joined |>
st_drop_geometry() |>
group_by(check_variable) |>
summarise(mean_pop_density = mean(pop_density_india),
.groups = "keep")
To determine area of a polygon, st_area() function is used.
india |>
st_area()
## Units: [m^2]
## [1] 164871219452 163066121108 82163787556 78559330387 94431499839
## [6] 118238781 135799676380 1504310841 3708836029 185709212363
## [11] 44081446376 55538115194 80226215548 192071260093 37853176465
## [16] 308776245994 308259630387 22368416342 22471908102 21187513837
## [21] 16631116132 156080143451 549465348 50274014378 342516183864
## [26] 7152632353 130728371017 10472109552 241157134272 53705294257
## [31] 85488226543 112699854048 52883743670 7528052264 665633067
## [36] 1050420948
We can transform geometries with unary and binary transformations. Unary transformations are based on a single geometry in isolation and includes measures such as creation of buffers and centroids. Binary operations modify the shape of one geometry based on the other and includes measures such as clipping and geometry unions.
The centroid based processes identify the center of a sf object and
is similar to measure of central tendency. There are many variations in
ways to define centre of an sf object, however, use of geographical
centre remains most common. st_centroid() provides centroid
values.
india_point <- india |>
st_centroid()
india_point
india <- cbind(india,
st_coordinates(st_centroid(india)))
india
buffer_bihar <- india_point |>
filter(NAME_1 == "Bihar") |>
st_buffer(dist = 100000)
ggplot() +
geom_sf(data = india) +
geom_sf(data = buffer_bihar,
fill = "red",
alpha = 0.5)
Though st_write() function helps in saving spatial
datasets as shapefiles, it is recommended that RDS data format be used
as it uses less space in the system and is computationally faster to
load subsequently.
pop_density
## class : RasterLayer
## dimensions : 180, 360, 64800 (nrow, ncol, ncell)
## resolution : 1, 1 (x, y)
## extent : -180, 180, -90, 90 (xmin, xmax, ymin, ymax)
## crs : +proj=longlat +datum=WGS84 +no_defs
## source : pop_density.tif
## names : pop_density
Similar to area calculations in a vector dataset, we would like to
know about the resolution of the raster data. The res()
function provides dimensions of each of the cell of the raster grid in
metres (By default, UTM projection is in meters).
res(pop_density)
## [1] 1 1
To extract value for a specified location
terra::extract() function can be used.
To subset based on extent, the bounding box details can be extracted
using st_bbox() function.
Understanding of map algebra makes raster processing faster. There are four subclasses of raster based operations as under:-
These operations act independently for each cell of the raster.
Also known a kernel, moving window or filter operations for raster data, they choose a central cell and its neighbourhood (such as surrounding eight cells, for example). The focal operation acts similar to spatial data aggregation and results in an output of aggregated values for each combination of central cell and its neighbourhood. raster::focal() function can be used for obtaining focal values by defining the matrix for central cell and its neighbourhood.
The principle behind zonal operations is same as that of focal operations. However, unlike in focal operations, zonal operations can go beyond a rectangular matrix of neighbourhood and can provide aggregated values for a pre-defined shape or distance or area, etc. The raster::zonal() function provides zonal statistics.
These operations are applied on the entire dataset and are used to summarise the characteristics of the given raster data such as mean, median, IQR, range etc of the datasets.
Two rasters can be merged using raster::merge() function
in general. However, depending upon the raster characteristics, more
refined methods are available such as from packages landsat, satellite,
etc. Merging rasters require same resolution and crs of the rasters as
prerequisites.
pop_density_india <- exactextractr::exact_extract(pop_density,
india,
"mean",
append_cols = "NAME_1")
##
|
| | 0%
|
|== | 3%
|
|==== | 6%
|
|====== | 8%
|
|======== | 11%
|
|========== | 14%
|
|============ | 17%
|
|============== | 19%
|
|================ | 22%
|
|================== | 25%
|
|=================== | 28%
|
|===================== | 31%
|
|======================= | 33%
|
|========================= | 36%
|
|=========================== | 39%
|
|============================= | 42%
|
|=============================== | 44%
|
|================================= | 47%
|
|=================================== | 50%
|
|===================================== | 53%
|
|======================================= | 56%
|
|========================================= | 58%
|
|=========================================== | 61%
|
|============================================= | 64%
|
|=============================================== | 67%
|
|================================================= | 69%
|
|=================================================== | 72%
|
|==================================================== | 75%
|
|====================================================== | 78%
|
|======================================================== | 81%
|
|========================================================== | 83%
|
|============================================================ | 86%
|
|============================================================== | 89%
|
|================================================================ | 92%
|
|================================================================== | 94%
|
|==================================================================== | 97%
|
|======================================================================| 100%
india <- left_join(india, pop_density_india)
india <- india |>
rename(pop_density_india = mean)
india |>
ggplot() +
geom_sf(aes(fill = pop_density_india)) +
scale_fill_gradient(low = "green", high = "red") +
labs(title = "Distribution of population density in India",
caption = "Source: SEDAC")
The descriptive measures in routine data analysis includes measures of central tendency (e.g. mean, median, mode) and measures of dispersion such as (e.g. SD, range, IQR). The choice of the descriptive measure chosen to describe the study variable depends upon the type of data and on the scale of measurement. While describing spatial datasets, these known measures of describing a study variable are complemented with additional summary measures to describe spatial data.
By the end of the session, students should be able to understand methods which are used to describe spatial distribution of study variable. The students are also expected to understand the difference in concept as well as methods in describing global estimates as compared to local estimates for presence of spatial clustering, if any.
It is also known as “Random Spatial Distribution” or “Independent Spatial Distribution”. When there is no spatial dependence, an event is expected to have CSR. Lack of CSR indicates spatial clustering.
A spatial process is said to be stationary when the dependence between measurements of the same variable across space is the same for all locations in an area. When the event occurrence is in CSR or in a regular pattern, the underlying process is said to be stationary.
If the dependence in a stationary process is affected by distance, but this is the same in all directions, the process is considered to be isotropic. When the stationary process is affected by distance and is not same in all directions, the process is considered to be anisotropic.
Knox defined a cluster as “a geographically and/or temporally bounded group of occurrences of sufficient size and concentration to be unlikely to have occurred by chance”.
First-order effects describe large-scale variations in the mean of the outcome of interest due to location or other explanatory variables, while second-order effects describe small-scale variations.
Global clustering methods are used to assess whether clustering is apparent throughout the study region but do not identify the location of clusters. The null hypothesis for global clustering methods is simply that ‘no clustering exists’ (i.e. random spatial dispersion) across the study region.
Global measures of spatial association assume that the spatial process under investigation is stationary. Under this assumption, which is rarely met, global tests of association run the risk of obscuring local effects. As a result, significant local clustering may not be detected and conversely, large non-clustered areas within a study area may be ignored. Local statistics overcome this problem by scanning the entire dataset, but only measuring dependence in limited portions of the study area, the bounds of which have to be specified.Local (specific) methods of cluster detection define the locations and extent of clusters. They are further divided into focused and non-focused tests.
Local measures of spatial clustering are divided into those primarily designed for aggregated data, and those for point data, although this distinction is not rigid and most can be used with either type of data. An important concept in understanding calculations behind local estimates of spatial clustering when point data is under consideration is the scan circles. These scan circles can be defined in terms of geographic distance (e.g. Openshaw’s method), number of cases (e.g. Besag and Newell’s method), and population size (e.g. Turnbull’s method and Kulldorff’s scan statistic). It is important to consider the optimal size of the scan circle while calculating local estimates of spatial clustering because a large scan circle is likely to identify spurious clusters whereas a small scan circle is likely to miss large clusters.
To define a neighbor, two commonly used approaches are based on adjacency and distance.
Other complex neighborhoods can be defined on basis of spatial relationships such as percentage of common boundaries.
Commonly used approaches based on adjacency to define neighbors are:-
Polygons are said to be neighbors or adjacent when they share a common border.
Polygons are adjacent if they share a border or corner
Also called spatial lags.
Also called adjacent neighbors
Also called neighbors-of-neighbors
In case of point data, all the events within a specified distance/ radius are said to be in neighborhood.
Many tests for clustering use Monte Carlo simulation in order to determine the statistical significance of the cluster (i.e. does the observed spatial pattern differ significantly from the null hypothesis of complete spatial randomness). This involves first calculating the test statistic using the observed data, and then re-calculating it using a specified number (e.g. 99, 499, 999) of simulated data sets (or permutations). The latter is used to generate the expected distribution of the test statistic under the null hypothesis. The likelihood of obtaining the value for the test statistic derived from the observed data is then calculated, and expressed as the p-value.
Auto-correlation statistics for aggregated data provide an estimate of the degree of spatial similarity observed among neighboring values of an attribute over a study area.
Fundamental to all auto-correlation statistics is the \(weights\) \(matrix\), used to define the spatial relationships of the regions so that those that are close in space are given greater weight in the calculation than those that are distant.
A Moran’s I of zero indicates the null hypothesis of no clustering or presence of CSR. A positive Moran’s I indicates positive spatial autocorrelation (i.e. clustering of areas of similar attribute values), while a negative coefficient indicates negative spatial autocorrelation (i.e. that neighboring areas tend to have dissimilar attribute values).
Moran’s I is approximately normally distributed and the significance of Moran’s I can be assessed using Monte Carlo randomization.
Also called as Local Indicators of Spatial Association (LISA).
The local Moran test detects local spatial autocorrelation in aggregated data by decomposing Moran’s I statistic into contributions for each area within a study region.
It detect clusters of either similar or dissimilar disease frequency values around a given observation.
The Moran’s scatter plot provides an indication of the stability of the spatial association throughout the data.
Horizontal axis: Observed values.
Vertical axis: Weighted average of neighbouring values.
Four types of association:- low–high, high–high, high–low and low–low.
india_moran <- india |>
filter(NAME_1 != "Andaman & Nicobar Island") |>
filter(NAME_1 != "Lakshadweep")
# Creating a neighbor list from polygon list
neighbor_list_q <- india_moran |>
poly2nb(queen = T)
# Neighbors list object summary
summary(neighbor_list_q)
## Neighbour list object:
## Number of regions: 34
## Number of nonzero links: 136
## Percentage nonzero weights: 11.76471
## Average number of links: 4
## Link number distribution:
##
## 1 2 3 4 5 6 7 9
## 2 8 6 4 6 4 3 1
## 2 least connected regions:
## 19 26 with 1 link
## 1 most connected region:
## 29 with 9 links
# Structure of neighbors list: Visualization
plot(india_moran$geometry,
border="black"); plot(neighbor_list_q,
st_coordinates(st_centroid(india_moran)),
pch = 19, cex = 0.6, add = TRUE, col= "red")
# Spatial weights
neighbor_weights_q_w <-
neighbor_list_q %>%
nb2listw(style = "W")
neighbor_weights_q_w
## Characteristics of weights list object:
## Neighbour list object:
## Number of regions: 34
## Number of nonzero links: 136
## Percentage nonzero weights: 11.76471
## Average number of links: 4
##
## Weights style: W
## Weights constants summary:
## n nn S0 S1 S2
## W 34 1156 34 19.43254 151.3604
# Moran's I
moran_values <- moran.test(india_moran$pop_density_india,
listw = neighbor_weights_q_w,
na.action=na.omit)
moran_values
##
## Moran I test under randomisation
##
## data: india_moran$pop_density_india
## weights: neighbor_weights_q_w
##
## Moran I statistic standard deviate = 2.4164, p-value = 0.007837
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## 0.207391989 -0.030303030 0.009676153
# Local Moran's I
# Create weights object
queen_w <- queen_weights(india_moran)
summary(queen_w)
# Calculate LISA
lisa <- local_moran(queen_w,
india_moran["pop_density_india"],
permutations = 999,
significance_cutoff = 0.1,
seed = 5197,
cpu_threads = 6)
lisa_value <- lisa_values(lisa)
lisa_pvalue <- lisa_pvalues(lisa)
lisa_quad <- lisa_clusters(lisa)
shp <- cbind(india_moran,
lisa_value,
lisa_pvalue,
lisa_quad)
# Create Local Moran's Map
lisa_colors <- lisa_colors(lisa)
lisa_labels <- lisa_labels(lisa)
lisa_clusters <- lisa_clusters(lisa)
shp$lisa_quad <- factor(lisa_quad,
levels = c(0,1,2,3,4),
labels = c("Not significant",
"High-High",
"Low-Low",
"Low-High",
"High-Low"))
lisa_map <- ggplot(shp)+
geom_sf(aes(fill = lisa_quad))+
scale_fill_manual(values = lisa_colors,
name = "LISA Quadrant")+
labs(title = "LISA map")+
ggspatial::annotation_north_arrow(height = unit(0.5, "cm"),
width = unit(0.5, "cm"),
location = "tl")+
ggspatial::annotation_scale(height = unit(0.15, "cm"),
location = "br")
lisa_map
There are multiple statistical tests which can be used to measure spatial clustering of point data. It is based on either the locations, distance or buffer matrix for global estimates and scan circles for local associations.
Focused tests are used to establish inference regarding role of a source (eg. factory, pond, etc) in causation of disease. It is hypothesized that the disease occurrence is higher among population groups closer to the source as compared to those staying at a far away distance from the source.
Epidemiological disease investigations should include an assessment of the spatial variation of disease risk, as this may provide important clues leading to causal explanations. Spatial variation estimation methods broadly includes:
Smoothing methods.
Interpolation.
When we undertake classic linear regression, one of the assumptions include that the independent variables should be independent of each other. However, in case the variables are related spatially, the assumption is voilated. It is therefore recommended that at the outset, in the exploratory phase of any study, we should look for spatial correlation. In case, it is not present, the researcher can proceed with non spatial regression models, however, if present, Geographically Weighted Regression (GWR) should be undertaken.
If you feel like your data has spatial dependence, then the following steps should be carried out:
Inspect your data, construct an appropriate spatial neighborhood, and assign weights.
Run an OLS regression with your variables of interest. This regression assumes spatial independence.
Inspect results and plot your residuals.
Compute the Moran’s I value for the residuals.
If you find significant spatial correlation, do a GWR, else proceed with non spatial models.
It is expected that the locations where the exposure is high should have higher leukemia rates. In this case, based on our previous understanding, it seems obvious, however, many times we may not be aware of underlying spatial associations and ignore the spatial aspect. Lets see what happens!!
df_gwr |>
ggplot()+
geom_point(
aes(x = PEXPOSURE, y = Z, color = AREANAME)
) +
theme(legend.position = "none") +
labs(x = "Exposure",
y = "Leukemia rate",
title = "Scatter plot: Association of Exposure and Leukemia rate") +
geom_smooth(aes(x = PEXPOSURE, y = Z),
method = lm)
model <- lm(Z ~ PEXPOSURE + PCTAGE65P + PCTOWNHOME,
data = df_gwr)
summary(model)
##
## Call:
## lm(formula = Z ~ PEXPOSURE + PCTAGE65P + PCTOWNHOME, data = df_gwr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.7417 -0.3957 -0.0326 0.3353 4.1398
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.51728 0.15856 -3.262 0.00124 **
## PEXPOSURE 0.04884 0.03506 1.393 0.16480
## PCTAGE65P 3.95089 0.60550 6.525 3.22e-10 ***
## PCTOWNHOME -0.56004 0.17031 -3.288 0.00114 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6571 on 277 degrees of freedom
## Multiple R-squared: 0.1932, Adjusted R-squared: 0.1844
## F-statistic: 22.1 on 3 and 277 DF, p-value: 7.306e-13
These results suggest that PEXPOSURE is not significant, but census variables are (home ownership and percent oldies). It’s also useful to glance at R-squared estimates. It gives a sense of how well model fits the data… it tells us how much variation we’ve explained out of the total variation in the model.
In the model above, the residuals should not be spatially related. Lets have a look!
df_gwr <- df_gwr |>
mutate(residuals = model$residuals)
# Transform df_gwr to sf object
df_gwr <- st_as_sf(df_gwr,
coords = c(lon = "X",
lat = "Y"))
Using neighbor list from polygon list and Spatial weights
neighbor_weights_q_w <- spData::listw_NY
df_gwr |>
pull(residuals) |>
moran.test(listw = neighbor_weights_q_w)
##
## Moran I test under randomisation
##
## data: pull(df_gwr, residuals)
## weights: neighbor_weights_q_w
##
## Moran I statistic standard deviate = 2.4457, p-value = 0.007229
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## 0.083090278 -0.003571429 0.001255603
model <- lagsarlm(formula=Z~PEXPOSURE+PCTAGE65P+PCTOWNHOME,data=df_gwr,listw=neighbor_weights_q_w,quiet=T)
summary(model)
##
## Call:lagsarlm(formula = Z ~ PEXPOSURE + PCTAGE65P + PCTOWNHOME, data = df_gwr,
## listw = neighbor_weights_q_w, quiet = T)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.586752 -0.391580 -0.022469 0.338017 4.029430
##
## Type: lag
## Coefficients: (asymptotic standard errors)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.514495 0.156154 -3.2948 0.000985
## PEXPOSURE 0.047627 0.034509 1.3801 0.167542
## PCTAGE65P 3.648198 0.599046 6.0900 1.129e-09
## PCTOWNHOME -0.414601 0.169554 -2.4453 0.014475
##
## Rho: 0.038893, LR test value: 6.9683, p-value: 0.0082967
## Asymptotic standard error: 0.015053
## z-value: 2.5837, p-value: 0.0097755
## Wald statistic: 6.6754, p-value: 0.0097755
##
## Log likelihood: -275.2447 for lag model
## ML residual variance (sigma squared): 0.41166, (sigma: 0.6416)
## Number of observations: 281
## Number of parameters estimated: 6
## AIC: 562.49, (AIC for lm: 567.46)
## LM test for residual autocorrelation
## test value: 1.4633, p-value: 0.22641
As expected, spatial auto-regressive parameter \((\rho)\) is highly significant (p = .008). Test for residual autocorrelation is not significant, which suggests that we cannot reject the null of randomly distributed residuals.
df_gwr |>
pull(residuals) |>
moran.test(listw = neighbor_weights_q_w)
##
## Moran I test under randomisation
##
## data: pull(df_gwr, residuals)
## weights: neighbor_weights_q_w
##
## Moran I statistic standard deviate = 2.4457, p-value = 0.007229
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## 0.083090278 -0.003571429 0.001255603