GIS concepts

Basic Definitions

Spatial data is a term we use to describe any data which includes information on the locations and shapes of geographic features and the relationships between them, usually stored as coordinates and topology. Spatial data can be used to describe, for example:

Locations of points of interest (e.g. water bodies, households, health facilities) Discrete events and their locations (e.g. malaria cases, crime spots, floods) Continuous surfaces (e.g. elevation, temperature, rainfall) Areas with counts or rates aggregated from individual-level data (e.g. disease prevalence rates, population census, etc.) Spatial data can often be categorised into vector data or raster data.

Vector data is a representation of the world using points, lines, and polygons. This data is created by digitizing the base data, and it stores information in x, y coordinates. Vector data is one of two primary types of spatial data in geographic information systems (GIS) – the other being raster data. Example: sea, lake, land parcels, countries and their administrative regions, etc.

Points: Each individual point is defined by a single x, y coordinate. There can be many points in a vector point file. Examples of point data include: Health facility locations, household clusters.

Lines: Lines are composed of many (at least 2) vertices, or points, that are connected. For instance, a road or a stream may be represented by a line. This line is composed of a series of segments, each “bend” in the road or stream represents a vertex that has defined x, y location.

Polygons: A polygon consists of 3 or more vertices that are connected and “closed”. Thus the outlines of plot boundaries, lakes, oceans, and states or countries are often represented by polygons. Occasionally, a polygon can have a hole in the middle of it (like a doughnut), this is something to be aware of.

Raster data consists of a matrix of cells (or pixels) organized into rows and columns (or a grid) where each cell contains a value representing information, such as temperature. Rasters are digital aerial photographs, imagery from satellites, digital pictures, or even scanned maps.

Both types of data structures can either hold discrete or continuous information and can be layered upon each other and interact. An example of discrete data would be data defining the spatial extent of water bodies, built up areas, forests, or locations of health facilities. In contrast, continuous spatial data would be used to record quantities, such as elevation, temperature, population, etc.

Analysis of spatial data supports explanation of certain phenomena and solving problems which are influenced by components which vary in space, or both space and time. Two important properties of geospatial data are, for example, that:

Observations in geographic space tend to be dependent on other geographic factors. For example, variations in disease rates within a country may be due to variations in population density, age, socioeconomic factors, disease vector densities, temperature or a combination of these and other factors.

Observations that are geographically close to each other tend to be more alike than those which are further apart.

Its important to have layers of spatial data all be on the same coordinate reference system.

Coordinate Reference System

When dealing with spatial data we often are dealing with topology which allows us to describe the information geographically. A coordinate reference system (CRS) refers to the way in which this spatial data that represent the earth’s surface (which is round / 3 dimensional) are flattened so that you can “Draw” them on a 2-dimensional surface (e.g. your computer screens or a piece of paper). There are many ways which we can ‘flatten’ the earth, each using a different (sometimes) mathematical approach to performing the flattening resulting in different coordinate system grids: most commonly the Geographic Coordinate System (GCS) and Projected Coordinate System (PCS). These approaches to flattening the data are specifically designed to optimize the accuracy of the data in terms of length and area and normally are stored in the CRS part of the shapefile in R. This is especially important when dealing with countries that are further away from the equator. Here is a fun little video to showcase how coordinate reference systems can really change/distort the reality of the actual geography! Geographic Coordinate System A geographic coordinate system is a reference system used to pin point locations onto the surface of the Earth. The locations in this system are defined by two values: latitude and longitude. These values are measured as angular distance from two planes which cross the center of the Earth - the plane defined by the Equator (for measuring the latitude coordinate), and the plane defined by the Prime Meridian (for measuring the longitude coordinate).

A Geographic Coordinate System (GCS) is defined by an ellipsoid, geoid and datum.

The use of the ellipsoid in GCS comes from the fact that the Earth is not a perfect sphere. In particular, the Earth’s equatorial axis is around 21 km longer than the prime meridian axis. The geoid represents the heterogeneity of the Earth’s surface resulting from the variations in the gravitational pull. The datum represents the choice of alignment of ellipsoid (or sphere) and the geoid, to complete the ensemble of the GCS.

It is important to know which GCS is associated with a given spatial file, because changing the choice of the GCS (i.e. the choice of ellipsoid, geoid or datum) can change the values of the coordinates measured for the same locations. Therefore, if our GCS is not properly defined, it could lead to misplacement of point locations on the map.

Projected Coordinate System The Projected Coordinate System (PCS) is used to represent the Earth’s coordinates on a flat (map) surface. The PCS is represented by a grid, with locations defined by the Cartesian coordinates (with x and y axis). In order to transform coordinates from the GCS to PCS, mathematical transformations need to be applied, hereby referred to as projections.

Vector data in R using sf Vector data are composed of discrete geometric locations (x,y values) known as vertices that define the “shape” of the spatial object. The organization of the vertices determines the type of vector that you are working with: point, line or polygon. Typically vector data is stored in shapefiles and within R can be called in using several different packages, most common being sp and sf. For the purpose of this material we will focus of teaching you the package sf as it is intended to succeed and replace R packages sp, rgeos and the vector parts of rgdal packages.

## Linking to GEOS 3.13.0, GDAL 3.10.1, PROJ 9.5.1; sf_use_s2() is TRUE
## Loading required package: ggplot2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ lubridate 1.9.4     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Registered S3 method overwritten by 'malariaAtlas':
##   method           from   
##   autoplot.default ggplot2

The sp package (spatial) provides classes and methods for spatial (vector) data; the classes document where the spatial location information resides, for 2D or 3D data. Utility functions are provided, e.g. for plotting data as maps, spatial selection, as well as methods for retrieving coordinates, for subsetting, print, summary, etc.

The sf package (simple features = points, lines, polygons and their respective ‘multi’ versions) is the new kid on the block with further functions to work with simple features, a standardized way to encode spatial vector data. It binds to the packages ‘GDAL’ for reading and writing data, to ‘GEOS’ for geometrical operations, and to ‘PROJ’ for projection conversions and datum transformations.

For the time being, it is best to know and use both the sp and the sf packages, as discussed in this post. However, we focus on the sf package. for the following reasons:

sf ensures fast reading and writing of data sf provides enhanced plotting performance sf objects can be treated as data frames in most operations sf functions can be combined using %>% operator and works well with the tidyverse collection of R packages. sf function names are relatively consistent and intuitive (all begin with st_) However, in some cases we need to transform sf objects to sp objects or vice versa. In that case, a simple transformation to the desired class is necessary: To sp object <- as(object, Class = “Spatial”)

To sf object_sf = st_as_sf(object_sp, “sf”)

A word of advice: be flexible in the usage of sf and sp. Sometimes it may be hard to explain why functions work for one data type and do not for the other. But since transformation is quite easy, time is better spent on analyzing your data than on wondering why operations do not work.

Shapefiles: Points, Lines, and Polygons Geospatial data in vector format are often stored in a shapefile format. Because the structure of points, lines, and polygons are different, each individual shapefile can only contain one vector type (all points, all lines or all polygons). You will not find a mixture of point, line and polygon objects in a single shapefile.

Objects stored in a shapefile often have a set of associated attributes that describe the data. For example, a line shapefile that contains the locations of streams, might contain the associated stream name, stream “order” and other information about each stream line object.

Shapefiles often have many file types associated to it. Each of these files continains valuable information. The most important file is the .shp file which is the main file that stores the feature geometries. .shx is an index file to connect the feature geometry and .dbf is a dBASE file that stores all the attribute information (like a table of information). These three files are required for plotting vector data. Often times you might also get additional useful files such as the .prj which stores the coordination system information.

Importing shapefiles into R Shapefiles can be called in to R using the function st_read(). Similarly to read_csv() we include a filepath to a shapefile. In this instance we would load the part of the shapefile that ends with .shp

Polygons

## Reading layer `TZ_admin1' from data source 
##   `C:\Users\Allan\OneDrive\Documents\R training\ammnet-hackathon\03_mapping-r\data\shapefiles\TZ_admin1.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 36 features and 16 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 29.3414 ymin: -11.7612 xmax: 40.4432 ymax: -0.9844
## Geodetic CRS:  WGS 84

You’ll notice that when you load the shapefile in, there will be a set of information explaining the features of the shapefile. The first sentence shows that you have loaded a ESRI shapefile, it contains 36 features (which are polygons in this case) and 17 columns of information stored as a data tabll. It mentions also there is a spatial extent (called bounding box) and the coordinate reference system (CRS).

You can also get this information when you simply call the sf object

## Simple feature collection with 36 features and 16 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 29.3414 ymin: -11.7612 xmax: 40.4432 ymax: -0.9844
## Geodetic CRS:  WGS 84
## First 10 features:
##    iso admn_lv   name_0     id_0  type_0          name_1      id_1     type_1
## 1  TZA       1 Tanzania 10001003 Country          Arusha  10316307     Region
## 2  TZA       1 Tanzania 10001003 Country        Morogoro  10313017     Region
## 3  TZA       1 Tanzania 10001003 Country Lake Tanganyika 604100824 Water Body
## 4  TZA       1 Tanzania 10001003 Country   Lake Victoria 604100823 Water Body
## 5  TZA       1 Tanzania 10001003 Country           Lindi  10313079     Region
## 6  TZA       1 Tanzania 10001003 Country         Manyara  10313760     Region
## 7  TZA       1 Tanzania 10001003 Country            Mara  10314956     Region
## 8  TZA       1 Tanzania 10001003 Country           Mbeya  10314026     Region
## 9  TZA       1 Tanzania 10001003 Country Mjini Magharibi  10313694     Region
## 10 TZA       1 Tanzania 10001003 Country          Mtwara  10316433     Region
##    name_2 id_2 type_2 name_3 id_3 type_3            source cntry_l
## 1      NA   NA     NA     NA   NA     NA Tanzania NBS 2020   TZA_1
## 2      NA   NA     NA     NA   NA     NA Tanzania NBS 2020   TZA_1
## 3      NA   NA     NA     NA   NA     NA Tanzania NBS 2020   TZA_1
## 4      NA   NA     NA     NA   NA     NA Tanzania NBS 2020   TZA_1
## 5      NA   NA     NA     NA   NA     NA Tanzania NBS 2020   TZA_1
## 6      NA   NA     NA     NA   NA     NA Tanzania NBS 2020   TZA_1
## 7      NA   NA     NA     NA   NA     NA Tanzania NBS 2020   TZA_1
## 8      NA   NA     NA     NA   NA     NA Tanzania NBS 2020   TZA_1
## 9      NA   NA     NA     NA   NA     NA Tanzania NBS 2020   TZA_1
## 10     NA   NA     NA     NA   NA     NA Tanzania NBS 2020   TZA_1
##                          geometry
## 1  MULTIPOLYGON (((35.2613 -1....
## 2  MULTIPOLYGON (((38.2698 -6....
## 3  MULTIPOLYGON (((31.2069 -8....
## 4  MULTIPOLYGON (((31.7822 -0....
## 5  MULTIPOLYGON (((39.5821 -9....
## 6  MULTIPOLYGON (((35.3626 -3....
## 7  MULTIPOLYGON (((33.3328 -2....
## 8  MULTIPOLYGON (((33.031 -6.8...
## 9  MULTIPOLYGON (((39.3401 -6....
## 10 MULTIPOLYGON (((40.4279 -10...

The snapshot above would likely be similar to what you would see in your console. You’ll notice that it would display in your environment as a table, but the difference between this and any other table is that it includes additional information in the metadata indicating its spatial features. In this case it is important to read in and check the metadata for the type of spatial information you have loaded.

Point data Now that we have in some polygon information, let’s try load a different type of vector data - points. Often times in public health space we would receive data that might represent points on a map (e.g. cases at a health facility, bed nets used in a village). This style of data would include GPS coordinates of longitude and latitude to help us know geographically where they are located.

We are going to pull some prepared MIS data into R. For alternatives I would suggest checking out the malaria prevalence data using the malariaAtlas package in R.

## Rows: 436 Columns: 27
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (9): country, country_id, continent, site_name, rural_urban, method, r...
## dbl  (15): id, site_id, latitude, longitude, month_start, year_start, month_...
## lgl   (2): pcr_type, source_id3
## date  (1): date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Make the gps data into a point simple feature

You’ll notice that when you brought in the data it does not display the spatial information and came in as a table, so then we may want to convert these into spatial information using the st_as_sf() function. The data we’ve pulled is all the public available data from the DHS program for Tanzania, we’ve pre-processed this specifically for prevalence information only. It includes the latitude and longitude information. When converting the table to a simple feature it must have complete information (i.e. no missing coordinates), we would recommend if you get data from elsewhere make sure to check there is no missingness. Note that we set the projection for the spatial object using the crs command; crs 4326 is the standard WGS 84 CRS.

Plotting the shapefiles in ggplot2 Now that we have some shapefiles let’s try make a plot of them both. We can use the geom called geom_sf() with the ggplot funtions to make a plot. This was explored in the previous hackathon on data visualization

We now have a plot of the map of Tanzania and its regions. Like other ggplots we can continue to add points to this map. #Plot tanzania and prevalence using shapefiles

Note that when you move the data into the geom_sf() it must include a mapping= aes() argument. If you don’t have any specific information you want to display and just want the map, you can say you would like it to map to the geometry .

Joining data to shapefiles We often want to join data to shapefiles to enable the creation of maps and to analyse spatial data. You can use the join functions in tidysverse to join sf data to tables. Just note that the sf data must go first to automatically be recognised as sf. Else you would need to reset it using st_as_sf() .

We’re gonna bring in population data by Region to join to the regional shapefile. However, before we can join the data, the names for the regions don’t match. In order for a correct join the key column must be exactly the same in both datasets. So we must make a key column called name_1 to match the shapefile. We’ll be using the tricks we learnt previous hackathons like the function str_to_sentence() from the stringr package of tidyverse.

## Rows: 31 Columns: 59
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (4): ADM0_EN, ADM0_PCODE, ADM1_EN, ADM1_PCODE
## dbl (55): Year, T_TL, T_00_04, T_05_09, T_10_14, T_15_19, T_20_24, T_25_29, ...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

#check if names in the columns match

## 
## FALSE  TRUE 
##     5    31

Now we can join the data to the current shapefile. Don’t worry that they currently don’t match! we have a plan ;)

Extracting polygon names for point data coming back to our prevalence data, often times for decision making we might want data to be summarised to regional forms. Currently the point prevalence doesn’t have any information about which region they belong to. So we can using the st_join . By default this spatial join will look for where the geometries of the spatial data’s intersect using st_interesect . It also by default uses a left join which is why the points come first.

## Spherical geometry (s2) switched off
## although coordinates are longitude/latitude, st_intersects assumes that they
## are planar
## although coordinates are longitude/latitude, st_intersects assumes that they
## are planar

from this we can try and calculate the mean prevalence in each region

## although coordinates are longitude/latitude, st_union assumes that they are
## planar
## Joining with `by = join_by(name_1)`

Buffers Sometimes you might want to create a buffer around the spatial information you have. Let’s try this on the point data. we’ll make a 20km buffer around each spatial point. We can use the st_buffer() to create this.

## dist is assumed to be in decimal degrees (arc_degrees).

Buffers are especially useful when you combine them with rasters and want to extract a summarised value of pixels for a small area.

Note that you’ll have gotten a warning that buffers don’t work correctly with longitude/latitude data. Remember the coordinate reference systems - these are important! the unit for the CRS we use WGS84 is decimal degrees and assumes an ellipsoidal (round) world still. However, the maps we create are flat. For best results you should think about changing the coordinate reference system to UTM (Universal Transverse Mercator) that assumed a flat plan and uses units of meters.

Shapefile projections Let’s look at our current CRS. You can view the CRS using the command st_crs()

## Coordinate Reference System:
##   User input: WGS 84 
##   wkt:
## GEOGCRS["WGS 84",
##     DATUM["World Geodetic System 1984",
##         ELLIPSOID["WGS 84",6378137,298.257223563,
##             LENGTHUNIT["metre",1]]],
##     PRIMEM["Greenwich",0,
##         ANGLEUNIT["degree",0.0174532925199433]],
##     CS[ellipsoidal,2],
##         AXIS["latitude",north,
##             ORDER[1],
##             ANGLEUNIT["degree",0.0174532925199433]],
##         AXIS["longitude",east,
##             ORDER[2],
##             ANGLEUNIT["degree",0.0174532925199433]],
##     ID["EPSG",4326]]
## Coordinate Reference System:
##   User input: EPSG:4326 
##   wkt:
## GEOGCRS["WGS 84",
##     ENSEMBLE["World Geodetic System 1984 ensemble",
##         MEMBER["World Geodetic System 1984 (Transit)"],
##         MEMBER["World Geodetic System 1984 (G730)"],
##         MEMBER["World Geodetic System 1984 (G873)"],
##         MEMBER["World Geodetic System 1984 (G1150)"],
##         MEMBER["World Geodetic System 1984 (G1674)"],
##         MEMBER["World Geodetic System 1984 (G1762)"],
##         MEMBER["World Geodetic System 1984 (G2139)"],
##         MEMBER["World Geodetic System 1984 (G2296)"],
##         ELLIPSOID["WGS 84",6378137,298.257223563,
##             LENGTHUNIT["metre",1]],
##         ENSEMBLEACCURACY[2.0]],
##     PRIMEM["Greenwich",0,
##         ANGLEUNIT["degree",0.0174532925199433]],
##     CS[ellipsoidal,2],
##         AXIS["geodetic latitude (Lat)",north,
##             ORDER[1],
##             ANGLEUNIT["degree",0.0174532925199433]],
##         AXIS["geodetic longitude (Lon)",east,
##             ORDER[2],
##             ANGLEUNIT["degree",0.0174532925199433]],
##     USAGE[
##         SCOPE["Horizontal component of 3D system."],
##         AREA["World."],
##         BBOX[-90,-180,90,180]],
##     ID["EPSG",4326]]

If these projections say don’t match or aren’t in the correct coordinate system for our analysis we can change the projection using the command st_transform(). # Change the projection to UTM zone 35S

Making publication style maps for publishing maps typically you’ll find that maps require a bit more information, such as north compass, scale bars and appropriate legends. We can use ggspatial package for that.

We might want to add labels of the regions on as well, we can do this using ggreppel

Writing out plots and shapefiles Once we’ve completed out image we might want to save it. We can use ggsave() to do so. Note that if you don’t specific the plot object it will save the last image you created in the plot window

## Saving 7 x 5 in image

Interactive maps using tmap So far we’ve explored the use of the sf package with ggplot as well as some additions like ggspatial. Another useful package that can help build interactive maps in tmap. The tmappackage is a powerful tool for creating thematic maps in R. It provides an intuitive and flexible way to visualize spatial data, similar to how ggplot2 works for general data visualization.

Like ggplot, the package tmap also building maps using layers. You start with tm_shape() to define the data, then add layers with various tm_*() functions.

Let’s try recreate our first map of tanzania regions:

Similarly if we wanted to showcase data in each polygon we could use this for the population data we summarised:

## 
## ── tmap v3 code detected ───────────────────────────────────────────────────────
## [v3->v4] `tm_polygons()`: instead of `style = "pretty"`, use fill.scale =
## `tm_scale_intervals()`.
## ℹ Migrate the argument(s) 'style', 'n', 'palette' (rename to 'values'),
##   'colorNA' (rename to 'value.na'), 'textNA' (rename to 'label.na') to
##   'tm_scale_intervals(<HERE>)'
## For small multiples, specify a 'tm_scale_' for each multiple, and put them in a
## list: 'fill'.scale = list(<scale1>, <scale2>, ...)'
## [v3->v4] `tm_polygons()`: migrate the argument(s) related to the legend of the
## visual variable `fill` namely 'title' to 'fill.legend = tm_legend(<HERE>)'

we might want to further customise things like putting the legend outside and add labels to the regions:

## 
## ── tmap v3 code detected ───────────────────────────────────────────────────────
## [v3->v4] `tm_polygons()`: instead of `style = "pretty"`, use fill.scale =
## `tm_scale_intervals()`.
## ℹ Migrate the argument(s) 'style', 'n', 'palette' (rename to 'values'),
##   'colorNA' (rename to 'value.na'), 'textNA' (rename to 'label.na') to
##   'tm_scale_intervals(<HERE>)'
## For small multiples, specify a 'tm_scale_' for each multiple, and put them in a
## list: 'fill'.scale = list(<scale1>, <scale2>, ...)'
## [v3->v4] `tm_polygons()`: migrate the argument(s) related to the legend of the
## visual variable `fill` namely 'title' to 'fill.legend = tm_legend(<HERE>)'

but the best part of tmap is that it can make your maps interactive. To switch to interactive mode:

## ℹ tmap mode set to "view".
## 
## 
## ── tmap v3 code detected ───────────────────────────────────────────────────────
## 
## [v3->v4] `tm_polygons()`: instead of `style = "pretty"`, use fill.scale =
## `tm_scale_intervals()`.
## ℹ Migrate the argument(s) 'style', 'n', 'palette' (rename to 'values'),
##   'colorNA' (rename to 'value.na'), 'textNA' (rename to 'label.na') to
##   'tm_scale_intervals(<HERE>)'
## For small multiples, specify a 'tm_scale_' for each multiple, and put them in a
## list: 'fill'.scale = list(<scale1>, <scale2>, ...)'
## [v3->v4] `tm_polygons()`: migrate the argument(s) related to the legend of the
## visual variable `fill` namely 'title' to 'fill.legend = tm_legend(<HERE>)'