Wednesday 21st August 2019

Last updated : 2025-12-09

1 Introduction

Space and place are recognised as an important facet of epidemiology and health data science to model exposures and communicate geographically varying results. Understanding spatial data formats and how it can be used in health data science is a useful skill. This workshop will introduce the main spatial data types and how to handle use them R. You will use Open Source software and Open Data to explore and visualise spatial variations in road traffic accidents in Scotland.

1.1 Software

To perform this exploratory data analysis we will be using R (https://www.r-project.org/) and R Studio (https://www.rstudio.com/products/rstudio/#Desktop). R and R Studio are platform independent suite of software which allow us to build geospatial models, perform statistical analysis and produce reproducible research in one environment. You will need to download and install these software on your machine to continue with this tutorial.

1.2 R and Spatial Data Libraries

There are hundreds of R packages related to spatial data functionality. However, most typically use three main packages sp, sf and raster. For this tutorial we will be usig the sf package (which references the sp package) as it has been written with tidy data (https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html) principles in mind:

Each variable forms a column.
Each observation forms a row.
Each type of observational unit forms a table

spatial packages in R

You can fin a useful cheatsheet for the sf library here.

1.2.1 Load the R libraries into your environment

If you want to run through the examples in this introduction you will need to load the following libraries in a new R script and downloaded the project folder as detailed in the Project Setup section. A more detailed example of how we can use spatial data in relation to injuries can be followed later.

Please remember to run the DataAquisition.R script

pkgs = c("ggplot2", "DT", "tidyverse", "sf", "raster", "tmap", "raster", "reshape2",
    "gtools", "gstat", "RColorBrewer", "classInt")
# check installed
new.packages <- pkgs[!(pkgs %in% installed.packages()[, "Package"])]
if (length(new.packages)) install.packages(new.packages, repos = "http://cran.us.r-project.org")

# load libraries
lapply(pkgs, library, character.only = T)

# load scotland shape for intro
scotland <- st_read("./Raw Data/Spatial Data/National Boundary/unzipped/scotland_ol_2001.shp",
    quiet = TRUE)

# check libraries loaded correctly
plot(st_geometry(scotland), axes = T, graticule = T)

2 GIS Data Types

There are two main types of spatial data: Vector and Raster. We briefly describe them below but for a more comprehensive overview of spatial data in R please refer to Chapter 2 in GeoComputation with R by Robin Lovelace, Jakub Nowosad, Jannes Muenchow found here .

2.1 Vector Data

Vector data are a series of points, lines and polygons which represent geographic features found on the earths surface. Points are the building blocks of vector data, a polyline is made up of two or more points (also known as nodes) and a polygon made up of series of points which have the same start and end point to enclose an area. Multi- points/lines/polygons are groups of alike features with a geometry collection representing a multitype feature.

Vector Data Types - GeoCompuation in R Lovelace et. al 2019

Along with a reference to a spatial location Vector data has the capabilities to store attribute information associated with a spatial location. In R terms think of a DataFrame where one or more columns references a spatial location either via a coordinate location, linkage field or spatial object type.

Typically, vector data are stored in a database format with the most common format called Shapefile - a set of files based on the Microsoft Access Database format. We will be using Shapefiles and another format GeoPackage in this workshop.

2.2 Raster Data

Raster Data are different to Vector Data in that the underlying data model is a matrix of equally spaced cells with a single value attached to the cell. The Raster data format is very efficient data format which makes it ideal for storing and processing continuous phenomena such as elevation, temperature or satellite data. Raster data can also be classified into discrete categorical groupings, typically to represent phenomena of a similar type.

Typically, raster data is stored in a image format such as .tif, .jpg, .png or GIS raster formats such as .adf. We will be generating raster data as part of this workshop and saving it as a .tif file.

# read in raster data
srtm_raster <- raster("./Raw Data/Spatial Data/Raster/unzipped/uk_srtm/w001001.adf")
# map raster
plot(srtm_raster)

srtm

2.3 Combining Vector and Raster Data

There are occasions when you want to combine raster and vector data, for example a raster background map for some vector data or to constrain one dataset by the extent of another (known as clipping or cookie cutting). To demonstrate this we can crop the raster we loaded to the extent of Scotland:

# Depending on the power of your laptop this may take some time!

srtm_cropped = crop(srtm_raster, as(scotland, "Spatial"))
# reproject to WGS 84
srtm_cropped = projectRaster(srtm_cropped, crs = "+init=epsg:4326")
# plot in wgs84 coord system as tidier
plot(srtm_cropped, axes = T)

srtm cropped

We can also extract data from the raster and convert it into a vector format, for example contouring the terrain model:

contours <- st_as_sf(rasterToContour(srtm_cropped))
# plot(srtm_cropped)
plot(st_geometry(contours), graticule = TRUE, col = topo.colors(10), axes = TRUE)

contours

This function has extracted height information contained in the raster and created a DataFrame:

glimpse(contours)
head(contours)

We can also add reference points to a raster background:

p1 = st_point(c(-2.8, 56.3))
# plot terrain
plot(srtm_cropped)
# add point
plot(st_geometry(p1), add = TRUE, col = "Red", cex = 1, axes = T, graticule = T)
# add label
text(x = -2.8, y = 56.3, "St. Andrews", cex = 0.5, pos = 4)

St Andrews

3 Using R as a GIS

Displaying spatial data in R is often the output some other analysis which has been conducted. One of the most powerful aspects of GIS is the ability to combine spatial and non spatial data to gain new understanding and insight. The rest of this workshop will describe how we can use R as a GIS to perform some basic analysis of road traffic accident data.

3.1 Aims

The aim is to demonstrate how we can use R to handle open data to model exposure to road traffic accidents using small area geographies. Specifically, we want to explore four different models of exposure and map the outputs to visualise the differences. These models can be broadly split into two categories:

Basic injury rates
- Injury rate by population
- Injury rate per km of road
Adjusted injury rates
- Injury rate adjusted by population demographics
- Injury rate adjusted by road type

3.2 Project Setup

Once you have installed R and R Studio we need to acquire the data that will allow us to explore the aims above. Download the r-project zip file from here and unzip.

In the scripts folder you will find a script (DataAquisition.R) to download the data and pre-process it. Run the DataAquisition.R file to setup directories and download and extract the data into the right folders for this tutorial.

3.3 Data Preparation

The first step is to load the data into R for analysis. Create a new script (Ctrl+Shift+N on Windows, Command+Shift+N on Mac) - and use the following code to load your data into R.

3.3.1 Load the R libraries into your environment

pkgs = c("ggplot2", "DT", "tidyverse", "sf", "raster", "tmap", "raster", "reshape2",
    "gtools", "gstat", "RColorBrewer", "classInt")
# check installed
new.packages <- pkgs[!(pkgs %in% installed.packages()[, "Package"])]
if (length(new.packages)) install.packages(new.packages, repos = "http://cran.us.r-project.org")


# load libraries
lapply(pkgs, library, character.only = T)

To check the packages have loaded properly draw a quick map of Scotland:

# read in shapefile
scotland <- st_read("./Raw Data/Spatial Data/National Boundary/unzipped/scotland_ol_2001.shp",
    quiet = TRUE)
# plot map
plot(st_geometry(scotland), graticule = TRUE, key.pos = NULL, axes = TRUE)

We have loaded this data using the sf::st_read() function, this means that the spatial has been loaded using an R friendly tidyverse data format - so we can use regular R functions on the (spatial) DataFrame.

glimpse(scotland)

## Rows: 116
## Columns: 3
## $ name     <chr> "Scotland", "Scotland", "Scotland", "Scotland", "Scotland", "…
## $ label    <chr> "Scotland", "Scotland", "Scotland", "Scotland", "Scotland", "…
## $ geometry <POLYGON [m]> POLYGON ((251271 968408, 25..., POLYGON ((148736.7 83…

head(scotland)

## Simple feature collection with 6 features and 2 fields
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: 95459.88 ymin: 530297 xmax: 452609 ymax: 1195119
## Projected CRS: OSGB36 / British National Grid
##       name    label                       geometry
## 1 Scotland Scotland POLYGON ((251271 968408, 25...
## 2 Scotland Scotland POLYGON ((148736.7 830542.9...
## 3 Scotland Scotland POLYGON ((148348 931954, 14...
## 4 Scotland Scotland POLYGON ((432850 1180913, 4...
## 5 Scotland Scotland POLYGON ((130685 659791, 13...
## 6 Scotland Scotland POLYGON ((143907.3 671573.7...

3.4 Creating simple risk maps in R

The aim of this next section is to show you how to use the spatial libraries in R to create some simple maps of accident rates and risk using open data for Scotland. We will be using: - Stats19 data - OS Open Roads Data - Scottish Data Zones and census data.

3.5 Linking spatial and non spatial data

A lot of data doesn’t contain an explicit spatial reference (e.g. coordinates) but instead contains linking fields. These linking fields allow you to link data to a pre-defined location or boundary. As health data scientists the vast majority of our work relates to improving the health and wellbeing of a population, therefore it is important to be able to link populations to a geographic location (disclosure controls permitting) when we want to model exposure. For this exercise we will be linking 2011 census age and gender data to small area geographies known as Data Zones in Scotland (LSOA’s in England and Wales). The linkage uses a common reference code found in both the spatial data and the population statistics.

## Load 2017 Stats 19 data

# load stats 19 accident into a data frame
Stats19.accidents <- read.csv(file = "./Raw Data/Stats 19/unzipped/Acc.csv", fileEncoding = "UTF-8-BOM")

# load stats 19 casualty data into a dataframe
Stats19.casualties <- read.csv(file = "./Raw Data/Stats 19/unzipped/Cas.csv", fileEncoding = "UTF-8-BOM")

## Load census data
census.data <- read.csv("./Raw Data/Census/unzipped/Data_AGE_SEX_UNIT.csv", fileEncoding = "UTF-8-BOM")
glimpse(census.data)

## Rows: 6,976
## Columns: 78
## $ CDU_ID                                                    <int> 346437, 3464…
## $ GEO_CODE                                                  <chr> "S01006506",…
## $ GEO_LABEL                                                 <chr> "Culter - 01…
## $ GEO_TYPE                                                  <chr> "Lower Super…
## $ GEO_TYP2                                                  <chr> "LSOADZ", "L…
## $ Age...Age.30.to.34...Sex...Total..Sex...Unit...Persons    <int> 69, 46, 67, …
## $ Age...Age.30.to.34...Sex...Males...Unit...Persons         <int> 29, 21, 34, …
## $ Age...Age.30.to.34...Sex...Females...Unit...Persons       <int> 40, 25, 33, …
## $ Age...Age.40.to.44...Sex...Total..Sex...Unit...Persons    <int> 62, 62, 44, …
## $ Age...Age.40.to.44...Sex...Males...Unit...Persons         <int> 33, 29, 23, …
## $ Age...Age.40.to.44...Sex...Females...Unit...Persons       <int> 29, 33, 21, …
## $ Age...Age.35.to.39...Sex...Total..Sex...Unit...Persons    <int> 71, 49, 60, …
## $ Age...Age.35.to.39...Sex...Males...Unit...Persons         <int> 37, 20, 32, …
## $ Age...Age.35.to.39...Sex...Females...Unit...Persons       <int> 34, 29, 28, …
## $ Age...Age.45.to.49...Sex...Total..Sex...Unit...Persons    <int> 80, 66, 68, …
## $ Age...Age.45.to.49...Sex...Males...Unit...Persons         <int> 36, 30, 34, …
## $ Age...Age.45.to.49...Sex...Females...Unit...Persons       <int> 44, 36, 34, …
## $ Age...Age.5.to.9...Sex...Total..Sex...Unit...Persons      <int> 30, 32, 27, …
## $ Age...Age.5.to.9...Sex...Males...Unit...Persons           <int> 13, 16, 15, …
## $ Age...Age.5.to.9...Sex...Females...Unit...Persons         <int> 17, 16, 12, …
## $ Age...Age.50.to.54...Sex...Total..Sex...Unit...Persons    <int> 66, 55, 50, …
## $ Age...Age.50.to.54...Sex...Males...Unit...Persons         <int> 32, 24, 28, …
## $ Age...Age.50.to.54...Sex...Females...Unit...Persons       <int> 34, 31, 22, …
## $ Age...Age.55.to.59...Sex...Total..Sex...Unit...Persons    <int> 62, 67, 54, …
## $ Age...Age.55.to.59...Sex...Males...Unit...Persons         <int> 42, 36, 27, …
## $ Age...Age.55.to.59...Sex...Females...Unit...Persons       <int> 20, 31, 27, …
## $ Age...Age.65.to.69...Sex...Total..Sex...Unit...Persons    <int> 38, 49, 23, …
## $ Age...Age.65.to.69...Sex...Males...Unit...Persons         <int> 16, 26, 13, …
## $ Age...Age.65.to.69...Sex...Females...Unit...Persons       <int> 22, 23, 10, …
## $ Age...Age.70.to.74...Sex...Total..Sex...Unit...Persons    <int> 29, 45, 10, …
## $ Age...Age.70.to.74...Sex...Males...Unit...Persons         <int> 13, 18, 6, 1…
## $ Age...Age.70.to.74...Sex...Females...Unit...Persons       <int> 16, 27, 4, 8…
## $ Age...Age.75.to.79...Sex...Total..Sex...Unit...Persons    <int> 22, 31, 11, …
## $ Age...Age.75.to.79...Sex...Males...Unit...Persons         <int> 4, 14, 3, 7,…
## $ Age...Age.75.to.79...Sex...Females...Unit...Persons       <int> 18, 17, 8, 1…
## $ Age...Age.80.to.84...Sex...Total..Sex...Unit...Persons    <int> 22, 17, 5, 2…
## $ Age...Age.80.to.84...Sex...Males...Unit...Persons         <int> 8, 7, 0, 10,…
## $ Age...Age.80.to.84...Sex...Females...Unit...Persons       <int> 14, 10, 5, 1…
## $ Age...Age.90.to.94...Sex...Total..Sex...Unit...Persons    <int> 4, 0, 1, 6, …
## $ Age...Age.90.to.94...Sex...Males...Unit...Persons         <int> 2, 0, 0, 2, …
## $ Age...Age.90.to.94...Sex...Females...Unit...Persons       <int> 2, 0, 1, 4, …
## $ Age...Total..Age...Sex...Total..Sex...Unit...Persons      <int> 872, 836, 64…
## $ Age...Total..Age...Sex...Males...Unit...Persons           <int> 416, 398, 32…
## $ Age...Total..Age...Sex...Females...Unit...Persons         <int> 456, 438, 31…
## $ Age...Age.0.to.4...Sex...Total..Sex...Unit...Persons      <int> 62, 38, 36, …
## $ Age...Age.0.to.4...Sex...Males...Unit...Persons           <int> 22, 20, 21, …
## $ Age...Age.0.to.4...Sex...Females...Unit...Persons         <int> 40, 18, 15, …
## $ Age...Age.15...Sex...Total..Sex...Unit...Persons          <int> 2, 13, 3, 4,…
## $ Age...Age.15...Sex...Males...Unit...Persons               <int> 1, 9, 2, 2, …
## $ Age...Age.15...Sex...Females...Unit...Persons             <int> 1, 4, 1, 2, …
## $ Age...Age.20.to.24...Sex...Total..Sex...Unit...Persons    <int> 48, 37, 46, …
## $ Age...Age.20.to.24...Sex...Males...Unit...Persons         <int> 24, 16, 19, …
## $ Age...Age.20.to.24...Sex...Females...Unit...Persons       <int> 24, 21, 27, …
## $ Age...Age.25.to.29...Sex...Total..Sex...Unit...Persons    <int> 83, 33, 54, …
## $ Age...Age.25.to.29...Sex...Males...Unit...Persons         <int> 42, 14, 27, …
## $ Age...Age.25.to.29...Sex...Females...Unit...Persons       <int> 41, 19, 27, …
## $ Age...Age.60.to.64...Sex...Total..Sex...Unit...Persons    <int> 53, 88, 37, …
## $ Age...Age.60.to.64...Sex...Males...Unit...Persons         <int> 23, 43, 20, …
## $ Age...Age.60.to.64...Sex...Females...Unit...Persons       <int> 30, 45, 17, …
## $ Age...Age.85.to.89...Sex...Total..Sex...Unit...Persons    <int> 9, 7, 5, 5, …
## $ Age...Age.85.to.89...Sex...Males...Unit...Persons         <int> 5, 4, 1, 0, …
## $ Age...Age.85.to.89...Sex...Females...Unit...Persons       <int> 4, 3, 4, 5, …
## $ Age...Age.10.to.11...Sex...Total..Sex...Unit...Persons    <int> 8, 25, 11, 9…
## $ Age...Age.10.to.11...Sex...Males...Unit...Persons         <int> 3, 18, 7, 5,…
## $ Age...Age.10.to.11...Sex...Females...Unit...Persons       <int> 5, 7, 4, 4, …
## $ Age...Age.12.to.14...Sex...Total..Sex...Unit...Persons    <int> 21, 32, 14, …
## $ Age...Age.12.to.14...Sex...Males...Unit...Persons         <int> 14, 16, 8, 9…
## $ Age...Age.12.to.14...Sex...Females...Unit...Persons       <int> 7, 16, 6, 5,…
## $ Age...Age.16.to.17...Sex...Total..Sex...Unit...Persons    <int> 15, 24, 10, …
## $ Age...Age.16.to.17...Sex...Males...Unit...Persons         <int> 8, 11, 4, 4,…
## $ Age...Age.16.to.17...Sex...Females...Unit...Persons       <int> 7, 13, 6, 3,…
## $ Age...Age.18.to.19...Sex...Total..Sex...Unit...Persons    <int> 16, 19, 7, 1…
## $ Age...Age.18.to.19...Sex...Males...Unit...Persons         <int> 9, 6, 5, 8, …
## $ Age...Age.18.to.19...Sex...Females...Unit...Persons       <int> 7, 13, 2, 3,…
## $ Age...Age.95.and.over...Sex...Total..Sex...Unit...Persons <int> 0, 1, 0, 1, …
## $ Age...Age.95.and.over...Sex...Males...Unit...Persons      <int> 0, 0, 0, 0, …
## $ Age...Age.95.and.over...Sex...Females...Unit...Persons    <int> 0, 1, 0, 1, …
## $ X                                                         <lgl> NA, NA, NA, …

# Load Spatial Data

# load spatial data - may take some time depending on your laptop
scottish.datazones <- st_read("./Raw Data/Spatial Data/Data Zones 2011/unzipped/SG_DataZone_Bdry_2011.shp")

## Reading layer `SG_DataZone_Bdry_2011' from data source 
##   `/Users/richfry/Library/CloudStorage/OneDrive-SwanseaUniversity/HDR Summer School 2019/Raw Data/Spatial Data/Data Zones 2011/unzipped/SG_DataZone_Bdry_2011.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 6976 features and 9 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 5547 ymin: 530252.8 xmax: 470288.5 ymax: 1220302
## Projected CRS: OSGB36 / British National Grid

scottish.datazones <- st_set_crs(scottish.datazones, 27700)

# to speed up rendering we need to simplify the geometry shapes - this should
# be used with caution as it could effect analysis for point in polygons - you
# can see the impact by changing the dTolernce value to 5000 - remember to
# reload the data from source once you have finished!
scottish.datazones <- st_simplify(scottish.datazones, preserveTopology = T, dTolerance = 100)

plot(st_geometry(scottish.datazones), col = NA, axes = T, graticule = T)

glimpse(scottish.datazones)

## Rows: 6,976
## Columns: 10
## $ DataZone   <chr> "S01006506", "S01006507", "S01006508", "S01006509", "S01006…
## $ Name       <chr> "Culter - 01", "Culter - 02", "Culter - 03", "Culter - 04",…
## $ TotPop2011 <int> 872, 836, 643, 580, 644, 751, 519, 648, 1043, 762, 540, 101…
## $ ResPop2011 <int> 852, 836, 643, 580, 577, 749, 496, 568, 940, 762, 467, 1014…
## $ HHCnt2011  <int> 424, 364, 340, 274, 256, 315, 170, 230, 339, 271, 185, 403,…
## $ StdAreaHa  <dbl> 438.880218, 22.349739, 27.019476, 9.625426, 18.007657, 40.0…
## $ StdAreaKm2 <dbl> 4.388801, 0.223498, 0.270194, 0.096254, 0.180076, 0.400487,…
## $ Shape_Leng <dbl> 11801.872, 2900.406, 3468.762, 1647.461, 3026.111, 4300.089…
## $ Shape_Area <dbl> 4388802.12, 221746.84, 270194.75, 96254.26, 180076.58, 4004…
## $ geometry   <POLYGON [m]> POLYGON ((383761.2 800775.3..., POLYGON ((383878.3 …

The glimpse function lets us quickly compare the datazones and census data - they should contain the same number of rows. Notice there is a column in each table which contains a unique identifier for each row.

# get cols for females and join census code back on
census.females <- census.data %>%
    as_tibble() %>%
    dplyr::select(contains("Females", ignore.case = F)) %>%
    cbind(., census.data$GEO_CODE)
# rename columns to something sensible
names(census.females) <- gsub("Age...Age.", "", names(census.females))
names(census.females) <- gsub("...Sex...Females...Unit...Persons", "", names(census.females))
# reorder
census.females <- census.females[, mixedsort(names(census.females))]
# rename last two cols
names(census.females)[length(names(census.females)) - 1] <- "TotalAge"
names(census.females)[length(names(census.females))] <- "DataZone"


# get cols for males and join census codes back on
census.males <- census.data %>%
    as_tibble() %>%
    dplyr::select(contains("Males", ignore.case = F)) %>%
    cbind(., census.data$GEO_CODE)
# rename columns to something sensible
names(census.males) <- gsub("Age...Age.", "", names(census.males))
names(census.males) <- gsub("...Sex...Males...Unit...Persons", "", names(census.males))
# reorder
census.males <- census.males[, mixedsort(names(census.males))]
# rename last two cols
names(census.males)[length(names(census.males)) - 1] <- "TotalAge"
names(census.males)[length(names(census.males))] <- "DataZone"


# get cols for all and join census codes back on
census.totals <- census.data %>%
    as_tibble() %>%
    dplyr::select(contains("Total", ignore.case = F)) %>%
    cbind(., census.data$GEO_CODE)
# rename columns to something sensible
names(census.totals) <- gsub("Age...Age.", "", names(census.totals))
names(census.totals) <- gsub("...Sex...Total..Sex...Unit...Persons", "", names(census.totals))
# reorder
census.totals <- census.totals[, mixedsort(names(census.totals))]
# rename last two cols
names(census.totals)[length(names(census.totals)) - 1] <- "TotalAgeMen"
names(census.totals)[length(names(census.totals)) - 2] <- "TotalAgeFemale"
names(census.totals)[length(names(census.totals)) - 3] <- "TotalAgeAll"
names(census.totals)[length(names(census.totals))] <- "DataZone"

3.6 Creating Road Summary Data

The UK has a number of road datasets which are available to us - we are using OS OpenRoads which details every road in Britain. First we need to load the roads data, and then clip the roads to each DataZone to allow us to create rates by areas. The following code shows you how to do it in R - you may want to skip to the next section - it takes a long time!

#### Warning this is a complex geometry operation and may take a long time -
#### you can skip to the load csv option if necessary### load roads for
#### Scotland
scottish.roads <- st_read(dsn = "./Raw Data/Spatial Data/OS Open Roads/Roads.gpkg",
    layer = "Roads")

## st_intersection to get the roads by DataZones
scottish.roads.datazones <- st_intersection(scottish.roads, scottish.datazones)
# add length field
scottish.roads.datazones$length <- st_length(scottish.roads.datazones)
# round length
scottish.roads.datazones$length <- round(scottish.roads.datazones$length, 2)

# check the intersection as worked
plot(st_geometry(scottish.datazones[scottish.datazones$DataZone == "S01006507", ]),
    col = "grey")
plot(st_geometry(scottish.roads), col = "Blue", add = T)
plot(st_geometry(scottish.roads.datazones[scottish.roads.datazones$DataZone == "S01006507",
    ]), col = "green", add = T)

# Group by datazone
roads.summary <- scottish.roads.datazones[, c("DataZone", "function.", "length")] %>%
    dcast(DataZone ~ function., value.var = "length", fun.aggregate = sum)

# save to csv
write.csv(roads.summary, file = "./Raw Data/Spatial Data/OS Open Roads/road_summary.csv")

Clipped Roads

# load road summary
scottish.roads <- read.csv(file = "./Raw Data/Spatial Data/OS Open Roads/road_summary.csv")

# create total column
scottish.roads <- scottish.roads %>%
    mutate(Total_km = rowSums(.[, 3:10], na.rm = T))
scottish.roads$Total_km <- round(scottish.roads$Total_km/1000, 2)
# replace NAs with 0
scottish.roads[is.na(scottish.roads)] <- 0

head(scottish.roads)

##   X  DataZone A.Road B.Road Local.Access.Road Local.Road Minor.Road Motorway
## 1 1 S01006506 620.34   0.00           1983.81    2413.74    4108.77        0
## 2 2 S01006507   0.00 371.28            312.27    2880.03       0.00        0
## 3 3 S01006508 432.39 541.05              0.00    1901.84     245.18        0
## 4 4 S01006509   0.00   0.00              0.00    1944.00       0.00        0
## 5 5 S01006510 176.64   0.00              0.00    2238.98      34.81        0
## 6 6 S01006511 636.06   0.00            575.59    2601.16       0.94        0
##   Restricted.Local.Access.Road Secondary.Access.Road Total_km
## 1                      4737.05                 42.72    13.91
## 2                        53.47                  0.00     3.62
## 3                       394.32                  0.00     3.51
## 4                         0.00                 85.14     2.03
## 5                        50.55                 38.15     2.54
## 6                      1002.07                  0.00     4.82

You can use head(*dataframename*) or View(*dataframename*) to view and explore the data in RStudio.

The next step is to discard unwanted columns and merge the Stats 19 dataframes to make the data more manageable with the aim of creating counts by small area geographies. As you can see there is a lot of information contained in Stats 19 - you may want to consider how you could use some of the extra information in due course.

# remove unwanted accident data

cols.to.keep = c("Accident_Index", "Location_Easting_OSGR", "Location_Northing_OSGR",
    "Accident_Severity", "Date", "Day_of_Week", "Time", "X1st_Road_Class", "LSOA_of_Accident_Location")


Stat19.accidents.filtered <- Stats19.accidents[, cols.to.keep]


cols.to.keep = c("Accident_Index", "Age_of_Casualty", "Sex_of_Casualty")

Stats19.casualties.filtered <- Stats19.casualties[, cols.to.keep]

# join data frames on 'Accident_Index'
Stats19.final <- left_join(Stat19.accidents.filtered, Stats19.casualties.filtered)

# filter out non Scottish LSOA codes by matching discarding those that begin
# with E or W
Stats19.final <- Stats19.final[substr(Stats19.final$LSOA_of_Accident_Location, 1,
    1) != "E", ]
Stats19.final <- Stats19.final[substr(Stats19.final$LSOA_of_Accident_Location, 1,
    1) != "W", ]

# Make Stats19 Spatial
Stats19.final <- st_as_sf(Stats19.final, coords = c("Location_Easting_OSGR", "Location_Northing_OSGR"),
    crs = "+init=epsg:27700", na.fail = F)

# quick plot to check the accidents are where we hope - i.e. Scotland (mainly)
plot(st_geometry(Stats19.final), axes = T, graticule = T, cex = 0.1, main = "Scotland - road accident locations")

As you can see from the data - the LSOA code (DataZone) for Scotland is empty. We can use a spatial join to fill in the missing information - where the location of the accident is compared to the boundary of the data zones.

# Bug fix to make sure the coord systems are the same
st_crs(Stats19.final) <- st_crs(scottish.datazones)

# perform spatial join
Stats19.final <- st_join(Stats19.final, scottish.datazones["DataZone"])

# remove null values
Stats19.final <- na.omit(Stats19.final)

# plot to check
plot(st_geometry(scotland), border = "green", graticule = T, axes = T)
plot(st_geometry(Stats19.final), col = "blue", cex = 0.1, add = T)

We are going to do the same to the census data so that we have a more manageable dataset - in this case we will group counts of people into males, females, children and adults.

# Create age groups by summing columns accross rows - females
census.final.females <- census.females %>%
    mutate(children.female = rowSums(.[, 1:6]), adults.female = rowSums(.[, 7:23])) %>%
    .[, c("DataZone", "children.female", "adults.female")]
# Create age groups by summing columns accross rows - males
census.final.males <- census.males %>%
    mutate(children.male = rowSums(.[, 1:6]), adults.male = rowSums(.[, 7:23])) %>%
    .[, c("DataZone", "children.male", "adults.male")]

census.final <- left_join(census.final.females, census.final.males)
census.final <- census.final %>%
    mutate(Total = rowSums(.[, 2:5]))

4 Creating Rates

A principal method in epidemiology is to describe distributions of disease or injury and identify populations at risk. One of the main methods you can use to estimate to explore those populations at risk is to create a ratio - for example:

\[ Injury Rate = \frac {number\ of\ injuries}{exposed\ population} \]

In this example we will be using road accident data published by the UK Government (Stats 19), population data in the form of census data and road network data published by the Ordnance Survey to explore the spatial distribution of the risk of being involved in a road traffic accident.

4.1 Basic Rates

The first step in creating a basic rate is to create counts of accidents by a small area geography - in this case a DataZone. As we have attached a DataZone code, using a spatial join, to the Stats19 data we can create a summary count by grouping on LSOA code as follows:

# create a count of accidents by DataZone
stats19.datazone <- Stats19.final %>%
    group_by(DataZone) %>%
    dplyr::summarise(n_accidents = n())

In this example we have grouped the data simply using the DataZone of the accident, you may want to consider how you can use some of the other data fields in the Stats19 to create a more refined accident rate for example accident severity, age of casualty or weather conditions.

We can now join the datasets together to create a single table from which we can calculate injury rates:

# remove geom from stats19
st_geometry(stats19.datazone) <- NULL

# join to datazone shapes
injury.rates <- left_join(scottish.datazones, scottish.roads)

# join stats19
injury.rates <- left_join(injury.rates, stats19.datazone)

# join census
injury.rates <- left_join(injury.rates, census.final)

injury.rates[is.na(injury.rates)] <- 0

This results in a final table of data as follows:

# cut down for html display
datatable(head(injury.rates, rownames = F))

We can map this using some simple code in R to show the distribution of accidents by DataZone:

plot(injury.rates["n_accidents"], axes = T, border = NA, graticule = T, main = "Number of Accidents by DataZone")

## Point Interpolation

We can convert the point data to a raster surface to model the variation in accidents. We can use some simple spatial modelling Inverse Distance Weighting to interpolate the point data to a raster using the Stats19 data. Each point is compared to its neighbours to estimate the values in the space inbetween. In this example we use Accident Severity to spatially model the variation in accident severity accross Scotland.

library(gstat)
library(RColorBrewer)
# convert to Spatial
tmp <- as(Stats19.final, "Spatial")
# set CRS
proj4string(tmp) <- CRS("+init=epsg:27700")
# create raster grid
r = raster(as(extent(tmp), "SpatialPolygons"), ncol = 40, nrow = 80)
# fit a model - ~1 means intercept only - IDW method uses all points but
# weights those nearer more highly
gs = gstat(formula = Accident_Severity ~ 1, locations = tmp)
# match coordinate systems and perform interpolation
crs(r) = crs(tmp)
r_idw = interpolate(r, gs)

## [inverse distance weighted interpolation]

# set up colour scheme
my.pal <- brewer.pal(n = 9, name = "RdYlBu")

# plot results
plot(r_idw, col = my.pal, main = "IDW Accident Severity")
plot(scotland, add = T, col = NA, border = "grey", lwd = 0.5)

The maps so far tell us where the accidents have occurred but doesn’t tell us anything about the population exposed. We can estimate the rate of accidents by population to see if some areas of Scotland are at higher risk than others from the data table we have created. First we will create a basic exposure rate with population as the denominator. To accomplish this we simply need to create a ratio for each DataZone by dividing the number of accidents by the population exposed.

# create rates per 1000 population
injury.rates$basic.rate.pop <- (injury.rates$n_accidents/injury.rates$ResPop2011) *
    1000

# map rates to show spatial distribution
plot(injury.rates["basic.rate.pop"], axes = T, border = NA, graticule = T, main = "Basic Rate")

We can bin the data into deciles we can see much more clearly any changes in the spatial distribution. The option below uses the kmeans classification which is widely used in GIS.

## set up a colour scheme for use in the maps you can view all schemes by
## typing: display.brewer.all()
cols = brewer.pal(5, "YlOrRd")
pal <- colorRampPalette(cols)

# map rates to show spatial distribution
plot(injury.rates["basic.rate.pop"], breaks = "kmeans", pal = pal(5), nbreaks = 5,
    axes = T, border = NA, graticule = T, main = "Basic Rate - kmeans")

This is a very “crude” measure of injury risk - a simple way of refining it would be to be calculate a rate per km of road within an LSOA. Using the open roads data we can estimate a rate per kilometre of road - this may give us an indication of whether a road in an area is particularly dangerous.

# create road rates
injury.rates$road.rate = injury.rates$n_accidents/injury.rates$Total_km

plot(injury.rates["road.rate"], breaks = "kmeans", nbreaks = 5, axes = T, border = NA,
    pal = pal(5), graticule = T, main = "Rate per KM of road")

As you can see the size of the different geographic areas and sparesness of roads in some areas of Scotlands means that there is not much variation displayed as the higher rates are found in smaller DataZones witin Glasgow and Edinburgh. We touch on using interactive mapping later in the tutorial which facilitates exploring this more.

4.2 Refined Rates

The previous section demonstrated how we can create basic rates of accidents per head of population or rates of accidents per km of road. These give an initial overview of the spatial distribution of accidents, however, they don’t really tell us anything about the accidents, people or roads involved and sometimes mask interesting patterns. We can create a new adjusted risk model where we more precisely define the numerators and denominators:

\[AR_i=\sum{\frac{injuries_{ir}}{total_r * basic\ rate}}\]

Where \(AR\) = adjusted rate, \(_i\) = geographical unit, \(_r\) = adjusment variable (e.g. Road Type, Age, Sex)

Using this formula we can substitute in a variety of variables depending on what we want to explore. For example to look at road classification we could adjust our rates by road class.

Stats 19 has six road classifications which detail where the accident took place:

Code	Label
1	Motorway
2	A(M)
3	A
4	B
5	C
6	Unclassified

These broadly map onto our road types categories extracted from the ITN data - we can aggregate them so we have matching groups.

# regroup stats19 roads so that they reflect OS OpenRoads Classification create
# counts by road type per DataZone
stats19.datazone.road <- Stats19.final %>%
    group_by(DataZone, X1st_Road_Class) %>%
    summarise(n_accidents = n())
# drop geom col
stats19.datazone.road$geometry <- NULL
# crosstab so road type is cols
stats19.datazone.road <- stats19.datazone.road %>%
    dcast(DataZone ~ X1st_Road_Class, value.var = "n_accidents")
# fill NAs with 0
stats19.datazone.road[is.na(stats19.datazone.road)] <- 0
# Group Road types together using table above
stats19.datazone.road <- stats19.datazone.road %>%
    mutate(n_motorway = rowSums(.[, c("2", "3")]), n_A_Roads = .[, "3"], n_B_Roads = .[,
        "4"], n_C_Roads = rowSums(.[, c("5", "6")]))

datatable(head(stats19.datazone.road))

Now we have created the counts by road type we can create rates by road classification as an adjusted rate.

# join tables
road.class.rates <- left_join(injury.rates, stats19.datazone.road)
# Fill NA
road.class.rates[is.na(road.class.rates)] <- 0

# #create Basic Rate for calcs road.class.rates$BasicRate <-
# rowSums(road.class.rates[,5:8]) / rowSums(road.class.rates[,2:4])

# create rates
road.class.rates$A_Road_Rate <- road.class.rates$n_A_Roads/(road.class.rates$A.Road *
    road.class.rates$road.rate)

road.class.rates$B_Road_Rate <- road.class.rates$n_B_Roads/(road.class.rates$B.Road *
    road.class.rates$road.rate)

road.class.rates$C_Road_Rate <- road.class.rates$n_C_Roads/(road.class.rates$Local.Road *
    road.class.rates$road.rate)

# change inf to 0
road.class.rates[] <- lapply(road.class.rates, function(i) if (is.numeric(i)) ifelse(is.infinite(i),
    0, i) else i)

# zero out NA
road.class.rates[is.na(road.class.rates)] <- 0

# simplify geom
road.class.rates <- st_simplify(road.class.rates, preserveTopology = T, dTolerance = 100)


plot(road.class.rates[, c("A_Road_Rate", "B_Road_Rate", "C_Road_Rate")], breaks = "kmeans",
    nbreaks = 5, pal = pal(5), axes = T, border = NA, graticule = T)

This in just one way of creating multiple maps - check out ggplot and the geom_sf functionality if you want more flexibilty!

5 Web Mapping

One the main developments in GIS in the past decade is the use of the internet to deliver data and display maps. There are number of packages in R which allow us to make interactive maps to display the outputs of our analysis. In this example we will use the tmap package which can be used to produce interactive maps - its very simple to use and embeds into a R Markdown:

# Bug in the 2.3 version of the code with patch not released to the main CRAN
# repo yet. If you have problems run this code and restart R

# install.packages('devtools')

# library(devtools) install.packages('tmap', version=2.2)

# library(tmap)
tmap_options(check.and.fix = TRUE)
tmap_mode("view")
tm_shape(road.class.rates) + tm_fill(col = "C_Road_Rate")

Try mapping the other rates and see how the maps change - this will give us an idea of the spatial distribution of road traffic accidents adjusted for road type.

Can you create interactive maps for all of the maps we have produced?

6 Further Work

This web page has gone through the very basic process of creating rates of exposure by small geographic areas. The data we have used is freely available to download and re-use, and contains a wealth of information that you could use to perform your own data exploration. For example:

Rates by sex
Rates by age group
Rates by vehicle type
Rates by accident severity

Think about how you can use the data supplied here and other data available on sites like http://www.data.gov.uk such as traffic counts, travel to work areas and data like running routes to create adjusted risk models. Have a go!

7 Useful Links/ Examples

7.1 R as a GIS

Robin Lovelace, Jakub Nowosad and Jannes Muenchow - Geocomputation in R: https://bookdown.org/robinlovelace/geocompr/
Chris Brunsdon’s and Lex Comber - Spatial Analysis in R : https://uk.sagepub.com/en-gb/eur/an-introduction-to-r-for-spatial-analysis-and-mapping/book241031
Geocomputation - A Practical Primer : https://uk.sagepub.com/en-gb/eur/geocomputation/book241023

7.2 R - Statistics

Discovering Statistics Using R - https://uk.sagepub.com/en-gb/eur/discovering-statistics-using-r/book236067

7.3 Stats19 Examples

Robin Lovelace BikeR github page - using similar data and principles for bike related exposure - https://github.com/Robinlovelace/bikeR
http://routerisks.co.uk/ - how risky is a route
R Stats19 Package: https://cran.r-project.org/web/packages/stats19/vignettes/blog.html
https://richfry.github.io - Data visualisation of Stats19 data (early version!)

8 Contact

Email:	Rich Fry	Sarah Rodgers
Twitter:	@richfry	@geographysarah

HDR UK Summer School 2019

Spatial Data Handling and Mapping Using R