1 Preamble

1.1 List of Acronyms

CRAN \(\hspace{2.0em}\)comprehensive R Archive Network
SIR \(\hspace{3.0em}\)standardised incidence ratio
SIDS \(\hspace{2.0em}\)sudden infant death syndrome

1.2 Aim

Have you ever wanted to test a new spatial model or other methodology but cannot find a decent spatial data set? Or perhaps you went to the trouble of trying to obtain (or collect!) some confidential data, only to find that the journal you want to publish in requires that all data you analyse be made publicly available. Many researchers resort to using simulated data for this reason. However, simulated data has its own drawbacks. Perhaps the main drawback is that can be incredibly difficult to simulate realistic spatial data. There are so many variables/features that you may want (or need) to control. For example, in epidemiological studies, it is typical to have to control things like:

  • population counts
  • observed cases (incidence rates):
    • mean number of cases
    • variance of number of cases
    • extremes
    • number of zero cases
  • spatial covariate(s):
    • magnitude (effect size)
    • interaction (multi-collinearity)
    • spatial autocorrelation of each covariate with itself
  • residual spatial variation leftover after accounting for spatial covariates
  • offset parameters
  • sample size

It only takes one of these variables to be poorly simulated for the whole data set to become unrealistic, and small changes to one variable can have a large impact on the other variables. For example, suppose you have a real data set of observed cases which you are using as a guideline to simulate cases. You determine that a gamma distribution will generate a random variate with a very similar mean and variance to that real data set. However, the gamma distribution has a very long tail, and about 1 in 100 simulated cases will be very large, perhaps order of magnitude larger than the sum of the real cases. So you place a constraint on these simulated values by truncating the distribution. Now 1% of your simulated cases are truncated, which may start to appear as a feature, especially if they happen to be situated close spatially (a cluster of similarly large values). Now you need to find a way of removing this artificial feature. And so on the problem goes, with each “adjustment” creating more unrealistic features.

However, what researchers may not be aware of is that there are a number of publicly available spatial data sets in health research! The purpose of this document is to provide researchers in this field with a practical guide on acquiring these spatial data sets, including where the data is located and R code for downloading/reading in the data.

I plan to update this document periodically. If you use this guide and find any mistakes or have any suggestions of open data sets in health research, please contact me at earl.w.duncan@gmail.com.

1.3 General R Libraries

Many R packages (libraries) are used throughout this document. Packages specific to a particluar data set are loaded as required. The following are a list of general packages required to be installed and loaded:

library(rgdal)          # For readOGR()
library(rgeos)          # For unionSpatialPolygons(), gIntersection(), gBuffer()
library(maptools)       # For unionSpatialPolygons()
library(ggplot2)        # For fortify(), ggplot()
library(readxl)         # For read_excel()
library(magrittr)       # For the pipe operator %>%
library(scales)         # For rescale()
library(dplyr)          # For inner_join(), bind_rows(),                   ### between(), mutate()
library(gridExtra)      # For grid.arrange()
library(tidyr)          # For gather()

1.4 Colour Schemes

This document includes plots of the data, using ggplot2. For each type of data set, consistent colour schemes are used. Many of the colours chosen are taken from colorbrewer2.org. These colour schemes are explained below.

The following constants are used to implement the colour schemes described above in producing the plots.

# Fill colours for observed and expected cases
Fill.colours.OE <- grey(c(seq(1, 0, -0.25), 0))

# Fill colours for the spatial covariate(s)
Fill.colours.x <- c("#4dac26", "#b8e186", "#f7f7f7", "#f1b6da", "#d01c8b")

# Fill colours and values for SIR
Fill.colours.SIR <- c("#2c7bb6", "#2c7bb6", "#abd9e9", "#ffffbf", "#fdae61", "#d7191c", "#d7191c")
Fill.values.SIR <- seq(-3, 3, length = 7)

# Breaks.SIR <- c(0.25, 0.5, 1/1.5, 1, 1.5, 2, 4)
Breaks.SIR <- c(0.25, 0.5, 1, 2, 4, 8)

# Fill colours for log-SIR
Fill.colours.log.SIR <- c("#5e3c99", "#b2abd2", "#f7f7f7", "#fdb863", "#e66101")

1.5 User-Defined Functions:

# Truncate values so they lie between specified min and max
Clip <- function(val, min, max){
    val <- val %>% 
    replace(., which(. > max), max) %>%
    replace(., which(. < min), min)
    return(val)
}

2 Well-Analysed Data Sets

2.1 Scottish Lip Cancer

The Scottish Lip Cancer data is a well-known data set amongst epidemiologists. This data set was compiled by Kemp et al. (1985) and first analysed by D. Clayton and Kaldor (1987). It has subsequently been analysed by N. E. Breslow and Clayton (1993), Leroux, Lei, and Breslow (2000), Spiegelhalter et al. (2002), Duncan, White, and Mengersen (2017), and many others.

Brief summary of the data:

  • Lip cancer incidence
  • 56 areas (counties in Scotland)
  • registered during the 6 years from 1975 to 1980
  • 1 spatial covariate (sun exposure)
  • observed counts are highly autocorrelated

Early versions of this data set did not include a spatial covariate. And the spatial neighbours listed in N. E. Breslow and Clayton (1993) appear to contain many mistakes, e.g. area 8 has neighbour 53, but area 53 does not contain 8 as a neighbour, and these two areas appear to be separated by three areas (not adjacent). Some, not all, of these inconsistencies may be explained if they were using a very different shapefile (with different area IDs).

2.1.1 The Shapefile

Originally I acquired the shapefile for this dataset from http://www.stat.osu.edu/~pfc/teaching/5012_spatial_statistics/datasets/Scottish_shapefile.RData in 2017.

# Import the shapefile
map <- readOGR("Hidden Data/Shapefile_Scotland_v2.shp", verbose = FALSE)
plot(map)

Unfortunately, due to link rot, it is no longer available.

Both the data and shapefile are available from the CARBayesdata package. However, the shapefile contains no projection data and comes in the format of two separate lists (not a spatial polygons object), so it is not really user-friendly.

Another option is to download it from https://geodacenter.github.io/data-and-lab/scotlip/. This is a zip file (direct link to zip file is https://s3.amazonaws.com/geoda/data/scotlip.zip). Once unzipped, the shpaefile and data can be read in as follows. Note that I saved the data in a subfolder “Downloaded Data” within the working directory, and removed the first nested folder after unzipping.

# Import the shapefile
map <- readOGR("Downloaded Data/scotlip/scotlip.shp", verbose = FALSE)

# Quick plot
plot(map)

The geometry appears to be over-simplified; the islands are too small, and the edges are very jagged. However, the data should match this geometry correctly. (I will update this if I find a better solution.)

From the shapefile, we merge all the polygons together into a single multi-polygon object, which we will use to plot a border around our maps. Without this border as a separate layer, you can only plot either all boundaries (including those between areas) or no boundaries at all.

# Create border map from full map
N <- length(map)
map.border <- unionSpatialPolygons(map, IDs = rep(1, N))
map.border <- fortify(map.border)

# Fortify
map.df <- fortify(map)

# Make shapefile dataframe IDs numeric
map.df$id <- as.numeric(map.df$id)
if(!all(range(map.df$id) == c(1, N))){
    map.df$id <- as.numeric(map.df$id) + 1  # Need to add 1 so range is 1:N
    print(range(map.df$id))
}
## [1]  1 56
stopifnot(all(unique(map.df$id) == 1:N))

2.1.2 The Data

The data can also be found in the CRAN packages mdhglm, CARBayesdata, and probably several others. For consistency, we will use the data included in the same download as the shapefile.

# Import data
data <- read_excel("Downloaded Data/scotlip/scotlip.xlsx") %>% 
    data.frame %>%
    subset(select = -c(1:4, 6:7, 9))
## readxl works best with a newer version of the tibble package.
## You currently have tibble v1.4.2.
## Falling back to column name repair from tibble <= v1.4.2.
## Message displays once per session.
names(data) <- c("area.ID", "observed", "expected", "x")

# Re-order records to match shapefile IDs
data <- data[match(as.numeric(paste(map$DISTRICT)), as.numeric(data$area.ID)), ]

head(data)
##   area.ID observed expected  x
## 1       1        9     1.38 16
## 2       2       39     8.66 16
## 3       3       11     3.04 10
## 4       4        9     2.53 24
## 5       5       15     4.26 10
## 6       6        8     2.40 24

The records in the data already matched the shapefile, but it is good practice to have this code anyway.

Note that in N. E. Breslow and Clayton (1993) (and most subsequent analyses of this data, including Duncan, White, and Mengersen (2017)) divide the covariate by 10 to increase stability of the estimates. It is worth modifying this value now so that this is relfected in the plots (as well as any modelling).

# Modify covariate
data$x <- data$x / 10

We also need to check that the sum of the observed equals the sum of the expected counts. Otherwise any subsequent analysis would yield erroneous results.

# Check sum of observed = sum of expected
sum(data$observed)
## [1] 536
sum(data$expected)
## [1] 536.01

These are close enough. We now add the “raw” (i.e. observed) SIR which is just the observed divided by expected counts to the data.frame.

data$raw <- data$observed / data$expected

Here is some example plots of the data. Note that I clip the SIR values, which are plotted using the log scale (so zeros would be problematic, and you don’t want extremely large values obscuring lower values either).

# Truncate raw SIR values (for plotting purposes)
raw.clipped <- Clip(data$raw, 0.125, 8)

# Truncate log-raw SIR values (for plotting purposes)
log.raw.clipped <- Clip(log(data$raw), -3, 3)

# Dataframe of the variables we need to append to the shapefile
Append <- data.frame(
    id = 1:N,
    obs = data$observed,
    exp = data$expected,
    x = data$x,
    raw.SIR = raw.clipped,
    log.raw.SIR = log.raw.clipped
)

# Fill values for observed and expected values
maxes <- c(max(data$observed), max(data$expected))
oe.min.max <- min(maxes)
oe.max.max <- max(maxes)
Fill.values.OE <- c(seq(0, oe.min.max, length = 5), oe.max.max)

# Fill values for covariate(s)
Lims <- Append$x %>% range %>% abs
Lims <- Lims[Lims %>% which.max]
Fill.values.x <- seq(-Lims, Lims, length = 5)

# Fill values for log-raw SIR
Lims <- Append$log.raw.SIR %>% range %>% abs
Lims <- Lims[Lims %>% which.max]
Fill.values.log.SIR <- seq(-Lims, Lims, length = 5)

# Rescaled fill values
Fill.values.obs.r <-     rescale(Fill.values.OE, from = range(Append$obs))
Fill.values.exp.r <-     rescale(Fill.values.OE, from = range(Append$exp))
Fill.values.x.r <-       rescale(Fill.values.x, from = range(Append$x))
Fill.values.SIR.r <-     rescale(Fill.values.SIR, from = range(log2(Append$raw.SIR)))
Fill.values.log.SIR.r <- rescale(Fill.values.log.SIR, from = range(Append$log.raw.SIR))

# Merge values with shapefile dataframe
dat <- inner_join(map.df, Append, by = "id")

# Set up basic components of the plot
gg.base <- ggplot(data = NULL, aes(x = long, y = lat, group = group)) +
    scale_x_continuous(expand = c(0.01, 0.01)) +
    scale_y_continuous(expand = c(0.01, 0.01)) +
    theme_void() +
    theme(legend.position = "bottom") +
    guides(fill = guide_colourbar(barwidth = 8)) +
    coord_fixed() # Keeps x-y ratio consistent

# Add borders and fill colours; customise legends
gg.obs <- gg.base + 
    geom_polygon(data = map.border, fill = "grey30", color = "black", size = 0.8) +
    geom_polygon(data = dat, aes(fill = obs), color = NA) +
    scale_fill_gradientn("", 
        colours = Fill.colours.OE, 
        values = Fill.values.obs.r,
        breaks = seq(0, 40, 10)) +
    ggtitle("Observed")

gg.exp <- gg.base + 
    geom_polygon(data = map.border, fill = "grey30", color = "black", size = 0.8) +
    geom_polygon(data = dat, aes(fill = exp), color = NA) +
    scale_fill_gradientn("", 
        colours = Fill.colours.OE, 
        values = Fill.values.exp.r,
        breaks = seq(0, 80, 20)) +
    ggtitle("Expected")

gg.x1 <- gg.base + 
    geom_polygon(data = map.border, fill = "grey30", color = "black", size = 0.8) +
    geom_polygon(data = dat, aes(fill = x), color = NA) +
    scale_fill_gradientn("", 
        colours = Fill.colours.x, 
        values = Fill.values.x.r,
        breaks = seq(-3, 3, 1)) +
    ggtitle("Covariate")

gg.SIR.raw <- gg.base + 
    geom_polygon(data = map.border, fill = "grey30", color = "black", size = 0.8) +
    geom_polygon(data = dat, aes(fill = log2(raw.SIR)), color = NA) +
    scale_fill_gradientn("", 
        colours = Fill.colours.SIR, 
        values = Fill.values.SIR.r,
        limits = range(log2(Append$raw.SIR)),
        breaks = log2(Breaks.SIR), 
        labels = Breaks.SIR) +
    ggtitle("Raw SIR")

gg.log.SIR <- gg.base + 
    geom_polygon(data = map.border, fill = "grey30", color = "black", size = 0.8) +
    geom_polygon(data = dat, aes(fill = log.raw.SIR), color = NA) +
    scale_fill_gradientn("", 
        colours = Fill.colours.log.SIR, 
        values = Fill.values.log.SIR.r,
        # limits = range(Append$log.raw.SIR),
        breaks = seq(-5, 5, 1), 
        labels = seq(-5, 5, 1)) +
    ggtitle("log-Raw SIR")

# Render the plots
grid.arrange(
    gg.obs, gg.exp, gg.x1, gg.log.SIR, gg.SIR.raw, 
    nrow = 1
)

2.2 North Carolina SIDS

2.2.1 Overview

The North Carolina sudden infant death syndrome (SIDS) data was first presented by Atkinson (1978). It contained the annual number of SIDS and death rate per 1000 live births by county and year (1974-75, 1975-76, 1976-77, and 1977-78), with aggregated counts and corresponding rates for the four-year period, 1974-78. This data was augmented a number of different times over the years:

  • Symons, Grimson, and Yuan (1983) added the number of live births by county for the four-year period, 1974-78. (This was an important addition, as it represents the population-year-at-risk.)
  • N. Cressie and Read (1985) added a list of adjacent neighbours (less useful since these can easily be generated by a shapefile)
  • N. Cressie and Read (1989, 700) added:
    • spatial coordinates with respect to an arbitrary location; and
    • a list of distance-based neighbours, namely those within a 30-mile radius1
  • N. Cressie and Chan (1989) added:
    • the number of SIDS and number of live births for the 5-year period, 1979-84; and
    • the number of non-white live births for each period, 1974-78 and 1979-84.

The augmented data is reprinted in N. A. C. Cressie (1993).

The following animation illustrates how this data set has been augmented over time.


This data set, in its various forms, has been analysed by many, including Atkinson (1978), Symons, Grimson, and Yuan (1983), N. Cressie and Read (1985), N. Cressie and Read (1989), N. Cressie and Chan (1989), N. A. C. Cressie (1993), and Banerjee, Carlin, and Gelfand (2014) (see Banerjee, Carlin, and Gelfand (2014) for a list of other authors who have analysed this data). But most of the early analyses were exploratory in nature. Here is a brief summary of the methods and findings:

  • “SIDS is more prevalent in the eastern part of the state, possibly due to the fact that eastern counties are among the poorest counties in the state and usually have the highest infant mortality rates” (Atkinson 1978, 6).
  • Exploratory analysis suggested geographical clustering (large-scale variation) (Symons, Grimson, and Yuan 1983) and spatial dependence (small-scale variation) (N. Cressie and Read 1985) which could be due to some surrogate variables such as race (N. Cressie and Chan 1989) (hence the addition of the covariate non-white live births by N. Cressie and Chan (1989)).
  • The data has been analysed:
    • using a two-component Poisson mixture model (Symons, Grimson, and Yuan 1983) (1974-79 only).
    • by first removing large scale variation using the Freeman-Tukey transformation (N. Cressie and Read 1985, 334) then
      • exploring the small scale variation through interactions (N. Cressie and Read 1985).
      • fitting the transformed data using an auto-Gaussian model for each period separately (the covariate is also transformed) (N. Cressie and Chan 1989; N. A. C. Cressie 1993, Ch. 7).
  • The Freeman-Tukey (square-root) transformation:
    • is a variance-controlling transformation (N. Cressie and Read 1989; N. A. C. Cressie 1993, 245)
    • removes heteroskedastic variation (N. A. C. Cressie 1993, 248, 394)
    • remove dependence of the variance on the mean (N. A. C. Cressie 1993, 395)
    • makes the data approximately normal (so an auto-Gaussian model on the transformed data can be used) (N. A. C. Cressie 1993, ch. 7)
  • Using an auto-Poisson model and Markov random field has been suggested (N. Cressie and Read 1985, 337; N. Cressie and Read 1989; N. A. C. Cressie 1993, 428–9) but generally considered too difficult at the time (N. Cressie and Read 1989, 706).
  • County 4 (Anson County) was removed before model fitting for being unusual (extreme) (N. Cressie and Chan 1989, 398; N. A. C. Cressie 1993, 553)

Discrepencies:

  • The second period of years for which data has been collected is 1979-1984 (which is a 5-year period). This is stated in N. Cressie and Chan (1989) and N. A. C. Cressie (1993). However, Banerjee, Carlin, and Gelfand (2014) refers to this period as 1979 to 1983. This is likely an error in Banerjee, Carlin, and Gelfand (2014).

Brief summary of the data:

  • SIDS incidence
  • collected over 2 time intervals:
    • aggregated from 1974 to 1978; and
    • aggregated from 1979 to 1984
  • 100 areas (counties in North Carolina, United States)
  • 1 spatial covariate (number of non-white births)

The data set and corresponding shapefile are both available from the spData, formerly in spdep, R package.

2.2.2 The Shapefile

Note: the instructions given in Banerjee, Carlin, and Gelfand (2014) for importing the shapefile and data set are outdated and use deprecated functions. See https://r-spatial.github.io/spdep/articles/sids.html for revised instructions. This page also provides information on discrepncies in different versions of the data set and shapefile between various published analyses.

# Import the shapefile from the spData package
map <- readOGR(system.file("shapes/sids.shp", package = "spData")[1])

# Add a projection
proj4string(map) <- CRS("+proj=longlat +datum=NAD27")

# Quick plot
plot(map)

# Fortify for plotting
map.df <- fortify(map)

# Make shapefile dataframe IDs numeric
N <- length(map)
map.df$id <- as.numeric(map.df$id)
if(!all(range(map.df$id) == c(1, N))){
    map.df$id <- as.numeric(map.df$id) + 1  # Need to add 1 so range is 1:N
}
stopifnot(all(unique(map.df$id) == 1:N))

Normally, this would be fine. However, the shapefile IDs, map$CRESS_ID, were not ordered. So these IDs need to be updated (otherwise every map will be wrong.)

for(i in 1:length(map.df$id)){
    map.df$id[i] <- map$CRESS_ID[map.df$id[i]]
}

Create the border layer:

# Create border map from full map
map.border <- unionSpatialPolygons(map, IDs = rep(1, N))

# Quick plot
plot(map.border)

2.2.3 The Data

The data are contained within the shapefile. To extract this data, simply convert the map to a data.frame.

# View data contained within shapefile
data <- as.data.frame(map)
head(data)
##   CNTY_ID  AREA PERIMETER CNTY_        NAME  FIPS FIPSNO CRESS_ID BIR74 SID74 NWBIR74 BIR79 SID79
## 0    1825 0.114     1.442  1825        Ashe 37009  37009        5  1091     1      10  1364     0
## 1    1827 0.061     1.231  1827   Alleghany 37005  37005        3   487     0      10   542     3
## 2    1828 0.143     1.630  1828       Surry 37171  37171       86  3188     5     208  3616     6
## 3    1831 0.070     2.968  1831   Currituck 37053  37053       27   508     1     123   830     2
## 4    1832 0.153     2.206  1832 Northampton 37131  37131       66  1421     9    1066  1606     3
## 5    1833 0.097     1.670  1833    Hertford 37091  37091       46  1452     7     954  1838     5
##   NWBIR79 east north      x       y       lon      lat L_id M_id
## 0      19  164   176 -81.67 4052.29 -81.48594 36.43940    1    2
## 1      12  183   182 -50.06 4059.70 -81.14061 36.52443    1    2
## 2     260  204   174 -16.14 4043.76 -80.75312 36.40033    1    2
## 3     145  461   182 406.01 4035.10 -76.04892 36.45655    1    4
## 4    1197  385   176 281.10 4029.75 -77.44057 36.38799    1    4
## 5    1237  411   176 323.77 4028.10 -76.96474 36.38189    1    4

There are a lot of unecessary variables here. We don’t even need the longitude and latitude here (that is contained within the map still). Here is a succint version:

data <- data[,-c(1:7, 15:22)]
head(data)
##   CRESS_ID BIR74 SID74 NWBIR74 BIR79 SID79 NWBIR79
## 0        5  1091     1      10  1364     0      19
## 1        3   487     0      10   542     3      12
## 2       86  3188     5     208  3616     6     260
## 3       27   508     1     123   830     2     145
## 4       66  1421     9    1066  1606     3    1197
## 5       46  1452     7     954  1838     5    1237

Before we add any new variables, we should convert this to long format. Functions like gather() could be used, but it isn’t really designed for combining multiple key-value pairs. The easiest solution is to manually append the variables from different years. However, rbind() does not work when the column names differ, so we need to duplicate the column names.

dat.1 <- data.frame(data[,c(1:4)], year = "1974-78")
names(dat.1) <- c("area.ID", "pop", "observed", "x", "year")
dat.2 <- data.frame(data[,c(1, 5:7)], year = "1979-84")
names(dat.2) <- names(dat.1)
data <- rbind(dat.1, dat.2)
rm(dat.1, dat.2)
head(data)
##   area.ID  pop observed    x    year
## 0       5 1091        1   10 1974-78
## 1       3  487        0   10 1974-78
## 2      86 3188        5  208 1974-78
## 3      27  508        1  123 1974-78
## 4      66 1421        9 1066 1974-78
## 5      46 1452        7  954 1974-78

I have renamed the number of births as pop since it represents the population-year-at-risk in the context of SIDS.

To create the expected counts, we rescale this population-year-at-risk so it has the same sum as the observed counts. This is called internal standardisation (Pascutto et al. 2000).

data$expected <- c(
    data$pop[1:100] * sum(data$observed[1:100]) / sum(data$pop[1:100]),
    data$pop[101:200] * sum(data$observed[101:200]) / sum(data$pop[101:200])
)

Add the raw SIR, i.e. the observed divided by expected counts.

data$raw <- data$observed / data$expected

Example plots for both time periods:

# Truncate raw SIR values (for plotting purposes)
raw.clipped <- Clip(data$raw, 0.125, 8)

# Truncate log-raw SIR values (for plotting purposes)
log.raw.clipped <- Clip(log(data$raw), -3, 3)

# Dataframe of the variables we need to append to the shapefile
Append <- data.frame(
    id = data$area.ID,
    obs = data$observed,
    exp = data$expected,
    x = data$x,
    raw.SIR = raw.clipped,
    log.raw.SIR = log.raw.clipped,
    year = data$year
)

# Fill values for observed and expected values
maxes <- c(max(Append$obs), max(Append$exp))
oe.min.max <- min(maxes)
oe.max.max <- max(maxes)
Fill.values.OE <- c(seq(0, oe.min.max, length = 5), oe.max.max)

# Fill values for covariate(s)
Lims <- Append$x %>% range %>% abs
Lims <- Lims[Lims %>% which.max]
Fill.values.x <- seq(-Lims, Lims, length = 5)

# Fill values for log-raw SIR
Lims <- Append$log.raw.SIR %>% range %>% abs
Lims <- Lims[Lims %>% which.max]
Fill.values.log.SIR <- seq(-Lims, Lims, length = 5)

# Rescaled fill values
Fill.values.obs.r <-     rescale(Fill.values.OE, from = range(Append$obs))
Fill.values.exp.r <-     rescale(Fill.values.OE, from = range(Append$exp))
Fill.values.x.r <-       rescale(Fill.values.x, from = range(Append$x))
Fill.values.SIR.r <-     rescale(Fill.values.SIR, from = range(log2(Append$raw.SIR)))
Fill.values.log.SIR.r <- rescale(Fill.values.log.SIR, from = range(Append$log.raw.SIR))

# Merge values with shapefile dataframe
dat <- inner_join(map.df, Append, by = "id")

# Set up basic components of the plot
gg.base <- ggplot(data = NULL, aes(x = long, y = lat, group = group)) +
    geom_polygon(data = map.border, fill = "grey30", color = "black", size = 0.8) +
    scale_x_continuous(expand = c(0.05, 0.05)) +
    scale_y_continuous(expand = c(0.05, 0.05)) +
    theme_void() +
    guides(fill = guide_colourbar(barheight = 7)) +
    coord_fixed() + # Keeps x-y ratio consistent
    facet_grid( ~ year)

# Add borders and fill colours; customise legends
gg.obs <- gg.base + 
    geom_polygon(data = dat, aes(fill = obs), color = NA) +
    scale_fill_gradientn("", 
        colours = Fill.colours.OE, 
        values = Fill.values.obs.r,
        breaks = seq(0, 60, 20)) +
    ggtitle("Observed")

gg.exp <- gg.base + 
    geom_polygon(data = dat, aes(fill = exp), color = NA) +
    scale_fill_gradientn("", 
        colours = Fill.colours.OE, 
        values = Fill.values.exp.r,
        breaks = seq(0, 60, 20)) +
    ggtitle("Expected")

gg.x1 <- gg.base + 
    geom_polygon(data = dat, aes(fill = x), color = NA) +
    scale_fill_gradientn("", 
        colours = Fill.colours.x, 
        values = Fill.values.x.r,
        breaks = seq(-10000, 10000, 5000),
        labels = paste0(seq(-10, 10, 5), "K")) +
    ggtitle("Covariate")

gg.SIR.raw <- gg.base + 
    geom_polygon(data = dat, aes(fill = log2(raw.SIR)), color = NA) +
    scale_fill_gradientn("", 
        colours = Fill.colours.SIR, 
        values = Fill.values.SIR.r,
        limits = range(log2(Append$raw.SIR)),
        breaks = log2(Breaks.SIR), 
        labels = Breaks.SIR) +
    ggtitle("Raw SIR")

gg.log.SIR <- gg.base + 
    geom_polygon(data = dat, aes(fill = log.raw.SIR), color = NA) +
    scale_fill_gradientn("", 
        colours = Fill.colours.log.SIR, 
        values = Fill.values.log.SIR.r,
        # limits = range(Append$log.raw.SIR),
        breaks = seq(-5, 5, 1), 
        labels = seq(-5, 5, 1)) +
    ggtitle("log-Raw SIR")

# Render the plots
grid.arrange(
    gg.obs, gg.exp, gg.x1, gg.log.SIR, gg.SIR.raw, 
    ncol = 1
)

2.2.4 Bonus Maps

Note that the maps above do not look very similar to the maps in the cited papers. This is because those maps were showing different variables. Here I attempt to recreate their maps and explain what they are showing.

2.2.4.1 Symons et al. (1983)

The first map in Symons, Grimson, and Yuan (1983) (Figure 2) shows the results of their simulation study. They used three different criterion to try to identify areas as “high” or “low” risk. For their third criterion, the 15 areas identified as “high” risk correspond to 15 out of 17 of the areas with the highest raw SIR (but they actually based the criterion on the death rate per 1000, which coincides with the raw SIR at least for the top 20). The other two criteria were more sparodic. I have simply hard-coded the IDs of these high risk areas below:

high.risk.1 <- c(85, 5, 16, 98, 94)
high.risk.2 <- c(85, 5, 16, 6, 86, 96, 98, 61, 94, 12)
high.risk.3 <- c(85, 5, 44, 16, 6, 86, 59, 28, 96, 98, 61, 94, 65, 12, 92)
high.risk.top.20 <- c(85, 5, 44, 16, 6, 86, 59, 28, 96, 98, 58, 9, 61, 94, 65, 12, 92, 97, 49, 74)

These three risk categories are nested, and this is how they represent them on the map.

dat <- matrix(99, N, 4)
dat[high.risk.1, 1] <- 1
dat[high.risk.2, 2] <- 2
dat[high.risk.3, 3] <- 3
dat[high.risk.top.20, 4] <- 4

# Dataframe of the variables we need to append to the shapefile
Append <- data.frame(
    id = data$area.ID,
    col = apply(dat, 1, min) %>% as.character
)

# Merge values with shapefile dataframe
dat <- inner_join(map.df, Append, by = "id")

# Plot
ggplot(data = dat, aes(x = long, y = lat, group = group)) +
    geom_polygon(data = dat, aes(fill = col), color = "black") +
    scale_fill_manual("", values = grey(seq(0, 1, length = 5)),
        labels = c(
            "High risk counties by criteria 1, 2, and 3",
            "High risk counties by criteria 2 and 3",
            "High risk counties by criteria 3",
            "Remaining counties of the 20 with largest SIDS rate",
            "Remaining 80 counties"
        )) +
    scale_x_continuous(expand = c(0.05, 0.05)) +
    scale_y_continuous(expand = c(0.05, 0.05)) +
    theme_void() +
    theme(legend.position = "bottom") +
    guides(fill = guide_legend(nrow = 5)) +
    coord_fixed()

The second map in Symons, Grimson, and Yuan (1983) (Figure 3) shows the results from performing a “Monte Carlo regionalisation test”. As far as I can understand, this is a crude method of smoothing, tweaking the results from the first map above by taking into account whether adjacent areas are considered high or low. They emphasise 3 cases where results changed: Swain county in the west, which increased from a normal-risk area to high-risk; and Rockingham and Scotland counties which were previously high-risk (under at least 1 criteria) and were decreased to normal-risk.

high.risk <- c(high.risk.top.20[c(1:12, 14)], 72)
dat <- rep("Low", N)
dat[high.risk] <- "High"

# Dataframe of the variables we need to append to the shapefile
Append <- data.frame(
    id = data$area.ID,
    col = dat
)

# Merge values with shapefile dataframe
dat <- inner_join(map.df, Append, by = "id")

# Plot
ggplot(data = dat, aes(x = long, y = lat, group = group)) +
    geom_polygon(aes(fill = col), color = "black") +
    scale_fill_manual("", values = grey(c(0.25, 1)),
        labels = c(
            "High incidence of SIDS",
            "Low incdience of SIDS"
        )) +
    scale_x_continuous(expand = c(0.05, 0.05)) +
    scale_y_continuous(expand = c(0.05, 0.05)) +
    theme_void() +
    theme(legend.position = "bottom") +
    guides(fill = guide_legend(nrow = 5)) +
    coord_fixed()

Essentially, they are trying to highlight areas with a high raw SIR. (They do not really show which areas are “low”, however, as they make no distinction between “average” and “low”.)

2.2.4.2 Cressie and Read (1985)

N. Cressie and Read (1985, 335–6) create a probability map. They assume that the observed counts, \(y_i\) come from a Poisson distribution with Poisson rate equal to the expected counts, \(E_i\),

\[ E_i = n_i \sum_{i=1}^N{y_i} \bigg/ \sum_{i=1}^N{n_i} ~, \] under the hypothesis of homogeneity. Then they define a quantity (referred to as the index of deviation in the later work N. Cressie and Read (1989)), denoted \(\rho_i\), as

\[ \rho_i = \begin{cases} p(x\geq y_i) & \mbox{if } y_i \geq E_i \mbox{ (high incidence)}\\ p(x<y_i) & \mbox{if } y_i < E_i \mbox{ (low incidence)}. \\ \end{cases} \] which indicates whether the incidence is high or low, with respect to the number of expected cases. Practically, \(p(x<y_i)\) means computing the left-tail probability of a Poisson distribution up until \(y_i\) with Poisson rate \(E_i\), which is performed using the ppois() funciton. Similarly for \(p(x\geq y_i)\) but for the right-tail probability.

dat <- data[which(data$year == "1974-78"),]
y <- dat$observed
E <- dat$expected

rho <- rep(NA, N)
for(i in 1:N){
    if(y[i] >= E[i]){
        rho[i] <- ppois(y[i], lambda = E[i], lower.tail = FALSE)
    }else{
        rho[i] <- ppois(y[i], lambda = E[i])
    }
}

Small values of \(\rho_i\) indicate that the SIDS rate for area \(i\) is unusually high (when \(y_i \geq E_i\)) or low (when \(y_i < E_i\)). N. Cressie and Read (1985) suggest using a cut-off of 0.05 for delineating small values.

# Determine which areas have unusually high or low SIDS rate
unusual.rate <- rep(NA, N)
unusual.rate[which(rho < 0.05 & y >= E)] <- "high"
unusual.rate[which(rho < 0.05 & y < E)] <- "low"

The resulting map:

The areas identified as low are the same as shown in N. Cressie and Read (1985) (Figure 1). However, I have 7 additional areas identified as high, whereas they only had 10 areas. These are the additional areas and their corresponding \(\rho\) values:

extra <- c(9, 28, 44, 58, 59, 65, 92)
rho[extra]
## [1] 0.04883597 0.01972804 0.01660806 0.04980039 0.03349278 0.03931612 0.04302268
plot(which(rho < 0.05 & y >= E), rho[which(rho < 0.05 & y >= E)])
lines(extra, rho[extra], type = "p", col = "red")

Based on this plot and the text pertaining to Figure 3 in N. Cressie and Read (1989), it seems they may have restricted the map to show only up to 10 areas with high incidence (those with the lowest values of \(\rho\)) and inadvertently forgotten to mention this in the text! (Or it could be a mistake.)

Here is what the map looks like in N. Cressie and Read (1985) (without the additional areas):

2.2.4.3 Cressie and Read (1989)

N. Cressie and Read (1989) create 2 exploratory maps. The first map shows a modified raw SIDS rate,

\[ W_i = (y_i + 1) / n_i \times 1000. \] showing only the 10 highest and 10 lowest values.

This is similar to the raw SIR, with 3 differences:

  • 1 is added to all observed counts;
  • the denominator is the population-year-at-risk, rather than expected counts (which yields smaller values than the raw SIR since the population is about 3700 on average); and
  • consequently, it is multiplied by 1000.
dat <- data[which(data$year == "1974-78"),]
W <- (dat$observed + 1) / dat$pop * 1000

These values behave similarly to the raw SIR (N. Cressie and Read (1989) describe them as “raw death rates”), but are not as easily interpretable. (Also, all the areas with a zero raw SIR have non-zero \(W\) values.) They have a mean of about 2.8 and range of 0.75 to 10.19.

Here is a re-creation of the map from N. Cressie and Read (1989) (Figure 1), showing the areas with the 10 highest and lowest \(W_i\) values:

N. Cressie and Read (1989, 703) makes an important statment: “This type of map that shows raw death rates can be misleading, since counties with small \(n_i'\text{s}\) typically are far more variable than those with large \(n_i'\text{s}\)…”. See also p707. This is reiterated 4 pages later.

The second map in N. Cressie and Read (1989) (Figure 3) is a probability map, very similar to the one in N. Cressie and Read (1985). The difference is that instead of showing areas that satisfy \(\rho_i < 0.05\), the show the 10 most unusually high counts and 10 most unusually low counts.

The areas with high SIDS correspond fairly closely with my map of high raw SIRS, but not the areas with low SIDS. But neither this plot nor the raw SIRS take into account spatial autocorrelation, as noted by N. Cressie and Read (1989, 705).

2.2.4.4 Cressie (1993)

N. A. C. Cressie (1993, 248) states: “…a choropleth map [of Freeman-Tukey-transformed SIDS rates] could be drawn that could be used to search for spatial patterns. The choropleth map of unsmoothed (transformed) rates gives a poor picture due to large heteroskedastic variation.” This is reiterated on page 394. (The Freeman-Tukey transformation is discussed below.)

N. A. C. Cressie (1993) also discusses the probability map created in N. Cressie and Read (1985). Neither maps are actually drawn.

2.2.4.5 Banerjee et al. (2014)

Banerjee, Carlin, and Gelfand (2014) propose transforming the observed cases and the covariate (non-white birth rates) using the Freeman-Tukey transformation. E.g.

# Freeman-Tukey transformation
freeman.tukey <- function(x, y){
    return(sqrt(1000) * (sqrt(x/y) + sqrt((x + 1)/y)))
}

# Add transformed observed cases and covariate
data$observed.FT <- freeman.tukey(data$observed, data$pop)
data$x.FT <-  freeman.tukey(data$x, data$pop)

(Aside: note that the Freeman-Tukey transformed values are not necessarily integers, so an auto-Poisson model is not appopriate for these transformed counts.)

There is a map in Banerjee, Carlin, and Gelfand (2014) (Figure 4.5a) titled “raw Freeman-Tukey transformed SIDS rates”. This map shows the Freeman-Tukey transformed observed counts (for 1979-84 only), which is in effect a type of rate that is an alternative to the raw SIR (it is not depicting the Freeman-Tukey transformed raw SIR). Their map (and the corresponding map for the period 1974-78) are reproduced below:

By changing the colour scheme to a continuous one, and adjusting the cut-offs so that dark blue corresponds to 1.5 instead of 2 and dark red corresponds to 5 instead of 3.5, we can see that these maps are similar in pattern to the raw SIR, but do not have such a straightforward interpretation (and exclude auto-Poisson models).

2.3 New York Leukemia Data

2.3.1 Overview

This data set comprises leukemia incidence in eight upstate New York counties. The data was first presented and analysed by Turnball et al. (1990), and subsequently analysed by L. A. Waller et al. (1992), L. A. Waller et al. (1994), L. Waller and Gotway (2004), and others. The data represent the number of leukemia cases from 1978-1982 per census tract in a region of upstate New York spanning eight counties (L. Waller and Gotway 2004), namely Broome, Tioga, Tompkins, Cortland, Chenango, Cayuga, Onondaga, and Madison. There were a total of 592 leukemia cases among 1,057,673 people at risk.

The cartographic systems, in order from smallest to largest, are census blocks, census block groups, then census tracts - the latter being relatively permanent statistical subdivisions of a county (https://current360.com/research-101-census-tracts-vs-census-block-groups/). The cases in this data set were originally georeferenced to census block groups, except in Broome County, where they were georeferenced to the larger census tracts. This was due to “limitations in the Cancer Registry geocoded leukemia data” (L. A. Waller et al. 1992).

In addition, “some of the cases could not be georeferenced to [census block groups or census tracts]. These cases were then allocated proportionally among the block groups, so that some of the resulting disease counts are not necessarily integers” (L. Waller and Gotway 2004).

In L. Waller and Gotway (2004), the number of cases per block group were aggregated to census tracts.


Brief summary of the data:

  • leukemia incidence
  • cases aggregated over 1978-1982
  • 281 areas (census tracts) covering 8 counties of upstate New York, United States
  • 2 spatial covariates (exposure potential; percentage of the population aged 65 or more)

The data as presented in L. Waller and Gotway (2004), and corresponding shapefile, are available from the spData R package. See ?spData::nydata for help.

2.3.2 The Shapefile

# Import the shapefile from the spData package
map <- readOGR(system.file("shapes/NY8_utm18.shp", package = "spData")[1])

# Simplify geometry
map <- rmapshaper::ms_simplify(map, keep = 0.05, keep_shapes = TRUE) # Destroys order

# Quick plot
plot(map)

# Fortify for plotting
map.df <- fortify(map)

# Make shapefile dataframe IDs numeric
N <- length(map)
map.df$id <- as.numeric(map.df$id)
if(!all(range(map.df$id) == c(1, N))){
    map.df$id <- as.numeric(map.df$id) + 1  # Need to add 1 so range is 1:N
}
stopifnot(all(unique(map.df$id) == 1:N))

Note: do not use the newer version, readOGR(system.file("shapes/NY8_bna_utm18.gpkg", package = "spData")[1]) as it does not compute the spatial weights matrix correctly when using nb2mat and poly2nb.

To create the map border layer, we must first fix some of the topography. Tyring to call unionSpatialPolygons() now will result in a TopologyException error. We fix this problem using the gBuffer function.

# Fix topography issues then create border layer
map.border <- gBuffer(map, byid = TRUE, width = 0) %>%
    unionSpatialPolygons(IDs = rep(1, N))

# Quick plot
plot(map.border)

2.3.3 The Data

There are two ways to extract the data, either from the shapefile, or directly from the spData package.

# Extract data from the R package
data <- spData::nydata

# OR
# Extract data from the shapefile
data <- as.data.frame(map)
head(data, 3)
##          AREANAME     AREAKEY        X        Y POP8 TRACTCAS  PROPCAS PCTOWNHOME PCTAGE65P        Z
## 1 Binghamton city 36007000100 4.069397 -67.3533 3540     3.08 0.000870  0.3277311 0.1466102  0.14197
## 2 Binghamton city 36007000200 4.639371 -66.8619 3560     4.08 0.001146  0.4268293 0.2351124  0.35555
## 3 Binghamton city 36007000300 5.709063 -66.9775 3739     1.09 0.000292  0.3377396 0.1380048 -0.58165
##    AVGIDIST PEXPOSURE   Cases       Xm       Ym   Xshift  Yshift
## 1 0.2373852  3.167099 3.08284 4069.397 -67353.3 423391.0 4661502
## 2 0.2087413  3.038511 4.08331 4639.371 -66861.9 423961.0 4661993
## 3 0.1708548  2.838229 1.08750 5709.063 -66977.5 425030.6 4661878

The second method contains more precise values of the fractional cases (given by the ’Cases` column), so we will use this.

There are a lot of unnecessary variables here. Here is a succinct version:

data <- data[,-c(1, 3:4, 6:8, 10:11, 14:17)]
head(data)
##       AREAKEY POP8 PCTAGE65P PEXPOSURE   Cases
## 1 36007000100 3540 0.1466102  3.167099 3.08284
## 2 36007000200 3560 0.2351124  3.038511 4.08331
## 3 36007000300 3739 0.1380048  2.838229 1.08750
## 4 36007000400 2784 0.1188937  2.643366 1.06515
## 5 36007000500 2571 0.1415791  2.758587 3.06017
## 6 36007000600 2729 0.1410773  2.848411 1.06386

A data dictionary is provided from the help file, ?spData::nydata. The main variables are:

  • POP8: population (from 1980 census).
  • PCTAGE65P: percentage of the population aged 65 or more (the risk of most leukemias increase with age)
  • PEXPOSURE: potential exposure measured as the inverse distance, say \(d\), between each census tract centroid and the nearest TCE site, then transformed via \(\log(100 \times d)\).
  • Cases: number of leukemia cases.

For leukemia, the entire population is considered to be at-risk. To create the expected counts, we rescale this population-at-risk so it has the same sum as the observed counts. This is called internal standardisation (Pascutto et al. 2000). Subsequently, we add the raw SIR, i.e. the observed divided by expected counts.

data$expected <- data$POP8 * sum(data$Cases) / sum(data$POP8)
data$raw <- data$Cases / data$expected

Let us rename some of these variables, and change the area IDs to be sequentially numbered from 1 to \(N\).

# Remove population at risk now we have raw SIR
data$POP8 <- NULL

# Rename variables
names(data)[1:4] <- c("area.ID", "age", "exposure", "observed")

# Renumber areas
data$area.ID <- 1:N

Example plots:

# Truncate raw SIR values (for plotting purposes)
raw.clipped <- Clip(data$raw, 0.125, 8)

# Truncate log-raw SIR values (for plotting purposes)
log.raw.clipped <- Clip(log(data$raw), -3, 3)

# Dataframe of the variables we need to append to the shapefile
Append <- data.frame(
    id = data$area.ID,
    obs = data$observed,
    exp = data$expected,
    x1 = data$exposure,
    x2 = data$age,
    raw.SIR = raw.clipped,
    log.raw.SIR = log.raw.clipped
)

# Fill values for observed and expected values
maxes <- c(max(Append$obs), max(Append$exp))
oe.min.max <- min(maxes)
oe.max.max <- max(maxes)
Fill.values.OE <- c(seq(0, oe.min.max, length = 5), oe.max.max)

# Fill values for covariate(s) 
Fill.values.x1 <- c(0, max(Append$x1))
Fill.values.x2 <- c(0, max(Append$x2))

# Fill values for log-raw SIR
Lims <- Append$log.raw.SIR %>% range %>% abs
Lims <- Lims[Lims %>% which.max]
Fill.values.log.SIR <- seq(-Lims, Lims, length = 5)

# Rescaled fill values
Fill.values.obs.r <-     rescale(Fill.values.OE, from = range(Append$obs))
Fill.values.exp.r <-     rescale(Fill.values.OE, from = range(Append$exp))
Fill.values.x1.r <-      rescale(Fill.values.x1, from = range(Append$x1))
Fill.values.x2.r <-      rescale(Fill.values.x2, from = range(Append$x2))
Fill.values.SIR.r <-     rescale(Fill.values.SIR, from = range(log2(Append$raw.SIR)))
Fill.values.log.SIR.r <- rescale(Fill.values.log.SIR, from = range(Append$log.raw.SIR))

# Merge values with shapefile dataframe
dat <- inner_join(map.df, Append, by = "id")

# Set up basic components of the plot
gg.base <- ggplot(data = NULL, aes(x = long, y = lat, group = group)) +
    geom_polygon(data = map.border, fill = "grey30", color = "black", size = 0.8) +
    scale_x_continuous(expand = c(0.05, 0.05)) +
    scale_y_continuous(expand = c(0.05, 0.05)) +
    theme_void() +
    theme(legend.position = "bottom") +
    guides(fill = guide_colourbar(barwidth = 9)) +
    coord_fixed() # Keeps x-y ratio consistent

# Add borders and fill colours; customise legends
gg.obs <- gg.base + 
    geom_polygon(data = dat, aes(fill = obs), color = NA) +
    scale_fill_gradientn("", 
        colours = Fill.colours.OE, 
        values = Fill.values.obs.r,
        breaks = seq(0, 10, 2)) +
    ggtitle("Observed")

gg.exp <- gg.base + 
    geom_polygon(data = dat, aes(fill = exp), color = NA) +
    scale_fill_gradientn("", 
        colours = Fill.colours.OE, 
        values = Fill.values.exp.r,
        breaks = seq(0, 10, 2)) +
    ggtitle("Expected")

gg.x1 <- gg.base + 
    geom_polygon(data = dat, aes(fill = x1), color = NA) +
    scale_fill_gradientn("", 
        colours = Fill.colours.x[c(3, 5)], 
        values = Fill.values.x1.r) +
    ggtitle("Exposure")

gg.x2 <- gg.base + 
    geom_polygon(data = dat, aes(fill = x2), color = NA) +
    scale_fill_gradientn("", 
        colours = Fill.colours.x[c(3, 5)], 
        values = Fill.values.x2.r) +
    ggtitle("Percent Age 65+")

gg.SIR.raw <- gg.base + 
    geom_polygon(data = dat, aes(fill = log2(raw.SIR)), color = NA) +
    scale_fill_gradientn("", 
        colours = Fill.colours.SIR, 
        values = Fill.values.SIR.r,
        limits = range(log2(Append$raw.SIR)),
        breaks = log2(Breaks.SIR), 
        labels = Breaks.SIR) +
    ggtitle("Raw SIR")

gg.log.SIR <- gg.base + 
    geom_polygon(data = dat, aes(fill = log.raw.SIR), color = NA) +
    scale_fill_gradientn("", 
        colours = Fill.colours.log.SIR, 
        values = Fill.values.log.SIR.r,
        breaks = seq(-5, 5, 1), 
        labels = seq(-5, 5, 1)) +
    ggtitle("log-Raw SIR")

# Render the plots
grid.arrange(
    gg.obs, gg.exp, gg.x1, gg.x2, gg.log.SIR, gg.SIR.raw, 
    nrow = 2
)

2.3.4 Notes

Recall that some of the osberved cases are not integers because of limitations in how the cases were georeferenced (L. Waller and Gotway 2004). If you are wish to model this data using a Poisson likelihood or otherwise assume discrete cases, you need to deal with these fractional cases. One option is to round. However, there are many cases near-zero which would be rounded to zero (and this would decrease the total number of cases from 592 to 573). A better solution is to first add a small amount to each observed case before rounding.

y.old <- data$observed
y.new <- round(data$observed + 0.205)
plot(y.old, y.new)

The number 0.205 was chosen so that the new total number of cases is essentially unchanged (now 593).

Another issue is the sparsity and number of areas. There are 281 areas in this data set, but few cases. There may be more areas than desired for an analysis and/or too few cases. One option is to combine areas into larger areas. Alternatively, some areas could be dropped.

2.3.4.1 Adding Cartographic Systems

First, to either merge or drop areas, we need to know information about the larger cartographic systems. Contained within the shapefile, there are are 64 unique area names, AREANAME, and the AREAKEY variable. The former does not appear to be very useful when mapped. However, the AREAKEY contains useful information for this purpose.

  • The first 5 digits uniquely identify counties
  • The first 8 digits uniquely identify county subdivisions (I’m unsure if this is the correct terminology, but it represents areas nested within counties - see https://www2.census.gov/geo/pdfs/reference/geodiagram.pdf for hierarchy of census geographic entities).

2.3.4.2 Merging areas

This example illustrates merging all the census tracts into county subdivisions:

ID <- substring(map$AREAKEY, 1, 8)

# Create new map of county subdivisions
map.2 <- gBuffer(map, byid = TRUE, width = 0) %>%
    unionSpatialPolygons(IDs = ID)

# Quick plot
plot(map.2)

This new map has 31 areas.

2.3.4.3 Dropping areas

This example illustrates dropping all the census tracts except those in the 3 northern-most counties (Cayuga, Onondaga, and Madison):

ID <- substring(map$AREAKEY, 1, 5) %>% unique

# Create new map of county subdivisions
map.2 <- map[substring(map$AREAKEY, 1, 5) %in% ID[c(2, 5:6)],]

# Quick plot
plot(map.2)

These three counties comprise 176 areas (census tracts).

3 Less Well-known Data Sets

3.1 Pennsylvania Lung Cancer

This data set comprises lung cancer incidence by county in Pennsylvania, United States, 2002. Also included is the census-based population and the proportion of smokers in each county.

This data set is described and analysed in Moraga (2018).


Brief summary of the data:

  • Lung cancer incidence
  • 67 areas (counties in Pennsylvania)
  • cases recorded in 2002
  • 1 spatial covariate (proportion of smokers)
  • stratified by race (2 levels), gender (2 levels), and broad age groups (4 levels)
  • not much significant spatial variation or deviation in raw SIR from 1, unless data are stratified

The data and shapefile are available from the SpatialEpi R package.

3.1.1 The Shapefile

# Load the R library
library(SpatialEpi)

# Load the data into the global environment
data(pennLC)

# Create the shapefile from the SpatialEpi package
map <- pennLC$spatial.polygon
plot(map)

Create the border layer:

# Create border map from full map
N <- length(map)
map.border <- unionSpatialPolygons(map, IDs = rep(1, N))
map.border <- fortify(map.border)

# Fortify
map.df <- fortify(map)

Here, I leave the area IDs as strings (names of the counties).

3.1.2 The Data

The pennLC object is a list with four elements:

  • geo: the longitude and latitude of the centroid of each county.
  • data: the observed cases and population stratified by race, gender, and age.
  • smoking: proportion of smokers by county.
  • spatial.polygon: a SpatialPolygons object (used to create the map above).

The data of interest are the combination of the second and third elements. We could combine these using a left join.

data <- left_join(pennLC$data, pennLC$smoking, by = "county")
head(data)
##   county cases population race gender      age smoking
## 1  adams     0       1492    o      f Under.40   0.234
## 2  adams     0        365    o      f    40.59   0.234
## 3  adams     1         68    o      f    60.69   0.234
## 4  adams     0         73    o      f      70+   0.234
## 5  adams     0      23351    w      f Under.40   0.234
## 6  adams     5      12136    w      f    40.59   0.234

However, the variable smoking should not be interpreted as strata-specific – it is only a county-wide statistic. To avoid confusion, we could leave the strata-specific data set pennLC$data as is, and add the aggregated cases and population to the smoking data set for a county-wide data set.

# Aggregate observed cases by county
y <- aggregate(pennLC$data$cases, by = list(county = pennLC$data$county), FUN = sum)
pop <- aggregate(pennLC$data$population, by = list(county = pennLC$data$county), FUN = sum)

# Add to smoking
data <- data.frame(
    pennLC$smoking, 
    observed = y$x, 
    pop = pop$x
)

Now we can compute and add the expected counts and raw SIR

data$expected <- data$pop / sum(data$pop) * sum(data$observed)
data$raw <- data$observed / data$expected

# Remove population-at-risk
data$pop <- NULL

Similarly, for the stratified data:

data.stratified <- pennLC$data
data.stratified$expected <- data.stratified$population / 
    sum(data.stratified$population) * sum(data.stratified$cases)
data.stratified$raw <- data.stratified$cases / data.stratified$expected

# Set raw-SIR to zero if expected cases are zero
data.stratified$raw[which(data.stratified$expected == 0)] <- 0

Here is some example plots of the unstratisifed data:

Here is some example plots of the stratified data:

data <-  data.stratified

# Merge strata into a single indicator variable (factor)
levels(data$age) <- c("40-59", "60-69", "70+", "<40")
strata <- paste(
        as.character(data$race), 
        as.character(data$gender), 
        as.character(data$age)
    ) %>%
    factor()
strata <- factor(strata, levels = levels(strata)[c(
    1, 5, 2, 6, 3, 7, 4, 8,
    9, 13, 10, 14, 11, 15, 12, 16
)])
data$strata <- strata

# Truncate raw SIR values (for plotting purposes)
raw.clipped <- Clip(data$raw, 0.125, 8)

# Truncate log-raw SIR values (for plotting purposes)
log.raw.clipped <- Clip(log(data$raw), -3, 3)

# Dataframe of the variables we need to append to the shapefile
Append <- data.frame(
    id = data$county,
    obs = data$cases,
    exp = data$expected,
    strata = data$strata,
    raw.SIR = raw.clipped,
    log.raw.SIR = log.raw.clipped
)
Append$id <- as.character(Append$id)

# Fill values for observed and expected values
maxes <- c(max(data$cases), max(data$expected))
oe.min.max <- min(maxes)
oe.max.max <- max(maxes)
Fill.values.OE <- c(seq(0, oe.min.max, length = 5), oe.max.max)

# Fill values for log-raw SIR
Lims <- Append$log.raw.SIR %>% range %>% abs
Lims <- Lims[Lims %>% which.max]
Fill.values.log.SIR <- seq(-Lims, Lims, length = 5)

# Rescaled fill values
Fill.values.obs.r <-     rescale(Fill.values.OE, from = range(Append$obs))
Fill.values.exp.r <-     rescale(Fill.values.OE, from = range(Append$exp))
Fill.values.SIR.r <-     rescale(Fill.values.SIR, from = range(log2(Append$raw.SIR)))
Fill.values.log.SIR.r <- rescale(Fill.values.log.SIR, from = range(Append$log.raw.SIR))

# Merge values with shapefile dataframe
dat <- inner_join(map.df, Append, by = "id")

# Set up basic components of the plot
gg.base <- ggplot(data = NULL, aes(x = long, y = lat, group = group)) +
    scale_x_continuous(expand = c(0.05, 0.05)) +
    scale_y_continuous(expand = c(0.05, 0.05)) +
    theme_void() +
    theme(
        legend.position = "bottom",
          strip.text = element_text(angle = 90)
    ) +
    guides(fill = guide_colourbar(barwidth = 8)) +
    coord_map() + # Keeps x-y ratio consistent
    facet_grid(strata ~ .)

# Add borders and fill colours; customise legends
gg.obs <- gg.base + 
    geom_polygon(data = map.border, fill = "grey30", color = "black", size = 0.8) +
    geom_polygon(data = dat, aes(fill = obs), color = NA) +
    scale_fill_gradientn("", 
        colours = Fill.colours.OE, 
        values = Fill.values.obs.r,
        breaks = seq(0, 1000, 100)) +
    ggtitle("Observed") +
    theme(strip.text = element_text(colour = "white"))

gg.exp <- gg.base + 
    geom_polygon(data = map.border, fill = "grey30", color = "black", size = 0.8) +
    geom_polygon(data = dat, aes(fill = exp), color = NA) +
    scale_fill_gradientn("", 
        colours = Fill.colours.OE, 
        values = Fill.values.exp.r,
        breaks = seq(0, 1000, 100)) +
    ggtitle("Expected") +
    theme(strip.text = element_text(colour = "white"))

gg.log.SIR <- gg.base + 
    geom_polygon(data = map.border, fill = "grey30", color = "black", size = 0.8) +
    geom_polygon(data = dat, aes(fill = log.raw.SIR), color = NA) +
    scale_fill_gradientn("", 
        colours = Fill.colours.log.SIR, 
        values = Fill.values.log.SIR.r,
        # limits = range(Append$log.raw.SIR),
        breaks = seq(-6, 6, 2), 
        labels = seq(-6, 6, 2)) +
    ggtitle("log-Raw SIR") +
    theme(strip.text = element_text(colour = "white"))

gg.SIR.raw <- gg.base + 
    geom_polygon(data = map.border, fill = "grey30", color = "black", size = 0.8) +
    geom_polygon(data = dat, aes(fill = log2(raw.SIR)), color = NA) +
    scale_fill_gradientn("", 
        colours = Fill.colours.SIR, 
        values = Fill.values.SIR.r,
        limits = range(log2(Append$raw.SIR)),
        breaks = log2(Breaks.SIR), 
        labels = Breaks.SIR) +
    ggtitle("Raw SIR") 

# Render the plots
grid.arrange(
    gg.obs, gg.exp, gg.log.SIR, gg.SIR.raw, 
    nrow = 1
)

Note that colours denoting SIR are relative to the entire data set, that is, all strata. This is why the maps for the “<40” age group appear to be blue for all areas; lung cancer incidence is much lower for this stratum compared to any of the other strata. If we subset the data for a particular stratum, and recalculate the expected counts and SIR for that particular stratum, the colours will adjust accordingly. Instead of subsetting the data for a particular stratum, we can compute the stratum-specific expected counts and SIR and plot all strata still.

# Compute the expected counts and SIR for each stratum
for(i in 1:length(unique(data$strata))){
    r <- which(data$strata == levels(strata)[i])
    data[r,]$expected <- data[r,]$population / sum(data[r,]$population) * sum(data[r,]$cases)
    data[r,]$raw <- data[r,]$cases / data[r,]$expected
}

# Set SIR to zero when expected is zero
data$raw[which(data$expected == 0)] <- 0

Now we create the plots as before (except the colour scheme for the observed and expected counts is capped at 100). By plotting the stratum-specific expected counts and raw SIR, we see quite different patterns emerge. The difference now is that the SIR in each map should be interpretted relative to the other areas in that map and not the other maps. That is, each row of plots should be interpretted individually.

We can also compute the expected counts and SIR for any combination of strata, e.g. each age group (regardless of race or gender) or each age group after subsetting for only whites, for example. Here is an example where we compute the age group-specific SIR after subsetting the data for males only (and pooling over race):

data <- data[which(data$gender == "m"),]

# Compute the expected counts and SIR for each age group
for(i in 1:length(unique(data$age))){
    r <- which(data$age == levels(data$age)[i])
    data[r,]$expected <- data[r,]$population / sum(data[r,]$population) * sum(data[r,]$cases)
    data[r,]$raw <- data[r,]$cases / data[r,]$expected
}

# Set SIR to zero when expected is zero
data$raw[which(data$expected == 0)] <- 0

3.2 500 Cities

The 500 Cities Project (https://www.cdc.gov/500Cities/) is a collaboration between the Centers for Disease Control and Prevention (CDC), the CDC Foundation, and the Robert Wood Johnson Foundation. The project’s aim is to make publicly available estimates for many health outcomes and risk factors, including arthritis, high cholesterol, diabetes, chronic kidney disease, and coronary heart disease, covering the largest 500 cities in the United States at a small area level (census tracts).

The project also conveniently provides a shapefile for the associated areas.

This is a very rich data set. However, there are two potential drawbacks with this data set:

  1. the data are model-based estimates. As a result, the values are typically spatially smooth (autocorrelated).
  2. The existnece of previous analyses for the same study region and variablse of interest is unlikely. This limits the ability to validate the results.

Brief summary of the data:

  • Numerous health-related variables to analyse
  • Numerous study regions to choose from
  • Values represent model-based estimates

3.2.1 The Shapefile

The shapefile can be downloaded from the 500 Cities Project website.

  1. Go to https://chronicdata.cdc.gov/browse?category=500+Cities.
  2. Scroll down and click on “500 Cities: Census Tract Boundaries”.
  3. Click the download button ( direct link).
  4. Unzip the files to a suitable directory. Note that I saved the data in a subfolder “Downloaded Data” within the working directory, and removed the first nested folder after unzipping.
# Import the shapefile
map <- readOGR("Downloaded Data/500 Cities/500Cities_Tracts_Clip.shp", verbose = FALSE)

This covers many disjoint regions. To get a sense of how these disjoint regions look, you can view the data on the website: go to https://www.cdc.gov/500cities/ and scroll down to the Local Data Widget. Select all states and DC and click Go. When the map first loads, you will just see coloured dots denoting the selected measure (arthritis by default). As you zoom in, this will eventually change to small area polygons (cenus tracts). Here is a screenshot showing the two levels of detail for arthritis:

For simplicity, we will focus only on the city of Balitmore, Marlyand (pictured above). The following code demonstrates how to subset the shapefile for this region:

# Subset the map for Baltimore only
map <- map[map$PlaceName == "Baltimore",]

# Quick plot
plot(map)

Other regions can be defined in a similar manner. Have a look at unique(map$PlaceName) for a list of region names.

Create the border layer:

# Create border map from full map
N <- length(map)
map.border <- unionSpatialPolygons(map, IDs = rep(1, N))
map.border <- fortify(map.border)

# Fortify
map.df <- fortify(map)

# Make shapefile dataframe IDs numeric
map.df$id <- as.numeric(map.df$id)

3.2.2 The Data

The data are available in CSV format:

  1. Go to https://chronicdata.cdc.gov/browse?category=500+Cities.
  2. Scroll down and click on “500 Cities: Census Tract-level Data (GIS Friendly Format), 2019 release”.
  3. Up the top, click Export, then click CSV for Excel. Save the file (direct link).
  4. Note that I have saved the data in a subfolder “Downloaded Data” within the working directory.
data <- read.csv("Downloaded Data/500 Cities/500_Cities__Census_Tract-level_Data__GIS_Friendly_Format___2019_release.csv")

This data.frame has 27210 rows and 63 columns. A data dictionary can be found on the webpage in step 2 above.

The following code will subset this data to match the subsetted shapefile (i.e. for Baltimore only):

data <- subset(data, PlaceName == "Baltimore")

Note that the map has 200 areas but there are only 199 records in the subsetted data. Before we can merge the data with the shapefile, we need a common key. The fortified shapefile IDs, map.df$id, are 5 digits which do not appear in the data. This ID is taken from the row.names of the shapefile. We can simply append these to the data before merging. However, we first need to determine the missing record.

missing.record <- which(!(unique(map$tract2010) %in% data$TractFIPS))

For illustration, let’s assume asthma prevalence is the health outcome of interest and use smoking prevalence as a risk factor (covariate). As might be expected, these two variables are quite highly correlated, and thus smoking should be a fairly useful predictor, even though it has been aggregated to a small-area (i.e. not an individual level covariate).

cor(data$CSMOKING_CrudePrev, data$CASTHMA_CrudePrev)
## [1] 0.8778579
plot(data$CSMOKING_CrudePrev, data$CASTHMA_CrudePrev)

The asthma and smoking variables both relate to adults aged 18 years or more, and the the population variable is based on the 2010 census (whether this includes under 18 year olds is unclear). However, for illustration, suppose this is a reasonable approximation of the population-at-risk. From this, we can compute and add the expected counts and raw SIR as usual.

# population-at-risk
pop <- gsub(",", "", data$Population2010) %>% 
    as.numeric

# Set up data of interest
data <- data.frame(
    observed = data$CASTHMA_CrudePrev, 
    x = data$CSMOKING_CrudePrev,
    pop = pop
)

data$expected <- data$pop / sum(data$pop) * sum(data$observed)
data$raw <- data$observed / data$expected

# Remove population-at-risk
data$pop <- NULL

Here is some example plots of the data:

4 Miscellaneous

List of potentially useful websites:

If you have any suggestions for open data sets, please contact me at earl.w.duncan@gmail.com.

References

Atkinson, D. 1978. Epidemiology of Sudden Infant Death in North Carolina: Do Cases Tend to Cluster? PHSB Studies, No. 16. Raleigh, North Carolina: N. C. Department of Human Resources, Division of Health Services, Public Health Statistics Branch.

Banerjee, S., B. P. Carlin, and A. E. Gelfand. 2014. Hierarchical Modeling and Analysis for Spatial Data. 2nd ed. Boca Raton: CRC Press/Chapman & Hall, Monographs on Statistics; Applied Probability 135.

Breslow, N. E., and D. G. Clayton. 1993. “Approximate inference in generalised linear mixed model.” Journal of the American Statistical Association 88 (421): 9–25. doi:10.1080/01621459.1993.10594284.

Clayton, D., and J. Kaldor. 1987. “Empirical Bayes estimates of age-standardized relative risks for use in disease mapping.” Biometrics 43 (3): 671–81. doi:10.2307/2532003.

Cressie, N. A. C. 1993. Statistics for Spatial Data. 2nd ed. New York: Wiley.

Cressie, N., and N. H. Chan. 1989. “Spatial modelling of regional variables.” Journal of the American Statistical Association 84 (406): 393–401. https://niasra.uow.edu.au/cei/UOW232444.html.

Cressie, N., and R. C. Read. 1985. “Do sudden infant deaths come in clusters?” Statistics and Decisions, Supplement Issue 2: 333–49. https://niasra.uow.edu.au/cei/UOW232444.html.

———. 1989. “Spatial data analysis of regional counts.” Biometrical Journal 31 (6): 699–719. doi:10.1002/bimj.4710310607.

Duncan, E. W., N. M. White, and K. Mengersen. 2017. “Spatial smoothing in Bayesian models: a comparison of weights matrix specifications and their impact on inference.” International Journal of Health Geographics 16 (1): 47. doi:10.1186/s12942-017-0120-x.

Kemp, I., P. Boyle, M. Smans, and C. Muir. 1985. Atlas of cancer in Scotland, 1975-1980. Incidence and Epidemiologic Perspective, IARC Scientific Publication 72. Lyon: International Agency for Research on Cancer.

Leroux, B. G., X. Lei, and N. Breslow. 2000. “Estimation of disease rates in small areas: a new mixed model for spatial dependence.” Statistical Models in Epidemiology, the Environment, and Clinical Trials 116: 179–91. doi:10.1007/978-1-4612-1284-3_4.

Moraga, P. 2018. “Small area disease risk estimation and visualization using R.” The R Journal 10 (1): 495–506. doi:10.32614/RJ-2018-036.

Pascutto, C., J. C. Wakefield, N. G. Best, S. Richardson, L. Bernardinelli, A. Staines, and P. Elliot. 2000. “Statistical issues in the analysis of disease mapping data.” Statistics in Medicine 19 (17-18): 2493–2519. doi:10.1002/1097-0258(20000915/30)19:17/18<2493::aid-sim584>3.0.co;2-d.

Spiegelhalter, D. J., N. G. Best, B. P. Carlin, and V. der Linde. 2002. “Bayesian measures of model complexity and fit (with discussion).” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 64 (4): 583–640. doi:10.1111/1467-9868.00353.

Symons, M. J., R. C. Grimson, and Y. C. Yuan. 1983. “Clustering of rare events.” Biometrics 39 (1): 193–205.

Turnball, B. W., E. J. Iwano, W. S. Burnett, H. L. Howe, and L. C. Clark. 1990. “Monitoring for clusters of disease: application to leukemia incidence in upstate New York.” American Journal of Epidemiology 132 (Supp 1): S136–S143. doi:10.1093/oxfordjournals.aje.a115775.

Waller, L. A., B. W. Turnbull, L. C. Clark, and P. Nasca. 1992. “Chronic disease surveillance and testing of clustering of disease and exposure: Application to leukemia incidence and TCE-contaminated dumpsites in upstate New York.” Environmetrics 3 (3): 281–300. doi:10.1002/env.3170030303.

———. 1994. “‘Spatial Pattern Analysis to Detect Rare Disease Clusters’.” In Case Studies in Biometry, edited by N. Lange, L. Ryan, L. Billard, D. Brillinger, L. Conquest, and J. Greenhouse, 3–23. New York: John Wiley & Sons.

Waller, L., and C. Gotway. 2004. Applied Spatial Statistics for Public Health Data. New York: John Wiley; Sons.


  1. These distance-based neighbours “correspond almost precisely to another set of neighbours defined by those counties sharing a boundary with the county under consideration”" (i.e. first-order adjacency neighbours) (N. Cressie and Read 1989, 700). Note that this list of neighbours includes the IDs of the area in question. By definition, an area is not a neighbour of itself, so these IDs need to be removed before any modelling.