Exploring the neighbourhood-level correlates of Covid-19 deaths in London

Tutorial for using a difference across spatial boundaries method

Richard Harris

October 2020

Abstract

This is a companion document to the paper, Exploring the neighbourhood-level correlates of Covid-19 deaths in London using a difference across spatial boundaries method, accepted for publication in the journal Health & Place. It includes the data used in the paper and shows how to reproduce the models found in it. A note on using connected graphs to visualise co-linearity is included.

Context

The paper to which this is a companion document explores neighbourhood-level correlates of the Covid-19 deaths in London during the initial wave of the pandemic within the UK. In examining these, the paper focuses on localised differences in the number of deaths, putting forward an innovative method of analysis that looks at the differences between places that share a border. Specifically, a difference across spatial boundaries method is advanced to consider whether a higher number of deaths in one neighbourhood, when compared to its neighbours, is related to other differences between those contiguous locations. It is also used to map localised ‘hot spots’ of the disease and to look for spatial variation in the regression coefficients. This tutorial provides the data and shows how to reproduce the analyses. A draft version of the paper is available here.

Loading the data

Assuming R has been installed on your computer (optionally – but recommended – with R Studio as the interface), the data from the paper can be downloaded and read-in using the fast read function in the data.table package, and manipulated using the tidyverse set of packages for data science in R. The following chunk of code will load the packages, first installing them if required.

while(!eval(require(data.table))) install.packages("data.table")
while(!eval(require(tidyverse))) install.packages("tidyverse")
if(!file.exists("london-data.csv")) download.file("https://www.dropbox.com/s/1s4qxmm3896ge7i/london_all.csv?dl=1", "london-data.csv")
df <- fread("london-data.csv") %>%
  as_tibble(.)

About the data

The data contain all the variables used in the paper. Each is a measure of difference, not the observed value for a specific location alone. Where $\text{lag} = 1$, the difference is between an MSOA (Middle Level Super Output Areas – a census zone) and a first order neighbour (i.e. a place it shares a boundary with). Where $\text{lag} = 2$, it is the difference between an MSOA and a second order neighbour (the neighbour of a neighbour), and so forth. The data are for two periods. The first is from March 1 to April 17, 2020. The second is from April 18 to May 31.

Most of the paper focuses on the differences between contiguous neighbours (i.e. $\text{lag} = 1$), in the first period ($ = 1 $). To subset the data accordingly:

dfsub <- df %>%
  filter(lag == 1, period == 1)

It is instructive to view the top of the data and note their structure:

dfsub %>%
  arrange(MSOA11CD1) %>%
  print(n = 6)

## # A tibble: 2,808 x 66
##       d MSOA11CD1 MSOA11CD2  Adults NCOVIDRATE carehomes density TWENTIES
##   <int> <chr>     <chr>       <dbl>      <dbl>     <dbl>   <dbl>    <dbl>
## 1     1 E02000003 E02000002  1.81       -0.846    -1.17    0.272    0.275
## 2     0 E02000004 E02000007 -1.23        0.748     1.35   -0.591    0.502
## 3     3 E02000004 E02000011 -0.0103     -0.865     0.165  -0.340    0.435
## 4     2 E02000004 E02000484 -1.07        1.09      2.32   -0.357    0.507
## 5     6 E02000004 E02000489 -0.954       0.569     2.49   -0.193    0.849
## 6     2 E02000005 E02000003 -0.746      -0.996     0       0.717   -0.261
## # ... with 2,802 more rows, and 58 more variables: THIRTIES <dbl>,
## #   FORTIES <dbl>, FIFTIES <dbl>, SIXTIES <dbl>, SEVENTIES <dbl>,
## #   EIGHTYplus <dbl>, WBRI <dbl>, WOTH <dbl>, INDIAN <dbl>, PAKISTANI <dbl>,
## #   BANG <dbl>, CHINESE <dbl>, AOTH <dbl>, BAFR <dbl>, BCRB <dbl>, BOTH <dbl>,
## #   MIXED <dbl>, EU <dbl>, EUROTH <dbl>, AFRICAN <dbl>, ASIAN <dbl>, RoW <dbl>,
## #   GROUP1 <dbl>, GROUP2 <dbl>, GROUP3 <dbl>, GROUP4 <dbl>, GROUP5 <dbl>,
## #   GROUP6 <dbl>, GROUP7 <dbl>, GROUP8 <dbl>, HPRICE <dbl>, INCOME <dbl>,
## #   EDU <dbl>, HEALTH <dbl>, HOUSING <dbl>, DETACHED <dbl>, SEMI <dbl>,
## #   TERRACED <dbl>, FLATS <dbl>, HHLD1 <dbl>, HHLD2 <dbl>, HHLD3 <dbl>,
## #   HHLD4 <dbl>, HHLD5 <dbl>, HHLD6 <dbl>, HHLD7 <dbl>, HHLD8 <dbl>,
## #   HIGHOCC <dbl>, SHARED3plus <dbl>, LAU210NM <chr>, LAU110NM <chr>,
## #   NUTS306NM <chr>, NUTS106NM <chr>, TCITY15NM <chr>, lag <int>, w <dbl>,
## #   period <int>, dates <chr>

Here, the MSOA, E02000003 had d = 1 more Covid-19 deaths than its neighbour, E02000004, possibly because more people live in the former (this can be seen by the number of Adults variable being positive). The definitions of the variables can be found in Table 1 of the paper. Note that the table describes the ‘raw’ variables. Remember that, in the data, it is the difference in their value for neighbouring MSOAs that is recorded. This is true for all the variables: the dependent variable, the predictor variables, and the controls. To assist in those models’ interpretation, each of the raw variables, except the dependent variable, is scaled prior to calculating the differences (converted to a z-score, with a mean of zero for London).

It may be seen in the output above that MSOA, E02000004 appears four times in the data column MSOA11CD1. That is because it has a higher number of deaths than four of its neighbours. If we sort on another column, MSOA11CD2, …

dfsub %>%
  arrange(MSOA11CD2) %>%
  select(d, MSOA11CD1, MSOA11CD2) %>%
  print(n = 8)

## # A tibble: 2,808 x 3
##       d MSOA11CD1 MSOA11CD2
##   <int> <chr>     <chr>    
## 1     3 E02000192 E02000001
## 2     1 E02000003 E02000002
## 3     8 E02000468 E02000002
## 4     3 E02000474 E02000002
## 5     1 E02000752 E02000002
## 6     3 E02000763 E02000002
## 7    13 E02000769 E02000002
## 8     2 E02000005 E02000003
## # ... with 2,800 more rows

… we find that MSOA, E02000007 appears six times. That is because it has fewer deaths than six of its neighbours.

The repetition of MSOAs in these two columns arises because most MSOAs have more than one neighbour to which they can be compared and their differences calculated. This induces a group structure in the data which means the observations are not independent of each other (because observations share neighbours). The group structure is accommodated by using a (cross-classified) multilevel model with two higher levels – one for MSOA11CD1, the other for MSOA11CD2.

Fitting the multilevel model

The model is fitted using the lme4 package. First, the package is installed and/or required:

while(!eval(require(lme4))) install.packages("lme4")
while(!eval(require(stargazer))) install.packages("stargazer")
# The extra package, stargazer, produces better-formatted tables

To reproduce the results in Table 2 of the paper, one of the variables is selected and regressed against the differences in the number of deaths (the dependent variable), with the inclusion of some control variables (described in the paper). For example, if TWENTIES is selected,

mlm <- glmer(d ~ TWENTIES +
                   I(poly(Adults,2)) + carehomes + NCOVIDRATE + density +
                   (1 | MSOA11CD1) + (1 | MSOA11CD2),
                  family = "poisson", data = dfsub,
             glmerControl(optimizer = "nlminbwrap"))
stargazer(mlm, type = "text", no.space = TRUE)

## 
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                                  d             
## -----------------------------------------------
## TWENTIES                     -0.106***         
##                               (0.023)          
## I(poly(Adults, 2))1          10.338***         
##                               (1.152)          
## I(poly(Adults, 2))2            1.166           
##                               (1.051)          
## carehomes                    0.189***          
##                               (0.019)          
## NCOVIDRATE                   0.058***          
##                               (0.020)          
## density                       0.050**          
##                               (0.022)          
## Constant                     0.689***          
##                               (0.030)          
## -----------------------------------------------
## Observations                   2,808           
## Log Likelihood              -5,643.960         
## Akaike Inf. Crit.           11,305.920         
## Bayesian Inf. Crit.         11,359.380         
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01

This gives an estimated effect size of -0.106, as in the paper. The same process can be used with all the other variables included in Table 2 of the paper.

The reason for using a Poisson model is because of the distribution of the response variable, d, the difference in the number of deaths between two MSOAs. Because of the way the data are constructed, those differences can only be positive integers or zero (negative values are discarded to avoid repetition, the paper explains why). To view their distribution:

while(!eval(require(ggplot2))) install.packages("ggplot2")
dfsub %>%
  ggplot(aes(x = d)) +
  geom_histogram(aes(y = ..density..), binwidth = 1, colour = "white") +
  geom_density(colour = "red", lwd = 1) +
  theme_minimal()

It can also be treated as a negative binomial distribution, This is useful for checking the results if the Poisson model fails to converge but generally makes little difference other than taking substantially longer to compute, as the following shows.

# Do not run this unless you like twiddling your thumbs
system.time(glmer(d ~ TWENTIES +
               poly(Adults,2) + carehomes + NCOVIDRATE + density +
               (1 | MSOA11CD1) + (1 | MSOA11CD2),
               data = dfsub))

##    user  system elapsed 
##    0.44    0.01    0.70

system.time(glmer.nb(d ~ TWENTIES +
               poly(Adults,2) + carehomes + NCOVIDRATE + density +
               (1 | MSOA11CD1) + (1 | MSOA11CD2),
               data = dfsub))

##    user  system elapsed 
##  211.51    0.33  220.31

Visualising the co-linearity between variables

A feature of the variables used in the study is that they are strongly correlated with each other because ethnicity, socio-economic disadvantage, household size and overcrowding, and age are not independent of each other. One model recorded in the paper has TWENTIES, EIGHTYplus, BCRB, INCOME, and HHLD8 as the predictor variables. Those variables are correlated with the response variable, d but also with each other and other variables in the data. To visualise the inter-correlations, connected graphs are used. The function to do so, which is really just a wrapper around the R package, igraph, is available as a source file (script),

if(!file.exists("connected-graphs.R"))
download.file("https://www.dropbox.com/s/ht5k403adce4h6f/corr_graph.R?dl=1", "connected-graphs.R")
source("connected-graphs.R")

To reproduce Figure 2 in the paper, showing some of the larger correlations ($r \geq{0.5}$) between the five variables and the rest, use:

dfsub %>%
  select(8:57) %>%
  corr_graph(., use.only = c("TWENTIES", "EIGHTYplus",
                                "BCRB", "INCOME",
                                "HHLD8"),
           vertex.cex = 1, set.seed = 105, height = 4.5, width = 8.5,
           ncol = 2, nrow = 1, margin = 0.2, line.col = rep(grey(0.3),2))

More about using igraph to Visualize multicollinearity can be read here (where the data used are similar to but not the same as those used here).

Identifying the localised ‘hot spots’ of Covid-19 in London

Begin with a model with no predictor variables other than the control variables. It is a random intercepts model:

mlm <- glmer(d ~ I(poly(Adults,2)) + carehomes + NCOVIDRATE + density +
               (1 | MSOA11CD1) + (1 | MSOA11CD2),
             family = "poisson", data = dfsub,
             glmerControl(optimizer = "nlminbwrap"))

Of interest are the MSOAs that consistently have a number of deaths that is higher than their neighbours, conditional on the control variables. To identify those MSOAs, the residuals for the MSOA11CD1 level of the model need to be extracted.

resids <- ranef(mlm, whichel = "MSOA11CD1")

In the paper, those that are significantly greater than zero (at a 95% confidence) are used as the basis to identify MSOAs as localised ‘hot spots’ of Covid-19:

as_tibble(resids) %>%
    filter(condval - 1.96 * condsd > 0) %>%
    arrange(desc(condval)) %>%
    print(n=5)

## # A tibble: 136 x 5
##   grpvar    term        grp       condval condsd
##   <chr>     <fct>       <fct>       <dbl>  <dbl>
## 1 MSOA11CD1 (Intercept) E02000117    1.83  0.144
## 2 MSOA11CD1 (Intercept) E02000715    1.79  0.149
## 3 MSOA11CD1 (Intercept) E02000573    1.67  0.127
## 4 MSOA11CD1 (Intercept) E02000280    1.56  0.134
## 5 MSOA11CD1 (Intercept) E02000379    1.45  0.148
## # ... with 131 more rows

Foremost amongst them appears to be an MSOA in Harlesden, Brent:

dfsub %>%
  filter(MSOA11CD1 == "E02000117") %>%
  select(LAU210NM, LAU110NM, MSOA11CD1, MSOA11CD2, d)

## # A tibble: 5 x 5
##   LAU210NM  LAU110NM MSOA11CD1 MSOA11CD2     d
##   <chr>     <chr>    <chr>     <chr>     <int>
## 1 Harlesden Brent    E02000117 E02000106    16
## 2 Harlesden Brent    E02000117 E02000113    20
## 3 Harlesden Brent    E02000117 E02000116    22
## 4 Harlesden Brent    E02000117 E02000119    17
## 5 Harlesden Brent    E02000117 E02000123    16

However, to this point the hot spots are defined only in relation to a contiguous, i.e. first-order, neighbour. In the paper, second-order, third-order, fourth-order, and so on to the tenth-order differences are considered. In the code chunk below the process of identifying significantly positive residuals is repeated for the first-, second-, and third-order neighbours. The only reason for stopping at the third, here, is because of the computational time involved as the order becomes larger. The principle remains the same, however. The inclusion of the parameter $\text{calc.derivs = FALSE}$ speeds up the computation but turns-off some tests that the model has converged (for further information, see here).

results <- df %>%
  filter(period == 1, lag < 4) %>%
  split(.$lag) %>%
  map(function(x) {
    mlm <- glmer(d ~ I(poly(Adults,2)) + carehomes + NCOVIDRATE + density +
               (1 | MSOA11CD1) + (1 | MSOA11CD2),
             family = "poisson", data = x,
             glmerControl(optimizer = "nlminbwrap", calc.derivs = FALSE))
    ranef(mlm, whichel = "MSOA11CD1") %>%
      as_tibble(.) %>%
      filter(condval - 1.96 * condsd > 0) %>%
      select(grp, condval) %>%
      rename(MSOA11CD = grp)
  })

These sets of residuals (one set each for the first, second, and third-order neighbours) are then matched against a list of all MSOAs in London, where those that do not have a significant and positive value are set to zero. Next, the values are summed together to produce an overall index value.

if(!file.exists("msoas-list.csv")) download.file("https://www.dropbox.com/s/m0emqdxgfjav9vh/msoa_NUTS_lookup.csv?dl=1", "msoas-list.csv")
msoas <- read_csv("msoas-list.csv") %>%
  filter(NUTS106NM == "London") %>%
  select(MSOA11CD, LAU210NM, LAU110NM)

results <- results %>%
  map(~ left_join(msoas, ., by = "MSOA11CD")) %>%
  map(~ replace_na(., list(condval = 0))) %>%
  as_tibble(.) %>%
  mutate(MSOA11CD = `1`$MSOA11CD,
         Index = 0.984 * `1`$condval +
                 0.935 * `2`$condval +
                 0.857 * `3`$condval) %>%
  select(MSOA11CD, Index)

Given a suitable boundary file, the index values can then be mapped, with higher values indicating greater localised hot spots. For handling the spatial data, the sf package is used. The map is plotted using ggplot2.

while(!eval(require(sf))) install.packages("sf")
if(!file.exists("london.shp")) {   
  download.file("https://github.com/profrichharris/covid19/blob/master/london.zip?raw=true",
                url = "london-map.zip", mode="wb")
  unzip("london-map.zip")
}
london <- read_sf("london.shp") %>%
  left_join(., results) %>%
  filter(LAD11NM != "City of London")
boroughs <- aggregate(london, by = list(london$LAD11NM), length) %>%
  filter(Group.1 != "City of London")

while(!eval(require(ggsflabel))) install.packages("ggsflabel")
# This package helps to prevent overlapping labels in the plot produced below
ggplot() +
  geom_sf(data = london, aes(fill = Index), colour = "transparent") +
  scale_fill_gradient(low = "white", high = "black", name = "Index") +
  geom_sf_label_repel(data = boroughs, aes(label = Group.1),
                      label.padding = unit(0.1, "lines"), size =2.5) +
  labs(x = "", y = "")

The output is a map similar to Figure 4 in the paper but not the same because, to recall, the code above stopped at the third-order neighbours. To reproduce the figure exactly, the following code can be used but it will take much longer to run. To reproduce Figure 5, the code needs to be changed so that $\text{period}==2$.

# This will take a while to run. I suggest that you don't.
results <- df %>%
  filter(period == 1) %>%
  split(.$lag) %>%
  map(function(x) {
    mlm <- glmer(d ~ I(poly(Adults,2)) + carehomes + NCOVIDRATE + density +
               (1 | MSOA11CD1) + (1 | MSOA11CD2),
             family = "poisson", data = x,
             glmerControl(optimizer = "nlminbwrap", calc.derivs = FALSE))
    ranef(mlm, whichel = "MSOA11CD1") %>%
      as_tibble(.) %>%
      filter(condval - 1.96 * condsd > 0) %>%
      select(grp, condval) %>%
      rename(MSOA11CD = grp)
  })

results <- results %>%
  map(~ left_join(msoas, ., by = "MSOA11CD")) %>%
  map(~ replace_na(., list(condval = 0))) %>%
  as_tibble(.) %>%
  mutate(MSOA11CD = `1`$MSOA11CD,
         Index = 0.984 * `1`$condval + 0.935 * `2`$condval +
                 0.857 * `3`$condval + 0.753 * `4`$condval +
                 0.629 * `5`$condval + 0.493 * `6`$condval +
                 0.354 * `7`$condval + 0.222 * `8`$condval +  
                 0.109 * `9`$condval + 0.030 * `10`$condval) %>%
  select(MSOA11CD, Index)

london <- read_sf("london.shp") %>%
  left_join(., results) %>%
  filter(LAD11NM != "City of London")

ggplot() +
  geom_sf(data = london, aes(fill = Index), colour = "transparent") +
  scale_fill_gradient(low = "white", high = "black", name = "Index") +
  geom_sf_label_repel(data = boroughs, aes(label = Group.1),
                      label.padding = unit(0.1, "lines"), size =2.5) +
  labs(x = "", y = "")

Spatial variation in the effect sizes

As well as mapping the localised hot spots, the paper also looks at geographical variation in the relationships between some of the predictor variables and the numbers of Covid-19 deaths, identifying where the effect of the predictor is significantly greater than average. The method for looking at the latter (the geographical variation in the relationships) is an extension of the former (the hot spots analysis, above; although their order is reversed in the paper).

Suppose we are interested in the geographical variation in the INCOME variable for a model that also includes the variables TWENTIES, EIGHTYplus, BCRB, and HHLD8, as well as the control variables. The code chunk below is similar to the chunk above except it now specifies a random slopes model not just a random intercepts one.

# Again, this will take a while to run. I suggest that you don't.
results <- df %>%
  filter(period == 1) %>%
  mutate(INCOME = -1 * INCOME) %>%
  mutate(TWENTIES = -1 * TWENTIES) %>%
  # The reason for taking the negative of these variables is so each
  # predictor variable is positively related with the dependent variable
  split(.$lag) %>%
  map(function(x) {
    mlm <- glmer(d ~ TWENTIES + EIGHTYplus + BCRB + INCOME + HHLD8 +
                    poly(Adults,2) + carehomes + NCOVIDRATE + density +
                    (INCOME | MSOA11CD1) + (1 | MSOA11CD2),
                    family = "poisson", data = x,
                    glmerControl(optimizer = "nloptwrap", calc.derivs = FALSE))
  
    ranef(mlm, whichel = "MSOA11CD1") %>%
      as_tibble(.) %>%
      filter(term == "INCOME") %>%
      filter(condval - 1.96 * condsd > 0) %>%
      select(grp, condval) %>%
      rename(MSOA11CD = grp)
  })

results <- results %>%
  map(~ left_join(msoas, ., by = "MSOA11CD")) %>%
  map(~ replace_na(., list(condval = 0))) %>%
  as_tibble(.) %>%
  mutate(MSOA11CD = `1`$MSOA11CD,
         INCOME = 0.984 * `1`$condval + 0.935 * `2`$condval +
                  0.857 * `3`$condval + 0.753 * `4`$condval +
                  0.629 * `5`$condval + 0.493 * `6`$condval +
                  0.354 * `7`$condval + 0.222 * `8`$condval +  
                  0.109 * `9`$condval + 0.030 * `10`$condval) %>%
  select(MSOA11CD, INCOME)

The places where lower incomes appear to have a disproportionate impact on the number of deaths, locally, are found in Hackney, Islington, and Croydon, amongst other places.

left_join(msoas, results, by = "MSOA11CD") %>%
  arrange(desc(INCOME)) %>%
  print(n = 6)

## # A tibble: 983 x 4
##   MSOA11CD  LAU210NM                    LAU110NM   INCOME
##   <chr>     <chr>                       <chr>       <dbl>
## 1 E02000347 Lordship                    Hackney     0.245
## 2 E02000558 Finsbury Park               Islington   0.238
## 3 E02000200 South Norwood               Croydon     0.234
## 4 E02000231 Coulsdon West               Croydon     0.234
## 5 E02000935 Roehampton and Putney Heath Wandsworth  0.209
## 6 E02000929 Latchmere                   Wandsworth  0.197
## # ... with 977 more rows

In the paper, the process above is undertaken for each of the five variables in the model, finally combining the results to give a composite score (shown in Figure 3 of the paper).