Context

The paper to which this is a companion document explores neighbourhood-level infection rates of Covid-19 in English major towns and cities over 7 weeks of the pandemic. That paper draws on the work of The Doreen Lawrence Review – a report on the disproportionate impact of Covid-19 on Black, Asian and minority ethnic communities in the UK – developing an index of exposure, measuring which ethnic groups have been most exposed to Covid-19 infected residential neighbourhoods during the first and second waves of the pandemic in England. The index is based on a Bayesian Poisson model with a random intercept in the linear predictor, allowing for extra-Poisson variation at neighbourhood and town/city scales. This permits within-city differences to be decoupled from broader regional trends in the disease. This document provides the data used in the paper and shows how to reproduce some of its analysis.

Loading the data

Assuming R has been installed on your computer (optionally – but recommended – with R Studio as the interface), the data from the paper can be downloaded and read-in as an R Workspace. The following chunk of code will also load the tidyverse packages used to view and manipulate the data, first installing them if required.

while(!eval(require(tidyverse))) install.packages("tidyverse")
connection <- url("https://www.dropbox.com/s/9x0mrgoqqeo0bew/ASAP.RData?dl=1")
load(connection)
close(connection)
rm(connection)

There are 5 objects in the workspace.

ls()
## [1] "expanded.model.coefficients" "expanded.model.results"     
## [3] "model.data"                  "reduced.model.coefficients" 
## [5] "reduced.model.results"

These are:

Object Name Description
model.data A tibble (data frame) of all the data used in the models of infection rates
reduced.model.coefficients A tibble of the beta values (effect sizes) for the control variables in each week of the study
reduced.model.results A tibble of how much each MSOA neighbourhood deviates from the national and local log rate of Covid infections for each week of the study, conditional on the control variables
expanded.model.coefficients As for the reduced model but with additional predictor variables
expanded.model.results As for the reduced model but conditional also on the additional predictor variables

For non-users of R, the contents of these objects are available as csv files from here.

About the Data

The data are contained in a 187207 rows by 37 columns data frame or ‘tibble’:

glimpse(model.data)
## Rows: 187,207
## Columns: 37
## $ MSOA11CD           <chr> "E02000001", "E02000002", "E02000003", "E0200000...
## $ MSOA11NM           <chr> "City of London", "Marks Gate", "Chadwell Heath ...
## $ COVID              <dbl> 4, 0, 0, 0, 0, 5, 4, 4, 0, 0, 4, 0, 0, 0, 0, 0, ...
## $ date               <chr> "2020-03-21", "2020-03-21", "2020-03-21", "2020-...
## $ year               <dbl> 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, ...
## $ covidweek          <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ TCITY15NM          <chr> "London", "London", "London", "London", "London"...
## $ LAU110NM           <chr> "City of London", "Barking and Dagenham", "Barki...
## $ PLACENM            <chr> "City of London & Westminster", "Barking and Dag...
## $ Adults             <dbl> 8100, 5388, 8281, 5171, 7232, 6964, 8950, 8125, ...
## $ Population         <dbl> 9721, 7735, 11174, 6687, 10432, 10048, 12662, 11...
## $ Adults18to21       <dbl> 5.234568, 5.994803, 5.482430, 6.787855, 6.305310...
## $ Adults22to35       <dbl> 30.48148, 27.82108, 30.09298, 27.71224, 30.12998...
## $ Adults40to49       <dbl> 14.50617, 19.54343, 17.95677, 15.18082, 20.39546...
## $ Adults50to59       <dbl> 16.53086, 15.49740, 15.36046, 16.74724, 15.81858...
## $ Adults60to69       <dbl> 12.654321, 9.372680, 10.699191, 11.564494, 8.697...
## $ Adults70to79       <dbl> 8.888889, 7.442465, 6.943606, 8.586347, 5.710730...
## $ Adults80plus       <dbl> 5.3209877, 6.9227914, 4.5767419, 7.3099981, 3.65...
## $ carehomes          <dbl> 0.000000000, 0.011878248, 0.000000000, 0.0263005...
## $ log.MeanHouseprice <dbl> 13.73873, 12.68594, 12.65726, 12.73955, 12.65086...
## $ houseprice         <dbl> 0.62583739, -0.42695290, -0.45562988, -0.3733465...
## $ NDom               <int> 7451, 2937, 4100, 2366, 3637, 3673, 4705, 4297, ...
## $ density            <dbl> 1.304657, 2.633640, 2.725366, 2.826289, 2.868298...
## $ Education          <dbl> 1.0, 12.5, 10.0, 40.0, 32.0, 25.0, 40.0, 6.7, 24...
## $ Health             <dbl> 1.9, 12.5, 10.0, 16.7, 4.0, 7.5, 14.3, 11.7, 8.0...
## $ Transport          <dbl> 0.9, 1.5, 0.8, 0.3, 0.8, 3.8, 7.1, 2.7, 2.8, 2.9...
## $ WBRI               <dbl> 4243, 3914, 4754, 4721, 5028, 5246, 7131, 4386, ...
## $ WOTH               <dbl> 1373, 398, 598, 200, 529, 530, 824, 687, 629, 44...
## $ BCRB               <dbl> 46, 302, 518, 141, 298, 169, 279, 273, 183, 123,...
## $ BAFR               <dbl> 98, 709, 936, 464, 1017, 1682, 1529, 1185, 1357,...
## $ BOTH               <dbl> 49, 122, 164, 44, 130, 146, 202, 123, 101, 89, 8...
## $ BANG               <dbl> 232, 147, 489, 60, 264, 138, 201, 235, 213, 230,...
## $ INDIAN             <dbl> 216, 270, 812, 47, 301, 71, 189, 263, 93, 168, 8...
## $ PAKISTANI          <dbl> 16, 178, 562, 41, 239, 98, 212, 364, 80, 63, 920...
## $ AOTH               <dbl> 213, 181, 345, 127, 205, 176, 224, 293, 196, 173...
## $ CHINESE            <dbl> 263, 44, 76, 38, 41, 43, 57, 65, 56, 48, 93, 45,...
## $ MIXD               <dbl> 289, 330, 433, 186, 313, 307, 478, 366, 280, 202...

The data are ‘long’ because they are in a stacked format. Each row represents a week of the pandemic and an MSOA (Census neighbourhood) in the study region. There are 59 weeks of Covid data, multiplied by 3173 MSOAs, giving the 187207 rows. Much of the data is duplicated across rows because they are time invariant. For example, the adult population of neighbourhood E02000001 is taken to be constant for each week of the study and is recorded as such in the data, as the following confirms:

model.data %>%
  filter(MSOA11CD == "E02000001") %>%
  summarise(min = min(Adults), max = max(Adults), IQR = IQR(Adults), n = n())
## # A tibble: 1 x 4
##     min   max   IQR     n
##   <dbl> <dbl> <dbl> <int>
## 1  8100  8100     0    59

The definition of the variables are below (click on hyperlinks for underlying sources)

Variable Names Definition
MSOA11CD, MSOA11NM MSOA11 Code and House of Commons Library MSOA Name
COVID Number of Covid cases (7-day sum)
date, week, year, covidweek Date stamps
TCITY15NM Name of Major Town or City
LAU110NM, PLACENM Local authority and ‘Place’ name
Adults, Population Mid-2019 population estimates
Adults18to21, Adults22to35, … Estimated percentage of adults aged 18-21, 22-35, etc.
carehomebeds Number of carehome beds per adult population
log.MeanHouseprice Log of trimmed mean houseprice, Jan 2017 - Feb 2020
houseprice The above, mean centred by PLACENM
NDom, Density Number of domestic delivery points, Population divided by NDom
Education, Health, Transport % Residents in key worker jobs
WBRI, WOTH, … 2011 Census count of White British population, White Other, etc.

Guide to ethnicity abbreviations:

WBRI = White British, WOTH = White Other, BCRB = Black Caribbean, BAFR = Black African, BOTH = Black Other, BANG = Bangladeshi, AOTH = Asian Other; MIXD = Mixed ethnicity

The model residuals

The Poisson model that is fitted in the paper has the form

\(\text{log(}\lambda_{it}\text{)}=\beta_{0t}+\beta_{1t}x_1+...+\text{ log(}P_i\text{)}+\varepsilon_{it}+\delta_{ijt}\)

(see Equation 4 of the paper) where \(\varepsilon_{it}\) represents the differences between places, conditional on the control and any predictor variables and relative to the overall infection rate for the study region at time \(t\), and \(\delta_{ijt}\) represents the differences within places.

The places with the highest rates of infection each week, conditional on the control variables, are

reduced.model.results %>%
  select(PLACENM, covidweek, regional) %>%
  unique(.) %>%
  group_by(covidweek) %>%
  filter(regional == max(regional)) %>%
  print(n = 5) # change to n = Inf if you want to see all weeks
## # A tibble: 61 x 3
## # Groups:   covidweek [61]
##   PLACENM    covidweek regional
##   <chr>          <dbl>    <dbl>
## 1 Brent              1     5.00
## 2 Southwark          2     2.34
## 3 Brent              3     1.25
## 4 Sunderland         4     1.22
## 5 Gateshead          5     1.01
## # ... with 56 more rows

The MSOAs with the highest rates of infection each week, conditional on the control variables and having allowed for the place effects (therefore also the broad geography of the disease), are

reduced.model.results %>%
  group_by(covidweek) %>%
  filter(local == max(local)) %>%
  select(MSOA11CD, MSOA11NM, PLACENM, covidweek, local) %>%
  print(n = 5) # change to n = Inf if you want to see all weeks
## # A tibble: 61 x 5
## # Groups:   covidweek [61]
##   MSOA11CD  MSOA11NM              PLACENM        covidweek local
##   <chr>     <chr>                 <chr>              <dbl> <dbl>
## 1 E02001071 Levenshulme North     Manchester             1  6.55
## 2 E02003572 Bitterne South        Southampton            2  2.83
## 3 E02002964 Townsend & Eaton Park Stoke-on-Trent         3  2.29
## 4 E02003270 Lewsey South          Luton                  4  1.79
## 5 E02001772 West Park             South Shields          5  2.37
## # ... with 56 more rows

We will return to how to fit this model later.

Calculating the Index of Exposure

The index is based on the Index of Dissimilarity, well-known in segregation research. That index can be written as

\(I_k = \sum_{i=1}^{N}\frac{n_{i(k)}}{n_{+(k)}}p_i\)

where \(I_k\) is the index value for ethnic group, \(k\), of which there are \(n_{i(k)}\) in neighbourhood, \(i\), and \(n_{+(k)}\) for all \(N\) neighbourhoods within the study. The remaining value, \(p_i\), may be interpreted probabilistically. For a segregation index, it is the probability of selecting a member of a second ethnic group, \(l\) from the same neighbourhood that the member of ethnic group \(k\) is living in and is equal to \({n_{i(l)}/}{n_{+(l)}}\). However, for the index of exposure that is used in the paper, \(p_i\) is the probability of selecting, from a Normal distribution, a value lower than the MSOA’s deviance from its expected infection rate for the week.

\(p_i=P(z<\varepsilon_i+\delta_{ij})\) or \(p_i=P(z<\varepsilon_i)\)

where \(z\) represents quantiles from a standard Normal distribution and giving two versions of the index, the reason for which is explained in the paper.

Using the first version, the index may be calculated for various ethnic groups each week and plotted in a way akin to Figure 2 of the paper using the code chunk:

while(!eval(require(ggplot2))) install.packages("ggplot2")
reduced.model.results %>%
  mutate(p = pnorm(local + regional)) %>%
  select(MSOA11CD, covidweek, p) %>%
  # Link the model results to the ethnicity data:
  left_join(.,
            model.data %>%
              select(MSOA11CD, covidweek, BAFR, BCRB, BANG, INDIAN,
                     PAKISTANI, CHINESE, WBRI, WOTH),
            by = c("MSOA11CD", "covidweek")) %>%
  select(-MSOA11CD) %>%
  # Calculate the index value for each ethnic group each week:
  group_by(covidweek) %>%
  mutate(across(!p, ~ p * . / sum(.))) %>%
  summarise(across(!p, sum)) %>%
  # Change to long format and plot:
  pivot_longer(2:ncol(.), "ethnicity") %>%
  ggplot(., aes(x = covidweek, y = value)) +
    geom_hline(yintercept = 0.5, col = "dark grey") +
    geom_line(aes(colour = ethnicity, linetype = ethnicity)) +
    ylim(c(0.2,0.9))

For comparison, the results when using the second version of the index, which are based on localised differences between MSOAs and control for the broader geographical pattern of the disease each week are,

reduced.model.results %>%
  # Just using the localised differences:
  mutate(p = pnorm(local)) %>%
  select(MSOA11CD, covidweek, p) %>%
  left_join(.,
            model.data %>%
              select(MSOA11CD, covidweek, BAFR, BCRB, BANG, INDIAN,
                     PAKISTANI, CHINESE, WBRI, WOTH),
            by = c("MSOA11CD", "covidweek")) %>%
  select(-MSOA11CD) %>%
  group_by(covidweek) %>%
  mutate(across(!p, ~ p * . / sum(.))) %>%
  summarise(across(!p, sum)) %>%
  pivot_longer(2:ncol(.), "ethnicity") %>%
  ggplot(., aes(x = covidweek, y = value)) +
    geom_hline(yintercept = 0.5, col = "dark grey") +
    geom_line(aes(colour = ethnicity, linetype = ethnicity)) +
    ylim(c(0.2,0.9))

Fitting the model

The following code fits the Bayesian model used in the paper. It is strongly recommended that you do not run it just out of curiosity as you will be waiting a while for the results. It assumes that the [brms] (https://cran.r-project.org/web/packages/brms/index.html) and tidybayes packages have been installed.

To fit the model for the first week of the data only:

## It is recommended that you do not run this!
require(brms)
require(tidybayes)

# Select the week:
mydata <- model.data %>%
  filter(covidweek == 1)

# Fit the model:
m1 <- brm(COVID ~ carehomes + Adults18to21 + Adults22to35 +
            (1|MSOA11CD) + (1|PLACENM) + offset(log(Adults)),
      family=poisson(),
      # Change to iter=1000 if you are interested in just trying the code:
      data=mydata, iter=8000, cores=4,
      control = list(adapt_delta = 0.99, max_treedepth = 15))

# Calculate the mean of the drawn values for the place effect (delta in the model)
place_effects <- spread_draws(m1,r_PLACENM[i,Intercept]) %>%
    rename(PLACENM = i) %>%
    select(PLACENM, r_PLACENM) %>% 
    group_by(PLACENM) %>% 
    summarise(regional = mean(r_PLACENM))
# This is just to tidy up the names of the places in place$PLACENM
place_effects$PLACENM <- gsub("\\.", " ", place_effects$PLACENM)

# Calculate the mean for the local effect (epsilon in the model)
local_effects <-spread_draws(m1,r_MSOA11CD[i,Intercept]) %>%
    rename(MSOA11CD = i) %>%
    select(MSOA11CD, r_MSOA11CD) %>% 
    group_by(MSOA11CD) %>% 
    summarise(local = mean(r_MSOA11CD))

# Bring the results together
results <- mydata %>%
    select(MSOA11CD, PLACENM, date, covidweek) %>%
    inner_join(local_effects, by = "MSOA11CD") %>%
    inner_join(place_effects, by = "PLACENM") 

results %>%
  print(n = 6)

# The fixed effects in the model may be of interest
fixef(m1)

The process then needs to be repeated for every week of the study - for example, by looping through each week of the study and saving the results in a list. If you decide to try it, be warned, it will take many hours!!!