Abstract
This is a companion document to the paper, Measuring the exposure of Black, Asian and other ethnic groups to Covid-infected neighbourhoods in English towns and cities, accepted for publication in the journal Applied Spatial Analysis and Policy. It includes the data used in the paper and shows how to reproduce some of the analysis found in it. A preprint version of the paper can be downloaded from Research Square.The paper to which this is a companion document explores neighbourhood-level infection rates of Covid-19 in English major towns and cities over 7 weeks of the pandemic. That paper draws on the work of The Doreen Lawrence Review – a report on the disproportionate impact of Covid-19 on Black, Asian and minority ethnic communities in the UK – developing an index of exposure, measuring which ethnic groups have been most exposed to Covid-19 infected residential neighbourhoods during the first and second waves of the pandemic in England. The index is based on a Bayesian Poisson model with a random intercept in the linear predictor, allowing for extra-Poisson variation at neighbourhood and town/city scales. This permits within-city differences to be decoupled from broader regional trends in the disease. This document provides the data used in the paper and shows how to reproduce some of its analysis.
Assuming R has been installed on your computer (optionally – but recommended – with R Studio as the interface), the data from the paper can be downloaded and read-in as an R Workspace. The following chunk of code will also load the tidyverse packages used to view and manipulate the data, first installing them if required.
while(!eval(require(tidyverse))) install.packages("tidyverse")
connection <- url("https://www.dropbox.com/s/9x0mrgoqqeo0bew/ASAP.RData?dl=1")
load(connection)
close(connection)
rm(connection)
There are 5 objects in the workspace.
ls()
## [1] "expanded.model.coefficients" "expanded.model.results"
## [3] "model.data" "reduced.model.coefficients"
## [5] "reduced.model.results"
These are:
| Object Name | Description |
|---|---|
| model.data | A tibble (data frame) of all the data used in the models of infection rates |
| reduced.model.coefficients | A tibble of the beta values (effect sizes) for the control variables in each week of the study |
| reduced.model.results | A tibble of how much each MSOA neighbourhood deviates from the national and local log rate of Covid infections for each week of the study, conditional on the control variables |
| expanded.model.coefficients | As for the reduced model but with additional predictor variables |
| expanded.model.results | As for the reduced model but conditional also on the additional predictor variables |
For non-users of R, the contents of these objects are available as csv files from here.
The data are contained in a 187207 rows by 37 columns data frame or ‘tibble’:
glimpse(model.data)
## Rows: 187,207
## Columns: 37
## $ MSOA11CD <chr> "E02000001", "E02000002", "E02000003", "E0200000...
## $ MSOA11NM <chr> "City of London", "Marks Gate", "Chadwell Heath ...
## $ COVID <dbl> 4, 0, 0, 0, 0, 5, 4, 4, 0, 0, 4, 0, 0, 0, 0, 0, ...
## $ date <chr> "2020-03-21", "2020-03-21", "2020-03-21", "2020-...
## $ year <dbl> 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, ...
## $ covidweek <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ TCITY15NM <chr> "London", "London", "London", "London", "London"...
## $ LAU110NM <chr> "City of London", "Barking and Dagenham", "Barki...
## $ PLACENM <chr> "City of London & Westminster", "Barking and Dag...
## $ Adults <dbl> 8100, 5388, 8281, 5171, 7232, 6964, 8950, 8125, ...
## $ Population <dbl> 9721, 7735, 11174, 6687, 10432, 10048, 12662, 11...
## $ Adults18to21 <dbl> 5.234568, 5.994803, 5.482430, 6.787855, 6.305310...
## $ Adults22to35 <dbl> 30.48148, 27.82108, 30.09298, 27.71224, 30.12998...
## $ Adults40to49 <dbl> 14.50617, 19.54343, 17.95677, 15.18082, 20.39546...
## $ Adults50to59 <dbl> 16.53086, 15.49740, 15.36046, 16.74724, 15.81858...
## $ Adults60to69 <dbl> 12.654321, 9.372680, 10.699191, 11.564494, 8.697...
## $ Adults70to79 <dbl> 8.888889, 7.442465, 6.943606, 8.586347, 5.710730...
## $ Adults80plus <dbl> 5.3209877, 6.9227914, 4.5767419, 7.3099981, 3.65...
## $ carehomes <dbl> 0.000000000, 0.011878248, 0.000000000, 0.0263005...
## $ log.MeanHouseprice <dbl> 13.73873, 12.68594, 12.65726, 12.73955, 12.65086...
## $ houseprice <dbl> 0.62583739, -0.42695290, -0.45562988, -0.3733465...
## $ NDom <int> 7451, 2937, 4100, 2366, 3637, 3673, 4705, 4297, ...
## $ density <dbl> 1.304657, 2.633640, 2.725366, 2.826289, 2.868298...
## $ Education <dbl> 1.0, 12.5, 10.0, 40.0, 32.0, 25.0, 40.0, 6.7, 24...
## $ Health <dbl> 1.9, 12.5, 10.0, 16.7, 4.0, 7.5, 14.3, 11.7, 8.0...
## $ Transport <dbl> 0.9, 1.5, 0.8, 0.3, 0.8, 3.8, 7.1, 2.7, 2.8, 2.9...
## $ WBRI <dbl> 4243, 3914, 4754, 4721, 5028, 5246, 7131, 4386, ...
## $ WOTH <dbl> 1373, 398, 598, 200, 529, 530, 824, 687, 629, 44...
## $ BCRB <dbl> 46, 302, 518, 141, 298, 169, 279, 273, 183, 123,...
## $ BAFR <dbl> 98, 709, 936, 464, 1017, 1682, 1529, 1185, 1357,...
## $ BOTH <dbl> 49, 122, 164, 44, 130, 146, 202, 123, 101, 89, 8...
## $ BANG <dbl> 232, 147, 489, 60, 264, 138, 201, 235, 213, 230,...
## $ INDIAN <dbl> 216, 270, 812, 47, 301, 71, 189, 263, 93, 168, 8...
## $ PAKISTANI <dbl> 16, 178, 562, 41, 239, 98, 212, 364, 80, 63, 920...
## $ AOTH <dbl> 213, 181, 345, 127, 205, 176, 224, 293, 196, 173...
## $ CHINESE <dbl> 263, 44, 76, 38, 41, 43, 57, 65, 56, 48, 93, 45,...
## $ MIXD <dbl> 289, 330, 433, 186, 313, 307, 478, 366, 280, 202...
The data are ‘long’ because they are in a stacked format. Each row represents a week of the pandemic and an MSOA (Census neighbourhood) in the study region. There are 59 weeks of Covid data, multiplied by 3173 MSOAs, giving the 187207 rows. Much of the data is duplicated across rows because they are time invariant. For example, the adult population of neighbourhood E02000001 is taken to be constant for each week of the study and is recorded as such in the data, as the following confirms:
model.data %>%
filter(MSOA11CD == "E02000001") %>%
summarise(min = min(Adults), max = max(Adults), IQR = IQR(Adults), n = n())
## # A tibble: 1 x 4
## min max IQR n
## <dbl> <dbl> <dbl> <int>
## 1 8100 8100 0 59
The definition of the variables are below (click on hyperlinks for underlying sources)
| Variable Names | Definition |
|---|---|
| MSOA11CD, MSOA11NM | MSOA11 Code and House of Commons Library MSOA Name |
| COVID | Number of Covid cases (7-day sum) |
| date, week, year, covidweek | Date stamps |
| TCITY15NM | Name of Major Town or City |
| LAU110NM, PLACENM | Local authority and ‘Place’ name |
| Adults, Population | Mid-2019 population estimates |
| Adults18to21, Adults22to35, … | Estimated percentage of adults aged 18-21, 22-35, etc. |
| carehomebeds | Number of carehome beds per adult population |
| log.MeanHouseprice | Log of trimmed mean houseprice, Jan 2017 - Feb 2020 |
| houseprice | The above, mean centred by PLACENM |
| NDom, Density | Number of domestic delivery points, Population divided by NDom |
| Education, Health, Transport | % Residents in key worker jobs |
| WBRI, WOTH, … | 2011 Census count of White British population, White Other, etc. |
Guide to ethnicity abbreviations:
WBRI = White British, WOTH = White Other, BCRB = Black Caribbean, BAFR = Black African, BOTH = Black Other, BANG = Bangladeshi, AOTH = Asian Other; MIXD = Mixed ethnicity
The Poisson model that is fitted in the paper has the form
\(\text{log(}\lambda_{it}\text{)}=\beta_{0t}+\beta_{1t}x_1+...+\text{ log(}P_i\text{)}+\varepsilon_{it}+\delta_{ijt}\)
(see Equation 4 of the paper) where \(\varepsilon_{it}\) represents the differences between places, conditional on the control and any predictor variables and relative to the overall infection rate for the study region at time \(t\), and \(\delta_{ijt}\) represents the differences within places.
The places with the highest rates of infection each week, conditional on the control variables, are
reduced.model.results %>%
select(PLACENM, covidweek, regional) %>%
unique(.) %>%
group_by(covidweek) %>%
filter(regional == max(regional)) %>%
print(n = 5) # change to n = Inf if you want to see all weeks
## # A tibble: 61 x 3
## # Groups: covidweek [61]
## PLACENM covidweek regional
## <chr> <dbl> <dbl>
## 1 Brent 1 5.00
## 2 Southwark 2 2.34
## 3 Brent 3 1.25
## 4 Sunderland 4 1.22
## 5 Gateshead 5 1.01
## # ... with 56 more rows
The MSOAs with the highest rates of infection each week, conditional on the control variables and having allowed for the place effects (therefore also the broad geography of the disease), are
reduced.model.results %>%
group_by(covidweek) %>%
filter(local == max(local)) %>%
select(MSOA11CD, MSOA11NM, PLACENM, covidweek, local) %>%
print(n = 5) # change to n = Inf if you want to see all weeks
## # A tibble: 61 x 5
## # Groups: covidweek [61]
## MSOA11CD MSOA11NM PLACENM covidweek local
## <chr> <chr> <chr> <dbl> <dbl>
## 1 E02001071 Levenshulme North Manchester 1 6.55
## 2 E02003572 Bitterne South Southampton 2 2.83
## 3 E02002964 Townsend & Eaton Park Stoke-on-Trent 3 2.29
## 4 E02003270 Lewsey South Luton 4 1.79
## 5 E02001772 West Park South Shields 5 2.37
## # ... with 56 more rows
We will return to how to fit this model later.
The index is based on the Index of Dissimilarity, well-known in segregation research. That index can be written as
\(I_k = \sum_{i=1}^{N}\frac{n_{i(k)}}{n_{+(k)}}p_i\)
where \(I_k\) is the index value for ethnic group, \(k\), of which there are \(n_{i(k)}\) in neighbourhood, \(i\), and \(n_{+(k)}\) for all \(N\) neighbourhoods within the study. The remaining value, \(p_i\), may be interpreted probabilistically. For a segregation index, it is the probability of selecting a member of a second ethnic group, \(l\) from the same neighbourhood that the member of ethnic group \(k\) is living in and is equal to \({n_{i(l)}/}{n_{+(l)}}\). However, for the index of exposure that is used in the paper, \(p_i\) is the probability of selecting, from a Normal distribution, a value lower than the MSOA’s deviance from its expected infection rate for the week.
\(p_i=P(z<\varepsilon_i+\delta_{ij})\) or \(p_i=P(z<\varepsilon_i)\)
where \(z\) represents quantiles from a standard Normal distribution and giving two versions of the index, the reason for which is explained in the paper.
Using the first version, the index may be calculated for various ethnic groups each week and plotted in a way akin to Figure 2 of the paper using the code chunk:
while(!eval(require(ggplot2))) install.packages("ggplot2")
reduced.model.results %>%
mutate(p = pnorm(local + regional)) %>%
select(MSOA11CD, covidweek, p) %>%
# Link the model results to the ethnicity data:
left_join(.,
model.data %>%
select(MSOA11CD, covidweek, BAFR, BCRB, BANG, INDIAN,
PAKISTANI, CHINESE, WBRI, WOTH),
by = c("MSOA11CD", "covidweek")) %>%
select(-MSOA11CD) %>%
# Calculate the index value for each ethnic group each week:
group_by(covidweek) %>%
mutate(across(!p, ~ p * . / sum(.))) %>%
summarise(across(!p, sum)) %>%
# Change to long format and plot:
pivot_longer(2:ncol(.), "ethnicity") %>%
ggplot(., aes(x = covidweek, y = value)) +
geom_hline(yintercept = 0.5, col = "dark grey") +
geom_line(aes(colour = ethnicity, linetype = ethnicity)) +
ylim(c(0.2,0.9))
For comparison, the results when using the second version of the index, which are based on localised differences between MSOAs and control for the broader geographical pattern of the disease each week are,
reduced.model.results %>%
# Just using the localised differences:
mutate(p = pnorm(local)) %>%
select(MSOA11CD, covidweek, p) %>%
left_join(.,
model.data %>%
select(MSOA11CD, covidweek, BAFR, BCRB, BANG, INDIAN,
PAKISTANI, CHINESE, WBRI, WOTH),
by = c("MSOA11CD", "covidweek")) %>%
select(-MSOA11CD) %>%
group_by(covidweek) %>%
mutate(across(!p, ~ p * . / sum(.))) %>%
summarise(across(!p, sum)) %>%
pivot_longer(2:ncol(.), "ethnicity") %>%
ggplot(., aes(x = covidweek, y = value)) +
geom_hline(yintercept = 0.5, col = "dark grey") +
geom_line(aes(colour = ethnicity, linetype = ethnicity)) +
ylim(c(0.2,0.9))
The following code fits the Bayesian model used in the paper. It is strongly recommended that you do not run it just out of curiosity as you will be waiting a while for the results. It assumes that the [brms] (https://cran.r-project.org/web/packages/brms/index.html) and tidybayes packages have been installed.
To fit the model for the first week of the data only:
## It is recommended that you do not run this!
require(brms)
require(tidybayes)
# Select the week:
mydata <- model.data %>%
filter(covidweek == 1)
# Fit the model:
m1 <- brm(COVID ~ carehomes + Adults18to21 + Adults22to35 +
(1|MSOA11CD) + (1|PLACENM) + offset(log(Adults)),
family=poisson(),
# Change to iter=1000 if you are interested in just trying the code:
data=mydata, iter=8000, cores=4,
control = list(adapt_delta = 0.99, max_treedepth = 15))
# Calculate the mean of the drawn values for the place effect (delta in the model)
place_effects <- spread_draws(m1,r_PLACENM[i,Intercept]) %>%
rename(PLACENM = i) %>%
select(PLACENM, r_PLACENM) %>%
group_by(PLACENM) %>%
summarise(regional = mean(r_PLACENM))
# This is just to tidy up the names of the places in place$PLACENM
place_effects$PLACENM <- gsub("\\.", " ", place_effects$PLACENM)
# Calculate the mean for the local effect (epsilon in the model)
local_effects <-spread_draws(m1,r_MSOA11CD[i,Intercept]) %>%
rename(MSOA11CD = i) %>%
select(MSOA11CD, r_MSOA11CD) %>%
group_by(MSOA11CD) %>%
summarise(local = mean(r_MSOA11CD))
# Bring the results together
results <- mydata %>%
select(MSOA11CD, PLACENM, date, covidweek) %>%
inner_join(local_effects, by = "MSOA11CD") %>%
inner_join(place_effects, by = "PLACENM")
results %>%
print(n = 6)
# The fixed effects in the model may be of interest
fixef(m1)
The process then needs to be repeated for every week of the study - for example, by looping through each week of the study and saving the results in a list. If you decide to try it, be warned, it will take many hours!!!