Note: this is a work in progress. There is more to be added. Last updated Mon 29 August 2022
In their paper, Area level deprivation and monthly COVID-19 cases: The impact of government policy in England, Morrissey, Spooner, Salter & Shaddick (2021) show that area level deprivation is significantly associated with monthly COVID-19 cases in England – a finding that is substantiated by other research. Here that association is explored further with weekly and regional data over a longer period of 111 weeks. For much of the pandemic, if the most deprived neighbourhoods took on the attributes of the most affluent or their populations then there would have been fewer COVID cases. However, since the end of last year, and with the emergence of the Omicron variant, that situation has reversed – deprived neighbourhoods have now had fewer cases overall. Furthermore, for all of the pandemic, the differences between regions (and between sub-regions) are much greater than the differences between deprivation deciles are: regional (and sub-regional) effects exceed deprivation ones. However, this is not to say that deprivation does not matter because the geography of deprivation in England is related to the regional geography of COVID. In fact, a greater concentration of deprived neighbourhoods at a sub-regional scale is often related to higher COVID rates – much more that a greater concentration of affluent neighbourhoods is. The results are fully reproducible by using the code below.
The data are the 7-day rolling sum of COVID data, covering the period from 2020-03-07 to 2022-04-16. The ‘raw’ data are from the The Government Portal1 although I have reformatted them and matched in some other data for ease of use. A comment concerning a key limitation of the data is left until the end but basically boils down to the data being least reliable where there are fewest COVID cases; perhaps less of a problem, here, because we are interested in the opposite.
We begin by loading and, where necessary, installing the packages that will be needed.
while(!require(tidyverse)) install.packages("tidyverse")
while(!require(lme4)) install.packages("lme4")
while(!require(ggplot2)) install.packages("ggplot2")
while(!require(zoo)) install.packages("zoo")
while(!require(fastDummies)) install.packages("fastDummies")
With those in place we can now…
The following code chunk loads the data and prepares it for the modelling, most usefully by converting it into long format. More will be explained about the data as we proceed and as the need arises.
df <- read_csv("https://www.dropbox.com/s/baqkwvsb0ah3ahh/covid_data.csv?dl=1") %>%
mutate(IMD_decile = floor(10*percent_rank(1/IMD)) + 1) %>%
mutate(IMD_decile = ifelse(IMD_decile > 10, 10, IMD_decile)) %>%
mutate(across(starts_with("age"), ~ 100 * . / `All Ages`)) %>%
mutate(carebeds = round(1000 * carebeds / `Adults`, 1)) %>%
pivot_longer(., cols = c(starts_with("2020-"), starts_with("2021-"), starts_with("2022-")),
names_to = "week",
values_to = "cases")
The models to be fitted are similar to those used by Harris and Brunsdon (2021) in our paper looking at the exposure of Black, Asian and other ethnic groups to COVID-infected neighbourhoods in English towns and cities. Changes are modifications to the control variables and, more substantively, to the higher-level random parameter of the model (see below). Where that higher level previously represented the geographical differences between English major towns and cities, here it represents the differences between deprivation deciles (i.e. between groups of neighbourhoods that are in the first 10% most deprived, the next 10%, the 10% after that, and so forth to the wealthiest 10%). Unlike the previous study, rural areas are not excluded. In addition to the change in the model specification, a change in the estimation procedure is employed too: it is estimated using lme4 and not as a Bayesian model (with the brms package). This is simply for computational speed – it is much faster this way. However, if you did want to use brms, you could do so by basing it on this tutorial.
The model is a Poisson model where \(\lambda_{it}\) is the number of reported cases of COVID in neighbourhood \(i\) in week \(t\):
\(\text{log(}\lambda_{it}\text{)}=\sum_k\beta_{(k)it}x_{(k)it}+\text{log(}P_i\text{)}+\varepsilon_{it}+\delta_{ijt}\)
Here \(\varepsilon_{it}\) represents the differences between places that can be attributed to the deprivation measure, conditional on the \(k\) control variables, \(\sum_k\beta_{kit}x_{kit}\), and relative to the overall infection rate for the study region at time \(t\); \(\delta_{ijt}\) represents the differences between deprivation deciles – loosely, ‘the deprivation effect’ – where the deprivation score, re-coded into deciles, is from the 2019 English Index of Multiple Deprivation;2 neighbourhoods are 2011 Census Middle Super Output Areas (MSOAs). Rearranging the model and knowing that \(P_i\) is the mid-2020 ONS estimated population of a neighbourhood shows that it is modelling the log of the infection rate each week:
\(\text{log(}\lambda_{it}\text{)} - \text{log(}P_t\text{)} =\sum_k\beta_{(k)it}x_{(k)it}+\varepsilon_{it}+\delta_{ijt}\)
\(\text{log}(\frac{\lambda_{it}}{P_i})=\sum_k\beta_{(k)it}x_{(k)it}+\varepsilon_{it}+\delta_{ijt}\)
Neighbourhood populations and their age structures are held at their mid-2020 estimates for the duration of the study.
Before fitting the model, we can speed-up the process of fitting by changing some of the defaults (See here for further information). The chief cost of doing so is that a check on reliable model convergence is omitted. In practice, a lack of model convergence occurs when there are few substantial differences between places because cases are so low everywhere. In such circumstances, it does not really matter if the model fails to reliably measure what slight differences there may be because they are effectively zero anyway.
nlopt <- function(par, fn, lower, upper, control) {
.nloptr <<- res <- nloptr(par, fn, lb = lower, ub = upper,
opts = list(algorithm = "NLOPT_LN_BOBYQA",
maxeval = 100000, xtol_abs = 1e-6, ftol_abs = 1e-6))
list(par = res$solution,
fval = res$objective,
conv = if (res$status > 0) 0 else res$status,
message = res$message
)
}
Having changed the defaults, a relatively simple model is fitted for each of the 111 weeks in turn (so it will result in 111 separate models). Unsurprisingly, this takes some time, as you can see from the code chunk below. If you are replicating the analysis fully, be prepared to be patient – make yourself a cup of tea and relax for half an hour. If you prefer, however, you can jump forward to the next code chunk which will load-in the models generated at this stage.
# This will take some time to run
# If you prefer, jump on to the next code chunk to load the resulting models
start <- Sys.time()
mlm <- lapply(unique(df$week), function(x) {
df %>%
filter(week == x) %>%
mutate(`age22-34` = `age22-24` + `age25-29` + `age30-34`) %>%
mutate(across(starts_with("age") | matches("carebeds") | matches("AandE"),
~ as.numeric(scale(.)))) %>%
glmer(cases ~ `age5-11` + `age12-17` + `age18-21` + `age22-34` +
carebeds + AandE +
(1|MSOA11CD) + (1|IMD_decile) + offset(log(`All Ages`)),
family=poisson(), data = .,
control = glmerControl(optimizer = "nloptwrap", calc.derivs = FALSE))
})
end <- Sys.time()
end - start
## Time difference of 43.75229 mins
[Or]
# If you didn't run the code chunk above then
# load in the preprocessed models
if(!exists("mlm")) {
connection <- url("https://www.dropbox.com/s/6lny8rne0b4k765/mlm.RData?dl=1")
load(connection)
close(connection)
}
Each of the models contains six control variables:
the percentages of the population per MSOA aged (a) 5-11 years, (b) 12-17, (c) 18-21 and (d) 22-34;
the number of carehome beds per adult population;3, and
whether the MSOA contains an Accident and Emergency hospital.4
Five of these six variables are included to control for four distinct types of setting where infection rates have been unusually high for at least some of the pandemic but where the inhabitants of those settings do not necessarily reflect the population characteristics of the wider neighbourhood. These are schools, care homes, student halls of residence and hospitals. The remaining variable, the percentage of the adult population aged 22 to 35, is included because of a period, after the first national lockdown, when infection cases were growing among younger adults.
As Harris and Brunsdon (2021) explain, the purpose of this model is not to explain the variations in the infection rates but to reveal them; here, particularly those that are between the deprivation deciles. It is essentially a null model – one that partitions, between the levels of the model, the variance around the average of the response variable, and does so without seeking to explain that variance statistically. It isn’t actually a null model because it also contains the control variables but those are just there to allow for a few distinct circumstances that are unusual and would otherwise inflate the more typical number of cases in their MSOA. Conceptually they can be regarded as making a statistical correction to the COVID data as opposed to trying to explain the causes of those ‘corrected’ values. Separate models are fitted for each week because there is no reason to assume that the effects that the variables control for would be the same throughout the study period. For an illustration, compare the effects for the week to 2020-08-22 (week 25 of the reported results) with those for the week to 2020-10-10 (week 32). Note how the return of students to University appears to have raised the effect of having populations aged 18 to 21:
fixef(mlm[[25]])
## (Intercept) `age5-11` `age12-17` `age18-21` `age22-34` carebeds
## -9.594440873 0.128560000 0.213486425 0.062614762 0.301705426 0.049155751
## AandE
## -0.003251845
fixef(mlm[[32]])
## (Intercept) `age5-11` `age12-17` `age18-21` `age22-34` carebeds
## -6.94460203 -0.09730137 0.03458307 0.23089447 0.06539847 0.01878886
## AandE
## -0.02626719
The random effect, \(\delta_{ijt}\), is a modelled estimate of how much each deprivation decile departs from the average log COVID rate for neighbourhoods that week. For example, if we look at week 50 of testing, for that week the conditional rate rises with each deprivation decile – the least deprived MSOAs (grp 10) have below average infections, rising to the most deprived MSOAs (grp 1), which have most.
ranef(mlm[[50]], whichel = "IMD_decile") %>%
as_tibble() %>%
mutate(ymin = condval - 1.96 * condsd, ymax = condval + 1.96 * condsd) %>%
ggplot(., aes(x = grp, y = condval, ymin = ymin, ymax = ymax)) +
geom_errorbar() +
geom_hline(yintercept = 0, linetype = "dotted") +
xlab("IMD decile")
Fig. 1: ‘Deprivation effects’ – the (residual) differences between the IMD deciles
The same finding can be illustrated by converting those differences to the predicted infection rates:
intercept <- fixef(mlm[[50]])[1]
ranef(mlm[[50]], whichel = "IMD_decile") %>%
as_tibble() %>%
mutate(rate = exp(condval + intercept) * 100000,
ratemin = exp(condval - 1.96 * condsd + intercept) * 100000,
ratemax = exp(condval + 1.96 * condsd + intercept) * 100000) %>%
ggplot(., aes(x = grp, y = rate, ymin = ratemin, ymax = ratemax)) +
geom_errorbar() +
geom_hline(yintercept = exp(intercept) * 100000, linetype = "dotted") +
xlab("IMD decile")
Fig. 2: The differences converted back into rates
Both Figures 1 and 2 are the differences for a single week. The rates for all the weeks can be gathered together as follows:
results <- lapply(1:length(mlm), function(i){
x <- mlm[[i]]
intercept <- fixef(x)[1]
ranef(x, whichel = "IMD_decile") %>%
as_tibble() %>%
mutate(ymin = condval - 1.96 * condsd,
ymax = condval + 1.96 * condsd,
rate = exp(condval + fixef(x)[1]) * 100000,
ratemin = exp(condval - 1.96 * condsd + intercept) * 100000,
ratemax = exp(condval + 1.96 * condsd + intercept) * 100000,
week = i,
date = unique(df$week)[i],
)
})
results <- do.call(rbind, results)
The weekly differences can then be plotted, here comparing the neighbourhoods amongst the top 20% most deprived (‘High’) with those in the bottom 20% (‘Low’). A non-linear scale has been used in the chart because the of the way that the rates of COVID shot up with the omicron variant.
results %>%
filter(grp == 1 | grp == 2 | grp == 9 | grp == 10) %>%
mutate(IMD = ifelse(grp == 1, "Highest 10%", "Lowest 10%")) %>%
mutate(IMD = ifelse(grp == 2, "Highest 20%", IMD)) %>%
mutate(IMD = ifelse(grp == 9, "Lowest 20%", IMD)) %>%
mutate(IMD = factor(IMD,
levels = c("Highest 10%", "Highest 20%", "Lowest 20%", "Lowest 10%"))) %>%
ggplot(aes(x = week, y = rate, col = IMD, ymin = ratemin, ymax = ratemax)) +
geom_errorbar(size = 0.5, col = "black") +
geom_line(size = 1, alpha = 0.5) +
scale_colour_manual(values = c("#880808", "#AA4A44", "#89CFF0", "#00008B")) +
scale_y_continuous(trans = "sqrt")
Fig. 3: The rate difference between low and high IMD neighbourhoods each week
The following code is just an extension of that above to add some ‘polish’ to the chart to make it more interpretable.
# This is used to improve the x-axis labelling
ticks <- results %>%
mutate(date = substr(date, 1, 7)) %>%
select(date, week) %>%
group_by(date) %>%
summarise(week = median(week)) %>%
ungroup(.) %>%
mutate(date = as.yearmon(date)) %>%
mutate(date = paste(substr(date, 1, 3), substr(date, 7, 8)))
# This is used to plot the lockdown periods on the chart
dates <- results %>%
select(date) %>%
unique() %>%
mutate(date = as.Date(date))
l1start <- which.min(abs(as.Date("2020-03-26") - dates$date))
l1end <- which.min(abs(as.Date("2020-06-14") - dates$date))
l2start <- which.min(abs(as.Date("2020-11-05") - dates$date))
l2end <- which.min(abs(as.Date("2020-12-01") - dates$date))
l3start <- which.min(abs(as.Date("2021-01-06") - dates$date))
l3end <- which.min(abs(as.Date("2021-03-28") - dates$date))
lockdowns <- data.frame(xmin = c(l1start, l2start, l3start),
xmax = c(l1end, l2end, l3end))
# This is the improved chart
results %>%
filter(grp == 1 | grp == 2 | grp == 9 | grp == 10) %>%
mutate(IMD = ifelse(grp == 1, "Highest 10%", "Lowest 10%")) %>%
mutate(IMD = ifelse(grp == 2, "Highest 20%", IMD)) %>%
mutate(IMD = ifelse(grp == 9, "Lowest 20%", IMD)) %>%
mutate(IMD = factor(IMD,
levels = c("Highest 10%", "Highest 20%", "Lowest 20%", "Lowest 10%"))) %>%
ggplot(aes(x = week, y = rate, col = IMD, ymin = ratemin, ymax = ratemax)) +
geom_errorbar(size = 0.5, col = "black") +
geom_line(alpha = 0.5, aes(size = IMD)) +
scale_size_manual(values = c(1, 0.5, 0.5, 1)) +
scale_colour_manual(values = c("#880808", "#AA4A44", "#89CFF0", "#00008B")) +
geom_rect(data=lockdowns, inherit.aes=FALSE, aes(xmin=xmin, xmax=xmax, ymin=0,
ymax=max(results$ratemax)), color="transparent", fill="grey", alpha=0.3) +
theme_light() +
theme(axis.text.x = element_text(angle = 90),
panel.grid.major.x = element_blank()) +
scale_x_continuous(name = NULL, breaks = ticks$week, labels = ticks$date,
limits = c(1, max(results$week)+2),
expand = c(0,0)) +
scale_y_continuous(trans = "sqrt")
Fig. 4: A prettier version of Figure 4 with the lockdown periods shaded in grey5
What we can see in Figure 4 is that the ‘deprivation effects’ – a term that I am applying rather loosely to describe the differences between the deprivation deciles of MSOAs and their populations – are evident for many weeks of the pandemic. It is not always that the most deprived neighbourhoods have higher expected rates than the least; sometimes there appears to be a lag whereby the rate rises in the affluent neighbourhoods before also rising in the least.
The differences between the deprivation deciles raise a question: how many fewer cases of COVID-19 would there be in the most deprived neighbourhoods, if they were to experience the same (perhaps beneficial) effects that the most affluent neighbourhoods do?. With reference to Figure 4, the same question is asking, how many fewer cases would there be in the deprived neighbourhoods if they were represented by the blue instead of the red lines.
To show how this can be estimated, we will return to week 50 as an example. We begin by extracting that week’s model from the rest:
x <- mlm[[50]]
Next we consider the deprivation effects from that model:
rf <- ranef(x, whichel = "IMD_decile") %>%
as_tibble()
rf %>% print(width=Inf, length=Inf)
## # A tibble: 10 × 5
## grpvar term grp condval condsd
## <chr> <fct> <fct> <dbl> <dbl>
## 1 IMD_decile (Intercept) 1 0.392 0.0217
## 2 IMD_decile (Intercept) 2 0.260 0.0220
## 3 IMD_decile (Intercept) 3 0.196 0.0223
## 4 IMD_decile (Intercept) 4 0.145 0.0226
## 5 IMD_decile (Intercept) 5 0.0412 0.0230
## 6 IMD_decile (Intercept) 6 -0.0235 0.0232
## 7 IMD_decile (Intercept) 7 -0.0636 0.0234
## 8 IMD_decile (Intercept) 8 -0.106 0.0236
## 9 IMD_decile (Intercept) 9 -0.168 0.0237
## 10 IMD_decile (Intercept) 10 -0.320 0.0244
Here we can see what Figure 1 previously showed – that the effects for deprivation deciles 1 and 2 (the top 20% most deprived) raise the rate above the average for that week (they aren’t alone in this but they do have most effect), whereas those for deprivation deciles 9 and 10 (the 20% least deprived), lower it (ditto). For the thought exercise, What we can do is substitute group 1’s value of 0.392 with group 10’s value of -0.32, and substitute group 2’s value of 0.26 with group 9’s, -0.168.
To achieve that, we need to know which IMD decile each observation in the model belongs to. That information is contained in,
IMD_decile <- slot(x, "frame")$IMD_decile
head(IMD_decile)
## [1] 2 4 4 3 2 2
The random effect, at the deprivation decile level, applicable to each observation is,
rf$condval[IMD_decile] %>% head()
## [1] 0.2598075 0.1445361 0.1445361 0.1961091 0.2598075 0.2598075
This shows that the first MSOA in the data is in the top 20% most deprived (but not the top 10%), the second is in the top 40% (but not 30%) and so forth. This information can be added to the MSOA level effect and to the fixed part of the model to give a predicted rate that is inclusive of the deprivation effect:
predict1 <- predict(x, re.form = ~ (1 | MSOA11CD)) + rf$condval[IMD_decile]
head(predict1)
## 1 2 3 4 5 6
## 2.474465 3.199153 1.991170 2.636329 2.894351 3.174117
The result is, in fact, just a long-winded way of obtaining
predict(x, re.form = ~ (1 | MSOA11CD) + (1 | IMD_decile)) %>% head()
## 1 2 3 4 5 6
## 2.474465 3.199153 1.991170 2.636329 2.894351 3.174117
The reason for demonstrating the long-winded way is because it is useful for substituting the effects of a deprivation group with those for another, to mix things around in accordance to our ‘what if?’. Below, the effects of being in the most affluent neighbourhoods are swapped into the most deprived:
IMD_decile_switch <- IMD_decile
IMD_decile_switch[IMD_decile == 1] <- 10
IMD_decile_switch[IMD_decile == 2] <- 9
predict2 <- predict(x, re.form = ~ (1 | MSOA11CD)) + rf$condval[IMD_decile_switch]
head(predict2)
## 1 2 3 4 5 6
## 2.047030 3.199153 1.991170 2.636329 2.466916 2.746682
There is likely a more elegant way of doing this using the \(\text{newdata}\) parameter in the \(\text{predict}\) function but, since the above works, I shall stick with it. Now we can look at the result of the substitutions. The actual number of cases in week 50, in the most deprived neighbourhoods, was
sum(slot(x, "frame")$cases[IMD_decile <= 2])
## [1] 22741
As estimated by the model it is,
sum(exp(predict1[IMD_decile <= 2]))
## [1] 22725.97
which is reassuringly close. But, if we give those neighbourhoods the benefit of being affluent then the number drops to,
sum(exp(predict2[IMD_decile <= 2]))
## [1] 12869.76
which is a 43.4 percent reduction.
Of course, that is for a single week. The following code chunk calculates for every week:
predictions <- lapply(1:length(mlm), function(i) {
x <- mlm[[i]]
IMD_decile <- slot(x, "frame")$IMD_decile
cases <- slot(x, "frame")$cases
rf <- ranef(x, whichel = "IMD_decile") %>%
as_tibble()
predict1 <- predict(x, re.form = ~ (1 | MSOA11CD) + (1 | IMD_decile))
IMD_decile_switch <- IMD_decile
IMD_decile_switch[IMD_decile == 1] <- 10
IMD_decile_switch[IMD_decile == 2] <- 9
predict2 <- predict(x, re.form = ~ (1 | MSOA11CD)) + rf$condval[IMD_decile_switch]
data.frame(IMD_decile = IMD_decile,
cases = cases,
cases1 = exp(predict1),
cases2 = exp(predict2)) %>%
as_tibble() %>%
filter(IMD_decile <= 2) %>%
group_by(IMD_decile) %>%
summarise(across(starts_with("cases"), sum)) %>%
ungroup(.) %>%
mutate(week = i, date = unique(df$week)[i])
})
predictions <- do.call(rbind, predictions)
Here again are the results for week 50:
predictions %>%
filter(week == 50) %>%
summarise(across(starts_with("cases"), sum))
## # A tibble: 1 × 3
## cases cases1 cases2
## <dbl> <dbl> <dbl>
## 1 22741 22726. 12870.
And here are the estimated totals across all weeks:
predictions %>%
summarise(across(starts_with("cases"), sum))
## # A tibble: 1 × 3
## cases cases1 cases2
## <dbl> <dbl> <dbl>
## 1 3752062 3745445. 4091472.
All things being equal, if the most deprived neighbourhoods had whatever it is about the people or the places that characterizes the most affluent then they would be expected to have had -9 per cent fewer cases. Surprisingly, the sign is negative, which means that being an affluent neighbourhood would, in fact, increase cases! This is unexpected but, as Figure 5 shows, it is due to omicron. For much of the pandemic, affluence cuts cases. However, not presently. In previous waves there has been some evidence of a socio-economic lag: rates rise in affluent neighbourhoods and then, later, in deprived ones (see Figure 4). Whether this will be true for Omnicron remains to be seen.
predictions %>%
group_by(week) %>%
summarise(across(starts_with("cases"), sum)) %>%
ungroup() %>%
mutate(across(starts_with("cases"), cumsum)) %>%
rename("actual" = cases, "predicted" = cases2) %>%
pivot_longer(., cols = c("actual", "predicted"),
names_to = "scenario",
values_to = "cumsum") %>%
mutate(cumsum = cumsum / 1000) %>%
ggplot(., aes(x = week, y = cumsum, col = scenario)) +
geom_line() +
geom_rect(data=lockdowns, inherit.aes=FALSE, aes(xmin=xmin, xmax=xmax, ymin=0,
ymax=3000), color="transparent", fill="grey", alpha=0.3) +
theme_light() +
ylab("Cumulative sum of cases (1000s)") +
theme(axis.text.x = element_text(angle = 90),
panel.grid.major.x = element_blank()) +
scale_x_continuous(name = NULL, breaks = ticks$week, labels = ticks$date,
limits = c(1, max(results$week)+2),
expand = c(0,0))
Fig. 5: The actual and predicted cumulative sum of COVID cases in the most deprived English neighbourhoods, where the prediction asks ‘what if those deprived neighbourhoods took on the effects of being affluent?’
A problem with the national level of analysis undertaken so far is that it conceals sub-regional differences. The only geography that is in the model is indirect, working through the deprivation deciles. That means that a poor neighbourhood in the North West of England, for example, is treated as equivalent to a poor neighbourhood in London, say. That may or may not be a valid equivalence in usual times but there is an added issue when it comes to looking at COVID because of the spatial transmission of the disease: on at least two occasions during the pandemic (at the beginning and around Christmas 2020) there has been evidence that London has been the ‘driver’ – or, at least, the early warning – of rising infection rates. On other occasions the pandemic has been concentrated in other towns and cities. This is evident in the following charts which are from the Harris and Brunsdon paper.
Fig. 6: Indicating how much individual towns and cities in the English regions deviate from their expected log infection rate each week of the Harris & Brunsdon study
To address the possibility of regional differences, in the following code chunk the models are extended so that the deprivation deciles are linked to regions at the higher level of the model. This means that deprivation decile 1 in the North West of England is treated as a different group to the same decile in London, and so forth for other regions. First, the new group labels are created,
df <- unite(df, "IMD_region", IMD_decile, regionName, sep = "_", remove = FALSE)
Now the models are fitted. As before, you may prefer to use the preprocessed models.
# As before, this will take some time to run
# If you prefer, jump on to the next code chunk to load the resulting models
mlm2 <- lapply(unique(df$week), function(x) {
df %>%
filter(week == x) %>%
mutate(`age22-34` = `age22-24` + `age25-29` + `age30-34`) %>%
mutate(across(starts_with("age") | matches("carebeds") | matches("AandE"),
~ as.numeric(scale(.)))) %>%
glmer(cases ~ `age5-11` + `age12-17` + `age18-21` + `age22-34` +
carebeds + AandE +
(1|MSOA11CD) + (1|IMD_region) + offset(log(`All Ages`)),
family=poisson(), data = .,
control = glmerControl(optimizer = "nloptwrap", calc.derivs = FALSE))
})
[Or]
# If you didn't run the code chunk above then
# load in the preprocessed models
if(!exists("mlm2")) {
connection <- url("https://www.dropbox.com/s/iipeqfi4c2rszqb/mlm2.RData?dl=1")
load(connection)
close(connection)
}
Having fitted the models, we can look at the now regionalised deprivation effects. First, we extract the residual effect of being in each of the groups (here obtained without confidence intervals, which would otherwise overload the charts).
condvals <- do.call(rbind,
lapply(1:length(mlm2), function(i) {
ranef(mlm2[[i]], whichel = "IMD_region") %>%
as_tibble() %>%
separate(grp, c("IMD_decile", "Region"), sep = "_") %>%
mutate(IMD_decile = factor(IMD_decile, levels = 1:10)) %>%
mutate(week = i)
})
)
Next, we can plot those effects. The resulting chart is rather cluttered but it is possible to see, for example, the early and Christmas 2020 peaks of the pandemic in London and the South East, some of the later peaks in the North East and North West, as well as the typically lower infection rate in the South West.
condvals %>%
ggplot(., aes(x = week, y = condval, col = IMD_decile)) +
geom_line(alpha = 0.5) +
geom_rect(data=lockdowns, inherit.aes=FALSE, aes(xmin=xmin, xmax=xmax, ymin=-1.5,
ymax=2), color="transparent", fill="grey", alpha=0.3) +
geom_hline(yintercept = 0, linetype = "dotted") +
theme_light() +
theme(axis.text.x = element_text(angle = 90),
panel.grid.major.x = element_blank()) +
scale_x_continuous(name = NULL, breaks = ticks$week, labels = ticks$date,
limits = c(1, max(results$week)+2),
expand = c(0,0)) +
facet_wrap(~ Region)
Fig. 7: The average deprivation effect each week by region
Figure 8 attempts to make things clearer by focusing on just the most and least deprived neighbourhoods. Generally they track each other reasonably well but there are some differences; for example, in Yorkshire and The Humber where the most deprived neighbourhoods have higher rates of infection after the first lockdown period and towards the end of the third, but lower rates at the beginning of the Omicron phase.
condvals %>%
filter(IMD_decile == 1 | IMD_decile == 2 | IMD_decile == 9 | IMD_decile == 10) %>%
mutate(IMD = ifelse(IMD_decile == 1, "Highest 10%", "Lowest 10%")) %>%
mutate(IMD = ifelse(IMD_decile == 2, "Highest 20%", IMD)) %>%
mutate(IMD = ifelse(IMD_decile == 9, "Lowest 20%", IMD)) %>%
mutate(IMD = factor(IMD,
levels = c("Highest 10%", "Highest 20%", "Lowest 20%", "Lowest 10%"))) %>%
ggplot(., aes(x = week, y = condval, col = IMD, size = IMD)) +
geom_line(alpha = 0.5) +
scale_size_manual(values = c(1, 0.5, 0.5, 1)) +
scale_colour_manual(values = c("#880808", "#AA4A44", "#89CFF0", "#00008B")) +
geom_rect(data=lockdowns, inherit.aes=FALSE, aes(xmin=xmin, xmax=xmax, ymin=-1.5,
ymax=2), color="transparent", fill="grey", alpha=0.3) +
geom_hline(yintercept = 0, linetype = "dotted") +
theme_light() +
theme(axis.text.x = element_text(angle = 90),
panel.grid.major.x = element_blank()) +
scale_x_continuous(name = NULL, breaks = ticks$week, labels = ticks$date,
limits = c(1, max(results$week)+2),
expand = c(0,0)) +
facet_wrap(~ Region)
Fig. 8: The average deprivation effect each week by region for the most and least deprived neighbourhoods
Figure 8 suggests that there are not typically huge difference between the most and least deprived in each region. To explore this further, we will do what we did previously and flip the deprivation effects from the most affluent to the most deprived and see what consequence that has on the predicted numbers of cases in the poorest neighbourhoods.
Stage 1 is to make the predictions:
regional_predictions <- lapply(1:length(mlm2), function(i) {
x <- mlm2[[i]]
IMD_decile <- slot(x, "frame")$IMD_region
cases <- slot(x, "frame")$cases
rf <- ranef(x, whichel = "IMD_region") %>%
as_tibble()
predict1 <- predict(x, re.form = ~ (1 | MSOA11CD) + (1 | IMD_region))
regions <- unique(df$regionName)
IMD_decile_switch <- IMD_decile
for(j in 1:length(regions)) {
IMD_decile_switch[IMD_decile == paste("1", regions[j], sep = "_")] <- paste("10", regions[j], sep = "_")
IMD_decile_switch[IMD_decile == paste("2", regions[j], sep = "_")] <- paste("9", regions[j], sep = "_")
}
predict2 <- predict(x, re.form = ~ (1 | MSOA11CD)) + rf$condval[IMD_decile_switch]
data.frame(IMD_decile = IMD_decile,
cases = cases,
cases1 = exp(predict1),
cases2 = exp(predict2)) %>%
as_tibble() %>%
separate(IMD_decile, c("IMD_decile", "Region"), "_") %>%
mutate(IMD_decile = as.numeric(IMD_decile)) %>%
filter(IMD_decile <= 2) %>%
group_by(IMD_decile, Region) %>%
summarise(across(starts_with("cases"), sum), .groups = "keep") %>%
ungroup(.) %>%
mutate(week = i, date = unique(df$week)[i])
})
regional_predictions <- do.call(rbind, regional_predictions)
Stage 2 is to calculate the cumulative sum of cases that actually occurred and might have occurred if the poorest neighbourhoods benefited or otherwise from the same effects as the richest. What we discover in each region reflects the national case. For much of the pandemic, deprived neighbourhoods would have fewer cases of COVID if they had the same attributes as the most wealthy, particularly during and following the third lockdown. However, the omicron variant has changed that, doing so soonest in London and the West Midlands, closely followed by other regions – see Figure 9.
cumsums <- regional_predictions %>%
select(week, cases, cases2, Region) %>%
group_by(week, Region) %>%
summarise(across(starts_with("cases"), sum), .groups = "keep") %>%
ungroup() %>%
group_by(Region) %>%
mutate(across(starts_with("cases"), cumsum)) %>%
ungroup() %>%
rename("actual" = cases, "predicted" = cases2) %>%
pivot_longer(., cols = c("actual", "predicted"),
names_to = "scenario",
values_to = "cumsum") %>%
mutate(cumsum = cumsum / 1000) %>%
split(., .$Region)
mx <- sapply(cumsums, function(x) max(x$cumsum))
plots <- lapply(1:length(cumsums), function(i) {
plotdata <- cumsums[[i]]
ggplot(plotdata, aes(x = week, y = cumsum, col = scenario)) +
geom_line() +
geom_rect(data=lockdowns, inherit.aes=FALSE, aes(xmin=xmin, xmax=xmax,
ymin=0, ymax = mx[i]),
color="transparent", fill="grey", alpha=0.3) +
theme_light() +
ylab("Cumulative sum of cases (1000s)") +
theme(axis.text.x = element_text(angle = 90),
panel.grid.major.x = element_blank()) +
scale_x_continuous(name = NULL, breaks = ticks$week, labels = ticks$date,
limits = c(1, max(results$week)+2),
expand = c(0,0)) +
ggtitle(unique(regional_predictions$Region)[i])
})
print(plots)
Fig. 9: The actual and predicted cumulative sum of COVID cases in the most deprived English neighbourhoods, where the prediction asks ‘what if those deprived neighbourhoods took on the effects of being affluent?’ in each region
Despite previous evidence that deprived places and their populations are more likely to be exposed to and infected by COVID-19, the evidence for that, here, is not strong – for some periods of the pandemic, perhaps, but not throughout it. In particular, the Omicron variant has left affluent neighbourhoods with greater rates of infection than deprived ones.
Looking back at Figure 8, it suggests that the greater differences are between regions and not between deprivation deciles. This finds some support in a paper by Griffith et al. which found that regional inequality in COVID-19 mortality [not infections] declined from an initial peak in April, before increasing again in June/July 2020.
We can consider whether and when regional differences matter more than those due to the deprivation deciles by extending the model to incorporate more levels of analysis. In the first model the two levels were MSOA-neighbourhoods (of which they are 6789)6 and the deprivation decile to which each MSOA belonged (10 groups). In the second, they were the MSOA and the combined IMD_Region codes (90 groups: 10 deciles \(\times\) 9 regions). In the third, below, there are four levels, which are the MSOAs, the IMD_Region codes, the IMD deciles and the Regions.
# As previously, jump on to the next code chunk to load the resulting models
mlm3 <- lapply(unique(df$week), function(x) {
df %>%
filter(week == x) %>%
mutate(`age22-34` = `age22-24` + `age25-29` + `age30-34`) %>%
mutate(across(starts_with("age") | matches("carebeds") | matches("AandE"),
~ as.numeric(scale(.)))) %>%
glmer(cases ~ `age5-11` + `age12-17` + `age18-21` + `age22-34` +
carebeds + AandE +
(1|MSOA11CD) + (1|IMD_region) + (1|regionName) + (1|IMD_decile) +
offset(log(`All Ages`)),
family=poisson(), data = .,
control = glmerControl(optimizer = "nloptwrap", calc.derivs = FALSE))
})
[Or]
# If you didn't run the code chunk above then
# load in the preprocessed models
if(!exists("mlm3")) {
connection <- url("https://www.dropbox.com/s/xdz8k9165u04u5w/mlm3.RData?dl=1")
load(connection)
close(connection)
}
Having fitted the model, of interest is at what level the differences between the infections rates are greatest. To explore this, first we need to calculate what percentage of the total variance is due to each level:
variances <- do.call(rbind,
lapply(1:length(mlm3), function(i) {
VarCorr(mlm3[[i]]) %>%
as_tibble() %>%
mutate(pcnt = round(vcov / sum(vcov) * 100,2), week = i) %>%
select(week, grp, vcov, pcnt)
}))
Then we can plot those percentages:
variances %>%
mutate(grp = factor(grp, levels = c("MSOA11CD", "IMD_decile", "IMD_region", "regionName"))) %>%
ggplot(., aes(x = week, y = pcnt, col = grp)) +
geom_line(size = 1, alpha = 0.5) +
geom_rect(data=lockdowns, inherit.aes=FALSE, aes(xmin=xmin, xmax=xmax,
ymin=0, ymax = max(variances$pcnt)),
color="transparent", fill="grey", alpha=0.3) +
theme_light() +
ylab("Percentage of variance") +
theme(axis.text.x = element_text(angle = 90),
panel.grid.major.x = element_blank()) +
scale_x_continuous(name = NULL, breaks = ticks$week, labels = ticks$date,
limits = c(1, max(results$week)+2),
expand = c(0,0))
Fig 10. Percentage of the variance in the infection rates due to each level of the model each week
It’s not surprising to find that most of the variation is at the MSOA level – some MSOAs ‘get lucky’, others do not each week. However, if we excluded that level from the chart it is clear that regional effects are much greater than the deprivation deciles ones:
variances %>%
mutate(grp = factor(grp, levels = c("MSOA11CD", "IMD_decile", "IMD_region", "regionName"))) %>%
filter(grp != "MSOA11CD") %>%
ggplot(., aes(x = week, y = pcnt, col = grp)) +
geom_line(size = 1, alpha = 0.5) +
geom_rect(data=lockdowns, inherit.aes=FALSE, aes(xmin=xmin, xmax=xmax,
ymin=0, ymax = max(variances$pcnt)),
color="transparent", fill="grey", alpha=0.3) +
theme_light() +
ylab("Percentage of variance") +
theme(axis.text.x = element_text(angle = 90),
panel.grid.major.x = element_blank()) +
scale_x_continuous(name = NULL, breaks = ticks$week, labels = ticks$date,
limits = c(1, max(results$week)+2),
expand = c(0,0))
Fig 11. As Figure 10 but with the MSOA level omitted
Does Figure 11 mean that deprivation does not matter? It isn’t that simple. To appreciate why, it is important to consider the geography of those deprivation deciles. Richer and poorer neighbourhoods are neither randomly nor uniformally distributed across the English regions. In the North East, 38% of neighbourhoods are in deprivation deciles 1 or 2, as are 35% of MSOAs in the North West. In the South East and South West, it is only 7% and 8%, respectively.
round(apply(proportions(table(df$IMD_decile, df$regionName), 2), 2, cumsum), 2)
##
## East Midlands East of England London North East North West South East
## 1 0.08 0.03 0.02 0.21 0.23 0.03
## 2 0.19 0.08 0.15 0.38 0.35 0.07
## 3 0.27 0.15 0.33 0.52 0.46 0.14
## 4 0.36 0.24 0.47 0.62 0.56 0.21
## 5 0.47 0.36 0.60 0.69 0.63 0.29
## 6 0.57 0.48 0.71 0.77 0.70 0.40
## 7 0.66 0.61 0.78 0.84 0.78 0.51
## 8 0.77 0.73 0.86 0.89 0.85 0.64
## 9 0.89 0.88 0.94 0.95 0.94 0.78
## 10 1.00 1.00 1.00 1.00 1.00 1.00
##
## South West West Midlands Yorkshire and The Humber
## 1 0.03 0.17 0.19
## 2 0.09 0.30 0.31
## 3 0.17 0.39 0.42
## 4 0.27 0.50 0.51
## 5 0.40 0.59 0.59
## 6 0.54 0.68 0.68
## 7 0.68 0.78 0.77
## 8 0.81 0.87 0.86
## 9 0.93 0.94 0.92
## 10 1.00 1.00 1.00
Table 1. Cumulative percentage of MSOAs in each deprivation decile in each region
The point is that the regional geography is closely entwined with the geography of neighbourhood deprivation and that the regions with the highest COVID rates for most weeks tend to be those with more of the deprived neighbourhoods, as Table 2 shows.
sort(table(sapply(mlm3, function(x) {
ranef(x, whichel = "regionName") %>%
as_tibble() %>%
filter(condval - 1.96*condsd > 0) %>%
arrange(desc(grp)) %>%
slice_head(n = 1) %>%
select(grp) %>%
pull(.)
})), decreasing = TRUE)
##
## North West Yorkshire and The Humber North East
## 23 20 19
## South West London East Midlands
## 17 15 12
## South East West Midlands East of England
## 3 2 0
Table 2. The number of weeks each region has had the highest modelled COVID infection rate
Let’s try a further model, looking at how the concentration of deprivation or of affluence in each week is related to the COVID rate of places across the period of the study. These ‘places’ are defined at a sub-regional scale based on the ONS 2015 defintiion of Major Town or City. Where that City is London, its Boroughs are used as the areal unit instead because it is so big, . For (rural) areas that are outside of a major towns or city, the local authority is named. In some cases the more de facto population-based boundary of a city or major town spans beyond its actual administrative boundary, as when Bristol spills into and, in these data, truncates the boundary of South Gloucestershire.7.
The first thing to do is calculate the proportion of the Adult population in each place that are in deprivation decile 1, deprivation 2, deprivation 3, etc.:
df <- df %>%
filter(week == min(week)) %>%
select(PLACE, IMD_decile, Adults) %>%
dummy_cols(select_columns = "IMD_decile", remove_selected_columns = TRUE) %>%
group_by(PLACE) %>%
summarise(across(starts_with("IMD_decile"), weighted.mean, w = Adults)) %>%
ungroup() %>%
left_join(df, ., by = "PLACE")
Having done that, a model is fitted with three levels (three random variables) – MSOAs, IMD deciles, PLACEs and regions – and with the proportion of Adults in each deprivation decile include as fixed effects.
# Jump on to the next code chunk to load the resulting models
mlm4 <- lapply(unique(df$week), function(x) {
df %>%
filter(week == x) %>%
mutate(`age22-34` = `age22-24` + `age25-29` + `age30-34`) %>%
mutate(across(starts_with("age") | matches("carebeds") | matches("AandE"),
~ as.numeric(scale(.)))) %>%
glmer(cases ~ `age5-11` + `age12-17` + `age18-21` + `age22-34` +
carebeds + AandE +
IMD_decile_1 + IMD_decile_2 + IMD_decile_3 + IMD_decile_4 + IMD_decile_5 +
IMD_decile_7 + IMD_decile_8 + IMD_decile_9 + IMD_decile_10 +
(1|MSOA11CD) + (1|IMD_decile) + (1|PLACE) + (1|regionName) +
offset(log(`All Ages`)),
family=poisson(), data = .,
control = glmerControl(optimizer = "nloptwrap", calc.derivs = FALSE))
})
[Or]
# To load in the preprocessed models
if(!exists("mlm4")) {
connection <- url("https://www.dropbox.com/s/d7hc6yy3ug2erof/mlm4.RData?dl=1")
load(connection)
close(connection)
}
As before, we can look at what level the differences between the infections rates are greatest, omitting the MSOA level. It is often but not always the differences between the places that have more variation than the differences between regions but both exceed the differences between deprivation deciles.
variances <- do.call(rbind,
lapply(1:length(mlm4), function(i) {
VarCorr(mlm4[[i]]) %>%
as_tibble() %>%
mutate(pcnt = round(vcov / sum(vcov) * 100,2), week = i) %>%
select(week, grp, vcov, pcnt)
}))
variances %>%
mutate(grp = factor(grp, levels = c("MSOA11CD", "IMD_decile", "PLACE", "regionName"))) %>%
filter(grp != "MSOA11CD") %>%
ggplot(., aes(x = week, y = pcnt, col = grp)) +
geom_line(size = 1, alpha = 0.5) +
geom_rect(data=lockdowns, inherit.aes=FALSE, aes(xmin=xmin, xmax=xmax,
ymin=0, ymax = max(variances$pcnt)),
color="transparent", fill="grey", alpha=0.3) +
theme_light() +
ylab("Percentage of variance") +
theme(axis.text.x = element_text(angle = 90),
panel.grid.major.x = element_blank()) +
scale_x_continuous(name = NULL, breaks = ticks$week, labels = ticks$date,
limits = c(1, max(results$week)+2),
expand = c(0,0))
Fig 12. Percentage of the variance in the infection rates due to each level of the model each week
At this point, the evidence that deprivation matters to COVID rates at a neighbourhood scale is weak. However, let’s look at the fixed effect for each IMD decile (6 is the baseline) and highlight, with a red dot, every instance where the estimated effect is significantly greater than zero. Just by ‘eyeballing’ Figure 13, below, it is evident that higher concentrations of neighbourhood deprivation are related to greater COVID-rates, more so than greater concentrations of affluence are.
coeffs <- lapply(1:length(mlm4), function(i) {
x <- mlm4[[i]]
data.frame(summary(x)$coefficients[,1:2]) %>%
rownames_to_column(var = "rowname") %>%
as_tibble() %>%
filter(substr(rowname, 1, 3) == "IMD") %>%
separate(., rowname, into = c(NA, NA, "IMD"), sep = "_") %>%
rename(beta = 2, se = 3) %>%
mutate(upr = beta + 1.96 * se, lwr = beta - 1.96 * se, week = i)
})
coeffs <- do.call(rbind, coeffs)
coeffs %>%
filter(week > 2) %>%
mutate(IMD = factor(IMD, levels = 1:10)) %>%
ggplot(., aes(x = week, y = beta, ymin = lwr, ymax = upr)) +
geom_errorbar() +
geom_point(data = coeffs %>% filter(week > 2 & lwr > 0), colour = "red", shape = 20, size = 5) +
geom_hline(yintercept = 0) +
facet_wrap(~ IMD) +
geom_rect(data=lockdowns, inherit.aes=FALSE, aes(xmin=xmin, xmax=xmax,
ymin=-2.5, ymax = 3.5),
color="transparent", fill="grey", alpha=0.3) +
theme_light() +
theme(axis.text.x = element_text(angle = 90),
panel.grid.major.x = element_blank()) +
scale_x_continuous(name = NULL, breaks = ticks$week, labels = ticks$date,
limits = c(1, max(results$week)+2),
expand = c(0,0))
Fig 13. The estimated effects of the deprivation deciles on the deprivation rates each week
There is one key limitation of the data used in this work — they are undercounts of the true number of COVID-19 cases in English neighbourhoods. This undercount arises for two reasons.
First, not everyone who has COVID will be tested for a positive diagnosis. Limitations in the testing regime are described in the Harris & Brunsdon paper and are most severe earlier in the pandemic.
Second, where a neighbourhood has less than three cases in a week, that number is treated as zero even th]ough it could really be one or two. That second type of undercount adds up, although, unsurprisingly it affects the largest cities most (because they have more MSOAs to be undercounted) in weeks when COVID is not especially prevalent.8:
Fig. 14: Undercount at the local authority level when summing up from MSOAs each week
Of the two sources of undercount, the number who have had COVID but are not tested is not known. For the second (which is likely the more trivial), an allowance has been made by adjusting the zero numbers upwards each week for MSOAs that had cases in either of the weeks before of after the zero number was registered, and/or by adding a case to randomly selected ‘zero case’ MSOAs, with their probability of selection equal to in how many of the 111 weeks they had registered three or more cases. The logic of all this is that small adjustments are made to the zero case numbers in any week for MSOAs that, (a) in the previous week had, (b) in the next week will have, or (c) over the course of the pandemic, most frequently have had more than three COVID cases so are presumably at greatest risk. The advantage of this adjustment procedure is that the MSOA level counts now add up to what they should equal at the local authority level and above. The disadvantage is that some of the numbers are now estimates at the MSOA level. Recall, however, that some were estimates already by virtue of setting an unknown number of ones and twos to zero. In practice, the adjustments are small and do not occur at all in the times and places where COVID is most prevalent, which are the ones most of interest.
Contrary to what might be expected from previous work, if those living in the most deprived English neighbourhoods were to take on the characteristics of the most affluent neighbourhoods and their populations, the would have more not fewer cases. This is likely related to the social circumstances under which the Omicron variant has spread. Without the same level of lockdown as in previous waves of the pandemic, the more affluent – or, at least, the neighbourhoods in which they live – have become more exposed to the disease and experienced more cases than deprived neighbourhoods. This is a reversal of what often happened earlier in the pandemic. Nevertheless, it would be premature to argue that deprivation does not matter. It is true that ‘deprivation effects’ (as measured by a random component in a multilevel model) matter much less than regional and sub-regional effects do. However, therein lies a clue. The geography of deprivation is closely related to the regional geography of England and there are often strong regional and sub-regional differences in the COVID rates throughout the pandemic. Fitting a regression model that includes the concentration of deprived and other neighbourhood types shows that a higher proportion of the adult population living in the most deprived neighbourhoods increases COVID-rates in English sub-regions more often than does having a higher proportion in the most affluent ones.
More accurately, it is from the population weighted mean average IMD score calculated for each MSOA from LSOA data↩︎
The carehome data are sourced from here (although I am actually using a slightly earlier version of the data)↩︎
A useful record of the timings of lockdown in England is available here↩︎
the City of London and the Isles of Scilly are omitted throughout↩︎
A * in the place name in the data indicates that the place extends beyond its own administrative boundary (e.g. Bristol); a # indicates a local authority that has been bisected by one or more others (e.g. South Gloucestershire)↩︎
when it is more prevalent lots of neighbourhoods have more than two cases so the numbers don’t need to be censored to zero↩︎