Exploring some of the area level correlates of Covid-19 deaths in London

Using a spatial difference-in-difference approach

Richard Harris

14 May 2020

Abstract

This short and exploratory paper sets out a method of analysis known as spatial difference-in-difference to look at some of the area level correlates of higher Covid-19 death rates in London. It shows that, in particular, having higher percentages of those in their eighties or above raises the death rate compared to neighbouring locations but other factors include the percentage of the population that is Black Caribbean and the percentage that have never worked or are long-term unemployed. There is, however, a lot of unexplained variation (spatial heterogeneity). Local ‘hot spots’ (and ‘cold spots’) of the death rates are mapped - defined as places with a conditionally higher (or lower) death rate when compared to their neighbours.

Introduction

Note: this is not peered reviewed and is exploratory only. It should be treated as such and with caution. Last revised May 18, 2020 to correct an error in identifying the ‘cold spots’ and to add some additional observations about the method of analysis

This short paper explores some of the area level correlates of Covid-19 death rates in London (excluding the City of London). To do this, it combines data on the number of deaths between March 1 and April 17,2020 per Middle Super Output Area (MSOA) released by the Office of National Statistics, divided by the mid-2018 population estimates of the number of adults usually resident in each MSOA to form an estimated death rate per 1000 population (this is not age standardised).

A spatial difference-in-difference approach is then used to estimate whether a difference in the death rate across contiguous MSOAs (those that share a border) is associated with differences in the ages of their population, their ethnic composition (in the 2011 Census), the value and types of houses that have been sold in them since Jan 2017 (from Land Registry data), their deprivation rank (from 2019 English Indices of Deprivation data), population density, and/or with differences in the percentages of their populations belonging to each of the classes in the National Statistics Socio-Economic Classification (also from Census data).

The main motivation for the spatial difference-in-difference approach is geographical curiosity – it is interesting when neighbouring places have different outcomes. In principle the method also allows for unobserved/unmodelled effects that are common across neighbours (by the process of subtraction, they are ‘subtracted out’).

The analysis is intended to be exploratory. It is reproducible and set out below.

Loading and about the data

The data set can be loaded, as below

require(tidyverse)
df <- read_csv("https://www.dropbox.com/s/iqdze8arxzmxeaa/london.csv?dl=1")
df %>% print(n = 2, width = Inf)

## # A tibble: 2,800 x 46
##   MSOA11CD1 rate1 MSOA11CD2 rate2       d LAU210CD   LAU210NM       LAU110CD
##   <chr>     <dbl> <chr>     <dbl>   <dbl> <chr>      <chr>          <chr>   
## 1 E02000002 0.369 E02000003 0.364 0.00470 UKI2101024 Chadwell Heath UKI2101 
## 2 E02000004 1.77  E02000007 1.29  0.483   UKI2101025 Eastbrook      UKI2101 
##   LAU110NM             NUTS306CD NUTS306NM                          NUTS106CD
##   <chr>                <chr>     <chr>                              <chr>    
## 1 Barking and Dagenham UKI21     Outer London - East and North East UKI      
## 2 Barking and Dagenham UKI21     Outer London - East and North East UKI      
##   NUTS106NM Population Twenties Thirties Fourties Fifties Sixties Seventies
##   <chr>          <dbl>    <dbl>    <dbl>    <dbl>   <dbl>   <dbl>     <dbl>
## 1 London         -3331    -1.38    -3.61    0.928   -1.14   -1.30   -0.0789
## 2 London         -3537     2.52    -1.37   -1.35     1.79    2.33    2.24  
##   Eightyplus TCITY15CD TCITY15NM permille decile Detached  Semi Terraced Flats
##        <dbl> <chr>     <chr>        <dbl>  <dbl>    <dbl> <dbl>    <dbl> <dbl>
## 1       1.87 J01000055 London          19      0   -0.268 11.6      1.03 -12.4
## 2       2.15 J01000055 London          64      1    0.558  7.71     8.53 -16.8
##    WBRI  INDIAN PAKISTANI   BANG   CHNE   BAFR   BCRB   MIXD Group1 Group2
##   <dbl>   <dbl>     <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1  10.4 -4.10      -2.97  -2.70  -0.107   1.15 -0.699  0.560  -2.35  -3.46
## 2  16.7 -0.0474    -0.452 -0.599  0.126 -11.6   0.358 -0.483   1.06   3.77
##   Group3  Group4 Group5 Group6 Group7 Group8 IMD_rank
##    <dbl>   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>    <dbl>
## 1  -3.33 -0.0522 -0.259   4.29   3.50   1.58   -4502.
## 2   4.88  0.904   0.350  -3.00  -2.75  -3.23    6483.
## # ... with 2,798 more rows

The structure of the data is as follows.

Variables MSOA11CD1 and MSOA11CD2 identify two, neighbouring MSOAs - that is, two that share a border. In the first case, it is E02000001, with a Covid-19 death rate (rate1) of 0.414 per 1000 population, and E02000193, with 0.262 (rate2). The difference, d, between those rates is 0.151. What the spatial difference-in-difference approach asks is whether the difference in the death rates is associated with any other differences in the populations of these two MSOAs – for example, is it related to there being 7.64 percentage points fewer residents aged in their Twenties in E02000001 than in E02000193, or 15.8 percentage points more in National Statistics Socio-economic Classification Group1 (Higher managerial; administrative and professional occupations) in E02000001.

Later the variables are explained more fully. For now, note that with the exception of rate1, rate2 (and also the non-numeric variables), each measures a (spatial) difference across the boundary of two contiguous (neighbouring) MSOAs: it is the value of the underlying variable for MSOA11CD1 minus the value for MSOA11CD2. Note also that because E02000001, for example, has more than one neighbour with which it shares a border so E02000001 appears more than once under the variable MSOA11CD1 (as might E02000193 under the variable MSOA11CD2). We will address this induced grouping effect in the modelling procedure. Finally, note also that in the same way that E02000001 shares a border with E02000193 so, of course, does E02000193 share a border with E02000001, which means that both could have appeared later in the table of the data with the positions reversed and with all the variables having ‘the same’ values but with their signs reversed (positive values would become negative, negative values would become positive). Since this is simply a duplication of the information, it is removed. That is done by only considering the cases where \(d > 0\).

A likely problem with the data is that d is highly skewed:

require(ggplot2)
df %>%
  ggplot(., aes(x = d)) +
  geom_histogram()

df %>%
  ggplot(., aes(sample = d)) +
  stat_qq() +
  stat_qq_line()

This is not well corrected with \(log(d)\) (\(\sqrt{d}\) does better) so a Box-Cox transformation will be applied to it:

require(MASS)
bcx <- boxcox(d ~ 1, data = df, plotit = FALSE)
lambda <- bcx$x[which.max(bcx$y)]
df <- df %>%
  mutate(dT = (d^lambda - 1) / lambda)

Fitting the model

We may suspect, based on what we know about Covid-19, that ares with higher percentages of eighty year olds and above (and possibly containing a care home or a retirement village) are likely to have higher death rates than a neighbouring area with fewer of the elderly. We could test this by fitting a model of the sort,

require(broom)
ols <- lm(dT ~ Eightyplus, data = df)
tidy(ols)

## # A tibble: 2 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)  -0.904     0.0141    -64.4  0.      
## 2 Eightyplus    0.0707    0.0103      6.89 6.87e-12

… and, yes, there is some evidence that the conjecture is true. However, what this has not done is address the grouping effect induced in the data by considering each MSOA and all of its neighbours. To do so, we will fit a simple multilevel model,

require(lme4)
mlm <- lmer(dT ~ Eightyplus + (1 | MSOA11CD1) + (1 | MSOA11CD2), data = df)
tidy(mlm) %>%
  # 'slice out' the random effects from the results table
  # leave the fixed effects
  slice(1:2)

## # A tibble: 2 x 5
##   term        estimate std.error statistic group
##   <chr>          <dbl>     <dbl>     <dbl> <chr>
## 1 (Intercept)  -1.31      0.0310    -42.4  fixed
## 2 Eightyplus    0.0958    0.0126      7.62 fixed

Because of the Box-Cox transformation, the regression coefficients do not have a ready interpretation other than to confirm that a higher percentage of the resident population who are in their eighties or above is associated with a higher death rate. However, the model fits the data poorly. The pseudo R-squared value is,

cor(df$dT, predict(mlm, re.form=NA))^2

## [1] 0.01668312

Let’s now expand the model to include a greater range of predictors. These are:

Measures of the difference in the areas’ affluence / deprivation:

decile: a measure of the difference in the (trimmed) mean average house price for each area, where that average is grouped into deciles (the difference is the difference between those decile scores)
IMD_rank: the difference in the average IMD rank for each area (lower rank = more deprivation). Calculated from the 2019 English Indices of Deprivation.

Measures of the difference in the areas’ age groups:

Twenties: the difference in the percentage of the population aged in their 20s (from mid 2018 estimates)
Thirties: the difference in the percentage of the population aged in their 30s
Sixties: the difference in the percentage of the population aged in their 60s
Seventies: the difference in the percentage of the population aged in their 70s
Eightyplus: the difference in the percentage of the population aged in their 80s or above

(Other age groups are omitted)

Measures of the difference in the areas’ housing stock:

Terraced: the difference in the percentage of properties sold since 2017 that are terraced properties
Flats: the difference in the percentage of properties sold that are flats or apartments
Detached: the difference in the percentage of properties sold that are detached

(Semi-detached properties are omitted)

Measures of the difference in the areas’ ethnic composition:

INDIAN: The difference in the percentage of the population that is Indian
PAKISTANI: The difference in the percentage of the population that is Pakistani
BANG: The difference in the percentage of the population that is Bangladeshi
CHNE: The difference in the percentage of the population that is Chinese
BAFR: The difference in the percentage of the population that is Black African
BCRB: The difference in the percentage of the population that is Black Caribbean
MIXED: The difference in the percentage of the population that is of a Mixed ethnicity

(The White British and some other ethnicities are omitted)

Measures of the difference in the areas’ socio-economic composition (‘class’):

Group1: The difference in the percentage of the population aged 16 and over in National Statistics Socio-Economic Classification (NS-Sec) 1: Higher managerial; administrative and professional
Group2: The difference in the percentage of the population in NS-Sec 2: Lower managerial; administrative and professional occupations
Group4: The difference in the percentage of the population in NS-Sec 4: Small employers and own account workers
Group5: The difference in the percentage of the population in NS-Sec 5: Lower supervisory and technical occupations
Group6: The difference in the percentage of the population in NS-Sec 6: Semi-routine occupations
Group7: The difference in the percentage of the population in NS-Sec 7: Routine occupations
Group8: The difference in the percentage of the population in NS-Sec 8: Never worked and long-term unemployed

(Group3: The difference in the percentage of the population in NS-Sec 3: Intermediate occupations is omitted)

A proxy for the difference in the areas’ population density

Population: the difference in the population in each area. This is a rough approximation to the difference in their population density.

Most of these variables are percentage point differences so can range, in principle, from \(-100\) to \(100\). The exceptions are the decile, IMD_rank and Population variables. Rather than rescale all the variables, for rough comparability with the rest, the three exceptions are rescaled to be of the same broad order of magnitude:

rescale <- function(x) {
  100 * (x - min(x)) / (max(x) - min(x))
}
df2 <- df %>%
  mutate_at(c("decile","IMD_rank","Population"), rescale)

Now fitting the model, isolating only the statistically significant variables (\(|t| > 1.96\)) and ordering these in decreasing effect size, gives:

mlm <- lmer(dT ~ 1 + (1 | MSOA11CD1) + (1 | MSOA11CD2) +
              decile + IMD_rank + 
              Twenties + Thirties + Sixties + Seventies + Eightyplus +
              Terraced + Flats + Detached +
              INDIAN + PAKISTANI + BANG + CHNE + BAFR + BCRB + MIXD +
              Group1 + Group2 + Group4 + Group5 + Group6 + Group7 + Group8 + 
              Population,
              data = df2)
tidy(mlm) %>%
  # 'slice out' the overall intercept and random effects
  slice(2:26) %>%
  dplyr::select(-group) %>%
  filter(abs(statistic) > 1.96) %>%
  arrange(desc(estimate)) %>%
  print(n=Inf)

## # A tibble: 14 x 4
##    term       estimate std.error statistic
##    <chr>         <dbl>     <dbl>     <dbl>
##  1 Eightyplus  0.203     0.0240       8.48
##  2 BCRB        0.0538    0.00842      6.39
##  3 Group8      0.0399    0.0145       2.76
##  4 INDIAN      0.0154    0.00371      4.16
##  5 Terraced    0.00598   0.00180      3.33
##  6 Flats       0.00595   0.00187      3.18
##  7 BAFR       -0.0117    0.00591     -1.99
##  8 IMD_rank   -0.0126    0.00343     -3.67
##  9 Group1     -0.0207    0.00944     -2.19
## 10 Twenties   -0.0226    0.00644     -3.51
## 11 PAKISTANI  -0.0294    0.00769     -3.83
## 12 Group6     -0.0516    0.0209      -2.47
## 13 Seventies  -0.0664    0.0286      -2.32
## 14 MIXD       -0.108     0.0218      -4.93

The strongest correlates with an increased death rate are greater percentages of those in their eighties and above, Black Caribbeans and those who have never worked or are long-term unemployed. The strongest correlates on a decreased rate are greater percentages of a Mixed ethnicity population, those aged in their seventies, and of those in semi-routine occupations.

The finding about Black Caribbeans supports those published by the Institute for Fiscal Studies and summarised here. The increased death rate within more deprived areas has also been reported by the Office of National Statistics. In the current model it can be seen that in addition to the effect of those who have never worked or are long term unemployed, places with a lower IMD rank (i.e. less deprivation) have a lowered death rate.

That an increased percentage of seventy year olds is associated with a decrease in the death rate is surprising but consistent across nearly all of London: if we add in dummy variables to represent each of London’s Boroughs (LAU110NM) the effect remains negative in nearly every one (see below, dropping the statistical insignificant variables). Care is therefore needed to avoid the ecological fallacy – the (incorrect) assumption that a statistical relationship measured at an area level necessarily applies to individuals living within those areas. Here, it does not follow that because greater percentages of people in their seventies are associated with lower death rates so being in your seventies reduces the risk: this clearly contrary to the medical and other statistical evidence. The ecological fallacy arises because there is something else about the areas in which those people are living that decreases the death rate (it may be due to the characteristics of other people who live there, for example).

mlm <- lmer(dT ~ 1 + (1 | MSOA11CD1) + (1 | MSOA11CD2) +
              IMD_rank + 
              Twenties + Eightyplus +
              Terraced + Flats +
              INDIAN + PAKISTANI + BAFR + BCRB + MIXD +
              Group1 + Group6 + Group8 +
              Seventies +
              Seventies:LAU110NM,
              data = df2)
results <- tidy(mlm) %>%
  slice(15:46) # These are the results related to the Seventies variable
baseline <- as(results[1,"estimate"], "numeric")
results %>%
  slice(-1) %>%
  mutate(plus_base = baseline + estimate) %>%
  mutate(is.negative = ifelse(plus_base < 0, T, F)) %>%
  dplyr::select(term, estimate, plus_base, is.negative) %>%
  print(n=Inf, width=Inf)

## # A tibble: 31 x 4
##    term                                     estimate plus_base is.negative
##    <chr>                                       <dbl>     <dbl> <lgl>      
##  1 Seventies:LAU110NMBarnet                 -0.0611   -0.0312  TRUE       
##  2 Seventies:LAU110NMBexley                 -0.105    -0.0754  TRUE       
##  3 Seventies:LAU110NMBrent                  -0.0914   -0.0615  TRUE       
##  4 Seventies:LAU110NMBromley                -0.142    -0.112   TRUE       
##  5 Seventies:LAU110NMCamden                 -0.0662   -0.0363  TRUE       
##  6 Seventies:LAU110NMCroydon                -0.0967   -0.0668  TRUE       
##  7 Seventies:LAU110NMEaling                 -0.00431   0.0256  FALSE      
##  8 Seventies:LAU110NMEnfield                -0.0812   -0.0514  TRUE       
##  9 Seventies:LAU110NMGreenwich              -0.0963   -0.0664  TRUE       
## 10 Seventies:LAU110NMHackney                -0.0776   -0.0477  TRUE       
## 11 Seventies:LAU110NMHammersmith and Fulham -0.0723   -0.0424  TRUE       
## 12 Seventies:LAU110NMHaringey               -0.0782   -0.0483  TRUE       
## 13 Seventies:LAU110NMHarrow                 -0.123    -0.0931  TRUE       
## 14 Seventies:LAU110NMHavering               -0.0766   -0.0468  TRUE       
## 15 Seventies:LAU110NMHillingdon             -0.0975   -0.0676  TRUE       
## 16 Seventies:LAU110NMHounslow               -0.0497   -0.0198  TRUE       
## 17 Seventies:LAU110NMIslington              -0.0762   -0.0463  TRUE       
## 18 Seventies:LAU110NMKensington and Chelsea -0.106    -0.0765  TRUE       
## 19 Seventies:LAU110NMKingston upon Thames   -0.0715   -0.0417  TRUE       
## 20 Seventies:LAU110NMLambeth                -0.0756   -0.0458  TRUE       
## 21 Seventies:LAU110NMLewisham               -0.124    -0.0938  TRUE       
## 22 Seventies:LAU110NMMerton                 -0.0729   -0.0430  TRUE       
## 23 Seventies:LAU110NMNewham                 -0.0650   -0.0351  TRUE       
## 24 Seventies:LAU110NMRedbridge              -0.0244    0.00547 FALSE      
## 25 Seventies:LAU110NMRichmond upon Thames   -0.0281    0.00175 FALSE      
## 26 Seventies:LAU110NMSouthwark              -0.0899   -0.0600  TRUE       
## 27 Seventies:LAU110NMSutton                 -0.0543   -0.0244  TRUE       
## 28 Seventies:LAU110NMTower Hamlets          -0.105    -0.0748  TRUE       
## 29 Seventies:LAU110NMWaltham Forest          0.00612   0.0360  FALSE      
## 30 Seventies:LAU110NMWandsworth             -0.0522   -0.0224  TRUE       
## 31 Seventies:LAU110NMWestminster            -0.161    -0.131   TRUE

Looking at the random effects

Refitting the above model but without the Borough level interaction between LAU110NM and Seventies variable shows that the predictor variables still explain only a low proportion of the differences in the death rates.

mlm <- lmer(dT ~ 1 + (1 | MSOA11CD1) + (1 | MSOA11CD2) +
              IMD_rank + 
              Twenties + Seventies + Eightyplus +
              Terraced + Flats +
              INDIAN + PAKISTANI + BAFR + BCRB + MIXD +
              Group1 + Group6 + Group8 +
              Seventies,
              data = df2)
# The pseudo R-squared value
cor(df$dT, predict(mlm, re.form=NA))^2

## [1] 0.04762505

The unexplained differences in the Covid-19 death rates of neighbouring MSOAs is of geographical interest because it invites the question, why are there places that are located next to each other but have different level of death rate?. Those places that are much higher than their neighbours (conditional on the predictor variables) can be identified from the model from the way that the group effect for MSOA11CD1 is modelled (as a random effect within the model):

rr1 <- ranef(mlm, whichel = "MSOA11CD1")
rr1 <- as_tibble(rr1) %>%
  dplyr::select(grp, condval, condsd) %>%
  # Keep only those that are significantly above zero
  filter(condval - 1.96 * condsd > 0) %>%
  rename(MSOA11CD = grp)
rr1 %>%
  arrange(desc(condval)) %>%
  print(n=3)

## # A tibble: 255 x 3
##   MSOA11CD  condval condsd
##   <fct>       <dbl>  <dbl>
## 1 E02000715    2.31  0.145
## 2 E02000098    2.08  0.136
## 3 E02000573    1.99  0.122
## # ... with 252 more rows

We can see, for example, that there is an MSOA, E02000715, which has a death rate that is much higher than its neighbours. That MSOA is in Newham.

df %>%
  filter(MSOA11CD1 == "E02000715") %>%
  dplyr::select(MSOA11CD1, rate1, rate2, d, LAU210NM, LAU110NM) %>%
  print(n=Inf)

## # A tibble: 5 x 6
##   MSOA11CD1 rate1 rate2     d LAU210NM      LAU110NM
##   <chr>     <dbl> <dbl> <dbl> <chr>         <chr>   
## 1 E02000715  4.07 1.72   2.35 Little Ilford Newham  
## 2 E02000715  4.07 0.489  3.58 Little Ilford Newham  
## 3 E02000715  4.07 0      4.07 Little Ilford Newham  
## 4 E02000715  4.07 1.42   2.65 Little Ilford Newham  
## 5 E02000715  4.07 0.943  3.13 Little Ilford Newham

The information can be used to map some localised ‘hot spots’ of Covid-19 deaths (although it also raises a problem, which is discussed below):

require(sf)
download.file("https://github.com/profrichharris/covid19/blob/master/london.zip?raw=true", "map.zip", mode="wb")
unzip("map.zip")
map <- read_sf("london.shp") %>%
  left_join(., rr1, by="MSOA11CD")
submap <- map %>%
  filter(!is.na(condval))
ggplot() +
  geom_sf(data = map) +
  geom_sf(data = submap, aes(fill = condval)) +
  scale_fill_gradient(low="yellow", high="red") +
  labs(fill = "Localised hot spot")

In much the same way, the random effects for MSOA11CD2 can be used to identify localised ‘cold spots’:

map <- map %>%
  dplyr::select(-condval, -condsd)
rr2 <- ranef(mlm, whichel = "MSOA11CD2")
rr2 <- as_tibble(rr2) %>%
  dplyr::select(grp, condval, condsd) %>%
  filter(condval - 1.96 * condsd > 0) %>%
  rename(MSOA11CD = grp)
map <- left_join(map, rr2, by="MSOA11CD")
submap <- map %>%
  filter(!is.na(condval))
ggplot() +
  geom_sf(data = map) +
  geom_sf(data = submap, aes(fill = condval)) +
  scale_fill_gradient(low="dark blue", high="light blue") +
  labs(fill = "Localised cold spot")

Conclusion

This papers has outlined the use of a spatial difference-in-difference method to explore some of the area-level correlates of increased Covid-19 death rates in London using a spatial difference-in-difference approach. It appears to support other analysis showing the higher death rate in more deprived areas, here especially indicated by those places with higher percentages of those who are long-term unemployed or who have never worked but also by those with a higher Index of Multiple Deprivation ranking. The higher death rate amongst Black Caribbeans populations (or, at least, where they are living) can also be inferred. A curiosity is that a higher percentage of those aged in their seventies appears to be associated with a lowered death rate. This emphasises the danger of the ecological fallacy when working with area data.

Some further observations

A bit more about the structure of the model

It is instructive to reflect a little more on the structure and the random part of the model. Consider a null model:

null <- lmer(dT ~ 1 + (1 | MSOA11CD1) + (1 | MSOA11CD2), data = df2)
tidy(null)

## # A tibble: 4 x 5
##   term                     estimate std.error statistic group    
##   <chr>                       <dbl>     <dbl>     <dbl> <chr>    
## 1 (Intercept)                -1.30     0.0315     -41.2 fixed    
## 2 sd_(Intercept).MSOA11CD2    0.517   NA           NA   MSOA11CD2
## 3 sd_(Intercept).MSOA11CD1    0.715   NA           NA   MSOA11CD1
## 4 sd_Observation.Residual     0.237   NA           NA   Residual

What this is modelling is that there is a difference in the death rate between places that share a border but sometimes those differences are greater or less than expected, where the expected value (on the transformed scale) is the intercept. Where they differ from expectation, it could be because some locations have a much higher death rate than their neighbours (which is what the random effect for MSOA11CD1 captures) or it could be because some location have a much lower death rate (which is the random effect for MSOA11CD2). The residuals appear to be broadly Normal:

rr1 <- ranef(null, whichel = "MSOA11CD1")
rr1 <- as_tibble(rr1) %>%
  rename(MSOA11CD = grp) %>%
  dplyr::select(MSOA11CD, condval, condsd)
rr1 %>%
  ggplot(., aes(sample = condval)) +
  stat_qq() +
  stat_qq_line()

rr2 <- ranef(null, whichel = "MSOA11CD2")
rr2 <- as_tibble(rr2) %>%
  rename(MSOA11CD = grp) %>%
  dplyr::select(MSOA11CD, condval, condsd)
rr2 %>%
  ggplot(., aes(sample = condval)) +
  stat_qq() +
  stat_qq_line()

The overall and modelled difference between any location and its neighbours is the sum of the two random effects plus the intercept:

# -- The below is just for matching the data from df2 to the map
# -- recognising that they are formatted differently (they have different 'shapes')

rate1 <- df %>%
  dplyr::select(MSOA11CD1, rate1) %>%
  unique(.)

rate2 <- df %>%
  dplyr::select(MSOA11CD2, rate2) %>%
  unique(.)

msoas <- map %>%
  left_join(., rate1, by = c("MSOA11CD" = "MSOA11CD1")) %>%
  left_join(., rate2, by = c("MSOA11CD" = "MSOA11CD2")) %>%
  mutate(rate = ifelse(is.na(rate1), rate2, rate1)) %>%
  dplyr::select(-rate1, -rate2) %>%
  filter(!is.na(rate))

## -- Now to the modelled estimates:

intercept <- pull(tidy(null)[1,"estimate"])

msoas <- msoas %>%
  dplyr::select(MSOA11CD, rate) %>%
  left_join(., rr1, by = "MSOA11CD") %>%
  left_join(., rr2, by = "MSOA11CD") %>%
  replace_na(., list(condval.x = 0, condval.x = 0)) %>%
  mutate(modelled.dT = condval.x + condval.y + intercept) %>%
  dplyr::select(MSOA11CD, modelled.dT, rate, geometry)

Now consider the spatial lag of the death rate at each location.

require(spdep)
nb <- poly2nb(msoas, snap=1)
listw <- nb2listw(nb, style="W")
lag.rate <- lag.listw(listw, msoas$rate)

The difference between the death rate at each location and its spatial lag is,

df3 <- msoas %>%
  st_drop_geometry(.) %>%
  mutate(from_lag.d = rate - lag.rate)

and keeping only the positive differences,

df3 <- df3 %>%
  filter(from_lag.d > 0)

Finally, applying the same Box-Cox transformation to the difference that was used in the outset of this paper, and removing some missing values,

df3 <- df3 %>%
  mutate(from_lag.dT = (from_lag.d^lambda - 1) / lambda) %>%
  filter(!is.na(modelled.dT))

… what we find is that the modelled difference between the death rate at a location and its neighbours is related to the difference between the death rate at that location and the spatial lag (for its neighbours). This is as it should be as, in both cases, the value at a location is being compared with the values at surrounding locations.

df3 %>%
  ggplot(aes(x = from_lag.dT, y = modelled.dT)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)

The correlation is 0.381.

Better distinguishing the localised ‘hot’ and ‘cold’ spots

A potential problem with identifying the ‘hot spots’ and ‘cold spots’ is that they are relative: one place may appear to have a conditionally lower death rate, relative to its neighbour(s), only because the neighbour(s) have a much higher rate. Or, it may appear to be higher only because the neighbour(s) are lower. But it is also possible for one to be distinctly high at the same time that a neighbour is distinctly low (or vice versa).

The following attempts to less ambiguously identify the hot and cold spots although it is a work-in-progressing and need some finessing:

rr1 <- ranef(mlm, whichel = "MSOA11CD1") %>%
  as_tibble(.) %>%
  rename(MSOA11CD = grp) %>%
  dplyr::select(MSOA11CD, condval, condsd)
rr2 <- ranef(mlm, whichel = "MSOA11CD2") %>%
  as_tibble(.) %>%
  rename(MSOA11CD = grp) %>%
  dplyr::select(MSOA11CD, condval, condsd)

rr <- df2 %>%
  dplyr::select(MSOA11CD1, MSOA11CD2) %>%
  left_join(., rr1, by = c("MSOA11CD1" = "MSOA11CD")) %>%
  left_join(., rr2, by = c("MSOA11CD2" = "MSOA11CD")) %>%
  filter(!(is.na(condval.x) & is.na(condval.y))) %>%
  replace_na(list(condval.x = 0, condval.y = 0))
  
hot <- rr %>%
  group_by(MSOA11CD1) %>%
  summarise(condval.x = sum(condval.x), condval.y = sum(condval.y)) %>%
  filter(condval.x > condval.y) %>%
  dplyr::select(MSOA11CD1)

cold <- rr %>%
  group_by(MSOA11CD2) %>%
  summarise(condval.x = sum(condval.x), condval.y = sum(condval.y)) %>%
  filter(condval.y > condval.x) %>%
  dplyr::select(MSOA11CD2)

rr1 <- rr1 %>%
  inner_join(., hot, by = c("MSOA11CD" = "MSOA11CD1")) %>%
  filter(condval - 1.96*condsd > 0)

rr2 <- rr2 %>%
  inner_join(., cold, by = c("MSOA11CD" = "MSOA11CD2")) %>%
  filter(condval - 1.96*condsd > 0)

map <- map %>%
  dplyr::select(-condval, -condsd)
map <- left_join(map, rr1, by="MSOA11CD")
submap <- map %>%
  filter(!is.na(condval))
ggplot() +
  geom_sf(data = map) +
  geom_sf(data = submap, aes(fill = condval)) +
  scale_fill_gradient(low="yellow", high="red") +
  labs(fill = "Localised hot spot")

map <- map %>%
  dplyr::select(-condval, -condsd)
map <- left_join(map, rr2, by="MSOA11CD")
submap <- map %>%
  filter(!is.na(condval))
ggplot() +
  geom_sf(data = map) +
  geom_sf(data = submap, aes(fill = condval)) +
  scale_fill_gradient(low="dark blue", high="light blue") +
  labs(fill = "Localised cold spot")

Addressing the problem of ‘invisibility’ in the map

You can sometimes get a better visualisation of the geography with a simple transformation using a cartogram approach that balances visibility and distortion, enlarging the smallest areas but with not too much distortion across the map (see, also, here). As one of the ‘hot spots’ is also one of the smallest areas, this may be a useful approach to try. Applying one such method (which takes a while to run – be patient!):

require(cartogram)
map <- map %>%
  dplyr::select(-condval, -condsd)
areas <- st_area(map)
areas <- as.numeric(st_area(map))
map$w <- sqrt(areas)
# I am sure the conversion between sp and sf objects and back again isn't
# really necessary but since it works...
map <- as(map, "Spatial")
carto <- cartogram_cont(map, weight = "w", maxSizeError = 1.1, prepare = "none")
carto <- st_as_sf(carto)

carto <- left_join(carto, rr1, by="MSOA11CD")
submap <- carto %>%
  filter(!is.na(condval))
ggplot() +
  geom_sf(data = carto) +
  geom_sf(data = submap, aes(fill = condval)) +
  scale_fill_gradient(low="yellow", high="red") +
  labs(fill = "Localised hot spot")

carto <- carto %>%
  dplyr::select(-condval, -condsd)
carto <- left_join(carto, rr2, by="MSOA11CD")
submap <- carto %>%
  filter(!is.na(condval))
ggplot() +
  geom_sf(data = carto) +
  geom_sf(data = submap, aes(fill = condval)) +
  scale_fill_gradient(low="dark blue", high="light blue") +
  labs(fill = "Localised cold spot")