Abstract
This short and exploratory paper sets out a method of analysis known as spatial difference-in-difference to look at some of the area level correlates of higher Covid-19 death rates in London. It shows that, in particular, having higher percentages of those in their eighties or above raises the death rate compared to neighbouring locations but other factors include the percentage of the population that is Black Caribbean and the percentage that have never worked or are long-term unemployed. There is, however, a lot of unexplained variation (spatial heterogeneity). Local ‘hot spots’ (and ‘cold spots’) of the death rates are mapped - defined as places with a conditionally higher (or lower) death rate when compared to their neighbours.Note: this is not peered reviewed and is exploratory only. It should be treated as such and with caution. Last revised May 18, 2020 to correct an error in identifying the ‘cold spots’ and to add some additional observations about the method of analysis
This short paper explores some of the area level correlates of Covid-19 death rates in London (excluding the City of London). To do this, it combines data on the number of deaths between March 1 and April 17,2020 per Middle Super Output Area (MSOA) released by the Office of National Statistics, divided by the mid-2018 population estimates of the number of adults usually resident in each MSOA to form an estimated death rate per 1000 population (this is not age standardised).
A spatial difference-in-difference approach is then used to estimate whether a difference in the death rate across contiguous MSOAs (those that share a border) is associated with differences in the ages of their population, their ethnic composition (in the 2011 Census), the value and types of houses that have been sold in them since Jan 2017 (from Land Registry data), their deprivation rank (from 2019 English Indices of Deprivation data), population density, and/or with differences in the percentages of their populations belonging to each of the classes in the National Statistics Socio-Economic Classification (also from Census data).
The main motivation for the spatial difference-in-difference approach is geographical curiosity – it is interesting when neighbouring places have different outcomes. In principle the method also allows for unobserved/unmodelled effects that are common across neighbours (by the process of subtraction, they are ‘subtracted out’).
The analysis is intended to be exploratory. It is reproducible and set out below.
The data set can be loaded, as below
require(tidyverse)
df <- read_csv("https://www.dropbox.com/s/iqdze8arxzmxeaa/london.csv?dl=1")
df %>% print(n = 2, width = Inf)
## # A tibble: 2,800 x 46
## MSOA11CD1 rate1 MSOA11CD2 rate2 d LAU210CD LAU210NM LAU110CD
## <chr> <dbl> <chr> <dbl> <dbl> <chr> <chr> <chr>
## 1 E02000002 0.369 E02000003 0.364 0.00470 UKI2101024 Chadwell Heath UKI2101
## 2 E02000004 1.77 E02000007 1.29 0.483 UKI2101025 Eastbrook UKI2101
## LAU110NM NUTS306CD NUTS306NM NUTS106CD
## <chr> <chr> <chr> <chr>
## 1 Barking and Dagenham UKI21 Outer London - East and North East UKI
## 2 Barking and Dagenham UKI21 Outer London - East and North East UKI
## NUTS106NM Population Twenties Thirties Fourties Fifties Sixties Seventies
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 London -3331 -1.38 -3.61 0.928 -1.14 -1.30 -0.0789
## 2 London -3537 2.52 -1.37 -1.35 1.79 2.33 2.24
## Eightyplus TCITY15CD TCITY15NM permille decile Detached Semi Terraced Flats
## <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1.87 J01000055 London 19 0 -0.268 11.6 1.03 -12.4
## 2 2.15 J01000055 London 64 1 0.558 7.71 8.53 -16.8
## WBRI INDIAN PAKISTANI BANG CHNE BAFR BCRB MIXD Group1 Group2
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 10.4 -4.10 -2.97 -2.70 -0.107 1.15 -0.699 0.560 -2.35 -3.46
## 2 16.7 -0.0474 -0.452 -0.599 0.126 -11.6 0.358 -0.483 1.06 3.77
## Group3 Group4 Group5 Group6 Group7 Group8 IMD_rank
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 -3.33 -0.0522 -0.259 4.29 3.50 1.58 -4502.
## 2 4.88 0.904 0.350 -3.00 -2.75 -3.23 6483.
## # ... with 2,798 more rows
The structure of the data is as follows.
Variables MSOA11CD1 and MSOA11CD2 identify two, neighbouring MSOAs - that is, two that share a border. In the first case, it is E02000001, with a Covid-19 death rate (rate1) of 0.414 per 1000 population, and E02000193, with 0.262 (rate2). The difference, d, between those rates is 0.151. What the spatial difference-in-difference approach asks is whether the difference in the death rates is associated with any other differences in the populations of these two MSOAs – for example, is it related to there being 7.64 percentage points fewer residents aged in their Twenties in E02000001 than in E02000193, or 15.8 percentage points more in National Statistics Socio-economic Classification Group1 (Higher managerial; administrative and professional occupations) in E02000001.
Later the variables are explained more fully. For now, note that with the exception of rate1, rate2 (and also the non-numeric variables), each measures a (spatial) difference across the boundary of two contiguous (neighbouring) MSOAs: it is the value of the underlying variable for MSOA11CD1 minus the value for MSOA11CD2. Note also that because E02000001, for example, has more than one neighbour with which it shares a border so E02000001 appears more than once under the variable MSOA11CD1 (as might E02000193 under the variable MSOA11CD2). We will address this induced grouping effect in the modelling procedure. Finally, note also that in the same way that E02000001 shares a border with E02000193 so, of course, does E02000193 share a border with E02000001, which means that both could have appeared later in the table of the data with the positions reversed and with all the variables having ‘the same’ values but with their signs reversed (positive values would become negative, negative values would become positive). Since this is simply a duplication of the information, it is removed. That is done by only considering the cases where \(d > 0\).
A likely problem with the data is that d is highly skewed:
require(ggplot2)
df %>%
ggplot(., aes(x = d)) +
geom_histogram()
df %>%
ggplot(., aes(sample = d)) +
stat_qq() +
stat_qq_line()
This is not well corrected with \(log(d)\) (\(\sqrt{d}\) does better) so a Box-Cox transformation will be applied to it:
require(MASS)
bcx <- boxcox(d ~ 1, data = df, plotit = FALSE)
lambda <- bcx$x[which.max(bcx$y)]
df <- df %>%
mutate(dT = (d^lambda - 1) / lambda)
We may suspect, based on what we know about Covid-19, that ares with higher percentages of eighty year olds and above (and possibly containing a care home or a retirement village) are likely to have higher death rates than a neighbouring area with fewer of the elderly. We could test this by fitting a model of the sort,
require(broom)
ols <- lm(dT ~ Eightyplus, data = df)
tidy(ols)
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -0.904 0.0141 -64.4 0.
## 2 Eightyplus 0.0707 0.0103 6.89 6.87e-12
… and, yes, there is some evidence that the conjecture is true. However, what this has not done is address the grouping effect induced in the data by considering each MSOA and all of its neighbours. To do so, we will fit a simple multilevel model,
require(lme4)
mlm <- lmer(dT ~ Eightyplus + (1 | MSOA11CD1) + (1 | MSOA11CD2), data = df)
tidy(mlm) %>%
# 'slice out' the random effects from the results table
# leave the fixed effects
slice(1:2)
## # A tibble: 2 x 5
## term estimate std.error statistic group
## <chr> <dbl> <dbl> <dbl> <chr>
## 1 (Intercept) -1.31 0.0310 -42.4 fixed
## 2 Eightyplus 0.0958 0.0126 7.62 fixed
Because of the Box-Cox transformation, the regression coefficients do not have a ready interpretation other than to confirm that a higher percentage of the resident population who are in their eighties or above is associated with a higher death rate. However, the model fits the data poorly. The pseudo R-squared value is,
cor(df$dT, predict(mlm, re.form=NA))^2
## [1] 0.01668312
Let’s now expand the model to include a greater range of predictors. These are:
Measures of the difference in the areas’ affluence / deprivation:
Measures of the difference in the areas’ age groups:
(Other age groups are omitted)
Measures of the difference in the areas’ housing stock:
(Semi-detached properties are omitted)
Measures of the difference in the areas’ ethnic composition:
(The White British and some other ethnicities are omitted)
Measures of the difference in the areas’ socio-economic composition (‘class’):
(Group3: The difference in the percentage of the population in NS-Sec 3: Intermediate occupations is omitted)
A proxy for the difference in the areas’ population density
Most of these variables are percentage point differences so can range, in principle, from \(-100\) to \(100\). The exceptions are the decile, IMD_rank and Population variables. Rather than rescale all the variables, for rough comparability with the rest, the three exceptions are rescaled to be of the same broad order of magnitude:
rescale <- function(x) {
100 * (x - min(x)) / (max(x) - min(x))
}
df2 <- df %>%
mutate_at(c("decile","IMD_rank","Population"), rescale)
Now fitting the model, isolating only the statistically significant variables (\(|t| > 1.96\)) and ordering these in decreasing effect size, gives:
mlm <- lmer(dT ~ 1 + (1 | MSOA11CD1) + (1 | MSOA11CD2) +
decile + IMD_rank +
Twenties + Thirties + Sixties + Seventies + Eightyplus +
Terraced + Flats + Detached +
INDIAN + PAKISTANI + BANG + CHNE + BAFR + BCRB + MIXD +
Group1 + Group2 + Group4 + Group5 + Group6 + Group7 + Group8 +
Population,
data = df2)
tidy(mlm) %>%
# 'slice out' the overall intercept and random effects
slice(2:26) %>%
dplyr::select(-group) %>%
filter(abs(statistic) > 1.96) %>%
arrange(desc(estimate)) %>%
print(n=Inf)
## # A tibble: 14 x 4
## term estimate std.error statistic
## <chr> <dbl> <dbl> <dbl>
## 1 Eightyplus 0.203 0.0240 8.48
## 2 BCRB 0.0538 0.00842 6.39
## 3 Group8 0.0399 0.0145 2.76
## 4 INDIAN 0.0154 0.00371 4.16
## 5 Terraced 0.00598 0.00180 3.33
## 6 Flats 0.00595 0.00187 3.18
## 7 BAFR -0.0117 0.00591 -1.99
## 8 IMD_rank -0.0126 0.00343 -3.67
## 9 Group1 -0.0207 0.00944 -2.19
## 10 Twenties -0.0226 0.00644 -3.51
## 11 PAKISTANI -0.0294 0.00769 -3.83
## 12 Group6 -0.0516 0.0209 -2.47
## 13 Seventies -0.0664 0.0286 -2.32
## 14 MIXD -0.108 0.0218 -4.93
The strongest correlates with an increased death rate are greater percentages of those in their eighties and above, Black Caribbeans and those who have never worked or are long-term unemployed. The strongest correlates on a decreased rate are greater percentages of a Mixed ethnicity population, those aged in their seventies, and of those in semi-routine occupations.
The finding about Black Caribbeans supports those published by the Institute for Fiscal Studies and summarised here. The increased death rate within more deprived areas has also been reported by the Office of National Statistics. In the current model it can be seen that in addition to the effect of those who have never worked or are long term unemployed, places with a lower IMD rank (i.e. less deprivation) have a lowered death rate.
That an increased percentage of seventy year olds is associated with a decrease in the death rate is surprising but consistent across nearly all of London: if we add in dummy variables to represent each of London’s Boroughs (LAU110NM) the effect remains negative in nearly every one (see below, dropping the statistical insignificant variables). Care is therefore needed to avoid the ecological fallacy – the (incorrect) assumption that a statistical relationship measured at an area level necessarily applies to individuals living within those areas. Here, it does not follow that because greater percentages of people in their seventies are associated with lower death rates so being in your seventies reduces the risk: this clearly contrary to the medical and other statistical evidence. The ecological fallacy arises because there is something else about the areas in which those people are living that decreases the death rate (it may be due to the characteristics of other people who live there, for example).
mlm <- lmer(dT ~ 1 + (1 | MSOA11CD1) + (1 | MSOA11CD2) +
IMD_rank +
Twenties + Eightyplus +
Terraced + Flats +
INDIAN + PAKISTANI + BAFR + BCRB + MIXD +
Group1 + Group6 + Group8 +
Seventies +
Seventies:LAU110NM,
data = df2)
results <- tidy(mlm) %>%
slice(15:46) # These are the results related to the Seventies variable
baseline <- as(results[1,"estimate"], "numeric")
results %>%
slice(-1) %>%
mutate(plus_base = baseline + estimate) %>%
mutate(is.negative = ifelse(plus_base < 0, T, F)) %>%
dplyr::select(term, estimate, plus_base, is.negative) %>%
print(n=Inf, width=Inf)
## # A tibble: 31 x 4
## term estimate plus_base is.negative
## <chr> <dbl> <dbl> <lgl>
## 1 Seventies:LAU110NMBarnet -0.0611 -0.0312 TRUE
## 2 Seventies:LAU110NMBexley -0.105 -0.0754 TRUE
## 3 Seventies:LAU110NMBrent -0.0914 -0.0615 TRUE
## 4 Seventies:LAU110NMBromley -0.142 -0.112 TRUE
## 5 Seventies:LAU110NMCamden -0.0662 -0.0363 TRUE
## 6 Seventies:LAU110NMCroydon -0.0967 -0.0668 TRUE
## 7 Seventies:LAU110NMEaling -0.00431 0.0256 FALSE
## 8 Seventies:LAU110NMEnfield -0.0812 -0.0514 TRUE
## 9 Seventies:LAU110NMGreenwich -0.0963 -0.0664 TRUE
## 10 Seventies:LAU110NMHackney -0.0776 -0.0477 TRUE
## 11 Seventies:LAU110NMHammersmith and Fulham -0.0723 -0.0424 TRUE
## 12 Seventies:LAU110NMHaringey -0.0782 -0.0483 TRUE
## 13 Seventies:LAU110NMHarrow -0.123 -0.0931 TRUE
## 14 Seventies:LAU110NMHavering -0.0766 -0.0468 TRUE
## 15 Seventies:LAU110NMHillingdon -0.0975 -0.0676 TRUE
## 16 Seventies:LAU110NMHounslow -0.0497 -0.0198 TRUE
## 17 Seventies:LAU110NMIslington -0.0762 -0.0463 TRUE
## 18 Seventies:LAU110NMKensington and Chelsea -0.106 -0.0765 TRUE
## 19 Seventies:LAU110NMKingston upon Thames -0.0715 -0.0417 TRUE
## 20 Seventies:LAU110NMLambeth -0.0756 -0.0458 TRUE
## 21 Seventies:LAU110NMLewisham -0.124 -0.0938 TRUE
## 22 Seventies:LAU110NMMerton -0.0729 -0.0430 TRUE
## 23 Seventies:LAU110NMNewham -0.0650 -0.0351 TRUE
## 24 Seventies:LAU110NMRedbridge -0.0244 0.00547 FALSE
## 25 Seventies:LAU110NMRichmond upon Thames -0.0281 0.00175 FALSE
## 26 Seventies:LAU110NMSouthwark -0.0899 -0.0600 TRUE
## 27 Seventies:LAU110NMSutton -0.0543 -0.0244 TRUE
## 28 Seventies:LAU110NMTower Hamlets -0.105 -0.0748 TRUE
## 29 Seventies:LAU110NMWaltham Forest 0.00612 0.0360 FALSE
## 30 Seventies:LAU110NMWandsworth -0.0522 -0.0224 TRUE
## 31 Seventies:LAU110NMWestminster -0.161 -0.131 TRUE
Refitting the above model but without the Borough level interaction between LAU110NM and Seventies variable shows that the predictor variables still explain only a low proportion of the differences in the death rates.
mlm <- lmer(dT ~ 1 + (1 | MSOA11CD1) + (1 | MSOA11CD2) +
IMD_rank +
Twenties + Seventies + Eightyplus +
Terraced + Flats +
INDIAN + PAKISTANI + BAFR + BCRB + MIXD +
Group1 + Group6 + Group8 +
Seventies,
data = df2)
# The pseudo R-squared value
cor(df$dT, predict(mlm, re.form=NA))^2
## [1] 0.04762505
The unexplained differences in the Covid-19 death rates of neighbouring MSOAs is of geographical interest because it invites the question, why are there places that are located next to each other but have different level of death rate?. Those places that are much higher than their neighbours (conditional on the predictor variables) can be identified from the model from the way that the group effect for MSOA11CD1 is modelled (as a random effect within the model):
rr1 <- ranef(mlm, whichel = "MSOA11CD1")
rr1 <- as_tibble(rr1) %>%
dplyr::select(grp, condval, condsd) %>%
# Keep only those that are significantly above zero
filter(condval - 1.96 * condsd > 0) %>%
rename(MSOA11CD = grp)
rr1 %>%
arrange(desc(condval)) %>%
print(n=3)
## # A tibble: 255 x 3
## MSOA11CD condval condsd
## <fct> <dbl> <dbl>
## 1 E02000715 2.31 0.145
## 2 E02000098 2.08 0.136
## 3 E02000573 1.99 0.122
## # ... with 252 more rows
We can see, for example, that there is an MSOA, E02000715, which has a death rate that is much higher than its neighbours. That MSOA is in Newham.
df %>%
filter(MSOA11CD1 == "E02000715") %>%
dplyr::select(MSOA11CD1, rate1, rate2, d, LAU210NM, LAU110NM) %>%
print(n=Inf)
## # A tibble: 5 x 6
## MSOA11CD1 rate1 rate2 d LAU210NM LAU110NM
## <chr> <dbl> <dbl> <dbl> <chr> <chr>
## 1 E02000715 4.07 1.72 2.35 Little Ilford Newham
## 2 E02000715 4.07 0.489 3.58 Little Ilford Newham
## 3 E02000715 4.07 0 4.07 Little Ilford Newham
## 4 E02000715 4.07 1.42 2.65 Little Ilford Newham
## 5 E02000715 4.07 0.943 3.13 Little Ilford Newham
The information can be used to map some localised ‘hot spots’ of Covid-19 deaths (although it also raises a problem, which is discussed below):
require(sf)
download.file("https://github.com/profrichharris/covid19/blob/master/london.zip?raw=true", "map.zip", mode="wb")
unzip("map.zip")
map <- read_sf("london.shp") %>%
left_join(., rr1, by="MSOA11CD")
submap <- map %>%
filter(!is.na(condval))
ggplot() +
geom_sf(data = map) +
geom_sf(data = submap, aes(fill = condval)) +
scale_fill_gradient(low="yellow", high="red") +
labs(fill = "Localised hot spot")
In much the same way, the random effects for MSOA11CD2 can be used to identify localised ‘cold spots’:
map <- map %>%
dplyr::select(-condval, -condsd)
rr2 <- ranef(mlm, whichel = "MSOA11CD2")
rr2 <- as_tibble(rr2) %>%
dplyr::select(grp, condval, condsd) %>%
filter(condval - 1.96 * condsd > 0) %>%
rename(MSOA11CD = grp)
map <- left_join(map, rr2, by="MSOA11CD")
submap <- map %>%
filter(!is.na(condval))
ggplot() +
geom_sf(data = map) +
geom_sf(data = submap, aes(fill = condval)) +
scale_fill_gradient(low="dark blue", high="light blue") +
labs(fill = "Localised cold spot")
This papers has outlined the use of a spatial difference-in-difference method to explore some of the area-level correlates of increased Covid-19 death rates in London using a spatial difference-in-difference approach. It appears to support other analysis showing the higher death rate in more deprived areas, here especially indicated by those places with higher percentages of those who are long-term unemployed or who have never worked but also by those with a higher Index of Multiple Deprivation ranking. The higher death rate amongst Black Caribbeans populations (or, at least, where they are living) can also be inferred. A curiosity is that a higher percentage of those aged in their seventies appears to be associated with a lowered death rate. This emphasises the danger of the ecological fallacy when working with area data.
It is instructive to reflect a little more on the structure and the random part of the model. Consider a null model:
null <- lmer(dT ~ 1 + (1 | MSOA11CD1) + (1 | MSOA11CD2), data = df2)
tidy(null)
## # A tibble: 4 x 5
## term estimate std.error statistic group
## <chr> <dbl> <dbl> <dbl> <chr>
## 1 (Intercept) -1.30 0.0315 -41.2 fixed
## 2 sd_(Intercept).MSOA11CD2 0.517 NA NA MSOA11CD2
## 3 sd_(Intercept).MSOA11CD1 0.715 NA NA MSOA11CD1
## 4 sd_Observation.Residual 0.237 NA NA Residual
What this is modelling is that there is a difference in the death rate between places that share a border but sometimes those differences are greater or less than expected, where the expected value (on the transformed scale) is the intercept. Where they differ from expectation, it could be because some locations have a much higher death rate than their neighbours (which is what the random effect for MSOA11CD1 captures) or it could be because some location have a much lower death rate (which is the random effect for MSOA11CD2). The residuals appear to be broadly Normal:
rr1 <- ranef(null, whichel = "MSOA11CD1")
rr1 <- as_tibble(rr1) %>%
rename(MSOA11CD = grp) %>%
dplyr::select(MSOA11CD, condval, condsd)
rr1 %>%
ggplot(., aes(sample = condval)) +
stat_qq() +
stat_qq_line()
rr2 <- ranef(null, whichel = "MSOA11CD2")
rr2 <- as_tibble(rr2) %>%
rename(MSOA11CD = grp) %>%
dplyr::select(MSOA11CD, condval, condsd)
rr2 %>%
ggplot(., aes(sample = condval)) +
stat_qq() +
stat_qq_line()
The overall and modelled difference between any location and its neighbours is the sum of the two random effects plus the intercept:
# -- The below is just for matching the data from df2 to the map
# -- recognising that they are formatted differently (they have different 'shapes')
rate1 <- df %>%
dplyr::select(MSOA11CD1, rate1) %>%
unique(.)
rate2 <- df %>%
dplyr::select(MSOA11CD2, rate2) %>%
unique(.)
msoas <- map %>%
left_join(., rate1, by = c("MSOA11CD" = "MSOA11CD1")) %>%
left_join(., rate2, by = c("MSOA11CD" = "MSOA11CD2")) %>%
mutate(rate = ifelse(is.na(rate1), rate2, rate1)) %>%
dplyr::select(-rate1, -rate2) %>%
filter(!is.na(rate))
## -- Now to the modelled estimates:
intercept <- pull(tidy(null)[1,"estimate"])
msoas <- msoas %>%
dplyr::select(MSOA11CD, rate) %>%
left_join(., rr1, by = "MSOA11CD") %>%
left_join(., rr2, by = "MSOA11CD") %>%
replace_na(., list(condval.x = 0, condval.x = 0)) %>%
mutate(modelled.dT = condval.x + condval.y + intercept) %>%
dplyr::select(MSOA11CD, modelled.dT, rate, geometry)
Now consider the spatial lag of the death rate at each location.
require(spdep)
nb <- poly2nb(msoas, snap=1)
listw <- nb2listw(nb, style="W")
lag.rate <- lag.listw(listw, msoas$rate)
The difference between the death rate at each location and its spatial lag is,
df3 <- msoas %>%
st_drop_geometry(.) %>%
mutate(from_lag.d = rate - lag.rate)
and keeping only the positive differences,
df3 <- df3 %>%
filter(from_lag.d > 0)
Finally, applying the same Box-Cox transformation to the difference that was used in the outset of this paper, and removing some missing values,
df3 <- df3 %>%
mutate(from_lag.dT = (from_lag.d^lambda - 1) / lambda) %>%
filter(!is.na(modelled.dT))
… what we find is that the modelled difference between the death rate at a location and its neighbours is related to the difference between the death rate at that location and the spatial lag (for its neighbours). This is as it should be as, in both cases, the value at a location is being compared with the values at surrounding locations.
df3 %>%
ggplot(aes(x = from_lag.dT, y = modelled.dT)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
The correlation is 0.381.
A potential problem with identifying the ‘hot spots’ and ‘cold spots’ is that they are relative: one place may appear to have a conditionally lower death rate, relative to its neighbour(s), only because the neighbour(s) have a much higher rate. Or, it may appear to be higher only because the neighbour(s) are lower. But it is also possible for one to be distinctly high at the same time that a neighbour is distinctly low (or vice versa).
The following attempts to less ambiguously identify the hot and cold spots although it is a work-in-progressing and need some finessing:
rr1 <- ranef(mlm, whichel = "MSOA11CD1") %>%
as_tibble(.) %>%
rename(MSOA11CD = grp) %>%
dplyr::select(MSOA11CD, condval, condsd)
rr2 <- ranef(mlm, whichel = "MSOA11CD2") %>%
as_tibble(.) %>%
rename(MSOA11CD = grp) %>%
dplyr::select(MSOA11CD, condval, condsd)
rr <- df2 %>%
dplyr::select(MSOA11CD1, MSOA11CD2) %>%
left_join(., rr1, by = c("MSOA11CD1" = "MSOA11CD")) %>%
left_join(., rr2, by = c("MSOA11CD2" = "MSOA11CD")) %>%
filter(!(is.na(condval.x) & is.na(condval.y))) %>%
replace_na(list(condval.x = 0, condval.y = 0))
hot <- rr %>%
group_by(MSOA11CD1) %>%
summarise(condval.x = sum(condval.x), condval.y = sum(condval.y)) %>%
filter(condval.x > condval.y) %>%
dplyr::select(MSOA11CD1)
cold <- rr %>%
group_by(MSOA11CD2) %>%
summarise(condval.x = sum(condval.x), condval.y = sum(condval.y)) %>%
filter(condval.y > condval.x) %>%
dplyr::select(MSOA11CD2)
rr1 <- rr1 %>%
inner_join(., hot, by = c("MSOA11CD" = "MSOA11CD1")) %>%
filter(condval - 1.96*condsd > 0)
rr2 <- rr2 %>%
inner_join(., cold, by = c("MSOA11CD" = "MSOA11CD2")) %>%
filter(condval - 1.96*condsd > 0)
map <- map %>%
dplyr::select(-condval, -condsd)
map <- left_join(map, rr1, by="MSOA11CD")
submap <- map %>%
filter(!is.na(condval))
ggplot() +
geom_sf(data = map) +
geom_sf(data = submap, aes(fill = condval)) +
scale_fill_gradient(low="yellow", high="red") +
labs(fill = "Localised hot spot")
map <- map %>%
dplyr::select(-condval, -condsd)
map <- left_join(map, rr2, by="MSOA11CD")
submap <- map %>%
filter(!is.na(condval))
ggplot() +
geom_sf(data = map) +
geom_sf(data = submap, aes(fill = condval)) +
scale_fill_gradient(low="dark blue", high="light blue") +
labs(fill = "Localised cold spot")
You can sometimes get a better visualisation of the geography with a simple transformation using a cartogram approach that balances visibility and distortion, enlarging the smallest areas but with not too much distortion across the map (see, also, here). As one of the ‘hot spots’ is also one of the smallest areas, this may be a useful approach to try. Applying one such method (which takes a while to run – be patient!):
require(cartogram)
map <- map %>%
dplyr::select(-condval, -condsd)
areas <- st_area(map)
areas <- as.numeric(st_area(map))
map$w <- sqrt(areas)
# I am sure the conversion between sp and sf objects and back again isn't
# really necessary but since it works...
map <- as(map, "Spatial")
carto <- cartogram_cont(map, weight = "w", maxSizeError = 1.1, prepare = "none")
carto <- st_as_sf(carto)
carto <- left_join(carto, rr1, by="MSOA11CD")
submap <- carto %>%
filter(!is.na(condval))
ggplot() +
geom_sf(data = carto) +
geom_sf(data = submap, aes(fill = condval)) +
scale_fill_gradient(low="yellow", high="red") +
labs(fill = "Localised hot spot")
carto <- carto %>%
dplyr::select(-condval, -condsd)
carto <- left_join(carto, rr2, by="MSOA11CD")
submap <- carto %>%
filter(!is.na(condval))
ggplot() +
geom_sf(data = carto) +
geom_sf(data = submap, aes(fill = condval)) +
scale_fill_gradient(low="dark blue", high="light blue") +
labs(fill = "Localised cold spot")