Your third session focused on Spatial Weights and Neighbours, where you were introduced to Moran’s I - a useful method for testing for relationships between geographic proximity and non-geographic mutual variables. Moran’s I is a measure for spatial autocorrelation. Among other things, it is useful in determining whether a non-geographic attribute (i.e. house prices) cluster together when examined in a geographic context. Another way of looking at spatial autocorrelation is testing whether Tobler’s first law of Geography is apparent in the data that you are examining both statistically and visually: “everything is related to everything else, but near things are more related than distant things”.

When examining spatial data, there will be situations where you’d want to go one step further and test whether a correlative relationship between two non-geographic attributes strengthens or weakens across a geographic expanse. Enters Geographically Weighted Regression (GWR)!

This brief asynchronous tutorial looks at one approach to perfoming GWR in R using one of the first R packages to do so, spgwr. This package is related to the sp range and was created by Roger Bivand years ago; however, it is still used in a variety of situations today. As of writing this document, it was last updated in August 2020. Learning how it works provides further insights into the inner workings of R coding and equips you with a further tool that can be blended into the spatial analyses performed in your research.

The contents herein are based on materials prepared by Adam Dennett (UCL, Centre of Advanced Spatial Analysis), who borrowed ideas from Bristol’s very own Richard Harris. An original tutorial can be found here: https://rpubs.com/adam_dennett/44975.

Preparation

First we need to load in our packages, set our working directories, and read in our data.

#Note: In order to install the packages, please remove the hatch in front of line 29. Once you've installed your packages, re-introduce the hatch to avoid conflicts.

#install.packages(c("spgwr","ggplot2","maptools","sf","tidyverse"))

#The lapply command is used here to read in multiple libraries.
lapply(c("spgwr","ggplot2","maptools","sf","tidyverse"), require, character.only = TRUE)

## [[1]]
## [1] TRUE
## 
## [[2]]
## [1] TRUE
## 
## [[3]]
## [1] TRUE
## 
## [[4]]
## [1] TRUE
## 
## [[5]]
## [1] TRUE

#Note: In order to set your own working directory, remove the hatch symbol in front of line 35. You haven't to do this is you're in the Cloud.
#setwd("Your file path")

During this tutorial, we will be exploring Bristol’s Index of Multiple Deprivation Scores for 2019. This, and similar shapefile datasets, can be downloaded from: https://opendata.bristol.gov.uk/pages/homepage/

Let’s read it into Rstudio as an object called ‘IDD’.

IDD <- read_sf("deprivation-in-bristol-2019.shp")

#review:
plot(IDD)

## Warning: plotting the first 9 out of 58 attributes; use max.plot = 58 to plot
## all

This method works best with points data, so let’s create a copy of this shapefile that has been converted into ‘points’ data and create columns that present our \(x\) and \(y\) coordinates; these will be called ‘coords_x’ and ‘coords_y’.

IDD_points <- st_centroid(IDD)

## Warning in st_centroid.sf(IDD): st_centroid assumes attributes are constant over
## geometries of x

## Warning in st_centroid.sfc(st_geometry(x), of_largest_polygon =
## of_largest_polygon): st_centroid does not give correct centroids for longitude/
## latitude data

plot(IDD_points)

## Warning: plotting the first 9 out of 58 attributes; use max.plot = 58 to plot
## all

IDD_points$coords_x <- unlist(map(IDD_points$geometry,1))
IDD_points$coords_y <- unlist(map(IDD_points$geometry,2))

The regression

We can perform simple linear regressions in R using the lm argument. Let’s have a go at generating a regression that compares deprivation scores for recorded secondary-school level education with health, income, and living environment deprivation scores.

model1 <- lm(IDD_points$education_s ~ IDD_points$health_depr + IDD_points$living_envi + IDD_points$income_depr)

summary(model1)

## 
## Call:
## lm(formula = IDD_points$education_s ~ IDD_points$health_depr + 
##     IDD_points$living_envi + IDD_points$income_depr)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -32.267  -5.252   0.127   6.088  30.298 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            17.15967    2.21126   7.760 1.96e-13 ***
## IDD_points$health_depr  9.61035    1.62473   5.915 1.05e-08 ***
## IDD_points$living_envi -0.51398    0.06457  -7.961 5.39e-14 ***
## IDD_points$income_depr  1.00910    0.08875  11.370  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.58 on 259 degrees of freedom
## Multiple R-squared:  0.8063, Adjusted R-squared:  0.804 
## F-statistic: 359.3 on 3 and 259 DF,  p-value: < 2.2e-16

From these results, it is apparent that we have a significant model (p<0.01) with an R-squared of around 80% and positive relationships between all of our selected independent variables apart from ‘Living environment’.

Checking the catch-all plot associated with this model (standardised residuals plotted again with the fitted values), we can see that as there is a weak positive correlation in the cloud of points, our model is correctly specified.

plot(model1,which=3)

At this stage, we can actually try to plot the residuals to determine whether a pattern is apparent:

resids <- residuals(model1)
colours <- c("dark blue", "blue", "red", "dark red")
map.resids <- SpatialPointsDataFrame(data = data.frame(resids), 
                                     coords = cbind(IDD_points$coords_x,IDD_points$coords_y))

spplot(map.resids,cuts=quantile(resids), col.regions=colours, cex = 1)

We can already begin to see the geographic clustering of reds and blues, which suggests that the strengths of the linear regressions seem to correspond with the geography of Bristol. For example, towards the city centre, it can be said that secondary education deprivation scores are less likely to be affected by health, living environment, and income deprivation.

Because of this patterning, we are justified to perform geographically weighted regression (GWR). Here, this functionality is introduced by the sp package. First we will need to calculate our Kernel Bandwidths - these will help R to accurately model coefficients of variation (\(CV\)) variables as they change between geographies, where these are essentially the standard deviations divided by mean averages and multiplied by 100:\[\frac{\sigma}{\mu}\times 100\]

GWRbandwidth <- gwr.sel(education_s ~ health_depr + living_envi + income_depr, data = IDD_points, coords = cbind(IDD_points$coords_x,IDD_points$coords_y),
                        adapt = TRUE)

## Adaptive q: 0.381966 CV score: 26574.29 
## Adaptive q: 0.618034 CV score: 28088.19 
## Adaptive q: 0.236068 CV score: 24855.54 
## Adaptive q: 0.145898 CV score: 23158.88 
## Adaptive q: 0.09016994 CV score: 21405.23 
## Adaptive q: 0.05572809 CV score: 19463.71 
## Adaptive q: 0.03444185 CV score: 18063.57 
## Adaptive q: 0.02128624 CV score: 17279.63 
## Adaptive q: 0.01315562 CV score: 17564.49 
## Adaptive q: 0.02116161 CV score: 17278.68 
## Adaptive q: 0.02051137 CV score: 17277.39 
## Adaptive q: 0.01770172 CV score: 17343.3 
## Adaptive q: 0.02069969 CV score: 17277.09 
## Adaptive q: 0.02074038 CV score: 17277.1 
## Adaptive q: 0.020659 CV score: 17277.11 
## Adaptive q: 0.02069969 CV score: 17277.09

Next, we’ll use our \(CV\) values to construct a model for our GWR.

gwr.model = gwr(education_s ~ health_depr + living_envi + income_depr,
                data = IDD_points, coords = cbind(IDD_points$coords_x,IDD_points$coords_y),
                adapt = GWRbandwidth, hatmatrix = TRUE, se.fit = TRUE)

gwr.model

## Call:
## gwr(formula = education_s ~ health_depr + living_envi + income_depr, 
##     data = IDD_points, coords = cbind(IDD_points$coords_x, IDD_points$coords_y), 
##     adapt = GWRbandwidth, hatmatrix = TRUE, se.fit = TRUE)
## Kernel function: gwr.Gauss 
## Adaptive quantile: 0.02069969 (about 5 of 263 data points)
## Summary of GWR coefficient estimates at data points:
##                     Min.     1st Qu.      Median     3rd Qu.        Max.
## X.Intercept. -17.8442805   6.3315512  14.7639117  20.2172783  47.0849942
## health_depr  -10.8223289   1.8006615   6.0687378  11.0840532  23.6771532
## living_envi   -1.5472047  -0.5266383  -0.2390002  -0.0806260   1.0101059
## income_depr    0.0075443   0.6855147   0.9353442   1.2499103   1.9762628
##               Global
## X.Intercept. 17.1597
## health_depr   9.6103
## living_envi  -0.5140
## income_depr   1.0091
## Number of data points: 263 
## Effective number of parameters (residual: 2traceS - traceS'S): 98.65256 
## Effective degrees of freedom (residual: 2traceS - traceS'S): 164.3474 
## Sigma (residual: 2traceS - traceS'S): 7.299771 
## Effective number of parameters (model: traceS): 72.4685 
## Effective degrees of freedom (model: traceS): 190.5315 
## Sigma (model: traceS): 6.77965 
## Sigma (ML): 5.770492 
## AICc (GWR p. 61, eq 2.33; p. 96, eq. 4.21): 1873.288 
## AIC (GWR p. 96, eq. 4.22): 1740.78 
## Residual sum of squares: 8757.525 
## Quasi-global R2: 0.9415163

This output reveals how the coefficients vary across Bristol’s 263 lower super output areas. Note that the global coefficients are exactly the same as the coefficients in the earlier lm model. If we look at correlations between secondary education and health we can see a min value of ~-10 (1 unit change in health deprivation resulting in a drop in secondary education deprivation scores) and ~+23 (1 unit change in health deprivation resulting in an increase of secondary education scores). Based on the 2nd, 3rd, and 4th quartile values, we can also say that for 75% of the lower super output areas, as secondary education deprivation scores increase by 1 point, health deprivation scores will incrase bween 1.8 and 11.08 point.

Coefficient ranges can also be seen for the other variables and they suggest some interesting spatial patterning. To explore this we can plot the GWR coefficients for different variables. Firstly we can attach the coefficients to our original dataframe - this can be achieved simply as the coefficients for each ward appear in the same order in our spatial points dataframe as they do in the original dataframe.

results <- as.data.frame(gwr.model$SDF)
head(results)

##       sum.w X.Intercept. health_depr living_envi income_depr X.Intercept._se
## 1  8.167282     12.24709  23.6771532  -0.2150906   0.7539932        7.561131
## 2  7.995472     25.68684   4.0054323  -0.1755817   0.9353442        9.420238
## 3  9.658137     27.66191   4.7606382  -0.4033728   0.9422035        7.896153
## 4 14.175512     25.13220  -0.6901757  -0.6434314   1.2998871        5.435238
## 5 14.910976     18.40079  11.0374214  -0.3083813   0.8739612        3.559039
## 6 14.216417     13.23541   3.7723979  -0.3515657   0.5252005        5.269834
##   health_depr_se living_envi_se income_depr_se     gwr.e      pred  pred.se
## 1       4.806427      0.2712373      0.2954933 -3.063906 48.898906 2.761749
## 2       6.753864      0.3846495      0.3124850  7.024636 57.859364 2.508070
## 3       5.658343      0.2858744      0.2696263 -2.779528 65.655528 3.084475
## 4       3.210532      0.1652988      0.1716769 16.738206 37.355794 1.802527
## 5       3.198920      0.1070606      0.1889791 -1.291435 53.924435 3.224016
## 6       2.345281      0.1421975      0.1304188 -2.485567  4.165567 2.279695
##     localR2 X.Intercept._se_EDF health_depr_se_EDF living_envi_se_EDF
## 1 0.9367276            8.141204           5.175166          0.2920460
## 2 0.8037031           10.142938           7.272006          0.4141589
## 3 0.8556576            8.501929           6.092439          0.3078060
## 4 0.8811079            5.852218           3.456838          0.1779802
## 5 0.8367627            3.832081           3.444334          0.1152740
## 6 0.8328346            5.674124           2.525205          0.1531066
##   income_depr_se_EDF pred.se.1   coord.x  coord.y
## 1          0.3181629  2.973624 -2.682757 51.49035
## 2          0.3364581  2.700484 -2.625674 51.41092
## 3          0.2903114  3.321110 -2.630318 51.40562
## 4          0.1848475  1.940813 -2.617060 51.42550
## 5          0.2034771  3.471355 -2.531669 51.43507
## 6          0.1404242  2.454589 -2.600220 51.45769

Mapping

Now that we have this data, we can extract specific coeffecient values for actual mapping:

IDD_points$coefLivingEnvironment <- results$living_envi
IDD_points$coefIncome <- results$income_depr
IDD_points$coefHealth <- results$health_depr

#At this point, let's ensure that the our data is dataframe in order for it to used by ggplot2

IDD_points_ggplot <- as.data.frame(IDD_points)

We can now plot the results using ggplot2:

ggplot()+
  geom_sf(data = IDD, fill = NA, lwd = 0.25)+coord_sf(crs = st_crs(4326))+
  geom_point(aes(x = coords_x, y = coords_y, color = coefLivingEnvironment), data = IDD_points_ggplot)+
               scale_color_gradient2(low = "blue", mid = "white", high = "red")+
  labs(x = "Longitude", y = "Latitude")

ggplot()+
  geom_sf(data = IDD, fill = NA, lwd = 0.25)+coord_sf(crs = st_crs(4326))+
  geom_point(aes(x = coords_x, y = coords_y, color = coefIncome), data = IDD_points_ggplot)+
               scale_color_gradient2(low = "blue", mid = "white", high = "red")+
  labs(x = "Longitude", y = "Latitude")

ggplot()+
  geom_sf(data = IDD, fill = NA, lwd = 0.25)+coord_sf(crs = st_crs(4326))+
  geom_point(aes(x = coords_x, y = coords_y, color = coefHealth), data = IDD_points_ggplot)+
               scale_color_gradient2(low = "blue", mid = "white", high = "red")+
  labs(x = "Longitude", y = "Latitude")

These plots visually display the distributions of geographically weighted regressions for comparisons between secondary education and health, living environment, and income.

Although this analysis has been conducted purely for demonstration purposes, some interesting observations can be made such as:

A mostly negative relationship exists between poor secondary education and living environments - this best exemplified towards the city centre. However, a positive correlation is apparent towards the northern end of the Bristol. This distribution is interesting as it raises questions around whether the quality of housing has a direct influence on the the performance of secondary aged students.
Although a mostly positive correlation is noted with regards to secondary education and income, the intensification of these trends are apparent towards the northern end of the region and towards the south. The northern trend is consistent with the previous point - lower incomes will result in access to lower quality places to live. However, this is contested by the values recorded just south of the city centre.
Secondary education deprivation and poor health present a positive relationship in most parts of Bristol. However, a negative relationship is apparent just towards the south of the city centre. It potentially addresses the anomaly observed in the previous point.

Richard Timmerman, November 2020 …

SP-based Geographically Weight Regression in R

Richard Timmerman

26/11/2020

Preparation

The regression

Mapping