Suggested citation:
Mendez C. (2020). Geographically weighted regression models: A tutorial using the spgwr package in R. R Studio/RPubs. Available at https://rpubs.com/quarcs-lab/tutorial-gwr1
This work is licensed under the Creative Commons Attribution-Share Alike 4.0 International License.
Acknowledgment:
Material adapted from multiple sources, in particular the course materials of Guy Lansley & James Cheshire (2016).
knitr::opts_chunk$set(echo = TRUE)
library(tidyverse) # Modern data science workflow
library(sf)
library(sp)
library(rgdal)
library(rgeos)
library(tmap)
library(tmaptools)
library(spgwr)
library(grid)
library(gridExtra)
# Change the presentation of decimal numbers to 4 and avoid scientific notation
options(prompt="R> ", digits=4, scipen=999)
GWR is a nonparametric multivariate model which can indicate where non-stationarity may take place across space (the study area).
It can be used to identify how (locally weighted) regression coefficients may vary across space (the study area).
All used datasets can be obtained from Guy Lansley & James Cheshire (2016).
## Observations: 749
## Variables: 5
## $ OA <fct> E00004120, E00004121, E00004122, E00004123, E00004124, …
## $ White_British <dbl> 42.36, 47.20, 40.68, 49.66, 51.14, 41.42, 48.54, 48.68,…
## $ Low_Occupancy <dbl> 6.2937, 5.9322, 2.9126, 0.9259, 2.0000, 3.9326, 5.5556,…
## $ Unemployed <dbl> 1.8939, 2.6882, 1.2121, 2.8037, 3.8168, 3.8462, 4.5455,…
## $ Qualification <dbl> 73.63, 69.90, 67.58, 60.78, 65.99, 74.21, 62.45, 60.35,…
## OGR data source with driver: ESRI Shapefile
## Source: "/Users/Carlos/GitHub/00-QuaRCS/tutorial-gwr1", layer: "Camden_oa11"
## with 749 features
## It has 1 fields
Using the sf
package
## Observations: 749
## Variables: 2
## $ OA11CD <chr> "E00004527", "E00004525", "E00004522", "E00004287", "E000042…
## $ geometry <POLYGON [m]> POLYGON ((530648 181230, 53..., POLYGON ((530511 181…
Transform it into a sf
object
Let us run a linear model to understand the global relationship between our variables in our study area.
##
## Call:
## lm(formula = Qualification ~ Unemployed + White_British, data = Census.Data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -50.31 -8.01 1.01 8.96 38.05
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 47.8670 2.3357 20.5 <0.0000000000000002 ***
## Unemployed -3.2946 0.1903 -17.3 <0.0000000000000002 ***
## White_British 0.4109 0.0403 10.2 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.7 on 746 degrees of freedom
## Multiple R-squared: 0.464, Adjusted R-squared: 0.463
## F-statistic: 323 on 2 and 746 DF, p-value: <0.0000000000000002
This model has an adjusted R-squared value of 0.463. So we can assume 46% of the variance can be explained by the model. We can also observe the influences of each of the variables. However, the overall fit of the model and each of the coefficients may vary across space if we consider different parts of our study area. It is, therefore, worth considering the standardized residuals from the model to help us understand and improve our future models.
A residual is the difference between the predicted and observed values for an observation in the model. Models with lower r-squared values would have greater residuals on average as the data would not fit the modeled regression line as well. Standardized residuals are represented as Z-scores where 0 represent the predicted values.
If you plot a linear model (i.e. our model object), R will print out four different plots of which are useful for evaluating the model fit. These are very briefly summarized as:
Residuals vs Fitted: considers the relationship between the actual and the predicted data. The more dispersed the residuals are, the weaker the R2 should be. This can be useful to identify outlines too.The fit also tells us if the residuals are non-linearly distributed.
Normal Q-Q: Demonstrates the extent to which the residuals are normally distributed. Normal residuals should fit the straight line well.
Scale-Location: Shows if the residuals are spread equally across the full range of the predictors. If the values in this chart display a linear positive relationship, it suggests that the residuals spread wider and wider for greater values (this is known as heteroscedasticity).
Residuals vs Leverage: Identifies outliers, high-leverage points and influential observations. This plot is pretty difficult to interpret and there are other means of identifying these values.
More information on these plots is here
If you want to print just one of the plots you can enter which = n within the plot()
function. i.e. plot(model, which = 3)
Map the spatial distribution of the residuals
## class : SpatialPolygonsDataFrame
## features : 6
## extent : 524326, 530660, 181181, 185111 (xmin, xmax, ymin, ymax)
## crs : +proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000 +y_0=-100000 +ellps=airy +towgs84=446.448,-125.157,542.06,0.15,0.247,0.842,-20.489 +units=m +no_defs
## variables : 6
## names : OA11CD, White_British, Low_Occupancy, Unemployed, Qualification, c..1....14.5939740246415...2....11.496973462005...3....6.9935255223138..
## min values : E00004200, 31.8681318681319, 8.54700854700855, 2.11640211640212, 31.7460317460317, 1.73891618985042
## max values : E00004527, 56.4516129032258, 19.6850393700787, 7.98319327731092, 67.8571428571429, 21.9914592049372
map.resids
## class : SpatialPolygonsDataFrame
## features : 6
## extent : 524326, 530660, 181181, 185111 (xmin, xmax, ymin, ymax)
## crs : +proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000 +y_0=-100000 +ellps=airy +towgs84=446.448,-125.157,542.06,0.15,0.247,0.842,-20.489 +units=m +no_defs
## variables : 6
## names : OA11CD, White_British, Low_Occupancy, Unemployed, Qualification, resids
## min values : E00004200, 31.8681318681319, 8.54700854700855, 2.11640211640212, 31.7460317460317, 1.73891618985042
## max values : E00004527, 56.4516129032258, 19.6850393700787, 7.98319327731092, 67.8571428571429, 21.9914592049372
## Observations: 749
## Variables: 7
## $ OA11CD <fct> E00004527, E00004525, E00004522, E00004287, E00004206, …
## $ White_British <dbl> 48.29, 40.94, 44.16, 31.87, 56.45, 39.92, 61.75, 56.36,…
## $ Low_Occupancy <dbl> 12.745, 16.807, 8.547, 12.613, 19.685, 11.765, 7.826, 8…
## $ Unemployed <dbl> 7.5117, 5.9908, 2.1164, 3.2864, 7.9832, 3.5242, 4.7619,…
## $ Qualification <dbl> 35.81, 42.41, 56.48, 67.86, 31.75, 57.20, 64.55, 16.36,…
## $ resids <dbl> 14.594, 11.497, 6.994, 1.739, 9.681, 21.991, 9.610, -4.…
## $ geometry <POLYGON [m]> POLYGON ((530648 181230, 53..., POLYGON ((53051…
No renaming is needed if we use map.resids2
as it is a sf
object
## Variable "resids" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.
## Variable "resids" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.
If there is a geographic pattern in the residuals, it is possible that an unobserved variable may be influencing the dependent variable
GWR is the term introduced by Fotheringham, Charlton and Brunsdon (1997, 2002) to describe a family of regression models in which the coefficients are allowed to vary spatially. GWR uses the coordinates of each sample point or zone centroid, ti, as a target point for a form of spatially weighted least squares regression (for some models the target points can be separately defined, e.g. as grid intersection points, rather than observed data points). (de Smith et al, 2015)
## Adaptive q: 0.382 CV score: 101421
## Adaptive q: 0.618 CV score: 109723
## Adaptive q: 0.2361 CV score: 96876
## Adaptive q: 0.1459 CV score: 94192
## Adaptive q: 0.09017 CV score: 91100
## Adaptive q: 0.05573 CV score: 88243
## Adaptive q: 0.03444 CV score: 85633
## Adaptive q: 0.02129 CV score: 83790
## Adaptive q: 0.01316 CV score: 83096
## Adaptive q: 0.008131 CV score: 84177
## Adaptive q: 0.01535 CV score: 83014
## Adaptive q: 0.01515 CV score: 82957
## Adaptive q: 0.01437 CV score: 82858
## Adaptive q: 0.01441 CV score: 82852
## Adaptive q: 0.01458 CV score: 82833
## Adaptive q: 0.0148 CV score: 82855
## Adaptive q: 0.01462 CV score: 82829
## Adaptive q: 0.01469 CV score: 82824
## Adaptive q: 0.01473 CV score: 82836
## Adaptive q: 0.01469 CV score: 82824
Just to check, let use the data OA.Census2
, which is an sp
object.
When using and sf
object, observation coordinates have to be given
gwr.model = gwr(Qualification ~ Unemployed + White_British,
data = OA.Census,
adapt=GWRbandwidth,
hatmatrix=TRUE,
se.fit=TRUE)
Print results
## Call:
## gwr(formula = Qualification ~ Unemployed + White_British, data = OA.Census,
## adapt = GWRbandwidth, hatmatrix = TRUE, se.fit = TRUE)
## Kernel function: gwr.Gauss
## Adaptive quantile: 0.01469 (about 11 of 749 data points)
## Summary of GWR coefficient estimates at data points:
## Min. 1st Qu. Median 3rd Qu. Max. Global
## X.Intercept. 11.082 34.434 45.769 59.754 85.019 47.87
## Unemployed -5.453 -3.283 -2.554 -1.794 0.770 -3.29
## White_British -0.280 0.200 0.378 0.532 0.947 0.41
## Number of data points: 749
## Effective number of parameters (residual: 2traceS - traceS'S): 132.6
## Effective degrees of freedom (residual: 2traceS - traceS'S): 616.4
## Sigma (residual: 2traceS - traceS'S): 9.904
## Effective number of parameters (model: traceS): 94.45
## Effective degrees of freedom (model: traceS): 654.6
## Sigma (model: traceS): 9.61
## Sigma (ML): 8.984
## AICc (GWR p. 61, eq 2.33; p. 96, eq. 4.21): 5633
## AIC (GWR p. 96, eq. 4.22): 5509
## Residual sum of squares: 60452
## Quasi-global R2: 0.7303
## [1] "sum.w" "X.Intercept." "Unemployed"
## [4] "White_British" "X.Intercept._se" "Unemployed_se"
## [7] "White_British_se" "gwr.e" "pred"
## [10] "pred.se" "localR2" "X.Intercept._se_EDF"
## [13] "Unemployed_se_EDF" "White_British_se_EDF" "pred.se.1"
OA.Census
polygon.The variable names followed by the name of our original data frame (i.e. OA.Census.Unemployed) are the coefficients of the model.
map1 <- tm_shape(gwr.map2) +
tm_fill("White_British",
n = 5,
style = "quantile") +
tm_layout(frame = FALSE,
legend.text.size = 0.5,
legend.title.size = 0.6)
map1
map2 <- tm_shape(gwr.map2) +
tm_fill("White_British.1",
n = 5,
style = "quantile",
title = "WB Coefficient") +
tm_layout(frame = FALSE,
legend.text.size = 0.5,
legend.title.size = 0.6)
map2
## Variable "White_British.1" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.
map3 <- tm_shape(gwr.map) +
tm_fill("Unemployed",
n = 5,
style = "quantile") +
tm_layout(frame = FALSE,
legend.text.size = 0.5,
legend.title.size = 0.6)
map3
map4 <- tm_shape(gwr.map) +
tm_fill("Unemployed.1",
n = 5,
style = "quantile",
title = "Ue Coefficient") +
tm_layout(frame = FALSE,
legend.text.size = 0.5,
legend.title.size = 0.6)
map4
## Variable "Unemployed.1" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.
# creates a clear grid
grid.newpage()
# assigns the cell size of the grid, in this case 2 by 2
pushViewport(viewport(layout=grid.layout(2,2)))
# prints a map object into a defined cell
print(map1, vp=viewport(layout.pos.col = 1, layout.pos.row =1))
print(map2, vp=viewport(layout.pos.col = 2, layout.pos.row =1))
## Variable "White_British.1" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.
print(map3, vp=viewport(layout.pos.col = 1, layout.pos.row =2))
print(map4, vp=viewport(layout.pos.col = 2, layout.pos.row =2))
## Variable "Unemployed.1" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.
If you are a member of the QuaRCS lab, you can run this tutorial in R Studio Cloud
END