Suggested citation:

Mendez C. (2020). Geographically weighted regression models: A tutorial using the spgwr package in R. R Studio/RPubs. Available at https://rpubs.com/quarcs-lab/tutorial-gwr1

This work is licensed under the Creative Commons Attribution-Share Alike 4.0 International License. Acknowledgment:

Material adapted from multiple sources, in particular the course materials of Guy Lansley & James Cheshire (2016).

# 1 Libraries

``````knitr::opts_chunk\$set(echo = TRUE)

library(tidyverse)  # Modern data science workflow
library(sf)
library(sp)
library(rgdal)
library(rgeos)
library(tmap)
library(tmaptools)
library(spgwr)
library(grid)
library(gridExtra)

# Change the presentation of decimal numbers to 4 and avoid scientific notation
options(prompt="R> ", digits=4, scipen=999)``````

# 2 Motivation

• GWR is a nonparametric multivariate model which can indicate where non-stationarity may take place across space (the study area).

• It can be used to identify how (locally weighted) regression coefficients may vary across space (the study area).

# 3 Tutorial objectives

• Explore the residuals of a linear model to understand its limitations
• Run a GWR and observe its parameters across space
• Print multiple maps in one graphic using gridExtra() with tmap

# 5 Import datasets

All used datasets can be obtained from Guy Lansley & James Cheshire (2016).

## 5.1 Non-spatial data

``Census.Data <-read.csv("practicaldata.csv")``
``glimpse(Census.Data)``
``````## Observations: 749
## Variables: 5
## \$ OA            <fct> E00004120, E00004121, E00004122, E00004123, E00004124, …
## \$ White_British <dbl> 42.36, 47.20, 40.68, 49.66, 51.14, 41.42, 48.54, 48.68,…
## \$ Low_Occupancy <dbl> 6.2937, 5.9322, 2.9126, 0.9259, 2.0000, 3.9326, 5.5556,…
## \$ Unemployed    <dbl> 1.8939, 2.6882, 1.2121, 2.8037, 3.8168, 3.8462, 4.5455,…
## \$ Qualification <dbl> 73.63, 69.90, 67.58, 60.78, 65.99, 74.21, 62.45, 60.35,…``````

## 5.2 Spatial data

``Output.Areas <- readOGR(".", "Camden_oa11")``
``````## OGR data source with driver: ESRI Shapefile
## Source: "/Users/Carlos/GitHub/00-QuaRCS/tutorial-gwr1", layer: "Camden_oa11"
## with 749 features
## It has 1 fields``````
``````# Run it in console (long output)
glimpse(Output.Areas)``````

Using the `sf` package

``Output.Areas2 <- read_sf("Camden_oa11.shp")``
``glimpse(Output.Areas2)``
``````## Observations: 749
## Variables: 2
## \$ OA11CD   <chr> "E00004527", "E00004525", "E00004522", "E00004287", "E000042…
## \$ geometry <POLYGON [m]> POLYGON ((530648 181230, 53..., POLYGON ((530511 181…``````

# 6 Merge datasets

``OA.Census <- merge(Output.Areas, Census.Data, by.x="OA11CD", by.y="OA")``

Transform it into a `sf` object

``OA.Census2 <- st_as_sf(OA.Census)``

# 7 Run linear model

Let us run a linear model to understand the global relationship between our variables in our study area.

• Dependent variable: The percentage of people with qualifications
• Independent (predictor) variables: the percentages of unemployed economically active adults and White British ethnicity
``````model <- lm(Qualification ~ Unemployed + White_British, data = Census.Data)
summary(model)``````
``````##
## Call:
## lm(formula = Qualification ~ Unemployed + White_British, data = Census.Data)
##
## Residuals:
##    Min     1Q Median     3Q    Max
## -50.31  -8.01   1.01   8.96  38.05
##
## Coefficients:
##               Estimate Std. Error t value            Pr(>|t|)
## (Intercept)    47.8670     2.3357    20.5 <0.0000000000000002 ***
## Unemployed     -3.2946     0.1903   -17.3 <0.0000000000000002 ***
## White_British   0.4109     0.0403    10.2 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.7 on 746 degrees of freedom
## Multiple R-squared:  0.464,  Adjusted R-squared:  0.463
## F-statistic:  323 on 2 and 746 DF,  p-value: <0.0000000000000002``````

## 7.1 Model diagnostics

This model has an adjusted R-squared value of 0.463. So we can assume 46% of the variance can be explained by the model. We can also observe the influences of each of the variables. However, the overall fit of the model and each of the coefficients may vary across space if we consider different parts of our study area. It is, therefore, worth considering the standardized residuals from the model to help us understand and improve our future models.

A residual is the difference between the predicted and observed values for an observation in the model. Models with lower r-squared values would have greater residuals on average as the data would not fit the modeled regression line as well. Standardized residuals are represented as Z-scores where 0 represent the predicted values.

If you plot a linear model (i.e. our model object), R will print out four different plots of which are useful for evaluating the model fit. These are very briefly summarized as:

• Residuals vs Fitted: considers the relationship between the actual and the predicted data. The more dispersed the residuals are, the weaker the R2 should be. This can be useful to identify outlines too.The fit also tells us if the residuals are non-linearly distributed.

• Normal Q-Q: Demonstrates the extent to which the residuals are normally distributed. Normal residuals should fit the straight line well.

• Scale-Location: Shows if the residuals are spread equally across the full range of the predictors. If the values in this chart display a linear positive relationship, it suggests that the residuals spread wider and wider for greater values (this is known as heteroscedasticity).

• Residuals vs Leverage: Identifies outliers, high-leverage points and influential observations. This plot is pretty difficult to interpret and there are other means of identifying these values.

``````par(mfrow=c(2,2))