Spatial Analysis: Autocorrelation and Local Regression

Chris Lloyd: 14th April 2016

Introduction

The exercise is based on RStudio - the package used in previous practical exercises. The key focus here is on the analysis of spatial patterning in population variables. The analysis comprises two parts: (i) analysis of spatial clustering in population variables and (ii) analysis of relationships between the variables. These map respectively onto the two themes of Spatial Autocorrelation and Geographically Weighted Regression. In the practical, you will work with data on employment and health status for small areas (lower layer super output areas (LSOAs)) within the North West of England.

Project Tasks

To complete the assignment you will need to:

Conduct an analysis of geographic inequalities in the North West of England.
Assess how variables representing socio-economic status are inter-related.

Collectively, the tasks associated with these themes comprise assignment 7 (20% of total module marks).

The submission deadline is Thursday 12th May 2016 at 5pm. Both electronic submission via VITAL and submission of a paper version to the Student Support Office is required.

Downloading counts for areas

You will download data for lower layer super output areas (LSOAs). These data will be downloaded from the Infuse website.

Visit the website http://infuse.mimas.ac.uk/ - click on 2011 Census Data
On the InFuse homepage, click on 2011 Census data.
From the list of Topics on the left hand side of the screen, select 'Economic Activity'.
You should click Select under the first of the 40 topic combinations (Age, Economic activity). Then, click Next. In the Categories window click the box next to Age 16 to 74. Next, you will need to expand the fields by clicking the required boxes (click - for the fields indicated) and then tick the boxes next to All categories: economic activity, Employee > Part-time, Employee > Full-time, Self-employed with employees > Part-time, Self-employed with employees > Full-time, Self-employed without employees > Part-time, Self-employed without employees > Full-time, Unemployed, Long-term sick or disabled.

Click Add. You should then have nine category combinations.

Click Next.

In the Geography window expand Regions (click on -). Expand North West and select Lower Super Output Areas and Data Zones (4497 areas).

Then, click Add followed by Next.

Selected Category Combinations:

.243 - Age : Age 16 to 74 - economic activity : All categories: economic activity - Unit : Persons
.248 - Age : Age 16 to 74 - economic activity : Unemployed - Unit : Persons
.256 - Age : Age 16 to 74 - economic activity : Part-time - Unit : Persons
.260 - Age : Age 16 to 74 - economic activity : Full-time - Unit : Persons
.283 - Age : Age 16 to 74 - economic activity : Long-term sick or disabled - Unit : Persons
.556 - Age : Age 16 to 74 - economic activity : Part-time - Unit : Persons
.557 - Age : Age 16 to 74 - economic activity : Full-time - Unit : Persons
.558 - Age : Age 16 to 74 - economic activity : Part-time - Unit : Persons
.559 - Age : Age 16 to 74 - economic activity : Full-time - Unit : Persons

Click Get the data, followed by Download data.

Save the data to your data directory (e.g., M:\ENVS453\nw).

Importing the area count data into R

Open RStudio. As usual, you must set your working directory. You can set it to whatever suits, but in the example below the M:\ drive of the University managed system is specified.

# For example:
setwd("M:/ENVS453/nw/")

Read in the text file downloaded from Infuse for use later on (making sure your filename is correctly specified within 'read.csv'):

# Read csv file, skipping the first two rows:
nwemploy <- read.csv("Data_AGE_ECOACT_UNIT.csv", header = FALSE, skip = 2)

Note that the you are skipping the first two rows of the data and no header labels are read in.

Now rename the columns (the default is v1, v2,…):

colnames(nwemploy) <- c("ID", "geocode", "label", "type", "typeid", "alleconact", 
    "employeePT", "employeeFT", "selfempwithPT", "selfempwithFT", "selfempnoPT", 
    "selfempnoFT", "unemploy", "LTsickdis")

You need to compute the percentage of people who are unemployed with the sum of employed and unemployed persons as the denominator. The unemployment percentages will be mapped and analysed later.

First, compute the total number of employed people (you will need this later as well as at this stage):

nwemploy$employed <- nwemploy$employeePT + nwemploy$employeeFT + nwemploy$selfempwithPT + 
    nwemploy$selfempwithFT + nwemploy$selfempnoPT + nwemploy$selfempnoFT

Then the total employed and unemployed people:

nwemploy$total <- nwemploy$employed + nwemploy$unemploy

Next, compute the percentage unemployed:

nwemploy$unemployPC <- (nwemploy$unemploy/nwemploy$total) * 100

Now check the file:

head(nwemploy)

##      ID   geocode       label                                    type
## 1 23905 E01004766 Bolton 005A Lower Super Output Areas and Data Zones
## 2 23906 E01004767 Bolton 005B Lower Super Output Areas and Data Zones
## 3 23907 E01004768 Bolton 001A Lower Super Output Areas and Data Zones
## 4 23908 E01004769 Bolton 003A Lower Super Output Areas and Data Zones
## 5 23909 E01004770 Bolton 003B Lower Super Output Areas and Data Zones
## 6 23910 E01004771 Bolton 003C Lower Super Output Areas and Data Zones
##   typeid alleconact employeePT employeeFT selfempwithPT selfempwithFT
## 1 LSOADZ       1147        178        452             4            20
## 2 LSOADZ       1254        164        533             5            17
## 3 LSOADZ       1173        189        504             7            36
## 4 LSOADZ       1161        143        547             9            22
## 5 LSOADZ       1024        155        367             8            34
## 6 LSOADZ       1159        176        464             9            32
##   selfempnoPT selfempnoFT unemploy LTsickdis NA employed total unemployPC
## 1          28          37       57        42 NA      719   776      7.345
## 2          28          54       47        47 NA      801   848      5.542
## 3          30          75       19         7 NA      841   860      2.209
## 4          29          55       24        22 NA      805   829      2.895
## 5          36          46       20        22 NA      646   666      3.003
## 6          28          85       26        25 NA      794   820      3.171

This provides information on the first few rows of the data.

Downloading area data

To map the data you have already worked with you need boundary files (vector polygon data). Area data can be downloaded from the website http://edina.ac.uk/ukborders/, but the area data you need are provided on VITAL > Learning Resources > Practical > NorthWestLSOA.zip.

Download the file, unzip the downloaded data, and place the contents inside your working directory.

If you want to save your workspace, so you can open it up later on, you can use the save.image function.

# This will save all objects currently loaded in the workspace
save.image(file = "nw.RData")

After saving, you can re-load the objects contained in this saved file at any time (e.g. after closing R) using the load command. The .RData file is stored in the working directory defined earlier.

# Remember that you will need to set your working directory before trying to
# load - otherwise R will not know where to find your RData file
load("nw.RData")

Saving R commands in a script

You can copy the commands you enter into the R console and then re-run/modify these commands ar a later point. To do this go to File > New File > R script. Then go to File > Save As and give the new file a sensible name such as script. In that case, the file is saved as script.R

Now, as you work through the exercise, you can copy your commands into the script file and then save the script file before ending the session. To run code, you can select code in the script file and then click the Run button.

Importing the LSOA boundaries

As the next step, we need to import the North West LSOAs shapefile (vector polygon boundaries) downloaded earlier.

We will need to make use of functions which are not contained in the base R installation.

You need to load packages including 'maptools', 'sp' and 'rgeos'. These packages include functions for reading and manipulating GIS data.

# Choose the CRAN mirror where the packages are going to be downloaded from
options(repos = c(CRAN = "http://www.stats.bris.ac.uk/R/"))

# Download / install and then load sp
install.packages("sp", depend = TRUE, lib = getwd())
library("sp", lib.loc = getwd())

# Download / install and then load maptools
install.packages("maptools", depend = TRUE, lib = getwd())
library(maptools, lib.loc = getwd())

# Download / install and then load stringr
install.packages("stringr", depend = TRUE, lib = getwd())
library("stringr", lib.loc = getwd())

# Download / install and then load rgeos
install.packages("rgeos", depend = TRUE, lib = getwd())
library("rgeos", lib.loc = getwd())

Now we will read in the LSOAs data into a type of object called a SpatialPolygonsDataFrame.

nwLSOA <- readShapePoly("NorthWestLSOA.shp")

We can see what the SpatialPolygonsDataFrame looks like using a the plot() function.

# A basic plot of the LSOAs SpatialPolygonsDataFrame
plot(nwLSOA)

plot of chunk unnamed-chunk-14

Now we have a map of LSOAs in the North West of England we can consider mapping employment data for areas.

Linking, mapping and analysing employment data

Next we will map the percentage of unemployed people. Firstly, you need to join the employment data to the LSOA areas so that the employment counts and percentages can be mapped:

nwLSOA@data = data.frame(nwLSOA@data, nwemploy[match(nwLSOA@data$geo_code, nwemploy$geocode), 
    ])

Now we'll map the percentage of people who were unemployed in 2011 (unemployPC) and add a legend to the map. At this stage, we need two new libraries - RColorBrewer and classInt.

# Download / install and then load RColorBrewer
install.packages("RColorBrewer", depend = TRUE, lib = getwd())
library("RColorBrewer", lib.loc = getwd())
# Download / install and then load RColorBrewer
install.packages("classInt", depend = TRUE, lib = getwd())
library("classInt", lib.loc = getwd())

We select a class interval scheme to display the values, and add a legend.

fj5 <- classIntervals(nwLSOA$unemployPC, n = 5, style = "fisher")
pal <- grey.colors(4, 0.95, 0.55, 2.2)
fj5Colours <- findColours(fj5, pal)
plot(nwLSOA, col = fj5Colours, pch = 19, border = NA)
legend("topleft", fill = attr(fj5Colours, "palette"), legend = names(attr(fj5Colours, 
    "table")), bty = "n")

plot of chunk unnamed-chunk-18

Note the element “border =NA” within the plot function - this turns off the borders, making the zones easier to see.

To map rates, it's useful to be able to round the values for inclusion in the legend. The first line of the code above can be modified to do this; we can also change the “topleft” element under the legend to more precisely position the legend (obviously, changing the x and y coordinates moves the legend); you will also add a scale bar and a north arrow:

fj5 <- classIntervals(round(nwLSOA$unemployPC, digits = 2), n = 5, style = "fisher")
pal <- grey.colors(4, 0.95, 0.55, 2.2)
fj5Colours <- findColours(fj5, pal)
plot(nwLSOA, col = fj5Colours, pch = 19, border = NA)
legend(x = 210000, y = 477000, fill = attr(fj5Colours, "palette"), legend = names(attr(fj5Colours, 
    "table")), bty = "n")
# Add a scale bar
SpatialPolygonsRescale(layout.scale.bar(), offset = c(235000, 486000), scale = 40000, 
    fill = c("white", "black"), plot.grid = F)
text(235000, 481000, "0", cex = 0.6)
text(255000, 481000, "20", cex = 0.6)
text(275000, 481000, "40km", cex = 0.6)
# Add a North arrow
SpatialPolygonsRescale(layout.north.arrow(1), offset = c(252000, 498000), scale = 20000, 
    plot.grid = F)

plot of chunk unnamed-chunk-19

The six figure numbers are eastings and northings and these can be changed with the result that the position of the features they refer to change. To determine the coordinates of a location type:

locator()

Click on the plot at the point for which you want locational information. Then click the ESC key. The coordinates of the location will then appear in the R console.

The maps should be saved as images (in the plots window in RStudio, go to Export > Save As Image and specify the image format as TIFF).

Note that you can export the nwLSOA file into shapefile format which can then be opened in QGIS or ArcGIS. This is done with:

writeSpatialShape(nwLSOA, "nwlsoa")

This creates the output shapefile 'nwlsoa'.

Creating LSOA centroids

In the following analysis, we need the centroids of the LSOAs, and this is achieved using the coordinates command.

LSOAXY <- SpatialPointsDataFrame(coordinates(nwLSOA), data = as(nwLSOA, "data.frame")[c("geo_code")])

LSOAXY is then joined to the nwemploy data. We link geo_code (in LSOAXY) to geocode in nwemploy:

LSOAemploy <- merge(nwemploy, LSOAXY, by.x = "geocode", by.y = "geo_code")
head(LSOAemploy)

##     geocode    ID       label                                    type
## 1 E01004766 23905 Bolton 005A Lower Super Output Areas and Data Zones
## 2 E01004767 23906 Bolton 005B Lower Super Output Areas and Data Zones
## 3 E01004768 23907 Bolton 001A Lower Super Output Areas and Data Zones
## 4 E01004769 23908 Bolton 003A Lower Super Output Areas and Data Zones
## 5 E01004770 23909 Bolton 003B Lower Super Output Areas and Data Zones
## 6 E01004771 23910 Bolton 003C Lower Super Output Areas and Data Zones
##   typeid alleconact employeePT employeeFT selfempwithPT selfempwithFT
## 1 LSOADZ       1147        178        452             4            20
## 2 LSOADZ       1254        164        533             5            17
## 3 LSOADZ       1173        189        504             7            36
## 4 LSOADZ       1161        143        547             9            22
## 5 LSOADZ       1024        155        367             8            34
## 6 LSOADZ       1159        176        464             9            32
##   selfempnoPT selfempnoFT unemploy LTsickdis NA employed total unemployPC
## 1          28          37       57        42 NA      719   776      7.345
## 2          28          54       47        47 NA      801   848      5.542
## 3          30          75       19         7 NA      841   860      2.209
## 4          29          55       24        22 NA      805   829      2.895
## 5          36          46       20        22 NA      646   666      3.003
## 6          28          85       26        25 NA      794   820      3.171
##   coords.x1 coords.x2
## 1    371009    411545
## 2    371874    411498
## 3    370103    413305
## 4    371920    412767
## 5    372191    412270
## 6    371347    412785

Spatial autocorrelation

One useful way of exploring the spatial patterning in data values (here, unemployment percentages) is to compute a statistic called the Moran's I spatial autocorrelation coefficient. Positive values of I indicate clustering of similar values, negative values of I indicate clustering of dissimilar values and values close to zero indicate zero spatial autocorrelation (a 'random' spatial pattern).

First, we need to install and load the package 'spdep':

# Download / install and then load spdep
install.packages("spdep", depend = TRUE, lib = getwd())
library("spdep", lib.loc = getwd())

Next we need to create list indicating which LSOAs share boundaries with which other LSOAs. This is done using the command 'poly2nb'. This must then be converted into a weights file using 'nblistw'. The weights file represents the spatial relationship between LSOAs and if an LSOA borders five other LSOAs then each neighbouring LSOA is given a weight of 0.2 since the weights for each set of neighbours of each LSOA sum to one and 1/5 = 0.2. With this weights file we can measure the average difference between unemployment percentages at each LSOA and at the neighbouring LSOAs. A value of I close to one would indicate that the unemployment percentage values at each LSOA tend to be similar to the unemployment percentage values at other LSOAs with which they share a border - thus, the values cluster.

LSOA.nb <- poly2nb(nwLSOA)
LSOA.wt <- nb2listw(LSOA.nb)
LSOA.wt

## Characteristics of weights list object:
## Neighbour list object:
## Number of regions: 4497 
## Number of nonzero links: 26348 
## Percentage nonzero weights: 0.1303 
## Average number of links: 5.859 
## 
## Weights style: W 
## Weights constants summary:
##      n       nn   S0   S1    S2
## W 4497 20223009 4497 1626 18677

# Summarise the weights values
summary(unlist(LSOA.wt$weights))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0455  0.1430  0.1670  0.1710  0.2000  1.0000

Using the new weights file we can now compute I:

moransUnemploy <- moran.test(nwLSOA$unemployPC, LSOA.wt)
moransUnemploy

## 
##  Moran's I test under randomisation
## 
## data:  nwLSOA$unemployPC  
## weights: LSOA.wt  
## 
## Moran I statistic standard deviate = 64.8, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##         5.805e-01        -2.224e-04         8.029e-05

The value of 0.466 indicates a moderate level of spatial clustering - that is, neighbouring unemployment percentages tend to be quite similar on average. The p-value of 1.419e-09 indicates that the clustering is highly significant (that is, it is unlikely to have occurred by chance) - values closer to zero indicate a higher significance level.

You should take note of the outputs from the analysis; the key information is the value of I and the significance level.

Values such as 1.419e-09 may not mean much to you. This is scientific notation and is a convenient way of displaying very small or very large values. In the case of 1.419e-09, this means that the decimal place is moved 9 positions to the left, so 1.419e-09 = 0.000000001419.

The previous analysis uses contiguity (neighbouring zones). For this next stage, you will compute distances between centroids and compute the inverse (so small distances take large values). These inverse distance weights will be row-standardised (weights are divided by the sum of the rows). The exponent in the example is 2 (the term '2' in: 1/(LSOA.dists**2)) while a cutoff of 20km is specified. Distances from one centroid to itself (i.e., within area distance) will be set to zero. Both the exponent and cutoff can be varied to assess the impact of changing spatial scales on I.

LSOA.dists <- as.matrix(dist(cbind(LSOAXY$coords.x1, LSOAXY$coords.x2)))
LSOA.dists.inv <- 1/(LSOA.dists^2)
LSOA.dists.inv[LSOA.dists > 20000] <- 0
diag(LSOA.dists.inv) <- 0
LSOA.dists.invR <- LSOA.dists.inv/rowSums(LSOA.dists.inv)
LSOA.dists.invR[1:5, 1:5]

##   0 1 2 3 4
## 0 0 0 0 0 0
## 1 0 0 0 0 0
## 2 0 0 0 0 0
## 3 0 0 0 0 0
## 4 0 0 0 0 0

The weights matrix is then converted into an object which can be used by the Moran's I tool.

# Convert the weights matrix into a weights list object
LSOA.dists.inv.w <- mat2listw(LSOA.dists.invR)

# Compute Moran's I
moransUnemployID <- moran.test(nwLSOA$unemployPC, LSOA.dists.inv.w)
moransUnemployID

## 
##  Moran's I test under randomisation
## 
## data:  nwLSOA$unemployPC  
## weights: LSOA.dists.inv.w  
## 
## Moran I statistic standard deviate = 78.18, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##         3.549e-01        -2.224e-04         2.064e-05

Try re-running Moran's I for a range of exponents (e.g., change 1/(LSOA.dists2) to 1/(LSOA.dists1.5)). A large exponent increases the proportional weight of closer observations - thus it emphasizes the nearest neighbours. I would be expected to tend to be larger (positive) over small spatial scales than over larger spatial scales. Note the value of I and the significance level for each bandwidth.

You can compute a Moran scatterplot (standardised data value versus weighted average of neighbouring values) easily:

# Compute a Moran scatterplot
moran.plot(nwLSOA$unemployPC, LSOA.dists.inv.w)

plot of chunk unnamed-chunk-30

This can obviously be derived for different weighting functions.

Measuring local spatial autocorrelation

The analysis so far has focused on using global Moran's I - there is a single number result for the whole study area. In this part of the exercise, you will compute a local version of Moran's I. This provides a measure of spatial association for each zone - as in global I, a positive value indicates that values neighbouring a given zone tend to be similar to the value in the zone while a negative value suggests that neighbouring values are dissimilar.

You will compute local Moran's I using the inverse distance weights.

# Compute local Moran's I
lmi <- as.data.frame(localmoran(nwLSOA$unemployPC, LSOA.dists.inv.w))

# Don't worry about the warning message!

LSOAloci <- SpatialPolygonsDataFrame(nwLSOA, lmi, match.ID = FALSE)

# Map the local Moran's I values
fj5 <- classIntervals(round(LSOAloci$Ii, digits = 4), n = 5, style = "fisher")
pal <- grey.colors(4, 0.95, 0.55, 2.2)
fj5Colours <- findColours(fj5, pal)
plot(LSOAloci, col = fj5Colours, pch = 19, border = NA)
legend("bottomright", fill = attr(fj5Colours, "palette"), legend = names(attr(fj5Colours, 
    "table")), bty = "n")

plot of chunk unnamed-chunk-31

Recall that positive values indicate similarity between neighbours while negative values indicate dissmilarity.

Save the map as an image following the same steps as before.

The zip file you downloaded earlier also contain the boundary of the North West of England. This will be added to a map in the next stage.

nw <- readShapePoly("NorthWest.shp")

It is possible to create maps of significant clusters using a few steps (the new 'nw' file is used to provide an outline for the map):

# Create a local I Cluster Map
quadrant <- vector(mode = "numeric", length = nrow(LSOAloci))

unempMean <- nwLSOA$unemployPC - mean(nwLSOA$unemployPC)
# centres the variable of interest around its mean
C_mI <- LSOAloci$Ii - mean(LSOAloci$Ii)
# centres the local Moran's around the mean

signif <- 0.1
# set a statistical significance level for the local Moran's

quadrant[unempMean > 0 & C_mI > 0] <- 4
# these lines define the high-high, low-low, low-high and high-low
# categories

quadrant[unempMean < 0 & C_mI < 0] <- 1
quadrant[unempMean < 0 & C_mI > 0] <- 2
quadrant[unempMean > 0 & C_mI < 0] <- 3
quadrant[lmi[, 5] > signif] <- 0
# places non-significant Moran's in the category '0'

brks <- c(0, 1, 2, 3, 4)
colors <- c("white", "blue", rgb(0, 0, 1, alpha = 0.4), rgb(1, 0, 0, alpha = 0.4), 
    "red")
plot(nwLSOA, border = NA, col = colors[findInterval(quadrant, brks, all.inside = FALSE)])
plot(nw, add = TRUE)
box()
legend("bottomright", legend = c("insignificant", "low-low", "low-high", "high-low", 
    "high-high"), fill = colors, bty = "n", cex = 0.7, y.intersp = 1, x.intersp = 1)
title("Local I Cluster Map")

plot of chunk unnamed-chunk-33

So, low-low indicates small values surrounded by small values, high-high is large values surrounded by large values, low-high is small values surrounded by large values and high-low is large values surrounded by small values.

Exploring relationships between population variables

So far, you have examined the association between data values attached to zones and values of the same property in neighbouring zones. Next, you will assess how different variables are related. The most straightforward approach is to use ordinary least squares (OLS) regression.

OLS is used here to explore how the percentage of unemployed persons in an area relates to the percentage of people who are long-term sick or disabled. To explore this relationship, you will need to first compute the percentage of people who were long-term sick or disabled - thus long-term sick or disabled / total persons by economic activity. The new percentage should be called lsdpc and it should be added to the file nwLSOA (thus, nwLSOA$lsdpc).

ols_UnempPSD <- lm(unemployPC ~ lsdpc, data = nwLSOA)
summary(ols_UnempPSD)

## 
## Call:
## lm(formula = unemployPC ~ lsdpc, data = nwLSOA)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -11.019  -1.345  -0.229   0.997  18.305 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.16510    0.06771    17.2   <2e-16 ***
## lsdpc        1.14132    0.00997   114.4   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.48 on 4495 degrees of freedom
## Multiple R-squared:  0.744,  Adjusted R-squared:  0.744 
## F-statistic: 1.31e+04 on 1 and 4495 DF,  p-value: <2e-16

The estimate value (parameter estimate) for lsdpc is the key figure here. It is positive and so indicates a positive relationship between unemployment and long-term sickness or disability (where values of one are high, so (on average) are values of the other). The *** entry at the end of the column for lsdpc indicates that the results are statistically significant.

Geographically weighted regression

The OLS results provide a 'global' summary of the relationship between the two variables. Geographically weighted regression can be used to explore how relationships differ between regions.

There are several R packages for GWR. These include spgwr, GWmodel and gwrr: http://cran.r-project.org/web/packages/spgwr/index.html, http://cran.r-project.org/web/packages/GWmodel/index.html, http://cran.r-project.org/web/packages/gwrr/index.html

In this exercise, you will use spgwr. To use GWR, it is necessary to select a spatial weighting scheme (a kernel) which defines how much weight is given to each neighbour of each zone. The size of a kernel is indicated by its bandwidth - a small bandwidth means that close neighbours are given large weights whilke more distance neighbours are given much smaller weighhts. With a larger bandwidth, the weights are more 'spread out' and there is not a marked decline in weights with increased distance.

The bandwidth can be selected manually or some form of optimisation approach can be used to select one. One possible approach is to use cross validation. This entails removing each observation in turn and estimating its value using then remaining data. The differences between the true observed values and the estimated values are then computed. This procedure is completed for multiple bandwidths and the bandwidth which corresponds to the smallest overall error is considered optimal. The bandwidth identified in this way is then used for fitting the GWR model.

library("spgwr")

## NOTE: This package does not constitute approval of GWR
## as a method of spatial analysis; see example(gwr)

# Identify the optimal bandwidth using cross validation; the fixed bandwidth
# Gaussian function is used
bwG <- gwr.sel(unemployPC ~ lsdpc, data = nwLSOA, gweight = gwr.Gauss, verbose = FALSE)
# View the selected bandwidth
bwG

## [1] 2486

# Fit the GWR model using the bandwidth identified in the previous step
gwrG <- gwr(unemployPC ~ lsdpc, data = nwLSOA, bandwidth = bwG, gweight = gwr.Gauss)
# Summarise the GWR outputs
summary(gwrG$SDF)

## Object of class SpatialPolygonsDataFrame
## Coordinates:
##      min    max
## x 293942 406087
## y 338994 588517
## Is projected: NA 
## proj4string : [NA]
## Data attributes:
##      sum.w        X.Intercept.         lsdpc            gwr.e        
##  Min.   :  1.0   Min.   :-17.115   Min.   :-1.670   Min.   :-10.234  
##  1st Qu.: 26.2   1st Qu.:  0.791   1st Qu.: 0.959   1st Qu.: -1.163  
##  Median : 43.5   Median :  1.513   Median : 1.073   Median : -0.169  
##  Mean   : 48.1   Mean   :  1.512   Mean   : 1.065   Mean   : -0.072  
##  3rd Qu.: 67.8   3rd Qu.:  2.038   3rd Qu.: 1.172   3rd Qu.:  0.817  
##  Max.   :121.1   Max.   :  6.328   Max.   : 7.708   Max.   : 15.569  
##       pred          localR2     
##  Min.   : 1.22   Min.   :0.106  
##  1st Qu.: 4.19   1st Qu.:0.705  
##  Median : 6.51   Median :0.758  
##  Mean   : 7.72   Mean   :0.747  
##  3rd Qu.:10.51   3rd Qu.:0.809  
##  Max.   :29.49   Max.   :1.000

So, the bandwidth distance identified is passed to bwG and this is then used with the bandwidth argument using gwr. You could instead manually specify a bandwidth (e.g., 'bandwidth = 20000.0') and avoid the use of gwr.sel.

With GWR, regression beta coefficients are provided for every input location and these can be mapped so that, for example, the local relationships between unemployment and long term illness and disability can be explored.

# Now map the coefficients (the local slope values).
spplot(gwrG$SDF, "lsdpc", col = "transparent")

plot of chunk unnamed-chunk-37


# and the local r square values
spplot(gwrG$SDF, "localR2", col = "transparent")

plot of chunk unnamed-chunk-37

The term col=“transparent” prevents the borders being shown as distinct solid lines. Save the maps as images.

Recall that r square indicates the goodness of fit. So, mapped values close to one indicate locations where the model is a good fit.

The weighting scheme (gweight) can be changed; gwr.Gauss is a fixed bandwidth kernel (e.g., weights decline to almost zero after a fixed distance). Instead you could use r gwr.bisquare which is a variable width kernel - it is small where there are the density of observations is high and large where the density of observations is low. So, in this case, it will be small in urban areas (where there are more LSOAs over a small area) and large in rural areas (where LSOAs are less dense). To clarify, in the case of gwr.Gauss a distance is specified while for gwr.bisquare it is the number of nearest neighbours. As an example, 25 nearest neighbours could be specified for gwr.bisquare and the distance from an urban LSOA to its 25th nearest neighbour will obviously likely be musch smaller than will the the distance from a rural LSOA to its 25th nearest neighbour. So, the weight is zero for the 26th nearest neighbour and beyond and the weights are spread out further in the rural case than in the urban case. In R, the nearest neighbours are expressed as a fraction of the total data (e.g., in this case, a bandwidth of 0.001104861 = about 4 of 4497 data points; while 1 would represent a neighbourhood containing all other observations).

# Identify the optimal bandwidth using cross validation; an adaptive
# bandwidth bi-square function is used
ad.bw <- gwr.sel(unemployPC ~ lsdpc, data = nwLSOA, adapt = TRUE)

## Adaptive q: 0.382 CV score: 27383 
## Adaptive q: 0.618 CV score: 27517 
## Adaptive q: 0.2361 CV score: 27072 
## Adaptive q: 0.1459 CV score: 26657 
## Adaptive q: 0.09017 CV score: 26191 
## Adaptive q: 0.05573 CV score: 25605 
## Adaptive q: 0.03444 CV score: 24752 
## Adaptive q: 0.02129 CV score: 23592 
## Adaptive q: 0.01316 CV score: 22380 
## Adaptive q: 0.008131 CV score: 21422 
## Adaptive q: 0.005025 CV score: 20640 
## Adaptive q: 0.003106 CV score: 20020 
## Adaptive q: 0.001919 CV score: 19563 
## Adaptive q: 0.001186 CV score: 19312 
## Adaptive q: 0.0007331 CV score: 19620 
## Adaptive q: 0.001354 CV score: 19370 
## Adaptive q: 0.001013 CV score: 19338 
## Adaptive q: 0.001146 CV score: 19304 
## Adaptive q: 0.001105 CV score: 19301 
## Adaptive q: 0.001064 CV score: 19313 
## Adaptive q: 0.001105 CV score: 19301

# View the selected bandwidth
ad.bw

## [1] 0.001105

The figure of 0.001104861 (about 4 of 4497 data points) is very small and it would be more sensible to use a larger value - you should try 0.02 (about 90 of 4497 data points):

# Fit the GWR model using the bandwidth identified in the previous step
gwrAD <- gwr(unemployPC ~ lsdpc, data = nwLSOA, adapt = 0.02, gweight = gwr.bisquare)
# Summarise the GWR outputs
summary(gwrAD$SDF)

## Object of class SpatialPolygonsDataFrame
## Coordinates:
##      min    max
## x 293942 406087
## y 338994 588517
## Is projected: NA 
## proj4string : [NA]
## Data attributes:
##      sum.w       X.Intercept.        lsdpc           gwr.e        
##  Min.   :11.0   Min.   :-1.604   Min.   :0.440   Min.   :-10.458  
##  1st Qu.:29.4   1st Qu.: 0.746   1st Qu.:0.933   1st Qu.: -1.130  
##  Median :32.9   Median : 1.369   Median :1.050   Median : -0.148  
##  Mean   :34.9   Mean   : 1.535   Mean   :1.066   Mean   : -0.081  
##  3rd Qu.:38.6   3rd Qu.: 1.997   3rd Qu.:1.195   3rd Qu.:  0.858  
##  Max.   :72.2   Max.   :10.461   Max.   :1.605   Max.   : 13.775  
##       pred          localR2       
##  Min.   : 1.42   Min.   :-0.0245  
##  1st Qu.: 4.15   1st Qu.: 0.7073  
##  Median : 6.43   Median : 0.7603  
##  Mean   : 7.73   Mean   : 0.7485  
##  3rd Qu.:10.53   3rd Qu.: 0.8170  
##  Max.   :31.04   Max.   : 0.9263

# Now map the coefficients (the local slope values).
spplot(gwrAD$SDF, "lsdpc", col = "transparent")

plot of chunk unnamed-chunk-40


# and the local r square values
spplot(gwrAD$SDF, "localR2", col = "transparent")

plot of chunk unnamed-chunk-40

Alternative bandwidth values for both approaches can be entered manually (so, as indicated above, there is then no need to use gwr.sel) and you can assess how the results change as the size of the bandwidths is increased or decreased.

Assessment

Q1. Give the Moran's I values for each of the variables included in your analysis. What do these values suggest to you about clustering (or spatial concentration) of the population of the North West? Provide an interpretation of your findings. You may also wish to make use of Moran scatterplots and/or different weighting schemes.

Q2. Describe the local Moran's I values for unemployment (%) and long-term sick or disabled (%) (and any other variables you choose to add) and interpret your findings. How does the map of local I values relate to the unemployment (%) and long-term sick or disabled (%) maps? Again, you may wish to use different weighting schemes.

Q3. Describe your GWR maps (local slope and local r square) for unemployment (%) against long-term sick or disabled (%) (and any other variables you choose to add) and interpret your findings. Does GWR add information over and above that provide by the global regression model? How do result vary as the weighting schemes are changed?

You may wish to expand your analysis to include other variables - these can be downloaded from Infuse and they can be linked to your existing data by reading them into R (as for the employment data), relabelling the columns (again, as for the employment data), and then using the merge command to join the files together. E.g.:

mergefile <- merge(nwemploy,newfile,by.x=“geocode”,by.y=“geocode”)

and then use the match command as before to link the new mergefile to the LSOAs. You could, for example, explore how housing tenure and employment together are related to long-term sickness and disability. For example, if you computed the percentage of social rented households and named it socrenthhpc, you could use OLS regression to explore this relationship:

ols_combined <- lm(lsdpc ~ unemployPC + socrenthhpc, data = nwLSOA)

You should provide references to other studies or technical accounts to help support your discussion.

Specific guidance on what is provided for each question is not provided - it is up to you to think of ways of building on the basic analysis.

Your document can be submitted in any format you like - including Rmd/html or Word. But, it is crucial that you include any R code you wrote/adapted to complete your analyses.

Summary

The various analysis that you have prepared during this tutorial will be of assistance when compiling your report. You may have to piece together different parts of the code to achieve this, however, examples of any code that you need to complete your report are contained in this document. However, you may wish to make use of the knowledge developed in previous practicals.

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.