Chris Lloyd: 14th April 2016
The exercise is based on RStudio - the package used in previous practical exercises. The key focus here is on the analysis of spatial patterning in population variables. The analysis comprises two parts: (i) analysis of spatial clustering in population variables and (ii) analysis of relationships between the variables. These map respectively onto the two themes of Spatial Autocorrelation and Geographically Weighted Regression. In the practical, you will work with data on employment and health status for small areas (lower layer super output areas (LSOAs)) within the North West of England.
To complete the assignment you will need to:
Collectively, the tasks associated with these themes comprise assignment 7 (20% of total module marks).
The submission deadline is Thursday 12th May 2016 at 5pm. Both electronic submission via VITAL and submission of a paper version to the Student Support Office is required.
You will download data for lower layer super output areas (LSOAs). These data will be downloaded from the Infuse website.
Click Add. You should then have nine category combinations.
Click Next.
In the Geography window expand Regions (click on -). Expand North West and select Lower Super Output Areas and Data Zones (4497 areas).
Then, click Add followed by Next.
Selected Category Combinations:
Click Get the data, followed by Download data.
Save the data to your data directory (e.g., M:\ENVS453\nw).
Open RStudio. As usual, you must set your working directory. You can set it to whatever suits, but in the example below the M:\ drive of the University managed system is specified.
# For example:
setwd("M:/ENVS453/nw/")
Read in the text file downloaded from Infuse for use later on (making sure your filename is correctly specified within 'read.csv'):
# Read csv file, skipping the first two rows:
nwemploy <- read.csv("Data_AGE_ECOACT_UNIT.csv", header = FALSE, skip = 2)
Note that the you are skipping the first two rows of the data and no header labels are read in.
Now rename the columns (the default is v1, v2,…):
colnames(nwemploy) <- c("ID", "geocode", "label", "type", "typeid", "alleconact",
"employeePT", "employeeFT", "selfempwithPT", "selfempwithFT", "selfempnoPT",
"selfempnoFT", "unemploy", "LTsickdis")
You need to compute the percentage of people who are unemployed with the sum of employed and unemployed persons as the denominator. The unemployment percentages will be mapped and analysed later.
First, compute the total number of employed people (you will need this later as well as at this stage):
nwemploy$employed <- nwemploy$employeePT + nwemploy$employeeFT + nwemploy$selfempwithPT +
nwemploy$selfempwithFT + nwemploy$selfempnoPT + nwemploy$selfempnoFT
Then the total employed and unemployed people:
nwemploy$total <- nwemploy$employed + nwemploy$unemploy
Next, compute the percentage unemployed:
nwemploy$unemployPC <- (nwemploy$unemploy/nwemploy$total) * 100
Now check the file:
head(nwemploy)
## ID geocode label type
## 1 23905 E01004766 Bolton 005A Lower Super Output Areas and Data Zones
## 2 23906 E01004767 Bolton 005B Lower Super Output Areas and Data Zones
## 3 23907 E01004768 Bolton 001A Lower Super Output Areas and Data Zones
## 4 23908 E01004769 Bolton 003A Lower Super Output Areas and Data Zones
## 5 23909 E01004770 Bolton 003B Lower Super Output Areas and Data Zones
## 6 23910 E01004771 Bolton 003C Lower Super Output Areas and Data Zones
## typeid alleconact employeePT employeeFT selfempwithPT selfempwithFT
## 1 LSOADZ 1147 178 452 4 20
## 2 LSOADZ 1254 164 533 5 17
## 3 LSOADZ 1173 189 504 7 36
## 4 LSOADZ 1161 143 547 9 22
## 5 LSOADZ 1024 155 367 8 34
## 6 LSOADZ 1159 176 464 9 32
## selfempnoPT selfempnoFT unemploy LTsickdis NA employed total unemployPC
## 1 28 37 57 42 NA 719 776 7.345
## 2 28 54 47 47 NA 801 848 5.542
## 3 30 75 19 7 NA 841 860 2.209
## 4 29 55 24 22 NA 805 829 2.895
## 5 36 46 20 22 NA 646 666 3.003
## 6 28 85 26 25 NA 794 820 3.171
This provides information on the first few rows of the data.
To map the data you have already worked with you need boundary files (vector polygon data). Area data can be downloaded from the website http://edina.ac.uk/ukborders/, but the area data you need are provided on VITAL > Learning Resources > Practical > NorthWestLSOA.zip.
Download the file, unzip the downloaded data, and place the contents inside your working directory.
If you want to save your workspace, so you can open it up later on, you can use the save.image function.
# This will save all objects currently loaded in the workspace
save.image(file = "nw.RData")
After saving, you can re-load the objects contained in this saved file at any time (e.g. after closing R) using the load command. The .RData file is stored in the working directory defined earlier.
# Remember that you will need to set your working directory before trying to
# load - otherwise R will not know where to find your RData file
load("nw.RData")
You can copy the commands you enter into the R console and then re-run/modify these commands ar a later point. To do this go to File > New File > R script. Then go to File > Save As and give the new file a sensible name such as script. In that case, the file is saved as script.R
Now, as you work through the exercise, you can copy your commands into the script file and then save the script file before ending the session. To run code, you can select code in the script file and then click the Run button.
As the next step, we need to import the North West LSOAs shapefile (vector polygon boundaries) downloaded earlier.
We will need to make use of functions which are not contained in the base R installation.
You need to load packages including 'maptools', 'sp' and 'rgeos'. These packages include functions for reading and manipulating GIS data.
# Choose the CRAN mirror where the packages are going to be downloaded from
options(repos = c(CRAN = "http://www.stats.bris.ac.uk/R/"))
# Download / install and then load sp
install.packages("sp", depend = TRUE, lib = getwd())
library("sp", lib.loc = getwd())
# Download / install and then load maptools
install.packages("maptools", depend = TRUE, lib = getwd())
library(maptools, lib.loc = getwd())
# Download / install and then load stringr
install.packages("stringr", depend = TRUE, lib = getwd())
library("stringr", lib.loc = getwd())
# Download / install and then load rgeos
install.packages("rgeos", depend = TRUE, lib = getwd())
library("rgeos", lib.loc = getwd())
Now we will read in the LSOAs data into a type of object called a SpatialPolygonsDataFrame.
nwLSOA <- readShapePoly("NorthWestLSOA.shp")
We can see what the SpatialPolygonsDataFrame looks like using a the plot() function.
# A basic plot of the LSOAs SpatialPolygonsDataFrame
plot(nwLSOA)
Now we have a map of LSOAs in the North West of England we can consider mapping employment data for areas.
Next we will map the percentage of unemployed people. Firstly, you need to join the employment data to the LSOA areas so that the employment counts and percentages can be mapped:
nwLSOA@data = data.frame(nwLSOA@data, nwemploy[match(nwLSOA@data$geo_code, nwemploy$geocode),
])
Now we'll map the percentage of people who were unemployed in 2011 (unemployPC) and add a legend to the map. At this stage, we need two new libraries - RColorBrewer and classInt.
# Download / install and then load RColorBrewer
install.packages("RColorBrewer", depend = TRUE, lib = getwd())
library("RColorBrewer", lib.loc = getwd())
# Download / install and then load RColorBrewer
install.packages("classInt", depend = TRUE, lib = getwd())
library("classInt", lib.loc = getwd())
We select a class interval scheme to display the values, and add a legend.
fj5 <- classIntervals(nwLSOA$unemployPC, n = 5, style = "fisher")
pal <- grey.colors(4, 0.95, 0.55, 2.2)
fj5Colours <- findColours(fj5, pal)
plot(nwLSOA, col = fj5Colours, pch = 19, border = NA)
legend("topleft", fill = attr(fj5Colours, "palette"), legend = names(attr(fj5Colours,
"table")), bty = "n")
Note the element “border =NA” within the plot function - this turns off the borders, making the zones easier to see.
To map rates, it's useful to be able to round the values for inclusion in the legend. The first line of the code above can be modified to do this; we can also change the “topleft” element under the legend to more precisely position the legend (obviously, changing the x and y coordinates moves the legend); you will also add a scale bar and a north arrow:
fj5 <- classIntervals(round(nwLSOA$unemployPC, digits = 2), n = 5, style = "fisher")
pal <- grey.colors(4, 0.95, 0.55, 2.2)
fj5Colours <- findColours(fj5, pal)
plot(nwLSOA, col = fj5Colours, pch = 19, border = NA)
legend(x = 210000, y = 477000, fill = attr(fj5Colours, "palette"), legend = names(attr(fj5Colours,
"table")), bty = "n")
# Add a scale bar
SpatialPolygonsRescale(layout.scale.bar(), offset = c(235000, 486000), scale = 40000,
fill = c("white", "black"), plot.grid = F)
text(235000, 481000, "0", cex = 0.6)
text(255000, 481000, "20", cex = 0.6)
text(275000, 481000, "40km", cex = 0.6)
# Add a North arrow
SpatialPolygonsRescale(layout.north.arrow(1), offset = c(252000, 498000), scale = 20000,
plot.grid = F)
The six figure numbers are eastings and northings and these can be changed with the result that the position of the features they refer to change. To determine the coordinates of a location type:
locator()
Click on the plot at the point for which you want locational information. Then click the ESC key. The coordinates of the location will then appear in the R console.
The maps should be saved as images (in the plots window in RStudio, go to Export > Save As Image and specify the image format as TIFF).
Note that you can export the nwLSOA file into shapefile format which can then be opened in QGIS or ArcGIS. This is done with:
writeSpatialShape(nwLSOA, "nwlsoa")
This creates the output shapefile 'nwlsoa'.
In the following analysis, we need the centroids of the LSOAs, and this is achieved using the coordinates command.
LSOAXY <- SpatialPointsDataFrame(coordinates(nwLSOA), data = as(nwLSOA, "data.frame")[c("geo_code")])
LSOAXY is then joined to the nwemploy data. We link geo_code (in LSOAXY) to geocode in nwemploy:
LSOAemploy <- merge(nwemploy, LSOAXY, by.x = "geocode", by.y = "geo_code")
head(LSOAemploy)
## geocode ID label type
## 1 E01004766 23905 Bolton 005A Lower Super Output Areas and Data Zones
## 2 E01004767 23906 Bolton 005B Lower Super Output Areas and Data Zones
## 3 E01004768 23907 Bolton 001A Lower Super Output Areas and Data Zones
## 4 E01004769 23908 Bolton 003A Lower Super Output Areas and Data Zones
## 5 E01004770 23909 Bolton 003B Lower Super Output Areas and Data Zones
## 6 E01004771 23910 Bolton 003C Lower Super Output Areas and Data Zones
## typeid alleconact employeePT employeeFT selfempwithPT selfempwithFT
## 1 LSOADZ 1147 178 452 4 20
## 2 LSOADZ 1254 164 533 5 17
## 3 LSOADZ 1173 189 504 7 36
## 4 LSOADZ 1161 143 547 9 22
## 5 LSOADZ 1024 155 367 8 34
## 6 LSOADZ 1159 176 464 9 32
## selfempnoPT selfempnoFT unemploy LTsickdis NA employed total unemployPC
## 1 28 37 57 42 NA 719 776 7.345
## 2 28 54 47 47 NA 801 848 5.542
## 3 30 75 19 7 NA 841 860 2.209
## 4 29 55 24 22 NA 805 829 2.895
## 5 36 46 20 22 NA 646 666 3.003
## 6 28 85 26 25 NA 794 820 3.171
## coords.x1 coords.x2
## 1 371009 411545
## 2 371874 411498
## 3 370103 413305
## 4 371920 412767
## 5 372191 412270
## 6 371347 412785
One useful way of exploring the spatial patterning in data values (here, unemployment percentages) is to compute a statistic called the Moran's I spatial autocorrelation coefficient. Positive values of I indicate clustering of similar values, negative values of I indicate clustering of dissimilar values and values close to zero indicate zero spatial autocorrelation (a 'random' spatial pattern).
First, we need to install and load the package 'spdep':
# Download / install and then load spdep
install.packages("spdep", depend = TRUE, lib = getwd())
library("spdep", lib.loc = getwd())
Next we need to create list indicating which LSOAs share boundaries with which other LSOAs. This is done using the command 'poly2nb'. This must then be converted into a weights file using 'nblistw'. The weights file represents the spatial relationship between LSOAs and if an LSOA borders five other LSOAs then each neighbouring LSOA is given a weight of 0.2 since the weights for each set of neighbours of each LSOA sum to one and 1/5 = 0.2. With this weights file we can measure the average difference between unemployment percentages at each LSOA and at the neighbouring LSOAs. A value of I close to one would indicate that the unemployment percentage values at each LSOA tend to be similar to the unemployment percentage values at other LSOAs with which they share a border - thus, the values cluster.
LSOA.nb <- poly2nb(nwLSOA)
LSOA.wt <- nb2listw(LSOA.nb)
LSOA.wt
## Characteristics of weights list object:
## Neighbour list object:
## Number of regions: 4497
## Number of nonzero links: 26348
## Percentage nonzero weights: 0.1303
## Average number of links: 5.859
##
## Weights style: W
## Weights constants summary:
## n nn S0 S1 S2
## W 4497 20223009 4497 1626 18677
# Summarise the weights values
summary(unlist(LSOA.wt$weights))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0455 0.1430 0.1670 0.1710 0.2000 1.0000
Using the new weights file we can now compute I:
moransUnemploy <- moran.test(nwLSOA$unemployPC, LSOA.wt)
moransUnemploy
##
## Moran's I test under randomisation
##
## data: nwLSOA$unemployPC
## weights: LSOA.wt
##
## Moran I statistic standard deviate = 64.8, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## 5.805e-01 -2.224e-04 8.029e-05
The value of 0.466 indicates a moderate level of spatial clustering - that is, neighbouring unemployment percentages tend to be quite similar on average. The p-value of 1.419e-09 indicates that the clustering is highly significant (that is, it is unlikely to have occurred by chance) - values closer to zero indicate a higher significance level.
You should take note of the outputs from the analysis; the key information is the value of I and the significance level.
Values such as 1.419e-09 may not mean much to you. This is scientific notation and is a convenient way of displaying very small or very large values. In the case of 1.419e-09, this means that the decimal place is moved 9 positions to the left, so 1.419e-09 = 0.000000001419.
The previous analysis uses contiguity (neighbouring zones). For this next stage, you will compute distances between centroids and compute the inverse (so small distances take large values). These inverse distance weights will be row-standardised (weights are divided by the sum of the rows). The exponent in the example is 2 (the term '2' in: 1/(LSOA.dists**2)) while a cutoff of 20km is specified. Distances from one centroid to itself (i.e., within area distance) will be set to zero. Both the exponent and cutoff can be varied to assess the impact of changing spatial scales on I.
LSOA.dists <- as.matrix(dist(cbind(LSOAXY$coords.x1, LSOAXY$coords.x2)))
LSOA.dists.inv <- 1/(LSOA.dists^2)
LSOA.dists.inv[LSOA.dists > 20000] <- 0
diag(LSOA.dists.inv) <- 0
LSOA.dists.invR <- LSOA.dists.inv/rowSums(LSOA.dists.inv)
LSOA.dists.invR[1:5, 1:5]
## 0 1 2 3 4
## 0 0 0 0 0 0
## 1 0 0 0 0 0
## 2 0 0 0 0 0
## 3 0 0 0 0 0
## 4 0 0 0 0 0
The weights matrix is then converted into an object which can be used by the Moran's I tool.
# Convert the weights matrix into a weights list object
LSOA.dists.inv.w <- mat2listw(LSOA.dists.invR)
# Compute Moran's I
moransUnemployID <- moran.test(nwLSOA$unemployPC, LSOA.dists.inv.w)
moransUnemployID
##
## Moran's I test under randomisation
##
## data: nwLSOA$unemployPC
## weights: LSOA.dists.inv.w
##
## Moran I statistic standard deviate = 78.18, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## 3.549e-01 -2.224e-04 2.064e-05
Try re-running Moran's I for a range of exponents (e.g., change 1/(LSOA.dists2) to 1/(LSOA.dists1.5)). A large exponent increases the proportional weight of closer observations - thus it emphasizes the nearest neighbours. I would be expected to tend to be larger (positive) over small spatial scales than over larger spatial scales. Note the value of I and the significance level for each bandwidth.
You can compute a Moran scatterplot (standardised data value versus weighted average of neighbouring values) easily:
# Compute a Moran scatterplot
moran.plot(nwLSOA$unemployPC, LSOA.dists.inv.w)
This can obviously be derived for different weighting functions.
The analysis so far has focused on using global Moran's I - there is a single number result for the whole study area. In this part of the exercise, you will compute a local version of Moran's I. This provides a measure of spatial association for each zone - as in global I, a positive value indicates that values neighbouring a given zone tend to be similar to the value in the zone while a negative value suggests that neighbouring values are dissimilar.
You will compute local Moran's I using the inverse distance weights.
# Compute local Moran's I
lmi <- as.data.frame(localmoran(nwLSOA$unemployPC, LSOA.dists.inv.w))
# Don't worry about the warning message!
LSOAloci <- SpatialPolygonsDataFrame(nwLSOA, lmi, match.ID = FALSE)
# Map the local Moran's I values
fj5 <- classIntervals(round(LSOAloci$Ii, digits = 4), n = 5, style = "fisher")
pal <- grey.colors(4, 0.95, 0.55, 2.2)
fj5Colours <- findColours(fj5, pal)
plot(LSOAloci, col = fj5Colours, pch = 19, border = NA)
legend("bottomright", fill = attr(fj5Colours, "palette"), legend = names(attr(fj5Colours,
"table")), bty = "n")
Recall that positive values indicate similarity between neighbours while negative values indicate dissmilarity.
Save the map as an image following the same steps as before.
The zip file you downloaded earlier also contain the boundary of the North West of England. This will be added to a map in the next stage.
nw <- readShapePoly("NorthWest.shp")
It is possible to create maps of significant clusters using a few steps (the new 'nw' file is used to provide an outline for the map):
# Create a local I Cluster Map
quadrant <- vector(mode = "numeric", length = nrow(LSOAloci))
unempMean <- nwLSOA$unemployPC - mean(nwLSOA$unemployPC)
# centres the variable of interest around its mean
C_mI <- LSOAloci$Ii - mean(LSOAloci$Ii)
# centres the local Moran's around the mean
signif <- 0.1
# set a statistical significance level for the local Moran's
quadrant[unempMean > 0 & C_mI > 0] <- 4
# these lines define the high-high, low-low, low-high and high-low
# categories
quadrant[unempMean < 0 & C_mI < 0] <- 1
quadrant[unempMean < 0 & C_mI > 0] <- 2
quadrant[unempMean > 0 & C_mI < 0] <- 3
quadrant[lmi[, 5] > signif] <- 0
# places non-significant Moran's in the category '0'
brks <- c(0, 1, 2, 3, 4)
colors <- c("white", "blue", rgb(0, 0, 1, alpha = 0.4), rgb(1, 0, 0, alpha = 0.4),
"red")
plot(nwLSOA, border = NA, col = colors[findInterval(quadrant, brks, all.inside = FALSE)])
plot(nw, add = TRUE)
box()
legend("bottomright", legend = c("insignificant", "low-low", "low-high", "high-low",
"high-high"), fill = colors, bty = "n", cex = 0.7, y.intersp = 1, x.intersp = 1)
title("Local I Cluster Map")
So, low-low indicates small values surrounded by small values, high-high is large values surrounded by large values, low-high is small values surrounded by large values and high-low is large values surrounded by small values.
So far, you have examined the association between data values attached to zones and values of the same property in neighbouring zones. Next, you will assess how different variables are related. The most straightforward approach is to use ordinary least squares (OLS) regression.
OLS is used here to explore how the percentage of unemployed persons in an area relates to the percentage of people who are long-term sick or disabled. To explore this relationship, you will need to first compute the percentage of people who were long-term sick or disabled - thus long-term sick or disabled / total persons by economic activity. The new percentage should be called lsdpc and it should be added to the file nwLSOA (thus, nwLSOA$lsdpc).
ols_UnempPSD <- lm(unemployPC ~ lsdpc, data = nwLSOA)
summary(ols_UnempPSD)
##
## Call:
## lm(formula = unemployPC ~ lsdpc, data = nwLSOA)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.019 -1.345 -0.229 0.997 18.305
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.16510 0.06771 17.2 <2e-16 ***
## lsdpc 1.14132 0.00997 114.4 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.48 on 4495 degrees of freedom
## Multiple R-squared: 0.744, Adjusted R-squared: 0.744
## F-statistic: 1.31e+04 on 1 and 4495 DF, p-value: <2e-16
The estimate value (parameter estimate) for lsdpc is the key figure here. It is positive and so indicates a positive relationship between unemployment and long-term sickness or disability (where values of one are high, so (on average) are values of the other). The *** entry at the end of the column for lsdpc indicates that the results are statistically significant.
The OLS results provide a 'global' summary of the relationship between the two variables. Geographically weighted regression can be used to explore how relationships differ between regions.
There are several R packages for GWR. These include spgwr, GWmodel and gwrr: http://cran.r-project.org/web/packages/spgwr/index.html, http://cran.r-project.org/web/packages/GWmodel/index.html, http://cran.r-project.org/web/packages/gwrr/index.html
In this exercise, you will use spgwr. To use GWR, it is necessary to select a spatial weighting scheme (a kernel) which defines how much weight is given to each neighbour of each zone. The size of a kernel is indicated by its bandwidth - a small bandwidth means that close neighbours are given large weights whilke more distance neighbours are given much smaller weighhts. With a larger bandwidth, the weights are more 'spread out' and there is not a marked decline in weights with increased distance.
The bandwidth can be selected manually or some form of optimisation approach can be used to select one. One possible approach is to use cross validation. This entails removing each observation in turn and estimating its value using then remaining data. The differences between the true observed values and the estimated values are then computed. This procedure is completed for multiple bandwidths and the bandwidth which corresponds to the smallest overall error is considered optimal. The bandwidth identified in this way is then used for fitting the GWR model.
library("spgwr")
## NOTE: This package does not constitute approval of GWR
## as a method of spatial analysis; see example(gwr)
# Identify the optimal bandwidth using cross validation; the fixed bandwidth
# Gaussian function is used
bwG <- gwr.sel(unemployPC ~ lsdpc, data = nwLSOA, gweight = gwr.Gauss, verbose = FALSE)
# View the selected bandwidth
bwG
## [1] 2486
# Fit the GWR model using the bandwidth identified in the previous step
gwrG <- gwr(unemployPC ~ lsdpc, data = nwLSOA, bandwidth = bwG, gweight = gwr.Gauss)
# Summarise the GWR outputs
summary(gwrG$SDF)
## Object of class SpatialPolygonsDataFrame
## Coordinates:
## min max
## x 293942 406087
## y 338994 588517
## Is projected: NA
## proj4string : [NA]
## Data attributes:
## sum.w X.Intercept. lsdpc gwr.e
## Min. : 1.0 Min. :-17.115 Min. :-1.670 Min. :-10.234
## 1st Qu.: 26.2 1st Qu.: 0.791 1st Qu.: 0.959 1st Qu.: -1.163
## Median : 43.5 Median : 1.513 Median : 1.073 Median : -0.169
## Mean : 48.1 Mean : 1.512 Mean : 1.065 Mean : -0.072
## 3rd Qu.: 67.8 3rd Qu.: 2.038 3rd Qu.: 1.172 3rd Qu.: 0.817
## Max. :121.1 Max. : 6.328 Max. : 7.708 Max. : 15.569
## pred localR2
## Min. : 1.22 Min. :0.106
## 1st Qu.: 4.19 1st Qu.:0.705
## Median : 6.51 Median :0.758
## Mean : 7.72 Mean :0.747
## 3rd Qu.:10.51 3rd Qu.:0.809
## Max. :29.49 Max. :1.000
So, the bandwidth distance identified is passed to bwG and this is then used with the bandwidth argument using gwr. You could instead manually specify a bandwidth (e.g., 'bandwidth = 20000.0') and avoid the use of gwr.sel.
With GWR, regression beta coefficients are provided for every input location and these can be mapped so that, for example, the local relationships between unemployment and long term illness and disability can be explored.
# Now map the coefficients (the local slope values).
spplot(gwrG$SDF, "lsdpc", col = "transparent")
# and the local r square values
spplot(gwrG$SDF, "localR2", col = "transparent")
The term col=“transparent” prevents the borders being shown as distinct solid lines. Save the maps as images.
Recall that r square indicates the goodness of fit. So, mapped values close to one indicate locations where the model is a good fit.
The weighting scheme (gweight) can be changed; gwr.Gauss is a fixed bandwidth kernel (e.g., weights decline to almost zero after a fixed distance). Instead you could use r gwr.bisquare which is a variable width kernel - it is small where there are the density of observations is high and large where the density of observations is low. So, in this case, it will be small in urban areas (where there are more LSOAs over a small area) and large in rural areas (where LSOAs are less dense). To clarify, in the case of gwr.Gauss a distance is specified while for gwr.bisquare it is the number of nearest neighbours. As an example, 25 nearest neighbours could be specified for gwr.bisquare and the distance from an urban LSOA to its 25th nearest neighbour will obviously likely be musch smaller than will the the distance from a rural LSOA to its 25th nearest neighbour. So, the weight is zero for the 26th nearest neighbour and beyond and the weights are spread out further in the rural case than in the urban case. In R, the nearest neighbours are expressed as a fraction of the total data (e.g., in this case, a bandwidth of 0.001104861 = about 4 of 4497 data points; while 1 would represent a neighbourhood containing all other observations).
# Identify the optimal bandwidth using cross validation; an adaptive
# bandwidth bi-square function is used
ad.bw <- gwr.sel(unemployPC ~ lsdpc, data = nwLSOA, adapt = TRUE)
## Adaptive q: 0.382 CV score: 27383
## Adaptive q: 0.618 CV score: 27517
## Adaptive q: 0.2361 CV score: 27072
## Adaptive q: 0.1459 CV score: 26657
## Adaptive q: 0.09017 CV score: 26191
## Adaptive q: 0.05573 CV score: 25605
## Adaptive q: 0.03444 CV score: 24752
## Adaptive q: 0.02129 CV score: 23592
## Adaptive q: 0.01316 CV score: 22380
## Adaptive q: 0.008131 CV score: 21422
## Adaptive q: 0.005025 CV score: 20640
## Adaptive q: 0.003106 CV score: 20020
## Adaptive q: 0.001919 CV score: 19563
## Adaptive q: 0.001186 CV score: 19312
## Adaptive q: 0.0007331 CV score: 19620
## Adaptive q: 0.001354 CV score: 19370
## Adaptive q: 0.001013 CV score: 19338
## Adaptive q: 0.001146 CV score: 19304
## Adaptive q: 0.001105 CV score: 19301
## Adaptive q: 0.001064 CV score: 19313
## Adaptive q: 0.001105 CV score: 19301
# View the selected bandwidth
ad.bw
## [1] 0.001105
The figure of 0.001104861 (about 4 of 4497 data points) is very small and it would be more sensible to use a larger value - you should try 0.02 (about 90 of 4497 data points):
# Fit the GWR model using the bandwidth identified in the previous step
gwrAD <- gwr(unemployPC ~ lsdpc, data = nwLSOA, adapt = 0.02, gweight = gwr.bisquare)
# Summarise the GWR outputs
summary(gwrAD$SDF)
## Object of class SpatialPolygonsDataFrame
## Coordinates:
## min max
## x 293942 406087
## y 338994 588517
## Is projected: NA
## proj4string : [NA]
## Data attributes:
## sum.w X.Intercept. lsdpc gwr.e
## Min. :11.0 Min. :-1.604 Min. :0.440 Min. :-10.458
## 1st Qu.:29.4 1st Qu.: 0.746 1st Qu.:0.933 1st Qu.: -1.130
## Median :32.9 Median : 1.369 Median :1.050 Median : -0.148
## Mean :34.9 Mean : 1.535 Mean :1.066 Mean : -0.081
## 3rd Qu.:38.6 3rd Qu.: 1.997 3rd Qu.:1.195 3rd Qu.: 0.858
## Max. :72.2 Max. :10.461 Max. :1.605 Max. : 13.775
## pred localR2
## Min. : 1.42 Min. :-0.0245
## 1st Qu.: 4.15 1st Qu.: 0.7073
## Median : 6.43 Median : 0.7603
## Mean : 7.73 Mean : 0.7485
## 3rd Qu.:10.53 3rd Qu.: 0.8170
## Max. :31.04 Max. : 0.9263
# Now map the coefficients (the local slope values).
spplot(gwrAD$SDF, "lsdpc", col = "transparent")
# and the local r square values
spplot(gwrAD$SDF, "localR2", col = "transparent")
Alternative bandwidth values for both approaches can be entered manually (so, as indicated above, there is then no need to use gwr.sel) and you can assess how the results change as the size of the bandwidths is increased or decreased.
Q1. Give the Moran's I values for each of the variables included in your analysis. What do these values suggest to you about clustering (or spatial concentration) of the population of the North West? Provide an interpretation of your findings. You may also wish to make use of Moran scatterplots and/or different weighting schemes.
Q2. Describe the local Moran's I values for unemployment (%) and long-term sick or disabled (%) (and any other variables you choose to add) and interpret your findings. How does the map of local I values relate to the unemployment (%) and long-term sick or disabled (%) maps? Again, you may wish to use different weighting schemes.
Q3. Describe your GWR maps (local slope and local r square) for unemployment (%) against long-term sick or disabled (%) (and any other variables you choose to add) and interpret your findings. Does GWR add information over and above that provide by the global regression model? How do result vary as the weighting schemes are changed?
You may wish to expand your analysis to include other variables - these can be downloaded from Infuse and they can be linked to your existing data by reading them into R (as for the employment data), relabelling the columns (again, as for the employment data), and then using the merge command to join the files together. E.g.:
mergefile <- merge(nwemploy,newfile,by.x=“geocode”,by.y=“geocode”)
and then use the match command as before to link the new mergefile to the LSOAs. You could, for example, explore how housing tenure and employment together are related to long-term sickness and disability. For example, if you computed the percentage of social rented households and named it socrenthhpc, you could use OLS regression to explore this relationship:
ols_combined <- lm(lsdpc ~ unemployPC + socrenthhpc, data = nwLSOA)
You should provide references to other studies or technical accounts to help support your discussion.
Specific guidance on what is provided for each question is not provided - it is up to you to think of ways of building on the basic analysis.
Your document can be submitted in any format you like - including Rmd/html or Word. But, it is crucial that you include any R code you wrote/adapted to complete your analyses.
The various analysis that you have prepared during this tutorial will be of assistance when compiling your report. You may have to piece together different parts of the code to achieve this, however, examples of any code that you need to complete your report are contained in this document. However, you may wish to make use of the knowledge developed in previous practicals.
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.