Geographically Weighted Summary Statistics

This work is licensed under the Creative Commons Attribution-ShareAlike 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/3.0/.

The thinking behind this tutorial

These practicals are designed to have an explanatary text, together with code examples. Note that all code examples have a light yellow background, and are boxed. Output from R is also shown, and text output is also boxed. Graphical output is shown ‘in line’. The idea is to copy the text in the boxes into a running version of R - you can use the output to see whether you are on the right track. Also, don’t be afraid to experiment - it may not always work, but for example playing around with some of the function parameters change some aspects of the analysis, or of the graphical presentation. Also, it is generally assumed that the code you type in from the boxes is actually entered in the same order as it appears here. Some of the later boxes depend on what was done earlier, and so skipping ahead might lead to errors. The help facility (putting a question mark in front of a command, for example ?plot, and hitting return) is also a good way to discover things…

To get started with this tutorial, first load the GWmodel package and some other helper packages:

library(GWmodel)
library(raster) 
library(rgdal)
library(rgeos)
library(RColorBrewer)
library(grDevices)

Review of Summary Statistics

Summary statistics are basic statistics used to summarise a large data set. For example, when looking at a data set of, say, 10000 house prices, you might wish to calculate the mean of these, to obtsain an idea of what a typical house price might be. Similarly, you may use a standard deviation to see the extent to which house prices spread around this mean. Finally (and perhaps a bit more obsure) you can use the skewness to measure the symmetry of distribution (ie is there a long upper or lower tail to the distribution, or are values fairly evenly distributed around the mean). Generally these are useful – although not comprehensive – techniques for what is sometimes called ‘data reduction’ (Ehrenberg 1981). They are useful in that with a small number of quantities, it is possibly to summarise note only typical values, but also distributional properties of variables of interest in very large data sets. Geographically weighted summary statistics are similar, but they apply summary statistics using a moving window, so that the above characteristics can be mapped as you move through different geographical regions in a data set. Thus, you can see whether the mean house price in Dublin is different to that in Maynooth (OK, so you probably know it is if you live near those places, but what about Barnet and Edgware, on the outskirts of London, for example?) - and also, although mean levels are often considered in that way, it is also possible to think about movinbg windows to estimate geographical variations in standard deviation - and see whether house prices are more variable in some places than others, or local skewness - to see whether the lop-sidedness of house price distribution changes from place to place.

Also, correlation is a useful bivariate summary statistic, as it measures the degree to which two variables are associated. The most commonly used measure of this is the Pearson Correlation Coefficient. Again, the idea with a geographically weighted correlation is to use a moving window approach to see whether this degree of association varies geographically - for example in some places floor area may be strongly correlated with house price, but in others less so, if, say, being of historical or cultural interest might make a smaller house more valuable than it would otherwise be.

Finally, the mean, standard deviation, skewness and correlation coefficient all have robust equivalents, the median, interquartile range, and quantile imbalance. These are robust in the sense that they are based on the sorted order of values. For the univariate summary statistics, if we sort a variable in ascending order, let the halfway point be $Q_2$, the first quarter point be $Q_1$ and the third quarter point be $Q_3$. Then the median is $Q_2$, the interquartile range is $Q_3 - Q_1$ and the quantile imbalance is \[ \frac{Q_3 - 2Q_2 + Q_1}{Q_3 - Q_1} \] The last one may be less familar, but it basically measures the difference between the first quartile and the median and the median and the second quartile - leading to a measure of lop-sidedness. The measures are seen as robust because one or two very high or very low values don’t ‘throw’ the summary statistics, since they won’t alter the values of $Q_1$, $Q_2$ or $Q_3$. Again, these can be geographically weighted - see Brunsdon, Fotheringham, and Charlton (2002).

Finally, the Spearman’s rank correlation coefficient is a robust version of the Pearson coefficient - to compute this, each value of each variable is replaced by its rank - the smallest value of the first variable is replaced by 1, the next smallest by 2 and so on, The same is then done for the second variable - and then the Pearson correlation coefficient is computed for these rank-based variables.

The summary statistics that are geographically weighted are listed below:

Statistic	What it Measures	Robust?	Bivariate or Univariate
Mean	Overall Level	No	Univariate
Standard Deviation	Spread	No	Univariate
Skewness	Asymmetry	No	Univariate
Median	Overall Level	Yes	Univariate
Interquartile Range	Spread	Yes	Univariate
Quantile Imbalance	Asymmetry	Yes	Univariate
Pearson Correlation	Association	No	Bivariate
Spearman Correlation	Association	Yes	Bivariate

The data

In this practical – and some of the others – you will be working with some of Myroslava’s Human Impact data. Thanks to Myroslava for preparing this!

tr.data <- read.csv("points.csv")
head(tr.data)

  ID       F2      F3 humanimpact lc_code fieldsize
1  2 -4.03750 37.5042           5       1         0
2  6 13.02920 56.6125         100       4         3
3  8 27.31250 53.2042          90       5         4
4 10 18.73750 46.0792         100       4         4
5 42 -6.79583 40.6042          75       5         2
6 56 11.76250 44.0958          15       5         3

From this, it can be seen that the longitude and latitude are recorded (columns F2 and F3) along with some other information including the human impact score humanimpact. Here, it is useful to create a SpatialPointsDataFrame from the house price data. These are R objects that are quite similar to shapefiles in ArcGIS - they link a list of spatial objects (points, lines or polygons) to a data frame - so that each point (or line or polygon) has a number of attributes associated with it. The attributes are listed in a data frame, and each row of the data frame is linked to one of the spatial objects. This is done using the SpatialPointsDataFrame function, it takes a two column vector as a first argument, to specify the location of the points, and a data frame as the second argument - to specify the attributes for each point. It will be useful to create a SpatialPointsDataFrame containing the human impact information.

impact <- SpatialPointsDataFrame(tr.data[,2:3],data.frame(humanimpact=tr.data[,4]))
proj4string(impact) <- CRS("+init=epsg:4326")
head(impact)

  humanimpact
1           5
2         100
3          90
4         100
5          75
6          15

The location of the GeoWiki points can be seen by plotting:

plot(impact,pch=16,cex=0.8)

We can also show these on a backdrop of world country borders - actually a SpatialPolygonsDataFrame object that can be read in using readOGR - assuming you have the data in your working folder:

world <- readOGR(dsn=".",layer="TM_WORLD_BORDERS-0.3",stringsAsFactors=F)

OGR data source with driver: ESRI Shapefile 
Source: "/Users/chrisbrunsdon/Dropbox/GWmodel (1)/GWR R Model Workshop Rothampstead 2019", layer: "TM_WORLD_BORDERS-0.3"
with 246 features
It has 11 fields
Integer64 fields read as strings:  POP2005

plot(impact,pch='') # This just sets the map boundaries without plotting anything
plot(world,add=TRUE,col='grey')
plot(impact,pch=16,cex=0.8,add=TRUE)

The overall distribution of impact scores can be seen as a histogram:

hist(impact$humanimpact,col='navy')

or a box-plot:

boxplot(impact$humanimpact,col='navy',horizontal=TRUE)

Perhaps here a box-plot is unhelpful - the histogram shows that the distribution is bi-model – effectively it is U-shaped with higher concentrations at scores 0 and 100. Box-plots are mostly designed to pick out features in distributions with a centrally place mode, possibly with imbalanced tails and some outliers. A lesson here is that even with exploration, care must be taken to choose an appropriate visualisation or data summary tool. A beeter visualisation here might involve splitting the data into ‘below 40’ and above 40 subsets and box-plotting these separately.

impact$hi_or_low <- ifelse(impact$humanimpact < 40,'low','high')
boxplot(humanimpact~hi_or_low,data=impact,col='indianred',horizontal=TRUE)

This illustrates the utility of a programmable data analysis and visualiusation tool such as R - the visualisation is something of a one-off but useful here. R gives the flexibility to do this, whereas most GUI based systems would not.

We can also look at the spatial pattern in the human impact variable. Here we will use a colour ramp to map the 0-100 score onto point colour when creating a plot. The following creates a new function colmap that maps 0-1 onto a range of colours defined by the Brewer ‘YlOrRd’ palette. Here, 0 maps on to yellow, and 1 on to red. Note this gives a $n$ rows by 3 columns matrix - this is passed to rgb to actually obtain the colours.

colmap <- colorRamp(brewer.pal(8,'YlOrRd'))

plot(impact,pch='')
plot(world,col='lightgrey',add=TRUE)
plot(impact,pch=16,cex=0.8,col=rgb(colmap(impact$humanimpact/100),max=255),add=TRUE)

Exploring Spatial Pattern with Geographically Weighted Summary Statistics

This gives a reasonable impression - but at this point only the raw data is shown. To identify local trends, the geographically weighted summary statistics are more helpful. GWmodel provides a number of geographically weighted methods - each of these operates on a SpatialPointsDataFrame object (or possibly a SpatialPolygonsDataFrame). These are calculated below - firstly compute distances between the points (using gw.dist withy the longlat=TRUE option to obtain great circle distances), and stored in d. Next, the gwss function computes a local summary statistics object containing a number of geographically weighted summary statistics. Here, the combination of bw=20 and adaptive=TRUE creates a set of adaptive local summary stats, where the bandwidth at each point is equal to the distance to the nearest neighbour.

d <- gw.dist(coordinates(impact),coordinates(impact),longlat=TRUE)
localstats1 <- gwss(impact,vars=c("humanimpact"),bw=20,dMat=d,adaptive = TRUE)

… and this may be plotted in the same way as the raw data:

plot(impact,pch='')
plot(world,col='lightgrey',add=TRUE)
plot(impact,pch=16,cex=0.8,col=rgb(colmap(localstats1$SDF$humanimpact_LM/100),max=255),add=TRUE)

The localstats1 object has a SDF component which is itself a SpatialPointsDataFrame. Variables within this can then be accessed via the $ operator. Variables created include the GW-mean (variable name plus _LM) the GW-standard deviation (variable name plus _LSD) and the GW-standard skewness (variable name plus _LSKe). A full guide to the variable names is below:

Variable Name	Statistic Refererred to	Comments
X_LM	GW mean
X_LSD	GW Standard deviation
X_Lvar	GW Variance	GW Standard deviation squared
X_LSKe	GW Skewness
X_LCV	GW Coefficient of variation	GW mean divided by GW Standard deviation
Cov_X.Y	GW Covariance	Not discussed here
Corr_X.Y	GW Pearson Correlation
Spearman_rho_X.Y	GW Spearman Correlation

Note that X and Y should be replaced by the names of the actual variables being investigated and that the bivariate statistics are provided only if two variable names (in a list) are supplied to gwss. A further ‘helper function’ may be useful, to automatically plot maps, add a title and draw a legend. This is defined below:

drawmap <- function(spdf,var,title='') {
  # The following defines the colour map and rescales the variable to 0,1
  rescale <- function(x) x/max(x)
  colmap <- colorRamp(brewer.pal(8,'YlOrRd'))
  # This part plots the map and writes the title
  plot(spdf,pch='')
  title(title)
  plot(world,col='lightgrey',add=TRUE)
  plot(spdf,pch=16,cex=0.8,col=rgb(colmap(rescale(var)),max=255),add=TRUE)
  # This part plots the legend - generaly the fiddliest part 
  p1 <- rep(-10.94,5)
  p2 <- 63.305 - seq(0,2.4,l=5)
  rect(p1[1]-0.4,p2[5]-0.4,p1[1]+2.6,p2[1]+0.4,col='lightgrey')
  varvals <- seq(min(var),max(var),l=5)
  varlabs <- sprintf('%5.1f',varvals)
  varcols <- rgb(colmap(rescale(varvals)),max=255)
  text(p1+2.6,p2,varlabs,pos=2,cex=0.8)
  points(p1,p2,pch=16,col=varcols)
}

This simplifies the plotting process once it has been defined. If you can’t see immediately how the function works don’t worry right now - you can study it later. The most difficult part is actually drawing the legend - but it is basically built up entity by entity - first the background rectangle, then the labels and then the coloured symbols. Also note that this automatically rescales the value to be mapped to range from 0 to 1. To draw the basic map, enter:

drawmap(impact,impact$humanimpact,'Human Impact')

You may have to experiment with re-sizing the window (and possibly re-drawing) to find a shape where the legend is fully visible. This can also be used to draw the geographically weighted mean:

drawmap(localstats1$SDF,localstats1$SDF$humanimpact_LM,'Human Impact (GW Mean)')

The GW-standard deviation can also be drawn using the function:

drawmap(localstats1$SDF,localstats1$SDF$humanimpact_LSD,'Human Impact (GW Standard Deviation)')

This shows some areas have a more geographically uniform pattern of human impact - for example southern England and northern France, and parts of north Africa.

Now look at the the GW-skewness - however, for skewness a slightly different version of drawmap is written to rescale skewness values so that negative and positive values both can be represented - and using a different colour scheme to reflect this. Firstly, create drawmap2 the modified version of drawmap:

drawmap2 <- function(spdf,var,title='') {
  # The following defines the colour map and rescales the variable to 0,1
  rescale <- function(x) (x + max(abs(x)))/(2*max(abs(x)))
  colmap <- colorRamp(brewer.pal(9,'RdYlBu'))
  # This part plots the map and writes the title
  plot(spdf,pch='')
  title(title)
  plot(world,col='lightgrey',add=TRUE)
  plot(spdf,pch=16,cex=0.8,col=rgb(colmap(rescale(var)),max=255),add=TRUE)
  # This part plots the legend - generaly the fiddliest part 
  p1 <- rep(-10.94,5)
  p2 <- 63.305 - seq(0,2.4,l=5)
  rect(p1[1]-0.4,p2[5]-0.4,p1[1]+2.6,p2[1]+0.4,col='lightgrey')
  varvals <- seq(-max(abs(var)),max(abs(var)),l=5)
  varlabs <- sprintf('%5.1f',varvals)
  varcols <- rgb(colmap(rescale(varvals)),max=255)
  text(p1+2.6,p2,varlabs,pos=2,cex=0.8)
  points(p1,p2,pch=16,col=varcols)
}

Now the GW-skewness can be mapped:

drawmap2(localstats1$SDF,localstats1$SDF$humanimpact_LSKe,'Human Impact (GW Skewness)')

In Finland there is a tendency to blue (ish) points suggesting some regions having a few relatively highly impacted areas compared to thier neigbours, but most striking are perhaps some of the red (negative) values in North Africa, suggesting a few regions of very low impact in comparison to thier neighbours. Information like this may be helpful in identifying sudden land use change.

Correlation

** Fill in discussion explain about the duds! **

night.lights <- raster("night_lights.tif")
ntlights <- extract(night.lights,impact)
ntlights[is.na(ntlights)] <- 0
impact <- SpatialPointsDataFrame(tr.data[,2:3],data.frame(humanimpact=tr.data[,4],ntlights=ntlights))
localstats2 <- gwss(impact,vars=c("humanimpact","ntlights"),bw=20,dMat=d,adaptive = TRUE)
dud <- is.nan(localstats2$SDF$Corr_humanimpact.ntlights)
drawmap2(localstats2$SDF[!dud,],localstats2$SDF$Corr_humanimpact.ntlights[!dud],'Human Impact / Night Light (Pearson Corr)')

drawmap2(localstats2$SDF[!dud,],localstats2$SDF$Spearman_rho_humanimpact.ntlights[!dud],'Human Impact / Night Light (Spearman Corr)')

Saving Images

Having created these images you may well want to save them for use in an article or report. The are at least two ways of doing this. One is to simply cut and paste them into a Word document. On the window where the map is drawn, just right-click and select ‘Copy as Metafile’ and paste into an open Word document. Alternatively, you can use dev.copy2pdf a function that copies the current window into a pdf, whose name is specified. For example, try

drawmap2(localstats2$SDF[!dud,],localstats2$SDF$Corr_humanimpact.ntlights[!dud],'Human Impact / Night Light (Pearson Corr)')
dev.copy2pdf(file='pearson.pdf')

you should find a file called pearson.pdf in your current working directory. Opening this will show the image you have just created. You can also insert this into documents. On the whole, the second approach is better if you want to use the images in any system other than MS Office - a notorious mangler of graphical images. However if you don’t expect to share the images any further, the cut-and-paste option is possibly simpler.

Further Suggestions: Try producing local statistics maps with alternative bandwidths. You can still use drawmap or drawmap2 to draw the maps, but just specify bw=50 (say) for a 50 nearest neighbour adaptive bandwidth when you call gwss.

References

Brunsdon, C., A.S. Fotheringham, and M. Charlton. 2002. “Geographically Weighted Summary Statistics.” Computers, Environment and Urban Systems 26 (6): 501–24.

Ehrenberg, A. 1981. Data Reduction. Chichester: John Wiley.