Pre-amble

To get started with this tutorial, first load the GWmodel package and some other helper packages:

library(GWmodel)
library(raster) 
library(rgdal)
library(rgeos)
library(RColorBrewer)
library(grDevices)
library(tmap)

Review of Summary Statistics

Summary statistics are basic statistics used to summarise a large data set. For example, when looking at a data set of, say, 10000 house prices, you might wish to calculate the mean of these, to obtain an idea of what a typical house price might be. Similarly, you may use a standard deviation to see the extent to which house prices spread around this mean. Finally (and perhaps a bit more obsure) you can use the skewness to measure the symmetry of distribution (i.e. is there a long upper or lower tail to the distribution, or are values fairly evenly distributed around the mean). Generally these are useful – although not comprehensive – techniques for what is sometimes called ‘data reduction’ [@Ehr81]. They are useful in that with a small number of quantities, it is possible to summarise, not only typical values, but also distributional properties of variables of interest in very large data sets.

Geographically weighted summary statistics are similar, but they apply summary statistics using a moving window, so that the above characteristics can be mapped as you move through different geographical regions in a data set. Thus, you can see whether the mean house price in Beijing is different to that in Xian - and also, although mean levels are often considered in that way, it is also possible to think about moving windows to estimate geographical variations in standard deviation - and see whether house prices are more variable in some places than others, or local skewness - to see whether the lop-sidedness of house price distribution changes from place to place.

Also, correlation is a useful bivariate summary statistic, as it measures the degree to which two variables are associated. The most commonly used measure of this is the Pearson Correlation Coefficient. Again, the idea with a geographically weighted correlation is to use a moving window approach to see whether this degree of association varies geographically - for example in some places house size may be strongly correlated with house price, but in others less so, if, say, being of historical or cultural interest might make a smaller house more valuable than it would otherwise be.

Finally, the mean, standard deviation, skewness and correlation coefficient all have robust equivalents - the median, interquartile range, quantile imbalance and Spearman’s Correlation Coefficient. These are robust in the sense that they are based on the sorted order of values. For the univariate summary statistics, if we sort a variable in ascending order, let the halfway point be $Q_2$, the first quarter point be $Q_1$ and the third quarter point be $Q_3$. Then the median is $Q_2$, the interquartile range is $Q_3 - Q_1$ and the quantile imbalance is \[ \frac{Q_3 - 2Q_2 + Q_1}{Q_3 - Q_1} \] The last one may be less familar, but it basically measures the difference between the first quartile and the median and the median and the second quartile - leading to a measure of lop-sidedness. The measures are seen as robust because one or two very high or very low values don’t ‘throw’ the summary statistics, since they won’t alter the values of $Q_1$, $Q_2$ or $Q_3$. Again, these can be geographically weighted - see @Brun02.

Spearman’s rank correlation coefficient is a robust version of the Pearson correlation coefficient - to compute this, each value of each variable is replaced by its rank - the smallest value of the first variable is replaced by 1, the next smallest by 2 and so on, The same is then done for the second variable - and then the Pearson correlation coefficient is computed for these rank-based variables.

The summary statistics that are currently geographically weighted in GWmodel are listed below:

Statistic	What it Measures	Robust?	Bivariate or Univariate?
Mean	Overall Level	No	Univariate
Standard Deviation	Spread	No	Univariate
Skewness	Asymmetry	No	Univariate
Median	Overall Level	Yes	Univariate
Interquartile Range	Spread	Yes	Univariate
Quantile Imbalance	Asymmetry	Yes	Univariate
Pearson Correlation	Association	No	Bivariate
Spearman Correlation	Association	Yes	Bivariate

The data

In this practical you will be working with the IIASA GeoWiki Human Impact data.

tr.data <- read.csv("points.csv")
head(tr.data)

  ID       F2      F3 humanimpact lc_code fieldsize
1  2 -4.03750 37.5042           5       1         0
2  6 13.02920 56.6125         100       4         3
3  8 27.31250 53.2042          90       5         4
4 10 18.73750 46.0792         100       4         4
5 42 -6.79583 40.6042          75       5         2
6 56 11.76250 44.0958          15       5         3

From this, it can be seen that the longitude and latitude are recorded (columns F2 and F3) along with some other information including the human impact score humanimpact. Here, it is useful to create a SpatialPointsDataFrame from the data. These are R objects that are quite similar to shapefiles in ArcGIS - they link a list of spatial objects (points, lines or polygons) to a data frame - so that each point (or line or polygon) has a number of attributes associated with it. The attributes are listed in a data frame, and each row of the data frame is linked to one of the spatial objects. This is done using the SpatialPointsDataFrame function, it takes a two column vector as a first argument, to specify the location of the points, and a data frame as the second argument - to specify the attributes for each point. It will be useful to create a SpatialPointsDataFrame containing the human impact information.

impact <- SpatialPointsDataFrame(tr.data[,2:3], 
                                 data.frame(humanimpact=tr.data[,4]), 
                                 proj4string=CRS("+init=epsg:4326"))
head(impact)

  humanimpact
1           5
2         100
3          90
4         100
5          75
6          15

The location of the GeoWiki points can be seen by plotting:

plot(impact,pch=16,cex=0.8)

We can also show these on a backdrop of European and surrounding country borders - actually a SpatialPolygonsDataFrame object that can be read in as a data item from the tmap package. This is a neat way to create maps in R. tmap works similarly to ggplot2 - essentially maps are built up of a series of map elements which are either specifications of shapefiles, directives to map variables in the shapefiles to colours, sizes or other attributes, directives to modify overall map styles, and decorations (compasses and so on). The following code accesses the Europe SpatialPointsDataFrame (supplied as a data item by tmap) then sets up a quick thematic map (qtm) as a base, then adds a new shape object via tm_shape (the SpatialPointsDataFrame called impact), adds a layer of dots frome this shape, and finally adds a compass:

data(Europe)
qtm(Europe) + tm_shape(impact) + tm_dots() + tm_compass()

The overall distribution of impact scores can be seen as a histogram:

hist(impact$humanimpact,col='navy')

or a box-plot:

boxplot(impact$humanimpact,col='navy',horizontal=TRUE)

Perhaps here a box-plot is unhelpful - the histogram shows that the distribution is bi-model – effectively it is U-shaped with higher concentrations at impact scores 0 and 100. Box-plots are mostly designed to pick out features in distributions with a centrally placed mode, possibly with imbalanced tails and some outliers. A lesson here is that even with exploration, care must be taken to choose an appropriate visualisation or data summary tool. A better visualisation here might involve splitting the data into ‘below 40’ and above 40 subsets and box-plotting these separately.

impact$hi_or_low <- ifelse(impact$humanimpact < 40,'low','high')
boxplot(humanimpact~hi_or_low,data=impact,col='indianred',horizontal=TRUE)

This illustrates the utility of a programmable data analysis and visualisation tool such as R - the visualisation is something of a one-off but is useful here. R gives the flexibility to do this, whereas most GUI based systems would not.

We can also look at the spatial pattern in the human impact variable. This is done by telling the tm_dots layer specification to map the colour of the dots on to one of the variables in impact - in this case the humanimpact variable. To start with we will use the default colour mapping, which just maps the range of values onto a yellow to red range. The dots are also set to be slightly larger than the default size:

qtm(Europe) + tm_shape(impact) + 
  tm_dots(col='humanimpact',title='Human Impact',size=0.06) + 
  tm_compass()

Exploring Spatial Pattern with Geographically Weighted Summary Statistics

So far we have a reasonable impression of the human impact process - but at this point only the raw data has been investigated. To better identify local trends, geographically weighted summary statistics are more helpful. GWmodel provides a number of geographically weighted methods - each of these operates on a SpatialPointsDataFrame object (or possibly a SpatialPolygonsDataFrame). These are calculated below - firstly compute distances between the points (using gw.dist withy the longlat=TRUE option to obtain great circle distances), and stored in d. Next, the gwss function computes a local summary statistics object containing a number of geographically weighted summary statistics. Here, the combination of bw=20 and adaptive=TRUE creates a set of local summary stats for the human impact data, where the bandwidth (bw) at each point is equal to the distance to the nearest 20 neighbours.

d <- gw.dist(coordinates(impact),coordinates(impact),longlat=TRUE)
localstats1 <- gwss(impact,vars=c("humanimpact"),bw=20,dMat=d,adaptive = TRUE)

… and this may be plotted in the same way as the raw data:

qtm(Europe) + 
  tm_shape(localstats1$SDF) + 
  tm_dots(col='humanimpact_LM',title='GW Mean',size=0.06) + 
  tm_compass()

The localstats1 object has a SDF component which is itself a SpatialPointsDataFrame. Variables within this can then be accessed via the $ operator. Variables created include the GW-mean (variable name plus _LM) the GW-standard deviation (variable name plus _LSD) and the GW-standard skewness (variable name plus _LSKe). A full guide to the variable names is below:

Variable Name	Statistic Refererred to	Comments
X_LM	GW mean
X_LSD	GW Standard deviation
X_Lvar	GW Variance	GW Standard deviation squared
X_LSKe	GW Skewness
X_LCV	GW Coefficient of variation	GW mean divided by GW Standard deviation
Cov_X.Y	GW Covariance	Not discussed here
Corr_X.Y	GW Pearson Correlation
Spearman_rho_X.Y	GW Spearman Correlation

Note that X and Y should be replaced by the names of the actual variables being investigated and that the bivariate statistics are provided only if two variable names (in a list) are supplied to gwss.

Given this, the GW-standard deviation for the Human Impact variable can now also be drawn:

qtm(Europe) + tm_shape(localstats1$SDF) + tm_dots(col='humanimpact_LSD',title='GW Std. Deviation',size=0.06) + tm_compass()

This shows some areas have a more geographically uniform pattern of human impact - for example southern England and northern France, and parts of north Africa.

Now look at GW-skewness - note that since this changes in sign, a divergent colour scheme is automatically chosen to colour-map the dots. This is not great, as it uses a red-green scale unsuitable for many colour blind people:

qtm(Europe) + 
  tm_shape(localstats1$SDF) + 
  tm_dots(col='humanimpact_LSKe',title='GW Skew',size=0.06) + 
  tm_compass()

There are a number of ways to rectify this - probably the simplest is to add a map style modifier tm_style_col_blind that modifiers any colour schemes on the map to avoid the red-green problem.

qtm(Europe) + 
  tm_shape(localstats1$SDF) + 
  tm_dots(col='humanimpact_LSKe',title='GW Skew',size=0.06) + 
  tm_compass() + 
  tm_style_col_blind()

In northern Norway, there is a tendency to blue (ish) points suggesting some regions having a few relatively highly impacted areas compared to thier neigbours, but most striking are perhaps some of the red (negative) values in North Africa, suggesting a few regions of very low impact in comparison to thier neighbours. Information like this may be helpful in identifying sudden land use change.

GW Correlation

Finally, we will look at geographically weighted correlation. To do this, we will make use of the night lighting tiff file - reading the values at the locations of the human impact scores via the raster R package. First, read in the raster data from the tiff file:

night.lights <- raster("night_lights.tif")

This can be plotted as well - the following specifies the night.lights raster as a shape file, plots it via tm_raster then specifies impact as a shape object and adds a dot layer based on this. It may take a little while to plot:

tm_shape(night.lights) + tm_raster(title='Night Light') + tm_shape(impact) + tm_dots()

Next extract the pixel grey intensities of night lighting - a few impact points fall outside the raster area - for now these are set to zero - because they give gwss indigestion if left as NA values!

ntlights <- extract(night.lights,impact)
ntlights[is.na(ntlights)] <- 0

Next we re-create the impact SpatialPointsDataFrame adding the sampled ntlights variable, and re-run gwss including both variables, so that geographically weighted correlations are computed.

impact <- SpatialPointsDataFrame(tr.data[,2:3], 
                                 data.frame(humanimpact=tr.data[,4],ntlights=ntlights), 
                                 proj4string=CRS("+init=epsg:4326"))
localstats2 <- gwss(impact,vars=c("humanimpact","ntlights"),bw=20,dMat=d,adaptive = TRUE)

Now, the local correlations may be mapped

qtm(Europe) + 
  tm_shape(localstats2$SDF) + 
  tm_dots(col='Corr_humanimpact.ntlights',title='Correlation',size=0.06) + 
  tm_compass() + 
  tm_style_col_blind()

Saving Images

Having created these images you may well want to save them for use in an article or report. The are at least two ways of doing this. One is to simply cut and paste them into a Word document. On the window where the map is drawn, just right-click and select ‘Copy as Metafile’ and paste into an open Word document. Alternatively, you can use pdf a function that copies the current graphics output into a pdf, whose name is specified. After you have drawn the image, enter dev.off(). For example, try

pdf('corrmap.pdf',h=11,w=11) # also sets height and width
qtm(Europe) + 
  tm_shape(localstats2$SDF) + 
  tm_dots(col='Corr_humanimpact.ntlights',title='Correlation',size=0.06) + 
  tm_compass() + 
  tm_style_col_blind()
dev.off()

you should find a file called corrmap.pdf in your current working directory. Opening this will show the image you have just created. You can also insert this into documents. On the whole, the second approach is better if you want to use the images in any system other than MS Office - a notorious mangler of graphical images. However if you don’t expect to share the images any further, the cut-and-paste option is possibly simpler.

Further Suggestions:

Try producing GW summary statistics maps with alternative bandwidths. You can still use the same methods to draw the maps, but just specify bw=50 (say) for a 50 nearest neighbour adaptive bandwidth when you call gwss.

Try producing a map of the GW-Spearman’s correlation coefficient. Its name in the data frame is listed earlier in this practical.

Session 2: Geographically Weighted Summary Statistics