This work is licensed under the Creative Commons Attribution-ShareAlike 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/3.0/.

The thinking behind this tutorial

These practicals are designed to have an explanatary text, together with code examples. Note that all code examples have a light yellow background, and are boxed. Output from R is also shown, and text output is also boxed. Graphical output is shown ‘in line’. The idea is to copy the text in the boxes into a running version of R - you can use the output to see whether you are on the right track. Also, don’t be afraid to experiment - it may not always work, but for example playing around with some of the function parameters change some aspects of the analysis, or of the graphical presentation. Also, it is generally assumed that the code you type in from the boxes is actually entered in the same order as it appears here. Some of the later boxes depend on what was done earlier, and so skipping ahead might lead to errors. The help facility (putting a question mark in front of a command, for example ?plot, and hitting return) is also a good way to discover things…

To get started with this tutorial, first load the GWmodel package and some other helper packages:

library(GWmodel)
library(raster) 
library(rgdal)
library(rgeos)
library(RColorBrewer)
library(grDevices)

Review of Summary Statistics

Summary statistics are basic statistics used to summarise a large data set. For example, when looking at a data set of, say, 10000 house prices, you might wish to calculate the mean of these, to obtsain an idea of what a typical house price might be. Similarly, you may use a standard deviation to see the extent to which house prices spread around this mean. Finally (and perhaps a bit more obsure) you can use the skewness to measure the symmetry of distribution (ie is there a long upper or lower tail to the distribution, or are values fairly evenly distributed around the mean). Generally these are useful – although not comprehensive – techniques for what is sometimes called ‘data reduction’ (Ehrenberg 1981). They are useful in that with a small number of quantities, it is possibly to summarise note only typical values, but also distributional properties of variables of interest in very large data sets. Geographically weighted summary statistics are similar, but they apply summary statistics using a moving window, so that the above characteristics can be mapped as you move through different geographical regions in a data set. Thus, you can see whether the mean house price in Dublin is different to that in Maynooth (OK, so you probably know it is if you live near those places, but what about Barnet and Edgware, on the outskirts of London, for example?) - and also, although mean levels are often considered in that way, it is also possible to think about movinbg windows to estimate geographical variations in standard deviation - and see whether house prices are more variable in some places than others, or local skewness - to see whether the lop-sidedness of house price distribution changes from place to place.

Also, correlation is a useful bivariate summary statistic, as it measures the degree to which two variables are associated. The most commonly used measure of this is the Pearson Correlation Coefficient. Again, the idea with a geographically weighted correlation is to use a moving window approach to see whether this degree of association varies geographically - for example in some places floor area may be strongly correlated with house price, but in others less so, if, say, being of historical or cultural interest might make a smaller house more valuable than it would otherwise be.

Finally, the mean, standard deviation, skewness and correlation coefficient all have robust equivalents, the median, interquartile range, and quantile imbalance. These are robust in the sense that they are based on the sorted order of values. For the univariate summary statistics, if we sort a variable in ascending order, let the halfway point be \(Q_2\), the first quarter point be \(Q_1\) and the third quarter point be \(Q_3\). Then the median is \(Q_2\), the interquartile range is \(Q_3 - Q_1\) and the quantile imbalance is \[ \frac{Q_3 - 2Q_2 + Q_1}{Q_3 - Q_1} \] The last one may be less familar, but it basically measures the difference between the first quartile and the median and the median and the second quartile - leading to a measure of lop-sidedness. The measures are seen as robust because one or two very high or very low values don’t ‘throw’ the summary statistics, since they won’t alter the values of \(Q_1\), \(Q_2\) or \(Q_3\). Again, these can be geographically weighted - see Brunsdon, Fotheringham, and Charlton (2002).

Finally, the Spearman’s rank correlation coefficient is a robust version of the Pearson coefficient - to compute this, each value of each variable is replaced by its rank - the smallest value of the first variable is replaced by 1, the next smallest by 2 and so on, The same is then done for the second variable - and then the Pearson correlation coefficient is computed for these rank-based variables.

The summary statistics that are geographically weighted are listed below:

Statistic What it Measures Robust? Bivariate or Univariate
Mean Overall Level No Univariate
Standard Deviation Spread No Univariate
Skewness Asymmetry No Univariate
Median Overall Level Yes Univariate
Interquartile Range Spread Yes Univariate
Quantile Imbalance Asymmetry Yes Univariate
Pearson Correlation Association No Bivariate
Spearman Correlation Association Yes Bivariate

The data

In this practical – and some of the others – you will be working with some of Myroslava’s Human Impact data. Thanks to Myroslava for preparing this!

tr.data <- read.csv("points.csv")
head(tr.data)
  ID       F2      F3 humanimpact lc_code fieldsize
1  2 -4.03750 37.5042           5       1         0
2  6 13.02920 56.6125         100       4         3
3  8 27.31250 53.2042          90       5         4
4 10 18.73750 46.0792         100       4         4
5 42 -6.79583 40.6042          75       5         2
6 56 11.76250 44.0958          15       5         3

From this, it can be seen that the longitude and latitude are recorded (columns F2 and F3) along with some other information including the human impact score humanimpact. Here, it is useful to create a SpatialPointsDataFrame from the house price data. These are R objects that are quite similar to shapefiles in ArcGIS - they link a list of spatial objects (points, lines or polygons) to a data frame - so that each point (or line or polygon) has a number of attributes associated with it. The attributes are listed in a data frame, and each row of the data frame is linked to one of the spatial objects. This is done using the SpatialPointsDataFrame function, it takes a two column vector as a first argument, to specify the location of the points, and a data frame as the second argument - to specify the attributes for each point. It will be useful to create a SpatialPointsDataFrame containing the human impact information.

impact <- SpatialPointsDataFrame(tr.data[,2:3],data.frame(humanimpact=tr.data[,4]))
proj4string(impact) <- CRS("+init=epsg:4326")
head(impact)
  humanimpact
1           5
2         100
3          90
4         100
5          75
6          15

The location of the GeoWiki points can be seen by plotting:

plot(impact,pch=16,cex=0.8)