This work is licensed under the Creative Commons Attribution-ShareAlike 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/3.0/.
These practicals are designed to have an explanatary text, together with code examples. Note that all code examples have a light yellow background, and are boxed. Output from R is also shown, and text output is also boxed. Graphical output is shown ‘in line’. The idea is to copy the text in the boxes into a running version of R - you can use the output to see whether you are on the right track. Also, don’t be afraid to experiment - it may not always work, but for example playing around with some of the function parameters change some aspects of the analysis, or of the graphical presentation. Also, it is generally assumed that the code you type in from the boxes is actually entered in the same order as it appears here. Some of the later boxes depend on what was done earlier, and so skipping ahead might lead to errors. The help facility (putting a question mark in front of a command, for example ?plot
, and hitting return) is also a good way to discover things…
To get started with this tutorial, first load the GWmodel
package.
library(GWmodel)
Summary statistics are basic statistics used to summarise a large data set. For example, when looking at a data set of, say, 10000 house prices, you might wish to calculate the mean of these, to obtsain an idea of what a typical house price might be. Similarly, you may use a standard deviation to see the extent to which house prices spread around this mean. Finally (and perhaps a bit more obsure) you can use the skewness to measure the symmetry of distribution (ie is there a long upper or lower tail to the distribution, or are values fairly evenly distributed around the mean). Generally these are useful – although not comprehensive – techniques for what is sometimes called ‘data reduction’ (Ehrenberg 1981). They are useful in that with a small number of quantities, it is possibly to summarise note only typical values, but also distributional properties of variables of interest in very large data sets. Geographically weighted summary statistics are similar, but they apply summary statistics using a moving window, so that the above characteristics can be mapped as you move through different geographical regions in a data set. Thus, you can see whether the mean house price in Dublin is different to that in Maynooth (OK, so you probably know it is if you live near those places, but what about Barnet and Edgware, on the outskirts of London, for example?) - and also, although mean levels are often considered in that way, it is also possible to think about movinbg windows to estimate geographical variations in standard deviation - and see whether house prices are more variable in some places than others, or local skewness - to see whether the lop-sidedness of house price distribution changes from place to place.
Also, correlation is a useful bivariate summary statistic, as it measures the degree to which two variables are associated. The most commonly used measure of this is the Pearson Correlation Coefficient. Again, the idea with a geographically weighted correlation is to use a moving window approach to see whether this degree of association varies geographically - for example in some places floor area may be strongly correlated with house price, but in others less so, if, say, being of historical or cultural interest might make a smaller house more valuable than it would otherwise be.
Finally, the mean, standard deviation, skewness and correlation coefficient all have robust equivalents, the median, interquartile range, and quantile imbalance. These are robust in the sense that they are based on the sorted order of values. For the univariate summary statistics, if we sort a variable in ascending order, let the halfway point be \(Q_2\), the first quarter point be \(Q_1\) and the third quarter point be \(Q_3\). Then the median is \(Q_2\), the interquartile range is \(Q_3 - Q_1\) and the quantile imbalance is \[ \frac{Q_3 - 2Q_2 + Q_1}{Q_3 - Q_1} \] The last one may be less familar, but it basically measures the difference between the first quartile and the median and the median and the second quartile - leading to a measure of lop-sidedness. The measures are seen as robust because one or two very high or very low values don’t ‘throw’ the summary statistics, since they won’t alter the values of \(Q_1\), \(Q_2\) or \(Q_3\). Again, these can be geographically weighted - see Brunsdon, Fotheringham, and Charlton (2002).
Finally, the Spearman’s rank correlation coefficient is a robust version of the Pearson coefficient - to compute this, each value of each variable is replaced by its rank - the smallest value of the first variable is replaced by 1, the next smallest by 2 and so on, The same is then done for the second variable - and then the Pearson correlation coefficient is computed for these rank-based variables.
The summary statistics that are geographically weighted are listed below:
Statistic | What it Measures | Robust? | Bivariate or Univariate |
---|---|---|---|
Mean | Overall Level | No | Univariate |
Standard Deviation | Spread | No | Univariate |
Skewness | Asymmetry | No | Univariate |
Median | Overall Level | Yes | Univariate |
Interquartile Range | Spread | Yes | Univariate |
Quantile Imbalance | Asymmetry | Yes | Univariate |
Pearson Correlation | Association | No | Bivariate |
Spearman Correlation | Association | Yes | Bivariate |
In this practical you will be working with house price data, obtained from the Nationwide Building Society (NBS) in England - this is a sample of houses sold in 1991 with mortgages arranged by NBS. The data is in a data frame called ewhp
. You will make some minor alterations, so this is copied to a new data frame houses
.
data(EWHP)
houses <- ewhp
head(houses)
Easting Northing PurPrice BldIntWr BldPostW Bld60s Bld70s Bld80s TypDetch TypSemiD
1 599500 142200 65000 0 0 0 0 1 0 1
2 575400 167200 45000 0 0 0 0 0 0 0
3 530300 177300 50000 1 0 0 0 0 0 0
4 524100 170300 105000 0 0 0 0 0 0 0
5 426900 514600 175000 0 0 0 0 1 1 0
6 508000 190400 250000 0 1 0 0 0 1 0
TypFlat FlrArea
1 0 78.94786
2 1 94.36591
3 0 41.33153
4 0 92.87983
5 0 200.52756
6 0 148.60773
From this, it can be seen that the houses purchase price (PurPrice
) and location (Easting
,Northing
) are recorded togther with a number of other artifacts of the house - for example its floor area in square meters (FlrArea
).
Also, house prices can be divided by 1000 since this tends to make graphs, map keys and so on less cluttered - as well as avoiding some numerical rounding errors when doing some fairly complex calculations:
houses$PurPrice <- houses$PurPrice / 1000
Looking at the relation of floor area and purchase price can be done using the standard plot
command in R:
plot(houses$FlrArea,houses$PurPrice, xlab='Floor Area',ylab='Purchase Price (1000\'s UK Pounds)')
its a useful plot, given my earlier comments - you can see that there a few quite expensive houses with small floor areas, and also some quite large houses with relatively low price. We can also look some summary statistics - for price
mean(houses$PurPrice)
[1] 67.347
sd(houses$PurPrice)
[1] 38.75477
That gives the mean and standard deviation - however the skewness doesn’t come as standard, but you can get it from the enigmatically-named R library e1071
:
library(e1071)
skewness(houses$PurPrice)
[1] 2.441269
From this value of around 2.4 it can be seen that this distribution is strongly positively skewed - suggesting a large upper tail - that is, a central group of typically-priced houses together with a long tail of relatively expensive ones. A quick histogram command should verify this:
hist(houses$PurPrice,xlab='Purchase Price',main='Housing Cost Distribution')
Reality Check: if you have just noticed how cheap housing is in London and are thinking of relocating there, don’t forgot this is data for 1991 :-)
Now, look at these quantities for the floor areas:
mean(houses$FlrArea)
[1] 100.9567
sd(houses$FlrArea)
[1] 38.08914
skewness(houses$FlrArea)
[1] 1.453568
In particular note the skewness again. Skewness is a dimensionless quantity, so that it doesn’t depend on whether floor areas are in square meters or square feet, and similarly the one for house prices wouldn’t change if, for example, the prices were converted to Euros. This is unlike means and standard deviations - these take on the units of the data - so a mean house price is in UK pounds here, and so on. A consequence of this is that it is therefore possible to compare the skewness of two different data sets - they are both on the same (dimensionless) scale. For floor area the skewness is 1.5 – this is still positively skewed, but less so than the prices. Again, a histogram verifies this:
hist(houses$FlrArea,xlab='Floor Area (Sq. Meters)',main='Housing Size')
Finally, consider the correlation coefficient between the price and floor area: this can be computed using the cor
function. As a default this computes the Pearson correlation:
cor(houses$PurPrice,houses$FlrArea)
[1] 0.6452189
This suggests a reasonable degree of positive association (ie more floor space is associated with higher prices). We can also compute the Spearman’s rank correlation:
cor(houses$PurPrice,houses$FlrArea,method='spearman')
[1] 0.5445533
This is a slightly lower value - but the interpretation is pretty much the same. essentially there is a medium strength positive association between the variables. As before this can be verified graphically:
plot(houses$FlrArea,houses$PurPrice,xlab='Floor Area (Square Metres)',ylab='Purchase Price (1000\'s UK Pounds)',pch=16)
Having examined the aspatial aspects of the houses, it is now time to look at the geographical aspects of distribution - in particular the geographically weighted versions of some of these statistics. GWmodel
provides a number of geographically weighted methods - each of these operates on a SpatialPointsDataFrame
object (or possibly a SpatialPolygonsDataFrame
). These are R objects that are quite similar to shapefiles in ArcGIS - they link a list of spatial objects (points, lines or polygons) to a data frame - so that each point (or line or polygon) has a number of attributes associated with it. The attributes are listed in a data frame, and each row of the data frame is linked to one of the spatial objects. Here, we want to create a SpatialPointsDataFrame
from the house price data. This is done using the SpatialPointsDataFrame
function, it takes a two column vector as a first argument, to specify the location of the points, and a data frame as the second argument - to specify the attributes for each point. Here we store the resultant spatial object in a new variable called houses.spdf
:
houses.spdf <- SpatialPointsDataFrame(houses[,1:2],houses)
plot(houses.spdf)
Note that plotting a SpatialPointsDataFrame
results in showing the locations. This map also shows that the locations of the houses alone are not easy to interpret, and that having some kind of backdrop map may be helpful. Here we can load the border around England, as a SpatialPolygon
and plot this as well as the points where the houses are. The SpatialPolygon
object ewoutline
can be loaded via the EWOutline
data entry:
data(EWOutline)
plot(ewoutline)
plot(houses.spdf,add=T,pch=16)
Now, you can compute the geographically weighted summary statistics. The function to do this is called gwss
and can be called with arguments giving the SpatialPointsDataFrame
, the variables that you want to compute the statistics for, and the bandwidth for the moving window. Here, use a 50km bandwidth. All of the statistics discussed above are returned in a single object - here stored in a variable called localstats1
.
localstats1 <- gwss(houses.spdf,vars=c("PurPrice","FlrArea"),bw=50000)
This object has a number of components. The most important one is probably a spatial data frame containing the results of local summary statistics for each data point location. These are stored in localstats1$SDF
- this itself is a SpatialPointsDataFrame
- to access just the data frame component of it, use the expression data.frame(localstats1$SDF)
- and to see just the first few observations, see
head(data.frame(localstats1$SDF))
PurPrice_LM FlrArea_LM PurPrice_LSD FlrArea_LSD PurPrice_LVar FlrArea_LVar
1 61.47969 101.44210 30.38428 38.77779 923.2044 1503.717
2 62.57194 96.72913 32.68179 39.75198 1068.0997 1580.220
3 86.36999 94.38158 40.72227 37.59890 1658.3034 1413.677
4 87.81406 94.92197 41.53786 37.66060 1725.3942 1418.321
5 67.74730 117.02072 53.57232 43.03838 2869.9934 1852.302
6 89.51224 95.36838 49.55524 38.41052 2455.7214 1475.368
PurPrice_LSKe FlrArea_LSKe PurPrice_LCV FlrArea_LCV Cov_PurPrice.FlrArea
1 1.465978 1.328137 0.4942166 0.3822653 858.5513
2 2.498220 1.791065 0.5223075 0.4109618 1078.6943
3 1.953608 1.687729 0.4714863 0.3983711 1104.6068
4 2.036613 1.656645 0.4730207 0.3967532 1139.5739
5 1.238753 1.058190 0.7907669 0.3677843 2459.2045
6 2.380058 1.581146 0.5536141 0.4027595 1482.3767
Corr_PurPrice.FlrArea Spearman_rho_PurPrice.FlrArea Easting Northing
1 0.6969353 0.5158254 599500 142200
2 0.8063441 0.7290336 575400 167200
3 0.7145582 0.6176430 530300 177300
4 0.7218727 0.6293769 524100 170300
5 0.9531620 0.8160508 426900 514600
6 0.7715712 0.6828798 508000 190400
Also, you can just enter
localstats1
***********************************************************************
* Package GWmodel *
***********************************************************************
***********************Calibration information*************************
Local summary statistics calculated for variables:
PurPrice FlrArea
Number of summary points: 519
Kernel function: bisquare
Summary points: the same locations as observations are used.
Fixed bandwidth: 50000
Distance metric: Euclidean distance metric is used.
************************Local Summary Statistics:**********************
Summary information for Local means:
Min. 1st Qu. Median 3rd Qu. Max.
PurPrice_LM 37.50 52.34 61.95 85.91 112.4
FlrArea_LM 79.70 94.69 98.58 105.20 145.7
Summary information for local standard deviation :
Min. 1st Qu. Median 3rd Qu. Max.
PurPrice_LSD 0.00 20.20 29.98 41.52 87.52
FlrArea_LSD 0.00 31.54 36.97 38.68 69.26
Summary information for local variance :
Min. 1st Qu. Median 3rd Qu. Max.
PurPrice_LVar 0.0 408.2 898.9 1724.0 7659
FlrArea_LVar 0.0 995.1 1367.0 1496.0 4796
Summary information for Local skewness:
Min. 1st Qu. Median 3rd Qu. Max.
PurPrice_LSKe -1.4640 1.1230 1.7420 2.1590 5.775
FlrArea_LSKe -1.5030 0.8058 1.4510 1.6660 5.775
Summary information for localized coefficient of variation:
Min. 1st Qu. Median 3rd Qu. Max.
PurPrice_LCV 0.0000 0.3874 0.4650 0.5174 0.8298
FlrArea_LCV 0.0000 0.3102 0.3630 0.3975 0.5126
Summary information for localized Covariance and Correlation between these variables:
Min. 1st Qu. Median 3rd Qu. Max.
Cov_PurPrice.FlrArea -4.033e+01 4.763e+02 9.130e+02 1.209e+03 6320
Corr_PurPrice.FlrArea -4.247e-02 6.512e-01 7.306e-01 8.067e-01 1
Spearman_rho_PurPrice.FlrArea 6.199e-03 5.807e-01 6.417e-01 7.331e-01 1
************************************************************************
to see an overview, in the style of the old GWR 3.0
program, for those who remember that. A guide to the variable names is below:
Variable Name | Statistic Refererred to | Comments |
---|---|---|
X_LM | GW mean | |
X_LSD | GW Standard deviation | |
X_Lvar | GW Variance | GW Standard deviation squared |
X_LSKe | GW Skewness | |
X_LCV | GW Coefficient of variation | GW mean divided by GW Standard deviation |
Cov_X.Y | GW Covariance | Not discussed here |
Corr_X.Y | GW Pearson Correlation | |
Spearman_rho_X.Y | GW Spearman Correlation |
Note that X
and Y
should be replaced by the names of the actual variables being investigated.
Next, enter a small R function definition to produce a map of the local geographically weighted summary statistic if your choice. Firstly, load the RColorBrewer
package - a useful tool for designing map colour palettes.
library(RColorBrewer)
Now type in the function definition. Really, this is just a short R program to draw a map - if you’ve programmed before its an R function definition. If not, think of it as a command that tells R how to draw a map rather than actually drawing the map. The map drawing itself will come later.
quick.map <- function(spdf,var,legend.title,main.title) {
x <- spdf@data[,var]
cut.vals <- pretty(x)
x.cut <- cut(x,cut.vals)
cut.levels <- levels(x.cut)
cut.band <- match(x.cut,cut.levels)
colors <- rev(brewer.pal(length(cut.levels),'YlOrRd'))
par(mar=c(1,1,1,1))
plot(ewoutline,col='olivedrab',bg='lightblue1')
title(main.title)
plot(spdf,add=TRUE,col=colors[cut.band],pch=16)
legend('topleft',cut.levels,col=colors,pch=16,bty='n',title=legend.title)
}
In short, the function does the following things:
gwss
objerctolivedrab
The function is called by entering
quick.map(gwss.object,variable.name,legend.title,main.map.title)
The advantage of defining a function is that the entire map can now be drawn using a single command for each variable, rather than having to repeat those 8 steps each time. Here is an example - looking at the locally weighted mean as calculated above. The result is a fairly good caricature of the early 1990’s UK house market…
quick.map(localstats1$SDF,"PurPrice_LM","1000's Uk Pounds","Geographically Weighted Mean")
Now, look at the locally weighted standard deviation and skewness. A new command here is par(mfrow=c(1,2))
- this sets an operating parameter (thats the par
bit) for the graphics engine in R - it tells it it use multiple figures (that’s the mf
) and to insert these row-wise (that the row
) and that they should be arranged as 1 row and 2 columns (ie two figures side-by side) - thats the c(1,2)
.
par(mfrow=c(1,2))
quick.map(localstats1$SDF,"PurPrice_LSKe","Skewness Level","Local Skewness")
quick.map(localstats1$SDF,"PurPrice_LSD","1000's Pounds","Local Standard Deviation")
Comparing figures side-by-side like this makes comparison easier - an idea strongly advocated by Tufte (1983) - see http://www.edwardtufte.com/bboard/q-and-a-fetch-msg?msg_id=0000hv for example. Here, you can see the differences in spatial pattern between local skewness and standard deviation very clearly. The two patterns differ notably. One interesting trend in skewness is that there is a pattern towards the western side of london and the south east for standard deviation to be greater - suggesting there are more houses that are relatively more variable. Further up the country it can be seen that variable is generally lower although there is some more variability in the north east.
Skewness seems to be largest in south Wales and the south west of England - suggesting these places have a more notable ‘upper tail’ where there is a small number of very expensive houses. One possible explanation of this is that these places are popular choices for second ‘holiday’ homes, where there is a distinct housing submarket whose prices are driven up by relatively affluent buyers. Although these houses are perhaps less likely to be purchased with a motrtgage arrangement, houses near to them will see similar rises in price. However, this is only a hypothesis, and would require further investigation - but this demonstrates the value of exploratory work - this has identified a pattern not easily identified by basic approaches, and a possible explanation, maybe unearthing a process that might otherwise have gone undiscovered.
We can also map the geographically weighted correlation between the variables PurPrice
and FlrArea
. On a practical note, you may need to close the current map window with the double display format, to return to a single map window. This new map allows the exploration of geographical variation in the strength of association the between the two variables. The only new R idea here is the expression(rho)
term. R allows mathematical expressions to appear in graphics as well as text - the \(\rho\) in this case is the usual mathematical symbol for Pearson correlation.
par(mfrow=c(1,1)) # Reset from two-panel display back to single
quick.map(localstats1$SDF,"Corr_PurPrice.FlrArea",expression(rho),"Geographically Weighted Pearson Correlation")
Here the strongest linkage is in the north east - and parts of East angla - suggesting in these cases that property size is very strongly linked to price. It is generally lower in the south east - possibly other factors are more important here - possibilities might be whether the property has a garage, nearness to facilities and so on. Note that
gwss
also provides the Spearman correlation as described above. A pair of maps is a useful way to compare the two kinds of correlation:
par(mfrow=c(1,2))
quick.map(localstats1$SDF,"Corr_PurPrice.FlrArea",expression(rho),"Geographically Weighted Pearson Correlation")
quick.map(localstats1$SDF,"Spearman_rho_PurPrice.FlrArea",expression(rho),"Geographically Weighted Spearman Correlation")
This time, the ‘small multiples’ principle does a good job of demonstrating that the two kinds of local correlation show quite a similar story.
As discussed earlier, there are quantile-based versions of the geographically weighted summary statistics, and gwss
can supply these as well. In addition to the earlier summary statistics, this now adds some extra ones:
Variable Name | Statistic Refererred to |
---|---|
X_Median | GW Median |
X_IQR | GW Interquartile Range |
X_QI | GW Quantile Imbalance |
To obtain these, just add the option quantile=TRUE
when you use the function. Here, this is done, and a map of geographically weighted medians is produced:
par(mfrow=c(1,1))
localstats2 <- gwss(houses.spdf,vars=c("PurPrice","FlrArea"),bw=50000,quantile=TRUE)
quick.map(localstats2$SDF,"PurPrice_Median","1000\'s UK Pounds","Geographically Weighted Median House Price")
Small multiples might also be a good way to compare this to the geographically weighted mean:
par(mfrow=c(1,2))
quick.map(localstats2$SDF,"PurPrice_Median","1000\'s UK Pounds","Geographically Weighted Median House Price")
quick.map(localstats2$SDF,"PurPrice_LM","1000\'s UK Pounds","Geographically Weighted Mean House Price")
The side-by-side comparison also works well here - probable Tufte’s best idea (runner up are his thoughts on Powerpoint: http://www.edwardtufte.com/tufte/powerpoint ). Here we see similar patterns, but with lower median values than means in north east England. Also, more generally medians are lower than means (look at the legends).
Finally, look at the other two robust geographically weighted summary statistics, the interquartile range and the quantile imbalance.
par(mfrow=c(1,2))
quick.map(localstats2$SDF,"PurPrice_IQR","1000\'s UK Pounds","Geographically Weighted Interquartile Range")
quick.map(localstats2$SDF,"PurPrice_QI","1000\'s UK Pounds","Geographically Weighted Quantile Imbalance")
The IQRs are quitle similar tio the standaerd deviations. Quantile imbalance is perhaps rather different - although extreme levels ion the north east might associate with the notably low median house prices compared to the means.
Having created these images you may well want to save them for use in an article or report. The are at least two ways of doing this. One is to simply cut and paste them into a Word document. On the window where the map is drawn, just right-click and select ‘Copy as Metafile’ and paste into an open Word document. Alternatively, you can use dev.copy2pdf
a function that copies the current window into a pdf, whose name is specified. For example, try
quick.map(localstats2$SDF,"PurPrice_Median","1000\'s UK Pounds","Geographically Weighted Median House Price")
dev.copy2pdf(file='median.pdf')
you should find a file called median.pdf
in your current working directory. Opening this will show the image you have just created. You can also insert this into documents. On the whole, the second approach is better if you want to use the images in any system other than MS Office - although if you don’t expect to share the images any further, the cut-and-paste option is possibly better.
Further Suggestions: Try producing local statistics maps with alternative bandwidths. You can still use
quick.map
to draw the maps, but just specifybw=100000
(say) for a 100km bandwidth when you callgwss
. Also, try using adaptive bandwidths - if you putadaptive=TRUE
in the call togwss
, and setbw
to the number of nearest neighbours. This means the window width alters in size, so that the windows contain a fixed number of points rather than have a fixed radius - this means that in denser urban areas the radius tends to be smaller, and more rapid spatial variation can be identified. As an example, for the nearest 50 neighbours, the command islocalstats3 <- gwss(houses.spdf,vars=c("PurPrice","FlrArea"),bw=50,adaptive=TRUE)
you can then go on to plot local statistics withquick.map
Brunsdon, C., A.S. Fotheringham, and M. Charlton. 2002. “Geographically Weighted Summary Statistics.” Computers, Environment and Urban Systems 26 (6): 501–24.
Ehrenberg, A. 1981. Data Reduction. Chichester: John Wiley.
Tufte, E. 1983. The Visual Display of Quantitative Information. Graphics Press.