Geographically Weighted Summary Statistics

This work is licensed under the Creative Commons Attribution-ShareAlike 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/3.0/.

The thinking behind this tutorial

These practicals are designed to have an explanatary text, together with code examples. Note that all code examples have a light yellow background, and are boxed. Output from R is also shown, and text output is also boxed. Graphical output is shown ‘in line’. The idea is to copy the text in the boxes into a running version of R - you can use the output to see whether you are on the right track. Also, don’t be afraid to experiment - it may not always work, but for example playing around with some of the function parameters change some aspects of the analysis, or of the graphical presentation. Also, it is generally assumed that the code you type in from the boxes is actually entered in the same order as it appears here. Some of the later boxes depend on what was done earlier, and so skipping ahead might lead to errors. The help facility (putting a question mark in front of a command, for example ?plot, and hitting return) is also a good way to discover things…

To get started with this tutorial, first load the GWmodel package.

library(GWmodel)

Review of Summary Statistics

Summary statistics are basic statistics used to summarise a large data set. For example, when looking at a data set of, say, 10000 house prices, you might wish to calculate the mean of these, to obtsain an idea of what a typical house price might be. Similarly, you may use a standard deviation to see the extent to which house prices spread around this mean. Finally (and perhaps a bit more obsure) you can use the skewness to measure the symmetry of distribution (ie is there a long upper or lower tail to the distribution, or are values fairly evenly distributed around the mean). Generally these are useful – although not comprehensive – techniques for what is sometimes called ‘data reduction’ (Ehrenberg 1981). They are useful in that with a small number of quantities, it is possibly to summarise note only typical values, but also distributional properties of variables of interest in very large data sets. Geographically weighted summary statistics are similar, but they apply summary statistics using a moving window, so that the above characteristics can be mapped as you move through different geographical regions in a data set. Thus, you can see whether the mean house price in Dublin is different to that in Maynooth (OK, so you probably know it is if you live near those places, but what about Barnet and Edgware, on the outskirts of London, for example?) - and also, although mean levels are often considered in that way, it is also possible to think about movinbg windows to estimate geographical variations in standard deviation - and see whether house prices are more variable in some places than others, or local skewness - to see whether the lop-sidedness of house price distribution changes from place to place.

Also, correlation is a useful bivariate summary statistic, as it measures the degree to which two variables are associated. The most commonly used measure of this is the Pearson Correlation Coefficient. Again, the idea with a geographically weighted correlation is to use a moving window approach to see whether this degree of association varies geographically - for example in some places floor area may be strongly correlated with house price, but in others less so, if, say, being of historical or cultural interest might make a smaller house more valuable than it would otherwise be.

Finally, the mean, standard deviation, skewness and correlation coefficient all have robust equivalents, the median, interquartile range, and quantile imbalance. These are robust in the sense that they are based on the sorted order of values. For the univariate summary statistics, if we sort a variable in ascending order, let the halfway point be $Q_2$, the first quarter point be $Q_1$ and the third quarter point be $Q_3$. Then the median is $Q_2$, the interquartile range is $Q_3 - Q_1$ and the quantile imbalance is \[ \frac{Q_3 - 2Q_2 + Q_1}{Q_3 - Q_1} \] The last one may be less familar, but it basically measures the difference between the first quartile and the median and the median and the second quartile - leading to a measure of lop-sidedness. The measures are seen as robust because one or two very high or very low values don’t ‘throw’ the summary statistics, since they won’t alter the values of $Q_1$, $Q_2$ or $Q_3$. Again, these can be geographically weighted - see Brunsdon, Fotheringham, and Charlton (2002).

Finally, the Spearman’s rank correlation coefficient is a robust version of the Pearson coefficient - to compute this, each value of each variable is replaced by its rank - the smallest value of the first variable is replaced by 1, the next smallest by 2 and so on, The same is then done for the second variable - and then the Pearson correlation coefficient is computed for these rank-based variables.

The summary statistics that are geographically weighted are listed below:

Statistic	What it Measures	Robust?	Bivariate or Univariate
Mean	Overall Level	No	Univariate
Standard Deviation	Spread	No	Univariate
Skewness	Asymmetry	No	Univariate
Median	Overall Level	Yes	Univariate
Interquartile Range	Spread	Yes	Univariate
Quantile Imbalance	Asymmetry	Yes	Univariate
Pearson Correlation	Association	No	Bivariate
Spearman Correlation	Association	Yes	Bivariate

The data

In this practical you will be working with house price data, obtained from the Nationwide Building Society (NBS) in England - this is a sample of houses sold in 1991 with mortgages arranged by NBS. The data is in a data frame called ewhp. You will make some minor alterations, so this is copied to a new data frame houses.

data(EWHP)
houses <- ewhp
head(houses)

  Easting Northing PurPrice BldIntWr BldPostW Bld60s Bld70s Bld80s TypDetch TypSemiD
1  599500   142200    65000        0        0      0      0      1        0        1
2  575400   167200    45000        0        0      0      0      0        0        0
3  530300   177300    50000        1        0      0      0      0        0        0
4  524100   170300   105000        0        0      0      0      0        0        0
5  426900   514600   175000        0        0      0      0      1        1        0
6  508000   190400   250000        0        1      0      0      0        1        0
  TypFlat   FlrArea
1       0  78.94786
2       1  94.36591
3       0  41.33153
4       0  92.87983
5       0 200.52756
6       0 148.60773

From this, it can be seen that the houses purchase price (PurPrice) and location (Easting,Northing) are recorded togther with a number of other artifacts of the house - for example its floor area in square meters (FlrArea).

Also, house prices can be divided by 1000 since this tends to make graphs, map keys and so on less cluttered - as well as avoiding some numerical rounding errors when doing some fairly complex calculations:

houses$PurPrice <- houses$PurPrice / 1000

Looking at the relation of floor area and purchase price can be done using the standard plot command in R:

plot(houses$FlrArea,houses$PurPrice, xlab='Floor Area',ylab='Purchase Price (1000\'s UK Pounds)')

its a useful plot, given my earlier comments - you can see that there a few quite expensive houses with small floor areas, and also some quite large houses with relatively low price. We can also look some summary statistics - for price

mean(houses$PurPrice)

[1] 67.347

sd(houses$PurPrice)

[1] 38.75477

That gives the mean and standard deviation - however the skewness doesn’t come as standard, but you can get it from the enigmatically-named R library e1071:

library(e1071)
skewness(houses$PurPrice)

[1] 2.441269

From this value of around 2.4 it can be seen that this distribution is strongly positively skewed - suggesting a large upper tail - that is, a central group of typically-priced houses together with a long tail of relatively expensive ones. A quick histogram command should verify this:

hist(houses$PurPrice,xlab='Purchase Price',main='Housing Cost Distribution')

Reality Check: if you have just noticed how cheap housing is in London and are thinking of relocating there, don’t forgot this is data for 1991 :-)

Now, look at these quantities for the floor areas:

mean(houses$FlrArea)

[1] 100.9567

sd(houses$FlrArea)

[1] 38.08914

skewness(houses$FlrArea)

[1] 1.453568

In particular note the skewness again. Skewness is a dimensionless quantity, so that it doesn’t depend on whether floor areas are in square meters or square feet, and similarly the one for house prices wouldn’t change if, for example, the prices were converted to Euros. This is unlike means and standard deviations - these take on the units of the data - so a mean house price is in UK pounds here, and so on. A consequence of this is that it is therefore possible to compare the skewness of two different data sets - they are both on the same (dimensionless) scale. For floor area the skewness is 1.5 – this is still positively skewed, but less so than the prices. Again, a histogram verifies this:

hist(houses$FlrArea,xlab='Floor Area (Sq. Meters)',main='Housing Size')

Finally, consider the correlation coefficient between the price and floor area: this can be computed using the cor function. As a default this computes the Pearson correlation:

cor(houses$PurPrice,houses$FlrArea)

[1] 0.6452189

This suggests a reasonable degree of positive association (ie more floor space is associated with higher prices). We can also compute the Spearman’s rank correlation:

cor(houses$PurPrice,houses$FlrArea,method='spearman')

[1] 0.5445533

This is a slightly lower value - but the interpretation is pretty much the same. essentially there is a medium strength positive association between the variables. As before this can be verified graphically:

plot(houses$FlrArea,houses$PurPrice,xlab='Floor Area (Square Metres)',ylab='Purchase Price (1000\'s UK Pounds)',pch=16)

Exploring Spatial Pattern with Geographically Weighted Summary Statistics

Having examined the aspatial aspects of the houses, it is now time to look at the geographical aspects of distribution - in particular the geographically weighted versions of some of these statistics. GWmodel provides a number of geographically weighted methods - each of these operates on a SpatialPointsDataFrame object (or possibly a SpatialPolygonsDataFrame). These are R objects that are quite similar to shapefiles in ArcGIS - they link a list of spatial objects (points, lines or polygons) to a data frame - so that each point (or line or polygon) has a number of attributes associated with it. The attributes are listed in a data frame, and each row of the data frame is linked to one of the spatial objects. Here, we want to create a SpatialPointsDataFrame from the house price data. This is done using the SpatialPointsDataFrame function, it takes a two column vector as a first argument, to specify the location of the points, and a data frame as the second argument - to specify the attributes for each point. Here we store the resultant spatial object in a new variable called houses.spdf:

houses.spdf <- SpatialPointsDataFrame(houses[,1:2],houses)
plot(houses.spdf)

Note that plotting a SpatialPointsDataFrame results in showing the locations. This map also shows that the locations of the houses alone are not easy to interpret, and that having some kind of backdrop map may be helpful. Here we can load the border around England, as a SpatialPolygon and plot this as well as the points where the houses are. The SpatialPolygon object ewoutline can be loaded via the EWOutline data entry:

data(EWOutline)
plot(ewoutline)
plot(houses.spdf,add=T,pch=16)

Now, you can compute the geographically weighted summary statistics. The function to do this is called gwss and can be called with arguments giving the SpatialPointsDataFrame, the variables that you want to compute the statistics for, and the bandwidth for the moving window. Here, use a 50km bandwidth. All of the statistics discussed above are returned in a single object - here stored in a variable called localstats1.

localstats1 <- gwss(houses.spdf,vars=c("PurPrice","FlrArea"),bw=50000)

This object has a number of components. The most important one is probably a spatial data frame containing the results of local summary statistics for each data point location. These are stored in localstats1$SDF - this itself is a SpatialPointsDataFrame - to access just the data frame component of it, use the expression data.frame(localstats1$SDF) - and to see just the first few observations, see

head(data.frame(localstats1$SDF))

  PurPrice_LM FlrArea_LM PurPrice_LSD FlrArea_LSD PurPrice_LVar FlrArea_LVar
1    61.47969  101.44210     30.38428    38.77779      923.2044     1503.717
2    62.57194   96.72913     32.68179    39.75198     1068.0997     1580.220
3    86.36999   94.38158     40.72227    37.59890     1658.3034     1413.677
4    87.81406   94.92197     41.53786    37.66060     1725.3942     1418.321
5    67.74730  117.02072     53.57232    43.03838     2869.9934     1852.302
6    89.51224   95.36838     49.55524    38.41052     2455.7214     1475.368
  PurPrice_LSKe FlrArea_LSKe PurPrice_LCV FlrArea_LCV Cov_PurPrice.FlrArea
1      1.465978     1.328137    0.4942166   0.3822653             858.5513
2      2.498220     1.791065    0.5223075   0.4109618            1078.6943
3      1.953608     1.687729    0.4714863   0.3983711            1104.6068
4      2.036613     1.656645    0.4730207   0.3967532            1139.5739
5      1.238753     1.058190    0.7907669   0.3677843            2459.2045
6      2.380058     1.581146    0.5536141   0.4027595            1482.3767
  Corr_PurPrice.FlrArea Spearman_rho_PurPrice.FlrArea Easting Northing
1             0.6969353                     0.5158254  599500   142200
2             0.8063441                     0.7290336  575400   167200
3             0.7145582                     0.6176430  530300   177300
4             0.7218727                     0.6293769  524100   170300
5             0.9531620                     0.8160508  426900   514600
6             0.7715712                     0.6828798  508000   190400

Also, you can just enter

localstats1

   ***********************************************************************
   *                       Package   GWmodel                             *
   ***********************************************************************

   ***********************Calibration information*************************

   Local summary statistics calculated for variables:
    PurPrice FlrArea
   Number of summary points: 519
   Kernel function: bisquare 
   Summary points: the same locations as observations are used.
   Fixed bandwidth: 50000 
   Distance metric: Euclidean distance metric is used.

   ************************Local Summary Statistics:**********************
   Summary information for Local means:
                 Min. 1st Qu. Median 3rd Qu.  Max.
   PurPrice_LM  37.50   52.34  61.95   85.91 112.4
   FlrArea_LM   79.70   94.69  98.58  105.20 145.7
   Summary information for local standard deviation :
                 Min. 1st Qu. Median 3rd Qu.  Max.
   PurPrice_LSD  0.00   20.20  29.98   41.52 87.52
   FlrArea_LSD   0.00   31.54  36.97   38.68 69.26
   Summary information for local variance :
                   Min. 1st Qu. Median 3rd Qu. Max.
   PurPrice_LVar    0.0   408.2  898.9  1724.0 7659
   FlrArea_LVar     0.0   995.1 1367.0  1496.0 4796
   Summary information for Local skewness:
                    Min. 1st Qu.  Median 3rd Qu.  Max.
   PurPrice_LSKe -1.4640  1.1230  1.7420  2.1590 5.775
   FlrArea_LSKe  -1.5030  0.8058  1.4510  1.6660 5.775
   Summary information for localized coefficient of variation:
                  Min. 1st Qu. Median 3rd Qu.   Max.
   PurPrice_LCV 0.0000  0.3874 0.4650  0.5174 0.8298
   FlrArea_LCV  0.0000  0.3102 0.3630  0.3975 0.5126
   Summary information for localized Covariance and Correlation between these variables:
                                       Min.    1st Qu.     Median    3rd Qu. Max.
   Cov_PurPrice.FlrArea          -4.033e+01  4.763e+02  9.130e+02  1.209e+03 6320
   Corr_PurPrice.FlrArea         -4.247e-02  6.512e-01  7.306e-01  8.067e-01    1
   Spearman_rho_PurPrice.FlrArea  6.199e-03  5.807e-01  6.417e-01  7.331e-01    1

   ************************************************************************

to see an overview, in the style of the old GWR 3.0 program, for those who remember that. A guide to the variable names is below:

Variable Name	Statistic Refererred to	Comments
X_LM	GW mean
X_LSD	GW Standard deviation
X_Lvar	GW Variance	GW Standard deviation squared
X_LSKe	GW Skewness
X_LCV	GW Coefficient of variation	GW mean divided by GW Standard deviation
Cov_X.Y	GW Covariance	Not discussed here
Corr_X.Y	GW Pearson Correlation
Spearman_rho_X.Y	GW Spearman Correlation

Note that X and Y should be replaced by the names of the actual variables being investigated.

Next, enter a small R function definition to produce a map of the local geographically weighted summary statistic if your choice. Firstly, load the RColorBrewer package - a useful tool for designing map colour palettes.

library(RColorBrewer)

Now type in the function definition. Really, this is just a short R program to draw a map - if you’ve programmed before its an R function definition. If not, think of it as a command that tells R how to draw a map rather than actually drawing the map. The map drawing itself will come later.

quick.map <- function(spdf,var,legend.title,main.title) {
 x <- spdf@data[,var]
 cut.vals <- pretty(x)
 x.cut <- cut(x,cut.vals)
 cut.levels <- levels(x.cut)
 cut.band <- match(x.cut,cut.levels)
 colors <- rev(brewer.pal(length(cut.levels),'YlOrRd'))
 par(mar=c(1,1,1,1))
 plot(ewoutline,col='olivedrab',bg='lightblue1')
 title(main.title)
 plot(spdf,add=TRUE,col=colors[cut.band],pch=16)
 legend('topleft',cut.levels,col=colors,pch=16,bty='n',title=legend.title)
}

In short, the function does the following things:

Extracts the GW-summary statistic of interest from the gwss objerct
Finds a ‘nice’ set of regularly spaced class intervals
Works out which class interval the GWSS at each point falls into
Defines a colour palette to show the intervals - basically a heat-map model here
Defines the size of the margin around the map
Plots the UK border, fills it in the excitingly-named colour olivedrab
Adds a title
Plots the colour-coded gwss points
Draws a legend

The function is called by entering

quick.map(gwss.object,variable.name,legend.title,main.map.title)

The advantage of defining a function is that the entire map can now be drawn using a single command for each variable, rather than having to repeat those 8 steps each time. Here is an example - looking at the locally weighted mean as calculated above. The result is a fairly good caricature of the early 1990’s UK house market…

quick.map(localstats1$SDF,"PurPrice_LM","1000's Uk Pounds","Geographically Weighted Mean")

Now, look at the locally weighted standard deviation and skewness. A new command here is par(mfrow=c(1,2)) - this sets an operating parameter (thats the par bit) for the graphics engine in R - it tells it it use multiple figures (that’s the mf) and to insert these row-wise (that the row) and that they should be arranged as 1 row and 2 columns (ie two figures side-by side) - thats the c(1,2).

par(mfrow=c(1,2))
quick.map(localstats1$SDF,"PurPrice_LSKe","Skewness Level","Local Skewness")
quick.map(localstats1$SDF,"PurPrice_LSD","1000's Pounds","Local Standard Deviation")

Comparing figures side-by-side like this makes comparison easier - an idea strongly advocated by Tufte (1983) - see http://www.edwardtufte.com/bboard/q-and-a-fetch-msg?msg_id=0000hv for example. Here, you can see the differences in spatial pattern between local skewness and standard deviation very clearly. The two patterns differ notably. One interesting trend in skewness is that there is a pattern towards the western side of london and the south east for standard deviation to be greater - suggesting there are more houses that are relatively more variable. Further up the country it can be seen that variable is generally lower although there is some more variability in the north east.

Skewness seems to be largest in south Wales and the south west of England - suggesting these places have a more notable ‘upper tail’ where there is a small number of very expensive houses. One possible explanation of this is that these places are popular choices for second ‘holiday’ homes, where there is a distinct housing submarket whose prices are driven up by relatively affluent buyers. Although these houses are perhaps less likely to be purchased with a motrtgage arrangement, houses near to them will see similar rises in price. However, this is only a hypothesis, and would require further investigation - but this demonstrates the value of exploratory work - this has identified a pattern not easily identified by basic approaches, and a possible explanation, maybe unearthing a process that might otherwise have gone undiscovered.

Exploring Non-Stationarity of Relationships

We can also map the geographically weighted correlation between the variables PurPrice and FlrArea. On a practical note, you may need to close the current map window with the double display format, to return to a single map window. This new map allows the exploration of geographical variation in the strength of association the between the two variables. The only new R idea here is the expression(rho) term. R allows mathematical expressions to appear in graphics as well as text - the $\rho$ in this case is the usual mathematical symbol for Pearson correlation.

par(mfrow=c(1,1)) # Reset from two-panel display back to single
quick.map(localstats1$SDF,"Corr_PurPrice.FlrArea",expression(rho),"Geographically Weighted Pearson Correlation")

Here the strongest linkage is in the north east - and parts of East angla - suggesting in these cases that property size is very strongly linked to price. It is generally lower in the south east - possibly other factors are more important here - possibilities might be whether the property has a garage, nearness to facilities and so on. Note that gwss also provides the Spearman correlation as described above. A pair of maps is a useful way to compare the two kinds of correlation:

par(mfrow=c(1,2))
quick.map(localstats1$SDF,"Corr_PurPrice.FlrArea",expression(rho),"Geographically Weighted Pearson Correlation")
quick.map(localstats1$SDF,"Spearman_rho_PurPrice.FlrArea",expression(rho),"Geographically Weighted Spearman Correlation")

This time, the ‘small multiples’ principle does a good job of demonstrating that the two kinds of local correlation show quite a similar story.

Robust, Quantile Based Local Summary Statistics

As discussed earlier, there are quantile-based versions of the geographically weighted summary statistics, and gwss can supply these as well. In addition to the earlier summary statistics, this now adds some extra ones:

Variable Name	Statistic Refererred to
X_Median	GW Median
X_IQR	GW Interquartile Range
X_QI	GW Quantile Imbalance

To obtain these, just add the option quantile=TRUE when you use the function. Here, this is done, and a map of geographically weighted medians is produced:

par(mfrow=c(1,1))
localstats2 <- gwss(houses.spdf,vars=c("PurPrice","FlrArea"),bw=50000,quantile=TRUE)
quick.map(localstats2$SDF,"PurPrice_Median","1000\'s UK Pounds","Geographically Weighted Median House Price")

Small multiples might also be a good way to compare this to the geographically weighted mean:

par(mfrow=c(1,2))
quick.map(localstats2$SDF,"PurPrice_Median","1000\'s UK Pounds","Geographically Weighted Median House Price")
quick.map(localstats2$SDF,"PurPrice_LM","1000\'s UK Pounds","Geographically Weighted Mean House Price")

The side-by-side comparison also works well here - probable Tufte’s best idea (runner up are his thoughts on Powerpoint: http://www.edwardtufte.com/tufte/powerpoint ). Here we see similar patterns, but with lower median values than means in north east England. Also, more generally medians are lower than means (look at the legends).

Finally, look at the other two robust geographically weighted summary statistics, the interquartile range and the quantile imbalance.

par(mfrow=c(1,2))
quick.map(localstats2$SDF,"PurPrice_IQR","1000\'s UK Pounds","Geographically Weighted Interquartile Range")
quick.map(localstats2$SDF,"PurPrice_QI","1000\'s UK Pounds","Geographically Weighted Quantile Imbalance")

The IQRs are quitle similar tio the standaerd deviations. Quantile imbalance is perhaps rather different - although extreme levels ion the north east might associate with the notably low median house prices compared to the means.

Saving Images

Having created these images you may well want to save them for use in an article or report. The are at least two ways of doing this. One is to simply cut and paste them into a Word document. On the window where the map is drawn, just right-click and select ‘Copy as Metafile’ and paste into an open Word document. Alternatively, you can use dev.copy2pdf a function that copies the current window into a pdf, whose name is specified. For example, try

quick.map(localstats2$SDF,"PurPrice_Median","1000\'s UK Pounds","Geographically Weighted Median House Price")

dev.copy2pdf(file='median.pdf')

you should find a file called median.pdf in your current working directory. Opening this will show the image you have just created. You can also insert this into documents. On the whole, the second approach is better if you want to use the images in any system other than MS Office - although if you don’t expect to share the images any further, the cut-and-paste option is possibly better.

Further Suggestions: Try producing local statistics maps with alternative bandwidths. You can still use quick.map to draw the maps, but just specify bw=100000 (say) for a 100km bandwidth when you call gwss. Also, try using adaptive bandwidths - if you put adaptive=TRUE in the call to gwss, and set bw to the number of nearest neighbours. This means the window width alters in size, so that the windows contain a fixed number of points rather than have a fixed radius - this means that in denser urban areas the radius tends to be smaller, and more rapid spatial variation can be identified. As an example, for the nearest 50 neighbours, the command is
localstats3 <- gwss(houses.spdf,vars=c("PurPrice","FlrArea"),bw=50,adaptive=TRUE) you can then go on to plot local statistics with quick.map

References

Brunsdon, C., A.S. Fotheringham, and M. Charlton. 2002. “Geographically Weighted Summary Statistics.” Computers, Environment and Urban Systems 26 (6): 501–24.

Ehrenberg, A. 1981. Data Reduction. Chichester: John Wiley.

Tufte, E. 1983. The Visual Display of Quantitative Information. Graphics Press.