  ## The thinking behind this tutorial

These practicals are designed to have an explanatary text, together with code examples. Note that all code examples have a light yellow background, and are boxed. Output from R is also shown, and text output is also boxed. Graphical output is shown ‘in line’. The idea is to copy the text in the boxes into a running version of R - you can use the output to see whether you are on the right track. Also, don’t be afraid to experiment - it may not always work, but for example playing around with some of the function parameters change some aspects of the analysis, or of the graphical presentation. Also, it is generally assumed that the code you type in from the boxes is actually entered in the same order as it appears here. Some of the later boxes depend on what was done earlier, and so skipping ahead might lead to errors. The help facility (putting a question mark in front of a command, for example ?plot, and hitting return) is also a good way to discover things…

To get started with this tutorial, first load the GWmodel package.

library(GWmodel)

## Review of Summary Statistics

Summary statistics are basic statistics used to summarise a large data set. For example, when looking at a data set of, say, 10000 house prices, you might wish to calculate the mean of these, to obtsain an idea of what a typical house price might be. Similarly, you may use a standard deviation to see the extent to which house prices spread around this mean. Finally (and perhaps a bit more obsure) you can use the skewness to measure the symmetry of distribution (ie is there a long upper or lower tail to the distribution, or are values fairly evenly distributed around the mean). Generally these are useful – although not comprehensive – techniques for what is sometimes called ‘data reduction’ (Ehrenberg 1981). They are useful in that with a small number of quantities, it is possibly to summarise note only typical values, but also distributional properties of variables of interest in very large data sets. Geographically weighted summary statistics are similar, but they apply summary statistics using a moving window, so that the above characteristics can be mapped as you move through different geographical regions in a data set. Thus, you can see whether the mean house price in Dublin is different to that in Maynooth (OK, so you probably know it is if you live near those places, but what about Barnet and Edgware, on the outskirts of London, for example?) - and also, although mean levels are often considered in that way, it is also possible to think about movinbg windows to estimate geographical variations in standard deviation - and see whether house prices are more variable in some places than others, or local skewness - to see whether the lop-sidedness of house price distribution changes from place to place.

Also, correlation is a useful bivariate summary statistic, as it measures the degree to which two variables are associated. The most commonly used measure of this is the Pearson Correlation Coefficient. Again, the idea with a geographically weighted correlation is to use a moving window approach to see whether this degree of association varies geographically - for example in some places floor area may be strongly correlated with house price, but in others less so, if, say, being of historical or cultural interest might make a smaller house more valuable than it would otherwise be.

Finally, the mean, standard deviation, skewness and correlation coefficient all have robust equivalents, the median, interquartile range, and quantile imbalance. These are robust in the sense that they are based on the sorted order of values. For the univariate summary statistics, if we sort a variable in ascending order, let the halfway point be $$Q_2$$, the first quarter point be $$Q_1$$ and the third quarter point be $$Q_3$$. Then the median is $$Q_2$$, the interquartile range is $$Q_3 - Q_1$$ and the quantile imbalance is $\frac{Q_3 - 2Q_2 + Q_1}{Q_3 - Q_1}$ The last one may be less familar, but it basically measures the difference between the first quartile and the median and the median and the second quartile - leading to a measure of lop-sidedness. The measures are seen as robust because one or two very high or very low values don’t ‘throw’ the summary statistics, since they won’t alter the values of $$Q_1$$, $$Q_2$$ or $$Q_3$$. Again, these can be geographically weighted - see Brunsdon, Fotheringham, and Charlton (2002).

Finally, the Spearman’s rank correlation coefficient is a robust version of the Pearson coefficient - to compute this, each value of each variable is replaced by its rank - the smallest value of the first variable is replaced by 1, the next smallest by 2 and so on, The same is then done for the second variable - and then the Pearson correlation coefficient is computed for these rank-based variables.

The summary statistics that are geographically weighted are listed below:

Statistic What it Measures Robust? Bivariate or Univariate
Mean Overall Level No Univariate
Skewness Asymmetry No Univariate
Median Overall Level Yes Univariate
Quantile Imbalance Asymmetry Yes Univariate
Pearson Correlation Association No Bivariate
Spearman Correlation Association Yes Bivariate

## The data

In this practical you will be working with house price data, obtained from the Nationwide Building Society (NBS) in England - this is a sample of houses sold in 1991 with mortgages arranged by NBS. The data is in a data frame called ewhp. You will make some minor alterations, so this is copied to a new data frame houses.

data(EWHP)
houses <- ewhp
head(houses)
  Easting Northing PurPrice BldIntWr BldPostW Bld60s Bld70s Bld80s TypDetch TypSemiD
1  599500   142200    65000        0        0      0      0      1        0        1
2  575400   167200    45000        0        0      0      0      0        0        0
3  530300   177300    50000        1        0      0      0      0        0        0
4  524100   170300   105000        0        0      0      0      0        0        0
5  426900   514600   175000        0        0      0      0      1        1        0
6  508000   190400   250000        0        1      0      0      0        1        0
TypFlat   FlrArea
1       0  78.94786
2       1  94.36591
3       0  41.33153
4       0  92.87983
5       0 200.52756
6       0 148.60773

From this, it can be seen that the houses purchase price (PurPrice) and location (Easting,Northing) are recorded togther with a number of other artifacts of the house - for example its floor area in square meters (FlrArea).

Also, house prices can be divided by 1000 since this tends to make graphs, map keys and so on less cluttered - as well as avoiding some numerical rounding errors when doing some fairly complex calculations:

houses$PurPrice <- houses$PurPrice / 1000

Looking at the relation of floor area and purchase price can be done using the standard plot command in R:

plot(houses$FlrArea,houses$PurPrice, xlab='Floor Area',ylab='Purchase Price (1000\'s UK Pounds)')