dplyr
The aim here is to provide an approach to geographically weighted approaches that fits neatly into the ‘tidy data’ framework provided by the dplyr
package and other related ones. Initially, this requires the tidyverse
group of packages to be loaded from the library:
The way this is achieved is to store the geographical weighting as a sparse matrix - but essentially in data frame (or tibble) format. Suppose we have a source data set \(S\) (with each record having a geographical location) and a target data set \(T\) (also with each record having a geographical location. Sometimes the source and target data sets are the same. Then there will be a distance between each element in \(S\) and each element in \(T\), giving \(n(S) \times n(D)\) distances. These can be stored in a three column data frame of the form:
Source data object ID | Target data object ID | Distance |
---|---|---|
SID001 | TID001 | 1.5km |
SID001 | TID002 | 6.4km |
\(\vdots\) | \(\vdots\) | \(\vdots\) |
If \(S\) and \(T\) are both large data sets, then this may be a problem. However, here, the weights (which depend on the distances) are of more importance. The weights are a function of distance, as well as some paramter to control the window width. If we use distance-based weights to begin with, the parameter will be called \(h\) and will represent the radius within which the weights are non-zero.
Some examples are
Name | Formula when \(d<h\) | Formula when \(d \ge h\) | Slope when \(d \nearrow h\) |
---|---|---|---|
Box Car | \(1\) | \(0\) | \(-\infty\) |
Linear | \(1 - \frac{d}{h}\) | \(0\) | \(-\frac{1}{h}\) |
BiSquare | \(\left(1 - (\frac{d}{h})^2\right) ^2\) | \(0\) | \(0\) |
BiCubic | \(\left(1 - (\frac{d}{h})^3\right) ^3\) | \(0\) | \(0\) |
CosSquare | \(\cos^2 \left( \frac{\pi d}{2h} \right)\) | \(0\) | \(0\) |
and their graphs are as below when \(h = 1\)
As seen in the table, not all of these are smooth when \(d=h\) - in particular the Box Car and Linear weighting functions are not. However, these are all weighting schemes for which the weight is zero once the the distance between geographical objects exceeds \(d\). This is where a sparse storage method is useful. These weights will be used for weighted sums, means and so on, and anything with a zero weight is excluded. Thus, to save space, the tabular form suggested above, but with weights rather than distances - and more specifically only non-zero weights will be used. Thus the table will be:
Source data object ID | Target data object ID | Weight |
---|---|---|
SID001 | TID001 | 0.41 |
SID001 | TID002 | 0.03 |
\(\vdots\) | \(\vdots\) | \(\vdots\) |
with only rows for which the weight is non zero included. Omitted source/target ID combinations are assumed to have a weight of zero.
Here is some R code to create functions for these weighting schemes:
bisquare <- function(x,h) ifelse(x<h,(1 - (x/h)^2)^2,0)
bicube <- function(x,h) ifelse(x<h,(1 - (x/h)^3)^3,0)
boxcar <- function(x,h) ifelse(x<h,1,0)
linear <- function(x,h) ifelse(x<h,1-x/h,0)
cossquare <- function(x,h) ifelse(x < h, cos(x*pi/(h*2))^2, 0)
Next, here is a helper function to create a sparse weight table, given two ‘simplefeatures’ (sf
) objects. You will need to specify the ID variables and the name of each of the objects, and the winsow size \(h\). Don’t worry too much about how it works for now. The function wt_tbl
will be treated as a method (here for example it is just defined for sf
objects, but later a matrix method will be defined).
wt_tbl <- function(x,...) UseMethod("wt_tbl")
wt_tbl.sf <- function(sfx,sfy,idx,idy,
wt=wt,dmax=Inf,wfun = bisquare, h, ...) {
idx <- enquo(idx)
idy <- enquo(idy)
wt =enquo(wt)
dmat <- st_distance(sfx,sfy,...) %>% units::drop_units()
colnames(dmat) <- sfy %>% select(!!idy) %>%
st_set_geometry(NULL) %>% .[[1]]
rownames(dmat) <- sfx %>% select(!!idx) %>%
st_set_geometry(NULL) %>% .[[1]]
res <- dmat %>% as_tibble(rownames=quo_name(idx)) %>%
gather(key=!!idy,value= !! wt, - !! idx) %>% filter(!! wt <= dmax)
if (missing(h)) {
kfun <- partial(wfun,h=dmax) }
else {
kfun <- partial(wfun,h=h)
}
res %>% mutate(!! wt := kfun(!! wt))
}
Next, it can be demonstrated. Read in some data relating to house prices and convert it to an sf
object.
library(sf)
hp <- read_csv('hpdemo.csv') %>%
rename(house_ID=ID) %>%
mutate(price = price/1000, house_ID=sprintf('H%04d',house_ID)) %>%
st_as_sf(coords=c('east','north'),crs=27700)
head(hp)
Simple feature collection with 6 features and 3 fields
geometry type: POINT
dimension: XY
bbox: xmin: 514600 ymin: 169200 xmax: 547900 ymax: 189600
epsg (SRID): 27700
proj4string: +proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000 +y_0=-100000 +ellps=airy +towgs84=446.448,-125.157,542.06,0.15,0.247,0.842,-20.489 +units=m +no_defs
# A tibble: 6 x 4
house_ID price fl_area geometry
<chr> <dbl> <int> <POINT [m]>
1 H0001 107 50 (523800 179700)
2 H0002 55.5 66 (533200 170900)
3 H0003 103 90 (514600 175800)
4 H0004 187 125 (516000 171000)
5 H0005 43 50 (533700 169200)
6 H0006 70.0 95 (547900 189600)
A straighforward map is now possible:
library(ggplot2)
library(ggspatial)
ggplot(hp,aes(col=price)) +
annotation_map_tile(type='osm',zoomin=-1) + geom_sf() +
scale_color_viridis_c(trans='log10')
It is then possible to generate a grid of hexagons that cover the data relating to house sales. Here the house sale locations are plotted over the hexagons.
gl_hexes <- st_buffer(hp,500) %>%
st_make_grid(cellsize=c(2000,2000),square=FALSE) %>%
st_sf() %>% mutate(hex_ID=sprintf('Hex%04d',row_number()))
ggplot(gl_hexes) +
annotation_map_tile(type='osm',zoom=10) + geom_sf(alpha=0.66) + geom_sf(data=hp)
Now, the wt_tbl
function can be used to create the sparse weight representation. Here, the value for \(h\) (effectively dmax
) is set at 10km. As the units are metres, this is specified as 10,000.
# A tibble: 6 x 3
house_ID hex_ID wt
<chr> <chr> <dbl>
1 H0026 Hex0001 0.000586
2 H0033 Hex0001 0.0371
3 H0041 Hex0001 0.0160
4 H0045 Hex0001 0.110
5 H0067 Hex0001 0.672
6 H0109 Hex0001 0.341
Using left joins, and group_by
operations it is possible to compute weighted summary statistics and other similar objects. Suppose here we wish to compute weighted mean house prices on a hexagon-by-hexagon basis. Firstly, we left_join
the data from hp
to the weights. In this case, we don’t want the geometry
column carried over (which it is by default even when it is not specified by select
) so this is stripped by the st_set_geometry
function.
Joining, by = "house_ID"
# A tibble: 6 x 4
house_ID hex_ID wt price
<chr> <chr> <dbl> <dbl>
1 H0026 Hex0001 0.000586 67.5
2 H0033 Hex0001 0.0371 92
3 H0041 Hex0001 0.0160 94
4 H0045 Hex0001 0.110 142.
5 H0067 Hex0001 0.672 52
6 H0109 Hex0001 0.341 148
Now, for any given hex_ID
we can select out the associated set of house prices and weights, for a geographically weighted summary statistic. The full list contains any house within a 10km radius of the hexagon labelled Hex0026
.
# A tibble: 6 x 4
house_ID hex_ID wt price
<chr> <chr> <dbl> <dbl>
1 H0004 Hex0026 0.148 187
2 H0009 Hex0026 0.193 104
3 H0011 Hex0026 0.139 88
4 H0016 Hex0026 0.0136 185
5 H0017 Hex0026 0.471 152
6 H0020 Hex0026 0.0620 135
More generally, if we group by hex_ID
we can calculate a weighted summary for all houses within the window surrounding each hexagon:
gwm_prices <- w_prices %>% group_by(hex_ID) %>% summarise(gwm_price=sum(wt*price)/sum(wt))
head(gwm_prices)
# A tibble: 6 x 2
hex_ID gwm_price
<chr> <dbl>
1 Hex0001 115.
2 Hex0002 117.
3 Hex0003 92.4
4 Hex0004 118.
5 Hex0005 103.
6 Hex0006 99.3
This is a list of geographically weighted mean house prices, for each hexagon. We can now re-join this to the gl_hexes
object and map it:
gl_hexes2 <- gl_hexes %>% left_join(gwm_prices)
ggplot(gl_hexes2,aes(fill=gwm_price)) +
annotation_map_tile(type='osm',zoom=10) +
geom_sf(alpha=0.5) +
scale_fill_viridis_c(trans='log10')