Geographical Weighting and dplyr

Introduction

The aim here is to provide an approach to geographically weighted approaches that fits neatly into the ‘tidy data’ framework provided by the dplyr package and other related ones. Initially, this requires the tidyverse group of packages to be loaded from the library:

library(tidyverse)

The way this is achieved is to store the geographical weighting as a sparse matrix - but essentially in data frame (or tibble) format. Suppose we have a source data set \(S\) (with each record having a geographical location) and a target data set \(T\) (also with each record having a geographical location. Sometimes the source and target data sets are the same. Then there will be a distance between each element in \(S\) and each element in \(T\), giving \(n(S) \times n(D)\) distances. These can be stored in a three column data frame of the form:

Source data object ID	Target data object ID	Distance
SID001	TID001	1.5km
SID001	TID002	6.4km
\(\vdots\)	\(\vdots\)	\(\vdots\)

If \(S\) and \(T\) are both large data sets, then this may be a problem. However, here, the weights (which depend on the distances) are of more importance. The weights are a function of distance, as well as some paramter to control the window width. If we use distance-based weights to begin with, the parameter will be called \(h\) and will represent the radius within which the weights are non-zero.

Some examples are

Name	Formula when \(d<h\)	Formula when \(d \ge h\)	Slope when \(d \nearrow h\)
Box Car	\(1\)	\(0\)	\(-\infty\)
Linear	\(1 - \frac{d}{h}\)	\(0\)	\(-\frac{1}{h}\)
BiSquare	\(\left(1 - (\frac{d}{h})^2\right) ^2\)	\(0\)	\(0\)
BiCubic	\(\left(1 - (\frac{d}{h})^3\right) ^3\)	\(0\)	\(0\)
CosSquare	\(\cos^2 \left( \frac{\pi d}{2h} \right)\)	\(0\)	\(0\)

and their graphs are as below when \(h = 1\)

As seen in the table, not all of these are smooth when \(d=h\) - in particular the Box Car and Linear weighting functions are not. However, these are all weighting schemes for which the weight is zero once the the distance between geographical objects exceeds \(d\). This is where a sparse storage method is useful. These weights will be used for weighted sums, means and so on, and anything with a zero weight is excluded. Thus, to save space, the tabular form suggested above, but with weights rather than distances - and more specifically only non-zero weights will be used. Thus the table will be:

Source data object ID	Target data object ID	Weight
SID001	TID001	0.41
SID001	TID002	0.03
\(\vdots\)	\(\vdots\)	\(\vdots\)

with only rows for which the weight is non zero included. Omitted source/target ID combinations are assumed to have a weight of zero.

Here is some R code to create functions for these weighting schemes:

bisquare <- function(x,h) ifelse(x<h,(1 - (x/h)^2)^2,0)

bicube <- function(x,h) ifelse(x<h,(1 - (x/h)^3)^3,0)

boxcar <- function(x,h) ifelse(x<h,1,0)

linear <- function(x,h) ifelse(x<h,1-x/h,0)

cossquare <- function(x,h) ifelse(x < h, cos(x*pi/(h*2))^2, 0)

Next, here is a helper function to create a sparse weight table, given two ‘simplefeatures’ (sf) objects. You will need to specify the ID variables and the name of each of the objects, and the winsow size \(h\). Don’t worry too much about how it works for now. The function wt_tbl will be treated as a method (here for example it is just defined for sf objects, but later a matrix method will be defined).

wt_tbl <- function(x,...) UseMethod("wt_tbl") 

wt_tbl.sf <- function(sfx,sfy,idx,idy,
                           wt=wt,dmax=Inf,wfun = bisquare, h, ...) {
  idx <- enquo(idx)
  idy <- enquo(idy)
  wt =enquo(wt)
  dmat <- st_distance(sfx,sfy,...) %>% units::drop_units()
  colnames(dmat) <- sfy %>% select(!!idy) %>% 
    st_set_geometry(NULL) %>% .[[1]]
  rownames(dmat) <- sfx %>% select(!!idx) %>%
    st_set_geometry(NULL) %>% .[[1]]
  res <- dmat %>% as_tibble(rownames=quo_name(idx)) %>% 
    gather(key=!!idy,value= !! wt, - !! idx) %>% filter(!! wt <= dmax)
  if (missing(h)) {
    kfun <- partial(wfun,h=dmax) }
  else {
    kfun <- partial(wfun,h=h)
  }
  res %>% mutate(!! wt := kfun(!! wt))
}

Next, it can be demonstrated. Read in some data relating to house prices and convert it to an sf object.

library(sf)

hp <- read_csv('hpdemo.csv') %>%
  rename(house_ID=ID) %>%
  mutate(price = price/1000, house_ID=sprintf('H%04d',house_ID)) %>%
  st_as_sf(coords=c('east','north'),crs=27700) 
head(hp)

Simple feature collection with 6 features and 3 fields
geometry type:  POINT
dimension:      XY
bbox:           xmin: 514600 ymin: 169200 xmax: 547900 ymax: 189600
epsg (SRID):    27700
proj4string:    +proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000 +y_0=-100000 +ellps=airy +towgs84=446.448,-125.157,542.06,0.15,0.247,0.842,-20.489 +units=m +no_defs
# A tibble: 6 x 4
  house_ID price fl_area        geometry
  <chr>    <dbl>   <int>     <POINT [m]>
1 H0001    107        50 (523800 179700)
2 H0002     55.5      66 (533200 170900)
3 H0003    103        90 (514600 175800)
4 H0004    187       125 (516000 171000)
5 H0005     43        50 (533700 169200)
6 H0006     70.0      95 (547900 189600)

A straighforward map is now possible:

library(ggplot2)
library(ggspatial)
ggplot(hp,aes(col=price)) + 
  annotation_map_tile(type='osm',zoomin=-1) + geom_sf() +
  scale_color_viridis_c(trans='log10')

It is then possible to generate a grid of hexagons that cover the data relating to house sales. Here the house sale locations are plotted over the hexagons.

gl_hexes <- st_buffer(hp,500) %>%
  st_make_grid(cellsize=c(2000,2000),square=FALSE) %>% 
  st_sf() %>% mutate(hex_ID=sprintf('Hex%04d',row_number()))

ggplot(gl_hexes) +  
  annotation_map_tile(type='osm',zoom=10) + geom_sf(alpha=0.66) + geom_sf(data=hp)

Now, the wt_tbl function can be used to create the sparse weight representation. Here, the value for \(h\) (effectively dmax) is set at 10km. As the units are metres, this is specified as 10,000.

weights <- wt_tbl(hp,gl_hexes,house_ID,hex_ID,dmax=10000)
head(weights)

# A tibble: 6 x 3
  house_ID hex_ID        wt
  <chr>    <chr>      <dbl>
1 H0026    Hex0001 0.000586
2 H0033    Hex0001 0.0371  
3 H0041    Hex0001 0.0160  
4 H0045    Hex0001 0.110   
5 H0067    Hex0001 0.672   
6 H0109    Hex0001 0.341

Using left joins, and group_by operations it is possible to compute weighted summary statistics and other similar objects. Suppose here we wish to compute weighted mean house prices on a hexagon-by-hexagon basis. Firstly, we left_join the data from hp to the weights. In this case, we don’t want the geometry column carried over (which it is by default even when it is not specified by select) so this is stripped by the st_set_geometry function.

w_prices <- weights %>%
  left_join(hp %>% 
    select(house_ID,price) %>% 
    st_set_geometry(NULL))

Joining, by = "house_ID"

head(w_prices)

# A tibble: 6 x 4
  house_ID hex_ID        wt price
  <chr>    <chr>      <dbl> <dbl>
1 H0026    Hex0001 0.000586  67.5
2 H0033    Hex0001 0.0371    92  
3 H0041    Hex0001 0.0160    94  
4 H0045    Hex0001 0.110    142. 
5 H0067    Hex0001 0.672     52  
6 H0109    Hex0001 0.341    148

Now, for any given hex_ID we can select out the associated set of house prices and weights, for a geographically weighted summary statistic. The full list contains any house within a 10km radius of the hexagon labelled Hex0026.

w_prices %>% filter(hex_ID=="Hex0026") %>% head()

# A tibble: 6 x 4
  house_ID hex_ID      wt price
  <chr>    <chr>    <dbl> <dbl>
1 H0004    Hex0026 0.148    187
2 H0009    Hex0026 0.193    104
3 H0011    Hex0026 0.139     88
4 H0016    Hex0026 0.0136   185
5 H0017    Hex0026 0.471    152
6 H0020    Hex0026 0.0620   135

More generally, if we group by hex_ID we can calculate a weighted summary for all houses within the window surrounding each hexagon:

gwm_prices <- w_prices %>% group_by(hex_ID) %>% summarise(gwm_price=sum(wt*price)/sum(wt))
head(gwm_prices)

# A tibble: 6 x 2
  hex_ID  gwm_price
  <chr>       <dbl>
1 Hex0001     115. 
2 Hex0002     117. 
3 Hex0003      92.4
4 Hex0004     118. 
5 Hex0005     103. 
6 Hex0006      99.3

This is a list of geographically weighted mean house prices, for each hexagon. We can now re-join this to the gl_hexes object and map it:

gl_hexes2 <- gl_hexes %>% left_join(gwm_prices)
ggplot(gl_hexes2,aes(fill=gwm_price)) + 
  annotation_map_tile(type='osm',zoom=10) + 
  geom_sf(alpha=0.5) + 
  scale_fill_viridis_c(trans='log10')

Geographical Weighting and `dplyr`

Chris Brunsdon

11 December 2018

Introduction