Having seen a talk about spatial interaction models, in this session you will see how to fit one using R. The data you will work with relates to Groningen. The code below pulls the data down.
download.file("https://www.dropbox.com/s/8yz4r73v9zmugu7/groningen_polys.zip?dl=1","groningen_polys.zip")
unzip("groningen_polys.zip")
gron_polys <- st_read(".","groningen_polys")
download.file("https://www.dropbox.com/s/t6mjbycebknzjd4/shops_pts.zip?dl=1","shops_pts.zip")
unzip("shops_pts.zip")
gron_shops <- st_read(".","shops_pts")
download.file("https://www.dropbox.com/s/frriyfou3ujj998/groningen_pts.zip?dl=1","groningen_pts.zip")
unzip("groningen_pts.zip")
gron_pts <- st_read(".","groningen_pts")
There are a number of census units in and around Groningen:
In addition, there are a number of stores, together with information relating to their floor areas.
A third data set relates to the administrative areas, but as a set of points - this is useful to compute ‘crow fly’ distances between the administrative areas and the shops. These data sets are called:
gron_pts - label points for the administrative areasgron_polys - polygons for the administtrative areasgron_shops - point locatrions for the storesAll the data can be downloaded and read as below:
### Load some useful packages
library(tidyverse)
library(sf)
library(tmap)
options(stringsAsFactors = FALSE)
### First the polys for the admion areas
### Download the zipfile (of the shapefile collection)
download.file("https://www.dropbox.com/s/8yz4r73v9zmugu7/groningen_polys.zip?dl=1","groningen_polys.zip")
### Unzip it
unzip("groningen_polys.zip")
### Read it in
gron_polys <- st_read(".","groningen_polys")
### Repeat for the shops
download.file("https://www.dropbox.com/s/t6mjbycebknzjd4/shops_pts.zip?dl=1","shops_pts.zip")
unzip("shops_pts.zip")
gron_shops <- st_read(".","shops_pts")
#### Repeat for the admin areae label points
download.file("https://www.dropbox.com/s/frriyfou3ujj998/groningen_pts.zip?dl=1","groningen_pts.zip")
unzip("groningen_pts.zip")
gron_pts <- st_read(".","groningen_pts")
This code downloads the shapefiles into your working folder, unzips them, and then reads the files into a R objects - for each of those objects in turn.
You can now use tmap commnds to view them. THe code below will create a basic plot of the areas and the shop locations:
tm_shape(gron_polys) + tm_borders(col='indianred') +
tm_shape(gron_shops) +
tm_dots(size='Floorspace',col='dodgerblue') +
tm_layout(legend.position=c("right","bottom"))
We can also list the data in the objects - here are gron_shops:
gron_shops
and gron_pts
gron_pts
The variables here arew the population AANT_INW - and the potential spending for each administrative unit Money - this is a simple estimate of €20 per person. The polygons object contains the same information - it just uses polygon geometry instead of point.
Finally, a useful tool for viewing data is to draw the map on an interactive backdrop. This can be done by setting tmap_mode to view instead of plot:
tmap_mode('view')
tm_shape(gron_polys) + tm_borders(col='indianred', lwd=2)
The data we currently have tells you where the stores are, and where the population is located. It also tells you how much spending power is in each place, and how attractive each store is (measured by floor space). To predict interaction the final peice in the jigsaw is the distance between the stores and the population. In broad terms, the interaction between place \(i\) and store \(j\) (annotated here is \(I_{ij}\)) is modelled as
\[ I_{ij} = f(P_i,A_j,d_{ij}) \]
where \(P_i\) is the productiveness in place \(i\), \(A_j\) is the attractiveness of store \(j\) and \(d_{ij}\) is the distance between the two. Here \(P_i\) is measured with the Money item, and \(A_j\) with the Floorspace item. So we need to compute the distances. One way of doing this is to extract the point coordinates from gron_pts and gron_shops and compute the straight line (crow fly) distances between each pair, using Pythagoras’ theorem. The following function does this.
get_dists <- function(pts1,pts2) {
c1 <- st_coordinates(pts1)
c2 <- st_coordinates(pts2)
res <- outer(c1[,1],c2[,1],'-')^2 + outer(c1[,2],c2[,2],'-')^2
return(sqrt(res))
}
This defines a new R function that takes a pair of point sf objects andf computes a distance matrix. At this stage we don’t have the distances acttually computed, we have just created a function to do the job. If you can’t see how this works immediately, don’t worry too much - and skip the following part - but if you are interested the details are between the lines below:
st_coordinates functiions extract the coordinates from a pair of point objects, and stores each in an \(n \times 2\) matrix.outer functions take each column of the two matrices in turn and creates a difference matrix for the \(x\) and \(y\) coordinates respectively.Now we have a function to calculate distance - so lets actually do this. The following computes the distance matrix for gron_shops and gron_pts, and gives names to the appropriate rows and columns in the matrix. Note the division of the distances by 1000. This makes the distances in kilometres, not metres, which is more appropriate for this model.
### Compute the distances
dmat <- get_dists(gron_shops,gron_pts)/1000
### Name the rows and columns nicely
rownames(dmat) <- gron_shops$shopcode
colnames(dmat) <- gron_pts$BU_CODE
# Print the top left hand corner
dmat[1:4,1:4]
## BU00140000 BU00140001 BU00140002 BU00140003
## AR000802617 2.6211707 2.1282047 2.6632298 2.3322780
## AR000935740 0.9472168 0.4911816 1.1519708 0.7369386
## AR000818293 0.3018875 0.3090520 0.7729367 0.4043062
## AR000809760 0.8388541 0.9243705 0.2415260 1.4010140
Now we have a distance matrix, but to merge it with the rest of the information, it would be better to have this in tidy format - essentially in the format of a bata frame, with three columns - the store ID, the administrative area ID and the distance between the two. This is useful because it may then be used as a look-up table for a given store, or a given administrative area, to find distances. To do this, first convert the matrix object into a data frame, and add a variable column based on the row name - this gives the shop code. Here we use tidyverse-style pipelining to ‘mutate’ the data frame.
dmat %>% as_data_frame %>% mutate(shopcode=rownames(dmat)) -> dist_tab
head(dist_tab)
This is still not in the format mentioned earlier, as each origin has its own column. To re-shape the data into three columns the gather command is needed. This takes data spread over several columns (here all of the origins) and ‘gathers’ them into a single column. The column names get a new variable, and the distance values are stacked.
This is shown below.
dist_tab %>% gather(key=BU_CODE,value=Dist,-shopcode) -> dist_tab
head(dist_tab)
Note that this approach creates two new columns - the one with the old column name, and the one with the values. These are the key and value parameters. The final -shopcode parameters tells gather not to include this column in the gathering process, but to carry it forward into the modified data frame.
Next, we want to join the attrativeness and productiveness variables onto this list, using database join-type commands that are provided by the left_join function. First off, join the attractiveness information from gron_shops :
dist_tab %>% left_join(gron_shops) %>% select(-geometry) -> dist_tab
head(dist_tab)
NOTE: You may get an error here if you have recently been using theMASS package - this also has a function called select that does a different thing. When more than one package has been loaded with the same nameed function, tyhe default one called may be the wrong one. Here you need the dplyr function select, not the MASS one - and this could have caused the error. To make this unambigous, use the following instead:
dist_tab %>% left_join(gron_shops) %>% dplyr::select(-geometry) -> dist_tab
head(dist_tab)
dplyr::select instead of just select, this makes it clear which select you mean.The select(-geometry) is used here as we want to create a straight forward data frame here, not a geographical object. We can also add the administrative area information in the same way.
dist_tab %>% left_join(gron_pts) %>% select(-geometry) -> dist_tab
head(dist_tab)
So now dist_tab has a series of rows corresponding to each \(i,j\) pair, with the values of \(d_{ij}\) \(P_i\) and \(A_j\) among the variables -
DistMoneyFloorspaceWe can now actually carry out the modelling. As an aside, the above work is a typical example of what some data scientists term data mungeing - basically not doing analysis, but re-shaping data into a format that will be helpful for the analysis. In the practical here we worked through this step-by-step, but in practice, one might specify the entire process as a single pipeline:
dmat %>% as_data_frame %>%
mutate(shopcode=rownames(dmat)) %>%
gather(key=BU_CODE,value=Dist,-shopcode) %>%
left_join(gron_shops) %>% select(-geometry) %>%
left_join(gron_pts) %>% select(-geometry) -> dist_tab
Before we noted that the spatial interaction model takes the form \(I_{ij} = f(P_i,A_j,d_{ij})\) - here, we will use a specific, multiplicative format:
\[
I_{ij} = \textsf{const}_j \times A_i^\alpha g(d_{ij})
\] where \(g\) is a decaying function of distance. Note that the productiveness does not appear explicitly here - in fact it features as a constraint - essentially for each administrative area, the total money going to all of the stores must add up to the total money for that area. This is achieved by multiplying the attractiveness and distance-deterrence terms by a constant for each area - the constant re-scales the amount spent so that the total for the area is equal to the variable Money for that area.
The form that \(g\) takes here will be that of exponential decay. Often this is written as \(e^{-\textsf{const} \times d_{ij}}\) but here we will use the form
\[ g(d_{ij}) = 2^{-d_{ij}/h} \] where \(h\) is the distance half-life. That is, for every \(h\) kilometers between the store and the home location, the desirability of making the trip halves. Although this can be written in more conventional exponential form as \(e^{-\log(2)d_{ij}/h}\), the parameter \(h\) is more easily interpreted here. Thus the full model is
\[ I_{ij} = \textsf{const}_j A_i^\alpha 2^{-d_{ij}/h} \]
Having arrived at the equation we now need to compute this and add a column to the dist_tab data frame to model the \(I_{ij}\) values. These will be the estimated expenditure for the population of area \(j\) at store \(i\). For a starting point, assume a fairly strong distance deterrence with a distance half-life of 200m. We can change this later.
The first part of the model (before applying the production constraint) is to compute \(A_i^\alpha 2^{-d_ij/h}\). \(\alpha\) here allows for extra attractiveness of the larger stores. In this case it is assumed to be quite close to 1 - say 1.05 - so that larger stores have a slightly disproportionate increase in attractiveness, but the effect is fairly weak.
dist_tab %>%
mutate(Flow_Factor=(Floorspace^1.05)*2^(-Dist/0.2)) -> sim_tab
head(sim_tab)
This has the Flow_Factor variable proportional to the cash flow between \(i\) and \(j\) but as yet it hasn’t been scaled. To do this, we first need to rescale the flow factors for each administrative area so that they sum to one. Then, they are each multiplied by the Money variable for that area, so that the total modelled flow of cash equals the total money coming from that area. This is done by using the sum function. If the sim_tab data frame is first grouped by the administrative unit code, (using group_by(BU_CODE)) then the sum function for each row applies to the sum of values for its specific BU_CODE - rather than the sum for the entire column. Thus
sim_tab %>% group_by(BU_CODE) %>%
mutate(Cash_Flow = Money*Flow_Factor/sum(Flow_Factor)) ->
sim_tab
head(sim_tab)
Thus we now have a data frame whose rows are the details of each \(i\) and \(j\) combination, including the modelled cash flow. We can first check that the constraint has worked:
sim_tab %>% group_by(BU_CODE) %>% summarise(Out_Flow = sum(Cash_Flow), Money=first(Money))
Here the summarise function creates a summary data frame, where statistics such as the sum or mean are computed for each group. Thus Out_Flow is the sum of all of the cash flows coming out of each administrative area. Note that the Money variable is the same for each member of the group in sim_tab so taking the first is just a way of summarising several repeated values with a single incidence of this value. Thus the output shows that the summed modelled cash flows from each adminstrative area is the same as the Money for that area. The constraint has worked. Phew!
Now we are happy with the model we may want to see what happens for different half-life distances. As with the data mungeing we can express the modelling process as a single pipeline. But to save time we can also put it into a function, with the half-life as an argument.
fit_sim <- function(h) {
dist_tab %>%
mutate(Flow_Factor=(Floorspace^1.05)*2^(-Dist/h)) %>%
group_by(BU_CODE) %>%
mutate(Cash_Flow = Money*Flow_Factor/sum(Flow_Factor)) -> res
return(res)
}
head(fit_sim(0.75))
We now have a data frame sim_tab that tabulates the results of the Spatial Interaction Model. The next step is to investigate the geographical patterns associated with this. Suppose we are mainly concerned with the AH/Albert Heijn group of stores. One starting point might be to discover the total cash flows predicted into each of the stores - this is a two-stage process - firstly filter out these stores (based on text patterns in the shop name) - and then group by the stores and sum over them using the summerise function. The pipeline is as below:
sim_tab %>% filter(grepl('AH|Heijn',shop)) %>%
group_by(shopcode) %>%
summarise(InFlow=sum(Cash_Flow)) -> inflow_tab
inflow_tab
Here grepl provides a ogical grep search - grep is an acronym for generalised regular expression parser - it is a tool for text matching, based on character patterns. More details are here: https://en.wikipedia.org/wiki/Grep - but for now, it is enough to note that ‘AH|Heijn’ picks out shop names with either ‘AH’ or ‘Heijn’ in them. However, although this tabulates the influx of cash, it doesn’t show a map. To get this, firstly join this table to the gron_shops geographical object and then use tmap to create a map.
gron_shops %>% left_join(inflow_tab) -> inflow_pts
tm_shape(inflow_pts) + tm_dots(size='InFlow',col='darkgreen',alpha=0.7)
This depends on the distance half life. Once again we can ecapsulate all of this in a function with \(h\) as the argument, making it easier to explore:
map_inflow <- function(h) {
fit_sim(h) %>% filter(grepl('AH|Heijn',shop)) %>%
group_by(shopcode) %>%
summarise(InFlow=sum(Cash_Flow)) -> inflow_tab
gron_shops %>% left_join(inflow_tab) -> inflow_pts
tm_shape(inflow_pts) + tm_dots(size='InFlow',col='darkgreen',alpha=0.7)
}
map_inflow(0.5)
The above shows the effect if \(h\) is 500m - now check it for 1km:
map_inflow(1.0)
In a nutshell, as the distance half life gets larger, people are willing to travel further to stores, and the more attractive larger stores get a greater share of the money - people will travel to a more favourable store even when it is a long way away.
The idea of a catchment area here is to identify, for each administrative unit, what is the most attractive store. This is done by considering the cash flow out of each unit to each of the stores and seeing which store gets the greatest share of the total outflow from the unit. In R (and in tidyverse terms) this is done using the rank function. Rank takes a variable and for each row, returns the rank of the value for that row, out of all of the rows. The ranks are in ascending order, so for \(n\) items the largest has the value \(n\).
To find the ranks of the outflow shares on an administrative unit by adminstrative unit basis, group_by is a helper function again. In this case it causes the rank function to return the relative ranks of the flow factors (\(A_i^\alpha 2^{-d_{ij}/h}\)) within each group. Finally, it would by useful to have ranks in descending order - you are interested in the maximum in each group, and so rather than trying to find ranks of \(n\) when you don’t know what \(n\) is, it is more useful to find ranks of 1. Changing the sign of the variable given to the rank function achieves this - essentially it flips the orders of the variables. Check the results here:
sim_tab %>% filter(grepl('AH|Heijn',shop)) %>%
group_by(BU_CODE) %>%
mutate(R=rank(-Flow_Factor)) -> rank_tab
rank_tab
The interactions with the highest rank for each administrative unit can then be found by filtering out those for which R is equal to 1.
rank_tab %>% filter(R==1) -> rank_tab
head(rank_tab)
There is now a unique row for each administrative unit, and the shop also appearing in that row is the one where most money flows - and this states which catchment area the administrative unit is in. To map these, firstly join this table to the groningen_polys object:
gron_polys %>% left_join(rank_tab) -> catch_polys
This can then be mapped, using the shopcode as a colouring variable - making the catchments visible:
tm_shape(catch_polys) + tm_polygons(col='shopcode')
A further useful technique would be to create dissolved catchment polygons. This essentially involves removing the boundaries between administrative unit polygons assigned to the same store. This is done by applying a group_by operation to a simple features object (here, catch_polys) and then a summarise - this merges all of the polygons in the same group. Here there is no summary variable, we just do this to make the merge:
catch_polys %>% group_by(shopcode) %>% summarise -> catch_polys
tm_shape(catch_polys) + tm_polygons(col='shopcode')
As before - all of this could be made into a small number of larger pipelines and encapsulated in a function of \(h\):
map_catchments <- function(h) {
fit_sim(h) %>% filter(grepl('AH|Heijn',shop)) %>%
group_by(BU_CODE) %>%
mutate(R=rank(-Flow_Factor)) %>%
filter(R==1) -> rank_tab
gron_polys %>%
left_join(rank_tab) %>%
group_by(shopcode) %>%
summarise -> catch_polys
tm_shape(catch_polys) + tm_polygons(col='shopcode')
}
map_catchments(0.1)
Now you can again experiment with different \(h\) values - for example 750m:
map_catchments(0.75)
Suppose now a new store was proposed with a floor space of 5000 square meters - therefore a fairly large store. How could the effect of this be allowed for? The main way to see the impact is to add the new store to the gron_shops spatial layer, and to re-run the model with this extra store. This can then be compared to the original model. To see this, firstly download a new spatial data object proposed:
download.file("https://www.dropbox.com/s/qjzxl0cr7ehtrti/proposed.zip?dl=1","proposed.zip")
unzip("proposed.zip")
proposed <- st_read(".","proposed")
## Reading layer `proposed' from data source `/Users/geoaco/Desktop/my_docs_mac/leeds_work/research/groningen/Groningen/GroningenRworkshop/Session 6' using driver `ESRI Shapefile'
## Simple feature collection with 1 feature and 3 fields
## geometry type: POINT
## dimension: XY
## bbox: xmin: 235494.2 ymin: 580729.9 xmax: 235494.2 ymax: 580729.9
## epsg (SRID): NA
## proj4string: +proj=sterea +lat_0=52.15616055555555 +lon_0=5.38763888888889 +k=0.9999079 +x_0=155000 +y_0=463000 +ellps=bessel +units=m +no_defs
Then, have a look at its location - the proposed store is in red.
tm_shape(gron_shops) + tm_dots(col='dodgerblue',size='Floorspace') + tm_shape(proposed) + tm_dots(col='indianred',size='Floorspace')
To re-run the model, again using a function is useful. Below is a function that encapsulates all of the steps in producing a model, from the provision of a shops object and an administrative area object as well as \(h\). The function looks intimidating, but really it is just a combination of the operations done above - the data munging and model fitting.
new_model <- function(admin,shops,h) {
dmat <- get_dists(shops,admin)/1000
rownames(dmat) <- shops$shopcode
colnames(dmat) <- admin$BU_CODE
dmat %>% as_data_frame %>%
mutate(shopcode=rownames(dmat)) %>%
gather(key=BU_CODE,value=Dist,-shopcode) %>%
left_join(shops) %>% select(-geometry) %>%
left_join(admin) %>% select(-geometry) -> dist_tab
dist_tab %>%
mutate(Flow_Factor=(Floorspace^1.05)*2^(-Dist/h)) %>%
group_by(BU_CODE) %>%
mutate(Cash_Flow = Money*Flow_Factor/sum(Flow_Factor)) -> res
return(res)
}
test <- new_model(gron_pts,gron_shops,0.3)
head(test)
Here the function takes the inputs and returns a data frame with the fitted model values, in the format of sim_tab used earlier.
To re-fit thew model with a new store, firstly create a new scenario by combining the new proposed store with the existing ones in Groningen:
all_shops <- rbind(proposed,gron_shops)
Then update the simulation table:
new_sim_tab <- new_model(gron_pts,all_shops,0.2)
head(new_sim_tab)
We can now compute the inflows to all of the stores.
new_sim_tab %>% filter(grepl('AH|Heijn|New Store',shop)) %>%
group_by(shopcode) %>%
summarise(InFlow=sum(Cash_Flow)) -> new_inflow_tab
head(new_inflow_tab)
… and map the new inflows:
all_shops %>% left_join(new_inflow_tab) -> new_inflow_pts
tm_shape(new_inflow_pts) + tm_dots(size='InFlow',col='darkgreen',alpha=0.7)
If we join the original inflow_tab to the new_inflow_tab using the shop code as a merge key we obtain a table with new and old cash inflow estimates, respectively called inflow.x and inflow.y.
new_inflow_tab %>% left_join(inflow_tab,by="shopcode") ->
diff_inflow_tab
diff_inflow_tab
The final shopcode is the proposed store, and so doesn’t have an original inflow. Here it makes sense to set this as zero:
diff_inflow_tab$InFlow.y[19] <- 0
We can then compute the differences:
diff_inflow_tab %>% mutate(Diff=InFlow.x - InFlow.y) -> diff_inflow_tab
diff_inflow_tab
Finally to map this as proportional symbols, since some values are positive and others are negative, create an absolute value of change column - and then map this
diff_inflow_tab %>% mutate(AbsDiff=abs(Diff)) -> diff_inflow_tab
all_shops %>% left_join(diff_inflow_tab) -> diff_inflow_pts
tm_shape(diff_inflow_pts) + tm_dots(size='AbsDiff',col='Diff',alpha=0.7) + tm_style_col_blind()
This helps to identify the stores that will experience the most impact if a new store is opened as proposed.
As an exercise, try to create a catchment map for the revised store layout.
Most of the basic kinds of analysis are covered in the first 5 sections - here are a couple of fancy ideas that may be useful. Each one of these is shown as a larger pipeline (or sequence of pipelines) in a function - it may be an interesting exercise to look over the function listing and figure how it works.
The first of these generalises the idea of catchments to find second choice stores - even if one of the stores gets the largest share of cash flow from an administrative unit, it won’t necessarily get all of the cash. This function demonstrates how to map second choice catchment areas:
map_catchments_2 <- function(h) {
fit_sim(h) %>% filter(grepl('AH|Heijn',shop)) %>%
group_by(BU_CODE) %>%
mutate(R=rank(-Flow_Factor)) %>%
filter(R==2) -> rank_tab
gron_polys %>%
left_join(rank_tab) %>%
group_by(shopcode) %>%
summarise -> catch_polys
tm_shape(catch_polys) + tm_polygons(col='shopcode')
}
map_catchments_2(0.5)
More generally, you can map \(k\)th choice catchments for any rank:
map_catchments_k <- function(h,k=1) {
fit_sim(h) %>% filter(grepl('AH|Heijn',shop)) %>%
group_by(BU_CODE) %>%
mutate(R=rank(-Flow_Factor)) %>%
filter(R==k) -> rank_tab
gron_polys %>%
left_join(rank_tab) %>%
group_by(shopcode) %>%
summarise -> catch_polys
tm_shape(catch_polys) + tm_polygons(col='shopcode')
}
map_catchments_k(0.5,3)
Another variant is mapping catchments in a more fuzzy way. Here again it is noted that although for each administrative unit there is a ‘most favoured’ store, the share that this store gets of the total cash flow varies. This function maps the standard catchment areas, but varies intensity of shading to show how dominant the most favoured store actually is:
map_catchments_fuzz <- function(h) {
fit_sim(h) %>% filter(grepl('AH|Heijn',shop)) %>%
group_by(BU_CODE) %>%
mutate(R=rank(-Flow_Factor), Share=Flow_Factor/sum(Flow_Factor)) %>%
filter(R==1) -> rank_tab
gron_polys %>%
left_join(rank_tab) -> catch_polys1
catch_polys1 %>% group_by(shopcode) %>% summarise -> catch_polys2
tm_shape(catch_polys1) + tm_polygons(col='Share') + tm_shape(catch_polys2) + tm_borders(col='darkblue')
}
map_catchments_fuzz(0.75)
Here, places nearest to the boundaries between catchments tend to have far lower shares.