1. Introduction

Having seen a talk about spatial interaction models, in this session you will see how to fit one using R. The data you will work with relates to Groningen. There are a number of census units in and around Groningen:

In addition, there are a number of stores, together with information relating to their floor areas.

A third data set relates to the administrative areas, but as a set of points - this is useful to compute ‘crow fly’ distances between the administrative areas and the shops. These data sets are called:

2. Getting the data

All the data can be obtained from the file link in the schedule - the files grongingen_pts.zip, groningen_polys.zip and shops_pts.zip should be copied over. They can then be unzipped in R.

### Load some useful packages

library(tidyverse)
library(sf)
library(tmap)
options(stringsAsFactors = FALSE)

### First the polys for the admion areas


### Unzip it
unzip("groningen_polys.zip")
### Read it in
gron_polys <- st_read(".","groningen_polys")


### Repeat for the shops

unzip("shops_pts.zip")
gron_shops <- st_read(".","shops_pts")


#### Repeat for the admin areae label points
unzip("groningen_pts.zip")
gron_pts <- st_read(".","groningen_pts")

This code downloads the shapefiles into your working folder, unzips them, and then reads the files into a R objects - for each of those objects in turn.

You can now use tmap commnds to view them. THe code below will create a basic plot of the areas and the shop locations:

tm_shape(gron_polys) + tm_borders(col='indianred') + 
  tm_shape(gron_shops) +
  tm_dots(size='Floorspace',col='dodgerblue') +
  tm_layout(legend.position=c("right","bottom"))

We can also list the data in the objects - here are gron_shops:

gron_shops
## Simple feature collection with 55 features and 3 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 229396.2 ymin: 578512.6 xmax: 237610.4 ymax: 585365.4
## epsg (SRID):    NA
## proj4string:    +proj=sterea +lat_0=52.15616055555555 +lon_0=5.38763888888889 +k=0.9999079 +x_0=155000 +y_0=463000 +ellps=bessel +units=m +no_defs
## First 20 features:
##                                              shop    shopcode Floorspace
## 1                                    AH de Wijert AR000802617       1617
## 2                            AH to go - Groningen AR000935740        313
## 3                                    Albert Heijn AR000818293       1242
## 4                                    Albert Heijn AR000809760      45000
## 5                               Albert Heijn 1060 AR000805380       1007
## 6                               Albert Heijn 1093 AR000800472       1997
## 7                               Albert Heijn 1243 AR000806759        968
## 8                               Albert Heijn B.V. AR000925140       1075
## 9                               Albert Heijn B.V. AR000952195       1020
## 10                              Albert Heijn B.V. AR000928876      25478
## 11                     Albert Heijn BV Supermarkt AR000807891       1234
## 12                     Albert Heijn BV Supermarkt AR000802690       2601
## 13                     Albert Heijn BV Supermarkt AR000804383       1224
## 14                     Albert Heijn BV Supermarkt AR000801845       1655
## 15                     Albert Heijn BV Supermarkt AR000800362       2451
## 16     Albert Heijn Supermarkt F. v.d. Heide B.V. AR000801783       1611
## 17   Albert Heijn Supermarkt H.J. v.d. Heide B.V. AR000923136        587
## 18                    Aldi Drachten BV Supermarkt AR000913738       1155
## 19  Aldi Drachten BV Supermarkt (09 Groningen I ) AR000801270        804
## 20 Aldi Drachten BV Supermarkt (19 Groningen III) AR000801929        762
##                         geometry
## 1  POINT (233908.537 579461.836)
## 2  POINT (233676.262 581123.793)
## 3  POINT (233589.003 581776.368)
## 4  POINT (234473.769 582256.525)
## 5  POINT (233050.067 578512.555)
## 6  POINT (232841.819 583560.236)
## 7  POINT (233383.717 581733.449)
## 8  POINT (232282.014 583303.129)
## 9  POINT (233505.174 582641.248)
## 10 POINT (231731.778 584343.977)
## 11   POINT (233956.801 583452.4)
## 12 POINT (234682.313 579657.421)
## 13 POINT (234025.217 581620.656)
## 14 POINT (233790.316 582140.082)
## 15 POINT (231240.363 582681.696)
## 16 POINT (235975.156 582954.418)
## 17 POINT (237216.701 583396.861)
## 18 POINT (233539.907 580368.509)
## 19 POINT (233513.457 582916.442)
## 20 POINT (234096.467 578847.938)

and gron_pts

gron_pts
## Simple feature collection with 108 features and 3 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 227645.8 ymin: 578286.9 xmax: 241791 ymax: 585952.8
## epsg (SRID):    NA
## proj4string:    +proj=sterea +lat_0=52.15616055555555 +lon_0=5.38763888888889 +k=0.9999079 +x_0=155000 +y_0=463000 +ellps=bessel +units=m +no_defs
## First 20 features:
##       BU_CODE AANT_INW  Money                       geometry
## 1  BU00140000     4440  88800 POINT (233655.736099999 582...
## 2  BU00140001     6600 132000 POINT (233834.593699999 581...
## 3  BU00140002     3920  78400 POINT (234292.167199999 582...
## 4  BU00140003     1765  35300 POINT (233195.739500001 581...
## 5  BU00140004       10    200 POINT (233043.1413 582503.1...
## 6  BU00140005     5510 110200 POINT (233458.532000002 582...
## 7  BU00140007        0      0 POINT (234243.446400002 582...
## 8  BU00140008      385   7700 POINT (233644.9516 581186.7...
## 9  BU00140100     1500  30000 POINT (234872.4652 581396.8...
## 10 BU00140101     5240 104800 POINT (234504.669799998 581...
## 11 BU00140102     1560  31200 POINT (234377.353500001 580...
## 12 BU00140103     4530  90600 POINT (233820.244800001 580...
## 13 BU00140104     2240  44800 POINT (233359.370999999 580...
## 14 BU00140105      885  17700 POINT (233171.4256 581236.7...
## 15 BU00140106     3265  65300 POINT (232672.585000001 581...
## 16 BU00140107     1270  25400 POINT (232867.578899998 580...
## 17 BU00140108       20    400 POINT (232223.331900001 580...
## 18 BU00140109        0      0 POINT (232885.599100001 580...
## 19 BU00140200     2845  56900 POINT (232537.562399998 582...
## 20 BU00140201     3875  77500 POINT (232918.379000001 582...

The variables here arew the population AANT_INW - and the potential spending for each administrative unit Money - this is a simple estimate of €20 per person. The polygons object contains the same information - it just uses polygon geometry instead of point.

Finally, a useful tool for viewing data is to draw the map on an interactive backdrop. This can be done by setting tmap_mode to view instead of plot:

tmap_mode('view')
tm_shape(gron_polys) + tm_borders(col='indianred', lwd=2)

3. Thinking about distances

The data we currently have tells you where the stores are, and where the population is located. It also tells you how much spending power is in each place, and how attractive each store is (measured by floor space). To predict interaction the final peice in the jigsaw is the distance between the stores and the population. In broad terms, the interaction between place \(i\) and store \(j\) (annotated here is \(I_{ij}\)) is modelled as

\[ I_{ij} = f(P_i,A_j,d_{ij}) \]

where \(P_i\) is the productiveness in place \(i\), \(A_j\) is the attractiveness of store \(j\) and \(d_{ij}\) is the distance between the two. Here \(P_i\) is measured with the Money item, and \(A_j\) with the Floorspace item. So we need to compute the distances. One way of doing this is to extract the point coordinates from gron_pts and gron_shops and compute the straight line (crow fly) distances between each pair, using Pythagoras’ theorem. The following function does this.

get_dists <- function(pts1,pts2) {
  c1 <- st_coordinates(pts1)
  c2 <- st_coordinates(pts2)
  res <- outer(c1[,1],c2[,1],'-')^2 + outer(c1[,2],c2[,2],'-')^2
  return(sqrt(res))
}

This defines a new R function that takes a pair of point sf objects andf computes a distance matrix. At this stage we don’t have the distances acttually computed, we have just created a function to do the job. If you can’t see how this works immediately, don’t worry too much - and skip the following part - but if you are interested the details are between the lines below:


  1. the st_coordinates functiions extract the coordinates from a pair of point objects, and stores each in an \(n \times 2\) matrix.
  2. The outer functions take each column of the two matrices in turn and creates a difference matrix for the \(x\) and \(y\) coordinates respectively.
  3. This are squared and added up foir the \(x\) and \(y\) difference matrices.
  4. The resultant matrix is ‘square-rooted’ to get the distance matrix.

Now we have a function to calculate distance - so lets actually do this. The following computes the distance matrix for gron_shops and gron_pts, and gives names to the appropriate rows and columns in the matrix. Note the division of the distances by 1000. This makes the distances in kilometres, not metres, which is more appropriate for this model.

### Compute the distances
dmat <- get_dists(gron_shops,gron_pts)/1000

### Name the rows and columns nicely
rownames(dmat) <- gron_shops$shopcode
colnames(dmat) <- gron_pts$BU_CODE

# Print the top left hand corner
dmat[1:4,1:4]
##             BU00140000 BU00140001 BU00140002 BU00140003
## AR000802617  2.6211707  2.1282047  2.6632298  2.3322780
## AR000935740  0.9472168  0.4911816  1.1519708  0.7369386
## AR000818293  0.3018875  0.3090520  0.7729367  0.4043062
## AR000809760  0.8388541  0.9243705  0.2415260  1.4010140

Now we have a distance matrix, but to merge it with the rest of the information, it would be better to have this in tidy format - essentially in the format of a bata frame, with three columns - the store ID, the administrative area ID and the distance between the two. This is useful because it may then be used as a look-up table for a given store, or a given administrative area, to find distances. To do this, first convert the matrix object into a data frame, and add a variable column based on the row name - this gives the shop code. Here we use tidyverse-style pipelining to ‘mutate’ the data frame.

dmat %>% as_data_frame %>% mutate(shopcode=rownames(dmat)) -> dist_tab
head(dist_tab)
## # A tibble: 6 x 109
##   BU00… BU00… BU00… BU00… BU00… BU00… BU00… BU001… BU00… BU00… BU00… BU00…
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2.62  2.13  2.66  2.33  3.16  3.09  3.08  1.74   2.16  1.85  1.42  1.27 
## 2 0.947 0.491 1.15  0.737 1.52  1.42  1.51  0.0703 1.23  0.833 0.773 0.419
## 3 0.302 0.309 0.773 0.404 0.909 0.758 0.996 0.592  1.34  1.08  1.26  1.07 
## 4 0.839 0.924 0.242 1.40  1.45  1.05  0.355 1.35   0.948 1.05  1.46  1.66 
## 5 3.61  3.17  3.79  3.17  3.99  4.03  4.19  2.74   3.41  3.06  2.64  2.35 
## 6 1.70  2.21  2.06  1.91  1.08  1.21  1.74  2.51   2.97  2.88  3.16  2.99 
## # ... with 97 more variables: BU00140104 <dbl>, BU00140105 <dbl>,
## #   BU00140106 <dbl>, BU00140107 <dbl>, BU00140108 <dbl>, BU00140109
## #   <dbl>, BU00140200 <dbl>, BU00140201 <dbl>, BU00140202 <dbl>,
## #   BU00140203 <dbl>, BU00140300 <dbl>, BU00140301 <dbl>, BU00140302
## #   <dbl>, BU00140400 <dbl>, BU00140401 <dbl>, BU00140402 <dbl>,
## #   BU00140403 <dbl>, BU00140404 <dbl>, BU00140500 <dbl>, BU00140501
## #   <dbl>, BU00140502 <dbl>, BU00140503 <dbl>, BU00140504 <dbl>,
## #   BU00140505 <dbl>, BU00140506 <dbl>, BU00140507 <dbl>, BU00140508
## #   <dbl>, BU00140509 <dbl>, BU00140510 <dbl>, BU00140511 <dbl>,
## #   BU00140600 <dbl>, BU00140601 <dbl>, BU00140602 <dbl>, BU00140603
## #   <dbl>, BU00140604 <dbl>, BU00140605 <dbl>, BU00140606 <dbl>,
## #   BU00140700 <dbl>, BU00140701 <dbl>, BU00140702 <dbl>, BU00140703
## #   <dbl>, BU00140704 <dbl>, BU00140705 <dbl>, BU00140800 <dbl>,
## #   BU00140801 <dbl>, BU00140802 <dbl>, BU00140803 <dbl>, BU00140804
## #   <dbl>, BU00140805 <dbl>, BU00140806 <dbl>, BU00140807 <dbl>,
## #   BU00140808 <dbl>, BU00140809 <dbl>, BU00140810 <dbl>, BU00140811
## #   <dbl>, BU00140812 <dbl>, BU00140900 <dbl>, BU00140901 <dbl>,
## #   BU00140902 <dbl>, BU00140903 <dbl>, BU00140904 <dbl>, BU00140905
## #   <dbl>, BU00140906 <dbl>, BU00140907 <dbl>, BU00141000 <dbl>,
## #   BU00141001 <dbl>, BU00141002 <dbl>, BU00141003 <dbl>, BU00141004
## #   <dbl>, BU00141005 <dbl>, BU00141100 <dbl>, BU00141101 <dbl>,
## #   BU00141102 <dbl>, BU00141103 <dbl>, BU00141104 <dbl>, BU00141105
## #   <dbl>, BU00141106 <dbl>, BU00141107 <dbl>, BU00141200 <dbl>,
## #   BU00141201 <dbl>, BU00141202 <dbl>, BU00141203 <dbl>, BU00141204
## #   <dbl>, BU00141205 <dbl>, BU00141206 <dbl>, BU00141207 <dbl>,
## #   BU00141208 <dbl>, BU00141209 <dbl>, BU00141210 <dbl>, BU00141211
## #   <dbl>, BU00141300 <dbl>, BU00141301 <dbl>, BU00141302 <dbl>,
## #   BU00141400 <dbl>, BU00141401 <dbl>, BU00141402 <dbl>, shopcode <chr>

This is still not in the format mentioned earlier, as each origin has its own column. To re-shape the data into three columns the gather command is needed. This takes data spread over several columns (here all of the origins) and ‘gathers’ them into a single column. The column names get a new variable, and the distance values are stacked.

This is shown below.

dist_tab %>% gather(key=BU_CODE,value=Dist,-shopcode) -> dist_tab
head(dist_tab)
## # A tibble: 6 x 3
##   shopcode    BU_CODE     Dist
##   <chr>       <chr>      <dbl>
## 1 AR000802617 BU00140000 2.62 
## 2 AR000935740 BU00140000 0.947
## 3 AR000818293 BU00140000 0.302
## 4 AR000809760 BU00140000 0.839
## 5 AR000805380 BU00140000 3.61 
## 6 AR000800472 BU00140000 1.70

Note that this approach creates two new columns - the one with the old column name, and the one with the values. These are the key and value parameters. The final -shopcode parameters tells gather not to include this column in the gathering process, but to carry it forward into the modified data frame.

Next, we want to join the attrativeness and productiveness variables onto this list, using database join-type commands that are provided by the left_join function. First off, join the attractiveness information from gron_shops :

dist_tab %>% left_join(gron_shops) %>% select(-geometry) -> dist_tab
head(dist_tab)
## # A tibble: 6 x 5
##   shopcode    BU_CODE     Dist shop                 Floorspace
##   <chr>       <chr>      <dbl> <chr>                     <int>
## 1 AR000802617 BU00140000 2.62  AH de Wijert               1617
## 2 AR000935740 BU00140000 0.947 AH to go - Groningen        313
## 3 AR000818293 BU00140000 0.302 Albert Heijn               1242
## 4 AR000809760 BU00140000 0.839 Albert Heijn              45000
## 5 AR000805380 BU00140000 3.61  Albert Heijn 1060          1007
## 6 AR000800472 BU00140000 1.70  Albert Heijn 1093          1997

NOTE: You may get an error here if you have recently been using theMASS package - this also has a function called select that does a different thing. When more than one package has been loaded with the same nameed function, tyhe default one called may be the wrong one. Here you need the dplyr function select, not the MASS one - and this could have caused the error. To make this unambigous, use the following instead:

dist_tab %>% left_join(gron_shops) %>% dplyr::select(-geometry) -> dist_tab
head(dist_tab)

Another approach is to use unloadNameSpace which effectively unloads a package - so entering

unloadNamespace("MASS")

can also fix this.


The select(-geometry) is used here as we want to create a straight forward data frame here, not a geographical object. We can also add the administrative area information in the same way.

dist_tab %>% left_join(gron_pts) %>% select(-geometry) -> dist_tab
head(dist_tab)
## # A tibble: 6 x 7
##   shopcode    BU_CODE     Dist shop                 Floorspace AANT… Money
##   <chr>       <chr>      <dbl> <chr>                     <int> <dbl> <dbl>
## 1 AR000802617 BU00140000 2.62  AH de Wijert               1617  4440 88800
## 2 AR000935740 BU00140000 0.947 AH to go - Groningen        313  4440 88800
## 3 AR000818293 BU00140000 0.302 Albert Heijn               1242  4440 88800
## 4 AR000809760 BU00140000 0.839 Albert Heijn              45000  4440 88800
## 5 AR000805380 BU00140000 3.61  Albert Heijn 1060          1007  4440 88800
## 6 AR000800472 BU00140000 1.70  Albert Heijn 1093          1997  4440 88800

So now dist_tab has a series of rows corresponding to each \(i,j\) pair, with the values of \(d_{ij}\) \(P_i\) and \(A_j\) among the variables -

We can now actually carry out the modelling. As an aside, the above work is a typical example of what some data scientists term data mungeing - basically not doing analysis, but re-shaping data into a format that will be helpful for the analysis. In the practical here we worked through this step-by-step, but in practice, one might specify the entire process as a single pipeline:

dmat %>% as_data_frame %>% 
  mutate(shopcode=rownames(dmat)) %>% 
  gather(key=BU_CODE,value=Dist,-shopcode) %>%
  left_join(gron_shops) %>% select(-geometry) %>%
  left_join(gron_pts) %>% select(-geometry) -> dist_tab

3. Fitting the model

Before we noted that the spatial interaction model takes the form \(I_{ij} = f(P_i,A_j,d_{ij})\) - here, we will use a specific, multiplicative format:

\[ I_{ij} = \textsf{const}_j \times A_i^\alpha g(d_{ij}) \] where \(g\) is a decaying function of distance. Note that the productiveness does not appear explicitly here - in fact it features as a constraint - essentially for each administrative area, the total money going to all of the stores must add up to the total money for that area. This is achieved by multiplying the attractiveness and distance-deterrence terms by a constant for each area - the constant re-scales the amount spent so that the total for the area is equal to the variable Money for that area.

The form that \(g\) takes here will be that of exponential decay. Often this is written as \(e^{-\textsf{const} \times d_{ij}}\) but here we will use the form

\[ g(d_{ij}) = 2^{-d_{ij}/h} \] where \(h\) is the distance half-life. That is, for every \(h\) kilometers between the store and the home location, the desirability of making the trip halves. Although this can be written in more conventional exponential form as \(e^{-\log(2)d_{ij}/h}\), the parameter \(h\) is more easily interpreted here. Thus the full model is

\[ I_{ij} = \textsf{const}_j A_i^\alpha 2^{-d_{ij}/h} \]

Having arrived at the equation we now need to compute this and add a column to the dist_tab data frame to model the \(I_{ij}\) values. These will be the estimated expenditure for the population of area \(j\) at store \(i\). For a starting point, assume a fairly strong distance deterrence with a distance half-life of 200m. We can change this later.

The first part of the model (before applying the production constraint) is to compute \(A_i^\alpha 2^{-d_ij/h}\). \(\alpha\) here allows for extra attractiveness of the larger stores. In this case it is assumed to be quite close to 1 - say 1.05 - so that larger stores have a slightly disproportionate increase in attractiveness, but the effect is fairly weak.

dist_tab %>% 
  mutate(Flow_Factor=(Floorspace^1.05)*2^(-Dist/0.2)) -> sim_tab
head(sim_tab)
## # A tibble: 6 x 8
##   shopcode    BU_CODE     Dist shop         Floorsp… AANT_… Money Flow_Fa…
##   <chr>       <chr>      <dbl> <chr>           <int>  <dbl> <dbl>    <dbl>
## 1 AR000802617 BU00140000 2.62  AH de Wijert     1617   4440 88800  2.65e⁻¹
## 2 AR000935740 BU00140000 0.947 AH to go - …      313   4440 88800  1.57e⁺¹
## 3 AR000818293 BU00140000 0.302 Albert Heijn     1242   4440 88800  6.23e⁺²
## 4 AR000809760 BU00140000 0.839 Albert Heijn    45000   4440 88800  4.20e⁺³
## 5 AR000805380 BU00140000 3.61  Albert Heij…     1007   4440 88800  5.25e⁻³
## 6 AR000800472 BU00140000 1.70  Albert Heij…     1997   4440 88800  8.14e⁺⁰

This has the Flow_Factor variable proportional to the cash flow between \(i\) and \(j\) but as yet it hasn’t been scaled. To do this, we first need to rescale the flow factors for each administrative area so that they sum to one. Then, they are each multiplied by the Money variable for that area, so that the total modelled flow of cash equals the total money coming from that area. This is done by using the sum function. If the sim_tab data frame is first grouped by the administrative unit code, (using group_by(BU_CODE)) then the sum function for each row applies to the sum of values for its specific BU_CODE - rather than the sum for the entire column. Thus

sim_tab %>% group_by(BU_CODE) %>% 
  mutate(Cash_Flow = Money*Flow_Factor/sum(Flow_Factor)) ->
  sim_tab
head(sim_tab)
## # A tibble: 6 x 9
## # Groups: BU_CODE [1]
##   shopcode    BU_CODE     Dist shop    Floor… AANT_… Money Flow_F… Cash_F…
##   <chr>       <chr>      <dbl> <chr>    <int>  <dbl> <dbl>   <dbl>   <dbl>
## 1 AR000802617 BU00140000 2.62  AH de …   1617   4440 88800 2.65e⁻¹ 2.47e⁺⁰
## 2 AR000935740 BU00140000 0.947 AH to …    313   4440 88800 1.57e⁺¹ 1.45e⁺²
## 3 AR000818293 BU00140000 0.302 Albert…   1242   4440 88800 6.23e⁺² 5.79e⁺³
## 4 AR000809760 BU00140000 0.839 Albert…  45000   4440 88800 4.20e⁺³ 3.90e⁺⁴
## 5 AR000805380 BU00140000 3.61  Albert…   1007   4440 88800 5.25e⁻³ 4.88e⁻²
## 6 AR000800472 BU00140000 1.70  Albert…   1997   4440 88800 8.14e⁺⁰ 7.57e⁺¹

Thus we now have a data frame whose rows are the details of each \(i\) and \(j\) combination, including the modelled cash flow. We can first check that the constraint has worked:

sim_tab %>% group_by(BU_CODE) %>% summarise(Out_Flow = sum(Cash_Flow), Money=first(Money))
## # A tibble: 108 x 3
##    BU_CODE    Out_Flow  Money
##    <chr>         <dbl>  <dbl>
##  1 BU00140000    88800  88800
##  2 BU00140001   132000 132000
##  3 BU00140002    78400  78400
##  4 BU00140003    35300  35300
##  5 BU00140004      200    200
##  6 BU00140005   110200 110200
##  7 BU00140007        0      0
##  8 BU00140008     7700   7700
##  9 BU00140100    30000  30000
## 10 BU00140101   104800 104800
## # ... with 98 more rows

Here the summarise function creates a summary data frame, where statistics such as the sum or mean are computed for each group. Thus Out_Flow is the sum of all of the cash flows coming out of each administrative area. Note that the Money variable is the same for each member of the group in sim_tab so taking the first is just a way of summarising several repeated values with a single incidence of this value. Thus the output shows that the summed modelled cash flows from each adminstrative area is the same as the Money for that area. The constraint has worked. Phew!

Now we are happy with the model we may want to see what happens for different half-life distances. As with the data mungeing we can express the modelling process as a single pipeline. But to save time we can also put it into a function, with the half-life as an argument.

fit_sim <- function(h) {
  dist_tab %>% 
    mutate(Flow_Factor=(Floorspace^1.05)*2^(-Dist/h))  %>% 
    group_by(BU_CODE) %>% 
    mutate(Cash_Flow = Money*Flow_Factor/sum(Flow_Factor)) -> res
  return(res)
}
head(fit_sim(0.75))
## # A tibble: 6 x 9
## # Groups: BU_CODE [1]
##   shopcode    BU_CODE     Dist shop     Floor… AANT_… Money Flow_F… Cash_…
##   <chr>       <chr>      <dbl> <chr>     <int>  <dbl> <dbl>   <dbl>  <dbl>
## 1 AR000802617 BU00140000 2.62  AH de W…   1617   4440 88800   208   3.02e²
## 2 AR000935740 BU00140000 0.947 AH to g…    313   4440 88800   174   2.53e²
## 3 AR000818293 BU00140000 0.302 Albert …   1242   4440 88800  1342   1.96e³
## 4 AR000809760 BU00140000 0.839 Albert …  45000   4440 88800 35414   5.16e⁴
## 5 AR000805380 BU00140000 3.61  Albert …   1007   4440 88800    50.6 7.38e¹
## 6 AR000800472 BU00140000 1.70  Albert …   1997   4440 88800   608   8.87e²

4. Mapping the Results

We now have a data frame sim_tab that tabulates the results of the Spatial Interaction Model. The next step is to investigate the geographical patterns associated with this. Suppose we are mainly concerned with the AH/Albert Heijn group of stores. One starting point might be to discover the total cash flows predicted into each of the stores - this is a two-stage process - firstly filter out these stores (based on text patterns in the shop name) - and then group by the stores and sum over them using the summerise function. The pipeline is as below:

sim_tab %>% filter(grepl('AH|Heijn',shop)) %>% 
  group_by(shopcode) %>% 
  summarise(InFlow=sum(Cash_Flow)) -> inflow_tab
inflow_tab
## # A tibble: 18 x 2
##    shopcode    InFlow
##    <chr>        <dbl>
##  1 AR000800362 125347
##  2 AR000800472  93925
##  3 AR000801783  67387
##  4 AR000801845  44432
##  5 AR000802353 137264
##  6 AR000802617  83345
##  7 AR000802690 163673
##  8 AR000804383  36782
##  9 AR000805380 102691
## 10 AR000806759  39424
## 11 AR000807891  99728
## 12 AR000809760 691538
## 13 AR000818293  42593
## 14 AR000923136  49486
## 15 AR000925140  45218
## 16 AR000928876 187409
## 17 AR000935740  11206
## 18 AR000952195  36642

Here grepl provides a ogical grep search - grep is an acronym for generalised regular expression parser - it is a tool for text matching, based on character patterns. More details are here: https://en.wikipedia.org/wiki/Grep - but for now, it is enough to note that ‘AH|Heijn’ picks out shop names with either ‘AH’ or ‘Heijn’ in them. However, although this tabulates the influx of cash, it doesn’t show a map. To get this, firstly join this table to the gron_shops geographical object and then use tmap to create a map.

gron_shops %>% left_join(inflow_tab) -> inflow_pts
tm_shape(inflow_pts) + tm_dots(size='InFlow',col='darkgreen',alpha=0.7)

This depends on the distance half life. Once again we can ecapsulate all of this in a function with \(h\) as the argument, making it easier to explore:

map_inflow <- function(h) {
  fit_sim(h) %>% filter(grepl('AH|Heijn',shop)) %>% 
  group_by(shopcode) %>% 
  summarise(InFlow=sum(Cash_Flow)) -> inflow_tab
  gron_shops %>% left_join(inflow_tab) -> inflow_pts
  tm_shape(inflow_pts) + tm_dots(size='InFlow',col='darkgreen',alpha=0.7)
}

map_inflow(0.5)

The above shows the effect if \(h\) is 500m - now check it for 1km:

map_inflow(1.0)

In a nutshell, as the distance half life gets larger, people are willing to travel further to stores, and the more attractive larger stores get a greater share of the money - people will travel to a more favourable store even when it is a long way away.

5. Catchment Areas

The idea of a catchment area here is to identify, for each administrative unit, what is the most attractive store. This is done by considering the cash flow out of each unit to each of the stores and seeing which store gets the greatest share of the total outflow from the unit. In R (and in tidyverse terms) this is done using the rank function. Rank takes a variable and for each row, returns the rank of the value for that row, out of all of the rows. The ranks are in ascending order, so for \(n\) items the largest has the value \(n\).

To find the ranks of the outflow shares on an administrative unit by adminstrative unit basis, group_by is a helper function again. In this case it causes the rank function to return the relative ranks of the flow factors (\(A_i^\alpha 2^{-d_{ij}/h}\)) within each group. Finally, it would by useful to have ranks in descending order - you are interested in the maximum in each group, and so rather than trying to find ranks of \(n\) when you don’t know what \(n\) is, it is more useful to find ranks of 1. Changing the sign of the variable given to the rank function achieves this - essentially it flips the orders of the variables. Check the results here:

sim_tab  %>% filter(grepl('AH|Heijn',shop)) %>% 
  group_by(BU_CODE) %>% 
  mutate(R=rank(-Flow_Factor)) -> rank_tab
rank_tab
## # A tibble: 1,944 x 10
## # Groups: BU_CODE [108]
##    shopcode    BU_CO…  Dist shop   Floo… AANT… Money Flow_F… Cash_F…     R
##    <chr>       <chr>  <dbl> <chr>  <int> <dbl> <dbl>   <dbl>   <dbl> <dbl>
##  1 AR000802617 BU001… 2.62  AH de…  1617  4440 88800 2.65e⁻¹ 2.47e⁺⁰ 15.0 
##  2 AR000935740 BU001… 0.947 AH to…   313  4440 88800 1.57e⁺¹ 1.45e⁺²  7.00
##  3 AR000818293 BU001… 0.302 Alber…  1242  4440 88800 6.23e⁺² 5.79e⁺³  3.00
##  4 AR000809760 BU001… 0.839 Alber… 45000  4440 88800 4.20e⁺³ 3.90e⁺⁴  1.00
##  5 AR000805380 BU001… 3.61  Alber…  1007  4440 88800 5.25e⁻³ 4.88e⁻² 17.0 
##  6 AR000800472 BU001… 1.70  Alber…  1997  4440 88800 8.14e⁺⁰ 7.57e⁺¹  9.00
##  7 AR000806759 BU001… 0.433 Alber…   968  4440 88800 3.04e⁺² 2.83e⁺³  4.00
##  8 AR000925140 BU001… 1.85  Alber…  1075  4440 88800 2.54e⁺⁰ 2.36e⁺¹ 10.0 
##  9 AR000952195 BU001… 0.590 Alber…  1020  4440 88800 1.87e⁺² 1.73e⁺³  6.00
## 10 AR000928876 BU001… 2.98  Alber… 25478  4440 88800 1.39e⁺⁰ 1.29e⁺¹ 11.0 
## # ... with 1,934 more rows

The interactions with the highest rank for each administrative unit can then be found by filtering out those for which R is equal to 1.

rank_tab %>% filter(R==1) -> rank_tab
head(rank_tab)
## # A tibble: 6 x 10
## # Groups: BU_CODE [6]
##   shopcode    BU_CO…  Dist shop    Floor… AANT_…  Money Flow… Cash_…     R
##   <chr>       <chr>  <dbl> <chr>    <int>  <dbl>  <dbl> <dbl>  <dbl> <dbl>
## 1 AR000809760 BU001… 0.839 Albert…  45000 4440    88800  4200 3.90e⁴  1.00
## 2 AR000809760 BU001… 0.924 Albert…  45000 6600   132000  3123 5.53e⁴  1.00
## 3 AR000809760 BU001… 0.242 Albert…  45000 3920    78400 33292 7.34e⁴  1.00
## 4 AR000806759 BU001… 0.195 Albert…    968 1765    35300   695 6.69e³  1.00
## 5 AR000809760 BU001… 1.45  Albert…  45000   10.0    200   502 3.59e¹  1.00
## 6 AR000809760 BU001… 1.05  Albert…  45000 5510   110200  2023 3.67e⁴  1.00

There is now a unique row for each administrative unit, and the shop also appearing in that row is the one where most money flows - and this states which catchment area the administrative unit is in. To map these, firstly join this table to the groningen_polys object:

gron_polys %>% left_join(rank_tab) -> catch_polys

This can then be mapped, using the shopcode as a colouring variable - making the catchments visible:

tm_shape(catch_polys) + tm_polygons(col='shopcode')

A further useful technique would be to create dissolved catchment polygons. This essentially involves removing the boundaries between administrative unit polygons assigned to the same store. This is done by applying a group_by operation to a simple features object (here, catch_polys) and then a summarise - this merges all of the polygons in the same group. Here there is no summary variable, we just do this to make the merge:

catch_polys %>% group_by(shopcode) %>% summarise -> catch_polys
tm_shape(catch_polys) + tm_polygons(col='shopcode')

As before - all of this could be made into a small number of larger pipelines and encapsulated in a function of \(h\):

map_catchments <- function(h) {
  fit_sim(h) %>% filter(grepl('AH|Heijn',shop)) %>% 
    group_by(BU_CODE) %>% 
    mutate(R=rank(-Flow_Factor)) %>%
    filter(R==1) -> rank_tab
  gron_polys %>% 
    left_join(rank_tab) %>% 
    group_by(shopcode) %>% 
    summarise -> catch_polys
  tm_shape(catch_polys) + tm_polygons(col='shopcode')
}

map_catchments(0.1)

Now you can again experiment with different \(h\) values - for example 750m:

map_catchments(0.75)

6. Assessing the Impact of a Proposed Store

Suppose now a new store was proposed with a floor space of 5000 square meters - therefore a fairly large store. How could the effect of this be allowed for? The main way to see the impact is to add the new store to the gron_shops spatial layer, and to re-run the model with this extra store. This can then be compared to the original model. To see this, firstly download a new spatial data object proposed:

download.file("https://www.dropbox.com/s/qjzxl0cr7ehtrti/proposed.zip?dl=1","proposed.zip")
unzip("proposed.zip")
proposed <- st_read(".","proposed")
## Reading layer `proposed' from data source `/Users/chrisbrunsdon/Dropbox/Groningen/Session 6' using driver `ESRI Shapefile'
## Simple feature collection with 1 feature and 3 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 235494.2 ymin: 580729.9 xmax: 235494.2 ymax: 580729.9
## epsg (SRID):    NA
## proj4string:    +proj=sterea +lat_0=52.15616055555555 +lon_0=5.38763888888889 +k=0.9999079 +x_0=155000 +y_0=463000 +ellps=bessel +units=m +no_defs

Then, have a look at its location - the proposed store is in red.

tm_shape(gron_shops) + tm_dots(col='dodgerblue',size='Floorspace') + tm_shape(proposed) + tm_dots(col='indianred',size='Floorspace')

To re-run the model, again using a function is useful. Below is a function that encapsulates all of the steps in producing a model, from the provision of a shops object and an administrative area object as well as \(h\). The function looks intimidating, but really it is just a combination of the operations done above - the data munging and model fitting.

new_model <- function(admin,shops,h) {
  
  dmat <- get_dists(shops,admin)/1000
  rownames(dmat) <- shops$shopcode
  colnames(dmat) <- admin$BU_CODE
  
  dmat %>% as_data_frame %>% 
  mutate(shopcode=rownames(dmat)) %>% 
  gather(key=BU_CODE,value=Dist,-shopcode) %>%
  left_join(shops) %>% select(-geometry) %>%
  left_join(admin) %>% select(-geometry) -> dist_tab
  
  dist_tab %>% 
  mutate(Flow_Factor=(Floorspace^1.05)*2^(-Dist/h))  %>% 
  group_by(BU_CODE) %>% 
  mutate(Cash_Flow = Money*Flow_Factor/sum(Flow_Factor)) -> res
  
  return(res) 
}

test <- new_model(gron_pts,gron_shops,0.3)
head(test)
## # A tibble: 6 x 9
## # Groups: BU_CODE [1]
##   shopcode    BU_CODE     Dist shop     Floor… AANT_… Money Flow_F… Cash_…
##   <chr>       <chr>      <dbl> <chr>     <int>  <dbl> <dbl>   <dbl>  <dbl>
## 1 AR000802617 BU00140000 2.62  AH de W…   1617   4440 88800 5.48e⁺⁰ 2.47e¹
## 2 AR000935740 BU00140000 0.947 AH to g…    313   4440 88800 4.68e⁺¹ 2.10e²
## 3 AR000818293 BU00140000 0.302 Albert …   1242   4440 88800 8.83e⁺² 3.97e³
## 4 AR000809760 BU00140000 0.839 Albert …  45000   4440 88800 1.11e⁺⁴ 4.98e⁴
## 5 AR000805380 BU00140000 3.61  Albert …   1007   4440 88800 3.40e⁻¹ 1.53e⁰
## 6 AR000800472 BU00140000 1.70  Albert …   1997   4440 88800 5.78e⁺¹ 2.60e²

Here the function takes the inputs and returns a data frame with the fitted model values, in the format of sim_tab used earlier.

To re-fit thew model with a new store, firstly create a new scenario by combining the new proposed store with the existing ones in Groningen:

all_shops <- rbind(proposed,gron_shops)

Then update the simulation table:

new_sim_tab <- new_model(gron_pts,all_shops,0.2)
head(new_sim_tab)
## # A tibble: 6 x 9
## # Groups: BU_CODE [1]
##   shopcode    BU_CODE     Dist shop    Floor… AANT_… Money Flow_F… Cash_F…
##   <chr>       <chr>      <dbl> <chr>    <dbl>  <dbl> <dbl>   <dbl>   <dbl>
## 1 ARxxxxxxxxx BU00140000 2.28  New St…   5000   4440 88800 2.88e⁺⁰ 2.67e⁺¹
## 2 AR000802617 BU00140000 2.62  AH de …   1617   4440 88800 2.65e⁻¹ 2.47e⁺⁰
## 3 AR000935740 BU00140000 0.947 AH to …    313   4440 88800 1.57e⁺¹ 1.45e⁺²
## 4 AR000818293 BU00140000 0.302 Albert…   1242   4440 88800 6.23e⁺² 5.79e⁺³
## 5 AR000809760 BU00140000 0.839 Albert…  45000   4440 88800 4.20e⁺³ 3.90e⁺⁴
## 6 AR000805380 BU00140000 3.61  Albert…   1007   4440 88800 5.25e⁻³ 4.88e⁻²

We can now compute the inflows to all of the stores.

new_sim_tab %>% filter(grepl('AH|Heijn|New Store',shop)) %>% 
  group_by(shopcode) %>% 
  summarise(InFlow=sum(Cash_Flow)) -> new_inflow_tab
head(new_inflow_tab)
## # A tibble: 6 x 2
##   shopcode    InFlow
##   <chr>        <dbl>
## 1 AR000800362 125346
## 2 AR000800472  93923
## 3 AR000801783  66115
## 4 AR000801845  44208
## 5 AR000802353 137259
## 6 AR000802617  82637

… and map the new inflows:

all_shops %>% left_join(new_inflow_tab) -> new_inflow_pts
tm_shape(new_inflow_pts) + tm_dots(size='InFlow',col='darkgreen',alpha=0.7) 

If we join the original inflow_tab to the new_inflow_tab using the shop code as a merge key we obtain a table with new and old cash inflow estimates, respectively called inflow.x and inflow.y.

new_inflow_tab %>% left_join(inflow_tab,by="shopcode") -> 
  diff_inflow_tab
diff_inflow_tab
## # A tibble: 19 x 3
##    shopcode    InFlow.x InFlow.y
##    <chr>          <dbl>    <dbl>
##  1 AR000800362   125346   125347
##  2 AR000800472    93923    93925
##  3 AR000801783    66115    67387
##  4 AR000801845    44208    44432
##  5 AR000802353   137259   137264
##  6 AR000802617    82637    83345
##  7 AR000802690   154633   163673
##  8 AR000804383    35830    36782
##  9 AR000805380   102655   102691
## 10 AR000806759    39300    39424
## 11 AR000807891    99717    99728
## 12 AR000809760   675622   691538
## 13 AR000818293    42347    42593
## 14 AR000923136    46870    49486
## 15 AR000925140    45217    45218
## 16 AR000928876   187409   187409
## 17 AR000935740    11006    11206
## 18 AR000952195    36620    36642
## 19 ARxxxxxxxxx    57940       NA

The final shopcode is the proposed store, and so doesn’t have an original inflow. Here it makes sense to set this as zero:

diff_inflow_tab$InFlow.y[19] <- 0

We can then compute the differences:

diff_inflow_tab %>% mutate(Diff=InFlow.x - InFlow.y) -> diff_inflow_tab
diff_inflow_tab
## # A tibble: 19 x 4
##    shopcode    InFlow.x InFlow.y       Diff
##    <chr>          <dbl>    <dbl>      <dbl>
##  1 AR000800362   125346   125347 -    0.913
##  2 AR000800472    93923    93925 -    1.87 
##  3 AR000801783    66115    67387 - 1272    
##  4 AR000801845    44208    44432 -  224    
##  5 AR000802353   137259   137264 -    4.05 
##  6 AR000802617    82637    83345 -  708    
##  7 AR000802690   154633   163673 - 9040    
##  8 AR000804383    35830    36782 -  952    
##  9 AR000805380   102655   102691 -   35.3  
## 10 AR000806759    39300    39424 -  123    
## 11 AR000807891    99717    99728 -   10.4  
## 12 AR000809760   675622   691538 -15915    
## 13 AR000818293    42347    42593 -  246    
## 14 AR000923136    46870    49486 - 2617    
## 15 AR000925140    45217    45218 -    0.724
## 16 AR000928876   187409   187409 -    0.563
## 17 AR000935740    11006    11206 -  200    
## 18 AR000952195    36620    36642 -   22.5  
## 19 ARxxxxxxxxx    57940        0  57940

Finally to map this as proportional symbols, since some values are positive and others are negative, create an absolute value of change column - and then map this

diff_inflow_tab %>% mutate(AbsDiff=abs(Diff)) -> diff_inflow_tab
all_shops %>% left_join(diff_inflow_tab) -> diff_inflow_pts
tm_shape(diff_inflow_pts) + tm_dots(size='AbsDiff',col='Diff',alpha=0.7) + tm_style_col_blind()

This helps to identify the stores that will experience the most impact if a new store is opened as proposed.

As an exercise, try to create a catchment map for the revised store layout.

7. Showing Off

Most of the basic kinds of analysis are covered in the first 5 sections - here are a couple of fancy ideas that may be useful. Each one of these is shown as a larger pipeline (or sequence of pipelines) in a function - it may be an interesting exercise to look over the function listing and figure how it works.

The first of these generalises the idea of catchments to find second choice stores - even if one of the stores gets the largest share of cash flow from an administrative unit, it won’t necessarily get all of the cash. This function demonstrates how to map second choice catchment areas:

map_catchments_2 <- function(h) {
  fit_sim(h) %>% filter(grepl('AH|Heijn',shop)) %>% 
    group_by(BU_CODE) %>% 
    mutate(R=rank(-Flow_Factor)) %>%
    filter(R==2) -> rank_tab
  gron_polys %>% 
    left_join(rank_tab) %>% 
    group_by(shopcode) %>% 
    summarise -> catch_polys
  tm_shape(catch_polys) + tm_polygons(col='shopcode')
}

map_catchments_2(0.5)

More generally, you can map \(k\)th choice catchments for any rank:

map_catchments_k <- function(h,k=1) {
  fit_sim(h) %>% filter(grepl('AH|Heijn',shop)) %>% 
    group_by(BU_CODE) %>% 
    mutate(R=rank(-Flow_Factor)) %>%
    filter(R==k) -> rank_tab
  gron_polys %>% 
    left_join(rank_tab) %>% 
    group_by(shopcode) %>% 
    summarise -> catch_polys
  tm_shape(catch_polys) + tm_polygons(col='shopcode')
}

map_catchments_k(0.5,3)

Another variant is mapping catchments in a more fuzzy way. Here again it is noted that although for each administrative unit there is a ‘most favoured’ store, the share that this store gets of the total cash flow varies. This function maps the standard catchment areas, but varies intensity of shading to show how dominant the most favoured store actually is:

map_catchments_fuzz <- function(h) {
  fit_sim(h) %>% filter(grepl('AH|Heijn',shop)) %>% 
    group_by(BU_CODE) %>% 
    mutate(R=rank(-Flow_Factor), Share=Flow_Factor/sum(Flow_Factor)) %>%
    filter(R==1) -> rank_tab
  gron_polys %>% 
    left_join(rank_tab)  -> catch_polys1
  catch_polys1 %>% group_by(shopcode) %>% summarise -> catch_polys2
  tm_shape(catch_polys1) + tm_polygons(col='Share') + tm_shape(catch_polys2) + tm_borders(col='darkblue')
}

map_catchments_fuzz(0.75)

Here, places nearest to the boundaries between catchments tend to have far lower shares.