Having seen a talk about spatial interaction models, in this session you will see how to fit one using R. The data you will work with relates to Groningen. There are a number of census units in and around Groningen:
In addition, there are a number of stores, together with information relating to their floor areas.
A third data set relates to the administrative areas, but as a set of points - this is useful to compute ‘crow fly’ distances between the administrative areas and the shops. These data sets are called:
gron_pts - label points for the administrative areasgron_polys - polygons for the administtrative areasgron_shops - point locatrions for the storesAll the data can be obtained from the file link in the schedule - the files grongingen_pts.zip, groningen_polys.zip and shops_pts.zip should be copied over. They can then be unzipped in R.
### Load some useful packages
library(tidyverse)
library(sf)
library(tmap)
options(stringsAsFactors = FALSE)
### First the polys for the admion areas
### Unzip it
unzip("groningen_polys.zip")
### Read it in
gron_polys <- st_read(".","groningen_polys")
### Repeat for the shops
unzip("shops_pts.zip")
gron_shops <- st_read(".","shops_pts")
#### Repeat for the admin areae label points
unzip("groningen_pts.zip")
gron_pts <- st_read(".","groningen_pts")
This code downloads the shapefiles into your working folder, unzips them, and then reads the files into a R objects - for each of those objects in turn.
You can now use tmap commnds to view them. THe code below will create a basic plot of the areas and the shop locations:
tm_shape(gron_polys) + tm_borders(col='indianred') +
tm_shape(gron_shops) +
tm_dots(size='Floorspace',col='dodgerblue') +
tm_layout(legend.position=c("right","bottom"))
We can also list the data in the objects - here are gron_shops:
gron_shops
## Simple feature collection with 55 features and 3 fields
## geometry type: POINT
## dimension: XY
## bbox: xmin: 229396.2 ymin: 578512.6 xmax: 237610.4 ymax: 585365.4
## epsg (SRID): NA
## proj4string: +proj=sterea +lat_0=52.15616055555555 +lon_0=5.38763888888889 +k=0.9999079 +x_0=155000 +y_0=463000 +ellps=bessel +units=m +no_defs
## First 20 features:
## shop shopcode Floorspace
## 1 AH de Wijert AR000802617 1617
## 2 AH to go - Groningen AR000935740 313
## 3 Albert Heijn AR000818293 1242
## 4 Albert Heijn AR000809760 45000
## 5 Albert Heijn 1060 AR000805380 1007
## 6 Albert Heijn 1093 AR000800472 1997
## 7 Albert Heijn 1243 AR000806759 968
## 8 Albert Heijn B.V. AR000925140 1075
## 9 Albert Heijn B.V. AR000952195 1020
## 10 Albert Heijn B.V. AR000928876 25478
## 11 Albert Heijn BV Supermarkt AR000807891 1234
## 12 Albert Heijn BV Supermarkt AR000802690 2601
## 13 Albert Heijn BV Supermarkt AR000804383 1224
## 14 Albert Heijn BV Supermarkt AR000801845 1655
## 15 Albert Heijn BV Supermarkt AR000800362 2451
## 16 Albert Heijn Supermarkt F. v.d. Heide B.V. AR000801783 1611
## 17 Albert Heijn Supermarkt H.J. v.d. Heide B.V. AR000923136 587
## 18 Aldi Drachten BV Supermarkt AR000913738 1155
## 19 Aldi Drachten BV Supermarkt (09 Groningen I ) AR000801270 804
## 20 Aldi Drachten BV Supermarkt (19 Groningen III) AR000801929 762
## geometry
## 1 POINT (233908.537 579461.836)
## 2 POINT (233676.262 581123.793)
## 3 POINT (233589.003 581776.368)
## 4 POINT (234473.769 582256.525)
## 5 POINT (233050.067 578512.555)
## 6 POINT (232841.819 583560.236)
## 7 POINT (233383.717 581733.449)
## 8 POINT (232282.014 583303.129)
## 9 POINT (233505.174 582641.248)
## 10 POINT (231731.778 584343.977)
## 11 POINT (233956.801 583452.4)
## 12 POINT (234682.313 579657.421)
## 13 POINT (234025.217 581620.656)
## 14 POINT (233790.316 582140.082)
## 15 POINT (231240.363 582681.696)
## 16 POINT (235975.156 582954.418)
## 17 POINT (237216.701 583396.861)
## 18 POINT (233539.907 580368.509)
## 19 POINT (233513.457 582916.442)
## 20 POINT (234096.467 578847.938)
and gron_pts
gron_pts
## Simple feature collection with 108 features and 3 fields
## geometry type: POINT
## dimension: XY
## bbox: xmin: 227645.8 ymin: 578286.9 xmax: 241791 ymax: 585952.8
## epsg (SRID): NA
## proj4string: +proj=sterea +lat_0=52.15616055555555 +lon_0=5.38763888888889 +k=0.9999079 +x_0=155000 +y_0=463000 +ellps=bessel +units=m +no_defs
## First 20 features:
## BU_CODE AANT_INW Money geometry
## 1 BU00140000 4440 88800 POINT (233655.736099999 582...
## 2 BU00140001 6600 132000 POINT (233834.593699999 581...
## 3 BU00140002 3920 78400 POINT (234292.167199999 582...
## 4 BU00140003 1765 35300 POINT (233195.739500001 581...
## 5 BU00140004 10 200 POINT (233043.1413 582503.1...
## 6 BU00140005 5510 110200 POINT (233458.532000002 582...
## 7 BU00140007 0 0 POINT (234243.446400002 582...
## 8 BU00140008 385 7700 POINT (233644.9516 581186.7...
## 9 BU00140100 1500 30000 POINT (234872.4652 581396.8...
## 10 BU00140101 5240 104800 POINT (234504.669799998 581...
## 11 BU00140102 1560 31200 POINT (234377.353500001 580...
## 12 BU00140103 4530 90600 POINT (233820.244800001 580...
## 13 BU00140104 2240 44800 POINT (233359.370999999 580...
## 14 BU00140105 885 17700 POINT (233171.4256 581236.7...
## 15 BU00140106 3265 65300 POINT (232672.585000001 581...
## 16 BU00140107 1270 25400 POINT (232867.578899998 580...
## 17 BU00140108 20 400 POINT (232223.331900001 580...
## 18 BU00140109 0 0 POINT (232885.599100001 580...
## 19 BU00140200 2845 56900 POINT (232537.562399998 582...
## 20 BU00140201 3875 77500 POINT (232918.379000001 582...
The variables here arew the population AANT_INW - and the potential spending for each administrative unit Money - this is a simple estimate of €20 per person. The polygons object contains the same information - it just uses polygon geometry instead of point.
Finally, a useful tool for viewing data is to draw the map on an interactive backdrop. This can be done by setting tmap_mode to view instead of plot:
tmap_mode('view')
tm_shape(gron_polys) + tm_borders(col='indianred', lwd=2)
The data we currently have tells you where the stores are, and where the population is located. It also tells you how much spending power is in each place, and how attractive each store is (measured by floor space). To predict interaction the final peice in the jigsaw is the distance between the stores and the population. In broad terms, the interaction between place \(i\) and store \(j\) (annotated here is \(I_{ij}\)) is modelled as
\[ I_{ij} = f(P_i,A_j,d_{ij}) \]
where \(P_i\) is the productiveness in place \(i\), \(A_j\) is the attractiveness of store \(j\) and \(d_{ij}\) is the distance between the two. Here \(P_i\) is measured with the Money item, and \(A_j\) with the Floorspace item. So we need to compute the distances. One way of doing this is to extract the point coordinates from gron_pts and gron_shops and compute the straight line (crow fly) distances between each pair, using Pythagoras’ theorem. The following function does this.
get_dists <- function(pts1,pts2) {
c1 <- st_coordinates(pts1)
c2 <- st_coordinates(pts2)
res <- outer(c1[,1],c2[,1],'-')^2 + outer(c1[,2],c2[,2],'-')^2
return(sqrt(res))
}
This defines a new R function that takes a pair of point sf objects andf computes a distance matrix. At this stage we don’t have the distances acttually computed, we have just created a function to do the job. If you can’t see how this works immediately, don’t worry too much - and skip the following part - but if you are interested the details are between the lines below:
st_coordinates functiions extract the coordinates from a pair of point objects, and stores each in an \(n \times 2\) matrix.outer functions take each column of the two matrices in turn and creates a difference matrix for the \(x\) and \(y\) coordinates respectively.Now we have a function to calculate distance - so lets actually do this. The following computes the distance matrix for gron_shops and gron_pts, and gives names to the appropriate rows and columns in the matrix. Note the division of the distances by 1000. This makes the distances in kilometres, not metres, which is more appropriate for this model.
### Compute the distances
dmat <- get_dists(gron_shops,gron_pts)/1000
### Name the rows and columns nicely
rownames(dmat) <- gron_shops$shopcode
colnames(dmat) <- gron_pts$BU_CODE
# Print the top left hand corner
dmat[1:4,1:4]
## BU00140000 BU00140001 BU00140002 BU00140003
## AR000802617 2.6211707 2.1282047 2.6632298 2.3322780
## AR000935740 0.9472168 0.4911816 1.1519708 0.7369386
## AR000818293 0.3018875 0.3090520 0.7729367 0.4043062
## AR000809760 0.8388541 0.9243705 0.2415260 1.4010140
Now we have a distance matrix, but to merge it with the rest of the information, it would be better to have this in tidy format - essentially in the format of a bata frame, with three columns - the store ID, the administrative area ID and the distance between the two. This is useful because it may then be used as a look-up table for a given store, or a given administrative area, to find distances. To do this, first convert the matrix object into a data frame, and add a variable column based on the row name - this gives the shop code. Here we use tidyverse-style pipelining to ‘mutate’ the data frame.
dmat %>% as_data_frame %>% mutate(shopcode=rownames(dmat)) -> dist_tab
head(dist_tab)
## # A tibble: 6 x 109
## BU00… BU00… BU00… BU00… BU00… BU00… BU00… BU001… BU00… BU00… BU00… BU00…
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2.62 2.13 2.66 2.33 3.16 3.09 3.08 1.74 2.16 1.85 1.42 1.27
## 2 0.947 0.491 1.15 0.737 1.52 1.42 1.51 0.0703 1.23 0.833 0.773 0.419
## 3 0.302 0.309 0.773 0.404 0.909 0.758 0.996 0.592 1.34 1.08 1.26 1.07
## 4 0.839 0.924 0.242 1.40 1.45 1.05 0.355 1.35 0.948 1.05 1.46 1.66
## 5 3.61 3.17 3.79 3.17 3.99 4.03 4.19 2.74 3.41 3.06 2.64 2.35
## 6 1.70 2.21 2.06 1.91 1.08 1.21 1.74 2.51 2.97 2.88 3.16 2.99
## # ... with 97 more variables: BU00140104 <dbl>, BU00140105 <dbl>,
## # BU00140106 <dbl>, BU00140107 <dbl>, BU00140108 <dbl>, BU00140109
## # <dbl>, BU00140200 <dbl>, BU00140201 <dbl>, BU00140202 <dbl>,
## # BU00140203 <dbl>, BU00140300 <dbl>, BU00140301 <dbl>, BU00140302
## # <dbl>, BU00140400 <dbl>, BU00140401 <dbl>, BU00140402 <dbl>,
## # BU00140403 <dbl>, BU00140404 <dbl>, BU00140500 <dbl>, BU00140501
## # <dbl>, BU00140502 <dbl>, BU00140503 <dbl>, BU00140504 <dbl>,
## # BU00140505 <dbl>, BU00140506 <dbl>, BU00140507 <dbl>, BU00140508
## # <dbl>, BU00140509 <dbl>, BU00140510 <dbl>, BU00140511 <dbl>,
## # BU00140600 <dbl>, BU00140601 <dbl>, BU00140602 <dbl>, BU00140603
## # <dbl>, BU00140604 <dbl>, BU00140605 <dbl>, BU00140606 <dbl>,
## # BU00140700 <dbl>, BU00140701 <dbl>, BU00140702 <dbl>, BU00140703
## # <dbl>, BU00140704 <dbl>, BU00140705 <dbl>, BU00140800 <dbl>,
## # BU00140801 <dbl>, BU00140802 <dbl>, BU00140803 <dbl>, BU00140804
## # <dbl>, BU00140805 <dbl>, BU00140806 <dbl>, BU00140807 <dbl>,
## # BU00140808 <dbl>, BU00140809 <dbl>, BU00140810 <dbl>, BU00140811
## # <dbl>, BU00140812 <dbl>, BU00140900 <dbl>, BU00140901 <dbl>,
## # BU00140902 <dbl>, BU00140903 <dbl>, BU00140904 <dbl>, BU00140905
## # <dbl>, BU00140906 <dbl>, BU00140907 <dbl>, BU00141000 <dbl>,
## # BU00141001 <dbl>, BU00141002 <dbl>, BU00141003 <dbl>, BU00141004
## # <dbl>, BU00141005 <dbl>, BU00141100 <dbl>, BU00141101 <dbl>,
## # BU00141102 <dbl>, BU00141103 <dbl>, BU00141104 <dbl>, BU00141105
## # <dbl>, BU00141106 <dbl>, BU00141107 <dbl>, BU00141200 <dbl>,
## # BU00141201 <dbl>, BU00141202 <dbl>, BU00141203 <dbl>, BU00141204
## # <dbl>, BU00141205 <dbl>, BU00141206 <dbl>, BU00141207 <dbl>,
## # BU00141208 <dbl>, BU00141209 <dbl>, BU00141210 <dbl>, BU00141211
## # <dbl>, BU00141300 <dbl>, BU00141301 <dbl>, BU00141302 <dbl>,
## # BU00141400 <dbl>, BU00141401 <dbl>, BU00141402 <dbl>, shopcode <chr>
This is still not in the format mentioned earlier, as each origin has its own column. To re-shape the data into three columns the gather command is needed. This takes data spread over several columns (here all of the origins) and ‘gathers’ them into a single column. The column names get a new variable, and the distance values are stacked.
This is shown below.
dist_tab %>% gather(key=BU_CODE,value=Dist,-shopcode) -> dist_tab
head(dist_tab)
## # A tibble: 6 x 3
## shopcode BU_CODE Dist
## <chr> <chr> <dbl>
## 1 AR000802617 BU00140000 2.62
## 2 AR000935740 BU00140000 0.947
## 3 AR000818293 BU00140000 0.302
## 4 AR000809760 BU00140000 0.839
## 5 AR000805380 BU00140000 3.61
## 6 AR000800472 BU00140000 1.70
Note that this approach creates two new columns - the one with the old column name, and the one with the values. These are the key and value parameters. The final -shopcode parameters tells gather not to include this column in the gathering process, but to carry it forward into the modified data frame.
Next, we want to join the attrativeness and productiveness variables onto this list, using database join-type commands that are provided by the left_join function. First off, join the attractiveness information from gron_shops :
dist_tab %>% left_join(gron_shops) %>% select(-geometry) -> dist_tab
head(dist_tab)
## # A tibble: 6 x 5
## shopcode BU_CODE Dist shop Floorspace
## <chr> <chr> <dbl> <chr> <int>
## 1 AR000802617 BU00140000 2.62 AH de Wijert 1617
## 2 AR000935740 BU00140000 0.947 AH to go - Groningen 313
## 3 AR000818293 BU00140000 0.302 Albert Heijn 1242
## 4 AR000809760 BU00140000 0.839 Albert Heijn 45000
## 5 AR000805380 BU00140000 3.61 Albert Heijn 1060 1007
## 6 AR000800472 BU00140000 1.70 Albert Heijn 1093 1997
NOTE: You may get an error here if you have recently been using theMASS package - this also has a function called select that does a different thing. When more than one package has been loaded with the same nameed function, tyhe default one called may be the wrong one. Here you need the dplyr function select, not the MASS one - and this could have caused the error. To make this unambigous, use the following instead:
dist_tab %>% left_join(gron_shops) %>% dplyr::select(-geometry) -> dist_tab
head(dist_tab)
dplyr::select instead of just select, this makes it clear which select you mean.Another approach is to use unloadNameSpace which effectively unloads a package - so entering
unloadNamespace("MASS")
can also fix this.
The select(-geometry) is used here as we want to create a straight forward data frame here, not a geographical object. We can also add the administrative area information in the same way.
dist_tab %>% left_join(gron_pts) %>% select(-geometry) -> dist_tab
head(dist_tab)
## # A tibble: 6 x 7
## shopcode BU_CODE Dist shop Floorspace AANT… Money
## <chr> <chr> <dbl> <chr> <int> <dbl> <dbl>
## 1 AR000802617 BU00140000 2.62 AH de Wijert 1617 4440 88800
## 2 AR000935740 BU00140000 0.947 AH to go - Groningen 313 4440 88800
## 3 AR000818293 BU00140000 0.302 Albert Heijn 1242 4440 88800
## 4 AR000809760 BU00140000 0.839 Albert Heijn 45000 4440 88800
## 5 AR000805380 BU00140000 3.61 Albert Heijn 1060 1007 4440 88800
## 6 AR000800472 BU00140000 1.70 Albert Heijn 1093 1997 4440 88800
So now dist_tab has a series of rows corresponding to each \(i,j\) pair, with the values of \(d_{ij}\) \(P_i\) and \(A_j\) among the variables -
DistMoneyFloorspaceWe can now actually carry out the modelling. As an aside, the above work is a typical example of what some data scientists term data mungeing - basically not doing analysis, but re-shaping data into a format that will be helpful for the analysis. In the practical here we worked through this step-by-step, but in practice, one might specify the entire process as a single pipeline:
dmat %>% as_data_frame %>%
mutate(shopcode=rownames(dmat)) %>%
gather(key=BU_CODE,value=Dist,-shopcode) %>%
left_join(gron_shops) %>% select(-geometry) %>%
left_join(gron_pts) %>% select(-geometry) -> dist_tab
Before we noted that the spatial interaction model takes the form \(I_{ij} = f(P_i,A_j,d_{ij})\) - here, we will use a specific, multiplicative format:
\[
I_{ij} = \textsf{const}_j \times A_i^\alpha g(d_{ij})
\] where \(g\) is a decaying function of distance. Note that the productiveness does not appear explicitly here - in fact it features as a constraint - essentially for each administrative area, the total money going to all of the stores must add up to the total money for that area. This is achieved by multiplying the attractiveness and distance-deterrence terms by a constant for each area - the constant re-scales the amount spent so that the total for the area is equal to the variable Money for that area.
The form that \(g\) takes here will be that of exponential decay. Often this is written as \(e^{-\textsf{const} \times d_{ij}}\) but here we will use the form
\[ g(d_{ij}) = 2^{-d_{ij}/h} \] where \(h\) is the distance half-life. That is, for every \(h\) kilometers between the store and the home location, the desirability of making the trip halves. Although this can be written in more conventional exponential form as \(e^{-\log(2)d_{ij}/h}\), the parameter \(h\) is more easily interpreted here. Thus the full model is
\[ I_{ij} = \textsf{const}_j A_i^\alpha 2^{-d_{ij}/h} \]
Having arrived at the equation we now need to compute this and add a column to the dist_tab data frame to model the \(I_{ij}\) values. These will be the estimated expenditure for the population of area \(j\) at store \(i\). For a starting point, assume a fairly strong distance deterrence with a distance half-life of 200m. We can change this later.
The first part of the model (before applying the production constraint) is to compute \(A_i^\alpha 2^{-d_ij/h}\). \(\alpha\) here allows for extra attractiveness of the larger stores. In this case it is assumed to be quite close to 1 - say 1.05 - so that larger stores have a slightly disproportionate increase in attractiveness, but the effect is fairly weak.
dist_tab %>%
mutate(Flow_Factor=(Floorspace^1.05)*2^(-Dist/0.2)) -> sim_tab
head(sim_tab)
## # A tibble: 6 x 8
## shopcode BU_CODE Dist shop Floorsp… AANT_… Money Flow_Fa…
## <chr> <chr> <dbl> <chr> <int> <dbl> <dbl> <dbl>
## 1 AR000802617 BU00140000 2.62 AH de Wijert 1617 4440 88800 2.65e⁻¹
## 2 AR000935740 BU00140000 0.947 AH to go - … 313 4440 88800 1.57e⁺¹
## 3 AR000818293 BU00140000 0.302 Albert Heijn 1242 4440 88800 6.23e⁺²
## 4 AR000809760 BU00140000 0.839 Albert Heijn 45000 4440 88800 4.20e⁺³
## 5 AR000805380 BU00140000 3.61 Albert Heij… 1007 4440 88800 5.25e⁻³
## 6 AR000800472 BU00140000 1.70 Albert Heij… 1997 4440 88800 8.14e⁺⁰
This has the Flow_Factor variable proportional to the cash flow between \(i\) and \(j\) but as yet it hasn’t been scaled. To do this, we first need to rescale the flow factors for each administrative area so that they sum to one. Then, they are each multiplied by the Money variable for that area, so that the total modelled flow of cash equals the total money coming from that area. This is done by using the sum function. If the sim_tab data frame is first grouped by the administrative unit code, (using group_by(BU_CODE)) then the sum function for each row applies to the sum of values for its specific BU_CODE - rather than the sum for the entire column. Thus
sim_tab %>% group_by(BU_CODE) %>%
mutate(Cash_Flow = Money*Flow_Factor/sum(Flow_Factor)) ->
sim_tab
head(sim_tab)
## # A tibble: 6 x 9
## # Groups: BU_CODE [1]
## shopcode BU_CODE Dist shop Floor… AANT_… Money Flow_F… Cash_F…
## <chr> <chr> <dbl> <chr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 AR000802617 BU00140000 2.62 AH de … 1617 4440 88800 2.65e⁻¹ 2.47e⁺⁰
## 2 AR000935740 BU00140000 0.947 AH to … 313 4440 88800 1.57e⁺¹ 1.45e⁺²
## 3 AR000818293 BU00140000 0.302 Albert… 1242 4440 88800 6.23e⁺² 5.79e⁺³
## 4 AR000809760 BU00140000 0.839 Albert… 45000 4440 88800 4.20e⁺³ 3.90e⁺⁴
## 5 AR000805380 BU00140000 3.61 Albert… 1007 4440 88800 5.25e⁻³ 4.88e⁻²
## 6 AR000800472 BU00140000 1.70 Albert… 1997 4440 88800 8.14e⁺⁰ 7.57e⁺¹
Thus we now have a data frame whose rows are the details of each \(i\) and \(j\) combination, including the modelled cash flow. We can first check that the constraint has worked:
sim_tab %>% group_by(BU_CODE) %>% summarise(Out_Flow = sum(Cash_Flow), Money=first(Money))
## # A tibble: 108 x 3
## BU_CODE Out_Flow Money
## <chr> <dbl> <dbl>
## 1 BU00140000 88800 88800
## 2 BU00140001 132000 132000
## 3 BU00140002 78400 78400
## 4 BU00140003 35300 35300
## 5 BU00140004 200 200
## 6 BU00140005 110200 110200
## 7 BU00140007 0 0
## 8 BU00140008 7700 7700
## 9 BU00140100 30000 30000
## 10 BU00140101 104800 104800
## # ... with 98 more rows
Here the summarise function creates a summary data frame, where statistics such as the sum or mean are computed for each group. Thus Out_Flow is the sum of all of the cash flows coming out of each administrative area. Note that the Money variable is the same for each member of the group in sim_tab so taking the first is just a way of summarising several repeated values with a single incidence of this value. Thus the output shows that the summed modelled cash flows from each adminstrative area is the same as the Money for that area. The constraint has worked. Phew!
Now we are happy with the model we may want to see what happens for different half-life distances. As with the data mungeing we can express the modelling process as a single pipeline. But to save time we can also put it into a function, with the half-life as an argument.
fit_sim <- function(h) {
dist_tab %>%
mutate(Flow_Factor=(Floorspace^1.05)*2^(-Dist/h)) %>%
group_by(BU_CODE) %>%
mutate(Cash_Flow = Money*Flow_Factor/sum(Flow_Factor)) -> res
return(res)
}
head(fit_sim(0.75))
## # A tibble: 6 x 9
## # Groups: BU_CODE [1]
## shopcode BU_CODE Dist shop Floor… AANT_… Money Flow_F… Cash_…
## <chr> <chr> <dbl> <chr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 AR000802617 BU00140000 2.62 AH de W… 1617 4440 88800 208 3.02e²
## 2 AR000935740 BU00140000 0.947 AH to g… 313 4440 88800 174 2.53e²
## 3 AR000818293 BU00140000 0.302 Albert … 1242 4440 88800 1342 1.96e³
## 4 AR000809760 BU00140000 0.839 Albert … 45000 4440 88800 35414 5.16e⁴
## 5 AR000805380 BU00140000 3.61 Albert … 1007 4440 88800 50.6 7.38e¹
## 6 AR000800472 BU00140000 1.70 Albert … 1997 4440 88800 608 8.87e²
We now have a data frame sim_tab that tabulates the results of the Spatial Interaction Model. The next step is to investigate the geographical patterns associated with this. Suppose we are mainly concerned with the AH/Albert Heijn group of stores. One starting point might be to discover the total cash flows predicted into each of the stores - this is a two-stage process - firstly filter out these stores (based on text patterns in the shop name) - and then group by the stores and sum over them using the summerise function. The pipeline is as below:
sim_tab %>% filter(grepl('AH|Heijn',shop)) %>%
group_by(shopcode) %>%
summarise(InFlow=sum(Cash_Flow)) -> inflow_tab
inflow_tab
## # A tibble: 18 x 2
## shopcode InFlow
## <chr> <dbl>
## 1 AR000800362 125347
## 2 AR000800472 93925
## 3 AR000801783 67387
## 4 AR000801845 44432
## 5 AR000802353 137264
## 6 AR000802617 83345
## 7 AR000802690 163673
## 8 AR000804383 36782
## 9 AR000805380 102691
## 10 AR000806759 39424
## 11 AR000807891 99728
## 12 AR000809760 691538
## 13 AR000818293 42593
## 14 AR000923136 49486
## 15 AR000925140 45218
## 16 AR000928876 187409
## 17 AR000935740 11206
## 18 AR000952195 36642
Here grepl provides a ogical grep search - grep is an acronym for generalised regular expression parser - it is a tool for text matching, based on character patterns. More details are here: https://en.wikipedia.org/wiki/Grep - but for now, it is enough to note that ‘AH|Heijn’ picks out shop names with either ‘AH’ or ‘Heijn’ in them. However, although this tabulates the influx of cash, it doesn’t show a map. To get this, firstly join this table to the gron_shops geographical object and then use tmap to create a map.
gron_shops %>% left_join(inflow_tab) -> inflow_pts
tm_shape(inflow_pts) + tm_dots(size='InFlow',col='darkgreen',alpha=0.7)
This depends on the distance half life. Once again we can ecapsulate all of this in a function with \(h\) as the argument, making it easier to explore:
map_inflow <- function(h) {
fit_sim(h) %>% filter(grepl('AH|Heijn',shop)) %>%
group_by(shopcode) %>%
summarise(InFlow=sum(Cash_Flow)) -> inflow_tab
gron_shops %>% left_join(inflow_tab) -> inflow_pts
tm_shape(inflow_pts) + tm_dots(size='InFlow',col='darkgreen',alpha=0.7)
}
map_inflow(0.5)
The above shows the effect if \(h\) is 500m - now check it for 1km:
map_inflow(1.0)
In a nutshell, as the distance half life gets larger, people are willing to travel further to stores, and the more attractive larger stores get a greater share of the money - people will travel to a more favourable store even when it is a long way away.
The idea of a catchment area here is to identify, for each administrative unit, what is the most attractive store. This is done by considering the cash flow out of each unit to each of the stores and seeing which store gets the greatest share of the total outflow from the unit. In R (and in tidyverse terms) this is done using the rank function. Rank takes a variable and for each row, returns the rank of the value for that row, out of all of the rows. The ranks are in ascending order, so for \(n\) items the largest has the value \(n\).
To find the ranks of the outflow shares on an administrative unit by adminstrative unit basis, group_by is a helper function again. In this case it causes the rank function to return the relative ranks of the flow factors (\(A_i^\alpha 2^{-d_{ij}/h}\)) within each group. Finally, it would by useful to have ranks in descending order - you are interested in the maximum in each group, and so rather than trying to find ranks of \(n\) when you don’t know what \(n\) is, it is more useful to find ranks of 1. Changing the sign of the variable given to the rank function achieves this - essentially it flips the orders of the variables. Check the results here:
sim_tab %>% filter(grepl('AH|Heijn',shop)) %>%
group_by(BU_CODE) %>%
mutate(R=rank(-Flow_Factor)) -> rank_tab
rank_tab
## # A tibble: 1,944 x 10
## # Groups: BU_CODE [108]
## shopcode BU_CO… Dist shop Floo… AANT… Money Flow_F… Cash_F… R
## <chr> <chr> <dbl> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 AR000802617 BU001… 2.62 AH de… 1617 4440 88800 2.65e⁻¹ 2.47e⁺⁰ 15.0
## 2 AR000935740 BU001… 0.947 AH to… 313 4440 88800 1.57e⁺¹ 1.45e⁺² 7.00
## 3 AR000818293 BU001… 0.302 Alber… 1242 4440 88800 6.23e⁺² 5.79e⁺³ 3.00
## 4 AR000809760 BU001… 0.839 Alber… 45000 4440 88800 4.20e⁺³ 3.90e⁺⁴ 1.00
## 5 AR000805380 BU001… 3.61 Alber… 1007 4440 88800 5.25e⁻³ 4.88e⁻² 17.0
## 6 AR000800472 BU001… 1.70 Alber… 1997 4440 88800 8.14e⁺⁰ 7.57e⁺¹ 9.00
## 7 AR000806759 BU001… 0.433 Alber… 968 4440 88800 3.04e⁺² 2.83e⁺³ 4.00
## 8 AR000925140 BU001… 1.85 Alber… 1075 4440 88800 2.54e⁺⁰ 2.36e⁺¹ 10.0
## 9 AR000952195 BU001… 0.590 Alber… 1020 4440 88800 1.87e⁺² 1.73e⁺³ 6.00
## 10 AR000928876 BU001… 2.98 Alber… 25478 4440 88800 1.39e⁺⁰ 1.29e⁺¹ 11.0
## # ... with 1,934 more rows
The interactions with the highest rank for each administrative unit can then be found by filtering out those for which R is equal to 1.
rank_tab %>% filter(R==1) -> rank_tab
head(rank_tab)
## # A tibble: 6 x 10
## # Groups: BU_CODE [6]
## shopcode BU_CO… Dist shop Floor… AANT_… Money Flow… Cash_… R
## <chr> <chr> <dbl> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 AR000809760 BU001… 0.839 Albert… 45000 4440 88800 4200 3.90e⁴ 1.00
## 2 AR000809760 BU001… 0.924 Albert… 45000 6600 132000 3123 5.53e⁴ 1.00
## 3 AR000809760 BU001… 0.242 Albert… 45000 3920 78400 33292 7.34e⁴ 1.00
## 4 AR000806759 BU001… 0.195 Albert… 968 1765 35300 695 6.69e³ 1.00
## 5 AR000809760 BU001… 1.45 Albert… 45000 10.0 200 502 3.59e¹ 1.00
## 6 AR000809760 BU001… 1.05 Albert… 45000 5510 110200 2023 3.67e⁴ 1.00
There is now a unique row for each administrative unit, and the shop also appearing in that row is the one where most money flows - and this states which catchment area the administrative unit is in. To map these, firstly join this table to the groningen_polys object:
gron_polys %>% left_join(rank_tab) -> catch_polys
This can then be mapped, using the shopcode as a colouring variable - making the catchments visible:
tm_shape(catch_polys) + tm_polygons(col='shopcode')
A further useful technique would be to create dissolved catchment polygons. This essentially involves removing the boundaries between administrative unit polygons assigned to the same store. This is done by applying a group_by operation to a simple features object (here, catch_polys) and then a summarise - this merges all of the polygons in the same group. Here there is no summary variable, we just do this to make the merge:
catch_polys %>% group_by(shopcode) %>% summarise -> catch_polys
tm_shape(catch_polys) + tm_polygons(col='shopcode')
As before - all of this could be made into a small number of larger pipelines and encapsulated in a function of \(h\):
map_catchments <- function(h) {
fit_sim(h) %>% filter(grepl('AH|Heijn',shop)) %>%
group_by(BU_CODE) %>%
mutate(R=rank(-Flow_Factor)) %>%
filter(R==1) -> rank_tab
gron_polys %>%
left_join(rank_tab) %>%
group_by(shopcode) %>%
summarise -> catch_polys
tm_shape(catch_polys) + tm_polygons(col='shopcode')
}
map_catchments(0.1)
Now you can again experiment with different \(h\) values - for example 750m:
map_catchments(0.75)
Suppose now a new store was proposed with a floor space of 5000 square meters - therefore a fairly large store. How could the effect of this be allowed for? The main way to see the impact is to add the new store to the gron_shops spatial layer, and to re-run the model with this extra store. This can then be compared to the original model. To see this, firstly download a new spatial data object proposed:
download.file("https://www.dropbox.com/s/qjzxl0cr7ehtrti/proposed.zip?dl=1","proposed.zip")
unzip("proposed.zip")
proposed <- st_read(".","proposed")
## Reading layer `proposed' from data source `/Users/chrisbrunsdon/Dropbox/Groningen/Session 6' using driver `ESRI Shapefile'
## Simple feature collection with 1 feature and 3 fields
## geometry type: POINT
## dimension: XY
## bbox: xmin: 235494.2 ymin: 580729.9 xmax: 235494.2 ymax: 580729.9
## epsg (SRID): NA
## proj4string: +proj=sterea +lat_0=52.15616055555555 +lon_0=5.38763888888889 +k=0.9999079 +x_0=155000 +y_0=463000 +ellps=bessel +units=m +no_defs
Then, have a look at its location - the proposed store is in red.
tm_shape(gron_shops) + tm_dots(col='dodgerblue',size='Floorspace') + tm_shape(proposed) + tm_dots(col='indianred',size='Floorspace')
To re-run the model, again using a function is useful. Below is a function that encapsulates all of the steps in producing a model, from the provision of a shops object and an administrative area object as well as \(h\). The function looks intimidating, but really it is just a combination of the operations done above - the data munging and model fitting.
new_model <- function(admin,shops,h) {
dmat <- get_dists(shops,admin)/1000
rownames(dmat) <- shops$shopcode
colnames(dmat) <- admin$BU_CODE
dmat %>% as_data_frame %>%
mutate(shopcode=rownames(dmat)) %>%
gather(key=BU_CODE,value=Dist,-shopcode) %>%
left_join(shops) %>% select(-geometry) %>%
left_join(admin) %>% select(-geometry) -> dist_tab
dist_tab %>%
mutate(Flow_Factor=(Floorspace^1.05)*2^(-Dist/h)) %>%
group_by(BU_CODE) %>%
mutate(Cash_Flow = Money*Flow_Factor/sum(Flow_Factor)) -> res
return(res)
}
test <- new_model(gron_pts,gron_shops,0.3)
head(test)
## # A tibble: 6 x 9
## # Groups: BU_CODE [1]
## shopcode BU_CODE Dist shop Floor… AANT_… Money Flow_F… Cash_…
## <chr> <chr> <dbl> <chr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 AR000802617 BU00140000 2.62 AH de W… 1617 4440 88800 5.48e⁺⁰ 2.47e¹
## 2 AR000935740 BU00140000 0.947 AH to g… 313 4440 88800 4.68e⁺¹ 2.10e²
## 3 AR000818293 BU00140000 0.302 Albert … 1242 4440 88800 8.83e⁺² 3.97e³
## 4 AR000809760 BU00140000 0.839 Albert … 45000 4440 88800 1.11e⁺⁴ 4.98e⁴
## 5 AR000805380 BU00140000 3.61 Albert … 1007 4440 88800 3.40e⁻¹ 1.53e⁰
## 6 AR000800472 BU00140000 1.70 Albert … 1997 4440 88800 5.78e⁺¹ 2.60e²
Here the function takes the inputs and returns a data frame with the fitted model values, in the format of sim_tab used earlier.
To re-fit thew model with a new store, firstly create a new scenario by combining the new proposed store with the existing ones in Groningen:
all_shops <- rbind(proposed,gron_shops)
Then update the simulation table:
new_sim_tab <- new_model(gron_pts,all_shops,0.2)
head(new_sim_tab)
## # A tibble: 6 x 9
## # Groups: BU_CODE [1]
## shopcode BU_CODE Dist shop Floor… AANT_… Money Flow_F… Cash_F…
## <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ARxxxxxxxxx BU00140000 2.28 New St… 5000 4440 88800 2.88e⁺⁰ 2.67e⁺¹
## 2 AR000802617 BU00140000 2.62 AH de … 1617 4440 88800 2.65e⁻¹ 2.47e⁺⁰
## 3 AR000935740 BU00140000 0.947 AH to … 313 4440 88800 1.57e⁺¹ 1.45e⁺²
## 4 AR000818293 BU00140000 0.302 Albert… 1242 4440 88800 6.23e⁺² 5.79e⁺³
## 5 AR000809760 BU00140000 0.839 Albert… 45000 4440 88800 4.20e⁺³ 3.90e⁺⁴
## 6 AR000805380 BU00140000 3.61 Albert… 1007 4440 88800 5.25e⁻³ 4.88e⁻²
We can now compute the inflows to all of the stores.
new_sim_tab %>% filter(grepl('AH|Heijn|New Store',shop)) %>%
group_by(shopcode) %>%
summarise(InFlow=sum(Cash_Flow)) -> new_inflow_tab
head(new_inflow_tab)
## # A tibble: 6 x 2
## shopcode InFlow
## <chr> <dbl>
## 1 AR000800362 125346
## 2 AR000800472 93923
## 3 AR000801783 66115
## 4 AR000801845 44208
## 5 AR000802353 137259
## 6 AR000802617 82637
… and map the new inflows:
all_shops %>% left_join(new_inflow_tab) -> new_inflow_pts
tm_shape(new_inflow_pts) + tm_dots(size='InFlow',col='darkgreen',alpha=0.7)
If we join the original inflow_tab to the new_inflow_tab using the shop code as a merge key we obtain a table with new and old cash inflow estimates, respectively called inflow.x and inflow.y.
new_inflow_tab %>% left_join(inflow_tab,by="shopcode") ->
diff_inflow_tab
diff_inflow_tab
## # A tibble: 19 x 3
## shopcode InFlow.x InFlow.y
## <chr> <dbl> <dbl>
## 1 AR000800362 125346 125347
## 2 AR000800472 93923 93925
## 3 AR000801783 66115 67387
## 4 AR000801845 44208 44432
## 5 AR000802353 137259 137264
## 6 AR000802617 82637 83345
## 7 AR000802690 154633 163673
## 8 AR000804383 35830 36782
## 9 AR000805380 102655 102691
## 10 AR000806759 39300 39424
## 11 AR000807891 99717 99728
## 12 AR000809760 675622 691538
## 13 AR000818293 42347 42593
## 14 AR000923136 46870 49486
## 15 AR000925140 45217 45218
## 16 AR000928876 187409 187409
## 17 AR000935740 11006 11206
## 18 AR000952195 36620 36642
## 19 ARxxxxxxxxx 57940 NA
The final shopcode is the proposed store, and so doesn’t have an original inflow. Here it makes sense to set this as zero:
diff_inflow_tab$InFlow.y[19] <- 0
We can then compute the differences:
diff_inflow_tab %>% mutate(Diff=InFlow.x - InFlow.y) -> diff_inflow_tab
diff_inflow_tab
## # A tibble: 19 x 4
## shopcode InFlow.x InFlow.y Diff
## <chr> <dbl> <dbl> <dbl>
## 1 AR000800362 125346 125347 - 0.913
## 2 AR000800472 93923 93925 - 1.87
## 3 AR000801783 66115 67387 - 1272
## 4 AR000801845 44208 44432 - 224
## 5 AR000802353 137259 137264 - 4.05
## 6 AR000802617 82637 83345 - 708
## 7 AR000802690 154633 163673 - 9040
## 8 AR000804383 35830 36782 - 952
## 9 AR000805380 102655 102691 - 35.3
## 10 AR000806759 39300 39424 - 123
## 11 AR000807891 99717 99728 - 10.4
## 12 AR000809760 675622 691538 -15915
## 13 AR000818293 42347 42593 - 246
## 14 AR000923136 46870 49486 - 2617
## 15 AR000925140 45217 45218 - 0.724
## 16 AR000928876 187409 187409 - 0.563
## 17 AR000935740 11006 11206 - 200
## 18 AR000952195 36620 36642 - 22.5
## 19 ARxxxxxxxxx 57940 0 57940
Finally to map this as proportional symbols, since some values are positive and others are negative, create an absolute value of change column - and then map this
diff_inflow_tab %>% mutate(AbsDiff=abs(Diff)) -> diff_inflow_tab
all_shops %>% left_join(diff_inflow_tab) -> diff_inflow_pts
tm_shape(diff_inflow_pts) + tm_dots(size='AbsDiff',col='Diff',alpha=0.7) + tm_style_col_blind()
This helps to identify the stores that will experience the most impact if a new store is opened as proposed.
As an exercise, try to create a catchment map for the revised store layout.
Most of the basic kinds of analysis are covered in the first 5 sections - here are a couple of fancy ideas that may be useful. Each one of these is shown as a larger pipeline (or sequence of pipelines) in a function - it may be an interesting exercise to look over the function listing and figure how it works.
The first of these generalises the idea of catchments to find second choice stores - even if one of the stores gets the largest share of cash flow from an administrative unit, it won’t necessarily get all of the cash. This function demonstrates how to map second choice catchment areas:
map_catchments_2 <- function(h) {
fit_sim(h) %>% filter(grepl('AH|Heijn',shop)) %>%
group_by(BU_CODE) %>%
mutate(R=rank(-Flow_Factor)) %>%
filter(R==2) -> rank_tab
gron_polys %>%
left_join(rank_tab) %>%
group_by(shopcode) %>%
summarise -> catch_polys
tm_shape(catch_polys) + tm_polygons(col='shopcode')
}
map_catchments_2(0.5)
More generally, you can map \(k\)th choice catchments for any rank:
map_catchments_k <- function(h,k=1) {
fit_sim(h) %>% filter(grepl('AH|Heijn',shop)) %>%
group_by(BU_CODE) %>%
mutate(R=rank(-Flow_Factor)) %>%
filter(R==k) -> rank_tab
gron_polys %>%
left_join(rank_tab) %>%
group_by(shopcode) %>%
summarise -> catch_polys
tm_shape(catch_polys) + tm_polygons(col='shopcode')
}
map_catchments_k(0.5,3)
Another variant is mapping catchments in a more fuzzy way. Here again it is noted that although for each administrative unit there is a ‘most favoured’ store, the share that this store gets of the total cash flow varies. This function maps the standard catchment areas, but varies intensity of shading to show how dominant the most favoured store actually is:
map_catchments_fuzz <- function(h) {
fit_sim(h) %>% filter(grepl('AH|Heijn',shop)) %>%
group_by(BU_CODE) %>%
mutate(R=rank(-Flow_Factor), Share=Flow_Factor/sum(Flow_Factor)) %>%
filter(R==1) -> rank_tab
gron_polys %>%
left_join(rank_tab) -> catch_polys1
catch_polys1 %>% group_by(shopcode) %>% summarise -> catch_polys2
tm_shape(catch_polys1) + tm_polygons(col='Share') + tm_shape(catch_polys2) + tm_borders(col='darkblue')
}
map_catchments_fuzz(0.75)
Here, places nearest to the boundaries between catchments tend to have far lower shares.