Different Types of SIM Model

Chris Brunsdon

2018-04-12

Introduction

This vignette assumes you have already worked through the basic introduction, which broadly explains the purpose of the simR package, and its basic use. The type of spatial interaction model (SIM) shown there was the doubly constrained model, in which the fitted interactions (in this case journeys to work) were constrained so that the total trips to each destination agreed with those for the observed interactions, as did the total trips from each origin. The model was written as

\[ \hat{T_{ij}} = kA_iB_j\exp\left(-\frac{c_{ij}}{\gamma}\right) \]

In this case, apart from calibrating a general distance decay effect (expressed in the same units as the ‘cost’ of transit from zone \(i\) to zone \(j\)), characteristics of the origin and destination zones were not considered. Instead, each zone was given a general attraction ‘balancing factor’ and a single production ‘balancing factor’. These respectively represented the relative ability for zones to respectively attract people, or produce people for the journeys to work, for a given cost. These are necessary to create a model capable of meeting the double constraints mentioned earlier. These constraints can be stated algebraically as

\[ kB_j \sum_{i} A_i \exp\left(-\frac{c_{ij}}{\gamma}\right) = \sum_i T_{ij} \] and

\[ kA_i \sum_{j} B_j \exp\left(-\frac{c_{ij}}{\gamma}\right) = \sum_j T_{ij} \]

However, it is also possible to drop either of these constraints. Suppose, for example the attractiveness of zones was considered to be a function of some other variables - for example the number of people employed by large companies (total employees \(>\) 500) in an area. This will not be the same as the number of people travelling to a place to work from the East Midlands data set for at least two reasons:

  1. Not all people work for a large-scale employer
  2. Some employees may come from outside of this region

However, the attractiveness is still a function of this quantity - it seems reasonable that an area with more large-scale employers is likely to attract more journeys to work. If this quantity is called \(E_i\) for zone \(i\) one could say

\[ A_i = f(E_i) \] Typically fpr a multiplicative model an exponential-function based link is used:

\[ A_i = \exp(\alpha E_i) \] Thus the new model is

\[ \hat{T_{ij}} = k\exp(\alpha E_i) B_j\exp\left(-\frac{c_{ij}}{\gamma}\right) \]

And since the \(E_i\)’s are fixed quantities only the sums from the origins can be constrained. However, an advantage of the constraint relaxation is that once the model is calibrated, we can answer ‘what if’ questions based on changing the \(E_i\) values - for example if a new company opened in zone \(i\), offering 600 new jobs, how might this change the travel-to-work patterns, assuming (at least initially) that the working population of each zone does not change. A model of this kind is called an origin constrained SIM - or possibly a production constrained SIM. More broadly suppose there are several attractiveness measures - say \(1 \cdots m\) - then overall the model is

\[ \hat{T_{ij}} = k\exp\left(\sum_m \alpha_m E_{mi} \right) B_j\exp\left(-\frac{c_{ij}}{\gamma}\right) \]

Here the quantities to be calibrated are the balancing terms for the origins \(\{B_j\}\) and \(\{\alpha_m \}\) and \(\gamma\).

Calibrating An Origin Constrained SIM using simR

This is done using the orig_constrained_sim function. This is similar to the doubly_constrained_sim model, but also requires a data frame giving the attractiveness variables for the destination. Rather than specify the variables used to measure the attractiveness, this function just presumes that all of the variables included are to be used in the model. The attractiveness data frame has \(m+1\) columns - \(m\) of these are the variables \(E_{1.}\) to \(E_{m.}\) and the final column is a key to link each row of variables with the corresponding destination in the interaction data frame. This should have the same column name as that, and also the place names should match exactly. Here, the eastmid_empl data frame can be used:

library(simR)
library(tidyverse)
data(eastmid_ttw)
data(eastmid_empl)
eastmid_empl
#> # A tibble: 40 x 5
#>    To           Micro Small Medium Large
#>  * <chr>        <int> <int>  <int> <int>
#>  1 Amber Valley  3935   415     95    10
#>  2 Ashfield      2475   315     80    15
#>  3 Bassetlaw     3590   390     55     0
#>  4 Blaby         3865   380     70     5
#>  5 Bolsover      2035   250     50    10
#>  6 Boston        1885   260     45    10
#>  7 Broxtowe      3080   265     35     5
#>  8 Charnwood     6840   660    105    15
#>  9 Chesterfield  2840   390     90    10
#> 10 Corby         1965   225     50     5
#> # ... with 30 more rows

This has four columns (apart from the name of the area in the column To) - these are

Column Name Description
Micro Number of employees 9 or less
Small Number of employees 10 to 49
Medium Number of employees 50 to 249
Large Number of employees 250 or more

For each Local Authority area, each variable gives a count of the number of people employed in companies of the four size categories1. These four variables therefore provide a set of attractiveness measures.

A SIM object can then be created:

eastmid_ttw %>% orig_constrained_sim(From,To,eastmid_empl,Dist,Count) -> sim_o_const
sim_o_const
#> Spatial Interaction Model
#> =========================
#> 
#> Variables
#> ---------
#> orig =  From 
#> dest =  To 
#> cost =  Dist 
#> intr =  Count 
#> ---------------------
#> 
#> Model Type: Origin Constrained 
#> Cost Denominator Parameter   8.36 
#> Cost Half Life               5.80 
#> ---------------------
#> 
#> AIC                        266361.5 
#> ====================================

The cost half life is quite similar to that before - around 5.8km.

The destination scores - the \(\exp\left(\sum_m \alpha_m E_{mi} \right)\) valures for each area \(i\) may be extracted via the dest_scores function. This gives a data frame with the destination name (dest) and the associated score (dest_score) as columns:

sim_o_const %>% dest_scores -> dest_sc
dest_sc 
#> # A tibble: 40 x 2
#>    dest         dest_score
#>    <chr>             <dbl>
#>  1 Amber Valley       2.80
#>  2 Ashfield           2.23
#>  3 Bassetlaw          1.94
#>  4 Blaby              2.22
#>  5 Bolsover           1.66
#>  6 Boston             1.57
#>  7 Broxtowe           1.51
#>  8 Charnwood          3.25
#>  9 Chesterfield       2.57
#> 10 Corby              1.72
#> # ... with 30 more rows

As in the basic vignette, these can be mapped via the eastmid_sf feature.

library(tmap)
library(sf)
eastmid_sf %>% left_join(dest_sc ,by=c('name'='dest')) -> eastmid_dest_sc
legend_style <- tm_legend(legend.position=c('right','bottom'),
                          legend.text.size=0.4)
tm_shape(eastmid_dest_sc) + tm_polygons(col='dest_score', title='Attraction') +
  legend_style

We can add fitted values to a data frame using this kind of model as well:

eastmid_ttw %>% add_sim_fitted(sim_o_const) -> eastmid_ttw_fitted_oc
eastmid_ttw_fitted_oc
#> # A tibble: 1,600 x 5
#>    From         To           Count  Dist    Fitted
#>    <chr>        <chr>        <int> <dbl>     <dbl>
#>  1 Amber Valley Amber Valley 23506   0   20550    
#>  2 Ashfield     Amber Valley  2568  15.6  3007    
#>  3 Bassetlaw    Amber Valley    96  47.9   120    
#>  4 Blaby        Amber Valley    20  53.5    26.6  
#>  5 Bolsover     Amber Valley  2513  24.3   806    
#>  6 Boston       Amber Valley     3  97.1     0.327
#>  7 Broxtowe     Amber Valley  1733  14.9  2236    
#>  8 Charnwood    Amber Valley    96  39.0   228    
#>  9 Chesterfield Amber Valley   820  25.6   911    
#> 10 Corby        Amber Valley     0  78.6     1.64 
#> # ... with 1,590 more rows

and check that although we relaxed the constraint on the totals for each destination agreeing with the observed trips, the dual constraint on the origins still holds:

eastmid_ttw_fitted_oc %>% group_by(From) %>% 
  summarise(Count=sum(Count),Fitted=sum(Fitted))
#> # A tibble: 40 x 3
#>    From         Count Fitted
#>    <chr>        <int>  <dbl>
#>  1 Amber Valley 46901  46901
#>  2 Ashfield     45088  45088
#>  3 Bassetlaw    33861  33861
#>  4 Blaby        36611  36611
#>  5 Bolsover     25685  25685
#>  6 Boston       24253  24253
#>  7 Broxtowe     42554  42554
#>  8 Charnwood    62885  62885
#>  9 Chesterfield 35153  35153
#> 10 Corby        24906  24906
#> # ... with 30 more rows

… but it no longer does on the destinations.

eastmid_ttw_fitted_oc %>% group_by(To) %>% 
  summarise(Count=sum(Count),Fitted=sum(Fitted))
#> # A tibble: 40 x 3
#>    To           Count Fitted
#>    <chr>        <int>  <dbl>
#>  1 Amber Valley 43359  47810
#>  2 Ashfield     43156  43566
#>  3 Bassetlaw    35376  31114
#>  4 Blaby        40739  33583
#>  5 Bolsover     21819  27114
#>  6 Boston       25053  24118
#>  7 Broxtowe     29693  32165
#>  8 Charnwood    52943  56523
#>  9 Chesterfield 39477  39773
#> 10 Corby        26052  25584
#> # ... with 30 more rows

It is also possible to identify whether the model is over- or under-fitting by computing the difference between the observed Count and the modelled Fitted trips to each destination as a percentage of the count.

eastmid_ttw_fitted_oc %>% group_by(To) %>% 
  summarise(Count=sum(Count),Fitted=sum(Fitted)) %>%
  transmute(To,Error=100*(Count-Fitted)/Count) -> 
  eastmid_err
eastmid_sf %>% left_join(eastmid_err ,by=c('name'='To')) -> 
  eastmid_err_sf
tm_shape(eastmid_err_sf) + tm_polygons(col='Error',title='Error (%)') +
  legend_style + tm_style_col_blind()

There is a notable trend in the error here - in particular the model underpredicts in the south, and overpredicts in the north east of the region (with a notable spatial anomaly at Lincoln - where it underpredicts again).


  1. Source: Office for National Statistics via NOMIS: https://www.nomisweb.co.uk