This vignette assumes you have already worked through the basic introduction, which broadly explains the purpose of the simR package, and its basic use. The type of spatial interaction model (SIM) shown there was the doubly constrained model, in which the fitted interactions (in this case journeys to work) were constrained so that the total trips to each destination agreed with those for the observed interactions, as did the total trips from each origin. The model was written as
\[ \hat{T_{ij}} = kA_iB_j\exp\left(-\frac{c_{ij}}{\gamma}\right) \]
In this case, apart from calibrating a general distance decay effect (expressed in the same units as the ‘cost’ of transit from zone \(i\) to zone \(j\)), characteristics of the origin and destination zones were not considered. Instead, each zone was given a general attraction ‘balancing factor’ and a single production ‘balancing factor’. These respectively represented the relative ability for zones to respectively attract people, or produce people for the journeys to work, for a given cost. These are necessary to create a model capable of meeting the double constraints mentioned earlier. These constraints can be stated algebraically as
\[ kB_j \sum_{i} A_i \exp\left(-\frac{c_{ij}}{\gamma}\right) = \sum_i T_{ij} \] and
\[ kA_i \sum_{j} B_j \exp\left(-\frac{c_{ij}}{\gamma}\right) = \sum_j T_{ij} \]
However, it is also possible to drop either of these constraints. Suppose, for example the attractiveness of zones was considered to be a function of some other variables - for example the number of people employed by large companies (total employees \(>\) 500) in an area. This will not be the same as the number of people travelling to a place to work from the East Midlands data set for at least two reasons:
However, the attractiveness is still a function of this quantity - it seems reasonable that an area with more large-scale employers is likely to attract more journeys to work. If this quantity is called \(E_i\) for zone \(i\) one could say
\[ A_i = f(E_i) \] Typically fpr a multiplicative model an exponential-function based link is used:
\[ A_i = \exp(\alpha E_i) \] Thus the new model is
\[ \hat{T_{ij}} = k\exp(\alpha E_i) B_j\exp\left(-\frac{c_{ij}}{\gamma}\right) \]
And since the \(E_i\)’s are fixed quantities only the sums from the origins can be constrained. However, an advantage of the constraint relaxation is that once the model is calibrated, we can answer ‘what if’ questions based on changing the \(E_i\) values - for example if a new company opened in zone \(i\), offering 600 new jobs, how might this change the travel-to-work patterns, assuming (at least initially) that the working population of each zone does not change. A model of this kind is called an origin constrained SIM - or possibly a production constrained SIM. More broadly suppose there are several attractiveness measures - say \(1 \cdots m\) - then overall the model is
\[ \hat{T_{ij}} = k\exp\left(\sum_m \alpha_m E_{mi} \right) B_j\exp\left(-\frac{c_{ij}}{\gamma}\right) \]
Here the quantities to be calibrated are the balancing terms for the origins \(\{B_j\}\) and \(\{\alpha_m \}\) and \(\gamma\).
simRThis is done using the orig_constrained_sim function. This is similar to the doubly_constrained_sim model, but also requires a data frame giving the attractiveness variables for the destination. Rather than specify the variables used to measure the attractiveness, this function just presumes that all of the variables included are to be used in the model. The attractiveness data frame has \(m+1\) columns - \(m\) of these are the variables \(E_{1.}\) to \(E_{m.}\) and the final column is a key to link each row of variables with the corresponding destination in the interaction data frame. This should have the same column name as that, and also the place names should match exactly. Here, the eastmid_empl data frame can be used:
library(simR)
library(tidyverse)
data(eastmid_ttw)
data(eastmid_empl)
eastmid_empl
#> # A tibble: 40 x 5
#> To Micro Small Medium Large
#> * <chr> <int> <int> <int> <int>
#> 1 Amber Valley 3935 415 95 10
#> 2 Ashfield 2475 315 80 15
#> 3 Bassetlaw 3590 390 55 0
#> 4 Blaby 3865 380 70 5
#> 5 Bolsover 2035 250 50 10
#> 6 Boston 1885 260 45 10
#> 7 Broxtowe 3080 265 35 5
#> 8 Charnwood 6840 660 105 15
#> 9 Chesterfield 2840 390 90 10
#> 10 Corby 1965 225 50 5
#> # ... with 30 more rowsThis has four columns (apart from the name of the area in the column To) - these are
| Column Name | Description |
|---|---|
Micro |
Number of employees 9 or less |
Small |
Number of employees 10 to 49 |
Medium |
Number of employees 50 to 249 |
Large |
Number of employees 250 or more |
For each Local Authority area, each variable gives a count of the number of people employed in companies of the four size categories1. These four variables therefore provide a set of attractiveness measures.
A SIM object can then be created:
eastmid_ttw %>% orig_constrained_sim(From,To,eastmid_empl,Dist,Count) -> sim_o_const
sim_o_const
#> Spatial Interaction Model
#> =========================
#>
#> Variables
#> ---------
#> orig = From
#> dest = To
#> cost = Dist
#> intr = Count
#> ---------------------
#>
#> Model Type: Origin Constrained
#> Cost Denominator Parameter 8.36
#> Cost Half Life 5.80
#> ---------------------
#>
#> AIC 266361.5
#> ====================================The cost half life is quite similar to that before - around 5.8km.
The destination scores - the \(\exp\left(\sum_m \alpha_m E_{mi} \right)\) valures for each area \(i\) may be extracted via the dest_scores function. This gives a data frame with the destination name (dest) and the associated score (dest_score) as columns:
sim_o_const %>% dest_scores -> dest_sc
dest_sc
#> # A tibble: 40 x 2
#> dest dest_score
#> <chr> <dbl>
#> 1 Amber Valley 2.80
#> 2 Ashfield 2.23
#> 3 Bassetlaw 1.94
#> 4 Blaby 2.22
#> 5 Bolsover 1.66
#> 6 Boston 1.57
#> 7 Broxtowe 1.51
#> 8 Charnwood 3.25
#> 9 Chesterfield 2.57
#> 10 Corby 1.72
#> # ... with 30 more rowsAs in the basic vignette, these can be mapped via the eastmid_sf feature.
library(tmap)
library(sf)
eastmid_sf %>% left_join(dest_sc ,by=c('name'='dest')) -> eastmid_dest_sc
legend_style <- tm_legend(legend.position=c('right','bottom'),
legend.text.size=0.4)
tm_shape(eastmid_dest_sc) + tm_polygons(col='dest_score', title='Attraction') +
legend_styleWe can add fitted values to a data frame using this kind of model as well:
eastmid_ttw %>% add_sim_fitted(sim_o_const) -> eastmid_ttw_fitted_oc
eastmid_ttw_fitted_oc
#> # A tibble: 1,600 x 5
#> From To Count Dist Fitted
#> <chr> <chr> <int> <dbl> <dbl>
#> 1 Amber Valley Amber Valley 23506 0 20550
#> 2 Ashfield Amber Valley 2568 15.6 3007
#> 3 Bassetlaw Amber Valley 96 47.9 120
#> 4 Blaby Amber Valley 20 53.5 26.6
#> 5 Bolsover Amber Valley 2513 24.3 806
#> 6 Boston Amber Valley 3 97.1 0.327
#> 7 Broxtowe Amber Valley 1733 14.9 2236
#> 8 Charnwood Amber Valley 96 39.0 228
#> 9 Chesterfield Amber Valley 820 25.6 911
#> 10 Corby Amber Valley 0 78.6 1.64
#> # ... with 1,590 more rowsand check that although we relaxed the constraint on the totals for each destination agreeing with the observed trips, the dual constraint on the origins still holds:
eastmid_ttw_fitted_oc %>% group_by(From) %>%
summarise(Count=sum(Count),Fitted=sum(Fitted))
#> # A tibble: 40 x 3
#> From Count Fitted
#> <chr> <int> <dbl>
#> 1 Amber Valley 46901 46901
#> 2 Ashfield 45088 45088
#> 3 Bassetlaw 33861 33861
#> 4 Blaby 36611 36611
#> 5 Bolsover 25685 25685
#> 6 Boston 24253 24253
#> 7 Broxtowe 42554 42554
#> 8 Charnwood 62885 62885
#> 9 Chesterfield 35153 35153
#> 10 Corby 24906 24906
#> # ... with 30 more rows… but it no longer does on the destinations.
eastmid_ttw_fitted_oc %>% group_by(To) %>%
summarise(Count=sum(Count),Fitted=sum(Fitted))
#> # A tibble: 40 x 3
#> To Count Fitted
#> <chr> <int> <dbl>
#> 1 Amber Valley 43359 47810
#> 2 Ashfield 43156 43566
#> 3 Bassetlaw 35376 31114
#> 4 Blaby 40739 33583
#> 5 Bolsover 21819 27114
#> 6 Boston 25053 24118
#> 7 Broxtowe 29693 32165
#> 8 Charnwood 52943 56523
#> 9 Chesterfield 39477 39773
#> 10 Corby 26052 25584
#> # ... with 30 more rowsIt is also possible to identify whether the model is over- or under-fitting by computing the difference between the observed Count and the modelled Fitted trips to each destination as a percentage of the count.
eastmid_ttw_fitted_oc %>% group_by(To) %>%
summarise(Count=sum(Count),Fitted=sum(Fitted)) %>%
transmute(To,Error=100*(Count-Fitted)/Count) ->
eastmid_err
eastmid_sf %>% left_join(eastmid_err ,by=c('name'='To')) ->
eastmid_err_sf
tm_shape(eastmid_err_sf) + tm_polygons(col='Error',title='Error (%)') +
legend_style + tm_style_col_blind()There is a notable trend in the error here - in particular the model underpredicts in the south, and overpredicts in the north east of the region (with a notable spatial anomaly at Lincoln - where it underpredicts again).
Source: Office for National Statistics via NOMIS: https://www.nomisweb.co.uk↩