Recap
So, last time we learned all about the unconstrained spatial interaction model; how we can use it to estimate flows in a system using values for origin emissiveness or destination attractiveness; how we can tweak the estimates the model produces through adjusting either the parameters associated with the predictor variables, or though using different predictor variables or updating their values; and how we can improve the fits of the model further by calibrating the parameters through using a Poisson regression model.
We saw that even after calibration, our model still only explained around 60% of the variation in the flows that we observed in our system, so can we do any better? Well yes, yes we can!
Constrained Models
If we return to Alan Wilson’s 1971 paper, he introduces a full family of spatial interaction models of which the unconstrained model is just the start. And indeed since then, there have been all number of incremental advances and alternatives (such as Stewart Fotheringham’s Competing Destinations models, Pooler’s production/attraction/cost relaxed models, Stillwell’s origin/destination parameter specific models and mine and Alan’s own multi-level model (to name just a few).
In this session we will explore the rest of Wilson’s family - the Production (origin) Constrained Model; the Attraction (destination) constrained model; and the Doubly Constrained Model.
We will see how we can, again, use a Poisson regression model in R to calibrate these models and how, once calibrated, we can use the models in different contexts, such as Land Use Transportation Interaction (LUTI) modelling, retail modelling and migration modelling.
1. Production and Attraction Constrained Models
Right, let’s get ourselves set up again…
Open up R Studio and set up a new project working directory. Open a new R script and save it into your new working directory.
Check the file is in your working directory and R can see it…
getwd()
list.files(getwd())
Library some packages (the four at the bottom you probably won’t need):
library(sp)
library(reshape2)
library(geojsonio)
library(rgdal)
library(downloader)
library(maptools)
library(dplyr)
library(broom)
library(stplanr)
library(ggplot2)
library(MASS)
library(sf)
library(tmap)
library(tmaptools)
library(stringr)
And check that you have our little test commuting data matrix from last time - cdatasub
- sitting in your global environment. If you don’t, then go back to the last practical here and follow the steps from the top until you have cdatasub in your global environment. It should look something like this:
cdatasubmat <- dcast(cdatasub, Orig ~ Dest, sum, value.var = "Total", margins=c("Orig", "Dest"))
cdatasubmat
Wilson’s real contribution to the field was in noticing that the unconstrained gravity model was sub-optimal as it did not make use of all of the available data in the system we are studying.
If we recall the estimates from our unconstrained model, none of the estimates summed to the observed in and out-flow totals:
cdatasubmat2
Our estimates did sum to the grand total of flows, but this is because we were really fitting a ‘total constrained’ model which used \(k\) - our constant of proportionality - to ensure everything sort of added up (to within 1 commuter).
Where we have a full flow matrix to calibrate parameters, we can incorporate the row (origin) totals, column (destination) totals or both origina and destination totals to constrain our flow estimates to these known values.
There are various reasons for wanting to do this, for example:
If We are interested in flows of money into businesses or customers into shops, might have information on the amount of disposable income and shopping habits of the people living in different areas from loyalty card data. This is known information about our origins and so we could constrain our spatial interaction model to this known information - we can make the assumption that this level of disposable income remains the same. We can then use other information about the attractiveness of places these people might like to shop in (store size, variety / specialism of goods etc.), to estimate how much money a new store opening in the area might make, or if a new out-of-town shopping centre opens, how much it might affect the business of shops in the town centre. This is what is known in the literature as the ‘retail model’ and is perhaps the most common example of a Production (orign) Constrained Spatial Interaction Model
We might be interested in understanding the impact of a large new employer in an area on the flows of traffic in the vicinity or on the demand for new worker accommodation nearby. A good example of where this might be the case is with large new infrastructure developments like new airports. For example, before the go-ahead for the new third runway at Heathrow was given, one option being considered was a new runway in the Thames Estuary. If a new airport was built here, what would be the potential impact on transport flows in the area and where might workers commute from? This sort of scenario could be tested with an Attraction (destination) Constrained Spatial Interaction Model where the number of new jobs in a destination is known (as well as jobs in the surrounding area) and the model could be used to estimate where the workers will be drawn from (and their likely travel-to-work patterns). This model is exactly the sort of model Land Use Transport Interaction (LUTI) model that was constructed by the Mechanicity Team in CASA - details here if you are interested…
We might be interested in understanding the changing patterns of commuting or migration over time. Data from the Census allows us to know an accurate snap-shot of migrating and commuting patterns every 10 years. In these full data matrices, we know both the numbers of commuters/migrants leaving origins and arriving at destinations as well as the interactions between them. If we constrain our model estimates to this known information at origin and destination, we can examine various things, including:
- the ways that the patterns of commuting/migration differ from the model predictions - where we might get more migrant/commuter flows than we would expect
- how the model parameters vary over time - for example how does distance / cost of travel affect flows over time? Are people prepared to travel further or less far than before?
2. Production-constrained Model
- \[T_{ij} = A_i O_i W_j^\alpha d_{ij}^-\beta \]
where
- \[O_{i} = \sum_j T_{ij}\]
and
- \[ A_i = \frac{1}{\sum_j W_j^\alpha d_{ij}^-\beta}\]
In the production-constrained model, \(O_i\) does not have a parameter as it is a known constraint. \(A_i\) is known as a balancing factor and is a vector of values which relate to each origin, \(i\), which do the equivalent job to \(k\) in the unconstrained/total constrained model but ensure that flow estimates from each origin sum to the know totals, \(O_i\) rather than just the overall total.
Now at this point, we could calculate all of the \(O_i\)s and the \(A_i\)s by hand for our sample system and then set about guessing/estimating the parameter values for the rest of the model, but as you might have already suspected from last time, we can use R and glm
to make it really easy and do all of that for us - woo hoo!
We set about re-specifying the Production-Constrained model as a Poisson regression model in exactly the same way as we did before. We need to take logs of the right-hand side of equation and assume that these are logarithmially linked to the Poisson distributed mean (\(\lambda_{ij}\)) of the \(T_{ij}\) variable. As such, Equation (1) becomes:
- \[\lambda_{ij} = exp(k + \mu_{i} + \alpha \ln W_j - \beta \ln d_{ij})\]
In Equation (4) \(\mu_{i}\) is the equivalent of the vector of balancing factors \(A_i\), but in regression / log-linear modelling terminology can also be described as either dummy variables or fixed effects. In practical terms, what this means is that in our regression model, \(\mu_i\) is modelled as a categorical predictor and therefore in the Poisson regression model, we don’t use the numeric values of \(O_i\), we use a categorical identifier for the origin. In terms of the example table above, for Barking and Dagenham we wouldn’t use 5675 as we would if we were fitting Equation 1, we would just use ‘Barking and Dagenham’. You might have noticed that \(k\) makes an appearance again - this is because now we have a regression equation again, we need an intercept value - this is estimated by fitting the model.
So, let’s give this model a whirl…
#run a production constrained SIM
prodSim <- glm(Total ~ Orig+log(wj2_destsal)+log(dist), na.action = na.exclude, family = poisson(link = "log"), data = cdatasub)
#let's have a look at it's summary...
summary(prodSim)
Call:
glm(formula = Total ~ Orig + log(wj2_destsal) + log(dist), family = poisson(link = "log"),
data = cdatasub, na.action = na.exclude)
Deviance Residuals:
Min 1Q Median 3Q Max
-71.314 -17.901 -6.669 9.587 58.776
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 7.99911 0.11380 70.29 <2e-16 ***
OrigBarnet 0.98222 0.01505 65.24 <2e-16 ***
OrigBexley 0.95836 0.01565 61.25 <2e-16 ***
OrigBrent 0.47875 0.01575 30.39 <2e-16 ***
OrigBromley 1.48752 0.01540 96.59 <2e-16 ***
OrigCamden -1.01944 0.01949 -52.30 <2e-16 ***
OrigCity of London -3.29114 0.05398 -60.97 <2e-16 ***
log(wj2_destsal) 2.04693 0.01015 201.76 <2e-16 ***
log(dist) -2.24218 0.01138 -197.08 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 176846 on 41 degrees of freedom
Residual deviance: 28783 on 33 degrees of freedom
AIC: 29115
Number of Fisher Scoring iterations: 5
So, what do we have?
Well, there are the elements of the model output that should be familiar from the unconstrained model:
the \(k\) parameter, or the intercept for the model - which in this case is 7.99911
the \(\alpha\) parameter related to the destination attractiveness: 2.04693
the \(\beta\) distance decay parameter: -2.24218
We can see from the standard outputs from the model that all of the explanatory variables are statistically significant (***) and the z-scores indicate that the destination salary is having the most influence on the model, with distance following closely behind.
And then we have a series of parameters which are the vector of \(\mu_i\) values associated with our origin constraints. The eagle eyed among you will notice that we are missing a parameter for Barking and Dagenham. No, the model hasn’t just taken a dislike to the home of Ford Motorcar production and right wing populism in the UK, it’s because dummy parameter values are presented in relation to a reference variable and in this case the default variable is the first one in our list - Barking and Dagenham.
The way that we can interpret this is that the City of London is around 3.3 times less likely to be an origin for commuters than the reference category (Barking and Dagenham), which by default is set to 0.
Now of course, it might be that Barking and Dagenham is not the most intuitive contrasting variable to use. Fortunately we can change the The particular way in which the model choses this variable using the contrasts
option. I won’t dwell much on contrasts, but if you would like to learn more, there are lots of good webpages out there including this one and you should consult the R help system - search for ?contrasts
. For now, let’s just swap the contrasting variable from the first to the last Origin and see what happens to the parameter values:
#set the contrasts
options(contrasts=c('contr.SAS','contr.SAS'))
prodSim2 <- glm(Total ~ Orig+log(wj2_destsal)+log(dist), na.action = na.exclude, family = poisson(link = "log"), data = cdatasub)
#let's have a look at it's summary...
summary(prodSim2)
Call:
glm(formula = Total ~ Orig + log(wj2_destsal) + log(dist), family = poisson(link = "log"),
data = cdatasub, na.action = na.exclude)
Deviance Residuals:
Min 1Q Median 3Q Max
-71.314 -17.901 -6.669 9.587 58.776
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 4.70796 0.11931 39.46 <2e-16 ***
OrigBarking and Dagenham 3.29114 0.05398 60.97 <2e-16 ***
OrigBarnet 4.27337 0.05240 81.55 <2e-16 ***
OrigBexley 4.24950 0.05294 80.28 <2e-16 ***
OrigBrent 3.76989 0.05251 71.79 <2e-16 ***
OrigBromley 4.77866 0.05310 90.00 <2e-16 ***
OrigCamden 2.27171 0.05319 42.71 <2e-16 ***
log(wj2_destsal) 2.04693 0.01015 201.76 <2e-16 ***
log(dist) -2.24218 0.01138 -197.08 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 176846 on 41 degrees of freedom
Residual deviance: 28783 on 33 degrees of freedom
AIC: 29115
Number of Fisher Scoring iterations: 5
Notice how the intercept value has changed and the dummy parameters for each origin now relate to how they compare to the City of London. As we could expect for this area (which is a commuting destination rather than an origin) all other origins in our system send between 2 and 4.25 times as many commuters.
The default contrast option is R is to use ‘Treatment’ or dummy coding, so if you would like to switch the contrasts back to treatment, it’s simple to do:
options(contrasts=c('contr.treatment','contr.treatment'))
2.1 Model Estimates
Now at this point you will be wanting to know what affect the constraints have had on the estimates produced by the model, so let’s plug the parameters back into Equation 4 and take a look…
First run the model again using OrigCodeNew instead of Orig… (note that because of the code order, the model runs like the )
prodSim <- glm(Total ~ OrigCodeNew+log(wj2_destsal)+log(dist), na.action = na.exclude, family = poisson(link = "log"), data = cdatasub)
summary(prodSim)
Call:
glm(formula = Total ~ OrigCodeNew + log(wj2_destsal) + log(dist),
family = poisson(link = "log"), data = cdatasub, na.action = na.exclude)
Deviance Residuals:
Min 1Q Median 3Q Max
-71.314 -17.901 -6.669 9.587 58.776
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 4.70796 0.11931 39.46 <2e-16 ***
OrigCodeNewE09000002 3.29114 0.05398 60.97 <2e-16 ***
OrigCodeNewE09000003 4.27337 0.05240 81.55 <2e-16 ***
OrigCodeNewE09000004 4.24950 0.05294 80.28 <2e-16 ***
OrigCodeNewE09000005 3.76989 0.05251 71.79 <2e-16 ***
OrigCodeNewE09000006 4.77866 0.05310 90.00 <2e-16 ***
OrigCodeNewE09000007 2.27171 0.05319 42.71 <2e-16 ***
log(wj2_destsal) 2.04693 0.01015 201.76 <2e-16 ***
log(dist) -2.24218 0.01138 -197.08 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 176846 on 41 degrees of freedom
Residual deviance: 28783 on 33 degrees of freedom
AIC: 29115
Number of Fisher Scoring iterations: 5
Now Create some \(O_i\) and \(D_j\) columns and store the total in and out flow matrix margins in them. Note, in the syntax below, I use the forward-pipe or %>% operator. This is probably something a bit new and warrents a bit of explaining. If you’d like to know more about how pipes can be used in R, visit the magrittr vignette page here - for now, just think of %>% as a bit like saying “then…” in your code
#create some Oi and Dj columns in the dataframe and store row and column totals in them:
#to create O_i, take cdatasub ...then... group by origcodenew ...then... summarise by calculating the sum of Total
O_i <- cdatasub %>% group_by(OrigCodeNew) %>% summarise(O_i = sum(Total))
cdatasub$O_i <- O_i$O_i[match(cdatasub$OrigCodeNew,O_i$OrigCodeNew)]
D_j <- cdatasub %>% group_by(DestCodeNew) %>% summarise(D_j = sum(Total))
cdatasub$D_j <- D_j$D_j[match(cdatasub$DestCodeNew,D_j$DestCodeNew)]
Now fish the coefficients out of the prodsim glm object
#You can do this in a number of ways, for example this prints everything:
prodSim_out <- tidy(prodSim)
prodSim_out
Or, more usefully, pull out the parameter values for \(\mu_i\) and store them back in the dataframe along with \(O_i\) and \(D_j\)
# or you can just pull out the coefficients and put them into an object
coefs <- as.data.frame(prodSim$coefficients)
#then once you have done this, you can join them back into the dataframe using a regular expression to match the bits of the identifier that you need - *note, this bit of code below took me about 2 hours to figure out!*
cdatasub$mu_i <- coefs$`prodSim$coefficients`[match(cdatasub$OrigCodeNew,sub(".*OrigCodeNew","", rownames(coefs)))]
#now, where we have missing values for our reference mu_i variable, fill those with 1s
cdatasub$mu_i <- ifelse(is.na(cdatasub$mu_i),0,cdatasub$mu_i)
head(cdatasub)
OK, now we can save our parameter values into some variables…
k <- prodSim$coefficients[1]
mu_i <- prodSim$coefficients[2:7]
alpha <- prodSim$coefficients[8]
beta <- prodSim$coefficients[9]
And we’re ready to generate our estimates:
cdatasub$prodsimest1 <- exp(k+(cdatasub$mu_i)+(alpha*log(cdatasub$wj2_destsal))+(beta*log(cdatasub$dist)))
Now of course we could also have done this the easy way, but again, I think it helps doing it the hard way, especially if we want to play with the parameters by hand or adjust any of the input values when running some what-if scenarios.
Here’s the easy way again for those who can’t remember:
cdatasub$prodsimFitted <- fitted(prodSim)
head(cdatasub)
2.2 Assessing the model output
So what do the outputs from our Production Constrained Model look like? How has the goodness-of-fit improved and how can we start to use this a bit like a retail model and assess the likely impacts of changing destination attractiveness etc.?
2.2.1 The flow matrix
#first round the estimates
cdatasub$prodsimFitted <- round(fitted(prodSim),0)
#now we can create pivot table to turn paired list into matrix (and compute the margins as well)
cdatasubmat3 <- dcast(cdatasub, Orig ~ Dest, sum, value.var = "prodsimFitted", margins=c("Orig", "Dest"))
cdatasubmat3
And compared with the original observed data?
cdatasubmat
Here it is very easy to see the Origin Constraints working. The sum across all destinations for each origin in the estimated matrix is exactly the same as the same sum across the observed matrix - \(\sum_j T_{ij} = \sum_j \lambda_{ij} = O_i\), but clearly, the same is not true when you sum across all origins for each destination - \(\sum_i T_{ij} \neq \sum_i \lambda_{ij} \neq D_j\)
2.2.2 How do the fits compare with the unconstrained model from last time?
#use the functions from the last practical to calculate some goodness-of-fit statistics
CalcRSquared(cdatasub$Total,cdatasub$prodsimFitted)
[1] 0.8175094
CalcRMSE(cdatasub$Total,cdatasub$prodsimFitted)
[1] 1382.781
Clearly by constraining our model estimates to known origin totals, the fit of the model has improved quite considerably - from around 0.67 in the unconstrained model to around 0.82 in this model. The RMSE has also dropped quite noticably.
2.2.3 A ‘what if…’ scenario
Now that we have calibrated our parameters and produced some estimates, we can start to play around with some what-if scenarios.
For example - What if the government invested loads of money into a new Car Plant in Barking and Dagenham and as a result, average wages increased from a mere £16,200 to £25,000. A far fetched scenario, but one that could make a good experiment.
First create create a new variable with these altered salaries:
cdatasub$wj3_destsalScenario <- cdatasub$wj2_destsal
cdatasub$wj3_destsalScenario <- ifelse(cdatasub$wj3_destsalScenario == 16200,25000,cdatasub$wj3_destsalScenario)
Now let’s plug these new values into the model and see how this changes the flows in the system…
cdatasub$prodsimest2 <- exp(k+(cdatasub$mu_i)+(alpha*log(cdatasub$wj3_destsalScenario))+(beta*log(cdatasub$dist)))
cdatasub$prodsimest2 <- round(cdatasub$prodsimest2,0)
#now we can create pivot table to turn paired list into matrix (and compute the margins as well)
cdatasubmat4 <- dcast(cdatasub, Orig ~ Dest, sum, value.var = "prodsimest2", margins=c("Orig", "Dest"))
cdatasubmat4
You will notice that by increasing the average salary in Barking and Dagenham, we’ve increased flows into Barking and Dagenham, but have not reduced the flows into other zones - the original constraints are still working on the other zones. One way to get around this, now that we have calibrated our parameters, is to return to the multiplicative model in Equation 1 and run this model after calculating our own \(A_i\) balancing factors.
#calculate some new wj^alpha and dij^beta values
wj2_alpha <- cdatasub$wj2_destsal^alpha
dist_beta <- cdatasub$dist^beta
#calculate the first stage of the Ai values
cdatasub$Ai1 <- wj2_alpha*dist_beta
#now do the sum over all js bit
A_i <- cdatasub %>% group_by(OrigCodeNew) %>% summarise(A_i = sum(Ai1))
#now divide in to 1
A_i[,2] <- 1/A_i[,2]
#and write the A_i values back into the data frame
cdatasub$A_i <- A_i$A_i[match(cdatasub$OrigCodeNew,A_i$OrigCodeNew)]
So that is it for calculating your \(A_i\) values. Now you have these, it’s very simple to plug everything back into Equation 1 and generate some estimates…
#To check everything works, recreate the original estimates
cdatasub$prodsimest3 <- cdatasub$A_i*cdatasub$O_i*wj2_alpha*dist_beta
You should see that your new estimates are exactly the same as your first estimates. If they’re not, then something has gone wrong. Now we have this though, we can keep messing around with some new estimates and keep the constraints. Remember, though, that you will need to recalculate \(A_i\) each time you want to create a new set of estimates. Let’s try with our new values for the destination salary in Barking and Dagenham
#calculate some new wj^alpha and dij^beta values
wj3_alpha <- cdatasub$wj3_destsalScenario^alpha
#calculate the first stage of the Ai values
cdatasub$Ai1 <- wj3_alpha*dist_beta
#now do the sum over all js bit
A_i <- cdatasub %>% group_by(OrigCodeNew) %>% summarise(A_i = sum(Ai1))
#now divide in to 1
A_i[,2] <- 1/A_i[,2]
#and write the A_i values back into the data frame
cdatasub$A_i <- A_i$A_i[match(cdatasub$OrigCodeNew,A_i$OrigCodeNew)]
Now we have some new \(A_i\)s, let’s generate some new scenario flow estimates…
#To check everything works, recreate the original estimates
cdatasub$prodsimest4_scenario <- cdatasub$A_i*cdatasub$O_i*wj3_alpha*dist_beta
cdatasub$prodsimest4_scenario <- round(cdatasub$prodsimest4_scenario,0)
#now we can create pivot table to turn paired list into matrix (and compute the margins as well)
cdatasubmat5 <- dcast(cdatasub, Orig ~ Dest, sum, value.var = "prodsimest4_scenario", margins=c("Orig", "Dest"))
cdatasubmat5
There are a number of things to note here. Firstly, we flows into Barking and Dagenham have virtually doubled, while flows into other Boroughs have reduced.
Secondly, Barking and Dagenham was a poor estimate anyway - it model was very much over estimating flows into this Borough. Increasing the salary into this borough has significantly increased flows, so this indicates that there are probably lots of other things that might be discouraging people from working in this borough.
Thirdly, Our origin constraints are now holding again.
3. Attraction-Constrained Model
The attraction constrained Model is virtually the same as the Production constrained model:
- \[T_{ij} = D_j B_j V_i^\mu d_{ij}^-\beta \]
where
- \[D_{j} = \sum_i T_{ij}\]
and
- \[ B_j = \frac{1}{\sum_i V_i^\mu d_{ij}^-\beta}\]
I won’t dwell on the attraction constrained model, except to say that it can be run in R as you would expect:
- \[\lambda_{ij} = exp(k + \mu \ln V_i + \alpha_{i} - \beta \ln d_{ij})\]
or in R:
attrSim <- glm(Total ~ DestCodeNew+log(vi1_origpop)+log(dist), na.action = na.exclude, family = poisson(link = "log"), data = cdatasub)
summary(attrSim)
Call:
glm(formula = Total ~ DestCodeNew + log(vi1_origpop) + log(dist),
family = poisson(link = "log"), data = cdatasub, na.action = na.exclude)
Deviance Residuals:
Min 1Q Median 3Q Max
-39.914 -21.614 -5.529 2.778 51.241
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.923776 0.134636 14.29 <2e-16 ***
DestCodeNewE09000002 -3.583256 0.038940 -92.02 <2e-16 ***
DestCodeNewE09000003 -1.224440 0.012171 -100.61 <2e-16 ***
DestCodeNewE09000004 -1.925497 0.018018 -106.87 <2e-16 ***
DestCodeNewE09000005 -1.540332 0.012743 -120.88 <2e-16 ***
DestCodeNewE09000006 -1.096912 0.015221 -72.06 <2e-16 ***
DestCodeNewE09000007 -0.299522 0.007840 -38.20 <2e-16 ***
log(vi1_origpop) 1.557941 0.011372 137.00 <2e-16 ***
log(dist) -1.203269 0.007017 -171.47 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 176846 on 41 degrees of freedom
Residual deviance: 23500 on 33 degrees of freedom
AIC: 23832
Number of Fisher Scoring iterations: 6
we can examine how the constraints hold for desintations this time:
#first round the estimates
cdatasub$attrsimFitted <- round(fitted(attrSim),0)
#now we can create pivot table to turn paired list into matrix (and compute the margins as well)
cdatasubmat6 <- dcast(cdatasub, Orig ~ Dest, sum, value.var = "attrsimFitted", margins=c("Orig", "Dest"))
cdatasubmat6
compared to…
cdatasubmat
and we can test the goodness-of-fit in exactly the same way as before:
#use the functions from the last practical to calculate some goodness-of-fit statistics
CalcRSquared(cdatasub$Total,cdatasub$attrsimFitted)
[1] 0.8443004
CalcRMSE(cdatasub$Total,cdatasub$attrsimFitted)
[1] 1310.26
OK, that’s where I’ll leave singly constrained models for now. There are, of course, plenty of things you could try out. For example:
You could try mapping the coefficients or the residual values from the model to see if there is any patterning in either the over or under prediction of flows.
You could try running your own version of a LUTI model by first calibrating the model parameters and plugging these into a multiplicative version of the model, adjusting the destination constraints to see which origins are likely to generate more trips.
4. Doubly Constrained Model
Now, the model in the family you have all been waiting for - the big boss, the daddy, the doubly constrained model!
Let’s begin with the formula:
- \[T_{ij} = A_i O_i B_j D_j d_{ij}^-\beta \]
where
\[O_{i} = \sum_j T_{ij}\]
\[D_{j} = \sum_i T_{ij}\]
and
\[A_i = \frac{1}{\sum_j B_j D_j d_{ij}^-\beta}\]
\[B_j = \frac{1}{\sum_i A_i O_i d_{ij}^-\beta}\]
Now, the astute will have noticed that the calculation of \(A_i\) relies on knowing \(B_j\) and the calculation of \(B_j\) relies on knowing \(A_i\). A conundrum!! If I don’t know \(A_i\) how can I calcuate \(B_j\) and then in turn \(A_i\) and then \(B_j\) ad infinitum???!!
Well, I wrestled with that for a while until I came across this paper by Martyn Senior where he sketches out a very useful algorithm for iteratively arriving at values for \(A_i\) and \(B_j\) by setting each to equal 1 initially and then continuing to calculate each in turn until the difference between each value is small enough not to matter.
We will return to this later, but for now, we will once again use the awesome power of R to deal with all of this difficulty for us!
We can run the doubly constrained model in exactly the same way as we ran the singly constrained models:
- \[\lambda_{ij} = exp(k + \mu_i + \alpha_{i} - \beta \ln d_{ij})\]
or in R…
#run a production constrained SIM
doubSim <- glm(Total ~ Orig+Dest+log(dist), na.action = na.exclude, family = poisson(link = "log"), data = cdatasub)
#let's have a look at it's summary...
summary(doubSim)
Call:
glm(formula = Total ~ Orig + Dest + log(dist), family = poisson(link = "log"),
data = cdatasub, na.action = na.exclude)
Deviance Residuals:
Min 1Q Median 3Q Max
-30.9202 -7.0946 -0.8203 7.7356 17.1904
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 28.15423 0.14329 196.48 <2e-16 ***
OrigBarnet 0.42341 0.01743 24.29 <2e-16 ***
OrigBexley 1.00050 0.01635 61.19 <2e-16 ***
OrigBrent -0.19467 0.01881 -10.35 <2e-16 ***
OrigBromley 1.60132 0.01547 103.53 <2e-16 ***
OrigCamden -1.42027 0.02296 -61.86 <2e-16 ***
OrigCity of London -4.25918 0.05568 -76.50 <2e-16 ***
DestBarnet 2.96200 0.04139 71.56 <2e-16 ***
DestBexley 1.45898 0.04313 33.83 <2e-16 ***
DestBrent 2.39091 0.04129 57.90 <2e-16 ***
DestBromley 2.65156 0.04100 64.67 <2e-16 ***
DestCamden 3.55784 0.03995 89.06 <2e-16 ***
DestCity of London 4.10797 0.03977 103.30 <2e-16 ***
log(dist) -2.50306 0.01457 -171.83 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 176845.8 on 41 degrees of freedom
Residual deviance: 5296.4 on 28 degrees of freedom
AIC: 5638.8
Number of Fisher Scoring iterations: 5
And the various flows and goodness-of-fit statistics?
#then round the estimates
cdatasub$doubsimFitted <- round(fitted(doubSim),0)
#now we can create pivot table to turn paired list into matrix (and compute the margins as well)
cdatasubmat7 <- dcast(cdatasub, Orig ~ Dest, sum, value.var = "doubsimFitted", margins=c("Orig", "Dest"))
cdatasubmat7
compared to…
cdatasubmat
and we can test the goodness-of-fit in exactly the same way as before:
#use the functions from the last practical to calculate some goodness-of-fit statistics
CalcRSquared(cdatasub$Total,cdatasub$doubsimFitted)
[1] 0.9824559
CalcRMSE(cdatasub$Total,cdatasub$doubsimFitted)
[1] 440.484
So the goodness of fit has shot up and we can clearly see the origin and destination constraints working, and for most sets of flows, the model is now producing some good estimates. However, there are still some errors in the flows, particularly for estimates between Barking and Dagenham and Bexley or Barnet and Camden.
Is there anything more we can do? Yes, of course there is.
4.1 Tweaking our Model
4.1.1 Distance Decay
Now, all of the way through these practicals, we have assumed that the distance decay parameter follows a negative power law. Well, it doesn’t need to.
In Wilson’s original paper, he generalised the distance decay parameter to:
\[f(d_{ij})\]
Where \(f\) represents some function of distance describing the rate at which the flow interactions change as distance increases. Lots of people have written about this, including Tayor (1971) and more recently Robin Lovelace in a transport context, here.
For the inverse power law that we have been using is one possible function of distance, the other common one that is used is the negative exponential function:
\[exp(-\beta d_{ij})\]
We can get a feel for how different distance decay parameters work by plotting some sample data:
xdistance <- seq(1,20,by=1)
InvPower2 <- xdistance^-2
NegExp0.3 <- exp(-0.3*xdistance)
df <- cbind(InvPower2,NegExp0.3)
meltdf <- melt(df)
ggplot(meltdf,aes(Var1,value, colour = Var2)) + geom_line()

Here you can see quite clearly that the inverse power function has a far more rapid distance decay effect than the negative exponential function. In real life, what this means is that if the observed interactions drop off very rapidly with distance, then they might be more likely to follow an inverse power law. This might be the case when looking at trips to the local convenience store by walking, for example. On the other hand, if the effect of distance is less severe (for example migration across the country for a new job) then the negative exponential funtion might be more appropriate.
There is no hard and fast rule as to which function to pick, it will just come down to which fits the data better…
As Tayor Oshan points out in his excellent Primer what this means in our Poisson regression model is that we simply substitute \(-\beta \ln d_{ij}\) for \(-\beta d_{ij}\) in our model:
#run a production constrained SIM
doubSim1 <- glm(Total ~ Orig+Dest+dist, na.action = na.exclude, family = poisson(link = "log"), data = cdatasub)
#let's have a look at it's summary...
summary(doubSim1)
Call:
glm(formula = Total ~ Orig + Dest + dist, family = poisson(link = "log"),
data = cdatasub, na.action = na.exclude)
Deviance Residuals:
Min 1Q Median 3Q Max
-27.6694 -5.7499 -0.4354 6.4668 14.4928
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 6.840e+00 4.368e-02 156.607 < 2e-16 ***
OrigBarnet 4.209e-01 1.768e-02 23.800 < 2e-16 ***
OrigBexley 1.025e+00 1.640e-02 62.482 < 2e-16 ***
OrigBrent -1.227e-01 1.876e-02 -6.541 6.13e-11 ***
OrigBromley 1.702e+00 1.554e-02 109.554 < 2e-16 ***
OrigCamden -8.486e-01 2.088e-02 -40.637 < 2e-16 ***
OrigCity of London -3.882e+00 5.502e-02 -70.556 < 2e-16 ***
DestBarnet 3.129e+00 4.162e-02 75.160 < 2e-16 ***
DestBexley 1.341e+00 4.320e-02 31.049 < 2e-16 ***
DestBrent 2.627e+00 4.151e-02 63.292 < 2e-16 ***
DestBromley 2.590e+00 4.100e-02 63.181 < 2e-16 ***
DestCamden 3.714e+00 4.013e-02 92.558 < 2e-16 ***
DestCity of London 4.017e+00 3.974e-02 101.083 < 2e-16 ***
dist -1.746e-04 1.090e-06 -160.241 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 176846 on 41 degrees of freedom
Residual deviance: 3402 on 28 degrees of freedom
AIC: 3744.4
Number of Fisher Scoring iterations: 5
cdatasub$doubsimFitted1 <- round(fitted(doubSim1),0)
CalcRSquared(cdatasub$Total,cdatasub$doubsimFitted1)
[1] 0.9842882
CalcRMSE(cdatasub$Total,cdatasub$doubsimFitted1)
[1] 413.444
So, we can see that using a negative exponential function in our model actually improves the fit and
4.1.2 Bunging some more variables in
Yes, the nice thing about doing all of this in a regression modelling framework is we can just keep adding predictor variables into the mix and seeing whether they have an effect.
You can’t add origin or destination specific predictors into a doubly constrained model like this, however, you could add some interaction predictors. For example, instead of modelling total flows, we could try and model motorbike commuters using information on car and underground commuters:
kitchensinkSIM <- glm(Motobike ~ Orig+Dest+dist+CarDrive+Underground, na.action = na.exclude, family = poisson(link = "log"), data = cdatasub)
#let's have a look at it's summary...
summary(kitchensinkSIM)
Call:
glm(formula = Motobike ~ Orig + Dest + dist + CarDrive + Underground,
family = poisson(link = "log"), data = cdatasub, na.action = na.exclude)
Deviance Residuals:
Min 1Q Median 3Q Max
-4.3608 -1.4674 -0.0085 0.7353 3.0146
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.563e+00 3.030e-01 8.457 < 2e-16 ***
OrigBarnet 3.215e-01 1.929e-01 1.666 0.09564 .
OrigBexley 1.019e+00 1.319e-01 7.722 1.15e-14 ***
OrigBrent -2.825e-01 1.609e-01 -1.756 0.07909 .
OrigBromley 1.172e+00 1.421e-01 8.247 < 2e-16 ***
OrigCamden -1.418e-01 2.396e-01 -0.592 0.55403
OrigCity of London -2.081e+01 1.398e+03 -0.015 0.98812
DestBarnet 1.584e+00 3.350e-01 4.728 2.27e-06 ***
DestBexley 7.608e-01 2.842e-01 2.677 0.00744 **
DestBrent 1.236e+00 3.274e-01 3.775 0.00016 ***
DestBromley 1.267e+00 3.243e-01 3.907 9.34e-05 ***
DestCamden 3.231e+00 2.664e-01 12.126 < 2e-16 ***
DestCity of London 3.435e+00 2.385e-01 14.404 < 2e-16 ***
dist -1.129e-04 1.457e-05 -7.748 9.31e-15 ***
CarDrive 2.305e-04 7.135e-05 3.231 0.00123 **
Underground -9.759e-05 3.982e-05 -2.451 0.01426 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 2890.39 on 41 degrees of freedom
Residual deviance: 104.87 on 26 degrees of freedom
AIC: 273.27
Number of Fisher Scoring iterations: 16
So now it gets quite interesting. Some of the dummy / constraint origins become statistically insignificant when car and tube commuters are added into the mix.
How can we interpret this?
Well, the kind of things that might influence commuting by motorbike patterns (lack of access to public transport, distance from workplace etc.) that also influence travel to work by car, aren’t applicable to centrally-located Camden with lots of public transport links. Camden is just a proxy for these factors (being centrally-located Camden with lots of public transport links), but of-course doesn’t capture the subtle variation in access to public transport and distance that car travel does. Camden’s influence is confounded by these better explanatory variables and becomes insignificant.
The parameter values give an indication of exactly how much of a change in commuting flows by motorcycle you might expect either for beginning or ending in a borough or for a one person change in people travelling by Car or Tube.
If you would like some more useful information on how to interpret the parameters (logged or otherwise) the emerge from a Poisson Regression model, again, Taylor Oshan’s primer is an excellent place to turn.
4.2 From Poisson Regression back to Entropy
As with the earlier models, I have shown you how you can plug the parameter estimates back into Wilson’s entropy maximising multiplicative models in order to generate estimates and tweak things still further.
If you remember from Equations 11 and 12 above, the key to the doubly constrained models is the \(A_i\) and \(B_j\) balancing factors and as they rely on each other, they need to be calculated iteratively. We can do this using Senior’s algorthm also mentioned earlier.
Here is the algorithm in R… (I wrote this code way back in 2012, so it’s a bit ropey, but it works…) To run it, hightlight the whole thing and run it all at once.
After it works, you can spend some time working out what is going on. I find the best solution is to try and run elements of the code individually to see what is happening…
##########################################################################
#This block of code will calculate balancing factors for an entropy
#maximising model (doubly constrained)
#set beta to the appropriate value according to whether exponential or power
if(tail(names(coef(doubSim)),1)=="dist"){
cdatasub$beta <- coef(doubSim)["dist"]
disdecay = 0
} else {
cdatasub$beta <- coef(doubSim)["log(dist)"]
disdecay = 1
}
#Create some new Ai and Bj columns and fill them with starting values
cdatasub$Ai <- 1
cdatasub$Bj <- 1
cdatasub$OldAi <- 10
cdatasub$OldBj <- 10
cdatasub$diff <- abs((cdatasub$OldAi-cdatasub$Ai)/cdatasub$OldAi)
#create convergence and iteration variables and give them initial values
cnvg = 1
its = 0
#This is a while-loop which will calculate Orig and Dest balancing
#factors until the specified convergence criteria is met
while(cnvg > 0.001){
print(paste0("iteration ", its))
its = its + 1 #increment the iteration counter by 1
#First some initial calculations for Ai...
if(disdecay==0){
cdatasub$Ai <- (cdatasub$Bj*cdatasub$D_j*exp(cdatasub$dist*cdatasub$beta))
} else {
cdatasub$Ai <- (cdatasub$Bj*cdatasub$D_j*exp(log(cdatasub$dist)*cdatasub$beta))
}
#aggregate the results by your Origs and store in a new dataframe
AiBF <- aggregate(Ai ~ Orig, data = cdatasub, sum)
#now divide by 1
AiBF$Ai <- 1/AiBF$Ai
#and replace the initial values with the new balancing factors
updates = AiBF[match(cdatasub$Orig,AiBF$Orig),"Ai"]
cdatasub$Ai = ifelse(!is.na(updates), updates, cdatasub$Ai)
#now, if not the first iteration, calculate the difference between the new Ai values and the old Ai values and once done, overwrite the old Ai values with the new ones.
if(its==1){
cdatasub$OldAi <- cdatasub$Ai
} else {
cdatasub$diff <- abs((cdatasub$OldAi-cdatasub$Ai)/cdatasub$OldAi)
cdatasub$OldAi <- cdatasub$Ai
}
#Now some similar calculations for Bj...
if(disdecay==0){
cdatasub$Bj <- (cdatasub$Ai*cdatasub$O_i*exp(cdatasub$dist*cdatasub$beta))
} else {
cdatasub$Bj <- (cdatasub$Ai*cdatasub$O_i*exp(log(cdatasub$dist)*cdatasub$beta))
}
#aggregate the results by your Dests and store in a new dataframe
BjBF <- aggregate(Bj ~ Dest, data = cdatasub, sum)
#now divide by 1
BjBF$Bj <- 1/BjBF$Bj
#and replace the initial values by the balancing factor
updates = BjBF[match(cdatasub$Dest,BjBF$Dest),"Bj"]
cdatasub$Bj = ifelse(!is.na(updates), updates, cdatasub$Bj)
#now, if not the first iteration, calculate the difference between the new Bj values and the old Bj values and once done, overwrite the old Bj values with the new ones.
if(its==1){
cdatasub$OldBj <- cdatasub$Bj
} else {
cdatasub$diff <- abs((cdatasub$OldBj-cdatasub$Bj)/cdatasub$OldBj)
cdatasub$OldBj <- cdatasub$Bj
}
#overwrite the convergence variable with
cnvg = sum(cdatasub$diff)
}
[1] "iteration 0"
[1] "iteration 1"
[1] "iteration 2"
[1] "iteration 3"
[1] "iteration 4"
[1] "iteration 5"
[1] "iteration 6"
[1] "iteration 7"
[1] "iteration 8"
So, we’ve calculated our \(A_i\) and \(B_j\) values using out iterative routine - now we can plug them back into our model, to prove, once again, that the Poisson Model is exactly the same as the multiplicative Entropy Maximising Model…
########################################################################
#Now create some SIM estimates
if(disdecay==0){
cdatasub$SIM_Estimates <- (cdatasub$O_i*cdatasub$Ai*cdatasub$D_j*cdatasub$Bj*exp(cdatasub$dist*cdatasub$beta))
} else{
cdatasub$SIM_Estimates_pow <- (cdatasub$O_i*cdatasub$Ai*cdatasub$D_j*cdatasub$Bj*exp(log(cdatasub$dist)*cdatasub$beta))
}
########################################################################
cdatasub$SIM_Estimates <- round(cdatasub$SIM_Estimates,0)
cdatasubmat8 <- dcast(cdatasub, Orig ~ Dest, sum, value.var = "SIM_Estimates", margins=c("Orig", "Dest"))
cdatasubmat8
Are the the same? Yes!! Woo hoo.
5. Conclusions, further notes and ideas for additional activities
Hopefully you have now seen how it is extremely straight-forward to run and calibrate Wilson’s full family of Spatial Interaction Models in R using GLM and Poisson Regression.
5.1 Some Further Notes
Now might be the time to mention that despite everything I’ve shown you, there has been some discussion in the literature as to whether the Poisson Model is actually a misspecification, especially for modelling migration flows. If you have the stomach for it, this paper by Congdon goes into a lot of detail.
The issue is a thing called ‘overdispersion’ which, translated, essentially relates to the model not being able to capture all of the things that could be explaining the flows in the independent variables that are supplied to the model. The details are tedious and only really intelligible to those with a statistics background. If you want a starter, try here, but in practical terms, we can get around this problem by fitting a very similar sort of regression model called the negative binomial regression model.
If you wish, you can read up and experiment with this model - you can fit it in exactly the same way as the glm
model but using a function called glm.nb
which is part of the mass
package. The negative binomial model has an extra parameter in the model for overdispersion. You you do try this, you will almost certainly discover that your results barely change - but hell, you might keep a pedantic reviewer at bay if you submit this to a journal (not that I’m speaking from experience or anything).
5.2 Further Activities
- Testing these models out on the whole of London and for different years
- You’ve been playing around with just a small 7 borough sample, why not try the full London system. +You can also try and download some similar data from the 2011 Census from Wicid - see if using \(Oi\) and \(Dj\) totals and the parameters you calibrated on the 2001 data, whether you can get reasonable estimates of the 2011 flows. +How have the model parameters changed between 2001 and 2011 - what does this mean
- Visualising your flow estimates
- try using the methods in
stplanr
and leaflet
from the last practical to visualise some of your flow estimates or flow residuals…
---
title: 'Dr Ds Idiots Guide to Spatial Interaction Modelling for Dummies - Part 2:
  Constrained Models'
output:
  html_notebook: default
  html_document: default
---

<div style="width:100px; height=100px">
![](C:\\Users\\Adam\\Dropbox\\Lectures\\SpatialInteractionModelling\\casa_480.png)
</div>

###Recap

So, [last time](https://rpubs.com/adam_dennett/257231) we learned all about the unconstrained spatial interaction model; how we can use it to estimate flows in a system using values for origin emissiveness or destination attractiveness; how we can tweak the estimates the model produces through adjusting either the parameters associated with the predictor variables, or though using different predictor variables or updating their values; and how we can improve the fits of the model further by calibrating the parameters through using a Poisson regression model. 

We saw that even after calibration, our model still only explained around 60% of the variation in the flows that we observed in our system, so can we do any better? Well yes, yes we can!

##Constrained Models

If we return to [Alan Wilson's 1971 paper](http://journals.sagepub.com/doi/abs/10.1068/a030001), he introduces a full *family* of spatial interaction models of which the unconstrained model is just the start. And indeed since then, there have been all number of incremental advances and alternatives (such as [Stewart Fotheringham's Competing Destinations models](https://www.researchgate.net/publication/23537117_A_New_Set_of_Spatial-Interaction_Models_The_Theory_of_Competing_Destinations), [Pooler's production/attraction/cost relaxed models](http://journals.sagepub.com/doi/abs/10.1177/030913259401800102), [Stillwell's origin/destination parameter specific models](http://journals.sagepub.com/doi/pdf/10.1068/a101187) and [mine and Alan's own multi-level model](http://journals.sagepub.com/doi/pdf/10.1068/a45398) (to name just a few).

In this session we will explore the rest of Wilson's family - the Production (origin) Constrained Model; the Attraction (destination) constrained model; and the Doubly Constrained Model. 

We will see how we can, again, use a Poisson regression model in R to calibrate these models and how, once calibrated, we can use the models in different contexts, such as Land Use Transportation Interaction (LUTI) modelling, retail modelling and migration modelling. 

##1. Production and Attraction Constrained Models

Right, let's get ourselves set up again...

Open up R Studio and set up a new project working directory. Open a new R script and save it into your new working directory. 

Check the file is in your working directory and R can see it...

```{r, echo=TRUE, message=FALSE, warning=FALSE, eval=FALSE}
getwd()
list.files(getwd())
```

Library some packages (the four at the bottom you probably won't need):

```{r, echo=TRUE, message=FALSE, warning=FALSE}
library(sp)
library(reshape2)
library(geojsonio)
library(rgdal)
library(downloader)
library(maptools)
library(dplyr)
library(broom) 
library(stplanr)
library(ggplot2)
library(MASS)
library(sf)
library(tmap)
library(tmaptools)
library(stringr)
```

And check that you have our little test commuting data matrix from last time - `cdatasub` - sitting in your global environment. If you don't, then [go back to the last practical here](https://rpubs.com/adam_dennett/255314) and follow the steps from the top until you have cdatasub in your global environment. It should look something like this:

```{r, include=FALSE}
#Fetch a GeoJson of some district-level boundaries from the ONS Geoportal. First add the URL to an object
ug <- "http://geoportal.statistics.gov.uk/datasets/8edafbe3276d4b56aec60991cbddda50_2.geojson"

#now fetch the data using the downloader package - create a temporary directory to put the data into
dir.create("./tmp")
downloader::download(url = ug, destfile = "./tmp/EW.GeoJSON")
#unpack the file into a new spatialPolygons dataframe
EW <- readOGR(dsn = "./tmp/EW.GeoJSON", layer = "OGRGeoJSON")

London <- EW[grep("^E09",EW@data$lad15cd),]

BNG = "+init=epsg:27700"
LondonBNG <- spTransform(London, BNG)
#now, order by borough code - *this step will be imporant later on*
LondonBNG <- LondonBNG[order(LondonBNG$lad15cd),]
#now use spDists to generate a big distance matrix of all distances between boroughs in London
dist <- spDists(LondonBNG)
#melt this matrix into a list of origin/destination pairs using melt. Melt in in the reshape2 package. Reshape2, dplyr and ggplot, together, are some of the best packages in R, so if you are not familiar with them, get googling and your life will be much better!
distPair <- melt(dist)

#read in your London Commuting Data
cdata <- read.csv("https://www.dropbox.com/s/7c1fi1txbvhdqby/LondonCommuting2001.csv?raw=1")
#read in a lookup table for translating between old borough codes and new borough codes
CodeLookup <- read.csv("https://www.dropbox.com/s/h8mpvnepdkwa1ac/CodeLookup.csv?raw=1")
#read in some population and income data
popincome <- read.csv("https://www.dropbox.com/s/84z22a4wo3x2p86/popincome.csv?raw=1")

#now merge these supplimentary data into your flow data dataframe
cdata$OrigCodeNew <- CodeLookup$NewCode[match(cdata$OrigCode, CodeLookup$OldCode)]
cdata$DestCodeNew <- CodeLookup$NewCode[match(cdata$DestCode, CodeLookup$OldCode)]
cdata$vi1_origpop <- popincome$pop[match(cdata$OrigCodeNew, popincome$code)]
cdata$vi2_origsal <- popincome$med_income[match(cdata$OrigCodeNew, popincome$code)]
cdata$wj1_destpop <- popincome$pop[match(cdata$DestCodeNew, popincome$code)]
cdata$wj2_destsal <- popincome$med_income[match(cdata$DestCodeNew, popincome$code)]

#Data needs to be ordered by borough code, if it's not, we will run into problems when we try to merge our distance data back in later, so to make sure, we can arrange by orign and then destination using dplyr's 'arrange' function

cdata <- arrange(cdata, OrigCodeNew, DestCodeNew)
cdata$TotalNoIntra <- ifelse(cdata$OrigCode == cdata$DestCode,0,cdata$Total)
cdata$offset <- ifelse(cdata$OrigCode == cdata$DestCode,0.0000000001,1)
# now add the distance column into the dataframe
cdata$dist <- distPair$value
#We'll just use the first 7 boroughs by code, so first, create a vector of these 7 to match with our data
toMatch<-c("00AA", "00AB", "00AC", "00AD", "00AE", "00AF", "00AG")
#subset the data by the 7 sample boroughs
#first the origins
cdatasub <- cdata[grep(paste(toMatch,collapse = "|"), cdata$OrigCode),]
#then the destinations
cdatasub <- cdatasub[grep(paste(toMatch,collapse = "|"), cdata$DestCode),]
#now chop out the intra-borough flows
cdatasub <- cdatasub[cdatasub$OrigCode!=cdatasub$DestCode,]
#now unfortunately if you look at the file, for some reason the grep process has left a lot of empty data cells in the dataframe, so let's just chop out everything after the 7*7 - 7 (42) pairs we are interested in...
cdatasub <- cdatasub[1:42,]
#now re-order so that OrigCodeNew, DestCodeNew and TotalNoIntra are the first three columns *note that you have to be explicit about the select function in the dplyr package as MASS also has a 'select' function and R will try and use this by default. We can be explict by using the syntax package::function
cdatasub <- dplyr::select(cdatasub, OrigCodeNew, DestCodeNew, Total, everything())

uncosim <- glm(Total ~ log(vi1_origpop)+log(wj2_destsal)+log(dist), na.action = na.exclude, family = poisson(link = "log"), data = cdatasub)

cdatasub$fitted <- fitted(uncosim)

cdatasub$fitted <- round(cdatasub$fitted,0)
sum(cdatasub$fitted)
#turn it into a little matrix and have a look at your handy work
cdatasubmat2 <- dcast(cdatasub, Orig ~ Dest, sum, value.var = "fitted", margins=c("Orig", "Dest"))
cdatasubmat2

CalcRSquared <- function(observed,estimated){
  r <- cor(observed,estimated)
  R2 <- r^2
  R2
}

CalcRMSE <- function(observed,estimated){
  res <- (observed - estimated)^2
  RMSE <- round(sqrt(mean(res)),3)
  RMSE
}
```


```{r, echo=TRUE, message=FALSE, warning=FALSE}
cdatasubmat <- dcast(cdatasub, Orig ~ Dest, sum, value.var = "Total", margins=c("Orig", "Dest"))
cdatasubmat
```


Wilson's real contribution to the field was in noticing that the unconstrained gravity model was sub-optimal as it did not make use of all of the available data in the system we are studying. 

If we recall the estimates from our unconstrained model, none of the estimates summed to the observed in and out-flow totals:

```{r}
cdatasubmat2
```



Our estimates did sum to the grand total of flows, but this is because we were really fitting a 'total constrained' model which used $k$ - our constant of proportionality - to ensure everything sort of added up (to within 1 commuter). 

Where we have a full flow matrix to calibrate parameters, we can incorporate the row (origin) totals, column (destination) totals or both origina and destination totals to *constrain* our flow estimates to these known values. 

There are various reasons for wanting to do this, for example:

a) If We are interested in flows of money into businesses or customers into shops, might have information on the amount of disposable income and shopping habits of the people living in different areas from loyalty card data. This is known information about our origins and so we could *constrain*  our spatial interaction model to this known information - we can make the assumption that this level of disposable income remains the same. We can then use other information about the attractiveness of places these people might like to shop in (store size, variety / specialism of goods etc.), to estimate how much money a new store opening in the area might make, or if a new out-of-town shopping centre opens, how much it might affect the business of shops in the town centre. This is what is known in the literature as the 'retail model' and is perhaps the most common example of a **Production (orign) Constrained Spatial Interaction Model**

b) We might be interested in understanding the impact of a large new employer in an area on the flows of traffic in the vicinity or on the demand for new worker accommodation nearby. A good example of where this might be the case is with large new infrastructure developments like new airports. For example, before the go-ahead for the new third runway at Heathrow was given, one option being considered was a new runway in the Thames Estuary. If a new airport was built here, what would be the potential impact on transport flows in the area and where might workers commute from? This sort of scenario could be tested with an **Attraction (destination) Constrained Spatial Interaction Model** where the number of new jobs in a destination is known (as well as jobs in the surrounding area) and the model could be used to estimate where the workers will be drawn from (and their likely travel-to-work patterns). This model is exactly the sort of model Land Use Transport Interaction (LUTI) model that was constructed by the Mechanicity Team in CASA - details [here](http://www.mechanicity.info/research/land-use-transport-interaction-modelling/#transport) if you are interested...

c) We might be interested in understanding the changing patterns of commuting or migration over time. Data from the Census allows us to know an accurate snap-shot of migrating and commuting patterns every 10 years. In these full data matrices, we know both the numbers of commuters/migrants leaving origins and arriving at destinations as well as the interactions between them. If we constrain our model estimates to this known information at origin and destination, we can examine various things, including:

     i) the ways that the patterns of commuting/migration differ from the model predictions - where we might get more migrant/commuter flows than we would expect
     ii) how the model parameters vary over time - for example how does distance / cost of travel affect flows over time? Are people prepared to travel further or less far than before?


##2. Production-constrained Model

(1) $$T_{ij} = A_i O_i W_j^\alpha d_{ij}^-\beta $$

where

(2) $$O_{i} = \sum_j T_{ij}$$

and 

(3) $$ A_i = \frac{1}{\sum_j W_j^\alpha d_{ij}^-\beta}$$

In the production-constrained model, $O_i$ does not have a parameter as it is a known constraint. $A_i$ is known as a *balancing factor* and is a vector of values which relate to each origin, $i$, which do the equivalent job to $k$ in the unconstrained/total constrained model but ensure that flow estimates from each origin sum to the know totals, $O_i$ rather than just the overall total.

Now at this point, we could calculate all of the $O_i$s and the $A_i$s by hand for our sample system and then set about guessing/estimating the parameter values for the rest of the model, but as you might have already suspected from last time, we can use R and `glm` to make it really easy and do all of that for us - woo hoo!

We set about re-specifying the Production-Constrained model as a Poisson regression model in exactly the same way as we did before. We need to take logs of the right-hand side of equation and assume that these are logarithmially linked to the Poisson distributed mean ($\lambda_{ij}$) of the $T_{ij}$ variable. As such, Equation (1) becomes:

(4) $$\lambda_{ij} = exp(k + \mu_{i} + \alpha \ln W_j - \beta \ln d_{ij})$$

In Equation (4) $\mu_{i}$ is the equivalent of the vector of balancing factors $A_i$, but in regression / log-linear modelling terminology can also be described as either **dummy variables** or **fixed effects**. In practical terms, what this means is that in our regression model, $\mu_i$ is modelled as a [categorical predictor](https://en.wikipedia.org/wiki/Categorical_variable) and therefore in the Poisson regression model, we don't use the numeric values of $O_i$, we use a categorical identifier for the origin. In terms of the example table above, for Barking and Dagenham we wouldn't use 5675 as we would if we were fitting Equation 1, we would just use 'Barking and Dagenham'. You might have noticed that $k$ makes an appearance again - this is because now we have a regression equation again, we need an intercept value - this is estimated by fitting the model. 

So, let's give this model a whirl...

```{r, echo=TRUE, message=FALSE, warning=FALSE}
#run a production constrained SIM
prodSim <- glm(Total ~ Orig+log(wj2_destsal)+log(dist), na.action = na.exclude, family = poisson(link = "log"), data = cdatasub)
#let's have a look at it's summary...
summary(prodSim)
```

So, what do we have?

Well, there are the elements of the model output that should be familiar from the unconstrained model:

the $k$ parameter, or the intercept for the model - which in this case is 7.99911

the $\alpha$ parameter related to the destination attractiveness: 2.04693

the $\beta$ distance decay parameter: -2.24218

We can see from the standard outputs from the model that all of the explanatory variables are statistically significant (***) and the z-scores indicate that the destination salary is having the most influence on the model, with distance following closely behind. 

And then we have a series of parameters which are the vector of $\mu_i$ values associated with our origin constraints. The eagle eyed among you will notice that we are missing a parameter for Barking and Dagenham. No, the model hasn't just taken a dislike to the home of Ford Motorcar production and right wing populism in the UK, it's because dummy parameter values are presented in relation to a reference variable and in this case the default variable is the first one in our list - Barking and Dagenham.

The way that we can interpret this is that the City of London is around 3.3 times less likely to be an origin for commuters than the reference category (Barking and Dagenham), which by default is set to 0.

Now of course, it might be that Barking and Dagenham is not the most intuitive contrasting variable to use. Fortunately we can change the The particular way in which the model choses this variable using the `contrasts` option. I won't dwell much on contrasts, but if you would like to learn more, there are lots of good webpages out there including [this one](http://statistics.ats.ucla.edu/stat/r/modules/dummy_vars.htm) and you should consult the R help system - search for `?contrasts`. For now, let's just swap the contrasting variable from the first to the last Origin and see what happens to the parameter values:


```{r, echo=TRUE, message=FALSE, warning=FALSE}
#set the contrasts
options(contrasts=c('contr.SAS','contr.SAS'))

prodSim2 <- glm(Total ~ Orig+log(wj2_destsal)+log(dist), na.action = na.exclude, family = poisson(link = "log"), data = cdatasub)
#let's have a look at it's summary...
summary(prodSim2)

```


Notice how the intercept value has changed and the dummy parameters for each origin now relate to how they compare to the City of London. As we could expect for this area (which is a commuting destination rather than an origin) all other origins in our system send between 2 and 4.25 times as many commuters. 

The default contrast option is R is to use 'Treatment' or dummy coding, so if you would like to switch the contrasts back to treatment, it's simple to do:

```{r, echo=TRUE, message=FALSE, warning=FALSE}
options(contrasts=c('contr.treatment','contr.treatment'))
```

###2.1 Model Estimates

Now at this point you will be wanting to know what affect the constraints have had on the estimates produced by the model, so let's plug the parameters back into Equation 4 and take a look...

First run the model again using OrigCodeNew instead of Orig... (note that because of the code order, the model runs like the )
```{r, echo=TRUE, message=FALSE,warning=FALSE}
prodSim <- glm(Total ~ OrigCodeNew+log(wj2_destsal)+log(dist), na.action = na.exclude, family = poisson(link = "log"), data = cdatasub)
summary(prodSim)
```

Now Create some $O_i$ and $D_j$ columns and store the total in and out flow matrix margins in them. *Note, in the syntax below, I use the forward-pipe or %>% operator. This is probably something a bit new and warrents a bit of explaining. If you'd like to know more about how pipes can be used in R, visit the [magrittr vignette page here](https://cran.r-project.org/web/packages/magrittr/vignettes/magrittr.html) - for now, just think of %>% as a bit like saying "then..." in your code*

```{r, echo=TRUE, message=FALSE, warning=FALSE}
#create some Oi and Dj columns in the dataframe and store row and column totals in them:
#to create O_i, take cdatasub ...then... group by origcodenew ...then... summarise by calculating the sum of Total
O_i <- cdatasub %>% group_by(OrigCodeNew) %>% summarise(O_i = sum(Total))
cdatasub$O_i <- O_i$O_i[match(cdatasub$OrigCodeNew,O_i$OrigCodeNew)]
D_j <- cdatasub %>% group_by(DestCodeNew) %>% summarise(D_j = sum(Total))
cdatasub$D_j <- D_j$D_j[match(cdatasub$DestCodeNew,D_j$DestCodeNew)]
```

Now fish the coefficients out of the prodsim glm object 

```{r, echo=TRUE, message=FALSE, warning=FALSE}
#You can do this in a number of ways, for example this prints everything:
prodSim_out <- tidy(prodSim)
prodSim_out
```

Or, more usefully, pull out the parameter values for $\mu_i$ and store them back in the dataframe along with $O_i$ and $D_j$

```{r, echo=TRUE, message=FALSE, warning=FALSE}
# or you can just pull out the coefficients and put them into an object
coefs <- as.data.frame(prodSim$coefficients)
#then once you have done this, you can join them back into the dataframe using a regular expression to match the bits of the identifier that you need - *note, this bit of code below took me about 2 hours to figure out!*
cdatasub$mu_i <- coefs$`prodSim$coefficients`[match(cdatasub$OrigCodeNew,sub(".*OrigCodeNew","", rownames(coefs)))]
#now, where we have missing values for our reference mu_i variable, fill those with 1s
cdatasub$mu_i <- ifelse(is.na(cdatasub$mu_i),0,cdatasub$mu_i)
head(cdatasub)
```

OK, now we can save our parameter values into some variables...

```{r, echo=TRUE, message=FALSE, warning=FALSE}
k <- prodSim$coefficients[1]
mu_i <- prodSim$coefficients[2:7]
alpha <- prodSim$coefficients[8]
beta <- prodSim$coefficients[9]
```

And we're ready to generate our estimates:

```{r, echo=TRUE, message=FALSE, warning=FALSE}
cdatasub$prodsimest1 <- exp(k+(cdatasub$mu_i)+(alpha*log(cdatasub$wj2_destsal))+(beta*log(cdatasub$dist)))
```

Now of course we could also have done this the easy way, but again, I think it helps doing it the hard way, especially if we want to play with the parameters by hand or adjust any of the input values when running some what-if scenarios. 

Here's the easy way again for those who can't remember:

```{r, echo=TRUE, message=FALSE, warning=FALSE}
cdatasub$prodsimFitted <- fitted(prodSim)

head(cdatasub)
```

###2.2 Assessing the model output

So what do the outputs from our Production Constrained Model look like? How has the goodness-of-fit improved and how can we start to use this a bit like a retail model and assess the likely impacts of changing destination attractiveness etc.?

####2.2.1 The flow matrix


```{r, echo=TRUE, message=FALSE, warning=FALSE}
#first round the estimates
cdatasub$prodsimFitted <- round(fitted(prodSim),0)
#now we can create pivot table to turn paired list into matrix (and compute the margins as well)
cdatasubmat3 <- dcast(cdatasub, Orig ~ Dest, sum, value.var = "prodsimFitted", margins=c("Orig", "Dest"))
cdatasubmat3
```

And compared with the original observed data?

```{r}
cdatasubmat
```

Here it is very easy to see the Origin Constraints working. The sum across all destinations for each origin in the estimated matrix is exactly the same as the same sum across the observed matrix - $\sum_j T_{ij} = \sum_j \lambda_{ij} = O_i$, but clearly, the same is not true when you sum across all origins for each destination - $\sum_i T_{ij} \neq \sum_i \lambda_{ij} \neq D_j$

####2.2.2 How do the fits compare with the unconstrained model from last time?

```{r, echo=TRUE, message=FALSE, warning=FALSE}
#use the functions from the last practical to calculate some goodness-of-fit statistics
CalcRSquared(cdatasub$Total,cdatasub$prodsimFitted)
```
```{r, echo=TRUE, message=FALSE, warning=FALSE}
CalcRMSE(cdatasub$Total,cdatasub$prodsimFitted)
```

Clearly by constraining our model estimates to known origin totals, the fit of the model has improved quite considerably - from around 0.67 in the unconstrained model to around 0.82 in this model. The RMSE has also dropped quite noticably. 

####2.2.3 A 'what if...' scenario

Now that we have calibrated our parameters and produced some estimates, we can start to play around with some what-if scenarios.

For example - What if the government invested loads of money into a new Car Plant in Barking and Dagenham and as a result, average wages increased from a mere £16,200 to £25,000. A far fetched scenario, but one that could make a good experiment. 

First create create a new variable with these altered salaries:

```{r, echo=TRUE, message=FALSE, warning=FALSE}
cdatasub$wj3_destsalScenario <- cdatasub$wj2_destsal
cdatasub$wj3_destsalScenario <- ifelse(cdatasub$wj3_destsalScenario == 16200,25000,cdatasub$wj3_destsalScenario)
```

Now let's plug these new values into the model and see how this changes the flows in the system...


```{r, echo=TRUE, message=FALSE, warning=FALSE}
cdatasub$prodsimest2 <- exp(k+(cdatasub$mu_i)+(alpha*log(cdatasub$wj3_destsalScenario))+(beta*log(cdatasub$dist)))

cdatasub$prodsimest2 <- round(cdatasub$prodsimest2,0)
#now we can create pivot table to turn paired list into matrix (and compute the margins as well)
cdatasubmat4 <- dcast(cdatasub, Orig ~ Dest, sum, value.var = "prodsimest2", margins=c("Orig", "Dest"))
cdatasubmat4
```

You will notice that by increasing the average salary in Barking and Dagenham, we've increased flows into Barking and Dagenham, but have not reduced the flows into other zones - the original constraints are still working on the other zones. One way to get around this, now that we have calibrated our parameters, is to return to the multiplicative model in Equation 1 and run this model after calculating our own $A_i$ balancing factors. 

```{r, echo=TRUE, message=FALSE, warning=FALSE}
#calculate some new wj^alpha and dij^beta values
wj2_alpha <- cdatasub$wj2_destsal^alpha
dist_beta <- cdatasub$dist^beta
#calculate the first stage of the Ai values
cdatasub$Ai1 <- wj2_alpha*dist_beta
#now do the sum over all js bit
A_i <- cdatasub %>% group_by(OrigCodeNew) %>% summarise(A_i = sum(Ai1))
#now divide in to 1
A_i[,2] <- 1/A_i[,2]
#and write the A_i values back into the data frame
cdatasub$A_i <- A_i$A_i[match(cdatasub$OrigCodeNew,A_i$OrigCodeNew)]
```

So that is it for calculating your $A_i$ values. Now you have these, it's very simple to plug everything back into Equation 1 and generate some estimates...

```{r, echo=TRUE, message=FALSE, warning=FALSE}
#To check everything works, recreate the original estimates
cdatasub$prodsimest3 <- cdatasub$A_i*cdatasub$O_i*wj2_alpha*dist_beta
```

You should see that your new estimates are exactly the same as your first estimates. If they're not, then something has gone wrong. Now we have this though, we can keep messing around with some new estimates and keep the constraints. 
Remember, though, that you will need to recalculate $A_i$ each time you want to create a new set of estimates. Let's try with our new values for the destination salary in Barking and Dagenham

```{r}
#calculate some new wj^alpha and dij^beta values
wj3_alpha <- cdatasub$wj3_destsalScenario^alpha
#calculate the first stage of the Ai values
cdatasub$Ai1 <- wj3_alpha*dist_beta
#now do the sum over all js bit
A_i <- cdatasub %>% group_by(OrigCodeNew) %>% summarise(A_i = sum(Ai1))
#now divide in to 1
A_i[,2] <- 1/A_i[,2]
#and write the A_i values back into the data frame
cdatasub$A_i <- A_i$A_i[match(cdatasub$OrigCodeNew,A_i$OrigCodeNew)]
```

Now we have some new $A_i$s, let's generate some new scenario flow estimates...

```{r, echo=TRUE, message=FALSE, warning=FALSE}
#To check everything works, recreate the original estimates
cdatasub$prodsimest4_scenario <- cdatasub$A_i*cdatasub$O_i*wj3_alpha*dist_beta
```

```{r}
cdatasub$prodsimest4_scenario <- round(cdatasub$prodsimest4_scenario,0)
#now we can create pivot table to turn paired list into matrix (and compute the margins as well)
cdatasubmat5 <- dcast(cdatasub, Orig ~ Dest, sum, value.var = "prodsimest4_scenario", margins=c("Orig", "Dest"))
cdatasubmat5
```


There are a number of things to note here. Firstly, we flows into Barking and Dagenham have virtually doubled, while flows into other Boroughs have reduced. 

Secondly, Barking and Dagenham was a poor estimate anyway - it model was very much over estimating flows into this Borough. Increasing the salary into this borough has significantly increased flows, so this indicates that there are probably lots of other things that might be discouraging people from working in this borough. 

Thirdly, Our origin constraints are now holding again. 

##3. Attraction-Constrained Model

The attraction constrained Model is virtually the same as the Production constrained model:

(5) $$T_{ij} = D_j B_j V_i^\mu d_{ij}^-\beta $$

where

(6) $$D_{j} = \sum_i T_{ij}$$

and 

(7) $$ B_j = \frac{1}{\sum_i V_i^\mu d_{ij}^-\beta}$$


I won't dwell on the attraction constrained model, except to say that it can be run in R as you would expect:

(8) $$\lambda_{ij} = exp(k + \mu \ln V_i + \alpha_{i} - \beta \ln d_{ij})$$

or in R:

```{r, echo=TRUE, message=FALSE, warning=FALSE}
attrSim <- glm(Total ~ DestCodeNew+log(vi1_origpop)+log(dist), na.action = na.exclude, family = poisson(link = "log"), data = cdatasub)
summary(attrSim)
```

we can examine how the constraints hold for desintations this time:

```{r, echo=TRUE, message=FALSE, warning=FALSE}
#first round the estimates
cdatasub$attrsimFitted <- round(fitted(attrSim),0)
#now we can create pivot table to turn paired list into matrix (and compute the margins as well)
cdatasubmat6 <- dcast(cdatasub, Orig ~ Dest, sum, value.var = "attrsimFitted", margins=c("Orig", "Dest"))
cdatasubmat6
```

compared to...

```{r}
cdatasubmat
```


and we can test the goodness-of-fit in exactly the same way as before:

```{r, echo=TRUE, message=FALSE, warning=FALSE}
#use the functions from the last practical to calculate some goodness-of-fit statistics
CalcRSquared(cdatasub$Total,cdatasub$attrsimFitted)
```

```{r, echo=TRUE, message=FALSE, warning=FALSE}
CalcRMSE(cdatasub$Total,cdatasub$attrsimFitted)

```

OK, that's where I'll leave singly constrained models for now. There are, of course, plenty of things you could try out. For example:

1. You could try mapping the coefficients or the residual values from the model to see if there is any patterning in either the over or under prediction of flows.

2. You could try running your own version of a LUTI model by first calibrating the model parameters and plugging these into a multiplicative version of the model, adjusting the destination constraints to see which origins are likely to generate more trips.

##4. Doubly Constrained Model

Now, the model in the family you have all been waiting for - the big boss, the daddy, the **doubly constrained model!**

Let's begin with the formula:

(9) $$T_{ij} = A_i O_i B_j D_j d_{ij}^-\beta $$

where

(10) $$O_{i} = \sum_j T_{ij}$$

(11) $$D_{j} = \sum_i T_{ij}$$

and 

(12) $$A_i = \frac{1}{\sum_j B_j D_j d_{ij}^-\beta}$$

(13) $$B_j = \frac{1}{\sum_i A_i O_i d_{ij}^-\beta}$$


Now, the astute will have noticed that the calculation of $A_i$ relies on knowing $B_j$ and the calculation of $B_j$ relies on knowing $A_i$. A conundrum!! If I don't know $A_i$ how can I calcuate $B_j$ and then in turn $A_i$ and then $B_j$ *ad infinitum*???!!

Well, I wrestled with that for a while until I came across [this paper by Martyn Senior](http://journals.sagepub.com/doi/abs/10.1177/030913257900300218) where he sketches out a very useful algorithm for iteratively arriving at values for $A_i$ and $B_j$ by setting each to equal 1 initially and then continuing to calculate each in turn until the difference between each value is small enough not to matter. 

We will return to this later, but for now, we will once again use the awesome power of R to deal with all of this difficulty for us!

We can run the doubly constrained model in exactly the same way as we ran the singly constrained models:

(14) $$\lambda_{ij} = exp(k + \mu_i + \alpha_{i} - \beta \ln d_{ij})$$

or in R...

```{r, echo=TRUE, message=FALSE, warning=FALSE}
#run a production constrained SIM
doubSim <- glm(Total ~ Orig+Dest+log(dist), na.action = na.exclude, family = poisson(link = "log"), data = cdatasub)
#let's have a look at it's summary...
summary(doubSim)
```

And the various flows and goodness-of-fit statistics?

```{r, echo=TRUE, message=FALSE, warning=FALSE}
#then round the estimates
cdatasub$doubsimFitted <- round(fitted(doubSim),0)
#now we can create pivot table to turn paired list into matrix (and compute the margins as well)
cdatasubmat7 <- dcast(cdatasub, Orig ~ Dest, sum, value.var = "doubsimFitted", margins=c("Orig", "Dest"))
cdatasubmat7
```

compared to...

```{r}
cdatasubmat
```


and we can test the goodness-of-fit in exactly the same way as before:

```{r, echo=TRUE, message=FALSE, warning=FALSE}
#use the functions from the last practical to calculate some goodness-of-fit statistics
CalcRSquared(cdatasub$Total,cdatasub$doubsimFitted)
```

```{r, echo=TRUE, message=FALSE, warning=FALSE}
CalcRMSE(cdatasub$Total,cdatasub$doubsimFitted)
```

So the goodness of fit has shot up and we can clearly see the origin and destination constraints working, and for most sets of flows, the model is now producing some good estimates. However, there are still some errors in the flows, particularly for estimates between Barking and Dagenham and Bexley or Barnet and Camden. 

Is there anything more we can do? Yes, of course there is. 

###4.1 Tweaking our Model

####4.1.1 Distance Decay

Now, all of the way through these practicals, we have assumed that the distance decay parameter follows a negative power law. Well, it doesn't need to. 

In [Wilson's original paper](http://journals.sagepub.com/doi/abs/10.1068/a030001), he generalised the distance decay parameter to: 

$$f(d_{ij})$$ 

Where $f$ represents some function of distance describing the rate at which the flow interactions change as distance increases. Lots of people have written about this, including [Tayor (1971)](http://onlinelibrary.wiley.com/doi/10.1111/j.1538-4632.1971.tb00364.x/full) and more recently Robin Lovelace in a transport context, [here](https://www.slideshare.net/ITSLeeds/estimating-distance-decay-for-the-national-propensity-to-cycle-tool).

For the inverse power law that we have been using is one possible function of distance, the other common one that is used is the negative exponential function:

$$exp(-\beta d_{ij})$$

We can get a feel for how different distance decay parameters work by plotting some sample data:

```{r}
xdistance <- seq(1,20,by=1)
InvPower2 <- xdistance^-2
NegExp0.3 <- exp(-0.3*xdistance)

df <- cbind(InvPower2,NegExp0.3)
meltdf <- melt(df)
ggplot(meltdf,aes(Var1,value, colour = Var2)) + geom_line()

```

Here you can see quite clearly that the inverse power function has a far more rapid distance decay effect than the negative exponential function. In real life, what this means is that if the observed interactions drop off very rapidly with distance, then they might be more likely to follow an inverse power law. This might be the case when looking at trips to the local convenience store by walking, for example. On the other hand, if the effect of distance is less severe (for example migration across the country for a new job) then the negative exponential funtion might be more appropriate. 

There is no hard and fast rule as to which function to pick, it will just come down to which fits the data better...

As [Tayor Oshan points out in his excellent Primer](http://openjournals.wu.ac.at/region/paper_175/175.html) what this means in our Poisson regression model is that we simply substitute $-\beta \ln d_{ij}$ for $-\beta d_{ij}$ in our model:

```{r, echo=TRUE, message=FALSE, warning=FALSE}
#run a production constrained SIM
doubSim1 <- glm(Total ~ Orig+Dest+dist, na.action = na.exclude, family = poisson(link = "log"), data = cdatasub)
#let's have a look at it's summary...
summary(doubSim1)
```

```{r}
cdatasub$doubsimFitted1 <- round(fitted(doubSim1),0)
```


```{r, echo=TRUE, message=FALSE, warning=FALSE}
CalcRSquared(cdatasub$Total,cdatasub$doubsimFitted1)
CalcRMSE(cdatasub$Total,cdatasub$doubsimFitted1)
```

So, we can see that using a negative exponential function in our model actually improves the fit and 

####4.1.2 Bunging some more variables in

Yes, the nice thing about doing all of this in a regression modelling framework is we can just keep adding predictor variables into the mix and seeing whether they have an effect. 

You can't add origin or destination specific predictors into a doubly constrained model like this, however, you could add some interaction predictors. For example, instead of modelling total flows, we could try and model motorbike commuters using information on car and underground commuters:

```{r, echo=TRUE, message=FALSE, warning=FALSE}
kitchensinkSIM <- glm(Motobike ~ Orig+Dest+dist+CarDrive+Underground, na.action = na.exclude, family = poisson(link = "log"), data = cdatasub)
#let's have a look at it's summary...
summary(kitchensinkSIM)
```

So now it gets quite interesting. Some of the dummy / constraint origins become statistically insignificant when car and tube commuters are added into the mix. 

How can we interpret this?

Well, the kind of things that might influence commuting by motorbike patterns (lack of access to public transport, distance from workplace etc.) that also influence travel to work by car, aren't applicable to centrally-located Camden with lots of public transport links. Camden is just a proxy for these factors (being centrally-located Camden with lots of public transport links), but of-course doesn't capture the subtle variation in access to public transport and distance that car travel does. Camden's influence is [confounded](https://en.wikipedia.org/wiki/Confounding) by these better explanatory variables and becomes insignificant. 

The parameter values give an indication of exactly how much of a change in commuting flows by motorcycle you might expect either for beginning or ending in a borough or for a one person change in people travelling by Car or Tube. 

If you would like some more useful information on how to interpret the parameters (logged or otherwise) the emerge from a Poisson Regression model, again, [Taylor Oshan's primer](http://openjournals.wu.ac.at/region/paper_175/175.html) is an excellent place to turn. 

###4.2 From Poisson Regression back to Entropy

As with the earlier models, I have shown you how you can plug the parameter estimates back into Wilson's entropy maximising multiplicative models in order to generate estimates and tweak things still further. 

If you remember from Equations 11 and 12 above, the key to the doubly constrained models is the $A_i$ and $B_j$ balancing factors and as they rely on each other, they need to be calculated iteratively. We can do this using [Senior's algorthm ](http://journals.sagepub.com/doi/abs/10.1177/030913257900300218) also mentioned earlier. 

Here is the algorithm in R... (I wrote this code way back in 2012, so it's a bit ropey, but it works...) To run it, hightlight the whole thing and run it all at once. 

After it works, you can spend some time working out what is going on. I find the best solution is to try and run elements of the code individually to see what is happening...


```{r, echo=TRUE, message=FALSE, warning=FALSE}
##########################################################################
#This block of code will calculate balancing factors for an entropy
#maximising model (doubly constrained)

#set beta to the appropriate value according to whether exponential or power
if(tail(names(coef(doubSim)),1)=="dist"){
  cdatasub$beta <- coef(doubSim)["dist"]
  disdecay = 0
} else {
  cdatasub$beta <- coef(doubSim)["log(dist)"]
  disdecay = 1
}

#Create some new Ai and Bj columns and fill them with starting values
cdatasub$Ai <- 1
cdatasub$Bj <- 1
cdatasub$OldAi <- 10
cdatasub$OldBj <- 10
cdatasub$diff <- abs((cdatasub$OldAi-cdatasub$Ai)/cdatasub$OldAi)

#create convergence and iteration variables and give them initial values
cnvg = 1
its = 0

#This is a while-loop which will calculate Orig and Dest balancing
#factors until the specified convergence criteria is met
while(cnvg > 0.001){
  print(paste0("iteration ", its))
  its = its + 1 #increment the iteration counter by 1
  #First some initial calculations for Ai...
  if(disdecay==0){
    cdatasub$Ai <- (cdatasub$Bj*cdatasub$D_j*exp(cdatasub$dist*cdatasub$beta))
  } else {
    cdatasub$Ai <- (cdatasub$Bj*cdatasub$D_j*exp(log(cdatasub$dist)*cdatasub$beta))
  }  
  #aggregate the results by your Origs and store in a new dataframe
  AiBF <- aggregate(Ai ~ Orig, data = cdatasub, sum)
  #now divide by 1
  AiBF$Ai <- 1/AiBF$Ai 
  #and replace the initial values with the new balancing factors
  updates = AiBF[match(cdatasub$Orig,AiBF$Orig),"Ai"]
  cdatasub$Ai = ifelse(!is.na(updates), updates, cdatasub$Ai)
  #now, if not the first iteration, calculate the difference between  the new Ai values and the old Ai values and once done, overwrite the old Ai values with the new ones. 
  if(its==1){
    cdatasub$OldAi <- cdatasub$Ai    
  } else {
    cdatasub$diff <- abs((cdatasub$OldAi-cdatasub$Ai)/cdatasub$OldAi)    
    cdatasub$OldAi <- cdatasub$Ai
  }
  
  #Now some similar calculations for Bj...
  if(disdecay==0){
    cdatasub$Bj <- (cdatasub$Ai*cdatasub$O_i*exp(cdatasub$dist*cdatasub$beta))
  } else {
    cdatasub$Bj <- (cdatasub$Ai*cdatasub$O_i*exp(log(cdatasub$dist)*cdatasub$beta))
  }
  #aggregate the results by your Dests and store in a new dataframe
  BjBF <- aggregate(Bj ~ Dest, data = cdatasub, sum)
  #now divide by 1
  BjBF$Bj <- 1/BjBF$Bj  
  #and replace the initial values by the balancing factor
  updates = BjBF[match(cdatasub$Dest,BjBF$Dest),"Bj"]
  cdatasub$Bj = ifelse(!is.na(updates), updates, cdatasub$Bj)
#now, if not the first iteration, calculate the difference between the new Bj values and the old Bj values and once done, overwrite the old Bj values with the new ones.
  if(its==1){
    cdatasub$OldBj <- cdatasub$Bj
  } else {
    cdatasub$diff <- abs((cdatasub$OldBj-cdatasub$Bj)/cdatasub$OldBj)    
    cdatasub$OldBj <- cdatasub$Bj
  } 
  #overwrite the convergence variable with 
  cnvg = sum(cdatasub$diff)
}
```

So, we've calculated our $A_i$ and $B_j$ values using out iterative routine - now we can plug them back into our model, to prove, once again, that the Poisson Model is exactly the same as the multiplicative Entropy Maximising Model...

```{r}
########################################################################
#Now create some SIM estimates
if(disdecay==0){
  cdatasub$SIM_Estimates <- (cdatasub$O_i*cdatasub$Ai*cdatasub$D_j*cdatasub$Bj*exp(cdatasub$dist*cdatasub$beta))
} else{
  cdatasub$SIM_Estimates_pow <- (cdatasub$O_i*cdatasub$Ai*cdatasub$D_j*cdatasub$Bj*exp(log(cdatasub$dist)*cdatasub$beta))
}
########################################################################
```


```{r, echo=TRUE, message=FALSE, warning=FALSE}
cdatasub$SIM_Estimates <- round(cdatasub$SIM_Estimates,0)
cdatasubmat8 <- dcast(cdatasub, Orig ~ Dest, sum, value.var = "SIM_Estimates", margins=c("Orig", "Dest"))
cdatasubmat8
```

Are the the same? Yes!! Woo hoo.

##5. Conclusions, further notes and ideas for additional activities

Hopefully you have now seen how it is extremely straight-forward to run and calibrate Wilson's full family of Spatial Interaction Models in R using GLM and Poisson Regression. 

###5.1 Some Further Notes

Now might be the time to mention that despite everything I've shown you, there has been some discussion in the literature as to whether the Poisson Model is actually a misspecification, especially for modelling migration flows. If you have the stomach for it, [this paper by Congdon goes into a lot of detail](http://journals.sagepub.com/doi/abs/10.1068/a251481).

The issue is a thing called 'overdispersion' which, translated, essentially relates to the model not being able to capture all of the things that could be explaining the flows in the independent variables that are supplied to the model. The details are tedious and only really intelligible to those with a statistics background. If you want a starter, [try here](https://en.wikipedia.org/wiki/Overdispersion), but in practical terms, we can get around this problem by fitting a very similar sort of regression model called the *negative binomial* regression model. 

If you wish, you can read up and experiment with this model - you can fit it in exactly the same way as the `glm` model but using a function called `glm.nb` which is part of the `mass` package. The negative binomial model has an extra parameter in the model for overdispersion. You you do try this, you will almost certainly discover that your results barely change - but hell, you might keep a pedantic reviewer at bay if you submit this to a journal (not that I'm speaking from experience or anything). 

###And some more comments

Another thing to note is that the example we used here had quite neat data. You will almost certainly run into problems if you have sparse data or predictors with 0s in them. If this happens, then you might need to either drop some rows in your data (if populated with 0s) or substitute 0s for very small numbers, much less than 1, but greater than 0 (this is because you can't take the log of 0)

And another thing to note is that our flow data and our predictors were all in and around the same order or magnitude. If you suddenly get data that (such as population masses at origins and destinations) that are an order of magnitude different (i.e. populations about ten times larger in different locations) then the model estimates might be biased. Fortunately, there are packages available to help us with these problems as well. The [`robustbase` package](https://cran.r-project.org/web/packages/robustbase/index.html) features a function called `glmrob` which will deal with this issues (again, your results probably won't change much, but worth knowing).

###5.2 Further Activities

1. Testing these models out on the whole of London and for different years
    + You've been playing around with just a small 7 borough sample, why not try the full London system. 
    +You can also try and download some similar data from the 2011 Census from [Wicid](http://wicid.ukdataservice.ac.uk/) - see if using $Oi$ and $Dj$ totals and the parameters you calibrated on the 2001 data, whether you can get reasonable estimates of the 2011 flows.
    +How have the model parameters changed between 2001 and 2011 - what does this mean


2. Visualising your flow estimates
    + try using the methods in `stplanr` and `leaflet` from [the last practical](https://rpubs.com/adam_dennett/257231) to visualise some of your flow estimates or flow residuals...


