Getting Started: Setting up R and Getting Your Data in Order

I will assume that you already have R and RStudio installed on your computer and you already have a little bit of knowledge about both. If you don’t, then download R, and then RStudio from https://www.r-project.org/ and https://www.rstudio.com/

Open up R Studio and set up a new project working directory. Open a new R script and save it into your new working directory.

Check the file is in your working directory and R can see it…

getwd()
list.files(getwd())

Before we go any further, let’s first get a few packages that will come in handy for wrangling our data, carrying out some analysis and producing some visualisations.

If you don’t have the packages already, install them using this script:

install.packages(c("sp", "MASS", "reshape2","geojsonio","rgdal","downloader","maptools","dplyr","broom","stplanr", "ggplot2", "leaflet"))

Now ‘library’ them into your working environment:

library(tidyverse)
library(sp)
library(MASS)
library(reshape2)
library(geojsonio)
library(rgdal)
library(downloader)
library(maptools)
library(broom) 
library(stplanr)
library(leaflet)
library(sf)
library(tmap)

Check that all of these packages have been libraried successfully and you don’t get any error messages - if they haven’t libraried successfully, then we may run into some problems later on. Possible problems that I have seen already are

1. that rgdal can cause some problems on Macs - if this happens, try updating your version of R and RStudio and re-installing the packages. If you are still having problems, try these solutions suggested on Stack Overflow, here:

2. If you are using a computer that uses chinese characters in the file system, you might need to download the zipfile of the package and install it into your package directory manually.

Setting up Some Spatial Data

As the name suggests, to run a spatial interaction model, you are going to need some spatial data and some data on interactions (flows). Let’s start with some spatial data:

#Fetch a GeoJson of some district-level boundaries from the ONS Geoportal. First add the URL to an object
EW <- geojson_read("http://geoportal.statistics.gov.uk/datasets/8edafbe3276d4b56aec60991cbddda50_2.geojson", what = "sp")

Check your boundaries have downloaded OK…

plot(EW)

#and have a quick look at the top of the data file
head(EW@data)
##   objectid   lad15cd              lad15nm lad15nmw st_areashape st_lengthshape
## 1        1 E06000001           Hartlepool              98441686       65270.30
## 2        2 E06000002        Middlesbrough              54553581       41055.85
## 3        3 E06000003 Redcar and Cleveland             253890938      101208.78
## 4        4 E06000004     Stockton-on-Tees             209730825      108085.16
## 5        5 E06000005           Darlington             197475682      107206.32
## 6        6 E06000006               Halton              90321511       60716.86

OK great, now lets just chop out London for the time being

#pull out london using grep and the regex wildcard for'start of the string' (^) to to look for the bit of the district code that relates to London (E09) from the 'lad15cd' column in the data slot of our spatial polygons dataframe
London <- EW[grep("^E09",EW@data$lad15cd),]
#plot it
plot(London)

#and have a look under the bonnet
summary(London)
## Object of class SpatialPolygonsDataFrame
## Coordinates:
##          min        max
## x -0.5103559  0.3340441
## y 51.2867587 51.6918770
## Is projected: FALSE 
## proj4string : [+proj=longlat +datum=WGS84 +no_defs]
## Data attributes:
##     objectid     lad15cd            lad15nm            lad15nmw        
##  Min.   :294   Length:33          Length:33          Length:33         
##  1st Qu.:302   Class :character   Class :character   Class :character  
##  Median :310   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :310                                                           
##  3rd Qu.:318                                                           
##  Max.   :326                                                           
##   st_areashape       st_lengthshape 
##  Min.   :  3149379   Min.   : 9602  
##  1st Qu.: 27249341   1st Qu.:29025  
##  Median : 38578079   Median :37404  
##  Mean   : 48324152   Mean   :38991  
##  3rd Qu.: 56585441   3rd Qu.:45769  
##  Max.   :150134859   Max.   :76164

Calculating a distance matrix

Now we have a nice clean set of London Boundaries, let’s extract some spatial data. Of course the important spatial data for spatial interaction models relates to the cost of interaction between places and this is very frequently represented through distance…

If you look at the summary above, you’ll notice that the boundaries we have are not in British National Grid - the bit that says proj4string tells me that we are in WGS84 or using latitude and longitude coordinates. We need to change this to British National Grid so our distances are in metres and not decimal degrees, then do everything we need to do to generate a distance matrix.

#first transfrom to BNG - this will be important for calculating distances using spTransform
BNG = "+init=epsg:27700"
LondonBNG <- spTransform(London, BNG)
#now, order by borough code - *this step will be imporant later on*
LondonBNG <- LondonBNG[order(LondonBNG$lad15cd),]
#now use spDists to generate a big distance matrix of all distances between boroughs in London
dist <- spDists(LondonBNG)
#melt this matrix into a list of origin/destination pairs using melt. Melt in in the reshape2 package. Reshape2, dplyr and ggplot, together, are some of the best packages in R, so if you are not familiar with them, get googling and your life will be much better!
distPair <- melt(dist)

Flow Data

The data we are going to use to test our spatial interaction models with is commuting data from the 2001 England and Wales Census. In the Census, there is a question which asks your home address and another which asks the location of your usual place of work, as well as the method of transport that you mainly use to conduct your commute. From this, estimates of commuter flows by transportation type for the whole country can be generated.

In this exercise, to save time, I have already downloaded some sample data which records the place (Borough) of residence and place (Borough) of work for all people living in London at the time of the 2001 Census. Borough level is quite coarse, but it will suffice for demonstration. If you would like to download your own commuting or migration flow data, then you should visit the Census Support Flow Data Service called wicid - https://wicid.ukdataservice.ac.uk/ - here you can download flows for a huge range of geographies from the 1981, 1991, 2001 and 2011 Censuses.

As well as flow data, I have also collated some additional data on income, the number of jobs in each borough in 2001 and the total population - we will use these as destination attractiveness / mass term and origin emissiveness / mass term proxies in the models which follow.

#read in your London Commuting Data
cdata <- read.csv("https://www.dropbox.com/s/7c1fi1txbvhdqby/LondonCommuting2001.csv?raw=1")
#read in a lookup table for translating between old borough codes and new borough codes
CodeLookup <- read.csv("https://www.dropbox.com/s/h8mpvnepdkwa1ac/CodeLookup.csv?raw=1")
#read in some population and income data
popincome <- read.csv("https://www.dropbox.com/s/84z22a4wo3x2p86/popincome.csv?raw=1")

#now merge these supplimentary data into your flow data dataframe
cdata$OrigCodeNew <- CodeLookup$NewCode[match(cdata$OrigCode, CodeLookup$OldCode)]
cdata$DestCodeNew <- CodeLookup$NewCode[match(cdata$DestCode, CodeLookup$OldCode)]
cdata$vi1_origpop <- popincome$pop[match(cdata$OrigCodeNew, popincome$code)]
cdata$vi2_origsal <- popincome$med_income[match(cdata$OrigCodeNew, popincome$code)]
cdata$wj1_destpop <- popincome$pop[match(cdata$DestCodeNew, popincome$code)]
cdata$wj2_destsal <- popincome$med_income[match(cdata$DestCodeNew, popincome$code)]

#Data needs to be ordered by borough code, if it's not, we will run into problems when we try to merge our distance data back in later, so to make sure, we can arrange by orign and then destination using dplyr's 'arrange' function

cdata <- arrange(cdata, OrigCodeNew, DestCodeNew)

Now to finish, we need to add in our distance data that we generated earlier and create a new column of total flows which excludes flows that occur within boroughs (we could keep the within-borough (intra-borough) flows in, but they can cause problems so for now we will just exclude them).

#First create a new total column which excludes intra-borough flow totals (well sets them to a very very small number for reasons you will see later...)
cdata$TotalNoIntra <- ifelse(cdata$OrigCode == cdata$DestCode,0,cdata$Total)
cdata$offset <- ifelse(cdata$OrigCode == cdata$DestCode,0.0000000001,1)
# now add the distance column into the dataframe
cdata$dist <- distPair$value

Let’s have a quick look at what your spangly new data looks like:

head(cdata)
##             Orig OrigCode                 Dest DestCode Total WorksFromHome
## 1 City of London     00AA       City of London     00AA  2059           432
## 2 City of London     00AA Barking and Dagenham     00AB     6             0
## 3 City of London     00AA               Barnet     00AC    14             0
## 4 City of London     00AA               Bexley     00AD     0             0
## 5 City of London     00AA                Brent     00AE    16             0
## 6 City of London     00AA              Bromley     00AF     0             0
##   Underground Train Bus Taxi CarDrive CarPass Motobike Bicycle Walk Other
## 1         120    53  50   31       39       0        3      18 1272    41
## 2           3     3   0    0        0       0        0       0    0     0
## 3          11     0   0    0        0       0        0       0    3     0
## 4           0     0   0    0        0       0        0       0    0     0
## 5          10     0   3    0        0       0        0       0    3     0
## 6           0     0   0    0        0       0        0       0    0     0
##   OrigCodeNew DestCodeNew vi1_origpop vi2_origsal wj1_destpop wj2_destsal
## 1   E09000001   E09000001       12000       38300       12000       38300
## 2   E09000001   E09000002       12000       38300       56000       16200
## 3   E09000001   E09000003       12000       38300      159000       18700
## 4   E09000001   E09000004       12000       38300      112000       18300
## 5   E09000001   E09000005       12000       38300      127000       16500
## 6   E09000001   E09000006       12000       38300      164000       19100
##   TotalNoIntra offset     dist
## 1            0  1e-10     0.00
## 2            6  1e+00 15995.23
## 3           14  1e+00 13935.67
## 4            0  1e+00 17359.91
## 5           16  1e+00 13119.80
## 6            0  1e+00 18740.82

Very nice, but to make this demonstration even easier, let’s just select a small subset of these flows (we can come back to the whole dataset later on…)

#We'll just use the first 7 boroughs by code, so first, create a vector of these 7 to match with our data
toMatch<-c("00AA", "00AB", "00AC", "00AD", "00AE", "00AF", "00AG")
#subset the data by the 7 sample boroughs
#first the origins
cdatasub <- cdata[grep(paste(toMatch,collapse = "|"), cdata$OrigCode),]
#then the destinations
cdatasub <- cdatasub[grep(paste(toMatch,collapse = "|"), cdata$DestCode),]
#now chop out the intra-borough flows
cdatasub <- cdatasub[cdatasub$OrigCode!=cdatasub$DestCode,]
#now unfortunately if you look at the file, for some reason the grep process has left a lot of empty data cells in the dataframe, so let's just chop out everything after the 7*7 - 7 (42) pairs we are interested in...
cdatasub <- cdatasub[1:42,]
#now re-order so that OrigCodeNew, DestCodeNew and TotalNoIntra are the first three columns *note that you have to be explicit about the select function in the dplyr package as MASS also has a 'select' function and R will try and use this by default. We can be explict by using the syntax package::function
cdatasub <- dplyr::select(cdatasub, OrigCodeNew, DestCodeNew, Total, everything())

And this is what those flows look like on a map - quick and dirty style…

#convert the London zones to an SF object
LondonBNGSF <- st_as_sf(LondonBNG) %>% subset(select = -objectid)
#use the od2line function from Robin Lovelace's excellent stplanr package
travel_network <- od2line(flow = cdatasub, zones = LondonBNGSF)
#and set the line widths to some sensible value according to the flow
w <- cdatasub$Total / max(cdatasub$Total) * 10
#now plot it...
plot(travel_network, lwd = w)

tmap_mode("view")
qtm(travel_network, lines.lwd = "Total", scale = 10)

And as a matrix…

#now we can create pivot table to turn paired list into matrix (and compute the margins as well)
cdatasubmat <- dcast(cdatasub, Orig ~ Dest, sum, value.var = "Total", margins=c("Orig", "Dest"))
cdatasubmat
##                   Orig Barking and Dagenham Barnet Bexley Brent Bromley Camden
## 1 Barking and Dagenham                    0    194     96   178      66   1500
## 2               Barnet                   96      0     34  5467      76  12080
## 3               Bexley                  362    132      0   144    4998   2470
## 4                Brent                   40   6124     28     0      66   8105
## 5              Bromley                  134    162   3199   201       0   3780
## 6               Camden                   36   1496     32  1350      60      0
## 7       City of London                    6     14      0    16       0    335
## 8                (all)                  674   8122   3389  7356    5266  28270
##   City of London (all)
## 1           3641  5675
## 2           7709 25462
## 3           6580 14686
## 4           4145 18508
## 5           9855 17331
## 6           8795 11769
## 7              0   371
## 8          40725 93802

OK, we’ve set everything up, now it’s…

Modellin’ Time!

In explaining how to run and calibrate spatial interaction models in R, I will adopt the notation used by Taylor Oshan in his excellent primer for running spatial interation models in Python. The paper is well worth a read and can be found here: http://openjournals.wu.ac.at/region/paper_175/175.html

Below is the classic multiplicative gravity model:

  1. \[T_{ij} = k \frac{V_i^\mu W_j^\alpha}{d_{ij}^\beta}\]

This gravity model can be written in the form more familiar from Wilson’s 1971 paper - http://journals.sagepub.com/doi/abs/10.1068/a030001

  1. \[T_{ij} = k V_i^\mu W_j^\alpha d_{ij}^-\beta \]

This model just says that the flows between an origin and destination are proportional to the product of the mass of the origin and destination and inversely proportional to the distance between them.

As origin and destination masses increase, flows increase, but as distance increases, flows decrease, and vice versa.

\(k\) is a constant of proportionality and leads to this particular model being more accurately described as a ‘total constrained’ model as all flows estimated by the model will sum to any observed flow data used to calibrate the parameters, where:

  1. \[k = \frac{T}{\sum_i \sum_jV_i^\mu W_j^\alpha d_{ij}^-\beta }\]

and \(T\) is the sum of our matrix of observed flows or:

  1. \[T = \sum_i \sum_j T_{ij}\]

In English, this is just the sum of all observed flows divided by the sum of all of the other elements in the model.

Estimating Model Parameters

Now, it’s perfectly possible to produce some flow estimates by plugging some arbitrary or expected estimated values into our parameters. The parameters relate to the scaling effect / importance of the variables they are associated with. Most simply, where the effects of origin and destination attributes on flows scale in a linear fashion (i.e. for a 1 unit increase in, say, population at origin, we might expect a 1 unit increase in flows of people from that origin, or for a halving in average salary at destination, we might expect a halving of commuters), \(\mu\) = 1 and \(\alpha\) = 1. In Newton’s original gravity equation, \(\beta\) = -2 where the influence of distance on flows follows a power law - i.e. for a 1 unit increase in distance, we have a 1^-2 (1) unit decrease in flows, for a 2 unit increase in distance, we have 2^-2 (0.25 or 1/4) of the flows, for a 3 unit increase, 3^-2 (0.111) etc.

Let’s see if these parameters are a fair first guess (we’ll use the whole dataset in order to get a less messy picture)…

#First plot the commuter flows against distance and then fit a model line with a ^-2 parameter
plot1 <- ggplot(cdata) + 
  geom_point(aes(x = dist, y = Total)) +
  stat_function(aes(x = dist), fun=function(x)x^-2, geom="line", colour="red")

plot1

#now, what about the origin and destination data...
plot2 <- ggplot(cdata) + 
  geom_point(aes(x = vi1_origpop, y = Total)) +
  stat_function(aes(x = dist), fun=function(x)x^1, geom="line", colour="red")

plot2

plot3 <- ggplot(cdata) + 
  geom_point(aes(x = wj2_destsal, y = Total)) +
  stat_function(aes(x = dist), fun=function(x)x^1, geom="line", colour="red")

plot3

OK, so it looks like we’re not far off (well, destination salary doesn’t look too promising as a predictor, but we’ll see how we get on…), so let’s see what flow estimates with these starting parameters look like.

#set up some variables to hold our parameter values in:
mu <- 1
alpha <- 1
beta <- -2
k <- 1
T2 <- sum(cdatasub$Total)

Now let’s create some flow estimates using Equation 2 above… Begin by applying the parameters to the variables:

vi1_mu <- cdatasub$vi1_origpop^mu
wj2_alpha <- cdatasub$wj2_destsal^alpha
dist_beta <- cdatasub$dist^beta
T1 <- vi1_mu*wj2_alpha*dist_beta
k <- T2/sum(T1)

Then, just as in Equation 2 above, just multiply everything together to get your flow estimates:

#run the model and store all of the new flow estimates in a new column in the dataframe
cdatasub$unconstrainedEst1 <- round(k*vi1_mu*wj2_alpha*dist_beta,0)
#check that the sum of these estimates makes sense
sum(cdatasub$unconstrainedEst1)
## [1] 93802
#turn it into a little matrix and have a look at your handy work
cdatasubmat1 <- dcast(cdatasub, Orig ~ Dest, sum, value.var = "unconstrainedEst1", margins=c("Orig", "Dest"))
cdatasubmat1
##                   Orig Barking and Dagenham Barnet Bexley Brent Bromley Camden
## 1 Barking and Dagenham                    0    265   1902   190     428    435
## 2               Barnet                  651      0    518  7344     453   6843
## 3               Bexley                 3367    373      0   317    2475    675
## 4                Brent                  422   6648    399     0     419   6627
## 5              Bromley                 1063    458   3472   467       0    881
## 6               Camden                  641   4105    562  4392     523      0
## 7       City of London                  121    184    116   183     104   1148
## 8                (all)                 6265  12033   6969 12893    4402  16609
##   City of London (all)
## 1           1335  4555
## 2           4995 20804
## 3           2267  9474
## 4           4501 19016
## 5           2849  9190
## 6          18684 28907
## 7              0  1856
## 8          34631 93802

How do the flow estimates compare with the original flows?

cdatasubmat
##                   Orig Barking and Dagenham Barnet Bexley Brent Bromley Camden
## 1 Barking and Dagenham                    0    194     96   178      66   1500
## 2               Barnet                   96      0     34  5467      76  12080
## 3               Bexley                  362    132      0   144    4998   2470
## 4                Brent                   40   6124     28     0      66   8105
## 5              Bromley                  134    162   3199   201       0   3780
## 6               Camden                   36   1496     32  1350      60      0
## 7       City of London                    6     14      0    16       0    335
## 8                (all)                  674   8122   3389  7356    5266  28270
##   City of London (all)
## 1           3641  5675
## 2           7709 25462
## 3           6580 14686
## 4           4145 18508
## 5           9855 17331
## 6           8795 11769
## 7              0   371
## 8          40725 93802

How good is my model?

So, looking at the two little matrices above you can see that in some cases the flow estimates aren’t too bad (Barking and Dagenham to Bartnet, for example), but in others they are pretty rubbish (Camden to the City of London, anyone?). Whilst it’s OK to eyeball small flow matrices like this, when you have much larger matrices, we need another solution…

Testing the “goodness-of-fit”.

Yes, that’s what it’s called - I know, it doesn’t sound correct, but goodness-of-fit is the correct term for checking how well your model estimates match up with your observed flows.

So how do we do it?

Well… there are a number of ways but perhaps the two most common are to look at the coefficient of determination (\(r^2\)) or the Square Root of Mean Squared Error (RMSE). You’ve probably come across \(r^2\) before if you have fitted a linear regression model, but you may not have come across RMSE. There are other methods and they all do more or less the same thing, which is essentially to compare the modelled estimates with the real data. \(r^2\) is popular as it is quite intuitive and can be compared across models. RMSE is less intuitive, but some argue is better for comparing changes to the same model. Here’s we’ll do both…

R-Squared

\(r^2\) is the square of the correlation coefficient, \(r\)

For our sample data, we can calculate this very easily using a little function

CalcRSquared <- function(observed,estimated){
  r <- cor(observed,estimated)
  R2 <- r^2
  R2
}
CalcRSquared(cdatasub$Total,cdatasub$unconstrainedEst1)
## [1] 0.5032696

Using this function we get a value of 0.51 or around 51%. This tells us that our model accounts for about 51% of the variation of flows in the system. Not bad, but not brilliant either.

Root Mean Squared Error (RMSE)

We can use a similar simple function to calculate the RMSE for our data

CalcRMSE <- function(observed,estimated){
  res <- (observed - estimated)^2
  RMSE <- round(sqrt(mean(res)),3)
  RMSE
}
CalcRMSE(cdatasub$Total,cdatasub$unconstrainedEst1)
## [1] 2503.352

The figure that is produced by the RMSE calculation is far less intuitive than the \(r^2\) value and this is mainly because it very much depends on things like the units the data are in and the volume of data. It can’t be used to compare different models run using different data sets. However, it is good for assessing whether changes to the model result in improvements. The closer to 0 the RMSE value, the better the model.

So how can we start to improve our fit…?

Improving our model: 1 - Calibrating parameters

(this bit might take a little while, but stick with it)

Now, the model we have run above is probably the most simple spatial interaction model we could have run and the results aren’t terrible, but they’re not great either.

One way that we can improve the fit of the model is by calibrating the parameters on the flow data that we have.

The traditional way that this has been done computationally is by using the goodness-of-fit statistics. If you have the requisite programming skills, you can write a computer algorithm that iteratively adjusts each parameter, runs the model, checks the goodness-of-fit and then starts all over again until the goodness-of-fit statistic is maximised.

This is partly why spatial interaction modelling was the preserve of specialists for so long as acquiring the requisite skills to write such computer programmes can be challenging!

However, since the early days of spatial interaction modelling, a number of useful developments have occurred…

The mathematically minded among you may have noticed that if you take the logarithms of both sides of Equation 2, you end up with the following equation:

  1. \[\ln T_{ij} = k + \mu\ln V_i + \alpha\ln W_j - \beta \ln d_{ij}\]

Those of you who have played around with regression models in the past will realise that this is exactly that - a regression model.

And if you have played around with regression models you will be aware that there are various pieces of software available to run regressions (such as R) and calibrate the parameters for us, so we don’t have to be expert programmers to do this - yay!

Now, there are a couple of papers that are worth reading at this point. Perhaps the best is by Flowerdew and Aitkin (1982), titled “A METHOD OF FITTING THE GRAVITY MODEL BASED ON THE POISSON DISTRIBUTION” - the paper can be found here: http://onlinelibrary.wiley.com/doi/10.1111/j.1467-9787.1982.tb00744.x/abstract

One of the key points that Flowerdew and Aitkin make is that the model in Equation 5 (known as a log-normal model) has various problems associated with it which mean that the estimates produced might not be reliable. If you’d like to know more about these, read the paper (and also Wilson’s 1971 paper), but at this point it is worth just knowing that the way around many of these issues is to re-specify the model, not as a log-normal regression, but as a Poisson or negative binomial regression model.

Poisson regression

The main theory (for non-experts like me anyway) behind the Poisson regression model is that the sorts of flows that spatial interaction models deal with (such as migration or commuting flows) relate to non-negative integer counts (you can’t have negative people moving between places and you can’t - normally, if they are alive - have fractions of people moving either).

As such, the continuous (normal) probabilty distributions which underpin standard regression models don’t hold. However, the discrtete probability distributions such as the Poisson distribution and the negative binomial distribution (of which the Poisson distribution is a special case - wikipedia it) do hold and so we can use these associations to model our flows.

At this point, it’s probably worth you looking at what a Poisson disribution looks like compared to a normal distribution, if you are not familiar.

Here’s a normal distribution:

#histogram with a count of 3000, a mean of 75 and a standard deviation of 5
qplot(rnorm(0:3000,mean=75,sd=5))

Now here’s a Poisson distribution with the same mean:

qplot(rpois(0:3000,lambda=75)) + stat_bin(binwidth = 1)

Looks kind of similar doesn’t it! The thing with the Poisson distribution is, when the mean (\(\lambda\) - lambda) changes, so does the distribution. As the mean gets smaller (and this is often the case with flow data where small flows are very likely - have a look at the ‘Total’ column in your cdata dataframe, lots of small numbers aren’t there?) the distribution starts to look a lot more like a skewed or log-normal distrbution. They key thing is it’s not - it’s a Poisson distribution. Here’s a similar frequency distribution with a small mean:

#what about a lambda of 0.5?
qplot(rpois(0:3000,0.5)) + stat_bin(binwidth = 1)

As far as we’re concerned, what this means is that if we are interested in all flows between all origins and destinations in our system, these flows will have a mean value of \(\lambda_{ij}\) and this will dictate the distribution. Here’s what the distrbution of our flows looks like:

qplot(cdata$Total) + geom_histogram()

Mmmm, Poissony!

So, what does all of this mean for our spatial interaction model?

Well the main thing it means is that Equation 5, for most sorts of spatial interaction models where we are modelling flows of people or whole things, is not correct.

By logging both sides of the equation in Equation 5, we are trying to get a situation where our \(T_{ij}\) flows can be modelled by using the values of our other variables such as distance, by using a straight line a bit like this:

qplot(log(dist), log(Total), data=cdata) + geom_smooth(method = lm)

If you compare this graph with the graph above (the first scatter plot we drew), it’s exactly the same data, but clearly by logging both the total and distance, we can get a bit closer to being able to fit a model estimate using a straight line.

What the Poisson distribution means is that the \(y\) variable in our model is not logged as in the graph above, but it can still be modelled using something like the blue line - I hope that sort of makes sense. If not, don’t worry, just take it from me that this is good news.

###The Poisson Regression Spatial Interaction Model

So, we can now re-specify Equation 5 as a Poisson Regression model. Instead of our independent variable being \(\ln T_{ij}\) our dependent variable is now the mean of our Poisson distribution \(\lambda_{ij}\) and the model becomes:

  1. \[\lambda_{ij} = \exp(k + \mu\ln V_i + \alpha\ln W_j - \beta \ln d_{ij})\]

What this model says is \(\lambda_{ij}\) (our independent variable - the estimate of \(T_{ij}\)) is logarithmically linked to (or modelled by) a linear combination of the logged independent variables in the model.

Now we have Equation 6 at our disposal, we can use a Poisson regression model to produce estimates of \(k\), \(\mu\), \(\alpha\) and \(\beta\) - or put another way, we can use the regression model to calibrate our parameters.

So, let’s have a go at doing it!!

It’s very straight forward to run a Poisson regression model in R using the glm (Generalised Linear Models) function. In practical terms, running a GLM model is no different to running a standard regression model using lm. If you want to find out more about glm, try the R help system ?glm or google it to find the function details. If you delve far enough into the depths of what GLM does, you will find that the parameters are calibrated though an ‘iteratively re-weighted least squares’ algorithm. This algorithm does exaxtly the sort of job I described earlier, it fits lots of lines to the data, continually adjusting the parameters and then seeing if it can minimise the error between the observed and expected values useing some goodness-of-fit measure is maximised/minimised.

These sorts of algorithms have been around for years and are very well established so it makes sense to make use of them rather than trying to re-invent the wheel ourselves. So here we go…

#run the unconstrained model
uncosim <- glm(Total ~ log(vi1_origpop)+log(wj2_destsal)+log(dist), na.action = na.exclude, family = poisson(link = "log"), data = cdatasub)

It’s a simple as that - runs in a matter of milliseconds. You should be able to see how the glm R code corresponds to Equation 6.

Total = \(T_{ij}\) = \(\lambda_{ij}\)

~ means ‘is modelled by’

log(vi1_origpop) = \(ln V_i\)

log(wj2_destsal) = \(ln W_j\)

log(dist) = \(\ln d_{ij}\)

family = poisson(link = "log") means that we are using a Poisson regression model (the link is always log with a Poisson model) where the left-hand side of the model equation is logarithmically linked to the variables on the right-hand side.

So what comes out of the other end?

Well, we can use the summary() function to have a look at the model parameters:

#run the unconstrained model
summary(uncosim) 
## 
## Call:
## glm(formula = Total ~ log(vi1_origpop) + log(wj2_destsal) + log(dist), 
##     family = poisson(link = "log"), data = cdatasub, na.action = na.exclude)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -55.37  -31.27  -16.07   12.66   62.84  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -15.808424   0.176536  -89.55   <2e-16 ***
## log(vi1_origpop)   1.755752   0.012483  140.65   <2e-16 ***
## log(wj2_destsal)   1.647197   0.008773  187.75   <2e-16 ***
## log(dist)         -1.407922   0.006707 -209.92   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 176846  on 41  degrees of freedom
## Residual deviance:  46086  on 38  degrees of freedom
## AIC: 46408
## 
## Number of Fisher Scoring iterations: 6

We can see from the summary that the Poisson regression has calibrated all 4 parameters for us and these appear under the ‘estimate’ column:

\(k\) (intercept) = -15.631802

\(\mu\) = 1.747997

\(\alpha\) = 1.642331

and \(\beta\) = -1.411889

We can also see from the other outputs that all variables are highly significant (***), with the z-scores revealing that distance has the most influence on the model (as we might have expected from the scatter plots we produced earlier which showed that distance had by far the strongest correlation with commuting flows).

These parameters are not too far away from our initial guesses of \(\mu\) = 1, \(\alpha\) = 1 and \(\beta\) = -2, but how do the estimates compare?

One way to calculate the estimates is to plug all of the parameters back into Equation 6 like this:

#first asign the parameter values from the model to the appropriate variables
k <- uncosim$coefficients[1]
mu <- uncosim$coefficients[2]
alpha <- uncosim$coefficients[3]
beta <- -uncosim$coefficients[4]

#now plug everything back into the Equation 6 model... (be careful with the positive and negative signing of the parameters as the beta parameter may not have been saved as negative so will need to force negative)
cdatasub$unconstrainedEst2 <- exp(k+(mu*log(cdatasub$vi1_origpop))+(alpha*log(cdatasub$wj2_destsal))-(beta*log(cdatasub$dist)))

#which is exactly the same as this...
cdatasub$unconstrainedEst2 <- (exp(k)*exp(mu*log(cdatasub$vi1_origpop))*exp(alpha*log(cdatasub$wj2_destsal))*exp(-beta*log(cdatasub$dist)))

#and of course, being R, there is an even easier way of doing this...
cdatasub$fitted <- fitted(uncosim)
#run the model and store all of the new flow estimates in a new column in the dataframe
cdatasub$unconstrainedEst2 <- round(cdatasub$unconstrainedEst2,0)
sum(cdatasub$unconstrainedEst2)
## [1] 93803
#turn it into a little matrix and have a look at your handy work
cdatasubmat2 <- dcast(cdatasub, Orig ~ Dest, sum, value.var = "unconstrainedEst2", margins=c("Orig", "Dest"))
cdatasubmat2
##                   Orig Barking and Dagenham Barnet Bexley Brent Bromley Camden
## 1 Barking and Dagenham                    0    206    808   145     294    308
## 2               Barnet                 1015      0    968  5683     919   6422
## 3               Bexley                 2231    542      0   430    2099    869
## 4                Brent                  591   4707    636     0     686   4957
## 5              Bromley                 1480    937   3820   844       0   1567
## 6               Camden                  623   2635    637  2455     630      0
## 7       City of London                   20     32     22    28      22    121
## 8                (all)                 5960   9059   6891  9585    4650  14244
##   City of London (all)
## 1           1264  3025
## 2           9587 24594
## 3           3803  9974
## 4           7034 18611
## 5           6670 15318
## 6          15056 22036
## 7              0   245
## 8          43414 93803

And the $1,000,000 question - has calibrating the parameters improved the model…?

CalcRSquared(cdatasub$Total,cdatasub$unconstrainedEst2)
## [1] 0.6725599
CalcRMSE(cdatasub$Total,cdatasub$unconstrainedEst2)
## [1] 1892.615

Yes indeedy do!!

The \(r^2\) has improved from 0.51 to 0.67 and the RMSE has reduced from 2502.24 to 1895.768 so by calibrating our parameters using the Poisson Regression Model, we have markedly improved our model fit.

But we can do even better. We have just been playing with the unconstrained model, by adding constraints into the model we can both improve our fit further AND start to do cool things like esimate transport trip distributions from know information about people leaving an area, or estimate the amount of money a shop is going to make from the available money that people in the surrounding area have to spend, or guess the number of migrants travelling between specific countries where we only know how many people in total leave one country and arrive in another.

We’ll do all of this in part 2 of these sessions.

Extension ideas for this session:

  1. Have a play around with inputting different parameter values and see what happens to the flow estimates - for example, what happens if you change the frictional effect of distance by increasing and decreasing the negative value of the \(\beta\) parameter? Try some values of -2,-3,-4. What do positive values do to the flows?

  2. What happens to the flow estimates if you adjust the other parameters in the model or remove \(k\)?

  3. Try running the model on the whole London system - although you may run into problems if you don’t remove the intra-borough flows. This is because for intra-flows, the distance value is 0 and you can’t take the log of 0. To fix this, you can either change all of the zero distance values to something very small (like 1 or 0.5) or, alternatively, remove all rows in the data where origin = destination with some code like this:

  4. What might you expect to happen to the flows of people between, say, Camden and Brent if all of a sudden loads of really well paid jobs appeared in Brent and the average salary doubled? How might this impact the other Boroughs in the system?