I will assume that you already have R and RStudio installed on your computer and you already have a little bit of knowledge about both. If you don’t, then download R, and then RStudio from https://www.r-project.org/ and https://www.rstudio.com/
Open up R Studio and set up a new project working directory. Open a new R script and save it into your new working directory.
Check the file is in your working directory and R can see it…
getwd()
list.files(getwd())
Before we go any further, let’s first get a few packages that will come in handy for wrangling our data, carrying out some analysis and producing some visualisations.
If you don’t have the packages already, install them using this script:
install.packages(c("sp", "MASS", "reshape2","geojsonio","rgdal","downloader","maptools","dplyr","broom","stplanr", "ggplot2", "leaflet"))
Now ‘library’ them into your working environment:
library(tidyverse)
library(sp)
library(MASS)
library(reshape2)
library(geojsonio)
library(rgdal)
library(downloader)
library(maptools)
library(broom)
library(stplanr)
library(leaflet)
library(sf)
library(tmap)
Check that all of these packages have been libraried successfully and you don’t get any error messages - if they haven’t libraried successfully, then we may run into some problems later on. Possible problems that I have seen already are
1. that rgdal can cause some problems on Macs - if this happens, try updating your version of R and RStudio and re-installing the packages. If you are still having problems, try these solutions suggested on Stack Overflow, here:
2. If you are using a computer that uses chinese characters in the file system, you might need to download the zipfile of the package and install it into your package directory manually.
As the name suggests, to run a spatial interaction model, you are going to need some spatial data and some data on interactions (flows). Let’s start with some spatial data:
#Fetch a GeoJson of some district-level boundaries from the ONS Geoportal. First add the URL to an object
EW <- geojson_read("http://geoportal.statistics.gov.uk/datasets/8edafbe3276d4b56aec60991cbddda50_2.geojson", what = "sp")
Check your boundaries have downloaded OK…
plot(EW)
#and have a quick look at the top of the data file
head(EW@data)
## objectid lad15cd lad15nm lad15nmw st_areashape st_lengthshape
## 1 1 E06000001 Hartlepool 98441686 65270.30
## 2 2 E06000002 Middlesbrough 54553581 41055.85
## 3 3 E06000003 Redcar and Cleveland 253890938 101208.78
## 4 4 E06000004 Stockton-on-Tees 209730825 108085.16
## 5 5 E06000005 Darlington 197475682 107206.32
## 6 6 E06000006 Halton 90321511 60716.86
OK great, now lets just chop out London for the time being
#pull out london using grep and the regex wildcard for'start of the string' (^) to to look for the bit of the district code that relates to London (E09) from the 'lad15cd' column in the data slot of our spatial polygons dataframe
London <- EW[grep("^E09",EW@data$lad15cd),]
#plot it
plot(London)
#and have a look under the bonnet
summary(London)
## Object of class SpatialPolygonsDataFrame
## Coordinates:
## min max
## x -0.5103559 0.3340441
## y 51.2867587 51.6918770
## Is projected: FALSE
## proj4string : [+proj=longlat +datum=WGS84 +no_defs]
## Data attributes:
## objectid lad15cd lad15nm lad15nmw
## Min. :294 Length:33 Length:33 Length:33
## 1st Qu.:302 Class :character Class :character Class :character
## Median :310 Mode :character Mode :character Mode :character
## Mean :310
## 3rd Qu.:318
## Max. :326
## st_areashape st_lengthshape
## Min. : 3149379 Min. : 9602
## 1st Qu.: 27249341 1st Qu.:29025
## Median : 38578079 Median :37404
## Mean : 48324152 Mean :38991
## 3rd Qu.: 56585441 3rd Qu.:45769
## Max. :150134859 Max. :76164
Now we have a nice clean set of London Boundaries, let’s extract some spatial data. Of course the important spatial data for spatial interaction models relates to the cost of interaction between places and this is very frequently represented through distance…
If you look at the summary above, you’ll notice that the boundaries we have are not in British National Grid - the bit that says proj4string tells me that we are in WGS84 or using latitude and longitude coordinates. We need to change this to British National Grid so our distances are in metres and not decimal degrees, then do everything we need to do to generate a distance matrix.
#first transfrom to BNG - this will be important for calculating distances using spTransform
BNG = "+init=epsg:27700"
LondonBNG <- spTransform(London, BNG)
#now, order by borough code - *this step will be imporant later on*
LondonBNG <- LondonBNG[order(LondonBNG$lad15cd),]
#now use spDists to generate a big distance matrix of all distances between boroughs in London
dist <- spDists(LondonBNG)
#melt this matrix into a list of origin/destination pairs using melt. Melt in in the reshape2 package. Reshape2, dplyr and ggplot, together, are some of the best packages in R, so if you are not familiar with them, get googling and your life will be much better!
distPair <- melt(dist)
The data we are going to use to test our spatial interaction models with is commuting data from the 2001 England and Wales Census. In the Census, there is a question which asks your home address and another which asks the location of your usual place of work, as well as the method of transport that you mainly use to conduct your commute. From this, estimates of commuter flows by transportation type for the whole country can be generated.
In this exercise, to save time, I have already downloaded some sample data which records the place (Borough) of residence and place (Borough) of work for all people living in London at the time of the 2001 Census. Borough level is quite coarse, but it will suffice for demonstration. If you would like to download your own commuting or migration flow data, then you should visit the Census Support Flow Data Service called wicid - https://wicid.ukdataservice.ac.uk/ - here you can download flows for a huge range of geographies from the 1981, 1991, 2001 and 2011 Censuses.
As well as flow data, I have also collated some additional data on income, the number of jobs in each borough in 2001 and the total population - we will use these as destination attractiveness / mass term and origin emissiveness / mass term proxies in the models which follow.
#read in your London Commuting Data
cdata <- read.csv("https://www.dropbox.com/s/7c1fi1txbvhdqby/LondonCommuting2001.csv?raw=1")
#read in a lookup table for translating between old borough codes and new borough codes
CodeLookup <- read.csv("https://www.dropbox.com/s/h8mpvnepdkwa1ac/CodeLookup.csv?raw=1")
#read in some population and income data
popincome <- read.csv("https://www.dropbox.com/s/84z22a4wo3x2p86/popincome.csv?raw=1")
#now merge these supplimentary data into your flow data dataframe
cdata$OrigCodeNew <- CodeLookup$NewCode[match(cdata$OrigCode, CodeLookup$OldCode)]
cdata$DestCodeNew <- CodeLookup$NewCode[match(cdata$DestCode, CodeLookup$OldCode)]
cdata$vi1_origpop <- popincome$pop[match(cdata$OrigCodeNew, popincome$code)]
cdata$vi2_origsal <- popincome$med_income[match(cdata$OrigCodeNew, popincome$code)]
cdata$wj1_destpop <- popincome$pop[match(cdata$DestCodeNew, popincome$code)]
cdata$wj2_destsal <- popincome$med_income[match(cdata$DestCodeNew, popincome$code)]
#Data needs to be ordered by borough code, if it's not, we will run into problems when we try to merge our distance data back in later, so to make sure, we can arrange by orign and then destination using dplyr's 'arrange' function
cdata <- arrange(cdata, OrigCodeNew, DestCodeNew)
Now to finish, we need to add in our distance data that we generated earlier and create a new column of total flows which excludes flows that occur within boroughs (we could keep the within-borough (intra-borough) flows in, but they can cause problems so for now we will just exclude them).
#First create a new total column which excludes intra-borough flow totals (well sets them to a very very small number for reasons you will see later...)
cdata$TotalNoIntra <- ifelse(cdata$OrigCode == cdata$DestCode,0,cdata$Total)
cdata$offset <- ifelse(cdata$OrigCode == cdata$DestCode,0.0000000001,1)
# now add the distance column into the dataframe
cdata$dist <- distPair$value
Let’s have a quick look at what your spangly new data looks like:
head(cdata)
## Orig OrigCode Dest DestCode Total WorksFromHome
## 1 City of London 00AA City of London 00AA 2059 432
## 2 City of London 00AA Barking and Dagenham 00AB 6 0
## 3 City of London 00AA Barnet 00AC 14 0
## 4 City of London 00AA Bexley 00AD 0 0
## 5 City of London 00AA Brent 00AE 16 0
## 6 City of London 00AA Bromley 00AF 0 0
## Underground Train Bus Taxi CarDrive CarPass Motobike Bicycle Walk Other
## 1 120 53 50 31 39 0 3 18 1272 41
## 2 3 3 0 0 0 0 0 0 0 0
## 3 11 0 0 0 0 0 0 0 3 0
## 4 0 0 0 0 0 0 0 0 0 0
## 5 10 0 3 0 0 0 0 0 3 0
## 6 0 0 0 0 0 0 0 0 0 0
## OrigCodeNew DestCodeNew vi1_origpop vi2_origsal wj1_destpop wj2_destsal
## 1 E09000001 E09000001 12000 38300 12000 38300
## 2 E09000001 E09000002 12000 38300 56000 16200
## 3 E09000001 E09000003 12000 38300 159000 18700
## 4 E09000001 E09000004 12000 38300 112000 18300
## 5 E09000001 E09000005 12000 38300 127000 16500
## 6 E09000001 E09000006 12000 38300 164000 19100
## TotalNoIntra offset dist
## 1 0 1e-10 0.00
## 2 6 1e+00 15995.23
## 3 14 1e+00 13935.67
## 4 0 1e+00 17359.91
## 5 16 1e+00 13119.80
## 6 0 1e+00 18740.82
Very nice, but to make this demonstration even easier, let’s just select a small subset of these flows (we can come back to the whole dataset later on…)
#We'll just use the first 7 boroughs by code, so first, create a vector of these 7 to match with our data
toMatch<-c("00AA", "00AB", "00AC", "00AD", "00AE", "00AF", "00AG")
#subset the data by the 7 sample boroughs
#first the origins
cdatasub <- cdata[grep(paste(toMatch,collapse = "|"), cdata$OrigCode),]
#then the destinations
cdatasub <- cdatasub[grep(paste(toMatch,collapse = "|"), cdata$DestCode),]
#now chop out the intra-borough flows
cdatasub <- cdatasub[cdatasub$OrigCode!=cdatasub$DestCode,]
#now unfortunately if you look at the file, for some reason the grep process has left a lot of empty data cells in the dataframe, so let's just chop out everything after the 7*7 - 7 (42) pairs we are interested in...
cdatasub <- cdatasub[1:42,]
#now re-order so that OrigCodeNew, DestCodeNew and TotalNoIntra are the first three columns *note that you have to be explicit about the select function in the dplyr package as MASS also has a 'select' function and R will try and use this by default. We can be explict by using the syntax package::function
cdatasub <- dplyr::select(cdatasub, OrigCodeNew, DestCodeNew, Total, everything())
And this is what those flows look like on a map - quick and dirty style…
#convert the London zones to an SF object
LondonBNGSF <- st_as_sf(LondonBNG) %>% subset(select = -objectid)
#use the od2line function from Robin Lovelace's excellent stplanr package
travel_network <- od2line(flow = cdatasub, zones = LondonBNGSF)
#and set the line widths to some sensible value according to the flow
w <- cdatasub$Total / max(cdatasub$Total) * 10
#now plot it...
plot(travel_network, lwd = w)
tmap_mode("view")
qtm(travel_network, lines.lwd = "Total", scale = 10)
And as a matrix…
#now we can create pivot table to turn paired list into matrix (and compute the margins as well)
cdatasubmat <- dcast(cdatasub, Orig ~ Dest, sum, value.var = "Total", margins=c("Orig", "Dest"))
cdatasubmat
## Orig Barking and Dagenham Barnet Bexley Brent Bromley Camden
## 1 Barking and Dagenham 0 194 96 178 66 1500
## 2 Barnet 96 0 34 5467 76 12080
## 3 Bexley 362 132 0 144 4998 2470
## 4 Brent 40 6124 28 0 66 8105
## 5 Bromley 134 162 3199 201 0 3780
## 6 Camden 36 1496 32 1350 60 0
## 7 City of London 6 14 0 16 0 335
## 8 (all) 674 8122 3389 7356 5266 28270
## City of London (all)
## 1 3641 5675
## 2 7709 25462
## 3 6580 14686
## 4 4145 18508
## 5 9855 17331
## 6 8795 11769
## 7 0 371
## 8 40725 93802
OK, we’ve set everything up, now it’s…
In explaining how to run and calibrate spatial interaction models in R, I will adopt the notation used by Taylor Oshan in his excellent primer for running spatial interation models in Python. The paper is well worth a read and can be found here: http://openjournals.wu.ac.at/region/paper_175/175.html
Below is the classic multiplicative gravity model:
This gravity model can be written in the form more familiar from Wilson’s 1971 paper - http://journals.sagepub.com/doi/abs/10.1068/a030001
This model just says that the flows between an origin and destination are proportional to the product of the mass of the origin and destination and inversely proportional to the distance between them.
As origin and destination masses increase, flows increase, but as distance increases, flows decrease, and vice versa.
where \(T_{ij}\) is the transition or flow, \(T\), between origin \(i\) (always the rows in a matrix) and destination \(j\) (always the columns in a matrix). If you are not overly familiar with matrix notation, the \(i\) and \(j\) are just generic indexes to allow us to refer to any cell in the matrix more generally.
\(V\) is a vector (a 1 dimensional matrix - or, if you like, a single line of numbers) of origin attributes which relate to the emissiveness of all origins in the dataset, \(i\) - in our sample dataset, we have a vector of origin populations (which I have called vi1_origpop) and a vector of origin average salaries (which I have called vi2_origsal) in 2001
\(W\) is a vector of desination of attributes relating to the attractivenss of all destinations in the dataset, \(j\) - in our sample dataset, we have a vector of destination populations (which I have called wj1_destpop) and a vector of destination average salaries (which I have called wj2_destsal) in 2001
\(d\) is a matrix of costs relating to the flows between \(i\) and \(j\) - in our case the cost is distance and it is called ‘dist’ in our dataset.
\(k\), \(\mu\), \(\alpha\) and \(\beta\) are all model parameters to be estimated
\(k\) is a constant of proportionality and leads to this particular model being more accurately described as a ‘total constrained’ model as all flows estimated by the model will sum to any observed flow data used to calibrate the parameters, where:
and \(T\) is the sum of our matrix of observed flows or:
In English, this is just the sum of all observed flows divided by the sum of all of the other elements in the model.
Now, it’s perfectly possible to produce some flow estimates by plugging some arbitrary or expected estimated values into our parameters. The parameters relate to the scaling effect / importance of the variables they are associated with. Most simply, where the effects of origin and destination attributes on flows scale in a linear fashion (i.e. for a 1 unit increase in, say, population at origin, we might expect a 1 unit increase in flows of people from that origin, or for a halving in average salary at destination, we might expect a halving of commuters), \(\mu\) = 1 and \(\alpha\) = 1. In Newton’s original gravity equation, \(\beta\) = -2 where the influence of distance on flows follows a power law - i.e. for a 1 unit increase in distance, we have a 1^-2 (1) unit decrease in flows, for a 2 unit increase in distance, we have 2^-2 (0.25 or 1/4) of the flows, for a 3 unit increase, 3^-2 (0.111) etc.
Let’s see if these parameters are a fair first guess (we’ll use the whole dataset in order to get a less messy picture)…
#First plot the commuter flows against distance and then fit a model line with a ^-2 parameter
plot1 <- ggplot(cdata) +
geom_point(aes(x = dist, y = Total)) +
stat_function(aes(x = dist), fun=function(x)x^-2, geom="line", colour="red")
plot1
#now, what about the origin and destination data...
plot2 <- ggplot(cdata) +
geom_point(aes(x = vi1_origpop, y = Total)) +
stat_function(aes(x = dist), fun=function(x)x^1, geom="line", colour="red")
plot2
plot3 <- ggplot(cdata) +
geom_point(aes(x = wj2_destsal, y = Total)) +
stat_function(aes(x = dist), fun=function(x)x^1, geom="line", colour="red")
plot3
OK, so it looks like we’re not far off (well, destination salary doesn’t look too promising as a predictor, but we’ll see how we get on…), so let’s see what flow estimates with these starting parameters look like.
#set up some variables to hold our parameter values in:
mu <- 1
alpha <- 1
beta <- -2
k <- 1
T2 <- sum(cdatasub$Total)
Now let’s create some flow estimates using Equation 2 above… Begin by applying the parameters to the variables:
vi1_mu <- cdatasub$vi1_origpop^mu
wj2_alpha <- cdatasub$wj2_destsal^alpha
dist_beta <- cdatasub$dist^beta
T1 <- vi1_mu*wj2_alpha*dist_beta
k <- T2/sum(T1)
Then, just as in Equation 2 above, just multiply everything together to get your flow estimates:
#run the model and store all of the new flow estimates in a new column in the dataframe
cdatasub$unconstrainedEst1 <- round(k*vi1_mu*wj2_alpha*dist_beta,0)
#check that the sum of these estimates makes sense
sum(cdatasub$unconstrainedEst1)
## [1] 93802
#turn it into a little matrix and have a look at your handy work
cdatasubmat1 <- dcast(cdatasub, Orig ~ Dest, sum, value.var = "unconstrainedEst1", margins=c("Orig", "Dest"))
cdatasubmat1
## Orig Barking and Dagenham Barnet Bexley Brent Bromley Camden
## 1 Barking and Dagenham 0 265 1902 190 428 435
## 2 Barnet 651 0 518 7344 453 6843
## 3 Bexley 3367 373 0 317 2475 675
## 4 Brent 422 6648 399 0 419 6627
## 5 Bromley 1063 458 3472 467 0 881
## 6 Camden 641 4105 562 4392 523 0
## 7 City of London 121 184 116 183 104 1148
## 8 (all) 6265 12033 6969 12893 4402 16609
## City of London (all)
## 1 1335 4555
## 2 4995 20804
## 3 2267 9474
## 4 4501 19016
## 5 2849 9190
## 6 18684 28907
## 7 0 1856
## 8 34631 93802
How do the flow estimates compare with the original flows?
cdatasubmat
## Orig Barking and Dagenham Barnet Bexley Brent Bromley Camden
## 1 Barking and Dagenham 0 194 96 178 66 1500
## 2 Barnet 96 0 34 5467 76 12080
## 3 Bexley 362 132 0 144 4998 2470
## 4 Brent 40 6124 28 0 66 8105
## 5 Bromley 134 162 3199 201 0 3780
## 6 Camden 36 1496 32 1350 60 0
## 7 City of London 6 14 0 16 0 335
## 8 (all) 674 8122 3389 7356 5266 28270
## City of London (all)
## 1 3641 5675
## 2 7709 25462
## 3 6580 14686
## 4 4145 18508
## 5 9855 17331
## 6 8795 11769
## 7 0 371
## 8 40725 93802
So, looking at the two little matrices above you can see that in some cases the flow estimates aren’t too bad (Barking and Dagenham to Bartnet, for example), but in others they are pretty rubbish (Camden to the City of London, anyone?). Whilst it’s OK to eyeball small flow matrices like this, when you have much larger matrices, we need another solution…
Yes, that’s what it’s called - I know, it doesn’t sound correct, but goodness-of-fit is the correct term for checking how well your model estimates match up with your observed flows.
So how do we do it?
Well… there are a number of ways but perhaps the two most common are to look at the coefficient of determination (\(r^2\)) or the Square Root of Mean Squared Error (RMSE). You’ve probably come across \(r^2\) before if you have fitted a linear regression model, but you may not have come across RMSE. There are other methods and they all do more or less the same thing, which is essentially to compare the modelled estimates with the real data. \(r^2\) is popular as it is quite intuitive and can be compared across models. RMSE is less intuitive, but some argue is better for comparing changes to the same model. Here’s we’ll do both…
\(r^2\) is the square of the correlation coefficient, \(r\)
For our sample data, we can calculate this very easily using a little function
CalcRSquared <- function(observed,estimated){
r <- cor(observed,estimated)
R2 <- r^2
R2
}
CalcRSquared(cdatasub$Total,cdatasub$unconstrainedEst1)
## [1] 0.5032696
Using this function we get a value of 0.51 or around 51%. This tells us that our model accounts for about 51% of the variation of flows in the system. Not bad, but not brilliant either.
We can use a similar simple function to calculate the RMSE for our data
CalcRMSE <- function(observed,estimated){
res <- (observed - estimated)^2
RMSE <- round(sqrt(mean(res)),3)
RMSE
}
CalcRMSE(cdatasub$Total,cdatasub$unconstrainedEst1)
## [1] 2503.352
The figure that is produced by the RMSE calculation is far less intuitive than the \(r^2\) value and this is mainly because it very much depends on things like the units the data are in and the volume of data. It can’t be used to compare different models run using different data sets. However, it is good for assessing whether changes to the model result in improvements. The closer to 0 the RMSE value, the better the model.
So how can we start to improve our fit…?
Now, the model we have run above is probably the most simple spatial interaction model we could have run and the results aren’t terrible, but they’re not great either.
One way that we can improve the fit of the model is by calibrating the parameters on the flow data that we have.
The traditional way that this has been done computationally is by using the goodness-of-fit statistics. If you have the requisite programming skills, you can write a computer algorithm that iteratively adjusts each parameter, runs the model, checks the goodness-of-fit and then starts all over again until the goodness-of-fit statistic is maximised.
This is partly why spatial interaction modelling was the preserve of specialists for so long as acquiring the requisite skills to write such computer programmes can be challenging!
However, since the early days of spatial interaction modelling, a number of useful developments have occurred…
The mathematically minded among you may have noticed that if you take the logarithms of both sides of Equation 2, you end up with the following equation:
Those of you who have played around with regression models in the past will realise that this is exactly that - a regression model.
And if you have played around with regression models you will be aware that there are various pieces of software available to run regressions (such as R) and calibrate the parameters for us, so we don’t have to be expert programmers to do this - yay!
Now, there are a couple of papers that are worth reading at this point. Perhaps the best is by Flowerdew and Aitkin (1982), titled “A METHOD OF FITTING THE GRAVITY MODEL BASED ON THE POISSON DISTRIBUTION” - the paper can be found here: http://onlinelibrary.wiley.com/doi/10.1111/j.1467-9787.1982.tb00744.x/abstract
One of the key points that Flowerdew and Aitkin make is that the model in Equation 5 (known as a log-normal model) has various problems associated with it which mean that the estimates produced might not be reliable. If you’d like to know more about these, read the paper (and also Wilson’s 1971 paper), but at this point it is worth just knowing that the way around many of these issues is to re-specify the model, not as a log-normal regression, but as a Poisson or negative binomial regression model.
The main theory (for non-experts like me anyway) behind the Poisson regression model is that the sorts of flows that spatial interaction models deal with (such as migration or commuting flows) relate to non-negative integer counts (you can’t have negative people moving between places and you can’t - normally, if they are alive - have fractions of people moving either).
As such, the continuous (normal) probabilty distributions which underpin standard regression models don’t hold. However, the discrtete probability distributions such as the Poisson distribution and the negative binomial distribution (of which the Poisson distribution is a special case - wikipedia it) do hold and so we can use these associations to model our flows.
At this point, it’s probably worth you looking at what a Poisson disribution looks like compared to a normal distribution, if you are not familiar.
Here’s a normal distribution:
#histogram with a count of 3000, a mean of 75 and a standard deviation of 5
qplot(rnorm(0:3000,mean=75,sd=5))
Now here’s a Poisson distribution with the same mean:
qplot(rpois(0:3000,lambda=75)) + stat_bin(binwidth = 1)
Looks kind of similar doesn’t it! The thing with the Poisson distribution is, when the mean (\(\lambda\) - lambda) changes, so does the distribution. As the mean gets smaller (and this is often the case with flow data where small flows are very likely - have a look at the ‘Total’ column in your cdata dataframe, lots of small numbers aren’t there?) the distribution starts to look a lot more like a skewed or log-normal distrbution. They key thing is it’s not - it’s a Poisson distribution. Here’s a similar frequency distribution with a small mean:
#what about a lambda of 0.5?
qplot(rpois(0:3000,0.5)) + stat_bin(binwidth = 1)
As far as we’re concerned, what this means is that if we are interested in all flows between all origins and destinations in our system, these flows will have a mean value of \(\lambda_{ij}\) and this will dictate the distribution. Here’s what the distrbution of our flows looks like:
qplot(cdata$Total) + geom_histogram()
So, what does all of this mean for our spatial interaction model?
Well the main thing it means is that Equation 5, for most sorts of spatial interaction models where we are modelling flows of people or whole things, is not correct.
By logging both sides of the equation in Equation 5, we are trying to get a situation where our \(T_{ij}\) flows can be modelled by using the values of our other variables such as distance, by using a straight line a bit like this:
qplot(log(dist), log(Total), data=cdata) + geom_smooth(method = lm)
If you compare this graph with the graph above (the first scatter plot we drew), it’s exactly the same data, but clearly by logging both the total and distance, we can get a bit closer to being able to fit a model estimate using a straight line.
What the Poisson distribution means is that the \(y\) variable in our model is not logged as in the graph above, but it can still be modelled using something like the blue line - I hope that sort of makes sense. If not, don’t worry, just take it from me that this is good news.
###The Poisson Regression Spatial Interaction Model
So, we can now re-specify Equation 5 as a Poisson Regression model. Instead of our independent variable being \(\ln T_{ij}\) our dependent variable is now the mean of our Poisson distribution \(\lambda_{ij}\) and the model becomes:
What this model says is \(\lambda_{ij}\) (our independent variable - the estimate of \(T_{ij}\)) is logarithmically linked to (or modelled by) a linear combination of the logged independent variables in the model.
Now we have Equation 6 at our disposal, we can use a Poisson regression model to produce estimates of \(k\), \(\mu\), \(\alpha\) and \(\beta\) - or put another way, we can use the regression model to calibrate our parameters.
So, let’s have a go at doing it!!
It’s very straight forward to run a Poisson regression model in R using the glm (Generalised Linear Models) function. In practical terms, running a GLM model is no different to running a standard regression model using lm. If you want to find out more about glm, try the R help system ?glm or google it to find the function details. If you delve far enough into the depths of what GLM does, you will find that the parameters are calibrated though an ‘iteratively re-weighted least squares’ algorithm. This algorithm does exaxtly the sort of job I described earlier, it fits lots of lines to the data, continually adjusting the parameters and then seeing if it can minimise the error between the observed and expected values useing some goodness-of-fit measure is maximised/minimised.
These sorts of algorithms have been around for years and are very well established so it makes sense to make use of them rather than trying to re-invent the wheel ourselves. So here we go…
#run the unconstrained model
uncosim <- glm(Total ~ log(vi1_origpop)+log(wj2_destsal)+log(dist), na.action = na.exclude, family = poisson(link = "log"), data = cdatasub)
It’s a simple as that - runs in a matter of milliseconds. You should be able to see how the glm R code corresponds to Equation 6.
Total = \(T_{ij}\) = \(\lambda_{ij}\)
~ means ‘is modelled by’
log(vi1_origpop) = \(ln V_i\)
log(wj2_destsal) = \(ln W_j\)
log(dist) = \(\ln d_{ij}\)
family = poisson(link = "log") means that we are using a Poisson regression model (the link is always log with a Poisson model) where the left-hand side of the model equation is logarithmically linked to the variables on the right-hand side.
So what comes out of the other end?
Well, we can use the summary() function to have a look at the model parameters:
#run the unconstrained model
summary(uncosim)
##
## Call:
## glm(formula = Total ~ log(vi1_origpop) + log(wj2_destsal) + log(dist),
## family = poisson(link = "log"), data = cdatasub, na.action = na.exclude)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -55.37 -31.27 -16.07 12.66 62.84
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -15.808424 0.176536 -89.55 <2e-16 ***
## log(vi1_origpop) 1.755752 0.012483 140.65 <2e-16 ***
## log(wj2_destsal) 1.647197 0.008773 187.75 <2e-16 ***
## log(dist) -1.407922 0.006707 -209.92 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 176846 on 41 degrees of freedom
## Residual deviance: 46086 on 38 degrees of freedom
## AIC: 46408
##
## Number of Fisher Scoring iterations: 6
We can see from the summary that the Poisson regression has calibrated all 4 parameters for us and these appear under the ‘estimate’ column:
\(k\) (intercept) = -15.631802
\(\mu\) = 1.747997
\(\alpha\) = 1.642331
and \(\beta\) = -1.411889
We can also see from the other outputs that all variables are highly significant (***), with the z-scores revealing that distance has the most influence on the model (as we might have expected from the scatter plots we produced earlier which showed that distance had by far the strongest correlation with commuting flows).
These parameters are not too far away from our initial guesses of \(\mu\) = 1, \(\alpha\) = 1 and \(\beta\) = -2, but how do the estimates compare?
One way to calculate the estimates is to plug all of the parameters back into Equation 6 like this:
#first asign the parameter values from the model to the appropriate variables
k <- uncosim$coefficients[1]
mu <- uncosim$coefficients[2]
alpha <- uncosim$coefficients[3]
beta <- -uncosim$coefficients[4]
#now plug everything back into the Equation 6 model... (be careful with the positive and negative signing of the parameters as the beta parameter may not have been saved as negative so will need to force negative)
cdatasub$unconstrainedEst2 <- exp(k+(mu*log(cdatasub$vi1_origpop))+(alpha*log(cdatasub$wj2_destsal))-(beta*log(cdatasub$dist)))
#which is exactly the same as this...
cdatasub$unconstrainedEst2 <- (exp(k)*exp(mu*log(cdatasub$vi1_origpop))*exp(alpha*log(cdatasub$wj2_destsal))*exp(-beta*log(cdatasub$dist)))
#and of course, being R, there is an even easier way of doing this...
cdatasub$fitted <- fitted(uncosim)
#run the model and store all of the new flow estimates in a new column in the dataframe
cdatasub$unconstrainedEst2 <- round(cdatasub$unconstrainedEst2,0)
sum(cdatasub$unconstrainedEst2)
## [1] 93803
#turn it into a little matrix and have a look at your handy work
cdatasubmat2 <- dcast(cdatasub, Orig ~ Dest, sum, value.var = "unconstrainedEst2", margins=c("Orig", "Dest"))
cdatasubmat2
## Orig Barking and Dagenham Barnet Bexley Brent Bromley Camden
## 1 Barking and Dagenham 0 206 808 145 294 308
## 2 Barnet 1015 0 968 5683 919 6422
## 3 Bexley 2231 542 0 430 2099 869
## 4 Brent 591 4707 636 0 686 4957
## 5 Bromley 1480 937 3820 844 0 1567
## 6 Camden 623 2635 637 2455 630 0
## 7 City of London 20 32 22 28 22 121
## 8 (all) 5960 9059 6891 9585 4650 14244
## City of London (all)
## 1 1264 3025
## 2 9587 24594
## 3 3803 9974
## 4 7034 18611
## 5 6670 15318
## 6 15056 22036
## 7 0 245
## 8 43414 93803
And the $1,000,000 question - has calibrating the parameters improved the model…?
CalcRSquared(cdatasub$Total,cdatasub$unconstrainedEst2)
## [1] 0.6725599
CalcRMSE(cdatasub$Total,cdatasub$unconstrainedEst2)
## [1] 1892.615
The \(r^2\) has improved from 0.51 to 0.67 and the RMSE has reduced from 2502.24 to 1895.768 so by calibrating our parameters using the Poisson Regression Model, we have markedly improved our model fit.
But we can do even better. We have just been playing with the unconstrained model, by adding constraints into the model we can both improve our fit further AND start to do cool things like esimate transport trip distributions from know information about people leaving an area, or estimate the amount of money a shop is going to make from the available money that people in the surrounding area have to spend, or guess the number of migrants travelling between specific countries where we only know how many people in total leave one country and arrive in another.
We’ll do all of this in part 2 of these sessions.
Have a play around with inputting different parameter values and see what happens to the flow estimates - for example, what happens if you change the frictional effect of distance by increasing and decreasing the negative value of the \(\beta\) parameter? Try some values of -2,-3,-4. What do positive values do to the flows?
What happens to the flow estimates if you adjust the other parameters in the model or remove \(k\)?
Try running the model on the whole London system - although you may run into problems if you don’t remove the intra-borough flows. This is because for intra-flows, the distance value is 0 and you can’t take the log of 0. To fix this, you can either change all of the zero distance values to something very small (like 1 or 0.5) or, alternatively, remove all rows in the data where origin = destination with some code like this:
What might you expect to happen to the flows of people between, say, Camden and Brent if all of a sudden loads of really well paid jobs appeared in Brent and the average salary doubled? How might this impact the other Boroughs in the system?