In this exercise, you will explore migration in England between 2000 and 2001. You will do this using two kinds of data:
Flow data represent movements of people, information, resources, etc between places. For some background to the kinds of flow data we will be working with, see:
http://census.ukdataservice.ac.uk/get-data/flow-data.aspx
You will write a short executive report (max 500 words) that will answer the following questions:
In addition to answering these two questions in your report, you should present as a minimum the following information:
Overall, your report should not have more than 10 maps or charts. All maps must be generated within R. You should make use of the knowledge you have developed in previous practical exercises. This may enable you explore the data in different ways or to produce better maps which follow standard cartographic conventions.
We will first download some interaction data from the CIDER (Centre for Interaction Data Estimation and Research) website.
Next, we will download 2001 district boundaries (as a shapefile) for England.
Our first task in R will be to import the flow data.
Open R. As usual, you must set your working directory.
# For example:
setwd("M:/ENVS257/Migration/")
If you open the csv file in a text editor such as Wordpad or Notepad, you'll see there are four header lines. We don't want to keep any of these. The csv file can be read in as follows.
# Read csv file, skipping the first four rows:
mig2001 <- read.csv("wicid_output.csv", header = FALSE, skip = 4)
Next, preview the file by looking at the first five rows.
head(mig2001)
## V1 V2 V3 V4 V5
## 1 00AA 00AA 154 7181 7181
## 2 00AA 00AB 0 7181 163929
## 3 00AA 00AC 9 7181 314559
## 4 00AA 00AD 3 7181 218310
## 5 00AA 00AE 9 7181 263463
## 6 00AA 00AF 15 7181 295526
The data labels are as follows:
We will rename these to something more meaningful:
colnames(mig2001) <- c("Origin", "Destination", "FlowOD", "PersonsO", "personsD")
Check that this has worked by again looking at the first five rows of the data.
head(mig2001)
## Origin Destination FlowOD PersonsO personsD
## 1 00AA 00AA 154 7181 7181
## 2 00AA 00AB 0 7181 163929
## 3 00AA 00AC 9 7181 314559
## 4 00AA 00AD 3 7181 218310
## 5 00AA 00AE 9 7181 263463
## 6 00AA 00AF 15 7181 295526
Next, we will summarise the variables.
summary(mig2001)
## Origin Destination FlowOD PersonsO
## 00AA : 409 00AA : 409 Min. : 0 Min. : 2136
## 00AB : 409 00AB : 409 1st Qu.: 0 1st Qu.: 85906
## 00AC : 409 00AC : 409 Median : 3 Median : 112970
## 00AD : 409 00AD : 409 Mean : 37 Mean : 143728
## 00AE : 409 00AE : 409 3rd Qu.: 9 3rd Qu.: 165693
## 00AF : 409 00AF : 409 Max. :127999 Max. :1681152
## (Other):164829 (Other):164829 NA's :2 NA's :2
## personsD
## Min. : 2136
## 1st Qu.: 85906
## Median : 112970
## Mean : 143728
## 3rd Qu.: 165693
## Max. :1681152
## NA's :2
The two NA's indicate rows at the end of the data which were textual information included in the downloaded file. You don't need to worry about these for now.
Note that the miniumum, 1st quartile and median FlowOD values are 0, 0, and 3 respectively. In 2001, Census data were modified using a small adjustment procedure which randomly converted all values of 0, 1, 2 or 3 to values of 0 or 3. So, there are no values of 1 or 2. This procedure was intended to help protect confidentiality of individuals. But, it means that smaller flows are of limited value…
If you want to save your workspace, so you can open it up later on, you can use the save.image function.
# This will save all objects currently loaded in the workspace
save.image(file = "myfile.RData")
After saving, you can re-load the objects contained in this saved file at any time (e.g. after closing R) using the load command. The .RData file is stored in the working directory defined earlier.
# Remember that you will need to set your working directory before trying to
# load - otherwise R will not know where to find you RData file
load("myfile.RData")
The flow data file includes values for the whole of Britain and for Northern Ireland, but we will generating maps which refer only to England.
We will also be working with migration data for areas. These data can be downloaded from CASWEB (http://casweb.mimas.ac.uk/; 2001 districts and unitary authorities Table KS024), but they have already been downloaded for you and are available through VITAL (See Practicals directory).
# Read csv file:
mig2001area <- read.csv("distmig2001.csv", header = TRUE)
Preview the file by looking at the first five rows.
head(mig2001area)
## Zone.Code Zone.Name ks0240001 ks0240002 ks0240003 ks0240004
## 1 00AA City of London 7185 1652 1065 308
## 2 00AB Barking and Dagenham 163944 16241 6971 789
## 3 00AC Barnet 314564 42902 18560 5576
## 4 00AD Bexley 218307 18376 7873 642
## 5 00AE Brent 263464 38356 14793 5633
## 6 00AF Bromley 295532 29449 12590 1711
## ks0240005 ks0240006 ks0240007
## 1 122 157 1022
## 2 1637 6844 7064
## 3 3598 15168 20311
## 4 1337 8524 8727
## 5 4732 13198 18317
## 6 2025 13123 13266
As the next step, we need to import the England districts shapefile downloaded earlier.
As in previous practicals, we will need to make use of functions which are not contained in the base R installation.
You need to load 'maptools' and 'rgeos'; these packages include functions for reading and manipulating GIS data.
# Choose the CRAN mirror where the packages are going to be downloaded from
options(repos = c(CRAN = "http://www.stats.bris.ac.uk/R/"))
# Download / install and then load sp
install.packages("sp", depend = TRUE, lib = getwd())
library("sp", lib.loc = getwd())
# Download / install and then load maptools
install.packages("maptools", depend = TRUE, lib = getwd())
library(maptools, lib.loc = getwd())
# Download / install and then load stringr
install.packages("stringr", depend = TRUE, lib = getwd())
library("stringr", lib.loc = getwd())
# Download / install and then load rgeos
install.packages("rgeos", depend = TRUE, lib = getwd())
library("rgeos", lib.loc = getwd())
Now we will read in the districts data into a type of object called a SpatialPolygonsDataFrame.
Districts <- readShapePoly("England_dt_2001_area.shp")
We can see what the SpatialPolygonsDataFrame looks like using the plot() function.
# A basic plot of the Districts SpatialPolygonsDataFrame
plot(Districts)
Now we have a map of districts in England we can consider mapping migration flows between districts.
First, we need to derive some information from the Districts shapefile. We do this using SpatialPointsDataFrame. To map flows between zones we also need the centroids of the districts, and this is achieved using the coordinates command.
DistrictsXY <- SpatialPointsDataFrame(coordinates(Districts), data = as(Districts,
"data.frame")[c("LABEL")])
DistrictsXY is then joined to the mig2001 data. We need to join these two files twice - once linking LABEL (in DistrictsXY) to Origin in Mig2001 and once to Destination in Mig2001. First we'll join to Destination and then the output from this will be joined to Origin.
DistMig <- merge(mig2001, DistrictsXY, by.x = "Destination", by.y = "LABEL")
head(DistMig)
## Destination Origin FlowOD PersonsO personsD coords.x1 coords.x2
## 1 00AA 33UF 0 98375 7181 532464 181220
## 2 00AA 00PM 0 90940 7181 532464 181220
## 3 00AA 32UC 0 130453 7181 532464 181220
## 4 00AA 31UG 0 47839 7181 532464 181220
## 5 00AA 20UB 0 53683 7181 532464 181220
## 6 00AA 00JA 0 156059 7181 532464 181220
colnames(DistMig) <- c("Destination", "Origin", "FlowOD", "PersonsO", "PersonsD",
"DestinationX", "DestinationY")
head(DistMig)
## Destination Origin FlowOD PersonsO PersonsD DestinationX DestinationY
## 1 00AA 33UF 0 98375 7181 532464 181220
## 2 00AA 00PM 0 90940 7181 532464 181220
## 3 00AA 32UC 0 130453 7181 532464 181220
## 4 00AA 31UG 0 47839 7181 532464 181220
## 5 00AA 20UB 0 53683 7181 532464 181220
## 6 00AA 00JA 0 156059 7181 532464 181220
DistMigF <- merge(DistMig, DistrictsXY, by.x = "Origin", by.y = "LABEL")
colnames(DistMigF) <- c("Origin", "Destination", "FlowOD", "PersonsO", "PersonsD",
"DestinationX", "DestinationY", "OriginX", "OriginY")
head(DistMigF)
## Origin Destination FlowOD PersonsO PersonsD DestinationX DestinationY
## 1 00AA 00MS 0 7181 217450 442363 113232
## 2 00AA 00FN 0 7181 279921 459075 304818
## 3 00AA 00JA 3 7181 156059 518388 302630
## 4 00AA 20UD 0 7181 85074 412505 548294
## 5 00AA 43UB 3 7181 121936 512049 163233
## 6 00AA 22UH 3 7181 120884 548555 203312
## OriginX OriginY
## 1 532464 181220
## 2 532464 181220
## 3 532464 181220
## 4 532464 181220
## 5 532464 181220
## 6 532464 181220
Now we will calculate in- and -out migration rates and map them. You have already downloaded a district shapefile, but we will need a cleaned version of this for the next stage of the analysis. The cleaned file is available through VITAL (Exercise 4 directory - download and unzip the file 'DistrictsUN.zip' into your working directory). The shapefile is imported as before.
DistrictsUN <- readShapePoly("DistrictsUN.shp")
DistrictsUN@data = data.frame(DistrictsUN@data, mig2001area[match(DistrictsUN@data$LABEL,
mig2001area$Zone.Code), ])
The data labels are as follows:
Now we'll map the number of people who are migrants (ks0240002) and add a legend to the map. At this stage, we need two new libraries - RColorBrewer and classInt.
# Download / install and then load RColorBrewer
install.packages("RColorBrewer", depend = TRUE, lib = getwd())
library("RColorBrewer", lib.loc = getwd())
# Download / install and then load RColorBrewer
install.packages("classInt", depend = TRUE, lib = getwd())
library("classInt", lib.loc = getwd())
We select class interval scheme to display the values, and add a legend.
fj5 <- classIntervals(DistrictsUN$ks0240002, n = 5, style = "fisher")
pal <- grey.colors(4, 0.95, 0.55, 2.2)
fj5Colours <- findColours(fj5, pal)
plot(DistrictsUN, col = fj5Colours, pch = 19)
legend("topleft", fill = attr(fj5Colours, "palette"), legend = names(attr(fj5Colours,
"table")), bty = "n")
Clearly, this map isn't of much use for exploring migration - we need to compute rates. As an example, in-migration rates are computed with:
DistrictsUN@data$inMig <- (DistrictsUN@data$ks0240003 + DistrictsUN@data$ks0240004)/DistrictsUN@data$ks0240001
Map the in-migration rates. Next, compute the out-migration rates (with ks0240007 as the numerator). You could also try mapping internal flows as a rate.
To map rates, it's useful to be able to round the values for inclusion in the legend. The first line of the code above can be modified to do this; we can also change the “topleft” element under the legend to more precisely position the legend (obviosuly, changing the x and y coordinates moves the legend); you will also add a scale bar and a north arrow:
fj5 <- classIntervals(round(DistrictsUN$inMig, digits = 2), n = 5, style = "fisher")
pal <- grey.colors(4, 0.95, 0.55, 2.2)
fj5Colours <- findColours(fj5, pal)
plot(DistrictsUN, col = fj5Colours, pch = 19)
legend(x = -25000, y = 386096, fill = attr(fj5Colours, "palette"), legend = names(attr(fj5Colours,
"table")), bty = "n")
# Add a scale bar
SpatialPolygonsRescale(layout.scale.bar(), offset = c(0, 1e+05), scale = 1e+05,
fill = c("white", "black"), plot.grid = F)
text(0, 125000, "0", cex = 0.6)
text(50000, 125000, "50", cex = 0.6)
text(1e+05, 125000, "100km", cex = 0.6)
# Add a North arrow
SpatialPolygonsRescale(layout.north.arrow(1), offset = c(80000, 410000), scale = 50000,
plot.grid = F)
Now we will map the flow data on top of the Districts. One way of doing this is to map the flows between districts as lines, with thicknesses indicating the size of the flows. Including all flows produces a very confusing map.
The internal flows (people who migrated within the same district) and small inter-district flows make it very difficult to see what is going on. So, we need to remove the internal flows and cut out smaller flows to give a clearer picture of migration patterns:
# Retain rows with X & Y coordinates > 0
DistMigF2 <- DistMigF[which(DistMigF$OriginX > 0 & DistMigF$DestinationX > 0 &
DistMigF$OriginY > 0 & DistMigF$DestinationY > 0), ]
# Compute the distances between the Origins and Destinations
DistMigF2$Dist <- sqrt((DistMigF2$OriginX - DistMigF2$DestinationX)^2 + (DistMigF2$OriginY -
DistMigF2$DestinationY)^2)
# Retain rows where Dist > 0
DistMigFz <- DistMigF2[which(DistMigF2$Dist > 0), ]
# Plot the districts and then map the flows
plot(Districts)
for (i in 1:length(DistMigFz[, 1])) {
lines(DistMigFz[i, c(6, 8)], DistMigFz[i, c(7, 9)], lwd = 0.2 + DistMigFz[i,
3]/100, col = "blue")
}
The map is still not very helpful, so we will refine it to exclude all flows of <= 100, and we'll reduce the impact of the largest flows through dividing by 1000, rather than 100.
# Retain rows where FlowOD > 100
DistMigCut <- DistMigFz[which(DistMigFz$FlowOD > 100), ]
# Plot the districts and then map the flows
plot(Districts)
for (i in 1:length(DistMigCut[, 1])) {
lines(DistMigCut[i, c(6, 8)], DistMigCut[i, c(7, 9)], lwd = 0.2 + DistMigCut[i,
3]/1000, col = "blue")
}
# Check all seems ok
summary(DistMigCut)
## Origin Destination FlowOD PersonsO
## 15UE : 114 15UF : 96 Min. : 101 Min. : 24443
## 15UF : 96 00HG : 69 1st Qu.: 139 1st Qu.: 92502
## 35UC : 72 15UE : 66 Median : 227 Median :140024
## 00CN : 62 15UC : 57 Mean : 612 Mean :183972
## 00HG : 48 35UC : 56 3rd Qu.: 443 3rd Qu.:238638
## 00MW : 45 00MR : 54 Max. :22076 Max. :977085
## (Other):3214 (Other):3253
## PersonsD DestinationX DestinationY OriginX
## Min. : 7181 Min. :132023 Min. : 17580 Min. :132023
## 1st Qu.: 92502 1st Qu.:388454 1st Qu.:152000 1st Qu.:388211
## Median :134248 Median :459163 Median :189463 Median :455413
## Mean :179012 Mean :441777 Mean :236789 Mean :439108
## 3rd Qu.:218344 3rd Qu.:526281 3rd Qu.:326032 3rd Qu.:526129
## Max. :977085 Max. :647811 Max. :638991 Max. :647811
##
## OriginY Dist
## Min. : 17580 Min. : 124
## 1st Qu.:155359 1st Qu.: 14155
## Median :189643 Median : 22232
## Mean :238852 Mean : 34166
## 3rd Qu.:329548 3rd Qu.: 35563
## Max. :638991 Max. :408510
##
That's better! It is now possible to discern some major trends in migration across England.
Try some other flow cutoffs (where DistMigCutTest is the new (output) file and DistMigFz is the input). For example, mapping flows of greater than 1000 people:
DistMigCutTest <- DistMigFz[which(DistMigFz$FlowOD > 1000), ]
plot(Districts)
for (i in 1:length(DistMigCutTest[, 1])) {
lines(DistMigCutTest[i, c(6, 8)], DistMigCutTest[i, c(7, 9)], lwd = 0.2 +
DistMigCutTest[i, 3]/1000, col = "blue")
}
Next, we are going to explore how the distance between districts and their populations are related to the size of migration flows. We might expect flows to be larger between districts which are close together. Also flows to districts which have large populations (and, by implications, more housing) may also be large. The relationships between variables such as flow size and distance can be analysed using regression. The regression approach you may be familiar with from Excel, for example, is called ordinary least squares (OLS) regression. OLS regression is used to explore how a set of variables ('independent variables', for example, population size of districts and distance between districts) are related to a single variable (for example, flows between districts)
OLS is used here to explore how flows relate to distance, origin populations, and destination populations.
We run the OLS model with intra-district (internal) flows removed. Research on migration flows suggests that we should transform the variables using logs (that is, they are put on a log scale; zeros can't be logged, so we add 0.5 to each value):
lreg_dist <- lm(log(FlowOD + 0.5) ~ log(Dist + 0.5) + log(PersonsO + 0.5) +
log(PersonsD + 0.5), data = DistMigFz)
summary(lreg_dist)
##
## Call:
## lm(formula = log(FlowOD + 0.5) ~ log(Dist + 0.5) + log(PersonsO +
## 0.5) + log(PersonsD + 0.5), data = DistMigFz)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.532 -1.031 0.142 0.992 6.414
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.94920 0.11775 -8.06 7.6e-16 ***
## log(Dist + 0.5) -1.01691 0.00458 -222.25 < 2e-16 ***
## log(PersonsO + 0.5) 0.65657 0.00561 116.94 < 2e-16 ***
## log(PersonsD + 0.5) 0.59166 0.00561 105.38 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.32 on 151706 degrees of freedom
## Multiple R-squared: 0.375, Adjusted R-squared: 0.375
## F-statistic: 3.04e+04 on 3 and 151706 DF, p-value: <2e-16
The estimate values (parameter estimates) for Dist, PersonsO and PersonsD are the key figures here.
The value for Dist is negative, while the values for PersonsO and PersonsD are positive. This indicates that Dist (distance) has a negative effect - as distance increases, flows decrease. This fits with what we'd expect. The positive values for PersonsO and PersonsD also correspond to what we'd expect - migration flows are likely to be larger FROM districts with large populations, and they are likely to be larger TO districts with large populations. In short, there is a push and pull factor in terms of population size. The three astrices (***) at the end of the columns indicate that all of the parameter estimates are statistically significant.
But, there is a problem with using OLS… Flows are counts which are unlikely to have a normal distribution - there isn't an equal proportion of very small and very large flows, with the bulk in the middle range. Instead, there are many small flows and few large flows. OLS is designed to work with data which have a normal distribution, and so its application to count data like migration flows is not sensible… An alternative approach is Poisson regression, which is appropriate for flow data. We can apply it using the glm function. Again, we exclude internal flows.
preg_distgtzero <- glm(FlowOD ~ Dist + PersonsO + PersonsD, data = DistMigFz,
family = poisson())
summary(preg_distgtzero)
##
## Call:
## glm(formula = FlowOD ~ Dist + PersonsO + PersonsD, family = poisson(),
## data = DistMigFz)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -24.2 -3.7 -0.7 2.4 377.5
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.19e+00 1.36e-03 3810 <2e-16 ***
## Dist -2.11e-05 1.08e-08 -1940 <2e-16 ***
## PersonsO 1.64e-06 4.24e-09 386 <2e-16 ***
## PersonsD 1.43e-06 4.44e-09 323 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 19039624 on 151709 degrees of freedom
## Residual deviance: 10947342 on 151706 degrees of freedom
## AIC: 11353920
##
## Number of Fisher Scoring iterations: 7
As for OLS, the parameter estimates for Dist are negative and those for PersonsO and PersonsD are positive. Values such as 2.11e-05 may not mean much to you. This is scientific notation and is a conveninet way of displaying very small or very large values. In the case of -2.11e-05, this means that the decimal place is moved 5 positions to the left, so -2.11e-05 = -0.0000211. The parameter estimate for PersonsO is larger than the value for PersonsD. This suggests that the number of persons at the origin has a larger positive effect on flow size than does the number of persons at the destination. This might suggest movement away from areas with larger populations. Have a think about whether than surprises you, given what you know about population change in the UK…
There are still problems with applying Poisson regression but it is conceptually an improvement on OLS regression.
We know that data on small flows are unreliable because of the small count adjustment procedure detailed previously. So, try re-running the Poisson regression model with different (small) flow thresholds (following the steps outlined for mapping flows). Use thresholds of 5 and 10 and compare results to those using all data (that is, a threshold of zero).
The various analysis that you have prepared during this tutorial will be of assistance when compiling your report. You may have to piece together different parts of the code to achieve this, however, examples of any code that you need to complete your report are contained in this document. However, you may wish to make use of the knowledge developed in previous practicals.

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.