Learning Outcomes
1. Understand what geodemographics are
2. Know why geodemographics are useful
3. Understand what index scores mean
4. Be able to apply index scores in a case study setting
In this practical you will be using some new commands. Have a look at the helpsheet and glossary posted on the Discussion Board on Vital which will explain the commands. If you have any more you want added, post them to the thread.
In this exercise you take on the role of an owner of a direct marketing consultancy that specialises in helping small businesses to target potential service users within their local area. Your most recent client is a group of IT trainers who are organising an Internet training campaign for people who haven’t used the Internet before at a local school. The campaign involves delivery of an information pamphlet to local residents and holding a training session at one of the local schools.
The trainers have two problems:
You have been asked to write a short executive report (max 500 words) that will answer the following questions:
In addition to answering these two questions in your report, you have been asked to present as a minimum the following information:
Overall, your report should not have more than 10 maps or charts. All maps must be generated within R; and all maps should adopt different (but appropriate) colour schemes to those illustrated in this practical.
Around 22% of the UK population have not used the Internet. Internet non-use rates vary across geodemographic groups. Using the Office for National Statistics Output Area Classification (OAC), we can see how these patterns vary across different types of neighbourhood (see Figure 1 and Table 1).
Figure 1: Percentage different from the national average for Internet non use
| OAC Group | Index Score |
|---|---|
| 1a: Farming communities | 121.57 |
| 1b: Rural tenants | 102.74 |
| 1c: Ageing rural dwellers | 119.55 |
| 2a: Students around campus | 52.33 |
| 2b: Inner city students | 30.70 |
| 2c: Comfortable cosmopolitan | 59.17 |
| 2d: Aspiring and affluent | 44.15 |
| 3a: Ethnic family life | 84.06 |
| 3b: Endeavouring ethnic mix | 76.34 |
| 3c: Ethnic dynamics | 93.11 |
| 3d: Aspirational techies | 84.38 |
| 4a: Rented family living | 90.22 |
| 4b: Challenged Asian terraces | 75.72 |
| 4c: Asian traits | 87.40 |
| 5a: Urban professionals and families | 79.14 |
| 5b: Ageing urban living | 105.20 |
| 6a: Suburban achievers | 97.73 |
| 6b: Semi-detached suburbia | 106.59 |
| 7a: Challenged diversity | 112.09 |
| 7b: Constrained flat dwellers | 139.81 |
| 7c: White communities | 129.61 |
| 7d: Ageing city dwellers | 188.37 |
| 8a: Industrious communities | 109.98 |
| 8b: Challenged terraced workers | 99.46 |
| 8c: Hard pressed ageing workers | 124.80 |
| 8d: Migration and churn | 99.25 |
Interpreting these index scores is a relatively straight forward task. A score of 100 indicates an OAC Group where Internet non-use is the same proportion as the national average. A score of 200 would indicate Internet non-use at a rate twice the national average, and a score of 50 half the national average. Index scores formatted in this way are commonly used in geodemographic analysis.
For this exercise you will use three sets of data: population estimates, the Output Area Classification and the locations of schools.
You first need to setup a project in RStudio, and then load a number of packages that you will need later in the practical. When you restart the practical on different machines, you will need to run these lines of code again. A good way of doing this is including these at the top of your script, then they can be easily run again.
#Load libaries
library(sp)
library(maptools)
library(stringr)
library(rgeos)
library(classInt)
library(RColorBrewer)
The shapefiles for Leeds can be downloaded from the UK Data Service, and we are going to download two sets of data - firstly the Output Areas (OAs), and secondaly, the wards (CAS).
Therefore, first, we need to download the spatial data. Usefully, the Output Area data is avalaible with the OAC classification already joined, so we do not need to merge the two data sets ourselves.
This is the Output Area boundaries for Leeds, but we also need to download the Census Ward. Repeat the process, selecting ‘English Census Merged Wards, 2011’ instead. Remember to name your zip files sensibly. The download facility will call both files ‘BoundaryData.zip’ which is not particually helpful!
Extract the files, and move all the files starting with the name england_oa_2011Polygon and england_cmwd_2011Polygon to your working folder. There should be 5 files for each data set - make sure you move them all. Then read in the data using these commands:
#Get 2011 OA for Leeds
OA <- readShapeSpatial("england_oac_2011",proj4string=CRS("+init=epsg:27700"))
#Get CAS wards for Leeds
CAS <- readShapeSpatial("england_cmwd_2011Polygon",proj4string=CRS("+init=epsg:27700"))
#Plot of Leeds
plot(OA)
We then will take a quick look at both the geometry and the attributes of the newly created object OA (using head(OA@data)). In the data slot, you will see that there are a variety of attributes about each Output Area in Leeds including an area id (code), the geodemographic super group, group and subgroup (oac_super_, oac_group and oac_sub_gr), a name for the geodemographic group (label) and the names of the groups and sub groups.
#First six rows of data
head(OA@data)
gid code label name oac_super_
0 132913 E00057132 E08000035E02002390E01011345E00057132 <NA> 8
1 138532 E00057142 E08000035E02002390E01011343E00057142 <NA> 8
2 136971 E00057159 E08000035E02002389E01011346E00057159 <NA> 8
3 138638 E00057160 E08000035E02002390E01011345E00057160 <NA> 8
4 137325 E00057346 E08000035E02002346E01032502E00057346 <NA> 8
5 137101 E00057347 E08000035E02002346E01032502E00057347 <NA> 8
oac_group oac_sub_gr oac_super0 oac_group_
0 8d 8d2 Hard-Pressed Living Migration and Churn
1 8d 8d2 Hard-Pressed Living Migration and Churn
2 8d 8d2 Hard-Pressed Living Migration and Churn
3 8d 8d2 Hard-Pressed Living Migration and Churn
4 8d 8d2 Hard-Pressed Living Migration and Churn
5 8d 8d2 Hard-Pressed Living Migration and Churn
oac_sub_g0
0 Hard-Pressed Ethnic Mix
1 Hard-Pressed Ethnic Mix
2 Hard-Pressed Ethnic Mix
3 Hard-Pressed Ethnic Mix
4 Hard-Pressed Ethnic Mix
5 Hard-Pressed Ethnic Mix
We will need the total population for each OA, which we will add now, following a simliar process to that used in the previous practical.
The mid-year population estimates for 2013 are available from http://www.ons.gov.uk. Navigate to Data, and search for SAPE15DT10c. The only entry should be “Supporting information - SAPE15DT10c - Census Output Area Estimates in the Yorkshire and the Humber region of England, Mid-2013 (ZIP 3845Kb)”.
Download this file and unzip it. When you save this file as a CSV, make sure you update the formatting in Microsoft Excel so the values over 1000 do not have a comma in them (you can do this by setting the number format to ‘General’). Otherwise R will give you an error message (Error in rowSums) when you try to perform the calculations later on.
Use the read.csv function to read in the data from the file SAPE15DT10c-mid-2013-coa-unformatted-syoa-estimates-yorkshire-and-the-humber.csv into a variable called pop2013. Remember to add the code to your script.
It should look like this once you have read it in:
head(pop2013)
OA11CD LSOA11CD All.Ages X0 X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12
1 E00037088 E01007317 381 6 6 12 11 14 4 6 6 6 9 6 5 6
2 E00037089 E01007317 323 5 3 2 5 5 2 6 4 4 3 3 2 4
3 E00037090 E01007317 269 5 4 0 5 4 4 0 5 4 2 2 5 4
4 E00037091 E01007317 237 2 1 4 1 2 0 3 1 2 1 1 1 1
5 E00037112 E01007317 293 6 3 6 5 6 2 4 5 2 5 5 4 3
6 E00037099 E01007318 386 7 14 11 6 6 11 7 6 3 6 6 7 4
X13 X14 X15 X16 X17 X18 X19 X20 X21 X22 X23 X24 X25 X26 X27 X28 X29 X30
1 2 5 7 8 6 7 5 8 5 5 8 3 4 8 7 5 7 8
2 3 6 4 2 2 4 5 3 3 3 0 4 7 3 2 4 3 4
3 3 5 3 3 6 3 1 7 3 3 3 2 3 6 3 1 0 3
4 2 2 4 1 1 3 0 2 3 1 3 1 2 2 2 1 2 2
5 6 3 2 4 6 4 6 3 4 4 3 1 7 6 1 4 2 3
6 3 4 3 3 2 1 4 3 3 3 8 4 6 9 7 11 11 11
X31 X32 X33 X34 X35 X36 X37 X38 X39 X40 X41 X42 X43 X44 X45 X46 X47 X48
1 6 4 6 3 3 2 3 4 2 3 6 10 3 5 6 5 6 4
2 4 2 2 4 2 0 2 7 1 3 3 4 6 5 3 5 6 3
3 3 2 1 6 3 2 4 5 5 5 3 4 4 4 5 3 3 5
4 3 3 1 1 1 2 1 2 2 2 2 4 2 3 0 3 3 4
5 3 1 2 1 3 3 3 3 3 3 2 3 6 3 3 4 5 8
6 6 13 10 12 10 5 7 9 5 6 7 8 5 6 10 4 6 11
X49 X50 X51 X52 X53 X54 X55 X56 X57 X58 X59 X60 X61 X62 X63 X64 X65 X66
1 2 8 4 3 5 4 6 2 4 2 1 3 2 3 6 4 3 3
2 5 6 3 3 1 6 1 3 3 2 7 5 4 8 10 9 8 5
3 5 3 4 6 3 4 6 4 4 5 1 7 2 4 1 5 4 4
4 5 4 3 4 0 2 5 4 4 5 3 5 4 4 2 3 3 11
5 4 6 1 3 3 4 3 4 4 2 6 3 6 4 4 5 4 6
6 3 4 7 1 4 1 5 3 1 2 2 3 1 2 1 5 0 3
X67 X68 X69 X70 X71 X72 X73 X74 X75 X76 X77 X78 X79 X80 X81 X82 X83 X84
1 2 2 3 1 0 0 2 2 2 0 0 3 3 0 1 2 1 2
2 3 6 7 7 5 3 7 3 1 2 2 5 4 0 2 0 1 0
3 3 1 1 6 2 4 3 1 3 0 2 0 0 0 0 0 2 0
4 5 4 10 7 3 2 0 3 3 5 3 2 8 4 5 4 2 0
5 1 3 6 1 1 2 3 5 0 1 2 2 3 2 1 2 1 1
6 1 0 1 1 0 1 1 0 0 1 1 0 0 0 0 0 0 0
X85 X86 X87 X88 X89 X90
1 1 0 1 1 0 1
2 1 1 1 1 0 0
3 0 0 0 0 0 0
4 0 0 0 0 2 1
5 0 0 0 0 0 0
6 0 0 0 0 1 0
Now we will tidy up the data (pop2013) and then join this to the spatial data (OA). Previously we used the merge() function, but this does not always work with spatial data, so instead, we will use the match() function.
#keep columns we want
pop2013 <- pop2013[,1:3]
#rename columns
colnames(pop2013)[3] <- "TotalPop"
#join to spatial data
OA@data = data.frame(OA@data, pop2013[match(OA@data[,"code"], pop2013[,"OA11CD"]),])
The first six lines of OA@data should look like this:
head(OA@data)
gid code label name oac_super_
0 132913 E00057132 E08000035E02002390E01011345E00057132 <NA> 8
1 138532 E00057142 E08000035E02002390E01011343E00057142 <NA> 8
2 136971 E00057159 E08000035E02002389E01011346E00057159 <NA> 8
3 138638 E00057160 E08000035E02002390E01011345E00057160 <NA> 8
4 137325 E00057346 E08000035E02002346E01032502E00057346 <NA> 8
5 137101 E00057347 E08000035E02002346E01032502E00057347 <NA> 8
oac_group oac_sub_gr oac_super0 oac_group_
0 8d 8d2 Hard-Pressed Living Migration and Churn
1 8d 8d2 Hard-Pressed Living Migration and Churn
2 8d 8d2 Hard-Pressed Living Migration and Churn
3 8d 8d2 Hard-Pressed Living Migration and Churn
4 8d 8d2 Hard-Pressed Living Migration and Churn
5 8d 8d2 Hard-Pressed Living Migration and Churn
oac_sub_g0 OA11CD LSOA11CD TotalPop
0 Hard-Pressed Ethnic Mix E00057132 E01011345 305
1 Hard-Pressed Ethnic Mix E00057142 E01011343 201
2 Hard-Pressed Ethnic Mix E00057159 E01011346 325
3 Hard-Pressed Ethnic Mix E00057160 E01011345 295
4 Hard-Pressed Ethnic Mix E00057346 E01032502 350
5 Hard-Pressed Ethnic Mix E00057347 E01032502 284
The next dataset you will download is the locations of all schools from the Department for Education website:
Note this is a daily snapshot of the Departmet for Education database, so the exact file name and contents will vary. However the process to read in the data should be the same.
Save the CSV file into your working directory and read it in. We also need to trim down the data set to just include the school name, type, status and its location:
#read in CSV
schools <- read.csv("edubase20150208.csv",header=TRUE)
#select the columns we need
schools <- subset(schools,select=c("EstablishmentName","TypeOfEstablishment..name.","EstablishmentStatus..name.","Easting","Northing"))
#remane columns
colnames(schools) <- c("SchoolName","SchoolType","SchoolStatus","Easting","Northing")
Again, the data should look like this:
head(schools)
SchoolName SchoolType
1 Sir John Cass's Foundation Primary School Voluntary Aided School
2 City of London School for Girls Other Independent School
3 St Paul's Cathedral School Other Independent School
4 City of London School Other Independent School
5 Sherborne Nursery School LA Nursery School
6 Thomas Coram Centre LA Nursery School
SchoolStatus Easting Northing
1 Open 533498 181201
2 Open 532301 181746
3 Open 532160 181151
4 Open 531981 180844
5 Closed 528515 184869
6 Open 530464 182403
You will notice that the schools data frame has the coordinates of each school in it - the eastings and the northings. The coverage of the school data can be explored using a basic plot of the x,y coordinates.
plot(schools$Easting,schools$Northing)
If RStudio stops working when you try to plot the data, then it may be struggling with the data. To solve this, load the Edubase CSV file in Excel, and delete the columns you don’t need in there, before reading it into RStudio.
You will see that the schools locations have a coverage of the whole of England & Wales; and as such, it is necessary to extract only those schools that reside within Leeds. We can do this using a point in polygon operation - i.e. identify those schools (points) which lie within the Leeds area (polygon). However, before we can do this, we must first convert the schools data frame into a SpatialPointsDataFrame. This is because at the moment R doesn’t know that is is spatial data - it just thinks is a table of data.
The function SpatialPointsDataFrame has a number of parameters - coords which accepts a list of the x and y coordinates, data that accepts the data frame containing attributes for each of the coordinates, and finally, a CRS string that shows the coordinate reference system for the given set of points - there are codes for lots of different referencing systems; however here we are using the one which corresponds to British National Grid.
#Remove those schools without Easting or Northing
schools <- subset(schools,Easting !="" | Northing != "")
#Create a unique ID for each schools
schools$schools_ID <- 1:nrow(schools)
#Create the SpatialPointsDataFrame
schools_SP <- SpatialPointsDataFrame(coords=c(schools[4],schools[5]),data=data.frame(schools$SchoolName,schools$schools_ID),proj4string=CRS("+init=epsg:27700"))
#Show the results
plot(schools_SP)
At first glace, this doesn’t look too different to the earlier plot - however, look in the Environment window (top-right) and you will see that schools_SP has a type of Large SpatialPointsDataFrame where as schools is just listed under Data which means R treats it as a normal data frame.
The next stage of analysis is to perform a point in polygon operation using the over() function - this examines two spatial objects and returns a different set of attributes depending whether the objects are points, lines or polygons. In the case of a point and a polygon, the returned object is a data frame of the attributes of the polygon in which the point resides. The returned data frame therefore has the same number of rows as the input spatial point data frame.
# point in polygon - returns a dataframe of the attributes of the polygons that the point is within.
o <- over(schools_SP,OA)
#Many of these will be NA values - because most schools are not in Leeds!
head(o)
gid code label name oac_super_ oac_group oac_sub_gr oac_super0
1 NA <NA> <NA> <NA> NA <NA> <NA> <NA>
2 NA <NA> <NA> <NA> NA <NA> <NA> <NA>
3 NA <NA> <NA> <NA> NA <NA> <NA> <NA>
4 NA <NA> <NA> <NA> NA <NA> <NA> <NA>
5 NA <NA> <NA> <NA> NA <NA> <NA> <NA>
6 NA <NA> <NA> <NA> NA <NA> <NA> <NA>
oac_group_ oac_sub_g0 OA11CD LSOA11CD TotalPop
1 <NA> <NA> <NA> <NA> NA
2 <NA> <NA> <NA> <NA> NA
3 <NA> <NA> <NA> <NA> NA
4 <NA> <NA> <NA> <NA> NA
5 <NA> <NA> <NA> <NA> NA
6 <NA> <NA> <NA> <NA> NA
#Add the attributes back onto the schools_SP SpatialPointsDataFrame
schools_SP@data <- cbind(schools_SP@data,o)
#Use the NA values to remove those points now within Leeds
schools_SP_Leeds <- schools_SP[!is.na(schools_SP@data$label),]
#Map your results
plot(schools_SP_Leeds)
#View the data slot of the results
head(schools_SP_Leeds@data)
schools.SchoolName schools.schools_ID gid
7798 Hunslet Nursery School 7798 138649
7799 The Armley Park Centre 7799 138361
7800 The Craven Road Centre 7800 138165
7801 Guiseley Infant and Nursery School 7801 137461
7802 Rawdon Littlemoor Primary School 7802 138635
7803 Yeadon South View Junior School 7803 137013
code label name oac_super_
7798 E00057830 E08000035E02006876E01011468E00057830 <NA> 7
7799 E00056874 E08000035E02002400E01011284E00056874 <NA> 4
7800 E00169615 E08000035E02002384E01011670E00169615 <NA> 2
7801 E00056776 E08000035E02002337E01011279E00056776 <NA> 5
7802 E00056761 E08000035E02002343E01011278E00056761 <NA> 5
7803 E00056823 E08000035E02002340E01011276E00056823 <NA> 5
oac_group oac_sub_gr oac_super0
7798 7c 7c2 Constrained City Dwellers
7799 4a 4a2 Multicultural Metropolitans
7800 2a 2a1 Cosmopolitans
7801 5b 5b2 Urbanites
7802 5a 5a1 Urbanites
7803 5b 5b3 Urbanites
oac_group_ oac_sub_g0
7798 White Communities Constrained Young Families
7799 Rented Family Living Social Renting New Arrivals
7800 Students Around Campus Student Communal Living
7801 Ageing Urban Living Communal Retirement
7802 Urban Professionals and Families White Professionals
7803 Ageing Urban Living Self-Sufficient Retirement
OA11CD LSOA11CD TotalPop
7798 E00057830 E01011468 192
7799 E00056874 E01011284 458
7800 E00169615 E01011670 870
7801 E00056776 E01011279 311
7802 E00056761 E01011278 251
7803 E00056823 E01011276 315
An estimated population of Internet non-users in each OA can be calculated using the index scores that were given in Table 1. The aim is to adjust the national rate of Internet non-use up or down from 22%* based on the OAC Group characteristics of the area, and then calculate an estimated population count for the number of people who haven’t used the Internet.
The 22% figure is based on the Oxford Internet Survey: Cultures of the Internet: The Internet in Britian (http://oxis.oii.ox.ac.uk/wp-content/uploads/2014/11/OxIS-2013.pdf), p4, “In 2013, 78% of the UK population said that they use the Internet.”
You now need to create a new data frame object called OAC_non_users_rates that will contain these rates. You can create a data frame using the data.frame() function, which will accept a series of lists as their input. For example…
#Character strings use ""
forename <- c("Alex","John","Peter","Eric")
surname <- c("Smith","Peters","Jones","Mount")
#Numbers do not use '""'
age <- c(10,15,67,50)
#Create a data frame
people <- data.frame(forename,surname,age)
#This should look like this if printed to the terminal
people
forename surname age
1 Alex Smith 10
2 John Peters 15
3 Peter Jones 67
4 Eric Mount 50
You now need to create a lookup data frame for the index scores that contains: * Id value 1 - 26 - column called ID (this will be used later for mapping!) - column called ID * OAC Group Code (e.g. 1a, 1b …) - column called oac_names * Index score for Internet non-use (see Table 1) - column called non-users
If you have done this correctly, this will look like
ID oac_names non_users_index
1 1 1a 121.57
2 2 1b 102.74
3 3 1c 119.55
4 4 2a 52.33
5 5 2b 30.70
6 6 2c 59.17
7 7 2d 44.15
8 8 3a 84.06
9 9 3b 76.34
10 10 3c 93.11
11 11 3d 84.38
12 12 4a 90.22
13 13 4b 75.72
14 14 4c 87.40
15 15 5a 79.14
16 16 5b 105.20
17 17 6a 97.73
18 18 6b 106.59
19 19 7a 112.09
20 20 7b 139.81
21 21 7c 129.61
22 22 7d 188.37
23 23 8a 109.98
24 24 8b 99.46
25 25 8c 124.80
26 26 8d 99.25
The next stage is to join this data onto the data slot of the OA variable. This code uses the match() function to examine which rows in the OAC_non_users_rates match with those in the OA@data. The matches are done using the OAC group name columns in both objects. If you print the first few rows of the data slot on the OA object you will see that this now contains the data you calculated above.
#Join non-user rathes onto OA
OA@data = data.frame(OA@data, OAC_non_users_rates[match(OA@data[,"oac_group"], OAC_non_users_rates[,"oac_names"]),])
#check output
head(OA@data)
gid code label name oac_super_
0 132913 E00057132 E08000035E02002390E01011345E00057132 <NA> 8
1 138532 E00057142 E08000035E02002390E01011343E00057142 <NA> 8
2 136971 E00057159 E08000035E02002389E01011346E00057159 <NA> 8
3 138638 E00057160 E08000035E02002390E01011345E00057160 <NA> 8
4 137325 E00057346 E08000035E02002346E01032502E00057346 <NA> 8
5 137101 E00057347 E08000035E02002346E01032502E00057347 <NA> 8
oac_group oac_sub_gr oac_super0 oac_group_
0 8d 8d2 Hard-Pressed Living Migration and Churn
1 8d 8d2 Hard-Pressed Living Migration and Churn
2 8d 8d2 Hard-Pressed Living Migration and Churn
3 8d 8d2 Hard-Pressed Living Migration and Churn
4 8d 8d2 Hard-Pressed Living Migration and Churn
5 8d 8d2 Hard-Pressed Living Migration and Churn
oac_sub_g0 OA11CD LSOA11CD TotalPop ID oac_names
0 Hard-Pressed Ethnic Mix E00057132 E01011345 305 26 8d
1 Hard-Pressed Ethnic Mix E00057142 E01011343 201 26 8d
2 Hard-Pressed Ethnic Mix E00057159 E01011346 325 26 8d
3 Hard-Pressed Ethnic Mix E00057160 E01011345 295 26 8d
4 Hard-Pressed Ethnic Mix E00057346 E01032502 350 26 8d
5 Hard-Pressed Ethnic Mix E00057347 E01032502 284 26 8d
non_users_index
0 99.25
1 99.25
2 99.25
3 99.25
4 99.25
5 99.25
You now need to calculate a new column in the OA object that transforms the index scores into a predicted rate of people who have not used the Internet in these areas. As noted earlier, the national average is around 22% of the population which we will adjust depending on the OAC group that the area is assigned.
You now need to calculate a new column in the OA object that transforms the index scores into a predicted rate of people without Internet access in these areas.
To do this, we can use the following equation which transforms the index score into a predicted rate.
\[ \frac{\text{OAC Group Index Score}} {100} * 22 \]
If you remember from your previous work, you can use the $ to refer to columns within your data, and also use them to create a new column. Thus…
OA@data$non_user_rate <- OA@data$non_users_index / 100 * 22
You now need create another column in your OA@data called non_user_count which is the number of people who are estimated to have not used the Internet. For this calculation, you should use the number of people in the Output Area (i.e. TotalPop). If this has been done correctly, your OA@data should look like…
head(OA@data)
gid code label name oac_super_
0 132913 E00057132 E08000035E02002390E01011345E00057132 <NA> 8
1 138532 E00057142 E08000035E02002390E01011343E00057142 <NA> 8
2 136971 E00057159 E08000035E02002389E01011346E00057159 <NA> 8
3 138638 E00057160 E08000035E02002390E01011345E00057160 <NA> 8
4 137325 E00057346 E08000035E02002346E01032502E00057346 <NA> 8
5 137101 E00057347 E08000035E02002346E01032502E00057347 <NA> 8
oac_group oac_sub_gr oac_super0 oac_group_
0 8d 8d2 Hard-Pressed Living Migration and Churn
1 8d 8d2 Hard-Pressed Living Migration and Churn
2 8d 8d2 Hard-Pressed Living Migration and Churn
3 8d 8d2 Hard-Pressed Living Migration and Churn
4 8d 8d2 Hard-Pressed Living Migration and Churn
5 8d 8d2 Hard-Pressed Living Migration and Churn
oac_sub_g0 OA11CD LSOA11CD TotalPop ID oac_names
0 Hard-Pressed Ethnic Mix E00057132 E01011345 305 26 8d
1 Hard-Pressed Ethnic Mix E00057142 E01011343 201 26 8d
2 Hard-Pressed Ethnic Mix E00057159 E01011346 325 26 8d
3 Hard-Pressed Ethnic Mix E00057160 E01011345 295 26 8d
4 Hard-Pressed Ethnic Mix E00057346 E01032502 350 26 8d
5 Hard-Pressed Ethnic Mix E00057347 E01032502 284 26 8d
non_users_index non_user_rate non_user_count
0 99.25 21.84 66.60
1 99.25 21.84 43.89
2 99.25 21.84 70.96
3 99.25 21.84 64.41
4 99.25 21.84 76.42
5 99.25 21.84 62.01
To illustrate the geography of OAC in Leeds we will first map this categorical variable. Earlier you created a variable called ID in OA@data. We can use these 1-26 scores to assign different colours from a list to each of the OA.
#Define a set of colours, one for each of the OAC groups
my_colour <- c("#00688B","#33A1C9","#87CEEB","#CDBE70","#FFEC8B","#6E8B3D","#A2CD5A","#CAFF70","#8B4C39","#CD7054","#FA8072","#BC8F8F","#848484","#B7B7B7","#DCDCDC","#5D478B","#9F79EE","#CD69C9","#CD96CD","#8B7E66","#CDBA96","#fb8072","#cc4c02","#fe9929","#fed98e","#ffffd4")
#Create a basic OAC choropleth map
plot(OA, col=my_colour[OA@data$oac_group], axes=FALSE,border = NA)
#Plot the wards on top of the OA
plot(CAS, axes=FALSE,border = "#6B6B6B",add=TRUE)
#Add on text labels for the wards
text(coordinates(CAS)[,1],coordinates(CAS)[,2],labels=CAS@data$name, cex=.7)
#Add the legend (the oac_names object was created earlier)
legend(x=412971, y=439516, legend=levels(OA@data$oac_group), fill=my_colour, bty="n", cex=.6, ncol=2)
#Add North Arrow
SpatialPolygonsRescale(layout.north.arrow(2), offset= c(444282,423158), scale = 2000, plot.grid=F)
#Add Scale Bar
SpatialPolygonsRescale(layout.scale.bar(), offset= c(436020,423158), scale= 5000, fill= c("white", "black"), plot.grid= F)
#Add text to scale bar
text(436020,424158,"0km", cex=.6)
text(436020 + 2500,424158,"2.5km", cex=.6)
text(436020 + 5000,424158,"5km", cex=.6)
#Add a title
title("OAC Group Map of Leeds")
This map is not very attractive or easy to interpret as it has far too many categories. As such, you now need to recreate it, however, with the more aggregate OAC Supergroups rather than Groups. We will first create a new column called ID2, which corresponds to the number at the start of the OAC Group code. For example, Groups 6c and 6a would be in SuperGroup 6. We will use the substr() function which can be used to extract part of a character vector. We also use the function as.numeric() to turn the number (which was extracted from a character string) into a numeric value.
#The numbers 1,1 refer to the start and end point in the string to extract
OA@data$ID2 <- as.numeric(substr(OA@data$oac_group,1,1))
For your report you will need to include an OAC SuperGroup map of Leeds with the location of the schools on it - remember, you can use add=TRUE to overplot new layers (see previous example). For example, you could add points onto the previous map as follows. The pch= parameter specifies the symbol type - there are 25 different symbols which can be specified by changing the number 19 to 1 through 25.
plot(schools_SP_Leeds,pch=19,cex=.4,col="#00688B")
You won’t just be able to replicate the OAC Group code entirely (shown above); you will for example need to specify a list object containing a set of just seven colours to use for the choropleth, and also a new set of names for the SuperGroups which are:
Also, when you plot this the legend will still be split over 2 columns. To change it back to 1 column, change the ncol=2 parameter in the legend function to ncol=1.
There is a lot to do here, so do things one step at a time. Also remember to write your code in a script so it is easier to go back and edit when it doesn’t work as expected!
You might aim to create a map which looks a little like the following…
Earlier you calculated the estimated frequency of people who would not have accessed the Internet within each Output Area. You can now map both the percentage rates of the target group (non_user_rate) and the number of people (non_user_count).
Ok, we will first map the percentage rates.
#set breaks for 6 categories and then add the values to a single list - style = fisher is jenks
breaks <- classIntervals(OA@data$non_user_rate,n=6,style="fisher")
#classIntervals returns and object with a number of different slots - however, we only need the break values; thus, we overwrite the object breaks as follows
breaks<- breaks$brks
#Select six colours from the pallet Blues - remember you can see all the pallets by running display.brewer.all()
my_colour <- brewer.pal(6, "Blues")
#Create a basic choropleth map
plot(OA, col=my_colour[findInterval(OA@data$non_user_rate, breaks,all.inside=TRUE)], axes=FALSE,border = NA)
#Plot the wards on top of the OA
plot(CAS, axes=FALSE,border = "#6B6B6B",add=TRUE)
#Add the legend
legend(x=412144, y=430519, legend=leglabs(breaks,between=" to "), fill=my_colour, bty="n",cex=.7)
#Add North Arrow
SpatialPolygonsRescale(layout.north.arrow(2), offset= c(444282,423158), scale = 2000, plot.grid=F)
#Add Scale Bar
SpatialPolygonsRescale(layout.scale.bar(), offset= c(436020,423158), scale= 5000, fill= c("white", "black"), plot.grid= F)
#Add text to scale bar
text(436020,424158,"0km", cex=.6)
text(436020 + 2500,424158,"2.5km", cex=.6)
text(436020 + 5000,424158,"5km", cex=.6)
#Add a title
title("Estimate rate of non Internet use in Leeds")
You now need to create another map for the non_user_count column; which should look something like…
Remember to check your legend. More than 2 or 3 decimal places it too many, so we can round the breaks cut off values. Try these different options by replacing the line breaks<- breaks$brks with one of those below, and see which you think is most appropiate:
breaks <- round(breaks$brks) #round at integer (whole numbers)
breaks <- round(breaks$brks, 1) #round at 0.1
breaks <- round(breaks$brks*2)/2 #easy way to round at 0.5
The are many different ways in which school catchment areas could be estimated. In this example you will use a simple buffer method that assumes people using a school predominantly reside within a euclidean distance approximating a ten minute walk. You will need to calculate this based on an estimated distance that the average person would walk in ten minutes. At an average of around 3mph this is 1608m.
The rgeos package has a function called gBuffer() that can be used to create buffers around points, lines or polygon objects. In the following example we create a new SpatialPolygons object called schools_SP_Leeds_Buffers. This then needs to be converted into a SpatialPolygonsDataFrame object by joining the @data from schools_SP_Leeds onto schools_SP_Leeds_Buffers.
# calculate buffers
schools_SP_Leeds_Buffers <- gBuffer(schools_SP_Leeds, width=1608, byid=TRUE)
#Convert schools_SP_Leeds_Buffers into a SpatialPolygonsDataFrame (rather than SpatialPolygons) by joining the data of the GP_SP_Leeds SpatialPolygonsDataFrame
schools_SP_Leeds_Buffers <- SpatialPolygonsDataFrame(schools_SP_Leeds_Buffers,schools_SP_Leeds@data)
We can then plot the schools and catchment areas on a map. You will see an addition of a par() parameter in this plot which we can use to change lots of attributes about the plot window - in this case the background colour.
#Change the plot window background colour
par(bg = "#696969")
#cas wards
plot(CAS, axes=FALSE,col="#6E7B8B",border = "#CAE1FF")
#school locations
plot(schools_SP_Leeds,pch=19,cex=.4,col="#5CACEE",add=TRUE)
#catchment buffers
plot(schools_SP_Leeds_Buffers, axes=FALSE,col=NA,border = "#F0F0F0",add=TRUE)
Once you have plotted the map with a different background colour, you need to run this code to reset the background colour to the white we usually have:
#Reset background colour
par(bg = "#FFFFFF")
To examine the characteristics of the estimated catchment areas you will now complete a further point in polygon operation that will examine the characteristics of the Output Areas that lie under the buffer zonal geography. However, before we can do this, we first need to convert the Output Area data into a set of points. Fortunately, there is a function (gCentroid) which is part of the rgeos package which will do this for us. We can also then copy over the attribute data from OA.
#load library
library(rgeos)
#extract centers of each OA
OA_Points_centriods <- gCentroid(OA, byid=TRUE)
#add the eastings and northings to the OA@data data frame
OA@data$easting <- OA_Points_centriods@coords[,1]
OA@data$northing <- OA_Points_centriods@coords[,2]
#turn these into coordinates
coords=c(OA@data[20],OA@data[21])
#recreate the OA_Points spatial data frame with the data, including non_user_count and TotalPop
OA_Points <- SpatialPointsDataFrame(coords,data=data.frame(OA@data$non_user_count,OA@data$TotalPop),proj4string=CRS("+init=epsg:27700"))
#Plot a map
plot(OA_Points)
If you get an error message saying undefined columns selected, then check which columns the eastings and northings are stored in. In my one, they are columns [20] and [21] but in yours this may be different.
We now need to implement another point in polygon operation, however, because we are interested in calculating some values for each catchment area, we first need to add an extra parameter to the over() function - returnList=TRUE. This returns a list of data frames, where each element of the list is a separate data frame and refers to a catchment area (i.e. school), and the values those OA which are within this zone. This is a little different from the lists you have created previously which were just lists of character strings.
#Create point in polygon list
o <- over(schools_SP_Leeds_Buffers,OA_Points,returnList=TRUE)
#View length of the list
length(o)
[1] 684
#If we examine the object o, we will see also see that this comprises a list of dataframes. The summary function tells you about an object - head, is used to wrap around the function so only the first six elements are shown
head(summary(o))
Length Class Mode
[1,] "2" "data.frame" "list"
[2,] "2" "data.frame" "list"
[3,] "2" "data.frame" "list"
[4,] "2" "data.frame" "list"
[5,] "2" "data.frame" "list"
[6,] "2" "data.frame" "list"
#View an item from the list (in this case sixth)
o[[6]]
OA.data.non_user_count OA.data.TotalPop
67 88.68 311
68 85.26 299
79 70.72 248
82 71.00 249
83 71.29 250
84 63.02 221
119 56.89 260
146 63.67 291
159 50.33 230
160 53.00 226
161 58.31 241
162 72.90 315
163 46.31 266
164 41.61 239
165 63.03 362
249 71.52 305
250 70.11 299
251 57.69 246
262 53.00 226
314 68.47 292
315 76.68 327
316 60.97 260
317 49.95 213
483 73.31 303
485 67.99 281
523 84.44 349
524 73.80 305
586 51.29 212
594 62.44 203
643 79.26 338
720 48.37 209
791 74.99 324
802 76.14 329
803 62.03 268
882 71.38 410
883 43.00 247
884 43.70 251
887 85.14 489
890 50.49 290
891 36.04 207
892 47.01 270
893 41.44 238
894 48.58 279
987 82.37 300
988 73.58 268
989 74.68 272
1149 72.93 334
1198 68.71 293
1244 79.35 289
1270 69.46 253
1404 40.10 171
1413 66.19 286
1474 93.33 398
1475 66.13 282
1481 98.49 420
1553 48.54 207
1554 58.62 250
1604 44.22 254
1622 50.84 292
2154 62.78 292
2208 74.89 343
2209 48.04 220
2218 71.40 327
2327 109.41 264
2338 59.95 274
The above results are the values from those OA which are within the catchment area for the sixth school in the schools_SP_Leeds_Buffers object; i.e.
schools_SP_Leeds_Buffers@data[6,]
schools.SchoolName schools.schools_ID gid code
7803 Yeadon South View Junior School 7803 137013 E00056823
label name oac_super_ oac_group
7803 E08000035E02002340E01011276E00056823 <NA> 5 5b
oac_sub_gr oac_super0 oac_group_ oac_sub_g0
7803 5b3 Urbanites Ageing Urban Living Self-Sufficient Retirement
OA11CD LSOA11CD TotalPop
7803 E00056823 E01011276 315
We can illustrate this by plotting the results for a single school.
#Plot the buffer for the school
plot(schools_SP_Leeds_Buffers[6,],col="#D4D4D4")
#get a list of those OA centroids that are within this buffer using the rownames - make them numeric
row_o <- as.numeric(rownames(o[[6]]))
#Plot all the OA centroids
plot(OA_Points,pch=19,cex=.5,col="#7F7F7F",add=TRUE)
#Plot those OA centroids which are inside the buffer - i.e. we use the row_o to select the rows
plot(OA_Points[row_o,],pch=19,cex=.5,col="#FF4500",add=TRUE)
The next stage is to look at the results of the point in polygon analysis and calculate the characteristics for each school catchment. The first stage is to use the lapply() function to apply a function across the list. In this case, for each column in the list elements (data frames), we calculate the sum (i.e. colSums). We create a new object called Internet_non_use that is a list containing the estimated non user population and the total population.
Internet_non_use <- lapply(o, colSums)
#View the first six items from the list
Internet_non_use[1:6]
[[1]]
OA.data.non_user_count OA.data.TotalPop
2642 11621
[[2]]
OA.data.non_user_count OA.data.TotalPop
6719 36467
[[3]]
OA.data.non_user_count OA.data.TotalPop
6942 48427
[[4]]
OA.data.non_user_count OA.data.TotalPop
3532 15751
[[5]]
OA.data.non_user_count OA.data.TotalPop
3403 15006
[[6]]
OA.data.non_user_count OA.data.TotalPop
4199 18267
In order to join this data back onto your schools records (schools_SP_Leeds) we need to collapse the list back into a single data frame which we can do by using the following code.
#Collapse list
school_non_use_Data <- data.frame(matrix(unlist(Internet_non_use), nrow=length(Internet_non_use), byrow=T))
#Change the column names to something sensible
colnames(school_non_use_Data) <- c("Internet_non_users","Total_Pop")
#This should look like
head(school_non_use_Data)
Internet_non_users Total_Pop
1 2642 11621
2 6719 36467
3 6942 48427
4 3532 15751
5 3403 15006
6 4199 18267
We can now join these records back onto the schools_SP_Leeds spatial data frame object and calculate a relative rate (i.e percentage of non Internet users in each OA).
#Join - the ordering has not been changed so this is valid
schools_SP_Leeds@data <- cbind(schools_SP_Leeds@data,school_non_use_Data)
#Relative rate
schools_SP_Leeds@data$relative_rate <- schools_SP_Leeds@data$Internet_non_users / schools_SP_Leeds@data$Total_Pop * 100
#show data
head(schools_SP_Leeds@data)
schools.SchoolName schools.schools_ID gid
7798 Hunslet Nursery School 7798 138649
7799 The Armley Park Centre 7799 138361
7800 The Craven Road Centre 7800 138165
7801 Guiseley Infant and Nursery School 7801 137461
7802 Rawdon Littlemoor Primary School 7802 138635
7803 Yeadon South View Junior School 7803 137013
code label name oac_super_
7798 E00057830 E08000035E02006876E01011468E00057830 <NA> 7
7799 E00056874 E08000035E02002400E01011284E00056874 <NA> 4
7800 E00169615 E08000035E02002384E01011670E00169615 <NA> 2
7801 E00056776 E08000035E02002337E01011279E00056776 <NA> 5
7802 E00056761 E08000035E02002343E01011278E00056761 <NA> 5
7803 E00056823 E08000035E02002340E01011276E00056823 <NA> 5
oac_group oac_sub_gr oac_super0
7798 7c 7c2 Constrained City Dwellers
7799 4a 4a2 Multicultural Metropolitans
7800 2a 2a1 Cosmopolitans
7801 5b 5b2 Urbanites
7802 5a 5a1 Urbanites
7803 5b 5b3 Urbanites
oac_group_ oac_sub_g0
7798 White Communities Constrained Young Families
7799 Rented Family Living Social Renting New Arrivals
7800 Students Around Campus Student Communal Living
7801 Ageing Urban Living Communal Retirement
7802 Urban Professionals and Families White Professionals
7803 Ageing Urban Living Self-Sufficient Retirement
OA11CD LSOA11CD TotalPop Internet_non_users Total_Pop
7798 E00057830 E01011468 192 2642 11621
7799 E00056874 E01011284 458 6719 36467
7800 E00169615 E01011670 870 6942 48427
7801 E00056776 E01011279 311 3532 15751
7802 E00056761 E01011278 251 3403 15006
7803 E00056823 E01011276 315 4199 18267
relative_rate
7798 22.74
7799 18.42
7800 14.33
7801 22.42
7802 22.68
7803 22.99
And create a map…this is the number of non users.
#cas wards
plot(CAS, axes=FALSE,col="#6E7B8B",border = "#AAAAAA")
#Generate a set of breaks and colours for the absolute population at risk
breaks <- classIntervals(schools_SP_Leeds@data$Internet_non_users,n=5,style="fisher")
breaks<- breaks$brks
my_colour <- brewer.pal(5, "OrRd")
#Plot the schools, colouring them by the breaks
plot(schools_SP_Leeds,pch=15,cex=.8,col=my_colour[findInterval(schools_SP_Leeds@data$Internet_non_users, breaks,all.inside=TRUE)],add=TRUE)
#Add legend
legend(x=412144, y=429519, legend=leglabs(breaks,between=" to "), fill=my_colour, bty="n",cex=.7)
In your report, you will need to think whether the rate of non use or the number of non users are most appropriate for use in your selection of training session location. A relative rate map might look something like…
The various analysis that you have prepared during this tutorial will be of assistance when compiling your report. You may have to piece together different parts of the code to achieve this, however, examples of any code that you need to complete your report are contained in this document.
As a final helper, if you want to display an ordered list you can do so using the order() function.
#An example sorting a data frame by the easting column
head(schools[order(schools$Easting,decreasing = TRUE),])
#An example sorting a spatial points data frame by the SchoolName
schools_SP_Leeds_reordered <- schools_SP_Leeds[order(schools_SP_Leeds@data$schools.SchoolName,decreasing = TRUE),]
This work is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/4.0/deed.en. Based upon Geodemographics and Social Marketing by Nick Bearman.