Geodemographics

Learning Outcomes
1. Understand what geodemographics are
2. Know why geodemographics are useful
3. Understand what index scores mean
4. Be able to apply index scores in a case study setting

In this practical you will be using some new commands. Have a look at the helpsheet and glossary posted on the Discussion Board on Vital which will explain the commands. If you have any more you want added, post them to the thread.

1 Introduction

In this exercise you take on the role of an owner of a direct marketing consultancy that specialises in helping small businesses to target potential service users within their local area. Your most recent client is a group of IT trainers who are organising an Internet training campaign for people who haven’t used the Internet before at a local school. The campaign involves delivery of an information pamphlet to local residents and holding a training session at one of the local schools.

The trainers have two problems:

  1. They want to know which of the schools would be the most appropriate location for the training session
  2. They want to promote the sessions with a local leaflet campaign, however they only have enough money to produce 5,000 leaflets and want the best return on this investment

Project Tasks

You have been asked to write a short executive report (max 500 words) that will answer the following questions:

  1. Identify the top five schools that would make the most appropriate location for the training sessions, embedding the reasons for these choices in the evidence generated by your data analysis.
  2. Identify a ranked list of the top 10 Output Areas that you would target, with the leafleting campaign. This would only use about 400 of the leaflets (about 40 households per Output Area * 10 OAs = 400), so describe which areas you would target with the remaining leaflets. Remember to embed the reasons for these choices in the evidence generated by your data analysis.

In addition to answering these two questions in your report, you have been asked to present as a minimum the following information:

  1. An OAC SuperGroup map of Leeds that also details the locations of the schools
  2. An estimated school catchment map overlaying OAC SuperGroups
  3. A population map that identifies those areas containing people who may benefit most from the Internet training (e.g. a map of the number of non-users for each school catchment area on top of the OAC SuperGroups)
  4. Maps showing the estimated number of people who don’t use the Internet and rate of non Internet use within Leeds
  5. A map showing the schools selected as most suitable for the training sessions

Overall, your report should not have more than 10 maps or charts. All maps must be generated within R; and all maps should adopt different (but appropriate) colour schemes to those illustrated in this practical.

Getting good marks!

  1. Get the basics right - add legends, titles, scale bars, north arrows etc.
  2. Follow the standard GIS marking criteria!
  3. Ensure that your report clearly identifies your answers to Q1 and Q2
  4. Make sure your answers are clear, and supported by evidence generated by your analysis
  5. Think about the data you are putting on the maps - e.g. if you are plotting points, what is the most appropriate background map etc…
  6. Choose a single style for your maps, i.e. so they look like they are part of a set
  7. Do not print out the R code in your report
  8. Make sure you include all the analysis required (see above)

2. Population of Internet non-users

Around 22% of the UK population have not used the Internet. Internet non-use rates vary across geodemographic groups. Using the Office for National Statistics Output Area Classification (OAC), we can see how these patterns vary across different types of neighbourhood (see Figure 1 and Table 1).

Figure 1: Percentage different from the national average for Internet non use

plot of chunk unnamed-chunk-1

Table 1: Propensity for Internet non-use
OAC Group Index Score
1a: Farming communities 121.57
1b: Rural tenants 102.74
1c: Ageing rural dwellers 119.55
2a: Students around campus 52.33
2b: Inner city students 30.70
2c: Comfortable cosmopolitan 59.17
2d: Aspiring and affluent 44.15
3a: Ethnic family life 84.06
3b: Endeavouring ethnic mix 76.34
3c: Ethnic dynamics 93.11
3d: Aspirational techies 84.38
4a: Rented family living 90.22
4b: Challenged Asian terraces 75.72
4c: Asian traits 87.40
5a: Urban professionals and families 79.14
5b: Ageing urban living 105.20
6a: Suburban achievers 97.73
6b: Semi-detached suburbia 106.59
7a: Challenged diversity 112.09
7b: Constrained flat dwellers 139.81
7c: White communities 129.61
7d: Ageing city dwellers 188.37
8a: Industrious communities 109.98
8b: Challenged terraced workers 99.46
8c: Hard pressed ageing workers 124.80
8d: Migration and churn 99.25

Interpreting these index scores is a relatively straight forward task. A score of 100 indicates an OAC Group where Internet non-use is the same proportion as the national average. A score of 200 would indicate Internet non-use at a rate twice the national average, and a score of 50 half the national average. Index scores formatted in this way are commonly used in geodemographic analysis.

3 Obtaining the Data

For this exercise you will use three sets of data: population estimates, the Output Area Classification and the locations of schools.

You first need to setup a project in RStudio, and then load a number of packages that you will need later in the practical. When you restart the practical on different machines, you will need to run these lines of code again. A good way of doing this is including these at the top of your script, then they can be easily run again.

#Load libaries
library(sp)
library(maptools)
library(stringr)
library(rgeos)
library(classInt)
library(RColorBrewer)

The shapefiles for Leeds can be downloaded from the UK Data Service, and we are going to download two sets of data - firstly the Output Areas (OAs), and secondaly, the wards (CAS).

Therefore, first, we need to download the spatial data. Usefully, the Output Area data is avalaible with the OAC classification already joined, so we do not need to merge the two data sets ourselves.

  • Go to http://census.edina.ac.uk/ and select Boundary Data Selector.
  • Then set Country to England, Geography to Census, dates to 2011 and later, and click ‘Find’.
  • Select ‘English Census Output Areas with OAC, 2011’ and click ‘List Areas’.
  • Select ‘Leeds’ from the list and click ‘Extract Boundary Data’.
  • Click ‘Go to Bookmark Facility’ and you should be able to download the data after a 10 to 30 second wait.

This is the Output Area boundaries for Leeds, but we also need to download the Census Ward. Repeat the process, selecting ‘English Census Merged Wards, 2011’ instead. Remember to name your zip files sensibly. The download facility will call both files ‘BoundaryData.zip’ which is not particually helpful!

Extract the files, and move all the files starting with the name england_oa_2011Polygon and england_cmwd_2011Polygon to your working folder. There should be 5 files for each data set - make sure you move them all. Then read in the data using these commands:

#Get 2011 OA for Leeds
OA <- readShapeSpatial("england_oac_2011",proj4string=CRS("+init=epsg:27700"))

#Get CAS wards for Leeds
CAS <- readShapeSpatial("england_cmwd_2011Polygon",proj4string=CRS("+init=epsg:27700"))

#Plot of Leeds
plot(OA)

plot of chunk unnamed-chunk-5

We then will take a quick look at both the geometry and the attributes of the newly created object OA (using head(OA@data)). In the data slot, you will see that there are a variety of attributes about each Output Area in Leeds including an area id (code), the geodemographic super group, group and subgroup (oac_super_, oac_group and oac_sub_gr), a name for the geodemographic group (label) and the names of the groups and sub groups.

#First six rows of data
head(OA@data)
     gid      code                                label name oac_super_
0 132913 E00057132 E08000035E02002390E01011345E00057132 <NA>          8
1 138532 E00057142 E08000035E02002390E01011343E00057142 <NA>          8
2 136971 E00057159 E08000035E02002389E01011346E00057159 <NA>          8
3 138638 E00057160 E08000035E02002390E01011345E00057160 <NA>          8
4 137325 E00057346 E08000035E02002346E01032502E00057346 <NA>          8
5 137101 E00057347 E08000035E02002346E01032502E00057347 <NA>          8
  oac_group oac_sub_gr          oac_super0          oac_group_
0        8d        8d2 Hard-Pressed Living Migration and Churn
1        8d        8d2 Hard-Pressed Living Migration and Churn
2        8d        8d2 Hard-Pressed Living Migration and Churn
3        8d        8d2 Hard-Pressed Living Migration and Churn
4        8d        8d2 Hard-Pressed Living Migration and Churn
5        8d        8d2 Hard-Pressed Living Migration and Churn
               oac_sub_g0
0 Hard-Pressed Ethnic Mix
1 Hard-Pressed Ethnic Mix
2 Hard-Pressed Ethnic Mix
3 Hard-Pressed Ethnic Mix
4 Hard-Pressed Ethnic Mix
5 Hard-Pressed Ethnic Mix

We will need the total population for each OA, which we will add now, following a simliar process to that used in the previous practical.

The mid-year population estimates for 2013 are available from http://www.ons.gov.uk. Navigate to Data, and search for SAPE15DT10c. The only entry should be “Supporting information - SAPE15DT10c - Census Output Area Estimates in the Yorkshire and the Humber region of England, Mid-2013 (ZIP 3845Kb)”.

Download this file and unzip it. When you save this file as a CSV, make sure you update the formatting in Microsoft Excel so the values over 1000 do not have a comma in them (you can do this by setting the number format to ‘General’). Otherwise R will give you an error message (Error in rowSums) when you try to perform the calculations later on.

Use the read.csv function to read in the data from the file SAPE15DT10c-mid-2013-coa-unformatted-syoa-estimates-yorkshire-and-the-humber.csv into a variable called pop2013. Remember to add the code to your script.

It should look like this once you have read it in:

head(pop2013)
     OA11CD  LSOA11CD All.Ages X0 X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12
1 E00037088 E01007317      381  6  6 12 11 14  4  6  6  6  9   6   5   6
2 E00037089 E01007317      323  5  3  2  5  5  2  6  4  4  3   3   2   4
3 E00037090 E01007317      269  5  4  0  5  4  4  0  5  4  2   2   5   4
4 E00037091 E01007317      237  2  1  4  1  2  0  3  1  2  1   1   1   1
5 E00037112 E01007317      293  6  3  6  5  6  2  4  5  2  5   5   4   3
6 E00037099 E01007318      386  7 14 11  6  6 11  7  6  3  6   6   7   4
  X13 X14 X15 X16 X17 X18 X19 X20 X21 X22 X23 X24 X25 X26 X27 X28 X29 X30
1   2   5   7   8   6   7   5   8   5   5   8   3   4   8   7   5   7   8
2   3   6   4   2   2   4   5   3   3   3   0   4   7   3   2   4   3   4
3   3   5   3   3   6   3   1   7   3   3   3   2   3   6   3   1   0   3
4   2   2   4   1   1   3   0   2   3   1   3   1   2   2   2   1   2   2
5   6   3   2   4   6   4   6   3   4   4   3   1   7   6   1   4   2   3
6   3   4   3   3   2   1   4   3   3   3   8   4   6   9   7  11  11  11
  X31 X32 X33 X34 X35 X36 X37 X38 X39 X40 X41 X42 X43 X44 X45 X46 X47 X48
1   6   4   6   3   3   2   3   4   2   3   6  10   3   5   6   5   6   4
2   4   2   2   4   2   0   2   7   1   3   3   4   6   5   3   5   6   3
3   3   2   1   6   3   2   4   5   5   5   3   4   4   4   5   3   3   5
4   3   3   1   1   1   2   1   2   2   2   2   4   2   3   0   3   3   4
5   3   1   2   1   3   3   3   3   3   3   2   3   6   3   3   4   5   8
6   6  13  10  12  10   5   7   9   5   6   7   8   5   6  10   4   6  11
  X49 X50 X51 X52 X53 X54 X55 X56 X57 X58 X59 X60 X61 X62 X63 X64 X65 X66
1   2   8   4   3   5   4   6   2   4   2   1   3   2   3   6   4   3   3
2   5   6   3   3   1   6   1   3   3   2   7   5   4   8  10   9   8   5
3   5   3   4   6   3   4   6   4   4   5   1   7   2   4   1   5   4   4
4   5   4   3   4   0   2   5   4   4   5   3   5   4   4   2   3   3  11
5   4   6   1   3   3   4   3   4   4   2   6   3   6   4   4   5   4   6
6   3   4   7   1   4   1   5   3   1   2   2   3   1   2   1   5   0   3
  X67 X68 X69 X70 X71 X72 X73 X74 X75 X76 X77 X78 X79 X80 X81 X82 X83 X84
1   2   2   3   1   0   0   2   2   2   0   0   3   3   0   1   2   1   2
2   3   6   7   7   5   3   7   3   1   2   2   5   4   0   2   0   1   0
3   3   1   1   6   2   4   3   1   3   0   2   0   0   0   0   0   2   0
4   5   4  10   7   3   2   0   3   3   5   3   2   8   4   5   4   2   0
5   1   3   6   1   1   2   3   5   0   1   2   2   3   2   1   2   1   1
6   1   0   1   1   0   1   1   0   0   1   1   0   0   0   0   0   0   0
  X85 X86 X87 X88 X89 X90
1   1   0   1   1   0   1
2   1   1   1   1   0   0
3   0   0   0   0   0   0
4   0   0   0   0   2   1
5   0   0   0   0   0   0
6   0   0   0   0   1   0

Now we will tidy up the data (pop2013) and then join this to the spatial data (OA). Previously we used the merge() function, but this does not always work with spatial data, so instead, we will use the match() function.

#keep columns we want
pop2013 <- pop2013[,1:3]
#rename columns
colnames(pop2013)[3] <- "TotalPop"
#join to spatial data
OA@data = data.frame(OA@data, pop2013[match(OA@data[,"code"], pop2013[,"OA11CD"]),])

The first six lines of OA@data should look like this:

head(OA@data)
     gid      code                                label name oac_super_
0 132913 E00057132 E08000035E02002390E01011345E00057132 <NA>          8
1 138532 E00057142 E08000035E02002390E01011343E00057142 <NA>          8
2 136971 E00057159 E08000035E02002389E01011346E00057159 <NA>          8
3 138638 E00057160 E08000035E02002390E01011345E00057160 <NA>          8
4 137325 E00057346 E08000035E02002346E01032502E00057346 <NA>          8
5 137101 E00057347 E08000035E02002346E01032502E00057347 <NA>          8
  oac_group oac_sub_gr          oac_super0          oac_group_
0        8d        8d2 Hard-Pressed Living Migration and Churn
1        8d        8d2 Hard-Pressed Living Migration and Churn
2        8d        8d2 Hard-Pressed Living Migration and Churn
3        8d        8d2 Hard-Pressed Living Migration and Churn
4        8d        8d2 Hard-Pressed Living Migration and Churn
5        8d        8d2 Hard-Pressed Living Migration and Churn
               oac_sub_g0    OA11CD  LSOA11CD TotalPop
0 Hard-Pressed Ethnic Mix E00057132 E01011345      305
1 Hard-Pressed Ethnic Mix E00057142 E01011343      201
2 Hard-Pressed Ethnic Mix E00057159 E01011346      325
3 Hard-Pressed Ethnic Mix E00057160 E01011345      295
4 Hard-Pressed Ethnic Mix E00057346 E01032502      350
5 Hard-Pressed Ethnic Mix E00057347 E01032502      284

The next dataset you will download is the locations of all schools from the Department for Education website:

Note this is a daily snapshot of the Departmet for Education database, so the exact file name and contents will vary. However the process to read in the data should be the same.

Save the CSV file into your working directory and read it in. We also need to trim down the data set to just include the school name, type, status and its location:

#read in CSV
schools <- read.csv("edubase20150208.csv",header=TRUE)
#select the columns we need
schools <- subset(schools,select=c("EstablishmentName","TypeOfEstablishment..name.","EstablishmentStatus..name.","Easting","Northing"))
#remane columns
colnames(schools) <- c("SchoolName","SchoolType","SchoolStatus","Easting","Northing")

Again, the data should look like this:

head(schools)
                                 SchoolName               SchoolType
1 Sir John Cass's Foundation Primary School   Voluntary Aided School
2           City of London School for Girls Other Independent School
3                St Paul's Cathedral School Other Independent School
4                     City of London School Other Independent School
5                  Sherborne Nursery School        LA Nursery School
6                       Thomas Coram Centre        LA Nursery School
  SchoolStatus Easting Northing
1         Open  533498   181201
2         Open  532301   181746
3         Open  532160   181151
4         Open  531981   180844
5       Closed  528515   184869
6         Open  530464   182403

4. Creating a spatial points data frame for schools residing within Leeds

You will notice that the schools data frame has the coordinates of each school in it - the eastings and the northings. The coverage of the school data can be explored using a basic plot of the x,y coordinates.

plot(schools$Easting,schools$Northing)

plot of chunk unnamed-chunk-13

If RStudio stops working when you try to plot the data, then it may be struggling with the data. To solve this, load the Edubase CSV file in Excel, and delete the columns you don’t need in there, before reading it into RStudio.

You will see that the schools locations have a coverage of the whole of England & Wales; and as such, it is necessary to extract only those schools that reside within Leeds. We can do this using a point in polygon operation - i.e. identify those schools (points) which lie within the Leeds area (polygon). However, before we can do this, we must first convert the schools data frame into a SpatialPointsDataFrame. This is because at the moment R doesn’t know that is is spatial data - it just thinks is a table of data.

The function SpatialPointsDataFrame has a number of parameters - coords which accepts a list of the x and y coordinates, data that accepts the data frame containing attributes for each of the coordinates, and finally, a CRS string that shows the coordinate reference system for the given set of points - there are codes for lots of different referencing systems; however here we are using the one which corresponds to British National Grid.

#Remove those schools without Easting or Northing
schools <- subset(schools,Easting !="" | Northing != "")

#Create a unique ID for each schools
schools$schools_ID <- 1:nrow(schools)

#Create the SpatialPointsDataFrame
schools_SP <- SpatialPointsDataFrame(coords=c(schools[4],schools[5]),data=data.frame(schools$SchoolName,schools$schools_ID),proj4string=CRS("+init=epsg:27700"))

#Show the results
plot(schools_SP)

plot of chunk unnamed-chunk-14

At first glace, this doesn’t look too different to the earlier plot - however, look in the Environment window (top-right) and you will see that schools_SP has a type of Large SpatialPointsDataFrame where as schools is just listed under Data which means R treats it as a normal data frame.

Point in polygon - Create a spatially derived subset of the schools

The next stage of analysis is to perform a point in polygon operation using the over() function - this examines two spatial objects and returns a different set of attributes depending whether the objects are points, lines or polygons. In the case of a point and a polygon, the returned object is a data frame of the attributes of the polygon in which the point resides. The returned data frame therefore has the same number of rows as the input spatial point data frame.

# point in polygon - returns a dataframe of the attributes of the polygons that the point is within.
o <- over(schools_SP,OA)
#Many of these will be NA values - because most schools are not in Leeds!
head(o)
  gid code label name oac_super_ oac_group oac_sub_gr oac_super0
1  NA <NA>  <NA> <NA>         NA      <NA>       <NA>       <NA>
2  NA <NA>  <NA> <NA>         NA      <NA>       <NA>       <NA>
3  NA <NA>  <NA> <NA>         NA      <NA>       <NA>       <NA>
4  NA <NA>  <NA> <NA>         NA      <NA>       <NA>       <NA>
5  NA <NA>  <NA> <NA>         NA      <NA>       <NA>       <NA>
6  NA <NA>  <NA> <NA>         NA      <NA>       <NA>       <NA>
  oac_group_ oac_sub_g0 OA11CD LSOA11CD TotalPop
1       <NA>       <NA>   <NA>     <NA>       NA
2       <NA>       <NA>   <NA>     <NA>       NA
3       <NA>       <NA>   <NA>     <NA>       NA
4       <NA>       <NA>   <NA>     <NA>       NA
5       <NA>       <NA>   <NA>     <NA>       NA
6       <NA>       <NA>   <NA>     <NA>       NA
#Add the attributes back onto the schools_SP SpatialPointsDataFrame
schools_SP@data <- cbind(schools_SP@data,o)
#Use the NA values to remove those points now within Leeds
schools_SP_Leeds <- schools_SP[!is.na(schools_SP@data$label),]
#Map your results
plot(schools_SP_Leeds)

plot of chunk unnamed-chunk-15

#View the data slot of the results
head(schools_SP_Leeds@data)
                     schools.SchoolName schools.schools_ID    gid
7798             Hunslet Nursery School               7798 138649
7799             The Armley Park Centre               7799 138361
7800             The Craven Road Centre               7800 138165
7801 Guiseley Infant and Nursery School               7801 137461
7802   Rawdon Littlemoor Primary School               7802 138635
7803    Yeadon South View Junior School               7803 137013
          code                                label name oac_super_
7798 E00057830 E08000035E02006876E01011468E00057830 <NA>          7
7799 E00056874 E08000035E02002400E01011284E00056874 <NA>          4
7800 E00169615 E08000035E02002384E01011670E00169615 <NA>          2
7801 E00056776 E08000035E02002337E01011279E00056776 <NA>          5
7802 E00056761 E08000035E02002343E01011278E00056761 <NA>          5
7803 E00056823 E08000035E02002340E01011276E00056823 <NA>          5
     oac_group oac_sub_gr                  oac_super0
7798        7c        7c2   Constrained City Dwellers
7799        4a        4a2 Multicultural Metropolitans
7800        2a        2a1               Cosmopolitans
7801        5b        5b2                   Urbanites
7802        5a        5a1                   Urbanites
7803        5b        5b3                   Urbanites
                           oac_group_                  oac_sub_g0
7798                White Communities  Constrained Young Families
7799             Rented Family Living Social Renting New Arrivals
7800           Students Around Campus     Student Communal Living
7801              Ageing Urban Living         Communal Retirement
7802 Urban Professionals and Families         White Professionals
7803              Ageing Urban Living  Self-Sufficient Retirement
        OA11CD  LSOA11CD TotalPop
7798 E00057830 E01011468      192
7799 E00056874 E01011284      458
7800 E00169615 E01011670      870
7801 E00056776 E01011279      311
7802 E00056761 E01011278      251
7803 E00056823 E01011276      315

Estimating the population of Internet non-users

An estimated population of Internet non-users in each OA can be calculated using the index scores that were given in Table 1. The aim is to adjust the national rate of Internet non-use up or down from 22%* based on the OAC Group characteristics of the area, and then calculate an estimated population count for the number of people who haven’t used the Internet.

The 22% figure is based on the Oxford Internet Survey: Cultures of the Internet: The Internet in Britian (http://oxis.oii.ox.ac.uk/wp-content/uploads/2014/11/OxIS-2013.pdf), p4, “In 2013, 78% of the UK population said that they use the Internet.”

You now need to create a new data frame object called OAC_non_users_rates that will contain these rates. You can create a data frame using the data.frame() function, which will accept a series of lists as their input. For example…

#Character strings use ""
forename <- c("Alex","John","Peter","Eric")
surname <- c("Smith","Peters","Jones","Mount")
#Numbers do not use '""'
age <- c(10,15,67,50)
#Create a data frame
people <- data.frame(forename,surname,age)
#This should look like this if printed to the terminal
people
  forename surname age
1     Alex   Smith  10
2     John  Peters  15
3    Peter   Jones  67
4     Eric   Mount  50

You now need to create a lookup data frame for the index scores that contains: * Id value 1 - 26 - column called ID (this will be used later for mapping!) - column called ID * OAC Group Code (e.g. 1a, 1b …) - column called oac_names * Index score for Internet non-use (see Table 1) - column called non-users

If you have done this correctly, this will look like

   ID oac_names non_users_index
1   1        1a          121.57
2   2        1b          102.74
3   3        1c          119.55
4   4        2a           52.33
5   5        2b           30.70
6   6        2c           59.17
7   7        2d           44.15
8   8        3a           84.06
9   9        3b           76.34
10 10        3c           93.11
11 11        3d           84.38
12 12        4a           90.22
13 13        4b           75.72
14 14        4c           87.40
15 15        5a           79.14
16 16        5b          105.20
17 17        6a           97.73
18 18        6b          106.59
19 19        7a          112.09
20 20        7b          139.81
21 21        7c          129.61
22 22        7d          188.37
23 23        8a          109.98
24 24        8b           99.46
25 25        8c          124.80
26 26        8d           99.25

The next stage is to join this data onto the data slot of the OA variable. This code uses the match() function to examine which rows in the OAC_non_users_rates match with those in the OA@data. The matches are done using the OAC group name columns in both objects. If you print the first few rows of the data slot on the OA object you will see that this now contains the data you calculated above.

#Join non-user rathes onto OA
OA@data = data.frame(OA@data, OAC_non_users_rates[match(OA@data[,"oac_group"], OAC_non_users_rates[,"oac_names"]),])
#check output
head(OA@data)
     gid      code                                label name oac_super_
0 132913 E00057132 E08000035E02002390E01011345E00057132 <NA>          8
1 138532 E00057142 E08000035E02002390E01011343E00057142 <NA>          8
2 136971 E00057159 E08000035E02002389E01011346E00057159 <NA>          8
3 138638 E00057160 E08000035E02002390E01011345E00057160 <NA>          8
4 137325 E00057346 E08000035E02002346E01032502E00057346 <NA>          8
5 137101 E00057347 E08000035E02002346E01032502E00057347 <NA>          8
  oac_group oac_sub_gr          oac_super0          oac_group_
0        8d        8d2 Hard-Pressed Living Migration and Churn
1        8d        8d2 Hard-Pressed Living Migration and Churn
2        8d        8d2 Hard-Pressed Living Migration and Churn
3        8d        8d2 Hard-Pressed Living Migration and Churn
4        8d        8d2 Hard-Pressed Living Migration and Churn
5        8d        8d2 Hard-Pressed Living Migration and Churn
               oac_sub_g0    OA11CD  LSOA11CD TotalPop ID oac_names
0 Hard-Pressed Ethnic Mix E00057132 E01011345      305 26        8d
1 Hard-Pressed Ethnic Mix E00057142 E01011343      201 26        8d
2 Hard-Pressed Ethnic Mix E00057159 E01011346      325 26        8d
3 Hard-Pressed Ethnic Mix E00057160 E01011345      295 26        8d
4 Hard-Pressed Ethnic Mix E00057346 E01032502      350 26        8d
5 Hard-Pressed Ethnic Mix E00057347 E01032502      284 26        8d
  non_users_index
0           99.25
1           99.25
2           99.25
3           99.25
4           99.25
5           99.25

You now need to calculate a new column in the OA object that transforms the index scores into a predicted rate of people who have not used the Internet in these areas. As noted earlier, the national average is around 22% of the population which we will adjust depending on the OAC group that the area is assigned.

You now need to calculate a new column in the OA object that transforms the index scores into a predicted rate of people without Internet access in these areas.

To do this, we can use the following equation which transforms the index score into a predicted rate.

\[ \frac{\text{OAC Group Index Score}} {100} * 22 \]

If you remember from your previous work, you can use the $ to refer to columns within your data, and also use them to create a new column. Thus…

OA@data$non_user_rate <- OA@data$non_users_index / 100 * 22

You now need create another column in your OA@data called non_user_count which is the number of people who are estimated to have not used the Internet. For this calculation, you should use the number of people in the Output Area (i.e. TotalPop). If this has been done correctly, your OA@data should look like…

head(OA@data)
     gid      code                                label name oac_super_
0 132913 E00057132 E08000035E02002390E01011345E00057132 <NA>          8
1 138532 E00057142 E08000035E02002390E01011343E00057142 <NA>          8
2 136971 E00057159 E08000035E02002389E01011346E00057159 <NA>          8
3 138638 E00057160 E08000035E02002390E01011345E00057160 <NA>          8
4 137325 E00057346 E08000035E02002346E01032502E00057346 <NA>          8
5 137101 E00057347 E08000035E02002346E01032502E00057347 <NA>          8
  oac_group oac_sub_gr          oac_super0          oac_group_
0        8d        8d2 Hard-Pressed Living Migration and Churn
1        8d        8d2 Hard-Pressed Living Migration and Churn
2        8d        8d2 Hard-Pressed Living Migration and Churn
3        8d        8d2 Hard-Pressed Living Migration and Churn
4        8d        8d2 Hard-Pressed Living Migration and Churn
5        8d        8d2 Hard-Pressed Living Migration and Churn
               oac_sub_g0    OA11CD  LSOA11CD TotalPop ID oac_names
0 Hard-Pressed Ethnic Mix E00057132 E01011345      305 26        8d
1 Hard-Pressed Ethnic Mix E00057142 E01011343      201 26        8d
2 Hard-Pressed Ethnic Mix E00057159 E01011346      325 26        8d
3 Hard-Pressed Ethnic Mix E00057160 E01011345      295 26        8d
4 Hard-Pressed Ethnic Mix E00057346 E01032502      350 26        8d
5 Hard-Pressed Ethnic Mix E00057347 E01032502      284 26        8d
  non_users_index non_user_rate non_user_count
0           99.25         21.84          66.60
1           99.25         21.84          43.89
2           99.25         21.84          70.96
3           99.25         21.84          64.41
4           99.25         21.84          76.42
5           99.25         21.84          62.01

5. Mapping OAC in Leeds

To illustrate the geography of OAC in Leeds we will first map this categorical variable. Earlier you created a variable called ID in OA@data. We can use these 1-26 scores to assign different colours from a list to each of the OA.

#Define a set of colours, one for each of the OAC groups
my_colour <- c("#00688B","#33A1C9","#87CEEB","#CDBE70","#FFEC8B","#6E8B3D","#A2CD5A","#CAFF70","#8B4C39","#CD7054","#FA8072","#BC8F8F","#848484","#B7B7B7","#DCDCDC","#5D478B","#9F79EE","#CD69C9","#CD96CD","#8B7E66","#CDBA96","#fb8072","#cc4c02","#fe9929","#fed98e","#ffffd4")
#Create a basic OAC choropleth map
plot(OA, col=my_colour[OA@data$oac_group], axes=FALSE,border = NA)
#Plot the wards on top of the OA
plot(CAS, axes=FALSE,border = "#6B6B6B",add=TRUE)
#Add on text labels for the wards
text(coordinates(CAS)[,1],coordinates(CAS)[,2],labels=CAS@data$name, cex=.7)
#Add the legend (the oac_names object was created earlier)
legend(x=412971, y=439516, legend=levels(OA@data$oac_group), fill=my_colour, bty="n", cex=.6, ncol=2)
#Add North Arrow
SpatialPolygonsRescale(layout.north.arrow(2), offset= c(444282,423158), scale = 2000, plot.grid=F)
#Add Scale Bar
SpatialPolygonsRescale(layout.scale.bar(), offset= c(436020,423158), scale= 5000, fill= c("white", "black"), plot.grid= F)
#Add text to scale bar 
text(436020,424158,"0km", cex=.6)
text(436020 + 2500,424158,"2.5km", cex=.6)
text(436020 + 5000,424158,"5km", cex=.6)
#Add a title
title("OAC Group Map of Leeds")

plot of chunk unnamed-chunk-22

This map is not very attractive or easy to interpret as it has far too many categories. As such, you now need to recreate it, however, with the more aggregate OAC Supergroups rather than Groups. We will first create a new column called ID2, which corresponds to the number at the start of the OAC Group code. For example, Groups 6c and 6a would be in SuperGroup 6. We will use the substr() function which can be used to extract part of a character vector. We also use the function as.numeric() to turn the number (which was extracted from a character string) into a numeric value.

#The numbers 1,1 refer to the start and end point in the string to extract
OA@data$ID2 <- as.numeric(substr(OA@data$oac_group,1,1))

For your report you will need to include an OAC SuperGroup map of Leeds with the location of the schools on it - remember, you can use add=TRUE to overplot new layers (see previous example). For example, you could add points onto the previous map as follows. The pch= parameter specifies the symbol type - there are 25 different symbols which can be specified by changing the number 19 to 1 through 25.

plot(schools_SP_Leeds,pch=19,cex=.4,col="#00688B")

plot of chunk unnamed-chunk-24

You won’t just be able to replicate the OAC Group code entirely (shown above); you will for example need to specify a list object containing a set of just seven colours to use for the choropleth, and also a new set of names for the SuperGroups which are:

  1. Rural Residents
  2. Cosmopolitans
  3. Ethnicity Central
  4. Multicultural Metropolitans
  5. Urbanites
  6. Suburbanites
  7. Constrained City Dwellers
  8. Hard-Pressed Living

Also, when you plot this the legend will still be split over 2 columns. To change it back to 1 column, change the ncol=2 parameter in the legend function to ncol=1.

There is a lot to do here, so do things one step at a time. Also remember to write your code in a script so it is easier to go back and edit when it doesn’t work as expected!

You might aim to create a map which looks a little like the following…

plot of chunk unnamed-chunk-25

6. Mapping the estimated at risk population in Leeds

Earlier you calculated the estimated frequency of people who would not have accessed the Internet within each Output Area. You can now map both the percentage rates of the target group (non_user_rate) and the number of people (non_user_count).

Ok, we will first map the percentage rates.

#set breaks for 6 categories and then add the values to a single list - style = fisher is jenks
breaks <- classIntervals(OA@data$non_user_rate,n=6,style="fisher")
#classIntervals returns and object with a number of different slots - however, we only need the break values; thus, we overwrite the object breaks as follows
breaks<- breaks$brks
#Select six colours from the pallet Blues - remember you can see all the pallets by running display.brewer.all()
my_colour <- brewer.pal(6, "Blues")
#Create a basic choropleth map
plot(OA, col=my_colour[findInterval(OA@data$non_user_rate, breaks,all.inside=TRUE)], axes=FALSE,border = NA)
#Plot the wards on top of the OA
plot(CAS, axes=FALSE,border = "#6B6B6B",add=TRUE)
#Add the legend
legend(x=412144, y=430519, legend=leglabs(breaks,between=" to "), fill=my_colour, bty="n",cex=.7)
#Add North Arrow
SpatialPolygonsRescale(layout.north.arrow(2), offset= c(444282,423158), scale = 2000, plot.grid=F)
#Add Scale Bar
SpatialPolygonsRescale(layout.scale.bar(), offset= c(436020,423158), scale= 5000, fill= c("white", "black"), plot.grid= F)
#Add text to scale bar 
text(436020,424158,"0km", cex=.6)
text(436020 + 2500,424158,"2.5km", cex=.6)
text(436020 + 5000,424158,"5km", cex=.6)
#Add a title
title("Estimate rate of non Internet use in Leeds")

plot of chunk unnamed-chunk-26

You now need to create another map for the non_user_count column; which should look something like…

plot of chunk unnamed-chunk-27

Remember to check your legend. More than 2 or 3 decimal places it too many, so we can round the breaks cut off values. Try these different options by replacing the line breaks<- breaks$brks with one of those below, and see which you think is most appropiate:

breaks <- round(breaks$brks) #round at integer (whole numbers)
breaks <- round(breaks$brks, 1) #round at 0.1
breaks <- round(breaks$brks*2)/2 #easy way to round at 0.5

7. Estimating School IT Training Catchment Areas

The are many different ways in which school catchment areas could be estimated. In this example you will use a simple buffer method that assumes people using a school predominantly reside within a euclidean distance approximating a ten minute walk. You will need to calculate this based on an estimated distance that the average person would walk in ten minutes. At an average of around 3mph this is 1608m.

The rgeos package has a function called gBuffer() that can be used to create buffers around points, lines or polygon objects. In the following example we create a new SpatialPolygons object called schools_SP_Leeds_Buffers. This then needs to be converted into a SpatialPolygonsDataFrame object by joining the @data from schools_SP_Leeds onto schools_SP_Leeds_Buffers.

# calculate buffers
schools_SP_Leeds_Buffers <- gBuffer(schools_SP_Leeds, width=1608, byid=TRUE)
#Convert schools_SP_Leeds_Buffers into a SpatialPolygonsDataFrame (rather than SpatialPolygons) by joining the data of the GP_SP_Leeds SpatialPolygonsDataFrame
schools_SP_Leeds_Buffers <- SpatialPolygonsDataFrame(schools_SP_Leeds_Buffers,schools_SP_Leeds@data)

We can then plot the schools and catchment areas on a map. You will see an addition of a par() parameter in this plot which we can use to change lots of attributes about the plot window - in this case the background colour.

#Change the plot window background colour
par(bg = "#696969")
#cas wards
plot(CAS, axes=FALSE,col="#6E7B8B",border = "#CAE1FF")
#school locations
plot(schools_SP_Leeds,pch=19,cex=.4,col="#5CACEE",add=TRUE)
#catchment buffers
plot(schools_SP_Leeds_Buffers, axes=FALSE,col=NA,border = "#F0F0F0",add=TRUE)

plot of chunk unnamed-chunk-30

Once you have plotted the map with a different background colour, you need to run this code to reset the background colour to the white we usually have:

#Reset background colour
par(bg = "#FFFFFF")

Estimating School Catchment Areas Characteristics

To examine the characteristics of the estimated catchment areas you will now complete a further point in polygon operation that will examine the characteristics of the Output Areas that lie under the buffer zonal geography. However, before we can do this, we first need to convert the Output Area data into a set of points. Fortunately, there is a function (gCentroid) which is part of the rgeos package which will do this for us. We can also then copy over the attribute data from OA.

#load library
library(rgeos)
#extract centers of each OA
OA_Points_centriods <- gCentroid(OA, byid=TRUE)
#add the eastings and northings to the OA@data data frame
OA@data$easting <- OA_Points_centriods@coords[,1]
OA@data$northing <- OA_Points_centriods@coords[,2]
#turn these into coordinates
coords=c(OA@data[20],OA@data[21])
#recreate the OA_Points spatial data frame with the data, including non_user_count and TotalPop
OA_Points <- SpatialPointsDataFrame(coords,data=data.frame(OA@data$non_user_count,OA@data$TotalPop),proj4string=CRS("+init=epsg:27700"))
#Plot a map
plot(OA_Points)

plot of chunk unnamed-chunk-32

If you get an error message saying undefined columns selected, then check which columns the eastings and northings are stored in. In my one, they are columns [20] and [21] but in yours this may be different.

We now need to implement another point in polygon operation, however, because we are interested in calculating some values for each catchment area, we first need to add an extra parameter to the over() function - returnList=TRUE. This returns a list of data frames, where each element of the list is a separate data frame and refers to a catchment area (i.e. school), and the values those OA which are within this zone. This is a little different from the lists you have created previously which were just lists of character strings.

#Create point in polygon list
o <- over(schools_SP_Leeds_Buffers,OA_Points,returnList=TRUE)
#View length of the list
length(o)
[1] 684
#If we examine the object o, we will see also see that this comprises a list of dataframes. The summary function tells you about an object - head, is used to wrap around the function so only the first six elements are shown
head(summary(o))
     Length Class        Mode  
[1,] "2"    "data.frame" "list"
[2,] "2"    "data.frame" "list"
[3,] "2"    "data.frame" "list"
[4,] "2"    "data.frame" "list"
[5,] "2"    "data.frame" "list"
[6,] "2"    "data.frame" "list"
#View an item from the list (in this case sixth)
o[[6]]
     OA.data.non_user_count OA.data.TotalPop
67                    88.68              311
68                    85.26              299
79                    70.72              248
82                    71.00              249
83                    71.29              250
84                    63.02              221
119                   56.89              260
146                   63.67              291
159                   50.33              230
160                   53.00              226
161                   58.31              241
162                   72.90              315
163                   46.31              266
164                   41.61              239
165                   63.03              362
249                   71.52              305
250                   70.11              299
251                   57.69              246
262                   53.00              226
314                   68.47              292
315                   76.68              327
316                   60.97              260
317                   49.95              213
483                   73.31              303
485                   67.99              281
523                   84.44              349
524                   73.80              305
586                   51.29              212
594                   62.44              203
643                   79.26              338
720                   48.37              209
791                   74.99              324
802                   76.14              329
803                   62.03              268
882                   71.38              410
883                   43.00              247
884                   43.70              251
887                   85.14              489
890                   50.49              290
891                   36.04              207
892                   47.01              270
893                   41.44              238
894                   48.58              279
987                   82.37              300
988                   73.58              268
989                   74.68              272
1149                  72.93              334
1198                  68.71              293
1244                  79.35              289
1270                  69.46              253
1404                  40.10              171
1413                  66.19              286
1474                  93.33              398
1475                  66.13              282
1481                  98.49              420
1553                  48.54              207
1554                  58.62              250
1604                  44.22              254
1622                  50.84              292
2154                  62.78              292
2208                  74.89              343
2209                  48.04              220
2218                  71.40              327
2327                 109.41              264
2338                  59.95              274

The above results are the values from those OA which are within the catchment area for the sixth school in the schools_SP_Leeds_Buffers object; i.e.

schools_SP_Leeds_Buffers@data[6,]
                  schools.SchoolName schools.schools_ID    gid      code
7803 Yeadon South View Junior School               7803 137013 E00056823
                                    label name oac_super_ oac_group
7803 E08000035E02002340E01011276E00056823 <NA>          5        5b
     oac_sub_gr oac_super0          oac_group_                 oac_sub_g0
7803        5b3  Urbanites Ageing Urban Living Self-Sufficient Retirement
        OA11CD  LSOA11CD TotalPop
7803 E00056823 E01011276      315

We can illustrate this by plotting the results for a single school.

#Plot the buffer for the school
plot(schools_SP_Leeds_Buffers[6,],col="#D4D4D4")
#get a list of those OA centroids that are within this buffer using the rownames - make them numeric
row_o <- as.numeric(rownames(o[[6]]))
#Plot all the OA centroids
plot(OA_Points,pch=19,cex=.5,col="#7F7F7F",add=TRUE)
#Plot those OA centroids which are inside the buffer - i.e. we use the row_o to select the rows
plot(OA_Points[row_o,],pch=19,cex=.5,col="#FF4500",add=TRUE)

plot of chunk unnamed-chunk-35

The next stage is to look at the results of the point in polygon analysis and calculate the characteristics for each school catchment. The first stage is to use the lapply() function to apply a function across the list. In this case, for each column in the list elements (data frames), we calculate the sum (i.e. colSums). We create a new object called Internet_non_use that is a list containing the estimated non user population and the total population.

Internet_non_use <- lapply(o, colSums)
#View the first six items from the list
Internet_non_use[1:6]
[[1]]
OA.data.non_user_count       OA.data.TotalPop 
                  2642                  11621 

[[2]]
OA.data.non_user_count       OA.data.TotalPop 
                  6719                  36467 

[[3]]
OA.data.non_user_count       OA.data.TotalPop 
                  6942                  48427 

[[4]]
OA.data.non_user_count       OA.data.TotalPop 
                  3532                  15751 

[[5]]
OA.data.non_user_count       OA.data.TotalPop 
                  3403                  15006 

[[6]]
OA.data.non_user_count       OA.data.TotalPop 
                  4199                  18267 

In order to join this data back onto your schools records (schools_SP_Leeds) we need to collapse the list back into a single data frame which we can do by using the following code.

#Collapse list
school_non_use_Data <- data.frame(matrix(unlist(Internet_non_use), nrow=length(Internet_non_use), byrow=T))
#Change the column names to something sensible
colnames(school_non_use_Data) <- c("Internet_non_users","Total_Pop")
#This should look like
head(school_non_use_Data)
  Internet_non_users Total_Pop
1               2642     11621
2               6719     36467
3               6942     48427
4               3532     15751
5               3403     15006
6               4199     18267

We can now join these records back onto the schools_SP_Leeds spatial data frame object and calculate a relative rate (i.e percentage of non Internet users in each OA).

#Join - the ordering has not been changed so this is valid
schools_SP_Leeds@data <- cbind(schools_SP_Leeds@data,school_non_use_Data)
#Relative rate
schools_SP_Leeds@data$relative_rate <- schools_SP_Leeds@data$Internet_non_users / schools_SP_Leeds@data$Total_Pop * 100
#show data
head(schools_SP_Leeds@data)
                     schools.SchoolName schools.schools_ID    gid
7798             Hunslet Nursery School               7798 138649
7799             The Armley Park Centre               7799 138361
7800             The Craven Road Centre               7800 138165
7801 Guiseley Infant and Nursery School               7801 137461
7802   Rawdon Littlemoor Primary School               7802 138635
7803    Yeadon South View Junior School               7803 137013
          code                                label name oac_super_
7798 E00057830 E08000035E02006876E01011468E00057830 <NA>          7
7799 E00056874 E08000035E02002400E01011284E00056874 <NA>          4
7800 E00169615 E08000035E02002384E01011670E00169615 <NA>          2
7801 E00056776 E08000035E02002337E01011279E00056776 <NA>          5
7802 E00056761 E08000035E02002343E01011278E00056761 <NA>          5
7803 E00056823 E08000035E02002340E01011276E00056823 <NA>          5
     oac_group oac_sub_gr                  oac_super0
7798        7c        7c2   Constrained City Dwellers
7799        4a        4a2 Multicultural Metropolitans
7800        2a        2a1               Cosmopolitans
7801        5b        5b2                   Urbanites
7802        5a        5a1                   Urbanites
7803        5b        5b3                   Urbanites
                           oac_group_                  oac_sub_g0
7798                White Communities  Constrained Young Families
7799             Rented Family Living Social Renting New Arrivals
7800           Students Around Campus     Student Communal Living
7801              Ageing Urban Living         Communal Retirement
7802 Urban Professionals and Families         White Professionals
7803              Ageing Urban Living  Self-Sufficient Retirement
        OA11CD  LSOA11CD TotalPop Internet_non_users Total_Pop
7798 E00057830 E01011468      192               2642     11621
7799 E00056874 E01011284      458               6719     36467
7800 E00169615 E01011670      870               6942     48427
7801 E00056776 E01011279      311               3532     15751
7802 E00056761 E01011278      251               3403     15006
7803 E00056823 E01011276      315               4199     18267
     relative_rate
7798         22.74
7799         18.42
7800         14.33
7801         22.42
7802         22.68
7803         22.99

And create a map…this is the number of non users.

#cas wards
plot(CAS, axes=FALSE,col="#6E7B8B",border = "#AAAAAA")
#Generate a set of breaks and colours for the absolute population at risk
breaks <- classIntervals(schools_SP_Leeds@data$Internet_non_users,n=5,style="fisher")
breaks<- breaks$brks
my_colour <- brewer.pal(5, "OrRd")
#Plot the schools, colouring them by the breaks
plot(schools_SP_Leeds,pch=15,cex=.8,col=my_colour[findInterval(schools_SP_Leeds@data$Internet_non_users, breaks,all.inside=TRUE)],add=TRUE)
#Add legend
legend(x=412144, y=429519, legend=leglabs(breaks,between=" to "), fill=my_colour, bty="n",cex=.7)

plot of chunk unnamed-chunk-39

In your report, you will need to think whether the rate of non use or the number of non users are most appropriate for use in your selection of training session location. A relative rate map might look something like…

plot of chunk unnamed-chunk-40

The various analysis that you have prepared during this tutorial will be of assistance when compiling your report. You may have to piece together different parts of the code to achieve this, however, examples of any code that you need to complete your report are contained in this document.

As a final helper, if you want to display an ordered list you can do so using the order() function.

#An example sorting a data frame by the easting column
head(schools[order(schools$Easting,decreasing = TRUE),])
#An example sorting a spatial points data frame by the SchoolName
schools_SP_Leeds_reordered <- schools_SP_Leeds[order(schools_SP_Leeds@data$schools.SchoolName,decreasing = TRUE),]

This work is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/4.0/deed.en. Based upon Geodemographics and Social Marketing by Nick Bearman.