Using Census Data

This week we are going to be working with some population data from the 2011 Census. The learning outcomes for this practical are:
- Understand what Census data can be used for and some of it’s limitations
- Know how to import Census data into R
- Be able to manage this data in R, including merging data frames and calculating new values
- Be able to create a map in R using this Census data
- Know how to use loops in R to create multiple maps

We are going to be using the Projects and Scripts facilities within RStudio to do this. To begin, start a new project in RStudio. Click File > New Project… and follow the instructions to create a new, empty project. Details are available in the short video at https://stream.liv.ac.uk/4jhqzf4b on how to do this.

When you create a new project, RStudio will automatically set your working directory to the directory you created in the Wizard. To check what the working directory is, we can run:

#set working directory
getwd()

It will be useful to start writing your code into a script as you go through the practical. This allows you to save your commands, and run then again if necessary. To create a new script, click File > New File > R Script. This will open a window at the top of RStudio, where you can type in your commands. Remember you run commands from here by highlighting the commands you want to run, and then clicking ‘Run’ or press Ctrl-Enter. More information on scripts are available in the videos on Vital.

Some preliminaries on reading and manipulating data

Before we begin the practical, there are a couple of things to learn about reading and manipulating data. We will use these skills later on in the practical.

Firstly, what is a CSV file, and how can we get them into R? These files will usually have the file extension “.csv” and are typically opened in a spreadsheet software package. Supposing we want to create a data set that contains a list of people who have come along to an R training workshop. This might look something like this:

Name Age Place
John 20 Liverpool
Helen 10 London
Mia 16 Liverpool
Carl 21 Manchester

Open Notepad and create a new text file, entering the following details as separate lines, with each column in the table separated by a comma, and each row on a new line (please note that the first row contains our desired headings):

Name,Age,Place
John,20,Liverpool
Helen,10,London
Mia,16,Liverpool
Carl,21,Manchester

Remember to add an extra blank line at the end of the file. Save the text file as test.csv in the working directory you created when you created a new project in RStudio. Next we will read in the CSV you created above.

#We will read this into a new data frame called test.
test <- read.csv('test.csv', header=TRUE)
#Now print test
test
   Name Age      Place
1  John  20  Liverpool
2 Helen  10     London
3   Mia  16  Liverpool
4  Carl  21 Manchester

Next, duplicate test.csv in your working directory, renaming this to test2.csv and opening it in Notepad. Add the following to the top of the file: “Here is some example text which we will use to demonstrate how to skip lines” and on another line: “Here is another line of text” so it looks as follows:

Here is some example text which we will use to demonstrate how to skip lines
Here is another line of text
Name,Age,Place
John,20,Liverpool
Helen,10,London
Mia,16,Liverpool
Carl,21,Manchester

Save the file and close it.

Next we will import this new .csv into R. First update the file name to test2.csv. We have to add an extra parameter to the read.csv function which specifies that the data we are interested in are located after line 2.

test2 <- read.csv('test2.csv', header=TRUE, skip=2)

If you print the results of the object test2 (i.e. type test2 in to the console!) you will see this is now the same as test, because you skipped the two junk lines.

Next we will illustrate how we can create a subset of a data frame, add a column and calculate a new value. To create a subset of the test data frame, we can use the subset function - this can select certain rows, columns or a combination of the two.

For example…

#Select the name and age columns
subset(test,select=c("Name","Age"))
   Name Age
1  John  20
2 Helen  10
3   Mia  16
4  Carl  21
#Select rows where the place is Liverpool
subset(test,Place == "Liverpool")
  Name Age     Place
1 John  20 Liverpool
3  Mia  16 Liverpool
#Select the name and age columns, and where Place is Liverpool
subset(test,Place == "Liverpool",select=c("Name","Age"))
  Name Age
1 John  20
3  Mia  16

We will now repeat the last step, however store the results as a new object. You will see that nothing prints to the console.

#Select the name and age columns, however, only where Place is equal to Liverpool (in R we need to use a double equals sign)
test3 <- subset(test,Place == "Liverpool",select=c("Name","Age"))

We can add a new column called “diff100” to the data frame object test3. We do this using the $ symbol. At first, we will make the values of the column NA (i.e. no value)

test3$diff100 <- NA

If you print test3, you will now see the new column which will contain the values NA. However, what we actually want to do is find out how different each of the ages are to 100 (i.e. 100 - age). In order to create this calculation we need to refer to the age column in the data frame. We can use the $ again to achieve this. For example…

test3$diff100 <- 100 - test3$Age
#print test3
test3
  Name Age diff100
1 John  20      80
3  Mia  16      84

When printing variables, you can use square brackets to specify either rows, columns or both. For example:

test3[1,1]
[1] John
Levels: Carl Helen John Mia
test3[,1]
[1] John Mia 
Levels: Carl Helen John Mia

Additionally, you can use colnames to give you a list of columns within a data frame:

colnames(test3)
[1] "Name"    "Age"     "diff100"

and you can use:

test3$diff100 <- NULL

to delete specific columns.

OK, using what you have learnt in this past section, you are now ready to play with some real data!

A couple of other important things! If you want to see the objects that you have created in the R workspace, you can look in the Environment tab in the top right of the windows, or use the ls() function

#List objects in the workspace
ls()

If you want to save your workspace - i.e. so you can open it up later on (like saving your work!); you can use the save.image function. This performs the same as clicking on Session > Save Workspace As… or clicking the save icon in the Environment tab. As you have created a project, RStudio will also ask you whether you want to save the workspace when you close the program.

#This will save all objects currently loaded in the workspace
save.image(file="myfile.RData")

After saving, you can re-load the objects contained in this saved file at any time (e.g. after closing RStudio) using the load command, using the Open icon in the Environment window, or by re-opening the project. The .RData file is stored in the working directory defined earlier.

#Remember that you will need to set your working directory before trying to load - otherwise R will not know where to find you RData file
load("myfile.RData")

One other command that may be useful to know is how to access the help files. If you want help on a specific function, type ?, then the function name:

?load

This will provide more information on the function. If you don’t know what a function is called, type ??, then whatever it is that the function does:

??open

This will search the help files for the term you specified (‘open’ in this case).

Making a Map with Census Data

Now we are going to use the same technique as last week to create a map of some data from the 2011 Census. To start with, we need to download the data, and although there are other sources of these data, in this example we will use the http://www.neighbourhood.statistics.gov.uk website.

  • Navigate to the site and select the “Topics” link on the bottom left hand side of the page.
  • Click “Census” -> “2011 Census: Key Statistics”.
  • Then click the radio button that corresponds to “Age Structure, 2011 (KS102EW)” and press the “Next” button at the bottom right hand side of the page.
  • Then click the related radio button for “Download” and press the “Next” button on the bottom right hand side of the page.
  • Select the radio button for the 2011 Statistical Geography Hierarchy (North West).
  • Press the “Next” button on the bottom right hand side of the page, and on the following page, download the Excel file by clicking on the “Microsoft Excel [*.xls]”.

This will be zipped, and once you have extracted the directory you will find it contains a variety of files. You are interested in the file named “KS102EW_2596_2011SOA_NW.xls”. Copy this to your working directory.

Open this file up in Excel, and you can see there are a number of different tabs, covering different geographies. We are interested in the LSOA geographies, so select the LSOA tab and look at the data it contains. You can see that there are a series of lines at the top which do not contain the data or headings for the columns. As such, we need to read this in R, skipping the unwanted lines and setting the column headings as the row containing “LSOA_CODE”.

R can’t directly import Excel files, but it can import CSV files, so make sure you have the LSOA tab open in Excel, and then save this as a CSV file. Excel will tell you that you can only save the active sheet (i.e. the one currently open), which is fine. Follow the instructions and make sure you save the CSV file (using the default name of KS102EW_2596_2011SOA_NW.csv) in your working folder.

Read the file into RStudio using the read.csv() command as a variable called pop2011. You have to work out the command yourself, but you can do this using the example we discussed earlier as a starting point.

Then run head(pop2011) to see that the data has been read in correctly.

head(pop2011)
   GOR_CODE   GOR_NAME   LA_CODE LA_NAME MSOA_CODE  MSOA_NAME LSOA_CODE
1 E12000002 North West E06000006  Halton E02002580 Halton 007 E01012367
2 E12000002 North West E06000006  Halton E02002576 Halton 003 E01012368
3 E12000002 North West E06000006  Halton E02002578 Halton 005 E01012369
4 E12000002 North West E06000006  Halton E02002580 Halton 007 E01012370
5 E12000002 North West E06000006  Halton E02002589 Halton 016 E01012371
6 E12000002 North West E06000006  Halton E02002589 Halton 016 E01012372
    LSOA_NAME AREA_METADATA  X DATA_VALUE DATA_VALUE.1 DATA_VALUE.2
1 Halton 007A            NA NA       1591          126          7.9
2 Halton 003A            NA NA       1938          138          7.1
3 Halton 005A            NA NA       1517           89          5.9
4 Halton 007B            NA NA       1743          135          7.7
5 Halton 016A            NA NA       1244           59          4.7
6 Halton 016B            NA NA       1139           41          3.6
  DATA_VALUE.3 DATA_VALUE.4 DATA_VALUE.5 DATA_VALUE.6 DATA_VALUE.7
1           55          3.5           32          2.0           71
2           70          3.6           39          2.0          108
3           56          3.7           24          1.6           90
4           53          3.0           31          1.8           67
5           40          3.2           18          1.4           63
6           31          2.7           16          1.4           36
  DATA_VALUE.8 DATA_VALUE.9 DATA_VALUE.10 DATA_VALUE.11 DATA_VALUE.12
1          4.5           14           0.9            41           2.6
2          5.6           23           1.2            35           1.8
3          5.9           28           1.8            43           2.8
4          3.8           18           1.0            37           2.1
5          5.1           16           1.3            25           2.0
6          3.2           14           1.2            20           1.8
  DATA_VALUE.13 DATA_VALUE.14 DATA_VALUE.15 DATA_VALUE.16 DATA_VALUE.17
1            46           2.9           129           8.1           141
2            42           2.2           124           6.4           145
3            46           3.0            84           5.5            84
4            49           2.8           158           9.1           210
5            25           2.0            72           5.8           110
6            14           1.2            61           5.4            72
  DATA_VALUE.18 DATA_VALUE.19 DATA_VALUE.20 DATA_VALUE.21 DATA_VALUE.22
1           8.9           311          19.5           276          17.3
2           7.5           354          18.3           353          18.2
3           5.5           285          18.8           371          24.5
4          12.0           355          20.4           303          17.4
5           8.8           258          20.7           310          24.9
6           6.3           189          16.6           299          26.3
  DATA_VALUE.23 DATA_VALUE.24 DATA_VALUE.25 DATA_VALUE.26 DATA_VALUE.27
1            74           4.7           136           8.5           102
2           102           5.3           193          10.0           147
3            89           5.9           113           7.4            89
4            89           5.1           122           7.0            66
5           113           9.1            83           6.7            42
6           127          11.2           152          13.3            51
  DATA_VALUE.28 DATA_VALUE.29 DATA_VALUE.30 DATA_VALUE.31 DATA_VALUE.32
1           6.4            26           1.6            11           0.7
2           7.6            40           2.1            25           1.3
3           5.9            18           1.2             8           0.5
4           3.8            33           1.9            17           1.0
5           3.4             9           0.7             1           0.1
6           4.5            14           1.2             2           0.2
  DATA_VALUE.33 DATA_VALUE.34
1          38.4            35
2          40.8            39
3          39.9            41
4          37.1            33
5          40.1            42
6          45.1            49

You can see that some of the variable names are missing. The names were in the Excel spreadsheet, but spread over the first 5 rows, so they didn’t import into R.

We can rename the columns, so that when we run the head command, R lists the correct names. This will also help us refer to the columns later on. Run the code below, which creates a new variable which contains the names (newcolnames) and then applies it to the pop2011 data frame. Any column name ending in a ‘c’ is a count (of people in that age group in that LSOA) and any column ending in ‘pc’ is a percentage. We have to use ‘pc’ because R doesn’t like symbols (e.g. %) in field names.

#create a new variable which contains the new variable names
newcolnames <- c("AllUsualResidentsc","Age0to4c","Age0to4pc","Age5to7c","Age5to7pc",
"Age8to9c","Age8to9pc","Age10to14c","Age10to14pc","Age15c","Age15pc","Age16to17c",
"Age16to17pc","Age18to19c","Age18to19pc","Age20to24c","Age20to24pc","Age25to29c",
"Age25to29pc","Age30to44c","Age30to44pc","Age45to59c","Age45to59pc","Age60to64c",
"Age60to64pc","Age65to74c","Age65to74pc","Age75to84c","Age75to84pc","Age85to89c",
"Age85to89pc","Age90andOverc","Age90andOverpc","MeanAge","MedianAge")
#apply these to pop2011 data frame
colnames(pop2011)[11:45] <- newcolnames

Now we have the correct column names for the data frame. It would also be good to check they have been applied to the pop2011 dataframe correctly.

Later on, we are going to create some maps of the age structure of Liverpool. Currently we have 18 different age groups, which will make many different maps. What we need to do is to aggregate some of the groups together. We want to make 5 maps, showing these categories:
- Percentage of Population 0-14 Years Old
- Percentage of Population 15-29 Years Old
- Percentage of Population 30-44 Years Old
- Percentage of Population 45-64 Years Old
- Percentage of Population 65+ Years Old

There are two ways we can calculate this, firstly by typing in each field we want to add manually:

#Add ages together manually to create a new field Age0to14c
pop2011$Age0to14c <- pop2011$Age0to4c + pop2011$Age5to7c + pop2011$Age8to9c + pop2011$Age10to14c

This works, and used head(pop2011) to check the output. It should look like this:

head(pop2011)
   GOR_CODE   GOR_NAME   LA_CODE LA_NAME MSOA_CODE  MSOA_NAME LSOA_CODE
1 E12000002 North West E06000006  Halton E02002580 Halton 007 E01012367
2 E12000002 North West E06000006  Halton E02002576 Halton 003 E01012368
3 E12000002 North West E06000006  Halton E02002578 Halton 005 E01012369
4 E12000002 North West E06000006  Halton E02002580 Halton 007 E01012370
5 E12000002 North West E06000006  Halton E02002589 Halton 016 E01012371
6 E12000002 North West E06000006  Halton E02002589 Halton 016 E01012372
    LSOA_NAME AREA_METADATA  X AllUsualResidentsc Age0to4c Age0to4pc
1 Halton 007A            NA NA               1591      126       7.9
2 Halton 003A            NA NA               1938      138       7.1
3 Halton 005A            NA NA               1517       89       5.9
4 Halton 007B            NA NA               1743      135       7.7
5 Halton 016A            NA NA               1244       59       4.7
6 Halton 016B            NA NA               1139       41       3.6
  Age5to7c Age5to7pc Age8to9c Age8to9pc Age10to14c Age10to14pc Age15c
1       55       3.5       32       2.0         71         4.5     14
2       70       3.6       39       2.0        108         5.6     23
3       56       3.7       24       1.6         90         5.9     28
4       53       3.0       31       1.8         67         3.8     18
5       40       3.2       18       1.4         63         5.1     16
6       31       2.7       16       1.4         36         3.2     14
  Age15pc Age16to17c Age16to17pc Age18to19c Age18to19pc Age20to24c
1     0.9         41         2.6         46         2.9        129
2     1.2         35         1.8         42         2.2        124
3     1.8         43         2.8         46         3.0         84
4     1.0         37         2.1         49         2.8        158
5     1.3         25         2.0         25         2.0         72
6     1.2         20         1.8         14         1.2         61
  Age20to24pc Age25to29c Age25to29pc Age30to44c Age30to44pc Age45to59c
1         8.1        141         8.9        311        19.5        276
2         6.4        145         7.5        354        18.3        353
3         5.5         84         5.5        285        18.8        371
4         9.1        210        12.0        355        20.4        303
5         5.8        110         8.8        258        20.7        310
6         5.4         72         6.3        189        16.6        299
  Age45to59pc Age60to64c Age60to64pc Age65to74c Age65to74pc Age75to84c
1        17.3         74         4.7        136         8.5        102
2        18.2        102         5.3        193        10.0        147
3        24.5         89         5.9        113         7.4         89
4        17.4         89         5.1        122         7.0         66
5        24.9        113         9.1         83         6.7         42
6        26.3        127        11.2        152        13.3         51
  Age75to84pc Age85to89c Age85to89pc Age90andOverc Age90andOverpc MeanAge
1         6.4         26         1.6            11            0.7    38.4
2         7.6         40         2.1            25            1.3    40.8
3         5.9         18         1.2             8            0.5    39.9
4         3.8         33         1.9            17            1.0    37.1
5         3.4          9         0.7             1            0.1    40.1
6         4.5         14         1.2             2            0.2    45.1
  MedianAge Age0to14c
1        35       284
2        39       355
3        41       259
4        33       286
5        42       180
6        49       124

Another way is to use a range:

#Add as a range
pop2011$Age15to29c <- rowSums(pop2011[,c(20,22,24,26,28)])

The range is a lot easier to type out, particularly if we have a large number of columns to add together (like the final category). You can use colnames(pop2011) to find out which columns we need to use. With this data set, we are lucky because the category 30 to 44 years already exists, so we don’t need to calculate it. Repeat this process for the last two age categories.

We also need to calculate the percentages. Here is the code for the first one:

#Add as a range
pop2011$Age0to14pc <- pop2011$Age0to14c / pop2011$AllUsualResidentsc

Work out the code for the other ones yourself. The data should look like this:

#Output of joined data
head(pop2011)
   GOR_CODE   GOR_NAME   LA_CODE LA_NAME MSOA_CODE  MSOA_NAME LSOA_CODE
1 E12000002 North West E06000006  Halton E02002580 Halton 007 E01012367
2 E12000002 North West E06000006  Halton E02002576 Halton 003 E01012368
3 E12000002 North West E06000006  Halton E02002578 Halton 005 E01012369
4 E12000002 North West E06000006  Halton E02002580 Halton 007 E01012370
5 E12000002 North West E06000006  Halton E02002589 Halton 016 E01012371
6 E12000002 North West E06000006  Halton E02002589 Halton 016 E01012372
    LSOA_NAME AREA_METADATA  X AllUsualResidentsc Age0to4c Age0to4pc
1 Halton 007A            NA NA               1591      126       7.9
2 Halton 003A            NA NA               1938      138       7.1
3 Halton 005A            NA NA               1517       89       5.9
4 Halton 007B            NA NA               1743      135       7.7
5 Halton 016A            NA NA               1244       59       4.7
6 Halton 016B            NA NA               1139       41       3.6
  Age5to7c Age5to7pc Age8to9c Age8to9pc Age10to14c Age10to14pc Age15c
1       55       3.5       32       2.0         71         4.5     14
2       70       3.6       39       2.0        108         5.6     23
3       56       3.7       24       1.6         90         5.9     28
4       53       3.0       31       1.8         67         3.8     18
5       40       3.2       18       1.4         63         5.1     16
6       31       2.7       16       1.4         36         3.2     14
  Age15pc Age16to17c Age16to17pc Age18to19c Age18to19pc Age20to24c
1     0.9         41         2.6         46         2.9        129
2     1.2         35         1.8         42         2.2        124
3     1.8         43         2.8         46         3.0         84
4     1.0         37         2.1         49         2.8        158
5     1.3         25         2.0         25         2.0         72
6     1.2         20         1.8         14         1.2         61
  Age20to24pc Age25to29c Age25to29pc Age30to44c Age30to44pc Age45to59c
1         8.1        141         8.9        311        19.5        276
2         6.4        145         7.5        354        18.3        353
3         5.5         84         5.5        285        18.8        371
4         9.1        210        12.0        355        20.4        303
5         5.8        110         8.8        258        20.7        310
6         5.4         72         6.3        189        16.6        299
  Age45to59pc Age60to64c Age60to64pc Age65to74c Age65to74pc Age75to84c
1        17.3         74         4.7        136         8.5        102
2        18.2        102         5.3        193        10.0        147
3        24.5         89         5.9        113         7.4         89
4        17.4         89         5.1        122         7.0         66
5        24.9        113         9.1         83         6.7         42
6        26.3        127        11.2        152        13.3         51
  Age75to84pc Age85to89c Age85to89pc Age90andOverc Age90andOverpc MeanAge
1         6.4         26         1.6            11            0.7    38.4
2         7.6         40         2.1            25            1.3    40.8
3         5.9         18         1.2             8            0.5    39.9
4         3.8         33         1.9            17            1.0    37.1
5         3.4          9         0.7             1            0.1    40.1
6         4.5         14         1.2             2            0.2    45.1
  MedianAge Age0to14c Age15to29c Age45to64c Age65plusc Age0to14pc
1        35       284        371        350        275     0.1785
2        39       355        369        455        405     0.1832
3        41       259        285        460        228     0.1707
4        33       286        472        392        238     0.1641
5        42       180        248        423        135     0.1447
6        49       124        181        426        219     0.1089
  Age15to29pc Age45to64pc Age65pluspc
1      0.2332      0.2200      0.1728
2      0.1904      0.2348      0.2090
3      0.1879      0.3032      0.1503
4      0.2708      0.2249      0.1365
5      0.1994      0.3400      0.1085
6      0.1589      0.3740      0.1923

If you wish to, you can also look at the population data for 2013, calculate the change between 2011 and 2013, and map this. Details of this optional exercise are at the end of the handout.

Spatial Data

Now we have the attribute data (percentage of people in each age group in each LSOA in this case) we need to join this attribute data to the spatial data. Therefore, first, we need to download the spatial data.

  • Go to http://census.edina.ac.uk/ and select Boundary Data Selector.
  • Then set Country to England, Geography to Statistical Building Block, dates to 2011 and later, and click ‘Find’.
  • Select ‘English Lower Layer Super Output Areas, 2011’ and click ‘List Areas’.
  • Select ‘Liverpool’ from the list and click ‘Extract Boundary Data’.
  • Click ‘Go to Bookmark Facility’ and you should be able to download the data after a 10 to 30 second wait.

Extract the files, and move all the files starting with the name england_lsoa_2011Polygon to your working folder. There should be 5 files - make sure you move them all. Then read in the data using these commands:

#load libraries
library(maptools)
Loading required package: sp
Checking rgeos availability: TRUE
#read in shapefile
LSOAs <- readShapeSpatial("england_lsoa_2011Polygon")

Like earlier, we can use the plot command to preview the data. Try plot(LSOAs). We can also apply the same technique with head() to the attribute table. Remember we need to use head(LSOAs@data) because this is a spatial data frame.

The next stage is to join the attribute data to the spatial data.

#join attribute data to LSOAs
LSOAs@data <- merge(pop2011,LSOAs@data,by.x="LSOA_CODE",by.y="code")

And use the head command to check it has joined correctly…

head(LSOAs@data)
  LSOA_CODE  GOR_CODE   GOR_NAME   LA_CODE   LA_NAME MSOA_CODE
1 E01006512 E12000002 North West E08000012 Liverpool E02001377
2 E01006513 E12000002 North West E08000012 Liverpool E02006932
3 E01006514 E12000002 North West E08000012 Liverpool E02001383
4 E01006515 E12000002 North West E08000012 Liverpool E02001383
5 E01006518 E12000002 North West E08000012 Liverpool E02001390
6 E01006519 E12000002 North West E08000012 Liverpool E02001402
      MSOA_NAME      LSOA_NAME AREA_METADATA  X AllUsualResidentsc
1 Liverpool 031 Liverpool 031A            NA NA               1880
2 Liverpool 060 Liverpool 060A            NA NA               2941
3 Liverpool 037 Liverpool 037A            NA NA               2108
4 Liverpool 037 Liverpool 037B            NA NA               1208
5 Liverpool 044 Liverpool 044A            NA NA               1696
6 Liverpool 056 Liverpool 056A            NA NA               1286
  Age0to4c Age0to4pc Age5to7c Age5to7pc Age8to9c Age8to9pc Age10to14c
1       98       5.2       43       2.3       15       0.8         35
2       23       0.8       11       0.4        6       0.2         12
3       39       1.9       13       0.6       12       0.6         38
4       54       4.5       42       3.5       14       1.2         63
5       92       5.4       45       2.7       25       1.5         92
6       70       5.4       38       3.0       21       1.6         86
  Age10to14pc Age15c Age15pc Age16to17c Age16to17pc Age18to19c Age18to19pc
1         1.9      6     0.3         34         1.8        124         6.6
2         0.4      1     0.0         14         0.5        624        21.2
3         1.8      3     0.1         13         0.6        196         9.3
4         5.2     12     1.0         18         1.5         35         2.9
5         5.4     17     1.0         42         2.5         46         2.7
6         6.7     16     1.2         46         3.6         28         2.2
  Age20to24c Age20to24pc Age25to29c Age25to29pc Age30to44c Age30to44pc
1        615        32.7        332        17.7        355        18.9
2       1366        46.4        393        13.4        279         9.5
3        586        27.8        271        12.9        474        22.5
4        141        11.7        139        11.5        272        22.5
5        120         7.1        108         6.4        320        18.9
6         56         4.4         50         3.9        225        17.5
  Age45to59c Age45to59pc Age60to64c Age60to64pc Age65to74c Age65to74pc
1        158         8.4         31         1.6         17         0.9
2        115         3.9         36         1.2         26         0.9
3        268        12.7         66         3.1         81         3.8
4        178        14.7         58         4.8         80         6.6
5        350        20.6        101         6.0        142         8.4
6        335        26.0         97         7.5        112         8.7
  Age75to84c Age75to84pc Age85to89c Age85to89pc Age90andOverc
1         13         0.7          3         0.2             1
2         23         0.8          7         0.2             5
3         33         1.6          9         0.4             6
4         73         6.0         27         2.2             2
5        145         8.5         34         2.0            17
6         85         6.6         12         0.9             9
  Age90andOverpc MeanAge MedianAge Age0to14c Age15to29c Age45to64c
1            0.1    27.4        24       191       1111        189
2            0.2    25.3        21        52       2398        151
3            0.3    32.8        27       102       1069        334
4            0.2    37.8        33       173        345        236
5            1.0    41.9        42       254        333        451
6            0.7    41.9        45       215        196        432
  Age65plusc Age0to14pc Age15to29pc Age45to64pc Age65pluspc           name
1         34    0.10160      0.5910     0.10053     0.01809 Liverpool 031A
2         61    0.01768      0.8154     0.05134     0.02074 Liverpool 060A
3        129    0.04839      0.5071     0.15844     0.06120 Liverpool 037A
4        182    0.14321      0.2856     0.19536     0.15066 Liverpool 037B
5        338    0.14976      0.1963     0.26592     0.19929 Liverpool 044A
6        218    0.16719      0.1524     0.33593     0.16952 Liverpool 056A
                        label
1 E08000012E02001377E01006512
2 E08000012E02006932E01006513
3 E08000012E02001383E01006514
4 E08000012E02001383E01006515
5 E08000012E02001390E01006518
6 E08000012E02001402E01006519

Making Maps

Now we have the data prepared, we can make some maps. There are many different ways of doing this, however, we will use one of the simplest, and outline each step as we go along. In this example, we will make a map of the percentage of population aged 0 to 14 years.

We need to load some more packages containing functions that will help us make some maps.

#load libraries
library("classInt")
library("RColorBrewer")

The first step is to find suitable breakpoints for the data contained in the Age0to14pc column. The continuous data needs to be assigned into categories so different colours can be applied on a choropleth map. There are numerous ways of doing this such as fisher/jenks, standard deviations or equal intervals as discussed in the lecture. In this example we use function classIntervals from the classInt package to find the ranges needed to divide the Age0to14pc into six categories.

#set breaks for 6 categories and then add the values to a single list - style = fisher is jenks
breaks <- classIntervals(LSOAs@data$Age0to14pc,n=6,style="fisher")

#classIntervals returns an object with a number of different slots - however, we only need the break values; thus, we overwrite the object breaks as follows
breaks <- breaks$brks

We will now use another package called RColorBrewer which provides a series of colour pallets that are suitable for mapping. You can have a look at the colour pallets online: http://colorbrewer2.org. Each of these pallets are named; and you can see all the available pallets as follows…

display.brewer.all()

We will now select a set of six colours from one of these pallets into a new list object called my_colours. If you print this object, you will see it contains six HEX codes for colours within the selected pallet.

#Select six colours from the pallet YlOrRd
my_colours <- brewer.pal(6, "YlOrRd")
#Print the colour codes to the terminal
my_colours
[1] "#FFFFB2" "#FED976" "#FEB24C" "#FD8D3C" "#F03B20" "#BD0026"

Another function called findInterval() can be used with the “breaks” object created earlier to divide the TotPop_PC column into aggregate groups. To view the intervals, just print the breaks object…

breaks
[1] 0.006619 0.066685 0.117807 0.153015 0.182719 0.222944 0.276984

Before we use these breaks to create a choropleth map, let us first have a look to see what the findInterval() function actually does.

The following code returns a list of numbers. Each number represents a row in the LSOAs@data Age0to14pc column - i.e. there will be the same number of values as there are rows in LSOA@data.

The function has examined the values of the Age0to14pc column against the break points that you have requested (stored in the “breaks” object); and returned these as a series of values (1-6) - i.e. break point 1 gets the number 1 etc. The all.inside=TRUE is an optional parameter to ensure that the values at the extreme ends of the range are allocated correctly.

#print the intervals
findInterval(LSOAs@data$Age0to14pc, breaks,all.inside=TRUE)
  [1] 2 1 1 3 3 4 1 2 2 2 1 3 3 3 3 4 3 2 4 3 4 3 3 4 3 3 4 3 4 3 5 5 5 5 5
 [36] 2 1 1 6 2 2 5 1 4 3 5 4 3 4 4 3 4 3 4 4 6 4 5 3 4 4 5 5 3 4 5 3 4 3 3
 [71] 3 2 4 2 3 3 3 3 3 3 4 5 6 3 3 4 4 4 5 4 4 4 4 4 3 4 4 3 3 3 4 3 3 4 3
[106] 4 4 3 3 3 2 5 5 4 6 5 5 5 6 5 6 4 4 3 6 2 5 4 5 4 4 4 5 4 2 5 4 5 5 5
[141] 5 5 4 6 5 5 4 6 5 6 1 6 6 4 6 4 3 4 4 3 3 3 3 3 4 3 5 5 2 5 3 5 5 5 5
[176] 4 3 5 3 4 4 4 5 3 5 5 5 5 5 4 4 5 5 4 2 2 1 4 3 3 4 6 5 5 6 6 5 5 5 5
[211] 6 5 4 4 5 4 4 3 1 2 3 4 6 4 6 6 6 6 4 2 4 5 5 5 4 5 5 3 4 4 4 3 4 4 4
[246] 3 3 4 4 4 4 5 5 4 4 2 5 4 3 3 3 3 3 4 4 3 2 2 2 4 3 3 4 5 4 4 5 2 1 1
[281] 1 1 1 1 1 1 1 1 5 1 1 1 2 3 5 2 4 2

We can combine this list of 1-6 numbers to select specific colours from the my_colours object you created earlier.

By using the square brackets we can use the output of the findInterval() function to select items from the my_colours list. E.g., a number 1 would select the first colour, a 2 the second colour, etc.

#Returns a list of colours
my_colours[findInterval(LSOAs@data$Age0to14pc, breaks,all.inside=TRUE)]
  [1] "#FED976" "#FFFFB2" "#FFFFB2" "#FEB24C" "#FEB24C" "#FD8D3C" "#FFFFB2"
  [8] "#FED976" "#FED976" "#FED976" "#FFFFB2" "#FEB24C" "#FEB24C" "#FEB24C"
 [15] "#FEB24C" "#FD8D3C" "#FEB24C" "#FED976" "#FD8D3C" "#FEB24C" "#FD8D3C"
 [22] "#FEB24C" "#FEB24C" "#FD8D3C" "#FEB24C" "#FEB24C" "#FD8D3C" "#FEB24C"
 [29] "#FD8D3C" "#FEB24C" "#F03B20" "#F03B20" "#F03B20" "#F03B20" "#F03B20"
 [36] "#FED976" "#FFFFB2" "#FFFFB2" "#BD0026" "#FED976" "#FED976" "#F03B20"
 [43] "#FFFFB2" "#FD8D3C" "#FEB24C" "#F03B20" "#FD8D3C" "#FEB24C" "#FD8D3C"
 [50] "#FD8D3C" "#FEB24C" "#FD8D3C" "#FEB24C" "#FD8D3C" "#FD8D3C" "#BD0026"
 [57] "#FD8D3C" "#F03B20" "#FEB24C" "#FD8D3C" "#FD8D3C" "#F03B20" "#F03B20"
 [64] "#FEB24C" "#FD8D3C" "#F03B20" "#FEB24C" "#FD8D3C" "#FEB24C" "#FEB24C"
 [71] "#FEB24C" "#FED976" "#FD8D3C" "#FED976" "#FEB24C" "#FEB24C" "#FEB24C"
 [78] "#FEB24C" "#FEB24C" "#FEB24C" "#FD8D3C" "#F03B20" "#BD0026" "#FEB24C"
 [85] "#FEB24C" "#FD8D3C" "#FD8D3C" "#FD8D3C" "#F03B20" "#FD8D3C" "#FD8D3C"
 [92] "#FD8D3C" "#FD8D3C" "#FD8D3C" "#FEB24C" "#FD8D3C" "#FD8D3C" "#FEB24C"
 [99] "#FEB24C" "#FEB24C" "#FD8D3C" "#FEB24C" "#FEB24C" "#FD8D3C" "#FEB24C"
[106] "#FD8D3C" "#FD8D3C" "#FEB24C" "#FEB24C" "#FEB24C" "#FED976" "#F03B20"
[113] "#F03B20" "#FD8D3C" "#BD0026" "#F03B20" "#F03B20" "#F03B20" "#BD0026"
[120] "#F03B20" "#BD0026" "#FD8D3C" "#FD8D3C" "#FEB24C" "#BD0026" "#FED976"
[127] "#F03B20" "#FD8D3C" "#F03B20" "#FD8D3C" "#FD8D3C" "#FD8D3C" "#F03B20"
[134] "#FD8D3C" "#FED976" "#F03B20" "#FD8D3C" "#F03B20" "#F03B20" "#F03B20"
[141] "#F03B20" "#F03B20" "#FD8D3C" "#BD0026" "#F03B20" "#F03B20" "#FD8D3C"
[148] "#BD0026" "#F03B20" "#BD0026" "#FFFFB2" "#BD0026" "#BD0026" "#FD8D3C"
[155] "#BD0026" "#FD8D3C" "#FEB24C" "#FD8D3C" "#FD8D3C" "#FEB24C" "#FEB24C"
[162] "#FEB24C" "#FEB24C" "#FEB24C" "#FD8D3C" "#FEB24C" "#F03B20" "#F03B20"
[169] "#FED976" "#F03B20" "#FEB24C" "#F03B20" "#F03B20" "#F03B20" "#F03B20"
[176] "#FD8D3C" "#FEB24C" "#F03B20" "#FEB24C" "#FD8D3C" "#FD8D3C" "#FD8D3C"
[183] "#F03B20" "#FEB24C" "#F03B20" "#F03B20" "#F03B20" "#F03B20" "#F03B20"
[190] "#FD8D3C" "#FD8D3C" "#F03B20" "#F03B20" "#FD8D3C" "#FED976" "#FED976"
[197] "#FFFFB2" "#FD8D3C" "#FEB24C" "#FEB24C" "#FD8D3C" "#BD0026" "#F03B20"
[204] "#F03B20" "#BD0026" "#BD0026" "#F03B20" "#F03B20" "#F03B20" "#F03B20"
[211] "#BD0026" "#F03B20" "#FD8D3C" "#FD8D3C" "#F03B20" "#FD8D3C" "#FD8D3C"
[218] "#FEB24C" "#FFFFB2" "#FED976" "#FEB24C" "#FD8D3C" "#BD0026" "#FD8D3C"
[225] "#BD0026" "#BD0026" "#BD0026" "#BD0026" "#FD8D3C" "#FED976" "#FD8D3C"
[232] "#F03B20" "#F03B20" "#F03B20" "#FD8D3C" "#F03B20" "#F03B20" "#FEB24C"
[239] "#FD8D3C" "#FD8D3C" "#FD8D3C" "#FEB24C" "#FD8D3C" "#FD8D3C" "#FD8D3C"
[246] "#FEB24C" "#FEB24C" "#FD8D3C" "#FD8D3C" "#FD8D3C" "#FD8D3C" "#F03B20"
[253] "#F03B20" "#FD8D3C" "#FD8D3C" "#FED976" "#F03B20" "#FD8D3C" "#FEB24C"
[260] "#FEB24C" "#FEB24C" "#FEB24C" "#FEB24C" "#FD8D3C" "#FD8D3C" "#FEB24C"
[267] "#FED976" "#FED976" "#FED976" "#FD8D3C" "#FEB24C" "#FEB24C" "#FD8D3C"
[274] "#F03B20" "#FD8D3C" "#FD8D3C" "#F03B20" "#FED976" "#FFFFB2" "#FFFFB2"
[281] "#FFFFB2" "#FFFFB2" "#FFFFB2" "#FFFFB2" "#FFFFB2" "#FFFFB2" "#FFFFB2"
[288] "#FFFFB2" "#F03B20" "#FFFFB2" "#FFFFB2" "#FFFFB2" "#FED976" "#FEB24C"
[295] "#F03B20" "#FED976" "#FD8D3C" "#FED976"

We can then pull all this together to create a basic chropleth map using the plot() function. The first attribute is the SpatialPolygonsDataFrame, followed by col= which is how we want to colour the polygons. If we specified a single colour, e.g. "#FEB24C", all the polygons would be the same colour.

However, because we use the findInterval() function to generate a list of colours, each polygon is coloured by an appropriate colour associated with area’s Age0to14pc value.

The other attributes in the plot() function are axes=FALSE to remove axis from the plot, and border = NA to remove borders from around the polygons.

#Create a basic choropleth map
plot(LSOAs, col=my_colours[findInterval(LSOAs@data$Age0to14pc, breaks,all.inside=TRUE)], axes=FALSE,border = NA)

plot of chunk unnamed-chunk-37

It’s also good to add a legend so we know what the colours represent

#Add the legend
legend(x=329936, y=386593, legend=leglabs(round(breaks,digits = 2),between=" to "), fill=my_colours, bty="n")

Now is an excellent time to put all of the mapping code into a script so we can change it and re-run it easily. Add the text below into a new R Script file, if you haven’t done this already. To create a new script, click File > New File > R Script. This will open a window at the top of RStudio, where you can type in your commands. Remember you run commands from here by highlighting the commands you want to run, and then clicking ‘Run’ or press Ctrl-Enter. More information on scripts are available in the videos on Vital.

#set breaks for 6 categories and then add the values to a single list - style = fisher is jenks
breaks <- classIntervals(LSOAs@data$Age0to14pc,n=6,style="fisher")
#classIntervals returns an object with a number of different slots - however, we only need the break values; thus, we overwrite the object breaks as follows
breaks <- breaks$brks
#select 6 colours
my_colours <- brewer.pal(6, "YlOrRd")
#plot map
plot(LSOAs, col=my_colours[findInterval(LSOAs@data$Age0to14pc, breaks,all.inside=TRUE)], axes=FALSE,border = NA)
#Add the legend
legend(x=329936, y=386593, legend=leglabs(round(breaks,digits = 2),between=" to "), fill=my_colours, bty="n")

plot of chunk unnamed-chunk-40

Try experimenting with the code to alter the number of classes, the classification method (run ?classIntervals for details) and the colour.

Extra Map Bits (optional exercise)

Along with the legend, there are a few extra bits to add to the map to finish it off:

#Add North Arrow
SpatialPolygonsRescale(layout.north.arrow(2), offset= c(333770,380184), scale = 1000, plot.grid=F)
#Add Scale Bar
SpatialPolygonsRescale(layout.scale.bar(), offset= c(335379,380084), scale= 3000, fill= c("white", "black"), plot.grid= F)
#Add text to scale bar 
text(335379,380606,"0km", cex=.8)
text(335379 + 1500,380606,"1.5km", cex=.8)
text(335379 + 3000,380606,"3km", cex=.8)
#Add a title
title("Percentage Population Aged 0 to 14")

Sometimes the layout in R Studio of the map can be difficult to get correct. Everything is located with a coordinate (a pair of six figure digits, e.g. 335379,380606). If you need to move something, simply change the coordinates. To find the correct coordinates, you can use the locator tool. Type locator() at the console, then click on the map where you want to place the object. Then click finish, and locator will print out the correct coordinates. Experiment with this to rearrange the objects on the map.

Exporting to PNG and Looping (optional exercise)

One way of saving the map is using the Export option in the plot window. We can also do this using code, by adding two lines - pdf(file="image.pdf") before the map code and dev.off() after the map code. Try it now, and R will save the map in your working directory.

pdf(file="image.pdf")
  #set breaks for 6 categories and then add the values to a single list - style = fisher is jenks
  breaks <- classIntervals(LSOAs@data$Age0to14pc,n=6,style="fisher")
  #classIntervals returns an object with a number of different slots - however, we only need the break values; thus, we overwrite the object breaks as follows
  breaks <- breaks$brks
  #select 6 colours
  my_colours <- brewer.pal(6, "YlOrRd")
  #plot map
  plot(LSOAs, col=my_colours[findInterval(LSOAs@data$Age0to14pc, breaks,all.inside=TRUE)], axes=FALSE,border = NA)
  #Add the legend
  legend(x=329936, y=386593, legend=leglabs(round(breaks,digits = 2),between=" to "), fill=my_colours, bty="n")
  #Add North Arrow
  SpatialPolygonsRescale(layout.north.arrow(2), offset= c(333770,380184), scale = 1000, plot.grid=F)
  #Add Scale Bar
  SpatialPolygonsRescale(layout.scale.bar(), offset= c(335379,380084), scale= 3000, fill= c("white", "black"), plot.grid= F)
  #Add text to scale bar 
  text(335379,380606,"0km", cex=.8)
  text(335379 + 1500,380606,"1.5km", cex=.8)
  text(335379 + 3000,380606,"3km", cex=.8)
  #Add a title
  title("Percentage Population Aged 0 to 14")
dev.off()

Saving the map using code allows us to create multiple maps very easily. A variable (mapvariables) is used to list which variables should be mapped (using the column numbers), and then the line starting for starts a loop. Try running the code, and then change the variables it maps (remember colnames(LSOAs@data) will be useful, and you can map as many variables as you wish).

#setup variable with list of maps to print
mapvariables <- c(50,51,31)
#loop through for each map
for (i in 1:length(mapvariables)) {
  #setup file name
  filename <- paste0("map",colnames(LSOAs@data)[mapvariables[i]],".pdf")
  pdf(file=filename)
  #create map
  var <- LSOAs@data[,mapvariables[i]]
  #set breaks for 6 categories
  breaks <- classIntervals(var,n=6,style="fisher")
  breaks <- breaks$brks
  #select 6 colours
  my_colours <- brewer.pal(6, "YlOrRd")
  #plot map
  plot(LSOAs, col=my_colours[findInterval(var, breaks,all.inside=TRUE)], axes=FALSE,border = NA)
  #Add the legend
  legend(x=329936, y=386593, legend=leglabs(round(breaks,digits = 2),between=" to "), fill=my_colours, bty="n")
  #Add North Arrow
  SpatialPolygonsRescale(layout.north.arrow(2), offset= c(333770,380184), scale = 1000, plot.grid=F)
  #Add Scale Bar
  SpatialPolygonsRescale(layout.scale.bar(), offset= c(335379,380084), scale= 3000, fill= c("white", "black"), plot.grid= F)
  #Add text to scale bar 
  text(335379,380606,"0km", cex=.8)
  text(335379 + 1500,380606,"1.5km", cex=.8)
  text(335379 + 3000,380606,"3km", cex=.8)
  #Add a title
  title(colnames(LSOAs@data)[mapvariables[i]])
  dev.off()
}

Looking at population change (optional exercise)

This section outlines how you can obtian some 2013 population data, read it into R, calculate the change in population and map it.

The mid-year population estimates for 2013 are available from http://www.ons.gov.uk. Navigate to Data, and search for SAPE15DT2. The only entry should be “SAPE15DT2 - Lower Super Output Area Mid-Year Population Estimates, unformatted, Mid-2013 (ZIP 10849Kb)”. Download this file and unzip it. When you save this file as a CSV, make sure you update the formatting in Microsoft Excel so the values over 1000 do not have a comma in them (you can do this by setting the number format to ‘General’). Otherwise R will give you an error message (‘Error in rowSums’) when you try to add them up.

The field names represent the number of people at that age in each LSOA - e.g. field “50” (labelled “X50” in R) represents the number of people 50 years old. This data only provides the count of population (i.e. number of people). You will need to calculate the percentage of people in each age bracket, and then the percentage point change from 2011 to 2013. You can do this in Excel or R - whichever you prefer.

The first stage is to read the data into R, using read.csv(), followed by calculating the percentage of people in each age bracket. Then, if you decide to merge the two data sets in R, you can use the merge() function to do this. Some instructions in the first half of http://www.alex-singleton.com/R-Tutorial-Materials/4-attribute-joins.pdf will help with this.

Once you have all the data together, creating a map is the same process as you used earlier.

Formative Submission

This work will be formatively assessed - i.e. everyone needs to make a submission, and I will look over the submissions and provide some informal feedback in the lecture on Monday. I will also present a ‘model’ answer, and explain what makes it a model answer. This will provide useful experience for your summative assignments. Maps exported as PDF are very high quality, but can be tricky to import into Word (your mileage may vary with this). You may need to use the export option from RStudio to save the files as PNG. Experiment to see what the best option is, but as long as I can see the map clearly in the submission, I won’t penalise you.

Some points to remember:

  • Include at least 4 maps that show the population of Liverpool in 2011
  • Discuss why different age brackets show different patterns
  • Map and discuss the changes between 2011 and 2013
  • Discuss limitations
  • No need to include R code
  • 500 word limit
  • Use a different colour to practical example
  • Make sure your maps are not distorted
  • Include a legend

This work is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/4.0/deed.en. Freely based upon Detecting Neighbourhood Change by Nick Bearman.