Getting Started With R

Aims of This Session:

The Very Basics

Firstly, you will need to create yourself a project directory in your storage area. Any files that you create during this exercise will be stored here. Do this in Windows.

Next, click on the icon, and the R console (a window) appears. Unlike most (or possibly all) of the software you may be used to, it is not usually controlled by clicking on menu items, forms, or buttons, but by typing text into this window.

The results are generally either graphs or text also printed into this window. R can initially be used as a calculator - enter the following:

6 + 8

Don't worry about the [1] for the moment - just note that R printed out 14 since this is the answer to the sum you typed in. In these tutorials, sometimes I show the results of what you have typed in. This is in the format shown below:

5 * 4
[1] 20

Also note that * is the symbol for multiplication here - the last command asked R to perform the calculation '5 times 4'. Other symbols are - for subtraction and / for division:

12 - 14
[1] -2
6/17
[1] 0.3529

R also has functions like square root, sine, cosine and so on. You can calculate these like this:

sqrt(25)
[1] 5
sqrt(2)
[1] 1.414

the examples above use the square root (sqrt) function. You can also combine these to make more complicated expressions:

sqrt(3 * 3 + 4 * 4)
[1] 5

You can also assign the answers of the calculations to variables, and use them in calculations. You do this as below

price <- 300

Here, the value 300 is stored in the variable price. The <- symbol means put the value on the right into the variable on the left - it is typed with a < followed by a -. This can be used in subsequent calculations. For example, to apply a 20% discount to this price, you could enter the following:

price - price * 0.2
[1] 240

or use intermediate variables

discount <- price * 0.2

price - discount
[1] 240

R can also work with lists of numbers, as well as individual ones. Lists are specified using the c function. Suppose you have a list of house prices listed in an estate agents, specified in thousands of pounds. You could store them in a variable called house.prices like this:

house.prices <- c(120, 150, 212, 99, 199, 299, 159)
house.prices
[1] 120 150 212  99 199 299 159

Note that there is no problem with full stops in the middle of variable names.

You can then apply functions to the lists. For example to take the average of a list, enter:

mean(house.prices)
[1] 176.9

If the house prices are in thousands of pounds, then this tells us that the mean house price is 176.9 thousand pounds. Note here that on your display, the answer may be displayed to more significant digits, so you may have something like 176.8571 as the mean value.

For the next exercise, try entering a larger data set (this time these are household burglaries per 10,000 households over a one month period) for 118 neighbourhoods in St. Helens:

burg.rates <- c(0, 7, 0, 0, 6, 19, 32, 0, 0, 0, 15, 6, 12, 8, 7, 6, 0, 0, 6, 
    0, 7, 0, 0, 0, 0, 0, 0, 0, 17, 0, 0, 21, 7, 12, 7, 36, 18, 0, 0, 7, 6, 0, 
    0, 0, 0, 0, 13, 22, 0, 0, 0, 7, 12, 7, 5, 11, 0, 0, 13, 13, 0, 6, 15, 6, 
    17, 37, 0, 6, 6, 5, 24, 0, 0, 0, 0, 0, 0, 0, 5, 15, 0, 5, 6, 0, 0, 0, 13, 
    0, 6, 0, 0, 0, 23, 6, 13, 15, 6, 0, 0, 7, 7, 0, 0, 0, 0, 19, 13, 0, 0, 0, 
    6, 9, 0, 0, 0, 0, 0, 5)

Note that if an R expression has obviously not been completed and you hit return, then you can continue to type it on the following line. This carries on until R has worked out that the command is finished. In this case, that happens when the close brackets ) has been typed, and return has been hit afterwards. If you now enter the variable name (burg.rates) you can see all of the values listed:

burg.rates
  [1]  0  7  0  0  6 19 32  0  0  0 15  6 12  8  7  6  0  0  6  0  7  0  0
 [24]  0  0  0  0  0 17  0  0 21  7 12  7 36 18  0  0  7  6  0  0  0  0  0
 [47] 13 22  0  0  0  7 12  7  5 11  0  0 13 13  0  6 15  6 17 37  0  6  6
 [70]  5 24  0  0  0  0  0  0  0  5 15  0  5  6  0  0  0 13  0  6  0  0  0
 [93] 23  6 13 15  6  0  0  7  7  0  0  0  0 19 13  0  0  0  6  9  0  0  0
[116]  0  0  5

In this example, there are 100 different rates and the last command lists them - they are more neatly written out than when you entered them, and perhaps now you can see what the numbers in square brackets are used for. Essentially they show you the position in the list of the first number on each row - thus if a row begins with [24] this implies that the 24nd number in the list is at the start of this printed row. The main idea is to allow you to find positions in the list of higher numbers more easily. Note that if you want to find out more about the basic use of R, a helpful guide can be found here: http://cran.r-project.org/doc/contrib/Owen-TheRGuide.pdf - in particular, pages up to and including 28 are very useful.

Simple Graphics

It is also possible to draw graphics using the data you have put in to the variables. This draws a histogram of the burglary data:

hist(burg.rates)

plot of chunk unnamed-chunk-13

By selecting the window it is possible to copy and paste the images into other documents, for example into the data encryption packages MS Word or Powerpoint or their open source alternatives.

Generally speaking in R, commands tend to give very basic plots, unless further details are provided. Thus to get a histogram with red bars, enter:

hist(burg.rates, col = "red")

plot of chunk unnamed-chunk-14

and to change the main title, xlab (ie x-label) and ylab (y-label) use:

hist(burg.rates, col = "red", main = "Burglaries per 1000 households", xlab = "Rate", 
    ylab = "Frequency")

plot of chunk unnamed-chunk-15

Now enter another related variable, the median house price (in 1000's of pounds) for a three-bedroom semi-detached house for each of the neighbourhoods.

house.prices <- c(200, 130, 200, 200, 180, 140, 65, 220, 180, 200, 210, 170, 
    180, 160, 180, 130, 240, 180, 170, 230, 150, 200, 200, 210, 220, 180, 200, 
    210, 150, 200, 230, 120, 180, 180, 190, 72, 80, 190, 220, 150, 200, 170, 
    170, 230, 200, 160, 140, 100, 140, 170, 180, 260, 170, 230, 190, 220, 140, 
    220, 120, 96, 210, 170, 180, 140, 150, 67, 200, 230, 140, 230, 83, 170, 
    200, 210, 240, 180, 200, 210, 250, 140, 130, 190, 110, 160, 150, 230, 160, 
    210, 200, 230, 210, 190, 120, 180, 87, 160, 190, 190, 230, 180, 110, 200, 
    250, 180, 200, 130, 180, 190, 190, 230, 210, 210, 150, 190, 210, 200, 210, 
    170)

As before, it is possible to draw a histogram of this variable:

hist(house.prices, col = "lightblue", main = "House Price", xlab = "1000s Pounds", 
    ylab = "Frequency")

plot of chunk unnamed-chunk-17

and also to create a scatter plot of the two variables, to see how median house price relates to burglary rate:

plot(burg.rates, house.prices, main = "Burglary vs. House Price", xlab = "Burglaries (per 1000 households)", 
    ylab = "Median House Price (1000s Pounds)")

plot of chunk unnamed-chunk-18

This shows that there is a relationship between the two quantities, although there is still a fair amount of randomness as well. The points show there is a general tendency for house prices to fall as burglary rate increases, but that there are other factors affecting house price as well.

The Data Frame

In the last section, you looked at two variables called house.prices and burg.rates and graphed their relationship. It is possible to combine the individual variables in a data frame - rather like an internal spreadsheet where all of the relavent data items are stored together as a set of columns. This is similar to the data set storage in SPSS (for those of you who have used that package) where each variable correspoinds to a column and each case (or observation) corresponds to a row. However, while SPSS can only have one data set active at a time, in R you can have several of them.

To create a data frame containing the last two variables enter

hp.data <- data.frame(Burglary = burg.rates, Price = house.prices)

Then type in its name to list it:

hp.data

Just to explain what has happened here: the function data.frame takes all of the variables that you wish to have as columns. The Burglary=burg.rates creates a column in the data frame called Burglary containing the values in the variable burg.rates in the last section. Similarly, it has a column called Price containing the values from house.prices. This new data frame is stored as a new object called hp.data (an object in R is similar to a variable, although it can be more complex - so it can contain more sophisticated things like data frames, not just a list of values). Typing in the name of the object (once it has been created) lists the values in the columns.

You can also enter

fix(hp.data)

to view this data frame in a window, and also edit values in the 'cells'. However, although it might be interesting to try using fix now, try it, but don't actually edit anything. To return to the R command line click on Quit in the data frame window. NOTE: it is important to do this, ortherwise you won't be able to type anything in to R.

You can also describe each column in the data set using the summary function:

summary(hp.data)
    Burglary         Price    
 Min.   : 0.00   Min.   : 65  
 1st Qu.: 0.00   1st Qu.:152  
 Median : 0.00   Median :185  
 Mean   : 5.64   Mean   :179  
 3rd Qu.: 7.00   3rd Qu.:210  
 Max.   :37.00   Max.   :260  

For each column, a number of values are listed:

Item Description
Min. The smallest value in the column
1st. Qu. The first quartile (the value ¼ of the way along a sorted list of values)
Median The median (the value ½ of the way along a sorted list of values)
Mean The average of the column
3rd. Qu. The third quartile (the value ¾ of the way along a sorted list of values)
Max. The largest value in the column

Between these numbers, an impression of the spread of values of each variable can be obtained. In particular it is possible to see that the median house price in St. Helens by neighbourhood ranges from 65,000 pounds to 260,000 pounds and that half of the prices lie between 152,500 pounds and 210,000 pounds. Also it can be seen that since the median measured burglary rate is zero, then at least half of areas had no burglaries in the month when counts were compiled.

Getting data in to R

There are lots of methods for getting data from some external file into R. One way is to enter it in directly, as in the last section, but this isn't practical for large datasets. It is also possible to read data in from text files (this will be done later) and to read it in from a source on the internet. Here, the internet method is demonstrated. In this case the data set is one made publicly available on my public dropbox folder. If you go follow the web link “http://dl.dropbox.com/u/7013997/ENVS257/hpdata.csv” you can see it in its 'raw' form. If you click on the link and then on the open option, the data will open in a read-only spreadsheet. Having seen the data set, close the spreadsheet and return to the R window.

The file is called a csv file - short for Comma Separated Variable - as in its 'raw' form it justs consists of several lines of variables separated by commas. Essentially this file contains the same data as the house price and burglary data you used earlier- although an extra column, with an ID number of each neighbourhood, is also included. You can read a csv file on the internet directly into a variable using the read.csv function in R. Here the contents of the file are read into a data frame object called hp.data2.

hp.data2 <- read.csv("http://dl.dropbox.com/u/7013997/ENVS257/hpdata.csv")

The function takes the URL of the file as an argument, and reads it in to the object on the left of the assignment (<-) symbol. As before, it is possible to list the data frame

hp.data2
     ID Burglary Price
1    21        0   200
2    24        7   130
3    31        0   200
4    32        0   200
5    78        6   180
6    80       19   140
7    81       32    65
8    98        0   220
9   100        0   180
10  101        0   200
11  102       15   210
12  111        6   170
13  112       12   180
14  113        8   160
15  114        7   180
16  115        6   130
17  116        0   240
18  117        0   180
19  118        6   170
20  119        0   230
21  120        7   150
22  121        0   200
23  122        0   200
24  123        0   210
25  124        0   220
26  125        0   180
27  126        0   200
28  127        0   210
29  128       17   150
30  614        0   200
31  620        0   230
32  621       21   120
33  622        7   180
34  624       12   180
35  625        7   190
36  626       36    72
37  627       18    80
38  628        0   190
39  629        0   220
40  764        7   150
41  765        6   200
42  766        0   170
43  767        0   170
44  768        0   230
45  769        0   200
46  833        0   160
47  834       13   140
48  835       22   100
49  836        0   140
50  837        0   170
51  838        0   180
52  839        7   260
53  840       12   170
54  841        7   230
55  842        5   190
56  843       11   220
57  844        0   140
58  845        0   220
59  846       13   120
60  847       13    96
61  848        0   210
62  849        6   170
63  850       15   180
64  851        6   140
65  852       17   150
66  853       37    67
67  854        0   200
68  855        6   230
69  856        6   140
70  857        5   230
71  858       24    83
72  859        0   170
73  860        0   200
74  861        0   210
75  862        0   240
76  863        0   180
77  864        0   200
78  865        0   210
79  866        5   250
80  867       15   140
81  868        0   130
82  869        5   190
83  870        6   110
84  871        0   160
85  872        0   150
86  873        0   230
87  874       13   160
88  875        0   210
89  876        6   200
90  877        0   230
91  878        0   210
92  879        0   190
93  880       23   120
94  881        6   180
95  882       13    87
96  883       15   160
97  884        6   190
98  885        0   190
99  886        0   230
100 887        7   180
101 888        7   110
102 889        0   200
103 890        0   250
104 891        0   180
105 892        0   200
106 893       19   130
107 894       13   180
108 895        0   190
109 896        0   190
110 897        0   230
111 898        6   210
112 899        9   210
113 900        0   150
114 901        0   190
115 902        0   210
116 903        0   200
117 904        0   210
118 905        5   170

and also to summarise it:

summary(hp.data2)
       ID         Burglary         Price    
 Min.   : 21   Min.   : 0.00   Min.   : 65  
 1st Qu.:616   1st Qu.: 0.00   1st Qu.:152  
 Median :846   Median : 0.00   Median :185  
 Mean   :654   Mean   : 5.64   Mean   :179  
 3rd Qu.:876   3rd Qu.: 7.00   3rd Qu.:210  
 Max.   :905   Max.   :37.00   Max.   :260  

Note that the second and third columns of this are the same as before – and also that the first column does not really make sense. Since it is an ID number, it is really just a identifying code, and it doesn't really matter what its minum value, or mean value might be - you might as well compute an average student ID number. For this reason, it is useful to ask R to drop this column from hp.data2 when computing the summary. This is fairly simple. Firstly note that you can use square brackets to pick out individual values in the data frame. For example to select the value in the 15th row of column 2, enter

hp.data2[15, 2]
[1] 7

and you can look at the full data frame to check that this is indeed 7. Also, for the columns, you can replace the column number for its name, provided it is in quotes:

hp.data2[15, "Burglary"]
[1] 7

Also, instead of just specifying a single row, it is possible to specify a range of rows:

hp.data2[10:15, "Burglary"]
[1]  0 15  6 12  8  7

This lists the burglary rates for neighbourhoods 10-15 in the data set.

If you want to specify a full column (ie all of the burglary rates), just leave the part where you would write the column range empty:

hp.data2[ ,"Burglary"]
  [1]  0  7  0  0  6 19 32  0  0  0 15  6 12  8  7  6  0  0  6  0  7  0  0
 [24]  0  0  0  0  0 17  0  0 21  7 12  7 36 18  0  0  7  6  0  0  0  0  0
 [47] 13 22  0  0  0  7 12  7  5 11  0  0 13 13  0  6 15  6 17 37  0  6  6
 [70]  5 24  0  0  0  0  0  0  0  5 15  0  5  6  0  0  0 13  0  6  0  0  0
 [93] 23  6 13 15  6  0  0  7  7  0  0  0  0 19 13  0  0  0  6  9  0  0  0
[116]  0  0  5

You can use a similar approach to select out a row of the data -

hp.data2[12, ]
    ID Burglary Price
12 111        6   170

This gives the ID, Burglary and Price (ie house price) values for the 12th neighbourhood in the list.

Another way of selecting out columns is to use the $ (dollar) approach. Entering

hp.data2$Price
  [1] 200 130 200 200 180 140  65 220 180 200 210 170 180 160 180 130 240
 [18] 180 170 230 150 200 200 210 220 180 200 210 150 200 230 120 180 180
 [35] 190  72  80 190 220 150 200 170 170 230 200 160 140 100 140 170 180
 [52] 260 170 230 190 220 140 220 120  96 210 170 180 140 150  67 200 230
 [69] 140 230  83 170 200 210 240 180 200 210 250 140 130 190 110 160 150
 [86] 230 160 210 200 230 210 190 120 180  87 160 190 190 230 180 110 200
[103] 250 180 200 130 180 190 190 230 210 210 150 190 210 200 210 170

extracts the column called Price and so on. This (and indeed the earlier ways of extracting columns) can be used in graphics commands, and so on. For example, a box-plot of the prices can be obtained by

boxplot(hp.data2$Price, col = "lightgreen")

plot of chunk unnamed-chunk-29

The box-plot is a graphical equivalent of the summaries seen earlier - the end-points of the longer lines show the maximum and minimum values, and the end points of the central box are the first and third quartiles. The line on the box marks the median.

Geographical Information

Until now, although this is geographical data, no maps have been drawn. In this section you will do this. Firstly, you need to load a new package into R - a package is an extra set of functionality that extends what R is capable of doing. Here, the new package is called maptools and it extends R by allowing it to draw maps, and handle geographical information. Firstly, you have to let R know you want to use the package - and install it:

install.packages("maptools", depend = TRUE, lib = getwd())

Packages are loaded via the library function:

library(maptools)
library(maptools, lib.loc = getwd())

However, this just makes R able to handle geographical data, it doesn't actually load any specific data sets. To do this, you'll need to obtain them from somewhere. Data for maps in R can be read in from shapefiles - these are a well known geographical information exchange format. Although the term 'shapefile' is used, in actuality a given data set consists of several files. For this reason, I will use the term 'shapefile set' to mean all of the files needed for a particular geographical data set. The first task is to obtain a shapefle set of the St. Helens neighbourhoods (or Lower Super Output Areas - LSOAs, as they are more formally called). To do this, you will need to put some information into the working directory you created earlier. The exact function you need to type in will depend on the name you have called your working directory. For example, my directory is called 'R Work' and is on the M: drive, so I type

setwd("M:/R work")

your version might well have a longer title. Also note that slashes are indicated with a '/' not '\'.

Later on in the practical, you will need to use some other packages - the following line of R will install them all now:

install.packages(c("rgeos", "RColorBrewer", "maptools", "classInt"), depend = TRUE, 
    lib = getwd())

There is a set of shapefiles for the St. Helens neighbourhoods at the same location as the data set you read in earlier. Since several files are needed, I have bundled these together in a single zipfile. You will now download this to your local folder and subsequrently unzip it. This can all be done via R functions:

download.file("http://dl.dropbox.com/u/7013997/ENVS257/sthel.zip", "sthel.zip")
unzip("sthel.zip")

The first function actually downloads the zip file into your working directory, the second one unzips it, creating the shapefile set. All of the shapefile set files begin with sthel but then have different endings, eg sthel.shp, sthel.dbx and sthel.shx. Now, these can be read in to a SpatialPolygons object.

sthel <- readShapeSpatial("sthel")

the readShapeSpatial function does this, and stores them into another type of object called a SpatialPolygons object. Recall that geographical data can be of type Point, Line or Polygon and that polygons are areas or regions, such as the neighbourhoods (LSOAs) in St. Helens. You can use the plot function to draw the polygons (ie the map of the LSOAs).

plot(sthel)

plot of chunk unnamed-chunk-36

Finally, a SpatialPolygonsDataFrame is the result of joining a SpatialPolygons object up with a data frame. The idea is that each polygon corresponds to one row of the data frame - i.e, the geographical description of the boundaries of each neighbourhood with the data about that neighbourhood. Here a SpatialPolygonsDataFrame is created by joining the sthel SpatialPolygons object to the hp.data2 data frame object. The result is assigned to a SpatialPolygonsDataFrame object called sthel.prices.

sthel.prices <- SpatialPolygonsDataFrame(sthel, hp.data2, match.ID = FALSE)

Finally you will draw a choropleth map of these house prices. In later exercises more will be explained as to how this works, but for now the method will simply be demonstrated. One useful feature of R is that it is possible to write new functions, and therefore extend its capabilities. The following function (called choro) is a simple choropleth map drawing tool:

choro <- function(spdframe, variable) {
    var <- spdframe@data[, variable]
    breaks <- classIntervals(var, n = 6, style = "fisher")$brk
    my_colours <- brewer.pal(6, "Greens")
    plot(spdframe, col = my_colours[findInterval(var, breaks, all.inside = TRUE)], 
        axes = FALSE, border = rgb(0.8, 0.8, 0.8))
    invisible(list(b = breaks, c = my_colours))
}

For now, just copy and paste this into R and hit return. Don't worry about trying to interpret it!

Note that it is important to make sure the upper and lower case letters you type in are exactly the same as the ones above. This is generally the case with R. If a window comes up asking where to install the library from, select one of the UK locations. When it has installed, just enter

library(classInt, lib.loc = getwd())

to load it. Also, load a further package for providing map shading palettes called RColorBrewer.

library(RColorBrewer)

Now it is possible to draw the choropleth map. Here we draw one of the burglary rates.

breaks <- choro(sthel.prices, "Burglary")

plot of chunk unnamed-chunk-42

The map shows the different burglary rates, with dark shading indicating a higher rate. However, this is much easier to interpret if you add a legend to the map. For now, just enter the following. As before, more about how all of the commands work will be disclosed later in the module.

legend(x = 357000, y = 392000, legend = leglabs(breaks$b), fill = breaks$c, 
    bty = "n")

plot of chunk unnamed-chunk-43

You can also add a title to the map:

title("Burglary Rates per 10,000 Homes in St. Helens")

plot of chunk unnamed-chunk-44

As a final useful technique you can copy and paste maps like this into word documents. Select the window with the map on it, and then right click on it. When the menu appears, select 'copy as bitmap'. If you also have MS Word up and running, you can then paste the map into your document.

Working with Open Government Data

The police.uk web site

In December of 2010 the police.uk web site was launched, providing details of geographical locations of crime in England and Wales. As well as providing an interface to allow people to view crimes in their (or other peoples') neighbourhoods, this also provided a means for downloading the crime data. Using this, combined with other data it is possible to obtain crime rate maps for anywhere in England and Wales, for a number of different classes of crime.

Mapping Crime Rates

Compare and Contrast Crime Rates

This section of the practical will examine the use of contemporary data about crime rates and population levels that can be merged to draw maps so that you can investigate hypotheses about geographical patterns of crime, for a number of different crime types, in the Merseyside region.

Looking at Crime Data

Firstly, you can use the police.uk web site to investigate local patterns in crime without making use of any other software. To do this, firstly go to the web site http://www.police.uk – this will show you the start-up page for this website – which, as stated earlier, provides a portal to maps of crime rates in localities in England and Wales – and looks something like this:

To try this out, enter the postcode L69 7ZT into the 'address, postcode or place name' box, and click on the 'search' button. This postcode corresponds to the Roxby Building, near Myrtle Street, on the edge of the campus. When you click on the button it will give you a new page with a number of smaller sections, including one titled 'Crime and ASB in this area'. Note: ASB = Anti-Social Behaviour, which does not always imply a literal criminal offence. This shows a map of the local area with a number of symbols, and a red marker showing the point corresponding approximately to the postcode centroid. If you click on this red marker, a new page with a bigger map appears, looking something like this:

Using the scale bar on the left of the map, you can 'zoom in' on the blue circle around the red marker. You can also ‘drag’ the map in the window (press the left mouse button while moving the mouse when the pointer is inside the map window) to pan around the area. The table on the left of the map gives you counts of the different kinds of crimes that have occurred in the most recent month. Clicking on any of the black circles gives a breakdown of different crime counts in an area. Clicking on the counts in the table makes the map only show points corresponding to the type of crime in the table.

Is the Web Site Enough?

Exploring the above maps gives some interesting insights into local crime. However, it also raises some questions. One issue is that although the circles with counts of crimes are centred on certain points on the map, these do not correspond exactly to where the crimes occurred. This is intentional on the part of the designers of this site, and preserves privacy (for example, the web site does not intend to publicise the exact addresses of houses that were burgled). The points where crimes are mapped either correspond to public places (eg train stations) or to the centre of a group of at least 12 private addresses. However, the choice of mapping technique has faced some criticism because it still uses a point-based approach – giving the impression that a crime (or several crimes) did occur at some very specific location, when in fact it did not – see for example figure below – there is no direct evidence that 12 incidents occured in exactly the location in Myrtle place as implied on the map – this is just the centre of the group of houses and other buildings. See http://bcsmaps.blogspot.com/2011/02/uk-police-maps-x-does-not-mark-spot.html for more discussion.

Using the Data With R

Since there are certain limitations to viewing the data directly on the web site, it is fortunate that another option is to download the raw data and display it in other ways, using R. To do this, note the ‘data’ menu on the map you have just created (on the blue bar above the map) and click on this. This takes you to a new page where data may be downloaded. Next, scroll down until you find the link for ‘Merseyside Police’, then click on the first of the three 'download' links. This gives a csv (comma separated variable) file for the street-level Merseyside police data – each line has details of each crime, including crime type, location and some other features, separated by commas. The download will probably put the file in your Downloads folder. Using folder one you created earlier - move the file named something like 2012-11-merseyside-street.zip to this folder (note: 2012-11 refers to the actual year and month that you download – this can change depending on the actual date when you do the practical – for the rest of these notes be aware that the 2012-11 part of the name may be replaced with another year and month) - and also that some of the maps or graphs you obtain will be slightly different to those on this web site.

Since the file is compressed to speed up the download (that is, zipped) you need to extract it first. Double click on the file, and you will see a new folder window appear containing file 2012-11-merseyside-street.csv. It is a good idea to extract it before using it in R. To do this, click on ‘extract’ in the top of the folder window. This gives a window to choose the folder into which the file should be extracted. Choose the working directory that you were using in the last exercise. Now, the data is in place and you need to start up R. Do this in the same way you did in the last exercise. When R is running enter the commands to set it to run in the directory you keep the data. Remember that to do this, the exact function you need to type in will depend on the name you have called your working directory. Again, from last time, my directory is called 'R Work' and is on the M: drive, so I type

setwd("M:/R work")

your version might well have a longer title. Also note that slashes are indicated with a '/' not '\'.

Now load the GISTools library (as you did before) and then read in the CSV file you downloaded with the crime data. Store it into a data frame called crimes.

require(maptools)
crimes <- read.csv("2012-11-merseyside-street.csv")

You can inspect the data frame in a window using fix(crimes) - but don't edit the values and remember that you have to close the window before typing anything else into R.

You will see that the data consists of a number of columns, each with a heading. Two of these are called Easting and Northing – these are the column headers that give the UK national grid coordinates of each incident (either a crime or an ASB event) in the data you have just downloaded. Another is headed Crime.Type and tells you which type of crime occurred. Note that the national grid references refer to the centres of small groups of households (or possibly public places) mentioned earlier.

At the moment, the data is just in a data.frame object - not any kind of spatial object. To create the spatial object, enter the following:

crime.pts <- SpatialPointsDataFrame(crimes[, 4:5], crimes[, -(4:5)])

This creates a SpatialPointsDataFrame object. The function takes two arguments - the first is columns 4 and 5 from the crimes data frame - these are the easting and northing of the crime location. The second argument is the data frame minus columns 4 and 5 - this is what -(4;5) indicates. These columns provide all the non-geographical data from the data frame. The resulting object crime.pts is a spatial points geographical shape object, whose points are each recorded crime in the data set you download. To see the geographical pattern of these crimes, enter:

plot(crime.pts, pch = ".", col = "darkred")

plot of chunk soddo

as you can see, you have now successfully obtained some crime data from police.uk and read it into R. The option pch='.' tells R to map each point with a full-stop character - and to make the colour of these dark red.

You can also examine the kinds of different crimes recorded.

table(crime.pts$Crime.type)

      Anti-social behaviour                    Burglary 
                       4541                        1190 
  Criminal damage and arson                       Drugs 
                       1657                         728 
                Other crime                 Other theft 
                        293                        1324 
Public disorder and weapons                     Robbery 
                        296                         128 
                Shoplifting               Vehicle crime 
                        746                         913 
              Violent crime 
                        998 

For a categorical column in a data frame (that is, one that attributes each location to a specific category, in this case 'crime type') the table function counts the number of times each category appears in the column. From this, it is possible to identify the relative frequencies with which each kind of crime occurs.

Obtaining a Layer of Lower Super Output areas

The next stage is to obtain some further map data. In particular, the point data you have just drawn still suffers from the 'X does not mark the spot' issue discussed earlier on. The next stage of this practical is to obtain some neighbourhood zone data and to aggregrate the crimes to this. These data are useful for a number of reasons:

  1. Go to the UKBorders web site http://edina.ac.uk/ukborders/ and try to log in with the login via UK federation option.
  2. If it asks you to register, see the following section “If you need to register for UKBorders”.
  3. At the opening window select Boundary Data Selector
  4. At the next window, there will be three choices to make. Set
  5. Then click on Find
  6. In the window below, a number of options will appear. Select 'English Super Output Areas (Lower Layer) 2001 [Within Counties]'.
  7. Click the 'List areas' button
  8. Then, in the window below this, select 'Merseyside'.
  9. Then scroll to the bottom (if needed) and click on 'Extract boundary Data'.
  10. This gives a new screen stating that the boundary data is being created. Eventually a window appears showing how to download it. Click on BoundaryData.zip to download it.
  11. When the next window pops up, click on Save.
  12. Select the folder you are currently working in to save it – name the file MerseyLSOA and save.
  13. At this point all of the GIS files (shapefiles) to describe the boundaries of Merseyside Lower Super Output Areas (LSOAs) are in this folder, but zipped. They can unzipped and then be loaded in to R.

If you need to register for UKBorders

Follow the registration procedure on the web site, and in a short while you should be able to access this site. For the mean time download the file MerseyLSOA.zip from http://dl.dropbox.com/u/7013997/ENVS257/MerseyLSOA.zip and put it in your working folder, and continue from here - or alternatively enter

download.file("http://dl.dropbox.com/u/7013997/ENVS257/MerseyLSOA.zip", "MerseyLSOA.zip")

which will do this via an R function as in the last practical. However, I recommend working through the other steps when you are registered, so that you are familiar with the process. You will probably need it in the future!

Use the map Data

Firstly, unzip the file you have just downloaded.

unzip("MerseyLSOA.zip")

The name of the downloaded shapefile sets actually aren't based on the term MerseyLSOA, but actually on the term england_low_soa_2001. Although this a little confusing (and not really avoidable, because the name comes from the institution supplying the data) the R object that the shape is read into can still have a more helpful name. To read it in, firstly load the maptools library, as in the last exercise:

library(maptools)
mersey.lsoa <- readShapeSpatial("england_low_soa_2001")

It is now possible to draw the crime data on a Merseyside backdrop:

plot(mersey.lsoa, border = "grey")
plot(crime.pts, pch = ".", col = "darkblue", add = TRUE)

plot of chunk unnamed-chunk-51

Note that including add=TRUE tells R to draw the crime points by adding them to an existing map, rather than starting a new map.

Focussing on Antisocial Behaviour

The data you have download includes all crimes. However, now the analysis will focus on reported incidents of antisocial behaviour. To do this, select out the antisocial behaviour points from the full crime shape object.

asb.pts <- crime.pts[crime.pts$Crime.type == "Anti-social behaviour", ]

The above method works similarly to the 'select rows by their number' seen in the first practical. The first thing in the square brackets is a conditional statement - basically it selects out cases for which a condition is true. In this cases, the Crime.type column should contain the attribute “Anti-social behaviour”. The expression after the comma is empty so that all of the rows are included. The result (a point object containing all of the anti-social behaviour incidents) is stored in asb.pts.

Next, find out how many ASBs have occurred in each LSOA. To do this, you need to load the package rgeos - basically a set of tools that help R handle overlay operations, such as finding out where geographical shapes intersect one another, and so on. Next you need to define a function called poly.counts - this counts how many points in a SpatialPointsDataFrame fall into each of a set of polygons in a SpatialPolygonsDataFrame, and creates a new variable. Here, the new variable is called asb.count. The code below carries out all of these operations. Note also that the # comment is placed at the beginning of some lines. This just tells R that the following text is just to explain what the R functions are doing - it doesn't have to be executed.

# This is another R package, allowing GIS overlay operations
library(rgeos)
# This defines a new R function - it counts how many points fall into each
# polygon
poly.counts <- function(pts, polys) colSums(gContains(polys, pts, byid = TRUE))
# The line below actually counts the number of crimes in each LSOA
asb.count <- poly.counts(asb.pts, mersey.lsoa)

Now that you have this information, you can draw a map of this. Before doing so, you need to add the choropleth map drawing code that you used earlier.

library(classInt, lib.loc = getwd())
library(RColorBrewer)
choro <- function(spdframe, variable) {
    var <- spdframe@data[, variable]
    breaks <- classIntervals(var, n = 6, style = "fisher")$brk
    my_colours <- brewer.pal(6, "Greens")
    plot(spdframe, col = my_colours[findInterval(var, breaks, all.inside = TRUE)], 
        axes = FALSE, border = rgb(0.8, 0.8, 0.8))
    invisible(list(b = breaks, c = my_colours))
}

Just cut and paste the above code into R (as with last time). Now, you can draw a choropleth map of asb counts for each LSOA. Firstly, add the asb counts as a column to the mersey.lsoa SpatialPolygonsDataFrame.

# First, add an ASB event count column to the 'mersey.lsoa'
# SpatialPolygonsDataFrame
mersey.lsoa@data$asb.count <- asb.count
# Now draw a choropleth map - use 'choro' the same way as the last
# practical
choro(mersey.lsoa, "asb.count")

plot of chunk unnamed-chunk-55

Computing Rates of ASB Occurrence

The map just drawn shows areas with a high absolute anti-social behaviour incident counts. However, it is more meaningful to consider these quantities per head of population - areas with larger populations would have higher absolute counts even if the chances of someone committing an offence was the same everywhere. Also, in some small communities, it might be that crime is relatively high. Working with crimes per person (ie per head of population) overcomes this issue. To deal with this, you need to download some population data. This comes from the web site: http://www.ons.gov.uk/ons/publications/re-reference-tables.html?edition=tcm%3A77-230902

At the time of writing, these are the most recent population estimates available from the government web site and relate to mid 2010. Although the crimes are from a slightly later date, we will assume that these population estimates are still reasonably accurate. Essentially, we have no choice – they are the most recent that can be obtained. Go to the web site above, and then download the file for the Lower Super Output Areas. This is the first of the two files that you will see. Download it into your working directory.

As before this is a zipped file. Follow the same procedure as before to extract it. Unfortunately, this insn't a csv file, it is stored in Microsoft Excel. You will need to use Excel to extract the data in CSV format before you read it into R. So, firstly open it in Excel and follow this procedure

  1. Go to the tab marked Mid-2010 Persons
  2. Delete rows 1-3
  3. Delete column B
  4. Delete columns D-H
  5. Change the column header for A to be 'LSOA' (no spaces)
  6. Change the column header for B to be 'pop' (no spaces again)
  7. Select coluimn B and select 'General' as the format - the commas between the thousands on the 'pop' column should go away. Save this file under the new name pop.csv - select csv as the format to save the file…

Once it has been saved, read it in to R.

pop <- read.csv("pop.csv")

This is actually the population estimates for the whole of the LSOAs in Great Britain, not just those in Merseyside. The next stage is to extract the estimates just for Merseyside.
Firstly, use the R function match to find the location in pop that each LSOA code in mersey.lsoa appears - these are stored in lsoa.lookup.

lsoa.lookup <- match(mersey.lsoa$zonecode, pop$LSOA)
head(lsoa.lookup)
[1] 6739 7284 6950 7191 7190 7187

Entering head(lsoa.lookup) shows the first few locations in the look up table - it tells you that the first of the LSOAs in mersey.lsoa corresponds to location 6739 in pop. In genereal, head is a useful function for lookup up the first few items in a variable, or part of an object. For example, try

head(mersey.lsoa@data)
   gid geonorth popnorth geoeast           name         label popeast
0  458   381722   383625  342348 Liverpool 058B 04BYE01006739  342928
1  459   383167   384902  323263    Wirral 033B 04CBE01007284  325257
2  461   420336   418219  332592    Sefton 004B 04CAE01006950  334110
3 2023   379256   380601  326022    Wirral 040F 04CBE01007191  327837
4 2024   380279   380955  325469    Wirral 040E 04CBE01007190  326631
5 2025   381092   381843  324075    Wirral 040B 04CBE01007187  325788
   zonecode asb.count
0 E01006739        14
1 E01007284         0
2 E01006950         9
3 E01007191         4
4 E01007190         3
5 E01007187         2

Now, add a column with the appropriate population estimates to the merseyside SpatialPolygonsDataFrame.

mersey.lsoa@data$pop <- pop[lsoa.lookup, "pop"]

You can check this has worked, by using head again:

head(mersey.lsoa@data)
   gid geonorth popnorth geoeast           name         label popeast
0  458   381722   383625  342348 Liverpool 058B 04BYE01006739  342928
1  459   383167   384902  323263    Wirral 033B 04CBE01007284  325257
2  461   420336   418219  332592    Sefton 004B 04CAE01006950  334110
3 2023   379256   380601  326022    Wirral 040F 04CBE01007191  327837
4 2024   380279   380955  325469    Wirral 040E 04CBE01007190  326631
5 2025   381092   381843  324075    Wirral 040B 04CBE01007187  325788
   zonecode asb.count  pop
0 E01006739        14 1684
1 E01007284         0 1588
2 E01006950         9 1591
3 E01007191         4 1323
4 E01007190         3 1402
5 E01007187         2 1392

You should see that a new column, pop, has been added. Now, a further column, that actually contains the rate of ASBs per head of population, can be added.

mersey.lsoa@data$asb.rate <- 10000 * mersey.lsoa@data$asb.count/mersey.lsoa@data$pop

Again, use head to check this:

head(mersey.lsoa@data)
   gid geonorth popnorth geoeast           name         label popeast
0  458   381722   383625  342348 Liverpool 058B 04BYE01006739  342928
1  459   383167   384902  323263    Wirral 033B 04CBE01007284  325257
2  461   420336   418219  332592    Sefton 004B 04CAE01006950  334110
3 2023   379256   380601  326022    Wirral 040F 04CBE01007191  327837
4 2024   380279   380955  325469    Wirral 040E 04CBE01007190  326631
5 2025   381092   381843  324075    Wirral 040B 04CBE01007187  325788
   zonecode asb.count  pop asb.rate
0 E01006739        14 1684    83.14
1 E01007284         0 1588     0.00
2 E01006950         9 1591    56.57
3 E01007191         4 1323    30.23
4 E01007190         3 1402    21.40
5 E01007187         2 1392    14.37

Note the multiplication by 10000 - this means rates are per 10000 heads of population. Now it is possible to produce a choropleth map of the ASB rates in Merseyside.

breaks <- choro(mersey.lsoa, "asb.rate")
legend(x = 347000, y = 420000, legend = leglabs(round(breaks$b, 1)), fill = breaks$c)

plot of chunk unnamed-chunk-63

In the leglabs part of the legend function, round is used to round the rate estimates to 1 place of decimal. Without this, the display becomes cluttered. You may need this map for the assignment (as well as some others). For this reason, it is a good idea to create a blank MS Word document, copy this map into it (see the last practical) and save it. There will be a few more maps and graphs that you will use - it is a good idea to use this Word document as a 'scrap book' and cut and paste the maps into your assignments at the appropriate places.

Comparison to Indicators of Deprivation

The Index of Multiple Deprivation is a score used to measure the degree of deprivation in geographical areas, currently produced by the Department for Communities and Local Government (DCLG) (http://www.communities.gov.uk/corporate/ ). Since 2004 the smallest region for which the IMD is produced has been the LSOA – the spatial unit you were working with in the practical. The index of multiple deprivation is a score based on combining scores for a number of ‘Domains’ of deprivation:

The bracketed figures on the right show the relative weighting that each domain is given to derive the overall score – thus, most emphasis is placed on Income and Employment. Each domain score is itself a combination of a number of variables relating to the individual domain. Note that, for this reason, each domain score is not a simple quantity (such as the rates of a specific illness in the health domain) but is itself a combined score based on several underlying variables. Not all of these underlying variables are made publicly available, but the scores for each domain and the overall IMD have been made publicly available.

Download the IMD Spreadsheet

To obtain this data you will need to visit the web site: http://www.communities.gov.uk/publications/corporate/statistics/indices2010 . If you scroll down this web site, you will see a selection of files to download. Scroll down to the link called 'Indices of deprivation 2010: All domains, all sub domains and supplementary indices' - make sure it is exactly this link that you download - there are quite a few links, but this is the only one that will work for the remainder of this practical. Save this file into the folder you are working in. This is a CSV file, and you should be able to read this directly into R, once it has been saved. The name of the CSV file isn't very informative '1871702.csv' - but you will read this into an r data frame object called deprivation - which should be more helpful.

deprivation <- read.csv("1871702.csv")

This file contains some quite long column names (and a few unneccesary columns). The following step picks out the columns you need:

deprivation <- deprivation[, c(1, 8, 10, 12, 14, 16, 18, 20, 22)]

and the next step changes the column names to more compact forms:

colnames(deprivation) <- c("LSOA", "Overall", "Income", "Employment", "Health", 
    "Education", "Housing", "Crime", "Environment")

as before head shows the first few values

head(deprivation)
       LSOA Overall Income Employment Health Education Housing Crime
1 E01000001    6.16   0.01       0.01  -2.11      0.21   32.60 -1.64
2 E01000002    5.59   0.01       0.01  -2.78      0.26   30.26 -1.93
3 E01000003   13.29   0.07       0.05  -0.97      7.16   40.32 -1.21
4 E01000004   11.17   0.04       0.04  -1.04      1.76   37.92 -1.32
5 E01000005   21.36   0.16       0.07   0.07     20.24   34.66 -1.02
6 E01000006   17.08   0.12       0.06  -0.38     13.84   32.81  0.42
  Environment
1       26.28
2       25.73
3       36.48
4       46.72
5       40.94
6       19.39

There are 9 columns - the first is the LSOA code, the second is the overal measure of deprivation, and the next 7 are scores for the individual aspects of deprivation listed above. Note that for these measures, positive values suggest above national average deprivation and negative values suggest below national average deprivation (with respect to levels in England).

As before, this data frame has information for the whole of England, and a look-up table approach is needed to match them to the mersey.lsoa shapefile.

lsoa.lookup <- match(mersey.lsoa$zonecode, deprivation$LSOA)
mersey.lsoa@data <- cbind(mersey.lsoa@data, deprivation[lsoa.lookup, -1])

The first line matches the LSOA code in the mersey.lsoa SpatialPointsDataFrame to the code in the deprivation data frame. The second line uses the cbind function to join several columns (from deprivation) to the existing data part of mersey.lsoa. As usual, head can be used to see the first few lines of the data part of mersey.lsoa:

head(mersey.lsoa@data)
   gid geonorth popnorth geoeast           name         label popeast
0  458   381722   383625  342348 Liverpool 058B 04BYE01006739  342928
1  459   383167   384902  323263    Wirral 033B 04CBE01007284  325257
2  461   420336   418219  332592    Sefton 004B 04CAE01006950  334110
3 2023   379256   380601  326022    Wirral 040F 04CBE01007191  327837
4 2024   380279   380955  325469    Wirral 040E 04CBE01007190  326631
5 2025   381092   381843  324075    Wirral 040B 04CBE01007187  325788
   zonecode asb.count  pop asb.rate Overall Income Employment Health
0 E01006739        14 1684    83.14   66.25   0.45       0.26   1.63
1 E01007284         0 1588     0.00    5.88   0.04       0.06  -0.27
2 E01006950         9 1591    56.57   41.50   0.24       0.22   1.81
3 E01007191         4 1323    30.23    5.73   0.02       0.05  -0.27
4 E01007190         3 1402    21.40    3.40   0.02       0.03  -0.41
5 E01007187         2 1392    14.37    4.71   0.03       0.05  -0.26
  Education Housing Crime Environment
0     72.85   18.36  1.16       51.60
1      2.27   13.38 -1.03        6.42
2     20.45   19.43  0.34       32.29
3      0.92   27.90 -1.71        1.58
4      0.54   20.88 -1.65        1.86
5      0.75   21.64 -1.51        1.18

Also, it is possible to draw maps of these indices. For example, to mapo the overall deprivation, enter:

breaks <- choro(mersey.lsoa, "Overall")
legend(x = 347000, y = 420000, legend = leglabs(round(breaks$b, 1)), fill = breaks$c)
title("Overall Deprivation in Merseyside (By LSOA)")

plot of chunk unnamed-chunk-70

Again, this would be a useful map to cut and paste into a scrapbook. Note the previous step can easily be modified to map other aspects of deprivation. Just replace "Overall" for one of the other column titles, and also change the title top be more appropriate. For example, for income deprivation enter:

breaks <- choro(mersey.lsoa, "Income")
legend(x = 347000, y = 420000, legend = leglabs(round(breaks$b, 1)), fill = breaks$c)
title("Income Deprivation in Merseyside (By LSOA)")

plot of chunk unnamed-chunk-71

Now that the SpatialPolygonsDataFrame object now has information about anti-social behaviour incidents, and deprivation. This makes it possible to consider the relationship between the two quantities. For example, here is a plot:

plot(mersey.lsoa$Overall, mersey.lsoa$asb.rate, xlab = "Index of Overall Deprivation", 
    ylab = "ASB Rate (per 10,000 people)")
title("Anti-Social Behaviour and Deprivation")

plot of chunk unnamed-chunk-72

It is possible to fit a straight line through the data, to see the relationship between the two quantities:

abline(lm(mersey.lsoa$asb.rate ~ mersey.lsoa$Overall), col = "darkblue", lwd = 2)

plot of chunk unnamed-chunk-73

the abline function draws a line fitted between the two variables (the fitting is done by lm(mersey.lsoa$asb.rate~mersey.lsoa$Overall) where lm stands for Linear Model). This may be another good graph to copy into the scrap book.

You may notice that although the line has a slight upward slope, there is quite a lot of variability around this. One reason for this is that although generally higher levels of deprivation in an area is associated with higher deprivation rates, this is not always the case – for example there are relatively high levels of anti social behaviour in some areas of mid-level deprivation, or in some town centre areas where there are quite a few pubs and night clubs, but which do not score highly in IMD terms. Also, the crime rates here are just for one month – and working with such a small time window adds to the variability in the data – in a given month there may be a spate of incidents in a one neighbourhood, whereas other areas that have a generally high risk may have experienced few incidents in this short period of time. Thus, it might be quite surprising to see a very strong relationship.

It is also possible to compute the correlation coefficient between the two quantities using R. Recall this is a number between -1 and 1 and briefly may be interpreted as follows:

Value Interpretation
-1 A perfect negative correlation – that is, the observed X and Y points all lie on a perfect line, with a downward slope.
–1 to 0 There is some negative connection between X and Y, but that the observed points are scattered around the fitted line - which has a downward slope. The closer to 0 the coefficient is, the more scattered the points are.
0 No correlation – there seems to be no connection between the variables X and Y
0 to 1 There is some positive connection between X and Y, but that the observed points are scattered around the fitted line - which has an upward slope. The closer to 0 the coefficient is, the more scattered the points are.
1 A perfect positive correlation - that is, the observed X and Y points all lie on a perfect line, with an upward slope

Note that the exact values of -1, 0, and 1 are really theoretical milestones - real data is nearly always in one of the two intervals: between -1 and 0 or between 0 and 1.

In R the function cor computes correlations. For example, to compute the correlation between the asb rate and the “Overall” index of deprivation, enter:

cor(mersey.lsoa@data$asb.rate, mersey.lsoa@data$Overall)
[1] 0.5082

which shows that the correlation between the overall index of deprivation and the asb rate per 10,000 people is around 0.5. It is also possible to specify a range of columns to compute the correlation. For example, in mersey.lsoa columnns 13 to 19 contain all of the different components of deprivation. Enter:

cor(mersey.lsoa@data$asb.rate, mersey.lsoa@data[, 13:19])
     Income Employment Health Education Housing  Crime Environment
[1,]  0.477     0.4905  0.518    0.3947  0.1206 0.4535      0.4036

to see how anti-social behaviour correlates to each of the components. From these, it can be seen that the strongest association is between Health and the asb rate, and the weakest is for Housing.

Assignment

This is a summative assignment, so both hard copy and electronic submission will be needed. The assignment is split into two parts:

Looking at geographical patterns

Using the scrap book of maps you produced earlier, cut and paste each of the maps (asb rate, each aspect of deprivation) into your assignment and write a brief (500 word) overview of the geographical patterns you see.

Looking at correlation and models

It is possible to modify the scatter plot graphs produced earlier, by substituting the word Overall for one of Income, Employment, Health, Education, Housing, Crime, Environment - that is the individual aspects of deprivation - in the command to plot the graph and supplying an appropriate title and x-label - for example:

plot(mersey.lsoa$Housing, mersey.lsoa$asb.rate, xlab = "Index of Housing Deprivation", 
    ylab = "ASB Rate (per 10,000 people)")
title("Anti-Social Behaviour and Housing Deprivation")

plot of chunk unnamed-chunk-76

Cut and paste a plot for each of these aspecrts of deprivation into your assignment document. Write a short description of the overall patterns that you see (ie summarising the information in all of the plots - around 400-500 words) - you can also make use of the correlation analysis to describe the rtelationships.

Hints

  1. Describe the geographical patterns in the maps of asb rates and the various indices of deprivation. where ae the largest and smallest values?
  2. Looking at Merseyside in Google earth may provide some help in identifying the different locations on the map.
  3. Make note of which aspects of deprivation have the strongest links to asb rate, and which have the weakest

This work is licensed under the Creative Commons Attribution-ShareAlike 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/3.0/.