Firstly, you will need to create yourself a project directory in your storage area. Any files that you create during this exercise will be stored here. Do this in Windows.
Next, click on the icon, and the R console (a window) appears. Unlike most (or possibly all) of the software you may be used to, it is not usually controlled by clicking on menu items, forms, or buttons, but by typing text into this window.
The results are generally either graphs or text also printed into this window. R can initially be used as a calculator - enter the following:
6 + 8
Don't worry about the [1] for the moment - just note that R printed out 14 since this is the answer to the sum you typed in. In these tutorials, sometimes I show the results of what you have typed in. This is in the format shown below:
5 * 4
[1] 20
Also note that * is the symbol for multiplication here - the last command asked R to perform the calculation '5 times 4'. Other symbols are - for subtraction and / for division:
12 - 14
[1] -2
6/17
[1] 0.3529
R also has functions like square root, sine, cosine and so on. You can calculate these like this:
sqrt(25)
[1] 5
sqrt(2)
[1] 1.414
the examples above use the square root (sqrt) function.
You can also combine these to make more complicated expressions:
sqrt(3 * 3 + 4 * 4)
[1] 5
You can also assign the answers of the calculations to variables, and use them in calculations. You do this as below
price <- 300
Here, the value 300 is stored in the variable price. The <- symbol means put the value on the right into the variable on the left - it is typed with a < followed by a -. This can be used in subsequent calculations. For example, to apply a 20% discount to this price, you could enter the following:
price - price * 0.2
[1] 240
or use intermediate variables
discount <- price * 0.2
price - discount
[1] 240
R can also work with lists of numbers, as well as individual ones. Lists are specified using the c function. Suppose you have a list of house prices listed in an estate agents, specified in thousands of pounds. You could store them in a variable called house.prices like this:
house.prices <- c(120, 150, 212, 99, 199, 299, 159)
house.prices
[1] 120 150 212 99 199 299 159
Note that there is no problem with full stops in the middle of variable names.
You can then apply functions to the lists. For example to take the average of a list, enter:
mean(house.prices)
[1] 176.9
If the house prices are in thousands of pounds, then this tells us that the mean house price is 176.9 thousand pounds. Note here that on your display, the answer may be displayed to more significant digits, so you may have something like 176.8571 as the mean value.
For the next exercise, try entering a larger data set (this time these are household burglaries per 10,000 households over a one month period) for 118 neighbourhoods in St. Helens:
burg.rates <- c(0, 7, 0, 0, 6, 19, 32, 0, 0, 0, 15, 6, 12, 8, 7, 6, 0, 0, 6,
0, 7, 0, 0, 0, 0, 0, 0, 0, 17, 0, 0, 21, 7, 12, 7, 36, 18, 0, 0, 7, 6, 0,
0, 0, 0, 0, 13, 22, 0, 0, 0, 7, 12, 7, 5, 11, 0, 0, 13, 13, 0, 6, 15, 6,
17, 37, 0, 6, 6, 5, 24, 0, 0, 0, 0, 0, 0, 0, 5, 15, 0, 5, 6, 0, 0, 0, 13,
0, 6, 0, 0, 0, 23, 6, 13, 15, 6, 0, 0, 7, 7, 0, 0, 0, 0, 19, 13, 0, 0, 0,
6, 9, 0, 0, 0, 0, 0, 5)
Note that if an R expression has obviously not been completed and you hit return, then you can continue to type it on the following line. This carries on until R has worked out that the command is finished. In this case, that happens when the close brackets ) has been typed, and return has been hit afterwards. If you now enter the variable name (burg.rates) you can see all of the values listed:
burg.rates
[1] 0 7 0 0 6 19 32 0 0 0 15 6 12 8 7 6 0 0 6 0 7 0 0
[24] 0 0 0 0 0 17 0 0 21 7 12 7 36 18 0 0 7 6 0 0 0 0 0
[47] 13 22 0 0 0 7 12 7 5 11 0 0 13 13 0 6 15 6 17 37 0 6 6
[70] 5 24 0 0 0 0 0 0 0 5 15 0 5 6 0 0 0 13 0 6 0 0 0
[93] 23 6 13 15 6 0 0 7 7 0 0 0 0 19 13 0 0 0 6 9 0 0 0
[116] 0 0 5
In this example, there are 100 different rates and the last command lists them - they are more neatly written out than when you entered them, and perhaps now you can see what the numbers in square brackets are used for. Essentially they show you the position in the list of the first number on each row - thus if a row begins with [24] this implies that the 24nd number in the list is at the start of this printed row. The main idea is to allow you to find positions in the list of higher numbers more easily. Note that if you want to find out more about the basic use of R, a helpful guide can be found here: http://cran.r-project.org/doc/contrib/Owen-TheRGuide.pdf - in particular, pages up to and including 28 are very useful.
It is also possible to draw graphics using the data you have put in to the variables. This draws a histogram of the burglary data:
hist(burg.rates)
By selecting the window it is possible to copy and paste the images into other documents, for example into the data encryption packages MS Word or Powerpoint or their open source alternatives.
Generally speaking in R, commands tend to give very basic plots, unless further details are provided. Thus to get a histogram with red bars, enter:
hist(burg.rates, col = "red")
and to change the main title, xlab (ie x-label) and ylab (y-label) use:
hist(burg.rates, col = "red", main = "Burglaries per 1000 households", xlab = "Rate",
ylab = "Frequency")
Now enter another related variable, the median house price (in 1000's of pounds) for a three-bedroom semi-detached house for each of the neighbourhoods.
house.prices <- c(200, 130, 200, 200, 180, 140, 65, 220, 180, 200, 210, 170,
180, 160, 180, 130, 240, 180, 170, 230, 150, 200, 200, 210, 220, 180, 200,
210, 150, 200, 230, 120, 180, 180, 190, 72, 80, 190, 220, 150, 200, 170,
170, 230, 200, 160, 140, 100, 140, 170, 180, 260, 170, 230, 190, 220, 140,
220, 120, 96, 210, 170, 180, 140, 150, 67, 200, 230, 140, 230, 83, 170,
200, 210, 240, 180, 200, 210, 250, 140, 130, 190, 110, 160, 150, 230, 160,
210, 200, 230, 210, 190, 120, 180, 87, 160, 190, 190, 230, 180, 110, 200,
250, 180, 200, 130, 180, 190, 190, 230, 210, 210, 150, 190, 210, 200, 210,
170)
As before, it is possible to draw a histogram of this variable:
hist(house.prices, col = "lightblue", main = "House Price", xlab = "1000s Pounds",
ylab = "Frequency")
and also to create a scatter plot of the two variables, to see how median house price relates to burglary rate:
plot(burg.rates, house.prices, main = "Burglary vs. House Price", xlab = "Burglaries (per 1000 households)",
ylab = "Median House Price (1000s Pounds)")
This shows that there is a relationship between the two quantities, although there is still a fair amount of randomness as well. The points show there is a general tendency for house prices to fall as burglary rate increases, but that there are other factors affecting house price as well.
In the last section, you looked at two variables called house.prices and burg.rates and graphed their relationship. It is possible to combine the individual variables in a data frame - rather like an internal spreadsheet where all of the relavent data items are stored together as a set of columns. This is similar to the data set storage in SPSS (for those of you who have used that package) where each variable correspoinds to a column and each case (or observation) corresponds to a row. However, while SPSS can only have one data set active at a time, in R you can have several of them.
To create a data frame containing the last two variables enter
hp.data <- data.frame(Burglary = burg.rates, Price = house.prices)
Then type in its name to list it:
hp.data
Just to explain what has happened here: the function data.frame takes all of the variables that you wish to have as columns. The Burglary=burg.rates creates a column in the data frame called Burglary containing the values in the variable burg.rates in the last section. Similarly, it has a column called Price containing the values from house.prices. This new data frame is stored as a new object called hp.data (an object in R is similar to a variable, although it can be more complex - so it can contain more sophisticated things like data frames, not just a list of values). Typing in the name of the object (once it has been created) lists the values in the columns.
You can also enter
fix(hp.data)
to view this data frame in a window, and also edit values in the 'cells'. However, although it might be interesting to try using fix now, try it, but don't actually edit anything. To return to the R command line click on Quit in the data frame window. NOTE: it is important to do this, ortherwise you won't be able to type anything in to R.
You can also describe each column in the data set using the summary function:
summary(hp.data)
Burglary Price
Min. : 0.00 Min. : 65
1st Qu.: 0.00 1st Qu.:152
Median : 0.00 Median :185
Mean : 5.64 Mean :179
3rd Qu.: 7.00 3rd Qu.:210
Max. :37.00 Max. :260
For each column, a number of values are listed:
| Item | Description |
|---|---|
| Min. | The smallest value in the column |
| 1st. Qu. | The first quartile (the value ¼ of the way along a sorted list of values) |
| Median | The median (the value ½ of the way along a sorted list of values) |
| Mean | The average of the column |
| 3rd. Qu. | The third quartile (the value ¾ of the way along a sorted list of values) |
| Max. | The largest value in the column |
Between these numbers, an impression of the spread of values of each variable can be obtained. In particular it is possible to see that the median house price in St. Helens by neighbourhood ranges from 65,000 pounds to 260,000 pounds and that half of the prices lie between 152,500 pounds and 210,000 pounds. Also it can be seen that since the median measured burglary rate is zero, then at least half of areas had no burglaries in the month when counts were compiled.
There are lots of methods for getting data from some external file into R. One way is to enter it in directly, as in the last section, but this isn't practical for large datasets. It is also possible to read data in from text files (this will be done later) and to read it in from a source on the internet. Here, the internet method is demonstrated. In this case the data set is one made publicly available on my public dropbox folder. If you go follow the web link “http://dl.dropbox.com/u/7013997/ENVS257/hpdata.csv” you can see it in its 'raw' form. If you click on the link and then on the open option, the data will open in a read-only spreadsheet. Having seen the data set, close the spreadsheet and return to the R window.
The file is called a csv file - short for Comma Separated Variable - as in its 'raw' form it justs consists of several lines of variables separated by commas. Essentially this file contains the same data as the house price and burglary data you used earlier- although an extra column, with an ID number of each neighbourhood, is also included. You can read a csv file on the internet directly into a variable using the read.csv function in R. Here the contents of the file are read into a data frame object called hp.data2.
hp.data2 <- read.csv("http://dl.dropbox.com/u/7013997/ENVS257/hpdata.csv")
The function takes the URL of the file as an argument, and reads it in to the object on the left of the assignment (<-) symbol. As before, it is possible to list the data frame
hp.data2
ID Burglary Price
1 21 0 200
2 24 7 130
3 31 0 200
4 32 0 200
5 78 6 180
6 80 19 140
7 81 32 65
8 98 0 220
9 100 0 180
10 101 0 200
11 102 15 210
12 111 6 170
13 112 12 180
14 113 8 160
15 114 7 180
16 115 6 130
17 116 0 240
18 117 0 180
19 118 6 170
20 119 0 230
21 120 7 150
22 121 0 200
23 122 0 200
24 123 0 210
25 124 0 220
26 125 0 180
27 126 0 200
28 127 0 210
29 128 17 150
30 614 0 200
31 620 0 230
32 621 21 120
33 622 7 180
34 624 12 180
35 625 7 190
36 626 36 72
37 627 18 80
38 628 0 190
39 629 0 220
40 764 7 150
41 765 6 200
42 766 0 170
43 767 0 170
44 768 0 230
45 769 0 200
46 833 0 160
47 834 13 140
48 835 22 100
49 836 0 140
50 837 0 170
51 838 0 180
52 839 7 260
53 840 12 170
54 841 7 230
55 842 5 190
56 843 11 220
57 844 0 140
58 845 0 220
59 846 13 120
60 847 13 96
61 848 0 210
62 849 6 170
63 850 15 180
64 851 6 140
65 852 17 150
66 853 37 67
67 854 0 200
68 855 6 230
69 856 6 140
70 857 5 230
71 858 24 83
72 859 0 170
73 860 0 200
74 861 0 210
75 862 0 240
76 863 0 180
77 864 0 200
78 865 0 210
79 866 5 250
80 867 15 140
81 868 0 130
82 869 5 190
83 870 6 110
84 871 0 160
85 872 0 150
86 873 0 230
87 874 13 160
88 875 0 210
89 876 6 200
90 877 0 230
91 878 0 210
92 879 0 190
93 880 23 120
94 881 6 180
95 882 13 87
96 883 15 160
97 884 6 190
98 885 0 190
99 886 0 230
100 887 7 180
101 888 7 110
102 889 0 200
103 890 0 250
104 891 0 180
105 892 0 200
106 893 19 130
107 894 13 180
108 895 0 190
109 896 0 190
110 897 0 230
111 898 6 210
112 899 9 210
113 900 0 150
114 901 0 190
115 902 0 210
116 903 0 200
117 904 0 210
118 905 5 170
and also to summarise it:
summary(hp.data2)
ID Burglary Price
Min. : 21 Min. : 0.00 Min. : 65
1st Qu.:616 1st Qu.: 0.00 1st Qu.:152
Median :846 Median : 0.00 Median :185
Mean :654 Mean : 5.64 Mean :179
3rd Qu.:876 3rd Qu.: 7.00 3rd Qu.:210
Max. :905 Max. :37.00 Max. :260
Note that the second and third columns of this are the same as before – and also that the first column does not really make sense. Since it is an ID number, it is really just a identifying code, and it doesn't really matter what its minum value, or mean value might be - you might as well compute an average student ID number. For this reason, it is useful to ask R to drop this column from hp.data2 when computing the summary. This is fairly simple. Firstly note that you can use square brackets to pick out individual values in the data frame. For example to select the value in the 15th row of column 2, enter
hp.data2[15, 2]
[1] 7
and you can look at the full data frame to check that this is indeed 7. Also, for the columns, you can replace the column number for its name, provided it is in quotes:
hp.data2[15, "Burglary"]
[1] 7
Also, instead of just specifying a single row, it is possible to specify a range of rows:
hp.data2[10:15, "Burglary"]
[1] 0 15 6 12 8 7
This lists the burglary rates for neighbourhoods 10-15 in the data set.
If you want to specify a full column (ie all of the burglary rates), just leave the part where you would write the column range empty:
hp.data2[ ,"Burglary"]
[1] 0 7 0 0 6 19 32 0 0 0 15 6 12 8 7 6 0 0 6 0 7 0 0
[24] 0 0 0 0 0 17 0 0 21 7 12 7 36 18 0 0 7 6 0 0 0 0 0
[47] 13 22 0 0 0 7 12 7 5 11 0 0 13 13 0 6 15 6 17 37 0 6 6
[70] 5 24 0 0 0 0 0 0 0 5 15 0 5 6 0 0 0 13 0 6 0 0 0
[93] 23 6 13 15 6 0 0 7 7 0 0 0 0 19 13 0 0 0 6 9 0 0 0
[116] 0 0 5
You can use a similar approach to select out a row of the data -
hp.data2[12, ]
ID Burglary Price
12 111 6 170
This gives the ID, Burglary and Price (ie house price) values for the 12th neighbourhood in the list.
Another way of selecting out columns is to use the $ (dollar) approach. Entering
hp.data2$Price
[1] 200 130 200 200 180 140 65 220 180 200 210 170 180 160 180 130 240
[18] 180 170 230 150 200 200 210 220 180 200 210 150 200 230 120 180 180
[35] 190 72 80 190 220 150 200 170 170 230 200 160 140 100 140 170 180
[52] 260 170 230 190 220 140 220 120 96 210 170 180 140 150 67 200 230
[69] 140 230 83 170 200 210 240 180 200 210 250 140 130 190 110 160 150
[86] 230 160 210 200 230 210 190 120 180 87 160 190 190 230 180 110 200
[103] 250 180 200 130 180 190 190 230 210 210 150 190 210 200 210 170
extracts the column called Price and so on. This (and indeed the earlier ways of extracting columns) can be used in graphics commands, and so on. For example, a box-plot of the prices can be obtained by
boxplot(hp.data2$Price, col = "lightgreen")
The box-plot is a graphical equivalent of the summaries seen earlier - the end-points of the longer lines show the maximum and minimum values, and the end points of the central box are the first and third quartiles. The line on the box marks the median.
Until now, although this is geographical data, no maps have been drawn. In this section you will do this. Firstly, you need to load a new package into R - a package is an extra set of functionality that extends what R is capable of doing. Here, the new package is called maptools and it extends R by allowing it to draw maps, and handle geographical information. Firstly, you have to let R know you want to use the package - and install it:
install.packages("maptools", depend = TRUE, lib = getwd())
Packages are loaded via the library function:
library(maptools)
library(maptools, lib.loc = getwd())
However, this just makes R able to handle geographical data, it doesn't actually load any specific data sets. To do this, you'll need to obtain them from somewhere. Data for maps in R can be read in from shapefiles - these are a well known geographical information exchange format. Although the term 'shapefile' is used, in actuality a given data set consists of several files. For this reason, I will use the term 'shapefile set' to mean all of the files needed for a particular geographical data set. The first task is to obtain a shapefle set of the St. Helens neighbourhoods (or Lower Super Output Areas - LSOAs, as they are more formally called). To do this, you will need to put some information into the working directory you created earlier. The exact function you need to type in will depend on the name you have called your working directory. For example, my directory is called 'R Work' and is on the M: drive, so I type
setwd("M:/R work")
your version might well have a longer title. Also note that slashes are indicated with a '/' not '\'.
Later on in the practical, you will need to use some other packages - the following line of R will install them all now:
install.packages(c("rgeos", "RColorBrewer", "maptools", "classInt"), depend = TRUE,
lib = getwd())
There is a set of shapefiles for the St. Helens neighbourhoods at the same location as the data set you read in earlier. Since several files are needed, I have bundled these together in a single zipfile. You will now download this to your local folder and subsequrently unzip it. This can all be done via R functions:
download.file("http://dl.dropbox.com/u/7013997/ENVS257/sthel.zip", "sthel.zip")
unzip("sthel.zip")
The first function actually downloads the zip file into your working directory, the second one unzips it, creating the shapefile set. All of the shapefile set files begin with sthel but then have different endings, eg sthel.shp, sthel.dbx and sthel.shx.
Now, these can be read in to a SpatialPolygons object.
sthel <- readShapeSpatial("sthel")
the readShapeSpatial function does this, and stores them into another type of object called a SpatialPolygons object. Recall that geographical data can be of type Point, Line or Polygon and that polygons are areas or regions, such as the neighbourhoods (LSOAs) in St. Helens. You can use the plot function to draw the polygons (ie the map of the LSOAs).
plot(sthel)
Finally, a SpatialPolygonsDataFrame is the result of joining a SpatialPolygons object up with a data frame. The idea is that each polygon corresponds to one row of the data frame - i.e, the geographical description of the boundaries of each neighbourhood with the data about that neighbourhood. Here a SpatialPolygonsDataFrame is created by joining the sthel SpatialPolygons object to the hp.data2 data frame object. The result is assigned to a SpatialPolygonsDataFrame object called sthel.prices.
sthel.prices <- SpatialPolygonsDataFrame(sthel, hp.data2, match.ID = FALSE)
Finally you will draw a choropleth map of these house prices. In later exercises more will be explained as to how this works, but for now the method will simply be demonstrated. One useful feature of R is that it is possible to write new functions, and therefore extend its capabilities. The following function (called choro) is a simple choropleth map drawing tool:
choro <- function(spdframe, variable) {
var <- spdframe@data[, variable]
breaks <- classIntervals(var, n = 6, style = "fisher")$brk
my_colours <- brewer.pal(6, "Greens")
plot(spdframe, col = my_colours[findInterval(var, breaks, all.inside = TRUE)],
axes = FALSE, border = rgb(0.8, 0.8, 0.8))
invisible(list(b = breaks, c = my_colours))
}
For now, just copy and paste this into R and hit return. Don't worry about trying to interpret it!
Note that it is important to make sure the upper and lower case letters you type in are exactly the same as the ones above. This is generally the case with R. If a window comes up asking where to install the library from, select one of the UK locations. When it has installed, just enter
library(classInt, lib.loc = getwd())
to load it. Also, load a further package for providing map shading palettes called RColorBrewer.
library(RColorBrewer)
Now it is possible to draw the choropleth map. Here we draw one of the burglary rates.
breaks <- choro(sthel.prices, "Burglary")
The map shows the different burglary rates, with dark shading indicating a higher rate. However, this is much easier to interpret if you add a legend to the map. For now, just enter the following. As before, more about how all of the commands work will be disclosed later in the module.
legend(x = 357000, y = 392000, legend = leglabs(breaks$b), fill = breaks$c,
bty = "n")
You can also add a title to the map:
title("Burglary Rates per 10,000 Homes in St. Helens")
As a final useful technique you can copy and paste maps like this into word documents. Select the window with the map on it, and then right click on it. When the menu appears, select 'copy as bitmap'. If you also have MS Word up and running, you can then paste the map into your document.
In December of 2010 the police.uk web site was launched, providing details of geographical locations of crime in England and Wales. As well as providing an interface to allow people to view crimes in their (or other peoples') neighbourhoods, this also provided a means for downloading the crime data. Using this, combined with other data it is possible to obtain crime rate maps for anywhere in England and Wales, for a number of different classes of crime.
This section of the practical will examine the use of contemporary data about crime rates and population levels that can be merged to draw maps so that you can investigate hypotheses about geographical patterns of crime, for a number of different crime types, in the Merseyside region.
Firstly, you can use the police.uk web site to investigate local patterns in crime without making use of any other software. To do this, firstly go to the web site http://www.police.uk – this will show you the start-up page for this website – which, as stated earlier, provides a portal to maps of crime rates in localities in England and Wales – and looks something like this:
To try this out, enter the postcode L69 7ZT into the 'address, postcode or place name' box, and click on the 'search' button. This postcode corresponds to the Roxby Building, near Myrtle Street, on the edge of the campus. When you click on the button it will give you a new page with a number of smaller sections,
including one titled 'Crime and ASB in this area'. Note: ASB = Anti-Social Behaviour, which does not always imply a literal criminal offence. This shows a map of the local area with a number of symbols, and a red marker showing the point corresponding approximately to the postcode centroid.
If you click on this red marker, a new page with a bigger map appears, looking something like this:
Using the scale bar on the left of the map, you can 'zoom in' on the blue circle around the red marker. You can also ‘drag’ the map in the window (press the left mouse button while moving the mouse when the pointer is inside the map window) to pan around the area. The table on the left of the map gives you counts of the different kinds of crimes that have occurred in the most recent month. Clicking on any of the black circles gives a breakdown of different crime counts in an area. Clicking on the counts in the table makes the map only show points corresponding to the type of crime in the table.
Exploring the above maps gives some interesting insights into local crime. However, it also raises some questions. One issue is that although the circles with counts of crimes are centred on certain points on the map, these do not correspond exactly to where the crimes occurred. This is intentional on the part of the designers of this site, and preserves privacy (for example, the web site does not intend to publicise the exact addresses of houses that were burgled). The points where crimes are mapped either correspond to public places (eg train stations) or to the centre of a group of at least 12 private addresses. However, the choice of mapping technique has faced some criticism because it still uses a point-based approach – giving the impression that a crime (or several crimes) did occur at some very specific location, when in fact it did not – see for example figure below – there is no direct evidence that 12 incidents occured in exactly the location in Myrtle place as implied on the map – this is just the centre of the group of houses and other buildings. See http://bcsmaps.blogspot.com/2011/02/uk-police-maps-x-does-not-mark-spot.html for more discussion.
Since there are certain limitations to viewing the data directly on the web site, it is fortunate that another option is to download the raw data and display it in other ways, using R. To do this, note the ‘data’ menu on the map you have just created (on the blue bar above the map) and click on this. This takes you to a new page where data may be downloaded. Next, scroll down until you find the link for ‘Merseyside Police’, then click on the first of the three 'download' links. This gives a csv (comma separated variable) file for the street-level Merseyside police data – each line has details of each crime, including crime type, location and some other features, separated by commas. The download will probably put the file in your Downloads folder. Using folder one you created earlier - move the file named something like 2012-11-merseyside-street.zip to this folder (note: 2012-11 refers to the actual year and month that you download – this can change depending on the actual date when you do the practical – for the rest of these notes be aware that the 2012-11 part of the name may be replaced with another year and month) - and also that some of the maps or graphs you obtain will be slightly different to those on this web site.
Since the file is compressed to speed up the download (that is, zipped) you need to extract it first. Double click on the file, and you will see a new folder window appear containing file 2012-11-merseyside-street.csv. It is a good idea to extract it before using it in R. To do this, click on ‘extract’ in the top of the folder window. This gives a window to choose the folder into which the file should be extracted. Choose the working directory that you were using in the last exercise.
Now, the data is in place and you need to start up R. Do this in the same way you did in the last exercise. When R is running enter the commands to set it to run in the directory you keep the data. Remember that to do this, the exact function you need to type in will depend on the name you have called your working directory. Again, from last time, my directory is called 'R Work' and is on the M: drive, so I type
setwd("M:/R work")
your version might well have a longer title. Also note that slashes are indicated with a '/' not '\'.
Now load the GISTools library (as you did before) and then read in the CSV file you downloaded with the crime data. Store it into a data frame called crimes.
require(maptools)
crimes <- read.csv("2012-11-merseyside-street.csv")
You can inspect the data frame in a window using fix(crimes) - but don't edit the values and remember that you have to close the window before typing anything else into R.
You will see that the data consists of a number of columns, each with a heading. Two of these are called Easting and Northing – these are the column headers that give the UK national grid coordinates of each incident (either a crime or an ASB event) in the data you have just downloaded. Another is headed Crime.Type and tells you which type of crime occurred. Note that the national grid references refer to the centres of small groups of households (or possibly public places) mentioned earlier.
At the moment, the data is just in a data.frame object - not any kind of spatial object. To create the spatial object, enter the following:
crime.pts <- SpatialPointsDataFrame(crimes[, 4:5], crimes[, -(4:5)])
This creates a SpatialPointsDataFrame object. The function takes two arguments - the first is columns 4 and 5 from the crimes data frame - these are the easting and northing of the crime location. The second argument is the data frame minus columns 4 and 5 - this is what -(4;5) indicates. These columns provide all the non-geographical data from the data frame. The resulting object crime.pts is a spatial points geographical shape object, whose points are each recorded crime in the data set you download. To see the geographical pattern of these crimes, enter:
plot(crime.pts, pch = ".", col = "darkred")
as you can see, you have now successfully obtained some crime data from police.uk and read it into R. The option pch='.' tells R to map each point with a full-stop character - and to make the colour of these dark red.
You can also examine the kinds of different crimes recorded.
table(crime.pts$Crime.type)
Anti-social behaviour Burglary
4541 1190
Criminal damage and arson Drugs
1657 728
Other crime Other theft
293 1324
Public disorder and weapons Robbery
296 128
Shoplifting Vehicle crime
746 913
Violent crime
998
For a categorical column in a data frame (that is, one that attributes each location to a specific category, in this case 'crime type') the table function counts the number of times each category appears in the column. From this, it is possible to identify the relative frequencies with which each kind of crime occurs.
The next stage is to obtain some further map data. In particular, the point data you have just drawn still suffers from the 'X does not mark the spot' issue discussed earlier on. The next stage of this practical is to obtain some neighbourhood zone data and to aggregrate the crimes to this. These data are useful for a number of reasons:
BoundaryData.zip to download it.MerseyLSOA and save.Follow the registration procedure on the web site, and in a short while you should be able to access this site. For the mean time download the file MerseyLSOA.zip from http://dl.dropbox.com/u/7013997/ENVS257/MerseyLSOA.zip and put it in your working folder, and continue from here - or alternatively enter
download.file("http://dl.dropbox.com/u/7013997/ENVS257/MerseyLSOA.zip", "MerseyLSOA.zip")
which will do this via an R function as in the last practical. However, I recommend working through the other steps when you are registered, so that you are familiar with the process. You will probably need it in the future!
Firstly, unzip the file you have just downloaded.
unzip("MerseyLSOA.zip")
The name of the downloaded shapefile sets actually aren't based on the term MerseyLSOA, but actually on the term england_low_soa_2001. Although this a little confusing (and not really avoidable, because the name comes from the institution supplying the data) the R object that the shape is read into can still have a more helpful name. To read it in, firstly load the maptools library, as in the last exercise:
library(maptools)
mersey.lsoa <- readShapeSpatial("england_low_soa_2001")
It is now possible to draw the crime data on a Merseyside backdrop:
plot(mersey.lsoa, border = "grey")
plot(crime.pts, pch = ".", col = "darkblue", add = TRUE)
Note that including add=TRUE tells R to draw the crime points by adding them to an existing map, rather than starting a new map.
The data you have download includes all crimes. However, now the analysis will focus on reported incidents of antisocial behaviour. To do this, select out the antisocial behaviour points from the full crime shape object.
asb.pts <- crime.pts[crime.pts$Crime.type == "Anti-social behaviour", ]
The above method works similarly to the 'select rows by their number' seen in the first practical. The first thing in the square brackets is a conditional statement - basically it selects out cases for which a condition is true. In this cases, the Crime.type column should contain the attribute “Anti-social behaviour”. The expression after the comma is empty so that all of the rows are included. The result (a point object containing all of the anti-social behaviour incidents) is stored in asb.pts.
Next, find out how many ASBs have occurred in each LSOA. To do this, you need to load the package rgeos - basically a set of tools that help R handle overlay operations, such as finding out where geographical shapes intersect one another, and so on. Next you need to define a function called poly.counts - this counts how many points in a SpatialPointsDataFrame fall into each of a set of polygons in a SpatialPolygonsDataFrame, and creates a new variable. Here, the new variable is called asb.count. The code below carries out all of these operations. Note also that the # comment is placed at the beginning of some lines. This just tells R that the following text is just to explain what the R functions are doing - it doesn't have to be executed.
# This is another R package, allowing GIS overlay operations
library(rgeos)
# This defines a new R function - it counts how many points fall into each
# polygon
poly.counts <- function(pts, polys) colSums(gContains(polys, pts, byid = TRUE))
# The line below actually counts the number of crimes in each LSOA
asb.count <- poly.counts(asb.pts, mersey.lsoa)
Now that you have this information, you can draw a map of this. Before doing so, you need to add the choropleth map drawing code that you used earlier.
library(classInt, lib.loc = getwd())
library(RColorBrewer)
choro <- function(spdframe, variable) {
var <- spdframe@data[, variable]
breaks <- classIntervals(var, n = 6, style = "fisher")$brk
my_colours <- brewer.pal(6, "Greens")
plot(spdframe, col = my_colours[findInterval(var, breaks, all.inside = TRUE)],
axes = FALSE, border = rgb(0.8, 0.8, 0.8))
invisible(list(b = breaks, c = my_colours))
}
Just cut and paste the above code into R (as with last time). Now, you can draw a choropleth map of asb counts for each LSOA. Firstly, add the asb counts as a column to the mersey.lsoa SpatialPolygonsDataFrame.
# First, add an ASB event count column to the 'mersey.lsoa'
# SpatialPolygonsDataFrame
mersey.lsoa@data$asb.count <- asb.count
# Now draw a choropleth map - use 'choro' the same way as the last
# practical
choro(mersey.lsoa, "asb.count")
The map just drawn shows areas with a high absolute anti-social behaviour incident counts. However, it is more meaningful to consider these quantities per head of population - areas with larger populations would have higher absolute counts even if the chances of someone committing an offence was the same everywhere. Also, in some small communities, it might be that crime is relatively high. Working with crimes per person (ie per head of population) overcomes this issue. To deal with this, you need to download some population data. This comes from the web site: http://www.ons.gov.uk/ons/publications/re-reference-tables.html?edition=tcm%3A77-230902
At the time of writing, these are the most recent population estimates available from the government web site and relate to mid 2010. Although the crimes are from a slightly later date, we will assume that these population estimates are still reasonably accurate. Essentially, we have no choice – they are the most recent that can be obtained. Go to the web site above, and then download the file for the Lower Super Output Areas. This is the first of the two files that you will see. Download it into your working directory.
As before this is a zipped file. Follow the same procedure as before to extract it. Unfortunately, this insn't a csv file, it is stored in Microsoft Excel. You will need to use Excel to extract the data in CSV format before you read it into R. So, firstly open it in Excel and follow this procedure
pop.csv - select csv as the format to save the file…Once it has been saved, read it in to R.
pop <- read.csv("pop.csv")
This is actually the population estimates for the whole of the LSOAs in Great Britain, not just those in Merseyside. The next stage is to extract the estimates just for Merseyside.
Firstly, use the R function match to find the location in pop that each LSOA code in mersey.lsoa appears - these are stored in lsoa.lookup.
lsoa.lookup <- match(mersey.lsoa$zonecode, pop$LSOA)
head(lsoa.lookup)
[1] 6739 7284 6950 7191 7190 7187
Entering head(lsoa.lookup) shows the first few locations in the look up table - it tells you that the first of the LSOAs in mersey.lsoa corresponds to location 6739 in pop. In genereal, head is a useful function for lookup up the first few items in a variable, or part of an object. For example, try
head(mersey.lsoa@data)
gid geonorth popnorth geoeast name label popeast
0 458 381722 383625 342348 Liverpool 058B 04BYE01006739 342928
1 459 383167 384902 323263 Wirral 033B 04CBE01007284 325257
2 461 420336 418219 332592 Sefton 004B 04CAE01006950 334110
3 2023 379256 380601 326022 Wirral 040F 04CBE01007191 327837
4 2024 380279 380955 325469 Wirral 040E 04CBE01007190 326631
5 2025 381092 381843 324075 Wirral 040B 04CBE01007187 325788
zonecode asb.count
0 E01006739 14
1 E01007284 0
2 E01006950 9
3 E01007191 4
4 E01007190 3
5 E01007187 2
Now, add a column with the appropriate population estimates to the merseyside SpatialPolygonsDataFrame.
mersey.lsoa@data$pop <- pop[lsoa.lookup, "pop"]
You can check this has worked, by using head again:
head(mersey.lsoa@data)
gid geonorth popnorth geoeast name label popeast
0 458 381722 383625 342348 Liverpool 058B 04BYE01006739 342928
1 459 383167 384902 323263 Wirral 033B 04CBE01007284 325257
2 461 420336 418219 332592 Sefton 004B 04CAE01006950 334110
3 2023 379256 380601 326022 Wirral 040F 04CBE01007191 327837
4 2024 380279 380955 325469 Wirral 040E 04CBE01007190 326631
5 2025 381092 381843 324075 Wirral 040B 04CBE01007187 325788
zonecode asb.count pop
0 E01006739 14 1684
1 E01007284 0 1588
2 E01006950 9 1591
3 E01007191 4 1323
4 E01007190 3 1402
5 E01007187 2 1392
You should see that a new column, pop, has been added. Now, a further column, that actually contains the rate of ASBs per head of population, can be added.
mersey.lsoa@data$asb.rate <- 10000 * mersey.lsoa@data$asb.count/mersey.lsoa@data$pop
Again, use head to check this:
head(mersey.lsoa@data)
gid geonorth popnorth geoeast name label popeast
0 458 381722 383625 342348 Liverpool 058B 04BYE01006739 342928
1 459 383167 384902 323263 Wirral 033B 04CBE01007284 325257
2 461 420336 418219 332592 Sefton 004B 04CAE01006950 334110
3 2023 379256 380601 326022 Wirral 040F 04CBE01007191 327837
4 2024 380279 380955 325469 Wirral 040E 04CBE01007190 326631
5 2025 381092 381843 324075 Wirral 040B 04CBE01007187 325788
zonecode asb.count pop asb.rate
0 E01006739 14 1684 83.14
1 E01007284 0 1588 0.00
2 E01006950 9 1591 56.57
3 E01007191 4 1323 30.23
4 E01007190 3 1402 21.40
5 E01007187 2 1392 14.37
Note the multiplication by 10000 - this means rates are per 10000 heads of population. Now it is possible to produce a choropleth map of the ASB rates in Merseyside.
breaks <- choro(mersey.lsoa, "asb.rate")
legend(x = 347000, y = 420000, legend = leglabs(round(breaks$b, 1)), fill = breaks$c)
In the leglabs part of the legend function, round is used to round the rate estimates to 1 place of decimal. Without this, the display becomes cluttered.
You may need this map for the assignment (as well as some others). For this reason, it is a good idea to create a blank MS Word document, copy this map into it (see the last practical) and save it. There will be a few more maps and graphs that you will use - it is a good idea to use this Word document as a 'scrap book' and cut and paste the maps into your assignments at the appropriate places.
The Index of Multiple Deprivation is a score used to measure the degree of deprivation in geographical areas, currently produced by the Department for Communities and Local Government (DCLG) (http://www.communities.gov.uk/corporate/ ). Since 2004 the smallest region for which the IMD is produced has been the LSOA – the spatial unit you were working with in the practical. The index of multiple deprivation is a score based on combining scores for a number of ‘Domains’ of deprivation:
The bracketed figures on the right show the relative weighting that each domain is given to derive the overall score – thus, most emphasis is placed on Income and Employment. Each domain score is itself a combination of a number of variables relating to the individual domain. Note that, for this reason, each domain score is not a simple quantity (such as the rates of a specific illness in the health domain) but is itself a combined score based on several underlying variables. Not all of these underlying variables are made publicly available, but the scores for each domain and the overall IMD have been made publicly available.
To obtain this data you will need to visit the web site: http://www.communities.gov.uk/publications/corporate/statistics/indices2010 . If you scroll down this web site, you will see a selection of files to download. Scroll down to the link called 'Indices of deprivation 2010: All domains, all sub domains and supplementary indices' - make sure it is exactly this link that you download - there are quite a few links, but this is the only one that will work for the remainder of this practical. Save this file into the folder you are working in. This is a CSV file, and you should be able to read this directly into R, once it has been saved. The name of the CSV file isn't very informative '1871702.csv' - but you will read this into an r data frame object called deprivation - which should be more helpful.
deprivation <- read.csv("1871702.csv")
This file contains some quite long column names (and a few unneccesary columns). The following step picks out the columns you need:
deprivation <- deprivation[, c(1, 8, 10, 12, 14, 16, 18, 20, 22)]
and the next step changes the column names to more compact forms:
colnames(deprivation) <- c("LSOA", "Overall", "Income", "Employment", "Health",
"Education", "Housing", "Crime", "Environment")
as before head shows the first few values
head(deprivation)
LSOA Overall Income Employment Health Education Housing Crime
1 E01000001 6.16 0.01 0.01 -2.11 0.21 32.60 -1.64
2 E01000002 5.59 0.01 0.01 -2.78 0.26 30.26 -1.93
3 E01000003 13.29 0.07 0.05 -0.97 7.16 40.32 -1.21
4 E01000004 11.17 0.04 0.04 -1.04 1.76 37.92 -1.32
5 E01000005 21.36 0.16 0.07 0.07 20.24 34.66 -1.02
6 E01000006 17.08 0.12 0.06 -0.38 13.84 32.81 0.42
Environment
1 26.28
2 25.73
3 36.48
4 46.72
5 40.94
6 19.39
There are 9 columns - the first is the LSOA code, the second is the overal measure of deprivation, and the next 7 are scores for the individual aspects of deprivation listed above. Note that for these measures, positive values suggest above national average deprivation and negative values suggest below national average deprivation (with respect to levels in England).
As before, this data frame has information for the whole of England, and a look-up table approach is needed to match them to the mersey.lsoa shapefile.
lsoa.lookup <- match(mersey.lsoa$zonecode, deprivation$LSOA)
mersey.lsoa@data <- cbind(mersey.lsoa@data, deprivation[lsoa.lookup, -1])
The first line matches the LSOA code in the mersey.lsoa SpatialPointsDataFrame to the code in the deprivation data frame. The second line uses the cbind function to join several columns (from deprivation) to the existing data part of mersey.lsoa. As usual, head can be used to see the first few lines of the data part of mersey.lsoa:
head(mersey.lsoa@data)
gid geonorth popnorth geoeast name label popeast
0 458 381722 383625 342348 Liverpool 058B 04BYE01006739 342928
1 459 383167 384902 323263 Wirral 033B 04CBE01007284 325257
2 461 420336 418219 332592 Sefton 004B 04CAE01006950 334110
3 2023 379256 380601 326022 Wirral 040F 04CBE01007191 327837
4 2024 380279 380955 325469 Wirral 040E 04CBE01007190 326631
5 2025 381092 381843 324075 Wirral 040B 04CBE01007187 325788
zonecode asb.count pop asb.rate Overall Income Employment Health
0 E01006739 14 1684 83.14 66.25 0.45 0.26 1.63
1 E01007284 0 1588 0.00 5.88 0.04 0.06 -0.27
2 E01006950 9 1591 56.57 41.50 0.24 0.22 1.81
3 E01007191 4 1323 30.23 5.73 0.02 0.05 -0.27
4 E01007190 3 1402 21.40 3.40 0.02 0.03 -0.41
5 E01007187 2 1392 14.37 4.71 0.03 0.05 -0.26
Education Housing Crime Environment
0 72.85 18.36 1.16 51.60
1 2.27 13.38 -1.03 6.42
2 20.45 19.43 0.34 32.29
3 0.92 27.90 -1.71 1.58
4 0.54 20.88 -1.65 1.86
5 0.75 21.64 -1.51 1.18
Also, it is possible to draw maps of these indices. For example, to mapo the overall deprivation, enter:
breaks <- choro(mersey.lsoa, "Overall")
legend(x = 347000, y = 420000, legend = leglabs(round(breaks$b, 1)), fill = breaks$c)
title("Overall Deprivation in Merseyside (By LSOA)")
Again, this would be a useful map to cut and paste into a scrapbook. Note the previous step can easily be modified to map other aspects of deprivation. Just replace "Overall" for one of the other column titles, and also change the title top be more appropriate. For example, for income deprivation enter:
breaks <- choro(mersey.lsoa, "Income")
legend(x = 347000, y = 420000, legend = leglabs(round(breaks$b, 1)), fill = breaks$c)
title("Income Deprivation in Merseyside (By LSOA)")
Now that the SpatialPolygonsDataFrame object now has information about anti-social behaviour incidents, and deprivation. This makes it possible to consider the relationship between the two quantities. For example, here is a plot:
plot(mersey.lsoa$Overall, mersey.lsoa$asb.rate, xlab = "Index of Overall Deprivation",
ylab = "ASB Rate (per 10,000 people)")
title("Anti-Social Behaviour and Deprivation")
It is possible to fit a straight line through the data, to see the relationship between the two quantities:
abline(lm(mersey.lsoa$asb.rate ~ mersey.lsoa$Overall), col = "darkblue", lwd = 2)
the abline function draws a line fitted between the two variables (the fitting is done by lm(mersey.lsoa$asb.rate~mersey.lsoa$Overall) where lm stands for Linear Model). This may be another good graph to copy into the scrap book.
You may notice that although the line has a slight upward slope, there is quite a lot of variability around this. One reason for this is that although generally higher levels of deprivation in an area is associated with higher deprivation rates, this is not always the case – for example there are relatively high levels of anti social behaviour in some areas of mid-level deprivation, or in some town centre areas where there are quite a few pubs and night clubs, but which do not score highly in IMD terms. Also, the crime rates here are just for one month – and working with such a small time window adds to the variability in the data – in a given month there may be a spate of incidents in a one neighbourhood, whereas other areas that have a generally high risk may have experienced few incidents in this short period of time. Thus, it might be quite surprising to see a very strong relationship.
It is also possible to compute the correlation coefficient between the two quantities using R. Recall this is a number between -1 and 1 and briefly may be interpreted as follows:
| Value | Interpretation |
|---|---|
| -1 | A perfect negative correlation – that is, the observed X and Y points all lie on a perfect line, with a downward slope. |
| –1 to 0 | There is some negative connection between X and Y, but that the observed points are scattered around the fitted line - which has a downward slope. The closer to 0 the coefficient is, the more scattered the points are. |
| 0 | No correlation – there seems to be no connection between the variables X and Y |
| 0 to 1 | There is some positive connection between X and Y, but that the observed points are scattered around the fitted line - which has an upward slope. The closer to 0 the coefficient is, the more scattered the points are. |
| 1 | A perfect positive correlation - that is, the observed X and Y points all lie on a perfect line, with an upward slope |
Note that the exact values of -1, 0, and 1 are really theoretical milestones - real data is nearly always in one of the two intervals: between -1 and 0 or between 0 and 1.
In R the function cor computes correlations. For example, to compute the correlation between the asb rate and the “Overall” index of deprivation, enter:
cor(mersey.lsoa@data$asb.rate, mersey.lsoa@data$Overall)
[1] 0.5082
which shows that the correlation between the overall index of deprivation and the asb rate per 10,000 people is around 0.5. It is also possible to specify a range of columns to compute the correlation. For example, in mersey.lsoa columnns 13 to 19 contain all of the different components of deprivation. Enter:
cor(mersey.lsoa@data$asb.rate, mersey.lsoa@data[, 13:19])
Income Employment Health Education Housing Crime Environment
[1,] 0.477 0.4905 0.518 0.3947 0.1206 0.4535 0.4036
to see how anti-social behaviour correlates to each of the components. From these, it can be seen that the strongest association is between Health and the asb rate, and the weakest is for Housing.
This is a summative assignment, so both hard copy and electronic submission will be needed. The assignment is split into two parts:
Using the scrap book of maps you produced earlier, cut and paste each of the maps (asb rate, each aspect of deprivation) into your assignment and write a brief (500 word) overview of the geographical patterns you see.
It is possible to modify the scatter plot graphs produced earlier, by substituting the word Overall for one of Income, Employment, Health, Education, Housing, Crime, Environment - that is the individual aspects of deprivation - in the command to plot the graph and supplying an appropriate title and x-label - for example:
plot(mersey.lsoa$Housing, mersey.lsoa$asb.rate, xlab = "Index of Housing Deprivation",
ylab = "ASB Rate (per 10,000 people)")
title("Anti-Social Behaviour and Housing Deprivation")
Cut and paste a plot for each of these aspecrts of deprivation into your assignment document. Write a short description of the overall patterns that you see (ie summarising the information in all of the plots - around 400-500 words) - you can also make use of the correlation analysis to describe the rtelationships.
Hints

This work is licensed under the Creative Commons Attribution-ShareAlike 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/3.0/.