Getting Started With R

Aims of This Session:

The Very Basics

Firstly, you will need to create yourself a project folder in your storage area. Any files that you create during this exercise will be stored here. Do this in Windows and remember where you created the folder - you will need this later.

Next, double-click on the R icon (labelled R i386 3.0.1), and the R console (a window) appears. Unlike most (or possibly all) of the software you may be used to, it is not usually controlled by clicking on menu items, forms, or buttons, but by typing text into this window.

The results are generally either graphs or text also printed into this window. R can initially be used as a calculator - enter the following:

6 + 8

Don't worry about the [1] for the moment - just note that R printed out 14 since this is the answer to the sum you typed in. In these tutorials, sometimes I show the results of what you have typed in. This is in the format shown below:

5 * 4
[1] 20

Also note that * is the symbol for multiplication here - the last command asked R to perform the calculation '5 times 4'. Other symbols are - for subtraction and / for division:

12 - 14
[1] -2
6/17
[1] 0.3529

R also has functions like square root, sine, cosine and so on. You can calculate these like this:

sqrt(25)
[1] 5
sqrt(2)
[1] 1.414

the examples above use the square root (sqrt) function. You can also combine these to make more complicated expressions:

sqrt(3 * 3 + 4 * 4)
[1] 5

You can also assign the answers of the calculations to variables, and use them in calculations. You do this as below

price <- 300

Here, the value 300 is stored in the variable price. The <- symbol means put the value on the right into the variable on the left, it is typed with a < followed by a -. This can be used in subsequent calculations. For example, to apply a 20% discount to this price, you could enter the following:

price - price * 0.2
[1] 240

or use intermediate variables:

discount <- price * 0.2

price - discount
[1] 240

R can also work with lists of numbers, as well as individual ones. Lists are specified using the c function. Suppose you have a list of house prices listed in an estate agents, specified in thousands of pounds. You could store them in a variable called house.prices like this:

house.prices <- c(120, 150, 212, 99, 199, 299, 159)
house.prices
[1] 120 150 212  99 199 299 159

Note that there is no problem with full stops in the middle of variable names.

You can then apply functions to the lists. For example to take the average of a list, enter:

mean(house.prices)
[1] 176.9

If the house prices are in thousands of pounds, then this tells us that the mean house price is 176.9 thousand pounds. Note here that on your display, the answer may be displayed to more significant digits, so you may have something like 176.8571 as the mean value.

For the next exercise, try entering a larger data set (this time these are household burglaries per 10,000 households over a one month period) for 118 neighbourhoods in St. Helens:

burg.rates <- c(0, 7, 0, 0, 6, 19, 32, 0, 0, 0, 15, 6, 12, 8, 7, 6, 0, 0, 6, 
    0, 7, 0, 0, 0, 0, 0, 0, 0, 17, 0, 0, 21, 7, 12, 7, 36, 18, 0, 0, 7, 6, 0, 
    0, 0, 0, 0, 13, 22, 0, 0, 0, 7, 12, 7, 5, 11, 0, 0, 13, 13, 0, 6, 15, 6, 
    17, 37, 0, 6, 6, 5, 24, 0, 0, 0, 0, 0, 0, 0, 5, 15, 0, 5, 6, 0, 0, 0, 13, 
    0, 6, 0, 0, 0, 23, 6, 13, 15, 6, 0, 0, 7, 7, 0, 0, 0, 0, 19, 13, 0, 0, 0, 
    6, 9, 0, 0, 0, 0, 0, 5)

Note that if an R expression has obviously not been completed and you hit return, then you can continue to type it on the following line. This carries on until R has worked out that the command is finished. In this case, that happens when the close brackets ) has been typed, and return has been hit afterwards. If you now enter the variable name (burg.rates) you can see all of the values listed:

burg.rates
  [1]  0  7  0  0  6 19 32  0  0  0 15  6 12  8  7  6  0  0  6  0  7  0  0
 [24]  0  0  0  0  0 17  0  0 21  7 12  7 36 18  0  0  7  6  0  0  0  0  0
 [47] 13 22  0  0  0  7 12  7  5 11  0  0 13 13  0  6 15  6 17 37  0  6  6
 [70]  5 24  0  0  0  0  0  0  0  5 15  0  5  6  0  0  0 13  0  6  0  0  0
 [93] 23  6 13 15  6  0  0  7  7  0  0  0  0 19 13  0  0  0  6  9  0  0  0
[116]  0  0  5

In this example, there are 100 different rates and the last command lists them - they are more neatly written out than when you entered them, and perhaps now you can see what the numbers in square brackets are used for. Essentially they show you the position in the list of the first number on each row - thus if a row begins with [24] this implies that the 24nd number in the list is at the start of this printed row. The main idea is to allow you to find positions in the list of higher numbers more easily. Note that if you want to find out more about the basic use of R, a helpful guide can be found here: http://cran.r-project.org/doc/contrib/Owen-TheRGuide.pdf - in particular, pages up to and including 28 are very useful.

Simple Graphics

It is also possible to draw graphics using the data you have put in to the variables. This draws a histogram of the burglary data:

hist(burg.rates)

plot of chunk unnamed-chunk-14

By selecting the window it is possible to copy and paste the images into other documents, for example into Word or PowerPoint or their open source alternatives.

Generally speaking in R, commands tend to give very basic plots, unless further details are provided. Thus to get a histogram with red bars, enter:

hist(burg.rates, col = "red")

plot of chunk unnamed-chunk-15

and to change the main title, xlab (ie x-label) and ylab (y-label) use:

You can press the up arrow on the keyboard to show the previous command. You can then alter this as necessary, rather than needing to retype the whole command!

hist(burg.rates, col = "red", main = "Burglaries per 1000 households", xlab = "Rate", 
    ylab = "Frequency")

plot of chunk unnamed-chunk-16

Now enter another related variable, the median house price (in 1000's of pounds) for a three-bedroom semi-detached house for each of the neighbourhoods.

house.prices <- c(200, 130, 200, 200, 180, 140, 65, 220, 180, 200, 210, 170, 
    180, 160, 180, 130, 240, 180, 170, 230, 150, 200, 200, 210, 220, 180, 200, 
    210, 150, 200, 230, 120, 180, 180, 190, 72, 80, 190, 220, 150, 200, 170, 
    170, 230, 200, 160, 140, 100, 140, 170, 180, 260, 170, 230, 190, 220, 140, 
    220, 120, 96, 210, 170, 180, 140, 150, 67, 200, 230, 140, 230, 83, 170, 
    200, 210, 240, 180, 200, 210, 250, 140, 130, 190, 110, 160, 150, 230, 160, 
    210, 200, 230, 210, 190, 120, 180, 87, 160, 190, 190, 230, 180, 110, 200, 
    250, 180, 200, 130, 180, 190, 190, 230, 210, 210, 150, 190, 210, 200, 210, 
    170)

As before, it is possible to draw a histogram of this variable:

hist(house.prices, col = "lightblue", main = "House Price", xlab = "1000s Pounds", 
    ylab = "Frequency")

plot of chunk unnamed-chunk-18

and also to create a scatter plot of the two variables, to see how median house price relates to burglary rate:

plot(burg.rates, house.prices, main = "Burglary vs. House Price", xlab = "Burglaries (per 1000 households)", 
    ylab = "Median House Price (1000s Pounds)")

plot of chunk unnamed-chunk-19

This shows that there is a relationship between the two quantities, although there is still a fair amount of randomness as well. The points show there is a general tendency for house prices to fall as burglary rate increases, but that there are other factors affecting house price as well.

The Data Frame

In the last section, you looked at two variables called house.prices and burg.rates and graphed their relationship. It is possible to combine the individual variables in a data frame - rather like an internal spreadsheet where all of the relevant data items are stored together as a set of columns. This is similar to the data set storage in SPSS (for those of you who have used that package) where each variable corresponds to a column and each case (or observation) corresponds to a row. However, while SPSS can only have one data set active at a time, in R you can have several of them.

To create a data frame containing the last two variables enter

hp.data <- data.frame(Burglary = burg.rates, Price = house.prices)

Then type in its name to list it:

hp.data

Just to explain what has happened here: the function data.frame takes all of the variables that you wish to have as columns. The Burglary=burg.rates creates a column in the data frame called Burglary containing the values in the variable burg.rates in the last section. Similarly, it has a column called Price containing the values from house.prices. This new data frame is stored as a new object called hp.data (an object in R is similar to a variable, although it can be more complex - so it can contain more sophisticated things like data frames, not just a list of values). Typing in the name of the object (once it has been created) lists the values in the columns.

You can also enter

fix(hp.data)

to view this data frame in a window, and also edit values in the 'cells'. However, although it might be interesting to try using fix now, try it, but don't actually edit anything. To return to the R command line click on Quit in the data frame window. NOTE: it is important to do this, otherwise you won't be able to type anything in to R.

You can also describe each column in the data set using the summary function:

summary(hp.data)
    Burglary         Price    
 Min.   : 0.00   Min.   : 65  
 1st Qu.: 0.00   1st Qu.:152  
 Median : 0.00   Median :185  
 Mean   : 5.64   Mean   :179  
 3rd Qu.: 7.00   3rd Qu.:210  
 Max.   :37.00   Max.   :260  

For each column, a number of values are listed:

Item Description
Min. The smallest value in the column
1st. Qu. The first quartile (the value ¼ of the way along a sorted list of values)
Median The median (the value ½ of the way along a sorted list of values)
Mean The average of the column
3rd. Qu. The third quartile (the value ¾ of the way along a sorted list of values)
Max. The largest value in the column

Between these numbers, an impression of the spread of values of each variable can be obtained. In particular it is possible to see that the median house price in St. Helens by neighbourhood ranges from 65,000 pounds to 260,000 pounds and that half of the prices lie between 152,500 pounds and 210,000 pounds. Also it can be seen that since the median measured burglary rate is zero, then at least half of areas had no burglaries in the month when counts were compiled.

Now is a good time to remind you to save your data on a regular basis. This is particularly important if you are working on coursework, and want to stop half-way through. R has two different elements you can save - the workspace and the history. The workspace is the most important element, as it contains any data frames or other objects you have created - like the hp.data object we created earlier. To do this, click File > Save Workspace. Choose somewhere to save it (your working folder is a good place) and give it a name. To load these in a new session, click File > Load Workspace and select your file. The same process works for the History.

Getting data in to R

There are lots of methods for getting data from some external file into R. One way is to enter it in directly, as in the last section, but this isn't practical for large datasets. It is also possible to read data in from text files (this will be done later) and to read it in from a source on the internet. Here, the internet method is demonstrated. In this case the data set is one made publicly available on my public dropbox folder. If you go follow the web link “http://dl.dropbox.com/u/7013997/ENVS257/hpdata.csv”, you can see it in its 'raw' form. If you click on the link and then on the 'Open' option, the data will open in a read-only spreadsheet. Having seen the data set, close the spreadsheet and return to the R window.

The file is called a csv file - short for Comma Separated Values - as in its 'raw' form it justs consists of several lines of variables separated by commas. Essentially this file contains the same data as the house price and burglary data you used earlier - although an extra column, with an ID number of each neighbourhood, is also included. You can read a csv file on the internet directly into a variable using the read.csv function in R. Here the contents of the file are read into a data frame object called hp.data2.

hp.data2 <- read.csv("http://dl.dropbox.com/u/7013997/ENVS257/hpdata.csv")

The function takes the URL of the file as an argument, and reads it in to the object on the left of the assignment (<-) symbol. As before, it is possible to list the data frame

hp.data2
     ID Burglary Price
1    21        0   200
2    24        7   130
3    31        0   200
4    32        0   200
5    78        6   180
6    80       19   140
7    81       32    65
8    98        0   220
9   100        0   180
10  101        0   200
11  102       15   210
12  111        6   170
13  112       12   180
14  113        8   160
15  114        7   180
16  115        6   130
17  116        0   240
18  117        0   180
19  118        6   170
20  119        0   230
21  120        7   150
22  121        0   200
23  122        0   200
24  123        0   210
25  124        0   220
26  125        0   180
27  126        0   200
28  127        0   210
29  128       17   150
30  614        0   200
31  620        0   230
32  621       21   120
33  622        7   180
34  624       12   180
35  625        7   190
36  626       36    72
37  627       18    80
38  628        0   190
39  629        0   220
40  764        7   150
41  765        6   200
42  766        0   170
43  767        0   170
44  768        0   230
45  769        0   200
46  833        0   160
47  834       13   140
48  835       22   100
49  836        0   140
50  837        0   170
51  838        0   180
52  839        7   260
53  840       12   170
54  841        7   230
55  842        5   190
56  843       11   220
57  844        0   140
58  845        0   220
59  846       13   120
60  847       13    96
61  848        0   210
62  849        6   170
63  850       15   180
64  851        6   140
65  852       17   150
66  853       37    67
67  854        0   200
68  855        6   230
69  856        6   140
70  857        5   230
71  858       24    83
72  859        0   170
73  860        0   200
74  861        0   210
75  862        0   240
76  863        0   180
77  864        0   200
78  865        0   210
79  866        5   250
80  867       15   140
81  868        0   130
82  869        5   190
83  870        6   110
84  871        0   160
85  872        0   150
86  873        0   230
87  874       13   160
88  875        0   210
89  876        6   200
90  877        0   230
91  878        0   210
92  879        0   190
93  880       23   120
94  881        6   180
95  882       13    87
96  883       15   160
97  884        6   190
98  885        0   190
99  886        0   230
100 887        7   180
101 888        7   110
102 889        0   200
103 890        0   250
104 891        0   180
105 892        0   200
106 893       19   130
107 894       13   180
108 895        0   190
109 896        0   190
110 897        0   230
111 898        6   210
112 899        9   210
113 900        0   150
114 901        0   190
115 902        0   210
116 903        0   200
117 904        0   210
118 905        5   170

and also to summarise it:

summary(hp.data2)
       ID         Burglary         Price    
 Min.   : 21   Min.   : 0.00   Min.   : 65  
 1st Qu.:616   1st Qu.: 0.00   1st Qu.:152  
 Median :846   Median : 0.00   Median :185  
 Mean   :654   Mean   : 5.64   Mean   :179  
 3rd Qu.:876   3rd Qu.: 7.00   3rd Qu.:210  
 Max.   :905   Max.   :37.00   Max.   :260  

Note that the second and third columns of this are the same as before – and also that the first column does not really make sense. Since it is an ID number, it is really just a identifying code, and it doesn't really matter what its minum value, or mean value might be - you might as well compute an average student ID number. For this reason, it is useful to ask R to drop this column from hp.data2 when computing the summary. This is fairly simple. Firstly note that you can use square brackets to pick out individual values in the data frame. For example to select the value in the 15th row of column 2, enter

hp.data2[15, 2]
[1] 7

and you can look at the full data frame to check that this is indeed 7. Also, for the columns, you can replace the column number for its name, provided it is in quotes:

hp.data2[15, "Burglary"]
[1] 7

Also, instead of just specifying a single row, it is possible to specify a range of rows:

hp.data2[10:15, "Burglary"]
[1]  0 15  6 12  8  7

This lists the burglary rates for neighbourhoods 10-15 in the data set.

If you want to specify a full column (ie all of the burglary rates), just leave the part where you would write the column range empty:

hp.data2[ ,"Burglary"]
  [1]  0  7  0  0  6 19 32  0  0  0 15  6 12  8  7  6  0  0  6  0  7  0  0
 [24]  0  0  0  0  0 17  0  0 21  7 12  7 36 18  0  0  7  6  0  0  0  0  0
 [47] 13 22  0  0  0  7 12  7  5 11  0  0 13 13  0  6 15  6 17 37  0  6  6
 [70]  5 24  0  0  0  0  0  0  0  5 15  0  5  6  0  0  0 13  0  6  0  0  0
 [93] 23  6 13 15  6  0  0  7  7  0  0  0  0 19 13  0  0  0  6  9  0  0  0
[116]  0  0  5

You can use a similar approach to select out a row of the data -

hp.data2[12, ]
    ID Burglary Price
12 111        6   170

This gives the ID, Burglary and Price (i.e. house price) values for the 12th neighbourhood in the list.

Another way of selecting out columns is to use the $ (dollar) approach. Entering

hp.data2$Price
  [1] 200 130 200 200 180 140  65 220 180 200 210 170 180 160 180 130 240
 [18] 180 170 230 150 200 200 210 220 180 200 210 150 200 230 120 180 180
 [35] 190  72  80 190 220 150 200 170 170 230 200 160 140 100 140 170 180
 [52] 260 170 230 190 220 140 220 120  96 210 170 180 140 150  67 200 230
 [69] 140 230  83 170 200 210 240 180 200 210 250 140 130 190 110 160 150
 [86] 230 160 210 200 230 210 190 120 180  87 160 190 190 230 180 110 200
[103] 250 180 200 130 180 190 190 230 210 210 150 190 210 200 210 170

extracts the column called Price and so on. This (and indeed the earlier ways of extracting columns) can be used in graphics commands, and so on. For example, a box-plot of the prices can be obtained by

boxplot(hp.data2$Price, col = "lightgreen")

plot of chunk unnamed-chunk-30

The box-plot is a graphical equivalent of the summaries seen earlier - the end-points of the longer lines show the maximum and minimum values, and the end points of the central box are the first and third quartiles. The line on the box marks the median.

Geographical Information

Until now, although this is geographical data, no maps have been drawn. In this section you will do this. Firstly, you need to load a new package into R - a package is an extra set of functionality that extends what R is capable of doing. Here, the new package is called maptools and it extends R by allowing it to draw maps, and handle geographical information. The packages are already installed, so all you need to do is to let R know you want to use the package:

library(maptools)

However, this just makes R able to handle geographical data, it doesn't actually load any specific data sets. To do this, you'll need to obtain them from somewhere. Data for maps in R can be read in from shapefiles - these are a well known geographical information exchange format. Although the term 'shapefile' is used, in actuality a given data set consists of several files. For this reason, I will use the term 'shapefile set' to mean all of the files needed for a particular geographical data set. The first task is to obtain a shapefle set of the St. Helens neighbourhoods (or Lower Super Output Areas - LSOAs, as they are more formally called). To do this, you will need to put some information into the working folder you created earlier. The exact function you need to type in will depend on the name you have called your working folder. For example, my folder is called 'R Work' and is on the M: drive, so I type

setwd("M:/R work")

Your version might well have a longer title. Also note that slashes are indicated with a '/' not '\'.

There is a set of shapefiles for the St. Helens neighbourhoods at the same location as the data set you read in earlier. Since several files are needed, I have bundled these together in a single zipfile. You will now download this to your local folder and subsequently unzip it. This can all be done via R functions:

download.file("http://dl.dropbox.com/u/7013997/ENVS257/sthel.zip", "sthel.zip")
unzip("sthel.zip")

The first function actually downloads the zip file into your working folder, the second one unzips it, creating the shapefile set. All of the shapefile set files begin with sthel but then have different endings, eg sthel.shp, sthel.dbx and sthel.shx. Now, these can be read in to a SpatialPolygons object.

sthel <- readShapeSpatial("sthel")

the readShapeSpatial function does this, and stores them into another type of object called a SpatialPolygons object. Recall that geographical data can be of type Point, Line or Polygon and that polygons are areas or regions, such as the neighbourhoods (LSOAs) in St. Helens. You can use the plot function to draw the polygons (i.e. the map of the LSOAs).

plot(sthel)

plot of chunk unnamed-chunk-34

Finally, a SpatialPolygonsDataFrame is the result of joining a SpatialPolygons object up with a data frame. The idea is that each polygon corresponds to one row of the data frame - i.e, the geographical description of the boundaries of each neighbourhood with the data about that neighbourhood. Here a SpatialPolygonsDataFrame is created by joining the sthel SpatialPolygons object to the hp.data2 data frame object. The result is assigned to a SpatialPolygonsDataFrame object called sthel.prices.

sthel.prices <- SpatialPolygonsDataFrame(sthel, hp.data2, match.ID = FALSE)

Finally you will draw a choropleth map of these house prices. In later exercises more will be explained as to how this works, but for now the method will simply be demonstrated. One useful feature of R is that it is possible to write new functions, and therefore extend its capabilities. The following function (called choro) is a simple choropleth map drawing tool:

choro <- function(spdframe, variable) {
    var <- spdframe@data[, variable]
    breaks <- classIntervals(var, n = 6, style = "fisher")$brk
    my_colours <- brewer.pal(6, "Greens")
    plot(spdframe, col = my_colours[findInterval(var, breaks, all.inside = TRUE)], 
        axes = FALSE, border = rgb(0.8, 0.8, 0.8))
    invisible(list(b = breaks, c = my_colours))
}

For now, just copy and paste this into R and hit return. Don't worry about trying to interpret it!

Note that it is important to make sure the upper and lower case letters you type in are exactly the same as the ones above. This is generally the case with R. If a window comes up asking where to install the library from, select one of the UK locations. When it has installed, just enter

library(classInt)

to load it. Also, load a further package for providing map shading palettes called RColorBrewer.

library(RColorBrewer)

Now it is possible to draw the choropleth map. Here we draw one of the burglary rates.

breaks <- choro(sthel.prices, "Burglary")

plot of chunk unnamed-chunk-40

The map shows the different burglary rates, with dark shading indicating a higher rate. However, this is much easier to interpret if you add a legend to the map. For now, just enter the following. As before, more about how all of the commands work will be disclosed later in the module.

legend(x = 357000, y = 392000, legend = leglabs(breaks$b), fill = breaks$c, 
    bty = "n")

plot of chunk unnamed-chunk-41

You can also add a title to the map:

title("Burglary Rates per 10,000 Homes in St. Helens")

plot of chunk unnamed-chunk-42

As a final useful technique you can copy and paste maps like this into Word documents. Select the window with the map on it, and then right click on it. When the menu appears, select 'copy as bitmap'. If you also have Word up and running, you can then paste the map into your document.

Working with Open Government Data

The police.uk web site

In December of 2010 the police.uk web site was launched, providing details of geographical locations of crime in England and Wales. As well as providing an interface to allow people to view crimes in their (or other peoples') neighbourhoods, this also provided a means for downloading the crime data. Using this, combined with other data it is possible to obtain crime rate maps for anywhere in England and Wales, for a number of different classes of crime.

Mapping Crime Rates

Compare and Contrast Crime Rates

This section of the practical will examine the use of contemporary data about crime rates and population levels that can be merged to draw maps so that you can investigate hypotheses about geographical patterns of crime, for a number of different crime types, in the Merseyside region.

Looking at Crime Data

Firstly, you can use the police.uk web site to investigate local patterns in crime without making use of any other software. To do this, firstly go to the web site http://www.police.uk – this will show you the start-up page for this website – which, as stated earlier, provides a portal to maps of crime rates in localities in England and Wales – and looks something like this:

To try this out, click the 'find your neighbourhood' and enter the postcode L69 7ZT into the 'postcode or place name' box. Click on the 'search' button and select 'City Centre, Merseyside Police'. This postcode corresponds to the Roxby Building, near Myrtle Street, on the edge of the campus. When you selected the location it will give you a new page with some information on the policing in the area. Click on 'Explore the crime map' and it will look something like this:

Using the + and - buttons in the top left of the map, you can 'zoom in' on the black markers. Another method is to click on a black marker. You can also ‘drag’ the map in the window (press the left mouse button while moving the mouse when the pointer is inside the map window) to pan around the area. Continue to zoom in, clicking on the black markers. When the map does not zoom in any further, it will show a popup window with a list of crimes that took place in the last month. Note: Anti-Social Behaviour (ASB) is often listed, which does not always imply a literal criminal offence.

Is the Web Site Enough?

Exploring the above maps gives some interesting insights into local crime. However, it also raises some questions. One issue is that although the circles with counts of crimes are centred on certain points on the map, these do not correspond exactly to where the crimes occurred. This is intentional on the part of the designers of this site, and preserves privacy (for example, the web site does not intend to publicise the exact addresses of houses that were burgled). The points where crimes are mapped either correspond to public places (e.g. train stations) or to the centre of a group of at least 12 private addresses. However, the choice of mapping technique has faced some criticism because it still uses a point-based approach – giving the impression that a crime (or several crimes) did occur at some very specific location, when in fact it did not – see for example figure below – there is no direct evidence that 10 incidents occurred in exactly the location in Myrtle Parade as implied on the map – this is just the centre of the group of houses and other buildings. See http://bcsmaps.blogspot.com/2011/02/uk-police-maps-x-does-not-mark-spot.html for more discussion.

Using the Data With R

Since there are certain limitations to viewing the data directly on the web site, it is fortunate that another option is to download the raw data and display it in other ways, using R. To do this, click on the 'options' menu in the top right hand corner of the webpage (blue background with three white lines) and click on 'Use the data'. This takes you to a new page where data may be downloaded. Next, click 'Downloads', scroll down and select ‘Merseyside Police’, then click on 'Generate file'. Click the orange 'Download Now' button. This gives a CSV (comma separated values) file for the street-level Merseyside police data – each line has details of each crime, including crime type, location and some other features, separated by commas. The download will probably put the file in your Downloads folder. Using folder one you created earlier - move the zip file to this folder.

Since the file is compressed to speed up the download (that is, zipped) you need to extract it first. Double click on the file, and you will see a new folder window appear containing file 2013-11-merseyside-street.csv. It is a good idea to extract it before using it in R. To do this, click on ‘extract’ in the top of the folder window. This gives a window to choose the folder into which the file should be extracted. Choose the working folder that you were using in the last exercise.

Note: 2013-11 refers to the actual year and month that you download – this can change depending on the actual date when you do the practical – for the rest of these notes be aware that the 2013-11 part of the name may be replaced with another year and month) - and also that some of the maps or graphs you obtain will be slightly different to those on this web site.

Now, the data is in place and you need to start up R. Do this in the same way you did in the last exercise. When R is running enter the commands to set it to run in the folder you keep the data. Remember that to do this, the exact function you need to type in will depend on the name you have called your working folder. Again, from last time, my folder is called 'R Work' and is on the M: drive, so I type

setwd("M:/R work")

your version might well have a longer title. Also note that slashes are indicated with a '/' not '\'.

Now load the maptools library (as you did before) and then read in the CSV file you downloaded with the crime data. Store it into a data frame called crimes.

require(maptools)
crimes <- read.csv("2013-11-merseyside-street.csv")

You can inspect the data frame in a window using fix(crimes) - but don't edit the values and remember that you have to close the window before typing anything else into R.

You will see that the data consists of a number of columns, each with a heading. Two of these are called Longitude and Latitude – these are the column headers that give the coordinates of each incident (either a crime or an ASB event) in the data you have just downloaded. Another is headed Crime.Type and tells you which type of crime occurred. Note that the grid references refer to the centres of small groups of households (or possibly public places) mentioned earlier.

At the moment, the data is just in a data.frame object - not any kind of spatial object. To create the spatial object, enter the following:

coords <- cbind(Longitude = as.numeric(as.character(crimes$Longitude)), Latitude = as.numeric(as.character(crimes$Latitude)))
crime.pts <- SpatialPointsDataFrame(coords, crimes[, -(5:6)], proj4string = CRS("+init=epsg:4326"))

This creates a SpatialPointsDataFrame object. This first line prepares the coordinates into a form that the SpatialPointsDataFrame can use. The SpatialPointsDataFrame function on the second line takes three arguments - the first is coordinates, created in the line above. The second argument is the data frame minus columns 5 and 6 - this is what -(5:6) indicates. These columns provide all the non-geographical data from the data frame. The resulting object crime.pts is a spatial points geographical shape object, whose points are each recorded crime in the data set you download. To see the geographical pattern of these crimes, enter:

plot(crime.pts, pch = ".", col = "darkred")

plot of chunk soddo

as you can see, you have now successfully obtained some crime data from police.uk and read it into R. The option pch='.' tells R to map each point with a full-stop character - and to make the colour of these dark red.

You can also examine the kinds of different crimes recorded.

table(crime.pts$Crime.type)

       Anti-social behaviour                Bicycle theft 
                        4309                          151 
                    Burglary    Criminal damage and arson 
                        1200                         1588 
                       Drugs                  Other crime 
                         840                           67 
                 Other theft        Possession of weapons 
                         916                           46 
                Public order                      Robbery 
                         264                           78 
                 Shoplifting        Theft from the person 
                         921                          248 
               Vehicle crime Violence and sexual offences 
                         714                         1147 

For a categorical column in a data frame (that is, one that attributes each location to a specific category, in this case 'crime type') the table function counts the number of times each category appears in the column. From this, it is possible to identify the relative frequencies with which each kind of crime occurs.

Obtaining a Layer of Lower Super Output areas

The next stage is to obtain some further map data. In particular, the point data you have just drawn still suffers from the 'X does not mark the spot' issue discussed earlier on. The next stage of this practical is to obtain some neighbourhood zone data and to aggregate the crimes to this. These data are useful for a number of reasons:

  1. Go to the UKBorders web site http://edina.ac.uk/ukborders/ and try to log in with the login via UK federation option.
  2. If it asks you to register, see the following section “If you need to register for UKBorders”.
  3. At the opening window select Boundary Data Selector
  4. At the next window, there will be three choices to make. Set
  5. Then click on Find
  6. In the window below, a number of options will appear. Select 'English Super Output Areas (Lower Layer) 2001 [Within Counties]'.
  7. Click the 'List areas' button
  8. Then, in the window below this, select 'Merseyside'.
  9. Then scroll to the bottom (if needed) and click on 'Extract boundary Data'.
  10. If this gives you an error message asking you to agree to special conditions, then see the following section “Special Conditions”.
  11. This gives a new screen stating that the boundary data is being created. Eventually a window appears showing how to download it. Click on BoundaryData.zip to download it.
  12. When the next window pops up, click on Save.
  13. Select the folder you are currently working in to save it – name the file MerseyLSOA and save.
  14. At this point all of the GIS files (shapefiles) to describe the boundaries of Merseyside Lower Super Output Areas (LSOAs) are in this folder, but zipped. They can unzipped and then be loaded in to R.

If you need to register for UKBorders

Follow the registration procedure on the web site, and in a short while you should be able to access this site. For the mean time download the file MerseyLSOA.zip from http://dl.dropbox.com/u/7013997/ENVS257/MerseyLSOA.zip and put it in your working folder, and continue from here - or alternatively enter

download.file("http://dl.dropbox.com/u/7013997/ENVS257/MerseyLSOA.zip", "MerseyLSOA.zip")

which will do this via an R function as in the last practical. However, I recommend working through the other steps when you are registered, so that you are familiar with the process. You will probably need it in the future!

If you need to agree to the Special Conditions

To access the data, you need to agree to special conditions form the UK Data Service. It should give you the option to do so, but if it doesn't follow these instructions:

  1. Go to http://ukdataservice.ac.uk/.
  2. Log in with the same details as before.
  3. Once you are at the 'Your account' page, click on the link to 'Special Conditions' on the left hand side.
  4. Accept the special conditions by clicking on the '2001 Census : Digitised Boundary Data (England and Wales) End User Licence Special Conditions' link and choosing accept.
  5. You should now be able to download the data, following the steps above.

Use the map Data

Firstly, unzip the file you have just downloaded.

unzip("MerseyLSOA.zip")

The name of the downloaded shapefile sets actually aren't based on the term MerseyLSOA, but actually on the term england_low_soa_2001. Although this a little confusing (and not really avoidable, because the name comes from the institution supplying the data) the R object that the shape is read into can still have a more helpful name. To read it in run the code below, the first two lines load the maptools and rgdal libraries, as in the last exercise:

library(maptools)
library(rgdal)
mersey.lsoa <- readShapeSpatial("england_low_soa_2001")

The crime data have locations given in latitude and longditude rather than Eastings and Northings, so we need to reprject the data so we can display it with the MereyLSOA data we have just loaded:

crime.pts <- spTransform(crime.pts, CRS("+init=epsg:27700"))

It is now possible to draw the crime data on a Merseyside backdrop:

plot(mersey.lsoa, border = "grey")
plot(crime.pts, pch = ".", col = "darkblue", add = TRUE)

plot of chunk unnamed-chunk-50

Note that including add=TRUE tells R to draw the crime points by adding them to an existing map, rather than starting a new map.

Focusing on Antisocial Behaviour

The data you have download includes all crimes. However, now the analysis will focus on reported incidents of antisocial behaviour. To do this, select out the antisocial behaviour points from the full crime shape object.

asb.pts <- crime.pts[crime.pts$Crime.type == "Anti-social behaviour", ]

The above method works similarly to the 'select rows by their number' seen in the first practical. The first thing in the square brackets is a conditional statement - basically it selects out cases for which a condition is true. In this cases, the Crime.type column should contain the attribute “Anti-social behaviour”. The expression after the comma is empty so that all of the rows are included. The result (a point object containing all of the anti-social behaviour incidents) is stored in asb.pts.

Next, find out how many ASBs have occurred in each LSOA. To do this, you need to load the package rgeos - basically a set of tools that help R handle overlay operations, such as finding out where geographical shapes intersect one another, and so on. Next you need to define a function called poly.counts - this counts how many points in a SpatialPointsDataFrame fall into each of a set of polygons in a SpatialPolygonsDataFrame, and creates a new variable. Here, the new variable is called asb.count. The code below carries out all of these operations. Note also that the # comment is placed at the beginning of some lines. This just tells R that the following text is just to explain what the R functions are doing - it doesn't have to be executed.

# This is another R package, allowing GIS overlay operations
library(rgeos)
# This defines a new R function - it counts how many points fall into each
# polygon
poly.counts <- function(pts, polys) colSums(gContains(polys, pts, byid = TRUE))
# The line below actually counts the number of crimes in each LSOA
asb.count <- poly.counts(asb.pts, mersey.lsoa)
Warning: spgeom1 and spgeom2 have different proj4 strings

Now that you have this information, you can draw a map of this. Before doing so, you need to add the choropleth map drawing code that you used earlier.

library(classInt)
library(RColorBrewer)
choro <- function(spdframe, variable) {
    var <- spdframe@data[, variable]
    breaks <- classIntervals(var, n = 6, style = "fisher")$brk
    my_colours <- brewer.pal(6, "Greens")
    plot(spdframe, col = my_colours[findInterval(var, breaks, all.inside = TRUE)], 
        axes = FALSE, border = rgb(0.8, 0.8, 0.8))
    invisible(list(b = breaks, c = my_colours))
}

Just cut and paste the above code into R (as with last time). Now, you can draw a choropleth map of asb counts for each LSOA. Firstly, add the asb counts as a column to the mersey.lsoa SpatialPolygonsDataFrame.

# First, add an ASB event count column to the 'mersey.lsoa'
# SpatialPolygonsDataFrame
mersey.lsoa@data$asb.count <- asb.count
# Now draw a choropleth map - use 'choro' the same way as the last practical
choro(mersey.lsoa, "asb.count")

plot of chunk unnamed-chunk-54

Computing Rates of ASB Occurrence

The map just drawn shows areas with a high absolute anti-social behaviour incident counts. However, it is more meaningful to consider these quantities per head of population - areas with larger populations would have higher absolute counts even if the chances of someone committing an offence was the same everywhere. Also, in some small communities, it might be that crime is relatively high. Working with crimes per person (i.e. per head of population) overcomes this issue. To deal with this, you need to download some population data. This comes from the web site: http://www.ons.gov.uk/ons/publications/re-reference-tables.html?edition=tcm%3A77-320861

At the time of writing, the most recent population estimates available from the government web site come from the 2011 census. Go to the web site above, and then download the file for the Lower Super Output Area Population Estimates, unformatted. This is the second of the five files that you will see. Download it into your working folder.

As before this is a zipped file. Follow the same procedure as before to extract it. Unfortunately, this isn't a csv file, it is stored in Microsoft Excel. You will need to use Excel to extract the data in CSV format before you read it into R. So, firstly open it in Excel and follow this procedure:

  1. Go to the tab marked Mid-2012 Persons
  2. Delete column B (Remember to completly delete the column, not just delete the values in it. To do this, right click on the column header and click 'Delete')
  3. Delete columns C to CO
  4. Change the column header for A to be 'lsoa' (no spaces)
  5. Change the column header for B to be 'pop' (no spaces again)
  6. Select coluimn B and select 'General' as the format - the commas between the thousands on the 'pop' column should go away. Save this file under the new name pop.csv - select csv as the format to save the file.

Once it has been saved, read it in to R.

pop <- read.csv("pop.csv")

This is actually the population estimates for the whole of the LSOAs in Great Britain, not just those in Merseyside. The next stage is to extract the estimates just for Merseyside.
Firstly, use the R function match to find the location in pop that each LSOA code in mersey.lsoa appears - these are stored in lsoa.lookup.

lsoa.lookup <- match(mersey.lsoa$zonecode, pop$lsoa)
head(lsoa.lookup)
[1] 5619 6107 5657 6144 6143 6140

Entering head(lsoa.lookup) shows the first few locations in the look up table - it tells you that the first of the LSOAs in mersey.lsoa corresponds to location 6739 in pop. In genereal, head is a useful function for lookup up the first few items in a variable, or part of an object. For example, try

head(mersey.lsoa@data)
   gid geonorth popnorth geoeast           name         label popeast
0  458   381722   383625  342348 Liverpool 058B 04BYE01006739  342928
1  459   383167   384902  323263    Wirral 033B 04CBE01007284  325257
2  461   420336   418219  332592    Sefton 004B 04CAE01006950  334110
3 2023   379256   380601  326022    Wirral 040F 04CBE01007191  327837
4 2024   380279   380955  325469    Wirral 040E 04CBE01007190  326631
5 2025   381092   381843  324075    Wirral 040B 04CBE01007187  325788
   zonecode asb.count
0 E01006739        12
1 E01007284         0
2 E01006950         1
3 E01007191         1
4 E01007190         3
5 E01007187         1

Now, add a column with the appropriate population estimates to the merseyside SpatialPolygonsDataFrame.

mersey.lsoa@data$pop <- pop[lsoa.lookup, "pop"]

You can check this has worked, by using head again:

head(mersey.lsoa@data)
   gid geonorth popnorth geoeast           name         label popeast
0  458   381722   383625  342348 Liverpool 058B 04BYE01006739  342928
1  459   383167   384902  323263    Wirral 033B 04CBE01007284  325257
2  461   420336   418219  332592    Sefton 004B 04CAE01006950  334110
3 2023   379256   380601  326022    Wirral 040F 04CBE01007191  327837
4 2024   380279   380955  325469    Wirral 040E 04CBE01007190  326631
5 2025   381092   381843  324075    Wirral 040B 04CBE01007187  325788
   zonecode asb.count  pop
0 E01006739        12 1731
1 E01007284         0 1638
2 E01006950         1 1515
3 E01007191         1 1354
4 E01007190         3 1480
5 E01007187         1 1447

You should see that a new column, pop, has been added. Now, a further column, that actually contains the rate of ASBs per head of population, can be added.

mersey.lsoa@data$asb.rate <- 10000 * (mersey.lsoa@data$asb.count/mersey.lsoa@data$pop)

If you get an error message along the lines of:

Warning message:
In Ops.factor(10000 * mersey.lsoa@data$asb.count, mersey.lsoa@data$pop) :
  / not meaningful for factors

It means you probably missed out step 6 above. R is trying to intrepret the population as a string (i.e. text) rather than a number. To tell R that is is a number, add the function as.numeric() around mersey.lsoa@data$pop so the line reads: mersey.lsoa@data$asb.rate <- 10000 * (mersey.lsoa@data$asb.count / as.numeric(mersey.lsoa@data$pop)). This should solve the problem.

Again, use head to check this:

head(mersey.lsoa@data)
   gid geonorth popnorth geoeast           name         label popeast
0  458   381722   383625  342348 Liverpool 058B 04BYE01006739  342928
1  459   383167   384902  323263    Wirral 033B 04CBE01007284  325257
2  461   420336   418219  332592    Sefton 004B 04CAE01006950  334110
3 2023   379256   380601  326022    Wirral 040F 04CBE01007191  327837
4 2024   380279   380955  325469    Wirral 040E 04CBE01007190  326631
5 2025   381092   381843  324075    Wirral 040B 04CBE01007187  325788
   zonecode asb.count  pop asb.rate
0 E01006739        12 1731   69.324
1 E01007284         0 1638    0.000
2 E01006950         1 1515    6.601
3 E01007191         1 1354    7.386
4 E01007190         3 1480   20.270
5 E01007187         1 1447    6.911

Note the multiplication by 10000 - this means rates are per 10000 heads of population. Now it is possible to produce a choropleth map of the ASB rates in Merseyside.

breaks <- choro(mersey.lsoa, "asb.rate")
Warning: var has missing values, omitted in finding classes
legend(x = 347000, y = 425000, legend = leglabs(round(breaks$b, 1)), fill = breaks$c)

plot of chunk unnamed-chunk-63

In the leglabs part of the legend function, round is used to round the rate estimates to 1 place of decimal. Without this, the display becomes cluttered. You may need this map for the assignment (as well as some others). For this reason, it is a good idea to create a blank Word document, copy this map into it (see the last practical) and save it. There will be a few more maps and graphs that you will use - it is a good idea to use this Word document as a 'scrap book' and cut and paste the maps into your assignments at the appropriate places.

Comparison to Indicators of Deprivation

The Index of Multiple Deprivation is a score used to measure the degree of deprivation in geographical areas, currently produced by the Department for Communities and Local Government (DCLG) (http://www.communities.gov.uk/corporate/ ). Since 2004 the smallest region for which the IMD is produced has been the LSOA – the spatial unit you were working with in the practical. The index of multiple deprivation is a score based on combining scores for a number of ‘Domains’ of deprivation:

The bracketed figures on the right show the relative weighting that each domain is given to derive the overall score – thus, most emphasis is placed on Income and Employment. Each domain score is itself a combination of a number of variables relating to the individual domain. Note that, for this reason, each domain score is not a simple quantity (such as the rates of a specific illness in the health domain) but is itself a combined score based on several underlying variables. Not all of these underlying variables are made publicly available, but the scores for each domain and the overall IMD have been made publicly available.

Download the IMD Spreadsheet

To obtain this data you will need to visit the web site: http://www.communities.gov.uk/publications/corporate/statistics/indices2010 . If you scroll down this web site, you will see a selection of files to download. Scroll down to the link called 'English indices of deprivation 2010: All domains, all sub domains and supplementary indices' - make sure it is exactly this link that you download - there are quite a few links, but this is the only one that will work for the remainder of this practical. Save this file into the folder you are working in. This is a CSV file, and you should be able to read this directly into R, once it has been saved. The name of the CSV file isn't very informative '1871702.csv' - but you will read this into an R data frame object called deprivation - which should be more helpful.

deprivation <- read.csv("1871702.csv")

This file contains some quite long column names (and a few unnecessary columns). The following step picks out the columns you need:

deprivation <- deprivation[, c(1, 8, 10, 12, 14, 16, 18, 20, 22)]

and the next step changes the column names to more compact forms:

colnames(deprivation) <- c("LSOA", "Overall", "Income", "Employment", "Health", 
    "Education", "Housing", "Crime", "Environment")

as before head shows the first few values

head(deprivation)
       LSOA Overall Income Employment Health Education Housing Crime
1 E01000001    6.16   0.01       0.01  -2.11      0.21   32.60 -1.64
2 E01000002    5.59   0.01       0.01  -2.78      0.26   30.26 -1.93
3 E01000003   13.29   0.07       0.05  -0.97      7.16   40.32 -1.21
4 E01000004   11.17   0.04       0.04  -1.04      1.76   37.92 -1.32
5 E01000005   21.36   0.16       0.07   0.07     20.24   34.66 -1.02
6 E01000006   17.08   0.12       0.06  -0.38     13.84   32.81  0.42
  Environment
1       26.28
2       25.73
3       36.48
4       46.72
5       40.94
6       19.39

There are 9 columns - the first is the LSOA code, the second is the overall measure of deprivation, and the next 7 are scores for the individual aspects of deprivation listed above. Note that for these measures, positive values suggest above national average deprivation and negative values suggest below national average deprivation (with respect to levels in England).

As before, this data frame has information for the whole of England, and a look-up table approach is needed to match them to the mersey.lsoa shapefile.

lsoa.lookup <- match(mersey.lsoa$zonecode, deprivation$LSOA)
mersey.lsoa@data <- cbind(mersey.lsoa@data, deprivation[lsoa.lookup, -1])

The first line matches the LSOA code in the mersey.lsoa SpatialPointsDataFrame to the code in the deprivation data frame. The second line uses the cbind function to join several columns (from deprivation) to the existing data part of mersey.lsoa. As usual, head can be used to see the first few lines of the data part of mersey.lsoa:

head(mersey.lsoa@data)
   gid geonorth popnorth geoeast           name         label popeast
0  458   381722   383625  342348 Liverpool 058B 04BYE01006739  342928
1  459   383167   384902  323263    Wirral 033B 04CBE01007284  325257
2  461   420336   418219  332592    Sefton 004B 04CAE01006950  334110
3 2023   379256   380601  326022    Wirral 040F 04CBE01007191  327837
4 2024   380279   380955  325469    Wirral 040E 04CBE01007190  326631
5 2025   381092   381843  324075    Wirral 040B 04CBE01007187  325788
   zonecode asb.count  pop asb.rate Overall Income Employment Health
0 E01006739        12 1731   69.324   66.25   0.45       0.26   1.63
1 E01007284         0 1638    0.000    5.88   0.04       0.06  -0.27
2 E01006950         1 1515    6.601   41.50   0.24       0.22   1.81
3 E01007191         1 1354    7.386    5.73   0.02       0.05  -0.27
4 E01007190         3 1480   20.270    3.40   0.02       0.03  -0.41
5 E01007187         1 1447    6.911    4.71   0.03       0.05  -0.26
  Education Housing Crime Environment
0     72.85   18.36  1.16       51.60
1      2.27   13.38 -1.03        6.42
2     20.45   19.43  0.34       32.29
3      0.92   27.90 -1.71        1.58
4      0.54   20.88 -1.65        1.86
5      0.75   21.64 -1.51        1.18

Also, it is possible to draw maps of these indices. For example, to mapo the overall deprivation, enter:

breaks <- choro(mersey.lsoa, "Overall")
legend(x = 347000, y = 425000, legend = leglabs(round(breaks$b, 1)), fill = breaks$c)
title("Overall Deprivation in Merseyside (By LSOA)")

plot of chunk unnamed-chunk-70

Again, this would be a useful map to cut and paste into a scrapbook. Note the previous step can easily be modified to map other aspects of deprivation. Just replace "Overall" for one of the other column titles, and also change the title top be more appropriate. For example, for income deprivation enter:

breaks <- choro(mersey.lsoa, "Income")
legend(x = 347000, y = 425000, legend = leglabs(round(breaks$b, 1)), fill = breaks$c)
title("Income Deprivation in Merseyside (By LSOA)")

plot of chunk unnamed-chunk-71

Now that the SpatialPolygonsDataFrame object now has information about anti-social behaviour incidents, and deprivation. This makes it possible to consider the relationship between the two quantities. For example, here is a plot:

plot(mersey.lsoa$Overall, mersey.lsoa$asb.rate, xlab = "Index of Overall Deprivation", 
    ylab = "ASB Rate (per 10,000 people)")
title("Anti-Social Behaviour and Deprivation")

plot of chunk unnamed-chunk-72

It is possible to fit a straight line through the data, to see the relationship between the two quantities:

abline(lm(mersey.lsoa$asb.rate ~ mersey.lsoa$Overall), col = "darkblue", lwd = 2)

plot of chunk unnamed-chunk-73

the abline function draws a line fitted between the two variables (the fitting is done by lm(mersey.lsoa$asb.rate~mersey.lsoa$Overall) where lm stands for Linear Model). This may be another good graph to copy into the scrap book.

You may notice that although the line has a slight upward slope, there is quite a lot of variability around this. One reason for this is that although generally higher levels of deprivation in an area is associated with higher deprivation rates, this is not always the case – for example there are relatively high levels of anti social behaviour in some areas of mid-level deprivation, or in some town centre areas where there are quite a few pubs and night clubs, but which do not score highly in IMD terms.

Also, the crime rates here are just for one month – and working with such a small time window adds to the variability in the data – in a given month there may be a spate of incidents in a one neighbourhood, whereas other areas that have a generally high risk may have experienced few incidents in this short period of time. Thus, it might be quite surprising to see a very strong relationship.

It is also possible to compute the correlation coefficient between the two quantities using R. Recall this is a number between -1 and 1 and briefly may be interpreted as follows:

Value Interpretation
-1 A perfect negative correlation – that is, the observed X and Y points all lie on a perfect line, with a downward slope.
–1 to 0 There is some negative connection between X and Y, but that the observed points are scattered around the fitted line - which has a downward slope. The closer to 0 the coefficient is, the more scattered the points are.
0 No correlation – there seems to be no connection between the variables X and Y
0 to 1 There is some positive connection between X and Y, but that the observed points are scattered around the fitted line - which has an upward slope. The closer to 0 the coefficient is, the more scattered the points are.
1 A perfect positive correlation - that is, the observed X and Y points all lie on a perfect line, with an upward slope

Note that the exact values of -1, 0, and 1 are really theoretical milestones - real data is nearly always in one of the two intervals: between -1 and 0 or between 0 and 1.

In R the function cor computes correlations. For example, to compute the correlation between the asb rate and the “Overall” index of deprivation, enter:

cor(mersey.lsoa@data$asb.rate, mersey.lsoa@data$Overall, use = "complete.obs")
[1] 0.4951

which shows that the correlation between the overall index of deprivation and the asb rate per 10,000 people is around 0.5. It is also possible to specify a range of columns to compute the correlation. For example, in mersey.lsoa columnns 13 to 19 contain all of the different components of deprivation. Enter:

cor(mersey.lsoa@data$asb.rate, mersey.lsoa@data[, 13:19], use = "complete.obs")
     Income Employment Health Education Housing  Crime Environment
[1,] 0.4736     0.4965 0.4848    0.4059 0.05131 0.4095      0.3767

to see how anti-social behaviour correlates to each of the components. From these, it can be seen that the strongest association is between Employment and the asb rate, and the weakest is for Housing. The use = "complete.obs" statement in the cor function is required because the LSOA boundaries changed slightly between the 2012 IMD data and the 2011 census data, so there are a few missing variables. The additional statement tells R to only look at the data where the variables are complete.

Assignment

This is a summative assignment, so only an electronic submission will be needed. The total word count is 500 words. The assignment is split into two parts:

1. Looking at geographical patterns

Using the scrap book of maps you produced earlier, cut and paste each of the maps (ASB rate, each aspect of deprivation) into your assignment and write a brief overview of the geographical patterns you see. You can either write one description for each map, or group them together to write a description - do which ever you think is best.

2. Looking at correlations and models

It is possible to modify the scatter plot graphs produced earlier, by substituting the word Overall for one of Income, Employment, Health, Education, Housing, Crime, Environment - that is the individual aspects of deprivation - in the command to plot the graph and supplying an appropriate title and x-label - for example:

plot(mersey.lsoa$Housing, mersey.lsoa$asb.rate, xlab = "Index of Housing Deprivation", 
    ylab = "ASB Rate (per 10,000 people)")
title("Anti-Social Behaviour and Housing Deprivation")

plot of chunk unnamed-chunk-76

Cut and paste a plot any two of these aspects of deprivation into your assignment document. Write a short description of the overall patterns that you see (i.e. summarising the information in the plots) - you can also make use of the correlation analysis to describe the relationships. For example, you could include scatter plots of the IMD domains which have the weakest and strongest correlation with ASB, and in the text you could say why you think this may be.

Hints

  1. Describe the geographical patterns in the maps of ASB rates and the various indices of deprivation. Where are the largest and smallest values?
  2. Looking at Merseyside in Google Earth or Google Maps may provide some help in identifying the different locations on the map.
  3. Make note of which aspects of deprivation have the strongest links to ASB rate, and which have the weakest.

Getting good marks!

  1. Get the basics right - make sure your graphs and maps have titles, legends and axis labels.
  2. Make sure you answer both parts 1 and 2 - geographical patterns and correlations.
  3. Do not put R code in your report.
  4. The word count is 500 words, and please don't include any appendices.

This work is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/4.0/deed.en. Based upon Getting Started with R by Chris Brunsdon.