Getting Started With R

Aims of This Session:

The Very Basics

Firstly, you will need to create yourself a project directory in your storage area. Any files that you create during this exercise will be stored here. Do this in Windows.

Next, click on the icon, and the R console (a window) appears. Unlike most (or possibly all) of the software you may be used to, it is not usually controlled by clicking on menu items, forms, or buttons, but by typing text into this window.

The results are generally either graphs or text also printed into this window. R can initially be used as a calculator - enter the following:

6 + 8

Don't worry about the [1] for the moment - just note that R printed out 14 since this is the answer to the sum you typed in. In these tutorials, sometimes I show the results of what you have typed in. This is in the format shown below:

5 * 4
[1] 20

Also note that * is the symbol for multiplication here - the last command asked R to perform the calculation '5 times 4'. Other symbols are - for subtraction and / for division:

12 - 14
[1] -2
6/17
[1] 0.3529

R also has functions like square root, sine, cosine and so on. You can calculate these like this:

sqrt(25)
[1] 5
sqrt(2)
[1] 1.414

the examples above use the square root (sqrt) function. You can also combine these to make more complicated expressions:

sqrt(3 * 3 + 4 * 4)
[1] 5

You can also assign the answers of the calculations to variables, and use them in calculations. You do this as below

price <- 300

Here, the value 300 is stored in the variable price. The <- symbol means put the value on the right into the variable on the left - it is typed with a < followed by a -. This can be used in subsequent calculations. For example, to apply a 20% discount to this price, you could enter the following:

price - price * 0.2
[1] 240

or use intermediate variables

discount <- price * 0.2

price - discount
[1] 240

R can also work with lists of numbers, as well as individual ones. Lists are specified using the c function. Suppose you have a list of house prices listed in an estate agents, specified in thousands of pounds. You could store them in a variable called house.prices like this:

house.prices <- c(120, 150, 212, 99, 199, 299, 159)
house.prices
[1] 120 150 212  99 199 299 159

Note that there is no problem with full stops in the middle of variable names.

You can then apply functions to the lists. For example to take the average of a list, enter:

mean(house.prices)
[1] 176.9

If the house prices are in thousands of pounds, then this tells us that the mean house price is 176.9 thousand pounds. Note here that on your display, the answer may be displayed to more significant digits, so you may have something like 176.8571 as the mean value.

For the next exercise, try entering a larger data set (this time these are household burglaries per 10,000 households over a one month period) for 118 neighbourhoods in St. Helens:

burg.rates <- c(0, 7, 0, 0, 6, 19, 32, 0, 0, 0, 15, 6, 12, 8, 7, 6, 0, 0, 6, 
    0, 7, 0, 0, 0, 0, 0, 0, 0, 17, 0, 0, 21, 7, 12, 7, 36, 18, 0, 0, 7, 6, 0, 
    0, 0, 0, 0, 13, 22, 0, 0, 0, 7, 12, 7, 5, 11, 0, 0, 13, 13, 0, 6, 15, 6, 
    17, 37, 0, 6, 6, 5, 24, 0, 0, 0, 0, 0, 0, 0, 5, 15, 0, 5, 6, 0, 0, 0, 13, 
    0, 6, 0, 0, 0, 23, 6, 13, 15, 6, 0, 0, 7, 7, 0, 0, 0, 0, 19, 13, 0, 0, 0, 
    6, 9, 0, 0, 0, 0, 0, 5)

Note that if an R expression has obviously not been completed and you hit return, then you can continue to type it on the following line. This carries on until R has worked out that the command is finished. In this case, that happens when the close brackets ) has been typed, and return has been hit afterwards. If you now enter the variable name (burg.rates) you can see all of the values listed:

burg.rates
  [1]  0  7  0  0  6 19 32  0  0  0 15  6 12  8  7  6  0  0  6  0  7  0  0
 [24]  0  0  0  0  0 17  0  0 21  7 12  7 36 18  0  0  7  6  0  0  0  0  0
 [47] 13 22  0  0  0  7 12  7  5 11  0  0 13 13  0  6 15  6 17 37  0  6  6
 [70]  5 24  0  0  0  0  0  0  0  5 15  0  5  6  0  0  0 13  0  6  0  0  0
 [93] 23  6 13 15  6  0  0  7  7  0  0  0  0 19 13  0  0  0  6  9  0  0  0
[116]  0  0  5

In this example, there are 100 different rates and the last command lists them - they are more neatly written out than when you entered them, and perhaps now you can see what the numbers in square brackets are used for. Essentially they show you the position in the list of the first number on each row - thus if a row begins with [24] this implies that the 24nd number in the list is at the start of this printed row. The main idea is to allow you to find positions in the list of higher numbers more easily. Note that if you want to find out more about the basic use of R, a helpful guide can be found here: http://cran.r-project.org/doc/contrib/Owen-TheRGuide.pdf - in particular, pages up to and including 28 are very useful.

Simple Graphics

It is also possible to draw graphics using the data you have put in to the variables. This draws a histogram of the burglary data:

hist(burg.rates)

plot of chunk unnamed-chunk-13

By selecting the window it is possible to copy and paste the images into other documents, for example into the data encryption packages MS Word or Powerpoint or their open source alternatives.

Generally speaking in R, commands tend to give very basic plots, unless further details are provided. Thus to get a histogram with red bars, enter:

hist(burg.rates, col = "red")

plot of chunk unnamed-chunk-14

and to change the main title, xlab (ie x-label) and ylab (y-label) use:

hist(burg.rates, col = "red", main = "Burglaries per 1000 households", xlab = "Rate", 
    ylab = "Frequency")

plot of chunk unnamed-chunk-15

Now enter another related variable, the median house price (in 1000's of pounds) for a three-bedroom semi-detached house for each of the neighbourhoods.

house.prices <- c(200, 130, 200, 200, 180, 140, 65, 220, 180, 200, 210, 170, 
    180, 160, 180, 130, 240, 180, 170, 230, 150, 200, 200, 210, 220, 180, 200, 
    210, 150, 200, 230, 120, 180, 180, 190, 72, 80, 190, 220, 150, 200, 170, 
    170, 230, 200, 160, 140, 100, 140, 170, 180, 260, 170, 230, 190, 220, 140, 
    220, 120, 96, 210, 170, 180, 140, 150, 67, 200, 230, 140, 230, 83, 170, 
    200, 210, 240, 180, 200, 210, 250, 140, 130, 190, 110, 160, 150, 230, 160, 
    210, 200, 230, 210, 190, 120, 180, 87, 160, 190, 190, 230, 180, 110, 200, 
    250, 180, 200, 130, 180, 190, 190, 230, 210, 210, 150, 190, 210, 200, 210, 
    170)

As before, it is possible to draw a histogram of this variable:

hist(house.prices, col = "lightblue", main = "House Price", xlab = "1000s Pounds", 
    ylab = "Frequency")

plot of chunk unnamed-chunk-17

and also to create a scatter plot of the two variables, to see how median house price relates to burglary rate:

plot(burg.rates, house.prices, main = "Burglary vs. House Price", xlab = "Burglaries (per 1000 households)", 
    ylab = "Median House Price (1000s Pounds)")

plot of chunk unnamed-chunk-18

This shows that there is a relationship between the two quantities, although there is still a fair amount of randomness as well. The points show there is a general tendency for house prices to fall as burglary rate increases, but that there are other factors affecting house price as well.

The Data Frame

In the last section, you looked at two variables called house.prices and burg.rates and graphed their relationship. It is possible to combine the individual variables in a data frame - rather like an internal spreadsheet where all of the relavent data items are stored together as a set of columns. This is similar to the data set storage in SPSS (for those of you who have used that package) where each variable correspoinds to a column and each case (or observation) corresponds to a row. However, while SPSS can only have one data set active at a time, in R you can have several of them.

To create a data frame containing the last two variables enter

hp.data <- data.frame(Burglary = burg.rates, Price = house.prices)

Then type in its name to list it:

hp.data

Just to explain what has happened here: the function data.frame takes all of the variables that you wish to have as columns. The Burglary=burg.rates creates a column in the data frame called Burglary containing the values in the variable burg.rates in the last section. Similarly, it has a column called Price containing the values from house.prices. This new data frame is stored as a new object called hp.data (an object in R is similar to a variable, although it can be more complex - so it can contain more sophisticated things like data frames, not just a list of values). Typing in the name of the object (once it has been created) lists the values in the columns.

You can also enter

fix(hp.data)

to view this data frame in a window, and also edit values in the 'cells'. However, although it might be interesting to try using fix now, try it, but don't actually edit anything. To return to the R command line click on Quit in the data frame window. NOTE: it is important to do this, ortherwise you won't be able to type anything in to R.

You can also describe each column in the data set using the summary function:

summary(hp.data)
    Burglary         Price    
 Min.   : 0.00   Min.   : 65  
 1st Qu.: 0.00   1st Qu.:152  
 Median : 0.00   Median :185  
 Mean   : 5.64   Mean   :179  
 3rd Qu.: 7.00   3rd Qu.:210  
 Max.   :37.00   Max.   :260  

For each column, a number of values are listed:

Item Description
Min. The smallest value in the column
1st. Qu. The first quartile (the value ¼ of the way along a sorted list of values)
Median The median (the value ½ of the way along a sorted list of values)
Mean The average of the column
3rd. Qu. The third quartile (the value ¾ of the way along a sorted list of values)
Max. The largest value in the column

Between these numbers, an impression of the spread of values of each variable can be obtained. In particular it is possible to see that the median house price in St. Helens by neighbourhood ranges from 65,000 pounds to 260,000 pounds and that half of the prices lie between 152,500 pounds and 210,000 pounds. Also it can be seen that since the median measured burglary rate is zero, then at least half of areas had no burglaries in the month when counts were compiled.

Getting data in to R

There are lots of methods for getting data from some external file into R. One way is to enter it in directly, as in the last section, but this isn't practical for large datasets. It is also possible to read data in from text files (this will be done later) and to read it in from a source on the internet. Here, the internet method is demonstrated. In this case the data set is one made publicly available on my public dropbox folder. If you go follow the web link “http://dl.dropbox.com/u/7013997/ENVS257/hpdata.csv” you can see it in its 'raw' form. If you click on the link and then on the open option, the data will open in a read-only spreadsheet. Having seen the data set, close the spreadsheet and return to the R window.

The file is called a csv file - short for Comma Separated Variable - as in its 'raw' form it justs consists of several lines of variables separated by commas. Essentially this file contains the same data as the house price and burglary data you used earlier- although an extra column, with an ID number of each neighbourhood, is also included. You can read a csv file on the internet directly into a variable using the read.csv function in R. Here the contents of the file are read into a data frame object called hp.data2.

hp.data2 <- read.csv("http://dl.dropbox.com/u/7013997/ENVS257/hpdata.csv")

The function takes the URL of the file as an argument, and reads it in to the object on the left of the assignment (<-) symbol. As before, it is possible to list the data frame

hp.data2
     ID Burglary Price
1    21        0   200
2    24        7   130
3    31        0   200
4    32        0   200
5    78        6   180
6    80       19   140
7    81       32    65
8    98        0   220
9   100        0   180
10  101        0   200
11  102       15   210
12  111        6   170
13  112       12   180
14  113        8   160
15  114        7   180
16  115        6   130
17  116        0   240
18  117        0   180
19  118        6   170
20  119        0   230
21  120        7   150
22  121        0   200
23  122        0   200
24  123        0   210
25  124        0   220
26  125        0   180
27  126        0   200
28  127        0   210
29  128       17   150
30  614        0   200
31  620        0   230
32  621       21   120
33  622        7   180
34  624       12   180
35  625        7   190
36  626       36    72
37  627       18    80
38  628        0   190
39  629        0   220
40  764        7   150
41  765        6   200
42  766        0   170
43  767        0   170
44  768        0   230
45  769        0   200
46  833        0   160
47  834       13   140
48  835       22   100
49  836        0   140
50  837        0   170
51  838        0   180
52  839        7   260
53  840       12   170
54  841        7   230
55  842        5   190
56  843       11   220
57  844        0   140
58  845        0   220
59  846       13   120
60  847       13    96
61  848        0   210
62  849        6   170
63  850       15   180
64  851        6   140
65  852       17   150
66  853       37    67
67  854        0   200
68  855        6   230
69  856        6   140
70  857        5   230
71  858       24    83
72  859        0   170
73  860        0   200
74  861        0   210
75  862        0   240
76  863        0   180
77  864        0   200
78  865        0   210
79  866        5   250
80  867       15   140
81  868        0   130
82  869        5   190
83  870        6   110
84  871        0   160
85  872        0   150
86  873        0   230
87  874       13   160
88  875        0   210
89  876        6   200
90  877        0   230
91  878        0   210
92  879        0   190
93  880       23   120
94  881        6   180
95  882       13    87
96  883       15   160
97  884        6   190
98  885        0   190
99  886        0   230
100 887        7   180
101 888        7   110
102 889        0   200
103 890        0   250
104 891        0   180
105 892        0   200
106 893       19   130
107 894       13   180
108 895        0   190
109 896        0   190
110 897        0   230
111 898        6   210
112 899        9   210
113 900        0   150
114 901        0   190
115 902        0   210
116 903        0   200
117 904        0   210
118 905        5   170

and also to summarise it:

summary(hp.data2)
       ID         Burglary         Price    
 Min.   : 21   Min.   : 0.00   Min.   : 65  
 1st Qu.:616   1st Qu.: 0.00   1st Qu.:152  
 Median :846   Median : 0.00   Median :185  
 Mean   :654   Mean   : 5.64   Mean   :179  
 3rd Qu.:876   3rd Qu.: 7.00   3rd Qu.:210  
 Max.   :905   Max.   :37.00   Max.   :260  

Note that the second and third columns of this are the same as before – and also that the first column does not really make sense. Since it is an ID number, it is really just a identifying code, and it doesn't really matter what its minum value, or mean value might be - you might as well compute an average student ID number. For this reason, it is useful to ask R to drop this column from hp.data2 when computing the summary. This is fairly simple. Firstly note that you can use square brackets to pick out individual values in the data frame. For example to select the value in the 15th row of column 2, enter

hp.data2[15, 2]
[1] 7

and you can look at the full data frame to check that this is indeed 7. Also, for the columns, you can replace the column number for its name, provided it is in quotes:

hp.data2[15, "Burglary"]
[1] 7

Also, instead of just specifying a single row, it is possible to specify a range of rows:

hp.data2[10:15, "Burglary"]
[1]  0 15  6 12  8  7

This lists the burglary rates for neighbourhoods 10-15 in the data set.

If you want to specify a full column (ie all of the burglary rates), just leave the part where you would write the column range empty:

hp.data2[ ,"Burglary"]
  [1]  0  7  0  0  6 19 32  0  0  0 15  6 12  8  7  6  0  0  6  0  7  0  0
 [24]  0  0  0  0  0 17  0  0 21  7 12  7 36 18  0  0  7  6  0  0  0  0  0
 [47] 13 22  0  0  0  7 12  7  5 11  0  0 13 13  0  6 15  6 17 37  0  6  6
 [70]  5 24  0  0  0  0  0  0  0  5 15  0  5  6  0  0  0 13  0  6  0  0  0
 [93] 23  6 13 15  6  0  0  7  7  0  0  0  0 19 13  0  0  0  6  9  0  0  0
[116]  0  0  5

You can use a similar approach to select out a row of the data -

hp.data2[12, ]
    ID Burglary Price
12 111        6   170

This gives the ID, Burglary and Price (ie house price) values for the 12th neighbourhood in the list.

Another way of selecting out columns is to use the $ (dollar) approach. Entering

hp.data2$Price
  [1] 200 130 200 200 180 140  65 220 180 200 210 170 180 160 180 130 240
 [18] 180 170 230 150 200 200 210 220 180 200 210 150 200 230 120 180 180
 [35] 190  72  80 190 220 150 200 170 170 230 200 160 140 100 140 170 180
 [52] 260 170 230 190 220 140 220 120  96 210 170 180 140 150  67 200 230
 [69] 140 230  83 170 200 210 240 180 200 210 250 140 130 190 110 160 150
 [86] 230 160 210 200 230 210 190 120 180  87 160 190 190 230 180 110 200
[103] 250 180 200 130 180 190 190 230 210 210 150 190 210 200 210 170

extracts the column called Price and so on. This (and indeed the earlier ways of extracting columns) can be used in graphics commands, and so on. For example, a box-plot of the prices can be obtained by

boxplot(hp.data2$Price, col = "lightgreen")

plot of chunk unnamed-chunk-29

The box-plot is a graphical equivalent of the summaries seen earlier - the end-points of the longer lines show the maximum and minimum values, and the end points of the central box are the first and third quartiles. The line on the box marks the median.

Geographical Information

Until now, although this is geographical data, no maps have been drawn. In this section you will do this. Firstly, you need to load a new package into R - a package is an extra set of functionality that extends what R is capable of doing. Here, the new package is called maptools and it extends R by allowing it to draw maps, and handle geographical information. Firstly, you have to let R know you want to use the package - and install it:

install.packages("maptools", depend = TRUE, lib = getwd())

Packages are loaded via the library function:

library(maptools)
library(maptools, lib.loc = getwd())

However, this just makes R able to handle geographical data, it doesn't actually load any specific data sets. To do this, you'll need to obtain them from somewhere. Data for maps in R can be read in from shapefiles - these are a well known geographical information exchange format. Although the term 'shapefile' is used, in actuality a given data set consists of several files. For this reason, I will use the term 'shapefile set' to mean all of the files needed for a particular geographical data set. The first task is to obtain a shapefle set of the St. Helens neighbourhoods (or Lower Super Output Areas - LSOAs, as they are more formally called). To do this, you will need to put some information into the working directory you created earlier. The exact function you need to type in will depend on the name you have called your working directory. For example, my directory is called 'R Work' and is on the M: drive, so I type

setwd("M:/R work")

your version might well have a longer title. Also note that slashes are indicated with a '/' not '\'.

Later on in the practical, you will need to use some other packages - the following line of R will install them all now:

install.packages(c("rgeos", "RColorBrewer", "maptools", "classInt"), depend = TRUE, 
    lib = getwd())

There is a set of shapefiles for the St. Helens neighbourhoods at the same location as the data set you read in earlier. Since several files are needed, I have bundled these together in a single zipfile. You will now download this to your local folder and subsequrently unzip it. This can all be done via R functions:

download.file("http://dl.dropbox.com/u/7013997/ENVS257/sthel.zip", "sthel.zip")
unzip("sthel.zip")

The first function actually downloads the zip file into your working directory, the second one unzips it, creating the shapefile set. All of the shapefile set files begin with sthel but then have different endings, eg sthel.shp, sthel.dbx and sthel.shx. Now, these can be read in to a SpatialPolygons object.

sthel <- readShapeSpatial("sthel")

the readShapeSpatial function does this, and stores them into another type of object called a SpatialPolygons object. Recall that geographical data can be of type Point, Line or Polygon and that polygons are areas or regions, such as the neighbourhoods (LSOAs) in St. Helens. You can use the plot function to draw the polygons (ie the map of the LSOAs).

plot(sthel)

plot of chunk unnamed-chunk-36

Finally, a SpatialPolygonsDataFrame is the result of joining a SpatialPolygons object up with a data frame. The idea is that each polygon corresponds to one row of the data frame - i.e, the geographical description of the boundaries of each neighbourhood with the data about that neighbourhood. Here a SpatialPolygonsDataFrame is created by joining the sthel SpatialPolygons object to the hp.data2 data frame object. The result is assigned to a SpatialPolygonsDataFrame object called sthel.prices.

sthel.prices <- SpatialPolygonsDataFrame(sthel, hp.data2, match.ID = FALSE)

Finally you will draw a choropleth map of these house prices. In later exercises more will be explained as to how this works, but for now the method will simply be demonstrated. One useful feature of R is that it is possible to write new functions, and therefore extend its capabilities. The following function (called choro) is a simple choropleth map drawing tool:

choro <- function(spdframe, variable) {
    var <- spdframe@data[, variable]
    breaks <- classIntervals(var, n = 6, style = "fisher")$brk
    my_colours <- brewer.pal(6, "Greens")
    plot(spdframe, col = my_colours[findInterval(var, breaks, all.inside = TRUE)], 
        axes = FALSE, border = rgb(0.8, 0.8, 0.8))
    invisible(list(b = breaks, c = my_colours))
}

For now, just copy and paste this into R and hit return. Don't worry about trying to interpret it!

Note that it is important to make sure the upper and lower case letters you type in are exactly the same as the ones above. This is generally the case with R. If a window comes up asking where to install the library from, select one of the UK locations. When it has installed, just enter

library(classInt, lib.loc = getwd())

to load it. Also, load a further package for providing map shading palettes called RColorBrewer.

library(RColorBrewer)

Now it is possible to draw the choropleth map. Here we draw one of the burglary rates.

breaks <- choro(sthel.prices, "Burglary")

plot of chunk unnamed-chunk-42

The map shows the different burglary rates, with dark shading indicating a higher rate. However, this is much easier to interpret if you add a legend to the map. For now, just enter the following. As before, more about how all of the commands work will be disclosed later in the module.

legend(x = 357000, y = 392000, legend = leglabs(breaks$b), fill = breaks$c, 
    bty = "n")

plot of chunk unnamed-chunk-43

You can also add a title to the map:

title("Burglary Rates per 10,000 Homes in St. Helens")

plot of chunk unnamed-chunk-44

As a final useful technique you can copy and paste maps like this into word documents. Select the window with the map on it, and then right click on it. When the menu appears, select 'copy as bitmap'. If you also have MS Word up and running, you can then paste the map into your document.

Working with Open Government Data

The police.uk web site

In December of 2010 the police.uk web site was launched, providing details of geographical locations of crime in England and Wales. As well as providing an interface to allow people to view crimes in their (or other peoples') neighbourhoods, this also provided a means for downloading the crime data. Using this, combined with other data it is possible to obtain crime rate maps for anywhere in England and Wales, for a number of different classes of crime.

Mapping Crime Rates

Compare and Contrast Crime Rates

This section of the practical will examine the use of contemporary data about crime rates and population levels that can be merged to draw maps so that you can investigate hypotheses about geographical patterns of crime, for a number of different crime types, in the Merseyside region.

Looking at Crime Data

Firstly, you can use the police.uk web site to investigate local patterns in crime without making use of any other software. To do this, firstly go to the web site http://www.police.uk – this will show you the start-up page for this website – which, as stated earlier, provides a portal to maps of crime rates in localities in England and Wales – and looks something like this: