After today you will know (1) how to download and install R and RStudio and (2) how to knit an R markdown file to generate and publish HTML and (3) some of the fundamentals of working with R in RStudio including a brief introduction to spatial data using ggplot2.
I designed this course for graduate students in geography and related disciplines. It's suitable for students with a knowledge of introductory inferential statistics through linear regression. The purpose of this course is to provide an understanding of statistical methods used to describe, analyze, and model spatial data. I assume that this is your first course in spatial data analysis. Focus is on application and away from statistical theory, but I will present some mathematical formalism at times. Topics will include exploratory spatial data analysis, analysis and models for lattice data, geostatistical data and point pattern data.
(1) To learn how to apply spatial statistical methods and models to geographical data, (2) to learn how to use R for analyzing geographical data, and (3) to learn how to interpret the results of a spatial data model. Much of the in-class time will be spent working through the R code.
You are responsible for working through the material (either in class or outside), handing in assignments on time, and preparing and presenting a final project to the class. It is your responsibility to locate the relevant data for your project. You need to choose a dataset for the term project no later than March 13.
Grades are determined by a term project worth 75 percent and a few homework problem sets collectively worth 25 percent.
Students are expected to uphold the Academic Honor Code published in The Florida State University Bulletin and the Student Handbook. The Academic Honor System of The Florida State University is based on the premise that each student has the responsibility (1) to uphold the highest standards of academic integrity in the student's own work, (2) to refuse to tolerate violations of academic integrity in the university community, and (3) to foster a high sense of integrity and social responsibility on the part of the university community.
Students with disabilities needing academic accommodation should: (1) register with and provide documentation to the Student Disability Resource Center; (2) bring a letter indicating the need for accommodation and what type. This should be done during the first week of classes.
This syllabus is a guide for the course and is subject to change with advanced notice.
Anselin, L., I. Syabri, and Y. Kho, 2004: GeoDa: An Introduction to Spatial Data Analysis, Spatial Analysis Laboratory, Department of Agricultural and Consumer Economics, University of Illinois, Urbana-Champaign.
Anselin, L., 2005: Exploring Spatial Data with GeoDa: A Workbook, Spatial Analysis Laboratory, Center for Spatially Integrated Social Science.
Anselin, L., 2005: Spatial Regression Analysis in R, Spatial Analysis Laboratory, Center for Spatially Integrated Social Science.
Baddeley, A., and R. Turner, 2005: spatstat: An R Package for Analyzing Spatial Point Patterns, Journal of Statistical Software, v12.
Cressie, N. A. C., 1993: Statistics for Spatial Data, Wiley Series in Probability and Mathematical Statistics, John Wiley & Sons, Inc., New York. A mathematical treatment of spatial data analysis.
Cressie, N. A. C., and C. K. Wikle, 2011: Statistics for Spatio-Temporal Data, Wiley Series in Probability and Mathematical Statistics, John Wiley & Sons, Inc., New York. A mathematical treatment of space-time statistics with an emphasis on Bayesian models.
Diggle, P. J., 2003: Statistical Analysis of Spatial Point Patterns, Second Edition, Arnold Publishers. An introduction to the concepts and methods of statistical analysis of spatial point patterns.
Fotherhingham, A. S., C. Brunsdon, and M. Charlton, 2000: Quantitative Geography: Perspectives on Spatial Data Analysis, SAGE Publications, London. A survey of spatial data analysis from the perspective of modern geography.
Haining, R., 2003: Spatial Data Analysis: Theory and Practice, Cambridge University Press. A confluence of geographic information science and applied spatial statistics.
Illian, J., A. Penttinen, H. Stoyan, and D. Stoyan, 2008: Statistical Analysis and Modeling of Spatial Point Patterns, Wiley Series in Statistics in Practice, John Wiley & Sons, Inc., New York. A mathematical treatment of spatial point processes.
Ripley, B. D., 1981: Spatial Statistics, Wiley, New York. A reference book on spatial data analysis with emphasis on point pattern analysis.
Wickham, H., 2009: ggplot2: Elegant Graphics for Data Analysis, Springer UserR! Series, Springer, New York. An introduction to the ggplot package for graphics.
Scientific publications have at least two goals: (i) to announce a result and (ii) to convince readers that the result is correct.
Papers in quantitative geography should describe the results and provide a clear enough protocol to allow repetition and extension.
Time is right. Tools are now available to create a scientific report that is automatically generated and that integrates text and code. Everything is dynamic, code gets executed at generation time and numbers are updated if data changes, etc.
Makes you more efficient at research. You can make adjustments based on reviewer's comments with ease. Labor saving.
Allows others to build on what you've done. Research moves forward.
Makes collaboration easier. Helps you build communities of like-minded researchers.
Code sharing is associated with research impact. Promotion & tenure.
No need to worry about a research audit.
R is a free software environment for statistical computing and graphics. It compiles and runs on Linux platforms, Windows and MacOS
The knitr package is an engine for dynamic report generation with R.
RStudio is an open-source integrated development environment (IDE) for R. You can run it on your desktop (Windows, Mac, or Linux) or over the web using RStudio Server.
R is an open-source statistical environment for data analysis and statistical computing. It is currently maintained by the R core-development team; an international group of volunteer developers.
Excellent graphing functions. Powerful and extensible syntax. Thousands of functions.
Limited graphical user interface (GUI) meaning it is harder to learn at the outset. No commercial support. It's a programming language so you need to appreciate syntax issues.
Common metaphors for working with computers are: browsers, iTunes, Excel, Word, etc. R is nothing like these. It is easy to forget commands. There are no visual cues: A blank screen is intimidating. R is not easy to learn.
But once you learn it you can take advantage of the availability of new, cutting edge applications in emerging statistical fields. For geographers this is particularly important for applied spatial statistics.
Also, an increasing number of scientists are reporting results in the context of R, and it's important to know what they are talking about. Most of the world's leading statisticians use and contribute to R.
R does GIS without the cost of a license manager.
R helps you create reproducible research.
You will use RStudio to create answer sheets for your homework and exams. You can use it to take notes during class.
install.packages("knitr")
Statistics is the analysis and modeling of data. The most useful command for quickly inputting small bits of data is the c() function. This function combines (concatenates) items together. As an example, consider a set of hypothetical annual land falling hurricane counts over a ten-year period.
2 3 0 3 1 0 0 1 2 1
To enter these into R, type or copy/paste into the R console.
counts = c(2, 3, 0, 3, 1, 0, 0, 1, 2, 1)
counts
## [1] 2 3 0 3 1 0 0 1 2 1
With R Studio, the R console is the lower, left window (default).
Notice a few things. You assigned the values to an object called counts. The assignment operator is an equal sign (=). Values do not print. They are assigned to an object name. They are printed by typing the object name as we did on the second line. Finally, the values when printed are prefaced with a [1]. This indicates that the object is a vector and the first entry in the vector is a value of 2 (The number immediately to the right of [1]). More on this later.
You can save some typing by using the arrow keys to retrieve previous commands. Each command is stored in a history file and the up arrow key will move backwards through the history file and the down arrow forwards. The left and right arrow keys will work as expected.
Once the data are stored in a variable, you can use functions on them.
sum(counts) # total number of hurricanes making landfall
## [1] 13
length(counts) # length of data vector
## [1] 10
sum(counts)/length(counts) # avg no. of hurricanes
## [1] 1.3
Other functions include, sort(), min(), max(), range(), diff(), and cumsum(). Try these on the landfall counts. What does range() do? What does diff() do?
The mean of a set of numbers defined as: \[ \bar x = \frac{x_1 + x_2 + \cdots + x_n}{n} \] The mean function does this for you on your set of counts by typing
mean(counts)
## [1] 1.3
The count data is stored in R as a vector. This means that R keeps track of the order that the data were entered. There is a first element, a second element, and so on. This is good for several reasons.
Our data vector of counts has a natural order - year 1, year 2, etc. We don't want to mix these. We would like to be able to make changes to the data item by item instead of entering the entire data again. Vectors are math objects so that math operations can be performed on them.
Let's see how these concepts apply to our landfall counts. Suppose 'counts' contain the annual landfall count from the first decade of a longer record. We want to keep track of counts over other decades. This could be done by the following, example.
counts.d1 = counts # make a copy
counts.d2 = c(0, 5, 4, 2, 3, 0, 3, 3, 2, 1) # enter a new set of counts
Note the two different object names. The period is only used as punctuation. You can't use an _ (underscore) to punctuate names as the underscore is an assignment operator.
Most functions operate on each element of the data vector at the same time.
counts.d1 + counts.d2
## [1] 2 8 4 5 4 0 3 4 4 2
The first year of the first decade is added from the first year of the second decade and so on.
What happens if you apply the c() function to these two vectors? Try it.
c(counts.d1, counts.d2)
If you are interested in each year's count as a difference from the decade mean, you type:
counts.d1 - mean(counts.d1)
## [1] 0.7 1.7 -1.3 1.7 -0.3 -1.3 -1.3 -0.3 0.7 -0.3
In this case a single number (the mean of the first decade) is subtracted from a vector. The result is from subtracting the number from each entry in the data vector. This is an example of data recycling. R repeats values from one vector so that the vector lengths match. Here the mean is repeated 10 times.
Suppose you are interested in the variance of the set of landfall counts. The formula is given by: \[ \hbox{var}(x) = \frac{(x_1 - \bar x)^2 + (x_2 - \bar x)^2 + \cdots + (x_n - \bar x)^2}{n-1} \]
Although the var() function compute this, here you see how you could do this directly using the vectorization of functions. The key is to find the squared differences and then add up the values.
The key is to find the squared differences and then add them up.
x = counts.d1
xbar = mean(x)
x - xbar
## [1] 0.7 1.7 -1.3 1.7 -0.3 -1.3 -1.3 -0.3 0.7 -0.3
(x - xbar)^2
## [1] 0.49 2.89 1.69 2.89 0.09 1.69 1.69 0.09 0.49 0.09
sum((x - xbar)^2)
## [1] 12.1
n = length(x)
n
## [1] 10
sum((x - xbar)^2)/(n - 1)
## [1] 1.344
To verify type
var(x)
## [1] 1.344
One restriction on vectors is that all the values must have the same type. This can be numeric, as in counts, character strings, as in
simpsons = c("Homer", "Marge", "Bart", "Lisa", "Maggie")
simpsons
## [1] "Homer" "Marge" "Bart" "Lisa" "Maggie"
Note that character strings are made with matching quotes, either double, “, or single, '.
If you mix types the values will be coerced into a common type, which is usually a character. Arithmetic operations do not work on characters.
Returning to the land falling hurricane counts.
counts.d1 = c(2, 3, 0, 3, 1, 0, 0, 1, 2, 1) # counts from the first decade
counts.d2 = c(0, 5, 4, 2, 3, 0, 3, 3, 2, 1) # counts from the second decade
Now suppose the U.S. National Hurricane Center (NHC) reanalyzes a storm, and that the 6th year of the 2nd decade is a 1 rather than a 0 for the number of landfalls. In this case you can type:
counts.d2[6] = 1 # assign the 6 year of the decade a value of 1 landfall
The assignment to the 6th entry in the vector counts.d2 is done by referencing the 6th entry of the vector with square brackets [].
It's important to keep this in mind: Parentheses () are used for functions and square brackets [] are used to extract values from vectors (and arrays, lists, etc).
counts.d2 #print out the values
## [1] 0 5 4 2 3 1 3 3 2 1
counts.d2[2] # print the number of landfalls during year 2 of the second decade
## [1] 5
counts.d2[4] # 4th year count
## [1] 2
counts.d2[-4] # all but the 4th year
## [1] 0 5 4 3 1 3 3 2 1
counts.d2[c(1, 3, 5, 7, 9)] # print the counts from the odd years
## [1] 0 4 3 3 2
Sometimes numbers have some structure or pattern. Take, for instance, the integers 1 through 99. To enter these into an R session one by one would be a waste of time.
The colon operator is used for creating simple sequences.
1:100
rev(1:100)
100:1
It is usually desirable to specify either the step size and the starting and ending points or the starting and ending points and the length of the sequence. The seq() function allows us to do this.
seq(1, 9, by = 2)
## [1] 1 3 5 7 9
seq(1, 10, by = 2)
## [1] 1 3 5 7 9
seq(1, 9, length = 5)
## [1] 1 3 5 7 9
When a vector of repeated values is desired, the rep() function is used. The simplest usage is to repeat the first argument a specified number of times.
rep(1, 10)
## [1] 1 1 1 1 1 1 1 1 1 1
rep(1:3, 3)
## [1] 1 2 3 1 2 3 1 2 3
More complicated patterns can be repeated by specifying pairs of equal-sized vectors. In this case, each element of the first vector is repeated the corresponding number of times specified by the element in the second vector.
rep(c("long", "short"), c(1, 2))
## [1] "long" "short" "short"
To find the maximum number of landfalls in the first decade, type:
max(counts.d1)
## [1] 3
Which years had the maximum?
counts.d1 == 3
## [1] FALSE TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
Notice the double equals signs (==). This tests each value in counts.decade1 to see if it is equal to 3. The 2nd and 4th values are equal to 3 so TRUEs are returned. Think of this as asking R a question. Is the value equal to 3? R answers all at once with a vector of TRUE's and FALSE's.
Now the question is – how do you get the vector element corresponding to the TRUE values? That is, which years have 3 landfalls?
which(counts.d1 == 3)
## [1] 2 4
Note that the function which.max() can be used to get the first maximum.
which.max(counts.d1)
## [1] 2
You might also want to know the total number of landfalls in each decade and the number of years in a decade without a landfall. Or how about the ratio of the mean number of landfalls over the two decades.
sum(counts.d1)
## [1] 13
sum(counts.d2)
## [1] 24
Here you stack to function calls on a single line by using the semicolon ;.
sum(counts.d1 == 0)
## [1] 3
sum(counts.d2 == 0)
## [1] 1
mean(counts.d2)/mean(counts.d1)
## [1] 1.846
So there is 85% more landfalls during the second decade. Is this statistically significant?
Before moving on, it is good practice to clean up your working directory. This is done using the rm (remove) function.
rm(counts.d1, counts.d2)
Remove all objects.
rm(list = ls())
Your problem sets and term project will be done using knitr. For your problem sets you will email me the web link to your answers. The link will by a web page generated from RStudio and knitr.
Here are the steps.
Material is taken from http://spatial.ly/2013/12/introduction-spatial-data-ggplot2/
The data are in a zipped file called spatialggplot on my website. Download the file to your current working directory and unzip it.
download.file("http://myweb.fsu.edu/jelsner/data/spatialggplot.zip", "spatialggplot.zip",
mode = "wb")
unzip("spatialggplot.zip")
One of the most important steps in handling spatial data with R is the ability to read in shapefiles. There are a number of ways to do this. The most simple is readShapePoly() in the maptools package.
library(maptools) # load the package
## Error: there is no package called 'maptools'
sport = readShapePoly("london_sport.shp") # read in the shapefile
## Error: could not find function "readShapePoly"
This method works fine, but it is no longer considered best practice since it doesn't load in the spatial referencing information etc associated with the shapefile.
A more powerful way to read in geographical data is to use the rgdal function readOGR(), which automatically extracts this information. This is R's interface to the 'Geospatial Abstraction Library (GDAL)' which is used by other open source GIS packages such as QGIS and enables R to handle a broader range of spatial data formats.
library(rgdal)
## Error: there is no package called 'rgdal'
sport = readOGR(dsn = ".", "london_sport")
## Error: could not find function "readOGR"
In the code above dsn is an argument of the function readOGR(). R functions have a default order of functions, so dsn = does not actually need to be typed: readOGR(”.“, "london_sport”) works the same, but it is good to remember the meaning of each argument when beginning to use R, so we sometimes include argument names when it is relevant.
Here, dsn stands for 'data source name' which is the folder containing the spatial data — this was pre-specified when you set your working directory. The next argument is a character string, text identifying the file required. There is no need to add a file extension.
The file contains the borough population and the percentage of the population engaging in sporting activities.
All shapefiles have an attribute table. This is loaded with readOGR() and can be treated in a similar way to an R data frame.
The geometry of spatial data is hidden unless you print the object (using the print() method). Let's take a look at the headings of sport by typing
names(sport)
## Error: object 'sport' not found
head(sport@data)
## Error: object 'sport' not found
The data contained in spatial data are kept in a 'slot' that can be accessed using the @ symbol: sport@data. This is useful if you do not wish to work with the spatial components of the data at all times.
Spatial objects in R contain a variety of additional information. Use the summary() method to get some additional information about the data object.
summary(sport)
## Error: object 'sport' not found
In the above output proj4string represents the coordinate reference system (CRS) used on the data. In this file it has been incorrectly specified so we can change it with the following.
proj4string(sport) = CRS("+init=epsg:27700")
## Error: could not find function "CRS"
The warning says that you are changing the coordinate reference system, not reprojecting the data. Epsg:27700 is the code for British National Grid. If we wanted to reproject the data into something like WGS84 for latitude and longitude you could type
sport.wgs84 = spTransform(sport, CRS("+init=epsg:4326"))
## Error: could not find function "spTransform"
The different epsg codes are a bit of hassle to remember but you can find them all at http://spatialreference.org.
The ggplot2 package is an implementation of the Grammar of Graphics (Wilkinson 2005) - a general scheme for data visualization that breaks up graphs into semantic components such as scales and layers.
As we saw last semester the functions in ggplot2 can serve as a replacement for the base graphics in R and contains a number of default options that match good visualisation practice.
The maps we produce will not be that meaningful—the focus here is on sound visualisation with R and not sound analysis (this will come).
The ggplot2 package is one of the best documented. The full documentation for it can be found online and it is recommended you test out the examples on your own machines and play with them: http://docs.ggplot2.org/current/ .
Good examples of graphs can also be found on the website http://cookbook-r.com.
library(ggplot2)
## Error: there is no package called 'ggplot2'
The basic plot() function requires no data preparation but additional effort in color selection/adding the map key etc. ggplot() (from the ggplot2 package) require some additional steps to format the spatial data but select colors and add keys etc automatically.
As a first attempt with ggplot2 we can create a scatter plot with the attribute data in the sport object created above.
p = ggplot(sport@data, aes(Partic_Per, Pop_2001))
## Error: could not find function "ggplot"
This defines a ggplot object where you say where you want the input data to come from. sport@data is actually a data frame contained within the wider spatial object sport (the @ enables you to access the attribute table of the sport shapefile).
The characters inside the aes argument refer to the parts of that data frame you wish to use (the variables Partic_Per and Pop_2001). This has to happen within the brackets of aes(), which means, roughly speaking 'aesthetics that vary'.
If you just type p and hit enter you get the error No layers in plot. This is because you have not told ggplot what you want to do with the data. We do this by adding so-called 'geoms', in this case geom_point().
p + geom_point()
## Error: object 'p' not found
Within the parentheses you can alter the nature of the points. Try something like
p + geom_point(color = "red", size = 2)
## Error: object 'p' not found
Experiment.
If you want to scale the points by borough population and color them by sports participation this is also fairly easy by adding another aes() argument.
p + geom_point(aes(colour = Partic_Per, size = Pop_2001))
## Error: object 'p' not found
The real power of ggplot2 lies in its ability to add layers to a plot. In this case we can add text to the plot.
p + geom_point(aes(colour = Partic_Per, size = Pop_2001)) + geom_text(size = 2,
aes(label = name))
## Error: object 'p' not found
This idea of layers (or geoms) is different from the standard plot functions in R, but you will find that each of the functions does a lot of clever stuff to make plotting much easier (see the documentation for a full list).
The following steps will create a map to show the percentage of the population in each London Borough who regularly participate in sports activities.
To get the shapefiles into a format that can be plotted we have to use the fortify() function. Spatial objects have a number of slots containing the various items of data (polygon geometry, projection, attribute information) associated with a shapefile.
Slots can be thought of as shelves within the data object that contain the different kinds of information. The 'polygons' slot contains the geometry of the polygons in the form of the XY coordinates used to draw the polygon outline.
We need to extract them as a data frame. The fortify() function was written specifically for this purpose. For this to work, either gpclib or rgeos packages must be installed.
sport.f = fortify(sport, region = "ons_label")
## Error: could not find function "fortify"
This step creates a data frame out of the spatial information.
We add the attribute information back using the merge() function (this performs a data join).
sport.f = merge(sport.f, sport@data, by.x = "id", by.y = "ons_label")
## Error: object 'sport.f' not found
head(sport.f[, 1:8])
## Error: object 'sport.f' not found
You should see a large data frame containing the latitude and longitude (they are actually Easting and Northing as the data are in British National Grid format) coordinates alongside the attribute information associated with each London Borough.
It is now straightforward to produce a map using all the built in tools (such as setting the breaks in the data) that ggplot2 has to offer.
Map = ggplot(sport.f, aes(long, lat, group = group, fill = Partic_Per)) + geom_polygon() +
coord_equal() + labs(x = "Easting (m)", y = "Northing (m)", fill = "% Sport Partic.") +
ggtitle("London Sports Participation")
## Error: could not find function "ggplot"
Map
## function (f, ...)
## {
## f <- match.fun(f)
## mapply(FUN = f, ..., SIMPLIFY = FALSE)
## }
## <bytecode: 0x0000000007a3a640>
## <environment: namespace:base>
The group=group points ggplot() to the group column added by fortify() and it identifies the groups of coordinates that pertain to individual polygons (in this case London Boroughs).
The default colors are nice but we may wish to produce the map in black and white.
Map + scale_fill_gradient(low = "white", high = "black")
## Error: could not find function "scale_fill_gradient"
Use ggsave() to save the plot to a file outside of R (e.g. ggsave(“my_map.pdf”) will save the map as a pdf, with default settings. For a larger map, you could try the following:
ggsave("my_large_plot.png", scale = 1, dpi = 400)
## Error: could not find function "ggsave"
ggmap is a package that uses the ggplot2 syntax as a template to create maps with image tiles from the likes of Google and OpenStreetMap.
library(ggmap) # you may have to use install.packages to install it first
## Error: there is no package called 'ggmap'
The sport object (SpatialPolygonsDataFrame) is in British National Grid but the ggmap image tiles are in WGS84. We therefore need to use the sport.wgs84 object created in the reprojection operation earlier.
The first job is to calculate the bounding box (bb for short) of the sport.wgs84 object to identify the geographic extent of the image tiles that we need.
b = bbox(sport.wgs84)
## Error: could not find function "bbox"
b[1, ] = (b[1, ] - mean(b[1, ])) * 1.05 + mean(b[1, ])
## Error: object 'b' not found
b[2, ] = (b[2, ] - mean(b[2, ])) * 1.05 + mean(b[2, ])
## Error: object 'b' not found
# scale longitude and latitude (increase bb by 5% for plot) replace 1.05
# with 1.xx for an xx% increase in the plot size
This is then fed into the get_map() function as the location argument. The syntax below contains 2 functions. ggmap is required to produce the plot and provides the base map data.
lnd.b1 = ggmap(get_map(location = b))
## Error: could not find function "ggmap"
In much the same way as we did above we can then layer the plot with different geoms.
First fortify the sport.wgs84 object and then merge with the required attribute data (we already did this step to create the sport.f object).
sport.wgs84.f = fortify(sport.wgs84, region = "ons_label")
## Error: could not find function "fortify"
sport.wgs84.f = merge(sport.wgs84.f, sport.wgs84@data, by.x = "id", by.y = "ons_label")
## Error: object 'sport.wgs84.f' not found
We can now overlay this on our base map.
lnd.b1 + geom_polygon(data = sport.wgs84.f, aes(x = long, y = lat, group = group,
fill = Partic_Per), alpha = 0.5)
## Error: object 'lnd.b1' not found
The code above contains a lot of parameters. Use the ggplot2 help pages to find out what they are. The resulting map looks okay, but it would be improved with a simpler base map in black and white. A design firm called stamen provide the tiles we need and they can be brought into the plot with the get_map function:
lnd.b2 = ggmap(get_map(location = b, source = "stamen", maptype = "toner", crop = TRUE))
## Error: could not find function "ggmap"
We can then produce the plot as before.
lnd.b2 + geom_polygon(data = sport.wgs84.f, aes(x = long, y = lat, group = group,
fill = Partic_Per), alpha = 0.5)
## Error: object 'lnd.b2' not found
Finally, if we want to increase the detail of the base map, get_map has a zoom parameter.
lnd.b3 = ggmap(get_map(location = b, source = "stamen", maptype = "toner", crop = TRUE,
zoom = 11))
## Error: could not find function "ggmap"
lnd.b3 + geom_polygon(data = sport.wgs84.f, aes(x = long, y = lat, group = group,
fill = Partic_Per), alpha = 0.5)
## Error: object 'lnd.b3' not found
This section builds on the previous information on plotting and highlights more advanced spatial functions from the rgeos package. We look at joining new datasets to our data - an attribute join - spatial joins, whereby data is added to the target layer depending on the location of the origins and clipping.
To reaffirm our starting point, let's re-plot the only spatial dataset in our workspace, and count the number of polygons.
library(rgdal)
## Error: there is no package called 'rgdal'
lnd = readOGR(dsn = ".", "london_sport")
## Error: could not find function "readOGR"
plot(lnd)
## Error: object 'lnd' not found
nrow(lnd)
## Error: object 'lnd' not found
Because we are using borough-level data, and boroughs are official administrative zones, there is more data available at this level. We will use the example of crime data to illustrate this data availability, and join this with the current spatial dataset. As before, we can download and import the data from within R:
download.file("http://data.london.gov.uk/datafiles/crime-community-safety/mps-recordedcrime-borough.csv",
destfile = "mps-recordedcrime-borough.csv")
crimeDat = read.csv("mps-recordedcrime-borough.csv", fileEncoding = "UCS-2LE")
head(crimeDat)
## Month MajorText MinorText
## 1 201104 Violence Against The Person Common Assault
## 2 201104 Burglary Burglary In A Dwelling
## 3 201104 Other Notifiable Offences Other Notifiable
## 4 201104 Robbery Personal Property
## 5 201104 Theft & Handling Handling Stolen Goods
## 6 201104 Theft & Handling Theft/Taking Of Pedal Cycle
## CrimeCount Spatial_DistrictName
## 1 81 Kensington and Chelsea
## 2 78 Kensington and Chelsea
## 3 12 Kensington and Chelsea
## 4 41 Kensington and Chelsea
## 5 3 Kensington and Chelsea
## 6 59 Kensington and Chelsea
summary(crimeDat$MajorText)
## Burglary Criminal Damage
## 1543 3125
## Drugs Fraud & Forgery
## 1925 1581
## Other Notifiable Offences Robbery
## 1374 1517
## Sexual Offences Theft & Handling
## 1557 6264
## Violence Against The Person
## 4885
Initially, the read.csv command flags an error: open the raw .csv file in a text editor such as Notepad to find the problem and correct it.
Alternatively, you can work out what the file encoding is and use the correct argument (this is not recommended - simpler just to edit the text file in most cases).
crimeTheft = subset(crimeDat, MajorText == "Theft & Handling")
head(crimeTheft, 2)
## Month MajorText MinorText CrimeCount
## 5 201104 Theft & Handling Handling Stolen Goods 3
## 6 201104 Theft & Handling Theft/Taking Of Pedal Cycle 59
## Spatial_DistrictName
## 5 Kensington and Chelsea
## 6 Kensington and Chelsea
Total the crimes
crimeAg = aggregate(CrimeCount ~ Spatial_DistrictName, FUN = "sum", data = crimeTheft)
head(crimeAg, 2)
## Spatial_DistrictName CrimeCount
## 1 Barking and Dagenham 12222
## 2 Barnet 19821
Now that we have crime data at the borough level, the challenge is to join it by name. This is not always straightforward. Let us see which names in the crime data match the spatial data:
lnd$name %in% crimeAg$Spatial_DistrictName
## Error: object 'lnd' not found
lnd$name[which(!lnd$name %in% crimeAg$Spatial_DistrictName)]
## Error: object 'lnd' not found
The first line of code above shows that all but one of the borough names matches; the second tells us that it is City of London that is named differently in the crime data.
levels(crimeAg$Spatial_DistrictName)
## [1] "Barking and Dagenham" "Barnet"
## [3] "Bexley" "Brent"
## [5] "Bromley" "Camden"
## [7] "Croydon" "Ealing"
## [9] "Enfield" "Greenwich"
## [11] "Hackney" "Hammersmith and Fulham"
## [13] "Haringey" "Harrow"
## [15] "Havering" "Hillingdon"
## [17] "Hounslow" "Islington"
## [19] "Kensington and Chelsea" "Kingston upon Thames"
## [21] "Lambeth" "Lewisham"
## [23] "Merton" "Newham"
## [25] "NULL" "Redbridge"
## [27] "Richmond upon Thames" "Southwark"
## [29] "Sutton" "Tower Hamlets"
## [31] "Waltham Forest" "Wandsworth"
## [33] "Westminster"
levels(crimeAg$Spatial_DistrictName)[25] = as.character(lnd$name[which(!lnd$name %in%
crimeAg$Spatial_DistrictName)])
## Error: object 'lnd' not found
lnd$name %in% crimeAg$Spatial_DistrictName
## Error: object 'lnd' not found
Now all names match. The above code first identified the row with the faulty name and then renamed the level to match the lnd dataset. Note that we could not rename the variable directly, as it is stored as a factor.
We are now ready to join the datasets. It is recommended to use the join() function in the plyr package but the merge function could equally be used.
library(plyr)
## Error: there is no package called 'plyr'
head(lnd$name)
## Error: object 'lnd' not found
head(crimeAg$Spatial_DistrictName) # the variables to join
## [1] Barking and Dagenham Barnet Bexley
## [4] Brent Bromley Camden
## 33 Levels: Barking and Dagenham Barnet Bexley Brent Bromley ... Westminster
crimeAg = rename(crimeAg, replace = c(Spatial_DistrictName = "name"))
## Error: could not find function "rename"
head(join(lnd@data, crimeAg)) #test to see if it works
## Error: could not find function "join"
lnd@data = join(lnd@data, crimeAg)
## Error: could not find function "join"
Plot crime counts
spplot(lnd, "CrimeCount")
## Error: could not find function "spplot"
lnd.f = fortify(lnd, region = "ons_label")
## Error: could not find function "fortify"
lnd.f = merge(lnd.f, lnd@data, by.x = "id", by.y = "ons_label")
## Error: object 'lnd.f' not found
Map = ggplot(lnd.f, aes(long, lat, group = group, fill = CrimeCount)) + geom_polygon() +
coord_equal() + labs(x = "Easting (m)", y = "Northing (m)", fill = "Number of Thefts")
## Error: could not find function "ggplot"
In addition to joining by zone name, it is also possible to do spatial joins. There are three varieties: many-to-one –where the values of many intersecting objects contribute to a new variable in the main table - one-to-many, or one-to-one. Because boroughs in London are quite large, we will conduct a many-to-one spatial join.
We will be using Tube Stations as the spatial data to join, with the aim of finding out which and how many stations are found in each London borough.
download.file("http://www.personal.leeds.ac.uk/~georl/egs/lnd-stns.zip", "lnd-stns.zip")
unzip("lnd-stns.zip")
library(rgdal)
## Error: there is no package called 'rgdal'
stations = readOGR(dsn = ".", layer = "lnd-stns", p4s = "+init=epsg:27700")
## Error: could not find function "readOGR"
proj4string(stations) # this is the full geographical detail.
## Error: could not find function "proj4string"
proj4string(lnd)
## Error: could not find function "proj4string"
bbox(stations)
## Error: could not find function "bbox"
bbox(lnd)
## Error: could not find function "bbox"
The above code loads the data, but shows that there are problems with it: the Coordinate Reference System (CRS) differs from that of our shapefile. Although OSGB 1936 (or EPSG 27700) is the 'correct' CRS for the UK, we will convert the stations dataset into lat-long coordinates, as this is a more common CRS and enables easy base map creation.
stations = spTransform(stations, CRSobj = CRS(proj4string(lnd)))
## Error: could not find function "spTransform"
plot(lnd)
## Error: object 'lnd' not found
points(stations[sample(1:nrow(stations), size = 500), ])
## Error: object 'stations' not found
Now we can clearly see that the stations overlay the boroughs. The problem is that the stations dataset is far more extensive than London borough dataset; we need clipping.
There are a number of functions that we can use to clip the points so that only those falling within London boroughs are retained.
We will use gIntersects(), for clipping although we could equally use gContains(), gWithin() and other g… functions - see rgeos help pages by typing ?gOverlaps or other functions for more. gIntersects will output information for each point, telling us which polygon it interacts with (i.e. the polygon it is in).
int = gIntersects(stations, lnd, byid = TRUE) # find which stations intersect
## Error: could not find function "gIntersects"
class(int) # it's outputed a matrix
## Error: object 'int' not found
dim(int) # with 33 rows (one for each zone) and 2532 cols (the points)
## Error: object 'int' not found
For the purposes of clipping, we are only interested in whether the point intersects with any of the boroughs. This is where the function apply() comes into play.
clipped = apply(int == FALSE, MARGIN = 2, all)
## Error: object 'int' not found
plot(stations[which(clipped), ]) # shows all stations we DO NOT want
## Error: object 'stations' not found
stations.cl = stations[which(!clipped), ] # use ! to select the invers
## Error: object 'stations' not found
points(stations.cl, col = "green") # check that it's worked
## Error: object 'stations.cl' not found
The first line looks at each column (MARGIN = 2, we would use MARGIN = 1 for row-by-row analysis) and report back whether all of the values are false. This creates the inverse selection that we want, hence the use of ! to invert it.
We test that the function works on a new object (often a good idea, to avoid overwriting useful data) with plots.