This is a work-in-progress chapter.
This tutorial teaches the basics of point pattern analysis in R. It is influenced by the chapter on Spatial Point Pattern Analysis in Applied Spatial Data Analysis with R [@bivand_applied_2013] and an online tutorial on Point Pattern Analyis by Robert Hijmans.
We will use the sp package for this rather than the newer sf package, as point pattern analysis is more established for the former Spatial
class system than the latter’s sf
classes. We will also use raster as it has concise and well-designed functions for spatial data:
pkgs = c(
"sp",
"tmap",
"raster",
"mapview",
"dismo",
"gstat"
)
i = pkgs[!pkgs %in% installed.packages()]
if(length(i) > 0)
install.packages(i)
lapply(pkgs, library, character.only = TRUE)
## Loading required package: leaflet
## [[1]]
## [1] "sp" "stats" "graphics" "grDevices" "utils" "datasets"
## [7] "methods" "base"
##
## [[2]]
## [1] "tmap" "sp" "stats" "graphics" "grDevices" "utils"
## [7] "datasets" "methods" "base"
##
## [[3]]
## [1] "raster" "tmap" "sp" "stats" "graphics"
## [6] "grDevices" "utils" "datasets" "methods" "base"
##
## [[4]]
## [1] "mapview" "leaflet" "raster" "tmap" "sp"
## [6] "stats" "graphics" "grDevices" "utils" "datasets"
## [11] "methods" "base"
##
## [[5]]
## [1] "dismo" "mapview" "leaflet" "raster" "tmap"
## [6] "sp" "stats" "graphics" "grDevices" "utils"
## [11] "datasets" "methods" "base"
##
## [[6]]
## [1] "gstat" "dismo" "mapview" "leaflet" "raster"
## [6] "tmap" "sp" "stats" "graphics" "grDevices"
## [11] "utils" "datasets" "methods" "base"
This tutorial assumes you have downloaded the relevant data onto your hard-disk, e.g. with the following commands:
if(!dir.exists("data")) {
dir.create("data")
f1 = "https://github.com/Robinlovelace/Creating-maps-in-R/raw/master/data/cycle_hire.geojson"
f2 = "https://github.com/Robinlovelace/Creating-maps-in-R/raw/master/data/cycle_hire-osm.geojson"
f3 = "https://github.com/Robinlovelace/Creating-maps-in-R/raw/master/data/lnd.geojson"
download.file(f1, "data/cycle_hire.geojson")
download.file(f2, "data/cycle_hire-osm.geojson")
download.file(f3, "data/lnd.geojson")
}
The input datasets, cunningly stored in the data/
directory, are on ‘Boris Bikes’ docking stations around London, and can be read-in and visualised with the following commands:
lnd = rgdal::readOGR("data/lnd.geojson")
## OGR data source with driver: GeoJSON
## Source: "data/lnd.geojson", layer: "OGRGeoJSON"
## with 33 features
## It has 7 fields
cycle_hire = rgdal::readOGR("data/cycle_hire.geojson")
## OGR data source with driver: GeoJSON
## Source: "data/cycle_hire.geojson", layer: "OGRGeoJSON"
## with 742 features
## It has 5 fields
cycle_hire_osm = rgdal::readOGR("data/cycle_hire-osm.geojson")
## OGR data source with driver: GeoJSON
## Source: "data/cycle_hire-osm.geojson", layer: "OGRGeoJSON"
## with 532 features
## It has 5 fields
library(sp) # to plot spatial data
plot(cycle_hire)
points(cycle_hire_osm, col = "red")
plot(lnd, add = TRUE)
It is immediately clear that the two datasets on cycle hire points are closely related (they have a high degree of spatial correlation) and have a distinctive pattern. cycle_hire
represents official data on cycle parking, and will be the main point dataset analysed. cycle_hire_osm
is the community contributed dataset on cycle hire locations, downloaded from OpenStreetMap. Both sets of points overlay some of London’s 33 boroughs, the central ones, and seem to follow the River Thames, especially along the north bank of the river. But how to describe that information quantitatively, and extrapolate the values from the plotted location to other areas?
It is the purpose of this tutorial to provide the know-how to answer such questions, that should be extensible to many applications that involve point data.
A basic statistic to compute on points within a polygon is the number of points per polygon, and the related statistic of point density. Let’s first compute that for London overall, before doing a zone-by-zone breakdown:
nrow(cycle_hire)
## [1] 742
lnd_area = sum(area(lnd)) / 1e6
The results show that there are 742 cycle hire points and that London covers an area of just over one and a half thousand square kilometres (1 km2 = 1000000 m2 = 1e6 m2 in scientific notation). That represents on average roughly one cycle parking rental space per 2 square kilometers, or half a rental point per square kilometer, as revealed by the results of the calculation below:
nrow(cycle_hire) / lnd_area
## [1] 0.4714415
This is not a good indicator of the density of the bike hire scheme overall, because they are so concentrated in central London. A more representative result can be found by calculating the average point density within the extent of the bike hire scheme. We can coerce the bounding box (or extent in raster terminology) of the stations into a polygon whose area can be measured with the following commands:
bb_hire = as(extent(cycle_hire), "SpatialPolygons")
crs(bb_hire) = crs(lnd)
c_area = area(bb_hire) / 1e6
crs()
calls?A useful level of analysis at which to analyse the geographical distribution of points is the zone-level. We can aggregate the points per zone and provide summary statistics. Starting with the number of points per polygon, this would calculated as follows:
crs(lnd) = crs(cycle_hire)
cycle_hire_ag = aggregate(cycle_hire["id"], lnd, FUN = "length")
Based on an analysis of the cycle_hire_ag
:
Find the average density of cycle hire points per zone in London.
A problem with the zonal representation of point density is that the results are dependent on the somewhat arbitrary shapes and sizes of the zones in which the points are aggregated. To overcome this problem we can create a raster representation of the points:
r = raster(bb_hire, ncol = 16, nrow = 10)
rc = rasterize(cycle_hire@coords, r, fun = "count")
plot(rc)
points(cycle_hire)
plot(lnd, add = TRUE)
This is already very useful. The results show that there are 5 clusters of cycle parking with much higher density than the surrounding areas. We can visualise these clusters using a static plot as follows:
plot(rc)
plot(rc > 12)
More useful, in terms of characterising the geographical characteristics of each cluster, would be to plot these 5 clusters interactively. Do this with mapview:
library(mapview)
mapview(rc > 12) +
mapview(cycle_hire)
The resulting interactive plot draws attention to the areas of high point density, such as the area surrounding Victoria station, illustrated below.
Another important characteristic of point patterns is the distances between points, which can be calculated using raster’s dist()
function:
d = spDists(cycle_hire, longlat = TRUE)
dm = as.matrix(d)
dm[1:3, 1:5]
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.000000 6.913473 1.966592 0.7700464 5.164896
## [2,] 6.913473 0.000000 8.205219 6.3051094 2.916743
## [3,] 1.966592 8.205219 0.000000 2.7062514 5.915129
The results show the distance, in km, form every point to every other. The dm
object is known as a distance matrix: note the diagonal of zero values. This distance matrix is very useful for various types of analysis, a couple of which we’ll explore below.
To find the minimum distance of each point to every other, we can use the apply
function, for each row, and then select the top 5:
diag(dm) = NA
dmin = apply(X = dm, MARGIN = 1, FUN = min, na.rm = TRUE)
sel_isolated = order(dmin, decreasing = TRUE)[1:5]
qtm(cycle_hire, col = "grey", main = "Isolated points") +
qtm(cycle_hire[sel_isolated,], symbols.col = "red", symbols.size = 2) + tm_scale_bar()
Another plot that is useful is that of the ‘G function’ for exploring the extent to which points cluster or separate compared with what would be expected from a random distribution [@bivand_applied_2013]:
distance = sort(unique(round(dmin, digits = 3)))
Gd = sapply(distance, function(x) sum(dmin < x))
Gd = Gd / length(dmin)
plot(distance, Gd)
Spatial interpolation refers to methods of estimating the value of something in one place, based on measurements taken elsewhere. It depends on spatial autocorrelation, defined in Waldo Tobler’s ‘first law of Geography’ as follows [@miller_toblers_2004]:
Everything is related to everything else, but near things are more related than distant thing
Building on the example of cycle hire points in London, we can ask the question: what is the expected number of bikes for a stand in location x, given knowledge of the existing data.
Thus spatial interpolation requires a dependent variable, which is summarised numerically and visually below:
summary(cycle_hire$nbikes)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 3.0 11.0 12.2 19.0 51.0
tm_shape(cycle_hire) +
tm_symbols(col = "nbikes", palette = "YlOrRd", alpha = 0.6, style = "quantile")
There is a clear spatial pattern to this: there are more bikes parked in the outer docking stations. We can say that verbally, but how to we represent that on the map?
A first port of call would be to rasterise the result, using the the raster representation of the study area contained in the object r
to find the mean per cell:
rnbikes = rasterize(cycle_hire, r, field = "nbikes", fun = mean)
plot(rnbikes)
What about estimating the values of cells outside the current network area? We can use raster’s focal()
function to estimate that.
w = matrix(1, nc = 9, nr = 9)
r_interp1 = focal(x = rnbikes, w = w, fun = mean, NAonly = TRUE, na.rm = TRUE, pad = TRUE)
plot(r_interp1)
points(cycle_hire)
w
in the above code block. What difference does the size make?The raster cell method of spatial interpolation was fun, but not that sophisticated or spatially precise, with a resolution of around 1 km.
The next simplest solution is to break the area up into pieces and assign the value of the entire area to the value of the point it contains:
library(dismo)
v = voronoi(cycle_hire)
v = intersect(v, r)
## Warning in intersect(x, y): non identical CRS
tm_shape(v) +
tm_fill("nbikes", palette = "YlOrRd", style = "quantile") +
qtm(cycle_hire, symbols.size = 0.2)
gstat provides a number of functions for spatial prediction and interpolation using a range of models. The most basic of these, and a workhorse for spatial interpolation is Inverse Distance Weighting (IDW):
library(gstat)
gs = gstat(formula = nbikes~1, locations = cycle_hire)
crs(r) = crs(lnd)
r_idw = interpolate(r, gs)
## [inverse distance weighted interpolation]
plot(r_idw)
nbikes
variable?