The following is taken from a vignette in the gstat package. gstat contains “geostatistical” functions, which can perform spatial (or, spatio-temporal) interpolations. This is particularly useful when you have data for certain points in a large area, and want some idea of how the values for that data being measured, might vary elsewhere.
In having tried to understand how work with spatial and spatio-temporal (hereafter denoted “S/T”) data in R, I’ve realized there’s a better way. First, for those looking to start, since gstat has functions that deal with spatial and S/T data, understanding how those structures work would probably be very useful when you start dealing with this package.
At least for me, there were certain aspects of this tutorial which weren’t obvious when I read through it, and in the end, took a while before I felt like I had a comfortable sense of what was happening. So, I thought I could elaborate on parts which weren’t immediately obvious to me. Along the way, to help myself, I thought to code it how I’ve become used to. Hence, the plots are all re-done in ggplot2 (with the originals alongside for comparison), and the pipe operator %>%
can really help clarify steps in involved operations.
library(sp)
library(tidyr)
library(ggplot2)
library(magrittr, warn.conflicts = FALSE)
Basically, the Meuse dataset contains measurements for concentrations of different elements, over an area in the Netherlands. Run ?meuse for more info.
data(meuse)
class(meuse)
## [1] "data.frame"
str(meuse)
## 'data.frame': 155 obs. of 14 variables:
## $ x : num 181072 181025 181165 181298 181307 ...
## $ y : num 333611 333558 333537 333484 333330 ...
## $ cadmium: num 11.7 8.6 6.5 2.6 2.8 3 3.2 2.8 2.4 1.6 ...
## $ copper : num 85 81 68 81 48 61 31 29 37 24 ...
## $ lead : num 299 277 199 116 117 137 132 150 133 80 ...
## $ zinc : num 1022 1141 640 257 269 ...
## $ elev : num 7.91 6.98 7.8 7.66 7.48 ...
## $ dist : num 0.00136 0.01222 0.10303 0.19009 0.27709 ...
## $ om : num 13.6 14 13 8 8.7 7.8 9.2 9.5 10.6 6.3 ...
## $ ffreq : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
## $ soil : Factor w/ 3 levels "1","2","3": 1 1 1 2 2 2 2 1 1 2 ...
## $ lime : Factor w/ 2 levels "0","1": 2 2 2 1 1 1 1 1 1 1 ...
## $ landuse: Factor w/ 15 levels "Aa","Ab","Ag",..: 4 4 4 11 4 11 4 2 2 15 ...
## $ dist.m : num 50 30 150 270 380 470 240 120 240 420 ...
If you’ll notice, there are two columns x
and y
in the dataset. These are the coordinates of the location which that row of the dataframe corresponds to (if the numbers seem large, don’t worry, it’s just Rijksdriehoek (RDH) coordinates, used in the Netherlands.)
Now, while it’s easy to include spatial information, like coordinates, just in columns of a dataframe, if someone else looks at the data, it might not be obvious what all the variables in the columns represent. So, one way to deal with this is to more concretely identify the observations with the corresponding location. Here, the result is going to be a SpatialPointsDataFrame
. This assignment can be done like normally, except that the formula notation is used:
coordinates(meuse) <- ~ x + y
class(meuse)
## [1] "SpatialPointsDataFrame"
## attr(,"package")
## [1] "sp"
Once that assignment is performed, then the class of the object changes accordingly.
Behind the scenes, this new Spatial*DataFrame is actually an S4 object (i.e., that’s the system of OO-programming used for it in R) :
str(meuse)
## Formal class 'SpatialPointsDataFrame' [package "sp"] with 5 slots
## ..@ data :'data.frame': 155 obs. of 12 variables:
## .. ..$ cadmium: num [1:155] 11.7 8.6 6.5 2.6 2.8 3 3.2 2.8 2.4 1.6 ...
## .. ..$ copper : num [1:155] 85 81 68 81 48 61 31 29 37 24 ...
## .. ..$ lead : num [1:155] 299 277 199 116 117 137 132 150 133 80 ...
## .. ..$ zinc : num [1:155] 1022 1141 640 257 269 ...
## .. ..$ elev : num [1:155] 7.91 6.98 7.8 7.66 7.48 ...
## .. ..$ dist : num [1:155] 0.00136 0.01222 0.10303 0.19009 0.27709 ...
## .. ..$ om : num [1:155] 13.6 14 13 8 8.7 7.8 9.2 9.5 10.6 6.3 ...
## .. ..$ ffreq : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
## .. ..$ soil : Factor w/ 3 levels "1","2","3": 1 1 1 2 2 2 2 1 1 2 ...
## .. ..$ lime : Factor w/ 2 levels "0","1": 2 2 2 1 1 1 1 1 1 1 ...
## .. ..$ landuse: Factor w/ 15 levels "Aa","Ab","Ag",..: 4 4 4 11 4 11 4 2 2 15 ...
## .. ..$ dist.m : num [1:155] 50 30 150 270 380 470 240 120 240 420 ...
## ..@ coords.nrs : int [1:2] 1 2
## ..@ coords : num [1:155, 1:2] 181072 181025 181165 181298 181307 ...
## .. ..- attr(*, "dimnames")=List of 2
## .. .. ..$ : chr [1:155] "1" "2" "3" "4" ...
## .. .. ..$ : chr [1:2] "x" "y"
## ..@ bbox : num [1:2, 1:2] 178605 329714 181390 333611
## .. ..- attr(*, "dimnames")=List of 2
## .. .. ..$ : chr [1:2] "x" "y"
## .. .. ..$ : chr [1:2] "min" "max"
## ..@ proj4string:Formal class 'CRS' [package "sp"] with 1 slot
## .. .. ..@ projargs: chr NA
While accessing elements of such an objects “slots” is discouraged, the sp package comes with useful helper functions that can be used, such as bbox and coordinates. The summary function also has slightly different output:
meuse %>% coordinates %>% head
## x y
## 1 181072 333611
## 2 181025 333558
## 3 181165 333537
## 4 181298 333484
## 5 181307 333330
## 6 181390 333260
meuse %>% bbox
## min max
## x 178605 181390
## y 329714 333611
summary(meuse)
## Object of class SpatialPointsDataFrame
## Coordinates:
## min max
## x 178605 181390
## y 329714 333611
## Is projected: NA
## proj4string : [NA]
## Number of points: 155
## Data attributes:
## cadmium copper lead zinc
## Min. : 0.200 Min. : 14.00 Min. : 37.0 Min. : 113.0
## 1st Qu.: 0.800 1st Qu.: 23.00 1st Qu.: 72.5 1st Qu.: 198.0
## Median : 2.100 Median : 31.00 Median :123.0 Median : 326.0
## Mean : 3.246 Mean : 40.32 Mean :153.4 Mean : 469.7
## 3rd Qu.: 3.850 3rd Qu.: 49.50 3rd Qu.:207.0 3rd Qu.: 674.5
## Max. :18.100 Max. :128.00 Max. :654.0 Max. :1839.0
##
## elev dist om ffreq soil lime
## Min. : 5.180 Min. :0.00000 Min. : 1.000 1:84 1:97 0:111
## 1st Qu.: 7.546 1st Qu.:0.07569 1st Qu.: 5.300 2:48 2:46 1: 44
## Median : 8.180 Median :0.21184 Median : 6.900 3:23 3:12
## Mean : 8.165 Mean :0.24002 Mean : 7.478
## 3rd Qu.: 8.955 3rd Qu.:0.36407 3rd Qu.: 9.000
## Max. :10.520 Max. :0.88039 Max. :17.000
## NA's :2
## landuse dist.m
## W :50 Min. : 10.0
## Ah :39 1st Qu.: 80.0
## Am :22 Median : 270.0
## Fw :10 Mean : 290.3
## Ab : 8 3rd Qu.: 450.0
## (Other):25 Max. :1000.0
## NA's : 1
Typically, to just access the data, I coerce the object to a dataframe with as.data.frame, which can be a lot quicker and cleaner than alternatives:
meuse_df <- cbind( attr(meuse, "data"), meuse@coords) # just coerce to df
The sp package comes with special built-in graphing functions, such as bubble
:
# bubble chart
bubble(meuse, "zinc", col = c("#00ff0088", "#00ff0088"),
main="zinc concentrations (ppm)")
The equivalent can also be made using ggplot
:
# I think the blue stands out better against the white background
meuse %>% as.data.frame %>%
ggplot(aes(x, y)) + geom_point(aes(size=zinc), color="blue", alpha=3/4) +
ggtitle("Zinc Concentration (ppm)") + coord_equal() + theme_bw()
## Project the data from Rijksdriehoek (RDH) (Netherlands topographical) map
## coordinates to Google Map coordinates; RDH coordinates have an EPSG code of
## 28992 and Google map coordinates have an EPSG code of 3857
# But from the documentation of proj4string: Note that only “+proj=longlat” is
# accepted for geographical coordinates, which must be ordered (eastings,
# northings). So use sp
# plan: convert rdh to longlat, then assign longlat, then transform to rdh
# TODO: incorporate this into a post on using ggmap with spatial data.
library(rgdal)
## rgdal: version: 1.0-7, (SVN revision 559)
## Geospatial Data Abstraction Library extensions to R successfully loaded
## Loaded GDAL runtime: GDAL 1.11.3, released 2015/09/16
## Path to GDAL shared files: /usr/local/Cellar/gdal/1.11.3_1/share/gdal
## Loaded PROJ.4 runtime: Rel. 4.9.2, 08 September 2015, [PJ_VERSION: 492]
## Path to PROJ.4 shared files: (autodetected)
## Linking to sp version: 1.1-1
ESPG <- make_EPSG()
ESPG[which(ESPG$code == 28992), ]
## code note
## 4083 28992 # Amersfoort / RD New
## prj4
## 4083 +proj=sterea +lat_0=52.15616055555555 +lon_0=5.38763888888889 +k=0.9999079 +x_0=155000 +y_0=463000 +ellps=bessel +towgs84=565.4171,50.3319,465.5524,-0.398957388243134,0.343987817378283,-1.87740163998045,4.0725 +units=m +no_defs
rdh_proj <- ESPG[which(ESPG$code == 28992), "prj4"]
#proj4string(meuse) = "+proj=longlat +datum=WGS84"
Along with the meuse
dataset is one called meuse.grid
. Later on in the interpolation, it’s used as locations to predict concentrations for. At first, it’s just a regular dataframe like meuse was:
data(meuse.grid)
summary(meuse.grid)
## x y part.a part.b
## Min. :178460 Min. :329620 Min. :0.0000 Min. :0.0000
## 1st Qu.:179420 1st Qu.:330460 1st Qu.:0.0000 1st Qu.:0.0000
## Median :179980 Median :331220 Median :0.0000 Median :1.0000
## Mean :179985 Mean :331348 Mean :0.3986 Mean :0.6014
## 3rd Qu.:180580 3rd Qu.:332140 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :181540 Max. :333740 Max. :1.0000 Max. :1.0000
## dist soil ffreq
## Min. :0.0000 1:1665 1: 779
## 1st Qu.:0.1193 2:1084 2:1335
## Median :0.2715 3: 354 3: 989
## Mean :0.2971
## 3rd Qu.:0.4402
## Max. :0.9926
meuse.grid %>% str
## 'data.frame': 3103 obs. of 7 variables:
## $ x : num 181180 181140 181180 181220 181100 ...
## $ y : num 333740 333700 333700 333700 333660 ...
## $ part.a: num 1 1 1 1 1 1 1 1 1 1 ...
## $ part.b: num 0 0 0 0 0 0 0 0 0 0 ...
## $ dist : num 0 0 0.0122 0.0435 0 ...
## $ soil : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
## $ ffreq : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
meuse.grid %>% class
## [1] "data.frame"
To better see the gridded nature of the data, we can just plot it:
# this is clearly gridded over the region of interest
meuse.grid %>% as.data.frame %>%
ggplot(aes(x, y)) + geom_point(size=1) + coord_equal()
# to compare, recall the bubble plot above; those points were what there were
# values for. this is much more sparse
meuse %>% as.data.frame %>%
ggplot(aes(x, y)) + geom_point(size=1) + coord_equal()
These two plots pretty much summarize our interpolation problem: given values at the locations in the latter plot, we want to interpolate over all values in the former plot.
And just as before, we specify that the x
and y
columns are actually coordinates for the observations. Here, though, we can also manually specify that meuse.grid actually cotains a grid of points. Although this might not appear to change anything if you only inspect the class, the attributes of the object do change (to see that, just check the attributes before and after identifying it as gridded).
coordinates(meuse.grid) = ~x+y
gridded(meuse.grid) = TRUE
meuse.grid %>% class
## [1] "SpatialPixelsDataFrame"
## attr(,"package")
## [1] "sp"
More plotting
image(meuse.grid["dist"])
title("distance to river (red=0)")
# ggplot version
meuse.grid %>% as.data.frame %>%
ggplot(aes(x, y)) + geom_tile(aes(fill=dist)) +
scale_fill_gradient(low = "red", high="yellow") + coord_equal() + theme_bw() +
ggtitle("Distance to River")
Alternatively, instead of using tiles, one could go Seurat-style and call geom_point
with small size (but, just note that scale_color_gradient
goes with points, and scale_fill_gradient
with tiles).
To recap up to this point: we have values at some points, and want to interpolate over an entire grid. In this case, we can use gstat’s kriging functions. In particular, we’ll just start off with the simple krige
for now.
library(gstat)
## Warning: package 'gstat' was built under R version 3.1.3
zinc.idw <- krige(zinc ~ 1, meuse, meuse.grid)
## [inverse distance weighted interpolation]
zinc.idw %>% class
## [1] "SpatialPixelsDataFrame"
## attr(,"package")
## [1] "sp"
zinc.idw %>% as.data.frame %>% head
## x y var1.pred var1.var
## 1 181180 333740 633.6864 NA
## 2 181140 333700 712.5450 NA
## 3 181180 333700 654.1617 NA
## 4 181220 333700 604.4422 NA
## 5 181100 333660 857.2558 NA
## 6 181140 333660 755.5061 NA
Here, there are a couple things to note. First, the function takes a “formula” argument. Since we want to interpolate for values of zinc
, we would use “ordinary”, or “simple”, kriging, in which case we use the notation “[variable] ~ 1”. The second argument is the where the values of that variable being interpolated, come from. The third is the region of interest, such as a grid of spatial locations we want estimated predictions for.
The result of the kriging is a data frame with coordinates (x
and y
), predicted values of the variable (var1.pred
), and variance of the predictions (var1.var
). (Aside: I’m not really sure why in this example, there are NA
’s for the prediction variance; I think it’s because a variogram wasn’t supplied to form the predictions from. However, I’m almost not clear how there can be predictions without that variogram, but I haven’t studied much of the theory behind this yet).
These results, again, can be graphed with the sp package’s functions, or otherwise with ggplot2
:
spplot(zinc.idw["var1.pred"], main="zinc inverse distance weighted interpolations")
#same spplot with ggplot
library(scales)
## Warning: package 'scales' was built under R version 3.1.3
zinc.idw %>% as.data.frame %>%
ggplot(aes(x=x, y=y, fill=var1.pred)) + geom_tile() + theme_bw() +
coord_equal() + scale_fill_gradient(low = "red", high="yellow") +
ggtitle("zinc inverse distance weighted interpolations") +
scale_x_continuous(labels=comma) + scale_y_continuous(labels=comma)
One advantage of ggplot2 in this case is the amount of control over the color scheme (as well as other aspects of the plot). In the example above, I stuck to the red-to-yellow scale used earlier. Although, note that here, red doesn’t represent 0
anymore.
# graphical check of hypothesis from above graphs
plot(log(zinc) ~ sqrt(dist), data=meuse, pch=16, cex=.5)
abline(lm(log(zinc) ~ sqrt(dist), meuse))
# or with ggplot:
# meuse %>% as.data.frame %>%
# ggplot(aes(sqrt(dist), log(zinc))) + geom_point() +
# geom_smooth(method="lm", se=FALSE)
As alluded to earlier, it’s often helpful when performing kriging to also have a variogram (or, semi-variogram) model fit to the data. For an excellent introduction to variograms, see Allison Lassiter’s website.
Basically, while we did interpolate using a relationship for zinc, we might want to explore how `log(zinc)`` varies over space. For this, we can plot a variogram. First, the code, then more explanation about what the code does.
# inspect variation of log(zinc) by distance (i.e., from the river)
lzn.vgm <- variogram(log(zinc)~1, meuse) # calculates sample variogram values
lzn.fit <- fit.variogram(lzn.vgm, model=vgm(1, "Sph", 900, 1)) # fit model
plot(lzn.vgm, lzn.fit) # plot the sample values, along with the fit model
In the first line, we merely calculate a sample variogram. This involves several things, as can be seen by inspecting the actual object:
lzn.vgm
## np dist gamma dir.hor dir.ver id
## 1 57 79.29244 0.1234479 0 0 var1
## 2 299 163.97367 0.2162185 0 0 var1
## 3 419 267.36483 0.3027859 0 0 var1
## 4 457 372.73542 0.4121448 0 0 var1
## 5 547 478.47670 0.4634128 0 0 var1
## 6 533 585.34058 0.5646933 0 0 var1
## 7 574 693.14526 0.5689683 0 0 var1
## 8 564 796.18365 0.6186769 0 0 var1
## 9 589 903.14650 0.6471479 0 0 var1
## 10 543 1011.29177 0.6915705 0 0 var1
## 11 500 1117.86235 0.7033984 0 0 var1
## 12 477 1221.32810 0.6038770 0 0 var1
## 13 452 1329.16407 0.6517158 0 0 var1
## 14 457 1437.25620 0.5665318 0 0 var1
## 15 415 1543.20248 0.5748227 0 0 var1
lzn.vgm %>% class
## [1] "gstatVariogram" "data.frame"
lzn.fit %>% class
## [1] "variogramModel" "data.frame"
The first column, np, says how many point pairs were within distance “dist” (if those numbers look like a lot, recall that although meuse has only 155 rows, there are 155 * 154 / 2 = 11,935 point pairs; see the plot above with the points graphed). If we plot this object itself, we just get the sample variogram, without any fit to it (try it!).
To perform a fit, we call the fit.variogram
function, and pass it two parameters: a variogram object, and a model we want to fit the data to. With the model specified, the function would find the optimal (in some sense) parameters for that model to fit the data.
In this tutorial, a spherical model is used. The book Applied Spatial Data Analysis with R (ASDAR) has the complete list of variogram models one can use. And while the functional forms of those models aren’t included, a more graphical/qualitative display of characteristics for different variogram models, is available by calling the function: show.vgms()
.
Now if you plot the variogram and the fit, you (surprise!) get both together.
But we might not like that model. So we could try to see how log(zinc)
varies with the square root of distance. This time, we’ll try an exponential model. Otherwise, everything is pretty much the same as before:
# inspect variation of log(zinc) by square root of distance
lznr.vgm <- variogram(log(zinc) ~ sqrt(dist), meuse)
lznr.fit <- fit.variogram(lznr.vgm, model=vgm(1, "Exp", 300, 1))
lznr.fit %>% class
## [1] "variogramModel" "data.frame"
plot(lznr.vgm, lznr.fit)
question: how do kriging results vary if no model specified? question: how does kriging happen when no projection specified? This seems to be opposed to the meuse tutorial. note: here, interpolation done on gridded SPDF, but this time, result is another SPDF (unlike when not specifying vgm model). Also, there are values for var1.var in the output (which seems to be variance of the prediction).
lzn.kriged <- krige(log(zinc) ~ 1, meuse, meuse.grid, model=lzn.fit)
## [using ordinary kriging]
# sp plotting
spplot(lzn.kriged["var1.pred"])
# kriging results in ggplot
lzn.kriged %>% as.data.frame %>%
ggplot(aes(x=x, y=y)) + geom_tile(aes(fill=var1.pred)) +
coord_equal() + scale_fill_gradient(low = "red", high="yellow") +
scale_x_continuous(labels=comma) + scale_y_continuous(labels=comma) +
theme_bw()
lzn.condsim <- krige(log(zinc)~1, meuse, meuse.grid, model=lzn.fit,
nmax=30, nsim=4)
## drawing 4 GLS realisations of beta...
## [using conditional Gaussian simulation]
# sp plotting
spplot(lzn.condsim, main="three conditional simulations")
# with ggplot2. (no need to call components with "@" or "attr(., "data"), e.g.)
#lzn_cond_df <- cbind(attr(lzn.condsim, "data"), attr(lzn.condsim, "coords"))
lzn.condsim %>% as.data.frame %>%
gather(sim, value, sim1:sim4) %>%
ggplot(aes(x=x, y=y)) + geom_tile(aes(fill=value)) +
facet_grid(.~sim) + coord_fixed(ratio = 1) +
scale_x_continuous(labels=comma) + scale_y_continuous(labels=comma) +
scale_fill_gradient(low = "red", high="yellow") +
ggtitle("Three conditional simulations") + theme_bw()