Geospatial statistics

A very interesting field of statistics is the one referred to Geospatial data.
All the studies in this field are based on the fact that:

Data closer in space are more correlated each other

(The closer the values are in space, the close the values they have)

There are three types of geospatial data:

1. Geostatistical data

In this case we have a regular enough Domain D (contains a rectangle).

In this domain we have done some observations of a random variable (Z) in some sites (s1, s2, …).

GOAL: Kriging

The objective with geostatistical data is to predict the value of the random variable (Z) is unoserved places (s0).

Stationarity

In order to make predictions, we have to assume that there is a certain degree of stationarity.

So that certain things in the past will remain the same in the future.

Levels of statonarity:

  1. Strongly stationary

    if \(\left( Z_{s1}, \ldots, Z_{sn} \right)^T \text{ and } \left( Z_{s1+h}, \ldots, Z_{sn+h} \right)^T\) have the same joint distribution

  2. Weakly stationary (second order stationarity)

    \[ \begin{align*}(i) & \quad E[Z_s] = m, \quad \text{for all } s \in D; \\(ii) & \quad \text{Cov}(Z_{si}, Z_{sj}) = E[(Z_{si} - m)(Z_{sj} - m)] = C(h), \quad \text{for all } s_i, s_j \text{ in } D, \text{ where } h = s_i - s_j.\end{align*} \]

  3. Intrinsic stationary

    \[ \begin{align*}(i) & \quad E[Z_s] = m, \quad \text{for all } s \in D; \\(ii) & \quad \text{Var}(Z_{si} - Z_{sj}) = E[(Z_{si} - Z_{sj})^2] = 2\gamma(h), \quad \text{for all } s_i, s_j \text{ in } D, \text{ where } h = s_i - s_j.\end{align*} \] If the semivariogram is intrinsic stationary, then:

    \[ \gamma(h) = C(0)-C(h) \] An intrinsic stationary process is isotropic if it depends only on the length of h \(||\mathbf{h}||\) and not on the direction.

    \[ \text{Var}(Z_{si} - Z_{sj}) = 2\gamma(||h||) \] To investigate isotropy, directional variograms are usually employed.

    If the field is not isotropic, it is anisotropic.

    There are 2 kind of anisotropy:

    1. geometric anisotropy: in this case we can rescale the domain and recover isotropy.
    2. zonal anisotropy

Geostatistical data in R

library(sp)           ## Data management
library(lattice)      ## Data management
library(gstat)        ## Geostatistics (essential package)
library(geoR)         ## Geostatistics (just for some plots)
## --------------------------------------------------------------
##  Analysis of Geostatistical Data
##  For an Introduction to geoR go to http://www.leg.ufpr.br/geoR
##  geoR version 1.9-4 (built on 2024-02-14) is now loaded
## --------------------------------------------------------------
## Functions for graphics 
v.f <- function(x, ...){100-cov.spatial(x, ...)}
v.f.est<-function(x,C0, ...){C0-cov.spatial(x, ...)}

Coordinates management

# load the data
data(meuse)

!!! Be sure to convert the coordinates in UTM coordinates (planar, meter) !!!

And tell to the program the columns (x,y) of the coordinates by doing:

coordinates(meuse) <- c('x','y') # to say what are the coloumns of the coordinates

For a better visualization I will use the following data:

# just for a better visualization of the river Meuse:
# we load river meuse data
data(meuse.riv)
meuse.lst <- list(Polygons(list(Polygon(meuse.riv)), "meuse.riv"))
meuse.sr <- SpatialPolygons(meuse.lst)
# grid for prediction
data(meuse.grid)
is(meuse.grid)
## [1] "data.frame" "list"       "oldClass"   "vector"
coordinates(meuse.grid) <- c('x','y')
meuse.grid <- as(meuse.grid, 'SpatialPixelsDataFrame')
# plot all together
image(meuse.grid, col = "lightgrey")
plot(meuse.sr, col = "grey", add = TRUE)
plot(meuse, add = TRUE)
title('meuse river geostatistical data')

# sampling locations are crosses
# we visualize the river
# domain D is grey area (subset of R^2 because we used UTM coordinates)

EDA

spplot(meuse, "zinc", do.log = T, colorkey = TRUE)  # better than bubble

Empirical semivariogram (estimation of spatial correlation)

N.B.: The notation “~1” stands for a single constant predictor (hp: spatially constant mean)

svgm <- variogram(log(zinc) ~ 1, meuse) # with all the defaults of the function
plot(svgm, main = 'Sample Variogram', pch=19)

Observations:

  1. Is there any asymptote? If yes, it is compatible with stationarity.

  2. Behaviour near zero? Linear model could fit well; also a quadratic might work.

How to check for anisotropy or isotropy ?

You have to plot the semivariogram in 4 directions and see if the graph changes or not.
To set the different directions we can change the alpha parameter.

plot(variogram(log(zinc) ~ 1, meuse, alpha = c(0, 45, 90, 135)), pch=19)

Variogram model fitting

It changes depending if we are assuing anisotropicity or isotropicity, because the models we have to use are different.

show.vgms() # TO SEE EXAMPLES OF THE MODELS

vgm(sill, model, range, nugget)

How can we fit the model we want to our empirical variogram ?

By using the fit.variogram() function:

v <- variogram(log(zinc) ~ 1, meuse)  # empirical semivariogram

v.fit <- fit.variogram(v, vgm(1, "Sph", 800, 1))   # fit of the empirical and theoretical

plot(v, v.fit, pch = 19) # always do this!

How to manage anisotropy?

Kriging in R

a) Ordinary Kriging

In Ordinary Kringing we are making the assumption that the mean is constant in all the locations of the domain.

I have to create a point S_0, which will be an unobserved location where I will di the prediction.ù

s0.new = data.frame(x=179180, y=330100) # UTM coordinates OF THE NEW LOCATION 
coordinates(s0.new) = c('x','y')        # set the coordinates of the dataframe

# plot all together
image(meuse.grid, col = "lightgrey")
plot(meuse.sr, col = "grey", add = TRUE)
plot(meuse, add = TRUE)
plot(s0.new, add = TRUE, col='red', lwd = 2)
title('meuse river geostatistical data')

In order to make the predictions, I have to create a gstat object that contains all the information.

  • formula, so the random variable

  • dataset

  • model of the field

g.tr <- gstat(formula = log(zinc) ~ 1, data = meuse, model = v.fit)

predict(g.tr, s0.new)
## [using ordinary kriging]
##        coordinates var1.pred  var1.var
## 1 (179180, 330100)  5.293158 0.1433444
# Create a gstat object setting a spherical (residual) variogram
# gstat(g.obj, id, formula, data, model, set,...)
g.tr <- gstat(formula = log(zinc) ~ 1, data = meuse, model = v.fit) # v.fit is the variogram model we use
g.tr
## data:
## var1 : formula = log(zinc)`~`1 ; data dim = 155 x 12
## variograms:
##         model      psill    range
## var1[1]   Nug 0.05065923   0.0000
## var1[2]   Sph 0.59060463 896.9976

While if I want a prediction on the mean (BEST LINEAR UNBIASED ESTIMATOR = BLUE)

predict(g.tr, meuse[1,], BLUE = TRUE)
## [generalized least squares trend estimation]
##        coordinates var1.pred   var1.var
## 1 (181072, 333611)  6.053541 0.03981776

I can do the same thing for all the points of the domain and obtain an entire predicted field.

N.B.: Remember that kringing is a interpolation, so the prediction in a olready observed location will return the value I already have.

Note that the variability for the predictions in the sites where I already have the observations is NULL, because I’m interpolating.

# Making the prediction in all the points
lz.ok <- predict(g.tr, meuse.grid, BLUE = FALSE)
## [using ordinary kriging]
# Plotting the predictions
spplot(lz.ok["var1.pred"], main = "Ordinary kriging - Prediction")

# Plotting the variability of my predictions (Variance)
spplot(lz.ok["var1.var"], main = "Ordinary kriging - Variance", col.regions = colorRampPalette(c("white", "red"))(20))

b) Universal Kriging

In this case I’m not assuming a constant mean anymore. It is too restrictive!

I’m saying that the stationarity is made on the residuals and not on the mean, in fact there is non-stationarity (Universal Krigin).

# non stationary formula (intercept is automatically inside)
meuse.gstat <- gstat(id = 'zinc', formula = log(zinc) ~ sqrt(dist), 
                     data = meuse, nmax = 50, model = v.fit, set = list(gls=1))
# set = list(gls=1) means that I want the estimation to be done with GLS

Now I plot the empirical semivariogram of the residuals:

In fact note that the sill is much smaller thant the on of the Ordinary kriging.

v.gls <- variogram(meuse.gstat) 
# difference with before (before we were just giving a formula, now we give the non-stationary object)
plot(v.gls) 

Now I fit the best semivariogram model to my empirical semivariogram:

v.gls.fit <- fit.variogram(v.gls, vgm(0.25, "Sph", 800, 0.8))
v.gls.fit
##   model      psill   range
## 1   Nug 0.07980791   0.000
## 2   Sph 0.14917202 863.959
plot(v.gls, v.gls.fit, pch = 19)

Then I fit my data with the theoretical model:

# Update gstat object with variogram model (there is: model = v.gls.fit)
meuse.gstat <- gstat(id = 'zinc', formula = log(zinc) ~ sqrt(dist),
                     data = meuse, nmax = 50, model = v.gls.fit, set = list(gls=1))
meuse.gstat
## data:
## zinc : formula = log(zinc)`~`sqrt(dist) ; data dim = 155 x 12 nmax = 50
## variograms:
##         model      psill   range
## zinc[1]   Nug 0.07980791   0.000
## zinc[2]   Sph 0.14917202 863.959
## set gls = 1;

I create an unobserved location:

s0.new
## SpatialPoints:
##           x      y
## [1,] 179180 330100
## Coordinate Reference System (CRS) arguments: NA
## I have to define also the covariate in s_0
s0.vec <- as.vector(slot(s0.new,'coords'))
# distance to the river: calculate the distance between s0 and each point of
# the river, then select the minimum
s0.dist <- min(rowSums(scale(meuse.riv,s0.vec)^2)) 
## I'm creating a new variable (distance) transforming the raw variables (coordinates)
s0.new <- as.data.frame(c(s0.new,s0.dist))
names(s0.new) <- c('x','y','dist')
coordinates(s0.new) <- c('x','y')
s0.new <- as(s0.new, 'SpatialPointsDataFrame')
s0.new
##        coordinates       dist
## 1 (179180, 330100) 0.02006051

Function “predict” uses the residual variogram stored in the gstat object to make the prediction.

predict(meuse.gstat, s0.new)
## [using universal kriging]
##        coordinates zinc.pred  zinc.var
## 1 (179180, 330100)  6.493177 0.1761844

While the prediction of the mean:

# this gives the estimate of x(s_0)'*beta
# becaause there is a trend component.
predict(meuse.gstat, s0.new, BLUE = TRUE)
## [generalized least squares trend estimation]
##        coordinates zinc.pred   zinc.var
## 1 (179180, 330100)  6.706755 0.04637467

As before, I can make predictions in all the domain:

lz.uk <- predict(meuse.gstat, meuse.grid, BLUE=FALSE)
## [using universal kriging]

And I can estimate the mean in all the domain:

lz.uk.BLUE <- predict(meuse.gstat, meuse.grid, BLUE=TRUE)
## [generalized least squares trend estimation]

Here we can see the differences between the Ordinary Kriging and the Universal Kriging:

library(gridExtra)

# Plot the maps side by side with larger size
grid.arrange(
  spplot(lz.ok[,1], main = 'Ordinary Kriging (PREDICTION), gstat'),
  spplot(lz.uk[,1], main = 'Universal Kriging (PREDICTION), gstat'),
  nrow = 2
)

While in the following graph we can see the mean in the domain.
Please note that in the Universal kriging the mean is changing as we can see from the graph.

I can use the mean to make PREDICTIONs in the future.

spplot(lz.uk.BLUE[,1], main = 'Universal Kriging (PREDICTION) - drift (MEAN) , gstat')

The variance assoiciated with the mean is the following:

spplot(lz.uk[,2], main = 'Universal Kriging - Variance')

Is the drift important to explain the variability of the response varibale Z_s ?

We have to compare the Variogram of the residuals the the on the the data.

plot(v$dist,v$gamma,xlab='distance',ylab='semivariance',pch=19,col='skyblue1',ylim=c(0,0.8))
curve(v.f.est(x, C0=v.fit[2,2]+v.fit[1,2], cov.pars=rbind(c(v.fit[2,2], v.fit[2,3]),c(v.fit[1,2], v.fit[1,3])), 
              cov.model = c("spherical","pure.nugget")), from = 0.0001, to = 1600,
              xlab = "distance", ylab = expression(gamma(h)),
      main = "Variogram model",add=TRUE,col='skyblue1',lwd=2, ylim=c(0,110)) # variance of z estimated from observations (0.6)

points(v.gls$dist,v.gls$gamma,xlab='distance',ylab='semivariance',pch=19,col='steelblue',ylim=c(0,0.8))
curve(v.f.est(x, C0=v.gls.fit[2,2]+v.gls.fit[1,2], 
              cov.pars=rbind(c(v.gls.fit[2,2], v.gls.fit[2,3]),c(v.gls.fit[1,2], v.gls.fit[1,3])), 
              cov.model = c("spherical","pure.nugget")), from = 0.0001, to = 1600,
      xlab = "distance", ylab = expression(gamma(h)),
      main = "Variogram model",add=TRUE,col='steelblue',lwd=2, ylim=c(0,110)) # variance of the residuals (0.2) [iterative algorithm]

The variance explained by the drift is what there is in between the two curves.

So the larger is the difference between the sill in the two cases, the more the regressors are important in explaining the variability.

If there is not much difference, we can go with the simpler model (Stationary model).