A very interesting field of statistics is the one referred to
Geospatial data.
All the studies in this field are based on the fact that:
Data closer in space are more correlated each other
(The closer the values are in space, the close the values they have)
There are three types of geospatial data:
Geostatistical data
Lattice data
Point patterns
In this case we have a regular enough Domain D (contains a rectangle).
In this domain we have done some observations of a random variable (Z) in some sites (s1, s2, …).
The objective with geostatistical data is to predict the value of the random variable (Z) is unoserved places (s0).
In order to make predictions, we have to assume that there is a certain degree of stationarity.
So that certain things in the past will remain the same in the future.
Levels of statonarity:
Strongly stationary
if \(\left( Z_{s1}, \ldots, Z_{sn} \right)^T \text{ and } \left( Z_{s1+h}, \ldots, Z_{sn+h} \right)^T\) have the same joint distribution
Weakly stationary (second order stationarity)
\[ \begin{align*}(i) & \quad E[Z_s] = m, \quad \text{for all } s \in D; \\(ii) & \quad \text{Cov}(Z_{si}, Z_{sj}) = E[(Z_{si} - m)(Z_{sj} - m)] = C(h), \quad \text{for all } s_i, s_j \text{ in } D, \text{ where } h = s_i - s_j.\end{align*} \]
Intrinsic stationary
\[ \begin{align*}(i) & \quad E[Z_s] = m, \quad \text{for all } s \in D; \\(ii) & \quad \text{Var}(Z_{si} - Z_{sj}) = E[(Z_{si} - Z_{sj})^2] = 2\gamma(h), \quad \text{for all } s_i, s_j \text{ in } D, \text{ where } h = s_i - s_j.\end{align*} \] If the semivariogram is intrinsic stationary, then:
\[ \gamma(h) = C(0)-C(h) \] An intrinsic stationary process is isotropic if it depends only on the length of h \(||\mathbf{h}||\) and not on the direction.
\[ \text{Var}(Z_{si} - Z_{sj}) = 2\gamma(||h||) \] To investigate isotropy, directional variograms are usually employed.
If the field is not isotropic, it is anisotropic.
There are 2 kind of anisotropy:
library(sp) ## Data management
library(lattice) ## Data management
library(gstat) ## Geostatistics (essential package)
library(geoR) ## Geostatistics (just for some plots)
## --------------------------------------------------------------
## Analysis of Geostatistical Data
## For an Introduction to geoR go to http://www.leg.ufpr.br/geoR
## geoR version 1.9-4 (built on 2024-02-14) is now loaded
## --------------------------------------------------------------
## Functions for graphics
v.f <- function(x, ...){100-cov.spatial(x, ...)}
v.f.est<-function(x,C0, ...){C0-cov.spatial(x, ...)}
# load the data
data(meuse)
And tell to the program the columns (x,y) of the coordinates by doing:
coordinates(meuse) <- c('x','y') # to say what are the coloumns of the coordinates
For a better visualization I will use the following data:
# just for a better visualization of the river Meuse:
# we load river meuse data
data(meuse.riv)
meuse.lst <- list(Polygons(list(Polygon(meuse.riv)), "meuse.riv"))
meuse.sr <- SpatialPolygons(meuse.lst)
# grid for prediction
data(meuse.grid)
is(meuse.grid)
## [1] "data.frame" "list" "oldClass" "vector"
coordinates(meuse.grid) <- c('x','y')
meuse.grid <- as(meuse.grid, 'SpatialPixelsDataFrame')
# plot all together
image(meuse.grid, col = "lightgrey")
plot(meuse.sr, col = "grey", add = TRUE)
plot(meuse, add = TRUE)
title('meuse river geostatistical data')
# sampling locations are crosses
# we visualize the river
# domain D is grey area (subset of R^2 because we used UTM coordinates)
spplot(meuse, "zinc", do.log = T, colorkey = TRUE) # better than bubble
N.B.: The notation “~1” stands for a single constant predictor (hp: spatially constant mean)
svgm <- variogram(log(zinc) ~ 1, meuse) # with all the defaults of the function
plot(svgm, main = 'Sample Variogram', pch=19)
Observations:
Is there any asymptote? If yes, it is compatible with stationarity.
Behaviour near zero? Linear model could fit well; also a quadratic might work.
You have to plot the semivariogram in 4 directions and see if the
graph changes or not.
To set the different directions we can change the alpha parameter.
plot(variogram(log(zinc) ~ 1, meuse, alpha = c(0, 45, 90, 135)), pch=19)
It changes depending if we are assuing anisotropicity or isotropicity, because the models we have to use are different.
show.vgms() # TO SEE EXAMPLES OF THE MODELS
vgm(sill, model, range, nugget)
How can we fit the model we want to our empirical variogram ?
By using the fit.variogram() function:
v <- variogram(log(zinc) ~ 1, meuse) # empirical semivariogram
v.fit <- fit.variogram(v, vgm(1, "Sph", 800, 1)) # fit of the empirical and theoretical
plot(v, v.fit, pch = 19) # always do this!
In Ordinary Kringing we are making the assumption that the mean is constant in all the locations of the domain.
I have to create a point S_0, which will be an unobserved location where I will di the prediction.ù
s0.new = data.frame(x=179180, y=330100) # UTM coordinates OF THE NEW LOCATION
coordinates(s0.new) = c('x','y') # set the coordinates of the dataframe
# plot all together
image(meuse.grid, col = "lightgrey")
plot(meuse.sr, col = "grey", add = TRUE)
plot(meuse, add = TRUE)
plot(s0.new, add = TRUE, col='red', lwd = 2)
title('meuse river geostatistical data')
In order to make the predictions, I have to create a gstat object that contains all the information.
formula, so the random variable
dataset
model of the field
g.tr <- gstat(formula = log(zinc) ~ 1, data = meuse, model = v.fit)
predict(g.tr, s0.new)
## [using ordinary kriging]
## coordinates var1.pred var1.var
## 1 (179180, 330100) 5.293158 0.1433444
# Create a gstat object setting a spherical (residual) variogram
# gstat(g.obj, id, formula, data, model, set,...)
g.tr <- gstat(formula = log(zinc) ~ 1, data = meuse, model = v.fit) # v.fit is the variogram model we use
g.tr
## data:
## var1 : formula = log(zinc)`~`1 ; data dim = 155 x 12
## variograms:
## model psill range
## var1[1] Nug 0.05065923 0.0000
## var1[2] Sph 0.59060463 896.9976
While if I want a prediction on the mean (BEST LINEAR UNBIASED ESTIMATOR = BLUE)
predict(g.tr, meuse[1,], BLUE = TRUE)
## [generalized least squares trend estimation]
## coordinates var1.pred var1.var
## 1 (181072, 333611) 6.053541 0.03981776
I can do the same thing for all the points of the domain and obtain an entire predicted field.
N.B.: Remember that kringing is a interpolation, so the prediction in a olready observed location will return the value I already have.
Note that the variability for the predictions in the sites where I
already have the observations is NULL, because I’m
interpolating.
# Making the prediction in all the points
lz.ok <- predict(g.tr, meuse.grid, BLUE = FALSE)
## [using ordinary kriging]
# Plotting the predictions
spplot(lz.ok["var1.pred"], main = "Ordinary kriging - Prediction")
# Plotting the variability of my predictions (Variance)
spplot(lz.ok["var1.var"], main = "Ordinary kriging - Variance", col.regions = colorRampPalette(c("white", "red"))(20))
In this case I’m not assuming a constant mean anymore. It is too restrictive!
I’m saying that the stationarity is made on the residuals and not on the mean, in fact there is non-stationarity (Universal Krigin).
# non stationary formula (intercept is automatically inside)
meuse.gstat <- gstat(id = 'zinc', formula = log(zinc) ~ sqrt(dist),
data = meuse, nmax = 50, model = v.fit, set = list(gls=1))
# set = list(gls=1) means that I want the estimation to be done with GLS
Now I plot the empirical semivariogram of the residuals:
In fact note that the sill is much smaller thant the on of the Ordinary kriging.
v.gls <- variogram(meuse.gstat)
# difference with before (before we were just giving a formula, now we give the non-stationary object)
plot(v.gls)
Now I fit the best semivariogram model to my empirical semivariogram:
v.gls.fit <- fit.variogram(v.gls, vgm(0.25, "Sph", 800, 0.8))
v.gls.fit
## model psill range
## 1 Nug 0.07980791 0.000
## 2 Sph 0.14917202 863.959
plot(v.gls, v.gls.fit, pch = 19)
Then I fit my data with the theoretical model:
# Update gstat object with variogram model (there is: model = v.gls.fit)
meuse.gstat <- gstat(id = 'zinc', formula = log(zinc) ~ sqrt(dist),
data = meuse, nmax = 50, model = v.gls.fit, set = list(gls=1))
meuse.gstat
## data:
## zinc : formula = log(zinc)`~`sqrt(dist) ; data dim = 155 x 12 nmax = 50
## variograms:
## model psill range
## zinc[1] Nug 0.07980791 0.000
## zinc[2] Sph 0.14917202 863.959
## set gls = 1;
I create an unobserved location:
s0.new
## SpatialPoints:
## x y
## [1,] 179180 330100
## Coordinate Reference System (CRS) arguments: NA
## I have to define also the covariate in s_0
s0.vec <- as.vector(slot(s0.new,'coords'))
# distance to the river: calculate the distance between s0 and each point of
# the river, then select the minimum
s0.dist <- min(rowSums(scale(meuse.riv,s0.vec)^2))
## I'm creating a new variable (distance) transforming the raw variables (coordinates)
s0.new <- as.data.frame(c(s0.new,s0.dist))
names(s0.new) <- c('x','y','dist')
coordinates(s0.new) <- c('x','y')
s0.new <- as(s0.new, 'SpatialPointsDataFrame')
s0.new
## coordinates dist
## 1 (179180, 330100) 0.02006051
Function “predict” uses the residual variogram stored in the gstat object to make the prediction.
predict(meuse.gstat, s0.new)
## [using universal kriging]
## coordinates zinc.pred zinc.var
## 1 (179180, 330100) 6.493177 0.1761844
While the prediction of the mean:
# this gives the estimate of x(s_0)'*beta
# becaause there is a trend component.
predict(meuse.gstat, s0.new, BLUE = TRUE)
## [generalized least squares trend estimation]
## coordinates zinc.pred zinc.var
## 1 (179180, 330100) 6.706755 0.04637467
As before, I can make predictions in all the domain:
lz.uk <- predict(meuse.gstat, meuse.grid, BLUE=FALSE)
## [using universal kriging]
And I can estimate the mean in all the domain:
lz.uk.BLUE <- predict(meuse.gstat, meuse.grid, BLUE=TRUE)
## [generalized least squares trend estimation]
Here we can see the differences between the Ordinary Kriging and the Universal Kriging:
library(gridExtra)
# Plot the maps side by side with larger size
grid.arrange(
spplot(lz.ok[,1], main = 'Ordinary Kriging (PREDICTION), gstat'),
spplot(lz.uk[,1], main = 'Universal Kriging (PREDICTION), gstat'),
nrow = 2
)
While in the following graph we can see the mean in the domain.
Please note that in the Universal kriging the mean is
changing as we can see from the graph.
I can use the mean to make PREDICTIONs in the
future.
spplot(lz.uk.BLUE[,1], main = 'Universal Kriging (PREDICTION) - drift (MEAN) , gstat')
The variance assoiciated with the mean is the following:
spplot(lz.uk[,2], main = 'Universal Kriging - Variance')
Is the drift important to explain the variability of the response varibale Z_s ?
We have to compare the Variogram of the residuals the the on the the data.
plot(v$dist,v$gamma,xlab='distance',ylab='semivariance',pch=19,col='skyblue1',ylim=c(0,0.8))
curve(v.f.est(x, C0=v.fit[2,2]+v.fit[1,2], cov.pars=rbind(c(v.fit[2,2], v.fit[2,3]),c(v.fit[1,2], v.fit[1,3])),
cov.model = c("spherical","pure.nugget")), from = 0.0001, to = 1600,
xlab = "distance", ylab = expression(gamma(h)),
main = "Variogram model",add=TRUE,col='skyblue1',lwd=2, ylim=c(0,110)) # variance of z estimated from observations (0.6)
points(v.gls$dist,v.gls$gamma,xlab='distance',ylab='semivariance',pch=19,col='steelblue',ylim=c(0,0.8))
curve(v.f.est(x, C0=v.gls.fit[2,2]+v.gls.fit[1,2],
cov.pars=rbind(c(v.gls.fit[2,2], v.gls.fit[2,3]),c(v.gls.fit[1,2], v.gls.fit[1,3])),
cov.model = c("spherical","pure.nugget")), from = 0.0001, to = 1600,
xlab = "distance", ylab = expression(gamma(h)),
main = "Variogram model",add=TRUE,col='steelblue',lwd=2, ylim=c(0,110)) # variance of the residuals (0.2) [iterative algorithm]
The variance explained by the drift is what there is in between the two curves.
So the larger is the difference between the sill in the two cases, the more the regressors are important in explaining the variability.
If there is not much difference, we can go with the simpler model (Stationary model).