Fitting a variogram model in R

Prologue

This is a demonstration of the calculation of the experimantal variogram and fitting theoretical variogram using the packate geoR in R. Note This document is written in R Markdown. This is a very convenient way to built up a workflow in R.

rm(list=ls()) # remove all variables from worspace

setwd("~/Cloud_Jena/Share/V_Gestat_2022_cloud/V09_Demos") # set your own here
filename = "ev15.csv"
a<-read.csv(filename)
a=na.omit(a)

Explore data

Get an overview of the measurement locations

plot(a$x,a$y)

Check for Skew

hist(a$TF_mm)

This is a skewed distribution, need to transform.

Use the transformation \(T_2\), which is the very commonly used so called Box-Cox Transformation:

\(T_2 = \begin{cases} \frac{x^{\lambda}-1}{\lambda} & , \lambda \ne 0 \\ ln(x) &, \lambda = 0 \end{cases}\)

shown in class. Change \(\lambda\) stepwise (by 0.25) within a suitable interval. Calculate an evaluation metric, such as
\(d_{\lambda} = \frac{\bar{x}_{\lambda} - q_{0.5,\lambda}}{IQR_{\lambda}}\) (shown in class)

for each step and find, which \(\lambda\) provides a good transformation result.

For this I am using inbuilt functions for determining the artithmetic mean (mean()), the median (median()) and the interquartile range (iqr()). I am implementing everything in a loop.

l = seq(-3,3,by=0.25) # an array containing all lambdas I am going to test
d=array(data=NA,dim=length(l)) # initialize d's (my metric for testing the quality of the transformation)

for (i in 1:length(l)) { # iterate through all elements i in l
if (l[i] != 0) { # only in case l is not equal to 0
tTF = (a$TF_mm^l[i]-1)/l[i] 
} else { # if l is equal to zero -> log transform
  tTF = log(a$TF_mm) 
  }

d[i] = (mean(tTF) - median(tTF))/IQR(tTF) # calculate the d for each transformed set
}

The best transformation is the one, where d is small, or minimal. We can find this by plotting or searching for the smallest value closest to 0 in d. The best way to find a value nearest to zero in an array is by taking the square of all the entried in the array and finding the smallest value.

plot(l,d^2)

(l[which(d^2==min(d^2))]) # lambda for the smallest d

## [1] 0

Observation: For the resolution chosen here, indeed \(\lambda\) = 0 seems to be the best transformation. In other words, a log transform is a very good choice.

Note: A number of packages offer automatic Box-Cox Transformations and searches for the optimal \(\lambda\), just search the web and find the one you like.

We will go with a log transformation:

a$tTF = log(a$TF)
hist(a$tTF)

Check for trend

We do this best for the transformed data.

par(mfrow = c(1,2))
plot(a$x,a$tTF)
plot(a$y,a$tTF)

There may be a trend, especially in x direction. To help us verify, let us do a linear model and check whether the parameters are sigificantly different from zero. R provides the

tx <- lm(a$tTF ~ a$x)
summary(tx)

## 
## Call:
## lm(formula = a$tTF ~ a$x)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.20703 -0.22484  0.02388  0.29766  1.20061 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.351e+03  4.695e+02  -5.007 9.19e-07 ***
## a$x          3.607e-03  7.199e-04   5.010 9.07e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4715 on 316 degrees of freedom
## Multiple R-squared:  0.07359,    Adjusted R-squared:  0.07066 
## F-statistic:  25.1 on 1 and 316 DF,  p-value: 9.072e-07

ty <- lm(a$tTF ~ a$y)
summary(ty)

## 
## Call:
## lm(formula = a$tTF ~ a$y)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.31142 -0.25525  0.02498  0.30162  1.00398 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.759e+04  6.892e+03  -4.003 7.80e-05 ***
## a$y          4.779e-03  1.194e-03   4.003 7.79e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4779 on 316 degrees of freedom
## Multiple R-squared:  0.04827,    Adjusted R-squared:  0.04525 
## F-statistic: 16.03 on 1 and 316 DF,  p-value: 7.793e-05

There are significant trends in x and y direction. This violates condition one of the requirements for interpolation by krigin (intrinsic hypthesis), e.g. that the expected value is not a function of the coordinates. We therfore have to account for this. We do this within the next step, when estimating the empirical semi-variogram and kriging.

Geostatistical Analysis

As an example, I use the well-established package gstat. The package relies on another one (sp), which provides many functions for handling spatial data. Please make sure that those two package are installed before proceeding.

library(sp) # spatial data handling
library(gstat) # geostatistical interpolation

## Warning: package 'gstat' was built under R version 4.0.5

coordinates(a) <- ~x+y

summary(a) # overview of the dataset

## Object of class SpatialPointsDataFrame
## Coordinates:
##         min       max
## x  652163.7  652299.9
## y 5772960.4 5773038.8
## Is projected: NA 
## proj4string : [NA]
## Number of points: 318
## Data attributes:
##      TF_mm             tTF         
##  Min.   : 1.060   Min.   :0.05827  
##  1st Qu.: 2.560   1st Qu.:0.94001  
##  Median : 3.360   Median :1.21194  
##  Mean   : 3.814   Mean   :1.22223  
##  3rd Qu.: 4.690   3rd Qu.:1.54543  
##  Max.   :10.260   Max.   :2.32825

plot(a) # basic plots of the dataset

Empirical variogram

We will try the two available estimators in this package: The classical Matheron and the robust Cressie and Hawkins (1980) variogram estimate. (see WO2007, eq. 6.1).

Remember, we need to address the trend. This is done by adding the coordinates x and y to the variable to be assessed below. The general command ()

# Set the upper boundaries of the lag distance classes to be used
bu = c(0.1, 0.5, seq(1,10,1), seq(12,40,2))

v.emp.mat <-variogram(tTF~x+y, boundaries = bu, a) 
v.emp.mod<-variogram(tTF~x+y, boundaries = bu, a, cressie = TRUE) 

par(mfrow=c(1,2))
plot(v.emp.mat)

plot(v.emp.mod)

The robust variogram looks slightly more compact, especially around the expected correlation length. This is typical in most situations.

Fit variogram model

The function for fitting a variogram in this package is called ´fit.variogram´.

Note: The parameters in this package are called as follows (see help page of variofit)

nugget (Intercept) - nugget, corresponds to the un-correlated variance at very small h, e.g. near the origin
psill - partial sill, approaches the sample variance
range - range, careful in some models this does not correspond to the correlation length

The help page of ´fit.variogram´ also links through to help page of ´vgm()´ which includes a comprehensive list of the possible variogram models. You can also view those when typing ´vgm()´ at the prompt.

As an input to the model, we select the variogram with the modulus estimator.

1st candidate model: Nugget

v.fit.nug <-  fit.variogram(v.emp.mod, model = vgm("Nug"))
(v.fit.nug)

plot(v.emp.mod, v.fit.nug)

This one is not a good model. This becomes obvious from comparing with the experimental variogram.

2nd candidate model: Spherical

v.fit.sph <-  fit.variogram(v.emp.mod, model = vgm("Sph"))
(v.fit.sph)

plot(v.emp.mod, v.fit.sph)

We note the more reasonable sill and range and visually better fit.The output of the model gives us the parameters of the spherical variogram model. The nugget is 0, the partial sill is 0.207 and the range is 9.56 m.

3rd candidate model: Expponential

v.fit.exp <-  fit.variogram(v.emp.mod, model = vgm("Exp"))
(v.fit.exp)

plot(v.emp.mod, v.fit.exp)

This also looks like a reasonable fit. The nugget is zero, the sill is a bit higher compared to the spherical with 0.228 and the range parameter is 6 m. Remember that the effective range does not correspond to the range paramter for this model type. The effective range is approximately three times the range paramter, e.g. 18 m (see lecture notes). In other words, this model predicts a longer correlation length. From the perspective of kringing this is desirable, but is it also correct?

Model selection

Fist off: I am explicitely not including a Gaussian Model as a potential candidate for reasons mentioned in the lecture. Also, just by visual inspection the nugget model is a horrible fit, and we will not consider it further.

The spherical and exponential model both look reasonable. We can decide on the better model, be comparing the fitting errors (or residuals): The sum of squared error is calculated for each of the fitted models and returned in the model structure in the attribute named “SSErr”. The smaller this number the better the fit (0 indicates a perfect fit). We can check it using the command ´attr´ which extracts information on attributes:

(sse.sph <- attr(v.fit.sph,"SSErr"))

## [1] 0.1544063

(sse.exp <- attr(v.fit.exp,"SSErr"))

## [1] 0.2225237

The exponential and spherical both “look” ok, but the spherical has a better fit. It also has a better fit on the short lags. Thus, unless we have options to cross-validate, we would chose the spherical models. If cross-validation is possible, we could also go ahead with both and select the one yieling a better kriging result.

Kriging

With the fitted semivariogram model under our belt, we can go ahead with spatial interpolation. This would most likely be done in GIS, but it is possible in R. Here is a quick look at how it can be done. Note: the plots are not georeferenced and partly have no scale. They are just meant to give you an impression of how the data look. Please do better.

Target grid

We first define the target grid, e.g. all the locations where we would like to come up with kriging estimates. I chose a grid spanning from the smallest to largest x and y coordinates (rounded to full meters) and a stepsize of 0.2 m

xdir=seq(floor(min(a$x)),ceiling(max(a$x)), by=0.2) # x cooridinates of the future grid
ydir=seq(floor(min(a$y)),ceiling(max(a$y)), by=0.2) # y cooridinates of the future grid
HH.grid <- expand.grid(x=xdir, y=ydir) # creating the grid
gridded(HH.grid) = ~x+y # adding coordinates
summary(HH.grid)

## Object of class SpatialPixels
## Coordinates:
##         min       max
## x  652162.9  652300.1
## y 5772959.9 5773039.1
## Is projected: NA 
## proj4string : [NA]
## Number of points: 271656
## Grid attributes:
##   cellcentre.offset cellsize cells.dim
## x            652163      0.2       686
## y           5772960      0.2       396

Interpolation

I use the command krige() from the gstat package. There are many more specialized kriging options. The general use is as follows

result <- krige(formula, observations, target grid, theoretical variogram)

Result is a spatial object containig the kriging results

formula is the estimation formula, simplest form is z ~ 1 (in our case tTF~1). Thankfully, with gstat package we can also account the spatial trend here. This is done by including the coordinates to the model formula as follows: tTF ~ x+y . Advantage: We do not have to remove and re-introduce the trend. This way of kriging goes by the name of “universal kriging”.

target grid is a spatial object (created above)

theoretical variogram is the variogram object created above, when selecting the best fit for the variogram

Kriging can take a while, have patience.

K = krige(tTF ~ x+y, a,HH.grid, v.fit.sph)

## [using universal kriging]

First look at the data

#plot raw kriging predictions
spplot(K["var1.pred"], main = "raw ordinary kriging predictions")

spplot(K["var1.var"], main = "raw ordinary kriging variance predictions")

Remember: We are kriging the transformed data, so resulting numbers will not be throughfall just yet. Also, there are some areas with a very bad kringing prediction, where the kriging variance is large. I am chosing to mask all areas where the kriging variance exceeds 85% of the maximum kriging variance. We take care of backtransforming and masking in the same step. For this, I temporarily transfer the result of the kriging (spatial object) to a normal dataframe, so that I can access the variables more comfortably.

N = as.data.frame(K) # transfer spatial object to data frame
N$kTF = exp(N$var1.pred) # back transformation.
kvar.thresh = max(N$var1.var)*0.85 # setting my threshold to 90% of max
N$kTF[which(N$var1.var>kvar.thresh)]=NA # set all predicted values NA if prediction variance exceeds the threshold

gridded(N) = ~x+y # back transfer data frame to spatial structure 

# plot again
spplot(N["kTF"], main = "ordinary kriging predictions")

image(N["kTF"], main = "ordinary kriging predictions with measurement points")
points(a$x,a$y)

Referencing the software

The packages of R are open access. There are plenty of them, so go out and explore. If you use the package for your work, please cite it. This is good practice for two reasons:

You acknowledge the author’s work
Your methods are well documented, anybody can check which methods you used precisely, if you indicate package and command used.

Here is how to find the citation information for a package:

citation(package = "gstat")

## 
## To cite package gstat in publications use:
## 
##   Pebesma, E.J., 2004. Multivariable geostatistics in S: the gstat
##   package. Computers & Geosciences, 30: 683-691.
## 
##   Benedikt Gr<U+00E4>ler, Edzer Pebesma and Gerard Heuvelink, 2016.
##   Spatio-Temporal Interpolation using gstat.  The R Journal 8(1),
##   204-218
## 
## To see these entries in BibTeX format, use 'print(<citation>,
## bibtex=TRUE)', 'toBibtex(.)', or set
## 'options(citation.bibtex.max=999)'.