Fitting a variogram model in R

Prologue

This is a demonstration of the calculation of the experimantal variogram and fitting theoretical variogram using the packate geoR in R. Note This document is written in R Markdown. This is a very convenient way to built up a workflow in R.

setwd("~/Cloud_Jena/Geostat_2021/InClass_Vario")
filename = "ev15.csv"
a<-read.csv(filename)
a=na.omit(a)

Explore data

Get an overview of the measurement locations

plot(a$x,a$y)

Check for Skew

hist(a$TF_mm)

This is a skewed distribution, need to transform.

Use the transformation $T_2$, which is the very commonly used so called Box-Cox Transformation:

$T_2 = \begin{cases} \frac{x^{\lambda}-1}{\lambda} & , \lambda \ne 0 \\ ln(x) &, \lambda = 0 \end{cases}$

shown in class. Change $\lambda$ stepwise (by 0.25) within a suitable interval. Calculate an evaluation metric, such as
$d_{\lambda} = \frac{\bar{x}_{\lambda} - q_{0.5,\lambda}}{IQR_{\lambda}}$ (shown in class)

for each step and find, which $\lambda$ provides a good transformation result.

For this I am using inbuilt functions for determining the artithmetic mean (mean()), the median (median()) and the interquartile range (iqr()). I am implementing everything in a loop.

l = seq(-3,3,by=0.25) # an array containing all lambdas I am going to test
d=array(data=NA,dim=length(l)) # initialize d's (my metric for testing the quality of the transformation)

for (i in 1:length(l)) { # iterate through all elements i in l
if (l[i] != 0) { # only in case l is not equal to 0
tTF = (a$TF_mm^l[i]-1)/l[i] 
} else { # if l is equal to zero -> log transform
  tTF = log(a$TF_mm) 
  }

d[i] = (mean(tTF) - median(tTF))/IQR(tTF) # calculate the d for each transformed set
}

The best transformation is the one, where d is small, or minimal. We can find this by plotting or searching for the smallest value closest to 0 in d. The best way to find a value nearest to zero in an array is by taking the square of all the entried in the array and finding the smallest value.

plot(l,d^2)

(l[which(d^2==min(d^2))]) # lambda for the smallest d

## [1] 0

Observation: For the resolution chosen here, indeed $\lambda$ = 0 seems to be the best transformation. In other words, a log transform is a very good choice.

Note: A number of packages offer automatic Box-Cox Transformations and searches for the optimal $\lambda$, just search the web and find the one you like.

We will go with a log transformation:

a$tTF = log(a$TF)
hist(a$tTF)

Check for trend

We do this best for the transformed data.

par(mfrow = c(1,2))
plot(a$x,a$tTF)
plot(a$y,a$tTF)

There may be a trend, esply in x direction. To help us verify, let us do a linear model and check whether the parameters are sigificant.

tx <- lm(a$tTF ~ a$x)
summary(tx)

## 
## Call:
## lm(formula = a$tTF ~ a$x)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.20703 -0.22484  0.02388  0.29766  1.20061 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.351e+03  4.695e+02  -5.007 9.19e-07 ***
## a$x          3.607e-03  7.199e-04   5.010 9.07e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4715 on 316 degrees of freedom
## Multiple R-squared:  0.07359,    Adjusted R-squared:  0.07066 
## F-statistic:  25.1 on 1 and 316 DF,  p-value: 9.072e-07

ty <- lm(a$tTF ~ a$y)
summary(ty)

## 
## Call:
## lm(formula = a$tTF ~ a$y)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.31142 -0.25525  0.02498  0.30162  1.00398 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.759e+04  6.892e+03  -4.003 7.80e-05 ***
## a$y          4.779e-03  1.194e-03   4.003 7.79e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4779 on 316 degrees of freedom
## Multiple R-squared:  0.04827,    Adjusted R-squared:  0.04525 
## F-statistic: 16.03 on 1 and 316 DF,  p-value: 7.793e-05

There are significant trends in x and y direction and we have to take care of them. We do this within the next step, when estimating the empirical semi-variogram.

Geostatistical Analysis

As an example, I use the classical package geoR. There are many, many more. I like geoR for its straighforwardness and it is a good package for starters. It is almost always possible to see what the method actually does and it is well documented. GeoR has the basic functions for deriving the variogram and detrending the data.

Next steps require for the package geoR to be installed.

library(geoR) # activate package

## --------------------------------------------------------------
##  Analysis of Geostatistical Data
##  For an Introduction to geoR go to http://www.leg.ufpr.br/geoR
##  geoR version 1.8-1 (built on 2020-02-08) is now loaded
## --------------------------------------------------------------

gtf = as.geodata(a, coords.col = 1:2, data.col=4) # translate the dataframe to a geodata object, use transformed tf
summary(gtf) # overview of the dataset

## Number of data points: 318 
## 
## Coordinates summary
##            x       y
## min 652163.7 5772960
## max 652299.9 5773039
## 
## Distance summary
##      min      max 
##   0.1000 137.9171 
## 
## Data summary
##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
## 0.05826891 0.94000726 1.21194097 1.22222963 1.54543258 2.32825284

plot(gtf) # basic plots of the dataset

Empirical variogram

Remember to address the trend. We will try the two available estimators in this package: The classical Matheron and the Cressie and Hawkins, called modulus, which is bias corrected.

# Klassenmitten festlegen
br = c(0.1, 0.5, seq(1,10,1), seq(12,40,2))
gtf.variog.mat<-variog(gtf,uvec = br, trend="1st", pairs.min = 30) # Trend als Polynom(fläche) erster Ordnung = linear

## variog: computing omnidirectional variogram

gtf.variog.mod<-variog(gtf,uvec = br, trend="1st", pairs.min = 30, estimator.type = "modulus") # Trend als Polynom(fläche) erster Ordnung = linear

## variog: computing omnidirectional variogram

par(mfrow=c(2,2))
plot(gtf.variog.mat)
plot(gtf.variog.mod)

The robust variogram looks slightly more compact, especially around the expected correlation length. This is typical in most situations.

Fit variogram model

The function for fitting a variogram in this package is called “variofofit”.

Note: The parameters in this package are called as follows (see help page of variofit)

tausq (Intercept) ($\tau^2$) - nugget, corresponds to the un-correlated variance at very small h, e.g. near the origin
sigmasq ($\sigma^2$) - partial sill, approaches the sample variance
phi ($\phi$) - range, careful in some models this does not correspond to the correlation length

Output delivers the “practical range”, which is what we called “effective range” in the lecture.

The help page of variofit also links through to help page of cov.spatial, which includes a comprehensive list of the possible variogram models.

As an input to the model, we select the variogram with the modulus estimator.

Now is a good time to think about the weights. Weights are given according to the number of pairs in each bin by default (see help page), and we go with that. Try playing around with using different weight options, and check the result, it can effect the fit at the small ranges. It will be desirable that a model fits the small ranges very well since those will be most important for the following interpolation.

1st candidate model: Linear nugget

We are required to include initial estimates of the model parameters $\sigma^2$ (partial sill) and $\phi$ (range). You can give a range or one value. Estimate by eyeballing the experimental variogram. partial sill is roughly at 0.22 Range is roughly at 6 m.

ini.par = as.matrix(cbind(c(0.18,0.2,0.22,0.24), c(2,4,6,8)))

vario.linear <- variofit(gtf.variog.mod, ini.cov.pars = ini.par, cov.model="linear")

## variofit: covariance model used is linear 
## variofit: weights used: npairs 
## variofit: minimisation function used: optim 
## variofit: searching for best initial value ... selected values:
##               sigmasq phi   tausq kappa
## initial.value "0.24"  "2"   "0"   "0.5"
## status        "est"   "est" "est" "fix"
## loss value: 223.872943496328

(vario.linear)

## variofit: model parameters estimated by WLS (weighted least squares):
## covariance model is: linear
## parameter estimates:
##   tausq sigmasq     phi 
##  0.1688  0.0013  1.0000 
## Practical Range with cor=0.05 for asymptotic range: Inf
## 
## variofit: minimised weighted sum of squares = 22.997

#plot(gtf.variog.mat)
#lines(vario.linear)

This one is not a good model. This becomes obvious from comparing the esimated nugget versus sill and the range with both do not compare well with the values expected when looking at the experimental variogram. Also, we make a mental note of the minimised weighted sum of squares. This number should be a small as possible and smaller the minimised weighted sum of squares the better the fit.

2nd candidate model: Spherical

ini.par = as.matrix(cbind(c(0.18,0.2,0.22,0.24), c(2,4,6,8)))

vario.sph <- variofit(gtf.variog.mod, ini.cov.pars = ini.par, cov.model="sph")

## variofit: covariance model used is spherical 
## variofit: weights used: npairs 
## variofit: minimisation function used: optim 
## variofit: searching for best initial value ... selected values:
##               sigmasq phi   tausq kappa
## initial.value "0.2"   "8"   "0"   "0.5"
## status        "est"   "est" "est" "fix"
## loss value: 14.9544555066347

(vario.sph)

## variofit: model parameters estimated by WLS (weighted least squares):
## covariance model is: spherical
## parameter estimates:
##   tausq sigmasq     phi 
##  0.0000  0.2074  7.9635 
## Practical Range with cor=0.05 for asymptotic range: 7.963519
## 
## variofit: minimised weighted sum of squares = 13.9611

We note the more reasonable sill and range. Also, a much better fit, according to the minimised weighted sum of squares.

3rd candidate model: Expponential

Here we have to be careful with the initial values. The effective range (which we are eyeballing) does not correspond to the actual range parameter, here called $\phi$. In the lecture we stated that $ a_e 3 a $. This means, the eyeballed correlation lengths should be divided by 3.

ini.par = as.matrix(cbind(c(0.18,0.2,0.22,0.24), c(2,4,6,8)/3))

vario.exp <- variofit(gtf.variog.mod, ini.cov.pars = ini.par, cov.model="exp")

## variofit: covariance model used is exponential 
## variofit: weights used: npairs 
## variofit: minimisation function used: optim 
## variofit: searching for best initial value ... selected values:
##               sigmasq phi    tausq kappa
## initial.value "0.2"   "2.67" "0"   "0.5"
## status        "est"   "est"  "est" "fix"
## loss value: 16.2456500545636

(vario.exp)

## variofit: model parameters estimated by WLS (weighted least squares):
## covariance model is: exponential
## parameter estimates:
##   tausq sigmasq     phi 
##  0.0000  0.2089  3.3544 
## Practical Range with cor=0.05 for asymptotic range: 10.04889
## 
## variofit: minimised weighted sum of squares = 14.9518

Note: We here need to check the practical range: 9.77 m, which is somewhat larger than for the spherical.

Besides, parameters are reasonable, but minimised weighted sum of squares is a little worse compared to the spherical model.

Model selection

Fist off: I am not explicitely not including a Gaussian Model as a potential candidate for reasons mentioned in the lecture.

According to the weighted sums of squares the spherical model is the best fit, followed closely by exponential and later the linear.

Lets compare the models visually:

plot(gtf.variog.mat)
lines(vario.linear)
lines(vario.sph, lty=3)
lines(vario.exp, lty=4)

Obviously the linear model is not a good fit, so we will not consider it further.

The exponential and spherical both “look” ok, but the spherical has a better fit. The correlation length is smaller. It also has a better fit on the short lags. Unless we have options to cross-validate, we would for those reasons chose the spherical. If cross-validation is possible, we could also go ahead with both and select the one yieling a better kriging result.

Some ideas for playing around:

Try out reasonable alternative models and see how they fit.
Check whether changing the weights affect the result.

Referencing the software

The packages of R are open access. There are plenty of them, so go out and explore. If you use the package for your work, please cite it. This is good practice for two reasons:

You acknowledge the author’s work
Your methods are well documented, anybody can check which methods you used precisely.

Here is how to find the citation information for a package:

citation(package = "geoR")

## 
## To cite package 'geoR' in publications use:
## 
##   Paulo J. Ribeiro Jr, Peter J. Diggle, Martin Schlather, Roger Bivand
##   and Brian Ripley (2020). geoR: Analysis of Geostatistical Data. R
##   package version 1.8-1. https://CRAN.R-project.org/package=geoR
## 
## A BibTeX entry for LaTeX users is
## 
##   @Manual{,
##     title = {geoR: Analysis of Geostatistical Data},
##     author = {Paulo J. {Ribeiro Jr} and Peter J. Diggle and Martin Schlather and Roger Bivand and Brian Ripley},
##     year = {2020},
##     note = {R package version 1.8-1},
##     url = {https://CRAN.R-project.org/package=geoR},
##   }
## 
## ATTENTION: This citation information has been auto-generated from the
## package DESCRIPTION file and may need manual editing, see
## 'help("citation")'.