This is a demonstration of the calculation of the experimantal variogram and fitting theoretical variogram using the packate geoR in R. Note This document is written in R Markdown. This is a very convenient way to built up a workflow in R.
setwd("~/Cloud_Jena/Geostat_2021/InClass_Vario")
filename = "ev15.csv"
a<-read.csv(filename)
a=na.omit(a)
plot(a$x,a$y)
hist(a$TF_mm)
This is a skewed distribution, need to transform.
Use the transformation \(T_2\), which is the very commonly used so called Box-Cox Transformation:
\(T_2 = \begin{cases} \frac{x^{\lambda}-1}{\lambda} & , \lambda \ne 0 \\ ln(x) &, \lambda = 0 \end{cases}\)
shown in class. Change \(\lambda\) stepwise (by 0.25) within a suitable interval. Calculate an evaluation metric, such as
\(d_{\lambda} = \frac{\bar{x}_{\lambda} - q_{0.5,\lambda}}{IQR_{\lambda}}\) (shown in class)
for each step and find, which \(\lambda\) provides a good transformation result.
For this I am using inbuilt functions for determining the artithmetic mean (mean()), the median (median()) and the interquartile range (iqr()). I am implementing everything in a loop.
l = seq(-3,3,by=0.25) # an array containing all lambdas I am going to test
d=array(data=NA,dim=length(l)) # initialize d's (my metric for testing the quality of the transformation)
for (i in 1:length(l)) { # iterate through all elements i in l
if (l[i] != 0) { # only in case l is not equal to 0
tTF = (a$TF_mm^l[i]-1)/l[i]
} else { # if l is equal to zero -> log transform
tTF = log(a$TF_mm)
}
d[i] = (mean(tTF) - median(tTF))/IQR(tTF) # calculate the d for each transformed set
}
The best transformation is the one, where d is small, or minimal. We can find this by plotting or searching for the smallest value closest to 0 in d. The best way to find a value nearest to zero in an array is by taking the square of all the entried in the array and finding the smallest value.
plot(l,d^2)
(l[which(d^2==min(d^2))]) # lambda for the smallest d
## [1] 0
Observation: For the resolution chosen here, indeed \(\lambda\) = 0 seems to be the best transformation. In other words, a log transform is a very good choice.
Note: A number of packages offer automatic Box-Cox Transformations and searches for the optimal \(\lambda\), just search the web and find the one you like.
We will go with a log transformation:
a$tTF = log(a$TF)
hist(a$tTF)
We do this best for the transformed data.
par(mfrow = c(1,2))
plot(a$x,a$tTF)
plot(a$y,a$tTF)
There may be a trend, esply in x direction. To help us verify, let us do a linear model and check whether the parameters are sigificant.
tx <- lm(a$tTF ~ a$x)
summary(tx)
##
## Call:
## lm(formula = a$tTF ~ a$x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.20703 -0.22484 0.02388 0.29766 1.20061
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.351e+03 4.695e+02 -5.007 9.19e-07 ***
## a$x 3.607e-03 7.199e-04 5.010 9.07e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4715 on 316 degrees of freedom
## Multiple R-squared: 0.07359, Adjusted R-squared: 0.07066
## F-statistic: 25.1 on 1 and 316 DF, p-value: 9.072e-07
ty <- lm(a$tTF ~ a$y)
summary(ty)
##
## Call:
## lm(formula = a$tTF ~ a$y)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.31142 -0.25525 0.02498 0.30162 1.00398
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.759e+04 6.892e+03 -4.003 7.80e-05 ***
## a$y 4.779e-03 1.194e-03 4.003 7.79e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4779 on 316 degrees of freedom
## Multiple R-squared: 0.04827, Adjusted R-squared: 0.04525
## F-statistic: 16.03 on 1 and 316 DF, p-value: 7.793e-05
There are significant trends in x and y direction and we have to take care of them. We do this within the next step, when estimating the empirical semi-variogram.
As an example, I use the classical package geoR. There are many, many more. I like geoR for its straighforwardness and it is a good package for starters. It is almost always possible to see what the method actually does and it is well documented. GeoR has the basic functions for deriving the variogram and detrending the data.
Next steps require for the package geoR to be installed.
library(geoR) # activate package
## --------------------------------------------------------------
## Analysis of Geostatistical Data
## For an Introduction to geoR go to http://www.leg.ufpr.br/geoR
## geoR version 1.8-1 (built on 2020-02-08) is now loaded
## --------------------------------------------------------------
gtf = as.geodata(a, coords.col = 1:2, data.col=4) # translate the dataframe to a geodata object, use transformed tf
summary(gtf) # overview of the dataset
## Number of data points: 318
##
## Coordinates summary
## x y
## min 652163.7 5772960
## max 652299.9 5773039
##
## Distance summary
## min max
## 0.1000 137.9171
##
## Data summary
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.05826891 0.94000726 1.21194097 1.22222963 1.54543258 2.32825284
plot(gtf) # basic plots of the dataset
Remember to address the trend. We will try the two available estimators in this package: The classical Matheron and the Cressie and Hawkins, called modulus, which is bias corrected.
# Klassenmitten festlegen
br = c(0.1, 0.5, seq(1,10,1), seq(12,40,2))
gtf.variog.mat<-variog(gtf,uvec = br, trend="1st", pairs.min = 30) # Trend als Polynom(fläche) erster Ordnung = linear
## variog: computing omnidirectional variogram
gtf.variog.mod<-variog(gtf,uvec = br, trend="1st", pairs.min = 30, estimator.type = "modulus") # Trend als Polynom(fläche) erster Ordnung = linear
## variog: computing omnidirectional variogram
par(mfrow=c(2,2))
plot(gtf.variog.mat)
plot(gtf.variog.mod)
The robust variogram looks slightly more compact, especially around the expected correlation length. This is typical in most situations.
The function for fitting a variogram in this package is called “variofofit”.
Note: The parameters in this package are called as follows (see help page of variofit)
Output delivers the “practical range”, which is what we called “effective range” in the lecture.
The help page of variofit also links through to help page of cov.spatial, which includes a comprehensive list of the possible variogram models.
As an input to the model, we select the variogram with the modulus estimator.
Now is a good time to think about the weights. Weights are given according to the number of pairs in each bin by default (see help page), and we go with that. Try playing around with using different weight options, and check the result, it can effect the fit at the small ranges. It will be desirable that a model fits the small ranges very well since those will be most important for the following interpolation.
We are required to include initial estimates of the model parameters \(\sigma^2\) (partial sill) and \(\phi\) (range). You can give a range or one value. Estimate by eyeballing the experimental variogram. partial sill is roughly at 0.22 Range is roughly at 6 m.
ini.par = as.matrix(cbind(c(0.18,0.2,0.22,0.24), c(2,4,6,8)))
vario.linear <- variofit(gtf.variog.mod, ini.cov.pars = ini.par, cov.model="linear")
## variofit: covariance model used is linear
## variofit: weights used: npairs
## variofit: minimisation function used: optim
## variofit: searching for best initial value ... selected values:
## sigmasq phi tausq kappa
## initial.value "0.24" "2" "0" "0.5"
## status "est" "est" "est" "fix"
## loss value: 223.872943496328
(vario.linear)
## variofit: model parameters estimated by WLS (weighted least squares):
## covariance model is: linear
## parameter estimates:
## tausq sigmasq phi
## 0.1688 0.0013 1.0000
## Practical Range with cor=0.05 for asymptotic range: Inf
##
## variofit: minimised weighted sum of squares = 22.997
#plot(gtf.variog.mat)
#lines(vario.linear)
This one is not a good model. This becomes obvious from comparing the esimated nugget versus sill and the range with both do not compare well with the values expected when looking at the experimental variogram. Also, we make a mental note of the minimised weighted sum of squares. This number should be a small as possible and smaller the minimised weighted sum of squares the better the fit.
ini.par = as.matrix(cbind(c(0.18,0.2,0.22,0.24), c(2,4,6,8)))
vario.sph <- variofit(gtf.variog.mod, ini.cov.pars = ini.par, cov.model="sph")
## variofit: covariance model used is spherical
## variofit: weights used: npairs
## variofit: minimisation function used: optim
## variofit: searching for best initial value ... selected values:
## sigmasq phi tausq kappa
## initial.value "0.2" "8" "0" "0.5"
## status "est" "est" "est" "fix"
## loss value: 14.9544555066347
(vario.sph)
## variofit: model parameters estimated by WLS (weighted least squares):
## covariance model is: spherical
## parameter estimates:
## tausq sigmasq phi
## 0.0000 0.2074 7.9635
## Practical Range with cor=0.05 for asymptotic range: 7.963519
##
## variofit: minimised weighted sum of squares = 13.9611
We note the more reasonable sill and range. Also, a much better fit, according to the minimised weighted sum of squares.
Here we have to be careful with the initial values. The effective range (which we are eyeballing) does not correspond to the actual range parameter, here called \(\phi\). In the lecture we stated that $ a_e 3 a $. This means, the eyeballed correlation lengths should be divided by 3.
ini.par = as.matrix(cbind(c(0.18,0.2,0.22,0.24), c(2,4,6,8)/3))
vario.exp <- variofit(gtf.variog.mod, ini.cov.pars = ini.par, cov.model="exp")
## variofit: covariance model used is exponential
## variofit: weights used: npairs
## variofit: minimisation function used: optim
## variofit: searching for best initial value ... selected values:
## sigmasq phi tausq kappa
## initial.value "0.2" "2.67" "0" "0.5"
## status "est" "est" "est" "fix"
## loss value: 16.2456500545636
(vario.exp)
## variofit: model parameters estimated by WLS (weighted least squares):
## covariance model is: exponential
## parameter estimates:
## tausq sigmasq phi
## 0.0000 0.2089 3.3544
## Practical Range with cor=0.05 for asymptotic range: 10.04889
##
## variofit: minimised weighted sum of squares = 14.9518
Note: We here need to check the practical range: 9.77 m, which is somewhat larger than for the spherical.
Besides, parameters are reasonable, but minimised weighted sum of squares is a little worse compared to the spherical model.
Fist off: I am not explicitely not including a Gaussian Model as a potential candidate for reasons mentioned in the lecture.
According to the weighted sums of squares the spherical model is the best fit, followed closely by exponential and later the linear.
Lets compare the models visually:
plot(gtf.variog.mat)
lines(vario.linear)
lines(vario.sph, lty=3)
lines(vario.exp, lty=4)
Obviously the linear model is not a good fit, so we will not consider it further.
The exponential and spherical both “look” ok, but the spherical has a better fit. The correlation length is smaller. It also has a better fit on the short lags. Unless we have options to cross-validate, we would for those reasons chose the spherical. If cross-validation is possible, we could also go ahead with both and select the one yieling a better kriging result.
Some ideas for playing around:
The packages of R are open access. There are plenty of them, so go out and explore. If you use the package for your work, please cite it. This is good practice for two reasons:
Here is how to find the citation information for a package:
citation(package = "geoR")
##
## To cite package 'geoR' in publications use:
##
## Paulo J. Ribeiro Jr, Peter J. Diggle, Martin Schlather, Roger Bivand
## and Brian Ripley (2020). geoR: Analysis of Geostatistical Data. R
## package version 1.8-1. https://CRAN.R-project.org/package=geoR
##
## A BibTeX entry for LaTeX users is
##
## @Manual{,
## title = {geoR: Analysis of Geostatistical Data},
## author = {Paulo J. {Ribeiro Jr} and Peter J. Diggle and Martin Schlather and Roger Bivand and Brian Ripley},
## year = {2020},
## note = {R package version 1.8-1},
## url = {https://CRAN.R-project.org/package=geoR},
## }
##
## ATTENTION: This citation information has been auto-generated from the
## package DESCRIPTION file and may need manual editing, see
## 'help("citation")'.