1 Introduction.

Spatial interpolation is undertaken to estimate spatial variation in risk of continuous variables. Kriging is among the most common methods used in spatial interpolation of risk of continuous spatial data variables in epidemiology. The present document is an effort to provide introduction on kriging using geoR package in R.

Loading data.

Parana data set is an inbuilt geodata data set in geoR library. The data refers to average rainfall over different years for the period May-June (dry-season). It was collected at 143 recording stations throughout Parana State, Brasil. We will be using the inbuilt data set for illustrative purposes.

It is important to note that if data set is available in formats other than “geodata”, the same needs to be converted into geodata format for carrying out analysis using geoR library. For the same, as.geodata function can be used by specifying longitude, latitude and attribute arguments in the function. as.geodata is the default method which converts a matrix or a data-frame to an object of the class “geodata”. Objects of the class “geodata” are lists with two obligatory components: coords and data. Optional components are allowed and a typical example is a vector or matrix with co-variate(s) values.

suppressMessages(library(geoR))

2 Understanding data.

2.1 Determination of maximum distance for variogram estimation.

Summary measures.

summary(parana)
## Number of data points: 143 
## 
## Coordinates summary
##         east    north
## min 150.1220  70.3600
## max 768.5087 461.9681
## 
## Distance summary
##      min      max 
##   1.0000 619.4925 
## 
## Borders summary
##         east    north
## min 137.9873  46.7695
## max 798.6256 507.9295
## 
## Data summary
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 162.7700 234.1900 269.9200 274.4106 318.2300 413.7000 
## 
## Other elements in the geodata object
## [1] "loci.paper"

As a thumb rule, the maximum distance for the variogram is taken as half of the maximum distance between two data points in the data set. Therefore, the maximum distance for the variogram for the present data should be 309.74625.

2.2 Estimating spatial trend and normality of data set.

Is it important to look for trends in the data set before kriging. The plot.geodata function produces a 2*2 display with the following plots:
The first indicates the spatial locations assign different colors to data in different quartiles.
The next two shows data against the X and Y coordinates.
The last is an histogram of the data values or optionally, a 3-D plot with spatial locations and associated data values.

Looking at second and third images of fig 2.1 , there seems to be a linear trend in the data values and Y coord as well as X Coord respectively. Also, the distribution of the data in the histograms does not appear normal in fourth image of fig 2.1.

After adjusting for first order trends, as seen in third and fourth image of fig 2.2, there seems to be trend present between residuals and X coord, and the distribution of residuals doesn’t seem to be normal respectively.

After adjusting for second order trends, as seen in fig 2.3, though some clustering of residuals still seems to be present, the linear trend is no longer obvious and distribution also is near normal. Hence, we should take further analysis with adjustment of second order trend. However, for illustration we shall plot with and without trend variograms subsequently.

plot(parana)
Visualisation of data without adjusting trends.

Figure 2.1: Visualisation of data without adjusting trends.

plot(parana, trend = "1st")
Visualisation of data without adjusting trends.

Figure 2.2: Visualisation of data without adjusting trends.

plot(parana, trend = "2nd")
Visualisation of data without adjusting trends.

Figure 2.3: Visualisation of data without adjusting trends.

3 Fitting a variogram model to the data set.

3.1 Creating and visualizing variogram.

The variog function in geoR library computes sample (empirical) variograms with options for the classical or robust estimators. Output can be returned as a binned variogram, a variogram cloud or a smoothed variogram. Data transformation (Box-Cox) is allowed. “Trends” can be specified and are fitted by ordinary least squares in which case the variograms are computed using the residuals.

The plot.variogram function in geoR library is used to visualize the variogram. plot.variog4 plot directional variograms computed by the function variog4. The omnidirectional variogram can be also included in the plot.

Fig 3.1 plots a cloud variogram.
Fig 3.2 plots a binned variogram.
Fig 3.3 plots a binned variogram wherein the bin details have been specified manually. It is important to note that when bin values are specified manually, for better visualization or otherwise, it should be ensured that the minimum number of pairs in each bin should not be less than 30.
Fig 3.4 and Fig 3.5 plots variograms with first and second order effects respectively.

plot(variog(parana, option = "cloud", max.dist = 309.75))
## variog: computing omnidirectional variogram
Variogram cloud.

Figure 3.1: Variogram cloud.

plot(variog(parana, max.dist = 309.75))
## variog: computing omnidirectional variogram
Binned variogram.

Figure 3.2: Binned variogram.

vario_manual <- variog(parana,
                       max.dist=309.75,
                       uvec=seq(0.01,309.75,30)) 
## variog: computing omnidirectional variogram
print(paste("minimum pairs in a single bin were", min(vario_manual$n)))
## [1] "minimum pairs in a single bin were 47"
plot(vario_manual)
Binned variogram with bins defined manually

Figure 3.3: Binned variogram with bins defined manually

plot(variog(parana, trend = "1st", max.dist = 309.75))
## variog: computing omnidirectional variogram
Variogram with first order trend.

Figure 3.4: Variogram with first order trend.

vario_model <- variog(parana, trend = "2nd", max.dist = 309.75)
## variog: computing omnidirectional variogram
plot(vario_model)
Variogram with second order trend.

Figure 3.5: Variogram with second order trend.

3.2 Determination of directional effects/ Anisotropy.

Fig 3.6 plots directional variograms to look for Anisotropy or directional effects. This is important to determine because of the assumptions of stationarity and isotropy while performing krigering.

plot(variog4(parana, trend = "2nd", max.dist = 309.75), omnidirectional = T)
## variog: computing variogram for direction = 0 degrees (0 radians)
##         tolerance angle = 22.5 degrees (0.393 radians)
## variog: computing variogram for direction = 45 degrees (0.785 radians)
##         tolerance angle = 22.5 degrees (0.393 radians)
## variog: computing variogram for direction = 90 degrees (1.571 radians)
##         tolerance angle = 22.5 degrees (0.393 radians)
## variog: computing variogram for direction = 135 degrees (2.356 radians)
##         tolerance angle = 22.5 degrees (0.393 radians)
## variog: computing omnidirectional variogram
Directional Variograms alongwith omnidirectional variogram.

Figure 3.6: Directional Variograms alongwith omnidirectional variogram.

3.3 Fitting covariance models on a variogram.

The variofit function estimate covariance parameters by fitting a parametric model to an empirical variogram. Variogram models can be fitted by using weighted or ordinary least squares.

Since using plot.geodata function, we found that the spatial trend and distribution was best after adjusting for second order trend (Refer section 2.2), we will use the variogram as shown in Fig 3.5.

Further, to determine the best fit model, multiple covariance models shall be fitted.

exp_fit_fix <- variofit(vario_model, cov.model = "exp", fix.nugget = T)
## variofit: covariance model used is exponential 
## variofit: weights used: npairs 
## variofit: minimisation function used: optim
## Warning in variofit(vario_model, cov.model = "exp", fix.nugget = T): initial
## values not provided - running the default search
## variofit: searching for best initial value ... selected values:
##               sigmasq  phi     tausq kappa
## initial.value "750.12" "47.65" "0"   "0.5"
## status        "est"    "est"   "fix" "fix"
## loss value: 54425015.7671772
exp_fit_nofix <- variofit(vario_model, cov.model = "exp", fix.nugget = F)
## variofit: covariance model used is exponential 
## variofit: weights used: npairs 
## variofit: minimisation function used: optim
## Warning in variofit(vario_model, cov.model = "exp", fix.nugget = F): initial
## values not provided - running the default search
## variofit: searching for best initial value ... selected values:
##               sigmasq  phi    tausq    kappa
## initial.value "375.06" "95.3" "375.06" "0.5"
## status        "est"    "est"  "est"    "fix"
## loss value: 17587991.976428
sph_fit_fix <- variofit(vario_model, cov.model = "sph", fix.nugget = T)
## variofit: covariance model used is spherical 
## variofit: weights used: npairs 
## variofit: minimisation function used: optim
## Warning in variofit(vario_model, cov.model = "sph", fix.nugget = T): initial
## values not provided - running the default search
## variofit: searching for best initial value ... selected values:
##               sigmasq  phi   tausq kappa
## initial.value "750.12" "0"   "0"   "0.5"
## status        "est"    "est" "fix" "fix"
## loss value: 73732944.7195672
sph_fit_nofix <- variofit(vario_model, cov.model = "sph", fix.nugget = F)
## variofit: covariance model used is spherical 
## variofit: weights used: npairs 
## variofit: minimisation function used: optim
## Warning in variofit(vario_model, cov.model = "sph", fix.nugget = F): initial
## values not provided - running the default search
## variofit: searching for best initial value ... selected values:
##               sigmasq  phi   tausq   kappa
## initial.value "562.59" "0"   "75.01" "0.5"
## status        "est"    "est" "est"   "fix"
## loss value: 24472921.5722333

3.4 Visualisation of model fit.

plot(vario_model,pch=16)
lines(exp_fit_fix,col="green",lwd=4, lty = 1)
lines(exp_fit_nofix,col="red",lwd=4, lty = 2)
lines(sph_fit_fix,col="blue",lwd=4, lty = 3) 
lines(sph_fit_nofix,col="black",lwd=4, lty = 4) 

3.5 Selecting best fit model.

The resulting model with least SSQ value is considered as best fit. Calculating SSQ for each model fit:

(exp_SSQ_fix <- summary(exp_fit_fix)$sum.of.squares)
##    value 
## 19917447
(exp_SSQ_nofix <- summary(exp_fit_nofix)$sum.of.squares)
##   value 
## 8156736
(sph_SSQ_fix <- summary(sph_fit_fix)$sum.of.squares)
##    value 
## 19917447
(sph_SSQ_nofix <- summary(sph_fit_nofix)$sum.of.squares)
##    value 
## 19917447
which.min(list(exp_SSQ_fix,exp_SSQ_nofix,
          sph_SSQ_fix, sph_SSQ_nofix))
## [1] 2

Hence, exponential model without fixed nugget should be considered as best fit model.

4 Creation of a prediction grid.

To perform kriging using the best fit model, a prediction grid is to be created. Hence creating a 100 by 100 grid.

prediction_grid <- expand.grid(seq(0, 800, length.out = 100), seq(0, 800, length.out = 100))
plot(prediction_grid)

5 Kriging.

krig.conv function performs spatial prediction for fixed covariance parameters using global neighbourhood. Options available implement the following types of kriging: SK (simple kriging), OK (ordinary kriging), KTE (external trend kriging) and UK (universal kriging).

krig_rain <- krige.conv(parana, loc=prediction_grid,
                       krige=krige.control(obj.model=exp_fit_nofix))
## krige.conv: results will be returned only for prediction locations inside the borders
## krige.conv: model with mean given by a 2nd order polynomial on the coordinates
## krige.conv: Kriging performed using global neighbourhood

6 Visualization.

image(krig_rain,col=heat.colors(64))

7 Examination of prediction errors.

7.1 Performing cross validation of results.

xvalid is function to perform model validation by comparing observed and values predicted by kriging.
Options include:
(i) leaving-one-out cross-validation where each data location is removed from the data set and the variable at this location is predicted using the remaining locations, for a given model. This can be computed for all or a subset of the data locations;
(ii) external validation can be performed by using validation locations other than data locations.

xv = xvalid(parana, model = exp_fit_nofix)
## xvalid: number of data locations       = 143
## xvalid: number of validation locations = 143
## xvalid: performing cross-validation at location ... 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 
## xvalid: end of cross-validation

7.2 Visualisation of cross validated results.

library(tidyverse)
## -- Attaching packages ----------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2     v purrr   0.3.4
## v tibble  3.0.3     v dplyr   1.0.1
## v tidyr   1.1.1     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0
## -- Conflicts -------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
xv1 <- data.frame(xv$data, xv$predicted)
ggplot(data = xv1, aes(x = xv.data, y = xv.predicted))+
  geom_point()+
  geom_smooth(method = "lm", color = "red")+
  xlab("Observed rainfall")+
  ylab("Predicted rainfall")+
    labs(title = "Linear regression model for predicted rainfall")
## `geom_smooth()` using formula 'y ~ x'
Linear regression model visualisation.

Figure 7.1: Linear regression model visualisation.

Fig 7.1 represents line of best fit for predicted and observed values based on the following equation:- \[ \operatorname{xv.predicted} = \alpha + \beta_{1}(\operatorname{xv.data}) + \epsilon \]

7.3 Calculating mean squared and mean absolute errors.

mean(xv$error^2)
## [1] 614.1851
mean(abs(xv$error))
## [1] 19.21942

Therefore we may conclude that on an average, an error of 19.2194244inches of mean rainfall can be expected at a given location.


  1. Ph.D. Scholar, AMCHSS, SCTIMST

  2. Professor, AMCHSS, SCTIMST