Spatial interpolation is undertaken to estimate spatial variation in risk of continuous variables. Kriging is among the most common methods used in spatial interpolation of risk of continuous spatial data variables in epidemiology. The present document is an effort to provide introduction on kriging using geoR package in R.
Parana data set is an inbuilt geodata data set in geoR library. The data refers to average rainfall over different years for the period May-June (dry-season). It was collected at 143 recording stations throughout Parana State, Brasil. We will be using the inbuilt data set for illustrative purposes.
It is important to note that if data set is available in formats other than “geodata”, the same needs to be converted into geodata format for carrying out analysis using geoR library. For the same, as.geodata function can be used by specifying longitude, latitude and attribute arguments in the function. as.geodata is the default method which converts a matrix or a data-frame to an object of the class “geodata”. Objects of the class “geodata” are lists with two obligatory components: coords and data. Optional components are allowed and a typical example is a vector or matrix with co-variate(s) values.
suppressMessages(library(geoR))
Summary measures.
summary(parana)
## Number of data points: 143
##
## Coordinates summary
## east north
## min 150.1220 70.3600
## max 768.5087 461.9681
##
## Distance summary
## min max
## 1.0000 619.4925
##
## Borders summary
## east north
## min 137.9873 46.7695
## max 798.6256 507.9295
##
## Data summary
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 162.7700 234.1900 269.9200 274.4106 318.2300 413.7000
##
## Other elements in the geodata object
## [1] "loci.paper"
As a thumb rule, the maximum distance for the variogram is taken as half of the maximum distance between two data points in the data set. Therefore, the maximum distance for the variogram for the present data should be 309.74625.
Is it important to look for trends in the data set before kriging. The plot.geodata function produces a 2*2 display with the following plots:
The first indicates the spatial locations assign different colors to data in different quartiles.
The next two shows data against the X and Y coordinates.
The last is an histogram of the data values or optionally, a 3-D plot with spatial locations and associated data values.
Looking at second and third images of fig 2.1 , there seems to be a linear trend in the data values and Y coord as well as X Coord respectively. Also, the distribution of the data in the histograms does not appear normal in fourth image of fig 2.1.
After adjusting for first order trends, as seen in third and fourth image of fig 2.2, there seems to be trend present between residuals and X coord, and the distribution of residuals doesn’t seem to be normal respectively.
After adjusting for second order trends, as seen in fig 2.3, though some clustering of residuals still seems to be present, the linear trend is no longer obvious and distribution also is near normal. Hence, we should take further analysis with adjustment of second order trend. However, for illustration we shall plot with and without trend variograms subsequently.
plot(parana)
Figure 2.1: Visualisation of data without adjusting trends.
plot(parana, trend = "1st")
Figure 2.2: Visualisation of data without adjusting trends.
plot(parana, trend = "2nd")
Figure 2.3: Visualisation of data without adjusting trends.
The variog function in geoR library computes sample (empirical) variograms with options for the classical or robust estimators. Output can be returned as a binned variogram, a variogram cloud or a smoothed variogram. Data transformation (Box-Cox) is allowed. “Trends” can be specified and are fitted by ordinary least squares in which case the variograms are computed using the residuals.
The plot.variogram function in geoR library is used to visualize the variogram. plot.variog4 plot directional variograms computed by the function variog4. The omnidirectional variogram can be also included in the plot.
Fig 3.1 plots a cloud variogram.
Fig 3.2 plots a binned variogram.
Fig 3.3 plots a binned variogram wherein the bin details have been specified manually. It is important to note that when bin values are specified manually, for better visualization or otherwise, it should be ensured that the minimum number of pairs in each bin should not be less than 30.
Fig 3.4 and Fig 3.5 plots variograms with first and second order effects respectively.
plot(variog(parana, option = "cloud", max.dist = 309.75))
## variog: computing omnidirectional variogram
Figure 3.1: Variogram cloud.
plot(variog(parana, max.dist = 309.75))
## variog: computing omnidirectional variogram
Figure 3.2: Binned variogram.
vario_manual <- variog(parana,
max.dist=309.75,
uvec=seq(0.01,309.75,30))
## variog: computing omnidirectional variogram
print(paste("minimum pairs in a single bin were", min(vario_manual$n)))
## [1] "minimum pairs in a single bin were 47"
plot(vario_manual)
Figure 3.3: Binned variogram with bins defined manually
plot(variog(parana, trend = "1st", max.dist = 309.75))
## variog: computing omnidirectional variogram
Figure 3.4: Variogram with first order trend.
vario_model <- variog(parana, trend = "2nd", max.dist = 309.75)
## variog: computing omnidirectional variogram
plot(vario_model)
Figure 3.5: Variogram with second order trend.
Fig 3.6 plots directional variograms to look for Anisotropy or directional effects. This is important to determine because of the assumptions of stationarity and isotropy while performing krigering.
plot(variog4(parana, trend = "2nd", max.dist = 309.75), omnidirectional = T)
## variog: computing variogram for direction = 0 degrees (0 radians)
## tolerance angle = 22.5 degrees (0.393 radians)
## variog: computing variogram for direction = 45 degrees (0.785 radians)
## tolerance angle = 22.5 degrees (0.393 radians)
## variog: computing variogram for direction = 90 degrees (1.571 radians)
## tolerance angle = 22.5 degrees (0.393 radians)
## variog: computing variogram for direction = 135 degrees (2.356 radians)
## tolerance angle = 22.5 degrees (0.393 radians)
## variog: computing omnidirectional variogram
Figure 3.6: Directional Variograms alongwith omnidirectional variogram.
The variofit function estimate covariance parameters by fitting a parametric model to an empirical variogram. Variogram models can be fitted by using weighted or ordinary least squares.
Since using plot.geodata function, we found that the spatial trend and distribution was best after adjusting for second order trend (Refer section 2.2), we will use the variogram as shown in Fig 3.5.
Further, to determine the best fit model, multiple covariance models shall be fitted.
exp_fit_fix <- variofit(vario_model, cov.model = "exp", fix.nugget = T)
## variofit: covariance model used is exponential
## variofit: weights used: npairs
## variofit: minimisation function used: optim
## Warning in variofit(vario_model, cov.model = "exp", fix.nugget = T): initial
## values not provided - running the default search
## variofit: searching for best initial value ... selected values:
## sigmasq phi tausq kappa
## initial.value "750.12" "47.65" "0" "0.5"
## status "est" "est" "fix" "fix"
## loss value: 54425015.7671772
exp_fit_nofix <- variofit(vario_model, cov.model = "exp", fix.nugget = F)
## variofit: covariance model used is exponential
## variofit: weights used: npairs
## variofit: minimisation function used: optim
## Warning in variofit(vario_model, cov.model = "exp", fix.nugget = F): initial
## values not provided - running the default search
## variofit: searching for best initial value ... selected values:
## sigmasq phi tausq kappa
## initial.value "375.06" "95.3" "375.06" "0.5"
## status "est" "est" "est" "fix"
## loss value: 17587991.976428
sph_fit_fix <- variofit(vario_model, cov.model = "sph", fix.nugget = T)
## variofit: covariance model used is spherical
## variofit: weights used: npairs
## variofit: minimisation function used: optim
## Warning in variofit(vario_model, cov.model = "sph", fix.nugget = T): initial
## values not provided - running the default search
## variofit: searching for best initial value ... selected values:
## sigmasq phi tausq kappa
## initial.value "750.12" "0" "0" "0.5"
## status "est" "est" "fix" "fix"
## loss value: 73732944.7195672
sph_fit_nofix <- variofit(vario_model, cov.model = "sph", fix.nugget = F)
## variofit: covariance model used is spherical
## variofit: weights used: npairs
## variofit: minimisation function used: optim
## Warning in variofit(vario_model, cov.model = "sph", fix.nugget = F): initial
## values not provided - running the default search
## variofit: searching for best initial value ... selected values:
## sigmasq phi tausq kappa
## initial.value "562.59" "0" "75.01" "0.5"
## status "est" "est" "est" "fix"
## loss value: 24472921.5722333
plot(vario_model,pch=16)
lines(exp_fit_fix,col="green",lwd=4, lty = 1)
lines(exp_fit_nofix,col="red",lwd=4, lty = 2)
lines(sph_fit_fix,col="blue",lwd=4, lty = 3)
lines(sph_fit_nofix,col="black",lwd=4, lty = 4)
The resulting model with least SSQ value is considered as best fit. Calculating SSQ for each model fit:
(exp_SSQ_fix <- summary(exp_fit_fix)$sum.of.squares)
## value
## 19917447
(exp_SSQ_nofix <- summary(exp_fit_nofix)$sum.of.squares)
## value
## 8156736
(sph_SSQ_fix <- summary(sph_fit_fix)$sum.of.squares)
## value
## 19917447
(sph_SSQ_nofix <- summary(sph_fit_nofix)$sum.of.squares)
## value
## 19917447
which.min(list(exp_SSQ_fix,exp_SSQ_nofix,
sph_SSQ_fix, sph_SSQ_nofix))
## [1] 2
Hence, exponential model without fixed nugget should be considered as best fit model.
To perform kriging using the best fit model, a prediction grid is to be created. Hence creating a 100 by 100 grid.
prediction_grid <- expand.grid(seq(0, 800, length.out = 100), seq(0, 800, length.out = 100))
plot(prediction_grid)
krig.conv function performs spatial prediction for fixed covariance parameters using global neighbourhood. Options available implement the following types of kriging: SK (simple kriging), OK (ordinary kriging), KTE (external trend kriging) and UK (universal kriging).
krig_rain <- krige.conv(parana, loc=prediction_grid,
krige=krige.control(obj.model=exp_fit_nofix))
## krige.conv: results will be returned only for prediction locations inside the borders
## krige.conv: model with mean given by a 2nd order polynomial on the coordinates
## krige.conv: Kriging performed using global neighbourhood
image(krig_rain,col=heat.colors(64))
xvalid is function to perform model validation by comparing observed and values predicted by kriging.
Options include:
(i) leaving-one-out cross-validation where each data location is removed from the data set and the variable at this location is predicted using the remaining locations, for a given model. This can be computed for all or a subset of the data locations;
(ii) external validation can be performed by using validation locations other than data locations.
xv = xvalid(parana, model = exp_fit_nofix)
## xvalid: number of data locations = 143
## xvalid: number of validation locations = 143
## xvalid: performing cross-validation at location ... 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143,
## xvalid: end of cross-validation
library(tidyverse)
## -- Attaching packages ----------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2 v purrr 0.3.4
## v tibble 3.0.3 v dplyr 1.0.1
## v tidyr 1.1.1 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.5.0
## -- Conflicts -------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
xv1 <- data.frame(xv$data, xv$predicted)
ggplot(data = xv1, aes(x = xv.data, y = xv.predicted))+
geom_point()+
geom_smooth(method = "lm", color = "red")+
xlab("Observed rainfall")+
ylab("Predicted rainfall")+
labs(title = "Linear regression model for predicted rainfall")
## `geom_smooth()` using formula 'y ~ x'
Figure 7.1: Linear regression model visualisation.
Fig 7.1 represents line of best fit for predicted and observed values based on the following equation:- \[ \operatorname{xv.predicted} = \alpha + \beta_{1}(\operatorname{xv.data}) + \epsilon \]
mean(xv$error^2)
## [1] 614.1851
mean(abs(xv$error))
## [1] 19.21942
Therefore we may conclude that on an average, an error of 19.2194244inches of mean rainfall can be expected at a given location.