var(1:6)
[1] 3.5
sum((1:6 - mean(1:6)) ^ 2) / (6 - 1)
[1] 3.5
sd(1:6) # = sqrt(var(1:6))
[1] 1.870829
The variance of a population (a set of values) \(X\) is defined as
\[ Var(X) = \frac{1}{n} \sum_{i=1}^{n} (x_{i}-\mu)^2 \]
where \(\mu\) is the average value of \(X\). In turn, the standard deviation is the square root of the variance.
But usually we want the variance of real-world sample data, which by definition is a subset of of all possible observations that could be made. For technical reasons, to eliminate bias arising from our limited sample the variance is calculated as
\[ Var(X) = \frac{1}{n-1} \sum_{i=1}^{n} (x_{i}-\mu)^2 \]
Such sample variance can be calculated by R’s var
function:
var(1:6)
[1] 3.5
sum((1:6 - mean(1:6)) ^ 2) / (6 - 1)
[1] 3.5
sd(1:6) # = sqrt(var(1:6))
[1] 1.870829
In spatial statistics, the theoretical variogram is a function describing the degree of spatial dependence of a spatial random field or stochastic process.
With real-world data we generally calculate the empirical semivariogram \(\hat{\gamma}\), defined as half the average squared difference between the values separated at distance \(h\):
\[ \hat{\gamma}(h) = \frac{1}{2|N(h)|} \sum_{(i,j) \in{N(h)}}^{} |z_{i}-z_{j}|^2 \]
In other words, each pair of points separated by \(h\) form the set of points \(N(h)\); the number of these points in this bin is \(|N(h)|\). Then for each pair of points \(i,j\) the square of the difference in the observation is calculated: \(|z_{i}-z_{j}|^2\). These squared differences are added together and normalized by the number of point pairs, and divided by 2 for the semivariogram value at this “lag” (distance of separation).
In practice, we group the lags into bins (groups of numerically similar separation distances). For more details, see Wikipedia.
The following parameters are often used to describe variograms:
sill: The limit (asymptote) of the semivariogram at the lag goes to infinity.
range: The distance at which the sill is attained, or difference of the semivariogram from the sill value becomes negligible.
By Asim Biswas & Bing Cheng Si - https://www.intechopen.com/chapters/39857, CC BY 3.0, https://commons.wikimedia.org/w/index.php?curid=116105041
Consider the meuse
dataset that is included in the sp
package:
library(sp)
data(meuse)
head(meuse[1:6]) # the first six columns are the most relevant for us
x y cadmium copper lead zinc
1 181072 333611 11.7 85 299 1022
2 181025 333558 8.6 81 277 1141
3 181165 333537 6.5 68 199 640
4 181298 333484 2.6 81 116 257
5 181307 333330 2.8 48 117 269
6 181390 333260 3.0 61 137 281
This dataset gives locations and topsoil heavy metal concentrations (see ?meuse
) and has 155 rows and 14 columns.
Let’s say we’re interested in the spatial distribution of zinc concentrations:
library(ggplot2)
ggplot(meuse, aes(x, y, color = zinc)) + geom_point()
Note also that the zinc data are not normally distributed.
ggplot(meuse, aes(zinc)) + geom_histogram(bins = 30)
Many statistical approaches assume data normality, however, and so we likely want to correct for this by modeling the logarithm of the data, which is (relatively) normal:
ggplot(meuse, aes(log(zinc))) + geom_histogram(bins = 30)
What is the spatial dependence of the meuse
zinc data? In other words, how much do we expect two random samples to vary in their zinc concentrations as a function of their location?
Intuitively, and by looking at the spatial plot above, it seems like samples taken far apart will vary more than samples taken close to each other. We’d like to quantify this.
Like most real-world datasets, meuse
does not have data at every possible sampling point, and so we first compute the variance between observations at different distances from each other, and then fit an empirical semivariogram.
library(gstat)
# gstat::variogram needs to know which columns have the
# location information. Because we're using a formula when
# calling it, below, we first set the "coordinates" of the
# meuse object; that is, we define which columns contain
# this location information
coordinates(meuse) = ~x+y
# Calculate the sample variogram
<- variogram(log(zinc) ~ 1, data = meuse)
v print(v)
np dist gamma dir.hor dir.ver id
1 57 79.29244 0.1234479 0 0 var1
2 299 163.97367 0.2162185 0 0 var1
3 419 267.36483 0.3027859 0 0 var1
4 457 372.73542 0.4121448 0 0 var1
5 547 478.47670 0.4634128 0 0 var1
6 533 585.34058 0.5646933 0 0 var1
7 574 693.14526 0.5689683 0 0 var1
8 564 796.18365 0.6186769 0 0 var1
9 589 903.14650 0.6471479 0 0 var1
10 543 1011.29177 0.6915705 0 0 var1
11 500 1117.86235 0.7033984 0 0 var1
12 477 1221.32810 0.6038770 0 0 var1
13 452 1329.16407 0.6517158 0 0 var1
14 457 1437.25620 0.5665318 0 0 var1
15 415 1543.20248 0.5748227 0 0 var1
What variogram()
returns is a data frame with np
, the number of points at each separation distance (or lag) dist
, and the variogram value gamma
at that lag.
plot(v)
It looks like these data have a nugget of ~0.1, a sill of ~0.6, and a range of ~1000. But to get precise values for these parameters, we need to fit a model to our empirical data.
Different theoretical semivariogram models make different assumptions and have different fitted forms. We can visualize some of these:
show.vgms(models = c("Exp", # Exponential model
"Sph", # Spherical model
"Gau", # Gaussian model
"Nug"), # Nugget model
nugget = 0.1)
Note the “Nug” (nugget) option, which models no spatial dependency at all.
Based on our knowledge of the physical processes at play, and/or examining the empirical data, we choose a model to fit:
# fit.variogram() fits a theoretical variogram model,
# generated by vgm(), to our empirical variogram data
<- fit.variogram(v, vgm("Sph"))
v.fit
print(v.fit)
model psill range
1 Nug 0.05065971 0.0000
2 Sph 0.59060511 897.0011
# Did the model-fitting process converge successfully?
# This is not guaranteed for nonlinear models.
attributes(v.fit)$singular
[1] FALSE
We see that the model-fitting process was successful, producing estimates of 0.051 for the nugget, 0.591 for the sill, and 897 for the range (the distance at which the observations are no longer spatially correlated).
Plot this fitted model against our observational data:
# For convenience, use show.vgms() to calculate the
# predicted values for our fitted model
<-show.vgms(min = min(v$dist),
fitted max = max(v$dist),
models ="Sph",
range = v.fit$range[2],
sill = v.fit$psill[2],
nugget = v.fit$psill[1],
plot = FALSE)
# Now we can easily plot this against the observed data
ggplot(v, aes(dist, gamma)) +
geom_point() +
geom_line(data = fitted,
aes(distance, semivariance),
linetype = 2)
The simple example above assumed our data were isotropic: the spatial dependency was the same regardless of direction. But in many cases this will not be true — the data will be anisotopic and so non-uniform depending on direction — and this needs to be accounted for.
Similarly, the example above doesn’t account for any other factors, but often we will first fit a model (for example, of how soil respiration is related to temperature and moisture) and then do a spatial analysis of the model residuals. The nlme
package is useful for this.
All code for this presentation is available at https://github.com/bpbond/basicstats.