In geostatistics, the variogram and the covariogram are two fundamental tools for measuring spatial dependence.
In theoretical statistics (using the true population mean \(\mu\)), they are strict mathematical inverses of each other. However, in applied data analysis (using the sample mean \(\bar{Y}\) and finite datasets), this strict inverse relationship breaks down.
This document explains exactly why this happens, both algebraically and numerically.
Let \(Y(s)\) be a stationary spatial process.
The theoretical variogram is defined as: \[ \gamma(t) = \frac{1}{2} E\left[(Y(s_i) - Y(s_j))^2\right] \]
The theoretical covariogram is defined as: \[ C(t) = E\left[(Y(s_i) - \mu)(Y(s_j) - \mu)\right] \] where \(\mu\) is the true population mean.
If we expand the variogram:
\[ \begin{aligned} \gamma(t) &= \frac{1}{2} E\left[(Y(s_i) - \mu - (Y(s_j) - \mu))^2\right] \\ &= \frac{1}{2} E\left[(Y(s_i) - \mu)^2 + (Y(s_j) - \mu)^2 - 2(Y(s_i) - \mu)(Y(s_j) - \mu)\right] \\ &= \frac{1}{2} \left( C(0) + C(0) - 2C(t) \right) \\ &= C(0) - C(t) \end{aligned} \]
Rearranging gives the theoretical inverse: \[ \boxed{C(t) = C(0) - \gamma(t)} \]
In theory, if you know the variogram and the total variance \(C(0)\), you can perfectly recover the covariogram. They are mirror images of each other.
When working with real data, we do not know \(\mu\). We must estimate it using the sample mean \(\bar{Y}\).
Our empirical variogram is calculated as:
\[ \hat{\gamma}(t_k) = \frac{1}{2N_k} \sum_{(s_i,s_j) \in N(t_k)} (Y(s_i) - Y(s_j))^2 \]
Our empirical covariogram is calculated as:
\[ \hat{C}(t_k) = \frac{1}{N_k} \sum_{(s_i,s_j) \in N(t_k)} (Y(s_i) - \bar{Y})(Y(s_j) - \bar{Y}) \]
Where: - \(t_k\) is the lag distance for bin \(k\) - \(N_k\) is the number of pairs in bin \(k\) - \(\bar{Y}\) is the sample mean
If we expand the empirical variogram using \(\bar{Y}\):
\[ \begin{aligned} \hat{\gamma}(t_k) &= \frac{1}{2N_k} \sum \left( [Y(s_i) - \bar{Y}] - [Y(s_j) - \bar{Y}] \right)^2 \\ &= \frac{1}{2N_k} \sum \left( (Y(s_i) - \bar{Y})^2 + (Y(s_j) - \bar{Y})^2 - 2(Y(s_i) - \bar{Y})(Y(s_j) - \bar{Y}) \right) \\ &= \hat{C}_{biased}(0) - \hat{C}(t_k) \end{aligned} \]
Notice that this equation strictly uses \(\hat{C}_{biased}(0)\), which is the sample variance divided by \(N\) (not \(N-1\)):
\[ \hat{C}_{biased}(0) = \frac{1}{N} \sum_{i=1}^N (Y(s_i) - \bar{Y})^2 \]
To get the empirical inverse, you would need to compute:
\[ \hat{C}(t_k) = \hat{C}(0) - \hat{\gamma}(t_k) \]
But how do you obtain \(\hat{C}(0)\)?
| Method | Formula | Why it breaks the strict inverse |
|---|---|---|
| 1. Biased Variance | \(\frac{1}{N}\sum (Y-\bar{Y})^2\) | Mathematically works perfectly! But this is a biased estimate of the true variance (it is too small on average). Statisticians avoid using it. |
| 2. Unbiased Variance | \(\frac{1}{N-1}\sum (Y-\bar{Y})^2\) | This is the standard, “correct” estimate. However, if you plug this into \(\hat{C}(0) - \hat{\gamma}(t_k)\), the subtraction will not equal your actual \(\hat{C}(t_k)\). |
| 3. Model Sill | Fit a theoretical model (e.g., Spherical) to \(\hat{\gamma}(t_k)\) and use the model’s Sill. | This is common practice in geostatistics. However, the model’s Sill is a smoothed approximation. Subtracting raw empirical \(\hat{\gamma}(t_k)\) from this modeled Sill will not perfectly match the raw \(\hat{C}(t_k)\). |
In real-world data, the variogram almost never goes to zero at lag \(t=0\). It jumps to a value called the nugget (\(c_0\)) due to:
Measurement error
Micro-scale variability (variation occurring at distances smaller than the minimum sampling interval)
The theoretical inverse says: \(C(0) = \text{Sill}\) (total variance).
The empirical variogram says: \(\hat{\gamma}(0) = \text{Nugget}\).
If you try to force \(\hat{C}(0) = \hat{C}(0) - \hat{\gamma}(0)\), it implies:
\[ \text{Nugget} = 0 \]
Unless your data has absolutely zero measurement error (which never happens in reality), the empirical inverse is mathematically impossible at \(t=0\) and won’t perfectly hold for other lags either. The nugget affects the entire shape of the variogram.
Let’s simulate a spatial dataset to see this breakdown visually.
This approach directly creates spatially correlated data without
using krige().
# Load necessary libraries
library(gstat)
library(sp)
library(ggplot2)
library(reshape2)
# Set seed for reproducibility
set.seed(123)
# 1. Create a 10x10 grid
grd <- expand.grid(x = 1:10, y = 1:10)
# 2. Calculate distance matrix between all points
dist_mat <- as.matrix(dist(grd))
# 3. Create covariance matrix using an exponential model
# C(h) = sill * exp(-h/range)
sill <- 10
range_val <- 5
nugget <- 2
# Calculate covariance matrix
C_mat <- sill * exp(-dist_mat / range_val)
# Add nugget to the diagonal (measurement error)
diag(C_mat) <- diag(C_mat) + nugget
# 4. Generate spatially correlated data using Cholesky decomposition
L <- chol(C_mat)
z <- L %*% rnorm(nrow(grd))
# 5. Create spatial points data frame
sim_data <- grd
sim_data$z <- as.vector(z)
# Convert to SpatialPointsDataFrame for gstat functions
coordinates(sim_data) <- ~ x + y
# 6. Check the first few values
head(sim_data)
## coordinates z
## 1 (1, 1) 5.602239
## 2 (2, 1) 4.796642
## 3 (3, 1) 7.041265
## 4 (4, 1) 3.432879
## 5 (5, 1) 3.745613
## 6 (6, 1) 5.078993
# 7. Calculate the Empirical Variogram
v_emp <- variogram(z ~ 1, data = sim_data, cutoff = 8)
# 8. Calculate the Empirical Covariogram manually
# Since variogram(..., covariogram = TRUE) can be tricky, we'll compute it directly
# Get the coordinates and values
coords <- coordinates(sim_data)
z_vals <- sim_data$z
mean_z <- mean(z_vals)
# Function to compute covariogram for a given lag
compute_covariogram <- function(coords, z_vals, lag_tolerance = 0.5) {
n <- nrow(coords)
lags <- unique(v_emp$dist) # Use same lags as variogram
results <- data.frame(lag = lags, cov = NA, np = NA)
for (i in 1:length(lags)) {
lag_val <- lags[i]
pairs <- c()
cov_sum <- 0
# Find all pairs within the lag tolerance
for (j in 1:(n-1)) {
for (k in (j+1):n) {
dist_ij <- sqrt(sum((coords[j,] - coords[k,])^2))
if (abs(dist_ij - lag_val) <= lag_tolerance) {
cov_sum <- cov_sum + (z_vals[j] - mean_z) * (z_vals[k] - mean_z)
pairs <- c(pairs, j)
}
}
}
results$np[i] <- length(pairs)
if (length(pairs) > 0) {
results$cov[i] <- cov_sum / length(pairs)
} else {
results$cov[i] <- NA
}
}
return(results)
}
# Compute covariogram using the same lags as the variogram
c_emp <- compute_covariogram(coords, z_vals)
# 9. Estimate C(0) using three different methods
N <- length(sim_data$z)
C0_biased <- sum((sim_data$z - mean_z)^2) / N
C0_unbiased <- sum((sim_data$z - mean_z)^2) / (N - 1)
C0_model <- sill + nugget # Total variance from our model = 12
# 10. Display the different C(0) estimates
data.frame(
Method = c("Biased", "Unbiased", "Model"),
C0 = c(C0_biased, C0_unbiased, C0_model)
)
## Method C0
## 1 Biased 4.930681
## 2 Unbiased 4.980486
## 3 Model 12.000000
# 11. Merge the data into a single dataframe for plotting
# Make sure both data frames have the same length
v_emp_subset <- v_emp[v_emp$dist %in% c_emp$lag, ]
df_plot <- data.frame(
lag = v_emp_subset$dist,
gamma = v_emp_subset$gamma,
C_empirical = c_emp$cov,
C_inverse_biased = C0_biased - v_emp_subset$gamma,
C_inverse_unbiased = C0_unbiased - v_emp_subset$gamma,
C_inverse_model = C0_model - v_emp_subset$gamma
)
# Remove any rows with NA values
df_plot <- na.omit(df_plot)
# 12. Print the first few rows to see the numerical differences
knitr::kable(head(df_plot, 8), digits = 3,
caption = "Comparison of True Covariogram vs. Inverses at Different Lags")
| lag | gamma | C_empirical | C_inverse_biased | C_inverse_unbiased | C_inverse_model |
|---|---|---|---|---|---|
| 1.000 | 3.264 | 1.197 | 1.667 | 1.717 | 8.736 |
| 1.414 | 3.672 | 1.197 | 1.258 | 1.308 | 8.328 |
| 2.000 | 4.593 | 0.026 | 0.338 | 0.387 | 7.407 |
| 2.236 | 4.744 | 0.026 | 0.186 | 0.236 | 7.256 |
| 3.036 | 4.724 | -0.002 | 0.206 | 0.256 | 7.276 |
| 3.606 | 4.571 | -0.047 | 0.359 | 0.409 | 7.429 |
| 4.116 | 5.007 | -0.214 | -0.077 | -0.027 | 6.993 |
| 4.472 | 4.916 | -0.214 | 0.015 | 0.065 | 7.084 |
# Reshape data for easier plotting
df_long <- melt(df_plot, id.vars = "lag",
measure.vars = c("C_empirical", "C_inverse_biased",
"C_inverse_unbiased", "C_inverse_model"),
variable.name = "Method", value.name = "Covariance")
# Create the plot
ggplot(df_long, aes(x = lag, y = Covariance, color = Method, linetype = Method)) +
geom_line(size = 1.2) +
geom_point(size = 2.5) +
labs(title = "Empirical Covariogram vs. Inverses Derived from the Variogram",
subtitle = "Only the biased C(0) perfectly recovers the empirical covariogram.",
x = "Lag Distance (t_k)",
y = "Covariance C(t_k)",
caption = "Black: True empirical covariogram\nBlue: Inverse using biased C(0)\nRed: Inverse using unbiased C(0)\nGreen: Inverse using modeled C(0)") +
theme_minimal(base_size = 14) +
scale_color_manual(values = c("C_empirical" = "black",
"C_inverse_biased" = "blue",
"C_inverse_unbiased" = "red",
"C_inverse_model" = "darkgreen")) +
theme(legend.position = "bottom",
legend.title = element_blank(),
plot.title = element_text(hjust = 0.5, face = "bold"),
plot.subtitle = element_text(hjust = 0.5))
| Line Color | Method | Behavior | Match? |
|---|---|---|---|
| Black | Empirical Covariogram (True \(\hat{C}(t_k)\)) | This is the gold standard calculated directly from data. | N/A |
| Blue | Inverse using Biased \(C(0)\) | Perfectly overlaps the black line! | ✅ Yes |
| Red | Inverse using Unbiased \(C(0)\) | Systematically shifted upward from the black line. | ❌ No |
| Green | Inverse using Model Sill | Smooth approximation; deviates from the noisy empirical black line. | ❌ No |
The biased inverse works mathematically because it uses exactly the same \(C(0)\) that falls out of the variogram expansion. However, it uses a statistically flawed estimate of the variance.
The unbiased inverse fails because it uses a different \(C(0)\) value. This is the standard “correct” variance estimate, but it breaks the arithmetic relationship.
The model inverse fails because it uses a smoothed theoretical Sill rather than the raw empirical variance. This is the most common practice in geostatistics, but it introduces discrepancies.
In real-world geostatistical practice:
Choose ONE function to model empirically (almost always the variogram).
Fit a theoretical model (Spherical, Exponential, Gaussian, etc.) to your empirical variogram.
Derive the covariogram from your fitted model using: \[ C(t) = \text{Sill} - \gamma_{model}(t) \]
Do NOT try to force a direct arithmetic inverse using raw empirical values and an arbitrarily chosen \(C(0)\).
| Context | Relationship | Status |
|---|---|---|
| Theoretical World (using \(\mu\)) | \(C(t) = C(0) - \gamma(t)\) | Strict Inverse (Perfect) |
| Empirical World (using \(\bar{Y}\) and biased \(C(0)\)) | \(\hat{C}(t_k) = \hat{C}_{biased}(0) - \hat{\gamma}(t_k)\) | Strict Inverse (But uses a statistically flawed \(C(0)\)) |
| Empirical World (using unbiased or modeled \(C(0)\)) | \(\hat{C}(t_k) \neq \hat{C}(0) - \hat{\gamma}(t_k)\) | Not an Inverse (The standard practice in real life) |
The takeaway: In practice, you should choose one function (usually the variogram) to model empirically. If you need the other, derive it from a fitted theoretical model (like Spherical or Exponential) rather than trying to force a direct arithmetic inverse with raw data.