Introduction

This document provides a comprehensive explanation of variogram analysis using the geoR package in R. We cover both the empirical variogram calculation (variog()) and theoretical model fitting (variofit()), with detailed explanations of all parameters and outputs.

A variogram is a fundamental tool in geostatistics that measures how spatial correlation changes with distance. It helps us understand: - How similar nearby observations are - The range of spatial dependence - The amount of unexplained variation (nugget effect)


The variog() Function

Function Syntax

variog(coords, data, uvec, ...)

Parameters: - coords: A matrix or data frame with x,y coordinates (2 columns) - data: The variable being analyzed (here, DBH - Diameter at Breast Height) - uvec: A vector defining the distance bins at which to calculate the variogram

Your Code

vario.DBH <- variog(coords = coords, data = DBH, uvec = (seq(0, max.dist, length = bins)))

This creates distance bins from 0 to the maximum distance between points, divided into bins number of intervals.

The Semivariance Formula

The empirical semivariance at distance h is calculated as:

\[\gamma(h) = \frac{1}{2n(h)} \sum_{i=1}^{n(h)} [z(s_i) - z(s_i + h)]^2\]

Where: - \(n(h)\) = number of pairs at distance h - \(z(s_i)\) = value at location i - \(z(s_i + h)\) = value at location i + distance h


Understanding Distance in the Variogram

What Does u Represent?

u represents the distance BETWEEN PAIRS OF POINTS (trees), not distance from some origin point.

Specifically, for every possible pair of trees in your dataset: - Tree A at coordinates (x₁, y₁) - Tree B at coordinates (x₂, y₂) - Distance = √[(x₁ - x₂)² + (y₁ - y₂)²]

The variogram bins these pairwise distances into groups (your 50 bins), and calculates the semivariance for all pairs whose distance falls into each bin.

Example with 3 Trees

Tree X Y DBH
A 0 0 25
B 10 0 30
C 20 0 35

Pairwise distances: - A-B distance = 10 units - A-C distance = 20 units
- B-C distance = 10 units

What happens in variog(): 1. Calculates all pairwise distances (thousands of pairs in your case) 2. Sorts them into bins (0-1.8, 1.8-3.6, 3.6-5.4, …) 3. For each bin, calculates semivariance using all pairs in that bin

So when u = 10.8 in your output, it means: > “The average semivariance for all pairs of trees that are approximately 10.8 units apart”

Important Implications

  1. Units are whatever your coordinates are

    • Meters → u = 10.8 means 10.8 meters
    • Feet → 10.8 feet
    • UTM → 10.8 meters
  2. The spatial structure is about tree-to-tree relationships

    • Trees ~1.8 units apart: semivariance = 525 (moderate similarity)
    • Trees ~50 units apart: semivariance = 985 (much less similarity)
    • Trees ~88 units apart: semivariance = 1081 (essentially no similarity)
  3. The max.dist = 89.07 tells you that the farthest apart any two trees in your dataset are is ~89 units.

  4. The bins are equally spaced: uvec = seq(0, max.dist, length = bins) creates 50 equally spaced bins from 0 to ~89 units.


Interpreting Your variog() Results

Your Output Structure

Component Description
$u Distance bin midpoints
$v Semivariance values
$n Number of pairs per bin
$sd Standard deviation of semivariance estimates
$var.mark Variance of the data
$beta.ols OLS estimate of the mean
$max.dist Maximum distance between points
$estimator.type ‘classical’ (Matheron’s estimator)
$n.data Sample size (1955 trees)
$direction ‘omnidirectional’

$u - Distance Bins (Midpoints)

These are the midpoints of each distance bin (50 bins in your case): - First bin: ~0 (near zero distance) - Last bin: ~88.17 units - Bin width: ~1.80 units (constant spacing)

# Example of first few distance bins
u_values <- c(0.000000, 1.799424, 3.598849, 5.398273, 7.197697)
data.frame(
  Bin = 1:5,
  Distance = u_values,
  Interpretation = c("Near-zero distance", "~1.8 units apart", "~3.6 units apart", 
                     "~5.4 units apart", "~7.2 units apart")
)
##   Bin Distance     Interpretation
## 1   1 0.000000 Near-zero distance
## 2   2 1.799424   ~1.8 units apart
## 3   3 3.598849   ~3.6 units apart
## 4   4 5.398273   ~5.4 units apart
## 5   5 7.197697   ~7.2 units apart

$v - Semivariance Values

These are the calculated semivariance values for each distance bin.

Key observations from your data:

# Example of semivariance progression
v_data <- data.frame(
  Distance = c(0.0, 1.8, 3.6, 5.4, 7.2, 50.4, 55.8, 57.6, 59.4, 88.2),
  Semivariance = c(370.98, 525.57, 749.24, 873.83, 873.15, 984.96, 995.50, 1018.29, 1026.65, 1081.79)
)
knitr::kable(v_data, caption = "Semivariance at Selected Distances")
Semivariance at Selected Distances
Distance Semivariance
0.0 370.98
1.8 525.57
3.6 749.24
5.4 873.83
7.2 873.15
50.4 984.96
55.8 995.50
57.6 1018.29
59.4 1026.65
88.2 1081.79
  • Starting value: ~371 (at near-zero distance) - this is close to the nugget effect
  • Increasing trend: The values generally increase from ~371 to ~1,043
  • Sill: The variogram appears to reach a sill around ~1,000-1,080 units
  • Range: The distance where the variogram first reaches the sill appears to be around 50-60 distance units

$n - Number of Pairs

Number of point pairs used to calculate each semivariance value:

# Example of pair counts
n_data <- data.frame(
  Bin = c(1, 2, 10, 20, 30, 40, 50),
  Pairs = c(127, 1134, 8552, 14443, 18213, 19822, 19621),
  Interpretation = c("Few pairs at near-zero distance", "Good sampling", "Good sampling", 
                     "Good sampling", "Good sampling", "Good sampling", "Good sampling")
)
knitr::kable(n_data, caption = "Number of Pairs per Bin")
Number of Pairs per Bin
Bin Pairs Interpretation
1 127 Few pairs at near-zero distance
2 1134 Good sampling
10 8552 Good sampling
20 14443 Good sampling
30 18213 Good sampling
40 19822 Good sampling
50 19621 Good sampling
  • First bin: Only 127 pairs (near-zero distances, few points close together)
  • Middle bins: 15,000-20,000 pairs (good sampling)
  • Last bins: ~19,600 pairs

$sd - Standard Deviation

Standard deviation of the semivariance estimates: - Values range from ~912 to ~1,930 - Higher uncertainty at small and large distances - Most reliable estimates in the middle distances (more pairs)


$var.mark - Variance of Data

  • 1052.68 = total variance of DBH measurements

$beta.ols - OLS Mean

  • 36.77 = Ordinary Least Squares estimate of the mean DBH

Other Parameters

  • $max.dist: 89.07 units = maximum distance between any two points
  • $estimator.type: “classical” = using the method-of-moments estimator (Matheron’s estimator)
  • $n.data: 1955 trees measured
  • $direction: “omnidirectional” = calculates variogram in all directions (no directional anisotropy)

What Your Variogram Tells You

Spatial Structure

Semivariance increases with distance → Spatial correlation decreases with distance
  • DBH values are positively correlated at short distances
  • Trees close together have similar DBH
  • Correlation disappears at distances > ~50-60 units

Key Parameters

1. Nugget Effect (~371)

  • The intercept at near-zero distance (~371) represents:
    • Measurement error
    • Micro-scale variability (variation at scales smaller than your minimum sampling distance)
    • Approximately 35% of total variance (371/1052.68)

2. Sill (~1,000-1,080)

  • The plateau reached at large distances
  • Represents the total variance of the process
  • Approximately 95-103% of the data variance

3. Range (~50-60 units)

  • Distance at which spatial correlation effectively disappears
  • Beyond this distance, DBH values are spatially independent
  • Important: This is an eyeballed estimate; formal model fitting is needed for precision

Practical Interpretation

This variogram suggests: - Strong spatial structure in DBH (clear increasing pattern) - Moderate nugget (some unexplained small-scale variation) - Effective range of about 50-60 units - this is the maximum distance at which spatial interpolation is useful - The spatial pattern is isotropic (same in all directions)


The variofit() Function

What is variofit()?

variofit() fits a theoretical variogram model to your empirical (experimental) variogram. While variog() calculates the observed semivariance at different distances, variofit() finds the best-fitting mathematical model that describes the spatial structure.

Function Syntax

variofit(vario, ini.cov.pars, cov.model, minimisation.function, weights, ...)

Your Code

fit.DBH <- variofit(
  vario.DBH, 
  ini.cov.pars = c(600, 200/-log(0.05)), 
  cov.model = "exponential", 
  minimisation.function = "nls", 
  weights = "equal"
)

Parameter Explanations

1. vario.DBH - The Empirical Variogram

This is the output from your variog() function. It contains: - The distance bins ($u) - The semivariance values ($v) - Number of pairs per bin ($n) - Other metadata

This is the observed data that the model will try to fit.

2. ini.cov.pars = c(600, 200/-log(0.05)) - Initial Parameters

This provides starting values for the optimization algorithm. It’s a vector of two numbers:

First Value: Partial Sill = 600

The partial sill (also called the structured variance) is the amount of variance in your data that is spatially structured - meaning it’s explained by the distance between points.

Where did 600 come from?

From your empirical variogram: - Estimated Sill ≈ 1050 (from maximum semivariance) - Estimated Nugget ≈ 371 (from first bin) - Estimated Partial Sill = Sill - Nugget = 1050 - 371 = 679

So 600 is a rounded-down estimate of the partial sill.

Visualizing the Components:

Semivariance
    ↑
    |   SILL (Total Variance)
    |   ←---------------------→
    |   |                     |
    |   |   PARTIAL SILL      |  ← Your 600 is here
    |   |   (Structured)      |
    |   |                     |
    |   NUGGET                |
    |   (Unstructured)        |
    |   ↓                     |
    |---|---------------------|---→ Distance
    0   |                     |
        ←--- RANGE ----------→
Second Value: Range Parameter = 200/-log(0.05)

This formula gives you a range parameter such that the practical range (where correlation drops to 5%) is 200 units.

  • -log(0.05) = 2.9957
  • 200 / 2.9957 = 66.76

For the exponential model: - Range parameter: φ = 66.76 - Practical range (95% sill): 3 × φ = 200.28 ✅

Interpretation: You’re telling the algorithm: > “I think the spatial correlation extends to about 200 units, so start with a range parameter that gives a practical range of 200.”

3. cov.model = “exponential” - Covariance Model

This specifies the theoretical model to fit.

Exponential Model Formula:

\[\gamma(h) = c_0 + c_1 \times (1 - e^{-h/\phi})\]

Where: - c₀ = Nugget (micro-scale variation + measurement error) - c₁ = Partial sill (spatially structured variance) - φ = Range parameter (controls decay rate) - h = Distance

Available Models:
Model Formula Characteristics
Spherical \(\gamma(h) = c_0 + c_1 \times (1.5(h/a) - 0.5(h/a)^3)\) for h ≤ a Reaches sill exactly at range a
Exponential \(\gamma(h) = c_0 + c_1 \times (1 - e^{-h/\phi})\) Approaches sill asymptotically
Gaussian \(\gamma(h) = c_0 + c_1 \times (1 - e^{-(h/\phi)^2})\) Smooth, parabolic near origin
Matérn More complex with smoothness parameter κ Very flexible

Why choose exponential? - Your empirical variogram shows a gradual leveling off, not an abrupt sill - Exponential models are good for ecological data where correlation decays smoothly - It has a practical range of about 3φ (unlike spherical which has an exact range)

4. minimisation.function = “nls” - Optimization Method

This specifies how the model is fitted to the data.

Options: - "nls" (Non-Linear Least Squares): Minimizes the sum of squared residuals - Advantages: Fast, stable, good for well-behaved data - Disadvantages: Assumes constant variance, sensitive to outliers

  • "optim": Uses R’s general-purpose optimization function
    • Advantages: More flexible, handles constraints
    • Disadvantages: Can be slower, may not converge
  • "nlminb": Uses R’s constrained optimization
    • Advantages: Handles parameter bounds well
    • Disadvantages: May be slower

With weights = "equal", "nls" minimizes:

\[\sum_{i=1}^{n} (\hat{\gamma}(h_i) - \gamma(h_i))^2\]

5. weights = “equal” - Weighting Scheme

This determines how much influence each bin has on the fitted model.

Weighting Description When to Use
"equal" All bins weighted equally When all bins have similar number of pairs
"npairs" Weight by number of pairs (n) Recommended - gives more weight to bins with more pairs
"cressie" Cressie’s weighting (n/h²) Accounts for both pair count and distance

Your choice of "equal" means all 50 distance bins contribute equally to the fit, even bins with few pairs.

Recommendation: Usually weights = "npairs" is better because bins with more pairs have more reliable estimates.


Understanding ini.cov.pars() Parameters

The Basic Structure

ini.cov.pars takes a vector of two numbers:

ini.cov.pars = c(partial_sill, range_parameter)

1. First Value: Partial Sill

The partial sill is the spatially structured variance (Sill - Nugget).

How to estimate it:

# From your empirical variogram
nugget_est <- vario.DBH$v[1]  # ~371
sill_est <- mean(tail(vario.DBH$v, 10))  # ~1050
partial_sill_est <- sill_est - nugget_est  # ~679

# Use a rounded value
partial_sill <- 600  # or 679 or 700

2. Second Value: Range Parameter

The range parameter controls how quickly correlation decays with distance. The interpretation depends on the model!

Model-Specific Interpretations

Model Range Parameter Relationship to Practical Range For Practical Range = 200
Exponential φ Practical = 3 × φ φ = 66.76 ✅
Spherical a Practical = a a = 200 ❌
Gaussian φ Practical = φ × √(-log(0.05)) φ = 115.47 ❌
Matérn (κ=0.5) φ Practical ≈ 3 × φ φ = 66.76 ✅
Matérn (κ=1.0) φ Practical ≈ 3.9 × φ φ = 51.28 ❌
Matérn (κ=1.5) φ Practical ≈ 4.5 × φ φ = 44.44 ❌

Visual Representation

Semivariance
    ↑
    |   ←--- SILL (Total Variance = Nugget + Partial Sill) ---→
    |   |                                                    |
    |   |   ←--- PARTIAL SILL (First Value) ---→             |
    |   |                                                    |
    |   NUGGET                                               |
    |   (Not in ini.cov.pars)                                |
    |   ↓                                                    |
    |---|--------------------------|------------------------|---→ Distance
    0   |                          |
        ←-- RANGE PARAMETER ------→
        (Second Value)

Examples for Different Models

Exponential Model

# If you think practical range is ~60 units
# For exponential: practical range = 3 × φ
# So φ = 60/3 = 20

ini.cov.pars = c(600, 20)  # partial sill = 600, range parameter = 20

# Or using your formula (practical range = 200)
ini.cov.pars = c(600, 200/-log(0.05))  # ≈ c(600, 66.76)

Spherical Model

# For spherical, the range parameter IS the practical range
# If you think range is ~60 units:

ini.cov.pars = c(600, 60)  # partial sill = 600, range = 60

# Or for practical range of 200:
ini.cov.pars = c(600, 200)  # Note: NOT 200/-log(0.05)!

Gaussian Model

# For Gaussian, practical range = φ × sqrt(-log(0.05))
# If practical range = 60, then φ = 60/sqrt(-log(0.05)) ≈ 34.6

ini.cov.pars = c(600, 34.6)  # partial sill = 600, range parameter = 34.6

# Or for practical range of 200:
ini.cov.pars = c(600, 200/sqrt(-log(0.05)))  # ≈ c(600, 115.47)

Quick Reference Table for Your Data

Based on your empirical variogram showing range ≈ 50-60 units:

Model Partial Sill (1st value) Range Parameter (2nd value) Practical Range
Exponential 600-700 55/3 ≈ 18.3 55
Spherical 600-700 55 55
Gaussian 600-700 55/sqrt(-log(0.05)) ≈ 31.8 55

Helper Function

# Function to get initial range parameter for different models
get_initial_range <- function(model, practical_range = 200, kappa = NULL) {
  switch(model,
    exponential = practical_range / 3,
    spherical = practical_range,
    gaussian = practical_range / sqrt(-log(0.05)),
    matern = {
      if (is.null(kappa)) stop("kappa must be specified for Matérn model")
      if (kappa == 0.5) practical_range / 3
      else if (kappa == 1.0) practical_range / 3.9
      else if (kappa == 1.5) practical_range / 4.5
      else stop("kappa must be 0.5, 1.0, or 1.5")
    },
    stop("Unknown model")
  )
}

# Usage examples
ini.range <- get_initial_range("exponential", 200)  # 66.76
ini.range <- get_initial_range("spherical", 200)    # 200
ini.range <- get_initial_range("gaussian", 200)     # 115.47
ini.range <- get_initial_range("matern", 200, kappa = 1.5)  # 44.44

Important Notes

  1. Nugget is NOT in ini.cov.pars - The nugget is estimated automatically by variofit() unless you fix it with the fix.nugget parameter.

  2. Partial Sill should be positive - It represents variance, so it must be > 0.

  3. Range parameter should be positive - It represents distance, so it must be > 0.

  4. Initial values don’t need to be perfect - They just need to be in the right ballpark to help the optimization algorithm converge.

  5. The model will adjust these values - The optimization will find the best-fitting values regardless of your initial guess.


Model Comparison

Fitting Multiple Models

# Estimate initial parameters from your data
nugget_est <- vario.DBH$v[1]  # ~371
sill_est <- mean(tail(vario.DBH$v, 10))  # ~1050
partial_sill_est <- sill_est - nugget_est  # ~679
range_est <- 55  # Where it levels off

# Fit different models with correct initial parameters
# Exponential
fit.exp <- variofit(
  vario.DBH,
  ini.cov.pars = c(partial_sill_est, range_est/3),
  cov.model = "exponential",
  weights = "npairs"
)

# Spherical
fit.sph <- variofit(
  vario.DBH,
  ini.cov.pars = c(partial_sill_est, range_est),
  cov.model = "spherical",
  weights = "npairs"
)

# Gaussian
phi_gau <- range_est / sqrt(-log(0.05))
fit.gau <- variofit(
  vario.DBH,
  ini.cov.pars = c(partial_sill_est, phi_gau),
  cov.model = "gaussian",
  weights = "npairs"
)

# Compare fit quality (lower RSS = better fit)
rss_values <- data.frame(
  Model = c("Exponential", "Spherical", "Gaussian"),
  RSS = c(fit.exp$sum.of.squares, fit.sph$sum.of.squares, fit.gau$sum.of.squares),
  Nugget = c(fit.exp$nugget, fit.sph$nugget, fit.gau$nugget),
  Partial_Sill = c(fit.exp$cov.pars[1], fit.sph$cov.pars[1], fit.gau$cov.pars[1]),
  Range = c(fit.exp$cov.pars[2], fit.sph$cov.pars[2], fit.gau$cov.pars[2])
)
knitr::kable(rss_values, caption = "Model Comparison")

Visual Comparison

# Plot all models
plot(vario.DBH, main = "Model Comparison for DBH")
lines(fit.exp, col = "red", lwd = 2)
lines(fit.sph, col = "blue", lwd = 2)
lines(fit.gau, col = "green", lwd = 2)
legend("bottomright", 
       legend = c("Empirical", "Exponential", "Spherical", "Gaussian"),
       col = c("black", "red", "blue", "green"),
       pch = c(1, NA, NA, NA),
       lty = c(NA, 1, 1, 1),
       lwd = c(NA, 2, 2, 2))

Complete Workflow Example

Here’s a complete workflow for your analysis:

# 1. Load required packages
library(geoR)

# 2. Calculate empirical variogram
max.dist <- max(dist(coords))
bins <- 50

vario.DBH <- variog(
  coords = coords,
  data = DBH,
  uvec = seq(0, max.dist, length = bins)
)

# 3. View empirical variogram
plot(vario.DBH, main = "Empirical Variogram of DBH")

# 4. Estimate initial parameters visually
nugget_est <- vario.DBH$v[1]  # ~371
sill_est <- mean(tail(vario.DBH$v, 10))  # ~1050
partial_sill_est <- sill_est - nugget_est  # ~679
range_est <- 55  # Where it levels off

# 5. Fit models with correct initial parameters
# Exponential
fit.exp <- variofit(
  vario.DBH,
  ini.cov.pars = c(partial_sill_est, range_est/3),
  cov.model = "exponential",
  weights = "npairs"
)

# Spherical
fit.sph <- variofit(
  vario.DBH,
  ini.cov.pars = c(partial_sill_est, range_est),
  cov.model = "spherical",
  weights = "npairs"
)

# 6. Compare models
summary(fit.exp)
summary(fit.sph)

# 7. Select best model (based on RSS)
if (fit.exp$sum.of.squares < fit.sph$sum.of.squares) {
  best_model <- fit.exp
  best_name <- "Exponential"
} else {
  best_model <- fit.sph
  best_name <- "Spherical"
}

cat("Best model:", best_name, "\n")
cat("RSS:", best_model$sum.of.squares, "\n")
cat("Nugget:", best_model$nugget, "\n")
cat("Partial Sill:", best_model$cov.pars[1], "\n")
cat("Range:", best_model$cov.pars[2], "\n")

# 8. Plot final model
plot(vario.DBH, main = paste("DBH Variogram -", best_name, "Model"))
lines(best_model, col = "red", lwd = 2)
legend("bottomright", 
       legend = c("Empirical", paste("Fitted", best_name)),
       col = c("black", "red"),
       pch = c(1, NA),
       lty = c(NA, 1),
       lwd = c(NA, 2))

# 9. Extract parameters for kriging
nugget <- best_model$nugget
partial_sill <- best_model$cov.pars[1]
range_param <- best_model$cov.pars[2]
total_sill <- nugget + partial_sill

cat("\nFinal Model Parameters:\n")
cat("Nugget:", nugget, "\n")
cat("Partial Sill:", partial_sill, "\n")
cat("Total Sill:", total_sill, "\n")
cat("Range Parameter:", range_param, "\n")
if (best_name == "Exponential") {
  cat("Practical Range:", 3 * range_param, "\n")
} else if (best_name == "Spherical") {
  cat("Practical Range:", range_param, "\n")
}

Cross-validation

# 10. Perform kriging cross-validation
kcv <- krige.cv(
  geodata = list(coords = coords, data = DBH),
  model = best_model,
  nsim = 100
)

# View cross-validation results
summary(kcv)

# Plot cross-validation
par(mfrow = c(2, 2))
plot(kcv$predicted, kcv$observed, 
     xlab = "Predicted", ylab = "Observed",
     main = "Cross-validation")
abline(0, 1, col = "red")
hist(kcv$residuals, main = "Residuals", xlab = "Residual")
qqnorm(kcv$residuals, main = "Normal Q-Q Plot")
qqline(kcv$residuals, col = "red")