(a.)
Y(s) = 2 * s1 * Z(s) + s2 + 1, Z(s) ~ second-order stationary, E[Z(s)] = 0, Cov(Z(s), Z(s’)) = exp(-||s - s’||) E[Y(s)] = E[2 * s1 * Z(s) + s2 + 1] = 2 * s1 * 0 + (s2 + 1) = s2 + 1 Cov(Y(s), Y(s’)) = Cov(2 * s1 * Z(s), 2 * s1’ * Z(s’)) = 4 * s1 * s1’ * exp(-||s - s’||)
Answer: Mean function: E[Y(s)] = s2 + 1
Covariance function: Cov(Y(s), Y(s’)) = 4 * s1 * s1’ * exp(-||s -
s’||)
(b.)
Var(Z(s)) = C(s,s) = A_µ(0) = 1, so correlation = covariance.
Let r = ||s - s’||. For 0 <= r <= a,
(1 - r/a)^µ = 0.05.
Solve for r:
1 - r/a = 0.05^(1/µ)
r = a * (1 - 0.05^(1/µ)).
Answer: Effective range = a * (1 - 0.05^(1/µ))
(c.)
Given semivariogram: \(\gamma(h) = 1 - \frac{1}{3(1 + h^2/2)^2}\) for \(h > 0\), and \(\gamma(0) = 0\).
Nugget \(c_0\): \(c_0 = \lim_{h \to 0^+} \gamma(h) = \lim_{h \to 0^+} \left( 1 - \frac{1}{3(1 + h^2/2)^2} \right) = 1 - \frac{1}{3} = 2/3\).
Total Sill \(C(0)\): \(C(0) = \lim_{h \to \infty} \gamma(h) = \lim_{h \to \infty} \left( 1 - \frac{1}{3(1 + h^2/2)^2} \right) = 1 - 0 = 1\).
Partial Sill \(c_1\): \(c_1 = C(0) - c_0 = 1 - 2/3 = 1/3\).
Covariance Function \(C(h)\): \(C(h) = C(0) - \gamma(h) = 1 - (1 - \frac{1}{3(1 + h^2/2)^2}) = \frac{1}{3(1 + h^2/2)^2}\) for \(h > 0\). \(C(0) = 1\). So, \(C(h) = \begin{cases} 1, & \text{if } h = 0, \\ \frac{1}{3(1 + h^2/2)^2}, & \text{if } h > 0. \end{cases}\)
Correlation Function \(R(h)\): \(R(h) = C(h) / C(0) = C(h) / 1 = C(h)\).
Effective Range \(h_0\): Set \(R(h_0) = 0.05\) for \(h_0 > 0\). \(\frac{1}{3(1 + h_0^2/2)^2} = 0.05\) \((1 + h_0^2/2)^2 = \frac{1}{3 \times 0.05} = \frac{1}{0.15} = \frac{100}{15} = \frac{20}{3}\) \(1 + h_0^2/2 = \sqrt{\frac{20}{3}}\) \(h_0^2/2 = \sqrt{\frac{20}{3}} - 1\) \(h_0^2 = 2 \left( \sqrt{\frac{20}{3}} - 1 \right)\) \(h_0 = \sqrt{2 \left( \sqrt{\frac{20}{3}} - 1 \right)}\)
Answer: Covariance \(C(h) = \begin{cases} 1, & \text{if } h = 0, \\ \frac{1}{3(1 + h^2/2)^2}, & \text{if } h > 0. \end{cases}\) Nugget = 2/3 Partial Sill = 1/3 Effective Range = \(\sqrt{2 \left( \sqrt{20/3} - 1 \right)}\) (approx 1.7788)
(d.)
library(ggplot2)
library(patchwork)
library(viridis)
library(sf)
library(sp)
library(gstat)
library(spdep)
library(nlme)
library(SpatialEpi)
library(spatialreg)
library(fields)
set.seed(5226)
n <- 41^2
x.grid <- seq(0, 1, by=1/40)
xy.grid <- expand.grid(x.grid, x.grid)
colnames(xy.grid) <- c("s1", "s2")
pair.dist <- rdist(xy.grid)
For Y (·) process in (a):
# 1. Define Covariance Matrix for the base process Z(s)
Cov.mat_a <- exp(-pair.dist)
# 2. Simulate the base Gaussian process Z(s) (mean 0)
Z_a <- t(chol(Cov.mat_a)) %*% rnorm(n)
# 3. Apply the transformation to get Y(s)
Y_sim_a <- 2 * xy.grid$s1 * Z_a + xy.grid$s2 + 1
# 4. Prepare data for plotting
plot.dat_a <- data.frame(Y = Y_sim_a,
x = xy.grid$s1,
y = xy.grid$s2)
plot.sf_a <- st_as_sf(plot.dat_a, coords = c("x", "y"))
# 5. Plot
ggplot(plot.sf_a) +
geom_sf(aes(color = Y), size = 2) +
scale_color_gradient(low = "blue", high = "orange") +
theme_bw() +
ggtitle("Simulation of Process (a): Y(s) = 2*s1*Z(s) + s2 + 1")
For Z(·) process in (b):
# 1. Define Covariance Matrix
mu_param <- 2
a_param <- 1
r <- pair.dist / a_param
term <- 1 - r
term[term < 0] <- 0
Cov.mat_b <- term^mu_param
# 2. Simulate the Gaussian process Z(s) (mean 0)
Z_b <- t(chol(Cov.mat_b)) %*% rnorm(n)
# 3. Prepare data for plotting
plot.dat_b <- data.frame(Z = Z_b,
x = xy.grid$s1,
y = xy.grid$s2)
plot.sf_b <- st_as_sf(plot.dat_b, coords = c("x", "y"))
# 4. Plot
ggplot(plot.sf_b) +
geom_sf(aes(color = Z), size = 2) +
scale_color_gradient(low = "blue", high = "orange") +
theme_bw() +
ggtitle("Simulation of Process (b): C(h) = (1-h)^2_+, mu=2, a=1")
For process in (c):
# 1. Define Covariance Matrix
Cov.mat_c <- 1 / (3 * (1 + pair.dist^2 / 2)^2)
diag(Cov.mat_c) <- 1
# 2. Simulate the Gaussian process Z(s) (mean 0)
Z_c <- t(chol(Cov.mat_c)) %*% rnorm(n)
# 3. Prepare data for plotting
plot.dat_c <- data.frame(Z = Z_c,
x = xy.grid$s1,
y = xy.grid$s2)
plot.sf_c <- st_as_sf(plot.dat_c, coords = c("x", "y"))
# 4. Plot
ggplot(plot.sf_c) +
geom_sf(aes(color = Z), size = 2) +
scale_color_gradient(low = "blue", high = "orange") +
theme_bw() +
ggtitle("Simulation of Process (c): Generalized Cauchy")
Regarding the squared exponential Gaussian process: Yes, when first
trying to simulate \(Z(\cdot)\) without
a nugget effect (as shown in code chunk 5a), a coding error was
encountered. The error message produced is:
Error in chol.default(Cov.mat_d1) : the leading minor of order [number] is not positive definite
This error occurs because the squared exponential covariance function
\(\exp(-h^2)\) approaches 1 extremely
fast as the distance \(h \to 0\). This
creates a covariance matrix that is “ill-conditioned” or “numerically
singular” (i.e., its eigenvalues are too close to zero), even though it
is theoretically positive definite. The chol() function
fails because it cannot handle this numerical instability. Yes, after
adding a small nugget \(c_0 = 10^{-6}\)
to the diagonal entries of the covariance matrix (as shown in code chunk
5b), the coding error did persist; it was resolved. The
chol() function executed successfully, and the simulation
was produced. This small nugget adds a tiny amount of variance to each
location, which is enough to make the matrix numerically stable and
strictly positive definite, allowing the Cholesky decomposition to
succeed.
(a.)
library(sf)
library(ggplot2)
file_path <- "C:/Users/dydsw/Desktop/Homework 2/data/juraZn.csv"
jura_data <- read.csv(file_path, header = TRUE)
jura_sf <- st_as_sf(jura_data, coords = c("Xloc", "Yloc"))
ggplot(data = jura_sf) +
geom_sf(aes(color = Zn), size = 2) +
scale_color_gradient(low = "blue", high = "orange", name = "Zn Conc.") +
theme_bw() +
ggtitle("Zinc Concentration at 259 Locations in Jura")
(b.)
library(gstat)
# 2. Calculate the semivariogram cloud
vgm_cloud <- variogram(Zn ~ 1, data = jura_sf, cloud = TRUE)
# 3. Plot the cloud
plot(vgm_cloud,
main = "Semivariogram Cloud for Jura Zinc Data",
xlab = "Distance (h)",
ylab = "Semivariance (gamma)")
(c.)
# 1. Create the logical condition for filtering
outlier_condition <- vgm_cloud$gamma > 14000 & vgm_cloud$dist < 0.25
# 2. Subset the variogram cloud data frame
outlier_pairs <- na.omit(vgm_cloud[outlier_condition, ])
# 3. Print the identified pairs to the console
print("Identified outlier pairs (gamma > 14000 & dist < 0.25):")
## [1] "Identified outlier pairs (gamma > 14000 & dist < 0.25):"
print(outlier_pairs)
## dist gamma dir.hor dir.ver id left right
## 2417 0.2210204 14125.44 0 0 var1 94 40
## 12192 0.2292706 15368.55 0 0 var1 208 40
## 12333 0.2194949 15417.68 0 0 var1 209 40
(d.)
# 1. Calculate the variogram using default options
vgm_default <- variogram(Zn ~ 1, data = jura_sf)
# 2. Plot the default variogram
plot(vgm_default,
main = "Default Empirical Semivariogram",
xlab = "Distance (h)",
ylab = "Semivariance (gamma)")
print("Default Variogram Bins:")
## [1] "Default Variogram Bins:"
print(vgm_default)
## np dist gamma dir.hor dir.ver id
## 1 342 0.05811439 304.1953 0 0 var1
## 2 461 0.23422428 589.8861 0 0 var1
## 3 831 0.37321886 709.0710 0 0 var1
## 4 931 0.51183712 707.5196 0 0 var1
## 5 1022 0.67333526 792.2940 0 0 var1
## 6 1284 0.81491201 783.7029 0 0 var1
## 7 1251 0.97279054 724.8374 0 0 var1
## 8 1663 1.10785333 930.4749 0 0 var1
## 9 1582 1.26463821 960.3067 0 0 var1
## 10 1807 1.40859854 924.0324 0 0 var1
## 11 1738 1.55417387 949.2298 0 0 var1
## 12 1793 1.71473991 878.4417 0 0 var1
## 13 1664 1.85265247 845.2936 0 0 var1
## 14 1560 2.00985981 920.8212 0 0 var1
## 15 1490 2.14546221 795.3116 0 0 var1
# 1. Define the specific binning parameters
my_cutoff <- 5.5
my_width <- my_cutoff / 22
# 2. Calculate the variogram with these parameters
vgm_specific <- variogram(Zn ~ 1, data = jura_sf,
cutoff = my_cutoff,
width = my_width)
# 3. Plot the specific variogram
plot(vgm_specific,
main = "Specific Semivariogram (22 bins to 5.5)",
xlab = "Distance (h)",
ylab = "Semivariance (gamma)")
When comparing the two plots to fit a model, especially at short
distances, I would prefer to use the first empirical
semivariogram (the default one). Here is the reasoning: There
is a critical trade-off between resolution (how many points we
plot) and stability (how reliable those points are). 1.
The Problem with the Second Plot (22 bins): The second
plot has a very small bin width (width = 0.25). While this
gives us high resolution (many points), it means each point’s value (the
average gamma) is calculated from fewer
pairs of data. This makes the plot highly susceptible to noise
and “local irregularities” (as seen in the standard answer). The result
is a “spiky” or “noisy” plot that is difficult to fit a smooth model to.
2. The Benefit of the First Plot (Default): The first
(default) plot uses wider bins. Each point’s value is an average based
on many more pairs of data points. This makes the
resulting trend more statistically stable and reliable.
When fitting a model for short distances (i.e., trying to determine the
nugget/sill), we need to see the true underlying trend, not
just random noise. The first plot, while having fewer points, provides a
“smoother” and more reliable estimate of this trend, making it a safer
and easier choice for fitting a model.
(e.)
# 1. Load gstat (if not already loaded)
library(gstat)
# 2. Define the directions (in degrees)
directions_to_check <- c(0, 45, 90, 135)
# 3. Calculate the directional variograms
vgm_directional <- variogram(Zn ~ 1,
data = jura_sf,
alpha = directions_to_check)
# 4. Plot the results
plot(vgm_directional,
main = "Directional Semivariograms (N, NE, E, SE)",
as.table = TRUE)
Upon reviewing the four plots, we are evaluating if the spatial structure is the same in all directions. The plots generated for the N, NE, E, and SE directions show noticeably different behaviors. For instance, the sill (the plateau height) and the range (the distance at which the plateau is reached) do not appear to be consistent across all four graphs. The slope of the variogram at short distances also seems to vary by direction. Because the spatial dependence structure (the shape of the semivariogram) is clearly not the same in all directions, the assumption of isotropy is inappropriate. This suggests that the data exhibits anisotropy (i.e., spatial correlation is direction-dependent).
(a.)
library(gstat)
vgm_log_emp <- variogram(log(Zn) ~ 1, data = jura_sf)
plot(vgm_log_emp,
main = "Empirical Variogram for log(Zn)")
(b.)
# 1. Define the "starting guess" model as specified
vgm_guess <- vgm(psill = 0.1, model = "Sph", range = 2.0, nugget = 0.1)
# 2. Fit the model
vgm_model_fit <- fit.variogram(vgm_log_emp, model = vgm_guess)
# 3. Report the fitted values (as requested by the hint)
fitted_nugget <- vgm_model_fit$psill[1]
fitted_psill <- vgm_model_fit$psill[2]
fitted_range <- vgm_model_fit$range[2]
print("--- Fitted Model Parameters ---")
## [1] "--- Fitted Model Parameters ---"
print(paste("Fitted Nugget:", fitted_nugget))
## [1] "Fitted Nugget: 0.030281914592556"
print(paste("Fitted Partial Sill:", fitted_psill))
## [1] "Fitted Partial Sill: 0.128333666570966"
print(paste("Fitted Range:", fitted_range))
## [1] "Fitted Range: 0.76172390295124"
# 4. Report the weighted sum of squares of error (WSSE)
wsse <- attr(vgm_model_fit, "SSE")
print(paste("Weighted Sum of Squares of Error (WSSE):", wsse))
## [1] "Weighted Sum of Squares of Error (WSSE): 3.12297624985944"
# 5. Plot both the empirical points and the fitted line
plot(vgm_log_emp,
model = vgm_model_fit,
main = "Empirical Variogram and Fitted Spherical Model for log(Zn)")
(c.)
# 1. Define the "starting guess" model WITHOUT a nugget
vgm_guess_C <- vgm(psill = 0.1, model = "Sph", range = 2.0)
# 2. Fit the "no-nugget" model
vgm_model_fit_C <- fit.variogram(vgm_log_emp, model = vgm_guess_C)
# 3. Report the fitted values for this new model
fitted_psill_C <- vgm_model_fit_C$psill[2]
fitted_range_C <- vgm_model_fit_C$range[2]
print("--- Fitted 'No-Nugget' Model Parameters ---")
## [1] "--- Fitted 'No-Nugget' Model Parameters ---"
print(paste("Fitted Partial Sill:", fitted_psill_C))
## [1] "Fitted Partial Sill: NA"
print(paste("Fitted Range:", fitted_range_C))
## [1] "Fitted Range: NA"
print(paste("Fitted Nugget (should be 0):", vgm_model_fit_C$psill[1]))
## [1] "Fitted Nugget (should be 0): 0.136097287439836"
# 4. Report the Weighted Sum of Squares of Error (WSSE)
wsse_C <- attr(vgm_model_fit_C, "SSE")
print(paste("Weighted Sum of Squares of Error (WSSE) for this model:", wsse_C))
## [1] "Weighted Sum of Squares of Error (WSSE) for this model: 24.810157275739"
We can now compare the two models based on their goodness-of-fit. The primary metric for this comparison is the Weighted Sum of Squares of Error (WSSE). This value represents the total discrepancy between the empirical variogram points (the dots) and the fitted model (the line). A lower WSSE indicates a better fit. * The model from (b) (which included a nugget) resulted in a WSSE of approximately 3.123. * The model from (c) (which was forced to have no nugget) resulted in a WSSE of approximately 24.810. Since 3.123 is substantially smaller than 24.810, it is clear that the model from (b) provides a significantly better fit to the data. This strongly implies that the data does have a nugget effect (a non-zero intercept), and forcing the model to pass through the origin (as in (c)) makes the fit much worse. Therefore, the model from (b) (with the nugget) is the better model.
(d.)
# 1. Define the "starting guess" for the Exponential model
vgm_guess_Exp <- vgm(psill = 0.1, model = "Exp", range = 2.0, nugget = 0.1)
# 2. Fit the Exponential model
vgm_model_fit_Exp <- fit.variogram(vgm_log_emp, model = vgm_guess_Exp)
# 3. Report the Weighted Sum of Squares of Error (WSSE)
wsse_Exp <- attr(vgm_model_fit_Exp, "SSE")
print("--- Exponential Model ---")
## [1] "--- Exponential Model ---"
print(paste("Fitted Model Parameters (Exp):",
"Nugget:", vgm_model_fit_Exp$psill[1],
"PSill:", vgm_model_fit_Exp$psill[2],
"Range:", vgm_model_fit_Exp$range[2]))
## [1] "Fitted Model Parameters (Exp): Nugget: 0.0221348813611945 PSill: 0.144010273228354 Range: 0.341424609027228"
print(paste("Exponential Model WSSE:", wsse_Exp))
## [1] "Exponential Model WSSE: 2.77789143332655"
# 4. Plot the Exponential model fit
plot(vgm_log_emp,
model = vgm_model_fit_Exp,
main = "Empirical Variogram and Fitted Exponential Model")
# 1. Define the "starting guess" for the Matérn model
vgm_guess_Mat <- vgm(psill = 0.1, model = "Mat", range = 2.0, nugget = 0.1, kappa = 1)
# 2. Fit the Matérn model
vgm_model_fit_Mat <- fit.variogram(vgm_log_emp, model = vgm_guess_Mat)
# 3. Report the Weighted Sum of Squares of Error (WSSE)
wsse_Mat <- attr(vgm_model_fit_Mat, "SSE")
print("--- Matérn (kappa=1) Model ---")
## [1] "--- Matérn (kappa=1) Model ---"
print(paste("Fitted Model Parameters (Mat, k=1):",
"Nugget:", vgm_model_fit_Mat$psill[1],
"PSill:", vgm_model_fit_Mat$psill[2],
"Range:", vgm_model_fit_Mat$range[2]))
## [1] "Fitted Model Parameters (Mat, k=1): Nugget: 0.0358274893034872 PSill: 0.127074630345843 Range: 0.216727036640958"
print(paste("Matérn (k=1) Model WSSE:", wsse_Mat))
## [1] "Matérn (k=1) Model WSSE: 2.52380913581046"
# 4. Plot the Matérn model fit
plot(vgm_log_emp,
model = vgm_model_fit_Mat,
main = "Empirical Variogram and Fitted Matérn (k=1) Model")
We have now fitted three different parametric models to the empirical variogram and can compare their goodness-of-fit using the Weighted Sum of Squares of Error (WSSE). A lower WSSE indicates a better fit. The WSSE values for the three models are: 1. Spherical Model (from b): WSSE \(\approx\) 3.123 2. Exponential Model (from d): WSSE \(\approx\) 2.778 3. Matérn (k=1) Model (from d): WSSE \(\approx\) 2.524 Comparing these three error values, the Matérn (k=1) model produced the smallest WSSE (2.524). Therefore, among the three options, the Matérn model with kappa=1 is the best model as it provides the closest fit to the empirical data.
(e.)
library(sf)
library(gstat)
locs_file_path <- "C:/Users/dydsw/Desktop/Homework 2/data/juralocs.csv"
locs_data <- read.csv(locs_file_path, header = TRUE)
jura_locs_sf <- st_as_sf(locs_data, coords = c("Xloc", "Yloc"))
kriging_results <- krige(log(Zn) ~ 1,
locations = jura_sf,
newdata = jura_locs_sf,
model = vgm_model_fit_Mat)
## [using ordinary kriging]
kriging_results$std_dev <- sqrt(kriging_results$var1.var)
print("--- Kriging Prediction Results ---")
## [1] "--- Kriging Prediction Results ---"
print("Locations of the 3 new points with their predicted log(Zn) mean and std. dev.:")
## [1] "Locations of the 3 new points with their predicted log(Zn) mean and std. dev.:"
print(kriging_results)
## Simple feature collection with 3 features and 3 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: 2.8938 ymin: 1.0972 xmax: 3.941 ymax: 4.7876
## CRS: NA
## var1.pred var1.var geometry std_dev
## 1 4.393334 0.07582320 POINT (3.2892 4.7876) 0.2753601
## 2 4.110294 0.09710073 POINT (3.941 1.0972) 0.3116099
## 3 4.269853 0.11488585 POINT (2.8938 2.244) 0.3389481
Based on the kriging output, the predicted means for the logarithm of zinc concentrations at the three new locations are 4.3933, 4.1103, and 4.2699. The corresponding prediction standard deviations are 0.2754, 0.3116, and 0.3389.
(f.)
library(sf)
library(gstat)
library(ggplot2)
library(viridis)
library(patchwork)
grid_file_path <- "C:/Users/dydsw/Desktop/Homework 2/data/juragrid.csv"
grid_data <- read.csv(grid_file_path, header = TRUE)
jura_grid_sf <- st_as_sf(grid_data, coords = c("Xloc", "Yloc"))
kriging_grid_results <- krige(log(Zn) ~ 1,
locations = jura_sf,
newdata = jura_grid_sf,
model = vgm_model_fit_Mat)
## [using ordinary kriging]
# --- Plot 1: Predicted Means ---
g1 <- ggplot() +
geom_sf(data = kriging_grid_results, aes(color = var1.pred)) +
scale_color_viridis(name = "Pred. log(Zn)") +
theme_bw() +
ggtitle("Predicted Mean of log(Zn)")
# --- Plot 2: Predicted Variances + Observed Data ---
g2 <- ggplot() +
# Layer 1: The grid of prediction variances
geom_sf(data = kriging_grid_results, aes(color = var1.var)) +
scale_color_viridis(name = "Variance", option = "plasma") +
# Layer 2: The original 259 data points (jura_sf)
geom_sf(data = jura_sf, color = "black", shape = 3, size = 1) +
theme_bw() +
ggtitle("Prediction Variance with Observed Data (+)")
# --- Combine Plots ---
print("Displaying plots side-by-side:")
## [1] "Displaying plots side-by-side:"
g1 + g2