Homework2 - STAT 406

library(sp)
library(ggplot2)
library(gstat)
library(dplyr)
library(maptools)
library(tidyr)

Question 1) Exploration of the data and sampling scheme

We continue with the analysis of the cobalt (Co) data in the Jura mountains. The full data set (259 locations) is in juraCofull.txt. The Xloc and Yloc coordinates are in km; soil cobalt concentration is measured in mg/kg (i.e. ppm). These data will be used for all parts of this HW assignment. Loosely related questions are grouped together.

a) Plot the data in a way that illustrates where soil samples were taken and how Co concentration varies over the study area. Your answer is the plot.

Co.points<-read.table("juraCoFull.txt",header = TRUE)
Co.points.sp<-Co.points
coordinates(Co.points.sp) <- c('Xloc','Yloc')

Co.points%>%ggplot(aes(Xloc,Yloc))+
  geom_point(aes(size=Co))+
  coord_equal() +
  theme_bw()

b) Soil samples appear to have been taken on a grid with some “extra” locations. What feature(s) of this choice of locations are good when the study goal is to describe how Co concentrations vary across the study area. Briefly explain your choice of feature(s).

So, it adds lot of points different distances and specially short distances to estimate the nugget. Start with grid samplign and go back and add some extra points.

c) What feature(s) of this choice of locations are good when the study goal is to estimate the empirical variogram? Briefly explain your choice of feature(s).

Having some extra point is the feature that adds some new locations to the grid sampling is excatly what makesthis samppling different and thna just grid sampling and suitable for etimation of SV.

Question 2) Exploration of the spatial pattern

a) Plot the semivariogram cloud based on all 259 locations. Your answer is the plot.

Co.vc <- variogram(Co~1, Co.points.sp, cloud=T)
Co.vc%>%ggplot(
    aes(x = dist, 
        y = gamma )
)+ geom_point(show.legend = FALSE) #+ geom_smooth()

b) Does this plot (from part a) show evidence of non-zero spatial correlation? If there is spatial correlation, is this positive or negative? Briefly explain your answers.

The zero sptial corrolation is a falt line which is not the case here. Moran Test

c) If a pair of points has a semi-variance of 120, what is the difference in Co concentrations between those two points?

\(\gamma(\hat{h})=\frac{1}{2N(h)} \Sigma[Z(s_i)-Z(s_j)]^2\) for \(\gamma(\hat{h})=120\) => \([Z(s_i)-Z(s_j)]^2=240\)

=>\(\sqrt(240)=15.4\)

d) There are two points in the semivariogram cloud (so two pairs of data values) that have large semivariance (approx. 120 for each pair) and moderate distance (circa 0.5 km and 0.85 km). Identify the observations (by observation number) that produce these points in the cloud.

e) Plot the empirical semivariogram using the Matheron estimator and default choices of cutoff and number of bins. Your answer is the plot.

variogram(Co~1, Co.points.sp)%>%ggplot(
    aes(x = dist, 
        y = gamma )
)+ geom_point(show.legend = FALSE)

f) Plot the empirical semivariogram using the Matheron estimator with 22 bins from 0 to 5.5 km. You will use one of these plots (this one or the one in part 2e) to fit a semivariogram model. Which will be more useful? Briefly explain your choice.

variogram(Co~1, Co.points.sp,cutoff=5.5,width=0.25)%>%ggplot(
    aes(x = dist, 
        y = gamma )
)+ geom_point(show.legend = FALSE)

g) In fact, there is a reason avoid the plot in part 2f. What is that?

Few number points estimated in some 5.5 km bins.

h) Estimate the empirical semivariogram using the Cressie-Hawkins estimator and default choices of cutoff and number of bins. Overlay the Matheron estimates and the Cressie-Hawkins estimates. Which estimator do you believe is more appropriate to use for with data? Briefly explain your choice. Suggestion: you may want to look at a histogram or boxplot of the Co values. Note: I believe there is more than one right answer for this question. Your explanation is more important than your choice.

rbind(cbind(variogram(Co~1, Co.points.sp,cressie=TRUE),Method="Cressie"),
      cbind(variogram(Co~1, Co.points.sp),Method="Matheron"))%>%group_by(Method)%>%
ggplot( aes(x = dist, y = gamma,color=Method))+ geom_point(show.legend = TRUE,size=2)

hist(Co.points$Co)

i) You now decide to evaluate whether the assumption of isotropy is appropriate. Would it be a good idea to estimate and compare four directional semivariograms, in the directions of 0° (N), 90° (E), 180° (S), and 270° (W)? Briefly explain why or why not. no in these directions there is isotropic

plot(variogram(Co~1, Co.points.sp,cutoff=1.75,width=0.35,map=T))

bull eyes j) Use your choice of appropriate tool to examine anisotropy in the data. Tell me what you tool(s) you used, your decision (isotropic or not), and the rationale for your decision.

plot(variogram(Co~1, Co.points.sp,alpha=c(0,45,225)))

Question 3) Modeling the spatial pattern

a) Fit an Exponential with nugget variogram model with starting values of partial sill = 15, range=1.5 and nugget=0.2. What are the parameters of the fitted model? Do you have any concerns about the fit?

Co.v<-variogram(Co~1, Co.points.sp,cressie=TRUE)
Co.vm <- fit.variogram(Co.v, vgm(15, 'Exp', 1.5, 0.2))
Co.vm

##   model    psill range
## 1   Nug 2.988969   0.0
## 2   Exp 0.000000   1.5

It’s flat line. bad staring values

b) Fit an Exponential with nugget variogram model with starting values of partial sill = 15, range=1.5 and nugget=0. (Note: I do want you to fit a model with a nugget; just start the fit at a value of nugget=0). What are the parameters of the fitted model? Do you have any concerns about the fit? Briefly explain why or why not.

Co.v<-variogram(Co~1, Co.points.sp,cressie=TRUE)
Co.vm <- fit.variogram(Co.v, vgm(15, 'Exp', 1.5,0))
Co.vm

##   model    psill       range
## 1   Nug 1.436511  0.00000000
## 2   Exp 8.285953 -0.04772473

range is negative and it’s starting value c) What are some reasonable starting values? Briefly justify your choices.

d) Use those starting values to fit an Exponential with nugget model, what are the estimated parameters? Is that fit reasonable?

Co.v<-variogram(Co~1, Co.points.sp,cressie=TRUE)
Co.vm <- fit.variogram(Co.v, vgm(15, 'Exp', 1, nugget=1))
Co.vm

##   model    psill     range
## 1   Nug  0.00000 0.0000000
## 2   Exp 17.10777 0.6995487

e) Fit two more semivariogram (SV) models: Spherical with nugget and Matern, k=1, with nugget. Plot all three models with the empirical semivariances. Visually, which of the three models is the most appropriate?

Co.vm.sph <- fit.variogram(Co.v,vgm(14, 'Sph', 1.5, 1))
Co.vm.mat <- fit.variogram(Co.v, vgm(16, 'Mat', 1, 1, k=1))

# plot the empirical binned variogram
with(Co.v, plot(dist, gamma,cex=1.3,pch=19))    
lines(variogramLine(Co.vm, maxdist=2), lwd=2)
lines(variogramLine(Co.vm.sph, maxdist=2), col=2, lwd=2)
lines(variogramLine(Co.vm.mat, maxdist=2), col=4, lwd=2)
legend('bottomright', bty='n', lty=1, col=c(1,2,4), lwd=2,
legend=c('Exponential','Spherical','Matern, k=1') )

f) Report the weighted SS for each of the three SV models. Which of these is the best fitting?

Exponential

## [1] 46036.28

Spherical

## [1] 27193.96

Matern, k=1

## [1] 27814.99

g) Fit the Spherical model without nugget. Is the fit of this model similar to the fit of the model with a nugget? Briefly explain your choice.

Co.vm.sph.NN <- fit.variogram(Co.v,vgm(14, 'Sph', 1.5))
with(Co.v, plot(dist, gamma,cex=1.3,pch=19))    
lines(variogramLine(Co.vm.sph.NN, maxdist=2), lwd=2)
lines(variogramLine(Co.vm.sph, maxdist=2), col=2, lwd=2)
legend('bottomright', bty='n', lty=1, col=c(1,2,4), lwd=2,
legend=c('Spherical without nugget','Spherical with nugget') )

attr(Co.vm.sph.NN, 'SSErr')

## [1] 34581.57

similar

h) Your goal is to predict cobalt concentrations at unmeasured locations. Based on everything you have already done, choose a SV model to use in kriging. Briefly explain your choice.

I would the spherical model with nugget because of the lowest

i) Use the estimated parameters for the Spherical with nugget model to estimate the variance of the cobalt concentrations (i.e. the sill). Compare this to the sample variance of the Co concentrations. What is a reasonable explanation why these two values are somewhat different?

print(Co.vm.sph)

##   model      psill    range
## 1   Nug  0.3323523 0.000000
## 2   Sph 14.4420116 1.221055

var(Co.points[,3])

## [1] 12.78811

Question 4) Mapping cobalt concentrations

The data in juragrid.csv are a fine grid of locations throughout the Jura. The data in juralocs.csv are three locations that are especially important to the investigators.

a) use the fitted spherical with nugget SV model to predict Co concentrations at the three locations in juralocs.csv. Report the predictions and their standard deviations.

pred.loc<-read.csv("juralocs.csv",header = TRUE)
pred.grid<-read.csv("juragrid.csv",header = TRUE)
## Get the points ready
pred.loc.sp<-pred.loc
coordinates(pred.loc.sp)<-c('Xlog','Yloc')
##krige
Co.krig <- krige(Co ~ 1, Co.points.sp, pred.loc.sp, Co.vm.sph)

## [using ordinary kriging]

cbind(Co.krig@coords,predictions=Co.krig@data$var1.pred,sd=sqrt(Co.krig@data$var1.var))

##        Xlog   Yloc predictions       sd
## [1,] 3.2892 4.7876    5.335775 1.709616
## [2,] 3.9410 1.0972    9.753077 1.976894
## [3,] 2.8938 2.2440    8.792380 2.173261

b) use the fitted spherical with nugget SV model to predict Co concentrations at the grid locations. Plot the predicted concentrations. Your answer is the plot.

## get the grid ready
coordinates(pred.grid) <- c('Xloc','Yloc')  
gridded(pred.grid) <- T  
##krige
Co.krig.grid <- krige(Co ~ 1, Co.points.sp, pred.grid, Co.vm.sph)

## [using ordinary kriging]

spplot(Co.krig.grid, 'var1.pred')

c) Cobalt concentrations in soil are actually not precisely measured. The laboratory that measures cobalt concentrations reports a measurement variance of 0.25 ppm2 (i.e., a sd of 0.5 ppm). Fit a Spherical with nugget and measurement error = 0.25 SV model. What are the estimated parameters of this model?

fit.variogram(Co.v, vgm(14, 'Sph', 1,1,Err=0.25))

## Warning in fit.variogram(Co.v, vgm(14, "Sph", 1, 1, Err = 0.25)): Warning:
## singular model in variogram fit

##   model psill range
## 1   Err  0.25     0
## 2   Nug  1.00     0
## 3   Sph 14.00     1

d) Use the Spherical + nugget + meas. error model to predict Co concentrations at the grid locations. How different are these predictions from the predictions in question 4b (non-measurement-error model)? What, if anything, does this tell you anything about the grid of locations?

Co.krig.grid <- krige(Co ~ 1, Co.points.sp, pred.grid,
                      fit.variogram(Co.v, vgm(14, 'Sph', 1,nugget = 1,Err=0.25)))

## [using ordinary kriging]

spplot(Co.krig.grid, 'var1.pred')

e) Use cross-validation to estimate the root Mean Square Error of Prediction (rMSEP, the square-root of the MSEP). What is that value?

Co.cv <- krige.cv(Co ~ 1, Co.points.sp, Co.vm.sph)
sqrt(mean(Co.cv$residual^2))

## [1] 2.104366

sd(Co.points[,3])

## [1] 3.576046

f) Compare the rMSEP to the sample standard deviation of the Co concentrations. You should see that the rMSEP is smaller. In regression, the rMSEP for predicting an observation at a new X is always larger than the error standard deviation. What is a likely explanation for why kriging predictions show the opposite relationship?