Spatial- BroomsBarn

Research Proposal

The data I am working with was found at http://www.kriging.com/datasets/. Given on page 10 in the supproting file (http://www.kriging.com/books/Chapter1_PG2000_CS.pdf) is the following data description

This is a soil science application, data supplied by Dick Webster - co-author of Webster & Oliver’s excellent ‘Geostatistics for Environmental Scientists’. Brooms Barn is an agricultural experimental station in East Anglia (UK) which hosts several fields within its area. The data set includes Potassium (K mg/l), Phosphorus (P mg/l) and pH levels in the soil. Over 400 samples were collected on a regular grid at 40 metres spacing.

The data file consists of 434 samples and the following fields for each sample

East and North position on the sampling grid - this is not in metres but in grid spacing, i.e. 1 unit of distance equals 40 metres;
K - potassium value in the soil, mg/l;
log10 K - logarithms to the base 10 of K values;
pH - universal measurements for acidity (or lack of) in the soil;
P - phosphorus levels in the soil, mg/l;
log10 P - logarithm to the base 10 for P values.

In the data file, the researchers identified right skewness of potassium (K) and phosphorus (P) as they accounted for this by logging both variables. In my analysis I will address this separately and determine the need to do a transformation on these variables and explore any outliers that may be causing the major skewness.

After exploring the data, I will do an analysis of the acidity (pH) values using bayesian estimation. My analysis will include:

Developing the “best” model by determining which predictors should be included in the model as well as comparing the spatial and non-spatial model.
Fitting the model on a training set with Bayesian methods.
Prdicting the pH at the testing set locations and comparing these to the actual values.

Data Exploration

##       East           North             K             log10K      
##  Min.   : 1.00   Min.   : 1.00   Min.   :-9.00   Min.   :-9.000  
##  1st Qu.: 6.00   1st Qu.:10.00   1st Qu.:20.00   1st Qu.: 1.301  
##  Median :11.00   Median :17.00   Median :25.00   Median : 1.398  
##  Mean   :10.33   Mean   :16.83   Mean   :26.23   Mean   : 1.375  
##  3rd Qu.:14.00   3rd Qu.:25.00   3rd Qu.:30.00   3rd Qu.: 1.477  
##  Max.   :18.00   Max.   :31.00   Max.   :96.00   Max.   : 1.982  
##        pH              P              log10P       
##  Min.   :5.500   Min.   :-9.000   Min.   :-9.0000  
##  1st Qu.:7.400   1st Qu.: 2.100   1st Qu.: 0.3222  
##  Median :8.000   Median : 3.500   Median : 0.5441  
##  Mean   :7.722   Mean   : 4.801   Mean   : 0.5017  
##  3rd Qu.:8.200   3rd Qu.: 5.700   3rd Qu.: 0.7559  
##  Max.   :8.600   Max.   :49.000   Max.   : 1.6902

Observing the correlation chart below, we determine two outliers as well as right skewness in both the K and P values. We also have reason to beleive there is left skewness in our response variable (pH).

Using the “plotyly” package to determine where the outliers are, we identify two outliers where \(log10P = -9\) and remove them from the dataset.

Exploring the transformation of both the explanatory variables (K and P) as well as our response variable (pH) we determine that logging K and P is appropriate and using BoxCox transformation determine \((\lambda \approx1.85)\) and that squaring pH would be an appropriate transformation. The final correlation plot to be used in my model is shown below.

To continue the analysis, I split my dataset into a training set (303 observations) and testing set (130 observations).

Visualizing the pH levels of the training dataset on the model scale (i.e. \(pH^2\)) and the residuals of the initial linear model as shown below, there appears to be spatial dependance among the data.

Comparing the 2D interpolated plots of potassium and phosphorus as well there appears to be a realtion between the potassium levels and the pH.

##Model Selection

##          log10K log10P
## 1  ( 1 ) " "    "*"   
## 2  ( 1 ) "*"    "*"

## [1] 1312.67 1302.92

## [1] 1320.10 1314.06

##      log10K log10P
## [1,] ""     "*"   
## [2,] "*"    "*"

## [1] 23116.59 22423.19

According to AIC, BIC, and PRESS, we conclude that including both postassium (K) and phosphorus (P) as predictors is appropriate. Therefore, our non-spatial model is given by: \[\textbf{Y(s)}=\beta_0+\beta_1\textbf{x}_1\textbf{(s)}+\beta_2\textbf{x}_2\textbf{(s)}+\epsilon\textbf{(s)}\] where \(\textbf{Y(s)}\) is \(pH^2\), \(\textbf{x}_1\textbf{(s)}\) is \(log10K\), and \(\textbf{x}_2\textbf{(s)}\) is \(log10P\) the at location s, with \(\epsilon\textbf{(s)}\) independent of location s with mean zero and variance \(\sigma_\epsilon^2\).

##Spatial vs NonSpatial Model Since there is reasoning to beleive there is spatial dependance in the residual plot above, we use the variogram methods and fit the data with an “exponential” covariance function.

## variog: computing omnidirectional variogram

## variog: computing omnidirectional variogram

## variofit: model parameters estimated by WLS (weighted least squares):
## covariance model is: exponential
## parameter estimates:
##   tausq sigmasq     phi 
##  2.2162 72.3953  2.7017 
## Practical Range with cor=0.05 for asymptotic range: 8.093433
## 
## variofit: minimised weighted sum of squares = 250.6173

Next, I compared the bayesian non-spatial model to the spatial bayesian model with noninformative priors to formally determine the existence of spatial dependance.

##                   value    value
## bar.D       1603.958441 296.8442
## D.bar.Omega 1599.959233 100.0698
## pD             3.999209 196.7744
## DIC         1607.957650 493.6185

##      MSPR.nsp  MSPR.sp
## [1,] 73.11827 21.03755

Based on our DIC and MSPR, we determine spatial dependance when modeling pH. Recall from above, \(Y(s)=pH^2\) at location s, and let \(X(s)=(1,X_1(s),X_2(s))^T\) and \(\beta=(\beta_0, \beta_1, \beta_2)^T\). Thus, my final model is given by:

\[\text{data model: } Y(s)|\beta,w(s),\sigma_e^2 \text{ ind } \sim Gaussian(X(s)^T\beta+w(s),\sigma_e^2)\] \[\text{process model: } w(s)|\sigma_w^2, \phi \sim Gaussian(0, C(h;\sigma_w^2, \phi))\] \[\text{parameter model: } p(\beta)p(\sigma_w^2)p(\phi)p(\sigma_e^2)\] where \(C(h;\sigma_w^2, \phi)=\sigma_w^2e^{-\phi||h||}, p(\beta) \sim Gaussian(0,10^6I), p(\sigma_w^2)\) and \(p(\sigma_e^2)\) are usually inverse Gamma with hyperparameter \(a=2\) and \(b\) the estimated \(\sigma_w^2=63.58\) and \(\sigma_e^2=7.09\) respectively from variogram methods above, and \(p(\phi)\) is a uniform prior within \((-log(0.05)/d_max,-log(0.01)/d_min),\) where \(d_max\) and \(d_min\) are maximum and minimum distance accross all pairs of locations in the training set.

##Bayesian Estimation Continuing to use the training dataset, I ran three adaptive MCMC chains with different initial values. Observing the plots below, we determine the three chains converge to the same distribution and use the gelman plots to determine the burn-in period to be 15,000 iterations.

## Potential scale reduction factors:
## 
##          Point est. Upper C.I.
## sigma.sq       1.01       1.02
## tau.sq         1.03       1.10
## phi            1.00       1.00
## 
## Multivariate psrf
## 
## 1.03

##            2.5%    25%    50%    75%   97.5%
## sigma.sq 53.506 66.857 77.205 91.667 143.717
## tau.sq    0.462  0.968  1.582  2.508   5.608
## phi       0.150  0.252  0.306  0.359   0.464

##                2.5%    25%    50%    75%  97.5%
## (Intercept)  53.463 60.738 64.101 67.719 74.457
## Xlog10K     -12.172 -8.008 -5.763 -3.415  1.147
## Xlog10P       5.407  7.345  8.291  9.287 11.121

##Prediction

Observing the predicted vs actual plot and residuals above we can see that our accuracy has increased and continue to look into the patterns. The typical pH range of soil is between 5.5 and 7. Observing the sampling grid we observe many acidic plots. However, where the residual range appears to be the highest there are three feilds or plots in the top that are more alkaline than acidic and fall within the desired soil pH range. This could be due to limestone being installed in these few plots causing the pH to decrease for ideal planting conditions.

Spatial- BroomsBarn

Emily Robinson

April 30, 2018

Research Proposal

Data Exploration