The data I am working with was found at http://www.kriging.com/datasets/. Given on page 10 in the supproting file (http://www.kriging.com/books/Chapter1_PG2000_CS.pdf) is the following data description
This is a soil science application, data supplied by Dick Webster - co-author of Webster & Oliver’s excellent ‘Geostatistics for Environmental Scientists’. Brooms Barn is an agricultural experimental station in East Anglia (UK) which hosts several fields within its area. The data set includes Potassium (K mg/l), Phosphorus (P mg/l) and pH levels in the soil. Over 400 samples were collected on a regular grid at 40 metres spacing.
The data file consists of 434 samples and the following fields for each sample
In the data file, the researchers identified right skewness of potassium (K) and phosphorus (P) as they accounted for this by logging both variables. In my analysis I will address this separately and determine the need to do a transformation on these variables and explore any outliers that may be causing the major skewness.
After exploring the data, I will do an analysis of the acidity (pH) values using bayesian estimation. My analysis will include:
## East North K log10K
## Min. : 1.00 Min. : 1.00 Min. :-9.00 Min. :-9.000
## 1st Qu.: 6.00 1st Qu.:10.00 1st Qu.:20.00 1st Qu.: 1.301
## Median :11.00 Median :17.00 Median :25.00 Median : 1.398
## Mean :10.33 Mean :16.83 Mean :26.23 Mean : 1.375
## 3rd Qu.:14.00 3rd Qu.:25.00 3rd Qu.:30.00 3rd Qu.: 1.477
## Max. :18.00 Max. :31.00 Max. :96.00 Max. : 1.982
## pH P log10P
## Min. :5.500 Min. :-9.000 Min. :-9.0000
## 1st Qu.:7.400 1st Qu.: 2.100 1st Qu.: 0.3222
## Median :8.000 Median : 3.500 Median : 0.5441
## Mean :7.722 Mean : 4.801 Mean : 0.5017
## 3rd Qu.:8.200 3rd Qu.: 5.700 3rd Qu.: 0.7559
## Max. :8.600 Max. :49.000 Max. : 1.6902
Observing the correlation chart below, we determine two outliers as well as right skewness in both the K and P values. We also have reason to beleive there is left skewness in our response variable (pH).
Using the “plotyly” package to determine where the outliers are, we identify two outliers where \(log10P = -9\) and remove them from the dataset.
Exploring the transformation of both the explanatory variables (K and P) as well as our response variable (pH) we determine that logging K and P is appropriate and using BoxCox transformation determine \((\lambda \approx1.85)\) and that squaring pH would be an appropriate transformation. The final correlation plot to be used in my model is shown below.
To continue the analysis, I split my dataset into a training set (303 observations) and testing set (130 observations).
Visualizing the pH levels of the training dataset on the model scale (i.e. \(pH^2\)) and the residuals of the initial linear model as shown below, there appears to be spatial dependance among the data.
Comparing the 2D interpolated plots of potassium and phosphorus as well there appears to be a realtion between the potassium levels and the pH.
##Model Selection
## log10K log10P
## 1 ( 1 ) " " "*"
## 2 ( 1 ) "*" "*"
## [1] 1312.67 1302.92
## [1] 1320.10 1314.06
## log10K log10P
## [1,] "" "*"
## [2,] "*" "*"
## [1] 23116.59 22423.19
According to AIC, BIC, and PRESS, we conclude that including both postassium (K) and phosphorus (P) as predictors is appropriate. Therefore, our non-spatial model is given by: \[\textbf{Y(s)}=\beta_0+\beta_1\textbf{x}_1\textbf{(s)}+\beta_2\textbf{x}_2\textbf{(s)}+\epsilon\textbf{(s)}\] where \(\textbf{Y(s)}\) is \(pH^2\), \(\textbf{x}_1\textbf{(s)}\) is \(log10K\), and \(\textbf{x}_2\textbf{(s)}\) is \(log10P\) the at location s, with \(\epsilon\textbf{(s)}\) independent of location s with mean zero and variance \(\sigma_\epsilon^2\).
##Spatial vs NonSpatial Model Since there is reasoning to beleive there is spatial dependance in the residual plot above, we use the variogram methods and fit the data with an “exponential” covariance function.
## variog: computing omnidirectional variogram
## variog: computing omnidirectional variogram
## variofit: model parameters estimated by WLS (weighted least squares):
## covariance model is: exponential
## parameter estimates:
## tausq sigmasq phi
## 2.2162 72.3953 2.7017
## Practical Range with cor=0.05 for asymptotic range: 8.093433
##
## variofit: minimised weighted sum of squares = 250.6173
Next, I compared the bayesian non-spatial model to the spatial bayesian model with noninformative priors to formally determine the existence of spatial dependance.
## value value
## bar.D 1603.958441 296.8442
## D.bar.Omega 1599.959233 100.0698
## pD 3.999209 196.7744
## DIC 1607.957650 493.6185
## MSPR.nsp MSPR.sp
## [1,] 73.11827 21.03755
Based on our DIC and MSPR, we determine spatial dependance when modeling pH. Recall from above, \(Y(s)=pH^2\) at location s, and let \(X(s)=(1,X_1(s),X_2(s))^T\) and \(\beta=(\beta_0, \beta_1, \beta_2)^T\). Thus, my final model is given by:
\[\text{data model: } Y(s)|\beta,w(s),\sigma_e^2 \text{ ind } \sim Gaussian(X(s)^T\beta+w(s),\sigma_e^2)\] \[\text{process model: } w(s)|\sigma_w^2, \phi \sim Gaussian(0, C(h;\sigma_w^2, \phi))\] \[\text{parameter model: } p(\beta)p(\sigma_w^2)p(\phi)p(\sigma_e^2)\] where \(C(h;\sigma_w^2, \phi)=\sigma_w^2e^{-\phi||h||}, p(\beta) \sim Gaussian(0,10^6I), p(\sigma_w^2)\) and \(p(\sigma_e^2)\) are usually inverse Gamma with hyperparameter \(a=2\) and \(b\) the estimated \(\sigma_w^2=63.58\) and \(\sigma_e^2=7.09\) respectively from variogram methods above, and \(p(\phi)\) is a uniform prior within \((-log(0.05)/d_max,-log(0.01)/d_min),\) where \(d_max\) and \(d_min\) are maximum and minimum distance accross all pairs of locations in the training set.
##Bayesian Estimation Continuing to use the training dataset, I ran three adaptive MCMC chains with different initial values. Observing the plots below, we determine the three chains converge to the same distribution and use the gelman plots to determine the burn-in period to be 15,000 iterations.
## Potential scale reduction factors:
##
## Point est. Upper C.I.
## sigma.sq 1.01 1.02
## tau.sq 1.03 1.10
## phi 1.00 1.00
##
## Multivariate psrf
##
## 1.03
## 2.5% 25% 50% 75% 97.5%
## sigma.sq 53.506 66.857 77.205 91.667 143.717
## tau.sq 0.462 0.968 1.582 2.508 5.608
## phi 0.150 0.252 0.306 0.359 0.464
## 2.5% 25% 50% 75% 97.5%
## (Intercept) 53.463 60.738 64.101 67.719 74.457
## Xlog10K -12.172 -8.008 -5.763 -3.415 1.147
## Xlog10P 5.407 7.345 8.291 9.287 11.121
##Prediction
Observing the predicted vs actual plot and residuals above we can see that our accuracy has increased and continue to look into the patterns. The typical pH range of soil is between 5.5 and 7. Observing the sampling grid we observe many acidic plots. However, where the residual range appears to be the highest there are three feilds or plots in the top that are more alkaline than acidic and fall within the desired soil pH range. This could be due to limestone being installed in these few plots causing the pH to decrease for ideal planting conditions.