Mapping soil functional properties using multilevel models

Tor-Gunnar Vågen (ICRAF), Leigh Winowiecki (CIAT), Lulseged T. Desta (CIAT) and Jerome E. Tondoh (CIAT)

Required packages:

Introduction

In this document we present examples of R functions and scripts that may be used to generate maps of soil functional properties based on for example remote sensing data, climate surfaces and/or digital elevation models and their derivatives. The data used come from the Africa Soil Information Service (AfSIS) project, which consists of a set of sites stratified across sub-Saharan Africa based on Koeppen-Geiger climate zones[1]. The methodology used was developed for land health assessment as is referred to as the Land Degradation Surveillance Framework (LDSF)[2,3]. A copy of the field guide can be accessed through this link (requires registration).

We demonstrate a prediction model for soil organic carbon (SOC), using linear mixed effects (lme) or multilevel modeling approaches. Such models are well suited for developing local maps of soil properties. We use surface reflectance calibrated Landsat ETM+ reflectance from the GLS2010 library as predictors. The example data are from sites in Tanzania, Mozambique and Mali, respectively, but have been made anonymous in order to make this tutorial publicly available.

The code presented here is intended as a starting point. Other approaches, such as gradient boosting models (gbm) and bayesian additive regression trees (BART) may be interesting to explore as well. There is no such thing as “the right model” as models are essentially wrong and the selection of model depends on the application as well. You are encouraged to use this tutorial as a starting point and we hope it is of help in developing similar models and maps using your own data.

Example data

The example dataset used in this exercise can be downloaded through the Dataverse site here as well.

Data citation: Vågen, Tor-G; Winowiecki, Leigh A; Tondoh, Jerome E; Desta, Lulseged T, 2013-01-01, “Africa Soil Information Service (AfSIS) - Soil Health Mapping”, http://hdl.handle.net/1902.1/19793 v2012.

setwd("~/Desktop/afsisData"); #Change this to where you saved the data
dat <- read.table("afsisSoil.csv", sep=",", header=T);

To see the content and the structure of the data, you may issue the following commands:

head(dat)
##   Country Site Cluster Plot DepthTop DepthBottom  SOC elev  map mfi pci
## 1       1    1       2    7        0          20 2937 1099 1299 233  18
## 2       1    1       2    7        0          20 2937 1099 1299 233  18
## 3       1    1       1    2        0          20 3043 1067 1299 233  18
## 4       1    1       1    2        0          20 3043 1067 1299 233  18
## 5       1    1       1    3        0          20 3312 1064 1299 233  18
## 6       1    1       1    3        0          20 3312 1064 1299 233  18
##   slope flowaccum SRFIBL SRFIGL SRFIRL SRFINA SRFIMB SRFIMC PBI  PVI
## 1    13        27    319    632    927   2300   2487   1497 681 1192
## 2    13        27    319    632    927   2300   2487   1497 681 1192
## 3     2         0    319    661    975   2251   2711   1567 680 1173
## 4     2         0    319    661    975   2251   2711   1567 680 1173
## 5     4        26    232    429    642   1860   1815   1037 540 1168
## 6     4        26    232    429    642   1860   1815   1037 540 1168
str(dat)
## 'data.frame':    6630 obs. of  21 variables:
##  $ Country    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Site       : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Cluster    : int  2 2 1 1 1 1 1 1 1 1 ...
##  $ Plot       : int  7 7 2 2 3 3 4 4 5 5 ...
##  $ DepthTop   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ DepthBottom: int  20 20 20 20 20 20 20 20 20 20 ...
##  $ SOC        : int  2937 2937 3043 3043 3312 3312 2159 2159 3558 3558 ...
##  $ elev       : int  1099 1099 1067 1067 1064 1064 1094 1094 1087 1087 ...
##  $ map        : num  1299 1299 1299 1299 1299 ...
##  $ mfi        : num  233 233 233 233 233 233 233 233 233 233 ...
##  $ pci        : num  18 18 18 18 18 18 18 18 18 18 ...
##  $ slope      : int  13 13 2 2 4 4 11 11 2 2 ...
##  $ flowaccum  : int  27 27 0 0 26 26 0 0 0 0 ...
##  $ SRFIBL     : int  319 319 319 319 232 232 348 348 261 261 ...
##  $ SRFIGL     : int  632 632 661 661 429 429 545 545 400 400 ...
##  $ SRFIRL     : int  927 927 975 975 642 642 666 666 666 666 ...
##  $ SRFINA     : int  2300 2300 2251 2251 1860 1860 2398 2398 1909 1909 ...
##  $ SRFIMB     : int  2487 2487 2711 2711 1815 1815 2064 2064 1865 1865 ...
##  $ SRFIMC     : int  1497 1497 1567 1567 1037 1037 1060 1060 1060 1060 ...
##  $ PBI        : int  681 681 680 680 540 540 652 652 554 554 ...
##  $ PVI        : int  1192 1192 1173 1173 1168 1168 1262 1262 1172 1172 ...

The variables are:

Let's take a look at the SOC data by Site…

library(lattice)
bwplot(Site ~ SOC, dat, box.ratio = 0.05, fill = "brown")

plot of chunk unnamed-chunk-1

As is evident from the plot above, there is great variation in SOC between these sites. Note that the SOC values are concentrations multiplied by 100 as we will be producing the final map in 16-bit format.

Model development (lme)

Before proceeding it is often a good idea to look at the distribution of the variable we are modeling to determine if any transformations are required.

library(lattice)
densityplot(~SOC, dat)

plot of chunk unnamed-chunk-2

library(nlme)
soc.lme1 <- lme(log(SOC) ~ SRFIGL + SRFIRL + SRFINA + SRFIMB + SRFIMC + PBI + 
    PVI + map, random = ~1 | Site/Cluster, dat)
summary(soc.lme1)
## Linear mixed-effects model fit by REML
##  Data: dat 
##    AIC  BIC logLik
##   4629 4710  -2302
## 
## Random effects:
##  Formula: ~1 | Site
##         (Intercept)
## StdDev:      0.6534
## 
##  Formula: ~1 | Cluster %in% Site
##         (Intercept) Residual
## StdDev:      0.2618    0.301
## 
## Fixed effects: log(SOC) ~ SRFIGL + SRFIRL + SRFINA + SRFIMB + SRFIMC + PBI +      PVI + map 
##              Value Std.Error   DF t-value p-value
## (Intercept) 11.811    13.021 5960   0.907  0.3644
## SRFIGL       0.000     0.000 5960   6.035  0.0000
## SRFIRL       0.001     0.004 5960   0.265  0.7909
## SRFINA       0.003     0.004 5960   0.786  0.4319
## SRFIMB       0.000     0.000 5960   3.893  0.0001
## SRFIMC       0.000     0.000 5960  -5.652  0.0000
## PBI         -0.012     0.014 5960  -0.851  0.3950
## PVI         -0.004     0.014 5960  -0.286  0.7749
## map          0.000     0.000 5960  -1.231  0.2182
##  Correlation: 
##        (Intr) SRFIGL SRFIRL SRFINA SRFIMB SRFIMC PBI    PVI   
## SRFIGL -0.009                                                 
## SRFIRL -0.684 -0.005                                          
## SRFINA  0.726 -0.006  0.004                                   
## SRFIMB -0.005  0.056  0.025  0.017                            
## SRFIMC -0.014 -0.006 -0.015 -0.027 -0.794                     
## PBI    -0.089 -0.003 -0.665 -0.749 -0.033  0.029              
## PVI    -0.999  0.008  0.721 -0.690  0.006  0.013  0.038       
## map    -0.012  0.014 -0.006 -0.006 -0.004 -0.006  0.008  0.000
## 
## Standardized Within-Group Residuals:
##     Min      Q1     Med      Q3     Max 
## -4.8010 -0.5373 -0.0103  0.5335  5.2811 
## 
## Number of Observations: 6630
## Number of Groups: 
##              Site Cluster %in% Site 
##                42               662

The model summary above indicates that we may want to simplify the model. Mean annual rainfall does not contribute much to the model when we include Landsat reflectance. After a few iterations with adding/removing variables from the model and testing each model we end up with:

library(nlme)
soc.lme2 <- lme(log(SOC) ~ SRFIGL + SRFINA + SRFIMB + SRFIMC + PVI, random = ~1 | 
    Site/Cluster, dat)
summary(soc.lme2)
## Linear mixed-effects model fit by REML
##  Data: dat 
##    AIC  BIC logLik
##   4592 4654  -2287
## 
## Random effects:
##  Formula: ~1 | Site
##         (Intercept)
## StdDev:       0.625
## 
##  Formula: ~1 | Cluster %in% Site
##         (Intercept) Residual
## StdDev:      0.2623    0.301
## 
## Fixed effects: log(SOC) ~ SRFIGL + SRFINA + SRFIMB + SRFIMC + PVI 
##             Value Std.Error   DF t-value p-value
## (Intercept) 5.410    0.3314 5963  16.322   0e+00
## SRFIGL      0.000    0.0001 5963   6.046   0e+00
## SRFINA      0.000    0.0001 5963  -6.904   0e+00
## SRFIMB      0.000    0.0000 5963   3.842   1e-04
## SRFIMC      0.000    0.0001 5963  -5.613   0e+00
## PVI         0.002    0.0003 5963   6.262   0e+00
##  Correlation: 
##        (Intr) SRFIGL SRFINA SRFIMB SRFIMC
## SRFIGL -0.682                            
## SRFINA  0.888 -0.777                     
## SRFIMB -0.158  0.057 -0.283              
## SRFIMC -0.252 -0.005 -0.113 -0.794       
## PVI    -0.950  0.729 -0.952  0.133  0.283
## 
## Standardized Within-Group Residuals:
##     Min      Q1     Med      Q3     Max 
## -4.7859 -0.5358 -0.0112  0.5346  5.2935 
## 
## Number of Observations: 6630
## Number of Groups: 
##              Site Cluster %in% Site 
##                42               662

The residuals from soc.lme2 look reasonable as well:

plot(soc.lme2)

plot of chunk unnamed-chunk-4


# Measured versus predicted SOC (back-transforming the predicted values):
library(car)
## Loading required package: MASS
## Loading required package: nnet
scatterplot(exp(predict(soc.lme2)) ~ SOC, xlab = "Measured SOC", ylab = "Predicted SOC", 
    xlim = c(0, 8000), ylim = c(0, 8000), smooth = F, dat)

plot of chunk unnamed-chunk-4

The above model has an adjusted R-squared of 0.85. As shown in the summary variance is highest between sites, but also relatively high within many of the sites (between sampling clusters).

Mappping SOC

For this exercise, we show an example of how to use the model developed earlier (soc.lme2) to predict SOC from satellite imagery. The images used are from the GLS2005 archive from sites in Mali, Tanzania and Mozambique. We have clipped the individual sites to reduce the size of the dataset. The images are in the original zip archive downloaded earlier.

Note that the bands must be named consistent with the covariates in the model and that we create a “dummy” band for each site which contains the site ID so that the model uses site-specific coefficients in the predictions of SOC.

library(raster);
## Loading required package: sp
## raster 2.0-41 (21-December-2012)
## Attaching package: 'raster'
## The following object(s) are masked from 'package:MASS':
## 
## area, select
## The following object(s) are masked from 'package:nlme':
## 
## getData
setwd("~/Desktop/afsisData"); #Change this to where you saved the data
rast1 <- stack(raster("1/SRFIGL.tif"),raster("1/SRFINA.tif"),raster("1/SRFIMB.tif"),raster("1/SRFIMC.tif"),raster("1/PVI.tif"));
site <- raster(rast1,1);
names(site) <- 'Site';
site[site>0] <- 7;
rast1 <- stack(rast1,site);

rast2<- stack(raster("2/SRFIGL.tif"),raster("2/SRFINA.tif"),raster("2/SRFIMB.tif"),raster("2/SRFIMC.tif"),raster("2/PVI.tif"));
site <- raster(rast2,1);
names(site) <- 'Site';
site[site>0] <- 39;
rast2 <- stack(rast2,site);

rast3<- stack(raster("3/SRFIGL.tif"),raster("3/SRFINA.tif"),raster("3/SRFIMB.tif"),raster("3/SRFIMC.tif"),raster("3/PVI.tif"));
site <- raster(rast3,1);
names(site) <- 'Site';
site[site>0] <- 1;
rast3 <- stack(rast3,site);

Let's apply soc.lme2 to the imagery and display the results.

mapSOC1 <- exp(predict(rast1, soc.lme2, level=1));
mapSOC2 <- exp(predict(rast2, soc.lme2, level=1));
mapSOC3 <- exp(predict(rast3, soc.lme2, level=1));


col=colorRampPalette(c("yellow","white","wheat","black"))(255); #You may want to play with these!
zlim=c(0,3500) #Or an appropriate stretch for the site you are modeling

plot(mapSOC1, col=col, main="Map of SOC (g/kg) for Site 1 (Mali)", xlab="Longitude", ylab="Latitude", zlim=zlim);

plot of chunk generateMap

plot(mapSOC2, col=col, main="Map of SOC (g/kg) for Site 2 (Mozambique)", xlab="Longitude", ylab="Latitude", zlim=zlim);

plot of chunk generateMap

plot(mapSOC3, col=col, main="Map of SOC (g/kg) for Site 3 (Tanzania)", xlab="Longitude", ylab="Latitude", zlim=zlim);

plot of chunk generateMap

Note that the model we use above is the “1-level” model, which means that we are using the “site-specific” model coefficients from soc.lme2. Given the size of the training dataset this model should give us reasonable predictions, but it is useful to compare the distribution of predicted SOC from the map surface and the measured values used to develop the model.

mapSOC.val1 <- getValues(mapSOC1)
mapSOC.val2 <- getValues(mapSOC2)
mapSOC.val3 <- getValues(mapSOC3)

plot(density(na.omit(mapSOC.val1)), col = "orange", lwd = 2, ylim = c(0, 0.008), 
    xlim = c(0, 3000), main = "")
# lines(density(na.omit(dat[dat$Site==7,]$SOC)), col='orange', lwd=2,
# lty=2);

lines(density(na.omit(mapSOC.val2)), col = "red", lwd = 2, main = "")

lines(density(na.omit(mapSOC.val3)), col = "black", lwd = 2, main = "")

legend("topright", bty = "n", c("Site 1", "Site 2", "Site 3"), lty = c(1, 2, 
    2), col = c("orange", "red", "black"), lwd = c(2, 2, 2))

plot of chunk generateSummary

Compare the above density plots to the boxplots for the measured values by site. The corresponding sites in the boxplot are:

References

[1] Kottek, M., Grieser, J., Beck, C., Rudolf, B., & Rubel, F. (2006). World Map of the Köppen-Geiger climate classification updated. Meteorologische Zeitschrift, 15(3), 259–263. doi:10.1127/0941-2948/2006/0130

[2] Vågen, T.-G., Davey, F., & Shepherd, K. D. (2012). Land health surveillance: Mapping soil carbon in Kenyan rangelands. In P. K. R. Nair & D. Garrity (Eds.), Agroforestry - The Future of Global Land Use. Springer.

[3] Vågen, T.-G., Shepherd, K. D., & Walsh, M. G. (2006). Sensing landscape level change in soil fertility following deforestation and conversion in the highlands of Madagascar using Vis-NIR spectroscopy. Geoderma, 133(3-4), 281–294. doi:10.1016/j.geoderma.2005.07.014

[4] Vågen, T.-G., Winowiecki, L. A., Tamene Desta, L., & Tondoh, J. E. (2010). The Land Degradation Surveillance Framework (LDSF) - Field Guide. 14 p. Nairobi, Kenya: World Agroforestry Centre.