Soil pH predictions through Google Earth Engine and R

This is an example of how to predict soil pH in GEE through R at a 30 m resolution. The province to be mapped is Otjozondjupa in Nambia (~1e5 km2). It has a subtropical steppe climate as it lies on a plateau (>1000 m elevation). The soil data is freely available through https://www.data.isric.org and the covariate data came from worldclim (https://developers.google.com/earth-engine/datasets). Many more datasets are available such as DEM, Landsat, Sentinel, etc…..

Requirements - Earth Engine account - earthengine API installed - rgee set up (https://cran.r-project.org/web/packages/rgee/vi gnettes/rgee01.html)

load libraries needed

library(sf) #spatial points to manipulate in r
library(tidyverse) #manipulate and tidy data
library(yardstick) #model evaluation
library(ggpubr) #extra for plotting
library(rgee) #GEE for R

#Initialize GEE in R
ee_Initialize()

## ── rgee 1.1.5 ─────────────────────────────────────── earthengine-api 0.1.323 ── 
##  ✔ user: not_defined
##  ✔ Initializing Google Earth Engine:
 ✔ Initializing Google Earth Engine:  DONE!
## 
 ✔ Earth Engine account: users/trevanflynn 
## ────────────────────────────────────────────────────────────────────────────────

label response (pH), scales (resolution and tile computation size) and load boundaries of the province.

#labels
label = "phaq_val_1" #response (how it is in the database)
scale = 30 #resolution of output

#Get Otjozondjupa province in Namibia (FeatureCollection to Geometry)
worldProvince = ee$FeatureCollection("FAO/GAUL_SIMPLIFIED_500m/2015/level1"); #admin of all provinces 
filterProvince = ee$Filter$eq('ADM1_NAME', 'Otjozondjupa'); #select province
region = worldProvince$filter(filterProvince); #filter to area geometry

#load WoSIS data of pH (point observations = FeatureCollection)
collection = ee$FeatureCollection("projects/tre-projects/assets/WoSIS_pH_H20")$
  select(label)$#select pH 
  filterBounds(region) #clip to region

#check data by plotting
Map$centerObject(region, 6) #zoom to province
Map$addLayer(region) #province

Map$addLayer(collection) #soil observations

#note, you can zoom and change background map.

Obtain covariate image (climatic data - 19 bioclim variabes) and sample data from WoSIS collection. Similar to extract in terra or stars package. We then add x and y to add neighborhood into the covariates.

Note, there are many more covariates available, this is just to see how it works. If an image collection is made to represent the scorpan factors (Mcbratney et al. 2003), convert to an image with ImageCollection$toBands().

#image  = coStack$toBands()
#get worldclim bioclim variables (already in image format)
coStack = ee$Image("WORLDCLIM/V1/BIO")$#get climate data
  clip(region)$#clip to region
  addBands(ee$Image$pixelLonLat())#add x and y as covariates

#resample at resolution desired
coStack = coStack$
  reproject(crs = "EPSG: 3857", scale = scale)$ #sample to 30 m
  resample(mode = 'bilinear') #set resampling mode (can be nearest neighbor or bicubic as well)

#sample covariates at observations
samples = coStack$sampleRegions(
  collection = collection, #collection with response
  properties = list(label), #response
  scale = scale, #resolution
  geometries = T)$ #keep locations
  randomColumn() #add random column for making training/test set

#split data
training = samples$filter(ee$Filter$gt('random',0.2)) # 80% training
validation = samples$filter(ee$Filter$lte('random',0.2)) # 20% testing

#predictor variable names
bands = coStack$bandNames()

Regression classifier, must build the model and set the mode to “REGRESSION” before training the model. For gradient tree boosting (GTB) you only need to specify number of trees to grow.

Default values for GTB: -shrinkage = 0.005 (lower = less over fitting but slower) -samplingRate = 0.7 (bag fraction for each tree) -maxNodes = Inf (max nodes in each tree) -loss = “LeastAbsoluteDeviation” (objective function) -seed = 0 (replicate results)

#build and train
rfReg= ee$Classifier$smileGradientTreeBoost( #model
  numberOfTrees = 500, #500 trees
  shrinkage = 0.1,
  loss = "LeastSquares")$#higher shrinkage = lower time needed (overfits easier)
  setOutputMode("REGRESSION")$ #(default = classification)
  train(features = training,
        classProperty = label,
        inputProperties = bands)

#predict trained RF model over covariate image
pHmap = coStack$classify(rfReg, "predictions")

#plot results (do not need to recenter image) 
Map$addLayer(pHmap, 
             list(min = 7, max = 9, palette = c("blue", "red",
                                              "yellow", "green",
                                              "grey"))) #color

May have to zoom into the image for it to render the map.

Evaluate predictions by predicting over the “validation” set (easiest to convert to sf if small area). Can also sample the predictions with the “validation” set using pHmap$sameRegions(…). Both produce the same results.

#set metrics (a bunch in the yardstick library).
met = metric_set(rmse, rsq, ccc) #lins coefficient, R2 and RMSE 

#predict and convert to sf object (can take some time)
eval = validation$classify(rfReg, "predictions")$ #predict
  select(label, "predictions")%>% #select response and predictions (lower computation)
  ee_as_sf(maxFeatures = 1e13) #set maxFeatures if a large amount of data. Can add arg "via" to use drive and gcs for larger tables.

## Number of features: Calculating ...
Number of features: 55

#print evaluation results
eval %>%
  met(phaq_val_1, predictions)

## # A tibble: 3 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard       0.818
## 2 rsq     standard       0.527
## 3 ccc     standard       0.683

#plot reg line and other stats
#R2 and regline plot
ggplot(data = eval, aes(y = phaq_val_1, x = predictions))+
  geom_point()+
  geom_smooth(method = 'lm', se = F)+
  stat_cor(label.y= 9.5, aes(label = after_stat(rr.label)))+
  stat_regline_equation(label.y = 9)+
  labs(x = "Predictions", y = "Measured", title = "pH")+
  theme_bw()

#covariate importance for top 10
varImp = rfReg$explain()$get("importance")$getInfo()%>% #get importance
  as_tibble()%>%#make tibble
  pivot_longer(cols = c(1:ncol(.)))%>% #long df for plotting
  arrange(desc(value))%>% #sort decreasing values
  slice(1:10) #get top 10

#plot
ggplot(data = varImp, aes(x = reorder(name, -value), y = value))+
  geom_bar(stat ="identity", color = 'black', fill = 'blue')+
  labs(x = "Variable", y = "Importance", title = "Covariate importance")+
  theme_bw()

Not bad results, add more/different covariates and try different models to improve accuracy.

Can get some statistics (both global and local) straight from GEE, however, this can be very slow. Easier to do after the raster is saved or converted into memory. But you can calculate statistics on validation data if you want (not suggested as sf is faster and easier).

#stats on validation set
pred = validation$classify(rfReg, "predictions")$
  select(c(label, "predictions"))

#Pearson's correlation coefficient
r = pred$reduceColumns(
  reducer = ee$Reducer$pearsonsCorrelation(), #reduce to 1 number
  selectors = list(label, "predictions")
)

#coefficient of determination
r2 = ee$Number(r$get("correlation"))$ #get correlation
  pow(2) #square it

#print
print(r2$getInfo())

## [1] 0.5273032

RMSE is a bit more complicated and you need to write a function to get residuals for each row and then another function to calculate RMSE.

#get residuals
addResid <- function(feature) {
  res <- ee$Number(feature$get(label))$ #subtract observed from predicted
    subtract(ee$Number(feature$get("predictions")))
  feature$set(list(res = res)) #create new feature in featureCollection
}

#apply function to FeatureCollection for residuals
res = pred$map(addResid)

#calculate RMSE
rmse = ee$Array(res$aggregate_array("res"))$
  pow(2)$ #square it
  reduce("mean", list(0))$ #get mean for the residuals
  sqrt()

#print RMSE
print(rmse$getInfo())

## [1] 0.8183916

To save or for further analysis, convert to stars object or export to google drive. Converting to stars will load into memory but can take a long time if resolution is high.

#Export
task = ee_image_to_drive(
  image = pHmap,
  description = "pHmap", #name of image
  folder = "rgee_backup", #folder to store in 
  scale = scale,
  region = region
)

#to send to GEE  - task$start() to save to drive or cloud
##END

Soil pH predictions through Google Earth Engine and R

Trevan Flynn

2023-05-29