This is a short introduction of the process of using predictive models with Machine learning (ML) in R. We will not be going into optimisation, which is an important part of ML but is different for every model so beyond this short intro. For more information on predictive models see “Applied Predictive Modeling” by Kuhn & Johnson (2013). I will attach the book in an email.
Things to know about machine learning:
1) ML has nothing to do with CAUSATION, it is only CORRELATION!!!
2) A lot of assumption that linear models have, do not apply to most ML algorithms.
Residuals do not have to have a normal distribution (unless regression kriging).
Data does not need to be uncorrelated (i.e., does not need a randomized sample design).
3) However, models often improve if you do abey the normal rules. Most importantly, we want our independent variables to be uncorrelated (basically impossible though).
4) We want our models to generalize to be used for many things and not over fit.
5) There are always errors, which we also need to model.
6) Machine learning can only learn from information we provide.
7) The independent variables or predictors, are known as covariates because they are treated differently.
For example, spectral data from satellites are actually dependent on the soil, vegetation, human activity etc., but we use them as a predictor when in reality it should be the reverse for standard methods.
ML packages in R?
There are many packages which can be used for ML in R for different algorithms but their are two packages that most models can be run that standardises everything.
1) The caret package - is the easiest to use and has the most models combined into one package. It is also the easiest to use and put into functions.
2) the tidymodels package - is the predecessor of the caret package and consists of many packages, is more difficult to learn and is still in the development phase. However, there are more options such as multi-output models, better pre-processing and other nice features.
There are other packages such as RWeka, which has some nice algorithms (especially unsupervised learning) but not used as much any more, for deep learning the keras R package is the best and if you want to integrate with Google Earth Engine (GEE) the rgee package and rafikisol (my package :)) are great. For this tutorial, we will use a package known as partykit, which runs decision trees (dt) (we will discuss when we get to it).
Spatial packages
Since with environmental sciences we deal with spatial data a lot, we need spatial packages. For points, lines and polygons the best is the sf package (all packages use this format now) and for raster images the stars and terra packages are the best. Stars can be used for a lot more things and aligns with the sf package better (same author) than the terra package but is a bit tricky at first. Either are great though.
Spectral packages
soilspec, plantspec, aqp (some spectral functions) are commonly used and there are a lot more. Data pre-processing is a big step with spectral data and the choice of package sometimes depends on which instrument is used and which file format the instrument puts out. A good book is "Soil Spectral Inference with R" by Wadoux et al. (2020) but I do not have access to that one and unfortunately, have no data to run an example with that type of data.
General steps:
1) Load packages, data and check the data distribution.
2) Gather or load covariates and combine them with the measured data.
Covariates we can go into but see Mcbratney et al. (2003) for the "scorpan" factors, which depending on scale, is the best we have so far.
2) Split the data into training and testing sets (Usually 80:20 or 70:30 depending on algorithm used).
Can use any data sampling method (e.g., random, stratified, cLHS, kriging range distance, and more) just depends what you think is best.
3) Choose the model and train the model on the 80/70% training set.
4) In R we generally predict on the testing set first (before spatial predictions) so we can see the accuracy and adjust the model until we are satisfied.
With GEE we can predict on the covariates right away because it doesnt matter and its actually good to see it visually to help with decision making.
5) The most useful evaluation statistics for regression models are Lin's concordance (CCC) and RMSE. We dont care about the R2 that much because we want a 1:1 line and R2 doesnt give us that much info on that.
6) Once satisfied, we predict the model over the covariates to make a map which can be used in other GIS systems or further analysis in R
The raster image is easier to manipulate and do anything really in R once you get used to it but QGIS works just fine and runs off python so is probably easier for most people.
7) Since we also model errors the spatial errors or uncertainty of the models need to also be displayed. Common ways are to spatially display these are standard deviation, standard error or prediction range limit. These maps are great to see where the model performed well and can give insight into why, they hold a lot of information and are one of the most important parts. This is easiest with regression kriging because kriging gives the variance automatically.
8) Since the uncertainties are also modeled we must use sensitivity analysis to see how well we modeled the uncertainties. We usually do this through the prediction interval coverage probablitity (PICP), however, its a parametric approach so we need normal residuals for this to be accurate.
9) Then we are finished and we can publish and make millions of $.
This tutorial will be on predicting SOC% in a catchment (317 km2) in eastern South Africa known as the Midlands, has a semi-humid environment and the parent material is shale with granite dykes scattered through the catchment (Wiese et al., 2016, Flynn et al., 2022). For covariates, we will just use a couple topographic covariates to keep the file size low (for sharing) at a 30 m resolution derived from the SRTM data. You can download the DEM with the geodata package really easily and derive these simple covariates in R. We will also use pipe commands “%>%” because it keeps the code simple to read and goes more by python logic instead of base R logic.
Step 1: Packages, data and get summary statistics
##packages - less packages used the better
library(sf) #shapefiles
library(stars) #raster images
library(tidymodels) #Everything needed for ML and evaluation statistics but has bugs and wont run partykit even though it says it does
library(tidyverse) #package to clean and manipulate data
library(partykit) #have to run our decision tree directly from this package
library(feedlot) #lots of nice features we are putting into this package
##spatial data-----------------------------------------------------------------
#SOC point data
df = read_sf("profiles.shp")
plot(df[, 2])
#Load terrain covariates (raster image)
cov = read_stars("Terrain_covariates.tif")
plot(cov)
#summary statistics
sumStat(data = df, columns = 2)
## # A tibble: 1 × 11
## # Groups: variable [1]
## variable min mean median max sd q25 q75 skewness kurtosis CV
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 SOC 2.21 9.09 8.09 22.4 4.39 5.74 12.0 0.777 -0.0772 1.36
Now we have all the data needed to build, train and predict the model. So we will extract covariate values at each observation and then split into 80 training and 20 testing.
#First, extract the covariates at each observation (have to be in same coordinate system)
dat = df %>%
st_extract(cov, .)%>% #extract values
st_as_sf()%>% #turn back into points
do(st_join(., df[, 2]))%>% #Join with SOC data
na.omit() #remove any rows with NA values
##now we can split the data into training and testing, everything will have the same name if done this way so makes everything much easier.
#We will use stratified split to get a good distribution of the data
spl = initial_split(data = dat, prop = 0.8, strata = "SOC", breaks = 5)
Note: this function creates a split object which can be passed to other functions and is important when doing some advanced things in tidymodels but is not the actual split data.
##now split data
#get 80% training data
tr = training(spl)
#get 20% testing data
ts = testing(spl)
#lets see the covariate distribution between the training and testing
densDistro(training = tr, testing = ts, columns = 1:7)
#then we can see the distribution between SOC of the two sets
densDistro(training = tr, testing = ts, type = "boxplot", columns = 8)
The distribution are pretty similar for the covariates and SOC, so I would be ok with running the models now. We will run a decision tree that is known as a conditional inference tree (ctree). Decision trees are a ML algorithm that tries to split the dependent variable (e.g., SOC in our case) using the covariates until no more splits can be made or a user defined criteria is met. I wont go into what conditional inference actually is but this method is nice because it splits the tree into more homogeneous groups based on statistics that are commonly known such as a Bonferroni post hoc test. It also displays the ctree nicely so we can see how it made each decision and see how the covariates helped split the SOC (covariate importance).
Also, now we can drop the spatial aspect of the data because we have the covariates and SOC together and it also makes running it faster. ML actually is not a spatial model, but we can add covariates like long and lat which is a psuedo spatial model or regression krige, which turns it spatial. Basically, ML does not care what predictions it makes in geographic space, it only cares about correlation. For now we will just keep the terrain covariates and skip the other parts.
#let drop the spatial aspect
ntr = tr %>% st_drop_geometry()
names(ntr) #see no more geometry column and its now a data frame
## [1] "DEM" "slope" "aspect" "TPI" "TRI" "roughness"
## [7] "flowdir" "SOC"
#remove for test set
nts = ts %>% st_drop_geometry()
#lets set some aspects to the model so we know how the dt will split. We will set it to bonferroni test with 90% confidence level for each split
trC = ctree_control(testtype = "Bonferroni", mincriterion = 0.90)
#we will make and fit a model using the ctree function in partykit package. The full stop after ~, indicates to use all other columns as covariates
set.seed(135) #we set the seed just so the results are the same
mod = ctree(SOC ~., data = ntr, control = trC)
#now we can plot the model to see what it did
plot(mod)
This is why I like the ctree, it shows which which covariates split the data the best (variable importance), which values it split on and the distribution of SOC at each terminal node (last split). I often use this to find significant differences and all sorts of things but you can clearly see the inner workings of the model. I actually use them for summary statistics as they tend to overfit and are not that good for actual regression but they form the basis for better models like random forest, stochastic gradient boosting and cubist.
Now we can evaluate the data and see how it performed. With this model and number of covariates, I wouldnt expect much really.
#predict on testing set and add to testing set
nts$pred = predict(mod, nts)
#set evaluation statistics
mets = metric_set(ccc, rmse, rpiq)
#show results
nts %>%
mets(SOC, pred)
## # A tibble: 3 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 ccc standard 0.451
## 2 rmse standard 3.78
## 3 rpiq standard 1.57
As expected, not the best model but thats ok, this is just an examples. Now we can predict over the covariates to spatially see it.
#predict over covariates
socMap = cov %>%
split()%>% #stars can be tricky but need to split the array to predict
predict(., mod) #predict
#plot the predictions
plotStar(socMap, breaks = 7)
Just from the predictions you can see why dt might be a good classifier (what its meant to be). They actually perform really well for classifying with imbalanced datasets. But can see there is lower SOC in the eastern side of the catchment (lower elevation) and more in the west (higher elevation).
Now, we can calculate the uncertainties. For this, we will use bootstrapping and take the results from 25 bootstraps to derive the lower limits, upper limits, and prediction limit range. We do this in two parts (2 functions) because more things can be done with the 25 boostrapped maps than this one uncertainty measure. This will be in feedlot package but partykit wont work with tidymodels.
This can be done in tidymodels alone but its actually more complicated than just doing a for loop which is similar to how you would do it in python as well.
#function to predict over coviates for each boostrap
#returns a single stars object with each band a bootstrap prediction
bootStars = function(form, data, covar, boots, ...){
#Make a list to put maps into
maps = list()
for(i in 1:boots){
#sample same number of points w/ replacement
boots = sample.int(nrow(data), 1*nrow(data), replace = T)
#train model for each bootstrap
mods = ctree(as.formula(form), data = data[boots, ], control = ...)
#predict
maps[[i]] = predict(split(covar), mods)
}
#combine into 1 stars object
m = do.call("c", maps) %>% merge()
return(m)
}
#function to calculate upper and lower limits and prediction range
#returns a single stars object with those three calcualtions.
bootUncert = function(bootStars, limit){
#get mean
ave = st_apply(bootStars, 1:2, mean)
#get se
sdMap = st_apply(bootStars, 1:2, sd)
#calculate standard error based on confidence interval (limit)
se = sdMap * qnorm(limit)
#get upper
upper = ave + se
#get lower
lower = ave - se
#get range
rang = upper - lower
#combine stars objects
uncert = c(lower, rang, upper)
#set names
names(uncert) = c("lower", "range", "upper")
return(merge(uncert))
}
#run bootstrap predictions
bootMaps = bootStars(SOC ~., data = ntr, covar = cov, boots = 25, control = trC)
plot(bootMaps)
Can see how the different predictions look. This is actually similar to how random forest works and as it takes the average of the predictions kinda the same way. Lets see how the average predictions looks like
#Calculate the mean
avePreds = st_apply(bootMaps, 1:2, mean)
#plot
plotStar(avePreds)
Can clearly see taking the average of a lot of decision trees is a much more robust model. We basically just made a conditional inference random forest but they call this method an ensemble of decision trees. Thats why we keep the uncertainty predictions in two parts because you can do cool stuff like this. If we take the residuals of each boostrap add them together to decrease the residuals we now made a gradient tree boost model. Thats why starting with a dt model is best and many things to try and do with them.
Now we run the bootUncert function.
uncert = bootUncert(bootStars = bootMaps, limit = 0.95)
#plot uncertainties
plot(uncert)
We have now calculated and made all the maps we need to make our million $, Now lets see how well we defined our prediction intervals with PICP sensitivity analysis.
#run PICP (part of our package)
picp = picpCalc(data = nts, response = nts$SOC, pred = nts$pred)
#plot line
picp[[1]]
So, even though our model sucked we did really well with estimating our uncertainties!! That is great because now from the uncertainties we can see where our model needs improvement such as covariates that will help in the higher altitude region in the west. These places are actually depressions at the higher altitude locations, therefore, some covariates are probably correlated so the model confused itself where the uncertainty is high.
Hope this helps and made sense!
References
Kuhn, M., Johnson, Kjell., (2013). Applied Predictive Modeling, Springer, New York. https://10.1007/978-1-4614-6849-3