Introduction

With the preliminary infestation scoring done, it’s now time to explore some predictive modeling. My go-to modeling tool is random forest, so I think that’s a suitable place to start. We can test out alternative modeling tools in the future, but RF is pretty robust. This will be a considerably shorter, though hopefully considerably more interesting R Markdown document.

Reading in the data

First things first. There are a few datasets that I’ll be working with here. First, I’ll read in the infestation-scored plot data generated in the previous Markdown document. That contains all of my dependent variables. Then, I’ll read in the raster data for all of my independent variables. For what it’s worth, I’m starting with three main categories of data:

library(raster)
## Warning: package 'raster' was built under R version 4.0.3
## Loading required package: sp
## Warning: package 'sp' was built under R version 4.0.3
library(exactextractr)
## Warning: package 'exactextractr' was built under R version 4.0.5
library(Rcpp)
## Warning: package 'Rcpp' was built under R version 4.0.5
library(VSURF)
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
library(rfUtilities)
library(ranger)
## 
## Attaching package: 'ranger'
## The following object is masked from 'package:randomForest':
## 
##     importance
library(rmarkdown)

# set random seed
set.seed(5757)

# define directory structure
main.dir <- "S:\\ursa\\campbell\\bwa\\data\\"
ls.dir <- paste0(main.dir, "landsat\\gee_20211013\\")
dem.dir <- paste0(main.dir, "dem\\")
fld.dir <- paste0(main.dir, "field_data\\field_data\\processed_data\\download_202110121039\\")

# read in field plot data
df <- read.csv(paste0(fld.dir, "a03_csv\\plot_level_summaries_20211015.csv"))

# convert to spatial points data frame
spdf <- SpatialPointsDataFrame(coords = df[,c("utm.e", "utm.n")],
                               data = df,
                               proj4string = CRS("+init=epsg:26912"))

# list predictor rasters
ls.2021.rasts <- list.files(path = paste0(ls.dir, "l8_2021\\a05_product\\"),
                            pattern = "*.tif$",
                            full.names = T)
ls.diff.rasts <- list.files(path = paste0(ls.dir, "l8_diff_2014_2021\\"),
                            pattern = "*.tif$",
                            full.names = T)
dem.rasts <- list.files(path = dem.dir,
                        pattern = "*.tif$",
                        full.names = T)

# create predictor stack
ls.2021.stack <- stack(ls.2021.rasts)
ls.diff.stack <- stack(ls.diff.rasts)
dem.stack <- stack(dem.rasts)
pred.stack <- stack(ls.2021.stack, ls.diff.stack, dem.stack)

So, as you can see below, we have 4 dependent variables and 49 independent variables:

# list dependent variables
for (i in colnames(df)[(ncol(df)-3):ncol(df)]){
  print(i)
}
## [1] "sympt.score.bwa"
## [1] "sympt.score.bwa.ba"
## [1] "sympt.score.bwa.sce"
## [1] "sympt.score.bwa.ba.sce"
# list independent variables
for (i in names(pred.stack)){
  print(i)
}
## [1] "l8_2021_b1"
## [1] "l8_2021_b2"
## [1] "l8_2021_b3"
## [1] "l8_2021_b4"
## [1] "l8_2021_b5"
## [1] "l8_2021_b6"
## [1] "l8_2021_b7"
## [1] "l8_2021_evi"
## [1] "l8_2021_msavi"
## [1] "l8_2021_nbr"
## [1] "l8_2021_nbr2"
## [1] "l8_2021_ndmi"
## [1] "l8_2021_ndvi"
## [1] "l8_2021_nirv"
## [1] "l8_2021_savi"
## [1] "l8_2021_tcb"
## [1] "l8_2021_tcg"
## [1] "l8_2021_tcw"
## [1] "l8_diff_2014_2021_b1"
## [1] "l8_diff_2014_2021_b2"
## [1] "l8_diff_2014_2021_b3"
## [1] "l8_diff_2014_2021_b4"
## [1] "l8_diff_2014_2021_b5"
## [1] "l8_diff_2014_2021_b6"
## [1] "l8_diff_2014_2021_b7"
## [1] "l8_diff_2014_2021_evi"
## [1] "l8_diff_2014_2021_msavi"
## [1] "l8_diff_2014_2021_nbr"
## [1] "l8_diff_2014_2021_nbr2"
## [1] "l8_diff_2014_2021_ndmi"
## [1] "l8_diff_2014_2021_ndvi"
## [1] "l8_diff_2014_2021_nirv"
## [1] "l8_diff_2014_2021_savi"
## [1] "l8_diff_2014_2021_tcb"
## [1] "l8_diff_2014_2021_tcg"
## [1] "l8_diff_2014_2021_tcw"
## [1] "aspect_cos"
## [1] "aspect_sin"
## [1] "curvature"
## [1] "elevation"
## [1] "slope"
## [1] "slope_aspect_cos"
## [1] "slope_aspect_sin"
## [1] "tpi_10"
## [1] "tpi_20"
## [1] "tpi_30"
## [1] "tpi_40"
## [1] "tpi_50"
## [1] "twi"

One of the great benefits of random forests is that, unlike regression, you can have more predictors than data points. Since we only have 31 plots, and 49 potential predictors, that’s great news…

Create a modeling data frame

The next step is to link the predictors with the outcome variables. To this, I’ll extract the pixel values within a 15 buffer around each plot point and take a mean of those within-buffer pixels for each predictor layer. Or, more concisely-put, I’ll get a zonal mean for each plot for each predictor variable. I’ll then merge the results into a single modeling data frame.

# extract pixel values from the stack within 15m plot buffers
buffs <- buffer(spdf, 15, dissolve = F)
pred.df <- exact_extract(x = pred.stack, 
                         y = buffs,
                         fun = "mean",
                         progress = F)
colnames(pred.df) <- unlist(lapply(colnames(pred.df), function(x) strsplit(x, "\\.")[[1]][2]))

# create modeling data frame
mod.df <- df[,c("sympt.score.bwa",
                "sympt.score.bwa.ba",
                "sympt.score.bwa.sce",
                "sympt.score.bwa.ba.sce")]
mod.df <- cbind(mod.df, pred.df)
paged_table(mod.df)

Model #1: Unweighted infestation scores

Now onto the fun part! Let’s see if we can model infestation… When running random forests, I always use this “VSURF” tool (variable selection using random forests). It’s amazing… Distills down the big batch of predictor variables into a streamlined set of the most relevant/strong predictors. Sort of like stepwise regression without all the baggage.

#-----------------------------MODEL #1: UNWEIGHTED-----------------------------#

# determine which columns are the dependent (y) and independent (x) variables
y.col <- 1
x.cols <- 5:ncol(mod.df)

# perform VSURF analysis to determine optimal predictors
v <- VSURF(x = mod.df[,x.cols],
           y = mod.df[,y.col],
           parallel = TRUE,
           ncores = 24,
           RFimplem = "ranger",
           verbose = F)
x.cols <- x.cols[v$varselect.pred]

# run the optimized random forest
rf <- randomForest(x = mod.df[,x.cols],
                   y = mod.df[,y.col],
                   importance = T)
rf
## 
## Call:
##  randomForest(x = mod.df[, x.cols], y = mod.df[, y.col], importance = T) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##           Mean of squared residuals: 0.005937601
##                     % Var explained: 61.62
# plot the results
par(mar = c(5,5,1,1))
par(las = 1)
ax.min <- min(c(mod.df[,y.col], rf$predicted))
ax.max <- max(c(mod.df[,y.col], rf$predicted))
plot(rf$predicted ~ mod.df[,y.col],
     xlab = "Measured Score",
     ylab = "Modeled Score",
     xlim = c(ax.min, ax.max),
     ylim = c(ax.min, ax.max),
     pch = 16)
grid()
lines(x = c(-1,1), y = c(-1,1), lty = 2)
mod.lm <- lm(rf$predicted ~ mod.df[,y.col])
abline(mod.lm, lwd = 2, col = "red")
legend("topleft", legend = c("1:1 line", "Regression line"), lwd = c(1,2), 
       lty = c(2,1), col = c("black", "red"))
text.x <- par("usr")[1] + (par("usr")[2] - par("usr")[1]) * 0.8
text.y <- par("usr")[3] + (par("usr")[4] - par("usr")[3]) * 0.2
eq <- paste0("y = ", round(coef(mod.lm)[2],3), "x + ", round(coef(mod.lm)[1], 3))
r2 <- paste0("r2 = ", round(summary(mod.lm)$adj.r.squared, 3))
text(x = text.x, y = text.y, labels = paste0(eq, "\n", r2), col = "red")

# plot the variable importance
varImpPlot(rf)

Model #2: BA-weighted infestation scores

Let’s try again, but with the basal area-weighted infestation scores…

#----------------------------MODEL #2: BA-WEIGHTED-----------------------------#

# determine which columns are the dependent (y) and independent (x) variables
y.col <- 2
x.cols <- 5:ncol(mod.df)

# perform VSURF analysis to determine optimal predictors
v <- VSURF(x = mod.df[,x.cols],
           y = mod.df[,y.col],
           parallel = TRUE,
           ncores = 24,
           RFimplem = "ranger",
           verbose = F)
x.cols <- x.cols[v$varselect.pred]

# run the optimized random forest
rf <- randomForest(x = mod.df[,x.cols],
                   y = mod.df[,y.col],
                   importance = T)
rf
## 
## Call:
##  randomForest(x = mod.df[, x.cols], y = mod.df[, y.col], importance = T) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##           Mean of squared residuals: 0.009703609
##                     % Var explained: 53.98
# plot the results
par(mar = c(5,5,1,1))
par(las = 1)
ax.min <- min(c(mod.df[,y.col], rf$predicted))
ax.max <- max(c(mod.df[,y.col], rf$predicted))
plot(rf$predicted ~ mod.df[,y.col],
     xlab = "Measured Score",
     ylab = "Modeled Score",
     xlim = c(ax.min, ax.max),
     ylim = c(ax.min, ax.max),
     pch = 16)
grid()
lines(x = c(-1,1), y = c(-1,1), lty = 2)
mod.lm <- lm(rf$predicted ~ mod.df[,y.col])
abline(mod.lm, lwd = 2, col = "red")
legend("topleft", legend = c("1:1 line", "Regression line"), lwd = c(1,2), 
       lty = c(2,1), col = c("black", "red"))
text.x <- par("usr")[1] + (par("usr")[2] - par("usr")[1]) * 0.8
text.y <- par("usr")[3] + (par("usr")[4] - par("usr")[3]) * 0.2
eq <- paste0("y = ", round(coef(mod.lm)[2],3), "x + ", round(coef(mod.lm)[1], 3))
r2 <- paste0("r2 = ", round(summary(mod.lm)$adj.r.squared, 3))
text(x = text.x, y = text.y, labels = paste0(eq, "\n", r2), col = "red")

# plot the variable importance
varImpPlot(rf)

Model #3: SCE-weighted infestation scores

Let’s try again, but with the SCE-weighted infestation scores…

#----------------------------MODEL #3: SCE-WEIGHTED----------------------------#

# determine which columns are the dependent (y) and independent (x) variables
y.col <- 3
x.cols <- 5:ncol(mod.df)

# perform VSURF analysis to determine optimal predictors
v <- VSURF(x = mod.df[,x.cols],
           y = mod.df[,y.col],
           parallel = TRUE,
           ncores = 24,
           RFimplem = "ranger",
           verbose = F)
x.cols <- x.cols[v$varselect.pred]

# run the optimized random forest
rf <- randomForest(x = mod.df[,x.cols],
                   y = mod.df[,y.col],
                   importance = T)
rf
## 
## Call:
##  randomForest(x = mod.df[, x.cols], y = mod.df[, y.col], importance = T) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##           Mean of squared residuals: 0.004576417
##                     % Var explained: 52.42
# plot the results
par(mar = c(5,5,1,1))
par(las = 1)
ax.min <- min(c(mod.df[,y.col], rf$predicted))
ax.max <- max(c(mod.df[,y.col], rf$predicted))
plot(rf$predicted ~ mod.df[,y.col],
     xlab = "Measured Score",
     ylab = "Modeled Score",
     xlim = c(ax.min, ax.max),
     ylim = c(ax.min, ax.max),
     pch = 16)
grid()
lines(x = c(-1,1), y = c(-1,1), lty = 2)
mod.lm <- lm(rf$predicted ~ mod.df[,y.col])
abline(mod.lm, lwd = 2, col = "red")
legend("topleft", legend = c("1:1 line", "Regression line"), lwd = c(1,2), 
       lty = c(2,1), col = c("black", "red"))
text.x <- par("usr")[1] + (par("usr")[2] - par("usr")[1]) * 0.8
text.y <- par("usr")[3] + (par("usr")[4] - par("usr")[3]) * 0.2
eq <- paste0("y = ", round(coef(mod.lm)[2],3), "x + ", round(coef(mod.lm)[1], 3))
r2 <- paste0("r2 = ", round(summary(mod.lm)$adj.r.squared, 3))
text(x = text.x, y = text.y, labels = paste0(eq, "\n", r2), col = "red")

# plot the variable importance
varImpPlot(rf)

Model #4: BA- and SCE-weighted infestation scores

Let’s try one more time, but with the BA- and SCE-weighted infestation scores…

#------------------------MODEL #4: BA- AND SCE-WEIGHTED------------------------#

# determine which columns are the dependent (y) and independent (x) variables
y.col <- 4
x.cols <- 5:ncol(mod.df)

# perform VSURF analysis to determine optimal predictors
v <- VSURF(x = mod.df[,x.cols],
           y = mod.df[,y.col],
           parallel = TRUE,
           ncores = 24,
           RFimplem = "ranger",
           verbose = F)
x.cols <- x.cols[v$varselect.pred]

# run the optimized random forest
rf <- randomForest(x = mod.df[,x.cols],
                   y = mod.df[,y.col],
                   importance = T)
rf
## 
## Call:
##  randomForest(x = mod.df[, x.cols], y = mod.df[, y.col], importance = T) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##           Mean of squared residuals: 0.009130174
##                     % Var explained: 49.05
# plot the results
par(mar = c(5,5,1,1))
par(las = 1)
ax.min <- min(c(mod.df[,y.col], rf$predicted))
ax.max <- max(c(mod.df[,y.col], rf$predicted))
plot(rf$predicted ~ mod.df[,y.col],
     xlab = "Measured Score",
     ylab = "Modeled Score",
     xlim = c(ax.min, ax.max),
     ylim = c(ax.min, ax.max),
     pch = 16)
grid()
lines(x = c(-1,1), y = c(-1,1), lty = 2)
mod.lm <- lm(rf$predicted ~ mod.df[,y.col])
abline(mod.lm, lwd = 2, col = "red")
legend("topleft", legend = c("1:1 line", "Regression line"), lwd = c(1,2), 
       lty = c(2,1), col = c("black", "red"))
text.x <- par("usr")[1] + (par("usr")[2] - par("usr")[1]) * 0.8
text.y <- par("usr")[3] + (par("usr")[4] - par("usr")[3]) * 0.2
eq <- paste0("y = ", round(coef(mod.lm)[2],3), "x + ", round(coef(mod.lm)[1], 3))
r2 <- paste0("r2 = ", round(summary(mod.lm)$adj.r.squared, 3))
text(x = text.x, y = text.y, labels = paste0(eq, "\n", r2), col = "red")

# plot the variable importance
varImpPlot(rf)

Discussion

So, it looks like, based on the preliminary data, that we can explain about 50% of variance in BWA score. Not bad, honestly – particularly considering (a) how subtle the symptoms are, and (b) how few plots we have. Generally, it seems like the unweighted scores are being modeled the best, which is somewhat interesting. The two basal area-weighted models might actually be stronger, except for that one outlier plot… I need to look more into what plot that is, and why its modeled score is so strangely high. In terms of the predictor variables that are being selected, it seems like the shorter-wavelength visible difference bands are really strong predictors, particularly bands 1 (coastal aerosol) and 3 (green). Topographic predictors (elevation, topographic position index, and slope) seem pretty strong as well. The 2021 Landsat images also show up, with SAVI, Tasseled Cap greenness and EVI coming through.

I’ll just leave it there for now. Again, very preliminary, but let’s discuss sometime soon.