EMS4D TLS vs MLS Random Forest Exploration

Introduction

In our last meeting, we discussed having me do some random forest exploration of the North Kaibab and Fish Lake fuel plot and lidar data. Eugenia sent the data along on 4/17, so I took a quick dive into the data to see what we might be able to find. The data appeared to have already been subset to a set of selected variables from a variable importance-based selection procedure. Furthermore, in Carlos and Eugenia’s code, there was some additional variable selection steps that were taken for each response variable, further refining the set of predictor variables for each response, and creating tailored and parsimonious models for each. Note that for this data exploration, I did not replicate the variable selection approach C & E used. So, the resulting models may contain some extraneous predictor variables, and thus may represent suboptimal predictive capacity. However, the comparison between modeling approaches still provides an “apples to apples” basis for evaluating how different modeling strategies will yield different performance metrics.

So, let’s jump in! First, I’ll load the necessary libraries and set the random seed for reproducibility.

Show Code

# load libraries
library(ranger)
library(knitr)
library(colorspace)
library(RColorBrewer)
library(dplyr)

# set random seed
set.seed(5757)

Next, I’ll subset the data frames to pick out columns associated with predictors (x), response variables (y), and plot identifiers.

Show Code

# read in the data
df_mls <- read.csv("S:/ursa2/campbell/ems4d/data/from_eugenia/globdata_modeling_MJB_mls.csv")
df_tls <- read.csv("S:/ursa2/campbell/ems4d/data/from_eugenia/globdata_modeling_MJB_tls.csv")

# remove rows with nas
df_mls <- df_mls[complete.cases(df_mls),]
df_tls <- df_tls[complete.cases(df_tls),]

# split into predictor (x), response (y), and id data
df_mls_id <- df_mls[,1]
df_mls_x <- df_mls[,2:58]
df_mls_y <- df_mls[,59:ncol(df_mls)]

df_tls_id <- df_tls[,1]
df_tls_x <- df_tls[,2:58]
df_tls_y <- df_tls[,59:ncol(df_tls)]

Here, I’ll create a function for plotting and saving predicted vs. observed JPG plots. Note that I’m saving all of these plots to JPG as I iterate through the various model architectures and response variables. So, they won’t be seen in this document, since there are so many of them, but I’ll send them along with all of the other outputs when I share this document.

Show Code

# define predicted-observed plotting function
plot_pred_obs <- function(prd, obs, var_name, save = F, file_name = NULL){
  
  # if the plot is being saved, open jpeg plotter
  if (save == T){
    jpeg(file_name, width = 5, height = 5, units = "in", res = 300, quality = 100)
  }

  # set up plot
  par(mar = c(5,5,2,1), las = 1)
  
  # get axis limits
  ax_lim <- range(c(prd, obs))
  
  # plot the data
  plot(ax_lim, ax_lim, type = "n", ylab = "Predicted", xlab = "Observed",
       main = var_name)
  grid()
  lines(c(-1e6,1e6), c(-1e6,1e6))
  points(prd ~ obs, pch = 16, col = rgb(0,0,0,0.5))
  
  # add regression line
  mod <- lm(prd ~ obs)
  lines(ax_lim, predict(mod, newdata = list(obs = ax_lim)), col = 2, lwd = 2)
  
  # add performance metrics
  a <- coef(mod)[2] |> round(2)
  b <- coef(mod)[1] |> round(2)
  eq <- bquote(y==.(a)*x+.(b))
  r2 <- summary(mod)$adj.r.squared |> round(2)
  r2 <- bquote(R^2==.(r2))
  rmse <- sqrt(mean((prd - obs)^2)) |> round(2)
  perc_rmse <- (100 * sqrt(mean((prd - obs)^2)) / mean(obs)) |> round(2)
  rmse <- bquote(RMSE==.(rmse)~"("*.(perc_rmse)*"%)")
  bias <- mean(prd - obs) |> round(2)
  perc_bias <- (100 * mean(prd - obs) / mean(obs)) |> round(2)
  if (bias >= 0) bias <- bquote(Bias==+.(abs(bias))~"("*+.(abs(perc_bias))*"%)")
  if (bias < 0) bias <- bquote(Bias==-.(abs(bias))~"("*-.(abs(perc_bias))*"%)")
  legend("topleft",
         legend = c(eq, r2, rmse, bias),
         bty = "n", x.intersp = 0, text.col = 2)
    
  # if the plot is being saved, close jpeg plotter
  if (save == T) dev.off()
  
}

Model Type #1: Full Model

This is the simplest model architecture – the full model. In this case, all of the data are used as training, and the performance metrics come from the out-of-bag estimates. Although random forests tend to be fairly robust to overfitting, out-of-bag estimates can sometimes provide an overly optimistic view of true model performance, given that there’s no truly independent test of performance.

MLS Data

First, I’ll run the full models on the MLS data.

Show Code

# loop through response variables
for (y in 1:ncol(df_mls_y)){
  
  # create modeling data frame
  df_mod <- cbind(df_mls_y[y], df_mls_x)
  
  # build model
  mod <- ranger(
    x = df_mod[,2:ncol(df_mod)],
    y = df_mod[,1]
  )
  
  # get performance metrics
  prd <- mod$predictions
  obs <- df_mod[,1]
  r2 <- summary(lm(prd ~ obs))$adj.r.squared
  rmse <- sqrt(mean((prd - obs) ^ 2))
  perc_rmse <- 100 * rmse / mean(obs)
  bias <- mean(prd - obs)
  perc_bias <- 100 * bias / mean(obs)
  
  # compile metrics
  if (y == 1){
    df_perf_mls_rf_full <- data.frame(
      response = colnames(df_mod)[1],
      r2 = r2,
      rmse = rmse,
      perc_rmse = perc_rmse,
      bias = bias,
      perc_bias = perc_bias
    )
  } else {
    df_perf_mls_rf_full <- rbind(
      df_perf_mls_rf_full, 
      data.frame(
        response = colnames(df_mod)[1],
        r2 = r2,
        rmse = rmse,
        perc_rmse = perc_rmse,
        bias = bias,
        perc_bias = perc_bias
      )
    )
  }
  
  # write predicted vs. observed table to csv
  df_pred_obs_mls_rf_full <- data.frame(
    plot_id = df_mls_id,
    observed = obs,
    predicted = prd
  )
  var_name <- colnames(df_mod)[1]
  dir_name <- "S:/ursa2/campbell/ems4d/data/from_eugenia/pred_obs/mls_rf_full"
  dir.create(dir_name, showWarnings = F)
  file_name <- gsub("\\.", "_", var_name) |> tolower()
  file_name <- file.path(dir_name, paste0(file_name, ".csv"))
  write.csv(df_pred_obs_mls_rf_full, file_name, row.names = F)
  
  # plot predicted vs. observed
  var_name <- colnames(df_mod)[1]
  dir_name <- "S:/ursa2/campbell/ems4d/data/from_eugenia/plots/mls_rf_full"
  dir.create(dir_name, showWarnings = F)
  file_name <- gsub("\\.", "_", var_name) |> tolower()
  file_name <- file.path(dir_name, paste0(file_name, ".jpg"))
  plot_pred_obs(prd, obs, var_name, T, file_name)

}

# sort performance metrics by name
df_perf_mls_rf_full <- df_perf_mls_rf_full[order(df_perf_mls_rf_full$response),]
row.names(df_perf_mls_rf_full) <- 1:nrow(df_perf_mls_rf_full)

# write out performance metrics csv
write.csv(
  df_perf_mls_rf_full,
  "S:/ursa2/campbell/ems4d/data/from_eugenia/perf_mets/mls_rf_full.csv",
  row.names = F
)

# print table
kable(
  df_perf_mls_rf_full, 
  caption = "Table 1. Performance Metrics of Full Random Forest Models on MLS Data",
  digits = 2,
  col.names = c("Response Variable", "R2", "RMSE", "%RMSE", "Bias", "%Bias")
)

Table 1. Performance Metrics of Full Random Forest Models on MLS Data
Response Variable	R2	RMSE	%RMSE	Bias	%Bias
AGB_tr_dead_mgha	0.59	21.21	60.06	1.09	3.09
AGB_tr_live_mgha	0.54	31.26	60.78	-0.93	-1.81
AGB_tr_mgha	0.68	40.03	46.14	0.35	0.40
Basal.Area	0.71	8.49	40.89	0.66	3.17
br_d_mgha	0.61	4.77	62.79	0.36	4.79
br_l_mgha	0.63	5.52	52.43	-0.27	-2.55
branches_mgha	0.72	8.01	44.20	0.57	3.17
c_vol_ratio	0.37	0.20	66.20	0.00	0.03
CBD_d	0.21	0.06	104.19	0.00	-0.15
CBD_l	0.10	0.07	98.61	0.00	-6.24
CBD_tot	0.26	0.09	71.08	-0.01	-5.52
CBH_dead	0.54	0.73	46.55	0.01	0.53
CBH_live	0.31	1.39	65.05	-0.06	-2.95
CWD	0.13	36.51	132.62	3.20	11.62
D_surf	0.25	18.09	96.87	0.26	1.37
DFF	0.19	16.32	100.74	0.09	0.53
duff_dep	0.26	0.81	112.86	0.01	1.51
fol_d_mgha	0.39	1.30	77.34	0.08	4.92
fol_l_mgha	0.58	1.65	63.29	-0.05	-1.89
foliage_mgha	0.65	2.14	49.94	0.05	1.17
height	0.66	2.62	36.42	-0.16	-2.18
hr1000_mgha	0.12	33.03	137.99	2.72	11.36
HS	0.00	4.04	103.43	0.14	3.68
litt_dep	0.48	1.00	63.35	-0.03	-2.10
Maximum.Height	0.76	4.69	29.00	-0.10	-0.63
mean_dleng	0.28	1.55	72.63	0.09	4.36
SDB	0.60	4.77	63.06	0.19	2.52
SDS	0.63	7.44	51.61	0.09	0.64
SS	0.38	4.84	92.33	0.07	1.29
SSd	0.09	1.93	112.89	0.00	0.14
tot_herb	-0.02	1.59	189.40	0.10	12.25
tot_shb	0.03	3.77	122.90	0.12	4.05
tot_surf	0.22	19.71	87.29	0.60	2.64
tot_wood	0.69	39.28	43.75	0.32	0.35
total_AGB	0.69	47.54	43.50	2.36	2.16
Tree.Density	0.31	1499.28	105.27	74.10	5.20

TLS Data

Next, I’ll run the models on the TLS data.

Show Code

# loop through response variables
for (y in 1:ncol(df_tls_y)){
  
  # create modeling data frame
  df_mod <- cbind(df_tls_y[y], df_tls_x)
  
  # build model
  mod <- ranger(
    x = df_mod[,2:ncol(df_mod)],
    y = df_mod[,1]
  )
  
  # get performance metrics
  prd <- mod$predictions
  obs <- df_mod[,1]
  r2 <- summary(lm(prd ~ obs))$adj.r.squared
  rmse <- sqrt(mean((prd - obs) ^ 2))
  perc_rmse <- 100 * rmse / mean(obs)
  bias <- mean(prd - obs)
  perc_bias <- 100 * bias / mean(obs)
  
  # compile metrics
  if (y == 1){
    df_perf_tls_rf_full <- data.frame(
      response = colnames(df_mod)[1],
      r2 = r2,
      rmse = rmse,
      perc_rmse = perc_rmse,
      bias = bias,
      perc_bias = perc_bias
    )
  } else {
    df_perf_tls_rf_full <- rbind(
      df_perf_tls_rf_full, 
      data.frame(
        response = colnames(df_mod)[1],
        r2 = r2,
        rmse = rmse,
        perc_rmse = perc_rmse,
        bias = bias,
        perc_bias = perc_bias
      )
    )
  }
  
  # write predicted vs. observed table to csv
  df_pred_obs_tls_rf_full <- data.frame(
    plot_id = df_tls_id,
    observed = obs,
    predicted = prd
  )
  var_name <- colnames(df_mod)[1]
  dir_name <- "S:/ursa2/campbell/ems4d/data/from_eugenia/pred_obs/tls_rf_full"
  dir.create(dir_name, showWarnings = F)
  file_name <- gsub("\\.", "_", var_name) |> tolower()
  file_name <- file.path(dir_name, paste0(file_name, ".csv"))
  write.csv(df_pred_obs_tls_rf_full, file_name, row.names = F)
  
  # plot predicted vs. observed
  var_name <- colnames(df_mod)[1]
  dir_name <- "S:/ursa2/campbell/ems4d/data/from_eugenia/plots/tls_rf_full"
  dir.create(dir_name, showWarnings = F)
  file_name <- gsub("\\.", "_", var_name) |> tolower()
  file_name <- file.path(dir_name, paste0(file_name, ".jpg"))
  plot_pred_obs(prd, obs, var_name, T, file_name)

}

# sort performance metrics by name
df_perf_tls_rf_full <- df_perf_tls_rf_full[order(df_perf_tls_rf_full$response),]
row.names(df_perf_tls_rf_full) <- 1:nrow(df_perf_tls_rf_full)

# write out performance metrics csv
write.csv(
  df_perf_tls_rf_full, 
  "S:/ursa2/campbell/ems4d/data/from_eugenia/perf_mets/tls_rf_full.csv",
  row.names = F
)

# print table
kable(
  df_perf_tls_rf_full,
  caption = "Table 2. Performance Metrics of Full Random Forest Models on TLS Data",
  digits = 2,
  col.names = c("Response Variable", "R2", "RMSE", "%RMSE", "Bias", "%Bias")
)

Table 2. Performance Metrics of Full Random Forest Models on TLS Data
Response Variable	R2	RMSE	%RMSE	Bias	%Bias
AGB_tr_dead_mgha	0.49	25.18	62.14	0.62	1.53
AGB_tr_live_mgha	0.47	50.26	72.62	-1.24	-1.80
AGB_tr_mgha	0.53	66.34	60.46	-0.13	-0.12
Basal.Area	0.70	9.95	40.86	-0.40	-1.64
br_d_mgha	0.51	5.45	63.25	-0.11	-1.24
br_l_mgha	0.51	9.33	66.71	-0.01	-0.07
branches_mgha	0.61	12.32	54.50	-0.02	-0.08
c_vol_ratio	0.33	0.19	63.09	0.00	-0.07
CBD_d	0.27	0.05	92.43	0.00	-0.27
CBD_l	0.39	0.06	77.30	0.00	-1.32
CBD_tot	0.45	0.08	58.44	-0.01	-4.33
CBH_dead	0.53	0.73	44.06	-0.03	-2.08
CBH_live	0.33	1.47	62.53	-0.05	-2.21
CWD	0.08	37.06	132.22	4.26	15.21
D_surf	0.33	15.92	82.01	0.18	0.93
DFF	0.25	14.45	86.54	0.32	1.95
duff_dep	0.53	0.69	80.03	0.02	2.06
fol_d_mgha	0.29	1.55	78.35	0.04	2.25
fol_l_mgha	0.39	2.94	84.21	0.02	0.55
foliage_mgha	0.38	4.06	74.12	0.11	1.96
height	0.62	2.95	37.87	-0.02	-0.22
hr1000_mgha	0.05	33.46	138.16	3.97	16.40
HS	0.07	3.75	98.07	0.28	7.22
litt_dep	0.22	1.18	72.38	0.05	2.90
Maximum.Height	0.84	3.97	22.73	-0.01	-0.07
mean_dleng	0.34	1.41	62.61	0.00	0.13
SDB	0.50	5.53	64.29	0.03	0.31
SDS	0.40	10.56	65.79	0.01	0.09
SS	0.10	8.85	144.73	0.03	0.56
SSd	0.05	2.75	139.50	0.13	6.63
tot_herb	-0.01	1.49	186.39	0.06	8.00
tot_shb	0.02	3.85	127.30	0.36	11.92
tot_surf	0.34	16.94	72.91	0.58	2.51
tot_wood	0.52	66.81	59.28	2.25	2.00
total_AGB	0.56	71.36	53.69	-0.58	-0.44
Tree.Density	0.12	2231.80	135.38	127.81	7.75

MLS vs TLS Comparison

Now I’ll compare the performance of MLS and TLS using the full models’ performance metrics. Note that in the resulting plot, the variables are ranked from best (top) to worst (bottom) performing, based on an average rank between R² and RMSE.

Show Code

# merge the performance metrics data
df_perf_rf_full <- merge(
  df_perf_mls_rf_full,
  df_perf_tls_rf_full,
  by = "response",
  suffixes = c("_mls", "_tls")
)

# get ranks and sort data.frame
rank_r2_mls <- rank(df_perf_rf_full$r2_mls * -1)
rank_r2_tls <- rank(df_perf_rf_full$r2_tls * -1)
rank_perc_rmse_mls <- rank(df_perf_rf_full$perc_rmse_mls)
rank_perc_rmse_tls <- rank(df_perf_rf_full$perc_rmse_tls)
rank_mean <- cbind(rank_r2_mls,
  rank_r2_tls,
  rank_perc_rmse_mls,
  rank_perc_rmse_tls
) |>
  rowMeans()
df_perf_rf_full$rank_mean <- rank_mean
df_perf_rf_full <- df_perf_rf_full[order(df_perf_rf_full$rank_mean, decreasing = T),]

# set up plot
par(mfrow = c(1,2), las = 1, mar = c(5,5,2,1))

#------------------------r2

# plot the data
dotchart(
  x = df_perf_rf_full$r2_mls,
  labels = df_perf_rf_full$response,
  xlim = range(c(df_perf_rf_full[,c("r2_mls", "r2_tls")])),
  pch = "",
  main = "R2"
)
points(
  x = df_perf_rf_full$r2_mls,
  y = 1:nrow(df_perf_rf_full),
  pch = 16,
  col = 2
)
points(
  x = df_perf_rf_full$r2_tls,
  y = 1:nrow(df_perf_rf_full),
  pch = 16,
  col = 4
)

#------------------------rmse

# plot the data
dotchart(
  x = df_perf_rf_full$perc_rmse_mls,
  labels = df_perf_rf_full$response,
  xlim = range(c(df_perf_rf_full[,c("perc_rmse_mls", "perc_rmse_tls")])),
  pch = "",
  main = "%RMSE"
)
points(
  x = df_perf_rf_full$perc_rmse_mls,
  y = 1:nrow(df_perf_rf_full),
  pch = 16,
  col = 2
)
points(
  x = df_perf_rf_full$perc_rmse_tls,
  y = 1:nrow(df_perf_rf_full),
  pch = 16,
  col = 4
)

# add legend
legend(
  "topright",
  legend = c("MLS", "TLS"),
  pch = 16,
  col = c(2, 4)
)

Figure 1. MLS vs. TLS Full RF Model Performance Comparison

Model Type #2: Leave One Out Cross-Validation

This is a somewhat more robust model architecture than the full model’s assessment approach, since each plot its iteratively held out for testing while a model is built with the remaining plots and applied to the prediction for that test plot. The compiled predictions are then compared to the observations as a whole, and performance metrics are computed from the compiled data.

MLS Data

First, I’ll run the LOOCV models on the MLS data.

Show Code

# loop through response variables
for (y in 1:ncol(df_mls_y)){
  
  # create modeling data frame
  df_mod <- cbind(df_mls_y[y], df_mls_x)
  
  # create empty vectors for predictions and observations
  prd <- c()
  obs <- c()
  
  # loocv loop
  for (i in 1:nrow(df_mod)){
    
    # split into training and validation
    df_mod_train <- df_mod[-i,]
    df_mod_valid <- df_mod[i,]
    
    # build model
    mod <- ranger(
      x = df_mod_train[,2:ncol(df_mod_train)],
      y = df_mod_train[,1]
    )
    
    # make predictions
    prd_fold <- predict(mod, df_mod_valid)$predictions
    obs_fold <- df_mod_valid[,1]
    
    # compile predictions
    prd <- c(prd, prd_fold)
    obs <- c(obs, obs_fold)
  
  }
  
  # get performance metrics
  r2 <- summary(lm(prd ~ obs))$adj.r.squared
  rmse <- sqrt(mean((prd - obs) ^ 2))
  perc_rmse <- 100 * rmse / mean(obs)
  bias <- mean(prd - obs)
  perc_bias <- 100 * bias / mean(obs)
  
  # compile metrics
  if (y == 1){
    df_perf_mls_rf_loo <- data.frame(
      response = colnames(df_mod)[1],
      r2 = r2,
      rmse = rmse,
      perc_rmse = perc_rmse,
      bias = bias,
      perc_bias = perc_bias
    )
  } else {
    df_perf_mls_rf_loo <- rbind(
      df_perf_mls_rf_loo, 
      data.frame(
        response = colnames(df_mod)[1],
        r2 = r2,
        rmse = rmse,
        perc_rmse = perc_rmse,
        bias = bias,
        perc_bias = perc_bias
      )
    )
  }
  
  # write predicted vs. observed table to csv
  df_pred_obs_mls_rf_loo <- data.frame(
    plot_id = df_mls_id,
    observed = obs,
    predicted = prd
  )
  var_name <- colnames(df_mod)[1]
  dir_name <- "S:/ursa2/campbell/ems4d/data/from_eugenia/pred_obs/mls_rf_loo"
  dir.create(dir_name, showWarnings = F)
  file_name <- gsub("\\.", "_", var_name) |> tolower()
  file_name <- file.path(dir_name, paste0(file_name, ".csv"))
  write.csv(df_pred_obs_mls_rf_loo, file_name, row.names = F)
  
  # plot predicted vs. observed
  var_name <- colnames(df_mod)[1]
  dir_name <- "S:/ursa2/campbell/ems4d/data/from_eugenia/plots/mls_rf_loo"
  dir.create(dir_name, showWarnings = F)
  file_name <- gsub("\\.", "_", var_name) |> tolower()
  file_name <- file.path(dir_name, paste0(file_name, ".jpg"))
  plot_pred_obs(prd, obs, var_name, T, file_name)

}

# sort performance metrics by name
df_perf_mls_rf_loo <- df_perf_mls_rf_loo[order(df_perf_mls_rf_loo$response),]
row.names(df_perf_mls_rf_loo) <- 1:nrow(df_perf_mls_rf_loo)

# write out performance metrics csv
write.csv(
  df_perf_mls_rf_loo,
  "S:/ursa2/campbell/ems4d/data/from_eugenia/perf_mets/mls_rf_loo.csv",
  row.names = F
)

# print table
kable(
  df_perf_mls_rf_loo, 
  caption = "Table 3. Performance Metrics of Leave One Out Cross-Validated Random Forest Models on MLS data",
  digits = 2,
  col.names = c("Response Variable", "R2", "RMSE", "%RMSE", "Bias", "%Bias")
)

Table 3. Performance Metrics of Leave One Out Cross-Validated Random Forest Models on MLS data
Response Variable	R2	RMSE	%RMSE	Bias	%Bias
AGB_tr_dead_mgha	0.58	21.44	60.71	1.26	3.56
AGB_tr_live_mgha	0.54	31.30	60.87	-0.85	-1.65
AGB_tr_mgha	0.68	40.05	46.17	1.19	1.37
Basal.Area	0.73	8.18	39.40	0.43	2.09
br_d_mgha	0.58	4.92	64.74	0.30	3.97
br_l_mgha	0.62	5.55	52.70	-0.17	-1.60
branches_mgha	0.72	8.02	44.22	0.58	3.18
c_vol_ratio	0.38	0.19	65.78	0.00	-0.35
CBD_d	0.18	0.06	106.16	0.00	1.58
CBD_l	0.10	0.07	98.52	0.00	-5.12
CBD_tot	0.31	0.08	69.18	-0.01	-6.06
CBH_dead	0.55	0.73	46.13	-0.01	-0.51
CBH_live	0.31	1.39	65.29	-0.06	-2.83
CWD	0.13	36.47	132.45	2.86	10.40
D_surf	0.27	17.70	94.78	0.31	1.69
DFF	0.19	16.32	100.75	0.31	1.90
duff_dep	0.25	0.82	114.06	0.03	4.80
fol_d_mgha	0.39	1.30	77.46	0.07	4.41
fol_l_mgha	0.58	1.65	63.30	-0.03	-1.10
foliage_mgha	0.64	2.16	50.44	0.03	0.78
height	0.65	2.63	36.60	-0.12	-1.62
hr1000_mgha	0.10	33.56	140.20	2.94	12.30
HS	-0.01	4.08	104.51	0.16	4.16
litt_dep	0.49	0.99	62.65	-0.02	-1.48
Maximum.Height	0.77	4.52	27.96	-0.02	-0.10
mean_dleng	0.28	1.55	72.51	0.08	3.66
SDB	0.58	4.91	64.90	0.30	3.97
SDS	0.65	7.24	50.19	0.15	1.04
SS	0.38	4.85	92.61	0.13	2.39
SSd	0.11	1.91	111.45	0.02	0.98
tot_herb	-0.02	1.57	186.29	0.07	7.87
tot_shb	0.03	3.75	122.42	0.10	3.11
tot_surf	0.23	19.67	87.10	0.45	2.02
tot_wood	0.69	39.65	44.16	1.29	1.43
total_AGB	0.71	45.71	41.82	1.31	1.20
Tree.Density	0.32	1489.90	104.61	81.71	5.74

TLS Data

Next, I’ll run the LOOCV models on the MLS data.

Show Code

# loop through response variables
for (y in 1:ncol(df_tls_y)){
  
  # create modeling data frame
  df_mod <- cbind(df_tls_y[y], df_tls_x)
  
  # create empty vectors for predictions and observations
  prd <- c()
  obs <- c()
  
  # loocv loop
  for (i in 1:nrow(df_mod)){
    
    # split into training and validation
    df_mod_train <- df_mod[-i,]
    df_mod_valid <- df_mod[i,]
    
    # build model
    mod <- ranger(
      x = df_mod_train[,2:ncol(df_mod_train)],
      y = df_mod_train[,1]
    )
    
    # make predictions
    prd_fold <- predict(mod, df_mod_valid)$predictions
    obs_fold <- df_mod_valid[,1]
    
    # compile predictions
    prd <- c(prd, prd_fold)
    obs <- c(obs, obs_fold)
  
  }
  
  # get performance metrics
  r2 <- summary(lm(prd ~ obs))$adj.r.squared
  rmse <- sqrt(mean((prd - obs) ^ 2))
  perc_rmse <- 100 * rmse / mean(obs)
  bias <- mean(prd - obs)
  perc_bias <- 100 * bias / mean(obs)
  
  # compile metrics
  if (y == 1){
    df_perf_tls_rf_loo <- data.frame(
      response = colnames(df_mod)[1],
      r2 = r2,
      rmse = rmse,
      perc_rmse = perc_rmse,
      bias = bias,
      perc_bias = perc_bias
    )
  } else {
    df_perf_tls_rf_loo <- rbind(
      df_perf_tls_rf_loo, 
      data.frame(
        response = colnames(df_mod)[1],
        r2 = r2,
        rmse = rmse,
        perc_rmse = perc_rmse,
        bias = bias,
        perc_bias = perc_bias
      )
    )
  }
  
  # write predicted vs. observed table to csv
  df_pred_obs_tls_rf_loo <- data.frame(
    plot_id = df_tls_id,
    observed = obs,
    predicted = prd
  )
  var_name <- colnames(df_mod)[1]
  dir_name <- "S:/ursa2/campbell/ems4d/data/from_eugenia/pred_obs/tls_rf_loo"
  dir.create(dir_name, showWarnings = F)
  file_name <- gsub("\\.", "_", var_name) |> tolower()
  file_name <- file.path(dir_name, paste0(file_name, ".csv"))
  write.csv(df_pred_obs_tls_rf_loo, file_name, row.names = F)
  
  # plot predicted vs. observed
  var_name <- colnames(df_mod)[1]
  dir_name <- "S:/ursa2/campbell/ems4d/data/from_eugenia/plots/tls_rf_loo"
  dir.create(dir_name, showWarnings = F)
  file_name <- gsub("\\.", "_", var_name) |> tolower()
  file_name <- file.path(dir_name, paste0(file_name, ".jpg"))
  plot_pred_obs(prd, obs, var_name, T, file_name)

}

# sort performance metrics by name
df_perf_tls_rf_loo <- df_perf_tls_rf_loo[order(df_perf_tls_rf_loo$response),]
row.names(df_perf_tls_rf_loo) <- 1:nrow(df_perf_tls_rf_loo)

# write out performance metrics csv
write.csv(
  df_perf_tls_rf_loo,
  "S:/ursa2/campbell/ems4d/data/from_eugenia/perf_mets/tls_rf_loo.csv",
  row.names = F
)

# print table
kable(
  df_perf_tls_rf_loo, 
  caption = "Table 4. Performance Metrics of Leave One Out Cross-Validated Random Forest Models on TLS data",
  digits = 2,
  col.names = c("Response Variable", "R2", "RMSE", "%RMSE", "Bias", "%Bias")
)

Table 4. Performance Metrics of Leave One Out Cross-Validated Random Forest Models on TLS data
Response Variable	R2	RMSE	%RMSE	Bias	%Bias
AGB_tr_dead_mgha	0.48	25.54	63.04	-0.24	-0.59
AGB_tr_live_mgha	0.48	49.58	71.64	-0.57	-0.82
AGB_tr_mgha	0.54	65.46	59.66	1.65	1.51
Basal.Area	0.69	10.06	41.31	-0.12	-0.48
br_d_mgha	0.51	5.44	63.16	-0.07	-0.79
br_l_mgha	0.53	9.16	65.46	-0.18	-1.30
branches_mgha	0.61	12.32	54.49	-0.04	-0.17
c_vol_ratio	0.31	0.19	64.02	0.00	1.30
CBD_d	0.26	0.05	93.22	0.00	1.47
CBD_l	0.37	0.06	78.44	0.00	-0.03
CBD_tot	0.44	0.08	58.34	0.00	-2.32
CBH_dead	0.54	0.72	43.67	-0.03	-1.86
CBH_live	0.33	1.47	62.61	-0.05	-2.14
CWD	0.07	36.78	131.20	3.68	13.12
D_surf	0.33	16.01	82.47	0.41	2.10
DFF	0.25	14.48	86.71	0.21	1.26
duff_dep	0.56	0.67	78.00	0.00	0.43
fol_d_mgha	0.30	1.54	77.66	0.02	0.98
fol_l_mgha	0.40	2.91	83.49	0.00	0.03
foliage_mgha	0.40	4.01	73.34	0.06	1.04
height	0.62	2.94	37.74	-0.11	-1.41
hr1000_mgha	0.04	33.98	140.28	3.56	14.71
HS	0.09	3.70	96.80	0.23	6.02
litt_dep	0.25	1.14	70.38	0.03	1.69
Maximum.Height	0.83	4.02	23.01	0.07	0.42
mean_dleng	0.36	1.39	61.50	0.00	0.05
SDB	0.50	5.51	64.08	-0.07	-0.87
SDS	0.42	10.40	64.80	0.06	0.35
SS	0.08	8.95	146.51	0.06	0.96
SSd	0.09	2.66	134.65	0.06	2.83
tot_herb	-0.01	1.48	186.29	0.09	11.54
tot_shb	0.03	3.75	124.03	0.29	9.72
tot_surf	0.32	17.18	73.91	0.49	2.12
tot_wood	0.53	66.34	58.86	0.46	0.41
total_AGB	0.57	70.29	52.88	1.89	1.42
Tree.Density	0.13	2192.81	133.01	108.55	6.58

MLS vs TLS Comparison

Now I’ll compare the performance of MLS and TLS using the LOOCV models’ performance metrics. Note that in the resulting plot, the variables are ranked from best (top) to worst (bottom) performing, based on an average rank between R² and RMSE.

Show Code


# merge the performance metrics data
df_perf_rf_loo <- merge(
  df_perf_mls_rf_loo,
  df_perf_tls_rf_loo,
  by = "response",
  suffixes = c("_mls", "_tls")
)

# get ranks and sort data.frame
rank_r2_mls <- rank(df_perf_rf_loo$r2_mls * -1)
rank_r2_tls <- rank(df_perf_rf_loo$r2_tls * -1)
rank_perc_rmse_mls <- rank(df_perf_rf_loo$perc_rmse_mls)
rank_perc_rmse_tls <- rank(df_perf_rf_loo$perc_rmse_tls)
rank_mean <- cbind(
  rank_r2_mls,
  rank_r2_tls,
  rank_perc_rmse_mls,
  rank_perc_rmse_tls
) |>
  rowMeans()
df_perf_rf_loo$rank_mean <- rank_mean
df_perf_rf_loo <- df_perf_rf_loo[order(df_perf_rf_loo$rank_mean, decreasing = T),]

# set up plot
par(mfrow = c(1,2), las = 1, mar = c(5,5,2,1))

#------------------------r2

# plot the data
dotchart(
  x = df_perf_rf_loo$r2_mls,
  labels = df_perf_rf_loo$response,
  xlim = range(c(df_perf_rf_loo[,c("r2_mls", "r2_tls")])),
  pch = "",
  main = "R2"
)
points(
  x = df_perf_rf_loo$r2_mls,
  y = 1:nrow(df_perf_rf_loo),
  pch = 16,
  col = 2
)
points(
  x = df_perf_rf_loo$r2_tls,
  y = 1:nrow(df_perf_rf_loo),
  pch = 16,
  col = 4
)

#------------------------rmse

# plot the data
dotchart(
  x = df_perf_rf_loo$perc_rmse_mls,
  labels = df_perf_rf_loo$response,
  xlim = range(c(df_perf_rf_loo[,c("perc_rmse_mls", "perc_rmse_tls")])),
  pch = "",
  main = "%RMSE"
)
points(
  x = df_perf_rf_loo$perc_rmse_mls,
  y = 1:nrow(df_perf_rf_loo),
  pch = 16,
  col = 2
)
points(
  x = df_perf_rf_loo$perc_rmse_tls,
  y = 1:nrow(df_perf_rf_loo),
  pch = 16,
  col = 4
)

# add legend
legend(
  "topright",
  legend = c("MLS", "TLS"),
  pch = 16,
  col = c(2, 4)
)

Figure 2. MLS vs. TLS LOOCV RF Model Performance Comparison

Model Type #3: Site-to-Site (2-Fold) Cross-Validation

This third model architecture provides what I would describe as the most robust so far, since it builds models using data from one site (either NK or FL) and applies it to the prediction of the other, with model performance metrics based on the combined predicted versus observed values that result from both folds. It provides a robust measure of extrapolative capacity in a different study area/ecosystem – that is, if you were to apply these models to the prediction of fuel structure outside of NK or FL, how well would they perform?

MLS Data

First, I’ll apply the site CV modeling approach to the MLS data.

Show Code

# loop through response variables
for (y in 1:ncol(df_mls_y)){
  
  # create modeling data frame
  df_mod <- cbind(df_mls_y[y], df_mls_x)
  
  # create empty vectors for predictions and observations
  prd <- c()
  obs <- c()
  ids <- c()
  
  # get site ids
  rows_fl <- grep("FL", df_mls_id)
  rows_nk <- grep("NK", df_mls_id)
  rows_list <- list(rows_fl, rows_nk)
  
  # sitecv loop
  for (i in 1:2){
    
    # split into training and validation
    df_mod_train <- df_mod[-rows_list[[i]],]
    df_mod_valid <- df_mod[rows_list[[i]],]
    
    # build model
    mod <- ranger(
      x = df_mod_train[,2:ncol(df_mod_train)],
      y = df_mod_train[,1]
    )
    
    # make predictions
    prd_fold <- predict(mod, df_mod_valid)$predictions
    obs_fold <- df_mod_valid[,1]
    
    # compile predictions
    prd <- c(prd, prd_fold)
    obs <- c(obs, obs_fold)
    ids <- c(ids, df_mls_id[rows_list[[i]]])
  
  }
  
  # get performance metrics
  r2 <- summary(lm(prd ~ obs))$adj.r.squared
  rmse <- sqrt(mean((prd - obs) ^ 2))
  perc_rmse <- 100 * rmse / mean(obs)
  bias <- mean(prd - obs)
  perc_bias <- 100 * bias / mean(obs)
  
  # compile metrics
  if (y == 1){
    df_perf_mls_rf_sitecv <- data.frame(
      response = colnames(df_mod)[1],
      r2 = r2,
      rmse = rmse,
      perc_rmse = perc_rmse,
      bias = bias,
      perc_bias = perc_bias
    )
  } else {
    df_perf_mls_rf_sitecv <- rbind(
      df_perf_mls_rf_sitecv, 
      data.frame(
        response = colnames(df_mod)[1],
        r2 = r2,
        rmse = rmse,
        perc_rmse = perc_rmse,
        bias = bias,
        perc_bias = perc_bias
      )
    )
  }
  
  # write predicted vs. observed table to csv
  df_pred_obs_mls_rf_sitecv <- data.frame(
    plot_id = ids,
    observed = obs,
    predicted = prd
  )
  var_name <- colnames(df_mod)[1]
  dir_name <- "S:/ursa2/campbell/ems4d/data/from_eugenia/pred_obs/mls_rf_sitecv"
  dir.create(dir_name, showWarnings = F)
  file_name <- gsub("\\.", "_", var_name) |> tolower()
  file_name <- file.path(dir_name, paste0(file_name, ".csv"))
  write.csv(df_pred_obs_mls_rf_sitecv, file_name, row.names = F)

  # plot predicted vs. observed
  var_name <- colnames(df_mod)[1]
  dir_name <- "S:/ursa2/campbell/ems4d/data/from_eugenia/plots/mls_rf_sitecv"
  dir.create(dir_name, showWarnings = F)
  file_name <- gsub("\\.", "_", var_name) |> tolower()
  file_name <- file.path(dir_name, paste0(file_name, ".jpg"))
  plot_pred_obs(prd, obs, var_name, T, file_name)

}

# sort performance metrics by name
df_perf_mls_rf_sitecv <- df_perf_mls_rf_sitecv[order(df_perf_mls_rf_sitecv$response),]
row.names(df_perf_mls_rf_sitecv) <- 1:nrow(df_perf_mls_rf_sitecv)

# write out performance metrics csv
write.csv(
  df_perf_mls_rf_sitecv,
  "S:/ursa2/campbell/ems4d/data/from_eugenia/perf_mets/mls_rf_sitecv.csv",
  row.names = F
)

# print table
kable(
  df_perf_mls_rf_sitecv,
  caption = "Table 5. Performance Metrics of Site-to-Site Cross-Validated Random Forest Models on MLS Data",
  digits = 2,
  col.names = c("Response Variable", "R2", "RMSE", "%RMSE", "Bias", "%Bias")
)

Table 5. Performance Metrics of Site-to-Site Cross-Validated Random Forest Models on MLS Data
Response Variable	R2	RMSE	%RMSE	Bias	%Bias
AGB_tr_dead_mgha	0.39	26.05	73.74	3.43	9.72
AGB_tr_live_mgha	0.57	31.40	61.06	-6.61	-12.86
AGB_tr_mgha	0.67	41.93	48.34	2.41	2.78
Basal.Area	0.71	8.85	42.59	0.25	1.22
br_d_mgha	0.21	7.09	93.24	1.12	14.70
br_l_mgha	0.62	5.73	54.46	-0.88	-8.36
branches_mgha	0.52	10.54	58.11	0.61	3.39
c_vol_ratio	0.30	0.21	72.63	0.06	21.02
CBD_d	0.06	0.06	116.70	0.01	27.52
CBD_l	0.04	0.07	103.66	-0.01	-19.09
CBD_tot	0.21	0.09	73.62	0.00	-0.61
CBH_dead	0.39	0.87	55.07	0.09	5.88
CBH_live	0.17	1.56	73.40	-0.19	-8.95
CWD	0.00	41.40	150.35	8.93	32.45
D_surf	0.03	22.11	118.43	3.10	16.59
DFF	0.04	18.54	114.48	2.23	13.79
duff_dep	0.28	0.80	111.43	0.02	2.22
fol_d_mgha	0.26	1.43	85.03	0.28	16.45
fol_l_mgha	0.34	2.09	80.43	-0.40	-15.29
foliage_mgha	0.49	2.60	60.75	-0.11	-2.49
height	0.50	3.17	44.11	0.12	1.69
hr1000_mgha	0.00	37.62	157.17	9.44	39.45
HS	0.02	4.91	125.71	0.20	5.05
litt_dep	0.14	1.32	83.66	-0.01	-0.56
Maximum.Height	0.71	5.19	32.05	0.55	3.42
mean_dleng	0.27	1.59	74.83	0.43	20.26
SDB	0.20	7.12	94.20	1.19	15.67
SDS	0.15	12.00	83.25	2.04	14.11
SS	0.09	6.15	117.36	1.29	24.64
SSd	-0.03	2.44	142.55	0.57	33.12
tot_herb	0.06	1.39	165.31	-0.02	-1.88
tot_shb	0.02	4.78	155.92	0.54	17.60
tot_surf	0.00	24.26	107.46	2.48	10.97
tot_wood	0.66	42.85	47.73	1.74	1.94
total_AGB	0.62	53.74	49.17	0.83	0.76
Tree.Density	0.20	1645.98	115.57	358.81	25.19

TLS Data

Next, I’ll apply the site CV modeling approach to the TLS data.

Show Code

# loop through response variables
for (y in 1:ncol(df_tls_y)){
  
  # create modeling data frame
  df_mod <- cbind(df_tls_y[y], df_tls_x)
  
  # create empty vectors for predictions and observations
  prd <- c()
  obs <- c()
  ids <- c()
  
  # get site ids
  rows_fl <- grep("FL", df_tls_id)
  rows_nk <- grep("NK", df_tls_id)
  rows_list <- list(rows_fl, rows_nk)
  
  # sitecv loop
  for (i in 1:2){
    
    # split into training and validation
    df_mod_train <- df_mod[-rows_list[[i]],]
    df_mod_valid <- df_mod[rows_list[[i]],]
    
    # build model
    mod <- ranger(
      x = df_mod_train[,2:ncol(df_mod_train)],
      y = df_mod_train[,1]
    )
    
    # make predictions
    prd_fold <- predict(mod, df_mod_valid)$predictions
    obs_fold <- df_mod_valid[,1]
    
    # compile predictions
    prd <- c(prd, prd_fold)
    obs <- c(obs, obs_fold)
    ids <- c(ids, df_mls_id[rows_list[[i]]])

  }
  
  # get performance metrics
  r2 <- summary(lm(prd ~ obs))$adj.r.squared
  rmse <- sqrt(mean((prd - obs) ^ 2))
  perc_rmse <- 100 * rmse / mean(obs)
  bias <- mean(prd - obs)
  perc_bias <- 100 * bias / mean(obs)
  
  # compile metrics
  if (y == 1){
    df_perf_tls_rf_sitecv <- data.frame(
      response = colnames(df_mod)[1],
      r2 = r2,
      rmse = rmse,
      perc_rmse = perc_rmse,
      bias = bias,
      perc_bias = perc_bias
    )
  } else {
    df_perf_tls_rf_sitecv <- rbind(
      df_perf_tls_rf_sitecv, 
      data.frame(
        response = colnames(df_mod)[1],
        r2 = r2,
        rmse = rmse,
        perc_rmse = perc_rmse,
        bias = bias,
        perc_bias = perc_bias
      )
    )
  }
  
  # write predicted vs. observed table to csv
  df_pred_obs_tls_rf_sitecv <- data.frame(
    plot_id = ids,
    observed = obs,
    predicted = prd
  )
  var_name <- colnames(df_mod)[1]
  dir_name <- "S:/ursa2/campbell/ems4d/data/from_eugenia/pred_obs/tls_rf_sitecv"
  dir.create(dir_name, showWarnings = F)
  file_name <- gsub("\\.", "_", var_name) |> tolower()
  file_name <- file.path(dir_name, paste0(file_name, ".csv"))
  write.csv(df_pred_obs_tls_rf_sitecv, file_name, row.names = F)

  # plot predicted vs. observed
  var_name <- colnames(df_mod)[1]
  dir_name <- "S:/ursa2/campbell/ems4d/data/from_eugenia/plots/tls_rf_sitecv"
  dir.create(dir_name, showWarnings = F)
  file_name <- gsub("\\.", "_", var_name) |> tolower()
  file_name <- file.path(dir_name, paste0(file_name, ".jpg"))
  plot_pred_obs(prd, obs, var_name, T, file_name)

}

# sort performance metrics by name
df_perf_tls_rf_sitecv <- df_perf_tls_rf_sitecv[order(df_perf_tls_rf_sitecv$response),]
row.names(df_perf_tls_rf_sitecv) <- 1:nrow(df_perf_tls_rf_sitecv)

# write out performance metrics csv
write.csv(
  df_perf_tls_rf_sitecv,
  "S:/ursa2/campbell/ems4d/data/from_eugenia/perf_mets/tls_rf_sitecv.csv",
  row.names = F
)

# print table
kable(
  df_perf_tls_rf_sitecv,
  caption = "Table 6. Performance Metrics of Site-to-Site Cross-Validated Random Forest Models on TLS Data",
  digits = 2,
  col.names = c("Response Variable", "R2", "RMSE", "%RMSE", "Bias", "%Bias")
)

Table 6. Performance Metrics of Site-to-Site Cross-Validated Random Forest Models on TLS Data
Response Variable	R2	RMSE	%RMSE	Bias	%Bias
AGB_tr_dead_mgha	0.37	29.08	71.75	7.56	18.67
AGB_tr_live_mgha	0.39	55.82	80.66	-11.12	-16.07
AGB_tr_mgha	0.45	73.81	67.27	-8.46	-7.71
Basal.Area	0.60	12.13	49.83	-1.20	-4.95
br_d_mgha	0.31	6.82	79.17	2.06	23.87
br_l_mgha	0.51	9.86	70.51	-1.52	-10.84
branches_mgha	0.52	14.02	62.01	-0.11	-0.50
c_vol_ratio	0.22	0.23	74.84	0.09	30.06
CBD_d	0.10	0.06	107.67	0.02	29.28
CBD_l	0.13	0.07	93.22	-0.01	-16.43
CBD_tot	0.20	0.09	69.89	0.00	-3.56
CBH_dead	0.34	0.93	56.09	0.21	12.60
CBH_live	0.20	1.62	68.93	-0.17	-7.14
CWD	-0.03	42.50	151.62	9.44	33.69
D_surf	-0.02	24.98	128.67	5.37	27.66
DFF	-0.02	20.75	124.24	3.34	20.03
duff_dep	0.44	0.77	88.91	0.13	15.45
fol_d_mgha	0.13	1.76	88.51	0.28	13.97
fol_l_mgha	0.17	3.56	101.93	-0.94	-27.07
foliage_mgha	0.15	4.85	88.70	-0.89	-16.30
height	0.51	3.34	42.94	0.04	0.56
hr1000_mgha	-0.03	36.93	152.45	7.70	31.81
HS	-0.03	4.48	117.30	0.38	10.06
litt_dep	-0.01	1.69	103.87	0.26	16.25
Maximum.Height	0.69	5.65	32.38	0.06	0.32
mean_dleng	0.35	1.48	65.66	0.48	21.14
SDB	0.34	6.70	77.89	2.18	25.34
SDS	0.34	11.42	71.12	2.66	16.58
SS	0.15	8.60	140.65	1.28	20.93
SSd	0.01	2.83	143.42	0.33	16.64
tot_herb	0.09	1.32	165.49	0.13	16.02
tot_shb	0.00	4.80	158.73	0.69	22.96
tot_surf	-0.03	27.91	120.12	5.68	24.44
tot_wood	0.43	75.80	67.26	-10.04	-8.91
total_AGB	0.38	85.35	64.21	-8.26	-6.22
Tree.Density	0.19	2086.54	126.57	311.32	18.88

MLS vs TLS Comparison

Now I’ll compare the performance of MLS and TLS using the site CV models’ performance metrics. Note that in the resulting plot, the variables are ranked from best (top) to worst (bottom) performing, based on an average rank between R² and RMSE.

Show Code

# merge the performance metrics data
df_perf_rf_sitecv <- merge(
  df_perf_mls_rf_sitecv,
  df_perf_tls_rf_sitecv,
  by = "response",
  suffixes = c("_mls", "_tls")
)

# get ranks and sort data.frame
rank_r2_mls <- rank(df_perf_rf_sitecv$r2_mls * -1)
rank_r2_tls <- rank(df_perf_rf_sitecv$r2_tls * -1)
rank_perc_rmse_mls <- rank(df_perf_rf_sitecv$perc_rmse_mls)
rank_perc_rmse_tls <- rank(df_perf_rf_sitecv$perc_rmse_tls)
rank_mean <- cbind(
  rank_r2_mls,
  rank_r2_tls,
  rank_perc_rmse_mls,
  rank_perc_rmse_tls
) |>
  rowMeans()
df_perf_rf_sitecv$rank_mean <- rank_mean
df_perf_rf_sitecv <- df_perf_rf_sitecv[order(df_perf_rf_sitecv$rank_mean, decreasing = T),]

# set up plot
par(mfrow = c(1,2), las = 1, mar = c(5,5,2,1))

#------------------------r2

# plot the data
dotchart(
  x = df_perf_rf_sitecv$r2_mls,
  labels = df_perf_rf_sitecv$response,
  xlim = range(c(df_perf_rf_sitecv[,c("r2_mls", "r2_tls")])),
  pch = "",
  main = "R2"
)
points(
  x = df_perf_rf_sitecv$r2_mls,
  y = 1:nrow(df_perf_rf_sitecv),
  pch = 16,
  col = 2
)
points(
  x = df_perf_rf_sitecv$r2_tls,
  y = 1:nrow(df_perf_rf_sitecv),
  pch = 16,
  col = 4
)

#------------------------rmse

# plot the data
dotchart(
  x = df_perf_rf_sitecv$perc_rmse_mls,
  labels = df_perf_rf_sitecv$response,
  xlim = range(c(df_perf_rf_sitecv[,c("perc_rmse_mls", "perc_rmse_tls")])),
  pch = "",
  main = "%RMSE"
)
points(
  x = df_perf_rf_sitecv$perc_rmse_mls,
  y = 1:nrow(df_perf_rf_sitecv),
  pch = 16,
  col = 2
)
points(
  x = df_perf_rf_sitecv$perc_rmse_tls,
  y = 1:nrow(df_perf_rf_sitecv),
  pch = 16,
  col = 4
)

# add legend
legend(
  "topright",
  legend = c("MLS", "TLS"),
  pch = 16,
  col = c(2, 4)
)

Figure 3. MLS vs. TLS Site CV RF Model Performance Comparison

Model Type #4: Boostrapping

The fourth approach is that which Carlos and Eugenia were using – a 500-iteration bootstrapping approach, whereby on each iteration, a sample with replacement equal to ~70% of the total dataset is used to train a model, which is used to make predictions on the remaining data. Given that multiple predictions are made on each validation plot several times, the mean of the predictions is used as the final prediction for comparison to observed fuel structure data.

MLS Data

First, I’ll run the boostrapping approach on the MLS data.

Show Code

# loop through response variables
for (y in 1:ncol(df_mls_y)){
  
  # create modeling data frame
  df_mod <- cbind(df_mls_y[y], df_mls_x)

  # boostrap loop
  for (i in 1:500){
    
    # split into training and validation
    train_rows <- sample(nrow(df_mod), round(0.7 * nrow(df_mod)), T)
    test_rows <- (1:nrow(df_mod))[-train_rows]
    test_ids <- df_mls_id[test_rows]
    df_mod_train <- df_mod[train_rows,]
    df_mod_valid <- df_mod[test_rows,]
    
    # build model
    mod <- ranger(
      x = df_mod_train[,2:ncol(df_mod_train)],
      y = df_mod_train[,1]
    )
    
    # make predictions
    prd_iter <- predict(mod, df_mod_valid)$predictions
    obs_iter <- df_mod_valid[,1]
    
    # compile predictions
    if (i == 1){
      df_pred_obs <- data.frame(
        plot_id = test_ids,
        predicted = prd_iter,
        observed = obs_iter
      )
    } else {
      df_pred_obs <- rbind(
        df_pred_obs, 
        data.frame(plot_id = test_ids,
                   predicted = prd_iter,
                   observed = obs_iter
        )
      )
    }
  
  }
  
  # get mean, median, and sd of predicted values by plot
  df_pred_obs_summ <- df_pred_obs |>
    group_by(plot_id) |>
    summarize(
      observed = mean(observed),
      predicted = mean(predicted)
    ) |>
    as.data.frame()
  
  # get performance metrics
  prd <- df_pred_obs_summ$predicted
  obs <- df_pred_obs_summ$observed
  r2 <- summary(lm(prd ~ obs))$adj.r.squared
  rmse <- sqrt(mean((prd - obs) ^ 2))
  perc_rmse <- 100 * rmse / mean(obs)
  bias <- mean(prd - obs)
  perc_bias <- 100 * bias / mean(obs)
  
  # compile metrics
  if (y == 1){
    df_perf_mls_rf_boot <- data.frame(
      response = colnames(df_mod)[1],
      r2 = r2,
      rmse = rmse,
      perc_rmse = perc_rmse,
      bias = bias,
      perc_bias = perc_bias
    )
  } else {
    df_perf_mls_rf_boot <- rbind(
      df_perf_mls_rf_boot,
      data.frame(
        response = colnames(df_mod)[1],
        r2 = r2,
        rmse = rmse,
        perc_rmse = perc_rmse,
        bias = bias,
        perc_bias = perc_bias
      )
    )
  }
  
  # write predicted vs. observed table to csv
  df_pred_obs_mls_rf_boot <- data.frame(
    plot_id = df_pred_obs_summ$plot_id,
    observed = obs,
    predicted = prd
  )
  var_name <- colnames(df_mod)[1]
  dir_name <- "S:/ursa2/campbell/ems4d/data/from_eugenia/pred_obs/mls_rf_boot"
  dir.create(dir_name, showWarnings = F)
  file_name <- gsub("\\.", "_", var_name) |> tolower()
  file_name <- file.path(dir_name, paste0(file_name, ".csv"))
  write.csv(df_pred_obs_mls_rf_boot, file_name, row.names = F)

  # plot predicted vs. observed
  var_name <- colnames(df_mod)[1]
  dir_name <- "S:/ursa2/campbell/ems4d/data/from_eugenia/plots/mls_rf_boot"
  dir.create(dir_name, showWarnings = F)
  file_name <- gsub("\\.", "_", var_name) |> tolower()
  file_name <- file.path(dir_name, paste0(file_name, ".jpg"))
  plot_pred_obs(prd, obs, var_name, T, file_name)

}

# sort performance metrics by name
df_perf_mls_rf_boot <- df_perf_mls_rf_boot[order(df_perf_mls_rf_boot$response),]
row.names(df_perf_mls_rf_boot) <- 1:nrow(df_perf_mls_rf_boot)

# write out performance metrics csv
write.csv(
  df_perf_mls_rf_boot,
  "S:/ursa2/campbell/ems4d/data/from_eugenia/perf_mets/mls_rf_boot.csv",
  row.names = F
)

# print table
kable(
  df_perf_mls_rf_boot,
  caption = "Table 7. Performance Metrics of Bootstrapped Random Forest Models on MLS Data",
  digits = 2,
  col.names = c("Response Variable", "R2", "RMSE", "%RMSE", "Bias", "%Bias")
)

Table 7. Performance Metrics of Bootstrapped Random Forest Models on MLS Data
Response Variable	R2	RMSE	%RMSE	Bias	%Bias
AGB_tr_dead_mgha	0.60	20.91	59.19	0.74	2.09
AGB_tr_live_mgha	0.56	30.91	60.10	-1.37	-2.66
AGB_tr_mgha	0.71	38.16	43.99	0.36	0.41
Basal.Area	0.74	8.12	39.08	0.37	1.78
br_d_mgha	0.59	4.86	63.94	0.32	4.16
br_l_mgha	0.63	5.54	52.60	-0.12	-1.14
branches_mgha	0.73	7.97	43.96	0.27	1.48
c_vol_ratio	0.37	0.20	66.12	0.01	4.30
CBD_d	0.13	0.06	109.29	0.00	6.31
CBD_l	0.02	0.07	102.54	0.00	-4.05
CBD_tot	0.23	0.09	72.57	0.00	-0.48
CBH_dead	0.56	0.72	45.47	-0.01	-0.48
CBH_live	0.28	1.41	66.08	-0.07	-3.47
CWD	0.08	37.47	136.10	2.77	10.04
D_surf	0.30	17.13	91.74	0.08	0.43
DFF	0.22	15.64	96.57	0.05	0.32
duff_dep	0.31	0.78	107.49	0.02	3.25
fol_d_mgha	0.42	1.25	74.51	0.06	3.33
fol_l_mgha	0.54	1.76	67.68	-0.09	-3.41
foliage_mgha	0.64	2.18	50.93	-0.01	-0.21
height	0.67	2.59	36.08	-0.08	-1.16
hr1000_mgha	0.08	34.08	142.40	3.68	15.36
HS	-0.02	3.99	102.04	0.15	3.84
litt_dep	0.49	1.02	64.79	-0.05	-3.31
Maximum.Height	0.78	4.47	27.66	-0.08	-0.52
mean_dleng	0.32	1.48	69.33	0.12	5.51
SDB	0.60	4.76	62.95	0.22	2.97
SDS	0.64	7.42	51.48	0.03	0.22
SS	0.37	4.88	93.22	0.15	2.85
SSd	0.10	1.89	110.66	0.07	4.30
tot_herb	0.00	1.49	177.24	0.06	7.58
tot_shb	0.02	3.71	121.02	0.15	4.76
tot_surf	0.25	19.01	84.19	0.25	1.12
tot_wood	0.71	38.87	43.29	0.70	0.78
total_AGB	0.72	45.64	41.76	0.92	0.84
Tree.Density	0.30	1499.89	105.31	54.42	3.82

TLS Data

Next, I’ll run the boostrapping approach on the TLS data.

Show Code

# loop through response variables
for (y in 1:ncol(df_tls_y)){
  
  # create modeling data frame
  df_mod <- cbind(df_tls_y[y], df_tls_x)

  # boostrap loop
  for (i in 1:500){
    
    # split into training and validation
    train_rows <- sample(nrow(df_mod), round(0.7 * nrow(df_mod)), T)
    test_rows <- (1:nrow(df_mod))[-train_rows]
    test_ids <- df_tls_id[test_rows]
    df_mod_train <- df_mod[train_rows,]
    df_mod_valid <- df_mod[test_rows,]
    
    # build model
    mod <- ranger(
      x = df_mod_train[,2:ncol(df_mod_train)],
      y = df_mod_train[,1]
    )
    
    # make predictions
    prd_iter <- predict(mod, df_mod_valid)$predictions
    obs_iter <- df_mod_valid[,1]
    
    # compile predictions
    if (i == 1){
      df_pred_obs <- data.frame(
        plot_id = test_ids,
        predicted = prd_iter,
        observed = obs_iter
      )
    } else {
      df_pred_obs <- rbind(
        df_pred_obs, 
        data.frame(plot_id = test_ids,
                   predicted = prd_iter,
                   observed = obs_iter
        )
      )
    }
  
  }
  
  # get mean, median, and sd of predicted values by plot
  df_pred_obs_summ <- df_pred_obs |>
    group_by(plot_id) |>
    summarize(
      observed = mean(observed),
      predicted = mean(predicted)
    ) |>
    as.data.frame()
  
  # get performance metrics
  prd <- df_pred_obs_summ$predicted
  obs <- df_pred_obs_summ$observed
  r2 <- summary(lm(prd ~ obs))$adj.r.squared
  rmse <- sqrt(mean((prd - obs) ^ 2))
  perc_rmse <- 100 * rmse / mean(obs)
  bias <- mean(prd - obs)
  perc_bias <- 100 * bias / mean(obs)
  
  # compile metrics
  if (y == 1){
    df_perf_tls_rf_boot <- data.frame(
      response = colnames(df_mod)[1],
      r2 = r2,
      rmse = rmse,
      perc_rmse = perc_rmse,
      bias = bias,
      perc_bias = perc_bias
    )
  } else {
    df_perf_tls_rf_boot <- rbind(
      df_perf_tls_rf_boot,
      data.frame(
        response = colnames(df_mod)[1],
        r2 = r2,
        rmse = rmse,
        perc_rmse = perc_rmse,
        bias = bias,
        perc_bias = perc_bias
      )
    )
  }
  
  # write predicted vs. observed table to csv
  df_pred_obs_tls_rf_boot <- data.frame(
    plot_id = df_pred_obs_summ$plot_id,
    observed = obs,
    predicted = prd
  )
  var_name <- colnames(df_mod)[1]
  dir_name <- "S:/ursa2/campbell/ems4d/data/from_eugenia/pred_obs/tls_rf_boot"
  dir.create(dir_name, showWarnings = F)
  file_name <- gsub("\\.", "_", var_name) |> tolower()
  file_name <- file.path(dir_name, paste0(file_name, ".csv"))
  write.csv(df_pred_obs_tls_rf_boot, file_name, row.names = F)

  # plot predicted vs. observed
  var_name <- colnames(df_mod)[1]
  dir_name <- "S:/ursa2/campbell/ems4d/data/from_eugenia/plots/tls_rf_boot"
  dir.create(dir_name, showWarnings = F)
  file_name <- gsub("\\.", "_", var_name) |> tolower()
  file_name <- file.path(dir_name, paste0(file_name, ".jpg"))
  plot_pred_obs(prd, obs, var_name, T, file_name)

}

# sort performance metrics by name
df_perf_tls_rf_boot <- df_perf_tls_rf_boot[order(df_perf_tls_rf_boot$response),]
row.names(df_perf_tls_rf_boot) <- 1:nrow(df_perf_tls_rf_boot)

# write out performance metrics csv
write.csv(
  df_perf_tls_rf_boot,
  "S:/ursa2/campbell/ems4d/data/from_eugenia/perf_mets/tls_rf_boot.csv",
  row.names = F
)

# print table
kable(
  df_perf_tls_rf_boot,
  caption = "Table 8. Performance Metrics of Bootstrapped Random Forest Models on TLS Data",
  digits = 2,
  col.names = c("Response Variable", "R2", "RMSE", "%RMSE", "Bias", "%Bias")
)

Table 8. Performance Metrics of Bootstrapped Random Forest Models on TLS Data
Response Variable	R2	RMSE	%RMSE	Bias	%Bias
AGB_tr_dead_mgha	0.47	25.99	64.14	0.70	1.72
AGB_tr_live_mgha	0.48	49.89	72.10	1.11	1.61
AGB_tr_mgha	0.52	67.08	61.13	1.49	1.36
Basal.Area	0.68	10.38	42.65	0.17	0.69
br_d_mgha	0.51	5.49	63.71	0.08	0.90
br_l_mgha	0.53	9.31	66.53	0.03	0.24
branches_mgha	0.60	12.72	56.29	0.14	0.60
c_vol_ratio	0.32	0.19	63.57	0.01	3.43
CBD_d	0.22	0.05	96.62	0.00	3.83
CBD_l	0.27	0.06	85.00	0.00	3.61
CBD_tot	0.46	0.08	59.83	0.00	2.01
CBH_dead	0.53	0.73	44.01	-0.02	-1.34
CBH_live	0.30	1.49	63.32	-0.01	-0.43
CWD	0.06	36.79	131.25	4.13	14.74
D_surf	0.27	16.60	85.48	0.41	2.14
DFF	0.22	14.69	87.97	0.23	1.40
duff_dep	0.50	0.72	83.83	0.03	3.68
fol_d_mgha	0.28	1.57	78.95	0.06	3.00
fol_l_mgha	0.42	2.87	82.35	0.04	1.25
foliage_mgha	0.40	4.00	73.01	0.13	2.30
height	0.63	2.93	37.69	-0.05	-0.65
hr1000_mgha	0.03	33.35	137.67	3.76	15.51
HS	0.07	3.71	96.98	0.23	5.93
litt_dep	0.24	1.15	70.51	0.04	2.43
Maximum.Height	0.82	4.26	24.38	0.23	1.31
mean_dleng	0.37	1.39	61.45	0.04	1.64
SDB	0.50	5.58	64.82	0.04	0.46
SDS	0.43	10.35	64.49	0.15	0.93
SS	0.10	8.79	143.89	0.13	2.05
SSd	0.07	2.67	135.57	0.06	3.02
tot_herb	0.02	1.40	176.24	0.06	7.76
tot_shb	0.02	3.73	123.18	0.29	9.45
tot_surf	0.27	17.79	76.54	0.70	2.99
tot_wood	0.52	67.88	60.23	2.13	1.89
total_AGB	0.56	72.39	54.46	2.09	1.57
Tree.Density	0.16	2118.66	128.51	127.68	7.74

MLS vs TLS Comparison

Now I’ll compare the performance of MLS and TLS using the boostrapped models’ performance metrics. Note that in the resulting plot, the variables are ranked from best (top) to worst (bottom) performing, based on an average rank between R² and RMSE.

Show Code

# merge the performance metrics data
df_perf_rf_boot <- merge(
  df_perf_mls_rf_boot,
  df_perf_tls_rf_boot,
  by = "response",
  suffixes = c("_mls", "_tls")
)

# get ranks and sort data.frame
rank_r2_mls <- rank(df_perf_rf_boot$r2_mls * -1)
rank_r2_tls <- rank(df_perf_rf_boot$r2_tls * -1)
rank_perc_rmse_mls <- rank(df_perf_rf_boot$perc_rmse_mls)
rank_perc_rmse_tls <- rank(df_perf_rf_boot$perc_rmse_tls)
rank_mean <- cbind(
  rank_r2_mls,
  rank_r2_tls,
  rank_perc_rmse_mls,
  rank_perc_rmse_tls
) |>
  rowMeans()
df_perf_rf_boot$rank_mean <- rank_mean
df_perf_rf_boot <- df_perf_rf_boot[order(df_perf_rf_boot$rank_mean, decreasing = T),]

# set up plot
par(mfrow = c(1,2), las = 1, mar = c(5,5,2,1))

#------------------------r2

# plot the data
dotchart(
  x = df_perf_rf_boot$r2_mls,
  labels = df_perf_rf_boot$response,
  xlim = range(c(df_perf_rf_boot[,c("r2_mls", "r2_tls")])),
  pch = "",
  main = "R2"
)
points(
  x = df_perf_rf_boot$r2_mls,
  y = 1:nrow(df_perf_rf_boot),
  pch = 16,
  col = 2
)
points(
  x = df_perf_rf_boot$r2_tls,
  y = 1:nrow(df_perf_rf_boot),
  pch = 16,
  col = 4
)

#------------------------rmse

# plot the data
dotchart(
  x = df_perf_rf_boot$perc_rmse_mls,
  labels = df_perf_rf_boot$response,
  xlim = range(c(df_perf_rf_boot[,c("perc_rmse_mls", "perc_rmse_tls")])),
  pch = "",
  main = "%RMSE"
)
points(
  x = df_perf_rf_boot$perc_rmse_mls,
  y = 1:nrow(df_perf_rf_boot),
  pch = 16,
  col = 2
)
points(
  x = df_perf_rf_boot$perc_rmse_tls,
  y = 1:nrow(df_perf_rf_boot),
  pch = 16,
  col = 4
)

# add legend
legend(
  "topright",
  legend = c("MLS", "TLS"),
  pch = 16,
  col = c(2, 4)
)

Figure 4. MLS vs. TLS Bootstrapped RF Model Performance Comparison

Compare Four Model Types

I thought it would be interesting to take a look at how the four different modeling approaches compared. To do this, I will focus on two performance metrics: R² and RMSE. For each, I’ll look at all of the response variables, and take the mean R² and RMSE across all four models. Then I’ll compare each model to that mean value, to see how much better or worse than they are in comparison to the average. I’ll plot the results out as a density plots, where vertical dotted lines will represent the median among all of the metrics’ difference from the mean.

MLS Data

First, I’ll compare the MLS models.

Show Code

# convert results dfs to matrices and simplify
mat_perf_mls_rf_full <- as.matrix(df_perf_mls_rf_full[,c("r2", "perc_rmse")])
mat_perf_mls_rf_loo <- as.matrix(df_perf_mls_rf_loo[,c("r2", "perc_rmse")])
mat_perf_mls_rf_sitecv <- as.matrix(df_perf_mls_rf_sitecv[,c("r2", "perc_rmse")])
mat_perf_mls_rf_boot <- as.matrix(df_perf_mls_rf_boot[,c("r2", "perc_rmse")])

# get performance metric means by response variable
mat_mls_mean <- (
  mat_perf_mls_rf_full +
    mat_perf_mls_rf_loo + 
    mat_perf_mls_rf_sitecv +
    mat_perf_mls_rf_boot
  ) / 4

# get individual model differences to mean performance across models
mat_diff_mls_rf_full <- mat_perf_mls_rf_full - mat_mls_mean
mat_diff_mls_rf_loo <- mat_perf_mls_rf_loo - mat_mls_mean
mat_diff_mls_rf_sitecv <- mat_perf_mls_rf_sitecv - mat_mls_mean
mat_diff_mls_rf_boot <- mat_perf_mls_rf_boot - mat_mls_mean

# define kernel density plotting function
kd_plot <- function(d, col, new, med, xlab){
  if (new == T){
    plot(x = xlim, y = ylim, bty = "n", xlab = NA, xaxt = "n", ylab = NA, 
         yaxt = "n", type = "n")
    abline(v = 0, col = "lightgray", lwd = 2)
    axis(1)
    mtext(xlab, 1, 3)
  }
  x <- d$x
  y <- d$y
  poly_col <- adjust_transparency(col, 0.25)
  polygon(c(min(x), x, max(x)), c(0, y, 0), col = poly_col,  border = NA)
  lines(y ~ x, lwd = 2, col = col)
  abline(v = med, lty = 3, col = col)
}

# set up plot
par(mfrow = c(2,1), las = 1, mar = c(5,5,1,1))

#-------------------r2

# get kernel densities
dens_mls_r2_rf_full <- density(
  mat_diff_mls_rf_full[,1],
  bw = 0.02,
  from = min(mat_diff_mls_rf_full[,1]),
  to = max(mat_diff_mls_rf_full[,1])
)
dens_mls_r2_rf_loo <- density(
  mat_diff_mls_rf_loo[,1], 
  bw = 0.02,
  from = min(mat_diff_mls_rf_loo[,1]),
  to = max(mat_diff_mls_rf_loo[,1])
)
dens_mls_r2_rf_sitecv <- density(
  mat_diff_mls_rf_sitecv[,1], 
  bw = 0.02,
  from = min(mat_diff_mls_rf_sitecv[,1]),
  to = max(mat_diff_mls_rf_sitecv[,1])
)
dens_mls_r2_rf_boot <- density(
  mat_diff_mls_rf_boot[,1], 
  bw = 0.02,
  from = min(mat_diff_mls_rf_boot[,1]),
  to = max(mat_diff_mls_rf_boot[,1])
)

# get axis limits
xlim <- range(
  c(
    dens_mls_r2_rf_full$x,
    dens_mls_r2_rf_loo$x,
    dens_mls_r2_rf_sitecv$x,
    dens_mls_r2_rf_boot$x
  )
)
ylim <- range(
  c(
    dens_mls_r2_rf_full$y,
    dens_mls_r2_rf_loo$y,
    dens_mls_r2_rf_sitecv$y,
    dens_mls_r2_rf_boot$y
  )
)

# define colors
cols <- brewer.pal(4, "Dark2")

# plot them out
kd_plot(dens_mls_r2_rf_full, cols[1], T, median(mat_diff_mls_rf_full[,1]), "R2 Difference")
kd_plot(dens_mls_r2_rf_loo, cols[2], F, median(mat_diff_mls_rf_loo[,1]))
kd_plot(dens_mls_r2_rf_sitecv, cols[3], F, median(mat_diff_mls_rf_sitecv[,1]))
kd_plot(dens_mls_r2_rf_boot, cols[4], F, median(mat_diff_mls_rf_boot[,1]))

# add legend
legend(
  "topleft",
  legend = c("Full", "LOOCV", "Site CV", "Boot"),
  fill = adjust_transparency(cols, 0.25), 
  border = cols, 
  bty = "n"
)

#-------------------rmse

# get kernel densities
dens_mls_rmse_rf_full <- density(
  mat_diff_mls_rf_full[,1],
  bw = 0.02,
  from = min(mat_diff_mls_rf_full[,1]),
  to = max(mat_diff_mls_rf_full[,1])
)
dens_mls_rmse_rf_loo <- density(
  mat_diff_mls_rf_loo[,1], 
  bw = 0.02,
  from = min(mat_diff_mls_rf_loo[,1]),
  to = max(mat_diff_mls_rf_loo[,1])
)
dens_mls_rmse_rf_sitecv <- density(
  mat_diff_mls_rf_sitecv[,1], 
  bw = 0.02,
  from = min(mat_diff_mls_rf_sitecv[,1]),
  to = max(mat_diff_mls_rf_sitecv[,1])
)
dens_mls_rmse_rf_boot <- density(
  mat_diff_mls_rf_boot[,1], 
  bw = 0.02,
  from = min(mat_diff_mls_rf_boot[,1]),
  to = max(mat_diff_mls_rf_boot[,1])
)

# get axis limits
xlim <- range(
  c(
    dens_mls_rmse_rf_full$x,
    dens_mls_rmse_rf_loo$x,
    dens_mls_rmse_rf_sitecv$x,
    dens_mls_rmse_rf_boot$x
  )
)
ylim <- range(
  c(
    dens_mls_rmse_rf_full$y,
    dens_mls_rmse_rf_loo$y,
    dens_mls_rmse_rf_sitecv$y,
    dens_mls_rmse_rf_boot$y
  )
)

# define colors
cols <- brewer.pal(4, "Dark2")

# plot them out
kd_plot(dens_mls_rmse_rf_full, cols[1], T, median(mat_diff_mls_rf_full[,1]), "%RMSE Difference")
kd_plot(dens_mls_rmse_rf_loo, cols[2], F, median(mat_diff_mls_rf_loo[,1]))
kd_plot(dens_mls_rmse_rf_sitecv, cols[3], F, median(mat_diff_mls_rf_sitecv[,1]))
kd_plot(dens_mls_rmse_rf_boot, cols[4], F, median(mat_diff_mls_rf_boot[,1]))

Figure 5. MLS vs. TLS Full RF Model Performance Comparison

TLS Data

Next, I’ll compare the TLS models.

Show Code

# convert results dfs to matrices and simplify
mat_perf_tls_rf_full <- as.matrix(df_perf_tls_rf_full[,c("r2", "perc_rmse")])
mat_perf_tls_rf_loo <- as.matrix(df_perf_tls_rf_loo[,c("r2", "perc_rmse")])
mat_perf_tls_rf_sitecv <- as.matrix(df_perf_tls_rf_sitecv[,c("r2", "perc_rmse")])
mat_perf_tls_rf_boot <- as.matrix(df_perf_tls_rf_boot[,c("r2", "perc_rmse")])

# get performance metric means by response variable
mat_tls_mean <- (
  mat_perf_tls_rf_full +
    mat_perf_tls_rf_loo + 
    mat_perf_tls_rf_sitecv +
    mat_perf_tls_rf_boot
  ) / 4

# get individual model differences to mean performance across models
mat_diff_tls_rf_full <- mat_perf_tls_rf_full - mat_tls_mean
mat_diff_tls_rf_loo <- mat_perf_tls_rf_loo - mat_tls_mean
mat_diff_tls_rf_sitecv <- mat_perf_tls_rf_sitecv - mat_tls_mean
mat_diff_tls_rf_boot <- mat_perf_tls_rf_boot - mat_tls_mean

# define kernel density plotting function
kd_plot <- function(d, col, new, med, xlab){
  if (new == T){
    plot(x = xlim, y = ylim, bty = "n", xlab = NA, xaxt = "n", ylab = NA, 
         yaxt = "n", type = "n")
    abline(v = 0, col = "lightgray", lwd = 2)
    axis(1)
    mtext(xlab, 1, 3)
  }
  x <- d$x
  y <- d$y
  poly_col <- adjust_transparency(col, 0.25)
  polygon(c(min(x), x, max(x)), c(0, y, 0), col = poly_col,  border = NA)
  lines(y ~ x, lwd = 2, col = col)
  abline(v = med, lty = 3, col = col)
}

# set up plot
par(mfrow = c(2,1), las = 1, mar = c(5,5,1,1))

#-------------------r2

# get kernel densities
dens_tls_r2_rf_full <- density(
  mat_diff_tls_rf_full[,1],
  bw = 0.02,
  from = min(mat_diff_tls_rf_full[,1]),
  to = max(mat_diff_tls_rf_full[,1])
)
dens_tls_r2_rf_loo <- density(
  mat_diff_tls_rf_loo[,1], 
  bw = 0.02,
  from = min(mat_diff_tls_rf_loo[,1]),
  to = max(mat_diff_tls_rf_loo[,1])
)
dens_tls_r2_rf_sitecv <- density(
  mat_diff_tls_rf_sitecv[,1], 
  bw = 0.02,
  from = min(mat_diff_tls_rf_sitecv[,1]),
  to = max(mat_diff_tls_rf_sitecv[,1])
)
dens_tls_r2_rf_boot <- density(
  mat_diff_tls_rf_boot[,1], 
  bw = 0.02,
  from = min(mat_diff_tls_rf_boot[,1]),
  to = max(mat_diff_tls_rf_boot[,1])
)

# get axis limits
xlim <- range(
  c(
    dens_tls_r2_rf_full$x,
    dens_tls_r2_rf_loo$x,
    dens_tls_r2_rf_sitecv$x,
    dens_tls_r2_rf_boot$x
  )
)
ylim <- range(
  c(
    dens_tls_r2_rf_full$y,
    dens_tls_r2_rf_loo$y,
    dens_tls_r2_rf_sitecv$y,
    dens_tls_r2_rf_boot$y
  )
)

# define colors
cols <- brewer.pal(4, "Dark2")

# plot them out
kd_plot(dens_tls_r2_rf_full, cols[1], T, median(mat_diff_tls_rf_full[,1]), "R2 Difference")
kd_plot(dens_tls_r2_rf_loo, cols[2], F, median(mat_diff_tls_rf_loo[,1]))
kd_plot(dens_tls_r2_rf_sitecv, cols[3], F, median(mat_diff_tls_rf_sitecv[,1]))
kd_plot(dens_tls_r2_rf_boot, cols[4], F, median(mat_diff_tls_rf_boot[,1]))

# add legend
legend(
  "topleft",
  legend = c("Full", "LOOCV", "Site CV", "Boot"),
  fill = adjust_transparency(cols, 0.25), 
  border = cols, 
  bty = "n"
)

#-------------------rmse

# get kernel densities
dens_tls_rmse_rf_full <- density(
  mat_diff_tls_rf_full[,1],
  bw = 0.02,
  from = min(mat_diff_tls_rf_full[,1]),
  to = max(mat_diff_tls_rf_full[,1])
)
dens_tls_rmse_rf_loo <- density(
  mat_diff_tls_rf_loo[,1], 
  bw = 0.02,
  from = min(mat_diff_tls_rf_loo[,1]),
  to = max(mat_diff_tls_rf_loo[,1])
)
dens_tls_rmse_rf_sitecv <- density(
  mat_diff_tls_rf_sitecv[,1], 
  bw = 0.02,
  from = min(mat_diff_tls_rf_sitecv[,1]),
  to = max(mat_diff_tls_rf_sitecv[,1])
)
dens_tls_rmse_rf_boot <- density(
  mat_diff_tls_rf_boot[,1], 
  bw = 0.02,
  from = min(mat_diff_tls_rf_boot[,1]),
  to = max(mat_diff_tls_rf_boot[,1])
)

# get axis limits
xlim <- range(
  c(
    dens_tls_rmse_rf_full$x,
    dens_tls_rmse_rf_loo$x,
    dens_tls_rmse_rf_sitecv$x,
    dens_tls_rmse_rf_boot$x
  )
)
ylim <- range(
  c(
    dens_tls_rmse_rf_full$y,
    dens_tls_rmse_rf_loo$y,
    dens_tls_rmse_rf_sitecv$y,
    dens_tls_rmse_rf_boot$y
  )
)

# define colors
cols <- brewer.pal(4, "Dark2")

# plot them out
kd_plot(dens_tls_rmse_rf_full, cols[1], T, median(mat_diff_tls_rf_full[,1]), "%RMSE Difference")
kd_plot(dens_tls_rmse_rf_loo, cols[2], F, median(mat_diff_tls_rf_loo[,1]))
kd_plot(dens_tls_rmse_rf_sitecv, cols[3], F, median(mat_diff_tls_rf_sitecv[,1]))
kd_plot(dens_tls_rmse_rf_boot, cols[4], F, median(mat_diff_tls_rf_boot[,1]))

Figure 6. TLS Model Type Performance Comparison

UPDATE 5/2/2024: A Quick Glimpse at Bias Correction

In our last meeting, Carlos discussed concerns about the bias in the predictions – the general trend of overestimation of low values and underestimation of high values across the models. Indeed, this is quite a common problem in machine learning models, particularly ensemble models like random forests. If you look at the %Bias figures from the various tables above, the bias is generally low, at least in the full, LOOCV, and bootstrapped models. As one might expect, the site-to-site models demonstrate considerably higher absolute bias. But, for the sake of a relatively simple example, let’s take a look at the full model type to examine the bias issue more closely. Specifically, I’ll focus on the MLS-driven total_AGB full model as a demonstrative case.

Show Code

# define response variable
y_name <- "total_AGB"
y <- which(colnames(df_mls_y) == y_name)
  
# create modeling data frame
df_mod <- cbind(df_mls_y[y], df_mls_x)

# build model
mod <- ranger(
  x = df_mod[,2:ncol(df_mod)],
  y = df_mod[,1]
)
  
# get predictions and observations
prd <- mod$predictions
obs <- df_mod[,1]

# plot it out
plot_pred_obs(prd, obs, y_name, F)

As can be seen, the model performed really quite nicely. On average, across all observations, the model is relatively unbiased. However, the slope of the regression line comparing predictions to observations is clearly less than 1. Again, this is expected behavior – models that leave some variance unexplained almost always exhibit a prediction-observation slope of < 1. Think of the extreme case: a model that is unable to explain any variance in observations. The best it can do is make guesses around the mean of observations. This model would have a prediction-observation slope of 0.

That being said, random forests (and other ensemble learners) are more prone to this behavior as a result of a few main factors:

The fact that RF relies on bagging means that every decision tree is being built on only a sample of the dataset. Assuming observations of some fuel metric are fairly normally distributed, this means that extreme high and low values will be sampled less frequently in the construction of decision trees. Thus, predictions will tend towards the less extreme values.
The fact that RF takes the mean among a series of decision tree predictions to produce its final predictions. Means will naturally act to moderate predictions.
The fact that RF cannot extrapolate. It can only make predictions within the range of observations, basically ensuring that the highest observed value will always be underestimated.

In the hopes of mitigating RF’s tendency to underestimate high values and overestimate low values, we can explore a bias correction approach. There are many, but one that I (and others) have found to be useful and intuitive is empirical distribution matching. It basically compares the distribution of predictions to the distribution of observations, and adjusts predictions to more closely align with observations.

Here’s a quick example.

Show Code

# get quantiles of predictions and observations
qs <- seq(0,1,0.01)
q_prd <- quantile(prd, qs)
q_obs <- quantile(obs, qs)

# plot them out
par(mar = c(5,5,1,1), las = 1)
plot(
  x = qs,
  y = q_obs,
  type = "n",
  xlab = "Quantile",
  ylab = "total_AGB"
)
grid()
lines(q_obs ~ qs, lwd = 2, col = 2)
lines(q_prd ~ qs, lwd = 2, col = 4)
legend(
  "topleft",
  legend = c("Observations", "Predictions"),
  lwd = 2,
  col = c(2, 4)
)

Really, the two match up quite nicely (since I’ve chosen a pretty well-performing model). But, clearly at the high end, the predictions tend to be low in comparison to the observations. We can leverage the offset between these two lines to adjust the predictions to more closely match the distribution of observations.

Show Code

# get difference between predicted and observed quantiles
q_diff <- q_obs - q_prd

# correct bias
bias_corr <- approx(q_prd, q_diff, prd, rule = 2)$y
prd_corr <- prd + bias_corr

# plot it out
plot_pred_obs(prd_corr, obs, y_name, F)

As you can see, this bias correction approach yields a new prediction-observation relationship that has a slope closer to 1 and a bias that is functionally zero. However, it did so at the expense of other model performance metrics. R² decreased and RMSE increased.

So, if the ability to make extreme predictions well and capture the overarching trend in high versus low values is of key importance, bias correction may be a useful tool. If, however, model performance is central, and extreme values are less important, then accepting the innately biased predictions may be the best path forward.