2-gsbe

Author

bbl

Background and science questions

(short background)

We ask four questions of increasing specificity:

  1. Does SPEI – incorporating the statistical deviations of precipitation and evaporative demand in a single index – add Rs predictive information beyond that of climate, LAI, vegetation, and remotely-derived productivity?

  2. Does the previous year’s SPEI value have Rs predictive value? I.e., is there a memory effect?

  3. Specifically for Rs observations in a non-drought year following one or more drought years, does the previous year’s SPEI have predictive value?

  4. Is there a ‘Birch effect’ in the observational record, in which non-drought Rs is higher than it otherwise would be when preceded by a drought?

0. Prep work and base models

Code
dat <- read_csv("srdb_joined_data.csv", show_col_types = FALSE)

message(nrow(dat), " rows of data")
4479 rows of data
Code
dat %>% 
    filter(is.na(Rs_growingseason) | (Rs_growingseason > 0 & Rs_growingseason < 20)) %>% 
    filter(is.na(Rs_annual) | (Rs_annual > 0 & Rs_annual < 4000)) %>% 
    filter(is.na(Rh_annual) | (Rh_annual > 0 & Rh_annual < 2000)) %>% 
    # Most SRDB records are from forests (~3000), then grasslands (~760), 
    # shrubland (~230), wetland (~210), and desert (~80). We add the next
    # one, savanna (~30), because it might be particularly vulnerable to 
    # a Birch effect, but group everything else into "Other"
    mutate(Ecosystem_type = if_else(Ecosystem_type %in% c("Desert", "Forest", "Grassland",
                                                          "Savanna", "Shrubland", "Wetland"),
                                    true = Ecosystem_type, 
                                    false = "Other",
                                    missing = "Other"),
           # vivid::pdpVars() below needs things as factors not char
           Ecosystem_type = as.factor(Ecosystem_type),
           Soil_type_name = as.factor(Soil_type_name)) ->
    dat_filtered

message("After filtering, ", nrow(dat_filtered), " rows of data")
After filtering, 4444 rows of data

Dependent variable distribution

We have three potential dependent variables:

  • Rs_annual (annual soil respiration)

  • Rh_annual (annual heterotrophic respiration)

  • Rs_growingseason (mean growing season soil respiration)

Their distributions are expected to be non-normal, which is a problem for linear models and not ideal for anything. Evaluate possible transformations.

Code
dat_filtered %>% 
    select(Rs_annual, Rh_annual, Rs_growingseason) %>% 
    pivot_longer(everything()) %>% 
    filter(!is.na(value), value > 0) %>% 
    rename(none = value) %>% 
    mutate(log = log(none), sqrt = sqrt(none)) %>% 
    
    pivot_longer(-name, names_to = "Transformation") %>% 
    ggplot(aes(value, color = Transformation)) + 
    geom_density() +
    facet_wrap(~ name + Transformation, scales = "free")

Code
dat_filtered %>% 
    mutate(sqrt_Rs_annual = sqrt(Rs_annual),
           sqrt_Rh_annual = sqrt(Rh_annual),
           sqrt_Rs_growingseason = sqrt(Rs_growingseason)) ->
    dat_filtered

Conclusion: we have a transformation winner: sqrt()!

Random Forest model: performance

We are using the ranger package.

Wright, M. N. and Ziegler, A.: Ranger: A fast implementation of random forests for high dimensional data in C++ and R, J. Stat. Softw., 77, https://doi.org/10.18637/jss.v077.i01, 2017.

Code
# Fit random forest model and test against validation data
# The independent variable is in the first column
fit_and_test_rf <- function(x_train, x_val) {
    f <- as.formula(paste(colnames(x_train)[1], "~ ."))
    rf <- ranger::ranger(formula = f, data = x_train, importance = "impurity")
    # use importance = "permutation" for importance_pvalues()
    
    if(!is.null(x_val)) {
        preds <- predict(rf, data = x_val)$predictions
        obs <- pull(x_val[1])
        validation_r2 <- r2(preds = preds, obs = obs)
    } else {
        validation_r2 <- NA_real_
    }

    return(list(model = rf,
                training_r2 = rf$r.squared, 
                importance = importance(rf), 
                validation_r2 = validation_r2))
}

# Do a k-fold cross-validation
# x = data; f = function to call; k = number of folds; ... = params for model
do_k_fold <- function(x, f, k = K_FOLD, quiet = TRUE, ...) {
    if(!all(complete.cases(x))) {
        stop("x should not have any NAs at this stage!")
    }
    
    groups <- sample.int(n = k, size = nrow(x), replace = TRUE)
    training_r2 <- rep.int(NA_real_, k)
    validation_r2 <- rep.int(NA_real_, k)
    for(i in seq_len(k)) {
        x_val <- x[groups == i,]
        x_train <- x[groups != i,]
        if(!quiet) {
            message("\tk-fold ", i, ":")
            message("\t\tTraining is ", nrow(x_train), " x ", ncol(x_train))
            message("\t\tValidation is ", nrow(x_val), " x ", ncol(x_val))
        }

        out <- f(x_train = x_train, x_val = x_val, ...)
        training_r2[i] <- out$training_r2
        validation_r2[i] <- out$validation_r2
    }
    tibble(k = seq_len(k), 
           training_r2 = training_r2,
           validation_r2 = validation_r2)
} # do_k_fold

# Base test dataset (all three d.v.'s)
dat_filtered %>% 
    select(sqrt_Rs_annual, sqrt_Rh_annual, sqrt_Rs_growingseason,
           Tair, Tsoil_lev1, Tsoil_lev2,
           Precip, VWC_lev1, VWC_lev2, 
           LAI_high, LAI_low, Ecosystem_type,
           starts_with("SPEI"),
           MODIS_NPP, MODIS_GPP,
           Veg_type_hi, Veg_type_low, Soil_type_name) ->
    dat_base_all

# Base test dataset without MODIS or SPEI data
dat_base_all %>% 
    select(-MODIS_NPP, -MODIS_GPP, -starts_with("SPEI")) ->
    dat_base_no_modispei

# Base test dataset (Rs_annual)
dat_base_no_modispei %>% 
    select(-sqrt_Rh_annual, -sqrt_Rs_growingseason) %>% 
    filter(complete.cases(.)) -> 
    dat_base_Rs_annual

message("The test complete-cases dataset has ", dim(dat_base_Rs_annual)[1],
        " rows and ", dim(dat_base_Rs_annual)[2], " columns")
The test complete-cases dataset has 3158 rows and 13 columns
Code
message("Full-data model performance:")
Full-data model performance:
Code
rf_all_data <- ranger(sqrt_Rs_annual ~ ., data = dat_base_Rs_annual, importance = "impurity")
print(rf_all_data)
Ranger result

Call:
 ranger(sqrt_Rs_annual ~ ., data = dat_base_Rs_annual, importance = "impurity") 

Type:                             Regression 
Number of trees:                  500 
Sample size:                      3158 
Number of independent variables:  12 
Mtry:                             3 
Target node size:                 5 
Variable importance mode:         impurity 
Splitrule:                        variance 
OOB prediction error (MSE):       25.88067 
R squared (OOB):                  0.6556625 
Code
imps <- sort(rf_all_data$variable.importance, decreasing = TRUE)

# Top three variables...
pdpVars(data = dat_base_Rs_annual, 
        fit = rf_all_data, 
        response = "sqrt_Rs_annual",
        vars = names(imps)[1:3])

Code
# ...and the next three
pdpVars(data = dat_base_Rs_annual, 
        fit = rf_all_data, 
        response = "sqrt_Rs_annual",
        vars = names(imps)[4:6])

Code
# k-fold cross-validation for all dependent variables
message("k-fold cross-validation:")
k-fold cross-validation:
Code
dvs <- c("sqrt_Rs_annual", "sqrt_Rh_annual", "sqrt_Rs_growingseason")
preds <- setdiff(colnames(dat_base_no_modispei), dvs)
kf_out <- list()
for(dv in dvs) {
    # Construct dataset and do the k-fold
    x <- dat_base_no_modispei[c(dv, preds)]
    x <- x[complete.cases(x),]
    kf_out[[dv]] <- do_k_fold(x, fit_and_test_rf) %>% mutate(dv = dv, n = nrow(x))
}
bind_rows(kf_out) %>% 
    group_by(dv) %>% 
    summarise(across(everything(), mean)) %>% 
    mutate(k = K_FOLD) -> # just so it's "10" or whatever as expected
    kf_out 
print(kf_out)
# A tibble: 3 × 5
  dv                        k training_r2 validation_r2     n
  <chr>                 <dbl>       <dbl>         <dbl> <dbl>
1 sqrt_Rh_annual           10       0.516         0.513   677
2 sqrt_Rs_annual           10       0.640         0.641  3158
3 sqrt_Rs_growingseason    10       0.558         0.555  1380

Side note: model performance is linearly related to dataset size! Huh.

Code
ggplot(kf_out, aes(n, validation_r2)) + geom_point() + geom_smooth(method = lm, formula = y~x)

How important are the MODIS data?

The MODIS data only start in 2001 (see gsbe-data-prep output), so requiring them in a model cuts off a lot of early data – on the order of 20% of the dataset. Are they needed?

Code
dat_base_all %>% 
    select(-sqrt_Rh_annual, -sqrt_Rs_growingseason, -starts_with("SPEI")) %>% 
    filter(complete.cases(.)) -> 
    dat_base_test_modis

rf_modis <- ranger(sqrt_Rs_annual ~ ., data = dat_base_test_modis, importance = "impurity")
print(rf_modis)
Ranger result

Call:
 ranger(sqrt_Rs_annual ~ ., data = dat_base_test_modis, importance = "impurity") 

Type:                             Regression 
Number of trees:                  500 
Sample size:                      2299 
Number of independent variables:  14 
Mtry:                             3 
Target node size:                 5 
Variable importance mode:         impurity 
Splitrule:                        variance 
OOB prediction error (MSE):       25.45516 
R squared (OOB):                  0.6472434 
Code
message("Model *with* MODIS data: OOB R2 = ", round(rf_modis$r.squared, 4))
Model *with* MODIS data: OOB R2 = 0.6472
Code
message("Model *without* MODIS data: OOB R2 = ", round(rf_all_data$r.squared, 4))
Model *without* MODIS data: OOB R2 = 0.6557

Conclusion: it doesn’t seem like they’re needed.

Roads not taken

Test spatial Random Forest

Spatial autocorrelation wouldn’t change the fit of the model, but it could bias the variable importance metrics. Evaluate.

Code
# This needs the newest version 1.1.5, not yet on CRAN
# remotes::install_github(repo = "blasbenito/spatialRF",
#   ref = "main", force = TRUE, quiet = TRUE)
stopifnot(packageVersion("spatialRF") >= "1.1.5")

dat_base_no_modispei %>% 
    bind_cols(select(dat_filtered, Longitude, Latitude)) %>% 
    select(-sqrt_Rh_annual, -sqrt_Rs_growingseason) %>% 
    filter(complete.cases(.)) -> 
    dat_spatial_test

# Compute distance between all pairs of points
dm <- geodist(select(dat_spatial_test, Longitude, Latitude), measure = "geodesic")
dat_spatial_test$Longitude <- dat_spatial_test$Latitude <- NULL
predictors <- setdiff(colnames(dat_spatial_test), colnames(dat_spatial_test)[1])
spatialRF::rf(data = dat_spatial_test, 
              dependent.variable.name = colnames(dat_spatial_test)[1],
              predictor.variable.names = predictors,
              distance.matrix = dm)

Model type
  - Fitted with:                     ranger()
  - Response variable:               sqrt_Rs_annual

Random forest parameters
  - Type:                            Regression
  - Number of trees:                 500
  - Sample size:                     3158
  - Number of predictors:            12
  - Mtry:                            3
  - Minimum node size:               5


Model performance 
  - R squared (oob):                  0.65592
  - R squared (cor(obs, pred)^2):     0.8884252
  - Pseudo R squared (cor(obs, pred)):0.9425631
  - RMSE (oob):                       5.085402
  - RMSE:                             3.0042
  - Normalized RMSE:                  0.2898722

Model residuals 
  - Stats: 
             ┌────────┬────────┬────────┬───────┬────────┬───────┐
             │ Min.   │ 1st Q. │ Median │ Mean  │ 3rd Q. │ Max.  │
             ├────────┼────────┼────────┼───────┼────────┼───────┤
             │ -16.00 │  -1.60 │  -0.05 │ -0.01 │   1.56 │ 15.93 │
             └────────┴────────┴────────┴───────┴────────┴───────┘
  - Normality: 
      - Shapiro-Wilks W: 0.972 
      - p-value        : 0 
      - Interpretation : Residuals are not normal 

  - Spatial autocorrelation: 
             ┌───────────┬───────────┬─────────┬──────────────────┐
             │  Distance │ Moran's I │ P value │ Interpretation   │
             ├───────────┼───────────┼─────────┼──────────────────┤
             │       0.0 │    -0.041 │   0.003 │ Negative spatial │
             │           │           │         │ correlation      │
             ├───────────┼───────────┼─────────┼──────────────────┤
             │ 3331584.0 │    -0.001 │   0.870 │ No spatial       │
             │           │           │         │ correlation      │
             ├───────────┼───────────┼─────────┼──────────────────┤
             │ 6663168.0 │    -0.000 │   0.907 │ No spatial       │
             │           │           │         │ correlation      │
             ├───────────┼───────────┼─────────┼──────────────────┤
             │ 9994752.0 │     0.000 │   0.221 │ No spatial       │
             │           │           │         │ correlation      │
             └───────────┴───────────┴─────────┴──────────────────┘

Variable importance: 
                        ┌────────────────┬────────────┐
                        │ Variable       │ Importance │
                        ├────────────────┼────────────┤
                        │ Tair           │      8.197 │
                        ├────────────────┼────────────┤
                        │ Tsoil_lev1     │      7.372 │
                        ├────────────────┼────────────┤
                        │ Tsoil_lev2     │      7.086 │
                        ├────────────────┼────────────┤
                        │ Precip         │      6.649 │
                        ├────────────────┼────────────┤
                        │ LAI_high       │      5.359 │
                        ├────────────────┼────────────┤
                        │ VWC_lev2       │      5.110 │
                        ├────────────────┼────────────┤
                        │ VWC_lev1       │      5.109 │
                        ├────────────────┼────────────┤
                        │ LAI_low        │      4.944 │
                        ├────────────────┼────────────┤
                        │ Veg_type_hi    │      4.124 │
                        ├────────────────┼────────────┤
                        │ Ecosystem_type │      3.910 │
                        ├────────────────┼────────────┤
                        │ Soil_type_name │      3.453 │
                        ├────────────────┼────────────┤
                        │ Veg_type_low   │      3.397 │
                        └────────────────┴────────────┘
Model type
  - Fitted with:                     ranger()
  - Response variable:               sqrt_Rs_annual

Random forest parameters
  - Type:                            Regression
  - Number of trees:                 500
  - Sample size:                     3158
  - Number of predictors:            12
  - Mtry:                            3
  - Minimum node size:               5


Model performance 
  - R squared (oob):                  0.65592
  - R squared (cor(obs, pred)^2):     0.8884252
  - Pseudo R squared (cor(obs, pred)):0.9425631
  - RMSE (oob):                       5.085402
  - RMSE:                             3.0042
  - Normalized RMSE:                  0.2898722

Model residuals 
  - Stats: 
             ┌────────┬────────┬────────┬───────┬────────┬───────┐
             │ Min.   │ 1st Q. │ Median │ Mean  │ 3rd Q. │ Max.  │
             ├────────┼────────┼────────┼───────┼────────┼───────┤
             │ -16.00 │  -1.60 │  -0.05 │ -0.01 │   1.56 │ 15.93 │
             └────────┴────────┴────────┴───────┴────────┴───────┘
  - Normality: 
      - Shapiro-Wilks W: 0.972 
      - p-value        : 0 
      - Interpretation : Residuals are not normal 

  - Spatial autocorrelation: 
             ┌───────────┬───────────┬─────────┬──────────────────┐
             │  Distance │ Moran's I │ P value │ Interpretation   │
             ├───────────┼───────────┼─────────┼──────────────────┤
             │       0.0 │    -0.041 │   0.003 │ Negative spatial │
             │           │           │         │ correlation      │
             ├───────────┼───────────┼─────────┼──────────────────┤
             │ 3331584.0 │    -0.001 │   0.870 │ No spatial       │
             │           │           │         │ correlation      │
             ├───────────┼───────────┼─────────┼──────────────────┤
             │ 6663168.0 │    -0.000 │   0.907 │ No spatial       │
             │           │           │         │ correlation      │
             ├───────────┼───────────┼─────────┼──────────────────┤
             │ 9994752.0 │     0.000 │   0.221 │ No spatial       │
             │           │           │         │ correlation      │
             └───────────┴───────────┴─────────┴──────────────────┘

Variable importance: 
                        ┌────────────────┬────────────┐
                        │ Variable       │ Importance │
                        ├────────────────┼────────────┤
                        │ Tair           │      8.197 │
                        ├────────────────┼────────────┤
                        │ Tsoil_lev1     │      7.372 │
                        ├────────────────┼────────────┤
                        │ Tsoil_lev2     │      7.086 │
                        ├────────────────┼────────────┤
                        │ Precip         │      6.649 │
                        ├────────────────┼────────────┤
                        │ LAI_high       │      5.359 │
                        ├────────────────┼────────────┤
                        │ VWC_lev2       │      5.110 │
                        ├────────────────┼────────────┤
                        │ VWC_lev1       │      5.109 │
                        ├────────────────┼────────────┤
                        │ LAI_low        │      4.944 │
                        ├────────────────┼────────────┤
                        │ Veg_type_hi    │      4.124 │
                        ├────────────────┼────────────┤
                        │ Ecosystem_type │      3.910 │
                        ├────────────────┼────────────┤
                        │ Soil_type_name │      3.453 │
                        ├────────────────┼────────────┤
                        │ Veg_type_low   │      3.397 │
                        └────────────────┴────────────┘

Conclusion: no strong evidence of spatial autocorrelation.

Test XGBoost

Is there any reason to think that this popular algorithm would out-perform our Random Forest approach, and/or give different answers?

Code
fit_and_test_xgboost <- function(x_train, x_val) {
    m_xgb <- xgboost(x_train[-1], x_train[1])
    pred_train <- predict(m_xgb, newdata = x_train)
    pred_val <- predict(m_xgb, newdata = x_val)
    tibble(training_r2 = r2(pred_train, pull(x_train[1])),
           validation_r2 = r2(pred_val, pull(x_val[1])))
}

m_xgb <- xgboost(dat_base_Rs_annual[-1], dat_base_Rs_annual[1])
message("Variable importance of XGBoost model:")
Variable importance of XGBoost model:
Code
as_tibble(xgb.importance(m_xgb))
# A tibble: 12 × 4
   Feature          Gain  Cover Frequency
   <chr>           <dbl>  <dbl>     <dbl>
 1 Precip         0.207  0.152     0.154 
 2 Tair           0.157  0.137     0.212 
 3 LAI_low        0.119  0.116     0.0936
 4 LAI_high       0.105  0.127     0.101 
 5 Ecosystem_type 0.0916 0.0399    0.0644
 6 Tsoil_lev1     0.0663 0.0845    0.0828
 7 VWC_lev1       0.0539 0.113     0.116 
 8 Soil_type_name 0.0513 0.0273    0.0364
 9 Veg_type_hi    0.0433 0.0194    0.0226
10 VWC_lev2       0.0412 0.0915    0.0624
11 Tsoil_lev2     0.0355 0.0682    0.0278
12 Veg_type_low   0.0294 0.0241    0.0270
Code
message("k-fold cross-validation results:")
k-fold cross-validation results:
Code
do_k_fold(dat_base_Rs_annual, fit_and_test_xgboost) %>% 
    summary()
       k          training_r2     validation_r2   
 Min.   : 1.00   Min.   :0.9089   Min.   :0.5884  
 1st Qu.: 3.25   1st Qu.:0.9109   1st Qu.:0.6111  
 Median : 5.50   Median :0.9117   Median :0.6400  
 Mean   : 5.50   Mean   :0.9122   Mean   :0.6317  
 3rd Qu.: 7.75   3rd Qu.:0.9129   3rd Qu.:0.6494  
 Max.   :10.00   Max.   :0.9175   Max.   :0.6835  

Seems like xgboost is over-fitting to the training data. Note however that

Unlike random forests, GBMs can have high variability in accuracy dependent on their hyperparameter settings (Probst, Bischl, and Boulesteix 2018) So tuning can require much more strategy than a random forest model.

Conclusion: performance and variable importance seem similar to ranger.

Test neural network

We use the neuralnet package for this. I experimented with the hidden layer topology and threshold settings to find a configuration that reliably converges with these data. Note: fitting this is slow.

Code
library(neuralnet)

Attaching package: 'neuralnet'
The following object is masked from 'package:dplyr':

    compute
Code
fit_and_test_nn <- function(x_train, x_val) {
    # Algorithm is not guaranteed to converge; if this happens, returns NAs
    out <- tibble(training_r2 = NA_real_, validation_r2 = NA_real_)
    try({
        m_nn <- neuralnet(sqrt_Rs_annual ~ ., data = x_train,
                          hidden = c(12, 6), rep = 2, threshold = 0.4)
        pred_train <- predict(m_nn, newdata = x_train)
        pred_val <- predict(m_nn, newdata = x_val)
        out <- tibble(training_r2 = r2(pred_train, pull(x_train[1])),
                      validation_r2 = r2(pred_val, pull(x_val[1])))
    })
    return(out)
}

# Rescale the variables and change factors to numeric
x <- dat_base_Rs_annual
nums <- sapply(x, is.numeric)
for(col in names(nums)[nums]) x[col] <- scale(x[col])
fcts <- sapply(x[-1], is.factor)
for(col in names(fcts)[fcts]) x[col] <- as.numeric(pull(x[col]))

# Overall model
m_nn <- neuralnet(sqrt_Rs_annual ~ ., data = x, hidden = c(12, 6), rep = 2, threshold = 0.4)
message("R2 of two-layer model: ", round(r2(predict(m_nn, newdata = x), x$sqrt_Rs_annual), 3))
R2 of two-layer model: 0.544
Code
m_nn <- neuralnet(sqrt_Rs_annual ~ ., data = x, hidden = c(24, 12, 6), rep = 2, threshold = 0.4)
message("R2 of three-layer model: ", round(r2(predict(m_nn, newdata = x), x$sqrt_Rs_annual), 3))
R2 of three-layer model: 0.825
Code
message("k-fold cross-validation results:")
k-fold cross-validation results:
Code
do_k_fold(x, fit_and_test_nn, k = 10) %>% 
    summary()
Error in if (ncol.matrix < rep) { : argument is of length zero
Error in if (ncol.matrix < rep) { : argument is of length zero
       k          training_r2     validation_r2   
 Min.   : 1.00   Min.   :0.5697   Min.   :0.2616  
 1st Qu.: 3.25   1st Qu.:0.5871   1st Qu.:0.3473  
 Median : 5.50   Median :0.5980   Median :0.3677  
 Mean   : 5.50   Mean   :0.5945   Mean   :0.3811  
 3rd Qu.: 7.75   3rd Qu.:0.6033   3rd Qu.:0.4257  
 Max.   :10.00   Max.   :0.6110   Max.   :0.4971  
                 NA's   :2        NA's   :2       

With a complex topology (e.g., three hidden layers) the training R2 can be cranked up to 0.9+, but it exhibits severe overfitting.

Conclusion: I’m not an expert here! But, no evidence that this will be a magic bullet.

Linear regression model: performance

Code
# Fit random forest model and test against validation data
# The dependent variable is in the first column
fit_and_test_lm <- function(x_train, x_val, m) {
    pred_train <- predict(m, newdata = x_train)
    pred_val <- predict(m, newdata = x_val)
    tibble(training_r2 = r2(pred_train, pull(x_train[1])),
           validation_r2 = r2(pred_val, pull(x_val[1])))
}

message("Fitting Rs_annual linear model...")
Fitting Rs_annual linear model...
Code
m_Rs <- lm(sqrt_Rs_annual ~ Tair + Tsoil_lev1 + Tsoil_lev2 +
            VWC_lev1 + I(VWC_lev1 ^ 2) + VWC_lev2 ^ 2 + I(VWC_lev2 ^ 2) +
            Ecosystem_type + LAI_high * LAI_low + 
            Veg_type_hi + Veg_type_low + Soil_type_name,
        data = dat_base_Rs_annual)
m_Rs_reduced <- MASS::stepAIC(m_Rs, trace = FALSE)
summary(m_Rs_reduced)

Call:
lm(formula = sqrt_Rs_annual ~ Tsoil_lev1 + Tsoil_lev2 + VWC_lev1 + 
    I(VWC_lev1^2) + VWC_lev2 + Ecosystem_type + LAI_high + LAI_low + 
    Veg_type_hi + Veg_type_low + Soil_type_name + LAI_high:LAI_low, 
    data = dat_base_Rs_annual)

Residuals:
    Min      1Q  Median      3Q     Max 
-27.788  -4.468  -0.424   4.446  31.339 

Coefficients:
                               Estimate Std. Error t value Pr(>|t|)    
(Intercept)                     6.73744    2.28535   2.948 0.003221 ** 
Tsoil_lev1                     -5.56752    2.03423  -2.737 0.006237 ** 
Tsoil_lev2                      5.97704    2.02496   2.952 0.003184 ** 
VWC_lev1                       70.47764   11.76958   5.988 2.36e-09 ***
I(VWC_lev1^2)                 -29.49790   16.63618  -1.773 0.076306 .  
VWC_lev2                      -31.40560   12.35927  -2.541 0.011100 *  
Ecosystem_typeForest            5.06871    1.37258   3.693 0.000226 ***
Ecosystem_typeGrassland         6.61035    1.34959   4.898 1.02e-06 ***
Ecosystem_typeOther             5.98881    1.90823   3.138 0.001714 ** 
Ecosystem_typeSavanna           3.37860    1.99292   1.695 0.090118 .  
Ecosystem_typeShrubland         1.38935    1.43300   0.970 0.332349    
Ecosystem_typeWetland          -0.55350    1.47870  -0.374 0.708196    
LAI_high                        0.80447    0.20722   3.882 0.000106 ***
LAI_low                         1.26311    0.37379   3.379 0.000736 ***
Veg_type_hi                    -0.08159    0.01799  -4.534 6.00e-06 ***
Veg_type_low                   -0.09372    0.03168  -2.958 0.003121 ** 
Soil_type_nameReserved          3.66922    0.98904   3.710 0.000211 ***
Soil_type_nameSand              1.70314    0.52928   3.218 0.001305 ** 
Soil_type_nameSandy clay loam  -6.96492    1.70349  -4.089 4.45e-05 ***
Soil_type_nameSandy loam        1.51591    0.38876   3.899 9.85e-05 ***
Soil_type_nameSilt clay loam   -6.15584    2.21432  -2.780 0.005468 ** 
Soil_type_nameSilt loam        -2.07288    0.45658  -4.540 5.84e-06 ***
LAI_high:LAI_low               -0.26208    0.09273  -2.826 0.004738 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.422 on 3135 degrees of freedom
Multiple R-squared:  0.2723,    Adjusted R-squared:  0.2672 
F-statistic: 53.32 on 22 and 3135 DF,  p-value: < 2.2e-16
Code
par(mfrow = c(2, 2))
plot(m_Rs_reduced)

Code
message("k-fold cross-validation results:")
k-fold cross-validation results:
Code
lm_rhs_formula <- as.character(formula(m_Rs))[3]
kf_out <- list()
for(dv in dvs) {
    # Pull out the rhs of the formula and paste dv (dependent variable) in front
    f <- as.formula(paste(dv, "~", lm_rhs_formula))
    # Construct dataset and call lm() once to give us a base model
    x <- dat_base_no_modispei[c(dv, preds)]
    x <- x[complete.cases(x),]
    m <- lm(f, data = x)
    kf_out[[dv]] <- do_k_fold(x, f = fit_and_test_lm, m = m) %>% 
        mutate(dv = dv, n = nrow(x))
}
bind_rows(kf_out) %>% 
    group_by(dv) %>% 
    summarise(across(everything(), mean)) %>% 
    mutate(k = K_FOLD) # just so it's "10" or whatever as expected
# A tibble: 3 × 5
  dv                        k training_r2 validation_r2     n
  <chr>                 <dbl>       <dbl>         <dbl> <dbl>
1 sqrt_Rh_annual           10       0.157         0.145   677
2 sqrt_Rs_annual           10       0.272         0.269  3158
3 sqrt_Rs_growingseason    10       0.245         0.242  1380
Code
# Directly compare lm and rf predictions
tibble(Ecosystem_type = dat_base_Rs_annual$Ecosystem_type,
       lm_prediction = predict(m_Rs_reduced),
       rf_prediction = predict(rf_all_data, data = dat_base_Rs_annual)$predictions) %>% 
    ggplot(aes(lm_prediction, rf_prediction)) + 
    geom_point(alpha = 0.25) + 
    geom_abline() + 
    geom_smooth(method = "lm", formula = y ~ x) +
    ggtitle("Linear model versus Random Forest predictions") + 
    facet_wrap(~Ecosystem_type, scales = "free")

Conclusions:

  • Linear model performance is less than half of Random Forest

  • Variable importance is similar

1. Predictive value of SPEI

Code
dat_base_all %>% 
    select(-sqrt_Rh_annual, -sqrt_Rs_growingseason,
           -MODIS_NPP, -MODIS_GPP,
           -starts_with("SPEI24")) %>% 
    filter(complete.cases(.)) -> 
    dat_test_spei

# remove the "_y1" columns
yr1cols <- grep("_y1$", colnames(dat_test_spei))
dat_test_spei_current <- dat_test_spei[-yr1cols]

message("Testing Random Forest...")
Testing Random Forest...
Code
rf_spei <- ranger(sqrt_Rs_annual ~ ., data = dat_test_spei_current, importance = "impurity")
print(rf_spei)
Ranger result

Call:
 ranger(sqrt_Rs_annual ~ ., data = dat_test_spei_current, importance = "impurity") 

Type:                             Regression 
Number of trees:                  500 
Sample size:                      3158 
Number of independent variables:  13 
Mtry:                             3 
Target node size:                 5 
Variable importance mode:         impurity 
Splitrule:                        variance 
OOB prediction error (MSE):       25.35987 
R squared (OOB):                  0.6625917 
Code
sort(desc(importance(rf_spei)))
        Precip           Tair     Tsoil_lev1     Tsoil_lev2       LAI_high 
    -29019.611     -25021.905     -22390.646     -22284.996     -19480.540 
      VWC_lev1       VWC_lev2        LAI_low         SPEI12 Ecosystem_type 
    -17007.844     -16424.424     -16367.969     -15101.086     -13277.015 
  Veg_type_low    Veg_type_hi Soil_type_name 
     -7324.625      -7042.327      -5787.961 
Code
message("Model *with* SPEI data: OOB R2 = ", round(rf_spei$r.squared, 4))
Model *with* SPEI data: OOB R2 = 0.6626
Code
message("Model *without* SPEI data: OOB R2 = ", round(rf_all_data$r.squared, 4))
Model *without* SPEI data: OOB R2 = 0.6557
Code
message("Testing linear model...")
Testing linear model...
Code
m_Rs_spei <- lm(sqrt_Rs_annual ~ (Tair + Tsoil_lev1 + Tsoil_lev2 +
            VWC_lev1 + I(VWC_lev1 ^ 2) + VWC_lev2 ^ 2 + I(VWC_lev2 ^ 2)) * 
               SPEI12 +
            Ecosystem_type + LAI_high * LAI_low + 
            Veg_type_hi + Veg_type_low + Soil_type_name,
        data = dat_test_spei_current)
m_Rs_spei_reduced <- MASS::stepAIC(m_Rs_spei, trace = FALSE)
summary(m_Rs_spei_reduced)

Call:
lm(formula = sqrt_Rs_annual ~ Tsoil_lev1 + Tsoil_lev2 + VWC_lev1 + 
    I(VWC_lev1^2) + VWC_lev2 + I(VWC_lev2^2) + SPEI12 + Ecosystem_type + 
    LAI_high + LAI_low + Veg_type_hi + Veg_type_low + Soil_type_name + 
    Tsoil_lev1:SPEI12 + VWC_lev1:SPEI12 + I(VWC_lev1^2):SPEI12 + 
    VWC_lev2:SPEI12 + I(VWC_lev2^2):SPEI12 + LAI_high:LAI_low, 
    data = dat_test_spei_current)

Residuals:
     Min       1Q   Median       3Q      Max 
-27.6119  -4.4274  -0.3507   4.4067  31.0208 

Coefficients:
                                Estimate Std. Error t value Pr(>|t|)    
(Intercept)                      7.28488    2.44960   2.974 0.002963 ** 
Tsoil_lev1                      -5.90778    2.09487  -2.820 0.004831 ** 
Tsoil_lev2                       6.32684    2.08545   3.034 0.002435 ** 
VWC_lev1                        76.03435   26.89406   2.827 0.004726 ** 
I(VWC_lev1^2)                  -22.66215   58.06661  -0.390 0.696357    
VWC_lev2                       -43.25009   29.46660  -1.468 0.142268    
I(VWC_lev2^2)                    2.60088   58.51991   0.044 0.964553    
SPEI12                           4.53003    1.63848   2.765 0.005730 ** 
Ecosystem_typeForest             5.64401    1.42897   3.950 8.00e-05 ***
Ecosystem_typeGrassland          7.20759    1.41205   5.104 3.52e-07 ***
Ecosystem_typeOther              6.63257    1.92770   3.441 0.000588 ***
Ecosystem_typeSavanna            4.53945    2.06551   2.198 0.028041 *  
Ecosystem_typeShrubland          2.15146    1.48690   1.447 0.148014    
Ecosystem_typeWetland            0.06634    1.53643   0.043 0.965563    
LAI_high                         0.74828    0.20746   3.607 0.000315 ***
LAI_low                          1.17528    0.37331   3.148 0.001658 ** 
Veg_type_hi                     -0.08151    0.01796  -4.539 5.87e-06 ***
Veg_type_low                    -0.09506    0.03181  -2.988 0.002828 ** 
Soil_type_nameReserved           3.36876    1.00349   3.357 0.000797 ***
Soil_type_nameSand               1.77210    0.53438   3.316 0.000923 ***
Soil_type_nameSandy clay loam   -7.61293    1.76515  -4.313 1.66e-05 ***
Soil_type_nameSandy loam         1.46363    0.38809   3.771 0.000165 ***
Soil_type_nameSilt clay loam    -6.49154    2.21259  -2.934 0.003372 ** 
Soil_type_nameSilt loam         -2.30605    0.46588  -4.950 7.82e-07 ***
Tsoil_lev1:SPEI12                0.03803    0.02574   1.477 0.139704    
VWC_lev1:SPEI12                 84.84272   33.54559   2.529 0.011482 *  
I(VWC_lev1^2):SPEI12          -107.42447   68.06831  -1.578 0.114624    
VWC_lev2:SPEI12               -114.46421   37.91129  -3.019 0.002554 ** 
I(VWC_lev2^2):SPEI12           143.42328   72.76120   1.971 0.048795 *  
LAI_high:LAI_low                -0.24093    0.09269  -2.599 0.009388 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.396 on 3128 degrees of freedom
Multiple R-squared:  0.2789,    Adjusted R-squared:  0.2722 
F-statistic: 41.72 on 29 and 3128 DF,  p-value: < 2.2e-16
Code
message("Model *with* SPEI data: training R2 = ", lm_r2(m_Rs_spei_reduced))
Model *with* SPEI data: training R2 = 0.2722
Code
message("Model *without* SPEI data: training R2 = ", lm_r2(m_Rs_reduced))
Model *without* SPEI data: training R2 = 0.2672
Code
message("ANOVA model comparison p-value: ", 
        anova_p_twomodels(m_Rs_spei_reduced, m_Rs_reduced))
ANOVA model comparison p-value: 2e-04

Conclusions:

  • SPEI impact on Random Forest model is…unclear. Small performance improvement.

  • SPEI does have effects in the linear model: significant interactions with many variables, significant difference between models with and without SPEI, and model R2 jumps modestly.

2. Predictive value of previous-year SPEI

Code
# Use the dat_test_spei dataset from above

message("Testing Random Forest...")
Testing Random Forest...
Code
rf_spei_y1 <- ranger(sqrt_Rs_annual ~ ., data = dat_test_spei, importance = "impurity")
print(rf_spei_y1)
Ranger result

Call:
 ranger(sqrt_Rs_annual ~ ., data = dat_test_spei, importance = "impurity") 

Type:                             Regression 
Number of trees:                  500 
Sample size:                      3158 
Number of independent variables:  14 
Mtry:                             3 
Target node size:                 5 
Variable importance mode:         impurity 
Splitrule:                        variance 
OOB prediction error (MSE):       25.79525 
R squared (OOB):                  0.656799 
Code
sort(desc(importance(rf_spei_y1)))
        Precip           Tair     Tsoil_lev1     Tsoil_lev2       LAI_high 
    -27331.254     -23552.644     -20992.323     -20536.254     -18321.323 
      VWC_lev1        LAI_low       VWC_lev2         SPEI12      SPEI12_y1 
    -16584.434     -15975.424     -14985.556     -13836.156     -13428.450 
Ecosystem_type    Veg_type_hi   Veg_type_low Soil_type_name 
    -12567.269      -6664.550      -6591.165      -5389.239 
Code
message("Model *with* previous-year SPEI data: OOB R2 = ", round(rf_spei_y1$r.squared, 4))
Model *with* previous-year SPEI data: OOB R2 = 0.6568
Code
message("Model *without* previous-year SPEI data: OOB R2 = ", round(rf_all_data$r.squared, 4))
Model *without* previous-year SPEI data: OOB R2 = 0.6557
Code
message("Testing linear model...")
Testing linear model...
Code
m_Rs_spei_y1 <- lm(sqrt_Rs_annual ~ (Tair + Tsoil_lev1 + Tsoil_lev2 +
            VWC_lev1 + I(VWC_lev1 ^ 2) + VWC_lev2 ^ 2 + I(VWC_lev2 ^ 2)) * 
               (SPEI12 + SPEI12_y1) +
            Ecosystem_type + LAI_high * LAI_low + 
            Veg_type_hi + Veg_type_low + Soil_type_name,
        data = dat_test_spei)
m_Rs_spei_y1_reduced <- MASS::stepAIC(m_Rs_spei_y1, trace = FALSE)
summary(m_Rs_spei_y1_reduced)

Call:
lm(formula = sqrt_Rs_annual ~ Tair + Tsoil_lev1 + Tsoil_lev2 + 
    VWC_lev1 + I(VWC_lev1^2) + VWC_lev2 + I(VWC_lev2^2) + SPEI12 + 
    SPEI12_y1 + Ecosystem_type + LAI_high + LAI_low + Veg_type_hi + 
    Veg_type_low + Soil_type_name + Tair:SPEI12_y1 + Tsoil_lev1:SPEI12 + 
    Tsoil_lev1:SPEI12_y1 + Tsoil_lev2:SPEI12 + Tsoil_lev2:SPEI12_y1 + 
    VWC_lev1:SPEI12 + VWC_lev1:SPEI12_y1 + I(VWC_lev1^2):SPEI12 + 
    I(VWC_lev1^2):SPEI12_y1 + VWC_lev2:SPEI12 + VWC_lev2:SPEI12_y1 + 
    I(VWC_lev2^2):SPEI12 + I(VWC_lev2^2):SPEI12_y1 + LAI_high:LAI_low, 
    data = dat_test_spei)

Residuals:
     Min       1Q   Median       3Q      Max 
-28.1490  -4.3946  -0.3988   4.3654  31.1586 

Coefficients:
                                Estimate Std. Error t value Pr(>|t|)    
(Intercept)                      8.90489    2.60665   3.416 0.000643 ***
Tair                            -0.03678    0.10846  -0.339 0.734562    
Tsoil_lev1                      -6.81589    2.15505  -3.163 0.001578 ** 
Tsoil_lev2                       7.26929    2.15717   3.370 0.000761 ***
VWC_lev1                        64.98001   28.88067   2.250 0.024522 *  
I(VWC_lev1^2)                   -7.31571   63.85980  -0.115 0.908802    
VWC_lev2                       -40.28533   32.47372  -1.241 0.214865    
I(VWC_lev2^2)                   -1.50635   66.24597  -0.023 0.981860    
SPEI12                           3.14251    1.95940   1.604 0.108856    
SPEI12_y1                        1.90560    2.15729   0.883 0.377126    
Ecosystem_typeForest             5.77271    1.44790   3.987 6.85e-05 ***
Ecosystem_typeGrassland          7.39496    1.43130   5.167 2.53e-07 ***
Ecosystem_typeOther              6.73859    1.94566   3.463 0.000541 ***
Ecosystem_typeSavanna            4.58197    2.07907   2.204 0.027607 *  
Ecosystem_typeShrubland          2.22037    1.50387   1.476 0.139928    
Ecosystem_typeWetland            0.19021    1.55330   0.122 0.902548    
LAI_high                         0.67803    0.20998   3.229 0.001255 ** 
LAI_low                          1.06722    0.37764   2.826 0.004743 ** 
Veg_type_hi                     -0.07967    0.01803  -4.419 1.03e-05 ***
Veg_type_low                    -0.10163    0.03218  -3.158 0.001603 ** 
Soil_type_nameReserved           3.45716    1.01135   3.418 0.000638 ***
Soil_type_nameSand               1.71939    0.53954   3.187 0.001453 ** 
Soil_type_nameSandy clay loam   -8.24303    1.80150  -4.576 4.93e-06 ***
Soil_type_nameSandy loam         1.51467    0.39259   3.858 0.000117 ***
Soil_type_nameSilt clay loam    -6.38633    2.23660  -2.855 0.004327 ** 
Soil_type_nameSilt loam         -2.20622    0.47510  -4.644 3.56e-06 ***
Tair:SPEI12_y1                   0.26756    0.12522   2.137 0.032698 *  
Tsoil_lev1:SPEI12                6.73741    3.02653   2.226 0.026078 *  
Tsoil_lev1:SPEI12_y1            -9.60122    3.19177  -3.008 0.002650 ** 
Tsoil_lev2:SPEI12               -6.67791    3.01507  -2.215 0.026843 *  
Tsoil_lev2:SPEI12_y1             9.27559    3.18800   2.910 0.003645 ** 
VWC_lev1:SPEI12                143.03279   38.15720   3.749 0.000181 ***
VWC_lev1:SPEI12_y1            -113.68765   36.31005  -3.131 0.001758 ** 
I(VWC_lev1^2):SPEI12          -202.14885   76.76836  -2.633 0.008499 ** 
I(VWC_lev1^2):SPEI12_y1        179.62292   75.45999   2.380 0.017355 *  
VWC_lev2:SPEI12               -168.93925   42.00558  -4.022 5.91e-05 ***
VWC_lev2:SPEI12_y1             111.34668   39.97510   2.785 0.005378 ** 
I(VWC_lev2^2):SPEI12           233.35001   80.92731   2.883 0.003960 ** 
I(VWC_lev2^2):SPEI12_y1       -175.07581   78.39066  -2.233 0.025595 *  
LAI_high:LAI_low                -0.21775    0.09359  -2.327 0.020047 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.381 on 3118 degrees of freedom
Multiple R-squared:  0.2841,    Adjusted R-squared:  0.2751 
F-statistic: 31.72 on 39 and 3118 DF,  p-value: < 2.2e-16
Code
message("Model *with* previous-year SPEI data: training R2 = ", lm_r2(m_Rs_spei_y1_reduced))
Model *with* previous-year SPEI data: training R2 = 0.2751
Code
message("Model *without* previous-year SPEI data: training R2 = ", lm_r2(m_Rs_spei_reduced))
Model *without* previous-year SPEI data: training R2 = 0.2722
Code
message("ANOVA model comparison p-value: ", 
        anova_p_twomodels(m_Rs_spei_y1_reduced, m_Rs_spei_reduced))
ANOVA model comparison p-value: 0.0127

BUT there’s a problem with the above. Fit a model with random previous-year SPEI data:

Code
dat_test_spei %>% 
    mutate(SPEI12_y1 = runif(n())) ->
    dat_test_spei_random

m_spei_random1 <- lm(formula(m_Rs_spei_y1), data = dat_test_spei_random)
m_spei_random1_reduced <- MASS::stepAIC(m_spei_random1, trace = FALSE)
summary(m_spei_random1_reduced)

Call:
lm(formula = sqrt_Rs_annual ~ Tsoil_lev1 + Tsoil_lev2 + VWC_lev1 + 
    I(VWC_lev1^2) + VWC_lev2 + I(VWC_lev2^2) + SPEI12 + SPEI12_y1 + 
    Ecosystem_type + LAI_high + LAI_low + Veg_type_hi + Veg_type_low + 
    Soil_type_name + Tsoil_lev1:SPEI12 + Tsoil_lev1:SPEI12_y1 + 
    VWC_lev1:SPEI12 + I(VWC_lev1^2):SPEI12 + VWC_lev2:SPEI12 + 
    I(VWC_lev2^2):SPEI12 + LAI_high:LAI_low, data = dat_test_spei_random)

Residuals:
     Min       1Q   Median       3Q      Max 
-27.8913  -4.4316  -0.3219   4.3899  30.0235 

Coefficients:
                                Estimate Std. Error t value Pr(>|t|)    
(Intercept)                      6.50654    2.49099   2.612 0.009044 ** 
Tsoil_lev1                      -5.75669    2.09344  -2.750 0.005996 ** 
Tsoil_lev2                       6.26037    2.08371   3.004 0.002682 ** 
VWC_lev1                        77.33641   26.87068   2.878 0.004028 ** 
I(VWC_lev1^2)                  -24.08819   58.01004  -0.415 0.677993    
VWC_lev2                       -43.65878   29.43819  -1.483 0.138158    
I(VWC_lev2^2)                    3.07804   58.46171   0.053 0.958014    
SPEI12                           4.60167    1.63702   2.811 0.004969 ** 
SPEI12_y1                        1.30761    0.93441   1.399 0.161794    
Ecosystem_typeForest             5.57853    1.42776   3.907 9.54e-05 ***
Ecosystem_typeGrassland          7.15632    1.41072   5.073 4.15e-07 ***
Ecosystem_typeOther              6.68406    1.92656   3.469 0.000529 ***
Ecosystem_typeSavanna            4.46583    2.06363   2.164 0.030534 *  
Ecosystem_typeShrubland          2.08820    1.48571   1.406 0.159966    
Ecosystem_typeWetland            0.01770    1.53498   0.012 0.990800    
LAI_high                         0.73094    0.20750   3.523 0.000433 ***
LAI_low                          1.14092    0.37341   3.055 0.002267 ** 
Veg_type_hi                     -0.08131    0.01794  -4.532 6.06e-06 ***
Veg_type_low                    -0.09437    0.03178  -2.969 0.003009 ** 
Soil_type_nameReserved           3.44393    1.00304   3.433 0.000604 ***
Soil_type_nameSand               1.84724    0.53447   3.456 0.000555 ***
Soil_type_nameSandy clay loam   -7.62233    1.76422  -4.320 1.61e-05 ***
Soil_type_nameSandy loam         1.47272    0.38779   3.798 0.000149 ***
Soil_type_nameSilt clay loam    -7.20182    2.22425  -3.238 0.001217 ** 
Soil_type_nameSilt loam         -2.33269    0.46584  -5.007 5.82e-07 ***
Tsoil_lev1:SPEI12                0.03794    0.02572   1.475 0.140277    
Tsoil_lev1:SPEI12_y1            -0.15909    0.06451  -2.466 0.013717 *  
VWC_lev1:SPEI12                 85.60554   33.51613   2.554 0.010691 *  
I(VWC_lev1^2):SPEI12          -107.99634   68.00534  -1.588 0.112375    
VWC_lev2:SPEI12               -115.46663   37.87696  -3.048 0.002319 ** 
I(VWC_lev2^2):SPEI12           144.12502   72.69311   1.983 0.047494 *  
LAI_high:LAI_low                -0.23217    0.09268  -2.505 0.012296 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.388 on 3126 degrees of freedom
Multiple R-squared:  0.2808,    Adjusted R-squared:  0.2737 
F-statistic: 39.38 on 31 and 3126 DF,  p-value: < 2.2e-16

…and some of the SPEI_yr1 terms might still be significant, just by chance!

Use an ANOVA to compare the (reduced) models with and without previous-year SPEI:

Code
m_spei_random2 <- lm(sqrt_Rs_annual ~ (Tair + Tsoil_lev1 + Tsoil_lev2 +
            VWC_lev1 + I(VWC_lev1 ^ 2) + VWC_lev2 ^ 2 + I(VWC_lev2 ^ 2)) * 
               (SPEI12) +
            Ecosystem_type + LAI_high * LAI_low + 
            Veg_type_hi + Veg_type_low + Soil_type_name,
        data = dat_test_spei_random)
m_spei_random2_reduced <- MASS::stepAIC(m_spei_random2, trace = FALSE)

anova(m_spei_random1_reduced, m_spei_random2_reduced)
Analysis of Variance Table

Model 1: sqrt_Rs_annual ~ Tsoil_lev1 + Tsoil_lev2 + VWC_lev1 + I(VWC_lev1^2) + 
    VWC_lev2 + I(VWC_lev2^2) + SPEI12 + SPEI12_y1 + Ecosystem_type + 
    LAI_high + LAI_low + Veg_type_hi + Veg_type_low + Soil_type_name + 
    Tsoil_lev1:SPEI12 + Tsoil_lev1:SPEI12_y1 + VWC_lev1:SPEI12 + 
    I(VWC_lev1^2):SPEI12 + VWC_lev2:SPEI12 + I(VWC_lev2^2):SPEI12 + 
    LAI_high:LAI_low
Model 2: sqrt_Rs_annual ~ Tsoil_lev1 + Tsoil_lev2 + VWC_lev1 + I(VWC_lev1^2) + 
    VWC_lev2 + I(VWC_lev2^2) + SPEI12 + Ecosystem_type + LAI_high + 
    LAI_low + Veg_type_hi + Veg_type_low + Soil_type_name + Tsoil_lev1:SPEI12 + 
    VWC_lev1:SPEI12 + I(VWC_lev1^2):SPEI12 + VWC_lev2:SPEI12 + 
    I(VWC_lev2^2):SPEI12 + LAI_high:LAI_low
  Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
1   3126 170645                              
2   3128 171103 -2   -458.06 4.1956 0.01515 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Conclusions:

  • SPEI_y1 impact on Random Forest model is…unclear. No performance improvement.

  • SPEI_y1 has significant interactions with many variables in the linear model, and model R2 increases slightly.

  • Need to be careful to include null model testing.

3. Value of previous-year SPEI after drought

For post-drought (PBE TRUE) observations, is SPEI_yr1 significant?

For this we compute a “potential Birch Effect” flag PBE: TRUE when (i) the current year has SPEI >=0 (non-drought) and (ii) the previous year had SPEI <= 1 (drought).

Code
dat_test_spei %>% 
    mutate(PBE = SPEI12 >= 0 & SPEI12_y1 <= -1) %>% 
    filter(PBE) ->
    dat_test_pbe

m_pbe <- lm(formula = formula(m_Rs_spei_y1), data = dat_test_pbe)
m_pbe_reduced <- MASS::stepAIC(m_pbe, trace = FALSE)
summary(m_pbe_reduced)

Null test: is previous-year SPEI significant if conditions were similar, i.e. no drought the previous year?

Code
dat_test_spei %>% 
    filter(SPEI12 > 0, SPEI12 < 0.5, SPEI12_y1 > 0, SPEI12_y1 < 0.5) ->
    dat_test_no_spei_change

m_no_change <- lm(formula = formula(m_Rs_spei_y1), data = dat_test_no_spei_change)
m_no_change_reduced <- MASS::stepAIC(m_no_change, trace = FALSE)
summary(m_no_change_reduced)

Null test: random SPEI_yr1.

Code
dat_test_spei %>% 
    mutate(SPEI12_y1 = runif(n())) ->
    dat_test_random_speiy1

m_random <- lm(formula = formula(m_Rs_spei_y1), data = dat_test_random_speiy1)
m_random_reduced <- MASS::stepAIC(m_random, trace = FALSE)
summary(m_random_reduced)

Call:
lm(formula = sqrt_Rs_annual ~ Tsoil_lev1 + Tsoil_lev2 + VWC_lev1 + 
    I(VWC_lev1^2) + VWC_lev2 + I(VWC_lev2^2) + SPEI12 + SPEI12_y1 + 
    Ecosystem_type + LAI_high + LAI_low + Veg_type_hi + Veg_type_low + 
    Soil_type_name + Tsoil_lev1:SPEI12 + Tsoil_lev1:SPEI12_y1 + 
    Tsoil_lev2:SPEI12_y1 + VWC_lev1:SPEI12 + I(VWC_lev1^2):SPEI12 + 
    I(VWC_lev1^2):SPEI12_y1 + VWC_lev2:SPEI12 + I(VWC_lev2^2):SPEI12 + 
    LAI_high:LAI_low, data = dat_test_random_speiy1)

Residuals:
     Min       1Q   Median       3Q      Max 
-28.0365  -4.3908  -0.3336   4.3581  30.7994 

Coefficients:
                                Estimate Std. Error t value Pr(>|t|)    
(Intercept)                      6.53118    2.56177   2.549 0.010836 *  
Tsoil_lev1                      -0.33898    3.48592  -0.097 0.922541    
Tsoil_lev2                       0.80255    3.46880   0.231 0.817049    
VWC_lev1                        72.33870   26.92071   2.687 0.007246 ** 
I(VWC_lev1^2)                  -23.12181   58.24722  -0.397 0.691424    
VWC_lev2                       -39.91519   29.47085  -1.354 0.175708    
I(VWC_lev2^2)                   -2.42269   58.52607  -0.041 0.966984    
SPEI12                           4.93215    1.64890   2.991 0.002801 ** 
SPEI12_y1                        0.87495    1.60490   0.545 0.585672    
Ecosystem_typeForest             6.06739    1.43889   4.217 2.55e-05 ***
Ecosystem_typeGrassland          7.68954    1.42519   5.395 7.35e-08 ***
Ecosystem_typeOther              7.03500    1.93298   3.639 0.000278 ***
Ecosystem_typeSavanna            5.10868    2.07623   2.461 0.013926 *  
Ecosystem_typeShrubland          2.57695    1.49694   1.721 0.085263 .  
Ecosystem_typeWetland            0.44849    1.54564   0.290 0.771710    
LAI_high                         0.74425    0.20736   3.589 0.000337 ***
LAI_low                          1.16990    0.37316   3.135 0.001734 ** 
Veg_type_hi                     -0.08087    0.01796  -4.502 6.97e-06 ***
Veg_type_low                    -0.09478    0.03180  -2.981 0.002897 ** 
Soil_type_nameReserved           3.32859    1.00403   3.315 0.000926 ***
Soil_type_nameSand               1.76625    0.53402   3.307 0.000952 ***
Soil_type_nameSandy clay loam   -7.49782    1.76622  -4.245 2.25e-05 ***
Soil_type_nameSandy loam         1.49391    0.38792   3.851 0.000120 ***
Soil_type_nameSilt clay loam    -6.43160    2.21158  -2.908 0.003661 ** 
Soil_type_nameSilt loam         -2.27781    0.46571  -4.891 1.05e-06 ***
Tsoil_lev1:SPEI12                0.03797    0.02573   1.476 0.140150    
Tsoil_lev1:SPEI12_y1           -11.22511    5.54943  -2.023 0.043184 *  
Tsoil_lev2:SPEI12_y1            11.12999    5.51752   2.017 0.043759 *  
VWC_lev1:SPEI12                 90.28915   33.60249   2.687 0.007248 ** 
I(VWC_lev1^2):SPEI12          -113.23488   68.09787  -1.663 0.096448 .  
I(VWC_lev1^2):SPEI12_y1         11.87780    7.24126   1.640 0.101045    
VWC_lev2:SPEI12               -122.32768   38.04091  -3.216 0.001315 ** 
I(VWC_lev2^2):SPEI12           152.71021   72.88325   2.095 0.036227 *  
LAI_high:LAI_low                -0.23901    0.09265  -2.580 0.009932 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.39 on 3124 degrees of freedom
Multiple R-squared:  0.281, Adjusted R-squared:  0.2734 
F-statistic:    37 on 33 and 3124 DF,  p-value: < 2.2e-16

4. Testing for a Birch effect

Linear model: PBE importance and interactions

Code
dat_filtered %>% 
    mutate(PBE = SPEI12_y1 <= -1 & SPEI12 > 0,
           Ecosystem_type_number = as.numeric(Ecosystem_type)) %>% 
    select(sqrt_Rs_annual, 
           Ecosystem_type, Ecosystem_type_number, 
           SPEI12, PBE, 
           Tair, Tsoil_lev1, Tsoil_lev2,
           Precip, VWC_lev1, VWC_lev2, 
           Veg_type_hi, Veg_type_low, LAI_high, LAI_low,
           Soil_type_number) %>% 
    filter(complete.cases(.)) ->
    dat_linear

message("Fitting Rs_annual linear model with PBE (SPEI12, -1.0) effect...")
m_Rs <- lm(sqrt_Rs_annual ~ (Tair + Tsoil_lev1 + Tsoil_lev2 +
            VWC_lev1 + I(VWC_lev1 ^ 2) + VWC_lev2 ^ 2 + I(VWC_lev2 ^ 2)) * PBE +
            Ecosystem_type_number + LAI_high * LAI_low + 
            Veg_type_hi + Veg_type_low + Soil_type_number,
        data = dat_linear)
m_Rs_reduced <- MASS::stepAIC(m_Rs, trace = FALSE)
summary(m_Rs_reduced)
par(mfrow = c(2, 2))
plot(m_Rs_reduced)

car::Anova(m_Rs_reduced, type = "III")

message("Fitting Rs_annual linear model without PBE effect...")
m_Rs_no_PBE <- lm(sqrt_Rs_annual ~ (Tair + Tsoil_lev1 + Tsoil_lev2 +
            VWC_lev1 + I(VWC_lev1 ^ 2) + VWC_lev2 ^ 2 + I(VWC_lev2 ^ 2)) +
            Ecosystem_type_number + LAI_high * LAI_low + 
                Veg_type_hi + Veg_type_low + Soil_type_number,
        data = dat_linear)

tibble(model_with_PBE = predict(m_Rs),
       model_without_PBE = predict(m_Rs_no_PBE),
       PBE = dat_linear$PBE, 
       Ecosystem_type = dat_linear$Ecosystem_type) %>% 
    ggplot(aes(model_without_PBE, model_with_PBE)) + 
    geom_point(aes(color = PBE)) + 
    geom_abline() + 
    facet_wrap(~Ecosystem_type) +
    ggtitle("Linear model: effect of including PBE on sqrt_Rs_annual")

# Extract the significant interactions between PBE and other variables 
message("Interactive PBE effects:")
tidy(m_Rs_reduced) %>% 
    mutate(model = "Rs_annual") %>% 
    filter(grepl("PBE", term)) %>% 
    separate(term, into = c("interaction", "PBE"), fill = "left", sep = ":") %>% 
    replace_na(list(interaction = "(none)")) %>% 
    select(model, interaction, estimate, p.value) %>% 
    mutate(estimate = round(estimate, 4), p.value = round(p.value, 4)) ->
    PBE_interactions

DT::datatable(PBE_interactions)

# Summary table of significant effect directions
PBE_interactions %>% 
    mutate(signif = p.value < 0.05, 
           effect = case_when(signif & estimate < 0 ~ "Negative",
                              signif & estimate >= 0 ~ "Positive",
                              .default = "")) %>% 
    select(model, interaction, effect) %>%
    pivot_wider(values_from = effect, names_from = model, values_fill = "") %>% 
    knitr::kable(caption = "PBE: significant effect directions")

Conclusions:

  • PBE (potential Birch effect, i.e. having a strong drought the year before a well-watered year) has a strong statistical effect on the temperature and moisture drivers of soil respiration.

  • But, it’s inconsistent—there’s no clean ‘Birch effect.’

PBE sensitivity test

Code
fit_and_test_lm <- function(x_train, x_val) {
    f <- as.formula(paste(colnames(x_train)[1], "~ (Tair + Tsoil_lev1 + Tsoil_lev2 +",
                          "VWC_lev1 + I(VWC_lev1 ^ 2) + VWC_lev2 ^ 2 + I(VWC_lev2 ^ 2))",
                          "* PBE + Ecosystem_type + LAI_high * LAI_low +",
                          "Veg_type_hi + Veg_type_low + Soil_type_name"))

    m <- lm(formula = f, data = x_train)
    m_reduced <- MASS::stepAIC(m, trace = FALSE)
    
    if(!is.null(x_val)) {
        preds <- predict(m_reduced, newdata = x_val)
        obs <- pull(x_train[1])
        validation_r2 <- rs(preds = preds, obs = obs)
    } else {
        validation_r2 <- NA_real_
    }

    list(model = m_reduced,
         training_r2 = glance(m_reduced)$r.squared,
         validation_r2 = validation_r2,
         importance = tidy(m_reduced))
}

# These are the three most common fluxes recorded in SRDB
#iv_set <- c("Rs_annual")#, "Rh_annual", "Rs_growingseason")
iv_set <- c("sqrt_Rs_annual", "sqrt_Rh_annual", "sqrt_Rs_growingseason")

# The following SPEI values are from Table 1 of Keune et al. (2025)
# https://www.nature.com/articles/s41597-025-04896-y
# and correspond to "severely dry", "moderately dry", and "mildly dry"
# (there are no "extremely dry"=-2 in the dataset)
spei_cutoffs <- c(-1.5, -1.0, 0.0)

# What SPEI window should be used for the cutoffs above?
spei_windows <- c(12, 24) # months

# Model independent variables
predictors <- c("Ecosystem_type",
                "SPEI12",
                "Tair", "Precip",
                "Tsoil_lev1", "Tsoil_lev2", "VWC_lev1", "VWC_lev2", 
                "Veg_type_hi", "Veg_type_low", 
                "LAI_high", "LAI_low", "Soil_type_name")

do_full_analysis <- function(x, iv_set, predictors, spei_windows, spei_cutoffs) {
    rf_results <- list()
    lm_results <- list()
    SPEI_cols <- colnames(x)[grep("^SPEI", colnames(x))]
    for(iv in iv_set) {
        
        for(w in spei_windows) {
            spei_past_col <- paste0("SPEI", w, "_y1")
            
            for(spei_cut in spei_cutoffs) {
                # Remove other potential independent variables
                this_x <- x[c(iv, union(predictors, SPEI_cols))]
                # Complete cases only
                this_x <- this_x[complete.cases(this_x),]
                
                message("---------------------------------------------")
                message("iv = ", iv)
                message("spei_cut = ", spei_cut, " window = ", w)
                #message("\tn = ", nrow(this_x))
                
                # identify observations not currently in a drought but that
                # WERE in a drought the previous year
                # PBE = potential Birch effect
                this_x$PBE = this_x$SPEI12 > 0 & this_x[spei_past_col] <= spei_cut
                #message("\tPBE = ", sum(this_x$PBE))
                
                # Drop non-predictors entirely now
                this_x <- this_x[c(iv, predictors, "PBE")]
                
                # Fit models and save results
                rf_out <- fit_and_test_rf(this_x, NULL)
                rf_results[[paste(iv, w, spei_cut)]] <- 
                    tibble(model = "rf",
                           spei_window = w,
                           spei_cut = spei_cut,
                           iv = iv,
                           predictor = names(rf_out$importance),
                           n = nrow(this_x),
                           pbe_n = sum(this_x$PBE),
                           importance = rf_out$importance,
                           training_r2 = rf_out$training_r2)
                
                lm_out <- fit_and_test_lm(this_x, NULL)
                lm_results[[paste(iv, w, spei_cut)]] <- 
                    tibble(model = "lm",
                           spei_window = w,
                           spei_cut = spei_cut,
                           iv = iv,
                           predictor = lm_out$importance$term,
                           importance = lm_out$importance$p.value,
                           n = nrow(this_x),
                           pbe_n = sum(this_x$PBE),
                           training_r2 = lm_out$training_r2)
 
            }
        }
    }
    bind_rows(bind_rows(rf_results), bind_rows(lm_results))
}

out <- do_full_analysis(dat_filtered,
                        iv_set, predictors,
                        spei_windows = c(12,24),
                        spei_cutoffs = spei_cutoffs)

# Compute means across the SPEI windows, drought definitions, and i.v.'s
out %>% 
    group_by(model, spei_window, spei_cut, iv, n, pbe_n) %>% 
    summarise(r2 = round(mean(training_r2), 3), .groups = "drop") %>% 
    mutate(spei_cut = as.factor(spei_cut),
           spei_window = paste(spei_window, "months")) ->
    out_smry

# Table: number (and % of total) of TRUE PBE entries; how big are the datasets?
out_smry %>%
    # these numbers are the same for each model type; pick one
    filter(model == "rf") %>% 
    mutate(pbe_n = paste0(pbe_n, " (", round(pbe_n / n * 100, 0), "%)")) %>% 
    select(spei_window, spei_cut, iv, pbe_n) %>% 
    pivot_wider(names_from = "iv", values_from = "pbe_n") %>% 
    knitr::kable(caption = "How often do 'PBE events' occur?", align = "r")

Conclusions:

  • PBE events (low SPEI the previous year, followed by non-drought conditions) occur in 0-20% of the data, depending on criteria.

Random Forest sensitivity results

Code
# Table: Training R2 values
out_smry %>%
    filter(model == "rf") %>% 
    select(spei_window, spei_cut, iv, r2) %>% 
    pivot_wider(names_from = "iv", values_from = "r2") %>% 
    knitr::kable(caption = "RF training (OOB) R2")

# Table: rank of PBE variable
out %>% 
    filter(model == "rf") %>% 
    group_by(spei_window, spei_cut, iv) %>% 
    mutate(rank = min_rank(desc(importance))) ->
    out_ranked

out_ranked %>% 
    filter(predictor == "PBE") %>% 
    select(spei_window, spei_cut, iv, rank) %>% 
    pivot_wider(names_from = "iv", values_from = "rank") %>%
    knitr::kable(caption = "Rank of PBE variable in RF model")

out_ranked %>% 
    group_by(iv, predictor) %>% summarise(rank = mean(rank)) %>% 
    ggplot(aes(iv, predictor, fill = rank)) + 
    geom_tile() + 
    xlab("Independent variable") +
    ggtitle("Predictor ranks in RF model")

Conclusions:

  • The values used for SPEI window (12/24 months) and drought level (0, -1, -1.5) don’t make much difference.

  • RF model explains 52-65% of OOB variability

  • Precip and Tair are uniformly important; LAI and soil temp/VWC medium importance; PBE (drought effect from previous year) not important

Linear model sensitivity results

Code
# Table: Training R2 values
out_smry %>%
    filter(model == "lm") %>% 
    select(spei_window, spei_cut, iv, r2) %>% 
    pivot_wider(names_from = "iv", values_from = "r2") %>% 
    knitr::kable(caption = "lm training adjusted R2")

# Rank of significant variables
# We're using the p-value for a crude approach
out %>% 
    filter(model == "lm", importance < 0.05, grepl("PBE", predictor)) %>% 
    group_by(iv, predictor) %>% 
    mutate(n_signif = n()) %>% 
    ggplot(aes(iv, predictor, fill = n_signif)) + 
    geom_tile() + 
    xlab("Independent variable") +
    ggtitle("PBE significance counts in linear model")

message("PBE is significant by itself in a few models:")
out %>% 
    filter(predictor=="PBETRUE", model=="lm", importance < 0.05) %>% 
    select(-model, -training_r2)

Conclusions:

  • The values used for SPEI window (12/24 months) and drought level (0, -1, -1.5) don’t make much difference.

  • The linear model explains 16-27% of training data variability.

  • PBE significantly interacts with Tair (esp. in predicting Rs_annual), Tsoil, and VWC.

Conditional inference Random Forest

Recall the default splitting rule during random forests tree building consists of selecting, out of all splits of the (randomly selected mtry) candidate variables, the split that minimizes the Gini impurity (in the case of classification) and the SSE (in case of regression). However, Strobl et al. (2007) illustrated that these default splitting rules favor the selection of features with many possible splits (e.g., continuous variables or categorical variables with many categories) over variables with fewer splits (the extreme case being binary variables, which have only one possible split). Conditional inference trees (Hothorn, Hornik, and Zeileis 2006) implement an alternative splitting mechanism that helps to reduce this variable selection bias. However, ensembling conditional inference trees has yet to be proven superior with regards to predictive accuracy and they take a lot longer to train.

From https://bradleyboehmke.github.io/HOML/random-forest.html.

Our PBE flag is a binary variable, so is its importance being under-valued by ranger? Test a conditional inference tree-based forest:

Code
set.seed(123)
training <- sample(nrow(test), 0.9 * nrow(test), replace = FALSE)
m_ci <- cforest(sqrt_Rs_annual ~ ., data = test[training,])
message("Variable importance of CI forest:")
sort(varimp(m_ci), decreasing = TRUE)

obs <- test[-training,]$sqrt_Rs_annual
preds <- predict(m_ci, newdata = test[-training,])
message("R2 against validation data = ", round(r2(preds = preds, obs = obs), 3))

Conclusion: no evidence that conditional inference random forest produces different results.

Summary

Summary

Reproducibility

R version 4.5.2 (2025-10-31)
Platform: aarch64-apple-darwin20
Running under: macOS Tahoe 26.2

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/New_York
tzcode source: internal

attached base packages:
[1] stats4    grid      stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] spatialRF_1.1.5   geodist_0.1.1     xgboost_3.1.2.1   party_1.3-18     
 [5] strucchange_1.5-4 sandwich_3.1-1    zoo_1.8-14        modeltools_0.2-24
 [9] mvtnorm_1.3-3     car_3.1-3         carData_3.0-5     broom_1.0.10     
[13] vivid_0.3.0       ranger_0.17.0     arrow_21.0.0.1    lubridate_1.9.4  
[17] readr_2.1.5       ggplot2_4.0.1     tidyr_1.3.1       dplyr_1.1.4      

loaded via a namespace (and not attached):
 [1] gridExtra_2.3      rlang_1.1.6        magrittr_2.0.4     multcomp_1.4-29   
 [5] matrixStats_1.5.0  compiler_4.5.2     mgcv_1.9-3         vctrs_0.6.5       
 [9] stringr_1.6.0      pkgconfig_2.0.3    crayon_1.5.3       fastmap_1.2.0     
[13] backports_1.5.0    labeling_0.4.3     utf8_1.2.6         ca_0.71.1         
[17] promises_1.3.3     DendSer_1.0.2      rmarkdown_2.29     tzdb_0.5.0        
[21] purrr_1.2.0        bit_4.6.0          xfun_0.52          jsonlite_2.0.0    
[25] later_1.4.2        parallel_4.5.2     cluster_2.1.8.1    R6_2.6.1          
[29] stringi_1.8.7      coin_1.4-3         RColorBrewer_1.1-3 GGally_2.4.0      
[33] Rcpp_1.1.0         assertthat_0.2.1   iterators_1.0.14   knitr_1.50        
[37] flashlight_1.0.0   httpuv_1.6.16      Matrix_1.7-4       splines_4.5.2     
[41] igraph_2.1.4       timechange_0.3.0   tidyselect_1.2.1   rstudioapi_0.17.1 
[45] abind_1.4-8        yaml_2.3.10        viridis_0.6.5      TSP_1.2.6         
[49] doParallel_1.0.17  codetools_0.2-20   lattice_0.22-7     tibble_3.3.0      
[53] shiny_1.11.1       withr_3.0.2        S7_0.2.1           evaluate_1.0.3    
[57] survival_3.8-3     ggstats_0.11.0     huxtable_5.8.0     pillar_1.11.1     
[61] foreach_1.5.2      generics_0.1.4     vroom_1.6.5        hms_1.1.3         
[65] scales_1.4.0       xtable_1.8-4       glue_1.8.0         tools_4.5.2       
[69] ggnewscale_0.5.2   data.table_1.17.2  registry_0.5-1     seriation_1.5.8   
[73] libcoin_1.0-10     colorspace_2.1-2   nlme_3.1-168       patchwork_1.3.2   
[77] Formula_1.2-5      cli_3.6.5          gclus_1.3.3        fansi_1.0.7       
[81] viridisLite_0.4.2  gtable_0.3.6       condvis2_0.1.2     digest_0.6.39     
[85] TH.data_1.1-5      htmlwidgets_1.6.4  farver_2.1.2       htmltools_0.5.9   
[89] lifecycle_1.0.4    mime_0.13          bit64_4.6.0-1      MASS_7.3-65