2-gsbe

Author

bbl

Background and science questions

After the rewetting of dry soils, for example precipitation following drought, a transient ‘Birch effect’ can occur due to he interaction of substrate concentration and soil moisture (Moyano et al. 2013). This can constitute a significant proportion of ecosystem soil CO2 efflux in arid and mesic ecosystems (Birch, 1958; Fierer and Schimel, 2002; Borken and Matzner, 2009).

(short background)

We ask four questions of increasing specificity:

  1. Does SPEI – incorporating the statistical deviations of precipitation and evaporative demand in a single index – add Rs predictive information beyond that of climate, LAI, vegetation, and remotely-derived productivity?

  2. Does the previous year’s SPEI value have Rs predictive value? I.e., is there a memory effect?

  3. Specifically for Rs observations in a non-drought year following one or more drought years, does the previous year’s SPEI have predictive value?

  4. Is there a ‘Birch effect’ in the observational record, in which non-drought Rs is higher than it otherwise would be when preceded by a drought?

0. Prep work and base models

Code
dat <- read_parquet("srdb_joined_data.parquet")

message(nrow(dat), " rows of data")
4484 rows of data
Code
dat %>% 
    filter(is.na(Rs_growingseason) | (Rs_growingseason > 0 & Rs_growingseason < 20)) %>% 
    filter(is.na(Rs_annual) | (Rs_annual > 0 & Rs_annual < 4000)) %>% 
    filter(is.na(Rh_annual) | (Rh_annual > 0 & Rh_annual < 2000)) %>% 
    # Most SRDB records are from forests (~3000), then grasslands (~760), 
    # shrubland (~230), wetland (~210), and desert (~80). We add the next
    # one, savanna (~30), because it might be particularly vulnerable to 
    # a Birch effect, but group everything else into "Other"
    mutate(Ecosystem_type = if_else(Ecosystem_type %in% c("Desert", "Forest", "Grassland",
                                                          "Savanna", "Shrubland", "Wetland"),
                                    true = Ecosystem_type, 
                                    false = "Other",
                                    missing = "Other")) %>% 
    # vivid::pdpVars() and xgb need things as factors not char
    mutate(Ecosystem_type = as.factor(Ecosystem_type),
           Ecosystem_type_ECMWF = as.factor(Ecosystem_type_ECMWF),
           Soil_type_name = as.factor(Soil_type_name)) ->
    dat_filtered

message("After filtering, ", nrow(dat_filtered), " rows of data")
After filtering, 4447 rows of data

Collinearity

Tair, Tsoil_lev1, and Tsoil_lev2 are highly correlated with each other (see 1-gsbe-data-prep), so drop level 2.

VWC_lev1 and VWC_lev2 are highly correlated with each other (see 1-gsbe-data-prep), so drop level 2.

Dependent variable distribution

We have three potential dependent variables:

  • Rs_annual (annual soil respiration)

  • Rh_annual (annual heterotrophic respiration)

  • Rs_growingseason (mean growing season soil respiration)

Their distributions are expected to be non-normal, which is a problem for linear models and not ideal for anything. Evaluate possible transformations.

Code
dat_filtered %>% 
    select(Rs_annual, Rh_annual, Rs_growingseason) %>% 
    pivot_longer(everything()) %>% 
    filter(!is.na(value), value > 0) %>% 
    rename(none = value) %>% 
    mutate(log = log(none), sqrt = sqrt(none)) %>% 
    
    pivot_longer(-name, names_to = "Transformation") %>% 
    ggplot(aes(value, color = Transformation)) + 
    geom_density() +
    facet_wrap(~ name + Transformation, scales = "free")

Code
dat_filtered %>% 
    mutate(sqrt_Rs_annual = sqrt(Rs_annual),
           sqrt_Rh_annual = sqrt(Rh_annual),
           sqrt_Rs_growingseason = sqrt(Rs_growingseason)) ->
    dat_filtered

Conclusion: we have a transformation winner: sqrt()!

Random Forest model: performance

We are using the ranger package.

Wright, M. N. and Ziegler, A.: Ranger: A fast implementation of random forests for high dimensional data in C++ and R, J. Stat. Softw., 77, https://doi.org/10.18637/jss.v077.i01, 2017.

Code
# Fit random forest model and test against validation data
# The independent variable is in the first column
fit_and_test_rf <- function(x_train, x_val) {
    f <- as.formula(paste(colnames(x_train)[1], "~ ."))
    rf <- ranger(formula = f, data = x_train, importance = "impurity")
    # use importance = "permutation" for importance_pvalues()
    training_rmse <- rmse(predict(rf, data = x_train)$predictions, pull(x_train[1]))
        
    if(!is.null(x_val)) {
        val_preds <- predict(rf, data = x_val)$predictions
        val_obs <- pull(x_val[1])
        validation_r2 <- r2(preds = val_preds, obs = val_obs)
        validation_rmse <- rmse(preds = val_preds, obs = val_obs)
    } else {
        validation_r2 <- NA_real_
        validation_rmse <- NA_real_
    }
    
    tibble(training_r2 = rf$r.squared, 
           training_rmse = training_rmse,
           validation_r2 = validation_r2,
           validation_rmse = validation_rmse,
           # PBE_importance isn't used until section 4's sensitivity test
           pbe_importance = rank(importance(rf))["PBE"])
}

# Base test dataset (all three d.v.'s)
dat_filtered %>% 
    select(sqrt_Rs_annual, sqrt_Rh_annual, sqrt_Rs_growingseason,
           # Tair, Tsoil_lev1, and Tsoil_lev2 are *highly* correlated
           # with each other (see 1-gsbe-data-prep), so drop level 2
           Tair, Tsoil_lev1, #Tsoil_lev2,
           # VWC_lev1 and VWC_lev2 are *highly* correlated with each
           # other (see 1-gsbe-data-prep), so drop level 2
           Precip, VWC_lev1, #VWC_lev2, 
           # see the "Which ecosystem type..." section below; here we drop the
           # SRDB "Ecosystem_type" field and use our remapped ECMWF vegetation
           # type instead; see more information in the `1-gsbe-data-prep` script
           Ecosystem_type = Ecosystem_type_ECMWF,
           starts_with("SPEI"),
           MODIS_NPP, MODIS_GPP,
           LAI_high, LAI_low,
           Soil_type_name) ->
    dat_base_all

# Base test dataset without MODIS or SPEI data
dat_base_all %>% 
    select(-MODIS_NPP, -MODIS_GPP, -starts_with("SPEI")) ->
    dat_base_no_modispei

# Base test dataset (Rs_annual only)
dat_base_no_modispei %>% 
    select(-sqrt_Rh_annual, -sqrt_Rs_growingseason) %>% 
    filter(complete.cases(.)) -> 
    dat_base_Rs_annual

message("The test complete-cases dataset has ", dim(dat_base_Rs_annual)[1],
        " rows and ", dim(dat_base_Rs_annual)[2], " columns")
The test complete-cases dataset has 3149 rows and 9 columns
Code
message("Full-data model performance:")
Full-data model performance:
Code
rf_all_data <- ranger(sqrt_Rs_annual ~ ., data = dat_base_Rs_annual, importance = "impurity")
print(rf_all_data)
Ranger result

Call:
 ranger(sqrt_Rs_annual ~ ., data = dat_base_Rs_annual, importance = "impurity") 

Type:                             Regression 
Number of trees:                  500 
Sample size:                      3149 
Number of independent variables:  8 
Mtry:                             2 
Target node size:                 5 
Variable importance mode:         impurity 
Splitrule:                        variance 
OOB prediction error (MSE):       26.5752 
R squared (OOB):                  0.6457764 
Code
message("Random forest predictor importance and Altmann p-values:")
Random forest predictor importance and Altmann p-values:
Code
imps <- sort(rf_all_data$variable.importance, decreasing = TRUE)
importance_pvalues(rf_all_data, "altmann", 
                   num.permutations = 100,
                   formula = sqrt_Rs_annual ~ ., 
                   data = dat_base_Rs_annual)
               importance     pvalue
Tair            35833.742 0.00990099
Tsoil_lev1      33530.962 0.00990099
Precip          38413.330 0.00990099
VWC_lev1        27113.398 0.00990099
Ecosystem_type  16244.304 0.00990099
LAI_high        28311.283 0.00990099
LAI_low         24493.905 0.00990099
Soil_type_name   8324.521 0.00990099
Code
# Top three variables...
pdpVars(data = dat_base_Rs_annual, 
        fit = rf_all_data, 
        response = "sqrt_Rs_annual",
        vars = names(imps)[1:3])

Code
# ...and the next three
pdpVars(data = dat_base_Rs_annual, 
        fit = rf_all_data, 
        response = "sqrt_Rs_annual",
        vars = names(imps)[4:6])

Code
# k-fold cross-validation for all dependent variables
message("k-fold cross-validation:")
k-fold cross-validation:
Code
dvs <- c("sqrt_Rs_annual", "sqrt_Rh_annual", "sqrt_Rs_growingseason")
preds <- setdiff(colnames(dat_base_no_modispei), dvs)
kf_out <- list()
for(dv in dvs) {
    # Construct dataset and do the k-fold
    x <- dat_base_no_modispei[c(dv, preds)]
    x <- x[complete.cases(x),]
    do_k_fold(x, fit_and_test_rf) %>% 
        select(-pbe_importance) %>%  # not used here
        mutate(dv = dv, n = nrow(x)) ->
        kf_out[[dv]]
}
bind_rows(kf_out) %>% 
    group_by(dv) %>% 
    summarise(across(everything(), \(x) mean(x, digits = 3)), .groups = "drop") %>% 
    mutate(k = K_FOLD) -> # just so it's "10" or whatever as expected
    kf_out 
print(kf_out)
# A tibble: 3 × 7
  dv             k training_r2 training_rmse validation_r2 validation_rmse     n
  <chr>      <dbl>       <dbl>         <dbl>         <dbl>           <dbl> <dbl>
1 sqrt_Rh_a…     5       0.478         2.51          0.475           4.57    674
2 sqrt_Rs_a…     5       0.612         3.15          0.617           5.35   3149
3 sqrt_Rs_g…     5       0.547         0.243         0.550           0.419  1377

Side note: model performance is linearly related to dataset size! Huh.

Code
ggplot(kf_out, aes(n, validation_r2)) + 
    geom_point() + 
    geom_smooth(method = lm, formula = y~x) +
    ggtitle("Validation R2 for Rs_annual, Rh_annual, and Rs_growingseason")

Roads not taken

How important are the MODIS data?

The MODIS data only start in 2001 (see gsbe-data-prep output), so requiring them in a model cuts off a lot of early data – on the order of 20% of the dataset. Are they needed?

Code
dat_base_all %>% 
    select(-sqrt_Rh_annual, -sqrt_Rs_growingseason, -starts_with("SPEI")) %>% 
    filter(complete.cases(.)) -> 
    dat_base_test_modis

rf_modis <- ranger(sqrt_Rs_annual ~ ., data = dat_base_test_modis, importance = "impurity")
print(rf_modis)
Ranger result

Call:
 ranger(sqrt_Rs_annual ~ ., data = dat_base_test_modis, importance = "impurity") 

Type:                             Regression 
Number of trees:                  500 
Sample size:                      2302 
Number of independent variables:  10 
Mtry:                             3 
Target node size:                 5 
Variable importance mode:         impurity 
Splitrule:                        variance 
OOB prediction error (MSE):       26.58191 
R squared (OOB):                  0.6301313 
Code
message("Model *with* MODIS data: OOB R2 = ", round(rf_modis$r.squared, 4))
Model *with* MODIS data: OOB R2 = 0.6301
Code
message("Model *without* MODIS data: OOB R2 = ", round(rf_all_data$r.squared, 4))
Model *without* MODIS data: OOB R2 = 0.6458
Code
importance_pvalues(rf_modis, "altmann", 
                   num.permutations = 100,
                   formula = sqrt_Rs_annual ~ ., 
                   data = dat_base_test_modis)
               importance     pvalue
Tair            18408.669 0.00990099
Tsoil_lev1      18200.650 0.00990099
Precip          21869.315 0.00990099
VWC_lev1        13250.520 0.24752475
Ecosystem_type   7219.131 0.00990099
MODIS_NPP       16818.743 0.00990099
MODIS_GPP       22431.263 0.00990099
LAI_high        15302.982 0.00990099
LAI_low         13432.407 0.00990099
Soil_type_name   5193.199 0.00990099

Conclusion: it doesn’t seem like they’re needed.

Which ecosystem type should we use?

Code
# Make datasets with each ecosystem type and find out!
dat_filtered %>%
    select(sqrt_Rs_annual, Tair, Tsoil_lev1, VWC_lev1, SPEI12, 
           Ecosystem_type, LAI_high, LAI_low, Soil_type_name) %>% 
    filter(complete.cases(.)) ->
    dat_srdb_eco
dat_filtered %>%
    select(sqrt_Rs_annual, Tair, Tsoil_lev1, VWC_lev1, SPEI12, 
           Ecosystem_type_ECMWF, LAI_high, LAI_low, Soil_type_name) %>% 
    filter(complete.cases(.)) ->
    dat_ecmwf_eco

# Fit rf with each ecosystem type and compare
rf_eco1 <- ranger(sqrt_Rs_annual ~ ., data = dat_srdb_eco)
rf_eco2 <- ranger(sqrt_Rs_annual ~ ., data = dat_ecmwf_eco)
message("rf model with SRDB ecosystem type: OOB R2 = ", round(rf_eco1$r.squared, 4))
rf model with SRDB ecosystem type: OOB R2 = 0.6403
Code
message("rf model with ECMWF ecosystem type: OOB R2 = ", round(rf_eco2$r.squared, 4))
rf model with ECMWF ecosystem type: OOB R2 = 0.6343
Code
# Fit lm with each ecosystem type and compare
lm_eco1 <- lm(sqrt_Rs_annual ~ Tair + Tsoil_lev1 + 
               VWC_lev1 + I(VWC_lev1 ^ 2) * SPEI12 + 
               Ecosystem_type + 
               LAI_high * LAI_low + 
               Soil_type_name,
           data = dat_srdb_eco)
lm_eco1_reduced <- MASS::stepAIC(lm_eco1, trace = FALSE)
lm_eco2 <- lm(sqrt_Rs_annual ~ Tair + Tsoil_lev1 + 
               VWC_lev1 + I(VWC_lev1 ^ 2) * SPEI12 + 
               Ecosystem_type_ECMWF + 
               LAI_high * LAI_low + 
               Soil_type_name,
           data = dat_ecmwf_eco)
lm_eco2_reduced <- MASS::stepAIC(lm_eco2, trace = FALSE)

message("lm model with SRDB ecosystem type: OOB R2 = ", lm_r2(lm_eco1_reduced))
lm model with SRDB ecosystem type: OOB R2 = 0.2667
Code
message("lm model with ECMWF ecosystem type: OOB R2 = ", lm_r2(lm_eco2_reduced))
lm model with ECMWF ecosystem type: OOB R2 = 0.28

Conclusion: it doesn’t seem to make much difference.

Test spatial Random Forest

Spatial autocorrelation wouldn’t change the fit of the model, but it could bias the variable importance metrics. Evaluate.

Code
# This needs the newest version 1.1.5, not yet on CRAN
# remotes::install_github(repo = "blasbenito/spatialRF",
#   ref = "main", force = TRUE, quiet = TRUE)
stopifnot(packageVersion("spatialRF") >= "1.1.5")

dat_base_no_modispei %>% 
    bind_cols(select(dat_filtered, Longitude, Latitude)) %>% 
    select(-sqrt_Rh_annual, -sqrt_Rs_growingseason) %>% 
    filter(complete.cases(.)) -> 
    dat_spatial_test

# Compute distance between all pairs of points
dm <- geodist(select(dat_spatial_test, Longitude, Latitude), measure = "geodesic")
dat_spatial_test$Longitude <- dat_spatial_test$Latitude <- NULL
predictors <- setdiff(colnames(dat_spatial_test), colnames(dat_spatial_test)[1])
spatialRF::rf(data = dat_spatial_test, 
              dependent.variable.name = colnames(dat_spatial_test)[1],
              predictor.variable.names = predictors,
              distance.matrix = dm)

Model type
  - Fitted with:                     ranger()
  - Response variable:               sqrt_Rs_annual

Random forest parameters
  - Type:                            Regression
  - Number of trees:                 500
  - Sample size:                     3149
  - Number of predictors:            8
  - Mtry:                            2
  - Minimum node size:               5


Model performance 
  - R squared (oob):                  0.6480875
  - R squared (cor(obs, pred)^2):     0.8772381
  - Pseudo R squared (cor(obs, pred)):0.9366099
  - RMSE (oob):                       5.138269
  - RMSE:                             3.1434
  - Normalized RMSE:                  0.3014814

Model residuals 
  - Stats: 
             ┌────────┬────────┬────────┬───────┬────────┬───────┐
             │ Min.   │ 1st Q. │ Median │ Mean  │ 3rd Q. │ Max.  │
             ├────────┼────────┼────────┼───────┼────────┼───────┤
             │ -16.00 │  -1.64 │  -0.08 │ -0.01 │   1.62 │ 19.38 │
             └────────┴────────┴────────┴───────┴────────┴───────┘
  - Normality: 
      - Shapiro-Wilks W: 0.967 
      - p-value        : 0 
      - Interpretation : Residuals are not normal 

  - Spatial autocorrelation: 
             ┌───────────┬───────────┬─────────┬──────────────────┐
             │  Distance │ Moran's I │ P value │ Interpretation   │
             ├───────────┼───────────┼─────────┼──────────────────┤
             │       0.0 │    -0.036 │   0.010 │ Negative spatial │
             │           │           │         │ correlation      │
             ├───────────┼───────────┼─────────┼──────────────────┤
             │ 3331584.0 │    -0.000 │   0.942 │ No spatial       │
             │           │           │         │ correlation      │
             ├───────────┼───────────┼─────────┼──────────────────┤
             │ 6663168.0 │     0.000 │   0.562 │ No spatial       │
             │           │           │         │ correlation      │
             ├───────────┼───────────┼─────────┼──────────────────┤
             │ 9994752.0 │     0.000 │   0.139 │ No spatial       │
             │           │           │         │ correlation      │
             └───────────┴───────────┴─────────┴──────────────────┘

Variable importance: 
                        ┌────────────────┬────────────┐
                        │ Variable       │ Importance │
                        ├────────────────┼────────────┤
                        │ Tair           │      8.221 │
                        ├────────────────┼────────────┤
                        │ Tsoil_lev1     │      7.870 │
                        ├────────────────┼────────────┤
                        │ Precip         │      7.078 │
                        ├────────────────┼────────────┤
                        │ LAI_high       │      6.067 │
                        ├────────────────┼────────────┤
                        │ LAI_low        │      5.638 │
                        ├────────────────┼────────────┤
                        │ VWC_lev1       │      5.367 │
                        ├────────────────┼────────────┤
                        │ Ecosystem_type │      4.699 │
                        ├────────────────┼────────────┤
                        │ Soil_type_name │      4.376 │
                        └────────────────┴────────────┘
Model type
  - Fitted with:                     ranger()
  - Response variable:               sqrt_Rs_annual

Random forest parameters
  - Type:                            Regression
  - Number of trees:                 500
  - Sample size:                     3149
  - Number of predictors:            8
  - Mtry:                            2
  - Minimum node size:               5


Model performance 
  - R squared (oob):                  0.6480875
  - R squared (cor(obs, pred)^2):     0.8772381
  - Pseudo R squared (cor(obs, pred)):0.9366099
  - RMSE (oob):                       5.138269
  - RMSE:                             3.1434
  - Normalized RMSE:                  0.3014814

Model residuals 
  - Stats: 
             ┌────────┬────────┬────────┬───────┬────────┬───────┐
             │ Min.   │ 1st Q. │ Median │ Mean  │ 3rd Q. │ Max.  │
             ├────────┼────────┼────────┼───────┼────────┼───────┤
             │ -16.00 │  -1.64 │  -0.08 │ -0.01 │   1.62 │ 19.38 │
             └────────┴────────┴────────┴───────┴────────┴───────┘
  - Normality: 
      - Shapiro-Wilks W: 0.967 
      - p-value        : 0 
      - Interpretation : Residuals are not normal 

  - Spatial autocorrelation: 
             ┌───────────┬───────────┬─────────┬──────────────────┐
             │  Distance │ Moran's I │ P value │ Interpretation   │
             ├───────────┼───────────┼─────────┼──────────────────┤
             │       0.0 │    -0.036 │   0.010 │ Negative spatial │
             │           │           │         │ correlation      │
             ├───────────┼───────────┼─────────┼──────────────────┤
             │ 3331584.0 │    -0.000 │   0.942 │ No spatial       │
             │           │           │         │ correlation      │
             ├───────────┼───────────┼─────────┼──────────────────┤
             │ 6663168.0 │     0.000 │   0.562 │ No spatial       │
             │           │           │         │ correlation      │
             ├───────────┼───────────┼─────────┼──────────────────┤
             │ 9994752.0 │     0.000 │   0.139 │ No spatial       │
             │           │           │         │ correlation      │
             └───────────┴───────────┴─────────┴──────────────────┘

Variable importance: 
                        ┌────────────────┬────────────┐
                        │ Variable       │ Importance │
                        ├────────────────┼────────────┤
                        │ Tair           │      8.221 │
                        ├────────────────┼────────────┤
                        │ Tsoil_lev1     │      7.870 │
                        ├────────────────┼────────────┤
                        │ Precip         │      7.078 │
                        ├────────────────┼────────────┤
                        │ LAI_high       │      6.067 │
                        ├────────────────┼────────────┤
                        │ LAI_low        │      5.638 │
                        ├────────────────┼────────────┤
                        │ VWC_lev1       │      5.367 │
                        ├────────────────┼────────────┤
                        │ Ecosystem_type │      4.699 │
                        ├────────────────┼────────────┤
                        │ Soil_type_name │      4.376 │
                        └────────────────┴────────────┘

Conclusion: no strong evidence of spatial autocorrelation.

Test XGBoost

Is there any reason to think that this popular algorithm would out-perform our Random Forest approach, and/or give different answers?

Code
fit_and_test_xgboost <- function(x_train, x_val) {
    m_xgb <- xgboost(x_train[-1], x_train[1])
    pred_train <- predict(m_xgb, newdata = x_train)
    pred_val <- predict(m_xgb, newdata = x_val)
    tibble(training_r2 = r2(pred_train, pull(x_train[1])),
           validation_r2 = r2(pred_val, pull(x_val[1])))
}

m_xgb <- xgboost(dat_base_Rs_annual[-1], y = dat_base_Rs_annual[1])
message("Variable importance of XGBoost model:")
Variable importance of XGBoost model:
Code
as_tibble(xgb.importance(m_xgb))
# A tibble: 8 × 4
  Feature          Gain  Cover Frequency
  <chr>           <dbl>  <dbl>     <dbl>
1 Precip         0.206  0.147     0.169 
2 Tair           0.184  0.175     0.228 
3 Ecosystem_type 0.145  0.0594    0.0968
4 LAI_low        0.122  0.160     0.108 
5 LAI_high       0.118  0.112     0.103 
6 Tsoil_lev1     0.0864 0.163     0.113 
7 VWC_lev1       0.0846 0.156     0.141 
8 Soil_type_name 0.0545 0.0279    0.0421
Code
message("k-fold cross-validation results:")
k-fold cross-validation results:
Code
do_k_fold(dat_base_Rs_annual, fit_and_test_xgboost) %>% 
    summary()
       k      training_r2     validation_r2   
 Min.   :1   Min.   :0.9098   Min.   :0.5752  
 1st Qu.:2   1st Qu.:0.9106   1st Qu.:0.5915  
 Median :3   Median :0.9115   Median :0.6040  
 Mean   :3   Mean   :0.9123   Mean   :0.6047  
 3rd Qu.:4   3rd Qu.:0.9136   3rd Qu.:0.6080  
 Max.   :5   Max.   :0.9159   Max.   :0.6446  

Seems like xgboost is over-fitting to the training data. Note however that

Unlike random forests, GBMs can have high variability in accuracy dependent on their hyperparameter settings (Probst, Bischl, and Boulesteix 2018) So tuning can require much more strategy than a random forest model.

Conclusion: performance and variable importance seem similar to ranger.

Test neural network

We use the neuralnet package for this. I experimented with the hidden layer topology and threshold settings to find a configuration that reliably converges with these data. Note: fitting this is slow.

Code
# eval:false above b/c this is SO SLOW

library(neuralnet)

fit_and_test_nn <- function(x_train, x_val) {
    # Algorithm is not guaranteed to converge; if this happens, returns NAs
    out <- tibble(training_r2 = NA_real_, validation_r2 = NA_real_)
    try({
        m_nn <- neuralnet(sqrt_Rs_annual ~ ., data = x_train,
                          hidden = c(12, 6), rep = 2, threshold = 0.4)
        pred_train <- predict(m_nn, newdata = x_train)
        pred_val <- predict(m_nn, newdata = x_val)
        out <- tibble(training_r2 = r2(pred_train, pull(x_train[1])),
                      validation_r2 = r2(pred_val, pull(x_val[1])))
    })
    return(out)
}

# Rescale the variables and change factors to numeric
x <- dat_base_Rs_annual
nums <- sapply(x, is.numeric)
for(col in names(nums)[nums]) x[col] <- scale(x[col])
fcts <- sapply(x[-1], is.factor)
for(col in names(fcts)[fcts]) x[col] <- as.numeric(pull(x[col]))
#saveRDS(x, "nn_topology/x.RDS")

# Overall model
m_nn <- neuralnet(sqrt_Rs_annual ~ ., data = x, hidden = c(12, 6), rep = 2, threshold = 0.4)
message("R2 of two-layer model: ", round(r2(predict(m_nn, newdata = x), x$sqrt_Rs_annual), 3))
m_nn <- neuralnet(sqrt_Rs_annual ~ ., data = x, hidden = c(24, 12, 6), rep = 2, threshold = 0.4)
message("R2 of three-layer model: ", round(r2(predict(m_nn, newdata = x), x$sqrt_Rs_annual), 3))

message("k-fold cross-validation results:")
do_k_fold(x, fit_and_test_nn) %>% 
    summary()

With a complex topology (e.g., three hidden layers) the training R2 can be cranked up to 0.9+, but it exhibits severe overfitting.

Conclusion: I’m not an expert here! But, no evidence that this will be a magic bullet.

Linear regression model: performance

Code
# Fit linear model and test against validation data
# The dependent variable is in the first column
fit_and_test_lm <- function(x_train, x_val, m) {
    pred_train <- predict(m, newdata = x_train)
    pred_val <- predict(m, newdata = x_val)
    tibble(training_r2 = r2(pred_train, pull(x_train[1])),
           training_rmse = rmse(pred_train, pull(x_train[1])),
           validation_r2 = r2(pred_val, pull(x_val[1])),
           validation_rmse = rmse(pred_val, pull(x_val[1])))
}

message("Fitting Rs_annual linear model...")
Fitting Rs_annual linear model...
Code
m_Rs <- lm(sqrt_Rs_annual ~ Tair + Tsoil_lev1 + 
               VWC_lev1 + I(VWC_lev1 ^ 2) + 
               Ecosystem_type + 
               LAI_high * LAI_low + 
               Soil_type_name,
           data = dat_base_Rs_annual)
m_Rs_reduced <- MASS::stepAIC(m_Rs, trace = FALSE)
summary(m_Rs_reduced)

Call:
lm(formula = sqrt_Rs_annual ~ Tsoil_lev1 + VWC_lev1 + I(VWC_lev1^2) + 
    Ecosystem_type + LAI_high + LAI_low + Soil_type_name + LAI_high:LAI_low, 
    data = dat_base_Rs_annual)

Residuals:
    Min      1Q  Median      3Q     Max 
-28.916  -4.473  -0.322   4.170  34.344 

Coefficients:
                                          Estimate Std. Error t value Pr(>|t|)
(Intercept)                                3.29055    1.73067   1.901 0.057353
Tsoil_lev1                                 0.39334    0.02590  15.186  < 2e-16
VWC_lev1                                  61.24039    9.46821   6.468 1.15e-10
I(VWC_lev1^2)                            -59.11099   14.43095  -4.096 4.31e-05
Ecosystem_typeDeciduous needleleaf trees  -3.59422    1.30483  -2.755 0.005911
Ecosystem_typeDeciduous shrubs            -1.32495    1.64748  -0.804 0.421325
Ecosystem_typeDesert                      -4.28414    1.41180  -3.035 0.002429
Ecosystem_typeEvergreen broadleaf trees    3.58101    0.84927   4.217 2.55e-05
Ecosystem_typeEvergreen needleleaf trees   3.68384    0.56436   6.527 7.78e-11
Ecosystem_typeEvergreen shrubs            -0.88913    1.68514  -0.528 0.597793
Ecosystem_typeGrass                        3.49954    0.71859   4.870 1.17e-06
Ecosystem_typeInterrupted forest           0.57167    0.51198   1.117 0.264254
Ecosystem_typeMixed forest/woodland        2.35768    0.62505   3.772 0.000165
Ecosystem_typeTundra                       3.48860    2.61382   1.335 0.182079
Ecosystem_typeWetland                     -4.04637    0.75338  -5.371 8.40e-08
LAI_high                                   1.21717    0.21508   5.659 1.66e-08
LAI_low                                    2.23251    0.38131   5.855 5.27e-09
Soil_type_nameReserved                     4.27210    0.99058   4.313 1.66e-05
Soil_type_nameSand                         2.33084    0.50112   4.651 3.44e-06
Soil_type_nameSandy clay loam             -6.98630    1.65029  -4.233 2.37e-05
Soil_type_nameSandy loam                   1.43758    0.38491   3.735 0.000191
Soil_type_nameSilt clay loam              -6.20399    2.22468  -2.789 0.005324
Soil_type_nameSilt loam                   -2.01886    0.46108  -4.379 1.23e-05
LAI_high:LAI_low                          -0.55453    0.09595  -5.780 8.22e-09
                                            
(Intercept)                              .  
Tsoil_lev1                               ***
VWC_lev1                                 ***
I(VWC_lev1^2)                            ***
Ecosystem_typeDeciduous needleleaf trees ** 
Ecosystem_typeDeciduous shrubs              
Ecosystem_typeDesert                     ** 
Ecosystem_typeEvergreen broadleaf trees  ***
Ecosystem_typeEvergreen needleleaf trees ***
Ecosystem_typeEvergreen shrubs              
Ecosystem_typeGrass                      ***
Ecosystem_typeInterrupted forest            
Ecosystem_typeMixed forest/woodland      ***
Ecosystem_typeTundra                        
Ecosystem_typeWetland                    ***
LAI_high                                 ***
LAI_low                                  ***
Soil_type_nameReserved                   ***
Soil_type_nameSand                       ***
Soil_type_nameSandy clay loam            ***
Soil_type_nameSandy loam                 ***
Soil_type_nameSilt clay loam             ** 
Soil_type_nameSilt loam                  ***
LAI_high:LAI_low                         ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.367 on 3125 degrees of freedom
Multiple R-squared:  0.282, Adjusted R-squared:  0.2767 
F-statistic: 53.35 on 23 and 3125 DF,  p-value: < 2.2e-16
Code
par(mfrow = c(2, 2))
plot(m_Rs_reduced)

Code
message("k-fold cross-validation results:")
k-fold cross-validation results:
Code
lm_rhs_formula <- as.character(formula(m_Rs))[3]
kf_out <- list()
for(dv in dvs) {
    # Pull out the rhs of the formula and paste iv (independent variable) in front
    f <- as.formula(paste(dv, "~", lm_rhs_formula))
    # Construct dataset and call lm() once to give us a base model
    x <- dat_base_no_modispei[c(dv, preds)]
    x <- x[complete.cases(x),]
    m <- lm(f, data = x)
    kf_out[[dv]] <- do_k_fold(x, f = fit_and_test_lm, m = m) %>% 
        mutate(dv = dv, n = nrow(x))
}
bind_rows(kf_out) %>% 
    group_by(dv) %>% 
    summarise(across(everything(), mean), .groups = "keep") %>% 
    mutate(across(everything(), \(x) round(x, digits = 3))) %>% 
    ungroup() %>% 
    mutate(k = K_FOLD) # just so it's "10" or whatever as expected
# A tibble: 3 × 7
  dv             k training_r2 training_rmse validation_r2 validation_rmse     n
  <chr>      <dbl>       <dbl>         <dbl>         <dbl>           <dbl> <dbl>
1 sqrt_Rh_a…     5       0.184         5.72          0.184           5.70    674
2 sqrt_Rs_a…     5       0.282         7.34          0.283           7.32   3149
3 sqrt_Rs_g…     5       0.212         0.555         0.208           0.555  1377
Code
# Directly compare lm and rf predictions
tibble(Ecosystem_type = dat_base_Rs_annual$Ecosystem_type,
       lm_prediction = predict(m_Rs_reduced),
       rf_prediction = predict(rf_all_data, data = dat_base_Rs_annual)$predictions) %>% 
    ggplot(aes(lm_prediction, rf_prediction)) + 
    geom_point(alpha = 0.25) + 
    geom_abline() + 
    geom_smooth(method = "lm", formula = y ~ x) +
    ggtitle("Linear model versus Random Forest predictions") + 
    facet_wrap(~Ecosystem_type, scales = "free")

Conclusions:

  • Linear model performance is less than half of Random Forest

  • Variable importance is similar

1. Predictive value of SPEI

Code
dat_base_all %>% 
    select(-sqrt_Rh_annual, -sqrt_Rs_growingseason,
           -MODIS_NPP, -MODIS_GPP,
           -starts_with("SPEI24")) %>% 
    filter(complete.cases(.)) -> 
    dat_test_spei

# remove the "_y1" columns
yr1cols <- grep("_y1$", colnames(dat_test_spei))
dat_test_spei_current <- dat_test_spei[-yr1cols]

message("Testing Random Forest...")
Testing Random Forest...
Code
rf_spei <- ranger(sqrt_Rs_annual ~ ., data = dat_test_spei_current, importance = "impurity")
print(rf_spei)
Ranger result

Call:
 ranger(sqrt_Rs_annual ~ ., data = dat_test_spei_current, importance = "impurity") 

Type:                             Regression 
Number of trees:                  500 
Sample size:                      3149 
Number of independent variables:  9 
Mtry:                             3 
Target node size:                 5 
Variable importance mode:         impurity 
Splitrule:                        variance 
OOB prediction error (MSE):       27.23921 
R squared (OOB):                  0.6369256 
Code
importance_pvalues(rf_spei, "altmann", 
                   num.permutations = 100,
                   formula = sqrt_Rs_annual ~ ., 
                   data = dat_test_spei_current)
               importance     pvalue
Tair            34456.338 0.00990099
Tsoil_lev1      31797.262 0.00990099
Precip          38466.133 0.00990099
VWC_lev1        23565.394 0.00990099
Ecosystem_type  15234.931 0.00990099
SPEI12          17739.063 1.00000000
LAI_high        26657.377 0.00990099
LAI_low         22915.313 0.00990099
Soil_type_name   7100.369 0.00990099
Code
message("Model *with* SPEI data: OOB R2 = ", round(rf_spei$r.squared, 4))
Model *with* SPEI data: OOB R2 = 0.6369
Code
message("Model *without* SPEI data: OOB R2 = ", round(rf_all_data$r.squared, 4))
Model *without* SPEI data: OOB R2 = 0.6458
Code
message("Testing linear model...")
Testing linear model...
Code
m_Rs_spei <- lm(sqrt_Rs_annual ~ (Tair + Tsoil_lev1 + 
                                      VWC_lev1 + I(VWC_lev1 ^ 2)) * 
                    SPEI12 +
                    Ecosystem_type + 
                    LAI_high * LAI_low + 
                    Soil_type_name,
                data = dat_test_spei_current)
m_Rs_spei_reduced <- MASS::stepAIC(m_Rs_spei, trace = FALSE)
car::Anova(m_Rs_spei_reduced, type = "III")
Anova Table (Type III tests)

Response: sqrt_Rs_annual
                     Sum Sq   Df F value    Pr(>F)    
(Intercept)             269    1  4.9891 0.0255776 *  
Tair                     16    1  0.3058 0.5803008    
Tsoil_lev1              447    1  8.2845 0.0040257 ** 
VWC_lev1               1976    1 36.6328 1.596e-09 ***
I(VWC_lev1^2)           730    1 13.5299 0.0002388 ***
SPEI12                   33    1  0.6073 0.4358599    
Ecosystem_type        12139   11 20.4539 < 2.2e-16 ***
LAI_high               1538    1 28.4989 1.005e-07 ***
LAI_low                1788    1 33.1363 9.426e-09 ***
Soil_type_name         5808    6 17.9407 < 2.2e-16 ***
Tair:SPEI12             130    1  2.4101 0.1206562    
Tsoil_lev1:SPEI12       169    1  3.1360 0.0766812 .  
VWC_lev1:SPEI12         197    1  3.6519 0.0560966 .  
I(VWC_lev1^2):SPEI12    168    1  3.1093 0.0779424 .  
LAI_high:LAI_low       1727    1 32.0116 1.671e-08 ***
Residuals            168283 3119                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Code
message("Model *with* SPEI data: training R2 = ", lm_r2(m_Rs_spei_reduced))
Model *with* SPEI data: training R2 = 0.2808
Code
message("Model *without* SPEI data: training R2 = ", lm_r2(m_Rs_reduced))
Model *without* SPEI data: training R2 = 0.2767
Code
message("ANOVA model comparison p-value: ", 
        anova_p_twomodels(m_Rs_spei_reduced, m_Rs_reduced))
ANOVA model comparison p-value: 5e-04

Conclusions:

  • SPEI impact on Random Forest model is small, but there is a performance improvement.

  • SPEI does have effects in the linear model: significant interactions with many variables, significant difference between models with and without SPEI, and model R2 jumps modestly.

2. Predictive value of previous-year SPEI

Code
# Use the dat_test_spei dataset from above

message("Testing Random Forest...")
Testing Random Forest...
Code
rf_spei_y1 <- ranger(sqrt_Rs_annual ~ ., data = dat_test_spei, importance = "impurity")
print(rf_spei_y1)
Ranger result

Call:
 ranger(sqrt_Rs_annual ~ ., data = dat_test_spei, importance = "impurity") 

Type:                             Regression 
Number of trees:                  500 
Sample size:                      3149 
Number of independent variables:  10 
Mtry:                             3 
Target node size:                 5 
Variable importance mode:         impurity 
Splitrule:                        variance 
OOB prediction error (MSE):       27.74289 
R squared (OOB):                  0.6302121 
Code
message("Model *with* previous-year SPEI data: OOB R2 = ", round(rf_spei_y1$r.squared, 4))
Model *with* previous-year SPEI data: OOB R2 = 0.6302
Code
message("Model *without* previous-year SPEI data: OOB R2 = ", round(rf_all_data$r.squared, 4))
Model *without* previous-year SPEI data: OOB R2 = 0.6458
Code
importance_pvalues(rf_spei_y1, "altmann", 
                   num.permutations = 100,
                   formula = sqrt_Rs_annual ~ ., 
                   data = dat_test_spei)
               importance     pvalue
Tair            31129.691 0.00990099
Tsoil_lev1      29804.574 0.00990099
Precip          36168.183 0.00990099
VWC_lev1        22056.429 0.00990099
Ecosystem_type  14850.373 0.00990099
SPEI12          15577.862 1.00000000
SPEI12_y1       15641.912 1.00000000
LAI_high        24866.545 0.00990099
LAI_low         21334.936 0.00990099
Soil_type_name   6798.547 0.00990099
Code
message("Testing linear model...")
Testing linear model...
Code
m_Rs_spei_y1 <- lm(sqrt_Rs_annual ~ (Tair + Tsoil_lev1 + 
                                         VWC_lev1 + I(VWC_lev1 ^ 2)) * 
                       (SPEI12 + SPEI12_y1) +
                       Ecosystem_type + 
                       LAI_high * LAI_low + 
                       Soil_type_name,
                   data = dat_test_spei)
m_Rs_spei_y1_reduced <- MASS::stepAIC(m_Rs_spei_y1, trace = FALSE)
car::Anova(m_Rs_spei_y1_reduced, type = "III")
Anova Table (Type III tests)

Response: sqrt_Rs_annual
                     Sum Sq   Df F value    Pr(>F)    
(Intercept)             269    1  4.9891 0.0255776 *  
Tair                     16    1  0.3058 0.5803008    
Tsoil_lev1              447    1  8.2845 0.0040257 ** 
VWC_lev1               1976    1 36.6328 1.596e-09 ***
I(VWC_lev1^2)           730    1 13.5299 0.0002388 ***
SPEI12                   33    1  0.6073 0.4358599    
Ecosystem_type        12139   11 20.4539 < 2.2e-16 ***
LAI_high               1538    1 28.4989 1.005e-07 ***
LAI_low                1788    1 33.1363 9.426e-09 ***
Soil_type_name         5808    6 17.9407 < 2.2e-16 ***
Tair:SPEI12             130    1  2.4101 0.1206562    
Tsoil_lev1:SPEI12       169    1  3.1360 0.0766812 .  
VWC_lev1:SPEI12         197    1  3.6519 0.0560966 .  
I(VWC_lev1^2):SPEI12    168    1  3.1093 0.0779424 .  
LAI_high:LAI_low       1727    1 32.0116 1.671e-08 ***
Residuals            168283 3119                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Code
message("Model *with* previous-year SPEI data: training R2 = ", lm_r2(m_Rs_spei_y1_reduced))
Model *with* previous-year SPEI data: training R2 = 0.2808
Code
message("Model *without* previous-year SPEI data: training R2 = ", lm_r2(m_Rs_spei_reduced))
Model *without* previous-year SPEI data: training R2 = 0.2808
Code
message("ANOVA model comparison p-value: ", 
        anova_p_twomodels(m_Rs_spei_y1_reduced, m_Rs_spei_reduced))
ANOVA model comparison p-value: NA

BUT there’s a problem with the above. Fit a model with random previous-year SPEI data:

Code
dat_test_spei %>% 
    mutate(SPEI12_y1 = runif(n())) ->
    dat_test_spei_random

m_spei_random1 <- lm(formula(m_Rs_spei_y1), data = dat_test_spei_random)
m_spei_random1_reduced <- MASS::stepAIC(m_spei_random1, trace = FALSE)
car::Anova(m_spei_random1_reduced, type = "III")
Anova Table (Type III tests)

Response: sqrt_Rs_annual
                     Sum Sq   Df F value    Pr(>F)    
(Intercept)             353    1  6.5525 0.0105207 *  
Tair                     16    1  0.3060 0.5801554    
Tsoil_lev1              446    1  8.2719 0.0040536 ** 
VWC_lev1               1968    1 36.5202 1.690e-09 ***
I(VWC_lev1^2)           720    1 13.3577 0.0002616 ***
SPEI12                   33    1  0.6056 0.4364921    
SPEI12_y1               302    1  5.6134 0.0178845 *  
Ecosystem_type        12199   11 20.5844 < 2.2e-16 ***
LAI_high               1504    1 27.9208 1.351e-07 ***
LAI_low                1761    1 32.6932 1.181e-08 ***
Soil_type_name         5825    6 18.0200 < 2.2e-16 ***
Tair:SPEI12             139    1  2.5778 0.1084704    
Tsoil_lev1:SPEI12       178    1  3.3119 0.0688761 .  
VWC_lev1:SPEI12         196    1  3.6356 0.0566491 .  
I(VWC_lev1^2):SPEI12    166    1  3.0859 0.0790727 .  
LAI_high:LAI_low       1691    1 31.3793 2.307e-08 ***
Residuals            167981 3118                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

…and some of the SPEI_yr1 terms might still be significant, just by chance!

Use an ANOVA to compare the (reduced) models with and without previous-year SPEI:

Code
m_spei_random2 <- lm(sqrt_Rs_annual ~ (Tair + Tsoil_lev1 + 
                                           VWC_lev1 + I(VWC_lev1 ^ 2)) * 
                         (SPEI12) +
                         Ecosystem_type + 
                         LAI_high * LAI_low + 
                         Soil_type_name,
                     data = dat_test_spei_random)
m_spei_random2_reduced <- MASS::stepAIC(m_spei_random2, trace = FALSE)

anova(m_spei_random1_reduced, m_spei_random2_reduced)
Analysis of Variance Table

Model 1: sqrt_Rs_annual ~ Tair + Tsoil_lev1 + VWC_lev1 + I(VWC_lev1^2) + 
    SPEI12 + SPEI12_y1 + Ecosystem_type + LAI_high + LAI_low + 
    Soil_type_name + Tair:SPEI12 + Tsoil_lev1:SPEI12 + VWC_lev1:SPEI12 + 
    I(VWC_lev1^2):SPEI12 + LAI_high:LAI_low
Model 2: sqrt_Rs_annual ~ (Tair + Tsoil_lev1 + VWC_lev1 + I(VWC_lev1^2)) * 
    (SPEI12) + Ecosystem_type + LAI_high * LAI_low + Soil_type_name
  Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
1   3118 167981                              
2   3119 168283 -1   -302.42 5.6134 0.01788 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Conclusions:

  • SPEI_y1 impact on Random Forest model is…unclear. No performance improvement.

  • SPEI_y1 has significant interactions with many variables in the linear model, model comparison ANOVA is significant, and model R2 increases slightly.

  • Need to be careful to include null model testing.

3. Value of previous-year SPEI after drought

Similar to the previous section but focused on post-drought observations (was SPEI <= -1 the previous year but now, in year of Rs measurement, SPEI is >= 0).

Code
dat_test_spei %>% 
    filter(SPEI12 >= 0 & SPEI12_y1 <= -1) ->
    dat_test_post_drought

There are 90 Rs data points that meet these post-drought criteria.

Code
message("Linear model:")
Linear model:
Code
m_pd <- lm(formula = formula(m_Rs_spei), data = dat_test_post_drought)
m_pd_reduced <- MASS::stepAIC(m_pd, trace = FALSE)
m_pd_y1 <- lm(formula = formula(m_Rs_spei_y1), data = dat_test_post_drought)
m_pd_y1_reduced <- MASS::stepAIC(m_pd_y1, trace = FALSE)

message("Post-drought model *with* previous-year SPEI: training R2 = ", lm_r2(m_pd_y1_reduced))
Post-drought model *with* previous-year SPEI: training R2 = 0.732
Code
message("Post-drought model *without* previous-year SPEI: training R2 = ", lm_r2(m_pd_reduced))
Post-drought model *without* previous-year SPEI: training R2 = 0.7102
Code
message("ANOVA model comparison p-value: ", 
        anova_p_twomodels(m_pd_y1_reduced, m_pd_reduced))
ANOVA model comparison p-value: 0.0861
Code
tibble(sqrt_Rs_annual = dat_test_post_drought$sqrt_Rs_annual, 
       WITHOUT_y1_SPEI = predict(m_pd),
       WITH_y1_SPEI = predict(m_pd_y1_reduced),
       Ecosystem_type = dat_test_post_drought$Ecosystem_type) %>% 
    pivot_longer(starts_with("WITH")) %>% 
    ggplot(aes(sqrt_Rs_annual, value, color = name)) +
    geom_point() + geom_abline() + 
    geom_smooth(method = lm, formula = y~x, se = FALSE, linetype = 2) +
    xlab("Observed Rs_annual") + ylab("Predicted Rs_annual")

Null test: random SPEI_y1.

Code
dat_test_post_drought %>% 
    mutate(SPEI12_y1 = runif(n())) ->
    dat_test_post_drought_random

m_pd_random <- lm(formula = formula(m_pd_y1), data = dat_test_post_drought_random)
m_pd_random_reduced <- MASS::stepAIC(m_pd_random, trace = FALSE)

message("Post-drought model without previous-year SPEI: training R2 = ", lm_r2(m_pd_reduced))
Post-drought model without previous-year SPEI: training R2 = 0.7102
Code
message("Post-drought model with *random* previous-year SPEI: training R2 = ", lm_r2(m_pd_random_reduced))
Post-drought model with *random* previous-year SPEI: training R2 = 0.7163
Code
message("ANOVA model comparison p-value: ", 
        anova_p_twomodels(m_pd_reduced, m_pd_random_reduced))
ANOVA model comparison p-value: 0.1943

Conclusions: for post-drought observations, previous-year SPEI1 has a big effect on prediction.

4. Testing for a Birch effect

Are post-drought observations ‘different’?

Define a PBE (“potential Birch effect”) flag using the post-drought conditions above: PBE <= -1 the previous year and PBE >0 = the year of Rs measurement.

PBE importance and interactions

Code
dat_base_all %>% 
    select(-MODIS_NPP, -MODIS_GPP) -> 
    dat_pbe_base

# First test is using SPEI12, with a <=-1 to >0 as post-drought
dat_pbe_base %>% 
    mutate(PBE = SPEI12_y1 <= -1 & SPEI12 > 0) %>% 
    dplyr::select(-sqrt_Rh_annual, -sqrt_Rs_growingseason, -starts_with("SPEI")) %>% 
    filter(complete.cases(.)) ->
    dat_pbe
dat_pbe %>% 
    select(-PBE) -> 
    dat_no_pbe

message("Fitting RF model with PBE (SPEI12, -1.0) effect...")
Fitting RF model with PBE (SPEI12, -1.0) effect...
Code
rf_pbe <- ranger(sqrt_Rs_annual ~ ., data = dat_pbe, importance = "impurity")
print(rf_pbe)
Ranger result

Call:
 ranger(sqrt_Rs_annual ~ ., data = dat_pbe, importance = "impurity") 

Type:                             Regression 
Number of trees:                  500 
Sample size:                      3149 
Number of independent variables:  9 
Mtry:                             3 
Target node size:                 5 
Variable importance mode:         impurity 
Splitrule:                        variance 
OOB prediction error (MSE):       26.36269 
R squared (OOB):                  0.6486089 
Code
importance_pvalues(rf_pbe, "altmann", 
                   num.permutations = 100,
                   formula = sqrt_Rs_annual ~ ., 
                   data = dat_pbe)
               importance     pvalue
Tair            35487.187 0.00990099
Tsoil_lev1      33762.672 0.00990099
Precip          41330.514 0.00990099
VWC_lev1        25342.458 0.00990099
Ecosystem_type  15796.492 0.00990099
LAI_high        28839.705 0.00990099
LAI_low         24266.648 0.00990099
Soil_type_name   8020.506 0.00990099
PBE              1002.656 0.95049505
Code
rf_no_pbe <- ranger(sqrt_Rs_annual ~ ., data = dat_no_pbe, importance = "impurity")

message("Model *with* PBE flag: OOB R2 = ", round(rf_pbe$r.squared, 4))
Model *with* PBE flag: OOB R2 = 0.6486
Code
message("Model *without* PBE flag: OOB R2 = ", round(rf_no_pbe$r.squared, 4))
Model *without* PBE flag: OOB R2 = 0.6473
Code
message("Fitting linear model with PBE (SPEI12, -1.0) effect...")
Fitting linear model with PBE (SPEI12, -1.0) effect...
Code
m_Rs_PBE <- lm(sqrt_Rs_annual ~ (Tair + Tsoil_lev1 +
                                     VWC_lev1 + I(VWC_lev1 ^ 2)) * PBE +
                   Ecosystem_type +
                   LAI_high * LAI_low + 
                   Soil_type_name,
               data = dat_pbe)
m_Rs_PBE_reduced <- MASS::stepAIC(m_Rs_PBE, trace = FALSE)
car::Anova(m_Rs_PBE_reduced, type = "III")
Anova Table (Type III tests)

Response: sqrt_Rs_annual
                 Sum Sq   Df  F value    Pr(>F)    
(Intercept)         219    1   4.0465   0.04435 *  
Tsoil_lev1        11893    1 219.3273 < 2.2e-16 ***
VWC_lev1           2170    1  40.0180 2.875e-10 ***
I(VWC_lev1^2)       848    1  15.6379 7.840e-05 ***
PBE                 209    1   3.8495   0.04985 *  
Ecosystem_type    12115   11  20.3105 < 2.2e-16 ***
LAI_high           1789    1  32.9836 1.019e-08 ***
LAI_low            1900    1  35.0332 3.593e-09 ***
Soil_type_name     5527    6  16.9870 < 2.2e-16 ***
Tsoil_lev1:PBE      236    1   4.3537   0.03701 *  
LAI_high:LAI_low   1842    1  33.9695 6.169e-09 ***
Residuals        169343 3123                       
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Code
message("Fitting Rs_annual linear model without PBE effect...")
Fitting Rs_annual linear model without PBE effect...
Code
m_Rs_no_PBE <- lm(sqrt_Rs_annual ~ (Tair + Tsoil_lev1 +
                                        VWC_lev1 + I(VWC_lev1 ^ 2)) +
                      Ecosystem_type + 
                      LAI_high * LAI_low + 
                      Soil_type_name,
                  data = dat_no_pbe)
m_Rs_no_PBE_reduced <- MASS::stepAIC(m_Rs_no_PBE, trace = FALSE)

message("Linear model without PBE flag: training R2 = ", lm_r2(m_Rs_no_PBE_reduced))
Linear model without PBE flag: training R2 = 0.2767
Code
message("Linear model with PBE flag: training R2 = ", lm_r2(m_Rs_PBE_reduced))
Linear model with PBE flag: training R2 = 0.2772
Code
message("ANOVA model comparison p-value: ", 
        anova_p_twomodels(m_Rs_PBE_reduced, m_Rs_no_PBE_reduced))
ANOVA model comparison p-value: 0.1093
Code
tibble(model_with_PBE = predict(m_Rs_PBE),
       model_without_PBE = predict(m_Rs_no_PBE),
       PBE = dat_pbe$PBE, 
       Ecosystem_type = dat_pbe$Ecosystem_type) %>% 
    ggplot(aes(model_without_PBE, model_with_PBE)) + 
    geom_point(aes(color = PBE)) + 
    geom_abline() + 
    facet_wrap(~Ecosystem_type) +
    ggtitle("Linear model: effect of including PBE on sqrt_Rs_annual")

Code
# Extract the significant interactions between PBE and other variables 
message("Interactive PBE effects:")
Interactive PBE effects:
Code
tidy(m_Rs_PBE_reduced) %>% 
    mutate(model = "Rs_annual") %>% 
    filter(grepl("PBE", term)) %>% 
    separate(term, into = c("interaction", "PBE"), fill = "left", sep = ":") %>% 
    replace_na(list(interaction = "(none)")) %>% 
    select(model, interaction, estimate, p.value) %>% 
    mutate(estimate = round(estimate, 4), p.value = round(p.value, 4)) ->
    PBE_interactions

DT::datatable(PBE_interactions)
Code
# Summary table of significant effect directions
PBE_interactions %>% 
    mutate(signif = p.value < 0.05, 
           effect = case_when(signif & estimate < 0 ~ "Negative",
                              signif & estimate >= 0 ~ "Positive",
                              .default = "")) %>% 
    select(model, interaction, effect) %>%
    pivot_wider(values_from = effect, names_from = model, values_fill = "") %>% 
    knitr::kable(caption = "PBE: significant effect directions")
PBE: significant effect directions
interaction Rs_annual
(none) Negative
Tsoil_lev1 Positive

Conclusions:

  • PBE (potential Birch effect: a drought the year before a well-watered year in which Rs was measured) has low importance in a RF model, although its Altmann p-value is marginally significant.

  • In a linear model, PBE has a strong effect on the temperature and moisture drivers of soil respiration.

  • But, it’s inconsistent–there’s no clean, unidirectional ‘Birch effect.’

PBE sensitivity test

Are the above results due to particular choices about what constitutes a drought (the SPEI cutoff value) and/or the SPEI window (12 or 24 months)?

Here we test the cutoff SPEI values from Table 1 of Keune et al. (2025) corresponding to “severely dry,” “moderately dry,” and “mildly dry” conditions.

Code
# These are the three most common fluxes recorded in SRDB
dv_set <- c("sqrt_Rs_annual", "sqrt_Rh_annual", "sqrt_Rs_growingseason")

# The following SPEI values are from Table 1 of Keune et al. (2025)
# https://www.nature.com/articles/s41597-025-04896-y
# and correspond to "severely dry", "moderately dry", and "mildly dry"
# (there are no "extremely dry"=-2.33 values in the dataset)
spei_cutoffs <- c(-1.65, -1.28, -0.84)

# What SPEI window should be used for the cutoffs above?
spei_windows <- c(12, 24) # months

# Model independent variables
predictors <- c("Ecosystem_type", 
                "SPEI12",
                "Tair", "Precip",
                "Tsoil_lev1", "VWC_lev1", 
                "LAI_high", "LAI_low", "Soil_type_name")

# Linear model right hand side
lm_rhs_formula <- as.character(formula(m_Rs_PBE))[3]

test_lm_pbe <- function(x_train, x_val, formula) {
    m <- lm(formula = formula, data = x_train)
    pred_train <- predict(m, newdata = x_train)
    # If data sizes are small (as happens with Rh_annual and spei_cutoff -1.5)
    # there might be a mismatch in available data, causing prediction of
    # validation data to fail
    try(pred_val <- predict(m, newdata = x_val))
    if(exists("pred_val")) {
        validation_r2 <- r2(pred_val, pull(x_val[1]))
        validation_rmse <- rmse(pred_val, pull(x_val[1]))
    } else {
        validation_r2 <- validation_rmse <- NA_real_
    }
    
    # Reduce the model and identify and significant PBE terms remaining
    m_reduced <- MASS::stepAIC(m, trace = FALSE)
    pbe_terms <- tidy(m_reduced) %>% filter(grepl("PBE", term))
    n_pbe_terms <- nrow(pbe_terms)
    n_signif_pbe_terms <- sum(pbe_terms$p.value < 0.05)
    
    # Re-fit model with no PBE term
    x_no_pbe <- x_train
    x_no_pbe$PBE <- 1
    m_no_pbe <- lm(formula = formula, data = x_no_pbe)

    tibble(n_pbe_terms = n_pbe_terms,
           n_signif_pbe_terms = n_signif_pbe_terms,
           pbe_anova = anova_p_twomodels(m, m_no_pbe),
           training_r2 = r2(pred_train, pull(x_train[1])),
           training_rmse = rmse(pred_train, pull(x_train[1])),
           validation_r2 = validation_r2,
           validation_rmse = validation_rmse)
}

do_sensitivity_analysis <- function(x, dv_set, predictors, spei_windows, spei_cutoffs) {
    rf_results <- list()
    lm_results <- list()
    SPEI_cols <- colnames(x)[grep("^SPEI", colnames(x))]
    for(dv in dv_set) {
        
        for(w in spei_windows) {
            spei_past_col <- paste0("SPEI", w, "_y1")
            
            for(spei_cut in spei_cutoffs) {
                # Remove other potential independent variables
                this_x <- x[c(dv, union(predictors, SPEI_cols))]
                # Complete cases only
                this_x <- this_x[complete.cases(this_x),]

                message("dv = ", dv, " spei_cut = ",
                        spei_cut, " window = ", w, " n = ", nrow(this_x))

                # Identify observations not currently in a drought but that
                # WERE in a drought the previous year
                # PBE = potential Birch effect
                this_x$PBE = this_x$SPEI12 > 0 & this_x[spei_past_col] <= spei_cut

                # Drop non-predictors entirely now
                this_x <- this_x[c(dv, predictors, "PBE")]
                
                # Construct formula
                f <- as.formula(paste(dv, "~", lm_rhs_formula))

                # Do the rf k-fold
                rf_out <- do_k_fold(this_x, fit_and_test_rf)
                rf_out$model <- "rf"
                # Do the lm k-fold; do_k_fold will pass f on to fit_and_test_lm
                lm_out <- do_k_fold(this_x, test_lm_pbe, formula = f)
                lm_out$model <- "lm"
                info_cols <- tibble(spei_window = w,
                                    spei_cut = spei_cut,
                                    dv = dv,
                                    n = nrow(this_x),
                                    n_pbe_true = sum(this_x$PBE))
                lm_results[[paste(dv, w, spei_cut)]] <- 
                    bind_cols(info_cols, bind_rows(rf_out, lm_out))
            }
        }
    }
    bind_rows(lm_results)
}

out <- do_sensitivity_analysis(dat_pbe_base,
                               dv_set, predictors,
                               spei_windows = c(12, 24),
                               spei_cutoffs = spei_cutoffs)
dv = sqrt_Rs_annual spei_cut = -1.65 window = 12 n = 3149
Warning in predict.lm(m, newdata = x_val): prediction from rank-deficient fit;
attr(*, "non-estim") has doubtful cases
dv = sqrt_Rs_annual spei_cut = -1.28 window = 12 n = 3149
dv = sqrt_Rs_annual spei_cut = -0.84 window = 12 n = 3149
dv = sqrt_Rs_annual spei_cut = -1.65 window = 24 n = 3149
dv = sqrt_Rs_annual spei_cut = -1.28 window = 24 n = 3149
dv = sqrt_Rs_annual spei_cut = -0.84 window = 24 n = 3149
dv = sqrt_Rh_annual spei_cut = -1.65 window = 12 n = 674
Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) : 
  factor Ecosystem_type has new levels Deciduous shrubs
Warning in predict.lm(m, newdata = x_val): prediction from rank-deficient fit;
attr(*, "non-estim") has doubtful cases
Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) : 
  factor Ecosystem_type has new levels Desert
dv = sqrt_Rh_annual spei_cut = -1.28 window = 12 n = 674
Warning in predict.lm(m, newdata = x_val): prediction from rank-deficient fit;
attr(*, "non-estim") has doubtful cases
Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) : 
  factor Ecosystem_type has new levels Deciduous shrubs
Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) : 
  factor Ecosystem_type has new levels Desert
Warning in predict.lm(m, newdata = x_val): prediction from rank-deficient fit;
attr(*, "non-estim") has doubtful cases
dv = sqrt_Rh_annual spei_cut = -0.84 window = 12 n = 674
Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) : 
  factor Ecosystem_type has new levels Desert
dv = sqrt_Rh_annual spei_cut = -1.65 window = 24 n = 674
Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) : 
  factor Ecosystem_type has new levels Desert
dv = sqrt_Rh_annual spei_cut = -1.28 window = 24 n = 674
Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) : 
  factor Ecosystem_type has new levels Desert
dv = sqrt_Rh_annual spei_cut = -0.84 window = 24 n = 674
Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) : 
  factor Ecosystem_type has new levels Desert
dv = sqrt_Rs_growingseason spei_cut = -1.65 window = 12 n = 1377
Warning in predict.lm(m, newdata = x_val): prediction from rank-deficient fit;
attr(*, "non-estim") has doubtful cases
Warning in predict.lm(m, newdata = x_val): prediction from rank-deficient fit;
attr(*, "non-estim") has doubtful cases
dv = sqrt_Rs_growingseason spei_cut = -1.28 window = 12 n = 1377
dv = sqrt_Rs_growingseason spei_cut = -0.84 window = 12 n = 1377
dv = sqrt_Rs_growingseason spei_cut = -1.65 window = 24 n = 1377
dv = sqrt_Rs_growingseason spei_cut = -1.28 window = 24 n = 1377
dv = sqrt_Rs_growingseason spei_cut = -0.84 window = 24 n = 1377
Code
# Compute means across the SPEI windows, drought definitions, and d.v.'s
out %>% 
    group_by(model, dv, spei_window, spei_cut, n, n_pbe_true) %>% 
    summarise(n_pbe_terms = mean(n_pbe_terms),
              n_signif_pbe_terms = mean(n_signif_pbe_terms),
              pbe_anova = mean(pbe_anova),
              pbe_importance = mean(pbe_importance),
              validation_r2 = round(mean(validation_r2, na.rm = TRUE), digits = 3),
              .groups = "drop") %>% 
    mutate(spei_cut = as.factor(spei_cut),
           spei_window = paste(spei_window, "months")) ->
    out_smry

DT::datatable(out_smry)
Code
ggplot(out_smry %>% filter(validation_r2 > 0),
       aes(spei_window, spei_cut, fill = validation_r2)) +
    geom_tile() + 
    facet_grid(model ~ dv) +
    ggtitle("k-fold mean R2 by model type and dependent variable")

Code
ggplot(out_smry %>% filter(model == "lm"),
       aes(spei_window, spei_cut, fill = pbe_anova < 0.05)) +
    geom_tile() + 
    facet_wrap(~dv) +
    ggtitle("lm two-model ANOVA: effect of including PBE")

Code
# Table: number (and % of total) of TRUE PBE entries; how big are the datasets?
out_smry %>%
    # these numbers are the same for each model type; pick one
    filter(model == "lm") %>% 
    mutate(pbe_true = paste0(n_pbe_true, " (", round(n_pbe_true / n * 100, 0), "%)")) %>% 
    select(dv, spei_window, spei_cut, pbe_true) %>% 
    pivot_wider(names_from = "dv", values_from = "pbe_true") %>% 
    knitr::kable(caption = "How many 'PBE events' are present in the data?", align = "r")
How many ‘PBE events’ are present in the data?
spei_window spei_cut sqrt_Rh_annual sqrt_Rs_annual sqrt_Rs_growingseason
12 months -1.65 1 (0%) 7 (0%) 2 (0%)
12 months -1.28 6 (1%) 26 (1%) 31 (2%)
12 months -0.84 24 (4%) 118 (4%) 65 (5%)
24 months -1.65 13 (2%) 47 (1%) 21 (2%)
24 months -1.28 27 (4%) 114 (4%) 66 (5%)
24 months -0.84 51 (8%) 227 (7%) 116 (8%)

Conclusions:

  • PBE events (low SPEI the previous year, followed by non-drought conditions) occur in 0-10% of the data, depending on criteria.

  • In a linear model, an ANOVA comparing models with and without PBE is almost always significant for Rs_annual and Rh_annual w/ 12-month SPEI; the exceptions are at the most extreme drought cutoff value.

Conditional inference Random Forest

Recall the default splitting rule during random forests tree building consists of selecting, out of all splits of the (randomly selected mtry) candidate variables, the split that minimizes the Gini impurity (in the case of classification) and the SSE (in case of regression). However, Strobl et al. (2007) illustrated that these default splitting rules favor the selection of features with many possible splits (e.g., continuous variables or categorical variables with many categories) over variables with fewer splits (the extreme case being binary variables, which have only one possible split). Conditional inference trees (Hothorn, Hornik, and Zeileis 2006) implement an alternative splitting mechanism that helps to reduce this variable selection bias. However, ensembling conditional inference trees has yet to be proven superior with regards to predictive accuracy and they take a lot longer to train.

From https://bradleyboehmke.github.io/HOML/random-forest.html.

Our PBE flag is a binary variable, so is its importance being under-valued by ranger? Test a conditional inference tree-based forest (note, this is slow):

Code
m_ci <- cforest(sqrt_Rs_annual ~ ., data = dat_pbe)
print(m_ci)

     Random Forest using Conditional Inference Trees

Number of trees:  500 

Response:  sqrt_Rs_annual 
Inputs:  Tair, Tsoil_lev1, Precip, VWC_lev1, Ecosystem_type, LAI_high, LAI_low, Soil_type_name, PBE 
Number of observations:  3149 
Code
message("Variable importance of CI forest:")
Variable importance of CI forest:
Code
sort(varimp(m_ci), decreasing = TRUE)
Ecosystem_type           Tair         Precip     Tsoil_lev1 Soil_type_name 
    24.8809359     22.4291690     19.4179743     18.7521006     11.9591633 
      LAI_high        LAI_low       VWC_lev1            PBE 
    10.0762005      8.2712181      7.6400167      0.1943363 
Code
# Fit c.i. random forest model and test against validation data
# The independent variable is in the first column
fit_and_test_ci_rf <- function(x_train, x_val) {
    f <- as.formula(paste(colnames(x_train)[1], "~ ."))
    rf <- cforest(formula = f, data = x_train)
    train_preds <- predict(rf, newdata = NULL)
    training_r2 <- r2(train_preds, pull(x_train[1]))
    training_rmse <- rmse(train_preds, pull(x_train[1]))
    
    if(!is.null(x_val)) {
        val_preds <- predict(rf, newdata = x_val)
        val_obs <- pull(x_val[1])
        validation_r2 <- r2(preds = val_preds, obs = val_obs)
        validation_rmse <- rmse(preds = val_preds, obs = val_obs)
    } else {
        validation_r2 <- NA_real_
        validation_rmse <- NA_real_
    }
    
    tibble(training_r2 = training_r2, 
           training_rmse = training_rmse,
           validation_r2 = validation_r2,
           validation_rmse = validation_rmse)
}

do_k_fold(dat_pbe, fit_and_test_ci_rf) %>% 
    summarise(across(everything(), \(x) mean(x, digits = 3))) %>%
    mutate(k = K_FOLD)
# A tibble: 1 × 5
      k training_r2 training_rmse validation_r2 validation_rmse
  <dbl>       <dbl>         <dbl>         <dbl>           <dbl>
1     5       0.631          5.26         0.487            6.19

Conclusion: no evidence that conditional inference random forest produces different results.

Summary

Summary

Reproducibility

R version 4.5.2 (2025-10-31)
Platform: aarch64-apple-darwin20
Running under: macOS Tahoe 26.2

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/New_York
tzcode source: internal

attached base packages:
[1] stats4    grid      stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] spatialRF_1.1.5   geodist_0.1.1     xgboost_3.1.2.1   party_1.3-18     
 [5] strucchange_1.5-4 sandwich_3.1-1    zoo_1.8-14        modeltools_0.2-24
 [9] mvtnorm_1.3-3     car_3.1-3         carData_3.0-5     broom_1.0.10     
[13] vivid_0.3.0       ranger_0.17.0     arrow_21.0.0.1    lubridate_1.9.4  
[17] readr_2.1.5       ggplot2_4.0.1     tidyr_1.3.1       dplyr_1.1.4      

loaded via a namespace (and not attached):
 [1] gridExtra_2.3      rlang_1.1.6        magrittr_2.0.4     multcomp_1.4-29   
 [5] matrixStats_1.5.0  compiler_4.5.2     mgcv_1.9-3         vctrs_0.6.5       
 [9] stringr_1.6.0      pkgconfig_2.0.3    crayon_1.5.3       fastmap_1.2.0     
[13] backports_1.5.0    labeling_0.4.3     ca_0.71.1          utf8_1.2.6        
[17] promises_1.3.3     DendSer_1.0.2      rmarkdown_2.29     tzdb_0.5.0        
[21] purrr_1.2.0        bit_4.6.0          xfun_0.52          cachem_1.1.0      
[25] jsonlite_2.0.0     later_1.4.2        parallel_4.5.2     cluster_2.1.8.1   
[29] R6_2.6.1           bslib_0.9.0        stringi_1.8.7      coin_1.4-3        
[33] RColorBrewer_1.1-3 GGally_2.4.0       jquerylib_0.1.4    Rcpp_1.1.0        
[37] assertthat_0.2.1   iterators_1.0.14   knitr_1.50         flashlight_1.0.0  
[41] httpuv_1.6.16      Matrix_1.7-4       splines_4.5.2      igraph_2.1.4      
[45] timechange_0.3.0   tidyselect_1.2.1   rstudioapi_0.17.1  abind_1.4-8       
[49] yaml_2.3.10        viridis_0.6.5      TSP_1.2.6          doParallel_1.0.17 
[53] codetools_0.2-20   lattice_0.22-7     tibble_3.3.0       shiny_1.11.1      
[57] withr_3.0.2        S7_0.2.1           evaluate_1.0.3     survival_3.8-3    
[61] ggstats_0.11.0     huxtable_5.8.0     pillar_1.11.1      DT_0.34.0         
[65] foreach_1.5.2      generics_0.1.4     hms_1.1.3          scales_1.4.0      
[69] xtable_1.8-4       glue_1.8.0         tools_4.5.2        ggnewscale_0.5.2  
[73] data.table_1.17.2  registry_0.5-1     crosstalk_1.2.1    seriation_1.5.8   
[77] libcoin_1.0-10     colorspace_2.1-2   nlme_3.1-168       patchwork_1.3.2   
[81] Formula_1.2-5      cli_3.6.5          gclus_1.3.3        fansi_1.0.7       
[85] viridisLite_0.4.2  gtable_0.3.6       condvis2_0.1.2     sass_0.4.10       
[89] digest_0.6.39      TH.data_1.1-5      htmlwidgets_1.6.4  farver_2.1.2      
[93] htmltools_0.5.9    lifecycle_1.0.4    mime_0.13          bit64_4.6.0-1     
[97] MASS_7.3-65