Does SPEI – incorporating the statistical deviations of precipitation and evaporative demand in a single index – add Rs predictive information beyond that of climate, LAI, vegetation, and remotely-derived productivity?
Does the previous year’s SPEI value have Rs predictive value? I.e., is there a memory effect?
Specifically for Rs observations in a non-drought year following one or more drought years, does the previous year’s SPEI have predictive value?
Is there a ‘Birch effect’ in the observational record, in which non-drought Rs is higher than it otherwise would be when preceded by a drought?
0. Prep work and base models
Code
dat <-read_csv("srdb_joined_data.csv", show_col_types =FALSE)message(nrow(dat), " rows of data")
4479 rows of data
Code
dat %>%filter(is.na(Rs_growingseason) | (Rs_growingseason >0& Rs_growingseason <20)) %>%filter(is.na(Rs_annual) | (Rs_annual >0& Rs_annual <4000)) %>%filter(is.na(Rh_annual) | (Rh_annual >0& Rh_annual <2000)) %>%# Most SRDB records are from forests (~3000), then grasslands (~760), # shrubland (~230), wetland (~210), and desert (~80). We add the next# one, savanna (~30), because it might be particularly vulnerable to # a Birch effect, but group everything else into "Other"mutate(Ecosystem_type =if_else(Ecosystem_type %in%c("Desert", "Forest", "Grassland","Savanna", "Shrubland", "Wetland"),true = Ecosystem_type, false ="Other",missing ="Other"),# vivid::pdpVars() below needs things as factors not charEcosystem_type =as.factor(Ecosystem_type),Soil_type_name =as.factor(Soil_type_name)) -> dat_filteredmessage("After filtering, ", nrow(dat_filtered), " rows of data")
After filtering, 4444 rows of data
Dependent variable distribution
We have three potential dependent variables:
Rs_annual (annual soil respiration)
Rh_annual (annual heterotrophic respiration)
Rs_growingseason (mean growing season soil respiration)
Their distributions are expected to be non-normal, which is a problem for linear models and not ideal for anything. Evaluate possible transformations.
Code
dat_filtered %>%select(Rs_annual, Rh_annual, Rs_growingseason) %>%pivot_longer(everything()) %>%filter(!is.na(value), value >0) %>%rename(none = value) %>%mutate(log =log(none), sqrt =sqrt(none)) %>%pivot_longer(-name, names_to ="Transformation") %>%ggplot(aes(value, color = Transformation)) +geom_density() +facet_wrap(~ name + Transformation, scales ="free")
Conclusion: we have a transformation winner: sqrt()!
Random Forest model: performance
We are using the ranger package.
Wright, M. N. and Ziegler, A.: Ranger: A fast implementation of random forests for high dimensional data in C++ and R, J. Stat. Softw., 77, https://doi.org/10.18637/jss.v077.i01, 2017.
Code
# Fit random forest model and test against validation data# The independent variable is in the first columnfit_and_test_rf <-function(x_train, x_val) { f <-as.formula(paste(colnames(x_train)[1], "~ .")) rf <- ranger::ranger(formula = f, data = x_train, importance ="impurity")# use importance = "permutation" for importance_pvalues()if(!is.null(x_val)) { preds <-predict(rf, data = x_val)$predictions obs <-pull(x_val[1]) validation_r2 <-r2(preds = preds, obs = obs) } else { validation_r2 <-NA_real_ }return(list(model = rf,training_r2 = rf$r.squared, importance =importance(rf), validation_r2 = validation_r2))}# Do a k-fold cross-validation# x = data; f = function to call; k = number of folds; ... = params for modeldo_k_fold <-function(x, f, k = K_FOLD, quiet =TRUE, ...) {if(!all(complete.cases(x))) {stop("x should not have any NAs at this stage!") } groups <-sample.int(n = k, size =nrow(x), replace =TRUE) training_r2 <-rep.int(NA_real_, k) validation_r2 <-rep.int(NA_real_, k)for(i inseq_len(k)) { x_val <- x[groups == i,] x_train <- x[groups != i,]if(!quiet) {message("\tk-fold ", i, ":")message("\t\tTraining is ", nrow(x_train), " x ", ncol(x_train))message("\t\tValidation is ", nrow(x_val), " x ", ncol(x_val)) } out <-f(x_train = x_train, x_val = x_val, ...) training_r2[i] <- out$training_r2 validation_r2[i] <- out$validation_r2 }tibble(k =seq_len(k), training_r2 = training_r2,validation_r2 = validation_r2)} # do_k_fold# Base test dataset (all three d.v.'s)dat_filtered %>%select(sqrt_Rs_annual, sqrt_Rh_annual, sqrt_Rs_growingseason, Tair, Tsoil_lev1, Tsoil_lev2, Precip, VWC_lev1, VWC_lev2, LAI_high, LAI_low, Ecosystem_type,starts_with("SPEI"), MODIS_NPP, MODIS_GPP, Veg_type_hi, Veg_type_low, Soil_type_name) -> dat_base_all# Base test dataset without MODIS or SPEI datadat_base_all %>%select(-MODIS_NPP, -MODIS_GPP, -starts_with("SPEI")) -> dat_base_no_modispei# Base test dataset (Rs_annual)dat_base_no_modispei %>%select(-sqrt_Rh_annual, -sqrt_Rs_growingseason) %>%filter(complete.cases(.)) -> dat_base_Rs_annualmessage("The test complete-cases dataset has ", dim(dat_base_Rs_annual)[1]," rows and ", dim(dat_base_Rs_annual)[2], " columns")
The test complete-cases dataset has 3158 rows and 13 columns
Code
message("Full-data model performance:")
Full-data model performance:
Code
rf_all_data <-ranger(sqrt_Rs_annual ~ ., data = dat_base_Rs_annual, importance ="impurity")print(rf_all_data)
Ranger result
Call:
ranger(sqrt_Rs_annual ~ ., data = dat_base_Rs_annual, importance = "impurity")
Type: Regression
Number of trees: 500
Sample size: 3158
Number of independent variables: 12
Mtry: 3
Target node size: 5
Variable importance mode: impurity
Splitrule: variance
OOB prediction error (MSE): 25.88067
R squared (OOB): 0.6556625
Code
imps <-sort(rf_all_data$variable.importance, decreasing =TRUE)# Top three variables...pdpVars(data = dat_base_Rs_annual, fit = rf_all_data, response ="sqrt_Rs_annual",vars =names(imps)[1:3])
Code
# ...and the next threepdpVars(data = dat_base_Rs_annual, fit = rf_all_data, response ="sqrt_Rs_annual",vars =names(imps)[4:6])
Code
# k-fold cross-validation for all dependent variablesmessage("k-fold cross-validation:")
k-fold cross-validation:
Code
dvs <-c("sqrt_Rs_annual", "sqrt_Rh_annual", "sqrt_Rs_growingseason")preds <-setdiff(colnames(dat_base_no_modispei), dvs)kf_out <-list()for(dv in dvs) {# Construct dataset and do the k-fold x <- dat_base_no_modispei[c(dv, preds)] x <- x[complete.cases(x),] kf_out[[dv]] <-do_k_fold(x, fit_and_test_rf) %>%mutate(dv = dv, n =nrow(x))}bind_rows(kf_out) %>%group_by(dv) %>%summarise(across(everything(), mean)) %>%mutate(k = K_FOLD) -># just so it's "10" or whatever as expected kf_out print(kf_out)
# A tibble: 3 × 5
dv k training_r2 validation_r2 n
<chr> <dbl> <dbl> <dbl> <dbl>
1 sqrt_Rh_annual 10 0.516 0.513 677
2 sqrt_Rs_annual 10 0.640 0.641 3158
3 sqrt_Rs_growingseason 10 0.558 0.555 1380
Side note: model performance is linearly related to dataset size! Huh.
Code
ggplot(kf_out, aes(n, validation_r2)) +geom_point() +geom_smooth(method = lm, formula = y~x)
How important are the MODIS data?
The MODIS data only start in 2001 (see gsbe-data-prep output), so requiring them in a model cuts off a lot of early data – on the order of 20% of the dataset. Are they needed?
k training_r2 validation_r2
Min. : 1.00 Min. :0.9089 Min. :0.5884
1st Qu.: 3.25 1st Qu.:0.9109 1st Qu.:0.6111
Median : 5.50 Median :0.9117 Median :0.6400
Mean : 5.50 Mean :0.9122 Mean :0.6317
3rd Qu.: 7.75 3rd Qu.:0.9129 3rd Qu.:0.6494
Max. :10.00 Max. :0.9175 Max. :0.6835
Seems like xgboost is over-fitting to the training data. Note however that
Unlike random forests, GBMs can have high variability in accuracy dependent on their hyperparameter settings (Probst, Bischl, and Boulesteix 2018) So tuning can require much more strategy than a random forest model.
Conclusion: performance and variable importance seem similar to ranger.
Test neural network
We use the neuralnet package for this. I experimented with the hidden layer topology and threshold settings to find a configuration that reliably converges with these data. Note: fitting this is slow.
Code
library(neuralnet)
Attaching package: 'neuralnet'
The following object is masked from 'package:dplyr':
compute
Code
fit_and_test_nn <-function(x_train, x_val) {# Algorithm is not guaranteed to converge; if this happens, returns NAs out <-tibble(training_r2 =NA_real_, validation_r2 =NA_real_)try({ m_nn <-neuralnet(sqrt_Rs_annual ~ ., data = x_train,hidden =c(12, 6), rep =2, threshold =0.4) pred_train <-predict(m_nn, newdata = x_train) pred_val <-predict(m_nn, newdata = x_val) out <-tibble(training_r2 =r2(pred_train, pull(x_train[1])),validation_r2 =r2(pred_val, pull(x_val[1]))) })return(out)}# Rescale the variables and change factors to numericx <- dat_base_Rs_annualnums <-sapply(x, is.numeric)for(col innames(nums)[nums]) x[col] <-scale(x[col])fcts <-sapply(x[-1], is.factor)for(col innames(fcts)[fcts]) x[col] <-as.numeric(pull(x[col]))# Overall modelm_nn <-neuralnet(sqrt_Rs_annual ~ ., data = x, hidden =c(12, 6), rep =2, threshold =0.4)message("R2 of two-layer model: ", round(r2(predict(m_nn, newdata = x), x$sqrt_Rs_annual), 3))
Error in if (ncol.matrix < rep) { : argument is of length zero
Error in if (ncol.matrix < rep) { : argument is of length zero
k training_r2 validation_r2
Min. : 1.00 Min. :0.5697 Min. :0.2616
1st Qu.: 3.25 1st Qu.:0.5871 1st Qu.:0.3473
Median : 5.50 Median :0.5980 Median :0.3677
Mean : 5.50 Mean :0.5945 Mean :0.3811
3rd Qu.: 7.75 3rd Qu.:0.6033 3rd Qu.:0.4257
Max. :10.00 Max. :0.6110 Max. :0.4971
NA's :2 NA's :2
With a complex topology (e.g., three hidden layers) the training R2 can be cranked up to 0.9+, but it exhibits severe overfitting.
Conclusion: I’m not an expert here! But, no evidence that this will be a magic bullet.
Linear regression model: performance
Code
# Fit random forest model and test against validation data# The dependent variable is in the first columnfit_and_test_lm <-function(x_train, x_val, m) { pred_train <-predict(m, newdata = x_train) pred_val <-predict(m, newdata = x_val)tibble(training_r2 =r2(pred_train, pull(x_train[1])),validation_r2 =r2(pred_val, pull(x_val[1])))}message("Fitting Rs_annual linear model...")
lm_rhs_formula <-as.character(formula(m_Rs))[3]kf_out <-list()for(dv in dvs) {# Pull out the rhs of the formula and paste dv (dependent variable) in front f <-as.formula(paste(dv, "~", lm_rhs_formula))# Construct dataset and call lm() once to give us a base model x <- dat_base_no_modispei[c(dv, preds)] x <- x[complete.cases(x),] m <-lm(f, data = x) kf_out[[dv]] <-do_k_fold(x, f = fit_and_test_lm, m = m) %>%mutate(dv = dv, n =nrow(x))}bind_rows(kf_out) %>%group_by(dv) %>%summarise(across(everything(), mean)) %>%mutate(k = K_FOLD) # just so it's "10" or whatever as expected
# A tibble: 3 × 5
dv k training_r2 validation_r2 n
<chr> <dbl> <dbl> <dbl> <dbl>
1 sqrt_Rh_annual 10 0.157 0.145 677
2 sqrt_Rs_annual 10 0.272 0.269 3158
3 sqrt_Rs_growingseason 10 0.245 0.242 1380
Code
# Directly compare lm and rf predictionstibble(Ecosystem_type = dat_base_Rs_annual$Ecosystem_type,lm_prediction =predict(m_Rs_reduced),rf_prediction =predict(rf_all_data, data = dat_base_Rs_annual)$predictions) %>%ggplot(aes(lm_prediction, rf_prediction)) +geom_point(alpha =0.25) +geom_abline() +geom_smooth(method ="lm", formula = y ~ x) +ggtitle("Linear model versus Random Forest predictions") +facet_wrap(~Ecosystem_type, scales ="free")
Conclusions:
Linear model performance is less than half of Random Forest
Variable importance is similar
1. Predictive value of SPEI
Code
dat_base_all %>%select(-sqrt_Rh_annual, -sqrt_Rs_growingseason,-MODIS_NPP, -MODIS_GPP,-starts_with("SPEI24")) %>%filter(complete.cases(.)) -> dat_test_spei# remove the "_y1" columnsyr1cols <-grep("_y1$", colnames(dat_test_spei))dat_test_spei_current <- dat_test_spei[-yr1cols]message("Testing Random Forest...")
Testing Random Forest...
Code
rf_spei <-ranger(sqrt_Rs_annual ~ ., data = dat_test_spei_current, importance ="impurity")print(rf_spei)
Ranger result
Call:
ranger(sqrt_Rs_annual ~ ., data = dat_test_spei_current, importance = "impurity")
Type: Regression
Number of trees: 500
Sample size: 3158
Number of independent variables: 13
Mtry: 3
Target node size: 5
Variable importance mode: impurity
Splitrule: variance
OOB prediction error (MSE): 25.35987
R squared (OOB): 0.6625917
message("Model *with* SPEI data: training R2 = ", lm_r2(m_Rs_spei_reduced))
Model *with* SPEI data: training R2 = 0.2722
Code
message("Model *without* SPEI data: training R2 = ", lm_r2(m_Rs_reduced))
Model *without* SPEI data: training R2 = 0.2672
Code
message("ANOVA model comparison p-value: ", anova_p_twomodels(m_Rs_spei_reduced, m_Rs_reduced))
ANOVA model comparison p-value: 2e-04
Conclusions:
SPEI impact on Random Forest model is…unclear. Small performance improvement.
SPEI does have effects in the linear model: significant interactions with many variables, significant difference between models with and without SPEI, and model R2 jumps modestly.
2. Predictive value of previous-year SPEI
Code
# Use the dat_test_spei dataset from abovemessage("Testing Random Forest...")
Testing Random Forest...
Code
rf_spei_y1 <-ranger(sqrt_Rs_annual ~ ., data = dat_test_spei, importance ="impurity")print(rf_spei_y1)
Ranger result
Call:
ranger(sqrt_Rs_annual ~ ., data = dat_test_spei, importance = "impurity")
Type: Regression
Number of trees: 500
Sample size: 3158
Number of independent variables: 14
Mtry: 3
Target node size: 5
Variable importance mode: impurity
Splitrule: variance
OOB prediction error (MSE): 25.79525
R squared (OOB): 0.656799
SPEI_y1 impact on Random Forest model is…unclear. No performance improvement.
SPEI_y1 has significant interactions with many variables in the linear model, and model R2 increases slightly.
Need to be careful to include null model testing.
3. Value of previous-year SPEI after drought
For post-drought (PBE TRUE) observations, is SPEI_yr1 significant?
For this we compute a “potential Birch Effect” flag PBE: TRUE when (i) the current year has SPEI >=0 (non-drought) and (ii) the previous year had SPEI <= 1 (drought).
PBE (potential Birch effect, i.e. having a strong drought the year before a well-watered year) has a strong statistical effect on the temperature and moisture drivers of soil respiration.
But, it’s inconsistent—there’s no clean ‘Birch effect.’
PBE sensitivity test
Code
fit_and_test_lm <-function(x_train, x_val) { f <-as.formula(paste(colnames(x_train)[1], "~ (Tair + Tsoil_lev1 + Tsoil_lev2 +","VWC_lev1 + I(VWC_lev1 ^ 2) + VWC_lev2 ^ 2 + I(VWC_lev2 ^ 2))","* PBE + Ecosystem_type + LAI_high * LAI_low +","Veg_type_hi + Veg_type_low + Soil_type_name")) m <-lm(formula = f, data = x_train) m_reduced <- MASS::stepAIC(m, trace =FALSE)if(!is.null(x_val)) { preds <-predict(m_reduced, newdata = x_val) obs <-pull(x_train[1]) validation_r2 <-rs(preds = preds, obs = obs) } else { validation_r2 <-NA_real_ }list(model = m_reduced,training_r2 =glance(m_reduced)$r.squared,validation_r2 = validation_r2,importance =tidy(m_reduced))}# These are the three most common fluxes recorded in SRDB#iv_set <- c("Rs_annual")#, "Rh_annual", "Rs_growingseason")iv_set <-c("sqrt_Rs_annual", "sqrt_Rh_annual", "sqrt_Rs_growingseason")# The following SPEI values are from Table 1 of Keune et al. (2025)# https://www.nature.com/articles/s41597-025-04896-y# and correspond to "severely dry", "moderately dry", and "mildly dry"# (there are no "extremely dry"=-2 in the dataset)spei_cutoffs <-c(-1.5, -1.0, 0.0)# What SPEI window should be used for the cutoffs above?spei_windows <-c(12, 24) # months# Model independent variablespredictors <-c("Ecosystem_type","SPEI12","Tair", "Precip","Tsoil_lev1", "Tsoil_lev2", "VWC_lev1", "VWC_lev2", "Veg_type_hi", "Veg_type_low", "LAI_high", "LAI_low", "Soil_type_name")do_full_analysis <-function(x, iv_set, predictors, spei_windows, spei_cutoffs) { rf_results <-list() lm_results <-list() SPEI_cols <-colnames(x)[grep("^SPEI", colnames(x))]for(iv in iv_set) {for(w in spei_windows) { spei_past_col <-paste0("SPEI", w, "_y1")for(spei_cut in spei_cutoffs) {# Remove other potential independent variables this_x <- x[c(iv, union(predictors, SPEI_cols))]# Complete cases only this_x <- this_x[complete.cases(this_x),]message("---------------------------------------------")message("iv = ", iv)message("spei_cut = ", spei_cut, " window = ", w)#message("\tn = ", nrow(this_x))# identify observations not currently in a drought but that# WERE in a drought the previous year# PBE = potential Birch effect this_x$PBE = this_x$SPEI12 >0& this_x[spei_past_col] <= spei_cut#message("\tPBE = ", sum(this_x$PBE))# Drop non-predictors entirely now this_x <- this_x[c(iv, predictors, "PBE")]# Fit models and save results rf_out <-fit_and_test_rf(this_x, NULL) rf_results[[paste(iv, w, spei_cut)]] <-tibble(model ="rf",spei_window = w,spei_cut = spei_cut,iv = iv,predictor =names(rf_out$importance),n =nrow(this_x),pbe_n =sum(this_x$PBE),importance = rf_out$importance,training_r2 = rf_out$training_r2) lm_out <-fit_and_test_lm(this_x, NULL) lm_results[[paste(iv, w, spei_cut)]] <-tibble(model ="lm",spei_window = w,spei_cut = spei_cut,iv = iv,predictor = lm_out$importance$term,importance = lm_out$importance$p.value,n =nrow(this_x),pbe_n =sum(this_x$PBE),training_r2 = lm_out$training_r2) } } }bind_rows(bind_rows(rf_results), bind_rows(lm_results))}out <-do_full_analysis(dat_filtered, iv_set, predictors,spei_windows =c(12,24),spei_cutoffs = spei_cutoffs)# Compute means across the SPEI windows, drought definitions, and i.v.'sout %>%group_by(model, spei_window, spei_cut, iv, n, pbe_n) %>%summarise(r2 =round(mean(training_r2), 3), .groups ="drop") %>%mutate(spei_cut =as.factor(spei_cut),spei_window =paste(spei_window, "months")) -> out_smry# Table: number (and % of total) of TRUE PBE entries; how big are the datasets?out_smry %>%# these numbers are the same for each model type; pick onefilter(model =="rf") %>%mutate(pbe_n =paste0(pbe_n, " (", round(pbe_n / n *100, 0), "%)")) %>%select(spei_window, spei_cut, iv, pbe_n) %>%pivot_wider(names_from ="iv", values_from ="pbe_n") %>% knitr::kable(caption ="How often do 'PBE events' occur?", align ="r")
Conclusions:
PBE events (low SPEI the previous year, followed by non-drought conditions) occur in 0-20% of the data, depending on criteria.
Random Forest sensitivity results
Code
# Table: Training R2 valuesout_smry %>%filter(model =="rf") %>%select(spei_window, spei_cut, iv, r2) %>%pivot_wider(names_from ="iv", values_from ="r2") %>% knitr::kable(caption ="RF training (OOB) R2")# Table: rank of PBE variableout %>%filter(model =="rf") %>%group_by(spei_window, spei_cut, iv) %>%mutate(rank =min_rank(desc(importance))) -> out_rankedout_ranked %>%filter(predictor =="PBE") %>%select(spei_window, spei_cut, iv, rank) %>%pivot_wider(names_from ="iv", values_from ="rank") %>% knitr::kable(caption ="Rank of PBE variable in RF model")out_ranked %>%group_by(iv, predictor) %>%summarise(rank =mean(rank)) %>%ggplot(aes(iv, predictor, fill = rank)) +geom_tile() +xlab("Independent variable") +ggtitle("Predictor ranks in RF model")
Conclusions:
The values used for SPEI window (12/24 months) and drought level (0, -1, -1.5) don’t make much difference.
RF model explains 52-65% of OOB variability
Precip and Tair are uniformly important; LAI and soil temp/VWC medium importance; PBE (drought effect from previous year) not important
Linear model sensitivity results
Code
# Table: Training R2 valuesout_smry %>%filter(model =="lm") %>%select(spei_window, spei_cut, iv, r2) %>%pivot_wider(names_from ="iv", values_from ="r2") %>% knitr::kable(caption ="lm training adjusted R2")# Rank of significant variables# We're using the p-value for a crude approachout %>%filter(model =="lm", importance <0.05, grepl("PBE", predictor)) %>%group_by(iv, predictor) %>%mutate(n_signif =n()) %>%ggplot(aes(iv, predictor, fill = n_signif)) +geom_tile() +xlab("Independent variable") +ggtitle("PBE significance counts in linear model")message("PBE is significant by itself in a few models:")out %>%filter(predictor=="PBETRUE", model=="lm", importance <0.05) %>%select(-model, -training_r2)
Conclusions:
The values used for SPEI window (12/24 months) and drought level (0, -1, -1.5) don’t make much difference.
The linear model explains 16-27% of training data variability.
PBE significantly interacts with Tair (esp. in predicting Rs_annual), Tsoil, and VWC.
Conditional inference Random Forest
Recall the default splitting rule during random forests tree building consists of selecting, out of all splits of the (randomly selected mtry) candidate variables, the split that minimizes the Gini impurity (in the case of classification) and the SSE (in case of regression). However, Strobl et al. (2007) illustrated that these default splitting rules favor the selection of features with many possible splits (e.g., continuous variables or categorical variables with many categories) over variables with fewer splits (the extreme case being binary variables, which have only one possible split). Conditional inference trees (Hothorn, Hornik, and Zeileis 2006) implement an alternative splitting mechanism that helps to reduce this variable selection bias. However, ensembling conditional inference trees has yet to be proven superior with regards to predictive accuracy and they take a lot longer to train.