Introducing the problem

We need to select a subset of covariates that best predict our parameter values. Generally speaking, there are two classes of methods to do this:

An entire area of machine learning “feature” selection is based around combining these two approaches into what are called wrapper-based methods. The typical introduction is Guyon and Elisseeff (2003).

Model-based approaches can be computationally inefficient. Ranking approaches tend to select large subsets of redundant variables and we aren’t necessarily testing the predictive power. The idea is that we can use one to solve the problems of the other.

An example of this would be using tree-based feature importance rankings to sort the candidate covariates in order of significance and then using forward stepwise selection where the candidate covariates are added to the model in order of significance found by the tree-based method.

Wrapper-based methods are useful outside of pure ML applications…

An advantage of wrapper-based methods is that you can decouple the variable ranking from the model used to test the performance.

That is, you can use some hugely complicated, black-box model like xgboost to rank the covariates, but you can test the predictive power of a simpler model in a forward stepwise selection process.

Wrapper-based ML methods in hydrology

1. Laimighofer et. al., (2022)

Laimighofer (2022) used three different covariate ranking procedures and stepwise selection on six different models to select a subset of covariates in low-flow models for ungauged catchments.

We implemented this with one variable ranking method (feature importance via xgboost) and one model for prediction (regression with xgboost). The issue here is that using xgboost as a covariate ranking procedure leads to selection of correlated (but not truly redundant) covariates, and there’s not always obvious drops in the RMSE like we see here.

2. & 3. Galelli et. al., (2013) and Alsahaf et. al., (2022)

Galelli and Castelletti (2013) developed an algorithm that uses wrapper-based methods to incrementally build a covariate subset for hydrological modeling using the top-ranked covariates individually evaluated in the inner model. An iterative process allows for extra constraints to be placed on which covariates are used as candidates in the stepwise selection, i.e. choosing only non-correlated covariates. The algorithm also has an automatic stopping criteria and tends to select smaller subsets of covaraites than Laimighofer (2022).

Alsahaf et al. (2022) adapted Galelli and Castelletti (2013) to use xgboost (xgboost was released in 2016).

What we did this summer

The covariate analysis we did this summer attempted to answer the question “is there any relationship at all between the covariates and parameter at duration d?”.

We went for maximum flexibility: xgboost to rank the covariates and xgboost as the inner “wrapped” model. We did this for each parameter at each duration.

We found that the GEV parameters are associated with different (often non-overlapping) sets of covariates.

An idea for a publication to support the regional model

Our regional model is a (i) latent Gaussian model for extremes with a (ii) nonstandard parameterization. We have talked about (iii) placing different functional forms on different parameters–i.e. linear relationships on \(\beta\) and \(\xi\) but a more complicated relationship on the median flood \(\eta\)–and if we add in our results from the summer then each of these parameters would (iv) depend on a different set of covariates.

Implementing ii-iv is potentially a paper’s worth of work, but on top of this we aim to do all of this for the dGEV.

Idea: using the algorithms we have already implemented, run a small study where we do covariate selection using linear models to predict \(\beta\) and \(\xi\) and test several different linear / nonlinear models to predict \(\eta\). This can support our chosen covariate + GEV parameter relationships in our latent Gaussian model.

Additionally, we can repeat this study on each duration and have a small write-up of how these GEV parameters might change with duration. This part is a little outside our latent Gaussian model since we probably won’t implement anything to do with this (i.e. we’ll have the same covariates operating on each duration in the big model), but would be interesting.

References

Alsahaf, Ahmad, Nicolai Petkov, Vikram Shenoy, and George Azzopardi. 2022. “A Framework for Feature Selection Through Boosting.” Expert Systems with Applications 187: 115895.
Galelli, Stefano, and A Castelletti. 2013. “Tree-Based Iterative Input Variable Selection for Hydrological Modeling.” Water Resources Research 49 (7): 4295–4310.
Guyon, Isabelle, and André Elisseeff. 2003. “An Introduction to Variable and Feature Selection.” Journal of Machine Learning Research 3 (Mar): 1157–82.
Laimighofer, Johannes et. al. 2022. “Parsimonious Statistical Learning Models for Low-Flow Estimation.” Hydrology and Earth System Sciences 26 (1): 129–48.