Figure 1: Maps of study sites showing monitored and unmonitored lakes
Linear modelling was used to determine the amount of variation of TLI3 explained FENZ and geochemistry predictor variables. Three separate multiple regression linear models were initially fitted using TLI3 as response variables and the following predictors :
FENZ variable only.
Geochemistry variables only
Significant FENZ predictors variables identified in 1. plus the geochemistry variables.
All predictor variables were centred and scaled before the analyses (by subtracting the overall mean from each observation and dividing the result by the overall standard deviation), to allow direct comparison of regression coefficients and inference about the relative sizes of effects. Predictor variables were transformed to using the Yeo-Johnson transformation, which is very similar to the Box-Cox (i.e. a power transformation), but does allow negative values. Missing values were imputed via bagging by fitting a bagged tree model for each predictor (as a function of all the others).
Models were selected using a step-wise algorithm based on the Akaike’s information criterion (AIC). The step-wise selection minimise model AIC values to find the most parsimonious model, i.e. with the minimal number of predictors without affecting model performance. Models were validated by inspecting the residuals and co-linearity was check using variance inflation factor.
Results of final linear models was represented as coefficient plots that shows regression coefficients graphically, with 95% credible confidence intervals around mean estimates. Coefficient plots show predcitorsd significance (the confidence intervals do not cover zero), the degree of uncertainty (the width of the intervals), the direction of the effect (whether is a positive or a negative effect) and the effect magnitude given by the absolute size of the coefficient. Coefficients that overlap with the zero-line, shown as a dotted vertical line, are likely to be insignificant.
The performance of the final model with selected predictors was evaluated through five-fold cross-validation (Kuhn 2008). This procedure randomly partitions the data into five subsets. One subset at a time is then used for testing the model, while the remaining sets are used to build the model. This reduces the bias in performance estimation since the testing and training data sets are independent of each other. Cross-validation was repeated three times to capture variability in the model performance and the results were averaged.
TLI3 was predicted for unmonitored lakes using each of the three models (i.e. FENZ model, Geochem model and FENZ + Geochem model). Predictor variables missing values were imputed using bagging.
Predicted TLI3 was compared with spot TLI3 values calculated for unmonitored lakes as :