Lake380 surface sediment geochemistry modelling

Author

javier.atalah@cawthron.org.nz

Published

September 19, 2022

Code

library(caret)
library(tidyverse, quietly = T)
library(janitor)
library(visreg)
library(ggpubr)
library(broom)
library(knitr)
library(lares)
library(ggfortify)
library(sjPlot)
library(rnaturalearth)

lawa_cols <- c("#20a7ad", "#85bb5b", "#ffa827", "#ff8400", "#e85129")
theme_set(theme_minimal())
source('functions/theme_javier.R')

# read data -----
pred_selected_num_t <-
  read_csv(
    'C:/Users/javiera/OneDrive - Cawthron/Critical step 1-1-3/Data Analysis/sediment_geochem/clean_data/select_predictor_numeric_transformed.csv',
    show_col_types = FALSE
  )

lm_dat <-
  read_csv(
    'C:/Users/javiera/OneDrive - Cawthron/Critical step 1-1-3/Data Analysis/sediment_geochem/clean_data/lm_dat_non_colinear.csv',
    show_col_types = FALSE
  )

lake_names <-
  read_csv(
    'C:/Users/javiera/OneDrive - Cawthron/Critical step 1-1-3/Data Analysis/sediment_geochem/clean_data/selected_metadata.csv',
    show_col_types = FALSE
  ) %>%
  select(Code, Lake) %>%
  clean_names()

metadata <-
  read_csv(
    'C:/Users/javiera/OneDrive - Cawthron/Critical step 1-1-3/Data Analysis/sediment_geochem/clean_data/geochem_data_all_selected.csv',
    show_col_types = FALSE
  )

unmon_lakes <- 
  read_csv('C:/Users/javiera/OneDrive - Cawthron/Critical step 1-1-3/Data Analysis/sediment_geochem/clean_data/unmon_lakes.csv')

mon_lakes <- 
  read_csv('C:/Users/javiera/OneDrive - Cawthron/Critical step 1-1-3/Data Analysis/sediment_geochem/clean_data/mon_lakes.csv')

map_data <-
bind_rows(Monitored = mon_lakes, Unmonitored = unmon_lakes, .id = 'Monitored') %>%
  mutate(code = factor(code)) %>%
  left_join(metadata %>% dplyr::select(Code, Latitude, Longitude) %>% rename(code = "Code"),
            by = 'code')

1 Background

This notebook described the linear modelling step of the sediment geochemical data of Lakes380.

See the data exploration notebook for the details on initial data exploration and quality assurance following a standard protocol (Zuur et al., 2010).

For a details on the regression analysis see Zuur and Ieno 2016 A protocol for conducting and presenting results of regression‐type analyses.

2 Methods

2.1 Study sites

Code

tabyl(map_data, Monitored) %>% 
  kable(digits = 2, caption = "Number of lakes of monitored and unmonitored lakes")

Table 1: Number of minitored and unmonitored lakes
Monitored	n	percent
No	76	0.43
Yes	101	0.57

Code

nz_med <- ne_countries(country = 'new zealand', scale = "medium", returnclass = "sf")
map_fig <- 
ggplot(nz_med) +
  geom_sf() +
  coord_sf(
    xlim = c(166, 178.8),
    ylim = c(-47.35,-34.35),
    clip = 'on'
  ) +
  geom_point(data = map_data, aes(x = Longitude, y = Latitude, color = Monitored)) +
  scale_x_continuous(breaks = seq(166, 178, by = 4)) +
  scale_y_continuous(breaks = seq(-48,-36, by = 4)) +
  scale_color_discrete(name = NULL) +
  labs(x = "Longitude", y = "Latitude") +
  theme_bw(base_size = 16) +
  theme(
    legend.position = c(.8, .2),
    legend.background = element_blank(),
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    strip.background = element_blank()
  )



map_fig

Figure 1: Maps of study sites showing monitored and unmonitored lakes

Linear modelling was used to determine the amount of variation of TLI3 explained FENZ and geochemistry predictor variables. Three separate multiple regression linear models were initially fitted using TLI3 as response variables and the following predictors :

FENZ variable only.
Geochemistry variables only
Significant FENZ predictors variables identified in 1. plus the geochemistry variables.

All predictor variables were centred and scaled before the analyses (by subtracting the overall mean from each observation and dividing the result by the overall standard deviation), to allow direct comparison of regression coefficients and inference about the relative sizes of effects. Predictor variables were transformed to using the Yeo-Johnson transformation, which is very similar to the Box-Cox (i.e. a power transformation), but does allow negative values. Missing values were imputed via bagging by fitting a bagged tree model for each predictor (as a function of all the others).

Models were selected using a step-wise algorithm based on the Akaike’s information criterion (AIC). The step-wise selection minimise model AIC values to find the most parsimonious model, i.e. with the minimal number of predictors without affecting model performance. Models were validated by inspecting the residuals and co-linearity was check using variance inflation factor.

Results of final linear models was represented as coefficient plots that shows regression coefficients graphically, with 95% credible confidence intervals around mean estimates. Coefficient plots show predcitorsd significance (the confidence intervals do not cover zero), the degree of uncertainty (the width of the intervals), the direction of the effect (whether is a positive or a negative effect) and the effect magnitude given by the absolute size of the coefficient. Coefficients that overlap with the zero-line, shown as a dotted vertical line, are likely to be insignificant.

The performance of the final model with selected predictors was evaluated through five-fold cross-validation (Kuhn 2008). This procedure randomly partitions the data into five subsets. One subset at a time is then used for testing the model, while the remaining sets are used to build the model. This reduces the bias in performance estimation since the testing and training data sets are independent of each other. Cross-validation was repeated three times to capture variability in the model performance and the results were averaged.

TLI3 was predicted for unmonitored lakes using each of the three models (i.e. FENZ model, Geochem model and FENZ + Geochem model). Predictor variables missing values were imputed using bagging.

Predicted TLI3 was compared with spot TLI3 values calculated for unmonitored lakes as :

Spot.TLI.TP = 0.218 + 2.92 * log10(WCS.Bul.TP * 1000)

Spot.TLI.TN = -3.61 + 3.01 * log10(WCS.Bul.TN * 1000)

Spot.TLI.Chla = 2.22 + 2.54 * log10(WCS.Chla * 1000)

spot_TLI3 = (Spot.TLI.TP + Spot.TLI.TN + Spot.TLI.Chla) / 3

3 FENZ model

Code

pred_fenz <- 
  select(lm_dat, lake_area:sampling_depth) %>% 
  names() %>% 
  as_tibble()

The initial FENZ model was fitted with 10 predictors listed in Table 2

Code

pred_fenz %>% 
kable()

Table 2: List of predictor variables used in the initial FENZ linear model
value
lake_area
lk_elev
dec_solrad
sum_wind
cat_hard
cat_peat
cat_phos
cat_calc
cat_psize
sampling_depth

3.1 Final FENZ model summary

Code

set.seed(123)

trControl <- 
  trainControl(
  method = "repeatedcv",
  repeats = 3,
  number = 5,
  savePredictions = 'all'
)

lm_model_fenz <- caret::train(
  tli3 ~ .,
  data = select(lm_dat, lake_area:sampling_depth, tli3),
  method = "lmStepAIC",
  trControl = trControl,
  trace = F
)

tab_model(lm_model_fenz$finalModel)

Table 3: summary table of the FENZ linear model
	outcome
Predictors	Estimates	CI	p
(Intercept)	4.09	3.94 – 4.24	<0.001
lk elev	-0.48	-0.71 – -0.24	<0.001
dec solrad	0.57	0.38 – 0.75	<0.001
sum wind	-0.27	-0.43 – -0.10	0.002
cat hard	0.24	-0.04 – 0.52	0.094
cat phos	0.49	0.18 – 0.80	0.002
cat calc	0.47	0.29 – 0.65	<0.001
cat psize	-0.29	-0.58 – -0.00	0.047
sampling depth	-0.70	-0.88 – -0.51	<0.001
Observations	101
R² / R² adjusted	0.722 / 0.698

3.2 Observed versus predicted TLI3 values using the full dataset

Code

aug_lm_dat_fenz <- 
  augment(lm_model_fenz$finalModel, data = lm_dat) %>% 
  mutate(TLI3_cat = cut(
    .fitted,
    breaks = c(0, 2, 3, 4, 5, 8),
    labels = c(
      "Microtrophic",
      "Oligotrophic",
      "Mesotrophic",
      "Eutrophic",
      "Hypertrophic"
    )
  )) 


obs_pred_fenz <- 
ggplot(aug_lm_dat_fenz, aes(y = .fitted, x = tli3)) +
  geom_point(aes(color = TLI3_cat), alpha = 0.5, size = 3) +
  geom_abline(slope = 1,
              intercept = 0,
              lty = 2) +
  tune::coord_obs_pred() +
  stat_cor(aes(label = paste(..rr.label.., gsub("p", "P", ..p.label..), sep = "~`,`~")),
           p.accuracy = 0.001,
           r.accuracy = 0.01) +
  labs(y = 'Predicted TLI3', x = 'Observed TLI3') +
  scale_color_manual(values = lawa_cols, name = "Predicter trophic level") +
  theme_minimal()

obs_pred_fenz

Figure 2: Observed versus predicted values for the FENZ model

4 Geochem model

Code

pred_geo <- 
  select(lm_dat, sampling_depth:rse_toc_res_p) %>% 
  names() %>% 
  as_tibble()

The initial Geochem model was fitted with 22 predictors listed in Table 4

Code

pred_geo %>% 
kable()

Table 4: List of predictor variables used in the initial Geochem linear model
value
sampling_depth
pf_total_p
pf_bd_p_percent
pf_na_oh_r_p_percent
pf_org_p_percent
pf_res_p_percent
ar_p_dw
dry_bd_dw
bul_s_om
bul_s_carbonates
bul_s_aluminium
bul_s_calcium
bul_s_lead
bul_s_copper
bul_s_cadmium
bul_s_zinc
gs_63
trr_mn_fe
tn_tp
rse_na_oh_al_bd_fe
rse_bd_fe_toc
rse_toc_res_p

4.1 Predictor variables correlation

Code

corr_cross(lm_dat %>% select(sampling_depth:rse_toc_res_p),
           max_pvalue = 0.1, 
           top = 30 
)

Figure 3: Ranked cross-correlations of predictor variables

4.2 Model summary

Code

set.seed(123)

trControl <- 
  trainControl(
  method = "repeatedcv",
  repeats = 3,
  number = 5,
  savePredictions = 'all'
)

lm_model_geo <- caret::train(
  tli3 ~ .,
  data = select(lm_dat, sampling_depth:tli3),
  method = "lmStepAIC",
  metric = "RMSE",
  scope = list(lower = ~ pf_total_p),
  trControl = trControl,
  trace = F
)

tab_model(lm_model_geo$finalModel)

Table 5: summary table of the Geochem linear model
	outcome
Predictors	Estimates	CI	p
(Intercept)	4.07	3.92 – 4.22	<0.001
sampling depth	-0.72	-0.91 – -0.53	<0.001
pf total p	0.16	-0.10 – 0.42	0.233
pf bd p percent	0.40	0.20 – 0.59	<0.001
pf res p percent	0.26	0.02 – 0.50	0.035
bul s om	0.32	-0.08 – 0.71	0.112
bul s calcium	-0.12	-0.29 – 0.05	0.152
bul s lead	-0.19	-0.40 – 0.02	0.081
bul s cadmium	0.35	0.17 – 0.53	<0.001
gs 63	0.46	0.12 – 0.81	0.010
trr mn fe	-0.13	-0.31 – 0.05	0.146
tn tp	-0.28	-0.57 – 0.02	0.066
rse bd fe toc	-0.32	-0.55 – -0.09	0.007
Observations	101
R² / R² adjusted	0.732 / 0.696

The table shows the step-wise selected final model summary with associated estimates, 95% confidence intervals and p-values.

4.3 Coefficient plots

Code

plot_model(lm_model_geo$finalModel,title = "") +
  theme_minimal()

Figure 4: Coefficient plot of the Geochem linear model

4.4 Variable importance

Code

varImp(lm_model_geo$finalModel) %>% 
  arrange(desc(Overall)) %>% 
  kable(digits = 2)

Table 6: Variable importance fro the Geochem model
	Overall
sampling_depth	7.61
pf_bd_p_percent	4.03
bul_s_cadmium	3.85
rse_bd_fe_toc	2.76
gs_63	2.64
pf_res_p_percent	2.14
tn_tp	1.86
bul_s_lead	1.77
bul_s_om	1.61
trr_mn_fe	1.47
bul_s_calcium	1.44
pf_total_p	1.20

4.5 Model validation

Code

ggarrange(plotlist = plot_model(lm_model_geo$finalModel,type = 'diag'))

Figure 5: Diagnostic plots for the GEochem linear model

4.6 Observed versus predicted TLI3 values using the full dataset

Code

aug_lm_dat <- 
  augment(lm_model_geo$finalModel, data = lm_dat) %>% 
  mutate(TLI3_cat = cut(
    .fitted,
    breaks = c(0, 2, 3, 4, 5, 8),
    labels = c(
      "Microtrophic",
      "Oligotrophic",
      "Mesotrophic",
      "Eutrophic",
      "Hypertrophic"
    )
  )) 


obs_pred_geo <- 
  ggplot(aug_lm_dat,aes(y = .fitted, x = tli3)) +
  geom_point(aes(color = TLI3_cat), alpha = 0.5, size = 3) +
  geom_abline(slope = 1, intercept = 0, lty = 2) +
  tune::coord_obs_pred() +
    stat_cor(aes(label = paste(..rr.label.., gsub("p", "P", ..p.label..), sep = "~`,`~")),
           p.accuracy = 0.001,
           r.accuracy = 0.01) +
  labs(y = 'Predicted TLI3', x = 'Observed TLI3') +
  scale_color_manual(values = lawa_cols, name = "Predicted trophic level") +
  theme_minimal()

obs_pred_geo

Figure 6: Observed versus predicted values for the Geochem linear model

4.7 Predicting TLI3 for unmonitored lakes

Code

metadata_lm_model <- 
  read_csv('C:/Users/javiera/OneDrive - Cawthron/Critical step 1-1-3/Data Analysis/sediment_geochem/clean_data/geochem_data_all_selected.csv', show_col_types = FALSE) %>%
  clean_names() %>%
  select(code, all_of(predictors(lm_model_geo$finalModel)))


spot_tli3_dat <- 
  read_csv('C:/Users/javiera/OneDrive - Cawthron/Critical step 1-1-3/Data Analysis/sediment_geochem/clean_data/geochem_data_all.csv', show_col_types = FALSE) %>%
  select(Code, WCS.Bul.TP, WCS.Bul.TN, WCS.Chla) %>%
  rename(code = "Code") %>%
  mutate(
    Spot.TLI.TP = 0.218 + 2.92 * log10(WCS.Bul.TP * 1000),
    Spot.TLI.TN = -3.61 + 3.01 * log10(WCS.Bul.TN * 1000),
    Spot.TLI.Chla = 2.22 + 2.54 * log10(WCS.Chla * 1000),
    spot_TLI3 = (Spot.TLI.TP +  Spot.TLI.TN + Spot.TLI.Chla) / 3
  )

geochem_unmon_dat <-
  read_csv(
    'C:/Users/javiera/OneDrive - Cawthron/Critical step 1-1-3/Data Analysis/sediment_geochem/clean_data/geochem_data_all.csv',
    show_col_types = FALSE
  ) %>%
  rename(code = "Code") %>%
  filter(code %in% unmon_lakes$code) %>%
  left_join(metadata_lm_model, by = "code") %>%
  left_join(spot_tli3_dat %>% select(code, spot_TLI3), by = 'code') %>%
  clean_names() %>%
  select(code, spot_tli3, all_of(predictors(lm_model_geo$finalModel)))

# transform predictors
unmon_preds_para <-
  preProcess(geochem_unmon_dat %>% select(-code),
             method = c("center", "scale", "YeoJohnson",  "bagImpute"))
pred_t_unmon <- 
  predict(unmon_preds_para, geochem_unmon_dat)

unmon_preds <- 
  augment(lm_model_geo$finalModel, newdata = pred_t_unmon) %>% 
  select(-spot_tli3) %>% 
  left_join(geochem_unmon_dat %>% select(code, spot_tli3), by = 'code') %>% 
  mutate(pred_tli3_cat = cut(
    .fitted,
    breaks = c(0, 2, 3, 4, 5, 10),
    labels = c(
      "Microtrophic",
      "Oligotrophic",
      "Mesotrophic",
      "Eutrophic",
      "Hypertrophic"
    )
  )) %>%
  left_join(lake_names, by = 'code') 

pred_tli3_vs_spot_tli3_plot <-
  ggplot(unmon_preds, aes(y = .fitted, x = spot_tli3)) +
  
  geom_point(alpha = 0.5, aes(color = pred_tli3_cat), size = 3) +
  geom_abline(slope = 1,
              intercept = 0,
              lty = 2) +
  tune::coord_obs_pred() +
  # geom_smooth(method = lm) +
  stat_fit_glance(
    method = "lm",
    label.y = "top",
    label.x = "left",
    method.args = list(formula = y ~ x),
    mapping = aes(label = sprintf(
      'r^2~"="~%.3f~~italic(P)~"="~%.2g',
      stat(r.squared),
      stat(p.value)
    )),
    parse = TRUE,
    size = 3
  ) +
  labs(y = 'Predicted TLI3', x = 'Spot TLI3') +
  scale_color_manual(values = lawa_cols, name = "Trophic level") +
  theme_minimal()

pred_tli3_vs_spot_tli3_plot

Figure 7: Observed versus predicted TLI3 values for unmonitored lakes based on the the Geochem model

4.8 Cross-validation

Code

resam_preds <-
  lm_model_geo$pred %>%
  separate(Resample, c("Fold", "Rep")) %>%
  mutate(res = obs - pred) %>%
  left_join(.,
            lm_dat %>%
              rownames_to_column('rowIndex') %>%
              mutate(rowIndex = as.numeric(rowIndex)), by = 'rowIndex') %>% 
  as_tibble()  %>% 
  mutate(TLI3_cat = cut(
    tli3,
    breaks = c(0, 2, 3, 4, 5, 8),
    labels = c(
      "Microtrophic",
      "Oligotrophic",
      "Mesotrophic",
      "Eutrophic",
      "Hypertrophic"
    )
  ),
  code = as.character(code)) %>% 
  left_join(lake_names, by = 'code')


ggplot(resam_preds,aes(pred, obs)) +
  geom_point(alpha = 0.5, aes(color = TLI3_cat)) +
  geom_abline(slope = 1, intercept = 0, lty = 2) +
  facet_grid(Fold~Rep) +
  stat_cor(aes(label = paste(
    ..rr.label.., gsub("p", "P", ..p.label..), sep = "~`,`~"
  )),
  p.accuracy = 0.001,
  r.accuracy = 0.01, size = 3) +
  geom_abline(slope = 1, intercept = 0, lty = 2, color = 1)  +
  labs(x = "Observed TLI3", y = "Predicted TLI3", title = "Test dataset") +
  scale_color_manual(values = lawa_cols, name = "Trophic level") +
  tune::coord_obs_pred() +
  theme_minimal()

Figure 8: Observed versus predicted values for the Geochem model using 5-fold cross validation

5 FENZ + Geochem model

Code

pred_geo_plus <- 
  lm_dat %>% select(lake_area:sampling_depth, predictors(lm_model_geo$finalModel)) %>% 
  names() %>% 
  as_tibble()

The initial FENZ + Geochem model was fitted with 21 predictors listed in Table 7

Code

pred_geo_plus %>% 
kable()

Table 7: List of predictor variables used in the initial Geochem linear model
value
lake_area
lk_elev
dec_solrad
sum_wind
cat_hard
cat_peat
cat_phos
cat_calc
cat_psize
sampling_depth
pf_total_p
pf_bd_p_percent
pf_res_p_percent
bul_s_om
bul_s_calcium
bul_s_lead
bul_s_cadmium
gs_63
trr_mn_fe
tn_tp
rse_bd_fe_toc

5.1 Predictor variables correlation

Code

corr_cross(lm_dat %>% select(lake_area:sampling_depth, predictors(lm_model_geo$finalModel)),
           max_pvalue = 0.1, 
           top = 30 
)

Figure 9: Ranked cross-correlations of predictor variables

5.2 Model summary

Code

set.seed(123)

trControl <- 
  trainControl(
  method = "repeatedcv",
  repeats = 3,
  number = 5,
  savePredictions = 'all'
)

lm_model_geo_plus_fenz <- caret::train(
  tli3 ~ .,
  data = select(
    lm_dat,
    lake_area:sampling_depth,
    tli3,
    predictors(lm_model_geo$finalModel)
  ),
  method = "lmStepAIC",
  metric = "RMSE",
  scope = list(lower = ~ pf_total_p),
  trControl = trControl,
  trace = F
)

tab_model(lm_model_geo_plus_fenz$finalModel)

Table 8: summary table of the Geochem + FENZ linear model
	outcome
Predictors	Estimates	CI	p
(Intercept)	4.07	3.95 – 4.20	<0.001
lk elev	-0.37	-0.54 – -0.21	<0.001
dec solrad	0.29	0.13 – 0.46	0.001
sum wind	-0.15	-0.31 – 0.01	0.061
cat calc	0.24	0.08 – 0.40	0.003
sampling depth	-0.63	-0.80 – -0.47	<0.001
pf total p	0.19	0.02 – 0.36	0.032
pf bd p percent	0.31	0.15 – 0.47	<0.001
bul s cadmium	0.13	-0.02 – 0.28	0.088
gs 63	0.21	-0.01 – 0.43	0.063
trr mn fe	-0.10	-0.26 – 0.05	0.176
rse bd fe toc	-0.23	-0.42 – -0.05	0.015
Observations	101
R² / R² adjusted	0.805 / 0.780

The table shows the step-wise selected final model summary with associated estimates, 95% confidence intervals and p-values.

5.3 Coefficient plots

Code

plot_model(lm_model_geo_plus_fenz$finalModel,title = "")

Figure 10: Coefficient plot of the Geochem + FENZ linear model

5.4 Variable importance

Code

varImp(lm_model_geo_plus_fenz$finalModel) %>% 
  arrange(desc(Overall)) %>% 
  kable(digits = 2)

Table 9: Variable importance fro the Geochem + FENZ model
	Overall
sampling_depth	7.68
lk_elev	4.51
pf_bd_p_percent	3.82
dec_solrad	3.54
cat_calc	3.04
rse_bd_fe_toc	2.49
pf_total_p	2.18
sum_wind	1.90
gs_63	1.88
bul_s_cadmium	1.73
trr_mn_fe	1.36

5.5 Model validation

Code

ggarrange(plotlist = plot_model(lm_model_geo_plus_fenz$finalModel,type = 'diag'))

Figure 11: Diagnostic plots for the Geochem + FENZ linear model

5.6 Observed versus predicted TLI3 values using the full dataset

Code

aug_lm_dat_geo_plus <- 
  augment(lm_model_geo_plus_fenz$finalModel, data = lm_dat) %>% 
  mutate(TLI3_cat = cut(
    .fitted,
    breaks = c(0, 2, 3, 4, 5, 8),
    labels = c(
      "Microtrophic",
      "Oligotrophic",
      "Mesotrophic",
      "Eutrophic",
      "Hypertrophic"
    )
  )) 


obs_pred_geo_plus <-
  ggplot(aug_lm_dat_geo_plus, aes(y = .fitted, x = tli3)) +
  geom_point(aes(color = TLI3_cat), alpha = 0.5, size = 3) +
  geom_abline(slope = 1,
              intercept = 0,
              lty = 2) +
  stat_cor(aes(label = paste(
    ..rr.label.., gsub("p", "P", ..p.label..), sep = "~`,`~"
  )),
  p.accuracy = 0.001,
  r.accuracy = 0.01) +
  labs(y = 'Predicted TLI3', x = 'Observed TLI3') +
  scale_color_manual(values = lawa_cols, name = "Predicted trophic level") +
  tune::coord_obs_pred() +
  theme_minimal()

obs_pred_geo_plus

Figure 12: Observed versus predicted values for the Geochem + FENZ linear model

5.7 Predicting TLI3 for unmonitored lakes

Code

metadata_lm_model <- 
  read_csv('C:/Users/javiera/OneDrive - Cawthron/Critical step 1-1-3/Data Analysis/sediment_geochem/clean_data/geochem_data_all_selected.csv', show_col_types = FALSE) %>%
  clean_names() %>%
  select(code, all_of(predictors(lm_model_geo_plus_fenz$finalModel)))


spot_tli3_dat <- 
  read_csv('C:/Users/javiera/OneDrive - Cawthron/Critical step 1-1-3/Data Analysis/sediment_geochem/clean_data/geochem_data_all.csv', show_col_types = FALSE) %>%
  dplyr::select(Code, WCS.Bul.TP, WCS.Bul.TN, WCS.Chla) %>%
  rename(code = "Code") %>%
  mutate(
    Spot.TLI.TP = 0.218 + 2.92 * log10(WCS.Bul.TP * 1000),
    Spot.TLI.TN = -3.61 + 3.01 * log10(WCS.Bul.TN * 1000),
    Spot.TLI.Chla = 2.22 + 2.54 * log10(WCS.Chla * 1000),
    spot_TLI3 = (Spot.TLI.TP +  Spot.TLI.TN + Spot.TLI.Chla) / 3
  )


geochem_unmon_dat <- 
  read_csv('C:/Users/javiera/OneDrive - Cawthron/Critical step 1-1-3/Data Analysis/sediment_geochem/clean_data/geochem_data_all.csv', show_col_types = FALSE) %>% 
  rename(code = "Code") %>% 
  filter(code %in% unmon_lakes$code) %>% 
  left_join(metadata_lm_model, by = "code") %>% 
  left_join(spot_tli3_dat %>% select(code, spot_TLI3), by = 'code') %>% 
  clean_names() %>% 
  select(code, spot_tli3, all_of(predictors(lm_model_geo_plus_fenz$finalModel)))

# transform predictors
unmon_preds_para <-
  preProcess(geochem_unmon_dat %>% select(-code),
             method = c("center", "scale", "YeoJohnson",  "bagImpute"))
pred_t_unmon <- 
  predict(unmon_preds_para, geochem_unmon_dat)

unmon_preds <- 
  augment(lm_model_geo_plus_fenz$finalModel, newdata = pred_t_unmon) %>% 
  select(-spot_tli3) %>% 
  left_join(geochem_unmon_dat %>% select(code, spot_tli3), by = 'code') %>% 
  mutate(pred_tli3_cat = cut(
    .fitted,
    breaks = c(0, 2, 3, 4, 5, 8),
    labels = c(
      "Microtrophic",
      "Oligotrophic",
      "Mesotrophic",
      "Eutrophic",
      "Hypertrophic"
    )
  )) %>%
  left_join(lake_names, by = 'code')

pred_tli3_vs_spot_tli3_plot1 <-
  ggplot(unmon_preds, aes(y = .fitted, x = spot_tli3)) +
  geom_point(alpha = 0.5, aes(color = pred_tli3_cat), size = 3) +
  geom_abline(slope = 1,
              intercept = 0,
              lty = 2) +
 stat_cor(aes(label = paste(
    ..rr.label.., gsub("p", "P", ..p.label..), sep = "~`,`~"
  )),
  p.accuracy = 0.001,
  r.accuracy = 0.01) +
  tune::coord_obs_pred() +
  labs(y = 'Predicted TLI3', x = 'Spot TLI3') +
  scale_color_manual(values = lawa_cols, name = "Trophic level") +
  theme_minimal()

pred_tli3_vs_spot_tli3_plot1

Figure 13: Observed versus predicted TLI3 values for unmonitored lakes based on the the Geochem + FENZ model

5.8 Cross-validation

Code

resam_preds <-
  lm_model_geo_plus_fenz$pred %>%
  separate(Resample, c("Fold", "Rep")) %>%
  mutate(res = obs - pred) %>%
  left_join(.,
            lm_dat %>%
              rownames_to_column('rowIndex') %>%
              mutate(rowIndex = as.numeric(rowIndex)), by = 'rowIndex') %>% 
  as_tibble()  %>% 
  mutate(TLI3_cat = cut(
    tli3,
    breaks = c(0, 2, 3, 4, 5, 8),
    labels = c(
      "Microtrophic",
      "Oligotrophic",
      "Mesotrophic",
      "Eutrophic",
      "Hypertrophic"
    )
  ),
  code = as.character(code)) %>% 
  left_join(lake_names, by = 'code')


ggplot(resam_preds,aes(pred, obs)) +
  geom_point(alpha = 0.5, aes(color = TLI3_cat)) +
  geom_abline(slope = 1, intercept = 0, lty = 2) +
  facet_grid(Fold~Rep) +
   stat_cor(aes(label = paste(
    ..rr.label.., gsub("p", "P", ..p.label..), sep = "~`,`~"
  )),
  p.accuracy = 0.001,
  r.accuracy = 0.01, size = 3) +
  geom_abline(slope = 1, intercept = 0, lty = 2, color = 1)  +
  labs(x = "Observed TLI3", y = "Predicted TLI3", title = "Test dataset") +
  scale_color_manual(values = lawa_cols, name = "Trophic level") +
  tune::coord_obs_pred() +
  theme_minimal()

Figure 14: Observed versus predicted values for the Geochem + FENZ model using 5-fold cross validation

6 Comparing the two models and paper outputs

6.1 Model summaries

Code

tab_model(
  lm_model_geo$finalModel,
  lm_model_fenz$finalModel,
  lm_model_geo_plus_fenz$finalModel,
  dv.labels = c(
    "Geochem",
    "FENZ",
    "Geochem + FENZ"
  )
)

Table 10: summary table of the Geochem, FENZ, and Geochem + FENZ linear models
	Geochem			FENZ			Geochem + FENZ
Predictors	Estimates	CI	p	Estimates	CI	p	Estimates	CI	p
(Intercept)	4.07	3.92 – 4.22	<0.001	4.09	3.94 – 4.24	<0.001	4.07	3.95 – 4.20	<0.001
sampling depth	-0.72	-0.91 – -0.53	<0.001	-0.70	-0.88 – -0.51	<0.001	-0.63	-0.80 – -0.47	<0.001
pf total p	0.16	-0.10 – 0.42	0.233				0.19	0.02 – 0.36	0.032
pf bd p percent	0.40	0.20 – 0.59	<0.001				0.31	0.15 – 0.47	<0.001
pf res p percent	0.26	0.02 – 0.50	0.035
bul s om	0.32	-0.08 – 0.71	0.112
bul s calcium	-0.12	-0.29 – 0.05	0.152
bul s lead	-0.19	-0.40 – 0.02	0.081
bul s cadmium	0.35	0.17 – 0.53	<0.001				0.13	-0.02 – 0.28	0.088
gs 63	0.46	0.12 – 0.81	0.010				0.21	-0.01 – 0.43	0.063
trr mn fe	-0.13	-0.31 – 0.05	0.146				-0.10	-0.26 – 0.05	0.176
tn tp	-0.28	-0.57 – 0.02	0.066
rse bd fe toc	-0.32	-0.55 – -0.09	0.007				-0.23	-0.42 – -0.05	0.015
lk elev				-0.48	-0.71 – -0.24	<0.001	-0.37	-0.54 – -0.21	<0.001
dec solrad				0.57	0.38 – 0.75	<0.001	0.29	0.13 – 0.46	0.001
sum wind				-0.27	-0.43 – -0.10	0.002	-0.15	-0.31 – 0.01	0.061
cat hard				0.24	-0.04 – 0.52	0.094
cat phos				0.49	0.18 – 0.80	0.002
cat calc				0.47	0.29 – 0.65	<0.001	0.24	0.08 – 0.40	0.003
cat psize				-0.29	-0.58 – -0.00	0.047
Observations	101			101			101
R² / R² adjusted	0.732 / 0.696			0.722 / 0.698			0.805 / 0.780

6.2 Cross-validation summary stats

Code

mods <-
  resamples(
    list(
      model_geo = lm_model_geo,
      model_fenz = lm_model_fenz,
      geochem_plus_fenz = lm_model_geo_plus_fenz
    )
  )

cv_res <- summary(mods, metric = c('Rsquare', 'RMSE'))

cv_summary_table <- 
bind_rows(Rsq = as_tibble(cv_res$statistics$Rsquare), RMSE = as_tibble(cv_res$statistics$RMSE), .id = 'Metric') %>% 
  select(-`NA's`) %>% 
  mutate(Model = rep(c('Geochem', 'FENZ', 'Geochem + FENZ'),2), .before = everything()) %>%
  mutate_if(is.numeric, ~signif(., 2))
  
cv_summary_table %>% 
  kable()

Model	Metric	Min.	1st Qu.	Median	Mean	3rd Qu.	Max.
Geochem	Rsq	0.47	0.55	0.63	0.62	0.66	0.80
FENZ	Rsq	0.53	0.62	0.67	0.67	0.73	0.77
Geochem + FENZ	Rsq	0.62	0.69	0.72	0.73	0.76	0.89
Geochem	RMSE	0.75	0.78	0.85	0.88	0.91	1.20
FENZ	RMSE	0.68	0.77	0.83	0.82	0.90	0.97
Geochem + FENZ	RMSE	0.52	0.68	0.72	0.75	0.84	0.98

Figure 15: Cross-validated R-square values for the Geochem and Geochem + FENZ linear models based on repeated 5-fold cross validation

Code

ggplot(mods, metric = "Rsquared") + 
  labs(y = "R-squared") +
  theme_minimal()

Figure 16: Cross-validated R-square values for the Geochem and Geochem + FENZ linear models based on repeated 5-fold cross validation

6.3 T- test to compare cross-valdation performance of the three models

Code

compare_models(lm_model_fenz, lm_model_geo_plus_fenz) %>%
  tidy() %>%
  mutate_if(is.numeric, ~ signif(., 2)) %>% 
  kable(caption = "FENZ vs Geo + FENZ")
compare_models(lm_model_geo, lm_model_geo_plus_fenz) %>% 
  tidy() %>% 
   mutate_if(is.numeric, ~ signif(., 2)) %>% 
  kable(caption = "GEO vs Geo + FENZ")
compare_models(lm_model_fenz, lm_model_geo) %>% 
  tidy() %>% 
   mutate_if(is.numeric, ~ signif(., 2)) %>% 
  kable(caption = "FENZ vs Geo")

Table 11: Results of t-test comparing the performace the Geochem and Geochem + FENZ linear models

(a) FENZ vs Geo + FENZ
estimate	statistic	p.value	parameter	conf.low	conf.high	method	alternative
0.072	2.8	0.015	14	0.017	0.13	One Sample t-test	two.sided

(b) GEO vs Geo + FENZ
estimate	statistic	p.value	parameter	conf.low	conf.high	method	alternative
0.12	5.2	0.00013	14	0.073	0.18	One Sample t-test	two.sided

(c) FENZ vs Geo
estimate	statistic	p.value	parameter	conf.low	conf.high	method	alternative
-0.052	-1.5	0.14	14	-0.12	0.02	One Sample t-test	two.sided

6.4 Summary statistics of final moedls

Code

bind_rows(Geo = lm_model_geo$finalModel %>% glance(),
FENZ = lm_model_fenz$finalModel %>% glance(),
`Geo + FENZ` = lm_model_geo_plus_fenz$finalModel %>% glance(), .id = "Model") %>% 
  mutate_if(is.numeric, ~signif(., 2)) %>% 
  kable()

Model	r.squared	adj.r.squared	sigma	statistic	df	logLik	AIC	BIC	deviance	df.residual	nobs
Geo	0.73	0.70	0.75	20	12	-110	240	280	50	88	100
FENZ	0.72	0.70	0.75	30	8	-110	240	260	52	92	100
Geo + FENZ	0.80	0.78	0.64	33	11	-92	210	240	36	89	100