Lake380 surface sediment geochemistry modelling

Author
Published

September 19, 2022

Code
library(caret)
library(tidyverse, quietly = T)
library(janitor)
library(visreg)
library(ggpubr)
library(broom)
library(knitr)
library(lares)
library(ggfortify)
library(sjPlot)
library(rnaturalearth)

lawa_cols <- c("#20a7ad", "#85bb5b", "#ffa827", "#ff8400", "#e85129")
theme_set(theme_minimal())
source('functions/theme_javier.R')

# read data -----
pred_selected_num_t <-
  read_csv(
    'C:/Users/javiera/OneDrive - Cawthron/Critical step 1-1-3/Data Analysis/sediment_geochem/clean_data/select_predictor_numeric_transformed.csv',
    show_col_types = FALSE
  )

lm_dat <-
  read_csv(
    'C:/Users/javiera/OneDrive - Cawthron/Critical step 1-1-3/Data Analysis/sediment_geochem/clean_data/lm_dat_non_colinear.csv',
    show_col_types = FALSE
  )

lake_names <-
  read_csv(
    'C:/Users/javiera/OneDrive - Cawthron/Critical step 1-1-3/Data Analysis/sediment_geochem/clean_data/selected_metadata.csv',
    show_col_types = FALSE
  ) %>%
  select(Code, Lake) %>%
  clean_names()

metadata <-
  read_csv(
    'C:/Users/javiera/OneDrive - Cawthron/Critical step 1-1-3/Data Analysis/sediment_geochem/clean_data/geochem_data_all_selected.csv',
    show_col_types = FALSE
  )

unmon_lakes <- 
  read_csv('C:/Users/javiera/OneDrive - Cawthron/Critical step 1-1-3/Data Analysis/sediment_geochem/clean_data/unmon_lakes.csv')

mon_lakes <- 
  read_csv('C:/Users/javiera/OneDrive - Cawthron/Critical step 1-1-3/Data Analysis/sediment_geochem/clean_data/mon_lakes.csv')

map_data <-
bind_rows(Monitored = mon_lakes, Unmonitored = unmon_lakes, .id = 'Monitored') %>%
  mutate(code = factor(code)) %>%
  left_join(metadata %>% dplyr::select(Code, Latitude, Longitude) %>% rename(code = "Code"),
            by = 'code')

1 Background

This notebook described the linear modelling step of the sediment geochemical data of Lakes380.

See the data exploration notebook for the details on initial data exploration and quality assurance following a standard protocol (Zuur et al., 2010).

For a details on the regression analysis see Zuur and Ieno 2016 A protocol for conducting and presenting results of regression‐type analyses.

2 Methods

2.1 Study sites

Code
tabyl(map_data, Monitored) %>% 
  kable(digits = 2, caption = "Number of lakes of monitored and unmonitored lakes")
Table 1: Number of minitored and unmonitored lakes
Monitored n percent
No 76 0.43
Yes 101 0.57
Code
nz_med <- ne_countries(country = 'new zealand', scale = "medium", returnclass = "sf")
map_fig <- 
ggplot(nz_med) +
  geom_sf() +
  coord_sf(
    xlim = c(166, 178.8),
    ylim = c(-47.35,-34.35),
    clip = 'on'
  ) +
  geom_point(data = map_data, aes(x = Longitude, y = Latitude, color = Monitored)) +
  scale_x_continuous(breaks = seq(166, 178, by = 4)) +
  scale_y_continuous(breaks = seq(-48,-36, by = 4)) +
  scale_color_discrete(name = NULL) +
  labs(x = "Longitude", y = "Latitude") +
  theme_bw(base_size = 16) +
  theme(
    legend.position = c(.8, .2),
    legend.background = element_blank(),
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    strip.background = element_blank()
  )



map_fig

Figure 1: Maps of study sites showing monitored and unmonitored lakes

Linear modelling was used to determine the amount of variation of TLI3 explained FENZ and geochemistry predictor variables. Three separate multiple regression linear models were initially fitted using TLI3 as response variables and the following predictors :

  1. FENZ variable only.

  2. Geochemistry variables only

  3. Significant FENZ predictors variables identified in 1. plus the geochemistry variables.

All predictor variables were centred and scaled before the analyses (by subtracting the overall mean from each observation and dividing the result by the overall standard deviation), to allow direct comparison of regression coefficients and inference about the relative sizes of effects. Predictor variables were transformed to using the Yeo-Johnson transformation, which is very similar to the Box-Cox (i.e. a power transformation), but does allow negative values. Missing values were imputed via bagging by fitting a bagged tree model for each predictor (as a function of all the others).

Models were selected using a step-wise algorithm based on the Akaike’s information criterion (AIC). The step-wise selection minimise model AIC values to find the most parsimonious model, i.e. with the minimal number of predictors without affecting model performance. Models were validated by inspecting the residuals and co-linearity was check using variance inflation factor.

Results of final linear models was represented as coefficient plots that shows regression coefficients graphically, with 95% credible confidence intervals around mean estimates. Coefficient plots show predcitorsd significance (the confidence intervals do not cover zero), the degree of uncertainty (the width of the intervals), the direction of the effect (whether is a positive or a negative effect) and the effect magnitude given by the absolute size of the coefficient. Coefficients that overlap with the zero-line, shown as a dotted vertical line, are likely to be insignificant.

The performance of the final model with selected predictors was evaluated through five-fold cross-validation (Kuhn 2008). This procedure randomly partitions the data into five subsets. One subset at a time is then used for testing the model, while the remaining sets are used to build the model. This reduces the bias in performance estimation since the testing and training data sets are independent of each other. Cross-validation was repeated three times to capture variability in the model performance and the results were averaged.

TLI3 was predicted for unmonitored lakes using each of the three models (i.e. FENZ model, Geochem model and FENZ + Geochem model). Predictor variables missing values were imputed using bagging.

Predicted TLI3 was compared with spot TLI3 values calculated for unmonitored lakes as :

Spot.TLI.TP = 0.218 + 2.92 * log10(WCS.Bul.TP * 1000)

Spot.TLI.TN = -3.61 + 3.01 * log10(WCS.Bul.TN * 1000)

Spot.TLI.Chla = 2.22 + 2.54 * log10(WCS.Chla * 1000)

spot_TLI3 = (Spot.TLI.TP + Spot.TLI.TN + Spot.TLI.Chla) / 3

3 FENZ model

Code
pred_fenz <- 
  select(lm_dat, lake_area:sampling_depth) %>% 
  names() %>% 
  as_tibble() 

The initial FENZ model was fitted with 10 predictors listed in Table 2

Code
pred_fenz %>% 
kable()
Table 2: List of predictor variables used in the initial FENZ linear model
value
lake_area
lk_elev
dec_solrad
sum_wind
cat_hard
cat_peat
cat_phos
cat_calc
cat_psize
sampling_depth

3.1 Final FENZ model summary

Code
set.seed(123)

trControl <- 
  trainControl(
  method = "repeatedcv",
  repeats = 3,
  number = 5,
  savePredictions = 'all'
)

lm_model_fenz <- caret::train(
  tli3 ~ .,
  data = select(lm_dat, lake_area:sampling_depth, tli3),
  method = "lmStepAIC",
  trControl = trControl,
  trace = F
)

tab_model(lm_model_fenz$finalModel)
Table 3: summary table of the FENZ linear model
  outcome
Predictors Estimates CI p
(Intercept) 4.09 3.94 – 4.24 <0.001
lk elev -0.48 -0.71 – -0.24 <0.001
dec solrad 0.57 0.38 – 0.75 <0.001
sum wind -0.27 -0.43 – -0.10 0.002
cat hard 0.24 -0.04 – 0.52 0.094
cat phos 0.49 0.18 – 0.80 0.002
cat calc 0.47 0.29 – 0.65 <0.001
cat psize -0.29 -0.58 – -0.00 0.047
sampling depth -0.70 -0.88 – -0.51 <0.001
Observations 101
R2 / R2 adjusted 0.722 / 0.698

3.2 Observed versus predicted TLI3 values using the full dataset

Code
aug_lm_dat_fenz <- 
  augment(lm_model_fenz$finalModel, data = lm_dat) %>% 
  mutate(TLI3_cat = cut(
    .fitted,
    breaks = c(0, 2, 3, 4, 5, 8),
    labels = c(
      "Microtrophic",
      "Oligotrophic",
      "Mesotrophic",
      "Eutrophic",
      "Hypertrophic"
    )
  )) 


obs_pred_fenz <- 
ggplot(aug_lm_dat_fenz, aes(y = .fitted, x = tli3)) +
  geom_point(aes(color = TLI3_cat), alpha = 0.5, size = 3) +
  geom_abline(slope = 1,
              intercept = 0,
              lty = 2) +
  tune::coord_obs_pred() +
  stat_cor(aes(label = paste(..rr.label.., gsub("p", "P", ..p.label..), sep = "~`,`~")),
           p.accuracy = 0.001,
           r.accuracy = 0.01) +
  labs(y = 'Predicted TLI3', x = 'Observed TLI3') +
  scale_color_manual(values = lawa_cols, name = "Predicter trophic level") +
  theme_minimal()

obs_pred_fenz

Figure 2: Observed versus predicted values for the FENZ model

4 Geochem model

Code
pred_geo <- 
  select(lm_dat, sampling_depth:rse_toc_res_p) %>% 
  names() %>% 
  as_tibble() 

The initial Geochem model was fitted with 22 predictors listed in Table 4

Code
pred_geo %>% 
kable()
Table 4: List of predictor variables used in the initial Geochem linear model
value
sampling_depth
pf_total_p
pf_bd_p_percent
pf_na_oh_r_p_percent
pf_org_p_percent
pf_res_p_percent
ar_p_dw
dry_bd_dw
bul_s_om
bul_s_carbonates
bul_s_aluminium
bul_s_calcium
bul_s_lead
bul_s_copper
bul_s_cadmium
bul_s_zinc
gs_63
trr_mn_fe
tn_tp
rse_na_oh_al_bd_fe
rse_bd_fe_toc
rse_toc_res_p

4.1 Predictor variables correlation

Code
corr_cross(lm_dat %>% select(sampling_depth:rse_toc_res_p),
           max_pvalue = 0.1, 
           top = 30 
)

Figure 3: Ranked cross-correlations of predictor variables

4.2 Model summary

Code
set.seed(123)

trControl <- 
  trainControl(
  method = "repeatedcv",
  repeats = 3,
  number = 5,
  savePredictions = 'all'
)

lm_model_geo <- caret::train(
  tli3 ~ .,
  data = select(lm_dat, sampling_depth:tli3),
  method = "lmStepAIC",
  metric = "RMSE",
  scope = list(lower = ~ pf_total_p),
  trControl = trControl,
  trace = F
)

tab_model(lm_model_geo$finalModel)
Table 5: summary table of the Geochem linear model
  outcome
Predictors Estimates CI p
(Intercept) 4.07 3.92 – 4.22 <0.001
sampling depth -0.72 -0.91 – -0.53 <0.001
pf total p 0.16 -0.10 – 0.42 0.233
pf bd p percent 0.40 0.20 – 0.59 <0.001
pf res p percent 0.26 0.02 – 0.50 0.035
bul s om 0.32 -0.08 – 0.71 0.112
bul s calcium -0.12 -0.29 – 0.05 0.152
bul s lead -0.19 -0.40 – 0.02 0.081
bul s cadmium 0.35 0.17 – 0.53 <0.001
gs 63 0.46 0.12 – 0.81 0.010
trr mn fe -0.13 -0.31 – 0.05 0.146
tn tp -0.28 -0.57 – 0.02 0.066
rse bd fe toc -0.32 -0.55 – -0.09 0.007
Observations 101
R2 / R2 adjusted 0.732 / 0.696

The table shows the step-wise selected final model summary with associated estimates, 95% confidence intervals and p-values.

4.3 Coefficient plots

Code
plot_model(lm_model_geo$finalModel,title = "") +
  theme_minimal()

Figure 4: Coefficient plot of the Geochem linear model

4.4 Variable importance

Code
varImp(lm_model_geo$finalModel) %>% 
  arrange(desc(Overall)) %>% 
  kable(digits = 2)
Table 6: Variable importance fro the Geochem model
Overall
sampling_depth 7.61
pf_bd_p_percent 4.03
bul_s_cadmium 3.85
rse_bd_fe_toc 2.76
gs_63 2.64
pf_res_p_percent 2.14
tn_tp 1.86
bul_s_lead 1.77
bul_s_om 1.61
trr_mn_fe 1.47
bul_s_calcium 1.44
pf_total_p 1.20

4.5 Model validation

Code
ggarrange(plotlist = plot_model(lm_model_geo$finalModel,type = 'diag'))

Figure 5: Diagnostic plots for the GEochem linear model

4.6 Observed versus predicted TLI3 values using the full dataset

Code
aug_lm_dat <- 
  augment(lm_model_geo$finalModel, data = lm_dat) %>% 
  mutate(TLI3_cat = cut(
    .fitted,
    breaks = c(0, 2, 3, 4, 5, 8),
    labels = c(
      "Microtrophic",
      "Oligotrophic",
      "Mesotrophic",
      "Eutrophic",
      "Hypertrophic"
    )
  )) 


obs_pred_geo <- 
  ggplot(aug_lm_dat,aes(y = .fitted, x = tli3)) +
  geom_point(aes(color = TLI3_cat), alpha = 0.5, size = 3) +
  geom_abline(slope = 1, intercept = 0, lty = 2) +
  tune::coord_obs_pred() +
    stat_cor(aes(label = paste(..rr.label.., gsub("p", "P", ..p.label..), sep = "~`,`~")),
           p.accuracy = 0.001,
           r.accuracy = 0.01) +
  labs(y = 'Predicted TLI3', x = 'Observed TLI3') +
  scale_color_manual(values = lawa_cols, name = "Predicted trophic level") +
  theme_minimal()

obs_pred_geo

Figure 6: Observed versus predicted values for the Geochem linear model

4.7 Predicting TLI3 for unmonitored lakes

Code
metadata_lm_model <- 
  read_csv('C:/Users/javiera/OneDrive - Cawthron/Critical step 1-1-3/Data Analysis/sediment_geochem/clean_data/geochem_data_all_selected.csv', show_col_types = FALSE) %>%
  clean_names() %>%
  select(code, all_of(predictors(lm_model_geo$finalModel)))


spot_tli3_dat <- 
  read_csv('C:/Users/javiera/OneDrive - Cawthron/Critical step 1-1-3/Data Analysis/sediment_geochem/clean_data/geochem_data_all.csv', show_col_types = FALSE) %>%
  select(Code, WCS.Bul.TP, WCS.Bul.TN, WCS.Chla) %>%
  rename(code = "Code") %>%
  mutate(
    Spot.TLI.TP = 0.218 + 2.92 * log10(WCS.Bul.TP * 1000),
    Spot.TLI.TN = -3.61 + 3.01 * log10(WCS.Bul.TN * 1000),
    Spot.TLI.Chla = 2.22 + 2.54 * log10(WCS.Chla * 1000),
    spot_TLI3 = (Spot.TLI.TP +  Spot.TLI.TN + Spot.TLI.Chla) / 3
  )

geochem_unmon_dat <-
  read_csv(
    'C:/Users/javiera/OneDrive - Cawthron/Critical step 1-1-3/Data Analysis/sediment_geochem/clean_data/geochem_data_all.csv',
    show_col_types = FALSE
  ) %>%
  rename(code = "Code") %>%
  filter(code %in% unmon_lakes$code) %>%
  left_join(metadata_lm_model, by = "code") %>%
  left_join(spot_tli3_dat %>% select(code, spot_TLI3), by = 'code') %>%
  clean_names() %>%
  select(code, spot_tli3, all_of(predictors(lm_model_geo$finalModel)))

# transform predictors
unmon_preds_para <-
  preProcess(geochem_unmon_dat %>% select(-code),
             method = c("center", "scale", "YeoJohnson",  "bagImpute"))
pred_t_unmon <- 
  predict(unmon_preds_para, geochem_unmon_dat)

unmon_preds <- 
  augment(lm_model_geo$finalModel, newdata = pred_t_unmon) %>% 
  select(-spot_tli3) %>% 
  left_join(geochem_unmon_dat %>% select(code, spot_tli3), by = 'code') %>% 
  mutate(pred_tli3_cat = cut(
    .fitted,
    breaks = c(0, 2, 3, 4, 5, 10),
    labels = c(
      "Microtrophic",
      "Oligotrophic",
      "Mesotrophic",
      "Eutrophic",
      "Hypertrophic"
    )
  )) %>%
  left_join(lake_names, by = 'code') 

pred_tli3_vs_spot_tli3_plot <-
  ggplot(unmon_preds, aes(y = .fitted, x = spot_tli3)) +
  
  geom_point(alpha = 0.5, aes(color = pred_tli3_cat), size = 3) +
  geom_abline(slope = 1,
              intercept = 0,
              lty = 2) +
  tune::coord_obs_pred() +
  # geom_smooth(method = lm) +
  stat_fit_glance(
    method = "lm",
    label.y = "top",
    label.x = "left",
    method.args = list(formula = y ~ x),
    mapping = aes(label = sprintf(
      'r^2~"="~%.3f~~italic(P)~"="~%.2g',
      stat(r.squared),
      stat(p.value)
    )),
    parse = TRUE,
    size = 3
  ) +
  labs(y = 'Predicted TLI3', x = 'Spot TLI3') +
  scale_color_manual(values = lawa_cols, name = "Trophic level") +
  theme_minimal()

pred_tli3_vs_spot_tli3_plot

Figure 7: Observed versus predicted TLI3 values for unmonitored lakes based on the the Geochem model

4.8 Cross-validation

Code
resam_preds <-
  lm_model_geo$pred %>%
  separate(Resample, c("Fold", "Rep")) %>%
  mutate(res = obs - pred) %>%
  left_join(.,
            lm_dat %>%
              rownames_to_column('rowIndex') %>%
              mutate(rowIndex = as.numeric(rowIndex)), by = 'rowIndex') %>% 
  as_tibble()  %>% 
  mutate(TLI3_cat = cut(
    tli3,
    breaks = c(0, 2, 3, 4, 5, 8),
    labels = c(
      "Microtrophic",
      "Oligotrophic",
      "Mesotrophic",
      "Eutrophic",
      "Hypertrophic"
    )
  ),
  code = as.character(code)) %>% 
  left_join(lake_names, by = 'code')


ggplot(resam_preds,aes(pred, obs)) +
  geom_point(alpha = 0.5, aes(color = TLI3_cat)) +
  geom_abline(slope = 1, intercept = 0, lty = 2) +
  facet_grid(Fold~Rep) +
  stat_cor(aes(label = paste(
    ..rr.label.., gsub("p", "P", ..p.label..), sep = "~`,`~"
  )),
  p.accuracy = 0.001,
  r.accuracy = 0.01, size = 3) +
  geom_abline(slope = 1, intercept = 0, lty = 2, color = 1)  +
  labs(x = "Observed TLI3", y = "Predicted TLI3", title = "Test dataset") +
  scale_color_manual(values = lawa_cols, name = "Trophic level") +
  tune::coord_obs_pred() +
  theme_minimal()

Figure 8: Observed versus predicted values for the Geochem model using 5-fold cross validation

5 FENZ + Geochem model

Code
pred_geo_plus <- 
  lm_dat %>% select(lake_area:sampling_depth, predictors(lm_model_geo$finalModel)) %>% 
  names() %>% 
  as_tibble() 

The initial FENZ + Geochem model was fitted with 21 predictors listed in Table 7

Code
pred_geo_plus %>% 
kable()
Table 7: List of predictor variables used in the initial Geochem linear model
value
lake_area
lk_elev
dec_solrad
sum_wind
cat_hard
cat_peat
cat_phos
cat_calc
cat_psize
sampling_depth
pf_total_p
pf_bd_p_percent
pf_res_p_percent
bul_s_om
bul_s_calcium
bul_s_lead
bul_s_cadmium
gs_63
trr_mn_fe
tn_tp
rse_bd_fe_toc

5.1 Predictor variables correlation

Code
corr_cross(lm_dat %>% select(lake_area:sampling_depth, predictors(lm_model_geo$finalModel)),
           max_pvalue = 0.1, 
           top = 30 
)

Figure 9: Ranked cross-correlations of predictor variables

5.2 Model summary

Code
set.seed(123)

trControl <- 
  trainControl(
  method = "repeatedcv",
  repeats = 3,
  number = 5,
  savePredictions = 'all'
)

lm_model_geo_plus_fenz <- caret::train(
  tli3 ~ .,
  data = select(
    lm_dat,
    lake_area:sampling_depth,
    tli3,
    predictors(lm_model_geo$finalModel)
  ),
  method = "lmStepAIC",
  metric = "RMSE",
  scope = list(lower = ~ pf_total_p),
  trControl = trControl,
  trace = F
)

tab_model(lm_model_geo_plus_fenz$finalModel)
Table 8: summary table of the Geochem + FENZ linear model
  outcome
Predictors Estimates CI p
(Intercept) 4.07 3.95 – 4.20 <0.001
lk elev -0.37 -0.54 – -0.21 <0.001
dec solrad 0.29 0.13 – 0.46 0.001
sum wind -0.15 -0.31 – 0.01 0.061
cat calc 0.24 0.08 – 0.40 0.003
sampling depth -0.63 -0.80 – -0.47 <0.001
pf total p 0.19 0.02 – 0.36 0.032
pf bd p percent 0.31 0.15 – 0.47 <0.001
bul s cadmium 0.13 -0.02 – 0.28 0.088
gs 63 0.21 -0.01 – 0.43 0.063
trr mn fe -0.10 -0.26 – 0.05 0.176
rse bd fe toc -0.23 -0.42 – -0.05 0.015
Observations 101
R2 / R2 adjusted 0.805 / 0.780

The table shows the step-wise selected final model summary with associated estimates, 95% confidence intervals and p-values.

5.3 Coefficient plots

Code
plot_model(lm_model_geo_plus_fenz$finalModel,title = "")

Figure 10: Coefficient plot of the Geochem + FENZ linear model

5.4 Variable importance

Code
varImp(lm_model_geo_plus_fenz$finalModel) %>% 
  arrange(desc(Overall)) %>% 
  kable(digits = 2)
Table 9: Variable importance fro the Geochem + FENZ model
Overall
sampling_depth 7.68
lk_elev 4.51
pf_bd_p_percent 3.82
dec_solrad 3.54
cat_calc 3.04
rse_bd_fe_toc 2.49
pf_total_p 2.18
sum_wind 1.90
gs_63 1.88
bul_s_cadmium 1.73
trr_mn_fe 1.36

5.5 Model validation

Code
ggarrange(plotlist = plot_model(lm_model_geo_plus_fenz$finalModel,type = 'diag'))

Figure 11: Diagnostic plots for the Geochem + FENZ linear model

5.6 Observed versus predicted TLI3 values using the full dataset

Code
aug_lm_dat_geo_plus <- 
  augment(lm_model_geo_plus_fenz$finalModel, data = lm_dat) %>% 
  mutate(TLI3_cat = cut(
    .fitted,
    breaks = c(0, 2, 3, 4, 5, 8),
    labels = c(
      "Microtrophic",
      "Oligotrophic",
      "Mesotrophic",
      "Eutrophic",
      "Hypertrophic"
    )
  )) 


obs_pred_geo_plus <-
  ggplot(aug_lm_dat_geo_plus, aes(y = .fitted, x = tli3)) +
  geom_point(aes(color = TLI3_cat), alpha = 0.5, size = 3) +
  geom_abline(slope = 1,
              intercept = 0,
              lty = 2) +
  stat_cor(aes(label = paste(
    ..rr.label.., gsub("p", "P", ..p.label..), sep = "~`,`~"
  )),
  p.accuracy = 0.001,
  r.accuracy = 0.01) +
  labs(y = 'Predicted TLI3', x = 'Observed TLI3') +
  scale_color_manual(values = lawa_cols, name = "Predicted trophic level") +
  tune::coord_obs_pred() +
  theme_minimal()

obs_pred_geo_plus

Figure 12: Observed versus predicted values for the Geochem + FENZ linear model

5.7 Predicting TLI3 for unmonitored lakes

Code
metadata_lm_model <- 
  read_csv('C:/Users/javiera/OneDrive - Cawthron/Critical step 1-1-3/Data Analysis/sediment_geochem/clean_data/geochem_data_all_selected.csv', show_col_types = FALSE) %>%
  clean_names() %>%
  select(code, all_of(predictors(lm_model_geo_plus_fenz$finalModel)))


spot_tli3_dat <- 
  read_csv('C:/Users/javiera/OneDrive - Cawthron/Critical step 1-1-3/Data Analysis/sediment_geochem/clean_data/geochem_data_all.csv', show_col_types = FALSE) %>%
  dplyr::select(Code, WCS.Bul.TP, WCS.Bul.TN, WCS.Chla) %>%
  rename(code = "Code") %>%
  mutate(
    Spot.TLI.TP = 0.218 + 2.92 * log10(WCS.Bul.TP * 1000),
    Spot.TLI.TN = -3.61 + 3.01 * log10(WCS.Bul.TN * 1000),
    Spot.TLI.Chla = 2.22 + 2.54 * log10(WCS.Chla * 1000),
    spot_TLI3 = (Spot.TLI.TP +  Spot.TLI.TN + Spot.TLI.Chla) / 3
  )


geochem_unmon_dat <- 
  read_csv('C:/Users/javiera/OneDrive - Cawthron/Critical step 1-1-3/Data Analysis/sediment_geochem/clean_data/geochem_data_all.csv', show_col_types = FALSE) %>% 
  rename(code = "Code") %>% 
  filter(code %in% unmon_lakes$code) %>% 
  left_join(metadata_lm_model, by = "code") %>% 
  left_join(spot_tli3_dat %>% select(code, spot_TLI3), by = 'code') %>% 
  clean_names() %>% 
  select(code, spot_tli3, all_of(predictors(lm_model_geo_plus_fenz$finalModel)))

# transform predictors
unmon_preds_para <-
  preProcess(geochem_unmon_dat %>% select(-code),
             method = c("center", "scale", "YeoJohnson",  "bagImpute"))
pred_t_unmon <- 
  predict(unmon_preds_para, geochem_unmon_dat)

unmon_preds <- 
  augment(lm_model_geo_plus_fenz$finalModel, newdata = pred_t_unmon) %>% 
  select(-spot_tli3) %>% 
  left_join(geochem_unmon_dat %>% select(code, spot_tli3), by = 'code') %>% 
  mutate(pred_tli3_cat = cut(
    .fitted,
    breaks = c(0, 2, 3, 4, 5, 8),
    labels = c(
      "Microtrophic",
      "Oligotrophic",
      "Mesotrophic",
      "Eutrophic",
      "Hypertrophic"
    )
  )) %>%
  left_join(lake_names, by = 'code')

pred_tli3_vs_spot_tli3_plot1 <-
  ggplot(unmon_preds, aes(y = .fitted, x = spot_tli3)) +
  geom_point(alpha = 0.5, aes(color = pred_tli3_cat), size = 3) +
  geom_abline(slope = 1,
              intercept = 0,
              lty = 2) +
 stat_cor(aes(label = paste(
    ..rr.label.., gsub("p", "P", ..p.label..), sep = "~`,`~"
  )),
  p.accuracy = 0.001,
  r.accuracy = 0.01) +
  tune::coord_obs_pred() +
  labs(y = 'Predicted TLI3', x = 'Spot TLI3') +
  scale_color_manual(values = lawa_cols, name = "Trophic level") +
  theme_minimal()

pred_tli3_vs_spot_tli3_plot1

Figure 13: Observed versus predicted TLI3 values for unmonitored lakes based on the the Geochem + FENZ model

5.8 Cross-validation

Code
resam_preds <-
  lm_model_geo_plus_fenz$pred %>%
  separate(Resample, c("Fold", "Rep")) %>%
  mutate(res = obs - pred) %>%
  left_join(.,
            lm_dat %>%
              rownames_to_column('rowIndex') %>%
              mutate(rowIndex = as.numeric(rowIndex)), by = 'rowIndex') %>% 
  as_tibble()  %>% 
  mutate(TLI3_cat = cut(
    tli3,
    breaks = c(0, 2, 3, 4, 5, 8),
    labels = c(
      "Microtrophic",
      "Oligotrophic",
      "Mesotrophic",
      "Eutrophic",
      "Hypertrophic"
    )
  ),
  code = as.character(code)) %>% 
  left_join(lake_names, by = 'code')


ggplot(resam_preds,aes(pred, obs)) +
  geom_point(alpha = 0.5, aes(color = TLI3_cat)) +
  geom_abline(slope = 1, intercept = 0, lty = 2) +
  facet_grid(Fold~Rep) +
   stat_cor(aes(label = paste(
    ..rr.label.., gsub("p", "P", ..p.label..), sep = "~`,`~"
  )),
  p.accuracy = 0.001,
  r.accuracy = 0.01, size = 3) +
  geom_abline(slope = 1, intercept = 0, lty = 2, color = 1)  +
  labs(x = "Observed TLI3", y = "Predicted TLI3", title = "Test dataset") +
  scale_color_manual(values = lawa_cols, name = "Trophic level") +
  tune::coord_obs_pred() +
  theme_minimal()

Figure 14: Observed versus predicted values for the Geochem + FENZ model using 5-fold cross validation

6 Comparing the two models and paper outputs

6.1 Model summaries

Code
tab_model(
  lm_model_geo$finalModel,
  lm_model_fenz$finalModel,
  lm_model_geo_plus_fenz$finalModel,
  dv.labels = c(
    "Geochem",
    "FENZ",
    "Geochem + FENZ"
  )
)
Table 10: summary table of the Geochem, FENZ, and Geochem + FENZ linear models
  Geochem FENZ Geochem + FENZ
Predictors Estimates CI p Estimates CI p Estimates CI p
(Intercept) 4.07 3.92 – 4.22 <0.001 4.09 3.94 – 4.24 <0.001 4.07 3.95 – 4.20 <0.001
sampling depth -0.72 -0.91 – -0.53 <0.001 -0.70 -0.88 – -0.51 <0.001 -0.63 -0.80 – -0.47 <0.001
pf total p 0.16 -0.10 – 0.42 0.233 0.19 0.02 – 0.36 0.032
pf bd p percent 0.40 0.20 – 0.59 <0.001 0.31 0.15 – 0.47 <0.001
pf res p percent 0.26 0.02 – 0.50 0.035
bul s om 0.32 -0.08 – 0.71 0.112
bul s calcium -0.12 -0.29 – 0.05 0.152
bul s lead -0.19 -0.40 – 0.02 0.081
bul s cadmium 0.35 0.17 – 0.53 <0.001 0.13 -0.02 – 0.28 0.088
gs 63 0.46 0.12 – 0.81 0.010 0.21 -0.01 – 0.43 0.063
trr mn fe -0.13 -0.31 – 0.05 0.146 -0.10 -0.26 – 0.05 0.176
tn tp -0.28 -0.57 – 0.02 0.066
rse bd fe toc -0.32 -0.55 – -0.09 0.007 -0.23 -0.42 – -0.05 0.015
lk elev -0.48 -0.71 – -0.24 <0.001 -0.37 -0.54 – -0.21 <0.001
dec solrad 0.57 0.38 – 0.75 <0.001 0.29 0.13 – 0.46 0.001
sum wind -0.27 -0.43 – -0.10 0.002 -0.15 -0.31 – 0.01 0.061
cat hard 0.24 -0.04 – 0.52 0.094
cat phos 0.49 0.18 – 0.80 0.002
cat calc 0.47 0.29 – 0.65 <0.001 0.24 0.08 – 0.40 0.003
cat psize -0.29 -0.58 – -0.00 0.047
Observations 101 101 101
R2 / R2 adjusted 0.732 / 0.696 0.722 / 0.698 0.805 / 0.780

6.2 Cross-validation summary stats

Code
mods <-
  resamples(
    list(
      model_geo = lm_model_geo,
      model_fenz = lm_model_fenz,
      geochem_plus_fenz = lm_model_geo_plus_fenz
    )
  )

cv_res <- summary(mods, metric = c('Rsquare', 'RMSE'))

cv_summary_table <- 
bind_rows(Rsq = as_tibble(cv_res$statistics$Rsquare), RMSE = as_tibble(cv_res$statistics$RMSE), .id = 'Metric') %>% 
  select(-`NA's`) %>% 
  mutate(Model = rep(c('Geochem', 'FENZ', 'Geochem + FENZ'),2), .before = everything()) %>%
  mutate_if(is.numeric, ~signif(., 2))
  
cv_summary_table %>% 
  kable()
Model Metric Min. 1st Qu. Median Mean 3rd Qu. Max.
Geochem Rsq 0.47 0.55 0.63 0.62 0.66 0.80
FENZ Rsq 0.53 0.62 0.67 0.67 0.73 0.77
Geochem + FENZ Rsq 0.62 0.69 0.72 0.73 0.76 0.89
Geochem RMSE 0.75 0.78 0.85 0.88 0.91 1.20
FENZ RMSE 0.68 0.77 0.83 0.82 0.90 0.97
Geochem + FENZ RMSE 0.52 0.68 0.72 0.75 0.84 0.98

Figure 15: Cross-validated R-square values for the Geochem and Geochem + FENZ linear models based on repeated 5-fold cross validation

Code
ggplot(mods, metric = "Rsquared") + 
  labs(y = "R-squared") +
  theme_minimal()

Figure 16: Cross-validated R-square values for the Geochem and Geochem + FENZ linear models based on repeated 5-fold cross validation

6.3 T- test to compare cross-valdation performance of the three models

Code
compare_models(lm_model_fenz, lm_model_geo_plus_fenz) %>%
  tidy() %>%
  mutate_if(is.numeric, ~ signif(., 2)) %>% 
  kable(caption = "FENZ vs Geo + FENZ")
compare_models(lm_model_geo, lm_model_geo_plus_fenz) %>% 
  tidy() %>% 
   mutate_if(is.numeric, ~ signif(., 2)) %>% 
  kable(caption = "GEO vs Geo + FENZ")
compare_models(lm_model_fenz, lm_model_geo) %>% 
  tidy() %>% 
   mutate_if(is.numeric, ~ signif(., 2)) %>% 
  kable(caption = "FENZ vs Geo")

Table 11: Results of t-test comparing the performace the Geochem and Geochem + FENZ linear models

(a) FENZ vs Geo + FENZ
estimate statistic p.value parameter conf.low conf.high method alternative
0.072 2.8 0.015 14 0.017 0.13 One Sample t-test two.sided
(b) GEO vs Geo + FENZ
estimate statistic p.value parameter conf.low conf.high method alternative
0.12 5.2 0.00013 14 0.073 0.18 One Sample t-test two.sided
(c) FENZ vs Geo
estimate statistic p.value parameter conf.low conf.high method alternative
-0.052 -1.5 0.14 14 -0.12 0.02 One Sample t-test two.sided

6.4 Summary statistics of final moedls

Code
bind_rows(Geo = lm_model_geo$finalModel %>% glance(),
FENZ = lm_model_fenz$finalModel %>% glance(),
`Geo + FENZ` = lm_model_geo_plus_fenz$finalModel %>% glance(), .id = "Model") %>% 
  mutate_if(is.numeric, ~signif(., 2)) %>% 
  kable()
Model r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
Geo 0.73 0.70 0.75 20 0 12 -110 240 280 50 88 100
FENZ 0.72 0.70 0.75 30 0 8 -110 240 260 52 92 100
Geo + FENZ 0.80 0.78 0.64 33 0 11 -92 210 240 36 89 100