Drivers of P in lake sediments

Author

Javier Atalah

Published

May 16, 2023

Code
library(tidyverse)
library(sjPlot)
library(ggpubr)
library(ggcorrplot)
library(lares)
library(knitr)
library(janitor)
library(caret)
library(sf)
library(rnaturalearth)
theme_set(theme_minimal())

give_n <- function(x){
   return(c(y = mean(x), label = length(x)))
}

conflict_prefer("select", "dplyr")
conflict_prefer("filter", "dplyr")

raw_dat <- read_csv('clean_data/raw_model_data.csv')

This notebook described the modelling to determine drivers of bulk sediment phosphorus in lake sediments. Lake bulk sediment phosphorus refers to the measurement of the amount of phosphorus stored in the sediment of a lake.

The response variable for this analysis is bul_s_phosphorus (g/kg dry wt) is mildly right skewed. Thus, it was square root transformed to improve normality.

There were 210 distinct lakes after removing missing values in the response variable.

Code
ggarrange(
  ggplot(raw_dat, aes(sqrt(sqrt(bul_s_phosphorus)))) +
    geom_histogram(bins = 20, fill = 'gray70') + labs(x = 'Fourth root bul_s_phosphorus'),
  ggplot(raw_dat, aes(bul_s_phosphorus)) +
    geom_histogram(bins = 20, fill = 'gray50')
)

1 Data exploration

1.1 Map of lake sediment phosphorus concentration

Code
nz_high <-
  ne_countries(country = 'new zealand',
               scale = "large",
               returnclass = "sf")

coords_sf <-
  read_csv('clean_data/lakes_coords.csv') %>% 
  st_as_sf(coords = c("longitude", "latitude"), crs = 4326)

ggplot() +
  geom_sf(data = nz_high) +
  theme_bw() +
  geom_sf(data = coords_sf,
          aes(color = bul_s_phosphorus),
          alpha = .4, size = 3) +
  coord_sf(xlim = c(166, 178.8), ylim = c(-47.35,-34.35)) +
  labs(color = "Phosphorus\n (g/kg dry wt)") +
  scale_color_viridis_c(option = "A", trans = 'log10')

1.2 Predictor variables description and units

Code
read_csv('clean_data/vars_desc.csv') %>% 
  DT::datatable()

1.3 Phosphorus in shallow and deep lakes

The lake was divided into two depth categories: shallow, with depths of 10 meters or less, and deep, with depths greater than 10 meters. Box plots were generated using the ggplot2 package in R software to visualize the distribution of TP concentrations in each depth category. The box plots were constructed by plotting the median, interquartile range (IQR), and whiskers that extended to 1.5 times the IQR. Outliers were plotted as individual points beyond the whiskers. The box plots were used to identify any significant differences in TP concentrations between the two depth categories. To account for non-normal distributions, a Wilcoxon rank-sum test was performed to determine whether there was a significant difference between the mean TP concentrations in the shallow and deep areas of the lake.

1.4

Seasonal effect by lake depth

Seasonality of phosphorus was assessed by comparing concentrations in two distinct time periods: October to April (Summer) and May to September (Winter). To account for non-normal distributions, a Wilcox rank-sum test was performed to determine whether there was a significant difference between the mean TP concentrations in summer and winter.

Code
ggplot(raw_dat, aes(x = lake_depth, bul_s_phosphorus)) +
  geom_boxplot() +
  stat_summary(fun.data = give_n, geom = "text") +
  stat_compare_means()

1.5 Seasonal effect by lake depth

The effect of season was only significant for shallow lakes.

Code
ggplot(raw_dat, aes(x = season, bul_s_phosphorus)) +
  geom_boxplot() +
  stat_summary(fun.data = give_n, geom = "text") +
  stat_compare_means() +
  facet_wrap(~lake_depth)

1.6 Missing values

The bar plot shows the number of missing values per variable in decreasing order.

Code
naniar::gg_miss_var(raw_dat) 

The missing values map (Figure 1) provides a different way of visualising NAs in relation to different observations (i.e. lakes) in the dataset.

Code
finalfit::missing_plot(raw_dat)

Figure 1: Map of missing values

1.7 Data transformation and imputation

All predictor variables were centred and scaled before fitting the model (by subtracting the overall mean from each observation and dividing the result by the overall standard deviation) to allow direct comparison of regression coefficients and inference about the relative sizes of effects. Predictor variables were transformed using the Yeo-Johnson transformation, which is very similar to the Box-Cox (i.e. a power transformation) but does allow negative values.

Missing values were imputed via bagging by fitting a bagged tree model for each predictor (as a function of all the others).

1.8 Predictor variables distribution

Code
raw_dat[-c(1:6)] %>% 
  pivot_longer(cols = everything()) %>% 
  ggplot() +
  geom_histogram(aes(value), bins = 15, na.rm = TRUE) +
  facet_wrap(vars(name), scales = "free") +
  theme_minimal() +
  theme(axis.text = element_text(size = 8), text=element_text(size=7)) 

Figure 2: Histograms of untransformed variables

1.9 Transformed data distribution

Code
dat_t <- read_csv('clean_data/data_t.csv')

dat_t[-c(1:6)] %>% 
  pivot_longer(cols = everything()) %>% 
  ggplot() +
  geom_histogram(aes(value), bins = 15, na.rm = TRUE) +
  facet_wrap(vars(name), scales = "free") +
  theme_minimal() +
  theme(axis.text = element_text(size = 8), text=element_text(size=7)) 

Figure 3: Histograms of transformed variables

1.10 Relationship between bul_s_phosphorus and geochem variables by depth

Removed an outlier trr_mn_fe value of 1.39 for Lake Lockett.

There is little evidence that deep and shallow lakes are responding differently.

Code
dat_t %>%
  # select(lake_depth, contains("bul_s"), contains("trr_")) %>%
  select(where(is.numeric), lake_depth) %>% 
  pivot_longer(cols = -c(bul_s_phosphorus, lake_depth)) %>%
  ggplot(aes(value, bul_s_phosphorus, color = lake_depth)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = lm) +
  stat_cor(
    aes(label = paste(
      ..rr.label.., gsub("p", "P", ..p.label..), sep = "~`,`~"
    )),
    p.accuracy = 0.001,
    r.accuracy = 0.01,
    size = 3
  ) +
  facet_wrap( ~ name, scales = "free") +
  scale_color_discrete(name = NULL) +
  theme_minimal()

Figure 4: Relationship of selected predictor variables with bul_s_phosphorus by season

2 Shallow lakes (<10m depth)

2.1 Relationship between bul_s_phosphorus and geochem variables by season

Code
dat_t %>%
  filter(lake_depth == "Shallow") %>% 
  select(where(is.numeric), season) %>% 
  pivot_longer(cols = -c(bul_s_phosphorus, season)) %>%
  ggplot(aes(value, bul_s_phosphorus, color = season)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = lm) +
  stat_cor(
    aes(label = paste(
      after_stat(rr.label), gsub("p", "P", after_stat(p.label)), sep = "~`,`~"
    )),
    p.accuracy = 0.001,
    r.accuracy = 0.01,
    size = 3
  ) +
  facet_wrap( ~ name, scales = "free") +
  scale_color_discrete(name = NULL) +
  theme_minimal()

Figure 5: Relationship of selected predictor variables with bul_s_phosphorus

2.2 Predictors correlations

The correlogram below shows the degree to which predictor variables considered in the analyses are linearly related. The correlogram is a graphical display of Pearson’s correlation matrix. The method also groups variables using hierarchical cluster analysis. Correlation values between -1 and 1 are shown using a colour scale for each pair of variables. If correlations among pairs of variables are not significant, they are shown in white.

Code
pre_cor <-
  cor(dat_t %>% 
             select(-code, -c(bul_s_phosphorus:date_survey)) %>% 
             filter(lake_depth == "Shallow") %>% 
             select(-lake_depth),
      method = "pearson",
      use = "complete.obs")

p_mat <- 
  ggcorrplot::cor_pmat(dat_t %>% 
             select(-code, -c(bul_s_phosphorus:date_survey)) %>% 
             filter(lake_depth == "Shallow") %>% 
             select(-lake_depth),)

ggcorrplot::ggcorrplot(pre_cor, 
                         # p.mat = p_mat, 
                         hc.order = TRUE,
                         type = "lower", 
                         insig = "blank",
                         lab = T,
                         colors = c("#6D9EC1", "white", "#E46726"),
                         lab_size = 3,
                         tl.cex = 10) 

Figure 6: Correlogram of predictor variables

2.3 Ranked cross-correlations

Compute the top 20 cross-correlations.

Code
corr_cross(dat_t %>% 
             select(-code, -c(bul_s_phosphorus:date_survey)) %>% 
             filter(lake_depth == "Shallow") %>% 
             select(-lake_depth),
           # name of dataset
           max_pvalue = 0.1,
           top = 20,
           quiet = T) +
  labs(title = NULL, subtitle = NULL)

Figure 7: Cross-correlations of predictor variables

2.4 Variance Inflation factor

All VIF values are <10.

Code
diag(solve(cor(
  dat_t %>% 
             select(-code, -c(bul_s_phosphorus:date_survey)) %>% 
             filter(lake_depth == "Deep") %>% 
             select(-lake_depth),
))) %>%
  enframe() %>%
  arrange(-value) %>%
  kable(digits = 2)
Table 1: Variance inflation factor table
name value
lake_volume 9.87
residence_time 6.41
catch_flow 5.82
cat_slope 5.61
bul_s_om 4.98
bul_s_sulfur 4.98
high_prod_exotic_grass 4.35
bul_s_iron 4.23
dry_bd_dw 4.15
ac_a 4.04
gs_63 3.87
cat_phos 3.71
max_depth 3.61
rse_bd_fe_toc 3.44
cat_hard 3.40
bul_s_manganese 3.35
sum_wind 3.26
dynamic_sediment_ratio 3.07
lk_elev 2.62
cat_calc 2.50
indi_for 2.47
low_prod_grass 2.15
bul_s_calcium 2.09
bul_s_aluminium 2.01
cat_peat 1.77
bul_s_carbonates 1.60

2.5 Correlation between Phosphorus and Predictors

The correlations in the plot below focus on the response variable bul_s_phosphorus.

Code
corr_var(dat_t %>%
    select(-code,-c(season:date_survey)) %>%
    filter(lake_depth == "Shallow") %>%
    select(-lake_depth),
    bul_s_phosphorus, # name of variable to focus on
         top = 20 # display top 5 correlations
)

Figure 8: Correlations of predictor variables with bul_s_phosphorus

2.6 Linear model for shallow lakes

The table shows the step-wise selected final model summary with associated estimates, 95% confidence intervals and p-values.

The summary table shows that most terms in the final models were highly significant, with a few exceptions that were marginally significant but retain in the final model. The the adjusted R2 value indicate the proportion of the variability in the response variable explained by the model.

Code
shallow_dat <-
  dat_t %>%
  filter(lake_depth == "Shallow") %>%
  select(-code, -date_survey, -lake_depth) %>% 
  mutate(season = fct_rev(season))

lm_s <-
  lm(
    bul_s_phosphorus  ~ . + season * max_depth,
    data = shallow_dat
  )

# models selection based on AIC criteria---
lm_s_sel <-
  MASS::stepAIC(lm_s, trace = F, scope = list(lower = ~ bul_s_iron))

tab_model(lm_s_sel)
  bul s phosphorus
Predictors Estimates CI p
(Intercept) 5.65 5.11 – 6.18 <0.001
season [Summer] 0.61 0.00 – 1.23 0.049
sum wind 0.20 0.05 – 0.36 0.009
cat hard 0.15 -0.02 – 0.32 0.087
residence time 0.47 0.26 – 0.68 <0.001
ac a 0.38 0.19 – 0.58 <0.001
high prod exotic grass 0.36 0.20 – 0.52 <0.001
max depth -0.41 -0.82 – -0.01 0.047
dry bd dw -0.22 -0.44 – -0.01 0.040
bul s om 0.26 0.05 – 0.48 0.018
bul s carbonates -0.13 -0.25 – -0.01 0.028
bul s iron 0.28 0.13 – 0.42 <0.001
bul s aluminium 0.12 -0.02 – 0.26 0.094
season [Summer] * max
depth
0.96 0.44 – 1.47 <0.001
Observations 91
R2 / R2 adjusted 0.646 / 0.586

2.7 Partial effects

The final model is presented as partial effects plots, which show the effect of each predictor variable conditional to others in the model. The partial effects of each predictor are displayed as a linear effect, either negative or positive, relative to the overall mean of the response variable. Partial plots also show standard errors around the line and partial residuals for each observation.

Code
pdp <- plot_model(lm_s_sel, type = 'pred', title = "", show.data = T, auto.label = F)
ggarrange(plotlist = pdp)

2.8 Coefficient plot

Results of linear final model can be represented as coefficient plots in the figure below that shows regression coefficients graphically, with 95% credible confidence intervals around mean estimates. The graphs show what is significant (the confidence intervals do not cover zero), the degree of uncertainty (the width of the intervals), the direction of the effect (whether is a positive or a negative effect) and the effect magnitude given by the absolute size of the coefficient. Coefficients that overlap with the zero-line, shown as a dotted vertical line, are likely to be insignificant.

Code
plot_model(lm_s_sel, type = 'est')

2.9 Variable importance

Code
caret::varImp(lm_s_sel) %>%
  as_tibble(rownames = 'var') %>% 
  ggplot(aes(x = reorder(var, Overall ), y = Overall )) +
  geom_point() +
  geom_segment(aes(
    x = var,
    xend = var,
    y = 0,
    yend = Overall
  )) +
  ylab("Variable importance") +
  xlab("Predictor") +
  coord_flip()

2.10 Diagnostic plots

Linear regression is based on the assumptions of normality, independence and absence of residual patterns, such heterogeneity of variance or non-linear patterns. Violation of these assumptions may result in biased parameter estimates and type I errors. To validate the fitted model, residuals were inspected by plotting them against fitted values, as Q-Q plot to check for normality. The VIF values for the predictors retained in the final model are <5. The diagnostic residuals plots presented below show that the final model met the assumptions. Thus, the model is considered suitable for this type of that and can be used for inference.

Code
diag_plots <- plot_model(lm_s_sel, type = 'diag')
ggarrange(plotlist = diag_plots)

3 Deep lakes (>10 m depth)

Relationship between bul_s_phosphorus and geochem variables by season

Code
dat_t %>%
  filter(lake_depth == "Deep") %>% 
  select(where(is.numeric), season, -code) %>% 
  pivot_longer(cols = -c(bul_s_phosphorus, season)) %>%
  ggplot(aes(value, bul_s_phosphorus, color = season)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = lm) +
  stat_cor(
    aes(label = paste(
      after_stat(rr.label), gsub("p", "P", after_stat(p.label)), sep = "~`,`~"
    )),
    p.accuracy = 0.001,
    r.accuracy = 0.01,
    size = 3
  ) +
  facet_wrap( ~ name, scales = "free") +
  scale_color_discrete(name = NULL) +
  theme_minimal()

Figure 9: Relationship of selected predictor variables with bul_s_phosphorus

3.1 Predictors correlations

The correlogram below shows the degree to which predictor variables considered in the analyses are linearly related. The correlogram is a graphical display of the Pearson’s correlation matrix. The method also groups variables using hierarchical cluster analysis. Correlation values between -1 and 1 are shown for each pair of variables represented using a colour scale. If correlations among pairs of variables are not significant, they are shown in white.

Code
pre_cor <-
  cor(dat_t %>% 
             select(-code, -c(bul_s_phosphorus:date_survey)) %>% 
             filter(lake_depth == "Deep") %>% 
             select(-lake_depth),
      method = "pearson",
      use = "complete.obs")

p_mat <- 
  ggcorrplot::cor_pmat(dat_t %>% 
             select(-code, -c(bul_s_phosphorus:date_survey)) %>% 
             filter(lake_depth == "Deep") %>% 
             select(-lake_depth))

ggcorrplot::ggcorrplot(pre_cor, 
                         # p.mat = p_mat, 
                         hc.order = TRUE,
                         type = "lower", 
                         insig = "blank",
                         lab = T,
                         colors = c("#6D9EC1", "white", "#E46726"),
                         lab_size = 3,
                         tl.cex = 10) 

Figure 10: Correlogram of predictor variables

3.2 Ranked cross-correlations

Compute the top 20 cross-correlations.

Code
corr_cross(dat_t %>% 
             select(-code, -c(bul_s_phosphorus:date_survey)) %>% 
             filter(lake_depth == "Deep") %>% 
             select(-lake_depth),
           # name of dataset
           max_pvalue = 0.1,
           top = 20,
           quiet = T) +
  labs(title = NULL, subtitle = NULL)

Figure 11: Cross-correlations of predictor variables

3.3 Variance Inflation factor

All VIF values are <10.

Code
diag(solve(cor(
  dat_t %>%
    select(-code, -c(bul_s_phosphorus:date_survey)) %>%
    filter(lake_depth == "Deep") %>%
    select(-lake_depth),
))) %>%
  enframe() %>%
  arrange(-value) %>%
  kable(digits = 2)
Table 2: Variance inflation factor table
name value
lake_volume 9.87
residence_time 6.41
catch_flow 5.82
cat_slope 5.61
bul_s_om 4.98
bul_s_sulfur 4.98
high_prod_exotic_grass 4.35
bul_s_iron 4.23
dry_bd_dw 4.15
ac_a 4.04
gs_63 3.87
cat_phos 3.71
max_depth 3.61
rse_bd_fe_toc 3.44
cat_hard 3.40
bul_s_manganese 3.35
sum_wind 3.26
dynamic_sediment_ratio 3.07
lk_elev 2.62
cat_calc 2.50
indi_for 2.47
low_prod_grass 2.15
bul_s_calcium 2.09
bul_s_aluminium 2.01
cat_peat 1.77
bul_s_carbonates 1.60

3.4 Correlation of Buls Phorphorus against predictors

The correlations in the plot below focus on the response variable bul_s_phosphorus.

Code
corr_var(
  dat_t %>%
    select(-code,-c(season:date_survey)) %>%
    filter(lake_depth == "Deep") %>%
    select(-lake_depth),
  bul_s_phosphorus,
  # name of variable to focus on
  top = 20 # display top 5 correlations
)

Figure 12: Correlations of predictor variables with bul_s_phosphorus

The table shows the step-wise selected final model summary with associated estimates, 95% confidence intervals and p-values.

3.5 Linear models for deep lakes

Code
deep_dat <-
  dat_t %>%
  filter(lake_depth == "Deep") %>%
  select(-c(1:2, -date_survey))

lm_d <-
  lm(bul_s_phosphorus  ~ .,
     data = deep_dat)

# models selection based on AIC criteria---
lm_d_sel <- MASS::stepAIC(lm_d, trace = F, scope = list(lower = ~ bul_s_iron))

tab_model(lm_d_sel)
  bul s phosphorus
Predictors Estimates CI p
(Intercept) 6.05 5.92 – 6.18 <0.001
cat phos 0.13 -0.02 – 0.27 0.091
cat calc 0.17 0.03 – 0.30 0.018
high prod exotic grass -0.16 -0.32 – 0.01 0.064
indi for -0.09 -0.22 – 0.03 0.151
dry bd dw -0.23 -0.41 – -0.05 0.014
bul s iron 0.19 0.01 – 0.38 0.039
bul s manganese 0.20 0.04 – 0.37 0.017
bul s sulfur 0.23 0.03 – 0.43 0.025
Observations 119
R2 / R2 adjusted 0.411 / 0.368

3.6 Partial effects

Code
pdp <-
  plot_model(
    lm_d_sel,
    type = 'pred',
    title = "",
    show.data = T,
    auto.label = F
  )
ggarrange(plotlist = pdp)

3.7 Coefficient plot

Code
plot_model(lm_d_sel, type = 'est')

3.8 Variable importance

Code
caret::varImp(lm_d_sel) %>%
  as_tibble(rownames = 'var') %>% 
  ggplot(aes(x = reorder(var, Overall ), y = Overall )) +
  geom_point() +
  geom_segment(aes(
    x = var,
    xend = var,
    y = 0,
    yend = Overall
  )) +
  ylab("Variable importance") +
  xlab("Predictor") +
  coord_flip()

3.9 Diagnostic plots

Code
diag_plots <- plot_model(lm_d_sel, type = 'diag')
ggarrange(plotlist = diag_plots)

4 All lakes

4.1 Global linear model (all lakes)

Model fitted with all data, i.e. shallow and deep lakes

Code
all_dat <-
  dat_t %>%
  select(-c(1:2, max_depth, bul_s_om, date_survey))

lm_a <-
  lm(
    bul_s_phosphorus  ~ .,
    data = all_dat
  )

# models selection based on AIC criteria---
lm_a_sel <- MASS::stepAIC(lm_a, trace = F, scope = list(lower = ~ bul_s_iron))

tab_model(lm_a_sel)
  bul s phosphorus
Predictors Estimates CI p
(Intercept) 6.02 5.93 – 6.12 <0.001
cat peat -0.13 -0.23 – -0.03 0.012
residence time 0.18 0.04 – 0.31 0.009
ac a 0.14 0.01 – 0.27 0.040
cat calc 0.17 0.07 – 0.28 0.001
indi for -0.12 -0.22 – -0.02 0.022
dry bd dw -0.37 -0.47 – -0.27 <0.001
bul s carbonates -0.07 -0.18 – 0.03 0.156
bul s iron 0.33 0.22 – 0.44 <0.001
Observations 210
R2 / R2 adjusted 0.347 / 0.321

4.2 Coefficient plot

Code
plot_model(lm_a_sel, type = 'est')

4.3 Partial effects

Code
pdp <-
  plot_model(
    lm_a_sel,
    type = 'pred',
    title = "",
    show.data = T,
    auto.label = F
  )
ggarrange(plotlist = pdp)

4.4 Diagnostic plot

Code
diag_plots <- plot_model(lm_a_sel, type = 'diag')
ggarrange(plotlist = diag_plots)