Drivers of P in lake sediments

Author

Javier Atalah

Published

May 16, 2023

Code

library(tidyverse)
library(sjPlot)
library(ggpubr)
library(ggcorrplot)
library(lares)
library(knitr)
library(janitor)
library(caret)
library(sf)
library(rnaturalearth)
theme_set(theme_minimal())

give_n <- function(x){
   return(c(y = mean(x), label = length(x)))
}

conflict_prefer("select", "dplyr")
conflict_prefer("filter", "dplyr")

raw_dat <- read_csv('clean_data/raw_model_data.csv')

This notebook described the modelling to determine drivers of bulk sediment phosphorus in lake sediments. Lake bulk sediment phosphorus refers to the measurement of the amount of phosphorus stored in the sediment of a lake.

The response variable for this analysis is bul_s_phosphorus (g/kg dry wt) is mildly right skewed. Thus, it was square root transformed to improve normality.

There were 210 distinct lakes after removing missing values in the response variable.

Code

ggarrange(
  ggplot(raw_dat, aes(sqrt(sqrt(bul_s_phosphorus)))) +
    geom_histogram(bins = 20, fill = 'gray70') + labs(x = 'Fourth root bul_s_phosphorus'),
  ggplot(raw_dat, aes(bul_s_phosphorus)) +
    geom_histogram(bins = 20, fill = 'gray50')
)

1 Data exploration

1.1 Map of lake sediment phosphorus concentration

Code

nz_high <-
  ne_countries(country = 'new zealand',
               scale = "large",
               returnclass = "sf")

coords_sf <-
  read_csv('clean_data/lakes_coords.csv') %>% 
  st_as_sf(coords = c("longitude", "latitude"), crs = 4326)

ggplot() +
  geom_sf(data = nz_high) +
  theme_bw() +
  geom_sf(data = coords_sf,
          aes(color = bul_s_phosphorus),
          alpha = .4, size = 3) +
  coord_sf(xlim = c(166, 178.8), ylim = c(-47.35,-34.35)) +
  labs(color = "Phosphorus\n (g/kg dry wt)") +
  scale_color_viridis_c(option = "A", trans = 'log10')

1.2 Predictor variables description and units

Code

read_csv('clean_data/vars_desc.csv') %>% 
  DT::datatable()

1.3 Phosphorus in shallow and deep lakes

The lake was divided into two depth categories: shallow, with depths of 10 meters or less, and deep, with depths greater than 10 meters. Box plots were generated using the ggplot2 package in R software to visualize the distribution of TP concentrations in each depth category. The box plots were constructed by plotting the median, interquartile range (IQR), and whiskers that extended to 1.5 times the IQR. Outliers were plotted as individual points beyond the whiskers. The box plots were used to identify any significant differences in TP concentrations between the two depth categories. To account for non-normal distributions, a Wilcoxon rank-sum test was performed to determine whether there was a significant difference between the mean TP concentrations in the shallow and deep areas of the lake.

1.4

Seasonal effect by lake depth

Seasonality of phosphorus was assessed by comparing concentrations in two distinct time periods: October to April (Summer) and May to September (Winter). To account for non-normal distributions, a Wilcox rank-sum test was performed to determine whether there was a significant difference between the mean TP concentrations in summer and winter.

Code

ggplot(raw_dat, aes(x = lake_depth, bul_s_phosphorus)) +
  geom_boxplot() +
  stat_summary(fun.data = give_n, geom = "text") +
  stat_compare_means()

1.5 Seasonal effect by lake depth

The effect of season was only significant for shallow lakes.

Code

ggplot(raw_dat, aes(x = season, bul_s_phosphorus)) +
  geom_boxplot() +
  stat_summary(fun.data = give_n, geom = "text") +
  stat_compare_means() +
  facet_wrap(~lake_depth)

1.6 Missing values

The bar plot shows the number of missing values per variable in decreasing order.

Code

naniar::gg_miss_var(raw_dat)

The missing values map (Figure 1) provides a different way of visualising NAs in relation to different observations (i.e. lakes) in the dataset.

Code

finalfit::missing_plot(raw_dat)

1.7 Data transformation and imputation

All predictor variables were centred and scaled before fitting the model (by subtracting the overall mean from each observation and dividing the result by the overall standard deviation) to allow direct comparison of regression coefficients and inference about the relative sizes of effects. Predictor variables were transformed using the Yeo-Johnson transformation, which is very similar to the Box-Cox (i.e. a power transformation) but does allow negative values.

Missing values were imputed via bagging by fitting a bagged tree model for each predictor (as a function of all the others).

1.8 Predictor variables distribution

Code

raw_dat[-c(1:6)] %>% 
  pivot_longer(cols = everything()) %>% 
  ggplot() +
  geom_histogram(aes(value), bins = 15, na.rm = TRUE) +
  facet_wrap(vars(name), scales = "free") +
  theme_minimal() +
  theme(axis.text = element_text(size = 8), text=element_text(size=7))

Figure 2: Histograms of untransformed variables

1.9 Transformed data distribution

Code

dat_t <- read_csv('clean_data/data_t.csv')

dat_t[-c(1:6)] %>% 
  pivot_longer(cols = everything()) %>% 
  ggplot() +
  geom_histogram(aes(value), bins = 15, na.rm = TRUE) +
  facet_wrap(vars(name), scales = "free") +
  theme_minimal() +
  theme(axis.text = element_text(size = 8), text=element_text(size=7))

Figure 3: Histograms of transformed variables

1.10 Relationship between bul_s_phosphorus and geochem variables by depth

Removed an outlier trr_mn_fe value of 1.39 for Lake Lockett.

There is little evidence that deep and shallow lakes are responding differently.

Code

dat_t %>%
  # select(lake_depth, contains("bul_s"), contains("trr_")) %>%
  select(where(is.numeric), lake_depth) %>% 
  pivot_longer(cols = -c(bul_s_phosphorus, lake_depth)) %>%
  ggplot(aes(value, bul_s_phosphorus, color = lake_depth)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = lm) +
  stat_cor(
    aes(label = paste(
      ..rr.label.., gsub("p", "P", ..p.label..), sep = "~`,`~"
    )),
    p.accuracy = 0.001,
    r.accuracy = 0.01,
    size = 3
  ) +
  facet_wrap( ~ name, scales = "free") +
  scale_color_discrete(name = NULL) +
  theme_minimal()

Figure 4: Relationship of selected predictor variables with bul_s_phosphorus by season

2 Shallow lakes (<10m depth)

2.1 Relationship between bul_s_phosphorus and geochem variables by season

Code

dat_t %>%
  filter(lake_depth == "Shallow") %>% 
  select(where(is.numeric), season) %>% 
  pivot_longer(cols = -c(bul_s_phosphorus, season)) %>%
  ggplot(aes(value, bul_s_phosphorus, color = season)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = lm) +
  stat_cor(
    aes(label = paste(
      after_stat(rr.label), gsub("p", "P", after_stat(p.label)), sep = "~`,`~"
    )),
    p.accuracy = 0.001,
    r.accuracy = 0.01,
    size = 3
  ) +
  facet_wrap( ~ name, scales = "free") +
  scale_color_discrete(name = NULL) +
  theme_minimal()

Figure 5: Relationship of selected predictor variables with bul_s_phosphorus

2.2 Predictors correlations

The correlogram below shows the degree to which predictor variables considered in the analyses are linearly related. The correlogram is a graphical display of Pearson’s correlation matrix. The method also groups variables using hierarchical cluster analysis. Correlation values between -1 and 1 are shown using a colour scale for each pair of variables. If correlations among pairs of variables are not significant, they are shown in white.

Code

pre_cor <-
  cor(dat_t %>% 
             select(-code, -c(bul_s_phosphorus:date_survey)) %>% 
             filter(lake_depth == "Shallow") %>% 
             select(-lake_depth),
      method = "pearson",
      use = "complete.obs")

p_mat <- 
  ggcorrplot::cor_pmat(dat_t %>% 
             select(-code, -c(bul_s_phosphorus:date_survey)) %>% 
             filter(lake_depth == "Shallow") %>% 
             select(-lake_depth),)

ggcorrplot::ggcorrplot(pre_cor, 
                         # p.mat = p_mat, 
                         hc.order = TRUE,
                         type = "lower", 
                         insig = "blank",
                         lab = T,
                         colors = c("#6D9EC1", "white", "#E46726"),
                         lab_size = 3,
                         tl.cex = 10)

Figure 6: Correlogram of predictor variables

2.3 Ranked cross-correlations

Compute the top 20 cross-correlations.

Code

corr_cross(dat_t %>% 
             select(-code, -c(bul_s_phosphorus:date_survey)) %>% 
             filter(lake_depth == "Shallow") %>% 
             select(-lake_depth),
           # name of dataset
           max_pvalue = 0.1,
           top = 20,
           quiet = T) +
  labs(title = NULL, subtitle = NULL)

Figure 7: Cross-correlations of predictor variables

2.4 Variance Inflation factor

All VIF values are <10.

Code

diag(solve(cor(
  dat_t %>% 
             select(-code, -c(bul_s_phosphorus:date_survey)) %>% 
             filter(lake_depth == "Deep") %>% 
             select(-lake_depth),
))) %>%
  enframe() %>%
  arrange(-value) %>%
  kable(digits = 2)

Table 1: Variance inflation factor table
name	value
lake_volume	9.87
residence_time	6.41
catch_flow	5.82
cat_slope	5.61
bul_s_om	4.98
bul_s_sulfur	4.98
high_prod_exotic_grass	4.35
bul_s_iron	4.23
dry_bd_dw	4.15
ac_a	4.04
gs_63	3.87
cat_phos	3.71
max_depth	3.61
rse_bd_fe_toc	3.44
cat_hard	3.40
bul_s_manganese	3.35
sum_wind	3.26
dynamic_sediment_ratio	3.07
lk_elev	2.62
cat_calc	2.50
indi_for	2.47
low_prod_grass	2.15
bul_s_calcium	2.09
bul_s_aluminium	2.01
cat_peat	1.77
bul_s_carbonates	1.60

2.5 Correlation between Phosphorus and Predictors

The correlations in the plot below focus on the response variable bul_s_phosphorus.

Code

corr_var(dat_t %>%
    select(-code,-c(season:date_survey)) %>%
    filter(lake_depth == "Shallow") %>%
    select(-lake_depth),
    bul_s_phosphorus, # name of variable to focus on
         top = 20 # display top 5 correlations
)

Figure 8: Correlations of predictor variables with bul_s_phosphorus

2.6 Linear model for shallow lakes

The table shows the step-wise selected final model summary with associated estimates, 95% confidence intervals and p-values.

The summary table shows that most terms in the final models were highly significant, with a few exceptions that were marginally significant but retain in the final model. The the adjusted R² value indicate the proportion of the variability in the response variable explained by the model.

Code

shallow_dat <-
  dat_t %>%
  filter(lake_depth == "Shallow") %>%
  select(-code, -date_survey, -lake_depth) %>% 
  mutate(season = fct_rev(season))

lm_s <-
  lm(
    bul_s_phosphorus  ~ . + season * max_depth,
    data = shallow_dat
  )

# models selection based on AIC criteria---
lm_s_sel <-
  MASS::stepAIC(lm_s, trace = F, scope = list(lower = ~ bul_s_iron))

tab_model(lm_s_sel)

	bul s phosphorus
Predictors	Estimates	CI	p
(Intercept)	5.65	5.11 – 6.18	<0.001
season [Summer]	0.61	0.00 – 1.23	0.049
sum wind	0.20	0.05 – 0.36	0.009
cat hard	0.15	-0.02 – 0.32	0.087
residence time	0.47	0.26 – 0.68	<0.001
ac a	0.38	0.19 – 0.58	<0.001
high prod exotic grass	0.36	0.20 – 0.52	<0.001
max depth	-0.41	-0.82 – -0.01	0.047
dry bd dw	-0.22	-0.44 – -0.01	0.040
bul s om	0.26	0.05 – 0.48	0.018
bul s carbonates	-0.13	-0.25 – -0.01	0.028
bul s iron	0.28	0.13 – 0.42	<0.001
bul s aluminium	0.12	-0.02 – 0.26	0.094
season [Summer] * max depth	0.96	0.44 – 1.47	<0.001
Observations	91
R² / R² adjusted	0.646 / 0.586

2.7 Partial effects

The final model is presented as partial effects plots, which show the effect of each predictor variable conditional to others in the model. The partial effects of each predictor are displayed as a linear effect, either negative or positive, relative to the overall mean of the response variable. Partial plots also show standard errors around the line and partial residuals for each observation.

Code

pdp <- plot_model(lm_s_sel, type = 'pred', title = "", show.data = T, auto.label = F)
ggarrange(plotlist = pdp)

2.8 Coefficient plot

Results of linear final model can be represented as coefficient plots in the figure below that shows regression coefficients graphically, with 95% credible confidence intervals around mean estimates. The graphs show what is significant (the confidence intervals do not cover zero), the degree of uncertainty (the width of the intervals), the direction of the effect (whether is a positive or a negative effect) and the effect magnitude given by the absolute size of the coefficient. Coefficients that overlap with the zero-line, shown as a dotted vertical line, are likely to be insignificant.

Code

plot_model(lm_s_sel, type = 'est')

2.9 Variable importance

Code

caret::varImp(lm_s_sel) %>%
  as_tibble(rownames = 'var') %>% 
  ggplot(aes(x = reorder(var, Overall ), y = Overall )) +
  geom_point() +
  geom_segment(aes(
    x = var,
    xend = var,
    y = 0,
    yend = Overall
  )) +
  ylab("Variable importance") +
  xlab("Predictor") +
  coord_flip()

2.10 Diagnostic plots

Linear regression is based on the assumptions of normality, independence and absence of residual patterns, such heterogeneity of variance or non-linear patterns. Violation of these assumptions may result in biased parameter estimates and type I errors. To validate the fitted model, residuals were inspected by plotting them against fitted values, as Q-Q plot to check for normality. The VIF values for the predictors retained in the final model are <5. The diagnostic residuals plots presented below show that the final model met the assumptions. Thus, the model is considered suitable for this type of that and can be used for inference.

Code

diag_plots <- plot_model(lm_s_sel, type = 'diag')
ggarrange(plotlist = diag_plots)

3 Deep lakes (>10 m depth)

Relationship between bul_s_phosphorus and geochem variables by season

Code

dat_t %>%
  filter(lake_depth == "Deep") %>% 
  select(where(is.numeric), season, -code) %>% 
  pivot_longer(cols = -c(bul_s_phosphorus, season)) %>%
  ggplot(aes(value, bul_s_phosphorus, color = season)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = lm) +
  stat_cor(
    aes(label = paste(
      after_stat(rr.label), gsub("p", "P", after_stat(p.label)), sep = "~`,`~"
    )),
    p.accuracy = 0.001,
    r.accuracy = 0.01,
    size = 3
  ) +
  facet_wrap( ~ name, scales = "free") +
  scale_color_discrete(name = NULL) +
  theme_minimal()

Figure 9: Relationship of selected predictor variables with bul_s_phosphorus

3.1 Predictors correlations

The correlogram below shows the degree to which predictor variables considered in the analyses are linearly related. The correlogram is a graphical display of the Pearson’s correlation matrix. The method also groups variables using hierarchical cluster analysis. Correlation values between -1 and 1 are shown for each pair of variables represented using a colour scale. If correlations among pairs of variables are not significant, they are shown in white.

Code

pre_cor <-
  cor(dat_t %>% 
             select(-code, -c(bul_s_phosphorus:date_survey)) %>% 
             filter(lake_depth == "Deep") %>% 
             select(-lake_depth),
      method = "pearson",
      use = "complete.obs")

p_mat <- 
  ggcorrplot::cor_pmat(dat_t %>% 
             select(-code, -c(bul_s_phosphorus:date_survey)) %>% 
             filter(lake_depth == "Deep") %>% 
             select(-lake_depth))

ggcorrplot::ggcorrplot(pre_cor, 
                         # p.mat = p_mat, 
                         hc.order = TRUE,
                         type = "lower", 
                         insig = "blank",
                         lab = T,
                         colors = c("#6D9EC1", "white", "#E46726"),
                         lab_size = 3,
                         tl.cex = 10)

Figure 10: Correlogram of predictor variables

3.2 Ranked cross-correlations

Compute the top 20 cross-correlations.

Code

corr_cross(dat_t %>% 
             select(-code, -c(bul_s_phosphorus:date_survey)) %>% 
             filter(lake_depth == "Deep") %>% 
             select(-lake_depth),
           # name of dataset
           max_pvalue = 0.1,
           top = 20,
           quiet = T) +
  labs(title = NULL, subtitle = NULL)

Figure 11: Cross-correlations of predictor variables

3.3 Variance Inflation factor

All VIF values are <10.

Code

diag(solve(cor(
  dat_t %>%
    select(-code, -c(bul_s_phosphorus:date_survey)) %>%
    filter(lake_depth == "Deep") %>%
    select(-lake_depth),
))) %>%
  enframe() %>%
  arrange(-value) %>%
  kable(digits = 2)

Table 2: Variance inflation factor table
name	value
lake_volume	9.87
residence_time	6.41
catch_flow	5.82
cat_slope	5.61
bul_s_om	4.98
bul_s_sulfur	4.98
high_prod_exotic_grass	4.35
bul_s_iron	4.23
dry_bd_dw	4.15
ac_a	4.04
gs_63	3.87
cat_phos	3.71
max_depth	3.61
rse_bd_fe_toc	3.44
cat_hard	3.40
bul_s_manganese	3.35
sum_wind	3.26
dynamic_sediment_ratio	3.07
lk_elev	2.62
cat_calc	2.50
indi_for	2.47
low_prod_grass	2.15
bul_s_calcium	2.09
bul_s_aluminium	2.01
cat_peat	1.77
bul_s_carbonates	1.60

3.4 Correlation of Buls Phorphorus against predictors

The correlations in the plot below focus on the response variable bul_s_phosphorus.

Code

corr_var(
  dat_t %>%
    select(-code,-c(season:date_survey)) %>%
    filter(lake_depth == "Deep") %>%
    select(-lake_depth),
  bul_s_phosphorus,
  # name of variable to focus on
  top = 20 # display top 5 correlations
)

Figure 12: Correlations of predictor variables with bul_s_phosphorus

The table shows the step-wise selected final model summary with associated estimates, 95% confidence intervals and p-values.

3.5 Linear models for deep lakes

Code

deep_dat <-
  dat_t %>%
  filter(lake_depth == "Deep") %>%
  select(-c(1:2, -date_survey))

lm_d <-
  lm(bul_s_phosphorus  ~ .,
     data = deep_dat)

# models selection based on AIC criteria---
lm_d_sel <- MASS::stepAIC(lm_d, trace = F, scope = list(lower = ~ bul_s_iron))

tab_model(lm_d_sel)

	bul s phosphorus
Predictors	Estimates	CI	p
(Intercept)	6.05	5.92 – 6.18	<0.001
cat phos	0.13	-0.02 – 0.27	0.091
cat calc	0.17	0.03 – 0.30	0.018
high prod exotic grass	-0.16	-0.32 – 0.01	0.064
indi for	-0.09	-0.22 – 0.03	0.151
dry bd dw	-0.23	-0.41 – -0.05	0.014
bul s iron	0.19	0.01 – 0.38	0.039
bul s manganese	0.20	0.04 – 0.37	0.017
bul s sulfur	0.23	0.03 – 0.43	0.025
Observations	119
R² / R² adjusted	0.411 / 0.368

3.6 Partial effects

Code

pdp <-
  plot_model(
    lm_d_sel,
    type = 'pred',
    title = "",
    show.data = T,
    auto.label = F
  )
ggarrange(plotlist = pdp)

3.7 Coefficient plot

Code

plot_model(lm_d_sel, type = 'est')

3.8 Variable importance

Code

caret::varImp(lm_d_sel) %>%
  as_tibble(rownames = 'var') %>% 
  ggplot(aes(x = reorder(var, Overall ), y = Overall )) +
  geom_point() +
  geom_segment(aes(
    x = var,
    xend = var,
    y = 0,
    yend = Overall
  )) +
  ylab("Variable importance") +
  xlab("Predictor") +
  coord_flip()

3.9 Diagnostic plots

Code

diag_plots <- plot_model(lm_d_sel, type = 'diag')
ggarrange(plotlist = diag_plots)

4 All lakes

4.1 Global linear model (all lakes)

Model fitted with all data, i.e. shallow and deep lakes

Code

all_dat <-
  dat_t %>%
  select(-c(1:2, max_depth, bul_s_om, date_survey))

lm_a <-
  lm(
    bul_s_phosphorus  ~ .,
    data = all_dat
  )

# models selection based on AIC criteria---
lm_a_sel <- MASS::stepAIC(lm_a, trace = F, scope = list(lower = ~ bul_s_iron))

tab_model(lm_a_sel)

	bul s phosphorus
Predictors	Estimates	CI	p
(Intercept)	6.02	5.93 – 6.12	<0.001
cat peat	-0.13	-0.23 – -0.03	0.012
residence time	0.18	0.04 – 0.31	0.009
ac a	0.14	0.01 – 0.27	0.040
cat calc	0.17	0.07 – 0.28	0.001
indi for	-0.12	-0.22 – -0.02	0.022
dry bd dw	-0.37	-0.47 – -0.27	<0.001
bul s carbonates	-0.07	-0.18 – 0.03	0.156
bul s iron	0.33	0.22 – 0.44	<0.001
Observations	210
R² / R² adjusted	0.347 / 0.321

4.2 Coefficient plot

Code

plot_model(lm_a_sel, type = 'est')

4.3 Partial effects

Code

pdp <-
  plot_model(
    lm_a_sel,
    type = 'pred',
    title = "",
    show.data = T,
    auto.label = F
  )
ggarrange(plotlist = pdp)

4.4 Diagnostic plot

Code

diag_plots <- plot_model(lm_a_sel, type = 'diag')
ggarrange(plotlist = diag_plots)