This notebook described the modelling to determine drivers of bulk sediment phosphorus in lake sediments. Lake bulk sediment phosphorus refers to the measurement of the amount of phosphorus stored in the sediment of a lake.
The response variable for this analysis is bul_s_phosphorus (g/kg dry wt) is mildly right skewed. Thus, it was square root transformed to improve normality.
There were 210 distinct lakes after removing missing values in the response variable.
Code
ggarrange(ggplot(raw_dat, aes(sqrt(sqrt(bul_s_phosphorus)))) +geom_histogram(bins =20, fill ='gray70') +labs(x ='Fourth root bul_s_phosphorus'),ggplot(raw_dat, aes(bul_s_phosphorus)) +geom_histogram(bins =20, fill ='gray50'))
The lake was divided into two depth categories: shallow, with depths of 10 meters or less, and deep, with depths greater than 10 meters. Box plots were generated using the ggplot2 package in R software to visualize the distribution of TP concentrations in each depth category. The box plots were constructed by plotting the median, interquartile range (IQR), and whiskers that extended to 1.5 times the IQR. Outliers were plotted as individual points beyond the whiskers. The box plots were used to identify any significant differences in TP concentrations between the two depth categories. To account for non-normal distributions, a Wilcoxon rank-sum test was performed to determine whether there was a significant difference between the mean TP concentrations in the shallow and deep areas of the lake.
1.4
Seasonal effect by lake depth
Seasonality of phosphorus was assessed by comparing concentrations in two distinct time periods: October to April (Summer) and May to September (Winter). To account for non-normal distributions, a Wilcox rank-sum test was performed to determine whether there was a significant difference between the mean TP concentrations in summer and winter.
The bar plot shows the number of missing values per variable in decreasing order.
Code
naniar::gg_miss_var(raw_dat)
The missing values map (Figure 1) provides a different way of visualising NAs in relation to different observations (i.e. lakes) in the dataset.
Code
finalfit::missing_plot(raw_dat)
Figure 1: Map of missing values
1.7 Data transformation and imputation
All predictor variables were centred and scaled before fitting the model (by subtracting the overall mean from each observation and dividing the result by the overall standard deviation) to allow direct comparison of regression coefficients and inference about the relative sizes of effects. Predictor variables were transformed using the Yeo-Johnson transformation, which is very similar to the Box-Cox (i.e. a power transformation) but does allow negative values.
Missing values were imputed via bagging by fitting a bagged tree model for each predictor (as a function of all the others).
Figure 5: Relationship of selected predictor variables with bul_s_phosphorus
2.2 Predictors correlations
The correlogram below shows the degree to which predictor variables considered in the analyses are linearly related. The correlogram is a graphical display of Pearson’s correlation matrix. The method also groups variables using hierarchical cluster analysis. Correlation values between -1 and 1 are shown using a colour scale for each pair of variables. If correlations among pairs of variables are not significant, they are shown in white.
The correlations in the plot below focus on the response variable bul_s_phosphorus.
Code
corr_var(dat_t %>%select(-code,-c(season:date_survey)) %>%filter(lake_depth =="Shallow") %>%select(-lake_depth), bul_s_phosphorus, # name of variable to focus ontop =20# display top 5 correlations)
Figure 8: Correlations of predictor variables with bul_s_phosphorus
2.6 Linear model for shallow lakes
The table shows the step-wise selected final model summary with associated estimates, 95% confidence intervals and p-values.
The summary table shows that most terms in the final models were highly significant, with a few exceptions that were marginally significant but retain in the final model. The the adjusted R2 value indicate the proportion of the variability in the response variable explained by the model.
The final model is presented as partial effects plots, which show the effect of each predictor variable conditional to others in the model. The partial effects of each predictor are displayed as a linear effect, either negative or positive, relative to the overall mean of the response variable. Partial plots also show standard errors around the line and partial residuals for each observation.
Code
pdp <-plot_model(lm_s_sel, type ='pred', title ="", show.data = T, auto.label = F)ggarrange(plotlist = pdp)
2.8 Coefficient plot
Results of linear final model can be represented as coefficient plots in the figure below that shows regression coefficients graphically, with 95% credible confidence intervals around mean estimates. The graphs show what is significant (the confidence intervals do not cover zero), the degree of uncertainty (the width of the intervals), the direction of the effect (whether is a positive or a negative effect) and the effect magnitude given by the absolute size of the coefficient. Coefficients that overlap with the zero-line, shown as a dotted vertical line, are likely to be insignificant.
Linear regression is based on the assumptions of normality, independence and absence of residual patterns, such heterogeneity of variance or non-linear patterns. Violation of these assumptions may result in biased parameter estimates and type I errors. To validate the fitted model, residuals were inspected by plotting them against fitted values, as Q-Q plot to check for normality. The VIF values for the predictors retained in the final model are <5. The diagnostic residuals plots presented below show that the final model met the assumptions. Thus, the model is considered suitable for this type of that and can be used for inference.
Code
diag_plots <-plot_model(lm_s_sel, type ='diag')ggarrange(plotlist = diag_plots)
3 Deep lakes (>10 m depth)
Relationship between bul_s_phosphorus and geochem variables by season
Figure 9: Relationship of selected predictor variables with bul_s_phosphorus
3.1 Predictors correlations
The correlogram below shows the degree to which predictor variables considered in the analyses are linearly related. The correlogram is a graphical display of the Pearson’s correlation matrix. The method also groups variables using hierarchical cluster analysis. Correlation values between -1 and 1 are shown for each pair of variables represented using a colour scale. If correlations among pairs of variables are not significant, they are shown in white.
3.4 Correlation of Buls Phorphorus against predictors
The correlations in the plot below focus on the response variable bul_s_phosphorus.
Code
corr_var( dat_t %>%select(-code,-c(season:date_survey)) %>%filter(lake_depth =="Deep") %>%select(-lake_depth), bul_s_phosphorus,# name of variable to focus ontop =20# display top 5 correlations)
Figure 12: Correlations of predictor variables with bul_s_phosphorus
The table shows the step-wise selected final model summary with associated estimates, 95% confidence intervals and p-values.