library(mgcv)
library(here)
library(dplyr)
library(tidyverse)
library(nlme)
library(ggplot2)
library(GGally)
library(ggpubr)
library(GGally)
library(gridExtra)
library(ggeffects)
library(car)
library(ggResidpanel)
library(patchwork)
vul <- read.csv(here("Week_09", "Data", "AlaWai_DiscreteBottle.csv"))
# convert to factors for categorical data
vul = vul %>%
mutate(Site = as.factor(Site)) %>%
mutate(Season = as.factor(Season)) %>%
mutate(SampleDepth = as.factor(SampleDepth)) %>%
drop_na()
Start by analyzing general spatial and temporal patterns of V. vulnificus abundance. Plot how abundance varies between sites, seasons, and depths.
ggpairs(vul, columns = c("Site", "vvhA", "Season", "SampleDepth"),
mapping = aes(color = Season))
Construct an appropriate model that tests whether there are effects of these predictors, and interactions between them. Figure out how the response variable (gene copies per mL) should be modeled (i.e., whether to transform or not, whether to use a non-normal distribution). Note that the response variable is a continuous number — qPCR does not actually count discrete gene copies, it quantifies a continuous number based on PCR cycles that is directly proportional to gene copy concentraiton. What are the magnitudes of the modeled effects? What are your interpretations of the results so far?
vulmod = lm(log(vvhA) ~ Site*Season*SampleDepth, # vvhA (gene copies per mL) as response variable w log transform due to exponential growth, site/season/depth as predictors and all interactions between them
data = vul)
resid_panel(vulmod, plots = c('resid', 'qq', 'lev', 'hist'))
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## ℹ The deprecated feature was likely used in the ggResidpanel package.
## Please report the issue to the authors.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning in plot_lev(model = model, type = type, smoother = smoother, theme = theme, : Observations with a leverage value of 1 are not included
## in the residuals versus leverage plot.
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## ℹ The deprecated feature was likely used in the ggResidpanel package.
## Please report the issue to the authors.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Anova(vulmod)
## Anova Table (Type II tests)
##
## Response: log(vvhA)
## Sum Sq Df F value Pr(>F)
## Site 179.372 7 12.8217 3.266e-11 ***
## Season 131.176 1 65.6361 3.076e-12 ***
## SampleDepth 59.119 2 14.7905 2.956e-06 ***
## Site:Season 39.139 7 2.7977 0.01128 *
## Site:SampleDepth 24.301 14 0.8685 0.59454
## Season:SampleDepth 0.345 2 0.0863 0.91742
## Site:Season:SampleDepth 23.147 14 0.8273 0.63803
## Residuals 173.873 87
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The residual panel follows a normal distribution, the Q-Q plot is fairly linear, and both residual plots are homoscedastic. Site, season, and sample depth have extremely small p-values which shows us they are signficant predictors, and site is the strongest predictor with the highest sum sq value. The interaction site:season shows that the effect of season depends on the location of the sample.
The authors measured many potential predictors of V. vulnificus, but only ended up using this set when performing model selection: Rainfall_5Day, AirTemp, O2Conc, Salinity, WaterTemp, Turbidity, Chlorophyll, NOx, Silicate, POC, TotalP, Tyrosine.like, HIX, and VisibleHumic.like. The reason is that there is substantial intercorrelation among the full set of predictors, and this set of variables are either relatively easy to measure, or hypothesized to be particularly important. Recreating the authors’ analysis of predictor correlation is outside the scope of this assignment, but you may want to explore it if you are interested.
Use scatterplots of V. vulnificus concentration vs. the predictors to consider which should be transformed for use in a linear model / generalized linear model. Recall that transformation of predictors is not about normality (predictors are not assumed to follow any distribution). Rather, transformation is useful for achieving linear relationships (when using linear models), and for avoiding extreme values that have a large leverage on the modeled relationship.
# potential predictors of V. vulnificus
predictors = c("Rainfall_5Day", "AirTemp", "O2Conc", "Salinity", "WaterTemp", "Turbidity", "Chlorophyll", "NOx", "Silicate", "POC", "TotalP", "Tyrosine.like", "HIX", "VisibleHumic.like")
vullong = vul %>%
dplyr::select(vvhA, all_of(predictors)) %>% #problem with other libraries loaded
pivot_longer(cols = all_of(predictors),
names_to = "predictor",
values_to = "value")
#scatterplots
ggplot(vullong, aes(x = value, y = log(vvhA))) +
geom_point(alpha = 0.4) + #opacity
facet_wrap(~ predictor, scales = "free_x") +
labs(title = "V. vulnificus vs Environmental predictors",
x = "Predictor value", y = "V. vulnificus abundance")
Candidates for transformation in a lm/glm: Chlorophyll, NOx (very skewed to the left with high leverage to the right), POC, Rainfall_5Day, Silicate, Tyrosine.like
Start by making single-predictor models for each of the 14 predictors listed above. Which predictors, on their own, best explain V. vulnificus concentration? If you make an AIC table of the 14 models, what do the Akaike weights look like? What does this mean? Note that in order to compare models by AIC each model needs to use the same set of samples (rows) this means you should remove any rows that include NAs for any of the 14 predictors, or for the response variable, before fitting any of the models. What is the downside of using only single-predictor models in this context?
# transform data
transformedpredictors = c("Chlorophyll", "NOx", "POC", "Rainfall_5Day", "Silicate", "Tyrosine.like")
other_predictors = c("AirTemp", "O2Conc", "Salinity", "WaterTemp", "Turbidity", "TotalP", "HIX", "VisibleHumic.like")
vultrans = vul %>%
dplyr::select(vvhA, all_of(transformedpredictors), all_of(other_predictors)) %>% #problems w libraries
mutate(vvhA_log = log(vvhA), #transform bc of exponential growth
across(all_of(transformedpredictors), ~log(.x), #log transform predictors
.names = "log_{.col}")) %>% #new column name for transformed predictors
pivot_longer(cols = c(starts_with("log_"), all_of(other_predictors)),
names_to = "predictor_name",
values_to = "value")
Now make one big model that contains all 14 predictors. Do marginal
null hypothesis tests on the predictors. Which seem important for
explaining V. vulnificus? Collectively, how much variation in V.
vulnificus concentration can be explained? What is a potential downside
of this approach to inference (i.e., one big model with marginal
tests)?
Now construct all possible models containing these 14 predictors (you
can ignore interactions…and recall that there is an R function that
automates this process). Which predictors consistently occur in the
most-supported models? What are the sums of Akaike weights for each
predictor? Plot the fitted relationships from the best model. How do
these results compare to the other approaches? Finally, attempt an
interpretation of the results. What mechanisms could underlie the strong
correlations identified by the analysis?