Load libraries

library(mgcv)
library(here)
library(dplyr)
library(tidyverse)
library(nlme)
library(ggplot2)
library(GGally)
library(ggpubr)
library(GGally)
library(gridExtra)
library(ggeffects)
library(car)
library(ggResidpanel)
library(patchwork)
vul <- read.csv(here("Week_09", "Data", "AlaWai_DiscreteBottle.csv"))

Manipulate data

# convert to factors for categorical data
vul = vul %>%
  mutate(Site = as.factor(Site)) %>%
  mutate(Season = as.factor(Season)) %>%
  mutate(SampleDepth = as.factor(SampleDepth)) %>%
  drop_na()

V. vulnificus abundance

Start by analyzing general spatial and temporal patterns of V. vulnificus abundance. Plot how abundance varies between sites, seasons, and depths.

Plot

ggpairs(vul, columns = c("Site", "vvhA", "Season", "SampleDepth"),
        mapping = aes(color = Season))

Construct an appropriate model that tests whether there are effects of these predictors, and interactions between them. Figure out how the response variable (gene copies per mL) should be modeled (i.e., whether to transform or not, whether to use a non-normal distribution). Note that the response variable is a continuous number — qPCR does not actually count discrete gene copies, it quantifies a continuous number based on PCR cycles that is directly proportional to gene copy concentraiton. What are the magnitudes of the modeled effects? What are your interpretations of the results so far?

Model

vulmod = lm(log(vvhA) ~ Site*Season*SampleDepth, # vvhA (gene copies per mL) as response variable w log transform due to exponential growth, site/season/depth as predictors and all interactions between them
           data = vul)

Magnitudes of modeled effects?

resid_panel(vulmod, plots = c('resid', 'qq', 'lev', 'hist'))
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## ℹ The deprecated feature was likely used in the ggResidpanel package.
##   Please report the issue to the authors.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning in plot_lev(model = model, type = type, smoother = smoother, theme = theme, : Observations with a leverage value of 1 are not included
##               in the residuals versus leverage plot.
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## ℹ The deprecated feature was likely used in the ggResidpanel package.
##   Please report the issue to the authors.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Anova(vulmod)
## Anova Table (Type II tests)
## 
## Response: log(vvhA)
##                          Sum Sq Df F value    Pr(>F)    
## Site                    179.372  7 12.8217 3.266e-11 ***
## Season                  131.176  1 65.6361 3.076e-12 ***
## SampleDepth              59.119  2 14.7905 2.956e-06 ***
## Site:Season              39.139  7  2.7977   0.01128 *  
## Site:SampleDepth         24.301 14  0.8685   0.59454    
## Season:SampleDepth        0.345  2  0.0863   0.91742    
## Site:Season:SampleDepth  23.147 14  0.8273   0.63803    
## Residuals               173.873 87                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The residual panel follows a normal distribution, the Q-Q plot is fairly linear, and both residual plots are homoscedastic. Site, season, and sample depth have extremely small p-values which shows us they are signficant predictors, and site is the strongest predictor with the highest sum sq value. The interaction site:season shows that the effect of season depends on the location of the sample.

Scatterplots

The authors measured many potential predictors of V. vulnificus, but only ended up using this set when performing model selection: Rainfall_5Day, AirTemp, O2Conc, Salinity, WaterTemp, Turbidity, Chlorophyll, NOx, Silicate, POC, TotalP, Tyrosine.like, HIX, and VisibleHumic.like. The reason is that there is substantial intercorrelation among the full set of predictors, and this set of variables are either relatively easy to measure, or hypothesized to be particularly important. Recreating the authors’ analysis of predictor correlation is outside the scope of this assignment, but you may want to explore it if you are interested.

Use scatterplots of V. vulnificus concentration vs. the predictors to consider which should be transformed for use in a linear model / generalized linear model. Recall that transformation of predictors is not about normality (predictors are not assumed to follow any distribution). Rather, transformation is useful for achieving linear relationships (when using linear models), and for avoiding extreme values that have a large leverage on the modeled relationship.

# potential predictors of V. vulnificus
predictors = c("Rainfall_5Day", "AirTemp", "O2Conc", "Salinity", "WaterTemp", "Turbidity", "Chlorophyll", "NOx", "Silicate", "POC", "TotalP", "Tyrosine.like", "HIX", "VisibleHumic.like")

vullong = vul %>%
  dplyr::select(vvhA, all_of(predictors)) %>% #problem with other libraries loaded
  pivot_longer(cols = all_of(predictors), 
               names_to = "predictor", 
               values_to = "value")

#scatterplots
ggplot(vullong, aes(x = value, y = log(vvhA))) +
  geom_point(alpha = 0.4) + #opacity
  facet_wrap(~ predictor, scales = "free_x") +
  labs(title = "V. vulnificus vs Environmental predictors",
       x = "Predictor value", y = "V. vulnificus abundance") 

Candidates for transformation in a lm/glm: Chlorophyll, NOx (very skewed to the left with high leverage to the right), POC, Rainfall_5Day, Silicate, Tyrosine.like

Single-predictor models

Start by making single-predictor models for each of the 14 predictors listed above. Which predictors, on their own, best explain V. vulnificus concentration? If you make an AIC table of the 14 models, what do the Akaike weights look like? What does this mean? Note that in order to compare models by AIC each model needs to use the same set of samples (rows) this means you should remove any rows that include NAs for any of the 14 predictors, or for the response variable, before fitting any of the models. What is the downside of using only single-predictor models in this context?

# transform data
transformedpredictors = c("Chlorophyll", "NOx", "POC", "Rainfall_5Day", "Silicate", "Tyrosine.like")
other_predictors = c("AirTemp", "O2Conc", "Salinity", "WaterTemp", "Turbidity", "TotalP", "HIX", "VisibleHumic.like")

vultrans = vul %>%
  dplyr::select(vvhA, all_of(transformedpredictors), all_of(other_predictors)) %>% #problems w libraries
  mutate(vvhA_log = log(vvhA), #transform bc of exponential growth
    across(all_of(transformedpredictors), ~log(.x), #log transform predictors
           .names = "log_{.col}")) %>% #new column name for transformed predictors
  pivot_longer(cols = c(starts_with("log_"), all_of(other_predictors)), 
               names_to = "predictor_name", 
               values_to = "value")

Big model

Now make one big model that contains all 14 predictors. Do marginal null hypothesis tests on the predictors. Which seem important for explaining V. vulnificus? Collectively, how much variation in V. vulnificus concentration can be explained? What is a potential downside of this approach to inference (i.e., one big model with marginal tests)?
Now construct all possible models containing these 14 predictors (you can ignore interactions…and recall that there is an R function that automates this process). Which predictors consistently occur in the most-supported models? What are the sums of Akaike weights for each predictor? Plot the fitted relationships from the best model. How do these results compare to the other approaches? Finally, attempt an interpretation of the results. What mechanisms could underlie the strong correlations identified by the analysis?