Load libraries

library(mgcv)
library(here)
library(dplyr)
library(tidyverse)
library(nlme)
library(MuMIn)
library(ggplot2)
library(GGally)
library(ggpubr)
library(GGally)
library(gridExtra)
library(ggeffects)
library(car)
library(ggResidpanel)
library(patchwork)
vul <- read.csv(here("Week_09", "Data", "AlaWai_DiscreteBottle.csv"))

Manipulate data

# convert to factors for categorical data
vul = vul %>%
  mutate(Site = as.factor(Site)) %>%
  mutate(Season = as.factor(Season)) %>%
  mutate(SampleDepth = as.factor(SampleDepth)) %>%
  drop_na()

V. vulnificus abundance

Start by analyzing general spatial and temporal patterns of V. vulnificus abundance. Plot how abundance varies between sites, seasons, and depths.

Plot

ggpairs(vul, columns = c("Site", "vvhA", "Season", "SampleDepth"),
        mapping = aes(color = Season))

Construct an appropriate model that tests whether there are effects of these predictors, and interactions between them. Figure out how the response variable (gene copies per mL) should be modeled (i.e., whether to transform or not, whether to use a non-normal distribution). Note that the response variable is a continuous number — qPCR does not actually count discrete gene copies, it quantifies a continuous number based on PCR cycles that is directly proportional to gene copy concentraiton. What are the magnitudes of the modeled effects? What are your interpretations of the results so far?

Model

vulmod = lm(log(vvhA) ~ Site*Season*SampleDepth, # vvhA (gene copies per mL) as response variable w log transform due to exponential growth, site/season/depth as predictors and all interactions between them
           data = vul)

Magnitudes of modeled effects?

resid_panel(vulmod, plots = c('resid', 'qq', 'lev', 'hist'))

Anova(vulmod)

## Anova Table (Type II tests)
## 
## Response: log(vvhA)
##                          Sum Sq Df F value    Pr(>F)    
## Site                    179.372  7 12.8217 3.266e-11 ***
## Season                  131.176  1 65.6361 3.076e-12 ***
## SampleDepth              59.119  2 14.7905 2.956e-06 ***
## Site:Season              39.139  7  2.7977   0.01128 *  
## Site:SampleDepth         24.301 14  0.8685   0.59454    
## Season:SampleDepth        0.345  2  0.0863   0.91742    
## Site:Season:SampleDepth  23.147 14  0.8273   0.63803    
## Residuals               173.873 87                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The residual panel follows a normal distribution, the Q-Q plot is fairly linear, and both residual plots are homoscedastic. Site, season, and sample depth have extremely small p-values which shows us they are signficant predictors, and site is the strongest predictor with the highest sum sq value. The interaction site:season shows that the effect of season depends on the location of the sample.

Scatterplots

The authors measured many potential predictors of V. vulnificus, but only ended up using this set when performing model selection: Rainfall_5Day, AirTemp, O2Conc, Salinity, WaterTemp, Turbidity, Chlorophyll, NOx, Silicate, POC, TotalP, Tyrosine.like, HIX, and VisibleHumic.like. The reason is that there is substantial intercorrelation among the full set of predictors, and this set of variables are either relatively easy to measure, or hypothesized to be particularly important. Recreating the authors’ analysis of predictor correlation is outside the scope of this assignment, but you may want to explore it if you are interested.

Use scatterplots of V. vulnificus concentration vs. the predictors to consider which should be transformed for use in a linear model / generalized linear model. Recall that transformation of predictors is not about normality (predictors are not assumed to follow any distribution). Rather, transformation is useful for achieving linear relationships (when using linear models), and for avoiding extreme values that have a large leverage on the modeled relationship.

# potential predictors of V. vulnificus
predictors = c("Rainfall_5Day", "AirTemp", "O2Conc", "Salinity", "WaterTemp", "Turbidity", "Chlorophyll", "NOx", "Silicate", "POC", "TotalP", "Tyrosine.like", "HIX", "VisibleHumic.like")

vullong = vul %>%
  dplyr::select(vvhA, all_of(predictors)) %>% #problem with other libraries loaded
  pivot_longer(cols = all_of(predictors), 
               names_to = "predictor", 
               values_to = "value")

#scatterplots
ggplot(vullong, aes(x = value, y = log(vvhA))) +
  geom_point(alpha = 0.4) + #opacity
  facet_wrap(~ predictor, scales = "free_x") +
  labs(title = "V. vulnificus vs Environmental predictors",
       x = "Predictor value", y = "V. vulnificus abundance")

Candidates for transformation in a lm/glm: Chlorophyll, NOx (very skewed to the left with high leverage to the right), POC, Rainfall_5Day, Silicate, Tyrosine.like

Single-predictor models

Start by making single-predictor models for each of the 14 predictors listed above. Which predictors, on their own, best explain V. vulnificus concentration? If you make an AIC table of the 14 models, what do the Akaike weights look like? What does this mean? Note that in order to compare models by AIC each model needs to use the same set of samples (rows) this means you should remove any rows that include NAs for any of the 14 predictors, or for the response variable, before fitting any of the models. What is the downside of using only single-predictor models in this context?

# transform data
transformedpredictors = c("Chlorophyll", "NOx", "POC", "Rainfall_5Day", "Silicate", "Tyrosine.like")
other_predictors = c("AirTemp", "O2Conc", "Salinity", "WaterTemp", "Turbidity", "TotalP", "HIX", "VisibleHumic.like")

vultrans = vul %>%
  dplyr::select(vvhA, all_of(transformedpredictors), all_of(other_predictors)) %>% #problems w libraries
  mutate(logvvhA = log(vvhA), #transform bc of exponential growth
    across(all_of(transformedpredictors), ~log(.x), #log transform predictors
           .names = "log_{.col}")) #new column name for transformed predictors


allpredictors = c("log_Chlorophyll", "log_NOx", "log_POC", "log_Rainfall_5Day", "log_Silicate", "log_Tyrosine.like", "AirTemp", "O2Conc", "Salinity", "WaterTemp", "Turbidity", "TotalP", "HIX", "VisibleHumic.like")

#store models in lsit
smod <- list()

for (p in allpredictors) {
  vvhAresponse = paste("logvvhA ~", p) #response variable vvha and list of all predicors
  predictorform = as.formula(vvhAresponse)
  smod[[p]] = lm(predictorform, data = vultrans) #store linear model in list w predictor name
}

AIC table

smodaic = model.sel(smod)
print(smodaic)

## Model selection table 
##                      (Int) log_Chl log_NOx log_POC log_Rnf_5Dy log_Slc
## log_Silicate       -1.7810                                        1.22
## VisibleHumic.like  -0.1789                                            
## log_NOx             1.4590           1.016                            
## log_POC            -3.0200                   1.285                    
## AirTemp           -11.9900                                            
## log_Tyrosine.like   7.8030                                            
## WaterTemp         -12.4800                                            
## Salinity            8.7740                                            
## TotalP              0.8958                                            
## log_Rainfall_5Day   2.3360                              0.4848        
## HIX                 1.0660                                            
## Turbidity           1.7050                                            
## log_Chlorophyll     2.4740  0.1802                                    
## O2Conc              2.9570                                            
##                   log_Tyr.lik    ArT      O2C     Sln    WtT    Trb  TtP    HIX
## log_Silicate                                                                   
## VisibleHumic.like                                                              
## log_NOx                                                                        
## log_POC                                                                        
## AirTemp                       0.5569                                           
## log_Tyrosine.like       2.365                                                  
## WaterTemp                                             0.5566                   
## Salinity                                      -0.2123                          
## TotalP                                                              1.89       
## log_Rainfall_5Day                                                              
## HIX                                                                      0.6255
## Turbidity                                                    0.2618            
## log_Chlorophyll                                                                
## O2Conc                               -0.05973                                  
##                   VsH.lik df   logLik  AICc  delta weight
## log_Silicate               3 -244.889 496.0   0.00      1
## VisibleHumic.like   28.18  3 -258.638 523.5  27.50      0
## log_NOx                    3 -271.168 548.5  52.56      0
## log_POC                    3 -273.554 553.3  57.33      0
## AirTemp                    3 -277.252 560.7  64.73      0
## log_Tyrosine.like          3 -280.310 566.8  70.84      0
## WaterTemp                  3 -281.036 568.3  72.29      0
## Salinity                   3 -282.313 570.8  74.85      0
## TotalP                     3 -285.833 577.8  81.89      0
## log_Rainfall_5Day          3 -288.718 583.6  87.66      0
## HIX                        3 -292.821 591.8  95.86      0
## Turbidity                  3 -295.026 596.2 100.27      0
## log_Chlorophyll            3 -298.856 603.9 107.93      0
## O2Conc                     3 -299.959 606.1 110.14      0
## Models ranked by AICc(x)

The Akaike weight of 1 for silicates shows overwhelming support from the model that Silicates is the best predictor. The AIC value is also the lowest for silicate, which acts as further evidence that it is the best predictor. However, this single predictor model does not show us any interactions between these predictors, only their standalone effects on the model.

Big model

Now make one big model that contains all 14 predictors. Do marginal null hypothesis tests on the predictors. Which seem important for explaining V. vulnificus? Collectively, how much variation in V. vulnificus concentration can be explained? What is a potential downside of this approach to inference (i.e., one big model with marginal tests)?

big.model = lm(logvvhA ~ log_Chlorophyll + log_NOx + log_POC + log_Rainfall_5Day + log_Silicate + log_Tyrosine.like + AirTemp + O2Conc + Salinity + WaterTemp + Turbidity + TotalP + HIX + VisibleHumic.like, data = vultrans)

Anova(big.model)

## Anova Table (Type II tests)
## 
## Response: logvvhA
##                    Sum Sq  Df F value    Pr(>F)    
## log_Chlorophyll     0.033   1  0.0283  0.866707    
## log_NOx             0.448   1  0.3805  0.538485    
## log_POC             0.980   1  0.8317  0.363622    
## log_Rainfall_5Day  46.384   1 39.3813 5.746e-09 ***
## log_Silicate       81.627   1 69.3045 1.566e-13 ***
## log_Tyrosine.like   0.982   1  0.8335  0.363088    
## AirTemp            21.043   1 17.8660 4.648e-05 ***
## O2Conc              3.073   1  2.6090  0.108884    
## Salinity           11.764   1  9.9878  0.001995 ** 
## WaterTemp          25.656   1 21.7825 8.015e-06 ***
## Turbidity           3.793   1  3.2200  0.075262 .  
## TotalP              2.578   1  2.1892  0.141601    
## HIX                 0.574   1  0.4870  0.486617    
## VisibleHumic.like   0.064   1  0.0541  0.816534    
## Residuals         141.337 120                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(big.model)

## 
## Call:
## lm(formula = logvvhA ~ log_Chlorophyll + log_NOx + log_POC + 
##     log_Rainfall_5Day + log_Silicate + log_Tyrosine.like + AirTemp + 
##     O2Conc + Salinity + WaterTemp + Turbidity + TotalP + HIX + 
##     VisibleHumic.like, data = vultrans)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.73140 -0.64079  0.02841  0.71414  2.63501 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -1.36774    2.34732  -0.583  0.56120    
## log_Chlorophyll    0.01304    0.07755   0.168  0.86671    
## log_NOx            0.09517    0.15428   0.617  0.53848    
## log_POC            0.17277    0.18945   0.912  0.36362    
## log_Rainfall_5Day  0.41022    0.06537   6.275 5.75e-09 ***
## log_Silicate       1.43393    0.17225   8.325 1.57e-13 ***
## log_Tyrosine.like  0.42732    0.46806   0.913  0.36309    
## AirTemp            0.43403    0.10268   4.227 4.65e-05 ***
## O2Conc            -0.06196    0.03836  -1.615  0.10888    
## Salinity           0.13502    0.04272   3.160  0.00199 ** 
## WaterTemp         -0.61253    0.13124  -4.667 8.02e-06 ***
## Turbidity          0.08731    0.04865   1.794  0.07526 .  
## TotalP             0.42627    0.28810   1.480  0.14160    
## HIX               -0.15701    0.22499  -0.698  0.48662    
## VisibleHumic.like  1.45841    6.27232   0.233  0.81653    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.085 on 120 degrees of freedom
## Multiple R-squared:  0.791,  Adjusted R-squared:  0.7666 
## F-statistic: 32.43 on 14 and 120 DF,  p-value: < 2.2e-16

It appears that silicate continues to be the best predictor given that it has the highest chi square value. Rainfall, water temperature, and air temperature are the next strongest predictors in explaining the abundance of V. vulnificus. 79.1% of the variation can be explained by the 14 predictors given the R-squared value. A potential downside of this type of model is that the effect of one predictor can depend on others; for example, air temperature and water temperature could depend on each other, and we’d need to make multiple models to see this.

All possible models

Now construct all possible models containing these 14 predictors (you can ignore interactions…and recall that there is an R function that automates this process). Which predictors consistently occur in the most-supported models? What are the sums of Akaike weights for each predictor? Plot the fitted relationships from the best model. How do these results compare to the other approaches? Finally, attempt an interpretation of the results. What mechanisms could underlie the strong correlations identified by the analysis?

big.model = lm(logvvhA ~ log_Chlorophyll + log_NOx + log_POC + log_Rainfall_5Day + log_Silicate + log_Tyrosine.like + AirTemp + O2Conc + Salinity + WaterTemp + Turbidity + TotalP + HIX + VisibleHumic.like, data = vultrans, family = binomial, na.action = na.pass)

sole.dredge = dredge(big.model, extra = "R^2") # The dredge() function will make all possible models that are subsets of the big.model
# extra=’R^2’ so that it will include a psuedo-R2 as part of the calculation for each model
Anova(big.model)

## Anova Table (Type II tests)
## 
## Response: logvvhA
##                    Sum Sq  Df F value    Pr(>F)    
## log_Chlorophyll     0.033   1  0.0283  0.866707    
## log_NOx             0.448   1  0.3805  0.538485    
## log_POC             0.980   1  0.8317  0.363622    
## log_Rainfall_5Day  46.384   1 39.3813 5.746e-09 ***
## log_Silicate       81.627   1 69.3045 1.566e-13 ***
## log_Tyrosine.like   0.982   1  0.8335  0.363088    
## AirTemp            21.043   1 17.8660 4.648e-05 ***
## O2Conc              3.073   1  2.6090  0.108884    
## Salinity           11.764   1  9.9878  0.001995 ** 
## WaterTemp          25.656   1 21.7825 8.015e-06 ***
## Turbidity           3.793   1  3.2200  0.075262 .  
## TotalP              2.578   1  2.1892  0.141601    
## HIX                 0.574   1  0.4870  0.486617    
## VisibleHumic.like   0.064   1  0.0541  0.816534    
## Residuals         141.337 120                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(big.model)

## 
## Call:
## lm(formula = logvvhA ~ log_Chlorophyll + log_NOx + log_POC + 
##     log_Rainfall_5Day + log_Silicate + log_Tyrosine.like + AirTemp + 
##     O2Conc + Salinity + WaterTemp + Turbidity + TotalP + HIX + 
##     VisibleHumic.like, data = vultrans, na.action = na.pass, 
##     family = binomial)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.73140 -0.64079  0.02841  0.71414  2.63501 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -1.36774    2.34732  -0.583  0.56120    
## log_Chlorophyll    0.01304    0.07755   0.168  0.86671    
## log_NOx            0.09517    0.15428   0.617  0.53848    
## log_POC            0.17277    0.18945   0.912  0.36362    
## log_Rainfall_5Day  0.41022    0.06537   6.275 5.75e-09 ***
## log_Silicate       1.43393    0.17225   8.325 1.57e-13 ***
## log_Tyrosine.like  0.42732    0.46806   0.913  0.36309    
## AirTemp            0.43403    0.10268   4.227 4.65e-05 ***
## O2Conc            -0.06196    0.03836  -1.615  0.10888    
## Salinity           0.13502    0.04272   3.160  0.00199 ** 
## WaterTemp         -0.61253    0.13124  -4.667 8.02e-06 ***
## Turbidity          0.08731    0.04865   1.794  0.07526 .  
## TotalP             0.42627    0.28810   1.480  0.14160    
## HIX               -0.15701    0.22499  -0.698  0.48662    
## VisibleHumic.like  1.45841    6.27232   0.233  0.81653    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.085 on 120 degrees of freedom
## Multiple R-squared:  0.791,  Adjusted R-squared:  0.7666 
## F-statistic: 32.43 on 14 and 120 DF,  p-value: < 2.2e-16

Plot for best model

bestmodel = lm(logvvhA ~ log_Rainfall_5Day + log_Silicate + AirTemp + Salinity + WaterTemp, data = vultrans, family = binomial, na.action = na.pass) # we have to set ‘na.action = na.pass’, which is necessary for the building-allmodels function.

## Warning: In lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
##  extra argument 'family' will be disregarded

beffects = plot(ggeffect(bestmodel))

grid.arrange(grobs = beffects, add.data = TRUE, jitter = 0.05, show.title = FALSE)

Within both the single predictor and big models, Silicate has consistently been the best predictor along with rainfall, air temperature, water temperature, and salinity.

Week 9 Predictions