library(mgcv)
library(here)
library(dplyr)
library(tidyverse)
library(nlme)
library(MuMIn)
library(ggplot2)
library(GGally)
library(ggpubr)
library(GGally)
library(gridExtra)
library(ggeffects)
library(car)
library(ggResidpanel)
library(patchwork)
vul <- read.csv(here("Week_09", "Data", "AlaWai_DiscreteBottle.csv"))
# convert to factors for categorical data
vul = vul %>%
mutate(Site = as.factor(Site)) %>%
mutate(Season = as.factor(Season)) %>%
mutate(SampleDepth = as.factor(SampleDepth)) %>%
drop_na()
Start by analyzing general spatial and temporal patterns of V. vulnificus abundance. Plot how abundance varies between sites, seasons, and depths.
ggpairs(vul, columns = c("Site", "vvhA", "Season", "SampleDepth"),
mapping = aes(color = Season))
Construct an appropriate model that tests whether there are effects of these predictors, and interactions between them. Figure out how the response variable (gene copies per mL) should be modeled (i.e., whether to transform or not, whether to use a non-normal distribution). Note that the response variable is a continuous number — qPCR does not actually count discrete gene copies, it quantifies a continuous number based on PCR cycles that is directly proportional to gene copy concentraiton. What are the magnitudes of the modeled effects? What are your interpretations of the results so far?
vulmod = lm(log(vvhA) ~ Site*Season*SampleDepth, # vvhA (gene copies per mL) as response variable w log transform due to exponential growth, site/season/depth as predictors and all interactions between them
data = vul)
resid_panel(vulmod, plots = c('resid', 'qq', 'lev', 'hist'))
Anova(vulmod)
## Anova Table (Type II tests)
##
## Response: log(vvhA)
## Sum Sq Df F value Pr(>F)
## Site 179.372 7 12.8217 3.266e-11 ***
## Season 131.176 1 65.6361 3.076e-12 ***
## SampleDepth 59.119 2 14.7905 2.956e-06 ***
## Site:Season 39.139 7 2.7977 0.01128 *
## Site:SampleDepth 24.301 14 0.8685 0.59454
## Season:SampleDepth 0.345 2 0.0863 0.91742
## Site:Season:SampleDepth 23.147 14 0.8273 0.63803
## Residuals 173.873 87
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The residual panel follows a normal distribution, the Q-Q plot is fairly linear, and both residual plots are homoscedastic. Site, season, and sample depth have extremely small p-values which shows us they are signficant predictors, and site is the strongest predictor with the highest sum sq value. The interaction site:season shows that the effect of season depends on the location of the sample.
The authors measured many potential predictors of V. vulnificus, but only ended up using this set when performing model selection: Rainfall_5Day, AirTemp, O2Conc, Salinity, WaterTemp, Turbidity, Chlorophyll, NOx, Silicate, POC, TotalP, Tyrosine.like, HIX, and VisibleHumic.like. The reason is that there is substantial intercorrelation among the full set of predictors, and this set of variables are either relatively easy to measure, or hypothesized to be particularly important. Recreating the authors’ analysis of predictor correlation is outside the scope of this assignment, but you may want to explore it if you are interested.
Use scatterplots of V. vulnificus concentration vs. the predictors to consider which should be transformed for use in a linear model / generalized linear model. Recall that transformation of predictors is not about normality (predictors are not assumed to follow any distribution). Rather, transformation is useful for achieving linear relationships (when using linear models), and for avoiding extreme values that have a large leverage on the modeled relationship.
# potential predictors of V. vulnificus
predictors = c("Rainfall_5Day", "AirTemp", "O2Conc", "Salinity", "WaterTemp", "Turbidity", "Chlorophyll", "NOx", "Silicate", "POC", "TotalP", "Tyrosine.like", "HIX", "VisibleHumic.like")
vullong = vul %>%
dplyr::select(vvhA, all_of(predictors)) %>% #problem with other libraries loaded
pivot_longer(cols = all_of(predictors),
names_to = "predictor",
values_to = "value")
#scatterplots
ggplot(vullong, aes(x = value, y = log(vvhA))) +
geom_point(alpha = 0.4) + #opacity
facet_wrap(~ predictor, scales = "free_x") +
labs(title = "V. vulnificus vs Environmental predictors",
x = "Predictor value", y = "V. vulnificus abundance")
Candidates for transformation in a lm/glm: Chlorophyll, NOx (very skewed to the left with high leverage to the right), POC, Rainfall_5Day, Silicate, Tyrosine.like
Start by making single-predictor models for each of the 14 predictors listed above. Which predictors, on their own, best explain V. vulnificus concentration? If you make an AIC table of the 14 models, what do the Akaike weights look like? What does this mean? Note that in order to compare models by AIC each model needs to use the same set of samples (rows) this means you should remove any rows that include NAs for any of the 14 predictors, or for the response variable, before fitting any of the models. What is the downside of using only single-predictor models in this context?
# transform data
transformedpredictors = c("Chlorophyll", "NOx", "POC", "Rainfall_5Day", "Silicate", "Tyrosine.like")
other_predictors = c("AirTemp", "O2Conc", "Salinity", "WaterTemp", "Turbidity", "TotalP", "HIX", "VisibleHumic.like")
vultrans = vul %>%
dplyr::select(vvhA, all_of(transformedpredictors), all_of(other_predictors)) %>% #problems w libraries
mutate(logvvhA = log(vvhA), #transform bc of exponential growth
across(all_of(transformedpredictors), ~log(.x), #log transform predictors
.names = "log_{.col}")) #new column name for transformed predictors
allpredictors = c("log_Chlorophyll", "log_NOx", "log_POC", "log_Rainfall_5Day", "log_Silicate", "log_Tyrosine.like", "AirTemp", "O2Conc", "Salinity", "WaterTemp", "Turbidity", "TotalP", "HIX", "VisibleHumic.like")
#store models in lsit
smod <- list()
for (p in allpredictors) {
vvhAresponse = paste("logvvhA ~", p) #response variable vvha and list of all predicors
predictorform = as.formula(vvhAresponse)
smod[[p]] = lm(predictorform, data = vultrans) #store linear model in list w predictor name
}
smodaic = model.sel(smod)
print(smodaic)
## Model selection table
## (Int) log_Chl log_NOx log_POC log_Rnf_5Dy log_Slc
## log_Silicate -1.7810 1.22
## VisibleHumic.like -0.1789
## log_NOx 1.4590 1.016
## log_POC -3.0200 1.285
## AirTemp -11.9900
## log_Tyrosine.like 7.8030
## WaterTemp -12.4800
## Salinity 8.7740
## TotalP 0.8958
## log_Rainfall_5Day 2.3360 0.4848
## HIX 1.0660
## Turbidity 1.7050
## log_Chlorophyll 2.4740 0.1802
## O2Conc 2.9570
## log_Tyr.lik ArT O2C Sln WtT Trb TtP HIX
## log_Silicate
## VisibleHumic.like
## log_NOx
## log_POC
## AirTemp 0.5569
## log_Tyrosine.like 2.365
## WaterTemp 0.5566
## Salinity -0.2123
## TotalP 1.89
## log_Rainfall_5Day
## HIX 0.6255
## Turbidity 0.2618
## log_Chlorophyll
## O2Conc -0.05973
## VsH.lik df logLik AICc delta weight
## log_Silicate 3 -244.889 496.0 0.00 1
## VisibleHumic.like 28.18 3 -258.638 523.5 27.50 0
## log_NOx 3 -271.168 548.5 52.56 0
## log_POC 3 -273.554 553.3 57.33 0
## AirTemp 3 -277.252 560.7 64.73 0
## log_Tyrosine.like 3 -280.310 566.8 70.84 0
## WaterTemp 3 -281.036 568.3 72.29 0
## Salinity 3 -282.313 570.8 74.85 0
## TotalP 3 -285.833 577.8 81.89 0
## log_Rainfall_5Day 3 -288.718 583.6 87.66 0
## HIX 3 -292.821 591.8 95.86 0
## Turbidity 3 -295.026 596.2 100.27 0
## log_Chlorophyll 3 -298.856 603.9 107.93 0
## O2Conc 3 -299.959 606.1 110.14 0
## Models ranked by AICc(x)
The Akaike weight of 1 for silicates shows overwhelming support from the model that Silicates is the best predictor. The AIC value is also the lowest for silicate, which acts as further evidence that it is the best predictor. However, this single predictor model does not show us any interactions between these predictors, only their standalone effects on the model.
Now make one big model that contains all 14 predictors. Do marginal null hypothesis tests on the predictors. Which seem important for explaining V. vulnificus? Collectively, how much variation in V. vulnificus concentration can be explained? What is a potential downside of this approach to inference (i.e., one big model with marginal tests)?
big.model = lm(logvvhA ~ log_Chlorophyll + log_NOx + log_POC + log_Rainfall_5Day + log_Silicate + log_Tyrosine.like + AirTemp + O2Conc + Salinity + WaterTemp + Turbidity + TotalP + HIX + VisibleHumic.like, data = vultrans)
Anova(big.model)
## Anova Table (Type II tests)
##
## Response: logvvhA
## Sum Sq Df F value Pr(>F)
## log_Chlorophyll 0.033 1 0.0283 0.866707
## log_NOx 0.448 1 0.3805 0.538485
## log_POC 0.980 1 0.8317 0.363622
## log_Rainfall_5Day 46.384 1 39.3813 5.746e-09 ***
## log_Silicate 81.627 1 69.3045 1.566e-13 ***
## log_Tyrosine.like 0.982 1 0.8335 0.363088
## AirTemp 21.043 1 17.8660 4.648e-05 ***
## O2Conc 3.073 1 2.6090 0.108884
## Salinity 11.764 1 9.9878 0.001995 **
## WaterTemp 25.656 1 21.7825 8.015e-06 ***
## Turbidity 3.793 1 3.2200 0.075262 .
## TotalP 2.578 1 2.1892 0.141601
## HIX 0.574 1 0.4870 0.486617
## VisibleHumic.like 0.064 1 0.0541 0.816534
## Residuals 141.337 120
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(big.model)
##
## Call:
## lm(formula = logvvhA ~ log_Chlorophyll + log_NOx + log_POC +
## log_Rainfall_5Day + log_Silicate + log_Tyrosine.like + AirTemp +
## O2Conc + Salinity + WaterTemp + Turbidity + TotalP + HIX +
## VisibleHumic.like, data = vultrans)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.73140 -0.64079 0.02841 0.71414 2.63501
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.36774 2.34732 -0.583 0.56120
## log_Chlorophyll 0.01304 0.07755 0.168 0.86671
## log_NOx 0.09517 0.15428 0.617 0.53848
## log_POC 0.17277 0.18945 0.912 0.36362
## log_Rainfall_5Day 0.41022 0.06537 6.275 5.75e-09 ***
## log_Silicate 1.43393 0.17225 8.325 1.57e-13 ***
## log_Tyrosine.like 0.42732 0.46806 0.913 0.36309
## AirTemp 0.43403 0.10268 4.227 4.65e-05 ***
## O2Conc -0.06196 0.03836 -1.615 0.10888
## Salinity 0.13502 0.04272 3.160 0.00199 **
## WaterTemp -0.61253 0.13124 -4.667 8.02e-06 ***
## Turbidity 0.08731 0.04865 1.794 0.07526 .
## TotalP 0.42627 0.28810 1.480 0.14160
## HIX -0.15701 0.22499 -0.698 0.48662
## VisibleHumic.like 1.45841 6.27232 0.233 0.81653
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.085 on 120 degrees of freedom
## Multiple R-squared: 0.791, Adjusted R-squared: 0.7666
## F-statistic: 32.43 on 14 and 120 DF, p-value: < 2.2e-16
It appears that silicate continues to be the best predictor given that it has the highest chi square value. Rainfall, water temperature, and air temperature are the next strongest predictors in explaining the abundance of V. vulnificus. 79.1% of the variation can be explained by the 14 predictors given the R-squared value. A potential downside of this type of model is that the effect of one predictor can depend on others; for example, air temperature and water temperature could depend on each other, and we’d need to make multiple models to see this.
Now construct all possible models containing these 14 predictors (you can ignore interactions…and recall that there is an R function that automates this process). Which predictors consistently occur in the most-supported models? What are the sums of Akaike weights for each predictor? Plot the fitted relationships from the best model. How do these results compare to the other approaches? Finally, attempt an interpretation of the results. What mechanisms could underlie the strong correlations identified by the analysis?
big.model = lm(logvvhA ~ log_Chlorophyll + log_NOx + log_POC + log_Rainfall_5Day + log_Silicate + log_Tyrosine.like + AirTemp + O2Conc + Salinity + WaterTemp + Turbidity + TotalP + HIX + VisibleHumic.like, data = vultrans, family = binomial, na.action = na.pass)
sole.dredge = dredge(big.model, extra = "R^2") # The dredge() function will make all possible models that are subsets of the big.model
# extra=’R^2’ so that it will include a psuedo-R2 as part of the calculation for each model
Anova(big.model)
## Anova Table (Type II tests)
##
## Response: logvvhA
## Sum Sq Df F value Pr(>F)
## log_Chlorophyll 0.033 1 0.0283 0.866707
## log_NOx 0.448 1 0.3805 0.538485
## log_POC 0.980 1 0.8317 0.363622
## log_Rainfall_5Day 46.384 1 39.3813 5.746e-09 ***
## log_Silicate 81.627 1 69.3045 1.566e-13 ***
## log_Tyrosine.like 0.982 1 0.8335 0.363088
## AirTemp 21.043 1 17.8660 4.648e-05 ***
## O2Conc 3.073 1 2.6090 0.108884
## Salinity 11.764 1 9.9878 0.001995 **
## WaterTemp 25.656 1 21.7825 8.015e-06 ***
## Turbidity 3.793 1 3.2200 0.075262 .
## TotalP 2.578 1 2.1892 0.141601
## HIX 0.574 1 0.4870 0.486617
## VisibleHumic.like 0.064 1 0.0541 0.816534
## Residuals 141.337 120
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(big.model)
##
## Call:
## lm(formula = logvvhA ~ log_Chlorophyll + log_NOx + log_POC +
## log_Rainfall_5Day + log_Silicate + log_Tyrosine.like + AirTemp +
## O2Conc + Salinity + WaterTemp + Turbidity + TotalP + HIX +
## VisibleHumic.like, data = vultrans, na.action = na.pass,
## family = binomial)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.73140 -0.64079 0.02841 0.71414 2.63501
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.36774 2.34732 -0.583 0.56120
## log_Chlorophyll 0.01304 0.07755 0.168 0.86671
## log_NOx 0.09517 0.15428 0.617 0.53848
## log_POC 0.17277 0.18945 0.912 0.36362
## log_Rainfall_5Day 0.41022 0.06537 6.275 5.75e-09 ***
## log_Silicate 1.43393 0.17225 8.325 1.57e-13 ***
## log_Tyrosine.like 0.42732 0.46806 0.913 0.36309
## AirTemp 0.43403 0.10268 4.227 4.65e-05 ***
## O2Conc -0.06196 0.03836 -1.615 0.10888
## Salinity 0.13502 0.04272 3.160 0.00199 **
## WaterTemp -0.61253 0.13124 -4.667 8.02e-06 ***
## Turbidity 0.08731 0.04865 1.794 0.07526 .
## TotalP 0.42627 0.28810 1.480 0.14160
## HIX -0.15701 0.22499 -0.698 0.48662
## VisibleHumic.like 1.45841 6.27232 0.233 0.81653
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.085 on 120 degrees of freedom
## Multiple R-squared: 0.791, Adjusted R-squared: 0.7666
## F-statistic: 32.43 on 14 and 120 DF, p-value: < 2.2e-16
bestmodel = lm(logvvhA ~ log_Rainfall_5Day + log_Silicate + AirTemp + Salinity + WaterTemp, data = vultrans, family = binomial, na.action = na.pass) # we have to set ‘na.action = na.pass’, which is necessary for the building-allmodels function.
## Warning: In lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
## extra argument 'family' will be disregarded
beffects = plot(ggeffect(bestmodel))
grid.arrange(grobs = beffects, add.data = TRUE, jitter = 0.05, show.title = FALSE)
Within both the single predictor and big models, Silicate has consistently been the best predictor along with rainfall, air temperature, water temperature, and salinity.