Beginning in 1988, the Hawaiʻi Ocean Time series program has taken near-monthly samples of ocean biogeochemistry and physics at Station ALOHA, which is 100 km north of Oʻahu. Here we will use some of these data to learn about generalized additive models. The data are retrieved from the HOT-DOGS website. We will focus on flow cytometry counts of microorganisms from 20062022. The researchers distinguish four groups of microbes using flow cytometry: the cyanobacterium Prochlorococcus (pro), the cyanobacterium Synechococcus (syn), a diverse group of heterotrophic bacteria (hbact), and a diverse group of small eukaryotes collectively referred to as picoeukaryotic phytoplankton (picoeuk). We will focus only on pro, hbact, and picoeuk, and explore how the niches of these three groups differ from one another.
# load libraries
library(dplyr)
library(tidyverse)
library(mgcv) # gam funct
library(gamair)
library(lubridate) # for editing dates
library(here)
library(GGally)
library(MuMIn)
# data
mlds <- read.csv(here("Week_10", "Data", "hot_mlds.csv"))
cyto <- read.csv(here("Week_10", "Data", "HOT_cyto_counts_edit.csv"))
# "botid" (bottle ID), "date" (date, in mmddyy format), "press" (pressure in decibars) "chl" (fluorometric chlorophyll a concentration, in micrograms L-1), "hbact" (heterotrophic bacteria concentration, in 105 cells mL-1), "pro" (Prochlorococcus concentration, in 105 cells mL-1), "syn" (Synechococcus concentration, in 105 cells mL-1), "picoeuk" (photosynthetic picoeukaryote concentration, in 105 cells mL-1), and "cruise" (cruise ID). Note that pressure in decibars is approximately equal to depth in meters (i.e., depth of the sampling device is measured using a pressure sensor).
# create day of the year predictor
cytodf <- cyto %>%
mutate(date = mdy(date),
day = yday(date))
For each of the three groups of microbes, fit a 2D smoother that characterizes how abundance changes with depth and with day of the year (i.e., from day 1 to day 365 or 366). When fitting the 2D smoother for each type of microbe, consider how the response variable should be modeled (transformed or not, normal or non-normal). You can see the probability distributions available for the gam() function in package mgcv by looking at the help file titled ‘family.mgcv’. Consider whether the basis dimension needs to be increased beyond the default value.
use ggpairs()
ggpairs(cytodf, columns = c('botid', 'date', 'press', 'chl', 'cruise' ), lower = list(continuous = wrap("points", alpha = 0.3, bg = 'blue', pch = 21), combo = wrap("facethist", bins = 10, col = ' green3'))) + theme(axis.text.y = element_text(size = 6), axis.text.x = element_text(size = 6))
Appears that botid and cruise, date and botid, and cruise and date are highly correlated. Can we use just one of these as the date predictor?
ggplot(cytodf,
aes(x = pro)) +
geom_histogram(bins = 50) + theme_light()
ggplot(cytodf,
aes(x = hbact)) +
geom_histogram(bins = 50) + theme_light()
ggplot(cytodf,
aes(x = picoeuk)) +
geom_histogram(bins = 50) + theme_light()
For each of the three groups of microbes, fit a 2D smoother that characterizes how abundance changes with depth and with day of the year (i.e., from day 1 to day 365 or 366). When fitting the 2D smoother for each type of microbe, consider how the response variable should be modeled (transformed or not, normal or non-normal). You can see the probability distributions available for the gam() function in package mgcv by looking at the help file titled ‘family.mgcv’. Consider whether the basis dimension needs to be increased beyond the default value.
pro_gam <- gam(pro ~ s(day, press), data = cytodf)
pro_log_gam <- gam(log(pro + 1e-4) ~ s(day,press), data = cytodf) # log transform data bc of skew
hbact_gam <- gam(hbact ~ s(day, press), data = cytodf)
pico_gam <- gam(picoeuk ~ s(day, press), data = cytodf)
pico_log_gam <- gam(log(picoeuk + 1e-4) ~ s(day,press), data = cytodf) # log transform bc ofskew
pico_log_gamk <- gam(log(picoeuk + 1e-4) ~ s(day,press, k = 50), data = cytodf) # log transform bc ofskew
summary(pro_gam)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## pro ~ s(day, press)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.33514 0.01069 124.9 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(day,press) 27.5 28.86 221.8 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.84 Deviance explained = 84.4%
## GCV = 0.14245 Scale est. = 0.13911 n = 1217
The smooth term reports ‘edf’, which is the estimated degrees of freedom. This number is using the wiggliness of the smooth to say “if this smoother were written as a parametric equation, approximately how many parameters would that equation have”
summary(pro_log_gam)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## log(pro + 1e-04) ~ s(day, press)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.36597 0.01545 -23.68 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(day,press) 24.84 27.99 334.4 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.885 Deviance explained = 88.8%
## GCV = 0.29688 Scale est. = 0.29057 n = 1217
summary(hbact_gam)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## hbact ~ s(day, press)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.88022 0.01811 214.3 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(day,press) 22.85 26.88 123.2 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.732 Deviance explained = 73.7%
## GCV = 0.40709 Scale est. = 0.39911 n = 1217
summary(pico_gam)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## picoeuk ~ s(day, press)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.009935 0.000161 61.72 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(day,press) 23.52 27.3 24.89 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.362 Deviance explained = 37.4%
## GCV = 3.218e-05 Scale est. = 3.1531e-05 n = 1217
summary(pico_log_gam)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## log(picoeuk + 1e-04) ~ s(day, press)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.90191 0.01723 -284.5 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(day,press) 25.49 28.27 62.42 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.593 Deviance explained = 60.2%
## GCV = 0.36944 Scale est. = 0.3614 n = 1217
We observe that higher deviance is explained for the log transformed GAMs.
gam.check(pro_gam)
##
## Method: GCV Optimizer: magic
## Smoothing parameter selection converged after 8 iterations.
## The RMS GCV score gradient at convergence was 2.946778e-07 .
## The Hessian was positive definite.
## Model rank = 30 / 30
##
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
##
## k' edf k-index p-value
## s(day,press) 29.0 27.5 1.06 1
gam.check(pro_log_gam)
##
## Method: GCV Optimizer: magic
## Smoothing parameter selection converged after 7 iterations.
## The RMS GCV score gradient at convergence was 1.953061e-05 .
## The Hessian was positive definite.
## Model rank = 30 / 30
##
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
##
## k' edf k-index p-value
## s(day,press) 29.0 24.8 1.04 0.87
gam.check(hbact_gam)
##
## Method: GCV Optimizer: magic
## Smoothing parameter selection converged after 6 iterations.
## The RMS GCV score gradient at convergence was 2.095162e-06 .
## The Hessian was positive definite.
## Model rank = 30 / 30
##
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
##
## k' edf k-index p-value
## s(day,press) 29.0 22.8 0.96 0.035 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
k’ = 29, the basis dimension, is observed to be too low. The meaning of the basis dimension depends on the kind of smoother used, but it effectively determines the upper bound on the complexity/wiggliness of the smoother. Appears to be regularly distributed
gam.check(pico_gam)
##
## Method: GCV Optimizer: magic
## Smoothing parameter selection converged after 4 iterations.
## The RMS GCV score gradient at convergence was 7.945724e-08 .
## The Hessian was positive definite.
## Model rank = 30 / 30
##
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
##
## k' edf k-index p-value
## s(day,press) 29.0 23.5 1.04 0.88
Skewed to the left, requires a transformation. Low p-value (k-index<1) may indicate that k is too low as well.
gam.check(pico_log_gam)
##
## Method: GCV Optimizer: magic
## Smoothing parameter selection converged after 7 iterations.
## The RMS GCV score gradient at convergence was 6.684561e-06 .
## The Hessian was positive definite.
## Model rank = 30 / 30
##
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
##
## k' edf k-index p-value
## s(day,press) 29.0 25.5 1.04 0.8
gam.check(pico_log_gamk)
##
## Method: GCV Optimizer: magic
## Smoothing parameter selection converged after 5 iterations.
## The RMS GCV score gradient at convergence was 1.061938e-05 .
## The Hessian was positive definite.
## Model rank = 50 / 50
##
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
##
## k' edf k-index p-value
## s(day,press) 49.0 33.7 1.05 0.95
Plot the fitted smoother in a way that is visually appealing. Finally, figure out how to test whether the relationship between abundance and depth changes over time or not. What are your interpretations of the results so far?
plot(pro_log_gam)
plot(hbact_gam)
plot(pico_log_gam)
plot(pico_log_gamk)
Now let’s compare the niches of the three groups to each other. Use a GAM including all groups simultaneously to simultaneously test three questions: (a) Do the different kinds of microbes have different mean abundances? (b) Do the different kinds of microbes have different average depth distributions (i.e., averaging over time)? (c) Do the different kinds of microbes have different average seasonal dynamics (i.e., averaging over depths)?
To fit GAM(s) including all three groups simultaneously you will need to convert the data to ‘long’ format, where there is a column that contains all the concentrations of all three types of microbes, and a second column that codes which microbe was counted in that row, as well as additional columns for the other model predictors. You can convert to long format by hand using a spreadsheet, or you can use a helpful function called pivot_longer() in the package ‘tidyr’.
Note that the s() function for smoothers can fit interactions, where different groups in a categorical predictor have smoothers with different shapes. This is specified like s(continuous_predictor, by = categorical_predictor). Now that you have fit models to test questions (a)-(c), make appropriate plots and perform appropriate hypothesis tests. How do you intepret the results?
cyto_long <- cytodf %>%
pivot_longer(cols = c(pro, hbact, picoeuk),
names_to = "group",
values_to = "abundance") %>%
mutate(log_abundance = log(abundance + 1e-4))
cyto_long$group <- as.factor(cyto_long$group)
#a GAM including all groups simultaneously
gam_all <- gam(log_abundance ~ group + #for mean abundance
s(press, by = group) + #for depth
s(day, by = group), #for season
data = cyto_long)
summary(gam_all)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## log_abundance ~ group + s(press, by = group) + s(day, by = group)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.29933 0.01495 86.89 <2e-16 ***
## grouppicoeuk -6.20124 0.02115 -293.24 <2e-16 ***
## grouppro -1.66530 0.02115 -78.75 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(press):grouphbact 2.932 3.630 108.249 <2e-16 ***
## s(press):grouppicoeuk 8.329 8.784 216.866 <2e-16 ***
## s(press):grouppro 4.720 5.655 1684.365 <2e-16 ***
## s(day):grouphbact 2.255 2.800 2.212 0.0724 .
## s(day):grouppicoeuk 8.646 8.963 19.294 <2e-16 ***
## s(day):grouppro 5.833 6.982 30.040 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.966 Deviance explained = 96.7%
## GCV = 0.27482 Scale est. = 0.27213 n = 3651
From the parametric coefficient results, we can observe that the microbes have different mean abundances. Given the edf values, we are also able to observe picoeukaryotic phytoplankton is highly complex while heterotrophic bacteria is less complex. The range of values indicates different average depth distributions. We also observe different seasonal dynamics; however, heterotrophic bacteria does not appear to follow a seasonal pattern given the p-value of 0.0724.
gam.check(gam_all)
##
## Method: GCV Optimizer: magic
## Smoothing parameter selection converged after 14 iterations.
## The RMS GCV score gradient at convergence was 4.920004e-06 .
## The Hessian was not positive definite.
## Model rank = 57 / 57
##
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
##
## k' edf k-index p-value
## s(press):grouphbact 9.00 2.93 0.91 <2e-16 ***
## s(press):grouppicoeuk 9.00 8.33 0.91 <2e-16 ***
## s(press):grouppro 9.00 4.72 0.91 <2e-16 ***
## s(day):grouphbact 9.00 2.25 0.82 <2e-16 ***
## s(day):grouppicoeuk 9.00 8.65 0.82 <2e-16 ***
## s(day):grouppro 9.00 5.83 0.82 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Using only data from the top 45 meters, how does the concentration of each group of microbes vary with (a) mixed layer depth and (b) Chl a concentration? Test whether the three types of microbes exhibit different relationships with the two predictors. Use appropriate GAM(s), hypothesis tests, and smoother plots to assess these questions.
# Merge data
mldsdf <- mlds %>%
mutate(date = mdy(date))
mergeddf <- merge(cyto_long, mldsdf, by = "date", all = TRUE) %>%
filter(press <= 45)
# GAM
gam_mixed <- gam(log_abundance ~ group + s(mean, by = group) +
s(log(chl), by = group),
data = mergeddf)
summary(gam_mixed)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## log_abundance ~ group + s(mean, by = group) + s(log(chl), by = group)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.55332 0.02545 61.04 <2e-16 ***
## grouppicoeuk -6.31603 0.03599 -175.50 <2e-16 ***
## grouppro -0.87879 0.03599 -24.42 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(mean):grouphbact 3.120 3.889 2.689 0.0418 *
## s(mean):grouppicoeuk 6.606 7.704 2.548 0.0131 *
## s(mean):grouppro 1.000 1.000 1.218 0.2710
## s(log(chl)):grouphbact 1.000 1.000 2.437 0.1199
## s(log(chl)):grouppicoeuk 6.557 7.615 5.913 2.06e-06 ***
## s(log(chl)):grouppro 1.630 2.049 1.275 0.2722
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.993 Deviance explained = 99.4%
## GCV = 0.059199 Scale est. = 0.053752 n = 249
gam.check(gam_mixed)
##
## Method: GCV Optimizer: magic
## Smoothing parameter selection converged after 16 iterations.
## The RMS GCV score gradient at convergence was 2.473004e-08 .
## The Hessian was positive definite.
## Model rank = 57 / 57
##
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
##
## k' edf k-index p-value
## s(mean):grouphbact 9.00 3.12 0.98 0.41
## s(mean):grouppicoeuk 9.00 6.61 0.98 0.33
## s(mean):grouppro 9.00 1.00 0.98 0.34
## s(log(chl)):grouphbact 9.00 1.00 1.07 0.88
## s(log(chl)):grouppicoeuk 9.00 6.56 1.07 0.88
## s(log(chl)):grouppro 9.00 1.63 1.07 0.84
This is likely overfitting and requires a penalty term
par(mfrow = c(2,3))
plot(gam_mixed)
gam_mixed_simple <- gam(log_abundance ~ group + s(mean) +
s(log(chl)),
data = mergeddf)
summary(gam_mixed_simple)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## log_abundance ~ group + s(mean) + s(log(chl))
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.55332 0.03015 51.51 <2e-16 ***
## grouppicoeuk -6.31603 0.04264 -148.11 <2e-16 ***
## grouppro -0.87879 0.04264 -20.61 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(mean) 3.198 3.971 1.605 0.18893
## s(log(chl)) 2.321 2.948 4.188 0.00696 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.99 Deviance explained = 99.1%
## GCV = 0.078145 Scale est. = 0.075471 n = 249
AICc(gam_mixed)
## [1] 7.817445
AICc(gam_mixed_simple)
## [1] 74.42416
The concentration of each group of microbes varies with mixed layer depth and Chl a concentration.