Load libraries and data

Beginning in 1988, the Hawaiʻi Ocean Time series program has taken near-monthly samples of ocean biogeochemistry and physics at Station ALOHA, which is 100 km north of Oʻahu. Here we will use some of these data to learn about generalized additive models. The data are retrieved from the HOT-DOGS website. We will focus on flow cytometry counts of microorganisms from 20062022. The researchers distinguish four groups of microbes using flow cytometry: the cyanobacterium Prochlorococcus (pro), the cyanobacterium Synechococcus (syn), a diverse group of heterotrophic bacteria (hbact), and a diverse group of small eukaryotes collectively referred to as picoeukaryotic phytoplankton (picoeuk). We will focus only on pro, hbact, and picoeuk, and explore how the niches of these three groups differ from one another.

# load libraries
library(dplyr)
library(tidyverse)
library(mgcv) # gam funct
library(gamair)
library(lubridate) # for editing dates
library(here)
library(GGally)
library(MuMIn)

# data
mlds <- read.csv(here("Week_10", "Data", "hot_mlds.csv"))
cyto <- read.csv(here("Week_10", "Data", "HOT_cyto_counts_edit.csv"))
# "botid" (bottle ID), "date" (date, in mmddyy format), "press" (pressure in decibars) "chl" (fluorometric chlorophyll a concentration, in micrograms L-1), "hbact" (heterotrophic bacteria concentration, in 105 cells mL-1), "pro" (Prochlorococcus concentration, in 105 cells mL-1), "syn" (Synechococcus concentration, in 105 cells mL-1), "picoeuk" (photosynthetic picoeukaryote concentration, in 105 cells mL-1), and "cruise" (cruise ID). Note that pressure in decibars is approximately equal to depth in meters (i.e., depth of the sampling device is measured using a pressure sensor).

# create day of the year predictor
cytodf <- cyto %>%
  mutate(date = mdy(date), 
         day = yday(date))

2D Smoothers - Abundance over Days and Depth

For each of the three groups of microbes, fit a 2D smoother that characterizes how abundance changes with depth and with day of the year (i.e., from day 1 to day 365 or 366). When fitting the 2D smoother for each type of microbe, consider how the response variable should be modeled (transformed or not, normal or non-normal). You can see the probability distributions available for the gam() function in package mgcv by looking at the help file titled ‘family.mgcv’. Consider whether the basis dimension needs to be increased beyond the default value.

Look at correlation among the predictors

use ggpairs()

ggpairs(cytodf, columns = c('botid', 'date', 'press', 'chl', 'cruise' ), lower = list(continuous = wrap("points", alpha = 0.3, bg = 'blue', pch = 21), combo = wrap("facethist", bins = 10, col = ' green3'))) + theme(axis.text.y = element_text(size = 6), axis.text.x = element_text(size = 6))

Appears that botid and cruise, date and botid, and cruise and date are highly correlated. Can we use just one of these as the date predictor?

2D models

Prochlorococcus

ggplot(cytodf,
       aes(x = pro)) +
  geom_histogram(bins = 50) + theme_light()

Heterotrophic bacteria

ggplot(cytodf,
       aes(x = hbact)) +
  geom_histogram(bins = 50) + theme_light()

Picoeukaryotic phytoplankton

ggplot(cytodf,
       aes(x = picoeuk)) +
  geom_histogram(bins = 50) + theme_light()

Set up GAMs

For each of the three groups of microbes, fit a 2D smoother that characterizes how abundance changes with depth and with day of the year (i.e., from day 1 to day 365 or 366). When fitting the 2D smoother for each type of microbe, consider how the response variable should be modeled (transformed or not, normal or non-normal). You can see the probability distributions available for the gam() function in package mgcv by looking at the help file titled ‘family.mgcv’. Consider whether the basis dimension needs to be increased beyond the default value.

pro_gam <- gam(pro ~ s(day, press), data = cytodf)
pro_log_gam <- gam(log(pro + 1e-4) ~ s(day,press), data = cytodf) # log transform data bc of skew
hbact_gam <- gam(hbact ~ s(day, press), data = cytodf)
pico_gam <- gam(picoeuk ~ s(day, press), data = cytodf)
pico_log_gam <- gam(log(picoeuk + 1e-4) ~ s(day,press), data = cytodf) # log transform bc ofskew
pico_log_gamk <- gam(log(picoeuk + 1e-4) ~ s(day,press, k = 50), data = cytodf) # log transform bc ofskew

Inspect the models

Prochlorococcus

summary(pro_gam)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## pro ~ s(day, press)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.33514    0.01069   124.9   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##               edf Ref.df     F p-value    
## s(day,press) 27.5  28.86 221.8  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =   0.84   Deviance explained = 84.4%
## GCV = 0.14245  Scale est. = 0.13911   n = 1217

The smooth term reports ‘edf’, which is the estimated degrees of freedom. This number is using the wiggliness of the smooth to say “if this smoother were written as a parametric equation, approximately how many parameters would that equation have”

log Prochlorococcus

summary(pro_log_gam)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## log(pro + 1e-04) ~ s(day, press)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.36597    0.01545  -23.68   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                edf Ref.df     F p-value    
## s(day,press) 24.84  27.99 334.4  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.885   Deviance explained = 88.8%
## GCV = 0.29688  Scale est. = 0.29057   n = 1217

Heterotrophic bacteria

summary(hbact_gam)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## hbact ~ s(day, press)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.88022    0.01811   214.3   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                edf Ref.df     F p-value    
## s(day,press) 22.85  26.88 123.2  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.732   Deviance explained = 73.7%
## GCV = 0.40709  Scale est. = 0.39911   n = 1217

Picoeukaryotic phytoplankton

summary(pico_gam)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## picoeuk ~ s(day, press)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.009935   0.000161   61.72   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                edf Ref.df     F p-value    
## s(day,press) 23.52   27.3 24.89  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.362   Deviance explained = 37.4%
## GCV = 3.218e-05  Scale est. = 3.1531e-05  n = 1217

log Picoeukaryotic phytoplankton

summary(pico_log_gam)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## log(picoeuk + 1e-04) ~ s(day, press)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -4.90191    0.01723  -284.5   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                edf Ref.df     F p-value    
## s(day,press) 25.49  28.27 62.42  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.593   Deviance explained = 60.2%
## GCV = 0.36944  Scale est. = 0.3614    n = 1217

Residual diagnostics

We observe that higher deviance is explained for the log transformed GAMs.

Prochlorococcus

gam.check(pro_gam)

## 
## Method: GCV   Optimizer: magic
## Smoothing parameter selection converged after 8 iterations.
## The RMS GCV score gradient at convergence was 2.946778e-07 .
## The Hessian was positive definite.
## Model rank =  30 / 30 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##                k'  edf k-index p-value
## s(day,press) 29.0 27.5    1.06       1

log Prochlorococcus

gam.check(pro_log_gam)

## 
## Method: GCV   Optimizer: magic
## Smoothing parameter selection converged after 7 iterations.
## The RMS GCV score gradient at convergence was 1.953061e-05 .
## The Hessian was positive definite.
## Model rank =  30 / 30 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##                k'  edf k-index p-value
## s(day,press) 29.0 24.8    1.04    0.87

Heterotrophic bacteria

gam.check(hbact_gam)

## 
## Method: GCV   Optimizer: magic
## Smoothing parameter selection converged after 6 iterations.
## The RMS GCV score gradient at convergence was 2.095162e-06 .
## The Hessian was positive definite.
## Model rank =  30 / 30 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##                k'  edf k-index p-value  
## s(day,press) 29.0 22.8    0.96   0.035 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

k’ = 29, the basis dimension, is observed to be too low. The meaning of the basis dimension depends on the kind of smoother used, but it effectively determines the upper bound on the complexity/wiggliness of the smoother. Appears to be regularly distributed

Picoeukaryotic phytoplankton

gam.check(pico_gam)

## 
## Method: GCV   Optimizer: magic
## Smoothing parameter selection converged after 4 iterations.
## The RMS GCV score gradient at convergence was 7.945724e-08 .
## The Hessian was positive definite.
## Model rank =  30 / 30 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##                k'  edf k-index p-value
## s(day,press) 29.0 23.5    1.04    0.88

Skewed to the left, requires a transformation. Low p-value (k-index<1) may indicate that k is too low as well.

log Picoeukaryotic phytoplankton

gam.check(pico_log_gam)

## 
## Method: GCV   Optimizer: magic
## Smoothing parameter selection converged after 7 iterations.
## The RMS GCV score gradient at convergence was 6.684561e-06 .
## The Hessian was positive definite.
## Model rank =  30 / 30 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##                k'  edf k-index p-value
## s(day,press) 29.0 25.5    1.04     0.8

log and k Picoeukaryotic phytoplankton

gam.check(pico_log_gamk)

## 
## Method: GCV   Optimizer: magic
## Smoothing parameter selection converged after 5 iterations.
## The RMS GCV score gradient at convergence was 1.061938e-05 .
## The Hessian was positive definite.
## Model rank =  50 / 50 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##                k'  edf k-index p-value
## s(day,press) 49.0 33.7    1.05    0.95

Plots for fitted smoother

Plot the fitted smoother in a way that is visually appealing. Finally, figure out how to test whether the relationship between abundance and depth changes over time or not. What are your interpretations of the results so far?

log Prochlorococcus

plot(pro_log_gam)

heterotrophic bateria

plot(hbact_gam)

log Picoeukaryotic phytoplankton

plot(pico_log_gam)

log and k Picoeukaryotic phytoplankton

plot(pico_log_gamk)

GAM for all groups

Now let’s compare the niches of the three groups to each other. Use a GAM including all groups simultaneously to simultaneously test three questions: (a) Do the different kinds of microbes have different mean abundances? (b) Do the different kinds of microbes have different average depth distributions (i.e., averaging over time)? (c) Do the different kinds of microbes have different average seasonal dynamics (i.e., averaging over depths)?

To fit GAM(s) including all three groups simultaneously you will need to convert the data to ‘long’ format, where there is a column that contains all the concentrations of all three types of microbes, and a second column that codes which microbe was counted in that row, as well as additional columns for the other model predictors. You can convert to long format by hand using a spreadsheet, or you can use a helpful function called pivot_longer() in the package ‘tidyr’.

Note that the s() function for smoothers can fit interactions, where different groups in a categorical predictor have smoothers with different shapes. This is specified like s(continuous_predictor, by = categorical_predictor). Now that you have fit models to test questions (a)-(c), make appropriate plots and perform appropriate hypothesis tests. How do you intepret the results?

Pivot long and fit model

cyto_long <- cytodf %>%
  pivot_longer(cols = c(pro, hbact, picoeuk),
               names_to = "group",
               values_to = "abundance") %>%
  mutate(log_abundance = log(abundance + 1e-4))
cyto_long$group <- as.factor(cyto_long$group)

#a GAM including all groups simultaneously
gam_all <- gam(log_abundance ~ group + #for mean abundance  
                 s(press, by = group) + #for depth 
                 s(day, by = group), #for season 
               data = cyto_long)

summary(gam_all)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## log_abundance ~ group + s(press, by = group) + s(day, by = group)
## 
## Parametric coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.29933    0.01495   86.89   <2e-16 ***
## grouppicoeuk -6.20124    0.02115 -293.24   <2e-16 ***
## grouppro     -1.66530    0.02115  -78.75   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                         edf Ref.df        F p-value    
## s(press):grouphbact   2.932  3.630  108.249  <2e-16 ***
## s(press):grouppicoeuk 8.329  8.784  216.866  <2e-16 ***
## s(press):grouppro     4.720  5.655 1684.365  <2e-16 ***
## s(day):grouphbact     2.255  2.800    2.212  0.0724 .  
## s(day):grouppicoeuk   8.646  8.963   19.294  <2e-16 ***
## s(day):grouppro       5.833  6.982   30.040  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.966   Deviance explained = 96.7%
## GCV = 0.27482  Scale est. = 0.27213   n = 3651

From the parametric coefficient results, we can observe that the microbes have different mean abundances. Given the edf values, we are also able to observe picoeukaryotic phytoplankton is highly complex while heterotrophic bacteria is less complex. The range of values indicates different average depth distributions. We also observe different seasonal dynamics; however, heterotrophic bacteria does not appear to follow a seasonal pattern given the p-value of 0.0724.

GAM check

gam.check(gam_all)

## 
## Method: GCV   Optimizer: magic
## Smoothing parameter selection converged after 14 iterations.
## The RMS GCV score gradient at convergence was 4.920004e-06 .
## The Hessian was not positive definite.
## Model rank =  57 / 57 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##                         k'  edf k-index p-value    
## s(press):grouphbact   9.00 2.93    0.91  <2e-16 ***
## s(press):grouppicoeuk 9.00 8.33    0.91  <2e-16 ***
## s(press):grouppro     9.00 4.72    0.91  <2e-16 ***
## s(day):grouphbact     9.00 2.25    0.82  <2e-16 ***
## s(day):grouppicoeuk   9.00 8.65    0.82  <2e-16 ***
## s(day):grouppro       9.00 5.83    0.82  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

GAM mixed

  1. Finally, let’s investigate how abundances of the three groups at shallower depths correlate with mixed layer depth (an index of stratification) and chlorophyll a concentration. The second attached file, hot_mlds.csv, contains the average mixed layer depth for each HOT cruise. You’ll need to merge the information in this file with the dataset you have been analyzing.

Using only data from the top 45 meters, how does the concentration of each group of microbes vary with (a) mixed layer depth and (b) Chl a concentration? Test whether the three types of microbes exhibit different relationships with the two predictors. Use appropriate GAM(s), hypothesis tests, and smoother plots to assess these questions.

# Merge data
mldsdf <- mlds %>%
  mutate(date = mdy(date))

mergeddf <- merge(cyto_long, mldsdf, by = "date", all = TRUE) %>%
  filter(press <= 45)

# GAM
gam_mixed <- gam(log_abundance ~ group + s(mean, by = group) + 
                   s(log(chl), by = group), 
                   data = mergeddf)

summary(gam_mixed)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## log_abundance ~ group + s(mean, by = group) + s(log(chl), by = group)
## 
## Parametric coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.55332    0.02545   61.04   <2e-16 ***
## grouppicoeuk -6.31603    0.03599 -175.50   <2e-16 ***
## grouppro     -0.87879    0.03599  -24.42   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                            edf Ref.df     F  p-value    
## s(mean):grouphbact       3.120  3.889 2.689   0.0418 *  
## s(mean):grouppicoeuk     6.606  7.704 2.548   0.0131 *  
## s(mean):grouppro         1.000  1.000 1.218   0.2710    
## s(log(chl)):grouphbact   1.000  1.000 2.437   0.1199    
## s(log(chl)):grouppicoeuk 6.557  7.615 5.913 2.06e-06 ***
## s(log(chl)):grouppro     1.630  2.049 1.275   0.2722    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.993   Deviance explained = 99.4%
## GCV = 0.059199  Scale est. = 0.053752  n = 249
gam.check(gam_mixed)

## 
## Method: GCV   Optimizer: magic
## Smoothing parameter selection converged after 16 iterations.
## The RMS GCV score gradient at convergence was 2.473004e-08 .
## The Hessian was positive definite.
## Model rank =  57 / 57 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##                            k'  edf k-index p-value
## s(mean):grouphbact       9.00 3.12    0.98    0.41
## s(mean):grouppicoeuk     9.00 6.61    0.98    0.33
## s(mean):grouppro         9.00 1.00    0.98    0.34
## s(log(chl)):grouphbact   9.00 1.00    1.07    0.88
## s(log(chl)):grouppicoeuk 9.00 6.56    1.07    0.88
## s(log(chl)):grouppro     9.00 1.63    1.07    0.84

This is likely overfitting and requires a penalty term

Mixed gam plots

par(mfrow = c(2,3))
plot(gam_mixed)

Compare GAMs

gam_mixed_simple <- gam(log_abundance ~ group + s(mean) +
                          s(log(chl)),
                        data = mergeddf)
summary(gam_mixed_simple)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## log_abundance ~ group + s(mean) + s(log(chl))
## 
## Parametric coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.55332    0.03015   51.51   <2e-16 ***
## grouppicoeuk -6.31603    0.04264 -148.11   <2e-16 ***
## grouppro     -0.87879    0.04264  -20.61   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##               edf Ref.df     F p-value   
## s(mean)     3.198  3.971 1.605 0.18893   
## s(log(chl)) 2.321  2.948 4.188 0.00696 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =   0.99   Deviance explained = 99.1%
## GCV = 0.078145  Scale est. = 0.075471  n = 249
AICc(gam_mixed)
## [1] 7.817445
AICc(gam_mixed_simple)
## [1] 74.42416

The concentration of each group of microbes varies with mixed layer depth and Chl a concentration.