Diatoms GoA SEM 2020-22

~~~

Recap

An investigation of diatom succession and community composition in the Gulf of Aqaba/ Eilat over 2 seasons (2020-22). Data collected by routine sampling (bi-weekly to daily) of seawater from st. A, filtering several L on polycarbonate 0.8 um filter (in duplicates).

Filters were counted in transect method (100-300 screens) by scanning electron microscope (SEM). Counts of individual diatoms were translated to concentration (individuals mL^-1).

Total concentration

Concentration based on size-fraction of 2 catergories: larger or smaller than 50 um.

Designated dataset for community analysis was compiled, by averaging duplicates at each time point, and log-transforming concentration (after addition of +1 to avoid log 0).

Heatmap showing log-concentration of each species (by date)

Heatmap showing log-concentration of each species (by clusters based on similarity of abundance)

To quantify similarities in community composition over the different time-points, Bray-Curtis method was applied. The red line indicates 70% similarities (which is the threshold for clustering)

To look for patterns across the entire time points, I applied cluster analysis to see how the communities are grouped without interference. For deciding number of clusters, 70% similarity was set as threshold (resulted in 11 clusters).

Grey lines show the main environmental parameters and how they are driving the differences between clusters.

Inferring ordination analysis

To better characterize the 3 communities that emerged from the ordination data, further analysis was applied to describe each community individually, and also to detect differences between these communities.
The first anlaysis is analysis of similarity (ANOSIM). ANOSIM statistic compares the mean of ranked dissimilarities between groups to the mean of ranked dissimilarities within groups. An R value close to “1.0” suggests dissimilarity between groups while an R value close to “0” suggests an even distribution of high and low ranks within and between groups (credit: GUSTA ME).

The table below shows the 5 most abundant species for each cluster (values here represent averaged log concentration).

Species richness for each cluster

SIMPER analysis cannot perform on clusters with one time point (as some appear in the new grouping)

Diatom community temporal dynamics

This analysis follows an article describing the R package codyn by Hallett et al. 2016: “Many traditional measurements of community structure, such as diversity indices and rank-abundance curves, represent ‘snapshots in time’ that poorly capture community dynamics. The development of analogous temporal metrics, such as species turnover and rank shifts, has highlighted how much communities can vary over time.
…Species turnover represents a temporal analogue to species richness. The function turnover calculates three metrics of species turnover: total turnover, appearances and disappearances. The default metric ‘total’ refers to total turnover, which calculates the proportion of species that differ between time points as:

The turnover function includes the option to calculate only the proportion of species that appear or only those that disappear. This allows the detection of differences in the time points in which many species appear vs. when species drop from the system, even while the total turnover value in both scenarios may be similar.”

In this panel, the upper graph shows the total turnover (color code for different depths at summer), and the lower graphs include results of surface (right) and dcm (left) for both components of total turnover: species appearance (orange) and species disappearance (green).

(from the article:)“Mean rank shifts represent a temporal analogue of species rank-abundance distributions and indicate the degree of species reordering between two time points. This metric is calculated by the rank_shift() function as:

where N is the number of species in common in both time points, t is the time point and R_i,t is the relative rank of species i at time t.”

The results of the mean rank shift calculation indicate that the degree of species reordering were stable during winter and spring of 2021. In the summer of 2021 (June and August) there was an increase in the degree of species reordering. Another stable period was recorded in the following autumn and winter, until March 2022, where there was another several peaks in the degree of species reordering.

Model selection to predict abundance of diatom species based on environmental parameters

As part of an investigation of diatom succession and community composition in the Gulf of Aqaba/ Eilat over 2 seasons (2020-22), and based on community analysis, several key species were detected. Each individual species was tested in a generalized linear model (glm) based on count data. The example here follows the abundance of diatoms of Thalassiosiraceae family in the GoA. According to the density plot below, Poisson-based glm was initially applied:

Due to different volumes that were filtered at each time point, the Poisson model was corrected by using the “offset” function. Based on NMDS data, and following ecological relevance (and removal of parameters due to inflation), 3 environmental parameters were tested in the model: Water temp., inorganic nitrogen levels (in seawater), and day-length.

initial_model_thal = glm(thal_count~temp+TIN+day_length,
                    +offset(log(vol_ml_3000)),family="poisson",
                    na.action = "na.fail",data=glmdata)

## 
## Call:
## glm(formula = thal_count ~ temp + TIN + day_length, family = "poisson", 
##     data = glmdata, weights = +offset(log(vol_ml_3000)), na.action = "na.fail")
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -98.512  -47.360  -19.223    0.573  163.610  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) 40.885436   0.095649  427.45   <2e-16 ***
## temp        -1.382606   0.003911 -353.48   <2e-16 ***
## TIN         -0.672306   0.006131 -109.65   <2e-16 ***
## day_length  -0.168728   0.002317  -72.83   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 402681  on 44  degrees of freedom
## Residual deviance: 145121  on 41  degrees of freedom
## AIC: 146103
## 
## Number of Fisher Scoring iterations: 6

Going over the summary, z and p values, and the null and residual deviances, hinted that the model is biased. High residual deviance implied overdisperssion of the residuals. This was validated with the following test:

distest<-dispersiontest(initial_model_thal,trafo=1,alternative ="greater")

## 
##  Overdispersion test
## 
## data:  initial_model_thal
## z = 2.8296, p-value = 0.002331
## alternative hypothesis: true alpha is greater than 0
## sample estimates:
##    alpha 
## 1533.841

An alternative glm that accounts for overdisperssion is negative binomial-based glm.

nbin_thal <- glm.nb(thal_count~temp*TIN*day_length,
                        +offset(log(vol_ml_3000)),maxit=1000,
                  na.action = "na.fail",data=glmdata)

## 
## Call:
## glm.nb(formula = thal_count ~ temp * TIN * day_length, data = glmdata, 
##     weights = +offset(log(vol_ml_3000)), na.action = "na.fail", 
##     maxit = 1000, init.theta = 1.221015633, link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.3871  -1.7954  -0.6076   0.3484   4.2276  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)   
## (Intercept)         129.0280    43.8212   2.944  0.00324 **
## temp                 -5.1474     1.8318  -2.810  0.00495 **
## TIN                  98.4504   105.2701   0.935  0.34968   
## day_length           -7.5699     3.4952  -2.166  0.03032 * 
## temp:TIN             -4.7105     4.6684  -1.009  0.31296   
## temp:day_length       0.3158     0.1464   2.157  0.03098 * 
## TIN:day_length       -8.0391     8.5804  -0.937  0.34880   
## temp:TIN:day_length   0.3830     0.3802   1.007  0.31371   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(1.221) family taken to be 1)
## 
##     Null deviance: 453.61  on 44  degrees of freedom
## Residual deviance: 136.43  on 37  degrees of freedom
## AIC: 1871.4
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  1.221 
##           Std. Err.:  0.141 
## 
##  2 x log-likelihood:  -1853.364

Thus, this model was selected to predict the abundances of main diatom species in the GoA, in response to 3 environmental parameters (mentioned above).

Residuals diagnostics

Validating this model by examination of residuals (this part is taken from RPub by Ben Horvath):

Deviance residuals:

The deviance residuals distribution seems ok.

Next, a comparison of the raw counts to the model’s prediction. The predicted values from a model will appear similar to the raw counts if the model is well-fit.

### yhat<-plot(density(glmdata$thal_count), main='raw counts vs. model prediction (counts_hat)')
### lines(density(predict(nbin_thal, type='response')), col='red')

It seems that the model (red line) fits the raw counts (black line).

Verifying there is no pattern among the residuals:

Final assessment to visualize if negative binomial is the right choise:

Following residuals diagnostics, it seems that negative binomial fits the count data.
Next, to decide the best predictors, dredge function was used for model selection:

selection_info_thal = dredge(nbin_thal)

## Fixed term is "(Intercept)"

The top model on the list best fits the data. Here is the model’s summary:

best_model_thal = get.models(selection_info_thal,subset = NA)[[1]]

## 
## Call:
## glm.nb(formula = thal_count ~ day_length + temp + day_length:temp + 
##     1, data = glmdata, weights = +offset(log(vol_ml_3000)), na.action = "na.fail", 
##     maxit = 1000, init.theta = 1.198892036, link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.3180  -1.4084  -0.5520   0.3022   4.4061  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     136.96326   16.87141   8.118 4.74e-16 ***
## day_length       -8.31563    1.37858  -6.032 1.62e-09 ***
## temp             -5.68197    0.71907  -7.902 2.75e-15 ***
## day_length:temp   0.36271    0.05879   6.169 6.87e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(1.1989) family taken to be 1)
## 
##     Null deviance: 445.44  on 44  degrees of freedom
## Residual deviance: 136.63  on 41  degrees of freedom
## AIC: 1866
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  1.199 
##           Std. Err.:  0.138 
## 
##  2 x log-likelihood:  -1856.034

Visualising the best model. on the left is the predicted values of Thalassiosira counts based on water temperature. Different line colors correspond to 3 levels of day-length: short (red), intermediate (blue), and long (green). Light-color around the trendline represents 95% CI of each prediction. The right panel shows the raw counts as function of water temp..

Another option is to use an average of several models.

supported_models_thal = selection_info_thal[selection_info_thal$delta < 2,]
avg_model_thal = model.avg(supported_models_thal,fit=TRUE)

## 
## Call:
## model.avg(object = get.models(object = supported_models_thal, 
##     subset = NA))
## 
## Component model call: 
## glm.nb(formula = thal_count ~ <2 unique rhs>, data = glmdata, weights = 
##      +offset(log(vol_ml_3000)), na.action = na.fail, maxit = 1000, 
##      init.theta = <2 unique values>, link = log)
## 
## Component models: 
##      df  logLik    AICc delta weight
## 124   5 -928.02 1867.57  0.00   0.63
## 1234  6 -927.20 1868.61  1.04   0.37
## 
## Term codes: 
##      day_length            temp             TIN day_length:temp 
##               1               2               3               4 
## 
## Model-averaged coefficients:  
## (full average) 
##                  Estimate Std. Error Adjusted SE z value Pr(>|z|)    
## (Intercept)     138.31133   16.95242    17.46721   7.918   <2e-16 ***
## day_length       -8.38864    1.37980     1.42194   5.899   <2e-16 ***
## temp             -5.73701    0.72210     0.74405   7.711   <2e-16 ***
## day_length:temp   0.36584    0.05885     0.06065   6.032   <2e-16 ***
## TIN              -0.12334    0.22153     0.22486   0.549    0.583    
##  
## (conditional average) 
##                  Estimate Std. Error Adjusted SE z value Pr(>|z|)    
## (Intercept)     138.31133   16.95242    17.46721   7.918   <2e-16 ***
## day_length       -8.38864    1.37980     1.42194   5.899   <2e-16 ***
## temp             -5.73701    0.72210     0.74405   7.711   <2e-16 ***
## day_length:temp   0.36584    0.05885     0.06065   6.032   <2e-16 ***
## TIN              -0.33089    0.25096     0.25878   1.279    0.201    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can compare both models and test which one shows higher correlation with the data.

best_model_cor_thal = cor(predict(best_model_thal),glmdata$thal_count)
avg_model_cor_thal = cor(predict(avg_model_thal),glmdata$thal_count)

## [1] 0.608157

## [1] 0.6073354

It seems that both models are similar, with a slightly higher correlation of the best model approach.
Testing the power of the model (the probability that the detected correlation is real)

pwr_test<-pwr.r.test(r=best_model_cor_thal,n=nrow(glmdata))

## 
##      approximate correlation power calculation (arctangh transformation) 
## 
##               n = 45
##               r = 0.608157
##       sig.level = 0.05
##           power = 0.9960602
##     alternative = two.sided

Sampling effort was sufficient, power is fairly high.

Chaetoceros hyalochaete

## 
## Call:
## glm.nb(formula = hyalo_count ~ day_length + temp + TIN + temp:TIN + 
##     1, data = glmdata, weights = +offset(log(vol_ml_1000)), na.action = "na.fail", 
##     maxit = 1000, init.theta = 0.8467493425, link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.9385  -2.0514  -0.9607   0.6610   3.7307  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  8.30421    2.28639   3.632 0.000281 ***
## day_length  -0.37951    0.07503  -5.058 4.23e-07 ***
## temp         0.03568    0.09637   0.370 0.711197    
## TIN         15.18361    6.19254   2.452 0.014210 *  
## temp:TIN    -0.64778    0.29023  -2.232 0.025619 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(0.8467) family taken to be 1)
## 
##     Null deviance: 289.69  on 44  degrees of freedom
## Residual deviance: 196.36  on 40  degrees of freedom
## AIC: 2046.6
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  0.8467 
##           Std. Err.:  0.0823 
## 
##  2 x log-likelihood:  -2034.6040

best_model_cor_hyalo = cor(predict(best_model_hyalo),glmdata$hyalo_count)

## [1] 0.6834654

pwr_test_hyalo<-pwr.r.test(r=best_model_cor_hyalo,n=nrow(glmdata))

## 
##      approximate correlation power calculation (arctangh transformation) 
## 
##               n = 45
##               r = 0.6834654
##       sig.level = 0.05
##           power = 0.9997696
##     alternative = two.sided

Cylindrotheca closterium

## 
## Call:
## model.avg(object = get.models(object = supported_models_closter, 
##     subset = NA))
## 
## Component model call: 
## glm.nb(formula = closter_count ~ <5 unique rhs>, data = glmdata, 
##      weights = +offset(log(vol_ml_1000)), na.action = na.fail, maxit = 1000, 
##      init.theta = <5 unique values>, link = log)
## 
## Component models: 
##         df   logLik    AICc delta weight
## 123456   8 -1260.47 2540.94  0.00   0.28
## 236      5 -1264.81 2541.16  0.22   0.25
## 1234567  9 -1259.36 2541.87  0.92   0.18
## 1236     6 -1263.91 2542.04  1.10   0.16
## 12346    7 -1262.70 2542.44  1.49   0.13
## 
## Term codes: 
##          day_length                temp                 TIN     day_length:temp 
##                   1                   2                   3                   4 
##      day_length:TIN            temp:TIN day_length:temp:TIN 
##                   5                   6                   7 
## 
## Model-averaged coefficients:  
## (full average) 
##                      Estimate Std. Error Adjusted SE z value Pr(>|z|)
## (Intercept)          50.67164   46.86114    47.12644   1.075    0.282
## day_length           -3.46433    3.74330     3.76437   0.920    0.357
## temp                 -1.76077    1.86605     1.87699   0.938    0.348
## TIN                 -73.95099   69.72191    70.31221   1.052    0.293
## day_length:temp       0.13462    0.14932     0.15019   0.896    0.370
## day_length:TIN        2.73386    5.70668     5.75393   0.475    0.635
## temp:TIN              3.06002    2.96350     2.99083   1.023    0.306
## day_length:temp:TIN  -0.09649    0.24271     0.24488   0.394    0.694
##  
## (conditional average) 
##                     Estimate Std. Error Adjusted SE z value Pr(>|z|)  
## (Intercept)          50.6716    46.8611     47.1264   1.075   0.2823  
## day_length           -4.6219     3.6530      3.6817   1.255   0.2093  
## temp                 -1.7608     1.8660      1.8770   0.938   0.3482  
## TIN                 -73.9510    69.7219     70.3122   1.052   0.2929  
## day_length:temp       0.2289     0.1278      0.1295   1.768   0.0770 .
## day_length:TIN        6.0009     7.2027      7.2847   0.824   0.4101  
## temp:TIN              3.0600     2.9635      2.9908   1.023   0.3062  
## day_length:temp:TIN  -0.5476     0.2954      0.3054   1.793   0.0729 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

avg_model_cor_closter = cor(predict(avg_model_closter),glmdata$closter_count)

## [1] 0.5065935

pwr_test_closter<-pwr.r.test(r=avg_model_cor_closter,n=nrow(glmdata))

## 
##      approximate correlation power calculation (arctangh transformation) 
## 
##               n = 45
##               r = 0.5065935
##       sig.level = 0.05
##           power = 0.9546288
##     alternative = two.sided

Pseudo-nitzschia

## 
## Call:
## glm.nb(formula = nitzs_count ~ day_length + temp + TIN + day_length:temp + 
##     day_length:TIN + 1, data = glmdata, weights = +offset(log(vol_ml_1000)), 
##     na.action = "na.fail", maxit = 1000, init.theta = 1.450997028, 
##     link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.5075  -1.8994  -0.1475   0.7776   4.9782  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -159.85551   21.52677  -7.426 1.12e-13 ***
## day_length        13.11350    1.70470   7.693 1.44e-14 ***
## temp               6.66153    0.83020   8.024 1.02e-15 ***
## TIN               18.40902    5.87388   3.134  0.00172 ** 
## day_length:temp   -0.52936    0.06593  -8.029 9.83e-16 ***
## day_length:TIN    -1.37125    0.46513  -2.948  0.00320 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(1.451) family taken to be 1)
## 
##     Null deviance: 332.88  on 44  degrees of freedom
## Residual deviance: 186.32  on 39  degrees of freedom
## AIC: 2295.1
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  1.451 
##           Std. Err.:  0.145 
## 
##  2 x log-likelihood:  -2281.139

best_model_cor_nitzs = cor(predict(best_model_nitzs),glmdata$nitzs_count)

## [1] 0.6371158

pwr_test_nitzs<-pwr.r.test(r=best_model_cor_nitzs,n=nrow(glmdata))

## 
##      approximate correlation power calculation (arctangh transformation) 
## 
##               n = 45
##               r = 0.6371158
##       sig.level = 0.05
##           power = 0.9984912
##     alternative = two.sided

Pennate (20-50 um size fraction)

## 
## Call:
## glm.nb(formula = pennM_count ~ day_length + temp + TIN + day_length:temp + 
##     day_length:TIN + 1, data = glmdata, weights = +offset(log(vol_ml_3000)), 
##     na.action = "na.fail", maxit = 1000, init.theta = 2.402524449, 
##     link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.7032  -1.2078  -0.4224   0.7815   3.6492  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -82.64727   19.87249  -4.159 3.20e-05 ***
## day_length        6.91384    1.57266   4.396 1.10e-05 ***
## temp              3.42043    0.76563   4.467 7.91e-06 ***
## TIN              11.84556    5.43614   2.179   0.0293 *  
## day_length:temp  -0.26737    0.06075  -4.401 1.08e-05 ***
## day_length:TIN   -0.96011    0.43079  -2.229   0.0258 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(2.4025) family taken to be 1)
## 
##     Null deviance: 182.82  on 44  degrees of freedom
## Residual deviance: 129.64  on 39  degrees of freedom
## AIC: 1523.1
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  2.403 
##           Std. Err.:  0.293 
## 
##  2 x log-likelihood:  -1509.148

best_model_cor_pennm = cor(predict(best_model_pennm),glmdata$pennM_count)

## [1] 0.3920523

Correlation is not strong (0.39). Testing the power level of the analysis:

pwr_test_pennm<-pwr.r.test(r=best_model_cor_pennm,n=nrow(glmdata))

## 
##      approximate correlation power calculation (arctangh transformation) 
## 
##               n = 45
##               r = 0.3920523
##       sig.level = 0.05
##           power = 0.7734919
##     alternative = two.sided

Power is lower than 90%. Another test will show the required sampling effort to reach power level of 90%:

pwr_test_pennm_samp_effort<-pwr.r.test(r=best_model_cor_pennm,power=0.9)

## 
##      approximate correlation power calculation (arctangh transformation) 
## 
##               n = 63.40017
##               r = 0.3920523
##       sig.level = 0.05
##           power = 0.9
##     alternative = two.sided

For this group of medium size pennates, 64 time points are required to obtain 90% power level.

Pennate (<20 um size fraction)

## 
## Call:
## glm.nb(formula = pennS_count ~ day_length + temp + TIN + day_length:temp + 
##     1, data = glmdata, weights = +offset(log(vol_ml_3000)), na.action = "na.fail", 
##     maxit = 1000, init.theta = 3.650265167, link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.0249  -1.3677  -0.4250   0.2146   4.2866  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -10.00486    9.49900  -1.053   0.2922    
## day_length        1.79359    0.77262   2.321   0.0203 *  
## temp              0.61903    0.40404   1.532   0.1255    
## TIN              -0.68199    0.14515  -4.699 2.62e-06 ***
## day_length:temp  -0.07066    0.03290  -2.148   0.0317 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(3.6503) family taken to be 1)
## 
##     Null deviance: 178.79  on 44  degrees of freedom
## Residual deviance: 126.85  on 40  degrees of freedom
## AIC: 1604.7
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  3.650 
##           Std. Err.:  0.453 
## 
##  2 x log-likelihood:  -1592.734

best_model_cor_penns = cor(predict(best_model_penns),glmdata$pennS_count)

## [1] 0.3403893

Correlation is not strong (0.34). Testing the power level of the analysis:

pwr_test_penns<-pwr.r.test(r=best_model_cor_penns,n=nrow(glmdata))

## 
##      approximate correlation power calculation (arctangh transformation) 
## 
##               n = 45
##               r = 0.3403893
##       sig.level = 0.05
##           power = 0.6404924
##     alternative = two.sided