~~~

 

Model selection to predict abundance of diatom species based on environmental parameters

As part of an investigation of diatom succession and community composition in the Gulf of Aqaba/ Eilat over 2 seasons (2020-22), and based on community analysis, several key species were detected. Each individual species was tested in a generalized linear model (glm) based on count data. The example here follows the abundance of diatoms of Thalassiosiraceae family in the GoA. According to the density plot below, Poisson-based glm was initially applied:

 

 

Due to different volumes that were filtered at each time point, the Poisson model was corrected by using the “offset” function. Based on NMDS data, and following ecological relevance (and removal of parameters due to inflation), 3 environmental parameters were tested in the model: Water temp., inorganic nitrogen levels (in seawater), and day-length.

 

initial_model_thal = glm(thal_count~temp+TIN+day_length,
                    +offset(log(vol_ml_3000)),family="poisson",
                    na.action = "na.fail",data=glmdata)
## 
## Call:
## glm(formula = thal_count ~ temp + TIN + day_length, family = "poisson", 
##     data = glmdata, weights = +offset(log(vol_ml_3000)), na.action = "na.fail")
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -98.512  -47.360  -19.223    0.573  163.610  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) 40.885436   0.095649  427.45   <2e-16 ***
## temp        -1.382606   0.003911 -353.48   <2e-16 ***
## TIN         -0.672306   0.006131 -109.65   <2e-16 ***
## day_length  -0.168728   0.002317  -72.83   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 402681  on 44  degrees of freedom
## Residual deviance: 145121  on 41  degrees of freedom
## AIC: 146103
## 
## Number of Fisher Scoring iterations: 6

 

Going over the summary, z and p values, and the null and residual deviances, hinted that the model is biased. High residual deviance implied overdisperssion of the residuals. This was validated with the following test:

 

distest<-dispersiontest(initial_model_thal,trafo=1,alternative ="greater")
## 
##  Overdispersion test
## 
## data:  initial_model_thal
## z = 2.8296, p-value = 0.002331
## alternative hypothesis: true alpha is greater than 0
## sample estimates:
##    alpha 
## 1533.841

 

An alternative glm that accounts for overdisperssion is negative binomial-based glm.

 

nbin_thal <- glm.nb(thal_count~temp*TIN*day_length,
                        +offset(log(vol_ml_3000)),maxit=1000,
                  na.action = "na.fail",data=glmdata)
## 
## Call:
## glm.nb(formula = thal_count ~ temp * TIN * day_length, data = glmdata, 
##     weights = +offset(log(vol_ml_3000)), na.action = "na.fail", 
##     maxit = 1000, init.theta = 1.221015633, link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.3871  -1.7954  -0.6076   0.3484   4.2276  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)   
## (Intercept)         129.0280    43.8212   2.944  0.00324 **
## temp                 -5.1474     1.8318  -2.810  0.00495 **
## TIN                  98.4504   105.2701   0.935  0.34968   
## day_length           -7.5699     3.4952  -2.166  0.03032 * 
## temp:TIN             -4.7105     4.6684  -1.009  0.31296   
## temp:day_length       0.3158     0.1464   2.157  0.03098 * 
## TIN:day_length       -8.0391     8.5804  -0.937  0.34880   
## temp:TIN:day_length   0.3830     0.3802   1.007  0.31371   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(1.221) family taken to be 1)
## 
##     Null deviance: 453.61  on 44  degrees of freedom
## Residual deviance: 136.43  on 37  degrees of freedom
## AIC: 1871.4
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  1.221 
##           Std. Err.:  0.141 
## 
##  2 x log-likelihood:  -1853.364

 

Thus, this model was selected to predict the abundances of main diatom species in the GoA, in response to 3 environmental parameters (mentioned above).

 

Residuals diagnostics

Validating this model by examination of residuals (this part is taken from RPub by Ben Horvath):

Deviance residuals:

The deviance residuals distribution seems ok.

Next, a comparison of the raw counts to the model’s prediction. The predicted values from a model will appear similar to the raw counts if the model is well-fit.

 

### yhat<-plot(density(glmdata$thal_count), main='raw counts vs. model prediction (counts_hat)')
### lines(density(predict(nbin_thal, type='response')), col='red')

 

It seems that the model (red line) fits the raw counts (black line).

Verifying there is no pattern among the residuals:

 

Final assessment to visualize if negative binomial is the right choise:

 

Following residuals diagnostics, it seems that negative binomial fits the count data.
Next, to decide the best predictors, dredge function was used for model selection:

 

selection_info_thal = dredge(nbin_thal)
## Fixed term is "(Intercept)"

 

 

The top model on the list best fits the data. Here is the model’s summary:

 

best_model_thal = get.models(selection_info_thal,subset = NA)[[1]]
## 
## Call:
## glm.nb(formula = thal_count ~ day_length + temp + day_length:temp + 
##     1, data = glmdata, weights = +offset(log(vol_ml_3000)), na.action = "na.fail", 
##     maxit = 1000, init.theta = 1.198892036, link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.3180  -1.4084  -0.5520   0.3022   4.4061  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     136.96326   16.87141   8.118 4.74e-16 ***
## day_length       -8.31563    1.37858  -6.032 1.62e-09 ***
## temp             -5.68197    0.71907  -7.902 2.75e-15 ***
## day_length:temp   0.36271    0.05879   6.169 6.87e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(1.1989) family taken to be 1)
## 
##     Null deviance: 445.44  on 44  degrees of freedom
## Residual deviance: 136.63  on 41  degrees of freedom
## AIC: 1866
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  1.199 
##           Std. Err.:  0.138 
## 
##  2 x log-likelihood:  -1856.034

 

Visualising the best model. on the left is the predicted values of Thalassiosira counts based on water temperature. Different line colors correspond to 3 levels of day-length: short (red), intermediate (blue), and long (green). Light-color around the trendline represents 95% CI of each prediction. The right panel shows the raw counts as function of water temp..

 

 

Another option is to use an average of several models.

 

supported_models_thal = selection_info_thal[selection_info_thal$delta < 2,]
avg_model_thal = model.avg(supported_models_thal,fit=TRUE)
## 
## Call:
## model.avg(object = get.models(object = supported_models_thal, 
##     subset = NA))
## 
## Component model call: 
## glm.nb(formula = thal_count ~ <2 unique rhs>, data = glmdata, weights = 
##      +offset(log(vol_ml_3000)), na.action = na.fail, maxit = 1000, 
##      init.theta = <2 unique values>, link = log)
## 
## Component models: 
##      df  logLik    AICc delta weight
## 124   5 -928.02 1867.57  0.00   0.63
## 1234  6 -927.20 1868.61  1.04   0.37
## 
## Term codes: 
##      day_length            temp             TIN day_length:temp 
##               1               2               3               4 
## 
## Model-averaged coefficients:  
## (full average) 
##                  Estimate Std. Error Adjusted SE z value Pr(>|z|)    
## (Intercept)     138.31133   16.95242    17.46721   7.918   <2e-16 ***
## day_length       -8.38864    1.37980     1.42194   5.899   <2e-16 ***
## temp             -5.73701    0.72210     0.74405   7.711   <2e-16 ***
## day_length:temp   0.36584    0.05885     0.06065   6.032   <2e-16 ***
## TIN              -0.12334    0.22153     0.22486   0.549    0.583    
##  
## (conditional average) 
##                  Estimate Std. Error Adjusted SE z value Pr(>|z|)    
## (Intercept)     138.31133   16.95242    17.46721   7.918   <2e-16 ***
## day_length       -8.38864    1.37980     1.42194   5.899   <2e-16 ***
## temp             -5.73701    0.72210     0.74405   7.711   <2e-16 ***
## day_length:temp   0.36584    0.05885     0.06065   6.032   <2e-16 ***
## TIN              -0.33089    0.25096     0.25878   1.279    0.201    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

 

We can compare both models and test which one shows higher correlation with the data.

 

best_model_cor_thal = cor(predict(best_model_thal),glmdata$thal_count)
avg_model_cor_thal = cor(predict(avg_model_thal),glmdata$thal_count)
## [1] 0.608157
## [1] 0.6073354

 

It seems that both models are similar, with a slightly higher correlation of the best model approach.
Testing the power of the model (the probability that the detected correlation is real)

 

pwr_test<-pwr.r.test(r=best_model_cor_thal,n=nrow(glmdata))
## 
##      approximate correlation power calculation (arctangh transformation) 
## 
##               n = 45
##               r = 0.608157
##       sig.level = 0.05
##           power = 0.9960602
##     alternative = two.sided
Sampling effort was sufficient, power is fairly high.

 

Chaetoceros hyalochaete

 

## 
## Call:
## glm.nb(formula = hyalo_count ~ day_length + temp + TIN + temp:TIN + 
##     1, data = glmdata, weights = +offset(log(vol_ml_1000)), na.action = "na.fail", 
##     maxit = 1000, init.theta = 0.8467493425, link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.9385  -2.0514  -0.9607   0.6610   3.7307  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  8.30421    2.28639   3.632 0.000281 ***
## day_length  -0.37951    0.07503  -5.058 4.23e-07 ***
## temp         0.03568    0.09637   0.370 0.711197    
## TIN         15.18361    6.19254   2.452 0.014210 *  
## temp:TIN    -0.64778    0.29023  -2.232 0.025619 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(0.8467) family taken to be 1)
## 
##     Null deviance: 289.69  on 44  degrees of freedom
## Residual deviance: 196.36  on 40  degrees of freedom
## AIC: 2046.6
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  0.8467 
##           Std. Err.:  0.0823 
## 
##  2 x log-likelihood:  -2034.6040
best_model_cor_hyalo = cor(predict(best_model_hyalo),glmdata$hyalo_count)
## [1] 0.6834654
pwr_test_hyalo<-pwr.r.test(r=best_model_cor_hyalo,n=nrow(glmdata))
## 
##      approximate correlation power calculation (arctangh transformation) 
## 
##               n = 45
##               r = 0.6834654
##       sig.level = 0.05
##           power = 0.9997696
##     alternative = two.sided

 

Cylindrotheca closterium

 

## 
## Call:
## model.avg(object = get.models(object = supported_models_closter, 
##     subset = NA))
## 
## Component model call: 
## glm.nb(formula = closter_count ~ <5 unique rhs>, data = glmdata, 
##      weights = +offset(log(vol_ml_1000)), na.action = na.fail, maxit = 1000, 
##      init.theta = <5 unique values>, link = log)
## 
## Component models: 
##         df   logLik    AICc delta weight
## 123456   8 -1260.47 2540.94  0.00   0.28
## 236      5 -1264.81 2541.16  0.22   0.25
## 1234567  9 -1259.36 2541.87  0.92   0.18
## 1236     6 -1263.91 2542.04  1.10   0.16
## 12346    7 -1262.70 2542.44  1.49   0.13
## 
## Term codes: 
##          day_length                temp                 TIN     day_length:temp 
##                   1                   2                   3                   4 
##      day_length:TIN            temp:TIN day_length:temp:TIN 
##                   5                   6                   7 
## 
## Model-averaged coefficients:  
## (full average) 
##                      Estimate Std. Error Adjusted SE z value Pr(>|z|)
## (Intercept)          50.67164   46.86114    47.12644   1.075    0.282
## day_length           -3.46433    3.74330     3.76437   0.920    0.357
## temp                 -1.76077    1.86605     1.87699   0.938    0.348
## TIN                 -73.95099   69.72191    70.31221   1.052    0.293
## day_length:temp       0.13462    0.14932     0.15019   0.896    0.370
## day_length:TIN        2.73386    5.70668     5.75393   0.475    0.635
## temp:TIN              3.06002    2.96350     2.99083   1.023    0.306
## day_length:temp:TIN  -0.09649    0.24271     0.24488   0.394    0.694
##  
## (conditional average) 
##                     Estimate Std. Error Adjusted SE z value Pr(>|z|)  
## (Intercept)          50.6716    46.8611     47.1264   1.075   0.2823  
## day_length           -4.6219     3.6530      3.6817   1.255   0.2093  
## temp                 -1.7608     1.8660      1.8770   0.938   0.3482  
## TIN                 -73.9510    69.7219     70.3122   1.052   0.2929  
## day_length:temp       0.2289     0.1278      0.1295   1.768   0.0770 .
## day_length:TIN        6.0009     7.2027      7.2847   0.824   0.4101  
## temp:TIN              3.0600     2.9635      2.9908   1.023   0.3062  
## day_length:temp:TIN  -0.5476     0.2954      0.3054   1.793   0.0729 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
avg_model_cor_closter = cor(predict(avg_model_closter),glmdata$closter_count)
## [1] 0.5065935
pwr_test_closter<-pwr.r.test(r=avg_model_cor_closter,n=nrow(glmdata))
## 
##      approximate correlation power calculation (arctangh transformation) 
## 
##               n = 45
##               r = 0.5065935
##       sig.level = 0.05
##           power = 0.9546288
##     alternative = two.sided

 

Pseudo-nitzschia

 

## 
## Call:
## glm.nb(formula = nitzs_count ~ day_length + temp + TIN + day_length:temp + 
##     day_length:TIN + 1, data = glmdata, weights = +offset(log(vol_ml_1000)), 
##     na.action = "na.fail", maxit = 1000, init.theta = 1.450997028, 
##     link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.5075  -1.8994  -0.1475   0.7776   4.9782  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -159.85551   21.52677  -7.426 1.12e-13 ***
## day_length        13.11350    1.70470   7.693 1.44e-14 ***
## temp               6.66153    0.83020   8.024 1.02e-15 ***
## TIN               18.40902    5.87388   3.134  0.00172 ** 
## day_length:temp   -0.52936    0.06593  -8.029 9.83e-16 ***
## day_length:TIN    -1.37125    0.46513  -2.948  0.00320 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(1.451) family taken to be 1)
## 
##     Null deviance: 332.88  on 44  degrees of freedom
## Residual deviance: 186.32  on 39  degrees of freedom
## AIC: 2295.1
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  1.451 
##           Std. Err.:  0.145 
## 
##  2 x log-likelihood:  -2281.139
best_model_cor_nitzs = cor(predict(best_model_nitzs),glmdata$nitzs_count)
## [1] 0.6371158
pwr_test_nitzs<-pwr.r.test(r=best_model_cor_nitzs,n=nrow(glmdata))
## 
##      approximate correlation power calculation (arctangh transformation) 
## 
##               n = 45
##               r = 0.6371158
##       sig.level = 0.05
##           power = 0.9984912
##     alternative = two.sided

 

Pennate (20-50 um size fraction)

 

## 
## Call:
## glm.nb(formula = pennM_count ~ day_length + temp + TIN + day_length:temp + 
##     day_length:TIN + 1, data = glmdata, weights = +offset(log(vol_ml_3000)), 
##     na.action = "na.fail", maxit = 1000, init.theta = 2.402524449, 
##     link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.7032  -1.2078  -0.4224   0.7815   3.6492  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -82.64727   19.87249  -4.159 3.20e-05 ***
## day_length        6.91384    1.57266   4.396 1.10e-05 ***
## temp              3.42043    0.76563   4.467 7.91e-06 ***
## TIN              11.84556    5.43614   2.179   0.0293 *  
## day_length:temp  -0.26737    0.06075  -4.401 1.08e-05 ***
## day_length:TIN   -0.96011    0.43079  -2.229   0.0258 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(2.4025) family taken to be 1)
## 
##     Null deviance: 182.82  on 44  degrees of freedom
## Residual deviance: 129.64  on 39  degrees of freedom
## AIC: 1523.1
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  2.403 
##           Std. Err.:  0.293 
## 
##  2 x log-likelihood:  -1509.148
best_model_cor_pennm = cor(predict(best_model_pennm),glmdata$pennM_count)
## [1] 0.3920523

Correlation is not strong (0.39). Testing the power level of the analysis:

pwr_test_pennm<-pwr.r.test(r=best_model_cor_pennm,n=nrow(glmdata))
## 
##      approximate correlation power calculation (arctangh transformation) 
## 
##               n = 45
##               r = 0.3920523
##       sig.level = 0.05
##           power = 0.7734919
##     alternative = two.sided

Power is lower than 90%. Another test will show the required sampling effort to reach power level of 90%:

pwr_test_pennm_samp_effort<-pwr.r.test(r=best_model_cor_pennm,power=0.9)
## 
##      approximate correlation power calculation (arctangh transformation) 
## 
##               n = 63.40017
##               r = 0.3920523
##       sig.level = 0.05
##           power = 0.9
##     alternative = two.sided
For this group of medium size pennates, 64 time points are required to obtain 90% power level.

 

Pennate (<20 um size fraction)

 

## 
## Call:
## glm.nb(formula = pennS_count ~ day_length + temp + TIN + day_length:temp + 
##     1, data = glmdata, weights = +offset(log(vol_ml_3000)), na.action = "na.fail", 
##     maxit = 1000, init.theta = 3.650265167, link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.0249  -1.3677  -0.4250   0.2146   4.2866  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -10.00486    9.49900  -1.053   0.2922    
## day_length        1.79359    0.77262   2.321   0.0203 *  
## temp              0.61903    0.40404   1.532   0.1255    
## TIN              -0.68199    0.14515  -4.699 2.62e-06 ***
## day_length:temp  -0.07066    0.03290  -2.148   0.0317 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(3.6503) family taken to be 1)
## 
##     Null deviance: 178.79  on 44  degrees of freedom
## Residual deviance: 126.85  on 40  degrees of freedom
## AIC: 1604.7
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  3.650 
##           Std. Err.:  0.453 
## 
##  2 x log-likelihood:  -1592.734
best_model_cor_penns = cor(predict(best_model_penns),glmdata$pennS_count)
## [1] 0.3403893

Correlation is not strong (0.34). Testing the power level of the analysis:

pwr_test_penns<-pwr.r.test(r=best_model_cor_penns,n=nrow(glmdata))
## 
##      approximate correlation power calculation (arctangh transformation) 
## 
##               n = 45
##               r = 0.3403893
##       sig.level = 0.05
##           power = 0.6404924
##     alternative = two.sided

Power is lower than 90%. Another test will show the required sampling effort to reach power level of 90%:

pwr_test_penns_samp_effort<-pwr.r.test(r=best_model_cor_penns,power=0.9)
## 
##      approximate correlation power calculation (arctangh transformation) 
## 
##               n = 85.73737
##               r = 0.3403893
##       sig.level = 0.05
##           power = 0.9
##     alternative = two.sided
For this group of small size pennates, 86 time points are required to obtain 90% power level (almost double sampling effort).