Due to different volumes that were filtered at each time point, the Poisson model was corrected by using the “offset” function. Based on NMDS data, and following ecological relevance (and removal of parameters due to inflation), 3 environmental parameters were tested in the model: Water temp., inorganic nitrogen levels (in seawater), and day-length.
initial_model_thal = glm(thal_count~temp+TIN+day_length,
+offset(log(vol_ml_3000)),family="poisson",
na.action = "na.fail",data=glmdata)
##
## Call:
## glm(formula = thal_count ~ temp + TIN + day_length, family = "poisson",
## data = glmdata, weights = +offset(log(vol_ml_3000)), na.action = "na.fail")
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -98.512 -47.360 -19.223 0.573 163.610
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 40.885436 0.095649 427.45 <2e-16 ***
## temp -1.382606 0.003911 -353.48 <2e-16 ***
## TIN -0.672306 0.006131 -109.65 <2e-16 ***
## day_length -0.168728 0.002317 -72.83 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 402681 on 44 degrees of freedom
## Residual deviance: 145121 on 41 degrees of freedom
## AIC: 146103
##
## Number of Fisher Scoring iterations: 6
Going over the summary, z and p values, and the null and residual deviances, hinted that the model is biased. High residual deviance implied overdisperssion of the residuals. This was validated with the following test:
distest<-dispersiontest(initial_model_thal,trafo=1,alternative ="greater")
##
## Overdispersion test
##
## data: initial_model_thal
## z = 2.8296, p-value = 0.002331
## alternative hypothesis: true alpha is greater than 0
## sample estimates:
## alpha
## 1533.841
An alternative glm that accounts for overdisperssion is negative binomial-based glm.
nbin_thal <- glm.nb(thal_count~temp*TIN*day_length,
+offset(log(vol_ml_3000)),maxit=1000,
na.action = "na.fail",data=glmdata)
##
## Call:
## glm.nb(formula = thal_count ~ temp * TIN * day_length, data = glmdata,
## weights = +offset(log(vol_ml_3000)), na.action = "na.fail",
## maxit = 1000, init.theta = 1.221015633, link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.3871 -1.7954 -0.6076 0.3484 4.2276
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 129.0280 43.8212 2.944 0.00324 **
## temp -5.1474 1.8318 -2.810 0.00495 **
## TIN 98.4504 105.2701 0.935 0.34968
## day_length -7.5699 3.4952 -2.166 0.03032 *
## temp:TIN -4.7105 4.6684 -1.009 0.31296
## temp:day_length 0.3158 0.1464 2.157 0.03098 *
## TIN:day_length -8.0391 8.5804 -0.937 0.34880
## temp:TIN:day_length 0.3830 0.3802 1.007 0.31371
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(1.221) family taken to be 1)
##
## Null deviance: 453.61 on 44 degrees of freedom
## Residual deviance: 136.43 on 37 degrees of freedom
## AIC: 1871.4
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 1.221
## Std. Err.: 0.141
##
## 2 x log-likelihood: -1853.364
Thus, this model was selected to predict the abundances of main diatom species in the GoA, in response to 3 environmental parameters (mentioned above).
Validating this model by examination of residuals (this part is taken from RPub by Ben Horvath):
Deviance residuals:
The deviance residuals distribution seems ok.
### yhat<-plot(density(glmdata$thal_count), main='raw counts vs. model prediction (counts_hat)')
### lines(density(predict(nbin_thal, type='response')), col='red')
It seems that the model (red line) fits the raw counts (black line).
Verifying there is no pattern among the residuals:
Final assessment to visualize if negative binomial is the right choise:
Following residuals diagnostics, it seems that negative binomial fits the count data.
selection_info_thal = dredge(nbin_thal)
## Fixed term is "(Intercept)"
The top model on the list best fits the data. Here is the model’s summary:
best_model_thal = get.models(selection_info_thal,subset = NA)[[1]]
##
## Call:
## glm.nb(formula = thal_count ~ day_length + temp + day_length:temp +
## 1, data = glmdata, weights = +offset(log(vol_ml_3000)), na.action = "na.fail",
## maxit = 1000, init.theta = 1.198892036, link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.3180 -1.4084 -0.5520 0.3022 4.4061
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 136.96326 16.87141 8.118 4.74e-16 ***
## day_length -8.31563 1.37858 -6.032 1.62e-09 ***
## temp -5.68197 0.71907 -7.902 2.75e-15 ***
## day_length:temp 0.36271 0.05879 6.169 6.87e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(1.1989) family taken to be 1)
##
## Null deviance: 445.44 on 44 degrees of freedom
## Residual deviance: 136.63 on 41 degrees of freedom
## AIC: 1866
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 1.199
## Std. Err.: 0.138
##
## 2 x log-likelihood: -1856.034
Visualising the best model. on the left is the predicted values of Thalassiosira counts based on water temperature. Different line colors correspond to 3 levels of day-length: short (red), intermediate (blue), and long (green). Light-color around the trendline represents 95% CI of each prediction. The right panel shows the raw counts as function of water temp..
Another option is to use an average of several models.
supported_models_thal = selection_info_thal[selection_info_thal$delta < 2,]
avg_model_thal = model.avg(supported_models_thal,fit=TRUE)
##
## Call:
## model.avg(object = get.models(object = supported_models_thal,
## subset = NA))
##
## Component model call:
## glm.nb(formula = thal_count ~ <2 unique rhs>, data = glmdata, weights =
## +offset(log(vol_ml_3000)), na.action = na.fail, maxit = 1000,
## init.theta = <2 unique values>, link = log)
##
## Component models:
## df logLik AICc delta weight
## 124 5 -928.02 1867.57 0.00 0.63
## 1234 6 -927.20 1868.61 1.04 0.37
##
## Term codes:
## day_length temp TIN day_length:temp
## 1 2 3 4
##
## Model-averaged coefficients:
## (full average)
## Estimate Std. Error Adjusted SE z value Pr(>|z|)
## (Intercept) 138.31133 16.95242 17.46721 7.918 <2e-16 ***
## day_length -8.38864 1.37980 1.42194 5.899 <2e-16 ***
## temp -5.73701 0.72210 0.74405 7.711 <2e-16 ***
## day_length:temp 0.36584 0.05885 0.06065 6.032 <2e-16 ***
## TIN -0.12334 0.22153 0.22486 0.549 0.583
##
## (conditional average)
## Estimate Std. Error Adjusted SE z value Pr(>|z|)
## (Intercept) 138.31133 16.95242 17.46721 7.918 <2e-16 ***
## day_length -8.38864 1.37980 1.42194 5.899 <2e-16 ***
## temp -5.73701 0.72210 0.74405 7.711 <2e-16 ***
## day_length:temp 0.36584 0.05885 0.06065 6.032 <2e-16 ***
## TIN -0.33089 0.25096 0.25878 1.279 0.201
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can compare both models and test which one shows higher correlation with the data.
best_model_cor_thal = cor(predict(best_model_thal),glmdata$thal_count)
avg_model_cor_thal = cor(predict(avg_model_thal),glmdata$thal_count)
## [1] 0.608157
## [1] 0.6073354
It seems that both models are similar, with a slightly higher correlation of the best model approach.
pwr_test<-pwr.r.test(r=best_model_cor_thal,n=nrow(glmdata))
##
## approximate correlation power calculation (arctangh transformation)
##
## n = 45
## r = 0.608157
## sig.level = 0.05
## power = 0.9960602
## alternative = two.sided
Sampling effort was sufficient, power is fairly high.
##
## Call:
## glm.nb(formula = hyalo_count ~ day_length + temp + TIN + temp:TIN +
## 1, data = glmdata, weights = +offset(log(vol_ml_1000)), na.action = "na.fail",
## maxit = 1000, init.theta = 0.8467493425, link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.9385 -2.0514 -0.9607 0.6610 3.7307
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 8.30421 2.28639 3.632 0.000281 ***
## day_length -0.37951 0.07503 -5.058 4.23e-07 ***
## temp 0.03568 0.09637 0.370 0.711197
## TIN 15.18361 6.19254 2.452 0.014210 *
## temp:TIN -0.64778 0.29023 -2.232 0.025619 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(0.8467) family taken to be 1)
##
## Null deviance: 289.69 on 44 degrees of freedom
## Residual deviance: 196.36 on 40 degrees of freedom
## AIC: 2046.6
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 0.8467
## Std. Err.: 0.0823
##
## 2 x log-likelihood: -2034.6040
best_model_cor_hyalo = cor(predict(best_model_hyalo),glmdata$hyalo_count)
## [1] 0.6834654
pwr_test_hyalo<-pwr.r.test(r=best_model_cor_hyalo,n=nrow(glmdata))
##
## approximate correlation power calculation (arctangh transformation)
##
## n = 45
## r = 0.6834654
## sig.level = 0.05
## power = 0.9997696
## alternative = two.sided
##
## Call:
## model.avg(object = get.models(object = supported_models_closter,
## subset = NA))
##
## Component model call:
## glm.nb(formula = closter_count ~ <5 unique rhs>, data = glmdata,
## weights = +offset(log(vol_ml_1000)), na.action = na.fail, maxit = 1000,
## init.theta = <5 unique values>, link = log)
##
## Component models:
## df logLik AICc delta weight
## 123456 8 -1260.47 2540.94 0.00 0.28
## 236 5 -1264.81 2541.16 0.22 0.25
## 1234567 9 -1259.36 2541.87 0.92 0.18
## 1236 6 -1263.91 2542.04 1.10 0.16
## 12346 7 -1262.70 2542.44 1.49 0.13
##
## Term codes:
## day_length temp TIN day_length:temp
## 1 2 3 4
## day_length:TIN temp:TIN day_length:temp:TIN
## 5 6 7
##
## Model-averaged coefficients:
## (full average)
## Estimate Std. Error Adjusted SE z value Pr(>|z|)
## (Intercept) 50.67164 46.86114 47.12644 1.075 0.282
## day_length -3.46433 3.74330 3.76437 0.920 0.357
## temp -1.76077 1.86605 1.87699 0.938 0.348
## TIN -73.95099 69.72191 70.31221 1.052 0.293
## day_length:temp 0.13462 0.14932 0.15019 0.896 0.370
## day_length:TIN 2.73386 5.70668 5.75393 0.475 0.635
## temp:TIN 3.06002 2.96350 2.99083 1.023 0.306
## day_length:temp:TIN -0.09649 0.24271 0.24488 0.394 0.694
##
## (conditional average)
## Estimate Std. Error Adjusted SE z value Pr(>|z|)
## (Intercept) 50.6716 46.8611 47.1264 1.075 0.2823
## day_length -4.6219 3.6530 3.6817 1.255 0.2093
## temp -1.7608 1.8660 1.8770 0.938 0.3482
## TIN -73.9510 69.7219 70.3122 1.052 0.2929
## day_length:temp 0.2289 0.1278 0.1295 1.768 0.0770 .
## day_length:TIN 6.0009 7.2027 7.2847 0.824 0.4101
## temp:TIN 3.0600 2.9635 2.9908 1.023 0.3062
## day_length:temp:TIN -0.5476 0.2954 0.3054 1.793 0.0729 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
avg_model_cor_closter = cor(predict(avg_model_closter),glmdata$closter_count)
## [1] 0.5065935
pwr_test_closter<-pwr.r.test(r=avg_model_cor_closter,n=nrow(glmdata))
##
## approximate correlation power calculation (arctangh transformation)
##
## n = 45
## r = 0.5065935
## sig.level = 0.05
## power = 0.9546288
## alternative = two.sided
##
## Call:
## glm.nb(formula = nitzs_count ~ day_length + temp + TIN + day_length:temp +
## day_length:TIN + 1, data = glmdata, weights = +offset(log(vol_ml_1000)),
## na.action = "na.fail", maxit = 1000, init.theta = 1.450997028,
## link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.5075 -1.8994 -0.1475 0.7776 4.9782
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -159.85551 21.52677 -7.426 1.12e-13 ***
## day_length 13.11350 1.70470 7.693 1.44e-14 ***
## temp 6.66153 0.83020 8.024 1.02e-15 ***
## TIN 18.40902 5.87388 3.134 0.00172 **
## day_length:temp -0.52936 0.06593 -8.029 9.83e-16 ***
## day_length:TIN -1.37125 0.46513 -2.948 0.00320 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(1.451) family taken to be 1)
##
## Null deviance: 332.88 on 44 degrees of freedom
## Residual deviance: 186.32 on 39 degrees of freedom
## AIC: 2295.1
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 1.451
## Std. Err.: 0.145
##
## 2 x log-likelihood: -2281.139
best_model_cor_nitzs = cor(predict(best_model_nitzs),glmdata$nitzs_count)
## [1] 0.6371158
pwr_test_nitzs<-pwr.r.test(r=best_model_cor_nitzs,n=nrow(glmdata))
##
## approximate correlation power calculation (arctangh transformation)
##
## n = 45
## r = 0.6371158
## sig.level = 0.05
## power = 0.9984912
## alternative = two.sided
##
## Call:
## glm.nb(formula = pennM_count ~ day_length + temp + TIN + day_length:temp +
## day_length:TIN + 1, data = glmdata, weights = +offset(log(vol_ml_3000)),
## na.action = "na.fail", maxit = 1000, init.theta = 2.402524449,
## link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.7032 -1.2078 -0.4224 0.7815 3.6492
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -82.64727 19.87249 -4.159 3.20e-05 ***
## day_length 6.91384 1.57266 4.396 1.10e-05 ***
## temp 3.42043 0.76563 4.467 7.91e-06 ***
## TIN 11.84556 5.43614 2.179 0.0293 *
## day_length:temp -0.26737 0.06075 -4.401 1.08e-05 ***
## day_length:TIN -0.96011 0.43079 -2.229 0.0258 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(2.4025) family taken to be 1)
##
## Null deviance: 182.82 on 44 degrees of freedom
## Residual deviance: 129.64 on 39 degrees of freedom
## AIC: 1523.1
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 2.403
## Std. Err.: 0.293
##
## 2 x log-likelihood: -1509.148
best_model_cor_pennm = cor(predict(best_model_pennm),glmdata$pennM_count)
## [1] 0.3920523
Correlation is not strong (0.39). Testing the power level of the analysis:
pwr_test_pennm<-pwr.r.test(r=best_model_cor_pennm,n=nrow(glmdata))
##
## approximate correlation power calculation (arctangh transformation)
##
## n = 45
## r = 0.3920523
## sig.level = 0.05
## power = 0.7734919
## alternative = two.sided
Power is lower than 90%. Another test will show the required sampling effort to reach power level of 90%:
pwr_test_pennm_samp_effort<-pwr.r.test(r=best_model_cor_pennm,power=0.9)
##
## approximate correlation power calculation (arctangh transformation)
##
## n = 63.40017
## r = 0.3920523
## sig.level = 0.05
## power = 0.9
## alternative = two.sided
For this group of medium size pennates, 64 time points are required to
obtain 90% power level.
##
## Call:
## glm.nb(formula = pennS_count ~ day_length + temp + TIN + day_length:temp +
## 1, data = glmdata, weights = +offset(log(vol_ml_3000)), na.action = "na.fail",
## maxit = 1000, init.theta = 3.650265167, link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.0249 -1.3677 -0.4250 0.2146 4.2866
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -10.00486 9.49900 -1.053 0.2922
## day_length 1.79359 0.77262 2.321 0.0203 *
## temp 0.61903 0.40404 1.532 0.1255
## TIN -0.68199 0.14515 -4.699 2.62e-06 ***
## day_length:temp -0.07066 0.03290 -2.148 0.0317 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(3.6503) family taken to be 1)
##
## Null deviance: 178.79 on 44 degrees of freedom
## Residual deviance: 126.85 on 40 degrees of freedom
## AIC: 1604.7
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 3.650
## Std. Err.: 0.453
##
## 2 x log-likelihood: -1592.734
best_model_cor_penns = cor(predict(best_model_penns),glmdata$pennS_count)
## [1] 0.3403893
Correlation is not strong (0.34). Testing the power level of the analysis:
pwr_test_penns<-pwr.r.test(r=best_model_cor_penns,n=nrow(glmdata))
##
## approximate correlation power calculation (arctangh transformation)
##
## n = 45
## r = 0.3403893
## sig.level = 0.05
## power = 0.6404924
## alternative = two.sided
Power is lower than 90%. Another test will show the required sampling effort to reach power level of 90%:
pwr_test_penns_samp_effort<-pwr.r.test(r=best_model_cor_penns,power=0.9)
##
## approximate correlation power calculation (arctangh transformation)
##
## n = 85.73737
## r = 0.3403893
## sig.level = 0.05
## power = 0.9
## alternative = two.sided
For this group of small size pennates, 86 time points are required to
obtain 90% power level (almost double sampling effort).