I want you to help me to justify which is the best choice for calibration dataset. There are three possibilities: (1) we can based on all harvested (group of) species (2) we work only on harvested fishes (3) make a regression using only species which are represented as a single species compartment.
This is because the input data for regression has a very wide variation; hence, choosing appropriate data to build the model can enhance the quality of model.
The df_top data set , which include SPPR for 5 species which are most commonly presented in the models as a single species compartment, can be served as the validation dataset. In this case, we will removed them from the total dataset
Now create a training data set that does not include the above species
df$Group <- as.character(df$Group)
df_top$Group <- as.character(df_top$Group)
# get rows of validation data set in df
idx <- NULL
system.time(for (i in 1:nrow(df)){
for (j in 1:nrow(df_top)){
if (df$ID[i]==df_top$ID[j] & df$Group[i]==df_top$Group[j]){
x <- which(df$ID== df$ID[j] & df$Group==df$Group[j])
idx <- c(idx, x)
}
}
})## user system elapsed
## 12.027 0.027 12.048
# remove validation data from df
df_cal <- df[-idx,]
df_all <- df
df <- df_cal # this data frame consist of all type of speciesdf_all_harvested_cor <- df_all_harvested[,which(colnames(df_all_harvested) %in%
c("SPPR", "TL", "TE", "PB", "QB", "PQ","NPP", "FCI", "TST"))]
chart.Correlation(as.matrix(df_all_harvested_cor), histogram=TRUE, pch=19) Figure 1. Correlation of \(\log_{10}\text(SPPR)\) versus predictors for all harvested species.
Table 1. Summary data for all harvested species
datatable(sapply(df_all_harvested_cor, summary))#datatable(round(cor(df_all_harvested_cor), digits=2))df_fish_cor <- df_fish[,which(colnames(df_fish) %in%
c("SPPR", "TL", "TE", "PB", "QB", "PQ","NPP", "FCI", "TST"))]
chart.Correlation(as.matrix(df_fish_cor), histogram=TRUE, pch=19) Figure 2. Correlation of \(\log_{10}\text(SPPR)\) versus predictors for harvested fish species.
Table 2. Summary data for all fish species
datatable(sapply(df_fish_cor, summary))#datatable(round(cor(df_all_harvested_cor), digits=2))df_single_cor <- df_single[,which(colnames(df_single) %in%
c("SPPR", "TL", "TE", "PB", "QB", "PQ","NPP", "FCI", "TST"))]
chart.Correlation(as.matrix(df_single_cor), histogram=TRUE, pch=19) Figure 3. Correlation of \(\log_{10}\text(SPPR)\) versus predictors for single species compartments.
Table 3. Summary data for all single species compartment
datatable(sapply(df_single_cor, summary))#datatable(round(cor(df_all_harvested_cor), digits=2))From Fig 1, Fig 2 and Fig 3, it is clear that \(\log_{10}\text(SPPR)\) is possitively correlated with TL and negatively correlated with PB and PQ at lower extent.
Table 4. Summary of models with all harvested species
datatable(Final_output[[1]])Table 5. Summary of models with only harvested fishes
datatable(Final_output[[2]])Table 6. Summary of models with single species compartments
datatable(Final_output[[3]])So the models constructed with calibration dataset consisting of only single species compartment gives best results in terms of cross-validated mean squared residuals and root means squred error of prediction. TL is the best predictor and the improvement of model will increase only slightly when more predictors added.
# Model with TL as the only predictor
fit <- lm(SPPR~TL, data=df_single)
summary(fit)##
## Call:
## lm(formula = SPPR ~ TL, data = df_single)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.18399 -0.43445 -0.08065 0.33063 2.32600
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.59298 0.15940 -3.72 0.000224 ***
## TL 0.89675 0.04411 20.33 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5693 on 460 degrees of freedom
## Multiple R-squared: 0.4732, Adjusted R-squared: 0.4721
## F-statistic: 413.2 on 1 and 460 DF, p-value: < 2.2e-16
# Model with TL and type of species
fit1 <- lm(SPPR~TL+TYPE, data=df_single)
summary(fit1)##
## Call:
## lm(formula = SPPR ~ TL + TYPE, data = df_single)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.19296 -0.40678 -0.09338 0.32063 2.28614
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.73113 0.27100 -2.698 0.00724 **
## TL 0.86868 0.04763 18.236 < 2e-16 ***
## TYPECP -0.14195 0.31005 -0.458 0.64730
## TYPECR 0.16389 0.32333 0.507 0.61248
## TYPECRB -0.10599 0.32513 -0.326 0.74457
## TYPEF 0.25941 0.26537 0.978 0.32883
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5658 on 456 degrees of freedom
## Multiple R-squared: 0.4841, Adjusted R-squared: 0.4784
## F-statistic: 85.57 on 5 and 456 DF, p-value: < 2.2e-16
# Model with TL and Ecosystem type
fit2 <- lm(SPPR~TL + Ecosystem.type, data=df_single)
summary(fit2)##
## Call:
## lm(formula = SPPR ~ TL + Ecosystem.type, data = df_single)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.18866 -0.41957 -0.06312 0.33177 2.09946
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.6677743 0.1658188 -4.027 6.62e-05 ***
## TL 0.9165614 0.0455675 20.114 < 2e-16 ***
## Ecosystem.typeChanel/Strait -0.0609074 0.1010844 -0.603 0.547115
## Ecosystem.typeCoastal lagoon 0.8045864 0.2134850 3.769 0.000186 ***
## Ecosystem.typeContinental shelf 0.0002651 0.0922701 0.003 0.997709
## Ecosystem.typeEstuary -0.3182764 0.4048779 -0.786 0.432216
## Ecosystem.typeOpen ocean -0.0253030 0.1054355 -0.240 0.810450
## Ecosystem.typeUpwelling 0.0325622 0.0997813 0.326 0.744321
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5619 on 454 degrees of freedom
## Multiple R-squared: 0.4935, Adjusted R-squared: 0.4857
## F-statistic: 63.19 on 7 and 454 DF, p-value: < 2.2e-16
So the equation represents the relationship between \(\log_{10}\text(SPPR)\) and TL will be:
\[\log_{10}{SPPR}=-0.59 +0.90 \times~ TL\]
Now we make the prediction of SPPR for some most harvested species according to FAO’s statistic based on our new equation and TL values obtained from Fishbase data base. Then we will compared our results with the result from empirical approach.
df_fishbase <- read.xlsx(file="fishbase.xlsx", 1)
df_fishbase <- cbind(df_fishbase, predict(fit, newdata=df_fishbase, interval = "prediction"))
df_fishbase$empirical.equation <- df_fishbase$TL-1 #Assuming 10% of TE
df_fishbase <- cbind(df_fishbase[,1:3], round(df_fishbase[,4:ncol(df_fishbase)], digits=2))
datatable(df_fishbase)ggplot(data=df_fishbase, aes(x=TL, y=fit))+
geom_line(colour="blue")+
geom_line(aes(y=lwr), linetype="dashed", colour="blue")+
geom_line(aes(y=upr), linetype="dashed", colour="blue")+
geom_point(aes(y=empirical.equation, colour="red"))+
scale_colour_discrete("", labels="SPPR from empirical equation")+
geom_vline(xintercept = 4.1)The empirical equation gives the lower results from our regression equation for species with trophic level smaller than 4.1. However, the values from empirical equation still fall within the 95% confident interval for the means SPPR estimated from our linear regression equation.
cutoff <- 4/(nrow(df_single)-length(fit$coefficients)-2) # cooks.distance cutoff
influential_obs_rownums <- which(cooks.distance(fit) > cutoff) # caputure influential observations
influential_obs_rownums <- as.numeric(influential_obs_rownums)
df_new <- df_single[-influential_obs_rownums,]
fit_new <- lm(SPPR~TL, data=df_new)
summary(fit_new)##
## Call:
## lm(formula = SPPR ~ TL, data = df_new)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.11724 -0.37608 -0.05681 0.32361 1.66465
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.52857 0.14382 -3.675 0.000267 ***
## TL 0.86396 0.03998 21.610 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4829 on 436 degrees of freedom
## Multiple R-squared: 0.5172, Adjusted R-squared: 0.5161
## F-statistic: 467 on 1 and 436 DF, p-value: < 2.2e-16
Figure for model diagnostic of original model
par(mfrow=c(2,2))
plot(fit)Overall model diagnostic
library(gvlma)
gvmodel <- gvlma(fit)
summary(gvmodel)##
## Call:
## lm(formula = SPPR ~ TL, data = df_single)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.18399 -0.43445 -0.08065 0.33063 2.32600
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.59298 0.15940 -3.72 0.000224 ***
## TL 0.89675 0.04411 20.33 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5693 on 460 degrees of freedom
## Multiple R-squared: 0.4732, Adjusted R-squared: 0.4721
## F-statistic: 413.2 on 1 and 460 DF, p-value: < 2.2e-16
##
##
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance = 0.05
##
## Call:
## gvlma(x = fit)
##
## Value p-value Decision
## Global Stat 81.4167 1.110e-16 Assumptions NOT satisfied!
## Skewness 59.2110 1.421e-14 Assumptions NOT satisfied!
## Kurtosis 16.3314 5.318e-05 Assumptions NOT satisfied!
## Link Function 0.2674 6.051e-01 Assumptions acceptable.
## Heteroscedasticity 5.6070 1.789e-02 Assumptions NOT satisfied!
df_single$fit <- fit$fitted.valuesggplot(data=df_single, aes(x=fit, y=SPPR))+
geom_point(alpha=0.2, colour="black")+
geom_smooth(aes(x=fit, y=SPPR), colour="black")+
geom_line(aes(x=SPPR, y=SPPR), colour="blue")+
scale_x_continuous(limits=c(0.5,4)) +
scale_y_continuous("log10(SPPR)", limits=c(-1,9))## `geom_smooth()` using method = 'loess'
## Warning: Removed 26 rows containing missing values (geom_path).
Figure xxx. Obverved vs fitted values
ggplot(data=df_single,aes(x=fit, y=SPPR-fit)) +
geom_point(alpha=0.2,color="black") +
geom_smooth(aes(x=fit, y=SPPR-fit),color="black")+
scale_y_continuous("Residuals")## `geom_smooth()` using method = 'loess'
Figure xxx. Residuals vs fitted values
So the variation in data set is quite big? How is the model quality?
Plot meanSPPR vs meanTL of different groups.
Overall test for model assumption after removing the outliers
gvmodel <- gvlma(fit_new)
summary(gvmodel)##
## Call:
## lm(formula = SPPR ~ TL, data = df_new)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.11724 -0.37608 -0.05681 0.32361 1.66465
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.52857 0.14382 -3.675 0.000267 ***
## TL 0.86396 0.03998 21.610 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4829 on 436 degrees of freedom
## Multiple R-squared: 0.5172, Adjusted R-squared: 0.5161
## F-statistic: 467 on 1 and 436 DF, p-value: < 2.2e-16
##
##
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance = 0.05
##
## Call:
## gvlma(x = fit_new)
##
## Value p-value Decision
## Global Stat 22.835397 1.366e-04 Assumptions NOT satisfied!
## Skewness 20.793358 5.116e-06 Assumptions NOT satisfied!
## Kurtosis 0.007421 9.313e-01 Assumptions acceptable.
## Link Function 0.054609 8.152e-01 Assumptions acceptable.
## Heteroscedasticity 1.980009 1.594e-01 Assumptions acceptable.
Hence, removing influential observations improve model fit slightly, and make model more satisfies the assumptions.
What are the most important factors for impact of fishery on primary production deprivation in marine ecosystems?
Human become more and more dependent on protein source from fishery and aquaculture. As such, impacts of human’s exploitation on marine ecosystem has received more attention. One of the popular approach to assess the human impact on marine ecosystem is via the amount of net primary production (NPP) deprived of through harvest of one unit of target species (SPPR). In this sense, the empirical approach which allows quantifying SPPR by using mean transfer efficiency of the ecosystem and the trophic level of harvested species are widely used. However, the marine ecosystems are complex and simplifying complex food web to a parallel linear food chain might not be adequate. This study uses a newly developed calculation framework which allows calculating SPPR directly from the food web flow matrix in combination with available published food webs in Ecosbase database to investigate what is the most important factors determining the SPPR. Interestingly, SPPR is more correlated with parameters associated with individual (group of) species, e.g., trophic level (TL), specific ingestion rate (P/B) and specific growth rate (P/Q) rather than ones used to characterizing the ecosystem properties such as mean transfer efficiency or Finn’s cycling index. Moreover, TL is the most important predictors explaining the variation of SPPR. We calculated SPPR based on TL for 44 most harvested species in FAO’s statistic and saw that the empirical equation mostly underestimates SPPR of these species; however, it is still within the confident interval.
Human’s dependence on marine biomass. Different approach to assess the impact of human on marine ecosystem through harvest of marine biomass. Aim of study: (1) investigate the important factors determing the amount of NPP required to produced harvested biomass. (2) predict the SPPR for some most harvested fish species.
Introduce the food web modelling and new framework to calculate SPPR from food web flow matrix Statistical analysis
Descriptive analysis of data obtain from the Ecobase database and calculated data.
Comparison with empirical approach