1. Data preparation
- 2. Calibration dataset with all harvested species
2. Correlation analysis
3. Variable selection
4. Construction of model
5. Comparison with empirical equation.
6. Analysis of influential observation
7. Paper outline
- 7.1. Abstract
- 7.2. Paper outline

1. Data preparation

I want you to help me to justify which is the best choice for calibration dataset. There are three possibilities: (1) we can based on all harvested (group of) species (2) we work only on harvested fishes (3) make a regression using only species which are represented as a single species compartment.

This is because the input data for regression has a very wide variation; hence, choosing appropriate data to build the model can enhance the quality of model.

The df_top data set , which include SPPR for 5 species which are most commonly presented in the models as a single species compartment, can be served as the validation dataset. In this case, we will removed them from the total dataset

Now create a training data set that does not include the above species

df$Group <- as.character(df$Group)
df_top$Group <- as.character(df_top$Group)
# get rows of validation data set in df
idx <- NULL
system.time(for (i in 1:nrow(df)){
    for (j in 1:nrow(df_top)){
        if (df$ID[i]==df_top$ID[j] & df$Group[i]==df_top$Group[j]){
            x <- which(df$ID== df$ID[j] & df$Group==df$Group[j])
            idx <- c(idx, x)
        }
    }
})

##    user  system elapsed 
##  12.027   0.027  12.048

# remove validation data from df
df_cal <- df[-idx,]
df_all <- df                    
df <- df_cal                  # this data frame consist of all type of species

Now we create different options for making calibration dataset.*

2. Calibration dataset with all harvested species

2. Correlation analysis

df_all_harvested_cor <- df_all_harvested[,which(colnames(df_all_harvested) %in% 
                                    c("SPPR", "TL", "TE", "PB", "QB", "PQ","NPP", "FCI", "TST"))]

chart.Correlation(as.matrix(df_all_harvested_cor), histogram=TRUE, pch=19)

Figure 1. Correlation of \(\log_{10}\text(SPPR)\) versus predictors for all harvested species.

Table 1. Summary data for all harvested species

datatable(sapply(df_all_harvested_cor, summary))

#datatable(round(cor(df_all_harvested_cor), digits=2))

df_fish_cor <- df_fish[,which(colnames(df_fish) %in% 
                                    c("SPPR",  "TL", "TE", "PB", "QB", "PQ","NPP", "FCI", "TST"))]

chart.Correlation(as.matrix(df_fish_cor), histogram=TRUE, pch=19)

Figure 2. Correlation of \(\log_{10}\text(SPPR)\) versus predictors for harvested fish species.

Table 2. Summary data for all fish species

datatable(sapply(df_fish_cor, summary))

#datatable(round(cor(df_all_harvested_cor), digits=2))

df_single_cor <- df_single[,which(colnames(df_single) %in% 
                                    c("SPPR", "TL", "TE", "PB", "QB", "PQ","NPP", "FCI", "TST"))]

chart.Correlation(as.matrix(df_single_cor), histogram=TRUE, pch=19)

Figure 3. Correlation of \(\log_{10}\text(SPPR)\) versus predictors for single species compartments.

Table 3. Summary data for all single species compartment

datatable(sapply(df_single_cor, summary))

#datatable(round(cor(df_all_harvested_cor), digits=2))

From Fig 1, Fig 2 and Fig 3, it is clear that \(\log_{10}\text(SPPR)\) is possitively correlated with TL and negatively correlated with PB and PQ at lower extent.

3. Variable selection

Table 4. Summary of models with all harvested species

datatable(Final_output[[1]])

Table 5. Summary of models with only harvested fishes

datatable(Final_output[[2]])

Table 6. Summary of models with single species compartments

datatable(Final_output[[3]])

So the models constructed with calibration dataset consisting of only single species compartment gives best results in terms of cross-validated mean squared residuals and root means squred error of prediction. TL is the best predictor and the improvement of model will increase only slightly when more predictors added.

4. Construction of model

# Model with TL as the only predictor
fit <- lm(SPPR~TL, data=df_single)
summary(fit)

## 
## Call:
## lm(formula = SPPR ~ TL, data = df_single)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.18399 -0.43445 -0.08065  0.33063  2.32600 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.59298    0.15940   -3.72 0.000224 ***
## TL           0.89675    0.04411   20.33  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5693 on 460 degrees of freedom
## Multiple R-squared:  0.4732, Adjusted R-squared:  0.4721 
## F-statistic: 413.2 on 1 and 460 DF,  p-value: < 2.2e-16

# Model with TL and type of species
fit1 <- lm(SPPR~TL+TYPE, data=df_single)
summary(fit1)

## 
## Call:
## lm(formula = SPPR ~ TL + TYPE, data = df_single)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.19296 -0.40678 -0.09338  0.32063  2.28614 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.73113    0.27100  -2.698  0.00724 ** 
## TL           0.86868    0.04763  18.236  < 2e-16 ***
## TYPECP      -0.14195    0.31005  -0.458  0.64730    
## TYPECR       0.16389    0.32333   0.507  0.61248    
## TYPECRB     -0.10599    0.32513  -0.326  0.74457    
## TYPEF        0.25941    0.26537   0.978  0.32883    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5658 on 456 degrees of freedom
## Multiple R-squared:  0.4841, Adjusted R-squared:  0.4784 
## F-statistic: 85.57 on 5 and 456 DF,  p-value: < 2.2e-16

# Model with TL and Ecosystem type 
fit2 <- lm(SPPR~TL + Ecosystem.type, data=df_single)
summary(fit2)

## 
## Call:
## lm(formula = SPPR ~ TL + Ecosystem.type, data = df_single)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.18866 -0.41957 -0.06312  0.33177  2.09946 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     -0.6677743  0.1658188  -4.027 6.62e-05 ***
## TL                               0.9165614  0.0455675  20.114  < 2e-16 ***
## Ecosystem.typeChanel/Strait     -0.0609074  0.1010844  -0.603 0.547115    
## Ecosystem.typeCoastal lagoon     0.8045864  0.2134850   3.769 0.000186 ***
## Ecosystem.typeContinental shelf  0.0002651  0.0922701   0.003 0.997709    
## Ecosystem.typeEstuary           -0.3182764  0.4048779  -0.786 0.432216    
## Ecosystem.typeOpen ocean        -0.0253030  0.1054355  -0.240 0.810450    
## Ecosystem.typeUpwelling          0.0325622  0.0997813   0.326 0.744321    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5619 on 454 degrees of freedom
## Multiple R-squared:  0.4935, Adjusted R-squared:  0.4857 
## F-statistic: 63.19 on 7 and 454 DF,  p-value: < 2.2e-16

So the equation represents the relationship between \(\log_{10}\text(SPPR)\) and TL will be:

\[\log_{10}{SPPR}=-0.59 +0.90 \times~ TL\]

5. Comparison with empirical equation.

Now we make the prediction of SPPR for some most harvested species according to FAO’s statistic based on our new equation and TL values obtained from Fishbase data base. Then we will compared our results with the result from empirical approach.

df_fishbase <- read.xlsx(file="fishbase.xlsx", 1)
df_fishbase <- cbind(df_fishbase, predict(fit, newdata=df_fishbase, interval = "prediction"))
df_fishbase$empirical.equation <- df_fishbase$TL-1  #Assuming 10% of TE
df_fishbase <- cbind(df_fishbase[,1:3], round(df_fishbase[,4:ncol(df_fishbase)], digits=2))
datatable(df_fishbase)

ggplot(data=df_fishbase, aes(x=TL, y=fit))+
  geom_line(colour="blue")+
  geom_line(aes(y=lwr), linetype="dashed", colour="blue")+
  geom_line(aes(y=upr), linetype="dashed", colour="blue")+
  geom_point(aes(y=empirical.equation, colour="red"))+
  scale_colour_discrete("", labels="SPPR from empirical equation")+
  geom_vline(xintercept = 4.1)

The empirical equation gives the lower results from our regression equation for species with trophic level smaller than 4.1. However, the values from empirical equation still fall within the 95% confident interval for the means SPPR estimated from our linear regression equation.

6. Analysis of influential observation

cutoff <- 4/(nrow(df_single)-length(fit$coefficients)-2)  # cooks.distance cutoff
influential_obs_rownums <- which(cooks.distance(fit) > cutoff)  # caputure influential observations 
influential_obs_rownums <- as.numeric(influential_obs_rownums)

df_new <- df_single[-influential_obs_rownums,]
fit_new <- lm(SPPR~TL, data=df_new)    
summary(fit_new)

## 
## Call:
## lm(formula = SPPR ~ TL, data = df_new)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.11724 -0.37608 -0.05681  0.32361  1.66465 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.52857    0.14382  -3.675 0.000267 ***
## TL           0.86396    0.03998  21.610  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4829 on 436 degrees of freedom
## Multiple R-squared:  0.5172, Adjusted R-squared:  0.5161 
## F-statistic:   467 on 1 and 436 DF,  p-value: < 2.2e-16

Figure for model diagnostic of original model

par(mfrow=c(2,2))
plot(fit)

Overall model diagnostic

library(gvlma)
gvmodel <- gvlma(fit)
summary(gvmodel)

## 
## Call:
## lm(formula = SPPR ~ TL, data = df_single)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.18399 -0.43445 -0.08065  0.33063  2.32600 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.59298    0.15940   -3.72 0.000224 ***
## TL           0.89675    0.04411   20.33  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5693 on 460 degrees of freedom
## Multiple R-squared:  0.4732, Adjusted R-squared:  0.4721 
## F-statistic: 413.2 on 1 and 460 DF,  p-value: < 2.2e-16
## 
## 
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance =  0.05 
## 
## Call:
##  gvlma(x = fit) 
## 
##                      Value   p-value                   Decision
## Global Stat        81.4167 1.110e-16 Assumptions NOT satisfied!
## Skewness           59.2110 1.421e-14 Assumptions NOT satisfied!
## Kurtosis           16.3314 5.318e-05 Assumptions NOT satisfied!
## Link Function       0.2674 6.051e-01    Assumptions acceptable.
## Heteroscedasticity  5.6070 1.789e-02 Assumptions NOT satisfied!

df_single$fit <- fit$fitted.values

ggplot(data=df_single, aes(x=fit, y=SPPR))+
  geom_point(alpha=0.2, colour="black")+
  geom_smooth(aes(x=fit, y=SPPR), colour="black")+
  geom_line(aes(x=SPPR, y=SPPR), colour="blue")+
  scale_x_continuous(limits=c(0.5,4)) +
  scale_y_continuous("log10(SPPR)", limits=c(-1,9))

## `geom_smooth()` using method = 'loess'

## Warning: Removed 26 rows containing missing values (geom_path).

Figure xxx. Obverved vs fitted values

ggplot(data=df_single,aes(x=fit, y=SPPR-fit)) +
  geom_point(alpha=0.2,color="black") +
  geom_smooth(aes(x=fit, y=SPPR-fit),color="black")+
  scale_y_continuous("Residuals")

## `geom_smooth()` using method = 'loess'

Figure xxx. Residuals vs fitted values

So the variation in data set is quite big? How is the model quality?

Plot meanSPPR vs meanTL of different groups.

Overall test for model assumption after removing the outliers

gvmodel <- gvlma(fit_new)
summary(gvmodel)

## 
## Call:
## lm(formula = SPPR ~ TL, data = df_new)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.11724 -0.37608 -0.05681  0.32361  1.66465 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.52857    0.14382  -3.675 0.000267 ***
## TL           0.86396    0.03998  21.610  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4829 on 436 degrees of freedom
## Multiple R-squared:  0.5172, Adjusted R-squared:  0.5161 
## F-statistic:   467 on 1 and 436 DF,  p-value: < 2.2e-16
## 
## 
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance =  0.05 
## 
## Call:
##  gvlma(x = fit_new) 
## 
##                        Value   p-value                   Decision
## Global Stat        22.835397 1.366e-04 Assumptions NOT satisfied!
## Skewness           20.793358 5.116e-06 Assumptions NOT satisfied!
## Kurtosis            0.007421 9.313e-01    Assumptions acceptable.
## Link Function       0.054609 8.152e-01    Assumptions acceptable.
## Heteroscedasticity  1.980009 1.594e-01    Assumptions acceptable.

Hence, removing influential observations improve model fit slightly, and make model more satisfies the assumptions.

7. Paper outline

What are the most important factors for impact of fishery on primary production deprivation in marine ecosystems?

7.1. Abstract

Human become more and more dependent on protein source from fishery and aquaculture. As such, impacts of human’s exploitation on marine ecosystem has received more attention. One of the popular approach to assess the human impact on marine ecosystem is via the amount of net primary production (NPP) deprived of through harvest of one unit of target species (SPPR). In this sense, the empirical approach which allows quantifying SPPR by using mean transfer efficiency of the ecosystem and the trophic level of harvested species are widely used. However, the marine ecosystems are complex and simplifying complex food web to a parallel linear food chain might not be adequate. This study uses a newly developed calculation framework which allows calculating SPPR directly from the food web flow matrix in combination with available published food webs in Ecosbase database to investigate what is the most important factors determining the SPPR. Interestingly, SPPR is more correlated with parameters associated with individual (group of) species, e.g., trophic level (TL), specific ingestion rate (P/B) and specific growth rate (P/Q) rather than ones used to characterizing the ecosystem properties such as mean transfer efficiency or Finn’s cycling index. Moreover, TL is the most important predictors explaining the variation of SPPR. We calculated SPPR based on TL for 44 most harvested species in FAO’s statistic and saw that the empirical equation mostly underestimates SPPR of these species; however, it is still within the confident interval.

7.2. Paper outline

Introduction

Human’s dependence on marine biomass. Different approach to assess the impact of human on marine ecosystem through harvest of marine biomass. Aim of study: (1) investigate the important factors determing the amount of NPP required to produced harvested biomass. (2) predict the SPPR for some most harvested fish species.

Material and method

Introduce the food web modelling and new framework to calculate SPPR from food web flow matrix Statistical analysis

Result and discussion

Descriptive analysis of data obtain from the Ecobase database and calculated data.
Correlation analysis
Regression analysis
Comparison with empirical approach

Code for all options

LDA

Nov 24, 2016