Introduction

Coregonus hoyi, often known as the bloater, is in the family Salmonidae and a species of freshwater whitefish . It is silver in colour, herring-like fish, 25.5 centimetres long. It is found in the Great Lakes and also in Lake Nipigon, inhabits underwater slopes.

Methodology

The task is to analyze the egg depositions of Lake Huron Bloasters by using different analysis methods. The final goal is to choose the best model from different set of models and give relevant forecasts of egg depositions for the next 5 years.

Setup

We load the following packages

library(TSA)
## Warning: package 'TSA' was built under R version 3.5.3
library(tseries)
library(lmtest)
library(dLagM)
## Warning: package 'dLagM' was built under R version 3.5.3
## Warning: package 'nardl' was built under R version 3.5.3
## Warning: package 'dynlm' was built under R version 3.5.3
library(fUnitRoots)
## Warning: package 'fUnitRoots' was built under R version 3.5.3
## Warning: package 'timeSeries' was built under R version 3.5.3
## Warning: package 'fBasics' was built under R version 3.5.3
library(FitAR)
## Warning: package 'FitAR' was built under R version 3.5.3
## Warning: package 'leaps' was built under R version 3.5.3
## Warning: package 'ltsa' was built under R version 3.5.2
## Warning: package 'bestglm' was built under R version 3.5.3
library(forecast)

Read Data

egg_data1 <- read.csv("eggs.csv", header = TRUE)
head(egg_data1)
##   year   eggs
## 1 1981 0.0402
## 2 1982 0.0602
## 3 1983 0.1205
## 4 1984 0.1807
## 5 1985 0.7229
## 6 1986 0.5321
class(egg_data1)
## [1] "data.frame"

Analyzing the data

In the below code, we are converting the data to a time series data to perform the analysis. A random walk graph is generated using the plot function and the following points are observed:

  - Trend : We see a upward trend.
  - Behaviour : The series appears to be auto regressive due to multiple succeeding points.
  - Seasonality : We see no seasonality in this graph.
  - Change in Variance: Change in variance is found.
  - Change in point: Year 1990 shows there is a sudden jump in the Deposition of Eggs
egg_data2 <-ts(as.vector(egg_data1$eggs), start= 1981, end=1996)

class(egg_data2)
## [1] "ts"
plot(egg_data2,type='o',ylab='Random Walk')

acf(egg_data2)

pacf(egg_data2)

adf.test(egg_data2)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  egg_data2
## Dickey-Fuller = -2.0669, Lag order = 2, p-value = 0.5469
## alternative hypothesis: stationary
plot(y= egg_data2, x = zlag(egg_data2), ylab = 'Egg Depositions', xlab = 'Previous Year Egg Depositions')

y = egg_data2
x = zlag(egg_data2)
index = 2:length(x)
cor(y[index],x[index])
## [1] 0.7445657

Transformed value

data.transform = BoxCox.ar(egg_data2, method = "yule-walker")
## Warning in arima0(x, order = c(i, 0L, 0L), include.mean = demean): possible
## convergence problem: optim gave code = 1

## Warning in arima0(x, order = c(i, 0L, 0L), include.mean = demean): possible
## convergence problem: optim gave code = 1

data.transform$ci
## [1] 0.1 0.8
lambda = 0.45 # The mid point of the interval 
data.bc = ((egg_data2^lambda)-1)/lambda
plot(data.bc, type = "o", ylab = "Egg Depositions (in millions)", main = "Transformed - Egg depositions between 1981 and 1996")

adf.test(data.bc)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  data.bc
## Dickey-Fuller = -1.6769, Lag order = 2, p-value = 0.6955
## alternative hypothesis: stationary
#Let's calculate the first difference and plot the first differenced series
diff.egg_data2 = diff(data.bc)
plot(diff.egg_data2, type = "o")

adf.test(diff.egg_data2)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  diff.egg_data2
## Dickey-Fuller = -3.6798, Lag order = 2, p-value = 0.0443
## alternative hypothesis: stationary
acf(diff.egg_data2)

pacf(diff.egg_data2)

Model Fitting

eacf(diff.egg_data2, ar.max = 3, ma.max = 3)
## AR/MA
##   0 1 2 3
## 0 o o o o
## 1 o o o o
## 2 o o o o
## 3 o o o o
res = armasubsets(y = diff.egg_data2, nar = 3, nma = 6, y.name = 'test', ar.method = 'yw')
## Warning in leaps.setup(x, y, wt = wt, nbest = nbest, nvmax = nvmax,
## force.in = force.in, : 4 linear dependencies found
## Reordering variables and trying again:
plot(res)

# ARIMA(0,1,1) - CSS
model011css = arima(diff.egg_data2, order = c(0,1,1), method = 'CSS')
coeftest(model011css)
## 
## z test of coefficients:
## 
##      Estimate Std. Error z value  Pr(>|z|)    
## ma1 -1.093338   0.056532  -19.34 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# ARIMA(0,1,1) - ML
model011ml = arima(diff.egg_data2, order = c(0,1,1), method = 'ML')
coeftest(model011ml)
## 
## z test of coefficients:
## 
##     Estimate Std. Error z value Pr(>|z|)
## ma1 -0.99994    0.66766 -1.4977   0.1342
# ARIMA(1,1,0) - CSS
model110css = arima(diff.egg_data2, order = c(1,1,0), method = 'CSS')
coeftest(model110css)
## 
## z test of coefficients:
## 
##     Estimate Std. Error z value Pr(>|z|)  
## ar1 -0.40609    0.24460 -1.6602  0.09687 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# ARIMA(1,1,0) - ML
model110ml = arima(diff.egg_data2, order = c(1,1,0), method = 'ML')
coeftest(model110ml)
## 
## z test of coefficients:
## 
##     Estimate Std. Error z value Pr(>|z|)
## ar1 -0.38056    0.23493 -1.6199   0.1053
# ARIMA (1,1,1) - CSS
model111css = arima(diff.egg_data2, order = c(1,1,1), method = 'CSS')
coeftest(model111css)
## 
## z test of coefficients:
## 
##      Estimate Std. Error z value  Pr(>|z|)    
## ar1  0.030947   0.290460  0.1065    0.9152    
## ma1 -1.021947   0.201926 -5.0610 4.171e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# ARIMA (1,1,1) - ML
model111ml = arima(diff.egg_data2, order = c(1,1,1), method = 'ML')
coeftest(model111ml)
## 
## z test of coefficients:
## 
##     Estimate Std. Error z value Pr(>|z|)   
## ar1  0.11474    0.26992  0.4251 0.670764   
## ma1 -1.00000    0.33205 -3.0116 0.002599 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# ARIMA(0,1,3) - CSS
model013css = arima(diff.egg_data2, order = c(0,1,3), method = 'CSS')
coeftest(model013css)
## 
## z test of coefficients:
## 
##     Estimate Std. Error z value  Pr(>|z|)    
## ma1 -1.10117    0.29725 -3.7045 0.0002118 ***
## ma2 -0.11909    0.33351 -0.3571 0.7210194    
## ma3  0.14673    0.34631  0.4237 0.6717891    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# ARIMA(0,1,3) - ML
model013ml = arima(diff.egg_data2, order = c(0,1,3), method = 'ML')
coeftest(model013ml)
## 
## z test of coefficients:
## 
##      Estimate Std. Error z value Pr(>|z|)
## ma1 -0.852252   0.536615 -1.5882   0.1122
## ma2 -0.102618   0.286176 -0.3586   0.7199
## ma3 -0.045118   0.512224 -0.0881   0.9298
# ARIMA(3,1,3) - CSS
model313css = arima(diff.egg_data2, order = c(3,1,3), method = 'CSS')
## Warning in stats::arima(x = x, order = order, seasonal = seasonal, xreg =
## xreg, : possible convergence problem: optim gave code = 1
coeftest(model313css)
## Warning in sqrt(diag(se)): NaNs produced
## 
## z test of coefficients:
## 
##       Estimate Std. Error   z value Pr(>|z|)    
## ar1 -1.2016973  0.0067545 -177.9105  < 2e-16 ***
## ar2 -0.8077002  0.0231155  -34.9419  < 2e-16 ***
## ar3 -0.1124583  0.0998924   -1.1258  0.26025    
## ma1  1.9195706         NA        NA       NA    
## ma2  1.9992451         NA        NA       NA    
## ma3 -1.3935177  0.6294151   -2.2140  0.02683 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# ARIMA(3,1,3) - ML
model313ml = arima(diff.egg_data2, order = c(3,1,3), method = 'ML')
## Warning in stats::arima(x = x, order = order, seasonal = seasonal, xreg =
## xreg, : possible convergence problem: optim gave code = 1
coeftest(model313ml)
## 
## z test of coefficients:
## 
##     Estimate Std. Error z value Pr(>|z|)   
## ar1  0.35427    0.42161  0.8403 0.400749   
## ar2 -0.67968    0.26271 -2.5872 0.009675 **
## ar3 -0.41455    0.32151 -1.2894 0.197269   
## ma1 -1.31542    0.60118 -2.1881 0.028666 * 
## ma2  1.42062    1.09000  1.3033 0.192465   
## ma3 -0.59339    0.60515 -0.9806 0.326809   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# ARIMA(0,1,5) - CSS
model015css = arima(diff.egg_data2, order = c(0,1,5), method = 'CSS')
coeftest(model015css)
## 
## z test of coefficients:
## 
##      Estimate Std. Error z value Pr(>|z|)   
## ma1 -1.016594   0.375033 -2.7107 0.006715 **
## ma2  0.013365   0.364525  0.0367 0.970752   
## ma3 -0.753976   0.239229 -3.1517 0.001623 **
## ma4  1.192805   0.411640  2.8977 0.003759 **
## ma5 -0.538537   0.506024 -1.0643 0.287215   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# ARIMA(0,1,5) - ML
model015ml = arima(diff.egg_data2, order = c(0,1,5), method = 'ML')
coeftest(model015ml)
## 
## z test of coefficients:
## 
##      Estimate Std. Error z value Pr(>|z|)
## ma1 -0.824906   0.739712 -1.1152   0.2648
## ma2  0.038341   0.314834  0.1218   0.9031
## ma3 -0.732093   0.840193 -0.8713   0.3836
## ma4  1.065250   0.965775  1.1030   0.2700
## ma5 -0.089888   0.354698 -0.2534   0.7999
# ARIMA (0,1,2) - CSS
model012css = arima(diff.egg_data2, order = c(0,1,2), method = 'CSS')
coeftest(model012css)
## 
## z test of coefficients:
## 
##      Estimate Std. Error z value  Pr(>|z|)    
## ma1 -1.017026   0.274754 -3.7016 0.0002143 ***
## ma2 -0.087178   0.308221 -0.2828 0.7772974    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# ARIMA (0,1,2) - ML
model012ml = arima(diff.egg_data2, order = c(0,1,2), method = 'ML')
coeftest(model012ml)
## 
## z test of coefficients:
## 
##     Estimate Std. Error z value Pr(>|z|)  
## ma1 -0.88465    0.42637 -2.0748   0.0380 *
## ma2 -0.11535    0.25484 -0.4527   0.6508  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# ARIMA(1,1,2) - CSS
model112css = arima(diff.egg_data2, order = c(1,1,2), method = 'CSS')
## Warning in stats::arima(x = x, order = order, seasonal = seasonal, xreg =
## xreg, : possible convergence problem: optim gave code = 1
coeftest(model112css)
## 
## z test of coefficients:
## 
##      Estimate Std. Error  z value  Pr(>|z|)    
## ar1  0.329019   0.010368  31.7337 < 2.2e-16 ***
## ma1 -2.494136   0.107804 -23.1359 < 2.2e-16 ***
## ma2  1.189979   0.159874   7.4432 9.825e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# ARIMA(1,1,2) - ML
model112ml = arima(diff.egg_data2, order = c(1,1,2), method = 'ML')
coeftest(model112ml)
## 
## z test of coefficients:
## 
##      Estimate Std. Error z value Pr(>|z|)
## ar1  0.022041   0.984412  0.0224   0.9821
## ma1 -0.904711   0.996659 -0.9077   0.3640
## ma2 -0.095271   0.937432 -0.1016   0.9191

Residual Analysis

residual.analysis <- function(model, std = TRUE){
  library(TSA)
  library(FitAR)
  if (std == TRUE){
    res.model = rstandard(model)
  }else{
    res.model = residuals(model)
  }
  par(mfrow=c(3,2))
  plot(res.model,type='o',ylab='Standardised residuals', main="Time series plot of standardised residuals")
  abline(h=0)
  hist(res.model,main="Histogram of standardised residuals")
  qqnorm(res.model,main="QQ plot of standardised residuals")
  qqline(res.model, col = 2)
  acf(res.model,main="ACF of standardised residuals")
  print(shapiro.test(res.model))
  k=0
  LBQPlot(res.model, lag.max = length(model$residuals)-1 , StartLag = k + 1, k = 0, SquaredQ = FALSE)
  par(mfrow=c(1,1))
}



residual.analysis(model = model011css)
## 
##  Shapiro-Wilk normality test
## 
## data:  res.model
## W = 0.9093, p-value = 0.1321
## Warning in (ra^2)/(n - (1:lag.max)): longer object length is not a multiple
## of shorter object length

residual.analysis(model = model011ml)
## 
##  Shapiro-Wilk normality test
## 
## data:  res.model
## W = 0.95773, p-value = 0.653
## Warning in (ra^2)/(n - (1:lag.max)): longer object length is not a multiple
## of shorter object length

residual.analysis(model = model110css)
## 
##  Shapiro-Wilk normality test
## 
## data:  res.model
## W = 0.96159, p-value = 0.7199
## Warning in (ra^2)/(n - (1:lag.max)): longer object length is not a multiple
## of shorter object length

residual.analysis(model = model110ml)
## 
##  Shapiro-Wilk normality test
## 
## data:  res.model
## W = 0.95896, p-value = 0.6743
## Warning in (ra^2)/(n - (1:lag.max)): longer object length is not a multiple
## of shorter object length

residual.analysis(model = model111css)
## 
##  Shapiro-Wilk normality test
## 
## data:  res.model
## W = 0.95293, p-value = 0.5717
## Warning in (ra^2)/(n - (1:lag.max)): longer object length is not a multiple
## of shorter object length

residual.analysis(model = model111ml)
## 
##  Shapiro-Wilk normality test
## 
## data:  res.model
## W = 0.94804, p-value = 0.4941
## Warning in (ra^2)/(n - (1:lag.max)): longer object length is not a multiple
## of shorter object length

residual.analysis(model = model013css)
## 
##  Shapiro-Wilk normality test
## 
## data:  res.model
## W = 0.89171, p-value = 0.07118
## Warning in (ra^2)/(n - (1:lag.max)): longer object length is not a multiple
## of shorter object length

residual.analysis(model = model013ml)
## 
##  Shapiro-Wilk normality test
## 
## data:  res.model
## W = 0.94279, p-value = 0.4188
## Warning in (ra^2)/(n - (1:lag.max)): longer object length is not a multiple
## of shorter object length

residual.analysis(model = model313css)
## 
##  Shapiro-Wilk normality test
## 
## data:  res.model
## W = 0.93626, p-value = 0.3377
## Warning in (ra^2)/(n - (1:lag.max)): longer object length is not a multiple
## of shorter object length

residual.analysis(model = model313ml)
## 
##  Shapiro-Wilk normality test
## 
## data:  res.model
## W = 0.95724, p-value = 0.6445
## Warning in (ra^2)/(n - (1:lag.max)): longer object length is not a multiple
## of shorter object length

residual.analysis(model = model015css)
## 
##  Shapiro-Wilk normality test
## 
## data:  res.model
## W = 0.90275, p-value = 0.1049
## Warning in (ra^2)/(n - (1:lag.max)): longer object length is not a multiple
## of shorter object length

residual.analysis(model = model015ml)
## 
##  Shapiro-Wilk normality test
## 
## data:  res.model
## W = 0.96794, p-value = 0.8265
## Warning in (ra^2)/(n - (1:lag.max)): longer object length is not a multiple
## of shorter object length

residual.analysis(model = model012css)
## 
##  Shapiro-Wilk normality test
## 
## data:  res.model
## W = 0.92679, p-value = 0.2442
## Warning in (ra^2)/(n - (1:lag.max)): longer object length is not a multiple
## of shorter object length

residual.analysis(model = model012ml)
## 
##  Shapiro-Wilk normality test
## 
## data:  res.model
## W = 0.94768, p-value = 0.4887
## Warning in (ra^2)/(n - (1:lag.max)): longer object length is not a multiple
## of shorter object length

residual.analysis(model = model112css)
## 
##  Shapiro-Wilk normality test
## 
## data:  res.model
## W = 0.9373, p-value = 0.3497
## Warning in (ra^2)/(n - (1:lag.max)): longer object length is not a multiple
## of shorter object length

residual.analysis(model = model112ml)
## 
##  Shapiro-Wilk normality test
## 
## data:  res.model
## W = 0.9475, p-value = 0.486
## Warning in (ra^2)/(n - (1:lag.max)): longer object length is not a multiple
## of shorter object length

sort.score <- function(x, score = c("bic", "aic")){
  if (score == "aic"){
    x[with(x, order(AIC)),]
  } else if (score == "bic") {
    x[with(x, order(BIC)),]
  } else {
    warning('score = "x" only accepts valid arguments ("aic","bic")')
  }
}


sort.score(AIC(model110ml, model011ml, model111ml, model013ml, model313ml, model015ml, model012ml, model112ml), score = "aic")
##            df      AIC
## model011ml  2 23.12446
## model012ml  3 24.93160
## model111ml  3 24.94140
## model013ml  4 26.92313
## model112ml  4 26.93109
## model110ml  2 27.05798
## model015ml  6 27.39690
## model313ml  7 29.34726
sort.score(BIC(model110ml, model011ml, model111ml, model013ml, model313ml, model015ml, model012ml, model112ml), score = "bic")
##            df      BIC
## model011ml  2 24.40258
## model012ml  3 26.84878
## model111ml  3 26.85857
## model110ml  2 28.33609
## model013ml  4 29.47936
## model112ml  4 29.48732
## model015ml  6 31.23124
## model313ml  7 33.82067

Predicting and forecasting values

predict(model011ml,n.ahead = 5,newxreg = NULL,se.fit=TRUE)
## $pred
## Time Series:
## Start = 1997 
## End = 2001 
## Frequency = 1 
## [1] 0.1148626 0.1148626 0.1148626 0.1148626 0.1148626
## 
## $se
## Time Series:
## Start = 1997 
## End = 2001 
## Frequency = 1 
## [1] 0.4491633 0.4491633 0.4491633 0.4491633 0.4491633
fit=Arima(egg_data2,c(0,1,1))
plot(forecast(fit,h=5))

fit
## Series: egg_data2 
## ARIMA(0,1,1) 
## 
## Coefficients:
##          ma1
##       0.0304
## s.e.  0.2545
## 
## sigma^2 estimated as 0.1885:  log likelihood=-8.25
## AIC=20.51   AICc=21.51   BIC=21.92

Conclusion

We analysed the egg depositions of Lake Huron Bloasters by using various analysis methods. We found the best fitting model for the data to be ARIMA(0,1,1). We started with understanding the properties of the data, converting it into a time series, checking the correlation, transforming it and differencing it once before modelling. After each step the adf test was conducted to make sure that the time series is stationary in order to fit the best model. Certain models were identified through EACF, AIC and BIC functions. These models were tested through the coefficient test and then the residuals were checked for normality as well. ARIMA(0,1,1) had the lowest values for AIC and BIC and also had normality behavious during the residual analysis. Further, predictions were made using the ARIMA(0,1,1) model for the next 5 years for the Egg Depositions.