Introduction

Coregonus hoyi, often known as the bloater, is in the family Salmonidae and a species of freshwater whitefish . It is silver in colour, herring-like fish, 25.5 centimetres long. It is found in the Great Lakes and also in Lake Nipigon, inhabits underwater slopes.

Methodology

The task is to analyze the egg depositions of Lake Huron Bloasters by using different analysis methods. The final goal is to choose the best model from different set of models and give relevant forecasts of egg depositions for the next 5 years.

Setup

We load the following packages

library(TSA)

## Warning: package 'TSA' was built under R version 3.5.3

library(tseries)
library(lmtest)
library(dLagM)

## Warning: package 'dLagM' was built under R version 3.5.3

## Warning: package 'nardl' was built under R version 3.5.3

## Warning: package 'dynlm' was built under R version 3.5.3

library(fUnitRoots)

## Warning: package 'fUnitRoots' was built under R version 3.5.3

## Warning: package 'timeSeries' was built under R version 3.5.3

## Warning: package 'fBasics' was built under R version 3.5.3

library(FitAR)

## Warning: package 'FitAR' was built under R version 3.5.3

## Warning: package 'leaps' was built under R version 3.5.3

## Warning: package 'ltsa' was built under R version 3.5.2

## Warning: package 'bestglm' was built under R version 3.5.3

library(forecast)

Read Data

Below, we are importing the data set using base R read.csv() function, displaying the first few observations of the dataset using head() function.
We then check the class which comes out to be data frame, we further need to convert the data frame to a time series data.
The data contains two columns year and eggs(in millions).

egg_data1 <- read.csv("eggs.csv", header = TRUE)
head(egg_data1)

##   year   eggs
## 1 1981 0.0402
## 2 1982 0.0602
## 3 1983 0.1205
## 4 1984 0.1807
## 5 1985 0.7229
## 6 1986 0.5321

class(egg_data1)

## [1] "data.frame"

Analyzing the data

In the below code, we are converting the data to a time series data to perform the analysis. A random walk graph is generated using the plot function and the following points are observed:

  - Trend : We see a upward trend.
  - Behaviour : The series appears to be auto regressive due to multiple succeeding points.
  - Seasonality : We see no seasonality in this graph.
  - Change in Variance: Change in variance is found.
  - Change in point: Year 1990 shows there is a sudden jump in the Deposition of Eggs

egg_data2 <-ts(as.vector(egg_data1$eggs), start= 1981, end=1996)

class(egg_data2)

## [1] "ts"

plot(egg_data2,type='o',ylab='Random Walk')

The acf and pacf are returning one significant lag.

acf(egg_data2)

pacf(egg_data2)

The null hypothesis states that the series is non stationary whereas the alternate hypothesis states that the series is stationary. The adf test shows that the p-value is 0.5469 which is greater than 0.05, hence we cannot reject the null hypothesis and the series is non stationary.

adf.test(egg_data2)

## 
##  Augmented Dickey-Fuller Test
## 
## data:  egg_data2
## Dickey-Fuller = -2.0669, Lag order = 2, p-value = 0.5469
## alternative hypothesis: stationary

In the scatterplot, we observe a positive upward trend. This shows that there is a strong correlation between egg depositions in an year as compared to the previous year.
We then find the correlation to be around 74% which justifies the upward trend of the scatterplot.

plot(y= egg_data2, x = zlag(egg_data2), ylab = 'Egg Depositions', xlab = 'Previous Year Egg Depositions')

y = egg_data2
x = zlag(egg_data2)
index = 2:length(x)
cor(y[index],x[index])

## [1] 0.7445657

Transformed value

We transform the data and plot it to understand the difference after transformation.
The lambda value comes out to be 0.45 and the new plot is not much different from the non transformed one.
From the adf test, the null hypothesis states that the series is non stationary whereas the alternate hypothesis states that the series is stationary. The p-value is 0.6955 which is greater than 0.05, hence we cannot reject the null hypothesis and the series is still non stationary.

data.transform = BoxCox.ar(egg_data2, method = "yule-walker")

## Warning in arima0(x, order = c(i, 0L, 0L), include.mean = demean): possible
## convergence problem: optim gave code = 1

## Warning in arima0(x, order = c(i, 0L, 0L), include.mean = demean): possible
## convergence problem: optim gave code = 1

data.transform$ci

## [1] 0.1 0.8

lambda = 0.45 # The mid point of the interval 
data.bc = ((egg_data2^lambda)-1)/lambda
plot(data.bc, type = "o", ylab = "Egg Depositions (in millions)", main = "Transformed - Egg depositions between 1981 and 1996")

adf.test(data.bc)

## 
##  Augmented Dickey-Fuller Test
## 
## data:  data.bc
## Dickey-Fuller = -1.6769, Lag order = 2, p-value = 0.6955
## alternative hypothesis: stationary

We calculate the first difference of the transformed series and then determine if the series becomes stationary or not.
We see a change in the plot and need to check the stationarity.

#Let's calculate the first difference and plot the first differenced series
diff.egg_data2 = diff(data.bc)
plot(diff.egg_data2, type = "o")

The adf test for the differenced series is done. The null hypothesis states that the series is non stationary whereas the alternate hypothesis states that the series is stationary. The p-value is 0.0443 which is less than 0.05, hence we can reject the null hypothesis and the series is stationary now.
Also the acf and pacf do not show any significant lags.
We have a stationary series now and we can proceed ahead with fitting the models.

adf.test(diff.egg_data2)

## 
##  Augmented Dickey-Fuller Test
## 
## data:  diff.egg_data2
## Dickey-Fuller = -3.6798, Lag order = 2, p-value = 0.0443
## alternative hypothesis: stationary

acf(diff.egg_data2)

pacf(diff.egg_data2)

Model Fitting

Although the acf and pacf functions do not show any white noise
We confirm existence of white noise by seeing that both p and q are equal to 0.
ARIMA(0,1,0), ARIMA(0,1,1), ARIMA(1,1,0) could be the models to test and further process. We limit the ar and ma values at 3 because we see an error after that.

eacf(diff.egg_data2, ar.max = 3, ma.max = 3)

## AR/MA
##   0 1 2 3
## 0 o o o o
## 1 o o o o
## 2 o o o o
## 3 o o o o

We plot the BIC table for different ar and ma values.
In the BIC table shaded columns correspond to AR(1), AR(2), AR(3), MA(1), MA(2), MA(3) and MA(5) coefficients.
Models {ARIMA(1,1,1), ARIMA(1,1,2), ARIMA(1,1,3),ARIMA(1,1,5), ARIMA(2,1,1), ARIMA(2,1,2),ARIMA(2,1,3), ARIMA(2,1,5), ARIMA(3,1,1), ARIMA(3,1,2), ARIMA(3,1,3),ARIMA(3,1,5), ARIMA(2,1,0), ARIMA(3,1,0),ARIMA(0,1,2), ARIMA(0,1,3),ARIMA(0,1,5), ARIMA(0,1,0), ARIMA(0,1,1), ARIMA(1,1,0)} are all the possible cases of models.

res = armasubsets(y = diff.egg_data2, nar = 3, nma = 6, y.name = 'test', ar.method = 'yw')

## Warning in leaps.setup(x, y, wt = wt, nbest = nbest, nvmax = nvmax,
## force.in = force.in, : 4 linear dependencies found

## Reordering variables and trying again:

plot(res)

We do the coefficient test on most of the cases but leave a few as they were large models.

# ARIMA(0,1,1) - CSS
model011css = arima(diff.egg_data2, order = c(0,1,1), method = 'CSS')
coeftest(model011css)

## 
## z test of coefficients:
## 
##      Estimate Std. Error z value  Pr(>|z|)    
## ma1 -1.093338   0.056532  -19.34 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# ARIMA(0,1,1) - ML
model011ml = arima(diff.egg_data2, order = c(0,1,1), method = 'ML')
coeftest(model011ml)

## 
## z test of coefficients:
## 
##     Estimate Std. Error z value Pr(>|z|)
## ma1 -0.99994    0.66766 -1.4977   0.1342

# ARIMA(1,1,0) - CSS
model110css = arima(diff.egg_data2, order = c(1,1,0), method = 'CSS')
coeftest(model110css)

## 
## z test of coefficients:
## 
##     Estimate Std. Error z value Pr(>|z|)  
## ar1 -0.40609    0.24460 -1.6602  0.09687 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# ARIMA(1,1,0) - ML
model110ml = arima(diff.egg_data2, order = c(1,1,0), method = 'ML')
coeftest(model110ml)

## 
## z test of coefficients:
## 
##     Estimate Std. Error z value Pr(>|z|)
## ar1 -0.38056    0.23493 -1.6199   0.1053

# ARIMA (1,1,1) - CSS
model111css = arima(diff.egg_data2, order = c(1,1,1), method = 'CSS')
coeftest(model111css)

## 
## z test of coefficients:
## 
##      Estimate Std. Error z value  Pr(>|z|)    
## ar1  0.030947   0.290460  0.1065    0.9152    
## ma1 -1.021947   0.201926 -5.0610 4.171e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# ARIMA (1,1,1) - ML
model111ml = arima(diff.egg_data2, order = c(1,1,1), method = 'ML')
coeftest(model111ml)

## 
## z test of coefficients:
## 
##     Estimate Std. Error z value Pr(>|z|)   
## ar1  0.11474    0.26992  0.4251 0.670764   
## ma1 -1.00000    0.33205 -3.0116 0.002599 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# ARIMA(0,1,3) - CSS
model013css = arima(diff.egg_data2, order = c(0,1,3), method = 'CSS')
coeftest(model013css)

## 
## z test of coefficients:
## 
##     Estimate Std. Error z value  Pr(>|z|)    
## ma1 -1.10117    0.29725 -3.7045 0.0002118 ***
## ma2 -0.11909    0.33351 -0.3571 0.7210194    
## ma3  0.14673    0.34631  0.4237 0.6717891    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# ARIMA(0,1,3) - ML
model013ml = arima(diff.egg_data2, order = c(0,1,3), method = 'ML')
coeftest(model013ml)

## 
## z test of coefficients:
## 
##      Estimate Std. Error z value Pr(>|z|)
## ma1 -0.852252   0.536615 -1.5882   0.1122
## ma2 -0.102618   0.286176 -0.3586   0.7199
## ma3 -0.045118   0.512224 -0.0881   0.9298

# ARIMA(3,1,3) - CSS
model313css = arima(diff.egg_data2, order = c(3,1,3), method = 'CSS')

## Warning in stats::arima(x = x, order = order, seasonal = seasonal, xreg =
## xreg, : possible convergence problem: optim gave code = 1

coeftest(model313css)

## Warning in sqrt(diag(se)): NaNs produced

## 
## z test of coefficients:
## 
##       Estimate Std. Error   z value Pr(>|z|)    
## ar1 -1.2016973  0.0067545 -177.9105  < 2e-16 ***
## ar2 -0.8077002  0.0231155  -34.9419  < 2e-16 ***
## ar3 -0.1124583  0.0998924   -1.1258  0.26025    
## ma1  1.9195706         NA        NA       NA    
## ma2  1.9992451         NA        NA       NA    
## ma3 -1.3935177  0.6294151   -2.2140  0.02683 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# ARIMA(3,1,3) - ML
model313ml = arima(diff.egg_data2, order = c(3,1,3), method = 'ML')

## Warning in stats::arima(x = x, order = order, seasonal = seasonal, xreg =
## xreg, : possible convergence problem: optim gave code = 1

coeftest(model313ml)

## 
## z test of coefficients:
## 
##     Estimate Std. Error z value Pr(>|z|)   
## ar1  0.35427    0.42161  0.8403 0.400749   
## ar2 -0.67968    0.26271 -2.5872 0.009675 **
## ar3 -0.41455    0.32151 -1.2894 0.197269   
## ma1 -1.31542    0.60118 -2.1881 0.028666 * 
## ma2  1.42062    1.09000  1.3033 0.192465   
## ma3 -0.59339    0.60515 -0.9806 0.326809   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# ARIMA(0,1,5) - CSS
model015css = arima(diff.egg_data2, order = c(0,1,5), method = 'CSS')
coeftest(model015css)

## 
## z test of coefficients:
## 
##      Estimate Std. Error z value Pr(>|z|)   
## ma1 -1.016594   0.375033 -2.7107 0.006715 **
## ma2  0.013365   0.364525  0.0367 0.970752   
## ma3 -0.753976   0.239229 -3.1517 0.001623 **
## ma4  1.192805   0.411640  2.8977 0.003759 **
## ma5 -0.538537   0.506024 -1.0643 0.287215   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# ARIMA(0,1,5) - ML
model015ml = arima(diff.egg_data2, order = c(0,1,5), method = 'ML')
coeftest(model015ml)

## 
## z test of coefficients:
## 
##      Estimate Std. Error z value Pr(>|z|)
## ma1 -0.824906   0.739712 -1.1152   0.2648
## ma2  0.038341   0.314834  0.1218   0.9031
## ma3 -0.732093   0.840193 -0.8713   0.3836
## ma4  1.065250   0.965775  1.1030   0.2700
## ma5 -0.089888   0.354698 -0.2534   0.7999

# ARIMA (0,1,2) - CSS
model012css = arima(diff.egg_data2, order = c(0,1,2), method = 'CSS')
coeftest(model012css)

## 
## z test of coefficients:
## 
##      Estimate Std. Error z value  Pr(>|z|)    
## ma1 -1.017026   0.274754 -3.7016 0.0002143 ***
## ma2 -0.087178   0.308221 -0.2828 0.7772974    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# ARIMA (0,1,2) - ML
model012ml = arima(diff.egg_data2, order = c(0,1,2), method = 'ML')
coeftest(model012ml)

## 
## z test of coefficients:
## 
##     Estimate Std. Error z value Pr(>|z|)  
## ma1 -0.88465    0.42637 -2.0748   0.0380 *
## ma2 -0.11535    0.25484 -0.4527   0.6508  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# ARIMA(1,1,2) - CSS
model112css = arima(diff.egg_data2, order = c(1,1,2), method = 'CSS')

## Warning in stats::arima(x = x, order = order, seasonal = seasonal, xreg =
## xreg, : possible convergence problem: optim gave code = 1

coeftest(model112css)

## 
## z test of coefficients:
## 
##      Estimate Std. Error  z value  Pr(>|z|)    
## ar1  0.329019   0.010368  31.7337 < 2.2e-16 ***
## ma1 -2.494136   0.107804 -23.1359 < 2.2e-16 ***
## ma2  1.189979   0.159874   7.4432 9.825e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# ARIMA(1,1,2) - ML
model112ml = arima(diff.egg_data2, order = c(1,1,2), method = 'ML')
coeftest(model112ml)

## 
## z test of coefficients:
## 
##      Estimate Std. Error z value Pr(>|z|)
## ar1  0.022041   0.984412  0.0224   0.9821
## ma1 -0.904711   0.996659 -0.9077   0.3640
## ma2 -0.095271   0.937432 -0.1016   0.9191

Residual Analysis

Residual analysis does show that model_011_css, model_011_ml, model_110_css appear to be normal as compared to others.Looking at p- values we can consider those where p- value is greater than 0.05 for normality cases.

residual.analysis <- function(model, std = TRUE){
  library(TSA)
  library(FitAR)
  if (std == TRUE){
    res.model = rstandard(model)
  }else{
    res.model = residuals(model)
  }
  par(mfrow=c(3,2))
  plot(res.model,type='o',ylab='Standardised residuals', main="Time series plot of standardised residuals")
  abline(h=0)
  hist(res.model,main="Histogram of standardised residuals")
  qqnorm(res.model,main="QQ plot of standardised residuals")
  qqline(res.model, col = 2)
  acf(res.model,main="ACF of standardised residuals")
  print(shapiro.test(res.model))
  k=0
  LBQPlot(res.model, lag.max = length(model$residuals)-1 , StartLag = k + 1, k = 0, SquaredQ = FALSE)
  par(mfrow=c(1,1))
}



residual.analysis(model = model011css)

## 
##  Shapiro-Wilk normality test
## 
## data:  res.model
## W = 0.9093, p-value = 0.1321

## Warning in (ra^2)/(n - (1:lag.max)): longer object length is not a multiple
## of shorter object length

residual.analysis(model = model011ml)

## 
##  Shapiro-Wilk normality test
## 
## data:  res.model
## W = 0.95773, p-value = 0.653

## Warning in (ra^2)/(n - (1:lag.max)): longer object length is not a multiple
## of shorter object length

residual.analysis(model = model110css)

## 
##  Shapiro-Wilk normality test
## 
## data:  res.model
## W = 0.96159, p-value = 0.7199

## Warning in (ra^2)/(n - (1:lag.max)): longer object length is not a multiple
## of shorter object length

residual.analysis(model = model110ml)

## 
##  Shapiro-Wilk normality test
## 
## data:  res.model
## W = 0.95896, p-value = 0.6743

## Warning in (ra^2)/(n - (1:lag.max)): longer object length is not a multiple
## of shorter object length

residual.analysis(model = model111css)

## 
##  Shapiro-Wilk normality test
## 
## data:  res.model
## W = 0.95293, p-value = 0.5717

## Warning in (ra^2)/(n - (1:lag.max)): longer object length is not a multiple
## of shorter object length

residual.analysis(model = model111ml)

## 
##  Shapiro-Wilk normality test
## 
## data:  res.model
## W = 0.94804, p-value = 0.4941

## Warning in (ra^2)/(n - (1:lag.max)): longer object length is not a multiple
## of shorter object length

residual.analysis(model = model013css)

## 
##  Shapiro-Wilk normality test
## 
## data:  res.model
## W = 0.89171, p-value = 0.07118

## Warning in (ra^2)/(n - (1:lag.max)): longer object length is not a multiple
## of shorter object length

residual.analysis(model = model013ml)

## 
##  Shapiro-Wilk normality test
## 
## data:  res.model
## W = 0.94279, p-value = 0.4188

## Warning in (ra^2)/(n - (1:lag.max)): longer object length is not a multiple
## of shorter object length

residual.analysis(model = model313css)

## 
##  Shapiro-Wilk normality test
## 
## data:  res.model
## W = 0.93626, p-value = 0.3377

## Warning in (ra^2)/(n - (1:lag.max)): longer object length is not a multiple
## of shorter object length

residual.analysis(model = model313ml)

## 
##  Shapiro-Wilk normality test
## 
## data:  res.model
## W = 0.95724, p-value = 0.6445

## Warning in (ra^2)/(n - (1:lag.max)): longer object length is not a multiple
## of shorter object length

residual.analysis(model = model015css)

## 
##  Shapiro-Wilk normality test
## 
## data:  res.model
## W = 0.90275, p-value = 0.1049

## Warning in (ra^2)/(n - (1:lag.max)): longer object length is not a multiple
## of shorter object length

residual.analysis(model = model015ml)

## 
##  Shapiro-Wilk normality test
## 
## data:  res.model
## W = 0.96794, p-value = 0.8265

## Warning in (ra^2)/(n - (1:lag.max)): longer object length is not a multiple
## of shorter object length

residual.analysis(model = model012css)

## 
##  Shapiro-Wilk normality test
## 
## data:  res.model
## W = 0.92679, p-value = 0.2442

## Warning in (ra^2)/(n - (1:lag.max)): longer object length is not a multiple
## of shorter object length

residual.analysis(model = model012ml)

## 
##  Shapiro-Wilk normality test
## 
## data:  res.model
## W = 0.94768, p-value = 0.4887

## Warning in (ra^2)/(n - (1:lag.max)): longer object length is not a multiple
## of shorter object length

residual.analysis(model = model112css)

## 
##  Shapiro-Wilk normality test
## 
## data:  res.model
## W = 0.9373, p-value = 0.3497

## Warning in (ra^2)/(n - (1:lag.max)): longer object length is not a multiple
## of shorter object length

residual.analysis(model = model112ml)

## 
##  Shapiro-Wilk normality test
## 
## data:  res.model
## W = 0.9475, p-value = 0.486

## Warning in (ra^2)/(n - (1:lag.max)): longer object length is not a multiple
## of shorter object length

From the AIC and BIC values we see that ARIMA(0,1,1) is the best model.

sort.score <- function(x, score = c("bic", "aic")){
  if (score == "aic"){
    x[with(x, order(AIC)),]
  } else if (score == "bic") {
    x[with(x, order(BIC)),]
  } else {
    warning('score = "x" only accepts valid arguments ("aic","bic")')
  }
}


sort.score(AIC(model110ml, model011ml, model111ml, model013ml, model313ml, model015ml, model012ml, model112ml), score = "aic")

##            df      AIC
## model011ml  2 23.12446
## model012ml  3 24.93160
## model111ml  3 24.94140
## model013ml  4 26.92313
## model112ml  4 26.93109
## model110ml  2 27.05798
## model015ml  6 27.39690
## model313ml  7 29.34726

sort.score(BIC(model110ml, model011ml, model111ml, model013ml, model313ml, model015ml, model012ml, model112ml), score = "bic")

##            df      BIC
## model011ml  2 24.40258
## model012ml  3 26.84878
## model111ml  3 26.85857
## model110ml  2 28.33609
## model013ml  4 29.47936
## model112ml  4 29.48732
## model015ml  6 31.23124
## model313ml  7 33.82067

Predicting and forecasting values

We use the predict function to predict values of Egg Depositions (in millions) over the next 5 years.

predict(model011ml,n.ahead = 5,newxreg = NULL,se.fit=TRUE)

## $pred
## Time Series:
## Start = 1997 
## End = 2001 
## Frequency = 1 
## [1] 0.1148626 0.1148626 0.1148626 0.1148626 0.1148626
## 
## $se
## Time Series:
## Start = 1997 
## End = 2001 
## Frequency = 1 
## [1] 0.4491633 0.4491633 0.4491633 0.4491633 0.4491633

fit=Arima(egg_data2,c(0,1,1))
plot(forecast(fit,h=5))

fit

## Series: egg_data2 
## ARIMA(0,1,1) 
## 
## Coefficients:
##          ma1
##       0.0304
## s.e.  0.2545
## 
## sigma^2 estimated as 0.1885:  log likelihood=-8.25
## AIC=20.51   AICc=21.51   BIC=21.92

Conclusion

We analysed the egg depositions of Lake Huron Bloasters by using various analysis methods. We found the best fitting model for the data to be ARIMA(0,1,1). We started with understanding the properties of the data, converting it into a time series, checking the correlation, transforming it and differencing it once before modelling. After each step the adf test was conducted to make sure that the time series is stationary in order to fit the best model. Certain models were identified through EACF, AIC and BIC functions. These models were tested through the coefficient test and then the residuals were checked for normality as well. ARIMA(0,1,1) had the lowest values for AIC and BIC and also had normality behavious during the residual analysis. Further, predictions were made using the ARIMA(0,1,1) model for the next 5 years for the Egg Depositions.

MATH1318 Semester 1, 2019

Assignment 2: Forecasts of Egg Depositions for the next 5 years

Ria Talwar, student ID - s3729618