Introduction

The dataset we will analyse in this assignment represents egg depositions (in millions) of age between years 1981 and 1996.

Data Set

The eggs series series is given by eggs.csv which contains of 2 variables and 16 obsevations.

Data Pre-processing

Preliminaries

In this project, we used the following R packages.

library(TSA)
library(fUnitRoots)
library(lmtest)
library(TSA)
library(fUnitRoots)
library(lmtest)
library(tseries)
library(FitAR)
library(forecast)

Data Cleaning and Transformation

We read the egg depositions (in millions) of age-3 datasets between from 1981 to 1996. We would later convert the data frame to time series data, we purposely to preparing the data for visulisation and looking for the pattern or trend.

eggs <- read.csv("C:/RMIT/MATH1318 TimeSeriesAnalysis/Assignment2/eggs.csv",header=TRUE)

head(eggs)
##   year   eggs
## 1 1981 0.0402
## 2 1982 0.0602
## 3 1983 0.1205
## 4 1984 0.1807
## 5 1985 0.7229
## 6 1986 0.5321
class(eggs)
## [1] "data.frame"
eggs.ts <- ts(eggs$eggs,start = 1981,frequency=1)
class(eggs.ts)
## [1] "ts"
eggs.ts.raw = eggs.ts
par(mfrow=c(1,2))
plot(eggs.ts,type='o',ylab='Egg depositions of age-3 ', main= "Time series plot for egg depositions")

plot(y=eggs.ts,x=zlag(eggs.ts),ylab='Egg depositions of age-3', xlab='Previous Year the egg deposition', main= "Scatter plot of egg depositions")

Figure 1 depicts a series in which there is an obvious upward trend over time: This time series display trend, the relation between plots, auto regressive behaviour. It appears that there is a drastic shift in the series, that slowly decays and eventually returns to previous levels. This may indicate an intervention model. It seems that succeeding measurements related to one anothers as the values that are neighbors in time tend to be similar in size. This is more clearly seen in the following scatter plot of neighboring pair:

In the scatter plot we see an upward trend in the plot-low values tend to be followed by low values in the next year, middle-sized values by middle-sized values, and high values by high values, which apparent strong relation. The amount of correlation between neighboring the egg depositions values is 0.75.

y = eggs.ts             # Read the eggs data into y
x = zlag(eggs.ts)       # Generate first lag of The eggs data
index = 2:length(x)      # Create an index to get rid of the first NA value in x
cor(y[index],x[index])   # Calculate correlation between numerical values in x and y
## [1] 0.7445657

Regression Approach

Analysis of trend

Now, we try to fit the regression line to the egg depositions data.

model.eggs.ln = lm(eggs.ts~time(eggs.ts)) # label the linear trend model as model.Ozone.ln
summary(model.eggs.ln)
## 
## Call:
## lm(formula = eggs.ts ~ time(eggs.ts))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4048 -0.2768 -0.1933  0.2536  1.1857 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   -165.98275   49.58836  -3.347  0.00479 **
## time(eggs.ts)    0.08387    0.02494   3.363  0.00464 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4598 on 14 degrees of freedom
## Multiple R-squared:  0.4469, Adjusted R-squared:  0.4074 
## F-statistic: 11.31 on 1 and 14 DF,  p-value: 0.004642
t = time(eggs.ts)
t2 = t^2
model.eggs.qa = lm(eggs.ts~ t + t2) # label the quadratic trend model as model1
summary(model.eggs.qa)
## 
## Call:
## lm(formula = eggs.ts ~ t + t2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.50896 -0.25523 -0.02701  0.16615  0.96322 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept) -4.647e+04  2.141e+04  -2.170   0.0491 *
## t            4.665e+01  2.153e+01   2.166   0.0494 *
## t2          -1.171e-02  5.415e-03  -2.163   0.0498 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4092 on 13 degrees of freedom
## Multiple R-squared:  0.5932, Adjusted R-squared:  0.5306 
## F-statistic: 9.479 on 2 and 13 DF,  p-value: 0.00289

Estimates of slope and intercept are Beta 1 = 0.08387 and Beta 0 = -16598275, respectively. The slope is statistically significant at 5% significance level, R-squared = 0.4469. The trend line is plotted over the time series in the following plot:

Estimates of slope and intercept are Beta 2 = -1.171e-02,Beta 1 = 4.665e+01 and Beta 0 = -4.647e+04, respectively. The slope is statistically significant at 5% significance level, R-squared = 0.5932. The fitted quadratic trend is shown below:

par(mfrow=c(1,2))
plot(ts(fitted(model.eggs.ln)), ylim = c(min(c(fitted(model.eggs.ln),as.vector(eggs.ts))),max(c(fitted(model.eggs.ln),as.vector(eggs.ts)))),ylab='y' ,main = "Fitted linear model", type="l",lty=2,col="red")
lines(as.vector(eggs.ts),type="o")

plot(ts(fitted(model.eggs.qa)), ylim = c(min(c(fitted(model.eggs.qa),as.vector(eggs.ts))),max(c(fitted(model.eggs.qa),as.vector(eggs.ts)))),ylab='y' ,main = "Fitted quadratic model", type="l",lty=2,col="red")
lines(as.vector(eggs.ts),type="o")

Linear model is seem does not work well, it does not capture the trend and it does not capture all the point and auto correlation and quadratic model is seem to be better, even does not capture all point.

Diagnostic check

We will do diagnostic check by observed the residual standard deviation and coefficeient of determination. To compare between linaer and quadratic model, the result shown the goodness of fit of the trend R-squared of quadratic model about 59% higher than linear model about 45% and the values of R-squared close to 0 that implies a not satisfactory fit. The residual standard error as 0.4092 lower than 0.4598.

Residual Analysis

par(mfrow=c(1,2))
res.model.eggs.ln = rstudent(model.eggs.ln)
plot(y = res.model.eggs.ln, x = as.vector(time(eggs.ts)),xlab = 'Time', ylab='Standardized Residuals',main = "standardized residuals linear",type='o',col="blue")
abline(h=0)

res.model.eggs.qa = rstudent(model.eggs.qa)
plot(y = res.model.eggs.qa, x = as.vector(time(eggs.ts)),xlab = 'Time', ylab='Standardized Residuals',main = "standardized residuals quadratic",type='o',col="blue")
abline(h=0)

We can check the normality of residuals with a histrogram. The following display a frequency histrogram of the standardised residuals from the both model.

Histogram of residuals

par(mfrow=c(1,2))
hist(rstudent(model.eggs.ln),xlab='Standardized Residuals',main = "Histrogram of Linear model")
hist(rstudent(model.eggs.qa),xlab='Standardized Residuals',main = "Histrogram of Quadratic model")

The distributes of linear model appear right-skew while The distributes of quadratic is somewhat symmetric and tail off at both the high and low ends as normal dirtrbution does.

Normal Q-Q Plot

The follwing plot shows the QQ normal scores plot for standardised residuals from the both model for the thickness of Ozone layer series. The straight-line pattern supports the assumption of a normal distributed stochastic component in this model.

par(mfrow=c(1,2))
y = rstudent(model.eggs.ln)
qqnorm(y,main = "Normal Q-Q Plot (Linear model)")
qqline(y, col = 2, lwd = 1, lty = 2)

y = rstudent(model.eggs.qa)
qqnorm(y,main ="Normal Q-Q Plot (Quadratic model)")
qqline(y, col = 2, lwd = 1, lty = 2)

The Q-Q plot for linear model normality assumption does not hold for the egg depositions of age-3, In addition we can check normality assumption of the stocahstic component by various hypothesis test and one of these test is Shapiro-Wilk test that calculates the correlation between the residuals confirm this inference with a p-value less than 0.05.

The Q-Q plot for quadratic model can considered as a normal distribution, In addition we can check normality assumption of the stocahstic component by various hypothesis test and one of these test is Shapiro-Wilk test that calculates the correlation between the residuals and the corresponding normal quantiles.

Shapiro-Wilk normality test

Linear Model

y = rstudent(model.eggs.ln)
shapiro.test(y)
## 
##  Shapiro-Wilk normality test
## 
## data:  y
## W = 0.7726, p-value = 0.001205

Qudratic Model

y = rstudent(model.eggs.qa)
shapiro.test(y)
## 
##  Shapiro-Wilk normality test
## 
## data:  y
## W = 0.87948, p-value = 0.03809

The result we get the p-value of 0.001205 for linear model that we reject the null hypothesis and p-value of 0.03809 for quadratic model. So we conclude not to reject the null hypothesis that the stochastic component of quadratic model are normally distributed.

ACF of standardized residuals

par(mfrow=c(1,2))
acf(rstudent(model.eggs.ln))
acf(rstudent(model.eggs.qa))

Both ACF plot confirm the smoothness of time series plot as we have correlation values higher than the confidence bound for one lags. The models has not captured everthing from time series, therefore regression model is not that good, it can fit well sometime but we should be careful on it.

Stationarity through differenceing

plot(eggs.ts,type='o',ylab='Egg depositions of age-3 ', main= "Time series plot for egg depositions (in millions) of age-3")

In the time series plot there is trend, no seasonality and no changing variance and succeeding observations imply the existence of autoregressive behavior.

par(mfrow=c(1,2)) # To plot both ACF and PACF in the same panel of plots
acf(eggs.ts)
pacf(eggs.ts)

Slowly decaying pattern in ACF and very high first correlation in PACF implies the existence of trend and nonstationarity.

adf.test(eggs.ts)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  eggs.ts
## Dickey-Fuller = -2.0669, Lag order = 2, p-value = 0.5469
## alternative hypothesis: stationary

With a p-value of 0.5469, we cannot reject the null hypothesis stating that the series is non-stationary.

par(mfrow=c(1,2))
plot(eggs.ts,type='o',ylab='egg depositions (millions)')
qqnorm(eggs.ts)
qqline(eggs.ts, col = 2)

shapiro.test(eggs.ts)
## 
##  Shapiro-Wilk normality test
## 
## data:  eggs.ts
## W = 0.94201, p-value = 0.3744

With a p-value of 0.3744, we cannot reject the null hypothesis stating that the series is non-stationary.

Transformations

BoxCox Transformation

Let's first apply the box-Cox transformation.

eggs.ts.transform$ci
## [1] 0.1 0.8

Mid-point of interval is 0.35. So we will take lamda = 0.35.

eggs.ts = log(eggs.ts)
plot(eggs.ts,type='o',ylab='Time series plot of log of egg depositions (millions)')

diff.eggs.ts = diff(eggs.ts, differences = 1) # Be careful I'm using log-transformed series
plot(diff.eggs.ts,type='o',ylab='QThe first difference of egg depositions')

order = ar(diff(diff.eggs.ts))$order
adfTest(diff.eggs.ts, lags = order,  title = NULL,description = NULL)
## 
## Title:
##  Augmented Dickey-Fuller Test
## 
## Test Results:
##   PARAMETER:
##     Lag Order: 4
##   STATISTIC:
##     Dickey-Fuller: -1.2696
##   P VALUE:
##     0.2049 
## 
## Description:
##  Fri Sep 18 23:56:29 2020 by user: Napat

With a p-value of 0.2049, we fail reject the null hypothesis stating that the series is non-stationary; hence, we conclude that the first differencing does not change the series staionary.

diff.eggs.ts = diff(eggs.ts, differences = 2) # Be careful I'm using log-transformed series
plot(diff.eggs.ts,type='o',ylab='QThe second difference of egg depositions')

order = ar(diff(diff.eggs.ts))$order
adfTest(diff.eggs.ts, lags = order,  title = NULL,description = NULL)
## 
## Title:
##  Augmented Dickey-Fuller Test
## 
## Test Results:
##   PARAMETER:
##     Lag Order: 1
##   STATISTIC:
##     Dickey-Fuller: -3.167
##   P VALUE:
##     0.01 
## 
## Description:
##  Fri Sep 18 23:56:29 2020 by user: Napat

With a p-value of 0.01, we reject the null hypothesis stating that the series is stationary.

par(mfrow=c(1,2))
acf(diff.eggs.ts)
pacf(diff.eggs.ts)

eacf(diff.eggs.ts,ar.max = 3, ma.max = 3)
## AR/MA
##   0 1 2 3
## 0 o o o o
## 1 o o o o
## 2 o o o o
## 3 o o o o
res = armasubsets(y=diff.eggs.ts,nar=3,nma=3,y.name='test',ar.method='ols')
plot(res)

The final set of possible models is {ARIMA(2,2,1), ARIMA(0,2,1), ARIMA(1,2,1)}

In the BIC table shaded columns correspond to AR(2) coefficients and MA(1) effect effect supported by BICs. So, from here we can include ARIMA(2,2,1), ARIMA(0,2,1), ARIMA(1,2,1) models. In conclusion the set of candidate models is {{ARIMA(2,2,1), ARIMA(0,2,1), ARIMA(1,2,1);lambda = 0.35}}.

# ARIMA(0,2,1)
model_021_css = arima(diff.eggs.ts,order=c(0,2,1),method='CSS')
coeftest(model_021_css)
## 
## z test of coefficients:
## 
##     Estimate Std. Error z value  Pr(>|z|)    
## ma1 -0.94420    0.11004 -8.5804 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model_021_ml = arima(diff.eggs.ts,order=c(0,2,1),method='ML')
coeftest(model_021_ml)
## 
## z test of coefficients:
## 
##     Estimate Std. Error z value  Pr(>|z|)    
## ma1 -0.99992    0.20623 -4.8486 1.243e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# ARIMA(2,2,1)
model_221_css = arima(diff.eggs.ts,order=c(2,2,1),method='CSS')
coeftest(model_021_css)
## 
## z test of coefficients:
## 
##     Estimate Std. Error z value  Pr(>|z|)    
## ma1 -0.94420    0.11004 -8.5804 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model_221_ml = arima(diff.eggs.ts,order=c(2,2,1),method='ML')
coeftest(model_221_ml)
## 
## z test of coefficients:
## 
##     Estimate Std. Error z value  Pr(>|z|)    
## ar1 -0.74193    0.28366 -2.6156  0.008907 ** 
## ar2 -0.21240    0.28732 -0.7393  0.459754    
## ma1 -1.00000    0.25273 -3.9568 7.597e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# ARIMA(1,2,1)
model_121_css = arima(diff.eggs.ts,order=c(1,2,1),method='CSS')
coeftest(model_121_css)
## 
## z test of coefficients:
## 
##     Estimate Std. Error z value  Pr(>|z|)    
## ar1 -0.71989    0.23763 -3.0295   0.00245 ** 
## ma1 -0.83390    0.14380 -5.7989 6.675e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model_121_ml = arima(diff.eggs.ts,order=c(1,2,1),method='ML')
coeftest(model_121_ml)
## 
## z test of coefficients:
## 
##     Estimate Std. Error z value  Pr(>|z|)    
## ar1 -0.59166    0.20862 -2.8361  0.004567 ** 
## ma1 -1.00000    0.23171 -4.3157 1.591e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
sort.score <- function(x, score = c("bic", "aic")){
  if (score == "aic"){
    x[with(x, order(AIC)),]
  } else if (score == "bic") {
    x[with(x, order(BIC)),]
  } else {
    warning('score = "x" only accepts valid arguments ("aic","bic")')
  }
}
sort.score(AIC(model_221_ml,model_021_ml,model_121_ml), score = "aic")
##              df      AIC
## model_121_ml  3 42.90014
## model_221_ml  4 44.39491
## model_021_ml  2 46.29932
sort.score(BIC(model_221_ml,model_021_ml,model_121_ml), score = "bic" )
##              df      BIC
## model_121_ml  3 44.35486
## model_221_ml  4 46.33454
## model_021_ml  2 47.26914

The ARIMA(1,2,1) model is the best one according to both AIC and BIC I'll try over-fitting with ARIMA(1,2,2) models. In ARIMA(1,2,1) model AR(1) is insignificant according to MLE and significant according to CSS. However, this model was not promising in terms of AIC and BIC.

# ARIMA(1,2,2)
model_122_css = arima(diff.eggs.ts,order=c(1,2,2),method='CSS')
coeftest(model_122_css)
## 
## z test of coefficients:
## 
##     Estimate Std. Error z value  Pr(>|z|)    
## ar1 -0.80325    0.18796 -4.2736 1.923e-05 ***
## ma1 -0.49668    0.37057 -1.3403    0.1801    
## ma2 -0.36007    0.38463 -0.9362    0.3492    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model_122_ml = arima(diff.eggs.ts,order=c(1,2,2),method='ML')
coeftest(model_122_ml)
## 
## z test of coefficients:
## 
##     Estimate Std. Error z value  Pr(>|z|)    
## ar1 -0.27549    0.27532 -1.0006  0.317018    
## ma1 -1.91744    0.36500 -5.2533 1.494e-07 ***
## ma2  0.99999    0.36335  2.7521  0.005921 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In ARIMA(1,2,2) model, AR(1) coefficient goes insignificant. So I'll stop at ARIM(1,2,1) model. We will go on with residual analysis of the models with sgnificant coefficients.

residual.analysis <- function(model, std = TRUE){
  library(TSA)
  library(FitAR)
  if (std == TRUE){
    res.model = rstandard(model)
  }else{
    res.model = residuals(model)
  }
  par(mfrow=c(3,2))
  plot(res.model,type='o',ylab='Standardised residuals', main="Time series plot of standardised residuals")
  abline(h=0)
  hist(res.model,main="Histogram of standardised residuals")
  qqnorm(res.model,main="QQ plot of standardised residuals")
  qqline(res.model, col = 2)
  acf(res.model,main="ACF of standardised residuals")
  print(shapiro.test(res.model))
  k=0
  LBQPlot(res.model, lag.max = length(model$residuals)-1 , StartLag = k + 1, k = 0, SquaredQ = FALSE)
  par(mfrow=c(1,1))
}

residual.analysis(model = model_121_ml)
## 
##  Shapiro-Wilk normality test
## 
## data:  res.model
## W = 0.96345, p-value = 0.7793

par(mfrow=c(1,1))

There is no problem with the diagnostic plots.

Now We applied log transformtion and second difference. To take them back:

log.data = log(eggs.ts.raw)
log.data.diff2.back = diffinv(diff.eggs.ts, differences = 2, xi = data.matrix(log.data[1:2]))
log.data.diff2.back = exp(log.data.diff2.back)

Forecasting

library(forecast)
# In the forecasts differencing will already been taken back! 
# We need to specify the lambda of Box-Cox transformation:
fit = Arima(eggs.ts.raw,c(1,2,1), lambda = 0.35) 
plot(forecast(fit,h=10)) 

In this study we found the best model for predict of egg depositions of age 3 is ARIM(1,2,1) model. In conclusion, statistically significant evidence was found to the result of analysis shown the prediction of egg depositions (in millions) of age 3 continue incresing in next five years.