1. Introduction

When an oxygen molecule is split into single atoms, these atoms combine with the nearby oxygen molecule to form an ozone. The oxygen molecules are split by the sun’s rays and most of the ozone is present in the Stratosphere. The thickness of this layer is measured in dobson unit and is it is very important for preventing the harmful radiations coming from the sun and the stars. Ozone helps in capturing the low wavelength radiations and also safeguards us from the harmful ultraviolet rays. The dataset given is the thickness of the ozone layer in dobson unit from 1927 to 2016. This data can be used for forecasting the ozone thickness for the upcoming years.

Aim

  • To analyze the given dataset and find the best fitting model using model-building strategy followed by a prediction for the next 5 years.
  • Propose a set of possible ARIMA(p, d, q) models using all suitable model specification tools.
  • To demonstrate the use of each model specification tool such as ACF-PACF, EACF, BIC table clearly with comments to back up choices of (p, d, q).

2. Method

3. Data Exploration and Manipulation

3.1. Dataset Description

The given dataset represents yearly changes in the ozone thickness from 1927 to 2016 in Dobson units. A negative value in the dataset represents a decrease in the thickness and a positive value represents an increase in the thickness The dataset contains 89 observations and 1 feature variable.

3.2. Importing packages

We will be using functions from the TSA package for modeling and predictions. The TSA package contains R functions and datasets from the book “Time Series Analysis with Applications in R”. The tseries package contains functions for performing unit root test which is handy in diagnostic checking and model specification.

library(TSA) 
## Warning: package 'TSA' was built under R version 4.0.5
## 
## Attaching package: 'TSA'
## The following objects are masked from 'package:stats':
## 
##     acf, arima
## The following object is masked from 'package:utils':
## 
##     tar
library(tseries)
## Warning: package 'tseries' was built under R version 4.0.5
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
defaultW <- getOption("warn") 
options(warn = -1) 

3.3. Reading the data

The working directory is set using the setwd() function and the data is read using R’s read.csv() function with header parameter set to be FALSE.

setwd("C:/Users/Lenovo/Documents/RMIT/Courses/Time Series/Assignment")
ozone <- read.csv("data1.csv", header = FALSE)
head(ozone)

3.4.Converting the data into time series

Information such as the structure of the dataset, data type etc, can be determined using the following functions:
structure -> str()
Data type -> class()

str(ozone)
## 'data.frame':    90 obs. of  1 variable:
##  $ V1: num  1.351 0.761 -1.269 -1.464 -0.979 ...
class(ozone)
## [1] "data.frame"
class(ozone$V1)
## [1] "numeric"

The output tells us that the dataset is a dataframe with 90 observations and 1 feature/variable. The feature V1 is a numeric datatype.
For better clarity, we can rename this variable to “Ozone Thickness”.

names(ozone)[1]<-paste("Ozone Thickness")
head(ozone)

For time series analysis, the expected datatype is a “Time Series”. Since, we do not have a time series datatype by default, we should convert the datatype into a time series one. Here, we are using the ts() function from the TSA package to convert the data into time series.

Syntax: ts(df$col, start=xx, end=yy, frequency=z)

The data is recorded yearly from 1927 till 2016 and thus the parameters start and end are 1927 and 2016 respectively. I want to start my analysis by working on the annual data and thus frequency is taken as 1.

ozoneTs <- ts(as.vector(ozone$`Ozone Thickness`), start = 1927, end = 2016, frequency = 1)
ozoneTs
## Time Series:
## Start = 1927 
## End = 2016 
## Frequency = 1 
##  [1]   1.35118436   0.76053242  -1.26855735  -1.46368717  -0.97920302
##  [6]   1.50856746   1.62991681   1.83242333  -0.83968364  -1.09611566
## [11]  -2.67457473  -2.78716606  -2.97317944  -0.23495548   0.97067000
## [16]   1.23652307   2.23062331   0.35671637  -2.12028099  -2.66812477
## [21]  -0.64702795   0.99608564   1.83817742   1.89922697  -0.55488121
## [26]  -1.40387419  -3.39178762   0.05777194   3.40717183   1.31488379
## [31]  -0.12882457  -2.51580137  -3.06205664  -3.33637179  -2.66332198
## [36]  -0.67958655  -2.11660422  -2.36318997  -5.36156537  -3.03103458
## [41]  -2.28838624  -1.06438684  -1.68813570  -3.16974819  -3.65647649
## [46]  -1.25151090  -1.08431732  -0.44863234  -0.17636387  -2.64954530
## [51]  -1.28317654  -4.29289634  -3.24282341  -3.60135297  -2.57288652
## [56]  -5.00586059  -6.68548244  -4.58870210  -4.32654629  -2.29370761
## [61]  -2.26456266  -2.27184846  -2.66440082  -3.79556478  -7.65843185
## [66] -10.17433972  -4.21230497  -2.82287161  -1.36776491  -4.43997062
## [71]  -3.78323838  -5.85304107  -8.55125744  -8.06501289  -7.75975806
## [76]  -6.65633206  -6.62708203  -7.83548356  -8.84424264  -7.67352209
## [81]  -7.05582939  -4.69497353  -7.12712128  -9.58954985 -10.19222042
## [86]  -9.33224686 -10.80567444 -10.86096923 -11.57941376  -9.30284452
time(ozoneTs)
## Time Series:
## Start = 1927 
## End = 2016 
## Frequency = 1 
##  [1] 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941
## [16] 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956
## [31] 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971
## [46] 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986
## [61] 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001
## [76] 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016

The output clearly shows us that the starting and ending points are 1927 and 2016 respectively with a frequency of 1. Visualizing the time points using the time() function also shows the range and frequency of time stamps.

4. Chapter 1: Analaysis of deterministic trend models for model fitting

4.1. Plotting the data

The time series data is plotted using the plot() function and is analyzed for the following 5 valid points:

  1. Trend
  2. Seasonality
  3. Change in Variance
  4. Intervention
  5. Behavior
plot(ozoneTs, type = "o", xlab="Year", ylab="Ozone Thickness",
     main="Change in the ozone thickness from 1927 to 2016")

Analysis of the 5 Valid points

  1. Trend: There is a clear downward trend in the time series plot.
  2. Seasonality: Overall we can see that there are no clear repeating pattern in the time series plot. This is because of the mid section of the plot from 1960 to 1980 giving us an understanding that there is no seasonality in the model. But however at the beginning of the plot there are signs of pattern which makes us think if this model can exhibit the characteristics of seasonality. At this point we cannot confirm about the seasonality of the time series. Thus, there’s a question mark for the seasonality characteristics of the series. 3. Change in Variance: We can say there is a slight change in variance. If you observe carefully there are small fluctuations between 1960 to 1970, but there are higher jumps from 1980 to 1990. This indicates change in variance in the data.
  3. Intervention Point: Clearly, there is no intervention point. Change point occurs when there is a significant/drastic change in the characteristics of the plot. Over here, there is no such behavior in the data indicating the absence of an intervention point.
  4. Behavior: This can be a mixture of Auto Regressive (AR) and Moving average (MA) behavior. Upward and downward flow / fluctuations denote the MA characteristics in the plot but there is also a random walk behavior denoting the presence of an AR characteristic. Thus, this can be estimated to be an ARMA model with dominant MA characteristics.

4.2. Impact of previous year(s) on the current year

The impact of previous year(s) on the current year can be estimated by using Pearson correlation coefficient. It is the correlation between present year and previous year(s) data and is calculated on dividing the covariance of the two variables by the product of their standard deviations. The result is normalized and is always between -1 and 1. Negative value explains the negative correlation between the two and positive value explains the positive correlation between the two.

Let us start by determining the pearson’s correlation between present year Y[t] and its previous year Y[t-1] followed by years Y[t-2], Y[t-3] etc. until we get a correlation value less than 0.60.

This pearson’s correlation is followed by a correlation plot which helps us visually understand the relationship between the two.

4.2.1. First Lag

PEARSON CORRELATION OF Y[t] and Y[t-1]

y = ozoneTs
x = zlag(ozoneTs)
index = 2:length(x)
cor(y[index], x[index])
## [1] 0.8700381

Inference:

  • The correlation of 0.823607 suggest that Y[t] and Y[t-1] are positively correlated. Thus, significant amount of information can be extracted from the past values (lag 1).
  • This correlation can be understood by plotting first lag along the x axis and the actual value along the y axis.

CORRELATION PLOT OF Y[t] and Y[t-1]

plot(y = ozoneTs, x = zlag(ozoneTs),ylab="Ozone Thickness",
     xlab="ozone thickness in the previous year", 
     main="Scatter plot to show the autocorrelation between Y[t] and Y[t-1]")

Inference:

  • As Y[t] increases, Y[t-1] also increases which explains the positive correlation between the two.
  • Thus, significant amount of information can be extracted using the past value (lag 1).

4.2.2. Second Lag

PEARSON CORRELATION OF Y[t] and Y[t-2]

y = ozoneTs
x = zlag(zlag(ozoneTs))
index = 3:length(x)
cor(y[index], x[index])
## [1] 0.7198518

Inference:

  • The correlation of 0.6567297 suggest that Y[t] and Y[t-2] are positively correlated but the correlation is not stronger.
  • This correlation can be understood by plotting second lag along the x axis and the actual value along the y axis.

CORRELATION PLOT OF Y[t] and Y[t-2]

plot(y = ozoneTs, x = zlag(zlag(ozoneTs)),ylab="Ozone Thickness",
     xlab="ozone thickness one year before the previous year", 
     main="Scatter plot to show the autocorrelation between Y[t] and Y[t-2]")

Inference:

  • There is still a positive correlation between Y[t] and Y[t-2]. Higher values of Y[t] has higher values of Y[t-2] explaining their correlation.
  • This however is not stronger compared to the first lag.

4.2.3. Third Lag

PEARSON CORRELATION OF Y[t] and Y[t-3]

y = ozoneTs
x = zlag(zlag(zlag(ozoneTs)))
index = 4:length(x)
cor(y[index], x[index])
## [1] 0.592762

Inference:

  • The correlation of 0.5368246 suggest that Y[t] and Y[t-3] are positively correlated but the correlation is not stronger.
  • This correlation can be understood by plotting second lag along the x axis and the actual value along the y axis.

CORRELATION PLOT OF Y[t] and Y[t-3]

plot(y = ozoneTs, x = zlag(zlag(ozoneTs)),ylab="Ozone Thickness",
     xlab="ozone thickness two years before the previous year", 
     main="Scatter plot to show the autocorrelation between Y[t] and Y[t-3]")

Inference:

  • There is a weak positive correlation between Y[t] and Y[t-3].
  • Not all higher values of Y[t] have higher values of Y[t-3] explaining their weak correlation.

4.3. Behaviour of the Time series model

From the time series plot it is clear that it is an ARMA model with both AR and MA characteristics. However, diagnostics checking is done using ACF and PACF to confirm the behavior of the model.

ACF -> Autocorrelation Function
PACF -> Partial Autocorrelation Function

4.3.1. ACF Plot

Auto correlation can be considered as a relationship between a variable’s current value and its past values. An autocorrelation of +1 explains positive correlation (an increase seen in one causes a proportionate increase in the other). An autocorrelation of -1 explains negative correlation (a decrease seen in one causes a proportionate decrease in the other).

acf(ozoneTs, main = "ACF plot of the Ozone thickness series")

Inference:
1. There is a clear decaying pattern in the ACF plot which confirms the presence of trend in the model.
2. Slight wave like pattern in the ACF plot says there can be slight seasonal behavior in the model.
3. There are multiple statistically significant lags explaining high order MA characteristics, but pattern denotes that the series is non-stationary.

4.3.2. PACF Plot

  • Partial autocorrelation is an autocorrelation which is conditional in nature.
  • This considers all the intermediate values between the two time series value.
pacf(ozoneTs, main = "PACF plot of the Ozone thickness series")

Inference:
1. Unlike ACF, PACF is less susceptible to explain a pattern.
2. Only one lag is statistically significant explaining the AR characteristics of the model.

4.4. Model Estimation

Deterministic Trend Models

We are still not sure about the characteristics of the model. Fitting a regression model is one way of estimation. From the ACF plot it is proved that the model follows a pattern and thus is non-stationary. Regression analysis can be used to model non constant mean trend.

Regression Trend Models:

  • Linear
  • Quadratic
  • Seasonal
  • Cosine

4.4.1. Linear trend Model

First, get the time stamps of the time series

t <- time(ozoneTs)
t
## Time Series:
## Start = 1927 
## End = 2016 
## Frequency = 1 
##  [1] 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941
## [16] 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956
## [31] 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971
## [46] 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986
## [61] 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001
## [76] 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016

4.4.1.1. Modeling

  • The regression model is determined by using the lm() function and the model is fitted using the abline() function.
model1 <- lm(ozoneTs ~ t)

plot(ozoneTs, type="o", ylab="y", main= "Fitted Linear curve to Ozone Thickness series")
abline(model1)

summary(model1)
## 
## Call:
## lm(formula = ozoneTs ~ t)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.7165 -1.6687  0.0275  1.4726  4.7940 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 213.720155  16.257158   13.15   <2e-16 ***
## t            -0.110029   0.008245  -13.34   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.032 on 88 degrees of freedom
## Multiple R-squared:  0.6693, Adjusted R-squared:  0.6655 
## F-statistic: 178.1 on 1 and 88 DF,  p-value: < 2.2e-16

4.4.1.2. Summary of the model

  1. The coefficient of t is -0.110029, this means there is a 0.110029 decrement for each time period.
  2. Pr(>|t|) for t is < 0.05, therefore there is significant information in t and this component can contibute for modeling.
  3. R-squared value for the model is 0.6655. Ideally we would like to have R-squared value between 0.75 and 0.85. Over here 0.6655 says that the model is not good enough but gives a decent fit.
  4. The overall p-value is less than 0.05 suggesting that this trend model is good for fitting and is capable of capturing a significant amount of information.

4.4.1.3. Residual Analysis of the linear trend model

  • Time series plot of standardized residuals: Ideally we want to see a random white noise series of the standardized residuals which suggests that the fitted trend model has captured majority of the observations.

  • Histogram Plot:
    Histogram gives us the frequency of standardized residuals. Ideally we require a symmetric distribution of standardized residuals to strongly support the trend fit.

  • Q-Q plot:
    Ideally we want to see all the dots stick to the reference line on the QQ plot of standardized residuals.

  • Shapiro-wilk Test:
    Shapiro wilk test is done to test the normality of standardized residuals.
    Ho: Data are normally distributed
    HA: Data are not normally distributed

    If p < 0.05, then we reject the null hypothesis and say that the data is not normal. Whereas, if p > 0.05 we fail to reject the null hypothesis and say that the data is normally distributed.

  • ACF & PACF Plot:
    We expect to see all bars in an ACF and PACF plot of standardized residuals under the horizontal dashed confidence limits. This explains that all the data has been captured and there is no information left in the residuals for capturing.

par(mfrow=c(3,2))

# TS plot
plot(y=rstudent(model1),x=as.vector(time(ozoneTs)), xlab='Time'
,
ylab='Standardized Residuals'
,type='o'
, main = "Time series plot of standardized resid
fitted to ozone thickness series.
")

# Histogram

hist(rstudent(model1),xlab='Standardized Residuals'
, main = "Histogram of standardised resi
fitted to the ozone thickness series.
")

# QQ plot
y= rstudent(model1)
qqnorm(y, main = "QQ plot of standardised residuals for the linear model
fitted to the ozone thickness series.
")
qqline(y, col = 2, lwd = 1, lty = 2)


# ACF
acf(rstudent(model1), main = "ACF of standardized residuals
")

# PACF
pacf(rstudent(model1), main = "PACF of standardized residuals
")

par(mfrow=c(1,1))

# Shapiro wilk test
y = rstudent(model1)
shapiro.test(y)
## 
##  Shapiro-Wilk normality test
## 
## data:  y
## W = 0.98733, p-value = 0.5372

Summary of residual analysis:

  1. There is still slight trend in the time series plot with changing variance.
  2. Surely there is no hints of seasonality and intervention point.
  3. Follows a mixed AR and MA behavior. (ARMA)
  4. Thus, the time series plot of the standardized residuals does not support that the trend model captures majority of the observation.
  5. The histogram plot of standardized residuals almost shows a symmetric distribution. This is a right skewed distribution.
  6. Thus, histogram plot does not support that the trend model captures majority of the observation.
  7. Not all the dots stick to the reference line suggesting that the model is not strong enough to capture all the observations.
  8. p < 0.05 confirms that the data is not normally distributed.
  9. Only one lag is statistically significant in both ACF and PACG and this explains that the trend model captures information but is still not the best as there is a lag which is significant.

4.4.2. Quadratic Model

4.4.2.1. Modeling

  • Over here, t2 is calculated by squaring t which is used as a parameter for a quadratic model.
t2 <- t^2
model2 <- lm(ozoneTs ~ t + t2)

plot(ts(fitted(model2)), ylab='y'
, main = "Fitted Quadratic curve to Ozone thickness series.
"
,
ylim = c(min(c(fitted(model2), as.vector(ozoneTs))) ,
max(c(fitted(model2), as.vector(ozoneTs)))
) )
lines(as.vector(ozoneTs),type="o")

summary(model2)
## 
## Call:
## lm(formula = ozoneTs ~ t + t2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.1062 -1.2846 -0.0055  1.3379  4.2325 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5.733e+03  1.232e+03  -4.654 1.16e-05 ***
## t            5.924e+00  1.250e+00   4.739 8.30e-06 ***
## t2          -1.530e-03  3.170e-04  -4.827 5.87e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.815 on 87 degrees of freedom
## Multiple R-squared:  0.7391, Adjusted R-squared:  0.7331 
## F-statistic: 123.3 on 2 and 87 DF,  p-value: < 2.2e-16

4.4.2.2. Summary of the model

  1. The coefficient of t is 5.924e+00, this means there is a 5.924e+00 increment for each time period (t).
  2. The coefficient of t2 is -1.530e-03, this means there is a 1.530e-03 decrement for each time period (t2).
  3. Pr(>|t|) for t and t2 is < 0.05, therefore there is significant information in t and t2.
  4. R-squared value for the model is 0.7331. Ideally we would like to have R-squared value between 0.75 and 0.85. Over here 0.7331 says that the model is good enough for fitting and is obviously better than the linear trend model.
  5. The overall p-value is less than 0.05 suggesting that this trend model is good for fitting and is capable of capturing a significant amount of information.

4.4.2.3. Residual Analysis of the quadratic trend model

  • lets perform residual analysis on the quadratic trend model.
par(mfrow=c(3,2))

# TS plot
plot(y=rstudent(model2),x=as.vector(time(ozoneTs)), xlab='Time'
,
ylab='Standardized Residuals'
,type='o'
, main = "Time series plot of standardized resid
fitted to ozone thickness series.
")

# Histogram
hist(rstudent(model2),xlab='Standardized Residuals'
, main = "Histogram of standardised resi
fitted to the ozone thickness series.
")

# QQ plot
y= rstudent(model2)
qqnorm(y, main = "QQ plot of standardised residuals for the Quadratic model
fitted to the ozone thickness series.
")
qqline(y, col = 2, lwd = 1, lty = 2)


# ACF
acf(rstudent(model2), main = "ACF of standardized residuals
")

# PACF
pacf(rstudent(model2), main = "PACF of standardized residuals
")

par(mfrow=c(1,1))

# Shapiro wilk test
y = rstudent(model2)
shapiro.test(y)
## 
##  Shapiro-Wilk normality test
## 
## data:  y
## W = 0.98889, p-value = 0.6493

Summary of Residual Analysis of Quadratic trend

  1. This model looks much better than the previous one with reduced trend.
  2. There is very hints of seasonality and intervention point.
  3. Follows a mixed AR and MA behavior. (ARMA)
  4. Thus, the time series plot of the standardized residuals does support that the trend model captures majority of the observation but of course this is not the best fit. This is just better fit compared to the linear trend model.
  5. The histogram plot of standardized residuals almost shows a symmetric distribution. This is a right skewed distribution.
  6. Thus, histogram plot does not support that the trend model is a good fit. This is however better than the linear model.
  7. Almost all the dots stick to the reference line suggesting that the model is good enough for fitting but still not the best.
  8. Shapiro-wilk test gives a result of p = 0.008989 and p < 0.05 confirms that the data is not normally distributed.
  9. Only one lag is statistically significant in both ACF and PACF and this explains that the trend model captures information but is still not the best as there is a lag which is significant.
  10. overall, this trend model is better than linear trend model.

Cyclic Trend Models:

Initially we started with the annual record of the ozone thickness by assuming frequency to be 1. But while visualizing the time series plot we found that there can be slight seasonal behavior in the series, especially at the beginning. Thus, it is always a good practice to fit seasonal and cosine models for model estimation.

Here I am defining a separate object ozoneTs1 with that ozone data converted as a time series with frequency determined using an ACF plot of ozone dataframe.

acf(ozone)

From the ACF plot of ozone we can see that there is a pattern and the presence of peak for every 6th lag. Hence, we assume that the pattern occurs for every 6 years and therefore frequency is set to 6.

par(mfrow=c(3,1))
ozoneTs1 <- ts(as.vector(matrix(ozone$`Ozone Thickness`, nrow = 15, ncol = 6)), frequency = 6)
time(ozoneTs1)
## Time Series:
## Start = c(1, 1) 
## End = c(15, 6) 
## Frequency = 6 
##  [1]  1.000000  1.166667  1.333333  1.500000  1.666667  1.833333  2.000000
##  [8]  2.166667  2.333333  2.500000  2.666667  2.833333  3.000000  3.166667
## [15]  3.333333  3.500000  3.666667  3.833333  4.000000  4.166667  4.333333
## [22]  4.500000  4.666667  4.833333  5.000000  5.166667  5.333333  5.500000
## [29]  5.666667  5.833333  6.000000  6.166667  6.333333  6.500000  6.666667
## [36]  6.833333  7.000000  7.166667  7.333333  7.500000  7.666667  7.833333
## [43]  8.000000  8.166667  8.333333  8.500000  8.666667  8.833333  9.000000
## [50]  9.166667  9.333333  9.500000  9.666667  9.833333 10.000000 10.166667
## [57] 10.333333 10.500000 10.666667 10.833333 11.000000 11.166667 11.333333
## [64] 11.500000 11.666667 11.833333 12.000000 12.166667 12.333333 12.500000
## [71] 12.666667 12.833333 13.000000 13.166667 13.333333 13.500000 13.666667
## [78] 13.833333 14.000000 14.166667 14.333333 14.500000 14.666667 14.833333
## [85] 15.000000 15.166667 15.333333 15.500000 15.666667 15.833333
plot(ozoneTs1, main="Half Yearly Ozone Thickness", type="o")
acf(ozoneTs1)
pacf(ozoneTs1)

par(mfrow=c(1,1))

Analysis of Time series, ACF and PACF plots:

  1. Trend: There is a still trend in the time series plot.
  2. Seasonality: Slight seasonal behavior in the time series plot.
  3. Change in Variance: Clear change in variance.
  4. Intervention Point: No intervention or changing point.
  5. Behavior: This can be a mixture of Auto Regressive (AR) and Moving average (MA) behavior. Upward and downward flow / fluctuations denote the MA characteristics in the plot but there is also a random walk behavior denoting the presence of an AR characteristic. Thus, this can be estimated to be an ARMA model.
  6. ACF plot shows wave like pattern above the confidence interval, explaining the seasonal behavior in the model and also non-stationarity.
  7. Only one lag is above the significance level suggesting it to be an AR(1) series.

4.4.3. Seasonal Model

4.4.3.1. Modeling

Let’s fit a seasonal regression model by initially capturing the seasons in the time series and storing them in a object month. followed by the usage of regression function lm() with the parameter month. - 1. Here, -1 is added to remove the intercept component in the time series. The model is the fitted and summary are analyzed.

month.=season(ozoneTs1)
model3=lm(ozoneTs1 ~ month.-1) 

plot(ts(fitted(model3)), ylab='y'
, main = "Fitted seaonal model to ozone thickness series.",
ylim = c(min(c(fitted(model3), as.vector(ozoneTs1))) ,
max(c(fitted(model3), as.vector(ozoneTs1)))
), col = "red" )
lines(as.vector(ozoneTs1),type="o")

summary(model3)
## 
## Call:
## lm(formula = ozoneTs1 ~ month. - 1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.4755 -1.6640  0.5134  2.3864  6.5111 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## month.Season-1  -2.8943     0.9318  -3.106 0.002585 ** 
## month.Season-2  -3.1722     0.9318  -3.404 0.001019 ** 
## month.Season-3  -3.6586     0.9318  -3.926 0.000176 ***
## month.Season-4  -3.1478     0.9318  -3.378 0.001108 ** 
## month.Season-5  -3.1039     0.9318  -3.331 0.001287 ** 
## month.Season-6  -3.2367     0.9318  -3.474 0.000814 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.609 on 84 degrees of freedom
## Multiple R-squared:  0.4589, Adjusted R-squared:  0.4202 
## F-statistic: 11.87 on 6 and 84 DF,  p-value: 1.325e-09

4.4.3.2. Summary of the model

  1. The coefficient for seasons 1 to 6 have a p-value less than 0.05 explaining that all the seasonal components contribute to the model fit.
  2. R-squared value for the model is 0.4202. Ideally we would like to have R-squared value between 0.75 and 0.85. Over here 0.4202 says that the model is not good enough for fitting.
  3. The overall p-value is less than 0.05 suggesting that this trend model is significant and can be used for fitting.
  4. This model does not exhibit better characteristics than the quadratic trend model. However, residual analysis can be performed for diagnostic checking.

4.4.3.3. Residual Analysis of the Seasonal trend model

  • lets perform residual analysis on the Seasonal trend model.
par(mfrow=c(3,2))

# TS plot
plot(y=rstudent(model3),x=as.vector(time(ozoneTs1)), xlab='Time'
,
ylab='Standardized Residuals'
,type='o'
, main = "Time series plot of standardized resid
fitted to ozone thickness series.
")

# Histogram

hist(rstudent(model3),xlab='Standardized Residuals'
, main = "Histogram of standardised resi
fitted to the ozone thickness series.
")

# QQ plot
y= rstudent(model3)
qqnorm(y, main = "QQ plot of standardised residuals for the Seasonal model
fitted to the ozone thickness series.
")
qqline(y, col = 2, lwd = 1, lty = 2)


# ACF
acf(rstudent(model3), main = "ACF of standardized residuals
")

# PACF
pacf(rstudent(model3), main = "PACF of standardized residuals
")

par(mfrow=c(1,1))

# Shapiro wilk test
y = rstudent(model3)
shapiro.test(y)
## 
##  Shapiro-Wilk normality test
## 
## data:  y
## W = 0.95486, p-value = 0.003382

Summary of Residual Analysis:

  1. There is still trend in the time series plot of the standardized residuals.
  2. Seasonality is present with changing variance.
  3. Absence of intervention point with ARMA behavior.
  4. The time series plot of standardized residuals suggests that the model has not captured majority of the information.Thus, this trend model will give a very bad fit.
  5. Histogram plot does not indicate the presence of a symmetric distribution.
  6. Majority of the dots are away from the reference line indicating about the weakness of the trend model fit.
  7. Shapiro-wilk test gives a result of p = 1.295e-11 and p < 0.05 confirms that the data is not normally distributed.
  8. ACF and PACF both have multiple lags above the confidence interval. This proves that a lot of information is not captured and thus, this model is not a good fit.

4.4.4. Cosine Trend:

The cosine trend model can also be fitted for model estimation.

  • Initially the harmonics are determined by using the harmonic(x,m) function where “x”is the time series and “m” is the number of pairs of a harmonic functions to be created.
  • The a dataframe is created using the time series data and the harmonics.
  • The regression function lm() is then used to model the cosine behavior.

4.4.4.1. Modeling

  • Lets fit the cosine model for the given time series data and calculate its summary.
har. <- harmonic(ozoneTs1,1)
data. <- data.frame(ozoneTs1, har.)
model4 <- lm(ozoneTs1 ~ cos.2.pi.t. + sin.2.pi.t. , data = data.)

plot(ts(fitted(model4)), ylab='y'
, main = "Fitted cosine wave to ozone thickness series.
", ylim = c(min(c(fitted(model4), as.vector(ozoneTs1))) ,
max(c(fitted(model4), as.vector(ozoneTs1)))), col = "red" )
lines(as.vector(ozoneTs1),type="o")

summary(model4)
## 
## Call:
## lm(formula = ozoneTs1 ~ cos.2.pi.t. + sin.2.pi.t., data = data.)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.4280 -1.6519  0.5365  2.3087  6.5586 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -3.2023     0.3743  -8.555 3.65e-13 ***
## cos.2.pi.t.   0.1434     0.5293   0.271    0.787    
## sin.2.pi.t.  -0.1415     0.5293  -0.267    0.790    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.551 on 87 degrees of freedom
## Multiple R-squared:  0.001663,   Adjusted R-squared:  -0.02129 
## F-statistic: 0.07245 on 2 and 87 DF,  p-value: 0.9302

4.4.4.2. Summary of the model

  1. The coefficient for cosine and sine terms have their p-values greater than 0.05 suggesting that they are not significant and is weak in capturing information.
  2. R-squared value for the model is -0.02129. Ideally we would like to have R-squared value between 0.75 and 0.85. Over here -0.02129 says that the model is bad for fitting.
  3. The overall p-value is greater than 0.05 suggesting that this trend model is not significant and is not good for fitting.

4.4.4.3. Residual Analysis of the cosine trend model

  • Lets perform residual analysis on the cosine trend model.
par(mfrow=c(3,2))

# TS plot
plot(y=rstudent(model4),x=as.vector(time(ozoneTs1)), xlab='Time'
,
ylab='Standardized Residuals'
,type='o'
, main = "Time series plot of standardized resid
fitted to ozone thickness series.
")

# Histogram

hist(rstudent(model4),xlab='Standardized Residuals'
, main = "Histogram of standardised resi
fitted to the ozone thickness series.
")

# QQ plot
y= rstudent(model3)
qqnorm(y, main = "QQ plot of standardised residuals for the Cosine model
fitted to the ozone thickness series.
")
qqline(y, col = 2, lwd = 1, lty = 2)


# ACF
acf(rstudent(model4), main = "ACF of standardized residuals
")

# PACF
pacf(rstudent(model4), main = "PACF of standardized residuals
")

par(mfrow=c(1,1))

# Shapiro wilk test
y = rstudent(model4)
shapiro.test(y)
## 
##  Shapiro-Wilk normality test
## 
## data:  y
## W = 0.95193, p-value = 0.002206

Summary of Residual Analysis:
1. There is no trend in the time series plot of the standardized residuals. 2. Seasonality is present with changing variance. 3. Absence of intervention point with ARMA behavior. 4. The time series plot of standardized residuals suggests that the model has not captured majority of the information.Thus, we this trend model will give a very bad fit. 5. Histogram plot does not indicate the presence of a symmetric distribution. 6. Majority of the dots are away from the reference line indicating about the weakness of the trend model fit. 7. Shapiro-wilk test gives a result of p = 1.287e-11 and p < 0.05 confirms that the data is not normally distributed. 8. ACF and PACF both have multiple lags above the confidence interval. This proves that a lot of information is not captured and thus, this model is not a good fit.

4.5. Optimum model

By performing multiple residual analysis on different trend models we can now conclude that all the trend models fail to capture the information. There is no strong evidence to prove that a deterministic trend model is good fit. However among the four trend models Quadratic trend model shows better characteristics for a model fit compared to the other three. Hence, Quadratic trend model is considered to be the optimum among the four. This model can now be used for forecasting.

4.6. Prediction

In my research for finding the best fitting model, I have concluded that the quadratic model gives the best fit compared to the others based on a number of criteria. Here, I am using a predict() function to predict the changes in the ozone thickness from 2017 till 2022.

h<- 5
t <- time(ozoneTs)
t2 <- t^2
aheadTimes <- data.frame(t = seq(2017, 2017 + h - 1, 1),
                        t2 = seq(2017, 2017 + h - 1, 1)^2)
forecast <- predict(model2, newdata = aheadTimes, interval = "prediction")

plot(ozoneTs, xlim= c(1927,2017+h), ylim = c(-15,10), ylab = "Ozone Thickness series"
, main = "Forecasts from the quadratic model fitted to the Ozone Thickness series.
")
lines(ts(as.vector(forecast[,3]), start = 2017), col="blue", type="l")
lines(ts(as.vector(forecast[,1]), start = 2017), col="red", type="l")
lines(ts(as.vector(forecast[,2]), start = 2017), col="blue", type="l")
legend("topleft", lty=1, pch=1, col=c("black", "blue", "red"), text.width = 18, 
       c("Data", "5% forecast limits", "Forecasts"))

4.7. Chapter 1 Summary

The data is initially converted into time series and checked for its 5 valid points followed by which pearson correlation of the data with its lags are computed. The ACF-PACF plot explains about the seasonal nature of the series and also suggests the series to be non-stationary. Right now we cannot tell the trend behavior of the model. Assuming it to be deterministic, we fit regression models-linear, quadratic, seasonal and cosine to the series and analyze on the summary followed by performing a residual analysis on the four models. After analysis it is clear that the quadratic model gives the best fit and helps in capturing majority of the information for prediction. Using quadratic model the forecast for the next five years is determined. After deterministic trend modeling and residual analysis we can strongly say that the model shows weak deterministic trend and follows a stochastic trend.

5. Chapter 2: Analaysis of stochastic trend models for model fitting

Propose a set of possible ARIMA(p, d, q) models using all suitable model specification tools. To demonstrate the use of each model specification tools such as ACF-PACF, EACF, BIC table clearly with comments to back up choices of (p, d, q).

In the previous task we did not find any suitable model which can fit the given time series data and also the data is non-stationary. Thus, it does not follow a deterministic trend and is assumed to be follow a stochastic trend.

Let’s revisit the dataset by visualizing the time series plot and analyzing the 5 valid points:

plot(ozoneTs, xlab="Year", ylab="Ozone Thickness",
     main="Change in the ozone thickness from 1927 to 2016", type="o")

5.1. Analyzing the 5 valid points

  1. Trend: There is a clear downward trend in the time series plot.
  2. Seasonality: We can observe that there can be a slight seasonal behavior for the series.
  3. Change in Variance: We can say there is a slight change in variance. If you observe carefully there are small fluctuations between 1960 to 1970, but there are higher jumps from 1980 to 1990. This indicates change in variance in the data.
  4. Intervention Point: Clearly, there is no intervention point. Change point occurs when there is a significant/drastic change in the characteristics of the plot. Over here, there is no such behavior in the data indicating the absence of an intervention point.
  5. Behavior: This can be a mixture of Auto Regressive (AR) and Moving average (MA) behavior. Upward and downward flow / fluctuations denote the MA characteristics in the plot but there is also a random walk behavior denoting the presence of an AR characteristic. Thus, this can be estimated to be an ARMA model.

5.2. Visualizing ACF and PACF Plot:

The ACF and PACF are initially visualized to check the stationarity in the time series data.

par(mfrow=c(1,2))
acf(ozoneTs, main = "ACF plot of ozone thickness series")
pacf(ozoneTs,  main = "PACF plot of ozone thickness series")

par(mfrow=c(1,1))

Inference:

  1. Clearly there is a pattern in the ACF plot which confirms the non-stationarity in the model.
  2. Wave like pattern in the ACF plot suggests seasonality.
  3. Mixed AR and MA characteristics wit MA being the dominant among the two.

5.3. Unit root tests

Unit root tests are performed on the data to confirm the non-stationarity of the time series.

5.3.1. ADF test

adf.test(ozoneTs, alternative = c("stationary"))
## 
##  Augmented Dickey-Fuller Test
## 
## data:  ozoneTs
## Dickey-Fuller = -3.2376, Lag order = 4, p-value = 0.0867
## alternative hypothesis: stationary

Inference:
The ADF test gives a p-value of 0.0867. p > 0.05 suggests that the model fails to reject the Null Hypothesis and is non-stationary.

5.4. Differencing and Transformation

  • The model is confirmed to have a non-stationary behavior by looking at the ACF plot and by running the adf test.
  • ACF plot clearly shows a decaying pattern which strongly suggests the non-stationarity in the series.
  • Thus, the series is assumed to be non-stationary. * The non-stationarity can be solved by performing differencing on the data.
  • Changing variance in the model can be stabilized by applying transformation.

5.4.1. Tranformation

The Slight changing variance in the series shows us an unstable characteristics of the model which can be made stable by performing transformation on the series. We can observe that there are negative values in our series and Transformation cannot be applied to negative data. Thus, first these negative values are handled by adding absolute minimum value of the data and 1. This will make the series positive and now the data is ready for transformation.

5.4.1.1. Log Transformation

  • Initially log transformation is applied on the series to stabilize the data.
ozoneTsLog <- log(ozoneTs + abs(min(ozoneTs)) + 1) 
plot(ozoneTsLog, type="o", ylab = "Simulated Series",
     main = "Time series plot of the Log transformed simulated series")

Inference:
We can see that the there is still unstable variances in the ts plot.

5.4.1.2. Box-Cox Transformation

  • Now, Box Cox transformation is applied on the series to stabilize the data.
BC <- BoxCox.ar(ozoneTs + abs(min(ozoneTs)) + 1, lambda = seq(-2,2,0.01))

BC$ci
## [1] 0.90 1.58
lambda <- BC$lambda[which(max(BC$loglike) == BC$loglike)]
lambda
## [1] 1.22
ozoneT_data_prep <- ozoneTs + abs(min(ozoneTs)) + 1
ozoneT_data_bc <- (ozoneT_data_prep^lambda-1)/lambda

plot(ozoneT_data_bc, type="o", ylab = "Simulated Series",
     main = "Time series plot of the BC-transformed simulated series")

Inference:

We can see that there is no significant difference in the actual series and the transformed series. But this is however better than the log transformation.

5.4.2. Differencing

Stabilizing the time series using Box-Cox is better than log transformation. But the lambda value of 1.22 is almost close to 1 saying that there is not much difference in the transformed series from the actual series. Thus, there is no point in using this transformed series and the raw series is used for further processing and model estimation. The actual series is now differenced to remove the trend and make it stationary.

First differencing:

The first differencing is performed by using the diff() function of the time series data.

ozoneTsDiff1 <- diff(ozoneTs, differences = 1)
plot(ozoneTsDiff1, ylab='Peak demand', xlab='Time', type='o', main = "Differenced model")

par(mfrow=c(1,2))
acf(ozoneTsDiff1, main = "ACF Plot of differenced series")
pacf(ozoneTsDiff1, main = "PACF Plot of differenced series")

par(mfrow=c(1,1))

adf.test(ozoneTsDiff1, alternative = c("stationary"))
## 
##  Augmented Dickey-Fuller Test
## 
## data:  ozoneTsDiff1
## Dickey-Fuller = -7.1568, Lag order = 4, p-value = 0.01
## alternative hypothesis: stationary
pp.test(ozoneTsDiff1, alternative = c("stationary"))
## 
##  Phillips-Perron Unit Root Test
## 
## data:  ozoneTsDiff1
## Dickey-Fuller Z(alpha) = -70.244, Truncation lag parameter = 3, p-value
## = 0.01
## alternative hypothesis: stationary
options(warn = defaultW)

Inference:

  1. The ACF plot does not have the decaying pattern which it had earlier explaining that the series is not stationary.
  2. The ADF test gives a p-value to 0.01 which is less than 0.05 suggesting stationarity in the series, but this p value is close to the alpha value. Thus, another unit root test is taken to confirm the stationarity.
  3. The PP test also gives a p-value of 0.01 explaining the stationarity in the model.

5.5. Model Estimation

The possible models can be estimated using the following:

  • EACF
  • BIC Table
  • ACF and PACF Plot

5.5.1. Extended Autocorrelation Function

Row names corresponds to the order of p and column names corresponds to the order of q. EACF fits all the possible models and fits “o” for all the feasible models and “x” for all the non-feasible models.

  • Look for the vertex or top-left “o”.
  • Corresponding p and q gives the order of the models.
  • Also look for the neighbor models for determining the p and q values.
eacf(ozoneTsDiff1)
## AR/MA
##   0 1 2 3 4 5 6 7 8 9 10 11 12 13
## 0 o o x o o o x o o x o  o  o  o 
## 1 x o x o o o o o o x o  o  o  o 
## 2 x o x o o o x o o x o  o  o  o 
## 3 x o x o o x o o o o o  o  o  o 
## 4 x o o x o x o o o o o  o  o  o 
## 5 x x x x o x o o o o o  o  o  o 
## 6 o o o x x o o o o o o  o  o  o 
## 7 o o o x o o o o o o o  o  o  o

Models From EACF:

ARIMA(p,d,q)
Since the model is first differenced we take d = 1.

The top-left “o” is at the first row and first column (0,0). Thus, we consider this as the vertex for estimating p and q values. The nerighbors of (0,0) are also considered for model estimation. The possible models from EACF are as follows-
1. ARIMA(0,1,0)
2. ARIMA(0,1,1)
3. ARIMA(1,1,1)

5.5.2. Bayesian Information Criterion (BIC)

In BIC all the possible models of p and q are fitted to the series and a residual analysis is performed on the result. This is then summarized as a BIC table. Initially, the number AR and MA orders are set to be 10.

BIC <- armasubsets(y = ozoneTsDiff1, nar = 10, nma = 10,
                   y.name = 'Ozone Thickness', ar.method = "ols")
## Warning in leaps.setup(x, y, wt = wt, nbest = nbest, nvmax = nvmax, force.in =
## force.in, : 4 linear dependencies found
## Reordering variables and trying again:
plot(BIC)

The BIC table with nar and nma set to 10 gives models with high order values. In the above BIC table order value of p is 10 and the order value of q are 2 and 3. Since 10 is a high order for modeling, we would like to change the nar and nma value for parsimony models.

BIC <- armasubsets(y = ozoneTsDiff1, nar = 5, nma = 5, y.name = 'Ozone Thickness', 
                   ar.method = "ols")
plot(BIC)

The above BIC table gives p orders of 3 and q orders of 2 which are good order values for modeling. Using these p and q values the following models are estimated-

ARIMA (3,1,2)

5.5.3. ACF and PACF Plots

After estimating the models using EACF and BIC table, ACF and PACF plots can be used to estimate additional models.

par(mfrow=c(1,2))
acf(ozoneTsDiff1, main = "ACF Plot for estimating the q value")
pacf(ozoneTsDiff1, main = "ACF Plot for estimating the p value")

par(mfrow=c(1,1))
  • The ACF plot shows two significant values crossing the confidence interval. Thus, q value can be 2 and since the second peak or value is only slightly significant we can also consider q to be 1.
  • The ACF plot can be interpreted as a pattern suggesting the q value to be 0.
  • Thus, the possible values of q from the ACF plots are 0, 1 and 2.
  • The PACF plot clearly shows no pattern and there are 3 significant values. Lag 5 is slightly insignificant but is surrounded by significant values and thus the possible orders of p are 3 and 4. If we want to consider only lags below 5 we observe that there are two significant values with the second only slightly significant suggesting the possibilities of p values being 1 and 2.

Possible p and q values:
p: 1,2,3,4
q: 0,1,2

  • Estimated models from ACF and PACF Plots-
    1. ARIMA(1,1,0)
    2. ARIMA(2,1,0)
    3. ARIMA(3,1,0)
    4. ARIMA(4,1,0)
    5. ARIMA(1,1,1)
    6. ARIMA(2,1,1)
    7. ARIMA(3,1,1)
    8. ARIMA(4,1,1)
    9. ARIMA(1,1,2)
    10. ARIMA(2,1,2)
    11. ARIMA(3,1,2)
    12. ARIMA(4,1,2)

5.6. Chapter 2 Summary

The time series is first analyzed for the 5 valid points and is observed that is has trend, seasonal behavior, change in variance, no intervention point and follows a mixed AR and MA behavior (i.e) ARMA characteristics. This is followed by visualizing the ACF and PACF plots of the series. The ACF plot shows a wave like pattern confirming the seasonal behavior of the series and also hinting the non-stationarity of the series. This Stationarity check can be made by performing unit root tests on the series. Unit-root test clearly suggests that the series in non-stationary and differencing has to be performed before estimating the models. Prior to differencing there is an important step of transformation which stabilizes by reducing the change in variance. However, after applying the log and box-cox transformation we do not see any significant change in the time series and also the lambda value is close to 1. Hence, actual raw time series can be used for further processing and estimation. The differenced time series is now stationary which can be verified by visualizing the ACF and PACF plots and also by performing unit-root tests on the differenced series. After making the series stationary the p and q orders of the model are estimated by using EACF, BIC table and ACF-PACF plots. Thus, the estimated models are as follows-

  1. ARIMA(0,1,0)
  2. ARIMA(0,1,1)
  3. ARIMA(1,1,0)
  4. ARIMA(1,1,1)
  5. ARIMA(1,1,2)
  6. ARIMA(3,1,2)
  7. ARIMA(2,1,0)
  8. ARIMA(2,1,1)
  9. ARIMA(2,1,2)
  10. ARIMA(3,1,0)
  11. ARIMA(3,1,1)
  12. ARIMA(4,1,1)
  13. ARIMA(4,1,0)
  14. ARIMA(4,1,2)

6. Conclusion

It can be concluded that the given ozone thickness data fails to exhibit deterministic trend behavior. This is explained in the chapter 1 where different regression models are fitted and diagnostic checking is done to determine the best model among the four. We found that all the four models were very weak at capturing the data and among the four quadratic model was the best with R Squared value close to 0.75 and exhibiting better residual analysis results. This quadratic model is then used to predict the ozone thickness for the next 5 years (2017-2021). Since the model fails to show any deterministic trend this time series is assumed to follow a stochastic trend. The non-stationarity of the model is confirmed using multiple unit-root tests and ACF-PACF plots and Using model specification tools such as ACF-PACF, EACF and BIC table appropriate (p,d,q) values are estimated.

7. References

[1] Ozone. (n.d.). Wikipedia. http://www.arctic.uoguelph.ca/cpe/environments/climate/climate_future/ozone/ozone_what.htm.

[2] Rob J Hyndman, George Athanasopoulos. (2018, May). Forecasting: Principles and Practice. https://otexts.com/fpp2/

[3] Tim Smith. (2020, May 10). Autocorrelation. https://www.investopedia.com/terms/a/autocorrelation.asp#:~:text=Autocorrelation%20represents%20the%20degree%20of,value%20and%20its%20past%20values.