When an oxygen molecule is split into single atoms, these atoms combine with the nearby oxygen molecule to form an ozone. The oxygen molecules are split by the sun’s rays and most of the ozone is present in the Stratosphere. The thickness of this layer is measured in dobson unit and is it is very important for preventing the harmful radiations coming from the sun and the stars. Ozone helps in capturing the low wavelength radiations and also safeguards us from the harmful ultraviolet rays. The dataset given is the thickness of the ozone layer in dobson unit from 1927 to 2016. This data can be used for forecasting the ozone thickness for the upcoming years.
The given dataset represents yearly changes in the ozone thickness from 1927 to 2016 in Dobson units. A negative value in the dataset represents a decrease in the thickness and a positive value represents an increase in the thickness The dataset contains 89 observations and 1 feature variable.
We will be using functions from the TSA package for modeling and predictions. The TSA package contains R functions and datasets from the book “Time Series Analysis with Applications in R”. The tseries package contains functions for performing unit root test which is handy in diagnostic checking and model specification.
library(TSA)
## Warning: package 'TSA' was built under R version 4.0.5
##
## Attaching package: 'TSA'
## The following objects are masked from 'package:stats':
##
## acf, arima
## The following object is masked from 'package:utils':
##
## tar
library(tseries)
## Warning: package 'tseries' was built under R version 4.0.5
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
defaultW <- getOption("warn")
options(warn = -1)
The working directory is set using the setwd() function and the data is read using R’s read.csv() function with header parameter set to be FALSE.
setwd("C:/Users/Lenovo/Documents/RMIT/Courses/Time Series/Assignment")
ozone <- read.csv("data1.csv", header = FALSE)
head(ozone)
Information such as the structure of the dataset, data type etc, can be determined using the following functions:
structure -> str()
Data type -> class()
str(ozone)
## 'data.frame': 90 obs. of 1 variable:
## $ V1: num 1.351 0.761 -1.269 -1.464 -0.979 ...
class(ozone)
## [1] "data.frame"
class(ozone$V1)
## [1] "numeric"
The output tells us that the dataset is a dataframe with 90 observations and 1 feature/variable. The feature V1 is a numeric datatype.
For better clarity, we can rename this variable to “Ozone Thickness”.
names(ozone)[1]<-paste("Ozone Thickness")
head(ozone)
For time series analysis, the expected datatype is a “Time Series”. Since, we do not have a time series datatype by default, we should convert the datatype into a time series one. Here, we are using the ts() function from the TSA package to convert the data into time series.
Syntax: ts(df$col, start=xx, end=yy, frequency=z)
The data is recorded yearly from 1927 till 2016 and thus the parameters start and end are 1927 and 2016 respectively. I want to start my analysis by working on the annual data and thus frequency is taken as 1.
ozoneTs <- ts(as.vector(ozone$`Ozone Thickness`), start = 1927, end = 2016, frequency = 1)
ozoneTs
## Time Series:
## Start = 1927
## End = 2016
## Frequency = 1
## [1] 1.35118436 0.76053242 -1.26855735 -1.46368717 -0.97920302
## [6] 1.50856746 1.62991681 1.83242333 -0.83968364 -1.09611566
## [11] -2.67457473 -2.78716606 -2.97317944 -0.23495548 0.97067000
## [16] 1.23652307 2.23062331 0.35671637 -2.12028099 -2.66812477
## [21] -0.64702795 0.99608564 1.83817742 1.89922697 -0.55488121
## [26] -1.40387419 -3.39178762 0.05777194 3.40717183 1.31488379
## [31] -0.12882457 -2.51580137 -3.06205664 -3.33637179 -2.66332198
## [36] -0.67958655 -2.11660422 -2.36318997 -5.36156537 -3.03103458
## [41] -2.28838624 -1.06438684 -1.68813570 -3.16974819 -3.65647649
## [46] -1.25151090 -1.08431732 -0.44863234 -0.17636387 -2.64954530
## [51] -1.28317654 -4.29289634 -3.24282341 -3.60135297 -2.57288652
## [56] -5.00586059 -6.68548244 -4.58870210 -4.32654629 -2.29370761
## [61] -2.26456266 -2.27184846 -2.66440082 -3.79556478 -7.65843185
## [66] -10.17433972 -4.21230497 -2.82287161 -1.36776491 -4.43997062
## [71] -3.78323838 -5.85304107 -8.55125744 -8.06501289 -7.75975806
## [76] -6.65633206 -6.62708203 -7.83548356 -8.84424264 -7.67352209
## [81] -7.05582939 -4.69497353 -7.12712128 -9.58954985 -10.19222042
## [86] -9.33224686 -10.80567444 -10.86096923 -11.57941376 -9.30284452
time(ozoneTs)
## Time Series:
## Start = 1927
## End = 2016
## Frequency = 1
## [1] 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941
## [16] 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956
## [31] 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971
## [46] 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986
## [61] 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001
## [76] 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016
The output clearly shows us that the starting and ending points are 1927 and 2016 respectively with a frequency of 1. Visualizing the time points using the time() function also shows the range and frequency of time stamps.
The time series data is plotted using the plot() function and is analyzed for the following 5 valid points:
plot(ozoneTs, type = "o", xlab="Year", ylab="Ozone Thickness",
main="Change in the ozone thickness from 1927 to 2016")
Analysis of the 5 Valid points
The impact of previous year(s) on the current year can be estimated by using Pearson correlation coefficient. It is the correlation between present year and previous year(s) data and is calculated on dividing the covariance of the two variables by the product of their standard deviations. The result is normalized and is always between -1 and 1. Negative value explains the negative correlation between the two and positive value explains the positive correlation between the two.
Let us start by determining the pearson’s correlation between present year Y[t] and its previous year Y[t-1] followed by years Y[t-2], Y[t-3] etc. until we get a correlation value less than 0.60.
This pearson’s correlation is followed by a correlation plot which helps us visually understand the relationship between the two.
PEARSON CORRELATION OF Y[t] and Y[t-1]
y = ozoneTs
x = zlag(ozoneTs)
index = 2:length(x)
cor(y[index], x[index])
## [1] 0.8700381
Inference:
CORRELATION PLOT OF Y[t] and Y[t-1]
plot(y = ozoneTs, x = zlag(ozoneTs),ylab="Ozone Thickness",
xlab="ozone thickness in the previous year",
main="Scatter plot to show the autocorrelation between Y[t] and Y[t-1]")
Inference:
PEARSON CORRELATION OF Y[t] and Y[t-2]
y = ozoneTs
x = zlag(zlag(ozoneTs))
index = 3:length(x)
cor(y[index], x[index])
## [1] 0.7198518
Inference:
CORRELATION PLOT OF Y[t] and Y[t-2]
plot(y = ozoneTs, x = zlag(zlag(ozoneTs)),ylab="Ozone Thickness",
xlab="ozone thickness one year before the previous year",
main="Scatter plot to show the autocorrelation between Y[t] and Y[t-2]")
Inference:
PEARSON CORRELATION OF Y[t] and Y[t-3]
y = ozoneTs
x = zlag(zlag(zlag(ozoneTs)))
index = 4:length(x)
cor(y[index], x[index])
## [1] 0.592762
Inference:
CORRELATION PLOT OF Y[t] and Y[t-3]
plot(y = ozoneTs, x = zlag(zlag(ozoneTs)),ylab="Ozone Thickness",
xlab="ozone thickness two years before the previous year",
main="Scatter plot to show the autocorrelation between Y[t] and Y[t-3]")
Inference:
From the time series plot it is clear that it is an ARMA model with both AR and MA characteristics. However, diagnostics checking is done using ACF and PACF to confirm the behavior of the model.
ACF -> Autocorrelation Function
PACF -> Partial Autocorrelation Function
Auto correlation can be considered as a relationship between a variable’s current value and its past values. An autocorrelation of +1 explains positive correlation (an increase seen in one causes a proportionate increase in the other). An autocorrelation of -1 explains negative correlation (a decrease seen in one causes a proportionate decrease in the other).
acf(ozoneTs, main = "ACF plot of the Ozone thickness series")
Inference:
1. There is a clear decaying pattern in the ACF plot which confirms the presence of trend in the model.
2. Slight wave like pattern in the ACF plot says there can be slight seasonal behavior in the model.
3. There are multiple statistically significant lags explaining high order MA characteristics, but pattern denotes that the series is non-stationary.
pacf(ozoneTs, main = "PACF plot of the Ozone thickness series")
Inference:
1. Unlike ACF, PACF is less susceptible to explain a pattern.
2. Only one lag is statistically significant explaining the AR characteristics of the model.
Deterministic Trend Models
We are still not sure about the characteristics of the model. Fitting a regression model is one way of estimation. From the ACF plot it is proved that the model follows a pattern and thus is non-stationary. Regression analysis can be used to model non constant mean trend.
Regression Trend Models:
First, get the time stamps of the time series
t <- time(ozoneTs)
t
## Time Series:
## Start = 1927
## End = 2016
## Frequency = 1
## [1] 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941
## [16] 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956
## [31] 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971
## [46] 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986
## [61] 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001
## [76] 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016
model1 <- lm(ozoneTs ~ t)
plot(ozoneTs, type="o", ylab="y", main= "Fitted Linear curve to Ozone Thickness series")
abline(model1)
summary(model1)
##
## Call:
## lm(formula = ozoneTs ~ t)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.7165 -1.6687 0.0275 1.4726 4.7940
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 213.720155 16.257158 13.15 <2e-16 ***
## t -0.110029 0.008245 -13.34 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.032 on 88 degrees of freedom
## Multiple R-squared: 0.6693, Adjusted R-squared: 0.6655
## F-statistic: 178.1 on 1 and 88 DF, p-value: < 2.2e-16
Time series plot of standardized residuals: Ideally we want to see a random white noise series of the standardized residuals which suggests that the fitted trend model has captured majority of the observations.
Histogram Plot:
Histogram gives us the frequency of standardized residuals. Ideally we require a symmetric distribution of standardized residuals to strongly support the trend fit.
Q-Q plot:
Ideally we want to see all the dots stick to the reference line on the QQ plot of standardized residuals.
Shapiro-wilk Test:
Shapiro wilk test is done to test the normality of standardized residuals.
Ho: Data are normally distributed
HA: Data are not normally distributed
If p < 0.05, then we reject the null hypothesis and say that the data is not normal. Whereas, if p > 0.05 we fail to reject the null hypothesis and say that the data is normally distributed.
ACF & PACF Plot:
We expect to see all bars in an ACF and PACF plot of standardized residuals under the horizontal dashed confidence limits. This explains that all the data has been captured and there is no information left in the residuals for capturing.
par(mfrow=c(3,2))
# TS plot
plot(y=rstudent(model1),x=as.vector(time(ozoneTs)), xlab='Time'
,
ylab='Standardized Residuals'
,type='o'
, main = "Time series plot of standardized resid
fitted to ozone thickness series.
")
# Histogram
hist(rstudent(model1),xlab='Standardized Residuals'
, main = "Histogram of standardised resi
fitted to the ozone thickness series.
")
# QQ plot
y= rstudent(model1)
qqnorm(y, main = "QQ plot of standardised residuals for the linear model
fitted to the ozone thickness series.
")
qqline(y, col = 2, lwd = 1, lty = 2)
# ACF
acf(rstudent(model1), main = "ACF of standardized residuals
")
# PACF
pacf(rstudent(model1), main = "PACF of standardized residuals
")
par(mfrow=c(1,1))
# Shapiro wilk test
y = rstudent(model1)
shapiro.test(y)
##
## Shapiro-Wilk normality test
##
## data: y
## W = 0.98733, p-value = 0.5372
Summary of residual analysis:
t2 <- t^2
model2 <- lm(ozoneTs ~ t + t2)
plot(ts(fitted(model2)), ylab='y'
, main = "Fitted Quadratic curve to Ozone thickness series.
"
,
ylim = c(min(c(fitted(model2), as.vector(ozoneTs))) ,
max(c(fitted(model2), as.vector(ozoneTs)))
) )
lines(as.vector(ozoneTs),type="o")
summary(model2)
##
## Call:
## lm(formula = ozoneTs ~ t + t2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.1062 -1.2846 -0.0055 1.3379 4.2325
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.733e+03 1.232e+03 -4.654 1.16e-05 ***
## t 5.924e+00 1.250e+00 4.739 8.30e-06 ***
## t2 -1.530e-03 3.170e-04 -4.827 5.87e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.815 on 87 degrees of freedom
## Multiple R-squared: 0.7391, Adjusted R-squared: 0.7331
## F-statistic: 123.3 on 2 and 87 DF, p-value: < 2.2e-16
par(mfrow=c(3,2))
# TS plot
plot(y=rstudent(model2),x=as.vector(time(ozoneTs)), xlab='Time'
,
ylab='Standardized Residuals'
,type='o'
, main = "Time series plot of standardized resid
fitted to ozone thickness series.
")
# Histogram
hist(rstudent(model2),xlab='Standardized Residuals'
, main = "Histogram of standardised resi
fitted to the ozone thickness series.
")
# QQ plot
y= rstudent(model2)
qqnorm(y, main = "QQ plot of standardised residuals for the Quadratic model
fitted to the ozone thickness series.
")
qqline(y, col = 2, lwd = 1, lty = 2)
# ACF
acf(rstudent(model2), main = "ACF of standardized residuals
")
# PACF
pacf(rstudent(model2), main = "PACF of standardized residuals
")
par(mfrow=c(1,1))
# Shapiro wilk test
y = rstudent(model2)
shapiro.test(y)
##
## Shapiro-Wilk normality test
##
## data: y
## W = 0.98889, p-value = 0.6493
Summary of Residual Analysis of Quadratic trend
Cyclic Trend Models:
Initially we started with the annual record of the ozone thickness by assuming frequency to be 1. But while visualizing the time series plot we found that there can be slight seasonal behavior in the series, especially at the beginning. Thus, it is always a good practice to fit seasonal and cosine models for model estimation.
Here I am defining a separate object ozoneTs1 with that ozone data converted as a time series with frequency determined using an ACF plot of ozone dataframe.
acf(ozone)
From the ACF plot of ozone we can see that there is a pattern and the presence of peak for every 6th lag. Hence, we assume that the pattern occurs for every 6 years and therefore frequency is set to 6.
par(mfrow=c(3,1))
ozoneTs1 <- ts(as.vector(matrix(ozone$`Ozone Thickness`, nrow = 15, ncol = 6)), frequency = 6)
time(ozoneTs1)
## Time Series:
## Start = c(1, 1)
## End = c(15, 6)
## Frequency = 6
## [1] 1.000000 1.166667 1.333333 1.500000 1.666667 1.833333 2.000000
## [8] 2.166667 2.333333 2.500000 2.666667 2.833333 3.000000 3.166667
## [15] 3.333333 3.500000 3.666667 3.833333 4.000000 4.166667 4.333333
## [22] 4.500000 4.666667 4.833333 5.000000 5.166667 5.333333 5.500000
## [29] 5.666667 5.833333 6.000000 6.166667 6.333333 6.500000 6.666667
## [36] 6.833333 7.000000 7.166667 7.333333 7.500000 7.666667 7.833333
## [43] 8.000000 8.166667 8.333333 8.500000 8.666667 8.833333 9.000000
## [50] 9.166667 9.333333 9.500000 9.666667 9.833333 10.000000 10.166667
## [57] 10.333333 10.500000 10.666667 10.833333 11.000000 11.166667 11.333333
## [64] 11.500000 11.666667 11.833333 12.000000 12.166667 12.333333 12.500000
## [71] 12.666667 12.833333 13.000000 13.166667 13.333333 13.500000 13.666667
## [78] 13.833333 14.000000 14.166667 14.333333 14.500000 14.666667 14.833333
## [85] 15.000000 15.166667 15.333333 15.500000 15.666667 15.833333
plot(ozoneTs1, main="Half Yearly Ozone Thickness", type="o")
acf(ozoneTs1)
pacf(ozoneTs1)
par(mfrow=c(1,1))
Analysis of Time series, ACF and PACF plots:
Let’s fit a seasonal regression model by initially capturing the seasons in the time series and storing them in a object month. followed by the usage of regression function lm() with the parameter month. - 1. Here, -1 is added to remove the intercept component in the time series. The model is the fitted and summary are analyzed.
month.=season(ozoneTs1)
model3=lm(ozoneTs1 ~ month.-1)
plot(ts(fitted(model3)), ylab='y'
, main = "Fitted seaonal model to ozone thickness series.",
ylim = c(min(c(fitted(model3), as.vector(ozoneTs1))) ,
max(c(fitted(model3), as.vector(ozoneTs1)))
), col = "red" )
lines(as.vector(ozoneTs1),type="o")
summary(model3)
##
## Call:
## lm(formula = ozoneTs1 ~ month. - 1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.4755 -1.6640 0.5134 2.3864 6.5111
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## month.Season-1 -2.8943 0.9318 -3.106 0.002585 **
## month.Season-2 -3.1722 0.9318 -3.404 0.001019 **
## month.Season-3 -3.6586 0.9318 -3.926 0.000176 ***
## month.Season-4 -3.1478 0.9318 -3.378 0.001108 **
## month.Season-5 -3.1039 0.9318 -3.331 0.001287 **
## month.Season-6 -3.2367 0.9318 -3.474 0.000814 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.609 on 84 degrees of freedom
## Multiple R-squared: 0.4589, Adjusted R-squared: 0.4202
## F-statistic: 11.87 on 6 and 84 DF, p-value: 1.325e-09
par(mfrow=c(3,2))
# TS plot
plot(y=rstudent(model3),x=as.vector(time(ozoneTs1)), xlab='Time'
,
ylab='Standardized Residuals'
,type='o'
, main = "Time series plot of standardized resid
fitted to ozone thickness series.
")
# Histogram
hist(rstudent(model3),xlab='Standardized Residuals'
, main = "Histogram of standardised resi
fitted to the ozone thickness series.
")
# QQ plot
y= rstudent(model3)
qqnorm(y, main = "QQ plot of standardised residuals for the Seasonal model
fitted to the ozone thickness series.
")
qqline(y, col = 2, lwd = 1, lty = 2)
# ACF
acf(rstudent(model3), main = "ACF of standardized residuals
")
# PACF
pacf(rstudent(model3), main = "PACF of standardized residuals
")
par(mfrow=c(1,1))
# Shapiro wilk test
y = rstudent(model3)
shapiro.test(y)
##
## Shapiro-Wilk normality test
##
## data: y
## W = 0.95486, p-value = 0.003382
Summary of Residual Analysis:
The cosine trend model can also be fitted for model estimation.
har. <- harmonic(ozoneTs1,1)
data. <- data.frame(ozoneTs1, har.)
model4 <- lm(ozoneTs1 ~ cos.2.pi.t. + sin.2.pi.t. , data = data.)
plot(ts(fitted(model4)), ylab='y'
, main = "Fitted cosine wave to ozone thickness series.
", ylim = c(min(c(fitted(model4), as.vector(ozoneTs1))) ,
max(c(fitted(model4), as.vector(ozoneTs1)))), col = "red" )
lines(as.vector(ozoneTs1),type="o")
summary(model4)
##
## Call:
## lm(formula = ozoneTs1 ~ cos.2.pi.t. + sin.2.pi.t., data = data.)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.4280 -1.6519 0.5365 2.3087 6.5586
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.2023 0.3743 -8.555 3.65e-13 ***
## cos.2.pi.t. 0.1434 0.5293 0.271 0.787
## sin.2.pi.t. -0.1415 0.5293 -0.267 0.790
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.551 on 87 degrees of freedom
## Multiple R-squared: 0.001663, Adjusted R-squared: -0.02129
## F-statistic: 0.07245 on 2 and 87 DF, p-value: 0.9302
par(mfrow=c(3,2))
# TS plot
plot(y=rstudent(model4),x=as.vector(time(ozoneTs1)), xlab='Time'
,
ylab='Standardized Residuals'
,type='o'
, main = "Time series plot of standardized resid
fitted to ozone thickness series.
")
# Histogram
hist(rstudent(model4),xlab='Standardized Residuals'
, main = "Histogram of standardised resi
fitted to the ozone thickness series.
")
# QQ plot
y= rstudent(model3)
qqnorm(y, main = "QQ plot of standardised residuals for the Cosine model
fitted to the ozone thickness series.
")
qqline(y, col = 2, lwd = 1, lty = 2)
# ACF
acf(rstudent(model4), main = "ACF of standardized residuals
")
# PACF
pacf(rstudent(model4), main = "PACF of standardized residuals
")
par(mfrow=c(1,1))
# Shapiro wilk test
y = rstudent(model4)
shapiro.test(y)
##
## Shapiro-Wilk normality test
##
## data: y
## W = 0.95193, p-value = 0.002206
Summary of Residual Analysis:
1. There is no trend in the time series plot of the standardized residuals. 2. Seasonality is present with changing variance. 3. Absence of intervention point with ARMA behavior. 4. The time series plot of standardized residuals suggests that the model has not captured majority of the information.Thus, we this trend model will give a very bad fit. 5. Histogram plot does not indicate the presence of a symmetric distribution. 6. Majority of the dots are away from the reference line indicating about the weakness of the trend model fit. 7. Shapiro-wilk test gives a result of p = 1.287e-11 and p < 0.05 confirms that the data is not normally distributed. 8. ACF and PACF both have multiple lags above the confidence interval. This proves that a lot of information is not captured and thus, this model is not a good fit.
By performing multiple residual analysis on different trend models we can now conclude that all the trend models fail to capture the information. There is no strong evidence to prove that a deterministic trend model is good fit. However among the four trend models Quadratic trend model shows better characteristics for a model fit compared to the other three. Hence, Quadratic trend model is considered to be the optimum among the four. This model can now be used for forecasting.
In my research for finding the best fitting model, I have concluded that the quadratic model gives the best fit compared to the others based on a number of criteria. Here, I am using a predict() function to predict the changes in the ozone thickness from 2017 till 2022.
h<- 5
t <- time(ozoneTs)
t2 <- t^2
aheadTimes <- data.frame(t = seq(2017, 2017 + h - 1, 1),
t2 = seq(2017, 2017 + h - 1, 1)^2)
forecast <- predict(model2, newdata = aheadTimes, interval = "prediction")
plot(ozoneTs, xlim= c(1927,2017+h), ylim = c(-15,10), ylab = "Ozone Thickness series"
, main = "Forecasts from the quadratic model fitted to the Ozone Thickness series.
")
lines(ts(as.vector(forecast[,3]), start = 2017), col="blue", type="l")
lines(ts(as.vector(forecast[,1]), start = 2017), col="red", type="l")
lines(ts(as.vector(forecast[,2]), start = 2017), col="blue", type="l")
legend("topleft", lty=1, pch=1, col=c("black", "blue", "red"), text.width = 18,
c("Data", "5% forecast limits", "Forecasts"))
The data is initially converted into time series and checked for its 5 valid points followed by which pearson correlation of the data with its lags are computed. The ACF-PACF plot explains about the seasonal nature of the series and also suggests the series to be non-stationary. Right now we cannot tell the trend behavior of the model. Assuming it to be deterministic, we fit regression models-linear, quadratic, seasonal and cosine to the series and analyze on the summary followed by performing a residual analysis on the four models. After analysis it is clear that the quadratic model gives the best fit and helps in capturing majority of the information for prediction. Using quadratic model the forecast for the next five years is determined. After deterministic trend modeling and residual analysis we can strongly say that the model shows weak deterministic trend and follows a stochastic trend.
Propose a set of possible ARIMA(p, d, q) models using all suitable model specification tools. To demonstrate the use of each model specification tools such as ACF-PACF, EACF, BIC table clearly with comments to back up choices of (p, d, q).
In the previous task we did not find any suitable model which can fit the given time series data and also the data is non-stationary. Thus, it does not follow a deterministic trend and is assumed to be follow a stochastic trend.
Let’s revisit the dataset by visualizing the time series plot and analyzing the 5 valid points:
plot(ozoneTs, xlab="Year", ylab="Ozone Thickness",
main="Change in the ozone thickness from 1927 to 2016", type="o")
The ACF and PACF are initially visualized to check the stationarity in the time series data.
par(mfrow=c(1,2))
acf(ozoneTs, main = "ACF plot of ozone thickness series")
pacf(ozoneTs, main = "PACF plot of ozone thickness series")
par(mfrow=c(1,1))
Inference:
Unit root tests are performed on the data to confirm the non-stationarity of the time series.
adf.test(ozoneTs, alternative = c("stationary"))
##
## Augmented Dickey-Fuller Test
##
## data: ozoneTs
## Dickey-Fuller = -3.2376, Lag order = 4, p-value = 0.0867
## alternative hypothesis: stationary
Inference:
The ADF test gives a p-value of 0.0867. p > 0.05 suggests that the model fails to reject the Null Hypothesis and is non-stationary.
The Slight changing variance in the series shows us an unstable characteristics of the model which can be made stable by performing transformation on the series. We can observe that there are negative values in our series and Transformation cannot be applied to negative data. Thus, first these negative values are handled by adding absolute minimum value of the data and 1. This will make the series positive and now the data is ready for transformation.
ozoneTsLog <- log(ozoneTs + abs(min(ozoneTs)) + 1)
plot(ozoneTsLog, type="o", ylab = "Simulated Series",
main = "Time series plot of the Log transformed simulated series")
Inference:
We can see that the there is still unstable variances in the ts plot.
BC <- BoxCox.ar(ozoneTs + abs(min(ozoneTs)) + 1, lambda = seq(-2,2,0.01))
BC$ci
## [1] 0.90 1.58
lambda <- BC$lambda[which(max(BC$loglike) == BC$loglike)]
lambda
## [1] 1.22
ozoneT_data_prep <- ozoneTs + abs(min(ozoneTs)) + 1
ozoneT_data_bc <- (ozoneT_data_prep^lambda-1)/lambda
plot(ozoneT_data_bc, type="o", ylab = "Simulated Series",
main = "Time series plot of the BC-transformed simulated series")
Inference:
We can see that there is no significant difference in the actual series and the transformed series. But this is however better than the log transformation.
Stabilizing the time series using Box-Cox is better than log transformation. But the lambda value of 1.22 is almost close to 1 saying that there is not much difference in the transformed series from the actual series. Thus, there is no point in using this transformed series and the raw series is used for further processing and model estimation. The actual series is now differenced to remove the trend and make it stationary.
First differencing:
The first differencing is performed by using the diff() function of the time series data.
ozoneTsDiff1 <- diff(ozoneTs, differences = 1)
plot(ozoneTsDiff1, ylab='Peak demand', xlab='Time', type='o', main = "Differenced model")
par(mfrow=c(1,2))
acf(ozoneTsDiff1, main = "ACF Plot of differenced series")
pacf(ozoneTsDiff1, main = "PACF Plot of differenced series")
par(mfrow=c(1,1))
adf.test(ozoneTsDiff1, alternative = c("stationary"))
##
## Augmented Dickey-Fuller Test
##
## data: ozoneTsDiff1
## Dickey-Fuller = -7.1568, Lag order = 4, p-value = 0.01
## alternative hypothesis: stationary
pp.test(ozoneTsDiff1, alternative = c("stationary"))
##
## Phillips-Perron Unit Root Test
##
## data: ozoneTsDiff1
## Dickey-Fuller Z(alpha) = -70.244, Truncation lag parameter = 3, p-value
## = 0.01
## alternative hypothesis: stationary
options(warn = defaultW)
Inference:
The possible models can be estimated using the following:
Row names corresponds to the order of p and column names corresponds to the order of q. EACF fits all the possible models and fits “o” for all the feasible models and “x” for all the non-feasible models.
eacf(ozoneTsDiff1)
## AR/MA
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13
## 0 o o x o o o x o o x o o o o
## 1 x o x o o o o o o x o o o o
## 2 x o x o o o x o o x o o o o
## 3 x o x o o x o o o o o o o o
## 4 x o o x o x o o o o o o o o
## 5 x x x x o x o o o o o o o o
## 6 o o o x x o o o o o o o o o
## 7 o o o x o o o o o o o o o o
Models From EACF:
ARIMA(p,d,q)
Since the model is first differenced we take d = 1.
The top-left “o” is at the first row and first column (0,0). Thus, we consider this as the vertex for estimating p and q values. The nerighbors of (0,0) are also considered for model estimation. The possible models from EACF are as follows-
1. ARIMA(0,1,0)
2. ARIMA(0,1,1)
3. ARIMA(1,1,1)
In BIC all the possible models of p and q are fitted to the series and a residual analysis is performed on the result. This is then summarized as a BIC table. Initially, the number AR and MA orders are set to be 10.
BIC <- armasubsets(y = ozoneTsDiff1, nar = 10, nma = 10,
y.name = 'Ozone Thickness', ar.method = "ols")
## Warning in leaps.setup(x, y, wt = wt, nbest = nbest, nvmax = nvmax, force.in =
## force.in, : 4 linear dependencies found
## Reordering variables and trying again:
plot(BIC)
The BIC table with nar and nma set to 10 gives models with high order values. In the above BIC table order value of p is 10 and the order value of q are 2 and 3. Since 10 is a high order for modeling, we would like to change the nar and nma value for parsimony models.
BIC <- armasubsets(y = ozoneTsDiff1, nar = 5, nma = 5, y.name = 'Ozone Thickness',
ar.method = "ols")
plot(BIC)
The above BIC table gives p orders of 3 and q orders of 2 which are good order values for modeling. Using these p and q values the following models are estimated-
ARIMA (3,1,2)
After estimating the models using EACF and BIC table, ACF and PACF plots can be used to estimate additional models.
par(mfrow=c(1,2))
acf(ozoneTsDiff1, main = "ACF Plot for estimating the q value")
pacf(ozoneTsDiff1, main = "ACF Plot for estimating the p value")
par(mfrow=c(1,1))
Possible p and q values:
p: 1,2,3,4
q: 0,1,2
The time series is first analyzed for the 5 valid points and is observed that is has trend, seasonal behavior, change in variance, no intervention point and follows a mixed AR and MA behavior (i.e) ARMA characteristics. This is followed by visualizing the ACF and PACF plots of the series. The ACF plot shows a wave like pattern confirming the seasonal behavior of the series and also hinting the non-stationarity of the series. This Stationarity check can be made by performing unit root tests on the series. Unit-root test clearly suggests that the series in non-stationary and differencing has to be performed before estimating the models. Prior to differencing there is an important step of transformation which stabilizes by reducing the change in variance. However, after applying the log and box-cox transformation we do not see any significant change in the time series and also the lambda value is close to 1. Hence, actual raw time series can be used for further processing and estimation. The differenced time series is now stationary which can be verified by visualizing the ACF and PACF plots and also by performing unit-root tests on the differenced series. After making the series stationary the p and q orders of the model are estimated by using EACF, BIC table and ACF-PACF plots. Thus, the estimated models are as follows-
It can be concluded that the given ozone thickness data fails to exhibit deterministic trend behavior. This is explained in the chapter 1 where different regression models are fitted and diagnostic checking is done to determine the best model among the four. We found that all the four models were very weak at capturing the data and among the four quadratic model was the best with R Squared value close to 0.75 and exhibiting better residual analysis results. This quadratic model is then used to predict the ozone thickness for the next 5 years (2017-2021). Since the model fails to show any deterministic trend this time series is assumed to follow a stochastic trend. The non-stationarity of the model is confirmed using multiple unit-root tests and ACF-PACF plots and Using model specification tools such as ACF-PACF, EACF and BIC table appropriate (p,d,q) values are estimated.
[1] Ozone. (n.d.). Wikipedia. http://www.arctic.uoguelph.ca/cpe/environments/climate/climate_future/ozone/ozone_what.htm.
[2] Rob J Hyndman, George Athanasopoulos. (2018, May). Forecasting: Principles and Practice. https://otexts.com/fpp2/
[3] Tim Smith. (2020, May 10). Autocorrelation. https://www.investopedia.com/terms/a/autocorrelation.asp#:~:text=Autocorrelation%20represents%20the%20degree%20of,value%20and%20its%20past%20values.