Time Series Analysis Of Ozone Layer Thickness From 1927-2016

Introduction

The objective of this assignment is to find the best fitting model among linear, quadratic, cyclic,seasonal and cosine and then predict the values of ozone layer thickness for the next five years using the best fitting model.

Data

The ozone dataset is a csv file which contains 90 observations of the ozone layer thickness from 1927 to 2016. The dataset includes both negative and positive values. Negative value means decrease in the ozone layer thickness while postive value means increase in the ozone layer thickness.

Task 1

Required Packages

library(TSA)
library(readr)
library(tseries)
library(knitr)

Reading Ozone Dataset

#reading the dataset
ozone <- read.csv("C:/Users/mafza/Downloads/TIMESERIES/assignmnt_1/ozone.csv", header= FALSE)
rownames(ozone) <- seq(from=1927, to=2016)
#checking the class of object
class(ozone)

## [1] "data.frame"

Descriptive Analysis

After reading the ozone dataset, i converted the datatype of our ozone dataset from dataframe to timeseries.

# Converting it  to the Time series object
ozone_ts <- ts(as.vector(ozone), start=1927, end=2016, frequency = 1) 
#checking the class of converted object
class(ozone_ts)

## [1] "ts"

plot(ozone_ts,type='o',ylab='ozone', main='Time series plot of Ozone Layer Series')

Five Charateristics

There is a decreasing or downwards trend.
There might be a slight variation not obvious.
Behavior (both MA & AR)
No change point.
No intervention.
No seasonality.

ScatterPlot of Ozone series and its first lag

To find the impact of previous year’s ozone layer thickness on the next year’s, we need to calculate the value of pearson correlaton between ozone series and its first lag.

y = ozone_ts           
x = zlag(ozone_ts)        
index = 2:length(x)           
cor(y[index],x[index])

## [1] 0.8700381

plot(y[index],x[index],ylab='Ozone series', xlab='First lag of ozone series', main = 'Scatter plot of ozone series and its first lag')

Interpretation of ScatterPlot

The pearson correlation value = 0.87 and the scatterplot indicates that there is strong correlation between ozone series and its first lag.

Fiting Regression Models

1.Fitting Linear Model

# Fitting the linear model

t <- time(ozone_ts)
t

## Time Series:
## Start = 1927 
## End = 2016 
## Frequency = 1 
##  [1] 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941
## [16] 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956
## [31] 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971
## [46] 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986
## [61] 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001
## [76] 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016

linear_model<- lm(ozone_ts ~ t) # label the model as linear_model
summary(linear_model)

## 
## Call:
## lm(formula = ozone_ts ~ t)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.7165 -1.6687  0.0275  1.4726  4.7940 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 213.720155  16.257158   13.15   <2e-16 ***
## t            -0.110029   0.008245  -13.34   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.032 on 88 degrees of freedom
## Multiple R-squared:  0.6693, Adjusted R-squared:  0.6655 
## F-statistic: 178.1 on 1 and 88 DF,  p-value: < 2.2e-16

plot(ozone_ts,ylab='ozone',xlab='Year',type='o', main = "Fitted linear model to the ozone series")
abline(linear_model)

Summary Statistics of Linear Model

The significance of t is significant, in other words, its p-value is less than alpha level 0.05.
R- squared value = 0.6693 which indicates there is 66.93% variance.
The overall general p-value is also less than 0.05 which means my linear model is significant.

Residual Analysis of Linear Model

# Residual analysis of linear model
plot(y=rstudent(linear_model),x=as.vector(time(ozone_ts)), xlab='Time',
     ylab='Standardized Residuals',type='o', main = "Residual plot")

plot(y=rstudent(linear_model),x=as.vector(time(ozone_ts)), xlab='Time',
     ylab='Standardized Residuals',type='p', main = "standardized residuals vs fitted.")
abline(h=0)

hist(rstudent(linear_model),xlab='Standardized Residuals', main = "Histogram of standardised residuals.")

y = rstudent(linear_model)
qqnorm(y, main = "QQ plot of standardised residuals.")
qqline(y, col = 2, lwd = 1, lty = 2)

acf(rstudent(linear_model), main = "ACF of standardized residuals.")

y = rstudent(linear_model)
shapiro.test(y)

## 
##  Shapiro-Wilk normality test
## 
## data:  y
## W = 0.98733, p-value = 0.5372

Interpretation of Residual Analysis of Linear Model

Residual Plot: The reisdual plot shows that the residuals are between the bound of -3 to +3.There are also some spkies which is not desirable in a residual plot.The residuals are hanging around the mean level.
Residual vs Fitted Plot: The residual vs fitted plot shows that there are some residuals outside the bound of -2 to 2. Also the residuals are not spread evenly.
Histogram: we want to see a symmetrical distribution in the histogram plot of standardised residuals. And in our case the histogram looks nearly symmetrical.
qqnorm: in qqnorm plot the residuals follow the normal distribution except the tails, as most of the residuals lie closely in a normal fitted line.
shapiro wilk test: Ho: data are normally distributed. Thus p-value = 0.5372 greater than 0.05 indicates that we are failed to reject the null hypothesis, hence the data are normally distributed.
acf plot: in acf plot we want all the lags under the confidence interval line, in our case there 3 significant lags which means there is autocorrelation left in residuals, in other words there is valuable information left in residuals which our linear model is unable to capture.

2. Fitting Quadratic Model

# Fit the quadratic model

t= time(ozone_ts)
t2 = t^2 
quadratic_model = lm(ozone_ts ~ t + t2) # label the model as quadratic_model
summary(quadratic_model)

## 
## Call:
## lm(formula = ozone_ts ~ t + t2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.1062 -1.2846 -0.0055  1.3379  4.2325 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5.733e+03  1.232e+03  -4.654 1.16e-05 ***
## t            5.924e+00  1.250e+00   4.739 8.30e-06 ***
## t2          -1.530e-03  3.170e-04  -4.827 5.87e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.815 on 87 degrees of freedom
## Multiple R-squared:  0.7391, Adjusted R-squared:  0.7331 
## F-statistic: 123.3 on 2 and 87 DF,  p-value: < 2.2e-16

plot(ts(fitted(quadratic_model)), ylim = c(min(c(fitted(quadratic_model), as.vector(ozone_ts))), max(c(fitted(quadratic_model),as.vector(ozone_ts)))),
     ylab='y' , main = "Fitted quadratic curve to ozone series", type="l",lty=2,col="red")
lines(as.vector(ozone_ts),type="o")

Summary Statistics of Quadratic Model

The significance of T and T^2 are significant, in other words, their p-vavlue is less than alpha level 0.05.
R- squared value = 0.7391 , which indicates there is 73.91% variance.
The overall general p-value is also less than 0.05 which means my quadratic model is significant.

Residual Analysis of Quadratic Model

# Residual analysis of Quadratic model
plot(y=rstudent(quadratic_model),x=as.vector(time(ozone_ts)), xlab='Time',
     ylab='Standardized Residuals',type='o', main = "Residual plot.")

plot(y=rstudent(quadratic_model),x=as.vector(time(ozone_ts)), xlab='Time',
     ylab='Standardized Residuals',type='p', main = "standardized residuals vs fitted")
abline(h=0)

hist(rstudent(quadratic_model),xlab='Standardized Residuals', main = "Histogram of standardised residuals.")

y = rstudent(quadratic_model)
qqnorm(y, main = "QQ plot of standardised residuals.")
qqline(y, col = 2, lwd = 1, lty = 2)

acf(rstudent(quadratic_model), main = "ACF of standardized residuals.")

y = rstudent(quadratic_model)
shapiro.test(y)

## 
##  Shapiro-Wilk normality test
## 
## data:  y
## W = 0.98889, p-value = 0.6493

Interpretation of Residual Analysis of Quadratic Model

Residual Plot: The reisdual plot shows that the residuals are between the bound of -3 to +3.There are also some spkies which is not desirable in a residual plot. The residuals are hanging around the mean level.
Residual vs Fitted Plot: The residual vs fitted plot shows that there are some residuals outside the bound of -2 to +2. Also the residuals are not spread evenly.
Histogram: we want to see a symmetrical distribution in the histogram plot of standardised residuals. And in our case the histogram looks left skewed..
qqnorm: in qqnorm plot the residuals follow the normal distribution except the tails, as most of the residuals lie closely in a normal fitted line.
shapiro wilk test: Ho: data are normally distributed. Thus p-value = 0.6493 greater than 0.05 indicates that we are failed to reject the null hypothesis, hence the data are normally distributed.
acf plot: in acf plot we want all the lags under the confidence interval line, in our case there more than 2 significant lags which means there is autocorrelation left in residuals, in other words there is valuable information left in residuals which our quadratic model is unable to capture.

3. Fitting Harmonic Model

# Fit the Harmonic Model


har. <- harmonic(ozone_ts, 0.45) 
harmonic_model <- lm(ozone_ts ~ har.)
summary(harmonic_model)

## 
## Call:
## lm(formula = ozone_ts ~ har.)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.3520 -1.8905  0.4837  2.3643  6.4248 
## 
## Coefficients: (1 not defined because of singularities)
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -2.970e+00  4.790e-01  -6.199 1.79e-08 ***
## har.cos(2*pi*t)         NA         NA      NA       NA    
## har.sin(2*pi*t)  5.462e+11  7.105e+11   0.769    0.444    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.522 on 88 degrees of freedom
## Multiple R-squared:  0.006672,   Adjusted R-squared:  -0.004616 
## F-statistic: 0.5911 on 1 and 88 DF,  p-value: 0.4441

plot(ts(fitted(harmonic_model)), ylab='y', main = "Fitted cosine wave to ozone series.",
     ylim = c(min(c(fitted(harmonic_model), as.vector(ozone_ts))) ,
              max(c(fitted(harmonic_model), as.vector(ozone_ts)))
     ), col = "green" )
lines(as.vector(ozone_ts),type="o")

Summary Statistics of Harmonic Model

The significance of har.sin(2pit) and har.cos(2pit) are insignificant, in other words, their p-value is greater than alpha level 0.05.
R- squared value = 0.006672 , which indicates there is 0.006% variance.
The overall general p-value = 0.4441 which is greater than 0.05 which means my quadratic model is insignificant.

Residual Analysis of Harmonic Model

# Residual analysis of Harmonic model
plot(y=rstudent(harmonic_model),x=as.vector(time(ozone_ts)), xlab='Time',
     ylab='Standardized Residuals',type='o', main = "Time series plot of standardized residuals.")

plot(y=rstudent(harmonic_model),x=as.vector(time(ozone_ts)), xlab='Time',
     ylab='Standardized Residuals',type='p', main = "standardized residuals vs fitted.")
abline(h=0)

hist(rstudent(harmonic_model),xlab='Standardized Residuals', main = "Histogram of standardised residuals.")

y = rstudent(harmonic_model)
qqnorm(y, main = "QQ plot of standardised residuals.")
qqline(y, col = 2, lwd = 1, lty = 2)

acf(rstudent(harmonic_model), main = "ACF of standardized residuals.")

y = rstudent(harmonic_model)
shapiro.test(y)

## 
##  Shapiro-Wilk normality test
## 
## data:  y
## W = 0.95856, p-value = 0.005875

Interpretation of Residual Analysis of Harmonic Model

Residual Plot: The reisdual plot shows that the residuals are between the bound of -3 to +3.There is also a downward trend which is not desirable in a residual plot.
Residual vs Fitted Plot: The residual vs fitted plot shows that there are some residuals outside the bound of -2 to +2. Also the residuals are not spread evenly.
Histogram: we want to see a symmetrical distribution in the histogram plot of standardized residuals. And in our case the histogram looks skewed.
qqnorm: in qqnorm plot the residuals doesn’t follow the normal distribution , as most of the residuals doesn’t lie closely in a normal fitted line.
shapiro wilk test: Ho: data are normally distributed. Thus p-value = 0.005875 which is less than than 0.05 indicates that we reject the null hypothesis, hence the data are not normally distributed.
acf plot: in acf plot we want all the lags under the confidence interval line, in our case there is a wave pattern which means there is autocorrelation left in residuals, in other words there is valuable information left in residuals which our harmonic model is unable to capture.

Best Fitting Model

From the summary statistics of all the models we can conclude that quadratic model is the best best model, as its R-squared value = 0.7391 is greater as compared to linear model which has R-sqaured value = 0.6693 which means quadratic model has higher variation than linear model. Also standard error = 1.815 of quadratic model is less than as compared to linear model which has higher standard error value = 2.032. So quadratic model is the best model as it has higher variation and less residual standard error.

Forecasting Using Quadratic Model

#forecasting quadratic model

h <- 4

t <- time(ozone_ts)
t2 <- t^2

aheadTimes <- data.frame(t = seq(2017, 2017+h, 1),
                         t2 = seq(2017, 2017+h, 1)^2)

forecasting_quadratic_model <- predict(quadratic_model, newdata = aheadTimes, interval = "prediction")

print(forecasting_quadratic_model)

##         fit       lwr       upr
## 1 -10.34387 -14.13556 -6.552180
## 2 -10.59469 -14.40282 -6.786548
## 3 -10.84856 -14.67434 -7.022786
## 4 -11.10550 -14.95015 -7.260851
## 5 -11.36550 -15.23030 -7.500701

plot(ozone_ts, xlim= c(1927,2021) ,ylim = c(-20,10),
     ylab = "Ozone-Layer series",
     main = "Forecasts from the quadratic model fitted to the Ozone series.")
lines(ts(as.vector(forecasting_quadratic_model[,3]), start = 2017), col="blue", type="l")
lines(ts(as.vector(forecasting_quadratic_model[,1]), start = 2017), col="red", type="l")
lines(ts(as.vector(forecasting_quadratic_model[,2]), start = 2017), col="blue", type="l")
legend("topleft", lty=1, pch=1, col=c("black","blue","red"),
       text.width = 18,
       c("Data","5% forecast limits", "Forecasts"))

The graph indicates that there is a decrease in ozone layer thickness predicted in the next five years.

Conclusion of Task-1

It was observed that quadratic model is the best best model, as its has approximately 74% variation but during residual analysis, in acf plots,we want all the lags under the confidence interval line but there more than 2 significant lags which means there is autocorrelation left in residuals, in other words there is valuable information left in residuals which our quadratic model is unable to capture.
High correlation in the lags tells that the trend in the series is not deterministic but the trend in the series is stochastic because when we fit the model it not stationary.

Task 2

Propose a set of possible ARIMA(p,d,q) models

plot(ozone_ts,type='o',ylab='ozone', main='Time series plot of ozone layer series')

ACF & PACF Plots

par(mfrow=c(1,2))
acf(ozone_ts,main="ACF plot for Ozone Layer series.")
pacf(ozone_ts,main="PACF plot for Ozone Layer series.")

There is a slowly decaying pattern in acf plot and very high first correlation in pacf plot which indicates that there is a trend in our series and our series is non-stationary.

Dickey-Fuller Unit-Root Test (ADF)

adf.test(ozone_ts, alternative = c("stationary"))

## 
##  Augmented Dickey-Fuller Test
## 
## data:  ozone_ts
## Dickey-Fuller = -3.2376, Lag order = 4, p-value = 0.0867
## alternative hypothesis: stationary

Ho: Non-stationary Ha: Stationary Our p-value = 0.0867 which is greater than 0.05, so we fail to reject our null hypothesis and hence the series is non-stationary.

Box-Cox Transformation

ozone_ts_posi = ozone_ts + abs(min(ozone_ts))+1
BC = BoxCox.ar(ozone_ts_posi )

BC$ci

## [1] 0.9 1.5

lambda <- BC$lambda[which(max(BC$loglike) == BC$loglike)]
lambda

## [1] 1.2

ozone_bc = (ozone_ts_posi^lambda-1)/lambda
plot(ozone_bc,type='o',ylab='Simulated series',  main = "Box Cox transformed series.")

qqnorm(ozone_bc)
qqline(ozone_bc, col = 4)

shapiro.test(ozone_bc)

## 
##  Shapiro-Wilk normality test
## 
## data:  ozone_bc
## W = 0.96644, p-value = 0.01995

Interpretation of Box-Cox Transformation

In the QQ plot the tails of the distribution is far from the normality.
The p-value = 0.01995 of Shapiro test is also less than 0.05. Hence, we have enough evidence to reject the normality hypothesis.
Also the value of lambda = 1.2 which concludes that Box-Cox transformation did not significantly help to improve normality of the observations.
we will do differencing on the original series.

Differencing on the original series

dif = diff(ozone_ts)
par(mfrow=c(1,1))
plot(dif,type='o',ylab='Average weekly earnings', main = "Time series plot of the first differenced ozone series.")

adf.test(dif)

## 
##  Augmented Dickey-Fuller Test
## 
## data:  dif
## Dickey-Fuller = -7.1568, Lag order = 4, p-value = 0.01
## alternative hypothesis: stationary

Ho: Non-stationary Ha: Stationary Our p-value = 0.01 which is less than 0.05, so we reject our null hypothesis and hence after the first differencing on the original series, the series has become stationary.

par(mfrow=c(1,2))
acf(dif,main="ACF of first difference.")
pacf(dif,main="PACF of first difference.")

In acf there is one significant lag so the value of q=1. There are also significant lags at lag 7 and at lag 10 but i will not considered them as there were many insignificant lags before them.
In pacf there 2 significant lags and one lag appears at lag 6 so so i will include both, so the possible value are p=2 and p=3. So the possible sets of arima models from acf and pacf are ARIMA(2,1,1), ARIMA(3,1,1)

EACF PLOT

eacf(dif)

## AR/MA
##   0 1 2 3 4 5 6 7 8 9 10 11 12 13
## 0 o o x o o o x o o x o  o  o  o 
## 1 x o x o o o o o o x o  o  o  o 
## 2 x o x o o o x o o x o  o  o  o 
## 3 x o x o o x o o o o o  o  o  o 
## 4 x o o x o x o o o o o  o  o  o 
## 5 x x x x o x o o o o o  o  o  o 
## 6 o o o x x o o o o o o  o  o  o 
## 7 o o o x o o o o o o o  o  o  o

The top-left “o” can be found at p=0 and q=0 which shows a clear vertex. The neibouring ARMA models are (p=0,q=1) and (p=1, q=1). So the possible sets of arima models from eacf are ARIMA(0,1,0), ARIMA(0,1,1), ARIMA(1,1,1)

BIC Table

I am setting the values of nar and nma parameters as 6 because in acf, pacf and eacf plots all the ARMA values are under 6.

res2 = armasubsets(dif , nar=6 , nma=6, y.name='ozone', ar.method='ols')
plot(res2)

From the bic table AR(3) and AR(4) and MA(2). So the possible arima models from bic table are ARIMA(3,1,2), ARIMA(4,1,2)

Conclusion of Task 2 .

So the possible set of arima models from acf, pacf, eacf and bic are:

ARIMA(2,1,1)
ARIMA(3,1,1)
ARIMA(0,1,0)
ARIMA(0,1,1)
ARIMA(1,1,1)
ARIMA(3,1,2)
ARIMA(4,1,2)