Introduction

My aim is to analyze the data by using the analysis methods covered in the first two modules of MATH1318 Time Series Analysis course in this semester. My final goal is to find the best fitting trend model to this dataset and give predictions of yearly changes for the next 5 years.

setwd("~/Downloads")
x<-c("TSA", "tidyverse", "dplyr")
lapply(x, require, character.only = TRUE)

## Loading required package: TSA

## Loading required package: leaps

## Loading required package: locfit

## locfit 1.5-9.1    2013-03-22

## Loading required package: mgcv

## Loading required package: nlme

## This is mgcv 1.8-17. For overview type 'help("mgcv-package")'.

## Loading required package: tseries

## 
## Attaching package: 'TSA'

## The following objects are masked from 'package:stats':
## 
##     acf, arima

## The following object is masked from 'package:utils':
## 
##     tar

## Loading required package: tidyverse

## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr

## Conflicts with tidy packages ----------------------------------------------

## collapse(): dplyr, nlme
## filter():   dplyr, stats
## lag():      dplyr, stats
## spec():     readr, TSA

## [[1]]
## [1] TRUE
## 
## [[2]]
## [1] TRUE
## 
## [[3]]
## [1] TRUE

Visualisation of time series

data <- read.csv("DATA.csv", header = FALSE)
rownames(data) <- seq(from = 1927, to = 2016)
data.ts <- ts(as.vector(data), start = 1927, end = 2016)
plot(data.ts, ylab = "Thickness(Dobson units)", type= "o", main = "Annual changes in the thickness of Ozone layer from 1927 to 2016")

We observe considerable variation in thickness of Ozone layer over the years from this time series plot. The plot showed that the data has a downward trend in which the Ozone layer has been becoming thinner over the period. For further analysis, we will investigate whether the consecutive years are related in some way. The following code chunk generates a scatter plot to find out the relationship between pairs of consecutive values:

The plots show a upward trend and general tendency very clearly. In the scatter plot we can see that there are so many points overlap the others, classified by values. For instance, the middle-sized values tend to follow by middle-sized values, and high values by high values. The amount of correlation between neighboring abundance values is 0.8700381 which is very high:

cor(y[index],x[index])

## [1] 0.8700381

Therefore, we can make a solid conclusion that, there is an autocorrelation in the data due to the appearance of succeeding points.

A Model-Building Strategy

plot(data.ts, ylab = "Thickness(Dobson units)", type= "o", main = "Annual changes in the thickness of Ozone layer from 1927 to 2016")

From this visulisation above we can refer that there is a moving average in the data therefore the data is stationary. Also the plot shows interventions in 1962, 1992…So we can refer it to white noise.

1- Model selection

Linear trend model

model1 = lm(data.ts~time(data.ts)) 
summary(model1)

## 
## Call:
## lm(formula = data.ts ~ time(data.ts))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.7165 -1.6687  0.0275  1.4726  4.7940 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   213.720155  16.257158   13.15   <2e-16 ***
## time(data.ts)  -0.110029   0.008245  -13.34   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.032 on 88 degrees of freedom
## Multiple R-squared:  0.6693, Adjusted R-squared:  0.6655 
## F-statistic: 178.1 on 1 and 88 DF,  p-value: < 2.2e-16

Estimates of slope and intercept are: β̂1= -0.110029 and β̂0= 213.720155 respectively,
Here the slope is statistically significant at 5% significance level. But the intercept seems too large. Adjusted R-squared stands at 0.6655 ( still < 0.7) which is not quite significant enough.

plot(data.ts,type='o',ylab='y',main = "Time series plot for simulated Ozone layer's thickness")
abline(model1)

Also, we can see that the trend line is plotted over the time series in the plot above does not capture the series very well in the data.

Therefore, it seems to us that the linear regression model applied for this data is not the best one.

Quadratic trend model

t = time(data.ts)
t2=t^2
model2 = lm(data.ts~t+t2) # label the model as model1 
summary(model2)

## 
## Call:
## lm(formula = data.ts ~ t + t2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.1062 -1.2846 -0.0055  1.3379  4.2325 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5.733e+03  1.232e+03  -4.654 1.16e-05 ***
## t            5.924e+00  1.250e+00   4.739 8.30e-06 ***
## t2          -1.530e-03  3.170e-04  -4.827 5.87e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.815 on 87 degrees of freedom
## Multiple R-squared:  0.7391, Adjusted R-squared:  0.7331 
## F-statistic: 123.3 on 2 and 87 DF,  p-value: < 2.2e-16

Again, the estimated slopes and intercept are significant, and the value of estimated intercept is reasonable.

R-squared in Quadratic model == ‘0.7331’ >> R-squared in linear model == ‘0.6655’

plot(ts(fitted(model2)), ylim = c(min(c(fitted(model2), as.vector(data.ts))), max(c(fitted(model2),as.vector(data.ts)))),ylab='y' , main = "Fitted quadratic curve to Ozone thickness data")
lines(as.vector(data.ts),type="o")

Quadratic trend line captures the data very well.

Residuals of selected model

Model 2- The following chunk generates a time series plot for the standardized residuals of the yearly layer thickness:

plot(y=rstudent(model2),x=as.vector(time(data.ts)), xlab='Time', ylab='Standardized Residuals',type='o', main = "Time series plot of residuals model 2")

The residuals seems to have a variation around mean level. Now we will use histogram to check the Normality of it.

Histogram:

hist(rstudent(model2),xlab='Standardized Residuals')

y = rstudent(model2)

Q-Q plot:

y = rstudent(model2)
qqnorm(y, main = "Normal Q-Q plot of model 2")
qqline(y,col=2,lwd=1,lty=2)

Shapiro-Wilk test:

y = rstudent(model2)
shapiro.test(y)

## 
##  Shapiro-Wilk normality test
## 
## data:  y
## W = 0.98889, p-value = 0.6493

Model 1- Residual plot:

plot(y=rstudent(model1),x=as.vector(time(data.ts)), xlab='Time', ylab='Standardized Residuals',type='o', main = "Time series plot of residuals model 1")

Q-Q plot:

y = rstudent(model1)
qqnorm(y, main = "Normal Q-Q plot of model 1")
qqline(y,col=2,lwd=1,lty=2)

Histogram:

hist(rstudent(model1),xlab='Standardized Residuals of model 1')

y = rstudent(model1)

Shapiro-Wilk test:

y = rstudent(model1)
shapiro.test(y)

## 
##  Shapiro-Wilk normality test
## 
## data:  y
## W = 0.98733, p-value = 0.5372

Compare all these visualisation, we can conclude that model 2 with quadratic trend is much better than linear one because in model 2:

The residuals tends to distribute around mean level more than model1’s.
Q-Q line is much more straight than model1’s and the points are getting closer to the quantile line in comparison with model 1.
P-value of model 2 is bigger than model 1 then we have more evidence to not reject the null hypothesis that the stochastic component of model 2 is normally distributed.

Autocorrelation

Model 2:

acf(rstudent(model2), main = "ACF of standardized residuals model 2")

Model 1:

acf(rstudent(model1), main = "ACF of standardized residuals of model 1")

In both ACF plots created by two models,we have correlation values higher than the confidence bound at several lags. This is not what we expect from a white noise process.

After all, it seems to me that model 2 with quadratic trend played better for the given time series data.

Forecasts

t = time(data.ts)
t2 = t^2
model2 = lm(data.ts~t+t2)
Year = c(2017, 2018, 2019, 2020, 2021)
new = data.frame(t = Year, t2 = Year^2, 1)

forecasts = predict(model2, new, interval = "prediction")

forecasts

##         fit       lwr       upr
## 1 -10.34387 -14.13556 -6.552180
## 2 -10.59469 -14.40282 -6.786548
## 3 -10.84856 -14.67434 -7.022786
## 4 -11.10550 -14.95015 -7.260851
## 5 -11.36550 -15.23030 -7.500701

Plot forcasts

plot(data.ts, xlim = c(1927,2021), ylim = c(-16,4), ylab = "Thickness of Ozone ", main = "Thickness of Ozone data along with forecasts")
lines(ts(as.vector(forecasts[,1]), start = 2017), col="red", type="l")
lines(ts(as.vector(forecasts[,2]), start = 2017), col="blue", type="l") 
lines(ts(as.vector(forecasts[,3]), start = 2017), col="blue", type="l") 
legend("bottomleft", lty=1, pch=1, col=c("black","blue","red"), text.width = 25 ,c("Data","5% forecast limits", "Forecasts"))

Conclusion:

For this data, I have been applying two models: linear and quadratic trends. But there are quite a few weaknesses and limits with those two models. Therefore we can say the quadratic model is better than linear one but not the best fitted one.

Time series assignment 1

Mai Thi Nguyen