Introduction

The aim of the investigation is to find the best fitting trend model to a provided Ozone time series data and to make predictions of yearly changes for the next 5 years. To answer the question, the data was firstly analysed through visualisation. Three trend models that seemed the most appropriate were identified and fitted to the time series. These three models were be further diagnosed to understand how well they fitted the data and if the assumptions of the models were satisfied.

Data

The dataset data1.csv for analysis represents yearly changes in the thickness of Ozone layer from 1927 to 2016 in Dobson units. A negative value in the dataset means a decrease in the thickness and a positive value - an increase.

The following code chunk imports the dataset as well as displays its first observations.

rr data1 <- read_csv(1.csv, col_names = F)

Parsed with column specification:
cols(
  X1 = col_double()
)

rr head(data1)

Checking the class of data1 object showed that the data was read as a data frame, so it was converted to a time series ozone.

rr class(data1)

[1] \tbl_df\     \tbl\        \data.frame\

rr ozone <- ts(as.vector(data1), start = 1927, end = 2016) head(ozone)

[1]  1.3511844  0.7605324 -1.2685573 -1.4636872 -0.9792030  1.5085675

Exploring and Visualising

The plot of Ozone time series shows that there is a general downward trend in the changes of ozone layer thickness over the observed time. There is no seasonality, changing variance in the data is not very obvious, and there can be multiple change points. The series seems to have an autoregressive behaviour, each next point follows the previous one. There is also an oscillation-like pattern in the data.

The observed downward trend in ozone layer thickness changes is explained by the increase in emissions of ozone-depleting substances over the years.

rr plot(ozone, type = , main = series plot for Ozone series, ylab = Units)

For the analysis, it is important to see whether or not consecutive years are related in some way. The following code chunk generates a scatter plot and calculates a correlation coefficient to investigate the relationship between pairs of consecutive changes in ozone layer thickness.

rr y = ozone x = zlag(y) plot(y = y, x = x, main = layer thickness changes vs previous year’s changes, ylab = , xlab = Year DU)

rr index <- 2:length(y) cor(y[index], x[index])

[1] 0.8700381

From a scatter plot we can tell that there is a strong positive linear relationship between two consecutive years’ values, the correlation coefficient (\(r = 0.87\)) confirms that.

Fitting trend models

After visualising the Ozone time series, three trend models were identified: linear, quadratic and a combination of cosine and quadratic trends seemed the most suitable for the investigation. The method used to fit these models was the classical regression approach, estimating unknown parameters in a linear model according to the criterion of least squares.

Fitting a linear trend

The deterministic linear trend model is expressed as follows:

\[\mu_t = \beta_0 + \beta_1t\] The following code chunk generates the constant and slope estimates using the ordinary least squares method.

rr t <- time(ozone) model1 <- lm(ozone~t) summary(model1)


Call:
lm(formula = ozone ~ t)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.7165 -1.6687  0.0275  1.4726  4.7940 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) 213.720155  16.257158   13.15   <2e-16 ***
t            -0.110029   0.008245  -13.34   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.032 on 88 degrees of freedom
Multiple R-squared:  0.6693,    Adjusted R-squared:  0.6655 
F-statistic: 178.1 on 1 and 88 DF,  p-value: < 2.2e-16

The estimated model is: \(\mu_t = 213.720155 - 0.110029t\). If the stochastic component in our time series is normally distributed white noise, model summary reports that the constant and slope estimates of linear trend are statistically significant at 5% level. According to coefficient of determination, about 67% of the variation in the Ozone series is explained by a linear time trend.

The following code superimposes the estimated line of best fit to our time series. The plot shows that although the linear model captures the general trend, it is not approximating the data well at all.

rr plot(ozone, type = , main = series plot for Ozone series, ylab = Units) abline(model1, col = )

Residual analysis

If the fitted model is reasonably suitable, the residuals should behave roughly like the true stochastic component, and assumptions behind the stochastic component can be checked by looking at the residuals.

The following code generates standardised residuals for the linear trend model:

rr residuals1 <- rstudent(model1)

Firstly, plotting standardised residuals over time. From looking at the generated plot, it is possible to say that there is major departure from randomness of the residuals, a clear pattern can be observed, so fitted linear model was not successful at capturing the autocorrelation effect in our time series. When we plot standardised residuals against fitted values, we can see that the points are not very evenly distributed, and larger negative residuals associate with larger negative/positive fitted values. There seems to be slight curvilinear relationship between standardised residuals and fitted values.

rr plot(y = residuals1, x = as.vector(time(ozone)), xlab = ‘Time’, ylab = ‘Standardized Residuals’, type = ‘o’) abline(h = 0, col = )

rr plot(y = residuals1, x = fitted(model1), ylab = ‘Standardized Residuals’,xlab = ‘Fitted Trend Line Values’, type = ‘p’)

The following code chunk produces a Q-Q Plot and a Shapiro-Wilk test to test the normality assumption. The straight line pattern is not obvious from a Q-Q Plot, but given the reported test statistics and \(p\)-value = 0.54, we fail to reject the \(H_0\) that the stochastic component of linear model is normally distributed.

rr qqnorm(residuals1) qqline(residuals1, col = 2, lwd = 1, lty = 2)

rr shapiro.test(residuals1)


    Shapiro-Wilk normality test

data:  residuals1
W = 0.98733, p-value = 0.5372

The following code generates the sample autocorrelation function for standardized residuals from the linear trend model of the Ozone series:

rr acf(residuals1, main = for standardised residuals)

The lag 1, lag 7 and lag 8 autocorrelations exceed two standard errors above zero. There is a clear pattern in the sample ACF graph, so we can infer that the stochastic component of the Ozone time series is not white noise.

Fitting a quadratic trend

The deterministic quadratic trend model is expressed as follows:

\[\mu_t = \beta_0 + \beta_1t + \beta_2t^2\] The following code chunck fits a quadratic trend mode to the Ozone data.

rr t2 <- t^2 model2 <- lm(ozone ~ t + t2) summary(model2)


Call:
lm(formula = ozone ~ t + t2)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.1062 -1.2846 -0.0055  1.3379  4.2325 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -5.733e+03  1.232e+03  -4.654 1.16e-05 ***
t            5.924e+00  1.250e+00   4.739 8.30e-06 ***
t2          -1.530e-03  3.170e-04  -4.827 5.87e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.815 on 87 degrees of freedom
Multiple R-squared:  0.7391,    Adjusted R-squared:  0.7331 
F-statistic: 123.3 on 2 and 87 DF,  p-value: < 2.2e-16

The estimated model is: \(\mu_t = -5733.23764 + 5.9246t - 0.00153t^2\). According to the adjusted coefficient of determination, about 73% of the variation in the Ozone series is explained by a quadratic time trend.

The following code superimposes the estimated quadratic curve to our time series. The plot shows that the quadratic trend captures the general trend much better than the linear one but still does not do a good job at approximating the data.

rr plot(ts(fitted(model2)), x = as.vector(time(ozone)), ylim = c(min(c(fitted(model2), as.vector(ozone))), max(c(fitted(model2),as.vector(ozone)))), ylab = ‘Dobson Units’, xlab = , main = quadratic curve to Ozone series, type = ,col = ) lines(ozone, type = )

Residual Analysis

From the plot of standardised residuals over time, we observe a very similar picture to the linear trend model. Residuals are not random, quadratic model was not successful at capturing the autocorrelation effect in the series. When we plot standardised residuals against fitted values, we can see that the points are not evenly distributed and tend to gather towards the right side of the rectangle as fitted values approach 0.

rr residuals2 <- rstudent(model2) plot(y = residuals2, x = as.vector(time(ozone)), xlab = ‘Time’, ylab = ‘Standardized Residuals’, type=‘o’) abline(h = 0, col = )

rr plot(y = residuals2, x = fitted(model2), ylab = ‘Standardized Residuals’, xlab = ‘Fitted Trend Line Values’, type = ‘p’)

The following code chunk produces a Q-Q Plot and a Shapiro-Wilk test to check the normality assumption. Given the straight line pattern from a Q-Q Plot and the reported test statistics with \(p\)-value = 0.65, we fail to reject the \(H_0\) that the stochastic component of quadratic model is normally distributed.

rr qqnorm(residuals2) qqline(residuals2, col = 2, lwd = 1, lty = 2)

rr shapiro.test(residuals2)


    Shapiro-Wilk normality test

data:  residuals2
W = 0.98889, p-value = 0.6493

The following code generates the sample autocorrelation function for standardized residuals from the quadratic trend model of the Ozone series:

rr acf(rstudent(model2), main = for standardised residuals)

As in the linear model case, after fitting a quadratic trend, some autocorreation remains in the residuals from this model. Lag 1 and lag 7 autocorrelations exceed two standard errors above zero, while lag 3, 4, 10 - below zero. After examining the residuals, we can infer that the stochastic component in our time series does not behave like white noise.

Forecasting with quadratic trend model

Although non of the fitted trend models are suitable for prediction purposes based on the residual analysis, predictions of yearly changes in the ozone layer thickness for the next 5 years were attempted. The model used for making predictions was the quadratic trend because all the terms were found statistically significant at 5% level unlike the combination trend, and it has an acceptable value of adjusted \(R^2\).

The following code chunk makes predictions for next 5 time points. The last level of the Ozone time series is for year 2016. So we are predicting changes for years 2017-2021.

rr pred <- seq(from = 2017, by = 1, length.out = 5) pred2 <- pred^2 new <- data.frame(t = pred, t2 = pred2) forecasts <- predict(model2, newdata = new, interval = ) forecasts

        fit       lwr       upr
1 -10.34387 -14.13556 -6.552180
2 -10.59469 -14.40282 -6.786548
3 -10.84856 -14.67434 -7.022786
4 -11.10550 -14.95015 -7.260851
5 -11.36550 -15.23030 -7.500701

Adding generated forecasts and their prediction intervals to the time series plot:

rr plot(ozone, xlim = c(1927, 2022), ylim = c(-16, 5)) lines(ts(as.vector(forecasts[,1]), start = 2017), col=, type=) r lines(ts(as.vector(forecasts[,2]), start = 2017), col=, type=) lines(ts(as.vector(forecasts[,3]), start = 2017), col=, type=) r legend(, lty=1, pch=1, col=c(,,), text.width = 22, c(,% forecast limits, ))

Conclusion

To find the best fitting trend model to the Ozone time series, three most promising models were identified. They were: linear, quadratic and a combination of cosine and quadratic trends. All three models were estimated and fitted to the data using ordinary regression approach. The process was followed by analysing the behaviour of standardised residuals from these models. The residuals investigation showed that although all three models managed to capture the general trend in the series, there was always some autocorrelation remaining in the residuals.

The quadratic trend was decided to be the “best fitting” in terms of significance of coefficients and its adjusted \(R^2\) value. It was further used to make predictions of yearly changes in the ozone layer thickness for next 5 years. These forecasts cannot be considered correct and cannot be trusted.

To sum up, fitting a deterministic trend to the Ozone time series did not result into a quality model that can be used for further prediction. Fitting ARIMA models can be the direction for future investigation.



---
title: "MATH1318 Assignment 1"
author: "Anna Krinochkina s3712761"
subtitle: Fitting trend models to predict yearly changes of Ozone layer thickness
output:
  html_notebook: default
---

## Introduction

The aim of the investigation is to find the best fitting trend model to a provided Ozone time series data and to make predictions of yearly changes for the next 5 years. To answer the question, the data was firstly analysed through visualisation. Three trend models that seemed the most appropriate were identified and fitted to the time series. These three models were be further diagnosed to understand how well they fitted the data and if the assumptions of the models were satisfied. 

```{r, echo=FALSE, message=FALSE}
library(readr)
install.packages("FSAdata")
library(FSAdata)
```

## Data 

The dataset `data1.csv` for analysis represents yearly changes in the thickness of Ozone layer from 1927 to 2016 in Dobson units. A negative value in the dataset means a decrease in the thickness and a positive value - an increase.

The following code chunk imports the dataset as well as displays its first observations.

```{r}
data1 <- read_csv("data1.csv", col_names = F)
head(data1)
```
 
Checking the class of `data1` object showed that the data was read as a data frame, so it was converted to a time series `ozone`.

```{r}
class(data1)
ozone <- ts(as.vector(data1), start = 1927, end = 2016)
head(ozone)
```

## Exploring and Visualising

The plot of Ozone time series shows that there is a general downward trend in the changes of ozone layer thickness over the observed time. There is no seasonality, changing variance in the data is not very obvious, and there can be multiple change points. The series seems to have an autoregressive behaviour, each next point follows the previous one. There is also an oscillation-like pattern in the data.

The observed downward trend in ozone layer thickness changes is explained by the increase in emissions of ozone-depleting substances over the years. 

```{r}
plot(ozone, type = "o", main = "Time series plot for Ozone series", ylab = "Dobson Units")
```

For the analysis, it is important to see whether or not consecutive years are related in some way. The following code chunk generates a scatter plot and calculates a correlation coefficient to investigate the relationship between pairs of consecutive changes in ozone layer thickness.

```{r}
y = ozone
x = zlag(y)
plot(y = y, x = x, main = "Ozone layer thickness changes vs previous year's changes", ylab = "DU", xlab = "Previous Year DU")
index <- 2:length(y)
cor(y[index], x[index])
```

From a scatter plot we can tell that there is a strong positive linear relationship between two consecutive years' values, the correlation coefficient ($r = 0.87$) confirms that.

## Fitting trend models

After visualising the Ozone time series, three trend models were identified: linear, quadratic and a combination of cosine and quadratic trends seemed the most suitable for the investigation. The method used to fit these models was the classical regression approach, estimating unknown parameters in a linear model according to the criterion of least squares. 

### Fitting a linear trend

The deterministic linear trend model is expressed as follows:

$$\mu_t = \beta_0 + \beta_1t$$
The following code chunk generates the constant and slope estimates using the ordinary least squares method.

```{r}
t <- time(ozone)
model1 <- lm(ozone~t)
summary(model1)
```

The estimated model is: $\mu_t = 213.720155 - 0.110029t$. If the stochastic component in our time series is normally distributed white noise, model summary reports that the constant and slope estimates of linear trend are statistically significant at 5% level. According to coefficient of determination, about 67% of the variation in the Ozone series is explained by a linear time trend. 

The following code superimposes the estimated line of best fit to our time series. The plot shows that although the linear model captures the general trend, it is not approximating the data well at all.

```{r}
plot(ozone, type = "o", main = "Time series plot for Ozone series", ylab = "Dobson Units")
abline(model1, col = "red")
```

#### Residual analysis

If the fitted model is reasonably suitable, the residuals should behave roughly like the true stochastic component, and assumptions behind the stochastic component can be checked by looking at the residuals.

The following code generates standardised residuals for the linear trend model:

```{r}
residuals1 <- rstudent(model1)
```

Firstly, plotting standardised residuals over time. From looking at the generated plot, it is possible to say that there is major departure from randomness of the residuals, a clear pattern can be observed, so fitted linear model was not successful at capturing the autocorrelation effect in our time series. When we plot standardised residuals against fitted values, we can see that the points are not very evenly distributed, and larger negative residuals associate with larger negative/positive fitted values. There seems to be slight curvilinear relationship between standardised residuals and fitted values.

```{r}
plot(y = residuals1, x = as.vector(time(ozone)), xlab = 'Time', ylab = 'Standardized Residuals', type = 'o')
abline(h = 0, col = "red")
plot(y = residuals1, x = fitted(model1), ylab = 'Standardized Residuals',xlab = 'Fitted Trend Line Values', type = 'p')
```

The following code chunk produces a Q-Q Plot and a Shapiro-Wilk test to test the normality assumption. The straight line pattern is not obvious from a Q-Q Plot, but given the reported test statistics and $p$-value = 0.54, we fail to reject the $H_0$ that the stochastic component of linear model is normally distributed.

```{r}
qqnorm(residuals1)
qqline(residuals1, col = 2, lwd = 1, lty = 2)
shapiro.test(residuals1)
```

The following code generates the sample autocorrelation function for standardized residuals from the linear trend model of the Ozone series:

```{r}
acf(residuals1, main = "ACF for standardised residuals")
```

The lag 1, lag 7 and lag 8 autocorrelations exceed two standard errors above zero. There is a clear pattern in the sample ACF graph, so we can infer that the stochastic component of the Ozone time series is not white noise.

### Fitting a quadratic trend

The deterministic quadratic trend model is expressed as follows:

$$\mu_t = \beta_0 + \beta_1t + \beta_2t^2$$
The following code chunck fits a quadratic trend mode to the Ozone data.

```{r}
t2 <- t^2
model2 <- lm(ozone ~ t + t2)
summary(model2)
```

The estimated model is: $\mu_t = -5733.23764 + 5.9246t - 0.00153t^2$. According to the adjusted coefficient of determination, about 73% of the variation in the Ozone series is explained by a quadratic time trend. 

The following code superimposes the estimated quadratic curve to our time series. The plot shows that the quadratic trend captures the general trend much better than the linear one but still does not do a good job at approximating the data.

```{r}
plot(ts(fitted(model2)), x = as.vector(time(ozone)), ylim = c(min(c(fitted(model2), as.vector(ozone))),
                                                              max(c(fitted(model2),as.vector(ozone)))),
     ylab = 'Dobson Units', xlab = "Time", main = "Fitted quadratic curve to Ozone series", type = "l",col = "red")
lines(ozone, type = "o")
```

#### Residual Analysis

From the plot of standardised residuals over time, we observe a very similar picture to the linear trend model. Residuals are not random, quadratic model was not successful at capturing the autocorrelation effect in the series. When we plot standardised residuals against fitted values, we can see that the points are not evenly distributed and tend to gather towards the right side of the rectangle as fitted values approach 0.

```{r}
residuals2 <- rstudent(model2)
plot(y = residuals2, x = as.vector(time(ozone)), xlab = 'Time', ylab = 'Standardized Residuals', type='o')
abline(h = 0, col = "red")
plot(y = residuals2, x = fitted(model2), ylab = 'Standardized Residuals', xlab = 'Fitted Trend Line Values', type = 'p')
```

The following code chunk produces a Q-Q Plot and a Shapiro-Wilk test to check the normality assumption. Given the straight line pattern from a Q-Q Plot and the reported test statistics with $p$-value = 0.65, we fail to reject the $H_0$ that the stochastic component of quadratic model is normally distributed.

```{r}
qqnorm(residuals2)
qqline(residuals2, col = 2, lwd = 1, lty = 2)
shapiro.test(residuals2)
```

The following code generates the sample autocorrelation function for standardized residuals from the quadratic trend model of the Ozone series:

```{r}
acf(rstudent(model2), main = "ACF for standardised residuals")
```

As in the linear model case, after fitting a quadratic trend, some autocorreation remains in the residuals from this model. Lag 1 and lag 7 autocorrelations exceed two standard errors above zero, while lag 3, 4, 10 - below zero. After examining the residuals, we can infer that the stochastic component in our time series does not behave like white noise.

### Fitting a linear combination of trends

The combination of cosine and quadratic trends was attempted to fit the Ozone time series. To fit a harmonic component of the model to the Ozone time series, a new time series `ozone_new` was created. The frequency was roughly determined by looking at the time series plot of the data. Experimenting with different values showed that frequency of 8 gave the best results in terms of $R^2$. 

```{r}
ozone_new <- ts(as.vector(data1), frequency = 8)
t_new <- time(ozone_new)
t2_new <- t_new^2
har <- harmonic(ozone_new, 1)
model3 <- lm(ozone_new ~ har + t_new + t2_new)
summary(model3)
```

The estimated model is: $\mu_t = -0.48916 -0.27177cos(2\pi t) - 0.586972sin(2\pi t) + 0.36777t - 0.09544t^2$. Model summary reports that only the sine and quadratic terms are statistically significant at 5% level. According to the reported adjusted $R^2$, about 74.5% of the variation in the Ozone series is explained by a combination of cosine and quadratic time trends. 

The following code superimposes the estimated curve to our time series. The plot shows that this model captures the general trend as well as tries to roughly approximate the data.

```{r}
plot(ts(fitted(model3), start = 1927, end = 2016), ylim = range(c(fitted(model3), ozone)), ylab='Dobson Units', xlab = "Time", 
     main = "Fitted curve to Ozone series", type = "l",col = "red")
lines(ozone,type="o")
```

#### Residual analysis

Although not all coefficients of the combination trend were found significant, we still will proceed to analyse the standardised residuals to have a better understanding. From the plot of standardised residuals over time, we observe a very similar picture to the previous two models. Residuals are not random. When we plot standardised residuals against fitted values, we can see that the points are not evenly distributed in the rectangle, though changing variance is not obvious.

```{r}
residuals3 <- rstudent(model3)
plot(y = residuals3, x = as.vector(time(ozone)), xlab = 'Time', ylab = 'Standardized Residuals', type = 'o')
abline(h = 0, col = "red")
plot(y = residuals3, x = fitted(model3), ylab = 'Standardized Residuals', xlab = 'Fitted Trend Line Values', type = 'p')
```

The following code chunk generates a Q-Q Plot and a Shapiro-Wilk test to check the normality assumption. Given the straight line pattern from a Q-Q Plot and the reported test statistics with $p$-value = 0.49, we fail to reject the $H_0$ that the stochastic component of a combination model is normally distributed.

```{r}
qqnorm(residuals3)
qqline(residuals3, col = 2, lwd = 1, lty = 2)
shapiro.test(residuals3)
```

As in two previous cases, after fitting a combination trend, some autocorreation still remains in the residuals from this model. We have correlation values higher than the confidence bounds at several lags. After analysing various residual plots, we can infer that the stochastic component in our time series does not behave like white noise.

```{r}
acf(rstudent(model3), main = "ACF for standardised residuals")
```

## Forecasting with quadratic trend model

Although non of the fitted trend models are suitable for prediction purposes based on the residual analysis, predictions of yearly changes in the ozone layer thickness for the next 5 years were attempted. The model used for making predictions was the quadratic trend because all the terms were found statistically significant at 5% level unlike the combination trend, and it has an acceptable value of adjusted $R^2$.

The following code chunk makes predictions for next 5 time points. The last level of the Ozone time series is for year 2016. So we are predicting changes for years 2017-2021.

```{r}
pred <- seq(from = 2017, by = 1, length.out = 5) 
pred2 <- pred^2
new <- data.frame(t = pred, t2 = pred2)
forecasts <- predict(model2, newdata = new, interval = "prediction")
forecasts
```

Adding generated forecasts and their prediction intervals to the time series plot:

```{r}
plot(ozone, xlim = c(1927, 2022), ylim = c(-16, 5))
lines(ts(as.vector(forecasts[,1]), start = 2017), col="red", type="l")
lines(ts(as.vector(forecasts[,2]), start = 2017), col="blue", type="l")
lines(ts(as.vector(forecasts[,3]), start = 2017), col="blue", type="l")
legend("topright", lty=1, pch=1, col=c("black","blue","red"), text.width = 22, c("Data","5% forecast limits", "Forecasts"))
```

## Conclusion

To find the best fitting trend model to the Ozone time series, three most promising models were identified. They were: linear, quadratic and a combination of cosine and quadratic trends. All three models were estimated and fitted to the data using ordinary regression approach. The process was followed by analysing the behaviour of standardised residuals from these models. The residuals investigation showed that although all three models managed to capture the general trend in the series, there was always some autocorrelation remaining in the residuals. 

The quadratic trend was decided to be the "best fitting" in terms of significance of coefficients and its adjusted $R^2$ value. It was further used to make predictions of yearly changes in the ozone layer thickness for next 5 years. These forecasts cannot be considered correct and cannot be trusted.

To sum up, fitting a deterministic trend to the Ozone time series did not result into a quality model that can be used for further prediction. Fitting ARIMA models can be the direction for future investigation.



<br>
<br>