# Load libraries and read the data
library(dplyr)
library(TSA)
library(magrittr)
Ozone <- read.csv('data1.csv', header =FALSE)
rownames(Ozone) <- seq(from=1927, to=2016)
head(Ozone)
Ozone <- ts(as.vector(Ozone), start = 1927, end=2016)#convert df to a ts object
plot(Ozone,type='o',ylab='Ozone Layer Thickness (DU)',
main="Time series plot of Ozone layer thickness series") #create a time series plot

- Trend : we do see a downward trend in the plot.
- Seasonality : cannot conclude at this stage.
- Behaviour: Appears to be Auto Regressive.
- Intervention: We may want to explore if something major happened around year 1992-93
- Changing Variance: more or less variance remains the same
I. Trying to fit a linear model
model_lm = lm(Ozone ~time(Ozone)) #define the linear regression model
plot(Ozone,type= 'o',ylab='y')
abline(model_lm) #add the fitted least square line to the model

summary(model_lm)
Call:
lm(formula = Ozone ~ time(Ozone))
Residuals:
Min 1Q Median 3Q Max
-4.7165 -1.6687 0.0275 1.4726 4.7940
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 213.720155 16.257158 13.15 <2e-16 ***
time(Ozone) -0.110029 0.008245 -13.34 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.032 on 88 degrees of freedom
Multiple R-squared: 0.6693, Adjusted R-squared: 0.6655
F-statistic: 178.1 on 1 and 88 DF, p-value: < 2.2e-16
- The coefficient of both intercept and linear component is statistically significant.
- coefficient of determination R2 appears to be fine at 0.6655, not too high or too low.
Assessing the quality of the linear model
Residual Analysis:
residual_model_lm = rstudent(model_lm) #rstudent() computes standardised residual
plot(y=residual_model_lm, x=as.vector(time(Ozone)),xlab = 'Time', ylab='Standardized Residuals',type = 'o')

- There is somewhat departure from randomness in the above residual plot.
Next, we look at the standardized residuals versus the corresponding trend estimate, or fitted value
plot(y=rstudent(model_lm), x=fitted(model_lm),
xlab ="Fitted Trend Value", ylab ='Standardized Residuals')

- There is no obvious trend in the residuals, however we do see variation with different fitted trend values.
Check the Normality of residuals with a histogram.
hist(rstudent(model_lm), xlab = "Standardized Residuals",
main = "Histogram of Standardized Residuals of Model")

- The plot appears somewhat symmetric and tails off at the low and high end reciprocating the behaviour of a normal distribution.
Check normality assumption with the quantile-quantile (QQ) plot
z = rstudent(model_lm)
qqnorm(z) #produce a qq plot of the residuals
qqline(z , col = 3 , lwd = 1)

- For a normaly distributed data the QQ plot aprrox like a straight line but in the above plot, the line doesnt capture all the points. The straight-line pattern here doesnt fully captures the assumption of normally distributed stochastic component in the model.
Hypothesis test: Shapiro-Wilk test is used to check the normality assumption of the stochastic component
s = rstudent(model_lm)
shapiro.test(s)
Shapiro-Wilk normality test
data: s
W = 0.98733, p-value = 0.5372
- We get the p-value of 0.5372.Hence, we fail to reject the null hypothesis i.e. the stochastic component of this model is normally distributed.
ACF(Autocorrelation function) for analysis of Time Series
##Find the significance autorcorrelation in standardised Residuals
acf(rstudent(model_lm), main="ACF of standardized residuals")

- There is some trend and some autrocorrelation left in the Residuals as deterministic model cannot completely capture it.
II. Trying to fit a Quadratic model
t=time(Ozone) #linear component
t2=t^2 #take square of time for the quadratic component
model_quad = lm(Ozone ~ t+t2) #create a quadratic model
plot(ts(fitted(model_quad)),ylim = c(min(c(fitted(model_quad),as.vector(Ozone))),
max(c(fitted(model_quad),as.vector(Ozone)))),ylab='y' ,
main = "Fitted quadratic curve to Ozone data")
lines(as.vector(Ozone),type='o')

summary(model_quad)
Call:
lm(formula = Ozone ~ t + t2)
Residuals:
Min 1Q Median 3Q Max
-5.1062 -1.2846 -0.0055 1.3379 4.2325
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5.733e+03 1.232e+03 -4.654 1.16e-05 ***
t 5.924e+00 1.250e+00 4.739 8.30e-06 ***
t2 -1.530e-03 3.170e-04 -4.827 5.87e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.815 on 87 degrees of freedom
Multiple R-squared: 0.7391, Adjusted R-squared: 0.7331
F-statistic: 123.3 on 2 and 87 DF, p-value: < 2.2e-16
- The coefficient of quadratic component is statistically significant.
- coefficient of determination R2 (0.7331) is better than that of the linear model(0.6655).
Assessing the quality of the Quadratic model
Residual Analysis:
residual_model_qd = rstudent(model_quad)
plot(y=residual_model_qd, x=as.vector(time(Ozone)),xlab = 'Time', ylab='Standardized Residuals',type='o')

- There is departure from randomness in the above residual plot. The plot is smooth at multiple place, so it cant be concluded as a white noise.
Next, we look at the standardized residuals versus the corresponding trend estimate, or fitted value
plot(y=rstudent(model_quad), x=fitted(model_quad),
xlab ="Fitted Trend Value", ylab ='Standardized Residuals')

- We do see variation with different fitted trend values.
Check the Normality of residuals with a histogram
hist(rstudent(model_quad), xlab = "Standardized Residuals",
main = "Histogram of Standardized Residuals of Model")

- The plot is not symmetric and so the residuals doesnt fulfill the normality assumption.
Check normality assumption with the quantile-quantile (QQ) plot
q = rstudent(model_quad)
qqnorm(q)
qqline(q , col = 2 , lwd = 1)

- For a normaly distributed data the QQ plot aprrox like a straight line. In the above plot, the line almost captures all the points, better than a linear model. The straight-line pattern here does supports the assumption of normally distributed stochastic component in the model.
Hypothesis test: Shapiro-Wilk test is used to check the normality assumption of the stochastic component
s=rstudent(model_quad)
shapiro.test(s)
Shapiro-Wilk normality test
data: s
W = 0.98889, p-value = 0.6493
- We get the p-value of 0.6493. Hence, we fail to reject the null hypothesis i.e. the stochastic component of this model is normally distributed.
ACF(Autocorrelation function) for analysis of Time Series
##Find the significance autorcorrelation in standardised Residuals
acf(rstudent(model_quad), main="ACF of standardized residuals")

Forecasting
t = c(2017,2018,2019,2020,2021) # create a time vector for next five years
t2 = t^2
new = data.frame(t,t2) #create a new data frame with t and t2
forecasts = predict(model_quad,new, interval = "prediction")
plot(Ozone,ylab= "Ozone Layer Thickness(DU)",ylim = c(-15,3),xlim=c(1927,2021),main="Time series plot of Ozone layer thickness series")
# convert forecasts to time series object starting from the first
# time steps-ahead to be able to use plot function
lines(ts(as.vector(forecasts[,1]), start = 2016), col="red", type="l",lwd=2)
lines(ts(as.vector(forecasts[,2]), start = 2016), col="blue", type="l",lwd=2)
lines(ts(as.vector(forecasts[,3]), start = 2016), col="blue", type="l",lwd=2)
legend("bottomleft", lty=1, pch=1,bty = "n", col=c("black","blue","red"), text.width = 18,
c("Data","5% forecast limit", "Forecasts"))

- The quadratic model fitted to original ozone layer thickness series showing forecast for another five years(2017 to 2021).
Conclusion
- A linear and qudratic model was analysed for suitability for the provided data set.
- Neither models appears to be the best fit for the given stochastic trend.
- According to multiple R2 74% of the variation in the Ozone layer thickness data is explained by the quadratic model against 67% by the linear model.
- Also, model fitting using the quadratic model captured the trend better than the linear model.
- Hence, quadratic model was selected and used for forecasting.
---
title: "MATH1318 Semester 1, 2019"
author: "Ahmad Hasnain(S3712538)"
subtitle: Forecasting Ozone Layer thicknes based on data from 1927 to 2016
output:
  html_notebook: default
  pdf_document: default
---
```{r, fig.width=10, fig.height=7}
# Load libraries and read the data 
library(dplyr)
library(TSA)
library(magrittr)
Ozone <- read.csv('data1.csv', header =FALSE)
rownames(Ozone) <- seq(from=1927, to=2016)
head(Ozone)
Ozone <- ts(as.vector(Ozone), start = 1927, end=2016)#convert df to a ts object
plot(Ozone,type='o',ylab='Ozone Layer Thickness (DU)',
main="Time series plot of Ozone layer thickness series") #create a time series plot

```

*  Observations:
1. Trend :  we do see a downward trend in the plot. 
2. Seasonality :  cannot conclude at this stage.
3. Behaviour: Appears to be Auto Regressive. 
4. Intervention: We may want to explore if something major happened around year 1992-93
5. Changing Variance: more or less variance remains the same 

## I. Trying to fit a linear model
```{r}
model_lm = lm(Ozone ~time(Ozone)) #define the linear regression model
plot(Ozone,type= 'o',ylab='y')
abline(model_lm) #add the fitted least square line to the model
summary(model_lm) 
```
* Observations:
1. The coefficient of both intercept and linear component is statistically significant.
2. *coefficient of determination* R^2^ appears to be fine at 0.6655, not too high or too low. 

### Assessing the quality of the linear model 
##### Residual Analysis: 

```{r}
residual_model_lm = rstudent(model_lm) #rstudent() computes standardised residual
plot(y=residual_model_lm, x=as.vector(time(Ozone)),xlab = 'Time', ylab='Standardized Residuals',type = 'o')
```
* There is somewhat departure from randomness in the above residual plot. 

##### Next, we look at the standardized residuals versus the corresponding trend estimate, or fitted value 

```{r}
plot(y=rstudent(model_lm), x=fitted(model_lm),
xlab ="Fitted Trend Value", ylab ='Standardized Residuals')
```

* There is no obvious trend in the residuals, however we do see variation with different fitted trend values. 

### Check the Normality of residuals with a histogram.

```{r}
hist(rstudent(model_lm), xlab = "Standardized Residuals", 
main = "Histogram of Standardized Residuals of Model")
```
* The plot appears somewhat symmetric and tails off at the low and high end reciprocating the behaviour of a normal distribution.  

### Check normality assumption with the quantile-quantile (QQ) plot

```{r}
z = rstudent(model_lm)
qqnorm(z) #produce a qq plot of the residuals 
qqline(z , col = 3 , lwd = 1) 
```

* For a normaly distributed data the QQ plot aprrox like a straight line but in the above plot, the line doesnt capture all the points. The straight-line pattern here doesnt fully captures the assumption of normally distributed stochastic component in the model. 


### Hypothesis test: *Shapiro-Wilk* test is used to check the normality assumption of the stochastic component 

```{r}
s = rstudent(model_lm)
shapiro.test(s)
```
* We get the p-value of 0.5372.Hence, we fail to reject the null hypothesis i.e.
the stochastic component of this model is normally distributed.

### ACF(Autocorrelation function) for analysis of Time Series
```{r}
##Find the significance autorcorrelation in standardised Residuals
acf(rstudent(model_lm), main="ACF of standardized residuals")
```
* There is some trend and some autrocorrelation left in the Residuals as deterministic model cannot completely capture it.

## II. Trying to fit a Quadratic model

```{r}
t=time(Ozone) #linear component 
t2=t^2 #take square of time for the quadratic component 
model_quad = lm(Ozone ~ t+t2) #create a quadratic model 

plot(ts(fitted(model_quad)),ylim = c(min(c(fitted(model_quad),as.vector(Ozone))),
max(c(fitted(model_quad),as.vector(Ozone)))),ylab='y' ,
main = "Fitted quadratic curve to Ozone data")
lines(as.vector(Ozone),type='o')

summary(model_quad)
```
* Observations:
1. The coefficient of quadratic component is statistically significant.
2. *coefficient of determination* R^2^ (0.7331) is better than that of the linear model(0.6655).

### Assessing the quality of the Quadratic model 
##### Residual Analysis: 
```{r}
residual_model_qd = rstudent(model_quad)
plot(y=residual_model_qd, x=as.vector(time(Ozone)),xlab = 'Time', ylab='Standardized Residuals',type='o')
```
* There is departure from randomness in the above residual plot. The plot is smooth at multiple place, so it cant be concluded as a white noise. 

##### Next, we look at the standardized residuals versus the corresponding trend estimate, or fitted value 

```{r}
plot(y=rstudent(model_quad), x=fitted(model_quad),
xlab ="Fitted Trend Value", ylab ='Standardized Residuals')
```
*  We do see variation with different fitted trend values. 

### Check the Normality of residuals with a histogram

```{r}
hist(rstudent(model_quad), xlab = "Standardized Residuals", 
main = "Histogram of Standardized Residuals of Model")
```

* The plot is not symmetric and so the residuals doesnt fulfill the normality assumption.


### Check normality assumption with the quantile-quantile (QQ) plot
```{r}
q = rstudent(model_quad)
qqnorm(q)
qqline(q , col = 2 , lwd = 1)
```
* For a normaly distributed data the QQ plot aprrox like a straight line. In the above plot, the line almost captures all the points, better than a linear model. The straight-line pattern here does supports the assumption of normally distributed stochastic component in the model. 


### Hypothesis test: *Shapiro-Wilk* test is used to check the normality assumption of the stochastic component 

```{r}
s=rstudent(model_quad)
shapiro.test(s)
```
* We get the p-value of 0.6493. Hence, we fail to reject the null hypothesis i.e. the stochastic component of this model is normally distributed.

### ACF(Autocorrelation function) for analysis of Time Series
```{r}
##Find the significance autorcorrelation in standardised Residuals
acf(rstudent(model_quad), main="ACF of standardized residuals")
```
* We have correlation values higher than the confidence bound at several lags. This doesnt appear to be a white noise. 

* __Conclusion__: 
    + Neither linear or quadratic models seems to be the best fit for the given   data as it appears to be a stochastic trend. 
    + Based on the model fitting and R^2^ values quadratic model seems to be a better fit for forecasting for the given data. 

###Forecasting 

```{r}
t = c(2017,2018,2019,2020,2021) # create a time vector for next five years 
t2 = t^2 
new = data.frame(t,t2) #create a new data frame with t and t2 
forecasts = predict(model_quad,new, interval = "prediction")
plot(Ozone,ylab= "Ozone Layer Thickness(DU)",ylim = c(-15,3),xlim=c(1927,2021),main="Time series plot of Ozone layer thickness series")
# convert forecasts to time series object starting from the first 
# time steps-ahead to be able to use plot function
lines(ts(as.vector(forecasts[,1]), start = 2016), col="red", type="l",lwd=2)
lines(ts(as.vector(forecasts[,2]), start = 2016), col="blue", type="l",lwd=2)
lines(ts(as.vector(forecasts[,3]), start = 2016), col="blue", type="l",lwd=2)
legend("bottomleft", lty=1, pch=1,bty = "n", col=c("black","blue","red"), text.width = 18,
       c("Data","5% forecast limit", "Forecasts"))
```
* The quadratic model fitted to original ozone layer thickness series showing forecast for another five years(2017 to 2021).

###Conclusion

- A linear and qudratic model was analysed for suitability for the provided data set. 
- Neither models appears to be the best fit for the given stochastic trend. 
- According to multiple  R^2^ 74% of the variation in the Ozone layer thickness data is explained by the quadratic model against 67% by the linear model. 
- Also, model fitting using the quadratic model captured the trend better than the linear model. 
- Hence, quadratic model was selected and used for forecasting. 


###### End of report 
<br>
<br>
