Introduction

An Ozone is a naturally occuring molecule that is comprised of three oxygen atoms, with the chemical formula of O3. The Ozone layer on the other hand is a term commonly used to refer to the high concentration of ozone that is situated in the stratosphere - between 15km to 30km above the surface of the earth across the globe. It also protects life on earth by absorbing harmful ultraviolet-B (UV-B) radiation from the sun. Prolonged exposure to UV-B radiation can cause skin cancer, immune system suppression, and genetic damage to humans and animals while also causing a lower yield of agricultural crops.

This report aims to explore and model the thickness of the Ozone layer by applying Time-Series analysis techniques onto the dataset in order to find and fit the best model to predict the changes in the thickness of the Ozone layer in the future. This can be useful information to raise awareness about the importance of the ozone layer, as well as the current and potential future state of the Ozone layer. This should stand as a stark reminder to every living creature on this earth - that there is only ONE Ozone layer and it needs to be protected; there will be no second chances.

Scope

The scope of this report consists of two approaches; a regression approach and a time-series approach.

Library

The libraries that will be used for this exploration will be “dplyr”, “readr”, “TSA”, and “tseries”.

Dataset Setup

The dataset shows the yearly changes in the thickness of Ozone layer from 1927 to 2016 in Dobson units. A negative value in the dataset represents a decrease in the thickness and a positive value represents an increase in the thickness.

The graph shows a generally downward trend. There is some seasonality in the graph. The graph displays little to no changing variance. The graph mostly reflects a Moving Average (MA), with certain instances of Autoregressive points (AR). Lastly, the graph shows no impactful change points.

Regression Approach

Fitting a Linear model

While the linear model captures the direction of the trend, the start and end datapoints are constantly under the line.

## 
## Call:
## lm(formula = ozoneTS ~ time(ozoneTS))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.7165 -1.6687  0.0275  1.4726  4.7940 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   213.720155  16.257158   13.15   <2e-16 ***
## time(ozoneTS)  -0.110029   0.008245  -13.34   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.032 on 88 degrees of freedom
## Multiple R-squared:  0.6693, Adjusted R-squared:  0.6655 
## F-statistic: 178.1 on 1 and 88 DF,  p-value: < 2.2e-16

Residual Analysis for Linear model

The standardised residuals graph shows that there is still a trend within the residuals. The histogram shows that the model is a relatively good fit with some data falling outside the normal curve. The QQ plot shows that the data fits the normal line relatively well in the middle but trails off at the beginning and end points. The ACF plot shows that there are correlational values in the first lag. The Shapiro-Wilk test failed to reject the null hypothesis that the residuals are normally distributed. Hence, the linear trend model is not necessarily a good model for this dataset.

## 
##  Shapiro-Wilk normality test
## 
## data:  res.linearModel
## W = 0.98733, p-value = 0.5372

Fitting a Quadratic model

The quadratic model graph shows that it has a much better fit than the linear model. The increased R-squared value from figure 3.0 also reinforces this statement.

## 
## Call:
## lm(formula = ozoneTS ~ t + t2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.1062 -1.2846 -0.0055  1.3379  4.2325 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5.733e+03  1.232e+03  -4.654 1.16e-05 ***
## t            5.924e+00  1.250e+00   4.739 8.30e-06 ***
## t2          -1.530e-03  3.170e-04  -4.827 5.87e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.815 on 87 degrees of freedom
## Multiple R-squared:  0.7391, Adjusted R-squared:  0.7331 
## F-statistic: 123.3 on 2 and 87 DF,  p-value: < 2.2e-16

Residual Analysis for Quadratic model

The standardised residuals graph shows that the mean sits close to 0 while the variance seems constant. The histogram shows good symmetry but with some data falling outside the normal curve. The QQ plot shows that the data fits the normal line better than the linear model as the beginning and end points of the linear model drifts further away from the normal line than the quadratic model. The ACF plot shows significant correlation at lag 1, lag 3 and lag 4 – this indicates possible trends within the residuals. The Shapiro-Wilk test failed to reject the null hypothesis that the residuals are normally distributed. Hence, the quadratic model is a better fit than the linear model.

## 
##  Shapiro-Wilk normality test
## 
## data:  res.quadModel
## W = 0.98889, p-value = 0.6493

Fitting a Cyclical model

The cyclical model graph does not seem to fit the data well. It is akin to that of a linear model, except is has fluctuations on its graph. It mainly captures a few points in the centre of the dataset, but completely misses the beginning and end points. Both figure 4.1 and 4.2 show signs that this model is not a suitable fit for this data.

## 
## Call:
## lm(formula = ozoneCyclic ~ ozCyclicSeason - 1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.4755 -1.6640  0.5134  2.3864  6.5111 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## ozCyclicSeasonSeason-1  -2.8943     0.9318  -3.106 0.002585 ** 
## ozCyclicSeasonSeason-2  -3.1722     0.9318  -3.404 0.001019 ** 
## ozCyclicSeasonSeason-3  -3.6586     0.9318  -3.926 0.000176 ***
## ozCyclicSeasonSeason-4  -3.1478     0.9318  -3.378 0.001108 ** 
## ozCyclicSeasonSeason-5  -3.1039     0.9318  -3.331 0.001287 ** 
## ozCyclicSeasonSeason-6  -3.2367     0.9318  -3.474 0.000814 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.609 on 84 degrees of freedom
## Multiple R-squared:  0.4589, Adjusted R-squared:  0.4202 
## F-statistic: 11.87 on 6 and 84 DF,  p-value: 1.325e-09

## 
## Call:
## lm(formula = ozoneCyclic ~ season(ozoneCyclic) - 1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.4755 -1.6640  0.5134  2.3864  6.5111 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## season(ozoneCyclic)Season-1  -2.8943     0.9318  -3.106 0.002585 ** 
## season(ozoneCyclic)Season-2  -3.1722     0.9318  -3.404 0.001019 ** 
## season(ozoneCyclic)Season-3  -3.6586     0.9318  -3.926 0.000176 ***
## season(ozoneCyclic)Season-4  -3.1478     0.9318  -3.378 0.001108 ** 
## season(ozoneCyclic)Season-5  -3.1039     0.9318  -3.331 0.001287 ** 
## season(ozoneCyclic)Season-6  -3.2367     0.9318  -3.474 0.000814 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.609 on 84 degrees of freedom
## Multiple R-squared:  0.4589, Adjusted R-squared:  0.4202 
## F-statistic: 11.87 on 6 and 84 DF,  p-value: 1.325e-09

Residual Analysis for Cyclical model

From the residuals, the histogram looks to be skewed, the QQ plot shows a worse fit than both the linear and quadratic model and the Shapiro-Wilk test with a p-value of 0.002206 rejects normality. With this, the quadratic model is still a better performing model.

## 
##  Shapiro-Wilk normality test
## 
## data:  res.cycModel
## W = 0.95486, p-value = 0.003382

Fitting a Cosine model

The cosine trend model graph provides an improved fit to the data compared to the cyclical model. It captures more of the data, but the beginning and end points are still missed by this model.

## 
## Call:
## lm(formula = ozoneCyclic ~ har. + t + t2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.4075 -1.2271  0.0685  1.3942  4.5511 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -0.49707    0.72907  -0.682    0.497    
## har.cos(2*pi*t)  0.03759    0.27139   0.139    0.890    
## har.sin(2*pi*t) -0.33543    0.27159  -1.235    0.220    
## t                0.26535    0.19766   1.342    0.183    
## t2              -0.05513    0.01144  -4.818 6.27e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.82 on 85 degrees of freedom
## Multiple R-squared:  0.7438, Adjusted R-squared:  0.7317 
## F-statistic: 61.69 on 4 and 85 DF,  p-value: < 2.2e-16

Residual Analysis for Cosine model

From the residuals, the histogram seems symmetrical and the QQ plot seems to fit the normal line better than the cyclical model, but not as well as the quadratic model. The Shapiro-Wilk test presents a p-value of 0.602 thus not rejecting the hypothesis of normality. As the coefficients of this model are not significant, this is still a less superior model than the quadratic model thus far.

## 
##  Shapiro-Wilk normality test
## 
## data:  res.cosModel
## W = 0.9885, p-value = 0.6206

Fitting a combined Cyclical and Quadratic model

The combined trend model shows to be the best fit thus far, as it has the shape similar to that of the cosine graph, but it conforms more to the shape of the data.

## 
## Call:
## lm(formula = cycQuad ~ season(cycQuad) - 1 + t + t2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.3458 -1.3076  0.0912  1.3598  4.7563 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## season(cycQuad)Season-1 -5.736e+03  1.255e+03  -4.571 1.70e-05 ***
## season(cycQuad)Season-2 -5.736e+03  1.255e+03  -4.571 1.70e-05 ***
## season(cycQuad)Season-3 -5.737e+03  1.255e+03  -4.572 1.69e-05 ***
## season(cycQuad)Season-4 -5.736e+03  1.255e+03  -4.571 1.70e-05 ***
## season(cycQuad)Season-5 -5.736e+03  1.255e+03  -4.571 1.70e-05 ***
## season(cycQuad)Season-6 -5.736e+03  1.255e+03  -4.571 1.70e-05 ***
## t                        5.927e+00  1.273e+00   4.656 1.23e-05 ***
## t2                      -1.531e-03  3.229e-04  -4.742 8.79e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.849 on 82 degrees of freedom
## Multiple R-squared:  0.8614, Adjusted R-squared:  0.8479 
## F-statistic:  63.7 on 8 and 82 DF,  p-value: < 2.2e-16

Residual Analysis for Combined model

From the residuals, the histogram looks relatively normally distributed but the QQ plot seems to deviate slightly from the normal line. However, it must be said that the QQ plot for this model performs better than the QQ plot of the cosine or cyclical models. Furthermore, the Shapiro-Wilk test presents a p-value of 0.705 thus not rejecting the hypothesis of normality.

## 
##  Shapiro-Wilk normality test
## 
## data:  res.cycQuadModel
## W = 0.98964, p-value = 0.705

Models Comparison

In order to select the best model among all the models fitted thus far, an AIC and BIC likelihood criterions would be applied, and the model with the lowest value will be selected.

##              df      BIC
## linearModel   3 394.5231
## quadModel     4 377.6635
## cycModel      7 511.7097
## cosModel      6 385.0420
## cycQuadModel  9 398.1314
##              df      AIC
## linearModel   3 387.0237
## quadModel     4 367.6643
## cycModel      7 494.2110
## cosModel      6 370.0431
## cycQuadModel  9 375.6332

Prediction for the next 5 years

Among all the models that were fitted onto the dataset, the quadratic model seems to be the best model for this prediction. While the combined model provided a better multiple R-squared value and well distributed residuals but based on the results of the AIC and BIC performed in section 2.6 above and the principle of parsimony – or Occam’s razor, the quadratic model is simpler and therefore should be the model to be selected for prediction.

##    Year Prediction   LowerCI   UpperCI
## 91 2017  -10.34387 -14.13556 -6.552180
## 92 2018  -10.59469 -14.40282 -6.786548
## 93 2019  -10.84856 -14.67434 -7.022786
## 94 2020  -11.10550 -14.95015 -7.260851
## 95 2021  -11.36550 -15.23030 -7.500701

Part 2

Time-series ACF and PACF plot 1

The gradual decrease in the ACF plot along with a high initial PACF point confirms the existence of a trend within the dataset. Additionally, the pattern generated by the ACF plot alludes to the presence of seasonality within the dataset.

Check for stationarity part 1

The Augmented Dickey-Fuller (ADF) Test resulted with a p-value of 0.0867. This p-value is greater than 0.05 which means that it fails to reject the null hypothesis that the data is non-stationary.

## 
##  Augmented Dickey-Fuller Test
## 
## data:  ozoneTS
## Dickey-Fuller = -3.2376, Lag order = 4, p-value = 0.0867
## alternative hypothesis: stationary

Box Cox Transformation

From the graph, the confidence interval of lambda includes 1 – more specifically, the range of lambda is between 0.9 to 1.3. With that, it is safe to assume lambda to be 1 and that it is unnecessary to perform a power or log transformation onto the dataset.

## Warning in arima0(x, order = c(i, 0L, 0L), include.mean = demean): possible
## convergence problem: optim gave code = 1

## Warning in arima0(x, order = c(i, 0L, 0L), include.mean = demean): possible
## convergence problem: optim gave code = 1

## Warning in arima0(x, order = c(i, 0L, 0L), include.mean = demean): possible
## convergence problem: optim gave code = 1

## Warning in arima0(x, order = c(i, 0L, 0L), include.mean = demean): possible
## convergence problem: optim gave code = 1

## Warning in arima0(x, order = c(i, 0L, 0L), include.mean = demean): possible
## convergence problem: optim gave code = 1

## Warning in arima0(x, order = c(i, 0L, 0L), include.mean = demean): possible
## convergence problem: optim gave code = 1

## Warning in arima0(x, order = c(i, 0L, 0L), include.mean = demean): possible
## convergence problem: optim gave code = 1

## Warning in arima0(x, order = c(i, 0L, 0L), include.mean = demean): possible
## convergence problem: optim gave code = 1

## Warning in arima0(x, order = c(i, 0L, 0L), include.mean = demean): possible
## convergence problem: optim gave code = 1

## Warning in arima0(x, order = c(i, 0L, 0L), include.mean = demean): possible
## convergence problem: optim gave code = 1

## Warning in arima0(x, order = c(i, 0L, 0L), include.mean = demean): possible
## convergence problem: optim gave code = 1

## Warning in arima0(x, order = c(i, 0L, 0L), include.mean = demean): possible
## convergence problem: optim gave code = 1

## Warning in arima0(x, order = c(i, 0L, 0L), include.mean = demean): possible
## convergence problem: optim gave code = 1

## Warning in arima0(x, order = c(i, 0L, 0L), include.mean = demean): possible
## convergence problem: optim gave code = 1

## Warning in arima0(x, order = c(i, 0L, 0L), include.mean = demean): possible
## convergence problem: optim gave code = 1

## Warning in arima0(x, order = c(i, 0L, 0L), include.mean = demean): possible
## convergence problem: optim gave code = 1

## Warning in arima0(x, order = c(i, 0L, 0L), include.mean = demean): possible
## convergence problem: optim gave code = 1

## Warning in arima0(x, order = c(i, 0L, 0L), include.mean = demean): possible
## convergence problem: optim gave code = 1

## Warning in arima0(x, order = c(i, 0L, 0L), include.mean = demean): possible
## convergence problem: optim gave code = 1

## Warning in arima0(x, order = c(i, 0L, 0L), include.mean = demean): possible
## convergence problem: optim gave code = 1

## Warning in arima0(x, order = c(i, 0L, 0L), include.mean = demean): possible
## convergence problem: optim gave code = 1

## Warning in arima0(x, order = c(i, 0L, 0L), include.mean = demean): possible
## convergence problem: optim gave code = 1

## Warning in arima0(x, order = c(i, 0L, 0L), include.mean = demean): possible
## convergence problem: optim gave code = 1

## Warning in arima0(x, order = c(i, 0L, 0L), include.mean = demean): possible
## convergence problem: optim gave code = 1

## Warning in arima0(x, order = c(i, 0L, 0L), include.mean = demean): possible
## convergence problem: optim gave code = 1

## Warning in arima0(x, order = c(i, 0L, 0L), include.mean = demean): possible
## convergence problem: optim gave code = 1

## Warning in arima0(x, order = c(i, 0L, 0L), include.mean = demean): possible
## convergence problem: optim gave code = 1

## Warning in arima0(x, order = c(i, 0L, 0L), include.mean = demean): possible
## convergence problem: optim gave code = 1

## Warning in arima0(x, order = c(i, 0L, 0L), include.mean = demean): possible
## convergence problem: optim gave code = 1

## Warning in arima0(x, order = c(i, 0L, 0L), include.mean = demean): possible
## convergence problem: optim gave code = 1

## Warning in arima0(x, order = c(i, 0L, 0L), include.mean = demean): possible
## convergence problem: optim gave code = 1

## Warning in arima0(x, order = c(i, 0L, 0L), include.mean = demean): possible
## convergence problem: optim gave code = 1

## Warning in arima0(x, order = c(i, 0L, 0L), include.mean = demean): possible
## convergence problem: optim gave code = 1

## Warning in arima0(x, order = c(i, 0L, 0L), include.mean = demean): possible
## convergence problem: optim gave code = 1

## Warning in arima0(x, order = c(i, 0L, 0L), include.mean = demean): possible
## convergence problem: optim gave code = 1

## Warning in arima0(x, order = c(i, 0L, 0L), include.mean = demean): possible
## convergence problem: optim gave code = 1

## Warning in arima0(x, order = c(i, 0L, 0L), include.mean = demean): possible
## convergence problem: optim gave code = 1

## Warning in arima0(x, order = c(i, 0L, 0L), include.mean = demean): possible
## convergence problem: optim gave code = 1

## Warning in arima0(x, order = c(i, 0L, 0L), include.mean = demean): possible
## convergence problem: optim gave code = 1

## Warning in arima0(x, order = c(i, 0L, 0L), include.mean = demean): possible
## convergence problem: optim gave code = 1

## Warning in arima0(x, order = c(i, 0L, 0L), include.mean = demean): possible
## convergence problem: optim gave code = 1

## Warning in arima0(x, order = c(i, 0L, 0L), include.mean = demean): possible
## convergence problem: optim gave code = 1

## Warning in arima0(x, order = c(i, 0L, 0L), include.mean = demean): possible
## convergence problem: optim gave code = 1

## Warning in arima0(x, order = c(i, 0L, 0L), include.mean = demean): possible
## convergence problem: optim gave code = 1

## Warning in arima0(x, order = c(i, 0L, 0L), include.mean = demean): possible
## convergence problem: optim gave code = 1

## Warning in arima0(x, order = c(i, 0L, 0L), include.mean = demean): possible
## convergence problem: optim gave code = 1

## Warning in arima0(x, order = c(i, 0L, 0L), include.mean = demean): possible
## convergence problem: optim gave code = 1

## Warning in arima0(x, order = c(i, 0L, 0L), include.mean = demean): possible
## convergence problem: optim gave code = 1

## Warning in arima0(x, order = c(i, 0L, 0L), include.mean = demean): possible
## convergence problem: optim gave code = 1

## Warning in arima0(x, order = c(i, 0L, 0L), include.mean = demean): possible
## convergence problem: optim gave code = 1

## Warning in arima0(x, order = c(i, 0L, 0L), include.mean = demean): possible
## convergence problem: optim gave code = 1

## Warning in arima0(x, order = c(i, 0L, 0L), include.mean = demean): possible
## convergence problem: optim gave code = 1

## Warning in arima0(x, order = c(i, 0L, 0L), include.mean = demean): possible
## convergence problem: optim gave code = 1

## Warning in arima0(x, order = c(i, 0L, 0L), include.mean = demean): possible
## convergence problem: optim gave code = 1

## Warning in arima0(x, order = c(i, 0L, 0L), include.mean = demean): possible
## convergence problem: optim gave code = 1

## Warning in arima0(x, order = c(i, 0L, 0L), include.mean = demean): possible
## convergence problem: optim gave code = 1

## Warning in arima0(x, order = c(i, 0L, 0L), include.mean = demean): possible
## convergence problem: optim gave code = 1

## Warning in arima0(x, order = c(i, 0L, 0L), include.mean = demean): possible
## convergence problem: optim gave code = 1

## Warning in arima0(x, order = c(i, 0L, 0L), include.mean = demean): possible
## convergence problem: optim gave code = 1

## Warning in arima0(x, order = c(i, 0L, 0L), include.mean = demean): possible
## convergence problem: optim gave code = 1

## Warning in arima0(x, order = c(i, 0L, 0L), include.mean = demean): possible
## convergence problem: optim gave code = 1

## Warning in arima0(x, order = c(i, 0L, 0L), include.mean = demean): possible
## convergence problem: optim gave code = 1

## Warning in arima0(x, order = c(i, 0L, 0L), include.mean = demean): possible
## convergence problem: optim gave code = 1

## Warning in arima0(x, order = c(i, 0L, 0L), include.mean = demean): possible
## convergence problem: optim gave code = 1

## Warning in arima0(x, order = c(i, 0L, 0L), include.mean = demean): possible
## convergence problem: optim gave code = 1

## Warning in arima0(x, order = c(i, 0L, 0L), include.mean = demean): possible
## convergence problem: optim gave code = 1

## Warning in arima0(x, order = c(i, 0L, 0L), include.mean = demean): possible
## convergence problem: optim gave code = 1

## Warning in arima0(x, order = c(i, 0L, 0L), include.mean = demean): possible
## convergence problem: optim gave code = 1

## [1] 0.9 1.3

Differencing to remove trend

From the plotted first difference, the data appears to have stabilised. In order to test for this, a check for stationarity should be performed next.

Check for stationarity part 2

In order to check for stationarity again, the ADF test was utilized once more. This time, it resulted with a p-value of less than 0.01. This p-value is less than 0.05, and therefore rejects the null hypothesis of non-stationarity and confirm statistically significant evidence that the time-series is stationary after the first differencing.

## Warning in adf.test(ozoneDiff): p-value smaller than printed p-value
## 
##  Augmented Dickey-Fuller Test
## 
## data:  ozoneDiff
## Dickey-Fuller = -7.1568, Lag order = 4, p-value = 0.01
## alternative hypothesis: stationary

Time-series ACF and PACF plot part 2

An oscillating pattern can be noticed in the ACF plot, along with one significant lag and one insignificant lag. The PACF plot contains 3 significant lags – thus the following ARIMA models can be proposed: - ARIMA (3,1,1) - ARIMA (3,1,2) - ARIMA (4,1,1) - ARIMA (4,1,2)

EACF plot

The top left vertex for the EACF model is p = 0 and q = 3. With this, an MA(3) model is a possibility. Along with the surrounding models, the potential models according to the EACF would be: - ARIMA (0,1,3) - ARIMA (0,1,4) - ARIMA (1,1,3) - ARIMA (1,1,4)

## AR/MA
##   0 1 2 3 4 5 6 7 8 9 10 11 12 13
## 0 o o x o o o x o o x o  o  o  o 
## 1 x o x o o o o o o x o  o  o  o 
## 2 x o x o o o x o o x o  o  o  o 
## 3 x o x o o x o o o o o  o  o  o 
## 4 x o o x o x o o o o o  o  o  o 
## 5 x x x x o x o o o o o  o  o  o 
## 6 o o o x x o o o o o o  o  o  o 
## 7 o o o x o o o o o o o  o  o  o

Bayesian Information Criterion table

From the BIC table, the shaded columns corresponds to AR(3) and MA(2). While MA(2) isn’t fully supported, it can still be placed into consideration as it is the best model. The BIC table allows the proposition of ARIMA (3,1,2).