Introduction

In this report, we will investigate the ozone layer thickness changes that occurred from 1927 to 2016. The data given has yearly ozone layer thickness change in Dobson units. We will start with loading the data and analyzing the time series plot. In the first part, we will find the best fitting model and make predictions for the next five years. In the second part, we will apply different model specification tools to derive a set of possible ARIMA models.

Data Analysis

# Load the csv data
df<-read.csv("data1.csv", header = FALSE)
rownames(df) <- seq(from=1927, to=2016)
colnames(df) <- c("Ozone_thickness")

#Check the class of the df
class(df)
## [1] "data.frame"
# Convert data frame to the TS object
dfTs <- ts(as.vector(df), start=1927, end=2016)

# Plot the TS
plot(dfTs,type='o',ylab='Ozone Layer thickness', main="Time Series Plot of Ozone Layer Thickness")

Figure 1: Time Series Plot of Ozone Layer Thickness

  1. Trend - the plot has a downward tendency.

  2. Seasonality - the data is yearly data, no seasonality is present.

  3. Changing variance - there is a small change in variance present.

  4. Behaviour - The graph is fluctuating, but the points succeed each other, therefore mostly AR behaviour.

  5. Change point - there is no change point.

## [1] 0.8700381

The correlation between this year last year is high, which confirms that the time series is auto-regressive.

Figure 2: Correlation between consecutive years

We can observe an outlier around (-10, -5). This could be a changing point. We see on the previous graph that there was a drop in the ozone layer thickness in 1992.

Task I

In this part, we will find the best fitting model among the linear, quadratic, cosine, cyclical or seasonal trend models by implementing the model-building strategy. Afterwards, we will use the best fitting model to predict yearly Ozone layer thickness changes for the next five years.

Linear model

Model fitting

## 
## Call:
## lm(formula = dfTs ~ time(dfTs))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.7165 -1.6687  0.0275  1.4726  4.7940 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 213.720155  16.257158   13.15   <2e-16 ***
## time(dfTs)   -0.110029   0.008245  -13.34   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.032 on 88 degrees of freedom
## Multiple R-squared:  0.6693, Adjusted R-squared:  0.6655 
## F-statistic: 178.1 on 1 and 88 DF,  p-value: < 2.2e-16
  • Hypothesis test for both coefficients was statistically significant as both p-values < 0.005.

    • Hypotheses for intercept were: 𝐻0 α = 0 𝐻1 α ≠ 0 The intercept value is 213.72 and can not be interpreted.
    • Hypotheses for slope were: 𝐻0 ẞ = 0 𝐻1 ẞ ≠ 0 The slope is -0.11 which indicates that the increment per year is negative.
  • Residual standard error is 2.032. It indicates the distance distribution of the actual data relative to the linear model.

  • The F-test for the linear regression had the following statistical hypotheses:

    • 𝐻0: The data do not fit the linear regression model
    • 𝐻1: The data fit the linear regression model The p-value < 0.001 therefore there is statistically significant evidence that the data fits a linear regression model.
  • Adjusted R-squared is 0.6655, thus this linear model accounts for 67% of variation in the data.

Figure 3: Linear model of ozone layer thickness change

The linear model is fitted to the data and captures the decreasing trend, but it doesn’t capture the fluctuations present.

Residuals analysis

For a well-fitted model, there should be no information left in residuals and the graph should resemble a white noise graph.

Figure 4: Linear model residuals

The plot looks similar to a white noise plot. The decreasing trend is not present anymore. Potentially, there is a quadratic trend.

Normality check

Histogram

Figure 5: Standardized residuals distribution

The distribution appears to be close to normal.

QQ Plot

Figure 6: Linear model residuals QQ plot

Standardised residuals are closely aligned with the line in the middle but deviate on the sides. The plot indicates that it is not a perfectly normal distribution. Further tests need to be performed.

Shapiro Wilk test
## 
##  Shapiro-Wilk normality test
## 
## data:  resMod
## W = 0.98733, p-value = 0.5372

Hypotheses are: - 𝐻0 Data is normally distributed - 𝐻1 Data is not normally distributed

This test was statistically insignificant, thus we confirm the normality of residuals.

Residuals autocorrelation

Figure 7: Autocorrelation function of linear model residuals

The bars are significant at 1, 7, 8 lags. There is still autocorrelation in residuals, therefore the trend in the series is not deterministic, but stochastic.

Quadratic model

## 
## Call:
## lm(formula = dfTs ~ t + t2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.1062 -1.2846 -0.0055  1.3379  4.2325 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5.733e+03  1.232e+03  -4.654 1.16e-05 ***
## t            5.924e+00  1.250e+00   4.739 8.30e-06 ***
## t2          -1.530e-03  3.170e-04  -4.827 5.87e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.815 on 87 degrees of freedom
## Multiple R-squared:  0.7391, Adjusted R-squared:  0.7331 
## F-statistic: 123.3 on 2 and 87 DF,  p-value: < 2.2e-16
  • Hypothesis test for all the coefficients was statistically significant as all the p-values < 0.005, therefore all the coefficient contribute to the model.

  • Residual standard error is 1.815, which is less than for linear model. It indicates the data points are close to the quadratic model.

  • The F-test for the quadratic model was significant indicating that the data fits a quadratic model.

  • Adjusted R-squared is 0.7331, thus the quadratic model accounts for 73% of the variation in the data. This indicates that the quadratic model is a better fit than the linear model.

Figure 8: Quadratic model of ozone layer thickness change

We see that the quadratic model fits the trend better, but the data fluctuation are uncaptured.

Residuals analysis

Figure 9: Quadratic model residuals

The above plot is more random and resembles stationary white noise series, with the mean close to zero.

Normality check

Histogram

Figure 10: Quadratic model standardized residuals distribution

Although the distribution is left-skewed it is still close to normal.

QQ Plot

Figure 11: Quadratic model residuals QQ plot

Similar to the linear model the middle part closely follows the line, while there is a deviation from normality on the sides.

## 
##  Shapiro-Wilk normality test
## 
## data:  resModQ
## W = 0.98889, p-value = 0.6493

This test was once again statistically insignificant, thus we confirm the normality of residuals.

Residuals autocorrelation

Figure 12: Autocorrelation function of quadratic model residuals

There are several statistically significant lags in the model. There is autocorrelation in the residuals, which should not be present in the white noise series.

Harmonic model

## 
## Call:
## lm(formula = dfTs ~ har.)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.3520 -1.8905  0.4837  2.3643  6.4248 
## 
## Coefficients: (1 not defined because of singularities)
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -2.970e+00  4.790e-01  -6.199 1.79e-08 ***
## har.cos(2*pi*t)         NA         NA      NA       NA    
## har.sin(2*pi*t)  5.462e+11  7.105e+11   0.769    0.444    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.522 on 88 degrees of freedom
## Multiple R-squared:  0.006672,   Adjusted R-squared:  -0.004616 
## F-statistic: 0.5911 on 1 and 88 DF,  p-value: 0.4441

Both sin and cos coefficient is insignificant, therefore the harmonic model doesn’t fit the data. Additionally, the Adjusted R-squared shows that the model captures 0%. This is expected as the data doesn’t show seasonality.

Cyclicity check

Since the data is yearly, there is no seasonality present. However, we observe repeating fluctuation every few years. To check for cyclicity we plot the year number. We try multiple cycle frequencies: 5,6,7,8. However, none points at cyclicity in the time series.

Figure 13: Ozone layer thickness change cyclicity

Forecasting

Out of the three models we fitted to the data the quadratic model performed the best, thus we will use it for forecasting.

##         fit       lwr       upr
## 1 -10.34387 -14.13556 -6.552180
## 2 -10.59469 -14.40282 -6.786548
## 3 -10.84856 -14.67434 -7.022786
## 4 -11.10550 -14.95015 -7.260851
## 5 -11.36550 -15.23030 -7.500701

Figure 14: Ozone layer thickness change forecast

The ozone layer model is predicted to decrease in the next 5 years. With the confidence interval, 95% Ozone layer thickness should be in the interval (-5, -15).

Result

Three models were fitted to the data - linear, harmonic and quadratic. Quadratic models performed the best capturing 74% of the variation in the data. However, the model’s residuals still have autocorrelation present indicating that the model doesn’t capture all the information. We have observed that the series is non-stationary with a clear decreasing trend. To make the series stationary we will use differencing in the next part.

Task II

In this section we will use model specification tools such as ACF-PACF, EACF, BIC table to propose a set of all possible ARIMA models for the ozone layer thickness data.

As we have seen in the previous section the data is non-stationary with a clear downward trend. Both AR and MA behaviour present resulting in succeeding points as well as fluctuation.

ACF and PACF plot

Figure 15: ACF and PACF plots before differencing

ACF shows a decaying pattern. This can be explained by a time series being non-stationary, thus we can not get the value of q from the ACF plot. PACF has 2 significant lags: 1, 4. Given the fourth lag is close to the limit p can be 1, 2, 3 or 4.

Augmented Dickey-Fuller test

To confirm if data is stationary or not we perform the Dickey-Fuller test. For this test hypothesis are: - H0 : time series is non-stationary; - H1 : time series is stationary.

## 
##  Augmented Dickey-Fuller Test
## 
## data:  dfTs
## Dickey-Fuller = -3.2376, Lag order = 4, p-value = 0.0867
## alternative hypothesis: stationary

As expected, the p-value is greater than the significance level, therefore null hypothesis can not be rejected: the time series is non-stationary.

Box Cox transformation

Figure 16: Box Cox transformation

The optimal value of lambda is 1, therefore we proceed to differencing without applying any transformation.

Differencing

To make the time series stationary we take a difference.

Figure 17: First difference of Ozone layer thickness change

We repeat Dickey-Fuller Test to confirm that the data became stationary after the transformation.

## Warning in adf.test(dfDif): p-value smaller than printed p-value
## 
##  Augmented Dickey-Fuller Test
## 
## data:  dfDif
## Dickey-Fuller = -7.1568, Lag order = 4, p-value = 0.01
## alternative hypothesis: stationary

As expected p-value < 0.05, therefore null hypothesis is rejected.

ACF and PACF

We perform ACF and PACF test on the stationary data.

Figure 18: ACF and PACF of the first difference

ACF has 3 significant lags: 3, 7, 10.

PACF has 3 significant lags: 3, 4, 6.

There is a number of possible models with ACF and PACF both having significant lags: - ARIMA(3,1,3) - ARIMA(3,1,7) - ARIMA(3,1,10) - ARIMA(4,1,3) - ARIMA(4,1,7) - ARIMA(4,1,10) - ARIMA(6,1,3) - ARIMA(6,1,7) - ARIMA(6,1,10)

Having 9 models from ACF PACF we will not consider models with q = 10, p = 6.

EACF

## AR/MA
##    0 1 2 3 4 5 6 7 8 9 10
## 0  o o x o o o x o o x o 
## 1  x o x o o o o o o x o 
## 2  x o x o o o x o o x o 
## 3  x o x o o x o o o o o 
## 4  x o o x o x o o o o o 
## 5  x x x x o x o o o o o 
## 6  o o o x x o o o o o o 
## 7  o o o x o o o o o o o 
## 8  x x o x o o o o o o o 
## 9  o x o x o o o o o o o 
## 10 x o x o x o o o o o o

Top left O is at p = 0, q = 0. However, it is interrupted by X, thus we select p = 0, q = 3 as the start point. Possible models are: - ARIMA(0,1,3), - ARIMA(0,1,4), - ARIMA(1,1,3), - ARIMA(1,1,4).

Figure 19: BIC

The above table has the highest BIC for p = 3 and q = 2. Thus possible models are ARIMA(3,1,2). The third row has p = 4. It is supported by EACF table and PACF plot, therefore we will add ARIMA(4,1,2) as a possible model.

Result

Based on all the applied model specification tools possible models are: - ARIMA(3,1,3); - ARIMA(3,1,7); - ARIMA(4,1,3); - ARIMA(4,1,7); - ARIMA(0,1,3); - ARIMA(0,1,4); - ARIMA(1,1,3); - ARIMA(1,1,4); - ARIMA(3,1,2); - ARIMA(4,1,2).

Conclusion

Ozone layer thickness data is a stochastic time series with a downwards trend, minimal variance, no change point and mixed MA/AR behaviour and no seasonality or cyclicity. The quadratic model appeared to be the best fit for the given time series. In the second part, we found a set of suitable ARIMA models. To choose the best model out of the set further testing needs to be performed.

References