In this report, we will investigate the ozone layer thickness changes that occurred from 1927 to 2016. The data given has yearly ozone layer thickness change in Dobson units. We will start with loading the data and analyzing the time series plot. In the first part, we will find the best fitting model and make predictions for the next five years. In the second part, we will apply different model specification tools to derive a set of possible ARIMA models.
# Load the csv data
df<-read.csv("data1.csv", header = FALSE)
rownames(df) <- seq(from=1927, to=2016)
colnames(df) <- c("Ozone_thickness")
#Check the class of the df
class(df)
## [1] "data.frame"
# Convert data frame to the TS object
dfTs <- ts(as.vector(df), start=1927, end=2016)
# Plot the TS
plot(dfTs,type='o',ylab='Ozone Layer thickness', main="Time Series Plot of Ozone Layer Thickness")
Figure 1: Time Series Plot of Ozone Layer Thickness
Trend - the plot has a downward tendency.
Seasonality - the data is yearly data, no seasonality is present.
Changing variance - there is a small change in variance present.
Behaviour - The graph is fluctuating, but the points succeed each other, therefore mostly AR behaviour.
Change point - there is no change point.
## [1] 0.8700381
The correlation between this year last year is high, which confirms that the time series is auto-regressive.
Figure 2: Correlation between consecutive years
We can observe an outlier around (-10, -5). This could be a changing point. We see on the previous graph that there was a drop in the ozone layer thickness in 1992.
In this part, we will find the best fitting model among the linear, quadratic, cosine, cyclical or seasonal trend models by implementing the model-building strategy. Afterwards, we will use the best fitting model to predict yearly Ozone layer thickness changes for the next five years.
##
## Call:
## lm(formula = dfTs ~ time(dfTs))
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.7165 -1.6687 0.0275 1.4726 4.7940
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 213.720155 16.257158 13.15 <2e-16 ***
## time(dfTs) -0.110029 0.008245 -13.34 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.032 on 88 degrees of freedom
## Multiple R-squared: 0.6693, Adjusted R-squared: 0.6655
## F-statistic: 178.1 on 1 and 88 DF, p-value: < 2.2e-16
Hypothesis test for both coefficients was statistically significant as both p-values < 0.005.
Residual standard error is 2.032. It indicates the distance distribution of the actual data relative to the linear model.
The F-test for the linear regression had the following statistical hypotheses:
Adjusted R-squared is 0.6655, thus this linear model accounts for 67% of variation in the data.
Figure 3: Linear model of ozone layer thickness change
The linear model is fitted to the data and captures the decreasing trend, but it doesn’t capture the fluctuations present.
For a well-fitted model, there should be no information left in residuals and the graph should resemble a white noise graph.
Figure 4: Linear model residuals
The plot looks similar to a white noise plot. The decreasing trend is not present anymore. Potentially, there is a quadratic trend.
Figure 5: Standardized residuals distribution
The distribution appears to be close to normal.
Figure 6: Linear model residuals QQ plot
Standardised residuals are closely aligned with the line in the middle but deviate on the sides. The plot indicates that it is not a perfectly normal distribution. Further tests need to be performed.
##
## Shapiro-Wilk normality test
##
## data: resMod
## W = 0.98733, p-value = 0.5372
Hypotheses are: - 𝐻0 Data is normally distributed - 𝐻1 Data is not normally distributed
This test was statistically insignificant, thus we confirm the normality of residuals.
Figure 7: Autocorrelation function of linear model residuals
The bars are significant at 1, 7, 8 lags. There is still autocorrelation in residuals, therefore the trend in the series is not deterministic, but stochastic.
##
## Call:
## lm(formula = dfTs ~ t + t2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.1062 -1.2846 -0.0055 1.3379 4.2325
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.733e+03 1.232e+03 -4.654 1.16e-05 ***
## t 5.924e+00 1.250e+00 4.739 8.30e-06 ***
## t2 -1.530e-03 3.170e-04 -4.827 5.87e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.815 on 87 degrees of freedom
## Multiple R-squared: 0.7391, Adjusted R-squared: 0.7331
## F-statistic: 123.3 on 2 and 87 DF, p-value: < 2.2e-16
Hypothesis test for all the coefficients was statistically significant as all the p-values < 0.005, therefore all the coefficient contribute to the model.
Residual standard error is 1.815, which is less than for linear model. It indicates the data points are close to the quadratic model.
The F-test for the quadratic model was significant indicating that the data fits a quadratic model.
Adjusted R-squared is 0.7331, thus the quadratic model accounts for 73% of the variation in the data. This indicates that the quadratic model is a better fit than the linear model.
Figure 8: Quadratic model of ozone layer thickness change
We see that the quadratic model fits the trend better, but the data fluctuation are uncaptured.
Figure 9: Quadratic model residuals
The above plot is more random and resembles stationary white noise series, with the mean close to zero.
Figure 10: Quadratic model standardized residuals distribution
Although the distribution is left-skewed it is still close to normal.
Figure 11: Quadratic model residuals QQ plot
Similar to the linear model the middle part closely follows the line, while there is a deviation from normality on the sides.
##
## Shapiro-Wilk normality test
##
## data: resModQ
## W = 0.98889, p-value = 0.6493
This test was once again statistically insignificant, thus we confirm the normality of residuals.
Figure 12: Autocorrelation function of quadratic model residuals
There are several statistically significant lags in the model. There is autocorrelation in the residuals, which should not be present in the white noise series.
##
## Call:
## lm(formula = dfTs ~ har.)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.3520 -1.8905 0.4837 2.3643 6.4248
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.970e+00 4.790e-01 -6.199 1.79e-08 ***
## har.cos(2*pi*t) NA NA NA NA
## har.sin(2*pi*t) 5.462e+11 7.105e+11 0.769 0.444
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.522 on 88 degrees of freedom
## Multiple R-squared: 0.006672, Adjusted R-squared: -0.004616
## F-statistic: 0.5911 on 1 and 88 DF, p-value: 0.4441
Both sin and cos coefficient is insignificant, therefore the harmonic model doesn’t fit the data. Additionally, the Adjusted R-squared shows that the model captures 0%. This is expected as the data doesn’t show seasonality.
Since the data is yearly, there is no seasonality present. However, we observe repeating fluctuation every few years. To check for cyclicity we plot the year number. We try multiple cycle frequencies: 5,6,7,8. However, none points at cyclicity in the time series.
Figure 13: Ozone layer thickness change cyclicity
Out of the three models we fitted to the data the quadratic model performed the best, thus we will use it for forecasting.
## fit lwr upr
## 1 -10.34387 -14.13556 -6.552180
## 2 -10.59469 -14.40282 -6.786548
## 3 -10.84856 -14.67434 -7.022786
## 4 -11.10550 -14.95015 -7.260851
## 5 -11.36550 -15.23030 -7.500701
Figure 14: Ozone layer thickness change forecast
The ozone layer model is predicted to decrease in the next 5 years. With the confidence interval, 95% Ozone layer thickness should be in the interval (-5, -15).
Three models were fitted to the data - linear, harmonic and quadratic. Quadratic models performed the best capturing 74% of the variation in the data. However, the model’s residuals still have autocorrelation present indicating that the model doesn’t capture all the information. We have observed that the series is non-stationary with a clear decreasing trend. To make the series stationary we will use differencing in the next part.
In this section we will use model specification tools such as ACF-PACF, EACF, BIC table to propose a set of all possible ARIMA models for the ozone layer thickness data.
As we have seen in the previous section the data is non-stationary with a clear downward trend. Both AR and MA behaviour present resulting in succeeding points as well as fluctuation.
Figure 15: ACF and PACF plots before differencing
ACF shows a decaying pattern. This can be explained by a time series being non-stationary, thus we can not get the value of q from the ACF plot. PACF has 2 significant lags: 1, 4. Given the fourth lag is close to the limit p can be 1, 2, 3 or 4.
To confirm if data is stationary or not we perform the Dickey-Fuller test. For this test hypothesis are: - H0 : time series is non-stationary; - H1 : time series is stationary.
##
## Augmented Dickey-Fuller Test
##
## data: dfTs
## Dickey-Fuller = -3.2376, Lag order = 4, p-value = 0.0867
## alternative hypothesis: stationary
As expected, the p-value is greater than the significance level, therefore null hypothesis can not be rejected: the time series is non-stationary.
Figure 16: Box Cox transformation
The optimal value of lambda is 1, therefore we proceed to differencing without applying any transformation.
To make the time series stationary we take a difference.
Figure 17: First difference of Ozone layer thickness change
We repeat Dickey-Fuller Test to confirm that the data became stationary after the transformation.
## Warning in adf.test(dfDif): p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: dfDif
## Dickey-Fuller = -7.1568, Lag order = 4, p-value = 0.01
## alternative hypothesis: stationary
As expected p-value < 0.05, therefore null hypothesis is rejected.
We perform ACF and PACF test on the stationary data.
Figure 18: ACF and PACF of the first difference
ACF has 3 significant lags: 3, 7, 10.
PACF has 3 significant lags: 3, 4, 6.
There is a number of possible models with ACF and PACF both having significant lags: - ARIMA(3,1,3) - ARIMA(3,1,7) - ARIMA(3,1,10) - ARIMA(4,1,3) - ARIMA(4,1,7) - ARIMA(4,1,10) - ARIMA(6,1,3) - ARIMA(6,1,7) - ARIMA(6,1,10)
Having 9 models from ACF PACF we will not consider models with q = 10, p = 6.
## AR/MA
## 0 1 2 3 4 5 6 7 8 9 10
## 0 o o x o o o x o o x o
## 1 x o x o o o o o o x o
## 2 x o x o o o x o o x o
## 3 x o x o o x o o o o o
## 4 x o o x o x o o o o o
## 5 x x x x o x o o o o o
## 6 o o o x x o o o o o o
## 7 o o o x o o o o o o o
## 8 x x o x o o o o o o o
## 9 o x o x o o o o o o o
## 10 x o x o x o o o o o o
Top left O is at p = 0, q = 0. However, it is interrupted by X, thus we select p = 0, q = 3 as the start point. Possible models are: - ARIMA(0,1,3), - ARIMA(0,1,4), - ARIMA(1,1,3), - ARIMA(1,1,4).
Figure 19: BIC
The above table has the highest BIC for p = 3 and q = 2. Thus possible models are ARIMA(3,1,2). The third row has p = 4. It is supported by EACF table and PACF plot, therefore we will add ARIMA(4,1,2) as a possible model.
Based on all the applied model specification tools possible models are: - ARIMA(3,1,3); - ARIMA(3,1,7); - ARIMA(4,1,3); - ARIMA(4,1,7); - ARIMA(0,1,3); - ARIMA(0,1,4); - ARIMA(1,1,3); - ARIMA(1,1,4); - ARIMA(3,1,2); - ARIMA(4,1,2).
Ozone layer thickness data is a stochastic time series with a downwards trend, minimal variance, no change point and mixed MA/AR behaviour and no seasonality or cyclicity. The quadratic model appeared to be the best fit for the given time series. In the second part, we found a set of suitable ARIMA models. To choose the best model out of the set further testing needs to be performed.