S3759607 TS

Introduction

The dataset is about the yearly changes in the thickness of Ozone layer 1927 to 2016 in Dobson units. A negative value in the dataset represents a decrease in the thickness and a positive value represents an increase in the thickness. This report is to find the best model for this time series.

We are treating the data as a deterministic trend in Task 1 while treating it as a stochastic trend in Task 2.

Load required packages

library(TSA)
library(tseries)

Read the dataset

Read this dataset into Rstudio from original CSV file. This dataframe is then converted to time series object. After plotting this time series data, we can see the following characteristics: - There is a downward trend in the data - Slight repetitive patterns over time - Some changing variation through time - Moving average and autoregressive behaviours are obvious - No intervention point noted

data <- read.csv("data1.csv", header = FALSE)
class(data)

## [1] "data.frame"

data_ts <- ts(data, start = 1927, end = 2016)
plot(data_ts, type = 'o', ylab = 'Yearly Changes in Dobson units', main = 'Time Series Plot of Yearly Changes in the Thickness of Ozone Layer')

Linear trend model

Suppose this plot is a linear time trend. With least-squares regression a linear trend model is fitted. According to the regression output, the estimates of intercept and slope are 213.72 and -0.11. They are both statistically significant. Adjusted R-squared is 0.6655 which is a bit low.

The residual points look quite randomly spreaded.Histogram and qqplot shows that the residuals are not perfect normal distribution but Shapiro-Wilk test (p-value = 0.5372) failed to reject the null hypothesis that the stochastic component of model1 is normally distributed. The independence test (p value = 9.7e-06) reject the null hypothesis that the stochastic component is independent. Sample autocorrelation function shows that there is correlation at lag 1. Therefore, linear trend model is not good enough.

model1 = lm(data_ts~time(data_ts))
summary(model1)   # Significant. Adjusted R-squared:0.6655.

## 
## Call:
## lm(formula = data_ts ~ time(data_ts))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.7165 -1.6687  0.0275  1.4726  4.7940 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   213.720155  16.257158   13.15   <2e-16 ***
## time(data_ts)  -0.110029   0.008245  -13.34   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.032 on 88 degrees of freedom
## Multiple R-squared:  0.6693, Adjusted R-squared:  0.6655 
## F-statistic: 178.1 on 1 and 88 DF,  p-value: < 2.2e-16

plot(data_ts, type='o', ylab='y', main='Time Series Plot for Yearly Changes with linear model')
abline(model1)

res.model1 = rstudent(model1)
plot(y = res.model1, x = as.vector(time(data_ts)),xlab = 'Time', ylab='Standardized Residuals of model1',type='p')

hist(res.model1, xlab = 'Standardized Residuals') # Not very normal

qqnorm(res.model1)
qqline(res.model1, col=2, lwd=1, lty=2)     # Not normally distributed enough

shapiro.test(res.model1)  #p-value = 0.5372. fail to reject normal distribution hypothosis

## 
##  Shapiro-Wilk normality test
## 
## data:  res.model1
## W = 0.98733, p-value = 0.5372

runs(res.model1)         # p-value = 9.7e-06, reject independentce hypothesis

## $pvalue
## [1] 9.7e-06
## 
## $observed.runs
## [1] 25
## 
## $expected.runs
## [1] 45.97778
## 
## $n1
## [1] 44
## 
## $n2
## [1] 46
## 
## $k
## [1] 0

acf(res.model1)          # Autocorrelation at lag 1

Quadratic trend model

Again with the least squares approach, a quadratic time trend is fitted to the ozone thickness time series. Based on the output, this model is significant and the R-squared (0.7331) is better than that of liner model.

The model lines looks like a better fit than linear model and residual points look quite randomly spreaded. In terms of the residuals, histogram and qqplot are not very encouraging although the normal distribution hypothesis is not rejected by shapiro test (p-value = 0.6493). Also, the independence test (p value = 6.89e-05) reject the null hypothesis that the stochastic component is independent. Lastly, ACF indicates that there is autocorrelation at lag 1, lag 3 and lag4. Therefore, we conclude that the quadratic model is better than linear model but we still move on to check other models.

t = time(data_ts)
t2 = t^2
model2 = lm(data_ts~t+t2)
summary(model2)

## 
## Call:
## lm(formula = data_ts ~ t + t2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.1062 -1.2846 -0.0055  1.3379  4.2325 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5.733e+03  1.232e+03  -4.654 1.16e-05 ***
## t            5.924e+00  1.250e+00   4.739 8.30e-06 ***
## t2          -1.530e-03  3.170e-04  -4.827 5.87e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.815 on 87 degrees of freedom
## Multiple R-squared:  0.7391, Adjusted R-squared:  0.7331 
## F-statistic: 123.3 on 2 and 87 DF,  p-value: < 2.2e-16

plot(ts(fitted(model2)), ylim=c(min(c(fitted(model2), as.vector(data_ts))), max(c(fitted(model2), as.vector(data_ts)))), ylab='y', main = "Fitted Quadratic Curve to Ozone Thickness data")
lines(as.vector(data_ts),type="o")

res.model2 = rstudent(model2)
plot(y = res.model2, x = as.vector(time(data_ts)),xlab = 'Time', ylab='Studentized Residuals of model2',type='p')
abline(h=0)

hist(res.model2, xlab = 'Standardized Residuals')

qqnorm(res.model2)
qqline(res.model2, col=2, lwd=1, lty=2)   # Not normally distributed

shapiro.test(res.model2)   # p-value = 0.6493

## 
##  Shapiro-Wilk normality test
## 
## data:  res.model2
## W = 0.98889, p-value = 0.6493

runs(res.model2)   # p-value = 6.89e-05

## $pvalue
## [1] 6.89e-05
## 
## $observed.runs
## [1] 27
## 
## $expected.runs
## [1] 46
## 
## $n1
## [1] 45
## 
## $n2
## [1] 45
## 
## $k
## [1] 0

acf(res.model2)   # autocorrelation at lag 1, lag 3 and lag4

Cyclical model

As there is a slight repetitive pattern noted in the time series plot and the data is yearly. Cyclical trend model is fitted. First, we assume the pattern happens every 6 years. By plotting the cyclical model against original plots, it’s easy to see that the model has no trend and does not capture the data well. R-squared is only 0.4202.

Then we combine the cyclical model and quadratic trend model. The combined model has an adjusted R-squared of 0.8517, which is much better than the cyclical model alone.

The combination model lines looks like a better fit than cyclical model and residual points look quite randomly spreaded. Histogram and qqplot shows normal distribution and this is confirmed by shapiro test (p-value = 0.3578). However, the independence test (p value = 6.89e-05) reject the null hypothesis that the stochastic component is independent. ACF indicates that there is autocorrelation at lag 1, lag 3 and lag4.

# Assume the yearly data repeat itself every 6 years.
data_sea6 = ts(as.vector(matrix(data$V1, nrow = 15, ncol = 6)), frequency = 6)
plot(data_sea6, type = 'o')

seas = season(data_sea6)
model3 = lm(data_sea6~seas-1)
summary(model3)   # Coefficients are statistically significant at 5% confidence level. R-squared: 0.4202

## 
## Call:
## lm(formula = data_sea6 ~ seas - 1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.4755 -1.6640  0.5134  2.3864  6.5111 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## seasSeason-1  -2.8943     0.9318  -3.106 0.002585 ** 
## seasSeason-2  -3.1722     0.9318  -3.404 0.001019 ** 
## seasSeason-3  -3.6586     0.9318  -3.926 0.000176 ***
## seasSeason-4  -3.1478     0.9318  -3.378 0.001108 ** 
## seasSeason-5  -3.1039     0.9318  -3.331 0.001287 ** 
## seasSeason-6  -3.2367     0.9318  -3.474 0.000814 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.609 on 84 degrees of freedom
## Multiple R-squared:  0.4589, Adjusted R-squared:  0.4202 
## F-statistic: 11.87 on 6 and 84 DF,  p-value: 1.325e-09

plot(ts(fitted(model3)),ylim=c(min(c(fitted(model3), as.vector(data_sea6))), max(c(fitted(model3), as.vector(data_sea6)))), ylab='y', main = "Fitted Cyclical model to Ozone Thickness data")
lines(as.vector(data_sea6), type = 'o')     # model3 does not capture the data

# Quadratic model is introduced.
t = time(data_sea6)
t2 = t^2
model3.1 = lm(data_sea6~seas+t+t2-1)
summary(model3.1)   #Adjusted R-squared:0.8517

## 
## Call:
## lm(formula = data_sea6 ~ seas + t + t2 - 1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.3458 -1.3076  0.0912  1.3598  4.7563 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## seasSeason-1 -0.46232    0.85049  -0.544    0.588    
## seasSeason-2 -0.63599    0.85462  -0.744    0.459    
## seasSeason-3 -1.01500    0.85834  -1.183    0.240    
## seasSeason-4 -0.39389    0.86166  -0.457    0.649    
## seasSeason-5 -0.23650    0.86458  -0.274    0.785    
## seasSeason-6 -0.25280    0.86710  -0.292    0.771    
## t             0.26561    0.20079   1.323    0.190    
## t2           -0.05512    0.01162  -4.742 8.79e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.849 on 82 degrees of freedom
## Multiple R-squared:  0.8614, Adjusted R-squared:  0.8479 
## F-statistic:  63.7 on 8 and 82 DF,  p-value: < 2.2e-16

plot(ts(fitted(model3.1)),ylim=c(min(c(fitted(model3.1), as.vector(data_sea6))), max(c(fitted(model3.1), as.vector(data_sea6)))), ylab='y', main = "Fitted Cyclical Curve to Ozone Thickness data (6)")
lines(as.vector(data_sea6), type = 'o')

res.model3.1 = rstudent(model3.1)
plot(y = res.model3.1, x = as.vector(time(data_sea6)),xlab = 'Time', ylab='Studentized Residuals of model3.1',type='p')
abline(h=0)

hist(res.model3.1, xlab = 'Standardized Residuals')

qqnorm(res.model3.1)
qqline(res.model3.1, col=2, lwd=1, lty=2)    # normal distribution

shapiro.test(res.model3.1)    #p-value = 0.3578

## 
##  Shapiro-Wilk normality test
## 
## data:  res.model3.1
## W = 0.98964, p-value = 0.705

runs(res.model3.1)     #p-value = 6.98e-05

## $pvalue
## [1] 6.98e-05
## 
## $observed.runs
## [1] 27
## 
## $expected.runs
## [1] 45.97778
## 
## $n1
## [1] 44
## 
## $n2
## [1] 46
## 
## $k
## [1] 0

acf(res.model3.1)     # autocorrelation at lag 1, lag 3 and lag4

This might be because we set the wrong frequency. Let’s change the frequency to 7 and try the cyclic-quadratic model again. This model has adjusted R-squared as 0.8538, which is slightly better than before.

Residual check shows normal distribution on qqplot but not on histogram. Shapiro test (p-value = 3.658e-06) is signigicant so reject the normal distribution hypothesis. However, the independence test (p value = p-value = 0.000372) reject the null hypothesis that the stochastic component is independent. ACF indicates that there is autocorrelation at lag 1, lag 3 and lag4. The residuals are neither normally distributed nor independent. Therefore, we decide to set the frequency as 6 still.

data_sea7 = ts(as.vector(matrix(data$V1, ncol = 7)), frequency = 7)
seas = season(data_sea7)
t = time(data_sea7)
t2 = t^2
model3.2 = lm(data_sea7~seas+t+t2-1)
summary(model3.2)   #Adjusted R-squared:  0.796

## 
## Call:
## lm(formula = data_sea7 ~ seas + t + t2 - 1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.4476 -1.2920 -0.2133  1.2420  9.4603 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## seasMonday     0.68549    1.02569   0.668  0.50581   
## seasTuesday    0.03107    1.03047   0.030  0.97602   
## seasWednesday -0.76639    1.03479  -0.741  0.46104   
## seasThursday  -0.58604    1.03866  -0.564  0.57414   
## seasFriday    -0.32709    1.04207  -0.314  0.75441   
## seasSaturday   0.11011    1.04504   0.105  0.91634   
## seasSunday     1.20552    1.04757   1.151  0.25317   
## t              0.06475    0.27012   0.240  0.81114   
## t2            -0.05318    0.01773  -2.999  0.00358 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.13 on 82 degrees of freedom
## Multiple R-squared:  0.8161, Adjusted R-squared:  0.796 
## F-statistic: 40.44 on 9 and 82 DF,  p-value: < 2.2e-16

plot(ts(fitted(model3.2)),ylim=c(min(c(fitted(model3.2), as.vector(data_sea7))), max(c(fitted(model3.2), as.vector(data_sea7)))), ylab='y', main = "Fitted Cyclical Curve to Ozone Thickness data (7)")
lines(as.vector(data_sea7), type = 'o')

res.model3.2 = rstudent(model3.2)
plot(y = res.model3.2, x = as.vector(time(data_sea7)),xlab = 'Time', ylab='Studentized Residuals of model3.2',type='p')
abline(h=0)

hist(res.model3.2, xlab = 'Standardized Residuals')

qqnorm(res.model3.2)
qqline(res.model3.2, col=2, lwd=1, lty=2)    # normal distribution

shapiro.test(res.model3.2)    #p-value = 3.658e-06

## 
##  Shapiro-Wilk normality test
## 
## data:  res.model3.2
## W = 0.90006, p-value = 3.658e-06

runs(res.model3.2)     #p-value = 0.000372

## $pvalue
## [1] 0.000372
## 
## $observed.runs
## [1] 29
## 
## $expected.runs
## [1] 46.05495
## 
## $n1
## [1] 50
## 
## $n2
## [1] 41
## 
## $k
## [1] 0

acf(res.model3.2)     # autocorrelation at lag 1, lag 3 and lag4

Cosine trend

Lastly, we try the cosine trend model combined with quadratic model. Based on the summary of this model, the coefficients are not significant and adjusted R-squared is 0.7317. Residual check: Histogram and qqplot shows normal distribution and this is confirmed by shapiro test (p-value = 0.6206). However, the independence test (p value = 0.000424) reject the null hypothesis that the stochastic component is independent. ACF indicates that there is autocorrelation at lag 1, lag 3 and lag4.

har.=harmonic(data_sea6, 1)
t = time(data_sea6)
t2 = t^2
model4=lm(data_sea6~har.+t+t2)
summary(model4)  # Adjusted R-squared:  0.7317

## 
## Call:
## lm(formula = data_sea6 ~ har. + t + t2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.4075 -1.2271  0.0685  1.3942  4.5511 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -0.49707    0.72907  -0.682    0.497    
## har.cos(2*pi*t)  0.03759    0.27139   0.139    0.890    
## har.sin(2*pi*t) -0.33543    0.27159  -1.235    0.220    
## t                0.26535    0.19766   1.342    0.183    
## t2              -0.05513    0.01144  -4.818 6.27e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.82 on 85 degrees of freedom
## Multiple R-squared:  0.7438, Adjusted R-squared:  0.7317 
## F-statistic: 61.69 on 4 and 85 DF,  p-value: < 2.2e-16

plot(ts(fitted(model4), frequency = 6), ylab='y', type = 'l', ylim=range(c(fitted(model4), data_sea6)), main = "Fitted Cosine trend model to Ozone Thickness data")
points(data_sea6)

res.model4 = rstudent(model4)
hist(res.model4, xlab = 'Standardized Residuals')

qqnorm(res.model4)
qqline(res.model4, col=2, lwd=1, lty=2)

shapiro.test(res.model4)   #p-value = 0.6206

## 
##  Shapiro-Wilk normality test
## 
## data:  res.model4
## W = 0.9885, p-value = 0.6206

runs(res.model4)   #p-value = 0.000424

## $pvalue
## [1] 0.000424
## 
## $observed.runs
## [1] 29
## 
## $expected.runs
## [1] 45.91111
## 
## $n1
## [1] 43
## 
## $n2
## [1] 47
## 
## $k
## [1] 0

acf(res.model4)  # autocorrelation at lag 1, lag 3 and lag4

Prediction

All in all, no perfect model is found so far. Autocorrelations are found in the residuals of all models explored. Cosine model has insignificant coefficients. Quadratic-cyclic combined model has a higher R-squared that of linear and cyclic model. Between quadratic-cyclic trend model and quadratic trend model, we use the quadratic trend model for prediction based on the principle of parsimony. Based on the output using predict() function, the yearly changes for the next five years with 95% confidence are: -10.34387, -10.59469, -10.84856, -11.10550 and -11.36550.

t = time(data_ts)
t2 = t^2
# Predict the data for the next five years
forecasts <- predict.lm(model2, data.frame(t=c(2017,2018,2019,2020,2021), t2=c(2017^2,2018^2,2019^2,2020^2,2021^2)), interval = "prediction" )
print(forecasts)

##         fit       lwr       upr
## 1 -10.34387 -14.13556 -6.552180
## 2 -10.59469 -14.40282 -6.786548
## 3 -10.84856 -14.67434 -7.022786
## 4 -11.10550 -14.95015 -7.260851
## 5 -11.36550 -15.23030 -7.500701

Propose a set of possible ARIMA(p, d, q) models

From the time series plot in task1, downward trend, slight seasonality and some changing variations are in the plot. Moving averages and autoregressive behaviours are in the plot but no obvious intervention point.

After checking the ACF and PACF, the waving pattern in ACF and very high first correlation spike at lag1 in PACF implies the likelihood of AR(1) model. However, the ACF does has some decaying pattern which indicates the existence of trend. Therefore, ADF test is performed. ADF test shows p-value is 0.086 > 0.05 so we fail to reject the non-stationarity hypothesis. Then we do the following: - Transformation: lambda = 1 and do not perform transformation - Differencing: after applying the the first difference the series became detrended and stationary. This is confirmed by ACF and ADF test. There are significant autocorrelation at lag3 on the ACF, lag3 and lag4 on PACF. Also, there are cosine waves patterns at ACF and PCF. The patterns are tailing off, so we conclude p=1,2,3,4 and q=1,2,3.

Then EACF check is performed. The upper left hand vertex of this polygon indicates p=0 and q=0 - A random walk model. This is in line with the AR(1) model assumption previously before differencing. By checking EACF, we also include IMA(1,1), ARIMA(1,1,1), ARIMA(2,1,1), ARIMA(3,1,1) and ARIMA(3,1,3). Here the “1” in the middle comes from the first differencing. BIC table shows models containing lags 10 of the observed time series and lag 11 of the error process. Since lag 10 and 11 are too large, and the 3rd row has models containing lag 3 of the time series.

Therefore, considering principle of parsimony the conclusion is the candidate models are AR(1), IMA(1,1), ARIMA(1,1,1) and ARIMA(2,1,1). We can also check ARIMA(3,1,3) and ARIMA(3,1,1) if they don’t work.

par(mfrow=c(1,2))
acf(data_ts)
pacf(data_ts)

#par(mfrow=c(1,1))

# Apply ADF test to check the stationarity

adf.test(data_ts)

## 
##  Augmented Dickey-Fuller Test
## 
## data:  data_ts
## Dickey-Fuller = -3.2376, Lag order = 4, p-value = 0.0867
## alternative hypothesis: stationary

# With a p-value of 0.0867, we cannot reject the null hypothesis stating that the series is non-stationary.

# Box Cox Transformation to stabilize variance
data_positive = data_ts+abs(min(data_ts))+0.1   # make sure all values are positive
data.transform = BoxCox.ar(data_positive)
data.transform$ci   # lambda is between 0.9 and 1.3, so we take lambda=1 and do not perform transformation

## [1] 0.9 1.3

# Differencing to de-trend
data.diff = diff(data_ts)
plot(data.diff, type='o',ylab='Yearly Change of Ozone Thickness')

# Check the stationarity of differenced series with ADF test
adf.test(data.diff)    # With a p-value of 0.01, we reject the null hypothesis stating that

## 
##  Augmented Dickey-Fuller Test
## 
## data:  data.diff
## Dickey-Fuller = -7.1568, Lag order = 4, p-value = 0.01
## alternative hypothesis: stationary

# the series is non-stationary; hence, we conclude that the first differencing make the series staionary.

# Specify the orders using ACf, PACF, EACF and BIC over the differenced series
par(mfrow=c(1,2)) 
acf(data.diff)
pacf(data.diff)

par(mfrow=c(1,1))    
# There are significant autocorrelation at lag3 on the ACF, lag3 lag4 on PACF. Also, there are cosine waves patterns at ACF and PCF. The patterns are tailing off. so p=1,2,3,4 and q=1,2,3.
eacf(data.diff) #We look for the upper left hand vertex of this polygon of 0's

## AR/MA
##   0 1 2 3 4 5 6 7 8 9 10 11 12 13
## 0 o o x o o o x o o x o  o  o  o 
## 1 x o x o o o o o o x o  o  o  o 
## 2 x o x o o o x o o x o  o  o  o 
## 3 x o x o o x o o o o o  o  o  o 
## 4 x o o x o x o o o o o  o  o  o 
## 5 x x x x o x o o o o o  o  o  o 
## 6 o o o x x o o o o o o  o  o  o 
## 7 o o o x o o o o o o o  o  o  o

# It indicates p=0; q=0. A random walk model. This is in line with the AR(1) model previously.
# p = 0; q = 1. ARIMA(0,1,1)
# p = 1; q = 1. ARIMA(1,1,1)
# p = 2; q = 1. ARIMA(2,1,1)
# p = 3; q = 1. ARIMA(3,1,1)
# p = 3; q = 3. ARIMA(3,1,3)

# Display the BIC table
res = armasubsets(y=data.diff,nar=14,nma=14,y.name='test',ar.method='ols')

## Reordering variables and trying again:

plot(res)

# Top row (best model/lowest BIC): model containing lags 10 of the observed time series and lag 11 of the error process. The BIC table shaded columns correspond to AR(10) and MA(11).