Task 1

Introduction

In this part, I will conduct the Model-Building Strategy to find the best fitting model for the data of the recorded thickness of Ozone layer from 1927 to 2016 in Dobson units. After this, the prediction of yearly changes for the next 5 years will be done for the benefit of the environment.

Model specification

Through the plot of Yearly changes in the thickness of Ozone layer, we can easily see that there is a downward trend from 1927 to 2016. But the changing of variation is not obvious so as to its repeated pattern. Additionally, it can’t be seen that there is any seasonality. Generally, this model can be an AR model due to some hanging observations as well as the MA model due to many fluctuations. However, in this part, I will only focus on Model-Building Strategy and prediction.

## Parsed with column specification:
## cols(
##   X1 = col_double()
## )
## [1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"
## Warning: Setting row names on a tibble is deprecated.

  • Regarding to the scatter plot of yearly changes in the thickness of Ozone layer and the correlation of 0.8700381, I find that neighboring values are very closely related.

## [1] 0.8700381

Model fitting

After comparing 3 models fitting, I decide to choose the quadratic model due to the highest coefficient of determination (0.7331). Also, the median of residuals is nearly approaching to 0 (-0.0055). Overall, it is statistically significant that both of the parameters (t&t2) are relative to the response variable which means Ozone layer is explained by the quadratic time trend.

## 
## Call:
## lm(formula = ozone ~ time(ozone))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.7165 -1.6687  0.0275  1.4726  4.7940 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 213.720155  16.257158   13.15   <2e-16 ***
## time(ozone)  -0.110029   0.008245  -13.34   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.032 on 88 degrees of freedom
## Multiple R-squared:  0.6693, Adjusted R-squared:  0.6655 
## F-statistic: 178.1 on 1 and 88 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = ozone ~ cos(2 * pi * f * t) + sin(2 * pi * f * t) - 
##     1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.3520 -1.8906  0.4837  2.3643  6.4248 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## cos(2 * pi * f * t) -2.970e+00  4.790e-01  -6.199 1.79e-08 ***
## sin(2 * pi * f * t)  5.462e+11  7.105e+11   0.769    0.444    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.522 on 88 degrees of freedom
## Multiple R-squared:  0.4601, Adjusted R-squared:  0.4479 
## F-statistic:  37.5 on 2 and 88 DF,  p-value: 1.663e-12
## 
## Call:
## lm(formula = ozone ~ t + t2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.1062 -1.2846 -0.0055  1.3379  4.2325 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5.733e+03  1.232e+03  -4.654 1.16e-05 ***
## t            5.924e+00  1.250e+00   4.739 8.30e-06 ***
## t2          -1.530e-03  3.170e-04  -4.827 5.87e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.815 on 87 degrees of freedom
## Multiple R-squared:  0.7391, Adjusted R-squared:  0.7331 
## F-statistic: 123.3 on 2 and 87 DF,  p-value: < 2.2e-16

Model diagnostics

In this part, I am going to test the quality of this model with some plots and tests. As for the time series plot of residuals, it can be seen that all observations scatter evenly around x=0 and no pattern in this plot.

In the fitted trend plot, there is almost the similarity of variation for all observations over the x-axis.

Finally, by testing the normality with histogram, there is a closely symmetric distribution with some exception. However, Shapiro test and qqplot indicate that the residuals is normal. Also, the residuals is independent since p-valie of the runs test is 6.89e-05. So, we can say the quadratic model is quite well fitting the data.

## 
##  Shapiro-Wilk normality test
## 
## data:  k
## W = 0.98889, p-value = 0.6493
## $pvalue
## [1] 6.89e-05
## 
## $observed.runs
## [1] 27
## 
## $expected.runs
## [1] 46
## 
## $n1
## [1] 45
## 
## $n2
## [1] 45
## 
## $k
## [1] 0

Prediction

##         fit       lwr       upr
## 1 -5727.315 -8173.596 -3281.034
## 2 -5721.396 -8165.194 -3277.597
## 3 -5715.480 -8156.797 -3274.162
## 4 -5709.566 -8148.404 -3270.729
## 5 -5703.656 -8140.015 -3267.298

Conclusion

To sum up, there is a quadratic trend for yearly changes in the thickness of Ozone layer and the neighboring values are very relative to each other. Also, with some normality and independent tests, we can trust this model and give a precise prediction for the future.

Task2

Introduction

In this part, I will see the Ozone data as a stochastic trend and try to make it a stationary trend with differencing and transforming in order to analyze it easier.

## [1] "ts"

Data Transformation

## [1] 0.90 1.58

## 
##  Shapiro-Wilk normality test
## 
## data:  BC.ozone
## W = 0.96644, p-value = 0.01995

## 
##  Shapiro-Wilk normality test
## 
## data:  diff.BC.ozone
## W = 0.98321, p-value = 0.3061
## 
##  Augmented Dickey-Fuller Test
## 
## data:  ozone
## Dickey-Fuller = -3.2376, Lag order = 4, p-value = 0.0867
## alternative hypothesis: stationary
## Warning in adf.test(diff.BC.ozone): p-value smaller than printed p-value
## 
##  Augmented Dickey-Fuller Test
## 
## data:  diff.BC.ozone
## Dickey-Fuller = -7.2426, Lag order = 4, p-value = 0.01
## alternative hypothesis: stationary
## AR/MA
##   0 1 2 3 4 5 6 7 8 9 10 11 12 13
## 0 o o x x o o x o o x o  o  o  o 
## 1 x o x o o o x o o x o  o  o  o 
## 2 o o x o o o x o o x o  o  o  o 
## 3 x o x o o x o o o o o  o  o  o 
## 4 x o o x o x o o o o o  o  o  o 
## 5 x x x x o x o o o o o  o  o  o 
## 6 o o o x o o o o o o o  o  o  o 
## 7 x o o x o o o o o o o  o  o  o
## Warning in leaps.setup(x, y, wt = wt, nbest = nbest, nvmax = nvmax,
## force.in = force.in, : 8 linear dependencies found
## Reordering variables and trying again:

Conclusion

The set of candidate models is {ARIMA(1,1,3), ARIMA(2,1,3), and ARIMA(3,1,3)}

Appendix

library(readr)
library(TSA)
ozone = read_csv("Downloads/data1.csv", col_names = FALSE)
class(ozone)
rownames(ozone) = seq(from=1927, to=2016)
ozone = ts(as.vector(ozone),start=1927, end=2016) 
# DATA VS TIME
plot(ozone,type='o',ylab='Yearly changes in the thickness of Ozone layer',main='Time series plot of \nyearly changes in the thickness of Ozone layer')

# Scatter plot
plot(y=ozone,x=zlag(ozone),ylab='changes in the thickness of Ozone layer', xlab='Previous changes in \nthe thickness of Ozone layer',     main="Scatter plot of yearly changes in \nthe thickness of Ozone layer")

# Correlation
y = ozone           
x = zlag(ozone)
index =2:length(x)
cor(y[index],x[index])

# Linear model
model1 = lm(ozone~time(ozone)) 
summary(model1)

# Cosine model
f=1
t = time(ozone)
model1_2=lm(ozone ~ cos(2*pi*f*t)+sin(2*pi*f*t)-1)
summary(model1_2)

# Quadratic model
t = time(ozone)
t2 = t^2
model1_3 = lm(ozone~t+t2) 
summary(model1_3)

plot(ts(fitted(model1_3)), 
     ylim = c(min(c(fitted(model1_3),as.vector(ozone))),
              max(c(fitted(model1_3),as.vector(ozone)))),
     ylab='y',main ="Fitted quadratic curve to Ozone layer data")
lines(as.vector(ozone),type="o")

res.model1_3=rstudent(model1_3)
plot(y = res.model1_3, x = as.vector(time(ozone)),xlab = 'Time', 
     ylab='Standardized Residuals',type='p',
     main ="Time series plot of residuals")

plot(y=res.model1_3,x=as.vector(fitted(model1_3)),
     xlab='Fitted Trend Values', ylab='Standardized Residuals'
     ,type='n', main ="Time series plot of standardised residuals")
points(y=rstudent(model1_3),x=as.vector(fitted(model1_3)))

# Normality test
hist(res.model1_3,xlab='Standardized Residuals')

k=res.model1_3
qqnorm(k)
qqline(k, col = 2, lwd = 1, lty = 2)

shapiro.test(k)

runs(k)


values = data.frame(t = seq(1,5,1))
values['t2']=values['t']**2
forecasts = predict(model1_3, values, interval ="prediction")
forecasts
library(readr)
library(TSA)
library(tseries)
class(ozone)

par(mfrow=c(1,2)) 
acf(ozone)
pacf(ozone)
par(mfrow=c(1,1)) 

# Box-cox Transformation
ozone_positive = ozone + abs(min(ozone))+1
ozone_transform = BoxCox.ar(ozone_positive, 
                            lambda = seq(-2,2, 0.01))
ozone_transform$ci
lambda = 1.2

BC.ozone = (ozone_positive^lambda-1)/lambda
qqnorm(BC.ozone)
qqline(BC.ozone, col = 2, lwd = 1, lty = 2)
shapiro.test(BC.ozone)

# Differencing Box-cox transformed data
diff.BC.ozone = diff(BC.ozone,differences=1)
plot(diff.BC.ozone,type='o',
     ylab='Yearly changes in the thickness of Ozone layer')
par(mfrow=c(1,2))
acf(diff.BC.ozone)
pacf(diff.BC.ozone)
par(mfrow=c(1,1))

# There is a tail off from lag3 both in ACF and PACF.
# So we can consider ARIMA(3,1,3) model

qqnorm(diff.BC.ozone)
qqline(diff.BC.ozone, col = 2, lwd = 1, lty = 2)
shapiro.test(diff.BC.ozone)


adf.test(ozone)
adf.test(diff.BC.ozone)
eacf(diff.BC.ozone)
# Which indicates that the set of candidate models are ARIMA(1,1,3), ARIMA(2,1,3), and ARIMA(3,1,3)

ozone_BIC = armasubsets(y=diff.BC.ozone,nar=14,nma=14,y.name='test',
                        ar.method='ols')
plot(ozone_BIC)

# In the BIC table, shaded columns correspond to AR(11), and MA(10)  coefficients 
# So, from here we can include ARIMA(11,1,3) model in the set of 
# candidate model.