In this part, I will conduct the Model-Building Strategy to find the best fitting model for the data of the recorded thickness of Ozone layer from 1927 to 2016 in Dobson units. After this, the prediction of yearly changes for the next 5 years will be done for the benefit of the environment.
Through the plot of Yearly changes in the thickness of Ozone layer, we can easily see that there is a downward trend from 1927 to 2016. But the changing of variation is not obvious so as to its repeated pattern. Additionally, it can’t be seen that there is any seasonality. Generally, this model can be an AR model due to some hanging observations as well as the MA model due to many fluctuations. However, in this part, I will only focus on Model-Building Strategy and prediction.
## Parsed with column specification:
## cols(
## X1 = col_double()
## )
## [1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
## Warning: Setting row names on a tibble is deprecated.
## [1] 0.8700381
After comparing 3 models fitting, I decide to choose the quadratic model due to the highest coefficient of determination (0.7331). Also, the median of residuals is nearly approaching to 0 (-0.0055). Overall, it is statistically significant that both of the parameters (t&t2) are relative to the response variable which means Ozone layer is explained by the quadratic time trend.
##
## Call:
## lm(formula = ozone ~ time(ozone))
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.7165 -1.6687 0.0275 1.4726 4.7940
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 213.720155 16.257158 13.15 <2e-16 ***
## time(ozone) -0.110029 0.008245 -13.34 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.032 on 88 degrees of freedom
## Multiple R-squared: 0.6693, Adjusted R-squared: 0.6655
## F-statistic: 178.1 on 1 and 88 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = ozone ~ cos(2 * pi * f * t) + sin(2 * pi * f * t) -
## 1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.3520 -1.8906 0.4837 2.3643 6.4248
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## cos(2 * pi * f * t) -2.970e+00 4.790e-01 -6.199 1.79e-08 ***
## sin(2 * pi * f * t) 5.462e+11 7.105e+11 0.769 0.444
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.522 on 88 degrees of freedom
## Multiple R-squared: 0.4601, Adjusted R-squared: 0.4479
## F-statistic: 37.5 on 2 and 88 DF, p-value: 1.663e-12
##
## Call:
## lm(formula = ozone ~ t + t2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.1062 -1.2846 -0.0055 1.3379 4.2325
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.733e+03 1.232e+03 -4.654 1.16e-05 ***
## t 5.924e+00 1.250e+00 4.739 8.30e-06 ***
## t2 -1.530e-03 3.170e-04 -4.827 5.87e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.815 on 87 degrees of freedom
## Multiple R-squared: 0.7391, Adjusted R-squared: 0.7331
## F-statistic: 123.3 on 2 and 87 DF, p-value: < 2.2e-16
In this part, I am going to test the quality of this model with some plots and tests. As for the time series plot of residuals, it can be seen that all observations scatter evenly around x=0 and no pattern in this plot.
In the fitted trend plot, there is almost the similarity of variation for all observations over the x-axis.
Finally, by testing the normality with histogram, there is a closely symmetric distribution with some exception. However, Shapiro test and qqplot indicate that the residuals is normal. Also, the residuals is independent since p-valie of the runs test is 6.89e-05. So, we can say the quadratic model is quite well fitting the data.
##
## Shapiro-Wilk normality test
##
## data: k
## W = 0.98889, p-value = 0.6493
## $pvalue
## [1] 6.89e-05
##
## $observed.runs
## [1] 27
##
## $expected.runs
## [1] 46
##
## $n1
## [1] 45
##
## $n2
## [1] 45
##
## $k
## [1] 0
## fit lwr upr
## 1 -5727.315 -8173.596 -3281.034
## 2 -5721.396 -8165.194 -3277.597
## 3 -5715.480 -8156.797 -3274.162
## 4 -5709.566 -8148.404 -3270.729
## 5 -5703.656 -8140.015 -3267.298
To sum up, there is a quadratic trend for yearly changes in the thickness of Ozone layer and the neighboring values are very relative to each other. Also, with some normality and independent tests, we can trust this model and give a precise prediction for the future.
In this part, I will see the Ozone data as a stochastic trend and try to make it a stationary trend with differencing and transforming in order to analyze it easier.
## [1] "ts"
## [1] 0.90 1.58
##
## Shapiro-Wilk normality test
##
## data: BC.ozone
## W = 0.96644, p-value = 0.01995
##
## Shapiro-Wilk normality test
##
## data: diff.BC.ozone
## W = 0.98321, p-value = 0.3061
##
## Augmented Dickey-Fuller Test
##
## data: ozone
## Dickey-Fuller = -3.2376, Lag order = 4, p-value = 0.0867
## alternative hypothesis: stationary
## Warning in adf.test(diff.BC.ozone): p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: diff.BC.ozone
## Dickey-Fuller = -7.2426, Lag order = 4, p-value = 0.01
## alternative hypothesis: stationary
## AR/MA
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13
## 0 o o x x o o x o o x o o o o
## 1 x o x o o o x o o x o o o o
## 2 o o x o o o x o o x o o o o
## 3 x o x o o x o o o o o o o o
## 4 x o o x o x o o o o o o o o
## 5 x x x x o x o o o o o o o o
## 6 o o o x o o o o o o o o o o
## 7 x o o x o o o o o o o o o o
## Warning in leaps.setup(x, y, wt = wt, nbest = nbest, nvmax = nvmax,
## force.in = force.in, : 8 linear dependencies found
## Reordering variables and trying again:
The set of candidate models is {ARIMA(1,1,3), ARIMA(2,1,3), and ARIMA(3,1,3)}
library(readr)
library(TSA)
ozone = read_csv("Downloads/data1.csv", col_names = FALSE)
class(ozone)
rownames(ozone) = seq(from=1927, to=2016)
ozone = ts(as.vector(ozone),start=1927, end=2016)
# DATA VS TIME
plot(ozone,type='o',ylab='Yearly changes in the thickness of Ozone layer',main='Time series plot of \nyearly changes in the thickness of Ozone layer')
# Scatter plot
plot(y=ozone,x=zlag(ozone),ylab='changes in the thickness of Ozone layer', xlab='Previous changes in \nthe thickness of Ozone layer', main="Scatter plot of yearly changes in \nthe thickness of Ozone layer")
# Correlation
y = ozone
x = zlag(ozone)
index =2:length(x)
cor(y[index],x[index])
# Linear model
model1 = lm(ozone~time(ozone))
summary(model1)
# Cosine model
f=1
t = time(ozone)
model1_2=lm(ozone ~ cos(2*pi*f*t)+sin(2*pi*f*t)-1)
summary(model1_2)
# Quadratic model
t = time(ozone)
t2 = t^2
model1_3 = lm(ozone~t+t2)
summary(model1_3)
plot(ts(fitted(model1_3)),
ylim = c(min(c(fitted(model1_3),as.vector(ozone))),
max(c(fitted(model1_3),as.vector(ozone)))),
ylab='y',main ="Fitted quadratic curve to Ozone layer data")
lines(as.vector(ozone),type="o")
res.model1_3=rstudent(model1_3)
plot(y = res.model1_3, x = as.vector(time(ozone)),xlab = 'Time',
ylab='Standardized Residuals',type='p',
main ="Time series plot of residuals")
plot(y=res.model1_3,x=as.vector(fitted(model1_3)),
xlab='Fitted Trend Values', ylab='Standardized Residuals'
,type='n', main ="Time series plot of standardised residuals")
points(y=rstudent(model1_3),x=as.vector(fitted(model1_3)))
# Normality test
hist(res.model1_3,xlab='Standardized Residuals')
k=res.model1_3
qqnorm(k)
qqline(k, col = 2, lwd = 1, lty = 2)
shapiro.test(k)
runs(k)
values = data.frame(t = seq(1,5,1))
values['t2']=values['t']**2
forecasts = predict(model1_3, values, interval ="prediction")
forecasts
library(readr)
library(TSA)
library(tseries)
class(ozone)
par(mfrow=c(1,2))
acf(ozone)
pacf(ozone)
par(mfrow=c(1,1))
# Box-cox Transformation
ozone_positive = ozone + abs(min(ozone))+1
ozone_transform = BoxCox.ar(ozone_positive,
lambda = seq(-2,2, 0.01))
ozone_transform$ci
lambda = 1.2
BC.ozone = (ozone_positive^lambda-1)/lambda
qqnorm(BC.ozone)
qqline(BC.ozone, col = 2, lwd = 1, lty = 2)
shapiro.test(BC.ozone)
# Differencing Box-cox transformed data
diff.BC.ozone = diff(BC.ozone,differences=1)
plot(diff.BC.ozone,type='o',
ylab='Yearly changes in the thickness of Ozone layer')
par(mfrow=c(1,2))
acf(diff.BC.ozone)
pacf(diff.BC.ozone)
par(mfrow=c(1,1))
# There is a tail off from lag3 both in ACF and PACF.
# So we can consider ARIMA(3,1,3) model
qqnorm(diff.BC.ozone)
qqline(diff.BC.ozone, col = 2, lwd = 1, lty = 2)
shapiro.test(diff.BC.ozone)
adf.test(ozone)
adf.test(diff.BC.ozone)
eacf(diff.BC.ozone)
# Which indicates that the set of candidate models are ARIMA(1,1,3), ARIMA(2,1,3), and ARIMA(3,1,3)
ozone_BIC = armasubsets(y=diff.BC.ozone,nar=14,nma=14,y.name='test',
ar.method='ols')
plot(ozone_BIC)
# In the BIC table, shaded columns correspond to AR(11), and MA(10) coefficients
# So, from here we can include ARIMA(11,1,3) model in the set of
# candidate model.