A data set which contains the information of ozone layer thickness from 1927 to 2016 is used to analyse the trend. Objective of the project is to identify a best fitting model for the data and to give predictions for the next five years using the identified model. For identifying best models, linear, quadratic, cosine, cyclical or seasonal trend models are considered and the model which best fits the data set will be selected.
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
##
## Attaching package: 'TSA'
## The following object is masked from 'package:readr':
##
## spec
## The following objects are masked from 'package:stats':
##
## acf, arima
## The following object is masked from 'package:utils':
##
## tar
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Registered S3 methods overwritten by 'forecast':
## method from
## fitted.Arima TSA
## plot.Arima TSA
## [1] "ts"
Trend : Trend is vissible in the plot. There is a downward trend which implies as time increases Ozone Layer Thickness is getting reduced.
Changing Variance : Eventhough there is slight change of variance at some points, overall we can say there is no changing variance.
Intervention : We cannot observe any drastic change at a single point and hence can say there is no intervention point
Behavior/Autocorrelation : Plot has successive points as well as fluctuation. So it can be an ARIMA model which will be checked at later patter of this report
Expression for linear trend model : Y = aX+c Estimating slope (a) and intercept function (c)
model1=lm(Ozone.data1~time(Ozone.data1))
summary(model1)
##
## Call:
## lm(formula = Ozone.data1 ~ time(Ozone.data1))
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.7165 -1.6687 0.0275 1.4726 4.7940
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 213.720155 16.257158 13.15 <2e-16 ***
## time(Ozone.data1) -0.110029 0.008245 -13.34 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.032 on 88 degrees of freedom
## Multiple R-squared: 0.6693, Adjusted R-squared: 0.6655
## F-statistic: 178.1 on 1 and 88 DF, p-value: < 2.2e-16
Slope & Intercept are satistically signifant at 5% significance level. But the R-squared value is 0.67, and hence this is not a good fitting model.Linear trend line is as follows
plot(Ozone.data1,type="o",ylab="Ozone Layer Thickness", main =
"Fitted Linear Trend")
legend("bottomleft",lty=1, bty = "n" ,text.width = 14, col=c("black","blue"),
c("Ozone Layer Thickness Changes", "Linear Model"))
abline(model1, col="blue")
Trend line is not able capture most of the values and hence linear trend is not a good fitting model.
res.model1=rstudent(model1)
plot(y = res.model1, x = as.vector(time(Ozone.data1)),
xlab = 'Time', ylab='Standardized Residuals',
type='p', main = "Standardized Residuals vs Time")
abline(h=0)
# REsiduals vs Fitted
plot(y=res.model1,x=fitted(model1),
xlab='Fitted Trend Values', ylab='Standardized Residuals',
type='p',
main = "Standardized Residuals vs Fitted values")
abline(h=0, col="red")
From the plots we can say residuals are correlated because all the points are randomly distributed.
hist(rstudent(model1),xlab="Standardized Residuals", main="Histogram of Standardized Residuals")
Looking at the histogram, we can say its normally distributed.
qqplot=rstudent(model1)
qqnorm(qqplot)
qqline(qqplot,col=2,lwd=1, lty=2)
At the start and at the end data deviates from the line and hence we can say the data is not normally distributed
st=rstudent(model1)
shapiro.test(st)
##
## Shapiro-Wilk normality test
##
## data: st
## W = 0.98733, p-value = 0.5372
P>0.5 and hence we fail to reject null hypothesis which means stochaistic component of data doesn’t have a normal distribution.
acf(rstudent(model1), main="ACF of Standardized Residuals")
From the ACF plots, we can observe significant lags. Hence correlation exists and we can infer that stochaistic component of series is not white noise.
Y = a1X + a2X^2 + c
# Obtaining summary of quadratic model
t=time(Ozone.data1)
t2=t^2
model2=lm(Ozone.data1~t+t2)
summary(model2)
##
## Call:
## lm(formula = Ozone.data1 ~ t + t2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.1062 -1.2846 -0.0055 1.3379 4.2325
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.733e+03 1.232e+03 -4.654 1.16e-05 ***
## t 5.924e+00 1.250e+00 4.739 8.30e-06 ***
## t2 -1.530e-03 3.170e-04 -4.827 5.87e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.815 on 87 degrees of freedom
## Multiple R-squared: 0.7391, Adjusted R-squared: 0.7331
## F-statistic: 123.3 on 2 and 87 DF, p-value: < 2.2e-16
Slope & Intercept are satistically signifant at 5% significance level. R-squared value is 0.72, which means 73% of the variation can be explained by quadratic model.Hnce this seems to be a good fitting model when compared to Linear model
Fitted Quadratic Plot
plot(ts(fitted(model2)),ylim=c(min(c(fitted(model2),
as.vector(Ozone.data1))),max(c(fitted(model2),
as.vector(Ozone.data1)))),ylab="Ozone Layer Thickness",
main="Fitted Quadratic Curve",col="blue")
legend("bottomleft",lty=1, bty = "n" ,text.width = 14, col=c("black","blue"),
c("Ozone Layer Thickness Changes", "Linear Model"))
lines(as.vector(Ozone.data1),type="o")
Plot shows that quadratic model has a better fit when compared to Linear Model
res.model2=rstudent(model2)
plot(y = res.model2, x = as.vector(time(Ozone.data1)),
xlab = 'Time', ylab='Standardized Residuals',
type='p', main = "Standardized Residuals vs Time")
abline(h=0)
# Residuals vs Fitted
plot(y=res.model2,x=fitted(model2),
xlab='Fitted Trend Values',
ylab='Standardized Residuals',
type='p',
main = "Standardized Residuals vs Fitted values")
abline(h=0, col="red")
From the plots we can say residuals are correlated because all the points are randomly distributed.
hist(rstudent(model2),xlab="Standardized Residuals", main="Histogram of Standardized Residuals")
Looking at the histogram, we can say residuals are not normally distributed.
qqplot=rstudent(model2)
qqnorm(qqplot)
qqline(qqplot,col=2,lwd=1, lty=2)
Only few points deviate from the straight line and hence we can say residuals are normally distributed
st=rstudent(model2)
shapiro.test(st)
##
## Shapiro-Wilk normality test
##
## data: st
## W = 0.98889, p-value = 0.6493
P>0.5 and hence we fail to reject null hypothesis which means stochaistic component of data doesn’t have a normal distribution.
acf(rstudent(model2), main="ACF of Standardized Residuals")
From the ACF plots, we can observe significant lags. Hence correlation exists and we can infer that stochaistic copmonent of series is not white noise.
There is no seasonality in the data and hence these two models are not considered for fitting.
From all the plots and hypothesis testing we can say that quadratic model is a better fitting model when compared with linear model. But we can’t say this as the best model as they do not satisy the the assumption of residuals and normality. For this we need to differentiate the data and then fit the model.
Hence the quadratic model is a better fit, this model is considered to predict the ozone layer thickness for next five years.
t= c(2017,2018,2019,2020,2021)
t2=t^2
x= data.frame(t,t2)
forecasts = predict(model2, x, interval = "prediction")
print(forecasts)
## fit lwr upr
## 1 -10.34387 -14.13556 -6.552180
## 2 -10.59469 -14.40282 -6.786548
## 3 -10.84856 -14.67434 -7.022786
## 4 -11.10550 -14.95015 -7.260851
## 5 -11.36550 -15.23030 -7.500701
plot(Ozone.data1,type='o',
ylab='Changes in the thickness of ozone layer',
xlim=c(1927,2022),
ylim=c(-15,10))
lines(ts(as.vector(forecasts[,1]), start = 2017), col="red", type="l")
lines(ts(as.vector(forecasts[,2]), start = 2017), col="blue", type="l")
lines(ts(as.vector(forecasts[,3]), start = 2017), col="blue", type="l")
legend("topright", lty=1, pch=1, col=c("black","blue","red"), text.width = 21,
c("Data","5% forecast limits", "Forecasts"))
While testing different models, quadratic model was found as a better fit and the same was used to forecast ozone layer thickness for the next five years with 5% upper and lower limit.From the forecast plot we can clearly identify a downward trend which indicates that there will be a reduction in the thickness of ozone layer in the forcasted years.
Objective of this part of the report is to predict ARIMA model for the ozone layer thickness dataset using various model specification tools.
From the time series plot from Part A from the plot that the,
1)There is a downward trend,
2)No seasonality in the plot,
3)No changing variance
4)No intervention in the plot
5)Plot has successive points as well as fluctuation so it has Moving Aveage and Auto Regressive behaviour.
# Checking the stationarity of data using QQ Plot
qqnorm(Ozone.data1)
qqline(Ozone.data1, col =2, lwd =1, lty =2)
# From QQ plot we can say that distribution is not normal and shapiro test is used proove the same
#shapiro wilk test
shapiro.test(Ozone.data1)
##
## Shapiro-Wilk normality test
##
## data: Ozone.data1
## W = 0.95605, p-value = 0.004031
#Since the p value is less than 0.05, we cannot say that data is normal
par(mfrow=c(1,2))
acf(Ozone.data1)
pacf(Ozone.data1)
## We can observe a Slowly decaying pattern in ACF and very high first correlation in PACF which shows the existence of trend and nonstationarity which needs to be removed before predicting the model.
#Applyin ADF test for checking nonstationarity
adf.test(Ozone.data1)
##
## Augmented Dickey-Fuller Test
##
## data: Ozone.data1
## Dickey-Fuller = -3.2376, Lag order = 4, p-value = 0.0867
## alternative hypothesis: stationary
#p>0.05 we fail to reject the null hypothesis which is that the series is non-stationary.
#Applying BocCox transformation
data2.trans = BoxCox.ar(Ozone.data1+ abs(min(Ozone.data1))+1)
data2.trans$ci
## [1] 0.9 1.5
lambda = 1.2
# mid point of th interval obtained from plot
Ozone.data1 = Ozone.data1 + abs(min(Ozone.data1))
# For Box Cox transformation all values need to be positive
BC.data = (Ozone.data1^lambda-1)/lambda
qqnorm(BC.data) # To check the normality
qqline(BC.data, col = 2)
shapiro.test(BC.data)
##
## Shapiro-Wilk normality test
##
## data: BC.data
## W = 0.96696, p-value = 0.02169
# From the QQ plot we can interpret that the distribution is not normal.
# Shapiro test also gives enough evidence to reject the normality hypothesis
# We can say that distribution is not normal even after box-cox transformation.
# Hence we need to do diffrencing
# Applying first difference and plotting graph to check whether it helped to reduce non stationary.
diff.BC.d1 = diff(Ozone.data1)
par(mfrow=c(1,1))
plot(diff.BC.d1,type='o',ylab='Quarterly earnings ')
adf.test(diff.BC.d1)
## Warning in adf.test(diff.BC.d1): p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: diff.BC.d1
## Dickey-Fuller = -7.1568, Lag order = 4, p-value = 0.01
## alternative hypothesis: stationary
# Based on the p-value we can reject the null hypothesis
# of and can infer that first differencing makes the series stationary.
par(mfrow=c(1,2)) # To plot both ACF and PACF in the same panel of plots
acf(diff.BC.d1)
pacf(diff.BC.d1)
From the ACF and PACF plot we can interpret p and q values. In ACF plot there are three significant lags, so q=3 and in PACf also number of significant lags is three, so q=3. We have done one differencing, so d=1. So the ARIMA model from PACF is (3,1,1)
eacf(diff.BC.d1)
## AR/MA
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13
## 0 o o x o o o x o o x o o o o
## 1 x o x o o o o o o x o o o o
## 2 x o x o o o x o o x o o o o
## 3 x o x o o x o o o o o o o o
## 4 x o o x o x o o o o o o o o
## 5 x x x x o x o o o o o o o o
## 6 o o o x x o o o o o o o o o
## 7 o o o x o o o o o o o o o o
# We have limited the order of pa & q to a maximum of five to reduce errors.
# From the output of eacf model we can consider following set of ARIMA models
# ARIMA(0,1,0)
# ARIMA(0,1,1)
# ARIMA(1,1,1)
par(mfrow=c(1,1))
res2 = armasubsets(y=diff.BC.d1,nar=5,nma=5,y.name='test',ar.method='ols')
plot(res2)
# In the BIC table,test.lag represents AR values and error-lag represents MA values.
# So values will be AR(3) & MA(2)
# ARIMA models
#ARIMA (3,1,2)
By using different model specification tools proposed ARIMA models are as follows:
ARIMA(0,1,0) ARIMA(0,1,1) ARIMA(1,1,1) ARIMA(3,1,1) ARIMA(3,1,2)