ozlayer <- read.csv("/Users/shubhamchougule/Downloads/data1.csv", header = FALSE)
#Attach the year data from 1927 to 2016
rownames(ozlayer) <- seq(from=1927, to=2016)
#Giving title to the column
colnames(ozlayer) <- c("Thickness")
ozlayer1 <- ts(as.vector(ozlayer), start=1927, end=2016,frequency = 2)
ozlayer <- ts(as.vector(ozlayer), start=1927, end=2016,frequency = 1)
plot(ozlayer,type='o',xlab= "Year", ylab='Thickness Difference')
From the above Graph of the Yearly thickness difference in the ozone layer, we can see that there is a change in the thickness of the ozone layer negatively that is a downward fall. No repeating pattern observed that is no trend observed. There is no seasonality in the graph. There is no intervention observed. There is no change in the varience. No auto correlation observed.
plot(y=ozlayer,x=zlag(ozlayer),ylab='Thickness', xlab='Last Year Thickness')
The given plot shows their is there is quite change in the thickness negatively.
#Calculating the correlation
b = ozlayer
a = zlag(ozlayer)
index = 2:length(a)
cor(b[index],a[index])
## [1] 0.8700381
The correlation of the data obtained above is very high between the years.
model1=lm(formula = ozlayer ~ time(ozlayer))
summary(model1)
##
## Call:
## lm(formula = ozlayer ~ time(ozlayer))
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.7165 -1.6687 0.0275 1.4726 4.7940
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 213.720155 16.257158 13.15 <2e-16 ***
## time(ozlayer) -0.110029 0.008245 -13.34 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.032 on 88 degrees of freedom
## Multiple R-squared: 0.6693, Adjusted R-squared: 0.6655
## F-statistic: 178.1 on 1 and 88 DF, p-value: < 2.2e-16
From the summary, we see that amount of change in the thickness per year is significant.
#Plot model 1
plot(ozlayer,type='o',ylab='y',main = "Linear Model")
abline(model1, col = 'red', lty=1)
In the above graph, there is a linear model red line passing, but the fit observed is not good. Most of the points lie above the line and very less below which are not distributed equally. Hence liner model does not fit ozone data.
plot(y=rstudent(model1),x=as.vector(time(ozlayer)), xlab='Time',
ylab='Residuals',type='o',
main = "Time series Residual plot for Linear Model")
The residual plot for the linear model looks good.
hist(rstudent(model1),xlab='Standardized Residuals', freq = FALSE, ylim = c(0,0.6),col='light blue'
,main='Standardized Residuals of Linear Model')
curve(dnorm(x, mean=mean(rstudent(model1)), sd=sd(rstudent(model1))), col="red", lwd=2, add=TRUE, yaxt="n")
Curve is a bit tend towards the left.
qqnorm(rstudent(model1), main="QQ plot of linear Model")
qqline(rstudent(model1), col = 2, lwd = 1, lty = 2)
#Shapiro-Wilk normality test
shapiro.test(rstudent(model1))
##
## Shapiro-Wilk normality test
##
## data: rstudent(model1)
## W = 0.98733, p-value = 0.5372
Shapiro-Wilk test shows a p-value of 0.5372 since the null hypothesis is not rejected which states that the stochastic component of the linear model is normally distributed.
t= time(ozlayer)
t2 = t^2
Model2 = lm(ozlayer~t+t2)
summary(Model2)
##
## Call:
## lm(formula = ozlayer ~ t + t2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.1062 -1.2846 -0.0055 1.3379 4.2325
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.733e+03 1.232e+03 -4.654 1.16e-05 ***
## t 5.924e+00 1.250e+00 4.739 8.30e-06 ***
## t2 -1.530e-03 3.170e-04 -4.827 5.87e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.815 on 87 degrees of freedom
## Multiple R-squared: 0.7391, Adjusted R-squared: 0.7331
## F-statistic: 123.3 on 2 and 87 DF, p-value: < 2.2e-16
Looking at the summary we see that the quadratic model is significant and also the R-squared value is good.
#Plot the Quadratic model
plot(ts(fitted(Model2)), ylim = c(min(c(fitted(Model2), as.vector(ozlayer))), max(c(fitted(Model2),as.vector(ozlayer)))),ylab='y' , main = "Quadratic Model")
lines(as.vector(ozlayer),type="o")
The quadratic curve observed in the graph seems to have a good fit for the model. Points are equally distributed on both sides.
plot(y=rstudent(Model2),x=as.vector(time(ozlayer)), xlab='Time',
ylab='Residuals ',type='o',
main = "Time series Residual plot for Quadratic Model")
If compare the quadratic residual plot with the liner plot we find that the quadratic models look better than a linear model.
hist(rstudent(Model2),xlab='Standardized Residuals ', freq = FALSE, ylim = c(0,0.6),col = "light blue"
,main='Standardized Residuals of Quadratic Model')
curve(dnorm(x, mean=mean(rstudent(Model2)), sd=sd(rstudent(Model2))), col="red", lwd=2, add=TRUE, yaxt="n")
From the graph, we can see that the quadratic fits perfectly as compared to the linear graph.
qqnorm(rstudent(Model2), main="QQ plot of Quadratic Model")
qqline(rstudent(Model2), col = 2, lwd = 1, lty = 2)
From the plots, we can see that the quadratic plot fits the normal fit.
shapiro.test(rstudent(Model2))
##
## Shapiro-Wilk normality test
##
## data: rstudent(Model2)
## W = 0.98889, p-value = 0.6493
Shapiro-Wilk test shows a p-value of 0.6493 since the null hypothesis is not rejected which states that the stochastic component of the linear model is normally distributed.
har.=harmonic(ozlayer,0.45)
model3=lm(ozlayer~har.)
summary(model3)
##
## Call:
## lm(formula = ozlayer ~ har.)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.3520 -1.8906 0.4837 2.3643 6.4248
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.970e+00 4.790e-01 -6.199 1.79e-08 ***
## har.cos(2*pi*t) NA NA NA NA
## har.sin(2*pi*t) 5.462e+11 7.105e+11 0.769 0.444
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.522 on 88 degrees of freedom
## Multiple R-squared: 0.006672, Adjusted R-squared: -0.004616
## F-statistic: 0.5911 on 1 and 88 DF, p-value: 0.4441
The summary gives a clear view that the harmonic model is not significant to the dataset as the p-value is greater than 0.05. Hence harmonic is not a good fit.
month.=season(ozlayer1) # period added to improve table display and this line sets up indicators
model4=lm(ozlayer1~month.-1) # -1 removes the intercept term
summary(model4)
##
## Call:
## lm(formula = ozlayer1 ~ month. - 1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.3605 -1.7335 0.5545 2.3792 6.6261
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## month.Season-1 -3.2189 0.3682 -8.742 1.72e-15 ***
## month.Season-2 -3.1169 0.3703 -8.418 1.26e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.493 on 177 degrees of freedom
## Multiple R-squared: 0.4542, Adjusted R-squared: 0.448
## F-statistic: 73.64 on 2 and 177 DF, p-value: < 2.2e-16
From the summary, we see that the p-value is significant and seasonality model can fit the dataset.Â
res.model4 = rstudent(model4)
par(mfrow=c(2,2))
plot(y = res.model4, x = as.vector(time(ozlayer1)),xlab = 'Time', ylab='Standardized Residuals',type='l',main = "Standardised residuals from seasonal model.")
points(y=res.model4,x=time(ozlayer1), pch=as.vector(season(ozlayer1)))
hist(res.model4,xlab='Standardized Residuals', main = "Histogram of standardised residuals.")
qqnorm(y=res.model4, main = "QQ plot of standardised residuals.")
qqline(y=res.model4, col = 2, lwd = 1, lty = 2)
shapiro.test(res.model4)
##
## Shapiro-Wilk normality test
##
## data: res.model4
## W = 0.95603, p-value = 2.223e-05
From the graphs, we see that there is changing variance in the model. The histogram is plotted is tend towards the left. Qqplot doesn’t have a good fit. Shapiro-Wilk test shows a p-value of less than 0.5 and the null hypothesis is rejected which states that the seasonal model is not normally distributed.
Carefully considering and exploring the linear model, quadratic model, Harmonic model, and Seasonal model we can confirm that the Quadratic model is the best fit for the time series. Hence we can use the Quadratic model for forecasting.
t = time(ozlayer)
t2 = t^2
Year = c(2017, 2018, 2019, 2020, 2021)
forecasts = predict(Model2, data.frame(t = Year, t2 = Year^2, 1), interval = "prediction")
forecasts
## fit lwr upr
## 1 -10.34387 -14.13556 -6.552180
## 2 -10.59469 -14.40282 -6.786548
## 3 -10.84856 -14.67434 -7.022786
## 4 -11.10550 -14.95015 -7.260851
## 5 -11.36550 -15.23030 -7.500701
plot(ozlayer, xlim = c(1927,2021), ylim = c(-16,4), ylab = "Thickness of ozone Layer ", main = "Thickness of ozone Layer along with forecasts")
lines(ts(as.vector(forecasts[,1]), start = 2017), col="red", type="l")
lines(ts(as.vector(forecasts[,2]), start = 2017), col="blue", type="l")
lines(ts(as.vector(forecasts[,3]), start = 2017), col="blue", type="l")
legend("topright", lty=1, pch=1, col=c("black","blue","red"), text.width = 25 ,c("Data","5% forecast limits", "Forecasts"))
ozlayerAcf <- acf(ozlayer,plot = FALSE)
plot(ozlayerAcf, main = "Time Series ACF")
ozlayerPacf <- pacf(ozlayer,plot = FALSE)
plot(ozlayerPacf, main = "Time Series PACF")
We see a slow decaying pattern in ACF which implies the existence of trend and nonstationarity. There is no trend observed in the PACF plot. Hence, non-stationarity is assumed.
#ADF check
adf.test(ozlayer)
##
## Augmented Dickey-Fuller Test
##
## data: ozlayer
## Dickey-Fuller = -3.2376, Lag order = 4, p-value = 0.0867
## alternative hypothesis: stationary
From ADF we get the p-value which is greater than 0.05 that is 0.0867 and hence null hypothesis cannot be rejected. This states that the series is non-stationary.
#Converting the data to linear because most of the values are negative and cannot perform boxcox transformation
ozlayer<- ozlayer+12.57941376
ozlayer.transform = BoxCox.ar(ozlayer)
ozlayer.transform$ci
## [1] 0.9 1.5
#check the converted data
lambda = 1.2
ozlayer.BC = (ozlayer^lambda-1)/lambda
qqnorm(ozlayer.BC)
qqline(ozlayer.BC)
shapiro.test(ozlayer.BC)
##
## Shapiro-Wilk normality test
##
## data: ozlayer.BC
## W = 0.96644, p-value = 0.01995
#Transformed data
ozlayer.BC.Diff = diff(ozlayer.BC)
plot(ozlayer.BC.Diff,type='o')
adf.test(ozlayer.BC.Diff,k=0)
## Warning in adf.test(ozlayer.BC.Diff, k = 0): p-value smaller than printed p-
## value
##
## Augmented Dickey-Fuller Test
##
## data: ozlayer.BC.Diff
## Dickey-Fuller = -8.6491, Lag order = 0, p-value = 0.01
## alternative hypothesis: stationary
pp.test(ozlayer.BC.Diff)
## Warning in pp.test(ozlayer.BC.Diff): p-value smaller than printed p-value
##
## Phillips-Perron Unit Root Test
##
## data: ozlayer.BC.Diff
## Dickey-Fuller Z(alpha) = -69.696, Truncation lag parameter = 3, p-value
## = 0.01
## alternative hypothesis: stationary
From the above tests, we can see that the p-value is less than 0.1 and no trend in the data. We can say that the data is stationary now.
#Final Data plot
plot(ozlayer.BC.Diff,ylab='ozlayer layer thickness',xlab='Year',type='o', main = "Time series plot of ozlayer layer thickness")
ozlayerdiffacf <- acf(ozlayer.BC.Diff,plot = FALSE)
plot(ozlayerdiffacf, main = "ACF of Differencing Ozone Thickness")
ACF shows that there are spicks at 3,4,7 and 10. But considering the first spicks 3 and 4 as they are one after another and the biggest one. So MA(q) is MA(3) and MA(4).
ozlayerdiffpacf <- pacf(ozlayer.BC.Diff,plot = FALSE)
plot(ozlayerdiffpacf, main = "PACF of Differencing Ozone Thickness")
PACF shows that there are spicks at 3,4 and 6. But considering the first spicks 3 and 4 as they are one after another as and biggest one. So AR(p) are AR(3) and AR(4).
eacf(ozlayer.BC.Diff,ar.max = 6, ma.max = 6)
## AR/MA
## 0 1 2 3 4 5 6
## 0 o o x x o o x
## 1 x o x o o o x
## 2 o o x o o o x
## 3 x o x o o x o
## 4 x o o x o x o
## 5 x x x x o x o
## 6 o o o x o o o
In EACF there is no clear vertex we can take the row corresponding to p=1 as vertex and  ARIMA(1,1,3), ARIMA(2,1,3) and ARIMA(1,1,4) are in the set of possible models.
#BIC table
plot(res)
In the BIC table, shaded columns correspond to AR(10) to AR(7) and MA(10) coefficients and from here we can include ARIMA(10,1,10) and ARIMA(7,2,10) are the set of possible models.
In conclusion the autocorrelation function, partial autocorellation function ,extended autocorellation function and Bayesian information criterion gives the values of p,q and differncings d to form set of candidate models {ARIMA(3,1,3),ARIMA(4,1,4),ARIMA(1,1,4),ARIMA(2,1,3), ARIMA(1,1,3),ARIMA(10,1,10), ARIMA(7,2,10)}.