Assignment1 Time Series Analysis

Task1

Importing dataset

ozlayer <- read.csv("/Users/shubhamchougule/Downloads/data1.csv", header = FALSE)
#Attach the year data from 1927 to 2016
rownames(ozlayer) <- seq(from=1927, to=2016)
#Giving title to the column
colnames(ozlayer) <- c("Thickness")
ozlayer1 <- ts(as.vector(ozlayer), start=1927, end=2016,frequency = 2) 

Changing the dataset to time series

ozlayer <- ts(as.vector(ozlayer), start=1927, end=2016,frequency = 1) 

Plot the time series dataset

plot(ozlayer,type='o',xlab= "Year", ylab='Thickness Difference')

From the above Graph of the Yearly thickness difference in the ozone layer, we can see that there is a change in the thickness of the ozone layer negatively that is a downward fall. No repeating pattern observed that is no trend observed. There is no seasonality in the graph. There is no intervention observed. There is no change in the varience. No auto correlation observed.

Comparision of the thikness of ozone to the pervious year thikness

plot(y=ozlayer,x=zlag(ozlayer),ylab='Thickness', xlab='Last Year Thickness')

The given plot shows their is there is quite change in the thickness negatively.

#Calculating the correlation
b = ozlayer
a = zlag(ozlayer)
index = 2:length(a)
cor(b[index],a[index])
## [1] 0.8700381

The correlation of the data obtained above is very high between the years.

Investigate the best fitting model

Model 1 to fit the Linear Regression.

model1=lm(formula = ozlayer ~ time(ozlayer))
summary(model1)
## 
## Call:
## lm(formula = ozlayer ~ time(ozlayer))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.7165 -1.6687  0.0275  1.4726  4.7940 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   213.720155  16.257158   13.15   <2e-16 ***
## time(ozlayer)  -0.110029   0.008245  -13.34   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.032 on 88 degrees of freedom
## Multiple R-squared:  0.6693, Adjusted R-squared:  0.6655 
## F-statistic: 178.1 on 1 and 88 DF,  p-value: < 2.2e-16

From the summary, we see that amount of change in the thickness per year is significant.

#Plot model 1
plot(ozlayer,type='o',ylab='y',main = "Linear Model")
abline(model1, col = 'red', lty=1)

In the above graph, there is a linear model red line passing, but the fit observed is not good. Most of the points lie above the line and very less below which are not distributed equally. Hence liner model does not fit ozone data.

Residual Analysis for Linear Model

Time Series Residual Plot for the model

plot(y=rstudent(model1),x=as.vector(time(ozlayer)), xlab='Time',
     ylab='Residuals',type='o', 
     main = "Time series Residual plot for Linear Model")

The residual plot for the linear model looks good.

Histograms for the selected Models

hist(rstudent(model1),xlab='Standardized Residuals', freq = FALSE, ylim = c(0,0.6),col='light blue'
     ,main='Standardized Residuals of Linear Model')
curve(dnorm(x, mean=mean(rstudent(model1)), sd=sd(rstudent(model1))), col="red", lwd=2, add=TRUE, yaxt="n")

Curve is a bit tend towards the left.

Q-Q plots of the selected models

qqnorm(rstudent(model1), main="QQ plot of linear Model")
qqline(rstudent(model1), col = 2, lwd = 1, lty = 2)

#Shapiro-Wilk normality test
shapiro.test(rstudent(model1))
## 
##  Shapiro-Wilk normality test
## 
## data:  rstudent(model1)
## W = 0.98733, p-value = 0.5372

Shapiro-Wilk test shows a p-value of 0.5372 since the null hypothesis is not rejected which states that the stochastic component of the linear model is normally distributed.

Model2 to fit for Quadratic

t= time(ozlayer)
t2 = t^2
Model2 = lm(ozlayer~t+t2)
summary(Model2)
## 
## Call:
## lm(formula = ozlayer ~ t + t2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.1062 -1.2846 -0.0055  1.3379  4.2325 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5.733e+03  1.232e+03  -4.654 1.16e-05 ***
## t            5.924e+00  1.250e+00   4.739 8.30e-06 ***
## t2          -1.530e-03  3.170e-04  -4.827 5.87e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.815 on 87 degrees of freedom
## Multiple R-squared:  0.7391, Adjusted R-squared:  0.7331 
## F-statistic: 123.3 on 2 and 87 DF,  p-value: < 2.2e-16

Looking at the summary we see that the quadratic model is significant and also the R-squared value is good.

#Plot the Quadratic model
plot(ts(fitted(Model2)), ylim = c(min(c(fitted(Model2), as.vector(ozlayer))), max(c(fitted(Model2),as.vector(ozlayer)))),ylab='y' , main = "Quadratic Model")
lines(as.vector(ozlayer),type="o")

The quadratic curve observed in the graph seems to have a good fit for the model. Points are equally distributed on both sides.

Residual Analysis of selected quadratic Model

Time Series Plots for the model

plot(y=rstudent(Model2),x=as.vector(time(ozlayer)), xlab='Time',
     ylab='Residuals ',type='o', 
     main = "Time series Residual plot for  Quadratic Model")

If compare the quadratic residual plot with the liner plot we find that the quadratic models look better than a linear model.

Histograms for the selected Model

hist(rstudent(Model2),xlab='Standardized Residuals ', freq = FALSE, ylim = c(0,0.6),col = "light blue"
     ,main='Standardized Residuals of Quadratic Model')
curve(dnorm(x, mean=mean(rstudent(Model2)), sd=sd(rstudent(Model2))), col="red", lwd=2, add=TRUE, yaxt="n")

From the graph, we can see that the quadratic fits perfectly as compared to the linear graph.

Q-Q plots of the selected models

qqnorm(rstudent(Model2), main="QQ plot of Quadratic Model")
qqline(rstudent(Model2), col = 2, lwd = 1, lty = 2)

From the plots, we can see that the quadratic plot fits the normal fit.

shapiro.test(rstudent(Model2))
## 
##  Shapiro-Wilk normality test
## 
## data:  rstudent(Model2)
## W = 0.98889, p-value = 0.6493

Shapiro-Wilk test shows a p-value of 0.6493 since the null hypothesis is not rejected which states that the stochastic component of the linear model is normally distributed.

Model 3 to fit Harmonic model

har.=harmonic(ozlayer,0.45)
model3=lm(ozlayer~har.)
summary(model3)
## 
## Call:
## lm(formula = ozlayer ~ har.)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.3520 -1.8906  0.4837  2.3643  6.4248 
## 
## Coefficients: (1 not defined because of singularities)
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -2.970e+00  4.790e-01  -6.199 1.79e-08 ***
## har.cos(2*pi*t)         NA         NA      NA       NA    
## har.sin(2*pi*t)  5.462e+11  7.105e+11   0.769    0.444    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.522 on 88 degrees of freedom
## Multiple R-squared:  0.006672,   Adjusted R-squared:  -0.004616 
## F-statistic: 0.5911 on 1 and 88 DF,  p-value: 0.4441

The summary gives a clear view that the harmonic model is not significant to the dataset as the p-value is greater than 0.05. Hence harmonic is not a good fit.

Model 4 to fit Seasonal model

month.=season(ozlayer1) # period added to improve table display and this line sets up indicators
model4=lm(ozlayer1~month.-1) # -1 removes the intercept term
summary(model4)
## 
## Call:
## lm(formula = ozlayer1 ~ month. - 1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.3605 -1.7335  0.5545  2.3792  6.6261 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## month.Season-1  -3.2189     0.3682  -8.742 1.72e-15 ***
## month.Season-2  -3.1169     0.3703  -8.418 1.26e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.493 on 177 degrees of freedom
## Multiple R-squared:  0.4542, Adjusted R-squared:  0.448 
## F-statistic: 73.64 on 2 and 177 DF,  p-value: < 2.2e-16

From the summary, we see that the p-value is significant and seasonality model can fit the dataset. 

Residual analysis of the seasonal model

res.model4 = rstudent(model4)
par(mfrow=c(2,2))
plot(y = res.model4, x = as.vector(time(ozlayer1)),xlab = 'Time', ylab='Standardized Residuals',type='l',main = "Standardised residuals from seasonal model.")
points(y=res.model4,x=time(ozlayer1), pch=as.vector(season(ozlayer1)))
hist(res.model4,xlab='Standardized Residuals', main = "Histogram of standardised residuals.")
qqnorm(y=res.model4, main = "QQ plot of standardised residuals.")
qqline(y=res.model4, col = 2, lwd = 1, lty = 2)
shapiro.test(res.model4)
## 
##  Shapiro-Wilk normality test
## 
## data:  res.model4
## W = 0.95603, p-value = 2.223e-05

From the graphs, we see that there is changing variance in the model. The histogram is plotted is tend towards the left. Qqplot doesn’t have a good fit. Shapiro-Wilk test shows a p-value of less than 0.5 and the null hypothesis is rejected which states that the seasonal model is not normally distributed.

Interpretaion

Carefully considering and exploring the linear model, quadratic model, Harmonic model, and Seasonal model we can confirm that the Quadratic model is the best fit for the time series. Hence we can use the Quadratic model for forecasting.

Forecast

t = time(ozlayer)
t2 = t^2

Year = c(2017, 2018, 2019, 2020, 2021)

forecasts = predict(Model2, data.frame(t = Year, t2 = Year^2, 1), interval = "prediction")

forecasts
##         fit       lwr       upr
## 1 -10.34387 -14.13556 -6.552180
## 2 -10.59469 -14.40282 -6.786548
## 3 -10.84856 -14.67434 -7.022786
## 4 -11.10550 -14.95015 -7.260851
## 5 -11.36550 -15.23030 -7.500701
plot(ozlayer, xlim = c(1927,2021), ylim = c(-16,4), ylab = "Thickness of ozone Layer ", main = "Thickness of ozone Layer along with forecasts")
lines(ts(as.vector(forecasts[,1]), start = 2017), col="red", type="l")
lines(ts(as.vector(forecasts[,2]), start = 2017), col="blue", type="l") 
lines(ts(as.vector(forecasts[,3]), start = 2017), col="blue", type="l") 
legend("topright", lty=1, pch=1, col=c("black","blue","red"), text.width = 25 ,c("Data","5% forecast limits", "Forecasts"))

Task2

ACF and PACF plots

ozlayerAcf <- acf(ozlayer,plot = FALSE)
plot(ozlayerAcf, main = "Time Series ACF")

ozlayerPacf <- pacf(ozlayer,plot = FALSE)
plot(ozlayerPacf, main = "Time Series PACF")

We see a slow decaying pattern in ACF which implies the existence of trend and nonstationarity. There is no trend observed in the PACF plot. Hence, non-stationarity is assumed.

#ADF check
adf.test(ozlayer)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  ozlayer
## Dickey-Fuller = -3.2376, Lag order = 4, p-value = 0.0867
## alternative hypothesis: stationary

From ADF we get the p-value which is greater than 0.05 that is 0.0867 and hence null hypothesis cannot be rejected. This states that the series is non-stationary.

#Converting the data to linear because most of the values are negative and cannot perform boxcox transformation
ozlayer<- ozlayer+12.57941376
ozlayer.transform = BoxCox.ar(ozlayer)

ozlayer.transform$ci
## [1] 0.9 1.5
#check the converted data
lambda = 1.2
ozlayer.BC = (ozlayer^lambda-1)/lambda
qqnorm(ozlayer.BC)
qqline(ozlayer.BC)

shapiro.test(ozlayer.BC)
## 
##  Shapiro-Wilk normality test
## 
## data:  ozlayer.BC
## W = 0.96644, p-value = 0.01995
#Transformed data
ozlayer.BC.Diff = diff(ozlayer.BC)
plot(ozlayer.BC.Diff,type='o')

adf.test(ozlayer.BC.Diff,k=0)
## Warning in adf.test(ozlayer.BC.Diff, k = 0): p-value smaller than printed p-
## value
## 
##  Augmented Dickey-Fuller Test
## 
## data:  ozlayer.BC.Diff
## Dickey-Fuller = -8.6491, Lag order = 0, p-value = 0.01
## alternative hypothesis: stationary
pp.test(ozlayer.BC.Diff)
## Warning in pp.test(ozlayer.BC.Diff): p-value smaller than printed p-value
## 
##  Phillips-Perron Unit Root Test
## 
## data:  ozlayer.BC.Diff
## Dickey-Fuller Z(alpha) = -69.696, Truncation lag parameter = 3, p-value
## = 0.01
## alternative hypothesis: stationary

From the above tests, we can see that the p-value is less than 0.1 and no trend in the data. We can say that the data is stationary now.

#Final Data plot
plot(ozlayer.BC.Diff,ylab='ozlayer layer thickness',xlab='Year',type='o', main = "Time series plot of ozlayer layer thickness")

ACF and PACF plot of Differencing

ozlayerdiffacf <- acf(ozlayer.BC.Diff,plot = FALSE)
plot(ozlayerdiffacf, main = "ACF of Differencing  Ozone Thickness")

ACF shows that there are spicks at 3,4,7 and 10. But considering the first spicks 3 and 4 as they are one after another and the biggest one. So MA(q) is MA(3) and MA(4).

ozlayerdiffpacf <- pacf(ozlayer.BC.Diff,plot = FALSE)
plot(ozlayerdiffpacf, main = "PACF of Differencing  Ozone Thickness")

PACF shows that there are spicks at 3,4 and 6. But considering the first spicks 3 and 4 as they are one after another as and biggest one. So AR(p) are AR(3) and AR(4).

EACF plot of differencing

eacf(ozlayer.BC.Diff,ar.max = 6, ma.max = 6)
## AR/MA
##   0 1 2 3 4 5 6
## 0 o o x x o o x
## 1 x o x o o o x
## 2 o o x o o o x
## 3 x o x o o x o
## 4 x o o x o x o
## 5 x x x x o x o
## 6 o o o x o o o

In EACF there is no clear vertex we can take the row corresponding to p=1 as vertex and  ARIMA(1,1,3), ARIMA(2,1,3) and ARIMA(1,1,4) are in the set of possible models.

#BIC table

plot(res)

In the BIC table, shaded columns correspond to AR(10) to AR(7) and MA(10) coefficients and from here we can include ARIMA(10,1,10) and ARIMA(7,2,10) are the set of possible models.

Conclusion

In conclusion the autocorrelation function, partial autocorellation function ,extended autocorellation function and Bayesian information criterion gives the values of p,q and differncings d to form set of candidate models {ARIMA(3,1,3),ARIMA(4,1,4),ARIMA(1,1,4),ARIMA(2,1,3), ARIMA(1,1,3),ARIMA(10,1,10), ARIMA(7,2,10)}.