The objective of this assignment is to find the best fitting model among linear, quadratic, cyclic,seasonal and cosine and then predict the values of ozone layer thickness for the next five years using the best fitting model.
The ozone dataset is a csv file which contains 90 observations of the ozone layer thickness from 1927 to 2016. The dataset includes both negative and positive values. Negative value means decrease in the ozone layer thickness while postive value means increase in the ozone layer thickness.
library(TSA)
library(readr)
library(tseries)
library(knitr)
#reading the dataset
ozone <- read.csv("C:/Users/mafza/Downloads/TIMESERIES/assignmnt_1/ozone.csv", header= FALSE)
rownames(ozone) <- seq(from=1927, to=2016)
#checking the class of object
class(ozone)
## [1] "data.frame"
After reading the ozone dataset, i converted the datatype of our ozone dataset from dataframe to timeseries.
# Converting it to the Time series object
ozone_ts <- ts(as.vector(ozone), start=1927, end=2016, frequency = 1)
#checking the class of converted object
class(ozone_ts)
## [1] "ts"
plot(ozone_ts,type='o',ylab='ozone', main='Time series plot of Ozone Layer Series')
To find the impact of previous year’s ozone layer thickness on the next year’s, we need to calculate the value of pearson correlaton between ozone series and its first lag.
y = ozone_ts
x = zlag(ozone_ts)
index = 2:length(x)
cor(y[index],x[index])
## [1] 0.8700381
plot(y[index],x[index],ylab='Ozone series', xlab='First lag of ozone series', main = 'Scatter plot of ozone series and its first lag')
The pearson correlation value = 0.87 and the scatterplot indicates that there is strong correlation between ozone series and its first lag.
# Fitting the linear model
t <- time(ozone_ts)
t
## Time Series:
## Start = 1927
## End = 2016
## Frequency = 1
## [1] 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941
## [16] 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956
## [31] 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971
## [46] 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986
## [61] 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001
## [76] 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016
linear_model<- lm(ozone_ts ~ t) # label the model as linear_model
summary(linear_model)
##
## Call:
## lm(formula = ozone_ts ~ t)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.7165 -1.6687 0.0275 1.4726 4.7940
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 213.720155 16.257158 13.15 <2e-16 ***
## t -0.110029 0.008245 -13.34 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.032 on 88 degrees of freedom
## Multiple R-squared: 0.6693, Adjusted R-squared: 0.6655
## F-statistic: 178.1 on 1 and 88 DF, p-value: < 2.2e-16
plot(ozone_ts,ylab='ozone',xlab='Year',type='o', main = "Fitted linear model to the ozone series")
abline(linear_model)
# Residual analysis of linear model
plot(y=rstudent(linear_model),x=as.vector(time(ozone_ts)), xlab='Time',
ylab='Standardized Residuals',type='o', main = "Residual plot")
plot(y=rstudent(linear_model),x=as.vector(time(ozone_ts)), xlab='Time',
ylab='Standardized Residuals',type='p', main = "standardized residuals vs fitted.")
abline(h=0)
hist(rstudent(linear_model),xlab='Standardized Residuals', main = "Histogram of standardised residuals.")
y = rstudent(linear_model)
qqnorm(y, main = "QQ plot of standardised residuals.")
qqline(y, col = 2, lwd = 1, lty = 2)
acf(rstudent(linear_model), main = "ACF of standardized residuals.")
y = rstudent(linear_model)
shapiro.test(y)
##
## Shapiro-Wilk normality test
##
## data: y
## W = 0.98733, p-value = 0.5372
Residual Plot: The reisdual plot shows that the residuals are between the bound of -3 to +3.There are also some spkies which is not desirable in a residual plot.The residuals are hanging around the mean level.
Residual vs Fitted Plot: The residual vs fitted plot shows that there are some residuals outside the bound of -2 to 2. Also the residuals are not spread evenly.
Histogram: we want to see a symmetrical distribution in the histogram plot of standardised residuals. And in our case the histogram looks nearly symmetrical.
qqnorm: in qqnorm plot the residuals follow the normal distribution except the tails, as most of the residuals lie closely in a normal fitted line.
shapiro wilk test: Ho: data are normally distributed. Thus p-value = 0.5372 greater than 0.05 indicates that we are failed to reject the null hypothesis, hence the data are normally distributed.
acf plot: in acf plot we want all the lags under the confidence interval line, in our case there 3 significant lags which means there is autocorrelation left in residuals, in other words there is valuable information left in residuals which our linear model is unable to capture.
# Fit the quadratic model
t= time(ozone_ts)
t2 = t^2
quadratic_model = lm(ozone_ts ~ t + t2) # label the model as quadratic_model
summary(quadratic_model)
##
## Call:
## lm(formula = ozone_ts ~ t + t2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.1062 -1.2846 -0.0055 1.3379 4.2325
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.733e+03 1.232e+03 -4.654 1.16e-05 ***
## t 5.924e+00 1.250e+00 4.739 8.30e-06 ***
## t2 -1.530e-03 3.170e-04 -4.827 5.87e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.815 on 87 degrees of freedom
## Multiple R-squared: 0.7391, Adjusted R-squared: 0.7331
## F-statistic: 123.3 on 2 and 87 DF, p-value: < 2.2e-16
plot(ts(fitted(quadratic_model)), ylim = c(min(c(fitted(quadratic_model), as.vector(ozone_ts))), max(c(fitted(quadratic_model),as.vector(ozone_ts)))),
ylab='y' , main = "Fitted quadratic curve to ozone series", type="l",lty=2,col="red")
lines(as.vector(ozone_ts),type="o")
# Residual analysis of Quadratic model
plot(y=rstudent(quadratic_model),x=as.vector(time(ozone_ts)), xlab='Time',
ylab='Standardized Residuals',type='o', main = "Residual plot.")
plot(y=rstudent(quadratic_model),x=as.vector(time(ozone_ts)), xlab='Time',
ylab='Standardized Residuals',type='p', main = "standardized residuals vs fitted")
abline(h=0)
hist(rstudent(quadratic_model),xlab='Standardized Residuals', main = "Histogram of standardised residuals.")
y = rstudent(quadratic_model)
qqnorm(y, main = "QQ plot of standardised residuals.")
qqline(y, col = 2, lwd = 1, lty = 2)
acf(rstudent(quadratic_model), main = "ACF of standardized residuals.")
y = rstudent(quadratic_model)
shapiro.test(y)
##
## Shapiro-Wilk normality test
##
## data: y
## W = 0.98889, p-value = 0.6493
Residual Plot: The reisdual plot shows that the residuals are between the bound of -3 to +3.There are also some spkies which is not desirable in a residual plot. The residuals are hanging around the mean level.
Residual vs Fitted Plot: The residual vs fitted plot shows that there are some residuals outside the bound of -2 to +2. Also the residuals are not spread evenly.
Histogram: we want to see a symmetrical distribution in the histogram plot of standardised residuals. And in our case the histogram looks left skewed..
qqnorm: in qqnorm plot the residuals follow the normal distribution except the tails, as most of the residuals lie closely in a normal fitted line.
shapiro wilk test: Ho: data are normally distributed. Thus p-value = 0.6493 greater than 0.05 indicates that we are failed to reject the null hypothesis, hence the data are normally distributed.
acf plot: in acf plot we want all the lags under the confidence interval line, in our case there more than 2 significant lags which means there is autocorrelation left in residuals, in other words there is valuable information left in residuals which our quadratic model is unable to capture.
# Fit the Harmonic Model
har. <- harmonic(ozone_ts, 0.45)
harmonic_model <- lm(ozone_ts ~ har.)
summary(harmonic_model)
##
## Call:
## lm(formula = ozone_ts ~ har.)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.3520 -1.8905 0.4837 2.3643 6.4248
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.970e+00 4.790e-01 -6.199 1.79e-08 ***
## har.cos(2*pi*t) NA NA NA NA
## har.sin(2*pi*t) 5.462e+11 7.105e+11 0.769 0.444
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.522 on 88 degrees of freedom
## Multiple R-squared: 0.006672, Adjusted R-squared: -0.004616
## F-statistic: 0.5911 on 1 and 88 DF, p-value: 0.4441
plot(ts(fitted(harmonic_model)), ylab='y', main = "Fitted cosine wave to ozone series.",
ylim = c(min(c(fitted(harmonic_model), as.vector(ozone_ts))) ,
max(c(fitted(harmonic_model), as.vector(ozone_ts)))
), col = "green" )
lines(as.vector(ozone_ts),type="o")
# Residual analysis of Harmonic model
plot(y=rstudent(harmonic_model),x=as.vector(time(ozone_ts)), xlab='Time',
ylab='Standardized Residuals',type='o', main = "Time series plot of standardized residuals.")
plot(y=rstudent(harmonic_model),x=as.vector(time(ozone_ts)), xlab='Time',
ylab='Standardized Residuals',type='p', main = "standardized residuals vs fitted.")
abline(h=0)
hist(rstudent(harmonic_model),xlab='Standardized Residuals', main = "Histogram of standardised residuals.")
y = rstudent(harmonic_model)
qqnorm(y, main = "QQ plot of standardised residuals.")
qqline(y, col = 2, lwd = 1, lty = 2)
acf(rstudent(harmonic_model), main = "ACF of standardized residuals.")
y = rstudent(harmonic_model)
shapiro.test(y)
##
## Shapiro-Wilk normality test
##
## data: y
## W = 0.95856, p-value = 0.005875
Residual Plot: The reisdual plot shows that the residuals are between the bound of -3 to +3.There is also a downward trend which is not desirable in a residual plot.
Residual vs Fitted Plot: The residual vs fitted plot shows that there are some residuals outside the bound of -2 to +2. Also the residuals are not spread evenly.
Histogram: we want to see a symmetrical distribution in the histogram plot of standardized residuals. And in our case the histogram looks skewed.
qqnorm: in qqnorm plot the residuals doesn’t follow the normal distribution , as most of the residuals doesn’t lie closely in a normal fitted line.
shapiro wilk test: Ho: data are normally distributed. Thus p-value = 0.005875 which is less than than 0.05 indicates that we reject the null hypothesis, hence the data are not normally distributed.
acf plot: in acf plot we want all the lags under the confidence interval line, in our case there is a wave pattern which means there is autocorrelation left in residuals, in other words there is valuable information left in residuals which our harmonic model is unable to capture.
From the summary statistics of all the models we can conclude that quadratic model is the best best model, as its R-squared value = 0.7391 is greater as compared to linear model which has R-sqaured value = 0.6693 which means quadratic model has higher variation than linear model. Also standard error = 1.815 of quadratic model is less than as compared to linear model which has higher standard error value = 2.032. So quadratic model is the best model as it has higher variation and less residual standard error.
#forecasting quadratic model
h <- 4
t <- time(ozone_ts)
t2 <- t^2
aheadTimes <- data.frame(t = seq(2017, 2017+h, 1),
t2 = seq(2017, 2017+h, 1)^2)
forecasting_quadratic_model <- predict(quadratic_model, newdata = aheadTimes, interval = "prediction")
print(forecasting_quadratic_model)
## fit lwr upr
## 1 -10.34387 -14.13556 -6.552180
## 2 -10.59469 -14.40282 -6.786548
## 3 -10.84856 -14.67434 -7.022786
## 4 -11.10550 -14.95015 -7.260851
## 5 -11.36550 -15.23030 -7.500701
plot(ozone_ts, xlim= c(1927,2021) ,ylim = c(-20,10),
ylab = "Ozone-Layer series",
main = "Forecasts from the quadratic model fitted to the Ozone series.")
lines(ts(as.vector(forecasting_quadratic_model[,3]), start = 2017), col="blue", type="l")
lines(ts(as.vector(forecasting_quadratic_model[,1]), start = 2017), col="red", type="l")
lines(ts(as.vector(forecasting_quadratic_model[,2]), start = 2017), col="blue", type="l")
legend("topleft", lty=1, pch=1, col=c("black","blue","red"),
text.width = 18,
c("Data","5% forecast limits", "Forecasts"))
The graph indicates that there is a decrease in ozone layer thickness predicted in the next five years.
plot(ozone_ts,type='o',ylab='ozone', main='Time series plot of ozone layer series')
par(mfrow=c(1,2))
acf(ozone_ts,main="ACF plot for Ozone Layer series.")
pacf(ozone_ts,main="PACF plot for Ozone Layer series.")
There is a slowly decaying pattern in acf plot and very high first correlation in pacf plot which indicates that there is a trend in our series and our series is non-stationary.
adf.test(ozone_ts, alternative = c("stationary"))
##
## Augmented Dickey-Fuller Test
##
## data: ozone_ts
## Dickey-Fuller = -3.2376, Lag order = 4, p-value = 0.0867
## alternative hypothesis: stationary
Ho: Non-stationary Ha: Stationary Our p-value = 0.0867 which is greater than 0.05, so we fail to reject our null hypothesis and hence the series is non-stationary.
ozone_ts_posi = ozone_ts + abs(min(ozone_ts))+1
BC = BoxCox.ar(ozone_ts_posi )
BC$ci
## [1] 0.9 1.5
lambda <- BC$lambda[which(max(BC$loglike) == BC$loglike)]
lambda
## [1] 1.2
ozone_bc = (ozone_ts_posi^lambda-1)/lambda
plot(ozone_bc,type='o',ylab='Simulated series', main = "Box Cox transformed series.")
qqnorm(ozone_bc)
qqline(ozone_bc, col = 4)
shapiro.test(ozone_bc)
##
## Shapiro-Wilk normality test
##
## data: ozone_bc
## W = 0.96644, p-value = 0.01995
dif = diff(ozone_ts)
par(mfrow=c(1,1))
plot(dif,type='o',ylab='Average weekly earnings', main = "Time series plot of the first differenced ozone series.")
adf.test(dif)
##
## Augmented Dickey-Fuller Test
##
## data: dif
## Dickey-Fuller = -7.1568, Lag order = 4, p-value = 0.01
## alternative hypothesis: stationary
Ho: Non-stationary Ha: Stationary Our p-value = 0.01 which is less than 0.05, so we reject our null hypothesis and hence after the first differencing on the original series, the series has become stationary.
par(mfrow=c(1,2))
acf(dif,main="ACF of first difference.")
pacf(dif,main="PACF of first difference.")
eacf(dif)
## AR/MA
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13
## 0 o o x o o o x o o x o o o o
## 1 x o x o o o o o o x o o o o
## 2 x o x o o o x o o x o o o o
## 3 x o x o o x o o o o o o o o
## 4 x o o x o x o o o o o o o o
## 5 x x x x o x o o o o o o o o
## 6 o o o x x o o o o o o o o o
## 7 o o o x o o o o o o o o o o
The top-left “o” can be found at p=0 and q=0 which shows a clear vertex. The neibouring ARMA models are (p=0,q=1) and (p=1, q=1). So the possible sets of arima models from eacf are ARIMA(0,1,0), ARIMA(0,1,1), ARIMA(1,1,1)
I am setting the values of nar and nma parameters as 6 because in acf, pacf and eacf plots all the ARMA values are under 6.
res2 = armasubsets(dif , nar=6 , nma=6, y.name='ozone', ar.method='ols')
plot(res2)
From the bic table AR(3) and AR(4) and MA(2). So the possible arima models from bic table are ARIMA(3,1,2), ARIMA(4,1,2)
So the possible set of arima models from acf, pacf, eacf and bic are: