Importing the Libraries
library(forecast)
library(tseries)
library(TSA)
Reading the file
setwd("E:/Great Lakes/Time Series/Assignment")
TSA1 <- read.csv("beer.csv",header = TRUE)
attach(TSA1)
Check for the Frequency using periodgram
p <- periodogram(OzBeer)
dd <- data.frame(freq=p$freq,spec=p$spec)
order <- dd[order(-dd$spec),]
top2 <- head(order,2)
tim <- 1/top2$freq
The spikes are at the frequency 0.01 and 0.25, if we consider 0.01 as the frequency then the equivalent time is (1/0.01 = 72), which is the entire dataset.
The spike at the frequency 0.25 and its equivalent time is (1/0.25 = 4). This makes more sensible as the data given is quaterly beer sales data.
Using the periodgram identified the frequency as ‘4’
Creating a Time series using the frequency which was identified using the periodgram
timeseries <- ts(TSA1,frequency = 4)
Checking the data wheteher its a Timeseries data or not
is.ts(timeseries)
## [1] TRUE
Plotting the Time series
plot.ts(timeseries,col = "blue",main = "Sales data time series with time stamp")
abline(reg=lm(timeseries~ time(timeseries)),col="red")
There is a linear trend in the time series as the time passes by and also the seasonality is getting repeated for every 4 quarters. The magnitude of the seasonal conponent is almost constant.
To test for stationary, doing KPSS test
kpss.test(diff(log(timeseries)))
## Warning in kpss.test(diff(log(timeseries))): p-value greater than printed
## p-value
##
## KPSS Test for Level Stationarity
##
## data: diff(log(timeseries))
## KPSS Level = 0.02596, Truncation lag parameter = 1, p-value = 0.1
The P-value is 0.1, as the p- value is greater than 0.05, donot reject the null hypothesis,hence mathematically data is stationary.
Running ARIMA with different p and q values As there is trend and seasonality in the data, we can use HOLT WINTER’s Method to forecast the time series data.
As the magnitude of the seasonality pattern is constant, we can use HOLT WINTER’s Additive method.
Running Holt Winter’s additive method by taking logarithm on the timeseries data
fit <- hw(timeseries,seasonal = "additive")
plot.ts(timeseries,col = "blue",main = "Sales data time series with time stamp")
lines(fitted(fit),col="red")
Interpereting the results
fit$model
## Holt-Winters' additive method
##
## Call:
## hw(y = timeseries, seasonal = "additive")
##
## Smoothing parameters:
## alpha = 0.0395
## beta = 0.0395
## gamma = 0.1854
##
## Initial states:
## l = 258.2137
## b = 0.0239
## s=55.077 -23.9983 -37.4121 6.3335
##
## sigma: 9.738
##
## AIC AICc BIC
## 653.6687 656.5719 674.1587
From the above results, we are able to find the overall smoothing parmater (alpha), trend smoothing parameter(beta) and seasonal smoothing parameter(gamma). The initial values of level, trend and seasonality is interpreted.
The variance part for the error term is also identified as sigma = 0.0299
The AIC for additvie model is -179.6.
Forecasting for the next 2 years
forecast(fit,8)
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 19 Q1 463.6812 451.2014 476.1609 444.5951 482.7672
## 19 Q2 416.0852 403.5665 428.6039 396.9395 435.2309
## 19 Q3 438.6314 426.0255 451.2373 419.3523 457.9105
## 19 Q4 536.0253 523.2657 548.7849 516.5112 555.5394
## 20 Q1 484.0115 470.3853 497.6376 463.1721 504.8509
## 20 Q2 436.4155 422.4713 450.3597 415.0897 457.7413
## 20 Q3 458.9617 444.5959 473.3275 436.9911 480.9323
## 20 Q4 556.3556 541.4571 571.2541 533.5704 579.1409
The forescasted values will show the point forecast values and also the range with low and high values as 80% and 95% confidence intervals respectively.
plot(forecast(fit,8))
The plot shows the forecasted values, the forecasted values are blurred this indicates that the forecasted values will vary within the blur part.
Decomposing the model into ‘Level’, ‘Trend’ and ‘Seasonality’
states <- fit$model$states[,1:3]
colnames(states) <- cbind('Level','Trend','Seasonality')
plot(states,col="blue",main="Decomposing the Forecast")
Auto Correlation plots
pacf <- pacf(diff(log(timeseries)))
From PACF we identified the spike comes to zero after the first lag in time. Hence the AR(p) value may be 1
acf <- acf(diff(log(timeseries)))
From ACF plot we identied the MA(q) process value will be 1 or 2 or 3 or 4
Running the Seasonal ARIMA model with different p and q values.
fit <- arima(log(timeseries),c(0,1,2),seasonal = list(order = c(0,1,2),period = 4))
fit1 <- arima(log(timeseries),c(0,1,1),seasonal = list(order = c(0,1,1),period = 4))
fit2 <- arima(log(timeseries),c(1,1,0),seasonal = list(order = c(1,1,0),period = 4))
fit3 <- arima(log(timeseries),c(1,1,1),seasonal = list(order = c(1,1,1),period = 4))
fit4 <- arima(log(timeseries),c(1,1,2),seasonal = list(order = c(1,1,2),period = 4))
Checking for AIC values, the least AIC value will be considered as the best fit model.
fit
##
## Call:
## arima(x = log(timeseries), order = c(0, 1, 2), seasonal = list(order = c(0,
## 1, 2), period = 4))
##
## Coefficients:
## ma1 ma2 sma1 sma2
## -1.0651 0.3171 -0.7356 0.0226
## s.e. 0.1189 0.1220 0.1768 0.1590
##
## sigma^2 estimated as 0.0009847: log likelihood = 134.73, aic = -261.46
fit1
##
## Call:
## arima(x = log(timeseries), order = c(0, 1, 1), seasonal = list(order = c(0,
## 1, 1), period = 4))
##
## Coefficients:
## ma1 sma1
## -0.8076 -0.6669
## s.e. 0.0692 0.1297
##
## sigma^2 estimated as 0.001087: log likelihood = 131.49, aic = -258.99
fit2
##
## Call:
## arima(x = log(timeseries), order = c(1, 1, 0), seasonal = list(order = c(1,
## 1, 0), period = 4))
##
## Coefficients:
## ar1 sar1
## -0.6875 -0.4075
## s.e. 0.0983 0.1185
##
## sigma^2 estimated as 0.001545: log likelihood = 121.17, aic = -238.35
fit3
##
## Call:
## arima(x = log(timeseries), order = c(1, 1, 1), seasonal = list(order = c(1,
## 1, 1), period = 4))
##
## Coefficients:
## ar1 ma1 sar1 sma1
## -0.3666 -0.7077 0.0120 -0.6707
## s.e. 0.1449 0.1041 0.1987 0.1603
##
## sigma^2 estimated as 0.0009943: log likelihood = 134.55, aic = -261.1
fit4
##
## Call:
## arima(x = log(timeseries), order = c(1, 1, 2), seasonal = list(order = c(1,
## 1, 2), period = 4))
##
## Coefficients:
## ar1 ma1 ma2 sar1 sma1 sma2
## -0.1574 -0.9141 0.1799 -0.8770 0.2945 -0.7051
## s.e. 0.3873 0.3655 0.3056 0.1282 0.1993 0.1648
##
## sigma^2 estimated as 0.0009298: log likelihood = 135.34, aic = -258.68
When p value(Auto regressive term) is 0 and the q value (error term ) is 2, the AIC value -261.46 is at the least. Hence considering this model as the best fit for the quaterly beer sales data.
Fitting the residuals, so that the spikes are below the significant dotted line.
acf(residuals(fit))
pacf(residuals(fit))
As there is no spikes above the significant line, the model has fitted correctly.
Evaluating Residuals
Ljung-Box test is carried out on residuals to see that after fitting the model what remains is actually the residuals. Null Hypothesis: Data is independently distributed Alternative Hypothesis: Data shows serial correlation
Box.test(residuals(fit),lag=4,type="Ljung")
##
## Box-Ljung test
##
## data: residuals(fit)
## X-squared = 0.15927, df = 4, p-value = 0.997
P value is >0.05 so donot reject the null hypothesis,henceforth “Data is independently distributed”
Forecasting
pred <- predict(fit, n.ahead = 8)
ts.plot(timeseries,2.718^pred$pred, lty = c(1,3),ylabel ="beer",main="Forecasted Series")
## Warning in xy.coords(x = matrix(rep.int(tx, k), ncol = k), y = x, log =
## log, : NAs introduced by coercion
## Warning in xy.coords(x, y): NAs introduced by coercion