TIME SERIES FORECASTING ON QUATERLY BEER SALES DATA

Importing the Libraries

library(forecast)
library(tseries)
library(TSA)

Reading the file

setwd("E:/Great Lakes/Time Series/Assignment")
TSA1 <- read.csv("beer.csv",header = TRUE)
attach(TSA1)

Check for the Frequency using periodgram

p <- periodogram(OzBeer)

dd <- data.frame(freq=p$freq,spec=p$spec)
order <- dd[order(-dd$spec),]
top2 <- head(order,2)
tim <- 1/top2$freq

The spikes are at the frequency 0.01 and 0.25, if we consider 0.01 as the frequency then the equivalent time is (1/0.01 = 72), which is the entire dataset.

The spike at the frequency 0.25 and its equivalent time is (1/0.25 = 4). This makes more sensible as the data given is quaterly beer sales data.

Using the periodgram identified the frequency as ‘4’

Creating a Time series using the frequency which was identified using the periodgram

timeseries <- ts(TSA1,frequency = 4)

Checking the data wheteher its a Timeseries data or not

is.ts(timeseries)

## [1] TRUE

Plotting the Time series

plot.ts(timeseries,col = "blue",main = "Sales data time series with time stamp")
abline(reg=lm(timeseries~ time(timeseries)),col="red")

There is a linear trend in the time series as the time passes by and also the seasonality is getting repeated for every 4 quarters. The magnitude of the seasonal conponent is almost constant.

To test for stationary, doing KPSS test

kpss.test(diff(log(timeseries)))

## Warning in kpss.test(diff(log(timeseries))): p-value greater than printed
## p-value

## 
##  KPSS Test for Level Stationarity
## 
## data:  diff(log(timeseries))
## KPSS Level = 0.02596, Truncation lag parameter = 1, p-value = 0.1

The P-value is 0.1, as the p- value is greater than 0.05, donot reject the null hypothesis,hence mathematically data is stationary.

Running ARIMA with different p and q values As there is trend and seasonality in the data, we can use HOLT WINTER’s Method to forecast the time series data.

As the magnitude of the seasonality pattern is constant, we can use HOLT WINTER’s Additive method.

Running Holt Winter’s additive method by taking logarithm on the timeseries data

fit <- hw(timeseries,seasonal = "additive")
plot.ts(timeseries,col = "blue",main = "Sales data time series with time stamp")
lines(fitted(fit),col="red")

Interpereting the results

fit$model

## Holt-Winters' additive method 
## 
## Call:
##  hw(y = timeseries, seasonal = "additive") 
## 
##   Smoothing parameters:
##     alpha = 0.0395 
##     beta  = 0.0395 
##     gamma = 0.1854 
## 
##   Initial states:
##     l = 258.2137 
##     b = 0.0239 
##     s=55.077 -23.9983 -37.4121 6.3335
## 
##   sigma:  9.738
## 
##      AIC     AICc      BIC 
## 653.6687 656.5719 674.1587

From the above results, we are able to find the overall smoothing parmater (alpha), trend smoothing parameter(beta) and seasonal smoothing parameter(gamma). The initial values of level, trend and seasonality is interpreted.

The variance part for the error term is also identified as sigma = 0.0299

The AIC for additvie model is -179.6.

Forecasting for the next 2 years

forecast(fit,8)

##       Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## 19 Q1       463.6812 451.2014 476.1609 444.5951 482.7672
## 19 Q2       416.0852 403.5665 428.6039 396.9395 435.2309
## 19 Q3       438.6314 426.0255 451.2373 419.3523 457.9105
## 19 Q4       536.0253 523.2657 548.7849 516.5112 555.5394
## 20 Q1       484.0115 470.3853 497.6376 463.1721 504.8509
## 20 Q2       436.4155 422.4713 450.3597 415.0897 457.7413
## 20 Q3       458.9617 444.5959 473.3275 436.9911 480.9323
## 20 Q4       556.3556 541.4571 571.2541 533.5704 579.1409

The forescasted values will show the point forecast values and also the range with low and high values as 80% and 95% confidence intervals respectively.

plot(forecast(fit,8))

The plot shows the forecasted values, the forecasted values are blurred this indicates that the forecasted values will vary within the blur part.

Decomposing the model into ‘Level’, ‘Trend’ and ‘Seasonality’

states <- fit$model$states[,1:3]
colnames(states) <- cbind('Level','Trend','Seasonality')
plot(states,col="blue",main="Decomposing the Forecast")

Auto Correlation plots

pacf <- pacf(diff(log(timeseries)))

From PACF we identified the spike comes to zero after the first lag in time. Hence the AR(p) value may be 1

acf <- acf(diff(log(timeseries)))

From ACF plot we identied the MA(q) process value will be 1 or 2 or 3 or 4

Running the Seasonal ARIMA model with different p and q values.

fit <- arima(log(timeseries),c(0,1,2),seasonal = list(order = c(0,1,2),period = 4))
fit1 <- arima(log(timeseries),c(0,1,1),seasonal = list(order = c(0,1,1),period = 4))
fit2 <- arima(log(timeseries),c(1,1,0),seasonal = list(order = c(1,1,0),period = 4))
fit3 <- arima(log(timeseries),c(1,1,1),seasonal = list(order = c(1,1,1),period = 4))
fit4 <- arima(log(timeseries),c(1,1,2),seasonal = list(order = c(1,1,2),period = 4))

Checking for AIC values, the least AIC value will be considered as the best fit model.

fit

## 
## Call:
## arima(x = log(timeseries), order = c(0, 1, 2), seasonal = list(order = c(0, 
##     1, 2), period = 4))
## 
## Coefficients:
##           ma1     ma2     sma1    sma2
##       -1.0651  0.3171  -0.7356  0.0226
## s.e.   0.1189  0.1220   0.1768  0.1590
## 
## sigma^2 estimated as 0.0009847:  log likelihood = 134.73,  aic = -261.46

fit1

## 
## Call:
## arima(x = log(timeseries), order = c(0, 1, 1), seasonal = list(order = c(0, 
##     1, 1), period = 4))
## 
## Coefficients:
##           ma1     sma1
##       -0.8076  -0.6669
## s.e.   0.0692   0.1297
## 
## sigma^2 estimated as 0.001087:  log likelihood = 131.49,  aic = -258.99

fit2

## 
## Call:
## arima(x = log(timeseries), order = c(1, 1, 0), seasonal = list(order = c(1, 
##     1, 0), period = 4))
## 
## Coefficients:
##           ar1     sar1
##       -0.6875  -0.4075
## s.e.   0.0983   0.1185
## 
## sigma^2 estimated as 0.001545:  log likelihood = 121.17,  aic = -238.35

fit3

## 
## Call:
## arima(x = log(timeseries), order = c(1, 1, 1), seasonal = list(order = c(1, 
##     1, 1), period = 4))
## 
## Coefficients:
##           ar1      ma1    sar1     sma1
##       -0.3666  -0.7077  0.0120  -0.6707
## s.e.   0.1449   0.1041  0.1987   0.1603
## 
## sigma^2 estimated as 0.0009943:  log likelihood = 134.55,  aic = -261.1

fit4

## 
## Call:
## arima(x = log(timeseries), order = c(1, 1, 2), seasonal = list(order = c(1, 
##     1, 2), period = 4))
## 
## Coefficients:
##           ar1      ma1     ma2     sar1    sma1     sma2
##       -0.1574  -0.9141  0.1799  -0.8770  0.2945  -0.7051
## s.e.   0.3873   0.3655  0.3056   0.1282  0.1993   0.1648
## 
## sigma^2 estimated as 0.0009298:  log likelihood = 135.34,  aic = -258.68

When p value(Auto regressive term) is 0 and the q value (error term ) is 2, the AIC value -261.46 is at the least. Hence considering this model as the best fit for the quaterly beer sales data.

Fitting the residuals, so that the spikes are below the significant dotted line.

acf(residuals(fit))

pacf(residuals(fit))

As there is no spikes above the significant line, the model has fitted correctly.

Evaluating Residuals

Ljung-Box test is carried out on residuals to see that after fitting the model what remains is actually the residuals. Null Hypothesis: Data is independently distributed Alternative Hypothesis: Data shows serial correlation

Box.test(residuals(fit),lag=4,type="Ljung")

## 
##  Box-Ljung test
## 
## data:  residuals(fit)
## X-squared = 0.15927, df = 4, p-value = 0.997

P value is >0.05 so donot reject the null hypothesis,henceforth “Data is independently distributed”

Forecasting

pred <- predict(fit, n.ahead = 8)
ts.plot(timeseries,2.718^pred$pred, lty = c(1,3),ylabel ="beer",main="Forecasted Series")

## Warning in xy.coords(x = matrix(rep.int(tx, k), ncol = k), y = x, log =
## log, : NAs introduced by coercion

## Warning in xy.coords(x, y): NAs introduced by coercion