Using the ARIMA method, model the Quarterly beer sales data and predict for the next 2 years.

Load the dataset

beer.sales <- read.csv("D:/PGP BA-BI Course Materials/TIME SERIES FORECASTING/GA/beer.csv")

The beer sales data is read into the R and saved in the object ‘beer.sales’.The next step is to store the data in a time series object using ts() function.

Plotting of Periodogram - to check for frequency

library(TSA)
## Warning: package 'TSA' was built under R version 3.4.3
## Loading required package: leaps
## Warning: package 'leaps' was built under R version 3.4.3
## Loading required package: locfit
## Warning: package 'locfit' was built under R version 3.4.3
## locfit 1.5-9.1    2013-03-22
## Loading required package: mgcv
## Loading required package: nlme
## This is mgcv 1.8-17. For overview type 'help("mgcv-package")'.
## Loading required package: tseries
## Warning: package 'tseries' was built under R version 3.4.3
## 
## Attaching package: 'TSA'
## The following objects are masked from 'package:stats':
## 
##     acf, arima
## The following object is masked from 'package:utils':
## 
##     tar
periodogram(beer.sales)

The above periodogram plot is to identify the dominant periods(or frequncies) of a time series. From this plot, the time period is identified as 0.25 and the frequency is calculated as 1/0.25 = 4. Therefore, we can conclude that the behaviour in the series is quarterly

Making as Timeseries data

beersales.ts <- ts(beer.sales, frequency = 4, start = c(2000,1))
beersales.ts
##       Qtr1  Qtr2  Qtr3  Qtr4
## 2000 284.4 212.8 226.9 308.4
## 2001 262.0 227.9 236.1 320.4
## 2002 271.9 232.8 237.0 313.4
## 2003 261.4 226.8 249.9 314.3
## 2004 286.1 226.5 260.4 311.4
## 2005 294.7 232.6 257.2 339.2
## 2006 279.1 249.8 269.8 345.7
## 2007 293.8 254.7 277.5 363.4
## 2008 313.4 272.8 300.1 369.5
## 2009 330.8 287.8 305.9 386.1
## 2010 335.2 288.0 308.3 402.3
## 2011 352.8 316.1 324.9 404.8
## 2012 393.0 318.9 327.0 442.3
## 2013 383.1 331.6 361.4 445.9
## 2014 386.6 357.2 373.6 466.2
## 2015 409.6 369.8 378.6 487.0
## 2016 419.2 376.7 392.8 506.1
## 2017 458.4 387.4 426.9 525.0

Assuming the data is from the year 2000, the start year is set as 2000 with 1 as increasing year.

Exploration of Time Series

plot(beersales.ts, col = "blue", main = "Time Series of Beer sales")
abline(reg = lm(beersales.ts~time(beersales.ts)))

We can see from the plot that there is a linear trend and the seasonal pattern exists in the data

plot(aggregate(beersales.ts,FUN = mean),col ="blue",main = "Plot of Monthly Sum of Sales",
     xlab = "Years", ylab = "Sum of Sales")

The above plot aggregates the cycle of sales and displays an year on year trend. It clearly shows that beer sales have been increasing without fail.

boxplot(beersales.ts~cycle(beersales.ts), main = "Boxplot distribution of sales quarterly wise",
        xlab = "Quarterly", ylab = "Beer Sales")

The above boxplot across each quarter gives us the sense of seasonal effect. The variance and mean value in the 4th quarter is comparatively higher than the others.

Test for Stationarity

kpss.test(beersales.ts)
## Warning in kpss.test(beersales.ts): p-value smaller than printed p-value
## 
##  KPSS Test for Level Stationarity
## 
## data:  beersales.ts
## KPSS Level = 3.0456, Truncation lag parameter = 1, p-value = 0.01

The above test validates that the given data is not stationary,hence differencing the above data to remove trend and seasonality to fit the ARIMA model

Identifying the Trend and Seasonal difference for stationarity

library(forecast)
## Warning: package 'forecast' was built under R version 3.4.3
## 
## Attaching package: 'forecast'
## The following object is masked from 'package:nlme':
## 
##     getResponse
nsdiffs(beersales.ts)
## [1] 1
ndiffs(beersales.ts)
## [1] 1

The above results shows that we have to conduct one time trend differencing one time seasonal differencing, to make the series stationary.

Test for stationary after differencing

kpss.test(diff(log(beersales.ts)))
## Warning in kpss.test(diff(log(beersales.ts))): p-value greater than printed
## p-value
## 
##  KPSS Test for Level Stationarity
## 
## data:  diff(log(beersales.ts))
## KPSS Level = 0.02596, Truncation lag parameter = 1, p-value = 0.1

After differencing the time series data is stationary and it is validated with p-value > 0.05.

Determing the optimal parmaters p,q from PACF and ACF plots

acf(diff(log(beersales.ts)),main = "ACF Plot")

In the ACF plot, the spikes above the blue dotted lines are the significant lags which has influence on the current state.

pacf(diff(log(beersales.ts)),main = "PACF Plot")

From the PACF plot, we are able to see that the lag shuts off after 1 and it is approximated to 0 beyond that point.

Fitting ARIMA model

Based on the information from the above plots, an ARIMA model is fit to the data. With different p and q values, the model is built and the best one with the lowest AIC value is selected.

fit <- arima(log(beersales.ts),c(0,1,4),seasonal = list(order = c(0,1,4), period = 4))
fit
## 
## Call:
## arima(x = log(beersales.ts), order = c(0, 1, 4), seasonal = list(order = c(0, 
##     1, 4), period = 4))
## 
## Coefficients:
##           ma1     ma2     ma3      ma4     sma1     sma2     sma3    sma4
##       -1.0589  0.3444  0.0623  -0.1118  -0.7020  -0.0005  -0.0627  0.1292
## s.e.   0.1442  0.2148  0.2224   0.1486   0.1775   0.1658   0.2159  0.1729
## 
## sigma^2 estimated as 0.0009623:  log likelihood = 135.43,  aic = -254.85

From the above, we are able to see the moving average and seasonal moving average coefficients which is there in the data. The sigma^2 is the value of the noise component. AIC value -254.85 is the lowest value found in the iterative process.

Plotting the residuals of ACF and PACF

acf(residuals(fit), main = "Residuals of ACF")

pacf(residuals(fit),main = "Residuals of PACF")

Evaluating residuals

Box.test(residuals(fit),lag = 4, type = "Ljung")
## 
##  Box-Ljung test
## 
## data:  residuals(fit)
## X-squared = 0.80238, df = 4, p-value = 0.9381

The above Ljung-Box test is carried out on residuals to see that after fitting the model what remains is actually the residuals. The test validates the data is independently distributed with a p-value > 0.05.

Forecasting

pred <- predict(fit,n.ahead = 2*4)
pred
## $pred
##          Qtr1     Qtr2     Qtr3     Qtr4
## 2018 6.152115 6.024324 6.080016 6.310361
## 2019 6.193566 6.066473 6.120546 6.352873
## 
## $se
##            Qtr1       Qtr2       Qtr3       Qtr4
## 2018 0.03102103 0.03107475 0.03231204 0.03406568
## 2019 0.03788073 0.03848232 0.03975074 0.04112375
ts.plot(beersales.ts,exp(pred$pred),log = "y",lty = c(1,3), col = c('blue','red'),xlab = "Years",ylab = "Beer Sales", 
        main = "Forecasted Series")