We have the data of calls from the year 1999 till 2004 March. We’ll forecast the call volume for the next 6 months.

Load the data

head(call)
##   Market_1 YEAR
## 1     3750 1999
## 2     3846 1999
## 3     3894 1999
## 4     4010 1999
## 5     4147 1999
## 6     4335 1999

We subset just the call volume values and convert it to time series data.

call_volume <- call[, 1]
class(call_volume)
## [1] "integer"
call_ts <- ts(call_volume, frequency = 12, start = c(1999))
class(call_ts)
## [1] "ts"
call_ts
##        Jan   Feb   Mar   Apr   May   Jun   Jul   Aug   Sep   Oct   Nov
## 1999  3750  3846  3894  4010  4147  4335  4554  4744  4885  5020  5208
## 2000  5574  5828  5942  6139  6244  6274  6347  6399  6443  6647  6732
## 2001  6982  6995  7163  7193  7341  7449  7534  7545  7590  7704  7934
## 2002  8223  8538  8777  8981  9074  9321  9582  9754 10103 10414 10455
## 2003 10602 10713 10771 11015 11040 11244 11270 11489 11657 11731 11656
## 2004 11759 11859 12141                                                
##        Dec
## 1999  5379
## 2000  6914
## 2001  8007
## 2002 10515
## 2003 11549
## 2004
plot(call_ts, xlab = "Time", ylab = "Call Volume")

plot of chunk unnamed-chunk-3

We build ARIMA model and try to forecast. Before that we check correlogram and values on auto-correlation and partial auto-correlation function.

acf(call_ts, lag.max=20)

plot of chunk unnamed-chunk-4

acf(call_ts, lag.max=20, plot=FALSE)
## 
## Autocorrelations of series 'call_ts', by lag
## 
## 0.0000 0.0833 0.1667 0.2500 0.3333 0.4167 0.5000 0.5833 0.6667 0.7500 
##  1.000  0.954  0.909  0.862  0.816  0.768  0.718  0.668  0.620  0.572 
## 0.8333 0.9167 1.0000 1.0833 1.1667 1.2500 1.3333 1.4167 1.5000 1.5833 
##  0.524  0.478  0.432  0.388  0.345  0.303  0.261  0.220  0.177  0.136 
## 1.6667 
##  0.098
pacf(call_ts, lag.max=20)

plot of chunk unnamed-chunk-4

pacf(call_ts, lag.max=20, plot=FALSE)
## 
## Partial autocorrelations of series 'call_ts', by lag
## 
## 0.0833 0.1667 0.2500 0.3333 0.4167 0.5000 0.5833 0.6667 0.7500 0.8333 
##  0.954 -0.014 -0.040 -0.017 -0.051 -0.046 -0.024 -0.018 -0.016 -0.037 
## 0.9167 1.0000 1.0833 1.1667 1.2500 1.3333 1.4167 1.5000 1.5833 1.6667 
## -0.011 -0.035 -0.006 -0.016 -0.027 -0.022 -0.037 -0.050 -0.014 -0.010

Since it’s a non-stationary time series, we first make it to stationary by differencing.

call_ts_diff1 <- diff(call_ts, differences = 1)
plot.ts(call_ts_diff1)

plot of chunk unnamed-chunk-5

The resulting time series of first differences (above) does not appear to be stationary in mean. Therefore, we can difference the time series twice, to see if that gives us a stationary time series.

call_ts_diff2 <- diff(call_ts, differences = 2)
plot.ts(call_ts_diff2)

plot of chunk unnamed-chunk-6

The time series of second differences (above) does appear to be stationary in mean and variance, as the level of the series stays roughly constant over time, and the variance of the series appears roughly constant over time.

The next step is to find appropriate ARIMA model. To do this, we examine the correlation and partial correlation of the stationary time series.

acf(call_ts_diff2, lag.max=20)

plot of chunk unnamed-chunk-7

acf(call_ts_diff2, lag.max=20, plot=FALSE)
## 
## Autocorrelations of series 'call_ts_diff2', by lag
## 
## 0.0000 0.0833 0.1667 0.2500 0.3333 0.4167 0.5000 0.5833 0.6667 0.7500 
##  1.000 -0.465  0.111 -0.070 -0.067  0.031 -0.088  0.109  0.034 -0.027 
## 0.8333 0.9167 1.0000 1.0833 1.1667 1.2500 1.3333 1.4167 1.5000 1.5833 
## -0.150  0.219 -0.089  0.029 -0.148  0.083  0.052 -0.139  0.066  0.125 
## 1.6667 
## -0.246
pacf(call_ts_diff2, lag.max=20)

plot of chunk unnamed-chunk-7

pacf(call_ts_diff2, lag.max=20, plot=FALSE)
## 
## Partial autocorrelations of series 'call_ts_diff2', by lag
## 
## 0.0833 0.1667 0.2500 0.3333 0.4167 0.5000 0.5833 0.6667 0.7500 0.8333 
## -0.465 -0.135 -0.097 -0.170 -0.108 -0.172 -0.040  0.071  0.023 -0.222 
## 0.9167 1.0000 1.0833 1.1667 1.2500 1.3333 1.4167 1.5000 1.5833 1.6667 
##  0.085  0.098  0.037 -0.201 -0.099  0.071 -0.068 -0.150  0.084 -0.236

From the above difference, acf and pacf values, we can build an ARIMA model of ARIMA(1,2,1). We can also call auto.arima and auto.arima ‘bic’ criterion for an appropriate model.

library(forecast)
auto.arima(call_ts)
## Series: call_ts 
## ARIMA(1,1,0) with drift         
## 
## Coefficients:
##         ar1   drift
##       0.237  135.87
## s.e.  0.126   15.01
## 
## sigma^2 estimated as 8213:  log likelihood=-361
## AIC=728   AICc=728.4   BIC=734.4
auto.arima(call_ts,ic = "bic")
## Series: call_ts 
## ARIMA(0,1,0) with drift         
## 
## Coefficients:
##        drift
##       135.34
## s.e.   11.84
## 
## sigma^2 estimated as 8694:  log likelihood=-362.7
## AIC=729.4   AICc=729.6   BIC=733.7

We try with ARIMA(1,2,1) model first.

call_ts_arima <- arima(call_ts, order=c(1,2,1))
call_ts_arima  
## Series: call_ts 
## ARIMA(1,2,1)                    
## 
## Coefficients:
##         ar1     ma1
##       0.258  -1.000
## s.e.  0.128   0.052
## 
## sigma^2 estimated as 8215:  log likelihood=-363.3
## AIC=732.6   AICc=733   BIC=738.9

We’ll forecast for the next 6 months from this model.

call_forecasts <- forecast.Arima(call_ts_arima, h=6)
call_forecasts
##          Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## Apr 2004          12315 12197 12432 12135 12494
## May 2004          12460 12271 12650 12170 12750
## Jun 2004          12599 12351 12846 12220 12977
## Jul 2004          12735 12439 13032 12282 13189
## Aug 2004          12871 12531 13211 12351 13391
## Sep 2004          13007 12628 13387 12427 13588
plot.forecast(call_forecasts, xlab = "Time", ylab = "Call Volume")

plot of chunk unnamed-chunk-10

we investigate whether the forecast errors of an ARIMA model are normally distributed with mean zero and constant variance, and whether there are correlations between successive forecast errors.
We make a correlogram of the forecast errors for our ARIMA(1,2,1) model and perform the Ljung-Box test for lags 1-20.

acf(call_forecasts$residuals, lag.max=20)

plot of chunk unnamed-chunk-11

Box.test(call_forecasts$residuals, lag=20, type="Ljung-Box")
## 
##  Box-Ljung test
## 
## data:  call_forecasts$residuals
## X-squared = 22.03, df = 20, p-value = 0.339
plot.ts(call_forecasts$residuals)

plot of chunk unnamed-chunk-11

To Plot histogram of forecast errors, with an overlaid normal curve that has mean zero and the same standard deviation as the distribution of forecast errors, we need to run this function.

plotForecastErrors <- function(forecasterrors)
{
  # make a histogram of the forecast errors:
  mybinsize <- IQR(forecasterrors)/4
  mysd <- sd(forecasterrors)
  mymin <- min(forecasterrors) - mysd*5
  mymax <- max(forecasterrors) + mysd*3
  # generate normally distributed data with mean 0 and standard deviation mysd
  mynorm <- rnorm(10000, mean=0, sd=mysd)
  mymin2 <- min(mynorm)
  mymax2 <- max(mynorm)
  if (mymin2 < mymin) { mymin <- mymin2 }
  if (mymax2 > mymax) { mymax <- mymax2 }
  # make a red histogram of the forecast errors, with the normally distributed data overlaid:
  mybins <- seq(mymin, mymax, mybinsize)
  hist(forecasterrors, col="red", freq=FALSE, breaks=mybins)
  # freq=FALSE ensures the area under the histogram = 1
  # generate normally distributed data with mean 0 and standard deviation mysd
  myhist <- hist(mynorm, plot=FALSE, breaks=mybins)
  # plot the normal curve as a blue line on top of the histogram of forecast errors:
  points(myhist$mids, myhist$density, type="l", col="blue", lwd=2)
}

plotForecastErrors(call_forecasts$residuals)

plot of chunk unnamed-chunk-12

We can check whether we can improve our model with ARIMA(1,2,0) model.

call_ts_arima2 <- arima(call_ts, order=c(1,2,0))
call_ts_arima2
## Series: call_ts 
## ARIMA(1,2,0)                    
## 
## Coefficients:
##          ar1
##       -0.478
## s.e.   0.114
## 
## sigma^2 estimated as 10070:  log likelihood=-367.8
## AIC=739.6   AICc=739.8   BIC=743.9

Forecast for the next 6 months with this model.

call_forecasts2 <- forecast.Arima(call_ts_arima2, h=6)
call_forecasts2
##          Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## Apr 2004          12336 12207 12465 12139 12533
## May 2004          12573 12338 12807 12214 12931
## Jun 2004          12789 12415 13164 12217 13362
## Jul 2004          13015 12486 13545 12205 13826
## Aug 2004          13237 12533 13941 12160 14314
## Sep 2004          13461 12568 14354 12095 14827
plot.forecast(call_forecasts2, xlab = "Time", ylab = "Call Volume")

plot of chunk unnamed-chunk-14

Correlogram of the forecast errors for our ARIMA(1,2,0) model and perform the Ljung-Box test for lags 1-20 and histogram of forecast errors.

acf(call_forecasts2$residuals, lag.max=20)

plot of chunk unnamed-chunk-15

Box.test(call_forecasts2$residuals, lag=20, type="Ljung-Box")
## 
##  Box-Ljung test
## 
## data:  call_forecasts2$residuals
## X-squared = 22.52, df = 20, p-value = 0.3128
plot.ts(call_forecasts2$residuals)

plot of chunk unnamed-chunk-15

plotForecastErrors(call_forecasts2$residuals)

plot of chunk unnamed-chunk-15

The histograms of the time series shows that the forecast errors are roughly normally distributed and the mean seems to be closed to zero. Therefore, it is plausible that the forecast errors are normally distributed with mean zero and constant variance.

Since successive forecast errors do not seem to be correlated, and the forecast errors seem to be normally distributed with mean zero and constant variance, the ARIMA(1,2,1) and ARIMA(1,2,0) do seem to provide an adequate predictive model for the call volume for the next 6 months.