We have the data of calls from the year 1999 till 2004 March. We’ll forecast the call volume for the next 6 months.
Load the data
head(call)
## Market_1 YEAR
## 1 3750 1999
## 2 3846 1999
## 3 3894 1999
## 4 4010 1999
## 5 4147 1999
## 6 4335 1999
We subset just the call volume values and convert it to time series data.
call_volume <- call[, 1]
class(call_volume)
## [1] "integer"
call_ts <- ts(call_volume, frequency = 12, start = c(1999))
class(call_ts)
## [1] "ts"
call_ts
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov
## 1999 3750 3846 3894 4010 4147 4335 4554 4744 4885 5020 5208
## 2000 5574 5828 5942 6139 6244 6274 6347 6399 6443 6647 6732
## 2001 6982 6995 7163 7193 7341 7449 7534 7545 7590 7704 7934
## 2002 8223 8538 8777 8981 9074 9321 9582 9754 10103 10414 10455
## 2003 10602 10713 10771 11015 11040 11244 11270 11489 11657 11731 11656
## 2004 11759 11859 12141
## Dec
## 1999 5379
## 2000 6914
## 2001 8007
## 2002 10515
## 2003 11549
## 2004
plot(call_ts, xlab = "Time", ylab = "Call Volume")
We build ARIMA model and try to forecast. Before that we check correlogram and values on auto-correlation and partial auto-correlation function.
acf(call_ts, lag.max=20)
acf(call_ts, lag.max=20, plot=FALSE)
##
## Autocorrelations of series 'call_ts', by lag
##
## 0.0000 0.0833 0.1667 0.2500 0.3333 0.4167 0.5000 0.5833 0.6667 0.7500
## 1.000 0.954 0.909 0.862 0.816 0.768 0.718 0.668 0.620 0.572
## 0.8333 0.9167 1.0000 1.0833 1.1667 1.2500 1.3333 1.4167 1.5000 1.5833
## 0.524 0.478 0.432 0.388 0.345 0.303 0.261 0.220 0.177 0.136
## 1.6667
## 0.098
pacf(call_ts, lag.max=20)
pacf(call_ts, lag.max=20, plot=FALSE)
##
## Partial autocorrelations of series 'call_ts', by lag
##
## 0.0833 0.1667 0.2500 0.3333 0.4167 0.5000 0.5833 0.6667 0.7500 0.8333
## 0.954 -0.014 -0.040 -0.017 -0.051 -0.046 -0.024 -0.018 -0.016 -0.037
## 0.9167 1.0000 1.0833 1.1667 1.2500 1.3333 1.4167 1.5000 1.5833 1.6667
## -0.011 -0.035 -0.006 -0.016 -0.027 -0.022 -0.037 -0.050 -0.014 -0.010
Since it’s a non-stationary time series, we first make it to stationary by differencing.
call_ts_diff1 <- diff(call_ts, differences = 1)
plot.ts(call_ts_diff1)
The resulting time series of first differences (above) does not appear to be stationary in mean. Therefore, we can difference the time series twice, to see if that gives us a stationary time series.
call_ts_diff2 <- diff(call_ts, differences = 2)
plot.ts(call_ts_diff2)
The time series of second differences (above) does appear to be stationary in mean and variance, as the level of the series stays roughly constant over time, and the variance of the series appears roughly constant over time.
The next step is to find appropriate ARIMA model. To do this, we examine the correlation and partial correlation of the stationary time series.
acf(call_ts_diff2, lag.max=20)
acf(call_ts_diff2, lag.max=20, plot=FALSE)
##
## Autocorrelations of series 'call_ts_diff2', by lag
##
## 0.0000 0.0833 0.1667 0.2500 0.3333 0.4167 0.5000 0.5833 0.6667 0.7500
## 1.000 -0.465 0.111 -0.070 -0.067 0.031 -0.088 0.109 0.034 -0.027
## 0.8333 0.9167 1.0000 1.0833 1.1667 1.2500 1.3333 1.4167 1.5000 1.5833
## -0.150 0.219 -0.089 0.029 -0.148 0.083 0.052 -0.139 0.066 0.125
## 1.6667
## -0.246
pacf(call_ts_diff2, lag.max=20)
pacf(call_ts_diff2, lag.max=20, plot=FALSE)
##
## Partial autocorrelations of series 'call_ts_diff2', by lag
##
## 0.0833 0.1667 0.2500 0.3333 0.4167 0.5000 0.5833 0.6667 0.7500 0.8333
## -0.465 -0.135 -0.097 -0.170 -0.108 -0.172 -0.040 0.071 0.023 -0.222
## 0.9167 1.0000 1.0833 1.1667 1.2500 1.3333 1.4167 1.5000 1.5833 1.6667
## 0.085 0.098 0.037 -0.201 -0.099 0.071 -0.068 -0.150 0.084 -0.236
From the above difference, acf and pacf values, we can build an ARIMA model of ARIMA(1,2,1). We can also call auto.arima and auto.arima ‘bic’ criterion for an appropriate model.
library(forecast)
auto.arima(call_ts)
## Series: call_ts
## ARIMA(1,1,0) with drift
##
## Coefficients:
## ar1 drift
## 0.237 135.87
## s.e. 0.126 15.01
##
## sigma^2 estimated as 8213: log likelihood=-361
## AIC=728 AICc=728.4 BIC=734.4
auto.arima(call_ts,ic = "bic")
## Series: call_ts
## ARIMA(0,1,0) with drift
##
## Coefficients:
## drift
## 135.34
## s.e. 11.84
##
## sigma^2 estimated as 8694: log likelihood=-362.7
## AIC=729.4 AICc=729.6 BIC=733.7
We try with ARIMA(1,2,1) model first.
call_ts_arima <- arima(call_ts, order=c(1,2,1))
call_ts_arima
## Series: call_ts
## ARIMA(1,2,1)
##
## Coefficients:
## ar1 ma1
## 0.258 -1.000
## s.e. 0.128 0.052
##
## sigma^2 estimated as 8215: log likelihood=-363.3
## AIC=732.6 AICc=733 BIC=738.9
We’ll forecast for the next 6 months from this model.
call_forecasts <- forecast.Arima(call_ts_arima, h=6)
call_forecasts
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## Apr 2004 12315 12197 12432 12135 12494
## May 2004 12460 12271 12650 12170 12750
## Jun 2004 12599 12351 12846 12220 12977
## Jul 2004 12735 12439 13032 12282 13189
## Aug 2004 12871 12531 13211 12351 13391
## Sep 2004 13007 12628 13387 12427 13588
plot.forecast(call_forecasts, xlab = "Time", ylab = "Call Volume")
we investigate whether the forecast errors of an ARIMA model are normally distributed with mean zero and constant variance, and whether there are correlations between successive forecast errors.
We make a correlogram of the forecast errors for our ARIMA(1,2,1) model and perform the Ljung-Box test for lags 1-20.
acf(call_forecasts$residuals, lag.max=20)
Box.test(call_forecasts$residuals, lag=20, type="Ljung-Box")
##
## Box-Ljung test
##
## data: call_forecasts$residuals
## X-squared = 22.03, df = 20, p-value = 0.339
plot.ts(call_forecasts$residuals)
To Plot histogram of forecast errors, with an overlaid normal curve that has mean zero and the same standard deviation as the distribution of forecast errors, we need to run this function.
plotForecastErrors <- function(forecasterrors)
{
# make a histogram of the forecast errors:
mybinsize <- IQR(forecasterrors)/4
mysd <- sd(forecasterrors)
mymin <- min(forecasterrors) - mysd*5
mymax <- max(forecasterrors) + mysd*3
# generate normally distributed data with mean 0 and standard deviation mysd
mynorm <- rnorm(10000, mean=0, sd=mysd)
mymin2 <- min(mynorm)
mymax2 <- max(mynorm)
if (mymin2 < mymin) { mymin <- mymin2 }
if (mymax2 > mymax) { mymax <- mymax2 }
# make a red histogram of the forecast errors, with the normally distributed data overlaid:
mybins <- seq(mymin, mymax, mybinsize)
hist(forecasterrors, col="red", freq=FALSE, breaks=mybins)
# freq=FALSE ensures the area under the histogram = 1
# generate normally distributed data with mean 0 and standard deviation mysd
myhist <- hist(mynorm, plot=FALSE, breaks=mybins)
# plot the normal curve as a blue line on top of the histogram of forecast errors:
points(myhist$mids, myhist$density, type="l", col="blue", lwd=2)
}
plotForecastErrors(call_forecasts$residuals)
We can check whether we can improve our model with ARIMA(1,2,0) model.
call_ts_arima2 <- arima(call_ts, order=c(1,2,0))
call_ts_arima2
## Series: call_ts
## ARIMA(1,2,0)
##
## Coefficients:
## ar1
## -0.478
## s.e. 0.114
##
## sigma^2 estimated as 10070: log likelihood=-367.8
## AIC=739.6 AICc=739.8 BIC=743.9
Forecast for the next 6 months with this model.
call_forecasts2 <- forecast.Arima(call_ts_arima2, h=6)
call_forecasts2
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## Apr 2004 12336 12207 12465 12139 12533
## May 2004 12573 12338 12807 12214 12931
## Jun 2004 12789 12415 13164 12217 13362
## Jul 2004 13015 12486 13545 12205 13826
## Aug 2004 13237 12533 13941 12160 14314
## Sep 2004 13461 12568 14354 12095 14827
plot.forecast(call_forecasts2, xlab = "Time", ylab = "Call Volume")
Correlogram of the forecast errors for our ARIMA(1,2,0) model and perform the Ljung-Box test for lags 1-20 and histogram of forecast errors.
acf(call_forecasts2$residuals, lag.max=20)
Box.test(call_forecasts2$residuals, lag=20, type="Ljung-Box")
##
## Box-Ljung test
##
## data: call_forecasts2$residuals
## X-squared = 22.52, df = 20, p-value = 0.3128
plot.ts(call_forecasts2$residuals)
plotForecastErrors(call_forecasts2$residuals)
The histograms of the time series shows that the forecast errors are roughly normally distributed and the mean seems to be closed to zero. Therefore, it is plausible that the forecast errors are normally distributed with mean zero and constant variance.
Since successive forecast errors do not seem to be correlated, and the forecast errors seem to be normally distributed with mean zero and constant variance, the ARIMA(1,2,1) and ARIMA(1,2,0) do seem to provide an adequate predictive model for the call volume for the next 6 months.