Forecasting is the first important stage of workforce management planning. WFM forecastors create the forecast of ,including others, volume of conversations to be expected in some future time. Forecasting future contacts is a difficult task because of the uncertainty associated with the many factors that determine contacts as well as the nature of distribution of contacts in time.
Time series data is an object observed in many consecutive units of time in chronological order. There are three various methods of dealing with time series forecasting.
Time series models - Time series models define the current value of a variable as a function of its own history. Thus no other parameter is involved.
Econometric models - Econometric models use other supportive variables to model the variation in the target variable.
Hybrid models - Hybrid models enjoy the best of the two worlds.
While there is not a single best method for forecasting, there are widely used methods and best practices. We will discuss some of the most commonly used time series models and hybrid approaches in this blog. We will also cover the metrics which can indicate the accuracy of the forecast.
accuracy() function from forecast package gives mean error (ME), root mean square error (RMSE), mean absolute error (MAE), mean percentage error (MPE), mean absolute percentage (MAPE), mean absolute squared error (MASE), autocorrelation function index (ACFI) and Theil’s U values.
MAPE is considered the best accuracy measure since it is not sensitive to sign of error and the magnitude of units. We will use MAPE measures to compare the accuracy of different methods to be discussed.
Loading the data
HistoricalData <- read.csv("historicalData.csv")
The data we will use is historical monthly volume data from Jan 2019 to Dec 2022 from a hypothetical contact center for a single channel.
HistoricalData$Period <- as.Date(HistoricalData$Period,format = "%d/%m/%Y")
HistoricalData$Email <- as.numeric(HistoricalData$Email)
HistoricalData$Phone <- as.numeric(HistoricalData$Phone)
str(HistoricalData)
## 'data.frame': 48 obs. of 3 variables:
## $ Period: Date, format: "2019-01-01" "2019-01-02" ...
## $ Email : num 12322 10160 9827 8986 9798 ...
## $ Phone : num 5544 4368 4422 3863 4507 ...
Our goal is to forecast the email contacts for the next 12 months. The phone contacts will be used as an additional variable when we discuss hybrid models.
Let’s define the Email column as a time series object. ts() function from stats package does just that.
data_ts <- ts(HistoricalData[,2], frequency = 12) # 12 since monthly data
Time series plots help us to identify outliers in the data. The outliers need to be handled before modeling.
To plot the time series data, we will use fpp2 package.
# install.packages("fpp2")
library(fpp2)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
## -- Attaching packages ---------------------------------------------- fpp2 2.4 --
## v ggplot2 3.3.5 v fma 2.4
## v forecast 8.15 v expsmooth 2.3
##
autoplot(data_ts) # time series plot
ggseasonplot(data_ts,main = "Season Plot") # time series plot for each season
ggseasonplot(data_ts, polar = T,main = "Polar Season Plot") # polar season plot
It is important to test your models on the historical data. We will, therefore, divide our historical data to train and test samples. The true test data is ofcourse unknown and it is in the future, however, testing on a out-of-sample data gives you confidence that the model will be robust against the future reality as well.
train <- subset(data_ts, end = length(data_ts) - 12) # year 2019 and 2020
test <- subset(data_ts, start = length(data_ts) - 12+1) # 2021 data
naive() function from forecast base package can be used for naive forecasts. Naive forecast extends the last value for the selected forecast period.
fc_naive <- naive(train, 12) # forecast 12 periods ahead
autoplot(fc_naive) # train, forecast together.
We can check the error band and the residuals.
summary(fc_naive)
##
## Forecast method: Naive method
##
## Model Information:
## Call: naive(y = train, h = 12)
##
## Residual sd: 1772.4178
##
## Error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 6.457143 1772.418 1343.314 -1.167537 12.18943 0.8223115
## ACF1
## Training set -0.08304126
##
## Forecasts:
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## Jan 4 12548 10276.555 14819.44 9074.1250 16021.87
## Feb 4 12548 9335.692 15760.31 7635.1989 17460.80
## Mar 4 12548 8613.742 16482.26 6531.0721 18564.93
## Apr 4 12548 8005.111 17090.89 5600.2501 19495.75
## May 4 12548 7468.895 17627.10 4780.1794 20315.82
## Jun 4 12548 6984.119 18111.88 4038.7789 21057.22
## Jul 4 12548 6538.322 18557.68 3356.9908 21739.01
## Aug 4 12548 6123.384 18972.62 2722.3978 22373.60
## Sep 4 12548 5733.666 19362.33 2126.3751 22969.62
## Oct 4 12548 5365.061 19730.94 1562.6428 23533.36
## Nov 4 12548 5014.470 20081.53 1026.4602 24069.54
## Dec 4 12548 4679.485 20416.52 514.1441 24581.86
checkresiduals(fc_naive) # Do they look like white noise?
##
## Ljung-Box test
##
## data: Residuals from Naive method
## Q* = 8.5062, df = 7, p-value = 0.2901
##
## Model df: 0. Total lags used: 7
Instead of extending the very last instance of train data, we can extend the respective last value from the last period for each month using seasonal naive models.
fc_snaive <- snaive(train, 12)
autoplot(fc_snaive) # captures some seasonality in the data
# error band and the residuals
checkresiduals(fc_snaive) # does the residual look like white noise?
##
## Ljung-Box test
##
## data: Residuals from Seasonal naive method
## Q* = 19.273, df = 7, p-value = 0.007373
##
## Model df: 0. Total lags used: 7
summary(fc_snaive)
##
## Forecast method: Seasonal naive method
##
## Model Information:
## Call: snaive(y = train, h = 12)
##
## Residual sd: 2008.8533
##
## Error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 861.4167 2008.853 1633.583 6.104395 13.31924 1 0.6081251
##
## Forecasts:
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## Jan 4 12770 10195.551 15344.45 8832.72 16707.28
## Feb 4 10433 7858.551 13007.45 6495.72 14370.28
## Mar 4 10823 8248.551 13397.45 6885.72 14760.28
## Apr 4 10110 7535.551 12684.45 6172.72 14047.28
## May 4 12970 10395.551 15544.45 9032.72 16907.28
## Jun 4 15306 12731.551 17880.45 11368.72 19243.28
## Jul 4 15637 13062.551 18211.45 11699.72 19574.28
## Aug 4 15923 13348.551 18497.45 11985.72 19860.28
## Sep 4 12102 9527.551 14676.45 8164.72 16039.28
## Oct 4 11747 9172.551 14321.45 7809.72 15684.28
## Nov 4 11358 8783.551 13932.45 7420.72 15295.28
## Dec 4 12548 9973.551 15122.45 8610.72 16485.28
#accuracy
accuracy(fc_snaive, data_ts)['Test set','MAPE']
## [1] 0
ses(),holt(),hw() functions from the base package - forecast can be used for Holt-Winter’s exponential smoothing algorithm. We need to check whether or not our data has trend, seasonality. Or we can be lazy and use ets() function from the same package that works great in all cases.
fc_ses <- ses(train, h = 12) # only for a stationary ts
autoplot(fc_ses) # forecast plot
fc_ses_t <- holt(train,h = 12)
autoplot(fc_ses_t)
# Simple exponential smoothing with trend and seasonality
fc_ses_ts_m <- hw(train, seasonal = "multiplicative", h = 12)
fc_ses_ts_a <- hw(train, seasonal = "additive", h = 12)
autoplot(fc_ses_ts_m)
autoplot(fc_ses_ts_a)
Accuracy
accuracy(fc_ses_ts_m, data_ts)['Test set','MAPE']
## [1] 8.407766
It is Automatic forecasting with exponential smoothing for a wide range of time series.
fitets <- ets(train)
fc_ETS <- forecast(fitets,h = 12)
Accuracy
accuracy(fc_ETS, data_ts)['Test set','MAPE']
## [1] 11.98624
Box-Jenkin’s methods are popular choices for short term forecasts. They should not be used beyond few periods since after some periods, the values are constant.
Box Jenkin’s models require a stationary time series. A stationary data does not have trend, seasonality and has mean zero variance over time. Therefore, it is important to remove those components from the time series, in other words stationarize the data.
A signal can be decomposed into a DC and AC component. (The DC component is also called level and the AC components can further be decomposed into trend, seasonality, noise.)
Trend - the general short-term change in direction (Up/down)
Seasonality - increases and decreases in regular time of day, day of week, etc
Stationarity Tests Dickey-Fuller Tests and augmented Dickey-Fuller Test are the commonly used tests for stationarity. KSPSS test, PP test and White noise test(Ljung-Box test) can also be used.
library(tseries)
adf.test(train) # augmented dicky fuller test
##
## Augmented Dickey-Fuller Test
##
## data: train
## Dickey-Fuller = -2.8081, Lag order = 3, p-value = 0.2587
## alternative hypothesis: stationary
Interpretation: p-value is not less than 0.05, we fail to reject the null hypothesis. This means the time series is non-stationary.
Data stationarization - stabilize the variance
Attempt for various lambda vales until the plot is stationary.
# Box-Cox transformations
train %>% BoxCox(lambda = .3) %>% autoplot()
train %>% BoxCox(lambda = -0.7) %>% autoplot()
BoxCox.lambda(train) # the best lambda
## [1] -0.6863084
diff(train)
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1 -2162 -333 -841 812 444 1663 2637 0 -4994 246 -407
## 2 1394 -2237 2531 -1791 709 1950 -48 415 -1178 -1789 487 401
## 3 2539 -2337 390 -713 2860 2336 331 286 -3821 -355 -389 1190
autoplot(diff(train)) # Plot of differenced data
The autocorrelation function plots can tell if there is remaining trend/seasonality.
ggAcf(diff(train))
difftrain <- diff(train, lag = 12)
log_train <- log(train)
Auto.Arima is a robust technique which can handle non-stationary time series data. The algorithm finds the types and number of operations needed to stationarize the data and applies accordingly.
fit <- auto.arima(train,seasonal = T) # seasonal = F for non-Seasonal
fc_arima<- fit %>% forecast(h = 12)
autoplot(fc_arima)
Accuracy
accuracy(fc_arima, data_ts)['Test set','MAPE']
## [1] 18.73199
We can also define customized ARIMA models after defining the parameters order = (p,d,q), seasonal = (P,D,Q). The parameters are found in the process of stationarizing the series.
p - number of lags (Autoregressive part)
d - number of differencing needed
q - number of error lags (Moving average part)
P - number of lags (Autoregressive part)
D - number of seasonal differencing needed
Q - number of error lags (Moving average part)
train %>% Arima(order = c(0, 1, 0),seasonal = c(0, 1, 0), include.constant = F) %>% forecast() %>% autoplot()
In this section, we will see how we can combine econometric approaches and time series approaches. We will need a second variable at this stage.
data_ts_all <- ts(HistoricalData[,c(2,3)], frequency = 12)
data_ts_phoneTest <- data_ts_all[,"Phone"][37:48] # to be used later
Let’s plot them together
autoplot(data_ts_all, facets = TRUE) # separately
Fit the model
fit <- auto.arima(data_ts_all[,'Email'], xreg = data_ts_all[,'Phone'], stationary = TRUE)
Checking the regression coefficient - positive coefficient indicates direct relationship.
# check the regression coefficient
coefficients(fit)
## sma1 xreg
## 0.5753787 2.2429109
Forecast
fc_dynRegre <- forecast(fit, xreg = data_ts_phoneTest)
# Plot fc with x and y labels
autoplot(fc_dynRegre) + xlab("Month") + ylab("Email")
We can also assume non linear relationships between the variables.
xreg <- cbind(Phone = data_ts_all[,'Phone'],
PhoneSquared = data_ts_all[,'Phone']^2,
PhoneCubed = data_ts_all[,'Phone']^3)
# Fit model
fit <- auto.arima(data_ts_all[, "Email"], xreg = xreg)
# Forecast fit 12 month ahead
fc_dynRegre_NL<- forecast(fit,
xreg = cbind(data_ts_phoneTest, data_ts_phoneTest^2, data_ts_phoneTest^3))
## Warning in forecast.forecast_ARIMA(fit, xreg = cbind(data_ts_phoneTest, : xreg
## contains different column names from the xreg used in training. Please check
## that the regressors are in the same order.
# Set up harmonic regressors of order K
# K must be not be greater than period/2
harmonics <- fourier(train, K = 1)
# Fit regression model with ARIMA errors
fit <- auto.arima(train, xreg = harmonics, seasonal = FALSE)
# Forecasts next years
newharmonics <- fourier(train, K = 1, h = 12)
fc_fourier <- forecast(fit, xreg = newharmonics)
# Plot forecasts fc_fourier
autoplot(fc_fourier)
Accuracy
accuracy(fc_fourier, data_ts)['Test set','MAPE']
## [1] 9.635415
Combination of models so far - trignometric models, box cox, arma errors, trend, seasonal. It is great for multiple seasonality time series with strong trend.
fit <- tbats(train)
# Forecast the series for the next 5 years
fc_TBAT <- forecast(fit, h = 12)
# Plot the forecasts
autoplot(fc_TBAT)
Accuracy
accuracy(fc_TBAT, data_ts)['Test set','MAPE']
## [1] 6.212802
As we have seen above, complex approaches do not always give the best result. It is, therefore, necessary to try multiple methods and choose the one that works the best for your problem. Forecasting is a process where models are continually fine-tuned for better accuracy. R can definitely make some of the hard work that comes with this easier.