Reliable Results - DRIVE 2015

Synopsis

On Wednesday, March 11, 2015, I presented a time series forecasting case study in the Analytics and Insight technical session at the 2015 DRIVE/ conference. This document is meant to provide a summary of that talk and provide the R code used to produce the analysis. Save for some exploratory data analysis using Tableau, everything was done in R.

Process for Reliable Results

In the presentation, I outlined four steps to produce reliable results in your forecasting projects. These four steps are:
- Right Data (select the right data)
- Training and Testing Sets (split the data into…) - Right Horizon (the # of future periods that can be forecasted reliably)
- Right Forecasting Model (finding the model with the best fit)

This process can be adapted to any of your predictive modeling projects, though the “right horizon” step can be skipped for predictive modeling projects that are not forecasting into the future. The process of splitting the data into a training and testing set, and building and evaluating models was summarized on this slide from the presentation.

Case Study

In this case study, I was asked to predict the amount of money we would raise through the next 20 quarters for a particular category of giving.

Required R Packages

These packages are used in the analysis. Here they are loaded, suppressing any run-time warnings.

library(forecast)
library(fpp)
library(reshape)
library(ggplot2)

Getting the Data

From our gift database, I extracted all gifts for the past 42 quarters for a specific set of funds. I used Oracle SQL Developer to extract the data. The method you use to extract the data is independent of the ability to perform the analysis.

Exploratory Analysis (Right Data)

Prior to loading the data into R, exploratory analysis on the gift data was done using Tableau. You may recall this graph from the presentation.

Using visual analysis, we saw that for gift levels under $100,000 the general linear trend was smooth. Therefore, we determined that the right data for this project were the subset of gifts under $100,000. Tableau offers the ability to export the quarterly data directly from Tableau, thus a single set of data was exported from Tableau that had the sum of total giving by fiscal year/quarter. The file has two fields, the time period and the amount. The same aggregation could have been performed using SQL on all gifts, or with an Excel Pivot Table.

Loading the Data into R

Preliminary Data Preparation in Excel

The data exported from Tableau is a tab-delimited file and maintains any numerical formatting applied in Tableau, such as currency format. The file was opened with Excel and
- the variables were renamed to timeperiod and amount
- the values for amount were reformatted from currency to a plain number
- the file was saved as a .csv

Importing the Data into R

In R, the path to the directory is set to where the data are. Here is an example using a Mac.

setwd("/Users/username/Documents/DRIVE/DRIVE 2015/Presentation")

Next, the data are loaded into a variable named df. The data class is a data frame, which you can think of as an Excel spreadsheet with rows (observations) and columns (variables).

# the data is another directory down, named "data".
# colClasses explicitly defines the data type for each column in the dataset
# df can be any variable name. I like to use df because it is short for "data frame", 
# which is the type of object class created in this instance  
df <- read.csv("./data/mydata.csv",
                        header = TRUE, colClasses = c("character", "numeric"))

Data Transformations

In order to protect the privacy of the data, an indexing algorithm was applied (code not shown). This changed the data slightly from what was presented at the conference. It does not alter the process or results.

Here are the first six observations (i.e. rows) in the data frame df.

head(df)

##   timeperiod   amount
## 1 FY 2005 Q1 100.0000
## 2 FY 2005 Q2 683.5988
## 3 FY 2005 Q3 251.6945
## 4 FY 2005 Q4 289.5020
## 5 FY 2006 Q1 128.9117
## 6 FY 2006 Q2 624.1277

In order to perform time series analyses in R, the data, which is currently stored as a data frame object, needs to be converted into a time series object.

# frequency = 4 tells R this is a quarterly time series data. 
# start = c(2005, 1) tells R that the first time period is year 2005, quarter 1. 
### in my mind, I'll know that this is fiscal year 2005, quarter 1. R doesn't care.  
my.ts <- ts(df$amount, frequency = 4, start = c(2005,1))

This is what that time series looks like. Notice how this structure differs from the data frame.

# display the time series object
my.ts

##            Qtr1       Qtr2       Qtr3       Qtr4
## 2005  100.00000  683.59882  251.69448  289.50202
## 2006  128.91166  624.12773  190.67689  241.08664
## 2007  103.16789  666.92079  190.86528  322.43431
## 2008  145.38125  711.12315  196.49469  300.45378
## 2009  148.71574  575.74578  195.21167  310.43856
## 2010  127.18824  689.35414  225.94688  301.37530
## 2011   81.76783  731.63781  206.97566  231.63494
## 2012  138.87962  783.20943  287.30762  309.56442
## 2013  194.44760  832.47487  199.22599  358.11528
## 2014  142.06643  949.27251  239.26883  408.16276
## 2015  208.10792 1092.59348

Here, the original data are transformed into a second time series, calculating the running aggregate total by fiscal quarter.

# aggregate giving dataset
agg.ts <- ts(cumsum(my.ts), frequency = 4, start = c(2005,1))

Here is the aggregated data as a time series.

agg.ts

##            Qtr1       Qtr2       Qtr3       Qtr4
## 2005   100.0000   783.5988  1035.2933  1324.7953
## 2006  1453.7070  2077.8347  2268.5116  2509.5982
## 2007  2612.7661  3279.6869  3470.5522  3792.9865
## 2008  3938.3678  4649.4909  4845.9856  5146.4394
## 2009  5295.1551  5870.9009  6066.1126  6376.5511
## 2010  6503.7394  7193.0935  7419.0404  7720.4157
## 2011  7802.1835  8533.8213  8740.7970  8972.4319
## 2012  9111.3115  9894.5210 10181.8286 10491.3930
## 2013 10685.8406 11518.3155 11717.5415 12075.6567
## 2014 12217.7232 13166.9957 13406.2645 13814.4273
## 2015 14022.5352 15115.1287

Training and Test Sets

We now have two time series sets, our running total and our total giving by fiscal year/quarter. Prior to fitting models, each of our data sets need to be split into a training set and a test set, using a 80/20 split.

# running total training and test sets
# first 34 quarters (FY05Q1 - FY13Q2)
training <- window(agg.ts, start = 2005, end = (2013.5 - 0.01))
# last 8 quarters (FY13Q3 - FY15Q2)
testing <- window(agg.ts, start = 2013.5)

# Non aggregate sets
# first 34 quarters (FY05Q1 - FY13Q2)
training.na <- window(my.ts, start = 2005, end = (2013.5 - 0.01))
# last 8 quarters (FY12Q3 - FY15Q2)
testing.na <- window(my.ts, start = 2013.5)

Here is what the non-aggregate training and test sets look like.

training.na

##           Qtr1      Qtr2      Qtr3      Qtr4
## 2005 100.00000 683.59882 251.69448 289.50202
## 2006 128.91166 624.12773 190.67689 241.08664
## 2007 103.16789 666.92079 190.86528 322.43431
## 2008 145.38125 711.12315 196.49469 300.45378
## 2009 148.71574 575.74578 195.21167 310.43856
## 2010 127.18824 689.35414 225.94688 301.37530
## 2011  81.76783 731.63781 206.97566 231.63494
## 2012 138.87962 783.20943 287.30762 309.56442
## 2013 194.44760 832.47487

testing.na

##           Qtr1      Qtr2      Qtr3      Qtr4
## 2013                      199.2260  358.1153
## 2014  142.0664  949.2725  239.2688  408.1628
## 2015  208.1079 1092.5935

Right Horizon

Recall from the presentation that for reliable results, the # of periods that can be reliably forecast should be no bigger than the # of periods in the test set. That was explained in this slide.

If a longer forecast period is desired, the ratio of training/test could be made 75/25 or 70/30. Or, a longer historical period of data could be used. Also, there is nothing preventing us from forecasting further periods, but those will be less reliable. This concept was depicted in this slide.

Finding the Right Model

When I was preparing for this presentation, I used this step-wise method for each model, and then repeated.
1. Fit the model to the training set
2. Make a forecast from the model predicting the next 8 quarters
3. Make a plot visualizing the delta between the forecast and the testing set
4. Calculate the percentage difference between the forecast and testing set and store that in a data frame
5. Repeat for next model

First we will look at steps 1, 2, and 3 for linear regression model and the Seasonal and Trend Decomposition with Loess (STL), because you will recall from the presentation that I showed the linear model first, expressed my dissatisfaction, and then used STL to show why linear regression was not optimal for these data. After completed this step, I will group the steps together for the remaining models, and you will see how these steps are very similar between models.

Linear Regression and Seasonal and Trend Decomposition with Loess (STL)

Fitting the models for Linear Regression and STL

This code fits the models to the training sets.

##### running total data
fit.lm.agg <- tslm(training ~ trend) # linear regression
##### non-running total data
# notice that the Seasonal and Trend Decomposition with Loess was run on the full dataset. I did this to produce the plot that breaks out the seasonal, trend, and remainder components that we saw in the presentation. You'll see it here in a bit.  
fit.stl <- stl(my.ts, s.window = 7) # seasonal and trend decomposition with Loess

Forecasting for Linear and STL

Next, the fitted model is used to forecast the following 8 periods.

f.lm <- forecast(fit.lm.agg, h = 8, level = c(80,95)) # linear regression
f.stl <- forecast(fit.stl, method = "naive") # seasonal and trend decomposition with Loess

Visually Exploring the Fit and Forecast and Linear and STL

At the DRIVE Conference, I showed this plot of the linear regression as an example of visually exploring a prediction. I stated that I was not in love with the results because the forecast was well below the test data for all periods.

# pdf(file = "./images/linear reqression.pdf", width = 9,
# height = 7)
plot(f.lm,
     ylab = "index (normally this would be $)", # y-axis label
     xlab = "Fiscal Year/Quarter", # x-axis label
     lwd = 3, # line width of the data
     las = 1, # tells R to show all axis labels in horizontal orientation
     main = "Linear Forecast Using 80/20 Training/Test split") # plot title
lines(testing, lwd = 2, col = "red") # add actual testing data as red line
lines(fitted(f.lm), col = "blue", lwd = 2) # add forecast predictions as blue line
legend("topleft", # add legend in the top-left corner of the plot
       lty = 1, # lty for line type. lty = 1 indicates solid line
       lwd = 3, # line width for legend
       col = c("black", "red", "blue"), # color of lines, in order
       legend = c("Training Data", "Test data", "Forecast") # names of lines, in order
       )

# dev.off()

During the presentation, I explained that I did further exploratory analysis of the data after the linear regression was returned. This plot was shown.

# pdf(file = "./images/giving categorized by fiscal quarter.pdf", width = 9,
# height = 7)
plot(subset(my.ts, quarter = 2), # plot quarter 2 from the full time series data
     ylim = c(0, max(my.ts)), # set the y-axis scale from 0 to the max value in the data
     las = 1, ylab = "index", xlab = "Fiscal Year", col = "red", lwd = 4,
     main = "Giving Categorized by Fiscal Quarter")
legend("topleft", lty = c(1,1,1,1,2), lwd = 3,
       col = c("blue", "red", "azure4", "purple", "black"),
       legend = c("Q1", "Q2", "Q3", "Q4", "training/test cutoff"))
lines(subset(my.ts, quarter = 1), col = "blue", lwd = 4) # add line for Q1
lines(subset(my.ts, quarter = 3), col = "azure4", lwd = 4) # add line for Q3
lines(subset(my.ts, quarter = 4), col = "purple", lwd = 4) # add line for Q4
abline(v = 2013.2, lwd = 3, col = "black", lty = 2) # add vertical line at x-value 2013.2

# dev.off()

The plot shows a) we have seasonal data and b) there is a sharp increase in our testing set that is present only at the tail of our training set.

Regarding the seasonal data, this is helpful to know, particularly if we need to forecast through the middle of a fiscal year. Recall that for this data, 55% of the annual income comes in the 2nd quarter alone, so taking an annual forecast and splitting it in half isn’t reliable enough.

Regarding the sharp increase in giving at the tail of the data, we will want to explore forecasting models that put a heavier weight on more recent observations, similar to what we do when creating RFM scores. If we have two donors giving the same lifetime amount, we give greater importance to the donor that has given more recently.

A Seasonal and Trend Decomposition with Loess (STL) plot helps visualize the overall trend from the seasonal components. The STL plot decomposes the actual data (top frame) into a seasonal component, trend component, and remainder component. During the presentation, I commented that in this plot we could clearly see a flat period for much of the training set, and starting at the end of the training set (FY 2012) the increase begins, and takes a sharp increase in our testing set.

# pdf(file = "./images/stl decomp.pdf", width = 9, height = 7)
plot(fit.stl, main = "STL Decomposition of Gifts, by Fiscal Year/Quarter",
     lwd = 2)

# dev.off()

Fitting the Other Models

As I explained above, when I did this analysis, I did all the steps for one model, then moved to the next model. Now I’m just going to group the models together for each step.

Here is the code to fit the remaining four models used in the case study. They are all fit to the training set.

fit.hwa <- hw(training.na, seasonal = "additive") # Holt-Winters seasonal additive
fit.hwm <- hw(training.na, seasonal = "multiplicative") # Holt-Winters seasonal multiplicative
fit.ets <- ets(training.na) # ExponenTial Smoothing
fit.arima <- auto.arima(training.na) # ARIMA

Forecasting the Other Models

Just as we did with the linear regression and STL models, the fitted models are used to forecast the following 8 periods.

f.hwa <- forecast(fit.hwa, h = 8) # Holt-Winters seasonal additive
f.hwm <- forecast(fit.hwm, h = 8) # Holt-Winters seasonal multiplicative
f.ets <- forecast(fit.ets, h = 8) # ExponenTial Smoothing
f.arima <- forecast(fit.arima, h = 8) # ARIMA

During the presentation I showed the results for the Holt-Winters’ Additive and Holt-Winters’ Multiplicative forecast. Here is how that plot was constructed.

# pdf(file = "./images/holt winters forecasts.pdf", width = 9, height = 7)
plot(fit.hwa, ylab = "index", las = 1, lwd = 2, xlab = "Fiscal Year", 
     plot.conf = FALSE, fcol = "white",
     main = "Forecasts from Holt-Winters' Seasonal Method",
     ylim = c(0, max(my.ts + .25)))
lines(testing.na, col = "red", lwd = 4) # test actual
lines(fit.hwa$mean, type = "o", col = "darkolivegreen", lwd = 4)
lines(fit.hwm$mean, type = "o", col = "blue", lwd = 4)
legend("topleft", lty = 1, lwd = 4, col = c("black", "red", "darkolivegreen", "blue"),
       legend = c("Training Set", "Test Set", "Holt Winters' Additive",
                  "Holt Winters' Multiplicative"))

# dev.off()

Model Evaluation

Calculating Deltas between Forecast and Actual

Up until now, the observations have been visual. In order to select the best fitting model, I decided to calculate the percentage difference between the actual testing data and the forecast for each model for each period (fy/quarter). The calculation is

Since we’ll be performing that same calculation for all six models, let’s create a function to perform that calculation. The function is stored in a variable diff.calc.

diff.calc <- function(actual, forecast) {
  v.forecast <- as.vector(forecast)
  v.actual <- as.vector(actual)
  ((v.forecast - v.actual)/v.actual)*100
}

We need a place to store all these calculations. Here, we create a one-column data frame explicitly stating the period being measured.

# create a vector with these periods
periods <- c("FY13Q3", "FY13Q4", "FY14Q1", "FY14Q2", "FY14Q3", "FY14Q4",
             "FY15Q1", "FY15Q2")
# create a dataframe named "differences" and put the periods into that 
differences <- data.frame(periods)

This is what our data frame, named differences, looks like at this point.

differences

##   periods
## 1  FY13Q3
## 2  FY13Q4
## 3  FY14Q1
## 4  FY14Q2
## 5  FY14Q3
## 6  FY14Q4
## 7  FY15Q1
## 8  FY15Q2

Next, we use the same process for all the model forecasts, except for one wrinkle. The linear model will follow a special case.
For our other five models, we simply need
- the testing data for each FY/Quarter, stored in the training.na time series created when we split the data into training/test sets, and
- the forecasted values for each FY/Quarter, stored in the model results

In the special case of the linear model, the forecasting results are cumulative running totals. We want the actual amount predicted to be raised for each FY/Quarter. Since this was a linear model, the resulting coefficient slope estimate is the amount being forecast for each and every quarter in the linear model. That value can be extracted from variable containing the results of the model. Above, we declared fit.lm.agg <- tslm(training ~ trend), and the resulting slope can be returned using fit.lm.agg$coefficients[2].

Using the diff.calc helper function, here we calculate the percentage difference between testing actual and forecasted for each of the models. The results are stored in their own column in the data frame differences.

differences$linear.rm <- diff.calc(testing.na, fit.lm.agg$coefficients[2])
differences$stl <- diff.calc(testing.na, f.stl$mean)
differences$hw.additive <- diff.calc(testing.na, f.hwa$mean)
differences$hw.mult <- diff.calc(testing.na, f.hwm$mean)
differences$ets <- diff.calc(testing.na, f.ets$mean)
differences$arima <- diff.calc(testing.na, f.arima$mean)

The resulting data frame now looks like this after rounding to four decimal places.

differences[,2:7] <- round(differences[,2:7],4)
differences

##   periods linear.rm      stl hw.additive  hw.mult      ets    arima
## 1  FY13Q3   64.5786 129.3613     36.5086  20.8799  46.2275  32.6833
## 2  FY13Q4   -8.4420  52.1257     -2.8419  -6.7108  -0.4790 -14.3265
## 3  FY14Q1  130.7958 156.0136     36.2487  -2.9913  45.1548  23.2408
## 4  FY14Q2  -65.4595  15.0980    -18.4185 -13.1890 -19.7947 -15.7957
## 5  FY14Q3   37.0355  90.9765     29.9559   9.8459  21.7556  14.1735
## 6  FY14Q4  -19.6685  33.4726     -5.2041 -10.8384 -12.6819 -22.6652
## 7  FY15Q1   57.5545  74.7696     11.7434 -27.9901  -0.9090 -11.6200
## 8  FY15Q2  -69.9904   0.0000    -25.5520 -18.1273 -30.3156 -26.0320

Evaluating Model Accuracy

In order to compare the six models, we create a graph to visually inspect the percentage differences between actual and forecast for all six models.

In order to create that graph, we need to reshape the data into a structure that is more suitable for plotting. We need to have all our measures in a single column. Using the melt function, our resulting data frame will have a single row (observation) for each period/test type combination.

# melt the differences dataframe
my.melt <- melt(differences, id = c("periods"))
# rename the columns
names(my.melt)[2:3] <- c("model.type", "delta")

Here are the first ten rows of our “melted” data frame.

head(my.melt, 10)

##    periods model.type    delta
## 1   FY13Q3  linear.rm  64.5786
## 2   FY13Q4  linear.rm  -8.4420
## 3   FY14Q1  linear.rm 130.7958
## 4   FY14Q2  linear.rm -65.4595
## 5   FY14Q3  linear.rm  37.0355
## 6   FY14Q4  linear.rm -19.6685
## 7   FY15Q1  linear.rm  57.5545
## 8   FY15Q2  linear.rm -69.9904
## 9   FY13Q3        stl 129.3613
## 10  FY13Q4        stl  52.1257

Finally, we can construct a bar plot for each forecasting model. Upon visualizing the plots, we see that the hw.mult model appears to have the best fit.

# pdf(file = "./images/six models delta.pdf", width = 9, height = 7)
ggplot(data = my.melt, aes(x = periods, y = delta, fill = model.type)) + 
  geom_bar(stat = "identity", position = position_dodge()) +
  facet_wrap(~model.type) + theme(axis.text.x = element_text(hjust = -3, angle = 90))

# dev.off()

Getting the Furture Forecast

Until now, our model selection was based on a forecast of historical periods. Our objective, though, is to forecast future periods. Recall from the presentation this flowchart, which explains that once the optimal model is found, we refit that model on the training and test sets combined.

The model is refit to the 42 quarters spanning FY05Q1 - FY15Q2. A forecast is made for the eight quarters from FY15Q3 - FY17Q2. The model returns the point forecast and the forecasts for the 80% and 95% confidence intervals.

# h = 8 dictates an 8-period forecast  
# level = c(80,95) dictates the confidence intervals returned. These can be
# changed to suit your need.
final.fit <- hw(my.ts, seasonal = "multiplicative",
                level = c(80,95), h = 8)

Here is a table of the resulting forecast.

final.fit

##         Point Forecast    Lo 80     Hi 80    Lo 95     Hi 95
## 2015 Q3       281.7488 215.7490  347.7487 180.8108  382.6869
## 2015 Q4       429.1645 328.4326  529.8963 275.1084  583.2205
## 2016 Q1       210.5383 160.9132  260.1633 134.6433  286.4332
## 2016 Q2      1156.7555 882.1801 1431.3310 736.8286 1576.6825
## 2016 Q3       314.3420 230.4126  398.2713 185.9830  442.7009
## 2016 Q4       477.4620 348.5323  606.3917 280.2811  674.6429
## 2017 Q1       233.6058 169.6176  297.5940 135.7444  331.4672
## 2017 Q2      1280.2353 923.4482 1637.0224 734.5765 1825.8941

Here is a visualization of the resulting forecast.

# pdf(file = "./images/final forecast.pdf", width = 9, height = 7)
plot(forecast(final.fit), lwd = 3, las = 1, main = "Final Forecast",
     xlab = "Fiscal Year", ylab = "index")
legend("topleft", lty = 1, lwd = 4, col = c("black", "blue"),
       legend = c("Historical Data", "Forecast"))

# dev.off()

From here, we might want to get the totals for the point forecast and the confidence intervals so we can make a business statement that answers the original business question. That can be accomplished either by exporting the results to a .csv which can then be opened in Excel, or calculating the sums right in R.

This is how to export the table.

write.csv(final.fit, "./mytable.csv")

This is how to calculate the sums.

lo.80 <- sum(final.fit$lower[,1])
lo.95 <- sum(final.fit$lower[,2])
hi.80 <- sum(final.fit$upper[,1])
hi.95 <- sum(final.fit$upper[,2])
point <- sum(final.fit$mean)

Answer to the Question

I might choose to make the following statement that answers the business question.

Given the historical data, our organization can reliably forecast with 95% confidence that we will raise between $2664 and $6104 over the next eight fiscal quarters from gifts under $100,000. This amount represents our base ability, and given a pipeline of prospects above $100,000, we most certainly will raise much more than this. At the present time, the data limits our ability to reliably forecast beyond 8 quarters. One year from now this forecast can be revisited, at which time a 12 - 16 period forecast can be provided.

References and Further Learning

Caffo, Brian. Practice Machine Learning. Coursera class. https://www.coursera.org/course/predmachlearn
Hyndman, Rob and George Athanasopoulos. Forecasting: principles and practice. OTexts. https://www.otexts.org/book/fpp