On Wednesday, March 11, 2015, I presented a time series forecasting case study in the Analytics and Insight technical session at the 2015 DRIVE/ conference. This document is meant to provide a summary of that talk and provide the R code used to produce the analysis. Save for some exploratory data analysis using Tableau, everything was done in R.
In the presentation, I outlined four steps to produce reliable results in your forecasting projects. These four steps are:
- Right Data (select the right data)
- Training and Testing Sets (split the data into…) - Right Horizon (the # of future periods that can be forecasted reliably)
- Right Forecasting Model (finding the model with the best fit)
This process can be adapted to any of your predictive modeling projects, though the “right horizon” step can be skipped for predictive modeling projects that are not forecasting into the future. The process of splitting the data into a training and testing set, and building and evaluating models was summarized on this slide from the presentation.
In this case study, I was asked to predict the amount of money we would raise through the next 20 quarters for a particular category of giving.
These packages are used in the analysis. Here they are loaded, suppressing any run-time warnings.
library(forecast)
library(fpp)
library(reshape)
library(ggplot2)
From our gift database, I extracted all gifts for the past 42 quarters for a specific set of funds. I used Oracle SQL Developer to extract the data. The method you use to extract the data is independent of the ability to perform the analysis.
Prior to loading the data into R, exploratory analysis on the gift data was done using Tableau. You may recall this graph from the presentation.
Using visual analysis, we saw that for gift levels under $100,000 the general linear trend was smooth. Therefore, we determined that the right data for this project were the subset of gifts under $100,000. Tableau offers the ability to export the quarterly data directly from Tableau, thus a single set of data was exported from Tableau that had the sum of total giving by fiscal year/quarter. The file has two fields, the time period and the amount. The same aggregation could have been performed using SQL on all gifts, or with an Excel Pivot Table.
The data exported from Tableau is a tab-delimited file and maintains any numerical formatting applied in Tableau, such as currency format. The file was opened with Excel and
- the variables were renamed to timeperiod and amount
- the values for amount were reformatted from currency to a plain number
- the file was saved as a .csv
In R, the path to the directory is set to where the data are. Here is an example using a Mac.
setwd("/Users/username/Documents/DRIVE/DRIVE 2015/Presentation")
Next, the data are loaded into a variable named df. The data class is a data frame, which you can think of as an Excel spreadsheet with rows (observations) and columns (variables).
# the data is another directory down, named "data".
# colClasses explicitly defines the data type for each column in the dataset
# df can be any variable name. I like to use df because it is short for "data frame",
# which is the type of object class created in this instance
df <- read.csv("./data/mydata.csv",
header = TRUE, colClasses = c("character", "numeric"))
In order to protect the privacy of the data, an indexing algorithm was applied (code not shown). This changed the data slightly from what was presented at the conference. It does not alter the process or results.
Here are the first six observations (i.e. rows) in the data frame df.
head(df)
## timeperiod amount
## 1 FY 2005 Q1 100.0000
## 2 FY 2005 Q2 683.5988
## 3 FY 2005 Q3 251.6945
## 4 FY 2005 Q4 289.5020
## 5 FY 2006 Q1 128.9117
## 6 FY 2006 Q2 624.1277
In order to perform time series analyses in R, the data, which is currently stored as a data frame object, needs to be converted into a time series object.
# frequency = 4 tells R this is a quarterly time series data.
# start = c(2005, 1) tells R that the first time period is year 2005, quarter 1.
### in my mind, I'll know that this is fiscal year 2005, quarter 1. R doesn't care.
my.ts <- ts(df$amount, frequency = 4, start = c(2005,1))
This is what that time series looks like. Notice how this structure differs from the data frame.
# display the time series object
my.ts
## Qtr1 Qtr2 Qtr3 Qtr4
## 2005 100.00000 683.59882 251.69448 289.50202
## 2006 128.91166 624.12773 190.67689 241.08664
## 2007 103.16789 666.92079 190.86528 322.43431
## 2008 145.38125 711.12315 196.49469 300.45378
## 2009 148.71574 575.74578 195.21167 310.43856
## 2010 127.18824 689.35414 225.94688 301.37530
## 2011 81.76783 731.63781 206.97566 231.63494
## 2012 138.87962 783.20943 287.30762 309.56442
## 2013 194.44760 832.47487 199.22599 358.11528
## 2014 142.06643 949.27251 239.26883 408.16276
## 2015 208.10792 1092.59348
Here, the original data are transformed into a second time series, calculating the running aggregate total by fiscal quarter.
# aggregate giving dataset
agg.ts <- ts(cumsum(my.ts), frequency = 4, start = c(2005,1))
Here is the aggregated data as a time series.
agg.ts
## Qtr1 Qtr2 Qtr3 Qtr4
## 2005 100.0000 783.5988 1035.2933 1324.7953
## 2006 1453.7070 2077.8347 2268.5116 2509.5982
## 2007 2612.7661 3279.6869 3470.5522 3792.9865
## 2008 3938.3678 4649.4909 4845.9856 5146.4394
## 2009 5295.1551 5870.9009 6066.1126 6376.5511
## 2010 6503.7394 7193.0935 7419.0404 7720.4157
## 2011 7802.1835 8533.8213 8740.7970 8972.4319
## 2012 9111.3115 9894.5210 10181.8286 10491.3930
## 2013 10685.8406 11518.3155 11717.5415 12075.6567
## 2014 12217.7232 13166.9957 13406.2645 13814.4273
## 2015 14022.5352 15115.1287
We now have two time series sets, our running total and our total giving by fiscal year/quarter. Prior to fitting models, each of our data sets need to be split into a training set and a test set, using a 80/20 split.
# running total training and test sets
# first 34 quarters (FY05Q1 - FY13Q2)
training <- window(agg.ts, start = 2005, end = (2013.5 - 0.01))
# last 8 quarters (FY13Q3 - FY15Q2)
testing <- window(agg.ts, start = 2013.5)
# Non aggregate sets
# first 34 quarters (FY05Q1 - FY13Q2)
training.na <- window(my.ts, start = 2005, end = (2013.5 - 0.01))
# last 8 quarters (FY12Q3 - FY15Q2)
testing.na <- window(my.ts, start = 2013.5)
Here is what the non-aggregate training and test sets look like.
training.na
## Qtr1 Qtr2 Qtr3 Qtr4
## 2005 100.00000 683.59882 251.69448 289.50202
## 2006 128.91166 624.12773 190.67689 241.08664
## 2007 103.16789 666.92079 190.86528 322.43431
## 2008 145.38125 711.12315 196.49469 300.45378
## 2009 148.71574 575.74578 195.21167 310.43856
## 2010 127.18824 689.35414 225.94688 301.37530
## 2011 81.76783 731.63781 206.97566 231.63494
## 2012 138.87962 783.20943 287.30762 309.56442
## 2013 194.44760 832.47487
testing.na
## Qtr1 Qtr2 Qtr3 Qtr4
## 2013 199.2260 358.1153
## 2014 142.0664 949.2725 239.2688 408.1628
## 2015 208.1079 1092.5935
Recall from the presentation that for reliable results, the # of periods that can be reliably forecast should be no bigger than the # of periods in the test set. That was explained in this slide.
If a longer forecast period is desired, the ratio of training/test could be made 75/25 or 70/30. Or, a longer historical period of data could be used. Also, there is nothing preventing us from forecasting further periods, but those will be less reliable. This concept was depicted in this slide.
When I was preparing for this presentation, I used this step-wise method for each model, and then repeated.
1. Fit the model to the training set
2. Make a forecast from the model predicting the next 8 quarters
3. Make a plot visualizing the delta between the forecast and the testing set
4. Calculate the percentage difference between the forecast and testing set and store that in a data frame
5. Repeat for next model
First we will look at steps 1, 2, and 3 for linear regression model and the Seasonal and Trend Decomposition with Loess (STL), because you will recall from the presentation that I showed the linear model first, expressed my dissatisfaction, and then used STL to show why linear regression was not optimal for these data. After completed this step, I will group the steps together for the remaining models, and you will see how these steps are very similar between models.
This code fits the models to the training sets.
##### running total data
fit.lm.agg <- tslm(training ~ trend) # linear regression
##### non-running total data
# notice that the Seasonal and Trend Decomposition with Loess was run on the full dataset. I did this to produce the plot that breaks out the seasonal, trend, and remainder components that we saw in the presentation. You'll see it here in a bit.
fit.stl <- stl(my.ts, s.window = 7) # seasonal and trend decomposition with Loess
Next, the fitted model is used to forecast the following 8 periods.
f.lm <- forecast(fit.lm.agg, h = 8, level = c(80,95)) # linear regression
f.stl <- forecast(fit.stl, method = "naive") # seasonal and trend decomposition with Loess
At the DRIVE Conference, I showed this plot of the linear regression as an example of visually exploring a prediction. I stated that I was not in love with the results because the forecast was well below the test data for all periods.
# pdf(file = "./images/linear reqression.pdf", width = 9,
# height = 7)
plot(f.lm,
ylab = "index (normally this would be $)", # y-axis label
xlab = "Fiscal Year/Quarter", # x-axis label
lwd = 3, # line width of the data
las = 1, # tells R to show all axis labels in horizontal orientation
main = "Linear Forecast Using 80/20 Training/Test split") # plot title
lines(testing, lwd = 2, col = "red") # add actual testing data as red line
lines(fitted(f.lm), col = "blue", lwd = 2) # add forecast predictions as blue line
legend("topleft", # add legend in the top-left corner of the plot
lty = 1, # lty for line type. lty = 1 indicates solid line
lwd = 3, # line width for legend
col = c("black", "red", "blue"), # color of lines, in order
legend = c("Training Data", "Test data", "Forecast") # names of lines, in order
)
# dev.off()
During the presentation, I explained that I did further exploratory analysis of the data after the linear regression was returned. This plot was shown.
# pdf(file = "./images/giving categorized by fiscal quarter.pdf", width = 9,
# height = 7)
plot(subset(my.ts, quarter = 2), # plot quarter 2 from the full time series data
ylim = c(0, max(my.ts)), # set the y-axis scale from 0 to the max value in the data
las = 1, ylab = "index", xlab = "Fiscal Year", col = "red", lwd = 4,
main = "Giving Categorized by Fiscal Quarter")
legend("topleft", lty = c(1,1,1,1,2), lwd = 3,
col = c("blue", "red", "azure4", "purple", "black"),
legend = c("Q1", "Q2", "Q3", "Q4", "training/test cutoff"))
lines(subset(my.ts, quarter = 1), col = "blue", lwd = 4) # add line for Q1
lines(subset(my.ts, quarter = 3), col = "azure4", lwd = 4) # add line for Q3
lines(subset(my.ts, quarter = 4), col = "purple", lwd = 4) # add line for Q4
abline(v = 2013.2, lwd = 3, col = "black", lty = 2) # add vertical line at x-value 2013.2
# dev.off()
The plot shows a) we have seasonal data and b) there is a sharp increase in our testing set that is present only at the tail of our training set.
Regarding the seasonal data, this is helpful to know, particularly if we need to forecast through the middle of a fiscal year. Recall that for this data, 55% of the annual income comes in the 2nd quarter alone, so taking an annual forecast and splitting it in half isn’t reliable enough.
Regarding the sharp increase in giving at the tail of the data, we will want to explore forecasting models that put a heavier weight on more recent observations, similar to what we do when creating RFM scores. If we have two donors giving the same lifetime amount, we give greater importance to the donor that has given more recently.
A Seasonal and Trend Decomposition with Loess (STL) plot helps visualize the overall trend from the seasonal components. The STL plot decomposes the actual data (top frame) into a seasonal component, trend component, and remainder component. During the presentation, I commented that in this plot we could clearly see a flat period for much of the training set, and starting at the end of the training set (FY 2012) the increase begins, and takes a sharp increase in our testing set.
# pdf(file = "./images/stl decomp.pdf", width = 9, height = 7)
plot(fit.stl, main = "STL Decomposition of Gifts, by Fiscal Year/Quarter",
lwd = 2)
# dev.off()
As I explained above, when I did this analysis, I did all the steps for one model, then moved to the next model. Now I’m just going to group the models together for each step.
Here is the code to fit the remaining four models used in the case study. They are all fit to the training set.
fit.hwa <- hw(training.na, seasonal = "additive") # Holt-Winters seasonal additive
fit.hwm <- hw(training.na, seasonal = "multiplicative") # Holt-Winters seasonal multiplicative
fit.ets <- ets(training.na) # ExponenTial Smoothing
fit.arima <- auto.arima(training.na) # ARIMA
Just as we did with the linear regression and STL models, the fitted models are used to forecast the following 8 periods.
f.hwa <- forecast(fit.hwa, h = 8) # Holt-Winters seasonal additive
f.hwm <- forecast(fit.hwm, h = 8) # Holt-Winters seasonal multiplicative
f.ets <- forecast(fit.ets, h = 8) # ExponenTial Smoothing
f.arima <- forecast(fit.arima, h = 8) # ARIMA
During the presentation I showed the results for the Holt-Winters’ Additive and Holt-Winters’ Multiplicative forecast. Here is how that plot was constructed.
# pdf(file = "./images/holt winters forecasts.pdf", width = 9, height = 7)
plot(fit.hwa, ylab = "index", las = 1, lwd = 2, xlab = "Fiscal Year",
plot.conf = FALSE, fcol = "white",
main = "Forecasts from Holt-Winters' Seasonal Method",
ylim = c(0, max(my.ts + .25)))
lines(testing.na, col = "red", lwd = 4) # test actual
lines(fit.hwa$mean, type = "o", col = "darkolivegreen", lwd = 4)
lines(fit.hwm$mean, type = "o", col = "blue", lwd = 4)
legend("topleft", lty = 1, lwd = 4, col = c("black", "red", "darkolivegreen", "blue"),
legend = c("Training Set", "Test Set", "Holt Winters' Additive",
"Holt Winters' Multiplicative"))
# dev.off()
Up until now, the observations have been visual. In order to select the best fitting model, I decided to calculate the percentage difference between the actual testing data and the forecast for each model for each period (fy/quarter). The calculation is
Since we’ll be performing that same calculation for all six models, let’s create a function to perform that calculation. The function is stored in a variable diff.calc.
diff.calc <- function(actual, forecast) {
v.forecast <- as.vector(forecast)
v.actual <- as.vector(actual)
((v.forecast - v.actual)/v.actual)*100
}
We need a place to store all these calculations. Here, we create a one-column data frame explicitly stating the period being measured.
# create a vector with these periods
periods <- c("FY13Q3", "FY13Q4", "FY14Q1", "FY14Q2", "FY14Q3", "FY14Q4",
"FY15Q1", "FY15Q2")
# create a dataframe named "differences" and put the periods into that
differences <- data.frame(periods)
This is what our data frame, named differences, looks like at this point.
differences
## periods
## 1 FY13Q3
## 2 FY13Q4
## 3 FY14Q1
## 4 FY14Q2
## 5 FY14Q3
## 6 FY14Q4
## 7 FY15Q1
## 8 FY15Q2
Next, we use the same process for all the model forecasts, except for one wrinkle. The linear model will follow a special case.
For our other five models, we simply need
- the testing data for each FY/Quarter, stored in the training.na time series created when we split the data into training/test sets, and
- the forecasted values for each FY/Quarter, stored in the model results
In the special case of the linear model, the forecasting results are cumulative running totals. We want the actual amount predicted to be raised for each FY/Quarter. Since this was a linear model, the resulting coefficient slope estimate is the amount being forecast for each and every quarter in the linear model. That value can be extracted from variable containing the results of the model. Above, we declared fit.lm.agg <- tslm(training ~ trend), and the resulting slope can be returned using fit.lm.agg$coefficients[2].
Using the diff.calc helper function, here we calculate the percentage difference between testing actual and forecasted for each of the models. The results are stored in their own column in the data frame differences.
differences$linear.rm <- diff.calc(testing.na, fit.lm.agg$coefficients[2])
differences$stl <- diff.calc(testing.na, f.stl$mean)
differences$hw.additive <- diff.calc(testing.na, f.hwa$mean)
differences$hw.mult <- diff.calc(testing.na, f.hwm$mean)
differences$ets <- diff.calc(testing.na, f.ets$mean)
differences$arima <- diff.calc(testing.na, f.arima$mean)
The resulting data frame now looks like this after rounding to four decimal places.
differences[,2:7] <- round(differences[,2:7],4)
differences
## periods linear.rm stl hw.additive hw.mult ets arima
## 1 FY13Q3 64.5786 129.3613 36.5086 20.8799 46.2275 32.6833
## 2 FY13Q4 -8.4420 52.1257 -2.8419 -6.7108 -0.4790 -14.3265
## 3 FY14Q1 130.7958 156.0136 36.2487 -2.9913 45.1548 23.2408
## 4 FY14Q2 -65.4595 15.0980 -18.4185 -13.1890 -19.7947 -15.7957
## 5 FY14Q3 37.0355 90.9765 29.9559 9.8459 21.7556 14.1735
## 6 FY14Q4 -19.6685 33.4726 -5.2041 -10.8384 -12.6819 -22.6652
## 7 FY15Q1 57.5545 74.7696 11.7434 -27.9901 -0.9090 -11.6200
## 8 FY15Q2 -69.9904 0.0000 -25.5520 -18.1273 -30.3156 -26.0320
In order to compare the six models, we create a graph to visually inspect the percentage differences between actual and forecast for all six models.
In order to create that graph, we need to reshape the data into a structure that is more suitable for plotting. We need to have all our measures in a single column. Using the melt function, our resulting data frame will have a single row (observation) for each period/test type combination.
# melt the differences dataframe
my.melt <- melt(differences, id = c("periods"))
# rename the columns
names(my.melt)[2:3] <- c("model.type", "delta")
Here are the first ten rows of our “melted” data frame.
head(my.melt, 10)
## periods model.type delta
## 1 FY13Q3 linear.rm 64.5786
## 2 FY13Q4 linear.rm -8.4420
## 3 FY14Q1 linear.rm 130.7958
## 4 FY14Q2 linear.rm -65.4595
## 5 FY14Q3 linear.rm 37.0355
## 6 FY14Q4 linear.rm -19.6685
## 7 FY15Q1 linear.rm 57.5545
## 8 FY15Q2 linear.rm -69.9904
## 9 FY13Q3 stl 129.3613
## 10 FY13Q4 stl 52.1257
Finally, we can construct a bar plot for each forecasting model. Upon visualizing the plots, we see that the hw.mult model appears to have the best fit.
# pdf(file = "./images/six models delta.pdf", width = 9, height = 7)
ggplot(data = my.melt, aes(x = periods, y = delta, fill = model.type)) +
geom_bar(stat = "identity", position = position_dodge()) +
facet_wrap(~model.type) + theme(axis.text.x = element_text(hjust = -3, angle = 90))
# dev.off()
Until now, our model selection was based on a forecast of historical periods. Our objective, though, is to forecast future periods. Recall from the presentation this flowchart, which explains that once the optimal model is found, we refit that model on the training and test sets combined.
The model is refit to the 42 quarters spanning FY05Q1 - FY15Q2. A forecast is made for the eight quarters from FY15Q3 - FY17Q2. The model returns the point forecast and the forecasts for the 80% and 95% confidence intervals.
# h = 8 dictates an 8-period forecast
# level = c(80,95) dictates the confidence intervals returned. These can be
# changed to suit your need.
final.fit <- hw(my.ts, seasonal = "multiplicative",
level = c(80,95), h = 8)
Here is a table of the resulting forecast.
final.fit
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 2015 Q3 281.7488 215.7490 347.7487 180.8108 382.6869
## 2015 Q4 429.1645 328.4326 529.8963 275.1084 583.2205
## 2016 Q1 210.5383 160.9132 260.1633 134.6433 286.4332
## 2016 Q2 1156.7555 882.1801 1431.3310 736.8286 1576.6825
## 2016 Q3 314.3420 230.4126 398.2713 185.9830 442.7009
## 2016 Q4 477.4620 348.5323 606.3917 280.2811 674.6429
## 2017 Q1 233.6058 169.6176 297.5940 135.7444 331.4672
## 2017 Q2 1280.2353 923.4482 1637.0224 734.5765 1825.8941
Here is a visualization of the resulting forecast.
# pdf(file = "./images/final forecast.pdf", width = 9, height = 7)
plot(forecast(final.fit), lwd = 3, las = 1, main = "Final Forecast",
xlab = "Fiscal Year", ylab = "index")
legend("topleft", lty = 1, lwd = 4, col = c("black", "blue"),
legend = c("Historical Data", "Forecast"))
# dev.off()
From here, we might want to get the totals for the point forecast and the confidence intervals so we can make a business statement that answers the original business question. That can be accomplished either by exporting the results to a .csv which can then be opened in Excel, or calculating the sums right in R.
This is how to export the table.
write.csv(final.fit, "./mytable.csv")
This is how to calculate the sums.
lo.80 <- sum(final.fit$lower[,1])
lo.95 <- sum(final.fit$lower[,2])
hi.80 <- sum(final.fit$upper[,1])
hi.95 <- sum(final.fit$upper[,2])
point <- sum(final.fit$mean)
I might choose to make the following statement that answers the business question.
Given the historical data, our organization can reliably forecast with 95% confidence that we will raise between $2664 and $6104 over the next eight fiscal quarters from gifts under $100,000. This amount represents our base ability, and given a pipeline of prospects above $100,000, we most certainly will raise much more than this. At the present time, the data limits our ability to reliably forecast beyond 8 quarters. One year from now this forecast can be revisited, at which time a 12 - 16 period forecast can be provided.
Caffo, Brian. Practice Machine Learning. Coursera class. https://www.coursera.org/course/predmachlearn
Hyndman, Rob and George Athanasopoulos. Forecasting: principles and practice. OTexts. https://www.otexts.org/book/fpp