Forecasting Using Time Series Data

Objectives:

  1. Load data set
  2. Render a plot of data points with a forecast X months in the future
  3. Visual Analysis - Determine if trend, seasonality, and noise exists in the data and apply the correct algorithm
  4. Partition Series
  5. Provide formula for forecasting algorithm
  6. Show forecast error

References

  1. “Practical Time Series Forecasting with R. A Hands-On Guide”, Scmueli, Lichtendahl Jr., Axelrod Schnall Publishers, 2016, Kindle version
  2. “The Book of R. A First Course in Programming and Statistics”, Davies, No Starch Press, 2016, Kindle version
  3. R Help (using “?” prior to desired package, function, etc.)
  4. Stack Overflow, https://stackoverflow.com
  5. R Bloggers, https://www.r-bloggers.com

Libraries

Forecasting Process

The forecasting process includes the following steps:
(1) Define Goal
(2) Get Data
(3) Explore & Visualize Series (iterate between 2 & 3)
(4) Pre-Process Data
(5) Partition Series
(6) Apply Forecasting Algorithm(s) based on Method(s)
(7) Evaluate & Compare Performance (iterate between 6 & 7)
(8) Implement Forecasts/System

1. Define Goal

Determing and clearly defining the forecasting goal based on intended use is key to ensuring useful results. There are essentially two types of goals:
(1) Descriptive modeling - time series analysis - determining seasonal patterns, trends, and relationships to external factors. This is done retrospectively and allows the method to use all of the known, historical data to “predict” what will happen later within history (i.e. within the existing time series data). (2) Predictive modeling - time series forecasting - uses information in series to forecast future values of that series. Forecasting models cannot use future information and is judged by its predictive accuracy rather than by its ability to provde correct casual explanations.

Our goal is to provide the most appropriate Predictive Model that most accurately will estimate a future value using time series data input. As part of this goal, we desire to provide the user an interactive dashboard approach to improve and “buy into” the provided estimate.

2. Get Data

Time Series Components

Time series data represents evidence of systematic and non-systematic behaviors. Systematic evidence includes: Trend, Level, and Seasonality. Non-systematic evidence is called noise. We identify Level, Trend, and Seasonality characteristics in the data to produce accurate future estimates (forecasts). Noise represents error that we seek to also estimate to understand the potential accuracy of the forecast. “Forecasting methods attempt to isolate the systematic part and quantify the noise level.”

Working with a (not so Large to a) Large Data Set

Libaries:

  • readxl
  • zoo
  • forecast

Useful Commands and Functions:

  • list.files() - list files within specified directory
  • read.excel() - reads in MS Excel formatted files
  • file.choose() - opens finder window and allows user to select file and returns name of file
  • getwd() - get working directory
  • read.csv() - reads in .csv formatted files
  library(zoo)
  library(forecast)
  library(knitr)
rgudata <- read.csv("/Users/normanreitter/RCoding/RGU Subscriber Counts - Monthly.csv", header=TRUE, sep=",", stringsAsFactors = default.stringsAsFactors())
      #reading in .csv formatted file
rgudata
##    Month Year RGU.Services.Count
## 1    Sep 2013             148431
## 2    Oct 2013             152357
## 3    Nov 2013             152237
## 4    Dec 2013             152698
## 5    Jan 2014             153186
## 6    Feb 2014             153528
## 7    Mar 2014             154968
## 8    Apr 2014             155759
## 9    May 2014             156365
## 10   Jun 2014             155745
## 11   Jul 2014             155700
## 12   Aug 2014             157507
## 13   Sep 2014             158450
## 14   Oct 2014             159293
## 15   Nov 2014             159001
## 16   Dec 2014             159595
## 17   Jan 2015             161556
## 18   Feb 2015             161988
## 19   Mar 2015             163402
## 20   Apr 2015             164015
## 21   May 2015             164031
## 22   Jun 2015             163709
## 23   Jul 2015             163034
## 24   Aug 2015             164560
## 25   Sep 2015             164562
## 26   Oct 2015             164875
## 27   Nov 2015             164212
## 28   Dec 2015             164285
## 29   Jan 2016             163857
## 30   Feb 2016             169188
## 31   Mar 2016             169859
## 32   Apr 2016             170101
## 33   May 2016             170545
## 34   Jun 2016             169884
## 35   Jul 2016             169401
## 36   Aug 2016             171208
## 37   Sep 2016             170533
## 38   Oct 2016             170063
## 39   Nov 2016             169633
## 40   Dec 2016             169497
## 41   Jan 2017             169467
## 42   Feb 2017             169217
## 43   Mar 2017             169969
## 44   Apr 2017             170049
## 45   May 2017             170755
## 46   Jun 2017             170201
rgu.ts <- ts(rgudata$RGU.Services.Count, start = c(rgudata$Year [1],1), freq=12) #creating a time series object
rgu.ts  #view ts object called "rgu"
##         Jan    Feb    Mar    Apr    May    Jun    Jul    Aug    Sep    Oct
## 2013 148431 152357 152237 152698 153186 153528 154968 155759 156365 155745
## 2014 158450 159293 159001 159595 161556 161988 163402 164015 164031 163709
## 2015 164562 164875 164212 164285 163857 169188 169859 170101 170545 169884
## 2016 170533 170063 169633 169497 169467 169217 169969 170049 170755 170201
##         Nov    Dec
## 2013 155700 157507
## 2014 163034 164560
## 2015 169401 171208
## 2016
ma.trailing<-rollmean(rgu.ts, k=12, align="right") #calculating trailing value estimates for each services.count with window 12 months
ma.centered<-ma(rgu.ts, order = 12)  # calculating centered value estimates with window of 12 months
plot(rgu.ts, ylim = c(140000, 180000), ylab = "RGU Service Counts", xlab = "Month-Year", bty="l", xaxt = "n", main = "") #basic plot of ts object
axis(1, at = seq(2013, 2016,1), labels=format(seq(2013, 2016, 1))) # setting x-axis values and labels
lines(ma.centered, lwd = 2, col="red")
lines(ma.trailing, lwd = 2, lty = 2, col = "blue")
legend(2013, 180000, c("RGU Service Counts", "Centered Moving Average (12 mo)", "Trailing Moving Average (12 mo)"), lty=c(1,1,2), lwd= c(1,1,2), bty = "n")

# Partitioning to have a Training Period and a Validation Period
nValid <- 12
nTrain <- length(rgu.ts) - nValid
train.ts<-window(rgu.ts, start = c(2013,1), end = c(2013,nTrain)) # training series
valid.ts<-window(rgu.ts, start = c(2013, nTrain+1), end = c(2013, nTrain+nValid)) # validation series
# Setting up the linear model (to determine trend line) and forecast
rguservice.lm <- tslm(train.ts ~ trend +I(trend^2)) # creating a linear model
rguservice.lm.pred <-forecast(rguservice.lm, h = nValid, level = 0) # creating the forecast
# Plotting it all together
plot(rguservice.lm.pred,ylim = c(140000, 180000), ylab = "RGU Service Counts", xlab = "Month-Year", bty="l", xaxt = "n", main = "")
axis(1, at = seq(2013, 2016,1), labels=format(seq(2013, 2016, 1))) # setting x-axis values and labels
lines(ma.centered, lwd = 2, col="red")
lines(ma.trailing, lwd = 2, lty = 2, col = "blue")
legend(2013, 180000, c("RGU Service Counts", "Centered Moving Average (12 mo)", "Trailing Moving Average (12 mo)"), lty=c(1,1,2), lwd= c(1,1,2), bty = "n")

# Checking out residuals - forecast error - for linear model using a histogram
hist(rguservice.lm.pred$residuals, ylab = "Frequency", xlab = "Forecast Error", bty = "l", main = "")

# Checking out normality of residuals thorugh Normal Q-Q Plot
qqnorm(rguservice.lm.pred$residuals)

# rguservice.lm.pred$residuals %>% kable()

To work with a large data set, we’ll need to read in the data from an external file. We can use R’s command line functions to import these data sets, as a data frame object. We can also export data frames from R by writing a new file to the computer. In addition, we can save any plots that we create in R as an image file.

rgudata <- read.csv("/Users/normanreitter/RCoding/RGU Subscriber Counts - Monthly.csv", header=TRUE, sep=",", stringsAsFactors = default.stringsAsFactors())
      #reading in .csv formatted file
rgudata
##    Month Year RGU.Services.Count
## 1    Sep 2013             148431
## 2    Oct 2013             152357
## 3    Nov 2013             152237
## 4    Dec 2013             152698
## 5    Jan 2014             153186
## 6    Feb 2014             153528
## 7    Mar 2014             154968
## 8    Apr 2014             155759
## 9    May 2014             156365
## 10   Jun 2014             155745
## 11   Jul 2014             155700
## 12   Aug 2014             157507
## 13   Sep 2014             158450
## 14   Oct 2014             159293
## 15   Nov 2014             159001
## 16   Dec 2014             159595
## 17   Jan 2015             161556
## 18   Feb 2015             161988
## 19   Mar 2015             163402
## 20   Apr 2015             164015
## 21   May 2015             164031
## 22   Jun 2015             163709
## 23   Jul 2015             163034
## 24   Aug 2015             164560
## 25   Sep 2015             164562
## 26   Oct 2015             164875
## 27   Nov 2015             164212
## 28   Dec 2015             164285
## 29   Jan 2016             163857
## 30   Feb 2016             169188
## 31   Mar 2016             169859
## 32   Apr 2016             170101
## 33   May 2016             170545
## 34   Jun 2016             169884
## 35   Jul 2016             169401
## 36   Aug 2016             171208
## 37   Sep 2016             170533
## 38   Oct 2016             170063
## 39   Nov 2016             169633
## 40   Dec 2016             169497
## 41   Jan 2017             169467
## 42   Feb 2017             169217
## 43   Mar 2017             169969
## 44   Apr 2017             170049
## 45   May 2017             170755
## 46   Jun 2017             170201

3. Explore & Visualize Series

“The most basic and informative plot for visualizing a time series is the time plot.”

  library(zoo)
  library(forecast)
rgudata <- read.csv("/Users/normanreitter/RCoding/RGU Subscriber Counts - Monthly.csv", header=TRUE, sep=",", stringsAsFactors = default.stringsAsFactors())
      #reading in .csv formatted file
# Creating a time series object
rgu.ts <- ts(rgudata$RGU.Services.Count, start = c(rgudata$Year [1],1), freq=12) #creating a time series object
rgu.ts  #view ts object called "rgu"
##         Jan    Feb    Mar    Apr    May    Jun    Jul    Aug    Sep    Oct
## 2013 148431 152357 152237 152698 153186 153528 154968 155759 156365 155745
## 2014 158450 159293 159001 159595 161556 161988 163402 164015 164031 163709
## 2015 164562 164875 164212 164285 163857 169188 169859 170101 170545 169884
## 2016 170533 170063 169633 169497 169467 169217 169969 170049 170755 170201
##         Nov    Dec
## 2013 155700 157507
## 2014 163034 164560
## 2015 169401 171208
## 2016
# Creating rolling mean and moving average values based on data in time series object
ma.trailing<-rollmean(rgu.ts, k=12, align="right") #calculating trailing value estimates for each services.count with window 12 months
ma.centered<-ma(rgu.ts, order = 12)  # calculating centered value estimates with window of 12 months

#Plotting trailing and centered estimates with the original RGU Service Counts data
plot(rgu.ts, ylim = c(140000, 180000), ylab = "RGU Service Counts", xlab = "Month-Year", bty="l", xaxt = "n", main = "") #basic plot of ts object
axis(1, at = seq(2013, 2016,1), labels=format(seq(2013, 2016, 1))) # setting x-axis values and labels
lines(ma.centered, lwd = 2, col="red")
lines(ma.trailing, lwd = 2, lty = 2, col = "blue")
legend(2013, 180000, c("RGU Service Counts", "Centered Moving Average (12 mo)", "Trailing Moving Average (12 mo)"), lty=c(1,1,2), lwd= c(1,1,2), bty = "n")

#### 4. Pre-Process Data …
#### 5. Partition Series

Set up a Training period and a Validation period based on partitioning time series data. View data with ‘trailing’ and ‘centered’ estimates. Plot a linear model that predicts future outcomes.

  library(zoo)
  library(forecast)
rgudata <- read.csv("/Users/normanreitter/RCoding/RGU Subscriber Counts - Monthly.csv", header=TRUE, sep=",", stringsAsFactors = default.stringsAsFactors())
      #reading in .csv formatted file
# Creating a time series object
rgu.ts <- ts(rgudata$RGU.Services.Count, start = c(rgudata$Year [1],1), freq=12) #creating a time series object
rgu.ts  #view ts object called "rgu"
##         Jan    Feb    Mar    Apr    May    Jun    Jul    Aug    Sep    Oct
## 2013 148431 152357 152237 152698 153186 153528 154968 155759 156365 155745
## 2014 158450 159293 159001 159595 161556 161988 163402 164015 164031 163709
## 2015 164562 164875 164212 164285 163857 169188 169859 170101 170545 169884
## 2016 170533 170063 169633 169497 169467 169217 169969 170049 170755 170201
##         Nov    Dec
## 2013 155700 157507
## 2014 163034 164560
## 2015 169401 171208
## 2016
# Creating rolling mean and moving average values based on data in time series object
ma.trailing<-rollmean(rgu.ts, k=12, align="right") #calculating trailing value estimates for each services.count with window 12 months
ma.centered<-ma(rgu.ts, order = 12)  # calculating centered value estimates with window of 12 months

# Partitioning to have a Training Period and a Validation Period
nValid <- 12
nTrain <- length(rgu.ts) - nValid
train.ts<-window(rgu.ts, start = c(2013,1), end = c(2013,nTrain)) # training series
valid.ts<-window(rgu.ts, start = c(2013, nTrain+1), end = c(2013, nTrain+nValid)) # validation series

# Setting up the linear model (to determine trend line) and forecast
rguservice.lm <- tslm(train.ts ~ trend +I(trend^2)) # creating a linear model
rguservice.lm.pred <-forecast(rguservice.lm, h = nValid, level = 0) # creating the forecast
# Plotting it all together
plot(rguservice.lm.pred,ylim = c(140000, 180000), ylab = "RGU Service Counts", xlab = "Month-Year", bty="l", xaxt = "n", main = "")
axis(1, at = seq(2013, 2016,1), labels=format(seq(2013, 2016, 1))) # setting x-axis values and labels
lines(ma.centered, lwd = 2, col="red")
lines(ma.trailing, lwd = 2, lty = 2, col = "blue")
legend(2013, 180000, c("RGU Service Counts", "Centered Moving Average (12 mo)", "Trailing Moving Average (12 mo)"), lty=c(1,1,2), lwd= c(1,1,2), bty = "n")

#### 6. Apply Forecasting Algorithm(s) based on Method(s)

6.1 Trailing Average Forecasting Algorithm

The Trailing Average allows analysts to forecast future occurances by calculating past values withing a window of width w representing a vector of length w of the most recent available values in the series. The k-step-ahead forecast \(Ft+k=(yt+yt-1+yt-w+1)/w\).

Notes: 1. Moving averages are only useful for forecasting time series that lack seasonality and trend. 2. Choosing window width w is important as it is related to the level of “smoothing” that we want to apply to the model. Wider (longer) windows will provide a more global estimate while narrower (shorter) windows provide more emphasis on near-term observations.

6.x Exponential Smoothing

Exponential Smoothing Models where error (Z) is set to additive (A) or multiplicative (M).

No Seasonality
ZNN - no Trend
ZAN - Additive Trend
ZMN - multiplicative Trend
ZAdN - Additive damped
ZMdN - Multiplicative damped

Additive Seasonality
ZNA - No Trend
ZAA - Additive Trend
ZMA - multiplicative Trend
ZAdA - Additive damped
ZMdA - Multiplicative damped

Multiplicative Seasonality
ZNM - No Trend
ZAM - Additive Trend
ZMM - multiplicative Trend
ZAdM - Additive damped
ZMdM - Multiplicative damped

6.x Double Exponential Smoothing
      # plotting three exponential smoothing models, "ANN", "MMN", "MMdN", and prediction (likelihood) cones
rgu.ets.ANN <- ets(rgu.ts, model = "ANN")  # Fit Model 1 to the time series
rgu.ets.MMN <- ets(rgu.ts, model = "MMN", damped = FALSE)  #Fit 2nd model
rgu.ets.MMdN <- ets(rgu.ts, model = "MMN", damped = TRUE)  #Fit 3rd model

# creating forecasts
rgu.ets.ANN.pred<-forecast(rgu.ets.ANN, h = 12, level = c(0.2, 0.4, 0.6,0.8)) # forecast with 20%, 40%, 60%, 80% prediction intervals, h = 12 months
rgu.ets.MMN.pred<-forecast(rgu.ets.MMN, h = 12, level = c(0.2, 0.4, 0.6, 0.8))
rgu.ets.MMdN.pred<-forecast(rgu.ets.MMdN, h = 12, level = c(0.2, 0.4, 0.6,0.8))

# plotting
par(mfrow = c(1,3))  # command sets plot window to show 1 row of 3 plots
plot(rgu.ets.ANN.pred, xlab = "Month/Year", ylab = "Service Counts", ylim = c(min(rgu.ts - 1000), max(rgu.ts) + 20000))
plot(rgu.ets.MMN.pred, xlab = "Month/Year", ylab = "Service Counts", ylim = c(min(rgu.ts - 1000), max(rgu.ts) + 20000))
plot(rgu.ets.MMdN.pred, xlab = "Month/Year", ylab = "Service Counts", ylim = c(min(rgu.ts - 1000), max(rgu.ts) + 20000))

6.x Choosing the Best Algorithm and Method

Checking on how the residuals from the linear model look…

# Partitioning to have a Training Period and a Validation Period
nValid <- 12
nTrain <- length(rgu.ts) - nValid
train.ts<-window(rgu.ts, start = c(2013,1), end = c(2013,nTrain)) # training series
valid.ts<-window(rgu.ts, start = c(2013, nTrain+1), end = c(2013, nTrain+nValid)) # validation series

# Setting up the linear model (to determine trend line) and forecast
rguservice.lm <- tslm(train.ts ~ trend +I(trend^2)) # creating a linear model
rguservice.lm.pred <-forecast(rguservice.lm, h = nValid, level = 0) # creating the forecast

# Checking out residuals - forecast error - for linear model using a histogram
hist(rguservice.lm.pred$residuals, ylab = "Frequency", xlab = "Forecast Error", bty = "l", main = "")

# Checking out normality of residuals thorugh Normal Q-Q Plot
qqnorm(rguservice.lm.pred$residuals)

# creating a new column to set as a time index (predictor)
# Checking accuracy using residuals
accuracy(rguservice.lm.pred$mean, rgu.ts)
##                 ME     RMSE      MAE       MPE    MAPE      ACF1 Theil's U
## Test set -2691.851 3149.573 2778.963 -1.584668 1.63555 0.7633371  4.598548

Automated Model Selection

  • Use model = “ZZZ”* R uses the Akaike’s Information Criterion (AIC) which combines fit to the training data with a penalty for the number of smoothing paramerts and initial valudes included in a model

R allows us to include an additive or multiplicative error. We can specify an additive trend, a multiplicative trend, or no trend at all. This gives us 18 models to choose from: 2 error types x 3 trend types x 3 seasonality types

Exponential smoothing state space model

Usually a three-character string identifying method using the framework terminology of Hyndman et al. (2002) and Hyndman et al. (2008). The first letter denotes the error type (“A”, “M” or “Z”); the second letter denotes the trend type (“N”,“A”,“M” or “Z”); and the third letter denotes the season type (“N”,“A”,“M” or “Z”). In all cases, “N”=none, “A”=additive, “M”=multiplicative and “Z”=automatically selected. So, for example, “ANN” is simple exponential smoothing with additive errors, “MAM” is multiplicative Holt-Winters’ method with multiplicative errors, and so on.

rgu.best.forecast <- ets(rgu.ts, restrict = FALSE, allow.multiplicative.trend = TRUE) #allowing the ets function to pick the "best" model and display output
rgu.best.forecast
## ETS(A,A,N) 
## 
## Call:
##  ets(y = rgu.ts, restrict = FALSE, allow.multiplicative.trend = TRUE) 
## 
##   Smoothing parameters:
##     alpha = 0.8281 
##     beta  = 1e-04 
## 
##   Initial states:
##     l = 149602.5969 
##     b = 456.8733 
## 
##   sigma:  1149.622
## 
##      AIC     AICc      BIC 
## 834.4589 835.9589 843.6021
plot(forecast(rgu.best.forecast)) # using the forecast function to plot the 'best' exponential smoothing algorithm

### Use Cases #### User Interaction to Identify Trends and Seasonality (1) User selects data and views sytem generated line chart (2) User views line chart and selects from the following Is there visual evidence a trend? Is there visual evidence of seasonality? (3) System selects appropriate algorithm based on user observation and answers
(a) No Trend, No Seasonality -> Exponential Smoothing
(b) Trend, No Seasonality -> Difference to remove Seasonality, then Exponential Smoothing or Double Exponential Smoothing
(c) No Trend, Seasonality -> Difference to remove Trend, then Exponential Smoothing or Double Exponential Smoothing
(d) Trend, Seasonlity -> Twice Difference to remove Trend, then ** Exponential Smoothing** or Double Exponential Smoothing
(4) System produces best algorithm option (5) System renders plot to show different forecasting results (6) User chooses best result (7) System provides estimates for next k-periods and renders plot based on these estimates.