The forecasting process includes the following steps:
(1) Define Goal
(2) Get Data
(3) Explore & Visualize Series (iterate between 2 & 3)
(4) Pre-Process Data
(5) Partition Series
(6) Apply Forecasting Algorithm(s) based on Method(s)
(7) Evaluate & Compare Performance (iterate between 6 & 7)
(8) Implement Forecasts/System
Determing and clearly defining the forecasting goal based on intended use is key to ensuring useful results. There are essentially two types of goals:
(1) Descriptive modeling - time series analysis - determining seasonal patterns, trends, and relationships to external factors. This is done retrospectively and allows the method to use all of the known, historical data to “predict” what will happen later within history (i.e. within the existing time series data). (2) Predictive modeling - time series forecasting - uses information in series to forecast future values of that series. Forecasting models cannot use future information and is judged by its predictive accuracy rather than by its ability to provde correct casual explanations.
Our goal is to provide the most appropriate Predictive Model that most accurately will estimate a future value using time series data input. As part of this goal, we desire to provide the user an interactive dashboard approach to improve and “buy into” the provided estimate.
Time series data represents evidence of systematic and non-systematic behaviors. Systematic evidence includes: Trend, Level, and Seasonality. Non-systematic evidence is called noise. We identify Level, Trend, and Seasonality characteristics in the data to produce accurate future estimates (forecasts). Noise represents error that we seek to also estimate to understand the potential accuracy of the forecast. “Forecasting methods attempt to isolate the systematic part and quantify the noise level.”
Libaries:
Useful Commands and Functions:
library(zoo)
library(forecast)
library(knitr)
rgudata <- read.csv("/Users/normanreitter/RCoding/RGU Subscriber Counts - Monthly.csv", header=TRUE, sep=",", stringsAsFactors = default.stringsAsFactors())
#reading in .csv formatted file
rgudata
## Month Year RGU.Services.Count
## 1 Sep 2013 148431
## 2 Oct 2013 152357
## 3 Nov 2013 152237
## 4 Dec 2013 152698
## 5 Jan 2014 153186
## 6 Feb 2014 153528
## 7 Mar 2014 154968
## 8 Apr 2014 155759
## 9 May 2014 156365
## 10 Jun 2014 155745
## 11 Jul 2014 155700
## 12 Aug 2014 157507
## 13 Sep 2014 158450
## 14 Oct 2014 159293
## 15 Nov 2014 159001
## 16 Dec 2014 159595
## 17 Jan 2015 161556
## 18 Feb 2015 161988
## 19 Mar 2015 163402
## 20 Apr 2015 164015
## 21 May 2015 164031
## 22 Jun 2015 163709
## 23 Jul 2015 163034
## 24 Aug 2015 164560
## 25 Sep 2015 164562
## 26 Oct 2015 164875
## 27 Nov 2015 164212
## 28 Dec 2015 164285
## 29 Jan 2016 163857
## 30 Feb 2016 169188
## 31 Mar 2016 169859
## 32 Apr 2016 170101
## 33 May 2016 170545
## 34 Jun 2016 169884
## 35 Jul 2016 169401
## 36 Aug 2016 171208
## 37 Sep 2016 170533
## 38 Oct 2016 170063
## 39 Nov 2016 169633
## 40 Dec 2016 169497
## 41 Jan 2017 169467
## 42 Feb 2017 169217
## 43 Mar 2017 169969
## 44 Apr 2017 170049
## 45 May 2017 170755
## 46 Jun 2017 170201
rgu.ts <- ts(rgudata$RGU.Services.Count, start = c(rgudata$Year [1],1), freq=12) #creating a time series object
rgu.ts #view ts object called "rgu"
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct
## 2013 148431 152357 152237 152698 153186 153528 154968 155759 156365 155745
## 2014 158450 159293 159001 159595 161556 161988 163402 164015 164031 163709
## 2015 164562 164875 164212 164285 163857 169188 169859 170101 170545 169884
## 2016 170533 170063 169633 169497 169467 169217 169969 170049 170755 170201
## Nov Dec
## 2013 155700 157507
## 2014 163034 164560
## 2015 169401 171208
## 2016
ma.trailing<-rollmean(rgu.ts, k=12, align="right") #calculating trailing value estimates for each services.count with window 12 months
ma.centered<-ma(rgu.ts, order = 12) # calculating centered value estimates with window of 12 months
plot(rgu.ts, ylim = c(140000, 180000), ylab = "RGU Service Counts", xlab = "Month-Year", bty="l", xaxt = "n", main = "") #basic plot of ts object
axis(1, at = seq(2013, 2016,1), labels=format(seq(2013, 2016, 1))) # setting x-axis values and labels
lines(ma.centered, lwd = 2, col="red")
lines(ma.trailing, lwd = 2, lty = 2, col = "blue")
legend(2013, 180000, c("RGU Service Counts", "Centered Moving Average (12 mo)", "Trailing Moving Average (12 mo)"), lty=c(1,1,2), lwd= c(1,1,2), bty = "n")
# Partitioning to have a Training Period and a Validation Period
nValid <- 12
nTrain <- length(rgu.ts) - nValid
train.ts<-window(rgu.ts, start = c(2013,1), end = c(2013,nTrain)) # training series
valid.ts<-window(rgu.ts, start = c(2013, nTrain+1), end = c(2013, nTrain+nValid)) # validation series
# Setting up the linear model (to determine trend line) and forecast
rguservice.lm <- tslm(train.ts ~ trend +I(trend^2)) # creating a linear model
rguservice.lm.pred <-forecast(rguservice.lm, h = nValid, level = 0) # creating the forecast
# Plotting it all together
plot(rguservice.lm.pred,ylim = c(140000, 180000), ylab = "RGU Service Counts", xlab = "Month-Year", bty="l", xaxt = "n", main = "")
axis(1, at = seq(2013, 2016,1), labels=format(seq(2013, 2016, 1))) # setting x-axis values and labels
lines(ma.centered, lwd = 2, col="red")
lines(ma.trailing, lwd = 2, lty = 2, col = "blue")
legend(2013, 180000, c("RGU Service Counts", "Centered Moving Average (12 mo)", "Trailing Moving Average (12 mo)"), lty=c(1,1,2), lwd= c(1,1,2), bty = "n")
# Checking out residuals - forecast error - for linear model using a histogram
hist(rguservice.lm.pred$residuals, ylab = "Frequency", xlab = "Forecast Error", bty = "l", main = "")
# Checking out normality of residuals thorugh Normal Q-Q Plot
qqnorm(rguservice.lm.pred$residuals)
# rguservice.lm.pred$residuals %>% kable()
To work with a large data set, we’ll need to read in the data from an external file. We can use R’s command line functions to import these data sets, as a data frame object. We can also export data frames from R by writing a new file to the computer. In addition, we can save any plots that we create in R as an image file.
rgudata <- read.csv("/Users/normanreitter/RCoding/RGU Subscriber Counts - Monthly.csv", header=TRUE, sep=",", stringsAsFactors = default.stringsAsFactors())
#reading in .csv formatted file
rgudata
## Month Year RGU.Services.Count
## 1 Sep 2013 148431
## 2 Oct 2013 152357
## 3 Nov 2013 152237
## 4 Dec 2013 152698
## 5 Jan 2014 153186
## 6 Feb 2014 153528
## 7 Mar 2014 154968
## 8 Apr 2014 155759
## 9 May 2014 156365
## 10 Jun 2014 155745
## 11 Jul 2014 155700
## 12 Aug 2014 157507
## 13 Sep 2014 158450
## 14 Oct 2014 159293
## 15 Nov 2014 159001
## 16 Dec 2014 159595
## 17 Jan 2015 161556
## 18 Feb 2015 161988
## 19 Mar 2015 163402
## 20 Apr 2015 164015
## 21 May 2015 164031
## 22 Jun 2015 163709
## 23 Jul 2015 163034
## 24 Aug 2015 164560
## 25 Sep 2015 164562
## 26 Oct 2015 164875
## 27 Nov 2015 164212
## 28 Dec 2015 164285
## 29 Jan 2016 163857
## 30 Feb 2016 169188
## 31 Mar 2016 169859
## 32 Apr 2016 170101
## 33 May 2016 170545
## 34 Jun 2016 169884
## 35 Jul 2016 169401
## 36 Aug 2016 171208
## 37 Sep 2016 170533
## 38 Oct 2016 170063
## 39 Nov 2016 169633
## 40 Dec 2016 169497
## 41 Jan 2017 169467
## 42 Feb 2017 169217
## 43 Mar 2017 169969
## 44 Apr 2017 170049
## 45 May 2017 170755
## 46 Jun 2017 170201
“The most basic and informative plot for visualizing a time series is the time plot.”
library(zoo)
library(forecast)
rgudata <- read.csv("/Users/normanreitter/RCoding/RGU Subscriber Counts - Monthly.csv", header=TRUE, sep=",", stringsAsFactors = default.stringsAsFactors())
#reading in .csv formatted file
# Creating a time series object
rgu.ts <- ts(rgudata$RGU.Services.Count, start = c(rgudata$Year [1],1), freq=12) #creating a time series object
rgu.ts #view ts object called "rgu"
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct
## 2013 148431 152357 152237 152698 153186 153528 154968 155759 156365 155745
## 2014 158450 159293 159001 159595 161556 161988 163402 164015 164031 163709
## 2015 164562 164875 164212 164285 163857 169188 169859 170101 170545 169884
## 2016 170533 170063 169633 169497 169467 169217 169969 170049 170755 170201
## Nov Dec
## 2013 155700 157507
## 2014 163034 164560
## 2015 169401 171208
## 2016
# Creating rolling mean and moving average values based on data in time series object
ma.trailing<-rollmean(rgu.ts, k=12, align="right") #calculating trailing value estimates for each services.count with window 12 months
ma.centered<-ma(rgu.ts, order = 12) # calculating centered value estimates with window of 12 months
#Plotting trailing and centered estimates with the original RGU Service Counts data
plot(rgu.ts, ylim = c(140000, 180000), ylab = "RGU Service Counts", xlab = "Month-Year", bty="l", xaxt = "n", main = "") #basic plot of ts object
axis(1, at = seq(2013, 2016,1), labels=format(seq(2013, 2016, 1))) # setting x-axis values and labels
lines(ma.centered, lwd = 2, col="red")
lines(ma.trailing, lwd = 2, lty = 2, col = "blue")
legend(2013, 180000, c("RGU Service Counts", "Centered Moving Average (12 mo)", "Trailing Moving Average (12 mo)"), lty=c(1,1,2), lwd= c(1,1,2), bty = "n")
#### 4. Pre-Process Data …
#### 5. Partition Series
Set up a Training period and a Validation period based on partitioning time series data. View data with ‘trailing’ and ‘centered’ estimates. Plot a linear model that predicts future outcomes.
library(zoo)
library(forecast)
rgudata <- read.csv("/Users/normanreitter/RCoding/RGU Subscriber Counts - Monthly.csv", header=TRUE, sep=",", stringsAsFactors = default.stringsAsFactors())
#reading in .csv formatted file
# Creating a time series object
rgu.ts <- ts(rgudata$RGU.Services.Count, start = c(rgudata$Year [1],1), freq=12) #creating a time series object
rgu.ts #view ts object called "rgu"
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct
## 2013 148431 152357 152237 152698 153186 153528 154968 155759 156365 155745
## 2014 158450 159293 159001 159595 161556 161988 163402 164015 164031 163709
## 2015 164562 164875 164212 164285 163857 169188 169859 170101 170545 169884
## 2016 170533 170063 169633 169497 169467 169217 169969 170049 170755 170201
## Nov Dec
## 2013 155700 157507
## 2014 163034 164560
## 2015 169401 171208
## 2016
# Creating rolling mean and moving average values based on data in time series object
ma.trailing<-rollmean(rgu.ts, k=12, align="right") #calculating trailing value estimates for each services.count with window 12 months
ma.centered<-ma(rgu.ts, order = 12) # calculating centered value estimates with window of 12 months
# Partitioning to have a Training Period and a Validation Period
nValid <- 12
nTrain <- length(rgu.ts) - nValid
train.ts<-window(rgu.ts, start = c(2013,1), end = c(2013,nTrain)) # training series
valid.ts<-window(rgu.ts, start = c(2013, nTrain+1), end = c(2013, nTrain+nValid)) # validation series
# Setting up the linear model (to determine trend line) and forecast
rguservice.lm <- tslm(train.ts ~ trend +I(trend^2)) # creating a linear model
rguservice.lm.pred <-forecast(rguservice.lm, h = nValid, level = 0) # creating the forecast
# Plotting it all together
plot(rguservice.lm.pred,ylim = c(140000, 180000), ylab = "RGU Service Counts", xlab = "Month-Year", bty="l", xaxt = "n", main = "")
axis(1, at = seq(2013, 2016,1), labels=format(seq(2013, 2016, 1))) # setting x-axis values and labels
lines(ma.centered, lwd = 2, col="red")
lines(ma.trailing, lwd = 2, lty = 2, col = "blue")
legend(2013, 180000, c("RGU Service Counts", "Centered Moving Average (12 mo)", "Trailing Moving Average (12 mo)"), lty=c(1,1,2), lwd= c(1,1,2), bty = "n")
#### 6. Apply Forecasting Algorithm(s) based on Method(s)
The Trailing Average allows analysts to forecast future occurances by calculating past values withing a window of width w representing a vector of length w of the most recent available values in the series. The k-step-ahead forecast \(Ft+k=(yt+yt-1+yt-w+1)/w\).
Notes: 1. Moving averages are only useful for forecasting time series that lack seasonality and trend. 2. Choosing window width w is important as it is related to the level of “smoothing” that we want to apply to the model. Wider (longer) windows will provide a more global estimate while narrower (shorter) windows provide more emphasis on near-term observations.
Exponential Smoothing Models where error (Z) is set to additive (A) or multiplicative (M).
No Seasonality
ZNN - no Trend
ZAN - Additive Trend
ZMN - multiplicative Trend
ZAdN - Additive damped
ZMdN - Multiplicative damped
Additive Seasonality
ZNA - No Trend
ZAA - Additive Trend
ZMA - multiplicative Trend
ZAdA - Additive damped
ZMdA - Multiplicative damped
Multiplicative Seasonality
ZNM - No Trend
ZAM - Additive Trend
ZMM - multiplicative Trend
ZAdM - Additive damped
ZMdM - Multiplicative damped
# plotting three exponential smoothing models, "ANN", "MMN", "MMdN", and prediction (likelihood) cones
rgu.ets.ANN <- ets(rgu.ts, model = "ANN") # Fit Model 1 to the time series
rgu.ets.MMN <- ets(rgu.ts, model = "MMN", damped = FALSE) #Fit 2nd model
rgu.ets.MMdN <- ets(rgu.ts, model = "MMN", damped = TRUE) #Fit 3rd model
# creating forecasts
rgu.ets.ANN.pred<-forecast(rgu.ets.ANN, h = 12, level = c(0.2, 0.4, 0.6,0.8)) # forecast with 20%, 40%, 60%, 80% prediction intervals, h = 12 months
rgu.ets.MMN.pred<-forecast(rgu.ets.MMN, h = 12, level = c(0.2, 0.4, 0.6, 0.8))
rgu.ets.MMdN.pred<-forecast(rgu.ets.MMdN, h = 12, level = c(0.2, 0.4, 0.6,0.8))
# plotting
par(mfrow = c(1,3)) # command sets plot window to show 1 row of 3 plots
plot(rgu.ets.ANN.pred, xlab = "Month/Year", ylab = "Service Counts", ylim = c(min(rgu.ts - 1000), max(rgu.ts) + 20000))
plot(rgu.ets.MMN.pred, xlab = "Month/Year", ylab = "Service Counts", ylim = c(min(rgu.ts - 1000), max(rgu.ts) + 20000))
plot(rgu.ets.MMdN.pred, xlab = "Month/Year", ylab = "Service Counts", ylim = c(min(rgu.ts - 1000), max(rgu.ts) + 20000))
Checking on how the residuals from the linear model look…
# Partitioning to have a Training Period and a Validation Period
nValid <- 12
nTrain <- length(rgu.ts) - nValid
train.ts<-window(rgu.ts, start = c(2013,1), end = c(2013,nTrain)) # training series
valid.ts<-window(rgu.ts, start = c(2013, nTrain+1), end = c(2013, nTrain+nValid)) # validation series
# Setting up the linear model (to determine trend line) and forecast
rguservice.lm <- tslm(train.ts ~ trend +I(trend^2)) # creating a linear model
rguservice.lm.pred <-forecast(rguservice.lm, h = nValid, level = 0) # creating the forecast
# Checking out residuals - forecast error - for linear model using a histogram
hist(rguservice.lm.pred$residuals, ylab = "Frequency", xlab = "Forecast Error", bty = "l", main = "")
# Checking out normality of residuals thorugh Normal Q-Q Plot
qqnorm(rguservice.lm.pred$residuals)
# creating a new column to set as a time index (predictor)
# Checking accuracy using residuals
accuracy(rguservice.lm.pred$mean, rgu.ts)
## ME RMSE MAE MPE MAPE ACF1 Theil's U
## Test set -2691.851 3149.573 2778.963 -1.584668 1.63555 0.7633371 4.598548
R allows us to include an additive or multiplicative error. We can specify an additive trend, a multiplicative trend, or no trend at all. This gives us 18 models to choose from: 2 error types x 3 trend types x 3 seasonality types
Usually a three-character string identifying method using the framework terminology of Hyndman et al. (2002) and Hyndman et al. (2008). The first letter denotes the error type (“A”, “M” or “Z”); the second letter denotes the trend type (“N”,“A”,“M” or “Z”); and the third letter denotes the season type (“N”,“A”,“M” or “Z”). In all cases, “N”=none, “A”=additive, “M”=multiplicative and “Z”=automatically selected. So, for example, “ANN” is simple exponential smoothing with additive errors, “MAM” is multiplicative Holt-Winters’ method with multiplicative errors, and so on.
rgu.best.forecast <- ets(rgu.ts, restrict = FALSE, allow.multiplicative.trend = TRUE) #allowing the ets function to pick the "best" model and display output
rgu.best.forecast
## ETS(A,A,N)
##
## Call:
## ets(y = rgu.ts, restrict = FALSE, allow.multiplicative.trend = TRUE)
##
## Smoothing parameters:
## alpha = 0.8281
## beta = 1e-04
##
## Initial states:
## l = 149602.5969
## b = 456.8733
##
## sigma: 1149.622
##
## AIC AICc BIC
## 834.4589 835.9589 843.6021
plot(forecast(rgu.best.forecast)) # using the forecast function to plot the 'best' exponential smoothing algorithm
### Use Cases #### User Interaction to Identify Trends and Seasonality (1) User selects data and views sytem generated line chart (2) User views line chart and selects from the following Is there visual evidence a trend? Is there visual evidence of seasonality? (3) System selects appropriate algorithm based on user observation and answers
(a) No Trend, No Seasonality -> Exponential Smoothing
(b) Trend, No Seasonality -> Difference to remove Seasonality, then Exponential Smoothing or Double Exponential Smoothing
(c) No Trend, Seasonality -> Difference to remove Trend, then Exponential Smoothing or Double Exponential Smoothing
(d) Trend, Seasonlity -> Twice Difference to remove Trend, then ** Exponential Smoothing** or Double Exponential Smoothing
(4) System produces best algorithm option (5) System renders plot to show different forecasting results (6) User chooses best result (7) System provides estimates for next k-periods and renders plot based on these estimates.