The Objective of this exercise is to determine if regression or any other time series models are suitable for forecasting the FP&A payroll data sets.
Gross Payroll is used as a test case and the following models (in Ascending Order) of complexity were utilized to show case its efficacy:
- Excel's Linear Regression Model
- Excel's Exponential Smoothing MOdel (ETS)
- R's Machine Learning Exponential Smoothing Models (ETS)
- R's Machine Learning ARIMA Model
require(ggthemes)
library(tidyverse)
library(magrittr)
library(TTR)
library(tidyr)
library(dplyr)
library(lubridate)
library(ggplot2)
library(plotly)
library(fpp2)
library(forecast)
library(caTools)
library(reshape2)
library(psych)
url1 <- "https://raw.githubusercontent.com/ssufian/Data_607/master/Payroll%20Gross%20to%20Net%20Report%20(A)%20NEE%20Fin%20Historical%2003%2031%2020-trim.csv"
#Reading CSV file from Github
df1 <- read.csv(file=url1, sep=",",na.strings = c("NA"," ",""),strip.white = T, stringsAsFactors = F, header=T)
# Make sure the 1st column is a factor
df1$Cost.Center<- factor(df1$Cost.Center)
#Selecting only the Result row (total row) for Gross Pay
df1 <- slice(df1,10)
#head(df1,1)
df2 <- df1 %>% select(Cost.Center,X1.1.2017:X2.1.2020) # removing othercolumns
#head(df2,1)
#Replacing Unwanted headers
df2 <- df2 %>% rename_at(vars(starts_with("X")),funs(str_replace(.,"X","")))
## Warning: funs() is soft deprecated as of dplyr 0.8.0
## Please use a list of either functions or lambdas:
##
## # Simple named list:
## list(mean = mean, median = median)
##
## # Auto named with `tibble::lst()`:
## tibble::lst(mean, median)
##
## # Using lambdas
## list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once per session.
# renaming columns
df2 <-df2 %>% rename(Cost_Center=Cost.Center)
#head(df2,1)
#make wide to long format
df3 <- melt(df2, id.vars = c("Cost_Center"),variable.name = "Date",value.name="Gross_Payroll")
#Removing "." from date labels using Regex
df3$Date <- gsub("[.]", "/", df3$Date)
head(df3,5)
## Cost_Center Date Gross_Payroll
## 1 FCOE-FP&A 1/1/2017 83697
## 2 FCOE-FP&A 2/1/2017 98139
## 3 FCOE-FP&A 3/1/2017 111554
## 4 FCOE-FP&A 4/1/2017 156649
## 5 FCOE-FP&A 5/1/2017 99361
#convert character to date.time
New_Date <- as.Date(df3$Date, "%m/%d/%Y")
df4 <- df3 %>% mutate(date=New_Date) %>% select(c("Cost_Center","date","Gross_Payroll"))
#head(df4,1)
summary(df4$Gross_Payroll)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 83697 226955 297088 289387 381000 462091
glimpse(describe(df4$Gross_Payroll))
## Observations: 1
## Variables: 13
## $ vars <dbl> 1
## $ n <dbl> 38
## $ mean <dbl> 289387.2
## $ sd <dbl> 103551.2
## $ median <dbl> 297087.5
## $ trimmed <dbl> 292590.1
## $ mad <dbl> 108449.2
## $ min <dbl> 83697
## $ max <dbl> 462091
## $ range <dbl> 378394
## $ skew <dbl> -0.2498309
## $ kurtosis <dbl> -0.8427995
## $ se <dbl> 16798.22
# Histogram
hist(df4$Gross_Payroll, xlab="Payroll",col="blue",,main="Fig1: Histogram of FP&A GrossPayroll",probability=TRUE)
s = sd(df4$Gross_Payroll)
m = mean(df4$Gross_Payroll)
curve(dnorm(x, mean=m, sd=s), add=TRUE,col = "red", lwd = 3)
Observation 1:
#Interactive timeseries line plot
fig1 <- plot_ly(df4, x = ~df4$date, y = ~df4$Gross_Payroll, name = 'Gross Payroll', type = 'scatter', mode = 'lines',
line = list(color = 'rgb(22, 96, 167)', width = 4))
fig1 <- fig1 %>% layout(title = "Fig 2: Interactive Historical GrossPayroll, FP&A",
xaxis = list(title = "Year"),
yaxis = list (title = "Payroll ($)")) %>% layout(plot_bgcolor='rgb(250, 257, 234)')
fig1
Observation 2:
Data exhibit positive upward trend with a slight uptick in variability starting Jul 2018
Observation 3:
- Excel Simple Linear Regression as shown in Fig 3a
- ANOVA analysis in Table 1
- Statistical Diagnostics in Fig 3b
#
#
#
The Simple Linear Regression has a Coeficient of Determination or \(R^{2}\) of 91%. This means that the model has 91% explanatory power. Also the slope of the model is highly significant due to a high T-stat (low p-level). Overall, the simple linear regression is a good viable model for this data set
All the statistical diagnostics showed the linear regression assumptions are met; homoegeinity of variances, normality of residuals and linear relationship
Observation 4:
- Excel Built-in Exponential Smoothing Functions as shown in Fig 4
#
The excel ETS module is showing very low SMAPE which means it is a well-specified model
Both excel-based models explains really well the data set and are suitable models for forecasting. However, as can seen in the forecasted values, future values will be coerced into linear modes. This means the model is incapable of capturing any seasonalities, if it existed.
Step 1:
#SplitRatio for 70%:30% splitting train and testing
train =head(df4,26)
test= tail(df4, 12)
nrow(train); nrow(test)
## [1] 26
## [1] 12
train
## Cost_Center date Gross_Payroll
## 1 FCOE-FP&A 2017-01-01 83697
## 2 FCOE-FP&A 2017-02-01 98139
## 3 FCOE-FP&A 2017-03-01 111554
## 4 FCOE-FP&A 2017-04-01 156649
## 5 FCOE-FP&A 2017-05-01 99361
## 6 FCOE-FP&A 2017-06-01 174622
## 7 FCOE-FP&A 2017-07-01 196364
## 8 FCOE-FP&A 2017-08-01 227229
## 9 FCOE-FP&A 2017-09-01 233604
## 10 FCOE-FP&A 2017-10-01 228581
## 11 FCOE-FP&A 2017-11-01 225442
## 12 FCOE-FP&A 2017-12-01 222437
## 13 FCOE-FP&A 2018-01-01 239901
## 14 FCOE-FP&A 2018-02-01 226863
## 15 FCOE-FP&A 2018-03-01 252863
## 16 FCOE-FP&A 2018-04-01 274896
## 17 FCOE-FP&A 2018-05-01 295541
## 18 FCOE-FP&A 2018-06-01 277493
## 19 FCOE-FP&A 2018-07-01 298634
## 20 FCOE-FP&A 2018-08-01 315266
## 21 FCOE-FP&A 2018-09-01 277503
## 22 FCOE-FP&A 2018-10-01 322791
## 23 FCOE-FP&A 2018-11-01 308144
## 24 FCOE-FP&A 2018-12-01 321804
## 25 FCOE-FP&A 2019-01-01 378308
## 26 FCOE-FP&A 2019-02-01 312909
dat_ts <- ts(train[, 3], start = c(2017, 1), end = c(2019, 2), frequency = 12)
Simply apply R’s built-in functions to Train Data using the various Exponential Smoothing Techniques
#Fit simple exponential smoothing model to data and show summary
fit_ses <- ses(dat_ts, h = 12)
summary(fit_ses)
##
## Forecast method: Simple exponential smoothing
##
## Model Information:
## Simple exponential smoothing
##
## Call:
## ses(y = dat_ts, h = 12)
##
## Smoothing parameters:
## alpha = 0.7135
##
## Initial states:
## l = 89757.1481
##
## sigma: 31891.98
##
## AIC AICc BIC
## 627.8751 628.9660 631.6494
##
## Error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 12772.61 30640.82 24929.68 5.177524 12.01798 0.2211096 -0.3524492
##
## Forecasts:
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## Mar 2019 326714.2 285843.0 367585.4 264207.0 389221.3
## Apr 2019 326714.2 276505.2 376923.2 249926.1 403502.2
## May 2019 326714.2 268650.1 384778.2 237912.9 415515.5
## Jun 2019 326714.2 261737.8 391690.5 227341.5 426086.9
## Jul 2019 326714.2 255493.3 397935.0 217791.3 435637.1
## Aug 2019 326714.2 249753.8 403674.5 209013.4 444414.9
## Sep 2019 326714.2 244413.6 409014.8 200846.3 452582.1
## Oct 2019 326714.2 239399.4 414029.0 193177.7 460250.7
## Nov 2019 326714.2 234657.9 418770.5 185926.2 467502.2
## Dec 2019 326714.2 230148.9 423279.5 179030.3 474398.0
## Jan 2020 326714.2 225841.3 427587.1 172442.4 480986.0
## Feb 2020 326714.2 221710.2 431718.1 166124.5 487303.9
#Plot the forecasted values
plot(fit_ses)
#Fit Holt exponential smoothing model to data and show summary
fit_holt <- holt(dat_ts, h = 12)
summary(fit_holt)
##
## Forecast method: Holt's method
##
## Model Information:
## Holt's method
##
## Call:
## holt(y = dat_ts, h = 12)
##
## Smoothing parameters:
## alpha = 0.2949
## beta = 0.0728
##
## Initial states:
## l = 66161.3251
## b = 18333.2628
##
## sigma: 28892.54
##
## AIC AICc BIC
## 624.4767 627.4767 630.7672
##
## Error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -6217.673 26577.27 21263.31 -4.13987 10.24616 0.1885914
## ACF1
## Training set -0.04963953
##
## Forecasts:
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## Mar 2019 347531.9 310504.6 384559.1 290903.5 404160.2
## Apr 2019 354090.3 314638.1 393542.4 293753.4 414427.1
## May 2019 360648.6 317956.1 403341.2 295356.0 425941.3
## Jun 2019 367207.0 320472.5 413941.6 295732.7 438681.4
## Jul 2019 373765.4 322234.8 425296.0 294956.2 452574.7
## Aug 2019 380323.8 323305.8 437341.9 293122.3 467525.4
## Sep 2019 386882.2 323750.4 450014.1 290330.4 483434.1
## Oct 2019 393440.6 323628.9 463252.3 286672.9 500208.4
## Nov 2019 399999.0 322994.3 477003.7 282230.5 517767.5
## Dec 2019 406557.4 321891.4 491223.4 277071.9 536042.9
## Jan 2020 413115.8 320357.7 505873.9 271254.6 554977.0
## Feb 2020 419674.2 318424.8 520923.6 264826.6 574521.8
#Plot the forecasted values
plot(fit_holt)
#Fit Holt-Winters exponential smoothing model to data and show summary
fit_hw <- hw(dat_ts, h = 12, seasonal = "multiplicative")
summary(fit_hw)
##
## Forecast method: Holt-Winters' multiplicative method
##
## Model Information:
## Holt-Winters' multiplicative method
##
## Call:
## hw(y = dat_ts, h = 12, seasonal = "multiplicative")
##
## Smoothing parameters:
## alpha = 0.0055
## beta = 1e-04
## gamma = 2e-04
##
## Initial states:
## l = 109308.052
## b = 9534.5763
## s = 0.9715 0.9599 1.0295 0.9831 1.0938 1.0664
## 1.0002 0.9653 1.0824 0.9391 0.8664 1.0423
##
## sigma: 0.1961
##
## AIC AICc BIC
## 649.9992 726.4992 671.3868
##
## Error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 118.9601 20843.74 16384.68 -2.763658 9.567314 0.1453211 0.3055741
##
## Forecasts:
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## Mar 2019 344389.8 257839.8 430939.9 212022.9 476756.7
## Apr 2019 407236.1 304890.3 509582.0 250711.7 563760.6
## May 2019 372382.4 278794.6 465970.3 229252.1 515512.8
## Jun 2019 395381.4 296011.8 494750.9 243408.8 547353.9
## Jul 2019 431743.1 323233.3 540252.8 265791.7 597694.5
## Aug 2019 453257.4 339338.7 567176.1 279033.8 627481.0
## Sep 2019 416789.7 312035.0 521544.3 256581.2 576998.1
## Oct 2019 446272.0 334105.7 558438.4 274728.4 617815.6
## Nov 2019 425252.1 318367.3 532136.9 261786.0 588718.2
## Dec 2019 439628.8 329128.9 550128.7 270633.8 608623.8
## Jan 2020 481586.9 360539.1 602634.7 296460.3 666713.5
## Feb 2020 408593.1 305890.9 511295.2 251523.7 565662.5
#Plot the forecasted values
plot(fit_hw)
#Fit automated exponential smoothing model to data and show summary
fit_auto <- forecast(dat_ts, h = 12)
summary(fit_auto)
##
## Forecast method: ETS(A,Ad,N)
##
## Model Information:
## ETS(A,Ad,N)
##
## Call:
## ets(y = object, lambda = lambda, biasadj = biasadj, allow.multiplicative.trend = allow.multiplicative.trend)
##
## Smoothing parameters:
## alpha = 1e-04
## beta = 1e-04
## phi = 0.9557
##
## Initial states:
## l = 66161.4197
## b = 18333.4909
##
## sigma: 24063.72
##
## AIC AICc BIC
## 615.7575 620.1786 623.3061
##
## Error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 2366.334 21626.47 16862.95 0.05546078 8.037052 0.1495631
## ACF1
## Training set -0.06244725
##
## Forecasts:
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## Mar 2019 345407.7 314568.8 376246.6 298243.7 392571.7
## Apr 2019 350565.2 319726.3 381404.1 303401.1 397729.2
## May 2019 355494.2 324655.3 386333.1 308330.2 402658.2
## Jun 2019 360204.8 329366.0 391043.7 313040.8 407368.9
## Jul 2019 364706.8 333867.9 395545.7 317542.8 411870.9
## Aug 2019 369009.4 338170.5 399848.3 321845.4 416173.4
## Sep 2019 373121.4 342282.5 403960.3 325957.3 420285.4
## Oct 2019 377051.2 346212.3 407890.1 329887.1 424215.3
## Nov 2019 380806.9 349968.0 411645.9 333642.9 427971.0
## Dec 2019 384396.3 353557.4 415235.2 337232.2 431560.4
## Jan 2020 387826.7 356987.7 418665.6 340662.6 434990.8
## Feb 2020 391105.1 360266.1 421944.0 343941.0 438269.2
#Plot the forecasted values
plot(forecast(fit_auto))
arima_model <- auto.arima(dat_ts)
summary(arima_model)
## Series: dat_ts
## ARIMA(0,1,1) with drift
##
## Coefficients:
## ma1 drift
## -0.6450 10021.604
## s.e. 0.1822 2012.817
##
## sigma^2 estimated as 745203831: log likelihood=-290.06
## AIC=586.13 AICc=587.27 BIC=589.79
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 846.4095 25675.26 20107.69 0.2596809 9.535563 0.1783418
## ACF1
## Training set -0.09149595
fcast <- forecast(arima_model,h=12)
plot(fcast)
# MAPE function to compute accuracy on Test Data
mape <- function(actual,pred){
mape <- mean(abs((actual - pred)/actual))*100
return (mape)
}
#Accuracy Measures on the various ETS Models on Test Data
# (1) SES Model
df_fc = as.data.frame(fit_ses)
test$simplexp = df_fc$`Point Forecast`
mape(test$Gross_Payroll, test$simplexp)
## [1] 18.21362
# (2) DES Model
df_holt = as.data.frame(fit_holt)
test$holt = df_holt$`Point Forecast`
mape(test$Gross_Payroll, test$holt)
## [1] 7.901171
# (3) TES Model
df_holt_winters = as.data.frame(fit_hw)
test$holt_winters = df_holt_winters$`Point Forecast`
mape(test$Gross_Payroll, test$holt_winters)
## [1] 7.314137
# (4) AETModel
df_aet = as.data.frame(fit_auto)
test$aet = df_aet$`Point Forecast`
mape(test$Gross_Payroll, test$aet)
## [1] 9.141867
# (5) Auto Arima Model
df_arima = as.data.frame(fcast)
test$arima = df_arima$`Point Forecast`
mape(test$Gross_Payroll, test$arima)
## [1] 8.595638
Observation 5:
Accuracy Measures (Descending Order) based on Calculated MAPE on test data:
(1) Holts-Winter or TES: 7.314
(2) Holts or DES: 7.901
(3) ARIMA: 8.596
(4) Automated Exponential Smoothing: 9.142
(5) Single Exponential Smoothing or SES: 18.213
Observation 6:
#Calcuate mean of residuals and plot histogram for residuals
fit_hw_res <- residuals(forecast(fit_hw))
mean(fit_hw_res)
## [1] -0.008910489
hist <- ggplot(fit_hw_res, aes(x=fit_hw_res)) +
geom_histogram(binwidth=.05, colour="blue", fill="yellow")+
ggtitle("Fig5a:Histogram of Holts-Winters Residuals")+theme_economist()
hist
## Don't know how to automatically pick scale for object of type ts. Defaulting to continuous.
# Q-Q Plot to check for normality of residuals
qq <-ggplot() + aes(sample = fit_hw_res) + geom_qq(distribution = qnorm) +
geom_qq_line(line.p = c(0.25, 0.75), col = "blue") + ylab("Gross Payroll")+ ggtitle("Fig5b: QQ Plot of Holts-Winters (TES) Residuals")+theme_economist()
qq
## Don't know how to automatically pick scale for object of type ts. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type ts. Defaulting to continuous.
Observation 7:
- The mean of the residuals is very close to zero. This suggests that TES model has utilized more of the “information” in the training data for its fitted model than any other method and qualifies it as the more suitable forecasting model.
- By plotting a histogram of the residuals, we understand the residual distribution and anomalies for the Holt-Winters method. Here, we witnessed that the ends of the data are "frayed" and exhibit some non-linearity which is quite common for time-series data
Although normality assumption in fitting exponential smoothing model is NOT required for the residuals of datasets, there is a normality requirement when producing prediction intervals.
NOTE: In extreme Non-Normal residuals, transformation of the original data must be performed
#adding columns to original dataframe of historical to merge with test
df5<-df4 %>% add_column(simplexp=df4$Gross_Payroll,holt=df4$Gross_Payroll,holt_winters=df4$Gross_Payroll,aet=df4$Gross_Payroll,arima=df4$Gross_Payroll)
#Concatenate two dataframes on top of each other
merge_vertical<- rbind(df5, test)
fig<- plot_ly(merge_vertical, x = ~merge_vertical$date, y = ~merge_vertical$simplexp, name = 'SES', type = 'scatter', mode = 'lines')
fig <- fig %>% add_trace(y = ~merge_vertical$holt, name = 'Holts (DES)', mode = 'lines+markers')
fig <- fig %>% add_trace(y = ~merge_vertical$holt_winters, name = 'Holts-Winters (TES)', mode = 'lines')
fig <- fig %>% add_trace(y = ~merge_vertical$aet, name = 'AET', mode = 'scatters')
fig <- fig %>% add_trace(y = ~merge_vertical$arima, name = 'ARIMA', mode = 'scatters')
fig <- fig %>% layout(title = "Fig 6: Interactive Forercasted GrossPayroll, FP&A",
xaxis = list(title = "Year"),
yaxis = list (title = "Payroll ($)")) %>% layout(plot_bgcolor='rgb(250, 257, 234)')
fig
Summary & Conclusion:
Gross Payroll lend itself very well to time series forecasting due to 3 main factors:
1) It does not exhibit big fluctuations
2) It is "heavily" populated
3) It is normally distributed
Advantages of Modeling in Excel:
1) For such a well-behaved data set, simple linear regression in Excel can do the job of forecasting. The same can be said of Excel's built-in Exponential Smoothing functions.
2) Quick and easy to replicate for one off analysis
3) Excel is a familiar platform within FP&A; short ramp-up time to learn and implment it's Data Analytical tool
Disadvantages of Modeling in Excel:
1) Forecasted results will be coerced into linear modes with no ability for the end user to observe any variablility into the future; no "wiggles"
2) For repeated processes, data sets have to be re-staged every time due to the built-in nature of the Excel's modules
3) Excel's ETS functions is a black box, end users cannot tweak its alpha, beta and gamma parameters
4) Cannot account for seasonality behavior
Advantages in Machine Learning:
1) Accuracy can be increased due to the ability of end user to tweak alpha, beta and gamma parameters. In addition, libraries are equiped with self-tuning hyperparameters in the models itself
2) For repeated processes, Data can be staged and automated for loading and wrangled into Tidy format via code; new data can be easily uploaded
3) Interactive visualizations can help end users understand the behavior of the data better
4) Several models can be created all at once for comparative analysis to select the best suited model to use
Disadvantages in Machine Learning:
1) Coding, implementation and understanding of libraries can be challenging for begineers; longer ramp-up time to learn and implement its tool kits
2) The end user may view these libraries as black boxes and not understand the math behind the outputs