FP&A Gross Payroll Forecasting Exercise

The Objective of this exercise is to determine if regression or any other time series models are suitable for forecasting the FP&A payroll data sets.

Gross Payroll is used as a test case and the following models (in Ascending Order) of complexity were utilized to show case its efficacy:

-  Excel's Linear Regression Model
-  Excel's Exponential Smoothing MOdel (ETS)
-  R's Machine Learning Exponential Smoothing Models (ETS)
-  R's Machine Learning ARIMA Model

The procedures were to first perform simple linear regression and ETS techniques in Excel and then in R as comparision
Observations were made along the way followed by a summary and conclusion section of pros and cons of each technique and processes

Loading the Libraries

require(ggthemes)
library(tidyverse)
library(magrittr)
library(TTR)
library(tidyr)
library(dplyr)
library(lubridate)
library(ggplot2)
library(plotly)
library(fpp2)   
library(forecast)
library(caTools)
library(reshape2)
library(psych)

Loading Data

url1 <- "https://raw.githubusercontent.com/ssufian/Data_607/master/Payroll%20Gross%20to%20Net%20Report%20(A)%20NEE%20Fin%20Historical%2003%2031%2020-trim.csv"
#Reading CSV file from Github
df1 <- read.csv(file=url1, sep=",",na.strings = c("NA"," ",""),strip.white = T, stringsAsFactors = F, header=T) 
# Make sure the 1st column is a factor
df1$Cost.Center<- factor(df1$Cost.Center)
#Selecting only the Result row (total row) for Gross Pay
df1 <- slice(df1,10)
#head(df1,1)

Data Wrangling

df2 <- df1 %>% select(Cost.Center,X1.1.2017:X2.1.2020) # removing othercolumns
#head(df2,1)

Renaming columns (Removng “X” in the column headers)

#Replacing Unwanted headers
df2 <- df2 %>% rename_at(vars(starts_with("X")),funs(str_replace(.,"X","")))

## Warning: funs() is soft deprecated as of dplyr 0.8.0
## Please use a list of either functions or lambdas: 
## 
##   # Simple named list: 
##   list(mean = mean, median = median)
## 
##   # Auto named with `tibble::lst()`: 
##   tibble::lst(mean, median)
## 
##   # Using lambdas
##   list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once per session.

# renaming columns
df2 <-df2 %>% rename(Cost_Center=Cost.Center)
#head(df2,1)

Wide to long format

Making Data Tidy

#make wide to long format
df3 <- melt(df2, id.vars = c("Cost_Center"),variable.name = "Date",value.name="Gross_Payroll")
#Removing "." from date labels using Regex
df3$Date <- gsub("[.]", "/", df3$Date)
head(df3,5)

##   Cost_Center     Date Gross_Payroll
## 1   FCOE-FP&A 1/1/2017         83697
## 2   FCOE-FP&A 2/1/2017         98139
## 3   FCOE-FP&A 3/1/2017        111554
## 4   FCOE-FP&A 4/1/2017        156649
## 5   FCOE-FP&A 5/1/2017         99361

Converting characters to Datetime

#convert character to date.time
New_Date <- as.Date(df3$Date, "%m/%d/%Y")
df4 <- df3 %>% mutate(date=New_Date) %>% select(c("Cost_Center","date","Gross_Payroll"))
#head(df4,1)

Distribution of Payroll: Summary statistics & Histogram

summary(df4$Gross_Payroll)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   83697  226955  297088  289387  381000  462091

glimpse(describe(df4$Gross_Payroll))

## Observations: 1
## Variables: 13
## $ vars     <dbl> 1
## $ n        <dbl> 38
## $ mean     <dbl> 289387.2
## $ sd       <dbl> 103551.2
## $ median   <dbl> 297087.5
## $ trimmed  <dbl> 292590.1
## $ mad      <dbl> 108449.2
## $ min      <dbl> 83697
## $ max      <dbl> 462091
## $ range    <dbl> 378394
## $ skew     <dbl> -0.2498309
## $ kurtosis <dbl> -0.8427995
## $ se       <dbl> 16798.22

# Histogram 
hist(df4$Gross_Payroll, xlab="Payroll",col="blue",,main="Fig1: Histogram of FP&A GrossPayroll",probability=TRUE)
s = sd(df4$Gross_Payroll)
m = mean(df4$Gross_Payroll)
curve(dnorm(x, mean=m, sd=s), add=TRUE,col = "red", lwd = 3)

Observation 1:

As shown, Historical Data appears to be quite normal; do not exhibit skewness or any extreme outliers

Interactive Historical Data

#Interactive timeseries line plot 
fig1 <- plot_ly(df4, x = ~df4$date, y = ~df4$Gross_Payroll, name = 'Gross Payroll', type = 'scatter', mode = 'lines',
        line = list(color = 'rgb(22, 96, 167)', width = 4)) 

fig1 <- fig1 %>% layout(title = "Fig 2: Interactive Historical GrossPayroll, FP&A",
         xaxis = list(title = "Year"),
         yaxis = list (title = "Payroll  ($)")) %>% layout(plot_bgcolor='rgb(250, 257, 234)')
fig1

Observation 2:

Data exhibit positive upward trend with a slight uptick in variability starting Jul 2018

Observation 3:

- Excel Simple Linear Regression as shown in Fig 3a 
- ANOVA analysis in Table 1 
- Statistical Diagnostics in Fig 3b

The Simple Linear Regression has a Coeficient of Determination or \(R^{2}\) of 91%. This means that the model has 91% explanatory power. Also the slope of the model is highly significant due to a high T-stat (low p-level). Overall, the simple linear regression is a good viable model for this data set
All the statistical diagnostics showed the linear regression assumptions are met; homoegeinity of variances, normality of residuals and linear relationship

Observation 4:

- Excel Built-in Exponential Smoothing Functions as shown in Fig 4

The excel ETS module is showing very low SMAPE which means it is a well-specified model
Both excel-based models explains really well the data set and are suitable models for forecasting. However, as can seen in the forecasted values, future values will be coerced into linear modes. This means the model is incapable of capturing any seasonalities, if it existed.

In the following section, Supervised Machine Learning using R’s libararies were utilized to show the ability of these models to capture all patterns:

(a) Randomness, (b) trends and (c) seasonalities

Step 1:

SplitData into Train & Test Sets

#SplitRatio for 70%:30% splitting train and testing

train =head(df4,26)
test= tail(df4, 12)

nrow(train); nrow(test)

## [1] 26

## [1] 12

Training set from Jan,2017 to Feb, 2019

Converting data into time series objects

train

##    Cost_Center       date Gross_Payroll
## 1    FCOE-FP&A 2017-01-01         83697
## 2    FCOE-FP&A 2017-02-01         98139
## 3    FCOE-FP&A 2017-03-01        111554
## 4    FCOE-FP&A 2017-04-01        156649
## 5    FCOE-FP&A 2017-05-01         99361
## 6    FCOE-FP&A 2017-06-01        174622
## 7    FCOE-FP&A 2017-07-01        196364
## 8    FCOE-FP&A 2017-08-01        227229
## 9    FCOE-FP&A 2017-09-01        233604
## 10   FCOE-FP&A 2017-10-01        228581
## 11   FCOE-FP&A 2017-11-01        225442
## 12   FCOE-FP&A 2017-12-01        222437
## 13   FCOE-FP&A 2018-01-01        239901
## 14   FCOE-FP&A 2018-02-01        226863
## 15   FCOE-FP&A 2018-03-01        252863
## 16   FCOE-FP&A 2018-04-01        274896
## 17   FCOE-FP&A 2018-05-01        295541
## 18   FCOE-FP&A 2018-06-01        277493
## 19   FCOE-FP&A 2018-07-01        298634
## 20   FCOE-FP&A 2018-08-01        315266
## 21   FCOE-FP&A 2018-09-01        277503
## 22   FCOE-FP&A 2018-10-01        322791
## 23   FCOE-FP&A 2018-11-01        308144
## 24   FCOE-FP&A 2018-12-01        321804
## 25   FCOE-FP&A 2019-01-01        378308
## 26   FCOE-FP&A 2019-02-01        312909

dat_ts <- ts(train[, 3], start = c(2017, 1), end = c(2019, 2), frequency = 12)

Type 1 - Simple Exponential Smoothing: SES Model

Step 2

Simply apply R’s built-in functions to Train Data using the various Exponential Smoothing Techniques

#Fit simple exponential smoothing model to data and show summary
fit_ses <- ses(dat_ts, h = 12)
summary(fit_ses)

## 
## Forecast method: Simple exponential smoothing
## 
## Model Information:
## Simple exponential smoothing 
## 
## Call:
##  ses(y = dat_ts, h = 12) 
## 
##   Smoothing parameters:
##     alpha = 0.7135 
## 
##   Initial states:
##     l = 89757.1481 
## 
##   sigma:  31891.98
## 
##      AIC     AICc      BIC 
## 627.8751 628.9660 631.6494 
## 
## Error measures:
##                    ME     RMSE      MAE      MPE     MAPE      MASE       ACF1
## Training set 12772.61 30640.82 24929.68 5.177524 12.01798 0.2211096 -0.3524492
## 
## Forecasts:
##          Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## Mar 2019       326714.2 285843.0 367585.4 264207.0 389221.3
## Apr 2019       326714.2 276505.2 376923.2 249926.1 403502.2
## May 2019       326714.2 268650.1 384778.2 237912.9 415515.5
## Jun 2019       326714.2 261737.8 391690.5 227341.5 426086.9
## Jul 2019       326714.2 255493.3 397935.0 217791.3 435637.1
## Aug 2019       326714.2 249753.8 403674.5 209013.4 444414.9
## Sep 2019       326714.2 244413.6 409014.8 200846.3 452582.1
## Oct 2019       326714.2 239399.4 414029.0 193177.7 460250.7
## Nov 2019       326714.2 234657.9 418770.5 185926.2 467502.2
## Dec 2019       326714.2 230148.9 423279.5 179030.3 474398.0
## Jan 2020       326714.2 225841.3 427587.1 172442.4 480986.0
## Feb 2020       326714.2 221710.2 431718.1 166124.5 487303.9

#Plot the forecasted values
plot(fit_ses)

Type 2 - Double Exponential Smoothing (Holts): DES Model

#Fit Holt exponential smoothing model to data and show summary
fit_holt <- holt(dat_ts, h = 12)
summary(fit_holt)

## 
## Forecast method: Holt's method
## 
## Model Information:
## Holt's method 
## 
## Call:
##  holt(y = dat_ts, h = 12) 
## 
##   Smoothing parameters:
##     alpha = 0.2949 
##     beta  = 0.0728 
## 
##   Initial states:
##     l = 66161.3251 
##     b = 18333.2628 
## 
##   sigma:  28892.54
## 
##      AIC     AICc      BIC 
## 624.4767 627.4767 630.7672 
## 
## Error measures:
##                     ME     RMSE      MAE      MPE     MAPE      MASE
## Training set -6217.673 26577.27 21263.31 -4.13987 10.24616 0.1885914
##                     ACF1
## Training set -0.04963953
## 
## Forecasts:
##          Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## Mar 2019       347531.9 310504.6 384559.1 290903.5 404160.2
## Apr 2019       354090.3 314638.1 393542.4 293753.4 414427.1
## May 2019       360648.6 317956.1 403341.2 295356.0 425941.3
## Jun 2019       367207.0 320472.5 413941.6 295732.7 438681.4
## Jul 2019       373765.4 322234.8 425296.0 294956.2 452574.7
## Aug 2019       380323.8 323305.8 437341.9 293122.3 467525.4
## Sep 2019       386882.2 323750.4 450014.1 290330.4 483434.1
## Oct 2019       393440.6 323628.9 463252.3 286672.9 500208.4
## Nov 2019       399999.0 322994.3 477003.7 282230.5 517767.5
## Dec 2019       406557.4 321891.4 491223.4 277071.9 536042.9
## Jan 2020       413115.8 320357.7 505873.9 271254.6 554977.0
## Feb 2020       419674.2 318424.8 520923.6 264826.6 574521.8

#Plot the forecasted values
plot(fit_holt)

Type 3 - Triple Exponential Smoothing: TES Model

#Fit Holt-Winters exponential smoothing model to data and show summary
fit_hw <- hw(dat_ts, h = 12, seasonal = "multiplicative")
summary(fit_hw)

## 
## Forecast method: Holt-Winters' multiplicative method
## 
## Model Information:
## Holt-Winters' multiplicative method 
## 
## Call:
##  hw(y = dat_ts, h = 12, seasonal = "multiplicative") 
## 
##   Smoothing parameters:
##     alpha = 0.0055 
##     beta  = 1e-04 
##     gamma = 2e-04 
## 
##   Initial states:
##     l = 109308.052 
##     b = 9534.5763 
##     s = 0.9715 0.9599 1.0295 0.9831 1.0938 1.0664
##            1.0002 0.9653 1.0824 0.9391 0.8664 1.0423
## 
##   sigma:  0.1961
## 
##      AIC     AICc      BIC 
## 649.9992 726.4992 671.3868 
## 
## Error measures:
##                    ME     RMSE      MAE       MPE     MAPE      MASE      ACF1
## Training set 118.9601 20843.74 16384.68 -2.763658 9.567314 0.1453211 0.3055741
## 
## Forecasts:
##          Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## Mar 2019       344389.8 257839.8 430939.9 212022.9 476756.7
## Apr 2019       407236.1 304890.3 509582.0 250711.7 563760.6
## May 2019       372382.4 278794.6 465970.3 229252.1 515512.8
## Jun 2019       395381.4 296011.8 494750.9 243408.8 547353.9
## Jul 2019       431743.1 323233.3 540252.8 265791.7 597694.5
## Aug 2019       453257.4 339338.7 567176.1 279033.8 627481.0
## Sep 2019       416789.7 312035.0 521544.3 256581.2 576998.1
## Oct 2019       446272.0 334105.7 558438.4 274728.4 617815.6
## Nov 2019       425252.1 318367.3 532136.9 261786.0 588718.2
## Dec 2019       439628.8 329128.9 550128.7 270633.8 608623.8
## Jan 2020       481586.9 360539.1 602634.7 296460.3 666713.5
## Feb 2020       408593.1 305890.9 511295.2 251523.7 565662.5

#Plot the forecasted values
plot(fit_hw)

Automated exponential smoothing forecasts

Using the forecast function to automatically pick “Best” model to fit data

#Fit automated exponential smoothing model to data and show summary
fit_auto <- forecast(dat_ts, h = 12)
summary(fit_auto)

## 
## Forecast method: ETS(A,Ad,N)
## 
## Model Information:
## ETS(A,Ad,N) 
## 
## Call:
##  ets(y = object, lambda = lambda, biasadj = biasadj, allow.multiplicative.trend = allow.multiplicative.trend) 
## 
##   Smoothing parameters:
##     alpha = 1e-04 
##     beta  = 1e-04 
##     phi   = 0.9557 
## 
##   Initial states:
##     l = 66161.4197 
##     b = 18333.4909 
## 
##   sigma:  24063.72
## 
##      AIC     AICc      BIC 
## 615.7575 620.1786 623.3061 
## 
## Error measures:
##                    ME     RMSE      MAE        MPE     MAPE      MASE
## Training set 2366.334 21626.47 16862.95 0.05546078 8.037052 0.1495631
##                     ACF1
## Training set -0.06244725
## 
## Forecasts:
##          Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## Mar 2019       345407.7 314568.8 376246.6 298243.7 392571.7
## Apr 2019       350565.2 319726.3 381404.1 303401.1 397729.2
## May 2019       355494.2 324655.3 386333.1 308330.2 402658.2
## Jun 2019       360204.8 329366.0 391043.7 313040.8 407368.9
## Jul 2019       364706.8 333867.9 395545.7 317542.8 411870.9
## Aug 2019       369009.4 338170.5 399848.3 321845.4 416173.4
## Sep 2019       373121.4 342282.5 403960.3 325957.3 420285.4
## Oct 2019       377051.2 346212.3 407890.1 329887.1 424215.3
## Nov 2019       380806.9 349968.0 411645.9 333642.9 427971.0
## Dec 2019       384396.3 353557.4 415235.2 337232.2 431560.4
## Jan 2020       387826.7 356987.7 418665.6 340662.6 434990.8
## Feb 2020       391105.1 360266.1 421944.0 343941.0 438269.2

#Plot the forecasted values
plot(forecast(fit_auto))

Automated ARIMA forecasts

arima_model <- auto.arima(dat_ts)
summary(arima_model)

## Series: dat_ts 
## ARIMA(0,1,1) with drift 
## 
## Coefficients:
##           ma1      drift
##       -0.6450  10021.604
## s.e.   0.1822   2012.817
## 
## sigma^2 estimated as 745203831:  log likelihood=-290.06
## AIC=586.13   AICc=587.27   BIC=589.79
## 
## Training set error measures:
##                    ME     RMSE      MAE       MPE     MAPE      MASE
## Training set 846.4095 25675.26 20107.69 0.2596809 9.535563 0.1783418
##                     ACF1
## Training set -0.09149595

fcast <- forecast(arima_model,h=12)
plot(fcast)

MAPE Function - testing accuracy of fitted data on Test sets

# MAPE function to compute accuracy on Test Data
mape <- function(actual,pred){
  mape <- mean(abs((actual - pred)/actual))*100
  return (mape)
}

Accuracy Test on Test Data

Adding all the forecasted (fitted data) to the test dataframe

#Accuracy Measures on the various ETS Models on Test Data
# (1) SES Model
df_fc = as.data.frame(fit_ses)
test$simplexp = df_fc$`Point Forecast`
mape(test$Gross_Payroll, test$simplexp)

## [1] 18.21362

# (2) DES Model
df_holt = as.data.frame(fit_holt)
test$holt = df_holt$`Point Forecast`
mape(test$Gross_Payroll, test$holt)

## [1] 7.901171

# (3) TES Model
df_holt_winters = as.data.frame(fit_hw)
test$holt_winters = df_holt_winters$`Point Forecast`
mape(test$Gross_Payroll, test$holt_winters)

## [1] 7.314137

# (4) AETModel
df_aet = as.data.frame(fit_auto)
test$aet = df_aet$`Point Forecast`
mape(test$Gross_Payroll, test$aet)

## [1] 9.141867

# (5) Auto Arima Model
df_arima = as.data.frame(fcast)
test$arima = df_arima$`Point Forecast`
mape(test$Gross_Payroll, test$arima)

## [1] 8.595638

Observation 5:

Accuracy Measures (Descending Order) based on Calculated MAPE on test data:
(1)  Holts-Winter or TES: 7.314
(2)  Holts or DES: 7.901
(3)  ARIMA: 8.596
(4)  Automated Exponential Smoothing: 9.142
(5)  Single Exponential Smoothing or SES: 18.213

Observation 6:

Holts-Winters (TES) was the superior model to use for this Data set

All Statistical Diagnostic will be based on TES fitted data

#Calcuate mean of residuals and plot histogram for residuals
fit_hw_res <- residuals(forecast(fit_hw))
mean(fit_hw_res)

## [1] -0.008910489

hist <- ggplot(fit_hw_res, aes(x=fit_hw_res)) +
  geom_histogram(binwidth=.05, colour="blue", fill="yellow")+
  ggtitle("Fig5a:Histogram of Holts-Winters Residuals")+theme_economist()

hist

## Don't know how to automatically pick scale for object of type ts. Defaulting to continuous.

# Q-Q Plot to check for normality of residuals
qq <-ggplot() + aes(sample = fit_hw_res) + geom_qq(distribution = qnorm) + 
           geom_qq_line(line.p = c(0.25, 0.75), col = "blue") + ylab("Gross Payroll")+ ggtitle("Fig5b: QQ Plot of Holts-Winters (TES) Residuals")+theme_economist()
qq

## Don't know how to automatically pick scale for object of type ts. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type ts. Defaulting to continuous.

Observation 7:

-  The mean of the residuals is very close to zero. This suggests that TES model has utilized more of the “information” in the training data for its fitted model than any other method and qualifies it as the more suitable forecasting model. 

-  By plotting a histogram of the residuals, we understand the residual distribution and anomalies for the Holt-Winters method.  Here, we witnessed that the ends of the data are "frayed" and exhibit some non-linearity which is quite common for time-series data

Check for normality of Residuals of The TES (Holts-Winters) Model

Although normality assumption in fitting exponential smoothing model is NOT required for the residuals of datasets, there is a normality requirement when producing prediction intervals.
NOTE: In extreme Non-Normal residuals, transformation of the original data must be performed

Merge all the Forecasted Models together into one dataframe for final charting

#adding columns to original dataframe of historical to merge with test 
df5<-df4 %>% add_column(simplexp=df4$Gross_Payroll,holt=df4$Gross_Payroll,holt_winters=df4$Gross_Payroll,aet=df4$Gross_Payroll,arima=df4$Gross_Payroll)

#Concatenate two dataframes on top of each other
merge_vertical<- rbind(df5, test)

Final chart to show all models

fig<- plot_ly(merge_vertical, x = ~merge_vertical$date, y = ~merge_vertical$simplexp, name = 'SES', type = 'scatter', mode = 'lines') 
fig <- fig %>% add_trace(y = ~merge_vertical$holt, name = 'Holts (DES)', mode = 'lines+markers') 
fig <- fig %>% add_trace(y = ~merge_vertical$holt_winters, name = 'Holts-Winters (TES)', mode = 'lines')
fig <- fig %>% add_trace(y = ~merge_vertical$aet, name = 'AET', mode = 'scatters')
fig <- fig %>% add_trace(y = ~merge_vertical$arima, name = 'ARIMA', mode = 'scatters')
fig <- fig %>% layout(title = "Fig 6: Interactive Forercasted GrossPayroll, FP&A",
         xaxis = list(title = "Year"),
         yaxis = list (title = "Payroll ($)")) %>% layout(plot_bgcolor='rgb(250, 257, 234)')
fig

Summary & Conclusion:

Gross Payroll lend itself very well to time series forecasting due to 3 main factors:
1) It does not exhibit big fluctuations
2) It is "heavily" populated
3) It is normally distributed

Advantages of Modeling in Excel:
1) For such a well-behaved data set, simple linear regression in Excel can do the job of forecasting.  The same can be said of Excel's built-in Exponential Smoothing functions.  
2) Quick and easy to replicate for one off analysis
3) Excel is a familiar platform within FP&A; short ramp-up time to learn and implment it's Data Analytical tool

Disadvantages of Modeling in Excel:
1) Forecasted results will be coerced into linear modes with no ability for the end user to observe any variablility into the future; no "wiggles"
2) For repeated processes, data sets have to be re-staged every time due to the built-in nature of the Excel's modules
3) Excel's ETS functions is a black box, end users cannot tweak its alpha, beta and gamma parameters
4) Cannot account for seasonality behavior

Advantages in Machine Learning:
1) Accuracy can be increased due to the ability of end user to tweak alpha, beta and gamma parameters.  In addition, libraries are equiped with self-tuning hyperparameters in the models itself
2) For repeated processes, Data can be staged and automated for loading and wrangled into Tidy format via code; new data can be easily uploaded
3) Interactive visualizations can help end users understand the behavior of the data better
4) Several models can be created all at once for comparative analysis to select the best suited model to use

Disadvantages in Machine Learning:
1) Coding, implementation and understanding of libraries can be challenging for begineers; longer ramp-up time to learn and implement its tool kits
2) The end user may view these libraries as black boxes and not understand the math behind the outputs