Organizational Payroll Data - Time Series Modeling

The time-series data represents the monthly payroll actuals for a national nonprofit organization from the start of 2018 to present. The nonprofit organization is experiencing a period of rapid growth in terms of impact and staffing. Afterschool and Summer programs are the main work of the organization with seasonal contract labor supplementing Full-time staff during key periods of the Summer and to a lesser extend in the Spring and Fall. The data set also includes FT staffing levels, which can be leveraged for more dynamic analysis. Forecasting future payroll obligations is significant for budgeting purposes and maintenance of cashflow for the organization.

Reading in and observing Payroll Data

org_payroll <- read.csv('payroll.csv')
str(org_payroll)

## 'data.frame':    47 obs. of  3 variables:
##  $ ï..Month: chr  "2018-1-1" "2018-2-1" "2018-3-1" "2018-4-1" ...
##  $ Payroll : num  72022 94061 136453 94494 96118 ...
##  $ FT_Staff: int  12 15 15 15 15 15 15 15 15 16 ...

org_payroll$ï..Month <- as.Date.character(org_payroll$ï..Month)
str(org_payroll)

## 'data.frame':    47 obs. of  3 variables:
##  $ ï..Month: Date, format: "2018-01-01" "2018-02-01" ...
##  $ Payroll : num  72022 94061 136453 94494 96118 ...
##  $ FT_Staff: int  12 15 15 15 15 15 15 15 15 16 ...

dim(org_payroll)

## [1] 47  3

##Analysis - Data includes 3 variables - Date by Month, Payroll for Month, Number of FT Staff
##Probably too few months
summary(org_payroll)

##     ï..Month             Payroll           FT_Staff     
##  Min.   :2018-01-01   Min.   :  72022   Min.   : 12.00  
##  1st Qu.:2018-12-16   1st Qu.: 150768   1st Qu.: 20.00  
##  Median :2019-12-01   Median : 402029   Median : 41.00  
##  Mean   :2019-12-01   Mean   : 470467   Mean   : 43.57  
##  3rd Qu.:2020-11-16   3rd Qu.: 699195   3rd Qu.: 56.50  
##  Max.   :2021-11-01   Max.   :1964896   Max.   :105.00

#Organization had 12 FT staff members at the open of 2018 and will end 2021 with 105+
#The payroll totals peaked at nearly $2m in Summer 2021, a big leap from Jan 2018s $72k
#The mean payroll for the org is $470k with a median monthly payroll of $402k

Exploring the Linear Relationship between FT Staff and Payroll

#Evaluating the relationship between FT staffing and Monthly payroll shows clear correlation
#It stands to reason that as the number of FT employees increases, so will the payroll burden
#FT_Staff is likely an explanatory variable for Payroll
cor.test(org_payroll$Payroll,org_payroll$FT_Staff)

## 
##  Pearson's product-moment correlation
## 
## data:  org_payroll$Payroll and org_payroll$FT_Staff
## t = 7.9145, df = 45, p-value = 4.592e-10
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6091157 0.8613276
## sample estimates:
##      cor 
## 0.762847

#p-value indicates that correlation does exist between FT Staff and FT staff level
#Correlation of 0.762847
plot(Payroll~FT_Staff, data = org_payroll)
#With a few exceptional outliers, to be further discussed, there appears to be a decently strong linear relationship
#Linear Regression Model
payroll_lm <- lm(Payroll~FT_Staff, data = org_payroll)  
abline(payroll_lm)

summary(payroll_lm)

## 
## Call:
## lm(formula = Payroll ~ FT_Staff, data = org_payroll)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -254453 -120377  -73775  -25005 1020705 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     5817      68971   0.084    0.933    
## FT_Staff       10663       1347   7.914 4.59e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 248200 on 45 degrees of freedom
## Multiple R-squared:  0.5819, Adjusted R-squared:  0.5726 
## F-statistic: 62.64 on 1 and 45 DF,  p-value: 4.592e-10

#This model had an adjusted R-Squared valued of 0.5726, which means it explains more than half of the residuals
#FT Staff as an explanatory variable is shown to be significant
#The intercept in this case is relatively meaningless as there would be no business without employees
#The coefficient for the FT_Staff means that for each FT_Staff member added, the monthly payroll will climb by $10,663
#This coefficient likely overstates the impact of each FT employee as there are unaccounted non-ft staff during certain seasons

Futher exploration of data as a Time-Series

#Plotting of both payroll and FT Staff as a time series
org_payroll[-1] %>% ts(start = 2018, frequency = 12) %>% autoplot(facets = TRUE)

payroll_ts <- ts(org_payroll$Payroll, start = 2018, end = c(2021,11), frequency = 12)
autoplot(payroll_ts) + xlab("Time (Months)") + ylab("Monthly Payroll ($)") +ggtitle("Org. Monthly Payroll (2018-2021)")

#Analysis of the TS plot shows a clear increasing trend over the four year period with distint seasonal peaking that also increases in magnitude
ggseasonplot(payroll_ts) + ggtitle("Seasonality plot") + xlab("Time (Months)") + ylab("Monthly Payroll ($)")

#Clearly the payroll level is higher  YOY as well as seasonal pattern of increasing monthly payrolls starting in May, peaking in July, and then receding in September.
#Seasonally, there also appears to be a slight uptick each December
#The seasonal trend reflects the fact that the organization operates Summer and Afterschool programming, with the former being it's primary business driver
#During Summer months, additional contract staff are hired to handle the seasonal workload, and then more FT hiring happens in Nov-Dec

decompose(payroll_ts) %>% autoplot()

#Errors/residuals appear to be increasing with time
#Trend is nearly linear -> indicating a likely additive rather than exponential/multiplicative trend
#Seasonality is clearly visible in two distinct annual spikes, one large spike in the Summer months followed by a smaller end of year peak
#Seasonality looks relatively stable, could be either additive or multiplicative

Learning from the ACF/PCF graphs and Priming for ARIMA

ggtsdisplay(payroll_ts)

#The data continues to show correlation with itself through 13 lags before decaying to almost zero.
#Data does not appear to be stationary and will require transformation

#Making Data Stationary
ndiffs(payroll_ts)

## [1] 1

payroll_tsdiff <- diff(payroll_ts)
ggtsdisplay(payroll_tsdiff)

#PACF looks reltively clean after initial 1 lag spike
payroll_ts_lambda <- BoxCox.lambda(payroll_ts)
payroll_ts_lambda

## [1] 0.4720262

#lambda is about 0.47, close to 0.5 which makes it almost a square root transformation
payroll_ts_bc <- box_cox(payroll_tsdiff, lambda = payroll_ts_lambda)
adf.test(payroll_ts_bc)

## Warning in adf.test(payroll_ts_bc): p-value smaller than printed p-value

## 
##  Augmented Dickey-Fuller Test
## 
## data:  payroll_ts_bc
## Dickey-Fuller = -4.3036, Lag order = 3, p-value = 0.01
## alternative hypothesis: stationary

#Data is stationary
autoplot(payroll_ts_bc)

ggtsdisplay(payroll_ts_bc)

#Negative ACF at lag 1 indicates likely MA terms in an ARIMA model
#Two or more random, consective jumps in ACF need MA terms to smooth out
#Seasonal terms also appear to be MA as ACF drops rather than decays

Summary of Exploratory Analysis and Modeling Plan

-The linear model fails to capture the seasonal aspect of this data and also would not adjust the level for extra factors like Cost of Living Adjustments to overall pay -I will try 6 modelling methods for comparison SNAIVE, Naive with Drift, ETS(MAN) & ETS(MAA), ARIMA, Dynamic Regression, and TBATS -Simple Methods: SNAIVE - Will capture the seasonal movement, but may understate forecast by not capturing the increasing trend. Drift - To capture the rolling change of the data -ETS model - Will smooth for both the seasonal and trend in data and can test additive and multiplicative approaches -ARIMA - Will use ARIMA (p,d,q)(P,D,Q), based on earlier findings expect d/D = 1, with MA terms. Both trend and seasonality. I will need to make data stationary -Dynamic regression to leverage the FT staff data -TBATS as a comparison tool

Create Validation set

Validation set will include the final 8 month of data times series (March-Nov 2021), then forecast 8 months out (Dec’21-July’22)

#TS of test and train set
payroll_train_ts <- subset(payroll_ts, end = 39)
Payroll_test_ts <- subset(payroll_ts, start = 40)
#DF of test and train set
payroll_train <- org_payroll[c(1:39),c(2:3)]
payroll_test <- org_payroll[,c(2:3)]

Simple Forecasting Methods

pfc_snaive <- snaive(payroll_train_ts, h =16, level = c(95))
pfc_snaive

##          Point Forecast      Lo 95     Hi 95
## Apr 2021       377136.1  -91812.89  846085.2
## May 2021       384165.3  -84783.75  853114.3
## Jun 2021       888267.1  419318.08 1357216.1
## Jul 2021      1056750.7  587801.66 1525699.7
## Aug 2021       721615.1  252666.10 1190564.2
## Sep 2021       402028.5  -66920.52  870977.5
## Oct 2021       442165.8  -26783.27  911114.8
## Nov 2021       480355.1   11406.07  949304.1
## Dec 2021       728870.0  259921.00 1197819.1
## Jan 2022       585523.4  116574.37 1054472.4
## Feb 2022       568735.8   99786.76 1037684.8
## Mar 2022       638212.5  169263.43 1107161.5
## Apr 2022       377136.1 -286057.93 1040330.2
## May 2022       384165.3 -279028.79 1047359.4
## Jun 2022       888267.1  225073.04 1551461.2
## Jul 2022      1056750.7  393556.62 1719944.8

autoplot(payroll_train_ts) + autolayer(pfc_snaive) + ggtitle("SNAIVE Forecast of 2022 Payroll") + xlab("Time (Months)") + ylab("Monthly Payroll ($)")

# Follows same pattern as 2021 data at some point, so July 22 hits the same peak.  Likely understates for the general YOY upward trend
checkresiduals(pfc_snaive)

## 
##  Ljung-Box test
## 
## data:  Residuals from Seasonal naive method
## Q* = 8.3383, df = 8, p-value = 0.4011
## 
## Model df: 0.   Total lags used: 8

#No aCF points through 12 lags overtake critical values, but fails Ljung-Box test, so residuals are not white noise
#In addition, appears to be positive bias in residuals and non normal distribution among residuals
accuracy(pfc_snaive, payroll_ts)

##                    ME     RMSE      MAE      MPE     MAPE     MASE       ACF1
## Training set 192980.8 239264.1 199430.5 44.72651 45.62029 1.000000  0.2399762
## Test set     422320.0 499055.3 445972.9 40.07652 43.05678 2.236233 -0.1960970
##              Theil's U
## Training set        NA
## Test set     0.9695998

pfc_drift <- rwf(payroll_train_ts, drift = TRUE, h = 16, level = c(95)) 
pfc_drift

##          Point Forecast      Lo 95   Hi 95
## Apr 2021       653112.2  261662.87 1044562
## May 2021       668012.0  107182.18 1228842
## Jun 2021       682911.7  -12712.02 1378535
## Jul 2021       697811.5 -115404.11 1511027
## Aug 2021       712711.2 -207512.48 1632935
## Sep 2021       727611.0 -292373.64 1747596
## Oct 2021       742510.7 -371935.52 1856957
## Nov 2021       757410.5 -447445.17 1962266
## Dec 2021       772310.2 -519753.51 2064374
## Jan 2022       787210.0 -589469.05 2163889
## Feb 2022       802109.7 -657042.92 2261262
## Mar 2022       817009.5 -722819.29 2356838
## Apr 2022       831909.2 -787066.90 2450885
## May 2022       846809.0 -849999.68 2543618
## Jun 2022       861708.7 -911790.76 2635208
## Jul 2022       876608.5 -972582.18 2725799

autoplot(payroll_train_ts) + autolayer(pfc_drift) + ggtitle("Random Walk w/ Drift Forecast of 2022 Payroll") + xlab("Time (Months)") + ylab("Monthly Payroll ($)")

#Seems to correctly match upwards trend, but does not account for seasonal fluctuation
checkresiduals(pfc_drift)

## 
##  Ljung-Box test
## 
## data:  Residuals from Random walk with drift
## Q* = 11.23, df = 7, p-value = 0.1289
## 
## Model df: 1.   Total lags used: 8

#Residuals are better than SNAIVE, but again fails Ljun-Box test - residuals are not white noise
accuracy(pfc_drift, payroll_ts)

##                        ME     RMSE      MAE       MPE     MAPE     MASE
## Training set 4.214640e-12 197077.3 119655.5 -19.65517 43.37800 0.599986
## Test set     3.111191e+05 507709.8 311119.1  22.71896 22.71896 1.560038
##                     ACF1 Theil's U
## Training set 0.002911721        NA
## Test set     0.143966704  1.049076

#The seasonal Naive and Random Walk with Drift models both had very similar model results, but neither are particularly accurate at around $500k in rmse.  I would prefer the rwd model over thelong term given it's match of trend.

ETS Models

#Need a model that can handle additive trend, as well as seasonality
pmod_ets1 <- ets(payroll_train_ts) 
pmod_ets1

## ETS(M,A,M) 
## 
## Call:
##  ets(y = payroll_train_ts) 
## 
##   Smoothing parameters:
##     alpha = 0.3683 
##     beta  = 0.0459 
##     gamma = 1e-04 
## 
##   Initial states:
##     l = 75506.8939 
##     b = 14630.9374 
##     s = 0.8216 0.5496 0.5565 0.5677 2.3854 2.2487
##            1.1703 0.6853 0.8046 0.774 0.7309 0.7055
## 
##   sigma:  0.2844
## 
##      AIC     AICc      BIC 
## 1035.046 1064.189 1063.327

#ETS(MAM) model which indicates that residuals/errors and Seasonality are considered multiplicative/exponential, whereas the trend is considered additive/linear
#AICc = 1064.189, alpha = 0.3683 so more balanced weighting across past observations
pfc_ets1 <- forecast(pmod_ets1, h=16, level = c(95))
pfc_ets1

##          Point Forecast       Lo 95   Hi 95
## Apr 2021       713553.7   315828.46 1111279
## May 2021       626971.9   248296.65 1005647
## Jun 2021      1103665.4   381273.98 1826057
## Jul 2021      2183889.8   636498.73 3731281
## Aug 2021      2383787.0   558835.79 4208738
## Sep 2021       583315.7   101934.98 1064696
## Oct 2021       587453.9    66195.99 1108712
## Nov 2021       595665.6    28853.16 1162478
## Dec 2021       913589.1   -16282.93 1843461
## Jan 2022       804259.6   -69154.37 1677674
## Feb 2022       853773.6  -133159.32 1840706
## Mar 2022       926019.6  -210864.41 2062904
## Apr 2022       985232.2  -296761.32 2267226
## May 2022       858344.5  -323089.05 2039778
## Jun 2022      1498802.1  -679461.91 3677066
## Jul 2022      2943120.0 -1565766.68 7452007

autoplot(payroll_train_ts) + autolayer(pfc_ets1) + ggtitle("ETS(MAM) Forecast of 2022 Payroll") + xlab("Time (Months)") + ylab("Monthly Payroll ($)")

checkresiduals(pmod_ets1)

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(M,A,M)
## Q* = 19.948, df = 3, p-value = 0.000174
## 
## Model df: 16.   Total lags used: 19

mean(pmod_ets1$residuals)

## [1] 0.0237839

#Residuals are not normally distributed and the variance is not considtent, but the residual are white noise around a mean of 0
accuracy(pfc_ets1$fitted, payroll_train_ts)

##                 ME   RMSE      MAE       MPE     MAPE      ACF1 Theil's U
## Test set -12477.53 157834 79560.85 -2.933129 19.47019 0.2138347 0.5216626

pmod_ets2 <- ets(payroll_train_ts, model = "MAA")
pmod_ets2

## ETS(M,A,A) 
## 
## Call:
##  ets(y = payroll_train_ts, model = "MAA") 
## 
##   Smoothing parameters:
##     alpha = 0.096 
##     beta  = 1e-04 
##     gamma = 1e-04 
## 
##   Initial states:
##     l = 137359.8783 
##     b = 11321.1327 
##     s = -46514.76 -133500 -119939.7 -148478.5 327614.1 348183.6
##            177553.8 -97001.99 -70785.6 -72668.79 -74256.88 -90205.3
## 
##   sigma:  0.3092
## 
##      AIC     AICc      BIC 
## 1040.961 1070.104 1069.242

#As a comparison I wanted to run an ETS(MAA) model because the decomposition seemed to show the seasonality being more stable across time then te pure data view
#AICc =1070, less than ETS(MAM) model
pfc_ets2 <- forecast(pmod_ets2, h=16, level = c(95))
pfc_ets2

##          Point Forecast    Lo 95     Hi 95
## Apr 2021       581389.5 229055.1  933724.0
## May 2021       566561.7 221388.7  911734.7
## Jun 2021       852481.7 333491.1 1371472.4
## Jul 2021      1034520.9 403476.5 1665565.2
## Aug 2021      1025337.9 396638.5 1654037.2
## Sep 2021       560631.8 202132.4  919131.2
## Oct 2021       600551.8 217468.3  983635.3
## Nov 2021       598382.0 214741.1  982022.8
## Dec 2021       696770.2 254778.4 1138762.0
## Jan 2022       664457.5 238948.9 1089966.1
## Feb 2022       691786.8 248653.6 1134920.0
## Mar 2022       704768.8 252148.6 1157389.0
## Apr 2022       718024.8 255740.5 1180309.1
## May 2022       703196.9 247175.9 1159217.9
## Jun 2022       989117.0 366558.1 1611675.8
## Jul 2022      1171156.1 439221.0 1903091.2

autoplot(payroll_train_ts) + autolayer(pfc_ets2) + ggtitle("ETS(MAA) Forecast of 2022 Payroll") + xlab("Time (Months)") + ylab("Monthly Payroll ($)")

checkresiduals(pmod_ets2)

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(M,A,A)
## Q* = 20.164, df = 3, p-value = 0.0001569
## 
## Model df: 16.   Total lags used: 19

mean(pmod_ets2$residuals)

## [1] 0.03782912

#Residuals are white noise, slightly more positive bias than ETSMod1
accuracy(pfc_ets2$fitted, payroll_train_ts)

##               ME     RMSE      MAE       MPE     MAPE      ACF1 Theil's U
## Test set 16700.4 97440.46 67305.26 -3.264913 21.12148 0.2786451 0.9104373

autoplot(payroll_train_ts, series = "Data") + autolayer(fitted.values(pfc_ets2), series = "ETS(MAA)") + autolayer(fitted.values(pfc_ets1), series = "ETS(MAM)")

#The ETS(MAM) model edges the ETS(MAA) model in terms of efficiency between complexity and correctness, with a lower AICc score. But the latter model much improves the RMSE from $157k to $97k, so it seems more tightly fit to the test data set.  Visually this can be seen with the July 2021 forecast in the ETS(MAM) increasing to high above the actual peak.  I believe that the ETS(MAA) model is a more accurate model.

ARIMA Models

ggtsdisplay(payroll_train_ts)

adf.test(payroll_train_ts)

## 
##  Augmented Dickey-Fuller Test
## 
## data:  payroll_train_ts
## Dickey-Fuller = -2.926, Lag order = 3, p-value = 0.2108
## alternative hypothesis: stationary

#Training data is not stationary and will need to be transformed to be modeled
ndiffs(payroll_train_ts)

## [1] 1

#Anticipating a d, D of 1
ggtsdisplay(diff(payroll_train_ts))

#Still not stationary, need a boxcox transformation
train_lambda <- BoxCox.lambda(payroll_ts_bc)

## Warning in guerrero(x, lower, upper): Guerrero's method for selecting a Box-Cox
## parameter (lambda) is given for strictly positive data.

train_lambda

## [1] 1.670373

#Requires an almost full log transformation at lambda = 0.014
train_bc <- box_cox(diff(payroll_train_ts), lambda = train_lambda)
ggtsdisplay(train_bc)

adf.test(train_bc)

## 
##  Augmented Dickey-Fuller Test
## 
## data:  train_bc
## Dickey-Fuller = -3.9808, Lag order = 3, p-value = 0.02106
## alternative hypothesis: stationary

#Appears stationary but with a negative bias
#Anticipating ARIMA(0,1,2)(0,1,2)(12)

pmod_arima <- auto.arima(payroll_train_ts, lambda = train_lambda)
pmod_arima

## Series: payroll_train_ts 
## ARIMA(0,0,0)(0,1,0)[12] with drift 
## Box Cox transformation: lambda= 1.670373 
## 
## Coefficients:
##          drift
##       90504964
## s.e.  18631203
## 
## sigma^2 estimated as 1.425e+18:  log likelihood=-602.11
## AIC=1208.23   AICc=1208.73   BIC=1210.82

#Small AICc, but cannot compare to ETS models
#ARIMA(0,0,1)(0,1,0)[12] which indicates that the trend was terms were not differenced with a MA, while their is seasonality reflected by differencing but not MA or AR terms.
pfc_arima <- forecast(pmod_arima, h=16)
pfc_arima

##          Point Forecast     Lo 80     Hi 80     Lo 95     Hi 95
## Apr 2021       550106.0  288948.6  744802.5 -29757.20  834942.8
## May 2021       555579.4  297323.2  749275.5  32833.81  839087.6
## Jun 2021       995683.5  841789.1 1135022.3 752189.45 1204298.0
## Jul 2021      1153261.4 1015570.6 1280701.6 937381.12 1344818.6
## Aug 2021       843272.5  667775.4  997009.1 561087.56 1072224.3
## Sep 2021       569626.1  318367.9  760808.0  94048.95  849786.5
## Oct 2021       601846.9  364635.0  787533.3 179740.08  874650.0
## Nov 2021       633250.0  407659.2  813917.3 244114.04  899285.3
## Dec 2021       849812.9  675413.8 1002858.7 569660.53 1077796.8
## Jan 2022       722727.6  522874.5  890665.0 392874.94  971393.8
## Feb 2022       708192.0  504726.5  878054.6 370729.34  959503.6
## Mar 2022       768867.7  579390.2  931017.3 460003.13 1009541.3
## Apr 2022       692164.6  378680.3  928992.8  83249.26 1039008.1
## May 2022       696861.7  385690.5  932852.4 101422.81 1042589.5
## Jun 2022      1095827.9  889137.5 1279069.9 765549.11 1369223.5
## Jul 2022      1244633.4 1057525.5 1414490.6 948928.95 1499054.9

#The forecasted values seem to be a bit high, exceeding all but two prior months right away in the first predictor month.
autoplot(payroll_train_ts) + autolayer(pfc_arima) + ggtitle("ARIMA Forecast of 2022 Payroll") + xlab("Time (Months)") + ylab("Monthly Payroll ($)")

#Shows very substantial growth by Summer 2022
checkresiduals

## function (object, lag, df = NULL, test, plot = TRUE, ...) 
## {
##     showtest <- TRUE
##     if (missing(test)) {
##         if (is.element("lm", class(object))) {
##             test <- "BG"
##         }
##         else {
##             test <- "LB"
##         }
##         showtest <- TRUE
##     }
##     else if (test != FALSE) {
##         test <- match.arg(test, c("LB", "BG"))
##         showtest <- TRUE
##     }
##     else {
##         showtest <- FALSE
##     }
##     if (is.element("ts", class(object)) | is.element("numeric", 
##         class(object))) {
##         residuals <- object
##         object <- list(method = "Missing")
##     }
##     else {
##         residuals <- residuals(object)
##     }
##     if (length(residuals) == 0L) {
##         stop("No residuals found")
##     }
##     if ("ar" %in% class(object)) {
##         method <- paste("AR(", object$order, ")", sep = "")
##     }
##     else if (!is.null(object$method)) {
##         method <- object$method
##     }
##     else if ("HoltWinters" %in% class(object)) {
##         method <- "HoltWinters"
##     }
##     else if ("StructTS" %in% class(object)) {
##         method <- "StructTS"
##     }
##     else {
##         method <- try(as.character(object), silent = TRUE)
##         if ("try-error" %in% class(method)) {
##             method <- "Missing"
##         }
##         else if (length(method) > 1 | base::nchar(method[1]) > 
##             50) {
##             method <- "Missing"
##         }
##     }
##     if (method == "Missing") {
##         main <- "Residuals"
##     }
##     else {
##         main <- paste("Residuals from", method)
##     }
##     if (plot) {
##         suppressWarnings(ggtsdisplay(residuals, plot.type = "histogram", 
##             main = main, ...))
##     }
##     if (is.element("forecast", class(object))) {
##         object <- object$model
##     }
##     if (is.null(object) | !showtest) {
##         return(invisible())
##     }
##     freq <- frequency(residuals)
##     if (grepl("STL \\+ ", method)) {
##         warning("The fitted degrees of freedom is based on the model used for the seasonally adjusted data.")
##     }
##     df <- modeldf(object)
##     if (missing(lag)) {
##         lag <- ifelse(freq > 1, 2 * freq, 10)
##         lag <- min(lag, round(length(residuals)/5))
##         lag <- max(df + 3, lag)
##     }
##     if (!is.null(df)) {
##         if (test == "BG") {
##             BGtest <- lmtest::bgtest(object, order = lag)
##             BGtest$data.name <- main
##             return(BGtest)
##         }
##         else {
##             LBtest <- Box.test(zoo::na.approx(residuals), fitdf = df, 
##                 lag = lag, type = "Ljung")
##             LBtest$method <- "Ljung-Box test"
##             LBtest$data.name <- main
##             names(LBtest$statistic) <- "Q*"
##             print(LBtest)
##             cat(paste("Model df: ", df, ".   Total lags used: ", 
##                 lag, "\n\n", sep = ""))
##             return(invisible(LBtest))
##         }
##     }
## }
## <bytecode: 0x000000001ff1f4e0>
## <environment: namespace:forecast>

#Residuals are not whitenoise
accuracy(pfc_arima$fitted, payroll_train_ts)

##                 ME     RMSE      MAE       MPE    MAPE      ACF1 Theil's U
## Test set -21074.44 137695.2 87641.28 -23.48771 31.8799 0.4000376  1.146927

#RMSE is similar to ETS(MAM) model in the test set

#Ultimately, the residuals are not white noise for the ARIMA model which suggests not all information is being factored in forecast.  In addition the standard error range (RMSE) is beat by the ETS(MAA) model, and intuitively this growth seems too exponential.  It might be a good aggressive scenario to present but I don't think the most likely.

Dynamic Regression

The data set for the nonprofit organization includes not only monthly payroll values but also the number of FT staff members employed in a given period. I have confirmed that there is strong correlation between the # of FT staff members and monthly payroll. Dynamic regression will allow us to leverage this relationship to make a linear forecast, and then layer in the time factors on trend.

pmod_dr <- auto.arima(payroll_train[,"Payroll"], xreg = payroll_train[,"FT_Staff"], lambda = train_lambda)
pmod_dr

## Series: payroll_train[, "Payroll"] 
## Regression with ARIMA(0,0,0) errors 
## Box Cox transformation: lambda= 1.670373 
## 
## Coefficients:
##           xreg
##       44067483
## s.e.   5573803
## 
## sigma^2 estimated as 1.763e+18:  log likelihood=-874.1
## AIC=1752.2   AICc=1752.53   BIC=1755.52

#x-reg not significant, asserts that 10,218k per FT employee for each additional staff
pfc_dr <- forecast(pmod_dr, xreg = rep(105, 16))
#Assumes that 105 employees is mean for forecast period
pfc_dr

##    Point Forecast    Lo 80   Hi 80    Lo 95   Hi 95
## 40       831205.1 631681.3 1002617 506755.7 1085763
## 41       831205.1 631681.3 1002617 506755.7 1085763
## 42       831205.1 631681.3 1002617 506755.7 1085763
## 43       831205.1 631681.3 1002617 506755.7 1085763
## 44       831205.1 631681.3 1002617 506755.7 1085763
## 45       831205.1 631681.3 1002617 506755.7 1085763
## 46       831205.1 631681.3 1002617 506755.7 1085763
## 47       831205.1 631681.3 1002617 506755.7 1085763
## 48       831205.1 631681.3 1002617 506755.7 1085763
## 49       831205.1 631681.3 1002617 506755.7 1085763
## 50       831205.1 631681.3 1002617 506755.7 1085763
## 51       831205.1 631681.3 1002617 506755.7 1085763
## 52       831205.1 631681.3 1002617 506755.7 1085763
## 53       831205.1 631681.3 1002617 506755.7 1085763
## 54       831205.1 631681.3 1002617 506755.7 1085763
## 55       831205.1 631681.3 1002617 506755.7 1085763

#Levels off at 2.1m which is way too high for future months
pfc_dr_ts <- ts(pfc_dr$mean, start = c(2021,3), frequency = 12)
checkresiduals(pfc_dr)

## 
##  Ljung-Box test
## 
## data:  Residuals from Regression with ARIMA(0,0,0) errors
## Q* = 17.679, df = 7, p-value = 0.0135
## 
## Model df: 1.   Total lags used: 8

#Residuals fail Ljung-Box test, not white noise
autoplot(payroll_train_ts) + autolayer(pfc_dr_ts)

accuracy(pfc_dr_ts,payroll_train_ts)

##                 ME     RMSE      MAE       MPE     MAPE
## Test set -192992.7 192992.7 192992.7 -30.23956 30.23956

#The Mean Average Error is very high at 1.2m, so this model is not effective at all.  Does not capture seasonal effect and seems to impacted by forecast period.

TBATS

TBATS is a fairly automated modeling technique in R that uses trigonometric terms for seasonality to increase with period, can use box.cox to transform data, has ARMA terms, and accounts for trending. Fourier terms and box cox transformations are automatic.

pmod_tbats <- tbats(payroll_train_ts)
pmod_tbats

## TBATS(0.698, {0,0}, 1, {<12,3>})
## 
## Call: tbats(y = payroll_train_ts)
## 
## Parameters
##   Lambda: 0.698214
##   Alpha: -0.02647514
##   Beta: 0.05670958
##   Damping Parameter: 1
##   Gamma-1 Values: -0.009914156
##   Gamma-2 Values: 0.01509178
## 
## Seed States:
##             [,1]
## [1,]  5437.35732
## [2,]   -60.61336
## [3,] -3039.16166
## [4,]  3011.17098
## [5,] -1621.31320
## [6,]   303.33302
## [7,]   559.95641
## [8,] -1303.52072
## attr(,"lambda")
## [1] 0.6982144
## 
## Sigma: 1699.509
## AIC: 1043.819

#TBATS(0.698, {0,0}, 1, {<12,3>})
#Box cox transformation was a little less than a square root
#No ARMA terms (p,q)
#Damping employed
#Frequency is monthly (12) with 3 fourier terms selected
pfc_tbats <- forecast(pmod_tbats, h = 16, level = c(95))
pfc_tbats

##          Point Forecast     Lo 95     Hi 95
## Apr 2021       691796.9  507474.8  892372.2
## May 2021       672488.8  489761.7  871682.3
## Jun 2021       944641.1  739963.3 1163715.9
## Jul 2021      1302306.0 1074895.6 1542410.0
## Aug 2021      1174502.9  954376.9 1407877.7
## Sep 2021       717722.5  530268.5  921345.7
## Oct 2021       539137.5  368364.4  728185.6
## Nov 2021       687562.9  501048.5  890854.1
## Dec 2021       797993.4  601121.9 1010814.7
## Jan 2022       731450.4  539488.4  940058.7
## Feb 2022       726622.3  533600.8  936607.7
## Mar 2022       800442.7  598311.9 1019373.4
## Apr 2022       760880.3  559371.5  980039.6
## May 2022       741004.1  539128.5  961124.9
## Jun 2022      1020321.9  793500.7 1263547.2
## Jul 2022      1385493.2 1132141.3 1653698.1

autoplot(pfc_tbats)

autoplot(payroll_train_ts) + autolayer(pfc_tbats) + ggtitle("TBATS Forecast of 2022 Payroll") +  xlab("Time (Months)") + ylab("Monthly Payroll ($)")

#Model looks very good, increasing seasonality but not overly pronounced
accuracy(pfc_tbats$fitted, payroll_ts)

##                ME     RMSE      MAE      MPE   MAPE        ACF1 Theil's U
## Test set -544.774 86484.57 62001.89 1.992759 19.819 0.001596921 0.5635915

#Smallest RMSE and MAE of all forecast models

Conclusion

-The simple forecasting methods were not game for accounting both the trend and seasonal components in the data -ARIMA models proved not a good fit for this data and no terms were even included with TBATS, residuals retained autocorrelation and error measures were weaker than exponential smoothing models -Dynamic regression probably required more stationary data in order to make useful predictions, as it stands the forecasts were very unrealistic during off months. -TBATS produced the best forecast plot and error metrics, but I ultimately prefer the ETS models -The ETS(MAA) has a slighlty higher AICc number but improves on the ETS(MAM) in terms of errors and visual plotting. Intuitively the ETS(MAA) seems to better match my observation of the seasonal effect in the data set -Final note on the data - ideally for monthly data we would have had more than 4 years off which to build our forecasts, especially given the seasonal nature. Also the validation set should be less than 20% of the training set which is not the case in my analysis.

#Final plot
autoplot(payroll_ts) + autolayer(pfc_ets2) + xlab("Time (months)") + ylab("Monthly Payroll ($)") + ggtitle("Forecast of Org Payroll thru July 2022 - ETS(MAA)")

#Final forecasts
pfc_ets2

##          Point Forecast    Lo 95     Hi 95
## Apr 2021       581389.5 229055.1  933724.0
## May 2021       566561.7 221388.7  911734.7
## Jun 2021       852481.7 333491.1 1371472.4
## Jul 2021      1034520.9 403476.5 1665565.2
## Aug 2021      1025337.9 396638.5 1654037.2
## Sep 2021       560631.8 202132.4  919131.2
## Oct 2021       600551.8 217468.3  983635.3
## Nov 2021       598382.0 214741.1  982022.8
## Dec 2021       696770.2 254778.4 1138762.0
## Jan 2022       664457.5 238948.9 1089966.1
## Feb 2022       691786.8 248653.6 1134920.0
## Mar 2022       704768.8 252148.6 1157389.0
## Apr 2022       718024.8 255740.5 1180309.1
## May 2022       703196.9 247175.9 1159217.9
## Jun 2022       989117.0 366558.1 1611675.8
## Jul 2022      1171156.1 439221.0 1903091.2

Predictive Analytics - Hwk 3

Thomas Laffey

12/12/2021