The time-series data represents the monthly payroll actuals for a national nonprofit organization from the start of 2018 to present. The nonprofit organization is experiencing a period of rapid growth in terms of impact and staffing. Afterschool and Summer programs are the main work of the organization with seasonal contract labor supplementing Full-time staff during key periods of the Summer and to a lesser extend in the Spring and Fall. The data set also includes FT staffing levels, which can be leveraged for more dynamic analysis. Forecasting future payroll obligations is significant for budgeting purposes and maintenance of cashflow for the organization.
org_payroll <- read.csv('payroll.csv')
str(org_payroll)
## 'data.frame': 47 obs. of 3 variables:
## $ ï..Month: chr "2018-1-1" "2018-2-1" "2018-3-1" "2018-4-1" ...
## $ Payroll : num 72022 94061 136453 94494 96118 ...
## $ FT_Staff: int 12 15 15 15 15 15 15 15 15 16 ...
org_payroll$ï..Month <- as.Date.character(org_payroll$ï..Month)
str(org_payroll)
## 'data.frame': 47 obs. of 3 variables:
## $ ï..Month: Date, format: "2018-01-01" "2018-02-01" ...
## $ Payroll : num 72022 94061 136453 94494 96118 ...
## $ FT_Staff: int 12 15 15 15 15 15 15 15 15 16 ...
dim(org_payroll)
## [1] 47 3
##Analysis - Data includes 3 variables - Date by Month, Payroll for Month, Number of FT Staff
##Probably too few months
summary(org_payroll)
## ï..Month Payroll FT_Staff
## Min. :2018-01-01 Min. : 72022 Min. : 12.00
## 1st Qu.:2018-12-16 1st Qu.: 150768 1st Qu.: 20.00
## Median :2019-12-01 Median : 402029 Median : 41.00
## Mean :2019-12-01 Mean : 470467 Mean : 43.57
## 3rd Qu.:2020-11-16 3rd Qu.: 699195 3rd Qu.: 56.50
## Max. :2021-11-01 Max. :1964896 Max. :105.00
#Organization had 12 FT staff members at the open of 2018 and will end 2021 with 105+
#The payroll totals peaked at nearly $2m in Summer 2021, a big leap from Jan 2018s $72k
#The mean payroll for the org is $470k with a median monthly payroll of $402k
#Evaluating the relationship between FT staffing and Monthly payroll shows clear correlation
#It stands to reason that as the number of FT employees increases, so will the payroll burden
#FT_Staff is likely an explanatory variable for Payroll
cor.test(org_payroll$Payroll,org_payroll$FT_Staff)
##
## Pearson's product-moment correlation
##
## data: org_payroll$Payroll and org_payroll$FT_Staff
## t = 7.9145, df = 45, p-value = 4.592e-10
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6091157 0.8613276
## sample estimates:
## cor
## 0.762847
#p-value indicates that correlation does exist between FT Staff and FT staff level
#Correlation of 0.762847
plot(Payroll~FT_Staff, data = org_payroll)
#With a few exceptional outliers, to be further discussed, there appears to be a decently strong linear relationship
#Linear Regression Model
payroll_lm <- lm(Payroll~FT_Staff, data = org_payroll)
abline(payroll_lm)
summary(payroll_lm)
##
## Call:
## lm(formula = Payroll ~ FT_Staff, data = org_payroll)
##
## Residuals:
## Min 1Q Median 3Q Max
## -254453 -120377 -73775 -25005 1020705
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5817 68971 0.084 0.933
## FT_Staff 10663 1347 7.914 4.59e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 248200 on 45 degrees of freedom
## Multiple R-squared: 0.5819, Adjusted R-squared: 0.5726
## F-statistic: 62.64 on 1 and 45 DF, p-value: 4.592e-10
#This model had an adjusted R-Squared valued of 0.5726, which means it explains more than half of the residuals
#FT Staff as an explanatory variable is shown to be significant
#The intercept in this case is relatively meaningless as there would be no business without employees
#The coefficient for the FT_Staff means that for each FT_Staff member added, the monthly payroll will climb by $10,663
#This coefficient likely overstates the impact of each FT employee as there are unaccounted non-ft staff during certain seasons
#Plotting of both payroll and FT Staff as a time series
org_payroll[-1] %>% ts(start = 2018, frequency = 12) %>% autoplot(facets = TRUE)
payroll_ts <- ts(org_payroll$Payroll, start = 2018, end = c(2021,11), frequency = 12)
autoplot(payroll_ts) + xlab("Time (Months)") + ylab("Monthly Payroll ($)") +ggtitle("Org. Monthly Payroll (2018-2021)")
#Analysis of the TS plot shows a clear increasing trend over the four year period with distint seasonal peaking that also increases in magnitude
ggseasonplot(payroll_ts) + ggtitle("Seasonality plot") + xlab("Time (Months)") + ylab("Monthly Payroll ($)")
#Clearly the payroll level is higher YOY as well as seasonal pattern of increasing monthly payrolls starting in May, peaking in July, and then receding in September.
#Seasonally, there also appears to be a slight uptick each December
#The seasonal trend reflects the fact that the organization operates Summer and Afterschool programming, with the former being it's primary business driver
#During Summer months, additional contract staff are hired to handle the seasonal workload, and then more FT hiring happens in Nov-Dec
decompose(payroll_ts) %>% autoplot()
#Errors/residuals appear to be increasing with time
#Trend is nearly linear -> indicating a likely additive rather than exponential/multiplicative trend
#Seasonality is clearly visible in two distinct annual spikes, one large spike in the Summer months followed by a smaller end of year peak
#Seasonality looks relatively stable, could be either additive or multiplicative
ggtsdisplay(payroll_ts)
#The data continues to show correlation with itself through 13 lags before decaying to almost zero.
#Data does not appear to be stationary and will require transformation
#Making Data Stationary
ndiffs(payroll_ts)
## [1] 1
payroll_tsdiff <- diff(payroll_ts)
ggtsdisplay(payroll_tsdiff)
#PACF looks reltively clean after initial 1 lag spike
payroll_ts_lambda <- BoxCox.lambda(payroll_ts)
payroll_ts_lambda
## [1] 0.4720262
#lambda is about 0.47, close to 0.5 which makes it almost a square root transformation
payroll_ts_bc <- box_cox(payroll_tsdiff, lambda = payroll_ts_lambda)
adf.test(payroll_ts_bc)
## Warning in adf.test(payroll_ts_bc): p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: payroll_ts_bc
## Dickey-Fuller = -4.3036, Lag order = 3, p-value = 0.01
## alternative hypothesis: stationary
#Data is stationary
autoplot(payroll_ts_bc)
ggtsdisplay(payroll_ts_bc)
#Negative ACF at lag 1 indicates likely MA terms in an ARIMA model
#Two or more random, consective jumps in ACF need MA terms to smooth out
#Seasonal terms also appear to be MA as ACF drops rather than decays
-The linear model fails to capture the seasonal aspect of this data and also would not adjust the level for extra factors like Cost of Living Adjustments to overall pay -I will try 6 modelling methods for comparison SNAIVE, Naive with Drift, ETS(MAN) & ETS(MAA), ARIMA, Dynamic Regression, and TBATS -Simple Methods: SNAIVE - Will capture the seasonal movement, but may understate forecast by not capturing the increasing trend. Drift - To capture the rolling change of the data -ETS model - Will smooth for both the seasonal and trend in data and can test additive and multiplicative approaches -ARIMA - Will use ARIMA (p,d,q)(P,D,Q), based on earlier findings expect d/D = 1, with MA terms. Both trend and seasonality. I will need to make data stationary -Dynamic regression to leverage the FT staff data -TBATS as a comparison tool
Validation set will include the final 8 month of data times series (March-Nov 2021), then forecast 8 months out (Dec’21-July’22)
#TS of test and train set
payroll_train_ts <- subset(payroll_ts, end = 39)
Payroll_test_ts <- subset(payroll_ts, start = 40)
#DF of test and train set
payroll_train <- org_payroll[c(1:39),c(2:3)]
payroll_test <- org_payroll[,c(2:3)]
pfc_snaive <- snaive(payroll_train_ts, h =16, level = c(95))
pfc_snaive
## Point Forecast Lo 95 Hi 95
## Apr 2021 377136.1 -91812.89 846085.2
## May 2021 384165.3 -84783.75 853114.3
## Jun 2021 888267.1 419318.08 1357216.1
## Jul 2021 1056750.7 587801.66 1525699.7
## Aug 2021 721615.1 252666.10 1190564.2
## Sep 2021 402028.5 -66920.52 870977.5
## Oct 2021 442165.8 -26783.27 911114.8
## Nov 2021 480355.1 11406.07 949304.1
## Dec 2021 728870.0 259921.00 1197819.1
## Jan 2022 585523.4 116574.37 1054472.4
## Feb 2022 568735.8 99786.76 1037684.8
## Mar 2022 638212.5 169263.43 1107161.5
## Apr 2022 377136.1 -286057.93 1040330.2
## May 2022 384165.3 -279028.79 1047359.4
## Jun 2022 888267.1 225073.04 1551461.2
## Jul 2022 1056750.7 393556.62 1719944.8
autoplot(payroll_train_ts) + autolayer(pfc_snaive) + ggtitle("SNAIVE Forecast of 2022 Payroll") + xlab("Time (Months)") + ylab("Monthly Payroll ($)")
# Follows same pattern as 2021 data at some point, so July 22 hits the same peak. Likely understates for the general YOY upward trend
checkresiduals(pfc_snaive)
##
## Ljung-Box test
##
## data: Residuals from Seasonal naive method
## Q* = 8.3383, df = 8, p-value = 0.4011
##
## Model df: 0. Total lags used: 8
#No aCF points through 12 lags overtake critical values, but fails Ljung-Box test, so residuals are not white noise
#In addition, appears to be positive bias in residuals and non normal distribution among residuals
accuracy(pfc_snaive, payroll_ts)
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 192980.8 239264.1 199430.5 44.72651 45.62029 1.000000 0.2399762
## Test set 422320.0 499055.3 445972.9 40.07652 43.05678 2.236233 -0.1960970
## Theil's U
## Training set NA
## Test set 0.9695998
pfc_drift <- rwf(payroll_train_ts, drift = TRUE, h = 16, level = c(95))
pfc_drift
## Point Forecast Lo 95 Hi 95
## Apr 2021 653112.2 261662.87 1044562
## May 2021 668012.0 107182.18 1228842
## Jun 2021 682911.7 -12712.02 1378535
## Jul 2021 697811.5 -115404.11 1511027
## Aug 2021 712711.2 -207512.48 1632935
## Sep 2021 727611.0 -292373.64 1747596
## Oct 2021 742510.7 -371935.52 1856957
## Nov 2021 757410.5 -447445.17 1962266
## Dec 2021 772310.2 -519753.51 2064374
## Jan 2022 787210.0 -589469.05 2163889
## Feb 2022 802109.7 -657042.92 2261262
## Mar 2022 817009.5 -722819.29 2356838
## Apr 2022 831909.2 -787066.90 2450885
## May 2022 846809.0 -849999.68 2543618
## Jun 2022 861708.7 -911790.76 2635208
## Jul 2022 876608.5 -972582.18 2725799
autoplot(payroll_train_ts) + autolayer(pfc_drift) + ggtitle("Random Walk w/ Drift Forecast of 2022 Payroll") + xlab("Time (Months)") + ylab("Monthly Payroll ($)")
#Seems to correctly match upwards trend, but does not account for seasonal fluctuation
checkresiduals(pfc_drift)
##
## Ljung-Box test
##
## data: Residuals from Random walk with drift
## Q* = 11.23, df = 7, p-value = 0.1289
##
## Model df: 1. Total lags used: 8
#Residuals are better than SNAIVE, but again fails Ljun-Box test - residuals are not white noise
accuracy(pfc_drift, payroll_ts)
## ME RMSE MAE MPE MAPE MASE
## Training set 4.214640e-12 197077.3 119655.5 -19.65517 43.37800 0.599986
## Test set 3.111191e+05 507709.8 311119.1 22.71896 22.71896 1.560038
## ACF1 Theil's U
## Training set 0.002911721 NA
## Test set 0.143966704 1.049076
#The seasonal Naive and Random Walk with Drift models both had very similar model results, but neither are particularly accurate at around $500k in rmse. I would prefer the rwd model over thelong term given it's match of trend.
#Need a model that can handle additive trend, as well as seasonality
pmod_ets1 <- ets(payroll_train_ts)
pmod_ets1
## ETS(M,A,M)
##
## Call:
## ets(y = payroll_train_ts)
##
## Smoothing parameters:
## alpha = 0.3683
## beta = 0.0459
## gamma = 1e-04
##
## Initial states:
## l = 75506.8939
## b = 14630.9374
## s = 0.8216 0.5496 0.5565 0.5677 2.3854 2.2487
## 1.1703 0.6853 0.8046 0.774 0.7309 0.7055
##
## sigma: 0.2844
##
## AIC AICc BIC
## 1035.046 1064.189 1063.327
#ETS(MAM) model which indicates that residuals/errors and Seasonality are considered multiplicative/exponential, whereas the trend is considered additive/linear
#AICc = 1064.189, alpha = 0.3683 so more balanced weighting across past observations
pfc_ets1 <- forecast(pmod_ets1, h=16, level = c(95))
pfc_ets1
## Point Forecast Lo 95 Hi 95
## Apr 2021 713553.7 315828.46 1111279
## May 2021 626971.9 248296.65 1005647
## Jun 2021 1103665.4 381273.98 1826057
## Jul 2021 2183889.8 636498.73 3731281
## Aug 2021 2383787.0 558835.79 4208738
## Sep 2021 583315.7 101934.98 1064696
## Oct 2021 587453.9 66195.99 1108712
## Nov 2021 595665.6 28853.16 1162478
## Dec 2021 913589.1 -16282.93 1843461
## Jan 2022 804259.6 -69154.37 1677674
## Feb 2022 853773.6 -133159.32 1840706
## Mar 2022 926019.6 -210864.41 2062904
## Apr 2022 985232.2 -296761.32 2267226
## May 2022 858344.5 -323089.05 2039778
## Jun 2022 1498802.1 -679461.91 3677066
## Jul 2022 2943120.0 -1565766.68 7452007
autoplot(payroll_train_ts) + autolayer(pfc_ets1) + ggtitle("ETS(MAM) Forecast of 2022 Payroll") + xlab("Time (Months)") + ylab("Monthly Payroll ($)")
checkresiduals(pmod_ets1)
##
## Ljung-Box test
##
## data: Residuals from ETS(M,A,M)
## Q* = 19.948, df = 3, p-value = 0.000174
##
## Model df: 16. Total lags used: 19
mean(pmod_ets1$residuals)
## [1] 0.0237839
#Residuals are not normally distributed and the variance is not considtent, but the residual are white noise around a mean of 0
accuracy(pfc_ets1$fitted, payroll_train_ts)
## ME RMSE MAE MPE MAPE ACF1 Theil's U
## Test set -12477.53 157834 79560.85 -2.933129 19.47019 0.2138347 0.5216626
pmod_ets2 <- ets(payroll_train_ts, model = "MAA")
pmod_ets2
## ETS(M,A,A)
##
## Call:
## ets(y = payroll_train_ts, model = "MAA")
##
## Smoothing parameters:
## alpha = 0.096
## beta = 1e-04
## gamma = 1e-04
##
## Initial states:
## l = 137359.8783
## b = 11321.1327
## s = -46514.76 -133500 -119939.7 -148478.5 327614.1 348183.6
## 177553.8 -97001.99 -70785.6 -72668.79 -74256.88 -90205.3
##
## sigma: 0.3092
##
## AIC AICc BIC
## 1040.961 1070.104 1069.242
#As a comparison I wanted to run an ETS(MAA) model because the decomposition seemed to show the seasonality being more stable across time then te pure data view
#AICc =1070, less than ETS(MAM) model
pfc_ets2 <- forecast(pmod_ets2, h=16, level = c(95))
pfc_ets2
## Point Forecast Lo 95 Hi 95
## Apr 2021 581389.5 229055.1 933724.0
## May 2021 566561.7 221388.7 911734.7
## Jun 2021 852481.7 333491.1 1371472.4
## Jul 2021 1034520.9 403476.5 1665565.2
## Aug 2021 1025337.9 396638.5 1654037.2
## Sep 2021 560631.8 202132.4 919131.2
## Oct 2021 600551.8 217468.3 983635.3
## Nov 2021 598382.0 214741.1 982022.8
## Dec 2021 696770.2 254778.4 1138762.0
## Jan 2022 664457.5 238948.9 1089966.1
## Feb 2022 691786.8 248653.6 1134920.0
## Mar 2022 704768.8 252148.6 1157389.0
## Apr 2022 718024.8 255740.5 1180309.1
## May 2022 703196.9 247175.9 1159217.9
## Jun 2022 989117.0 366558.1 1611675.8
## Jul 2022 1171156.1 439221.0 1903091.2
autoplot(payroll_train_ts) + autolayer(pfc_ets2) + ggtitle("ETS(MAA) Forecast of 2022 Payroll") + xlab("Time (Months)") + ylab("Monthly Payroll ($)")
checkresiduals(pmod_ets2)
##
## Ljung-Box test
##
## data: Residuals from ETS(M,A,A)
## Q* = 20.164, df = 3, p-value = 0.0001569
##
## Model df: 16. Total lags used: 19
mean(pmod_ets2$residuals)
## [1] 0.03782912
#Residuals are white noise, slightly more positive bias than ETSMod1
accuracy(pfc_ets2$fitted, payroll_train_ts)
## ME RMSE MAE MPE MAPE ACF1 Theil's U
## Test set 16700.4 97440.46 67305.26 -3.264913 21.12148 0.2786451 0.9104373
autoplot(payroll_train_ts, series = "Data") + autolayer(fitted.values(pfc_ets2), series = "ETS(MAA)") + autolayer(fitted.values(pfc_ets1), series = "ETS(MAM)")
#The ETS(MAM) model edges the ETS(MAA) model in terms of efficiency between complexity and correctness, with a lower AICc score. But the latter model much improves the RMSE from $157k to $97k, so it seems more tightly fit to the test data set. Visually this can be seen with the July 2021 forecast in the ETS(MAM) increasing to high above the actual peak. I believe that the ETS(MAA) model is a more accurate model.
ggtsdisplay(payroll_train_ts)
adf.test(payroll_train_ts)
##
## Augmented Dickey-Fuller Test
##
## data: payroll_train_ts
## Dickey-Fuller = -2.926, Lag order = 3, p-value = 0.2108
## alternative hypothesis: stationary
#Training data is not stationary and will need to be transformed to be modeled
ndiffs(payroll_train_ts)
## [1] 1
#Anticipating a d, D of 1
ggtsdisplay(diff(payroll_train_ts))
#Still not stationary, need a boxcox transformation
train_lambda <- BoxCox.lambda(payroll_ts_bc)
## Warning in guerrero(x, lower, upper): Guerrero's method for selecting a Box-Cox
## parameter (lambda) is given for strictly positive data.
train_lambda
## [1] 1.670373
#Requires an almost full log transformation at lambda = 0.014
train_bc <- box_cox(diff(payroll_train_ts), lambda = train_lambda)
ggtsdisplay(train_bc)
adf.test(train_bc)
##
## Augmented Dickey-Fuller Test
##
## data: train_bc
## Dickey-Fuller = -3.9808, Lag order = 3, p-value = 0.02106
## alternative hypothesis: stationary
#Appears stationary but with a negative bias
#Anticipating ARIMA(0,1,2)(0,1,2)(12)
pmod_arima <- auto.arima(payroll_train_ts, lambda = train_lambda)
pmod_arima
## Series: payroll_train_ts
## ARIMA(0,0,0)(0,1,0)[12] with drift
## Box Cox transformation: lambda= 1.670373
##
## Coefficients:
## drift
## 90504964
## s.e. 18631203
##
## sigma^2 estimated as 1.425e+18: log likelihood=-602.11
## AIC=1208.23 AICc=1208.73 BIC=1210.82
#Small AICc, but cannot compare to ETS models
#ARIMA(0,0,1)(0,1,0)[12] which indicates that the trend was terms were not differenced with a MA, while their is seasonality reflected by differencing but not MA or AR terms.
pfc_arima <- forecast(pmod_arima, h=16)
pfc_arima
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## Apr 2021 550106.0 288948.6 744802.5 -29757.20 834942.8
## May 2021 555579.4 297323.2 749275.5 32833.81 839087.6
## Jun 2021 995683.5 841789.1 1135022.3 752189.45 1204298.0
## Jul 2021 1153261.4 1015570.6 1280701.6 937381.12 1344818.6
## Aug 2021 843272.5 667775.4 997009.1 561087.56 1072224.3
## Sep 2021 569626.1 318367.9 760808.0 94048.95 849786.5
## Oct 2021 601846.9 364635.0 787533.3 179740.08 874650.0
## Nov 2021 633250.0 407659.2 813917.3 244114.04 899285.3
## Dec 2021 849812.9 675413.8 1002858.7 569660.53 1077796.8
## Jan 2022 722727.6 522874.5 890665.0 392874.94 971393.8
## Feb 2022 708192.0 504726.5 878054.6 370729.34 959503.6
## Mar 2022 768867.7 579390.2 931017.3 460003.13 1009541.3
## Apr 2022 692164.6 378680.3 928992.8 83249.26 1039008.1
## May 2022 696861.7 385690.5 932852.4 101422.81 1042589.5
## Jun 2022 1095827.9 889137.5 1279069.9 765549.11 1369223.5
## Jul 2022 1244633.4 1057525.5 1414490.6 948928.95 1499054.9
#The forecasted values seem to be a bit high, exceeding all but two prior months right away in the first predictor month.
autoplot(payroll_train_ts) + autolayer(pfc_arima) + ggtitle("ARIMA Forecast of 2022 Payroll") + xlab("Time (Months)") + ylab("Monthly Payroll ($)")
#Shows very substantial growth by Summer 2022
checkresiduals
## function (object, lag, df = NULL, test, plot = TRUE, ...)
## {
## showtest <- TRUE
## if (missing(test)) {
## if (is.element("lm", class(object))) {
## test <- "BG"
## }
## else {
## test <- "LB"
## }
## showtest <- TRUE
## }
## else if (test != FALSE) {
## test <- match.arg(test, c("LB", "BG"))
## showtest <- TRUE
## }
## else {
## showtest <- FALSE
## }
## if (is.element("ts", class(object)) | is.element("numeric",
## class(object))) {
## residuals <- object
## object <- list(method = "Missing")
## }
## else {
## residuals <- residuals(object)
## }
## if (length(residuals) == 0L) {
## stop("No residuals found")
## }
## if ("ar" %in% class(object)) {
## method <- paste("AR(", object$order, ")", sep = "")
## }
## else if (!is.null(object$method)) {
## method <- object$method
## }
## else if ("HoltWinters" %in% class(object)) {
## method <- "HoltWinters"
## }
## else if ("StructTS" %in% class(object)) {
## method <- "StructTS"
## }
## else {
## method <- try(as.character(object), silent = TRUE)
## if ("try-error" %in% class(method)) {
## method <- "Missing"
## }
## else if (length(method) > 1 | base::nchar(method[1]) >
## 50) {
## method <- "Missing"
## }
## }
## if (method == "Missing") {
## main <- "Residuals"
## }
## else {
## main <- paste("Residuals from", method)
## }
## if (plot) {
## suppressWarnings(ggtsdisplay(residuals, plot.type = "histogram",
## main = main, ...))
## }
## if (is.element("forecast", class(object))) {
## object <- object$model
## }
## if (is.null(object) | !showtest) {
## return(invisible())
## }
## freq <- frequency(residuals)
## if (grepl("STL \\+ ", method)) {
## warning("The fitted degrees of freedom is based on the model used for the seasonally adjusted data.")
## }
## df <- modeldf(object)
## if (missing(lag)) {
## lag <- ifelse(freq > 1, 2 * freq, 10)
## lag <- min(lag, round(length(residuals)/5))
## lag <- max(df + 3, lag)
## }
## if (!is.null(df)) {
## if (test == "BG") {
## BGtest <- lmtest::bgtest(object, order = lag)
## BGtest$data.name <- main
## return(BGtest)
## }
## else {
## LBtest <- Box.test(zoo::na.approx(residuals), fitdf = df,
## lag = lag, type = "Ljung")
## LBtest$method <- "Ljung-Box test"
## LBtest$data.name <- main
## names(LBtest$statistic) <- "Q*"
## print(LBtest)
## cat(paste("Model df: ", df, ". Total lags used: ",
## lag, "\n\n", sep = ""))
## return(invisible(LBtest))
## }
## }
## }
## <bytecode: 0x000000001ff1f4e0>
## <environment: namespace:forecast>
#Residuals are not whitenoise
accuracy(pfc_arima$fitted, payroll_train_ts)
## ME RMSE MAE MPE MAPE ACF1 Theil's U
## Test set -21074.44 137695.2 87641.28 -23.48771 31.8799 0.4000376 1.146927
#RMSE is similar to ETS(MAM) model in the test set
#Ultimately, the residuals are not white noise for the ARIMA model which suggests not all information is being factored in forecast. In addition the standard error range (RMSE) is beat by the ETS(MAA) model, and intuitively this growth seems too exponential. It might be a good aggressive scenario to present but I don't think the most likely.
The data set for the nonprofit organization includes not only monthly payroll values but also the number of FT staff members employed in a given period. I have confirmed that there is strong correlation between the # of FT staff members and monthly payroll. Dynamic regression will allow us to leverage this relationship to make a linear forecast, and then layer in the time factors on trend.
pmod_dr <- auto.arima(payroll_train[,"Payroll"], xreg = payroll_train[,"FT_Staff"], lambda = train_lambda)
pmod_dr
## Series: payroll_train[, "Payroll"]
## Regression with ARIMA(0,0,0) errors
## Box Cox transformation: lambda= 1.670373
##
## Coefficients:
## xreg
## 44067483
## s.e. 5573803
##
## sigma^2 estimated as 1.763e+18: log likelihood=-874.1
## AIC=1752.2 AICc=1752.53 BIC=1755.52
#x-reg not significant, asserts that 10,218k per FT employee for each additional staff
pfc_dr <- forecast(pmod_dr, xreg = rep(105, 16))
#Assumes that 105 employees is mean for forecast period
pfc_dr
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 40 831205.1 631681.3 1002617 506755.7 1085763
## 41 831205.1 631681.3 1002617 506755.7 1085763
## 42 831205.1 631681.3 1002617 506755.7 1085763
## 43 831205.1 631681.3 1002617 506755.7 1085763
## 44 831205.1 631681.3 1002617 506755.7 1085763
## 45 831205.1 631681.3 1002617 506755.7 1085763
## 46 831205.1 631681.3 1002617 506755.7 1085763
## 47 831205.1 631681.3 1002617 506755.7 1085763
## 48 831205.1 631681.3 1002617 506755.7 1085763
## 49 831205.1 631681.3 1002617 506755.7 1085763
## 50 831205.1 631681.3 1002617 506755.7 1085763
## 51 831205.1 631681.3 1002617 506755.7 1085763
## 52 831205.1 631681.3 1002617 506755.7 1085763
## 53 831205.1 631681.3 1002617 506755.7 1085763
## 54 831205.1 631681.3 1002617 506755.7 1085763
## 55 831205.1 631681.3 1002617 506755.7 1085763
#Levels off at 2.1m which is way too high for future months
pfc_dr_ts <- ts(pfc_dr$mean, start = c(2021,3), frequency = 12)
checkresiduals(pfc_dr)
##
## Ljung-Box test
##
## data: Residuals from Regression with ARIMA(0,0,0) errors
## Q* = 17.679, df = 7, p-value = 0.0135
##
## Model df: 1. Total lags used: 8
#Residuals fail Ljung-Box test, not white noise
autoplot(payroll_train_ts) + autolayer(pfc_dr_ts)
accuracy(pfc_dr_ts,payroll_train_ts)
## ME RMSE MAE MPE MAPE
## Test set -192992.7 192992.7 192992.7 -30.23956 30.23956
#The Mean Average Error is very high at 1.2m, so this model is not effective at all. Does not capture seasonal effect and seems to impacted by forecast period.
TBATS is a fairly automated modeling technique in R that uses trigonometric terms for seasonality to increase with period, can use box.cox to transform data, has ARMA terms, and accounts for trending. Fourier terms and box cox transformations are automatic.
pmod_tbats <- tbats(payroll_train_ts)
pmod_tbats
## TBATS(0.698, {0,0}, 1, {<12,3>})
##
## Call: tbats(y = payroll_train_ts)
##
## Parameters
## Lambda: 0.698214
## Alpha: -0.02647514
## Beta: 0.05670958
## Damping Parameter: 1
## Gamma-1 Values: -0.009914156
## Gamma-2 Values: 0.01509178
##
## Seed States:
## [,1]
## [1,] 5437.35732
## [2,] -60.61336
## [3,] -3039.16166
## [4,] 3011.17098
## [5,] -1621.31320
## [6,] 303.33302
## [7,] 559.95641
## [8,] -1303.52072
## attr(,"lambda")
## [1] 0.6982144
##
## Sigma: 1699.509
## AIC: 1043.819
#TBATS(0.698, {0,0}, 1, {<12,3>})
#Box cox transformation was a little less than a square root
#No ARMA terms (p,q)
#Damping employed
#Frequency is monthly (12) with 3 fourier terms selected
pfc_tbats <- forecast(pmod_tbats, h = 16, level = c(95))
pfc_tbats
## Point Forecast Lo 95 Hi 95
## Apr 2021 691796.9 507474.8 892372.2
## May 2021 672488.8 489761.7 871682.3
## Jun 2021 944641.1 739963.3 1163715.9
## Jul 2021 1302306.0 1074895.6 1542410.0
## Aug 2021 1174502.9 954376.9 1407877.7
## Sep 2021 717722.5 530268.5 921345.7
## Oct 2021 539137.5 368364.4 728185.6
## Nov 2021 687562.9 501048.5 890854.1
## Dec 2021 797993.4 601121.9 1010814.7
## Jan 2022 731450.4 539488.4 940058.7
## Feb 2022 726622.3 533600.8 936607.7
## Mar 2022 800442.7 598311.9 1019373.4
## Apr 2022 760880.3 559371.5 980039.6
## May 2022 741004.1 539128.5 961124.9
## Jun 2022 1020321.9 793500.7 1263547.2
## Jul 2022 1385493.2 1132141.3 1653698.1
autoplot(pfc_tbats)
autoplot(payroll_train_ts) + autolayer(pfc_tbats) + ggtitle("TBATS Forecast of 2022 Payroll") + xlab("Time (Months)") + ylab("Monthly Payroll ($)")
#Model looks very good, increasing seasonality but not overly pronounced
accuracy(pfc_tbats$fitted, payroll_ts)
## ME RMSE MAE MPE MAPE ACF1 Theil's U
## Test set -544.774 86484.57 62001.89 1.992759 19.819 0.001596921 0.5635915
#Smallest RMSE and MAE of all forecast models
-The simple forecasting methods were not game for accounting both the trend and seasonal components in the data -ARIMA models proved not a good fit for this data and no terms were even included with TBATS, residuals retained autocorrelation and error measures were weaker than exponential smoothing models -Dynamic regression probably required more stationary data in order to make useful predictions, as it stands the forecasts were very unrealistic during off months. -TBATS produced the best forecast plot and error metrics, but I ultimately prefer the ETS models -The ETS(MAA) has a slighlty higher AICc number but improves on the ETS(MAM) in terms of errors and visual plotting. Intuitively the ETS(MAA) seems to better match my observation of the seasonal effect in the data set -Final note on the data - ideally for monthly data we would have had more than 4 years off which to build our forecasts, especially given the seasonal nature. Also the validation set should be less than 20% of the training set which is not the case in my analysis.
#Final plot
autoplot(payroll_ts) + autolayer(pfc_ets2) + xlab("Time (months)") + ylab("Monthly Payroll ($)") + ggtitle("Forecast of Org Payroll thru July 2022 - ETS(MAA)")
#Final forecasts
pfc_ets2
## Point Forecast Lo 95 Hi 95
## Apr 2021 581389.5 229055.1 933724.0
## May 2021 566561.7 221388.7 911734.7
## Jun 2021 852481.7 333491.1 1371472.4
## Jul 2021 1034520.9 403476.5 1665565.2
## Aug 2021 1025337.9 396638.5 1654037.2
## Sep 2021 560631.8 202132.4 919131.2
## Oct 2021 600551.8 217468.3 983635.3
## Nov 2021 598382.0 214741.1 982022.8
## Dec 2021 696770.2 254778.4 1138762.0
## Jan 2022 664457.5 238948.9 1089966.1
## Feb 2022 691786.8 248653.6 1134920.0
## Mar 2022 704768.8 252148.6 1157389.0
## Apr 2022 718024.8 255740.5 1180309.1
## May 2022 703196.9 247175.9 1159217.9
## Jun 2022 989117.0 366558.1 1611675.8
## Jul 2022 1171156.1 439221.0 1903091.2