Part A: Forecasting ATM Withdrawals
Provided data set is of cash withdrawals of four ATMs from May 1, 2009 through April 30, 2010. In a real business setting, this data could be used to answer, “How much cash should be delivered to each ATM?” This is probably more valuable and actionable than answering, “How much cash do the combined four ATMs require?” Therefore, I will attempt to forecast each individual ATM.
atm <- read_excel('./data/ATM624Data.xlsx', col_names=TRUE,
col_types=c('date', 'text', 'numeric')) %>%
rename(date=DATE, atm=ATM, cash=Cash) %>%
mutate(date = as_date(date)) %>%
na.omit %>%
arrange(date, atm)
head(atm)## # A tibble: 6 x 3
## date atm cash
## <date> <chr> <dbl>
## 1 2009-05-01 ATM1 96
## 2 2009-05-01 ATM2 107
## 3 2009-05-01 ATM3 0
## 4 2009-05-01 ATM4 777.
## 5 2009-05-02 ATM1 82
## 6 2009-05-02 ATM2 89
Use the dplyr functionality group_map to convert each ATM’s data into a time series, with weekly seasonality:
atm_ts <- atm %>%
group_by(atm) %>%
group_map(~ ts(.x$cash, start=c(2009, 05, 01), frequency=7))
str(atm_ts)## List of 4
## $ : Time-Series [1:362] from 2010 to 2061: 96 82 85 90 99 88 8 104 87 93 ...
## $ : Time-Series [1:363] from 2010 to 2061: 107 89 90 55 79 19 2 103 107 118 ...
## $ : Time-Series [1:365] from 2010 to 2062: 0 0 0 0 0 0 0 0 0 0 ...
## $ : Time-Series [1:365] from 2010 to 2062: 777 524.4 792.8 908.2 52.8 ...
Nice big plots of each series:
## [[1]]
##
## [[2]]
##
## [[3]]
##
## [[4]]
The first and second seem like unremarkable time series. The third ATM apparently only came online recently. The fourth series has an obvious outlier. Use the tsoultlier function to suggest a correction:
## $index
## [1] 285
##
## $replacements
## [1] 190.9351
Much better!
Examine seasonality:
Transformations
With the exception of ATM3, these series have non-constant variance, both within and between weeks. Use a Box-Cox transformation to stabilize the variance:
boxcox_transform <- function(x) {
lambda <- BoxCox.lambda(x)
print(lambda)
BoxCox(x, lambda)
}
atm_bc <- lapply(atm_ts, boxcox_transform)## [1] 0.1714881
## [1] 0.6846687
## [1] 1.332968
## [1] 0.449771
With the exception of ATM3, these all look fairly stationary. For a little more insight, I subjected each of them to the KPSS unit root test. It suggested the first series is not stationary at the 10 percent level, and second series is not stationary at the 1 percent level:
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 5 lags.
##
## Value of test-statistic is: 0.3903
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 5 lags.
##
## Value of test-statistic is: 2.1544
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
The first series test-statistic of 0.3903 is just ‘over’ the 10 percent critical value, i.e., we reject the null hypothesis that the series is stationary at the 10 percent level. The second test-statistic of 2.1698 is well above even the highest critical value. Both will require differencing.
Forecast
There is also the question of seasonal differencing. Below I will examine the ACFs of each plot, and determine the strong autocorrelations at lags of 7, 14, 21, …, indicate seasonal differencing is required for each series (except ATM3).
ATM1 Model
Examine the ACF and PACF plots of the differenced series:
We know from above one that one order of differencing is required. In the ACF plot, lag 2 is the ‘last’ significant lag of the first few lags. This suggests a second order moving average may be appropriate. In the PACF plot, lag 6 is the last significant lag. This suggests that a sixth order autoregressive component may be appropriate. This plot tells us that autocorrelated information is not merely ‘carried over’ from the first lag.
To determine the seasonal order, examine ACF and PACF plots with lag=7. The ACF has one significant (negative) lag at 7, while the PACF has significant but decreasing lags at multiples of 7. This pattern suggests a moving average is appropriate.
Construct our model, and use auto.arima to compare to:
atm_m <- list()
atm_m[[1]] <- Arima(atm_ts[[1]], order=c(6,1,2), seasonal=c(0,1,1), lambda=lambdas[1])
summary(atm_m[[1]])## Series: atm_ts[[1]]
## ARIMA(6,1,2)(0,1,1)[7]
## Box Cox transformation: lambda= 0.1714881
##
## Coefficients:
## ar1 ar2 ar3 ar4 ar5 ar6 ma1 ma2
## 0.9985 -0.2146 0.1077 -0.0772 0.0232 0.0332 -1.9187 0.9187
## s.e. 0.0876 0.0753 0.0761 0.0752 0.0753 0.0572 0.0678 0.0683
## sma1
## -0.5708
## s.e. 0.0527
##
## sigma^2 estimated as 1.149: log likelihood=-527.99
## AIC=1075.99 AICc=1076.63 BIC=1114.68
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 1.087414 27.71046 18.17501 -90.83027 110.5185 0.965743
## ACF1
## Training set 0.02180556
m1 <- auto.arima(atm_ts[[1]], d=1, max.p=10, max.q=10, approximation=FALSE,
lambda=lambdas[1]) # trace=TRUEThe algorithm came up with an ARIMA(6,1,0)(1,1,1)[7] model, compared to my ARIMA(6,1,2)(0,1,1)[7] model. My model performs slightly better with AICc=1076.63 versus AICc=1098.94, and RMSE=1.05 versus RMSE=1.09. The residual diagnostics of my model look reasonable—no significant autocorrelation in the residuals, and an approximately normal distribution (although the outliers are troubling).
##
## Ljung-Box test
##
## data: Residuals from ARIMA(6,1,2)(0,1,1)[7]
## Q* = 4.9416, df = 5, p-value = 0.423
##
## Model df: 9. Total lags used: 14
ATM2 model
Briefly the repeating the process as above:
- PACF is strongly significant at lag 6, so try \(p=6\) is worth trying
- ACF peaks at lag 5, so try \(q=5\)
- \(d=1\), as explained above
Seasonal plots:
- ACF has a sharp cutoff at lag 7, while the PACF decays slowly at multiples of 7—indicates an MA term is a good idea
- Seasonal differencing required so \(d=1\)
Compare the my model with auto.arima:
atm_m[[2]] <- Arima(atm_ts[[2]], order=c(6,1,5), seasonal=c(0,1,1), lambda=lambdas[2])
summary(atm_m[[2]])## Series: atm_ts[[2]]
## ARIMA(6,1,5)(0,1,1)[7]
## Box Cox transformation: lambda= 0.6846687
##
## Coefficients:
## ar1 ar2 ar3 ar4 ar5 ar6 ma1 ma2
## -0.9025 0.563 0.7533 0.3222 -0.0637 -0.1708 -0.1068 -1.6224
## s.e. 0.3641 0.177 0.1858 0.2104 0.0961 0.1123 0.3577 0.3535
## ma3 ma4 ma5 sma1
## -0.2045 0.7157 0.2217 -0.4870
## s.e. 0.3548 0.0920 0.2165 0.0729
##
## sigma^2 estimated as 56.29: log likelihood=-1220.64
## AIC=2467.29 AICc=2468.35 BIC=2517.62
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 0.8562703 25.84584 18.02581 -Inf Inf 0.8388484 -0.01832677
m2 <- auto.arima(atm_ts[[2]], d=1, approximation=FALSE, max.p=7, max.q=7,
max.order=20, lambda=lambdas[2]) # trace=TRUEThe automated process selects a substantially different model, ARIMA(3,1,0)(0,1,1)[7]. My model has a substantially better \(AICc\) and \(RMSE\), but it has many more terms. Examine the residuals as the final test:
##
## Ljung-Box test
##
## data: Residuals from ARIMA(6,1,5)(0,1,1)[7]
## Q* = 3.9096, df = 3, p-value = 0.2714
##
## Model df: 12. Total lags used: 15
##
## Ljung-Box test
##
## data: Residuals from ARIMA(3,1,0)(0,1,1)[7]
## Q* = 36.956, df = 10, p-value = 5.76e-05
##
## Model df: 4. Total lags used: 14
Overall, I prefer the residuals of the manual model. The automated model is definitely not white noise, from the results of the Ljung-Box test and the residual ACF plot—whereas my model’s residuals are. I retain my model.
ATM3 model
This third time series has only two non-zero values, at the very end. There’s no point in fitting a complicated model.
Instead, I will consider only the last two values, and fit an ARIMA(0, 0, 0) to predict central tendency:
ATM4 model
Recall from above this series did not require any differencing, according to unit root testing. Interestingly, the PACF and ACF plots are nearly identical. The only significant lags are related to seasonality.
This strongly suggests to me this series may be a ‘seasonal white noise’ series.
The seasonal ACF and PACF plots show a familar pattern: A single significant lag at 7 in the ACF, and the PACF shows significant lags in multiples of 7 though decreasing. This suggests an MA term.
Fit models:
atm_m[[4]] <- Arima(atm_ts[[4]], order=c(0,0,0), seasonal=c(0,1,1), lambda=lambdas[4])
m4 <- auto.arima(atm_ts[[4]], d=1, approximation=FALSE, lambda=lambdas[4])
summary(atm_m[[4]])## Series: atm_ts[[4]]
## ARIMA(0,0,0)(0,1,1)[7]
## Box Cox transformation: lambda= 0.449771
##
## Coefficients:
## sma1
## -0.8698
## s.e. 0.0318
##
## sigma^2 estimated as 166.8: log likelihood=-1428.3
## AIC=2860.61 AICc=2860.64 BIC=2868.37
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 63.23555 339.2674 256.8784 -371.6817 415.354 0.7415309
## ACF1
## Training set 0.09951433
The automated procedure fits a much more complex model: ARIMA(5,1,0)(2,0,0)[7]. However, my seasonal white noise model bests it both in terms of \(AICc\) (mine: 2860.64, automated: 2970.66) and \(RMSE\) (mine: 12.77, automated: 13.94).
The residuals for the manual model look okay—I’d prefer the histogram to look a little more normal. However, these diagnostics are superior to the automated model.
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,0,0)(0,1,1)[7]
## Q* = 20.874, df = 13, p-value = 0.07545
##
## Model df: 1. Total lags used: 14
Forecasting and evaluation
The forecasts of the four models seem pretty reasonable:
grid.arrange(ncol=2, nrow=2,
atm_m[[1]] %>% forecast(h=30) %>% autoplot,
atm_m[[2]] %>% forecast(h=30) %>% autoplot,
atm_m[[3]] %>% forecast(h=30) %>% autoplot,
atm_m[[4]] %>% forecast(h=30) %>% autoplot)These estimates allow us to answer the original business question: How much cash should we allocate to each of the four ATMS? The forecasts for next week, rounded up to be presentable to the boss, are:
## [,1] [,2] [,3] [,4]
## [1,] 87 66 84 315
## [2,] 101 64 84 389
## [3,] 75 8 84 399
## [4,] 4 2 84 69
## [5,] 101 89 84 457
## [6,] 80 91 84 307
## [7,] 86 67 84 512
However, these models need a little bit more work before they are ready for prime time. We need them to account especially for beginnings and ends of months, holidays, and every-other-week pay periods.
Part B: Forecasting Residential Power Load
This data set contains electricity requirements in kilowatt-hours for a residential neighborhood, by month, from 1998 to 2013. The task is to forecast future power requirements.
power <- read_excel('./data/ResidentialCustomerForecastLoad-624.xlsx', col_names=TRUE) %>%
rename(id=CaseSequence, dt=`YYYY-MMM`, kwh=KWH) %>%
mutate(dt=as_date(paste(dt, '01', sep='-'), format='%Y-%b-%d', tz='UTC'))
power_ts <- ts(power$kwh, start=c(1998, 1), frequency=12)
autoplot(power_ts)There is a clear outlier around 2010. It may have been a blackout or an external event. We will replace it with a more reasonable number:
## $index
## [1] 151 192
##
## $replacements
## [1] 7696306 7028762
The tsoutliers function also marks observation 192 as an outlier, the very last month of the series. It does seem unusually high, but not so much that replacement is warrented.
The series exhibits clear seasonality, with electricity consumption peaking around August each year, collapsing to a low in November, then steadily increasing to a local in January, and decreasing again until May. This corresponds to increased use of air conditioning as temperature rises, and heating as temperature falls.
Although the series looks like it has approximately constant variance to my eye, the subseries plot helps us see that in fact variance shifts depending on month. Variance grows with the mean for each month.
Transformations
Given the change in variance, we want to transform the data. The optimal Box-Cox transformation is \(\lambda^* = -0.1978\). The plots now look closer to constant variance:
lambda <- BoxCox.lambda(power_ts)
power_bc <- BoxCox(power_ts, lambda=lambda)
grid.arrange(ncol=2,
autoplot(power_bc),
ggsubseriesplot(power_bc))To handle the increase in mean, difference the data. The ndiffs function suggests a single difference is sufficient; test with the KPSS test:
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 4 lags.
##
## Value of test-statistic is: 0.0623
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
As the test statistic is less than the critical values, fail to reject the hypothesis that the data is stationary with \(d=1\).
Modeling
First step is to examine the ACF and PACF functions of the Box-Cox transformed and single differenced series. The ACF crosses the significance line at the second lag, suggesting two MA terms. The PACF also crosses at the second lag, suggesting two AR terms.
Seasonally differenced: Cut off in ACF at first lag, and decreasing lags at mutliples of 12 in the PACF, suggesting an MA term—two actually, since lag 24 is still significant.
Fit the models:
power_m <- Arima(power_ts, order=c(2,1,2), seasonal=c(0,1,2), lambda=lambda)
power_m2 <- auto.arima(power_ts, d=1, lambda=lambda, approximation=FALSE) # trace=TRUE
summary(power_m)## Series: power_ts
## ARIMA(2,1,2)(0,1,2)[12]
## Box Cox transformation: lambda= -0.1977787
##
## Coefficients:
## ar1 ar2 ma1 ma2 sma1 sma2
## -0.5254 0.0562 -0.1807 -0.7165 -0.8165 0.1477
## s.e. 0.1927 0.1069 0.1759 0.1708 0.0842 0.0902
##
## sigma^2 estimated as 1.842e-05: log likelihood=723.61
## AIC=-1433.21 AICc=-1432.56 BIC=-1410.9
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 35638.93 645967 456651.1 0.05592155 6.842141 0.7426929
## ACF1
## Training set 0.05854157
I fit a ARIMA(2,1,2)(0,1,2)[12] model, while the algorithm selected a ARIMA(1,1,2)(0,1,2)[12] model. That is, the manual model has one additional autoregressive term. The manual model boasts an \(AICc = -1433.21\) versus the automated’s \(AIC = -1434.92\), and the manual has \(RMSE = 645967\) versus the automated \(RMSE = 647718\).
Both models’ residuals look good, and about the same. I can’t justify the extra autoregressive term in the manual model for this small a gain in performance, so I retain the automated model power_m2.
##
## Ljung-Box test
##
## data: Residuals from ARIMA(2,1,2)(0,1,2)[12]
## Q* = 13.726, df = 18, p-value = 0.7468
##
## Model df: 6. Total lags used: 24
Forecast
Using this second model for forecasting the next 12 months:
To make the point estimates of the forecast presentable to the boss, put them in terms of millions of kilowatt hours, to the third significant digit:
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov
## 2014 10.40 8.53 7.08 6.13 5.97 8.08 9.45 10.00 8.68 6.12 6.03
## Dec
## 2014 8.01
Extra Credit: Part C: Forecasting Water Pipe Flow
Load data:
pipes1 <- read_excel('./data/Waterflow_Pipe1.xlsx', col_names=TRUE) %>%
rename(dt=`Date Time`, flow=WaterFlow)
pipes2 <- read_excel('./data/Waterflow_Pipe2.xlsx', col_names=TRUE) %>%
rename(dt=`Date Time`, flow=WaterFlow)The data is clearly over a 10 day period. I’m not sure how to extract the precise datetime from this, but since it doesn’t really matter, I will set the date at the standard origin of Jan. 1, 1970. Then I aggregate by hour, the mean of flow for each dataframe.
pipes1_df <- pipes1 %>%
mutate(base = dt - 42300,
base_hour = floor((base - floor(base))*24),
base_day = floor(base)) %>%
mutate(dt = as.POSIXct('1970-01-01') + hours(base_hour) + days(base_day)) %>%
mutate(dt = floor_date(dt, 'hour')) %>%
group_by(dt) %>%
summarise(flow=mean(flow)) %>%
arrange(dt)
pipes2_df <- pipes2 %>%
mutate(base = dt - 42300,
base_hour = floor((base - floor(base))*24),
base_day = floor(base)) %>%
mutate(dt = as.POSIXct('1970-01-01') + hours(base_hour) + days(base_day)) %>%
mutate(dt = floor_date(dt, 'hour')) %>%
group_by(dt) %>%
summarise(flow=mean(flow)) %>%
arrange(dt)Merge the two dataframes, replacing all missing values with 0, and sum the two flows:
pipes_df <- pipes1_df %>%
full_join(pipes2_df, by='dt', suffix=c('_1', '_2')) %>%
replace(is.na(.), 0) %>%
mutate(flow_total = flow_1 + flow_2)
head(pipes_df)## # A tibble: 6 x 4
## dt flow_1 flow_2 flow_total
## <dttm> <dbl> <dbl> <dbl>
## 1 1970-01-01 00:00:00 26.1 0 26.1
## 2 1970-01-01 01:00:00 18.9 30.9 49.8
## 3 1970-01-01 02:00:00 15.2 0 15.2
## 4 1970-01-01 03:00:00 23.1 38.0 61.1
## 5 1970-01-01 04:00:00 15.5 34.0 49.5
## 6 1970-01-01 05:00:00 22.7 0 22.7
After plotting the data, I suspect the strongest seasonality will be hourly, so I create a time series object with frequency=24:
I will be using a Box-Cox transformation with \(\lambda^* = -0.794488\) and first order differencing.
Examine ACF and PACF plots of the transformed and differenced series:
The PACF suggests autoregression of 5, and the ACF a moving average of order 2.
Examine seasonal ACF and PACF plots of the transformed and differenced series. The sharp peak in the ACF at 24, and the same peak followed by decline in the PACF, suggest seasonal differencing and a single moving average term.
Fit the models:
pipes_m <- Arima(pipes_ts, order=c(5, 1, 2), seasonal=c(0, 1, 1), lambda=lambda)
# Don't knit!
# pipes_m2 <- auto.arima(pipes_ts, d=1, D=1, lambda=lambda, seasonal=TRUE,
# approximation=FALSE)
summary(pipes_m)## Series: pipes_ts
## ARIMA(5,1,2)(0,1,1)[24]
## Box Cox transformation: lambda= -0.794488
##
## Coefficients:
## ar1 ar2 ar3 ar4 ar5 ma1 ma2
## -0.6218 -0.1572 0.0555 -0.0428 -0.1518 -0.4638 -0.5127
## s.e. 0.2180 0.0502 0.0509 0.0543 0.0382 0.2188 0.2167
## sma1
## -0.9652
## s.e. 0.0510
##
## sigma^2 estimated as 0.001831: log likelihood=1219.25
## AIC=-2420.49 AICc=-2420.24 BIC=-2379.27
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 3.565151 20.22264 14.2567 -7.474117 40.80451 0.9272938
## ACF1
## Training set 0.1154157
The auto.arima algorithm returns the model ARIMA(5,1,0)(2,1,0)[24]. It removed my second order moving average, and changed the seasonality. My model scores worse \(AICc = -2420.24\) than the automated model \(AICc = -2197.49\). However, my model has a drastically lower \(RMSE = 20.22\) compared to \(32.68\).
The residuals of both models show an extreme outlier in the residuals, which is distorting the results. I looked back and this extreme data point was not present in the original pipes_ts, but is created with the Box-Cox transformation. This should be corrected in any production model!
Nonetheless, as my model’s residuals do not show any significant autocorrelation in the residual ACF plot—and the automated model does—I accept my model for now.
##
## Ljung-Box test
##
## data: Residuals from ARIMA(5,1,2)(0,1,1)[24]
## Q* = 80.902, df = 40, p-value = 0.0001378
##
## Model df: 8. Total lags used: 48
Forecast for next 7 days:
For the boss:
## Time Series:
## Start = c(2001, 3)
## End = c(2008, 2)
## Frequency = 24
## [1] 34.1 35.2 31.9 29.9 22.8 38.8 30.4 36.2 29.7 35.4 33.6 42.7 29.9 40.4
## [15] 43.6 35.0 34.6 35.0 32.8 38.3 35.6 36.3 27.7 36.1 36.2 35.9 32.2 31.0
## [29] 23.5 37.1 30.9 35.9 29.4 35.5 33.7 42.3 30.0 40.4 43.5 35.0 34.6 35.0
## [43] 32.8 38.3 35.6 36.2 27.7 36.1 36.2 35.9 32.2 31.0 23.5 37.1 30.9 35.9
## [57] 29.4 35.5 33.7 42.3 30.0 40.4 43.5 35.0 34.5 35.0 32.8 38.3 35.6 36.2
## [71] 27.7 36.1 36.2 35.9 32.2 31.0 23.5 37.1 30.9 35.9 29.4 35.4 33.7 42.3
## [85] 30.0 40.3 43.5 34.9 34.5 34.9 32.7 38.3 35.5 36.2 27.7 36.1 36.2 35.9
## [99] 32.2 31.0 23.4 37.1 30.9 35.9 29.4 35.4 33.6 42.2 30.0 40.3 43.4 34.9
## [113] 34.5 34.9 32.7 38.2 35.5 36.2 27.7 36.0 36.2 35.8 32.2 31.0 23.4 37.0
## [127] 30.9 35.8 29.4 35.4 33.6 42.2 30.0 40.3 43.4 34.9 34.5 34.9 32.7 38.2
## [141] 35.5 36.2 27.7 36.0 36.1 35.8 32.2 31.0 23.4 37.0 30.9 35.8 29.4 35.4
## [155] 33.6 42.2 29.9 40.3 43.4 34.9 34.5 34.9 32.7 38.2 35.5 36.1 27.7 36.0
This model needs some work! Likely, the two pipes’ flow should be forecasted seperately, like I did in Part A above.