In this part, we work with a dataset of cash withdrawals from 4 ATMs (“ATM624Data.xlsx”). The data is a daily time series from 2009/5/1 to 2010/4/30, spanning a full year. The target variable is Cash, which gives the aggregate cash withdrawals in hundreds of dollars for the indicated ATM machine (ATM) and date (DATE).
First let’s review the data. We note the following:
Regarding the outlier for ATM4, ideally we would investigate the outlier with the source of the data and then decide how to treat it. For instance, if we knew the outlier was explained by a holiday or other special circumstance, then we would keep it and note it as an outlier when reviewing the results of our analysis. On the other hand, if we knew the outlier to be a data error, we would either remove it or replace it with an estimate (imputation). However, in this case we don’t know for sure, so we go with our intuition that this is a data error, in which case we treat it as missing data.
Regarding the missing data for ATM1 and ATM2 (as well as the outlier removed for ATM4), the simplest option would be to ignore them since we’re only missing 2-3 values out of 365 for each machine. However, the ETS and ARIMA model-fitting functions that we use in the modeling phase will throw errors if there are missing data in the interior of the time series. Consequently we need to impute values to the missing data, which we do by simply taking the average of the prior week and the following week, i.e., we interpolate the adjacent weekly measurements. If we knew more about the data and believed that the ATMs are subject to the same drivers of consumer behavior (for instance, if the machines are all situated in the same shopping mall), then we could undertake a more sophisticated imputation method like regresssion, in which we fit the relationship between one machine and the other three machines, and then use the regression to estimate the missing observation for the first machine as a function of the observations from the other three machines. In any case, because there are only 2-3 missing observations out of 365 for each machine, whichever imputation method we choose is not likely to bias the results in any meaningful way.
After making adjustments for the outlier and missing data, we define our time series for model fitting, which we call cashts. We set the seasonal period to 7, since we expect there may be a weekly “seasonal” pattern.
# load data and review summary stats
file1 <- "ATM624Data.xlsx"
raw1_long <- read_excel(file1, col_types = c("date", "text", "numeric")) %>%
mutate(DATE = as_date(DATE), WDAY = wday(DATE, label = TRUE)) %>%
filter(is.na(ATM) == FALSE) %>%
filter(Cash < 10000) # remove outlier for ATM4
raw1 <- pivot_wider(raw1_long, names_from = ATM, values_from = Cash)
summary(raw1)
## DATE WDAY ATM1 ATM2
## Min. :2009-05-01 Sun:52 Min. : 1.00 Min. : 0.00
## 1st Qu.:2009-07-31 Mon:52 1st Qu.: 73.00 1st Qu.: 25.50
## Median :2009-10-30 Tue:52 Median : 91.00 Median : 67.00
## Mean :2009-10-30 Wed:52 Mean : 83.89 Mean : 62.58
## 3rd Qu.:2010-01-29 Thu:52 3rd Qu.:108.00 3rd Qu.: 93.00
## Max. :2010-04-30 Fri:53 Max. :180.00 Max. :147.00
## Sat:52 NA's :3 NA's :2
## ATM3 ATM4
## Min. : 0.0000 Min. : 1.563
## 1st Qu.: 0.0000 1st Qu.: 123.918
## Median : 0.0000 Median : 403.305
## Mean : 0.7206 Mean : 445.346
## 3rd Qu.: 0.0000 3rd Qu.: 704.271
## Max. :96.0000 Max. :1712.075
## NA's :1
ggplot(raw1_long) + geom_histogram(aes(x = Cash)) +
facet_wrap(~ ATM, scales = "free") +
labs(title = "Histogram of Cash by ATM (excluding outlier for ATM4)")
# fix missing data & define ts
raw1 <- raw1 %>% mutate(ATM1 = ifelse(is.na(ATM1), (lag(ATM1, 7) + lead(ATM1, 7))/2, ATM1),
ATM2 = ifelse(is.na(ATM2), (lag(ATM2, 7) + lead(ATM2, 7))/2, ATM2),
ATM4 = ifelse(is.na(ATM4), (lag(ATM4, 7) + lead(ATM4, 7))/2, ATM4))
cashts <- ts(raw1[ , 3:6], start = c(1, 6), frequency = 7)
Next we review the plots below to better understand the time series. We focus on the ATM1, ATM2, and ATM4 time series, since the ATM3 series doesn’t have enough data (only 3 data points) to draw meaningful conclusions. We note the following observations:
# plots
ggpairs(as.data.frame(cashts[ , -3]),
title = "Pairs plot: cashts")
autoplot(cashts, facets = TRUE) +
labs(title = "Time plot: cashts",
x = "Week",
y = "$ hundreds")
ggseasonplot(cashts[ , 'ATM1']) +
labs(title = "Seasonal plot: ATM1")
ggseasonplot(cashts[ , 'ATM2']) +
labs(title = "Seasonal plot: ATM2")
#ggseasonplot(cashts[ , 'ATM3']) +
# labs(title = "Seasonal plot: ATM3")
ggseasonplot(cashts[ , 'ATM4']) +
labs(title = "Seasonal plot: ATM4")
ggsubseriesplot(cashts[, 'ATM1']) +
labs(title = "Seasonal subseries plot: ATM1")
ggsubseriesplot(cashts[, 'ATM2']) +
labs(title = "Seasonal subseries plot: ATM2")
#ggsubseriesplot(cashts[, 'ATM3']) +
# labs(title = "Seasonal subseries plot: ATM3")
ggsubseriesplot(cashts[, 'ATM4']) +
labs(title = "Seasonal subseries plot: ATM4")
ggAcf(cashts[ , c(1,2,4)], lag.max = 70) +
labs(title = "Autocorrelation plot: cashts")
# box-cox transformation & plot
cat("Box-Cox lambda parameter for ATM1: ", BoxCox.lambda(cashts[ , 'ATM1']))
## Box-Cox lambda parameter for ATM1: 0.2597339
autoplot(cbind(ATM1 = cashts[ , 'ATM1'], transformed = BoxCox(cashts[ , 'ATM1'], lambda = "auto")),
facets = TRUE) +
labs(title = "Time plot: ATM1 and transformed time series",
x = "Week",
y = "$ hundreds")
cat("Box-Cox lambda parameter for ATM2: ", BoxCox.lambda(cashts[ , 'ATM2']))
## Box-Cox lambda parameter for ATM2: 0.672753
autoplot(cbind(ATM2 = cashts[ , 'ATM2'], transformed = BoxCox(cashts[ , 'ATM2'], lambda = "auto")),
facets = TRUE) +
labs(title = "Time plot: ATM2 and transformed time series",
x = "Week",
y = "$ hundreds")
#cat("Box-Cox lambda parameter for ATM3: ", BoxCox.lambda(cashts[ , 'ATM3']))
#autoplot(cbind(ATM3 = cashts[ , 'ATM3'], transformed = BoxCox(cashts[ , 'ATM3'], lambda = "auto")),
# facets = TRUE) +
# labs(title = "Time plot: ATM3 and transformed time series",
# x = "Week",
# y = "$ hundreds")
cat("Box-Cox lambda parameter for ATM4: ", BoxCox.lambda(cashts[ , 'ATM4']))
## Box-Cox lambda parameter for ATM4: 0.4556867
autoplot(cbind(ATM4 = cashts[ , 'ATM4'], transformed = BoxCox(cashts[ , 'ATM4'], lambda = "auto")),
facets = TRUE) +
labs(title = "Time plot: ATM4 and transformed time series",
x = "Week",
y = "$ hundreds")
Finally we plot the STL decomposition to view the seasonal and trend components of the ATM1, ATM2, and ATM4 time series. (We exclude the ATM3 series since it has only 3 data points.) It is apparent that the trend component is not as significant as the seasonal component, particularly for ATM1 and ATM2. Also we see that the seasonal component varies over time in terms of both amplitude and curve shape.
# STL decomposition
cashts[ , 'ATM1'] %>% stl(s.window = 7) %>% autoplot() +
labs(title = "STL decomposition: ATM1", x = "Week")
cashts[ , 'ATM2'] %>% stl(s.window = 7) %>% autoplot() +
labs(title = "STL decomposition: ATM2", x = "Week")
#cashts[ , 'ATM3'] %>% stl(s.window = 7) %>% autoplot() +
# labs(title = "STL decomposition: ATM3", x = "Week")
cashts[ , 'ATM4'] %>% stl(s.window = 7) %>% autoplot() +
labs(title = "STL decomposition: ATM4", x = "Week")
To model and forecast the ATM1, ATM2, and ATM4 time series, we try 2 approaches:
For the ATM3 series, we try a different approach given the limited data count.
We start by fitting an STL model using the stlf function with ETS or ARIMA modeling methods. The approach is detailed in the modeling section of Part B below, but in brief, it’s an estimation methodology that forecasts the seasonally-adjusted time series (using the selected modeling method) and then re-seasonalizes the forecast based on the last seasonal cycle. In other words, it uses a seasonal naive method for the seasonal component of the forecast.
Below we plot the time series and forecasts, with a close-up to examine the forecast period. The forecasts produced by the two models are very close; however note that some of the forecasts go negative (e.g., for ATM2). In general the STL approach is useful to provide an approximate forecast, but since we know that seasonality is an important feature of the time series, we cannot rely on this approach.
atm1ts <- cashts[ , 'ATM1']
atm2ts <- cashts[ , 'ATM2']
atm4ts <- cashts[ , 'ATM4']
# stlf models
atm1stlf.ets <- stlf(atm1ts, method = "ets", h = 31)
atm1stlf.arima <- stlf(atm1ts, method = "arima", h = 31)
atm2stlf.ets <- stlf(atm2ts, method = "ets", h = 31)
atm2stlf.arima <- stlf(atm2ts, method = "arima", h = 31)
atm4stlf.ets <- stlf(atm4ts, method = "ets", h = 31)
atm4stlf.arima <- stlf(atm4ts, method = "arima", h = 31)
# plot ATM1
autoplot(atm1ts) +
autolayer(atm1stlf.ets, PI = FALSE, series = "STLF-ETS") +
autolayer(atm1stlf.arima, PI = FALSE, series = "STLF-ARIMA") +
geom_vline(xintercept = 53 + 6/7, lty = 3, lwd = 1) +
labs(title = "STLF forecast: ATM1",
x = "Week",
y = "$ hundreds")
# close-up ATM1
autoplot(atm1ts) +
autolayer(atm1stlf.ets, PI = FALSE, series = "STLF-ETS") +
autolayer(atm1stlf.arima, PI = FALSE, series = "STLF-ARIMA") +
geom_vline(xintercept = 53 + 6/7, lty = 3, lwd = 1) +
labs(title = "Close-up of STLF forecast: ATM1",
x = "Week",
y = "$ hundreds") +
coord_cartesian(xlim = c(50, 58),
ylim = c(0, 120))
# plot ATM2
autoplot(atm2ts) +
autolayer(atm2stlf.ets, PI = FALSE, series = "STLF-ETS") +
autolayer(atm2stlf.arima, PI = FALSE, series = "STLF-ARIMA") +
geom_vline(xintercept = 53 + 6/7, lty = 3, lwd = 1) +
labs(title = "STLF forecast: ATM2",
x = "Week",
y = "$ hundreds")
# close-up ATM2
autoplot(atm2ts) +
autolayer(atm2stlf.ets, PI = FALSE, series = "STLF-ETS") +
autolayer(atm2stlf.arima, PI = FALSE, series = "STLF-ARIMA") +
geom_vline(xintercept = 53 + 6/7, lty = 3, lwd = 1) +
labs(title = "Close-up of STLF forecast: ATM2",
x = "Week",
y = "$ hundreds") +
coord_cartesian(xlim = c(50, 58),
ylim = c(0, 120))
# plot ATM4
autoplot(atm4ts) +
autolayer(atm4stlf.ets, PI = FALSE, series = "STLF-ETS") +
autolayer(atm4stlf.arima, PI = FALSE, series = "STLF-ARIMA") +
geom_vline(xintercept = 53 + 6/7, lty = 3, lwd = 1) +
labs(title = "STLF forecast: ATM4",
x = "Week",
y = "$ hundreds")
# close-up ATM4
autoplot(atm4ts) +
autolayer(atm4stlf.ets, PI = FALSE, series = "STLF-ETS") +
autolayer(atm4stlf.arima, PI = FALSE, series = "STLF-ARIMA") +
geom_vline(xintercept = 53 + 6/7, lty = 3, lwd = 1) +
labs(title = "Close-up of STLF forecast: ATM4",
x = "Week",
y = "$ hundreds") +
coord_cartesian(xlim = c(50, 58),
ylim = c(0, 800))
Next we fit ETS and ARIMA models to the ATM1, ATM2, and ATM4 time series, including the seasonal component. We do this using the ets and auto.arima functions, and in each case, we show the model summary, plot model diagnostics, and check the model residuals.
# ATM1
atm1.ets <- ets(atm1ts)
atm1.ets
## ETS(A,N,A)
##
## Call:
## ets(y = atm1ts)
##
## Smoothing parameters:
## alpha = 0.0195
## gamma = 0.331
##
## Initial states:
## l = 80.3735
## s = -65.5466 7.5892 18.1514 10.6598 11.7762 5.0861
## 12.2839
##
## sigma: 23.988
##
## AIC AICc BIC
## 4483.965 4484.586 4522.964
autoplot(atm1.ets, main = "Components of ETS(A,N,A) method: ATM1")
checkresiduals(atm1.ets)
##
## Ljung-Box test
##
## data: Residuals from ETS(A,N,A)
## Q* = 29.632, df = 5, p-value = 1.743e-05
##
## Model df: 9. Total lags used: 14
# ATM2
atm2.ets <- ets(atm2ts)
atm2.ets
## ETS(A,N,A)
##
## Call:
## ets(y = atm2ts)
##
## Smoothing parameters:
## alpha = 1e-04
## gamma = 0.3634
##
## Initial states:
## l = 72.6123
## s = -60.9385 -41.3568 15.1052 6.3099 28.5868 22.4571
## 29.8363
##
## sigma: 24.9396
##
## AIC AICc BIC
## 4512.363 4512.985 4551.362
autoplot(atm2.ets, main = "Components of ETS(A,N,A) method: ATM2")
checkresiduals(atm2.ets)
##
## Ljung-Box test
##
## data: Residuals from ETS(A,N,A)
## Q* = 35.57, df = 5, p-value = 1.158e-06
##
## Model df: 9. Total lags used: 14
# ATM4
atm4.ets <- ets(atm4ts)
atm4.ets
## ETS(A,N,A)
##
## Call:
## ets(y = atm4ts)
##
## Smoothing parameters:
## alpha = 1e-04
## gamma = 1e-04
##
## Initial states:
## l = 453.3376
## s = -274.211 -32.2112 27.1532 40.6802 90.0601 50.641
## 97.8877
##
## sigma: 332.8719
##
## AIC AICc BIC
## 6404.013 6404.634 6443.012
autoplot(atm4.ets, main = "Components of ETS(A,N,A) method: ATM4")
checkresiduals(atm4.ets)
##
## Ljung-Box test
##
## data: Residuals from ETS(A,N,A)
## Q* = 17.48, df = 5, p-value = 0.003674
##
## Model df: 9. Total lags used: 14
cat("Number of seasonal differences needed for ATM1:", nsdiffs(atm1ts))
## Number of seasonal differences needed for ATM1: 1
cat("Number of differences needed for ATM1:" , ndiffs(diff(atm1ts, lag = 7)))
## Number of differences needed for ATM1: 0
cat("Number of seasonal differences needed for ATM2:", nsdiffs(atm2ts))
## Number of seasonal differences needed for ATM2: 1
cat("Number of differences needed for ATM2:" , ndiffs(diff(atm2ts, lag = 7)))
## Number of differences needed for ATM2: 0
cat("Number of seasonal differences needed for ATM4:", nsdiffs(atm4ts))
## Number of seasonal differences needed for ATM4: 0
cat("Number of differences needed for ATM4:" , ndiffs(atm4ts))
## Number of differences needed for ATM4: 0
# ATM1
atm1.arima <- auto.arima(atm1ts, stepwise = FALSE, approximation = FALSE)
#atm1.arima <- auto.arima(atm1ts)
atm1.arima
## Series: atm1ts
## ARIMA(0,0,1)(1,1,1)[7]
##
## Coefficients:
## ma1 sar1 sma1
## 0.1884 0.1881 -0.7549
## s.e. 0.0548 0.0791 0.0549
##
## sigma^2 estimated as 552.9: log likelihood=-1638.92
## AIC=3285.84 AICc=3285.95 BIC=3301.36
autoplot(atm1.arima, main = "ARIMA characteristic roots: ATM1")
checkresiduals(atm1.arima)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,0,1)(1,1,1)[7]
## Q* = 15.695, df = 11, p-value = 0.1528
##
## Model df: 3. Total lags used: 14
# ATM2
#atm2.arima <- auto.arima(atm2ts, stepwise = FALSE, approximation = FALSE)
atm2.arima <- auto.arima(atm2ts)
atm2.arima
## Series: atm2ts
## ARIMA(2,0,2)(0,1,1)[7]
##
## Coefficients:
## ar1 ar2 ma1 ma2 sma1
## -0.4238 -0.913 0.4590 0.8009 -0.7486
## s.e. 0.0547 0.042 0.0848 0.0572 0.0389
##
## sigma^2 estimated as 586.2: log likelihood=-1648.7
## AIC=3309.4 AICc=3309.64 BIC=3332.68
autoplot(atm2.arima, main = "ARIMA characteristic roots: ATM2")
checkresiduals(atm2.arima)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(2,0,2)(0,1,1)[7]
## Q* = 9.0631, df = 9, p-value = 0.4315
##
## Model df: 5. Total lags used: 14
# ATM4
atm4.arima <- auto.arima(atm4ts, stepwise = FALSE, approximation = FALSE)
#atm4.arima <- auto.arima(atm4ts)
atm4.arima
## Series: atm4ts
## ARIMA(1,0,0)(2,0,0)[7] with non-zero mean
##
## Coefficients:
## ar1 sar1 sar2 mean
## 0.0799 0.1494 0.1118 444.4260
## s.e. 0.0523 0.0522 0.0526 26.1069
##
## sigma^2 estimated as 118508: log likelihood=-2648.19
## AIC=5306.38 AICc=5306.55 BIC=5325.88
autoplot(atm4.arima, main = "ARIMA characteristic roots: ATM4")
checkresiduals(atm4.arima)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(1,0,0)(2,0,0)[7] with non-zero mean
## Q* = 14.208, df = 10, p-value = 0.1637
##
## Model df: 4. Total lags used: 14
# best models
bestatm1 <- atm1.arima
bestatm2 <- atm2.arima
bestatm4 <- atm4.arima
Comparing the ETS and ARIMA model output for ATM1, ATM2, and ATM4, it is apparent that the ARIMA models provide a stronger fit to the data for all series. Specifically, the ARIMA models have better model diagnostics and satisfy our assumption of no autocorrelation in the model residuals. Hence we select the ARIMA models as our final forecast models for all three time series.
Let’s plot the ATM1, ATM2, and ATM4 time series with forecasts produced by the ETS and ARIMA models. We include a close-up of the forecast period in order to examine and compare the output of the two models. It is interesting to note that the ETS and ARIMA forecasts are very close for the ATM1 and ATM2 series, but they are markedly different for the ATM4 series. In particular, the ETS model predicts a continuing seasonal pattern for ATM4, whereas the ARIMA model predicts a damped oscillatory pattern that seems to converge to a constant. This may arise from the weaker seasonality pattern and greater noise in the ATM4 data.
atm1f.ets <- forecast(atm1.ets, h = 31)
atm1f.arima <- forecast(atm1.arima, h = 31)
atm2f.ets <- forecast(atm2.ets, h = 31)
atm2f.arima <- forecast(atm2.arima, h = 31)
atm4f.ets <- forecast(atm4.ets, h = 31)
atm4f.arima <- forecast(atm4.arima, h = 31)
# plot ATM1
autoplot(atm1ts) +
autolayer(atm1f.ets, PI = FALSE, series = "ETS") +
autolayer(atm1f.arima, PI = FALSE, series = "ARIMA") +
geom_vline(xintercept = 53 + 6/7, lty = 3, lwd = 1) +
labs(title = "ETS & ARIMA forecast: ATM1",
subtitle = "ETS(A,N,A) and ARIMA(0,0,1)(1,1,1)[7]",
x = "Week",
y = "$ hundreds")
# close-up ATM1
autoplot(atm1ts) +
autolayer(atm1f.ets, PI = FALSE, series = "ETS") +
autolayer(atm1f.arima, PI = FALSE, series = "ARIMA") +
geom_vline(xintercept = 53 + 6/7, lty = 3, lwd = 1) +
labs(title = "Close-up of ETS & ARIMA forecast: ATM1",
subtitle = "ETS(A,N,A) and ARIMA(0,0,1)(1,1,1)[7]",
x = "Week",
y = "$ hundreds") +
coord_cartesian(xlim = c(50, 58),
ylim = c(0, 120))
# plot ATM2
autoplot(atm2ts) +
autolayer(atm2f.ets, PI = FALSE, series = "ETS") +
autolayer(atm2f.arima, PI = FALSE, series = "ARIMA") +
geom_vline(xintercept = 53 + 6/7, lty = 3, lwd = 1) +
labs(title = "ETS & ARIMA forecast: ATM2",
subtitle = "ETS(A,N,A) and ARIMA(2,0,2)(0,1,1)[7]",
x = "Week",
y = "$ hundreds")
# close-up ATM2
autoplot(atm2ts) +
autolayer(atm2f.ets, PI = FALSE, series = "ETS") +
autolayer(atm2f.arima, PI = FALSE, series = "ARIMA") +
geom_vline(xintercept = 53 + 6/7, lty = 3, lwd = 1) +
labs(title = "Close-up of ETS & ARIMA forecast: ATM2",
subtitle = "ETS(A,N,A) and ARIMA(2,0,2)(0,1,1)[7]",
x = "Week",
y = "$ hundreds") +
coord_cartesian(xlim = c(50, 58),
ylim = c(0, 120))
# plot ATM4
autoplot(atm4ts) +
autolayer(atm4f.ets, PI = FALSE, series = "ETS") +
autolayer(atm4f.arima, PI = FALSE, series = "ARIMA") +
geom_vline(xintercept = 53 + 6/7, lty = 3, lwd = 1) +
labs(title = "ETS & ARIMA forecast: ATM4",
subtitle = "ETS(A,N,A) and ARIMA(1,0,0)(2,0,0)[7] with constant",
x = "Week",
y = "$ hundreds")
# close-up ATM4
autoplot(atm4ts) +
autolayer(atm4f.ets, PI = FALSE, series = "ETS") +
autolayer(atm4f.arima, PI = FALSE, series = "ARIMA") +
geom_vline(xintercept = 53 + 6/7, lty = 3, lwd = 1) +
labs(title = "Close-up of ETS & ARIMA forecast: ATM4",
subtitle = "ETS(A,N,A) and ARIMA(1,0,0)(2,0,0)[7] with constant",
x = "Week",
y = "$ hundreds") +
coord_cartesian(xlim = c(50, 58),
ylim = c(0, 800))
We noted earlier in the data preparation section that the ATM3 series is zero for the first 362 observations and non-zero for only the last 3 observations. Upon further review, we notice that the non-zero cash withdrawals (96, 82, and 85) on days 363 through 365 are identical to those from ATM1. This would be a good time to investigate the nature of the time series with the source of the data. Questions to ask include:
Absent the answers to these questions, we make our best guess that (a) it’s purely coincidental that ATM1 and ATM3 have identical observations on days 363 through 365, (b) the machines reflect independent observations, so future ATM3 observations cannot be estimated as a function of the forecasts for ATM1, ATM2, and ATM4, and (c) the zero observations for ATM3 prior to day 363 reflect missing data, which are not relevent for forecasting purposes.
With these assumptions, we re-define the ATM3 time series to exclude the zero values, resulting in a 3-day time series. Then we try several simple forecasting methods:
The forecasts from these simple methods are shown in the plot below. We note that:
Based on these observations, we choose simple exponential smoothing for our final forecast of ATM3. For practical purposes, the choice between the average method, the naive method, and the simple exponential smoothing method is not meaningful in terms of the resulting forecast, as all three methods give estimates that fall within each other’s prediction intervals. However, we expect that the prediction intervals should start out large and should grow over the forecast window, reflecting the increasing forecast uncertainty over time (since we’re forecasting 31 days from a dataset of only 3 observations). Furthermore, since cash withdrawals are inherently positive, we require that the prediction intervals remain positive. Only the simple exponential smoothing method is consistent with these requirements.
# simple methods for ATM3
# define 3-day ts for ATM3
atm3ts <- window(cashts[ , 'ATM3'], start = c(53, 4))
atm3f.mean <- meanf(atm3ts, h = 31)
atm3f.naive <- naive(atm3ts, h = 31)
atm3f.drift <- rwf(atm3ts, h = 31, drift = TRUE)
atm3f.ses <- ses(atm3ts, h = 31)
autoplot(atm3ts) +
autolayer(atm3f.mean, series = "Mean", PI = FALSE) +
autolayer(atm3f.naive, series = "Naive", PI = FALSE) +
autolayer(atm3f.drift, series = "Drift", PI = FALSE) +
autolayer(atm3f.ses, series = "SES", PI = FALSE) +
geom_vline(xintercept = 53 + 6/7, lty = 3, lwd = 1) +
labs(title = "Simple forecasts for ATM3",
x = "Week",
y = "$ hundreds") +
coord_cartesian(xlim = c(53, 58),
ylim = c(70, 100))
# best model forecast
bestatm3f <- atm3f.ses
Finally we use our chosen models to forecast daily ATM cash withdrawals over the next month for each machine. We plot the forecasts with 80th- and 95th-percentile prediction intervals, and save the forecasts to an Excel file.
# forecast 31 days
fatm1 <- forecast(bestatm1, h = 31)
fatm2 <- forecast(bestatm2, h = 31)
fatm4 <- forecast(bestatm4, h = 31)
fatm3 <- bestatm3f
# plot with prediction intervals - ATM1
autoplot(fatm1) +
geom_vline(xintercept = 53 + 6/7, lty = 3, lwd = 1) +
labs(title = "Final forecast: ATM1",
x = "Week",
y = "$ hundreds")
# close-up - ATM1
autoplot(fatm1) +
geom_vline(xintercept = 53 + 6/7, lty = 3, lwd = 1) +
labs(title = "Close-up of final forecast: ATM1",
x = "Week",
y = "$ hundreds") +
coord_cartesian(xlim = c(50, 58))#,
#ylim = c(0, 100))
# plot with prediction intervals - ATM2
autoplot(fatm2) +
geom_vline(xintercept = 53 + 6/7, lty = 3, lwd = 1) +
labs(title = "Final forecast: ATM2",
x = "Week",
y = "$ hundreds")
# close-up - ATM2
autoplot(fatm2) +
geom_vline(xintercept = 53 + 6/7, lty = 3, lwd = 1) +
labs(title = "Close-up of final forecast: ATM2",
x = "Week",
y = "$ hundreds") +
coord_cartesian(xlim = c(50, 58))#,
#ylim = c(0, 100))
# plot with prediction intervals - ATM3
autoplot(cashts[ , 'ATM3']) + autolayer(fatm3) +
geom_vline(xintercept = 53 + 6/7, lty = 3, lwd = 1) +
labs(title = "Final forecast: ATM3",
x = "Week",
y = "$ hundreds")
# close-up - ATM3
autoplot(cashts[ , 'ATM3']) + autolayer(fatm3) +
geom_vline(xintercept = 53 + 6/7, lty = 3, lwd = 1) +
labs(title = "Close-up of final forecast: ATM3",
x = "Week",
y = "$ hundreds") +
coord_cartesian(xlim = c(50, 58))#,
#ylim = c(0, 100))
# plot with prediction intervals - ATM4
autoplot(fatm4) +
geom_vline(xintercept = 53 + 6/7, lty = 3, lwd = 1) +
labs(title = "Final forecast: ATM4",
x = "Week",
y = "$ hundreds")
# close-up - ATM4
autoplot(fatm4) +
geom_vline(xintercept = 53 + 6/7, lty = 3, lwd = 1) +
labs(title = "Close-up of final forecast: ATM4",
x = "Week",
y = "$ hundreds") +
coord_cartesian(xlim = c(50, 58))#,
#ylim = c(0, 100))
# forecast df for excel file
end <- raw1$DATE[nrow(raw1)]
fcdates <- end + days(1:31)
cashfc_df <- data.frame(DATE = fcdates,
ATM1 = fatm1$mean,
ATM2 = fatm2$mean,
ATM3 = fatm3$mean,
ATM4 = fatm4$mean)
# save to excel file <<< DONE AT END OF PART C
#write_xlsx(list(ATMForecast = cashfc_df),
# "KBenson_Proj1_Forecasts.xlsx")
In this part, we work with a dataset of residential power usage (“ResidentialCustomerForecastLoad-624.xlsx”). The data is a monthly time series from 1998/01 to 2013/12, spanning 16 years. The target variable is KWH, which is the power consumption in kilowatt hours.
Reviewing the data, we note the following:
After the adjustment for the missing data, we define our time series for model fitting, which we call powerts.
# load data and review summary stats
file2 <- "ResidentialCustomerForecastLoad-624.xlsx"
raw2 <- read_excel(file2, col_types = c("numeric", "text", "numeric"))
summary(raw2)
## CaseSequence YYYY-MMM KWH
## Min. :733.0 Length:192 Min. : 770523
## 1st Qu.:780.8 Class :character 1st Qu.: 5429912
## Median :828.5 Mode :character Median : 6283324
## Mean :828.5 Mean : 6502475
## 3rd Qu.:876.2 3rd Qu.: 7620524
## Max. :924.0 Max. :10655730
## NA's :1
ggplot(raw2) + geom_histogram(aes(x = KWH)) +
labs(title = "Histogram of KWH")
# outlier & missing data index
outlier <- which.min(raw2$KWH)
missing <- which.max(is.na(raw2$KWH))
# fix missing data and define ts
raw2 <- mutate(raw2, KWH = ifelse(is.na(KWH), (lag(KWH, 12) + lead(KWH, 12))/2, KWH))
powerts <- ts(raw2[['KWH']], start = c(1998,1), frequency = 12)
Reviewing the plots below, we make the following observations:
# plots
autoplot(powerts) +
geom_vline(xintercept = 1998 + (c(outlier, missing)-1)/12, lty = 2, col = 2) +
labs(title = "Time plot: powerts",
x = "Year",
y = "KWH")
ggseasonplot(powerts) +
labs(y = "KWH")
#ggseasonplot(powerts, polar = TRUE)
ggsubseriesplot(powerts) +
labs(title = "Seasonal subseries plot: powerts",
y = "KWH")
#gglagplot(powerts)
ggAcf(powerts, lag.max = 120) +
labs(title = "Autocorrelation plot: powerts")
# box-cox transformation & plot
cat("Box-Cox lambda parameter: ", BoxCox.lambda(powerts))
## Box-Cox lambda parameter: 0.1217513
autoplot(cbind(powerts = powerts, transformed = BoxCox(powerts, lambda = "auto")),
facets = TRUE) +
labs(title = "Time plot: powerts and transformed time series",
x = "Year",
y = "KWH")
Finally we plot the STL decomposition to view the seasonal and trend components of the time series. There is a slight increase in the trend line after 2010.
# STL decomposition
stl(powerts, s.window = 13) %>% autoplot() +
labs(title = "STL decomposition: powerts",
x = "Year")
To model and forecast the powerts time series, we try 3 approaches:
As a starting point, we fit an STL model using the stlf function with ETS or ARIMA modeling methods. Procedurally, this approach does the following:
While ETS or ARIMA modeling is used for fitting the non-seasonal (SA) time series, the seasonal adjustment to the forecast is equivalent to the “seasonal naive” method of using the last year of the seasonal pattern. As such, we expect this approach to give an approximate forecast that ignores modeling of the seasonal pattern.
Below we plot the time series and forecasts, with a close-up to examine the forecast period. The forecasts produced by the two models are virtually indistinguishable. Note that we also considered versions of the STL ETS and STL ARIMA models with a Box-Cox transformation, but the resulting forecasts included a mid-year kink in the curve, which is not consistent with the time series history. Hence we rejected these model versions.
# stlf models
stlf2.ets <- stlf(powerts, method = "ets", h = 12)
stlf2.arima <- stlf(powerts, method = "arima", h = 12)
#stlf2.ets.bc <- stlf(powerts, method = "ets", lambda = "auto", h = 12)
#stlf2.arima.bc <- stlf(powerts, method = "arima", lambda = "auto", h = 12)
# plot
autoplot(powerts) +
autolayer(stlf2.ets, PI = FALSE, series = "STLF-ETS") +
autolayer(stlf2.arima, PI = FALSE, series = "STLF-ARIMA") +
# autolayer(stlf2.ets.bc, PI = FALSE, series = "STLF-ETS-BC") +
# autolayer(stlf2.arima.bc, PI = FALSE, series = "STLF-ARIMA-BC") +
geom_vline(xintercept = 2014, lty = 3, lwd = 1) +
labs(title = "STLF forecast using ETS & ARIMA",
x = "Year",
y = "KWH")
# close-up
autoplot(powerts) +
autolayer(stlf2.ets, PI = FALSE, series = "STLF-ETS") +
autolayer(stlf2.arima, PI = FALSE, series = "STLF-ARIMA") +
# autolayer(stlf2.ets.bc, PI = FALSE, series = "STLF-ETS-BC") +
# autolayer(stlf2.arima.bc, PI = FALSE, series = "STLF-ARIMA-BC") +
geom_vline(xintercept = 2014, lty = 3, lwd = 1) +
labs(title = "Close-up of STLF forecast using ETS & ARIMA",
x = "Year",
y = "KWH") +
coord_cartesian(xlim = c(2012, 2015),
ylim = c(5e06, 11e06))
Next we fit ETS and ARIMA models to the powerts time series, including the seasonal component. We do this using the ets and auto.arima functions, and in each case, we show the model summary, plot model diagnostics, and check the model residuals.
# ets without box-cox <<< DON'T USE
#fit2.ets <- ets(powerts)
#fit2.ets
#autoplot(fit2.ets)
#checkresiduals(fit2.ets)
# ets with box-cox <<< BEST
fit2.ets <- ets(powerts, lambda = 'auto')
fit2.ets
## ETS(A,N,A)
##
## Call:
## ets(y = powerts, lambda = "auto")
##
## Box-Cox transformation: lambda= 0.1218
##
## Smoothing parameters:
## alpha = 0.0525
## gamma = 1e-04
##
## Initial states:
## l = 46.7427
## s = -0.1683 -1.6746 -0.8476 1.2707 1.798 0.4992
## 0.1981 -1.5378 -1.1848 -0.4614 0.4895 1.6189
##
## sigma: 1.2401
##
## AIC AICc BIC
## 1107.525 1110.252 1156.387
autoplot(fit2.ets)
checkresiduals(fit2.ets)
##
## Ljung-Box test
##
## data: Residuals from ETS(A,N,A)
## Q* = 11.321, df = 10, p-value = 0.333
##
## Model df: 14. Total lags used: 24
cat("Number of seasonal differences needed:", nsdiffs(powerts))
## Number of seasonal differences needed: 1
cat("Number of differences needed:" , ndiffs(diff(powerts, lag = 12)))
## Number of differences needed: 0
# arima without box-cox <<< BEST
fit2.arima <- auto.arima(powerts, stepwise = FALSE, approximation = FALSE)
fit2.arima
## Series: powerts
## ARIMA(1,0,0)(0,1,2)[12] with drift
##
## Coefficients:
## ar1 sma1 sma2 drift
## 0.2527 -0.8945 0.1295 7695.751
## s.e. 0.0747 0.0816 0.0809 2067.653
##
## sigma^2 estimated as 7.347e+11: log likelihood=-2718.59
## AIC=5447.17 AICc=5447.52 BIC=5463.14
autoplot(fit2.arima)
checkresiduals(fit2.arima)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(1,0,0)(0,1,2)[12] with drift
## Q* = 11.761, df = 20, p-value = 0.924
##
## Model df: 4. Total lags used: 24
# arima with box-cox <<< DON'T USE
#fit2.arima <- auto.arima(powerts, lambda = "auto", stepwise = FALSE, approximation = FALSE)
#fit2.arima
#autoplot(fit2.arima)
#checkresiduals(fit2.arima)
Now let’s plot the powerts time series with forecasts produced by the ETS and ARIMA models. We include a close-up of the forecast period to compare the output of the two models. As we saw before (for the STL forecasts), the forecasts from the ETS and ARIMA models are almost indistinguishable.
fit2f.ets <- forecast(fit2.ets, h = 12)
fit2f.arima <- forecast(fit2.arima, h = 12)
# plot
autoplot(powerts) +
autolayer(fit2f.ets, PI = FALSE, series = "ETS") +
autolayer(fit2f.arima, PI = FALSE, series = "ARIMA") +
geom_vline(xintercept = 2014, lty = 3, lwd = 1) +
labs(title = "ETS & ARIMA forecast",
subtitle = "ETS(A,N,A) and ARIMA(1,0,0)(0,1,2)[12]",
x = "Year",
y = "KWH")
# close-up
autoplot(powerts) +
autolayer(fit2f.ets, PI = FALSE, series = "ETS") +
autolayer(fit2f.arima, PI = FALSE, series = "ARIMA") +
geom_vline(xintercept = 2014, lty = 3, lwd = 1) +
labs(title = "Close-up of ETS & ARIMA forecast",
subtitle = "ETS(A,N,A) and ARIMA(1,0,0)(0,1,2)[12]",
x = "Year",
y = "KWH") +
coord_cartesian(xlim = c(2012, 2015),
ylim = c(5e06, 11e06))
The forecasts generated by the ETS and ARIMA models are very close. In order to choose between the two models, let’s evaluate the models on a validation set and then select the better-performing model. We do this as follows:
We should highlight several points here:
# training set
train2 <- window(powerts, end = c(2009, 12))
# ets(ANA) with box-cox
fit2.ets2 <- ets(train2, model = "ANA", lambda = "auto")
fit2.ets2
## ETS(A,N,A)
##
## Call:
## ets(y = train2, model = "ANA", lambda = "auto")
##
## Box-Cox transformation: lambda= 1.6004
##
## Smoothing parameters:
## alpha = 0.0313
## gamma = 1e-04
##
## Initial states:
## l = 46834503418.7937
## s = -6682876559 -18425292716 -8034710750 15946508441 21155789995 15437056188
## -2279293982 -16193138210 -13494776141 -6430908173 3257915458 15743726449
##
## sigma: 6714723135
##
## AIC AICc BIC
## 7247.665 7251.415 7292.212
# arima(1,0,0)(0,1,2)[12] with drift
fit2.arima2 <- Arima(train2, order = c(1,0,0), seasonal = c(0,1,2),
include.drift = TRUE)
fit2.arima2
## Series: train2
## ARIMA(1,0,0)(0,1,2)[12] with drift
##
## Coefficients:
## ar1 sma1 sma2 drift
## 0.1712 -0.7961 0.0697 3281.088
## s.e. 0.0939 0.1037 0.1154 1692.300
##
## sigma^2 estimated as 2.985e+11: log likelihood=-1934.04
## AIC=3878.08 AICc=3878.55 BIC=3892.49
Reviewing the RMSE, MAE, MAPE, and MASE error metrics on the validation set, we see that the ARIMA model has stronger predictive performance than the ETS model. Hence we select the ARIMA model for our forecast. Note that for forecasting purposes, we use the version of the ARIMA model that was fitted on the entire dataset.
# accuracy metrics
acc2.ets2 <- fit2.ets2 %>% forecast(h = 48) %>% accuracy(powerts)
acc2.arima2 <- fit2.arima2 %>% forecast(h = 48) %>% accuracy(powerts)
acc2.ets2[ , c(2,3,5,6)] %>%
kable(caption = "Error metrics for ETS(A,N,A) with Box-Cox")
| RMSE | MAE | MAPE | MASE | |
|---|---|---|---|---|
| Training set | 512644.1 | 399909.3 | 6.407868 | 0.731715 |
| Test set | 1617775.6 | 1113038.5 | 30.760147 | 2.036529 |
acc2.arima2[ , c(2,3,5,6)] %>%
kable(caption = "Error metrics for ARIMA(1,0,0)(0,1,2)[12] with drift")
| RMSE | MAE | MAPE | MASE | |
|---|---|---|---|---|
| Training set | 515115 | 379925.1 | 6.140061 | 0.6951498 |
| Test set | 1470451 | 946544.4 | 28.486439 | 1.7318946 |
# best model - use version fitted on full dataset
bestfit2 <- fit2.arima
Finally we use our chosen model to produce monthly forecasts of residential power usage in 2014. We plot the forecast with 80th- and 95th-percentile prediction intervals, and save the forecast to our Excel file.
# forecast 12 months
f2 <- forecast(bestfit2, h = 12)
# plot with prediction intervals
autoplot(powerts) +
autolayer(f2) +
geom_vline(xintercept = 2014, lty = 3, lwd = 1) +
labs(title = "Final forecast from ARIMA(1,0,0)(0,1,2)[12]",
x = "Year",
y = "KWH")
# close-up
autoplot(powerts) +
autolayer(f2) +
geom_vline(xintercept = 2014, lty = 3, lwd = 1) +
labs(title = "Close-up of final forecast from ARIMA(1,0,0)(0,1,2)[12]",
x = "Year",
y = "KWH") +
coord_cartesian(xlim = c(2012, 2015),
ylim = c(5e06, 11e06))
# forecast df for excel file
powerfc_df <- data.frame("YYYY-MMM" = paste0("2014-", month.abb),
KWH = f2$mean)
# save to excel file <<< DONE AT END OF PART C
#write_xlsx(list(PowerForecast = powerfc_df),
# "KBenson_Proj1_Forecasts.xlsx")
In this part, we work with two datasets of time-stamped waterflow measurements. The first dataset (“Waterflow_Pipe1.xlsx”) includes measurements at various irregular times from 10/23/15 through 11/1/15. The second dataset (“Waterflow_Pipe2.xlsx”) includes regular hourly measurements from 10/23/15 through 12/3/15.
We start by merging the data into a combined dataset. In order to do this, we merge the data along a common time scale as follows:
The effect of this procedure is to combine measurements from either dataset occurring within the same hour (e.g., 12:01AM, 12:15AM, and 12:59AM) and then to take the average of such measurements. The resulting dataset is an hourly time series from 10/23/15 through 12/3/15, which includes two distinct time ranges:
Once the data is merged, we define the time series of waterflow measurements, which we call waterts. Note that we specify a frequency of 24 for the time series, as we expect there may be a daily “seasonal” pattern for the hourly measurements.
Note that fortunately, there are no missing data and no outliers in the time series.
# load & parse data
file3a <- "Waterflow_Pipe1.xlsx"
file3b <- "Waterflow_Pipe2.xlsx"
raw3a <- read_excel(file3a, col_types = c("date", "numeric")) %>%
rename(DateTime = "Date Time") %>%
mutate(Hour = ceiling_date(DateTime, unit = "hour"))
raw3b <- read_excel(file3b, col_types = c("date", "numeric")) %>%
rename(DateTime = "Date Time") %>%
mutate(Hour = ceiling_date(DateTime, unit = "hour"))
raw3 <- rbind(raw3a, raw3b) %>%
group_by(Hour) %>%
summarise(Count = n(), Avg = mean(WaterFlow, na.rm = TRUE))
# define ts
waterts <- ts(raw3$Avg, frequency = 24)
# check for missing data & outliers
summary(waterts)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.885 23.777 33.242 35.963 47.098 77.388
ggplot(waterts) + geom_histogram(aes(x = waterts)) +
labs(title = "Histogram of waterts",
x = "Waterflow")
Let’s review the data using the plots below. We note several observations:
# plots
autoplot(waterts) +
labs(title = "Time plot: waterts",
x = "Day",
y = "Waterflow")
ggseasonplot(waterts) +
labs(y = "Waterflow") +
guides(color=guide_legend(title = "Day"))
#ggseasonplot(waterts, polar = TRUE)
ggsubseriesplot(waterts) +
labs(title = "Seasonal subseries plot: waterts",
x = "Hour",
y = "Waterflow")
#gglagplot(waterts)
ggAcf(waterts, lag.max = 120) +
labs(title = "Autocorrelation plot: waterts")
# box-cox transformation & plot
cat("Box-Cox lambda parameter: ", BoxCox.lambda(waterts))
## Box-Cox lambda parameter: -0.64441
autoplot(cbind(waterts = waterts, transformed = BoxCox(waterts, lambda = "auto")),
facets = TRUE) +
labs(title = "Time plot: waterts and transformed time series",
x = "Day",
y = "Waterflow")
Next we plot the STL decomposition to view the seasonal and trend components of the time series. As noted earlier, there is a noticeable increase after the first 10-12 days in both the mean level of the trend component and the variance of the seasonal component.
# STL decomposition
stl(waterts, s.window = 25) %>% autoplot() +
labs(title = "STL decomposition: waterts",
x = "Day")
Finally we consider the issue of whether the time series is stationary. In its raw form, the time series is not stationary since it fails the definition of stationarity, in which the statistical properties of the data should not depend on which time period is analyzed. For instance, we observed above that the measurement data has a lower mean level and lower variance within the first 10-12 days than over the remaining days.
In order to produce a stationary time series (which can be used, for instance, in an ARIMA model), we make the following adjustments to the data:
We see from the plot below that the resulting time series appears to be stationary. Furthermore, the plot indicates that this may be a good candidate for ARIMA(0,1,1), as the pattern of the ACF (single spike at lag = 1) and the PACF (exponential decay profile) suggest a MA(1) model.
# adjust to make ts stationary & plot
waterts_bc <- BoxCox(waterts, lambda = "auto")
cat("Number of seasonal differences needed:", nsdiffs(waterts_bc))
## Number of seasonal differences needed: 0
cat("Number of differences needed:" , ndiffs(waterts_bc))
## Number of differences needed: 1
waterts_bc %>% diff() %>% ggtsdisplay()
To model and forecast the powerts time series, we try our usual 3 approaches:
To start we plot the STL forecast with ETS and ARIMA methods for the non-seasonal (seasonally adjusted) component, with the Box-Cox transformation. We see that the two forecasts are indistinguishable. However, these forecasts are not appropriate, since the time series does not exhibit a strong seasonal pattern, whereas the forecasts reseasonalize the non-seasonal component by using the most recent 24-hour period.
# stlf models
stlf3.ets <- stlf(waterts, lambda = "auto", method = "ets", h = 168)
stlf3.arima <- stlf(waterts, lambda = "auto", method = "arima", h = 168)
# plot
autoplot(waterts) +
autolayer(stlf3.ets, PI = FALSE, series = "STLF-ETS") +
autolayer(stlf3.arima, PI = FALSE, series = "STLF-ARIMA") +
geom_vline(xintercept = 1 + 1000/24, lty = 3, lwd = 1) +
labs(title = "STLF forecast using ETS & ARIMA",
x = "Day",
y = "Waterflow")
# close-up
autoplot(waterts) +
autolayer(stlf3.ets, PI = FALSE, series = "STLF-ETS") +
autolayer(stlf3.arima, PI = FALSE, series = "STLF-ARIMA") +
geom_vline(xintercept = 1 + 1000/24, lty = 3, lwd = 1) +
labs(title = "Close-up of STLF forecast using ETS & ARIMA",
x = "Day",
y = "Waterflow") +
coord_cartesian(xlim = c(35, 50))
Next we fit ETS and ARIMA models to the time series using the ets and auto.arima functions in R. The results are shown below, including model summaries and diagnostic plots for the residuals. Note that for both models we use the Box-Cox transformation with bias adjustment, since we want the point forecasts to represent the mean (rather than median) of the forecast distribution.
fit3.ets <- ets(waterts, lambda = "auto", biasadj = TRUE)
fit3.ets
## ETS(A,N,N)
##
## Call:
## ets(y = waterts, lambda = "auto", biasadj = TRUE)
##
## Box-Cox transformation: lambda= -0.6444
##
## Smoothing parameters:
## alpha = 0.0164
##
## Initial states:
## l = 1.3516
##
## sigma: 0.0736
##
## AIC AICc BIC
## 1693.762 1693.786 1708.485
autoplot(fit3.ets)
checkresiduals(fit3.ets)
##
## Ljung-Box test
##
## data: Residuals from ETS(A,N,N)
## Q* = 45.951, df = 46, p-value = 0.4743
##
## Model df: 2. Total lags used: 48
fit3.arima <- auto.arima(waterts, lambda = "auto", biasadj = TRUE)
fit3.arima
## Series: waterts
## ARIMA(0,1,1)
## Box Cox transformation: lambda= -0.64441
##
## Coefficients:
## ma1
## -0.9825
## s.e. 0.0061
##
## sigma^2 estimated as 0.005418: log likelihood=1187.67
## AIC=-2371.34 AICc=-2371.33 BIC=-2361.53
autoplot(fit3.arima)
checkresiduals(fit3.arima)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,1,1)
## Q* = 46.043, df = 47, p-value = 0.5122
##
## Model df: 1. Total lags used: 48
Finally, let’s compare 7-day forecasts (7 * 24 = 168 hourly measurements) from the ETS and ARIMA models. In either case, the forecasts are flat and equivalent to a simple exponential smoothing model. The ETS model forecasts 46.9 whereas the ARIMA model forecasts 47.2.
fit3f.ets <- forecast(fit3.ets, h = 168)
fit3f.arima <- forecast(fit3.arima, h = 168)
autoplot(waterts) +
autolayer(fit3f.ets, series = "ETS", PI = FALSE) +
autolayer(fit3f.arima, series = "ARIMA", PI = FALSE) +
geom_vline(xintercept = 1 + 1000/24, lty = 3, lwd = 1) +
labs(title = "ETS & ARIMA forecast",
subtitle = "ETS(A,N,N) and ARIMA(0,1,1)",
x = "Day",
y = "Waterflow")
cat("ETS(A,N,N) forecast:", fit3f.ets$mean[1])
## ETS(A,N,N) forecast: 46.89912
autoplot(waterts) +
autolayer(fit3f.ets, series = "ETS", PI = FALSE) +
autolayer(fit3f.arima, series = "ARIMA", PI = FALSE) +
geom_vline(xintercept = 1 + 1000/24, lty = 3, lwd = 1) +
labs(title = "Close-up of ETS & ARIMA forecast",
subtitle = "ETS(A,N,N) and ARIMA(0,1,1)",
x = "Day",
y = "Waterflow") +
coord_cartesian(xlim = c(35, 50))
cat("ARIMA(0,1,1) forecast:", fit3f.arima$mean[1])
## ARIMA(0,1,1) forecast: 47.23156
To choose between the ETS and ARIMA models, we could conduct a validation exercise to test each model’s predictive performance on a validation sample (or use time series cross-validation). However, in this case, doing a validation exercise would likely be over-kill since the two models generate almost identical forecasts:
So for practical purposes the two forecasts are identical, and it makes no meaningful difference which model we choose. In this case, we select the simpler ETS model as our final forecasting model.
Finally we use our chosen model to produce a 7-day forecast of hourly waterflow measurements. The 7-day forecast period is equivalent to 168 hours, starting at 12/3/15 5PM. We plot the forecast with 80th- and 95th-percentile prediction intervals, and save the forecast to our Excel file. We note that the prediction intervals appear extremely wide, with upper bounds that are well above the historical range of measurement values. This may arise from how the fpp2 package computes prediction intervals when the Box-Cox transformation is used.
bestfit3 <- fit3.ets
# forecast 168 hours
f3 <- forecast(bestfit3, h = 168)
# plot with prediction intervals
autoplot(waterts) +
autolayer(f3) +
geom_vline(xintercept = 1 + 1000/24, lty = 3, lwd = 1) +
labs(title = "Final forecast from ETS(A,N,N)",
x = "Day",
y = "Waterflow")
# close-up
autoplot(waterts) +
autolayer(f3) +
geom_vline(xintercept = 1 + 1000/24, lty = 3, lwd = 1) +
labs(title = "Close-up of final forecast from ETS(A,N,N)",
x = "Day",
y = "Waterflow") +
coord_cartesian(xlim = c(35, 50))
# forecast df for excel file
end <- raw3$Hour[nrow(raw3)]
fctimes <- end + hours(1:168)
waterfc_df <- data.frame("Date Time" = fctimes,
WaterFlow = f3$mean)
# save all forecasts to excel file
write_xlsx(list(ATMForecast = cashfc_df,
PowerForecast = powerfc_df,
WaterflowForecast = waterfc_df),
"KBenson_Proj1_Forecasts.xlsx")