First we will load and explore the data. We will also convert the data into a time series object (tsibble). The xlsx data was opened in Excel and exported as a CSV.
atm_raw <- read.csv("https://raw.githubusercontent.com/hdupre/DATA624/main/Project1/ATM624Data.csv")
head(atm_raw)
## DATE ATM Cash
## 1 5/1/2009 12:00:00 AM ATM1 96
## 2 5/1/2009 12:00:00 AM ATM2 107
## 3 5/2/2009 12:00:00 AM ATM1 82
## 4 5/2/2009 12:00:00 AM ATM2 89
## 5 5/3/2009 12:00:00 AM ATM1 85
## 6 5/3/2009 12:00:00 AM ATM2 90
summary(atm_raw)
## DATE ATM Cash
## Length:1474 Length:1474 Min. : 0.0
## Class :character Class :character 1st Qu.: 0.5
## Mode :character Mode :character Median : 73.0
## Mean : 155.6
## 3rd Qu.: 114.0
## Max. :10920.0
## NA's :19
There are some NAs in the data.
There appear to be five records missing the cash value – the other fourteen are from May 2010 (the month to be forecasted) and aren’t tied to an ATM ID. We’ll remove those records and convert to tsibbles for each ATM. Gaps will be filled with the median value for ATM 1, 2 and 4 since they appear to be active nearly every day.
atm_df <- atm_raw
atm_df$ATM[atm_df$ATM==""] <- NA
atm_df <- atm_df %>%
drop_na() %>%
mutate(DATE = as_date(DATE, format="%m/%d/%Y 12:00:00 AM"))
atm1_ts <- atm_df %>%
filter(ATM=="ATM1") %>%
select(-ATM) %>%
as_tsibble() %>%
fill_gaps(Cash = median(Cash))
## Using `DATE` as index variable.
atm2_ts <- atm_df %>%
filter(ATM=="ATM2") %>%
select(-ATM) %>%
as_tsibble() %>%
fill_gaps(Cash = median(Cash))
## Using `DATE` as index variable.
atm3_ts <- atm_df %>%
filter(ATM=="ATM3") %>%
select(-ATM) %>%
mutate(Cash = ifelse(Cash==0,NA,Cash)) %>%
drop_na %>%
as_tsibble()
## Using `DATE` as index variable.
atm4_ts <- atm_df %>%
filter(ATM=="ATM4") %>%
select(-ATM) %>%
as_tsibble()
## Using `DATE` as index variable.
grid.arrange(
atm1_ts %>%
autoplot(Cash)+
labs(title = "ATM1"),
atm2_ts %>%
autoplot(Cash)+
labs(title = "ATM2"),
atm3_ts %>%
autoplot(Cash)+
labs(title = "ATM3"),
atm4_ts %>%
autoplot(Cash)+
labs(title = "ATM4"), nrow=2)
From these series plots we can see some seasonality in the data from ATMs 1 and 2. ATM 3 had no withdrawals until near the end of our time series. ATM 4 has a single, massive spike and it’s difficult to see on this plot whether the variation is seasonal.
atm1_ts %>%
model(classical_decomposition(Cash,type="additive")) %>%
components() %>%
autoplot()
## Warning: Removed 3 row(s) containing missing values (geom_path).
atm1_ts %>%
gg_tsdisplay(Cash, plot_type = "partial")
ATM 1 has strong indications of weekly seasonality. There is no apparent trend nor an increasing or decreasing variance. According to fpp3, a Holt-Winters’ method model may be used for “daily type data” (section 8.3).
We will compare against an auto ARIMA model (as I have found that the auto function generally performs best) and an SNAIVE model. I’m also throwing in a prophet model as it is designed for forecasting daily data with “strong seasonality” but performance may suffer as it prefers data that also has yearly seasonality and holiday effects, according to fpp3 (section 12.2).
fit1 <- atm1_ts %>%
model(
hw=ETS(Cash ~ error("M") + trend("Ad") + season("M")),
arima=ARIMA(Cash, stepwise = FALSE, approx=FALSE),
snaive=SNAIVE(Cash),
prophet = prophet(Cash ~ season(period = 7, order = 3,
type = "multiplicative"))
)
glance(fit1) %>%
arrange(AICc)
## # A tibble: 4 × 13
## .model sigma2 log_lik AIC AICc BIC MSE AMSE MAE ar_roots ma_roots
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <list> <list>
## 1 arima 559. -1641. 3290. 3290. 3305. NA NA NA <cpl> <cpl>
## 2 hw 0.137 -2274. 4573. 4574. 4624. 696. 709. 0.221 <NULL> <NULL>
## 3 snaive 774. NA NA NA NA NA NA NA <NULL> <NULL>
## 4 prophet NA NA NA NA NA NA NA NA <NULL> <NULL>
## # … with 2 more variables: sigma <dbl>, changepoints <list>
accuracy(fit1)
## # A tibble: 4 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 hw Training -0.752 26.4 16.3 -127. 143. 0.922 0.950 0.0810
## 2 arima Training -0.0660 23.3 14.6 -103. 118. 0.825 0.839 -0.00814
## 3 snaive Training -0.0363 27.8 17.7 -96.5 116. 1 1 0.159
## 4 prophet Training 0.0887 28.4 19.5 -140. 157. 1.10 1.02 0.131
The prophet model performed the worst based on RMSE. It is possible the time series is not long enough or that the “period”/“order” configuration isn’t optimal. The auto ARIMA model appears to have the lowest AICc and lowest RMSE. Let’s look at the residuals and features.
fit1 %>%
select(arima) %>%
gg_tsresiduals(lag=36)
Residuals appear to be consistent with white noise.
augment(fit1) %>%
filter(.model=="arima") %>%
features(.innov, ljung_box, lag=36, dof=7)
## # A tibble: 1 × 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 arima 38.7 0.108
P-value is large enough to confirm that the residuals are similar to white noise. We will use this model to forecast ATM1 in the results section.
We will take a similar approach to ATM2 since the time series plots appear to share similar features.
atm2_ts %>%
model(classical_decomposition(Cash,type="additive")) %>%
components() %>%
autoplot()
## Warning: Removed 3 row(s) containing missing values (geom_path).
atm2_ts %>%
gg_tsdisplay(Cash, plot_type = "partial")
Much like the previous data, there is obvious weekly seasonality.
fit2 <- atm2_ts %>%
model(
hw=ETS(Cash ~ error("M") + trend("Ad") + season("M")),
arima=ARIMA(Cash, stepwise = FALSE, approx=FALSE),
snaive=SNAIVE(Cash),
prophet = prophet(Cash ~ season(order = 3,
type = "multiplicative"))
)
glance(fit2) %>%
arrange(AICc)
## # A tibble: 4 × 13
## .model sigma2 log_lik AIC AICc BIC MSE AMSE MAE ar_roots ma_roots
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <list> <list>
## 1 arima 602. -1653. 3319. 3319. 3342. NA NA NA <cpl> <cpl>
## 2 hw 0.387 -2388. 4801. 4802. 4852. 1171. 1172. 0.499 <NULL> <NULL>
## 3 snaive 910. NA NA NA NA NA NA NA <NULL> <NULL>
## 4 prophet NA NA NA NA NA NA NA NA <NULL> <NULL>
## # … with 2 more variables: sigma <dbl>, changepoints <list>
accuracy(fit2)
## # A tibble: 4 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 hw Training 2.12 34.0 28.3 -269. 300. 1.36 1.13 0.0222
## 2 arima Training -0.885 24.1 17.0 -Inf Inf 0.819 0.801 -0.00512
## 3 snaive Training 0.0223 30.1 20.8 -Inf Inf 1 1.00 0.0447
## 4 prophet Training 0.656 31.0 23.9 -Inf Inf 1.15 1.03 0.0276
Again the auto ARIMA has the lowest AICc and RMSE. Let’s look at the residuals to ensure the forecasts would be viable.
fit2 %>%
select(arima) %>%
gg_tsresiduals(lag=36)
augment(fit2) %>%
filter(.model=="arima") %>%
features(.innov, ljung_box, lag=24, dof=7)
## # A tibble: 1 × 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 arima 22.3 0.174
While the residuals do have a significant spike this appears to be within the tolerances of autocorrelation in white noise. The innovation residuals have a zero mean.
ATM 3 has only three observations – a naive forecast keeps things simple and will likely be accurate enough for the amount of data available.
fit3 <- atm3_ts %>%
model(NAIVE(Cash))
ATM 4’s most obvious issue is the outlier. Googling the cash an ATM can hold comes up with an approximation of $200,000. This ATM had over $1 million withdrawn in one day. Let’s handle the outlier and then take another look at the data.
atm4_ts_decomp <- atm4_ts %>%
model(STL(Cash ~ season(period=7), robust=TRUE)) %>%
components()
atm4_ts_decomp %>%
autoplot()
outliers <- atm4_ts_decomp %>%
filter(
remainder < quantile(remainder, 0.25) - 3*IQR(remainder) |
remainder > quantile(remainder, 0.75) + 3*IQR(remainder)
)
outliers
## # A dable: 3 x 7 [1D]
## # Key: .model [1]
## # : Cash = trend + season_7 + remainder
## .model DATE Cash trend season_7 remainder season_adjust
## <chr> <date> <int> <dbl> <dbl> <dbl> <dbl>
## 1 STL(Cash ~ season(per… 2009-09-22 1712 325. -142. 1529. 1854.
## 2 STL(Cash ~ season(per… 2010-01-29 1575 455. 35.3 1084. 1540.
## 3 STL(Cash ~ season(per… 2010-02-09 10920 299. -64.8 10686. 10985.
This methodology of finding outliers from fpp3 (section 13.9) found two more outliers in addition to the large spike. We will remove them and then fill the gaps with a median value.
atm4_ts <- atm4_ts %>%
anti_join(outliers) %>%
fill_gaps(Cash = median(Cash))
## Joining, by = c("DATE", "Cash")
atm4_ts %>%
autoplot()
## Plot variable not specified, automatically selected `.vars = Cash`
This plot looks much better – roughly stationary, consistent variance indicating no need for a transformation, no real identifiable seasonality. We will try a few different forecasting methods. ARIMA has performed well so far and simple exponential smoothing as it is suitable for data with no clear trend or seasonal pattern. A method with drift won’t appear necessary due to lack of trend, so perhaps a plain naive forecast can provide a sort of baseline.
We will try a stepwise ARIMA model as well, to see if there’s a performance difference.
fit4 <- atm4_ts %>%
model(naive = NAIVE(Cash),
ets = ETS(Cash ~ error("A") + trend("N") + season("N")),
stepwise = ARIMA(Cash),
search = ARIMA(Cash, stepwise=FALSE)
)
glance(fit4) %>%
arrange(AICc)
## # A tibble: 4 × 11
## .model sigma2 log_lik AIC AICc BIC MSE AMSE MAE ar_roots
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <list>
## 1 search 108934. -2631. 5279. 5279. 5310. NA NA NA <cpl [10]>
## 2 stepwise 110435. -2636. 5280. 5280. 5295. NA NA NA <cpl [14]>
## 3 ets 115515. -3203. 6412. 6412. 6424. 114882. 114982. 285. <NULL>
## 4 naive 213124. NA NA NA NA NA NA NA <NULL>
## # … with 1 more variable: ma_roots <list>
accuracy(fit4)
## # A tibble: 4 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 naive Training -0.810 461. 377. -459. 519. 1.12 1.05 -0.465
## 2 ets Training -0.0364 339. 285. -539. 572. 0.843 0.770 0.0760
## 3 stepwise Training -0.202 331. 274. -502. 534. 0.811 0.751 0.0837
## 4 search Training 0.0620 327. 272. -477. 510. 0.806 0.742 0.00382
The non-stepwise ARIMA again has the better AICc and RMSE. Now for a residual check.
fit4 %>%
select(search) %>%
gg_tsresiduals(lag=24)
augment(fit4) %>%
filter(.model=="search") %>%
features(.innov, ljung_box, lag=24, dof=7)
## # A tibble: 1 × 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 search 23.8 0.124
A significant p-value and a lack of correlation that is consistent with white noise. The zero mean requirement in the innov residuals is met as is the optional “constant variance” requirement. The histogram does not appear normal so perhaps a Box-Cox transformation could have helped.
grid.arrange(
fit1 %>%
select(arima) %>%
forecast(h=31) %>%
autoplot(atm1_ts)+
labs(title = "ATM1"),
fit2 %>%
select(arima) %>%
forecast(h=31) %>%
autoplot(atm2_ts)+
labs(title = "ATM2"),
fit3 %>%
forecast(h=31) %>%
autoplot(atm3_ts)+
labs(title = "ATM3"),
fit4 %>%
select(search) %>%
forecast(h=31) %>%
autoplot(atm4_ts)+
labs(title = "ATM4"),
nrow=2)
forecast1 <- fit1 %>%
select(arima) %>%
forecast(h=31) %>%
as.data.frame %>%
select(DATE, .mean)
forecast2 <- fit2 %>%
select(arima) %>%
forecast(h=31) %>%
as.data.frame %>%
select(DATE, .mean)
forecast3 <- fit3 %>%
forecast(h=31) %>%
as.data.frame %>%
select(DATE, .mean)
forecast4 <- fit4 %>%
select(search) %>%
forecast(h=31) %>%
as.data.frame %>%
select(DATE, .mean)
dataset_names <- list('ATM1' = forecast1, 'ATM2' = forecast2, 'ATM3' = forecast3, 'ATM4' = forecast4)
write.xlsx(dataset_names, file = 'atm_forecasts_2010_05.xlsx')
power_raw <- read.csv("https://raw.githubusercontent.com/hdupre/DATA624/main/Project1/ResidentialCustomerForecastLoad-624.csv")
head(power_raw)
## CaseSequence YYYY.MMM KWH
## 1 733 1998-Jan 6862583
## 2 734 1998-Feb 5838198
## 3 735 1998-Mar 5420658
## 4 736 1998-Apr 5010364
## 5 737 1998-May 4665377
## 6 738 1998-Jun 6467147
summary(power_raw)
## CaseSequence YYYY.MMM KWH
## Min. :733.0 Length:192 Min. : 770523
## 1st Qu.:780.8 Class :character 1st Qu.: 5429912
## Median :828.5 Mode :character Median : 6283324
## Mean :828.5 Mean : 6502475
## 3rd Qu.:876.2 3rd Qu.: 7620524
## Max. :924.0 Max. :10655730
## NA's :1
There is one NA on Sept 2008, so that will be dealt with by assigning the mean of the other September values.
power_df <- power_raw
sep2008_mean <- power_df%>%
filter(grepl('Sep', YYYY.MMM)) %>%
drop_na() %>%
summarise(mean(KWH)) %>%
as.numeric()
power_df$KWH[is.na(power_df$KWH)] <- sep2008_mean
power_ts <- power_df %>%
mutate(date = yearmonth(YYYY.MMM)) %>%
select(date,KWH) %>%
as_tsibble(index=date)
power_ts %>%
autoplot() +
labs(title = "KWH")
## Plot variable not specified, automatically selected `.vars = KWH`
The seasonality of these data should be investigated so we know what periods to assign in the models – it doesn’t appear to be yearly. Also, the outlier must be dealt with, and variance seems to “squeeze” smaller in the middle of the plot, so a Box-Cox transformation could help.
power_ts %>%
gg_tsdisplay(KWH, plot_type = "histogram")
There is a pronounced spike every six months, which can be useful when selecting a model.
We will now deal with the outlier and then explore transforming the data.
power_ts_decomp <- power_ts %>%
model(STL(KWH,robust=TRUE)) %>%
components()
power_ts_decomp %>%
autoplot()
outliers <- power_ts_decomp %>%
filter(
remainder < quantile(remainder, 0.25) - 3*IQR(remainder) |
remainder > quantile(remainder, 0.75) + 3*IQR(remainder)
)
outliers
## # A dable: 2 x 7 [1M]
## # Key: .model [1]
## # : KWH = trend + season_year + remainder
## .model date KWH trend season_year remainder season_adjust
## <chr> <mth> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 STL(KWH, robust = … 2010 Jul 7.71e5 6.86e6 1277186. -7369513. -506663.
## 2 STL(KWH, robust = … 2013 Dec 9.61e6 7.34e6 -256378. 2526281. 9862682.
July 2010 has the low outlier. The July 2010 outlier seems like it’s 1/10 of its real value, so this may be a recording error we’ll simply multiply it by ten.
The STL decomposition also indicates an upward trend that was more difficult to see before. Methods with trend, such as Holt’s linear trend method, may be useful in that case though we have to remember the seasonality.
power_ts <- power_ts %>%
mutate(KWH = ifelse(KWH<1000000,KWH*10,KWH))
power_ts %>%
autoplot()
## Plot variable not specified, automatically selected `.vars = KWH`
These data are ready for forecasting with a Box-Cox transformation applied to KWH.
lambda <- BoxCox.lambda(power_ts$KWH, method = "guerrero")
fit <- power_ts %>%
model(
AAN = ETS(box_cox(KWH,lambda) ~ error("A") + trend("A") + season("N")),
additive = ETS(box_cox(KWH,lambda) ~ error("A") + trend("A") +
season("A")),
multiplicative = ETS(box_cox(KWH,lambda) ~ error("M") + trend("A") +
season("M")),
stepwise = ARIMA(box_cox(KWH,lambda)),
search = ARIMA(box_cox(KWH,lambda), stepwise=FALSE)
)
glance(fit) %>%
arrange(AICc)
## # A tibble: 5 × 11
## .model sigma2 log_lik AIC AICc BIC MSE AMSE MAE
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 search 1.45e-5 757. -1497. -1496. -1472. NA NA NA
## 2 stepwise 1.52e-5 751. -1492. -1492. -1476. NA NA NA
## 3 additive 1.41e-5 576. -1117. -1114. -1062. 1.30e-5 1.35e-5 2.78e-3
## 4 multiplicative 6.48e-7 575. -1116. -1112. -1060. 1.31e-5 1.34e-5 6.11e-4
## 5 AAN 6.77e-5 419. -828. -828. -812. 6.63e-5 6.66e-5 7.08e-3
## # … with 2 more variables: ar_roots <list>, ma_roots <list>
accuracy(fit)
## # A tibble: 5 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 AAN Training 134633. 1335239. 1.14e6 -1.79 17.6 1.87 1.65 0.442
## 2 additive Training 28260. 615101. 4.55e5 -0.298 6.84 0.747 0.759 0.260
## 3 multiplicative Training 33836. 624097. 4.69e5 -0.224 7.05 0.770 0.770 0.261
## 4 stepwise Training 49164. 627621. 4.77e5 0.174 7.25 0.784 0.775 0.100
## 5 search Training 48686. 615191. 4.63e5 0.207 7.02 0.760 0.759 0.108
Once again, the search ARIMA model has the lowest AICc. It also has one of the lowest RMSEs, though the Holt-Winters’ additive method has a lower RMSE.
fit %>%
select(search) %>%
forecast(h=12) %>%
autoplot(power_ts)+
labs(title = "KWH")
forecast <- fit %>%
select(search) %>%
forecast(h=12) %>%
as.data.frame %>%
select(date, .mean)
dataset_names <- list('KWH' = forecast)
write.xlsx(dataset_names, file = 'power_forecasts_2014.xlsx')