Forecast the amount of cash withdrawn from 4 ATMs for May 2010 using historical withdrawal data.
The dataset consists of time series data for daily cash withdrawals from 4 different ATMs between May 1, 2009 and April 30, 2010. There are three variables:
To prepare the data for analysis and forecasting, I made sure that the \(\text{DATE}\) column was in Date format and that the \(\text{Cash}\) column was in numeric format. The dataset \(\text{Cash}\) value was in hundreds of dollars, so I multiplied by 100 to convert to USD. The rows with NAs in the ATM column were dropped since only the 4 ATMs were being considered in this analysis. The tibble was then converted to a tsibble with key being \(\text{ATM}\) and index being \(\text{DATE}\).
# get the data
atm_data <- read_xlsx(link1)
# Convert to tsibble
atm <- atm_data %>%
mutate(
DATE = as.Date(DATE, origin = "1899-12-30"),
Cash = as.numeric(Cash) * 100 # convert from hundreds to USD
) %>%
drop_na(ATM) %>% # drop rows with NAs in ATM column
as_tsibble(key = ATM, index = DATE)
In this section, I explored the data further by looking at the spread of the data by column.
# summarize data
summary(atm)
## DATE ATM Cash
## Min. :2009-05-01 Length:1460 Min. : 0
## 1st Qu.:2009-07-31 Class :character 1st Qu.: 50
## Median :2009-10-30 Mode :character Median : 7300
## Mean :2009-10-30 Mean : 15558
## 3rd Qu.:2010-01-29 3rd Qu.: 11400
## Max. :2010-04-30 Max. :1091976
## NA's :5
# Looking at NA Cash values
atm %>%
filter(is.na(Cash))
## # A tsibble: 5 x 3 [1D]
## # Key: ATM [2]
## DATE ATM Cash
## <date> <chr> <dbl>
## 1 2009-06-13 ATM1 NA
## 2 2009-06-16 ATM1 NA
## 3 2009-06-22 ATM1 NA
## 4 2009-06-18 ATM2 NA
## 5 2009-06-24 ATM2 NA
There are 5 NA \(\text{Cash}\) values in the whole dataset, with 3 belonging to ATM1 and 2 belonging to ATM2. Given that removing these values would result in a gap in the data that is already limited by the dataset being daily values for a period of 1 year, I decided to impute these NAs with the mean value of the corresponding ATM.
# Imputing NA Cash values with the ATM's mean
atm <- atm %>%
group_by(ATM) %>%
mutate(
Cash = ifelse(is.na(Cash),
mean(Cash, na.rm = TRUE),
Cash)
) %>%
ungroup()
# plotting the time series for each ATM
autoplot(atm, Cash) +
facet_wrap(~ATM, scales = "free_y") +
labs(
title = "Daily Cash Withdrawals by ATM",
x = "Date",
y = "Withdrawal (USD)"
) +
theme(
plot.title = element_text(hjust = 0.5, face = 'bold')
)
The time series plots show us:
atm %>%
filter(ATM == "ATM3" & Cash >0)
## # A tsibble: 3 x 3 [1D]
## # Key: ATM [1]
## DATE ATM Cash
## <date> <chr> <dbl>
## 1 2010-04-28 ATM3 9600
## 2 2010-04-29 ATM3 8200
## 3 2010-04-30 ATM3 8500
Looking at the time series plot for ATM 3, it looks like there is one extreme outlier in the time series, with all other dates having zero withdrawals. However, looking at the above table, we can see that there were 3 consecutive days with withdrawals. These are still outliers compared to the rest of the data, but the fact that they occurred consecutively means they are likely not errors and can be easily explained.
atm %>%
filter(ATM == "ATM4" & Cash > 175000)
## # A tsibble: 1 x 3 [1D]
## # Key: ATM [1]
## DATE ATM Cash
## <date> <chr> <dbl>
## 1 2010-02-09 ATM4 1091976.
The outlier occurred on February 9th, 2010 with a value of $1,091,976. No other withdrawals exceeded $175,000, making this a clear and extreme anomaly. I decided to impute the value with the mean value of the day before and after.
atm <- atm %>%
group_by(ATM) %>%
mutate(Cash = ifelse(ATM == "ATM4" & DATE == as.Date("2010-02-09"),
mean(Cash[ATM == "ATM4" &
DATE %in% c(as.Date("2010-02-08"), as.Date("2010-02-10"))],
na.rm = TRUE),
Cash)) %>%
ungroup()
# box cox
atm_lambda <- atm %>%
features(Cash, features = guerrero)
atm_lambda
## # A tibble: 4 × 2
## ATM lambda_guerrero
## <chr> <dbl>
## 1 ATM1 0.237
## 2 ATM2 0.724
## 3 ATM3 1.45
## 4 ATM4 0.450
However, since the ATM data spans a single year and residual diagnostics later confirm constant variance, a Box–Cox transformation was not applied. I experimented with applying the square root transformation for ATM4, but it did not materially improve residual behavior or model performance. Therefore, the analysis proceeded using the untransformed data to preserve interpretability and maintain consistency across all ATMs.
This section describes the development, training, and evaluation of forecasting models used to predict daily cash withdrawals for each ATM.
I tested 3 different forecasting models per ATM:
# Fit Models
fit_models_atm <- atm %>%
model(
snaive = SNAIVE(Cash ~ lag("week")),
ets = ETS(Cash),
arima = ARIMA(Cash)
)
# Model info
glimpse(fit_models_atm)
## Rows: 4
## Columns: 4
## Key: ATM [4]
## $ ATM <chr> "ATM1", "ATM2", "ATM3", "ATM4"
## $ snaive <model> [SNAIVE], [SNAIVE], [SNAIVE], [SNAIVE]
## $ ets <model> [ETS(A,N,A)], [ETS(A,N,A)], [ETS(A,N,N)], [ETS(M,N,M)]
## $ arima <model> [ARIMA(0,0,1)(0,1,2)[7]], [ARIMA(2,0,2)(0,1,1)[7]], [ARIMA(0,0,…
# Check accuracy for each model - focusing on MAE, RMSE, and MAPE
accuracy(fit_models_atm)
## # A tibble: 12 × 11
## ATM .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ATM1 snaive Training -3.63e0 2787. 1.78e3 -96.6 116. 1 1 0.151
## 2 ATM1 ets Training -2.20e1 2384. 1.52e3 -107. 122. 0.853 0.855 0.125
## 3 ATM1 arima Training -7.36e0 2341. 1.47e3 -103. 118. 0.825 0.840 -0.00795
## 4 ATM2 snaive Training 2.23e0 3005. 2.08e3 -Inf Inf 1 1 0.0459
## 5 ATM2 ets Training -7.10e1 2506. 1.79e3 -Inf Inf 0.860 0.834 0.0198
## 6 ATM2 arima Training -8.81e1 2407. 1.70e3 -Inf Inf 0.819 0.801 -0.00507
## 7 ATM3 snaive Training 7.35e1 804. 7.35e1 100 100 1.00 1.00 0.640
## 8 ATM3 ets Training 2.70e1 503. 2.72e1 Inf Inf 0.371 0.625 -0.00717
## 9 ATM3 arima Training 2.71e1 503. 2.71e1 34.6 34.6 0.370 0.625 0.0124
## 10 ATM4 snaive Training -3.70e2 45304. 3.46e4 -392. 447. 1 1 0.0621
## 11 ATM4 ets Training -3.04e3 35425. 2.77e4 -452. 484. 0.800 0.782 0.0574
## 12 ATM4 arima Training 3.16e0 33952. 2.80e4 -507. 540. 0.809 0.749 0.00472
The best model was determined for each ATM by assessing which model has lower MAE, RMSE, and MAPEs.
ATM1: ARIMA achieved the lowest MAE and RMSE, outperforming ETS and SNAIVE. The ARIMA model best captured short-term fluctuations and weekly seasonality. Specifically, ARIMA(0,0,1)(0,1,2)[7] is selected, which is a seasonal moving average model where weekly seasonality is handled by 1 seasonal difference and two seasonal MA terms.
ATM2: ARIMA had the lowest MAE and RMSE. MAPE was infinite due to their being zero-values at some dates, but relative values confirm the advantage of using ARIMA. Specifically, ARIMA(2,0,2)(0,1,1)[7] is selected, which combines both AR and MA components plus a single seasonal MA term. It models both short-term autocorrelation and weekly seasonal fluctuations.
ATM3: The series remained flat for most of the period and then had a sudden spike towards the end of the period. The ETS and ARIMA models have nearly identical MAE and RMSE values. MAPE is infinite for the ETS model, while ARIMA has a finite number. ARIMA was therefore selected as the preferred model, as it handled long zero periods more realistically and yielded stable percentage errors. Specifically, ARIMA(0,0,2) is selected, which is a simple non-seasonal moving average model. The series is stationary and primarily influenced by the average of the last two random shocks. It does not account for seasonality.
ATM4: For ATM4, ARIMA and ETS both performed similarly. The ARIMA model achieved the lowest RMSE and residual autocorrelation, indicating a more accurate fit after accounting for the February 2010 outlier. Specifically, ARIMA(3,0,2)(1,0,0)[7] w/ mean is selected, which effectively captures the short-term fluctuations and the weekly seasonal pattern.
The below residual diagnostics indicate that the model is well-fit.
These diagnostics confirm that the model adequately captures the trend and seasonality, and no further transformations are necessary.
# ATM1
fit_models_atm %>%
filter(ATM == "ATM1") %>%
select(arima) %>%
gg_tsresiduals()
The below residual diagnostics indicate that the model is well-fit.
These diagnostics confirm that the model adequately captures the trend and seasonality, and no further transformations are necessary.
# ATM2
fit_models_atm %>%
filter(ATM == "ATM2") %>%
select(arima) %>%
gg_tsresiduals()
The residual diagnostics for ATM3 indicate that the ARIMA(0,0,2) model provides a reasonable fit given the atypical data structure.
These diagnostics confirms that a box-cox transformation would not be necessary here as the issue is not variance growth, but lack of activity.
# ATM3
fit_models_atm %>%
filter(ATM == "ATM3") %>%
select(arima) %>%
gg_tsresiduals()
The below residual diagnostics for ATM4 indicate that the ARIMA(3,0,2)(1,0,0)[7] w/ mean model provides a reasonable fit, with a few notable points.
Overall, these diagnostics confirm that the model adequately captures the underlying structure of ATM4 withdrawals. As discussed previously, I experimented with applying the square root of the data due to the lambda being approximately 0.5, but the model performance and residual behavior did not improve. I decided that no transformation was necessary because of this.
# ATM4
fit_models_atm %>%
filter(ATM == "ATM4") %>%
select(arima) %>%
gg_tsresiduals()
Once the best models were determined for each ATM, I forecasted the cash withdrawals for each ATM for the month of May.
acc_tables_atm <- accuracy(fit_models_atm)
# select best models
best_models <- acc_tables_atm %>%
group_by(ATM) %>%
arrange(RMSE, MAE, .by_group = TRUE) %>%
slice(1) %>%
ungroup() %>%
select(ATM, .model)
# forecast
future <- new_data(atm, 31)
# Forecast with ALL models
fc_all <- forecast(fit_models_atm, new_data = future)
# Keep only the chosen model per ATM
final_fc <- fc_all %>%
inner_join(best_models, by = c("ATM", ".model"))
# write to excel
final_fc %>%
write_xlsx('May2010_ATM_Forecasts.xlsx')
# Plot forecasts for each ATM
autoplot(final_fc, atm) +
labs(title = "Forecasted Cash Withdrawals for May 2010", y = "USD") +
facet_wrap(~ATM, scales = "free_y") +
xlab("Date") +
ylab("Withdrawal (USD)")+
theme(
plot.title = element_text(hjust=0.5, face='bold')
)
The forecast plots present the 31 day forecasts for May 2010, overlaid on the historical withdrawal data for each ATM. The shaded regions represent the 80% and 95% prediction intervals, showing the expected range of cash withdrawals based on the selected model for each ATM.
For ATM 1 and ATM 2, the ARIMA model was used and produces forecasts that closely follow the historical weekly pattern observed throughout the period. The prediction intervals remain relatively narrow, reflecting stable withdrawal behavior and strong weekly seasonality.
ATM 3 exhibited zero activity for most of the period and then had a sudden spike in April 2010. ARIMA generates a largely flat forecast with wide uncertainty. This is consistent with its historical behavior and lack of seasonal structure. If there was more data available from previous years, we’d be able to see if there was a cyclical trend with this data.
For ATM 4, the ARIMA model was applied. The forecasts show moderately volatile but centered withdrawals around historical averages. The prediction intervals are very wide, indicating higher uncertainty. The reason for this is likely due to the ATM’s more irregular withdrawal behavior. Despite this, the model still captures the overall level and weekly seasonality adequately.
Overall, the forecasts suggest that ATM 1 and ATM 2 are expected to maintain consistent withdrawal volumes through the month of May, while ATM 3 and ATM 4 show greater variability and uncertainty due to their irregular historical behavior.
Using a simple dataset of residential power usage between January 1998 and December 2013, the goal is to model this data and provide a monthly forecast for 2014.
# read data
power <- read_xlsx(link2)
# look at first 5 rows
head(power, 5)
## # A tibble: 5 × 3
## CaseSequence `YYYY-MMM` KWH
## <dbl> <chr> <dbl>
## 1 733 1998-Jan 6862583
## 2 734 1998-Feb 5838198
## 3 735 1998-Mar 5420658
## 4 736 1998-Apr 5010364
## 5 737 1998-May 4665377
The dataset contains three columns:
To prepare the data for analysis and forecasting, I converted the YYYY-MMM column into a \(\text{date}\) object and renamed to Date, removed the original YYYY-MMM column, and then converted the tibble into a tsibble.
# convert YYYY-MMM column into date format and convert data into tsibble for time series
power <- power %>%
mutate(
Date = yearmonth(`YYYY-MMM`) # convert to yearmonth object
) %>%
select(CaseSequence, Date, KWH) %>%
as_tsibble(index = Date)
In this section, I explored the data further by looking at the spread of the data by column.
summary(power$KWH)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 770523 5429912 6283324 6502475 7620524 10655730 1
range(power$Date)
## <yearmonth[2]>
## [1] "1998 Jan" "2013 Dec"
power %>%
filter(is.na(KWH))
## # A tsibble: 1 x 3 [1M]
## CaseSequence Date KWH
## <dbl> <mth> <dbl>
## 1 861 2008 Sep NA
# plotting the power time series
autoplot(power, KWH) +
labs(
title = "Residential Power Usage\nBetween January 1998 and December 2013",
x = "Month",
y = "Power (KWH)"
)
We can see that there is 1 NA in the KWH column. Looking at the time series plot, this NA causes a gap in the data. I decided to impute with the mean value for that month (September) across all years.
# Impute missing September values with the average September KWH across all years
power <- power %>%
mutate(
month = month(Date, label = TRUE), # extract month name
KWH = ifelse(
is.na(KWH) & month == "Sep",
mean(KWH[month == "Sep"], na.rm = TRUE), # average KWH of all Septembers
KWH
)
) %>%
select(-month)
# plotting the power time series
autoplot(power, KWH) +
labs(
title = "Residential Power Usage\nBetween January 1998 and December 2013",
x = "Month",
y = "Power (KWH)"
) +
theme(
plot.title = element_text(hjust = 0.5, face = 'bold')
)
We can see after imputing the missing value, the gap is now gone.
This time series plot shows us:
power %>%
filter(KWH<3000000)
## # A tsibble: 1 x 3 [1M]
## CaseSequence Date KWH
## <dbl> <mth> <dbl>
## 1 883 2010 Jul 770523
The outlier occurred in July 2010. The outlier is not 0, it is 770,523 KWH. There is not enough information to explain whether this is a reporting error or a major event (like a power outage). We will consider this outlier in our forecasting.
# box cox
power_lambda <- power %>%
features(KWH, features = guerrero) %>%
pull(lambda_guerrero)
power_lambda
## [1] 0.1226627
The lambda is 0.1226627, which is very close to 0. I will apply a box-cox transformation with \(\lambda = 0.1226627\).
# applying box-cox transformation
power$KWH_transformed <- box_cox(power$KWH, power_lambda)
# plot after transformation
autoplot(power, KWH_transformed) +
labs(
title = "Residential Power Usage\nBetween January 1998 and December 2013\nBox-Cox Transformation",
x = "Month",
y = "Power (KWH)"
) +
theme(
plot.title = element_text(hjust = 0.5, face = 'bold')
)
After the Box-Cox transformation, the variance appears slightly smoother, but it wasn’t highly unstable in the first place, so the overall structure didn’t change much. The seasonal structure of the data is preserved. The transformation does not eliminate the outlier, which confirms that the outlier is a genuine anomaly. Given the limited improvement and loss of direct interpretability, I decided to not apply the box-cox transformation to the forecasting model.
This section describes the development, training, and selection of the forecasting models used to predict the monthly residential power usage for 2014.
I tested 3 different forecasting models for this data:
The SNAIVE model was not considered because it does not capture trend + seasonality interaction.
# fit models
fit_models_power <- power %>%
model(
ets = ETS(KWH),
arima = ARIMA(KWH),
stl_ets = decomposition_model(
STL(KWH ~ season(window = "periodic")),
ETS(season_adjust ~ error("A") + trend("A") + season("A"))
)
)
# MODEL INFO
glimpse(fit_models_power)
## Rows: 1
## Columns: 3
## $ ets <model> [ETS(M,N,M)]
## $ arima <model> [ARIMA(0,0,1)(1,1,1)[12] w/ drift]
## $ stl_ets <model> [STL decomposition model]
# Check accuracy for each model - focusing on MAE, RMSE, and MAPE
accuracy(fit_models_power)
## # A tibble: 3 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ets Training 62242. 826660. 500307. -4.33 12.0 0.729 0.705 0.173
## 2 arima Training -25758. 823919. 489801. -5.52 11.6 0.714 0.702 0.0131
## 3 stl_ets Training 53043. 821514. 504335. -4.46 11.8 0.735 0.700 0.233
Both the STL_ETS and ARIMA produce strong forecasts. Although STL+ETS yielded the smallest RMSE, ARIMA achieved lower average errors (MAE and MAPE) and cleaner residual diagnostics (near-zero autocorrelation). So, ARIMA was selected as the preferred model for forecasting 2014 residential power usage. Specifically, the selected model is ARIMA(0,0,1)(1,1,1)[12] w/ drift.
Once the best model was selected, I forecasted the power usage for 2014.
# ARIMA model
arima_model <- power %>%
model(ARIMA(KWH))
# future 12 months
future_power <- new_data(power,12)
# generate forecasts
fc_power <- forecast(arima_model, new_data = future_power)
# write to excel
fc_power %>%
write_xlsx('2014_Power_Usage_Forecasts.xlsx')
# plot forecasts
autoplot(fc_power, power) +
labs(
title = "Residential Power Usage Forecast for 2014\n(ARIMA Model)",
x = "Month",
y = "Power (KWH)"
) +
theme(
plot.title = element_text(hjust = 0.5, face='bold')
)
fc_power
## # A fable: 12 x 4 [1M]
## # Key: .model [1]
## .model Date
## <chr> <mth>
## 1 ARIMA(KWH) 2014 Jan
## 2 ARIMA(KWH) 2014 Feb
## 3 ARIMA(KWH) 2014 Mar
## 4 ARIMA(KWH) 2014 Apr
## 5 ARIMA(KWH) 2014 May
## 6 ARIMA(KWH) 2014 Jun
## 7 ARIMA(KWH) 2014 Jul
## 8 ARIMA(KWH) 2014 Aug
## 9 ARIMA(KWH) 2014 Sep
## 10 ARIMA(KWH) 2014 Oct
## 11 ARIMA(KWH) 2014 Nov
## 12 ARIMA(KWH) 2014 Dec
## # ℹ 2 more variables: KWH <dist>, .mean <dbl>
The forecast plots show the 12 month forecasts for 2014, overlaid on historical residential power usage data between 1998 and 2013. The shaded regions represent the 80% and 95% prediction intervals, showing the expected range of power usage. The forecasts show a clear seasonal pattern consistent with prior years, with peaks in the winter (January) and late summer (August), consistent with heating/cooling demand cycles. Prediction uncertainty remains moderate and stable across the year. Overall, we can conclude that this is a relatively reliable forecast.