Import Libraries
library(fpp3)## Warning: package 'fpp3' was built under R version 4.1.2
## ── Attaching packages ────────────────────────────────────────────── fpp3 0.5 ──
## ✓ tibble 3.1.6 ✓ tsibble 1.1.3
## ✓ dplyr 1.0.8 ✓ tsibbledata 0.4.1
## ✓ tidyr 1.2.0 ✓ feasts 0.3.0
## ✓ lubridate 1.8.0 ✓ fable 0.3.2
## ✓ ggplot2 3.3.5 ✓ fabletools 0.3.2
## Warning: package 'dplyr' was built under R version 4.1.2
## Warning: package 'tidyr' was built under R version 4.1.2
## Warning: package 'tsibble' was built under R version 4.1.2
## Warning: package 'tsibbledata' was built under R version 4.1.2
## Warning: package 'feasts' was built under R version 4.1.2
## Warning: package 'fable' was built under R version 4.1.2
## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## x lubridate::date() masks base::date()
## x dplyr::filter() masks stats::filter()
## x tsibble::intersect() masks base::intersect()
## x tsibble::interval() masks lubridate::interval()
## x dplyr::lag() masks stats::lag()
## x tsibble::setdiff() masks base::setdiff()
## x tsibble::union() masks base::union()
library(dplyr)
library(tidyr)
library(ggplot2)
library(readxl)Part 1 - ATM Data
In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.
Load and Inspect Data
atm <- read_excel("/Users/alecmccabe/Downloads/ATM624DATA.xlsx", col_types = c('date', 'text', 'numeric')) %>%
# converting into date format
mutate(DATE = as_date(DATE)) %>%
# convert to tsibble with DATE as index and ATM as key
as_tsibble(index = DATE, key = ATM)
print(dim(atm))## [1] 1474 3
summary(atm)## DATE ATM Cash
## Min. :2009-05-01 Length:1474 Min. : 0.0
## 1st Qu.:2009-08-01 Class :character 1st Qu.: 0.5
## Median :2009-11-01 Mode :character Median : 73.0
## Mean :2009-10-31 Mean : 155.6
## 3rd Qu.:2010-02-01 3rd Qu.: 114.0
## Max. :2010-05-14 Max. :10919.8
## NA's :19
The ATM dataset contains 1474 entries. Each entry corresponds to a cash withdrawel from a specific ATM at a specific time. The dates range from May-2009 to May-2010.
colSums(is.na(atm))## DATE ATM Cash
## 0 14 19
atm[is.na(atm$ATM),]There are 14 missing values for the ATM column. But upon inspection they are all in May-2010, which we are attempting to model. We can remove these without effect.
atm <- atm %>%
drop_na(ATM)atm[is.na(atm$Cash),]After removing the entries with missing ATM values, there are 5 entries left with missing Cash values. All of the missing values are in June of 2009. In later steps we will think about different ways we can overcome this.
We can overcome the missing values in a few ways: - First, we can replace the missing values with the mean or median values for the ATM and Month. - Second, we can aggregate all the data onto a weekly or monthly level. This however may not be the best approach since all of the missing Cash values are from the same 2-week period in June. Potentially a monthly aggregation would be appropriate. - Lastly, we can ignore the imputation of missing values and instead rely purely on an ARIMA model, which can inherently handle missing values.
For this project, I will explore options 1 and 2. I will create a new dataframe which imputes the missing Cash values from June. I will retain the existing dataset and test ARIMA later without imputation. I will then compare results between the two approaches.
# calculate median values
atm1_median <- median(atm[(atm$ATM == 'ATM1') & (month(atm$DATE)==6) & (!is.na(atm$Cash)),]$Cash)
atm2_median <- median(atm[(atm$ATM == 'ATM1') & (month(atm$DATE)==6) & (!is.na(atm$Cash)),]$Cash)
# replace missing values with median values
atm[(atm$ATM == 'ATM1') & (is.na(atm$Cash)),]$Cash = atm1_median
atm[(atm$ATM == 'ATM2') & (is.na(atm$Cash)),]$Cash = atm2_medianExploring the Data
atm %>%
autoplot(Cash) +
facet_wrap(~ATM, scales = "free", nrow = 4) +
labs(title = "ATM Time Series by ATM")From the above, we can see that ATMs 1 and 2 have similar series values, with continued activity (withdrawls) across the period. It is difficult to ascertain the behavior of ATM4 because of an extreme outlier which we will need to address.
ATM 4 differs, mainly in that it exhibit significantly smaller levels of cash withdrawls. In fact, ATM3 has 0 withdraws until late April of 2010, after which there is a large spike. Is this spike an outlier, or was this third ATM newly introduced to a busy corner? Regardless, we don’t have enough data for the ATM to include in our analysis so we will remove in further steps.
atm <- atm %>%
filter(ATM != 'ATM3')For ATM 4, we will need to address the singular outlier occuring around Marth of 2010. We can do this by replacing it with the average of the series.
max <- max(atm[atm$ATM=='ATM4',]$Cash)
median <- median(atm[atm$ATM=='ATM4',]$Cash)
atm[(atm$ATM=='ATM4') & (atm$Cash == max),]$Cash = medianatm %>%
autoplot(Cash) +
facet_wrap(~ATM, scales = "free", nrow = 4) +
labs(title = "ATM Time Series by ATM")atm %>%
ACF(Cash) %>%
autoplot() +
facet_wrap(~ATM, scales = "free", nrow = 4) +
labs(title = "ATM Time Series by ATM")From the ACT plots above, we can see that ATMs 1 & 2 both exhibit weekly seasonality. We cannot see the same type of seasonality from ATM 4.
Exploring ATM 1
atm1 <- atm %>%
filter(ATM == 'ATM1')
# ATM1 STL Decomposition
atm1 %>%
model(STL(Cash ~ season(window="periodic"), robust = TRUE)) %>%
components() %>%
autoplot() +
labs(title = "ATM1 with Seasonal Decomposition")Based on the seasonal decomposition above, we can see that there is no significant trend in the series. The data is follows a weekly seasonality. We also see that the remainder is mostly random, but exhibits seemingly non-random behavior around March of 2010.
atm1 |>
gg_lag(Cash, geom = "point") +
labs(x = "lag(Beer, k)")Now looking at the lag plot for ATM 1, we can once again see that this ATM’s time series follows a weekly (7-day) seasonality cycle.
# create training set and compare against last months
atm1_train <- atm1 %>%
filter(DATE < '2010-04-01')# lambda for box cox transformation
atm1_lambda <- atm1_train %>%
features(Cash, features = guerrero) %>%
pull(lambda_guerrero)
atm1_fit <- atm1_train %>%
model(
# additive ETS model
additive = ETS(Cash ~ error("A") + trend("N") + season("A")),
# multiplicative ETS model
multiplicative = ETS(Cash ~ error("M") + trend("N") + season("M")),
# SNAIVE model
snaive = SNAIVE(Cash),
# transformed additive ETS model
additive_ts = ETS(box_cox(Cash,atm1_lambda) ~ error("A") + trend("N") + season("A")),
# transformed multiplicative ETS model
multiplicative_ts = ETS(box_cox(Cash,atm1_lambda) ~ error("M") + trend("N") + season("M")),
# transformed SNAIVE model
snaive_ts = SNAIVE(box_cox(Cash,atm1_lambda)),
# arima model
ARIMA = ARIMA(box_cox(Cash,atm1_lambda))
)
accuracy(atm1_fit)For ATM 1, the multiplicative-transform model produced the lowest RSME
atm1_forecast <- atm1_fit %>%
forecast(h = 31) %>%
filter(.model=='multiplicative_ts')
atm1_forecast %>%
autoplot(atm1, level=NULL) +
labs(title="Transformed Multiplicative ETS Forecast")# report of the arima model
atm1_fit %>% select(.model = "multiplicative_ts") %>% report()## Series: Cash
## Model: ETS(M,N,M)
## Transformation: box_cox(Cash, atm1_lambda)
## Smoothing parameters:
## alpha = 0.03183972
## gamma = 0.2749021
##
## Initial states:
## l[0] s[0] s[-1] s[-2] s[-3] s[-4] s[-5] s[-6]
## 11.94849 0.4187292 1.012255 1.149805 1.054564 1.076637 1.048191 1.239819
##
## sigma^2: 0.0533
##
## AIC AICc BIC
## 2657.902 2658.581 2696.043
# residuals
atm1_fit %>%
select("multiplicative_ts") %>%
gg_tsresiduals()#Box-Pierce test, â„“=2m for seasonal data, m=7
atm1_fit %>%
select(.model = "multiplicative_ts") %>%
augment() %>%
features(.innov, box_pierce, lag = 14, dof = 0)Exploring ATM 2
atm2 <- atm %>%
filter(ATM == 'ATM2')
# ATM1 STL Decomposition
atm2 %>%
model(STL(Cash ~ season(window="periodic"), robust = TRUE)) %>%
components() %>%
autoplot() +
labs(title = "ATM2 with Seasonal Decomposition")Based on the seasonal decomposition above, we can see that there is no significant trend in the series. The data is follows a weekly seasonality. We also see that the remainder is mostly random, but exhibits seemingly non-random behavior around March of 2010.
atm2 |>
gg_lag(Cash, geom = "point") +
labs(x = "lag(Beer, k)")Now looking at the lag plot for ATM 1, we can once again see that this ATM’s time series follows a weekly (7-day) seasonality cycle.
# create training set and compare against last months
atm2_train <- atm2 %>%
filter(DATE < '2010-04-01')# lambda for box cox transformation
atm2_lambda <- atm2_train %>%
features(Cash, features = guerrero) %>%
pull(lambda_guerrero)
atm2_fit <- atm2_train %>%
model(
# additive ETS model
additive = ETS(Cash ~ error("A") + trend("N") + season("A")),
# multiplicative ETS model
multiplicative = ETS(Cash ~ error("M") + trend("N") + season("M")),
# SNAIVE model
snaive = SNAIVE(Cash),
# transformed additive ETS model
additive_ts = ETS(box_cox(Cash,atm2_lambda) ~ error("A") + trend("N") + season("A")),
# transformed multiplicative ETS model
multiplicative_ts = ETS(box_cox(Cash,atm2_lambda) ~ error("M") + trend("N") + season("M")),
# transformed SNAIVE model
snaive_ts = SNAIVE(box_cox(Cash,atm2_lambda)),
# arima model
ARIMA = ARIMA(box_cox(Cash,atm2_lambda))
)
accuracy(atm2_fit)For ATM 2, ARIMA and Transformed Additive ETS produced the lowest RSME values. We can compare their AIC and BIC to determine which model we should move forward with.
atm2_fit %>%
forecast(h = 31) %>%
filter(.model %in% c('additive_ts', 'ARIMA')) %>%
autoplot(atm2, level=NULL) +
labs(title="Transformed Multiplicative ETS Forecast")# report of the multiplicative model
atm2_fit %>% select(.model = "additive_ts") %>% report()## Series: Cash
## Model: ETS(A,N,A)
## Transformation: box_cox(Cash, atm2_lambda)
## Smoothing parameters:
## alpha = 0.0001000105
## gamma = 0.3788876
##
## Initial states:
## l[0] s[0] s[-1] s[-2] s[-3] s[-4] s[-5] s[-6]
## 28.54438 -19.49563 -14.30877 4.824496 1.906367 7.832009 9.026157 10.21537
##
## sigma^2: 83.0954
##
## AIC AICc BIC
## 3439.307 3439.986 3477.448
# report of the multiplicative model
atm2_fit %>% select(.model = "ARIMA") %>% report()## Series: Cash
## Model: ARIMA(2,0,2)(0,1,1)[7]
## Transformation: box_cox(Cash, atm2_lambda)
##
## Coefficients:
## ar1 ar2 ma1 ma2 sma1
## -0.4242 -0.9271 0.462 0.8169 -0.7544
## s.e. 0.0503 0.0425 0.081 0.0574 0.0460
##
## sigma^2 estimated as 78.22: log likelihood=-1180.26
## AIC=2372.52 AICc=2372.78 BIC=2395.27
The ARIMA model has significantly lower AIC and BIC values. Considering it also has the lowest RSME, it is an obvious choice to move forward with ARIMA for ATM2.
atm2_forecast <- atm2_fit %>%
forecast(h = 31) %>%
filter(.model %in% c('ARIMA'))# residuals
atm2_fit %>%
select("ARIMA") %>%
gg_tsresiduals()#Box-Pierce test, â„“=2m for seasonal data, m=7
atm2_fit %>%
select(.model = "ARIMA") %>%
augment() %>%
features(.innov, box_pierce, lag = 14, dof = 0)Exploring ATM 4
atm4 <- atm %>%
filter(ATM == 'ATM4')
# ATM1 STL Decomposition
atm4 %>%
model(STL(Cash ~ season(window="periodic"), robust = TRUE)) %>%
components() %>%
autoplot() +
labs(title = "ATM1 with Seasonal Decomposition")Based on the seasonal decomposition above, we can see that there is no significant trend in the series. The data is follows a weekly seasonality. We also see that the remainder is mostly random, but exhibits seemingly non-random behavior around March of 2010.
atm4 |>
gg_lag(Cash, geom = "point") +
labs(x = "lag(Beer, k)")Now looking at the lag plot for ATM 1, we can once again see that this ATM’s time series follows a weekly (7-day) seasonality cycle.
# create training set and compare against last months
atm4_train <- atm4 %>%
filter(DATE < '2010-04-01')# lambda for box cox transformation
atm4_lambda <- atm4_train %>%
features(Cash, features = guerrero) %>%
pull(lambda_guerrero)
atm4_fit <- atm4_train %>%
model(
# additive ETS model
additive = ETS(Cash ~ error("A") + trend("N") + season("A")),
# multiplicative ETS model
multiplicative = ETS(Cash ~ error("M") + trend("N") + season("M")),
# SNAIVE model
snaive = SNAIVE(Cash),
# transformed additive ETS model
additive_ts = ETS(box_cox(Cash,atm4_lambda) ~ error("A") + trend("N") + season("A")),
# transformed multiplicative ETS model
multiplicative_ts = ETS(box_cox(Cash,atm4_lambda) ~ error("M") + trend("N") + season("M")),
# transformed SNAIVE model
snaive_ts = SNAIVE(box_cox(Cash,atm4_lambda)),
# arima model
ARIMA = ARIMA(box_cox(Cash,atm4_lambda))
)
accuracy(atm4_fit)For ATM 4, the Transformed Additive TS model produced the lowest RSME
atm4_forecast <- atm4_fit %>%
forecast(h = 40) %>%
filter(.model=='additive_ts')
atm4_forecast %>%
autoplot(atm4, level=NULL) +
labs(title="Transformed Additive ETS Forecast")While the mean of the forecasted values match the mean of the last months in the test set, we can see that the chosen model does not do a gret job at capturing the true variance of cash withdrawls from ATM 4.
# report of the arima model
atm4_fit %>% select(.model = "additive_ts") %>% report()## Series: Cash
## Model: ETS(A,N,A)
## Transformation: box_cox(Cash, atm4_lambda)
## Smoothing parameters:
## alpha = 0.000100027
## gamma = 0.0001000061
##
## Initial states:
## l[0] s[0] s[-1] s[-2] s[-3] s[-4] s[-5] s[-6]
## 23.12951 -10.53245 -0.008230648 0.3562908 2.212736 2.473327 2.258718 3.23961
##
## sigma^2: 92.8782
##
## AIC AICc BIC
## 3476.592 3477.271 3514.734
# residuals
atm4_fit %>%
select("additive_ts") %>%
gg_tsresiduals()#Box-Pierce test, â„“=2m for seasonal data, m=7
atm4_fit %>%
select(.model = "additive_ts") %>%
augment() %>%
features(.innov, box_pierce, lag = 14, dof = 0)Producing Forecasts for May
forecasts <- bind_rows(atm1_forecast, atm2_forecast, atm4_forecast) %>%
as.data.frame() %>%
select(DATE, ATM, .mean) %>%
rename(Cash = .mean)forecasts# altogether plot
forecasts %>%
as_tsibble(index = DATE, key = ATM) %>%
autoplot(Cash) +
autolayer(atm, Cash, colour = "black") +
facet_wrap(~ATM, scales = "free", nrow = 4) +
labs(title = "ATM Withdrawls")