In Part A of Project1, the task is to forecast the cash withdrawals from four different ATM machines for May 2010. The data is provided in a single file, with the variable ‘Cash’ presented in hundreds of dollars. Apart from this variable, the dataset is straightforward. The intentional ambiguity aims to infuse a business-like aura into the task. The report should include an explanation of the process, the techniques employed, those omitted, and the actual forecast. Data will be shared via an Excel file. Please present your findings, visuals, discussions, and R code through an RPubs link, along with the actual .rmd file. Additionally, please provide the forecast in a format compatible with Excel.
Uploading the data
Checking datatype is numeric for the Cash variable.
str(df_atm)
## 'data.frame': 1474 obs. of 3 variables:
## $ DATE: chr "5/1/2009 12:00:00 AM" "5/1/2009 12:00:00 AM" "5/2/2009 12:00:00 AM" "5/2/2009 12:00:00 AM" ...
## $ ATM : chr "ATM1" "ATM2" "ATM1" "ATM2" ...
## $ Cash: int 96 107 82 89 85 90 90 55 99 79 ...
Create a tsibble object
df_atm_ts <- df_atm %>%
mutate(DATE = mdy_hms(DATE))%>%
mutate(DATE = as_date(DATE))%>%
as_tsibble(key = ATM, index = DATE)
Plotting the four ATM machines to gain a better understanding of their respective time series.
p1_a <- df_atm_ts %>%
filter(ATM == "ATM1") %>%
autoplot(Cash) +
labs(title= "ATM1")
p2_a <- df_atm_ts %>%
filter(ATM == "ATM2") %>%
autoplot(Cash) +
labs(title= "ATM2")
p3_a <- df_atm_ts %>%
filter(ATM == "ATM3") %>%
autoplot(Cash) +
labs(title= "ATM3")
p4_a <-df_atm_ts %>%
filter(ATM == "ATM4") %>%
autoplot(Cash) +
labs(title= "ATM4")
grid.arrange(p1_a, p2_a, p3_a, p4_a, nrow = 2)
Reviewing the plot, it becomes evident that there is no discernible trend or cycle in the dataset for any of the ATMs. Instead, it resembles white noise and could potentially be standardized, suggesting the utilization of ARIMA models. However, a further examination of the ACF plot is necessary to identify any underlying seasonality.
Analyzing the plots, a notable observation is the abundance of zero values for ATM3. It may be plausible to impute these values from ATM1, ATM2, and/or ATM4. Confidence in this approach would be higher if ATM4 exhibited a similar scale to ATM1 and ATM2. An outlier value is evident in the plot for ATM4, prompting the decision to replace it with the mean of ATM4.
Inspecting for missing data reveals minimal absence in the Cash variables for each ATM. Consequently, for imputation purposes, the mean of ATM1 and ATM2 was chosen.
df_atm_ts %>%
filter(is.na(Cash))
## # A tsibble: 19 x 3 [1D]
## # Key: ATM [3]
## DATE ATM Cash
## <date> <chr> <int>
## 1 2010-05-01 "" NA
## 2 2010-05-02 "" NA
## 3 2010-05-03 "" NA
## 4 2010-05-04 "" NA
## 5 2010-05-05 "" NA
## 6 2010-05-06 "" NA
## 7 2010-05-07 "" NA
## 8 2010-05-08 "" NA
## 9 2010-05-09 "" NA
## 10 2010-05-10 "" NA
## 11 2010-05-11 "" NA
## 12 2010-05-12 "" NA
## 13 2010-05-13 "" NA
## 14 2010-05-14 "" NA
## 15 2009-06-13 "ATM1" NA
## 16 2009-06-16 "ATM1" NA
## 17 2009-06-22 "ATM1" NA
## 18 2009-06-18 "ATM2" NA
## 19 2009-06-24 "ATM2" NA
Examining the histogram plot to assess if the distribution is normalized.
# Gather the data into a long format
df_atm_long <- df_atm_ts%>%
select(!DATE) %>%
rename(value = Cash,
key = ATM)
ggplot(df_atm_long, aes(value)) +
geom_histogram(bins = 20) +
facet_wrap(~key, scales = 'free')+
labs(title = "Histogram of ATM")
## Warning: Removed 19 rows containing non-finite outside the scale range
## (`stat_bin()`).
It’s necessary to observe the distribution after removing the outlier in ATM4. Additionally, I’ve opted to isolate ATM3 to a single month of data.
Imputing mean values for ATM1 and ATM2.
df_atm_1 <- df_atm_ts %>%
filter(ATM == "ATM1") %>%
mutate(Cash = if_else(is.na(Cash), round(mean(Cash, na.rm = TRUE)), Cash))
df_atm_2 <- df_atm_ts %>%
filter(ATM == "ATM2") %>%
mutate(Cash = if_else(is.na(Cash), round(mean(Cash, na.rm = TRUE)), Cash))
Isolating the ATM data to just the month of April 2010.
df_atm_3 <- df_atm_ts %>%
filter(ATM == "ATM3")%>%
filter(DATE > "2010-03-31")
Imputing the mean value for the outlier in ATM4.
df_atm_4 <- df_atm_ts %>%
filter(ATM == "ATM4")%>%
mutate(Cash= if_else(Cash == 10920, round(mean(Cash, na.rm = TRUE)), Cash))
Re-plotting ATM4 to determine if the scale has shifted to a similar scale as ATM1 and ATM2.
df_atm_4%>%
autoplot(Cash) +
labs(y="Cash in hundreds of dollars", title="ATM4 with mean imputation of outlier")
There’s something peculiar about ATM4. Its cash values seem
almost identical to those of ATM1, 2, and 3, but with an additional
zero. The highest value in ATM4 is 170,100, which is highly suspicious
considering that all the other ATMs hover around 15,000 as the highest
daily amount. According to an article on moneyryde.com(https://moneyryde.com/how-much-money-does-an-atm-hold/),
an ATM can hold about 200,000 or more. Does this imply that ATM4 would
need to be replenished with cash daily or weekly? I’ve decided not to
change the scale of the values but will flag this information for
further investigation.
# Plot histogram using ggplot2
ggplot(df_atm_4, aes(x = Cash)) +
geom_histogram(binwidth = 20) +
labs(title = "Histogram of Cash in ATM4", x = "Values", y = "Frequency")
Generating ACF plot to check for seasonality.
For ATM1, a significant spike is observed at certain lags, indicating the presence of seasonality in the data. Notably, there is a significant peak at lag 7, suggesting a weekly seasonality.
df_atm_1 %>%
ACF(Cash)%>%
autoplot()
For ATM2, a significant spike is observed at certain lags,
indicating the presence of seasonality in the data. Notably, there is a
significant peak at lag 7, suggesting a weekly seasonality.
df_atm_2 %>%
ACF(Cash)%>%
autoplot()
For ATM3, there is insufficient information to confirm the
presence of seasonality.
df_atm_3 %>%
ACF(Cash)%>%
autoplot()
For ATM4, a significant spike is observed at certain lags,
indicating the presence of seasonality in the data. Notably, there is a
significant peak at lag 7, suggesting a weekly seasonality.
df_atm_4%>%
ACF(Cash)%>%
autoplot()
For all the ATMs exhibiting seasonality, I initially employed a SNAIVE model to forecast for May 2010. SNAIVE is a suitable choice when dealing with data demonstrating strong seasonal patterns, as it directly forecasts future values based on corresponding values from previous seasonal periods.
However, upon conducting residual diagnostics tests, while the plots of the residuals, ACF, and histogram provided the characteristics I needed, the Portmanteau test of the SNAIVE models yielded significant p-values. A significant p-value typically indicates the presence of autocorrelation in the residuals. Autocorrelation in the residuals suggests that the model has not captured all the information present in the data, and there are still patterns or structures that the model has not explained adequately.
As a result, I opted to utilize the Holt-Winters’ method for seasonality, which yielded improved p-values.
lambda_1 <- df_atm_1 %>%
features(Cash, features = guerrero) %>%
pull(lambda_guerrero)
fit_atm_1_a <- df_atm_1%>%
model(SNAIVE = SNAIVE(box_cox(Cash, lambda_1)),
ETS = ETS((box_cox(Cash, lambda_1) ~ error("A") + trend("A") +
season("A"))))
fit_atm_1_b <- df_atm_1%>%
model(ETS = ETS((box_cox(Cash, lambda_1) ~ error("A") + trend("A") +
season("A"))))
fit_atm_1_c <- df_atm_1%>%
model(SNAIVE = SNAIVE(box_cox(Cash, lambda_1)))
fc_atm_1 <- fit_atm_1_a %>%
forecast(h= 31)
fc_atm_1%>%
autoplot(df_atm_1, level = 80) +
labs(y = "Cash in hundreds of dollars",
title = "ATM1 Forecast in May 2010")
fit_atm_1_b |> gg_tsresiduals()
fit_atm_1_c |> gg_tsresiduals()
augment(fit_atm_1_a)%>%
features(.innov, ljung_box, lag = 7)
## # A tibble: 2 × 4
## ATM .model lb_stat lb_pvalue
## <chr> <chr> <dbl> <dbl>
## 1 ATM1 ETS 15.3 3.27e- 2
## 2 ATM1 SNAIVE 80.6 1.02e-14
lambda_2 <- df_atm_2%>%
features(Cash, features = guerrero) %>%
pull(lambda_guerrero)
fit_atm_2_a <- df_atm_2%>%
model(SNAIVE = SNAIVE(box_cox(Cash, lambda_2)),
ETS = ETS((box_cox(Cash, lambda_2) ~ error("A") + trend("A") +
season("A"))))
fit_atm_2_b <- df_atm_2%>%
model(ETS = ETS((box_cox(Cash, lambda_2) ~ error("A") + trend("A") +
season("A"))))
fit_atm_2_c <- df_atm_2%>%
model(SNAIVE = SNAIVE(box_cox(Cash, lambda_2)))
fc_atm_2 <- fit_atm_2_a %>%
forecast(h= 31)
fc_atm_2%>%
autoplot(df_atm_2, level = 80) +
labs(y = "",
title = latex2exp::TeX(paste0(
"Transformed gas production with $\\lambda$ = ",
round(lambda_2,2))))
fit_atm_2_c |> gg_tsresiduals()
fit_atm_2_b |> gg_tsresiduals()
augment(fit_atm_2_a)%>%
features(.innov, ljung_box, lag = 7)
## # A tibble: 2 × 4
## ATM .model lb_stat lb_pvalue
## <chr> <chr> <dbl> <dbl>
## 1 ATM2 ETS 22.1 0.00242
## 2 ATM2 SNAIVE 94.2 0
lambda_4 <- df_atm_4 %>%
features(Cash, features = guerrero) %>%
pull(lambda_guerrero)
fit_atm_4_a <- df_atm_4%>%
model(SNAIVE = SNAIVE(box_cox(Cash, lambda_4)),
ETS = ETS((box_cox(Cash, lambda_4) ~ error("A") + trend("A") +
season("A"))))
fit_atm_4_b <- df_atm_4%>%
model(ETS = ETS((box_cox(Cash, lambda_4) ~ error("A") + trend("A") +
season("A"))))
fit_atm_4_c <- df_atm_4%>%
model(SNAIVE = SNAIVE(box_cox(Cash, lambda_4)))
fc_atm_4 <- fit_atm_4_a %>%
forecast(h= 31)
fc_atm_4%>%
autoplot(df_atm_4, level = 80) +
labs(y = "Cash in hundreds of dollars",
title = "ATM1 Forecast in May 2010")
fit_atm_4_b |> gg_tsresiduals()
fit_atm_4_c |> gg_tsresiduals()
augment(fit_atm_4_a)%>%
features(.innov, ljung_box, lag = 7)
## # A tibble: 2 × 4
## ATM .model lb_stat lb_pvalue
## <chr> <chr> <dbl> <dbl>
## 1 ATM4 ETS 12.4 0.0894
## 2 ATM4 SNAIVE 99.6 0
For ATM3, I opted for the naive model due to the limited availability of information.
fit_atm_3 <- df_atm_3%>%
model(NAIVE(Cash))
fc_atm_3 <- fit_atm_3 %>%
forecast(h= 31)
fc_atm_3%>%
autoplot(df_atm_3, level = 80) +
labs(y = "",
title =
"Transformed gas production with $\\lambda$ = ",
)
fit_atm_3 |> gg_tsresiduals()
augment(fit_atm_3)%>%
features(.innov, ljung_box, lag = 10)
## # A tibble: 1 × 4
## ATM .model lb_stat lb_pvalue
## <chr> <chr> <dbl> <dbl>
## 1 ATM3 NAIVE(Cash) 1.03 1.00