In Part A of Project1, the task is to forecast the cash withdrawals from four different ATM machines for May 2010. The data is provided in a single file, with the variable ‘Cash’ presented in hundreds of dollars. Apart from this variable, the dataset is straightforward. The intentional ambiguity aims to infuse a business-like aura into the task. The report should include an explanation of the process, the techniques employed, those omitted, and the actual forecast. Data will be shared via an Excel file. Please present your findings, visuals, discussions, and R code through an RPubs link, along with the actual .rmd file. Additionally, please provide the forecast in a format compatible with Excel.

Exploratory Data Analysis (EDA)

Uploading the data

Checking datatype is numeric for the Cash variable.

str(df_atm)
## 'data.frame':    1474 obs. of  3 variables:
##  $ DATE: chr  "5/1/2009 12:00:00 AM" "5/1/2009 12:00:00 AM" "5/2/2009 12:00:00 AM" "5/2/2009 12:00:00 AM" ...
##  $ ATM : chr  "ATM1" "ATM2" "ATM1" "ATM2" ...
##  $ Cash: int  96 107 82 89 85 90 90 55 99 79 ...

Create a tsibble object

df_atm_ts  <- df_atm %>%
  mutate(DATE = mdy_hms(DATE))%>%
  mutate(DATE = as_date(DATE))%>%
  as_tsibble(key = ATM, index = DATE)

Plotting the four ATM machines to gain a better understanding of their respective time series.

p1_a <- df_atm_ts %>%
  filter(ATM == "ATM1") %>%
  autoplot(Cash) +
  labs(title= "ATM1")

p2_a <- df_atm_ts %>%
  filter(ATM == "ATM2") %>%
  autoplot(Cash) +
  labs(title= "ATM2")

p3_a <- df_atm_ts %>%
  filter(ATM == "ATM3") %>%
  autoplot(Cash) +
  labs(title= "ATM3")

p4_a <-df_atm_ts %>%
  filter(ATM == "ATM4") %>%
  autoplot(Cash) +
  labs(title= "ATM4")

grid.arrange(p1_a, p2_a, p3_a, p4_a, nrow = 2)



Reviewing the plot, it becomes evident that there is no discernible trend or cycle in the dataset for any of the ATMs. Instead, it resembles white noise and could potentially be standardized, suggesting the utilization of ARIMA models. However, a further examination of the ACF plot is necessary to identify any underlying seasonality.

Analyzing the plots, a notable observation is the abundance of zero values for ATM3. It may be plausible to impute these values from ATM1, ATM2, and/or ATM4. Confidence in this approach would be higher if ATM4 exhibited a similar scale to ATM1 and ATM2. An outlier value is evident in the plot for ATM4, prompting the decision to replace it with the mean of ATM4.

Inspecting for missing data reveals minimal absence in the Cash variables for each ATM. Consequently, for imputation purposes, the mean of ATM1 and ATM2 was chosen.

df_atm_ts %>%
  filter(is.na(Cash))
## # A tsibble: 19 x 3 [1D]
## # Key:       ATM [3]
##    DATE       ATM     Cash
##    <date>     <chr>  <int>
##  1 2010-05-01 ""        NA
##  2 2010-05-02 ""        NA
##  3 2010-05-03 ""        NA
##  4 2010-05-04 ""        NA
##  5 2010-05-05 ""        NA
##  6 2010-05-06 ""        NA
##  7 2010-05-07 ""        NA
##  8 2010-05-08 ""        NA
##  9 2010-05-09 ""        NA
## 10 2010-05-10 ""        NA
## 11 2010-05-11 ""        NA
## 12 2010-05-12 ""        NA
## 13 2010-05-13 ""        NA
## 14 2010-05-14 ""        NA
## 15 2009-06-13 "ATM1"    NA
## 16 2009-06-16 "ATM1"    NA
## 17 2009-06-22 "ATM1"    NA
## 18 2009-06-18 "ATM2"    NA
## 19 2009-06-24 "ATM2"    NA

Examining the histogram plot to assess if the distribution is normalized.

# Gather the data into a long format
df_atm_long <- df_atm_ts%>%
  select(!DATE) %>%
  rename(value = Cash,
         key = ATM)


ggplot(df_atm_long, aes(value)) +
  geom_histogram(bins = 20) +
  facet_wrap(~key, scales = 'free')+
  labs(title = "Histogram of ATM")
## Warning: Removed 19 rows containing non-finite outside the scale range
## (`stat_bin()`).



It’s necessary to observe the distribution after removing the outlier in ATM4. Additionally, I’ve opted to isolate ATM3 to a single month of data.

Imputing mean values for ATM1 and ATM2.

df_atm_1 <- df_atm_ts %>%
  filter(ATM == "ATM1") %>%
  mutate(Cash = if_else(is.na(Cash), round(mean(Cash, na.rm = TRUE)), Cash))
df_atm_2 <- df_atm_ts %>%
  filter(ATM == "ATM2") %>%
  mutate(Cash = if_else(is.na(Cash), round(mean(Cash, na.rm = TRUE)), Cash))

Isolating the ATM data to just the month of April 2010.

df_atm_3 <- df_atm_ts %>%
  filter(ATM == "ATM3")%>%
  filter(DATE > "2010-03-31")

Imputing the mean value for the outlier in ATM4.

df_atm_4 <- df_atm_ts %>%
  filter(ATM == "ATM4")%>%
  mutate(Cash= if_else(Cash == 10920, round(mean(Cash, na.rm = TRUE)), Cash))

Re-plotting ATM4 to determine if the scale has shifted to a similar scale as ATM1 and ATM2.

df_atm_4%>%
  autoplot(Cash) +
  labs(y="Cash in hundreds of dollars", title="ATM4 with mean imputation of outlier")



There’s something peculiar about ATM4. Its cash values seem almost identical to those of ATM1, 2, and 3, but with an additional zero. The highest value in ATM4 is 170,100, which is highly suspicious considering that all the other ATMs hover around 15,000 as the highest daily amount. According to an article on moneyryde.com(https://moneyryde.com/how-much-money-does-an-atm-hold/), an ATM can hold about 200,000 or more. Does this imply that ATM4 would need to be replenished with cash daily or weekly? I’ve decided not to change the scale of the values but will flag this information for further investigation.

# Plot histogram using ggplot2
ggplot(df_atm_4, aes(x = Cash)) +
  geom_histogram(binwidth = 20) +
  labs(title = "Histogram of Cash in ATM4", x = "Values", y = "Frequency")



ACF Plots

Generating ACF plot to check for seasonality.

For ATM1, a significant spike is observed at certain lags, indicating the presence of seasonality in the data. Notably, there is a significant peak at lag 7, suggesting a weekly seasonality.

df_atm_1 %>%
  ACF(Cash)%>%
  autoplot()



For ATM2, a significant spike is observed at certain lags, indicating the presence of seasonality in the data. Notably, there is a significant peak at lag 7, suggesting a weekly seasonality.

df_atm_2 %>%
  ACF(Cash)%>%
  autoplot()



For ATM3, there is insufficient information to confirm the presence of seasonality.

df_atm_3 %>%
  ACF(Cash)%>%
  autoplot()



For ATM4, a significant spike is observed at certain lags, indicating the presence of seasonality in the data. Notably, there is a significant peak at lag 7, suggesting a weekly seasonality.

df_atm_4%>%
  ACF(Cash)%>%
  autoplot()



For all the ATMs exhibiting seasonality, I initially employed a SNAIVE model to forecast for May 2010. SNAIVE is a suitable choice when dealing with data demonstrating strong seasonal patterns, as it directly forecasts future values based on corresponding values from previous seasonal periods.

However, upon conducting residual diagnostics tests, while the plots of the residuals, ACF, and histogram provided the characteristics I needed, the Portmanteau test of the SNAIVE models yielded significant p-values. A significant p-value typically indicates the presence of autocorrelation in the residuals. Autocorrelation in the residuals suggests that the model has not captured all the information present in the data, and there are still patterns or structures that the model has not explained adequately.

As a result, I opted to utilize the Holt-Winters’ method for seasonality, which yielded improved p-values.

ATM1 Forecast and Residual diagnostics

lambda_1 <- df_atm_1 %>%
  features(Cash, features = guerrero) %>%
  pull(lambda_guerrero)

fit_atm_1_a <- df_atm_1%>%
  model(SNAIVE = SNAIVE(box_cox(Cash, lambda_1)),
        ETS = ETS((box_cox(Cash, lambda_1) ~ error("A") + trend("A") +
                                                season("A"))))

fit_atm_1_b <- df_atm_1%>%
  model(ETS = ETS((box_cox(Cash, lambda_1) ~ error("A") + trend("A") +
                                                season("A"))))
fit_atm_1_c <- df_atm_1%>%
  model(SNAIVE = SNAIVE(box_cox(Cash, lambda_1)))


fc_atm_1 <- fit_atm_1_a %>%
  forecast(h= 31)

fc_atm_1%>%
  autoplot(df_atm_1, level = 80) +
  labs(y = "Cash in hundreds of dollars",
       title = "ATM1 Forecast in May 2010")

fit_atm_1_b |> gg_tsresiduals()

fit_atm_1_c |> gg_tsresiduals()

augment(fit_atm_1_a)%>%
  features(.innov, ljung_box, lag = 7)
## # A tibble: 2 × 4
##   ATM   .model lb_stat lb_pvalue
##   <chr> <chr>    <dbl>     <dbl>
## 1 ATM1  ETS       15.3  3.27e- 2
## 2 ATM1  SNAIVE    80.6  1.02e-14

ATM2 Forecast and Residual diagnostics

lambda_2 <- df_atm_2%>%
  features(Cash, features = guerrero) %>%
  pull(lambda_guerrero)

fit_atm_2_a <- df_atm_2%>%
  model(SNAIVE = SNAIVE(box_cox(Cash, lambda_2)),
        ETS = ETS((box_cox(Cash, lambda_2) ~ error("A") + trend("A") +
                                                season("A"))))

fit_atm_2_b <- df_atm_2%>%
  model(ETS = ETS((box_cox(Cash, lambda_2) ~ error("A") + trend("A") +
                                                season("A"))))
fit_atm_2_c <- df_atm_2%>%
  model(SNAIVE = SNAIVE(box_cox(Cash, lambda_2)))

fc_atm_2 <- fit_atm_2_a %>%
  forecast(h= 31)

fc_atm_2%>%
  autoplot(df_atm_2, level = 80) +
  labs(y = "",
       title = latex2exp::TeX(paste0(
         "Transformed gas production with $\\lambda$ = ",
         round(lambda_2,2))))

fit_atm_2_c |> gg_tsresiduals()

fit_atm_2_b |> gg_tsresiduals()

augment(fit_atm_2_a)%>%
  features(.innov, ljung_box, lag = 7)
## # A tibble: 2 × 4
##   ATM   .model lb_stat lb_pvalue
##   <chr> <chr>    <dbl>     <dbl>
## 1 ATM2  ETS       22.1   0.00242
## 2 ATM2  SNAIVE    94.2   0

ATM4 Forecast and Residual diagnostics

lambda_4 <- df_atm_4 %>%
  features(Cash, features = guerrero) %>%
  pull(lambda_guerrero)

fit_atm_4_a <- df_atm_4%>%
  model(SNAIVE = SNAIVE(box_cox(Cash, lambda_4)),
        ETS = ETS((box_cox(Cash, lambda_4) ~ error("A") + trend("A") +
                                                season("A"))))

fit_atm_4_b <- df_atm_4%>%
  model(ETS = ETS((box_cox(Cash, lambda_4) ~ error("A") + trend("A") +
                                                season("A"))))
fit_atm_4_c <- df_atm_4%>%
  model(SNAIVE = SNAIVE(box_cox(Cash, lambda_4)))


fc_atm_4 <- fit_atm_4_a %>%
  forecast(h= 31)

fc_atm_4%>%
  autoplot(df_atm_4, level = 80) +
  labs(y = "Cash in hundreds of dollars",
       title = "ATM1 Forecast in May 2010")

fit_atm_4_b  |> gg_tsresiduals()

fit_atm_4_c  |> gg_tsresiduals()

augment(fit_atm_4_a)%>%
  features(.innov, ljung_box, lag = 7)
## # A tibble: 2 × 4
##   ATM   .model lb_stat lb_pvalue
##   <chr> <chr>    <dbl>     <dbl>
## 1 ATM4  ETS       12.4    0.0894
## 2 ATM4  SNAIVE    99.6    0

For ATM3, I opted for the naive model due to the limited availability of information.

ATM3 Forecast and Residual diagnostics

fit_atm_3 <- df_atm_3%>%
  model(NAIVE(Cash))


fc_atm_3 <- fit_atm_3 %>%
  forecast(h= 31)

fc_atm_3%>%
  autoplot(df_atm_3, level = 80) +
  labs(y = "",
       title = 
         "Transformed gas production with $\\lambda$ = ",
         )

fit_atm_3 |> gg_tsresiduals()

augment(fit_atm_3)%>%
  features(.innov, ljung_box, lag = 10)
## # A tibble: 1 × 4
##   ATM   .model      lb_stat lb_pvalue
##   <chr> <chr>         <dbl>     <dbl>
## 1 ATM3  NAIVE(Cash)    1.03      1.00