This project consists of 3 parts - two required and one bonus and is worth 15% of your grade. The project is due at 11:59 PM on Sunday October 31. I will accept late submissions with a penalty until the meetup after that when we review some projects.

# Import required R libraries
library(fpp3)
library(tidyverse)
library(readxl)
library(writexl)
library(seasonal)
library(stringr)

Part A – ATM Forecast

Dataset: ATM624Data.xlsx

Prompt: In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable Cash is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an Excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual .rmd file. Also please submit the forecast which you will put in an Excel readable file.

# Read in data
atm_data_raw <- read_excel("data/ATM624Data.xlsx")

# Properly convert the DATE column to match true input
# https://stackoverflow.com/questions/43230470/how-to-convert-excel-date-format-to-proper-date-in-r
atm_data_raw <- atm_data_raw %>% mutate(DATE = as.Date(DATE, origin = "1899-12-30"))

# Initial output to see data
#head(atm_data_raw)

# Output summary for high level assessment
summary(atm_data_raw)

##       DATE                ATM                 Cash        
##  Min.   :2009-05-01   Length:1474        Min.   :    0.0  
##  1st Qu.:2009-08-01   Class :character   1st Qu.:    0.5  
##  Median :2009-11-01   Mode  :character   Median :   73.0  
##  Mean   :2009-10-31                      Mean   :  155.6  
##  3rd Qu.:2010-02-01                      3rd Qu.:  114.0  
##  Max.   :2010-05-14                      Max.   :10919.8  
##                                          NA's   :19

# Check dimensions to understand breadth of data
dim(atm_data_raw)

## [1] 1474    3

# 1474    3
#Cash in hundreds

Dates of dataset start at 2009-05-01 and end with 2010-05-14. This indicates 379 dates, which is 14 more than a single year of 365 days.

Dimensions of the initial dataset indicate 1474 observations, which again indicates more than a single year’s worth of data. The 3 columns are DATE, ATM, and Cash. The DATE indicates the specific data, the ATM indicates which of the 4 ATMs, and Cash represents the total amount withdrawn for the given date and ATM.

# Define as tsibble with DATE as index
atm_data_ts <- atm_data_raw %>%
  as_tsibble(index = DATE, key = ATM)

# Output tsibble after 2010-04-30
atm_data_ts %>%
  filter(DATE > "2010-04-30")

## # A tsibble: 14 x 3 [1D]
## # Key:       ATM [1]
##    DATE       ATM    Cash
##    <date>     <chr> <dbl>
##  1 2010-05-01 <NA>     NA
##  2 2010-05-02 <NA>     NA
##  3 2010-05-03 <NA>     NA
##  4 2010-05-04 <NA>     NA
##  5 2010-05-05 <NA>     NA
##  6 2010-05-06 <NA>     NA
##  7 2010-05-07 <NA>     NA
##  8 2010-05-08 <NA>     NA
##  9 2010-05-09 <NA>     NA
## 10 2010-05-10 <NA>     NA
## 11 2010-05-11 <NA>     NA
## 12 2010-05-12 <NA>     NA
## 13 2010-05-13 <NA>     NA
## 14 2010-05-14 <NA>     NA

A closer look at the data shows 14 observations starting with date 2010-05-01 and ending on 2010-05-14 do not indicate an ATM nor a Cash amount, thus these 14 observations will be ignored in the upcoming evaluation. The removal of these 14 observations also makes for a cleaner dataset, as now the total observations is 1460, which is exacting 365 for each of the 4 ATMs.

Plot all data

atm_data_ts %>%
  autoplot(Cash) +
  labs(y = "Cash (in hundreds $USD)", title = "ATM Withdrawls")

An initial plot of the time series shows the ATM4 data falls in a range greater than the other three ATMs. With the extreme outlier in ATM4, the plot does not provide much value for comparison across the four ATMS.

ATM1

For forecasting the withdrawal amount for each ATM, I will go through each ATM individually, beginning with ATM1.

# Filter to ATM1 data
atm1_data_ts <- atm_data_ts %>%
  filter(ATM == 'ATM1')

# Output summary
summary(atm1_data_ts)

##       DATE                ATM                 Cash       
##  Min.   :2009-05-01   Length:365         Min.   :  1.00  
##  1st Qu.:2009-07-31   Class :character   1st Qu.: 73.00  
##  Median :2009-10-30   Mode  :character   Median : 91.00  
##  Mean   :2009-10-30                      Mean   : 83.89  
##  3rd Qu.:2010-01-29                      3rd Qu.:108.00  
##  Max.   :2010-04-30                      Max.   :180.00  
##                                          NA's   :3

The summary of data for ATM1 confirms the 365 days (single year) of observations and also indicates 3 missing Cash values.

atm1_data_ts$DATE[is.na(atm1_data_ts$Cash)]

## [1] "2009-06-13" "2009-06-16" "2009-06-22"

The 3 dates with missing Cash values are displayed above. Not sure the significance of the missing data occurring in a small window of time, so will consider imputation.

atm1_data_ts %>%
  autoplot(Cash) +
  labs(y = "Cash (in hundreds $USD)", title = "ATM1 Withdrawals")

Initial plot of the ATM1 data appears to show a weekly seasonal component along with a dip between October 2009 and January 2010. There does not appear to be a clear increasing or decreasing trend nor a cyclic component.

Impute Missing Data

Given the low volume of missing values, three dates, I will simply impute the data with the median value of the full ATM1 dataset.

# Calculate median value for ATM1
median <- median(atm1_data_ts$Cash, na.rm=TRUE)

# Set NAs to median
atm1_data_ts$Cash[is.na(atm1_data_ts$Cash)] <- median

summary(atm1_data_ts)

##       DATE                ATM                 Cash       
##  Min.   :2009-05-01   Length:365         Min.   :  1.00  
##  1st Qu.:2009-07-31   Class :character   1st Qu.: 73.00  
##  Median :2009-10-30   Mode  :character   Median : 91.00  
##  Mean   :2009-10-30                      Mean   : 83.95  
##  3rd Qu.:2010-01-29                      3rd Qu.:108.00  
##  Max.   :2010-04-30                      Max.   :180.00

Summary output confirms no missing data for ATM1 Cash amounts.

Simply updating the ATM1 data tsibble to be defined properly.

atm1_data_ts <- atm1_data_ts %>%
  mutate(DATE = as_date(DATE)) %>%
  update_tsibble(index = DATE)
#atm1_data_ts

Decomposition

To better understand the trend and seasonal nature of the ATM1 dataset, decomposition is performed with a period of 7 to represent the weekly pattern.

# From: https://stats.stackexchange.com/questions/494013/control-the-period-for-daily-time-series-in-tsibbles
atm1_dcmp <- atm1_data_ts %>%
  model(stl = STL(Cash ~ trend(window=Inf) + season(period=7, window="periodic"))) 

atm1_dcmp %>% components() %>% autoplot()

STL decomposition shows the remainder plot has the most impact on the data based on the gray bar on the left. The seasonal plot shows a clear seasonal pattern, but while the seasonal pattern remains the same, the remainder plot shows greater variance toward the end of plot, beginning in February 2010. The trend line plot confirms no meaningful trend present in the dataset.

components(atm1_dcmp) %>%
  as_tsibble() %>%
  filter_index("2009-05-01" ~ "2010-04-30") %>%
  autoplot(Cash, colour = "gray") +
  geom_line(aes(y=season_adjust), colour = "#0072B2") +
  labs(y = "Cash (in hundreds $USD)", title = "Seasonally Adjusted Trendline")

The above seasonally adjusted trendline confirms that despite a few outliers through December 2009, the seasonal component does well to address the weekly nature of the data, but the plot above indicates that starting in February 2010, the seasonal component does not properly define the dataset. Based on the above plot, a new weekly pattern appears to emerge in February 2010.

components(atm1_dcmp) %>%
  as_tsibble() %>%
  filter_index("2010-02-16" ~ "2010-04-30") %>%
  autoplot(Cash, colour = "gray") +
  geom_line(aes(y=season_adjust), colour = "#0072B2") +
  labs(y = "Cash (in hundreds $USD)", title = "Seasonally Adjusted Trendline")

Above plot shows that the seasonally adjusted data does not account for the weekly seasonal nature for the last three months of the ATM1 dataset. Clearly a shift in the weekly seasonal pattern has occurred.

atm1_data_ts %>% ACF() %>% autoplot()

The ACF of the ATM1 data shows a clear correlation at lags 7, 14, and 21, which is expected for seasonal data following a weekly pattern. High negative correlations are present at lags 2 and 5 that appear to follow a 7-lag pattern. I have a hunch these correlations are due to the change in the weekly pattern.

atm1_data_ts %>% PACF() %>% autoplot()

The PACF plot re-confirms the correlations at lag 7 and 14. Interesting, correlation also appears at lags 2 and 5.

atm1_data_ts %>% gg_season(Cash, period = "week") +
  labs(y="$USD (in hundreds)", title="ATM Withdrawals")

To better understand the weekly seasonal pattern, the gg_season plot above does show some sort of shift on Tuesday and Thursday.

atm1_data_ts %>%
  gg_subseries(period = "week") +
  labs(
    y = "$USD (in hundreds)",
    title = "ATM Withdrawals"
  )

The gg_subseries plot confirms the shift in February occurs on Tuesday, Wednesday, and Thursday. The other days of the week indicate some variance and outliers, but nothing as dramatic as Tuesday through Thursday.

Train and Test Model

Before forecasting the data for May 2010, I want to train and test the different model functions for proper evaluation. The models are trained on 292 days, or approximately 80% of the year represented.

# Create training set (assume 1 year of data)
# 292 days for training, approx. 80% of data
train_atm1 <- atm1_data_ts %>% 
  filter_index("2009-05-01" ~ "2010-02-17")

#train_atm1

# Fit the models
fit_atm1 <- train_atm1 %>%
  model(
    Naive = NAIVE(Cash),
    `Seasonal naive` = SNAIVE(Cash),
    `Random walk` = RW(Cash),
    Arima = ARIMA(Cash),
    ETS_Add = ETS(Cash ~ error("A") + trend("N") + season("A")),
    ETS_Mult = ETS(Cash ~ error("M") + trend("N") + season("M")),
    ETS_Damp = ETS(Cash ~ error("M") + trend("Ad") + season("M"))
  )

fit_atm1 %>% accuracy()

## # A tibble: 7 × 11
##   ATM   .model         .type       ME  RMSE   MAE   MPE  MAPE  MASE RMSSE     ACF1
##   <chr> <chr>          <chr>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl>
## 1 ATM1  Naive          Training 0.284  49.9  37.4 -85.0 120.  2.07  1.77  -0.356  
## 2 ATM1  Seasonal naive Training 0.203  28.1  18.0 -83.3 103.  1     1      0.0957 
## 3 ATM1  Random walk    Training 0.284  49.9  37.4 -85.0 120.  2.07  1.77  -0.356  
## 4 ATM1  Arima          Training 0.179  21.9  14.3 -66.5  82.7 0.790 0.780  0.00341
## 5 ATM1  ETS_Add        Training 0.175  21.4  13.7 -73.3  89.7 0.761 0.763  0.0940 
## 6 ATM1  ETS_Mult       Training 1.98   21.6  13.7 -75.8  90.6 0.760 0.769  0.0956 
## 7 ATM1  ETS_Damp       Training 1.46   21.3  13.3 -76.4  90.4 0.737 0.759  0.0756

Using the different model functions from the Hyndman textbook, the best performing model based on RMSE on the training set is the ETS model with dampening. Overall, the three ETS models and the ARIMA model all perform well in comparison. The seasonal naive model also performs well with an RMSE slightly higher than the ARIMA model and the three ETS models.

# Generate forecasts for 72 days
fc_atm1 <- fit_atm1 %>% forecast(h = "72 days")

fc_atm1 %>% accuracy(atm1_data_ts)

## # A tibble: 7 × 11
##   .model         ATM   .type     ME  RMSE   MAE   MPE  MAPE  MASE RMSSE     ACF1
##   <chr>          <chr> <chr>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl>
## 1 Arima          ATM1  Test   -2.20  50.8  37.8 -466.  502.  2.09  1.81 -0.0565 
## 2 ETS_Add        ATM1  Test   -3.25  52.4  38.0 -479.  514.  2.11  1.86  0.00348
## 3 ETS_Damp       ATM1  Test  -10.4   54.2  40.0 -517.  547.  2.22  1.93 -0.0246 
## 4 ETS_Mult       ATM1  Test   -7.03  52.9  39.2 -494.  527.  2.17  1.88 -0.00808
## 5 Naive          ATM1  Test  -98.2  104.   98.2 -893.  893.  5.44  3.71 -0.0585 
## 6 Random walk    ATM1  Test  -98.2  104.   98.2 -893.  893.  5.44  3.71 -0.0585 
## 7 Seasonal naive ATM1  Test   -5.21  64.1  53.7 -460.  509.  2.97  2.28  0.00878

The accuracy of the models on the test dataset shows ARIMA performing the best with an RMSE of 50.77. The three ETS models do perform well also.

# Plot forecasts against actual values
fc_atm1 %>%
  autoplot(filter_index(train_atm1, "2010-02-01" ~ "2010-04-30"), level = NULL) +
  autolayer(
    filter_index(atm1_data_ts, "2010-02-10" ~ "2010-04-30"),
    colour = "black"
  ) +
  labs(
    y = "Cash (in hundreds $USD)",
    title = "Forecasts for ATM1 Withdrawals"
  ) +
  guides(colour = guide_legend(title = "Forecast"))

Plotting the seven forecasts along with the original dataset shows the seasonal pattern of the forecasts does not match the seasonal pattern of the actual dataset. Based on the observations in the decomposition section, the weekly pattern has changed, and thus training the model on the first 80% of the data actually misses the new pattern found in the data. Further evaluation is needed.

fit_atm1_arima <- train_atm1 %>%
  model(ARIMA(Cash))

# Check the residuals of Arima model
fit_atm1_arima %>% gg_tsresiduals()

To further confirm the assumption of a bad model, I’ve taken the ARIMA model, the best performing based on RMSE, and displayed the residuals above. The ACF plot of the residuals clearly indicates a correlation exists at lags 8 and 20. This indicates the model isn’t quite capturing the correlation of the test data properly.

Train and Test Model: Second Attempt

Given the decomposition and the model evaluations indicate a change in seasonal pattern sometime in February 2010, I will attempt to train and test on just the data beginning on 2010-02-16. Yes, this certainly reduces the amount of data used to build the model, but I believe may better capture the change in the weekly pattern and thus provide better forecasts for the month of May 2010.

# ATM1 train on 2010-02-16 to 2010-04-30 (end)
# 15 + 31 + 30 = 76 days, 80% is 61 days
new_seasonal_atm1 <- atm1_data_ts %>% 
  filter_index("2010-02-16" ~ "2010-04-30")

train_atm1 <- atm1_data_ts %>% 
  filter_index("2010-02-16" ~ "2010-04-14")

# Fit the models
fit_atm1 <- train_atm1 %>%
  model(
    `Seasonal naive` = SNAIVE(Cash),
    Arima = ARIMA(Cash),
    ETS_Add = ETS(Cash ~ error("A") + trend("N") + season("A")),
    ETS_Mult = ETS(Cash ~ error("M") + trend("N") + season("M")),
    ETS_Damp = ETS(Cash ~ error("M") + trend("Ad") + season("M"))
  )

report(fit_atm1)

## # A tibble: 5 × 12
##   ATM   .model  sigma2 log_lik   AIC  AICc   BIC ar_roots ma_roots    MSE   AMSE
##   <chr> <chr>    <dbl>   <dbl> <dbl> <dbl> <dbl> <list>   <list>    <dbl>  <dbl>
## 1 ATM1  Seaso… 575.        NA    NA    NA    NA  <NULL>   <NULL>      NA     NA 
## 2 ATM1  Arima  311.      -219.  447.  448.  455. <cpl [9… <cpl [0…    NA     NA 
## 3 ATM1  ETS_A… 327.      -281.  582.  586.  602. <NULL>   <NULL>     276.   330.
## 4 ATM1  ETS_M…   0.155   -308.  636.  641.  657. <NULL>   <NULL>    1046.  1437.
## 5 ATM1  ETS_D…   0.405   -329.  683.  692.  710. <NULL>   <NULL>   27524. 41523.
## # … with 1 more variable: MAE <dbl>

Now using only the top 5 performing models due to the seasonal nature of the data, the ARIMA model performs the best of the 4 models by evaluating the AICc of 447. Note: the Seasonal Naive model does not produce an AICc.

fit_atm1 %>% accuracy()

## # A tibble: 5 × 11
##   ATM   .model         .type      ME  RMSE   MAE    MPE  MAPE  MASE RMSSE    ACF1
##   <chr> <chr>          <chr>   <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>   <dbl>
## 1 ATM1  Seasonal naive Trai…  -5.20   24.3  15.9 -214.  230.  1     1      0.444 
## 2 ATM1  Arima          Trai…   0.138  16.0  11.9  -97.6 143.  0.753 0.660 -0.127 
## 3 ATM1  ETS_Add        Trai…  -1.19   16.6  11.7  -56.5  68.7 0.739 0.684  0.0722
## 4 ATM1  ETS_Mult       Trai…  -2.40   32.3  20.8 -125.  145.  1.31  1.33   0.480 
## 5 ATM1  ETS_Damp       Trai… -10.4   166.   61.7  -13.8  68.2 3.89  6.82   0.412

The ARIMA model performs the best on the training set with an RMSE of 16.05, while the ETS Additive performs almost as well as the ARIMA model. The Seasonal Naive model performs third best and the ETS with dampening is clearly not doing well compared to the other four models.

# Generate forecasts for 16 days (two weeks), so we'll see if it picks up the seasonal nature
fc_atm1 <- fit_atm1 %>% forecast(h = "16 days")

fc_atm1 %>% accuracy(new_seasonal_atm1)

## # A tibble: 5 × 11
##   .model         ATM   .type     ME  RMSE   MAE   MPE  MAPE  MASE RMSSE     ACF1
##   <chr>          <chr> <chr>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl>
## 1 Arima          ATM1  Test   7.71  12.8   9.50 40.9  43.2  0.599 0.527  0.0154 
## 2 ETS_Add        ATM1  Test   1.06  11.8   8.88  5.50 15.3  0.560 0.486 -0.337  
## 3 ETS_Damp       ATM1  Test  46.6   54.0  46.6  57.3  58.4  2.94  2.22   0.0345 
## 4 ETS_Mult       ATM1  Test   9.17  15.1  11.0  -2.95 26.2  0.694 0.621 -0.161  
## 5 Seasonal naive ATM1  Test   0.125  9.64  7.12  1.26  9.83 0.449 0.397 -0.00791

For the accuracy of the five models on the test set, the Seasonal Naive model actually performs the best with an RMSE of 9.64, while ARIMA and ETS Additive, also perform well.

# Plot forecasts against actual values
fc_atm1 %>%
  autoplot(filter_index(train_atm1, "2010-04-01" ~ "2010-04-30"), level = NULL) +
  autolayer(
    filter_index(new_seasonal_atm1, "2010-04-01" ~ "2010-04-30"),
    colour = "black"
  ) +
  labs(
    y = "Cash (in hundreds $USD)",
    title = "Forecasts for ATM1 Withdrawals"
  ) +
  guides(colour = guide_legend(title = "Forecast"))

The plot of the test forecasts confirms the models are correctly capturing the seasonal nature of the data near the end of the dataset. Unfortunately, the ARIMA model is forecasting negative values, which of course is not possible.

fit_atm1_arima<- train_atm1 %>%
  model(ARIMA(Cash))

report(fit_atm1_arima)

## Series: Cash 
## Model: ARIMA(2,0,0)(1,1,0)[7] 
## 
## Coefficients:
##          ar1     ar2     sar1
##       0.5923  0.3165  -0.6292
## s.e.  0.1520  0.1665   0.1431
## 
## sigma^2 estimated as 311.1:  log likelihood=-219.43
## AIC=446.86   AICc=447.73   BIC=454.59

The definition of the ARIMA model is shown above.

# Check the residuals of ARIMA
fit_atm1_arima %>% gg_tsresiduals()

As above, I’ve displayed the residuals of the ARIMA model to assess the appropriateness of the model. The plot of Innovation residuals appears to follow white noise after the first 10 observations or so. The ACF plot indicates no correlation outside of the confidence intervals.

Transformation Attempt

The dataset did not appear to show increasing variance over the time series, but I wanted to try a Box-Cox transformation to see if a better model could be generated.

# Consider Box-Cox
lambda <- new_seasonal_atm1 %>%
  features(Cash, features = guerrero) %>%
  pull(lambda_guerrero)

new_seasonal_atm1 %>%
  autoplot(box_cox(Cash, lambda)) +
  labs(y = "",
       title = latex2exp::TeX(paste0(
         "Transformed ATM1 withdrawals with $\\lambda$ = ",
         round(lambda,2))))

Applying Box-Cox to see if I can squeeze out a little better performance on the test data. The resulting lambda value is 0.93.

# Forecast for May 2010
fit_atm1_bc <- train_atm1 %>%
  model(
    `Seasonal naive` = SNAIVE(box_cox(Cash, lambda)),
    Arima = ARIMA(box_cox(Cash, lambda)),
    ETS_Add = ETS(box_cox(Cash, lambda) ~ error("A") + trend("N") + season("A"))
  )

report(fit_atm1_bc)

## # A tibble: 3 × 12
##   ATM   .model    sigma2 log_lik   AIC  AICc   BIC ar_roots ma_roots   MSE  AMSE
##   <chr> <chr>      <dbl>   <dbl> <dbl> <dbl> <dbl> <list>   <list>   <dbl> <dbl>
## 1 ATM1  Seasonal…   323.     NA    NA    NA    NA  <NULL>   <NULL>     NA    NA 
## 2 ATM1  Arima       175.   -206.  417.  418.  423. <cpl [1… <cpl [7…   NA    NA 
## 3 ATM1  ETS_Add     182.   -264.  548.  552.  568. <NULL>   <NULL>    154.  185.
## # … with 1 more variable: MAE <dbl>

The AICc of the ARIMA model with Box-Cox transformation has improved from 447.73 to 417.47. A promising sign.

fit_atm1_bc %>% accuracy()

## # A tibble: 3 × 11
##   ATM   .model         .type       ME  RMSE   MAE    MPE  MAPE  MASE RMSSE    ACF1
##   <chr> <chr>          <chr>    <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>   <dbl>
## 1 ATM1  Seasonal naive Training -5.20  24.3  15.9 -214.  230.  1     1      0.444 
## 2 ATM1  Arima          Training -2.56  16.1  12.3 -105.  127.  0.773 0.661 -0.313 
## 3 ATM1  ETS_Add        Training -1.11  16.6  12.0  -66.2  77.5 0.755 0.682  0.0836

The accuracy of the models with Box-Cox show little to no improvement over the models without transformations on the training set. The RMSE of the ARIMA model with the transformation actually increases from 16.04555 to 16.06325 as noted above.

# Generate forecasts for the next 16 days
fc_atm1_bc <- fit_atm1_bc %>% forecast(h = "16 days")

fc_atm1_bc %>% accuracy(new_seasonal_atm1)

## # A tibble: 3 × 11
##   .model         ATM   .type     ME  RMSE   MAE   MPE  MAPE  MASE RMSSE    ACF1
##   <chr>          <chr> <chr>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>
## 1 Arima          ATM1  Test  -6.25  11.5   9.35 -32.6  36.2 0.589 0.474 -0.360 
## 2 ETS_Add        ATM1  Test  -0.247 11.8   9.24 -24.8  35.4 0.582 0.487 -0.398 
## 3 Seasonal naive ATM1  Test  -1.05   9.87  7.88 -21.8  29.4 0.497 0.406 -0.0900

The RMSE of the ARIMA model with Box-Cox transformation on the test data does improve the RMSE compared to the ARIMA model without transformation from 12.812248 to 11.53609. The Seasonal Naive RMSE increases with the transformation from 9.643651 to 9.87190. From all the results with and without transformation, the Seasonal Naive model without transformation performs the best on the test dataset.

# Plot forecasts against actual values
fc_atm1_bc %>%
  autoplot(filter_index(new_seasonal_atm1, "2010-04-01" ~ "2010-04-30"), level = NULL) +
  autolayer(
    filter_index(new_seasonal_atm1, "2010-04-01" ~ "2010-04-30"),
    colour = "black"
  ) +
  labs(
    y = "Cash (in hundreds $USD)",
    title = "Forecasts for ATM1 Withdrawals"
  ) +
  guides(colour = guide_legend(title = "Forecast"))

The above plot shows the forecasts of the models with transformation along with the actual data. Similar to the plot from the models with transformations, the forecasts do follow the seasonal pattern of the data.

ATM1 Conclusion

Given the assignment is to forecast the data for May 2010 from the provided data, I believe the models generated from the final two and a half months is the appropriate approach. Clearly, a shift in the weekly pattern occurs in February 2010 and without any additional context or history, I assume that shift will continue through May 2010. As the forecast is only 1 month or approximately four more weeks from the end of the provided data, then I choose to believe that new pattern will persist for the next four weeks through May 2010.

Based on the results of the seasonal models with and without the Box-Cox transformation, I’ve chosen the Seasonal Naive model without transformation to forecast the May 2010 values. The model performed on par, if not better, than the more complex models, and did not produce any negative values as seen in the ARIMA model. In the interest of simplicity and good results, the Seasonal Naive model is selected.

# Generate forecasts for 16 days (two weeks) plus the 31 days of May
fc_atm1_final <- fit_atm1 %>% forecast(h = "47 days")

# Plot forecasts against actual values
fc_atm1_final %>%
  autoplot(filter_index(train_atm1, "2010-02-16" ~ "2010-05-31"), level = NULL) +
  autolayer(
    filter_index(new_seasonal_atm1, "2010-02-16" ~ "2010-05-31"),
    colour = "black"
  ) +
  labs(
    y = "Cash (in hundreds $USD)",
    title = "Forecasts for ATM1 Withdrawals"
  ) +
  guides(colour = guide_legend(title = "Forecast"))

Forecasts of May 2010 for ATM1.

Prepare forecast fable for input into Excel file for ATM1.

# dataframe for Excel output
atm1_forecast_may2010 <- fc_atm1_final %>%
  filter(.model == "Seasonal naive", DATE > "2010-04-30")

atm1_forecast_may2010 <- data.frame(atm1_forecast_may2010) %>%
  select(c(DATE, ATM, .mean))%>%
  rename(Cash = .mean)

atm1_forecast_may2010

##          DATE  ATM Cash
## 1  2010-05-01 ATM1   87
## 2  2010-05-02 ATM1   92
## 3  2010-05-03 ATM1   63
## 4  2010-05-04 ATM1    3
## 5  2010-05-05 ATM1  103
## 6  2010-05-06 ATM1   79
## 7  2010-05-07 ATM1   90
## 8  2010-05-08 ATM1   87
## 9  2010-05-09 ATM1   92
## 10 2010-05-10 ATM1   63
## 11 2010-05-11 ATM1    3
## 12 2010-05-12 ATM1  103
## 13 2010-05-13 ATM1   79
## 14 2010-05-14 ATM1   90
## 15 2010-05-15 ATM1   87
## 16 2010-05-16 ATM1   92
## 17 2010-05-17 ATM1   63
## 18 2010-05-18 ATM1    3
## 19 2010-05-19 ATM1  103
## 20 2010-05-20 ATM1   79
## 21 2010-05-21 ATM1   90
## 22 2010-05-22 ATM1   87
## 23 2010-05-23 ATM1   92
## 24 2010-05-24 ATM1   63
## 25 2010-05-25 ATM1    3
## 26 2010-05-26 ATM1  103
## 27 2010-05-27 ATM1   79
## 28 2010-05-28 ATM1   90
## 29 2010-05-29 ATM1   87
## 30 2010-05-30 ATM1   92
## 31 2010-05-31 ATM1   63

ATM1 final forecast values.

ATM2

Now, for the evaluation and forecasting of the ATM2 time series.

atm2_data_ts <- atm_data_ts %>%
  filter(ATM == 'ATM2')

summary(atm2_data_ts)

##       DATE                ATM                 Cash       
##  Min.   :2009-05-01   Length:365         Min.   :  0.00  
##  1st Qu.:2009-07-31   Class :character   1st Qu.: 25.50  
##  Median :2009-10-30   Mode  :character   Median : 67.00  
##  Mean   :2009-10-30                      Mean   : 62.58  
##  3rd Qu.:2010-01-29                      3rd Qu.: 93.00  
##  Max.   :2010-04-30                      Max.   :147.00  
##                                          NA's   :2

The ATM2 time series, just as ATM1, includes the dates 2009-05-01 through 2010-04-30, a full year of 365 days. The Cash column indicates two missing values along with a range from 0 to 147.

atm2_data_ts %>%  autoplot(Cash) +
  labs(y = "Cash (in hundreds $USD)", title = "ATM2 Withdrawals")

The initial plot of the ATM2 dataset indicates a weekly seasonal pattern, as expected. The range identified above seems appropriate with no clear outliers. With two values missing, imputation to replace the values is recommended. A slightly decreasing trend does appear across the time series, but no identifiable cyclic component appears relevant.

atm2_data_ts$DATE[is.na(atm2_data_ts$Cash)]

## [1] "2009-06-18" "2009-06-24"

The two missing values also occur in June 2009, just as the missing values of ATM1 occurred.

Imputation

# Calculate median value for ATM2
median <- median(atm2_data_ts$Cash, na.rm=TRUE)

# Set NAs to median
atm2_data_ts$Cash[is.na(atm2_data_ts$Cash)] <- median

atm2_data_ts %>%  autoplot(Cash) +
  labs(y = "Cash (in hundreds $USD)", title = "ATM2 Withdrawals with Imputed Data")

Given the small volume of missing data, I’ve chosen to impute the two missing values with the median value of the full ATM2 dataset. The plot of the dataset with imputation still appears to contain a weekly seasonal pattern with very slight decreasing trend.

Decomposition

atm2_data_ts %>%
  model(stl = STL(Cash ~ trend(window=Inf) + season(period=7, window="periodic"))
        ) %>% 
  components() %>% autoplot()

Similar to ATM1, a shift in the seasonal nature of the data appears in the final few months. As applied on the ATM1 dataset, I plan to train and test the models on the final section of the dataset in which the weekly seasonal nature shifts.

atm2_data_ts %>% ACF() %>% autoplot()

As expected, the ACF plot for the ATM2 dataset shows strong correlation at lags 7, 14 and 20. Interesting to find high negative correlations also. I believe these correlations also identify the weekly pattern that appears in the data but is offset given the shift in February 2010.

atm2_data_ts %>% PACF() %>% autoplot()

The PACF plot for ATM2 also displays a high correlation at lags 7 and 14. The high negative correlations occur at lags 2 and 5. Again, I believe these lags are due to the change in the weekly seasonal pattern.

atm2_data_ts %>% gg_season(Cash, period = "week") +
  labs(y="$ (in hundreds)", title="ATM2 Withdrawals")

The gg_season plot highlights the shift in the weekly pattern between Tuesday and Thursday, similar to that of ATM1.

atm2_data_ts %>%
  gg_subseries(period = "week") +
  labs(
    y = "$ (in hundreds)",
    title = "ATM2 Withdrawals"
  )

The gg_subseries plot also re-confirms the change in weekly pattern. The adjustment appears on almost everyday of the week except for Saturday.

atm2_dcmp <- atm2_data_ts %>%
  model(stl = STL(Cash ~ trend(window=Inf) + season(period=7, window="periodic"))) 

components(atm2_dcmp) %>%
  as_tsibble() %>%
  filter_index("2009-05-01" ~ "2010-04-30") %>%
  autoplot(Cash, colour = "gray") +
  geom_line(aes(y=season_adjust), colour = "#0072B2") +
  labs(y = "Cash (in hundreds)", title = "Seasonally Adjusted Trendline")

The seasonally adjusted trendline again confirms a shift in the weekly pattern starting in February 2010.

components(atm2_dcmp) %>%
  as_tsibble() %>%
  filter_index("2010-02-16" ~ "2010-04-30") %>%
  autoplot(Cash, colour = "gray") +
  geom_line(aes(y=season_adjust), colour = "#0072B2") +
  labs(y = "Cash (in hundreds)", title = "Seasonally Adjusted Trendline")

Focusing the seasonally adjusted trendline on the final few months, the blue trendline follows the actual dataset, indicating the seasonal adjustment is not properly accounting for this seasonal pattern.

Train and Test Model

Given the decomposition and the model evaluations indicate a change in seasonal pattern sometime in February 2010, I will attempt to train and test on just the data beginning on 2010-02-16. As mentioned in the section for ATM1, this approach certainly reduces the amount of data used to build the model, but I believe may better capture the change in the weekly pattern and thus provide better forecasts for the month of May 2010.

# ATM1 train on 2010-02-16 to 2010-04-30 (end)
# 15 + 31 + 30 = 76 days, 80% is 61 days
new_seasonal_atm2 <- atm2_data_ts %>% 
  filter_index("2010-02-16" ~ "2010-04-30")

train_atm2 <- atm2_data_ts %>% 
  filter_index("2010-02-16" ~ "2010-04-14")

# Fit the models
fit_atm2 <- train_atm2 %>%
  model(
    `Seasonal naive` = SNAIVE(Cash),
    Arima = ARIMA(Cash),
    ETS_Add = ETS(Cash ~ error("A") + trend("N") + season("A")),
    ETS_Mult = ETS(Cash ~ error("M") + trend("N") + season("M")),
    ETS_Damp = ETS(Cash ~ error("M") + trend("Ad") + season("M"))
  )

report(fit_atm2)

## # A tibble: 5 × 12
##   ATM   .model  sigma2 log_lik   AIC  AICc   BIC ar_roots ma_roots    MSE   AMSE
##   <chr> <chr>    <dbl>   <dbl> <dbl> <dbl> <dbl> <list>   <list>    <dbl>  <dbl>
## 1 ATM2  Seaso… 588.        NA    NA    NA    NA  <NULL>   <NULL>      NA     NA 
## 2 ATM2  Arima  364.      -224.  452.  452.  456. <cpl [7… <cpl [0…    NA     NA 
## 3 ATM2  ETS_A… 358.      -283.  587.  591.  607. <NULL>   <NULL>     302.   276.
## 4 ATM2  ETS_M…   0.408   -297.  614.  619.  634. <NULL>   <NULL>    4367.  4182.
## 5 ATM2  ETS_D…   6.22    -368.  762.  770.  789. <NULL>   <NULL>   56573. 53446.
## # … with 1 more variable: MAE <dbl>

Focusing on the 5 models that account for a seasonal pattern, the ARIMA model performs the best of the 4 models by evaluating to the AICc of 451.

fit_atm2 %>% accuracy()

## # A tibble: 5 × 11
##   ATM   .model         .type      ME  RMSE   MAE    MPE  MAPE  MASE RMSSE    ACF1
##   <chr> <chr>          <chr>   <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>   <dbl>
## 1 ATM2  Seasonal naive Trai…  -1.25   24.0  16.2 -Inf   Inf   1     1     -0.191 
## 2 ATM2  Arima          Trai…  -1.46   17.7  11.5 -Inf   Inf   0.712 0.737 -0.0891
## 3 ATM2  ETS_Add        Trai…  -0.724  17.4  12.0 -Inf   Inf   0.741 0.723 -0.130 
## 4 ATM2  ETS_Mult       Trai… -13.1    66.7  31.0  -65.0  90.3 1.92  2.77   0.467 
## 5 ATM2  ETS_Damp       Trai…   6.79  240.  121.   -32.6 217.  7.48  9.98   0.488

The ETS Additive model performs the best on the training set with an RMSE of 17.39, while the ARIMA performs almost as well as the ETS Additive model. The Seasonal Naive model performs third best and again the ETS with dampening is clearly not doing well compared to the other four models.

# Generate forecasts for 16 days (two weeks), so we'll see if it picks up the seasonal nature
fc_atm2 <- fit_atm2 %>% forecast(h = "16 days")

fc_atm2 %>% accuracy(new_seasonal_atm2)

## # A tibble: 5 × 11
##   .model         ATM   .type     ME  RMSE   MAE     MPE  MAPE  MASE RMSSE     ACF1
##   <chr>          <chr> <chr>  <dbl> <dbl> <dbl>   <dbl> <dbl> <dbl> <dbl>    <dbl>
## 1 Arima          ATM2  Test    4.57  16.1 12.5    0.971  24.5 0.775 0.670 -0.0923 
## 2 ETS_Add        ATM2  Test    6.59  17.8 12.7  -35.9    70.8 0.784 0.741 -0.00583
## 3 ETS_Damp       ATM2  Test  -17.6   32.7 23.0  -29.1    36.5 1.42  1.36   0.478  
## 4 ETS_Mult       ATM2  Test   -1.33  12.4  9.87  -1.36   20.5 0.610 0.517 -0.108  
## 5 Seasonal naive ATM2  Test    4.06  19.7 14.3    2.45   23.3 0.884 0.820 -0.244

For the accuracy of the five models on the test set, the ETS Multiplicative model performs the best with an RMSE of 12.44, while ARIMA, ETS Additive and Seasonal Naive, also perform well.

# Plot forecasts against actual values
fc_atm2 %>%
  autoplot(filter_index(train_atm2, "2010-04-01" ~ "2010-04-30"), level = NULL) +
  autolayer(
    filter_index(new_seasonal_atm2, "2010-04-01" ~ "2010-04-30"),
    colour = "black"
  ) +
  labs(
    y = "Cash (in hundreds $USD)",
    title = "Forecasts for ATM2 Withdrawals"
  ) +
  guides(colour = guide_legend(title = "Forecast"))

The plot of the test forecasts confirms the models are correctly capturing the seasonal nature of the data near the end of the dataset.

fit_atm2_ets <- train_atm2 %>%
  model(ETS_Add = ETS(Cash ~ error("A") + trend("N") + season("A")))

report(fit_atm2_ets)

## Series: Cash 
## Model: ETS(A,N,A) 
##   Smoothing parameters:
##     alpha = 0.0001000053 
##     gamma = 0.000100022 
## 
##   Initial states:
##      l[0]      s[0]    s[-1]    s[-2]     s[-3]    s[-4]    s[-5]     s[-6]
##  56.74485 -52.03051 26.58281 12.06053 -13.65817 31.39567 45.81194 -50.16227
## 
##   sigma^2:  357.7803
## 
##      AIC     AICc      BIC 
## 586.7609 591.4417 607.3653

The resulting ETS Additive model definition is defined above.

# Check the residuals of ARIMA
fit_atm2_ets %>% gg_tsresiduals()

Similar to ARIMA with ATM1, I’ve displayed the residuals of the ETS Additive model to assess the appropriateness of the model. The plot of Innovation residuals appears to follow white noise. The ACF plot indicates only correlation at lag 13, but otherwise no correlation outside of the confidence intervals.

ATM2 Conclusion

After considering the RMSE of the ETS Additive model for both the training and test datasets along with the visual inspection of the forecasted values through May 2010, I’ve decided to use the forecast values from the ETS Additive model. The ETS Multiplicative did have the best RMSE on the test dataset but the visual inspection of the forecast values shows a growing variance I don’t expect in the time series over the next month. I believe the Seasonal Naive, ARIMA, and ETS Additive models each show promising results on the test and appear to follow the seasonal pattern based on visual assessment, so I’ve selected ETS Additive based on the accuracy on the training and test datasets.

# Generate forecasts for 16 days (two weeks) plus the 31 days of May
fc_atm2_final <- fit_atm2 %>% forecast(h = "47 days")

# Plot forecasts against actual values
fc_atm2_final %>%
  autoplot(filter_index(train_atm2, "2010-02-16" ~ "2010-05-31"), level = NULL) +
  autolayer(
    filter_index(new_seasonal_atm2, "2010-02-16" ~ "2010-05-31"),
    colour = "black"
  ) +
  labs(
    y = "Cash (in hundreds $USD)",
    title = "Forecasts for ATM2 Withdrawals"
  ) +
  guides(colour = guide_legend(title = "Forecast"))

Forecasts of May 2010 for ATM2 models.

# dataframe for Excel output
atm2_forecast_may2010 <- fc_atm2_final %>%
  filter(.model == "ETS_Add", DATE > "2010-04-30")

atm2_forecast_may2010 <- data.frame(atm2_forecast_may2010) %>%
  select(c(DATE, ATM, .mean))%>%
  rename(Cash = .mean)

atm2_forecast_may2010

##          DATE  ATM       Cash
## 1  2010-05-01 ATM2  68.799631
## 2  2010-05-02 ATM2  83.321692
## 3  2010-05-03 ATM2   4.712864
## 4  2010-05-04 ATM2   6.579454
## 5  2010-05-05 ATM2 102.551287
## 6  2010-05-06 ATM2  88.134400
## 7  2010-05-07 ATM2  43.081012
## 8  2010-05-08 ATM2  68.799631
## 9  2010-05-09 ATM2  83.321692
## 10 2010-05-10 ATM2   4.712864
## 11 2010-05-11 ATM2   6.579454
## 12 2010-05-12 ATM2 102.551287
## 13 2010-05-13 ATM2  88.134400
## 14 2010-05-14 ATM2  43.081012
## 15 2010-05-15 ATM2  68.799631
## 16 2010-05-16 ATM2  83.321692
## 17 2010-05-17 ATM2   4.712864
## 18 2010-05-18 ATM2   6.579454
## 19 2010-05-19 ATM2 102.551287
## 20 2010-05-20 ATM2  88.134400
## 21 2010-05-21 ATM2  43.081012
## 22 2010-05-22 ATM2  68.799631
## 23 2010-05-23 ATM2  83.321692
## 24 2010-05-24 ATM2   4.712864
## 25 2010-05-25 ATM2   6.579454
## 26 2010-05-26 ATM2 102.551287
## 27 2010-05-27 ATM2  88.134400
## 28 2010-05-28 ATM2  43.081012
## 29 2010-05-29 ATM2  68.799631
## 30 2010-05-30 ATM2  83.321692
## 31 2010-05-31 ATM2   4.712864

ATM2 final forecast values.

ATM3

Now, let’s take a look at the dataset for ATM3.

atm3_data_ts_all <- atm_data_ts %>%
  filter(ATM == 'ATM3')

summary(atm3_data_ts_all)

##       DATE                ATM                 Cash        
##  Min.   :2009-05-01   Length:365         Min.   : 0.0000  
##  1st Qu.:2009-07-31   Class :character   1st Qu.: 0.0000  
##  Median :2009-10-30   Mode  :character   Median : 0.0000  
##  Mean   :2009-10-30                      Mean   : 0.7206  
##  3rd Qu.:2010-01-29                      3rd Qu.: 0.0000  
##  Max.   :2010-04-30                      Max.   :96.0000

From summary output, the date range (2009-05-01 to 2010-04-03) matches expectations given the data from ATM1 and ATM2. The Cash column does not contain any missing data but does indicate curious aspects. The range is 0 to 96 which might be odd to have 0, and the median is also 0, which certainly indicates a curiosity about the data. The mean is also quite low, less than 1.

atm3_data_ts_all %>% autoplot(Cash) +
  labs(y = "Cash (in hundreds $USD)", title = "ATM3 Withdrawals")

A simple plot of the data clearly shows unexpected information. Only a few days at the end of the dataset contain anything above zero.

Remove data

atm3_data_ts <- atm3_data_ts_all %>%
  filter(ATM == 'ATM3', Cash > 0)

atm3_data_ts

## # A tsibble: 3 x 3 [1D]
## # Key:       ATM [1]
##   DATE       ATM    Cash
##   <date>     <chr> <dbl>
## 1 2010-04-28 ATM3     96
## 2 2010-04-29 ATM3     82
## 3 2010-04-30 ATM3     85

By removing every date with a value equal to 0, only three dates remain - the final three dates of the dataset.

summary(atm3_data_ts)

##       DATE                ATM                 Cash      
##  Min.   :2010-04-28   Length:3           Min.   :82.00  
##  1st Qu.:2010-04-28   Class :character   1st Qu.:83.50  
##  Median :2010-04-29   Mode  :character   Median :85.00  
##  Mean   :2010-04-29                      Mean   :87.67  
##  3rd Qu.:2010-04-29                      3rd Qu.:90.50  
##  Max.   :2010-04-30                      Max.   :96.00

For those three observations, the range is 82 to 96 with a mean of 87.67.

atm3_data_ts %>% autoplot(Cash) +
  labs(y = "Cash (in hundreds $USD)", title = "ATM3 Withdrawals")

The plot confirms that only three dates contain a Cash amount greater than zero.

ATM3 Model

A decision must be made in how to approach this dataset. Of 365 days, 362 contain a Cash amount of zero and the final three dates contain a Cash amount meeting expectations. Without history or context of ATM3, I will make the assumption that ATM3 was broken or in some way not in use for everyday with a Cash amount of zero. My assumption would follow that starting on 2010-04-28 ATM3 was fixed or made active, and thus can be assumed working and active for the month of May 2010.

With so little data to build a model, I will choose the MEAN function based on the three dates containing Cash amounts above zero to make the forecast for May 2010.

fit_atm3 <- atm3_data_ts %>% model(MEAN(Cash))

report(fit_atm3)

## Series: Cash 
## Model: MEAN 
## 
## Mean: 87.6667 
## sigma^2: 54.3333

The model returns an expected value of 87.67 as the mean, which was indicated in the summary of the data.

# Forecast for 31 days of May 2010
fc_atm3 <- fit_atm3 %>% forecast(h = "31 days")

fc_atm3 %>%
  autoplot(filter_index(atm3_data_ts_all, "2009-05-01" ~ "2010-05-31"), level = NULL) +
  autolayer(
    filter_index(atm1_data_ts, "2010-04-28" ~ "2010-05-31"),
    colour = "black"
  ) +
  labs(y="Cash (in hundreds $USD)", title="ATM3 Withdrawals with Forecast") +
  guides(colour = guide_legend(title = "Forecast"))

Not a very interesting plot of the forecasted data in relation to the ATM3 dataset, but given the minimal amount of data, the mean forecast is a constant line for the month of May 2010.

ATM3 Conclusion

With only three values with data above zero, so I’m assuming there was absolutely no activity but the ATM3 existed, or else it would have NA. The Mean model is selected as the only viable model to use for three values.

# dataframe for Excel output
atm3_forecast_may2010 <- fc_atm3

atm3_forecast_may2010 <- data.frame(atm3_forecast_may2010) %>%
  select(c(DATE, ATM, .mean))%>%
  rename(Cash = .mean)

atm3_forecast_may2010

##          DATE  ATM     Cash
## 1  2010-05-01 ATM3 87.66667
## 2  2010-05-02 ATM3 87.66667
## 3  2010-05-03 ATM3 87.66667
## 4  2010-05-04 ATM3 87.66667
## 5  2010-05-05 ATM3 87.66667
## 6  2010-05-06 ATM3 87.66667
## 7  2010-05-07 ATM3 87.66667
## 8  2010-05-08 ATM3 87.66667
## 9  2010-05-09 ATM3 87.66667
## 10 2010-05-10 ATM3 87.66667
## 11 2010-05-11 ATM3 87.66667
## 12 2010-05-12 ATM3 87.66667
## 13 2010-05-13 ATM3 87.66667
## 14 2010-05-14 ATM3 87.66667
## 15 2010-05-15 ATM3 87.66667
## 16 2010-05-16 ATM3 87.66667
## 17 2010-05-17 ATM3 87.66667
## 18 2010-05-18 ATM3 87.66667
## 19 2010-05-19 ATM3 87.66667
## 20 2010-05-20 ATM3 87.66667
## 21 2010-05-21 ATM3 87.66667
## 22 2010-05-22 ATM3 87.66667
## 23 2010-05-23 ATM3 87.66667
## 24 2010-05-24 ATM3 87.66667
## 25 2010-05-25 ATM3 87.66667
## 26 2010-05-26 ATM3 87.66667
## 27 2010-05-27 ATM3 87.66667
## 28 2010-05-28 ATM3 87.66667
## 29 2010-05-29 ATM3 87.66667
## 30 2010-05-30 ATM3 87.66667
## 31 2010-05-31 ATM3 87.66667

ATM3 final forecast values.

ATM4

Finally, the evaluation and forecast model for the ATM4 dataset.

atm4_data_ts <- atm_data_ts %>%
  filter(ATM == 'ATM4')

summary(atm4_data_ts)

##       DATE                ATM                 Cash          
##  Min.   :2009-05-01   Length:365         Min.   :    1.563  
##  1st Qu.:2009-07-31   Class :character   1st Qu.:  124.334  
##  Median :2009-10-30   Mode  :character   Median :  403.839  
##  Mean   :2009-10-30                      Mean   :  474.043  
##  3rd Qu.:2010-01-29                      3rd Qu.:  704.507  
##  Max.   :2010-04-30                      Max.   :10919.762

As expected, the summary for the ATM4 dataset falls on the dates 2009-05-01 through 2010-04-30. The Cash column indicates no missing values. The range of the Cash column is 1.563 to 10919.762 with a mean of 474.043. The low value is certainly low, but thankfully not zero, as seen in ATM3. The high value is quite high compared to the mean and the third quartile. Perhaps an outlier or a few exists in the ATM4 data.

atm4_data_ts %>% autoplot(Cash) +
  labs(y="Cash (in hundreds $USD)", title="ATM4 Withdrawals")

The plot of the ATM4 data shows a clear outlier in February 2010.

atm4_data_ts %>%
  filter(Cash > 3000)

## # A tsibble: 1 x 3 [1D]
## # Key:       ATM [1]
##   DATE       ATM     Cash
##   <date>     <chr>  <dbl>
## 1 2010-02-09 ATM4  10920.

The day of the outlier turns out to be 2010-02-09.

Decomposition

atm4_dcmp <- atm4_data_ts %>%
  model(stl = STL(Cash ~ trend(window=Inf) + season(period=7, window="periodic"))) 

atm4_dcmp %>% components() %>% autoplot()

Given the weekly seasonal pattern worked for ATM1 and ATM2, let’s try decomposition on ATM4. Seen in the the plot above, the outlier makes the plots near impossible to evaluate.

Impute outlier and missing data

Let’s the remove the outlier by imputing the median value in its place.

# Calculate median value for ATM4
# Straightforward approach to impute data
median <- median(atm4_data_ts$Cash, na.rm=TRUE)

# Get rid of outlier from ATM4
atm4_data_ts$Cash[atm4_data_ts$Cash > 3000] <- median 

atm4_data_ts %>% autoplot(Cash) +
  labs(y="Cash (in hundreds $USD)", title="ATM4 Withdrawals with Imputed Value")

By removing the outlier, the data does appear more approachable. Overall, I don’t detect a clear trend, perhaps a slight decrease. Given the data represents a single year, I don’t expect to find a cyclic component, but perhaps a weekly seasonal component exists. Let’s try the decomposition again, now excluding the outlier.

atm4_dcmp <- atm4_data_ts %>%
  model(stl = STL(Cash ~ trend(window=Inf) + season(period=7, window="periodic")))

atm4_dcmp %>% components() %>% autoplot()

The remainder component does play a larger factor than the seasonal component. As identified in ATM1 and ATM2, the trend component has minimal impact. Let’s try the seasonally adjusted plot.

components(atm4_dcmp) %>%
  as_tsibble() %>%
  filter_index("2009-05-01" ~ "2010-04-30") %>%
  autoplot(Cash, colour = "gray") +
  geom_line(aes(y=season_adjust), colour = "#0072B2") +
  labs(y = "Cash (in hundreds $USD)", title = "Seasonally Adjusted Trendline")

The seasonally-adjusted trendline plot above does not convey any clear weekly pattern.

atm4_data_ts %>% ACF() %>% autoplot()

The ACF plot indicates correlation at lags 7, 14, and 21 as expected. Also, correlation appears at lag 10.

atm4_data_ts %>% PACF() %>% autoplot()

The PACF shows correlation at lags 7, 10, and 19. The negative correlations at lag 10 and 19 would indicate a secondary pattern or two exists in the dataset.

atm4_data_ts %>% gg_season(Cash, period = "week") +
  labs(y="$ (in hundreds)", title="ATM4 Withdrawals")

I’ll admit the above plot of gg_season does not provide much value except perhaps for some nice colors.

atm4_data_ts %>%
  gg_subseries(period = "week") +
  labs(
    y = "$ (in hundreds)",
    title = "ATM4 Withdrawals"
  )

The gg_subseries plot above does indicate a change in pattern on Tuesday, Wednesday, and Thursday, similar to that found in ATM1 and ATM2, but not quite as drastic and clear.

atm4_data_post0209 <- atm4_data_ts %>%
  filter_index("2010-02-10" ~ "2010-04-30")

atm4_data_post0209 %>%
  autoplot(Cash) +
  labs(y="Cash (in hundreds $USD)", title="ATM4 Withdrawals")

To better understand the final few months of the ATM4 dataset, the above plot shows perhaps a weekly seasonal pattern does exist but just not as clear as found in ATM1 and ATM2.

atm4_dcmp <- atm4_data_post0209 %>%
  model(stl = STL(Cash ~ trend(window=Inf) + season(period=7, window="periodic")))

atm4_dcmp %>% components() %>% autoplot()

The decomposition of the final few months of ATM4 does indicate a semblance of weekly pattern, but the remainder values still provide the largest impact compared to the seasonal or trend components.

components(atm4_dcmp) %>%
  as_tsibble() %>%
  filter_index("2010-02-10" ~ "2010-04-30") %>%
  autoplot(Cash, colour = "gray") +
  geom_line(aes(y=season_adjust), colour = "#0072B2") +
  labs(y = "Cash (in hundreds $USD)", title = "Seasonally Adjusted Trendline")

The seasonally adjusted trendline is still too irregular for my comfort. Unfortunately, the assessment doesn’t provide a clear indication of the patterns found in the final few months. Given the goal is to provide a forecast for the month of May 2010, I plan to try the models with seasonality and evaluate the results.

Train and Test

Given I’m struggling to find a viable weekly seasonal pattern through visual inspection, I will apply the different seasonal models through the train and test approach across the entire dataset.

# Create training set (1 year of data)
# 292 days for training, approx. 80% of data
train_atm4 <- atm4_data_ts %>% 
  filter_index("2009-05-01" ~ "2010-02-17")

# Fit the models
fit_atm4 <- train_atm4 %>%
  model(
    `Seasonal naive` = SNAIVE(Cash),
    Arima = ARIMA(Cash),
    ETS_Add = ETS(Cash ~ error("A") + trend("N") + season("A")),
    ETS_Mult = ETS(Cash ~ error("M") + trend("N") + season("M")),
    ETS_Damp = ETS(Cash ~ error("M") + trend("Ad") + season("M"))
  )

report(fit_atm4)

## # A tibble: 5 × 12
##   ATM   .model        sigma2 log_lik   AIC  AICc   BIC ar_roots ma_roots     MSE
##   <chr> <chr>          <dbl>   <dbl> <dbl> <dbl> <dbl> <list>   <list>     <dbl>
## 1 ATM4  Seasonal …   2.09e+5     NA    NA    NA    NA  <NULL>   <NULL>       NA 
## 2 ATM4  Arima        1.25e+5  -2134. 4274. 4274. 4285. <cpl [7… <cpl [0…     NA 
## 3 ATM4  ETS_Add      1.05e+5  -2521. 5063. 5063. 5099. <NULL>   <NULL>   101739.
## 4 ATM4  ETS_Mult     5.14e-1  -2469. 4958. 4959. 4995. <NULL>   <NULL>   103349.
## 5 ATM4  ETS_Damp     4.88e-1  -2478. 4981. 4982. 5029. <NULL>   <NULL>   106682.
## # … with 2 more variables: AMSE <dbl>, MAE <dbl>

The ARIMA model generates the best AICc value at 4271.254 compared to the ETS models.

fit_atm4 %>% accuracy()

## # A tibble: 5 × 11
##   ATM   .model         .type       ME  RMSE   MAE   MPE  MAPE  MASE RMSSE   ACF1
##   <chr> <chr>          <chr>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>
## 1 ATM4  Seasonal naive Trai…   2.11    456.  347. -366.  422. 1     1     0.0620
## 2 ATM4  Arima          Trai…   0.0482  352.  295. -501.  535. 0.850 0.772 0.0819
## 3 ATM4  ETS_Add        Trai…  -9.31    319.  249. -421.  450. 0.717 0.699 0.110 
## 4 ATM4  ETS_Mult       Trai…  11.5     321.  251. -412.  444. 0.723 0.705 0.108 
## 5 ATM4  ETS_Damp       Trai… -18.3     327.  257. -434.  463. 0.741 0.716 0.0931

According to the RMSE, the ETS Additive model produces the most accurate model against the training data at 318.97.

# Generate forecasts for 72 days
fc_atm4 <- fit_atm4 %>% forecast(h = "72 days")

fc_atm4 %>% accuracy(atm4_data_ts)

## # A tibble: 5 × 11
##   .model         ATM   .type      ME  RMSE   MAE    MPE  MAPE  MASE RMSSE     ACF1
##   <chr>          <chr> <chr>   <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>    <dbl>
## 1 Arima          ATM4  Test   -19.2   319.  266.  -777.  808. 0.768 0.699 -0.0357 
## 2 ETS_Add        ATM4  Test   -20.7   386.  319.  -918.  960. 0.919 0.847  0.0325 
## 3 ETS_Damp       ATM4  Test   -34.1   400.  330. -1017. 1059. 0.951 0.878  0.00763
## 4 ETS_Mult       ATM4  Test    -5.37  387.  321.  -913.  958. 0.925 0.849  0.00836
## 5 Seasonal naive ATM4  Test  -101.    520.  417. -1484. 1530. 1.20  1.14  -0.0989

Now forecasting the models on the test dataset, the ARIMA model performs the best with an RMSE of 318.87.

# Plot forecasts against actual values
fc_atm4 %>%
  autoplot(filter_index(train_atm4, "2010-02-15" ~ "2010-04-30"), level = NULL) +
  autolayer(
    filter_index(atm4_data_ts, "2010-02-15" ~ "2010-04-30"),
    colour = "black"
  ) +
  labs(
    y = "Cash (in hundreds $USD)",
    title = "Forecasts for ATM4 Withdrawals"
  ) +
  guides(colour = guide_legend(title = "Forecast"))

The above plot focused on the final two and half months of the data indicates the models are not following the weekly seasonal pattern. Let me try the truncated training and test approach similar to ATM1 and ATM2 to better capture the weekly pattern from February through April of 2010.

# ATM4 train on 2010-02-16 to 2010-04-30 (end)
# 15 + 31 + 30 = 76 days, 80% is 61 days
new_seasonal_atm4 <- atm4_data_ts %>% 
  filter_index("2010-02-16" ~ "2010-04-30")

train_atm4 <- atm4_data_ts %>% 
  filter_index("2010-02-16" ~ "2010-04-14")

# Fit the models
fit_atm4 <- train_atm4 %>%
  model(
    `Seasonal naive` = SNAIVE(Cash),
    Arima = ARIMA(Cash),
    ETS_Add = ETS(Cash ~ error("A") + trend("N") + season("A")),
    ETS_Mult = ETS(Cash ~ error("M") + trend("N") + season("M")),
    ETS_Damp = ETS(Cash ~ error("M") + trend("Ad") + season("M"))
  )

report(fit_atm4)

## # A tibble: 5 × 12
##   ATM   .model        sigma2 log_lik   AIC  AICc   BIC ar_roots ma_roots     MSE
##   <chr> <chr>          <dbl>   <dbl> <dbl> <dbl> <dbl> <list>   <list>     <dbl>
## 1 ATM4  Seasonal …   2.09e+5     NA    NA    NA    NA  <NULL>   <NULL>   NA     
## 2 ATM4  Arima        1.11e+5   -419.  841.  841.  845. <cpl [0… <cpl [0… NA     
## 3 ATM4  ETS_Add      9.95e+4   -447.  913.  918.  934. <NULL>   <NULL>    8.41e4
## 4 ATM4  ETS_Mult     5.76e-1   -452.  924.  928.  944. <NULL>   <NULL>    4.57e5
## 5 ATM4  ETS_Damp     3.74e+0   -517. 1060. 1068. 1087. <NULL>   <NULL>    4.09e6
## # … with 2 more variables: AMSE <dbl>, MAE <dbl>

The ARIMA attains the best AICc value of 841.49 compared to the ETS models.

fit_atm4 %>% accuracy()

## # A tibble: 5 × 11
##   ATM   .model     .type         ME  RMSE   MAE     MPE  MAPE  MASE RMSSE   ACF1
##   <chr> <chr>      <chr>      <dbl> <dbl> <dbl>   <dbl> <dbl> <dbl> <dbl>  <dbl>
## 1 ATM4  Seasonal … Train… -1.05e+ 1  453.  349. -524.    578. 1     1     0.0769
## 2 ATM4  Arima      Train…  1.94e-10  330.  280. -747.    780. 0.803 0.728 0.0556
## 3 ATM4  ETS_Add    Train… -3.45e+ 1  290.  228. -204.    236. 0.652 0.640 0.0543
## 4 ATM4  ETS_Mult   Train… -1.75e+ 2  676.  391. -301.    328. 1.12  1.49  0.512 
## 5 ATM4  ETS_Damp   Train… -1.51e+ 1 2022. 1372.   -4.63  506. 3.93  4.46  0.803

The ETS Additive model produces the best RMSE of 289.9164 against the training data.

# Generate forecasts for 16 days (two weeks), so we'll see if it picks up the seasonal nature
fc_atm4 <- fit_atm4 %>% forecast(h = "16 days")

fc_atm4 %>% accuracy(new_seasonal_atm4)

## # A tibble: 5 × 11
##   .model         ATM   .type     ME  RMSE   MAE    MPE  MAPE  MASE RMSSE     ACF1
##   <chr>          <chr> <chr>  <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>    <dbl>
## 1 Arima          ATM4  Test  -135.   278.  229.  -924.  937. 0.655 0.613 -0.308  
## 2 ETS_Add        ATM4  Test  -147.   274.  213.  -636.  645. 0.610 0.605  0.0327 
## 3 ETS_Damp       ATM4  Test   616.   690.  616.   670.  670. 1.77  1.52  -0.194  
## 4 ETS_Mult       ATM4  Test   -92.1  242.  189.  -618.  632. 0.542 0.534 -0.00529
## 5 Seasonal naive ATM4  Test  -244.   449.  370. -1178. 1201. 1.06  0.991  0.219

Forecasting the models for the final 16 days of the dataset, the ETS Multiplicative model achieves the best RMSE at 241.88.

# Plot forecasts against actual values
fc_atm4 %>%
  autoplot(filter_index(train_atm4, "2010-04-01" ~ "2010-04-30"), level = NULL) +
  autolayer(
    filter_index(new_seasonal_atm4, "2010-04-01" ~ "2010-04-30"),
    colour = "black"
  ) +
  labs(
    y = "Cash (in hundreds $USD)",
    title = "Forecasts for ATM4 Withdrawals"
  ) +
  guides(colour = guide_legend(title = "Forecast"))

The plot of the forecasts for the five seasonal models shows the ETS Additive and ETS Multiplicative models as following the weekly pattern best. The ARIMA model results in a constant, horizontal line while the ETS with dampening immediately starts producing negative values, which is not helpful. Given the small sample size above, the Seasonal Naive model does produce a model following the weekly seasonal pattern.

fit_atm4_ets <- train_atm4 %>%
  model(ETS_Mult = ETS(Cash ~ error("M") + trend("N") + season("M")))

report(fit_atm4_ets)

## Series: Cash 
## Model: ETS(M,N,M) 
##   Smoothing parameters:
##     alpha = 0.210569 
##     gamma = 0.0002234196 
## 
##   Initial states:
##      l[0]      s[0]    s[-1]    s[-2]    s[-3]    s[-4]    s[-5]    s[-6]
##  2222.802 0.8511952 1.166145 1.031802 1.261922 1.229669 1.281946 0.177322
## 
##   sigma^2:  0.5761
## 
##      AIC     AICc      BIC 
## 923.6732 928.3540 944.2776

The definition of the ETS Multiplicative model is defined above.

# Check the residuals of ARIMA
fit_atm4_ets %>% gg_tsresiduals()

Given the results of the test data forecasts, the ETS Multiplicative model is chosen. The residuals for the model appear close to white noise while the residuals do not show correlation, a good sign for the model.

Transformation

In an attempt to improve the model, the Box-Cox transformation is applied along with the seasonal model functions.

# Consider Box-Cox
lambda <- new_seasonal_atm4 %>%
  features(Cash, features = guerrero) %>%
  pull(lambda_guerrero)

new_seasonal_atm4 %>%
  autoplot(box_cox(Cash, lambda)) +
  labs(y = "",
       title = latex2exp::TeX(paste0(
         "Transformed ATM4 withdrawals with $\\lambda$ = ",
         round(lambda,2))))

The Box-Cox transformation of the final few months of the dataset helps clarify the weekly seasonal pattern with a lambda value of 0.3.

fit_atm4_bc <- train_atm4 %>%
  model(
    `Seasonal naive` = SNAIVE(box_cox(Cash, lambda)),
    Arima = ARIMA(box_cox(Cash, lambda)),
    ETS_Add = ETS(box_cox(Cash, lambda) ~ error("A") + trend("N") + season("A")),
    ETS_Mult = ETS(box_cox(Cash, lambda) ~ error("M") + trend("N") + season("M")),
  )

report(fit_atm4_bc)

## # A tibble: 4 × 12
##   ATM   .model    sigma2 log_lik   AIC  AICc   BIC ar_roots ma_roots   MSE  AMSE
##   <chr> <chr>      <dbl>   <dbl> <dbl> <dbl> <dbl> <list>   <list>   <dbl> <dbl>
## 1 ATM4  Seasonal… 53.7       NA    NA    NA    NA  <NULL>   <NULL>    NA    NA  
## 2 ATM4  Arima     23.6     -176.  360.  361.  368. <cpl [1… <cpl [0…  NA    NA  
## 3 ATM4  ETS_Add   25.0     -206.  432.  437.  453. <NULL>   <NULL>    21.1  18.5
## 4 ATM4  ETS_Mult   0.135   -216.  451.  456.  472. <NULL>   <NULL>    30.5  31.0
## # … with 1 more variable: MAE <dbl>

The ARIMA model with transformation performs the best on the training data with an AICc of 360.74.

fit_atm4_bc %>% accuracy()

## # A tibble: 4 × 11
##   ATM   .model         .type       ME  RMSE   MAE   MPE  MAPE  MASE RMSSE    ACF1
##   <chr> <chr>          <chr>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>
## 1 ATM4  Seasonal naive Training -10.5  453.  349. -524.  578. 1     1      0.0769
## 2 ATM4  Arima          Training  60.6  293.  239. -218.  257. 0.684 0.646 -0.0113
## 3 ATM4  ETS_Add        Training  51.8  298.  218. -126.  158. 0.626 0.658  0.0641
## 4 ATM4  ETS_Mult       Training  11.0  370.  283. -310.  348. 0.810 0.816  0.144

The best accuracy on the training set goes to the ARIMA model with an RMSE of 293.02.

# Generate forecasts for the next 16 days
fc_atm4_bc <- fit_atm4_bc %>% forecast(h = "16 days")

fc_atm4_bc %>% accuracy(new_seasonal_atm4)

## # A tibble: 4 × 11
##   .model         ATM   .type    ME  RMSE   MAE    MPE  MAPE  MASE RMSSE   ACF1
##   <chr>          <chr> <chr> <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>  <dbl>
## 1 Arima          ATM4  Test  -174.  321.  246. -1011. 1025. 0.705 0.708 0.380 
## 2 ETS_Add        ATM4  Test  -207.  341.  275.  -737.  747. 0.788 0.752 0.0637
## 3 ETS_Mult       ATM4  Test  -317.  459.  363.  -922.  930. 1.04  1.01  0.401 
## 4 Seasonal naive ATM4  Test  -616.  823.  705. -2079. 2094. 2.02  1.82  0.433

The forecast of the models for the final 16 days of the dataset shows the ARIMA model with the best-performing RMSE of 320.73.

# Plot forecasts against actual values
fc_atm4_bc %>%
  autoplot(filter_index(new_seasonal_atm4, "2010-04-01" ~ "2010-04-30"), level = NULL) +
  autolayer(
    filter_index(new_seasonal_atm4, "2010-04-01" ~ "2010-04-30"),
    colour = "black"
  ) +
  labs(
    y = "Cash (in hundreds $USD)",
    title = "Forecasts for ATM4 Withdrawals"
  ) +
  guides(colour = guide_legend(title = "Forecast"))

The above plot of the seasonal models with Box-Cox transformation clearly show the forecasts do not outperform the forecasts of the models without the transformation.

ATM4 Conclusion

Overall, the ETS Multiplicative model without transformation performs the best on the final few months of the ATM4 dataset.

# Generate forecasts for 16 days (two weeks) plus the 31 days of May
fc_atm4_final <- fit_atm4 %>% forecast(h = "47 days")

# Plot forecasts against actual values
fc_atm4_final %>%
  autoplot(filter_index(train_atm4, "2010-02-16" ~ "2010-05-31"), level = NULL) +
  autolayer(
    filter_index(new_seasonal_atm4, "2010-02-16" ~ "2010-05-31"),
    colour = "black"
  ) +
  labs(
    y = "Cash (in hundreds $USD)",
    title = "Forecasts for ATM1 Withdrawals"
  ) +
  guides(colour = guide_legend(title = "Forecast"))

Forecasts for the ATM4 models.

# dataframe for Excel output
atm4_forecast_may2010 <- fc_atm4_final %>%
  filter(.model == "ETS_Mult", DATE > "2010-04-30")

atm4_forecast_may2010 <- data.frame(atm4_forecast_may2010) %>%
  select(c(DATE, ATM, .mean))%>%
  rename(Cash = .mean)

atm4_forecast_may2010

##          DATE  ATM      Cash
## 1  2010-05-01 ATM4 431.30801
## 2  2010-05-02 ATM4 487.36884
## 3  2010-05-03 ATM4 355.88127
## 4  2010-05-04 ATM4  74.05705
## 5  2010-05-05 ATM4 536.09275
## 6  2010-05-06 ATM4 514.16819
## 7  2010-05-07 ATM4 527.55000
## 8  2010-05-08 ATM4 431.31970
## 9  2010-05-09 ATM4 487.38205
## 10 2010-05-10 ATM4 355.89092
## 11 2010-05-11 ATM4  74.05906
## 12 2010-05-12 ATM4 536.10727
## 13 2010-05-13 ATM4 514.18212
## 14 2010-05-14 ATM4 527.56430
## 15 2010-05-15 ATM4 431.33139
## 16 2010-05-16 ATM4 487.39526
## 17 2010-05-17 ATM4 355.90056
## 18 2010-05-18 ATM4  74.06107
## 19 2010-05-19 ATM4 536.12181
## 20 2010-05-20 ATM4 514.19606
## 21 2010-05-21 ATM4 527.57860
## 22 2010-05-22 ATM4 431.34308
## 23 2010-05-23 ATM4 487.40847
## 24 2010-05-24 ATM4 355.91021
## 25 2010-05-25 ATM4  74.06308
## 26 2010-05-26 ATM4 536.13634
## 27 2010-05-27 ATM4 514.20999
## 28 2010-05-28 ATM4 527.59290
## 29 2010-05-29 ATM4 431.35477
## 30 2010-05-30 ATM4 487.42168
## 31 2010-05-31 ATM4 355.91985

ATM4 final forecast values.

Overall daily totals across all four ATMs

I really wanted to find some underlying relationship between the four different ATMs that would better explain some of the outliers and anomalies and perhaps provide clarity to the shift in the weekly pattern in February 2010. I totaled the daily withdrawals across the four ATMs in the hopes of seeing a clear pattern when combined.

atm_data_all <- bind_rows(atm1_data_ts,
                          atm2_data_ts,
                          atm3_data_ts_all,
                          atm4_data_ts)

atm_data_total_by_day <- atm_data_all %>%
  index_by(DATE) %>%
  summarize(Cash = sum(Cash))

#atm_data_total_by_day

atm_data_total_by_day %>% autoplot(Cash) + 
  labs(
    y = "Cash (in hundreds $USD)",
    title = "ATMs Withdrawals Combined",
    subtitle = "Summary contains Imputed Data"
  )

Overall, the weekly seasonal pattern still emerges, but no amazing insights seeing the data combined.

# Seasonally adjusted plot
atm_dcmp <- atm_data_total_by_day %>%
  model(stl = STL(Cash ~ trend(window=Inf) + season(period=7, window="periodic"))) 

atm_dcmp %>% components() %>% autoplot()

The full STL decomposition repeats the findings from the individual ATMs. A weak overall trend line with the remainder component providing the most impact followed by the weekly seasonal pattern. As noted particularly in ATM1 and ATM2, beginning in February 2010, the remainder plot shows greater variance.

components(atm_dcmp) %>%
  as_tsibble() %>%
  autoplot(Cash, colour = "gray") +
  geom_line(aes(y=season_adjust), colour = "#0072B2") +
  labs(y = "Cash (in hundreds)", title = "Seasonally Adjusted Trendline")

The seasonally adjusted trendline follows the findings from before. The seasonal adjustment clearly does not follow the actual data beginning in February 2010.

Write Forecast to Excel File

Combine the forecasts for the four ATM datasets and write to an Excel file.

# Write to Excel file
atm_forecast_may2010 <- bind_rows(atm1_forecast_may2010,
                                  atm2_forecast_may2010,
                                  atm3_forecast_may2010,
                                  atm4_forecast_may2010
                                  )
write_xlsx(atm_forecast_may2010, "data/ptanofsky_atm_forecast_may2010.xlsx")

Part B – Residential Customer Power

Dataset: ResidentialCustomerForecastLoad-624.xlsx

Prompt: Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straightforward. Add this to your existing files above.

# Read in data
power_data_raw <- read_excel("data/ResidentialCustomerForecastLoad-624.xlsx")

# Change column name to 'Month'
names(power_data_raw)[names(power_data_raw) == 'YYYY-MMM'] <- 'Month'

# Display first 5 rows for visual inspection
#head(power_data_raw)

# Display summary for initial assessment
summary(power_data_raw)

##   CaseSequence      Month                KWH          
##  Min.   :733.0   Length:192         Min.   :  770523  
##  1st Qu.:780.8   Class :character   1st Qu.: 5429912  
##  Median :828.5   Mode  :character   Median : 6283324  
##  Mean   :828.5                      Mean   : 6502475  
##  3rd Qu.:876.2                      3rd Qu.: 7620524  
##  Max.   :924.0                      Max.   :10655730  
##                                     NA's   :1

The summary of the residential power data indicates 192 months, which is exactly 16 years. The KWH column shows 1 missing value with a range of 770523 to 10655730. The minimum value being so distinct from the first quartile value may indicate an outlier. The CaseSequence column appears to be a integer unique identifier for each month, so will be ignored moving forward.

# Display dimensions for assessment
# should be 16 years of data 1998-2013
# Yep, 192/12 = 16
dim(power_data_raw)

## [1] 192   3

The dimensions output confirms the summary output, 192 observations with three columns.

power_data_ts <- power_data_raw %>%
  mutate(Month = yearmonth(Month)) %>%
  mutate(KWH = KWH/1e3) %>% # In thousands
  select(-c(CaseSequence)) %>%
  as_tsibble(index = Month)

# Output first 5 rows of tsibble
#head(power_data_ts)

# Initial plot of data
power_data_ts %>%
  autoplot(KWH) + 
  labs(title="Residential Power Consumption",
       y="KWH (in thousands)")

After converting the raw data to a tsibble with the Month as the index, we do see a missing value perhaps sometime in 2008 and an outlier in early 2010.

power_data_ts$Month[is.na(power_data_ts$KWH)]

## <yearmonth[1]>
## [1] "2008 Sep"

Month with missing value is “2008 Sep”.

power_data_ts$Month[power_data_ts$KWH < 3000]

## <yearmonth[2]>
## [1] NA         "2010 Jul"

Month with outlier is “2010 Jul”.

Impute data

Given the data is monthly and appears to follow a clear seasonal pattern, I will impute the missing value and the outlier value by calculating the median value for each month for imputation.

First, I will find the median value for all September entries followed by July.

# I want all the September values
sep_data <- power_data_ts %>%
  filter(str_detect(Month, "Sep"))

sep_data <- as_tibble(sep_data)

sep_kwh_med <- sep_data %>% 
  summarise(Median = median(KWH, na.rm = TRUE))

sep_kwh_med

## # A tibble: 1 × 1
##   Median
##    <dbl>
## 1  7667.

# Median 7666.97

power_data_ts[129, 2] <- sep_kwh_med

Median value for all other September values is 7666.97.

# I want all the July values
jul_data <- power_data_ts %>%
  filter(str_detect(Month, "Jul"))

jul_data <- as_tibble(jul_data)

jul_kwh_med <- jul_data %>% 
  summarise(Median = median(KWH, na.rm = TRUE))

jul_kwh_med

## # A tibble: 1 × 1
##   Median
##    <dbl>
## 1  7679.

# Median 7678.623   

power_data_ts[151, 2] <- jul_kwh_med

Median value for all other July values is 7695.942.

power_data_ts %>% autoplot(KWH) + 
  labs(title="Residential Power Consumption",
       y="KWH (in thousands)")

Above plot shows the residential customer power data with imputations. Clearly a seasonal pattern exists along with an increasing variance near the end of the plot. Perhaps a Box-Cox transformation could be useful. Also, a slight increasing trend does appear, at least for the second half of the plot.

Decomposition

power_dcmp <- power_data_ts %>%
  model(stl = STL(KWH ~ trend(window=Inf) + season(period=12, window="periodic"))) 
power_dcmp %>% components() %>% autoplot()

The decomposition of the data indicates an increasing trend along with a seasonal pattern and remainder component that appear near equal in weight.

components(power_dcmp) %>%
  as_tsibble() %>%
  autoplot(KWH, colour = "gray") +
  geom_line(aes(y=season_adjust), colour = "#0072B2") +
  labs(y = "KWH (in thousands)", title = "Seasonally Adjusted Trendline")

The plot of the seasonally adjusted data indicates the seasonal pattern does persist through the full dataset with a few exceptions throughout.

power_data_ts %>% ACF() %>% autoplot()

ACF plot shows the highest correlation at lag 12, which is expected for monthly data over the course of several years.

power_data_ts %>% PACF() %>% autoplot()

PACF plot indicates autocorrelation at lags 1, 2, 5, 11, and 12, interesting result for monthly data in which I expected the PACF plot to show the highest correlation at lag 12.

power_data_ts %>% gg_season(KWH, period = "year") +
  labs(y="KWH (in thousands)", title="Residential Power Consumption")

The gg_season plot certainly shows a yearly pattern with low usage in the spring and fall and higher usage in the summer and winter.

power_data_ts %>%
  gg_subseries(period = "year") +
  labs(y="KWH (in thousands)", title="Residential Power Consumption")

The gg_subseries plot indicates increasing usage for almost every month over the course of the dataset. This increase is another indicator the Box-Cox transformation will be necessary

Train and Test Model

As performed in Section A, I will perform a train and test approach to select the best performing model. The training will occur on the first 80% of the dataset, in this case 154 months. The test o the models will be evaluated on the final 38 months of the dataset.

# Train on 154 months, which is 12 years and 8 months (80% of the total)
# Total is 192 months (16 years)

train_power_data <- power_data_ts %>% 
  filter_index("1998-Jan" ~ "2010-Oct")

With the increasing variance in the plots, a Box-Cox transformation is applied to calculate the lambda value.

lambda <- power_data_ts %>%
  features(KWH, features = guerrero) %>%
  pull(lambda_guerrero)

train_power_data %>%
  autoplot(box_cox(KWH, lambda)) +
  labs(y = "",
       title = latex2exp::TeX(paste0(
         "Transformed ATM1 withdrawals with $\\lambda$ = ",
         round(lambda,2))))

The Box-Cox transformation does appear to address the increasing variance in the initial plot with a lambda value of -0.2.

# Fit the models
fit_power_bc <- train_power_data %>%
  model(
    Naive = NAIVE(box_cox(KWH, lambda)),
    `Seasonal naive` = SNAIVE(box_cox(KWH, lambda)),
    `Random walk` = RW(box_cox(KWH, lambda)),
    Arima = ARIMA(box_cox(KWH, lambda)),
    ETS_Add = ETS(box_cox(KWH, lambda) ~ error("A") + trend("A") + season("A")),
    ETS_Mult = ETS(box_cox(KWH, lambda) ~ error("M") + trend("A") + season("M")),
    ETS_Damp = ETS(box_cox(KWH, lambda) ~ error("M") + trend("Ad") + season("M"))
  )

report(fit_power_bc)

## # A tibble: 7 × 11
##   .model    sigma2 log_lik   AIC  AICc   BIC ar_roots ma_roots      MSE     AMSE
##   <chr>      <dbl>   <dbl> <dbl> <dbl> <dbl> <list>   <list>      <dbl>    <dbl>
## 1 Naive    1.21e-3     NA    NA    NA    NA  <NULL>   <NULL>   NA       NA      
## 2 Seasona… 3.58e-4     NA    NA    NA    NA  <NULL>   <NULL>   NA       NA      
## 3 Random … 1.21e-3     NA    NA    NA    NA  <NULL>   <NULL>   NA       NA      
## 4 Arima    2.10e-4    398. -785. -785. -771. <cpl [1… <cpl [1… NA       NA      
## 5 ETS_Add  2.22e-4    268. -503. -498. -451. <NULL>   <NULL>    1.99e-4  2.05e-4
## 6 ETS_Mult 1.33e-5    269. -504. -500. -452. <NULL>   <NULL>    1.97e-4  2.04e-4
## 7 ETS_Damp 1.35e-5    268. -501. -496. -446. <NULL>   <NULL>    1.99e-4  2.08e-4
## # … with 1 more variable: MAE <dbl>

The best performing model on the training set according to the AICc is the ARIMA model at -785.4651.

fit_power_bc %>% accuracy()

## # A tibble: 7 × 10
##   .model         .type       ME  RMSE   MAE    MPE  MAPE  MASE RMSSE   ACF1
##   <chr>          <chr>    <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>  <dbl>
## 1 Naive          Training -6.45 1289. 1069. -2.21  17.2  1.92  1.80  0.182 
## 2 Seasonal naive Training 69.0   715.  557.  0.478  8.79 1     1     0.267 
## 3 Random walk    Training -6.45 1289. 1069. -2.21  17.2  1.92  1.80  0.182 
## 4 Arima          Training  7.10  518.  397. -0.309  6.34 0.713 0.725 0.0885
## 5 ETS_Add        Training 14.8   542.  422. -0.447  6.69 0.758 0.758 0.324 
## 6 ETS_Mult       Training 14.0   539.  414. -0.480  6.55 0.744 0.754 0.311 
## 7 ETS_Damp       Training 53.2   543.  414.  0.167  6.50 0.744 0.760 0.296

The best RMSE on the training data was attained by the ARIMA model followed by the three ETS models.

# Generate forecasts for 38 months, so we'll see if it picks up the seasonal nature
fc_power_bc <- fit_power_bc %>% forecast(h = "38 months")

fc_power_bc %>% accuracy(power_data_ts)

## # A tibble: 7 × 10
##   .model         .type     ME  RMSE   MAE    MPE  MAPE  MASE RMSSE  ACF1
##   <chr>          <chr>  <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Arima          Test    621. 1034.  757.   7.25  9.42  1.36  1.45 0.229
## 2 ETS_Add        Test    633. 1084.  798.   7.33  9.99  1.43  1.52 0.175
## 3 ETS_Damp       Test    691. 1107.  820.   8.13 10.2   1.47  1.55 0.172
## 4 ETS_Mult       Test    590. 1038.  751.   6.77  9.36  1.35  1.45 0.164
## 5 Naive          Test  -1348. 2580. 2155. -23.4  32.0   3.87  3.61 0.689
## 6 Random walk    Test  -1348. 2580. 2155. -23.4  32.0   3.87  3.61 0.689
## 7 Seasonal naive Test    303. 1018.  850.   2.77 11.0   1.53  1.42 0.419

The best RMSE on the test data of 38 months is the Seasonal Naive model followed closely by the ARIMA model and the ETS Multiplicative model.

# Plot forecasts against actual values
fc_power_bc %>%
  autoplot(filter_index(power_data_ts, "2010-Nov" ~ "2013-Dec"), level = NULL) +
  autolayer(
    filter_index(power_data_ts, "2010-Nov" ~ "2013-Dec"),
    colour = "black"
  ) +
  labs(
    y = "KWH (in thousands)",
    title = "Power Forecast"
  ) +
  guides(colour = guide_legend(title = "Forecast"))

The above plot of the forecasts overlayed with the actual data indicates several of the models are able to predict the monthly seasonal pattern well.

fit_power_arima <- train_power_data %>%
  model(ARIMA(box_cox(KWH, lambda)))

report(fit_power_arima)

## Series: KWH 
## Model: ARIMA(1,0,1)(0,1,1)[12] w/ drift 
## Transformation: box_cox(KWH, lambda) 
## 
## Coefficients:
##           ar1     ma1     sma1  constant
##       -0.4374  0.6839  -0.7247    0.0021
## s.e.   0.1778  0.1394   0.0706    0.0007
## 
## sigma^2 estimated as 0.00021:  log likelihood=397.73
## AIC=-785.47   AICc=-785.02   BIC=-770.69

The report of the ARIMA model indicates an ARIMA(1,0,1)(0,1,1)[12] with drift. The result indicates a seasonal differencing of 1 along with a seasonal MA term of 1. The non-seasonal (p,d,q) shows an AR term of 1 and an MA term of 1. The seasonal lag is denoted as 12, which is expected for the monthly data.

# Check the residuals of ARIMA
fit_power_arima %>% gg_tsresiduals()

The ARIMA model residuals appears as white noise and do not show any correlation in the ACF plot.

augment(fit_power_arima) %>% features(.innov, ljung_box, lag=24, dof=4)

## # A tibble: 1 × 3
##   .model                      lb_stat lb_pvalue
##   <chr>                         <dbl>     <dbl>
## 1 ARIMA(box_cox(KWH, lambda))    25.5     0.185

Based on the Ljung-Box test result, the large p-value confirms the residuals are similar to white noise.

Forecast

Let’s forecast for the year 2014 with all the models. The forecast will be for 50 months instead of 38 as above in order to capture the data for 2014.

# Generate forecasts for 50 months, which covers the end of the provided time series and the next 12 months of 2014
fc_power_bc14 <- fit_power_bc %>% forecast(h = "50 months")

fc_power_bc14 %>%
  autoplot(filter_index(power_data_ts, "2010-Nov" ~ "2014-Dec"), level = NULL) +
  autolayer(
    filter_index(power_data_ts, "2010-Nov" ~ "2014-Dec"),
    colour = "black"
  ) +
  labs(
    y = "KWH (in thousands)",
    title = "Power Forecast"
  ) +
  guides(colour = guide_legend(title = "Forecast"))

Based on the good results of the ARIMA model on the training and test data, I’ve selected the ARIMA model for the final forecast values for 2014. The Seasonal Naive model did well, but I believe the ARIMA model will more accurately identify the nuance in the next year over the Seasonal Naive model.

# Write to excel
# dataframe for Excel output
Residental_Forecast_2014 <- fc_power_bc14 %>%
  filter(.model == "Arima", Month > yearmonth("2013 Dec"))

Residental_Forecast_2014 <- data.frame(Residental_Forecast_2014) %>%
  select(c(Month, .mean))%>%
  rename(KWH = .mean)

# Multiply by 1000 to return to original form
Residental_Forecast_2014$KWH <- Residental_Forecast_2014$KWH * 1e3
Residental_Forecast_2014$Month <- as.character(Residental_Forecast_2014$Month)

Residental_Forecast_2014 <- Residental_Forecast_2014 %>%
  mutate(Month = str_replace(Month, " ", "-")) %>%
  rename(`YYYY-MMM` = Month)

Case_Seq <- c(seq(from=925, to=936))

Residental_Forecast_2014 <- bind_cols(CaseSequence=Case_Seq, Residental_Forecast_2014)

# Write to Excel file
write_xlsx(Residental_Forecast_2014, "data/ptanofsky_kwh_forecast_2014.xlsx")

Part C – Waterflow (optional)

Dataset: Waterflow_Pipe1.xlsx and Waterflow_Pipe2.xlsx

Part C consists of two data sets. These are simple 2 columns sets, however they have different time stamps. Your optional assignment is to time-base sequence the data and aggregate based on hour (example of what this looks like, follows). Note for multiple recordings within an hour, take the mean. Then to determine if the data is stationary and can it be forecast. If so, provide a week forward forecast and present results via Rpubs and .rmd and the forecast in an Excel readable file.

library(lubridate)
# Read in data
pipe1_data_raw <- read_excel("data/Waterflow_Pipe1.xlsx", col_types = c("date", "numeric"))
pipe2_data_raw <- read_excel("data/Waterflow_Pipe2.xlsx", col_types = c("date", "numeric"))

# From: https://stackoverflow.com/questions/53635818/convert-datetime-from-excel-to-r

pipe1_data_raw$`Date Time` <- as.POSIXct(pipe1_data_raw$`Date Time`,
                              origin="1899-12-30",
                              tz="GMT")

pipe2_data_raw$`Date Time` <- as.POSIXct(pipe2_data_raw$`Date Time`,
                              origin="1899-12-30",
                              tz="GMT")

#summary(pipe1_data_raw)
#summary(pipe2_data_raw)

#dim(pipe1_data_raw)
#dim(pipe2_data_raw)

# Calculate the average flow per hour for pipe1
pipe1_data_raw$Date <- as.Date(pipe1_data_raw$`Date Time`)
pipe1_data_raw$Time <- hour(pipe1_data_raw$`Date Time`) + 1

pipe1_data_raw <- pipe1_data_raw %>%
  group_by(Date, Time) %>%
  summarise(WF_AVG = mean(WaterFlow)) %>% ungroup()

pipe1_data_raw$`Date Time` <- with(pipe1_data_raw, ymd_h(paste(Date, Time)))

pipe1_data_raw <- pipe1_data_raw %>% 
  select(c(`Date Time`, WF_AVG)) %>%
  rename(WaterFlow = WF_AVG)

head(pipe1_data_raw)

## # A tibble: 6 × 2
##   `Date Time`         WaterFlow
##   <dttm>                  <dbl>
## 1 2015-10-23 01:00:00      26.1
## 2 2015-10-23 02:00:00      18.9
## 3 2015-10-23 03:00:00      15.2
## 4 2015-10-23 04:00:00      23.1
## 5 2015-10-23 05:00:00      15.5
## 6 2015-10-23 06:00:00      22.7

Above output confirms the average of Pipe 1 has been calculated for each hour. Note: I applied the ceiling to each timestamp of Pipe 1 to the nearest hour, so the Pipe 1 and Pipe 2 datasets line up and both begin with “2015-10-23 01:00:00”.

# Define as tsibble
pipe1_data_ts <- pipe1_data_raw %>%
  as_tsibble(index = `Date Time`)

pipe1_data_ts %>% autoplot()

The above plot is a simply visualization of the Pipe 1 waterflow data.

pipe2_data_ts <- pipe2_data_raw %>%
  as_tsibble(index = `Date Time`)

pipe2_data_ts %>% 
  autoplot(colour = "gray") +
  autolayer(pipe1_data_ts, colour = "#0072B2")

The above plot shows the Pipe 1 hourly mean data in blue overlayed on the Pipe 2 provided hourly data in gray.

pipe2_data_ts %>% 
  filter_index("2015-10-23" ~ "2015-11-01") %>%
  autoplot(colour = "gray") +
  autolayer(pipe1_data_ts, colour = "#0072B2")

In an attempt to better compare the Pipe 1 and Pipe 2 data, I’ve truncated the plot to only include the hours in which data is available for both pipe waterflows.

A Conclusion of Sorts

Unfortunately, I didn’t allot myself the time needed to tackle Part C. I performed the initial data wrangling above to find the mean for Pipe 1 waterflow by hour. The initial plots above for visual inspection didn’t provide any initial insights, so I didn’t move forward with any modeling. The discrepancy in the dates for Pipe 1 and Pipe 2 data provided definitely requires additional analysis.

DATA 624 Project 1

CUNY Fall 2021

Philip Tanofsky

31 October 2021

Part A – ATM Forecast

Plot all data

ATM1

Impute Missing Data

Decomposition

Train and Test Model

Train and Test Model: Second Attempt

Transformation Attempt

ATM1 Conclusion

ATM2

Imputation

Decomposition

Train and Test Model

ATM2 Conclusion

ATM3

Remove data

ATM3 Model

ATM3 Conclusion

ATM4

Decomposition

Impute outlier and missing data

Train and Test

Transformation

ATM4 Conclusion

Overall daily totals across all four ATMs

Write Forecast to Excel File

Part B – Residential Customer Power

Impute data

Decomposition

Train and Test Model

Forecast

Part C – Waterflow (optional)

A Conclusion of Sorts