Proj1 ATM

PART A -ATM

Load Data

In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file. Also please submit the forecast which you will put in an Excel readable file.

getwd()

## [1] "T:/00-624 HH Predictive Analytics"

library(ggplot2)

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(GGally)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

library(fpp2)

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

## -- Attaching packages ---------------------------------------------- fpp2 2.4 --

## v forecast  8.13     v expsmooth 2.3 
## v fma       2.4

##

library(magrittr)
library(dplyr)
library(data.table)

## 
## Attaching package: 'data.table'

## The following objects are masked from 'package:dplyr':
## 
##     between, first, last

library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --

## v tibble  3.0.4     v purrr   0.3.4
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.0

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x data.table::between() masks dplyr::between()
## x tidyr::extract()      masks magrittr::extract()
## x dplyr::filter()       masks stats::filter()
## x data.table::first()   masks dplyr::first()
## x dplyr::lag()          masks stats::lag()
## x data.table::last()    masks dplyr::last()
## x purrr::set_names()    masks magrittr::set_names()
## x purrr::transpose()    masks data.table::transpose()

library(scales)

## 
## Attaching package: 'scales'

## The following object is masked from 'package:purrr':
## 
##     discard

## The following object is masked from 'package:readr':
## 
##     col_factor

theme_set(theme_light())

library(gridExtra)

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

library(forecast)

atm.df = fread("ATM624Data.txt", sep='\t', header=TRUE, stringsAsFactors =FALSE)

atm.df$DATE = as.Date(atm.df$DATE, format='%m/%d/%Y 12:00:00 AM', origin = '1990-01-01')
head(atm.df)

##          DATE  ATM Cash
## 1: 2009-05-01 ATM1   96
## 2: 2009-05-01 ATM2  107
## 3: 2009-05-02 ATM1   82
## 4: 2009-05-02 ATM2   89
## 5: 2009-05-03 ATM1   85
## 6: 2009-05-03 ATM2   90

First the needed library is loaded into the memory. The data are read. There are three columns for this data set, date, ATM (ATM1- ATM4 ), and the cash Withdrawn from each corresponding ATM machine and the date. There are 1474 observations ( date and ATM combo ) in this data set.

The date format from the original database is fixed into standard date format.

EDA - Exploratory Data Analysis

library(tidyverse)
summary (atm.df)

##       DATE                ATM                 Cash        
##  Min.   :2009-05-01   Length:1474        Min.   :    0.0  
##  1st Qu.:2009-08-01   Class :character   1st Qu.:    0.5  
##  Median :2009-11-01   Mode  :character   Median :   73.0  
##  Mean   :2009-10-31                      Mean   :  155.6  
##  3rd Qu.:2010-02-01                      3rd Qu.:  114.0  
##  Max.   :2010-05-14                      Max.   :10920.0  
##                                          NA's   :19

atm.df2 <- atm.df %>% 
  drop_na () %>% 
  spread (ATM, Cash) 
atm.ts2 <- ts(atm.df2 %>%  select (-DATE) )

Summary of the data shows us that the data ranges from May 1st 2009 to May14th 2010. There are 19 rows that contain missing value in cash withdrawal. The cash withdrawal amount ranges from $0.00 to $ 10920.

In order for us to plot the data, we first dropped the 19 missing values, then spread the columns of ATM and cash.

library(forecast) # to plot ts objects
autoplot(atm.ts2) +
  labs(title = "Cash withdrawn from 4 ATMS",
       subtitle = "May 2009 - April 2010",
       x = "Day") +
  scale_y_continuous("Cash withdrawn (hundreds)", labels = dollar) +
  scale_color_discrete(NULL) +
  theme(legend.position = c(0.1, 0.8))

The daily cash withdrawn money are plotted out by day, between these time Frame. cash withdrawn from each ATM machine are plotted Out on the daily basis.

Due to a huge spike (outlier ) of cash withdrawn at the later stage (around day 250, or early 2010) at ATM #4, the scale of all the rest of the cash withdrawn are all clustered together and make the figure hard to interpret. So the daily cash plot is not very useful. Then we will try other types of data visualization below. To reduce the scaling obscurity, facetted plots are used for each individual ATM machine so that the outlier of cash withdrawn reflected on the Y axies are not obscuring the normal withdrawn.

# convert df back to tidy format for faceted plotting
atm.df2 %>% gather(ATM, Cash, -DATE) %>% 
  # plot
  ggplot(aes(x = DATE, y = Cash, col = ATM)) +
  geom_line(show.legend = FALSE) +
  facet_wrap(~ ATM, ncol = 1, scales = "free_y") +
  labs(title = "Cash withdrawn from 4 ATMs",
       subtitle = "May 2009 - April 2010",
       x = "Date") +
  scale_y_continuous("Cash withdrawn (hundreds)", labels = dollar)

Yes indeed, the facetted plot clearly shows that ATM1 and ATM2 machine perform consistently throughout all data periods. The daily cash withdrawn lies within $150 each machine all throughout.

However, there is clearly an outlier of cash withdrawn on ATM #4, in the amount exceeding $10,000, around February 2010. Otherwise ATM 4 machine behaves very similar to ATM1 and ATM 2.

ATM3 machine has no cash withdrawn until April 2010, indicating that this is a newly installed ATM machine around that time frame.

atm.df$DateStr = paste0(year(atm.df$DATE), '-', month(atm.df$DATE))

atm.df = atm.df %>% 
  dplyr::filter(!is.na(Cash)) %>%
  dplyr::group_by(DateStr) %>%
  dplyr::summarise(Cash = sum(Cash)) %>%
  dplyr::select(-DateStr)

## `summarise()` ungrouping output (override with `.groups` argument)

atm.ts = ts(atm.df, start=c(2009,5), end=c(2010,4), frequency=12)

autoplot(atm.ts) +
  xlab('Time') +
  ylab('MonthlyCash') +
  ggtitle ('Time series plot showing total monthly cash withdrawn from all four ATMs')

For the modeling, instead of using the daily cash withdrawn, we decided to use the monthly aggregation cash withdrawn. That way, we hope to see a better and more reliable indication of the ATM machine performance.

The missing value from the ATM data frame is filtered out, and then the daily cash amount are grouped together from all four ATM machines are grouped an added together. The aggregated money for each day are further summarized into monthly cash withdrawn, so that it can space out better on the plot. This aggregated cash withdrawn are plotted out as a line graph, starting from May 2009 and end in April 2010. From the line graph, we can see that there is a jump in cash withdrawn around February 2010. The withdrawn amount monthly our somewhat uneven.

calcRMSE = function(modelName, modelFunc)
{
  res = tsCV(atm.ts,modelFunc)
  print(paste('RMSE for model', modelName, '=', sqrt(mean(res^2, na.rm=TRUE))))
}

calcRMSE('Seasonal Naive', snaive)

## [1] "RMSE for model Seasonal Naive = 6445.52777301652"

calcRMSE('Random Walk', rwf)

## [1] "RMSE for model Random Walk = 4085.96603688105"

We applied the seasonal naive model and the random walk model into this time series. We can see that the RMSE for random walk model is 4085, smaller than the RMSE of seasonal naive model ( @ 6445 ). So seasonal not if model is out of the game In terms of modeling for ATM withdrawal data.

func <- function(x, h)
  {forecast(auto.arima(x), h=h)}
calcRMSE('ARIMA', func)

## [1] "RMSE for model ARIMA = 4326.07477846553"

The ARIMA model did not give us a satisfying RSME (at 4326, which is bigger than the random walk model). So we will not use ARIMA model for our forecast.

func  <- function(x, h)
  {forecast(ets(x, model="ZZZ" ), h=h)}
calcRMSE('ETS', func)

## [1] "RMSE for model ETS = 3902.99872575104"

fit  = ets(atm.ts, model = 'ZZZ')
summary(fit )

## ETS(M,N,N) 
## 
## Call:
##  ets(y = atm.ts, model = "ZZZ") 
## 
##   Smoothing parameters:
##     alpha = 0.5293 
## 
##   Initial states:
##     l = 15080.8623 
## 
##   sigma:  0.1853
## 
##      AIC     AICc      BIC 
## 228.5467 231.5467 230.0015 
## 
## Training set error measures:
##                    ME     RMSE      MAE      MPE     MAPE MASE       ACF1
## Training set 494.8631 3420.644 2829.518 0.864177 14.76745  NaN -0.1823195

checkresiduals(fit )

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(M,N,N)
## Q* = 1.1929, df = 3, p-value = 0.7547
## 
## Model df: 2.   Total lags used: 5

fcst  = forecast(fit , h=1)
print(fcst )

##          Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## May 2010       18224.17 13897.03 22551.31 11606.38 24841.96

autoplot(fcst , pi=TRUE) +
  ggtitle("Monthly withdrawal forecast from all four ATMs")

We then applied ETS model, or exponential smoothing model on this data. the RMS E for this modeling is 3902, smaller than the random walk model. So we decide to use ETS model for our forecast of ATM withdrawal.

The system has chosen the ETS(M,N,N) as our model, meaning that every term is multiplicative, no trend, no seasonality.

Ljung-Box test of the residual test indicates that the residual is not significant, at P value of 0.75. Therefore no further transformation is needed. Our model can be readily available to use.

We used this model to forecast the cash withdrawn for May 2010, and got a point forecast of 18224. Also the corresponding 80% confidence interval and 95% confidence interval our given as well.

# fwrite (data.frame (fcst), "ATMforcastMay2010.csv", sep=",", col.names=TRUE, row.names = FALSE)

Proj1 ATM

Hui Han

PART A -ATM

Load Data

EDA - Exploratory Data Analysis