Data 624 Midterm Project 1a

Data

DATA & Data Review

We load the data and then utilize the skim and vis_dat function to assess the quality of the data. Here are some observations:

There is no ATM5
ATMs 1 & 2 have missing numeric (CASH) data
ATMs 1 & 4 appear to have outliers

Next we will perform some exploratory data analysis to confirm our initial findings and identify other potential issues.

atm <- readxl::read_excel('ATM624Data.xlsx')
atm$DATE <- as.Date(atm$DATE, origin = "1899-12-30")

summary(atm)

##       DATE                ATM                 Cash        
##  Min.   :2009-05-01   Length:1474        Min.   :    0.0  
##  1st Qu.:2009-08-01   Class :character   1st Qu.:    0.5  
##  Median :2009-11-01   Mode  :character   Median :   73.0  
##  Mean   :2009-10-31                      Mean   :  155.6  
##  3rd Qu.:2010-02-01                      3rd Qu.:  114.0  
##  Max.   :2010-05-14                      Max.   :10919.8  
##                                          NA's   :19

atm_skim <- as_tibble(atm) %>% 
  dplyr::group_by(ATM) 

  
  skim_without_charts(atm_skim)

Data summary

Name	atm_skim
Number of rows	1474
Number of columns	3
_______________________
Column type frequency:
Date	1
numeric	1
________________________
Group variables	ATM

Variable type: Date

skim_variable	ATM	complete_rate	min	max	median	n_unique
DATE	ATM1	1	2009-05-01	2010-04-30	2009-10-30	365
DATE	ATM2	1	2009-05-01	2010-04-30	2009-10-30	365
DATE	ATM3	1	2009-05-01	2010-04-30	2009-10-30	365
DATE	ATM4	1	2009-05-01	2010-04-30	2009-10-30	365
DATE	NA	1	2010-05-01	2010-05-14	2010-05-07	14

Variable type: numeric

skim_variable	ATM	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100
Cash	ATM1	3	0.99	83.89	36.66	1.00	73.00	91.00	108.00	180.00
Cash	ATM2	2	0.99	62.58	38.90	0.00	25.50	67.00	93.00	147.00
Cash	ATM3	0	1.00	0.72	7.94	0.00	0.00	0.00	0.00	96.00
Cash	ATM4	0	1.00	474.04	650.94	1.56	124.33	403.84	704.51	10919.76
Cash	NA	14	0.00	NaN	NA	NA	NA	NA	NA	NA

vis_dat(atm)

EDA

Exploratory Data Analysis

We utilized a combination of box plots, line charts and the ggtsdisplay function to perform our EDA. Here are some key findings.

The outlier in ATM4 may in fact be bad data and will have to be replaced/imputed.
Almost all the values for ATM 3 are zero. This ATM may have been deployed very recently.
The ACF and PACF chart reveal some weekly patterns for ATMs 1 and 2
ATM 2 appears to have the largest variance based up the box plot

We will address the missing data and outliers in our pre-processing section.

Preprocessing

Exploratory Data Analysis

In this section we will address outliers and missing data. Our strategy is to utilize average day of week values by ATM for the missing data.

The dplyr script belows calculate weekday averages by ATM and the replaces NA values with the appropriate weekday average. The Skim function is utilized a second time to ensure the preprocessing has been successful. The skim function confirms there is no longer missing values and the outlier for ATM 4 has been imputed. We are now ready to move to the Modeling section.

A plot of the pre-processed times series is set forth below.

atm <- atm %>% 
  mutate(wday = wday(DATE, label=TRUE, abbr=TRUE)) %>% 
  mutate(Cash = if_else(Cash>9000,NA_real_,Cash)) %>% 
  group_by(ATM, wday) %>% 
  mutate(wday_avg = mean(Cash, na.rm = TRUE)) %>% 
  arrange(ATM, DATE) %>% 
  mutate(Cash = if_else(is.na(Cash), wday_avg, Cash)) %>% 
  ungroup() %>% 
  select(DATE, ATM, Cash)

Data summary

Name	atm_skim2
Number of rows	1474
Number of columns	3
_______________________
Column type frequency:
Date	1
numeric	1
________________________
Group variables	ATM

Variable type: Date

skim_variable	ATM	complete_rate	min	max	median	n_unique
DATE	ATM1	1	2009-05-01	2010-04-30	2009-10-30	365
DATE	ATM2	1	2009-05-01	2010-04-30	2009-10-30	365
DATE	ATM3	1	2009-05-01	2010-04-30	2009-10-30	365
DATE	ATM4	1	2009-05-01	2010-04-30	2009-10-30	365
DATE	NA	1	2010-05-01	2010-05-14	2010-05-07	14

Variable type: numeric

skim_variable	ATM	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100
Cash	ATM1	0	1	83.94	36.51	1.00	73.00	91.00	108.00	180.00
Cash	ATM2	0	1	62.43	38.85	0.00	25.53	66.00	93.00	147.00
Cash	ATM3	0	1	0.72	7.94	0.00	0.00	0.00	0.00	96.00
Cash	ATM4	0	1	445.35	350.90	1.56	124.33	403.84	704.19	1712.07
Cash	NA	14	0	NaN	NA	NA	NA	NA	NA	NA

Modeling

Exploratory Data Analysis

In the modeling section we will utilized STL Decomposition, Exponential Smoothing and Arima to identify the best possible model fit. Based upon my personal work experience, I have determine not to model ATM 3. The reason for this is that both the location and age (time in service) of an ATM are so important to its activity/usage that modeling an ATM by taking an average of other ATMs would be a unproductive.

After completing our preliminary modeling analysis we will utilize the tsCV function to help assess our modeling results. We will leverage the RMSE to pick our preferred model(s). We will model each ATM individually and review modeling residuals. A modeling summary follows:

STL Decomposition

We utilized STL (stl() function) Decomposition for model ATMs 1, 2 and 4. An analysis of the ACF and PACF charts indicated that 7 was the appropriate period (weekly) for seasonality. Plots for each of the modeled ATMs are set forth below.

Exponential Smoothing

We used the ets() function in our exponential smoothing models. Again, we modeled ATMs 1, 2, and 4. Key parameter setting follow:

model - Default (ZZZ). This setting enables the function to determine the best model
lambda - On and Off.Box Cox is utilized when on to determine the optimal lambda. biasadj is utilized when Box Cox is on.
damped - Null (this enables the model to determine the best model)

The exponential smoothing models utilized in this analysis follows:

ATM 1: ETS(A, N, A) with Box-Cox transformation: lambda= 0.2454, Stationary Diff=0, Seasonal Diff=1
ATM 2: ETS(A,N,A) with Box-Cox transformation: lambda= 0.6936, Stationary Diff=1, Seasonal Diff=1
ATM 4: ETS(A,N,A) with Box-Cox transformation: lambda= 0.4533, Stationary Diff=0, Seasonal Diff=0

ARIMA

The auto.arima function was utilized to produce ARIMA forecasts. The models utilized for each ATM machine follow:

ATM 1: ARIMA(0,0,2)(0,1,1)[7] with Box Cox transformation: lambda= 0.2454
ATM 2: ARIMA(2,0,2)(0,1,1)[7] with Box Cox transformation: lambda= 0.6936
ATM 4: ARIMA(0,0,1)(2,0,0)[7] with non-zero mean with Box Cox transformation: lambda= 0.4533

Model Accuracy / Selection

An analysis of model accuracy is performed for each ATM machine. We have utilized RMSE as our accuracy metric. Model accuracy results are summarized below:

ATM 1: Best RMSE: Arima - 37.1; STL RMSE:51.44 ; ETS RMSE: 43.61
ATM 2: Best RMSE: Arima - 46.1; STL RMSE:58.94 ; ETS RMSE: 59.40
ATM 4: Best RMSE: Arima - 287.2; STL RMSE:350.75 ; ETS RMSE: 329.09

Detailed forecasting information by ATM is set forth below:

ATM 1

temp = atm %>%  
  drop_na() %>% 
  spread(ATM, Cash)
atm = ts(temp %>% select(-DATE), frequency = 7)


ggtsdisplay(atm[,1], main = "Withdrawals", ylab = "Withdrawals (in hundreds)", xlab = "Week")

ggsubseriesplot(atm[,1]) +
  labs(title = "Withdrawals", subtitle ="May 2009 - April 2010",
       x = "Days of the Week") +
  scale_y_continuous("Amount of Cash Withdrawal",
                     labels = scales::dollar_format(scale = 0.1, suffix = "K"))

Box-Cox Transformation

lambda_1 <- BoxCox.lambda(atm[,1])

The Box-Cox lambda value for the series is: 0.2453825.

STL

atm1_stl_fit <- atm[,1] %>% stl(s.window = 7, robust = TRUE) %>%
  autoplot()  +
  labs(title = "ATM 1 - STL Decomposition", x = "Week")

atm1_stl_fit

ETS

atm1_ets_fit <- atm[,1] %>% 
  ets(lambda = lambda_1, biasadj = TRUE)

atm1_ets_fit

## ETS(A,N,A) 
## 
## Call:
##  ets(y = ., lambda = lambda_1, biasadj = TRUE) 
## 
##   Box-Cox transformation: lambda= 0.2454 
## 
##   Smoothing parameters:
##     alpha = 1e-04 
##     gamma = 0.353 
## 
##   Initial states:
##     l = 7.5346 
##     s = -4.5312 0.8387 0.2145 0.8725 1.1761 0.925
##            0.5045
## 
##   sigma:  1.2737
## 
##      AIC     AICc      BIC 
## 2340.975 2341.596 2379.973

ARIMA

atm1_arima_fit <- atm[,1] %>% 
  auto.arima(lambda = lambda_1, biasadj = TRUE)

atm1_arima_fit

## Series: . 
## ARIMA(0,0,2)(0,1,1)[7] 
## Box Cox transformation: lambda= 0.2453825 
## 
## Coefficients:
##          ma1      ma2     sma1
##       0.1001  -0.1135  -0.6455
## s.e.  0.0524   0.0527   0.0426
## 
## sigma^2 estimated as 1.584:  log likelihood=-590.73
## AIC=1189.46   AICc=1189.58   BIC=1204.98

Model Accuracy

##            STL      ETS    ARIMA
## RMSE: 51.44506 43.61599 37.14315

Residuals

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,0,2)(0,1,1)[7]
## Q* = 7.9435, df = 11, p-value = 0.7184
## 
## Model df: 3.   Total lags used: 14

ATM 2

ggtsdisplay(atm[,2], main = "Withdrawals", ylab = "Withdrawals (100s)", xlab = "Week")

ggsubseriesplot(atm[,2]) +
  labs(title = "Withdrawals", subtitle ="May 2009 - April 2010",
       x = "Days of the Week") +
  scale_y_continuous("Cash Withdrawals",
                     labels = scales::dollar_format(scale = 0.1, suffix = "K"))

The Box-Cox lambda value for the series is: 0.6935866.

STL

atm2_stl_fit <- atm[,2] %>% stl(s.window = 7, robust = TRUE) %>%
  autoplot()  +
  labs(title = "ATM 2 - STL Decomposition", x = "Week")

atm2_stl_fit

ETS

atm2_ets_fit <- atm[,2] %>% 
  ets(lambda = lambda_2, biasadj = TRUE)

atm2_ets_fit

## ETS(A,N,A) 
## 
## Call:
##  ets(y = ., lambda = lambda_2, biasadj = TRUE) 
## 
##   Box-Cox transformation: lambda= 0.6936 
## 
##   Smoothing parameters:
##     alpha = 1e-04 
##     gamma = 0.3967 
## 
##   Initial states:
##     l = 25.4405 
##     s = -13.7016 -10.3688 6.5125 -1.8726 0.3309 10.0456
##            9.054
## 
##   sigma:  7.4538
## 
##      AIC     AICc      BIC 
## 3630.720 3631.342 3669.719

ARIMA

atm2_arima_fit <- atm[,2] %>% 
  auto.arima(lambda = lambda_2, biasadj = TRUE)

atm2_arima_fit

## Series: . 
## ARIMA(2,0,2)(0,1,1)[7] 
## Box Cox transformation: lambda= 0.6935866 
## 
## Coefficients:
##           ar1      ar2     ma1     ma2     sma1
##       -0.4209  -0.9036  0.4633  0.7935  -0.7169
## s.e.   0.0589   0.0458  0.0888  0.0601   0.0416
## 
## sigma^2 estimated as 52.2:  log likelihood=-1215.47
## AIC=2442.94   AICc=2443.18   BIC=2466.23

Model Accuracy

##            STL      ETS    ARIMA
## RMSE: 58.94468 59.40853 46.07943

Residuals

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(3,0,3)(0,1,1)[7] with drift
## Q* = 5.2417, df = 6, p-value = 0.5132
## 
## Model df: 8.   Total lags used: 14

ATM 4

ggtsdisplay(atm[,4], main = "ATM 4 Withdrawals", ylab = "Withdrawals (100s)", xlab = "Week")

ggsubseriesplot(atm[,4]) +
  labs(title = "ATM 4  Withdrawals", subtitle ="May 2009 - April 2010",
       x = "Days of the Week") +
  scale_y_continuous("Cash Withdrawals",
                     labels = scales::dollar_format(scale = 0.1, suffix = "K"))

Box Cox Transform

The Box-Cox lambda value for the series is: 0.4533182.

STL

atm4_stl_fit <- atm[,4] %>% stl(s.window = 7, robust = TRUE) %>%
  autoplot()  +
  labs(title = "ATM 4 - Decomposition", x = "Week")

atm4_stl_fit

ETS

atm4_ets_fit <- atm[,4] %>% 
  ets(lambda = lambda_4, biasadj = TRUE)

atm4_ets_fit

## ETS(A,N,A) 
## 
## Call:
##  ets(y = ., lambda = lambda_4, biasadj = TRUE) 
## 
##   Box-Cox transformation: lambda= 0.4533 
## 
##   Smoothing parameters:
##     alpha = 1e-04 
##     gamma = 0.1008 
## 
##   Initial states:
##     l = 29.2323 
##     s = -17.9705 -2.5336 0.6625 4.6494 6.2229 4.8631
##            4.1063
## 
##   sigma:  13.1943
## 
##      AIC     AICc      BIC 
## 4047.594 4048.215 4086.593

ARIMA

atm4_arima_fit <- atm[,4] %>% 
  auto.arima(lambda = lambda_4, biasadj = TRUE)

atm4_arima_fit

## Series: . 
## ARIMA(0,0,1)(2,0,0)[7] with non-zero mean 
## Box Cox transformation: lambda= 0.4533182 
## 
## Coefficients:
##          ma1    sar1    sar2     mean
##       0.0765  0.2090  0.1990  29.1078
## s.e.  0.0529  0.0517  0.0525   1.2587
## 
## sigma^2 estimated as 183.8:  log likelihood=-1467.94
## AIC=2945.87   AICc=2946.04   BIC=2965.37

Model Accuracy

##            STL      ETS    ARIMA
## RMSE: 350.7567 329.0905 287.1895

Residuals

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,0,1)(2,0,0)[7] with non-zero mean
## Q* = 12.392, df = 10, p-value = 0.2596
## 
## Model df: 4.   Total lags used: 14

Output Forecast Data

ATM forecat output is set forth below. The ATM forecasts have also been written to a csv file.

 atm_output = function(week){
  # Split to train and test
  train1 = window(atm[,1], end = c(week, 3))
  train2 = window(atm[,2], end = c(week, 3))
  train4 = window(atm[,4], end = c(week, 3))
  test1 = window(atm[,1], start = c(week, 4))
  test2 = window(atm[,2], start = c(week, 4))
  test4 = window(atm[,4], start = c(week, 4))
  
  arima.model1 = train1 %>% Arima(order = c(0,0,2), seasonal = c(0,1,1),lambda = lambda_1, biasadj = TRUE)
  arima.model2 = train2 %>% Arima(order = c(3,0,3), seasonal = c(0,1,1),include.drift = TRUE, lambda = lambda_2, biasadj = TRUE)
  arima.model4 = train4 %>% Arima(order = c(0,0,1), seasonal = c(2,0,0),include.mean = TRUE, lambda = lambda_4, biasadj = TRUE)
 
  # Create the forecasts
  ATM_1_Fcst = forecast(arima.model1, h = 31)$mean %>% ts(start = 366, end = 396)
  ATM_2_Fcst = forecast(arima.model2, h = 31)$mean %>% ts(start = 366, end = 396)
  ATM_4_Fcst = forecast(arima.model4, h = 31)$mean %>% ts(start = 366, end = 396)
  
  DATE <- seq(as.Date('2010-05-01'), as.Date('2010-05-31'), 1)
  ATM1 <- c(ATM_1_Fcst)
  ATM2 <- c(ATM_2_Fcst)
  ATM4 <- c(ATM_4_Fcst)
  df <- data.frame(DATE, ATM1, ATM2, ATM4)
  
}

myATMFcst <- atm_output(44)
myATMFcst

##          DATE      ATM1     ATM2     ATM4
## 1  2010-05-01  88.03248 43.73438 513.6667
## 2  2010-05-02  70.27660 29.52705 386.9937
## 3  2010-05-03  76.01683 53.06357 481.4655
## 4  2010-05-04  37.14118 24.18667 661.3958
## 5  2010-05-05 109.88323 42.56347 453.9731
## 6  2010-05-06 101.69941 60.56368 412.7631
## 7  2010-05-07 101.80257 64.93641 578.4927
## 8  2010-05-08  84.34108 38.51018 486.2995
## 9  2010-05-09  74.09145 41.57788 351.5421
## 10 2010-05-10  76.12534 53.16685 442.8794
## 11 2010-05-11  37.21530 17.44156 587.5181
## 12 2010-05-12 110.01498 49.47473 425.1071
## 13 2010-05-13 101.82591 67.45401 451.5255
## 14 2010-05-14 101.92913 56.10449 580.6077
## 15 2010-05-15  84.45562 38.62484 472.1165
## 16 2010-05-16  74.19842 49.50073 424.2479
## 17 2010-05-17  76.23384 49.41827 458.5358
## 18 2010-05-18  37.28942 14.80928 511.2671
## 19 2010-05-19 110.14673 55.76500 450.7936
## 20 2010-05-20 101.95240 68.48927 450.1528
## 21 2010-05-21 102.05569 50.24438 499.2524
## 22 2010-05-22  84.57017 40.80389 464.5069
## 23 2010-05-23  74.30538 53.25603 433.7026
## 24 2010-05-24  76.34235 45.28566 455.3553
## 25 2010-05-25  37.36355 14.22772 486.2518
## 26 2010-05-26 110.27847 60.00312 451.0631
## 27 2010-05-27 102.07890 66.78616 454.9031
## 28 2010-05-28 102.18226 47.09690 482.9237
## 29 2010-05-29  84.68471 43.17946 460.4860
## 30 2010-05-30  74.41235 54.09822 446.9201
## 31 2010-05-31  76.45086 42.13277 456.6020

write_excel_csv(myATMFcst, path="C:/Users/mutue/OneDrive/Documents/Data 624/data624_ATMFCST_JMUNDY.csv")

Data 624 Midterm Project 1a

Jim Mundy

Data

DATA & Data Review

EDA

Exploratory Data Analysis

Preprocessing

Exploratory Data Analysis

Modeling

Exploratory Data Analysis

STL Decomposition

Exponential Smoothing

ARIMA

Model Accuracy / Selection

ATM 1

Box-Cox Transformation

The Box-Cox lambda value for the series is: 0.2453825.

STL

ETS

ARIMA

Model Accuracy

Residuals

ATM 2

The Box-Cox lambda value for the series is: 0.6935866.

STL

ETS

ARIMA

Model Accuracy

Residuals

ATM 4

Box Cox Transform

The Box-Cox lambda value for the series is: 0.4533182.

STL

ETS

ARIMA

Model Accuracy

Residuals

Output Forecast Data

ATM forecat output is set forth below. The ATM forecasts have also been written to a csv file.