Data 624 - Project 1

Part A

Assignment Question

In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file. Also, please submit the forecast which you will put in an Excel readable file.

Data Management

data_raw <- read_excel ('./data/ATM624Data.xlsx', col_names = TRUE,
                        col_types = c('date','text','numeric'))%>%
  rename(date = DATE, atm = ATM, cash = Cash)%>%
  mutate(date = as.Date(date, origin = "1899-12-30"))

summary(data_raw)

##       date                atm                 cash        
##  Min.   :2009-05-01   Length:1474        Min.   :    0.0  
##  1st Qu.:2009-08-01   Class :character   1st Qu.:    0.5  
##  Median :2009-11-01   Mode  :character   Median :   73.0  
##  Mean   :2009-10-31                      Mean   :  155.6  
##  3rd Qu.:2010-02-01                      3rd Qu.:  114.0  
##  Max.   :2010-05-14                      Max.   :10919.8  
##                                          NA's   :19

After loading the data, we see there are a number of NA’s which need to be dealt with. Generally, dropping the cases is not advisable as the time series should have all the dates. In this case I have adjust that logic as not only the amount of cash is missing but also which ATM the cash is in is missing. Since I can not imply the location it does not make sense to do math manipulation for missing ‘cash’ values.

I will perform the following process:

I will drop any incomplete case on the ‘ATM’ row.
- The N/A’s in the ‘ATM’ column are at the end and look like they are at the start of the month.
- It should leave me with data from 2009-05-01 to 2010-04-30.
Any remaining NA’s in the ‘CASH’ column will have a k-nearest neighbor moving average applied.
- The data is in hundreds of dollars and does not have any decimal points, any decimals generated by this process will be rounded to zero.

atm <- data_raw
atm <- data_raw[complete.cases(data_raw),]
atm$cash <- imputeTS::na_ma(atm$cash, k=5, weighting = "simple")

# as the data is in hundreds of dollars &
atm <- atm %>%
  mutate(across(where(is.numeric), round, 0))

atm <- atm%>%
  spread(atm, cash)

atm_ts <- ts(atm %>% select(-date))
#str(atm_ts)

autoplot(atm_ts)+
  labs(title = "Cash per all ATMs", subtitle = "By week")+
  scale_y_continuous("Withdrawn in Hundreds")

Huge outlier in ATM4 will need processing.
No further outliers are present.

ATM 1

atm1 <- atm_ts[,1]
atm1<-ts(atm1, frequency = 7)
ggtsdisplay(atm1, points = FALSE, plot.type = "histogram")

## Warning: Removed 3 rows containing non-finite values (stat_bin).

Seasonality in the data
Huge lag spikes, will need to understand better after the Box-Cox transform

ATM 2

atm2 <- atm_ts[,2]
atm2<-ts(atm2, frequency = 7)
ggtsdisplay(atm2, points = FALSE, plot.type = "histogram")

## Warning: Removed 2 rows containing non-finite values (stat_bin).

Seasonality in the data
Huge lag spikes, will need to understand better after the Box-Cox transform

ATM 3

atm3 <- atm_ts[,3]
atm3[which(atm3 == 0)]<-NA
atm3<-ts(atm3, frequency = 7)
ggtsdisplay(atm3, points = FALSE, plot.type = "histogram")

## Warning: Removed 362 rows containing non-finite values (stat_bin).

Only 3 observations in the data
- Not sure if there is enough data to accurately predict the needed amount of cash.
May need to use median for prediction

ATM 4

atm4 <- atm_ts[,4]
atm4[which.max(atm4)]<-median(atm4, na.rm = TRUE)
atm4<-ts(atm4, frequency = 7)
ggtsdisplay(atm4, points = FALSE, plot.type = "histogram")

With the outlier brought back down using median the data is in a good stage to move forward
Not sure if we are seeing seasonality

Creating Forecast models

All Models will start with a Box-Cox transform as it is the simplest transform method. Each will be evaluated to see if the transform is beneficial, gets removed, or if other steps are required.

ATM1

atm1_lambda <- BoxCox.lambda(atm1)
ggtsdisplay(atm1  %>% BoxCox(atm1_lambda))

## Warning: Removed 3 rows containing missing values (geom_point).

ur.kpss(atm1  %>% BoxCox(atm1_lambda)) %>% summary

## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 5 lags. 
## 
## Value of test-statistic is: 0.3929 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

We still see regular lag spikes at 7, 14, and 21

ggtsdisplay(diff(atm1, 7), points = FALSE)

ur.kpss(diff(atm1, 7)) %>% summary

## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 5 lags. 
## 
## Value of test-statistic is: 0.0259 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

Setting the differential at 7 to account for the regular spikes seems to do the trick and reduce the test statistic

ggtsdisplay(atm1  %>% BoxCox(atm1_lambda)%>% diff(7))

## Warning: Removed 6 rows containing missing values (geom_point).

ur.kpss(atm1  %>% BoxCox(atm1_lambda) %>% diff(7)) %>% summary

## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 5 lags. 
## 
## Value of test-statistic is: 0.016 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

Success with a Box-Cox and Diff of 7
This leads to an Arima model of 2,0,2 with seasonally of 0,1,1

atm_model1 <- Arima(atm1, order=c(2,0,2), seasonal = c(0,1,1), lambda = atm1_lambda)
summary(atm_model1)

## Series: atm1 
## ARIMA(2,0,2)(0,1,1)[7] 
## Box Cox transformation: lambda= 0.2081476 
## 
## Coefficients:
##           ar1      ar2     ma1     ma2     sma1
##       -0.5362  -0.8226  0.6000  0.7602  -0.6945
## s.e.   0.1342   0.1455  0.1418  0.1770   0.0433
## 
## sigma^2 estimated as 1.244:  log likelihood=-542.33
## AIC=1096.67   AICc=1096.9   BIC=1119.95
## 
## Training set error measures:
##                    ME     RMSE      MAE       MPE     MAPE      MASE
## Training set 2.522885 24.94303 16.01159 -85.35837 104.6165 0.9058306
##                    ACF1
## Training set 0.06631228

checkresiduals(atm_model1)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(2,0,2)(0,1,1)[7]
## Q* = 7.3162, df = 9, p-value = 0.6042
## 
## Model df: 5.   Total lags used: 14

atm1_check <- auto.arima(atm1, approximation = FALSE, lambda = atm1_lambda)
summary(atm1_check)

## Series: atm1 
## ARIMA(0,0,2)(0,1,1)[7] 
## Box Cox transformation: lambda= 0.2081476 
## 
## Coefficients:
##          ma1      ma2     sma1
##       0.0989  -0.1107  -0.6482
## s.e.  0.0532   0.0527   0.0425
## 
## sigma^2 estimated as 1.247:  log likelihood=-543.68
## AIC=1095.37   AICc=1095.48   BIC=1110.89
## 
## Training set error measures:
##                    ME     RMSE      MAE       MPE     MAPE      MASE
## Training set 2.493867 25.19797 16.12913 -88.12706 107.3975 0.9124807
##                    ACF1
## Training set 0.01895515

checkresiduals(atm1_check)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,0,2)(0,1,1)[7]
## Q* = 8.5811, df = 11, p-value = 0.6605
## 
## Model df: 3.   Total lags used: 14

The Auto Arima model came up with a (0,0,2)(0,1,1)[7]
I chose (2,0,2)(0,1,1)[7]
According to the residuals, both models represent the data well as the residuals are not distinguishable from white noise.
The AICc is lower for the auto model and the P-Value is lower for the manual
I will keep the manual model

ATM2

No Transforms

ggtsdisplay(atm2, points = FALSE, plot.type = "histogram")

## Warning: Removed 2 rows containing non-finite values (stat_bin).

ur.kpss(atm2) %>% summary

## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 5 lags. 
## 
## Value of test-statistic is: 2.1698 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

Box-Cox Transform

atm2_lambda <- BoxCox.lambda(atm2)

ggtsdisplay(atm2  %>% BoxCox(atm2_lambda))

## Warning: Removed 2 rows containing missing values (geom_point).

ur.kpss(atm2 %>% BoxCox(atm2_lambda)) %>% summary

## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 5 lags. 
## 
## Value of test-statistic is: 2.1556 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

The Box-Cox transform did not change the data much
- Perhaps I should start with a general diff()

ndiffs(atm2)

## [1] 1

ggtsdisplay(diff(atm2, 7), points = FALSE)

ur.kpss(diff(atm2, 7)) %>% summary

## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 5 lags. 
## 
## Value of test-statistic is: 0.0168 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

The Lag Spike at 7 and the differential check returning 1 I’ll start my model with a (2,1,2)(0,1,1)

atm_model2<- Arima(atm2, order=c(2,1,2), seasonal = c(0,1,1), lambda = atm2_lambda)
summary(atm_model2)

## Series: atm2 
## ARIMA(2,1,2)(0,1,1)[7] 
## Box Cox transformation: lambda= 0.7164431 
## 
## Coefficients:
##           ar1      ar2      ma1      ma2     sma1
##       -0.0359  -0.1078  -0.9285  -0.0715  -0.6228
## s.e.   0.2651   0.0569   0.2633   0.2631   0.0450
## 
## sigma^2 estimated as 65.42:  log likelihood=-1249.21
## AIC=2510.42   AICc=2510.66   BIC=2533.69
## 
## Training set error measures:
##                     ME    RMSE      MAE  MPE MAPE      MASE        ACF1
## Training set 0.6080112 24.9517 17.33436 -Inf  Inf 0.8461616 -0.01891811

checkresiduals(atm_model2)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(2,1,2)(0,1,1)[7]
## Q* = 26.39, df = 9, p-value = 0.001763
## 
## Model df: 5.   Total lags used: 14

atm2_check <- auto.arima(atm2, approximation = FALSE, lambda = atm2_lambda)
summary(atm2_check)

## Series: atm2 
## ARIMA(2,0,2)(0,1,1)[7] 
## Box Cox transformation: lambda= 0.7164431 
## 
## Coefficients:
##           ar1      ar2     ma1     ma2     sma1
##       -0.4188  -0.9019  0.4587  0.7886  -0.7187
## s.e.   0.0587   0.0471  0.0884  0.0619   0.0414
## 
## sigma^2 estimated as 62.19:  log likelihood=-1240.08
## AIC=2492.16   AICc=2492.4   BIC=2515.45
## 
## Training set error measures:
##                     ME     RMSE      MAE  MPE MAPE      MASE        ACF1
## Training set 0.2631418 24.22403 16.90092 -Inf  Inf 0.8250038 -0.01464052

checkresiduals(atm2_check)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(2,0,2)(0,1,1)[7]
## Q* = 9.8051, df = 9, p-value = 0.3665
## 
## Model df: 5.   Total lags used: 14

The Auto Arima model came up with a (2,0,2)(0,1,1)[7]
I chose (2,1,2)(0,1,1)[7]
My model has MUCH lower AIC and AICc; slightly larger but not significant P value.
I will still stay with my model

ATM3

I had a hard time figuring out what to do with this model. There is only 3 observations in the dataset. After doing research, I chose a random walk model vs an average model using the non-zero observations.

#atm3_lambda <- BoxCox.lambda(atm3) # lambda for ATM3 will not be required

atm_model3 <- rwf(atm3, h=31)

ATM4

atm4_lambda <- BoxCox.lambda(atm4)

ggtsdisplay(atm4, points = FALSE, plot.type = "histogram")

ndiffs(atm4)

## [1] 0

ggtsdisplay(atm4  %>% diff(7))

The data seems different than ATM 1&2.
- Not sure if we need to go as extreme on this one.
- The plot seems to be already at the white noise level
- I will try (1,0,0)(0,1,0)

atm_model4 <- Arima(atm4, order=c(1,0,0), seasonal = c(0,1,0), lambda = atm4_lambda)
summary(atm_model4)

## Series: atm4 
## ARIMA(1,0,0)(0,1,0)[7] 
## Box Cox transformation: lambda= 0.4524377 
## 
## Coefficients:
##          ar1
##       0.0440
## s.e.  0.0528
## 
## sigma^2 estimated as 297.2:  log likelihood=-1526.79
## AIC=3057.58   AICc=3057.61   BIC=3065.34
## 
## Training set error measures:
##                     ME     RMSE      MAE       MPE     MAPE      MASE
## Training set -3.622377 447.8441 339.2516 -371.1099 425.4804 0.9793969
##                    ACF1
## Training set 0.01466724

checkresiduals(atm_model4)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,0,0)(0,1,0)[7]
## Q* = 120.15, df = 13, p-value < 2.2e-16
## 
## Model df: 1.   Total lags used: 14

atm4_check <- auto.arima(atm4, approximation = FALSE, lambda = atm4_lambda)
summary(atm4_check)

## Series: atm4 
## ARIMA(1,0,0)(2,0,0)[7] with non-zero mean 
## Box Cox transformation: lambda= 0.4524377 
## 
## Coefficients:
##          ar1    sar1    sar2     mean
##       0.0770  0.2091  0.1995  28.9903
## s.e.  0.0526  0.0516  0.0525   1.2615
## 
## sigma^2 estimated as 181.9:  log likelihood=-1466.06
## AIC=2942.11   AICc=2942.28   BIC=2961.61
## 
## Training set error measures:
##                    ME     RMSE      MAE       MPE     MAPE      MASE
## Training set 84.40696 351.7723 274.2572 -354.4169 398.8866 0.7917625
##                    ACF1
## Training set 0.01970019

checkresiduals(atm4_check)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,0,0)(2,0,0)[7] with non-zero mean
## Q* = 15.867, df = 10, p-value = 0.1035
## 
## Model df: 4.   Total lags used: 14

I will go with the Auto model for this one. The AIC and AICc are lower.
- Additionally the P-value from mine is significant
- The Mean Error is negative, the RMSE is higher

# imputing the Auto Arima into the list
atm_model5 <- Arima(atm4, order=c(1,0,0), seasonal = c(2,0,0), lambda = atm4_lambda)

Forecasting

atm1_forecast <- forecast(atm_model1, 31, level = 95)
autoplot(atm1_forecast) + 
    labs(title = "ATM1: ARIMA(2,0,2)(0,1,1)", x = "Week", y = NULL) +
    theme(legend.position = "none")

atm2_forecast <- forecast(atm_model2, 31, level = 95)
autoplot(atm2_forecast) + 
    labs(title = "ATM2: ARIMA(2,1,2)(0,1,1)", x = "Week", y = NULL) +
    theme(legend.position = "none")

autoplot(atm_model3) + 
    labs(title = "ATM3: Random Walk", x = "Week", y = NULL) +
    theme(legend.position = "none")

## Warning: Removed 362 row(s) containing missing values (geom_path).

atm3_forecast <- forecast(atm_model3)

atm4_forecast <- forecast(atm_model5, 31, level = 95)
autoplot(atm2_forecast) + 
    labs(title = "ATM4: ARIMA(1,0,0)(2,0,0)", x = "Week", y = NULL) +
    theme(legend.position = "none")

export <- rbind(atm1_forecast$mean, atm2_forecast$mean, atm3_forecast$mean, atm4_forecast$mean)
write.csv(export, "ATM Forecasts.csv")