Part A – ATM Forecast

Instruction

In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.

Data Exploration

I will begin by plotting the cash withdrawn by each ATM.

#Read data
if (!file.exists("ATM624Data.xlsx")) {download.file("https://github.com/isaram/CUNY-SPS/raw/master/DATA624/ATM624Data.xlsx", "ATM624Data.xlsx")}
atm <- read_excel("ATM624Data.xlsx") %>% mutate(DATE = as.Date(DATE, origin = "1899-12-30"), ATM = as.factor(ATM)) %>% data.frame()
atm %>% ggplot(aes(DATE, Cash, color = ATM)) + geom_line() + scale_color_brewer(palette = "Set2") + scale_y_continuous(labels = comma) + facet_wrap(.~ATM, ncol = 2)

The ATM that is utilized at a higher rate than all of the others is ATM #4. There appear to be outliers and missing values in the data. If data from ATM #4 is removed from the dataset the visualization would show a different picture.

#ATM4 removed
atm %>% filter(ATM != "ATM4") %>% filter(!is.na(ATM)) %>% ggplot(aes(DATE, Cash, color = ATM)) + geom_line() + facet_wrap(.~ATM, ncol = 2) + scale_color_brewer(palette = "Set2")

Based on the filtered visualization, it appears that ATM #3 only recently utilized and ATM #1 and #2 appear to be similar to one another with no obvious trend.

#Data plot
atm %>% na.omit() %>% ggplot(aes(ATM, Cash, color = ATM)) + geom_boxplot() + scale_color_brewer(palette = "Set2") + facet_wrap(.~ATM, nrow = 2, scales = "free") 

atm %>% filter(!is.na(ATM)) %>% group_by(ATM) %>% summarise(Minimum = min(Cash, na.rm = T),`1st Qu.` = quantile(Cash, .25, na.rm = T),Mean = mean(Cash, na.rm = T),Median = median(Cash, na.rm = T),`3rd Qu.` = quantile(Cash, .75, na.rm = T), Maximum = max(Cash, na.rm = T),`NA's` = sum(is.na(Cash))) %>%
kable() %>% kable_styling()
ATM Minimum 1st Qu. Mean Median 3rd Qu. Maximum NA’s
ATM1 1.00000 73.0000 83.8867403 91.0000 108.000 180.00 3
ATM2 0.00000 25.5000 62.5785124 67.0000 93.000 147.00 2
ATM3 0.00000 0.0000 0.7205479 0.0000 0.000 96.00 0
ATM4 1.56326 124.3344 474.0433449 403.8393 704.507 10919.76 0

There appears to be a pattern in the time series however fluctuations are very limted. Perhaps there might be a patten to be found by further examining the day of week.

atm %>% na.omit() %>% mutate(`Day of Week` = recode(as.factor(wday(DATE)), "1" = "Sunday", "2" = "Monday", "3" = "Tuesday", "4" = "Wednesday", "5" = "Thursday", "6" = "Friday", "7" = "Saturday"))  %>% ggplot(aes(`Day of Week`, Cash, color = `Day of Week`)) + geom_boxplot() + facet_grid(ATM ~ `Day of Week`, scales = "free") + scale_color_brewer(palette = "Set2") 

There appears to be fluctuations based on the day of the week. There are many variations between the ATMs in which each will be modeled seperately.The data will be cleansed and modeling techniques will be explored. A cross-validation on the models will be examined to arrive at the RMSE. The optimal model would be one that minimizes the RMSE.

Data Visualization and Modeling

ATM #1

#Remove missing observations
atm1 <- atm %>% filter(ATM == "ATM1")
atm1$clean_cash <- c(tsclean(ts(atm1$Cash, start = decimal_date(as.Date(min(atm1$DATE))), frequency = 365)))
#Create time series objects
ggtsplot <- function(ts, title) {grid.arrange(autoplot(ts) +ggtitle(title) +scale_y_continuous(labels = comma) +theme(axis.title = element_blank()),grid.arrange(ggAcf(ts) + ggtitle(element_blank()),ggPacf(ts) + ggtitle(element_blank()), ncol = 2), nrow = 2)}
#Frequency of seven to represent days in a weeks
atm1_ts <- ts(atm1$clean_cash, start = decimal_date(as.Date(min(atm1$DATE))), frequency = 7)
ggtsplot(atm1_ts, "ATM #1")

Based on the results, there appears to be a negative correlation between the 1st and 3rd/5th. There are repeating peaks on every 7th observation.

#STL Decomposition Model
atm1_ts %>% stl(s.window = 7, robust = TRUE) %>% autoplot()

#STL Decomposition using STL and ETS
h <- 31
atm1_stl_ets_fit <- atm1_ts %>% stlf(h = h, s.window = 7, robust = TRUE, method = "ets")
checkresiduals(atm1_stl_ets_fit)


    Ljung-Box test

data:  Residuals from STL +  ETS(A,N,N)
Q* = 13.759, df = 12, p-value = 0.3164

Model df: 2.   Total lags used: 14
# STL Decomposition using STL and ARIMA
atm1_stl_arima_fit <- atm1_ts %>% stlf(h = h, s.window = 7, robust = TRUE, method = "arima")
checkresiduals(atm1_stl_arima_fit)


    Ljung-Box test

data:  Residuals from STL +  ARIMA(0,1,2)
Q* = 8.4479, df = 12, p-value = 0.7492

Model df: 2.   Total lags used: 14
#Holt-Winters (HW)
atm1_hw_fit <- hw(atm1_ts, h = h)
checkresiduals(atm1_hw_fit)


    Ljung-Box test

data:  Residuals from Holt-Winters' additive method
Q* = 28.877, df = 3, p-value = 0.000002376

Model df: 11.   Total lags used: 14
#Holt-Winters(HW) with Box Cox Adjustment
atm1_lambda <- BoxCox.lambda(atm1_ts)
atm1_adj_hw_fit <- hw(atm1_ts, h = h, lambda = atm1_lambda)
checkresiduals(atm1_adj_hw_fit)


    Ljung-Box test

data:  Residuals from Holt-Winters' additive method
Q* = 18.824, df = 3, p-value = 0.0002973

Model df: 11.   Total lags used: 14
#ARIMA
atm1_arima_fit <- auto.arima(atm1_ts)
checkresiduals(atm1_arima_fit)


    Ljung-Box test

data:  Residuals from ARIMA(0,0,1)(0,1,2)[7]
Q* = 15.247, df = 11, p-value = 0.1715

Model df: 3.   Total lags used: 14

All methods appear to be viable candidates, as they performed well and should be considered for the cross-validation stage. In order predicting the sample data the tsCV function will be used to evaluate the models. The objective would be to minimize the RMSE by obtaining the errors from the cross validation process to compute the RMSE.

# Cross Validation
get_rmse <- function(e) {sqrt(mean(e^2, na.rm = TRUE))}
atm1_arima_forecast <- function(x, h) {forecast(Arima(x, order = c(0, 0, 1), seasonal = c(0, 1, 2)), h = h)}
stl_ets_errors <- tsCV(atm1_ts, stlf, h = h, s.window = 7, robust = TRUE, method = "ets")
stl_arima_errors <- tsCV(atm1_ts, stlf, h = h, s.window = 7, robust = TRUE, method = "arima")
hw_errors <- tsCV(atm1_ts, hw, h = h)
adj_hw_errors <- tsCV(atm1_ts, hw, h = h, lambda = atm1_lambda)
arima_errors <- tsCV(atm1_ts, atm1_arima_forecast, h = h)
data.frame(Model = c("STL ETS", "STL ARIMA", "ARIMA", "Holt-Winters", "Adjusted Holt-Winters"), RMSE = c(get_rmse(stl_ets_errors[, h]), get_rmse(stl_arima_errors[, h]), get_rmse(arima_errors[, h]), get_rmse(hw_errors[, h]), get_rmse(adj_hw_errors[, h]))) %>% arrange(RMSE) %>% kable() %>% kable_styling()
Model RMSE
ARIMA 29.67002
STL ARIMA 31.80636
STL ETS 31.81959
Adjusted Holt-Winters 34.30067
Holt-Winters 39.49958

Based on the cross validation, it appears that the ARIMA model was the ideal method of job predicting the sample data set which will be used to forecast ATM#1.

ATM #2

#Remove missing observations
atm2 <- atm %>% filter(ATM == "ATM2")
atm2$clean_cash <- c(tsclean(ts(atm2$Cash, start = decimal_date(as.Date(min(atm2$DATE))), frequency = 365)))
atm2_ts <- ts(atm2$clean_cash, start = decimal_date(as.Date(min(atm2$DATE))), frequency = 7)
ggtsplot(atm2_ts, "ATM #2")

#STL Decomposition Model
atm2_ts %>% stl(s.window = 7, robust = TRUE) %>% autoplot()

#STL Decomposition using STL and ETS
atm2_stl_ets_fit <- atm2_ts %>% stlf(h = h, s.window = 7, robust = TRUE, method = "ets")
checkresiduals(atm2_stl_ets_fit)


    Ljung-Box test

data:  Residuals from STL +  ETS(A,N,N)
Q* = 9.2372, df = 12, p-value = 0.6825

Model df: 2.   Total lags used: 14
#STL Decomposition using STL and ARIMA
atm2_stl_arima_fit <- atm2_ts %>% stlf(h = h, s.window = 7, robust = TRUE, method = "arima")
checkresiduals(atm2_stl_arima_fit)


    Ljung-Box test

data:  Residuals from STL +  ARIMA(2,1,2)
Q* = 7.4932, df = 10, p-value = 0.6782

Model df: 4.   Total lags used: 14
#Holt-Winters
atm2_hw_fit <- hw(atm2_ts, h = h)
checkresiduals(atm2_hw_fit)


    Ljung-Box test

data:  Residuals from Holt-Winters' additive method
Q* = 37.435, df = 3, p-value = 0.00000003722

Model df: 11.   Total lags used: 14
#Holt-Winters with Box Cox Adjustment
atm2_lambda <- BoxCox.lambda(atm2_ts)
atm2_adj_hw_fit <- hw(atm2_ts, h = h, lambda = atm2_lambda)
checkresiduals(atm2_adj_hw_fit)


    Ljung-Box test

data:  Residuals from Holt-Winters' additive method
Q* = 34.776, df = 3, p-value = 0.0000001358

Model df: 11.   Total lags used: 14
#ARIMA
atm2_arima_fit <- auto.arima(atm2_ts)
checkresiduals(atm2_arima_fit)


    Ljung-Box test

data:  Residuals from ARIMA(2,0,2)(0,1,1)[7]
Q* = 10.231, df = 9, p-value = 0.3321

Model df: 5.   Total lags used: 14

All methods appear to be viable candidates, as they performed well and should be considered for the cross-validation stage. In order predicting the sample data, the tsCV function and evaluate the models.

#Cross Validation
atm2_arima_forecast <- function(x, h) {forecast(Arima(x, order = c(2, 0, 2), seasonal = c(0, 1, 1)), h = h)}
stl_ets_errors <- tsCV(atm2_ts, stlf, h = h, s.window = 7, robust = TRUE, method = "ets")
stl_arima_errors <- tsCV(atm2_ts, stlf, h = h, s.window = 7, robust = TRUE, method = "arima")
hw_errors <- tsCV(atm2_ts, hw, h = h)
adj_hw_errors <- tsCV(atm2_ts, hw, h = h, lambda = atm2_lambda)
arima_errors <- tsCV(atm2_ts, atm2_arima_forecast, h = h)
data.frame(Model = c("STL ETS", "STL ARIMA", "ARIMA", "Holt-Winters", "Adjusted Holt-Winters"),RMSE = c(get_rmse(stl_ets_errors[, h]), get_rmse(stl_arima_errors[, h]), get_rmse(arima_errors[, h]), get_rmse(hw_errors[, h]), get_rmse(adj_hw_errors[, h]))) %>% arrange(RMSE) %>% kable() %>% kable_styling()
Model RMSE
ARIMA 33.52154
STL ARIMA 34.17962
STL ETS 34.96679
Adjusted Holt-Winters 41.53970
Holt-Winters 55.33134

Based on the cross validation, it appears that the ARIMA model was the ideal method of job predicting the sample data set which will be used to forecast ATM#2.

ATM #3

This ATM is unique and the challenge with the visual is that there are only three data points therefore the average will be used.

#Data Cleaning
atm %>% filter(ATM == "ATM3") %>% mutate(nonzero = if_else(Cash == 0, "No", "Yes")) %>% ggplot(aes(DATE, Cash, color = nonzero)) + geom_point() + scale_color_brewer(palette = "Set2") 

#Filter
atm3 <- atm %>% filter(ATM == "ATM3", Cash > 0)
atm3_mean <- mean(atm3$Cash)

ATM #4

#Remove missing observations
atm4 <- atm %>% filter(ATM == "ATM4")
atm4$clean_cash <- c(tsclean(ts(atm4$Cash, start = decimal_date(as.Date(min(atm4$DATE))), frequency = 365)))
atm4_ts <- ts(atm4$clean_cash, start = decimal_date(as.Date(min(atm4$DATE))), frequency = 7)
ggtsplot(atm4_ts, "ATM #4")

# STL Decomposition Model
atm2_ts %>% stl(s.window = 7, robust = TRUE) %>% autoplot()

#STL Decompositon Model using STL and ETS
atm4_stl_ets_fit <- atm4_ts %>% stlf(h = h, s.window = 7, robust = TRUE, method = "ets")
checkresiduals(atm4_stl_ets_fit)


    Ljung-Box test

data:  Residuals from STL +  ETS(A,N,N)
Q* = 30.61, df = 12, p-value = 0.002258

Model df: 2.   Total lags used: 14
#STL Decompositon Model using STL and ARIMA
atm4_stl_arima_fit <- atm4_ts %>%
  stlf(h = h, s.window = 7, robust = TRUE, method = "arima")
checkresiduals(atm4_stl_arima_fit)


    Ljung-Box test

data:  Residuals from STL +  ARIMA(1,0,1) with non-zero mean
Q* = 21.78, df = 11, p-value = 0.02614

Model df: 3.   Total lags used: 14
#Holt-Winters
atm4_hw_fit <- hw(atm4_ts, h = h)
checkresiduals(atm4_hw_fit)


    Ljung-Box test

data:  Residuals from Holt-Winters' additive method
Q* = 18.279, df = 3, p-value = 0.0003852

Model df: 11.   Total lags used: 14
#Holt-Winters with Box Cox Adjustment
atm4_lambda <- BoxCox.lambda(atm4_ts)
atm4_adj_hw_fit <- hw(atm4_ts, h = h, lambda = atm4_lambda)
checkresiduals(atm4_adj_hw_fit)


    Ljung-Box test

data:  Residuals from Holt-Winters' additive method
Q* = 18.934, df = 3, p-value = 0.0002822

Model df: 11.   Total lags used: 14
#ARIMA
atm4_arima_fit <- auto.arima(atm4_ts)
checkresiduals(atm4_arima_fit)


    Ljung-Box test

data:  Residuals from ARIMA(0,0,3)(1,0,0)[7] with non-zero mean
Q* = 16.96, df = 9, p-value = 0.04934

Model df: 5.   Total lags used: 14
#Cross Validation
atm4_arima_forecast <- function(x, h) {forecast(Arima(x, order = c(0, 0, 3), seasonal = c(1, 0, 0)), h = h)}
stl_ets_errors <- tsCV(atm4_ts, stlf, h = h, s.window = 7, robust = TRUE, method = "ets")
stl_arima_errors <- tsCV(atm4_ts, stlf, h = h, s.window = 7, robust = TRUE, method = "arima")
hw_errors <- tsCV(atm4_ts, hw, h = h)
adj_hw_errors <- tsCV(atm4_ts, hw, h = h, lambda = atm4_lambda)
arima_errors <- tsCV(atm4_ts, atm4_arima_forecast, h = h)
data.frame(Model = c("STL ETS", "STL ARIMA", "ARIMA", "Holt-Winters", "Adjusted Holt-Winters"), RMSE = c(get_rmse(stl_ets_errors[, h]), get_rmse(stl_arima_errors[, h]), get_rmse(arima_errors[, h]), get_rmse(hw_errors[, h]), get_rmse(adj_hw_errors[, h]))) %>% arrange(RMSE) %>% kable() %>% kable_styling()
Model RMSE
ARIMA 346.7434
STL ARIMA 353.7169
STL ETS 366.0626
Adjusted Holt-Winters 388.4804
Holt-Winters 504.2440

The HW model with Box Cox Adjustment preformed very well and the ACF spikes are outside the bands. The STL models did not perform as well as the ARIMA. All models will be cross validated and the tsCV function will be used to evaluate the models to minimize the RMSE. The ARIMA model will be used to forecast the projections of ATM #4.

Conclusion

After testing multiple approaches using cross-validation, the ARIMA models was selected for ATM #1, #2 and #4, as it had the lowest RMSE.

Part B – Forecasting Power

Instruction

Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.

Data Exploration

#Read Data
if (!file.exists("ResidentialCustomerForecastLoad-624.xlsx")) {download.file("https://github.com/isaram/CUNY-SPS/raw/master/DATA624/ResidentialCustomerForecastLoad-624.xlsx", "ResidentialCustomerForecastLoad-624.xlsx")}
raw_data <- read_excel("ResidentialCustomerForecastLoad-624.xlsx")
kwh <- raw_data %>% select(KWH) %>% ts(start = decimal_date(date("1998-01-01")), frequency = 12)
autoplot(kwh, main = "Residential Power Usage (KWH)") + scale_y_continuous(labels = comma)

#Summary
summary(kwh) %>% kable() %>% kable_styling()
KWH
Min. : 770523
1st Qu.: 5429912
Median : 6283324
Mean : 6502475
3rd Qu.: 7620524
Max. :10655730
NA’s :1

Data Visualization and Modeling

There appears to be missing data and outlying points which I will clean. To cleanse the data, tsclean function will be used. Once the data is cleaned, the data will be replotted with ACF and PACF.

kwh <- tsclean(kwh)
ggtsplot <- function(ts, title) {grid.arrange(autoplot(ts) + scale_y_continuous(labels = comma) + ggtitle(title) + theme(axis.title = element_blank()),grid.arrange(ggAcf(ts) + ggtitle(element_blank()),ggPacf(ts) + ggtitle(element_blank()), ncol = 2), nrow = 2)}
ggtsplot(kwh, "Cleaned Residential Power Usage (KWH)")

The data has a seasonal trend and a model that displays seasonality will need to be used.The model to be used is ARIMA or Holt-Winters (HW) model.

h <- 12
lambda <- BoxCox.lambda(kwh)
autoplot(kwh) + autolayer(hw(kwh, h = h), series = "Holt-Winters") + autolayer(hw(kwh, h = h, lambda = lambda), series = "Holt-Winters (Box-Cox Adjusted)") + autolayer(hw(kwh, h = h, lambda = lambda, damped = TRUE), series = "Holt-Winters (Damped & Box-Cox Adjusted)") + autolayer(forecast(auto.arima(kwh), h = h), series = "ARIMA") + facet_wrap(. ~ series, ncol = 2) + scale_color_brewer(palette = "Set2") + scale_y_continuous(labels = comma) + theme(legend.position = "none", axis.title = element_blank())

Since both models represent seasonality well, residuals will be explored to determine which model is best.

#Holt-Winters (HW)
hw_fit <- hw(kwh, h = h)
checkresiduals(hw_fit)


    Ljung-Box test

data:  Residuals from Holt-Winters' additive method
Q* = 61.741, df = 8, p-value = 0.0000000002121

Model df: 16.   Total lags used: 24
#Adjusted Holt-Winters Model
adj_hw_fit <- hw(kwh, h = h, lambda = lambda)
checkresiduals(adj_hw_fit)


    Ljung-Box test

data:  Residuals from Holt-Winters' additive method
Q* = 48.973, df = 8, p-value = 0.00000006433

Model df: 16.   Total lags used: 24
# ARIMA Model
arima_fit <- auto.arima(kwh)
checkresiduals(arima_fit)


    Ljung-Box test

data:  Residuals from ARIMA(3,0,1)(2,1,0)[12] with drift
Q* = 11.549, df = 17, p-value = 0.8266

Model df: 7.   Total lags used: 24

In the Holt-Winters Model, there are spikes in the ACF which would make this a model that is not suitable. The Adjusted Holt-Winters Model is a bit better than the previous model but there are still ACF spikes outside of the bands. The ARIMA model appears to be the best selection as there is one ACF small spike outside of the threshold, which I will confirm using cross validation.

In order to predict the sample data the tsCV function will be used to evaluate the models in the cross-validation stage. The objective would be to minimize the RMSE by obtaining the errors from the cross validation process to compute the RMSE.

#Cross Validation
get_rmse <- function(e) {sqrt(mean(e^2, na.rm = TRUE))}
arima_forecast <- function(x, h) {forecast(Arima(x, order = c(3, 0, 1), seasonal = c(2, 1, 0), include.drift = TRUE), h = h)}
hw_error <- tsCV(kwh, hw, h = h)
adj_hw_error <- tsCV(kwh, hw, h = h, lambda = lambda)
damped_adj_hw_error <- tsCV(kwh, hw, h = h, lambda = lambda, damped = TRUE)
arima_errors <- tsCV(kwh, arima_forecast, h = h)
data.frame(Model = c("ARIMA", "Holt-Winters", "Adjusted Holt-Winters", "Damped & Adjusted Holt-Winters"), RMSE = c(get_rmse(arima_errors[, h]), get_rmse(hw_error[, h]), get_rmse(adj_hw_error[, h]), get_rmse(damped_adj_hw_error[, h]))) %>% arrange(RMSE) %>% kable() %>% kable_styling()
Model RMSE
ARIMA 659529.8
Holt-Winters 1010721.9
Adjusted Holt-Winters 1373617.3
Damped & Adjusted Holt-Winters 1547751.3

The cross validation results confirm that the ARIMA model is the best selection as it minimized the RMSE. This model will be used in this exercise.

Conclusion

#Power usage forecast 
my_forecast <- forecast(arima_fit, h = h)
autoplot(my_forecast) 

my_forecast %>% kable() %>% kable_styling()
Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
Jan 2014 9433336 8688058 10178615 8293531 10573141
Feb 2014 8570219 7785677 9354761 7370366 9770072
Mar 2014 6615312 5830157 7400467 5414521 7816102
Apr 2014 6005538 5196071 6815004 4767565 7243510
May 2014 5917569 5108101 6727036 4679595 7155543
Jun 2014 8187387 7377114 8997660 6948182 9426592
Jul 2014 9528779 8717809 10339750 8288507 10769052
Aug 2014 9991953 9180963 10802943 8751650 11232255
Sep 2014 8546843 7735715 9357971 7306330 9787356
Oct 2014 5856327 5045196 6667458 4615809 7096845
Nov 2014 6153245 5342114 6964376 4912727 7393763
Dec 2014 7732110 6920971 8543250 6491580 8972641