Part A – ATM Forecast

Instruction

In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.

Data Exploration

I will begin by plotting the cash withdrawn by each ATM.

#Read data
if (!file.exists("ATM624Data.xlsx")) {download.file("https://github.com/isaram/CUNY-SPS/raw/master/DATA624/ATM624Data.xlsx", "ATM624Data.xlsx")}
atm <- read_excel("ATM624Data.xlsx") %>% mutate(DATE = as.Date(DATE, origin = "1899-12-30"), ATM = as.factor(ATM)) %>% data.frame()
atm %>% ggplot(aes(DATE, Cash, color = ATM)) + geom_line() + scale_color_brewer(palette = "Set2") + scale_y_continuous(labels = comma) + facet_wrap(.~ATM, ncol = 2)

The ATM that is utilized at a higher rate than all of the others is ATM #4. There appear to be outliers and missing values in the data. If data from ATM #4 is removed from the dataset the visualization would show a different picture.

#ATM4 removed
atm %>% filter(ATM != "ATM4") %>% filter(!is.na(ATM)) %>% ggplot(aes(DATE, Cash, color = ATM)) + geom_line() + facet_wrap(.~ATM, ncol = 2) + scale_color_brewer(palette = "Set2")

Based on the filtered visualization, it appears that ATM #3 only recently utilized and ATM #1 and #2 appear to be similar to one another with no obvious trend.

#Data plot
atm %>% na.omit() %>% ggplot(aes(ATM, Cash, color = ATM)) + geom_boxplot() + scale_color_brewer(palette = "Set2") + facet_wrap(.~ATM, nrow = 2, scales = "free")

atm %>% filter(!is.na(ATM)) %>% group_by(ATM) %>% summarise(Minimum = min(Cash, na.rm = T),`1st Qu.` = quantile(Cash, .25, na.rm = T),Mean = mean(Cash, na.rm = T),Median = median(Cash, na.rm = T),`3rd Qu.` = quantile(Cash, .75, na.rm = T), Maximum = max(Cash, na.rm = T),`NA's` = sum(is.na(Cash))) %>%
kable() %>% kable_styling()

ATM	Minimum	1st Qu.	Mean	Median	3rd Qu.	Maximum	NA’s
ATM1	1.00000	73.0000	83.8867403	91.0000	108.000	180.00	3
ATM2	0.00000	25.5000	62.5785124	67.0000	93.000	147.00	2
ATM3	0.00000	0.0000	0.7205479	0.0000	0.000	96.00	0
ATM4	1.56326	124.3344	474.0433449	403.8393	704.507	10919.76	0

There appears to be a pattern in the time series however fluctuations are very limted. Perhaps there might be a patten to be found by further examining the day of week.

atm %>% na.omit() %>% mutate(`Day of Week` = recode(as.factor(wday(DATE)), "1" = "Sunday", "2" = "Monday", "3" = "Tuesday", "4" = "Wednesday", "5" = "Thursday", "6" = "Friday", "7" = "Saturday"))  %>% ggplot(aes(`Day of Week`, Cash, color = `Day of Week`)) + geom_boxplot() + facet_grid(ATM ~ `Day of Week`, scales = "free") + scale_color_brewer(palette = "Set2")

There appears to be fluctuations based on the day of the week. There are many variations between the ATMs in which each will be modeled seperately.The data will be cleansed and modeling techniques will be explored. A cross-validation on the models will be examined to arrive at the RMSE. The optimal model would be one that minimizes the RMSE.

Data Visualization and Modeling

ATM #1

#Remove missing observations
atm1 <- atm %>% filter(ATM == "ATM1")
atm1$clean_cash <- c(tsclean(ts(atm1$Cash, start = decimal_date(as.Date(min(atm1$DATE))), frequency = 365)))

#Create time series objects
ggtsplot <- function(ts, title) {grid.arrange(autoplot(ts) +ggtitle(title) +scale_y_continuous(labels = comma) +theme(axis.title = element_blank()),grid.arrange(ggAcf(ts) + ggtitle(element_blank()),ggPacf(ts) + ggtitle(element_blank()), ncol = 2), nrow = 2)}
#Frequency of seven to represent days in a weeks
atm1_ts <- ts(atm1$clean_cash, start = decimal_date(as.Date(min(atm1$DATE))), frequency = 7)
ggtsplot(atm1_ts, "ATM #1")

Based on the results, there appears to be a negative correlation between the 1st and 3rd/5th. There are repeating peaks on every 7th observation.

#STL Decomposition Model
atm1_ts %>% stl(s.window = 7, robust = TRUE) %>% autoplot()

#STL Decomposition using STL and ETS
h <- 31
atm1_stl_ets_fit <- atm1_ts %>% stlf(h = h, s.window = 7, robust = TRUE, method = "ets")
checkresiduals(atm1_stl_ets_fit)


    Ljung-Box test

data:  Residuals from STL +  ETS(A,N,N)
Q* = 13.759, df = 12, p-value = 0.3164

Model df: 2.   Total lags used: 14

# STL Decomposition using STL and ARIMA
atm1_stl_arima_fit <- atm1_ts %>% stlf(h = h, s.window = 7, robust = TRUE, method = "arima")
checkresiduals(atm1_stl_arima_fit)


    Ljung-Box test

data:  Residuals from STL +  ARIMA(0,1,2)
Q* = 8.4479, df = 12, p-value = 0.7492

Model df: 2.   Total lags used: 14

#Holt-Winters (HW)
atm1_hw_fit <- hw(atm1_ts, h = h)
checkresiduals(atm1_hw_fit)


    Ljung-Box test

data:  Residuals from Holt-Winters' additive method
Q* = 28.877, df = 3, p-value = 0.000002376

Model df: 11.   Total lags used: 14

#Holt-Winters(HW) with Box Cox Adjustment
atm1_lambda <- BoxCox.lambda(atm1_ts)
atm1_adj_hw_fit <- hw(atm1_ts, h = h, lambda = atm1_lambda)
checkresiduals(atm1_adj_hw_fit)


    Ljung-Box test

data:  Residuals from Holt-Winters' additive method
Q* = 18.824, df = 3, p-value = 0.0002973

Model df: 11.   Total lags used: 14

#ARIMA
atm1_arima_fit <- auto.arima(atm1_ts)
checkresiduals(atm1_arima_fit)


    Ljung-Box test

data:  Residuals from ARIMA(0,0,1)(0,1,2)[7]
Q* = 15.247, df = 11, p-value = 0.1715

Model df: 3.   Total lags used: 14

All methods appear to be viable candidates, as they performed well and should be considered for the cross-validation stage. In order predicting the sample data the tsCV function will be used to evaluate the models. The objective would be to minimize the RMSE by obtaining the errors from the cross validation process to compute the RMSE.

# Cross Validation
get_rmse <- function(e) {sqrt(mean(e^2, na.rm = TRUE))}
atm1_arima_forecast <- function(x, h) {forecast(Arima(x, order = c(0, 0, 1), seasonal = c(0, 1, 2)), h = h)}
stl_ets_errors <- tsCV(atm1_ts, stlf, h = h, s.window = 7, robust = TRUE, method = "ets")
stl_arima_errors <- tsCV(atm1_ts, stlf, h = h, s.window = 7, robust = TRUE, method = "arima")
hw_errors <- tsCV(atm1_ts, hw, h = h)
adj_hw_errors <- tsCV(atm1_ts, hw, h = h, lambda = atm1_lambda)
arima_errors <- tsCV(atm1_ts, atm1_arima_forecast, h = h)
data.frame(Model = c("STL ETS", "STL ARIMA", "ARIMA", "Holt-Winters", "Adjusted Holt-Winters"), RMSE = c(get_rmse(stl_ets_errors[, h]), get_rmse(stl_arima_errors[, h]), get_rmse(arima_errors[, h]), get_rmse(hw_errors[, h]), get_rmse(adj_hw_errors[, h]))) %>% arrange(RMSE) %>% kable() %>% kable_styling()

Model	RMSE
ARIMA	29.67002
STL ARIMA	31.80636
STL ETS	31.81959
Adjusted Holt-Winters	34.30067
Holt-Winters	39.49958

Based on the cross validation, it appears that the ARIMA model was the ideal method of job predicting the sample data set which will be used to forecast ATM#1.

ATM #2

#Remove missing observations
atm2 <- atm %>% filter(ATM == "ATM2")
atm2$clean_cash <- c(tsclean(ts(atm2$Cash, start = decimal_date(as.Date(min(atm2$DATE))), frequency = 365)))
atm2_ts <- ts(atm2$clean_cash, start = decimal_date(as.Date(min(atm2$DATE))), frequency = 7)
ggtsplot(atm2_ts, "ATM #2")

#STL Decomposition Model
atm2_ts %>% stl(s.window = 7, robust = TRUE) %>% autoplot()

#STL Decomposition using STL and ETS
atm2_stl_ets_fit <- atm2_ts %>% stlf(h = h, s.window = 7, robust = TRUE, method = "ets")
checkresiduals(atm2_stl_ets_fit)


    Ljung-Box test

data:  Residuals from STL +  ETS(A,N,N)
Q* = 9.2372, df = 12, p-value = 0.6825

Model df: 2.   Total lags used: 14

#STL Decomposition using STL and ARIMA
atm2_stl_arima_fit <- atm2_ts %>% stlf(h = h, s.window = 7, robust = TRUE, method = "arima")
checkresiduals(atm2_stl_arima_fit)


    Ljung-Box test

data:  Residuals from STL +  ARIMA(2,1,2)
Q* = 7.4932, df = 10, p-value = 0.6782

Model df: 4.   Total lags used: 14

#Holt-Winters
atm2_hw_fit <- hw(atm2_ts, h = h)
checkresiduals(atm2_hw_fit)


    Ljung-Box test

data:  Residuals from Holt-Winters' additive method
Q* = 37.435, df = 3, p-value = 0.00000003722

Model df: 11.   Total lags used: 14

#Holt-Winters with Box Cox Adjustment
atm2_lambda <- BoxCox.lambda(atm2_ts)
atm2_adj_hw_fit <- hw(atm2_ts, h = h, lambda = atm2_lambda)
checkresiduals(atm2_adj_hw_fit)


    Ljung-Box test

data:  Residuals from Holt-Winters' additive method
Q* = 34.776, df = 3, p-value = 0.0000001358

Model df: 11.   Total lags used: 14

#ARIMA
atm2_arima_fit <- auto.arima(atm2_ts)
checkresiduals(atm2_arima_fit)


    Ljung-Box test

data:  Residuals from ARIMA(2,0,2)(0,1,1)[7]
Q* = 10.231, df = 9, p-value = 0.3321

Model df: 5.   Total lags used: 14

All methods appear to be viable candidates, as they performed well and should be considered for the cross-validation stage. In order predicting the sample data, the tsCV function and evaluate the models.

#Cross Validation
atm2_arima_forecast <- function(x, h) {forecast(Arima(x, order = c(2, 0, 2), seasonal = c(0, 1, 1)), h = h)}
stl_ets_errors <- tsCV(atm2_ts, stlf, h = h, s.window = 7, robust = TRUE, method = "ets")
stl_arima_errors <- tsCV(atm2_ts, stlf, h = h, s.window = 7, robust = TRUE, method = "arima")
hw_errors <- tsCV(atm2_ts, hw, h = h)
adj_hw_errors <- tsCV(atm2_ts, hw, h = h, lambda = atm2_lambda)
arima_errors <- tsCV(atm2_ts, atm2_arima_forecast, h = h)
data.frame(Model = c("STL ETS", "STL ARIMA", "ARIMA", "Holt-Winters", "Adjusted Holt-Winters"),RMSE = c(get_rmse(stl_ets_errors[, h]), get_rmse(stl_arima_errors[, h]), get_rmse(arima_errors[, h]), get_rmse(hw_errors[, h]), get_rmse(adj_hw_errors[, h]))) %>% arrange(RMSE) %>% kable() %>% kable_styling()

Model	RMSE
ARIMA	33.52154
STL ARIMA	34.17962
STL ETS	34.96679
Adjusted Holt-Winters	41.53970
Holt-Winters	55.33134

Based on the cross validation, it appears that the ARIMA model was the ideal method of job predicting the sample data set which will be used to forecast ATM#2.

ATM #3

This ATM is unique and the challenge with the visual is that there are only three data points therefore the average will be used.

#Data Cleaning
atm %>% filter(ATM == "ATM3") %>% mutate(nonzero = if_else(Cash == 0, "No", "Yes")) %>% ggplot(aes(DATE, Cash, color = nonzero)) + geom_point() + scale_color_brewer(palette = "Set2")

#Filter
atm3 <- atm %>% filter(ATM == "ATM3", Cash > 0)
atm3_mean <- mean(atm3$Cash)

ATM #4

#Remove missing observations
atm4 <- atm %>% filter(ATM == "ATM4")
atm4$clean_cash <- c(tsclean(ts(atm4$Cash, start = decimal_date(as.Date(min(atm4$DATE))), frequency = 365)))
atm4_ts <- ts(atm4$clean_cash, start = decimal_date(as.Date(min(atm4$DATE))), frequency = 7)
ggtsplot(atm4_ts, "ATM #4")

# STL Decomposition Model
atm2_ts %>% stl(s.window = 7, robust = TRUE) %>% autoplot()

#STL Decompositon Model using STL and ETS
atm4_stl_ets_fit <- atm4_ts %>% stlf(h = h, s.window = 7, robust = TRUE, method = "ets")
checkresiduals(atm4_stl_ets_fit)


    Ljung-Box test

data:  Residuals from STL +  ETS(A,N,N)
Q* = 30.61, df = 12, p-value = 0.002258

Model df: 2.   Total lags used: 14

#STL Decompositon Model using STL and ARIMA
atm4_stl_arima_fit <- atm4_ts %>%
  stlf(h = h, s.window = 7, robust = TRUE, method = "arima")
checkresiduals(atm4_stl_arima_fit)


    Ljung-Box test

data:  Residuals from STL +  ARIMA(1,0,1) with non-zero mean
Q* = 21.78, df = 11, p-value = 0.02614

Model df: 3.   Total lags used: 14

#Holt-Winters
atm4_hw_fit <- hw(atm4_ts, h = h)
checkresiduals(atm4_hw_fit)


    Ljung-Box test

data:  Residuals from Holt-Winters' additive method
Q* = 18.279, df = 3, p-value = 0.0003852

Model df: 11.   Total lags used: 14

#Holt-Winters with Box Cox Adjustment
atm4_lambda <- BoxCox.lambda(atm4_ts)
atm4_adj_hw_fit <- hw(atm4_ts, h = h, lambda = atm4_lambda)
checkresiduals(atm4_adj_hw_fit)


    Ljung-Box test

data:  Residuals from Holt-Winters' additive method
Q* = 18.934, df = 3, p-value = 0.0002822

Model df: 11.   Total lags used: 14

#ARIMA
atm4_arima_fit <- auto.arima(atm4_ts)
checkresiduals(atm4_arima_fit)


    Ljung-Box test

data:  Residuals from ARIMA(0,0,3)(1,0,0)[7] with non-zero mean
Q* = 16.96, df = 9, p-value = 0.04934

Model df: 5.   Total lags used: 14

#Cross Validation
atm4_arima_forecast <- function(x, h) {forecast(Arima(x, order = c(0, 0, 3), seasonal = c(1, 0, 0)), h = h)}
stl_ets_errors <- tsCV(atm4_ts, stlf, h = h, s.window = 7, robust = TRUE, method = "ets")
stl_arima_errors <- tsCV(atm4_ts, stlf, h = h, s.window = 7, robust = TRUE, method = "arima")
hw_errors <- tsCV(atm4_ts, hw, h = h)
adj_hw_errors <- tsCV(atm4_ts, hw, h = h, lambda = atm4_lambda)
arima_errors <- tsCV(atm4_ts, atm4_arima_forecast, h = h)
data.frame(Model = c("STL ETS", "STL ARIMA", "ARIMA", "Holt-Winters", "Adjusted Holt-Winters"), RMSE = c(get_rmse(stl_ets_errors[, h]), get_rmse(stl_arima_errors[, h]), get_rmse(arima_errors[, h]), get_rmse(hw_errors[, h]), get_rmse(adj_hw_errors[, h]))) %>% arrange(RMSE) %>% kable() %>% kable_styling()

Model	RMSE
ARIMA	346.7434
STL ARIMA	353.7169
STL ETS	366.0626
Adjusted Holt-Winters	388.4804
Holt-Winters	504.2440

The HW model with Box Cox Adjustment preformed very well and the ACF spikes are outside the bands. The STL models did not perform as well as the ARIMA. All models will be cross validated and the tsCV function will be used to evaluate the models to minimize the RMSE. The ARIMA model will be used to forecast the projections of ATM #4.

Conclusion

After testing multiple approaches using cross-validation, the ARIMA models was selected for ATM #1, #2 and #4, as it had the lowest RMSE.

Part B – Forecasting Power

Instruction

Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.

Data Exploration

#Read Data
if (!file.exists("ResidentialCustomerForecastLoad-624.xlsx")) {download.file("https://github.com/isaram/CUNY-SPS/raw/master/DATA624/ResidentialCustomerForecastLoad-624.xlsx", "ResidentialCustomerForecastLoad-624.xlsx")}
raw_data <- read_excel("ResidentialCustomerForecastLoad-624.xlsx")
kwh <- raw_data %>% select(KWH) %>% ts(start = decimal_date(date("1998-01-01")), frequency = 12)
autoplot(kwh, main = "Residential Power Usage (KWH)") + scale_y_continuous(labels = comma)

#Summary
summary(kwh) %>% kable() %>% kable_styling()

	KWH
	Min. : 770523
	1st Qu.: 5429912
	Median : 6283324
	Mean : 6502475
	3rd Qu.: 7620524
	Max. :10655730
	NA’s :1

Data Visualization and Modeling

There appears to be missing data and outlying points which I will clean. To cleanse the data, tsclean function will be used. Once the data is cleaned, the data will be replotted with ACF and PACF.

kwh <- tsclean(kwh)
ggtsplot <- function(ts, title) {grid.arrange(autoplot(ts) + scale_y_continuous(labels = comma) + ggtitle(title) + theme(axis.title = element_blank()),grid.arrange(ggAcf(ts) + ggtitle(element_blank()),ggPacf(ts) + ggtitle(element_blank()), ncol = 2), nrow = 2)}
ggtsplot(kwh, "Cleaned Residential Power Usage (KWH)")

The data has a seasonal trend and a model that displays seasonality will need to be used.The model to be used is ARIMA or Holt-Winters (HW) model.

h <- 12
lambda <- BoxCox.lambda(kwh)
autoplot(kwh) + autolayer(hw(kwh, h = h), series = "Holt-Winters") + autolayer(hw(kwh, h = h, lambda = lambda), series = "Holt-Winters (Box-Cox Adjusted)") + autolayer(hw(kwh, h = h, lambda = lambda, damped = TRUE), series = "Holt-Winters (Damped & Box-Cox Adjusted)") + autolayer(forecast(auto.arima(kwh), h = h), series = "ARIMA") + facet_wrap(. ~ series, ncol = 2) + scale_color_brewer(palette = "Set2") + scale_y_continuous(labels = comma) + theme(legend.position = "none", axis.title = element_blank())

Since both models represent seasonality well, residuals will be explored to determine which model is best.

#Holt-Winters (HW)
hw_fit <- hw(kwh, h = h)
checkresiduals(hw_fit)


    Ljung-Box test

data:  Residuals from Holt-Winters' additive method
Q* = 61.741, df = 8, p-value = 0.0000000002121

Model df: 16.   Total lags used: 24

#Adjusted Holt-Winters Model
adj_hw_fit <- hw(kwh, h = h, lambda = lambda)
checkresiduals(adj_hw_fit)


    Ljung-Box test

data:  Residuals from Holt-Winters' additive method
Q* = 48.973, df = 8, p-value = 0.00000006433

Model df: 16.   Total lags used: 24

# ARIMA Model
arima_fit <- auto.arima(kwh)
checkresiduals(arima_fit)


    Ljung-Box test

data:  Residuals from ARIMA(3,0,1)(2,1,0)[12] with drift
Q* = 11.549, df = 17, p-value = 0.8266

Model df: 7.   Total lags used: 24

In the Holt-Winters Model, there are spikes in the ACF which would make this a model that is not suitable. The Adjusted Holt-Winters Model is a bit better than the previous model but there are still ACF spikes outside of the bands. The ARIMA model appears to be the best selection as there is one ACF small spike outside of the threshold, which I will confirm using cross validation.

In order to predict the sample data the tsCV function will be used to evaluate the models in the cross-validation stage. The objective would be to minimize the RMSE by obtaining the errors from the cross validation process to compute the RMSE.

#Cross Validation
get_rmse <- function(e) {sqrt(mean(e^2, na.rm = TRUE))}
arima_forecast <- function(x, h) {forecast(Arima(x, order = c(3, 0, 1), seasonal = c(2, 1, 0), include.drift = TRUE), h = h)}
hw_error <- tsCV(kwh, hw, h = h)
adj_hw_error <- tsCV(kwh, hw, h = h, lambda = lambda)
damped_adj_hw_error <- tsCV(kwh, hw, h = h, lambda = lambda, damped = TRUE)
arima_errors <- tsCV(kwh, arima_forecast, h = h)
data.frame(Model = c("ARIMA", "Holt-Winters", "Adjusted Holt-Winters", "Damped & Adjusted Holt-Winters"), RMSE = c(get_rmse(arima_errors[, h]), get_rmse(hw_error[, h]), get_rmse(adj_hw_error[, h]), get_rmse(damped_adj_hw_error[, h]))) %>% arrange(RMSE) %>% kable() %>% kable_styling()

Model	RMSE
ARIMA	659529.8
Holt-Winters	1010721.9
Adjusted Holt-Winters	1373617.3
Damped & Adjusted Holt-Winters	1547751.3

The cross validation results confirm that the ARIMA model is the best selection as it minimized the RMSE. This model will be used in this exercise.

Conclusion

#Power usage forecast 
my_forecast <- forecast(arima_fit, h = h)
autoplot(my_forecast)

my_forecast %>% kable() %>% kable_styling()

	Point Forecast	Lo 80	Hi 80	Lo 95	Hi 95
Jan 2014	9433336	8688058	10178615	8293531	10573141
Feb 2014	8570219	7785677	9354761	7370366	9770072
Mar 2014	6615312	5830157	7400467	5414521	7816102
Apr 2014	6005538	5196071	6815004	4767565	7243510
May 2014	5917569	5108101	6727036	4679595	7155543
Jun 2014	8187387	7377114	8997660	6948182	9426592
Jul 2014	9528779	8717809	10339750	8288507	10769052
Aug 2014	9991953	9180963	10802943	8751650	11232255
Sep 2014	8546843	7735715	9357971	7306330	9787356
Oct 2014	5856327	5045196	6667458	4615809	7096845
Nov 2014	6153245	5342114	6964376	4912727	7393763
Dec 2014	7732110	6920971	8543250	6491580	8972641

DATA 624 Project 1 - Part A and B

Isabel R

10/25/20

Part A – ATM Forecast

Instruction

Data Exploration

Data Visualization and Modeling

ATM #1

ATM #2

ATM #3

ATM #4

Conclusion

Part B – Forecasting Power

Instruction

Data Exploration

Data Visualization and Modeling

Conclusion