In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.
I will begin by plotting the cash withdrawn by each ATM.
#Read data
if (!file.exists("ATM624Data.xlsx")) {download.file("https://github.com/isaram/CUNY-SPS/raw/master/DATA624/ATM624Data.xlsx", "ATM624Data.xlsx")}
atm <- read_excel("ATM624Data.xlsx") %>% mutate(DATE = as.Date(DATE, origin = "1899-12-30"), ATM = as.factor(ATM)) %>% data.frame()
atm %>% ggplot(aes(DATE, Cash, color = ATM)) + geom_line() + scale_color_brewer(palette = "Set2") + scale_y_continuous(labels = comma) + facet_wrap(.~ATM, ncol = 2) The ATM that is utilized at a higher rate than all of the others is ATM #4. There appear to be outliers and missing values in the data. If data from ATM #4 is removed from the dataset the visualization would show a different picture.
#ATM4 removed
atm %>% filter(ATM != "ATM4") %>% filter(!is.na(ATM)) %>% ggplot(aes(DATE, Cash, color = ATM)) + geom_line() + facet_wrap(.~ATM, ncol = 2) + scale_color_brewer(palette = "Set2") Based on the filtered visualization, it appears that ATM #3 only recently utilized and ATM #1 and #2 appear to be similar to one another with no obvious trend.
#Data plot
atm %>% na.omit() %>% ggplot(aes(ATM, Cash, color = ATM)) + geom_boxplot() + scale_color_brewer(palette = "Set2") + facet_wrap(.~ATM, nrow = 2, scales = "free") atm %>% filter(!is.na(ATM)) %>% group_by(ATM) %>% summarise(Minimum = min(Cash, na.rm = T),`1st Qu.` = quantile(Cash, .25, na.rm = T),Mean = mean(Cash, na.rm = T),Median = median(Cash, na.rm = T),`3rd Qu.` = quantile(Cash, .75, na.rm = T), Maximum = max(Cash, na.rm = T),`NA's` = sum(is.na(Cash))) %>%
kable() %>% kable_styling()| ATM | Minimum | 1st Qu. | Mean | Median | 3rd Qu. | Maximum | NA’s |
|---|---|---|---|---|---|---|---|
| ATM1 | 1.00000 | 73.0000 | 83.8867403 | 91.0000 | 108.000 | 180.00 | 3 |
| ATM2 | 0.00000 | 25.5000 | 62.5785124 | 67.0000 | 93.000 | 147.00 | 2 |
| ATM3 | 0.00000 | 0.0000 | 0.7205479 | 0.0000 | 0.000 | 96.00 | 0 |
| ATM4 | 1.56326 | 124.3344 | 474.0433449 | 403.8393 | 704.507 | 10919.76 | 0 |
There appears to be a pattern in the time series however fluctuations are very limted. Perhaps there might be a patten to be found by further examining the day of week.
atm %>% na.omit() %>% mutate(`Day of Week` = recode(as.factor(wday(DATE)), "1" = "Sunday", "2" = "Monday", "3" = "Tuesday", "4" = "Wednesday", "5" = "Thursday", "6" = "Friday", "7" = "Saturday")) %>% ggplot(aes(`Day of Week`, Cash, color = `Day of Week`)) + geom_boxplot() + facet_grid(ATM ~ `Day of Week`, scales = "free") + scale_color_brewer(palette = "Set2") There appears to be fluctuations based on the day of the week. There are many variations between the ATMs in which each will be modeled seperately.The data will be cleansed and modeling techniques will be explored. A cross-validation on the models will be examined to arrive at the RMSE. The optimal model would be one that minimizes the RMSE.
#Remove missing observations
atm1 <- atm %>% filter(ATM == "ATM1")
atm1$clean_cash <- c(tsclean(ts(atm1$Cash, start = decimal_date(as.Date(min(atm1$DATE))), frequency = 365)))#Create time series objects
ggtsplot <- function(ts, title) {grid.arrange(autoplot(ts) +ggtitle(title) +scale_y_continuous(labels = comma) +theme(axis.title = element_blank()),grid.arrange(ggAcf(ts) + ggtitle(element_blank()),ggPacf(ts) + ggtitle(element_blank()), ncol = 2), nrow = 2)}
#Frequency of seven to represent days in a weeks
atm1_ts <- ts(atm1$clean_cash, start = decimal_date(as.Date(min(atm1$DATE))), frequency = 7)
ggtsplot(atm1_ts, "ATM #1") Based on the results, there appears to be a negative correlation between the 1st and 3rd/5th. There are repeating peaks on every 7th observation.
#STL Decomposition using STL and ETS
h <- 31
atm1_stl_ets_fit <- atm1_ts %>% stlf(h = h, s.window = 7, robust = TRUE, method = "ets")
checkresiduals(atm1_stl_ets_fit)
Ljung-Box test
data: Residuals from STL + ETS(A,N,N)
Q* = 13.759, df = 12, p-value = 0.3164
Model df: 2. Total lags used: 14
# STL Decomposition using STL and ARIMA
atm1_stl_arima_fit <- atm1_ts %>% stlf(h = h, s.window = 7, robust = TRUE, method = "arima")
checkresiduals(atm1_stl_arima_fit)
Ljung-Box test
data: Residuals from STL + ARIMA(0,1,2)
Q* = 8.4479, df = 12, p-value = 0.7492
Model df: 2. Total lags used: 14
Ljung-Box test
data: Residuals from Holt-Winters' additive method
Q* = 28.877, df = 3, p-value = 0.000002376
Model df: 11. Total lags used: 14
#Holt-Winters(HW) with Box Cox Adjustment
atm1_lambda <- BoxCox.lambda(atm1_ts)
atm1_adj_hw_fit <- hw(atm1_ts, h = h, lambda = atm1_lambda)
checkresiduals(atm1_adj_hw_fit)
Ljung-Box test
data: Residuals from Holt-Winters' additive method
Q* = 18.824, df = 3, p-value = 0.0002973
Model df: 11. Total lags used: 14
Ljung-Box test
data: Residuals from ARIMA(0,0,1)(0,1,2)[7]
Q* = 15.247, df = 11, p-value = 0.1715
Model df: 3. Total lags used: 14
All methods appear to be viable candidates, as they performed well and should be considered for the cross-validation stage. In order predicting the sample data the tsCV function will be used to evaluate the models. The objective would be to minimize the RMSE by obtaining the errors from the cross validation process to compute the RMSE.
# Cross Validation
get_rmse <- function(e) {sqrt(mean(e^2, na.rm = TRUE))}
atm1_arima_forecast <- function(x, h) {forecast(Arima(x, order = c(0, 0, 1), seasonal = c(0, 1, 2)), h = h)}
stl_ets_errors <- tsCV(atm1_ts, stlf, h = h, s.window = 7, robust = TRUE, method = "ets")
stl_arima_errors <- tsCV(atm1_ts, stlf, h = h, s.window = 7, robust = TRUE, method = "arima")
hw_errors <- tsCV(atm1_ts, hw, h = h)
adj_hw_errors <- tsCV(atm1_ts, hw, h = h, lambda = atm1_lambda)
arima_errors <- tsCV(atm1_ts, atm1_arima_forecast, h = h)
data.frame(Model = c("STL ETS", "STL ARIMA", "ARIMA", "Holt-Winters", "Adjusted Holt-Winters"), RMSE = c(get_rmse(stl_ets_errors[, h]), get_rmse(stl_arima_errors[, h]), get_rmse(arima_errors[, h]), get_rmse(hw_errors[, h]), get_rmse(adj_hw_errors[, h]))) %>% arrange(RMSE) %>% kable() %>% kable_styling()| Model | RMSE |
|---|---|
| ARIMA | 29.67002 |
| STL ARIMA | 31.80636 |
| STL ETS | 31.81959 |
| Adjusted Holt-Winters | 34.30067 |
| Holt-Winters | 39.49958 |
Based on the cross validation, it appears that the ARIMA model was the ideal method of job predicting the sample data set which will be used to forecast ATM#1.
#Remove missing observations
atm2 <- atm %>% filter(ATM == "ATM2")
atm2$clean_cash <- c(tsclean(ts(atm2$Cash, start = decimal_date(as.Date(min(atm2$DATE))), frequency = 365)))
atm2_ts <- ts(atm2$clean_cash, start = decimal_date(as.Date(min(atm2$DATE))), frequency = 7)
ggtsplot(atm2_ts, "ATM #2")#STL Decomposition using STL and ETS
atm2_stl_ets_fit <- atm2_ts %>% stlf(h = h, s.window = 7, robust = TRUE, method = "ets")
checkresiduals(atm2_stl_ets_fit)
Ljung-Box test
data: Residuals from STL + ETS(A,N,N)
Q* = 9.2372, df = 12, p-value = 0.6825
Model df: 2. Total lags used: 14
#STL Decomposition using STL and ARIMA
atm2_stl_arima_fit <- atm2_ts %>% stlf(h = h, s.window = 7, robust = TRUE, method = "arima")
checkresiduals(atm2_stl_arima_fit)
Ljung-Box test
data: Residuals from STL + ARIMA(2,1,2)
Q* = 7.4932, df = 10, p-value = 0.6782
Model df: 4. Total lags used: 14
Ljung-Box test
data: Residuals from Holt-Winters' additive method
Q* = 37.435, df = 3, p-value = 0.00000003722
Model df: 11. Total lags used: 14
#Holt-Winters with Box Cox Adjustment
atm2_lambda <- BoxCox.lambda(atm2_ts)
atm2_adj_hw_fit <- hw(atm2_ts, h = h, lambda = atm2_lambda)
checkresiduals(atm2_adj_hw_fit)
Ljung-Box test
data: Residuals from Holt-Winters' additive method
Q* = 34.776, df = 3, p-value = 0.0000001358
Model df: 11. Total lags used: 14
Ljung-Box test
data: Residuals from ARIMA(2,0,2)(0,1,1)[7]
Q* = 10.231, df = 9, p-value = 0.3321
Model df: 5. Total lags used: 14
All methods appear to be viable candidates, as they performed well and should be considered for the cross-validation stage. In order predicting the sample data, the tsCV function and evaluate the models.
#Cross Validation
atm2_arima_forecast <- function(x, h) {forecast(Arima(x, order = c(2, 0, 2), seasonal = c(0, 1, 1)), h = h)}
stl_ets_errors <- tsCV(atm2_ts, stlf, h = h, s.window = 7, robust = TRUE, method = "ets")
stl_arima_errors <- tsCV(atm2_ts, stlf, h = h, s.window = 7, robust = TRUE, method = "arima")
hw_errors <- tsCV(atm2_ts, hw, h = h)
adj_hw_errors <- tsCV(atm2_ts, hw, h = h, lambda = atm2_lambda)
arima_errors <- tsCV(atm2_ts, atm2_arima_forecast, h = h)
data.frame(Model = c("STL ETS", "STL ARIMA", "ARIMA", "Holt-Winters", "Adjusted Holt-Winters"),RMSE = c(get_rmse(stl_ets_errors[, h]), get_rmse(stl_arima_errors[, h]), get_rmse(arima_errors[, h]), get_rmse(hw_errors[, h]), get_rmse(adj_hw_errors[, h]))) %>% arrange(RMSE) %>% kable() %>% kable_styling()| Model | RMSE |
|---|---|
| ARIMA | 33.52154 |
| STL ARIMA | 34.17962 |
| STL ETS | 34.96679 |
| Adjusted Holt-Winters | 41.53970 |
| Holt-Winters | 55.33134 |
Based on the cross validation, it appears that the ARIMA model was the ideal method of job predicting the sample data set which will be used to forecast ATM#2.
This ATM is unique and the challenge with the visual is that there are only three data points therefore the average will be used.
#Data Cleaning
atm %>% filter(ATM == "ATM3") %>% mutate(nonzero = if_else(Cash == 0, "No", "Yes")) %>% ggplot(aes(DATE, Cash, color = nonzero)) + geom_point() + scale_color_brewer(palette = "Set2") #Remove missing observations
atm4 <- atm %>% filter(ATM == "ATM4")
atm4$clean_cash <- c(tsclean(ts(atm4$Cash, start = decimal_date(as.Date(min(atm4$DATE))), frequency = 365)))
atm4_ts <- ts(atm4$clean_cash, start = decimal_date(as.Date(min(atm4$DATE))), frequency = 7)
ggtsplot(atm4_ts, "ATM #4")#STL Decompositon Model using STL and ETS
atm4_stl_ets_fit <- atm4_ts %>% stlf(h = h, s.window = 7, robust = TRUE, method = "ets")
checkresiduals(atm4_stl_ets_fit)
Ljung-Box test
data: Residuals from STL + ETS(A,N,N)
Q* = 30.61, df = 12, p-value = 0.002258
Model df: 2. Total lags used: 14
#STL Decompositon Model using STL and ARIMA
atm4_stl_arima_fit <- atm4_ts %>%
stlf(h = h, s.window = 7, robust = TRUE, method = "arima")
checkresiduals(atm4_stl_arima_fit)
Ljung-Box test
data: Residuals from STL + ARIMA(1,0,1) with non-zero mean
Q* = 21.78, df = 11, p-value = 0.02614
Model df: 3. Total lags used: 14
Ljung-Box test
data: Residuals from Holt-Winters' additive method
Q* = 18.279, df = 3, p-value = 0.0003852
Model df: 11. Total lags used: 14
#Holt-Winters with Box Cox Adjustment
atm4_lambda <- BoxCox.lambda(atm4_ts)
atm4_adj_hw_fit <- hw(atm4_ts, h = h, lambda = atm4_lambda)
checkresiduals(atm4_adj_hw_fit)
Ljung-Box test
data: Residuals from Holt-Winters' additive method
Q* = 18.934, df = 3, p-value = 0.0002822
Model df: 11. Total lags used: 14
Ljung-Box test
data: Residuals from ARIMA(0,0,3)(1,0,0)[7] with non-zero mean
Q* = 16.96, df = 9, p-value = 0.04934
Model df: 5. Total lags used: 14
#Cross Validation
atm4_arima_forecast <- function(x, h) {forecast(Arima(x, order = c(0, 0, 3), seasonal = c(1, 0, 0)), h = h)}
stl_ets_errors <- tsCV(atm4_ts, stlf, h = h, s.window = 7, robust = TRUE, method = "ets")
stl_arima_errors <- tsCV(atm4_ts, stlf, h = h, s.window = 7, robust = TRUE, method = "arima")
hw_errors <- tsCV(atm4_ts, hw, h = h)
adj_hw_errors <- tsCV(atm4_ts, hw, h = h, lambda = atm4_lambda)
arima_errors <- tsCV(atm4_ts, atm4_arima_forecast, h = h)
data.frame(Model = c("STL ETS", "STL ARIMA", "ARIMA", "Holt-Winters", "Adjusted Holt-Winters"), RMSE = c(get_rmse(stl_ets_errors[, h]), get_rmse(stl_arima_errors[, h]), get_rmse(arima_errors[, h]), get_rmse(hw_errors[, h]), get_rmse(adj_hw_errors[, h]))) %>% arrange(RMSE) %>% kable() %>% kable_styling()| Model | RMSE |
|---|---|
| ARIMA | 346.7434 |
| STL ARIMA | 353.7169 |
| STL ETS | 366.0626 |
| Adjusted Holt-Winters | 388.4804 |
| Holt-Winters | 504.2440 |
The HW model with Box Cox Adjustment preformed very well and the ACF spikes are outside the bands. The STL models did not perform as well as the ARIMA. All models will be cross validated and the tsCV function will be used to evaluate the models to minimize the RMSE. The ARIMA model will be used to forecast the projections of ATM #4.
After testing multiple approaches using cross-validation, the ARIMA models was selected for ATM #1, #2 and #4, as it had the lowest RMSE.
Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.
#Read Data
if (!file.exists("ResidentialCustomerForecastLoad-624.xlsx")) {download.file("https://github.com/isaram/CUNY-SPS/raw/master/DATA624/ResidentialCustomerForecastLoad-624.xlsx", "ResidentialCustomerForecastLoad-624.xlsx")}
raw_data <- read_excel("ResidentialCustomerForecastLoad-624.xlsx")
kwh <- raw_data %>% select(KWH) %>% ts(start = decimal_date(date("1998-01-01")), frequency = 12)
autoplot(kwh, main = "Residential Power Usage (KWH)") + scale_y_continuous(labels = comma)| KWH | |
|---|---|
| Min. : 770523 | |
| 1st Qu.: 5429912 | |
| Median : 6283324 | |
| Mean : 6502475 | |
| 3rd Qu.: 7620524 | |
| Max. :10655730 | |
| NA’s :1 |
There appears to be missing data and outlying points which I will clean. To cleanse the data, tsclean function will be used. Once the data is cleaned, the data will be replotted with ACF and PACF.
kwh <- tsclean(kwh)
ggtsplot <- function(ts, title) {grid.arrange(autoplot(ts) + scale_y_continuous(labels = comma) + ggtitle(title) + theme(axis.title = element_blank()),grid.arrange(ggAcf(ts) + ggtitle(element_blank()),ggPacf(ts) + ggtitle(element_blank()), ncol = 2), nrow = 2)}
ggtsplot(kwh, "Cleaned Residential Power Usage (KWH)") The data has a seasonal trend and a model that displays seasonality will need to be used.The model to be used is ARIMA or Holt-Winters (HW) model.
h <- 12
lambda <- BoxCox.lambda(kwh)
autoplot(kwh) + autolayer(hw(kwh, h = h), series = "Holt-Winters") + autolayer(hw(kwh, h = h, lambda = lambda), series = "Holt-Winters (Box-Cox Adjusted)") + autolayer(hw(kwh, h = h, lambda = lambda, damped = TRUE), series = "Holt-Winters (Damped & Box-Cox Adjusted)") + autolayer(forecast(auto.arima(kwh), h = h), series = "ARIMA") + facet_wrap(. ~ series, ncol = 2) + scale_color_brewer(palette = "Set2") + scale_y_continuous(labels = comma) + theme(legend.position = "none", axis.title = element_blank())Since both models represent seasonality well, residuals will be explored to determine which model is best.
Ljung-Box test
data: Residuals from Holt-Winters' additive method
Q* = 61.741, df = 8, p-value = 0.0000000002121
Model df: 16. Total lags used: 24
#Adjusted Holt-Winters Model
adj_hw_fit <- hw(kwh, h = h, lambda = lambda)
checkresiduals(adj_hw_fit)
Ljung-Box test
data: Residuals from Holt-Winters' additive method
Q* = 48.973, df = 8, p-value = 0.00000006433
Model df: 16. Total lags used: 24
Ljung-Box test
data: Residuals from ARIMA(3,0,1)(2,1,0)[12] with drift
Q* = 11.549, df = 17, p-value = 0.8266
Model df: 7. Total lags used: 24
In the Holt-Winters Model, there are spikes in the ACF which would make this a model that is not suitable. The Adjusted Holt-Winters Model is a bit better than the previous model but there are still ACF spikes outside of the bands. The ARIMA model appears to be the best selection as there is one ACF small spike outside of the threshold, which I will confirm using cross validation.
In order to predict the sample data the tsCV function will be used to evaluate the models in the cross-validation stage. The objective would be to minimize the RMSE by obtaining the errors from the cross validation process to compute the RMSE.
#Cross Validation
get_rmse <- function(e) {sqrt(mean(e^2, na.rm = TRUE))}
arima_forecast <- function(x, h) {forecast(Arima(x, order = c(3, 0, 1), seasonal = c(2, 1, 0), include.drift = TRUE), h = h)}
hw_error <- tsCV(kwh, hw, h = h)
adj_hw_error <- tsCV(kwh, hw, h = h, lambda = lambda)
damped_adj_hw_error <- tsCV(kwh, hw, h = h, lambda = lambda, damped = TRUE)
arima_errors <- tsCV(kwh, arima_forecast, h = h)
data.frame(Model = c("ARIMA", "Holt-Winters", "Adjusted Holt-Winters", "Damped & Adjusted Holt-Winters"), RMSE = c(get_rmse(arima_errors[, h]), get_rmse(hw_error[, h]), get_rmse(adj_hw_error[, h]), get_rmse(damped_adj_hw_error[, h]))) %>% arrange(RMSE) %>% kable() %>% kable_styling()| Model | RMSE |
|---|---|
| ARIMA | 659529.8 |
| Holt-Winters | 1010721.9 |
| Adjusted Holt-Winters | 1373617.3 |
| Damped & Adjusted Holt-Winters | 1547751.3 |
The cross validation results confirm that the ARIMA model is the best selection as it minimized the RMSE. This model will be used in this exercise.
| Point Forecast | Lo 80 | Hi 80 | Lo 95 | Hi 95 | |
|---|---|---|---|---|---|
| Jan 2014 | 9433336 | 8688058 | 10178615 | 8293531 | 10573141 |
| Feb 2014 | 8570219 | 7785677 | 9354761 | 7370366 | 9770072 |
| Mar 2014 | 6615312 | 5830157 | 7400467 | 5414521 | 7816102 |
| Apr 2014 | 6005538 | 5196071 | 6815004 | 4767565 | 7243510 |
| May 2014 | 5917569 | 5108101 | 6727036 | 4679595 | 7155543 |
| Jun 2014 | 8187387 | 7377114 | 8997660 | 6948182 | 9426592 |
| Jul 2014 | 9528779 | 8717809 | 10339750 | 8288507 | 10769052 |
| Aug 2014 | 9991953 | 9180963 | 10802943 | 8751650 | 11232255 |
| Sep 2014 | 8546843 | 7735715 | 9357971 | 7306330 | 9787356 |
| Oct 2014 | 5856327 | 5045196 | 6667458 | 4615809 | 7096845 |
| Nov 2014 | 6153245 | 5342114 | 6964376 | 4912727 | 7393763 |
| Dec 2014 | 7732110 | 6920971 | 8543250 | 6491580 | 8972641 |