DATA 624: Project 01
Libraries
Part A
ATM Forecast, ATM624Data.xlsx
In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.
Introduction
In 2012, consumers made 5.8 billion ATM withdrawals totaling $687 billion in value. The number of ATM withdrawals decreased 0.9 percent per year from 2009 to 2012. However, during the same period, the total dollar value of ATM withdrawals increased 2.0 percent, and the average withdrawal value increased from 108 to 118.
We can see how important it is for the bank or ATM provider to forecast cash withdrawal as close as possible in order to optimized their business. Our task is to forecast how much cash is taken out of 4 different ATM machines for May 2010. Data was provided in a excel file. The variable “Cash” is provided in hundreds of dollars.
Load the data
## # A tibble: 6 x 3
## DATE ATM Cash
## <dbl> <chr> <dbl>
## 1 39934 ATM1 96
## 2 39934 ATM2 107
## 3 39935 ATM1 82
## 4 39935 ATM2 89
## 5 39936 ATM1 85
## 6 39936 ATM2 90
## tibble [1,474 x 3] (S3: tbl_df/tbl/data.frame)
## $ DATE: num [1:1474] 39934 39934 39935 39935 39936 ...
## $ ATM : chr [1:1474] "ATM1" "ATM2" "ATM1" "ATM2" ...
## $ Cash: num [1:1474] 96 107 82 89 85 90 90 55 99 79 ...
Data exploration
Let’s convert the data into a data frame and do some basic transformation along the way
atm_df <- atm %>%
mutate(DATE = as.Date(DATE, origin = "1899-12-30"), #Convert date column https://rdrr.io/r/base/as.Date.html
ATM = as.factor(ATM)) %>%
data.frame()
dim(atm_df)## [1] 1474 3
## DATE ATM Cash
## 1 2009-05-01 ATM1 96
## 2 2009-05-01 ATM2 107
## 3 2009-05-02 ATM1 82
## 4 2009-05-02 ATM2 89
## 5 2009-05-03 ATM1 85
## 6 2009-05-03 ATM2 90
## DATE ATM Cash
## 1469 2010-04-25 ATM4 542.28060
## 1470 2010-04-26 ATM4 403.83934
## 1471 2010-04-27 ATM4 13.69733
## 1472 2010-04-28 ATM4 348.20106
## 1473 2010-04-29 ATM4 44.24535
## 1474 2010-04-30 ATM4 482.28711
Let’s drop the missing value and check the dimension again.
## [1] 1474 3
## [1] 1455 3
So, we got rid of 19 rows of data.
We are asked to forecast for May 2010, let’s check the max date we have in the dataset by ATM.
## ATM max
## 1 ATM1 2010-04-30
## 2 ATM2 2010-04-30
## 3 ATM3 2010-04-30
## 4 ATM4 2010-04-30
## DATE ATM Cash
## Min. :2009-05-01 ATM1:362 Min. : 0.0
## 1st Qu.:2009-08-01 ATM2:363 1st Qu.: 0.5
## Median :2009-10-31 ATM3:365 Median : 73.0
## Mean :2009-10-30 ATM4:365 Mean : 155.6
## 3rd Qu.:2010-01-29 3rd Qu.: 114.0
## Max. :2010-04-30 Max. :10919.8
Let’s visualize the data
atm_df %>%
ggplot(aes(DATE, Cash, color = ATM)) +
geom_line() +
ggtitle("Cash Withdrawal by Date and ATM") +
scale_color_brewer(palette = "Set2") +
scale_y_continuous(labels = comma) +
facet_wrap(.~ATM, ncol = 2) +
theme(legend.title = element_blank(),
legend.position = "bottom",
axis.title = element_blank())## DATE ATM Cash
## 1 2010-04-28 ATM3 96
## 2 2010-04-29 ATM3 82
## 3 2010-04-30 ATM3 85
We can see from above that ATM#1 and #2 cash out were very similar. ATM#3 only used last three days. On the other hand ATM#4 cash out was at much higher rate compared to other three ATMs and there is a outlier spike in ATM#4. Let’s use separate Y-scale for each of the ATM (scales = “free”) and do the same visualization so that we can have a better look despite having a outlier in ATM#4.
atm_df %>%
ggplot(aes(DATE, Cash, color = ATM)) +
geom_line() +
facet_wrap(.~ATM, ncol = 2, scales = "free") +
ggtitle("Cash Withdrawal by Date and ATM") +
scale_color_brewer(palette = "Set2") +
theme(legend.title = element_blank(),
legend.position = "bottom",
axis.title = element_blank())Let’s look at the distributions
atm_df %>%
ggplot(aes(ATM, Cash, color = ATM)) +
geom_boxplot() +
ggtitle("Cash Withdrawal by ATM") +
scale_color_brewer(palette = "Set2") +
facet_wrap(.~ATM, nrow = 2, scales = "free")Let’s look at how was the cash out by weekdays, if there is any pattern.
atm_df_day <- atm_df %>%
mutate(`day_of_week` = recode(as.factor(wday(DATE)), "1" = "Su", "2" = "Mo", "3" = "Tu", "4" = "We", "5" = "Th", "6" = "Fr", "7" = "Sa"))
head(atm_df_day)## DATE ATM Cash day_of_week
## 1 2009-05-01 ATM1 96 Fr
## 2 2009-05-01 ATM2 107 Fr
## 3 2009-05-02 ATM1 82 Sa
## 4 2009-05-02 ATM2 89 Sa
## 5 2009-05-03 ATM1 85 Su
## 6 2009-05-03 ATM2 90 Su
ggplot(data = atm_df_day, mapping = aes(x=day_of_week, y=Cash)) + geom_col() + ggtitle("Cash Withdrawal by ATM and weekdays") + scale_color_brewer(palette = "Set2") + facet_wrap(.~ATM, nrow = 2, scales = "free")Source: https://stackoverflow.com/questions/9216138/find-the-day-of-a-week
ATM 1
I will use frequency = 7 as there are weekly effect.
# data cleaning and convert ATM1 data to time series
atm1 <- atm_df %>%
filter(ATM == "ATM1")
atm1$clean <- c(tsclean(ts(atm1$Cash, start = decimal_date(as.Date(min(atm1$DATE))), frequency = 365)))
ts_atm1 <- ts(atm1$clean, frequency = 7)We see from above that ACF & PACF has peak lag repeat in 7th lag.
Model Run
Holt-Winters
##
## Ljung-Box test
##
## data: Residuals from Holt-Winters' additive method
## Q* = 16.871, df = 3, p-value = 0.0007513
##
## Model df: 11. Total lags used: 14
This model did fairly well. Let’s see if it does better if we do box cox transformation
Holt-Winters with Box Cox Adjustment
atm1_lambda <- BoxCox.lambda(ts_atm1)
atm1_hw_box_cox <- hw(ts_atm1, h = h, lambda = atm1_lambda)
checkresiduals(atm1_hw_box_cox)##
## Ljung-Box test
##
## data: Residuals from Holt-Winters' additive method
## Q* = 16.921, df = 3, p-value = 0.0007336
##
## Model df: 11. Total lags used: 14
STL Decomposition Models
Let’s try few seasonal decomposition models. I will consider s.window = 7 (seasonal window) to account for weekday’s cashout differences.
STL & ETS
h <- 31
atm1_stl_ets <- ts_atm1 %>%
stlf(h = h, s.window = 7, robust = TRUE, method = "ets")
checkresiduals(atm1_stl_ets)## Warning in checkresiduals(atm1_stl_ets): The fitted degrees of freedom is based
## on the model used for the seasonally adjusted data.
##
## Ljung-Box test
##
## data: Residuals from STL + ETS(A,N,N)
## Q* = 7.9258, df = 12, p-value = 0.7909
##
## Model df: 2. Total lags used: 14
STL & ARIMA
atm1_stl_arima <- ts_atm1 %>%
stlf(h = h, s.window = 7, robust = TRUE, method = "arima")
checkresiduals(atm1_stl_arima)## Warning in checkresiduals(atm1_stl_arima): The fitted degrees of freedom is
## based on the model used for the seasonally adjusted data.
##
## Ljung-Box test
##
## data: Residuals from STL + ARIMA(0,1,2)
## Q* = 5.4248, df = 12, p-value = 0.9423
##
## Model df: 2. Total lags used: 14
ARIMA
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,0,1)(0,1,1)[7]
## Q* = 11.267, df = 12, p-value = 0.5062
##
## Model df: 2. Total lags used: 14
This model looks like it is preforming well. Let’s see how all of them stack up.
Model performance
To measure and compare how well the models did, i will use the tsCV function and compare RMSE.
get_rmse <- function(e) {
sqrt(mean(e^2, na.rm = TRUE))
}
atm1_arima_forecast <- function(x, h) {
forecast(Arima(x, order = c(0, 0, 1), seasonal = c(0, 1, 2)), h = h)
}
stl_ets_errors <- tsCV(ts_atm1, stlf, h = h, s.window = 7, robust = TRUE, method = "ets")
stl_arima_errors <- tsCV(ts_atm1, stlf, h = h, s.window = 7, robust = TRUE, method = "arima")
hw_errors <- tsCV(ts_atm1, hw, h = h)
boxcox_hw_errors <- tsCV(ts_atm1, hw, h = h, lambda = atm1_lambda)
arima_errors <- tsCV(ts_atm1, atm1_arima_forecast, h = h)
data.frame(Model = c("STL ETS", "STL ARIMA", "ARIMA", "Holt-Winters", "Box Cox Holt-Winters"),
RMSE = c(get_rmse(stl_ets_errors[, h]), get_rmse(stl_arima_errors[, h]), get_rmse(arima_errors[, h]), get_rmse(hw_errors[, h]), get_rmse(boxcox_hw_errors[, h]))) %>%
arrange(RMSE) %>%
kable() %>%
kable_styling()| Model | RMSE |
|---|---|
| ARIMA | 35.67540 |
| STL ARIMA | 37.99406 |
| STL ETS | 38.75113 |
| Box Cox Holt-Winters | 45.01814 |
| Holt-Winters | 46.03976 |
ARIMA model has the lowest RMSE.
ATM #2
Let’s repeat above process for ATM2. I will use same frequency = 7 as there are weekly effect.
# data cleaning and convert ATM1 data to time series
atm2 <- atm_df %>%
filter(ATM == "ATM2")
atm2$clean <- c(tsclean(ts(atm2$Cash, start = decimal_date(as.Date(min(atm1$DATE))), frequency = 365)))
ts_atm2 <- ts(atm2$clean, frequency = 7)We see from above that ACF & PACF has peak lag repeat in 7th lag.
Model Run
Holt-Winters
##
## Ljung-Box test
##
## data: Residuals from Holt-Winters' additive method
## Q* = 36.563, df = 3, p-value = 5.693e-08
##
## Model df: 11. Total lags used: 14
Holt-Winters with Box Cox
atm2_lambda <- BoxCox.lambda(ts_atm2)
atm2_hw_cox_box <- hw(ts_atm2, h = h, lambda = atm2_lambda)
checkresiduals(atm2_hw_cox_box)##
## Ljung-Box test
##
## data: Residuals from Holt-Winters' additive method
## Q* = 34.635, df = 3, p-value = 1.455e-07
##
## Model df: 11. Total lags used: 14
STL Decomposition Models
This is very similar to ATM #1.
STL & ETS
atm2_stl_ets <- ts_atm2 %>%
stlf(h = h, s.window = 7, robust = TRUE, method = "ets")
checkresiduals(atm2_stl_ets)## Warning in checkresiduals(atm2_stl_ets): The fitted degrees of freedom is based
## on the model used for the seasonally adjusted data.
##
## Ljung-Box test
##
## data: Residuals from STL + ETS(A,N,N)
## Q* = 14.34, df = 12, p-value = 0.2795
##
## Model df: 2. Total lags used: 14
This model performs much better compared to ATM#1
STL & ARIMA
atm2_stl_arima <- ts_atm2 %>%
stlf(h = h, s.window = 7, robust = TRUE, method = "arima")
checkresiduals(atm2_stl_arima)## Warning in checkresiduals(atm2_stl_arima): The fitted degrees of freedom is
## based on the model used for the seasonally adjusted data.
##
## Ljung-Box test
##
## data: Residuals from STL + ARIMA(0,1,1)
## Q* = 13.983, df = 13, p-value = 0.375
##
## Model df: 1. Total lags used: 14
ARIMA
##
## Ljung-Box test
##
## data: Residuals from ARIMA(2,0,2)(0,1,2)[7]
## Q* = 12.12, df = 8, p-value = 0.1459
##
## Model df: 6. Total lags used: 14
This model looks like it is preforming well. Let’s see how all of them stack up.
Model performance
To measure and compare how well the models did, i will use the tsCV function and compare RMSE.
atm2_arima_forecast <- function(x, h) {
forecast(Arima(x, order = c(2, 0, 2), seasonal = c(0, 1, 1)), h = h)
}
stl_ets_errors <- tsCV(ts_atm2, stlf, h = h, s.window = 7, robust = TRUE, method = "ets")
stl_arima_errors <- tsCV(ts_atm2, stlf, h = h, s.window = 7, robust = TRUE, method = "arima")
hw_errors <- tsCV(ts_atm2, hw, h = h)
adj_hw_errors <- tsCV(ts_atm2, hw, h = h, lambda = atm2_lambda)
arima_errors <- tsCV(ts_atm2, atm2_arima_forecast, h = h)
data.frame(Model = c("STL ETS", "STL ARIMA", "ARIMA", "Holt-Winters", "Box Cox Holt-Winters"),
RMSE = c(get_rmse(stl_ets_errors[, h]), get_rmse(stl_arima_errors[, h]), get_rmse(arima_errors[, h]), get_rmse(hw_errors[, h]), get_rmse(adj_hw_errors[, h]))) %>%
arrange(RMSE) %>%
kable() %>%
kable_styling()| Model | RMSE |
|---|---|
| ARIMA | 38.92183 |
| STL ARIMA | 44.02338 |
| STL ETS | 44.59259 |
| Box Cox Holt-Winters | 47.07522 |
| Holt-Winters | 58.72761 |
ARIMA perform best with lowest RMSE.
ATM #3
This model is quite different from the two proceeding cases. You can see it in the plot:
atm_df %>%
filter(ATM == "ATM3") %>%
mutate(nonzero = if_else(Cash == 0, "No", "Yes")) %>%
ggplot(aes(DATE, Cash, color = nonzero)) +
geom_point() +
ggtitle("ATM 3") +
scale_color_brewer(palette = "Set1") +
theme(axis.title = element_blank(), legend.position = "none")We can see from above that all of the cashout is zero (red line) except last three.
We will be using mean of three cashout as forecast.
ATM #4
Model Run
STL Decomposition Models
This is very similar to ATM #1.
STL & ETS
atm4_stl_ets <- ts_atm4 %>%
stlf(h = h, s.window = 7, robust = TRUE, method = "ets")
checkresiduals(atm4_stl_ets)## Warning in checkresiduals(atm4_stl_ets): The fitted degrees of freedom is based
## on the model used for the seasonally adjusted data.
##
## Ljung-Box test
##
## data: Residuals from STL + ETS(A,N,N)
## Q* = 30.61, df = 12, p-value = 0.002258
##
## Model df: 2. Total lags used: 14
STL & ARIMA
atm4_stl_arima <- ts_atm4 %>%
stlf(h = h, s.window = 7, robust = TRUE, method = "arima")
checkresiduals(atm4_stl_arima)## Warning in checkresiduals(atm4_stl_arima): The fitted degrees of freedom is
## based on the model used for the seasonally adjusted data.
##
## Ljung-Box test
##
## data: Residuals from STL + ARIMA(1,0,1) with non-zero mean
## Q* = 21.78, df = 11, p-value = 0.02614
##
## Model df: 3. Total lags used: 14
Holt-Winters
##
## Ljung-Box test
##
## data: Residuals from Holt-Winters' additive method
## Q* = 18.279, df = 3, p-value = 0.0003852
##
## Model df: 11. Total lags used: 14
This method seems to have preformed better than any of the STL models.
Holt-Winters with Box Cox Adjustment
atm4_lambda <- BoxCox.lambda(ts_atm4)
atm4_hw_box_cox <- hw(ts_atm4, h = h, lambda = atm4_lambda)
checkresiduals(atm4_hw_box_cox)##
## Ljung-Box test
##
## data: Residuals from Holt-Winters' additive method
## Q* = 18.934, df = 3, p-value = 0.0002822
##
## Model df: 11. Total lags used: 14
Model Performance
atm4_arima_forecast <- function(x, h) {
forecast(Arima(x, order = c(0, 0, 3), seasonal = c(1, 0, 0)), h = h)
}
stl_ets_errors <- tsCV(ts_atm4, stlf, h = h, s.window = 7, robust = TRUE, method = "ets")
stl_arima_errors <- tsCV(ts_atm4, stlf, h = h, s.window = 7, robust = TRUE, method = "arima")
hw_errors <- tsCV(ts_atm4, hw, h = h)
adj_hw_errors <- tsCV(ts_atm4, hw, h = h, lambda = atm4_lambda)
arima_errors <- tsCV(ts_atm4, atm4_arima_forecast, h = h)
data.frame(Model = c("STL ETS", "STL ARIMA", "ARIMA", "Holt-Winters", "Holt-Winters Box Cox"),
RMSE = c(get_rmse(stl_ets_errors[, h]), get_rmse(stl_arima_errors[, h]), get_rmse(arima_errors[, h]), get_rmse(hw_errors[, h]), get_rmse(adj_hw_errors[, h]))) %>%
arrange(RMSE) %>%
kable() %>%
kable_styling()| Model | RMSE |
|---|---|
| ARIMA | 346.7434 |
| STL ARIMA | 353.7169 |
| STL ETS | 366.0626 |
| Holt-Winters Box Cox | 388.4804 |
| Holt-Winters | 504.2440 |
Again, ARIMA has the lowest RMSE.
Summary
Let’s get the forecast of May 2010 in a excel file as instructed in the assignment. We will use ARIMA for AMT1, ATM2 & ATM4 as it gives best performance, we will take mean as forecast for ATM3 as there are only three cashout in the time series.
dates <- seq(ymd("2010-05-01"), ymd("2010-05-31"), by = "1 day")
atm1_forecast <- forecast(atm1_arima, h = h)
atm2_forecast <- forecast(atm2_arima, h = h)
atm3_forecast <- rep(atm3_mean, h)
atm4_forecast <- forecast(atm4_arima, h = h)
df_forecast <- data.frame("DATE" = dates, "ATM" = c("ATM1"), "Cash" = c(atm1_forecast$mean))
df_forecast <- data.frame("DATE" = dates, "ATM" = c("ATM2"), "Cash" = c(atm2_forecast$mean)) %>%
rbind(df_forecast, .)
df_forecast <- data.frame("DATE" = dates, "ATM" = c("ATM3"), "Cash" = atm3_forecast) %>%
rbind(df_forecast, .)
df_forecast <- data.frame("DATE" = dates, "ATM" = c("ATM4"), "Cash" = c(atm4_forecast$mean)) %>%
rbind(df_forecast, .)
write.xlsx(df_forecast, "ATM_Forecasts.xlsx")Part B
Part B – Forecasting Power, ResidentialCustomerForecastLoad-624.xlsx
Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.
Load the data
## # A tibble: 6 x 3
## CaseSequence `YYYY-MMM` KWH
## <dbl> <chr> <dbl>
## 1 733 1998-Jan 6862583
## 2 734 1998-Feb 5838198
## 3 735 1998-Mar 5420658
## 4 736 1998-Apr 5010364
## 5 737 1998-May 4665377
## 6 738 1998-Jun 6467147
## tibble [192 x 3] (S3: tbl_df/tbl/data.frame)
## $ CaseSequence: num [1:192] 733 734 735 736 737 738 739 740 741 742 ...
## $ YYYY-MMM : chr [1:192] "1998-Jan" "1998-Feb" "1998-Mar" "1998-Apr" ...
## $ KWH : num [1:192] 6862583 5838198 5420658 5010364 4665377 ...
## Loading required package: Rcpp
## ##
## ## Amelia II: Multiple Imputation
## ## (Version 1.7.6, built: 2019-11-24)
## ## Copyright (C) 2005-2020 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##
## Warning: Unknown or uninitialised column: `arguments`.
## Warning: Unknown or uninitialised column: `arguments`.
## Warning: Unknown or uninitialised column: `imputations`.
There is one missing value that we will clean in next step.
Data exploration & Cleaning
Let’s convert the data into a time series and do some basic transformation along the way
Let’s plot the time series using autoplot()
## KWH
## Min. : 770523
## 1st Qu.: 5429912
## Median : 6283324
## Mean : 6502475
## 3rd Qu.: 7620524
## Max. :10655730
## NA's :1
We have missing data and outliers. We will clean the data using `tsclean’ from forecast package.
Here is the details on `tsclean’ https://www.rdocumentation.org/packages/forecast/versions/8.12/topics/tsclean
We can see there is clear seasonality in the time series.
Intial models
We will use Holt-winters or ARIMA as the data has clear seasonality. Let’s plot some of the models.
h <- 12
lambda <- BoxCox.lambda(power_ts)
autoplot(power_ts) +
autolayer(hw(power_ts, h = h), series = "Holt-Winters") +
autolayer(hw(power_ts, h = h, lambda = lambda), series = "Holt-Winters (Box-Cox Adjusted)") +
autolayer(hw(power_ts, h = h, lambda = lambda, damped = TRUE), series = "Holt-Winters (Damped & Box-Cox Adjusted)") +
autolayer(forecast(auto.arima(power_ts), h = h), series = "ARIMA") +
facet_wrap(. ~ series, ncol = 2) +
scale_color_brewer(palette = "Set1") +
scale_y_continuous(labels = comma) +
theme(legend.position = "none", axis.title = element_blank())We can see from above plots that all models were able to account for seasonality. Let’s check the residuals to measure performance.
Holt-Winters Model
##
## Ljung-Box test
##
## data: Residuals from Holt-Winters' additive method
## Q* = 61.741, df = 8, p-value = 2.121e-10
##
## Model df: 16. Total lags used: 24
We can see spikes in ACF outside the bound.
Holt-Winters Model with Box Cox
##
## Ljung-Box test
##
## data: Residuals from Holt-Winters' additive method
## Q* = 48.973, df = 8, p-value = 6.433e-08
##
## Model df: 16. Total lags used: 24
Box cox tranformation definately improve ACF, but we can still see some spikes outside ACF bound.
Damped & Adjusted Holt-Winters Model
##
## Ljung-Box test
##
## data: Residuals from Damped Holt-Winters' additive method
## Q* = 47.391, df = 7, p-value = 4.682e-08
##
## Model df: 17. Total lags used: 24
The performance is similar compared to last one.
Model performance
To measure and compare how well the models did, i will use the accuracy() function and compare RMSE.
## ME RMSE MAE MPE MAPE MASE
## Training set -7314.218 552020.2 409527.5 -0.7072315 6.308404 0.6866543
## ACF1
## Training set 0.006468207
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 11091.18 589792.8 446682.5 -0.5544273 6.817087 0.7489519 0.29072
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 14739.27 580346.6 448159.9 -0.4823313 6.858504 0.7514292 0.205424
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 65430.94 583270.3 439544.5 0.3231644 6.650032 0.7369837 0.2769445
ARIMA has the lowest RMSE.
Summary
power_forecast <- forecast(arima_fit, h = h)
autoplot(power_forecast) +
ggtitle("Forecasted Residential Power Consumption (KWH)") +
theme(axis.title = element_blank())## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## Jan 2014 9433336 8688058 10178615 8293531 10573141
## Feb 2014 8570219 7785677 9354761 7370366 9770072
## Mar 2014 6615312 5830157 7400467 5414521 7816102
## Apr 2014 6005538 5196071 6815004 4767565 7243510
## May 2014 5917569 5108101 6727036 4679595 7155543
## Jun 2014 8187387 7377114 8997660 6948182 9426592
## Jul 2014 9528779 8717809 10339750 8288507 10769052
## Aug 2014 9991953 9180963 10802943 8751650 11232255
## Sep 2014 8546843 7735715 9357971 7306330 9787356
## Oct 2014 5856327 5045196 6667458 4615809 7096845
## Nov 2014 6153245 5342114 6964376 4912727 7393763
## Dec 2014 7732110 6920971 8543250 6491580 8972641
Let’s get monthly forecast for 2014 in a excel file as instructed in the assignment. We will use ARIMA as it gives best performance.
n <- max(power$CaseSequence)
dates <- seq(ymd("2014-01-01"), ymd("2014-12-31"), by = "1 month")
dates <- format((dates), "%Y-%b")
df_forecast <- data.frame( "CaseSequence" = c(n+1,n+2,n+3,n+4,n+5,n+6,n+7,n+8,n+9,n+10,n+11,n+12),`YYYY-MMM` = dates, "KWH" = c(round(power_forecast$mean)))
write.xlsx(df_forecast, "power_Forecasts.xlsx")Part C
Part C consists of two data sets. These are simple 2 columns sets, however they have different time stamps. Your optional assignment is to time-base sequence the data and aggregate based on hour (example of what this looks like, follows). Note for multiple recordings within an hour, take the mean. Then to determine if the data is stationary and can it be forecast. If so, provide a week forward forecast and present results via Rpubs and .rmd and the forecast in an Excel readable file.
Load Data
I have renamed the column name and changed the Date columns format in excel so that it’s easy to read in R.
Data exploration
Pipe1 has multiple reading within an hour, I took the mean for multiple recordings within an hour and convert both data as dataframe.
pipe1<-pipe1 %>%
mutate(Date = date(DateTime),
Hour = hour(DateTime) + 1)
pipe1 <- pipe1 %>%
group_by(Date, Hour) %>%
summarise_at(vars(WaterFlow),
list(name = mean)) %>% mutate(DateTime = ymd_h(paste(Date, Hour)))
pipe1 <- data.frame(pipe1) %>% select(DateTime, WaterFlow = name)
pipe2 <- data.frame(pipe2)## DateTime WaterFlow
## 1 2015-10-23 01:00:00 26.10280
## 2 2015-10-23 02:00:00 18.85202
## 3 2015-10-23 03:00:00 15.15857
## 4 2015-10-23 04:00:00 23.07886
## 5 2015-10-23 05:00:00 15.48219
## 6 2015-10-23 06:00:00 22.72539
## DateTime WaterFlow
## 1 2015-10-23 01:00:00 18.81079
## 2 2015-10-23 02:00:00 43.08703
## 3 2015-10-23 03:00:00 37.98770
## 4 2015-10-23 04:00:00 36.12038
## 5 2015-10-23 05:00:00 31.85126
## 6 2015-10-23 06:00:00 28.23809
Let’s join both pipe 1 and pipe2 data to get a one complete data frame with combined hourly reading.
water_df<-full_join(pipe1, pipe2, by = "DateTime", suffix = c("_1", "_2")) %>%
mutate(WaterFlow_1 = ifelse(is.na(WaterFlow_1), 0, WaterFlow_1)) %>%
mutate(WaterFlow = WaterFlow_1 + WaterFlow_2) %>%
select(DateTime, WaterFlow)
head(water_df)## DateTime WaterFlow
## 1 2015-10-23 01:00:00 44.91359
## 2 2015-10-23 02:00:00 61.93904
## 3 2015-10-23 03:00:00 53.14627
## 4 2015-10-23 04:00:00 59.19923
## 5 2015-10-23 05:00:00 47.33345
## 6 2015-10-23 06:00:00 50.96348
Let’s convert the dataframe to timeseries and plot it with autoplot().
water_ts<-ts(water_df$WaterFlow, frequency = 24)
autoplot(water_ts, ylab = "WaterFlow") + labs(title = "Hourly waterflow through two pipelines")We can see the data is non-stationary. We will do box-cox transformation and use lag-1 difference.
lambda <- BoxCox.lambda(water_ts)
water_boxcox <- BoxCox(water_ts, lambda)
ggtsdisplay(diff(water_boxcox), points = FALSE,
main = "Differenced Box-Cox transformed water flow")Forecasting
We can see from above that timeseries is stationary but ACF has a big spike. There is no seasonality. I will use an ARIMA(1,1,1) model and plot the residuals.
##
## Ljung-Box test
##
## data: Residuals from ARIMA(1,1,1)
## Q* = 59.539, df = 46, p-value = 0.08677
##
## Model df: 2. Total lags used: 48
Let’s check how auto arima perform compare to ARIMA(1,1,1)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,1,2)(1,0,1)[24]
## Q* = 53.956, df = 44, p-value = 0.1445
##
## Model df: 4. Total lags used: 48
## ME RMSE MAE MPE MAPE MASE
## Training set 0.1731158 16.36306 13.35409 -28.19662 50.25857 0.7511972
## ACF1
## Training set -0.002533543
## ME RMSE MAE MPE MAPE MASE
## Training set -0.2064065 16.31447 13.33287 -29.28277 50.64767 0.7500036
## ACF1
## Training set -0.001115005
Auto arima has slightly lower RMSE. We will move forward with auto arima and plot it.
water_forecast<-forecast(water_auto, 24*7, level = 95)
autoplot(water_forecast) +
labs(title = "WaterFlow Forecasted", x = "Day", y = "Total WaterFlow") Our model didn’t capture much variability.
let’s write the forecast in a excel file.
df_forecast <- data_frame(DateTime = max(water_df$DateTime) +
hours(1:168), WaterFlow = water_forecast$mean)## Warning: `data_frame()` is deprecated as of tibble 1.1.0.
## Please use `tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.