Libraries

library(tidyverse)
library(ggplot2)
library(plotly)
library(kableExtra)
library(readxl)
library(forecast)
library(gridExtra)
library(scales)
library(lubridate)
library(plyr)
library(openxlsx)

Part A

ATM Forecast, ATM624Data.xlsx

In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.

Introduction

In 2012, consumers made 5.8 billion ATM withdrawals totaling $687 billion in value. The number of ATM withdrawals decreased 0.9 percent per year from 2009 to 2012. However, during the same period, the total dollar value of ATM withdrawals increased 2.0 percent, and the average withdrawal value increased from 108 to 118.

Source: https://www.atmmarketplace.com/articles/fed-study-fills-in-the-numbers-for-atm-transaction-volumes-and-values/

We can see how important it is for the bank or ATM provider to forecast cash withdrawal as close as possible in order to optimized their business. Our task is to forecast how much cash is taken out of 4 different ATM machines for May 2010. Data was provided in a excel file. The variable “Cash” is provided in hundreds of dollars.

Load the data

atm <- read_excel("ATM624Data.xlsx")
head(atm)
## # A tibble: 6 x 3
##    DATE ATM    Cash
##   <dbl> <chr> <dbl>
## 1 39934 ATM1     96
## 2 39934 ATM2    107
## 3 39935 ATM1     82
## 4 39935 ATM2     89
## 5 39936 ATM1     85
## 6 39936 ATM2     90
str(atm)
## tibble [1,474 x 3] (S3: tbl_df/tbl/data.frame)
##  $ DATE: num [1:1474] 39934 39934 39935 39935 39936 ...
##  $ ATM : chr [1:1474] "ATM1" "ATM2" "ATM1" "ATM2" ...
##  $ Cash: num [1:1474] 96 107 82 89 85 90 90 55 99 79 ...

Data exploration

Let’s convert the data into a data frame and do some basic transformation along the way

atm_df <- atm %>%
  mutate(DATE = as.Date(DATE, origin = "1899-12-30"), #Convert date column https://rdrr.io/r/base/as.Date.html
         ATM = as.factor(ATM)) %>%
  data.frame()
dim(atm_df)
## [1] 1474    3
head(atm_df)
##         DATE  ATM Cash
## 1 2009-05-01 ATM1   96
## 2 2009-05-01 ATM2  107
## 3 2009-05-02 ATM1   82
## 4 2009-05-02 ATM2   89
## 5 2009-05-03 ATM1   85
## 6 2009-05-03 ATM2   90
tail(atm_df)
##            DATE  ATM      Cash
## 1469 2010-04-25 ATM4 542.28060
## 1470 2010-04-26 ATM4 403.83934
## 1471 2010-04-27 ATM4  13.69733
## 1472 2010-04-28 ATM4 348.20106
## 1473 2010-04-29 ATM4  44.24535
## 1474 2010-04-30 ATM4 482.28711

Let’s drop the missing value and check the dimension again.

dim(atm_df)
## [1] 1474    3
atm_df <- atm_df %>% drop_na()  #drop missing values
dim (atm_df)
## [1] 1455    3

So, we got rid of 19 rows of data.

We are asked to forecast for May 2010, let’s check the max date we have in the dataset by ATM.

ddply(atm_df, "ATM", summarize, max = max(DATE))
##    ATM        max
## 1 ATM1 2010-04-30
## 2 ATM2 2010-04-30
## 3 ATM3 2010-04-30
## 4 ATM4 2010-04-30
summary(atm_df)
##       DATE              ATM           Cash        
##  Min.   :2009-05-01   ATM1:362   Min.   :    0.0  
##  1st Qu.:2009-08-01   ATM2:363   1st Qu.:    0.5  
##  Median :2009-10-31   ATM3:365   Median :   73.0  
##  Mean   :2009-10-30   ATM4:365   Mean   :  155.6  
##  3rd Qu.:2010-01-29              3rd Qu.:  114.0  
##  Max.   :2010-04-30              Max.   :10919.8

Let’s visualize the data

atm_df %>%
  ggplot(aes(DATE, Cash, color = ATM)) +
  geom_line() +
  ggtitle("Cash Withdrawal by Date and ATM") +
  scale_color_brewer(palette = "Set2") +
  scale_y_continuous(labels = comma) +
  facet_wrap(.~ATM, ncol = 2) +
  theme(legend.title = element_blank(),
        legend.position = "bottom",
        axis.title = element_blank())

atm_df_atm3 <- atm_df %>%
  filter(ATM == "ATM3", Cash !=0)
atm_df_atm3
##         DATE  ATM Cash
## 1 2010-04-28 ATM3   96
## 2 2010-04-29 ATM3   82
## 3 2010-04-30 ATM3   85

We can see from above that ATM#1 and #2 cash out were very similar. ATM#3 only used last three days. On the other hand ATM#4 cash out was at much higher rate compared to other three ATMs and there is a outlier spike in ATM#4. Let’s use separate Y-scale for each of the ATM (scales = “free”) and do the same visualization so that we can have a better look despite having a outlier in ATM#4.

atm_df %>%
  ggplot(aes(DATE, Cash, color = ATM)) +
  geom_line() +
  facet_wrap(.~ATM, ncol = 2, scales = "free") +
  ggtitle("Cash Withdrawal by Date and ATM") +
  scale_color_brewer(palette = "Set2") +
  theme(legend.title = element_blank(),
        legend.position = "bottom",
        axis.title = element_blank())

Let’s look at the distributions

atm_df %>%
  ggplot(aes(ATM, Cash, color = ATM)) +
  geom_boxplot() +
  ggtitle("Cash Withdrawal by ATM") +
  scale_color_brewer(palette = "Set2") +
  facet_wrap(.~ATM, nrow = 2, scales = "free")

Let’s look at how was the cash out by weekdays, if there is any pattern.

atm_df_day <- atm_df %>%
  mutate(`day_of_week` = recode(as.factor(wday(DATE)), "1" = "Su", "2" = "Mo", "3" = "Tu", "4" = "We", "5" = "Th", "6" = "Fr", "7" = "Sa"))
head(atm_df_day)
##         DATE  ATM Cash day_of_week
## 1 2009-05-01 ATM1   96          Fr
## 2 2009-05-01 ATM2  107          Fr
## 3 2009-05-02 ATM1   82          Sa
## 4 2009-05-02 ATM2   89          Sa
## 5 2009-05-03 ATM1   85          Su
## 6 2009-05-03 ATM2   90          Su
ggplot(data = atm_df_day, mapping = aes(x=day_of_week, y=Cash)) + geom_col() + ggtitle("Cash Withdrawal by ATM and weekdays") + scale_color_brewer(palette = "Set2") + facet_wrap(.~ATM, nrow = 2, scales = "free")

Source: https://stackoverflow.com/questions/9216138/find-the-day-of-a-week

ATM 1

I will use frequency = 7 as there are weekly effect.

# data cleaning and convert ATM1 data to time series
atm1 <- atm_df %>% 
  filter(ATM == "ATM1")
atm1$clean <- c(tsclean(ts(atm1$Cash, start = decimal_date(as.Date(min(atm1$DATE))), frequency = 365)))

ts_atm1 <- ts(atm1$clean, frequency = 7)
ggtsdisplay(ts_atm1)

We see from above that ACF & PACF has peak lag repeat in 7th lag.

Model Run

Holt-Winters

h=31
atm1_hw <- hw(ts_atm1, h = h)
checkresiduals(atm1_hw)

## 
##  Ljung-Box test
## 
## data:  Residuals from Holt-Winters' additive method
## Q* = 16.871, df = 3, p-value = 0.0007513
## 
## Model df: 11.   Total lags used: 14

This model did fairly well. Let’s see if it does better if we do box cox transformation

Holt-Winters with Box Cox Adjustment

atm1_lambda <- BoxCox.lambda(ts_atm1)
atm1_hw_box_cox <- hw(ts_atm1, h = h, lambda = atm1_lambda)
checkresiduals(atm1_hw_box_cox)

## 
##  Ljung-Box test
## 
## data:  Residuals from Holt-Winters' additive method
## Q* = 16.921, df = 3, p-value = 0.0007336
## 
## Model df: 11.   Total lags used: 14

STL Decomposition Models

Let’s try few seasonal decomposition models. I will consider s.window = 7 (seasonal window) to account for weekday’s cashout differences.

ts_atm1 %>%
  stl(s.window = 7, robust = TRUE) %>%
  autoplot()

STL & ETS

h <- 31
atm1_stl_ets <- ts_atm1 %>%
  stlf(h = h, s.window = 7, robust = TRUE, method = "ets")
checkresiduals(atm1_stl_ets)
## Warning in checkresiduals(atm1_stl_ets): The fitted degrees of freedom is based
## on the model used for the seasonally adjusted data.

## 
##  Ljung-Box test
## 
## data:  Residuals from STL +  ETS(A,N,N)
## Q* = 7.9258, df = 12, p-value = 0.7909
## 
## Model df: 2.   Total lags used: 14

STL & ARIMA

atm1_stl_arima <- ts_atm1 %>%
  stlf(h = h, s.window = 7, robust = TRUE, method = "arima")
checkresiduals(atm1_stl_arima)
## Warning in checkresiduals(atm1_stl_arima): The fitted degrees of freedom is
## based on the model used for the seasonally adjusted data.

## 
##  Ljung-Box test
## 
## data:  Residuals from STL +  ARIMA(0,1,2)
## Q* = 5.4248, df = 12, p-value = 0.9423
## 
## Model df: 2.   Total lags used: 14

ARIMA

atm1_arima <- auto.arima(ts_atm1)
checkresiduals(atm1_arima)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,0,1)(0,1,1)[7]
## Q* = 11.267, df = 12, p-value = 0.5062
## 
## Model df: 2.   Total lags used: 14

This model looks like it is preforming well. Let’s see how all of them stack up.

Model performance

To measure and compare how well the models did, i will use the tsCV function and compare RMSE.

get_rmse <- function(e) {
  sqrt(mean(e^2, na.rm = TRUE))
}
atm1_arima_forecast <- function(x, h) {
  forecast(Arima(x, order = c(0, 0, 1), seasonal = c(0, 1, 2)), h = h)
}
stl_ets_errors <- tsCV(ts_atm1, stlf, h = h, s.window = 7, robust = TRUE, method = "ets")
stl_arima_errors <- tsCV(ts_atm1, stlf, h = h, s.window = 7, robust = TRUE, method = "arima")
hw_errors <- tsCV(ts_atm1, hw, h = h)
boxcox_hw_errors <- tsCV(ts_atm1, hw, h = h, lambda = atm1_lambda)
arima_errors <- tsCV(ts_atm1, atm1_arima_forecast, h = h)

data.frame(Model = c("STL ETS", "STL ARIMA", "ARIMA", "Holt-Winters", "Box Cox Holt-Winters"),
           RMSE = c(get_rmse(stl_ets_errors[, h]), get_rmse(stl_arima_errors[, h]), get_rmse(arima_errors[, h]), get_rmse(hw_errors[, h]), get_rmse(boxcox_hw_errors[, h]))) %>%
  arrange(RMSE) %>%
  kable() %>%
  kable_styling()
Model RMSE
ARIMA 35.67540
STL ARIMA 37.99406
STL ETS 38.75113
Box Cox Holt-Winters 45.01814
Holt-Winters 46.03976

ARIMA model has the lowest RMSE.

autoplot(forecast(atm1_arima, h=31))

ATM #2

Let’s repeat above process for ATM2. I will use same frequency = 7 as there are weekly effect.

# data cleaning and convert ATM1 data to time series
atm2 <- atm_df %>% 
  filter(ATM == "ATM2")
atm2$clean <- c(tsclean(ts(atm2$Cash, start = decimal_date(as.Date(min(atm1$DATE))), frequency = 365)))

ts_atm2 <- ts(atm2$clean, frequency = 7)
ggtsdisplay(ts_atm2)

We see from above that ACF & PACF has peak lag repeat in 7th lag.

Model Run

Holt-Winters

atm2_hw <- hw(ts_atm2, h = h)
checkresiduals(atm2_hw)

## 
##  Ljung-Box test
## 
## data:  Residuals from Holt-Winters' additive method
## Q* = 36.563, df = 3, p-value = 5.693e-08
## 
## Model df: 11.   Total lags used: 14

Holt-Winters with Box Cox

atm2_lambda <- BoxCox.lambda(ts_atm2)
atm2_hw_cox_box <- hw(ts_atm2, h = h, lambda = atm2_lambda)
checkresiduals(atm2_hw_cox_box)

## 
##  Ljung-Box test
## 
## data:  Residuals from Holt-Winters' additive method
## Q* = 34.635, df = 3, p-value = 1.455e-07
## 
## Model df: 11.   Total lags used: 14

STL Decomposition Models

ts_atm2 %>%
  stl(s.window = 7, robust = TRUE) %>%
  autoplot()

This is very similar to ATM #1.

STL & ETS

atm2_stl_ets <- ts_atm2 %>%
  stlf(h = h, s.window = 7, robust = TRUE, method = "ets")
checkresiduals(atm2_stl_ets)
## Warning in checkresiduals(atm2_stl_ets): The fitted degrees of freedom is based
## on the model used for the seasonally adjusted data.

## 
##  Ljung-Box test
## 
## data:  Residuals from STL +  ETS(A,N,N)
## Q* = 14.34, df = 12, p-value = 0.2795
## 
## Model df: 2.   Total lags used: 14

This model performs much better compared to ATM#1

STL & ARIMA

atm2_stl_arima <- ts_atm2 %>%
  stlf(h = h, s.window = 7, robust = TRUE, method = "arima")
checkresiduals(atm2_stl_arima)
## Warning in checkresiduals(atm2_stl_arima): The fitted degrees of freedom is
## based on the model used for the seasonally adjusted data.

## 
##  Ljung-Box test
## 
## data:  Residuals from STL +  ARIMA(0,1,1)
## Q* = 13.983, df = 13, p-value = 0.375
## 
## Model df: 1.   Total lags used: 14

ARIMA

atm2_arima <- auto.arima(ts_atm2)
checkresiduals(atm2_arima)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(2,0,2)(0,1,2)[7]
## Q* = 12.12, df = 8, p-value = 0.1459
## 
## Model df: 6.   Total lags used: 14

This model looks like it is preforming well. Let’s see how all of them stack up.

Model performance

To measure and compare how well the models did, i will use the tsCV function and compare RMSE.

atm2_arima_forecast <- function(x, h) {
  forecast(Arima(x, order = c(2, 0, 2), seasonal = c(0, 1, 1)), h = h)
}
stl_ets_errors <- tsCV(ts_atm2, stlf, h = h, s.window = 7, robust = TRUE, method = "ets")
stl_arima_errors <- tsCV(ts_atm2, stlf, h = h, s.window = 7, robust = TRUE, method = "arima")
hw_errors <- tsCV(ts_atm2, hw, h = h)
adj_hw_errors <- tsCV(ts_atm2, hw, h = h, lambda = atm2_lambda)
arima_errors <- tsCV(ts_atm2, atm2_arima_forecast, h = h)
data.frame(Model = c("STL ETS", "STL ARIMA", "ARIMA", "Holt-Winters", "Box Cox Holt-Winters"),
           RMSE = c(get_rmse(stl_ets_errors[, h]), get_rmse(stl_arima_errors[, h]), get_rmse(arima_errors[, h]), get_rmse(hw_errors[, h]), get_rmse(adj_hw_errors[, h]))) %>%
  arrange(RMSE) %>%
  kable() %>%
  kable_styling()
Model RMSE
ARIMA 38.92183
STL ARIMA 44.02338
STL ETS 44.59259
Box Cox Holt-Winters 47.07522
Holt-Winters 58.72761

ARIMA perform best with lowest RMSE.

autoplot(forecast(atm2_arima, h=31))

ATM #3

This model is quite different from the two proceeding cases. You can see it in the plot:

atm_df %>%
  filter(ATM == "ATM3") %>%
  mutate(nonzero = if_else(Cash == 0, "No", "Yes")) %>%
  ggplot(aes(DATE, Cash, color = nonzero)) +
  geom_point() +
  ggtitle("ATM 3") +
  scale_color_brewer(palette = "Set1") +
  theme(axis.title = element_blank(), legend.position = "none")

We can see from above that all of the cashout is zero (red line) except last three.

We will be using mean of three cashout as forecast.

atm3 <- atm_df %>%
  filter(ATM == "ATM3", Cash > 0)
atm3_mean <- mean(atm3$Cash)

ATM #4

# data cleaning and convert ATM4 data to time series
atm4 <- atm_df %>%
  filter(ATM == "ATM4")
atm4$clean <- c(tsclean(ts(atm4$Cash, start = decimal_date(as.Date(min(atm4$DATE))), frequency = 365)))

ts_atm4 <- ts(atm4$clean, frequency = 7)

Model Run

STL Decomposition Models

ts_atm4 %>%
  stl(s.window = 7, robust = TRUE) %>%
  autoplot()

This is very similar to ATM #1.

STL & ETS

atm4_stl_ets <- ts_atm4 %>%
  stlf(h = h, s.window = 7, robust = TRUE, method = "ets")
checkresiduals(atm4_stl_ets)
## Warning in checkresiduals(atm4_stl_ets): The fitted degrees of freedom is based
## on the model used for the seasonally adjusted data.

## 
##  Ljung-Box test
## 
## data:  Residuals from STL +  ETS(A,N,N)
## Q* = 30.61, df = 12, p-value = 0.002258
## 
## Model df: 2.   Total lags used: 14

STL & ARIMA

atm4_stl_arima <- ts_atm4 %>%
  stlf(h = h, s.window = 7, robust = TRUE, method = "arima")
checkresiduals(atm4_stl_arima)
## Warning in checkresiduals(atm4_stl_arima): The fitted degrees of freedom is
## based on the model used for the seasonally adjusted data.

## 
##  Ljung-Box test
## 
## data:  Residuals from STL +  ARIMA(1,0,1) with non-zero mean
## Q* = 21.78, df = 11, p-value = 0.02614
## 
## Model df: 3.   Total lags used: 14

Holt-Winters

atm4_hw <- hw(ts_atm4, h = h)
checkresiduals(atm4_hw)

## 
##  Ljung-Box test
## 
## data:  Residuals from Holt-Winters' additive method
## Q* = 18.279, df = 3, p-value = 0.0003852
## 
## Model df: 11.   Total lags used: 14

This method seems to have preformed better than any of the STL models.

Holt-Winters with Box Cox Adjustment

atm4_lambda <- BoxCox.lambda(ts_atm4)
atm4_hw_box_cox <- hw(ts_atm4, h = h, lambda = atm4_lambda)
checkresiduals(atm4_hw_box_cox)

## 
##  Ljung-Box test
## 
## data:  Residuals from Holt-Winters' additive method
## Q* = 18.934, df = 3, p-value = 0.0002822
## 
## Model df: 11.   Total lags used: 14

ARIMA

atm4_arima <- auto.arima(ts_atm4)
checkresiduals(atm4_arima)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,0,3)(1,0,0)[7] with non-zero mean
## Q* = 16.96, df = 9, p-value = 0.04934
## 
## Model df: 5.   Total lags used: 14

Model Performance

atm4_arima_forecast <- function(x, h) {
  forecast(Arima(x, order = c(0, 0, 3), seasonal = c(1, 0, 0)), h = h)
}
stl_ets_errors <- tsCV(ts_atm4, stlf, h = h, s.window = 7, robust = TRUE, method = "ets")
stl_arima_errors <- tsCV(ts_atm4, stlf, h = h, s.window = 7, robust = TRUE, method = "arima")
hw_errors <- tsCV(ts_atm4, hw, h = h)
adj_hw_errors <- tsCV(ts_atm4, hw, h = h, lambda = atm4_lambda)
arima_errors <- tsCV(ts_atm4, atm4_arima_forecast, h = h)
data.frame(Model = c("STL ETS", "STL ARIMA", "ARIMA", "Holt-Winters", "Holt-Winters Box Cox"),
           RMSE = c(get_rmse(stl_ets_errors[, h]), get_rmse(stl_arima_errors[, h]), get_rmse(arima_errors[, h]), get_rmse(hw_errors[, h]), get_rmse(adj_hw_errors[, h]))) %>%
  arrange(RMSE) %>%
  kable() %>%
  kable_styling()
Model RMSE
ARIMA 346.7434
STL ARIMA 353.7169
STL ETS 366.0626
Holt-Winters Box Cox 388.4804
Holt-Winters 504.2440

Again, ARIMA has the lowest RMSE.

autoplot(forecast(atm4_arima, h=31))

Summary

Let’s get the forecast of May 2010 in a excel file as instructed in the assignment. We will use ARIMA for AMT1, ATM2 & ATM4 as it gives best performance, we will take mean as forecast for ATM3 as there are only three cashout in the time series.

dates <- seq(ymd("2010-05-01"), ymd("2010-05-31"), by = "1 day")
atm1_forecast <- forecast(atm1_arima, h = h)
atm2_forecast <- forecast(atm2_arima, h = h)
atm3_forecast <- rep(atm3_mean, h)
atm4_forecast <- forecast(atm4_arima, h = h)


df_forecast <- data.frame("DATE" = dates, "ATM" = c("ATM1"), "Cash" = c(atm1_forecast$mean))
df_forecast <- data.frame("DATE" = dates, "ATM" = c("ATM2"), "Cash" = c(atm2_forecast$mean)) %>%
  rbind(df_forecast, .)
df_forecast <- data.frame("DATE" = dates, "ATM" = c("ATM3"), "Cash" = atm3_forecast) %>%
  rbind(df_forecast, .)
df_forecast <- data.frame("DATE" = dates, "ATM" = c("ATM4"), "Cash" = c(atm4_forecast$mean)) %>%
  rbind(df_forecast, .)
write.xlsx(df_forecast, "ATM_Forecasts.xlsx")

Part B

Part B – Forecasting Power, ResidentialCustomerForecastLoad-624.xlsx

Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.

Load the data

power <- read_excel("ResidentialCustomerForecastLoad-624.xlsx")
head(power)
## # A tibble: 6 x 3
##   CaseSequence `YYYY-MMM`     KWH
##          <dbl> <chr>        <dbl>
## 1          733 1998-Jan   6862583
## 2          734 1998-Feb   5838198
## 3          735 1998-Mar   5420658
## 4          736 1998-Apr   5010364
## 5          737 1998-May   4665377
## 6          738 1998-Jun   6467147
str(power)
## tibble [192 x 3] (S3: tbl_df/tbl/data.frame)
##  $ CaseSequence: num [1:192] 733 734 735 736 737 738 739 740 741 742 ...
##  $ YYYY-MMM    : chr [1:192] "1998-Jan" "1998-Feb" "1998-Mar" "1998-Apr" ...
##  $ KWH         : num [1:192] 6862583 5838198 5420658 5010364 4665377 ...
library(Amelia)
## Loading required package: Rcpp
## ## 
## ## Amelia II: Multiple Imputation
## ## (Version 1.7.6, built: 2019-11-24)
## ## Copyright (C) 2005-2020 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##
missmap(power)
## Warning: Unknown or uninitialised column: `arguments`.

## Warning: Unknown or uninitialised column: `arguments`.
## Warning: Unknown or uninitialised column: `imputations`.

There is one missing value that we will clean in next step.

Data exploration & Cleaning

Let’s convert the data into a time series and do some basic transformation along the way

power_ts <- power %>%
  select(KWH) %>%
  ts(start = decimal_date(date("1998-01-01")), frequency = 12)

Let’s plot the time series using autoplot()

autoplot(power_ts, ylab = "KWH") +
  scale_y_continuous(labels = comma)

summary(power_ts)
##       KWH          
##  Min.   :  770523  
##  1st Qu.: 5429912  
##  Median : 6283324  
##  Mean   : 6502475  
##  3rd Qu.: 7620524  
##  Max.   :10655730  
##  NA's   :1

We have missing data and outliers. We will clean the data using `tsclean’ from forecast package.

Here is the details on `tsclean’ https://www.rdocumentation.org/packages/forecast/versions/8.12/topics/tsclean

power_ts <- tsclean(power_ts)
# let's plot it again
ggtsdisplay(power_ts)

We can see there is clear seasonality in the time series.

Intial models

We will use Holt-winters or ARIMA as the data has clear seasonality. Let’s plot some of the models.

h <- 12
lambda <- BoxCox.lambda(power_ts)
autoplot(power_ts) +
  autolayer(hw(power_ts, h = h), series = "Holt-Winters") +
  autolayer(hw(power_ts, h = h, lambda = lambda), series = "Holt-Winters (Box-Cox Adjusted)") +
  autolayer(hw(power_ts, h = h, lambda = lambda, damped = TRUE), series = "Holt-Winters (Damped & Box-Cox Adjusted)") +
  autolayer(forecast(auto.arima(power_ts), h = h), series = "ARIMA") +
  facet_wrap(. ~ series, ncol = 2) +
  scale_color_brewer(palette = "Set1") +
  scale_y_continuous(labels = comma) +
  theme(legend.position = "none", axis.title = element_blank())

We can see from above plots that all models were able to account for seasonality. Let’s check the residuals to measure performance.

Holt-Winters Model

hw <- hw(power_ts, h = h)
checkresiduals(hw)

## 
##  Ljung-Box test
## 
## data:  Residuals from Holt-Winters' additive method
## Q* = 61.741, df = 8, p-value = 2.121e-10
## 
## Model df: 16.   Total lags used: 24

We can see spikes in ACF outside the bound.

Holt-Winters Model with Box Cox

hw_box_cox <- hw(power_ts, h = h, lambda = lambda)
checkresiduals(hw_box_cox)

## 
##  Ljung-Box test
## 
## data:  Residuals from Holt-Winters' additive method
## Q* = 48.973, df = 8, p-value = 6.433e-08
## 
## Model df: 16.   Total lags used: 24

Box cox tranformation definately improve ACF, but we can still see some spikes outside ACF bound.

Damped & Adjusted Holt-Winters Model

damp_adj_hw <- hw(power_ts, h = h, lambda = lambda, damped = TRUE)
checkresiduals(damp_adj_hw)

## 
##  Ljung-Box test
## 
## data:  Residuals from Damped Holt-Winters' additive method
## Q* = 47.391, df = 7, p-value = 4.682e-08
## 
## Model df: 17.   Total lags used: 24

The performance is similar compared to last one.

ARIMA Model

arima_fit <- auto.arima(power_ts)
checkresiduals(arima_fit)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(3,0,1)(2,1,0)[12] with drift
## Q* = 11.549, df = 17, p-value = 0.8266
## 
## Model df: 7.   Total lags used: 24

This seems like the best model so far.

Model performance

To measure and compare how well the models did, i will use the accuracy() function and compare RMSE.

accuracy(forecast(arima_fit, h=12))
##                     ME     RMSE      MAE        MPE     MAPE      MASE
## Training set -7314.218 552020.2 409527.5 -0.7072315 6.308404 0.6866543
##                     ACF1
## Training set 0.006468207
accuracy(forecast(hw), h=12)
##                    ME     RMSE      MAE        MPE     MAPE      MASE    ACF1
## Training set 11091.18 589792.8 446682.5 -0.5544273 6.817087 0.7489519 0.29072
accuracy(forecast(hw_box_cox), h=12)
##                    ME     RMSE      MAE        MPE     MAPE      MASE     ACF1
## Training set 14739.27 580346.6 448159.9 -0.4823313 6.858504 0.7514292 0.205424
accuracy(forecast(damp_adj_hw), h=12)
##                    ME     RMSE      MAE       MPE     MAPE      MASE      ACF1
## Training set 65430.94 583270.3 439544.5 0.3231644 6.650032 0.7369837 0.2769445

ARIMA has the lowest RMSE.

Summary

power_forecast <- forecast(arima_fit, h = h)
autoplot(power_forecast) +
  ggtitle("Forecasted Residential Power Consumption (KWH)") +
  theme(axis.title = element_blank())

power_forecast
##          Point Forecast   Lo 80    Hi 80   Lo 95    Hi 95
## Jan 2014        9433336 8688058 10178615 8293531 10573141
## Feb 2014        8570219 7785677  9354761 7370366  9770072
## Mar 2014        6615312 5830157  7400467 5414521  7816102
## Apr 2014        6005538 5196071  6815004 4767565  7243510
## May 2014        5917569 5108101  6727036 4679595  7155543
## Jun 2014        8187387 7377114  8997660 6948182  9426592
## Jul 2014        9528779 8717809 10339750 8288507 10769052
## Aug 2014        9991953 9180963 10802943 8751650 11232255
## Sep 2014        8546843 7735715  9357971 7306330  9787356
## Oct 2014        5856327 5045196  6667458 4615809  7096845
## Nov 2014        6153245 5342114  6964376 4912727  7393763
## Dec 2014        7732110 6920971  8543250 6491580  8972641

Let’s get monthly forecast for 2014 in a excel file as instructed in the assignment. We will use ARIMA as it gives best performance.

n <- max(power$CaseSequence)

dates <- seq(ymd("2014-01-01"), ymd("2014-12-31"), by = "1 month")

dates <- format((dates), "%Y-%b")

df_forecast <- data.frame( "CaseSequence" = c(n+1,n+2,n+3,n+4,n+5,n+6,n+7,n+8,n+9,n+10,n+11,n+12),`YYYY-MMM` = dates, "KWH" = c(round(power_forecast$mean)))
write.xlsx(df_forecast, "power_Forecasts.xlsx")

Part C

Part C consists of two data sets. These are simple 2 columns sets, however they have different time stamps. Your optional assignment is to time-base sequence the data and aggregate based on hour (example of what this looks like, follows). Note for multiple recordings within an hour, take the mean. Then to determine if the data is stationary and can it be forecast. If so, provide a week forward forecast and present results via Rpubs and .rmd and the forecast in an Excel readable file.

Load Data

I have renamed the column name and changed the Date columns format in excel so that it’s easy to read in R.

pipe1<-read_excel("Waterflow_Pipe1.xlsx",col_types =c("date", "numeric"))
pipe2<-read_excel("Waterflow_Pipe2.xlsx",col_types =c("date", "numeric"))

colnames(pipe1)= c("DateTime","WaterFlow")
colnames(pipe2)= c("DateTime","WaterFlow")

Data exploration

Pipe1 has multiple reading within an hour, I took the mean for multiple recordings within an hour and convert both data as dataframe.

pipe1<-pipe1 %>% 
  mutate(Date = date(DateTime),
         Hour = hour(DateTime) + 1) 

pipe1 <- pipe1 %>%                                       
  group_by(Date, Hour) %>%                         
  summarise_at(vars(WaterFlow),
               list(name = mean))  %>% mutate(DateTime = ymd_h(paste(Date, Hour)))        

pipe1 <- data.frame(pipe1) %>% select(DateTime, WaterFlow = name)
pipe2 <- data.frame(pipe2)
head(pipe1)
##              DateTime WaterFlow
## 1 2015-10-23 01:00:00  26.10280
## 2 2015-10-23 02:00:00  18.85202
## 3 2015-10-23 03:00:00  15.15857
## 4 2015-10-23 04:00:00  23.07886
## 5 2015-10-23 05:00:00  15.48219
## 6 2015-10-23 06:00:00  22.72539
head(pipe2)
##              DateTime WaterFlow
## 1 2015-10-23 01:00:00  18.81079
## 2 2015-10-23 02:00:00  43.08703
## 3 2015-10-23 03:00:00  37.98770
## 4 2015-10-23 04:00:00  36.12038
## 5 2015-10-23 05:00:00  31.85126
## 6 2015-10-23 06:00:00  28.23809

Let’s join both pipe 1 and pipe2 data to get a one complete data frame with combined hourly reading.

water_df<-full_join(pipe1, pipe2, by = "DateTime", suffix = c("_1", "_2")) %>% 
  mutate(WaterFlow_1 = ifelse(is.na(WaterFlow_1), 0, WaterFlow_1)) %>% 
  mutate(WaterFlow = WaterFlow_1 + WaterFlow_2) %>% 
  select(DateTime, WaterFlow)

head(water_df)
##              DateTime WaterFlow
## 1 2015-10-23 01:00:00  44.91359
## 2 2015-10-23 02:00:00  61.93904
## 3 2015-10-23 03:00:00  53.14627
## 4 2015-10-23 04:00:00  59.19923
## 5 2015-10-23 05:00:00  47.33345
## 6 2015-10-23 06:00:00  50.96348

Let’s convert the dataframe to timeseries and plot it with autoplot().

water_ts<-ts(water_df$WaterFlow, frequency = 24)
autoplot(water_ts, ylab = "WaterFlow") + labs(title = "Hourly waterflow through two pipelines")

ggtsdisplay(water_ts)

We can see the data is non-stationary. We will do box-cox transformation and use lag-1 difference.

lambda <- BoxCox.lambda(water_ts)
water_boxcox <- BoxCox(water_ts, lambda)

ggtsdisplay(diff(water_boxcox), points = FALSE,
            main = "Differenced Box-Cox transformed water flow")

Forecasting

We can see from above that timeseries is stationary but ACF has a big spike. There is no seasonality. I will use an ARIMA(1,1,1) model and plot the residuals.

water_fit <- Arima(water_ts, order = c(1, 1, 1), lambda = lambda)
checkresiduals(water_fit)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,1,1)
## Q* = 59.539, df = 46, p-value = 0.08677
## 
## Model df: 2.   Total lags used: 48

Let’s check how auto arima perform compare to ARIMA(1,1,1)

water_auto <- auto.arima(water_ts)
checkresiduals(water_auto)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,1,2)(1,0,1)[24]
## Q* = 53.956, df = 44, p-value = 0.1445
## 
## Model df: 4.   Total lags used: 48
accuracy(forecast(water_fit, h=24*7))
##                     ME     RMSE      MAE       MPE     MAPE      MASE
## Training set 0.1731158 16.36306 13.35409 -28.19662 50.25857 0.7511972
##                      ACF1
## Training set -0.002533543
accuracy(forecast(water_auto), h=24*7)
##                      ME     RMSE      MAE       MPE     MAPE      MASE
## Training set -0.2064065 16.31447 13.33287 -29.28277 50.64767 0.7500036
##                      ACF1
## Training set -0.001115005

Auto arima has slightly lower RMSE. We will move forward with auto arima and plot it.

water_forecast<-forecast(water_auto, 24*7, level = 95)

autoplot(water_forecast) + 
    labs(title = "WaterFlow Forecasted", x = "Day", y = "Total WaterFlow") 

Our model didn’t capture much variability.
let’s write the forecast in a excel file.

df_forecast <- data_frame(DateTime = max(water_df$DateTime) + 
  hours(1:168), WaterFlow = water_forecast$mean)
## Warning: `data_frame()` is deprecated as of tibble 1.1.0.
## Please use `tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
write.xlsx(df_forecast, "water_Forecasts.xlsx")