Load packages

library(zoo)
library(forecast)
library(readxl)
library(tidyverse)
library(imputeTS)
library(openxlsx)
library(fpp3)
library(lubridate)

Part A - ATM Forecast, ATM624Data.xlsx

In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.

Read in the Data

After reading in the data, I can see the date has imported as a double. I converted this date back to date format in R.

ATM624Data <- 
  read_excel("C:\\Users\\Kim\\Documents\\Data624\\ATM624Data.xlsx")
ATM624Data <- ATM624Data %>% 
  mutate(DATE = as.Date(DATE, origin = "1899-12-30"))

Separate the Data

Before beginning this assignment, I took a look at the data briefly in Excel. I can see that there are duplicate blank entries that do not appear to correspond to any of the ATMs. All four ATMs have datasets that match up to the full year, which can be shown by filtering the data on ATM type and looking at the number of entries (365 entries for 365 days). As such, I will exclude the dates with blank ATM ID as they appear to be in the data erroneously. This exclusion is simply done by separating the datasets into their respective ATM identities using the filter function.

ATM1 <- ATM624Data %>% filter(ATM == "ATM1")
ATM2 <- ATM624Data %>% filter(ATM == "ATM2")
ATM3 <- ATM624Data %>% filter(ATM == "ATM3")
ATM4 <- ATM624Data %>% filter(ATM == "ATM4")

ATM1 Analysis

Convert to tsibble and impute NA values

Next, I will focus on the ATM1 dataset. I turned ATM1 into a tsibble and autoplot it to see the general shape of the data. I can see there are missing values, and confirm this using is.na. Since I can see a seasonal pattern, albeit irregular (i.e. the general up and down behavior appears seasonal, but with varying magnitude), I believe imputing the NA values based on linear interpolation makes the most sense. I can do this through the na_interpolation() function in the imputeTS package.

ATM1 <- ATM1 %>% as_tsibble(index = DATE)
ATM1 %>% autoplot(Cash)

any(is.na(ATM1$Cash))

## [1] TRUE

ATM1$Cash <-na_interpolation(ATM1$Cash, option = "linear")
ATM1 %>% autoplot(Cash)

I looked at both Exponential Smoothing and ARIMA as potential forecasting models for all four datasets. I forecasted 31 days ahead to predict ATM withdrawals in the month of May 2010.

ETS model on the non-transformed dataset

An ETS(A,N,A) model was selected and used for the forecast. Looking at the residuals, the variance has irregular spikes and the ACF shows autocorrelations above the critical value at 3 lags. The p-value resulting from the ljung_box test is under 0.05, and therefore the model is not a good fit the time series is not white noise.

fit_ETS_ATM1 <- ATM1 %>% model(ETS(Cash))
report(fit_ETS_ATM1)

## Series: Cash 
## Model: ETS(A,N,A) 
##   Smoothing parameters:
##     alpha = 0.02126635 
##     gamma = 0.3298864 
## 
##   Initial states:
##      l[0]      s[0]    s[-1]    s[-2]    s[-3]    s[-4]    s[-5]   s[-6]
##  77.02956 -55.27235 2.531658 16.15413 4.529656 11.40245 12.92571 7.72875
## 
##   sigma^2:  580.2586
## 
##      AIC     AICc      BIC 
## 4487.018 4487.639 4526.017

fc_ETS_ATM1 <- forecast(fit_ETS_ATM1, h = 31)
fc_ETS_ATM1  %>% autoplot() + autolayer(ATM1)

gg_tsresiduals(fit_ETS_ATM1)

augment(fit_ETS_ATM1) %>% 
  features(.innov, ljung_box, lag = 14)

## # A tibble: 1 × 3
##   .model    lb_stat lb_pvalue
##   <chr>       <dbl>     <dbl>
## 1 ETS(Cash)    28.9    0.0108

ARIMA model on the non-transformed dataset

For ARIMA, I first performed a unitroot test using the features(,unitroot_nsdiffs) function to see if the model requires seasonal differencing given its strong seasonal pattern. Since ARIMA is performed on stationary data and seasonal data is not stationary, I expect seasonal differencing to be required. This is confirmed, where one order of seasonal differencing is returned by the function. Afterwards, I seasonally difference the data and then use the features(,unitroot_ndiffs) to determine if non-seasonal differencing is also required. Since it returns 0, I then plot the ACF and PACF to determine if the data is stationary. The timeplot appears roughly horizontal with relatively constant variance although there are a few spikes where variance is larger. The ACF is not large and positive, and does not show the characteristic slow decrease in non-stationary data. However, there are spikes in the PACF throughout the plot. Using the KPSS unit test, I can see that the data is stationary after differencing because the p-value is greater than 0.05, which means I do not reject the null hypothesis that the data is stationary. As such, I can fit an ARIMA model to the non-transformed dataset.

ATM1 %>% features(Cash, unitroot_nsdiffs)

## # A tibble: 1 × 1
##   nsdiffs
##     <int>
## 1       1

ATM1_nsdiff <- ATM1 %>% mutate(Cash_nsdiff = difference(Cash, 7))
ATM1_nsdiff %>% features(Cash_nsdiff, unitroot_ndiffs)

## # A tibble: 1 × 1
##   ndiffs
##    <int>
## 1      0

ATM1_nsdiff %>% gg_tsdisplay(Cash_nsdiff,plot_type = "partial")

ATM1_nsdiff %>% features(Cash_nsdiff, unitroot_kpss)

## # A tibble: 1 × 2
##   kpss_stat kpss_pvalue
##       <dbl>       <dbl>
## 1    0.0220         0.1

I looked at the ARIMA model without transformation first, using the ARIMA function with stepwise = FALSE and approximation = FALSE. Although this is more computationally intensive, the dataset is small and the function runs within 30 seconds. The potential for an improved model outweighs the disadvantages in this case. An ARIMA(0,0,1)(1,1,1)[7] model is fit, and this model is used for forecasting. After forecasting, I can see that the variance of the residuals is not very constant, but the autocorrelations in the ACF plot are almost all within the bounds of the critical values except for one. The distribution of the residuals also appear approximately normal. Performing the ljung-box test, I can determine that this is a white noise time series as the p-value is above 0.05, and this model can be considered a good fit.

fit_ARIMA_ATM1 <- ATM1 %>% 
  model(ARIMA(Cash, stepwise = FALSE, approximation = FALSE))
report(fit_ARIMA_ATM1)

## Series: Cash 
## Model: ARIMA(0,0,1)(1,1,1)[7] 
## 
## Coefficients:
##          ma1    sar1     sma1
##       0.1968  0.1848  -0.7537
## s.e.  0.0546  0.0793   0.0552
## 
## sigma^2 estimated as 555.1:  log likelihood=-1639.63
## AIC=3287.26   AICc=3287.37   BIC=3302.78

fc_ARIMA_ATM1 <- forecast(fit_ARIMA_ATM1, h = 31)
fc_ARIMA_ATM1  %>% autoplot() + autolayer(ATM1)

gg_tsresiduals(fit_ARIMA_ATM1)

augment(fit_ARIMA_ATM1) %>% 
  features(.innov, ljung_box, lag = 14, dof = 3)

## # A tibble: 1 × 3
##   .model                                               lb_stat lb_pvalue
##   <chr>                                                  <dbl>     <dbl>
## 1 ARIMA(Cash, stepwise = FALSE, approximation = FALSE)    15.1     0.178

Box-Cox transformation

However, given the ACF and PACF plots of the seasonally differenced data, I also explored a Box-Cox transformation on the data before modeling to see if the prerequisites for stationary data could be better met. I perform a Box-Cox transformation on the data using the features(, features = guerrero) function to find the optimal lambda for transformation. I then use that lambda to transform the data.

lambda_guerrero <- ATM1 %>% 
  features(Cash, features = guerrero)
lambda_guerrero <- lambda_guerrero$lambda_guerrero
ATM1_box_cox <- ATM1 %>% 
  mutate(box_cox = box_cox(Cash,lambda_guerrero)) %>% 
  select(box_cox)

ETS model on the transformed dataset

I attempted an exponential smoothing model again first, where an ETS(A,N,A) model was selected again. Forecasting with the Box-Cox transformed data, I can see in the residuals that the variance is not constant, though this model is a better fit given fewer spikes in the ACF plot that extend beyond the critical values. Additionally, performing the ljung-box test now returns a p-value greater than 0.05, indicating a white noise time series.

fit_ETS_ATM1_box_cox <- ATM1_box_cox %>% model(ETS(box_cox))
report(fit_ETS_ATM1_box_cox)

## Series: box_cox 
## Model: ETS(A,N,A) 
##   Smoothing parameters:
##     alpha = 0.0001000125 
##     gamma = 0.3521274 
## 
##   Initial states:
##      l[0]      s[0]     s[-1]    s[-2]     s[-3]    s[-4]     s[-5]     s[-6]
##  7.943327 -4.537357 0.6094231 1.115404 0.5366943 1.005215 0.5374259 0.7331945
## 
##   sigma^2:  1.8037
## 
##      AIC     AICc      BIC 
## 2379.638 2380.260 2418.637

fc_ETS_ATM1_box_cox <- forecast(fit_ETS_ATM1_box_cox, h = 31)
fc_ETS_ATM1_box_cox  %>% autoplot() + autolayer(ATM1_box_cox)

gg_tsresiduals(fit_ETS_ATM1_box_cox)

augment(fit_ETS_ATM1_box_cox) %>% 
  features(.innov, ljung_box, lag = 14)

## # A tibble: 1 × 3
##   .model       lb_stat lb_pvalue
##   <chr>          <dbl>     <dbl>
## 1 ETS(box_cox)    19.1     0.160

ARIMA model on the transformed dataset

For performing ARIMA on the Box-Cox transformed data, I follow the same process from the above section “ARIMA model on the non-transformed dataset”. Looking at the ACF and PACF plots, I can see the PACF plot shows fewer spikes throughout the plot. The KPSS is greater than 0.05 so the data is determined to be stationary.

ATM1_box_cox %>% features(box_cox, unitroot_nsdiffs)

## # A tibble: 1 × 1
##   nsdiffs
##     <int>
## 1       1

ATM1_box_cox_nsdiff <- ATM1_box_cox %>% 
  mutate(box_cox_Cash_nsdiff = difference(box_cox, 7))
ATM1_box_cox_nsdiff %>% 
  features(box_cox_Cash_nsdiff, unitroot_ndiffs)

## # A tibble: 1 × 1
##   ndiffs
##    <int>
## 1      0

ATM1_box_cox_nsdiff %>% 
  gg_tsdisplay(box_cox_Cash_nsdiff,plot_type = "partial")

ATM1_box_cox_nsdiff %>% 
  features(box_cox_Cash_nsdiff, unitroot_kpss)

## # A tibble: 1 × 2
##   kpss_stat kpss_pvalue
##       <dbl>       <dbl>
## 1    0.0153         0.1

When fitting an ARIMA model to the Box-Cox transformed data, an ARIMA(0,0,2)(0,1,1)[7] model was selected. Looking at the residuals, I can see a wide range of variance, though all autocorrelations are within the bounds of the critical values and the distribution of residuals is approximately normal. The p-value from the ljung-box test is 0.543, indicating a good fit and white noise time series.

fit_ARIMA_ATM1_box_cox <- ATM1_box_cox %>% model(ARIMA(box_cox, stepwise = FALSE, approximation = FALSE))
report(fit_ARIMA_ATM1_box_cox)

## Series: box_cox 
## Model: ARIMA(0,0,2)(0,1,1)[7] 
## 
## Coefficients:
##          ma1      ma2     sma1
##       0.1126  -0.1094  -0.6418
## s.e.  0.0524   0.0520   0.0432
## 
## sigma^2 estimated as 1.764:  log likelihood=-609.99
## AIC=1227.98   AICc=1228.09   BIC=1243.5

fc_ARIMA_ATM1_box_cox <- forecast(fit_ARIMA_ATM1_box_cox, h = 31)
fc_ARIMA_ATM1_box_cox  %>% autoplot() + autolayer(ATM1_box_cox)

gg_tsresiduals(fit_ARIMA_ATM1_box_cox)

augment(fit_ARIMA_ATM1_box_cox) %>% features(.innov, ljung_box, lag = 14, dof = 3)

## # A tibble: 1 × 3
##   .model                                                  lb_stat lb_pvalue
##   <chr>                                                     <dbl>     <dbl>
## 1 ARIMA(box_cox, stepwise = FALSE, approximation = FALSE)    9.86     0.543

Comparison of models to determine best fit

To determine the best model between the three “good fit” models, ARIMA(0,0,1)(1,1,1)[7] on the non-transformed data, ETS(A,N,A) on the Box-Cox transformed data, and ARIMA(0,0,2)(0,1,1)[7] on the Box-Cox transformed data, I split the data into a training set and test set. The training set is comprised of May 2009 through March 2010. Note that this has to be done separately for the Box-Cox transformed data and the non-transformed data.

ATM1_train <- ATM1 %>% filter(DATE < "2010-04-01")
ATM1_box_cox_train <- ATM1_box_cox %>% filter(DATE < "2010-04-01")

ARIMA(0,0,1)(1,1,1)[7] on the non-transformed data:

fit_ARIMA_ATM1_train <- ATM1_train %>% 
  model(ARIMA(Cash ~ pdq(0,0,1) + PDQ(1,1,1)))
report(fit_ARIMA_ATM1_train)

## Series: Cash 
## Model: ARIMA(0,0,1)(1,1,1)[7] 
## 
## Coefficients:
##          ma1    sar1     sma1
##       0.2026  0.2012  -0.7716
## s.e.  0.0578  0.0872   0.0639
## 
## sigma^2 estimated as 595.8:  log likelihood=-1514
## AIC=3035.99   AICc=3036.11   BIC=3051.16

ETS(A,N,A) on the transformed data:

fit_ETS_ATM1_box_cox_train <- ATM1_box_cox_train %>% 
  model(ETS(box_cox ~ error("A")+trend("N")+season("A")))
report(fit_ETS_ATM1_box_cox_train)

## Series: box_cox 
## Model: ETS(A,N,A) 
##   Smoothing parameters:
##     alpha = 0.0001000537 
##     gamma = 0.3433203 
## 
##   Initial states:
##      l[0]      s[0]     s[-1]     s[-2]     s[-3]     s[-4]    s[-5]     s[-6]
##  7.867257 -4.531842 0.4201177 0.8977109 0.6534645 0.9420963 1.005517 0.6129352
## 
##   sigma^2:  1.9461
## 
##      AIC     AICc      BIC 
## 2181.670 2182.349 2219.811

ARIMA(0,0,2)(0,1,1)[7] on the transformed data:

fit_ARIMA_ATM1_box_cox_train <- ATM1_box_cox_train %>% 
  model(ARIMA(box_cox ~ pdq(0,0,2) + PDQ(0,1,1)))
report(fit_ARIMA_ATM1_box_cox_train)

## Series: box_cox 
## Model: ARIMA(0,0,2)(0,1,1)[7] 
## 
## Coefficients:
##          ma1      ma2     sma1
##       0.1139  -0.1197  -0.6512
## s.e.  0.0547   0.0549   0.0473
## 
## sigma^2 estimated as 1.899:  log likelihood=-571.03
## AIC=1150.05   AICc=1150.18   BIC=1165.22

After creating the models, I can create the forecasts for 30 days and then assess the accuracy based on the test data. In this case, since models are being compared among different “starting” datasets (i.e. transformed and non-transformed), the RMSSE should be used as a comparison. The RMSSE takes into account the scale of the data, which makes it a better for comparison. Looking back at the residuals, all models exhibit heteroscedasticity although it is most prominent in the Box-Cox transformed models. However, the ARIMA(0,0,2)(0,1,1)[7] on the transformed data has the lowest RMSSE, so ultimately it is selected as the best-fit model.

bind_rows(
  fit_ARIMA_ATM1_train %>% forecast(h = 30) %>% 
    accuracy(ATM1) %>% select(.model, MASE,RMSSE),
  fit_ETS_ATM1_box_cox_train %>% forecast(h = 30) %>% 
    accuracy(ATM1_box_cox) %>% select(.model, MASE,RMSSE),
  fit_ARIMA_ATM1_box_cox_train %>% forecast(h = 30) %>% 
    accuracy(ATM1_box_cox) %>% select(.model, MASE,RMSSE),
)

## # A tibble: 3 × 3
##   .model                                                        MASE RMSSE
##   <chr>                                                        <dbl> <dbl>
## 1 "ARIMA(Cash ~ pdq(0, 0, 1) + PDQ(1, 1, 1))"                  0.527 0.448
## 2 "ETS(box_cox ~ error(\"A\") + trend(\"N\") + season(\"A\"))" 0.527 0.419
## 3 "ARIMA(box_cox ~ pdq(0, 0, 2) + PDQ(0, 1, 1))"               0.526 0.418

gg_tsresiduals(fit_ARIMA_ATM1)

gg_tsresiduals(fit_ETS_ATM1_box_cox)

gg_tsresiduals(fit_ARIMA_ATM1_box_cox)

Since the transformed data had the best fit model, the data must be transformed back in order to interpret the predictive values of cash withdrawals from ATM1 for the month of May 2010.

fc_ARIMA_ATM1_box_cox

## # A fable: 31 x 4 [1D]
## # Key:     .model [1]
##    .model                                                  DATE      
##    <chr>                                                   <date>    
##  1 ARIMA(box_cox, stepwise = FALSE, approximation = FALSE) 2010-05-01
##  2 ARIMA(box_cox, stepwise = FALSE, approximation = FALSE) 2010-05-02
##  3 ARIMA(box_cox, stepwise = FALSE, approximation = FALSE) 2010-05-03
##  4 ARIMA(box_cox, stepwise = FALSE, approximation = FALSE) 2010-05-04
##  5 ARIMA(box_cox, stepwise = FALSE, approximation = FALSE) 2010-05-05
##  6 ARIMA(box_cox, stepwise = FALSE, approximation = FALSE) 2010-05-06
##  7 ARIMA(box_cox, stepwise = FALSE, approximation = FALSE) 2010-05-07
##  8 ARIMA(box_cox, stepwise = FALSE, approximation = FALSE) 2010-05-08
##  9 ARIMA(box_cox, stepwise = FALSE, approximation = FALSE) 2010-05-09
## 10 ARIMA(box_cox, stepwise = FALSE, approximation = FALSE) 2010-05-10
## # ℹ 21 more rows
## # ℹ 2 more variables: box_cox <dist>, .mean <dbl>

fc_ARIMA_ATM1_box_cox  %>% autoplot(level = FALSE) + autolayer(ATM1_box_cox)

ATM1_forecasted <- fc_ARIMA_ATM1_box_cox %>% rename(box_cox_Cash = .mean) %>%
  mutate(Cash = inv_box_cox(box_cox_Cash, lambda_guerrero)) %>% 
  select(DATE,Cash)
ATM1_forecasted %>% autoplot(color = "blue") + autolayer(ATM1) +
  labs(x = "Date", y = "Cash Withdrawal in Hundreds", 
       title = "Cash Withdrawal Prediction from ATM2 using ARIMA(0,0,2)(0,1,1)")

ATM2 Analysis

With the ATM2 dataset, I can see a more regular seasonal pattern. I can also see again that there are NA values. For this dataset, I believe imputing the NA values using seasonal decomposition makes the most sense. To do this, I use the na_seadec() function in the imputeTS package.

ATM2 <- ATM2 %>% as_tsibble(index = DATE)
ATM2 %>% autoplot(Cash)

any(is.na(ATM2$Cash))

## [1] TRUE

ATM2$Cash <- na_seadec(ATM2$Cash, algorithm = "interpolation")
ATM2 %>% autoplot(Cash)

ETS model on the non-transformed dataset

An ETS(A,N,A) model was selected and used for the forecast. Looking at the residuals, the variance does not appear relatively constant and the ACF shows autocorrelations above the critical value at 4 lags, but the distribution of residuals appears roughly normal. The p-value resulting from the ljung_box test is under 0.05, and therefore the model is not a good fit the time series is not white noise.

fit_ETS_ATM2 <- ATM2 %>% model(ETS(Cash))
report(fit_ETS_ATM2)

## Series: Cash 
## Model: ETS(A,N,A) 
##   Smoothing parameters:
##     alpha = 0.0001000049 
##     gamma = 0.3594252 
## 
##   Initial states:
##      l[0]      s[0]     s[-1]    s[-2]     s[-3]    s[-4]    s[-5]    s[-6]
##  72.71479 -33.56563 -27.28811 12.45705 0.8474403 4.742114 17.29087 25.51626
## 
##   sigma^2:  648.1015
## 
##      AIC     AICc      BIC 
## 4527.377 4527.998 4566.376

fc_ETS_ATM2 <- forecast(fit_ETS_ATM2, h = 31)
fc_ETS_ATM2  %>% autoplot() + autolayer(ATM2)

gg_tsresiduals(fit_ETS_ATM2)

augment(fit_ETS_ATM2) %>% 
  features(.innov, ljung_box, lag = 14)

## # A tibble: 1 × 3
##   .model    lb_stat lb_pvalue
##   <chr>       <dbl>     <dbl>
## 1 ETS(Cash)    37.4  0.000649

ARIMA model on the non-transformed dataset

For ARIMA, I can see the dataset requires one order of seasonal differencing. No non-seasonal differencing is required. The variance of the differences is not very constant and the PACF shows spikes throughout the plot. The data is shown to be stationary after differencing, even without transformation, through the KPSS unit test.

ATM2 %>% features(Cash, unitroot_nsdiffs)

## # A tibble: 1 × 1
##   nsdiffs
##     <int>
## 1       1

ATM2_nsdiff <- ATM2 %>% mutate(Cash_nsdiff = difference(Cash, 7))
ATM2_nsdiff %>% features(Cash_nsdiff, unitroot_ndiffs)

## # A tibble: 1 × 1
##   ndiffs
##    <int>
## 1      0

ATM2_nsdiff %>% gg_tsdisplay(Cash_nsdiff,plot_type = "partial")

ATM2_nsdiff %>% features(Cash_nsdiff, unitroot_kpss)

## # A tibble: 1 × 2
##   kpss_stat kpss_pvalue
##       <dbl>       <dbl>
## 1    0.0167         0.1

An ARIMA(2,0,2)(0,1,1)[7] model is fit, and this model is used for forecasting. After forecasting, I can see that the variance of the residuals is not very constant, and the autocorrelations in the ACF plot has two spikes past the critical values. This is past the 5% threshold of tolerance for autocorrelations outside of the critical value bounds, although the distribution of the residuals appear approximately normal. Performing the ljung-box test, I can see that the p-value is above 0.05, and this model is a better fit than the ETS(A,N,A) model.

fit_ARIMA_ATM2 <- ATM2 %>% 
  model(ARIMA(Cash, stepwise = FALSE, approximation = FALSE))
report(fit_ARIMA_ATM2)

## Series: Cash 
## Model: ARIMA(2,0,2)(0,1,1)[7] 
## 
## Coefficients:
##           ar1      ar2     ma1     ma2     sma1
##       -0.4320  -0.9130  0.4773  0.8048  -0.7547
## s.e.   0.0553   0.0407  0.0861  0.0556   0.0381
## 
## sigma^2 estimated as 602.5:  log likelihood=-1653.67
## AIC=3319.33   AICc=3319.57   BIC=3342.61

fc_ARIMA_ATM2 <- forecast(fit_ARIMA_ATM2, h = 31)
fc_ARIMA_ATM2  %>% autoplot() + autolayer(ATM2)

gg_tsresiduals(fit_ARIMA_ATM2)

augment(fit_ARIMA_ATM2) %>% 
  features(.innov, ljung_box, lag = 14, dof = 5)

## # A tibble: 1 × 3
##   .model                                               lb_stat lb_pvalue
##   <chr>                                                  <dbl>     <dbl>
## 1 ARIMA(Cash, stepwise = FALSE, approximation = FALSE)    10.2     0.332

Box-Cox transformation

Again, I also explored a Box-Cox transformation on the data before modeling to see if the prerequisites for stationary data could be better met.

lambda_guerrero <- ATM2 %>% 
  features(Cash, features = guerrero)
lambda_guerrero <- lambda_guerrero$lambda_guerrero
ATM2_box_cox <- ATM2 %>% 
  mutate(box_cox = box_cox(Cash,lambda_guerrero)) %>% 
  select(box_cox)

ETS model on the transformed dataset

I attempted an exponential smoothing model again first, where an ETS(A,N,A) model was selected again. Forecasting with the Box-Cox transformed data, I can see in the autocorrelations still show spikes beyond the bounds of the critical values. Additionally, performing the ljung-box test still returns a p-value less than 0.05, indicating this time series is not white noise.

fit_ETS_ATM2_box_cox <- ATM2_box_cox %>% model(ETS(box_cox))
report(fit_ETS_ATM2_box_cox)

## Series: box_cox 
## Model: ETS(A,N,A) 
##   Smoothing parameters:
##     alpha = 0.0001000387 
##     gamma = 0.3837262 
## 
##   Initial states:
##      l[0]      s[0]     s[-1]   s[-2]    s[-3]    s[-4]    s[-5]   s[-6]
##  26.45702 -18.63684 -11.61431 10.6539 2.227776 3.924051 6.477696 6.96773
## 
##   sigma^2:  72.4456
## 
##      AIC     AICc      BIC 
## 3727.585 3728.206 3766.584

fc_ETS_ATM2_box_cox <- forecast(fit_ETS_ATM2_box_cox, h = 31)
fc_ETS_ATM2_box_cox  %>% autoplot() + autolayer(ATM2_box_cox)

gg_tsresiduals(fit_ETS_ATM2_box_cox)

augment(fit_ETS_ATM2_box_cox) %>% 
  features(.innov, ljung_box, lag = 14)

## # A tibble: 1 × 3
##   .model       lb_stat lb_pvalue
##   <chr>          <dbl>     <dbl>
## 1 ETS(box_cox)    33.9   0.00214

ARIMA model on the transformed dataset

Looking at the ACF and PACF plots, I can see the Box-Cox transformed dataset produces plots that are nearly identical in shape to the non-transformed dataset. After differencing, the KPSS unit test shows that this data is stationary and an ARIMA model can be fit.

ATM2_box_cox %>% features(box_cox, unitroot_nsdiffs)

## # A tibble: 1 × 1
##   nsdiffs
##     <int>
## 1       1

ATM2_box_cox_nsdiff <- ATM2_box_cox %>% 
  mutate(box_cox_Cash_nsdiff = difference(box_cox, 7))
ATM2_box_cox_nsdiff %>% 
  features(box_cox_Cash_nsdiff, unitroot_ndiffs)

## # A tibble: 1 × 1
##   ndiffs
##    <int>
## 1      0

ATM2_box_cox_nsdiff %>% 
  gg_tsdisplay(box_cox_Cash_nsdiff,plot_type = "partial")

ATM2_box_cox_nsdiff %>% 
  features(box_cox_Cash_nsdiff, unitroot_kpss)

## # A tibble: 1 × 2
##   kpss_stat kpss_pvalue
##       <dbl>       <dbl>
## 1    0.0162         0.1

An ARIMA(2,0,2)(0,1,1)[7] model was selected for the Box-Cox transformed data, which is unsurprising given the similar shape of the data. The p-value from the ljung-box test is again over 0.05, so this model can be considered a potential model for selection and evaluation.

fit_ARIMA_ATM2_box_cox <- ATM2_box_cox %>% 
  model(ARIMA(box_cox, stepwise = FALSE, approximation = FALSE))
report(fit_ARIMA_ATM2_box_cox)

## Series: box_cox 
## Model: ARIMA(2,0,2)(0,1,1)[7] 
## 
## Coefficients:
##           ar1      ar2     ma1     ma2     sma1
##       -0.4256  -0.9051  0.4719  0.7967  -0.7271
## s.e.   0.0588   0.0444  0.0888  0.0590   0.0406
## 
## sigma^2 estimated as 68.16:  log likelihood=-1263.33
## AIC=2538.66   AICc=2538.9   BIC=2561.94

fc_ARIMA_ATM2_box_cox <- forecast(fit_ARIMA_ATM2_box_cox, h = 31)
fc_ARIMA_ATM2_box_cox  %>% autoplot() + autolayer(ATM2_box_cox)

gg_tsresiduals(fit_ARIMA_ATM2_box_cox)

augment(fit_ARIMA_ATM2_box_cox) %>% 
  features(.innov, ljung_box, lag = 14, dof = 5)

## # A tibble: 1 × 3
##   .model                                                  lb_stat lb_pvalue
##   <chr>                                                     <dbl>     <dbl>
## 1 ARIMA(box_cox, stepwise = FALSE, approximation = FALSE)    10.7     0.300

Comparison of models to determine best fit

The two models in this assessment are the ARIMA(2,0,2)(0,1,1)[7] on the non-transformed data and ARIMA(2,0,2)(0,1,1)[7] on the transformed data.

ATM2_train <- ATM2 %>% filter(DATE < "2010-04-01")
ATM2_box_cox_train <- ATM2_box_cox %>% filter(DATE < "2010-04-01")

ARIMA(2,0,2)(0,1,1)[7] on the non-transformed data:

fit_ARIMA_ATM2_train <- ATM2_train %>% 
  model(ARIMA(Cash ~ pdq(2,0,2) + PDQ(0,1,1)))
report(fit_ARIMA_ATM2_train)

## Series: Cash 
## Model: ARIMA(2,0,2)(0,1,1)[7] 
## 
## Coefficients:
##           ar1      ar2     ma1     ma2    sma1
##       -0.4322  -0.9259  0.4799  0.8085  -0.779
## s.e.   0.0484   0.0399  0.0805  0.0537   0.042
## 
## sigma^2 estimated as 626.2:  log likelihood=-1521.62
## AIC=3055.25   AICc=3055.51   BIC=3078.01

ARIMA(2,0,2)(0,1,1)[7] on the transformed data:

fit_ARIMA_ATM2_box_cox_train <- ATM2_box_cox_train %>% 
  model(ARIMA(box_cox ~ pdq(2,0,2) + PDQ(0,1,1)))
report(fit_ARIMA_ATM2_box_cox_train)

## Series: box_cox 
## Model: ARIMA(2,0,2)(0,1,1)[7] 
## 
## Coefficients:
##           ar1      ar2     ma1     ma2     sma1
##       -0.4251  -0.9187  0.4732  0.8028  -0.7474
## s.e.   0.0511   0.0441  0.0819  0.0578   0.0460
## 
## sigma^2 estimated as 71.19:  log likelihood=-1164.73
## AIC=2341.45   AICc=2341.71   BIC=2364.21

From the comparison below, we can see that the model on the Box-Cox transformed data had performed better due to its lower MASE, although RMSSE is the same. When revisiting the residuals plot, we can see similar behavior in both where there are two spikes in the autocorrelations, roughly normal distribution of residuals, and heteroscedasticity when looking at the variance of residuals.

bind_rows(
  fit_ARIMA_ATM2_train %>% forecast(h = 30) %>% 
    accuracy(ATM2) %>% select(.model, MASE,RMSSE),
  fit_ARIMA_ATM2_box_cox_train %>% forecast(h = 30) %>% 
    accuracy(ATM2_box_cox) %>% select(.model, MASE,RMSSE)
)

## # A tibble: 2 × 3
##   .model                                        MASE RMSSE
##   <chr>                                        <dbl> <dbl>
## 1 ARIMA(Cash ~ pdq(2, 0, 2) + PDQ(0, 1, 1))    0.767 0.681
## 2 ARIMA(box_cox ~ pdq(2, 0, 2) + PDQ(0, 1, 1)) 0.733 0.681

gg_tsresiduals(fit_ARIMA_ATM2)

gg_tsresiduals(fit_ARIMA_ATM2_box_cox)

Since the transformed data had the best fit model, the data must be transformed back in order to interpret the predictive values of cash withdrawals from ATM2 for the month of May 2010.

fc_ARIMA_ATM2_box_cox

## # A fable: 31 x 4 [1D]
## # Key:     .model [1]
##    .model                                                  DATE      
##    <chr>                                                   <date>    
##  1 ARIMA(box_cox, stepwise = FALSE, approximation = FALSE) 2010-05-01
##  2 ARIMA(box_cox, stepwise = FALSE, approximation = FALSE) 2010-05-02
##  3 ARIMA(box_cox, stepwise = FALSE, approximation = FALSE) 2010-05-03
##  4 ARIMA(box_cox, stepwise = FALSE, approximation = FALSE) 2010-05-04
##  5 ARIMA(box_cox, stepwise = FALSE, approximation = FALSE) 2010-05-05
##  6 ARIMA(box_cox, stepwise = FALSE, approximation = FALSE) 2010-05-06
##  7 ARIMA(box_cox, stepwise = FALSE, approximation = FALSE) 2010-05-07
##  8 ARIMA(box_cox, stepwise = FALSE, approximation = FALSE) 2010-05-08
##  9 ARIMA(box_cox, stepwise = FALSE, approximation = FALSE) 2010-05-09
## 10 ARIMA(box_cox, stepwise = FALSE, approximation = FALSE) 2010-05-10
## # ℹ 21 more rows
## # ℹ 2 more variables: box_cox <dist>, .mean <dbl>

fc_ARIMA_ATM2_box_cox  %>% autoplot(level = FALSE) + 
  autolayer(ATM2_box_cox)

ATM2_forecasted <- fc_ARIMA_ATM2_box_cox %>% 
  rename(box_cox_Cash = .mean) %>%
  mutate(Cash = inv_box_cox(box_cox_Cash, lambda_guerrero)) %>% 
  select(DATE,Cash)
ATM2_forecasted %>% autoplot(color = "blue") + 
  autolayer(ATM2) +
  labs(x = "Date", y = "Cash Withdrawal in Hundreds", 
       title = "Cash Withdrawal Prediction from ATM2 using ARIMA(2,0,2)(0,1,1)")

ATM3 Analysis

With the ATM3 dataset, I can see that cash withdrawal is zero up until the last few days of April 2010. Predictions given this limited dataset would not be meaningful, as the sample size is too small. If a prediction had to be made, I would suggest using a simple exponential smoothing forecast, as shown below. It is possible that ATM3 is a new ATM, or data collection has somehow failed up until April 2010. Given the data shown, my overall recommendation would be to disregard the ATM3 dataset unless more data can be acquired. Almost nothing meaningful, besides general scale, can be gleaned from a few datapoints.

ATM3 <- ATM3 %>% as_tsibble(index = DATE)
ATM3 %>% autoplot(Cash)

ATM3 %>% filter(Cash != 0)

## # A tsibble: 3 x 3 [1D]
##   DATE       ATM    Cash
##   <date>     <chr> <dbl>
## 1 2010-04-28 ATM3     96
## 2 2010-04-29 ATM3     82
## 3 2010-04-30 ATM3     85

fit_SES_ATM3 <- ATM3 %>% model(ETS(Cash))
report(fit_SES_ATM3)

## Series: Cash 
## Model: ETS(A,N,N) 
##   Smoothing parameters:
##     alpha = 0.8585628 
## 
##   Initial states:
##        l[0]
##  -0.1269535
## 
##   sigma^2:  25.4128
## 
##      AIC     AICc      BIC 
## 3338.324 3338.391 3350.024

fc_SES_ATM3 <- forecast(fit_SES_ATM3, h = 31)
fc_SES_ATM3 %>% autoplot() + autolayer(ATM3)

ATM3_forecasted <- fc_SES_ATM3 %>% 
  rename(CashDistribution = Cash, Cash = .mean) %>% 
  select(DATE,Cash)

ATM4 Analysis

With the ATM4 dataset, I can see a strong outlier in the data but no NA values. I can see this outlier occurs on 2010-02-09, and has a value of 10920. Typically, ATMs will have the capacity for that type of volume of withdrawal in a day, as 10920 $100 withdrawals amounts to $1,092,000. This is reason to believe that the outlier is an erroneous value. The ATM4 data does not appear to be seasonal, so it would not make sense to impute based on seasonal decomposition. Instead, I used the mean of the data to impute this value given that the dataset appears homoscedastic and the mean appears stable/independent of time when excluding the outlier.

ATM4 <- ATM4 %>% as_tsibble(index = DATE)
ATM4 %>% autoplot(Cash)

any(is.na(ATM4$Cash))

## [1] FALSE

ATM4 %>% filter(Cash == max(Cash))

## # A tsibble: 1 x 3 [1D]
##   DATE       ATM     Cash
##   <date>     <chr>  <dbl>
## 1 2010-02-09 ATM4  10920.

ATM4_mean_wo_outlier <- ATM4 %>% filter(Cash != max(Cash)) 
ATM4_mean_wo_outlier <- mean(ATM4_mean_wo_outlier$Cash)
ATM4$Cash[285] <- ATM4_mean_wo_outlier
ATM4 %>% autoplot(Cash)

ETS model on the non-transformed dataset

An ETS(A,N,A) model was selected and used for the forecast. Looking at the residuals, the variance looks relatively constant and the ACF shows autocorrelations all within the bounds of the critical values. However, the distribution does appear slightly skewed. The p-value resulting from the ljung_box test is greater than 0.05, and therefore the model can be considered a good fit.

fit_ETS_ATM4 <- ATM4 %>% model(ETS(Cash))
report(fit_ETS_ATM4)

## Series: Cash 
## Model: ETS(A,N,A) 
##   Smoothing parameters:
##     alpha = 0.0001072472 
##     gamma = 0.0001000058 
## 
##   Initial states:
##      l[0]      s[0]     s[-1]    s[-2]    s[-3]    s[-4]   s[-5]   s[-6]
##  452.3827 -277.6182 -30.42428 17.26821 44.47028 89.31875 58.8934 98.0918
## 
##   sigma^2:  110735
## 
##      AIC     AICc      BIC 
## 6403.786 6404.408 6442.785

fc_ETS_ATM4 <- forecast(fit_ETS_ATM4, h = 31)
fc_ETS_ATM4  %>% autoplot() + autolayer(ATM4)

gg_tsresiduals(fit_ETS_ATM4)

augment(fit_ETS_ATM4) %>% features(.innov, ljung_box, lag = 14)

## # A tibble: 1 × 3
##   .model    lb_stat lb_pvalue
##   <chr>       <dbl>     <dbl>
## 1 ETS(Cash)    18.1     0.203

ARIMA model on the non-transformed dataset

For ARIMA, I can see the dataset does not require any differencing as it is already stationary. The variance appears constant, and the ACF is not high and strictly positive; however, the PACF shows two multiple spikes throughout the plot. The data is confirmed to be stationary through the KPSS test.

ATM4 %>% features(Cash, unitroot_nsdiffs)

## # A tibble: 1 × 1
##   nsdiffs
##     <int>
## 1       0

ATM4 %>% features(Cash, unitroot_ndiffs)

## # A tibble: 1 × 1
##   ndiffs
##    <int>
## 1      0

ATM4 %>% gg_tsdisplay(Cash,plot_type = "partial")

ATM4 %>% features(Cash, unitroot_kpss)

## # A tibble: 1 × 2
##   kpss_stat kpss_pvalue
##       <dbl>       <dbl>
## 1     0.117         0.1

An ARIMA(1,0,0)(2,0,0)[7] w/ mean model is fit to the data, and this model is used for forecasting. After forecasting, I can see that the variance of the residuals is relatively constant, and the autocorrelations in the ACF plot has no spikes past the critical values. The distribution of the residuals appear potentially right skewed. Performing the ljung-box test, I can see that the p-value is slightly above 0.05, and this model can therefore be considered for further evaluation.

fit_ARIMA_ATM4 <- ATM4 %>% 
  model(ARIMA(Cash, stepwise = FALSE, approximation = FALSE))
report(fit_ARIMA_ATM4)

## Series: Cash 
## Model: ARIMA(1,0,0)(2,0,0)[7] w/ mean 
## 
## Coefficients:
##          ar1    sar1    sar2  constant
##       0.0812  0.1486  0.1128  301.3721
## s.e.  0.0523  0.0521  0.0526   17.7422
## 
## sigma^2 estimated as 118443:  log likelihood=-2648.09
## AIC=5306.19   AICc=5306.35   BIC=5325.68

fc_ARIMA_ATM4 <- forecast(fit_ARIMA_ATM4, h = 31)
fc_ARIMA_ATM4  %>% autoplot() + autolayer(ATM4)

gg_tsresiduals(fit_ARIMA_ATM4)

augment(fit_ARIMA_ATM4) %>% 
  features(.innov, ljung_box, lag = 10, dof = 3)

## # A tibble: 1 × 3
##   .model                                               lb_stat lb_pvalue
##   <chr>                                                  <dbl>     <dbl>
## 1 ARIMA(Cash, stepwise = FALSE, approximation = FALSE)    13.4    0.0622

Box-Cox transformation

Again, I also explored a Box-Cox transformation on the data before modeling to see if the prerequisites for stationary data could be better met.

lambda_guerrero <- ATM4 %>% 
  features(Cash, features = guerrero)
lambda_guerrero <- lambda_guerrero$lambda_guerrero
ATM4_box_cox <- ATM4 %>% 
  mutate(box_cox = box_cox(Cash,lambda_guerrero)) %>% 
  select(box_cox)

ETS model on the transformed dataset

I attempted an exponential smoothing model again first, where an ETS(M,N,A) model was selected. Forecasting with the Box-Cox transformed data, I can see the autocorrelations do not exceed the bounds of the critical values. The residuals appear fairly constant besides one irregular spike where the variance is higher. Additionally, performing the ljung-box test again returns a p-value greater than 0.05.

fit_ETS_ATM4_box_cox <- ATM4_box_cox %>% model(ETS(box_cox))
report(fit_ETS_ATM4_box_cox)

## Series: box_cox 
## Model: ETS(M,N,A) 
##   Smoothing parameters:
##     alpha = 0.01547922 
##     gamma = 0.1782924 
## 
##   Initial states:
##      l[0]      s[0]     s[-1]       s[-2]    s[-3]    s[-4]    s[-5]    s[-6]
##  27.73898 -19.73406 -4.699187 -0.02708285 2.157202 6.623173 8.928071 6.751885
## 
##   sigma^2:  0.223
## 
##      AIC     AICc      BIC 
## 4036.699 4037.320 4075.698

fc_ETS_ATM4_box_cox <- forecast(fit_ETS_ATM4_box_cox, h = 31)
fc_ETS_ATM4_box_cox  %>% autoplot() + autolayer(ATM4_box_cox)

gg_tsresiduals(fit_ETS_ATM4_box_cox)

augment(fit_ETS_ATM4_box_cox) %>% 
  features(.innov, ljung_box, lag = 14)

## # A tibble: 1 × 3
##   .model       lb_stat lb_pvalue
##   <chr>          <dbl>     <dbl>
## 1 ETS(box_cox)    14.9     0.385

ARIMA model on the transformed dataset

Looking at the ACF and PACF plots, I can see the Box-Cox transformed dataset has more constant variance after the transformation. However, a greater number of lags exhibit spikes past the critical values in the PACF. The data is again shown to be stationary through the KPSS test.

ATM4_box_cox %>% features(box_cox, unitroot_nsdiffs)

## # A tibble: 1 × 1
##   nsdiffs
##     <int>
## 1       0

ATM4_box_cox %>% features(box_cox, unitroot_ndiffs)

## # A tibble: 1 × 1
##   ndiffs
##    <int>
## 1      0

ATM4_box_cox %>% gg_tsdisplay(box_cox,plot_type = "partial")

ATM4_box_cox %>% features(box_cox, unitroot_kpss)

## # A tibble: 1 × 2
##   kpss_stat kpss_pvalue
##       <dbl>       <dbl>
## 1    0.0771         0.1

An ARIMA(1,0,0)(2,0,0)[7] w/ mean model was selected for the Box-Cox transformed data The variance of residuals is relatively constant, and the ACF shows one spike beyond the critical values. The distribution of residuals appears roughly normal. The p-value from the ljung-box test is again over 0.05, so this model can be considered a potential model for selection and evaluation.

fit_ARIMA_ATM4_box_cox <- ATM4_box_cox %>% 
  model(ARIMA(box_cox, stepwise = FALSE, approximation = FALSE))
report(fit_ARIMA_ATM4_box_cox)

## Series: box_cox 
## Model: ARIMA(1,0,0)(2,0,0)[7] w/ mean 
## 
## Coefficients:
##          ar1    sar1    sar2  constant
##       0.0766  0.2092  0.1990   15.9056
## s.e.  0.0526  0.0516  0.0525    0.6921
## 
## sigma^2 estimated as 183.7:  log likelihood=-1467.91
## AIC=2945.82   AICc=2945.98   BIC=2965.32

fc_ARIMA_ATM4_box_cox <- forecast(fit_ARIMA_ATM4_box_cox, h = 31)
fc_ARIMA_ATM4_box_cox  %>% autoplot() + autolayer(ATM4_box_cox)

gg_tsresiduals(fit_ARIMA_ATM4_box_cox)

augment(fit_ARIMA_ATM4_box_cox) %>% 
  features(.innov, ljung_box, lag = 14, dof = 3)

## # A tibble: 1 × 3
##   .model                                                  lb_stat lb_pvalue
##   <chr>                                                     <dbl>     <dbl>
## 1 ARIMA(box_cox, stepwise = FALSE, approximation = FALSE)    15.8     0.150

Comparison of models to determine best fit

All four models assessed, ETS(A,N,A) and ARIMA(1,0,0)(2,0,0)[7] w/ mean for the non-transformed data, and ETS(M,N,A) and ARIMA(1,0,0)(2,0,0)[7] w/ mean for the transformed data, will be considered for evaluation.

ATM4_train <- ATM4 %>% filter(DATE < "2010-04-01")
ATM4_box_cox_train <- ATM4_box_cox %>% filter(DATE < "2010-04-01")

ETS(A,N,A) on the non-transformed data:

fit_ETS_ATM4_train <- ATM4_train %>% 
  model(ETS(Cash~ error("A")+trend("N")+season("A")))
report(fit_ETS_ATM4_train)

## Series: Cash 
## Model: ETS(A,N,A) 
##   Smoothing parameters:
##     alpha = 0.0001000658 
##     gamma = 0.0001188245 
## 
##   Initial states:
##      l[0]      s[0]     s[-1]    s[-2]   s[-3]    s[-4]    s[-5]    s[-6]
##  459.5541 -301.4226 -41.96698 39.36969 33.5328 90.72653 54.21942 125.5412
## 
##   sigma^2:  112403.6
## 
##      AIC     AICc      BIC 
## 5854.611 5855.290 5892.752

ARIMA(1,0,0)(2,0,0)[7] w/ mean on the non-transformed data:

fit_ARIMA_ATM4_train <- ATM4_train %>% 
  model(ARIMA(Cash ~ pdq(1,0,0) + PDQ(2,0,0)))
report(fit_ARIMA_ATM4_train)

## Series: Cash 
## Model: ARIMA(1,0,0)(2,0,0)[7] w/ mean 
## 
## Coefficients:
##          ar1    sar1    sar2  constant
##       0.0817  0.1461  0.0977  312.5267
## s.e.  0.0546  0.0545  0.0550   18.8407
## 
## sigma^2 estimated as 122684:  log likelihood=-2436.15
## AIC=4882.31   AICc=4882.49   BIC=4901.38

ETS(M,N,A) on the transformed data:

fit_ETS_ATM4_box_cox_train <- ATM4_box_cox_train %>% 
  model(ETS(box_cox~ error("M")+trend("N")+season("A")))
report(fit_ETS_ATM4_box_cox_train)

## Series: box_cox 
## Model: ETS(M,N,A) 
##   Smoothing parameters:
##     alpha = 0.01370986 
##     gamma = 0.161741 
## 
##   Initial states:
##      l[0]      s[0]     s[-1]    s[-2]    s[-3]    s[-4]    s[-5]    s[-6]
##  25.80506 -17.85373 -4.477105 1.189807 5.956007 5.026289 4.475229 5.683503
## 
##   sigma^2:  0.2302
## 
##      AIC     AICc      BIC 
## 3683.053 3683.732 3721.194

ARIMA(1,0,0)(2,0,0)[7] w/ mean on the transformed data:

fit_ARIMA_ATM4_box_cox_train <- ATM4_box_cox_train %>% 
  model(ARIMA(box_cox ~ pdq(1,0,0) + PDQ(2,0,0)))
report(fit_ARIMA_ATM4_box_cox_train)

## Series: box_cox 
## Model: ARIMA(1,0,0)(2,0,0)[7] w/ mean 
## 
## Coefficients:
##          ar1    sar1    sar2  constant
##       0.0791  0.2020  0.1813   16.6784
## s.e.  0.0548  0.0538  0.0548    0.7302
## 
## sigma^2 estimated as 187.8:  log likelihood=-1350.68
## AIC=2711.36   AICc=2711.54   BIC=2730.43

From the comparison below, we can see that the ARIMA(1,0,0)(2,0,0) model on non-transformed data had a lower RMSSE. Looking back at the residuals plots for the two lowest RMSSE models (ARIMA(1,0,0)(2,0,0) model on non-transformed data and ETS(M,N,A) on the transformed data), the level of variance appears visually similar. Neither have spikes past the critical values in the ACF plots, and the distribution of residuals for the ETS(M,N,A) model appears slightly more normal. However, since the RMSSE is lower by more than a couple percent, the ARIMA(1,0,0)(2,0,0) model on non-transformed data was selected.

bind_rows(
  fit_ETS_ATM4_train %>% forecast(h = 30) %>% 
    accuracy(ATM4) %>% select(.model, MASE,RMSSE),
  fit_ARIMA_ATM4_train %>% forecast(h = 30) %>% 
    accuracy(ATM4) %>% select(.model, MASE,RMSSE),
  fit_ETS_ATM4_box_cox_train %>% forecast(h = 30) %>% 
    accuracy(ATM4_box_cox) %>% select(.model, MASE,RMSSE),
  fit_ARIMA_ATM4_box_cox_train %>% forecast(h = 30) %>% 
    accuracy(ATM4_box_cox) %>% select(.model, MASE,RMSSE)
)

## # A tibble: 4 × 3
##   .model                                                        MASE RMSSE
##   <chr>                                                        <dbl> <dbl>
## 1 "ETS(Cash ~ error(\"A\") + trend(\"N\") + season(\"A\"))"    0.742 0.672
## 2 "ARIMA(Cash ~ pdq(1, 0, 0) + PDQ(2, 0, 0))"                  0.664 0.592
## 3 "ETS(box_cox ~ error(\"M\") + trend(\"N\") + season(\"A\"))" 0.611 0.613
## 4 "ARIMA(box_cox ~ pdq(1, 0, 0) + PDQ(2, 0, 0))"               0.707 0.692

gg_tsresiduals(fit_ARIMA_ATM4)

gg_tsresiduals(fit_ETS_ATM4_box_cox)

Since the non-transformed data had the best fit model, the data forecasted as-is can be used to predict values of cash withdrawals from ATM4 for the month of May 2010.

ATM4_forecasted <- fc_ARIMA_ATM4 %>% 
  rename(CashDistribution = Cash, Cash = .mean) %>% 
  select(DATE,Cash)

ATM4_forecasted %>% autoplot(color = "blue") + autolayer(ATM4) +
  labs(x = "Date", y = "Cash Withdrawal in Hundreds", 
       title = "Cash Withdrawal Prediction from ATM4 using ARIMA(1,0,0)(2,0,0)[7] w/ mean)")

Final remarks

The final model selection for the ATM1, ATM2, and ATM4 forecasts, as well as their corresponding MASE and RMSSE resulting from the training and test set exercise, are listed below. As previously discussed, ATM3 only has three datapoints and a forecast based on this is not recommended. However, a SES forecast was performed, with the values provided in the attached Excel file along with all other forecasts.

bind_rows(
  fit_ARIMA_ATM1_box_cox_train %>% forecast(h = 30) %>% 
    accuracy(ATM1_box_cox) %>% select(.model, MASE,RMSSE),
  fit_ARIMA_ATM2_box_cox_train %>% forecast(h = 30) %>% 
    accuracy(ATM2_box_cox) %>% select(.model, MASE,RMSSE),
  fit_ARIMA_ATM4_train %>% forecast(h = 30) %>% 
    accuracy(ATM4) %>% select(.model, MASE,RMSSE)
)

## # A tibble: 3 × 3
##   .model                                        MASE RMSSE
##   <chr>                                        <dbl> <dbl>
## 1 ARIMA(box_cox ~ pdq(0, 0, 2) + PDQ(0, 1, 1)) 0.526 0.418
## 2 ARIMA(box_cox ~ pdq(2, 0, 2) + PDQ(0, 1, 1)) 0.733 0.681
## 3 ARIMA(Cash ~ pdq(1, 0, 0) + PDQ(2, 0, 0))    0.664 0.592

report(fit_ARIMA_ATM1_box_cox)

## Series: box_cox 
## Model: ARIMA(0,0,2)(0,1,1)[7] 
## 
## Coefficients:
##          ma1      ma2     sma1
##       0.1126  -0.1094  -0.6418
## s.e.  0.0524   0.0520   0.0432
## 
## sigma^2 estimated as 1.764:  log likelihood=-609.99
## AIC=1227.98   AICc=1228.09   BIC=1243.5

report(fit_ARIMA_ATM2_box_cox)

## Series: box_cox 
## Model: ARIMA(2,0,2)(0,1,1)[7] 
## 
## Coefficients:
##           ar1      ar2     ma1     ma2     sma1
##       -0.4256  -0.9051  0.4719  0.7967  -0.7271
## s.e.   0.0588   0.0444  0.0888  0.0590   0.0406
## 
## sigma^2 estimated as 68.16:  log likelihood=-1263.33
## AIC=2538.66   AICc=2538.9   BIC=2561.94

report(fit_SES_ATM3)

## Series: Cash 
## Model: ETS(A,N,N) 
##   Smoothing parameters:
##     alpha = 0.8585628 
## 
##   Initial states:
##        l[0]
##  -0.1269535
## 
##   sigma^2:  25.4128
## 
##      AIC     AICc      BIC 
## 3338.324 3338.391 3350.024

report(fit_ARIMA_ATM4)

## Series: Cash 
## Model: ARIMA(1,0,0)(2,0,0)[7] w/ mean 
## 
## Coefficients:
##          ar1    sar1    sar2  constant
##       0.0812  0.1486  0.1128  301.3721
## s.e.  0.0523  0.0521  0.0526   17.7422
## 
## sigma^2 estimated as 118443:  log likelihood=-2648.09
## AIC=5306.19   AICc=5306.35   BIC=5325.68

Data_624_Koon_Project_1_excel <- createWorkbook()
addWorksheet(Data_624_Koon_Project_1_excel, "ATM1")
writeData(Data_624_Koon_Project_1_excel, "ATM1", ATM1_forecasted)
addWorksheet(Data_624_Koon_Project_1_excel, "ATM2")
writeData(Data_624_Koon_Project_1_excel, "ATM2", ATM2_forecasted)
addWorksheet(Data_624_Koon_Project_1_excel, "ATM3")
writeData(Data_624_Koon_Project_1_excel, "ATM3", ATM3_forecasted)
addWorksheet(Data_624_Koon_Project_1_excel, "ATM4")
writeData(Data_624_Koon_Project_1_excel, "ATM4", ATM4_forecasted)

Part B – Forecasting Power, ResidentialCustomerForecastLoad-624.xlsx

Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.

Read in the Data

After reading in the data, I can see the YearMonth has imported as a character I converted this date back to YearMonth format in R.

ResCust <- read_excel("C:\\Users\\Kim\\Documents\\Data624\\ResidentialCustomerForecastLoad-624.xlsx") # read in the file
ResCust <- ResCust %>% 
  rename(YearMonth= `YYYY-MMM`) %>% 
  mutate(YearMonth = yearmonth(YearMonth)) %>%
  as_tsibble(index = YearMonth) %>%
  select(KWH)

View the data

Taking a look at the data, I can see there are missing values and one potential outlier. When I use gg_season(), I can see that the outlier does not follow the seasonal monthly trend in the data. Given that the data is shown to be seasonal based on a 12 month period, I believe seasonal decomposition would be an appropriate imputation method for both the NA value and outlier. Find_frequency is set to true to estimate the frequency of the time-series automatically.

ResCust %>% autoplot()

ResCust %>% filter(KWH == min(KWH, na.rm = TRUE))

## # A tsibble: 1 x 2 [1M]
##      KWH YearMonth
##    <dbl>     <mth>
## 1 770523  2010 Jul

ResCust %>% gg_season()

ResCust %>% filter(KWH == min(KWH, na.rm = TRUE))

## # A tsibble: 1 x 2 [1M]
##      KWH YearMonth
##    <dbl>     <mth>
## 1 770523  2010 Jul

ResCust$KWH[151] <- NA
ResCust$KWH <- na_seadec(ResCust$KWH, algorithm = "interpolation", find_frequency = TRUE)

ResCust %>% autoplot()

ResCust %>% gg_season()

any(is.na(ResCust$KWH))

## [1] FALSE

ETS model on the non-transformed dataset

An ETS(M,N,M) model was fitted to the data. When looking at the residuals, the variance appears homoscedastic, though there are two spikes above critical values shown in the ACF. The distribution of residuals appears normal. When performing the ljung_box test, the p-value is shown to be above 0.05 and this model can be considered as a potential fit for further analysis.

fit_ETS_ResCust <- ResCust %>% model(ETS(KWH))
report(fit_ETS_ResCust)

## Series: KWH 
## Model: ETS(M,N,M) 
##   Smoothing parameters:
##     alpha = 0.1284459 
##     gamma = 0.2213289 
## 
##   Initial states:
##     l[0]      s[0]     s[-1]     s[-2]    s[-3]    s[-4]    s[-5]    s[-6]
##  6189952 0.8933526 0.7641105 0.9346016 1.252045 1.287414 1.252513 1.006389
##      s[-7]     s[-8]     s[-9]    s[-10]   s[-11]
##  0.7643922 0.8073412 0.8729015 0.9913738 1.173565
## 
##   sigma^2:  0.0093
## 
##      AIC     AICc      BIC 
## 6142.152 6144.879 6191.014

fc_ETS_ResCust <- forecast(fit_ETS_ResCust, h = 1)
fc_ETS_ResCust  %>% autoplot() + autolayer(ResCust)

gg_tsresiduals(fit_ETS_ResCust)

augment(fit_ETS_ResCust) %>% 
  features(.innov, ljung_box, lag = 24)

## # A tibble: 1 × 3
##   .model   lb_stat lb_pvalue
##   <chr>      <dbl>     <dbl>
## 1 ETS(KWH)    32.8     0.109

ARIMA model on the non-transformed dataset

For ARIMA, I can see the dataset requires one order of seasonal differencing. No non-seasonal differencing is required. The variance of the differences is relatively constant and the PACF show show a few spikes throughout the plot. Looking at the lags at multiples of seasonality, even the 24th lag is above the critical value. However, the KPSS test shows the data is stationary after the seasonal differencing.

ResCust %>% features(KWH, unitroot_nsdiffs)

## # A tibble: 1 × 1
##   nsdiffs
##     <int>
## 1       1

ResCust_nsdiff <- ResCust %>% 
  mutate(KWH_nsdiff = difference(KWH, 12))
ResCust_nsdiff %>% 
  features(KWH_nsdiff, unitroot_ndiffs)

## # A tibble: 1 × 1
##   ndiffs
##    <int>
## 1      0

ResCust_nsdiff %>% 
  gg_tsdisplay(KWH_nsdiff,plot_type = "partial", lag = 36)

ResCust_nsdiff %>% 
  features(KWH_nsdiff, unitroot_kpss)

## # A tibble: 1 × 2
##   kpss_stat kpss_pvalue
##       <dbl>       <dbl>
## 1     0.168         0.1

An ARIMA(0,0,3)(2,1,0)[12] w/ drift model is fit, and this model is used for forecasting. After forecasting, I can see that the variance of the residuals not constant, with almost no variance at the beginning of the time series. However, the autocorrelations in the ACF plot are all within the bounds of the critical values and the distribution of residuals appears normal. Performing the ljung-box test, I can see that the p-value is above 0.05 and can also be considered as a potential fit for further analysis.

fit_ARIMA_ResCust <- ResCust %>% 
  model(ARIMA(KWH, stepwise = FALSE, approximation = FALSE))
report(fit_ARIMA_ResCust)

## Series: KWH 
## Model: ARIMA(0,0,3)(2,1,0)[12] w/ drift 
## 
## Coefficients:
##          ma1     ma2     ma3     sar1     sar2   constant
##       0.3456  0.0469  0.2383  -0.7019  -0.4150  229135.00
## s.e.  0.0789  0.0886  0.0749   0.0771   0.0789   77965.14
## 
## sigma^2 estimated as 3.869e+11:  log likelihood=-2657.83
## AIC=5329.66   AICc=5330.31   BIC=5352.01

fc_ARIMA_ResCust <- forecast(fit_ARIMA_ResCust, h = 1)
fc_ARIMA_ResCust  %>% autoplot() + autolayer(ResCust)

gg_tsresiduals(fit_ARIMA_ResCust)

augment(fit_ARIMA_ResCust) %>% 
  features(.innov, ljung_box, lag = 24, dof = 5)

## # A tibble: 1 × 3
##   .model                                              lb_stat lb_pvalue
##   <chr>                                                 <dbl>     <dbl>
## 1 ARIMA(KWH, stepwise = FALSE, approximation = FALSE)    14.9     0.732

Box-Cox transformation

Again, I also explored a Box-Cox transformation on the data before modeling to see if the prerequisites for stationary data could be better met.

lambda_guerrero <- ResCust %>% 
  features(KWH, features = guerrero)
lambda_guerrero <- lambda_guerrero$lambda_guerrero
ResCust_box_cox <- ResCust %>% 
  mutate(box_cox = box_cox(KWH,lambda_guerrero)) %>% 
  select(box_cox)
ResCust_box_cox %>% autoplot()

ETS model on the transformed dataset

I attempted an exponential smoothing model again first, where an ETS(M,N,M) model was selected again. Forecasting with the Box-Cox transformed data, I can see in the autocorrelations still show spikes beyond the bounds of the critical values. The variance of the residuals is relatively constant and the distribution of residuals is roughly normal. Performing the ljung-box test still returns a p-value greater than 0.05, indicating this model can also be evaluated as a potential model.

fit_ETS_ResCust_box_cox <- ResCust_box_cox %>% model(ETS(box_cox))
report(fit_ETS_ResCust_box_cox)

## Series: box_cox 
## Model: ETS(A,A,A) 
##   Smoothing parameters:
##     alpha = 0.09726989 
##     beta  = 0.0001007045 
##     gamma = 0.0001195169 
## 
##   Initial states:
##      l[0]         b[0]         s[0]        s[-1]        s[-2]       s[-3]
##  4.439695 4.230818e-05 -0.001682906 -0.009258518 -0.004276473 0.005169698
##        s[-4]      s[-5]       s[-6]        s[-7]        s[-8]        s[-9]
##  0.008620482 0.00699773 0.000901787 -0.007986956 -0.006149625 -0.002418816
##       s[-10]      s[-11]
##  0.002752698 0.007330899
## 
##   sigma^2:  0
## 
##       AIC      AICc       BIC 
## -1195.663 -1192.146 -1140.286

fc_ETS_ResCust_box_cox <- forecast(fit_ETS_ResCust_box_cox, h = 1) %>%
  rename(box_cox_KWH = .mean)
ResCust_box_cox %>% ggplot() + 
  geom_line(aes(x = YearMonth, y = box_cox)) + 
  geom_point(aes(x = fc_ETS_ResCust_box_cox$YearMonth, 
                 y = fc_ETS_ResCust_box_cox$box_cox_KWH), 
             color = "blue")

gg_tsresiduals(fit_ETS_ResCust_box_cox)

augment(fit_ETS_ResCust_box_cox) %>% features(.innov, ljung_box, lag = 24)

## # A tibble: 1 × 3
##   .model       lb_stat lb_pvalue
##   <chr>          <dbl>     <dbl>
## 1 ETS(box_cox)    48.4   0.00224

ARIMA model on the transformed dataset

Looking at the PACF plot, I can see the Box-Cox transformed dataset has PACF spikes that decrease as the lags increase. Looking at the multiples of seasonality, 12, I can see that the spikes are within bounds at these multiples after lag 12. The KPSS test also shows that the data is stationary after differencing.

ResCust_box_cox %>% features(box_cox, unitroot_nsdiffs)

## # A tibble: 1 × 1
##   nsdiffs
##     <int>
## 1       1

ResCust_box_cox_nsdiff <- 
  ResCust_box_cox %>% 
  mutate(box_cox_KWH_nsdiff = difference(box_cox, 7))
ResCust_box_cox_nsdiff %>% 
  features(box_cox_KWH_nsdiff, unitroot_ndiffs)

## # A tibble: 1 × 1
##   ndiffs
##    <int>
## 1      0

ResCust_box_cox_nsdiff %>% 
  gg_tsdisplay(box_cox_KWH_nsdiff,plot_type = "partial", lag = 36)

ResCust_box_cox_nsdiff %>% 
  features(box_cox_KWH_nsdiff, unitroot_kpss)

## # A tibble: 1 × 2
##   kpss_stat kpss_pvalue
##       <dbl>       <dbl>
## 1    0.0330         0.1

An ARIMA(0,0,3)(2,1,0)[12] w/ drift model was selected for the Box-Cox transformed data. Looking at the variance of the residuals, I can see that there is a similar problem where the variance is almost zero for approximately the first year of datapoints. This means the variance is not homoscedastic. However, there is only one spike in the autocorrelations, and the distribution of residuals appear roughly normal. The p-value from the ljung-box test is again over 0.05, so this model can be considered a potential model for selection and evaluation.

fit_ARIMA_ResCust_box_cox <- ResCust_box_cox %>% 
  model(ARIMA(box_cox, stepwise = FALSE, approximation = FALSE))
report(fit_ARIMA_ResCust_box_cox)

## Series: box_cox 
## Model: ARIMA(0,0,3)(0,1,1)[12] w/ drift 
## 
## Coefficients:
##          ma1     ma2     ma3     sma1  constant
##       0.3123  0.0899  0.2014  -0.7118     5e-04
## s.e.  0.0763  0.0844  0.0704   0.0640     1e-04
## 
## sigma^2 estimated as 1.017e-05:  log likelihood=790.27
## AIC=-1568.54   AICc=-1568.06   BIC=-1549.38

fc_ARIMA_ResCust_box_cox <- 
  forecast(fit_ARIMA_ResCust_box_cox, h = 1) %>%
    rename(box_cox_KWH = .mean)
ResCust_box_cox %>% ggplot() + 
  geom_line(aes(x = YearMonth, y = box_cox)) + 
  geom_point(aes(x = fc_ARIMA_ResCust_box_cox$YearMonth, 
                 y = fc_ARIMA_ResCust_box_cox$box_cox_KWH), 
             color = "blue")

gg_tsresiduals(fit_ARIMA_ResCust_box_cox)

augment(fit_ARIMA_ResCust_box_cox) %>% features(.innov, ljung_box, lag = 24, dof = 5)

## # A tibble: 1 × 3
##   .model                                                  lb_stat lb_pvalue
##   <chr>                                                     <dbl>     <dbl>
## 1 ARIMA(box_cox, stepwise = FALSE, approximation = FALSE)    23.4     0.221

Comparison of models to determine best fit

All four models assessed, ETS(M,N,M) and ARIMA(0,0,3)(2,1,0)[12] w/ drift for the non-transformed data, and ETS(M,N,M) and ARIMA(0,0,3)(2,1,0)[12] w/ drift for the transformed data, will be considered for evaluation.

ResCust_train <- ResCust %>% 
  filter(YearMonth < yearmonth("2013 Dec"))
ResCust_box_cox_train <- ResCust_box_cox %>% 
  filter(YearMonth < yearmonth("2013 Dec"))

ETS(M,N,M) on the non-transformed data:

fit_ETS_ResCust_train <- ResCust_train %>% 
  model(ETS(KWH ~ error("M")+trend("N")+season("M")))
report(fit_ETS_ResCust_train)

## Series: KWH 
## Model: ETS(M,N,M) 
##   Smoothing parameters:
##     alpha = 0.1261529 
##     gamma = 0.2073059 
## 
##   Initial states:
##     l[0]      s[0]     s[-1]     s[-2]    s[-3]    s[-4]    s[-5]    s[-6]
##  6198551 0.8983829 0.7614825 0.9353988 1.233974 1.275716 1.259156 1.003643
##     s[-7]    s[-8]     s[-9]    s[-10]   s[-11]
##  0.768538 0.806859 0.8791735 0.9981168 1.179559
## 
##   sigma^2:  0.0086
## 
##      AIC     AICc      BIC 
## 6095.270 6098.013 6144.054

ARIMA(0,0,3)(2,1,0)[12] w/ drift on the non-transformed data:

fit_ARIMA_ResCust_train <- ResCust_train %>% 
  model(ARIMA(KWH ~ pdq(0,0,3) + PDQ(2,1,0)))
report(fit_ARIMA_ResCust_train)

## Series: KWH 
## Model: ARIMA(0,0,3)(2,1,0)[12] w/ drift 
## 
## Coefficients:
##          ma1     ma2     ma3     sar1     sar2   constant
##       0.3471  0.0478  0.2424  -0.6880  -0.4633  208360.75
## s.e.  0.0748  0.0828  0.0680   0.0735   0.0756   74737.56
## 
## sigma^2 estimated as 3.498e+11:  log likelihood=-2634.48
## AIC=5282.96   AICc=5283.61   BIC=5305.27

ETS(M,N,M) on the transformed data:

fit_ETS_ResCust_box_cox_train <- ResCust_box_cox_train %>% 
  model(ETS(box_cox ~ error("M")+trend("N")+season("M")))
report(fit_ETS_ResCust_box_cox_train)

## Series: box_cox 
## Model: ETS(M,N,M) 
##   Smoothing parameters:
##     alpha = 0.1051158 
##     gamma = 0.2003773 
## 
##   Initial states:
##      l[0]      s[0]     s[-1]     s[-2]    s[-3]    s[-4]    s[-5]   s[-6]
##  4.440018 0.9991129 0.9981611 0.9995753 1.001659 1.002001 1.001643 1.00014
##      s[-7]     s[-8]     s[-9]   s[-10]   s[-11]
##  0.9981808 0.9986073 0.9992228 1.000239 1.001458
## 
##   sigma^2:  0
## 
##       AIC      AICc       BIC 
## -1198.950 -1196.207 -1150.166

ARIMA(0,0,3)(2,1,0)[12] w/ drift on the transformed data:

fit_ARIMA_ResCust_box_cox_train <- ResCust_box_cox_train %>% 
  model(ARIMA(box_cox ~ pdq(0,0,3) + PDQ(2,1,0)))
report(fit_ARIMA_ResCust_box_cox_train)

## Series: box_cox 
## Model: ARIMA(0,0,3)(2,1,0)[12] w/ drift 
## 
## Coefficients:
##          ma1     ma2     ma3     sar1     sar2  constant
##       0.2705  0.0714  0.2264  -0.7161  -0.4085     1e-03
## s.e.  0.0739  0.0804  0.0671   0.0725   0.0752     4e-04
## 
## sigma^2 estimated as 9.725e-06:  log likelihood=791.42
## AIC=-1568.84   AICc=-1568.19   BIC=-1546.53

From the comparison below, we can see that the ARIMA(0,0,3)(2,1,0)[12] w/ drift model on transformed data had a lower RMSSE.

bind_rows(
  fit_ETS_ResCust_train %>% forecast(h = 1) %>% 
    accuracy(ResCust) %>% select(.model, MASE,RMSSE),
  fit_ARIMA_ResCust_train %>% forecast(h = 1) %>% 
    accuracy(ResCust) %>% select(.model, MASE,RMSSE),
  fit_ETS_ResCust_box_cox_train %>% forecast(h = 1) %>% 
    accuracy(ResCust_box_cox) %>% select(.model, MASE,RMSSE),
  fit_ARIMA_ResCust_box_cox_train %>% forecast(h = 1) %>% 
    accuracy(ResCust_box_cox) %>% select(.model, MASE,RMSSE)
)

## # A tibble: 4 × 3
##   .model                                                        MASE RMSSE
##   <chr>                                                        <dbl> <dbl>
## 1 "ETS(KWH ~ error(\"M\") + trend(\"N\") + season(\"M\"))"      4.14  3.21
## 2 "ARIMA(KWH ~ pdq(0, 0, 3) + PDQ(2, 1, 0))"                    4.31  3.34
## 3 "ETS(box_cox ~ error(\"M\") + trend(\"N\") + season(\"M\"))"  3.18  2.52
## 4 "ARIMA(box_cox ~ pdq(0, 0, 3) + PDQ(2, 1, 0))"                3.10  2.46

Revisiting the residuals plot, we can see that the scale of the residuals is smaller and more constant in the ETS(M,N,M) model on the transformed data. This model also had the second lowest RMSSE. When splitting the training set at an even earlier date to compare a 12 month forecast in the testing set, the RMSSE only differs between the two models by 0.028, with the ARIMA model still having the lower RMSSE. Looking at the ACF plots, since the ETS model has two spikes above the critical values and the ARIMA model has one, the ARIMA model was ultimately chosen.

gg_tsresiduals(fit_ETS_ResCust_box_cox)

gg_tsresiduals(fit_ARIMA_ResCust_box_cox)

ResCust_box_cox_train2 <- ResCust_box_cox %>% 
  filter(YearMonth <= yearmonth("2012 Dec"))

fit_ETS_ResCust_box_cox_train2 <- ResCust_box_cox_train2 %>% 
  model(ETS(box_cox ~ error("M")+trend("N")+season("M")))
report(fit_ETS_ResCust_box_cox_train2)

## Series: box_cox 
## Model: ETS(M,N,M) 
##   Smoothing parameters:
##     alpha = 0.1121282 
##     gamma = 0.1778984 
## 
##   Initial states:
##      l[0]      s[0]     s[-1]     s[-2]    s[-3]    s[-4]   s[-5]    s[-6]
##  4.440014 0.9992006 0.9981364 0.9995244 1.001621 1.002023 1.00166 1.000138
##      s[-7]     s[-8]     s[-9]   s[-10]   s[-11]
##  0.9981581 0.9985882 0.9992599 1.000246 1.001444
## 
##   sigma^2:  0
## 
##       AIC      AICc       BIC 
## -1134.708 -1131.781 -1086.814

fit_ARIMA_ResCust_box_cox_train2 <- ResCust_box_cox_train2 %>% 
  model(ARIMA(box_cox ~ pdq(0,0,3) + PDQ(2,1,0)))
report(fit_ARIMA_ResCust_box_cox_train2)

## Series: box_cox 
## Model: ARIMA(0,0,3)(2,1,0)[12] w/ drift 
## 
## Coefficients:
##          ma1     ma2     ma3     sar1     sar2  constant
##       0.2803  0.0780  0.2422  -0.7460  -0.4122     1e-03
## s.e.  0.0755  0.0813  0.0692   0.0752   0.0800     4e-04
## 
## sigma^2 estimated as 9.85e-06:  log likelihood=742.17
## AIC=-1470.34   AICc=-1469.64   BIC=-1448.47

bind_rows(
  fit_ETS_ResCust_box_cox_train2 %>% forecast(h = 12) %>% 
    accuracy(ResCust_box_cox) %>% select(.model, MASE,RMSSE),
  fit_ARIMA_ResCust_box_cox_train2 %>% forecast(h = 12) %>% 
    accuracy(ResCust_box_cox) %>% select(.model, MASE,RMSSE)
)

## # A tibble: 2 × 3
##   .model                                                        MASE RMSSE
##   <chr>                                                        <dbl> <dbl>
## 1 "ETS(box_cox ~ error(\"M\") + trend(\"N\") + season(\"M\"))" 0.808 0.930
## 2 "ARIMA(box_cox ~ pdq(0, 0, 3) + PDQ(2, 1, 0))"               0.857 0.916

Since the transformed data had the best fit model, the data must be transformed back in order to interpret the predictive value of KWH usage for January 2014.

fc_ARIMA_ResCust_box_cox

## # A tsibble: 1 x 4 [1M]
## # Key:       .model [1]
##   .model                                                  YearMonth
##   <chr>                                                       <mth>
## 1 ARIMA(box_cox, stepwise = FALSE, approximation = FALSE)  2014 Jan
## # ℹ 2 more variables: box_cox <dist>, box_cox_KWH <dbl>

fc_ARIMA_ResCust_box_cox  %>% autoplot(level = FALSE) + 
  autolayer(ResCust_box_cox)

ResCust_forecasted <- fc_ARIMA_ResCust_box_cox %>% 
  mutate(KWH = inv_box_cox(box_cox_KWH, lambda_guerrero)) %>% 
  select(YearMonth,KWH)
ResCust %>% autoplot() + 
  geom_point(aes(x = ResCust_forecasted$YearMonth, 
                 y = ResCust_forecasted$KWH),
             color = "blue") +
  labs(x = "Date", y = "KWH Consumption", 
       title = "Predicted KWH Consumption for January 2014 using ARIMA(0,0,3)(2,1,0)[12] w/ drift")

Final Remarks

The final model selection for the ResCust forecast, as well as its corresponding MASE and RMSSE resulting from the training and test set exercise, are listed below.

bind_rows(
  fit_ARIMA_ResCust_box_cox_train %>% forecast(h = 1) %>% 
    accuracy(ResCust_box_cox) %>% select(.model, MASE,RMSSE),
  fit_ARIMA_ResCust_box_cox_train2 %>% forecast(h = 12) %>% 
    accuracy(ResCust_box_cox) %>% select(.model, MASE,RMSSE)
)

## # A tibble: 2 × 3
##   .model                                        MASE RMSSE
##   <chr>                                        <dbl> <dbl>
## 1 ARIMA(box_cox ~ pdq(0, 0, 3) + PDQ(2, 1, 0)) 3.10  2.46 
## 2 ARIMA(box_cox ~ pdq(0, 0, 3) + PDQ(2, 1, 0)) 0.857 0.916

report(fit_ARIMA_ResCust_box_cox)

## Series: box_cox 
## Model: ARIMA(0,0,3)(0,1,1)[12] w/ drift 
## 
## Coefficients:
##          ma1     ma2     ma3     sma1  constant
##       0.3123  0.0899  0.2014  -0.7118     5e-04
## s.e.  0.0763  0.0844  0.0704   0.0640     1e-04
## 
## sigma^2 estimated as 1.017e-05:  log likelihood=790.27
## AIC=-1568.54   AICc=-1568.06   BIC=-1549.38

addWorksheet(Data_624_Koon_Project_1_excel, "ResCust")
writeData(Data_624_Koon_Project_1_excel, "ResCust", ResCust_forecasted)

Part C – BONUS, optional (part or all), Waterflow_Pipe1.xlsx and Waterflow_Pipe2.xlsx

Part C consists of two data sets. These are simple 2 columns sets, however they have different time stamps. Your optional assignment is to time-base sequence the data and aggregate based on hour (example of what this looks like, follows). Note for multiple recordings within an hour, take the mean. Then to determine if the data is stationary and can it be forecast. If so, provide a week forward forecast and present results via Rpubs and .rmd and the forecast in an Excel readable file.

Read in the Data

After reading in the data, I can see the datetime has imported as a double. I converted this date back to datetime format in R, after combining the two datasets. Then, I mutate the overall dataset to extract Date and hour, and group by these columns. Afterwards, I can summarize the mean for each hour.

Waterflow1 <- read_excel("C:\\Users\\Kim\\Documents\\Data624\\Waterflow_Pipe1.xlsx") # read in the file
Waterflow2 <- read_excel("C:\\Users\\Kim\\Documents\\Data624\\Waterflow_Pipe2.xlsx") # read in the file

waterflow <- rbind(Waterflow1,Waterflow2) %>% 
  mutate(DateTime = convertToDateTime(`Date Time`)) %>% 
  select(DateTime,WaterFlow) %>% 
  mutate(Date = as_date(DateTime), Hour = hour(DateTime)) %>%
  group_by(Date,Hour) %>%
  summarize(meanWaterFlow = mean(WaterFlow), .groups = "drop") %>%
  mutate(DateHour = as.POSIXct(paste(Date, Hour), format = "%Y-%m-%d %H")) %>%
  select(DateHour, meanWaterFlow)

Convert to tsibble

I turned the waterflow data into into a tsibble and autoplot it to see the general shape of the data. Using is.na, I can see there are no NA values. However, using fill_gaps() I can see there is an implicit NA value at 2015-11-01 01:00:00. I used gg_season() to see if there were any typical daily trends, and there does not seem to be. I then graphed the surrounding values to view the local shape and trend. Given this, I will impute the single value based on linear interpolation.

waterflow <- waterflow %>% 
  as_tsibble(index = DateHour) %>%
  fill_gaps()
waterflow %>% autoplot(meanWaterFlow)

any(is.na(waterflow$meanWaterFlow))

## [1] TRUE

waterflow %>% filter(is.na(meanWaterFlow))

## # A tsibble: 1 x 2 [1h] <?>
##   DateHour            meanWaterFlow
##   <dttm>                      <dbl>
## 1 2015-11-01 01:00:00            NA

gg_season(waterflow, period = 24) # no obvious trend by hour

waterflow %>% filter(as_date(DateHour) == '2015-11-01' | 
                       as_date(DateHour) == '2015-10-31') %>% 
  autoplot()

waterflow$meanWaterFlow <-na_interpolation(waterflow$meanWaterFlow, option = "linear")
waterflow %>% filter(as_date(DateHour) == '2015-11-01' | 
                       as_date(DateHour) == '2015-10-31') %>% 
  autoplot()

waterflow %>% ggplot(aes(x = meanWaterFlow)) + geom_histogram()

waterflow %>% autoplot()

Determine Stationarity

I can see the dataset requires one order of non-seasonal differencing. No seasonal differencing is required. The variance of the differences is not constant, as the variance is much smaller in the beginning of the plot and then becomes larger as time passes. After differencing, I check the KPSS test again on the differenced data, and can see the data is still not stationary

waterflow %>% features(meanWaterFlow, unitroot_nsdiffs)

## # A tibble: 1 × 1
##   nsdiffs
##     <int>
## 1       0

waterflow %>% features(meanWaterFlow, unitroot_ndiffs)

## # A tibble: 1 × 1
##   ndiffs
##    <int>
## 1      1

waterflow %>% features(meanWaterFlow, unitroot_kpss)

## # A tibble: 1 × 2
##   kpss_stat kpss_pvalue
##       <dbl>       <dbl>
## 1      4.08        0.01

waterflow_ndiff <- waterflow %>% 
  mutate(meanWaterFlow_ndiff = difference(meanWaterFlow))
waterflow_ndiff %>% 
  gg_tsdisplay(meanWaterFlow_ndiff,plot_type = "partial")

waterflow_ndiff %>% features(meanWaterFlow, unitroot_kpss)

## # A tibble: 1 × 2
##   kpss_stat kpss_pvalue
##       <dbl>       <dbl>
## 1      4.08        0.01

Box-Cox Transformation

To stabilize the variance, I perform a Box-Cox transformation on the data. I can see that there are a few values that differ significantly from the rest of the dataset, likely due to having an initially low value before the transformation. On the whole, the variance does appear more constant and aligned with the initial week presented in the dataset.

lambda_guerrero <- waterflow %>% 
  features(meanWaterFlow, features = guerrero)
lambda_guerrero <- lambda_guerrero$lambda_guerrero
waterflow_box_cox <- waterflow %>% 
  mutate(box_cox = box_cox(meanWaterFlow,lambda_guerrero)) %>% 
  select(box_cox)
waterflow_box_cox %>% autoplot()

waterflow_box_cox %>% ggplot(aes(x = box_cox)) + geom_histogram()

After differencing, I perform the KPSS test on the differenced data of the Box-Cox transformed dataset. I can see that the KPSS is 0.1, which means I do not reject the null hypothesis that the data is stationary.

waterflow_box_cox %>% features(box_cox, unitroot_nsdiffs)

## # A tibble: 1 × 1
##   nsdiffs
##     <int>
## 1       0

waterflow_box_cox %>% features(box_cox, unitroot_ndiffs)

## # A tibble: 1 × 1
##   ndiffs
##    <int>
## 1      1

waterflow_box_cox %>% features(box_cox, unitroot_kpss)

## # A tibble: 1 × 2
##   kpss_stat kpss_pvalue
##       <dbl>       <dbl>
## 1      1.91        0.01

waterflow_box_cox_ndiff <- waterflow_box_cox %>% mutate(box_cox_meanWaterFlow_nsdiff = difference(box_cox))
waterflow_box_cox_ndiff %>% gg_tsdisplay(box_cox_meanWaterFlow_nsdiff,plot_type = "partial")

waterflow_box_cox_ndiff %>% features(box_cox_meanWaterFlow_nsdiff, unitroot_kpss)

## # A tibble: 1 × 2
##   kpss_stat kpss_pvalue
##       <dbl>       <dbl>
## 1   0.00507         0.1

ARIMA model on the transformed dataset

As such, I move on to creating a forecast. A week forward forecast on hourly data means predicting 168 values. I fit an ARIMA(0,1,1) model to the data. Looking at the residuals, I can see that the variance has some spikes but is otherwise fairly constant. There are two autocorrelations in the ACF plot that are above the critical values, which, at 6%, is above the 5% threshold. The distribution of residuals also appears left skewed.

fit_ARIMA_waterflow_box_cox <- 
  waterflow_box_cox %>% 
  model(ARIMA(box_cox, stepwise = FALSE, approximation = FALSE))
report(fit_ARIMA_waterflow_box_cox)

## Series: box_cox 
## Model: ARIMA(0,1,1) 
## 
## Coefficients:
##           ma1
##       -0.9801
## s.e.   0.0064
## 
## sigma^2 estimated as 0.01033:  log likelihood=867.3
## AIC=-1730.59   AICc=-1730.58   BIC=-1720.78

fc_ARIMA_waterflow_box_cox <- 
  forecast(fit_ARIMA_waterflow_box_cox, h = 168)
fc_ARIMA_waterflow_box_cox  %>% 
  autoplot() + autolayer(waterflow_box_cox)

gg_tsresiduals(fit_ARIMA_waterflow_box_cox)

augment(fit_ARIMA_waterflow_box_cox) %>% 
  features(.innov, ljung_box, lag = 10, dof = 1)

## # A tibble: 1 × 3
##   .model                                                  lb_stat lb_pvalue
##   <chr>                                                     <dbl>     <dbl>
## 1 ARIMA(box_cox, stepwise = FALSE, approximation = FALSE)    8.68     0.467

ETS model on the non-transformed dataset

I can fit an ETS(M,N,N) model to the non-transformed dataset as well, since ETS does not require stationary data. When looking at the residuals, it appears homoscedastic although there are 5 autocorrelations shown above critical values in the ACF plot. The distribution of residuals appears normal. When performing the ljung_box test, the p-value is shown to be above 0.05 and this model can be considered as a potential fit for further analysis.

fit_ETS_waterflow <- waterflow %>% model(ETS(meanWaterFlow))
report(fit_ETS_waterflow)

## Series: meanWaterFlow 
## Model: ETS(M,N,N) 
##   Smoothing parameters:
##     alpha = 0.03724136 
## 
##   Initial states:
##      l[0]
##  20.64244
## 
##   sigma^2:  0.1577
## 
##      AIC     AICc      BIC 
## 12179.54 12179.57 12194.27

fc_ETS_waterflow <- forecast(fit_ETS_waterflow, h = 168)
fc_ETS_waterflow  %>% autoplot() + autolayer(waterflow)

gg_tsresiduals(fit_ETS_waterflow)

augment(fit_ETS_waterflow) %>% 
  features(.innov, ljung_box, lag = 10)

## # A tibble: 1 × 3
##   .model             lb_stat lb_pvalue
##   <chr>                <dbl>     <dbl>
## 1 ETS(meanWaterFlow)    3.89     0.952

ETS model on the transformed dataset

Fitting an ETS(M,N,N) model to the transformed data, I can see that the variance of the residuals is similar to the ARIMA model on the transformed data. Similarly, there are two spikes in the ACF plot and the distribution of residuals is left skewed. When performing the ljung_box test, the p-value is shown to be above 0.05 and this model can be considered as a potential fit for further analysis.

fit_ETS_waterflow_box_cox <- 
  waterflow_box_cox %>% model(ETS(box_cox))
report(fit_ETS_waterflow_box_cox)

## Series: box_cox 
## Model: ETS(M,N,N) 
##   Smoothing parameters:
##     alpha = 0.0207642 
## 
##   Initial states:
##     l[0]
##  1.54057
## 
##   sigma^2:  0.004
## 
##      AIC     AICc      BIC 
## 2328.562 2328.586 2343.292

fc_ETS_waterflow_box_cox <- 
  forecast(fit_ETS_waterflow_box_cox, h = 168) 
fc_ETS_waterflow_box_cox  %>% 
  autoplot() + autolayer(waterflow_box_cox)

gg_tsresiduals(fit_ETS_waterflow_box_cox)

augment(fit_ETS_waterflow_box_cox) %>% 
  features(.innov, ljung_box, lag = 10)

## # A tibble: 1 × 3
##   .model       lb_stat lb_pvalue
##   <chr>          <dbl>     <dbl>
## 1 ETS(box_cox)    8.78     0.553

Comparison of models to determine best fit

All three models assessed, ETS(M,N,N) for the non-transformed data, and ETS(M,N,N) and ARIMA(0,1,1) for the transformed data, will be considered for evaluation.

waterflow_train <- waterflow %>% 
  filter(DateHour <= as_datetime ("2015-11-26 16:00:00"))
waterflow_box_cox_train <- waterflow_box_cox %>% 
  filter(DateHour <= as_datetime ("2015-11-26 16:00:00"))

ETS(M,N,N) on the non-transformed data:

fit_ETS_waterflow_train <- waterflow_train %>% 
  model(ETS(meanWaterFlow ~ error("M")+trend("N")+season("N")))
report(fit_ETS_waterflow_train)

## Series: meanWaterFlow 
## Model: ETS(M,N,N) 
##   Smoothing parameters:
##     alpha = 0.04155465 
## 
##   Initial states:
##      l[0]
##  20.83852
## 
##   sigma^2:  0.1482
## 
##      AIC     AICc      BIC 
## 9842.374 9842.404 9856.535

ETS(M,N,N) on the transformed data:

fit_ETS_waterflow_box_cox_train <- waterflow_box_cox_train %>% 
  model(ETS(box_cox ~ error("M")+trend("N")+season("N")))
report(fit_ETS_waterflow_box_cox_train)

## Series: box_cox 
## Model: ETS(M,N,N) 
##   Smoothing parameters:
##     alpha = 0.0216675 
## 
##   Initial states:
##      l[0]
##  1.541633
## 
##   sigma^2:  0.0037
## 
##      AIC     AICc      BIC 
## 1697.246 1697.275 1711.406

ARIMA(0,1,1) on the transformed data:

fit_ARIMA_waterflow_box_cox_train <- waterflow_box_cox_train %>% 
  model(ARIMA(box_cox ~ pdq(0,1,1)))
report(fit_ARIMA_waterflow_box_cox_train)

## Series: box_cox 
## Model: ARIMA(0,1,1) 
## 
## Coefficients:
##           ma1
##       -0.9793
## s.e.   0.0067
## 
## sigma^2 estimated as 0.009492:  log likelihood=752.17
## AIC=-1500.33   AICc=-1500.32   BIC=-1490.89

As a baseline, I also included a MEAN and NAIVE model, given that the plots of the forecast do not appear complex. Potentially, either of these simple models may provide comparatively good prediction.

fit_MEAN_waterflow_train <- waterflow_train %>% 
  model(MEAN(meanWaterFlow))
report(fit_MEAN_waterflow_train)

## Series: meanWaterFlow 
## Model: MEAN 
## 
## Mean: 35.3898 
## sigma^2: 235.6841

fit_NAIVE_waterflow_train <- waterflow_train %>% 
  model(NAIVE(meanWaterFlow))
report(fit_NAIVE_waterflow_train)

## Series: meanWaterFlow 
## Model: NAIVE 
## 
## sigma^2: 390.7018

From the comparison below, we can see that the ETS(M,N,N) model on the non-transformed data had a lower RMSSE. However, autocorrelations for the ETS(M,N,N) model on the non-transformed data was seen to exceed the critical values at many points. As such, I would suggest either model from the transformed dataset, as both produced the same predictions and therefore have the same residuals.

bind_rows(
  fit_ETS_waterflow_train %>% forecast(h = 168) %>% 
    accuracy(waterflow) %>% select(.model, MASE,RMSSE),
  fit_ETS_waterflow_box_cox_train %>% forecast(h = 168) %>% 
    accuracy(waterflow_box_cox) %>% select(.model, MASE,RMSSE),
  fit_ARIMA_waterflow_box_cox_train %>% forecast(h = 168) %>% 
    accuracy(waterflow_box_cox) %>% select(.model, MASE,RMSSE),
  fit_MEAN_waterflow_train %>% forecast(h = 168) %>% 
    accuracy(waterflow) %>% select(.model, MASE,RMSSE),
  fit_NAIVE_waterflow_train %>% forecast(h = 168) %>% 
    accuracy(waterflow) %>% select(.model, MASE,RMSSE)
)

## # A tibble: 5 × 3
##   .model                                                              MASE RMSSE
##   <chr>                                                              <dbl> <dbl>
## 1 "ETS(meanWaterFlow ~ error(\"M\") + trend(\"N\") + season(\"N\"))" 0.952 0.887
## 2 "ETS(box_cox ~ error(\"M\") + trend(\"N\") + season(\"N\"))"       0.933 0.890
## 3 "ARIMA(box_cox ~ pdq(0, 1, 1))"                                    0.933 0.890
## 4 "MEAN(meanWaterFlow)"                                              0.961 0.902
## 5 "NAIVE(meanWaterFlow)"                                             1.13  1.07

gg_tsresiduals(fit_ARIMA_waterflow_box_cox)

gg_tsresiduals(fit_ETS_waterflow)

gg_tsresiduals(fit_ETS_waterflow_box_cox)

Since the transformed data had the best fit model, the data must be transformed back in order to interpret the one week ahead waterflow forecasts.

fc_ETS_waterflow_box_cox

## # A fable: 168 x 4 [1h] <?>
## # Key:     .model [1]
##    .model       DateHour           
##    <chr>        <dttm>             
##  1 ETS(box_cox) 2015-12-03 17:00:00
##  2 ETS(box_cox) 2015-12-03 18:00:00
##  3 ETS(box_cox) 2015-12-03 19:00:00
##  4 ETS(box_cox) 2015-12-03 20:00:00
##  5 ETS(box_cox) 2015-12-03 21:00:00
##  6 ETS(box_cox) 2015-12-03 22:00:00
##  7 ETS(box_cox) 2015-12-03 23:00:00
##  8 ETS(box_cox) 2015-12-04 00:00:00
##  9 ETS(box_cox) 2015-12-04 01:00:00
## 10 ETS(box_cox) 2015-12-04 02:00:00
## # ℹ 158 more rows
## # ℹ 2 more variables: box_cox <dist>, .mean <dbl>

fc_ETS_waterflow_box_cox  %>% 
  autoplot(level = FALSE) + autolayer(waterflow_box_cox)

waterflow_forecasted <- 
  fc_ETS_waterflow_box_cox %>% 
  rename(box_cox_waterflow = .mean) %>%
  mutate(meanWaterFlow = inv_box_cox(box_cox_waterflow, lambda_guerrero)) %>% 
  select(DateHour,meanWaterFlow)
waterflow_forecasted %>% 
  autoplot(color = "blue") + autolayer(waterflow) +
  labs(x = "Date", y = "Waterflow", 
       title = "One week ahead waterflow predictions")

Final Remarks

The final model selection for the Waterflow forecasts, as well as its corresponding MASE and RMSSE resulting from the training and test set exercise, are listed below. Only the forecasts for the ETS(M,N,N) of the transformed data is shown in the Excel outputs, since they are the same between the ETS and ARIMA model.

bind_rows(
  fit_ETS_waterflow_box_cox_train %>% forecast(h = 168) %>% 
    accuracy(waterflow_box_cox) %>% select(.model, MASE,RMSSE),
  fit_ARIMA_waterflow_box_cox_train %>% forecast(h = 168) %>% 
    accuracy(waterflow_box_cox) %>% select(.model, MASE,RMSSE)
)

## # A tibble: 2 × 3
##   .model                                                        MASE RMSSE
##   <chr>                                                        <dbl> <dbl>
## 1 "ETS(box_cox ~ error(\"M\") + trend(\"N\") + season(\"N\"))" 0.933 0.890
## 2 "ARIMA(box_cox ~ pdq(0, 1, 1))"                              0.933 0.890

report(fit_ETS_waterflow_box_cox_train)

## Series: box_cox 
## Model: ETS(M,N,N) 
##   Smoothing parameters:
##     alpha = 0.0216675 
## 
##   Initial states:
##      l[0]
##  1.541633
## 
##   sigma^2:  0.0037
## 
##      AIC     AICc      BIC 
## 1697.246 1697.275 1711.406

report(fit_ARIMA_waterflow_box_cox_train)

## Series: box_cox 
## Model: ARIMA(0,1,1) 
## 
## Coefficients:
##           ma1
##       -0.9793
## s.e.   0.0067
## 
## sigma^2 estimated as 0.009492:  log likelihood=752.17
## AIC=-1500.33   AICc=-1500.32   BIC=-1490.89

addWorksheet(Data_624_Koon_Project_1_excel, "waterflow")
writeData(Data_624_Koon_Project_1_excel, "waterflow", waterflow_forecasted)
saveWorkbook(Data_624_Koon_Project_1_excel,
             "C:\\Users\\Kim\\Documents\\Data624\\DATA_624_Project_1.xlsx", 
             overwrite = TRUE)

Appendix of Mean and Naive model fit

fit_MEAN_waterflow <- waterflow %>% 
  model(MEAN(meanWaterFlow))
report(fit_MEAN_waterflow)

## Series: meanWaterFlow 
## Model: MEAN 
## 
## Mean: 35.9727 
## sigma^2: 245.5421

fc_MEAN_waterflow <- 
  forecast(fit_MEAN_waterflow, h = 168)
fc_MEAN_waterflow  %>% autoplot() + 
  autolayer(waterflow)

gg_tsresiduals(fit_MEAN_waterflow)

fit_NAIVE_waterflow <- waterflow %>% 
  model(NAIVE(meanWaterFlow))
report(fit_NAIVE_waterflow)

## Series: meanWaterFlow 
## Model: NAIVE 
## 
## sigma^2: 418.263

fc_NAIVE_waterflow <- 
  forecast(fit_NAIVE_waterflow, h = 168)
fc_NAIVE_waterflow  %>% autoplot() + 
  autolayer(waterflow)

gg_tsresiduals(fit_NAIVE_waterflow)

fit_MEAN_waterflow_box_cox <- waterflow_box_cox %>% 
  model(MEAN(box_cox))
report(fit_MEAN_waterflow_box_cox)

## Series: box_cox 
## Model: MEAN 
## 
## Mean: 1.5899 
## sigma^2: 0.0107

fc_MEAN_waterflow_box_cox <- 
  forecast(fit_MEAN_waterflow_box_cox, h = 168) 
fc_MEAN_waterflow_box_cox  %>% autoplot() + 
  autolayer(waterflow_box_cox)

gg_tsresiduals(fit_MEAN_waterflow_box_cox)

fit_NAIVE_waterflow_box_cox <- waterflow_box_cox %>% 
  model(NAIVE(box_cox))
report(fit_NAIVE_waterflow_box_cox)

## Series: box_cox 
## Model: NAIVE 
## 
## sigma^2: 0.0204

fc_NAIVE_waterflow_box_cox <- 
  forecast(fit_NAIVE_waterflow_box_cox, h = 168) 
fc_NAIVE_waterflow_box_cox  %>% autoplot() + 
  autolayer(waterflow_box_cox)

gg_tsresiduals(fit_NAIVE_waterflow_box_cox)

Data 624 Project 1

Kim Koon

2024-10-27

Load packages

Part A - ATM Forecast, ATM624Data.xlsx

Read in the Data

Separate the Data

ATM1 Analysis

Convert to tsibble and impute NA values

ETS model on the non-transformed dataset

ARIMA model on the non-transformed dataset

Box-Cox transformation

ETS model on the transformed dataset

ARIMA model on the transformed dataset

Comparison of models to determine best fit

ATM2 Analysis

ETS model on the non-transformed dataset

ARIMA model on the non-transformed dataset

Box-Cox transformation

ETS model on the transformed dataset

ARIMA model on the transformed dataset

Comparison of models to determine best fit

ATM3 Analysis

ATM4 Analysis

ETS model on the non-transformed dataset

ARIMA model on the non-transformed dataset

Box-Cox transformation

ETS model on the transformed dataset

ARIMA model on the transformed dataset

Comparison of models to determine best fit

Final remarks

Part B – Forecasting Power, ResidentialCustomerForecastLoad-624.xlsx

Read in the Data

View the data

ETS model on the non-transformed dataset

ARIMA model on the non-transformed dataset

Box-Cox transformation

ETS model on the transformed dataset

ARIMA model on the transformed dataset

Comparison of models to determine best fit

Final Remarks

Part C – BONUS, optional (part or all), Waterflow_Pipe1.xlsx and Waterflow_Pipe2.xlsx

Read in the Data

Convert to tsibble

Determine Stationarity

Box-Cox Transformation

ARIMA model on the transformed dataset

ETS model on the non-transformed dataset

ETS model on the transformed dataset

Comparison of models to determine best fit

Final Remarks

Appendix of Mean and Naive model fit