Data 624 PROJECT 1 (Week 9) Forecasting

ATM, Residential Power and Waterpipe Flow

Alexander Ng

Due 10/31/2021

# Conditional Code Evaluation is managed here.

runPartA = TRUE
runPartB = TRUE
runPartC = TRUE
runOncePartC = FALSE  # Used to load and convert XLSX files that cause unsurpressable warnings.

cross_validation_stepsize = 1  # Make this equal 1 for final render but a larger integer while editing.

Part A ATM Cash Withdrawal Forecast

In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.

Part A Solution

Executive Summary

We forecast 4 ATM Cash withdrawal time series for each day of May 2010.
Each of the 4 time series exhibited different statistical properties and posed different technical challenges to build a time series model. We adopt a single time series model selection process to use one of 3 model classes: SARIMA with no drift, SARIMA with drift and ETS. To validate the model choice, we run a 6 month cross validation procedure with 153 consecutive backtests looking at RMSE and MAE. We select the following models for each ATM:

We output the forecast in a single Excel file.

Data Wrangling

We wrangle and clean the input ATM data to a suitable form.

There are data wrangling issues which we fix:

  1. Extra rows of dates lacking any data. These are simply omitted by dropping rows where ATM field is blank.

  2. The tsibble date interval is irregular but needs to be regular in order to apply time series models. Two changes are required to do this:

    • By changing DATE from a datetime to a calendar day, we regularize tsibble to a daily interval.

    • We use fill_gaps to add missing rows. For ATM1 and ATM2, a few dates exist for which no observations of Cash withdrawal are recorded.

We plot the raw time series for each ATM in order to visually explore missing data and outliers.

From the above time series plot, we observe that:

  1. ATM1 and ATM2 shows seasonality.
  2. ATM2: Some trend and seasonality may exist.
  3. ATM3 is abnormal. Most of the observations are zero except the last 3 days. Those last three observations are identical to ATM1 observations on the same date as shown in the table below. Therefore, we predict future ATM3 withdrawals based on the ATM1 predictions.
  4. ATM4 contains one outlier which obscures the statistical behavior of the time series since the \(y\) values are squeezed down.
DATE ATM1 ATM2 ATM3 ATM4
2010-04-25 109 89 0 542.28
2010-04-26 74 11 0 403.84
2010-04-27 4 2 0 13.70
2010-04-28 96 107 96 348.20
2010-04-29 82 89 82 44.25
2010-04-30 85 90 85 482.29

Data Imputation

There are only 6 observations with missing or suspicious values for Cash so we can adopt a simple strategy to impute their values. Due to the absence of any clear trend but strong weekly seasonality, a good imputation strategy is to use the last week’s value. This rule works as long as its application is well-defined. This rule is undefined if data is missing in the first 7 days of the data series no prior week’s value exists. However, that never happens.

More precisely,

\[Cash^{*}(ATM[x], T) = Cash(ATM[x], T- 7)\] where \(T\) is the date where data is imputed, \(ATM[x]\) is the ATM machine. \(T-7\) is 7 calendar days before \(T\).

There are a small number of missing values for ATM1 and ATM2. Three observations in June 2009 dates have missing Cash for ATM1. Two observations in June 2009 have missing Cash. There is one outlier for ATM4.
These observations are listed in the table below.

Outliers and Missing Values

DATE ATM Cash id
2009-06-13 ATM1 NA 44
2009-06-16 ATM1 NA 47
2009-06-22 ATM1 NA 53
2009-06-18 ATM2 NA 414
2009-06-24 ATM2 NA 420
2010-02-09 ATM4 10920 1380

To justify imputing the outlier for ATM4, we observe that ATM4 has $47,400 average daily withdrawals. According to NPR, an ATM machine can hold at most $200,000 but usually holds tens of thousands in currency. (https://www.npr.org/templates/story/story.php?storyId=125621359) The ATM4 outlier shows Cash withdrawal value of $1,092,000 in one day. This is 5 times more than the maximum cash suggested by the NPR ATM story and is most likely an error.
So we replace the outlier value with the Cash value from that of a week earlier.

The changes above are implemented in the code below.

atmdat[44,  "Cash"] =  96
atmdat[47,  "Cash"] = 145
atmdat[53,  "Cash"] = 106
atmdat[414, "Cash"] = 7
atmdat[420, "Cash"] = 24
atmdat[1380, "Cash"] = 153.2

Previously, we stated that are strong weekly seasonal effects without evidence. To avoid repetition, we show below boxplots grouped by day of the week for each ATM. The weekly seasonal effects are clearly visible in the changes of the medians and IQR between each day of week. For example, there is a strong decline on Thurday for all ATMs.

We evaluate possible seasonal effects to estimate the missing values at the weekly and monthly level. When we examine the day of week effect at each ATM, we see a dip on Thursday at all ATM. There is strong evidence of weekly seasonality.

Next, we asked if there is evidence of quarterly or annual seasonality.
The boxplots below for each ATM grouped by month show no such evidence. However, we acknowledge there is clearly time variation in the withdrawals but insufficient data to figure out longer periods of seasonality.

It’s worth rerunning the time series plots after the imputation. We see below the significant change is in the ATM4 plot. Others look virtually unchanged. Without its outlier, the updated ATM4 plot shows the behavior of the Cash withdrawals more clearly. We see significant but constant volatility in the ATM4 Cash withdrawals with right skewness in its distribution.

We observe in the ggpairs plot below:

STL Decomposition for ATM data

We run an STL decomposition to confirm the absence of a deterministic trend and check STL residuals before model bulding.

The below plot for ATM1 shows no evidence of a deterministic trend but strong evidence of weekly seasonality. The STL residuals look useful even if volatility appears to cluster.

The below plot for ATM2 finds no deterministic trend in ATM2, strong evidence of seasonality but some volatility clustering in the residual.

The ATM4 decomposition below confirms weekly seasonality and suggests right skewness which we will correct with a Box-Cox transformation.

Model Building and Evaluation Process

We adopt the same process of model building, evaluation and forecasting for each of the three ATM series. The process begins with considering three candidate classes of time series models:

We experimented offline with different orders of each model class in conjunction with the automated search algorithm to seek optimal fit. For each model class, we select one model with chosen order and fit it to the entire dataset.

Next, we compare their May 2010 forecasts, inspect their model residual plots for suitability. Afterwards, we run a time series cross validation process to produce many out-of-sample forecasts for each of the 3 models. The RMSE and MAE of each model are graphically evaluated.
Together with the early assessments of model residuals, the cross validation results enable us to chose the desired model to generate forecasts.

We apply this process to ATM1, ATM2 and ATM4 in succession. The resulting forecasts for all 4 ATMs are exported in one excel file.

Model Building for ATM1

TWe explain the model building and evaluation process in detail for ATM1

Looking at the raw time series ACF plots below, we observe significant autocorrelations and partial autocorrelations at multiple lags.

forecast_horizon = 31

fit_atm1 = piv %>% model( 
  sarima_nd = ARIMA(ATM1 ~ 0 + pdq(0,0,1) + PDQ(0, 1, 2, period = 7 ) ) ,
  sarima_d = ARIMA(ATM1 ~ 1 + pdq(0,0,1) + PDQ(0, 1, 2, period = 7 ) ) ,
  ets = ETS( ATM1 ~ error("A") + trend("N") + season("A")  )
    ) 

fc_atm1 = fit_atm1 %>% forecast(h=forecast_horizon) 

The SARIMA model with no drift has order:

\[ARIMA ~ (p=0,d=0,q=1) + (P=0, D=1, Q=2)_{7}\]

The fitted model specification is:

\[(1 - B^7)X_{t} = ( 1 + \theta_1 B)(1 + \Theta_{1}B^7 + \Theta_{2}B^{14} ) \epsilon_{t}\]

The model parameters are listed below.

report(fit_atm1 %>% select(sarima_nd))
## Series: ATM1 
## Model: ARIMA(0,0,1)(0,1,2)[7] 
## 
## Coefficients:
##          ma1     sma1     sma2
##       0.1868  -0.5762  -0.1096
## s.e.  0.0548   0.0507   0.0496
## 
## sigma^2 estimated as 556.4:  log likelihood=-1640.02
## AIC=3288.04   AICc=3288.16   BIC=3303.57

The SARIMA model with drift has the same order but includes a small drift term.

report(fit_atm1 %>% select(sarima_d))
## Series: ATM1 
## Model: ARIMA(0,0,1)(0,1,2)[7] w/ drift 
## 
## Coefficients:
##          ma1     sma1     sma2  constant
##       0.1867  -0.5765  -0.1097   -0.0883
## s.e.  0.0548   0.0507   0.0496    0.4869
## 
## sigma^2 estimated as 557.9:  log likelihood=-1640.01
## AIC=3290.02   AICc=3290.19   BIC=3309.42

The exponential smoothing model has an additive error, no trend and an additive seasonal term with weekly period.

report(fit_atm1 %>% select(ets))
## Series: ATM1 
## Model: ETS(A,N,A) 
##   Smoothing parameters:
##     alpha = 0.0206385 
##     gamma = 0.3301984 
## 
##   Initial states:
##      l[0]      s[0]    s[-1]    s[-2]    s[-3]    s[-4]    s[-5]    s[-6]
##  79.39605 -62.73693 6.754848 13.60636 10.95823 12.94413 10.40877 8.064604
## 
##   sigma^2:  578.8824
## 
##      AIC     AICc      BIC 
## 4486.151 4486.772 4525.150

The forecasts for the month of May 2010 are shown in the plot below for all 3 models. They do not differ sufficiently to allow model selection based on forecast reasonableness. None of the models are entirely satisfactory because they don’t seem to generate a stochastic trend. We observe that, even in the SARIMA model with drift, the drift term is very small. The weekly periodicity seems to be well captured by the model specification.

Looking at their innovation residuals for white noise, we see that SARIMA with no drift produces white noise like residuals. The acf of the innovation residuals stay inside the dashed lines. The histogram looks relatively symmetric although slightly fat-tailed. The timeseries plot of the residuals look decent.

The SARIMA model with drift shows similar characteristics as that without drift.

The ETS model residuals look inferior. We see a persistent autocorrelation at lags 1, 6 below. On the basis of the model residuals, we can rule out the ETS model for consideration.

Cross Validation for ATM1

The cross validation will create forecasts using a procedure called “evaluation on a rolling forecasting origin.” The procedure is described in section 5.10 of Hyndman, Athanosopoulos. (https://otexts.com/fpp3/tscv.html).

Within the full dataset, define \(T_{init}\) as the first date (May 1, 2009) of the full dataset and \(T_{final}\) as the last date (Apr 30, 2010).

We have to define the \(N\) test folds. Each of the \(N=153\) test folds allows us to evaluate prediction accuracy on an out-of-sample period of 30 days. To ensure adequate training data to fit the model, we use at least 50% of the full dataset to train the model. Define \(T_1\) as the origin of the first test fold. We set \(T_1\) to be Oct 29, 2010 - approximately 182 days after \(T_{init}\).

The final test fold has its forecasting origin on \(T_N\) defined as March 31, 2010. \(T_N\) is chosen to ensure the out-of-sample testing period ends by \(T_{final}\). The table below illustrates the organization of the folds which is identical for all ATMs.

Cross Validation Date Ranges

Fold Train Start Train End Test Start Test End
\(T_1\) May 1, 2009 Oct 29, 2010 Oct 30, 2010 Nov 28, 2010
\(T_2\) May 1, 2009 Oct 30, 2010 Oct 31, 2010 Nov 29, 2010
\(T_N\) May 1, 2009 Mar 31, 2010 April 1, 2010 April 30, 2010

The following code implements evaluation on a rolling forecasting origin for 3 models using the stretch_tsibble function call to generate the multi-fold dataset. We will evaluate the 3 models based on RMSE and MAE. For each model we plot the model error rate either \(RMSE\) or \(MAE\) on the \(y\) axis against the forecasting horizon \(h\) on the horizontal axis.

ATM1_stretch =  piv %>%  slice(1:(n()-31)) %>% 
  select( ATM1) %>% 
  stretch_tsibble(.init = 182, .step = cross_validation_stepsize )

bt_atm1_sarima = ATM1_stretch %>% 
  model( 
    sarima_nd = ARIMA(ATM1 ~ 0 + pdq(0,0,1) + PDQ(0, 1, 2, period = 7 ) ) ,
    sarima_d = ARIMA(ATM1 ~ 1 + pdq(0,0,1) + PDQ(0, 1, 2, period = 7 ) ) ,
    ets = ETS( ATM1 ~ error("A") + trend("N") + season("A")  )
) %>% forecast( h = 30) 

Based on the above two plots, we conclude:

Model Selection for ATM1

On the basis of cross-validation and examining model residuals, we adopt the SARIMA model with no drift (sarima_nd). It produced slightly better cross-validated model errors, is more parsimonious than SARIMA with drift and beats ETS on the basis of model residuals. We chose this particular specification by using the fable ARIMA model search algorithm to find the orders of the seasonal and non-seasonal models. However, they make intuitive sense.

Model Building for ATM2

forecast_horizon = 31

fit_atm2 = piv %>% model( 
  sarima_nd = ARIMA(ATM2  ) ,
  sarima_d = ARIMA(ATM2 ~ 1 + pdq(2,0,2) + PDQ(0, 1, 1, period = 7 ) ) ,
  ets = ETS( ATM2 ~ error("A") + trend("N") + season("A") )
) 

fc_atm2 = fit_atm2 %>% forecast(h=forecast_horizon) 

The SARIMA model with no drift has order:

\[ARIMA ~ (p=2,d=0,q=2) + (P=0, D=1, Q=1)_{7}\]

The fitted model specification is:

\[(1-\phi_1B-\phi_2B^2)(1 - B^7)X_{t} = ( 1 + \theta_1 B +\theta_2 B^2)(1 + \Theta_{1}B^7 ) \epsilon_{t}\]

The model parameters are listed below.

report(fit_atm2 %>% select(sarima_nd))
## Series: ATM2 
## Model: ARIMA(2,0,2)(0,1,1)[7] 
## 
## Coefficients:
##           ar1      ar2     ma1     ma2     sma1
##       -0.4236  -0.9122  0.4590  0.7992  -0.7487
## s.e.   0.0548   0.0423  0.0848  0.0576   0.0388
## 
## sigma^2 estimated as 586:  log likelihood=-1648.63
## AIC=3309.25   AICc=3309.49   BIC=3332.54

The SARIMA model with drift has the same order but includes a moderate drift term.

report(fit_atm2 %>% select(sarima_d))
## Series: ATM2 
## Model: ARIMA(2,0,2)(0,1,1)[7] w/ drift 
## 
## Coefficients:
##           ar1      ar2     ma1     ma2     sma1  constant
##       -0.4247  -0.9146  0.4600  0.7996  -0.7561   -0.7255
## s.e.   0.0525   0.0411  0.0823  0.0565   0.0392    0.7475
## 
## sigma^2 estimated as 585.9:  log likelihood=-1648.17
## AIC=3310.35   AICc=3310.67   BIC=3337.51

The exponential smoothing model has an additive error, no trend and an additive seasonal term with weekly period.

report(fit_atm2 %>% select(ets))
## Series: ATM2 
## Model: ETS(A,N,A) 
##   Smoothing parameters:
##     alpha = 0.003771524 
##     gamma = 0.3621959 
## 
##   Initial states:
##      l[0]      s[0]     s[-1]   s[-2]    s[-3]    s[-4]    s[-5]   s[-6]
##  72.74555 -53.45142 -42.43341 16.7643 6.024302 26.48906 17.09206 29.5151
## 
##   sigma^2:  624.0025
## 
##      AIC     AICc      BIC 
## 4513.546 4514.168 4552.545

The forecasts for the month of May 2010 are shown in the plot below for all 3 models of ATM2. Just like for ATM1, they look very similar in their forecasts. But they capture the level and seasonality of the existing prior data. For the forecast perspective, they are all reasonable candidates.

Looking at their innovation residuals for white noise, we see that SARIMA with no drift produces white noise like residuals. The acf of the innovation residuals stay inside the dashed lines. The histogram looks relatively symmetric although slightly fat-tailed. The timeseries plot of the residuals shows more recurring downward spikes in the residuals but they don’t look periodic but rather cyclical.

The SARIMA model with drift shows similar characteristics as that without drift.

The ETS model residuals look inferior for ATM2 consistent with our findings for ATM1. On the basis of the model residuals, we can rule out the ETS model for consideration.

Cross Validation of ATM2

Following the same cross validation procedure used for ATM1, we produce diagnostic plots of the model error in forecasting ATM2 Cash withdrawals. The code chunk is shown below. Although the SARIMA model order and parameters are different between ATM1 and ATM2, the conclusion is similar. A SARIMA model with no drift is preferred to one with drift and another ETS model.

ATM2_stretch =  piv %>%  
  slice(1:(n()-31)) %>% 
  select( ATM2) %>% 
  stretch_tsibble(.init = 182, .step = cross_validation_stepsize )

bt_atm2_sarima = ATM2_stretch %>% 
  model( 
    sarima_nd = ARIMA(ATM2 ~ 0 +  pdq(2,0,2) + PDQ(0, 1, 1, period = 7 )) ,
    sarima_d = ARIMA(ATM2 ~ 1 +   pdq(2,0,2) + PDQ(0, 1, 1, period = 7 )) ,
    ets = ETS( ATM2 ~ error("A") + trend("N") + season("A")  )
  ) %>% forecast( h = 30) 

Based on the two above plots, we conclude that:

On balance we refer SARIMA model with no drift of the order specified above for ATM2.

Model Building for ATM4

We observe some autocorrelations for ATM4 at lags 7, 14. Also, we observe more asymmetry in ATM4 than ATM1 or ATM2.

The Box-Cox transformation uses \(\lambda = 0.398\) obtained by Guerrero’s method. Evaluating the time series plot and ACF of the transformed series, shows the skewness reduced significantly.

## [1] 0.3988552

forecast_horizon = 31

fit_atm4 = piv %>% model( 
  sarima_nd = ARIMA(box_cox( ATM4, lambda )  ~ 0 + pdq( 0, 0, 1) + PDQ( 2,  0,  0, period = 7 )   ) ,
  sarima_d  = ARIMA( box_cox( ATM4, lambda )  ~ 1 + pdq( 0, 0, 1) + PDQ( 2,  0,  0, period = 7 )   )  ,
  ets       = ETS( box_cox(ATM4, lambda )  ~ error("M") + trend("N") + season("A") )
   ) 

fc_atm4 = fit_atm4 %>% forecast(h=forecast_horizon) 

We report the model parameters below for SARIMA no drift.

## Series: ATM4 
## Model: ARIMA(0,0,1)(2,0,0)[7] 
## Transformation: box_cox(ATM4, lambda) 
## 
## Coefficients:
##          ma1    sar1    sar2
##       0.1383  0.4542  0.4592
## s.e.  0.0530  0.0467  0.0466
## 
## sigma^2 estimated as 119.7:  log likelihood=-1395.67
## AIC=2799.34   AICc=2799.46   BIC=2814.94

We report the model parameters below for SARIMA with drift.

## Series: ATM4 
## Model: ARIMA(0,0,1)(2,0,0)[7] w/ mean 
## Transformation: box_cox(ATM4, lambda) 
## 
## Coefficients:
##          ma1    sar1    sar2  constant
##       0.0794  0.2124  0.2098   13.2940
## s.e.  0.0527  0.0516  0.0524    0.5507
## 
## sigma^2 estimated as 100.1:  log likelihood=-1357.1
## AIC=2724.2   AICc=2724.36   BIC=2743.7

We report the model parameters below for ETS.

## Series: ATM4 
## Model: ETS(M,N,A) 
## Transformation: box_cox(ATM4, lambda) 
##   Smoothing parameters:
##     alpha = 0.01092174 
##     gamma = 0.1767566 
## 
##   Initial states:
##      l[0]      s[0]     s[-1]      s[-2]   s[-3]    s[-4]    s[-5]    s[-6]
##  21.73538 -14.38734 -3.441799 -0.4329291 7.38859 3.173384 4.410797 3.289302
## 
##   sigma^2:  0.193
## 
##      AIC     AICc      BIC 
## 3817.797 3818.418 3856.796

If we examine the model residuals, the SARIMA model with no drift produces significant autocorrelations at lags 7 and 14. The SARIMA model with drift produces significant autocorrelation at lag 10. Only the ETS model produces no significant autocorrelations.

Applying the Ljung Box test, we observe that model residuals of ETS and SARIMA with drift do not reject the white noise hypothesis. Whereas, for SARIMA with no drift, the null hypothesis is strongly rejected.

augment( fit_atm4 ) %>% features(.innov, ljung_box, lag = 10)
## # A tibble: 3 x 3
##   .model    lb_stat lb_pvalue
##   <chr>       <dbl>     <dbl>
## 1 ets          16.0   0.101  
## 2 sarima_d     15.2   0.125  
## 3 sarima_nd    28.5   0.00147

Cross Validation for ATM4

Lastly, we run the cross-validation procedure as before. The conclusion is that we choose the ETS model over the SARIMA with drift model.

ATM4_stretch =  piv %>%  slice(1:(n()-31)) %>% 
  select( ATM4) %>% stretch_tsibble(.init = 182, .step = cross_validation_stepsize )


bt_atm4 = ATM4_stretch %>% model( 
  sarima_nd = ARIMA(box_cox( ATM4, lambda )
                    ~ 0 + pdq( 0, 0, 1) + PDQ( 2,  0,  0, period = 7 )   ) ,
 sarima_d   = ARIMA( box_cox( ATM4, lambda )  ~ 
                       1 + pdq( 0, 0, 1) + PDQ( 2,  0,  0, period = 7 )   )  ,
  ets       = ETS( box_cox(ATM4, lambda )     ~ error("M") + trend("N") + season("A") )
                              )  %>% 
  forecast( h = 30) 

The above diagnostic plots of the RMSE and MAE for ATM4 are somewhat in conflict and produce very different responses than for ATM1 or ATM2.

Therefore, on balance, while some cross validation evidence is conflicting, ETS seems to be a better predictive model at more forecasting horizons than either SARIMA specification.

Exporting Forecasts

We export the May 2010 daily forecasts for all ATMs in an excel file called Output_PartA_ATM1_ANG_Forecast.xlsx. The file format, same as the source data file, contains three columns: DATE, ATM and Cash.

unpack_hilo( hilo( fc_atm1 ) ) %>% 
  filter(.model == 'sarima_nd' ) %>% 
  mutate( Cash = .mean, ATM = 'ATM1') %>%
  select( DATE, ATM, Cash) -> atm1_output


unpack_hilo( hilo( fc_atm1 ) ) %>% 
  filter(.model == 'sarima_nd' ) %>% 
  mutate( Cash = .mean, ATM = 'ATM3') %>%
  select( DATE, ATM, Cash) -> atm3_output


unpack_hilo( hilo( fc_atm2 ) ) %>% 
  filter(.model == 'sarima_nd' ) %>% 
  mutate( Cash = .mean, ATM = 'ATM2') %>%
  select( DATE, ATM, Cash) -> atm2_output


unpack_hilo( hilo( fc_atm4 ) ) %>% 
  filter(.model == 'ets' ) %>% 
  mutate( Cash = .mean, ATM = 'ATM4') %>%
  select( DATE, ATM, Cash) -> atm4_output

all_atm_output = rbind( atm1_output, atm2_output, atm3_output, atm4_output)

all_atm_output %>%  openxlsx::write.xlsx( file = "Output_PartA_ATM_ANG_Forecasts.xlsx", overwrite = TRUE )

Part B - Residential Electrical Power Consumption

Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.

Part B Solution

Data Wrangling

The raw dataset has 3 columns:

Two observations require data imputation or evaluation:

Bad Observations

mth CaseSequence KWH
2008 Sep 861 NA
2010 Jul 883 770523

The seasonal graph suggests that a strong seasonality pattern. So interpolating a missing or inaccurate monthly reading with its 12 month prior reading seems acceptable. We do this for both data point issues. Because there is a positive upward trend in the data, imputing a value from the previous year may bias the estimate downward. We accept the biased estimate because the trend is flatter before 2011.

Now we plot the missing value for 2008 in red. We join two red line segments between the missing value at CaseSequence 861 and its two immediate adjacent values on the graph (associated with CaseSequence 860 and 862). Likewise, we join two green line segments at CaseSequence 883.

Lastly, we compare electrical consumption before and after the data imputation in the two plots below. The adjusted chart below in blue looks very reasonable and more tractable for model building than the original unadjusted series

Exploratory Analysis

We see strong evidence of a positive trend and seasonality in the electricity consumption in the STL decomposition charts. Causes for the positive trend could be an increase in population or economic activity. Causes for seasonality could be the climate which cause electric demand to increase in hotter months or human behavior because of school or tourism.

Model Building

Following the same model building procedures outlined in Part A, we examine 3 models. Their orders were selected based on automated model search algorithm in fable.

fit_rpi = rpi %>% model( 
  ets =   ETS( KWH ~ error("M") + trend("A") + season("M")), 
  sarima_d = ARIMA(KWH  ~ 1 + pdq( 0,  0, 4) + PDQ( 2 , 1 , 0 , period = 12) ) ,
  sarima_nd = ARIMA( KWH  ~ 0 + pdq( 0,  0, 4) + PDQ( 2 , 1 , 0 , period = 12) )
 )

Reports of the parameters of the fitted models on the full data series follow below:

## Series: KWH 
## Model: ETS(M,A,M) 
##   Smoothing parameters:
##     alpha = 0.07343375 
##     beta  = 0.001543274 
##     gamma = 0.0001001775 
## 
##   Initial states:
##     l[0]     b[0]      s[0]     s[-1]     s[-2]    s[-3]    s[-4]    s[-5]
##  6211689 6855.655 0.9597963 0.7463089 0.8687219 1.176741 1.259157 1.199069
##    s[-6]     s[-7]     s[-8]     s[-9]   s[-10]   s[-11]
##  1.00169 0.7713234 0.8070511 0.9153309 1.069367 1.225444
## 
##   sigma^2:  0.009
## 
##      AIC     AICc      BIC 
## 6140.425 6143.943 6195.803
## Series: KWH 
## Model: ARIMA(0,0,4)(2,1,0)[12] w/ drift 
## 
## Coefficients:
##          ma1     ma2     ma3      ma4     sar1     sar2   constant
##       0.3317  0.0359  0.2180  -0.0702  -0.7001  -0.4163  227156.63
## s.e.  0.0780  0.0825  0.0746   0.0788   0.0777   0.0787   71633.02
## 
## sigma^2 estimated as 3.794e+11:  log likelihood=-2655.6
## AIC=5327.19   AICc=5328.04   BIC=5352.74
## Series: KWH 
## Model: ARIMA(0,0,4)(2,1,0)[12] 
## 
## Coefficients:
##          ma1     ma2     ma3      ma4     sar1     sar2
##       0.3758  0.0776  0.2577  -0.0338  -0.6594  -0.3841
## s.e.  0.0785  0.0841  0.0741   0.0778   0.0787   0.0812
## 
## sigma^2 estimated as 3.979e+11:  log likelihood=-2659.9
## AIC=5333.8   AICc=5334.45   BIC=5356.15

The Model Residuals are examined in the plots below for all 3 models. We observe that:

The model residuals of the ETS model suggest non-stationarity while those of the SARIMA models do.

Looking at forecasts, all three models do a similar job in the plot below. They capturing the annual periodicity and amplitude of the KWH series for the preceding 2-3 years. Note that the historical data of the plot is truncated in order to see the forecast period more clearly.

Cross Validation

We now cross validate the candidate models on a rolling forecasting origin. This process follows the convention explained in detail in Part A. Because the KWH time series has fewer observations, we split the data at a different starting point. We use Nov 2005 (data point 95) of the raw data as the initial origin of the cross validation. Using the same notation as in Part A, we define:

\(T_1\) as the origin of the first fold \(T_2\) as the origin of the second fold \(T_N\) as the origin of the last fold

There are \(N=86\) folds in the cross validation exercise with the following example date ranges for the training and test periods.

Fold Train Start Train End Test Start Test End
\(T_1\) Jan 1998 Nov 2005 Dec 2005 Nov 2006
\(T_2\) Jan 1998 Dec 2005 Jan 2006 Dec 2006
\(T_N\) Jan 1998 Dec 2012 Jan 2013 Dec 2013
rpi_stretch =  rpi %>%  slice(1:(n()-12)) %>% 
  select( KWH) %>% stretch_tsibble(.init = 95, .step = cross_validation_stepsize )

Both cross validation error plots show that SARIMA with no drift (sarima_nd) has the lowest error. Its predictive accuracy dominates the other two models across all forecast horizons from \(h = 1 \ldots 12\).

Together with the earlier evidence on the stationarity of its model residuals and the intuitive accuracy of this forecasts, we select the SARIMA model with no drifts as our preferred model.

Exporting Forecasts

We export the 2014 forecast at monthly intervals using the sarima_nd model fitted earlier. The output will be written to a file Output_PartB_KWH_ANG_Forecasts.xlsx which will use the next consecutive CaseSequence number 925 for the Jan 2014 forecast until reaching Dec 2014.

output_fc_partb = unpack_hilo(hilo(fc_rpi %>% filter(.model == 'sarima_nd' ) ) ) %>% 
  select( mth, .mean ) %>%
  mutate( KWH = .mean, CaseSequence = row_number() + 924) %>%
  select(CaseSequence, mth , KWH ) %>% 
  rename( `YYYY-MMM` = mth ) %>% 
  openxlsx::write.xlsx( file = "Output_PartB_KWH_ANG_Forecasts.xlsx",
                        overwrite = TRUE)

Part C BONUS Waterflow in Pipes

Part C consists of two data sets. These are simple 2 columns sets, however they have different time stamps. Your optional assignment is to time-base sequence the data and aggregate based on hour (example of what this looks like, follows). Note for multiple recordings within an hour, take the mean. Then to determine if the data is stationary and can it be forecast. If so, provide a week forward forecast and present results via Rpubs and .rmd and the forecast in an Excel readable file.

Part C Solution

Data Wrangling

One data wrangling issue we resolved is a minor known bug in the readxl library. The read_excel function generates a large number of messages when loading the Waterflow_Pipe1.xlsx and Waterflow_Pipe2.xlsx files. We get around this bug by converting the original files to a cleaner versions that load quietly. The new files,called wfpipe1.xlsx and wfpipe2.xlsx, contain equivalent information and are used henceforth. The timestamp and flow rate variables are renamed to dt and flow1 for Pipe 1 and flow2 for Pipe 2.

For reproducibility, we retain the conversion code and control its execution by a conditional variable runOncePartC.

Exploratory Data Analysis

Pipe 1

Pipe 1 contains flow readings at irregular time periods with the readings starting after midnight of Oct 23, 2015 midnight and ending on Nov 1, 2015 11:35:43pm.

We will group the readings into regular hourly intervals and associate the group average to the end of the hour interval. To illustrate using the table below of selected readings, the first 4 readings, colored beige, are assigned to Oct 23, 2015 1am. The next 5 readings, colored blue, are taken between 1am and 2am are assigned to Oct 23, 2015 2am. We also show the last observation of Pipe 1 in yellow.

Waterflow Pipe 1: Initial and Final Observations

dt flow1
2015-10-23 00:24:06 23.370
2015-10-23 00:40:02 28.003
2015-10-23 00:53:51 23.066
2015-10-23 00:55:40 29.973
2015-10-23 01:19:17 5.998
2015-10-23 01:23:58 15.935
2015-10-23 01:50:05 26.617
2015-10-23 01:55:33 33.283
2015-10-23 01:59:15 12.427
2015-10-23 02:51:51 21.833
2015-11-01 23:34:10 16.219
2015-11-01 23:35:43 21.212

Pipe 2

Pipe 2 contains regular hourly readings taken over a longer period of time. The first reading occurs at Oct 23, 2015 1am - coinciding with the first hourly reading of Pipe 1. The last hourly reading occurs on Dec 3, 2015 4pm - over a month after the last Pipe 1 hourly reading.

We show the initial and last reading of Pipe 2 for illustration below.

Waterflow Pipe 2: Initial and Final Readings

dt flow2
2015-10-23 01:00:00 18.811
2015-10-23 02:00:00 43.087
2015-10-23 03:00:00 37.988
2015-10-23 04:00:00 36.120
2015-10-23 05:00:00 31.851
2015-12-03 15:00:00 33.401
2015-12-03 16:00:00 66.681
# This code groups the irregular timestamps into the ending hour of each hourly interval.
# to get the starting hour use floor_date( dt, unit = "hour")
# We record the average of the group readings
# ---------------------------------------------------------------------------
wfpipe1 %>% mutate( dt_aggregate = ceiling_date( dt , unit = "hour")) %>%
   group_by( dt_aggregate ) %>%
   summarize( flow1 = mean(flow1, na.rm = TRUE )) %>%
   rename( dt = dt_aggregate ) -> hwfpipe1

We show the resulting aggregated value for Pipe 1 below. The beige and blue colored rows show average readings corresponding to their same-colored groups in the earlier table.

Hourly Average of water pipe 1 flow

dt flow1
2015-10-23 01:00:00 26.103
2015-10-23 02:00:00 18.852
2015-10-23 03:00:00 15.159

Visual Analysis

We will join the two pipe datasets together by hourly timestamp as full outer join, then convert the dataset into a tsibble to perform time series analysis. Note that are 4 empty hours intervals for Pipe1. During those hours, no readings were observed so the hour is blank.

Full Join of both pipes: First and last 3 rows

dt flow1 flow2 id
2015-10-23 01:00:00 26.103 18.811 1
2015-10-23 02:00:00 18.852 43.087 2
2015-10-23 03:00:00 15.159 37.988 3
2015-12-03 14:00:00 NA 42.937 998
2015-12-03 15:00:00 NA 33.401 999
2015-12-03 16:00:00 NA 66.681 1000

Let’s investigate the behavior of the two pipe flow rates.

The time series plot leads us to conclude that Pipe 1 is much smaller than Pipe 2 in its flow rate.

If the two pipes are related, then either Pipe 1 feeds into Pipe 2 as a tributary or Pipe 2 feeds into Pipe 1 as a source. In either case, we would look for a contemporaneous or lagged relationship in their flow rates. This woul be implied using a pairs plot and cross correlation function.

Looking at the distributions of flow1 and flow2 restricted to the common period of readings, we see there is a -0.004 correlation of the contemporaneous flow data. We also see that both Pipes have symmetric distributions of flows in the density plots. No patterns emerges in the scatter plot at lag 0.

Looking at the Pipe2 flow, we observe a similar symmetric distribution on the entire sample history.

Looking at the autocorrelations of Pipe 1, we see moderate evidence of negative autocorrelation at 9 and 12 hours. Perhaps these are related to the fact that people use water in the day and stop using it at night.

Pipe 2 shows a positive autocorrelation at lag 15 and 24. Curiously, there is no significant 12 hour negative autocorrelation effect. Perhaps Pipe 2 is associated with an industrial usage where periodicity are tied to shift work and daily calendar but not sleep-wake cycles.

Lastly, we inspect the cross correlations between Pipe 1 and Pipe 2. This shows the relationship between lags and leads of flow 1 versus flow 2.

For example, the lag coefficient \(h=-2\) measures the correlation between the 2 hours lagged flow rate of Pipe 1 to the current Pipe 2 flow rate.

\[CCF[-2]=corr(\text{flow1}[h-2], \text{flow2}[h])\]

Evidence of a strong cross correlation for \(h< 0\) implies Pipe 1 predicts Pipe 2 flow rate. Evidence of a strong cross correlation for \(h > 0\) implies Pipe 2 predicts Pipe 1 flow rate. If lagged Pipe 1 flow predicts Pipe 2 flow, this could mean Pipe 1 flows into Pipe 2. We see weak cross correlations for all lags \(h\) which suggests a modest linkage between the Pipes’ flows.

fjpipes %>% slice(1:240) %>% 
  CCF( flow1, flow2 , lag_max = 18) %>% 
  autoplot() + labs( title = "Cross Correlations between Flow 1 and Flow 2")

Therefore, we don’t see to model any relationship between the pipes’ flows.

Model Building

We therefore consider building a time series model for Pipe 1 flow.

The STL decomposition suggests that a stochastic trend with some recurring daily seasonality. The residuals of the STL decomposition suggest show autocorrelation. They are not white noise but appear homoskedastic and symmetric.

We see that the KPSS test suggest the flow series \(X_t\)

is stationary.

fjpipes %>% slice(1:240) -> flow1dat

flow1dat %>%  features( flow1 , unitroot_kpss)
## # A tibble: 1 x 2
##   kpss_stat kpss_pvalue
##       <dbl>       <dbl>
## 1     0.184         0.1
fit_pipe1 = flow1dat %>% 
  model( m1 = ARIMA( flow1 ) , 
         m2 = ARIMA( flow1 ~ pdq(0,1,2) + PDQ(0, 1, 0, period = 12 ) ),
         m3 = ARIMA( flow1 ~ pdq(0,1,2) + PDQ(0, 1, 0, period = 24 ) ),
         )
report(fit_pipe1 %>% select(m1))
## Series: flow1 
## Model: ARIMA(0,0,0) w/ mean 
## 
## Coefficients:
##       constant
##        19.8585
## s.e.    0.2743
## 
## sigma^2 estimated as 18.14:  log likelihood=-687.79
## AIC=1379.59   AICc=1379.64   BIC=1386.55
report(fit_pipe1 %>% select(m2))
## Series: flow1 
## Model: ARIMA(0,1,2)(0,1,0)[12] 
## 
## Coefficients:
##           ma1      ma2
##       -0.9561  -0.0439
## s.e.   0.0734   0.0725
## 
## sigma^2 estimated as 41.63:  log likelihood=-746.98
## AIC=1499.96   AICc=1500.07   BIC=1510.24
report(fit_pipe1 %>% select(m3))
## Series: flow1 
## Model: ARIMA(0,1,2)(0,1,0)[24] 
## 
## Coefficients:
##           ma1      ma2
##       -0.9616  -0.0384
## s.e.   0.0684   0.0669
## 
## sigma^2 estimated as 36.52:  log likelihood=-693.5
## AIC=1392.99   AICc=1393.11   BIC=1403.11

Next, we examine the 3 time series model based on their model residuals.

Now we examine the forecasts.

Model m1 which uses a constant with white noise is not satisfactory. The mean forecast show above is a constant. The model m3 produces a repeating forecast could looks more plausible.

Based on these considerations, we choose model m3 which has a specification:

\[ARIMA = pdq(0,1,2) + PDQ(0,1,0)_{24}\]

We export its forecasts as to PartC_Forecast.xlsx

output_fc_partb = unpack_hilo(hilo(fc_pipe1 %>% filter(.model == 'm3' ) ) ) %>%   
  select( dt, .mean ) %>%
  mutate( flow1 = .mean ) %>%
  select(dt, flow1  ) %>% openxlsx::write.xlsx( file = "Output_PartC_Forecast.xlsx",
                        overwrite = TRUE)

# Code

We summarize all the R code used in this project in this appendix for ease of reading.


```r
knitr::opts_chunk$set(echo = FALSE)

# Conditional Code Evaluation is managed here.

runPartA = TRUE
runPartB = TRUE
runPartC = TRUE
runOncePartC = FALSE  # Used to load and convert XLSX files that cause unsurpressable warnings.

cross_validation_stepsize = 1  # Make this equal 1 for final render but a larger integer while editing.

library(fpp3)
library(knitr)
library(tidyverse)
library(kableExtra)
library(cowplot)
library(skimr)
library(GGally) # for ggpairs
library(readxl) # to parse Excel workbooks
library(openxlsx) # for writing xlsx files

.code-bg {
  background-color: lightgray;
}
# Part A - Load the data and store it in original skinny datatable and alternative wide pivot table.
# Both forms are tsibbles
raw_ATM624Data <- read_excel("ATM624Data.xlsx", 
    col_types = c("date", "text", "numeric"))

atmdat_tmp = raw_ATM624Data %>%  filter(!is.na(ATM)) %>%  # Some dates are included in file with no ATM or cash data
as_tsibble(key=ATM, index = DATE, regular = FALSE)

atmdat_tmp %>% fill_gaps() %>% mutate(DATE=date(DATE)) %>% as_tsibble(key=ATM, index=DATE) -> atmdat

atmdat %>% pivot_wider(names_from =ATM, values_from = Cash) -> piv
atmdat %>% autoplot(Cash) + facet_wrap(~ATM, scales = "free_y") + 
  labs(title="Time Series of ATM before Data Cleaning")

piv %>% tail(n=6) %>% kable( digits = 2 ) %>% kable_styling(bootstrap_options = c("hover", "striped"), position="left")
atmdat %>% mutate( id = row_number()) %>% filter(is.na(Cash) | Cash > 10000) %>% 
  kable( digits = 0, caption = "Outliers and Missing Values" ) %>% 
  kable_styling(bootstrap_options = c("hover", "striped"), position = "left") 
atmdat[44,  "Cash"] =  96
atmdat[47,  "Cash"] = 145
atmdat[53,  "Cash"] = 106
atmdat[414, "Cash"] = 7
atmdat[420, "Cash"] = 24
atmdat[1380, "Cash"] = 153.2

atmdat %>% pivot_wider(names_from =ATM, values_from = Cash) -> piv
atmdat %>% 
  mutate( wDay = lubridate::wday(DATE, label=TRUE)) %>% 
  as_tibble() %>% 
  ggplot() + geom_boxplot(aes(y=Cash, x = wDay) ) + 
  facet_wrap(~ATM, scales = "free") + 
  ggtitle("Withdrawals by Day of Week After Imputation",                                                                      subtitle="Grouped by ATM")

atmdat %>% 
  mutate( mon = lubridate::month(DATE, label=TRUE)) %>% 
  as_tibble() %>% 
  ggplot() + geom_boxplot(aes(y=Cash, x = mon) ) + 
  facet_wrap(~ATM, scales = "free") + 
  ggtitle("Cash Withdrawals by Month After Imputation", subtitle="Grouped by ATM")
atmdat %>% autoplot(Cash) + facet_wrap(~ATM, scales = "free_y") 
ggpairs(piv[,2:5], lower= list( continuous=wrap("points", alpha = 0.4, size = 0.2 ) ) )
piv %>% model(ATM1 = STL(ATM1)) %>% components() %>% autoplot() + labs(title="STL Decomposition for ATM1")
piv %>% model(ATM2 = STL(ATM2)) %>% components() %>% autoplot() + labs(title="STL Decomposition for ATM2")
piv %>% model(ATM4 = STL(ATM4)) %>% components() %>% autoplot() + labs(title="STL Decomposition for ATM4")

piv %>% gg_tsdisplay(y=ATM1, plot_type = "partial")

forecast_horizon = 31

fit_atm1 = piv %>% model( 
  sarima_nd = ARIMA(ATM1 ~ 0 + pdq(0,0,1) + PDQ(0, 1, 2, period = 7 ) ) ,
  sarima_d = ARIMA(ATM1 ~ 1 + pdq(0,0,1) + PDQ(0, 1, 2, period = 7 ) ) ,
  ets = ETS( ATM1 ~ error("A") + trend("N") + season("A")  )
    ) 

fc_atm1 = fit_atm1 %>% forecast(h=forecast_horizon) 
report(fit_atm1 %>% select(sarima_nd))
report(fit_atm1 %>% select(sarima_d))
report(fit_atm1 %>% select(ets))

fc_atm1 %>% autoplot(piv %>% filter_index( "2010-03-01" ~ .)) + 
  labs(title = paste0("Three model forecasts for ", forecast_horizon, " day forecast"),
       subtitle = "History After March 2010" )

fit_atm1 %>% select(sarima_nd) %>% gg_tsresiduals(type = "innovation") + labs(title="SARIMA no drift for ATM1")

fit_atm1 %>% select(sarima_d) %>% gg_tsresiduals(type = "innovation") + labs(title="SARIMA w drift for ATM1")

fit_atm1 %>% select(ets) %>% gg_tsresiduals(type = "innovation") + labs(title="ETS Residuals for ATM1")

ATM1_stretch =  piv %>%  slice(1:(n()-31)) %>% 
  select( ATM1) %>% 
  stretch_tsibble(.init = 182, .step = cross_validation_stepsize )

bt_atm1_sarima = ATM1_stretch %>% 
  model( 
    sarima_nd = ARIMA(ATM1 ~ 0 + pdq(0,0,1) + PDQ(0, 1, 2, period = 7 ) ) ,
    sarima_d = ARIMA(ATM1 ~ 1 + pdq(0,0,1) + PDQ(0, 1, 2, period = 7 ) ) ,
    ets = ETS( ATM1 ~ error("A") + trend("N") + season("A")  )
) %>% forecast( h = 30) 


bb = bt_atm1_sarima %>% group_by(.id, .model) %>% mutate(h = row_number()) %>% ungroup()
rmse_bt_arima = bb %>% fabletools::accuracy( piv, by = c("h", ".model")) %>% select(h, .model , RMSE, MAE ) 

rmse_bt_arima %>% ggplot() + geom_line(aes(x = h, y = RMSE , col = .model ) )  + labs(title = "RMSE of Cross Validated Models for ATM1")
rmse_bt_arima %>% ggplot() + geom_line(aes(x = h, y = MAE , col = .model ) )  + labs(title = "MAE of Cross Validated Models for ATM1")


piv %>% gg_tsdisplay(y=ATM2, plot_type = "partial")

forecast_horizon = 31

fit_atm2 = piv %>% model( 
  sarima_nd = ARIMA(ATM2  ) ,
  sarima_d = ARIMA(ATM2 ~ 1 + pdq(2,0,2) + PDQ(0, 1, 1, period = 7 ) ) ,
  ets = ETS( ATM2 ~ error("A") + trend("N") + season("A") )
) 

fc_atm2 = fit_atm2 %>% forecast(h=forecast_horizon) 
report(fit_atm2 %>% select(sarima_nd))
report(fit_atm2 %>% select(sarima_d))
report(fit_atm2 %>% select(ets))

fc_atm2 %>% autoplot(piv %>% filter_index( "2010-03-01" ~ .)) + 
  labs(title = paste0("Three model forecasts for ", forecast_horizon, " day forecast for ATM2"),
       subtitle = "History After March 2010" )

fit_atm2 %>% select(sarima_nd) %>% gg_tsresiduals(type = "innovation") + labs(title="SARIMA no drift for ATM2")

fit_atm2 %>% select(sarima_d) %>% gg_tsresiduals(type = "innovation") + labs(title="SARIMA w drift for ATM2")

fit_atm2 %>% select(ets) %>% gg_tsresiduals(type = "innovation") + labs(title="ETS Residuals for ATM2")

ATM2_stretch =  piv %>%  
  slice(1:(n()-31)) %>% 
  select( ATM2) %>% 
  stretch_tsibble(.init = 182, .step = cross_validation_stepsize )

bt_atm2_sarima = ATM2_stretch %>% 
  model( 
    sarima_nd = ARIMA(ATM2 ~ 0 +  pdq(2,0,2) + PDQ(0, 1, 1, period = 7 )) ,
    sarima_d = ARIMA(ATM2 ~ 1 +   pdq(2,0,2) + PDQ(0, 1, 1, period = 7 )) ,
    ets = ETS( ATM2 ~ error("A") + trend("N") + season("A")  )
  ) %>% forecast( h = 30) 


bb2 = bt_atm2_sarima %>% group_by(.id, .model) %>% mutate(h = row_number()) %>% ungroup()
rmse_bt_atm2   = bb %>% fabletools::accuracy( piv, by = c("h", ".model")) %>% select(h, .model , RMSE, MAE ) 

rmse_bt_atm2 %>% ggplot() + geom_line(aes(x = h, y = RMSE , col = .model ) )  + labs(title = "RMSE of Cross Validated Models for ATM2")
rmse_bt_atm2 %>% ggplot() + geom_line(aes(x = h, y = MAE , col = .model ) )  + labs(title = "MAE of Cross Validated Models for ATM2")


piv %>% gg_tsdisplay(y=ATM4, plot_type = "partial")
(lambda <- piv %>% features( ATM4, features = guerrero) %>% pull(lambda_guerrero) )

piv = piv %>% mutate( bcATM4 =  box_cox(ATM4, lambda) )

piv %>% gg_tsdisplay(y=bcATM4, plot_type = "partial")

forecast_horizon = 31

fit_atm4 = piv %>% model( 
  sarima_nd = ARIMA(box_cox( ATM4, lambda )  ~ 0 + pdq( 0, 0, 1) + PDQ( 2,  0,  0, period = 7 )   ) ,
  sarima_d  = ARIMA( box_cox( ATM4, lambda )  ~ 1 + pdq( 0, 0, 1) + PDQ( 2,  0,  0, period = 7 )   )  ,
  ets       = ETS( box_cox(ATM4, lambda )  ~ error("M") + trend("N") + season("A") )
   ) 

fc_atm4 = fit_atm4 %>% forecast(h=forecast_horizon) 

report(fit_atm4 %>% select(sarima_nd) )
report(fit_atm4 %>% select(sarima_d) )
report(fit_atm4 %>% select(ets) )
fc_atm4 %>% autoplot(piv %>% filter_index("2010-01-01" ~ . ), level = NULL) + labs(title = "ATM4 forecast")

fit_atm4 %>% select(sarima_nd) %>% gg_tsresiduals(type = "innovation") + labs(title="SARIMA no drift for ATM4")

fit_atm4 %>% select(sarima_d) %>% gg_tsresiduals(type = "innovation") + labs(title="SARIMA w drift for ATM4")

fit_atm4 %>% select(ets) %>% gg_tsresiduals(type = "innovation") + labs(title="ETS for ATM4")

augment( fit_atm4 ) %>% features(.innov, ljung_box, lag = 10)
ATM4_stretch =  piv %>%  slice(1:(n()-31)) %>% 
  select( ATM4) %>% stretch_tsibble(.init = 182, .step = cross_validation_stepsize )


bt_atm4 = ATM4_stretch %>% model( 
  sarima_nd = ARIMA(box_cox( ATM4, lambda )
                    ~ 0 + pdq( 0, 0, 1) + PDQ( 2,  0,  0, period = 7 )   ) ,
 sarima_d   = ARIMA( box_cox( ATM4, lambda )  ~ 
                       1 + pdq( 0, 0, 1) + PDQ( 2,  0,  0, period = 7 )   )  ,
  ets       = ETS( box_cox(ATM4, lambda )     ~ error("M") + trend("N") + season("A") )
                              )  %>% 
  forecast( h = 30) 


bb4 = bt_atm4  %>% group_by(.id, .model) %>% mutate(h = row_number()) %>% ungroup()
rmse_bt_atm4    = bb4 %>% fabletools::accuracy( piv, by = c("h", ".model")) %>% select(h, .model , RMSE, MAE ) 

rmse_bt_atm4 %>% ggplot() + geom_line(aes(x = h, y = RMSE , col = .model ) )  + labs(title = "RMSE of Cross Validated Models for ATM4")
rmse_bt_atm4 %>% ggplot() + geom_line(aes(x = h, y = MAE , col = .model ) )  + labs(title = "MAE of Cross Validated Models for ATM4")




unpack_hilo( hilo( fc_atm1 ) ) %>% 
  filter(.model == 'sarima_nd' ) %>% 
  mutate( Cash = .mean, ATM = 'ATM1') %>%
  select( DATE, ATM, Cash) -> atm1_output


unpack_hilo( hilo( fc_atm1 ) ) %>% 
  filter(.model == 'sarima_nd' ) %>% 
  mutate( Cash = .mean, ATM = 'ATM3') %>%
  select( DATE, ATM, Cash) -> atm3_output


unpack_hilo( hilo( fc_atm2 ) ) %>% 
  filter(.model == 'sarima_nd' ) %>% 
  mutate( Cash = .mean, ATM = 'ATM2') %>%
  select( DATE, ATM, Cash) -> atm2_output


unpack_hilo( hilo( fc_atm4 ) ) %>% 
  filter(.model == 'ets' ) %>% 
  mutate( Cash = .mean, ATM = 'ATM4') %>%
  select( DATE, ATM, Cash) -> atm4_output

all_atm_output = rbind( atm1_output, atm2_output, atm3_output, atm4_output)

all_atm_output %>%  openxlsx::write.xlsx( file = "Output_PartA_ATM_ANG_Forecasts.xlsx", overwrite = TRUE )
respower <- read_excel("ResidentialCustomerForecastLoad-624.xlsx",
                                                  col_types = c("numeric", "text", "numeric"))

respower %>% 
  mutate( mth = yearmonth(`YYYY-MMM`)) %>% 
  select( mth, CaseSequence, KWH) %>% 
  as_tsibble(index = mth) -> rp
rp %>% filter(is.na(KWH) | ( KWH/1e6 < 3 ) ) %>% 
  kable(caption ="Bad Observations") %>%
  kable_styling(bootstrap_options = c("hover", "striped"), position = "left")
rp %>% gg_season(KWH/1e6) + labs(y="Gigawatt Hours", title = "Seasonal Pattern for Residential Consumption", subtitle = "Raw Data Before Imputation" )


fix2008 = rp %>% mutate( KWH = ifelse( CaseSequence == 861 ,rp$KWH[ CaseSequence == (861 - 12)], KWH ) ) %>% 
  filter( abs(CaseSequence - 861 ) < 2 )


fix2010 = rp %>% mutate( KWH = ifelse( CaseSequence == 883 ,rp$KWH[ CaseSequence == (883 - 12)], KWH ) ) %>% 
  filter( abs(CaseSequence - 883 ) < 2 )

rp %>% filter(year(mth) > 2006 ) %>% autoplot(KWH) + 
  geom_point(data = fix2008, aes(y=KWH), color="red", alpha = 0.5 ) +
  geom_line(data=fix2008, color = "red", alpha = 0.5) +
  geom_point(data = fix2010, aes(y=KWH), color="green", alpha = 0.5 ) +
  geom_line(data=fix2010, color = "green", alpha = 0.5) +
  labs(title="Imputation for Missing or Outliers Data", subtitle = "Mega Kilowatt Hours")
plotPartB1 = rp %>% autoplot(KWH/1e6) + labs(title = "Unadjusted Residential Gigawatt Usage", y = "gigawatt hours")

# This is the corrected data.
rp %>% mutate( KWH = ifelse( CaseSequence == 883 , rp$KWH[ CaseSequence == (883 - 12)], KWH)) %>%
       mutate( KWH = ifelse( CaseSequence == 861 , rp$KWH[ CaseSequence == (861 - 12)], KWH)) -> rpi

plotPartB2 = rpi %>% autoplot(KWH/1e6, color = "blue") + labs(title="Adjusted Residential Gigawatt Usage", y = "gigawatt hours" )

plot_grid(plotPartB1, plotPartB2, ncol = 1)


rpi %>% model( STL(KWH)) %>% components() %>% autoplot() + labs("STL Decomposition for Residential Electric Consumption")
fit_rpi = rpi %>% model( 
  ets =   ETS( KWH ~ error("M") + trend("A") + season("M")), 
  sarima_d = ARIMA(KWH  ~ 1 + pdq( 0,  0, 4) + PDQ( 2 , 1 , 0 , period = 12) ) ,
  sarima_nd = ARIMA( KWH  ~ 0 + pdq( 0,  0, 4) + PDQ( 2 , 1 , 0 , period = 12) )
 )

report(fit_rpi %>% select(ets))
report(fit_rpi %>% select(sarima_d))
report(fit_rpi %>% select(sarima_nd))


fit_rpi %>% select(ets ) %>% gg_tsresiduals(type = "innovation") + labs(title="Model Residuals ETS(MAM)")
fit_rpi %>% select(sarima_d ) %>% gg_tsresiduals(type = "innovation") + labs(title="Model Residuals SARIMA w Drift")
fit_rpi %>% select(sarima_nd) %>% gg_tsresiduals(type = "innovation") + labs(title="Model Residuals SARIMA No Drift")

fc_rpi = fit_rpi %>% forecast(h = 12)

fc_rpi %>% 
  autoplot(rpi %>% filter_index("2012-01" ~ . ), level = NULL) + labs(title = "Forecasts of 3 Models on Raw Series")


rpi_stretch =  rpi %>%  slice(1:(n()-12)) %>% 
  select( KWH) %>% stretch_tsibble(.init = 95, .step = cross_validation_stepsize )


bt_rpi = rpi_stretch %>% model( 
   ets         = ETS( KWH ~ error("M") + trend("A") + season("M")), 
   sarima_d    = ARIMA(KWH  ~ 1 + pdq( 0,  0, 4) + PDQ( 2 , 1 , 0 , period = 12) ) ,
   sarima_nd   = ARIMA( KWH  ~ 0 + pdq( 0,  0, 4) + PDQ( 2 , 1 , 0 , period = 12) )
                              )  %>%  forecast( h = 12) 

cc = bt_rpi  %>% group_by(.id, .model) %>% mutate(h = row_number()) %>% ungroup()
rmse_bt_rpi   = cc %>% fabletools::accuracy( rpi, by = c("h", ".model")) %>% select(h, .model , RMSE, MAE ) 

rmse_bt_rpi %>% ggplot() + geom_line(aes(x = h, y = RMSE , col = .model ) )  +
  scale_x_continuous(breaks = c( 1, 2, 4, 6, 8, 10, 12 )) +
 labs(title = "RMSE of Cross Validated Models for KWH")
rmse_bt_rpi %>% ggplot() + geom_line(aes(x = h, y = MAE , col = .model ) )  + 
  scale_x_continuous(breaks = c( 1, 2, 4, 6, 8, 10, 12 )) +
  labs(title = "MAE of Cross Validated Models for KWH")


output_fc_partb = unpack_hilo(hilo(fc_rpi %>% filter(.model == 'sarima_nd' ) ) ) %>% 
  select( mth, .mean ) %>%
  mutate( KWH = .mean, CaseSequence = row_number() + 924) %>%
  select(CaseSequence, mth , KWH ) %>% 
  rename( `YYYY-MMM` = mth ) %>% 
  openxlsx::write.xlsx( file = "Output_PartB_KWH_ANG_Forecasts.xlsx",
                        overwrite = TRUE)


Waterflow_Pipe1 <- read_excel("Waterflow_Pipe1.xlsx", col_types = c("date", "numeric"))
wfpipe1 = Waterflow_Pipe1 %>% rename( dt = `Date Time`, flow1 = WaterFlow )
wfpipe1 %>% openxlsx::write.xlsx(file="wfpipe1.xlsx" , overwrite = TRUE )
rm(wfpipe1)

Waterflow_Pipe2 <- read_excel("Waterflow_Pipe2.xlsx", col_types = c("date", "numeric"))
wfpipe2 = Waterflow_Pipe2 %>% rename( dt = `Date Time`, flow2 = WaterFlow )
wfpipe2 %>% openxlsx::write.xlsx(file="wfpipe2.xlsx" , overwrite = TRUE )
rm(wfpipe2)
# This code is loaded every time.
wfpipe1 = read_excel("wfpipe1.xlsx", col_types = c("date", "numeric"))
wfpipe2 = read_excel("wfpipe2.xlsx", col_types = c("date", "numeric"))

wfpipe1 %>% slice(1:10,( (n()-1):n())  ) %>% 
  kable(caption = "Waterflow Pipe 1:  Initial and Final Observations", digits = 3) %>% 
  kable_styling(bootstrap_options = c("hover", "striped"), position = "left") %>%
  row_spec(1:4, background = "tan") %>%
  row_spec(5:9, background = "skyblue") %>%
  row_spec(12, background = "yellow")
wfpipe2 %>% slice(1:5, ( (n()-1) : n() ) ) %>% 
  kable(caption = "Waterflow Pipe 2: Initial and Final Readings", digits = 3) %>% 
  kable_styling(bootstrap_options = c("hover", "striped"), position = "left") %>%
  row_spec(1, background = "tan") %>%
  row_spec(7, background = "yellow")

# This code groups the irregular timestamps into the ending hour of each hourly interval.
# to get the starting hour use floor_date( dt, unit = "hour")
# We record the average of the group readings
# ---------------------------------------------------------------------------
wfpipe1 %>% mutate( dt_aggregate = ceiling_date( dt , unit = "hour")) %>%
   group_by( dt_aggregate ) %>%
   summarize( flow1 = mean(flow1, na.rm = TRUE )) %>%
   rename( dt = dt_aggregate ) -> hwfpipe1


hwfpipe1 %>% slice(1:3) %>% kable(caption = "Hourly Average of water pipe 1 flow", digits = 3 ) %>%
  kable_styling(bootstrap_options = c("hover", "striped"), position = "left") %>%
  row_spec(1, background = "tan") %>%
  row_spec(2, background = "skyblue") 

hwfpipe1 %>% full_join( wfpipe2 , by = "dt") %>% as_tsibble(index = dt ) %>%
  mutate( id = row_number() ) -> fjpipes

# Fill down only on the period up to the end of Pipe 1 data 
# Don't fill down due to Pipe 2 data existing but no pipe 1 data. This would be an error
fjpipes %>% slice(1:240) %>% fill(flow1, .direction = c("down")) -> fjpipes2

# Now copy over the filled down values back into the original tsibble
# Trick:  use a left join from original to truncated imputed data frame to fill in values.
# -------------------------------------------------------------------------------------------
fjpipes %>% left_join(fjpipes2, by = "dt" ) %>% mutate( flow1.x = ifelse((id.x < 240) & is.na(flow1.x), flow1.y, flow1.x))  %>% rename( flow1 = flow1.x, flow2 = flow2.x, id = id.x ) %>% select( dt, flow1, flow2, id ) -> fjpipes3

fjpipes = fjpipes3

fjpipes %>% slice(1:3,((n()-2):n())) %>% 
  kable(caption = "Full Join of both pipes: First and last 3 rows", digits = 3) %>% 
  kable_styling(bootstrap_options = c("hover", "striped"), position = "left")

fjpipes %>% autoplot(flow1) + geom_line(aes(x=dt, y = flow2), alpha = 0.5, col = "red" ) +
  labs(title = "Pipe1 and Pipe 2 Flow Rates", x = "Time", y = "Flow Rate")

fjpipes %>% filter( !is.na(flow1)) %>% 
  ggpairs( columns = 2:3 , 
          lower= list( continuous=wrap("points", alpha = 0.6, size = 0.3 ) ) )

fjpipes %>% ggplot(aes(x = flow2)) + geom_density() + labs(title = "Pipe 2 flow")
fjpipes  %>% gg_tsdisplay( flow1, plot_type = c("partial")) + labs(title = "Flow 1 Autocorrelations")
fjpipes  %>% gg_tsdisplay( flow2, plot_type = c("partial")) + labs(title = "Flow 2 Autocorrelations")

fjpipes %>% slice(1:240) %>% 
  CCF( flow1, flow2 , lag_max = 18) %>% 
  autoplot() + labs( title = "Cross Correlations between Flow 1 and Flow 2")


fjpipes %>% slice(1:240) %>% gg_tsdisplay(flow1, plot_type = "partial")

fjpipes %>% slice(1:240) %>% model(STL(flow1)) %>% components() %>% autoplot()

fjpipes %>% slice(1:240) -> flow1dat

flow1dat %>%  features( flow1 , unitroot_kpss)
fit_pipe1 = flow1dat %>% 
  model( m1 = ARIMA( flow1 ) , 
         m2 = ARIMA( flow1 ~ pdq(0,1,2) + PDQ(0, 1, 0, period = 12 ) ),
         m3 = ARIMA( flow1 ~ pdq(0,1,2) + PDQ(0, 1, 0, period = 24 ) ),
         )
report(fit_pipe1 %>% select(m1))
report(fit_pipe1 %>% select(m2))
report(fit_pipe1 %>% select(m3))
fit_pipe1 %>% select(m1 ) %>% gg_tsresiduals()

fit_pipe1 %>% select(m2 ) %>% gg_tsresiduals()

fit_pipe1 %>% select(m3 ) %>% gg_tsresiduals()

fc_pipe1 = fit_pipe1 %>% forecast(h = 240)

fc_pipe1 %>% filter( .model != 'm2' ) %>% autoplot(flow1dat %>% filter(id > 100) , level = NULL )


output_fc_partb = unpack_hilo(hilo(fc_pipe1 %>% filter(.model == 'm3' ) ) ) %>%   
  select( dt, .mean ) %>%
  mutate( flow1 = .mean ) %>%
  select(dt, flow1  ) %>% openxlsx::write.xlsx( file = "Output_PartC_Forecast.xlsx",
                        overwrite = TRUE)