# Conditional Code Evaluation is managed here.
runPartA = TRUE
runPartB = TRUE
runPartC = TRUE
runOncePartC = FALSE # Used to load and convert XLSX files that cause unsurpressable warnings.
cross_validation_stepsize = 1 # Make this equal 1 for final render but a larger integer while editing.In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.
We forecast 4 ATM Cash withdrawal time series for each day of May 2010.
Each of the 4 time series exhibited different statistical properties and posed different technical challenges to build a time series model. We adopt a single time series model selection process to use one of 3 model classes: SARIMA with no drift, SARIMA with drift and ETS. To validate the model choice, we run a 6 month cross validation procedure with 153 consecutive backtests looking at RMSE and MAE. We select the following models for each ATM:
We output the forecast in a single Excel file.
We wrangle and clean the input ATM data to a suitable form.
There are data wrangling issues which we fix:
Extra rows of dates lacking any data. These are simply omitted by dropping rows where ATM field is blank.
The tsibble date interval is irregular but needs to be regular in order to apply time series models. Two changes are required to do this:
By changing DATE from a datetime to a calendar day, we regularize tsibble to a daily interval.
We use fill_gaps to add missing rows. For ATM1 and ATM2, a few dates exist for which no observations of Cash withdrawal are recorded.
We plot the raw time series for each ATM in order to visually explore missing data and outliers.
From the above time series plot, we observe that:
| DATE | ATM1 | ATM2 | ATM3 | ATM4 |
|---|---|---|---|---|
| 2010-04-25 | 109 | 89 | 0 | 542.28 |
| 2010-04-26 | 74 | 11 | 0 | 403.84 |
| 2010-04-27 | 4 | 2 | 0 | 13.70 |
| 2010-04-28 | 96 | 107 | 96 | 348.20 |
| 2010-04-29 | 82 | 89 | 82 | 44.25 |
| 2010-04-30 | 85 | 90 | 85 | 482.29 |
There are only 6 observations with missing or suspicious values for Cash so we can adopt a simple strategy to impute their values. Due to the absence of any clear trend but strong weekly seasonality, a good imputation strategy is to use the last week’s value. This rule works as long as its application is well-defined. This rule is undefined if data is missing in the first 7 days of the data series no prior week’s value exists. However, that never happens.
More precisely,
\[Cash^{*}(ATM[x], T) = Cash(ATM[x], T- 7)\] where \(T\) is the date where data is imputed, \(ATM[x]\) is the ATM machine. \(T-7\) is 7 calendar days before \(T\).
There are a small number of missing values for ATM1 and ATM2. Three observations in June 2009 dates have missing Cash for ATM1. Two observations in June 2009 have missing Cash. There is one outlier for ATM4.
These observations are listed in the table below.
Outliers and Missing Values
| DATE | ATM | Cash | id |
|---|---|---|---|
| 2009-06-13 | ATM1 | NA | 44 |
| 2009-06-16 | ATM1 | NA | 47 |
| 2009-06-22 | ATM1 | NA | 53 |
| 2009-06-18 | ATM2 | NA | 414 |
| 2009-06-24 | ATM2 | NA | 420 |
| 2010-02-09 | ATM4 | 10920 | 1380 |
To justify imputing the outlier for ATM4, we observe that ATM4 has $47,400 average daily withdrawals. According to NPR, an ATM machine can hold at most $200,000 but usually holds tens of thousands in currency. (https://www.npr.org/templates/story/story.php?storyId=125621359) The ATM4 outlier shows Cash withdrawal value of $1,092,000 in one day. This is 5 times more than the maximum cash suggested by the NPR ATM story and is most likely an error.
So we replace the outlier value with the Cash value from that of a week earlier.
The changes above are implemented in the code below.
atmdat[44, "Cash"] = 96
atmdat[47, "Cash"] = 145
atmdat[53, "Cash"] = 106
atmdat[414, "Cash"] = 7
atmdat[420, "Cash"] = 24
atmdat[1380, "Cash"] = 153.2Previously, we stated that are strong weekly seasonal effects without evidence. To avoid repetition, we show below boxplots grouped by day of the week for each ATM. The weekly seasonal effects are clearly visible in the changes of the medians and IQR between each day of week. For example, there is a strong decline on Thurday for all ATMs.
We evaluate possible seasonal effects to estimate the missing values at the weekly and monthly level. When we examine the day of week effect at each ATM, we see a dip on Thursday at all ATM. There is strong evidence of weekly seasonality.
Next, we asked if there is evidence of quarterly or annual seasonality.
The boxplots below for each ATM grouped by month show no such evidence. However, we acknowledge there is clearly time variation in the withdrawals but insufficient data to figure out longer periods of seasonality.
It’s worth rerunning the time series plots after the imputation. We see below the significant change is in the ATM4 plot. Others look virtually unchanged. Without its outlier, the updated ATM4 plot shows the behavior of the Cash withdrawals more clearly. We see significant but constant volatility in the ATM4 Cash withdrawals with right skewness in its distribution.
We observe in the ggpairs plot below:
We run an STL decomposition to confirm the absence of a deterministic trend and check STL residuals before model bulding.
The below plot for ATM1 shows no evidence of a deterministic trend but strong evidence of weekly seasonality. The STL residuals look useful even if volatility appears to cluster.
The below plot for ATM2 finds no deterministic trend in ATM2, strong evidence of seasonality but some volatility clustering in the residual.
The ATM4 decomposition below confirms weekly seasonality and suggests right skewness which we will correct with a Box-Cox transformation.
We adopt the same process of model building, evaluation and forecasting for each of the three ATM series. The process begins with considering three candidate classes of time series models:
ETS with no trend but weekly seasonalityWe experimented offline with different orders of each model class in conjunction with the automated search algorithm to seek optimal fit. For each model class, we select one model with chosen order and fit it to the entire dataset.
Next, we compare their May 2010 forecasts, inspect their model residual plots for suitability. Afterwards, we run a time series cross validation process to produce many out-of-sample forecasts for each of the 3 models. The RMSE and MAE of each model are graphically evaluated.
Together with the early assessments of model residuals, the cross validation results enable us to chose the desired model to generate forecasts.
We apply this process to ATM1, ATM2 and ATM4 in succession. The resulting forecasts for all 4 ATMs are exported in one excel file.
TWe explain the model building and evaluation process in detail for ATM1
Looking at the raw time series ACF plots below, we observe significant autocorrelations and partial autocorrelations at multiple lags.
forecast_horizon = 31
fit_atm1 = piv %>% model(
sarima_nd = ARIMA(ATM1 ~ 0 + pdq(0,0,1) + PDQ(0, 1, 2, period = 7 ) ) ,
sarima_d = ARIMA(ATM1 ~ 1 + pdq(0,0,1) + PDQ(0, 1, 2, period = 7 ) ) ,
ets = ETS( ATM1 ~ error("A") + trend("N") + season("A") )
)
fc_atm1 = fit_atm1 %>% forecast(h=forecast_horizon) The SARIMA model with no drift has order:
\[ARIMA ~ (p=0,d=0,q=1) + (P=0, D=1, Q=2)_{7}\]
The fitted model specification is:
\[(1 - B^7)X_{t} = ( 1 + \theta_1 B)(1 + \Theta_{1}B^7 + \Theta_{2}B^{14} ) \epsilon_{t}\]
The model parameters are listed below.
report(fit_atm1 %>% select(sarima_nd))## Series: ATM1
## Model: ARIMA(0,0,1)(0,1,2)[7]
##
## Coefficients:
## ma1 sma1 sma2
## 0.1868 -0.5762 -0.1096
## s.e. 0.0548 0.0507 0.0496
##
## sigma^2 estimated as 556.4: log likelihood=-1640.02
## AIC=3288.04 AICc=3288.16 BIC=3303.57
The SARIMA model with drift has the same order but includes a small drift term.
report(fit_atm1 %>% select(sarima_d))## Series: ATM1
## Model: ARIMA(0,0,1)(0,1,2)[7] w/ drift
##
## Coefficients:
## ma1 sma1 sma2 constant
## 0.1867 -0.5765 -0.1097 -0.0883
## s.e. 0.0548 0.0507 0.0496 0.4869
##
## sigma^2 estimated as 557.9: log likelihood=-1640.01
## AIC=3290.02 AICc=3290.19 BIC=3309.42
The exponential smoothing model has an additive error, no trend and an additive seasonal term with weekly period.
report(fit_atm1 %>% select(ets))## Series: ATM1
## Model: ETS(A,N,A)
## Smoothing parameters:
## alpha = 0.0206385
## gamma = 0.3301984
##
## Initial states:
## l[0] s[0] s[-1] s[-2] s[-3] s[-4] s[-5] s[-6]
## 79.39605 -62.73693 6.754848 13.60636 10.95823 12.94413 10.40877 8.064604
##
## sigma^2: 578.8824
##
## AIC AICc BIC
## 4486.151 4486.772 4525.150
The forecasts for the month of May 2010 are shown in the plot below for all 3 models. They do not differ sufficiently to allow model selection based on forecast reasonableness. None of the models are entirely satisfactory because they don’t seem to generate a stochastic trend. We observe that, even in the SARIMA model with drift, the drift term is very small. The weekly periodicity seems to be well captured by the model specification.
Looking at their innovation residuals for white noise, we see that SARIMA with no drift produces white noise like residuals. The acf of the innovation residuals stay inside the dashed lines. The histogram looks relatively symmetric although slightly fat-tailed. The timeseries plot of the residuals look decent.
The SARIMA model with drift shows similar characteristics as that without drift.
The ETS model residuals look inferior. We see a persistent autocorrelation at lags 1, 6 below. On the basis of the model residuals, we can rule out the ETS model for consideration.
The cross validation will create forecasts using a procedure called “evaluation on a rolling forecasting origin.” The procedure is described in section 5.10 of Hyndman, Athanosopoulos. (https://otexts.com/fpp3/tscv.html).
Within the full dataset, define \(T_{init}\) as the first date (May 1, 2009) of the full dataset and \(T_{final}\) as the last date (Apr 30, 2010).
We have to define the \(N\) test folds. Each of the \(N=153\) test folds allows us to evaluate prediction accuracy on an out-of-sample period of 30 days. To ensure adequate training data to fit the model, we use at least 50% of the full dataset to train the model. Define \(T_1\) as the origin of the first test fold. We set \(T_1\) to be Oct 29, 2010 - approximately 182 days after \(T_{init}\).
The final test fold has its forecasting origin on \(T_N\) defined as March 31, 2010. \(T_N\) is chosen to ensure the out-of-sample testing period ends by \(T_{final}\). The table below illustrates the organization of the folds which is identical for all ATMs.
Cross Validation Date Ranges
| Fold | Train Start | Train End | Test Start | Test End |
|---|---|---|---|---|
| \(T_1\) | May 1, 2009 | Oct 29, 2010 | Oct 30, 2010 | Nov 28, 2010 |
| \(T_2\) | May 1, 2009 | Oct 30, 2010 | Oct 31, 2010 | Nov 29, 2010 |
| \(T_N\) | May 1, 2009 | Mar 31, 2010 | April 1, 2010 | April 30, 2010 |
The following code implements evaluation on a rolling forecasting origin for 3 models using the stretch_tsibble function call to generate the multi-fold dataset. We will evaluate the 3 models based on RMSE and MAE. For each model we plot the model error rate either \(RMSE\) or \(MAE\) on the \(y\) axis against the forecasting horizon \(h\) on the horizontal axis.
ATM1_stretch = piv %>% slice(1:(n()-31)) %>%
select( ATM1) %>%
stretch_tsibble(.init = 182, .step = cross_validation_stepsize )
bt_atm1_sarima = ATM1_stretch %>%
model(
sarima_nd = ARIMA(ATM1 ~ 0 + pdq(0,0,1) + PDQ(0, 1, 2, period = 7 ) ) ,
sarima_d = ARIMA(ATM1 ~ 1 + pdq(0,0,1) + PDQ(0, 1, 2, period = 7 ) ) ,
ets = ETS( ATM1 ~ error("A") + trend("N") + season("A") )
) %>% forecast( h = 30) Based on the above two plots, we conclude:
sarima_nd) performs best across all \(h=1..30\) on both the RMSE and MAE plots.sarima_d) underperforms slightly compared to its SARIMA cousin with no drift.ets) performs significantly worse that the SARIMA models.On the basis of cross-validation and examining model residuals, we adopt the SARIMA model with no drift (sarima_nd). It produced slightly better cross-validated model errors, is more parsimonious than SARIMA with drift and beats ETS on the basis of model residuals. We chose this particular specification by using the fable ARIMA model search algorithm to find the orders of the seasonal and non-seasonal models. However, they make intuitive sense.
forecast_horizon = 31
fit_atm2 = piv %>% model(
sarima_nd = ARIMA(ATM2 ) ,
sarima_d = ARIMA(ATM2 ~ 1 + pdq(2,0,2) + PDQ(0, 1, 1, period = 7 ) ) ,
ets = ETS( ATM2 ~ error("A") + trend("N") + season("A") )
)
fc_atm2 = fit_atm2 %>% forecast(h=forecast_horizon) The SARIMA model with no drift has order:
\[ARIMA ~ (p=2,d=0,q=2) + (P=0, D=1, Q=1)_{7}\]
The fitted model specification is:
\[(1-\phi_1B-\phi_2B^2)(1 - B^7)X_{t} = ( 1 + \theta_1 B +\theta_2 B^2)(1 + \Theta_{1}B^7 ) \epsilon_{t}\]
The model parameters are listed below.
report(fit_atm2 %>% select(sarima_nd))## Series: ATM2
## Model: ARIMA(2,0,2)(0,1,1)[7]
##
## Coefficients:
## ar1 ar2 ma1 ma2 sma1
## -0.4236 -0.9122 0.4590 0.7992 -0.7487
## s.e. 0.0548 0.0423 0.0848 0.0576 0.0388
##
## sigma^2 estimated as 586: log likelihood=-1648.63
## AIC=3309.25 AICc=3309.49 BIC=3332.54
The SARIMA model with drift has the same order but includes a moderate drift term.
report(fit_atm2 %>% select(sarima_d))## Series: ATM2
## Model: ARIMA(2,0,2)(0,1,1)[7] w/ drift
##
## Coefficients:
## ar1 ar2 ma1 ma2 sma1 constant
## -0.4247 -0.9146 0.4600 0.7996 -0.7561 -0.7255
## s.e. 0.0525 0.0411 0.0823 0.0565 0.0392 0.7475
##
## sigma^2 estimated as 585.9: log likelihood=-1648.17
## AIC=3310.35 AICc=3310.67 BIC=3337.51
The exponential smoothing model has an additive error, no trend and an additive seasonal term with weekly period.
report(fit_atm2 %>% select(ets))## Series: ATM2
## Model: ETS(A,N,A)
## Smoothing parameters:
## alpha = 0.003771524
## gamma = 0.3621959
##
## Initial states:
## l[0] s[0] s[-1] s[-2] s[-3] s[-4] s[-5] s[-6]
## 72.74555 -53.45142 -42.43341 16.7643 6.024302 26.48906 17.09206 29.5151
##
## sigma^2: 624.0025
##
## AIC AICc BIC
## 4513.546 4514.168 4552.545
The forecasts for the month of May 2010 are shown in the plot below for all 3 models of ATM2. Just like for ATM1, they look very similar in their forecasts. But they capture the level and seasonality of the existing prior data. For the forecast perspective, they are all reasonable candidates.
Looking at their innovation residuals for white noise, we see that SARIMA with no drift produces white noise like residuals. The acf of the innovation residuals stay inside the dashed lines. The histogram looks relatively symmetric although slightly fat-tailed. The timeseries plot of the residuals shows more recurring downward spikes in the residuals but they don’t look periodic but rather cyclical.
The SARIMA model with drift shows similar characteristics as that without drift.
The ETS model residuals look inferior for ATM2 consistent with our findings for ATM1. On the basis of the model residuals, we can rule out the ETS model for consideration.
Following the same cross validation procedure used for ATM1, we produce diagnostic plots of the model error in forecasting ATM2 Cash withdrawals. The code chunk is shown below. Although the SARIMA model order and parameters are different between ATM1 and ATM2, the conclusion is similar. A SARIMA model with no drift is preferred to one with drift and another ETS model.
ATM2_stretch = piv %>%
slice(1:(n()-31)) %>%
select( ATM2) %>%
stretch_tsibble(.init = 182, .step = cross_validation_stepsize )
bt_atm2_sarima = ATM2_stretch %>%
model(
sarima_nd = ARIMA(ATM2 ~ 0 + pdq(2,0,2) + PDQ(0, 1, 1, period = 7 )) ,
sarima_d = ARIMA(ATM2 ~ 1 + pdq(2,0,2) + PDQ(0, 1, 1, period = 7 )) ,
ets = ETS( ATM2 ~ error("A") + trend("N") + season("A") )
) %>% forecast( h = 30) Based on the two above plots, we conclude that:
sarima_nd) performs best in RMSE across all \(h=1 \ldots 30\) but is somewhat mixed for MAE. ETS beats SARIMA for \(h\) between 17 and 26.On balance we refer SARIMA model with no drift of the order specified above for ATM2.
We observe some autocorrelations for ATM4 at lags 7, 14. Also, we observe more asymmetry in ATM4 than ATM1 or ATM2.
The Box-Cox transformation uses \(\lambda = 0.398\) obtained by Guerrero’s method. Evaluating the time series plot and ACF of the transformed series, shows the skewness reduced significantly.
## [1] 0.3988552
forecast_horizon = 31
fit_atm4 = piv %>% model(
sarima_nd = ARIMA(box_cox( ATM4, lambda ) ~ 0 + pdq( 0, 0, 1) + PDQ( 2, 0, 0, period = 7 ) ) ,
sarima_d = ARIMA( box_cox( ATM4, lambda ) ~ 1 + pdq( 0, 0, 1) + PDQ( 2, 0, 0, period = 7 ) ) ,
ets = ETS( box_cox(ATM4, lambda ) ~ error("M") + trend("N") + season("A") )
)
fc_atm4 = fit_atm4 %>% forecast(h=forecast_horizon) We report the model parameters below for SARIMA no drift.
## Series: ATM4
## Model: ARIMA(0,0,1)(2,0,0)[7]
## Transformation: box_cox(ATM4, lambda)
##
## Coefficients:
## ma1 sar1 sar2
## 0.1383 0.4542 0.4592
## s.e. 0.0530 0.0467 0.0466
##
## sigma^2 estimated as 119.7: log likelihood=-1395.67
## AIC=2799.34 AICc=2799.46 BIC=2814.94
We report the model parameters below for SARIMA with drift.
## Series: ATM4
## Model: ARIMA(0,0,1)(2,0,0)[7] w/ mean
## Transformation: box_cox(ATM4, lambda)
##
## Coefficients:
## ma1 sar1 sar2 constant
## 0.0794 0.2124 0.2098 13.2940
## s.e. 0.0527 0.0516 0.0524 0.5507
##
## sigma^2 estimated as 100.1: log likelihood=-1357.1
## AIC=2724.2 AICc=2724.36 BIC=2743.7
We report the model parameters below for ETS.
## Series: ATM4
## Model: ETS(M,N,A)
## Transformation: box_cox(ATM4, lambda)
## Smoothing parameters:
## alpha = 0.01092174
## gamma = 0.1767566
##
## Initial states:
## l[0] s[0] s[-1] s[-2] s[-3] s[-4] s[-5] s[-6]
## 21.73538 -14.38734 -3.441799 -0.4329291 7.38859 3.173384 4.410797 3.289302
##
## sigma^2: 0.193
##
## AIC AICc BIC
## 3817.797 3818.418 3856.796
If we examine the model residuals, the SARIMA model with no drift produces significant autocorrelations at lags 7 and 14. The SARIMA model with drift produces significant autocorrelation at lag 10. Only the ETS model produces no significant autocorrelations.
Applying the Ljung Box test, we observe that model residuals of ETS and SARIMA with drift do not reject the white noise hypothesis. Whereas, for SARIMA with no drift, the null hypothesis is strongly rejected.
augment( fit_atm4 ) %>% features(.innov, ljung_box, lag = 10)## # A tibble: 3 x 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 ets 16.0 0.101
## 2 sarima_d 15.2 0.125
## 3 sarima_nd 28.5 0.00147
Lastly, we run the cross-validation procedure as before. The conclusion is that we choose the ETS model over the SARIMA with drift model.
ATM4_stretch = piv %>% slice(1:(n()-31)) %>%
select( ATM4) %>% stretch_tsibble(.init = 182, .step = cross_validation_stepsize )
bt_atm4 = ATM4_stretch %>% model(
sarima_nd = ARIMA(box_cox( ATM4, lambda )
~ 0 + pdq( 0, 0, 1) + PDQ( 2, 0, 0, period = 7 ) ) ,
sarima_d = ARIMA( box_cox( ATM4, lambda ) ~
1 + pdq( 0, 0, 1) + PDQ( 2, 0, 0, period = 7 ) ) ,
ets = ETS( box_cox(ATM4, lambda ) ~ error("M") + trend("N") + season("A") )
) %>%
forecast( h = 30) The above diagnostic plots of the RMSE and MAE for ATM4 are somewhat in conflict and produce very different responses than for ATM1 or ATM2.
Therefore, on balance, while some cross validation evidence is conflicting, ETS seems to be a better predictive model at more forecasting horizons than either SARIMA specification.
We export the May 2010 daily forecasts for all ATMs in an excel file called Output_PartA_ATM1_ANG_Forecast.xlsx. The file format, same as the source data file, contains three columns: DATE, ATM and Cash.
unpack_hilo( hilo( fc_atm1 ) ) %>%
filter(.model == 'sarima_nd' ) %>%
mutate( Cash = .mean, ATM = 'ATM1') %>%
select( DATE, ATM, Cash) -> atm1_output
unpack_hilo( hilo( fc_atm1 ) ) %>%
filter(.model == 'sarima_nd' ) %>%
mutate( Cash = .mean, ATM = 'ATM3') %>%
select( DATE, ATM, Cash) -> atm3_output
unpack_hilo( hilo( fc_atm2 ) ) %>%
filter(.model == 'sarima_nd' ) %>%
mutate( Cash = .mean, ATM = 'ATM2') %>%
select( DATE, ATM, Cash) -> atm2_output
unpack_hilo( hilo( fc_atm4 ) ) %>%
filter(.model == 'ets' ) %>%
mutate( Cash = .mean, ATM = 'ATM4') %>%
select( DATE, ATM, Cash) -> atm4_output
all_atm_output = rbind( atm1_output, atm2_output, atm3_output, atm4_output)
all_atm_output %>% openxlsx::write.xlsx( file = "Output_PartA_ATM_ANG_Forecasts.xlsx", overwrite = TRUE )Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.
The raw dataset has 3 columns:
YYYY-MM - time point is identified as a month-year - which we rename to mthCaseSequence integer index which appears to be a unique consecutive keyKWH a kilowatt hour consumption levelTwo observations require data imputation or evaluation:
Bad Observations
| mth | CaseSequence | KWH |
|---|---|---|
| 2008 Sep | 861 | NA |
| 2010 Jul | 883 | 770523 |
The seasonal graph suggests that a strong seasonality pattern. So interpolating a missing or inaccurate monthly reading with its 12 month prior reading seems acceptable. We do this for both data point issues. Because there is a positive upward trend in the data, imputing a value from the previous year may bias the estimate downward. We accept the biased estimate because the trend is flatter before 2011.
Now we plot the missing value for 2008 in red. We join two red line segments between the missing value at CaseSequence 861 and its two immediate adjacent values on the graph (associated with CaseSequence 860 and 862). Likewise, we join two green line segments at CaseSequence 883.
Lastly, we compare electrical consumption before and after the data imputation in the two plots below. The adjusted chart below in blue looks very reasonable and more tractable for model building than the original unadjusted series
We see strong evidence of a positive trend and seasonality in the electricity consumption in the STL decomposition charts. Causes for the positive trend could be an increase in population or economic activity. Causes for seasonality could be the climate which cause electric demand to increase in hotter months or human behavior because of school or tourism.
Following the same model building procedures outlined in Part A, we examine 3 models. Their orders were selected based on automated model search algorithm in fable.
fit_rpi = rpi %>% model(
ets = ETS( KWH ~ error("M") + trend("A") + season("M")),
sarima_d = ARIMA(KWH ~ 1 + pdq( 0, 0, 4) + PDQ( 2 , 1 , 0 , period = 12) ) ,
sarima_nd = ARIMA( KWH ~ 0 + pdq( 0, 0, 4) + PDQ( 2 , 1 , 0 , period = 12) )
)Reports of the parameters of the fitted models on the full data series follow below:
## Series: KWH
## Model: ETS(M,A,M)
## Smoothing parameters:
## alpha = 0.07343375
## beta = 0.001543274
## gamma = 0.0001001775
##
## Initial states:
## l[0] b[0] s[0] s[-1] s[-2] s[-3] s[-4] s[-5]
## 6211689 6855.655 0.9597963 0.7463089 0.8687219 1.176741 1.259157 1.199069
## s[-6] s[-7] s[-8] s[-9] s[-10] s[-11]
## 1.00169 0.7713234 0.8070511 0.9153309 1.069367 1.225444
##
## sigma^2: 0.009
##
## AIC AICc BIC
## 6140.425 6143.943 6195.803
## Series: KWH
## Model: ARIMA(0,0,4)(2,1,0)[12] w/ drift
##
## Coefficients:
## ma1 ma2 ma3 ma4 sar1 sar2 constant
## 0.3317 0.0359 0.2180 -0.0702 -0.7001 -0.4163 227156.63
## s.e. 0.0780 0.0825 0.0746 0.0788 0.0777 0.0787 71633.02
##
## sigma^2 estimated as 3.794e+11: log likelihood=-2655.6
## AIC=5327.19 AICc=5328.04 BIC=5352.74
## Series: KWH
## Model: ARIMA(0,0,4)(2,1,0)[12]
##
## Coefficients:
## ma1 ma2 ma3 ma4 sar1 sar2
## 0.3758 0.0776 0.2577 -0.0338 -0.6594 -0.3841
## s.e. 0.0785 0.0841 0.0741 0.0778 0.0787 0.0812
##
## sigma^2 estimated as 3.979e+11: log likelihood=-2659.9
## AIC=5333.8 AICc=5334.45 BIC=5356.15
The Model Residuals are examined in the plots below for all 3 models. We observe that:
The model residuals of the ETS model suggest non-stationarity while those of the SARIMA models do.
Looking at forecasts, all three models do a similar job in the plot below. They capturing the annual periodicity and amplitude of the KWH series for the preceding 2-3 years. Note that the historical data of the plot is truncated in order to see the forecast period more clearly.
We now cross validate the candidate models on a rolling forecasting origin. This process follows the convention explained in detail in Part A. Because the KWH time series has fewer observations, we split the data at a different starting point. We use Nov 2005 (data point 95) of the raw data as the initial origin of the cross validation. Using the same notation as in Part A, we define:
\(T_1\) as the origin of the first fold \(T_2\) as the origin of the second fold \(T_N\) as the origin of the last fold
There are \(N=86\) folds in the cross validation exercise with the following example date ranges for the training and test periods.
| Fold | Train Start | Train End | Test Start | Test End |
|---|---|---|---|---|
| \(T_1\) | Jan 1998 | Nov 2005 | Dec 2005 | Nov 2006 |
| \(T_2\) | Jan 1998 | Dec 2005 | Jan 2006 | Dec 2006 |
| \(T_N\) | Jan 1998 | Dec 2012 | Jan 2013 | Dec 2013 |
rpi_stretch = rpi %>% slice(1:(n()-12)) %>%
select( KWH) %>% stretch_tsibble(.init = 95, .step = cross_validation_stepsize )Both cross validation error plots show that SARIMA with no drift (sarima_nd) has the lowest error. Its predictive accuracy dominates the other two models across all forecast horizons from \(h = 1 \ldots 12\).
Together with the earlier evidence on the stationarity of its model residuals and the intuitive accuracy of this forecasts, we select the SARIMA model with no drifts as our preferred model.
We export the 2014 forecast at monthly intervals using the sarima_nd model fitted earlier. The output will be written to a file Output_PartB_KWH_ANG_Forecasts.xlsx which will use the next consecutive CaseSequence number 925 for the Jan 2014 forecast until reaching Dec 2014.
output_fc_partb = unpack_hilo(hilo(fc_rpi %>% filter(.model == 'sarima_nd' ) ) ) %>%
select( mth, .mean ) %>%
mutate( KWH = .mean, CaseSequence = row_number() + 924) %>%
select(CaseSequence, mth , KWH ) %>%
rename( `YYYY-MMM` = mth ) %>%
openxlsx::write.xlsx( file = "Output_PartB_KWH_ANG_Forecasts.xlsx",
overwrite = TRUE)Part C consists of two data sets. These are simple 2 columns sets, however they have different time stamps. Your optional assignment is to time-base sequence the data and aggregate based on hour (example of what this looks like, follows). Note for multiple recordings within an hour, take the mean. Then to determine if the data is stationary and can it be forecast. If so, provide a week forward forecast and present results via Rpubs and .rmd and the forecast in an Excel readable file.
One data wrangling issue we resolved is a minor known bug in the readxl library. The read_excel function generates a large number of messages when loading the Waterflow_Pipe1.xlsx and Waterflow_Pipe2.xlsx files. We get around this bug by converting the original files to a cleaner versions that load quietly. The new files,called wfpipe1.xlsx and wfpipe2.xlsx, contain equivalent information and are used henceforth. The timestamp and flow rate variables are renamed to dt and flow1 for Pipe 1 and flow2 for Pipe 2.
For reproducibility, we retain the conversion code and control its execution by a conditional variable runOncePartC.
Pipe 1 contains flow readings at irregular time periods with the readings starting after midnight of Oct 23, 2015 midnight and ending on Nov 1, 2015 11:35:43pm.
We will group the readings into regular hourly intervals and associate the group average to the end of the hour interval. To illustrate using the table below of selected readings, the first 4 readings, colored beige, are assigned to Oct 23, 2015 1am. The next 5 readings, colored blue, are taken between 1am and 2am are assigned to Oct 23, 2015 2am. We also show the last observation of Pipe 1 in yellow.
Waterflow Pipe 1: Initial and Final Observations
| dt | flow1 |
|---|---|
| 2015-10-23 00:24:06 | 23.370 |
| 2015-10-23 00:40:02 | 28.003 |
| 2015-10-23 00:53:51 | 23.066 |
| 2015-10-23 00:55:40 | 29.973 |
| 2015-10-23 01:19:17 | 5.998 |
| 2015-10-23 01:23:58 | 15.935 |
| 2015-10-23 01:50:05 | 26.617 |
| 2015-10-23 01:55:33 | 33.283 |
| 2015-10-23 01:59:15 | 12.427 |
| 2015-10-23 02:51:51 | 21.833 |
| 2015-11-01 23:34:10 | 16.219 |
| 2015-11-01 23:35:43 | 21.212 |
Pipe 2 contains regular hourly readings taken over a longer period of time. The first reading occurs at Oct 23, 2015 1am - coinciding with the first hourly reading of Pipe 1. The last hourly reading occurs on Dec 3, 2015 4pm - over a month after the last Pipe 1 hourly reading.
We show the initial and last reading of Pipe 2 for illustration below.
Waterflow Pipe 2: Initial and Final Readings
| dt | flow2 |
|---|---|
| 2015-10-23 01:00:00 | 18.811 |
| 2015-10-23 02:00:00 | 43.087 |
| 2015-10-23 03:00:00 | 37.988 |
| 2015-10-23 04:00:00 | 36.120 |
| 2015-10-23 05:00:00 | 31.851 |
| 2015-12-03 15:00:00 | 33.401 |
| 2015-12-03 16:00:00 | 66.681 |
# This code groups the irregular timestamps into the ending hour of each hourly interval.
# to get the starting hour use floor_date( dt, unit = "hour")
# We record the average of the group readings
# ---------------------------------------------------------------------------
wfpipe1 %>% mutate( dt_aggregate = ceiling_date( dt , unit = "hour")) %>%
group_by( dt_aggregate ) %>%
summarize( flow1 = mean(flow1, na.rm = TRUE )) %>%
rename( dt = dt_aggregate ) -> hwfpipe1We show the resulting aggregated value for Pipe 1 below. The beige and blue colored rows show average readings corresponding to their same-colored groups in the earlier table.
Hourly Average of water pipe 1 flow
| dt | flow1 |
|---|---|
| 2015-10-23 01:00:00 | 26.103 |
| 2015-10-23 02:00:00 | 18.852 |
| 2015-10-23 03:00:00 | 15.159 |
We will join the two pipe datasets together by hourly timestamp as full outer join, then convert the dataset into a tsibble to perform time series analysis. Note that are 4 empty hours intervals for Pipe1. During those hours, no readings were observed so the hour is blank.
Full Join of both pipes: First and last 3 rows
| dt | flow1 | flow2 | id |
|---|---|---|---|
| 2015-10-23 01:00:00 | 26.103 | 18.811 | 1 |
| 2015-10-23 02:00:00 | 18.852 | 43.087 | 2 |
| 2015-10-23 03:00:00 | 15.159 | 37.988 | 3 |
| 2015-12-03 14:00:00 | NA | 42.937 | 998 |
| 2015-12-03 15:00:00 | NA | 33.401 | 999 |
| 2015-12-03 16:00:00 | NA | 66.681 | 1000 |
Let’s investigate the behavior of the two pipe flow rates.
The time series plot leads us to conclude that Pipe 1 is much smaller than Pipe 2 in its flow rate.
If the two pipes are related, then either Pipe 1 feeds into Pipe 2 as a tributary or Pipe 2 feeds into Pipe 1 as a source. In either case, we would look for a contemporaneous or lagged relationship in their flow rates. This woul be implied using a pairs plot and cross correlation function.
Looking at the distributions of flow1 and flow2 restricted to the common period of readings, we see there is a -0.004 correlation of the contemporaneous flow data. We also see that both Pipes have symmetric distributions of flows in the density plots. No patterns emerges in the scatter plot at lag 0.
Looking at the Pipe2 flow, we observe a similar symmetric distribution on the entire sample history.
Looking at the autocorrelations of Pipe 1, we see moderate evidence of negative autocorrelation at 9 and 12 hours. Perhaps these are related to the fact that people use water in the day and stop using it at night.
Pipe 2 shows a positive autocorrelation at lag 15 and 24. Curiously, there is no significant 12 hour negative autocorrelation effect. Perhaps Pipe 2 is associated with an industrial usage where periodicity are tied to shift work and daily calendar but not sleep-wake cycles.
Lastly, we inspect the cross correlations between Pipe 1 and Pipe 2. This shows the relationship between lags and leads of flow 1 versus flow 2.
For example, the lag coefficient \(h=-2\) measures the correlation between the 2 hours lagged flow rate of Pipe 1 to the current Pipe 2 flow rate.
\[CCF[-2]=corr(\text{flow1}[h-2], \text{flow2}[h])\]
Evidence of a strong cross correlation for \(h< 0\) implies Pipe 1 predicts Pipe 2 flow rate. Evidence of a strong cross correlation for \(h > 0\) implies Pipe 2 predicts Pipe 1 flow rate. If lagged Pipe 1 flow predicts Pipe 2 flow, this could mean Pipe 1 flows into Pipe 2. We see weak cross correlations for all lags \(h\) which suggests a modest linkage between the Pipes’ flows.
fjpipes %>% slice(1:240) %>%
CCF( flow1, flow2 , lag_max = 18) %>%
autoplot() + labs( title = "Cross Correlations between Flow 1 and Flow 2")Therefore, we don’t see to model any relationship between the pipes’ flows.
We therefore consider building a time series model for Pipe 1 flow.
The STL decomposition suggests that a stochastic trend with some recurring daily seasonality. The residuals of the STL decomposition suggest show autocorrelation. They are not white noise but appear homoskedastic and symmetric.
We see that the KPSS test suggest the flow series \(X_t\)
is stationary.
fjpipes %>% slice(1:240) -> flow1dat
flow1dat %>% features( flow1 , unitroot_kpss)## # A tibble: 1 x 2
## kpss_stat kpss_pvalue
## <dbl> <dbl>
## 1 0.184 0.1
fit_pipe1 = flow1dat %>%
model( m1 = ARIMA( flow1 ) ,
m2 = ARIMA( flow1 ~ pdq(0,1,2) + PDQ(0, 1, 0, period = 12 ) ),
m3 = ARIMA( flow1 ~ pdq(0,1,2) + PDQ(0, 1, 0, period = 24 ) ),
)report(fit_pipe1 %>% select(m1))## Series: flow1
## Model: ARIMA(0,0,0) w/ mean
##
## Coefficients:
## constant
## 19.8585
## s.e. 0.2743
##
## sigma^2 estimated as 18.14: log likelihood=-687.79
## AIC=1379.59 AICc=1379.64 BIC=1386.55
report(fit_pipe1 %>% select(m2))## Series: flow1
## Model: ARIMA(0,1,2)(0,1,0)[12]
##
## Coefficients:
## ma1 ma2
## -0.9561 -0.0439
## s.e. 0.0734 0.0725
##
## sigma^2 estimated as 41.63: log likelihood=-746.98
## AIC=1499.96 AICc=1500.07 BIC=1510.24
report(fit_pipe1 %>% select(m3))## Series: flow1
## Model: ARIMA(0,1,2)(0,1,0)[24]
##
## Coefficients:
## ma1 ma2
## -0.9616 -0.0384
## s.e. 0.0684 0.0669
##
## sigma^2 estimated as 36.52: log likelihood=-693.5
## AIC=1392.99 AICc=1393.11 BIC=1403.11
Next, we examine the 3 time series model based on their model residuals.
Now we examine the forecasts.
Model m1 which uses a constant with white noise is not satisfactory. The mean forecast show above is a constant. The model m3 produces a repeating forecast could looks more plausible.
Based on these considerations, we choose model m3 which has a specification:
\[ARIMA = pdq(0,1,2) + PDQ(0,1,0)_{24}\]
We export its forecasts as to PartC_Forecast.xlsx
output_fc_partb = unpack_hilo(hilo(fc_pipe1 %>% filter(.model == 'm3' ) ) ) %>%
select( dt, .mean ) %>%
mutate( flow1 = .mean ) %>%
select(dt, flow1 ) %>% openxlsx::write.xlsx( file = "Output_PartC_Forecast.xlsx",
overwrite = TRUE)
# Code
We summarize all the R code used in this project in this appendix for ease of reading.
```r
knitr::opts_chunk$set(echo = FALSE)
# Conditional Code Evaluation is managed here.
runPartA = TRUE
runPartB = TRUE
runPartC = TRUE
runOncePartC = FALSE # Used to load and convert XLSX files that cause unsurpressable warnings.
cross_validation_stepsize = 1 # Make this equal 1 for final render but a larger integer while editing.
library(fpp3)
library(knitr)
library(tidyverse)
library(kableExtra)
library(cowplot)
library(skimr)
library(GGally) # for ggpairs
library(readxl) # to parse Excel workbooks
library(openxlsx) # for writing xlsx files
.code-bg {
background-color: lightgray;
}
# Part A - Load the data and store it in original skinny datatable and alternative wide pivot table.
# Both forms are tsibbles
raw_ATM624Data <- read_excel("ATM624Data.xlsx",
col_types = c("date", "text", "numeric"))
atmdat_tmp = raw_ATM624Data %>% filter(!is.na(ATM)) %>% # Some dates are included in file with no ATM or cash data
as_tsibble(key=ATM, index = DATE, regular = FALSE)
atmdat_tmp %>% fill_gaps() %>% mutate(DATE=date(DATE)) %>% as_tsibble(key=ATM, index=DATE) -> atmdat
atmdat %>% pivot_wider(names_from =ATM, values_from = Cash) -> piv
atmdat %>% autoplot(Cash) + facet_wrap(~ATM, scales = "free_y") +
labs(title="Time Series of ATM before Data Cleaning")
piv %>% tail(n=6) %>% kable( digits = 2 ) %>% kable_styling(bootstrap_options = c("hover", "striped"), position="left")
atmdat %>% mutate( id = row_number()) %>% filter(is.na(Cash) | Cash > 10000) %>%
kable( digits = 0, caption = "Outliers and Missing Values" ) %>%
kable_styling(bootstrap_options = c("hover", "striped"), position = "left")
atmdat[44, "Cash"] = 96
atmdat[47, "Cash"] = 145
atmdat[53, "Cash"] = 106
atmdat[414, "Cash"] = 7
atmdat[420, "Cash"] = 24
atmdat[1380, "Cash"] = 153.2
atmdat %>% pivot_wider(names_from =ATM, values_from = Cash) -> piv
atmdat %>%
mutate( wDay = lubridate::wday(DATE, label=TRUE)) %>%
as_tibble() %>%
ggplot() + geom_boxplot(aes(y=Cash, x = wDay) ) +
facet_wrap(~ATM, scales = "free") +
ggtitle("Withdrawals by Day of Week After Imputation", subtitle="Grouped by ATM")
atmdat %>%
mutate( mon = lubridate::month(DATE, label=TRUE)) %>%
as_tibble() %>%
ggplot() + geom_boxplot(aes(y=Cash, x = mon) ) +
facet_wrap(~ATM, scales = "free") +
ggtitle("Cash Withdrawals by Month After Imputation", subtitle="Grouped by ATM")
atmdat %>% autoplot(Cash) + facet_wrap(~ATM, scales = "free_y")
ggpairs(piv[,2:5], lower= list( continuous=wrap("points", alpha = 0.4, size = 0.2 ) ) )
piv %>% model(ATM1 = STL(ATM1)) %>% components() %>% autoplot() + labs(title="STL Decomposition for ATM1")
piv %>% model(ATM2 = STL(ATM2)) %>% components() %>% autoplot() + labs(title="STL Decomposition for ATM2")
piv %>% model(ATM4 = STL(ATM4)) %>% components() %>% autoplot() + labs(title="STL Decomposition for ATM4")
piv %>% gg_tsdisplay(y=ATM1, plot_type = "partial")
forecast_horizon = 31
fit_atm1 = piv %>% model(
sarima_nd = ARIMA(ATM1 ~ 0 + pdq(0,0,1) + PDQ(0, 1, 2, period = 7 ) ) ,
sarima_d = ARIMA(ATM1 ~ 1 + pdq(0,0,1) + PDQ(0, 1, 2, period = 7 ) ) ,
ets = ETS( ATM1 ~ error("A") + trend("N") + season("A") )
)
fc_atm1 = fit_atm1 %>% forecast(h=forecast_horizon)
report(fit_atm1 %>% select(sarima_nd))
report(fit_atm1 %>% select(sarima_d))
report(fit_atm1 %>% select(ets))
fc_atm1 %>% autoplot(piv %>% filter_index( "2010-03-01" ~ .)) +
labs(title = paste0("Three model forecasts for ", forecast_horizon, " day forecast"),
subtitle = "History After March 2010" )
fit_atm1 %>% select(sarima_nd) %>% gg_tsresiduals(type = "innovation") + labs(title="SARIMA no drift for ATM1")
fit_atm1 %>% select(sarima_d) %>% gg_tsresiduals(type = "innovation") + labs(title="SARIMA w drift for ATM1")
fit_atm1 %>% select(ets) %>% gg_tsresiduals(type = "innovation") + labs(title="ETS Residuals for ATM1")
ATM1_stretch = piv %>% slice(1:(n()-31)) %>%
select( ATM1) %>%
stretch_tsibble(.init = 182, .step = cross_validation_stepsize )
bt_atm1_sarima = ATM1_stretch %>%
model(
sarima_nd = ARIMA(ATM1 ~ 0 + pdq(0,0,1) + PDQ(0, 1, 2, period = 7 ) ) ,
sarima_d = ARIMA(ATM1 ~ 1 + pdq(0,0,1) + PDQ(0, 1, 2, period = 7 ) ) ,
ets = ETS( ATM1 ~ error("A") + trend("N") + season("A") )
) %>% forecast( h = 30)
bb = bt_atm1_sarima %>% group_by(.id, .model) %>% mutate(h = row_number()) %>% ungroup()
rmse_bt_arima = bb %>% fabletools::accuracy( piv, by = c("h", ".model")) %>% select(h, .model , RMSE, MAE )
rmse_bt_arima %>% ggplot() + geom_line(aes(x = h, y = RMSE , col = .model ) ) + labs(title = "RMSE of Cross Validated Models for ATM1")
rmse_bt_arima %>% ggplot() + geom_line(aes(x = h, y = MAE , col = .model ) ) + labs(title = "MAE of Cross Validated Models for ATM1")
piv %>% gg_tsdisplay(y=ATM2, plot_type = "partial")
forecast_horizon = 31
fit_atm2 = piv %>% model(
sarima_nd = ARIMA(ATM2 ) ,
sarima_d = ARIMA(ATM2 ~ 1 + pdq(2,0,2) + PDQ(0, 1, 1, period = 7 ) ) ,
ets = ETS( ATM2 ~ error("A") + trend("N") + season("A") )
)
fc_atm2 = fit_atm2 %>% forecast(h=forecast_horizon)
report(fit_atm2 %>% select(sarima_nd))
report(fit_atm2 %>% select(sarima_d))
report(fit_atm2 %>% select(ets))
fc_atm2 %>% autoplot(piv %>% filter_index( "2010-03-01" ~ .)) +
labs(title = paste0("Three model forecasts for ", forecast_horizon, " day forecast for ATM2"),
subtitle = "History After March 2010" )
fit_atm2 %>% select(sarima_nd) %>% gg_tsresiduals(type = "innovation") + labs(title="SARIMA no drift for ATM2")
fit_atm2 %>% select(sarima_d) %>% gg_tsresiduals(type = "innovation") + labs(title="SARIMA w drift for ATM2")
fit_atm2 %>% select(ets) %>% gg_tsresiduals(type = "innovation") + labs(title="ETS Residuals for ATM2")
ATM2_stretch = piv %>%
slice(1:(n()-31)) %>%
select( ATM2) %>%
stretch_tsibble(.init = 182, .step = cross_validation_stepsize )
bt_atm2_sarima = ATM2_stretch %>%
model(
sarima_nd = ARIMA(ATM2 ~ 0 + pdq(2,0,2) + PDQ(0, 1, 1, period = 7 )) ,
sarima_d = ARIMA(ATM2 ~ 1 + pdq(2,0,2) + PDQ(0, 1, 1, period = 7 )) ,
ets = ETS( ATM2 ~ error("A") + trend("N") + season("A") )
) %>% forecast( h = 30)
bb2 = bt_atm2_sarima %>% group_by(.id, .model) %>% mutate(h = row_number()) %>% ungroup()
rmse_bt_atm2 = bb %>% fabletools::accuracy( piv, by = c("h", ".model")) %>% select(h, .model , RMSE, MAE )
rmse_bt_atm2 %>% ggplot() + geom_line(aes(x = h, y = RMSE , col = .model ) ) + labs(title = "RMSE of Cross Validated Models for ATM2")
rmse_bt_atm2 %>% ggplot() + geom_line(aes(x = h, y = MAE , col = .model ) ) + labs(title = "MAE of Cross Validated Models for ATM2")
piv %>% gg_tsdisplay(y=ATM4, plot_type = "partial")
(lambda <- piv %>% features( ATM4, features = guerrero) %>% pull(lambda_guerrero) )
piv = piv %>% mutate( bcATM4 = box_cox(ATM4, lambda) )
piv %>% gg_tsdisplay(y=bcATM4, plot_type = "partial")
forecast_horizon = 31
fit_atm4 = piv %>% model(
sarima_nd = ARIMA(box_cox( ATM4, lambda ) ~ 0 + pdq( 0, 0, 1) + PDQ( 2, 0, 0, period = 7 ) ) ,
sarima_d = ARIMA( box_cox( ATM4, lambda ) ~ 1 + pdq( 0, 0, 1) + PDQ( 2, 0, 0, period = 7 ) ) ,
ets = ETS( box_cox(ATM4, lambda ) ~ error("M") + trend("N") + season("A") )
)
fc_atm4 = fit_atm4 %>% forecast(h=forecast_horizon)
report(fit_atm4 %>% select(sarima_nd) )
report(fit_atm4 %>% select(sarima_d) )
report(fit_atm4 %>% select(ets) )
fc_atm4 %>% autoplot(piv %>% filter_index("2010-01-01" ~ . ), level = NULL) + labs(title = "ATM4 forecast")
fit_atm4 %>% select(sarima_nd) %>% gg_tsresiduals(type = "innovation") + labs(title="SARIMA no drift for ATM4")
fit_atm4 %>% select(sarima_d) %>% gg_tsresiduals(type = "innovation") + labs(title="SARIMA w drift for ATM4")
fit_atm4 %>% select(ets) %>% gg_tsresiduals(type = "innovation") + labs(title="ETS for ATM4")
augment( fit_atm4 ) %>% features(.innov, ljung_box, lag = 10)
ATM4_stretch = piv %>% slice(1:(n()-31)) %>%
select( ATM4) %>% stretch_tsibble(.init = 182, .step = cross_validation_stepsize )
bt_atm4 = ATM4_stretch %>% model(
sarima_nd = ARIMA(box_cox( ATM4, lambda )
~ 0 + pdq( 0, 0, 1) + PDQ( 2, 0, 0, period = 7 ) ) ,
sarima_d = ARIMA( box_cox( ATM4, lambda ) ~
1 + pdq( 0, 0, 1) + PDQ( 2, 0, 0, period = 7 ) ) ,
ets = ETS( box_cox(ATM4, lambda ) ~ error("M") + trend("N") + season("A") )
) %>%
forecast( h = 30)
bb4 = bt_atm4 %>% group_by(.id, .model) %>% mutate(h = row_number()) %>% ungroup()
rmse_bt_atm4 = bb4 %>% fabletools::accuracy( piv, by = c("h", ".model")) %>% select(h, .model , RMSE, MAE )
rmse_bt_atm4 %>% ggplot() + geom_line(aes(x = h, y = RMSE , col = .model ) ) + labs(title = "RMSE of Cross Validated Models for ATM4")
rmse_bt_atm4 %>% ggplot() + geom_line(aes(x = h, y = MAE , col = .model ) ) + labs(title = "MAE of Cross Validated Models for ATM4")
unpack_hilo( hilo( fc_atm1 ) ) %>%
filter(.model == 'sarima_nd' ) %>%
mutate( Cash = .mean, ATM = 'ATM1') %>%
select( DATE, ATM, Cash) -> atm1_output
unpack_hilo( hilo( fc_atm1 ) ) %>%
filter(.model == 'sarima_nd' ) %>%
mutate( Cash = .mean, ATM = 'ATM3') %>%
select( DATE, ATM, Cash) -> atm3_output
unpack_hilo( hilo( fc_atm2 ) ) %>%
filter(.model == 'sarima_nd' ) %>%
mutate( Cash = .mean, ATM = 'ATM2') %>%
select( DATE, ATM, Cash) -> atm2_output
unpack_hilo( hilo( fc_atm4 ) ) %>%
filter(.model == 'ets' ) %>%
mutate( Cash = .mean, ATM = 'ATM4') %>%
select( DATE, ATM, Cash) -> atm4_output
all_atm_output = rbind( atm1_output, atm2_output, atm3_output, atm4_output)
all_atm_output %>% openxlsx::write.xlsx( file = "Output_PartA_ATM_ANG_Forecasts.xlsx", overwrite = TRUE )
respower <- read_excel("ResidentialCustomerForecastLoad-624.xlsx",
col_types = c("numeric", "text", "numeric"))
respower %>%
mutate( mth = yearmonth(`YYYY-MMM`)) %>%
select( mth, CaseSequence, KWH) %>%
as_tsibble(index = mth) -> rp
rp %>% filter(is.na(KWH) | ( KWH/1e6 < 3 ) ) %>%
kable(caption ="Bad Observations") %>%
kable_styling(bootstrap_options = c("hover", "striped"), position = "left")
rp %>% gg_season(KWH/1e6) + labs(y="Gigawatt Hours", title = "Seasonal Pattern for Residential Consumption", subtitle = "Raw Data Before Imputation" )
fix2008 = rp %>% mutate( KWH = ifelse( CaseSequence == 861 ,rp$KWH[ CaseSequence == (861 - 12)], KWH ) ) %>%
filter( abs(CaseSequence - 861 ) < 2 )
fix2010 = rp %>% mutate( KWH = ifelse( CaseSequence == 883 ,rp$KWH[ CaseSequence == (883 - 12)], KWH ) ) %>%
filter( abs(CaseSequence - 883 ) < 2 )
rp %>% filter(year(mth) > 2006 ) %>% autoplot(KWH) +
geom_point(data = fix2008, aes(y=KWH), color="red", alpha = 0.5 ) +
geom_line(data=fix2008, color = "red", alpha = 0.5) +
geom_point(data = fix2010, aes(y=KWH), color="green", alpha = 0.5 ) +
geom_line(data=fix2010, color = "green", alpha = 0.5) +
labs(title="Imputation for Missing or Outliers Data", subtitle = "Mega Kilowatt Hours")
plotPartB1 = rp %>% autoplot(KWH/1e6) + labs(title = "Unadjusted Residential Gigawatt Usage", y = "gigawatt hours")
# This is the corrected data.
rp %>% mutate( KWH = ifelse( CaseSequence == 883 , rp$KWH[ CaseSequence == (883 - 12)], KWH)) %>%
mutate( KWH = ifelse( CaseSequence == 861 , rp$KWH[ CaseSequence == (861 - 12)], KWH)) -> rpi
plotPartB2 = rpi %>% autoplot(KWH/1e6, color = "blue") + labs(title="Adjusted Residential Gigawatt Usage", y = "gigawatt hours" )
plot_grid(plotPartB1, plotPartB2, ncol = 1)
rpi %>% model( STL(KWH)) %>% components() %>% autoplot() + labs("STL Decomposition for Residential Electric Consumption")
fit_rpi = rpi %>% model(
ets = ETS( KWH ~ error("M") + trend("A") + season("M")),
sarima_d = ARIMA(KWH ~ 1 + pdq( 0, 0, 4) + PDQ( 2 , 1 , 0 , period = 12) ) ,
sarima_nd = ARIMA( KWH ~ 0 + pdq( 0, 0, 4) + PDQ( 2 , 1 , 0 , period = 12) )
)
report(fit_rpi %>% select(ets))
report(fit_rpi %>% select(sarima_d))
report(fit_rpi %>% select(sarima_nd))
fit_rpi %>% select(ets ) %>% gg_tsresiduals(type = "innovation") + labs(title="Model Residuals ETS(MAM)")
fit_rpi %>% select(sarima_d ) %>% gg_tsresiduals(type = "innovation") + labs(title="Model Residuals SARIMA w Drift")
fit_rpi %>% select(sarima_nd) %>% gg_tsresiduals(type = "innovation") + labs(title="Model Residuals SARIMA No Drift")
fc_rpi = fit_rpi %>% forecast(h = 12)
fc_rpi %>%
autoplot(rpi %>% filter_index("2012-01" ~ . ), level = NULL) + labs(title = "Forecasts of 3 Models on Raw Series")
rpi_stretch = rpi %>% slice(1:(n()-12)) %>%
select( KWH) %>% stretch_tsibble(.init = 95, .step = cross_validation_stepsize )
bt_rpi = rpi_stretch %>% model(
ets = ETS( KWH ~ error("M") + trend("A") + season("M")),
sarima_d = ARIMA(KWH ~ 1 + pdq( 0, 0, 4) + PDQ( 2 , 1 , 0 , period = 12) ) ,
sarima_nd = ARIMA( KWH ~ 0 + pdq( 0, 0, 4) + PDQ( 2 , 1 , 0 , period = 12) )
) %>% forecast( h = 12)
cc = bt_rpi %>% group_by(.id, .model) %>% mutate(h = row_number()) %>% ungroup()
rmse_bt_rpi = cc %>% fabletools::accuracy( rpi, by = c("h", ".model")) %>% select(h, .model , RMSE, MAE )
rmse_bt_rpi %>% ggplot() + geom_line(aes(x = h, y = RMSE , col = .model ) ) +
scale_x_continuous(breaks = c( 1, 2, 4, 6, 8, 10, 12 )) +
labs(title = "RMSE of Cross Validated Models for KWH")
rmse_bt_rpi %>% ggplot() + geom_line(aes(x = h, y = MAE , col = .model ) ) +
scale_x_continuous(breaks = c( 1, 2, 4, 6, 8, 10, 12 )) +
labs(title = "MAE of Cross Validated Models for KWH")
output_fc_partb = unpack_hilo(hilo(fc_rpi %>% filter(.model == 'sarima_nd' ) ) ) %>%
select( mth, .mean ) %>%
mutate( KWH = .mean, CaseSequence = row_number() + 924) %>%
select(CaseSequence, mth , KWH ) %>%
rename( `YYYY-MMM` = mth ) %>%
openxlsx::write.xlsx( file = "Output_PartB_KWH_ANG_Forecasts.xlsx",
overwrite = TRUE)
Waterflow_Pipe1 <- read_excel("Waterflow_Pipe1.xlsx", col_types = c("date", "numeric"))
wfpipe1 = Waterflow_Pipe1 %>% rename( dt = `Date Time`, flow1 = WaterFlow )
wfpipe1 %>% openxlsx::write.xlsx(file="wfpipe1.xlsx" , overwrite = TRUE )
rm(wfpipe1)
Waterflow_Pipe2 <- read_excel("Waterflow_Pipe2.xlsx", col_types = c("date", "numeric"))
wfpipe2 = Waterflow_Pipe2 %>% rename( dt = `Date Time`, flow2 = WaterFlow )
wfpipe2 %>% openxlsx::write.xlsx(file="wfpipe2.xlsx" , overwrite = TRUE )
rm(wfpipe2)
# This code is loaded every time.
wfpipe1 = read_excel("wfpipe1.xlsx", col_types = c("date", "numeric"))
wfpipe2 = read_excel("wfpipe2.xlsx", col_types = c("date", "numeric"))
wfpipe1 %>% slice(1:10,( (n()-1):n()) ) %>%
kable(caption = "Waterflow Pipe 1: Initial and Final Observations", digits = 3) %>%
kable_styling(bootstrap_options = c("hover", "striped"), position = "left") %>%
row_spec(1:4, background = "tan") %>%
row_spec(5:9, background = "skyblue") %>%
row_spec(12, background = "yellow")
wfpipe2 %>% slice(1:5, ( (n()-1) : n() ) ) %>%
kable(caption = "Waterflow Pipe 2: Initial and Final Readings", digits = 3) %>%
kable_styling(bootstrap_options = c("hover", "striped"), position = "left") %>%
row_spec(1, background = "tan") %>%
row_spec(7, background = "yellow")
# This code groups the irregular timestamps into the ending hour of each hourly interval.
# to get the starting hour use floor_date( dt, unit = "hour")
# We record the average of the group readings
# ---------------------------------------------------------------------------
wfpipe1 %>% mutate( dt_aggregate = ceiling_date( dt , unit = "hour")) %>%
group_by( dt_aggregate ) %>%
summarize( flow1 = mean(flow1, na.rm = TRUE )) %>%
rename( dt = dt_aggregate ) -> hwfpipe1
hwfpipe1 %>% slice(1:3) %>% kable(caption = "Hourly Average of water pipe 1 flow", digits = 3 ) %>%
kable_styling(bootstrap_options = c("hover", "striped"), position = "left") %>%
row_spec(1, background = "tan") %>%
row_spec(2, background = "skyblue")
hwfpipe1 %>% full_join( wfpipe2 , by = "dt") %>% as_tsibble(index = dt ) %>%
mutate( id = row_number() ) -> fjpipes
# Fill down only on the period up to the end of Pipe 1 data
# Don't fill down due to Pipe 2 data existing but no pipe 1 data. This would be an error
fjpipes %>% slice(1:240) %>% fill(flow1, .direction = c("down")) -> fjpipes2
# Now copy over the filled down values back into the original tsibble
# Trick: use a left join from original to truncated imputed data frame to fill in values.
# -------------------------------------------------------------------------------------------
fjpipes %>% left_join(fjpipes2, by = "dt" ) %>% mutate( flow1.x = ifelse((id.x < 240) & is.na(flow1.x), flow1.y, flow1.x)) %>% rename( flow1 = flow1.x, flow2 = flow2.x, id = id.x ) %>% select( dt, flow1, flow2, id ) -> fjpipes3
fjpipes = fjpipes3
fjpipes %>% slice(1:3,((n()-2):n())) %>%
kable(caption = "Full Join of both pipes: First and last 3 rows", digits = 3) %>%
kable_styling(bootstrap_options = c("hover", "striped"), position = "left")
fjpipes %>% autoplot(flow1) + geom_line(aes(x=dt, y = flow2), alpha = 0.5, col = "red" ) +
labs(title = "Pipe1 and Pipe 2 Flow Rates", x = "Time", y = "Flow Rate")
fjpipes %>% filter( !is.na(flow1)) %>%
ggpairs( columns = 2:3 ,
lower= list( continuous=wrap("points", alpha = 0.6, size = 0.3 ) ) )
fjpipes %>% ggplot(aes(x = flow2)) + geom_density() + labs(title = "Pipe 2 flow")
fjpipes %>% gg_tsdisplay( flow1, plot_type = c("partial")) + labs(title = "Flow 1 Autocorrelations")
fjpipes %>% gg_tsdisplay( flow2, plot_type = c("partial")) + labs(title = "Flow 2 Autocorrelations")
fjpipes %>% slice(1:240) %>%
CCF( flow1, flow2 , lag_max = 18) %>%
autoplot() + labs( title = "Cross Correlations between Flow 1 and Flow 2")
fjpipes %>% slice(1:240) %>% gg_tsdisplay(flow1, plot_type = "partial")
fjpipes %>% slice(1:240) %>% model(STL(flow1)) %>% components() %>% autoplot()
fjpipes %>% slice(1:240) -> flow1dat
flow1dat %>% features( flow1 , unitroot_kpss)
fit_pipe1 = flow1dat %>%
model( m1 = ARIMA( flow1 ) ,
m2 = ARIMA( flow1 ~ pdq(0,1,2) + PDQ(0, 1, 0, period = 12 ) ),
m3 = ARIMA( flow1 ~ pdq(0,1,2) + PDQ(0, 1, 0, period = 24 ) ),
)
report(fit_pipe1 %>% select(m1))
report(fit_pipe1 %>% select(m2))
report(fit_pipe1 %>% select(m3))
fit_pipe1 %>% select(m1 ) %>% gg_tsresiduals()
fit_pipe1 %>% select(m2 ) %>% gg_tsresiduals()
fit_pipe1 %>% select(m3 ) %>% gg_tsresiduals()
fc_pipe1 = fit_pipe1 %>% forecast(h = 240)
fc_pipe1 %>% filter( .model != 'm2' ) %>% autoplot(flow1dat %>% filter(id > 100) , level = NULL )
output_fc_partb = unpack_hilo(hilo(fc_pipe1 %>% filter(.model == 'm3' ) ) ) %>%
select( dt, .mean ) %>%
mutate( flow1 = .mean ) %>%
select(dt, flow1 ) %>% openxlsx::write.xlsx( file = "Output_PartC_Forecast.xlsx",
overwrite = TRUE)