library(zoo)
library(forecast)
library(readxl)
library(tidyverse)
library(imputeTS)
library(openxlsx)
library(fpp3)
library(lubridate)
In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.
After reading in the data, I can see the date has imported as a double. I converted this date back to date format in R.
ATM624Data <-
read_excel("C:\\Users\\Kim\\Documents\\Data624\\ATM624Data.xlsx")
ATM624Data <- ATM624Data %>%
mutate(DATE = as.Date(DATE, origin = "1899-12-30"))
Before beginning this assignment, I took a look at the data briefly in Excel. I can see that there are duplicate blank entries that do not appear to correspond to any of the ATMs. All four ATMs have datasets that match up to the full year, which can be shown by filtering the data on ATM type and looking at the number of entries (365 entries for 365 days). As such, I will exclude the dates with blank ATM ID as they appear to be in the data erroneously. This exclusion is simply done by separating the datasets into their respective ATM identities using the filter function.
ATM1 <- ATM624Data %>% filter(ATM == "ATM1")
ATM2 <- ATM624Data %>% filter(ATM == "ATM2")
ATM3 <- ATM624Data %>% filter(ATM == "ATM3")
ATM4 <- ATM624Data %>% filter(ATM == "ATM4")
Next, I will focus on the ATM1 dataset. I turned ATM1 into a tsibble and autoplot it to see the general shape of the data. I can see there are missing values, and confirm this using is.na. Since I can see a seasonal pattern, albeit irregular (i.e. the general up and down behavior appears seasonal, but with varying magnitude), I believe imputing the NA values based on linear interpolation makes the most sense. I can do this through the na_interpolation() function in the imputeTS package.
ATM1 <- ATM1 %>% as_tsibble(index = DATE)
ATM1 %>% autoplot(Cash)
any(is.na(ATM1$Cash))
## [1] TRUE
ATM1$Cash <-na_interpolation(ATM1$Cash, option = "linear")
ATM1 %>% autoplot(Cash)
I looked at both Exponential Smoothing and ARIMA as potential forecasting models for all four datasets. I forecasted 31 days ahead to predict ATM withdrawals in the month of May 2010.
An ETS(A,N,A) model was selected and used for the forecast. Looking at the residuals, the variance has irregular spikes and the ACF shows autocorrelations above the critical value at 3 lags. The p-value resulting from the ljung_box test is under 0.05, and therefore the model is not a good fit the time series is not white noise.
fit_ETS_ATM1 <- ATM1 %>% model(ETS(Cash))
report(fit_ETS_ATM1)
## Series: Cash
## Model: ETS(A,N,A)
## Smoothing parameters:
## alpha = 0.02126635
## gamma = 0.3298864
##
## Initial states:
## l[0] s[0] s[-1] s[-2] s[-3] s[-4] s[-5] s[-6]
## 77.02956 -55.27235 2.531658 16.15413 4.529656 11.40245 12.92571 7.72875
##
## sigma^2: 580.2586
##
## AIC AICc BIC
## 4487.018 4487.639 4526.017
fc_ETS_ATM1 <- forecast(fit_ETS_ATM1, h = 31)
fc_ETS_ATM1 %>% autoplot() + autolayer(ATM1)
gg_tsresiduals(fit_ETS_ATM1)
augment(fit_ETS_ATM1) %>%
features(.innov, ljung_box, lag = 14)
## # A tibble: 1 × 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 ETS(Cash) 28.9 0.0108
For ARIMA, I first performed a unitroot test using the features(,unitroot_nsdiffs) function to see if the model requires seasonal differencing given its strong seasonal pattern. Since ARIMA is performed on stationary data and seasonal data is not stationary, I expect seasonal differencing to be required. This is confirmed, where one order of seasonal differencing is returned by the function. Afterwards, I seasonally difference the data and then use the features(,unitroot_ndiffs) to determine if non-seasonal differencing is also required. Since it returns 0, I then plot the ACF and PACF to determine if the data is stationary. The timeplot appears roughly horizontal with relatively constant variance although there are a few spikes where variance is larger. The ACF is not large and positive, and does not show the characteristic slow decrease in non-stationary data. However, there are spikes in the PACF throughout the plot. Using the KPSS unit test, I can see that the data is stationary after differencing because the p-value is greater than 0.05, which means I do not reject the null hypothesis that the data is stationary. As such, I can fit an ARIMA model to the non-transformed dataset.
ATM1 %>% features(Cash, unitroot_nsdiffs)
## # A tibble: 1 × 1
## nsdiffs
## <int>
## 1 1
ATM1_nsdiff <- ATM1 %>% mutate(Cash_nsdiff = difference(Cash, 7))
ATM1_nsdiff %>% features(Cash_nsdiff, unitroot_ndiffs)
## # A tibble: 1 × 1
## ndiffs
## <int>
## 1 0
ATM1_nsdiff %>% gg_tsdisplay(Cash_nsdiff,plot_type = "partial")
ATM1_nsdiff %>% features(Cash_nsdiff, unitroot_kpss)
## # A tibble: 1 × 2
## kpss_stat kpss_pvalue
## <dbl> <dbl>
## 1 0.0220 0.1
I looked at the ARIMA model without transformation first, using the ARIMA function with stepwise = FALSE and approximation = FALSE. Although this is more computationally intensive, the dataset is small and the function runs within 30 seconds. The potential for an improved model outweighs the disadvantages in this case. An ARIMA(0,0,1)(1,1,1)[7] model is fit, and this model is used for forecasting. After forecasting, I can see that the variance of the residuals is not very constant, but the autocorrelations in the ACF plot are almost all within the bounds of the critical values except for one. The distribution of the residuals also appear approximately normal. Performing the ljung-box test, I can determine that this is a white noise time series as the p-value is above 0.05, and this model can be considered a good fit.
fit_ARIMA_ATM1 <- ATM1 %>%
model(ARIMA(Cash, stepwise = FALSE, approximation = FALSE))
report(fit_ARIMA_ATM1)
## Series: Cash
## Model: ARIMA(0,0,1)(1,1,1)[7]
##
## Coefficients:
## ma1 sar1 sma1
## 0.1968 0.1848 -0.7537
## s.e. 0.0546 0.0793 0.0552
##
## sigma^2 estimated as 555.1: log likelihood=-1639.63
## AIC=3287.26 AICc=3287.37 BIC=3302.78
fc_ARIMA_ATM1 <- forecast(fit_ARIMA_ATM1, h = 31)
fc_ARIMA_ATM1 %>% autoplot() + autolayer(ATM1)
gg_tsresiduals(fit_ARIMA_ATM1)
augment(fit_ARIMA_ATM1) %>%
features(.innov, ljung_box, lag = 14, dof = 3)
## # A tibble: 1 × 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 ARIMA(Cash, stepwise = FALSE, approximation = FALSE) 15.1 0.178
However, given the ACF and PACF plots of the seasonally differenced data, I also explored a Box-Cox transformation on the data before modeling to see if the prerequisites for stationary data could be better met. I perform a Box-Cox transformation on the data using the features(, features = guerrero) function to find the optimal lambda for transformation. I then use that lambda to transform the data.
lambda_guerrero <- ATM1 %>%
features(Cash, features = guerrero)
lambda_guerrero <- lambda_guerrero$lambda_guerrero
ATM1_box_cox <- ATM1 %>%
mutate(box_cox = box_cox(Cash,lambda_guerrero)) %>%
select(box_cox)
I attempted an exponential smoothing model again first, where an ETS(A,N,A) model was selected again. Forecasting with the Box-Cox transformed data, I can see in the residuals that the variance is not constant, though this model is a better fit given fewer spikes in the ACF plot that extend beyond the critical values. Additionally, performing the ljung-box test now returns a p-value greater than 0.05, indicating a white noise time series.
fit_ETS_ATM1_box_cox <- ATM1_box_cox %>% model(ETS(box_cox))
report(fit_ETS_ATM1_box_cox)
## Series: box_cox
## Model: ETS(A,N,A)
## Smoothing parameters:
## alpha = 0.0001000125
## gamma = 0.3521274
##
## Initial states:
## l[0] s[0] s[-1] s[-2] s[-3] s[-4] s[-5] s[-6]
## 7.943327 -4.537357 0.6094231 1.115404 0.5366943 1.005215 0.5374259 0.7331945
##
## sigma^2: 1.8037
##
## AIC AICc BIC
## 2379.638 2380.260 2418.637
fc_ETS_ATM1_box_cox <- forecast(fit_ETS_ATM1_box_cox, h = 31)
fc_ETS_ATM1_box_cox %>% autoplot() + autolayer(ATM1_box_cox)
gg_tsresiduals(fit_ETS_ATM1_box_cox)
augment(fit_ETS_ATM1_box_cox) %>%
features(.innov, ljung_box, lag = 14)
## # A tibble: 1 × 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 ETS(box_cox) 19.1 0.160
For performing ARIMA on the Box-Cox transformed data, I follow the same process from the above section “ARIMA model on the non-transformed dataset”. Looking at the ACF and PACF plots, I can see the PACF plot shows fewer spikes throughout the plot. The KPSS is greater than 0.05 so the data is determined to be stationary.
ATM1_box_cox %>% features(box_cox, unitroot_nsdiffs)
## # A tibble: 1 × 1
## nsdiffs
## <int>
## 1 1
ATM1_box_cox_nsdiff <- ATM1_box_cox %>%
mutate(box_cox_Cash_nsdiff = difference(box_cox, 7))
ATM1_box_cox_nsdiff %>%
features(box_cox_Cash_nsdiff, unitroot_ndiffs)
## # A tibble: 1 × 1
## ndiffs
## <int>
## 1 0
ATM1_box_cox_nsdiff %>%
gg_tsdisplay(box_cox_Cash_nsdiff,plot_type = "partial")
ATM1_box_cox_nsdiff %>%
features(box_cox_Cash_nsdiff, unitroot_kpss)
## # A tibble: 1 × 2
## kpss_stat kpss_pvalue
## <dbl> <dbl>
## 1 0.0153 0.1
When fitting an ARIMA model to the Box-Cox transformed data, an ARIMA(0,0,2)(0,1,1)[7] model was selected. Looking at the residuals, I can see a wide range of variance, though all autocorrelations are within the bounds of the critical values and the distribution of residuals is approximately normal. The p-value from the ljung-box test is 0.543, indicating a good fit and white noise time series.
fit_ARIMA_ATM1_box_cox <- ATM1_box_cox %>% model(ARIMA(box_cox, stepwise = FALSE, approximation = FALSE))
report(fit_ARIMA_ATM1_box_cox)
## Series: box_cox
## Model: ARIMA(0,0,2)(0,1,1)[7]
##
## Coefficients:
## ma1 ma2 sma1
## 0.1126 -0.1094 -0.6418
## s.e. 0.0524 0.0520 0.0432
##
## sigma^2 estimated as 1.764: log likelihood=-609.99
## AIC=1227.98 AICc=1228.09 BIC=1243.5
fc_ARIMA_ATM1_box_cox <- forecast(fit_ARIMA_ATM1_box_cox, h = 31)
fc_ARIMA_ATM1_box_cox %>% autoplot() + autolayer(ATM1_box_cox)
gg_tsresiduals(fit_ARIMA_ATM1_box_cox)
augment(fit_ARIMA_ATM1_box_cox) %>% features(.innov, ljung_box, lag = 14, dof = 3)
## # A tibble: 1 × 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 ARIMA(box_cox, stepwise = FALSE, approximation = FALSE) 9.86 0.543
To determine the best model between the three “good fit” models, ARIMA(0,0,1)(1,1,1)[7] on the non-transformed data, ETS(A,N,A) on the Box-Cox transformed data, and ARIMA(0,0,2)(0,1,1)[7] on the Box-Cox transformed data, I split the data into a training set and test set. The training set is comprised of May 2009 through March 2010. Note that this has to be done separately for the Box-Cox transformed data and the non-transformed data.
ATM1_train <- ATM1 %>% filter(DATE < "2010-04-01")
ATM1_box_cox_train <- ATM1_box_cox %>% filter(DATE < "2010-04-01")
ARIMA(0,0,1)(1,1,1)[7] on the non-transformed data:
fit_ARIMA_ATM1_train <- ATM1_train %>%
model(ARIMA(Cash ~ pdq(0,0,1) + PDQ(1,1,1)))
report(fit_ARIMA_ATM1_train)
## Series: Cash
## Model: ARIMA(0,0,1)(1,1,1)[7]
##
## Coefficients:
## ma1 sar1 sma1
## 0.2026 0.2012 -0.7716
## s.e. 0.0578 0.0872 0.0639
##
## sigma^2 estimated as 595.8: log likelihood=-1514
## AIC=3035.99 AICc=3036.11 BIC=3051.16
ETS(A,N,A) on the transformed data:
fit_ETS_ATM1_box_cox_train <- ATM1_box_cox_train %>%
model(ETS(box_cox ~ error("A")+trend("N")+season("A")))
report(fit_ETS_ATM1_box_cox_train)
## Series: box_cox
## Model: ETS(A,N,A)
## Smoothing parameters:
## alpha = 0.0001000537
## gamma = 0.3433203
##
## Initial states:
## l[0] s[0] s[-1] s[-2] s[-3] s[-4] s[-5] s[-6]
## 7.867257 -4.531842 0.4201177 0.8977109 0.6534645 0.9420963 1.005517 0.6129352
##
## sigma^2: 1.9461
##
## AIC AICc BIC
## 2181.670 2182.349 2219.811
ARIMA(0,0,2)(0,1,1)[7] on the transformed data:
fit_ARIMA_ATM1_box_cox_train <- ATM1_box_cox_train %>%
model(ARIMA(box_cox ~ pdq(0,0,2) + PDQ(0,1,1)))
report(fit_ARIMA_ATM1_box_cox_train)
## Series: box_cox
## Model: ARIMA(0,0,2)(0,1,1)[7]
##
## Coefficients:
## ma1 ma2 sma1
## 0.1139 -0.1197 -0.6512
## s.e. 0.0547 0.0549 0.0473
##
## sigma^2 estimated as 1.899: log likelihood=-571.03
## AIC=1150.05 AICc=1150.18 BIC=1165.22
After creating the models, I can create the forecasts for 30 days and then assess the accuracy based on the test data. In this case, since models are being compared among different “starting” datasets (i.e. transformed and non-transformed), the RMSSE should be used as a comparison. The RMSSE takes into account the scale of the data, which makes it a better for comparison. Looking back at the residuals, all models exhibit heteroscedasticity although it is most prominent in the Box-Cox transformed models. However, the ARIMA(0,0,2)(0,1,1)[7] on the transformed data has the lowest RMSSE, so ultimately it is selected as the best-fit model.
bind_rows(
fit_ARIMA_ATM1_train %>% forecast(h = 30) %>%
accuracy(ATM1) %>% select(.model, MASE,RMSSE),
fit_ETS_ATM1_box_cox_train %>% forecast(h = 30) %>%
accuracy(ATM1_box_cox) %>% select(.model, MASE,RMSSE),
fit_ARIMA_ATM1_box_cox_train %>% forecast(h = 30) %>%
accuracy(ATM1_box_cox) %>% select(.model, MASE,RMSSE),
)
## # A tibble: 3 × 3
## .model MASE RMSSE
## <chr> <dbl> <dbl>
## 1 "ARIMA(Cash ~ pdq(0, 0, 1) + PDQ(1, 1, 1))" 0.527 0.448
## 2 "ETS(box_cox ~ error(\"A\") + trend(\"N\") + season(\"A\"))" 0.527 0.419
## 3 "ARIMA(box_cox ~ pdq(0, 0, 2) + PDQ(0, 1, 1))" 0.526 0.418
gg_tsresiduals(fit_ARIMA_ATM1)
gg_tsresiduals(fit_ETS_ATM1_box_cox)
gg_tsresiduals(fit_ARIMA_ATM1_box_cox)
Since the transformed data had the best fit model, the data must be transformed back in order to interpret the predictive values of cash withdrawals from ATM1 for the month of May 2010.
fc_ARIMA_ATM1_box_cox
## # A fable: 31 x 4 [1D]
## # Key: .model [1]
## .model DATE
## <chr> <date>
## 1 ARIMA(box_cox, stepwise = FALSE, approximation = FALSE) 2010-05-01
## 2 ARIMA(box_cox, stepwise = FALSE, approximation = FALSE) 2010-05-02
## 3 ARIMA(box_cox, stepwise = FALSE, approximation = FALSE) 2010-05-03
## 4 ARIMA(box_cox, stepwise = FALSE, approximation = FALSE) 2010-05-04
## 5 ARIMA(box_cox, stepwise = FALSE, approximation = FALSE) 2010-05-05
## 6 ARIMA(box_cox, stepwise = FALSE, approximation = FALSE) 2010-05-06
## 7 ARIMA(box_cox, stepwise = FALSE, approximation = FALSE) 2010-05-07
## 8 ARIMA(box_cox, stepwise = FALSE, approximation = FALSE) 2010-05-08
## 9 ARIMA(box_cox, stepwise = FALSE, approximation = FALSE) 2010-05-09
## 10 ARIMA(box_cox, stepwise = FALSE, approximation = FALSE) 2010-05-10
## # ℹ 21 more rows
## # ℹ 2 more variables: box_cox <dist>, .mean <dbl>
fc_ARIMA_ATM1_box_cox %>% autoplot(level = FALSE) + autolayer(ATM1_box_cox)
ATM1_forecasted <- fc_ARIMA_ATM1_box_cox %>% rename(box_cox_Cash = .mean) %>%
mutate(Cash = inv_box_cox(box_cox_Cash, lambda_guerrero)) %>%
select(DATE,Cash)
ATM1_forecasted %>% autoplot(color = "blue") + autolayer(ATM1) +
labs(x = "Date", y = "Cash Withdrawal in Hundreds",
title = "Cash Withdrawal Prediction from ATM2 using ARIMA(0,0,2)(0,1,1)")
With the ATM2 dataset, I can see a more regular seasonal pattern. I can also see again that there are NA values. For this dataset, I believe imputing the NA values using seasonal decomposition makes the most sense. To do this, I use the na_seadec() function in the imputeTS package.
ATM2 <- ATM2 %>% as_tsibble(index = DATE)
ATM2 %>% autoplot(Cash)
any(is.na(ATM2$Cash))
## [1] TRUE
ATM2$Cash <- na_seadec(ATM2$Cash, algorithm = "interpolation")
ATM2 %>% autoplot(Cash)
An ETS(A,N,A) model was selected and used for the forecast. Looking at the residuals, the variance does not appear relatively constant and the ACF shows autocorrelations above the critical value at 4 lags, but the distribution of residuals appears roughly normal. The p-value resulting from the ljung_box test is under 0.05, and therefore the model is not a good fit the time series is not white noise.
fit_ETS_ATM2 <- ATM2 %>% model(ETS(Cash))
report(fit_ETS_ATM2)
## Series: Cash
## Model: ETS(A,N,A)
## Smoothing parameters:
## alpha = 0.0001000049
## gamma = 0.3594252
##
## Initial states:
## l[0] s[0] s[-1] s[-2] s[-3] s[-4] s[-5] s[-6]
## 72.71479 -33.56563 -27.28811 12.45705 0.8474403 4.742114 17.29087 25.51626
##
## sigma^2: 648.1015
##
## AIC AICc BIC
## 4527.377 4527.998 4566.376
fc_ETS_ATM2 <- forecast(fit_ETS_ATM2, h = 31)
fc_ETS_ATM2 %>% autoplot() + autolayer(ATM2)
gg_tsresiduals(fit_ETS_ATM2)
augment(fit_ETS_ATM2) %>%
features(.innov, ljung_box, lag = 14)
## # A tibble: 1 × 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 ETS(Cash) 37.4 0.000649
For ARIMA, I can see the dataset requires one order of seasonal differencing. No non-seasonal differencing is required. The variance of the differences is not very constant and the PACF shows spikes throughout the plot. The data is shown to be stationary after differencing, even without transformation, through the KPSS unit test.
ATM2 %>% features(Cash, unitroot_nsdiffs)
## # A tibble: 1 × 1
## nsdiffs
## <int>
## 1 1
ATM2_nsdiff <- ATM2 %>% mutate(Cash_nsdiff = difference(Cash, 7))
ATM2_nsdiff %>% features(Cash_nsdiff, unitroot_ndiffs)
## # A tibble: 1 × 1
## ndiffs
## <int>
## 1 0
ATM2_nsdiff %>% gg_tsdisplay(Cash_nsdiff,plot_type = "partial")
ATM2_nsdiff %>% features(Cash_nsdiff, unitroot_kpss)
## # A tibble: 1 × 2
## kpss_stat kpss_pvalue
## <dbl> <dbl>
## 1 0.0167 0.1
An ARIMA(2,0,2)(0,1,1)[7] model is fit, and this model is used for forecasting. After forecasting, I can see that the variance of the residuals is not very constant, and the autocorrelations in the ACF plot has two spikes past the critical values. This is past the 5% threshold of tolerance for autocorrelations outside of the critical value bounds, although the distribution of the residuals appear approximately normal. Performing the ljung-box test, I can see that the p-value is above 0.05, and this model is a better fit than the ETS(A,N,A) model.
fit_ARIMA_ATM2 <- ATM2 %>%
model(ARIMA(Cash, stepwise = FALSE, approximation = FALSE))
report(fit_ARIMA_ATM2)
## Series: Cash
## Model: ARIMA(2,0,2)(0,1,1)[7]
##
## Coefficients:
## ar1 ar2 ma1 ma2 sma1
## -0.4320 -0.9130 0.4773 0.8048 -0.7547
## s.e. 0.0553 0.0407 0.0861 0.0556 0.0381
##
## sigma^2 estimated as 602.5: log likelihood=-1653.67
## AIC=3319.33 AICc=3319.57 BIC=3342.61
fc_ARIMA_ATM2 <- forecast(fit_ARIMA_ATM2, h = 31)
fc_ARIMA_ATM2 %>% autoplot() + autolayer(ATM2)
gg_tsresiduals(fit_ARIMA_ATM2)
augment(fit_ARIMA_ATM2) %>%
features(.innov, ljung_box, lag = 14, dof = 5)
## # A tibble: 1 × 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 ARIMA(Cash, stepwise = FALSE, approximation = FALSE) 10.2 0.332
Again, I also explored a Box-Cox transformation on the data before modeling to see if the prerequisites for stationary data could be better met.
lambda_guerrero <- ATM2 %>%
features(Cash, features = guerrero)
lambda_guerrero <- lambda_guerrero$lambda_guerrero
ATM2_box_cox <- ATM2 %>%
mutate(box_cox = box_cox(Cash,lambda_guerrero)) %>%
select(box_cox)
I attempted an exponential smoothing model again first, where an ETS(A,N,A) model was selected again. Forecasting with the Box-Cox transformed data, I can see in the autocorrelations still show spikes beyond the bounds of the critical values. Additionally, performing the ljung-box test still returns a p-value less than 0.05, indicating this time series is not white noise.
fit_ETS_ATM2_box_cox <- ATM2_box_cox %>% model(ETS(box_cox))
report(fit_ETS_ATM2_box_cox)
## Series: box_cox
## Model: ETS(A,N,A)
## Smoothing parameters:
## alpha = 0.0001000387
## gamma = 0.3837262
##
## Initial states:
## l[0] s[0] s[-1] s[-2] s[-3] s[-4] s[-5] s[-6]
## 26.45702 -18.63684 -11.61431 10.6539 2.227776 3.924051 6.477696 6.96773
##
## sigma^2: 72.4456
##
## AIC AICc BIC
## 3727.585 3728.206 3766.584
fc_ETS_ATM2_box_cox <- forecast(fit_ETS_ATM2_box_cox, h = 31)
fc_ETS_ATM2_box_cox %>% autoplot() + autolayer(ATM2_box_cox)
gg_tsresiduals(fit_ETS_ATM2_box_cox)
augment(fit_ETS_ATM2_box_cox) %>%
features(.innov, ljung_box, lag = 14)
## # A tibble: 1 × 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 ETS(box_cox) 33.9 0.00214
Looking at the ACF and PACF plots, I can see the Box-Cox transformed dataset produces plots that are nearly identical in shape to the non-transformed dataset. After differencing, the KPSS unit test shows that this data is stationary and an ARIMA model can be fit.
ATM2_box_cox %>% features(box_cox, unitroot_nsdiffs)
## # A tibble: 1 × 1
## nsdiffs
## <int>
## 1 1
ATM2_box_cox_nsdiff <- ATM2_box_cox %>%
mutate(box_cox_Cash_nsdiff = difference(box_cox, 7))
ATM2_box_cox_nsdiff %>%
features(box_cox_Cash_nsdiff, unitroot_ndiffs)
## # A tibble: 1 × 1
## ndiffs
## <int>
## 1 0
ATM2_box_cox_nsdiff %>%
gg_tsdisplay(box_cox_Cash_nsdiff,plot_type = "partial")
ATM2_box_cox_nsdiff %>%
features(box_cox_Cash_nsdiff, unitroot_kpss)
## # A tibble: 1 × 2
## kpss_stat kpss_pvalue
## <dbl> <dbl>
## 1 0.0162 0.1
An ARIMA(2,0,2)(0,1,1)[7] model was selected for the Box-Cox transformed data, which is unsurprising given the similar shape of the data. The p-value from the ljung-box test is again over 0.05, so this model can be considered a potential model for selection and evaluation.
fit_ARIMA_ATM2_box_cox <- ATM2_box_cox %>%
model(ARIMA(box_cox, stepwise = FALSE, approximation = FALSE))
report(fit_ARIMA_ATM2_box_cox)
## Series: box_cox
## Model: ARIMA(2,0,2)(0,1,1)[7]
##
## Coefficients:
## ar1 ar2 ma1 ma2 sma1
## -0.4256 -0.9051 0.4719 0.7967 -0.7271
## s.e. 0.0588 0.0444 0.0888 0.0590 0.0406
##
## sigma^2 estimated as 68.16: log likelihood=-1263.33
## AIC=2538.66 AICc=2538.9 BIC=2561.94
fc_ARIMA_ATM2_box_cox <- forecast(fit_ARIMA_ATM2_box_cox, h = 31)
fc_ARIMA_ATM2_box_cox %>% autoplot() + autolayer(ATM2_box_cox)
gg_tsresiduals(fit_ARIMA_ATM2_box_cox)
augment(fit_ARIMA_ATM2_box_cox) %>%
features(.innov, ljung_box, lag = 14, dof = 5)
## # A tibble: 1 × 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 ARIMA(box_cox, stepwise = FALSE, approximation = FALSE) 10.7 0.300
The two models in this assessment are the ARIMA(2,0,2)(0,1,1)[7] on the non-transformed data and ARIMA(2,0,2)(0,1,1)[7] on the transformed data.
ATM2_train <- ATM2 %>% filter(DATE < "2010-04-01")
ATM2_box_cox_train <- ATM2_box_cox %>% filter(DATE < "2010-04-01")
ARIMA(2,0,2)(0,1,1)[7] on the non-transformed data:
fit_ARIMA_ATM2_train <- ATM2_train %>%
model(ARIMA(Cash ~ pdq(2,0,2) + PDQ(0,1,1)))
report(fit_ARIMA_ATM2_train)
## Series: Cash
## Model: ARIMA(2,0,2)(0,1,1)[7]
##
## Coefficients:
## ar1 ar2 ma1 ma2 sma1
## -0.4322 -0.9259 0.4799 0.8085 -0.779
## s.e. 0.0484 0.0399 0.0805 0.0537 0.042
##
## sigma^2 estimated as 626.2: log likelihood=-1521.62
## AIC=3055.25 AICc=3055.51 BIC=3078.01
ARIMA(2,0,2)(0,1,1)[7] on the transformed data:
fit_ARIMA_ATM2_box_cox_train <- ATM2_box_cox_train %>%
model(ARIMA(box_cox ~ pdq(2,0,2) + PDQ(0,1,1)))
report(fit_ARIMA_ATM2_box_cox_train)
## Series: box_cox
## Model: ARIMA(2,0,2)(0,1,1)[7]
##
## Coefficients:
## ar1 ar2 ma1 ma2 sma1
## -0.4251 -0.9187 0.4732 0.8028 -0.7474
## s.e. 0.0511 0.0441 0.0819 0.0578 0.0460
##
## sigma^2 estimated as 71.19: log likelihood=-1164.73
## AIC=2341.45 AICc=2341.71 BIC=2364.21
From the comparison below, we can see that the model on the Box-Cox transformed data had performed better due to its lower MASE, although RMSSE is the same. When revisiting the residuals plot, we can see similar behavior in both where there are two spikes in the autocorrelations, roughly normal distribution of residuals, and heteroscedasticity when looking at the variance of residuals.
bind_rows(
fit_ARIMA_ATM2_train %>% forecast(h = 30) %>%
accuracy(ATM2) %>% select(.model, MASE,RMSSE),
fit_ARIMA_ATM2_box_cox_train %>% forecast(h = 30) %>%
accuracy(ATM2_box_cox) %>% select(.model, MASE,RMSSE)
)
## # A tibble: 2 × 3
## .model MASE RMSSE
## <chr> <dbl> <dbl>
## 1 ARIMA(Cash ~ pdq(2, 0, 2) + PDQ(0, 1, 1)) 0.767 0.681
## 2 ARIMA(box_cox ~ pdq(2, 0, 2) + PDQ(0, 1, 1)) 0.733 0.681
gg_tsresiduals(fit_ARIMA_ATM2)
gg_tsresiduals(fit_ARIMA_ATM2_box_cox)
Since the transformed data had the best fit model, the data must be transformed back in order to interpret the predictive values of cash withdrawals from ATM2 for the month of May 2010.
fc_ARIMA_ATM2_box_cox
## # A fable: 31 x 4 [1D]
## # Key: .model [1]
## .model DATE
## <chr> <date>
## 1 ARIMA(box_cox, stepwise = FALSE, approximation = FALSE) 2010-05-01
## 2 ARIMA(box_cox, stepwise = FALSE, approximation = FALSE) 2010-05-02
## 3 ARIMA(box_cox, stepwise = FALSE, approximation = FALSE) 2010-05-03
## 4 ARIMA(box_cox, stepwise = FALSE, approximation = FALSE) 2010-05-04
## 5 ARIMA(box_cox, stepwise = FALSE, approximation = FALSE) 2010-05-05
## 6 ARIMA(box_cox, stepwise = FALSE, approximation = FALSE) 2010-05-06
## 7 ARIMA(box_cox, stepwise = FALSE, approximation = FALSE) 2010-05-07
## 8 ARIMA(box_cox, stepwise = FALSE, approximation = FALSE) 2010-05-08
## 9 ARIMA(box_cox, stepwise = FALSE, approximation = FALSE) 2010-05-09
## 10 ARIMA(box_cox, stepwise = FALSE, approximation = FALSE) 2010-05-10
## # ℹ 21 more rows
## # ℹ 2 more variables: box_cox <dist>, .mean <dbl>
fc_ARIMA_ATM2_box_cox %>% autoplot(level = FALSE) +
autolayer(ATM2_box_cox)
ATM2_forecasted <- fc_ARIMA_ATM2_box_cox %>%
rename(box_cox_Cash = .mean) %>%
mutate(Cash = inv_box_cox(box_cox_Cash, lambda_guerrero)) %>%
select(DATE,Cash)
ATM2_forecasted %>% autoplot(color = "blue") +
autolayer(ATM2) +
labs(x = "Date", y = "Cash Withdrawal in Hundreds",
title = "Cash Withdrawal Prediction from ATM2 using ARIMA(2,0,2)(0,1,1)")
With the ATM3 dataset, I can see that cash withdrawal is zero up until the last few days of April 2010. Predictions given this limited dataset would not be meaningful, as the sample size is too small. If a prediction had to be made, I would suggest using a simple exponential smoothing forecast, as shown below. It is possible that ATM3 is a new ATM, or data collection has somehow failed up until April 2010. Given the data shown, my overall recommendation would be to disregard the ATM3 dataset unless more data can be acquired. Almost nothing meaningful, besides general scale, can be gleaned from a few datapoints.
ATM3 <- ATM3 %>% as_tsibble(index = DATE)
ATM3 %>% autoplot(Cash)
ATM3 %>% filter(Cash != 0)
## # A tsibble: 3 x 3 [1D]
## DATE ATM Cash
## <date> <chr> <dbl>
## 1 2010-04-28 ATM3 96
## 2 2010-04-29 ATM3 82
## 3 2010-04-30 ATM3 85
fit_SES_ATM3 <- ATM3 %>% model(ETS(Cash))
report(fit_SES_ATM3)
## Series: Cash
## Model: ETS(A,N,N)
## Smoothing parameters:
## alpha = 0.8585628
##
## Initial states:
## l[0]
## -0.1269535
##
## sigma^2: 25.4128
##
## AIC AICc BIC
## 3338.324 3338.391 3350.024
fc_SES_ATM3 <- forecast(fit_SES_ATM3, h = 31)
fc_SES_ATM3 %>% autoplot() + autolayer(ATM3)
ATM3_forecasted <- fc_SES_ATM3 %>%
rename(CashDistribution = Cash, Cash = .mean) %>%
select(DATE,Cash)
With the ATM4 dataset, I can see a strong outlier in the data but no NA values. I can see this outlier occurs on 2010-02-09, and has a value of 10920. Typically, ATMs will have the capacity for that type of volume of withdrawal in a day, as 10920 $100 withdrawals amounts to $1,092,000. This is reason to believe that the outlier is an erroneous value. The ATM4 data does not appear to be seasonal, so it would not make sense to impute based on seasonal decomposition. Instead, I used the mean of the data to impute this value given that the dataset appears homoscedastic and the mean appears stable/independent of time when excluding the outlier.
ATM4 <- ATM4 %>% as_tsibble(index = DATE)
ATM4 %>% autoplot(Cash)
any(is.na(ATM4$Cash))
## [1] FALSE
ATM4 %>% filter(Cash == max(Cash))
## # A tsibble: 1 x 3 [1D]
## DATE ATM Cash
## <date> <chr> <dbl>
## 1 2010-02-09 ATM4 10920.
ATM4_mean_wo_outlier <- ATM4 %>% filter(Cash != max(Cash))
ATM4_mean_wo_outlier <- mean(ATM4_mean_wo_outlier$Cash)
ATM4$Cash[285] <- ATM4_mean_wo_outlier
ATM4 %>% autoplot(Cash)
An ETS(A,N,A) model was selected and used for the forecast. Looking at the residuals, the variance looks relatively constant and the ACF shows autocorrelations all within the bounds of the critical values. However, the distribution does appear slightly skewed. The p-value resulting from the ljung_box test is greater than 0.05, and therefore the model can be considered a good fit.
fit_ETS_ATM4 <- ATM4 %>% model(ETS(Cash))
report(fit_ETS_ATM4)
## Series: Cash
## Model: ETS(A,N,A)
## Smoothing parameters:
## alpha = 0.0001072472
## gamma = 0.0001000058
##
## Initial states:
## l[0] s[0] s[-1] s[-2] s[-3] s[-4] s[-5] s[-6]
## 452.3827 -277.6182 -30.42428 17.26821 44.47028 89.31875 58.8934 98.0918
##
## sigma^2: 110735
##
## AIC AICc BIC
## 6403.786 6404.408 6442.785
fc_ETS_ATM4 <- forecast(fit_ETS_ATM4, h = 31)
fc_ETS_ATM4 %>% autoplot() + autolayer(ATM4)
gg_tsresiduals(fit_ETS_ATM4)
augment(fit_ETS_ATM4) %>% features(.innov, ljung_box, lag = 14)
## # A tibble: 1 × 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 ETS(Cash) 18.1 0.203
For ARIMA, I can see the dataset does not require any differencing as it is already stationary. The variance appears constant, and the ACF is not high and strictly positive; however, the PACF shows two multiple spikes throughout the plot. The data is confirmed to be stationary through the KPSS test.
ATM4 %>% features(Cash, unitroot_nsdiffs)
## # A tibble: 1 × 1
## nsdiffs
## <int>
## 1 0
ATM4 %>% features(Cash, unitroot_ndiffs)
## # A tibble: 1 × 1
## ndiffs
## <int>
## 1 0
ATM4 %>% gg_tsdisplay(Cash,plot_type = "partial")
ATM4 %>% features(Cash, unitroot_kpss)
## # A tibble: 1 × 2
## kpss_stat kpss_pvalue
## <dbl> <dbl>
## 1 0.117 0.1
An ARIMA(1,0,0)(2,0,0)[7] w/ mean model is fit to the data, and this model is used for forecasting. After forecasting, I can see that the variance of the residuals is relatively constant, and the autocorrelations in the ACF plot has no spikes past the critical values. The distribution of the residuals appear potentially right skewed. Performing the ljung-box test, I can see that the p-value is slightly above 0.05, and this model can therefore be considered for further evaluation.
fit_ARIMA_ATM4 <- ATM4 %>%
model(ARIMA(Cash, stepwise = FALSE, approximation = FALSE))
report(fit_ARIMA_ATM4)
## Series: Cash
## Model: ARIMA(1,0,0)(2,0,0)[7] w/ mean
##
## Coefficients:
## ar1 sar1 sar2 constant
## 0.0812 0.1486 0.1128 301.3721
## s.e. 0.0523 0.0521 0.0526 17.7422
##
## sigma^2 estimated as 118443: log likelihood=-2648.09
## AIC=5306.19 AICc=5306.35 BIC=5325.68
fc_ARIMA_ATM4 <- forecast(fit_ARIMA_ATM4, h = 31)
fc_ARIMA_ATM4 %>% autoplot() + autolayer(ATM4)
gg_tsresiduals(fit_ARIMA_ATM4)
augment(fit_ARIMA_ATM4) %>%
features(.innov, ljung_box, lag = 10, dof = 3)
## # A tibble: 1 × 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 ARIMA(Cash, stepwise = FALSE, approximation = FALSE) 13.4 0.0622
Again, I also explored a Box-Cox transformation on the data before modeling to see if the prerequisites for stationary data could be better met.
lambda_guerrero <- ATM4 %>%
features(Cash, features = guerrero)
lambda_guerrero <- lambda_guerrero$lambda_guerrero
ATM4_box_cox <- ATM4 %>%
mutate(box_cox = box_cox(Cash,lambda_guerrero)) %>%
select(box_cox)
I attempted an exponential smoothing model again first, where an ETS(M,N,A) model was selected. Forecasting with the Box-Cox transformed data, I can see the autocorrelations do not exceed the bounds of the critical values. The residuals appear fairly constant besides one irregular spike where the variance is higher. Additionally, performing the ljung-box test again returns a p-value greater than 0.05.
fit_ETS_ATM4_box_cox <- ATM4_box_cox %>% model(ETS(box_cox))
report(fit_ETS_ATM4_box_cox)
## Series: box_cox
## Model: ETS(M,N,A)
## Smoothing parameters:
## alpha = 0.01547922
## gamma = 0.1782924
##
## Initial states:
## l[0] s[0] s[-1] s[-2] s[-3] s[-4] s[-5] s[-6]
## 27.73898 -19.73406 -4.699187 -0.02708285 2.157202 6.623173 8.928071 6.751885
##
## sigma^2: 0.223
##
## AIC AICc BIC
## 4036.699 4037.320 4075.698
fc_ETS_ATM4_box_cox <- forecast(fit_ETS_ATM4_box_cox, h = 31)
fc_ETS_ATM4_box_cox %>% autoplot() + autolayer(ATM4_box_cox)
gg_tsresiduals(fit_ETS_ATM4_box_cox)
augment(fit_ETS_ATM4_box_cox) %>%
features(.innov, ljung_box, lag = 14)
## # A tibble: 1 × 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 ETS(box_cox) 14.9 0.385
Looking at the ACF and PACF plots, I can see the Box-Cox transformed dataset has more constant variance after the transformation. However, a greater number of lags exhibit spikes past the critical values in the PACF. The data is again shown to be stationary through the KPSS test.
ATM4_box_cox %>% features(box_cox, unitroot_nsdiffs)
## # A tibble: 1 × 1
## nsdiffs
## <int>
## 1 0
ATM4_box_cox %>% features(box_cox, unitroot_ndiffs)
## # A tibble: 1 × 1
## ndiffs
## <int>
## 1 0
ATM4_box_cox %>% gg_tsdisplay(box_cox,plot_type = "partial")
ATM4_box_cox %>% features(box_cox, unitroot_kpss)
## # A tibble: 1 × 2
## kpss_stat kpss_pvalue
## <dbl> <dbl>
## 1 0.0771 0.1
An ARIMA(1,0,0)(2,0,0)[7] w/ mean model was selected for the Box-Cox transformed data The variance of residuals is relatively constant, and the ACF shows one spike beyond the critical values. The distribution of residuals appears roughly normal. The p-value from the ljung-box test is again over 0.05, so this model can be considered a potential model for selection and evaluation.
fit_ARIMA_ATM4_box_cox <- ATM4_box_cox %>%
model(ARIMA(box_cox, stepwise = FALSE, approximation = FALSE))
report(fit_ARIMA_ATM4_box_cox)
## Series: box_cox
## Model: ARIMA(1,0,0)(2,0,0)[7] w/ mean
##
## Coefficients:
## ar1 sar1 sar2 constant
## 0.0766 0.2092 0.1990 15.9056
## s.e. 0.0526 0.0516 0.0525 0.6921
##
## sigma^2 estimated as 183.7: log likelihood=-1467.91
## AIC=2945.82 AICc=2945.98 BIC=2965.32
fc_ARIMA_ATM4_box_cox <- forecast(fit_ARIMA_ATM4_box_cox, h = 31)
fc_ARIMA_ATM4_box_cox %>% autoplot() + autolayer(ATM4_box_cox)
gg_tsresiduals(fit_ARIMA_ATM4_box_cox)
augment(fit_ARIMA_ATM4_box_cox) %>%
features(.innov, ljung_box, lag = 14, dof = 3)
## # A tibble: 1 × 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 ARIMA(box_cox, stepwise = FALSE, approximation = FALSE) 15.8 0.150
All four models assessed, ETS(A,N,A) and ARIMA(1,0,0)(2,0,0)[7] w/ mean for the non-transformed data, and ETS(M,N,A) and ARIMA(1,0,0)(2,0,0)[7] w/ mean for the transformed data, will be considered for evaluation.
ATM4_train <- ATM4 %>% filter(DATE < "2010-04-01")
ATM4_box_cox_train <- ATM4_box_cox %>% filter(DATE < "2010-04-01")
ETS(A,N,A) on the non-transformed data:
fit_ETS_ATM4_train <- ATM4_train %>%
model(ETS(Cash~ error("A")+trend("N")+season("A")))
report(fit_ETS_ATM4_train)
## Series: Cash
## Model: ETS(A,N,A)
## Smoothing parameters:
## alpha = 0.0001000658
## gamma = 0.0001188245
##
## Initial states:
## l[0] s[0] s[-1] s[-2] s[-3] s[-4] s[-5] s[-6]
## 459.5541 -301.4226 -41.96698 39.36969 33.5328 90.72653 54.21942 125.5412
##
## sigma^2: 112403.6
##
## AIC AICc BIC
## 5854.611 5855.290 5892.752
ARIMA(1,0,0)(2,0,0)[7] w/ mean on the non-transformed data:
fit_ARIMA_ATM4_train <- ATM4_train %>%
model(ARIMA(Cash ~ pdq(1,0,0) + PDQ(2,0,0)))
report(fit_ARIMA_ATM4_train)
## Series: Cash
## Model: ARIMA(1,0,0)(2,0,0)[7] w/ mean
##
## Coefficients:
## ar1 sar1 sar2 constant
## 0.0817 0.1461 0.0977 312.5267
## s.e. 0.0546 0.0545 0.0550 18.8407
##
## sigma^2 estimated as 122684: log likelihood=-2436.15
## AIC=4882.31 AICc=4882.49 BIC=4901.38
ETS(M,N,A) on the transformed data:
fit_ETS_ATM4_box_cox_train <- ATM4_box_cox_train %>%
model(ETS(box_cox~ error("M")+trend("N")+season("A")))
report(fit_ETS_ATM4_box_cox_train)
## Series: box_cox
## Model: ETS(M,N,A)
## Smoothing parameters:
## alpha = 0.01370986
## gamma = 0.161741
##
## Initial states:
## l[0] s[0] s[-1] s[-2] s[-3] s[-4] s[-5] s[-6]
## 25.80506 -17.85373 -4.477105 1.189807 5.956007 5.026289 4.475229 5.683503
##
## sigma^2: 0.2302
##
## AIC AICc BIC
## 3683.053 3683.732 3721.194
ARIMA(1,0,0)(2,0,0)[7] w/ mean on the transformed data:
fit_ARIMA_ATM4_box_cox_train <- ATM4_box_cox_train %>%
model(ARIMA(box_cox ~ pdq(1,0,0) + PDQ(2,0,0)))
report(fit_ARIMA_ATM4_box_cox_train)
## Series: box_cox
## Model: ARIMA(1,0,0)(2,0,0)[7] w/ mean
##
## Coefficients:
## ar1 sar1 sar2 constant
## 0.0791 0.2020 0.1813 16.6784
## s.e. 0.0548 0.0538 0.0548 0.7302
##
## sigma^2 estimated as 187.8: log likelihood=-1350.68
## AIC=2711.36 AICc=2711.54 BIC=2730.43
From the comparison below, we can see that the ARIMA(1,0,0)(2,0,0) model on non-transformed data had a lower RMSSE. Looking back at the residuals plots for the two lowest RMSSE models (ARIMA(1,0,0)(2,0,0) model on non-transformed data and ETS(M,N,A) on the transformed data), the level of variance appears visually similar. Neither have spikes past the critical values in the ACF plots, and the distribution of residuals for the ETS(M,N,A) model appears slightly more normal. However, since the RMSSE is lower by more than a couple percent, the ARIMA(1,0,0)(2,0,0) model on non-transformed data was selected.
bind_rows(
fit_ETS_ATM4_train %>% forecast(h = 30) %>%
accuracy(ATM4) %>% select(.model, MASE,RMSSE),
fit_ARIMA_ATM4_train %>% forecast(h = 30) %>%
accuracy(ATM4) %>% select(.model, MASE,RMSSE),
fit_ETS_ATM4_box_cox_train %>% forecast(h = 30) %>%
accuracy(ATM4_box_cox) %>% select(.model, MASE,RMSSE),
fit_ARIMA_ATM4_box_cox_train %>% forecast(h = 30) %>%
accuracy(ATM4_box_cox) %>% select(.model, MASE,RMSSE)
)
## # A tibble: 4 × 3
## .model MASE RMSSE
## <chr> <dbl> <dbl>
## 1 "ETS(Cash ~ error(\"A\") + trend(\"N\") + season(\"A\"))" 0.742 0.672
## 2 "ARIMA(Cash ~ pdq(1, 0, 0) + PDQ(2, 0, 0))" 0.664 0.592
## 3 "ETS(box_cox ~ error(\"M\") + trend(\"N\") + season(\"A\"))" 0.611 0.613
## 4 "ARIMA(box_cox ~ pdq(1, 0, 0) + PDQ(2, 0, 0))" 0.707 0.692
gg_tsresiduals(fit_ARIMA_ATM4)
gg_tsresiduals(fit_ETS_ATM4_box_cox)
Since the non-transformed data had the best fit model, the data forecasted as-is can be used to predict values of cash withdrawals from ATM4 for the month of May 2010.
ATM4_forecasted <- fc_ARIMA_ATM4 %>%
rename(CashDistribution = Cash, Cash = .mean) %>%
select(DATE,Cash)
ATM4_forecasted %>% autoplot(color = "blue") + autolayer(ATM4) +
labs(x = "Date", y = "Cash Withdrawal in Hundreds",
title = "Cash Withdrawal Prediction from ATM4 using ARIMA(1,0,0)(2,0,0)[7] w/ mean)")
The final model selection for the ATM1, ATM2, and ATM4 forecasts, as well as their corresponding MASE and RMSSE resulting from the training and test set exercise, are listed below. As previously discussed, ATM3 only has three datapoints and a forecast based on this is not recommended. However, a SES forecast was performed, with the values provided in the attached Excel file along with all other forecasts.
bind_rows(
fit_ARIMA_ATM1_box_cox_train %>% forecast(h = 30) %>%
accuracy(ATM1_box_cox) %>% select(.model, MASE,RMSSE),
fit_ARIMA_ATM2_box_cox_train %>% forecast(h = 30) %>%
accuracy(ATM2_box_cox) %>% select(.model, MASE,RMSSE),
fit_ARIMA_ATM4_train %>% forecast(h = 30) %>%
accuracy(ATM4) %>% select(.model, MASE,RMSSE)
)
## # A tibble: 3 × 3
## .model MASE RMSSE
## <chr> <dbl> <dbl>
## 1 ARIMA(box_cox ~ pdq(0, 0, 2) + PDQ(0, 1, 1)) 0.526 0.418
## 2 ARIMA(box_cox ~ pdq(2, 0, 2) + PDQ(0, 1, 1)) 0.733 0.681
## 3 ARIMA(Cash ~ pdq(1, 0, 0) + PDQ(2, 0, 0)) 0.664 0.592
report(fit_ARIMA_ATM1_box_cox)
## Series: box_cox
## Model: ARIMA(0,0,2)(0,1,1)[7]
##
## Coefficients:
## ma1 ma2 sma1
## 0.1126 -0.1094 -0.6418
## s.e. 0.0524 0.0520 0.0432
##
## sigma^2 estimated as 1.764: log likelihood=-609.99
## AIC=1227.98 AICc=1228.09 BIC=1243.5
report(fit_ARIMA_ATM2_box_cox)
## Series: box_cox
## Model: ARIMA(2,0,2)(0,1,1)[7]
##
## Coefficients:
## ar1 ar2 ma1 ma2 sma1
## -0.4256 -0.9051 0.4719 0.7967 -0.7271
## s.e. 0.0588 0.0444 0.0888 0.0590 0.0406
##
## sigma^2 estimated as 68.16: log likelihood=-1263.33
## AIC=2538.66 AICc=2538.9 BIC=2561.94
report(fit_SES_ATM3)
## Series: Cash
## Model: ETS(A,N,N)
## Smoothing parameters:
## alpha = 0.8585628
##
## Initial states:
## l[0]
## -0.1269535
##
## sigma^2: 25.4128
##
## AIC AICc BIC
## 3338.324 3338.391 3350.024
report(fit_ARIMA_ATM4)
## Series: Cash
## Model: ARIMA(1,0,0)(2,0,0)[7] w/ mean
##
## Coefficients:
## ar1 sar1 sar2 constant
## 0.0812 0.1486 0.1128 301.3721
## s.e. 0.0523 0.0521 0.0526 17.7422
##
## sigma^2 estimated as 118443: log likelihood=-2648.09
## AIC=5306.19 AICc=5306.35 BIC=5325.68
Data_624_Koon_Project_1_excel <- createWorkbook()
addWorksheet(Data_624_Koon_Project_1_excel, "ATM1")
writeData(Data_624_Koon_Project_1_excel, "ATM1", ATM1_forecasted)
addWorksheet(Data_624_Koon_Project_1_excel, "ATM2")
writeData(Data_624_Koon_Project_1_excel, "ATM2", ATM2_forecasted)
addWorksheet(Data_624_Koon_Project_1_excel, "ATM3")
writeData(Data_624_Koon_Project_1_excel, "ATM3", ATM3_forecasted)
addWorksheet(Data_624_Koon_Project_1_excel, "ATM4")
writeData(Data_624_Koon_Project_1_excel, "ATM4", ATM4_forecasted)
Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.
After reading in the data, I can see the YearMonth has imported as a character I converted this date back to YearMonth format in R.
ResCust <- read_excel("C:\\Users\\Kim\\Documents\\Data624\\ResidentialCustomerForecastLoad-624.xlsx") # read in the file
ResCust <- ResCust %>%
rename(YearMonth= `YYYY-MMM`) %>%
mutate(YearMonth = yearmonth(YearMonth)) %>%
as_tsibble(index = YearMonth) %>%
select(KWH)
Taking a look at the data, I can see there are missing values and one potential outlier. When I use gg_season(), I can see that the outlier does not follow the seasonal monthly trend in the data. Given that the data is shown to be seasonal based on a 12 month period, I believe seasonal decomposition would be an appropriate imputation method for both the NA value and outlier. Find_frequency is set to true to estimate the frequency of the time-series automatically.
ResCust %>% autoplot()
ResCust %>% filter(KWH == min(KWH, na.rm = TRUE))
## # A tsibble: 1 x 2 [1M]
## KWH YearMonth
## <dbl> <mth>
## 1 770523 2010 Jul
ResCust %>% gg_season()
ResCust %>% filter(KWH == min(KWH, na.rm = TRUE))
## # A tsibble: 1 x 2 [1M]
## KWH YearMonth
## <dbl> <mth>
## 1 770523 2010 Jul
ResCust$KWH[151] <- NA
ResCust$KWH <- na_seadec(ResCust$KWH, algorithm = "interpolation", find_frequency = TRUE)
ResCust %>% autoplot()
ResCust %>% gg_season()
any(is.na(ResCust$KWH))
## [1] FALSE
An ETS(M,N,M) model was fitted to the data. When looking at the residuals, the variance appears homoscedastic, though there are two spikes above critical values shown in the ACF. The distribution of residuals appears normal. When performing the ljung_box test, the p-value is shown to be above 0.05 and this model can be considered as a potential fit for further analysis.
fit_ETS_ResCust <- ResCust %>% model(ETS(KWH))
report(fit_ETS_ResCust)
## Series: KWH
## Model: ETS(M,N,M)
## Smoothing parameters:
## alpha = 0.1284459
## gamma = 0.2213289
##
## Initial states:
## l[0] s[0] s[-1] s[-2] s[-3] s[-4] s[-5] s[-6]
## 6189952 0.8933526 0.7641105 0.9346016 1.252045 1.287414 1.252513 1.006389
## s[-7] s[-8] s[-9] s[-10] s[-11]
## 0.7643922 0.8073412 0.8729015 0.9913738 1.173565
##
## sigma^2: 0.0093
##
## AIC AICc BIC
## 6142.152 6144.879 6191.014
fc_ETS_ResCust <- forecast(fit_ETS_ResCust, h = 1)
fc_ETS_ResCust %>% autoplot() + autolayer(ResCust)
gg_tsresiduals(fit_ETS_ResCust)
augment(fit_ETS_ResCust) %>%
features(.innov, ljung_box, lag = 24)
## # A tibble: 1 × 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 ETS(KWH) 32.8 0.109
For ARIMA, I can see the dataset requires one order of seasonal differencing. No non-seasonal differencing is required. The variance of the differences is relatively constant and the PACF show show a few spikes throughout the plot. Looking at the lags at multiples of seasonality, even the 24th lag is above the critical value. However, the KPSS test shows the data is stationary after the seasonal differencing.
ResCust %>% features(KWH, unitroot_nsdiffs)
## # A tibble: 1 × 1
## nsdiffs
## <int>
## 1 1
ResCust_nsdiff <- ResCust %>%
mutate(KWH_nsdiff = difference(KWH, 12))
ResCust_nsdiff %>%
features(KWH_nsdiff, unitroot_ndiffs)
## # A tibble: 1 × 1
## ndiffs
## <int>
## 1 0
ResCust_nsdiff %>%
gg_tsdisplay(KWH_nsdiff,plot_type = "partial", lag = 36)
ResCust_nsdiff %>%
features(KWH_nsdiff, unitroot_kpss)
## # A tibble: 1 × 2
## kpss_stat kpss_pvalue
## <dbl> <dbl>
## 1 0.168 0.1
An ARIMA(0,0,3)(2,1,0)[12] w/ drift model is fit, and this model is used for forecasting. After forecasting, I can see that the variance of the residuals not constant, with almost no variance at the beginning of the time series. However, the autocorrelations in the ACF plot are all within the bounds of the critical values and the distribution of residuals appears normal. Performing the ljung-box test, I can see that the p-value is above 0.05 and can also be considered as a potential fit for further analysis.
fit_ARIMA_ResCust <- ResCust %>%
model(ARIMA(KWH, stepwise = FALSE, approximation = FALSE))
report(fit_ARIMA_ResCust)
## Series: KWH
## Model: ARIMA(0,0,3)(2,1,0)[12] w/ drift
##
## Coefficients:
## ma1 ma2 ma3 sar1 sar2 constant
## 0.3456 0.0469 0.2383 -0.7019 -0.4150 229135.00
## s.e. 0.0789 0.0886 0.0749 0.0771 0.0789 77965.14
##
## sigma^2 estimated as 3.869e+11: log likelihood=-2657.83
## AIC=5329.66 AICc=5330.31 BIC=5352.01
fc_ARIMA_ResCust <- forecast(fit_ARIMA_ResCust, h = 1)
fc_ARIMA_ResCust %>% autoplot() + autolayer(ResCust)
gg_tsresiduals(fit_ARIMA_ResCust)
augment(fit_ARIMA_ResCust) %>%
features(.innov, ljung_box, lag = 24, dof = 5)
## # A tibble: 1 × 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 ARIMA(KWH, stepwise = FALSE, approximation = FALSE) 14.9 0.732
Again, I also explored a Box-Cox transformation on the data before modeling to see if the prerequisites for stationary data could be better met.
lambda_guerrero <- ResCust %>%
features(KWH, features = guerrero)
lambda_guerrero <- lambda_guerrero$lambda_guerrero
ResCust_box_cox <- ResCust %>%
mutate(box_cox = box_cox(KWH,lambda_guerrero)) %>%
select(box_cox)
ResCust_box_cox %>% autoplot()
I attempted an exponential smoothing model again first, where an ETS(M,N,M) model was selected again. Forecasting with the Box-Cox transformed data, I can see in the autocorrelations still show spikes beyond the bounds of the critical values. The variance of the residuals is relatively constant and the distribution of residuals is roughly normal. Performing the ljung-box test still returns a p-value greater than 0.05, indicating this model can also be evaluated as a potential model.
fit_ETS_ResCust_box_cox <- ResCust_box_cox %>% model(ETS(box_cox))
report(fit_ETS_ResCust_box_cox)
## Series: box_cox
## Model: ETS(A,A,A)
## Smoothing parameters:
## alpha = 0.09726989
## beta = 0.0001007045
## gamma = 0.0001195169
##
## Initial states:
## l[0] b[0] s[0] s[-1] s[-2] s[-3]
## 4.439695 4.230818e-05 -0.001682906 -0.009258518 -0.004276473 0.005169698
## s[-4] s[-5] s[-6] s[-7] s[-8] s[-9]
## 0.008620482 0.00699773 0.000901787 -0.007986956 -0.006149625 -0.002418816
## s[-10] s[-11]
## 0.002752698 0.007330899
##
## sigma^2: 0
##
## AIC AICc BIC
## -1195.663 -1192.146 -1140.286
fc_ETS_ResCust_box_cox <- forecast(fit_ETS_ResCust_box_cox, h = 1) %>%
rename(box_cox_KWH = .mean)
ResCust_box_cox %>% ggplot() +
geom_line(aes(x = YearMonth, y = box_cox)) +
geom_point(aes(x = fc_ETS_ResCust_box_cox$YearMonth,
y = fc_ETS_ResCust_box_cox$box_cox_KWH),
color = "blue")
gg_tsresiduals(fit_ETS_ResCust_box_cox)
augment(fit_ETS_ResCust_box_cox) %>% features(.innov, ljung_box, lag = 24)
## # A tibble: 1 × 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 ETS(box_cox) 48.4 0.00224
Looking at the PACF plot, I can see the Box-Cox transformed dataset has PACF spikes that decrease as the lags increase. Looking at the multiples of seasonality, 12, I can see that the spikes are within bounds at these multiples after lag 12. The KPSS test also shows that the data is stationary after differencing.
ResCust_box_cox %>% features(box_cox, unitroot_nsdiffs)
## # A tibble: 1 × 1
## nsdiffs
## <int>
## 1 1
ResCust_box_cox_nsdiff <-
ResCust_box_cox %>%
mutate(box_cox_KWH_nsdiff = difference(box_cox, 7))
ResCust_box_cox_nsdiff %>%
features(box_cox_KWH_nsdiff, unitroot_ndiffs)
## # A tibble: 1 × 1
## ndiffs
## <int>
## 1 0
ResCust_box_cox_nsdiff %>%
gg_tsdisplay(box_cox_KWH_nsdiff,plot_type = "partial", lag = 36)
ResCust_box_cox_nsdiff %>%
features(box_cox_KWH_nsdiff, unitroot_kpss)
## # A tibble: 1 × 2
## kpss_stat kpss_pvalue
## <dbl> <dbl>
## 1 0.0330 0.1
An ARIMA(0,0,3)(2,1,0)[12] w/ drift model was selected for the Box-Cox transformed data. Looking at the variance of the residuals, I can see that there is a similar problem where the variance is almost zero for approximately the first year of datapoints. This means the variance is not homoscedastic. However, there is only one spike in the autocorrelations, and the distribution of residuals appear roughly normal. The p-value from the ljung-box test is again over 0.05, so this model can be considered a potential model for selection and evaluation.
fit_ARIMA_ResCust_box_cox <- ResCust_box_cox %>%
model(ARIMA(box_cox, stepwise = FALSE, approximation = FALSE))
report(fit_ARIMA_ResCust_box_cox)
## Series: box_cox
## Model: ARIMA(0,0,3)(0,1,1)[12] w/ drift
##
## Coefficients:
## ma1 ma2 ma3 sma1 constant
## 0.3123 0.0899 0.2014 -0.7118 5e-04
## s.e. 0.0763 0.0844 0.0704 0.0640 1e-04
##
## sigma^2 estimated as 1.017e-05: log likelihood=790.27
## AIC=-1568.54 AICc=-1568.06 BIC=-1549.38
fc_ARIMA_ResCust_box_cox <-
forecast(fit_ARIMA_ResCust_box_cox, h = 1) %>%
rename(box_cox_KWH = .mean)
ResCust_box_cox %>% ggplot() +
geom_line(aes(x = YearMonth, y = box_cox)) +
geom_point(aes(x = fc_ARIMA_ResCust_box_cox$YearMonth,
y = fc_ARIMA_ResCust_box_cox$box_cox_KWH),
color = "blue")
gg_tsresiduals(fit_ARIMA_ResCust_box_cox)
augment(fit_ARIMA_ResCust_box_cox) %>% features(.innov, ljung_box, lag = 24, dof = 5)
## # A tibble: 1 × 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 ARIMA(box_cox, stepwise = FALSE, approximation = FALSE) 23.4 0.221
All four models assessed, ETS(M,N,M) and ARIMA(0,0,3)(2,1,0)[12] w/ drift for the non-transformed data, and ETS(M,N,M) and ARIMA(0,0,3)(2,1,0)[12] w/ drift for the transformed data, will be considered for evaluation.
ResCust_train <- ResCust %>%
filter(YearMonth < yearmonth("2013 Dec"))
ResCust_box_cox_train <- ResCust_box_cox %>%
filter(YearMonth < yearmonth("2013 Dec"))
ETS(M,N,M) on the non-transformed data:
fit_ETS_ResCust_train <- ResCust_train %>%
model(ETS(KWH ~ error("M")+trend("N")+season("M")))
report(fit_ETS_ResCust_train)
## Series: KWH
## Model: ETS(M,N,M)
## Smoothing parameters:
## alpha = 0.1261529
## gamma = 0.2073059
##
## Initial states:
## l[0] s[0] s[-1] s[-2] s[-3] s[-4] s[-5] s[-6]
## 6198551 0.8983829 0.7614825 0.9353988 1.233974 1.275716 1.259156 1.003643
## s[-7] s[-8] s[-9] s[-10] s[-11]
## 0.768538 0.806859 0.8791735 0.9981168 1.179559
##
## sigma^2: 0.0086
##
## AIC AICc BIC
## 6095.270 6098.013 6144.054
ARIMA(0,0,3)(2,1,0)[12] w/ drift on the non-transformed data:
fit_ARIMA_ResCust_train <- ResCust_train %>%
model(ARIMA(KWH ~ pdq(0,0,3) + PDQ(2,1,0)))
report(fit_ARIMA_ResCust_train)
## Series: KWH
## Model: ARIMA(0,0,3)(2,1,0)[12] w/ drift
##
## Coefficients:
## ma1 ma2 ma3 sar1 sar2 constant
## 0.3471 0.0478 0.2424 -0.6880 -0.4633 208360.75
## s.e. 0.0748 0.0828 0.0680 0.0735 0.0756 74737.56
##
## sigma^2 estimated as 3.498e+11: log likelihood=-2634.48
## AIC=5282.96 AICc=5283.61 BIC=5305.27
ETS(M,N,M) on the transformed data:
fit_ETS_ResCust_box_cox_train <- ResCust_box_cox_train %>%
model(ETS(box_cox ~ error("M")+trend("N")+season("M")))
report(fit_ETS_ResCust_box_cox_train)
## Series: box_cox
## Model: ETS(M,N,M)
## Smoothing parameters:
## alpha = 0.1051158
## gamma = 0.2003773
##
## Initial states:
## l[0] s[0] s[-1] s[-2] s[-3] s[-4] s[-5] s[-6]
## 4.440018 0.9991129 0.9981611 0.9995753 1.001659 1.002001 1.001643 1.00014
## s[-7] s[-8] s[-9] s[-10] s[-11]
## 0.9981808 0.9986073 0.9992228 1.000239 1.001458
##
## sigma^2: 0
##
## AIC AICc BIC
## -1198.950 -1196.207 -1150.166
ARIMA(0,0,3)(2,1,0)[12] w/ drift on the transformed data:
fit_ARIMA_ResCust_box_cox_train <- ResCust_box_cox_train %>%
model(ARIMA(box_cox ~ pdq(0,0,3) + PDQ(2,1,0)))
report(fit_ARIMA_ResCust_box_cox_train)
## Series: box_cox
## Model: ARIMA(0,0,3)(2,1,0)[12] w/ drift
##
## Coefficients:
## ma1 ma2 ma3 sar1 sar2 constant
## 0.2705 0.0714 0.2264 -0.7161 -0.4085 1e-03
## s.e. 0.0739 0.0804 0.0671 0.0725 0.0752 4e-04
##
## sigma^2 estimated as 9.725e-06: log likelihood=791.42
## AIC=-1568.84 AICc=-1568.19 BIC=-1546.53
From the comparison below, we can see that the ARIMA(0,0,3)(2,1,0)[12] w/ drift model on transformed data had a lower RMSSE.
bind_rows(
fit_ETS_ResCust_train %>% forecast(h = 1) %>%
accuracy(ResCust) %>% select(.model, MASE,RMSSE),
fit_ARIMA_ResCust_train %>% forecast(h = 1) %>%
accuracy(ResCust) %>% select(.model, MASE,RMSSE),
fit_ETS_ResCust_box_cox_train %>% forecast(h = 1) %>%
accuracy(ResCust_box_cox) %>% select(.model, MASE,RMSSE),
fit_ARIMA_ResCust_box_cox_train %>% forecast(h = 1) %>%
accuracy(ResCust_box_cox) %>% select(.model, MASE,RMSSE)
)
## # A tibble: 4 × 3
## .model MASE RMSSE
## <chr> <dbl> <dbl>
## 1 "ETS(KWH ~ error(\"M\") + trend(\"N\") + season(\"M\"))" 4.14 3.21
## 2 "ARIMA(KWH ~ pdq(0, 0, 3) + PDQ(2, 1, 0))" 4.31 3.34
## 3 "ETS(box_cox ~ error(\"M\") + trend(\"N\") + season(\"M\"))" 3.18 2.52
## 4 "ARIMA(box_cox ~ pdq(0, 0, 3) + PDQ(2, 1, 0))" 3.10 2.46
Revisiting the residuals plot, we can see that the scale of the residuals is smaller and more constant in the ETS(M,N,M) model on the transformed data. This model also had the second lowest RMSSE. When splitting the training set at an even earlier date to compare a 12 month forecast in the testing set, the RMSSE only differs between the two models by 0.028, with the ARIMA model still having the lower RMSSE. Looking at the ACF plots, since the ETS model has two spikes above the critical values and the ARIMA model has one, the ARIMA model was ultimately chosen.
gg_tsresiduals(fit_ETS_ResCust_box_cox)
gg_tsresiduals(fit_ARIMA_ResCust_box_cox)
ResCust_box_cox_train2 <- ResCust_box_cox %>%
filter(YearMonth <= yearmonth("2012 Dec"))
fit_ETS_ResCust_box_cox_train2 <- ResCust_box_cox_train2 %>%
model(ETS(box_cox ~ error("M")+trend("N")+season("M")))
report(fit_ETS_ResCust_box_cox_train2)
## Series: box_cox
## Model: ETS(M,N,M)
## Smoothing parameters:
## alpha = 0.1121282
## gamma = 0.1778984
##
## Initial states:
## l[0] s[0] s[-1] s[-2] s[-3] s[-4] s[-5] s[-6]
## 4.440014 0.9992006 0.9981364 0.9995244 1.001621 1.002023 1.00166 1.000138
## s[-7] s[-8] s[-9] s[-10] s[-11]
## 0.9981581 0.9985882 0.9992599 1.000246 1.001444
##
## sigma^2: 0
##
## AIC AICc BIC
## -1134.708 -1131.781 -1086.814
fit_ARIMA_ResCust_box_cox_train2 <- ResCust_box_cox_train2 %>%
model(ARIMA(box_cox ~ pdq(0,0,3) + PDQ(2,1,0)))
report(fit_ARIMA_ResCust_box_cox_train2)
## Series: box_cox
## Model: ARIMA(0,0,3)(2,1,0)[12] w/ drift
##
## Coefficients:
## ma1 ma2 ma3 sar1 sar2 constant
## 0.2803 0.0780 0.2422 -0.7460 -0.4122 1e-03
## s.e. 0.0755 0.0813 0.0692 0.0752 0.0800 4e-04
##
## sigma^2 estimated as 9.85e-06: log likelihood=742.17
## AIC=-1470.34 AICc=-1469.64 BIC=-1448.47
bind_rows(
fit_ETS_ResCust_box_cox_train2 %>% forecast(h = 12) %>%
accuracy(ResCust_box_cox) %>% select(.model, MASE,RMSSE),
fit_ARIMA_ResCust_box_cox_train2 %>% forecast(h = 12) %>%
accuracy(ResCust_box_cox) %>% select(.model, MASE,RMSSE)
)
## # A tibble: 2 × 3
## .model MASE RMSSE
## <chr> <dbl> <dbl>
## 1 "ETS(box_cox ~ error(\"M\") + trend(\"N\") + season(\"M\"))" 0.808 0.930
## 2 "ARIMA(box_cox ~ pdq(0, 0, 3) + PDQ(2, 1, 0))" 0.857 0.916
Since the transformed data had the best fit model, the data must be transformed back in order to interpret the predictive value of KWH usage for January 2014.
fc_ARIMA_ResCust_box_cox
## # A tsibble: 1 x 4 [1M]
## # Key: .model [1]
## .model YearMonth
## <chr> <mth>
## 1 ARIMA(box_cox, stepwise = FALSE, approximation = FALSE) 2014 Jan
## # ℹ 2 more variables: box_cox <dist>, box_cox_KWH <dbl>
fc_ARIMA_ResCust_box_cox %>% autoplot(level = FALSE) +
autolayer(ResCust_box_cox)
ResCust_forecasted <- fc_ARIMA_ResCust_box_cox %>%
mutate(KWH = inv_box_cox(box_cox_KWH, lambda_guerrero)) %>%
select(YearMonth,KWH)
ResCust %>% autoplot() +
geom_point(aes(x = ResCust_forecasted$YearMonth,
y = ResCust_forecasted$KWH),
color = "blue") +
labs(x = "Date", y = "KWH Consumption",
title = "Predicted KWH Consumption for January 2014 using ARIMA(0,0,3)(2,1,0)[12] w/ drift")
The final model selection for the ResCust forecast, as well as its corresponding MASE and RMSSE resulting from the training and test set exercise, are listed below.
bind_rows(
fit_ARIMA_ResCust_box_cox_train %>% forecast(h = 1) %>%
accuracy(ResCust_box_cox) %>% select(.model, MASE,RMSSE),
fit_ARIMA_ResCust_box_cox_train2 %>% forecast(h = 12) %>%
accuracy(ResCust_box_cox) %>% select(.model, MASE,RMSSE)
)
## # A tibble: 2 × 3
## .model MASE RMSSE
## <chr> <dbl> <dbl>
## 1 ARIMA(box_cox ~ pdq(0, 0, 3) + PDQ(2, 1, 0)) 3.10 2.46
## 2 ARIMA(box_cox ~ pdq(0, 0, 3) + PDQ(2, 1, 0)) 0.857 0.916
report(fit_ARIMA_ResCust_box_cox)
## Series: box_cox
## Model: ARIMA(0,0,3)(0,1,1)[12] w/ drift
##
## Coefficients:
## ma1 ma2 ma3 sma1 constant
## 0.3123 0.0899 0.2014 -0.7118 5e-04
## s.e. 0.0763 0.0844 0.0704 0.0640 1e-04
##
## sigma^2 estimated as 1.017e-05: log likelihood=790.27
## AIC=-1568.54 AICc=-1568.06 BIC=-1549.38
addWorksheet(Data_624_Koon_Project_1_excel, "ResCust")
writeData(Data_624_Koon_Project_1_excel, "ResCust", ResCust_forecasted)
Part C consists of two data sets. These are simple 2 columns sets, however they have different time stamps. Your optional assignment is to time-base sequence the data and aggregate based on hour (example of what this looks like, follows). Note for multiple recordings within an hour, take the mean. Then to determine if the data is stationary and can it be forecast. If so, provide a week forward forecast and present results via Rpubs and .rmd and the forecast in an Excel readable file.
After reading in the data, I can see the datetime has imported as a double. I converted this date back to datetime format in R, after combining the two datasets. Then, I mutate the overall dataset to extract Date and hour, and group by these columns. Afterwards, I can summarize the mean for each hour.
Waterflow1 <- read_excel("C:\\Users\\Kim\\Documents\\Data624\\Waterflow_Pipe1.xlsx") # read in the file
Waterflow2 <- read_excel("C:\\Users\\Kim\\Documents\\Data624\\Waterflow_Pipe2.xlsx") # read in the file
waterflow <- rbind(Waterflow1,Waterflow2) %>%
mutate(DateTime = convertToDateTime(`Date Time`)) %>%
select(DateTime,WaterFlow) %>%
mutate(Date = as_date(DateTime), Hour = hour(DateTime)) %>%
group_by(Date,Hour) %>%
summarize(meanWaterFlow = mean(WaterFlow), .groups = "drop") %>%
mutate(DateHour = as.POSIXct(paste(Date, Hour), format = "%Y-%m-%d %H")) %>%
select(DateHour, meanWaterFlow)
I turned the waterflow data into into a tsibble and autoplot it to see the general shape of the data. Using is.na, I can see there are no NA values. However, using fill_gaps() I can see there is an implicit NA value at 2015-11-01 01:00:00. I used gg_season() to see if there were any typical daily trends, and there does not seem to be. I then graphed the surrounding values to view the local shape and trend. Given this, I will impute the single value based on linear interpolation.
waterflow <- waterflow %>%
as_tsibble(index = DateHour) %>%
fill_gaps()
waterflow %>% autoplot(meanWaterFlow)
any(is.na(waterflow$meanWaterFlow))
## [1] TRUE
waterflow %>% filter(is.na(meanWaterFlow))
## # A tsibble: 1 x 2 [1h] <?>
## DateHour meanWaterFlow
## <dttm> <dbl>
## 1 2015-11-01 01:00:00 NA
gg_season(waterflow, period = 24) # no obvious trend by hour
waterflow %>% filter(as_date(DateHour) == '2015-11-01' |
as_date(DateHour) == '2015-10-31') %>%
autoplot()
waterflow$meanWaterFlow <-na_interpolation(waterflow$meanWaterFlow, option = "linear")
waterflow %>% filter(as_date(DateHour) == '2015-11-01' |
as_date(DateHour) == '2015-10-31') %>%
autoplot()
waterflow %>% ggplot(aes(x = meanWaterFlow)) + geom_histogram()
waterflow %>% autoplot()
I can see the dataset requires one order of non-seasonal differencing. No seasonal differencing is required. The variance of the differences is not constant, as the variance is much smaller in the beginning of the plot and then becomes larger as time passes. After differencing, I check the KPSS test again on the differenced data, and can see the data is still not stationary
waterflow %>% features(meanWaterFlow, unitroot_nsdiffs)
## # A tibble: 1 × 1
## nsdiffs
## <int>
## 1 0
waterflow %>% features(meanWaterFlow, unitroot_ndiffs)
## # A tibble: 1 × 1
## ndiffs
## <int>
## 1 1
waterflow %>% features(meanWaterFlow, unitroot_kpss)
## # A tibble: 1 × 2
## kpss_stat kpss_pvalue
## <dbl> <dbl>
## 1 4.08 0.01
waterflow_ndiff <- waterflow %>%
mutate(meanWaterFlow_ndiff = difference(meanWaterFlow))
waterflow_ndiff %>%
gg_tsdisplay(meanWaterFlow_ndiff,plot_type = "partial")
waterflow_ndiff %>% features(meanWaterFlow, unitroot_kpss)
## # A tibble: 1 × 2
## kpss_stat kpss_pvalue
## <dbl> <dbl>
## 1 4.08 0.01
To stabilize the variance, I perform a Box-Cox transformation on the data. I can see that there are a few values that differ significantly from the rest of the dataset, likely due to having an initially low value before the transformation. On the whole, the variance does appear more constant and aligned with the initial week presented in the dataset.
lambda_guerrero <- waterflow %>%
features(meanWaterFlow, features = guerrero)
lambda_guerrero <- lambda_guerrero$lambda_guerrero
waterflow_box_cox <- waterflow %>%
mutate(box_cox = box_cox(meanWaterFlow,lambda_guerrero)) %>%
select(box_cox)
waterflow_box_cox %>% autoplot()
waterflow_box_cox %>% ggplot(aes(x = box_cox)) + geom_histogram()
After differencing, I perform the KPSS test on the differenced data of the Box-Cox transformed dataset. I can see that the KPSS is 0.1, which means I do not reject the null hypothesis that the data is stationary.
waterflow_box_cox %>% features(box_cox, unitroot_nsdiffs)
## # A tibble: 1 × 1
## nsdiffs
## <int>
## 1 0
waterflow_box_cox %>% features(box_cox, unitroot_ndiffs)
## # A tibble: 1 × 1
## ndiffs
## <int>
## 1 1
waterflow_box_cox %>% features(box_cox, unitroot_kpss)
## # A tibble: 1 × 2
## kpss_stat kpss_pvalue
## <dbl> <dbl>
## 1 1.91 0.01
waterflow_box_cox_ndiff <- waterflow_box_cox %>% mutate(box_cox_meanWaterFlow_nsdiff = difference(box_cox))
waterflow_box_cox_ndiff %>% gg_tsdisplay(box_cox_meanWaterFlow_nsdiff,plot_type = "partial")
waterflow_box_cox_ndiff %>% features(box_cox_meanWaterFlow_nsdiff, unitroot_kpss)
## # A tibble: 1 × 2
## kpss_stat kpss_pvalue
## <dbl> <dbl>
## 1 0.00507 0.1
As such, I move on to creating a forecast. A week forward forecast on hourly data means predicting 168 values. I fit an ARIMA(0,1,1) model to the data. Looking at the residuals, I can see that the variance has some spikes but is otherwise fairly constant. There are two autocorrelations in the ACF plot that are above the critical values, which, at 6%, is above the 5% threshold. The distribution of residuals also appears left skewed.
fit_ARIMA_waterflow_box_cox <-
waterflow_box_cox %>%
model(ARIMA(box_cox, stepwise = FALSE, approximation = FALSE))
report(fit_ARIMA_waterflow_box_cox)
## Series: box_cox
## Model: ARIMA(0,1,1)
##
## Coefficients:
## ma1
## -0.9801
## s.e. 0.0064
##
## sigma^2 estimated as 0.01033: log likelihood=867.3
## AIC=-1730.59 AICc=-1730.58 BIC=-1720.78
fc_ARIMA_waterflow_box_cox <-
forecast(fit_ARIMA_waterflow_box_cox, h = 168)
fc_ARIMA_waterflow_box_cox %>%
autoplot() + autolayer(waterflow_box_cox)
gg_tsresiduals(fit_ARIMA_waterflow_box_cox)
augment(fit_ARIMA_waterflow_box_cox) %>%
features(.innov, ljung_box, lag = 10, dof = 1)
## # A tibble: 1 × 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 ARIMA(box_cox, stepwise = FALSE, approximation = FALSE) 8.68 0.467
I can fit an ETS(M,N,N) model to the non-transformed dataset as well, since ETS does not require stationary data. When looking at the residuals, it appears homoscedastic although there are 5 autocorrelations shown above critical values in the ACF plot. The distribution of residuals appears normal. When performing the ljung_box test, the p-value is shown to be above 0.05 and this model can be considered as a potential fit for further analysis.
fit_ETS_waterflow <- waterflow %>% model(ETS(meanWaterFlow))
report(fit_ETS_waterflow)
## Series: meanWaterFlow
## Model: ETS(M,N,N)
## Smoothing parameters:
## alpha = 0.03724136
##
## Initial states:
## l[0]
## 20.64244
##
## sigma^2: 0.1577
##
## AIC AICc BIC
## 12179.54 12179.57 12194.27
fc_ETS_waterflow <- forecast(fit_ETS_waterflow, h = 168)
fc_ETS_waterflow %>% autoplot() + autolayer(waterflow)
gg_tsresiduals(fit_ETS_waterflow)
augment(fit_ETS_waterflow) %>%
features(.innov, ljung_box, lag = 10)
## # A tibble: 1 × 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 ETS(meanWaterFlow) 3.89 0.952
Fitting an ETS(M,N,N) model to the transformed data, I can see that the variance of the residuals is similar to the ARIMA model on the transformed data. Similarly, there are two spikes in the ACF plot and the distribution of residuals is left skewed. When performing the ljung_box test, the p-value is shown to be above 0.05 and this model can be considered as a potential fit for further analysis.
fit_ETS_waterflow_box_cox <-
waterflow_box_cox %>% model(ETS(box_cox))
report(fit_ETS_waterflow_box_cox)
## Series: box_cox
## Model: ETS(M,N,N)
## Smoothing parameters:
## alpha = 0.0207642
##
## Initial states:
## l[0]
## 1.54057
##
## sigma^2: 0.004
##
## AIC AICc BIC
## 2328.562 2328.586 2343.292
fc_ETS_waterflow_box_cox <-
forecast(fit_ETS_waterflow_box_cox, h = 168)
fc_ETS_waterflow_box_cox %>%
autoplot() + autolayer(waterflow_box_cox)
gg_tsresiduals(fit_ETS_waterflow_box_cox)
augment(fit_ETS_waterflow_box_cox) %>%
features(.innov, ljung_box, lag = 10)
## # A tibble: 1 × 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 ETS(box_cox) 8.78 0.553
All three models assessed, ETS(M,N,N) for the non-transformed data, and ETS(M,N,N) and ARIMA(0,1,1) for the transformed data, will be considered for evaluation.
waterflow_train <- waterflow %>%
filter(DateHour <= as_datetime ("2015-11-26 16:00:00"))
waterflow_box_cox_train <- waterflow_box_cox %>%
filter(DateHour <= as_datetime ("2015-11-26 16:00:00"))
ETS(M,N,N) on the non-transformed data:
fit_ETS_waterflow_train <- waterflow_train %>%
model(ETS(meanWaterFlow ~ error("M")+trend("N")+season("N")))
report(fit_ETS_waterflow_train)
## Series: meanWaterFlow
## Model: ETS(M,N,N)
## Smoothing parameters:
## alpha = 0.04155465
##
## Initial states:
## l[0]
## 20.83852
##
## sigma^2: 0.1482
##
## AIC AICc BIC
## 9842.374 9842.404 9856.535
ETS(M,N,N) on the transformed data:
fit_ETS_waterflow_box_cox_train <- waterflow_box_cox_train %>%
model(ETS(box_cox ~ error("M")+trend("N")+season("N")))
report(fit_ETS_waterflow_box_cox_train)
## Series: box_cox
## Model: ETS(M,N,N)
## Smoothing parameters:
## alpha = 0.0216675
##
## Initial states:
## l[0]
## 1.541633
##
## sigma^2: 0.0037
##
## AIC AICc BIC
## 1697.246 1697.275 1711.406
ARIMA(0,1,1) on the transformed data:
fit_ARIMA_waterflow_box_cox_train <- waterflow_box_cox_train %>%
model(ARIMA(box_cox ~ pdq(0,1,1)))
report(fit_ARIMA_waterflow_box_cox_train)
## Series: box_cox
## Model: ARIMA(0,1,1)
##
## Coefficients:
## ma1
## -0.9793
## s.e. 0.0067
##
## sigma^2 estimated as 0.009492: log likelihood=752.17
## AIC=-1500.33 AICc=-1500.32 BIC=-1490.89
As a baseline, I also included a MEAN and NAIVE model, given that the plots of the forecast do not appear complex. Potentially, either of these simple models may provide comparatively good prediction.
fit_MEAN_waterflow_train <- waterflow_train %>%
model(MEAN(meanWaterFlow))
report(fit_MEAN_waterflow_train)
## Series: meanWaterFlow
## Model: MEAN
##
## Mean: 35.3898
## sigma^2: 235.6841
fit_NAIVE_waterflow_train <- waterflow_train %>%
model(NAIVE(meanWaterFlow))
report(fit_NAIVE_waterflow_train)
## Series: meanWaterFlow
## Model: NAIVE
##
## sigma^2: 390.7018
From the comparison below, we can see that the ETS(M,N,N) model on the non-transformed data had a lower RMSSE. However, autocorrelations for the ETS(M,N,N) model on the non-transformed data was seen to exceed the critical values at many points. As such, I would suggest either model from the transformed dataset, as both produced the same predictions and therefore have the same residuals.
bind_rows(
fit_ETS_waterflow_train %>% forecast(h = 168) %>%
accuracy(waterflow) %>% select(.model, MASE,RMSSE),
fit_ETS_waterflow_box_cox_train %>% forecast(h = 168) %>%
accuracy(waterflow_box_cox) %>% select(.model, MASE,RMSSE),
fit_ARIMA_waterflow_box_cox_train %>% forecast(h = 168) %>%
accuracy(waterflow_box_cox) %>% select(.model, MASE,RMSSE),
fit_MEAN_waterflow_train %>% forecast(h = 168) %>%
accuracy(waterflow) %>% select(.model, MASE,RMSSE),
fit_NAIVE_waterflow_train %>% forecast(h = 168) %>%
accuracy(waterflow) %>% select(.model, MASE,RMSSE)
)
## # A tibble: 5 × 3
## .model MASE RMSSE
## <chr> <dbl> <dbl>
## 1 "ETS(meanWaterFlow ~ error(\"M\") + trend(\"N\") + season(\"N\"))" 0.952 0.887
## 2 "ETS(box_cox ~ error(\"M\") + trend(\"N\") + season(\"N\"))" 0.933 0.890
## 3 "ARIMA(box_cox ~ pdq(0, 1, 1))" 0.933 0.890
## 4 "MEAN(meanWaterFlow)" 0.961 0.902
## 5 "NAIVE(meanWaterFlow)" 1.13 1.07
gg_tsresiduals(fit_ARIMA_waterflow_box_cox)
gg_tsresiduals(fit_ETS_waterflow)
gg_tsresiduals(fit_ETS_waterflow_box_cox)
Since the transformed data had the best fit model, the data must be transformed back in order to interpret the one week ahead waterflow forecasts.
fc_ETS_waterflow_box_cox
## # A fable: 168 x 4 [1h] <?>
## # Key: .model [1]
## .model DateHour
## <chr> <dttm>
## 1 ETS(box_cox) 2015-12-03 17:00:00
## 2 ETS(box_cox) 2015-12-03 18:00:00
## 3 ETS(box_cox) 2015-12-03 19:00:00
## 4 ETS(box_cox) 2015-12-03 20:00:00
## 5 ETS(box_cox) 2015-12-03 21:00:00
## 6 ETS(box_cox) 2015-12-03 22:00:00
## 7 ETS(box_cox) 2015-12-03 23:00:00
## 8 ETS(box_cox) 2015-12-04 00:00:00
## 9 ETS(box_cox) 2015-12-04 01:00:00
## 10 ETS(box_cox) 2015-12-04 02:00:00
## # ℹ 158 more rows
## # ℹ 2 more variables: box_cox <dist>, .mean <dbl>
fc_ETS_waterflow_box_cox %>%
autoplot(level = FALSE) + autolayer(waterflow_box_cox)
waterflow_forecasted <-
fc_ETS_waterflow_box_cox %>%
rename(box_cox_waterflow = .mean) %>%
mutate(meanWaterFlow = inv_box_cox(box_cox_waterflow, lambda_guerrero)) %>%
select(DateHour,meanWaterFlow)
waterflow_forecasted %>%
autoplot(color = "blue") + autolayer(waterflow) +
labs(x = "Date", y = "Waterflow",
title = "One week ahead waterflow predictions")
The final model selection for the Waterflow forecasts, as well as its corresponding MASE and RMSSE resulting from the training and test set exercise, are listed below. Only the forecasts for the ETS(M,N,N) of the transformed data is shown in the Excel outputs, since they are the same between the ETS and ARIMA model.
bind_rows(
fit_ETS_waterflow_box_cox_train %>% forecast(h = 168) %>%
accuracy(waterflow_box_cox) %>% select(.model, MASE,RMSSE),
fit_ARIMA_waterflow_box_cox_train %>% forecast(h = 168) %>%
accuracy(waterflow_box_cox) %>% select(.model, MASE,RMSSE)
)
## # A tibble: 2 × 3
## .model MASE RMSSE
## <chr> <dbl> <dbl>
## 1 "ETS(box_cox ~ error(\"M\") + trend(\"N\") + season(\"N\"))" 0.933 0.890
## 2 "ARIMA(box_cox ~ pdq(0, 1, 1))" 0.933 0.890
report(fit_ETS_waterflow_box_cox_train)
## Series: box_cox
## Model: ETS(M,N,N)
## Smoothing parameters:
## alpha = 0.0216675
##
## Initial states:
## l[0]
## 1.541633
##
## sigma^2: 0.0037
##
## AIC AICc BIC
## 1697.246 1697.275 1711.406
report(fit_ARIMA_waterflow_box_cox_train)
## Series: box_cox
## Model: ARIMA(0,1,1)
##
## Coefficients:
## ma1
## -0.9793
## s.e. 0.0067
##
## sigma^2 estimated as 0.009492: log likelihood=752.17
## AIC=-1500.33 AICc=-1500.32 BIC=-1490.89
addWorksheet(Data_624_Koon_Project_1_excel, "waterflow")
writeData(Data_624_Koon_Project_1_excel, "waterflow", waterflow_forecasted)
saveWorkbook(Data_624_Koon_Project_1_excel,
"C:\\Users\\Kim\\Documents\\Data624\\DATA_624_Project_1.xlsx",
overwrite = TRUE)
fit_MEAN_waterflow <- waterflow %>%
model(MEAN(meanWaterFlow))
report(fit_MEAN_waterflow)
## Series: meanWaterFlow
## Model: MEAN
##
## Mean: 35.9727
## sigma^2: 245.5421
fc_MEAN_waterflow <-
forecast(fit_MEAN_waterflow, h = 168)
fc_MEAN_waterflow %>% autoplot() +
autolayer(waterflow)
gg_tsresiduals(fit_MEAN_waterflow)
fit_NAIVE_waterflow <- waterflow %>%
model(NAIVE(meanWaterFlow))
report(fit_NAIVE_waterflow)
## Series: meanWaterFlow
## Model: NAIVE
##
## sigma^2: 418.263
fc_NAIVE_waterflow <-
forecast(fit_NAIVE_waterflow, h = 168)
fc_NAIVE_waterflow %>% autoplot() +
autolayer(waterflow)
gg_tsresiduals(fit_NAIVE_waterflow)
fit_MEAN_waterflow_box_cox <- waterflow_box_cox %>%
model(MEAN(box_cox))
report(fit_MEAN_waterflow_box_cox)
## Series: box_cox
## Model: MEAN
##
## Mean: 1.5899
## sigma^2: 0.0107
fc_MEAN_waterflow_box_cox <-
forecast(fit_MEAN_waterflow_box_cox, h = 168)
fc_MEAN_waterflow_box_cox %>% autoplot() +
autolayer(waterflow_box_cox)
gg_tsresiduals(fit_MEAN_waterflow_box_cox)
fit_NAIVE_waterflow_box_cox <- waterflow_box_cox %>%
model(NAIVE(box_cox))
report(fit_NAIVE_waterflow_box_cox)
## Series: box_cox
## Model: NAIVE
##
## sigma^2: 0.0204
fc_NAIVE_waterflow_box_cox <-
forecast(fit_NAIVE_waterflow_box_cox, h = 168)
fc_NAIVE_waterflow_box_cox %>% autoplot() +
autolayer(waterflow_box_cox)
gg_tsresiduals(fit_NAIVE_waterflow_box_cox)