Project 1

C. Rosemond 10.25.20

library(tidyverse)
library(ggplot2)
library(fpp2)
library(urca)
library(readxl)

Part A

Data pre-processing

Exploration and initial cleaning

atm <- read_xlsx("ATM624Data.xlsx", sheet=1)
#summary(atm)
atm$DATE <- as.Date.numeric(atm$DATE, origin="1899-12-30")
summary(atm)

##       DATE                ATM                 Cash        
##  Min.   :2009-05-01   Length:1474        Min.   :    0.0  
##  1st Qu.:2009-08-01   Class :character   1st Qu.:    0.5  
##  Median :2009-11-01   Mode  :character   Median :   73.0  
##  Mean   :2009-10-31                      Mean   :  155.6  
##  3rd Qu.:2010-02-01                      3rd Qu.:  114.0  
##  Max.   :2010-05-14                      Max.   :10919.8  
##                                          NA's   :19

This dataset consists of 1474 observations across three variables: DATE, which indicates the date representing an individual set of ATM transactions; ATM, which specifies the ATM that hosted that set of transactions; and Cash, which notes the total amount of money, in hundreds of dollars, withdrawn from that ATM on that date. Observations run from May 1, 2009 through May 14, 2010.

atm[is.na(atm$Cash),]

## # A tibble: 19 x 3
##    DATE       ATM    Cash
##    <date>     <chr> <dbl>
##  1 2009-06-13 ATM1     NA
##  2 2009-06-16 ATM1     NA
##  3 2009-06-18 ATM2     NA
##  4 2009-06-22 ATM1     NA
##  5 2009-06-24 ATM2     NA
##  6 2010-05-01 <NA>     NA
##  7 2010-05-02 <NA>     NA
##  8 2010-05-03 <NA>     NA
##  9 2010-05-04 <NA>     NA
## 10 2010-05-05 <NA>     NA
## 11 2010-05-06 <NA>     NA
## 12 2010-05-07 <NA>     NA
## 13 2010-05-08 <NA>     NA
## 14 2010-05-09 <NA>     NA
## 15 2010-05-10 <NA>     NA
## 16 2010-05-11 <NA>     NA
## 17 2010-05-12 <NA>     NA
## 18 2010-05-13 <NA>     NA
## 19 2010-05-14 <NA>     NA

Nineteen observations have missing values: five observations missing values for Cash for ATMs 1 or 2 during June 2009; and fourteen observations, running from May 1 through May 14, 2010, with missing values for both ATM and Cash. The first five observations will either have to be removed or imputed with new values prior to modeling, and the latter fourteen observations will ultimately be forecasted and thus dropped beforehand.

atm_clean <- atm %>% drop_na(ATM)
atm_clean %>%
  ggplot(aes(x = DATE, y = Cash)) +
  geom_line() +
  facet_grid(ATM ~ ., scales = "free_y") +
  ggtitle("Cash withdrawn from four ATMs (5/1/09 - 4/30/10)") +
  ylab("Cash (Hundreds of $)") +
  xlab("Date")

The four ATMs vary substantially in pattern, level, and amplitude of cash withdrawals over the course of the series. ATMs 1 and 2 both display clear, possibly weekly seasonality in cash withdrawn, with variation between roughly zero and fifteen thousand dollars. ATM 3 flatlines at zero dollars withdrawn until the last few days of the series, when it jumps to nearly ten thousand dollars withdrawn, which suggests that the machine was offline until that point. By contrast, ATM 4 shows much higher levels of withdrawals--varying between roughly zero and 150 thousand dollars--throughout along with a massive spike to over one million dollars around February 2010.

Missing Cash values must be addressed prior to modeling. All five observations with these values occurred during the same month, suggesting at least one relatively common observed variable (DATE), so the data are at best missing at random. Additionally, since only three variables are available in the data, omitted variables like the geographic location of the ATM could relate, suggesting missingness not at random. This suggestion is speculative, however, so imputation moves forward.

atm1 <- atm_clean %>% filter(ATM == "ATM1") %>% select(-c(DATE,ATM)) %>% ts(frequency=7) %>% tsclean(replace.missing=TRUE)
atm2 <- atm_clean %>% filter(ATM == "ATM2") %>% select(-c(DATE,ATM)) %>% ts(frequency=7) %>% tsclean(replace.missing=TRUE)
sum(is.na(atm1) | is.na(atm2))

## [1] 0

Values are imputed through linear interpolation. In response to seasonality, the data are first decomposed using robust STL, linear interpolation is applied to the seasonally adjusted data, and then the seasonal component is re-added.

atm %>% filter(ATM == "ATM3" & Cash != 0)

## # A tibble: 3 x 3
##   DATE       ATM    Cash
##   <date>     <chr> <dbl>
## 1 2010-04-28 ATM3     96
## 2 2010-04-29 ATM3     82
## 3 2010-04-30 ATM3     85

ATM 3 had only three days of withdrawals during the series period: the final three days, or April 28 through 30, 2010. The Cash values for these days are 96, 82, and 85, respectively. The small amount of data greatly limits the ability to forecast values for this ATM.

atm_clean %>% filter(DATE <= "2010-04-30" & DATE >= "2010-04-28" & ATM != "ATM4") %>% arrange(DATE,ATM)

## # A tibble: 9 x 3
##   DATE       ATM    Cash
##   <date>     <chr> <dbl>
## 1 2010-04-28 ATM1     96
## 2 2010-04-28 ATM2    107
## 3 2010-04-28 ATM3     96
## 4 2010-04-29 ATM1     82
## 5 2010-04-29 ATM2     89
## 6 2010-04-29 ATM3     82
## 7 2010-04-30 ATM1     85
## 8 2010-04-30 ATM2     90
## 9 2010-04-30 ATM3     85

atm3 <- atm_clean %>% filter(ATM == "ATM3" & Cash != 0) %>% select(-c(DATE,ATM)) %>% ts()

The observed values ATM 3 do, however, appear relatively similar in level and amplitude to the values from the same days for ATMs 1 and 2. A look at values across ATMs shows that the three values are identical for both ATMs 1 and 3. Given that ATM 3 has very little observed data to inform modeling, substantial assumptions must be made in order to forecast. The identical values for ATMs 1 and 3 suggest that the same model may be appropriate for both ATM-specific series. This assumption is quite large, so a simple forecast method will be used instead.

atm4 <- atm_clean %>% filter(ATM == "ATM4") %>% select(-c(DATE,ATM)) %>% ts() %>% tsclean()
autoplot(atm4, main="Time plot of ATM 4 (May 1, 2009 - April 30, 2010)") +
  ylab("Cash withdrawn (Hundreds of $)")

Also of note is the spike in withdrawals from ATM 4. It appears to be an outlier relative to all other withdrawals, from any ATM, during the series. The $forecast$ package offers functionality to identify and address outliers through the same process of decomposition and linear interpolation used for missing data imputation. A time plot following interpolation suggests that the spike was indeed an outlier. Replacing that value results in a series that is more consistent in level over time.

Transformation and differencing

Next, setting aside ATM 3, each of the series undergoes a Box-Cox transformation followed by differencing, as necessary. These steps will better prepare the series for eventual modeling.

ATM 1

atm1 %>% ggtsdisplay(main="ATM 1, pre-transformation and differencing")

BoxCox.lambda(atm1)

## [1] 0.1917299

A time plot and pair of autocorrelation plots show that ATM 1 features little to no trend but clear weekly seasonality (~ 7 days). Variance appears relatively consistent throughout, though a Box-Cox transformation is used to confirm. The suggested Box-Cox lambda is approximately 0.19, or roughly 0.20. This and other rough lambda values are used to ease interpretation.

atm1_bc <- atm1 %>% BoxCox(lambda=0.2)
nsdiffs(atm1_bc)

## [1] 1

ndiffs(diff(atm1_bc, lag=7))

## [1] 0

The series needs one seasonal difference. A second, subsequent difference is not suggested.

atm1_bc_d <- atm1_bc %>% diff(lag=7)
atm1_bc_d %>% ggtsdisplay(main="ATM 1, post-Box-Cox and seasonal differencing")

atm1_bc_d %>% ur.kpss() %>% summary()

## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 5 lags. 
## 
## Value of test-statistic is: 0.0377 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

The Box-Cox transformation ($\lambda$ = 0.2) and a seasonal difference (lag = 7) result in a statistically significant stationary series. The unit root test statistic of approximately 0.04 is well within the relevant critical values ($\alpha$ = 0.05) for statistical significance.

ATM 2

atm2 %>% ggtsdisplay(main="ATM 2, pre-transformation and differencing")

BoxCox.lambda(atm2)

## [1] 0.4555632

A time plot and pair of autocorrelation plots reflect that ATM 2 is similar to ATM: its series shows little to no trend but clear weekly seasonality. Variance decreases slightly over the course of the series. The suggested Box-Cox lambda is approximately 0.46, or roughly 0.50--the square-root transformation. The seasonality requires at least one difference.

atm2_bc <- atm2 %>% BoxCox(lambda=0.5)
nsdiffs(atm2_bc)

## [1] 1

ndiffs(diff(atm2_bc, lag=7))

## [1] 0

As with ATM 1, a single, seasonal difference is suggested.

atm2_bc_d <- atm2_bc %>% diff(lag=7) 
atm2_bc_d %>% ggtsdisplay(main="ATM 2, post-Box-Cox and seasonal differencing")

atm2_bc_d %>% ur.kpss() %>% summary()

## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 5 lags. 
## 
## Value of test-statistic is: 0.0197 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

The Box-Cox transformation ($\lambda$ = 0.5) and a seasonal difference (lag = 7) result in a statistically significant stationary series. The unit root test statistic of approximately 0.02 is well within the relevant critical values ($\alpha$ = 0.05) for statistical significance.

ATM 4

#atm4 %>% ggtsdisplay(main="ATM 4, pre-transformation and differencing")
atm4 <- atm %>% filter(ATM == "ATM4") %>% select(-c(DATE,ATM)) %>% ts(frequency=7) %>% tsclean()
atm4 %>% ggtsdisplay(main="ATM 4, pre-transformation and differencing")

BoxCox.lambda(atm4)

## [1] 0.4492823

First, like ATMs 1 and 2, ATM 4 shows weekly seasonality, which was not immediately apparent prior to cleaning. It also appears near-stationary already. The suggested Box-Cox lambda is approximately 0.45, or roughly the square-root transformation. The seasonality will require at least one difference.

atm4_bc <- atm4 %>% BoxCox(lambda=0.5)
nsdiffs(atm4_bc)

## [1] 0

ndiffs(diff(atm4_bc, lag=7))

## [1] 0

ggtsdisplay(atm4_bc)

Despite the seasonality, no difference is suggested. However, after transformation, autocorrelation plots continue to show statistically significant correlations ($\alpha$ = 0.05) on lags 7, 14, and 21. In response, a single, seasonal difference is used.

atm4_bc_d <- atm4_bc %>% diff(lag=7)
atm4_bc_d %>% ggtsdisplay(main="ATM 4, post-Box-Cox and seasonal differencing")

atm4_bc_d %>% ur.kpss() %>% summary()

## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 5 lags. 
## 
## Value of test-statistic is: 0.0186 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

Modeling and evaluation

For ATMs 1, 2, and 4, seasonal ARIMA models are fitted then used for forecasting. Training sets run from week 1 (May 2009 through week 45 (roughly March 2010), leaving the remaining weeks for the respective test sets. For ATM 3, a simple average forecasting method is used.

ATM 1

atm1_train <- window(atm1_bc, end=45)
atm1_bc_d %>% window(end=45) %>% ggtsdisplay(main="ATM 1, post-Box-Cox and seasonal differencing, training")

Based on the autocorrelation plots, the large statistically significant correlation ($\alpha$ = 0.05) at ACF lag 1 suggests a non-seasonal MA(1) component, and the significant correlation at ACF lag 7 suggests a seasonal MA(1) component. With the PACF plot, there is a spike at lag 1, suggesting a non-seasonal AR(1) component, and two large, significant correlations at lags 7 and 14, which suggest a seasonal AR(2) component. Lastly, there is the single seasonal difference, which results in an ARIMA(1,0,1)(2,1,1)[7] model.

atm1_train %>% Arima(order=c(1,0,1), seasonal=c(2,1,1)) %>% checkresiduals()

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,0,1)(2,1,1)[7]
## Q* = 5.2582, df = 9, p-value = 0.8112
## 
## Model df: 5.   Total lags used: 14

After fitting, the ARIMA(1,0,1)(2,1,2)[7] model returns residuals that are within the bounds for statistical significance ($\alpha$ = 0.05) and thus equivalent to white noise, and its Ljung-Box test statistic carries a sufficiently large p-value of approximately 0.81 on 9 degrees of freedom. This model appears sufficient for forecasting.

(atm1_arima1 <- atm1_train %>% Arima(order=c(1,0,1), seasonal=c(2,1,2)))

## Series: . 
## ARIMA(1,0,1)(2,1,2)[7] 
## 
## Coefficients:
##          ar1      ma1    sar1     sar2     sma1    sma2
##       0.7216  -0.3931  0.0629  -0.3746  -0.7199  0.3413
## s.e.  0.1312   0.1812  0.1656   0.1009   0.1645  0.1342
## 
## sigma^2 estimated as 0.2019:  log likelihood=-186.36
## AIC=386.71   AICc=387.09   BIC=412.68

(atm1_arima2 <- atm1_train %>% auto.arima(seasonal=TRUE, approximation=FALSE, stepwise=FALSE))

## Series: . 
## ARIMA(1,0,0)(2,1,2)[7] 
## 
## Coefficients:
##          ar1    sar1     sar2     sma1    sma2
##       0.4101  0.0367  -0.3931  -0.6761  0.3356
## s.e.  0.0544  0.1609   0.1014   0.1590  0.1345
## 
## sigma^2 estimated as 0.204:  log likelihood=-188.43
## AIC=388.85   AICc=389.14   BIC=411.11

Use of auto.arima()--with parameters seasonal as TRUE and approximation and stepwise as FALSE--results in an ARIMA(1,0,0)(2,1,2)[7] model that is identical to the manual one but for the non-seasonal MA component. Selecting a non-seasonal MA(0) component is unclear per the ACF plot of the training data, but regardless, the manual model compares favorably to the auto model based upon Akaike's Information Criterion (AIC) (386.71 versus 388.85) and the AIC corrected for small sample bias (AIC$\ _c$) (387.09 versus 389.14), but not the Bayesian Information Criterion (BIC) (412.68 versus 411.11).

atm1_test <- atm1_bc %>% window(start=46)
atm1_arima1_fc <- atm1_arima1 %>% forecast(h = 50)
atm1_arima2_fc <- atm1_arima2 %>% forecast(h = 50)
accuracy(atm1_arima1_fc, atm1_test)

##                       ME     RMSE       MAE         MPE      MAPE      MASE        ACF1 Theil's U
## Training set  0.01581305 0.439767 0.3260096  -0.2064724  5.612292 0.7865289  0.02851785        NA
## Test set     -0.80858765 2.162254 1.0995383 -53.8869025 57.860427 2.6527400 -0.10337171 0.2112206

accuracy(atm1_arima2_fc, atm1_test)

##                       ME      RMSE       MAE         MPE      MAPE      MASE        ACF1 Theil's U
## Training set  0.01831306 0.4428305 0.3331738  -0.1752014  5.707503 0.8038133 -0.03243407        NA
## Test set     -0.76278747 2.1518665 1.0728958 -53.1662015 57.385691 2.5884625 -0.09613665 0.2107459

Next, both models are evaluated on the test set. This evaluation entails a forecast of the 50 values in that test set and then a comparison with the actual values. Relative to the manual model, the auto model displays lower absolute errors across criteria, though the values are quite close. The auto model, then, will support the eventual forecast.

ATM 2

atm2_train <- window(atm2_bc, end=45)
atm2_bc_d %>% window(end=45) %>% ggtsdisplay(main="ATM 2, post-Box-Cox and seasonal differencing, training")

Based on the autocorrelation plots, the large statistically significant correlation ($\alpha$ = 0.05) at ACF lag 1 suggests a non-seasonal MA(1) component, and the significant correlation at ACF lag 7 suggests a seasonal MA(1) component. With the PACF plot, there is a spike at lag 1, suggesting a non-seasonal AR(1) component, and a large, significant correlations at lag 7, which suggests a seasonal AR(1) component. Lastly, there is the single seasonal difference, which results in an ARIMA(1,0,1)(1,1,1)[7] model.

atm2_train %>% Arima(order=c(1,0,1), seasonal=c(1,1,1)) %>% checkresiduals()

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,0,1)(1,1,1)[7]
## Q* = 35.613, df = 10, p-value = 9.807e-05
## 
## Model df: 4.   Total lags used: 14

Checking the residuals of the initial ARIMA(1,0,1)(1,1,1)[7] model reveals an ACF plot that continues to show a large significant correlation on lag 5, suggesting a non-seasonal MA(5) component is necessary. In response, an ARIMA(1,0,5)(1,1,1)[7] model is fitted to the training data.

atm2_train %>% Arima(order=c(1,0,5), seasonal=c(1,1,1)) %>% checkresiduals()

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,0,5)(1,1,1)[7]
## Q* = 11.484, df = 6, p-value = 0.07453
## 
## Model df: 8.   Total lags used: 14

The ARIMA(1,0,5)(1,1,1)[7] model passes residual checks, including a Ljung-Box test result that is not statistically significant ($\alpha$ = 0.05) with a p-value of approximately 0.07. This statistic is not ideal, but as Hyndman and Athanasopoulos note, it is normal and practical to move forward with a sufficient model.

atm2_train %>% auto.arima(seasonal=TRUE, approximation=FALSE, stepwise=FALSE) %>% checkresiduals()

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(2,0,2)(0,1,1)[7]
## Q* = 25.151, df = 9, p-value = 0.002808
## 
## Model df: 5.   Total lags used: 14

As comparison, the auto model is an ARIMA(2,0,2)(0,1,1)[7]. Unlike the manual model, it does not pass initial residual checks. Its Ljung-Box test statistic is statistically significant ($\alpha$ = 0.05) at p-value of approximately 0.003, and there is also something of a sinusoidal pattern in its autocorrelations. The combination suggests that the residuals are not white noise

(atm2_arima1 <- atm2_train %>% Arima(order=c(1,0,5), seasonal=c(1,1,1)))

## Series: . 
## ARIMA(1,0,5)(1,1,1)[7] 
## 
## Coefficients:
##          ar1      ma1      ma2     ma3     ma4      ma5    sar1     sma1
##       0.6625  -0.4768  -0.2441  0.1497  0.0220  -0.2980  0.1949  -0.6057
## s.e.  0.0976   0.1047   0.0643  0.0650  0.0561   0.0639  0.1358   0.1224
## 
## sigma^2 estimated as 7.275:  log likelihood=-725.79
## AIC=1469.58   AICc=1470.19   BIC=1502.97

(atm2_arima2 <- atm2_train %>% auto.arima(seasonal=TRUE, approximation=FALSE, stepwise=FALSE))

## Series: . 
## ARIMA(2,0,2)(0,1,1)[7] 
## 
## Coefficients:
##           ar1      ar2     ma1     ma2     sma1
##       -0.4744  -0.9109  0.6176  0.8079  -0.6754
## s.e.   0.0623   0.0648  0.0778  0.0785   0.0573
## 
## sigma^2 estimated as 7.434:  log likelihood=-730.65
## AIC=1473.31   AICc=1473.59   BIC=1495.57

As with its ATM 1 auto counterpart, the auto ARIMA(2,0,2)(0,1,1)[7] model for ATM 2 shows slightly worse AIC (1473.31 versus 1469.58) and AIC$\ _c$ (1473.59 versus 1470.19) values than does the manual model. However, it features a slightly lower BIC value (1495.57 versus 1502.97).

atm2_test <- atm2_bc %>% window(start=46)
atm2_arima1_fc <- atm2_arima1 %>% forecast(h = 50)
atm2_arima2_fc <- atm2_arima2 %>% forecast(h = 50)
accuracy(atm2_arima1_fc, atm1_test)

##                      ME     RMSE      MAE       MPE     MAPE      MASE         ACF1 Theil's U
## Training set -0.2154851 2.630885 1.881945      -Inf      Inf 0.7987309 -0.003900967        NA
## Test set     -5.0909161 5.779999 5.090916 -92.94928 92.94928 2.1606749 -0.108128657  1.494979

accuracy(atm2_arima2_fc, atm1_test)

##                      ME     RMSE      MAE       MPE     MAPE      MASE        ACF1 Theil's U
## Training set -0.1219859 2.673070 1.908264      -Inf      Inf 0.8099012  0.09438628        NA
## Test set     -5.0653652 6.883749 5.453049 -84.68315 90.38983 2.3143706 -0.21952832  1.974388

Evaluated on the ATM 2 test data, neither the manual nor the auto model performs particularly well. The manual model shows a lower RMSE (approximately 5.78 versus 6.88) than does the auto model, so the former will be used to forecast.

ATM 3

With only three data points, the ATM 3 series offers limited information for modeling. Thus three simple methods are compared visually.

atm3_mean <- meanf(atm3, h = 14)
atm3_naive <- naive(atm3, h = 14)
atm3_drift <- rwf(atm3, h = 14, drift = TRUE)
autoplot(atm3) +
  autolayer(atm3_mean, series = "Average", PI = FALSE) +
  autolayer(atm3_naive, series = "Naive", PI = FALSE) +
  autolayer(atm3_drift, series = "RWF + Drift", PI = FALSE)

The simple average and naive methods appear similar, with the average resulting in a slightly higher point forecast of approximately 87.7 compared to the naive point forecast of 85. Using the random walk with drift, which slopes downward at a constant rate, is a clear non-starter. Given there is little means for evaluation, and based upon the plot, the simple average method is used for forecasting.

ATM 4

atm4_train <- window(atm4_bc, end=45)
atm4_bc_d %>% window(end=45) %>% ggtsdisplay(main="ATM 4, post-Box-Cox and seasonal differencing, training")

The ATM 4 training data show a large statistically significant correlation ($\alpha$ = 0.05) at seasonal lag 7 of the ACF plot along with exponential decay in the seasonal lags (7, 14, and 21) of the PACF plot. This pattern suggests an MA(1) component. Neither autocorrelation plot displays much else going on. Considering the single seasonal difference results in an ARIMA(0,0,0)(0,1,1)[7] model.

atm4_train %>% Arima(order=c(0,0,0), seasonal=c(0,1,1)) %>% checkresiduals()

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,0,0)(0,1,1)[7]
## Q* = 19.804, df = 13, p-value = 0.1002
## 
## Model df: 1.   Total lags used: 14

Checking the residuals of the initial ARIMA(0,0,0)(0,1,1)[7] model reveals apparent white noise based upon the Ljung-Box test statistic p-value of approximately 0.10 on 13 degrees of freedom.

atm4_train %>% auto.arima(seasonal=TRUE, approximation=FALSE, stepwise=FALSE) %>% checkresiduals()

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,0,0)(2,0,0)[7] with non-zero mean
## Q* = 15.804, df = 11, p-value = 0.1485
## 
## Model df: 3.   Total lags used: 14

The auto model, an ARIMA(0,0,0)(2,0,0)[7] model, also passes the residual checks. Its test statistic for the Ljung-Box test is statistically significant ($\alpha$ = 0.05) with a p-value of approximately 0.15 on 11 degrees of freedom. The ACF plot shows most correlations within the bounds, though a couple--on lags 9 and 10--are just significant.

(atm4_arima1 <- atm4_train %>% Arima(order=c(0,0,0), seasonal=c(0,1,1)))

## Series: . 
## ARIMA(0,0,0)(0,1,1)[7] 
## 
## Coefficients:
##          sma1
##       -0.9913
## s.e.   0.1644
## 
## sigma^2 estimated as 286:  log likelihood=-1294.08
## AIC=2592.15   AICc=2592.19   BIC=2599.57

(atm4_arima2 <- atm4_train %>% auto.arima(seasonal=TRUE, approximation=FALSE, stepwise=FALSE))

## Series: . 
## ARIMA(0,0,0)(2,0,0)[7] with non-zero mean 
## 
## Coefficients:
##         sar1    sar2     mean
##       0.2119  0.1321  36.3204
## s.e.  0.0570  0.0577   1.5397
## 
## sigma^2 estimated as 328.7:  log likelihood=-1332.62
## AIC=2673.25   AICc=2673.38   BIC=2688.18

Per the information criteria, the manual model outperforms the auto model. Its AIC (2592.15) and BIC (2599.57) are both lower than the same for the auto. Given the difference in differencing across models, comparing the AIC$\ _c$ values on the training set is inappropriate.

atm4_test <- atm4_bc %>% window(start=46)
atm4_arima1_fc <- atm4_arima1 %>% forecast(h = 50)
atm4_arima2_fc <- atm4_arima2 %>% forecast(h = 50)
accuracy(atm4_arima1_fc, atm1_test)

##                       ME     RMSE      MAE        MPE     MAPE      MASE       ACF1 Theil's U
## Training set   0.3863146 16.69024 12.99425  -76.75994 103.7497 0.7126296  0.1092475        NA
## Test set     -29.9271085 31.21481 29.92711 -676.72569 676.7257 1.6412599 -0.0193582  5.069983

accuracy(atm4_arima2_fc, atm1_test)

##                         ME     RMSE      MAE        MPE     MAPE      MASE        ACF1 Theil's U
## Training set   0.002543109 18.04128 15.22090  -91.61932 119.0303 0.8347431  0.08070191        NA
## Test set     -30.289846379 30.34006 30.28985 -653.51579 653.5158 1.6611532 -0.10247579  5.389692

Neither model performs well on the test set. Specifically, the auto model has a smaller RMSE (30.34 versus 31.21), so that model will be used for forecasting.

Forecasting

Each of the selected models informs forecasts for the fourteen dates--observations missing values for ATM and Cash--from May 1 through May 14, 2010. The model-specific forecasts are then added together to create the total amount of cash, in hundreds of dollars, forecasted to be withdrawn daily from the four ATMs.

atm1_fc <- atm1_bc %>% Arima(order = c(1,0,0), seasonal = c(2,1,2)) %>% forecast(h = 14)
atm2_fc <- atm2_bc %>% Arima(order = c(1,0,5), seasonal = c(1,1,1)) %>% forecast(h = 14)
atm3_fc <- meanf(atm3, h = 14)
atm4_fc <- atm4_bc %>% Arima(order = c(0,0,0), seasonal = c(2,0,0)) %>% forecast(h = 14)

After forecasting, the Box-Cox-transformed point forecasts for ATMs 1, 2, and 4 need to be back-transformed prior to summing them. Then, the final 14 point forecasts of each one are kept.

atm1_fc_back <- InvBoxCox(atm1_fc$mean, lambda = 0.2)
atm2_fc_back <- InvBoxCox(atm2_fc$mean, lambda = 0.5)
atm4_fc_back <- InvBoxCox(atm4_fc$mean, lambda = 0.5)

atm_fc <- as.numeric(unclass(atm1_fc_back) + unclass(atm2_fc_back) + unclass(atm3_fc$mean) + unclass(atm4_fc_back))
atm_dates <- atm[is.na(atm$Cash) & is.na(atm$ATM),1]
atm_final <- bind_cols(list(date = atm_dates, forecast = atm_fc))
write_csv(atm_final, "project1_partA_forecast.csv")

summary(atm_final$forecast)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   269.1   498.2   565.7   535.7   615.7   665.6

The final forecasts have a mean amount of cash withdrawn (hundreds of $) of 535.7, a median of 565.7, and a minimum and maximum of 269.1 and 665.6, respectively.

Part B

Data pre-processing

Exploration and initial cleaning

power <- read_xlsx("ResidentialCustomerForecastLoad-624.xlsx", sheet = 1)

This dataset consists of 192 observations across three variables: CaseSequence, which indicates the sequence number of the case; YYYY-MMM, which specifies the date of the case period; and KWH, which notes the power consumption, in kilowatt hours. The dates run monthly from January 1998 until December 2013.

summary(power)

##   CaseSequence     YYYY-MMM              KWH          
##  Min.   :733.0   Length:192         Min.   :  770523  
##  1st Qu.:780.8   Class :character   1st Qu.: 5429912  
##  Median :828.5   Mode  :character   Median : 6283324  
##  Mean   :828.5                      Mean   : 6502475  
##  3rd Qu.:876.2                      3rd Qu.: 7620524  
##  Max.   :924.0                      Max.   :10655730  
##                                     NA's   :1

power[is.na(power$KWH),]

## # A tibble: 1 x 3
##   CaseSequence `YYYY-MMM`   KWH
##          <dbl> <chr>      <dbl>
## 1          861 2008-Sep      NA

Focusing on KWH, values range from a minimum of 770523 to a maximum of 10655730, with a mean and median of 6502475 and 6283324, respectively. There is also one observation with a missing value: case 861 from September 2008. This missing value will need to be imputed.

kwh_ts <- power %>% select(-c(CaseSequence,`YYYY-MMM`)) %>% ts(frequency = 12, start = c(1998,1))
autoplot(kwh_ts, main="Monthly power usage in kilowatts per hour (01/1998 - 12/2013)") +
  ylab("Kilowatts used per hour (KWH)")

An initial time plot reveals several characteristics of the series. First, power usage displays clear seasonality--likely annual or perhaps semi-annual--that requires further investigation. Second, there is a seeming outlier in mid-2010. And third, there is a slight increase in level around 2011, which appears coupled with a slight increase in variance.

ggseasonplot(kwh_ts) +
  ggtitle("Monthly power usage (KWH), by year") +
  ylab("Kilowatts used per hour (KWH)")

Breaking the series down by year reveals a sinusoidal pattern of power usage, with relative peak usage in early winter (January) and mid summer (July and August) and relative troughs in the spring (April and May) and fall (October and November). The aforementioned outlier comes in July 2010. Also, generally, more recent years appear to show higher power usage relative to earlier years.

boxplot(kwh_ts, main="Monthly power usage (KWH), outlier", ylab="Kilowatts used per hour (KWH)")

A simple boxplot shows that the July 2010 value is a statistical outlier. A possible, legitimate reason for the outlier could be power failure, though without subject matter expertise, it is speculation. At any rate, given the singular nature of the outlier relative to the rest of the series, it will be replaced [as will the September 2008 missing value] via linear interpolation.

kwh_ts_clean <- kwh_ts %>% tsclean(replace.missing = TRUE)
autoplot(kwh_ts_clean, main="Monthly power usage (KWH), clean") +
  ylab("Kilowatts used per hour (KWH)")

After imputation of missing values and replacement of the outlier through linear interpolation, the series appears ready for possible transformation and differencing.

Transformation and differencing

The series now undergoes a Box-Cox transformation as well as differencing, as necessary.

BoxCox.lambda(kwh_ts_clean)

## [1] -0.1442665

The suggested Box-Cox lambda is approximately -0.14, so a rough, rounded lambda of -0.20 is used to ease interpretation.

kwh_bc <- kwh_ts_clean %>% BoxCox(lambda = -0.2)
kwh_bc %>% ggtsdisplay(main="Monthly power usage (KWH), post-Box-Cox transformation")

After the Box-Cox transformation, the variance looks relatively consistent across the series, though any change pre-post-transformation is not obvious. An ACF plot suggests that there is annual seasonality along with a clear sinusoidal pattern.

nsdiffs(kwh_bc)

## [1] 1

ndiffs(diff(kwh_bc, lag = 12))

## [1] 0

A seasonal difference appears needed, though a follow-up, non-seasonal one is not.

kwh_bc_d <- kwh_bc %>% diff(lag = 12)
kwh_bc_d %>% ggtsdisplay(main="Monthly power usage (KWH), post-Box-Cox and seasonal differencing")

kwh_bc_d %>% ur.kpss() %>% summary()

## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 4 lags. 
## 
## Value of test-statistic is: 0.1043 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

Post-seasonal differencing (lag = 12), the series is stationary. Its unit root test-statistic of approximately 0.10 is within the bounds for statistical significance ($\alpha$ = 0.05). The autocorrelation plots continue to show statistically significant spikes ($\alpha$ = 0.05) on the seasonal lags. Regardless, the transformed and differenced series appears ready for modeling.

Modeling and evaluation

The training set consists of roughly 80 percent of the series points, or 156 months (13 years). The test set consists of the remaining 36 months.

kwh_train <- kwh_bc %>% window(end = c(2010,12))
kwh_bc_d %>% window(end = c(2010,12)) %>% ggtsdisplay(main="Monthly power usage (KWH), post-Box-Cox and seasonal differencing, training")

Looking at plots of the transformed and differenced training set, statistically significant spikes ($\alpha$ = 0.05) on the seasonal lags, in both the ACF and PACF plots, suggest seasonal AR(1) and MA(1) components. Further, the significant spikes at lag 1 in both plots suggest non-seasonal AR(1) and MA(1) components. Combined with the seasonal difference, an ARIMA(1,0,1)(1,1,1)[12] model seems appropriate for the series.

kwh_train %>% Arima(order = c(1,0,1), seasonal = c(1,1,1)) %>% checkresiduals()

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,0,1)(1,1,1)[12]
## Q* = 31.654, df = 20, p-value = 0.04713
## 
## Model df: 4.   Total lags used: 24

An ARIMA(1,0,1)(1,1,1)[12] model does not pass the residual checks, with a Ljung-Box test-statistic p-value of approximately 0.05. That is, there is sufficient evidence to suggest the residuals are different from white noise. The residuals look okay per visual checks, with a significant spike at lag 3 of the ACF, and a possibly left skewed distribution.

kwh_train %>% auto.arima(seasonal = TRUE, stepwise = FALSE, approximation = FALSE) %>% checkresiduals()

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(4,0,0)(0,1,1)[12] with drift
## Q* = 24.446, df = 18, p-value = 0.1409
## 
## Model df: 6.   Total lags used: 24

The auto-generated model is an ARIMA(4,0,0)(0,1,1)[12] with drift. Likewise, its residuals pass all checks. The residuals' Ljung-Box test statistic has a p-value of approximately 0.14 on 18 degrees of freedom--still not statistically significant. The residuals also pass the general visual checks, with all correlations within bounds on the ACF, and a somewhat symmetrical distribution.

(kwh_arima1 <- kwh_train %>% Arima(order = c(1,0,1), seasonal = c(1,1,1)))

## Series: . 
## ARIMA(1,0,1)(1,1,1)[12] 
## 
## Coefficients:
##          ar1      ma1     sar1     sma1
##       0.6547  -0.4422  -0.1459  -0.6345
## s.e.  0.2200   0.2665   0.1238   0.1057
## 
## sigma^2 estimated as 1.658e-05:  log likelihood=594.94
## AIC=-1179.88   AICc=-1179.45   BIC=-1165.03

(kwh_arima2 <- kwh_train %>% auto.arima(seasonal = TRUE, stepwise = FALSE, approximation = FALSE))

## Series: . 
## ARIMA(4,0,0)(0,1,1)[12] with drift 
## 
## Coefficients:
##          ar1      ar2     ar3      ar4     sma1  drift
##       0.2805  -0.0327  0.2364  -0.2575  -0.7741  0e+00
## s.e.  0.0817   0.0815  0.0814   0.0812   0.0747  1e-04
## 
## sigma^2 estimated as 1.484e-05:  log likelihood=603.91
## AIC=-1193.82   AICc=-1193   BIC=-1173.03

The auto model performs better than the manual one per information criteria. Its AIC (-1193.82 versus -1179.88), AIC$\ _c$ (-1193 versus -1179.45), and BIC (-1173.03 versus -1165.03) are all lower.

kwh_test <- kwh_bc %>% window(start = c(2011,1))
kwh_arima1_fc <- kwh_arima1 %>% forecast(h = 36)
kwh_arima2_fc <- kwh_arima2 %>% forecast(h = 36)
accuracy(kwh_arima1_fc, kwh_test)

##                        ME        RMSE         MAE        MPE       MAPE     MASE      ACF1 Theil's U
## Training set 0.0008674094 0.003857687 0.003112486 0.01810554 0.06510538 0.793626 0.0784383        NA
## Test set     0.0052328567 0.006527952 0.005258685 0.10923591 0.10977622 1.340867 0.1768341 0.7182242

accuracy(kwh_arima2_fc, kwh_test)

##                        ME        RMSE         MAE         MPE       MAPE      MASE      ACF1 Theil's U
## Training set 0.0002962721 0.003623677 0.002956864 0.006162472 0.06185357 0.7539455 0.1141579        NA
## Test set     0.0039016049 0.005513545 0.004320208 0.081427214 0.09018416 1.1015728 0.1432683 0.6067527

Evaluation on the test set returns errors, for both models, that are surprisingly low. The modeling process appears sound. Specifically, the RMSE of the auto model is approximately 0.006 compared with approximately 0.007 for the manual model. The auto model is slightly more useful for forecasting.

Forecasting

Monthly power usage for 2014 is forecasted using the selected ARIMA(4,0,0)(0,1,1)[12] model with drift.

kwh_fc <- kwh_bc %>% Arima(order = c(4,0,0), seasonal = c(0,1,1), include.drift = TRUE) %>% forecast(h = 12)
kwh_fc_back <- InvBoxCox(kwh_fc$mean, lambda = -0.2)

The point forecasts, one for each month of 2014, must then be back-transformed to account for the initial Box-Cox transformation.

kwh_final <- bind_cols(list(date = paste0("2014-", month.abb), forecast = kwh_fc_back))
write_csv(kwh_final, "project1_partB_forecast.csv")
summary(kwh_final$forecast)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 5688699 5929446 7361542 7403671 8404165 9493756

The forecasts range from a minimum of 5688699 KWH to a maximum of 9493756 KWH, with a mean of 7403671 and a median of 7361542.

Source

Hyndman, R.J., & Athanasopoulos, G. (2018). Forecasting: principles and practice, 2nd edition. OTexts: Melbourne, Australia. OTexts.com/fpp2. Accessed on October 17, 2020.