DATA 624 Project #1

Part A: ATM cash withdrawals

In this part, we work with a dataset of cash withdrawals from 4 ATMs (“ATM624Data.xlsx”). The data is a daily time series from 2009/5/1 to 2010/4/30, spanning a full year. The target variable is Cash, which gives the aggregate cash withdrawals in hundreds of dollars for the indicated ATM machine (ATM) and date (DATE).

Data preparation

First let’s review the data. We note the following:

Outliers: one observation (10920 for ATM4 on 2010/2/9) appears to be an outlier. This is likely a data measurement or recording error, as the value is larger by an order of magnitude compared to the distribution of the remaining ATM4 observations.
Missing data: there are 14 rows in the data file without ATM and Cash data, which we exclude. In addition, there are 3 missing Cash values for ATM1 and 2 missing Cash values for ATM2.
Zero values: the Cash data for ATM3 is mostly zeros; in fact ATM3 has non-zero Cash data for only the last 3 days starting on 2010/4/28. This may indicate a new ATM machine that came online at the end of the measurement period.

Regarding the outlier for ATM4, ideally we would investigate the outlier with the source of the data and then decide how to treat it. For instance, if we knew the outlier was explained by a holiday or other special circumstance, then we would keep it and note it as an outlier when reviewing the results of our analysis. On the other hand, if we knew the outlier to be a data error, we would either remove it or replace it with an estimate (imputation). However, in this case we don’t know for sure, so we go with our intuition that this is a data error, in which case we treat it as missing data.

Regarding the missing data for ATM1 and ATM2 (as well as the outlier removed for ATM4), the simplest option would be to ignore them since we’re only missing 2-3 values out of 365 for each machine. However, the ETS and ARIMA model-fitting functions that we use in the modeling phase will throw errors if there are missing data in the interior of the time series. Consequently we need to impute values to the missing data, which we do by simply taking the average of the prior week and the following week, i.e., we interpolate the adjacent weekly measurements. If we knew more about the data and believed that the ATMs are subject to the same drivers of consumer behavior (for instance, if the machines are all situated in the same shopping mall), then we could undertake a more sophisticated imputation method like regresssion, in which we fit the relationship between one machine and the other three machines, and then use the regression to estimate the missing observation for the first machine as a function of the observations from the other three machines. In any case, because there are only 2-3 missing observations out of 365 for each machine, whichever imputation method we choose is not likely to bias the results in any meaningful way.

After making adjustments for the outlier and missing data, we define our time series for model fitting, which we call cashts. We set the seasonal period to 7, since we expect there may be a weekly “seasonal” pattern.

# load data and review summary stats
file1 <- "ATM624Data.xlsx"
raw1_long <- read_excel(file1, col_types = c("date", "text", "numeric")) %>% 
  mutate(DATE = as_date(DATE), WDAY = wday(DATE, label = TRUE)) %>% 
  filter(is.na(ATM) == FALSE) %>% 
  filter(Cash < 10000) # remove outlier for ATM4
raw1 <- pivot_wider(raw1_long, names_from = ATM, values_from = Cash)
summary(raw1)

##       DATE             WDAY         ATM1             ATM2       
##  Min.   :2009-05-01   Sun:52   Min.   :  1.00   Min.   :  0.00  
##  1st Qu.:2009-07-31   Mon:52   1st Qu.: 73.00   1st Qu.: 25.50  
##  Median :2009-10-30   Tue:52   Median : 91.00   Median : 67.00  
##  Mean   :2009-10-30   Wed:52   Mean   : 83.89   Mean   : 62.58  
##  3rd Qu.:2010-01-29   Thu:52   3rd Qu.:108.00   3rd Qu.: 93.00  
##  Max.   :2010-04-30   Fri:53   Max.   :180.00   Max.   :147.00  
##                       Sat:52   NA's   :3        NA's   :2       
##       ATM3              ATM4         
##  Min.   : 0.0000   Min.   :   1.563  
##  1st Qu.: 0.0000   1st Qu.: 123.918  
##  Median : 0.0000   Median : 403.305  
##  Mean   : 0.7206   Mean   : 445.346  
##  3rd Qu.: 0.0000   3rd Qu.: 704.271  
##  Max.   :96.0000   Max.   :1712.075  
##                    NA's   :1

ggplot(raw1_long) + geom_histogram(aes(x = Cash)) + 
    facet_wrap(~ ATM, scales = "free") + 
    labs(title = "Histogram of Cash by ATM (excluding outlier for ATM4)")

# fix missing data & define ts
raw1 <- raw1 %>% mutate(ATM1 = ifelse(is.na(ATM1), (lag(ATM1, 7) + lead(ATM1, 7))/2, ATM1), 
                        ATM2 = ifelse(is.na(ATM2), (lag(ATM2, 7) + lead(ATM2, 7))/2, ATM2), 
                        ATM4 = ifelse(is.na(ATM4), (lag(ATM4, 7) + lead(ATM4, 7))/2, ATM4))
cashts <- ts(raw1[ , 3:6], start = c(1, 6), frequency = 7)

Data exploration

Next we review the plots below to better understand the time series. We focus on the ATM1, ATM2, and ATM4 time series, since the ATM3 series doesn’t have enough data (only 3 data points) to draw meaningful conclusions. We note the following observations:

The ATM1, ATM2, and ATM4 series are moderately correlated, with pairwise correlations in the 0.4 to 0.7 range.
Other than AMT3, the time series display only a weak trend-cycle. In the case of ATM3, the time series is zero until the last 3 observations, which as we noted earlier, may be attributable to a new machine coming online.
The ATM1, ATM2, and ATM4 series exhibit a strong seasonal pattern, with higher withdrawals typically occurring on Fridays and Saturdays and lower withdrawals occurring on Thursdays. However, note that while this appears to be true in general (particularly for ATM1 and ATM2), the seasonal pattern changes over time. This could be due to weekly patterns that vary over the course of the year (e.g., winter weeks versus summer weeks).
Consistent with the seasonal pattern, ATM1, ATM2, and ATM4 time series have statistically significant autocorrelations for 7-day lags and multiples thereof going out past 10 weeks. The autocorrelations for ATM4 are somewhat weaker than those seen for ATM1 and ATM2.
The ATM1, ATM2, and ATM4 series appear to have no significant issues with unstable variances. Although the estimated parameters for the Box-Cox transformation are \(\mu = \{0.26, 0.67, 0.46\}\) for the three series, obvious problems like increasing variances are not evident in the time plots.

# plots
ggpairs(as.data.frame(cashts[ , -3]), 
        title = "Pairs plot: cashts")

autoplot(cashts, facets = TRUE) + 
  labs(title = "Time plot: cashts", 
       x = "Week", 
       y = "$ hundreds")

ggseasonplot(cashts[ , 'ATM1']) + 
  labs(title = "Seasonal plot: ATM1")

ggseasonplot(cashts[ , 'ATM2']) + 
  labs(title = "Seasonal plot: ATM2")

#ggseasonplot(cashts[ , 'ATM3']) + 
#  labs(title = "Seasonal plot: ATM3")
ggseasonplot(cashts[ , 'ATM4']) + 
  labs(title = "Seasonal plot: ATM4")

ggsubseriesplot(cashts[, 'ATM1']) + 
  labs(title = "Seasonal subseries plot: ATM1")

ggsubseriesplot(cashts[, 'ATM2']) + 
  labs(title = "Seasonal subseries plot: ATM2")

#ggsubseriesplot(cashts[, 'ATM3']) + 
#  labs(title = "Seasonal subseries plot: ATM3")
ggsubseriesplot(cashts[, 'ATM4']) + 
  labs(title = "Seasonal subseries plot: ATM4")

ggAcf(cashts[ , c(1,2,4)], lag.max = 70) + 
    labs(title = "Autocorrelation plot: cashts")

# box-cox transformation & plot
cat("Box-Cox lambda parameter for ATM1: ", BoxCox.lambda(cashts[ , 'ATM1']))

## Box-Cox lambda parameter for ATM1:  0.2597339

autoplot(cbind(ATM1 = cashts[ , 'ATM1'], transformed = BoxCox(cashts[ , 'ATM1'], lambda = "auto")), 
         facets = TRUE) + 
    labs(title = "Time plot: ATM1 and transformed time series", 
         x = "Week", 
         y = "$ hundreds")

cat("Box-Cox lambda parameter for ATM2: ", BoxCox.lambda(cashts[ , 'ATM2']))

## Box-Cox lambda parameter for ATM2:  0.672753

autoplot(cbind(ATM2 = cashts[ , 'ATM2'], transformed = BoxCox(cashts[ , 'ATM2'], lambda = "auto")), 
         facets = TRUE) + 
    labs(title = "Time plot: ATM2 and transformed time series", 
         x = "Week", 
         y = "$ hundreds")

#cat("Box-Cox lambda parameter for ATM3: ", BoxCox.lambda(cashts[ , 'ATM3']))
#autoplot(cbind(ATM3 = cashts[ , 'ATM3'], transformed = BoxCox(cashts[ , 'ATM3'], lambda = "auto")), 
#         facets = TRUE) + 
#    labs(title = "Time plot: ATM3 and transformed time series", 
#         x = "Week", 
#         y = "$ hundreds")
cat("Box-Cox lambda parameter for ATM4: ", BoxCox.lambda(cashts[ , 'ATM4']))

## Box-Cox lambda parameter for ATM4:  0.4556867

autoplot(cbind(ATM4 = cashts[ , 'ATM4'], transformed = BoxCox(cashts[ , 'ATM4'], lambda = "auto")), 
         facets = TRUE) + 
    labs(title = "Time plot: ATM4 and transformed time series", 
         x = "Week", 
         y = "$ hundreds")

Finally we plot the STL decomposition to view the seasonal and trend components of the ATM1, ATM2, and ATM4 time series. (We exclude the ATM3 series since it has only 3 data points.) It is apparent that the trend component is not as significant as the seasonal component, particularly for ATM1 and ATM2. Also we see that the seasonal component varies over time in terms of both amplitude and curve shape.

# STL decomposition
cashts[ , 'ATM1'] %>% stl(s.window = 7) %>% autoplot() + 
    labs(title = "STL decomposition: ATM1", x = "Week")

cashts[ , 'ATM2'] %>% stl(s.window = 7) %>% autoplot() + 
    labs(title = "STL decomposition: ATM2", x = "Week")

#cashts[ , 'ATM3'] %>% stl(s.window = 7) %>% autoplot() + 
#    labs(title = "STL decomposition: ATM3", x = "Week")
cashts[ , 'ATM4'] %>% stl(s.window = 7) %>% autoplot() + 
    labs(title = "STL decomposition: ATM4", x = "Week")

Modeling

To model and forecast the ATM1, ATM2, and ATM4 time series, we try 2 approaches:

Approach #1: fit STL model
Approach #2: fit ETS & ARIMA models.

For the ATM3 series, we try a different approach given the limited data count.

Approach #1: Fit STL model

We start by fitting an STL model using the stlf function with ETS or ARIMA modeling methods. The approach is detailed in the modeling section of Part B below, but in brief, it’s an estimation methodology that forecasts the seasonally-adjusted time series (using the selected modeling method) and then re-seasonalizes the forecast based on the last seasonal cycle. In other words, it uses a seasonal naive method for the seasonal component of the forecast.

Below we plot the time series and forecasts, with a close-up to examine the forecast period. The forecasts produced by the two models are very close; however note that some of the forecasts go negative (e.g., for ATM2). In general the STL approach is useful to provide an approximate forecast, but since we know that seasonality is an important feature of the time series, we cannot rely on this approach.

atm1ts <- cashts[ , 'ATM1']
atm2ts <- cashts[ , 'ATM2']
atm4ts <- cashts[ , 'ATM4']

# stlf models
atm1stlf.ets <- stlf(atm1ts, method = "ets", h = 31) 
atm1stlf.arima <- stlf(atm1ts, method = "arima", h = 31) 
atm2stlf.ets <- stlf(atm2ts, method = "ets", h = 31) 
atm2stlf.arima <- stlf(atm2ts, method = "arima", h = 31) 
atm4stlf.ets <- stlf(atm4ts, method = "ets", h = 31) 
atm4stlf.arima <- stlf(atm4ts, method = "arima", h = 31) 

# plot ATM1
autoplot(atm1ts) + 
    autolayer(atm1stlf.ets, PI = FALSE, series = "STLF-ETS") + 
    autolayer(atm1stlf.arima, PI = FALSE, series = "STLF-ARIMA") + 
    geom_vline(xintercept = 53 + 6/7, lty = 3, lwd = 1) + 
    labs(title = "STLF forecast: ATM1", 
         x = "Week", 
         y = "$ hundreds")

# close-up ATM1
autoplot(atm1ts) + 
    autolayer(atm1stlf.ets, PI = FALSE, series = "STLF-ETS") + 
    autolayer(atm1stlf.arima, PI = FALSE, series = "STLF-ARIMA") + 
    geom_vline(xintercept = 53 + 6/7, lty = 3, lwd = 1) + 
    labs(title = "Close-up of STLF forecast: ATM1", 
         x = "Week", 
         y = "$ hundreds") + 
    coord_cartesian(xlim = c(50, 58), 
                    ylim = c(0, 120))

# plot ATM2
autoplot(atm2ts) + 
    autolayer(atm2stlf.ets, PI = FALSE, series = "STLF-ETS") + 
    autolayer(atm2stlf.arima, PI = FALSE, series = "STLF-ARIMA") + 
    geom_vline(xintercept = 53 + 6/7, lty = 3, lwd = 1) + 
    labs(title = "STLF forecast: ATM2", 
         x = "Week", 
         y = "$ hundreds")

# close-up ATM2
autoplot(atm2ts) + 
    autolayer(atm2stlf.ets, PI = FALSE, series = "STLF-ETS") + 
    autolayer(atm2stlf.arima, PI = FALSE, series = "STLF-ARIMA") + 
    geom_vline(xintercept = 53 + 6/7, lty = 3, lwd = 1) + 
    labs(title = "Close-up of STLF forecast: ATM2", 
         x = "Week", 
         y = "$ hundreds") + 
    coord_cartesian(xlim = c(50, 58), 
                    ylim = c(0, 120))

# plot ATM4
autoplot(atm4ts) + 
    autolayer(atm4stlf.ets, PI = FALSE, series = "STLF-ETS") + 
    autolayer(atm4stlf.arima, PI = FALSE, series = "STLF-ARIMA") + 
    geom_vline(xintercept = 53 + 6/7, lty = 3, lwd = 1) + 
    labs(title = "STLF forecast: ATM4", 
         x = "Week", 
         y = "$ hundreds")

# close-up ATM4
autoplot(atm4ts) + 
    autolayer(atm4stlf.ets, PI = FALSE, series = "STLF-ETS") + 
    autolayer(atm4stlf.arima, PI = FALSE, series = "STLF-ARIMA") + 
    geom_vline(xintercept = 53 + 6/7, lty = 3, lwd = 1) + 
    labs(title = "Close-up of STLF forecast: ATM4", 
         x = "Week", 
         y = "$ hundreds") + 
    coord_cartesian(xlim = c(50, 58), 
                    ylim = c(0, 800))

Approach #2: Fit ETS & ARIMA models

Next we fit ETS and ARIMA models to the ATM1, ATM2, and ATM4 time series, including the seasonal component. We do this using the ets and auto.arima functions, and in each case, we show the model summary, plot model diagnostics, and check the model residuals.

ETS(A,N,A) for ATM1, ATM2, and ATM4
- The best ETS fit for all three series is a model with no trend, additive seasonal component, and additive errors.
- The ETS(A,N,A) model produces residuals that do not resemble white noise, i.e., they show significant autocorrelations and they don’t pass the Ljung-Box test.
- Note that we also tried fitting an ETS model with Box-Cox transformation, but the results were the same, i.e., model residuals suffer from significant autocorrelation.

# ATM1
atm1.ets <- ets(atm1ts)
atm1.ets

## ETS(A,N,A) 
## 
## Call:
##  ets(y = atm1ts) 
## 
##   Smoothing parameters:
##     alpha = 0.0195 
##     gamma = 0.331 
## 
##   Initial states:
##     l = 80.3735 
##     s = -65.5466 7.5892 18.1514 10.6598 11.7762 5.0861
##            12.2839
## 
##   sigma:  23.988
## 
##      AIC     AICc      BIC 
## 4483.965 4484.586 4522.964

autoplot(atm1.ets, main = "Components of ETS(A,N,A) method: ATM1")

checkresiduals(atm1.ets)

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(A,N,A)
## Q* = 29.632, df = 5, p-value = 1.743e-05
## 
## Model df: 9.   Total lags used: 14

# ATM2
atm2.ets <- ets(atm2ts)
atm2.ets

## ETS(A,N,A) 
## 
## Call:
##  ets(y = atm2ts) 
## 
##   Smoothing parameters:
##     alpha = 1e-04 
##     gamma = 0.3634 
## 
##   Initial states:
##     l = 72.6123 
##     s = -60.9385 -41.3568 15.1052 6.3099 28.5868 22.4571
##            29.8363
## 
##   sigma:  24.9396
## 
##      AIC     AICc      BIC 
## 4512.363 4512.985 4551.362

autoplot(atm2.ets, main = "Components of ETS(A,N,A) method: ATM2")

checkresiduals(atm2.ets)

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(A,N,A)
## Q* = 35.57, df = 5, p-value = 1.158e-06
## 
## Model df: 9.   Total lags used: 14

# ATM4
atm4.ets <- ets(atm4ts)
atm4.ets

## ETS(A,N,A) 
## 
## Call:
##  ets(y = atm4ts) 
## 
##   Smoothing parameters:
##     alpha = 1e-04 
##     gamma = 1e-04 
## 
##   Initial states:
##     l = 453.3376 
##     s = -274.211 -32.2112 27.1532 40.6802 90.0601 50.641
##            97.8877
## 
##   sigma:  332.8719
## 
##      AIC     AICc      BIC 
## 6404.013 6404.634 6443.012

autoplot(atm4.ets, main = "Components of ETS(A,N,A) method: ATM4")

checkresiduals(atm4.ets)

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(A,N,A)
## Q* = 17.48, df = 5, p-value = 0.003674
## 
## Model df: 9.   Total lags used: 14

ARIMA(0,0,1)(1,1,1)[7] for ATM1
- The best ARIMA fit for ATM1 is a model with non-seasonal MA(1), seasonal AR(1), and seasonal MA(1) components, together with first-order seasonal differences.
- The characteristic roots and model residuals are fine; i.e., the inverse roots all lie within the unit circle, and the residuals resemble white noise.
ARIMA(2,0,2)(0,1,1)[7] for ATM2
- The best ARIMA fit for ATM2 is a model with non-seasonal AR(2), non-seasonal MA(2), and seasonal MA(1) components, together with first-order seasonal differences.
- The characteristic roots and model residuals are fine.
ARIMA(1,0,0)(2,0,0)[7] with constant for ATM4
- The best ARIMA fit for ATM4 is a model with non-seasonal AR(1) and seasonal AR(2) components (no differencing needed).
- The characteristic roots and model residuals are fine.

cat("Number of seasonal differences needed for ATM1:", nsdiffs(atm1ts))

## Number of seasonal differences needed for ATM1: 1

cat("Number of differences needed for ATM1:" , ndiffs(diff(atm1ts, lag = 7)))

## Number of differences needed for ATM1: 0

cat("Number of seasonal differences needed for ATM2:", nsdiffs(atm2ts))

## Number of seasonal differences needed for ATM2: 1

cat("Number of differences needed for ATM2:" , ndiffs(diff(atm2ts, lag = 7)))

## Number of differences needed for ATM2: 0

cat("Number of seasonal differences needed for ATM4:", nsdiffs(atm4ts))

## Number of seasonal differences needed for ATM4: 0

cat("Number of differences needed for ATM4:" , ndiffs(atm4ts))

## Number of differences needed for ATM4: 0

# ATM1
atm1.arima <- auto.arima(atm1ts, stepwise = FALSE, approximation = FALSE)
#atm1.arima <- auto.arima(atm1ts)
atm1.arima

## Series: atm1ts 
## ARIMA(0,0,1)(1,1,1)[7] 
## 
## Coefficients:
##          ma1    sar1     sma1
##       0.1884  0.1881  -0.7549
## s.e.  0.0548  0.0791   0.0549
## 
## sigma^2 estimated as 552.9:  log likelihood=-1638.92
## AIC=3285.84   AICc=3285.95   BIC=3301.36

autoplot(atm1.arima, main = "ARIMA characteristic roots: ATM1")

checkresiduals(atm1.arima)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,0,1)(1,1,1)[7]
## Q* = 15.695, df = 11, p-value = 0.1528
## 
## Model df: 3.   Total lags used: 14

# ATM2
#atm2.arima <- auto.arima(atm2ts, stepwise = FALSE, approximation = FALSE)
atm2.arima <- auto.arima(atm2ts)
atm2.arima

## Series: atm2ts 
## ARIMA(2,0,2)(0,1,1)[7] 
## 
## Coefficients:
##           ar1     ar2     ma1     ma2     sma1
##       -0.4238  -0.913  0.4590  0.8009  -0.7486
## s.e.   0.0547   0.042  0.0848  0.0572   0.0389
## 
## sigma^2 estimated as 586.2:  log likelihood=-1648.7
## AIC=3309.4   AICc=3309.64   BIC=3332.68

autoplot(atm2.arima, main = "ARIMA characteristic roots: ATM2")

checkresiduals(atm2.arima)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(2,0,2)(0,1,1)[7]
## Q* = 9.0631, df = 9, p-value = 0.4315
## 
## Model df: 5.   Total lags used: 14

# ATM4
atm4.arima <- auto.arima(atm4ts, stepwise = FALSE, approximation = FALSE)
#atm4.arima <- auto.arima(atm4ts)
atm4.arima

## Series: atm4ts 
## ARIMA(1,0,0)(2,0,0)[7] with non-zero mean 
## 
## Coefficients:
##          ar1    sar1    sar2      mean
##       0.0799  0.1494  0.1118  444.4260
## s.e.  0.0523  0.0522  0.0526   26.1069
## 
## sigma^2 estimated as 118508:  log likelihood=-2648.19
## AIC=5306.38   AICc=5306.55   BIC=5325.88

autoplot(atm4.arima, main = "ARIMA characteristic roots: ATM4")

checkresiduals(atm4.arima)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,0,0)(2,0,0)[7] with non-zero mean
## Q* = 14.208, df = 10, p-value = 0.1637
## 
## Model df: 4.   Total lags used: 14

# best models 
bestatm1 <- atm1.arima
bestatm2 <- atm2.arima
bestatm4 <- atm4.arima

Comparing the ETS and ARIMA model output for ATM1, ATM2, and ATM4, it is apparent that the ARIMA models provide a stronger fit to the data for all series. Specifically, the ARIMA models have better model diagnostics and satisfy our assumption of no autocorrelation in the model residuals. Hence we select the ARIMA models as our final forecast models for all three time series.

Let’s plot the ATM1, ATM2, and ATM4 time series with forecasts produced by the ETS and ARIMA models. We include a close-up of the forecast period in order to examine and compare the output of the two models. It is interesting to note that the ETS and ARIMA forecasts are very close for the ATM1 and ATM2 series, but they are markedly different for the ATM4 series. In particular, the ETS model predicts a continuing seasonal pattern for ATM4, whereas the ARIMA model predicts a damped oscillatory pattern that seems to converge to a constant. This may arise from the weaker seasonality pattern and greater noise in the ATM4 data.

atm1f.ets <- forecast(atm1.ets, h = 31)
atm1f.arima <- forecast(atm1.arima, h = 31)
atm2f.ets <- forecast(atm2.ets, h = 31)
atm2f.arima <- forecast(atm2.arima, h = 31)
atm4f.ets <- forecast(atm4.ets, h = 31)
atm4f.arima <- forecast(atm4.arima, h = 31)

# plot ATM1
autoplot(atm1ts) + 
    autolayer(atm1f.ets, PI = FALSE, series = "ETS") + 
    autolayer(atm1f.arima, PI = FALSE, series = "ARIMA") + 
    geom_vline(xintercept = 53 + 6/7, lty = 3, lwd = 1) +   
    labs(title = "ETS & ARIMA forecast: ATM1", 
         subtitle = "ETS(A,N,A) and ARIMA(0,0,1)(1,1,1)[7]", 
         x = "Week", 
         y = "$ hundreds")

# close-up ATM1
autoplot(atm1ts) + 
    autolayer(atm1f.ets, PI = FALSE, series = "ETS") + 
    autolayer(atm1f.arima, PI = FALSE, series = "ARIMA") + 
    geom_vline(xintercept = 53 + 6/7, lty = 3, lwd = 1) +   
    labs(title = "Close-up of ETS & ARIMA forecast: ATM1", 
         subtitle = "ETS(A,N,A) and ARIMA(0,0,1)(1,1,1)[7]", 
         x = "Week", 
         y = "$ hundreds") + 
    coord_cartesian(xlim = c(50, 58), 
                    ylim = c(0, 120))

# plot ATM2
autoplot(atm2ts) + 
    autolayer(atm2f.ets, PI = FALSE, series = "ETS") + 
    autolayer(atm2f.arima, PI = FALSE, series = "ARIMA") + 
    geom_vline(xintercept = 53 + 6/7, lty = 3, lwd = 1) +   
    labs(title = "ETS & ARIMA forecast: ATM2", 
         subtitle = "ETS(A,N,A) and ARIMA(2,0,2)(0,1,1)[7]", 
         x = "Week", 
         y = "$ hundreds")

# close-up ATM2
autoplot(atm2ts) + 
    autolayer(atm2f.ets, PI = FALSE, series = "ETS") + 
    autolayer(atm2f.arima, PI = FALSE, series = "ARIMA") + 
    geom_vline(xintercept = 53 + 6/7, lty = 3, lwd = 1) +   
    labs(title = "Close-up of ETS & ARIMA forecast: ATM2", 
         subtitle = "ETS(A,N,A) and ARIMA(2,0,2)(0,1,1)[7]", 
         x = "Week", 
         y = "$ hundreds") + 
    coord_cartesian(xlim = c(50, 58), 
                    ylim = c(0, 120))

# plot ATM4
autoplot(atm4ts) + 
    autolayer(atm4f.ets, PI = FALSE, series = "ETS") + 
    autolayer(atm4f.arima, PI = FALSE, series = "ARIMA") + 
    geom_vline(xintercept = 53 + 6/7, lty = 3, lwd = 1) +   
    labs(title = "ETS & ARIMA forecast: ATM4", 
         subtitle = "ETS(A,N,A) and ARIMA(1,0,0)(2,0,0)[7] with constant", 
         x = "Week", 
         y = "$ hundreds")

# close-up ATM4
autoplot(atm4ts) + 
    autolayer(atm4f.ets, PI = FALSE, series = "ETS") + 
    autolayer(atm4f.arima, PI = FALSE, series = "ARIMA") + 
    geom_vline(xintercept = 53 + 6/7, lty = 3, lwd = 1) +   
    labs(title = "Close-up of ETS & ARIMA forecast: ATM4", 
         subtitle = "ETS(A,N,A) and ARIMA(1,0,0)(2,0,0)[7] with constant", 
         x = "Week", 
         y = "$ hundreds") + 
    coord_cartesian(xlim = c(50, 58), 
                    ylim = c(0, 800))

ATM3 model

We noted earlier in the data preparation section that the ATM3 series is zero for the first 362 observations and non-zero for only the last 3 observations. Upon further review, we notice that the non-zero cash withdrawals (96, 82, and 85) on days 363 through 365 are identical to those from ATM1. This would be a good time to investigate the nature of the time series with the source of the data. Questions to ask include:

What is the relationship among the four machines? We saw in the data exploration section that the ATM1, ATM2, and ATM4 series are moderately correlated. Can we estimate future observations for ATM3 on the basis of the forecasts for ATM1, ATM2, and ATM4? For instance, if the four machines were co-located within the same shopping mall, it would be a reasonable assumption that observations are systemically related.
Is there a reason the observations for ATM1 and ATM3 are identical on days 363 through 365, or is this coincidental?
Specifically for ATM3, why are the first 362 observations zero? Do the zero values contain any information relevant for forecasting, or are they simply missing observations prior to a new machine coming online?

Absent the answers to these questions, we make our best guess that (a) it’s purely coincidental that ATM1 and ATM3 have identical observations on days 363 through 365, (b) the machines reflect independent observations, so future ATM3 observations cannot be estimated as a function of the forecasts for ATM1, ATM2, and ATM4, and (c) the zero observations for ATM3 prior to day 363 reflect missing data, which are not relevent for forecasting purposes.

With these assumptions, we re-define the ATM3 time series to exclude the zero values, resulting in a 3-day time series. Then we try several simple forecasting methods:

Average method (average the 3 observations)
Naive method (take the last observation)
Drift method (continue the estimated trend)
Simple exponential smoothing method (estimate the weighted average of the 3 observations).

The forecasts from these simple methods are shown in the plot below. We note that:

The average method gives a mean value of 88, but the prediction intervals are constant over time.
The naive method gives the same value 85 as simple exponential smoothing, but the prediction intervals go negative during the forecast period.
The drift method produces a precipitous downward trend, which we reject as unrealistic.
The simple exponential smoothing method gives the same forecast 85 as the naive method, but the prediction intervals appear well-behaved and grow over time during the forecast window.

Based on these observations, we choose simple exponential smoothing for our final forecast of ATM3. For practical purposes, the choice between the average method, the naive method, and the simple exponential smoothing method is not meaningful in terms of the resulting forecast, as all three methods give estimates that fall within each other’s prediction intervals. However, we expect that the prediction intervals should start out large and should grow over the forecast window, reflecting the increasing forecast uncertainty over time (since we’re forecasting 31 days from a dataset of only 3 observations). Furthermore, since cash withdrawals are inherently positive, we require that the prediction intervals remain positive. Only the simple exponential smoothing method is consistent with these requirements.

# simple methods for ATM3
# define 3-day ts for ATM3
atm3ts <- window(cashts[ , 'ATM3'], start = c(53, 4))
atm3f.mean <- meanf(atm3ts, h = 31)
atm3f.naive <- naive(atm3ts, h = 31)
atm3f.drift <- rwf(atm3ts, h = 31, drift = TRUE)
atm3f.ses <- ses(atm3ts, h = 31)

autoplot(atm3ts) + 
  autolayer(atm3f.mean, series = "Mean", PI = FALSE) + 
  autolayer(atm3f.naive, series = "Naive", PI = FALSE) + 
  autolayer(atm3f.drift, series = "Drift", PI = FALSE) + 
  autolayer(atm3f.ses, series = "SES", PI = FALSE) + 
  geom_vline(xintercept = 53 + 6/7, lty = 3, lwd = 1) + 
  labs(title = "Simple forecasts for ATM3", 
         x = "Week", 
         y = "$ hundreds") + 
  coord_cartesian(xlim = c(53, 58), 
                  ylim = c(70, 100))

# best model forecast 
bestatm3f <- atm3f.ses

Forecast

Finally we use our chosen models to forecast daily ATM cash withdrawals over the next month for each machine. We plot the forecasts with 80th- and 95th-percentile prediction intervals, and save the forecasts to an Excel file.

# forecast 31 days
fatm1 <- forecast(bestatm1, h = 31)
fatm2 <- forecast(bestatm2, h = 31)
fatm4 <- forecast(bestatm4, h = 31)
fatm3 <- bestatm3f

# plot with prediction intervals - ATM1
autoplot(fatm1) + 
    geom_vline(xintercept = 53 + 6/7, lty = 3, lwd = 1) + 
    labs(title = "Final forecast: ATM1", 
         x = "Week", 
         y = "$ hundreds")

# close-up - ATM1
autoplot(fatm1) + 
    geom_vline(xintercept = 53 + 6/7, lty = 3, lwd = 1) + 
    labs(title = "Close-up of final forecast: ATM1", 
         x = "Week", 
         y = "$ hundreds") + 
    coord_cartesian(xlim = c(50, 58))#,

                    #ylim = c(0, 100))
# plot with prediction intervals - ATM2
autoplot(fatm2) + 
    geom_vline(xintercept = 53 + 6/7, lty = 3, lwd = 1) + 
    labs(title = "Final forecast: ATM2", 
         x = "Week", 
         y = "$ hundreds")

# close-up - ATM2
autoplot(fatm2) + 
    geom_vline(xintercept = 53 + 6/7, lty = 3, lwd = 1) + 
    labs(title = "Close-up of final forecast: ATM2", 
         x = "Week", 
         y = "$ hundreds") + 
    coord_cartesian(xlim = c(50, 58))#,

                    #ylim = c(0, 100))
# plot with prediction intervals - ATM3
autoplot(cashts[ , 'ATM3']) + autolayer(fatm3) + 
    geom_vline(xintercept = 53 + 6/7, lty = 3, lwd = 1) + 
    labs(title = "Final forecast: ATM3", 
         x = "Week", 
         y = "$ hundreds")

# close-up - ATM3
autoplot(cashts[ , 'ATM3']) + autolayer(fatm3) + 
    geom_vline(xintercept = 53 + 6/7, lty = 3, lwd = 1) + 
    labs(title = "Close-up of final forecast: ATM3", 
         x = "Week", 
         y = "$ hundreds") + 
    coord_cartesian(xlim = c(50, 58))#,

                    #ylim = c(0, 100))
# plot with prediction intervals - ATM4
autoplot(fatm4) + 
    geom_vline(xintercept = 53 + 6/7, lty = 3, lwd = 1) + 
    labs(title = "Final forecast: ATM4", 
         x = "Week", 
         y = "$ hundreds")

# close-up - ATM4
autoplot(fatm4) + 
    geom_vline(xintercept = 53 + 6/7, lty = 3, lwd = 1) + 
    labs(title = "Close-up of final forecast: ATM4", 
         x = "Week", 
         y = "$ hundreds") + 
    coord_cartesian(xlim = c(50, 58))#,

                    #ylim = c(0, 100))

# forecast df for excel file
end <- raw1$DATE[nrow(raw1)]
fcdates <- end + days(1:31)
cashfc_df <- data.frame(DATE = fcdates, 
                        ATM1 = fatm1$mean, 
                        ATM2 = fatm2$mean, 
                        ATM3 = fatm3$mean,
                        ATM4 = fatm4$mean)
# save to excel file <<< DONE AT END OF PART C
#write_xlsx(list(ATMForecast = cashfc_df), 
#           "KBenson_Proj1_Forecasts.xlsx")

Part B: Residential power usage

In this part, we work with a dataset of residential power usage (“ResidentialCustomerForecastLoad-624.xlsx”). The data is a monthly time series from 1998/01 to 2013/12, spanning 16 years. The target variable is KWH, which is the power consumption in kilowatt hours.

Data preparation

Reviewing the data, we note the following:

Outliers: one data point (2010-Jul, 770523) appears to be an outlier in the time series. It could be an error in data measurement / data entry, as the value is smaller by an order of magnitude compared to the distribution of the remaining data. If we knew this to be the case, then we could remove the datapoint or replace it with an estimate. However, without knowing for sure, we decide to keep the data point as is, and keep this in mind when reviewing the results of our analysis.
Missing data: there is 1 missing value in the time series for 2008-Sep. Because the model fitting functions we use below (for estimating ETS and ARIMA models) throw errors for missing values, we decide to impute a value for the missing data. We see from the time plot below that the time series is relatively smooth in the neighborhood of 2008-Sep, showing a consistent seasonal pattern. In this case, we impute the missing value by setting it equal to the average of the values from 12 months prior to and 12 months following the missing point (i.e., the average of 2007-Sep and 2009-Sep).

After the adjustment for the missing data, we define our time series for model fitting, which we call powerts.

# load data and review summary stats
file2 <- "ResidentialCustomerForecastLoad-624.xlsx"
raw2 <- read_excel(file2, col_types = c("numeric", "text", "numeric")) 
summary(raw2)

##   CaseSequence     YYYY-MMM              KWH          
##  Min.   :733.0   Length:192         Min.   :  770523  
##  1st Qu.:780.8   Class :character   1st Qu.: 5429912  
##  Median :828.5   Mode  :character   Median : 6283324  
##  Mean   :828.5                      Mean   : 6502475  
##  3rd Qu.:876.2                      3rd Qu.: 7620524  
##  Max.   :924.0                      Max.   :10655730  
##                                     NA's   :1

ggplot(raw2) + geom_histogram(aes(x = KWH)) + 
    labs(title = "Histogram of KWH")

# outlier & missing data index
outlier <- which.min(raw2$KWH)
missing <- which.max(is.na(raw2$KWH))

# fix missing data and define ts
raw2 <- mutate(raw2, KWH = ifelse(is.na(KWH), (lag(KWH, 12) + lead(KWH, 12))/2, KWH))
powerts <- ts(raw2[['KWH']], start = c(1998,1), frequency = 12)

Data exploration

Reviewing the plots below, we make the following observations:

As noted before, the data has an outlier at 2010-Jul and missing value at 2008-Sep. We decided to keep the outlier, although it may be a measurement / data entry error. The imputed value that we use for 2008-Sep appears consistent with the time series behavior in the neighborhood of the missing value.
The time series exhibits a strong seasonal pattern, with peak usage in the winter (heating) and summer (cooling) months.
Consistent with the seasonal pattern, the time series has statistically significant autocorrelations for 3, 6, 9, and 12-month lags, as well as multiples thereof.
The data show almost no trend, although the period after 2010 perhaps has a higher mean value than the period before 2010.
The time series seems to show increasing variance over time, so we will consider a Box-Cox transformation in the modeling below. The suggested parameter for the Box-Cox transformation is \(\lambda = 0.12\), which is almost a log transform.

# plots
autoplot(powerts) + 
    geom_vline(xintercept = 1998 + (c(outlier, missing)-1)/12, lty = 2, col = 2) + 
    labs(title = "Time plot: powerts", 
         x = "Year", 
         y = "KWH")

ggseasonplot(powerts) + 
    labs(y = "KWH")

#ggseasonplot(powerts, polar = TRUE)
ggsubseriesplot(powerts) + 
    labs(title = "Seasonal subseries plot: powerts", 
         y = "KWH")

#gglagplot(powerts)
ggAcf(powerts, lag.max = 120) + 
    labs(title = "Autocorrelation plot: powerts")

# box-cox transformation & plot
cat("Box-Cox lambda parameter: ", BoxCox.lambda(powerts))

## Box-Cox lambda parameter:  0.1217513

autoplot(cbind(powerts = powerts, transformed = BoxCox(powerts, lambda = "auto")), 
         facets = TRUE) + 
    labs(title = "Time plot: powerts and transformed time series", 
         x = "Year", 
         y = "KWH")

Finally we plot the STL decomposition to view the seasonal and trend components of the time series. There is a slight increase in the trend line after 2010.

# STL decomposition
stl(powerts, s.window = 13) %>% autoplot() + 
    labs(title = "STL decomposition: powerts", 
         x = "Year")

Modeling

To model and forecast the powerts time series, we try 3 approaches:

Approach #1: fit STL model
Approach #2: fit ETS & ARIMA models
Approach #3: validate ETS/ARIMA & choose best model

Approach #1: Fit STL model

As a starting point, we fit an STL model using the stlf function with ETS or ARIMA modeling methods. Procedurally, this approach does the following:

Apply an STL decomposition to the time series
Take the seasonally-adjusted (SA) time series for modeling
Fit either an ETS or ARIMA model to the SA time series
Forecast the SA time series using the fitted model
Re-seasonalize the forecast by adding a seasonal adjustment based on the last year of the seasonal component (from the STL decomposition).

While ETS or ARIMA modeling is used for fitting the non-seasonal (SA) time series, the seasonal adjustment to the forecast is equivalent to the “seasonal naive” method of using the last year of the seasonal pattern. As such, we expect this approach to give an approximate forecast that ignores modeling of the seasonal pattern.

Below we plot the time series and forecasts, with a close-up to examine the forecast period. The forecasts produced by the two models are virtually indistinguishable. Note that we also considered versions of the STL ETS and STL ARIMA models with a Box-Cox transformation, but the resulting forecasts included a mid-year kink in the curve, which is not consistent with the time series history. Hence we rejected these model versions.

# stlf models
stlf2.ets <- stlf(powerts, method = "ets", h = 12) 
stlf2.arima <- stlf(powerts, method = "arima", h = 12) 
#stlf2.ets.bc <- stlf(powerts, method = "ets", lambda = "auto", h = 12) 
#stlf2.arima.bc <- stlf(powerts, method = "arima", lambda = "auto", h = 12) 

# plot
autoplot(powerts) + 
    autolayer(stlf2.ets, PI = FALSE, series = "STLF-ETS") + 
    autolayer(stlf2.arima, PI = FALSE, series = "STLF-ARIMA") + 
#    autolayer(stlf2.ets.bc, PI = FALSE, series = "STLF-ETS-BC") + 
#    autolayer(stlf2.arima.bc, PI = FALSE, series = "STLF-ARIMA-BC") + 
    geom_vline(xintercept = 2014, lty = 3, lwd = 1) + 
    labs(title = "STLF forecast using ETS & ARIMA", 
         x = "Year", 
         y = "KWH")

# close-up
autoplot(powerts) + 
    autolayer(stlf2.ets, PI = FALSE, series = "STLF-ETS") + 
    autolayer(stlf2.arima, PI = FALSE, series = "STLF-ARIMA") + 
#    autolayer(stlf2.ets.bc, PI = FALSE, series = "STLF-ETS-BC") + 
#    autolayer(stlf2.arima.bc, PI = FALSE, series = "STLF-ARIMA-BC") + 
    geom_vline(xintercept = 2014, lty = 3, lwd = 1) + 
    labs(title = "Close-up of STLF forecast using ETS & ARIMA", 
         x = "Year", 
         y = "KWH") + 
    coord_cartesian(xlim = c(2012, 2015), 
                    ylim = c(5e06, 11e06))

Approach #2: Fit ETS & ARIMA models

Next we fit ETS and ARIMA models to the powerts time series, including the seasonal component. We do this using the ets and auto.arima functions, and in each case, we show the model summary, plot model diagnostics, and check the model residuals.

ETS(A,N,A) with Box-Cox
- The best ETS fit is a model with no trend, additive seasonal component, and additive errors, where we have applied a Box-Cox transformation to the time series.
- Without the Box-Cox transformation, the best ETS fit is ETS(M,N,M) – no trend with multiplicative seasonality and multiplicative errors.
- We considered both model versions, but in the end opted to include the Box-Cox transformation, because the resulting model residuals were better behaved. The model version without the Box-Cox transformation had model residuals that showed statistically significant autocorrelations and that failed the Ljung-Box test.
- The ETS(A,N,A) model with Box-Cox transformation produces residuals that resemble white noise, i.e., they show no significant autocorrelations and they pass the Ljung-Box test.

# ets without box-cox <<< DON'T USE
#fit2.ets <- ets(powerts)
#fit2.ets
#autoplot(fit2.ets)
#checkresiduals(fit2.ets)

# ets with box-cox <<< BEST
fit2.ets <- ets(powerts, lambda = 'auto')
fit2.ets

## ETS(A,N,A) 
## 
## Call:
##  ets(y = powerts, lambda = "auto") 
## 
##   Box-Cox transformation: lambda= 0.1218 
## 
##   Smoothing parameters:
##     alpha = 0.0525 
##     gamma = 1e-04 
## 
##   Initial states:
##     l = 46.7427 
##     s = -0.1683 -1.6746 -0.8476 1.2707 1.798 0.4992
##            0.1981 -1.5378 -1.1848 -0.4614 0.4895 1.6189
## 
##   sigma:  1.2401
## 
##      AIC     AICc      BIC 
## 1107.525 1110.252 1156.387

autoplot(fit2.ets)

checkresiduals(fit2.ets)

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(A,N,A)
## Q* = 11.321, df = 10, p-value = 0.333
## 
## Model df: 14.   Total lags used: 24

ARIMA(1,0,0)(0,1,2)[12] with drift
- The best ARIMA fit is a model with non-seasonal AR(1) component, seasonal MA(2) component, and first-order seasonal differences, with drift. This version does not include a Box-Cox transformation.
- Including the Box-Cox transformation, the best ARIMA fit is ARIMA(0,1,2)(2,0,0)[12], which is a combination of non-seasonal MA(2), first-order non-seasonal differences, and seasonal AR(2).
- We considered both model versions, but chose the first version without Box-Cox transformation as the preferred alternative. The second version was slightly more complex (total order of \(p+q+P+Q\) = 4 versus 3), had non-seasonal instead of seasonal differences, and produced slightly worse model residuals (with a few significant autocorrelations).
- The ARIMA(1,0,0)(0,1,2)[12] with drift model (without Box-Cox transformation) generates model residuals that resemble white noise.

cat("Number of seasonal differences needed:", nsdiffs(powerts))

## Number of seasonal differences needed: 1

cat("Number of differences needed:" , ndiffs(diff(powerts, lag = 12)))

## Number of differences needed: 0

# arima without box-cox <<< BEST
fit2.arima <- auto.arima(powerts, stepwise = FALSE, approximation = FALSE)
fit2.arima

## Series: powerts 
## ARIMA(1,0,0)(0,1,2)[12] with drift 
## 
## Coefficients:
##          ar1     sma1    sma2     drift
##       0.2527  -0.8945  0.1295  7695.751
## s.e.  0.0747   0.0816  0.0809  2067.653
## 
## sigma^2 estimated as 7.347e+11:  log likelihood=-2718.59
## AIC=5447.17   AICc=5447.52   BIC=5463.14

autoplot(fit2.arima)

checkresiduals(fit2.arima)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,0,0)(0,1,2)[12] with drift
## Q* = 11.761, df = 20, p-value = 0.924
## 
## Model df: 4.   Total lags used: 24

# arima with box-cox <<< DON'T USE
#fit2.arima <- auto.arima(powerts, lambda = "auto", stepwise = FALSE, approximation = FALSE)
#fit2.arima
#autoplot(fit2.arima)
#checkresiduals(fit2.arima)

Now let’s plot the powerts time series with forecasts produced by the ETS and ARIMA models. We include a close-up of the forecast period to compare the output of the two models. As we saw before (for the STL forecasts), the forecasts from the ETS and ARIMA models are almost indistinguishable.

fit2f.ets <- forecast(fit2.ets, h = 12)
fit2f.arima <- forecast(fit2.arima, h = 12)

# plot
autoplot(powerts) + 
    autolayer(fit2f.ets, PI = FALSE, series = "ETS") + 
    autolayer(fit2f.arima, PI = FALSE, series = "ARIMA") + 
    geom_vline(xintercept = 2014, lty = 3, lwd = 1) + 
    labs(title = "ETS & ARIMA forecast", 
         subtitle = "ETS(A,N,A) and ARIMA(1,0,0)(0,1,2)[12]", 
         x = "Year", 
         y = "KWH")

# close-up
autoplot(powerts) + 
    autolayer(fit2f.ets, PI = FALSE, series = "ETS") + 
    autolayer(fit2f.arima, PI = FALSE, series = "ARIMA") + 
    geom_vline(xintercept = 2014, lty = 3, lwd = 1) + 
    labs(title = "Close-up of ETS & ARIMA forecast", 
         subtitle = "ETS(A,N,A) and ARIMA(1,0,0)(0,1,2)[12]", 
         x = "Year", 
         y = "KWH") + 
    coord_cartesian(xlim = c(2012, 2015), 
                    ylim = c(5e06, 11e06))

Approach #3: Validate & select best model

The forecasts generated by the ETS and ARIMA models are very close. In order to choose between the two models, let’s evaluate the models on a validation set and then select the better-performing model. We do this as follows:

Split the time series 75/25 into a training set spanning the first 12 years (1998/01 to 2009/12) and a validation set spanning the last 4 years (2010/01 to 2013/12)
Fit the chosen ETS and ARIMA model versions on the training data
Evaluate the prediction error on the validation data
Select the better-performing model as our final model.

We should highlight several points here:

First, we could use time series cross-validation as an alternative framework for model evaluation. However, given that our task is prediction for the year 2014, we prefer to give greater weight to model performance in recent years (rather than equal weight to performance in all periods). This is the reason we choose a simple 75/25 split of the data with the validation set corresponding to the most recent 4-year period.
Second, when fitting the ETS and ARIMA models to the training data, we use the same model versions (ETS(A,N,A) with Box-Cox and ARIMA(1,0,0)(0,1,2)[12] with drift) as we fitted earlier on the entire dataset. We do this because we are testing these specific model varieties, rather than testing general ETS and ARIMA model fitting.
Third, once we choose the better-performing model based on the validation set, we will use that model version as fitted to the entire (100%) time series in order to make predictions for 2014.

# training set
train2 <- window(powerts, end = c(2009, 12))

# ets(ANA) with box-cox
fit2.ets2 <- ets(train2, model = "ANA", lambda = "auto")
fit2.ets2

## ETS(A,N,A) 
## 
## Call:
##  ets(y = train2, model = "ANA", lambda = "auto") 
## 
##   Box-Cox transformation: lambda= 1.6004 
## 
##   Smoothing parameters:
##     alpha = 0.0313 
##     gamma = 1e-04 
## 
##   Initial states:
##     l = 46834503418.7937 
##     s = -6682876559 -18425292716 -8034710750 15946508441 21155789995 15437056188
##            -2279293982 -16193138210 -13494776141 -6430908173 3257915458 15743726449
## 
##   sigma:  6714723135
## 
##      AIC     AICc      BIC 
## 7247.665 7251.415 7292.212

# arima(1,0,0)(0,1,2)[12] with drift
fit2.arima2 <- Arima(train2, order = c(1,0,0), seasonal = c(0,1,2), 
                    include.drift = TRUE)
fit2.arima2

## Series: train2 
## ARIMA(1,0,0)(0,1,2)[12] with drift 
## 
## Coefficients:
##          ar1     sma1    sma2     drift
##       0.1712  -0.7961  0.0697  3281.088
## s.e.  0.0939   0.1037  0.1154  1692.300
## 
## sigma^2 estimated as 2.985e+11:  log likelihood=-1934.04
## AIC=3878.08   AICc=3878.55   BIC=3892.49

Reviewing the RMSE, MAE, MAPE, and MASE error metrics on the validation set, we see that the ARIMA model has stronger predictive performance than the ETS model. Hence we select the ARIMA model for our forecast. Note that for forecasting purposes, we use the version of the ARIMA model that was fitted on the entire dataset.

# accuracy metrics
acc2.ets2 <- fit2.ets2 %>% forecast(h = 48) %>% accuracy(powerts)
acc2.arima2 <- fit2.arima2 %>% forecast(h = 48) %>% accuracy(powerts)
acc2.ets2[ , c(2,3,5,6)] %>% 
    kable(caption = "Error metrics for ETS(A,N,A) with Box-Cox")

Error metrics for ETS(A,N,A) with Box-Cox
	RMSE	MAE	MAPE	MASE
Training set	512644.1	399909.3	6.407868	0.731715
Test set	1617775.6	1113038.5	30.760147	2.036529

acc2.arima2[ , c(2,3,5,6)] %>% 
    kable(caption = "Error metrics for ARIMA(1,0,0)(0,1,2)[12] with drift")

Error metrics for ARIMA(1,0,0)(0,1,2)[12] with drift
	RMSE	MAE	MAPE	MASE
Training set	515115	379925.1	6.140061	0.6951498
Test set	1470451	946544.4	28.486439	1.7318946

# best model - use version fitted on full dataset
bestfit2 <- fit2.arima

Forecast

Finally we use our chosen model to produce monthly forecasts of residential power usage in 2014. We plot the forecast with 80th- and 95th-percentile prediction intervals, and save the forecast to our Excel file.

# forecast 12 months
f2 <- forecast(bestfit2, h = 12)

# plot with prediction intervals
autoplot(powerts) + 
    autolayer(f2) + 
    geom_vline(xintercept = 2014, lty = 3, lwd = 1) + 
    labs(title = "Final forecast from ARIMA(1,0,0)(0,1,2)[12]", 
         x = "Year", 
         y = "KWH")

# close-up
autoplot(powerts) + 
    autolayer(f2) + 
    geom_vline(xintercept = 2014, lty = 3, lwd = 1) + 
    labs(title = "Close-up of final forecast from ARIMA(1,0,0)(0,1,2)[12]", 
         x = "Year", 
         y = "KWH") + 
    coord_cartesian(xlim = c(2012, 2015), 
                    ylim = c(5e06, 11e06))

# forecast df for excel file
powerfc_df <- data.frame("YYYY-MMM" = paste0("2014-", month.abb), 
                         KWH = f2$mean)
# save to excel file <<< DONE AT END OF PART C
#write_xlsx(list(PowerForecast = powerfc_df), 
#           "KBenson_Proj1_Forecasts.xlsx")

Part C: Waterflow

In this part, we work with two datasets of time-stamped waterflow measurements. The first dataset (“Waterflow_Pipe1.xlsx”) includes measurements at various irregular times from 10/23/15 through 11/1/15. The second dataset (“Waterflow_Pipe2.xlsx”) includes regular hourly measurements from 10/23/15 through 12/3/15.

Data preparation

We start by merging the data into a combined dataset. In order to do this, we merge the data along a common time scale as follows:

All time stamps are rounded up to the nearest whole hour (e.g., 12:01AM, 12:15AM, and 12:59AM all round up to 1AM). This step only affects the measurements in the first dataset, as all measurements in the second dataset were taken hourly on the hour (e.g., 1AM, 2AM, etc.).
Measurements are then aggregated and grouped by the hour.
Measurements within each hour group are summarized by taking the mean.

The effect of this procedure is to combine measurements from either dataset occurring within the same hour (e.g., 12:01AM, 12:15AM, and 12:59AM) and then to take the average of such measurements. The resulting dataset is an hourly time series from 10/23/15 through 12/3/15, which includes two distinct time ranges:

From 10/23/15 1AM through 11/2/15 12AM, the data is based on the mean of the aggregate hourly measurements from the first and second datasets
From 11/2/15 1AM through 12/3/15 4PM, the data is identical to the measurements from the second dataset.

Once the data is merged, we define the time series of waterflow measurements, which we call waterts. Note that we specify a frequency of 24 for the time series, as we expect there may be a daily “seasonal” pattern for the hourly measurements.

Note that fortunately, there are no missing data and no outliers in the time series.

# load & parse data
file3a <- "Waterflow_Pipe1.xlsx"
file3b <- "Waterflow_Pipe2.xlsx"
raw3a <- read_excel(file3a, col_types = c("date", "numeric")) %>% 
    rename(DateTime = "Date Time") %>%
    mutate(Hour = ceiling_date(DateTime, unit = "hour"))
raw3b <- read_excel(file3b, col_types = c("date", "numeric")) %>% 
    rename(DateTime = "Date Time") %>%
    mutate(Hour = ceiling_date(DateTime, unit = "hour"))
raw3 <- rbind(raw3a, raw3b) %>% 
    group_by(Hour) %>% 
    summarise(Count = n(), Avg = mean(WaterFlow, na.rm = TRUE))

# define ts
waterts <- ts(raw3$Avg, frequency = 24)

# check for missing data & outliers
summary(waterts)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.885  23.777  33.242  35.963  47.098  77.388

ggplot(waterts) + geom_histogram(aes(x = waterts)) + 
  labs(title = "Histogram of waterts", 
       x = "Waterflow")

Data exploration

Let’s review the data using the plots below. We note several observations:

The distribution of measurements has a well-defined range (between 2 and 78 with mean / median in the low-to-mid 30s), and it appears to be slightly skewed
Measurements are extremely “jumpy”, i.e., the time series is not a smooth curve
The time series does not have a trend-cycle, although the mean level over the first 10-12 days is lower than the mean level over the remaining days
The time series does not have a strong seasonal pattern
The measurement data show non-constant variance, with lower variance over the first 10-12 days and higher variance over the remaining days. During the modeling phase, we will consider a Box-Cox transformation to stabilize the variance, which is shown below (for \(\lambda = -0.64\))
The time series exhibit strong autocorrelations for lags of up to 4-5 days (96-120 hours).

# plots
autoplot(waterts) + 
  labs(title = "Time plot: waterts", 
       x = "Day", 
       y = "Waterflow")

ggseasonplot(waterts) + 
  labs(y = "Waterflow") + 
  guides(color=guide_legend(title = "Day"))

#ggseasonplot(waterts, polar = TRUE)
ggsubseriesplot(waterts) + 
  labs(title = "Seasonal subseries plot: waterts", 
       x = "Hour", 
       y = "Waterflow")

#gglagplot(waterts)
ggAcf(waterts, lag.max = 120) + 
  labs(title = "Autocorrelation plot: waterts")

# box-cox transformation & plot
cat("Box-Cox lambda parameter: ", BoxCox.lambda(waterts))

## Box-Cox lambda parameter:  -0.64441

autoplot(cbind(waterts = waterts, transformed = BoxCox(waterts, lambda = "auto")), 
         facets = TRUE) + 
    labs(title = "Time plot: waterts and transformed time series", 
         x = "Day", 
         y = "Waterflow")

Next we plot the STL decomposition to view the seasonal and trend components of the time series. As noted earlier, there is a noticeable increase after the first 10-12 days in both the mean level of the trend component and the variance of the seasonal component.

# STL decomposition
stl(waterts, s.window = 25) %>% autoplot() + 
    labs(title = "STL decomposition: waterts", 
         x = "Day")

Finally we consider the issue of whether the time series is stationary. In its raw form, the time series is not stationary since it fails the definition of stationarity, in which the statistical properties of the data should not depend on which time period is analyzed. For instance, we observed above that the measurement data has a lower mean level and lower variance within the first 10-12 days than over the remaining days.

In order to produce a stationary time series (which can be used, for instance, in an ARIMA model), we make the following adjustments to the data:

Apply Box-Cox transformation to stabilize the variance (with \(\lambda = -0.64\))
Take no seasonal differences
Take ordinary first-order differences.

We see from the plot below that the resulting time series appears to be stationary. Furthermore, the plot indicates that this may be a good candidate for ARIMA(0,1,1), as the pattern of the ACF (single spike at lag = 1) and the PACF (exponential decay profile) suggest a MA(1) model.

# adjust to make ts stationary & plot
waterts_bc <- BoxCox(waterts, lambda = "auto")
cat("Number of seasonal differences needed:", nsdiffs(waterts_bc))

## Number of seasonal differences needed: 0

cat("Number of differences needed:" , ndiffs(waterts_bc))

## Number of differences needed: 1

waterts_bc %>% diff() %>% ggtsdisplay()

Modeling

To model and forecast the powerts time series, we try our usual 3 approaches:

Approach #1: fit STL model
Approach #2: fit ETS & ARIMA models
Approach #3: validate ETS/ARIMA & choose best model

Approach #1: Fit STL model

To start we plot the STL forecast with ETS and ARIMA methods for the non-seasonal (seasonally adjusted) component, with the Box-Cox transformation. We see that the two forecasts are indistinguishable. However, these forecasts are not appropriate, since the time series does not exhibit a strong seasonal pattern, whereas the forecasts reseasonalize the non-seasonal component by using the most recent 24-hour period.

# stlf models
stlf3.ets <- stlf(waterts, lambda = "auto", method = "ets", h = 168) 
stlf3.arima <- stlf(waterts, lambda = "auto", method = "arima", h = 168) 

# plot
autoplot(waterts) + 
    autolayer(stlf3.ets, PI = FALSE, series = "STLF-ETS") + 
    autolayer(stlf3.arima, PI = FALSE, series = "STLF-ARIMA") + 
    geom_vline(xintercept = 1 + 1000/24, lty = 3, lwd = 1) + 
    labs(title = "STLF forecast using ETS & ARIMA", 
         x = "Day", 
         y = "Waterflow")

# close-up
autoplot(waterts) + 
    autolayer(stlf3.ets, PI = FALSE, series = "STLF-ETS") + 
    autolayer(stlf3.arima, PI = FALSE, series = "STLF-ARIMA") + 
    geom_vline(xintercept = 1 + 1000/24, lty = 3, lwd = 1) + 
    labs(title = "Close-up of STLF forecast using ETS & ARIMA", 
         x = "Day", 
         y = "Waterflow") + 
    coord_cartesian(xlim = c(35, 50))

Approach #2: Fit ETS & ARIMA models

Next we fit ETS and ARIMA models to the time series using the ets and auto.arima functions in R. The results are shown below, including model summaries and diagnostic plots for the residuals. Note that for both models we use the Box-Cox transformation with bias adjustment, since we want the point forecasts to represent the mean (rather than median) of the forecast distribution.

ETS(A,N,N)
- This is a simple exponential smoothing model with additive errors.
- Model residuals appear to be well-behaved and resemble white noise.

fit3.ets <- ets(waterts, lambda = "auto", biasadj = TRUE)
fit3.ets

## ETS(A,N,N) 
## 
## Call:
##  ets(y = waterts, lambda = "auto", biasadj = TRUE) 
## 
##   Box-Cox transformation: lambda= -0.6444 
## 
##   Smoothing parameters:
##     alpha = 0.0164 
## 
##   Initial states:
##     l = 1.3516 
## 
##   sigma:  0.0736
## 
##      AIC     AICc      BIC 
## 1693.762 1693.786 1708.485

autoplot(fit3.ets)

checkresiduals(fit3.ets)

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(A,N,N)
## Q* = 45.951, df = 46, p-value = 0.4743
## 
## Model df: 2.   Total lags used: 48

ARIMA(0,1,1)
- This is a MA(1) model with first-order differencing, which is consistent with our observation from the time series display plot earlier.
- Model residuals mostly resemble white noise, although a couple autocorrelations appear to be significant; however this may be spurious, as we would expect 1 in 20 lags to appear significant by random chance.

fit3.arima <- auto.arima(waterts, lambda = "auto", biasadj = TRUE)
fit3.arima

## Series: waterts 
## ARIMA(0,1,1) 
## Box Cox transformation: lambda= -0.64441 
## 
## Coefficients:
##           ma1
##       -0.9825
## s.e.   0.0061
## 
## sigma^2 estimated as 0.005418:  log likelihood=1187.67
## AIC=-2371.34   AICc=-2371.33   BIC=-2361.53

autoplot(fit3.arima)

checkresiduals(fit3.arima)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,1,1)
## Q* = 46.043, df = 47, p-value = 0.5122
## 
## Model df: 1.   Total lags used: 48

Finally, let’s compare 7-day forecasts (7 * 24 = 168 hourly measurements) from the ETS and ARIMA models. In either case, the forecasts are flat and equivalent to a simple exponential smoothing model. The ETS model forecasts 46.9 whereas the ARIMA model forecasts 47.2.

fit3f.ets <- forecast(fit3.ets, h = 168)
fit3f.arima <- forecast(fit3.arima, h = 168)

autoplot(waterts) + 
    autolayer(fit3f.ets, series = "ETS", PI = FALSE) + 
    autolayer(fit3f.arima, series = "ARIMA", PI = FALSE) + 
    geom_vline(xintercept = 1 + 1000/24, lty = 3, lwd = 1) + 
    labs(title = "ETS & ARIMA forecast", 
         subtitle = "ETS(A,N,N) and ARIMA(0,1,1)", 
         x = "Day", 
         y = "Waterflow")

cat("ETS(A,N,N) forecast:", fit3f.ets$mean[1])

## ETS(A,N,N) forecast: 46.89912

autoplot(waterts) + 
    autolayer(fit3f.ets, series = "ETS", PI = FALSE) + 
    autolayer(fit3f.arima, series = "ARIMA", PI = FALSE) + 
    geom_vline(xintercept = 1 + 1000/24, lty = 3, lwd = 1) + 
    labs(title = "Close-up of ETS & ARIMA forecast", 
         subtitle = "ETS(A,N,N) and ARIMA(0,1,1)", 
         x = "Day", 
         y = "Waterflow") + 
    coord_cartesian(xlim = c(35, 50))

cat("ARIMA(0,1,1) forecast:", fit3f.arima$mean[1])

## ARIMA(0,1,1) forecast: 47.23156

Validate & select best model

To choose between the ETS and ARIMA models, we could conduct a validation exercise to test each model’s predictive performance on a validation sample (or use time series cross-validation). However, in this case, doing a validation exercise would likely be over-kill since the two models generate almost identical forecasts:

The ETS and ARIMA models are equivalent (simple exponential smoothing model and MA(1))
Both forecasts are flat with forecast values that are practically identical (46.9 vs. 47.2)
The forecast from each model lies well within the prediction interval of the other model, and further, the prediction intervals from both models are indistinguishable.

So for practical purposes the two forecasts are identical, and it makes no meaningful difference which model we choose. In this case, we select the simpler ETS model as our final forecasting model.

Forecast

Finally we use our chosen model to produce a 7-day forecast of hourly waterflow measurements. The 7-day forecast period is equivalent to 168 hours, starting at 12/3/15 5PM. We plot the forecast with 80th- and 95th-percentile prediction intervals, and save the forecast to our Excel file. We note that the prediction intervals appear extremely wide, with upper bounds that are well above the historical range of measurement values. This may arise from how the fpp2 package computes prediction intervals when the Box-Cox transformation is used.

bestfit3 <- fit3.ets
# forecast 168 hours
f3 <- forecast(bestfit3, h = 168)

# plot with prediction intervals
autoplot(waterts) + 
    autolayer(f3) + 
    geom_vline(xintercept = 1 + 1000/24, lty = 3, lwd = 1) + 
    labs(title = "Final forecast from ETS(A,N,N)", 
         x = "Day", 
         y = "Waterflow")

# close-up
autoplot(waterts) + 
    autolayer(f3) + 
    geom_vline(xintercept = 1 + 1000/24, lty = 3, lwd = 1) + 
    labs(title = "Close-up of final forecast from ETS(A,N,N)", 
         x = "Day", 
         y = "Waterflow") + 
    coord_cartesian(xlim = c(35, 50))

# forecast df for excel file
end <- raw3$Hour[nrow(raw3)]
fctimes <- end + hours(1:168)
waterfc_df <- data.frame("Date Time" = fctimes, 
                         WaterFlow = f3$mean)

# save all forecasts to excel file
write_xlsx(list(ATMForecast = cashfc_df, 
                PowerForecast = powerfc_df, 
                WaterflowForecast = waterfc_df), 
           "KBenson_Proj1_Forecasts.xlsx")

DATA 624 Project #1

Kevin Benson

3/29/2020

Part A: ATM cash withdrawals

Data preparation

Data exploration

Modeling

Approach #1: Fit STL model

Approach #2: Fit ETS & ARIMA models

ATM3 model

Forecast

Part B: Residential power usage

Data preparation

Data exploration

Modeling

Approach #1: Fit STL model

Approach #2: Fit ETS & ARIMA models

Approach #3: Validate & select best model

Forecast

Part C: Waterflow

Data preparation

Data exploration

Modeling

Approach #1: Fit STL model

Approach #2: Fit ETS & ARIMA models

Validate & select best model

Forecast