PART A

Data Exploration

First we will load and explore the data. We will also convert the data into a time series object (tsibble). The xlsx data was opened in Excel and exported as a CSV.

atm_raw <- read.csv("https://raw.githubusercontent.com/hdupre/DATA624/main/Project1/ATM624Data.csv")

head(atm_raw)
##                   DATE  ATM Cash
## 1 5/1/2009 12:00:00 AM ATM1   96
## 2 5/1/2009 12:00:00 AM ATM2  107
## 3 5/2/2009 12:00:00 AM ATM1   82
## 4 5/2/2009 12:00:00 AM ATM2   89
## 5 5/3/2009 12:00:00 AM ATM1   85
## 6 5/3/2009 12:00:00 AM ATM2   90
summary(atm_raw)
##      DATE               ATM                 Cash        
##  Length:1474        Length:1474        Min.   :    0.0  
##  Class :character   Class :character   1st Qu.:    0.5  
##  Mode  :character   Mode  :character   Median :   73.0  
##                                        Mean   :  155.6  
##                                        3rd Qu.:  114.0  
##                                        Max.   :10920.0  
##                                        NA's   :19

There are some NAs in the data.

There appear to be five records missing the cash value – the other fourteen are from May 2010 (the month to be forecasted) and aren’t tied to an ATM ID. We’ll remove those records and convert to tsibbles for each ATM. Gaps will be filled with the median value for ATM 1, 2 and 4 since they appear to be active nearly every day.

atm_df <- atm_raw

atm_df$ATM[atm_df$ATM==""] <- NA

atm_df <- atm_df %>%
  drop_na() %>%
  mutate(DATE = as_date(DATE, format="%m/%d/%Y 12:00:00 AM"))
  

atm1_ts <- atm_df %>%
  filter(ATM=="ATM1") %>%
  select(-ATM) %>%
  as_tsibble() %>%
  fill_gaps(Cash = median(Cash))
## Using `DATE` as index variable.
atm2_ts <- atm_df %>%
  filter(ATM=="ATM2") %>%
  select(-ATM) %>%
  as_tsibble() %>%
  fill_gaps(Cash = median(Cash))
## Using `DATE` as index variable.
atm3_ts <- atm_df %>%
  filter(ATM=="ATM3") %>%
  select(-ATM) %>%
  mutate(Cash = ifelse(Cash==0,NA,Cash)) %>%
  drop_na %>%
  as_tsibble()
## Using `DATE` as index variable.
atm4_ts <- atm_df %>%
  filter(ATM=="ATM4") %>%
  select(-ATM) %>%
  as_tsibble()
## Using `DATE` as index variable.
grid.arrange(
  atm1_ts %>%
  autoplot(Cash)+
    labs(title = "ATM1"),
  atm2_ts %>%
  autoplot(Cash)+
    labs(title = "ATM2"),
  atm3_ts %>%
  autoplot(Cash)+
    labs(title = "ATM3"),
  atm4_ts %>%
  autoplot(Cash)+
    labs(title = "ATM4"), nrow=2)

From these series plots we can see some seasonality in the data from ATMs 1 and 2. ATM 3 had no withdrawals until near the end of our time series. ATM 4 has a single, massive spike and it’s difficult to see on this plot whether the variation is seasonal.

ATM 1

atm1_ts %>%
  model(classical_decomposition(Cash,type="additive")) %>%
  components() %>%
  autoplot()
## Warning: Removed 3 row(s) containing missing values (geom_path).

atm1_ts %>%
  gg_tsdisplay(Cash, plot_type = "partial")

ATM 1 has strong indications of weekly seasonality. There is no apparent trend nor an increasing or decreasing variance. According to fpp3, a Holt-Winters’ method model may be used for “daily type data” (section 8.3).

We will compare against an auto ARIMA model (as I have found that the auto function generally performs best) and an SNAIVE model. I’m also throwing in a prophet model as it is designed for forecasting daily data with “strong seasonality” but performance may suffer as it prefers data that also has yearly seasonality and holiday effects, according to fpp3 (section 12.2).

fit1 <- atm1_ts %>%
  model(
    hw=ETS(Cash ~ error("M") + trend("Ad") + season("M")),
    arima=ARIMA(Cash, stepwise = FALSE, approx=FALSE),
    snaive=SNAIVE(Cash),
    prophet = prophet(Cash ~ season(period = 7, order = 3,
                                    type = "multiplicative"))
  )


glance(fit1) %>%
  arrange(AICc)
## # A tibble: 4 × 13
##   .model   sigma2 log_lik   AIC  AICc   BIC   MSE  AMSE    MAE ar_roots ma_roots
##   <chr>     <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl> <list>   <list>  
## 1 arima   559.     -1641. 3290. 3290. 3305.   NA    NA  NA     <cpl>    <cpl>   
## 2 hw        0.137  -2274. 4573. 4574. 4624.  696.  709.  0.221 <NULL>   <NULL>  
## 3 snaive  774.        NA    NA    NA    NA    NA    NA  NA     <NULL>   <NULL>  
## 4 prophet  NA         NA    NA    NA    NA    NA    NA  NA     <NULL>   <NULL>  
## # … with 2 more variables: sigma <dbl>, changepoints <list>
accuracy(fit1)
## # A tibble: 4 × 10
##   .model  .type         ME  RMSE   MAE    MPE  MAPE  MASE RMSSE     ACF1
##   <chr>   <chr>      <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>    <dbl>
## 1 hw      Training -0.752   26.4  16.3 -127.   143. 0.922 0.950  0.0810 
## 2 arima   Training -0.0660  23.3  14.6 -103.   118. 0.825 0.839 -0.00814
## 3 snaive  Training -0.0363  27.8  17.7  -96.5  116. 1     1      0.159  
## 4 prophet Training  0.0887  28.4  19.5 -140.   157. 1.10  1.02   0.131

The prophet model performed the worst based on RMSE. It is possible the time series is not long enough or that the “period”/“order” configuration isn’t optimal. The auto ARIMA model appears to have the lowest AICc and lowest RMSE. Let’s look at the residuals and features.

fit1 %>%
  select(arima) %>%
  gg_tsresiduals(lag=36)

Residuals appear to be consistent with white noise.

augment(fit1) %>%
  filter(.model=="arima") %>%
  features(.innov, ljung_box, lag=36, dof=7)
## # A tibble: 1 × 3
##   .model lb_stat lb_pvalue
##   <chr>    <dbl>     <dbl>
## 1 arima     38.7     0.108

P-value is large enough to confirm that the residuals are similar to white noise. We will use this model to forecast ATM1 in the results section.

ATM 2

We will take a similar approach to ATM2 since the time series plots appear to share similar features.

atm2_ts %>%
  model(classical_decomposition(Cash,type="additive")) %>%
  components() %>%
  autoplot()
## Warning: Removed 3 row(s) containing missing values (geom_path).

atm2_ts %>%
  gg_tsdisplay(Cash, plot_type = "partial")

Much like the previous data, there is obvious weekly seasonality.

fit2 <- atm2_ts %>%
  model(
    hw=ETS(Cash ~ error("M") + trend("Ad") + season("M")),
    arima=ARIMA(Cash, stepwise = FALSE, approx=FALSE),
    snaive=SNAIVE(Cash),
    prophet = prophet(Cash ~ season(order = 3,
                                    type = "multiplicative"))
  )


glance(fit2) %>%
  arrange(AICc)
## # A tibble: 4 × 13
##   .model   sigma2 log_lik   AIC  AICc   BIC   MSE  AMSE    MAE ar_roots ma_roots
##   <chr>     <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl> <list>   <list>  
## 1 arima   602.     -1653. 3319. 3319. 3342.   NA    NA  NA     <cpl>    <cpl>   
## 2 hw        0.387  -2388. 4801. 4802. 4852. 1171. 1172.  0.499 <NULL>   <NULL>  
## 3 snaive  910.        NA    NA    NA    NA    NA    NA  NA     <NULL>   <NULL>  
## 4 prophet  NA         NA    NA    NA    NA    NA    NA  NA     <NULL>   <NULL>  
## # … with 2 more variables: sigma <dbl>, changepoints <list>
accuracy(fit2)
## # A tibble: 4 × 10
##   .model  .type         ME  RMSE   MAE   MPE  MAPE  MASE RMSSE     ACF1
##   <chr>   <chr>      <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl>
## 1 hw      Training  2.12    34.0  28.3 -269.  300. 1.36  1.13   0.0222 
## 2 arima   Training -0.885   24.1  17.0 -Inf   Inf  0.819 0.801 -0.00512
## 3 snaive  Training  0.0223  30.1  20.8 -Inf   Inf  1     1.00   0.0447 
## 4 prophet Training  0.656   31.0  23.9 -Inf   Inf  1.15  1.03   0.0276

Again the auto ARIMA has the lowest AICc and RMSE. Let’s look at the residuals to ensure the forecasts would be viable.

fit2 %>%
  select(arima) %>%
  gg_tsresiduals(lag=36)

augment(fit2) %>%
  filter(.model=="arima") %>%
  features(.innov, ljung_box, lag=24, dof=7)
## # A tibble: 1 × 3
##   .model lb_stat lb_pvalue
##   <chr>    <dbl>     <dbl>
## 1 arima     22.3     0.174

While the residuals do have a significant spike this appears to be within the tolerances of autocorrelation in white noise. The innovation residuals have a zero mean.

ATM 3

ATM 3 has only three observations – a naive forecast keeps things simple and will likely be accurate enough for the amount of data available.

fit3 <- atm3_ts %>%
  model(NAIVE(Cash))

ATM 4

ATM 4’s most obvious issue is the outlier. Googling the cash an ATM can hold comes up with an approximation of $200,000. This ATM had over $1 million withdrawn in one day. Let’s handle the outlier and then take another look at the data.

atm4_ts_decomp <- atm4_ts %>%
  model(STL(Cash ~ season(period=7), robust=TRUE)) %>%
  components()

atm4_ts_decomp %>%
  autoplot()

outliers <- atm4_ts_decomp %>%
  filter(
    remainder < quantile(remainder, 0.25) - 3*IQR(remainder) |
    remainder > quantile(remainder, 0.75) + 3*IQR(remainder)
  )

outliers
## # A dable: 3 x 7 [1D]
## # Key:     .model [1]
## # :        Cash = trend + season_7 + remainder
##   .model                 DATE        Cash trend season_7 remainder season_adjust
##   <chr>                  <date>     <int> <dbl>    <dbl>     <dbl>         <dbl>
## 1 STL(Cash ~ season(per… 2009-09-22  1712  325.   -142.      1529.         1854.
## 2 STL(Cash ~ season(per… 2010-01-29  1575  455.     35.3     1084.         1540.
## 3 STL(Cash ~ season(per… 2010-02-09 10920  299.    -64.8    10686.        10985.

This methodology of finding outliers from fpp3 (section 13.9) found two more outliers in addition to the large spike. We will remove them and then fill the gaps with a median value.

atm4_ts <- atm4_ts %>%
  anti_join(outliers) %>%
  fill_gaps(Cash = median(Cash))
## Joining, by = c("DATE", "Cash")
atm4_ts %>%
  autoplot()
## Plot variable not specified, automatically selected `.vars = Cash`

This plot looks much better – roughly stationary, consistent variance indicating no need for a transformation, no real identifiable seasonality. We will try a few different forecasting methods. ARIMA has performed well so far and simple exponential smoothing as it is suitable for data with no clear trend or seasonal pattern. A method with drift won’t appear necessary due to lack of trend, so perhaps a plain naive forecast can provide a sort of baseline.

We will try a stepwise ARIMA model as well, to see if there’s a performance difference.

fit4 <- atm4_ts %>%
  model(naive = NAIVE(Cash),
        ets = ETS(Cash ~ error("A") + trend("N") + season("N")),
        stepwise = ARIMA(Cash),
        search = ARIMA(Cash, stepwise=FALSE)
  )

glance(fit4) %>%
  arrange(AICc)
## # A tibble: 4 × 11
##   .model    sigma2 log_lik   AIC  AICc   BIC     MSE    AMSE   MAE ar_roots  
##   <chr>      <dbl>   <dbl> <dbl> <dbl> <dbl>   <dbl>   <dbl> <dbl> <list>    
## 1 search   108934.  -2631. 5279. 5279. 5310.     NA      NA    NA  <cpl [10]>
## 2 stepwise 110435.  -2636. 5280. 5280. 5295.     NA      NA    NA  <cpl [14]>
## 3 ets      115515.  -3203. 6412. 6412. 6424. 114882. 114982.  285. <NULL>    
## 4 naive    213124.     NA    NA    NA    NA      NA      NA    NA  <NULL>    
## # … with 1 more variable: ma_roots <list>
accuracy(fit4)
## # A tibble: 4 × 10
##   .model   .type         ME  RMSE   MAE   MPE  MAPE  MASE RMSSE     ACF1
##   <chr>    <chr>      <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl>
## 1 naive    Training -0.810   461.  377. -459.  519. 1.12  1.05  -0.465  
## 2 ets      Training -0.0364  339.  285. -539.  572. 0.843 0.770  0.0760 
## 3 stepwise Training -0.202   331.  274. -502.  534. 0.811 0.751  0.0837 
## 4 search   Training  0.0620  327.  272. -477.  510. 0.806 0.742  0.00382

The non-stepwise ARIMA again has the better AICc and RMSE. Now for a residual check.

fit4 %>%
  select(search) %>%
  gg_tsresiduals(lag=24)

augment(fit4) %>%
  filter(.model=="search") %>%
  features(.innov, ljung_box, lag=24, dof=7)
## # A tibble: 1 × 3
##   .model lb_stat lb_pvalue
##   <chr>    <dbl>     <dbl>
## 1 search    23.8     0.124

A significant p-value and a lack of correlation that is consistent with white noise. The zero mean requirement in the innov residuals is met as is the optional “constant variance” requirement. The histogram does not appear normal so perhaps a Box-Cox transformation could have helped.

Forecast and Results

grid.arrange(
  fit1 %>%
    select(arima) %>%
    forecast(h=31) %>%
    autoplot(atm1_ts)+
    labs(title = "ATM1"),
  fit2 %>%
    select(arima) %>%
    forecast(h=31) %>%
    autoplot(atm2_ts)+
    labs(title = "ATM2"),
  fit3 %>%
    forecast(h=31) %>%
    autoplot(atm3_ts)+
    labs(title = "ATM3"),
  fit4 %>%
    select(search) %>%
    forecast(h=31) %>%
    autoplot(atm4_ts)+
    labs(title = "ATM4"), 
  nrow=2)

forecast1 <- fit1 %>%
  select(arima) %>%
  forecast(h=31) %>%
  as.data.frame %>%
  select(DATE, .mean)

forecast2 <- fit2 %>%
  select(arima) %>%
  forecast(h=31) %>%
  as.data.frame %>%
  select(DATE, .mean)

forecast3 <- fit3 %>%
  forecast(h=31) %>%
  as.data.frame %>%
  select(DATE, .mean)

forecast4 <- fit4 %>%
  select(search) %>%
  forecast(h=31) %>%
  as.data.frame %>%
  select(DATE, .mean)

dataset_names <- list('ATM1' = forecast1, 'ATM2' = forecast2, 'ATM3' = forecast3, 'ATM4' = forecast4)

write.xlsx(dataset_names, file = 'atm_forecasts_2010_05.xlsx')

Part B

Load data, exploration, transformation

power_raw <- read.csv("https://raw.githubusercontent.com/hdupre/DATA624/main/Project1/ResidentialCustomerForecastLoad-624.csv")

head(power_raw)
##   CaseSequence YYYY.MMM     KWH
## 1          733 1998-Jan 6862583
## 2          734 1998-Feb 5838198
## 3          735 1998-Mar 5420658
## 4          736 1998-Apr 5010364
## 5          737 1998-May 4665377
## 6          738 1998-Jun 6467147
summary(power_raw)
##   CaseSequence     YYYY.MMM              KWH          
##  Min.   :733.0   Length:192         Min.   :  770523  
##  1st Qu.:780.8   Class :character   1st Qu.: 5429912  
##  Median :828.5   Mode  :character   Median : 6283324  
##  Mean   :828.5                      Mean   : 6502475  
##  3rd Qu.:876.2                      3rd Qu.: 7620524  
##  Max.   :924.0                      Max.   :10655730  
##                                     NA's   :1

There is one NA on Sept 2008, so that will be dealt with by assigning the mean of the other September values.

power_df <- power_raw

sep2008_mean <- power_df%>%
  filter(grepl('Sep', YYYY.MMM)) %>%
  drop_na() %>%
  summarise(mean(KWH)) %>%
  as.numeric()

power_df$KWH[is.na(power_df$KWH)] <- sep2008_mean

power_ts <- power_df %>%
  mutate(date = yearmonth(YYYY.MMM)) %>%
  select(date,KWH) %>%
  as_tsibble(index=date)

  
power_ts %>%
  autoplot() +
  labs(title = "KWH")
## Plot variable not specified, automatically selected `.vars = KWH`

The seasonality of these data should be investigated so we know what periods to assign in the models – it doesn’t appear to be yearly. Also, the outlier must be dealt with, and variance seems to “squeeze” smaller in the middle of the plot, so a Box-Cox transformation could help.

power_ts %>%
  gg_tsdisplay(KWH, plot_type = "histogram")

There is a pronounced spike every six months, which can be useful when selecting a model.

We will now deal with the outlier and then explore transforming the data.

power_ts_decomp <- power_ts %>%
  model(STL(KWH,robust=TRUE)) %>%
  components()

power_ts_decomp %>%
  autoplot()

outliers <- power_ts_decomp %>%
  filter(
    remainder < quantile(remainder, 0.25) - 3*IQR(remainder) |
    remainder > quantile(remainder, 0.75) + 3*IQR(remainder)
  )

outliers
## # A dable: 2 x 7 [1M]
## # Key:     .model [1]
## # :        KWH = trend + season_year + remainder
##   .model                  date    KWH  trend season_year remainder season_adjust
##   <chr>                  <mth>  <dbl>  <dbl>       <dbl>     <dbl>         <dbl>
## 1 STL(KWH, robust = … 2010 Jul 7.71e5 6.86e6    1277186. -7369513.      -506663.
## 2 STL(KWH, robust = … 2013 Dec 9.61e6 7.34e6    -256378.  2526281.      9862682.

July 2010 has the low outlier. The July 2010 outlier seems like it’s 1/10 of its real value, so this may be a recording error we’ll simply multiply it by ten.

The STL decomposition also indicates an upward trend that was more difficult to see before. Methods with trend, such as Holt’s linear trend method, may be useful in that case though we have to remember the seasonality.

power_ts <- power_ts %>%
  mutate(KWH = ifelse(KWH<1000000,KWH*10,KWH))

power_ts %>%
  autoplot()
## Plot variable not specified, automatically selected `.vars = KWH`

These data are ready for forecasting with a Box-Cox transformation applied to KWH.

Forecast and Results

lambda <- BoxCox.lambda(power_ts$KWH, method = "guerrero")


fit <- power_ts %>%
  model(
    AAN = ETS(box_cox(KWH,lambda) ~ error("A") + trend("A") + season("N")),
    additive = ETS(box_cox(KWH,lambda) ~ error("A") + trend("A") +
                                                season("A")),
    multiplicative = ETS(box_cox(KWH,lambda) ~ error("M") + trend("A") +
                                                season("M")),
    stepwise = ARIMA(box_cox(KWH,lambda)),
    search = ARIMA(box_cox(KWH,lambda), stepwise=FALSE)
  )


glance(fit) %>%
  arrange(AICc)
## # A tibble: 5 × 11
##   .model          sigma2 log_lik    AIC   AICc    BIC      MSE     AMSE      MAE
##   <chr>            <dbl>   <dbl>  <dbl>  <dbl>  <dbl>    <dbl>    <dbl>    <dbl>
## 1 search         1.45e-5    757. -1497. -1496. -1472. NA       NA       NA      
## 2 stepwise       1.52e-5    751. -1492. -1492. -1476. NA       NA       NA      
## 3 additive       1.41e-5    576. -1117. -1114. -1062.  1.30e-5  1.35e-5  2.78e-3
## 4 multiplicative 6.48e-7    575. -1116. -1112. -1060.  1.31e-5  1.34e-5  6.11e-4
## 5 AAN            6.77e-5    419.  -828.  -828.  -812.  6.63e-5  6.66e-5  7.08e-3
## # … with 2 more variables: ar_roots <list>, ma_roots <list>
accuracy(fit)
## # A tibble: 5 × 10
##   .model         .type         ME     RMSE    MAE    MPE  MAPE  MASE RMSSE  ACF1
##   <chr>          <chr>      <dbl>    <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 AAN            Training 134633. 1335239. 1.14e6 -1.79  17.6  1.87  1.65  0.442
## 2 additive       Training  28260.  615101. 4.55e5 -0.298  6.84 0.747 0.759 0.260
## 3 multiplicative Training  33836.  624097. 4.69e5 -0.224  7.05 0.770 0.770 0.261
## 4 stepwise       Training  49164.  627621. 4.77e5  0.174  7.25 0.784 0.775 0.100
## 5 search         Training  48686.  615191. 4.63e5  0.207  7.02 0.760 0.759 0.108

Once again, the search ARIMA model has the lowest AICc. It also has one of the lowest RMSEs, though the Holt-Winters’ additive method has a lower RMSE.

fit %>%
    select(search) %>%
    forecast(h=12) %>%
    autoplot(power_ts)+
    labs(title = "KWH")

forecast <- fit %>%
  select(search) %>%
  forecast(h=12) %>%
  as.data.frame %>%
  select(date, .mean)

dataset_names <- list('KWH' = forecast)

write.xlsx(dataset_names, file = 'power_forecasts_2014.xlsx')