ATM Withdrawal Forecasting

Problem Statement

Given historical data of daily ATM withdrawals, from 4 different machines, forecast the withdrawal amounts for the next month (Mat 2010).

Data Import and Cleansing

Import

First, we load the data. The given data file was changed the Excel file into a CSV to avoid trouble with how Microsoft stores dates.

# Excel file was saved as CSV to avoid issues with how Microsoft stores dates
 rawATM <- read_csv("ATM624Data.csv")

head(rawATM)
## # A tibble: 6 x 3
##   DATE                 ATM    Cash
##   <chr>                <chr> <dbl>
## 1 5/1/2009 12:00:00 AM ATM1     96
## 2 5/1/2009 12:00:00 AM ATM2    107
## 3 5/2/2009 12:00:00 AM ATM1     82
## 4 5/2/2009 12:00:00 AM ATM2     89
## 5 5/3/2009 12:00:00 AM ATM1     85
## 6 5/3/2009 12:00:00 AM ATM2     90

The data was imported as character types and a double. Before we beginning cleaning, we will put the data into a new data frame so the original data stays as it was imported.

atm <- rawATM

Cleaning

First, we do some basic cleansing: correct the DATE column to be an actual date data type, and rename the columns

# Change DATE column to a proper date format
atm$DATE <- as.Date(mdy_hms(atm$DATE, tz="EST"))

# Change column names
names(atm) <- c("date","machine","amount")

head(atm)
## # A tibble: 6 x 3
##   date       machine amount
##   <date>     <chr>    <dbl>
## 1 2009-05-01 ATM1        96
## 2 2009-05-01 ATM2       107
## 3 2009-05-02 ATM1        82
## 4 2009-05-02 ATM2        89
## 5 2009-05-03 ATM1        85
## 6 2009-05-03 ATM2        90

Next I check for missing values:

missing <- atm %>% filter(is.na(date)) %>% nrow() %>% tibble(date = .)
missing <- atm %>% filter(is.na(machine)) %>% nrow() %>%
  tibble(machine = .) %>% cbind(missing,.)
missing <- atm %>% filter(is.na(amount)) %>% nrow() %>%
  tibble(amount = .) %>% cbind(missing,.)

rownames(missing) <- "NA Values"

kbl(missing) %>% kable_paper("striped", full_width = F, position="left") %>%
  add_header_above(c("Column Name" = 4))
Column Name
date machine amount
NA Values 0 14 19

We have 19 records where no amount is recorded and 14 records where no machine is recorded. We’ll look at which records those are:

atm %>% filter(is.na(amount) | is.na(machine))
## # A tibble: 19 x 3
##    date       machine amount
##    <date>     <chr>    <dbl>
##  1 2009-06-13 ATM1        NA
##  2 2009-06-16 ATM1        NA
##  3 2009-06-18 ATM2        NA
##  4 2009-06-22 ATM1        NA
##  5 2009-06-24 ATM2        NA
##  6 2010-05-01 <NA>        NA
##  7 2010-05-02 <NA>        NA
##  8 2010-05-03 <NA>        NA
##  9 2010-05-04 <NA>        NA
## 10 2010-05-05 <NA>        NA
## 11 2010-05-06 <NA>        NA
## 12 2010-05-07 <NA>        NA
## 13 2010-05-08 <NA>        NA
## 14 2010-05-09 <NA>        NA
## 15 2010-05-10 <NA>        NA
## 16 2010-05-11 <NA>        NA
## 17 2010-05-12 <NA>        NA
## 18 2010-05-13 <NA>        NA
## 19 2010-05-14 <NA>        NA

Looking at the output, the majority of the records that are missing amount are also missing machine. They are also dates that we are supposed to forecast. So, we’ll remove those from the data set.

# Keep only where both amount and machine are not NA
atm <- atm %>% filter(!(is.na(amount) & is.na(machine)))

# Re-check
atm %>% filter(is.na(amount))
## # A tibble: 5 x 3
##   date       machine amount
##   <date>     <chr>    <dbl>
## 1 2009-06-13 ATM1        NA
## 2 2009-06-16 ATM1        NA
## 3 2009-06-18 ATM2        NA
## 4 2009-06-22 ATM1        NA
## 5 2009-06-24 ATM2        NA

What’s left are missing amounts for 5 dates. We will have to deal with that missing data.

Imputation

For our missing values of amount we will impute values using the previous amount for the same machine on the same day of the week.

First we add the day of the week:

atm$weekday <- atm$date %>% wday()

Now we get the last value:

# Function to return the value of the amount from 7 days previous for the
#   same machine.
lastWeekday <- function(d, m){
  d <- d-7
  i <- atm %>% filter(date == d, machine == m) %>%
    select(amount) %>% as.double()
  return(i)
}

lastWeekday <- Vectorize(lastWeekday)

atm[is.na(atm$amount),"amount"] <- atm %>% filter(is.na(amount)) %>%
  mutate(imputed = lastWeekday(date, machine)) %>% select(imputed)

Next we search for outliers. We’ll start by plotting the data:

atm %>% ggplot(aes(x=date,y=amount,color=machine)) + geom_line() +
  facet_wrap(machine ~ ., scales = "free") +
  theme_light() +
  scale_x_date(date_breaks = "3 months", date_labels = "%b %y") +
  scale_color_discrete(name="Machine") +
  labs(title="Cash Withdrawals by ATM", y="Amount (100's of $)",
       x="Date")

Two issues can be seen in the graphs above. The first is that ATM3 only has a few non-zero data points. Forecasting with those data will be nearly impossible to do accurately.

The second issue is that ATM4 appears to have a single, large outlier at date 2010-02-09. Let’s look more closely at that.

days <- c("Sunday","Monday","Tuesday","Wednesday","Thursday",
          "Friday","Saturday")

atm %>% filter(machine == "ATM4") %>% ggplot(aes(x=date, y=amount)) +
  geom_line() + theme_light() +
  theme(axis.text.x = element_blank()) + 
  facet_wrap(ordered(weekday,labels=days) ~ .,scales="free_y") +
  labs(title="ATM4", subtitle="By Day of Week", x=NULL, y= NULL)

We can see that the suspected outlier was recorded on a Tuesday and is well above the other values recorded on a Tuesday (or any other weekday for that matter).

If we look at the data points surrounding it, that outlier value does not look like a part of a localized trend (say, in advance of Valentine’s Day):

atm %>% filter(date <= as.Date('2010-03-09'),
               date >= as.Date('2010-01-09'),
               machine=="ATM4") %>%
  ggplot(aes(x=date,y=amount)) + geom_line() +
  theme_light() +
  scale_x_date(date_breaks = "1 week", date_labels = "%b %d") +
  labs(title="Cash Withdrawals at ATM4", 
       subtitle="01/09/10 - 03/09/10",
       y="Amount (100's of $)",
       x="Date")

So, we assume this is some sort of data entry issue and will replace this value with the mean of the Tuesday amounts for ATM4.

# Mean value withdrawn on Tuesdays at ATM4
replacement <- atm %>% filter(machine=="ATM4",weekday==3) %>% summarize(mean(amount))

# Replace outlier value
atm[which(atm$date==as.Date("2010-02-09") & atm$machine=="ATM4"),"amount"] <- replacement

Splitting

Now that we have the data more or less how we want it in terms of data types, we will split each machine into it’s own data set and convert them into time series.

atm1 <- atm %>% filter(machine=="ATM1") %>% select(amount) %>%
  ts(frequency = 7, start=c(strftime(as.Date("2009-05-01"),'%w'),
                            strftime(as.Date("2009-05-01"),'%W')))

atm2 <- atm %>% filter(machine=="ATM2") %>% select(amount) %>%
  ts(frequency = 7, start=c(strftime(as.Date("2009-05-01"),'%w'),
                            strftime(as.Date("2009-05-01"),'%W')))

atm3 <- atm %>% filter(machine=="ATM3") %>% select(amount) %>%
  ts(frequency = 7, start=c(strftime(as.Date("2009-05-01"),'%w'),
                            strftime(as.Date("2009-05-01"),'%W')))

atm4 <- atm %>% filter(machine=="ATM4") %>% select(amount) %>%
  ts(frequency = 7, start=c(strftime(as.Date("2009-05-01"),'%w'),
                            strftime(as.Date("2009-05-01"),'%W')))

Exploratory Analysis

First, let’s plot the series for each ATM:

atm1 %>% autoplot() + theme_light() +
  labs(title="Withdrawals from ATM1", y="Amount (100's of $)",
       x="Week")

atm2 %>% autoplot() + theme_light() +
  labs(title="Withdrawals from ATM2", y="Amount (100's of $)",
       x="Week")

atm3 %>% autoplot() + theme_light() +
  labs(title="Withdrawals from ATM3", y="Amount (100's of $)",
       x="Week")

atm4 %>% autoplot() + theme_light() +
  labs(title="Withdrawals from ATM4", y="Amount (100's of $)",
       x="Week")

  • ATM1 appears to have some weekly seasonality.

  • ATM2 also appears to have a similar weekly seasonality.

  • ATM3, as previously discussed, does not have enough data to really forecast well.

  • ATM4, unlike the other machines’ data, appears to be rather random - like white noise.

Modeling

Now we’ll look at models for each ATM, starting with ATM1.

ATM1

First we’ll try to stabilize the series with a Box Cox transform:

l <- BoxCox.lambda(atm1)
atm1.trans <- BoxCox(atm1,lambda=l)

autoplot(cbind(original = atm1, transformed=atm1.trans), facets=T) +
  theme_light() +
  labs(title="Withdrawals from ATM1", y="Amount (100's of $)",
       x="Week")

Looking at the plots above, the Box Cox transform seems to stabilize some of the variance in the data using \(\lambda =\) 0.25.

Next we’ll split the data sets into a training set and a testing set, holding out 60 days for testing and the rest for training the models:

atm1.train <- subset(atm1.trans,end=length(atm1.trans)-60)
atm1.test <- subset(atm1.trans, start=length(atm1.trans)-59)

Next, we’ll look at various models to see how they perform on the training data set.

We’ll begin using a seasonal naive forecast:

atm1.snaive <- snaive(atm1.train,h=31, biasadj=T)

accuracy(atm1.snaive)
##                       ME     RMSE       MAE  MPE MAPE MASE      ACF1
## Training set 0.004699424 1.632284 0.8453411 -Inf  Inf    1 0.1045403

There appears to be some numerical instability in some of the error rates, but the RMSE and MASE are interpretable. The MASE shows us that this forecast is equal to a naive forecast (which this is a seasonal version of). Let’s look at the plot:

autoplot(atm1.snaive, PI=F) + theme_light() +
  labs(title="Withdrawals from ATM1",
       subtitle = "Seasonal Naive Model",
       y="Amount (100's of $)",
       x="Week")

As expected, the snaive() model is really just adding in the seasonal component to the last value.

Next, we’ll try the Holt-Winters method using additive seasonality:

# Use Holt-Winters method
atm1.hw <- hw(atm1.train, seasonal = "additive", h=31, biasadj = T)

accuracy(atm1.hw)
##                        ME     RMSE       MAE  MPE MAPE      MASE      ACF1
## Training set -0.000598947 1.308741 0.6893755 -Inf  Inf 0.8154998 0.1040176

Our RMSE and MASE accuracy measures both decreased from the seasonal naive method above.

autoplot(atm1.hw, PI=F, series = "Holt-Winters") + theme_light() +
  labs(title="Withdrawals from ATM1",
       subtitle = "Holt-Winters Seasonal Model",
       y="Amount (100's of $)",
       x="Week")

The predictions from the Holt-Winters are flatter than the Seasonal Naive forecast, and don’t seem to take into account some of the positive spikes seen in the past data. This may be ok, though, as it may generalize better on unseen data.

The error measures RMSE and MASE are an improvement over the Seasonal Naive model, though.

Finally, we’ll try a seasonal ARIMA model:

atm1.arima <- auto.arima(atm1.train)

atm1.arima
## Series: atm1.train 
## ARIMA(0,0,2)(0,1,1)[7] 
## 
## Coefficients:
##          ma1      ma2     sma1
##       0.1282  -0.0945  -0.8596
## s.e.  0.0579   0.0570   0.0493
## 
## sigma^2 estimated as 1.726:  log likelihood=-507.35
## AIC=1022.7   AICc=1022.84   BIC=1037.49

The auto.arima() function chose an ARIMA\((0,0,2)(0,1,1)_{[7]}\) model. This makes sense, as it there should be little correlation with previous withdrawal amounts, yet some change in seasonality variance over time.

accuracy(atm1.arima)
##                      ME     RMSE       MAE  MPE MAPE      MASE         ACF1
## Training set 0.02862017 1.291929 0.6996114 -Inf  Inf 0.8276084 -0.001612681

The RMSE is only slightly lower with this ARIMA model and the MASE is actually a bit worse, when compared to the Holt-Winters model.

atm1.pred <- forecast(atm1.arima,h=31)

autoplot(atm1.pred, PI=F) + theme_light() +
  labs(title="Withdrawals from ATM1",
       subtitle = "ARIMA(0,0,1)(0,1,2)[7] Model",
       y="Amount (100's of $)",
       x="Week")

The ARIMA model predictions strongly resemble the Holt-Winters model above. Let’s look at the residuals:

checkresiduals(atm1.arima)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,0,2)(0,1,1)[7]
## Q* = 7.7234, df = 11, p-value = 0.7378
## 
## Model df: 3.   Total lags used: 14

The residuals are not really normal, though the Ljung-Box test confirms there is a lack of significant correlation.

Now, we will evaluate the accuracy measures with the test set:

# Check test set accuracy measures
testMeasures <- cbind(method = "Seasonal Naive", 
                      data.frame(
                        as.list(accuracy(atm1.snaive, atm1.test)["Test set",])
                      ))

testMeasures <- rbind(testMeasures,
  cbind(method = "Holt-Winters", 
                      data.frame(
                        as.list(accuracy(atm1.hw, atm1.test)["Test set",])
                      ))
)

atm1.pred <- Arima(atm1.test,model=atm1.arima,h=31, biasadj=T)

testMeasures <- rbind(testMeasures,
  cbind(method = "ARIMA", 
                      data.frame(
                        as.list(accuracy(atm1.pred)["Training set",])
                      ), Theil.s.U = NA))

testMeasures %>% select(method, RMSE, MASE) %>%
  kable() %>% kable_styling(bootstrap_options = "striped", full_width = F)
method RMSE MASE
Seasonal Naive 1.2043282 1.1263666
Holt-Winters 2.8380034 2.3113934
ARIMA 0.5211845 0.7523489

The ARIMA model seems to generalize very well, as it has the best error metrics using the testing data set. So we will use that model for our predicted values for May 2010:

atm1.forecast <- forecast(atm1, model = atm1.arima, h=31, biasadj = T)

atm1.forecast %>%
  kable() %>% kable_styling(bootstrap_options = "striped") %>%
  scroll_box(width = "100%", height = "200px")
Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
59.42857 88.54968 86.86618 90.23319 85.97499 91.12438
59.57143 104.26506 102.56778 105.96234 101.66929 106.86083
59.71429 74.26779 72.56307 75.97250 71.66064 76.87493
59.85714 27.14883 25.44411 28.85355 24.54169 29.75597
60.00000 96.76130 95.05659 98.46602 94.15416 99.36844
60.14286 73.60436 71.89964 75.30908 70.99722 76.21150
60.28571 91.53222 89.82750 93.23693 88.92507 94.13936
60.42857 90.03139 88.31036 91.75242 87.39930 92.66348
60.57143 103.59256 101.87126 105.31385 100.96006 106.22505
60.71429 74.26779 72.54634 75.98923 71.63507 76.90050
60.85714 27.14883 25.42739 28.87027 24.51611 29.78155
61.00000 96.76130 95.03986 98.48275 94.12858 99.39402
61.14286 73.60436 71.88292 75.32580 70.97164 76.23708
61.28571 91.53222 89.81077 93.25366 88.89950 94.16493
61.42857 90.03139 88.29379 91.76899 87.37397 92.68882
61.57143 103.59256 101.85469 105.33042 100.93472 106.25039
61.71429 74.26779 72.52978 76.00579 71.60974 76.92584
61.85714 27.14883 25.41082 28.88684 24.49078 29.80688
62.00000 96.76130 95.02330 98.49931 94.10325 99.41935
62.14286 73.60436 71.86635 75.34236 70.94631 76.26241
62.28571 91.53222 89.79421 93.27022 88.87417 94.19027
62.42857 90.03139 88.27738 91.78540 87.34887 92.71392
62.57143 103.59256 101.83828 105.34683 100.90963 106.27548
62.71429 74.26779 72.51337 76.02220 71.58464 76.95093
62.85714 27.14883 25.39442 28.90324 24.46569 29.83197
63.00000 96.76130 95.00689 98.51572 94.07816 99.44445
63.14286 73.60436 71.84995 75.35877 70.92122 76.28750
63.28571 91.53222 89.77780 93.28663 88.84907 94.21536
63.42857 90.03139 88.26113 91.80166 87.32400 92.73878
63.57143 103.59256 101.82203 105.36308 100.88477 106.30034
63.71429 74.26779 72.49712 76.03845 71.55978 76.97579
# Output point forecasts
output <- cbind(date = seq(from=as.Date("2010-05-01"),
                           to=as.Date("2010-05-31"),by=1),
      point_forecast = as.data.frame(atm1.forecast$mean))

write.xlsx(output,"forecast_ATM1.xlsx")

autoplot(atm1.forecast, PI=T) +
  theme_light() +
  labs(title="Withdrawals from ATM1",
       subtitle = "ARIMA(0,0,1)(0,1,2)[7] Forecasts",
       y="Amount (100's of $)",
       x="Week")

ATM2

As we did for ATM1 above, we’ll try to stabilize the series with a Box Cox transform:

l <- BoxCox.lambda(atm2)
atm2.trans <- BoxCox(atm2,lambda=l)

autoplot(cbind(original = atm2, transformed=atm2.trans), facets=T) +
  theme_light() +
  labs(title="Withdrawals from ATM2", y="Amount (100's of $)",
       x="Week")

There seems to be some stability gained by the transform, so we’ll keep it. As before, we split the data into train and test sets.

atm2.train <- subset(atm2.trans,end=length(atm2.trans)-60)
atm2.test <- subset(atm2.trans, start=length(atm2.trans)-59)

We’ll again begin using a seasonal naive forecast:

atm2.snaive <- snaive(atm2.train,h=31, biasadj=T)

accuracy(atm2.snaive)
##                       ME    RMSE      MAE  MPE MAPE MASE       ACF1
## Training set -0.05817628 8.21813 5.634022 -Inf  Inf    1 0.09936693
autoplot(atm2.snaive, PI = F) +
  theme_light() +
  labs(title="Withdrawals from ATM2",
       subtitle = "Seasonal Naive Model",
       y="Amount (100's of $)",
       x="Week")

Visually, there appears to be a bit of a downward trend in the data, but this Seasonal Naive model doesn’t take that into account. This might not be a good model, so we’ll keep looking.

We’ll try Holt-Winters as our second model:

# Use Holt-Winters method
atm2.hw <- hw(atm2.train, seasonal = "additive", h=31, biasadj = T)

accuracy(atm2.hw)
##                      ME     RMSE     MAE  MPE MAPE      MASE       ACF1
## Training set 0.02198949 6.697468 4.46445 -Inf  Inf 0.7924091 0.02699244

Our chosen error metrics: RMSE and MASE, both decrease compared to the Seasonal Naive model above.

autoplot(atm2.hw, PI = F) +
  theme_light() +
  labs(title="Withdrawals from ATM2",
       subtitle = "Holt-Winters Seasonal Model",
       y="Amount (100's of $)",
       x="Week")

We see that this model does use some of the downward trend we saw, though not much.

Finally, we’ll try ARIMA on these data:

atm2.arima <- auto.arima(atm2.train)

atm2.arima
## Series: atm2.train 
## ARIMA(0,0,0)(0,1,2)[7] 
## 
## Coefficients:
##          sma1     sma2
##       -0.7271  -0.0828
## s.e.   0.0601   0.0594
## 
## sigma^2 estimated as 46.04:  log likelihood=-995.94
## AIC=1997.88   AICc=1997.97   BIC=2008.98

Interestingly, the auto.arima() function chose a different model than it did for the ATM1 data. Let’s check the accuracy measures:

accuracy(atm2.arima)
##                      ME     RMSE      MAE  MPE MAPE      MASE       ACF1
## Training set -0.4826825 6.684235 4.445472 -Inf  Inf 0.7890405 0.04564369

The ARIMA model has very nearly the same RMSE and MASE that the Holt-Winter model did.

atm2.pred <- forecast(atm2.arima,h=31)

autoplot(atm2.pred, PI=T) + theme_light() +
  labs(title="Withdrawals from ATM2",
       subtitle = "ARIMA(0,0,0)(0,1,2)[7] Model",
       y="Amount (100's of $)",
       x="Week")

We’ll check the accuracy over the testing data set to see how well the models generalize:

# Check test set accuracy measures
testMeasures <- cbind(method = "Seasonal Naive", 
                      data.frame(
                        as.list(accuracy(atm2.snaive, atm1.test)["Test set",])
                      ))

testMeasures <- rbind(testMeasures,
  cbind(method = "Holt-Winters", 
                      data.frame(
                        as.list(accuracy(atm2.hw, atm1.test)["Test set",])
                      ))
)

atm2.pred <- Arima(atm2.test,model=atm2.arima,h=31, biasadj=T)

testMeasures <- rbind(testMeasures,
  cbind(method = "ARIMA", 
                      data.frame(
                        as.list(accuracy(atm2.pred)["Training set",])
                      ), Theil.s.U = NA))

testMeasures %>% select(method, RMSE, MASE) %>%
  kable() %>% kable_styling(bootstrap_options = "striped", full_width = F)
method RMSE MASE
Seasonal Naive 14.703665 2.0390192
Holt-Winters 12.178901 1.8777154
ARIMA 4.629508 0.7121923

As before, ARIMA does better with data it has not seen. So, we’ll use this model to make our forecasts:

atm2.forecast <- forecast(atm2, model = atm2.arima, h=31, biasadj = T)

atm2.forecast %>%
  kable() %>% kable_styling(bootstrap_options = "striped") %>%
  scroll_box(width = "100%", height = "200px")
Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
59.42857 64.80428 56.108830 73.49972 51.505739 78.10281
59.57143 73.03464 64.339191 81.73008 59.736100 86.33318
59.71429 17.74685 9.051407 26.44230 4.448316 31.04539
59.85714 10.85213 2.156684 19.54758 -2.446407 24.15067
60.00000 93.34042 84.644969 102.03586 80.041878 106.63895
60.14286 81.31553 72.620085 90.01098 68.016994 94.61407
60.28571 67.14308 58.447630 75.83852 53.844538 80.44161
60.42857 65.44402 56.430659 74.45739 51.659272 79.22878
60.57143 71.08524 62.071880 80.09861 57.300492 84.87000
60.71429 18.44272 9.429351 27.45608 4.657964 32.22747
60.85714 11.72797 2.714601 20.74133 -2.056786 25.51272
61.00000 91.90718 82.893816 100.92055 78.122429 105.69193
61.14286 80.67954 71.666178 89.69291 66.894791 94.46430
61.28571 64.36705 55.353684 73.38041 50.582296 78.15180
61.42857 65.44402 56.280326 74.60772 51.429358 79.45869
61.57143 71.08524 61.921546 80.24894 57.070578 85.09991
61.71429 18.44272 9.279018 27.60641 4.428050 32.45738
61.85714 11.72797 2.564268 20.89166 -2.286701 25.74263
62.00000 91.90718 82.743483 101.07088 77.892514 105.92185
62.14286 80.67954 71.515845 89.84324 66.664877 94.69421
62.28571 64.36705 55.203350 73.53075 50.352382 78.38172
62.42857 65.44402 56.132420 74.75563 51.203154 79.68489
62.57143 71.08524 61.773640 80.39685 56.844375 85.32611
62.71429 18.44272 9.131112 27.75432 4.201846 32.68359
62.85714 11.72797 2.416362 21.03957 -2.512904 25.96884
63.00000 91.90718 82.595576 101.21878 77.666311 106.14805
63.14286 80.67954 71.367939 89.99115 66.438673 94.92041
63.28571 64.36705 55.055444 73.67865 50.126179 78.60792
63.42857 65.44402 55.986827 74.90122 50.980488 79.90756
63.57143 71.08524 61.628047 80.54244 56.621709 85.54878
63.71429 18.44272 8.985518 27.89991 3.979180 32.90625
# Output point forecasts
output <- cbind(date = seq(from=as.Date("2010-05-01"),
                           to=as.Date("2010-05-31"),by=1),
      point_forecast = as.data.frame(atm2.forecast$mean))

write.xlsx(output,"forecast_ATM2.xlsx")

autoplot(atm2.forecast, PI=T) +
  theme_light() +
  labs(title="Withdrawals from ATM2",
       subtitle = "ARIMA(0,0,0)(0,1,2)[7] Forecasts",
       y="Amount (100's of $)",
       x="Week")

ATM3

Given the fact that there are only 3 data points for ATM3, there isn’t a good model choice. The best we could likely do is a naive forecast, but even that I would not be comfortable using in practice:

autoplot(atm3) + autolayer(naive(atm3))

ATM4

The data for ATM4, as mentioned above, seemed to have no real seasonal pattern and almost resembled white noise. So, we’ll take a slightly different approach.

First, though, we will check to see if a Box-Cox transformation is useful:

l <- BoxCox.lambda(atm4)
atm4.trans <- BoxCox(atm4,lambda=l)

autoplot(cbind(original = atm4, transformed=atm4.trans), facets=T) +
  theme_light() +
  labs(title="Withdrawals from ATM4", y="Amount (100's of $)",
       x="Week")

The transformation did reduce some of the variance, so we’ll keep it.

Next, we’ll see if there is any evidence of seasonality that we did not see in the overall plot.

ggAcf(atm4.trans)

Surprisingly, there does appear to be some weekly seasonality like ATM1 and ATM2. We’ll split the data and use an STL decomposition to see what that looks like:

atm4.train <- subset(atm4.trans,end=length(atm2.trans)-60)
atm4.test <- subset(atm4.trans, start=length(atm2.trans)-59)
atm4.decomp <- stl(atm4.train, s.window="periodic")
autoplot(atm4.decomp)

The stl() function seemed to tease out the seasonality pretty well. We’ll try to use this decomposition to forecast, and see how well it does:

atm4.stl <- seasadj(atm4.decomp) %>% snaive(h=31)

accuracy(atm4.stl)
##                     ME     RMSE      MAE       MPE    MAPE MASE       ACF1
## Training set 0.1070168 17.97907 14.21384 -34.43799 174.691    1 0.04499377
atm4.stl %>% autoplot() +
  labs(title="Withdrawals from ATM4", 
       subtitle="STL Seasonal Naive Model",
       y="Amount (100's of $)",
       x="Week")

The STL method does ok, but we can probably do better. We’ll try a Holt-Winter model:

atm4.hw <- hw(atm4.train, h = 31, seasonal = "additive", biasadj = T,
              lambda=l)

accuracy(atm4.hw)
##                     ME     RMSE      MAE       MPE     MAPE      MASE
## Training set -0.269489 13.16263 10.47636 -58.62438 82.00934 0.7370535
##                    ACF1
## Training set 0.06498851

The Holt-Winter model does well in both RMSE and MASE. Let’s plot the results and see what the model predictions look like based on the training data:

autoplot(atm4.hw, PI = T) +
  theme_light() +
  labs(title="Withdrawals from ATM4",
       subtitle = "Holt-Winters Seasonal Model",
       y="Amount (100's of $)",
       x="Week")

The predictions look ok, though they have a large prediction interval.

Finally, we try an ARIMA model next to see if it works better:

atm4.arima <- auto.arima(atm4.train, biasadj = T)

atm4.arima
## Series: atm4.train 
## ARIMA(0,0,1)(2,0,0)[7] with non-zero mean 
## 
## Coefficients:
##          ma1    sar1    sar2     mean
##       0.0883  0.2092  0.1443  30.0424
## s.e.  0.0576  0.0577  0.0586   1.3337
## 
## sigma^2 estimated as 200.8:  log likelihood=-1239.71
## AIC=2489.43   AICc=2489.63   BIC=2508.03

The auto.arima() function chose an ARIMA\((0,0,1)(2,0,0)_{[7]}\) model. We’ll first check the accuracy measures on the training set and then plot them.

accuracy(atm2.arima)
##                      ME     RMSE      MAE  MPE MAPE      MASE       ACF1
## Training set -0.4826825 6.684235 4.445472 -Inf  Inf 0.7890405 0.04564369
atm4.pred <- forecast(atm4.arima,h=31)

autoplot(atm4.pred, PI=T) + theme_light() +
  labs(title="Withdrawals from ATM4",
       subtitle = "ARIMA(0,0,1)(2,0,0)[7] Model",
       y="Amount (100's of $)",
       x="Week")

The RMSE accuracy measure appears better than the STL Seasonal Naive model, but the MASE is actually a bit worse. Also, looking at the graph above, the predictions don’t look very stable.

We’ll check the residuals next:

checkresiduals(atm4.arima)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,0,1)(2,0,0)[7] with non-zero mean
## Q* = 11.319, df = 10, p-value = 0.3332
## 
## Model df: 4.   Total lags used: 14

The residuals aren’t exactly normal, though there appear to be no significant correlations between them and the Ljung-Box confirms the same.

Finally, let’s look at all of the models’ performance with the test data:

# Check test set accuracy measures
testMeasures <- cbind(method = "STL", 
                      data.frame(
                        as.list(accuracy(atm4.stl, atm4.test)["Test set",])
                      ))

testMeasures <- rbind(testMeasures,
  cbind(method = "Holt-Winters", 
                      data.frame(
                        as.list(accuracy(atm4.hw, atm4.test)["Test set",])
                      ))
)

atm4.pred <- Arima(atm4.test,model=atm4.arima,biasadj=T)

testMeasures <- rbind(testMeasures,
  cbind(method = "ARIMA", 
                      data.frame(
                        as.list(accuracy(atm4.pred)["Training set",])
                      ), Theil.s.U = NA))

testMeasures %>% select(method, RMSE, MASE) %>%
  kable() %>% kable_styling(bootstrap_options = "striped", full_width = F)
method RMSE MASE
STL 21.28007 1.1667665
Holt-Winters 17.59295 0.9856173
ARIMA 12.33961 0.8749154

Surprisingly, the ARIMA model does better than all the others in terms of our chosen accuracy measures on the test data set. However, the instability of the model predictions makes me hesitant to use it.

In conclusion, I would choose the Holt-Winter model here, despite it’s worse performance in accuracy measures, because I feel it provides more stable predictions of the three models.

atm4.forecast <- forecast(atm4.trans, model=atm4.hw, h=31,
                          use.initial.values=TRUE,
                          biasadj = T)

atm4.forecast %>%
  kable() %>% kable_styling(bootstrap_options = "striped") %>%
  scroll_box(width = "100%", height = "200px")
Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
59.42857 26.05965 10.470474 49.50389 5.2010605 65.25192
59.57143 27.51257 11.324439 51.64204 5.7757149 67.78627
59.71429 28.18119 11.699643 52.66995 6.0224496 69.03117
59.85714 27.16074 11.031793 51.32039 5.5456069 67.52032
60.00000 23.22162 8.635555 45.77789 3.9248643 61.11479
60.14286 10.16172 1.927026 25.85077 0.2560098 37.39999
60.28571 28.53184 11.760629 53.51730 6.0073660 70.23031
60.42857 26.05043 10.209555 50.09025 4.9321807 66.30303
60.57143 27.50308 11.052246 52.24166 5.4910460 68.85870
60.71429 28.17157 11.422702 53.27568 5.7313119 70.11364
60.85714 27.15131 10.763731 51.91736 5.2675011 68.58935
61.00000 23.21296 8.401166 46.33875 3.6950086 62.12720
61.14286 10.15619 1.824018 26.26222 0.2059312 38.17594
61.28571 28.52215 11.483438 54.12705 5.7171902 71.32070
61.42857 26.04121 9.953083 50.67825 4.6719199 67.35948
61.57143 27.49359 10.784547 52.84291 5.2150999 69.93647
61.71429 28.16195 11.150264 53.88306 5.4489254 71.20148
61.85714 27.14189 10.500105 52.51602 4.9980256 69.66383
62.00000 23.20431 8.171013 46.90141 3.4733746 63.14521
62.14286 10.15067 1.724389 26.67582 0.1618406 38.95794
62.28571 28.51247 11.210683 54.73854 5.4356753 72.41664
62.42857 26.03200 9.700924 51.26806 4.4200735 68.42159
62.57143 27.48410 10.521203 53.44597 4.9476692 71.01990
62.71429 28.15234 10.882190 54.49225 5.1750833 72.29501
62.85714 27.13246 10.240785 53.11654 4.7369799 70.74407
63.00000 23.19565 7.944979 47.46604 3.2597776 64.16911
63.14286 10.14514 1.628061 27.09168 0.1235512 39.74622
63.28571 28.50279 10.942235 55.35195 5.1626242 73.51843
63.42857 26.02278 9.452956 51.85985 4.1764548 69.48964
63.57143 27.47461 10.262092 54.05102 4.6885653 72.10929
63.71429 28.14273 10.618357 55.10345 4.9095971 73.39452
# Output point forecasts
output <- cbind(date = seq(from=as.Date("2010-05-01"),
                           to=as.Date("2010-05-31"),by=1),
      point_forecast = as.data.frame(InvBoxCox(atm4.forecast$mean,lambda=l)))

write.xlsx(output,"forecast_ATM4.xlsx")

autoplot(atm4.forecast, PI=T) +
  theme_light() +
  labs(title="Withdrawals from ATM4",
       subtitle = "Holt-Winters Forecasts",
       y="Amount (100's of $)",
       x="Week")


Power Usage Forecasting

Problem Statement

Given a dataset of residential power usage for January 1998 until December 2013, model these data and produce a monthly forecast for 2014.

Data Import and Cleansing

Import

# Read in the data
power_raw <- read_xlsx("power.xlsx")

# Create a new data frame, keeping the original data safe
power <- power_raw

Cleaning

As we did with the ATM data, we first do some basic cleansing: correct the DATE column to be an actual date data type, and rename the columns.

# Assign easier names
colnames(power) <- c("seq","date","kwh")

# Fix the date column so it can be converted
power$date <- paste0(power$date,"-01")
power$date <- ymd(power$date)

head(power)
## # A tibble: 6 x 3
##     seq date           kwh
##   <dbl> <date>       <dbl>
## 1   733 1998-01-01 6862583
## 2   734 1998-02-01 5838198
## 3   735 1998-03-01 5420658
## 4   736 1998-04-01 5010364
## 5   737 1998-05-01 4665377
## 6   738 1998-06-01 6467147

Next, we check if any values are missing:

missing <- power %>% filter(is.na(date)) %>% nrow() %>% tibble(date = .)
missing <- power %>% filter(is.na(seq)) %>% nrow() %>%
  tibble(seq = .) %>% cbind(missing,.)
missing <- power %>% filter(is.na(kwh)) %>% nrow() %>%
  tibble(kwh = .) %>% cbind(missing,.)

rownames(missing) <- "NA Values"

kbl(missing) %>% kable_paper("striped", full_width = F, position="left") %>%
  add_header_above(c("Column Name" = 4))
Column Name
date seq kwh
NA Values 0 0 1

Thankfully, we only have one missing value in the kwh column. Specifically:

power %>% filter(is.na(kwh)) %>% kbl %>%
  kable_styling(bootstrap_options = "striped", full_width = F)
seq date kwh
861 2008-09-01 NA

Imputation

For our missing value of kwh, we’ll look at all the values in the same month over the data set to see if the values are stable enough to impute our missing value:

power %>% filter(month(date)==9) %>%
  ggplot(aes(x=year(date),y=kwh)) + geom_col() + theme_light() +
  labs(title="Residential Power Usage", subtitle="September Months",
       y="kWh", x="year") +
  scale_y_continuous(labels=number_format(scale=10^-6, suffix = "M"))

The plot above shows that using the previous value (of the same month) should be an acceptable replacement for the missing value. This way, if there is a seasonality change, we don’t break that pattern with our imputation.

imputed <- power %>% filter(date == as.Date("2007-09-01")) %>%
    select(kwh) %>% as.numeric()

power[is.na(power$kwh),"kwh"] <- imputed

Next we’ll look for outliers. We’ll start by plotting the raw data:

power %>% ggplot(aes(x=date,y=kwh)) + geom_line() +
  theme_light() +
  labs(title="Residential Power Usage",
       y="kWh", x="year") +
  scale_y_continuous(labels=number_format(scale=10^-6, suffix = "M"))

There appears to be a visual inconsistency with the data around late 2010. We’ll look closer:

power %>% filter(date >= as.Date("2009-07-01"), date <= as.Date("2011-07-01")) %>%
  ggplot(aes(x=date,y=kwh)) + geom_line() +
  theme_light() +
  labs(title="Residential Power Usage", subtitle="(selected date range)",
       y="kWh", x="year") +
  scale_y_continuous(labels=number_format(scale=10^-6, suffix = "M"))

The unusually low point in question is July 2010. Let’s compare to other July months to see if it is indeed that unusual:

power %>% filter(month(date)==7) %>%
  ggplot(aes(x=year(date),y=kwh)) + geom_col() + theme_light() +
  labs(title="Residential Power Usage", subtitle="July Months",
       y="kWh", x="year") +
  scale_y_continuous(labels=number_format(scale=10^-6, suffix = "M"))

Compared to fellow July entries, it is pretty obvious that the 2010 entry is likely wrong. Not knowing anything about the data set (where it is from, for example) we don’t know if there is a legitimate reason for it to be so low.

In fact, the value for July 2010 seems to be about 10x smaller than other July entries:

power %>% filter(month(date)==7) %>%
  mutate(kwh = comma(kwh)) %>%
  select(date, kwh) %>%
  kbl() %>% kable_styling(bootstrap_options = "striped", full_width = F) %>%
  row_spec(13, bold = T, color = "white", background = "#D7261E")
date kwh
1998-07-01 8,914,755
1999-07-01 7,444,416
2000-07-01 6,862,662
2001-07-01 7,855,129
2002-07-01 7,039,702
2003-07-01 6,900,676
2004-07-01 7,307,931
2005-07-01 8,337,998
2006-07-01 7,945,564
2007-07-01 6,426,220
2008-07-01 7,643,987
2009-07-01 7,713,260
2010-07-01 770,523
2011-07-01 10,093,343
2012-07-01 8,886,851
2013-07-01 8,415,321

It looks like what was probably intended to be entered was 7,705,230. That puts it in-line with the previous July value.

So, we will fix that value:

power[which(power$date==as.Date("2010-07-01")),"kwh"] <- power[which(power$date==as.Date("2010-07-01")),"kwh"] * 10

Now our data looks a lot better:

power %>% ggplot(aes(x=date,y=kwh)) + geom_line() +
  theme_light() +
  labs(title="Residential Power Usage",
       y="kWh", x="year") +
  scale_y_continuous(labels=number_format(scale=10^-6, suffix = "M"))

Modeling

Now we can begin our modeling. First, we turn the data into a formal time series:

power <- ts(power$kwh, frequency = 12, start=c(1998,01))

power %>% autoplot() +
  theme_light() +
  labs(title="Residential Power Usage",
       y="kWh", x="year") +
  scale_y_continuous(labels=number_format(scale=10^-6, suffix = "M"))

Obviously there is seasonality here as well as a generally upward trend. We’ll split the series into training and testing sets first:

# Split data. Use 3 years as a test set
power.train <- window(power, end=c(2010,12))
power.test <- window(power, start=c(2011,01))

Next we will see if a Box-Cox transformation is necessary to stabilize variance:

l <- BoxCox.lambda(power)

cbind(original = power, transformed = BoxCox(power, lambda=l)) %>%
  autoplot(facets=T) +
  theme_light() +
  labs(title="Residential Power Usage",
       subtitle="Training Data - Transformed",
       y="kWh", x="year")

The Box-Cox transform didn’t seem to do very much, so we will proceed without it.

As far as models, we will start with a seasonal naive model:

power.snaive <- snaive(power.train, h=12, biasadj = T)

accuracy(power.snaive)
##                    ME     RMSE      MAE       MPE     MAPE MASE      ACF1
## Training set 57974.93 715450.2 558876.6 0.2900295 8.851673    1 0.2682315

The accuracy measures are reasonable as a baseline for other models we’ll fit below. Let’s look at the forecasts provided by the model:

power.snaive %>% autoplot() +
  theme_light() +
  labs(title="Residential Power Usage",
       subtitle="Training data with Seasonal Naive predictions",
       y="kWh", x="year") +
  scale_y_continuous(labels=number_format(scale=10^-6, suffix = "M"))

Next we’ll try an exponential smoothing model to see how well it works in comparison:

power.ets <- ets(power.train, model="ZZZ", biasadj=T)

summary(power.ets)
## ETS(M,N,M) 
## 
## Call:
##  ets(y = power.train, model = "ZZZ", biasadj = T) 
## 
##   Smoothing parameters:
##     alpha = 0.0561 
##     gamma = 1e-04 
## 
##   Initial states:
##     l = 6197932.6926 
##     s = 0.921 0.7451 0.8944 1.2076 1.2518 1.1998
##            0.9861 0.7702 0.814 0.9345 1.0588 1.2168
## 
##   sigma:  0.0881
## 
##      AIC     AICc      BIC 
## 4923.378 4926.807 4969.126 
## 
## Training set error measures:
##                    ME   RMSE    MAE      MPE     MAPE      MASE      ACF1
## Training set 50934.76 535570 406708 0.147575 6.413358 0.7277242 0.3239843

We see that the ets() function chose a model with multiplicative error and seasonality terms, with no trend term.

power.ets %>% autoplot()

autoplot(power.train) + autolayer(forecast(power.ets, h=12, biasadj=T)) +
  theme_light() +
  labs(title="Residential Power Usage",
       subtitle="Training data with ETS(M,N,M) predictions",
       y="kWh", x="year") +
  scale_y_continuous(labels=number_format(scale=10^-6, suffix = "M"))

The ETS model seemed to do ok, as the predictions on the test data set look reasonable.

Next we will try an ARIMA model:

power.arima <- auto.arima(power.train)

summary(power.arima)
## Series: power.train 
## ARIMA(0,0,5)(2,1,0)[12] with drift 
## 
## Coefficients:
##          ma1      ma2     ma3     ma4      ma5     sar1     sar2     drift
##       0.2665  -0.0552  0.2594  -0.053  -0.1339  -0.7438  -0.4444  4818.548
## s.e.  0.0852   0.0866  0.0838   0.083   0.0820   0.0840   0.0819  2233.982
## 
## sigma^2 estimated as 2.787e+11:  log likelihood=-2102.32
## AIC=4222.64   AICc=4223.98   BIC=4249.37
## 
## Training set error measures:
##                     ME     RMSE      MAE        MPE     MAPE      MASE
## Training set -9898.183 492916.6 374398.5 -0.6403311 6.032486 0.6699127
##                      ACF1
## Training set -0.002832901

Interestingly, the auto.arima() function chose an ARIMA\((0,0,5)(2,1,0)_{[12]}\) model with drift. Looking at the training set accuracy measures, both RMSE and MASE were better than the ETS model above.

Let’s look at the residuals of this ARIMA model:

checkresiduals(power.arima)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,0,5)(2,1,0)[12] with drift
## Q* = 11, df = 16, p-value = 0.8095
## 
## Model df: 8.   Total lags used: 24

The plot of the residuals is relatively close to normal. The ACF plot shows no significant autocorrelation (with one small exception on lag 35). The Ljung-Box test has a high p-value, indicating that the residuals are uncorrelated.

Let’s look at the predictions we get from the training data:

power.pred <- forecast(power.arima, h=12, biasadj=T)

power.pred %>% autoplot() +
  theme_light() +
  labs(title="Residential Power Usage",
       subtitle="Training data with ARIMA predictions",
       y="kWh", x="year") +
  scale_y_continuous(labels=number_format(scale=10^-6, suffix = "M"))

The predictions look good and have smaller estimated prediction intervals than the ETS model did.

Now, we will compare accuracy measures when these two models are applied to the test data (which was not used in training the models):

# Check test set accuracy measures

testMeasures <- cbind(method = "Seasonal Naive", 
                      data.frame(as.list(
                        accuracy(power.snaive, power.test)["Test set",])
                        )
                      )

power.pred.ets <- predict(power.ets,h=36, biasadj=T)

testMeasures <- rbind(testMeasures,
  cbind(method = "ETS", 
                      data.frame(as.list(
                        accuracy(power.pred.ets, power.test)["Test set",])
                        )
                      ))

power.pred.arima <- forecast(power.arima,h=36,biasadj=T)

testMeasures <- rbind(testMeasures,
  cbind(method = "ARIMA", 
                      data.frame(
                        as.list(
                          accuracy(power.pred.arima, power.test)["Test set",]
                          )
                      )))

testMeasures %>% select(method, RMSE, MASE) %>%
  kable() %>% kable_styling(bootstrap_options = "striped", full_width = F)
method RMSE MASE
Seasonal Naive 1398107 2.111974
ETS 1205701 1.667069
ARIMA 1058430 1.392549

The accuracy measures on the test set tell us that the ARIMA model outperforms the other two.

Thus we will use the ARIMA model for our forecast of 2014:

power.forecast <- forecast(power, model = power.arima, h=12, biasadj = T)

power.forecast %>%
  kable() %>% kable_styling(bootstrap_options = "striped")
Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
Jan 2014 10051232 9374678 10727786 9016532 11085932
Feb 2014 8353735 7653568 9053901 7282922 9424547
Mar 2014 7136993 6435833 7838154 6064662 8209325
Apr 2014 5728649 5005862 6451435 4623242 6834055
May 2014 5488768 4765092 6212445 4382001 6595536
Jun 2014 8101105 7371782 8830428 6985701 9216509
Jul 2014 9428686 8699363 10158010 8313282 10544090
Aug 2014 9911712 9182388 10641035 8796308 11027115
Sep 2014 8405698 7676375 9135021 7290294 9521102
Oct 2014 5762170 5032846 6491493 4646766 6877574
Nov 2014 6055582 5326258 6784905 4940178 7170985
Dec 2014 8243222 7513898 8972545 7127818 9358625
# Output point forecasts
output <- cbind(date = seq(from=as.Date("2014-01-01"),
                           to=as.Date("2014-12-01"),by="month"),
      point_forecast = as.data.frame(power.forecast$mean))

write.xlsx(output,"forecast_Power.xlsx")

autoplot(power.forecast, PI=T) +
  theme_light() +
  labs(title="Residential Power Usage",
       subtitle="Forecasts using ARIMA(0,0,5)(2,1,0)[12] w/ drift",
       y="kWh", x="year") +
  scale_y_continuous(labels=number_format(scale=10^-6, suffix = "M"))