Part A ATM Forecast

library(fabletools)
## Registered S3 method overwritten by 'tsibble':
##   method               from 
##   as_tibble.grouped_df dplyr
library(fable)
library(ggplot2)
library(forecast)
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tsibble)
## 
## Attaching package: 'tsibble'
## 
## The following object is masked from 'package:lubridate':
## 
##     interval
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, union
library(readxl)
library(dplyr)
library(zoo)
## 
## Attaching package: 'zoo'
## 
## The following object is masked from 'package:tsibble':
## 
##     index
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
library(feasts)

Instructions

In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.

Data Load

# Loading in provided excel file of ATM data
atm <- readxl::read_excel("C:\\Users\\chung\\Downloads\\ATM624Data.xlsx")

Data Exploration

# Taking a look at the data passed into R studio
head(atm)
## # A tibble: 6 × 3
##    DATE ATM    Cash
##   <dbl> <chr> <dbl>
## 1 39934 ATM1     96
## 2 39934 ATM2    107
## 3 39935 ATM1     82
## 4 39935 ATM2     89
## 5 39936 ATM1     85
## 6 39936 ATM2     90

I note here that we will have to reformat the DATE column as the format after importing the data does not match the format in excel which is “m/d/yyyy h:mm:ss AM/PM”. The DATE in our imported data shows the number of days since 1899-12-30

summary(atm)
##       DATE           ATM                 Cash        
##  Min.   :39934   Length:1474        Min.   :    0.0  
##  1st Qu.:40026   Class :character   1st Qu.:    0.5  
##  Median :40118   Mode  :character   Median :   73.0  
##  Mean   :40118                      Mean   :  155.6  
##  3rd Qu.:40210                      3rd Qu.:  114.0  
##  Max.   :40312                      Max.   :10919.8  
##                                     NA's   :19

I note here that there are 19 NA values in the ‘Cash’ variable, this is another thing that I will have to address.

# Ensuring all four ATMs are accounted for in the data
unique(atm$ATM)
## [1] "ATM1" "ATM2" NA     "ATM3" "ATM4"

Looking into the counts of each ATM as the NA values could be missing data for one of the four ATMs being recorded.

atm %>%
  group_by(ATM) %>%
  summarise(Count = n())
## # A tibble: 5 × 2
##   ATM   Count
##   <chr> <int>
## 1 ATM1    365
## 2 ATM2    365
## 3 ATM3    365
## 4 ATM4    365
## 5 <NA>     14
atm_na <- atm %>%
  filter(is.na(ATM))
atm_na
## # A tibble: 14 × 3
##     DATE ATM    Cash
##    <dbl> <chr> <dbl>
##  1 40299 <NA>     NA
##  2 40300 <NA>     NA
##  3 40301 <NA>     NA
##  4 40302 <NA>     NA
##  5 40303 <NA>     NA
##  6 40304 <NA>     NA
##  7 40305 <NA>     NA
##  8 40306 <NA>     NA
##  9 40307 <NA>     NA
## 10 40308 <NA>     NA
## 11 40309 <NA>     NA
## 12 40310 <NA>     NA
## 13 40311 <NA>     NA
## 14 40312 <NA>     NA

Exploring each ATM value’s counts, there are 365 days worth of values, and so NA values in the ATM column will also be addressed in data preprocessing. I will drop the observations with NA ATM values, then check to see if the date ranges for each ATM are aligned.

# Plotting each ATM to see the general structure of the data

ggplot(atm, aes(DATE, Cash, col = ATM)) +
  geom_line() +
  facet_wrap(~ ATM, scales = "free_y")
## Warning: Removed 14 rows containing missing values or values outside the scale range
## (`geom_line()`).

# Viewing ATM4's outlier

atm %>%
  filter(ATM == 'ATM4') %>%
  arrange(desc(Cash))
## # A tibble: 365 × 3
##     DATE ATM     Cash
##    <dbl> <chr>  <dbl>
##  1 40218 ATM4  10920.
##  2 40078 ATM4   1712.
##  3 40207 ATM4   1575.
##  4 40060 ATM4   1495.
##  5 39973 ATM4   1484.
##  6 40061 ATM4   1301.
##  7 40237 ATM4   1276.
##  8 40048 ATM4   1246.
##  9 39978 ATM4   1221.
## 10 40085 ATM4   1195.
## # ℹ 355 more rows

There is a clear outlier in ATM4’s data with one value being above $9000, given the variable ‘Cash’ is recorded in hundreds of dollars, that amount being withdrawn from an ATM is very unlikely. In preprocessing I will impute a value to address the outlier.

# Exploring ATM3's cash values
atm %>%
  filter(ATM == "ATM3") %>%
  distinct(Cash)
## # A tibble: 4 × 1
##    Cash
##   <dbl>
## 1     0
## 2    96
## 3    82
## 4    85

ATM3 has only 3 distinct values for Cash where all days except for 3 is 0.

Data Preprocessing

In our data exploration I discovered that I need to reformat the date column in the imported data, address the Cash/ATM column NA values, and impute a value to address the outlier in ATM4.

# Reformatting the date
atm <- atm %>%
  mutate(DATE = as.Date(DATE, origin = '1899-12-30'))

# Checking date ranges
atm %>%
  group_by(ATM) %>%
  summarise(
    earliest_date = min(DATE),
    latest_date = max(DATE))
## # A tibble: 5 × 3
##   ATM   earliest_date latest_date
##   <chr> <date>        <date>     
## 1 ATM1  2009-05-01    2010-04-30 
## 2 ATM2  2009-05-01    2010-04-30 
## 3 ATM3  2009-05-01    2010-04-30 
## 4 ATM4  2009-05-01    2010-04-30 
## 5 <NA>  2010-05-01    2010-05-14

The date ranges after reformatting are appropriate, each ATM’s date ranges match up and represent a years worth of data by day ending at 4/30/2010.

# Omitting the NA ATM value observations from our data

atm <- atm %>%
  filter(!is.na(ATM))

# Rechecking ATM column NA counts
atm %>%
  group_by(ATM) %>%
  summarise(Count = n())
## # A tibble: 4 × 2
##   ATM   Count
##   <chr> <int>
## 1 ATM1    365
## 2 ATM2    365
## 3 ATM3    365
## 4 ATM4    365
# Rechecking Cash NA counts

summary(atm)
##       DATE                ATM                 Cash        
##  Min.   :2009-05-01   Length:1460        Min.   :    0.0  
##  1st Qu.:2009-07-31   Class :character   1st Qu.:    0.5  
##  Median :2009-10-30   Mode  :character   Median :   73.0  
##  Mean   :2009-10-30                      Mean   :  155.6  
##  3rd Qu.:2010-01-29                      3rd Qu.:  114.0  
##  Max.   :2010-04-30                      Max.   :10919.8  
##                                          NA's   :5

There are still 5 NA’s in the Cash variable, I will take a closer look.

atm_na <- atm %>%
  filter(is.na(Cash))
atm_na
## # A tibble: 5 × 3
##   DATE       ATM    Cash
##   <date>     <chr> <dbl>
## 1 2009-06-13 ATM1     NA
## 2 2009-06-16 ATM1     NA
## 3 2009-06-18 ATM2     NA
## 4 2009-06-22 ATM1     NA
## 5 2009-06-24 ATM2     NA

Given that the NA values are not related to predictor values and the data is not structurally missing, I will impute missing values. I choose to impute here instead of removing the observations to maintain the structure of the data for when I convert to time series. I will impute the median values to avoid influence of extreme values.

# Calculating median for ATM1's Cash
median_cash_atm1 <- median(atm$Cash[atm$ATM == "ATM1"], na.rm = TRUE)

# Impute NA values in Cash for ATM1
atm$Cash[is.na(atm$Cash) & atm$ATM == "ATM1"] <- median_cash_atm1

median_cash_atm1
## [1] 91
# Calculating median for ATM2's Cash
median_cash_atm2 <- median(atm$Cash[atm$ATM == "ATM2"], na.rm = TRUE)

# Impute NA values in Cash for ATM2
atm$Cash[is.na(atm$Cash) & atm$ATM == "ATM2"] <- median_cash_atm2

median_cash_atm2
## [1] 67
# Checking that all NA values in Cash variable have been imputed
summary(atm)
##       DATE                ATM                 Cash        
##  Min.   :2009-05-01   Length:1460        Min.   :    0.0  
##  1st Qu.:2009-07-31   Class :character   1st Qu.:    1.0  
##  Median :2009-10-30   Mode  :character   Median :   73.0  
##  Mean   :2009-10-30                      Mean   :  155.3  
##  3rd Qu.:2010-01-29                      3rd Qu.:  114.0  
##  Max.   :2010-04-30                      Max.   :10919.8

Imputing the outlier in ATM4’s Cash variable with ATM4’s median Cash value

# Finding the atm data index where the ATM4 outlier is located
global_index <- which(atm$ATM == "ATM4")[which.max(atm$Cash[atm$ATM == "ATM4"])]

global_index
## [1] 1380
# Calculating the median Cash where ATM is ATM4
median_cash_atm4 <- median(atm$Cash[atm$ATM == "ATM4" & seq_along(atm$ATM) != 1380], na.rm = TRUE)

median_cash_atm4
## [1] 403.3049
# Imputing the Cash variable at the index
atm$Cash[1380] <- median_cash_atm4

# Viewing data after imputation
ggplot(atm, aes(DATE, Cash, col = ATM)) +
  geom_line() +
  facet_wrap(~ ATM, scales = "free_y")

Forecasts

Breaking the data into 4 separate time series

atm1_ts <- atm %>%
  as_tsibble(index = DATE, key = ATM) %>%
  filter(ATM == "ATM1")

atm2_ts <- atm %>%
  as_tsibble(index = DATE, key = ATM) %>%
  filter(ATM == "ATM2")

atm3_ts <- atm %>%
  as_tsibble(index = DATE, key = ATM) %>%
  filter(ATM == "ATM3")

atm4_ts <- atm %>%
  as_tsibble(index = DATE, key = ATM) %>%
  filter(ATM == "ATM4")

ATM1 Forecast

I will begin by analyzing the ACF/PACF plots, and the STL decomposition for the data

library(feasts)

atm1_ts %>%
  gg_tsdisplay(Cash, plot_type = 'partial')

The autocorrelation spikes at 7, 14, and 21 indicates that there is weekly seasonality in the data for ATM1. There is no apparent trend and variability is stable throughout.

The STL decomposition with a weekly parameter below reflect these findings, showing a stable trend and weekly seasonality that can change over time.

atm1_ts %>%
  model(stl = STL(Cash)) %>%
  components() %>%
  autoplot()

ndiffs(atm1_ts$Cash)
## [1] 0

With a reasonably stable variance and ndiff indicating no differencing needed I will not do a transformation or differencing.

I will opt to train and test a seasonal naive method, Holter Winters additive, and compare with an auto optimized ARIMA. My inclination is that a Holter Winters additive should be the best due to the seasonality and lack of trend.

fit <- atm1_ts %>%
  model(ARIMA(Cash))

report(fit)
## Series: Cash 
## Model: ARIMA(0,0,1)(0,1,2)[7] 
## 
## Coefficients:
##          ma1     sma1     sma2
##       0.1695  -0.5867  -0.0989
## s.e.  0.0553   0.0508   0.0497
## 
## sigma^2 estimated as 559.8:  log likelihood=-1641.12
## AIC=3290.23   AICc=3290.34   BIC=3305.75

The auto ARIMA indicates that the optimal ARIMA model is a Seasonal ARIMA model with a single moving average term, 1 seasonal differencing, and strong weekly seasonal noise.

fit <- atm1_ts %>%
  model(ETS(Cash))

report(fit)
## Series: Cash 
## Model: ETS(A,N,A) 
##   Smoothing parameters:
##     alpha = 0.01651456 
##     gamma = 0.3270894 
## 
##   Initial states:
##      l[0]      s[0]     s[-1]    s[-2]   s[-3]    s[-4]    s[-5]    s[-6]
##  78.33245 -60.17048 -9.153121 17.72025 6.50369 15.97554 16.44455 12.67958
## 
##   sigma^2:  579.039
## 
##      AIC     AICc      BIC 
## 4486.250 4486.871 4525.249

The auto ETS indicates that the optimal ETS model is additive in seasonality and error but no trend component.

atm1_train <- atm1_ts %>%
  filter(DATE <= "2010-04-01")

atm1_test <- atm1_ts %>%
  filter(DATE >= "2010-04-01")

atm1_fits <- atm1_train %>%
  model(
    SNAIVE = SNAIVE(Cash),
    auto_ETS = ETS(Cash),
    'Auto ARIMA' = ARIMA(Cash) 
  )
atm1_forecasts <- atm1_fits %>%
  forecast(h = 30)

# Plotting forecasts

atm1_forecasts %>%
  autoplot(atm1_test) +
  guides(colour = guide_legend(title = "Forecast"))+
  labs(title= "ATM1 Forecasts: April") +
  xlab("Date") +
  ylab("Cash in Hundreds")

atm1_forecasts %>%
  accuracy(atm1_test)
## # A tibble: 3 × 11
##   .model     ATM   .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE    ACF1
##   <chr>      <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>
## 1 Auto ARIMA ATM1  Test  -7.24  12.6 10.0  -85.8  88.9   NaN   NaN -0.340 
## 2 SNAIVE     ATM1  Test  -8.17  16.8 14.5  -69.5  76.6   NaN   NaN -0.0151
## 3 auto_ETS   ATM1  Test  -6.52  12.1  9.55 -64.5  67.8   NaN   NaN -0.325

From our accuracy values we can tell that the best model here is the auto ETS. The ETS model scored the lowest RMSE and MAE value.

# Fitting ETS for final forecast output
atm1_fit_ets <- atm1_ts %>%
  model(ETS = ETS(Cash))

atm1_forecast_ets <- atm1_fit_ets %>%
  forecast(h =30)

atm1_forecast_ets %>% 
  autoplot(atm1_ts) +
  labs(title = "ATM1 ETS: May",
       y = "Cash in Hundreds")

atm1_forecast_results <- as.data.frame(atm1_forecast_ets) %>%
  select(DATE, .mean) %>% # Using mean of forecasts to represent the point estimate projections
  rename(Date = DATE, Cash = .mean) %>%
  mutate(Cash=round(Cash,2))

atm1_forecast_results
##          Date   Cash
## 1  2010-05-01  87.05
## 2  2010-05-02 100.76
## 3  2010-05-03  73.11
## 4  2010-05-04   5.74
## 5  2010-05-05 100.13
## 6  2010-05-06  79.43
## 7  2010-05-07  85.60
## 8  2010-05-08  87.05
## 9  2010-05-09 100.76
## 10 2010-05-10  73.11
## 11 2010-05-11   5.74
## 12 2010-05-12 100.13
## 13 2010-05-13  79.43
## 14 2010-05-14  85.60
## 15 2010-05-15  87.05
## 16 2010-05-16 100.76
## 17 2010-05-17  73.11
## 18 2010-05-18   5.74
## 19 2010-05-19 100.13
## 20 2010-05-20  79.43
## 21 2010-05-21  85.60
## 22 2010-05-22  87.05
## 23 2010-05-23 100.76
## 24 2010-05-24  73.11
## 25 2010-05-25   5.74
## 26 2010-05-26 100.13
## 27 2010-05-27  79.43
## 28 2010-05-28  85.60
## 29 2010-05-29  87.05
## 30 2010-05-30 100.76
write.csv(atm1_forecast_results, file = "atm1_forecast_results.csv")

ATM2 Forecast

Starting with an STL decomposition to observe components, and ACF/PACF plots

atm2_ts %>%
  model(stl = STL(Cash)) %>%
  components() %>%
  autoplot()

gg_tsdisplay(atm2_ts, Cash)

For ATM2 we see similar components as ATM1, where on the time plot we can observe stable variance throughout, a relatively stable trend - that perhaps shows cyclic behavior, and weekly seasonality. Also noted in the ACF plot, we have spike again in lags 7, 14, and 21.

ndiffs(atm2_ts$Cash)
## [1] 1

Our ndiffs function suggests a differencing to the first order is required to make the data stationary.

# Differencing the data for weekly seasonality

atm2_ts <- atm2_ts %>%
  mutate(diff_ATM2 = difference(Cash, lag = 7))

# Checking ndiffs after differencing indicates no additional differencing required
ndiffs(atm2_ts$diff_ATM2)
## [1] 0
fit <- atm2_ts %>%
  model(ARIMA(diff_ATM2))

report(fit)
## Series: diff_ATM2 
## Model: ARIMA(0,0,0)(2,0,0)[7] 
## 
## Coefficients:
##          sar1     sar2
##       -0.5795  -0.1814
## s.e.   0.0520   0.0516
## 
## sigma^2 estimated as 654.4:  log likelihood=-1672.25
## AIC=3350.49   AICc=3350.56   BIC=3362.19
fit <- atm2_ts %>%
  model(ETS(Cash))

report(fit)
## Series: Cash 
## Model: ETS(A,N,A) 
##   Smoothing parameters:
##     alpha = 0.000100055 
##     gamma = 0.3627371 
## 
##   Initial states:
##      l[0]      s[0]    s[-1]    s[-2]     s[-3]    s[-4]    s[-5]    s[-6]
##  72.97637 -46.64692 -22.2591 4.814576 0.2230553 20.58732 14.88622 28.39485
## 
##   sigma^2:  644.2689
## 
##      AIC     AICc      BIC 
## 4525.212 4525.834 4564.211

Our Auto ARIMA suggests that a Seasonal ARIMA(0,0,0)(2,0,0) is optimal, where there are two seasonal autoregressive terms.

To compare different models here we will have to include differenced and nondifferenced data. The ARIMA model requres differencing, however the ETS and SNAIVE models do not. Similiary to ATM1 I would suspect that an ETS model would do well here because of the weekly seasonality and lack of trend.

atm2_train <- atm2_ts %>%
  filter(DATE < "2010-04-01")

atm2_test <- atm2_ts %>%
  filter(DATE >= "2010-04-01")

# Nondifferenced models
atm2_fit_nondiff <- atm2_train %>%
  model(
    SNAIVE = SNAIVE(Cash),
    ETS = ETS(Cash),
  )

# Differenced models
atm2_fit_diff <- atm2_train %>%
  slice(2:336) %>% 
  model(
    `Auto ARIMA` = ARIMA(diff_ATM2, stepwise = FALSE, approx = FALSE)
  )

atm2_forecast_nondiff <- atm2_fit_nondiff %>%
  forecast(h = 30)

atm2__forecast_diff <- atm2_fit_diff %>%
  forecast(h = 30)

# Plotting nondifferenced models
atm2_forecast_nondiff %>%
  autoplot(atm2_ts, level = NULL)+
  facet_wrap( ~ .model, scales = "free_y") +
  guides(colour = guide_legend(title = "Forecast"))+
  labs(title= "ATM2 Forecasts: April") +
  xlab("Date") +
  ylab("Cash in Hundreds") 

# Plotting differenced models
atm2__forecast_diff %>%
  autoplot(atm2_ts, level = NULL)+
  facet_wrap( ~ .model, scales = "free_y") +
  guides(colour = guide_legend(title = "Forecast"))+
  labs(title= "ATM2 Forecasts | April") +
  xlab("Date") +
  ylab("Cash")
## Warning: Removed 7 rows containing missing values or values outside the scale range
## (`geom_line()`).

atm2_forecast_nondiff %>%
  accuracy(atm2_test)
## # A tibble: 2 × 11
##   .model ATM   .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE    ACF1
##   <chr>  <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>
## 1 ETS    ATM2  Test   9.51  19.4  14.3 -26.3  58.2   NaN   NaN -0.0932
## 2 SNAIVE ATM2  Test  12.9   26.2  17.8  35.1  45.1   NaN   NaN -0.319
atm2__forecast_diff %>%
  accuracy(atm2_test)
## # A tibble: 1 × 11
##   .model     ATM   .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE   ACF1
##   <chr>      <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>
## 1 Auto ARIMA ATM2  Test   3.71  20.1  14.2   Inf   Inf   NaN   NaN -0.312

Our auto ETS model had the lowest RMSE suggesting it was the best fit. The ARIMA(0,0,0)(2,0,0)[7] however provided a very close RMSE, and a slightly lower MAE. I will forecast using the auto ETS on nondifferenced data.

atm2_fit_ets <- atm2_ts %>%
  model(ETS = ETS(Cash))

atm2_forecast_ets <- atm2_fit_ets %>%
  forecast(h = 30)

atm2_forecast_ets %>% 
  autoplot(atm2_ts) +
  labs(title = "ATM2 ETS Nondiff: May",
       y = "Cash in Hundreds")

atm2_forecast_results <- as.data.frame(atm2_forecast_ets) %>%
  select(DATE, .mean) %>% # Using mean of forecasts to represent the point estimate projections
  rename(Date = DATE, Cash = .mean) %>%
  mutate(Cash=round(Cash,2))

atm2_forecast_results
##          Date   Cash
## 1  2010-05-01  68.35
## 2  2010-05-02  74.19
## 3  2010-05-03  11.09
## 4  2010-05-04   2.14
## 5  2010-05-05 101.60
## 6  2010-05-06  92.38
## 7  2010-05-07  68.98
## 8  2010-05-08  68.35
## 9  2010-05-09  74.19
## 10 2010-05-10  11.09
## 11 2010-05-11   2.14
## 12 2010-05-12 101.60
## 13 2010-05-13  92.38
## 14 2010-05-14  68.98
## 15 2010-05-15  68.35
## 16 2010-05-16  74.19
## 17 2010-05-17  11.09
## 18 2010-05-18   2.14
## 19 2010-05-19 101.60
## 20 2010-05-20  92.38
## 21 2010-05-21  68.98
## 22 2010-05-22  68.35
## 23 2010-05-23  74.19
## 24 2010-05-24  11.09
## 25 2010-05-25   2.14
## 26 2010-05-26 101.60
## 27 2010-05-27  92.38
## 28 2010-05-28  68.98
## 29 2010-05-29  68.35
## 30 2010-05-30  74.19
write.csv(atm2_forecast_results, file = "atm2_forecast_results.csv")

ATM3 Forecast

Our ATM3 data only contains 3 observations of Cash withdrawn above $0.

atm3_ts %>%
  autoplot()
## Plot variable not specified, automatically selected `.vars = Cash`

Because of the lack of data our best forecast here is likely the Naive model where the future observation is most likely the most recent past observation. Another thought would be to use the random walk with drift, however with only 3 data points any trending is not reliable. With three data points training and testing models ultimately not feasible.

atm3_fit <- atm3_ts %>%
  model(NAIVE(Cash))

atm3_forecast <- atm3_fit %>%
  forecast(h = 30)

atm3_forecast %>%
  autoplot(atm3_ts)

My recommendation here is to wait until more data is available for analysis, or attempt to obtain more data in the case that this ATM was out for repair and has data prior to the date range we have at our disposal.

atm3_forecast_results <- as.data.frame(atm3_forecast) %>%
  select(DATE, .mean) %>% # Using mean of forecasts to represent the point estimate projections
  rename(Date = DATE, Cash = .mean) %>%
  mutate(Cash=round(Cash,2))

atm3_forecast_results
##          Date Cash
## 1  2010-05-01   85
## 2  2010-05-02   85
## 3  2010-05-03   85
## 4  2010-05-04   85
## 5  2010-05-05   85
## 6  2010-05-06   85
## 7  2010-05-07   85
## 8  2010-05-08   85
## 9  2010-05-09   85
## 10 2010-05-10   85
## 11 2010-05-11   85
## 12 2010-05-12   85
## 13 2010-05-13   85
## 14 2010-05-14   85
## 15 2010-05-15   85
## 16 2010-05-16   85
## 17 2010-05-17   85
## 18 2010-05-18   85
## 19 2010-05-19   85
## 20 2010-05-20   85
## 21 2010-05-21   85
## 22 2010-05-22   85
## 23 2010-05-23   85
## 24 2010-05-24   85
## 25 2010-05-25   85
## 26 2010-05-26   85
## 27 2010-05-27   85
## 28 2010-05-28   85
## 29 2010-05-29   85
## 30 2010-05-30   85
write.csv(atm3_forecast_results, file = "atm3_forecast_results.csv")

ATM4 Forecast

atm4_ts %>%
  model(stl = STL(Cash)) %>%
  components() %>%
  autoplot()

gg_tsdisplay(atm4_ts, Cash)

Similarly to ATM 1 and 2 we see ACF autocorrelations spiking at 7, 14, and 21, with variable seasonality, and not much trend-cycle. We do however have a changing variablity in the time plots above, signaling that we can utilize a box-cox transformation to normalize the variance.

atm4_lambda <- atm4_ts %>%
  features(Cash, features = guerrero) %>%
  pull(lambda_guerrero)

atm4_lambda
## [1] 0.4525599

With a suggested lambda of 0.4525599 we can transform the data and when reporting our forecasts. The forecast package will handle back-transforming in the forecast() function.

atm4_ts <- atm4_ts %>%
  mutate(Cash_boxcox = box_cox(Cash, atm4_lambda))

atm4_ts %>%
  model(stl = STL(Cash_boxcox)) %>%
  components() %>%
  autoplot()

gg_tsdisplay(atm4_ts, Cash_boxcox)

As we can see post transformation the variance has been smoothed, however the ACF spikes still remain.

ndiffs(atm4_ts$Cash_boxcox)
## [1] 0

No differencing is suggested by our ndiffs post transformation

For ATM4 we will train SNAIVE, autoARIMA, and autoETS models. Due to the oscilations in the remainder component I believe that the SNAIVE might perform the best, the ETS model in this case may overfit.

atm4_train <- atm4_ts %>%
  filter(DATE < "2010-04-01")

atm4_test <- atm4_ts %>%
  filter(DATE >= "2010-04-01")

atm4_fits <- atm4_train %>%
  model(
    SNAIVE = SNAIVE(Cash_boxcox),
    auto_ETS = ETS(Cash_boxcox),
    arima = ARIMA(Cash_boxcox)
  )

atm4_forecasts <- atm4_fits %>%
  forecast(h = 30)


atm4_forecasts %>%
  autoplot(atm4_test)+
  labs(
    y = "Cash in Hundreds",
    title = "ATM4 Forecasts: April"
  )

atm4_forecasts %>%
  accuracy(atm4_test)
## # A tibble: 3 × 11
##   .model   ATM   .type    ME  RMSE   MAE    MPE  MAPE  MASE RMSSE    ACF1
##   <chr>    <chr> <chr> <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>   <dbl>
## 1 SNAIVE   ATM4  Test  -5.89  12.2  8.87  -83.3  91.5   NaN   NaN  0.0751
## 2 arima    ATM4  Test  -1.88  12.2  9.78 -128.  147.    NaN   NaN -0.0836
## 3 auto_ETS ATM4  Test  -2.18  10.7  8.41 -108.  123.    NaN   NaN  0.0608

For ATM4 our auto_ETS model performed the most accurately on the RMSE and the MAE values.

atm4_fit_ets <- atm4_ts %>%
  model(ETS = ETS(Cash_boxcox))

atm4_forecasts_ets <- atm4_fit_ets %>%
  forecast(h=30)

atm4_forecasts_ets %>%
  autoplot(atm4_ts) +
    labs(title = "ATM4 ETS Nondiff: May",
       y = "Cash in Hundreds")

atm4_forecast_results <- as.data.frame(atm4_forecasts_ets) %>%
  select(DATE, .mean) %>% # Using mean of forecasts to represent the point estimate projections
  rename(Date = DATE, Cash = .mean) %>%
  mutate(Cash=round(Cash,2))

atm4_forecast_results
##          Date  Cash
## 1  2010-05-01 25.51
## 2  2010-05-02 29.59
## 3  2010-05-03 29.98
## 4  2010-05-04  9.22
## 5  2010-05-05 33.29
## 6  2010-05-06 27.14
## 7  2010-05-07 34.14
## 8  2010-05-08 25.51
## 9  2010-05-09 29.59
## 10 2010-05-10 29.98
## 11 2010-05-11  9.22
## 12 2010-05-12 33.29
## 13 2010-05-13 27.14
## 14 2010-05-14 34.14
## 15 2010-05-15 25.51
## 16 2010-05-16 29.59
## 17 2010-05-17 29.98
## 18 2010-05-18  9.22
## 19 2010-05-19 33.29
## 20 2010-05-20 27.14
## 21 2010-05-21 34.14
## 22 2010-05-22 25.51
## 23 2010-05-23 29.59
## 24 2010-05-24 29.98
## 25 2010-05-25  9.22
## 26 2010-05-26 33.29
## 27 2010-05-27 27.14
## 28 2010-05-28 34.14
## 29 2010-05-29 25.51
## 30 2010-05-30 29.59
write.csv(atm4_forecast_results, file = "atm4_forecast_results.csv")

Part B

Instructions

Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.

Data Load

# Loading in provided excel file of residential power usage

power_data <- read_excel("ResidentialCustomerForecastLoad-624.xlsx")

Data Exploration

head(power_data)
## # A tibble: 6 × 3
##   CaseSequence `YYYY-MMM`     KWH
##          <dbl> <chr>        <dbl>
## 1          733 1998-Jan   6862583
## 2          734 1998-Feb   5838198
## 3          735 1998-Mar   5420658
## 4          736 1998-Apr   5010364
## 5          737 1998-May   4665377
## 6          738 1998-Jun   6467147

For the purposes of our timeseries we will remove the variable CaseSequence, rename and set the index to Date, and convert the new date variable from chr type to date type.

summary(power_data)
##   CaseSequence     YYYY-MMM              KWH          
##  Min.   :733.0   Length:192         Min.   :  770523  
##  1st Qu.:780.8   Class :character   1st Qu.: 5429912  
##  Median :828.5   Mode  :character   Median : 6283324  
##  Mean   :828.5                      Mean   : 6502475  
##  3rd Qu.:876.2                      3rd Qu.: 7620524  
##  Max.   :924.0                      Max.   :10655730  
##                                     NA's   :1

For the one missing value, we will address with imputation.

Data Preprocessing

# Identifying index of NA value
which(is.na(power_data$KWH))
## [1] 129
# Using linear interpolation to estimate missing value based on neighboring data points.
power_data$KWH <- na.approx(power_data$KWH)

# Checking that no NA values are left
summary(power_data)
##   CaseSequence     YYYY-MMM              KWH          
##  Min.   :733.0   Length:192         Min.   :  770523  
##  1st Qu.:780.8   Class :character   1st Qu.: 5434539  
##  Median :828.5   Mode  :character   Median : 6314472  
##  Mean   :828.5                      Mean   : 6502824  
##  3rd Qu.:876.2                      3rd Qu.: 7608792  
##  Max.   :924.0                      Max.   :10655730
# Creating time series with date and KWH, dropping unneeded CaseSequence

power_data <- power_data %>%
  rename(Date = `YYYY-MMM`) %>%
  mutate(Date = yearmonth(Date))

# Create tsibble

power_ts <- power_data %>%
  as_tsibble(index = Date)

# Drop CaseSequence
power_ts <- power_ts %>%
  select(-CaseSequence)

Data Modeling

First visualizing the data and inspecting components

gg_tsdisplay(power_ts)
## Plot variable not specified, automatically selected `y = KWH`

power_ts %>%
  model(stl = STL(KWH)) %>%
  components() %>%
  autoplot()

We can also see that there is an outlier in the data. I will address the outlier via linear interpolation. In the gg_tsdisplay plot above the outlier is shown to be a recording of less power usage than in the troughs of the months of the least usage. The recording is likely an error.

# Addressing outlier via linear interpolation.
min_kwh <- min(power_ts$KWH, na.rm = TRUE)
print(min_kwh)
## [1] 770523
# Replacing outlier with NA
power_ts <- power_ts %>%
  mutate(KWH = ifelse(KWH == min_kwh, NA, KWH))

# Using interpolation on created NA value
power_ts <- power_ts %>%
  mutate(KWH = na.approx(KWH))

Showing plots after addressing outlier

gg_tsdisplay(power_ts, KWH)

power_ts %>%
  model(stl = STL(KWH)) %>%
  components() %>%
  autoplot()

The two above plots show that there is a recently accelerating positive trend in the data, and there can be strong monthly seasonality noted. In the ACF plot there can be a 6 month pattern with autocorrelation spikes at lags 6, 12, and 18. The Monthly seasonality is likely associated with the changing of the seasons. The range of seasonality does not seem to be increasing - the variability is stable.

ndiffs(power_ts$KWH)
## [1] 1

Our ndiffs function, the trend seen above, and ACF plot above indicate nonstationary data. I will do a first order differencing before modeling for ARIMA. For our modeling I will compare an autoARIMA, autoETS, and STL. STL allows for exponential smoothing and can take into user selected components

# Differencing the data for monthly seasonality

power_ts <- power_ts %>%
  mutate(diff_KWH = difference(KWH, lag = 12))

# Checking ndiffs after differencing indicates no additional differencing required
ndiffs(power_ts$diff_KWH)
## [1] 0
fit <- power_ts %>%
  model(ARIMA(diff_KWH))

report(fit)
## Series: diff_KWH 
## Model: ARIMA(1,0,0)(2,0,0)[12] w/ mean 
## 
## Coefficients:
##          ar1     sar1     sar2   constant
##       0.3073  -0.7261  -0.4155  158323.36
## s.e.  0.0752   0.0767   0.0782   49164.21
## 
## sigma^2 estimated as 3.774e+11:  log likelihood=-2662.58
## AIC=5335.16   AICc=5335.48   BIC=5351.45
fit <- power_ts %>%
  model(ETS = ETS(KWH))

report(fit)
## Series: KWH 
## Model: ETS(M,N,M) 
##   Smoothing parameters:
##     alpha = 0.1152884 
##     gamma = 0.1983171 
## 
##   Initial states:
##     l[0]      s[0]     s[-1]     s[-2]    s[-3]    s[-4]    s[-5]    s[-6]
##  6188218 0.9083241 0.7515655 0.9314405 1.225896 1.276768 1.233079 1.016933
##      s[-7]    s[-8]     s[-9]   s[-10]  s[-11]
##  0.7623912 0.804406 0.8855059 1.011111 1.19258
## 
##   sigma^2:  0.0094
## 
##      AIC     AICc      BIC 
## 6145.320 6148.047 6194.182

For our models I will compare an SNAIVE, an ARIMA(1,0,0)(2,0,0)[12] w/ mean (Seasonal autoregressive), and ETS(M,N,M). The auto ETS is suggesting that the seasonality is multiplicative - something that I did not catch by observing out decomposition.

power_ts_train <- power_ts %>%
  filter(Date < yearmonth("2013 Jan"))


power_ts_test <- power_ts %>%
  filter(Date >= yearmonth("2013 Jan"))


# Nondifferenced models
power_fit_nondiff <- power_ts_train %>%
  model(
    SNAIVE = SNAIVE(KWH), 
    ETS = ETS(KWH),        
  )


# Differenced models
power_fit_diff <- power_ts_train %>%
  model(
    `Auto ARIMA` = ARIMA(diff_KWH, stepwise = FALSE, approx = FALSE)
  )

power_forecast_nondiff <- power_fit_nondiff %>%
  forecast(h = 12)

power_forecast_diff <- power_fit_diff %>%
  forecast(h = 12)

# Plotting nondifferenced models
power_forecast_nondiff %>%
  autoplot(power_ts, level = NULL)+
  facet_wrap( ~ .model, scales = "free_y") +
  guides(colour = guide_legend(title = "Forecast"))+
  labs(title= "Non Diff Model Forecasts: 2013") +
  xlab("Date") +
  ylab("KWH") 

# Plotting nondifferenced models
power_forecast_diff %>%
  autoplot(power_ts, level = NULL)+
  facet_wrap( ~ .model, scales = "free_y") +
  guides(colour = guide_legend(title = "Forecast"))+
  labs(title= "Diff Model Forecasts: 2013") +
  xlab("Date") +
  ylab("KWH") 

power_forecast_nondiff %>%
  accuracy(power_ts_test)
## # A tibble: 2 × 10
##   .model .type      ME     RMSE     MAE   MPE  MAPE  MASE RMSSE    ACF1
##   <chr>  <chr>   <dbl>    <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>
## 1 ETS    Test  381167. 1063141. 631440.  3.50  7.29   NaN   NaN  0.0249
## 2 SNAIVE Test  405195. 1035538. 618606.  4.55  7.06   NaN   NaN -0.0313
power_forecast_diff %>%
  accuracy(power_ts_test)
## # A tibble: 1 × 10
##   .model     .type      ME    RMSE     MAE   MPE  MAPE  MASE RMSSE    ACF1
##   <chr>      <chr>   <dbl>   <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>
## 1 Auto ARIMA Test  234994. 936323. 625297. -329.  539.   NaN   NaN -0.0436

Out of our three models the one with the lowest RMSE is the ARIMA(1,0,0)(2,0,0)[12] w/ mean. This model takes into account the mean of the differenced data, and uses a 1 term non seasonal autoregressive term, with 2 seasonal autoregressive terms. When comparing versus the other models, the SNAIVE did better in accuracy than the ETS, but both the models run on the nondifferenced data were far from close to the ARIMA’s RMSE. The accuracy values make sense as the ETS model may tend to overfit when there is a trend involved, and the ARIMA model possibly adapts to longer term forecasting better.

I will move forward with the ARIMA model for modeling.

power_model_ARIMA <- power_ts %>%
  model(ARIMA(diff_KWH))

power_forecast_ARIMA <- power_model_ARIMA %>%
  forecast(h = 12)

power_forecast_ARIMA %>% 
  autoplot(power_ts) +
  labs(title = "ARIMA(1,0,0)(2,0,0)[12] monthly forecast: 2014",
       y = "KWH")
## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_line()`).

power_forecast_results <- as.data.frame(power_forecast_ARIMA) %>%
  select(Date, .mean)

power_forecast_results
##        Date        .mean
## 1  2014 Jan  -477705.329
## 2  2014 Feb  1048413.918
## 3  2014 Mar   182796.294
## 4  2014 Apr   -90589.288
## 5  2014 May     6132.105
## 6  2014 Jun   281254.815
## 7  2014 Jul  1072841.129
## 8  2014 Aug   904218.322
## 9  2014 Sep   506776.073
## 10 2014 Oct   107295.038
## 11 2014 Nov   376991.984
## 12 2014 Dec -1256249.701
write.csv(power_forecast_results, file = "power_forecast_results.csv")