Data624: Project 1

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## ── Attaching packages ────────────────────────────────────────────── fpp3 0.5 ──
## 
## ✔ tsibble     1.1.4     ✔ fable       0.3.3
## ✔ tsibbledata 0.4.1     ✔ fabletools  0.3.4
## ✔ feasts      0.3.1     
## 
## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## ✖ lubridate::date()    masks base::date()
## ✖ dplyr::filter()      masks stats::filter()
## ✖ tsibble::intersect() masks base::intersect()
## ✖ tsibble::interval()  masks lubridate::interval()
## ✖ dplyr::lag()         masks stats::lag()
## ✖ tsibble::setdiff()   masks base::setdiff()
## ✖ tsibble::union()     masks base::union()
## 
## 
## Attaching package: 'naniar'
## 
## 
## The following object is masked from 'package:tsibble':
## 
##     pedestrian
## 
## 
## 
## Attaching package: 'zoo'
## 
## 
## The following object is masked from 'package:tsibble':
## 
##     index
## 
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## 
## Loading required package: lattice
## 
## 
## Attaching package: 'caret'
## 
## 
## The following objects are masked from 'package:fabletools':
## 
##     MAE, RMSE
## 
## 
## The following object is masked from 'package:purrr':
## 
##     lift
## 
## 
## 
## Attaching package: 'seasonal'
## 
## 
## The following object is masked from 'package:tibble':
## 
##     view
## 
## 
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

This project consists of 3 parts - two required and one bonus and is worth 15% of your grade.

Part A – ATM Forecast, ATM624Data.xlsx

In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.

Load Part A Input Files:

input <- getwd()
if(grepl( 'Data624', input, fixed = TRUE)){
    input_fp <- input
} else{
    input_fp <- paste0(getwd(),'/Desktop/Masters_Data_Sci/Data624/')
}

part_a <- 'ATM624Data.xlsx'
parta_i_df <- readxl::read_excel(paste0(input_fp,'/',part_a))
head(parta_i_df)

## # A tibble: 6 × 3
##    DATE ATM    Cash
##   <dbl> <chr> <dbl>
## 1 39934 ATM1     96
## 2 39934 ATM2    107
## 3 39935 ATM1     82
## 4 39935 ATM2     89
## 5 39936 ATM1     85
## 6 39936 ATM2     90

Review Data & Data Preparation

summary(parta_i_df)

##       DATE           ATM                 Cash        
##  Min.   :39934   Length:1474        Min.   :    0.0  
##  1st Qu.:40026   Class :character   1st Qu.:    0.5  
##  Median :40118   Mode  :character   Median :   73.0  
##  Mean   :40118                      Mean   :  155.6  
##  3rd Qu.:40210                      3rd Qu.:  114.0  
##  Max.   :40312                      Max.   :10919.8  
##                                     NA's   :19

From a first pass we see that there are 19 null values for cash and it’s not clear if there are any additional missing records in the other two columns. The DATE column is also reading in the data using Excel’s date format which will need to be fixed to create a proper Tsibble.

Let’s graph the withdrawal amounts across each ATM:

parta <- parta_i_df |> mutate(dte=as.Date(DATE, origin = "1899-12-30")) |> 
    arrange(dte,ATM,Cash) |> 
    dplyr::select(dte,ATM,Cash)
    
parta_prep <- parta |>
    filter(!is.na(ATM)) |>
    mutate(day_of_week=wday(dte,label=TRUE)) |>
    group_by(ATM,day_of_week) |>
    mutate(day_num = row_number()) |> ungroup()

ggplot(parta,aes(x=dte,y=Cash,colour=ATM)) +
    geom_line()

Not much is evident from this plot of the data given there is one large outlier value in February or March 2010 that far exceeds all the remaining data points. Perhaps a transformation can account for this record, but it will likely be challenging for any model to appropriately account for this data point. It seems unlikely that there would be evidence to indicate that an error occured in tracking this specific date for ATM4. (i.e. someone incorrectly added a zero on one of the values)

What is the distribution of all of the withdrawals?

ggplot(parta,aes(x=Cash)) +
    geom_histogram() + labs(title='All Cash Withdrawal Distribution',x='Number of Days')

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

There is very strong decay across the dataset and the vast majority of days appear to be near zero. The shape perhaps is similar to a negative binomial curve.

What are the distributions after separating each ATM?

atm1_dist <- ggplot(parta |> filter(ATM =='ATM1') ,aes(x=Cash)) +
    geom_histogram() + labs(title='ATM1 Distribution')
atm2_dist <- ggplot(parta |> filter(ATM =='ATM2') ,aes(x=Cash)) +
    geom_histogram() + labs(title='ATM2 Distribution')
atm3_dist <- ggplot(parta |> filter(ATM =='ATM3') ,aes(x=Cash)) +
    geom_histogram() + labs(title='ATM3 Distribution')
atm4_dist <- ggplot(parta |> filter(ATM =='ATM4') ,aes(x=Cash)) +
    geom_histogram() + labs(title='ATM4 Distribution')
cowplot::plot_grid(atm1_dist,atm2_dist,atm3_dist,atm4_dist,ncol=2,nrow=2)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The shapes of the distributions are quite different across the ATMs which certainly indicates that the demand in each location is quite different despite the fact that the information in the input dataset is tracked over the exact same time frame. This might warrant testing forecasts for each ATM rather than trying to accommodate one model for all. It is a little difficult to segment these ATMs without additional information, but purely based on their usage it seems like very few models will be able to account for ATM1 and ATM2 while also reliably estimating ATM3 and ATM4. ATM1 has a clear bimodal shape with most of the concentration in a bell pattern around 100 although there are plenty of days with limited cash withdrawn. ATM2 also appears bimodal although there is concentration across many values between 0 and 100. ATM4 is primarily concentrated in low denominations and the shape is decaying at a very fast rate although there is one large outlier that was previously called out. The scale is far different from the rest of the ATMs and despite that one outlier there appear to be more days with larger amounts of withdrawals compared to the other locations.

Let’s review ATM3 further as it seems as though there aren’t many days with withdrawals and a model that always guesses zero may be very hard to beat.

summary(parta |> filter(ATM=='ATM3'))

##       dte                 ATM                 Cash        
##  Min.   :2009-05-01   Length:365         Min.   : 0.0000  
##  1st Qu.:2009-07-31   Class :character   1st Qu.: 0.0000  
##  Median :2009-10-30   Mode  :character   Median : 0.0000  
##  Mean   :2009-10-30                      Mean   : 0.7206  
##  3rd Qu.:2010-01-29                      3rd Qu.: 0.0000  
##  Max.   :2010-04-30                      Max.   :96.0000

The 3rd quartile is also zero which would indicate that at least 75% of the data is blank for this ATM.

How many days have withdrawals:

parta |> filter(ATM=='ATM3') |> mutate(is_zero=ifelse(Cash==0,'Y','N')) |> group_by(is_zero) |> summarise(days=n())

## # A tibble: 2 × 2
##   is_zero  days
##   <chr>   <int>
## 1 N           3
## 2 Y         362

## Only 3 days have withdrawals for this ATM  which is only 0.008%! Perhaps we need an anomaly detection algorithm to accurately guess the days where someone goes to pull out cash from this location.

Are the days of withdrawal significant in any way? (i.e. holidays)

parta |> filter(ATM=='ATM3' & Cash!=0)

## # A tibble: 3 × 3
##   dte        ATM    Cash
##   <date>     <chr> <dbl>
## 1 2010-04-28 ATM3     96
## 2 2010-04-29 ATM3     82
## 3 2010-04-30 ATM3     85

Interestingly enough the only values for ATM3 are 3 consecutive days at the end of the time series range provided.

Let’s see what the rates of missingness are across all columns to determine how to treat the different null records that exist.

plot_missing(parta, missing_only = TRUE,
                   ggtheme = theme_classic())

There are a few rows without an ATM value which will need to be reviewed further to determine what method exists to properly impute these records or if they should be excluded from the models.

Let’s see what values exist for those days that are all blank and if they are incorrectly included in the file. Perhaps someone may have added rows for dates using a formula in Excel with the expectation to input them into their file at a later point as there are about two additional weeks of dates.

parta |> filter(dte>='2010-05-01' & dte<='2010-05-15')

## # A tibble: 14 × 3
##    dte        ATM    Cash
##    <date>     <chr> <dbl>
##  1 2010-05-01 <NA>     NA
##  2 2010-05-02 <NA>     NA
##  3 2010-05-03 <NA>     NA
##  4 2010-05-04 <NA>     NA
##  5 2010-05-05 <NA>     NA
##  6 2010-05-06 <NA>     NA
##  7 2010-05-07 <NA>     NA
##  8 2010-05-08 <NA>     NA
##  9 2010-05-09 <NA>     NA
## 10 2010-05-10 <NA>     NA
## 11 2010-05-11 <NA>     NA
## 12 2010-05-12 <NA>     NA
## 13 2010-05-13 <NA>     NA
## 14 2010-05-14 <NA>     NA

It appears that these blank records may actually represent dates that need to be predicted in the forecast.

What is the latest date available for the remaining ATMs?

parta |> group_by(ATM) |> summarise(last_date=max(dte))

## # A tibble: 5 × 2
##   ATM   last_date 
##   <chr> <date>    
## 1 ATM1  2010-04-30
## 2 ATM2  2010-04-30
## 3 ATM3  2010-04-30
## 4 ATM4  2010-04-30
## 5 <NA>  2010-05-14

The remaining ATMs are all cutoff at the end of April 2010 and based on the initial prompt requesting forecasts for May 2010, the decision to exclude these rows from May 2010 makes sense as they will be populated with estimates in the future, but are not going to be used in the model.

The remaining null values more likely need to be imputed to create a valid time series. There are a number of reasonable approaches that can be applied that would fill the remaining 4 missing records. Some of the more advanced methods, such as MICE and KNN, do a better job of capturing the interrelationships that exist between different predictors. In this case the number of data points is more limited and a more simplistic approach might be sufficient to impute reasonable values.

Let’s look to see the values around the time of the null records:

parta |> filter(ATM %in% c('ATM1','ATM2') & dte>='2009-06-01' & dte<='2009-06-30') |> arrange(ATM,dte) |>
    ggplot(aes(x=dte,y=Cash)) +
    geom_line(aes(colour=ATM))

The following includes a couple of potential imputation alternatives that will be considered:

The missingness implies there were no withdrawals for those ATMs and should in fact be zero.

How many other days do ATMs have zero values:

parta |> filter(Cash==0) |> group_by(ATM) |> summarise(num_zero=n())

## # A tibble: 2 × 2
##   ATM   num_zero
##   <chr>    <int>
## 1 ATM2         2
## 2 ATM3       362

ATM2 has other examples of zero, while ATM1 does not have any dates with missing withdrawals which would indicate this approach for treating missingness is less likely to be an accurate method to account for this data.

Using a moving average to estimate the withdrawals

This approach would attempt to estimate the withdrawal amount based on recent leading and lagging dates that are available. Given that there are 2 missing values for each ATM that are both close to one another (within two calendar days) the window in time would likely exclude the other NAs while obtaining the imputed value. What is the most appropriate time frame to calculate the window in time? This is something that will be explored further in a subsequent section.

Calculating an average based on specific day of week

One approach would be to impute the value based on the most similar days that exist in the remaining time series given that there likely are seasonal factors that explain some of the activity that occurs at a given ATM, as well as holidays that impact withdrawal patterns. In this way the other NA values would not impact the calculation of each imputation given the proximity to one another (i.e. they are not going to be the same day)

When initially proposing this option I hadn’t considered that decomposition techniques would be limited by the fact that the ATMs in this series are only tracked over one calendar year. Typically, the algorithms in Fable are reliant on a minimum of two years in order to determine period effects.

Let’s review the days in question to confirm they aren’t around holiday times:

parta |> filter(is.na(Cash) & dte<'2010-05-01') |> mutate(day_of_week=wday(dte,label=TRUE)) |>  arrange(ATM,dte)

## # A tibble: 5 × 4
##   dte        ATM    Cash day_of_week
##   <date>     <chr> <dbl> <ord>      
## 1 2009-06-13 ATM1     NA Sat        
## 2 2009-06-16 ATM1     NA Tue        
## 3 2009-06-22 ATM1     NA Mon        
## 4 2009-06-18 ATM2     NA Thu        
## 5 2009-06-24 ATM2     NA Wed

The only potential holiday that occurred at the time (Juneteenth wasn’t a federal holiday in 2009/2010) is Father’s day and that might prompt some people to withdraw cash to buy gifts. Three out of the Five days are after that time so they likely wouldn’t be impacted, but it is possible that Father’s day might have some effect on ATM1’s first two NA values. It is difficult to speculate if this was statistically impactful, but let’s review other top days of withdrawals for ATM1 to see if holidays or the days preceding them drove the most traffic to this ATM.

parta |> filter(ATM=='ATM1') |> mutate(day_of_week=wday(dte,label=TRUE)) |> arrange(-Cash) |> slice(1:15)

## # A tibble: 15 × 4
##    dte        ATM    Cash day_of_week
##    <date>     <chr> <dbl> <ord>      
##  1 2009-09-22 ATM1    180 Tue        
##  2 2010-02-17 ATM1    179 Wed        
##  3 2009-08-25 ATM1    174 Tue        
##  4 2010-01-29 ATM1    152 Fri        
##  5 2010-02-28 ATM1    152 Sun        
##  6 2009-08-14 ATM1    151 Fri        
##  7 2009-09-04 ATM1    150 Fri        
##  8 2009-06-09 ATM1    145 Tue        
##  9 2009-09-05 ATM1    144 Sat        
## 10 2009-06-12 ATM1    142 Fri        
## 11 2009-06-19 ATM1    140 Fri        
## 12 2009-09-20 ATM1    140 Sun        
## 13 2009-12-01 ATM1    140 Tue        
## 14 2009-12-04 ATM1    140 Fri        
## 15 2010-02-05 ATM1    138 Fri

The expectation would have been that gifts needed around December would potentially be the highest days, but early December has the 13th and 14th highest days for withdrawals. The top 5 days for this ATM don’t appear to correlate with a holiday although the June time frame around where there are missing values has 3 of the highest values in the top 11. The most common day of the week with withdrawals is Friday which the assumption would that is when individuals are paid their wages. What is prompting the top 3 days to be a week day would probably require additional context to explain those values.

Let’s see the same cut of data for ATM2:

parta |> filter(ATM=='ATM2') |> mutate(day_of_week=wday(dte,label=TRUE)) |> arrange(-Cash) |> slice(1:15)

## # A tibble: 15 × 4
##    dte        ATM    Cash day_of_week
##    <date>     <chr> <dbl> <ord>      
##  1 2009-06-06 ATM2    147 Sat        
##  2 2009-07-17 ATM2    147 Fri        
##  3 2009-08-04 ATM2    146 Tue        
##  4 2009-05-15 ATM2    137 Fri        
##  5 2009-05-29 ATM2    136 Fri        
##  6 2009-08-07 ATM2    136 Fri        
##  7 2009-07-11 ATM2    135 Sat        
##  8 2009-06-19 ATM2    134 Fri        
##  9 2009-06-02 ATM2    133 Tue        
## 10 2009-06-08 ATM2    132 Mon        
## 11 2009-11-20 ATM2    132 Fri        
## 12 2009-07-03 ATM2    128 Fri        
## 13 2009-10-23 ATM2    127 Fri        
## 14 2009-07-28 ATM2    126 Tue        
## 15 2009-07-31 ATM2    126 Fri

The top days for this ATM also appear to be mostly Fridays with a few Saturdays and other weekdays sprinkled into the most frequent days. The rationale is the same as ATM1 that paydays would correspond with increased withdrawals.

parta |> mutate(day_of_week=wday(dte,label=TRUE)) |> 
    filter(!is.na(Cash) & Cash !=0 & Cash<9000) |>
    ggplot(aes(x=Cash)) +
    geom_histogram() +
    facet_wrap(vars(day_of_week))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The distributions across all of the ATMs appear to be right skewed after excluding zeros and the one massive outlier with Fridays have the highest values of skew across all days. Tuesdays had more skew than expected.

Let’s impute a value for the NAs based on a rolling mean and calculating a the 7 week day moving average including the 4 preceding weeks’ withdrawals and the 3 succeeding weeks withdrawals.

parta_na_fill <- parta |> filter(!is.na(ATM)) |>
    mutate(day_of_week=wday(dte,label=TRUE)) |>
    group_by(ATM,day_of_week) |>
    mutate(day_num = row_number()) |>
    arrange(ATM,day_of_week) |>
    filter(!is.na(Cash)) |> 
    mutate(roll_mean=rollmean(Cash,k=7,fill=NA)) |> 
    filter(lead(day_num)-day_num!=1) |> mutate(new_day=day_num+1)|> dplyr::select(c(dte,ATM,new_day,roll_mean)) |> ungroup()

## Adding missing grouping variables: `day_of_week`

cash <- parta_prep |> left_join(parta_na_fill,join_by(ATM==ATM,day_num==new_day,day_of_week==day_of_week),suffix = c("", ".y")) |> mutate(cash_mod=ifelse(is.na(Cash),roll_mean,Cash)) |> dplyr::select(c(dte,ATM,Cash,day_of_week,day_num,cash_mod)) |> filter(!is.na(ATM))

summary(cash)

##       dte                 ATM                 Cash         day_of_week
##  Min.   :2009-05-01   Length:1460        Min.   :    0.0   Sun:208    
##  1st Qu.:2009-07-31   Class :character   1st Qu.:    0.5   Mon:208    
##  Median :2009-10-30   Mode  :character   Median :   73.0   Tue:208    
##  Mean   :2009-10-30                      Mean   :  155.6   Wed:208    
##  3rd Qu.:2010-01-29                      3rd Qu.:  114.0   Thu:208    
##  Max.   :2010-04-30                      Max.   :10919.8   Fri:212    
##                                          NA's   :5         Sat:208    
##     day_num         cash_mod      
##  Min.   : 1.00   Min.   :    0.0  
##  1st Qu.:14.00   1st Qu.:    1.0  
##  Median :27.00   Median :   73.0  
##  Mean   :26.57   Mean   :  155.3  
##  3rd Qu.:40.00   3rd Qu.:  114.0  
##  Max.   :53.00   Max.   :10919.8  
##

This new dataframe confirms that all of the na values are in the modified cash column cash_mod and this time series can now be turned into a tsibble and modeled using techniques covered in the course so far.

One last step before proceeding with modeling is to try to transform the ATM shapes to try to standarize values for improved modeling.

Transformations

Let’s first take a look at trying to transform the full dataset across all of the ATMs although the expectation is this will not better estimate a normal distribution

mod_cash <- preProcess(cash |> dplyr::select(cash_mod),method='BoxCox')

mod_cash$bc$cash_mod$lambda

## NULL

This in fact returns a null parameter so we will attempt to transform each ATM to better approximate a normal distribution.

Let’s see if ATM1 can be improved:

mod_cash1 <- preProcess(cash |> as.data.frame() |> filter(ATM=='ATM1') |> dplyr::select(cash_mod),method='BoxCox')

cash_tf1 <- mod_cash1$bc$cash_mod$lambda

cowplot::plot_grid(cash |> filter(ATM=='ATM1') |>
    ggplot(aes(x=cash_mod)) +
    geom_histogram() + labs(title='ATM1 Cash Distribution'),cash |> filter(ATM=='ATM1') |>
    ggplot(aes(x=dte,y=cash_mod)) +
    geom_line() + labs(title='ATM1 Cash over Time'),nrow=2)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

cowplot::plot_grid(cash |> filter(ATM=='ATM1') |>
    ggplot(aes(x=cash_mod^cash_tf1)) +
    geom_histogram() + labs(title='ATM1 Cash^1.2 Distribution'),cash |> filter(ATM=='ATM1') |>
    ggplot(aes(x=dte,y=cash_mod^cash_tf1)) +
    geom_line() + labs(title='ATM1 Cash^1.2 over Time'),nrow=2)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

After transforming the ATM1 withdrawals there isn’t much improvement with the shape of the data and it’s unclear if the daily seasonal patterns that exist are less variable. Based on the power transform obtained from BoxCox methods it does not appear this would be that useful particularly given decreased interpretability.

Let’s see if ATM2 can be improved after adjusting the zero values:

mod_cash2 <- preProcess(cash |> as.data.frame() |> filter(ATM=='ATM2') |> mutate(cash_mod=ifelse(cash_mod==0,0.01,cash_mod)) |> dplyr::select(cash_mod),method='BoxCox')

cash_tf2 <- mod_cash2$bc$cash_mod$lambda

cowplot::plot_grid(cash |> filter(ATM=='ATM2') |>
    ggplot(aes(x=cash_mod)) +
    geom_histogram() + labs(title='ATM2 Cash Distribution'),cash |> filter(ATM=='ATM2') |>
    ggplot(aes(x=dte,y=cash_mod)) +
    geom_line() + labs(title='ATM2 Cash over Time'),nrow=2)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

cowplot::plot_grid(cash |> filter(ATM=='ATM2') |>
    ggplot(aes(x=cash_mod^cash_tf2)) +
    geom_histogram() + labs(title='ATM2 Cash^0.7 Distribution'),cash |> filter(ATM=='ATM2') |>
    ggplot(aes(x=dte,y=cash_mod^cash_tf2)) +
    geom_line() + labs(title='ATM2 Cash^0.7 over Time'),nrow=2)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The shape of the distribution does not seem to be that much improved as the lower values still show up fairly heavily although the data appears more leftward skewed. The line graph doesn’t appear to show much of a difference in the patterns and seasons that exist. Therefore, it appears that no transformation will be used for this ATM as well.

ATM3:

Unfortunately, ATM3 remains a rather hard time series to predict/statistically optimize given that almost all of the values are zero. Therefore, no transformation will be attempted as it’s unlikely to improve the data given the lack of information with a boxcox when 99% of the data will be adjusted to review available transformations.

Let’s see if ATM4 can be improved:

mod_cash4 <- preProcess(cash |> as.data.frame() |> filter(ATM=='ATM4') |> mutate(cash_mod=ifelse(cash_mod==0,0.01,cash_mod)) |> dplyr::select(cash_mod),method='BoxCox')

cash_tf4 <- mod_cash4$bc$cash_mod$lambda

cowplot::plot_grid(cash |> filter(ATM=='ATM4') |>
    ggplot(aes(x=cash_mod)) +
    geom_histogram() + labs(title='ATM4 Cash Distribution'),cash |> filter(ATM=='ATM4') |>
    ggplot(aes(x=dte,y=cash_mod)) +
    geom_line() + labs(title='ATM4 Cash over Time'),nrow=2)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

cowplot::plot_grid(cash |> filter(ATM=='ATM4') |>
    ggplot(aes(x=cash_mod^cash_tf4)) +
    geom_histogram() + labs(title='ATM2 Cash^0.3 Distribution'),cash |> filter(ATM=='ATM4') |>
    ggplot(aes(x=dte,y=cash_mod^cash_tf4)) +
    geom_line() + labs(title='ATM4 Cash^0.3 over Time'),nrow=2)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

cash |> filter(ATM=='ATM4' & dte>='2010-02-05')

## # A tibble: 85 × 6
##    dte        ATM      Cash day_of_week day_num cash_mod
##    <date>     <chr>   <dbl> <ord>         <dbl>    <dbl>
##  1 2010-02-05 ATM4    454.  Fri              41    454. 
##  2 2010-02-06 ATM4    458.  Sat              41    458. 
##  3 2010-02-07 ATM4    112.  Sun              41    112. 
##  4 2010-02-08 ATM4    418.  Mon              41    418. 
##  5 2010-02-09 ATM4  10920.  Tue              41  10920. 
##  6 2010-02-10 ATM4     42.4 Wed              41     42.4
##  7 2010-02-11 ATM4    280.  Thu              41    280. 
##  8 2010-02-12 ATM4    412.  Fri              42    412. 
##  9 2010-02-13 ATM4    853.  Sat              42    853. 
## 10 2010-02-14 ATM4    180.  Sun              42    180. 
## # ℹ 75 more rows

The primary concern with ATM4 is the treatment of the outlier and after making this tranformation the goal would be to decrease the differential between that data point and the remainder of the series. The original value is more than 10x the rest of the withdrawals, which makes you wonder if someone accidentally typed a zero at the end. After the transformation, it is only 2x-3x compared to the rest of the values. While there will no doubt be a large remainder there in most model estimates there are limited means of adjusting that value. Therefore, the transformation accomplishes the task at hand with this subset of the data for ATM4. The shape of the data is a bit improved as well so hopefully that would allow for modeling techniques to better capture the other trends/seasonality that exists for this ATM.

summary(cash|>filter(ATM=='ATM4'))

##       dte                 ATM                 Cash           day_of_week
##  Min.   :2009-05-01   Length:365         Min.   :    1.563   Sun:52     
##  1st Qu.:2009-07-31   Class :character   1st Qu.:  124.334   Mon:52     
##  Median :2009-10-30   Mode  :character   Median :  403.839   Tue:52     
##  Mean   :2009-10-30                      Mean   :  474.043   Wed:52     
##  3rd Qu.:2010-01-29                      3rd Qu.:  704.507   Thu:52     
##  Max.   :2010-04-30                      Max.   :10919.762   Fri:53     
##                                                              Sat:52     
##     day_num         cash_mod        
##  Min.   : 1.00   Min.   :    1.563  
##  1st Qu.:14.00   1st Qu.:  124.334  
##  Median :27.00   Median :  403.839  
##  Mean   :26.57   Mean   :  474.043  
##  3rd Qu.:40.00   3rd Qu.:  704.507  
##  Max.   :53.00   Max.   :10919.762  
##

Modeling

cash_ts <- cash |> as_tsibble(index=dte,key=ATM)

cowplot::plot_grid(cash_ts |> filter(ATM=='ATM1') |>ACF(cash_mod) |> autoplot()+labs(title='ATM1 Autocorrelation'),cash_ts |> filter(ATM=='ATM2') |>ACF(cash_mod) |> autoplot()+labs(title='ATM2 Autocorrelation'),cash_ts |> filter(ATM=='ATM3') |>ACF(cash_mod) |> autoplot()+labs(title='ATM3 Autocorrelation'),cash_ts |> filter(ATM=='ATM4') |>ACF(cash_mod^cash_tf4) |> autoplot()+labs(title='ATM4 Autocorrelation'))

The autocorrelation as expected is strongest based on weekly time increments, but there are statistically significant lags for ATM1 and ATM2 between many day time differences. ATM3 is primarily zero so it makes sense that only minimal lagged values are correlated. ATM4 after applying the transformation identified in the last section appears to only have autocorrelation at week increments.

Let’s start working on models:

The approach will use basic forecasting strategies (mean, naive, seasonal naive, and drift) compared against more advanced modeling options (exponential smoothing & ARIMA variants).

There will also be two approaches for sampling train and test data to determine if that contributes to meaningful improvements in comparative metrics, which the thought will be to use RMSE as the common benchmark.

Use an 80% training and 20% holdout set (last two months of data)
Run cross fold validation against the full dataset as the time series is smaller in duration and Kuhn mentions that a test holdout may not be as accurate with a smaller total increment. Hyndman does not seem to give so much guidance on other alternatives, but provides methods in Fable to run on time series.

Let’s review the preparation for an ARIMA model:

Let’s run a unit root test to determine if any level of differencing is needed in the dataset:

cash_ts |> features(cash_mod, unitroot_kpss)

## # A tibble: 4 × 3
##   ATM   kpss_stat kpss_pvalue
##   <chr>     <dbl>       <dbl>
## 1 ATM1     0.486       0.0448
## 2 ATM2     2.03        0.01  
## 3 ATM3     0.389       0.0818
## 4 ATM4     0.0797      0.1

By evaluating all the models unit root tests we can determine if differencing is needed in order to meet stationarity assumptions to run ARIMA models on the data. If we are using a standard alpha value of 5%, ATM1 and ATM2 do not meet the theshold to pass this test and will require differencing. Those results aren’t that surprising given the strong seasonal trends that appear to exist for these ATMs. It is a little surprising that ATM4 met that criteria, but we will check the results for first order differencing as well. ARIMA will not be run on ATM3 as the lack of data will make it less likely to perform well and the basic models are probably the only means of making a realistic estimate for the forecast.

cash_ts |> features(difference(cash_mod,7), unitroot_kpss)

## # A tibble: 4 × 3
##   ATM   kpss_stat kpss_pvalue
##   <chr>     <dbl>       <dbl>
## 1 ATM1     0.0219      0.1   
## 2 ATM2     0.0163      0.1   
## 3 ATM3     0.389       0.0819
## 4 ATM4     0.0140      0.1

One order of differencing appears to get every ATM above the 5% threshold for stationarity for ARIMA models.

cash_ts |> filter(ATM=='ATM1') |> gg_tsdisplay(difference(cash_mod,7),plot_type='partial')

The ACF and PACF show some seasonal patterns that exist that will help us build some tentative models. AR(1) will be attempted as well as MA(1) and MA(3).

cash_ts |> filter(ATM=='ATM2') |> gg_tsdisplay(difference(cash_mod,7),plot_type='partial')

ATM2 appears to have some potential models based on signficant autocorrelation that exists in the ACF and PACF plots. Models AR(1) and AR(3) will be attempted along with MA(1) MA(2) and MA(3) all within the seasonal P/Q terms.

cash_ts |> filter(ATM=='ATM3') |> gg_tsdisplay(cash_mod,plot_type='partial')

Given there are only 3 data points available for this ATM it seems unlikely an ARIMA model is appropriate for forecasting this data.

cash_ts |> filter(ATM=='ATM4') |> gg_tsdisplay(difference(cash_mod^cash_tf4,7),plot_type='partial')

In ATM4 autocorrelation plot, we see the data is fairly stationarity although there are a few points that stand out from the rest, but it appears to be one above and below the line. Considering a seasonal model AR(1) might be viable along with MA(1), MA(2), and MA(3).

cash_ts |> filter(ATM=='ATM1') |> autoplot(cash_mod)+labs(title='ATM1 Time Series')

cash_ts |> filter(ATM=='ATM2') |> autoplot(cash_mod)+labs(title='ATM2 Time Series')

cash_ts |> filter(ATM=='ATM3') |> autoplot(cash_mod)+labs(title='ATM3 Time Series')

cash_ts |> filter(ATM=='ATM4') |> autoplot(cash_mod^cash_tf4)+labs(title='ATM4 Time Series')

Let’s train the model for the first 10 months of data and run forecasts for each ATM:

cash_ts_fit <- cash_ts |> dplyr::filter(!ATM %in% c('ATM3','ATM4') & dte<'2010-03-01') |>
    model(mean=MEAN(cash_mod),
          naive=NAIVE(cash_mod),
          snaive=SNAIVE(cash_mod ~ lag("week")),
          drift=RW(cash_mod ~ drift()),
          aaa = ETS(cash_mod ~ error('A')+trend('A') + season('A')),
          aada = ETS(cash_mod ~ error('A')+trend('Ad') + season('A')),
          arima000110 = ARIMA(cash_mod ~ pdq(0,0,0) + PDQ(1,1,0)),
          arima000310 = ARIMA(cash_mod ~ pdq(0,0,0) + PDQ(3,1,0)),
          arima000011 = ARIMA(cash_mod ~ pdq(0,0,0) + PDQ(0,1,1)),
          arima000012 = ARIMA(cash_mod ~ pdq(0,0,0) + PDQ(0,1,2)),
          arima000013 = ARIMA(cash_mod ~ pdq(0,0,0) + PDQ(0,1,3)),
          arima_auto = ARIMA(cash_mod,stepwise=FALSE)
          )

cash_ts_fit |>
    forecast(h=60) |>
    autoplot(cash_ts,level=NULL)

It is a bit challenging to review the forecast simultaneously and therefore each one will be plotted separately.

ATM3 will not follow normal training procedures given that such a high amount of the data is a zero value. There aren’t many alternatives to try to estimate this specific time series

atm3_fit <- cash_ts |> filter(ATM=='ATM3') |>
    model(mean=MEAN(cash_mod),
          naive=NAIVE(cash_mod),
          snaive=SNAIVE(cash_mod ~ lag("week")),
          drift=RW(cash_mod ~ drift()),
          aan = ETS(cash_mod ~ error('A')+trend('A') + season('N')),
          aadn = ETS(cash_mod ~ error('A')+trend('Ad') + season('N')),
          ana=ETS(cash_mod ~ error('A')+trend('N') + season('N')))

atm3_fit |>
    forecast(h=30) |>
    autoplot(cash_ts |> filter(ATM=='ATM3' & dte>'2010-04-01'),level=NULL)

Based on the limited available information, the ultimate selection will be an exponential smoothing model with a damped trend. It’s too early to say if there is seasonal data and it’s not as though there are any relevant weekly values to build from. Although the naive model might be more conservative it would seem more probable that there will be some variety in future withdrawal days.

ATM4 will also need to be separately fit as it is the only key that will be transformed.

cash_ts_fit_atm4 <- cash_ts |> dplyr::filter(ATM=='ATM4' & dte<'2010-03-01') |>
    model(mean=MEAN(cash_mod^cash_tf4),
          naive=NAIVE(cash_mod^cash_tf4),
          snaive=SNAIVE(cash_mod^cash_tf4 ~ lag("week")),
          drift=RW(cash_mod^cash_tf4 ~ drift()),
          aaa = ETS(cash_mod^cash_tf4 ~ error('A')+trend('A') + season('A')),
          aada = ETS(cash_mod^cash_tf4 ~ error('A')+trend('Ad') + season('A')),
          arima000110 = ARIMA(cash_mod^cash_tf4 ~ 0 + pdq(0,0,0) + PDQ(1,1,0)),
          arima000310 = ARIMA(cash_mod^cash_tf4 ~ 0 + pdq(0,0,0) + PDQ(3,1,0)),
          arima000011 = ARIMA(cash_mod^cash_tf4 ~ 0 + pdq(0,0,0) + PDQ(0,1,1)),
          arima000012 = ARIMA(cash_mod^cash_tf4 ~ 0 + pdq(0,0,0) + PDQ(0,1,2)),
          arima000013 = ARIMA(cash_mod^cash_tf4 ~ 0 + pdq(0,0,0) + PDQ(0,1,3)),
          arima_auto = ARIMA(cash_mod^cash_tf4,stepwise=FALSE)
          )

Which model did the automated algorithm pick?

report(cash_ts_fit |> filter(ATM=='ATM1') |> dplyr::select(arima_auto))

## Series: cash_mod 
## Model: ARIMA(0,0,1)(0,1,2)[7] 
## 
## Coefficients:
##          ma1     sma1     sma2
##       0.2162  -0.7362  -0.1801
## s.e.  0.0586   0.0613   0.0589
## 
## sigma^2 estimated as 572.9:  log likelihood=-1368.99
## AIC=2745.98   AICc=2746.12   BIC=2760.75

report(cash_ts_fit |> filter(ATM=='ATM2') |> dplyr::select(arima_auto))

## Series: cash_mod 
## Model: ARIMA(5,0,0)(0,1,1)[7] w/ drift 
## 
## Coefficients:
##          ar1      ar2      ar3     ar4      ar5     sma1  constant
##       0.0329  -0.1226  -0.0232  0.0929  -0.1858  -0.8522   -0.6462
## s.e.  0.0572   0.0578   0.0577  0.0585   0.0596   0.0426    0.2442
## 
## sigma^2 estimated as 596.1:  log likelihood=-1371.38
## AIC=2758.76   AICc=2759.26   BIC=2788.31

report(cash_ts_fit_atm4 |> filter(ATM=='ATM4') |> dplyr::select(arima_auto))

## Series: cash_mod 
## Model: ARIMA(0,0,0)(1,0,0)[7] w/ mean 
## Transformation: cash_mod^cash_tf4 
## 
## Coefficients:
##         sar1  constant
##       0.2475    4.2841
## s.e.  0.0562    0.1081
## 
## sigma^2 estimated as 3.631:  log likelihood=-626.57
## AIC=1259.15   AICc=1259.23   BIC=1270.3

ATM1 has similar seasonal parameters to what was manually selected although it includes one error term lag. ATM2 is substantially different than expected with 5 p lagged terms as well as a constant although the seasonal component is in line with the ACF/PACF review. ATM4 obtained a very similar result as to what was identified although it excluded a seasonal differencing step.

cash_ts_fit |> filter(ATM=='ATM1') |>
    forecast(h=60) |>
    autoplot(cash_ts |> filter(ATM=='ATM1' & dte>'2009-12-31'),level=NULL) + labs(title='ATM1 Forecasts')

It’s rather challenging given the frequent swings between daily withdrawals to determine if any of these models outperforms one another although the seasonal naive model seems to substantially overestimate certain points while doing a good job of forecasting the days with limited withdrawals.

cash_ts_fit |> filter(ATM=='ATM2') |>
    forecast(h=60) |>
    autoplot(cash_ts |> filter(ATM=='ATM2' & dte>'2009-12-31'),level=NULL) + labs(title='ATM2 Forecasts')

Similar to ATM1 given the high number of models being tested it is hard to truly discern the differences among each one. The mixture of colors could probably be improved a bit although very few palettes can properly account for more than 8 variants.

cash_ts_fit_atm4 |> filter(ATM=='ATM4') |> dplyr::select(-c(naive,drift)) |>
    forecast(h=60) |>
    autoplot(cash_ts |> filter(ATM=='ATM4' & dte>'2009-12-31'),level=NULL) + labs(title='ATM4 Forecasts')

The naive and drift models were eliminated as the outlier event caused substantial inaccuracies where these models mirrored the data (last day of training for the outlier). The forecast without a doubt are unrealistic methods for modeling this ATM’s withdrawal data.

glance(cash_ts_fit) |> arrange(AICc) |> dplyr::select(ATM:BIC)

## # A tibble: 24 × 7
##    ATM   .model      sigma2 log_lik   AIC  AICc   BIC
##    <chr> <chr>        <dbl>   <dbl> <dbl> <dbl> <dbl>
##  1 ATM1  arima_auto    573.  -1369. 2746. 2746. 2761.
##  2 ATM1  arima000012   599.  -1375. 2757. 2757. 2768.
##  3 ATM1  arima000013   599.  -1375. 2758. 2758. 2773.
##  4 ATM2  arima_auto    596.  -1371. 2759. 2759. 2788.
##  5 ATM1  arima000011   621.  -1380. 2763. 2763. 2771.
##  6 ATM2  arima000011   627.  -1381. 2766. 2766. 2773.
##  7 ATM2  arima000012   627.  -1381. 2767. 2767. 2778.
##  8 ATM2  arima000013   629.  -1380. 2769. 2769. 2783.
##  9 ATM1  arima000310   653.  -1384. 2777. 2777. 2791.
## 10 ATM2  arima000310   660.  -1386. 2780. 2780. 2795.
## # ℹ 14 more rows

It can be a deceiving to compare the AICc for all of the models as they potentially have different degrees of freedom making their comparison inaccurate for in sample train data. As expected the ATM3 models did not perform at all and for the other ATMs it appears ARIMA might be best.

glance(cash_ts_fit_atm4) |> arrange(AICc) |> dplyr::select(ATM:BIC)

## # A tibble: 12 × 7
##    ATM   .model      sigma2 log_lik   AIC  AICc   BIC
##    <chr> <chr>        <dbl>   <dbl> <dbl> <dbl> <dbl>
##  1 ATM4  arima000011   3.07   -601. 1205. 1205. 1212.
##  2 ATM4  arima000012   3.08   -600. 1206. 1206. 1217.
##  3 ATM4  arima000013   3.10   -600. 1207. 1207. 1222.
##  4 ATM4  arima000310   3.92   -625. 1258. 1258. 1273.
##  5 ATM4  arima_auto    3.63   -627. 1259. 1259. 1270.
##  6 ATM4  arima000110   4.66   -650. 1305. 1305. 1312.
##  7 ATM4  aaa           3.15  -1038. 2100. 2101. 2145.
##  8 ATM4  aada          3.15  -1037. 2101. 2102. 2149.
##  9 ATM4  mean          3.85     NA    NA    NA    NA 
## 10 ATM4  naive         7.31     NA    NA    NA    NA 
## 11 ATM4  snaive        5.79     NA    NA    NA    NA 
## 12 ATM4  drift         7.31     NA    NA    NA    NA

Let’s calculate on the test set and then use RMSE to compare each model by ATM.

cash_ts_fc <- cash_ts_fit |>
    forecast(h=60)

cash_ts_amt4_fc <- cash_ts_fit_atm4 |>
    forecast(h=60)

cash_ts_fc |> accuracy(cash_ts) |> arrange(RMSE)

## # A tibble: 24 × 11
##    .model     ATM   .type      ME  RMSE   MAE     MPE  MAPE  MASE RMSSE     ACF1
##    <chr>      <chr> <chr>   <dbl> <dbl> <dbl>   <dbl> <dbl> <dbl> <dbl>    <dbl>
##  1 snaive     ATM2  Test    3.73   19.4  12.8 -Inf    Inf   0.600 0.630 -0.145  
##  2 arima0001… ATM2  Test   -2.24   21.2  15.0 -Inf    Inf   0.706 0.687 -0.174  
##  3 snaive     ATM1  Test  -16.6    24.5  19.6   -9.04  33.0 1.04  0.829  0.0297 
##  4 arima0001… ATM1  Test  -20.9    25.8  22.2 -127.   128.  1.18  0.873 -0.195  
##  5 arima0003… ATM1  Test  -15.0    27.7  20.0 -244.   249.  1.06  0.936 -0.187  
##  6 arima0003… ATM2  Test    0.737  27.9  24.2 -Inf    Inf   1.14  0.904 -0.00914
##  7 mean       ATM1  Test   -7.87   34.8  23.0 -311.   325.  1.22  1.18  -0.126  
##  8 arima0000… ATM1  Test   -9.44   36.0  26.2 -312.   329.  1.39  1.22  -0.154  
##  9 mean       ATM2  Test   -7.62   40.4  35.3 -Inf    Inf   1.66  1.31   0.0409 
## 10 arima0000… ATM1  Test   -6.80   42.0  31.5 -337.   363.  1.67  1.42  -0.0765 
## # ℹ 14 more rows

When evaluating all of the ATMs in one table there is too much going on and therefore let’s take a subset of the top 3 of each ATM forecasts based on RMSE.

cash_ts_fc |> accuracy(cash_ts) |> group_by(ATM) |> 
    arrange(RMSE) |> mutate(row=row_number()) |> filter(row<=3) |> dplyr::select(.model:MAE,row)

## # A tibble: 6 × 7
## # Groups:   ATM [2]
##   .model      ATM   .type      ME  RMSE   MAE   row
##   <chr>       <chr> <chr>   <dbl> <dbl> <dbl> <int>
## 1 snaive      ATM2  Test    3.73   19.4  12.8     1
## 2 arima000110 ATM2  Test   -2.24   21.2  15.0     2
## 3 snaive      ATM1  Test  -16.6    24.5  19.6     1
## 4 arima000110 ATM1  Test  -20.9    25.8  22.2     2
## 5 arima000310 ATM1  Test  -15.0    27.7  20.0     3
## 6 arima000310 ATM2  Test    0.737  27.9  24.2     3

The manually selected model, the automated, and the mean model all performed exteremely similarly on the holdout set. That’s rather surprising that the mean would be comparable to other modeling techniques. Perhaps this speaks to the fact that few of the models are accurately accounting for the full variance in the data. The one outlier case may be contributing to these results. ATM3 due to it’s irregularlity has multiple models with the same results although it’ll be difficult to choose the “best” option for this particular machine. Another surprise result for ATM1 & ATM2 that the seasonal naive method performed best on the test set, which might signify that prior lagged weekly periods were very similar to one another. Cross validation will be helpful as a secondary check to determine if these benchmarks are in fact preferable to more complex methods.

cash_ts_amt4_fc |> accuracy(cash_ts |> filter(ATM=='ATM4')) |> group_by(ATM) |> 
    arrange(RMSE) |> mutate(row=row_number())

## # A tibble: 12 × 12
## # Groups:   ATM [1]
##    .model      ATM   .type       ME   RMSE    MAE     MPE   MAPE   MASE  RMSSE
##    <chr>       <chr> <chr>    <dbl>  <dbl>  <dbl>   <dbl>  <dbl>  <dbl>  <dbl>
##  1 mean        ATM4  Test     -70.7   297.   255.   -901.   925.  0.606  0.307
##  2 arima_auto  ATM4  Test     -73.1   299.   256.   -917.   940.  0.608  0.310
##  3 arima000012 ATM4  Test     -67.5   353.   299.  -1059.  1095.  0.712  0.365
##  4 arima000011 ATM4  Test     -66.9   353.   300.  -1057.  1094.  0.713  0.366
##  5 arima000013 ATM4  Test     -67.7   360.   305.  -1075.  1112.  0.727  0.372
##  6 aada        ATM4  Test    -180.    407.   330.  -1335.  1359.  0.786  0.421
##  7 aaa         ATM4  Test    -192.    410.   331.  -1347.  1369.  0.788  0.424
##  8 arima000310 ATM4  Test    -331.    546.   432.  -1741.  1753.  1.03   0.565
##  9 arima000110 ATM4  Test    -636.    867.   674.  -1320.  1324.  1.60   0.897
## 10 snaive      ATM4  Test   -1308.   1755.  1340.  -2019.  2023.  3.19   1.82 
## 11 naive       ATM4  Test  -15958.  18135. 15958. -38504. 38504. 38.0   18.8  
## 12 drift       ATM4  Test  -18513.  21434. 18513. -44970. 44970. 44.1   22.2  
## # ℹ 2 more variables: ACF1 <dbl>, row <int>

The mean method appears to best track the in sample data, which is rather surprising although the ARIMA(0,0,1)(0,0,0) is close behind. Cross validation may help further separate the models.

Let’s check the residuals for the top model for each ATM:

cash_ts_fit |> dplyr::select(snaive) |> filter(ATM=='ATM1') |> gg_tsresiduals()

The residual distribution shape appears quite normal although there are some large residuals values in both directions which likely counter balance one another in the model. There is probably too much autocorrelation remaining in the residuals which likely means it would not pass a portmanteau test.

augment(cash_ts_fit |> dplyr::select(snaive) |> filter(ATM=='ATM1')) |>
  features(.innov, ljung_box, lag=7)

## # A tibble: 1 × 4
##   ATM   .model lb_stat lb_pvalue
##   <chr> <chr>    <dbl>     <dbl>
## 1 ATM1  snaive    62.1  5.81e-11

The p-value is extremely small which means that the residuals are not resembling white noise and that this model leaves something to be desired.

Does the second best ARIMA model have more acceptable residuals:

cash_ts_fit |> dplyr::select(arima000110) |> filter(ATM=='ATM1') |> gg_tsresiduals()

The residual distribution shape appears fairly normal although there are some large residuals outliers in both directions. There is probably too much autocorrelation remaining in the residuals and would also seem to indicate it would not pass a portmanteau test.

augment(cash_ts_fit |> dplyr::select(arima000110) |> filter(ATM=='ATM1')) |>
  features(.innov, ljung_box, lag = 7,dof=2)

## # A tibble: 1 × 4
##   ATM   .model      lb_stat lb_pvalue
##   <chr> <chr>         <dbl>     <dbl>
## 1 ATM1  arima000110    20.3   0.00112

Yes, both of the top models do not have residuals that resemble white noise. It is not the end of the world although it might speak to the quality of the models in terms of predicting ATM1

cash_ts_fit |> dplyr::select(snaive) |> filter(ATM=='ATM2') |> gg_tsresiduals()

The distribution of the residuals appears to satisfy a normal bell like shape although there are many large outliers that seem to exist. The autocorrelation among lagged values appears to indicate that this model’s residuals resemble white noise.

augment(cash_ts_fit |> dplyr::select(snaive) |> filter(ATM=='ATM2')) |>
  features(.innov, ljung_box, lag = 7)

## # A tibble: 1 × 4
##   ATM   .model lb_stat lb_pvalue
##   <chr> <chr>    <dbl>     <dbl>
## 1 ATM2  snaive    80.6  1.04e-14

The p-value is very small which is rather surprising which means that the residuals are not resembling white noise and that this model does not fully account for the variance within the data for this ATM.

Does the second best ARIMA model have more acceptable residuals:

cash_ts_fit |> dplyr::select(arima000110) |> filter(ATM=='ATM2') |> gg_tsresiduals()

The residuals are centered around zero with a fairly normal distribution although the spikes in the time plot show more spikes in the positive direction which is somewhat evident in the histogram as well. The autocorrelation in the residuals is definitely closer to and would white noise although it may not pass a portmanteau test given there are two significant lagged values.

augment(cash_ts_fit |> dplyr::select(arima000110) |> filter(ATM=='ATM2')) |>
  features(.innov, ljung_box, lag = 7, dof = 2)

## # A tibble: 1 × 4
##   ATM   .model      lb_stat lb_pvalue
##   <chr> <chr>         <dbl>     <dbl>
## 1 ATM2  arima000110    14.9    0.0109

Yes, consistent with the visual observation the residuals are not representing white noise.

The residuals for ATM3 are not large enough to evaluate and will have to proceed more with intuition.

Somewhat as expected there isn’t much that can be done about the residuals for this ATM given all of the training data had zero values.

Let’s review ATM4 residuals:

cash_ts_fit_atm4 |> dplyr::select(mean) |> gg_tsresiduals()

The residuals plot indicates that the residuals are fairly evenly off in both directions for ATM4 with a mostly normal approximation The ACF plot may show potential signs that it is not equivalent to white noise with 3 lagged values being statistically significant. Let’s run Ljung Box to confirm that observation:

augment(cash_ts_fit_atm4 |> dplyr::select(mean)) |>
  features(.innov, ljung_box, lag = 7, dof = 1)

## # A tibble: 1 × 3
##   .model lb_stat lb_pvalue
##   <chr>    <dbl>     <dbl>
## 1 mean      21.7   0.00140

As expected the residuals do not pass this test that they resemble white noise. Let’s evaluate the second best model to see if it meets those criteria:

cash_ts_fit_atm4 |> dplyr::select(arima_auto) |> gg_tsresiduals()

The autocorrelation is substantially less and there isn’t a single lagged value that is significant which should mean that it would pass Ljung Box test. Outside of a large outlier the residuals are mostly homoskedastic and have a somewhat bell shape although it does not appear to be centered around zero.

augment(cash_ts_fit_atm4 |> dplyr::select(arima_auto)) |>
  features(.innov, ljung_box, lag = 7, dof = 1)

## # A tibble: 1 × 3
##   .model     lb_stat lb_pvalue
##   <chr>        <dbl>     <dbl>
## 1 arima_auto    3.50     0.744

No surprise with the results of this pormanteau test that the residuals resemble white noise.

Let’s test running cross validation for each ATM to more thoroughly evaluate the dataset and pick the best model.

ATM1:

cash_cv_fc_atm1 <- cash_ts |>
  stretch_tsibble(.init = 30) |> 
    filter(ATM=='ATM1') |>
    model(mean=MEAN(cash_mod),
          snaive=SNAIVE(cash_mod ~ lag("week")),
          arima000110 = ARIMA(cash_mod ~ 0 + pdq(0,0,0) + PDQ(1,1,0)),
          arima001012 = ARIMA(cash_mod ~ 0 + pdq(0,0,0) + PDQ(0,1,2))) |>
    forecast(h=30) |>
    group_by(.id) |>
  mutate(h = row_number()) |>
  ungroup() |>
  as_fable(response = "cash_mod", distribution = cash_mod)

cash_cv_fc_atm1 |>
  accuracy(cash_ts, by = c("h", ".model")) |>
  ggplot(aes(x = h, y = RMSE)) +
  geom_point(aes(colour=.model))

Although the mean method technically has the lowest average RMSE across the specific time horizons in cross validation, it is also less robust against some of the complexities that exist for the withdrawal data. In combination with the initial test validation data and the performance shown the model that will be selected for forecast is the ARIMA001012.

atm1_final_fc <- cash_ts_fit |> filter(ATM=='ATM1') |> dplyr::select(arima_auto) |> forecast(h=92) |> hilo() |> mutate(lower_95=`95%`$lower,upper_95=`95%`$upper) |> dplyr::select(dte,cash_mod,.mean,lower_95,upper_95)
atm1_final <- as_tibble(tail(atm1_final_fc,31)) |> mutate(ATM='ATM1',.model='ARIMA001012')

ATM2:

cash_cv_fc_atm2 <- cash_ts |>
  stretch_tsibble(.init = 30) |> 
    filter(ATM=='ATM2') |>
    model(snaive=SNAIVE(cash_mod ~ lag("week")),
          arima000310 = ARIMA(cash_mod ~ 0 + pdq(0,0,0) + PDQ(3,1,0)),
          arima000110 = ARIMA(cash_mod ~ 0 + pdq(0,0,0) + PDQ(1,1,0))) |>
    forecast(h=30) |>
    group_by(.id) |>
  mutate(h = row_number()) |>
  ungroup() |>
  as_fable(response = "cash_mod", distribution = cash_mod)

cash_cv_fc_atm2 |>
  accuracy(cash_ts, by = c("h", ".model")) |>
  ggplot(aes(x = h, y = RMSE)) +
  geom_point(aes(colour=.model))

After reevaluating the ATM2 data running more tests with cross validation the seasonal naive model does perform really well, but there is fairly comparable results in the ARIMA(0,0,0)(3,1,0) as it has about the same RMSE across multiple forecasted horizons. The ARIMA model will be chosen for the forecast given it can better handle some of the complex patterns that are present from cash withdrawals at this ATM.

atm2_final_fc <- cash_ts_fit |> filter(ATM=='ATM2') |> dplyr::select(arima000310) |> forecast(h=92) |> hilo() |> mutate(lower_95=`95%`$lower,upper_95=`95%`$upper) |> dplyr::select(dte,cash_mod,.mean,lower_95,upper_95)
atm2_final <- as_tibble(tail(atm2_final_fc,31)) |> mutate(ATM='ATM2',.model='ARIMA000310')

ATM3 due to the the limited values will not benefit from running cross validation and the submission will include the damped trend data with exponential smoothing model (AADN) given that the withdrawals are likely to differ on multiple days.

atm3_final_fc <- atm3_fit |> dplyr::select(aadn) |> forecast(h=31) |> hilo() |>
    mutate(lower_95=`95%`$lower,upper_95=`95%`$upper) |> 
    dplyr::select(dte,cash_mod,.model,.mean,lower_95,upper_95) |> mutate(ATM='ATM3')

ATM4:

cash_cv_fc_atm4 <-  cash_ts |>
  stretch_tsibble(.init = 30) |> 
    filter(ATM=='ATM4') |>
    model(mean=MEAN(cash_mod^cash_tf4),
          arima000100= ARIMA(cash_mod^cash_tf4 ~ 0 + pdq(0,0,0) + PDQ(0,1,2)),
          aaa = ETS(cash_mod^cash_tf4 ~ error('A')+trend('A') + season('A')),
          aada = ETS(cash_mod^cash_tf4 ~ error('A')+trend('Ad') + season('A'))) |>
    forecast(h=31) |>
    group_by(.id) |>
  mutate(h = row_number()) |>
  ungroup() |>
  as_fable(response = "cash_mod", distribution = cash_mod)

cash_cv_fc_atm4 |>
  accuracy(cash_ts, by = c("h", ".model")) |>
  ggplot(aes(x = h, y = RMSE)) +
  geom_point(aes(colour=.model))

The cross validation testing procedure indicates that the mean method performs the best and is quite separated from the next best model in ARIMA(0,0,1)(0,0,0). Although it seems a bit surprising that the data would be best approximated by a baseline method it likely speaks to the limitations/residuals from the other models even though the residuals are still correlated in the mean method.

atm4_final_fc <- cash_ts_fit_atm4 |> 
    dplyr::select(mean) |> forecast(h=92) |> filter(dte>'2010-04-30')|> hilo() |>
    mutate(lower_95=`95%`$lower,upper_95=`95%`$upper) |> 
    dplyr::select(dte,cash_mod,.mean,lower_95,upper_95)

atm4_final <- as_tibble(tail(atm4_final_fc,31)) |> mutate(ATM='ATM4',.model='mean',lower_95)

atm_full_final <- dplyr::bind_rows(atm1_final,atm2_final,as_tibble(atm3_final_fc),atm4_final) |> relocate(ATM)

write.csv(atm_full_final,paste0(getwd(),'/','parta_atm_forecasts.csv'))

Part B – Forecasting Power, ResidentialCustomerForecastLoad-624.xlsx

Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.

Load the data

input <- getwd()
if(grepl( 'Data624', input, fixed = TRUE)){
    input_b_fp <- input
} else{
    input_b_fp <- paste0(getwd(),'/Desktop/Masters_Data_Sci/Data624/')
}

part_b <- 'ResidentialCustomerForecastLoad-624.xlsx'
partb_i_df <- readxl::read_excel(paste0(input_b_fp,'/',part_b))
head(partb_i_df)

## # A tibble: 6 × 3
##   CaseSequence `YYYY-MMM`     KWH
##          <dbl> <chr>        <dbl>
## 1          733 1998-Jan   6862583
## 2          734 1998-Feb   5838198
## 3          735 1998-Mar   5420658
## 4          736 1998-Apr   5010364
## 5          737 1998-May   4665377
## 6          738 1998-Jun   6467147

summary(partb_i_df)

##   CaseSequence     YYYY-MMM              KWH          
##  Min.   :733.0   Length:192         Min.   :  770523  
##  1st Qu.:780.8   Class :character   1st Qu.: 5429912  
##  Median :828.5   Mode  :character   Median : 6283324  
##  Mean   :828.5                      Mean   : 6502475  
##  3rd Qu.:876.2                      3rd Qu.: 7620524  
##  Max.   :924.0                      Max.   :10655730  
##                                     NA's   :1

partb_i_df |> dplyr::filter(is.na(KWH) | is.na(`YYYY-MMM`))

## # A tibble: 1 × 3
##   CaseSequence `YYYY-MMM`   KWH
##          <dbl> <chr>      <dbl>
## 1          861 2008-Sep      NA

colnames(partb_i_df) <- c('casesequence','month','KWH')
partb_i_df |> filter(casesequence>=860 & casesequence<=865)

## # A tibble: 6 × 3
##   casesequence month        KWH
##          <dbl> <chr>      <dbl>
## 1          860 2008-Aug 8037137
## 2          861 2008-Sep      NA
## 3          862 2008-Oct 5101803
## 4          863 2008-Nov 4555602
## 5          864 2008-Dec 6442746
## 6          865 2009-Jan 8072330

The NA value is only occurring in one month in question. In order to properly impute this monthly data let’s review the relationship with lagged values to determine if a seasonal naive or other lagged interval more closely correlate with the available data.

partb <- partb_i_df |> mutate(YearMonth = tsibble::yearmonth(as.character(month))) 

partb_na <- partb |> as_tsibble(index=YearMonth)

partb_na |> gg_lag(KWH,geom='point',lags=1:24)

Tracking the shape of the KWH lagged values it appears there are much stronger relationships between half year and full year comparisons then other intermonthly comparisons.

Let’s review an ACF plot to confirm this further, particularly with farther out lagged periods for considering moving average periods.

partb_na |> ACF(KWH,lag_max=60) |> autoplot()

Similar to the lagged plots the autocorrelation is linearly related across many values, but the standardized pattern for a year’s lag is by far the strong relationship for each monthly energy consumption. Based on reviewing the ACF, the one NA value will be imputed using a variant of a moving average of similar year over year data.

Lets impute using a 3 year time frame using two years of preceding data and one following. The reason for taking this time period is that the missing value should likely be most similar to the same month’s consumption a year before and after it and it will be treated like it was the moving average of the preceding September.

kwh <- partb_na |> mutate(kwh_imp=ifelse(is.na(KWH),mean(pull(partb_na |> filter(month(YearMonth,abbr=FALSE)==9 & casesequence>=837 & casesequence<=873),KWH),na.rm=TRUE),KWH))

Let’s confirm if there are any null values left:

kwh |> filter(is.na(kwh_imp))

## # A tsibble: 0 x 5 [?]
## # ℹ 5 variables: casesequence <dbl>, month <chr>, KWH <dbl>, YearMonth <mth>,
## #   kwh_imp <dbl>

Transformations

Now that every value is populated it’s time to evaluate transformations. The unadjusted values are quite high so at a minimum dividing by 10 to some power would probably be appropriate at the very least.

Let’s evaluate if an alternative transformation would be appropriate using BoxCox.

kwh_bc_mod <- preProcess(as.data.frame(kwh),method='BoxCox')
kwh_bc <- kwh_bc_mod$kwh_imp$lambda

The transformation suggested when maximizing the log likelihood returned 1 for the kwh_imp series

kwh |> autoplot(kwh_imp)

There is clearly seasonality that is to be expected with energy demand that is highly correlated with temperature extremes for a monthly basis. Otherwise, the variance does not appear to be changing that dramatically across the full time series range although there is a large outlier that appears to exist in early 2010 that is far below the rest of the data points.

Modeling

Similar to part A, the approach that will be taken is to use two different sampling strategies to evaluate the models.

Use 80% of the time series to train the model and create a forecast using the final 20%
Use cross validation to take more cuts of the data to try to train the model on larger time frames for step forecasts

Arima Preparation

The ARIMA modeling requires some initial preparation to confirm that the data inputs meet critical assumptions to reliably model the time series. The first attempt at stationarity will be evaluated at the first order of differeencing due to the clear seasonality that exists with this energy data.

kwh |> gg_tsdisplay(difference(kwh_imp,lag=12), plot_type='partial')

Aside from two larger outliers the remainder of the points looks to be quite homoskedastic after accounting for the seasonality. The ACF and PACF have very similar looking statistically significant lags at 1 and 12 respectively which means we could propose a fairly simple model with an AR and MA model of 1 for the seasonal term P or Q.This estimated model will be compared against the automated ARIMA model to evaluate what terms it selects.

Let’s confirm with a unit root test:

kwh |>
features(difference(kwh_imp,12), unitroot_kpss)

## # A tibble: 1 × 2
##   kpss_stat kpss_pvalue
##       <dbl>       <dbl>
## 1     0.108         0.1

This statistical test also confirms what was observed in the prior plots that this data can be considered stationary

With the time frame being months for this time series and a much longer time horizon there are some additional model types that are available for this data that were not usable for the daily ATM withdrawals from Part A. Some additional modeling will include STL and X11 to evaluate as many possible models are available.

Let’s quickly run STL decomposition

cmp_kwh<- kwh |> filter(year(YearMonth)<2011) |>
    model(stl=STL(kwh_imp)) |> 
    components()

autoplot(cmp_kwh)

As is expected there isn’t a strong trend/cycle that exists although there was a slight increase towards the end of the time range and ultimately a large plunge in consumption as well. The seasonality looks to be closely tied to seasons with spikes at winter and summer and subsequent troughs in fall and spring.

Let’s train several models to determine what forecasts the residential power usage:

kwh_train_fit <- kwh |> filter(year(YearMonth)<2011) |>
    model(mean=MEAN(kwh_imp),
          naive=NAIVE(kwh_imp),
          snaive=SNAIVE(kwh_imp),
          drift=RW(kwh_imp~ drift()),
          arima000110=ARIMA(kwh_imp ~ pdq(0,0,0)+PDQ(1,1,0)),
          arima000011=ARIMA(kwh_imp ~ pdq(0,0,0)+PDQ(0,1,1)),
          arima_auto=ARIMA(kwh_imp),
          ana =ETS(kwh_imp ~ error('A')+trend('N')+season('A')),
          aada=ETS(kwh_imp ~ error('A')+trend('Ad')+season('A')),
          aaa=ETS(kwh_imp ~ error('A')+trend('A')+season('A')))

The Hyndman textbook recommends using the corrected AICc metric as a means of evaluating the models.

Let’s see what the automated ARIMA algorithm determined was the best combination of pdq PDQ parameters.

report(kwh_train_fit |> dplyr::select(arima_auto))

## Series: kwh_imp 
## Model: ARIMA(0,0,0)(0,1,2)[12] 
## 
## Coefficients:
##          sma1    sma2
##       -0.7843  0.2927
## s.e.   0.1081  0.1445
## 
## sigma^2 estimated as 6.413e+11:  log likelihood=-2164.59
## AIC=4335.18   AICc=4335.35   BIC=4344.09

The optimization process didn’t result in a substantially different model from the visual examination of the ACF & PACF. There is one seasonal differencing term and instead of one Q there are 2 lagged seasonal error terms. Given the strong lagged autocorrelation that was visible in the initial lagged plot and ACF this kind of makes sense.

Let’s review the AICc for in sample evaluation of the fit.

glance(kwh_train_fit) |> arrange(AICc) |> dplyr::select(.model:BIC)

## # A tibble: 10 × 6
##    .model       sigma2 log_lik   AIC  AICc   BIC
##    <chr>         <dbl>   <dbl> <dbl> <dbl> <dbl>
##  1 arima_auto  6.41e11  -2165. 4335. 4335. 4344.
##  2 arima000011 6.64e11  -2167. 4337. 4337. 4343.
##  3 arima000110 7.22e11  -2171. 4347. 4347. 4353.
##  4 ana         6.63e11  -2510. 5049. 5053. 5095.
##  5 aaa         6.65e11  -2509. 5052. 5056. 5103.
##  6 aada        6.67e11  -2509. 5053. 5058. 5108.
##  7 mean        1.72e12     NA    NA    NA    NA 
##  8 naive       2.22e12     NA    NA    NA    NA 
##  9 snaive      8.53e11     NA    NA    NA    NA 
## 10 drift       2.22e12     NA    NA    NA    NA

kwh_train_fit |> forecast(h='3 years') |> autoplot(kwh |> filter(year(YearMonth)>2008),level=NULL) +labs(title='KWH Power Usage Forecast')

With many of the forecasts plotted over the top of one another it is not the easiest to differentiate aside from those large divergences that exist. The basic modeling methods appear to diverge rather substantially from the data set. It appears that the one outlier term also heavily influenced the seasonal naive troughs that were forecasted.

kwh_train_fc <- kwh_train_fit |> forecast(h='3 years')

knitr::kable(kwh_train_fc |> accuracy(kwh) |>arrange(RMSE))

.model	.type	ME	RMSE	MAE	MPE	MAPE	MASE	RMSSE	ACF1
aaa	Test	1095281	1404861	1122418	13.47739	13.93459	1.846162	1.526712	0.1472531
ana	Test	1179309	1466785	1185844	14.66555	14.78124	1.950485	1.594008	0.1409454
aada	Test	1188842	1474169	1194171	14.80145	14.89686	1.964182	1.602032	0.1414678
arima_auto	Test	1048891	1646442	1093123	12.51883	13.17523	1.797977	1.789247	0.2915537
arima000011	Test	1111303	1663266	1139781	13.31875	13.74764	1.874720	1.807530	0.2884737
arima000110	Test	1089373	1987124	1223648	12.65490	14.65295	2.012665	2.159478	0.2477648
mean	Test	1229194	1991876	1570376	12.66579	18.66749	2.582965	2.164643	0.3618090
naive	Test	1353888	2071151	1612027	14.40062	18.96022	2.651473	2.250793	0.3618090
drift	Test	1438630	2126170	1644521	15.58526	19.23620	2.704920	2.310584	0.3590813
snaive	Test	1184005	2668671	1551425	13.42303	18.76234	2.551795	2.900140	0.1844302

The results of the test validation are rather interesting as some of the ARIMA models may have slightly overfit to the training data and performed worse on the test set. Evaluating the data using RMSE shows that the Holt’s Winters additive method seems to do the best. The automated ARIMA does manage to minimize the MAE compared to the rest of the models although the exponential smoothing model is second in that respect.

Let’s review the residuals of the ETS(AAA) model

kwh_train_fit |> 
dplyr::select(aaa) |>
  gg_tsresiduals()

The innovation residuals seem to be fairly homoskedastic aside from the large outlier the remaining points appear to reasonably split above and below zero. The distribution of the residuals are distributed around zero and appear to be approximately normal. There are a few autocorrelated residuals that exist but it doesn’t appear to be overly correlated among lagged values.

Let’s run a Ljung box test to confirm that’s the case:

augment(kwh_train_fit |> dplyr::select(aaa,ana,aada,arima_auto)) |>
  features(.innov, ljung_box, lag = 12, dof = 3)

## # A tibble: 4 × 3
##   .model     lb_stat lb_pvalue
##   <chr>        <dbl>     <dbl>
## 1 aaa          23.0    0.00615
## 2 aada         22.4    0.00775
## 3 ana          23.5    0.00524
## 4 arima_auto    8.51   0.483

The statistical test for the Ljung box indicates that the residuals for all of the exponential models do not resemble white noise given the very low p-values obtained. This doesn’t necessarily disqualify these models, but could imply there are some additional components needed in the model. The best ARIMA model despite having a substantially higher RMSE resembles white noise for the residuals. Given the high levels of autocorrelation in the initial dataset that might cause some of the correlations that remain in the exponential models selected

Let’s run cross validation as the training method to see how that impacts the results for selecting the final model to run future forecasting. For optimization purposes the basic models will be excluded given how poorly they performed on training and test data in the previous section.

kwh_cv_fit <- kwh |>
  stretch_tsibble(.init = 36,.step=3) |>
  model(arima000110=ARIMA(kwh_imp ~ 0 + pdq(0,0,0)+PDQ(1,1,0)),
          arima000011=ARIMA(kwh_imp ~ 0+ pdq(0,0,0)+PDQ(0,1,1)),
          arima_auto=ARIMA(kwh_imp ~ 0+ pdq(0,0,0)+PDQ(0,1,2)),
          ana =ETS(kwh_imp ~ error('A')+trend('N')+season('A')),
          aada=ETS(kwh_imp ~ error('A')+trend('Ad')+season('A')),
          aaa=ETS(kwh_imp ~ error('A')+trend('A')+season('A')))

kwh_cv_fc <- kwh_cv_fit |> forecast(h = 12) |> 
  group_by(.id) |>
  mutate(h = row_number()) |>
  ungroup() |>
  as_fable(response = "kwh_imp", distribution = kwh_imp)

kwh_cv_fc |>
    accuracy(kwh, by = c("h", ".model")) |> arrange(RMSE) |>
    ggplot(aes(x = h, y = RMSE)) +
    geom_point(aes(colour=.model))

It is a somewhat mixed bag in terms of the results from the cross validation as many of the models appear to comparably perform across multiple forecast horizons. If you are taking into account all of the time horitzons it appears that the exponential model without a trend factor and additive error/seasonal model (ANA) is performing the best for predicting energy usage. It was also the best performing exponential model in the training in sample data although worse than the ARIMA options; however, the ARIMA models reviewed all have larger errors at later time horizons which makes the ANA exponential seem preferable.

kwh_final_fc <- kwh_train_fit |> dplyr::select(ana) |> forecast(h=48) |> hilo() |> mutate(lower_95=`95%`$lower,upper_95=`95%`$upper) |> dplyr::select(.model,YearMonth,kwh_imp,.mean,lower_95,upper_95)

write.csv(as_tibble(tail(kwh_final_fc,12)),paste0(getwd(),'/','partb_energy_usage.csv'))

Data624: Project 1

jfor: Josh Forster

2024-03-10

Load Part A Input Files:

Review Data & Data Preparation

Transformations

Modeling

Load the data

Transformations

Modeling

Arima Preparation