## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## ── Attaching packages ────────────────────────────────────────────── fpp3 0.5 ──
##
## ✔ tsibble 1.1.4 ✔ fable 0.3.3
## ✔ tsibbledata 0.4.1 ✔ fabletools 0.3.4
## ✔ feasts 0.3.1
##
## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## ✖ lubridate::date() masks base::date()
## ✖ dplyr::filter() masks stats::filter()
## ✖ tsibble::intersect() masks base::intersect()
## ✖ tsibble::interval() masks lubridate::interval()
## ✖ dplyr::lag() masks stats::lag()
## ✖ tsibble::setdiff() masks base::setdiff()
## ✖ tsibble::union() masks base::union()
##
##
## Attaching package: 'naniar'
##
##
## The following object is masked from 'package:tsibble':
##
## pedestrian
##
##
##
## Attaching package: 'zoo'
##
##
## The following object is masked from 'package:tsibble':
##
## index
##
##
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
##
##
## Loading required package: lattice
##
##
## Attaching package: 'caret'
##
##
## The following objects are masked from 'package:fabletools':
##
## MAE, RMSE
##
##
## The following object is masked from 'package:purrr':
##
## lift
##
##
##
## Attaching package: 'seasonal'
##
##
## The following object is masked from 'package:tibble':
##
## view
##
##
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
This project consists of 3 parts - two required and one bonus and is worth 15% of your grade.
Part A – ATM Forecast, ATM624Data.xlsx
In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.
input <- getwd()
if(grepl( 'Data624', input, fixed = TRUE)){
input_fp <- input
} else{
input_fp <- paste0(getwd(),'/Desktop/Masters_Data_Sci/Data624/')
}
part_a <- 'ATM624Data.xlsx'
parta_i_df <- readxl::read_excel(paste0(input_fp,'/',part_a))
head(parta_i_df)
## # A tibble: 6 × 3
## DATE ATM Cash
## <dbl> <chr> <dbl>
## 1 39934 ATM1 96
## 2 39934 ATM2 107
## 3 39935 ATM1 82
## 4 39935 ATM2 89
## 5 39936 ATM1 85
## 6 39936 ATM2 90
summary(parta_i_df)
## DATE ATM Cash
## Min. :39934 Length:1474 Min. : 0.0
## 1st Qu.:40026 Class :character 1st Qu.: 0.5
## Median :40118 Mode :character Median : 73.0
## Mean :40118 Mean : 155.6
## 3rd Qu.:40210 3rd Qu.: 114.0
## Max. :40312 Max. :10919.8
## NA's :19
From a first pass we see that there are 19 null values for cash and
it’s not clear if there are any additional missing records in the other
two columns. The DATE column is also reading in the data
using Excel’s date format which will need to be fixed to create a proper
Tsibble.
Let’s graph the withdrawal amounts across each ATM:
parta <- parta_i_df |> mutate(dte=as.Date(DATE, origin = "1899-12-30")) |>
arrange(dte,ATM,Cash) |>
dplyr::select(dte,ATM,Cash)
parta_prep <- parta |>
filter(!is.na(ATM)) |>
mutate(day_of_week=wday(dte,label=TRUE)) |>
group_by(ATM,day_of_week) |>
mutate(day_num = row_number()) |> ungroup()
ggplot(parta,aes(x=dte,y=Cash,colour=ATM)) +
geom_line()
Not much is evident from this plot of the data given there is one large outlier value in February or March 2010 that far exceeds all the remaining data points. Perhaps a transformation can account for this record, but it will likely be challenging for any model to appropriately account for this data point. It seems unlikely that there would be evidence to indicate that an error occured in tracking this specific date for ATM4. (i.e. someone incorrectly added a zero on one of the values)
What is the distribution of all of the withdrawals?
ggplot(parta,aes(x=Cash)) +
geom_histogram() + labs(title='All Cash Withdrawal Distribution',x='Number of Days')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
There is very strong decay across the dataset and the vast majority of days appear to be near zero. The shape perhaps is similar to a negative binomial curve.
What are the distributions after separating each ATM?
atm1_dist <- ggplot(parta |> filter(ATM =='ATM1') ,aes(x=Cash)) +
geom_histogram() + labs(title='ATM1 Distribution')
atm2_dist <- ggplot(parta |> filter(ATM =='ATM2') ,aes(x=Cash)) +
geom_histogram() + labs(title='ATM2 Distribution')
atm3_dist <- ggplot(parta |> filter(ATM =='ATM3') ,aes(x=Cash)) +
geom_histogram() + labs(title='ATM3 Distribution')
atm4_dist <- ggplot(parta |> filter(ATM =='ATM4') ,aes(x=Cash)) +
geom_histogram() + labs(title='ATM4 Distribution')
cowplot::plot_grid(atm1_dist,atm2_dist,atm3_dist,atm4_dist,ncol=2,nrow=2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The shapes of the distributions are quite different across the ATMs which certainly indicates that the demand in each location is quite different despite the fact that the information in the input dataset is tracked over the exact same time frame. This might warrant testing forecasts for each ATM rather than trying to accommodate one model for all. It is a little difficult to segment these ATMs without additional information, but purely based on their usage it seems like very few models will be able to account for ATM1 and ATM2 while also reliably estimating ATM3 and ATM4. ATM1 has a clear bimodal shape with most of the concentration in a bell pattern around 100 although there are plenty of days with limited cash withdrawn. ATM2 also appears bimodal although there is concentration across many values between 0 and 100. ATM4 is primarily concentrated in low denominations and the shape is decaying at a very fast rate although there is one large outlier that was previously called out. The scale is far different from the rest of the ATMs and despite that one outlier there appear to be more days with larger amounts of withdrawals compared to the other locations.
Let’s review ATM3 further as it seems as though there aren’t many days with withdrawals and a model that always guesses zero may be very hard to beat.
summary(parta |> filter(ATM=='ATM3'))
## dte ATM Cash
## Min. :2009-05-01 Length:365 Min. : 0.0000
## 1st Qu.:2009-07-31 Class :character 1st Qu.: 0.0000
## Median :2009-10-30 Mode :character Median : 0.0000
## Mean :2009-10-30 Mean : 0.7206
## 3rd Qu.:2010-01-29 3rd Qu.: 0.0000
## Max. :2010-04-30 Max. :96.0000
The 3rd quartile is also zero which would indicate that at least 75% of the data is blank for this ATM.
How many days have withdrawals:
parta |> filter(ATM=='ATM3') |> mutate(is_zero=ifelse(Cash==0,'Y','N')) |> group_by(is_zero) |> summarise(days=n())
## # A tibble: 2 × 2
## is_zero days
## <chr> <int>
## 1 N 3
## 2 Y 362
## Only 3 days have withdrawals for this ATM which is only 0.008%! Perhaps we need an anomaly detection algorithm to accurately guess the days where someone goes to pull out cash from this location.
Are the days of withdrawal significant in any way? (i.e. holidays)
parta |> filter(ATM=='ATM3' & Cash!=0)
## # A tibble: 3 × 3
## dte ATM Cash
## <date> <chr> <dbl>
## 1 2010-04-28 ATM3 96
## 2 2010-04-29 ATM3 82
## 3 2010-04-30 ATM3 85
Interestingly enough the only values for ATM3 are 3 consecutive days at the end of the time series range provided.
Let’s see what the rates of missingness are across all columns to determine how to treat the different null records that exist.
plot_missing(parta, missing_only = TRUE,
ggtheme = theme_classic())
There are a few rows without an ATM value which will need to be reviewed further to determine what method exists to properly impute these records or if they should be excluded from the models.
Let’s see what values exist for those days that are all blank and if they are incorrectly included in the file. Perhaps someone may have added rows for dates using a formula in Excel with the expectation to input them into their file at a later point as there are about two additional weeks of dates.
parta |> filter(dte>='2010-05-01' & dte<='2010-05-15')
## # A tibble: 14 × 3
## dte ATM Cash
## <date> <chr> <dbl>
## 1 2010-05-01 <NA> NA
## 2 2010-05-02 <NA> NA
## 3 2010-05-03 <NA> NA
## 4 2010-05-04 <NA> NA
## 5 2010-05-05 <NA> NA
## 6 2010-05-06 <NA> NA
## 7 2010-05-07 <NA> NA
## 8 2010-05-08 <NA> NA
## 9 2010-05-09 <NA> NA
## 10 2010-05-10 <NA> NA
## 11 2010-05-11 <NA> NA
## 12 2010-05-12 <NA> NA
## 13 2010-05-13 <NA> NA
## 14 2010-05-14 <NA> NA
It appears that these blank records may actually represent dates that need to be predicted in the forecast.
What is the latest date available for the remaining ATMs?
parta |> group_by(ATM) |> summarise(last_date=max(dte))
## # A tibble: 5 × 2
## ATM last_date
## <chr> <date>
## 1 ATM1 2010-04-30
## 2 ATM2 2010-04-30
## 3 ATM3 2010-04-30
## 4 ATM4 2010-04-30
## 5 <NA> 2010-05-14
The remaining ATMs are all cutoff at the end of April 2010 and based on the initial prompt requesting forecasts for May 2010, the decision to exclude these rows from May 2010 makes sense as they will be populated with estimates in the future, but are not going to be used in the model.
The remaining null values more likely need to be imputed to create a valid time series. There are a number of reasonable approaches that can be applied that would fill the remaining 4 missing records. Some of the more advanced methods, such as MICE and KNN, do a better job of capturing the interrelationships that exist between different predictors. In this case the number of data points is more limited and a more simplistic approach might be sufficient to impute reasonable values.
Let’s look to see the values around the time of the null records:
parta |> filter(ATM %in% c('ATM1','ATM2') & dte>='2009-06-01' & dte<='2009-06-30') |> arrange(ATM,dte) |>
ggplot(aes(x=dte,y=Cash)) +
geom_line(aes(colour=ATM))
The following includes a couple of potential imputation alternatives that will be considered:
How many other days do ATMs have zero values:
parta |> filter(Cash==0) |> group_by(ATM) |> summarise(num_zero=n())
## # A tibble: 2 × 2
## ATM num_zero
## <chr> <int>
## 1 ATM2 2
## 2 ATM3 362
ATM2 has other examples of zero, while ATM1 does not have any dates with missing withdrawals which would indicate this approach for treating missingness is less likely to be an accurate method to account for this data.
This approach would attempt to estimate the withdrawal amount based on recent leading and lagging dates that are available. Given that there are 2 missing values for each ATM that are both close to one another (within two calendar days) the window in time would likely exclude the other NAs while obtaining the imputed value. What is the most appropriate time frame to calculate the window in time? This is something that will be explored further in a subsequent section.
One approach would be to impute the value based on the most similar days that exist in the remaining time series given that there likely are seasonal factors that explain some of the activity that occurs at a given ATM, as well as holidays that impact withdrawal patterns. In this way the other NA values would not impact the calculation of each imputation given the proximity to one another (i.e. they are not going to be the same day)
When initially proposing this option I hadn’t considered that decomposition techniques would be limited by the fact that the ATMs in this series are only tracked over one calendar year. Typically, the algorithms in Fable are reliant on a minimum of two years in order to determine period effects.
Let’s review the days in question to confirm they aren’t around holiday times:
parta |> filter(is.na(Cash) & dte<'2010-05-01') |> mutate(day_of_week=wday(dte,label=TRUE)) |> arrange(ATM,dte)
## # A tibble: 5 × 4
## dte ATM Cash day_of_week
## <date> <chr> <dbl> <ord>
## 1 2009-06-13 ATM1 NA Sat
## 2 2009-06-16 ATM1 NA Tue
## 3 2009-06-22 ATM1 NA Mon
## 4 2009-06-18 ATM2 NA Thu
## 5 2009-06-24 ATM2 NA Wed
The only potential holiday that occurred at the time (Juneteenth wasn’t a federal holiday in 2009/2010) is Father’s day and that might prompt some people to withdraw cash to buy gifts. Three out of the Five days are after that time so they likely wouldn’t be impacted, but it is possible that Father’s day might have some effect on ATM1’s first two NA values. It is difficult to speculate if this was statistically impactful, but let’s review other top days of withdrawals for ATM1 to see if holidays or the days preceding them drove the most traffic to this ATM.
parta |> filter(ATM=='ATM1') |> mutate(day_of_week=wday(dte,label=TRUE)) |> arrange(-Cash) |> slice(1:15)
## # A tibble: 15 × 4
## dte ATM Cash day_of_week
## <date> <chr> <dbl> <ord>
## 1 2009-09-22 ATM1 180 Tue
## 2 2010-02-17 ATM1 179 Wed
## 3 2009-08-25 ATM1 174 Tue
## 4 2010-01-29 ATM1 152 Fri
## 5 2010-02-28 ATM1 152 Sun
## 6 2009-08-14 ATM1 151 Fri
## 7 2009-09-04 ATM1 150 Fri
## 8 2009-06-09 ATM1 145 Tue
## 9 2009-09-05 ATM1 144 Sat
## 10 2009-06-12 ATM1 142 Fri
## 11 2009-06-19 ATM1 140 Fri
## 12 2009-09-20 ATM1 140 Sun
## 13 2009-12-01 ATM1 140 Tue
## 14 2009-12-04 ATM1 140 Fri
## 15 2010-02-05 ATM1 138 Fri
The expectation would have been that gifts needed around December would potentially be the highest days, but early December has the 13th and 14th highest days for withdrawals. The top 5 days for this ATM don’t appear to correlate with a holiday although the June time frame around where there are missing values has 3 of the highest values in the top 11. The most common day of the week with withdrawals is Friday which the assumption would that is when individuals are paid their wages. What is prompting the top 3 days to be a week day would probably require additional context to explain those values.
Let’s see the same cut of data for ATM2:
parta |> filter(ATM=='ATM2') |> mutate(day_of_week=wday(dte,label=TRUE)) |> arrange(-Cash) |> slice(1:15)
## # A tibble: 15 × 4
## dte ATM Cash day_of_week
## <date> <chr> <dbl> <ord>
## 1 2009-06-06 ATM2 147 Sat
## 2 2009-07-17 ATM2 147 Fri
## 3 2009-08-04 ATM2 146 Tue
## 4 2009-05-15 ATM2 137 Fri
## 5 2009-05-29 ATM2 136 Fri
## 6 2009-08-07 ATM2 136 Fri
## 7 2009-07-11 ATM2 135 Sat
## 8 2009-06-19 ATM2 134 Fri
## 9 2009-06-02 ATM2 133 Tue
## 10 2009-06-08 ATM2 132 Mon
## 11 2009-11-20 ATM2 132 Fri
## 12 2009-07-03 ATM2 128 Fri
## 13 2009-10-23 ATM2 127 Fri
## 14 2009-07-28 ATM2 126 Tue
## 15 2009-07-31 ATM2 126 Fri
The top days for this ATM also appear to be mostly Fridays with a few Saturdays and other weekdays sprinkled into the most frequent days. The rationale is the same as ATM1 that paydays would correspond with increased withdrawals.
parta |> mutate(day_of_week=wday(dte,label=TRUE)) |>
filter(!is.na(Cash) & Cash !=0 & Cash<9000) |>
ggplot(aes(x=Cash)) +
geom_histogram() +
facet_wrap(vars(day_of_week))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The distributions across all of the ATMs appear to be right skewed after excluding zeros and the one massive outlier with Fridays have the highest values of skew across all days. Tuesdays had more skew than expected.
Let’s impute a value for the NAs based on a rolling mean and calculating a the 7 week day moving average including the 4 preceding weeks’ withdrawals and the 3 succeeding weeks withdrawals.
parta_na_fill <- parta |> filter(!is.na(ATM)) |>
mutate(day_of_week=wday(dte,label=TRUE)) |>
group_by(ATM,day_of_week) |>
mutate(day_num = row_number()) |>
arrange(ATM,day_of_week) |>
filter(!is.na(Cash)) |>
mutate(roll_mean=rollmean(Cash,k=7,fill=NA)) |>
filter(lead(day_num)-day_num!=1) |> mutate(new_day=day_num+1)|> dplyr::select(c(dte,ATM,new_day,roll_mean)) |> ungroup()
## Adding missing grouping variables: `day_of_week`
cash <- parta_prep |> left_join(parta_na_fill,join_by(ATM==ATM,day_num==new_day,day_of_week==day_of_week),suffix = c("", ".y")) |> mutate(cash_mod=ifelse(is.na(Cash),roll_mean,Cash)) |> dplyr::select(c(dte,ATM,Cash,day_of_week,day_num,cash_mod)) |> filter(!is.na(ATM))
summary(cash)
## dte ATM Cash day_of_week
## Min. :2009-05-01 Length:1460 Min. : 0.0 Sun:208
## 1st Qu.:2009-07-31 Class :character 1st Qu.: 0.5 Mon:208
## Median :2009-10-30 Mode :character Median : 73.0 Tue:208
## Mean :2009-10-30 Mean : 155.6 Wed:208
## 3rd Qu.:2010-01-29 3rd Qu.: 114.0 Thu:208
## Max. :2010-04-30 Max. :10919.8 Fri:212
## NA's :5 Sat:208
## day_num cash_mod
## Min. : 1.00 Min. : 0.0
## 1st Qu.:14.00 1st Qu.: 1.0
## Median :27.00 Median : 73.0
## Mean :26.57 Mean : 155.3
## 3rd Qu.:40.00 3rd Qu.: 114.0
## Max. :53.00 Max. :10919.8
##
This new dataframe confirms that all of the na values are in the
modified cash column cash_mod and this time series can now
be turned into a tsibble and modeled using techniques covered in the
course so far.
One last step before proceeding with modeling is to try to transform the ATM shapes to try to standarize values for improved modeling.
Let’s first take a look at trying to transform the full dataset across all of the ATMs although the expectation is this will not better estimate a normal distribution
mod_cash <- preProcess(cash |> dplyr::select(cash_mod),method='BoxCox')
mod_cash$bc$cash_mod$lambda
## NULL
This in fact returns a null parameter so we will attempt to transform each ATM to better approximate a normal distribution.
Let’s see if ATM1 can be improved:
mod_cash1 <- preProcess(cash |> as.data.frame() |> filter(ATM=='ATM1') |> dplyr::select(cash_mod),method='BoxCox')
cash_tf1 <- mod_cash1$bc$cash_mod$lambda
cowplot::plot_grid(cash |> filter(ATM=='ATM1') |>
ggplot(aes(x=cash_mod)) +
geom_histogram() + labs(title='ATM1 Cash Distribution'),cash |> filter(ATM=='ATM1') |>
ggplot(aes(x=dte,y=cash_mod)) +
geom_line() + labs(title='ATM1 Cash over Time'),nrow=2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
cowplot::plot_grid(cash |> filter(ATM=='ATM1') |>
ggplot(aes(x=cash_mod^cash_tf1)) +
geom_histogram() + labs(title='ATM1 Cash^1.2 Distribution'),cash |> filter(ATM=='ATM1') |>
ggplot(aes(x=dte,y=cash_mod^cash_tf1)) +
geom_line() + labs(title='ATM1 Cash^1.2 over Time'),nrow=2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
After transforming the ATM1 withdrawals there isn’t much improvement with the shape of the data and it’s unclear if the daily seasonal patterns that exist are less variable. Based on the power transform obtained from BoxCox methods it does not appear this would be that useful particularly given decreased interpretability.
Let’s see if ATM2 can be improved after adjusting the zero values:
mod_cash2 <- preProcess(cash |> as.data.frame() |> filter(ATM=='ATM2') |> mutate(cash_mod=ifelse(cash_mod==0,0.01,cash_mod)) |> dplyr::select(cash_mod),method='BoxCox')
cash_tf2 <- mod_cash2$bc$cash_mod$lambda
cowplot::plot_grid(cash |> filter(ATM=='ATM2') |>
ggplot(aes(x=cash_mod)) +
geom_histogram() + labs(title='ATM2 Cash Distribution'),cash |> filter(ATM=='ATM2') |>
ggplot(aes(x=dte,y=cash_mod)) +
geom_line() + labs(title='ATM2 Cash over Time'),nrow=2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
cowplot::plot_grid(cash |> filter(ATM=='ATM2') |>
ggplot(aes(x=cash_mod^cash_tf2)) +
geom_histogram() + labs(title='ATM2 Cash^0.7 Distribution'),cash |> filter(ATM=='ATM2') |>
ggplot(aes(x=dte,y=cash_mod^cash_tf2)) +
geom_line() + labs(title='ATM2 Cash^0.7 over Time'),nrow=2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The shape of the distribution does not seem to be that much improved as the lower values still show up fairly heavily although the data appears more leftward skewed. The line graph doesn’t appear to show much of a difference in the patterns and seasons that exist. Therefore, it appears that no transformation will be used for this ATM as well.
ATM3:
Unfortunately, ATM3 remains a rather hard time series to predict/statistically optimize given that almost all of the values are zero. Therefore, no transformation will be attempted as it’s unlikely to improve the data given the lack of information with a boxcox when 99% of the data will be adjusted to review available transformations.
Let’s see if ATM4 can be improved:
mod_cash4 <- preProcess(cash |> as.data.frame() |> filter(ATM=='ATM4') |> mutate(cash_mod=ifelse(cash_mod==0,0.01,cash_mod)) |> dplyr::select(cash_mod),method='BoxCox')
cash_tf4 <- mod_cash4$bc$cash_mod$lambda
cowplot::plot_grid(cash |> filter(ATM=='ATM4') |>
ggplot(aes(x=cash_mod)) +
geom_histogram() + labs(title='ATM4 Cash Distribution'),cash |> filter(ATM=='ATM4') |>
ggplot(aes(x=dte,y=cash_mod)) +
geom_line() + labs(title='ATM4 Cash over Time'),nrow=2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
cowplot::plot_grid(cash |> filter(ATM=='ATM4') |>
ggplot(aes(x=cash_mod^cash_tf4)) +
geom_histogram() + labs(title='ATM2 Cash^0.3 Distribution'),cash |> filter(ATM=='ATM4') |>
ggplot(aes(x=dte,y=cash_mod^cash_tf4)) +
geom_line() + labs(title='ATM4 Cash^0.3 over Time'),nrow=2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
cash |> filter(ATM=='ATM4' & dte>='2010-02-05')
## # A tibble: 85 × 6
## dte ATM Cash day_of_week day_num cash_mod
## <date> <chr> <dbl> <ord> <dbl> <dbl>
## 1 2010-02-05 ATM4 454. Fri 41 454.
## 2 2010-02-06 ATM4 458. Sat 41 458.
## 3 2010-02-07 ATM4 112. Sun 41 112.
## 4 2010-02-08 ATM4 418. Mon 41 418.
## 5 2010-02-09 ATM4 10920. Tue 41 10920.
## 6 2010-02-10 ATM4 42.4 Wed 41 42.4
## 7 2010-02-11 ATM4 280. Thu 41 280.
## 8 2010-02-12 ATM4 412. Fri 42 412.
## 9 2010-02-13 ATM4 853. Sat 42 853.
## 10 2010-02-14 ATM4 180. Sun 42 180.
## # ℹ 75 more rows
The primary concern with ATM4 is the treatment of the outlier and after making this tranformation the goal would be to decrease the differential between that data point and the remainder of the series. The original value is more than 10x the rest of the withdrawals, which makes you wonder if someone accidentally typed a zero at the end. After the transformation, it is only 2x-3x compared to the rest of the values. While there will no doubt be a large remainder there in most model estimates there are limited means of adjusting that value. Therefore, the transformation accomplishes the task at hand with this subset of the data for ATM4. The shape of the data is a bit improved as well so hopefully that would allow for modeling techniques to better capture the other trends/seasonality that exists for this ATM.
summary(cash|>filter(ATM=='ATM4'))
## dte ATM Cash day_of_week
## Min. :2009-05-01 Length:365 Min. : 1.563 Sun:52
## 1st Qu.:2009-07-31 Class :character 1st Qu.: 124.334 Mon:52
## Median :2009-10-30 Mode :character Median : 403.839 Tue:52
## Mean :2009-10-30 Mean : 474.043 Wed:52
## 3rd Qu.:2010-01-29 3rd Qu.: 704.507 Thu:52
## Max. :2010-04-30 Max. :10919.762 Fri:53
## Sat:52
## day_num cash_mod
## Min. : 1.00 Min. : 1.563
## 1st Qu.:14.00 1st Qu.: 124.334
## Median :27.00 Median : 403.839
## Mean :26.57 Mean : 474.043
## 3rd Qu.:40.00 3rd Qu.: 704.507
## Max. :53.00 Max. :10919.762
##
cash_ts <- cash |> as_tsibble(index=dte,key=ATM)
cowplot::plot_grid(cash_ts |> filter(ATM=='ATM1') |>ACF(cash_mod) |> autoplot()+labs(title='ATM1 Autocorrelation'),cash_ts |> filter(ATM=='ATM2') |>ACF(cash_mod) |> autoplot()+labs(title='ATM2 Autocorrelation'),cash_ts |> filter(ATM=='ATM3') |>ACF(cash_mod) |> autoplot()+labs(title='ATM3 Autocorrelation'),cash_ts |> filter(ATM=='ATM4') |>ACF(cash_mod^cash_tf4) |> autoplot()+labs(title='ATM4 Autocorrelation'))
The autocorrelation as expected is strongest based on weekly time increments, but there are statistically significant lags for ATM1 and ATM2 between many day time differences. ATM3 is primarily zero so it makes sense that only minimal lagged values are correlated. ATM4 after applying the transformation identified in the last section appears to only have autocorrelation at week increments.
Let’s start working on models:
The approach will use basic forecasting strategies (mean, naive, seasonal naive, and drift) compared against more advanced modeling options (exponential smoothing & ARIMA variants).
There will also be two approaches for sampling train and test data to determine if that contributes to meaningful improvements in comparative metrics, which the thought will be to use RMSE as the common benchmark.
Fable to run on time series.Let’s review the preparation for an ARIMA model:
Let’s run a unit root test to determine if any level of differencing is needed in the dataset:
cash_ts |> features(cash_mod, unitroot_kpss)
## # A tibble: 4 × 3
## ATM kpss_stat kpss_pvalue
## <chr> <dbl> <dbl>
## 1 ATM1 0.486 0.0448
## 2 ATM2 2.03 0.01
## 3 ATM3 0.389 0.0818
## 4 ATM4 0.0797 0.1
By evaluating all the models unit root tests we can determine if differencing is needed in order to meet stationarity assumptions to run ARIMA models on the data. If we are using a standard alpha value of 5%, ATM1 and ATM2 do not meet the theshold to pass this test and will require differencing. Those results aren’t that surprising given the strong seasonal trends that appear to exist for these ATMs. It is a little surprising that ATM4 met that criteria, but we will check the results for first order differencing as well. ARIMA will not be run on ATM3 as the lack of data will make it less likely to perform well and the basic models are probably the only means of making a realistic estimate for the forecast.
cash_ts |> features(difference(cash_mod,7), unitroot_kpss)
## # A tibble: 4 × 3
## ATM kpss_stat kpss_pvalue
## <chr> <dbl> <dbl>
## 1 ATM1 0.0219 0.1
## 2 ATM2 0.0163 0.1
## 3 ATM3 0.389 0.0819
## 4 ATM4 0.0140 0.1
One order of differencing appears to get every ATM above the 5% threshold for stationarity for ARIMA models.
cash_ts |> filter(ATM=='ATM1') |> gg_tsdisplay(difference(cash_mod,7),plot_type='partial')
The ACF and PACF show some seasonal patterns that exist that will help us build some tentative models. AR(1) will be attempted as well as MA(1) and MA(3).
cash_ts |> filter(ATM=='ATM2') |> gg_tsdisplay(difference(cash_mod,7),plot_type='partial')
ATM2 appears to have some potential models based on signficant autocorrelation that exists in the ACF and PACF plots. Models AR(1) and AR(3) will be attempted along with MA(1) MA(2) and MA(3) all within the seasonal P/Q terms.
cash_ts |> filter(ATM=='ATM3') |> gg_tsdisplay(cash_mod,plot_type='partial')
Given there are only 3 data points available for this ATM it seems unlikely an ARIMA model is appropriate for forecasting this data.
cash_ts |> filter(ATM=='ATM4') |> gg_tsdisplay(difference(cash_mod^cash_tf4,7),plot_type='partial')
In ATM4 autocorrelation plot, we see the data is fairly stationarity although there are a few points that stand out from the rest, but it appears to be one above and below the line. Considering a seasonal model AR(1) might be viable along with MA(1), MA(2), and MA(3).
cash_ts |> filter(ATM=='ATM1') |> autoplot(cash_mod)+labs(title='ATM1 Time Series')
cash_ts |> filter(ATM=='ATM2') |> autoplot(cash_mod)+labs(title='ATM2 Time Series')
cash_ts |> filter(ATM=='ATM3') |> autoplot(cash_mod)+labs(title='ATM3 Time Series')
cash_ts |> filter(ATM=='ATM4') |> autoplot(cash_mod^cash_tf4)+labs(title='ATM4 Time Series')
Let’s train the model for the first 10 months of data and run forecasts for each ATM:
cash_ts_fit <- cash_ts |> dplyr::filter(!ATM %in% c('ATM3','ATM4') & dte<'2010-03-01') |>
model(mean=MEAN(cash_mod),
naive=NAIVE(cash_mod),
snaive=SNAIVE(cash_mod ~ lag("week")),
drift=RW(cash_mod ~ drift()),
aaa = ETS(cash_mod ~ error('A')+trend('A') + season('A')),
aada = ETS(cash_mod ~ error('A')+trend('Ad') + season('A')),
arima000110 = ARIMA(cash_mod ~ pdq(0,0,0) + PDQ(1,1,0)),
arima000310 = ARIMA(cash_mod ~ pdq(0,0,0) + PDQ(3,1,0)),
arima000011 = ARIMA(cash_mod ~ pdq(0,0,0) + PDQ(0,1,1)),
arima000012 = ARIMA(cash_mod ~ pdq(0,0,0) + PDQ(0,1,2)),
arima000013 = ARIMA(cash_mod ~ pdq(0,0,0) + PDQ(0,1,3)),
arima_auto = ARIMA(cash_mod,stepwise=FALSE)
)
cash_ts_fit |>
forecast(h=60) |>
autoplot(cash_ts,level=NULL)
It is a bit challenging to review the forecast simultaneously and therefore each one will be plotted separately.
ATM3 will not follow normal training procedures given that such a high amount of the data is a zero value. There aren’t many alternatives to try to estimate this specific time series
atm3_fit <- cash_ts |> filter(ATM=='ATM3') |>
model(mean=MEAN(cash_mod),
naive=NAIVE(cash_mod),
snaive=SNAIVE(cash_mod ~ lag("week")),
drift=RW(cash_mod ~ drift()),
aan = ETS(cash_mod ~ error('A')+trend('A') + season('N')),
aadn = ETS(cash_mod ~ error('A')+trend('Ad') + season('N')),
ana=ETS(cash_mod ~ error('A')+trend('N') + season('N')))
atm3_fit |>
forecast(h=30) |>
autoplot(cash_ts |> filter(ATM=='ATM3' & dte>'2010-04-01'),level=NULL)
Based on the limited available information, the ultimate selection will be an exponential smoothing model with a damped trend. It’s too early to say if there is seasonal data and it’s not as though there are any relevant weekly values to build from. Although the naive model might be more conservative it would seem more probable that there will be some variety in future withdrawal days.
ATM4 will also need to be separately fit as it is the only key that will be transformed.
cash_ts_fit_atm4 <- cash_ts |> dplyr::filter(ATM=='ATM4' & dte<'2010-03-01') |>
model(mean=MEAN(cash_mod^cash_tf4),
naive=NAIVE(cash_mod^cash_tf4),
snaive=SNAIVE(cash_mod^cash_tf4 ~ lag("week")),
drift=RW(cash_mod^cash_tf4 ~ drift()),
aaa = ETS(cash_mod^cash_tf4 ~ error('A')+trend('A') + season('A')),
aada = ETS(cash_mod^cash_tf4 ~ error('A')+trend('Ad') + season('A')),
arima000110 = ARIMA(cash_mod^cash_tf4 ~ 0 + pdq(0,0,0) + PDQ(1,1,0)),
arima000310 = ARIMA(cash_mod^cash_tf4 ~ 0 + pdq(0,0,0) + PDQ(3,1,0)),
arima000011 = ARIMA(cash_mod^cash_tf4 ~ 0 + pdq(0,0,0) + PDQ(0,1,1)),
arima000012 = ARIMA(cash_mod^cash_tf4 ~ 0 + pdq(0,0,0) + PDQ(0,1,2)),
arima000013 = ARIMA(cash_mod^cash_tf4 ~ 0 + pdq(0,0,0) + PDQ(0,1,3)),
arima_auto = ARIMA(cash_mod^cash_tf4,stepwise=FALSE)
)
Which model did the automated algorithm pick?
report(cash_ts_fit |> filter(ATM=='ATM1') |> dplyr::select(arima_auto))
## Series: cash_mod
## Model: ARIMA(0,0,1)(0,1,2)[7]
##
## Coefficients:
## ma1 sma1 sma2
## 0.2162 -0.7362 -0.1801
## s.e. 0.0586 0.0613 0.0589
##
## sigma^2 estimated as 572.9: log likelihood=-1368.99
## AIC=2745.98 AICc=2746.12 BIC=2760.75
report(cash_ts_fit |> filter(ATM=='ATM2') |> dplyr::select(arima_auto))
## Series: cash_mod
## Model: ARIMA(5,0,0)(0,1,1)[7] w/ drift
##
## Coefficients:
## ar1 ar2 ar3 ar4 ar5 sma1 constant
## 0.0329 -0.1226 -0.0232 0.0929 -0.1858 -0.8522 -0.6462
## s.e. 0.0572 0.0578 0.0577 0.0585 0.0596 0.0426 0.2442
##
## sigma^2 estimated as 596.1: log likelihood=-1371.38
## AIC=2758.76 AICc=2759.26 BIC=2788.31
report(cash_ts_fit_atm4 |> filter(ATM=='ATM4') |> dplyr::select(arima_auto))
## Series: cash_mod
## Model: ARIMA(0,0,0)(1,0,0)[7] w/ mean
## Transformation: cash_mod^cash_tf4
##
## Coefficients:
## sar1 constant
## 0.2475 4.2841
## s.e. 0.0562 0.1081
##
## sigma^2 estimated as 3.631: log likelihood=-626.57
## AIC=1259.15 AICc=1259.23 BIC=1270.3
ATM1 has similar seasonal parameters to what was manually selected although it includes one error term lag. ATM2 is substantially different than expected with 5 p lagged terms as well as a constant although the seasonal component is in line with the ACF/PACF review. ATM4 obtained a very similar result as to what was identified although it excluded a seasonal differencing step.
cash_ts_fit |> filter(ATM=='ATM1') |>
forecast(h=60) |>
autoplot(cash_ts |> filter(ATM=='ATM1' & dte>'2009-12-31'),level=NULL) + labs(title='ATM1 Forecasts')
It’s rather challenging given the frequent swings between daily withdrawals to determine if any of these models outperforms one another although the seasonal naive model seems to substantially overestimate certain points while doing a good job of forecasting the days with limited withdrawals.
cash_ts_fit |> filter(ATM=='ATM2') |>
forecast(h=60) |>
autoplot(cash_ts |> filter(ATM=='ATM2' & dte>'2009-12-31'),level=NULL) + labs(title='ATM2 Forecasts')
Similar to ATM1 given the high number of models being tested it is hard to truly discern the differences among each one. The mixture of colors could probably be improved a bit although very few palettes can properly account for more than 8 variants.
cash_ts_fit_atm4 |> filter(ATM=='ATM4') |> dplyr::select(-c(naive,drift)) |>
forecast(h=60) |>
autoplot(cash_ts |> filter(ATM=='ATM4' & dte>'2009-12-31'),level=NULL) + labs(title='ATM4 Forecasts')
The naive and drift models were eliminated as the outlier event caused substantial inaccuracies where these models mirrored the data (last day of training for the outlier). The forecast without a doubt are unrealistic methods for modeling this ATM’s withdrawal data.
glance(cash_ts_fit) |> arrange(AICc) |> dplyr::select(ATM:BIC)
## # A tibble: 24 × 7
## ATM .model sigma2 log_lik AIC AICc BIC
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ATM1 arima_auto 573. -1369. 2746. 2746. 2761.
## 2 ATM1 arima000012 599. -1375. 2757. 2757. 2768.
## 3 ATM1 arima000013 599. -1375. 2758. 2758. 2773.
## 4 ATM2 arima_auto 596. -1371. 2759. 2759. 2788.
## 5 ATM1 arima000011 621. -1380. 2763. 2763. 2771.
## 6 ATM2 arima000011 627. -1381. 2766. 2766. 2773.
## 7 ATM2 arima000012 627. -1381. 2767. 2767. 2778.
## 8 ATM2 arima000013 629. -1380. 2769. 2769. 2783.
## 9 ATM1 arima000310 653. -1384. 2777. 2777. 2791.
## 10 ATM2 arima000310 660. -1386. 2780. 2780. 2795.
## # ℹ 14 more rows
It can be a deceiving to compare the AICc for all of the models as they potentially have different degrees of freedom making their comparison inaccurate for in sample train data. As expected the ATM3 models did not perform at all and for the other ATMs it appears ARIMA might be best.
glance(cash_ts_fit_atm4) |> arrange(AICc) |> dplyr::select(ATM:BIC)
## # A tibble: 12 × 7
## ATM .model sigma2 log_lik AIC AICc BIC
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ATM4 arima000011 3.07 -601. 1205. 1205. 1212.
## 2 ATM4 arima000012 3.08 -600. 1206. 1206. 1217.
## 3 ATM4 arima000013 3.10 -600. 1207. 1207. 1222.
## 4 ATM4 arima000310 3.92 -625. 1258. 1258. 1273.
## 5 ATM4 arima_auto 3.63 -627. 1259. 1259. 1270.
## 6 ATM4 arima000110 4.66 -650. 1305. 1305. 1312.
## 7 ATM4 aaa 3.15 -1038. 2100. 2101. 2145.
## 8 ATM4 aada 3.15 -1037. 2101. 2102. 2149.
## 9 ATM4 mean 3.85 NA NA NA NA
## 10 ATM4 naive 7.31 NA NA NA NA
## 11 ATM4 snaive 5.79 NA NA NA NA
## 12 ATM4 drift 7.31 NA NA NA NA
Let’s calculate on the test set and then use RMSE to compare each model by ATM.
cash_ts_fc <- cash_ts_fit |>
forecast(h=60)
cash_ts_amt4_fc <- cash_ts_fit_atm4 |>
forecast(h=60)
cash_ts_fc |> accuracy(cash_ts) |> arrange(RMSE)
## # A tibble: 24 × 11
## .model ATM .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 snaive ATM2 Test 3.73 19.4 12.8 -Inf Inf 0.600 0.630 -0.145
## 2 arima0001… ATM2 Test -2.24 21.2 15.0 -Inf Inf 0.706 0.687 -0.174
## 3 snaive ATM1 Test -16.6 24.5 19.6 -9.04 33.0 1.04 0.829 0.0297
## 4 arima0001… ATM1 Test -20.9 25.8 22.2 -127. 128. 1.18 0.873 -0.195
## 5 arima0003… ATM1 Test -15.0 27.7 20.0 -244. 249. 1.06 0.936 -0.187
## 6 arima0003… ATM2 Test 0.737 27.9 24.2 -Inf Inf 1.14 0.904 -0.00914
## 7 mean ATM1 Test -7.87 34.8 23.0 -311. 325. 1.22 1.18 -0.126
## 8 arima0000… ATM1 Test -9.44 36.0 26.2 -312. 329. 1.39 1.22 -0.154
## 9 mean ATM2 Test -7.62 40.4 35.3 -Inf Inf 1.66 1.31 0.0409
## 10 arima0000… ATM1 Test -6.80 42.0 31.5 -337. 363. 1.67 1.42 -0.0765
## # ℹ 14 more rows
When evaluating all of the ATMs in one table there is too much going on and therefore let’s take a subset of the top 3 of each ATM forecasts based on RMSE.
cash_ts_fc |> accuracy(cash_ts) |> group_by(ATM) |>
arrange(RMSE) |> mutate(row=row_number()) |> filter(row<=3) |> dplyr::select(.model:MAE,row)
## # A tibble: 6 × 7
## # Groups: ATM [2]
## .model ATM .type ME RMSE MAE row
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <int>
## 1 snaive ATM2 Test 3.73 19.4 12.8 1
## 2 arima000110 ATM2 Test -2.24 21.2 15.0 2
## 3 snaive ATM1 Test -16.6 24.5 19.6 1
## 4 arima000110 ATM1 Test -20.9 25.8 22.2 2
## 5 arima000310 ATM1 Test -15.0 27.7 20.0 3
## 6 arima000310 ATM2 Test 0.737 27.9 24.2 3
The manually selected model, the automated, and the mean model all performed exteremely similarly on the holdout set. That’s rather surprising that the mean would be comparable to other modeling techniques. Perhaps this speaks to the fact that few of the models are accurately accounting for the full variance in the data. The one outlier case may be contributing to these results. ATM3 due to it’s irregularlity has multiple models with the same results although it’ll be difficult to choose the “best” option for this particular machine. Another surprise result for ATM1 & ATM2 that the seasonal naive method performed best on the test set, which might signify that prior lagged weekly periods were very similar to one another. Cross validation will be helpful as a secondary check to determine if these benchmarks are in fact preferable to more complex methods.
cash_ts_amt4_fc |> accuracy(cash_ts |> filter(ATM=='ATM4')) |> group_by(ATM) |>
arrange(RMSE) |> mutate(row=row_number())
## # A tibble: 12 × 12
## # Groups: ATM [1]
## .model ATM .type ME RMSE MAE MPE MAPE MASE RMSSE
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 mean ATM4 Test -70.7 297. 255. -901. 925. 0.606 0.307
## 2 arima_auto ATM4 Test -73.1 299. 256. -917. 940. 0.608 0.310
## 3 arima000012 ATM4 Test -67.5 353. 299. -1059. 1095. 0.712 0.365
## 4 arima000011 ATM4 Test -66.9 353. 300. -1057. 1094. 0.713 0.366
## 5 arima000013 ATM4 Test -67.7 360. 305. -1075. 1112. 0.727 0.372
## 6 aada ATM4 Test -180. 407. 330. -1335. 1359. 0.786 0.421
## 7 aaa ATM4 Test -192. 410. 331. -1347. 1369. 0.788 0.424
## 8 arima000310 ATM4 Test -331. 546. 432. -1741. 1753. 1.03 0.565
## 9 arima000110 ATM4 Test -636. 867. 674. -1320. 1324. 1.60 0.897
## 10 snaive ATM4 Test -1308. 1755. 1340. -2019. 2023. 3.19 1.82
## 11 naive ATM4 Test -15958. 18135. 15958. -38504. 38504. 38.0 18.8
## 12 drift ATM4 Test -18513. 21434. 18513. -44970. 44970. 44.1 22.2
## # ℹ 2 more variables: ACF1 <dbl>, row <int>
The mean method appears to best track the in sample data, which is rather surprising although the ARIMA(0,0,1)(0,0,0) is close behind. Cross validation may help further separate the models.
Let’s check the residuals for the top model for each ATM:
cash_ts_fit |> dplyr::select(snaive) |> filter(ATM=='ATM1') |> gg_tsresiduals()
The residual distribution shape appears quite normal although there are some large residuals values in both directions which likely counter balance one another in the model. There is probably too much autocorrelation remaining in the residuals which likely means it would not pass a portmanteau test.
augment(cash_ts_fit |> dplyr::select(snaive) |> filter(ATM=='ATM1')) |>
features(.innov, ljung_box, lag=7)
## # A tibble: 1 × 4
## ATM .model lb_stat lb_pvalue
## <chr> <chr> <dbl> <dbl>
## 1 ATM1 snaive 62.1 5.81e-11
The p-value is extremely small which means that the residuals are not resembling white noise and that this model leaves something to be desired.
Does the second best ARIMA model have more acceptable residuals:
cash_ts_fit |> dplyr::select(arima000110) |> filter(ATM=='ATM1') |> gg_tsresiduals()
The residual distribution shape appears fairly normal although there are some large residuals outliers in both directions. There is probably too much autocorrelation remaining in the residuals and would also seem to indicate it would not pass a portmanteau test.
augment(cash_ts_fit |> dplyr::select(arima000110) |> filter(ATM=='ATM1')) |>
features(.innov, ljung_box, lag = 7,dof=2)
## # A tibble: 1 × 4
## ATM .model lb_stat lb_pvalue
## <chr> <chr> <dbl> <dbl>
## 1 ATM1 arima000110 20.3 0.00112
Yes, both of the top models do not have residuals that resemble white noise. It is not the end of the world although it might speak to the quality of the models in terms of predicting ATM1
cash_ts_fit |> dplyr::select(snaive) |> filter(ATM=='ATM2') |> gg_tsresiduals()
The distribution of the residuals appears to satisfy a normal bell like shape although there are many large outliers that seem to exist. The autocorrelation among lagged values appears to indicate that this model’s residuals resemble white noise.
augment(cash_ts_fit |> dplyr::select(snaive) |> filter(ATM=='ATM2')) |>
features(.innov, ljung_box, lag = 7)
## # A tibble: 1 × 4
## ATM .model lb_stat lb_pvalue
## <chr> <chr> <dbl> <dbl>
## 1 ATM2 snaive 80.6 1.04e-14
The p-value is very small which is rather surprising which means that the residuals are not resembling white noise and that this model does not fully account for the variance within the data for this ATM.
Does the second best ARIMA model have more acceptable residuals:
cash_ts_fit |> dplyr::select(arima000110) |> filter(ATM=='ATM2') |> gg_tsresiduals()
The residuals are centered around zero with a fairly normal distribution although the spikes in the time plot show more spikes in the positive direction which is somewhat evident in the histogram as well. The autocorrelation in the residuals is definitely closer to and would white noise although it may not pass a portmanteau test given there are two significant lagged values.
augment(cash_ts_fit |> dplyr::select(arima000110) |> filter(ATM=='ATM2')) |>
features(.innov, ljung_box, lag = 7, dof = 2)
## # A tibble: 1 × 4
## ATM .model lb_stat lb_pvalue
## <chr> <chr> <dbl> <dbl>
## 1 ATM2 arima000110 14.9 0.0109
Yes, consistent with the visual observation the residuals are not representing white noise.
The residuals for ATM3 are not large enough to evaluate and will have to proceed more with intuition.
Somewhat as expected there isn’t much that can be done about the residuals for this ATM given all of the training data had zero values.
Let’s review ATM4 residuals:
cash_ts_fit_atm4 |> dplyr::select(mean) |> gg_tsresiduals()
The residuals plot indicates that the residuals are fairly evenly off in both directions for ATM4 with a mostly normal approximation The ACF plot may show potential signs that it is not equivalent to white noise with 3 lagged values being statistically significant. Let’s run Ljung Box to confirm that observation:
augment(cash_ts_fit_atm4 |> dplyr::select(mean)) |>
features(.innov, ljung_box, lag = 7, dof = 1)
## # A tibble: 1 × 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 mean 21.7 0.00140
As expected the residuals do not pass this test that they resemble white noise. Let’s evaluate the second best model to see if it meets those criteria:
cash_ts_fit_atm4 |> dplyr::select(arima_auto) |> gg_tsresiduals()
The autocorrelation is substantially less and there isn’t a single lagged value that is significant which should mean that it would pass Ljung Box test. Outside of a large outlier the residuals are mostly homoskedastic and have a somewhat bell shape although it does not appear to be centered around zero.
augment(cash_ts_fit_atm4 |> dplyr::select(arima_auto)) |>
features(.innov, ljung_box, lag = 7, dof = 1)
## # A tibble: 1 × 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 arima_auto 3.50 0.744
No surprise with the results of this pormanteau test that the residuals resemble white noise.
Let’s test running cross validation for each ATM to more thoroughly evaluate the dataset and pick the best model.
ATM1:
cash_cv_fc_atm1 <- cash_ts |>
stretch_tsibble(.init = 30) |>
filter(ATM=='ATM1') |>
model(mean=MEAN(cash_mod),
snaive=SNAIVE(cash_mod ~ lag("week")),
arima000110 = ARIMA(cash_mod ~ 0 + pdq(0,0,0) + PDQ(1,1,0)),
arima001012 = ARIMA(cash_mod ~ 0 + pdq(0,0,0) + PDQ(0,1,2))) |>
forecast(h=30) |>
group_by(.id) |>
mutate(h = row_number()) |>
ungroup() |>
as_fable(response = "cash_mod", distribution = cash_mod)
cash_cv_fc_atm1 |>
accuracy(cash_ts, by = c("h", ".model")) |>
ggplot(aes(x = h, y = RMSE)) +
geom_point(aes(colour=.model))
Although the mean method technically has the lowest average RMSE across the specific time horizons in cross validation, it is also less robust against some of the complexities that exist for the withdrawal data. In combination with the initial test validation data and the performance shown the model that will be selected for forecast is the ARIMA001012.
atm1_final_fc <- cash_ts_fit |> filter(ATM=='ATM1') |> dplyr::select(arima_auto) |> forecast(h=92) |> hilo() |> mutate(lower_95=`95%`$lower,upper_95=`95%`$upper) |> dplyr::select(dte,cash_mod,.mean,lower_95,upper_95)
atm1_final <- as_tibble(tail(atm1_final_fc,31)) |> mutate(ATM='ATM1',.model='ARIMA001012')
ATM2:
cash_cv_fc_atm2 <- cash_ts |>
stretch_tsibble(.init = 30) |>
filter(ATM=='ATM2') |>
model(snaive=SNAIVE(cash_mod ~ lag("week")),
arima000310 = ARIMA(cash_mod ~ 0 + pdq(0,0,0) + PDQ(3,1,0)),
arima000110 = ARIMA(cash_mod ~ 0 + pdq(0,0,0) + PDQ(1,1,0))) |>
forecast(h=30) |>
group_by(.id) |>
mutate(h = row_number()) |>
ungroup() |>
as_fable(response = "cash_mod", distribution = cash_mod)
cash_cv_fc_atm2 |>
accuracy(cash_ts, by = c("h", ".model")) |>
ggplot(aes(x = h, y = RMSE)) +
geom_point(aes(colour=.model))
After reevaluating the ATM2 data running more tests with cross validation the seasonal naive model does perform really well, but there is fairly comparable results in the ARIMA(0,0,0)(3,1,0) as it has about the same RMSE across multiple forecasted horizons. The ARIMA model will be chosen for the forecast given it can better handle some of the complex patterns that are present from cash withdrawals at this ATM.
atm2_final_fc <- cash_ts_fit |> filter(ATM=='ATM2') |> dplyr::select(arima000310) |> forecast(h=92) |> hilo() |> mutate(lower_95=`95%`$lower,upper_95=`95%`$upper) |> dplyr::select(dte,cash_mod,.mean,lower_95,upper_95)
atm2_final <- as_tibble(tail(atm2_final_fc,31)) |> mutate(ATM='ATM2',.model='ARIMA000310')
ATM3 due to the the limited values will not benefit from running cross validation and the submission will include the damped trend data with exponential smoothing model (AADN) given that the withdrawals are likely to differ on multiple days.
atm3_final_fc <- atm3_fit |> dplyr::select(aadn) |> forecast(h=31) |> hilo() |>
mutate(lower_95=`95%`$lower,upper_95=`95%`$upper) |>
dplyr::select(dte,cash_mod,.model,.mean,lower_95,upper_95) |> mutate(ATM='ATM3')
ATM4:
cash_cv_fc_atm4 <- cash_ts |>
stretch_tsibble(.init = 30) |>
filter(ATM=='ATM4') |>
model(mean=MEAN(cash_mod^cash_tf4),
arima000100= ARIMA(cash_mod^cash_tf4 ~ 0 + pdq(0,0,0) + PDQ(0,1,2)),
aaa = ETS(cash_mod^cash_tf4 ~ error('A')+trend('A') + season('A')),
aada = ETS(cash_mod^cash_tf4 ~ error('A')+trend('Ad') + season('A'))) |>
forecast(h=31) |>
group_by(.id) |>
mutate(h = row_number()) |>
ungroup() |>
as_fable(response = "cash_mod", distribution = cash_mod)
cash_cv_fc_atm4 |>
accuracy(cash_ts, by = c("h", ".model")) |>
ggplot(aes(x = h, y = RMSE)) +
geom_point(aes(colour=.model))
The cross validation testing procedure indicates that the mean method performs the best and is quite separated from the next best model in ARIMA(0,0,1)(0,0,0). Although it seems a bit surprising that the data would be best approximated by a baseline method it likely speaks to the limitations/residuals from the other models even though the residuals are still correlated in the mean method.
atm4_final_fc <- cash_ts_fit_atm4 |>
dplyr::select(mean) |> forecast(h=92) |> filter(dte>'2010-04-30')|> hilo() |>
mutate(lower_95=`95%`$lower,upper_95=`95%`$upper) |>
dplyr::select(dte,cash_mod,.mean,lower_95,upper_95)
atm4_final <- as_tibble(tail(atm4_final_fc,31)) |> mutate(ATM='ATM4',.model='mean',lower_95)
atm_full_final <- dplyr::bind_rows(atm1_final,atm2_final,as_tibble(atm3_final_fc),atm4_final) |> relocate(ATM)
write.csv(atm_full_final,paste0(getwd(),'/','parta_atm_forecasts.csv'))
Part B – Forecasting Power, ResidentialCustomerForecastLoad-624.xlsx
Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.
input <- getwd()
if(grepl( 'Data624', input, fixed = TRUE)){
input_b_fp <- input
} else{
input_b_fp <- paste0(getwd(),'/Desktop/Masters_Data_Sci/Data624/')
}
part_b <- 'ResidentialCustomerForecastLoad-624.xlsx'
partb_i_df <- readxl::read_excel(paste0(input_b_fp,'/',part_b))
head(partb_i_df)
## # A tibble: 6 × 3
## CaseSequence `YYYY-MMM` KWH
## <dbl> <chr> <dbl>
## 1 733 1998-Jan 6862583
## 2 734 1998-Feb 5838198
## 3 735 1998-Mar 5420658
## 4 736 1998-Apr 5010364
## 5 737 1998-May 4665377
## 6 738 1998-Jun 6467147
summary(partb_i_df)
## CaseSequence YYYY-MMM KWH
## Min. :733.0 Length:192 Min. : 770523
## 1st Qu.:780.8 Class :character 1st Qu.: 5429912
## Median :828.5 Mode :character Median : 6283324
## Mean :828.5 Mean : 6502475
## 3rd Qu.:876.2 3rd Qu.: 7620524
## Max. :924.0 Max. :10655730
## NA's :1
partb_i_df |> dplyr::filter(is.na(KWH) | is.na(`YYYY-MMM`))
## # A tibble: 1 × 3
## CaseSequence `YYYY-MMM` KWH
## <dbl> <chr> <dbl>
## 1 861 2008-Sep NA
colnames(partb_i_df) <- c('casesequence','month','KWH')
partb_i_df |> filter(casesequence>=860 & casesequence<=865)
## # A tibble: 6 × 3
## casesequence month KWH
## <dbl> <chr> <dbl>
## 1 860 2008-Aug 8037137
## 2 861 2008-Sep NA
## 3 862 2008-Oct 5101803
## 4 863 2008-Nov 4555602
## 5 864 2008-Dec 6442746
## 6 865 2009-Jan 8072330
The NA value is only occurring in one month in question. In order to properly impute this monthly data let’s review the relationship with lagged values to determine if a seasonal naive or other lagged interval more closely correlate with the available data.
partb <- partb_i_df |> mutate(YearMonth = tsibble::yearmonth(as.character(month)))
partb_na <- partb |> as_tsibble(index=YearMonth)
partb_na |> gg_lag(KWH,geom='point',lags=1:24)
Tracking the shape of the KWH lagged values it appears there are much stronger relationships between half year and full year comparisons then other intermonthly comparisons.
Let’s review an ACF plot to confirm this further, particularly with farther out lagged periods for considering moving average periods.
partb_na |> ACF(KWH,lag_max=60) |> autoplot()
Similar to the lagged plots the autocorrelation is linearly related across many values, but the standardized pattern for a year’s lag is by far the strong relationship for each monthly energy consumption. Based on reviewing the ACF, the one NA value will be imputed using a variant of a moving average of similar year over year data.
Lets impute using a 3 year time frame using two years of preceding data and one following. The reason for taking this time period is that the missing value should likely be most similar to the same month’s consumption a year before and after it and it will be treated like it was the moving average of the preceding September.
kwh <- partb_na |> mutate(kwh_imp=ifelse(is.na(KWH),mean(pull(partb_na |> filter(month(YearMonth,abbr=FALSE)==9 & casesequence>=837 & casesequence<=873),KWH),na.rm=TRUE),KWH))
Let’s confirm if there are any null values left:
kwh |> filter(is.na(kwh_imp))
## # A tsibble: 0 x 5 [?]
## # ℹ 5 variables: casesequence <dbl>, month <chr>, KWH <dbl>, YearMonth <mth>,
## # kwh_imp <dbl>
Now that every value is populated it’s time to evaluate transformations. The unadjusted values are quite high so at a minimum dividing by 10 to some power would probably be appropriate at the very least.
Let’s evaluate if an alternative transformation would be appropriate using BoxCox.
kwh_bc_mod <- preProcess(as.data.frame(kwh),method='BoxCox')
kwh_bc <- kwh_bc_mod$kwh_imp$lambda
The transformation suggested when maximizing the log likelihood returned 1 for the kwh_imp series
kwh |> autoplot(kwh_imp)
There is clearly seasonality that is to be expected with energy demand that is highly correlated with temperature extremes for a monthly basis. Otherwise, the variance does not appear to be changing that dramatically across the full time series range although there is a large outlier that appears to exist in early 2010 that is far below the rest of the data points.
Similar to part A, the approach that will be taken is to use two different sampling strategies to evaluate the models.
The ARIMA modeling requires some initial preparation to confirm that the data inputs meet critical assumptions to reliably model the time series. The first attempt at stationarity will be evaluated at the first order of differeencing due to the clear seasonality that exists with this energy data.
kwh |> gg_tsdisplay(difference(kwh_imp,lag=12), plot_type='partial')
Aside from two larger outliers the remainder of the points looks to be quite homoskedastic after accounting for the seasonality. The ACF and PACF have very similar looking statistically significant lags at 1 and 12 respectively which means we could propose a fairly simple model with an AR and MA model of 1 for the seasonal term P or Q.This estimated model will be compared against the automated ARIMA model to evaluate what terms it selects.
Let’s confirm with a unit root test:
kwh |>
features(difference(kwh_imp,12), unitroot_kpss)
## # A tibble: 1 × 2
## kpss_stat kpss_pvalue
## <dbl> <dbl>
## 1 0.108 0.1
This statistical test also confirms what was observed in the prior plots that this data can be considered stationary
With the time frame being months for this time series and a much longer time horizon there are some additional model types that are available for this data that were not usable for the daily ATM withdrawals from Part A. Some additional modeling will include STL and X11 to evaluate as many possible models are available.
Let’s quickly run STL decomposition
cmp_kwh<- kwh |> filter(year(YearMonth)<2011) |>
model(stl=STL(kwh_imp)) |>
components()
autoplot(cmp_kwh)
As is expected there isn’t a strong trend/cycle that exists although there was a slight increase towards the end of the time range and ultimately a large plunge in consumption as well. The seasonality looks to be closely tied to seasons with spikes at winter and summer and subsequent troughs in fall and spring.
Let’s train several models to determine what forecasts the residential power usage:
kwh_train_fit <- kwh |> filter(year(YearMonth)<2011) |>
model(mean=MEAN(kwh_imp),
naive=NAIVE(kwh_imp),
snaive=SNAIVE(kwh_imp),
drift=RW(kwh_imp~ drift()),
arima000110=ARIMA(kwh_imp ~ pdq(0,0,0)+PDQ(1,1,0)),
arima000011=ARIMA(kwh_imp ~ pdq(0,0,0)+PDQ(0,1,1)),
arima_auto=ARIMA(kwh_imp),
ana =ETS(kwh_imp ~ error('A')+trend('N')+season('A')),
aada=ETS(kwh_imp ~ error('A')+trend('Ad')+season('A')),
aaa=ETS(kwh_imp ~ error('A')+trend('A')+season('A')))
The Hyndman textbook recommends using the corrected AICc metric as a means of evaluating the models.
Let’s see what the automated ARIMA algorithm determined was the best combination of pdq PDQ parameters.
report(kwh_train_fit |> dplyr::select(arima_auto))
## Series: kwh_imp
## Model: ARIMA(0,0,0)(0,1,2)[12]
##
## Coefficients:
## sma1 sma2
## -0.7843 0.2927
## s.e. 0.1081 0.1445
##
## sigma^2 estimated as 6.413e+11: log likelihood=-2164.59
## AIC=4335.18 AICc=4335.35 BIC=4344.09
The optimization process didn’t result in a substantially different model from the visual examination of the ACF & PACF. There is one seasonal differencing term and instead of one Q there are 2 lagged seasonal error terms. Given the strong lagged autocorrelation that was visible in the initial lagged plot and ACF this kind of makes sense.
Let’s review the AICc for in sample evaluation of the fit.
glance(kwh_train_fit) |> arrange(AICc) |> dplyr::select(.model:BIC)
## # A tibble: 10 × 6
## .model sigma2 log_lik AIC AICc BIC
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 arima_auto 6.41e11 -2165. 4335. 4335. 4344.
## 2 arima000011 6.64e11 -2167. 4337. 4337. 4343.
## 3 arima000110 7.22e11 -2171. 4347. 4347. 4353.
## 4 ana 6.63e11 -2510. 5049. 5053. 5095.
## 5 aaa 6.65e11 -2509. 5052. 5056. 5103.
## 6 aada 6.67e11 -2509. 5053. 5058. 5108.
## 7 mean 1.72e12 NA NA NA NA
## 8 naive 2.22e12 NA NA NA NA
## 9 snaive 8.53e11 NA NA NA NA
## 10 drift 2.22e12 NA NA NA NA
kwh_train_fit |> forecast(h='3 years') |> autoplot(kwh |> filter(year(YearMonth)>2008),level=NULL) +labs(title='KWH Power Usage Forecast')
With many of the forecasts plotted over the top of one another it is not the easiest to differentiate aside from those large divergences that exist. The basic modeling methods appear to diverge rather substantially from the data set. It appears that the one outlier term also heavily influenced the seasonal naive troughs that were forecasted.
kwh_train_fc <- kwh_train_fit |> forecast(h='3 years')
knitr::kable(kwh_train_fc |> accuracy(kwh) |>arrange(RMSE))
| .model | .type | ME | RMSE | MAE | MPE | MAPE | MASE | RMSSE | ACF1 |
|---|---|---|---|---|---|---|---|---|---|
| aaa | Test | 1095281 | 1404861 | 1122418 | 13.47739 | 13.93459 | 1.846162 | 1.526712 | 0.1472531 |
| ana | Test | 1179309 | 1466785 | 1185844 | 14.66555 | 14.78124 | 1.950485 | 1.594008 | 0.1409454 |
| aada | Test | 1188842 | 1474169 | 1194171 | 14.80145 | 14.89686 | 1.964182 | 1.602032 | 0.1414678 |
| arima_auto | Test | 1048891 | 1646442 | 1093123 | 12.51883 | 13.17523 | 1.797977 | 1.789247 | 0.2915537 |
| arima000011 | Test | 1111303 | 1663266 | 1139781 | 13.31875 | 13.74764 | 1.874720 | 1.807530 | 0.2884737 |
| arima000110 | Test | 1089373 | 1987124 | 1223648 | 12.65490 | 14.65295 | 2.012665 | 2.159478 | 0.2477648 |
| mean | Test | 1229194 | 1991876 | 1570376 | 12.66579 | 18.66749 | 2.582965 | 2.164643 | 0.3618090 |
| naive | Test | 1353888 | 2071151 | 1612027 | 14.40062 | 18.96022 | 2.651473 | 2.250793 | 0.3618090 |
| drift | Test | 1438630 | 2126170 | 1644521 | 15.58526 | 19.23620 | 2.704920 | 2.310584 | 0.3590813 |
| snaive | Test | 1184005 | 2668671 | 1551425 | 13.42303 | 18.76234 | 2.551795 | 2.900140 | 0.1844302 |
The results of the test validation are rather interesting as some of the ARIMA models may have slightly overfit to the training data and performed worse on the test set. Evaluating the data using RMSE shows that the Holt’s Winters additive method seems to do the best. The automated ARIMA does manage to minimize the MAE compared to the rest of the models although the exponential smoothing model is second in that respect.
Let’s review the residuals of the ETS(AAA) model
kwh_train_fit |>
dplyr::select(aaa) |>
gg_tsresiduals()
The innovation residuals seem to be fairly homoskedastic aside from the large outlier the remaining points appear to reasonably split above and below zero. The distribution of the residuals are distributed around zero and appear to be approximately normal. There are a few autocorrelated residuals that exist but it doesn’t appear to be overly correlated among lagged values.
Let’s run a Ljung box test to confirm that’s the case:
augment(kwh_train_fit |> dplyr::select(aaa,ana,aada,arima_auto)) |>
features(.innov, ljung_box, lag = 12, dof = 3)
## # A tibble: 4 × 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 aaa 23.0 0.00615
## 2 aada 22.4 0.00775
## 3 ana 23.5 0.00524
## 4 arima_auto 8.51 0.483
The statistical test for the Ljung box indicates that the residuals for all of the exponential models do not resemble white noise given the very low p-values obtained. This doesn’t necessarily disqualify these models, but could imply there are some additional components needed in the model. The best ARIMA model despite having a substantially higher RMSE resembles white noise for the residuals. Given the high levels of autocorrelation in the initial dataset that might cause some of the correlations that remain in the exponential models selected
Let’s run cross validation as the training method to see how that impacts the results for selecting the final model to run future forecasting. For optimization purposes the basic models will be excluded given how poorly they performed on training and test data in the previous section.
kwh_cv_fit <- kwh |>
stretch_tsibble(.init = 36,.step=3) |>
model(arima000110=ARIMA(kwh_imp ~ 0 + pdq(0,0,0)+PDQ(1,1,0)),
arima000011=ARIMA(kwh_imp ~ 0+ pdq(0,0,0)+PDQ(0,1,1)),
arima_auto=ARIMA(kwh_imp ~ 0+ pdq(0,0,0)+PDQ(0,1,2)),
ana =ETS(kwh_imp ~ error('A')+trend('N')+season('A')),
aada=ETS(kwh_imp ~ error('A')+trend('Ad')+season('A')),
aaa=ETS(kwh_imp ~ error('A')+trend('A')+season('A')))
kwh_cv_fc <- kwh_cv_fit |> forecast(h = 12) |>
group_by(.id) |>
mutate(h = row_number()) |>
ungroup() |>
as_fable(response = "kwh_imp", distribution = kwh_imp)
kwh_cv_fc |>
accuracy(kwh, by = c("h", ".model")) |> arrange(RMSE) |>
ggplot(aes(x = h, y = RMSE)) +
geom_point(aes(colour=.model))
It is a somewhat mixed bag in terms of the results from the cross validation as many of the models appear to comparably perform across multiple forecast horizons. If you are taking into account all of the time horitzons it appears that the exponential model without a trend factor and additive error/seasonal model (ANA) is performing the best for predicting energy usage. It was also the best performing exponential model in the training in sample data although worse than the ARIMA options; however, the ARIMA models reviewed all have larger errors at later time horizons which makes the ANA exponential seem preferable.
kwh_final_fc <- kwh_train_fit |> dplyr::select(ana) |> forecast(h=48) |> hilo() |> mutate(lower_95=`95%`$lower,upper_95=`95%`$upper) |> dplyr::select(.model,YearMonth,kwh_imp,.mean,lower_95,upper_95)
write.csv(as_tibble(tail(kwh_final_fc,12)),paste0(getwd(),'/','partb_energy_usage.csv'))