DATA 624 - Project 1 Part A

Overview
Import data
Change Data
Convert data into a tsibble object

Overview

In part A of this project, our goal is to forecast how much cash is taken out of 4 different ATM machines for May 2010. For this project, we will conduct the following steps.

We will start by importing the raw data for each of the datasets into a dataframe, and handle any initial cleanup or conversions of the data
Next we will explore the data that is provided to ensure that there are no missing values, and if there are any missing values, to determine the best way to handle them. Additionally, we will look for outliers or other anomalies in the data that we should take into account
Once we’ve made the necessary adjustments to the data, we will further visualize the data to understand how it is behaving over time to help us determine the appropriate time series model to use to model the data. There may be additional adjustments we need to make - depending on the type of model that we select.
We will then split the data into training and test data for building our models.
We will test out several models that we feel are adequate representations of our data, and then evaluate them to determine the model that best explains the relationship between our independent variable (time) and our dependent variable (cash)
Once we’ve decided on a model to use, then we will evaluate the model on our test data to evaluate how effective it is at forecasting
Finally, once we’ve landed on a model that is effective in forecasting the data, then we will use it to forecast the May 2010 data.

Import data

During my initial import of the data, I was running into problems where the date column was coming in as a Julian date and I was unable to figure out how to change it to an actual data. So I decided to coerce the data type at the import stage.

raw_data <- read_xlsx('../data/ATM624Data.xlsx', col_types=c("date","text","numeric"))

Change Data

In this step, I’m changing the DATE field to a date type, and also changing the dataframe to wide type so that each ATM is it’s own column. Finally, there was an NA column in the data; so I’m removing that column from the data.

raw_data <- raw_data %>% mutate(DATE = as.Date(DATE))


data_wide <- raw_data %>% pivot_wider(names_from=ATM, values_from=Cash)

data_wide %>% summary()

##       DATE                 ATM1             ATM2              NA     
##  Min.   :2009-05-01   Min.   :  1.00   Min.   :  0.00   Min.   : NA  
##  1st Qu.:2009-08-03   1st Qu.: 73.00   1st Qu.: 25.50   1st Qu.: NA  
##  Median :2009-11-06   Median : 91.00   Median : 67.00   Median : NA  
##  Mean   :2009-11-06   Mean   : 83.89   Mean   : 62.58   Mean   :NaN  
##  3rd Qu.:2010-02-08   3rd Qu.:108.00   3rd Qu.: 93.00   3rd Qu.: NA  
##  Max.   :2010-05-14   Max.   :180.00   Max.   :147.00   Max.   : NA  
##                       NA's   :17       NA's   :16       NA's   :379  
##       ATM3              ATM4          
##  Min.   : 0.0000   Min.   :    1.563  
##  1st Qu.: 0.0000   1st Qu.:  124.334  
##  Median : 0.0000   Median :  403.839  
##  Mean   : 0.7206   Mean   :  474.043  
##  3rd Qu.: 0.0000   3rd Qu.:  704.507  
##  Max.   :96.0000   Max.   :10919.762  
##  NA's   :14        NA's   :14

data_wide %>% str()

## tibble [379 × 6] (S3: tbl_df/tbl/data.frame)
##  $ DATE: Date[1:379], format: "2009-05-01" "2009-05-02" ...
##  $ ATM1: num [1:379] 96 82 85 90 99 88 8 104 87 93 ...
##  $ ATM2: num [1:379] 107 89 90 55 79 19 2 103 107 118 ...
##  $ NA  : num [1:379] NA NA NA NA NA NA NA NA NA NA ...
##  $ ATM3: num [1:379] 0 0 0 0 0 0 0 0 0 0 ...
##  $ ATM4: num [1:379] 777 524.4 792.8 908.2 52.8 ...

data_wide <- data_wide %>% select(!`NA`)

When looking at the data in a wide format, we find that there is an NA column that has no valid data. Additionally, within the different ATM columns, we see that there are NA values as well. In ATM 1 there are 17 NAs, in ATM2 there are 16, and in ATM3 and ATM4 there are 14 NA values. Additionally, the DATE field is in numeric form and as such should be changed to an actual date format.

Convert data into a tsibble object

Next we convert the data into a tsibble, using the date field as the index.

atm_ts <-tsibble(data_wide)

## Using `DATE` as index variable.

atm_ts %>%
  autoplot(ATM4)

## Warning: Removed 14 rows containing missing values (`geom_line()`).

a. ATM1 Forecast

For this we are starting with the ATM1 data. Since we will be forecasting the data for May 2010, we will begin by filtering the data to include on those values that occurred prior to May 2010.

atm1 <- atm_ts %>%
  select(ATM1) %>%
  filter(DATE < "2010-05-01")

Next we preview the data to understand the properties of the data

summary(atm1)

##       ATM1             DATE           
##  Min.   :  1.00   Min.   :2009-05-01  
##  1st Qu.: 73.00   1st Qu.:2009-07-31  
##  Median : 91.00   Median :2009-10-30  
##  Mean   : 83.89   Mean   :2009-10-30  
##  3rd Qu.:108.00   3rd Qu.:2010-01-29  
##  Max.   :180.00   Max.   :2010-04-30  
##  NA's   :3

nrow(atm1)

## [1] 365

atm1 %>% autoplot()

## Plot variable not specified, automatically selected `.vars = ATM1`

atm1 %>% filter(is.na(ATM1))

## # A tsibble: 3 x 2 [1D]
##    ATM1 DATE      
##   <dbl> <date>    
## 1    NA 2009-06-13
## 2    NA 2009-06-16
## 3    NA 2009-06-22

Looking at the data, we see that there are 365 values, and that the data is a time-series with daily values. Additionally, we see that there are three missing values on 6/13/2009, 6/16/2009 and 6/22/2009. To deal with these missing values, we will first train a model on the data prior to these values, and then use that to impute the missing values. Once we’ve completed that step, we will then take our dataset and create or test and training data that will be used to build our overall model.

First we will create a dataset that includes just the data prior to 6/13/2009

atm1_early <- atm1 %>% filter(DATE < '2009-06-13')

Next we will explore this data using the STL decomposition to bettern understand the components of the timeseries so we can use this to help us determine how to model the data. I tried using box-cox transformation to stabilize the trend in the data, but the transformation did not have a discernible impact on this aspect of the data.

atm1_early

## # A tsibble: 43 x 2 [1D]
##     ATM1 DATE      
##    <dbl> <date>    
##  1    96 2009-05-01
##  2    82 2009-05-02
##  3    85 2009-05-03
##  4    90 2009-05-04
##  5    99 2009-05-05
##  6    88 2009-05-06
##  7     8 2009-05-07
##  8   104 2009-05-08
##  9    87 2009-05-09
## 10    93 2009-05-10
## # ℹ 33 more rows

atm1_early %>% 
  model(STL(ATM1)) %>%
  components() %>%
  autoplot()

Based on our decomposition, we see that there is a pretty stable trend at the beginning, with a significant bump later on in the dataset. Additionally, we wee that there is weekly seasonality in the data. The remainder is relatively stable and centered around zero, until we get to later in June where there is a significant outlier in the data.

Based on this, we will explore modeling the time series with an ETS model. Based on the component data above, I believe that the seasonal trend is Additive. However, while the overall trend is basically stable throughout the data, the fact that there’s a significant spike in the data, causes me to think it’s possible that that would be best modeled as a Multiplicative trend. However, I will try models of the following types:

(A,A)
(A,M)
(N,M)
(N,A)

fit <- atm1_early %>% 
  model(m1 = ETS(ATM1 ~ error("A")+trend("A")+season("A")),
        m2 = ETS(ATM1 ~ error("A")+trend("A")+season("M")),
        m3 = ETS(ATM1 ~ error("A")+trend("N")+season("M")),
        m4 = ETS(ATM1 ~ error("A")+trend("N")+season("A")))


glance(fit) %>% arrange(AICc)

## # A tibble: 4 × 9
##   .model sigma2 log_lik   AIC  AICc   BIC   MSE  AMSE   MAE
##   <chr>   <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 m3       424.   -206.  432.  439.  449.  335.  341.  10.2
## 2 m4       436.   -206.  433.  440.  451.  345.  351.  11.1
## 3 m2       438.   -205.  435.  445.  456.  326.  333.  10.1
## 4 m1       510.   -209.  441.  452.  462.  380.  386.  12.4

In reviewing the AICc for each of the models, we find that m3 (N,M) has the lower AICc and therefore I will be using that model for forecasting the missing values on 6/13, 6/16 and 6/22.

fit2 <- atm1_early %>%
  model(ETS(ATM1 ~ error("A") + trend("N") + season("M")))

aug <- fit2 %>% augment()

autoplot(aug, .innov)

aug %>%
  ggplot(aes(x=.innov)) + 
  geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

aug %>% ACF(.innov) %>%
  autoplot()

From the residuals we see evidence of the major outlier in the data in June, and the fact that there’s a weekly lag in the autocorrelation plot. We will use this model to impute the missing values.

fit2_fc <- fit2 %>% forecast(h=10)


missing_val1 <- (fit2_fc %>% filter(DATE == '2009-06-13'))$.mean
missing_val2 <- (fit2_fc %>% filter(DATE == '2009-06-16'))$.mean
missing_val3 <- (fit2_fc %>% filter(DATE == '2009-06-22'))$.mean

atm1_fixed <- atm1 %>%
  mutate(ATM1 = ifelse(DATE == '2009-06-13',missing_val1,ATM1))

atm1_fixed <- atm1_fixed %>%
  mutate(ATM1 = ifelse(DATE == '2009-06-16',missing_val2,ATM1))

atm1_fixed <- atm1_fixed %>%
  mutate(ATM1 = ifelse(DATE == '2009-06-22',missing_val3,ATM1))

Now that we’ve imputed the missing values, we will now create a test and training dataset on the entire data with the imputed values included. From there, we will basically go through the same steps that we just completed - but now we will focus on building a model on our training data and then using the test data to evaluate the performance of the data.

We will first look at the autoplot of the new data set and determine the number of rows to include in our training and test data. Consistent with standard practice, we will opt for a 70/30 split.

train_pct <- .7
train_size <- floor(nrow(atm1_fixed)*train_pct)

train_data <- atm1_fixed %>% slice(1:train_size)
test_data <- atm1_fixed %>% slice(train_size:n())


train_data %>% autoplot()

## Plot variable not specified, automatically selected `.vars = ATM1`

We will now build our model on the test data using the same model that we used initially.

train_fit <- train_data %>% 
  model(ETS(ATM1 ~ error("A") + trend("N") + season("M")))

train_fc <- train_fit %>% forecast(h=nrow(test_data)-1)

train_fc %>% autoplot(atm1_fixed)

Finally, prior to using the model to forecast the data for May 2010, we will look at the residuals of the model and evaluate how effective the model is at predicting using our test data.

accuracy(train_fc, test_data)

## # A tibble: 1 × 10
##   .model                  .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE   ACF1
##   <chr>                   <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>
## 1 "ETS(ATM1 ~ error(\"A\… Test   1.14  45.3  31.2 -303.  337.   NaN   NaN 0.0602

aug <- train_fit %>% augment()

autoplot(aug, .innov)

aug %>%
  ggplot(aes(x=.innov)) + 
  geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

aug %>% ACF(.innov) %>%
  autoplot()

When looking at the plots for the residuals of the model, we see that the model has done a pretty good job at capturing the variance in the data. There are no signficicant autocorrelations in the data and the residuals appear to be normally distributed. Finally, since we are satisfied with the model selected, we will retrain the model on the full dataset and then use that to generate the forecast for the data for May 2010

final_model <- atm1_fixed %>% 
  model(ETS(ATM1 ~ error("A") + trend("N") + season("M")))

parta_fcst1 <- final_model %>% forecast(h=14)

parta_fcst1 %>% autoplot(atm1_fixed)

parta_fcst1

## # A fable: 14 x 4 [1D]
## # Key:     .model [1]
##    .model                                         DATE               ATM1  .mean
##    <chr>                                          <date>           <dist>  <dbl>
##  1 "ETS(ATM1 ~ error(\"A\") + trend(\"N\") + sea… 2010-05-01 sample[5000]  87.4 
##  2 "ETS(ATM1 ~ error(\"A\") + trend(\"N\") + sea… 2010-05-02 sample[5000] 101.  
##  3 "ETS(ATM1 ~ error(\"A\") + trend(\"N\") + sea… 2010-05-03 sample[5000]  74.1 
##  4 "ETS(ATM1 ~ error(\"A\") + trend(\"N\") + sea… 2010-05-04 sample[5000]   5.91
##  5 "ETS(ATM1 ~ error(\"A\") + trend(\"N\") + sea… 2010-05-05 sample[5000] 100.  
##  6 "ETS(ATM1 ~ error(\"A\") + trend(\"N\") + sea… 2010-05-06 sample[5000]  79.2 
##  7 "ETS(ATM1 ~ error(\"A\") + trend(\"N\") + sea… 2010-05-07 sample[5000]  86.0 
##  8 "ETS(ATM1 ~ error(\"A\") + trend(\"N\") + sea… 2010-05-08 sample[5000]  87.4 
##  9 "ETS(ATM1 ~ error(\"A\") + trend(\"N\") + sea… 2010-05-09 sample[5000] 101.  
## 10 "ETS(ATM1 ~ error(\"A\") + trend(\"N\") + sea… 2010-05-10 sample[5000]  73.8 
## 11 "ETS(ATM1 ~ error(\"A\") + trend(\"N\") + sea… 2010-05-11 sample[5000]   6.19
## 12 "ETS(ATM1 ~ error(\"A\") + trend(\"N\") + sea… 2010-05-12 sample[5000] 100.  
## 13 "ETS(ATM1 ~ error(\"A\") + trend(\"N\") + sea… 2010-05-13 sample[5000]  79.6 
## 14 "ETS(ATM1 ~ error(\"A\") + trend(\"N\") + sea… 2010-05-14 sample[5000]  86.2

write.xlsx(as_tibble(parta_fcst1), '../data/project1_forecast.xlsx', sheetName='parta_fcst1', append=TRUE)

b. ATM2 Forecast

Next we’ll begin the process of doing the forecast for ATM2. Again, we start by creating a filtered dataset for the time series prior to May 2010.

atm2 <- atm_ts %>%
  select(ATM2) %>%
  filter(DATE < "2010-05-01")

We next view the data to see if there are any missing values that we will need to impute and to see if there are any extreme values or any other anomalies in the data

atm2 %>% autoplot()

## Plot variable not specified, automatically selected `.vars = ATM2`

summary(atm2)

##       ATM2             DATE           
##  Min.   :  0.00   Min.   :2009-05-01  
##  1st Qu.: 25.50   1st Qu.:2009-07-31  
##  Median : 67.00   Median :2009-10-30  
##  Mean   : 62.58   Mean   :2009-10-30  
##  3rd Qu.: 93.00   3rd Qu.:2010-01-29  
##  Max.   :147.00   Max.   :2010-04-30  
##  NA's   :2

atm2 %>% filter(is.na(ATM2))

## # A tsibble: 2 x 2 [1D]
##    ATM2 DATE      
##   <dbl> <date>    
## 1    NA 2009-06-18
## 2    NA 2009-06-24

Once again, we find that there are missing data values and thus we’ll want to impute those values as we do for ATM1. We’ll begin by creating a filtered time series for the values prior to 6/18; then we’ll build a model based on those early values, and then we’ll create the forecasts necessary to impute the missing values. Once we’ve imputed these values, then we can build our model on the new fixed time series.

atm2_early <- atm2 %>% filter(DATE < '2009-06-18')


atm2_early %>%
  model(STL(ATM2)) %>%
  components() %>%
  autoplot()

Based on the STL composition of the data, we will once again go with a exponential smoothing model. Once again, we will try several different models and evaluate which one results in the lowest AICc value. We’ll try the following models:

(N,A)
(N,M)
(A,A)
(A,M)

fit_early <- atm2_early %>%
  model(m1 = ETS(ATM2 ~ error("A") + trend("N") + season("A")),
        m2 = ETS(ATM2 ~ error("A") + trend("N") + season("M")),
        m3 = ETS(ATM2 ~ error("A") + trend("A") + season("A")),
        m4 = ETS(ATM2 ~ error("A") + trend("A") + season("M")))

glance(fit_early) %>% arrange(AICc)

## # A tibble: 4 × 9
##   .model sigma2 log_lik   AIC  AICc   BIC   MSE  AMSE   MAE
##   <chr>   <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 m1       589.   -241.  502.  508.  521.  478.  486.  15.0
## 2 m2       601.   -241.  503.  509.  522.  488.  495.  15.6
## 3 m3       663.   -243.  509.  518.  532.  511.  521.  15.0
## 4 m4       691.   -244.  511.  520.  534.  533.  535.  16.0

Based on the evaluation of the modeling, we find that the (N,A) model provides the lowest AICc. So we will use that to forecast the missing values

fit_early <- atm2_early %>% model(ETS(ATM2 ~ error("A") + trend("N") + season("A")))

fit_fcst <- fit_early %>% forecast(h=10) 

missing_val1 <- (fit_fcst %>% filter(DATE == '2009-06-18'))$.mean
missing_val2 <- (fit_fcst %>% filter(DATE == '2009-06-24'))$.mean

atm2_fixed <- atm2 %>% 
  mutate(ATM2 = ifelse(DATE == '2009-06-18',missing_val1, ATM2))

atm2_fixed <- atm2_fixed %>% 
  mutate(ATM2 = ifelse(DATE == '2009-06-24',missing_val2, ATM2))

atm2_fixed %>% autoplot()

## Plot variable not specified, automatically selected `.vars = ATM2`

summary(atm2_fixed)

##       ATM2             DATE           
##  Min.   :  0.00   Min.   :2009-05-01  
##  1st Qu.: 25.00   1st Qu.:2009-07-31  
##  Median : 66.00   Median :2009-10-30  
##  Mean   : 62.33   Mean   :2009-10-30  
##  3rd Qu.: 93.00   3rd Qu.:2010-01-29  
##  Max.   :147.00   Max.   :2010-04-30

Next we will use the fixed dataset to build a model that can be used to forecast our day for May. Once again, we will do this by creating a training and test dataset comprised of 70/30 percent of the data respectively.

train_pct = .7
train_size = floor(train_pct * nrow(atm2_fixed))
test_size = nrow(atm2_fixed) - train_size

train_data <- atm2_fixed %>% slice(1:train_size)
test_data <- atm2_fixed %>% slice(train_size+1 - n())


train_data %>% 
  model(STL(ATM2)) %>%
  components() %>%
  autoplot()

Based on the decomposition of the model, it doesn’t appear that there is a steady trend. Additionally, the seasonal data appears to be fluctuating over time. Therefore, we expect to be able to apply a multiplicative value to the seasonal portion of our model. But once again, we will explore the time series as an ETS model.

train_fit <- train_data %>%
  model(m1 = ETS(ATM2 ~ error("A") + trend("N") + season("A")),
        m2 = ETS(ATM2 ~ error("A") + trend("N") + season("M")))

Based on the AICc score, we will use a multiplicative variable for the seasonal component for our model. Next we’ll evaluate the model on the test data to see how well it performs.

train_fit <- train_data %>%
  model(ETS(ATM2 ~ error("A") + trend("N") + season("M")))

train_fcst <- train_fit %>% forecast(h=test_size)

train_fcst %>% 
  autoplot(atm2_fixed)

accuracy(train_fcst, test_data)

## # A tibble: 1 × 10
##   .model                  .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE   ACF1
##   <chr>                   <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>
## 1 "ETS(ATM2 ~ error(\"A\… Test  0.522  50.1  40.6  -Inf   Inf  1.82  1.55 0.0113

aug <- train_fit %>% augment()

autoplot(aug, .innov)

aug %>%
  ggplot(aes(x=.innov)) + 
  geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

aug %>% ACF(.innov) %>%
  autoplot()

When we look at the accuracy of the model, we see that it does a pretty good job forecasting the data. Additionally a view of the residuals of the model checks all of our tests. Therefore we will go ahead and build a final model using the entire data series and then use this to generate our May 2010 forecasts.

final_fit <- atm2_fixed %>%
  model(ETS(ATM2 ~ error("A") + trend("N") + season("M")))

parta_fcst2 <- final_fit %>% forecast(h=14)

parta_fcst2 %>% autoplot(atm2_fixed)

parta_fcst2

## # A fable: 14 x 4 [1D]
## # Key:     .model [1]
##    .model                                         DATE               ATM2  .mean
##    <chr>                                          <date>           <dist>  <dbl>
##  1 "ETS(ATM2 ~ error(\"A\") + trend(\"N\") + sea… 2010-05-01 sample[5000]  68.1 
##  2 "ETS(ATM2 ~ error(\"A\") + trend(\"N\") + sea… 2010-05-02 sample[5000]  74.3 
##  3 "ETS(ATM2 ~ error(\"A\") + trend(\"N\") + sea… 2010-05-03 sample[5000]  10.9 
##  4 "ETS(ATM2 ~ error(\"A\") + trend(\"N\") + sea… 2010-05-04 sample[5000]   2.07
##  5 "ETS(ATM2 ~ error(\"A\") + trend(\"N\") + sea… 2010-05-05 sample[5000] 102.  
##  6 "ETS(ATM2 ~ error(\"A\") + trend(\"N\") + sea… 2010-05-06 sample[5000]  93.0 
##  7 "ETS(ATM2 ~ error(\"A\") + trend(\"N\") + sea… 2010-05-07 sample[5000]  68.7 
##  8 "ETS(ATM2 ~ error(\"A\") + trend(\"N\") + sea… 2010-05-08 sample[5000]  68.8 
##  9 "ETS(ATM2 ~ error(\"A\") + trend(\"N\") + sea… 2010-05-09 sample[5000]  73.7 
## 10 "ETS(ATM2 ~ error(\"A\") + trend(\"N\") + sea… 2010-05-10 sample[5000]  10.4 
## 11 "ETS(ATM2 ~ error(\"A\") + trend(\"N\") + sea… 2010-05-11 sample[5000]   2.22
## 12 "ETS(ATM2 ~ error(\"A\") + trend(\"N\") + sea… 2010-05-12 sample[5000] 102.  
## 13 "ETS(ATM2 ~ error(\"A\") + trend(\"N\") + sea… 2010-05-13 sample[5000]  92.6 
## 14 "ETS(ATM2 ~ error(\"A\") + trend(\"N\") + sea… 2010-05-14 sample[5000]  68.4

write.xlsx(as_tibble(parta_fcst2), '../data/project1_forecast.xlsx', sheetName='parta_fcst2', append=TRUE)

c. ATM3 Forecast

We will continue the process for ATM3. We’ll start by creating a dataset of the data prior to May 2010

atm3_early <- atm_ts %>%
  select(ATM3) %>%
  filter(DATE < "2010-05-01")

atm3_early %>% autoplot()

## Plot variable not specified, automatically selected `.vars = ATM3`

summary(atm3_early)

##       ATM3              DATE           
##  Min.   : 0.0000   Min.   :2009-05-01  
##  1st Qu.: 0.0000   1st Qu.:2009-07-31  
##  Median : 0.0000   Median :2009-10-30  
##  Mean   : 0.7206   Mean   :2009-10-30  
##  3rd Qu.: 0.0000   3rd Qu.:2010-01-29  
##  Max.   :96.0000   Max.   :2010-04-30

This timeseries is really interesting. While there are no outliers in the data, the data is mainly zeroes followed by a large deposit. While this may seem off from a data perspective, I know that there have been times where I have accounts that have been opened for years, but I don’t make regular deposits into them. Then, all of a sudden I’ll get a notice from my bank that the account will be closed for inactivity, and I’ll make a single deposit to meet the requirements, but then it’s possible that I don’t make any additional deposits going forward.

Therefore, I believe that we don’t really have a significant approach to be able to effectively predict the future deposits. Therefore, I recommend that we simply use one of the more simplistic approaches.

fit_atm3 <- atm3_early %>% 
  model(Mean = MEAN(ATM3),
        `Naive` = NAIVE(ATM3),
        Drift = RW(ATM3~drift()))


fit_atm3

## # A mable: 1 x 3
##      Mean   Naive         Drift
##   <model> <model>       <model>
## 1  <MEAN> <NAIVE> <RW w/ drift>

glance(fit_atm3)

## # A tibble: 3 × 2
##   .model sigma2
##   <chr>   <dbl>
## 1 Mean     63.1
## 2 Naive    25.9
## 3 Drift    25.9

Based on the model, we see that the Naive and Drift methods have the lowest sigma2. However, in this instance, I believe that the Mean forecast is the best forecast. For this instance, we will generate our forecast as the average of each of the forecast for the three models.

fc <- fit_atm3 %>% forecast(h=14)

colnames(fc)

## [1] ".model" "DATE"   "ATM3"   ".mean"

adjusted_fcst <- as_tibble(fc) %>% 
  select(c(".model", "DATE", ".mean")) %>%
  pivot_wider(names_from=.model, values_from=.mean) %>%
  mutate(mean_forecast = (Mean + Naive + Drift)/3) %>%
  select(c("DATE","mean_forecast")) %>%
  tsibble()

## Using `DATE` as index variable.

adjusted_fcst

## # A tsibble: 14 x 2 [1D]
##    DATE       mean_forecast
##    <date>             <dbl>
##  1 2010-05-01          57.0
##  2 2010-05-02          57.1
##  3 2010-05-03          57.1
##  4 2010-05-04          57.2
##  5 2010-05-05          57.3
##  6 2010-05-06          57.4
##  7 2010-05-07          57.5
##  8 2010-05-08          57.5
##  9 2010-05-09          57.6
## 10 2010-05-10          57.7
## 11 2010-05-11          57.8
## 12 2010-05-12          57.8
## 13 2010-05-13          57.9
## 14 2010-05-14          58.0

write.xlsx(as_tibble(adjusted_fcst), '../data/project1_forecast.xlsx', sheetName='parta_fcst3', append=TRUE)

d. ATM4 Forecast

Finally, we work to build out our forecast model for ATM4. We start once again by filtering out the values for the period prior to May 2010 and then examining the data.

atm4 <- atm_ts %>%
  select(ATM4) %>%
  filter(DATE < "2010-05-01")


summary(atm4)

##       ATM4                DATE           
##  Min.   :    1.563   Min.   :2009-05-01  
##  1st Qu.:  124.334   1st Qu.:2009-07-31  
##  Median :  403.839   Median :2009-10-30  
##  Mean   :  474.043   Mean   :2009-10-30  
##  3rd Qu.:  704.507   3rd Qu.:2010-01-29  
##  Max.   :10919.762   Max.   :2010-04-30

atm4 %>% autoplot()

## Plot variable not specified, automatically selected `.vars = ATM4`

When we view the data, we see that there’s an extreme outlier value in the data that we will need to deal with. We will deal with that by imputing it based on a forecasted value using the values prior to that value. In this instance, it’s most likely that the number is an actual outlier and not just a situation where the individual came across an extreme amount of money. The reason why I think this is the case is because the regular deposits for the prior period are way lower than this one time deposit amount.

atm4_early <- atm4 %>% filter(DATE < '2010-02-09') 

summary(atm4_early)

##       ATM4               DATE           
##  Min.   :   1.563   Min.   :2009-05-01  
##  1st Qu.: 111.536   1st Qu.:2009-07-10  
##  Median : 408.168   Median :2009-09-19  
##  Mean   : 447.897   Mean   :2009-09-19  
##  3rd Qu.: 691.654   3rd Qu.:2009-11-29  
##  Max.   :1712.075   Max.   :2010-02-08

Now we work to build the model of this early data set.

atm4_early %>% 
  model(stl = STL(ATM4)) %>%
  components() %>%
  autoplot()

When looking at the decomposition of the time series, we see that there is no trend and the seasonality in the data appears to be either additive or multiplicative. So we’ll use this to explore different ETS models

 fit_early <- atm4_early %>%
  model(ets1 = ETS(ATM4 ~ error("A")+trend("N")+season("A")),
        ets2 = ETS(ATM4 ~ error("A")+trend("N")+season("M")))


glance(fit_early) %>% arrange(AICc)

## # A tibble: 2 × 9
##   .model  sigma2 log_lik   AIC  AICc   BIC     MSE    AMSE   MAE
##   <chr>    <dbl>   <dbl> <dbl> <dbl> <dbl>   <dbl>   <dbl> <dbl>
## 1 ets1   104995.  -2439. 4899. 4899. 4935. 101668. 101940.  247.
## 2 ets2   105229.  -2440. 4899. 4900. 4936. 101895. 102232.  248.

In this case, the model that uses the additive of the seasonality component has the lowest AICc, so we will use that to build our model for this early part of the data

fit_early <- atm4_early %>%
  model(ETS(ATM4 ~ error("A") + trend("N") + season("A")))

fc_early <- fit_early %>% forecast(h=1) 

fc_early

## # A fable: 1 x 4 [1D]
## # Key:     .model [1]
##   .model                                          DATE                ATM4 .mean
##   <chr>                                           <date>            <dist> <dbl>
## 1 "ETS(ATM4 ~ error(\"A\") + trend(\"N\") + seas… 2010-02-09 N(552, 1e+05)  552.

atm4_fixed <- atm4 %>% mutate(ATM4 = ifelse(DATE == '2010-02-09', fc_early$.mean,ATM4 ))


atm4_fixed %>% autoplot(ATM4)

Now that we have a fixed timeseries, we can move forward with evaluating a model on the entire dataset for forecasting our May data.

train_pct = .7
train_size = floor(train_pct * nrow(atm4_fixed))
test_size = nrow(atm4_fixed) - train_size

train_data <- atm4_fixed %>% slice(1:train_size)
test_data <- atm4_fixed %>% slice(train_size+1 - n())


train_data %>% 
  model(STL(ATM4)) %>%
  components() %>%
  autoplot()

Based on the decomposition of the model, it doesn’t appear that there is a steady trend. Additionally, the seasonal data appears to be fluctuating over time. Therefore, we will go ahead and build the timeseries using the same model that we applied earlier

train_fit <- train_data %>%
  model(ETS(ATM4 ~ error("A") + trend("N") + season("A")))

train_fcst <- train_fit %>% forecast(h=test_size)

train_fcst %>% 
  autoplot(atm4_fixed)

accuracy(train_fcst, test_data)

## # A tibble: 1 × 10
##   .model                  .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE   ACF1
##   <chr>                   <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>
## 1 "ETS(ATM4 ~ error(\"A\… Test  -24.8  365.  295. -655.  693. 0.843 0.801 0.0650

aug <- train_fit %>% augment()

autoplot(aug, .innov)

aug %>%
  ggplot(aes(x=.innov)) + 
  geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

aug %>% ACF(.innov) %>%
  autoplot()

When we look at the accuracy of the model, we see that it is not doing a good job forecasting the holdout data and capturing the movement in the data. Additionally a view of the residuals of the model does not look good, and finally we see some remaining correlation in the data. Based on this, we will need to explore a new model.

bc <- BoxCoxTrans(train_data$ATM4)
lambda <- bc$lambda

train_data %>% 
  model(STL(ATM4^lambda-1/lambda)) %>%
  components() %>%
  autoplot()

train_data %>% autoplot()

## Plot variable not specified, automatically selected `.vars = ATM4`

train_data %>%
  gg_tsdisplay(difference(ATM4^lambda-1/lambda,7), plot_type = 'partial')

## Warning: Removed 7 rows containing missing values (`geom_line()`).

## Warning: Removed 7 rows containing missing values (`geom_point()`).

After further evaluation, I’ve evaluated an ARIMA model after applying a box cox transformation with lambda = 0.4. Additionally, I applied a sesonal difference of 7 days (weekly) for the data. Given the fact that there is correlation in the ACF at lag 7, we will therefore use an ARIMA model with seasonality, with the following options p = 0, d=0, q=3 and then for the seasonal component we will use an MA(1), resulting in p=0,1,1

train_fit <- train_data %>% 
  model(m1 = ARIMA((ATM4^lambda-1/lambda)~1+pdq(0,0,3) + PDQ(0,1,1)),
        m2 = ARIMA((ATM4^lambda-1/lambda)~1+pdq(3,0,0) + PDQ(0,1,1)),
        m3 = ARIMA((ATM4^lambda-1/lambda), stepwise=FALSE))


glance(train_fit) %>% arrange(AICc)

## # A tibble: 3 × 8
##   .model sigma2 log_lik   AIC  AICc   BIC ar_roots   ma_roots  
##   <chr>   <dbl>   <dbl> <dbl> <dbl> <dbl> <list>     <list>    
## 1 m2       13.7   -687. 1386. 1386. 1407. <cpl [3]>  <cpl [7]> 
## 2 m1       13.8   -687. 1386. 1387. 1407. <cpl [0]>  <cpl [10]>
## 3 m3       17.2   -724. 1455. 1455. 1469. <cpl [14]> <cpl [0]>

Based on this, we see the AR(3) model has the lowest AICc score and therefore we will use that to generate our forecast

train_fit <- train_data %>%
  model(ARIMA((ATM4^lambda-1/lambda)~1+pdq(3,0,0) + PDQ(0,1,1)))

train_fcst <- train_fit %>% forecast(h=test_size)

train_fcst %>% 
  autoplot(test_data)

accuracy(train_fcst, test_data)

## # A tibble: 1 × 10
##   .model                  .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE   ACF1
##   <chr>                   <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>
## 1 "ARIMA((ATM4^lambda - … Test  -28.1  355.  290. -628.  663. 0.826 0.778 0.0566

aug <- train_fit %>% augment()

autoplot(aug, .innov)

aug %>%
  ggplot(aes(x=.innov)) + 
  geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

aug %>% ACF(.innov) %>%
  autoplot()

I’m still not pleased with the quality of the model and its forecast ability. However, I’m not sure what else I can do to improve the model. Therefore, I’m going to go ahead and forecast the May 2010 data, using this bad model.

fit_final <- atm4_fixed %>% model(ARIMA((ATM4^lambda-1/lambda)~1+pdq(3,0,0) + PDQ(0,1,1)))
  
parta_fcst4 <- fit_final %>% forecast(h=14)

parta_fcst4 %>% autoplot(atm4_fixed)

parta_fcst4

## # A fable: 14 x 4 [1D]
## # Key:     .model [1]
##    .model                                         DATE                ATM4 .mean
##    <chr>                                          <date>            <dist> <dbl>
##  1 "ARIMA((ATM4^lambda - 1/lambda) ~ 1 + pdq(3, … 2010-05-01 t(N(7.1, 15))  374.
##  2 "ARIMA((ATM4^lambda - 1/lambda) ~ 1 + pdq(3, … 2010-05-02 t(N(7.8, 15))  432.
##  3 "ARIMA((ATM4^lambda - 1/lambda) ~ 1 + pdq(3, … 2010-05-03 t(N(8.3, 15))  481.
##  4 "ARIMA((ATM4^lambda - 1/lambda) ~ 1 + pdq(3, … 2010-05-04 t(N(2.7, 16))  127.
##  5 "ARIMA((ATM4^lambda - 1/lambda) ~ 1 + pdq(3, … 2010-05-05   t(N(9, 16))  549.
##  6 "ARIMA((ATM4^lambda - 1/lambda) ~ 1 + pdq(3, … 2010-05-06 t(N(7.3, 16))  391.
##  7 "ARIMA((ATM4^lambda - 1/lambda) ~ 1 + pdq(3, … 2010-05-07 t(N(9.5, 16))  605.
##  8 "ARIMA((ATM4^lambda - 1/lambda) ~ 1 + pdq(3, … 2010-05-08 t(N(7.2, 16))  389.
##  9 "ARIMA((ATM4^lambda - 1/lambda) ~ 1 + pdq(3, … 2010-05-09 t(N(8.2, 16))  468.
## 10 "ARIMA((ATM4^lambda - 1/lambda) ~ 1 + pdq(3, … 2010-05-10 t(N(8.4, 16))  488.
## 11 "ARIMA((ATM4^lambda - 1/lambda) ~ 1 + pdq(3, … 2010-05-11 t(N(2.7, 16))  129.
## 12 "ARIMA((ATM4^lambda - 1/lambda) ~ 1 + pdq(3, … 2010-05-12   t(N(9, 16))  553.
## 13 "ARIMA((ATM4^lambda - 1/lambda) ~ 1 + pdq(3, … 2010-05-13 t(N(7.3, 16))  393.
## 14 "ARIMA((ATM4^lambda - 1/lambda) ~ 1 + pdq(3, … 2010-05-14 t(N(9.5, 16))  607.

write.xlsx(as_tibble(parta_fcst4), '../data/project1_forecast.xlsx', sheetName='parta_fcst4', append=TRUE)