In part A of this project, our goal is to forecast how much cash is taken out of 4 different ATM machines for May 2010. For this project, we will conduct the following steps.
During my initial import of the data, I was running into problems where the date column was coming in as a Julian date and I was unable to figure out how to change it to an actual data. So I decided to coerce the data type at the import stage.
raw_data <- read_xlsx('../data/ATM624Data.xlsx', col_types=c("date","text","numeric"))
In this step, I’m changing the DATE field to a date type, and also changing the dataframe to wide type so that each ATM is it’s own column. Finally, there was an NA column in the data; so I’m removing that column from the data.
raw_data <- raw_data %>% mutate(DATE = as.Date(DATE))
data_wide <- raw_data %>% pivot_wider(names_from=ATM, values_from=Cash)
data_wide %>% summary()
## DATE ATM1 ATM2 NA
## Min. :2009-05-01 Min. : 1.00 Min. : 0.00 Min. : NA
## 1st Qu.:2009-08-03 1st Qu.: 73.00 1st Qu.: 25.50 1st Qu.: NA
## Median :2009-11-06 Median : 91.00 Median : 67.00 Median : NA
## Mean :2009-11-06 Mean : 83.89 Mean : 62.58 Mean :NaN
## 3rd Qu.:2010-02-08 3rd Qu.:108.00 3rd Qu.: 93.00 3rd Qu.: NA
## Max. :2010-05-14 Max. :180.00 Max. :147.00 Max. : NA
## NA's :17 NA's :16 NA's :379
## ATM3 ATM4
## Min. : 0.0000 Min. : 1.563
## 1st Qu.: 0.0000 1st Qu.: 124.334
## Median : 0.0000 Median : 403.839
## Mean : 0.7206 Mean : 474.043
## 3rd Qu.: 0.0000 3rd Qu.: 704.507
## Max. :96.0000 Max. :10919.762
## NA's :14 NA's :14
data_wide %>% str()
## tibble [379 × 6] (S3: tbl_df/tbl/data.frame)
## $ DATE: Date[1:379], format: "2009-05-01" "2009-05-02" ...
## $ ATM1: num [1:379] 96 82 85 90 99 88 8 104 87 93 ...
## $ ATM2: num [1:379] 107 89 90 55 79 19 2 103 107 118 ...
## $ NA : num [1:379] NA NA NA NA NA NA NA NA NA NA ...
## $ ATM3: num [1:379] 0 0 0 0 0 0 0 0 0 0 ...
## $ ATM4: num [1:379] 777 524.4 792.8 908.2 52.8 ...
data_wide <- data_wide %>% select(!`NA`)
When looking at the data in a wide format, we find that there is an NA column that has no valid data. Additionally, within the different ATM columns, we see that there are NA values as well. In ATM 1 there are 17 NAs, in ATM2 there are 16, and in ATM3 and ATM4 there are 14 NA values. Additionally, the DATE field is in numeric form and as such should be changed to an actual date format.
Next we convert the data into a tsibble, using the date field as the index.
atm_ts <-tsibble(data_wide)
## Using `DATE` as index variable.
atm_ts %>%
autoplot(ATM4)
## Warning: Removed 14 rows containing missing values (`geom_line()`).
For this we are starting with the ATM1 data. Since we will be forecasting the data for May 2010, we will begin by filtering the data to include on those values that occurred prior to May 2010.
atm1 <- atm_ts %>%
select(ATM1) %>%
filter(DATE < "2010-05-01")
Next we preview the data to understand the properties of the data
summary(atm1)
## ATM1 DATE
## Min. : 1.00 Min. :2009-05-01
## 1st Qu.: 73.00 1st Qu.:2009-07-31
## Median : 91.00 Median :2009-10-30
## Mean : 83.89 Mean :2009-10-30
## 3rd Qu.:108.00 3rd Qu.:2010-01-29
## Max. :180.00 Max. :2010-04-30
## NA's :3
nrow(atm1)
## [1] 365
atm1 %>% autoplot()
## Plot variable not specified, automatically selected `.vars = ATM1`
atm1 %>% filter(is.na(ATM1))
## # A tsibble: 3 x 2 [1D]
## ATM1 DATE
## <dbl> <date>
## 1 NA 2009-06-13
## 2 NA 2009-06-16
## 3 NA 2009-06-22
Looking at the data, we see that there are 365 values, and that the data is a time-series with daily values. Additionally, we see that there are three missing values on 6/13/2009, 6/16/2009 and 6/22/2009. To deal with these missing values, we will first train a model on the data prior to these values, and then use that to impute the missing values. Once we’ve completed that step, we will then take our dataset and create or test and training data that will be used to build our overall model.
First we will create a dataset that includes just the data prior to 6/13/2009
atm1_early <- atm1 %>% filter(DATE < '2009-06-13')
Next we will explore this data using the STL decomposition to bettern understand the components of the timeseries so we can use this to help us determine how to model the data. I tried using box-cox transformation to stabilize the trend in the data, but the transformation did not have a discernible impact on this aspect of the data.
atm1_early
## # A tsibble: 43 x 2 [1D]
## ATM1 DATE
## <dbl> <date>
## 1 96 2009-05-01
## 2 82 2009-05-02
## 3 85 2009-05-03
## 4 90 2009-05-04
## 5 99 2009-05-05
## 6 88 2009-05-06
## 7 8 2009-05-07
## 8 104 2009-05-08
## 9 87 2009-05-09
## 10 93 2009-05-10
## # ℹ 33 more rows
atm1_early %>%
model(STL(ATM1)) %>%
components() %>%
autoplot()
Based on our decomposition, we see that there is a pretty stable trend
at the beginning, with a significant bump later on in the dataset.
Additionally, we wee that there is weekly seasonality in the data. The
remainder is relatively stable and centered around zero, until we get to
later in June where there is a significant outlier in the data.
Based on this, we will explore modeling the time series with an ETS model. Based on the component data above, I believe that the seasonal trend is Additive. However, while the overall trend is basically stable throughout the data, the fact that there’s a significant spike in the data, causes me to think it’s possible that that would be best modeled as a Multiplicative trend. However, I will try models of the following types:
fit <- atm1_early %>%
model(m1 = ETS(ATM1 ~ error("A")+trend("A")+season("A")),
m2 = ETS(ATM1 ~ error("A")+trend("A")+season("M")),
m3 = ETS(ATM1 ~ error("A")+trend("N")+season("M")),
m4 = ETS(ATM1 ~ error("A")+trend("N")+season("A")))
glance(fit) %>% arrange(AICc)
## # A tibble: 4 × 9
## .model sigma2 log_lik AIC AICc BIC MSE AMSE MAE
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 m3 424. -206. 432. 439. 449. 335. 341. 10.2
## 2 m4 436. -206. 433. 440. 451. 345. 351. 11.1
## 3 m2 438. -205. 435. 445. 456. 326. 333. 10.1
## 4 m1 510. -209. 441. 452. 462. 380. 386. 12.4
In reviewing the AICc for each of the models, we find that m3 (N,M) has the lower AICc and therefore I will be using that model for forecasting the missing values on 6/13, 6/16 and 6/22.
fit2 <- atm1_early %>%
model(ETS(ATM1 ~ error("A") + trend("N") + season("M")))
aug <- fit2 %>% augment()
autoplot(aug, .innov)
aug %>%
ggplot(aes(x=.innov)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
aug %>% ACF(.innov) %>%
autoplot()
From the residuals we see evidence of the major outlier in the data in
June, and the fact that there’s a weekly lag in the autocorrelation
plot. We will use this model to impute the missing values.
fit2_fc <- fit2 %>% forecast(h=10)
missing_val1 <- (fit2_fc %>% filter(DATE == '2009-06-13'))$.mean
missing_val2 <- (fit2_fc %>% filter(DATE == '2009-06-16'))$.mean
missing_val3 <- (fit2_fc %>% filter(DATE == '2009-06-22'))$.mean
atm1_fixed <- atm1 %>%
mutate(ATM1 = ifelse(DATE == '2009-06-13',missing_val1,ATM1))
atm1_fixed <- atm1_fixed %>%
mutate(ATM1 = ifelse(DATE == '2009-06-16',missing_val2,ATM1))
atm1_fixed <- atm1_fixed %>%
mutate(ATM1 = ifelse(DATE == '2009-06-22',missing_val3,ATM1))
Now that we’ve imputed the missing values, we will now create a test and training dataset on the entire data with the imputed values included. From there, we will basically go through the same steps that we just completed - but now we will focus on building a model on our training data and then using the test data to evaluate the performance of the data.
We will first look at the autoplot of the new data set and determine the number of rows to include in our training and test data. Consistent with standard practice, we will opt for a 70/30 split.
train_pct <- .7
train_size <- floor(nrow(atm1_fixed)*train_pct)
train_data <- atm1_fixed %>% slice(1:train_size)
test_data <- atm1_fixed %>% slice(train_size:n())
train_data %>% autoplot()
## Plot variable not specified, automatically selected `.vars = ATM1`
We will now build our model on the test data using the same model that
we used initially.
train_fit <- train_data %>%
model(ETS(ATM1 ~ error("A") + trend("N") + season("M")))
train_fc <- train_fit %>% forecast(h=nrow(test_data)-1)
train_fc %>% autoplot(atm1_fixed)
Finally, prior to using the model to forecast the data for May 2010, we will look at the residuals of the model and evaluate how effective the model is at predicting using our test data.
accuracy(train_fc, test_data)
## # A tibble: 1 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 "ETS(ATM1 ~ error(\"A\… Test 1.14 45.3 31.2 -303. 337. NaN NaN 0.0602
aug <- train_fit %>% augment()
autoplot(aug, .innov)
aug %>%
ggplot(aes(x=.innov)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
aug %>% ACF(.innov) %>%
autoplot()
When looking at the plots for the residuals of the model, we see that
the model has done a pretty good job at capturing the variance in the
data. There are no signficicant autocorrelations in the data and the
residuals appear to be normally distributed. Finally, since we are
satisfied with the model selected, we will retrain the model on the full
dataset and then use that to generate the forecast for the data for May
2010
final_model <- atm1_fixed %>%
model(ETS(ATM1 ~ error("A") + trend("N") + season("M")))
parta_fcst1 <- final_model %>% forecast(h=14)
parta_fcst1 %>% autoplot(atm1_fixed)
parta_fcst1
## # A fable: 14 x 4 [1D]
## # Key: .model [1]
## .model DATE ATM1 .mean
## <chr> <date> <dist> <dbl>
## 1 "ETS(ATM1 ~ error(\"A\") + trend(\"N\") + sea… 2010-05-01 sample[5000] 87.4
## 2 "ETS(ATM1 ~ error(\"A\") + trend(\"N\") + sea… 2010-05-02 sample[5000] 101.
## 3 "ETS(ATM1 ~ error(\"A\") + trend(\"N\") + sea… 2010-05-03 sample[5000] 74.1
## 4 "ETS(ATM1 ~ error(\"A\") + trend(\"N\") + sea… 2010-05-04 sample[5000] 5.91
## 5 "ETS(ATM1 ~ error(\"A\") + trend(\"N\") + sea… 2010-05-05 sample[5000] 100.
## 6 "ETS(ATM1 ~ error(\"A\") + trend(\"N\") + sea… 2010-05-06 sample[5000] 79.2
## 7 "ETS(ATM1 ~ error(\"A\") + trend(\"N\") + sea… 2010-05-07 sample[5000] 86.0
## 8 "ETS(ATM1 ~ error(\"A\") + trend(\"N\") + sea… 2010-05-08 sample[5000] 87.4
## 9 "ETS(ATM1 ~ error(\"A\") + trend(\"N\") + sea… 2010-05-09 sample[5000] 101.
## 10 "ETS(ATM1 ~ error(\"A\") + trend(\"N\") + sea… 2010-05-10 sample[5000] 73.8
## 11 "ETS(ATM1 ~ error(\"A\") + trend(\"N\") + sea… 2010-05-11 sample[5000] 6.19
## 12 "ETS(ATM1 ~ error(\"A\") + trend(\"N\") + sea… 2010-05-12 sample[5000] 100.
## 13 "ETS(ATM1 ~ error(\"A\") + trend(\"N\") + sea… 2010-05-13 sample[5000] 79.6
## 14 "ETS(ATM1 ~ error(\"A\") + trend(\"N\") + sea… 2010-05-14 sample[5000] 86.2
write.xlsx(as_tibble(parta_fcst1), '../data/project1_forecast.xlsx', sheetName='parta_fcst1', append=TRUE)
Next we’ll begin the process of doing the forecast for ATM2. Again, we start by creating a filtered dataset for the time series prior to May 2010.
atm2 <- atm_ts %>%
select(ATM2) %>%
filter(DATE < "2010-05-01")
We next view the data to see if there are any missing values that we will need to impute and to see if there are any extreme values or any other anomalies in the data
atm2 %>% autoplot()
## Plot variable not specified, automatically selected `.vars = ATM2`
summary(atm2)
## ATM2 DATE
## Min. : 0.00 Min. :2009-05-01
## 1st Qu.: 25.50 1st Qu.:2009-07-31
## Median : 67.00 Median :2009-10-30
## Mean : 62.58 Mean :2009-10-30
## 3rd Qu.: 93.00 3rd Qu.:2010-01-29
## Max. :147.00 Max. :2010-04-30
## NA's :2
atm2 %>% filter(is.na(ATM2))
## # A tsibble: 2 x 2 [1D]
## ATM2 DATE
## <dbl> <date>
## 1 NA 2009-06-18
## 2 NA 2009-06-24
Once again, we find that there are missing data values and thus we’ll want to impute those values as we do for ATM1. We’ll begin by creating a filtered time series for the values prior to 6/18; then we’ll build a model based on those early values, and then we’ll create the forecasts necessary to impute the missing values. Once we’ve imputed these values, then we can build our model on the new fixed time series.
atm2_early <- atm2 %>% filter(DATE < '2009-06-18')
atm2_early %>%
model(STL(ATM2)) %>%
components() %>%
autoplot()
Based on the STL composition of the data, we will once again go with a
exponential smoothing model. Once again, we will try several different
models and evaluate which one results in the lowest AICc value. We’ll
try the following models:
fit_early <- atm2_early %>%
model(m1 = ETS(ATM2 ~ error("A") + trend("N") + season("A")),
m2 = ETS(ATM2 ~ error("A") + trend("N") + season("M")),
m3 = ETS(ATM2 ~ error("A") + trend("A") + season("A")),
m4 = ETS(ATM2 ~ error("A") + trend("A") + season("M")))
glance(fit_early) %>% arrange(AICc)
## # A tibble: 4 × 9
## .model sigma2 log_lik AIC AICc BIC MSE AMSE MAE
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 m1 589. -241. 502. 508. 521. 478. 486. 15.0
## 2 m2 601. -241. 503. 509. 522. 488. 495. 15.6
## 3 m3 663. -243. 509. 518. 532. 511. 521. 15.0
## 4 m4 691. -244. 511. 520. 534. 533. 535. 16.0
Based on the evaluation of the modeling, we find that the (N,A) model provides the lowest AICc. So we will use that to forecast the missing values
fit_early <- atm2_early %>% model(ETS(ATM2 ~ error("A") + trend("N") + season("A")))
fit_fcst <- fit_early %>% forecast(h=10)
missing_val1 <- (fit_fcst %>% filter(DATE == '2009-06-18'))$.mean
missing_val2 <- (fit_fcst %>% filter(DATE == '2009-06-24'))$.mean
atm2_fixed <- atm2 %>%
mutate(ATM2 = ifelse(DATE == '2009-06-18',missing_val1, ATM2))
atm2_fixed <- atm2_fixed %>%
mutate(ATM2 = ifelse(DATE == '2009-06-24',missing_val2, ATM2))
atm2_fixed %>% autoplot()
## Plot variable not specified, automatically selected `.vars = ATM2`
summary(atm2_fixed)
## ATM2 DATE
## Min. : 0.00 Min. :2009-05-01
## 1st Qu.: 25.00 1st Qu.:2009-07-31
## Median : 66.00 Median :2009-10-30
## Mean : 62.33 Mean :2009-10-30
## 3rd Qu.: 93.00 3rd Qu.:2010-01-29
## Max. :147.00 Max. :2010-04-30
Next we will use the fixed dataset to build a model that can be used to forecast our day for May. Once again, we will do this by creating a training and test dataset comprised of 70/30 percent of the data respectively.
train_pct = .7
train_size = floor(train_pct * nrow(atm2_fixed))
test_size = nrow(atm2_fixed) - train_size
train_data <- atm2_fixed %>% slice(1:train_size)
test_data <- atm2_fixed %>% slice(train_size+1 - n())
train_data %>%
model(STL(ATM2)) %>%
components() %>%
autoplot()
Based on the decomposition of the model, it doesn’t appear that there is
a steady trend. Additionally, the seasonal data appears to be
fluctuating over time. Therefore, we expect to be able to apply a
multiplicative value to the seasonal portion of our model. But once
again, we will explore the time series as an ETS model.
train_fit <- train_data %>%
model(m1 = ETS(ATM2 ~ error("A") + trend("N") + season("A")),
m2 = ETS(ATM2 ~ error("A") + trend("N") + season("M")))
Based on the AICc score, we will use a multiplicative variable for the seasonal component for our model. Next we’ll evaluate the model on the test data to see how well it performs.
train_fit <- train_data %>%
model(ETS(ATM2 ~ error("A") + trend("N") + season("M")))
train_fcst <- train_fit %>% forecast(h=test_size)
train_fcst %>%
autoplot(atm2_fixed)
accuracy(train_fcst, test_data)
## # A tibble: 1 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 "ETS(ATM2 ~ error(\"A\… Test 0.522 50.1 40.6 -Inf Inf 1.82 1.55 0.0113
aug <- train_fit %>% augment()
autoplot(aug, .innov)
aug %>%
ggplot(aes(x=.innov)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
aug %>% ACF(.innov) %>%
autoplot()
When we look at the accuracy of the model, we see that it does a pretty
good job forecasting the data. Additionally a view of the residuals of
the model checks all of our tests. Therefore we will go ahead and build
a final model using the entire data series and then use this to generate
our May 2010 forecasts.
final_fit <- atm2_fixed %>%
model(ETS(ATM2 ~ error("A") + trend("N") + season("M")))
parta_fcst2 <- final_fit %>% forecast(h=14)
parta_fcst2 %>% autoplot(atm2_fixed)
parta_fcst2
## # A fable: 14 x 4 [1D]
## # Key: .model [1]
## .model DATE ATM2 .mean
## <chr> <date> <dist> <dbl>
## 1 "ETS(ATM2 ~ error(\"A\") + trend(\"N\") + sea… 2010-05-01 sample[5000] 68.1
## 2 "ETS(ATM2 ~ error(\"A\") + trend(\"N\") + sea… 2010-05-02 sample[5000] 74.3
## 3 "ETS(ATM2 ~ error(\"A\") + trend(\"N\") + sea… 2010-05-03 sample[5000] 10.9
## 4 "ETS(ATM2 ~ error(\"A\") + trend(\"N\") + sea… 2010-05-04 sample[5000] 2.07
## 5 "ETS(ATM2 ~ error(\"A\") + trend(\"N\") + sea… 2010-05-05 sample[5000] 102.
## 6 "ETS(ATM2 ~ error(\"A\") + trend(\"N\") + sea… 2010-05-06 sample[5000] 93.0
## 7 "ETS(ATM2 ~ error(\"A\") + trend(\"N\") + sea… 2010-05-07 sample[5000] 68.7
## 8 "ETS(ATM2 ~ error(\"A\") + trend(\"N\") + sea… 2010-05-08 sample[5000] 68.8
## 9 "ETS(ATM2 ~ error(\"A\") + trend(\"N\") + sea… 2010-05-09 sample[5000] 73.7
## 10 "ETS(ATM2 ~ error(\"A\") + trend(\"N\") + sea… 2010-05-10 sample[5000] 10.4
## 11 "ETS(ATM2 ~ error(\"A\") + trend(\"N\") + sea… 2010-05-11 sample[5000] 2.22
## 12 "ETS(ATM2 ~ error(\"A\") + trend(\"N\") + sea… 2010-05-12 sample[5000] 102.
## 13 "ETS(ATM2 ~ error(\"A\") + trend(\"N\") + sea… 2010-05-13 sample[5000] 92.6
## 14 "ETS(ATM2 ~ error(\"A\") + trend(\"N\") + sea… 2010-05-14 sample[5000] 68.4
write.xlsx(as_tibble(parta_fcst2), '../data/project1_forecast.xlsx', sheetName='parta_fcst2', append=TRUE)
We will continue the process for ATM3. We’ll start by creating a dataset of the data prior to May 2010
atm3_early <- atm_ts %>%
select(ATM3) %>%
filter(DATE < "2010-05-01")
atm3_early %>% autoplot()
## Plot variable not specified, automatically selected `.vars = ATM3`
summary(atm3_early)
## ATM3 DATE
## Min. : 0.0000 Min. :2009-05-01
## 1st Qu.: 0.0000 1st Qu.:2009-07-31
## Median : 0.0000 Median :2009-10-30
## Mean : 0.7206 Mean :2009-10-30
## 3rd Qu.: 0.0000 3rd Qu.:2010-01-29
## Max. :96.0000 Max. :2010-04-30
This timeseries is really interesting. While there are no outliers in the data, the data is mainly zeroes followed by a large deposit. While this may seem off from a data perspective, I know that there have been times where I have accounts that have been opened for years, but I don’t make regular deposits into them. Then, all of a sudden I’ll get a notice from my bank that the account will be closed for inactivity, and I’ll make a single deposit to meet the requirements, but then it’s possible that I don’t make any additional deposits going forward.
Therefore, I believe that we don’t really have a significant approach to be able to effectively predict the future deposits. Therefore, I recommend that we simply use one of the more simplistic approaches.
fit_atm3 <- atm3_early %>%
model(Mean = MEAN(ATM3),
`Naive` = NAIVE(ATM3),
Drift = RW(ATM3~drift()))
fit_atm3
## # A mable: 1 x 3
## Mean Naive Drift
## <model> <model> <model>
## 1 <MEAN> <NAIVE> <RW w/ drift>
glance(fit_atm3)
## # A tibble: 3 × 2
## .model sigma2
## <chr> <dbl>
## 1 Mean 63.1
## 2 Naive 25.9
## 3 Drift 25.9
Based on the model, we see that the Naive and Drift methods have the lowest sigma2. However, in this instance, I believe that the Mean forecast is the best forecast. For this instance, we will generate our forecast as the average of each of the forecast for the three models.
fc <- fit_atm3 %>% forecast(h=14)
colnames(fc)
## [1] ".model" "DATE" "ATM3" ".mean"
adjusted_fcst <- as_tibble(fc) %>%
select(c(".model", "DATE", ".mean")) %>%
pivot_wider(names_from=.model, values_from=.mean) %>%
mutate(mean_forecast = (Mean + Naive + Drift)/3) %>%
select(c("DATE","mean_forecast")) %>%
tsibble()
## Using `DATE` as index variable.
adjusted_fcst
## # A tsibble: 14 x 2 [1D]
## DATE mean_forecast
## <date> <dbl>
## 1 2010-05-01 57.0
## 2 2010-05-02 57.1
## 3 2010-05-03 57.1
## 4 2010-05-04 57.2
## 5 2010-05-05 57.3
## 6 2010-05-06 57.4
## 7 2010-05-07 57.5
## 8 2010-05-08 57.5
## 9 2010-05-09 57.6
## 10 2010-05-10 57.7
## 11 2010-05-11 57.8
## 12 2010-05-12 57.8
## 13 2010-05-13 57.9
## 14 2010-05-14 58.0
write.xlsx(as_tibble(adjusted_fcst), '../data/project1_forecast.xlsx', sheetName='parta_fcst3', append=TRUE)
Finally, we work to build out our forecast model for ATM4. We start once again by filtering out the values for the period prior to May 2010 and then examining the data.
atm4 <- atm_ts %>%
select(ATM4) %>%
filter(DATE < "2010-05-01")
summary(atm4)
## ATM4 DATE
## Min. : 1.563 Min. :2009-05-01
## 1st Qu.: 124.334 1st Qu.:2009-07-31
## Median : 403.839 Median :2009-10-30
## Mean : 474.043 Mean :2009-10-30
## 3rd Qu.: 704.507 3rd Qu.:2010-01-29
## Max. :10919.762 Max. :2010-04-30
atm4 %>% autoplot()
## Plot variable not specified, automatically selected `.vars = ATM4`
When we view the data, we see that there’s an extreme outlier value in
the data that we will need to deal with. We will deal with that by
imputing it based on a forecasted value using the values prior to that
value. In this instance, it’s most likely that the number is an actual
outlier and not just a situation where the individual came across an
extreme amount of money. The reason why I think this is the case is
because the regular deposits for the prior period are way lower than
this one time deposit amount.
atm4_early <- atm4 %>% filter(DATE < '2010-02-09')
summary(atm4_early)
## ATM4 DATE
## Min. : 1.563 Min. :2009-05-01
## 1st Qu.: 111.536 1st Qu.:2009-07-10
## Median : 408.168 Median :2009-09-19
## Mean : 447.897 Mean :2009-09-19
## 3rd Qu.: 691.654 3rd Qu.:2009-11-29
## Max. :1712.075 Max. :2010-02-08
Now we work to build the model of this early data set.
atm4_early %>%
model(stl = STL(ATM4)) %>%
components() %>%
autoplot()
When looking at the decomposition of the time series, we see that there is no trend and the seasonality in the data appears to be either additive or multiplicative. So we’ll use this to explore different ETS models
fit_early <- atm4_early %>%
model(ets1 = ETS(ATM4 ~ error("A")+trend("N")+season("A")),
ets2 = ETS(ATM4 ~ error("A")+trend("N")+season("M")))
glance(fit_early) %>% arrange(AICc)
## # A tibble: 2 × 9
## .model sigma2 log_lik AIC AICc BIC MSE AMSE MAE
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ets1 104995. -2439. 4899. 4899. 4935. 101668. 101940. 247.
## 2 ets2 105229. -2440. 4899. 4900. 4936. 101895. 102232. 248.
In this case, the model that uses the additive of the seasonality component has the lowest AICc, so we will use that to build our model for this early part of the data
fit_early <- atm4_early %>%
model(ETS(ATM4 ~ error("A") + trend("N") + season("A")))
fc_early <- fit_early %>% forecast(h=1)
fc_early
## # A fable: 1 x 4 [1D]
## # Key: .model [1]
## .model DATE ATM4 .mean
## <chr> <date> <dist> <dbl>
## 1 "ETS(ATM4 ~ error(\"A\") + trend(\"N\") + seas… 2010-02-09 N(552, 1e+05) 552.
atm4_fixed <- atm4 %>% mutate(ATM4 = ifelse(DATE == '2010-02-09', fc_early$.mean,ATM4 ))
atm4_fixed %>% autoplot(ATM4)
Now that we have a fixed timeseries, we can move forward with evaluating
a model on the entire dataset for forecasting our May data.
train_pct = .7
train_size = floor(train_pct * nrow(atm4_fixed))
test_size = nrow(atm4_fixed) - train_size
train_data <- atm4_fixed %>% slice(1:train_size)
test_data <- atm4_fixed %>% slice(train_size+1 - n())
train_data %>%
model(STL(ATM4)) %>%
components() %>%
autoplot()
Based on the decomposition of the model, it doesn’t appear that there is a steady trend. Additionally, the seasonal data appears to be fluctuating over time. Therefore, we will go ahead and build the timeseries using the same model that we applied earlier
train_fit <- train_data %>%
model(ETS(ATM4 ~ error("A") + trend("N") + season("A")))
train_fcst <- train_fit %>% forecast(h=test_size)
train_fcst %>%
autoplot(atm4_fixed)
accuracy(train_fcst, test_data)
## # A tibble: 1 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 "ETS(ATM4 ~ error(\"A\… Test -24.8 365. 295. -655. 693. 0.843 0.801 0.0650
aug <- train_fit %>% augment()
autoplot(aug, .innov)
aug %>%
ggplot(aes(x=.innov)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
aug %>% ACF(.innov) %>%
autoplot()
When we look at the accuracy of the model, we see that it is not doing a
good job forecasting the holdout data and capturing the movement in the
data. Additionally a view of the residuals of the model does not look
good, and finally we see some remaining correlation in the data. Based
on this, we will need to explore a new model.
bc <- BoxCoxTrans(train_data$ATM4)
lambda <- bc$lambda
train_data %>%
model(STL(ATM4^lambda-1/lambda)) %>%
components() %>%
autoplot()
train_data %>% autoplot()
## Plot variable not specified, automatically selected `.vars = ATM4`
train_data %>%
gg_tsdisplay(difference(ATM4^lambda-1/lambda,7), plot_type = 'partial')
## Warning: Removed 7 rows containing missing values (`geom_line()`).
## Warning: Removed 7 rows containing missing values (`geom_point()`).
After further evaluation, I’ve evaluated an ARIMA model after applying a
box cox transformation with lambda = 0.4. Additionally, I applied a
sesonal difference of 7 days (weekly) for the data. Given the fact that
there is correlation in the ACF at lag 7, we will therefore use an ARIMA
model with seasonality, with the following options p = 0, d=0, q=3 and
then for the seasonal component we will use an MA(1), resulting in
p=0,1,1
train_fit <- train_data %>%
model(m1 = ARIMA((ATM4^lambda-1/lambda)~1+pdq(0,0,3) + PDQ(0,1,1)),
m2 = ARIMA((ATM4^lambda-1/lambda)~1+pdq(3,0,0) + PDQ(0,1,1)),
m3 = ARIMA((ATM4^lambda-1/lambda), stepwise=FALSE))
glance(train_fit) %>% arrange(AICc)
## # A tibble: 3 × 8
## .model sigma2 log_lik AIC AICc BIC ar_roots ma_roots
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <list> <list>
## 1 m2 13.7 -687. 1386. 1386. 1407. <cpl [3]> <cpl [7]>
## 2 m1 13.8 -687. 1386. 1387. 1407. <cpl [0]> <cpl [10]>
## 3 m3 17.2 -724. 1455. 1455. 1469. <cpl [14]> <cpl [0]>
Based on this, we see the AR(3) model has the lowest AICc score and therefore we will use that to generate our forecast
train_fit <- train_data %>%
model(ARIMA((ATM4^lambda-1/lambda)~1+pdq(3,0,0) + PDQ(0,1,1)))
train_fcst <- train_fit %>% forecast(h=test_size)
train_fcst %>%
autoplot(test_data)
accuracy(train_fcst, test_data)
## # A tibble: 1 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 "ARIMA((ATM4^lambda - … Test -28.1 355. 290. -628. 663. 0.826 0.778 0.0566
aug <- train_fit %>% augment()
autoplot(aug, .innov)
aug %>%
ggplot(aes(x=.innov)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
aug %>% ACF(.innov) %>%
autoplot()
I’m still not pleased with the quality of the model and its forecast
ability. However, I’m not sure what else I can do to improve the model.
Therefore, I’m going to go ahead and forecast the May 2010 data, using
this bad model.
fit_final <- atm4_fixed %>% model(ARIMA((ATM4^lambda-1/lambda)~1+pdq(3,0,0) + PDQ(0,1,1)))
parta_fcst4 <- fit_final %>% forecast(h=14)
parta_fcst4 %>% autoplot(atm4_fixed)
parta_fcst4
## # A fable: 14 x 4 [1D]
## # Key: .model [1]
## .model DATE ATM4 .mean
## <chr> <date> <dist> <dbl>
## 1 "ARIMA((ATM4^lambda - 1/lambda) ~ 1 + pdq(3, … 2010-05-01 t(N(7.1, 15)) 374.
## 2 "ARIMA((ATM4^lambda - 1/lambda) ~ 1 + pdq(3, … 2010-05-02 t(N(7.8, 15)) 432.
## 3 "ARIMA((ATM4^lambda - 1/lambda) ~ 1 + pdq(3, … 2010-05-03 t(N(8.3, 15)) 481.
## 4 "ARIMA((ATM4^lambda - 1/lambda) ~ 1 + pdq(3, … 2010-05-04 t(N(2.7, 16)) 127.
## 5 "ARIMA((ATM4^lambda - 1/lambda) ~ 1 + pdq(3, … 2010-05-05 t(N(9, 16)) 549.
## 6 "ARIMA((ATM4^lambda - 1/lambda) ~ 1 + pdq(3, … 2010-05-06 t(N(7.3, 16)) 391.
## 7 "ARIMA((ATM4^lambda - 1/lambda) ~ 1 + pdq(3, … 2010-05-07 t(N(9.5, 16)) 605.
## 8 "ARIMA((ATM4^lambda - 1/lambda) ~ 1 + pdq(3, … 2010-05-08 t(N(7.2, 16)) 389.
## 9 "ARIMA((ATM4^lambda - 1/lambda) ~ 1 + pdq(3, … 2010-05-09 t(N(8.2, 16)) 468.
## 10 "ARIMA((ATM4^lambda - 1/lambda) ~ 1 + pdq(3, … 2010-05-10 t(N(8.4, 16)) 488.
## 11 "ARIMA((ATM4^lambda - 1/lambda) ~ 1 + pdq(3, … 2010-05-11 t(N(2.7, 16)) 129.
## 12 "ARIMA((ATM4^lambda - 1/lambda) ~ 1 + pdq(3, … 2010-05-12 t(N(9, 16)) 553.
## 13 "ARIMA((ATM4^lambda - 1/lambda) ~ 1 + pdq(3, … 2010-05-13 t(N(7.3, 16)) 393.
## 14 "ARIMA((ATM4^lambda - 1/lambda) ~ 1 + pdq(3, … 2010-05-14 t(N(9.5, 16)) 607.
write.xlsx(as_tibble(parta_fcst4), '../data/project1_forecast.xlsx', sheetName='parta_fcst4', append=TRUE)