This project consists of 3 parts - two required and one bonus and is worth 15% of your grade. The project is due at 11:59 PM on Sunday October 31. I will accept late submissions with a penalty until the meetup after that when we review some projects.
# Import required R libraries
library(fpp3)
library(tidyverse)
library(readxl)
library(writexl)
library(seasonal)
library(stringr)
Prompt: In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable Cash is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an Excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual .rmd file. Also please submit the forecast which you will put in an Excel readable file.
# Read in data
atm_data_raw <- read_excel("data/ATM624Data.xlsx")
# Properly convert the DATE column to match true input
# https://stackoverflow.com/questions/43230470/how-to-convert-excel-date-format-to-proper-date-in-r
atm_data_raw <- atm_data_raw %>% mutate(DATE = as.Date(DATE, origin = "1899-12-30"))
# Initial output to see data
#head(atm_data_raw)
# Output summary for high level assessment
summary(atm_data_raw)
## DATE ATM Cash
## Min. :2009-05-01 Length:1474 Min. : 0.0
## 1st Qu.:2009-08-01 Class :character 1st Qu.: 0.5
## Median :2009-11-01 Mode :character Median : 73.0
## Mean :2009-10-31 Mean : 155.6
## 3rd Qu.:2010-02-01 3rd Qu.: 114.0
## Max. :2010-05-14 Max. :10919.8
## NA's :19
# Check dimensions to understand breadth of data
dim(atm_data_raw)
## [1] 1474 3
# 1474 3
#Cash in hundreds
Dates of dataset start at 2009-05-01 and end with 2010-05-14. This indicates 379 dates, which is 14 more than a single year of 365 days.
Dimensions of the initial dataset indicate 1474 observations, which again indicates more than a single year’s worth of data. The 3 columns are DATE, ATM, and Cash. The DATE indicates the specific data, the ATM indicates which of the 4 ATMs, and Cash represents the total amount withdrawn for the given date and ATM.
# Define as tsibble with DATE as index
atm_data_ts <- atm_data_raw %>%
as_tsibble(index = DATE, key = ATM)
# Output tsibble after 2010-04-30
atm_data_ts %>%
filter(DATE > "2010-04-30")
## # A tsibble: 14 x 3 [1D]
## # Key: ATM [1]
## DATE ATM Cash
## <date> <chr> <dbl>
## 1 2010-05-01 <NA> NA
## 2 2010-05-02 <NA> NA
## 3 2010-05-03 <NA> NA
## 4 2010-05-04 <NA> NA
## 5 2010-05-05 <NA> NA
## 6 2010-05-06 <NA> NA
## 7 2010-05-07 <NA> NA
## 8 2010-05-08 <NA> NA
## 9 2010-05-09 <NA> NA
## 10 2010-05-10 <NA> NA
## 11 2010-05-11 <NA> NA
## 12 2010-05-12 <NA> NA
## 13 2010-05-13 <NA> NA
## 14 2010-05-14 <NA> NA
A closer look at the data shows 14 observations starting with date 2010-05-01 and ending on 2010-05-14 do not indicate an ATM nor a Cash amount, thus these 14 observations will be ignored in the upcoming evaluation. The removal of these 14 observations also makes for a cleaner dataset, as now the total observations is 1460, which is exacting 365 for each of the 4 ATMs.
atm_data_ts %>%
autoplot(Cash) +
labs(y = "Cash (in hundreds $USD)", title = "ATM Withdrawls")
An initial plot of the time series shows the ATM4 data falls in a range greater than the other three ATMs. With the extreme outlier in ATM4, the plot does not provide much value for comparison across the four ATMS.
For forecasting the withdrawal amount for each ATM, I will go through each ATM individually, beginning with ATM1.
# Filter to ATM1 data
atm1_data_ts <- atm_data_ts %>%
filter(ATM == 'ATM1')
# Output summary
summary(atm1_data_ts)
## DATE ATM Cash
## Min. :2009-05-01 Length:365 Min. : 1.00
## 1st Qu.:2009-07-31 Class :character 1st Qu.: 73.00
## Median :2009-10-30 Mode :character Median : 91.00
## Mean :2009-10-30 Mean : 83.89
## 3rd Qu.:2010-01-29 3rd Qu.:108.00
## Max. :2010-04-30 Max. :180.00
## NA's :3
The summary of data for ATM1 confirms the 365 days (single year) of observations and also indicates 3 missing Cash values.
atm1_data_ts$DATE[is.na(atm1_data_ts$Cash)]
## [1] "2009-06-13" "2009-06-16" "2009-06-22"
The 3 dates with missing Cash values are displayed above. Not sure the significance of the missing data occurring in a small window of time, so will consider imputation.
atm1_data_ts %>%
autoplot(Cash) +
labs(y = "Cash (in hundreds $USD)", title = "ATM1 Withdrawals")
Initial plot of the ATM1 data appears to show a weekly seasonal component along with a dip between October 2009 and January 2010. There does not appear to be a clear increasing or decreasing trend nor a cyclic component.
Given the low volume of missing values, three dates, I will simply impute the data with the median value of the full ATM1 dataset.
# Calculate median value for ATM1
median <- median(atm1_data_ts$Cash, na.rm=TRUE)
# Set NAs to median
atm1_data_ts$Cash[is.na(atm1_data_ts$Cash)] <- median
summary(atm1_data_ts)
## DATE ATM Cash
## Min. :2009-05-01 Length:365 Min. : 1.00
## 1st Qu.:2009-07-31 Class :character 1st Qu.: 73.00
## Median :2009-10-30 Mode :character Median : 91.00
## Mean :2009-10-30 Mean : 83.95
## 3rd Qu.:2010-01-29 3rd Qu.:108.00
## Max. :2010-04-30 Max. :180.00
Summary output confirms no missing data for ATM1 Cash amounts.
Simply updating the ATM1 data tsibble to be defined properly.
atm1_data_ts <- atm1_data_ts %>%
mutate(DATE = as_date(DATE)) %>%
update_tsibble(index = DATE)
#atm1_data_ts
To better understand the trend and seasonal nature of the ATM1 dataset, decomposition is performed with a period of 7 to represent the weekly pattern.
# From: https://stats.stackexchange.com/questions/494013/control-the-period-for-daily-time-series-in-tsibbles
atm1_dcmp <- atm1_data_ts %>%
model(stl = STL(Cash ~ trend(window=Inf) + season(period=7, window="periodic")))
atm1_dcmp %>% components() %>% autoplot()
STL decomposition shows the remainder plot has the most impact on the data based on the gray bar on the left. The seasonal plot shows a clear seasonal pattern, but while the seasonal pattern remains the same, the remainder plot shows greater variance toward the end of plot, beginning in February 2010. The trend line plot confirms no meaningful trend present in the dataset.
components(atm1_dcmp) %>%
as_tsibble() %>%
filter_index("2009-05-01" ~ "2010-04-30") %>%
autoplot(Cash, colour = "gray") +
geom_line(aes(y=season_adjust), colour = "#0072B2") +
labs(y = "Cash (in hundreds $USD)", title = "Seasonally Adjusted Trendline")
The above seasonally adjusted trendline confirms that despite a few outliers through December 2009, the seasonal component does well to address the weekly nature of the data, but the plot above indicates that starting in February 2010, the seasonal component does not properly define the dataset. Based on the above plot, a new weekly pattern appears to emerge in February 2010.
components(atm1_dcmp) %>%
as_tsibble() %>%
filter_index("2010-02-16" ~ "2010-04-30") %>%
autoplot(Cash, colour = "gray") +
geom_line(aes(y=season_adjust), colour = "#0072B2") +
labs(y = "Cash (in hundreds $USD)", title = "Seasonally Adjusted Trendline")
Above plot shows that the seasonally adjusted data does not account for the weekly seasonal nature for the last three months of the ATM1 dataset. Clearly a shift in the weekly seasonal pattern has occurred.
atm1_data_ts %>% ACF() %>% autoplot()
The ACF of the ATM1 data shows a clear correlation at lags 7, 14, and 21, which is expected for seasonal data following a weekly pattern. High negative correlations are present at lags 2 and 5 that appear to follow a 7-lag pattern. I have a hunch these correlations are due to the change in the weekly pattern.
atm1_data_ts %>% PACF() %>% autoplot()
The PACF plot re-confirms the correlations at lag 7 and 14. Interesting, correlation also appears at lags 2 and 5.
atm1_data_ts %>% gg_season(Cash, period = "week") +
labs(y="$USD (in hundreds)", title="ATM Withdrawals")
To better understand the weekly seasonal pattern, the gg_season plot above does show some sort of shift on Tuesday and Thursday.
atm1_data_ts %>%
gg_subseries(period = "week") +
labs(
y = "$USD (in hundreds)",
title = "ATM Withdrawals"
)
The gg_subseries plot confirms the shift in February occurs on Tuesday, Wednesday, and Thursday. The other days of the week indicate some variance and outliers, but nothing as dramatic as Tuesday through Thursday.
Before forecasting the data for May 2010, I want to train and test the different model functions for proper evaluation. The models are trained on 292 days, or approximately 80% of the year represented.
# Create training set (assume 1 year of data)
# 292 days for training, approx. 80% of data
train_atm1 <- atm1_data_ts %>%
filter_index("2009-05-01" ~ "2010-02-17")
#train_atm1
# Fit the models
fit_atm1 <- train_atm1 %>%
model(
Naive = NAIVE(Cash),
`Seasonal naive` = SNAIVE(Cash),
`Random walk` = RW(Cash),
Arima = ARIMA(Cash),
ETS_Add = ETS(Cash ~ error("A") + trend("N") + season("A")),
ETS_Mult = ETS(Cash ~ error("M") + trend("N") + season("M")),
ETS_Damp = ETS(Cash ~ error("M") + trend("Ad") + season("M"))
)
fit_atm1 %>% accuracy()
## # A tibble: 7 × 11
## ATM .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ATM1 Naive Training 0.284 49.9 37.4 -85.0 120. 2.07 1.77 -0.356
## 2 ATM1 Seasonal naive Training 0.203 28.1 18.0 -83.3 103. 1 1 0.0957
## 3 ATM1 Random walk Training 0.284 49.9 37.4 -85.0 120. 2.07 1.77 -0.356
## 4 ATM1 Arima Training 0.179 21.9 14.3 -66.5 82.7 0.790 0.780 0.00341
## 5 ATM1 ETS_Add Training 0.175 21.4 13.7 -73.3 89.7 0.761 0.763 0.0940
## 6 ATM1 ETS_Mult Training 1.98 21.6 13.7 -75.8 90.6 0.760 0.769 0.0956
## 7 ATM1 ETS_Damp Training 1.46 21.3 13.3 -76.4 90.4 0.737 0.759 0.0756
Using the different model functions from the Hyndman textbook, the best performing model based on RMSE on the training set is the ETS model with dampening. Overall, the three ETS models and the ARIMA model all perform well in comparison. The seasonal naive model also performs well with an RMSE slightly higher than the ARIMA model and the three ETS models.
# Generate forecasts for 72 days
fc_atm1 <- fit_atm1 %>% forecast(h = "72 days")
fc_atm1 %>% accuracy(atm1_data_ts)
## # A tibble: 7 × 11
## .model ATM .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Arima ATM1 Test -2.20 50.8 37.8 -466. 502. 2.09 1.81 -0.0565
## 2 ETS_Add ATM1 Test -3.25 52.4 38.0 -479. 514. 2.11 1.86 0.00348
## 3 ETS_Damp ATM1 Test -10.4 54.2 40.0 -517. 547. 2.22 1.93 -0.0246
## 4 ETS_Mult ATM1 Test -7.03 52.9 39.2 -494. 527. 2.17 1.88 -0.00808
## 5 Naive ATM1 Test -98.2 104. 98.2 -893. 893. 5.44 3.71 -0.0585
## 6 Random walk ATM1 Test -98.2 104. 98.2 -893. 893. 5.44 3.71 -0.0585
## 7 Seasonal naive ATM1 Test -5.21 64.1 53.7 -460. 509. 2.97 2.28 0.00878
The accuracy of the models on the test dataset shows ARIMA performing the best with an RMSE of 50.77. The three ETS models do perform well also.
# Plot forecasts against actual values
fc_atm1 %>%
autoplot(filter_index(train_atm1, "2010-02-01" ~ "2010-04-30"), level = NULL) +
autolayer(
filter_index(atm1_data_ts, "2010-02-10" ~ "2010-04-30"),
colour = "black"
) +
labs(
y = "Cash (in hundreds $USD)",
title = "Forecasts for ATM1 Withdrawals"
) +
guides(colour = guide_legend(title = "Forecast"))
Plotting the seven forecasts along with the original dataset shows the seasonal pattern of the forecasts does not match the seasonal pattern of the actual dataset. Based on the observations in the decomposition section, the weekly pattern has changed, and thus training the model on the first 80% of the data actually misses the new pattern found in the data. Further evaluation is needed.
fit_atm1_arima <- train_atm1 %>%
model(ARIMA(Cash))
# Check the residuals of Arima model
fit_atm1_arima %>% gg_tsresiduals()
To further confirm the assumption of a bad model, I’ve taken the ARIMA model, the best performing based on RMSE, and displayed the residuals above. The ACF plot of the residuals clearly indicates a correlation exists at lags 8 and 20. This indicates the model isn’t quite capturing the correlation of the test data properly.
Given the decomposition and the model evaluations indicate a change in seasonal pattern sometime in February 2010, I will attempt to train and test on just the data beginning on 2010-02-16. Yes, this certainly reduces the amount of data used to build the model, but I believe may better capture the change in the weekly pattern and thus provide better forecasts for the month of May 2010.
# ATM1 train on 2010-02-16 to 2010-04-30 (end)
# 15 + 31 + 30 = 76 days, 80% is 61 days
new_seasonal_atm1 <- atm1_data_ts %>%
filter_index("2010-02-16" ~ "2010-04-30")
train_atm1 <- atm1_data_ts %>%
filter_index("2010-02-16" ~ "2010-04-14")
# Fit the models
fit_atm1 <- train_atm1 %>%
model(
`Seasonal naive` = SNAIVE(Cash),
Arima = ARIMA(Cash),
ETS_Add = ETS(Cash ~ error("A") + trend("N") + season("A")),
ETS_Mult = ETS(Cash ~ error("M") + trend("N") + season("M")),
ETS_Damp = ETS(Cash ~ error("M") + trend("Ad") + season("M"))
)
report(fit_atm1)
## # A tibble: 5 × 12
## ATM .model sigma2 log_lik AIC AICc BIC ar_roots ma_roots MSE AMSE
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <list> <list> <dbl> <dbl>
## 1 ATM1 Seaso… 575. NA NA NA NA <NULL> <NULL> NA NA
## 2 ATM1 Arima 311. -219. 447. 448. 455. <cpl [9… <cpl [0… NA NA
## 3 ATM1 ETS_A… 327. -281. 582. 586. 602. <NULL> <NULL> 276. 330.
## 4 ATM1 ETS_M… 0.155 -308. 636. 641. 657. <NULL> <NULL> 1046. 1437.
## 5 ATM1 ETS_D… 0.405 -329. 683. 692. 710. <NULL> <NULL> 27524. 41523.
## # … with 1 more variable: MAE <dbl>
Now using only the top 5 performing models due to the seasonal nature of the data, the ARIMA model performs the best of the 4 models by evaluating the AICc of 447. Note: the Seasonal Naive model does not produce an AICc.
fit_atm1 %>% accuracy()
## # A tibble: 5 × 11
## ATM .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ATM1 Seasonal naive Trai… -5.20 24.3 15.9 -214. 230. 1 1 0.444
## 2 ATM1 Arima Trai… 0.138 16.0 11.9 -97.6 143. 0.753 0.660 -0.127
## 3 ATM1 ETS_Add Trai… -1.19 16.6 11.7 -56.5 68.7 0.739 0.684 0.0722
## 4 ATM1 ETS_Mult Trai… -2.40 32.3 20.8 -125. 145. 1.31 1.33 0.480
## 5 ATM1 ETS_Damp Trai… -10.4 166. 61.7 -13.8 68.2 3.89 6.82 0.412
The ARIMA model performs the best on the training set with an RMSE of 16.05, while the ETS Additive performs almost as well as the ARIMA model. The Seasonal Naive model performs third best and the ETS with dampening is clearly not doing well compared to the other four models.
# Generate forecasts for 16 days (two weeks), so we'll see if it picks up the seasonal nature
fc_atm1 <- fit_atm1 %>% forecast(h = "16 days")
fc_atm1 %>% accuracy(new_seasonal_atm1)
## # A tibble: 5 × 11
## .model ATM .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Arima ATM1 Test 7.71 12.8 9.50 40.9 43.2 0.599 0.527 0.0154
## 2 ETS_Add ATM1 Test 1.06 11.8 8.88 5.50 15.3 0.560 0.486 -0.337
## 3 ETS_Damp ATM1 Test 46.6 54.0 46.6 57.3 58.4 2.94 2.22 0.0345
## 4 ETS_Mult ATM1 Test 9.17 15.1 11.0 -2.95 26.2 0.694 0.621 -0.161
## 5 Seasonal naive ATM1 Test 0.125 9.64 7.12 1.26 9.83 0.449 0.397 -0.00791
For the accuracy of the five models on the test set, the Seasonal Naive model actually performs the best with an RMSE of 9.64, while ARIMA and ETS Additive, also perform well.
# Plot forecasts against actual values
fc_atm1 %>%
autoplot(filter_index(train_atm1, "2010-04-01" ~ "2010-04-30"), level = NULL) +
autolayer(
filter_index(new_seasonal_atm1, "2010-04-01" ~ "2010-04-30"),
colour = "black"
) +
labs(
y = "Cash (in hundreds $USD)",
title = "Forecasts for ATM1 Withdrawals"
) +
guides(colour = guide_legend(title = "Forecast"))
The plot of the test forecasts confirms the models are correctly capturing the seasonal nature of the data near the end of the dataset. Unfortunately, the ARIMA model is forecasting negative values, which of course is not possible.
fit_atm1_arima<- train_atm1 %>%
model(ARIMA(Cash))
report(fit_atm1_arima)
## Series: Cash
## Model: ARIMA(2,0,0)(1,1,0)[7]
##
## Coefficients:
## ar1 ar2 sar1
## 0.5923 0.3165 -0.6292
## s.e. 0.1520 0.1665 0.1431
##
## sigma^2 estimated as 311.1: log likelihood=-219.43
## AIC=446.86 AICc=447.73 BIC=454.59
The definition of the ARIMA model is shown above.
# Check the residuals of ARIMA
fit_atm1_arima %>% gg_tsresiduals()
As above, I’ve displayed the residuals of the ARIMA model to assess the appropriateness of the model. The plot of Innovation residuals appears to follow white noise after the first 10 observations or so. The ACF plot indicates no correlation outside of the confidence intervals.
The dataset did not appear to show increasing variance over the time series, but I wanted to try a Box-Cox transformation to see if a better model could be generated.
# Consider Box-Cox
lambda <- new_seasonal_atm1 %>%
features(Cash, features = guerrero) %>%
pull(lambda_guerrero)
new_seasonal_atm1 %>%
autoplot(box_cox(Cash, lambda)) +
labs(y = "",
title = latex2exp::TeX(paste0(
"Transformed ATM1 withdrawals with $\\lambda$ = ",
round(lambda,2))))
Applying Box-Cox to see if I can squeeze out a little better performance on the test data. The resulting lambda value is 0.93.
# Forecast for May 2010
fit_atm1_bc <- train_atm1 %>%
model(
`Seasonal naive` = SNAIVE(box_cox(Cash, lambda)),
Arima = ARIMA(box_cox(Cash, lambda)),
ETS_Add = ETS(box_cox(Cash, lambda) ~ error("A") + trend("N") + season("A"))
)
report(fit_atm1_bc)
## # A tibble: 3 × 12
## ATM .model sigma2 log_lik AIC AICc BIC ar_roots ma_roots MSE AMSE
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <list> <list> <dbl> <dbl>
## 1 ATM1 Seasonal… 323. NA NA NA NA <NULL> <NULL> NA NA
## 2 ATM1 Arima 175. -206. 417. 418. 423. <cpl [1… <cpl [7… NA NA
## 3 ATM1 ETS_Add 182. -264. 548. 552. 568. <NULL> <NULL> 154. 185.
## # … with 1 more variable: MAE <dbl>
The AICc of the ARIMA model with Box-Cox transformation has improved from 447.73 to 417.47. A promising sign.
fit_atm1_bc %>% accuracy()
## # A tibble: 3 × 11
## ATM .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ATM1 Seasonal naive Training -5.20 24.3 15.9 -214. 230. 1 1 0.444
## 2 ATM1 Arima Training -2.56 16.1 12.3 -105. 127. 0.773 0.661 -0.313
## 3 ATM1 ETS_Add Training -1.11 16.6 12.0 -66.2 77.5 0.755 0.682 0.0836
The accuracy of the models with Box-Cox show little to no improvement over the models without transformations on the training set. The RMSE of the ARIMA model with the transformation actually increases from 16.04555 to 16.06325 as noted above.
# Generate forecasts for the next 16 days
fc_atm1_bc <- fit_atm1_bc %>% forecast(h = "16 days")
fc_atm1_bc %>% accuracy(new_seasonal_atm1)
## # A tibble: 3 × 11
## .model ATM .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Arima ATM1 Test -6.25 11.5 9.35 -32.6 36.2 0.589 0.474 -0.360
## 2 ETS_Add ATM1 Test -0.247 11.8 9.24 -24.8 35.4 0.582 0.487 -0.398
## 3 Seasonal naive ATM1 Test -1.05 9.87 7.88 -21.8 29.4 0.497 0.406 -0.0900
The RMSE of the ARIMA model with Box-Cox transformation on the test data does improve the RMSE compared to the ARIMA model without transformation from 12.812248 to 11.53609. The Seasonal Naive RMSE increases with the transformation from 9.643651 to 9.87190. From all the results with and without transformation, the Seasonal Naive model without transformation performs the best on the test dataset.
# Plot forecasts against actual values
fc_atm1_bc %>%
autoplot(filter_index(new_seasonal_atm1, "2010-04-01" ~ "2010-04-30"), level = NULL) +
autolayer(
filter_index(new_seasonal_atm1, "2010-04-01" ~ "2010-04-30"),
colour = "black"
) +
labs(
y = "Cash (in hundreds $USD)",
title = "Forecasts for ATM1 Withdrawals"
) +
guides(colour = guide_legend(title = "Forecast"))
The above plot shows the forecasts of the models with transformation along with the actual data. Similar to the plot from the models with transformations, the forecasts do follow the seasonal pattern of the data.
Given the assignment is to forecast the data for May 2010 from the provided data, I believe the models generated from the final two and a half months is the appropriate approach. Clearly, a shift in the weekly pattern occurs in February 2010 and without any additional context or history, I assume that shift will continue through May 2010. As the forecast is only 1 month or approximately four more weeks from the end of the provided data, then I choose to believe that new pattern will persist for the next four weeks through May 2010.
Based on the results of the seasonal models with and without the Box-Cox transformation, I’ve chosen the Seasonal Naive model without transformation to forecast the May 2010 values. The model performed on par, if not better, than the more complex models, and did not produce any negative values as seen in the ARIMA model. In the interest of simplicity and good results, the Seasonal Naive model is selected.
# Generate forecasts for 16 days (two weeks) plus the 31 days of May
fc_atm1_final <- fit_atm1 %>% forecast(h = "47 days")
# Plot forecasts against actual values
fc_atm1_final %>%
autoplot(filter_index(train_atm1, "2010-02-16" ~ "2010-05-31"), level = NULL) +
autolayer(
filter_index(new_seasonal_atm1, "2010-02-16" ~ "2010-05-31"),
colour = "black"
) +
labs(
y = "Cash (in hundreds $USD)",
title = "Forecasts for ATM1 Withdrawals"
) +
guides(colour = guide_legend(title = "Forecast"))
Forecasts of May 2010 for ATM1.
Prepare forecast fable for input into Excel file for ATM1.
# dataframe for Excel output
atm1_forecast_may2010 <- fc_atm1_final %>%
filter(.model == "Seasonal naive", DATE > "2010-04-30")
atm1_forecast_may2010 <- data.frame(atm1_forecast_may2010) %>%
select(c(DATE, ATM, .mean))%>%
rename(Cash = .mean)
atm1_forecast_may2010
## DATE ATM Cash
## 1 2010-05-01 ATM1 87
## 2 2010-05-02 ATM1 92
## 3 2010-05-03 ATM1 63
## 4 2010-05-04 ATM1 3
## 5 2010-05-05 ATM1 103
## 6 2010-05-06 ATM1 79
## 7 2010-05-07 ATM1 90
## 8 2010-05-08 ATM1 87
## 9 2010-05-09 ATM1 92
## 10 2010-05-10 ATM1 63
## 11 2010-05-11 ATM1 3
## 12 2010-05-12 ATM1 103
## 13 2010-05-13 ATM1 79
## 14 2010-05-14 ATM1 90
## 15 2010-05-15 ATM1 87
## 16 2010-05-16 ATM1 92
## 17 2010-05-17 ATM1 63
## 18 2010-05-18 ATM1 3
## 19 2010-05-19 ATM1 103
## 20 2010-05-20 ATM1 79
## 21 2010-05-21 ATM1 90
## 22 2010-05-22 ATM1 87
## 23 2010-05-23 ATM1 92
## 24 2010-05-24 ATM1 63
## 25 2010-05-25 ATM1 3
## 26 2010-05-26 ATM1 103
## 27 2010-05-27 ATM1 79
## 28 2010-05-28 ATM1 90
## 29 2010-05-29 ATM1 87
## 30 2010-05-30 ATM1 92
## 31 2010-05-31 ATM1 63
ATM1 final forecast values.
Now, for the evaluation and forecasting of the ATM2 time series.
atm2_data_ts <- atm_data_ts %>%
filter(ATM == 'ATM2')
summary(atm2_data_ts)
## DATE ATM Cash
## Min. :2009-05-01 Length:365 Min. : 0.00
## 1st Qu.:2009-07-31 Class :character 1st Qu.: 25.50
## Median :2009-10-30 Mode :character Median : 67.00
## Mean :2009-10-30 Mean : 62.58
## 3rd Qu.:2010-01-29 3rd Qu.: 93.00
## Max. :2010-04-30 Max. :147.00
## NA's :2
The ATM2 time series, just as ATM1, includes the dates 2009-05-01 through 2010-04-30, a full year of 365 days. The Cash column indicates two missing values along with a range from 0 to 147.
atm2_data_ts %>% autoplot(Cash) +
labs(y = "Cash (in hundreds $USD)", title = "ATM2 Withdrawals")
The initial plot of the ATM2 dataset indicates a weekly seasonal pattern, as expected. The range identified above seems appropriate with no clear outliers. With two values missing, imputation to replace the values is recommended. A slightly decreasing trend does appear across the time series, but no identifiable cyclic component appears relevant.
atm2_data_ts$DATE[is.na(atm2_data_ts$Cash)]
## [1] "2009-06-18" "2009-06-24"
The two missing values also occur in June 2009, just as the missing values of ATM1 occurred.
# Calculate median value for ATM2
median <- median(atm2_data_ts$Cash, na.rm=TRUE)
# Set NAs to median
atm2_data_ts$Cash[is.na(atm2_data_ts$Cash)] <- median
atm2_data_ts %>% autoplot(Cash) +
labs(y = "Cash (in hundreds $USD)", title = "ATM2 Withdrawals with Imputed Data")
Given the small volume of missing data, I’ve chosen to impute the two missing values with the median value of the full ATM2 dataset. The plot of the dataset with imputation still appears to contain a weekly seasonal pattern with very slight decreasing trend.
atm2_data_ts %>%
model(stl = STL(Cash ~ trend(window=Inf) + season(period=7, window="periodic"))
) %>%
components() %>% autoplot()
Similar to ATM1, a shift in the seasonal nature of the data appears in the final few months. As applied on the ATM1 dataset, I plan to train and test the models on the final section of the dataset in which the weekly seasonal nature shifts.
atm2_data_ts %>% ACF() %>% autoplot()
As expected, the ACF plot for the ATM2 dataset shows strong correlation at lags 7, 14 and 20. Interesting to find high negative correlations also. I believe these correlations also identify the weekly pattern that appears in the data but is offset given the shift in February 2010.
atm2_data_ts %>% PACF() %>% autoplot()
The PACF plot for ATM2 also displays a high correlation at lags 7 and 14. The high negative correlations occur at lags 2 and 5. Again, I believe these lags are due to the change in the weekly seasonal pattern.
atm2_data_ts %>% gg_season(Cash, period = "week") +
labs(y="$ (in hundreds)", title="ATM2 Withdrawals")
The gg_season plot highlights the shift in the weekly pattern between Tuesday and Thursday, similar to that of ATM1.
atm2_data_ts %>%
gg_subseries(period = "week") +
labs(
y = "$ (in hundreds)",
title = "ATM2 Withdrawals"
)
The gg_subseries plot also re-confirms the change in weekly pattern. The adjustment appears on almost everyday of the week except for Saturday.
atm2_dcmp <- atm2_data_ts %>%
model(stl = STL(Cash ~ trend(window=Inf) + season(period=7, window="periodic")))
components(atm2_dcmp) %>%
as_tsibble() %>%
filter_index("2009-05-01" ~ "2010-04-30") %>%
autoplot(Cash, colour = "gray") +
geom_line(aes(y=season_adjust), colour = "#0072B2") +
labs(y = "Cash (in hundreds)", title = "Seasonally Adjusted Trendline")
The seasonally adjusted trendline again confirms a shift in the weekly pattern starting in February 2010.
components(atm2_dcmp) %>%
as_tsibble() %>%
filter_index("2010-02-16" ~ "2010-04-30") %>%
autoplot(Cash, colour = "gray") +
geom_line(aes(y=season_adjust), colour = "#0072B2") +
labs(y = "Cash (in hundreds)", title = "Seasonally Adjusted Trendline")
Focusing the seasonally adjusted trendline on the final few months, the blue trendline follows the actual dataset, indicating the seasonal adjustment is not properly accounting for this seasonal pattern.
Given the decomposition and the model evaluations indicate a change in seasonal pattern sometime in February 2010, I will attempt to train and test on just the data beginning on 2010-02-16. As mentioned in the section for ATM1, this approach certainly reduces the amount of data used to build the model, but I believe may better capture the change in the weekly pattern and thus provide better forecasts for the month of May 2010.
# ATM1 train on 2010-02-16 to 2010-04-30 (end)
# 15 + 31 + 30 = 76 days, 80% is 61 days
new_seasonal_atm2 <- atm2_data_ts %>%
filter_index("2010-02-16" ~ "2010-04-30")
train_atm2 <- atm2_data_ts %>%
filter_index("2010-02-16" ~ "2010-04-14")
# Fit the models
fit_atm2 <- train_atm2 %>%
model(
`Seasonal naive` = SNAIVE(Cash),
Arima = ARIMA(Cash),
ETS_Add = ETS(Cash ~ error("A") + trend("N") + season("A")),
ETS_Mult = ETS(Cash ~ error("M") + trend("N") + season("M")),
ETS_Damp = ETS(Cash ~ error("M") + trend("Ad") + season("M"))
)
report(fit_atm2)
## # A tibble: 5 × 12
## ATM .model sigma2 log_lik AIC AICc BIC ar_roots ma_roots MSE AMSE
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <list> <list> <dbl> <dbl>
## 1 ATM2 Seaso… 588. NA NA NA NA <NULL> <NULL> NA NA
## 2 ATM2 Arima 364. -224. 452. 452. 456. <cpl [7… <cpl [0… NA NA
## 3 ATM2 ETS_A… 358. -283. 587. 591. 607. <NULL> <NULL> 302. 276.
## 4 ATM2 ETS_M… 0.408 -297. 614. 619. 634. <NULL> <NULL> 4367. 4182.
## 5 ATM2 ETS_D… 6.22 -368. 762. 770. 789. <NULL> <NULL> 56573. 53446.
## # … with 1 more variable: MAE <dbl>
Focusing on the 5 models that account for a seasonal pattern, the ARIMA model performs the best of the 4 models by evaluating to the AICc of 451.
fit_atm2 %>% accuracy()
## # A tibble: 5 × 11
## ATM .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ATM2 Seasonal naive Trai… -1.25 24.0 16.2 -Inf Inf 1 1 -0.191
## 2 ATM2 Arima Trai… -1.46 17.7 11.5 -Inf Inf 0.712 0.737 -0.0891
## 3 ATM2 ETS_Add Trai… -0.724 17.4 12.0 -Inf Inf 0.741 0.723 -0.130
## 4 ATM2 ETS_Mult Trai… -13.1 66.7 31.0 -65.0 90.3 1.92 2.77 0.467
## 5 ATM2 ETS_Damp Trai… 6.79 240. 121. -32.6 217. 7.48 9.98 0.488
The ETS Additive model performs the best on the training set with an RMSE of 17.39, while the ARIMA performs almost as well as the ETS Additive model. The Seasonal Naive model performs third best and again the ETS with dampening is clearly not doing well compared to the other four models.
# Generate forecasts for 16 days (two weeks), so we'll see if it picks up the seasonal nature
fc_atm2 <- fit_atm2 %>% forecast(h = "16 days")
fc_atm2 %>% accuracy(new_seasonal_atm2)
## # A tibble: 5 × 11
## .model ATM .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Arima ATM2 Test 4.57 16.1 12.5 0.971 24.5 0.775 0.670 -0.0923
## 2 ETS_Add ATM2 Test 6.59 17.8 12.7 -35.9 70.8 0.784 0.741 -0.00583
## 3 ETS_Damp ATM2 Test -17.6 32.7 23.0 -29.1 36.5 1.42 1.36 0.478
## 4 ETS_Mult ATM2 Test -1.33 12.4 9.87 -1.36 20.5 0.610 0.517 -0.108
## 5 Seasonal naive ATM2 Test 4.06 19.7 14.3 2.45 23.3 0.884 0.820 -0.244
For the accuracy of the five models on the test set, the ETS Multiplicative model performs the best with an RMSE of 12.44, while ARIMA, ETS Additive and Seasonal Naive, also perform well.
# Plot forecasts against actual values
fc_atm2 %>%
autoplot(filter_index(train_atm2, "2010-04-01" ~ "2010-04-30"), level = NULL) +
autolayer(
filter_index(new_seasonal_atm2, "2010-04-01" ~ "2010-04-30"),
colour = "black"
) +
labs(
y = "Cash (in hundreds $USD)",
title = "Forecasts for ATM2 Withdrawals"
) +
guides(colour = guide_legend(title = "Forecast"))
The plot of the test forecasts confirms the models are correctly capturing the seasonal nature of the data near the end of the dataset.
fit_atm2_ets <- train_atm2 %>%
model(ETS_Add = ETS(Cash ~ error("A") + trend("N") + season("A")))
report(fit_atm2_ets)
## Series: Cash
## Model: ETS(A,N,A)
## Smoothing parameters:
## alpha = 0.0001000053
## gamma = 0.000100022
##
## Initial states:
## l[0] s[0] s[-1] s[-2] s[-3] s[-4] s[-5] s[-6]
## 56.74485 -52.03051 26.58281 12.06053 -13.65817 31.39567 45.81194 -50.16227
##
## sigma^2: 357.7803
##
## AIC AICc BIC
## 586.7609 591.4417 607.3653
The resulting ETS Additive model definition is defined above.
# Check the residuals of ARIMA
fit_atm2_ets %>% gg_tsresiduals()
Similar to ARIMA with ATM1, I’ve displayed the residuals of the ETS Additive model to assess the appropriateness of the model. The plot of Innovation residuals appears to follow white noise. The ACF plot indicates only correlation at lag 13, but otherwise no correlation outside of the confidence intervals.
After considering the RMSE of the ETS Additive model for both the training and test datasets along with the visual inspection of the forecasted values through May 2010, I’ve decided to use the forecast values from the ETS Additive model. The ETS Multiplicative did have the best RMSE on the test dataset but the visual inspection of the forecast values shows a growing variance I don’t expect in the time series over the next month. I believe the Seasonal Naive, ARIMA, and ETS Additive models each show promising results on the test and appear to follow the seasonal pattern based on visual assessment, so I’ve selected ETS Additive based on the accuracy on the training and test datasets.
# Generate forecasts for 16 days (two weeks) plus the 31 days of May
fc_atm2_final <- fit_atm2 %>% forecast(h = "47 days")
# Plot forecasts against actual values
fc_atm2_final %>%
autoplot(filter_index(train_atm2, "2010-02-16" ~ "2010-05-31"), level = NULL) +
autolayer(
filter_index(new_seasonal_atm2, "2010-02-16" ~ "2010-05-31"),
colour = "black"
) +
labs(
y = "Cash (in hundreds $USD)",
title = "Forecasts for ATM2 Withdrawals"
) +
guides(colour = guide_legend(title = "Forecast"))
Forecasts of May 2010 for ATM2 models.
# dataframe for Excel output
atm2_forecast_may2010 <- fc_atm2_final %>%
filter(.model == "ETS_Add", DATE > "2010-04-30")
atm2_forecast_may2010 <- data.frame(atm2_forecast_may2010) %>%
select(c(DATE, ATM, .mean))%>%
rename(Cash = .mean)
atm2_forecast_may2010
## DATE ATM Cash
## 1 2010-05-01 ATM2 68.799631
## 2 2010-05-02 ATM2 83.321692
## 3 2010-05-03 ATM2 4.712864
## 4 2010-05-04 ATM2 6.579454
## 5 2010-05-05 ATM2 102.551287
## 6 2010-05-06 ATM2 88.134400
## 7 2010-05-07 ATM2 43.081012
## 8 2010-05-08 ATM2 68.799631
## 9 2010-05-09 ATM2 83.321692
## 10 2010-05-10 ATM2 4.712864
## 11 2010-05-11 ATM2 6.579454
## 12 2010-05-12 ATM2 102.551287
## 13 2010-05-13 ATM2 88.134400
## 14 2010-05-14 ATM2 43.081012
## 15 2010-05-15 ATM2 68.799631
## 16 2010-05-16 ATM2 83.321692
## 17 2010-05-17 ATM2 4.712864
## 18 2010-05-18 ATM2 6.579454
## 19 2010-05-19 ATM2 102.551287
## 20 2010-05-20 ATM2 88.134400
## 21 2010-05-21 ATM2 43.081012
## 22 2010-05-22 ATM2 68.799631
## 23 2010-05-23 ATM2 83.321692
## 24 2010-05-24 ATM2 4.712864
## 25 2010-05-25 ATM2 6.579454
## 26 2010-05-26 ATM2 102.551287
## 27 2010-05-27 ATM2 88.134400
## 28 2010-05-28 ATM2 43.081012
## 29 2010-05-29 ATM2 68.799631
## 30 2010-05-30 ATM2 83.321692
## 31 2010-05-31 ATM2 4.712864
ATM2 final forecast values.
Now, let’s take a look at the dataset for ATM3.
atm3_data_ts_all <- atm_data_ts %>%
filter(ATM == 'ATM3')
summary(atm3_data_ts_all)
## DATE ATM Cash
## Min. :2009-05-01 Length:365 Min. : 0.0000
## 1st Qu.:2009-07-31 Class :character 1st Qu.: 0.0000
## Median :2009-10-30 Mode :character Median : 0.0000
## Mean :2009-10-30 Mean : 0.7206
## 3rd Qu.:2010-01-29 3rd Qu.: 0.0000
## Max. :2010-04-30 Max. :96.0000
From summary output, the date range (2009-05-01 to 2010-04-03) matches expectations given the data from ATM1 and ATM2. The Cash column does not contain any missing data but does indicate curious aspects. The range is 0 to 96 which might be odd to have 0, and the median is also 0, which certainly indicates a curiosity about the data. The mean is also quite low, less than 1.
atm3_data_ts_all %>% autoplot(Cash) +
labs(y = "Cash (in hundreds $USD)", title = "ATM3 Withdrawals")
A simple plot of the data clearly shows unexpected information. Only a few days at the end of the dataset contain anything above zero.
atm3_data_ts <- atm3_data_ts_all %>%
filter(ATM == 'ATM3', Cash > 0)
atm3_data_ts
## # A tsibble: 3 x 3 [1D]
## # Key: ATM [1]
## DATE ATM Cash
## <date> <chr> <dbl>
## 1 2010-04-28 ATM3 96
## 2 2010-04-29 ATM3 82
## 3 2010-04-30 ATM3 85
By removing every date with a value equal to 0, only three dates remain - the final three dates of the dataset.
summary(atm3_data_ts)
## DATE ATM Cash
## Min. :2010-04-28 Length:3 Min. :82.00
## 1st Qu.:2010-04-28 Class :character 1st Qu.:83.50
## Median :2010-04-29 Mode :character Median :85.00
## Mean :2010-04-29 Mean :87.67
## 3rd Qu.:2010-04-29 3rd Qu.:90.50
## Max. :2010-04-30 Max. :96.00
For those three observations, the range is 82 to 96 with a mean of 87.67.
atm3_data_ts %>% autoplot(Cash) +
labs(y = "Cash (in hundreds $USD)", title = "ATM3 Withdrawals")
The plot confirms that only three dates contain a Cash amount greater than zero.
A decision must be made in how to approach this dataset. Of 365 days, 362 contain a Cash amount of zero and the final three dates contain a Cash amount meeting expectations. Without history or context of ATM3, I will make the assumption that ATM3 was broken or in some way not in use for everyday with a Cash amount of zero. My assumption would follow that starting on 2010-04-28 ATM3 was fixed or made active, and thus can be assumed working and active for the month of May 2010.
With so little data to build a model, I will choose the MEAN function based on the three dates containing Cash amounts above zero to make the forecast for May 2010.
fit_atm3 <- atm3_data_ts %>% model(MEAN(Cash))
report(fit_atm3)
## Series: Cash
## Model: MEAN
##
## Mean: 87.6667
## sigma^2: 54.3333
The model returns an expected value of 87.67 as the mean, which was indicated in the summary of the data.
# Forecast for 31 days of May 2010
fc_atm3 <- fit_atm3 %>% forecast(h = "31 days")
fc_atm3 %>%
autoplot(filter_index(atm3_data_ts_all, "2009-05-01" ~ "2010-05-31"), level = NULL) +
autolayer(
filter_index(atm1_data_ts, "2010-04-28" ~ "2010-05-31"),
colour = "black"
) +
labs(y="Cash (in hundreds $USD)", title="ATM3 Withdrawals with Forecast") +
guides(colour = guide_legend(title = "Forecast"))
Not a very interesting plot of the forecasted data in relation to the ATM3 dataset, but given the minimal amount of data, the mean forecast is a constant line for the month of May 2010.
With only three values with data above zero, so I’m assuming there was absolutely no activity but the ATM3 existed, or else it would have NA. The Mean model is selected as the only viable model to use for three values.
# dataframe for Excel output
atm3_forecast_may2010 <- fc_atm3
atm3_forecast_may2010 <- data.frame(atm3_forecast_may2010) %>%
select(c(DATE, ATM, .mean))%>%
rename(Cash = .mean)
atm3_forecast_may2010
## DATE ATM Cash
## 1 2010-05-01 ATM3 87.66667
## 2 2010-05-02 ATM3 87.66667
## 3 2010-05-03 ATM3 87.66667
## 4 2010-05-04 ATM3 87.66667
## 5 2010-05-05 ATM3 87.66667
## 6 2010-05-06 ATM3 87.66667
## 7 2010-05-07 ATM3 87.66667
## 8 2010-05-08 ATM3 87.66667
## 9 2010-05-09 ATM3 87.66667
## 10 2010-05-10 ATM3 87.66667
## 11 2010-05-11 ATM3 87.66667
## 12 2010-05-12 ATM3 87.66667
## 13 2010-05-13 ATM3 87.66667
## 14 2010-05-14 ATM3 87.66667
## 15 2010-05-15 ATM3 87.66667
## 16 2010-05-16 ATM3 87.66667
## 17 2010-05-17 ATM3 87.66667
## 18 2010-05-18 ATM3 87.66667
## 19 2010-05-19 ATM3 87.66667
## 20 2010-05-20 ATM3 87.66667
## 21 2010-05-21 ATM3 87.66667
## 22 2010-05-22 ATM3 87.66667
## 23 2010-05-23 ATM3 87.66667
## 24 2010-05-24 ATM3 87.66667
## 25 2010-05-25 ATM3 87.66667
## 26 2010-05-26 ATM3 87.66667
## 27 2010-05-27 ATM3 87.66667
## 28 2010-05-28 ATM3 87.66667
## 29 2010-05-29 ATM3 87.66667
## 30 2010-05-30 ATM3 87.66667
## 31 2010-05-31 ATM3 87.66667
ATM3 final forecast values.
Finally, the evaluation and forecast model for the ATM4 dataset.
atm4_data_ts <- atm_data_ts %>%
filter(ATM == 'ATM4')
summary(atm4_data_ts)
## DATE ATM Cash
## Min. :2009-05-01 Length:365 Min. : 1.563
## 1st Qu.:2009-07-31 Class :character 1st Qu.: 124.334
## Median :2009-10-30 Mode :character Median : 403.839
## Mean :2009-10-30 Mean : 474.043
## 3rd Qu.:2010-01-29 3rd Qu.: 704.507
## Max. :2010-04-30 Max. :10919.762
As expected, the summary for the ATM4 dataset falls on the dates 2009-05-01 through 2010-04-30. The Cash column indicates no missing values. The range of the Cash column is 1.563 to 10919.762 with a mean of 474.043. The low value is certainly low, but thankfully not zero, as seen in ATM3. The high value is quite high compared to the mean and the third quartile. Perhaps an outlier or a few exists in the ATM4 data.
atm4_data_ts %>% autoplot(Cash) +
labs(y="Cash (in hundreds $USD)", title="ATM4 Withdrawals")
The plot of the ATM4 data shows a clear outlier in February 2010.
atm4_data_ts %>%
filter(Cash > 3000)
## # A tsibble: 1 x 3 [1D]
## # Key: ATM [1]
## DATE ATM Cash
## <date> <chr> <dbl>
## 1 2010-02-09 ATM4 10920.
The day of the outlier turns out to be 2010-02-09.
atm4_dcmp <- atm4_data_ts %>%
model(stl = STL(Cash ~ trend(window=Inf) + season(period=7, window="periodic")))
atm4_dcmp %>% components() %>% autoplot()
Given the weekly seasonal pattern worked for ATM1 and ATM2, let’s try decomposition on ATM4. Seen in the the plot above, the outlier makes the plots near impossible to evaluate.
Let’s the remove the outlier by imputing the median value in its place.
# Calculate median value for ATM4
# Straightforward approach to impute data
median <- median(atm4_data_ts$Cash, na.rm=TRUE)
# Get rid of outlier from ATM4
atm4_data_ts$Cash[atm4_data_ts$Cash > 3000] <- median
atm4_data_ts %>% autoplot(Cash) +
labs(y="Cash (in hundreds $USD)", title="ATM4 Withdrawals with Imputed Value")
By removing the outlier, the data does appear more approachable. Overall, I don’t detect a clear trend, perhaps a slight decrease. Given the data represents a single year, I don’t expect to find a cyclic component, but perhaps a weekly seasonal component exists. Let’s try the decomposition again, now excluding the outlier.
atm4_dcmp <- atm4_data_ts %>%
model(stl = STL(Cash ~ trend(window=Inf) + season(period=7, window="periodic")))
atm4_dcmp %>% components() %>% autoplot()
The remainder component does play a larger factor than the seasonal component. As identified in ATM1 and ATM2, the trend component has minimal impact. Let’s try the seasonally adjusted plot.
components(atm4_dcmp) %>%
as_tsibble() %>%
filter_index("2009-05-01" ~ "2010-04-30") %>%
autoplot(Cash, colour = "gray") +
geom_line(aes(y=season_adjust), colour = "#0072B2") +
labs(y = "Cash (in hundreds $USD)", title = "Seasonally Adjusted Trendline")
The seasonally-adjusted trendline plot above does not convey any clear weekly pattern.
atm4_data_ts %>% ACF() %>% autoplot()
The ACF plot indicates correlation at lags 7, 14, and 21 as expected. Also, correlation appears at lag 10.
atm4_data_ts %>% PACF() %>% autoplot()
The PACF shows correlation at lags 7, 10, and 19. The negative correlations at lag 10 and 19 would indicate a secondary pattern or two exists in the dataset.
atm4_data_ts %>% gg_season(Cash, period = "week") +
labs(y="$ (in hundreds)", title="ATM4 Withdrawals")
I’ll admit the above plot of gg_season does not provide much value except perhaps for some nice colors.
atm4_data_ts %>%
gg_subseries(period = "week") +
labs(
y = "$ (in hundreds)",
title = "ATM4 Withdrawals"
)
The gg_subseries plot above does indicate a change in pattern on Tuesday, Wednesday, and Thursday, similar to that found in ATM1 and ATM2, but not quite as drastic and clear.
atm4_data_post0209 <- atm4_data_ts %>%
filter_index("2010-02-10" ~ "2010-04-30")
atm4_data_post0209 %>%
autoplot(Cash) +
labs(y="Cash (in hundreds $USD)", title="ATM4 Withdrawals")
To better understand the final few months of the ATM4 dataset, the above plot shows perhaps a weekly seasonal pattern does exist but just not as clear as found in ATM1 and ATM2.
atm4_dcmp <- atm4_data_post0209 %>%
model(stl = STL(Cash ~ trend(window=Inf) + season(period=7, window="periodic")))
atm4_dcmp %>% components() %>% autoplot()
The decomposition of the final few months of ATM4 does indicate a semblance of weekly pattern, but the remainder values still provide the largest impact compared to the seasonal or trend components.
components(atm4_dcmp) %>%
as_tsibble() %>%
filter_index("2010-02-10" ~ "2010-04-30") %>%
autoplot(Cash, colour = "gray") +
geom_line(aes(y=season_adjust), colour = "#0072B2") +
labs(y = "Cash (in hundreds $USD)", title = "Seasonally Adjusted Trendline")
The seasonally adjusted trendline is still too irregular for my comfort. Unfortunately, the assessment doesn’t provide a clear indication of the patterns found in the final few months. Given the goal is to provide a forecast for the month of May 2010, I plan to try the models with seasonality and evaluate the results.
Given I’m struggling to find a viable weekly seasonal pattern through visual inspection, I will apply the different seasonal models through the train and test approach across the entire dataset.
# Create training set (1 year of data)
# 292 days for training, approx. 80% of data
train_atm4 <- atm4_data_ts %>%
filter_index("2009-05-01" ~ "2010-02-17")
# Fit the models
fit_atm4 <- train_atm4 %>%
model(
`Seasonal naive` = SNAIVE(Cash),
Arima = ARIMA(Cash),
ETS_Add = ETS(Cash ~ error("A") + trend("N") + season("A")),
ETS_Mult = ETS(Cash ~ error("M") + trend("N") + season("M")),
ETS_Damp = ETS(Cash ~ error("M") + trend("Ad") + season("M"))
)
report(fit_atm4)
## # A tibble: 5 × 12
## ATM .model sigma2 log_lik AIC AICc BIC ar_roots ma_roots MSE
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <list> <list> <dbl>
## 1 ATM4 Seasonal … 2.09e+5 NA NA NA NA <NULL> <NULL> NA
## 2 ATM4 Arima 1.25e+5 -2134. 4274. 4274. 4285. <cpl [7… <cpl [0… NA
## 3 ATM4 ETS_Add 1.05e+5 -2521. 5063. 5063. 5099. <NULL> <NULL> 101739.
## 4 ATM4 ETS_Mult 5.14e-1 -2469. 4958. 4959. 4995. <NULL> <NULL> 103349.
## 5 ATM4 ETS_Damp 4.88e-1 -2478. 4981. 4982. 5029. <NULL> <NULL> 106682.
## # … with 2 more variables: AMSE <dbl>, MAE <dbl>
The ARIMA model generates the best AICc value at 4271.254 compared to the ETS models.
fit_atm4 %>% accuracy()
## # A tibble: 5 × 11
## ATM .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ATM4 Seasonal naive Trai… 2.11 456. 347. -366. 422. 1 1 0.0620
## 2 ATM4 Arima Trai… 0.0482 352. 295. -501. 535. 0.850 0.772 0.0819
## 3 ATM4 ETS_Add Trai… -9.31 319. 249. -421. 450. 0.717 0.699 0.110
## 4 ATM4 ETS_Mult Trai… 11.5 321. 251. -412. 444. 0.723 0.705 0.108
## 5 ATM4 ETS_Damp Trai… -18.3 327. 257. -434. 463. 0.741 0.716 0.0931
According to the RMSE, the ETS Additive model produces the most accurate model against the training data at 318.97.
# Generate forecasts for 72 days
fc_atm4 <- fit_atm4 %>% forecast(h = "72 days")
fc_atm4 %>% accuracy(atm4_data_ts)
## # A tibble: 5 × 11
## .model ATM .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Arima ATM4 Test -19.2 319. 266. -777. 808. 0.768 0.699 -0.0357
## 2 ETS_Add ATM4 Test -20.7 386. 319. -918. 960. 0.919 0.847 0.0325
## 3 ETS_Damp ATM4 Test -34.1 400. 330. -1017. 1059. 0.951 0.878 0.00763
## 4 ETS_Mult ATM4 Test -5.37 387. 321. -913. 958. 0.925 0.849 0.00836
## 5 Seasonal naive ATM4 Test -101. 520. 417. -1484. 1530. 1.20 1.14 -0.0989
Now forecasting the models on the test dataset, the ARIMA model performs the best with an RMSE of 318.87.
# Plot forecasts against actual values
fc_atm4 %>%
autoplot(filter_index(train_atm4, "2010-02-15" ~ "2010-04-30"), level = NULL) +
autolayer(
filter_index(atm4_data_ts, "2010-02-15" ~ "2010-04-30"),
colour = "black"
) +
labs(
y = "Cash (in hundreds $USD)",
title = "Forecasts for ATM4 Withdrawals"
) +
guides(colour = guide_legend(title = "Forecast"))
The above plot focused on the final two and half months of the data indicates the models are not following the weekly seasonal pattern. Let me try the truncated training and test approach similar to ATM1 and ATM2 to better capture the weekly pattern from February through April of 2010.
# ATM4 train on 2010-02-16 to 2010-04-30 (end)
# 15 + 31 + 30 = 76 days, 80% is 61 days
new_seasonal_atm4 <- atm4_data_ts %>%
filter_index("2010-02-16" ~ "2010-04-30")
train_atm4 <- atm4_data_ts %>%
filter_index("2010-02-16" ~ "2010-04-14")
# Fit the models
fit_atm4 <- train_atm4 %>%
model(
`Seasonal naive` = SNAIVE(Cash),
Arima = ARIMA(Cash),
ETS_Add = ETS(Cash ~ error("A") + trend("N") + season("A")),
ETS_Mult = ETS(Cash ~ error("M") + trend("N") + season("M")),
ETS_Damp = ETS(Cash ~ error("M") + trend("Ad") + season("M"))
)
report(fit_atm4)
## # A tibble: 5 × 12
## ATM .model sigma2 log_lik AIC AICc BIC ar_roots ma_roots MSE
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <list> <list> <dbl>
## 1 ATM4 Seasonal … 2.09e+5 NA NA NA NA <NULL> <NULL> NA
## 2 ATM4 Arima 1.11e+5 -419. 841. 841. 845. <cpl [0… <cpl [0… NA
## 3 ATM4 ETS_Add 9.95e+4 -447. 913. 918. 934. <NULL> <NULL> 8.41e4
## 4 ATM4 ETS_Mult 5.76e-1 -452. 924. 928. 944. <NULL> <NULL> 4.57e5
## 5 ATM4 ETS_Damp 3.74e+0 -517. 1060. 1068. 1087. <NULL> <NULL> 4.09e6
## # … with 2 more variables: AMSE <dbl>, MAE <dbl>
The ARIMA attains the best AICc value of 841.49 compared to the ETS models.
fit_atm4 %>% accuracy()
## # A tibble: 5 × 11
## ATM .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ATM4 Seasonal … Train… -1.05e+ 1 453. 349. -524. 578. 1 1 0.0769
## 2 ATM4 Arima Train… 1.94e-10 330. 280. -747. 780. 0.803 0.728 0.0556
## 3 ATM4 ETS_Add Train… -3.45e+ 1 290. 228. -204. 236. 0.652 0.640 0.0543
## 4 ATM4 ETS_Mult Train… -1.75e+ 2 676. 391. -301. 328. 1.12 1.49 0.512
## 5 ATM4 ETS_Damp Train… -1.51e+ 1 2022. 1372. -4.63 506. 3.93 4.46 0.803
The ETS Additive model produces the best RMSE of 289.9164 against the training data.
# Generate forecasts for 16 days (two weeks), so we'll see if it picks up the seasonal nature
fc_atm4 <- fit_atm4 %>% forecast(h = "16 days")
fc_atm4 %>% accuracy(new_seasonal_atm4)
## # A tibble: 5 × 11
## .model ATM .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Arima ATM4 Test -135. 278. 229. -924. 937. 0.655 0.613 -0.308
## 2 ETS_Add ATM4 Test -147. 274. 213. -636. 645. 0.610 0.605 0.0327
## 3 ETS_Damp ATM4 Test 616. 690. 616. 670. 670. 1.77 1.52 -0.194
## 4 ETS_Mult ATM4 Test -92.1 242. 189. -618. 632. 0.542 0.534 -0.00529
## 5 Seasonal naive ATM4 Test -244. 449. 370. -1178. 1201. 1.06 0.991 0.219
Forecasting the models for the final 16 days of the dataset, the ETS Multiplicative model achieves the best RMSE at 241.88.
# Plot forecasts against actual values
fc_atm4 %>%
autoplot(filter_index(train_atm4, "2010-04-01" ~ "2010-04-30"), level = NULL) +
autolayer(
filter_index(new_seasonal_atm4, "2010-04-01" ~ "2010-04-30"),
colour = "black"
) +
labs(
y = "Cash (in hundreds $USD)",
title = "Forecasts for ATM4 Withdrawals"
) +
guides(colour = guide_legend(title = "Forecast"))
The plot of the forecasts for the five seasonal models shows the ETS Additive and ETS Multiplicative models as following the weekly pattern best. The ARIMA model results in a constant, horizontal line while the ETS with dampening immediately starts producing negative values, which is not helpful. Given the small sample size above, the Seasonal Naive model does produce a model following the weekly seasonal pattern.
fit_atm4_ets <- train_atm4 %>%
model(ETS_Mult = ETS(Cash ~ error("M") + trend("N") + season("M")))
report(fit_atm4_ets)
## Series: Cash
## Model: ETS(M,N,M)
## Smoothing parameters:
## alpha = 0.210569
## gamma = 0.0002234196
##
## Initial states:
## l[0] s[0] s[-1] s[-2] s[-3] s[-4] s[-5] s[-6]
## 2222.802 0.8511952 1.166145 1.031802 1.261922 1.229669 1.281946 0.177322
##
## sigma^2: 0.5761
##
## AIC AICc BIC
## 923.6732 928.3540 944.2776
The definition of the ETS Multiplicative model is defined above.
# Check the residuals of ARIMA
fit_atm4_ets %>% gg_tsresiduals()
Given the results of the test data forecasts, the ETS Multiplicative model is chosen. The residuals for the model appear close to white noise while the residuals do not show correlation, a good sign for the model.
In an attempt to improve the model, the Box-Cox transformation is applied along with the seasonal model functions.
# Consider Box-Cox
lambda <- new_seasonal_atm4 %>%
features(Cash, features = guerrero) %>%
pull(lambda_guerrero)
new_seasonal_atm4 %>%
autoplot(box_cox(Cash, lambda)) +
labs(y = "",
title = latex2exp::TeX(paste0(
"Transformed ATM4 withdrawals with $\\lambda$ = ",
round(lambda,2))))
The Box-Cox transformation of the final few months of the dataset helps clarify the weekly seasonal pattern with a lambda value of 0.3.
fit_atm4_bc <- train_atm4 %>%
model(
`Seasonal naive` = SNAIVE(box_cox(Cash, lambda)),
Arima = ARIMA(box_cox(Cash, lambda)),
ETS_Add = ETS(box_cox(Cash, lambda) ~ error("A") + trend("N") + season("A")),
ETS_Mult = ETS(box_cox(Cash, lambda) ~ error("M") + trend("N") + season("M")),
)
report(fit_atm4_bc)
## # A tibble: 4 × 12
## ATM .model sigma2 log_lik AIC AICc BIC ar_roots ma_roots MSE AMSE
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <list> <list> <dbl> <dbl>
## 1 ATM4 Seasonal… 53.7 NA NA NA NA <NULL> <NULL> NA NA
## 2 ATM4 Arima 23.6 -176. 360. 361. 368. <cpl [1… <cpl [0… NA NA
## 3 ATM4 ETS_Add 25.0 -206. 432. 437. 453. <NULL> <NULL> 21.1 18.5
## 4 ATM4 ETS_Mult 0.135 -216. 451. 456. 472. <NULL> <NULL> 30.5 31.0
## # … with 1 more variable: MAE <dbl>
The ARIMA model with transformation performs the best on the training data with an AICc of 360.74.
fit_atm4_bc %>% accuracy()
## # A tibble: 4 × 11
## ATM .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ATM4 Seasonal naive Training -10.5 453. 349. -524. 578. 1 1 0.0769
## 2 ATM4 Arima Training 60.6 293. 239. -218. 257. 0.684 0.646 -0.0113
## 3 ATM4 ETS_Add Training 51.8 298. 218. -126. 158. 0.626 0.658 0.0641
## 4 ATM4 ETS_Mult Training 11.0 370. 283. -310. 348. 0.810 0.816 0.144
The best accuracy on the training set goes to the ARIMA model with an RMSE of 293.02.
# Generate forecasts for the next 16 days
fc_atm4_bc <- fit_atm4_bc %>% forecast(h = "16 days")
fc_atm4_bc %>% accuracy(new_seasonal_atm4)
## # A tibble: 4 × 11
## .model ATM .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Arima ATM4 Test -174. 321. 246. -1011. 1025. 0.705 0.708 0.380
## 2 ETS_Add ATM4 Test -207. 341. 275. -737. 747. 0.788 0.752 0.0637
## 3 ETS_Mult ATM4 Test -317. 459. 363. -922. 930. 1.04 1.01 0.401
## 4 Seasonal naive ATM4 Test -616. 823. 705. -2079. 2094. 2.02 1.82 0.433
The forecast of the models for the final 16 days of the dataset shows the ARIMA model with the best-performing RMSE of 320.73.
# Plot forecasts against actual values
fc_atm4_bc %>%
autoplot(filter_index(new_seasonal_atm4, "2010-04-01" ~ "2010-04-30"), level = NULL) +
autolayer(
filter_index(new_seasonal_atm4, "2010-04-01" ~ "2010-04-30"),
colour = "black"
) +
labs(
y = "Cash (in hundreds $USD)",
title = "Forecasts for ATM4 Withdrawals"
) +
guides(colour = guide_legend(title = "Forecast"))
The above plot of the seasonal models with Box-Cox transformation clearly show the forecasts do not outperform the forecasts of the models without the transformation.
Overall, the ETS Multiplicative model without transformation performs the best on the final few months of the ATM4 dataset.
# Generate forecasts for 16 days (two weeks) plus the 31 days of May
fc_atm4_final <- fit_atm4 %>% forecast(h = "47 days")
# Plot forecasts against actual values
fc_atm4_final %>%
autoplot(filter_index(train_atm4, "2010-02-16" ~ "2010-05-31"), level = NULL) +
autolayer(
filter_index(new_seasonal_atm4, "2010-02-16" ~ "2010-05-31"),
colour = "black"
) +
labs(
y = "Cash (in hundreds $USD)",
title = "Forecasts for ATM1 Withdrawals"
) +
guides(colour = guide_legend(title = "Forecast"))
Forecasts for the ATM4 models.
# dataframe for Excel output
atm4_forecast_may2010 <- fc_atm4_final %>%
filter(.model == "ETS_Mult", DATE > "2010-04-30")
atm4_forecast_may2010 <- data.frame(atm4_forecast_may2010) %>%
select(c(DATE, ATM, .mean))%>%
rename(Cash = .mean)
atm4_forecast_may2010
## DATE ATM Cash
## 1 2010-05-01 ATM4 431.30801
## 2 2010-05-02 ATM4 487.36884
## 3 2010-05-03 ATM4 355.88127
## 4 2010-05-04 ATM4 74.05705
## 5 2010-05-05 ATM4 536.09275
## 6 2010-05-06 ATM4 514.16819
## 7 2010-05-07 ATM4 527.55000
## 8 2010-05-08 ATM4 431.31970
## 9 2010-05-09 ATM4 487.38205
## 10 2010-05-10 ATM4 355.89092
## 11 2010-05-11 ATM4 74.05906
## 12 2010-05-12 ATM4 536.10727
## 13 2010-05-13 ATM4 514.18212
## 14 2010-05-14 ATM4 527.56430
## 15 2010-05-15 ATM4 431.33139
## 16 2010-05-16 ATM4 487.39526
## 17 2010-05-17 ATM4 355.90056
## 18 2010-05-18 ATM4 74.06107
## 19 2010-05-19 ATM4 536.12181
## 20 2010-05-20 ATM4 514.19606
## 21 2010-05-21 ATM4 527.57860
## 22 2010-05-22 ATM4 431.34308
## 23 2010-05-23 ATM4 487.40847
## 24 2010-05-24 ATM4 355.91021
## 25 2010-05-25 ATM4 74.06308
## 26 2010-05-26 ATM4 536.13634
## 27 2010-05-27 ATM4 514.20999
## 28 2010-05-28 ATM4 527.59290
## 29 2010-05-29 ATM4 431.35477
## 30 2010-05-30 ATM4 487.42168
## 31 2010-05-31 ATM4 355.91985
ATM4 final forecast values.
I really wanted to find some underlying relationship between the four different ATMs that would better explain some of the outliers and anomalies and perhaps provide clarity to the shift in the weekly pattern in February 2010. I totaled the daily withdrawals across the four ATMs in the hopes of seeing a clear pattern when combined.
atm_data_all <- bind_rows(atm1_data_ts,
atm2_data_ts,
atm3_data_ts_all,
atm4_data_ts)
atm_data_total_by_day <- atm_data_all %>%
index_by(DATE) %>%
summarize(Cash = sum(Cash))
#atm_data_total_by_day
atm_data_total_by_day %>% autoplot(Cash) +
labs(
y = "Cash (in hundreds $USD)",
title = "ATMs Withdrawals Combined",
subtitle = "Summary contains Imputed Data"
)
Overall, the weekly seasonal pattern still emerges, but no amazing insights seeing the data combined.
# Seasonally adjusted plot
atm_dcmp <- atm_data_total_by_day %>%
model(stl = STL(Cash ~ trend(window=Inf) + season(period=7, window="periodic")))
atm_dcmp %>% components() %>% autoplot()
The full STL decomposition repeats the findings from the individual ATMs. A weak overall trend line with the remainder component providing the most impact followed by the weekly seasonal pattern. As noted particularly in ATM1 and ATM2, beginning in February 2010, the remainder plot shows greater variance.
components(atm_dcmp) %>%
as_tsibble() %>%
autoplot(Cash, colour = "gray") +
geom_line(aes(y=season_adjust), colour = "#0072B2") +
labs(y = "Cash (in hundreds)", title = "Seasonally Adjusted Trendline")
The seasonally adjusted trendline follows the findings from before. The seasonal adjustment clearly does not follow the actual data beginning in February 2010.
Combine the forecasts for the four ATM datasets and write to an Excel file.
# Write to Excel file
atm_forecast_may2010 <- bind_rows(atm1_forecast_may2010,
atm2_forecast_may2010,
atm3_forecast_may2010,
atm4_forecast_may2010
)
write_xlsx(atm_forecast_may2010, "data/ptanofsky_atm_forecast_may2010.xlsx")
Prompt: Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straightforward. Add this to your existing files above.
# Read in data
power_data_raw <- read_excel("data/ResidentialCustomerForecastLoad-624.xlsx")
# Change column name to 'Month'
names(power_data_raw)[names(power_data_raw) == 'YYYY-MMM'] <- 'Month'
# Display first 5 rows for visual inspection
#head(power_data_raw)
# Display summary for initial assessment
summary(power_data_raw)
## CaseSequence Month KWH
## Min. :733.0 Length:192 Min. : 770523
## 1st Qu.:780.8 Class :character 1st Qu.: 5429912
## Median :828.5 Mode :character Median : 6283324
## Mean :828.5 Mean : 6502475
## 3rd Qu.:876.2 3rd Qu.: 7620524
## Max. :924.0 Max. :10655730
## NA's :1
The summary of the residential power data indicates 192 months, which is exactly 16 years. The KWH column shows 1 missing value with a range of 770523 to 10655730. The minimum value being so distinct from the first quartile value may indicate an outlier. The CaseSequence column appears to be a integer unique identifier for each month, so will be ignored moving forward.
# Display dimensions for assessment
# should be 16 years of data 1998-2013
# Yep, 192/12 = 16
dim(power_data_raw)
## [1] 192 3
The dimensions output confirms the summary output, 192 observations with three columns.
power_data_ts <- power_data_raw %>%
mutate(Month = yearmonth(Month)) %>%
mutate(KWH = KWH/1e3) %>% # In thousands
select(-c(CaseSequence)) %>%
as_tsibble(index = Month)
# Output first 5 rows of tsibble
#head(power_data_ts)
# Initial plot of data
power_data_ts %>%
autoplot(KWH) +
labs(title="Residential Power Consumption",
y="KWH (in thousands)")
After converting the raw data to a tsibble with the Month as the index, we do see a missing value perhaps sometime in 2008 and an outlier in early 2010.
power_data_ts$Month[is.na(power_data_ts$KWH)]
## <yearmonth[1]>
## [1] "2008 Sep"
Month with missing value is “2008 Sep”.
power_data_ts$Month[power_data_ts$KWH < 3000]
## <yearmonth[2]>
## [1] NA "2010 Jul"
Month with outlier is “2010 Jul”.
Given the data is monthly and appears to follow a clear seasonal pattern, I will impute the missing value and the outlier value by calculating the median value for each month for imputation.
First, I will find the median value for all September entries followed by July.
# I want all the September values
sep_data <- power_data_ts %>%
filter(str_detect(Month, "Sep"))
sep_data <- as_tibble(sep_data)
sep_kwh_med <- sep_data %>%
summarise(Median = median(KWH, na.rm = TRUE))
sep_kwh_med
## # A tibble: 1 × 1
## Median
## <dbl>
## 1 7667.
# Median 7666.97
power_data_ts[129, 2] <- sep_kwh_med
Median value for all other September values is 7666.97.
# I want all the July values
jul_data <- power_data_ts %>%
filter(str_detect(Month, "Jul"))
jul_data <- as_tibble(jul_data)
jul_kwh_med <- jul_data %>%
summarise(Median = median(KWH, na.rm = TRUE))
jul_kwh_med
## # A tibble: 1 × 1
## Median
## <dbl>
## 1 7679.
# Median 7678.623
power_data_ts[151, 2] <- jul_kwh_med
Median value for all other July values is 7695.942.
power_data_ts %>% autoplot(KWH) +
labs(title="Residential Power Consumption",
y="KWH (in thousands)")
Above plot shows the residential customer power data with imputations. Clearly a seasonal pattern exists along with an increasing variance near the end of the plot. Perhaps a Box-Cox transformation could be useful. Also, a slight increasing trend does appear, at least for the second half of the plot.
power_dcmp <- power_data_ts %>%
model(stl = STL(KWH ~ trend(window=Inf) + season(period=12, window="periodic")))
power_dcmp %>% components() %>% autoplot()
The decomposition of the data indicates an increasing trend along with a seasonal pattern and remainder component that appear near equal in weight.
components(power_dcmp) %>%
as_tsibble() %>%
autoplot(KWH, colour = "gray") +
geom_line(aes(y=season_adjust), colour = "#0072B2") +
labs(y = "KWH (in thousands)", title = "Seasonally Adjusted Trendline")
The plot of the seasonally adjusted data indicates the seasonal pattern does persist through the full dataset with a few exceptions throughout.
power_data_ts %>% ACF() %>% autoplot()
ACF plot shows the highest correlation at lag 12, which is expected for monthly data over the course of several years.
power_data_ts %>% PACF() %>% autoplot()
PACF plot indicates autocorrelation at lags 1, 2, 5, 11, and 12, interesting result for monthly data in which I expected the PACF plot to show the highest correlation at lag 12.
power_data_ts %>% gg_season(KWH, period = "year") +
labs(y="KWH (in thousands)", title="Residential Power Consumption")
The gg_season plot certainly shows a yearly pattern with low usage in the spring and fall and higher usage in the summer and winter.
power_data_ts %>%
gg_subseries(period = "year") +
labs(y="KWH (in thousands)", title="Residential Power Consumption")
The gg_subseries plot indicates increasing usage for almost every month over the course of the dataset. This increase is another indicator the Box-Cox transformation will be necessary
As performed in Section A, I will perform a train and test approach to select the best performing model. The training will occur on the first 80% of the dataset, in this case 154 months. The test o the models will be evaluated on the final 38 months of the dataset.
# Train on 154 months, which is 12 years and 8 months (80% of the total)
# Total is 192 months (16 years)
train_power_data <- power_data_ts %>%
filter_index("1998-Jan" ~ "2010-Oct")
With the increasing variance in the plots, a Box-Cox transformation is applied to calculate the lambda value.
lambda <- power_data_ts %>%
features(KWH, features = guerrero) %>%
pull(lambda_guerrero)
train_power_data %>%
autoplot(box_cox(KWH, lambda)) +
labs(y = "",
title = latex2exp::TeX(paste0(
"Transformed ATM1 withdrawals with $\\lambda$ = ",
round(lambda,2))))
The Box-Cox transformation does appear to address the increasing variance in the initial plot with a lambda value of -0.2.
# Fit the models
fit_power_bc <- train_power_data %>%
model(
Naive = NAIVE(box_cox(KWH, lambda)),
`Seasonal naive` = SNAIVE(box_cox(KWH, lambda)),
`Random walk` = RW(box_cox(KWH, lambda)),
Arima = ARIMA(box_cox(KWH, lambda)),
ETS_Add = ETS(box_cox(KWH, lambda) ~ error("A") + trend("A") + season("A")),
ETS_Mult = ETS(box_cox(KWH, lambda) ~ error("M") + trend("A") + season("M")),
ETS_Damp = ETS(box_cox(KWH, lambda) ~ error("M") + trend("Ad") + season("M"))
)
report(fit_power_bc)
## # A tibble: 7 × 11
## .model sigma2 log_lik AIC AICc BIC ar_roots ma_roots MSE AMSE
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <list> <list> <dbl> <dbl>
## 1 Naive 1.21e-3 NA NA NA NA <NULL> <NULL> NA NA
## 2 Seasona… 3.58e-4 NA NA NA NA <NULL> <NULL> NA NA
## 3 Random … 1.21e-3 NA NA NA NA <NULL> <NULL> NA NA
## 4 Arima 2.10e-4 398. -785. -785. -771. <cpl [1… <cpl [1… NA NA
## 5 ETS_Add 2.22e-4 268. -503. -498. -451. <NULL> <NULL> 1.99e-4 2.05e-4
## 6 ETS_Mult 1.33e-5 269. -504. -500. -452. <NULL> <NULL> 1.97e-4 2.04e-4
## 7 ETS_Damp 1.35e-5 268. -501. -496. -446. <NULL> <NULL> 1.99e-4 2.08e-4
## # … with 1 more variable: MAE <dbl>
The best performing model on the training set according to the AICc is the ARIMA model at -785.4651.
fit_power_bc %>% accuracy()
## # A tibble: 7 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Naive Training -6.45 1289. 1069. -2.21 17.2 1.92 1.80 0.182
## 2 Seasonal naive Training 69.0 715. 557. 0.478 8.79 1 1 0.267
## 3 Random walk Training -6.45 1289. 1069. -2.21 17.2 1.92 1.80 0.182
## 4 Arima Training 7.10 518. 397. -0.309 6.34 0.713 0.725 0.0885
## 5 ETS_Add Training 14.8 542. 422. -0.447 6.69 0.758 0.758 0.324
## 6 ETS_Mult Training 14.0 539. 414. -0.480 6.55 0.744 0.754 0.311
## 7 ETS_Damp Training 53.2 543. 414. 0.167 6.50 0.744 0.760 0.296
The best RMSE on the training data was attained by the ARIMA model followed by the three ETS models.
# Generate forecasts for 38 months, so we'll see if it picks up the seasonal nature
fc_power_bc <- fit_power_bc %>% forecast(h = "38 months")
fc_power_bc %>% accuracy(power_data_ts)
## # A tibble: 7 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Arima Test 621. 1034. 757. 7.25 9.42 1.36 1.45 0.229
## 2 ETS_Add Test 633. 1084. 798. 7.33 9.99 1.43 1.52 0.175
## 3 ETS_Damp Test 691. 1107. 820. 8.13 10.2 1.47 1.55 0.172
## 4 ETS_Mult Test 590. 1038. 751. 6.77 9.36 1.35 1.45 0.164
## 5 Naive Test -1348. 2580. 2155. -23.4 32.0 3.87 3.61 0.689
## 6 Random walk Test -1348. 2580. 2155. -23.4 32.0 3.87 3.61 0.689
## 7 Seasonal naive Test 303. 1018. 850. 2.77 11.0 1.53 1.42 0.419
The best RMSE on the test data of 38 months is the Seasonal Naive model followed closely by the ARIMA model and the ETS Multiplicative model.
# Plot forecasts against actual values
fc_power_bc %>%
autoplot(filter_index(power_data_ts, "2010-Nov" ~ "2013-Dec"), level = NULL) +
autolayer(
filter_index(power_data_ts, "2010-Nov" ~ "2013-Dec"),
colour = "black"
) +
labs(
y = "KWH (in thousands)",
title = "Power Forecast"
) +
guides(colour = guide_legend(title = "Forecast"))
The above plot of the forecasts overlayed with the actual data indicates several of the models are able to predict the monthly seasonal pattern well.
fit_power_arima <- train_power_data %>%
model(ARIMA(box_cox(KWH, lambda)))
report(fit_power_arima)
## Series: KWH
## Model: ARIMA(1,0,1)(0,1,1)[12] w/ drift
## Transformation: box_cox(KWH, lambda)
##
## Coefficients:
## ar1 ma1 sma1 constant
## -0.4374 0.6839 -0.7247 0.0021
## s.e. 0.1778 0.1394 0.0706 0.0007
##
## sigma^2 estimated as 0.00021: log likelihood=397.73
## AIC=-785.47 AICc=-785.02 BIC=-770.69
The report of the ARIMA model indicates an ARIMA(1,0,1)(0,1,1)[12] with drift. The result indicates a seasonal differencing of 1 along with a seasonal MA term of 1. The non-seasonal (p,d,q) shows an AR term of 1 and an MA term of 1. The seasonal lag is denoted as 12, which is expected for the monthly data.
# Check the residuals of ARIMA
fit_power_arima %>% gg_tsresiduals()
The ARIMA model residuals appears as white noise and do not show any correlation in the ACF plot.
augment(fit_power_arima) %>% features(.innov, ljung_box, lag=24, dof=4)
## # A tibble: 1 × 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 ARIMA(box_cox(KWH, lambda)) 25.5 0.185
Based on the Ljung-Box test result, the large p-value confirms the residuals are similar to white noise.
Let’s forecast for the year 2014 with all the models. The forecast will be for 50 months instead of 38 as above in order to capture the data for 2014.
# Generate forecasts for 50 months, which covers the end of the provided time series and the next 12 months of 2014
fc_power_bc14 <- fit_power_bc %>% forecast(h = "50 months")
fc_power_bc14 %>%
autoplot(filter_index(power_data_ts, "2010-Nov" ~ "2014-Dec"), level = NULL) +
autolayer(
filter_index(power_data_ts, "2010-Nov" ~ "2014-Dec"),
colour = "black"
) +
labs(
y = "KWH (in thousands)",
title = "Power Forecast"
) +
guides(colour = guide_legend(title = "Forecast"))
Based on the good results of the ARIMA model on the training and test data, I’ve selected the ARIMA model for the final forecast values for 2014. The Seasonal Naive model did well, but I believe the ARIMA model will more accurately identify the nuance in the next year over the Seasonal Naive model.
# Write to excel
# dataframe for Excel output
Residental_Forecast_2014 <- fc_power_bc14 %>%
filter(.model == "Arima", Month > yearmonth("2013 Dec"))
Residental_Forecast_2014 <- data.frame(Residental_Forecast_2014) %>%
select(c(Month, .mean))%>%
rename(KWH = .mean)
# Multiply by 1000 to return to original form
Residental_Forecast_2014$KWH <- Residental_Forecast_2014$KWH * 1e3
Residental_Forecast_2014$Month <- as.character(Residental_Forecast_2014$Month)
Residental_Forecast_2014 <- Residental_Forecast_2014 %>%
mutate(Month = str_replace(Month, " ", "-")) %>%
rename(`YYYY-MMM` = Month)
Case_Seq <- c(seq(from=925, to=936))
Residental_Forecast_2014 <- bind_cols(CaseSequence=Case_Seq, Residental_Forecast_2014)
# Write to Excel file
write_xlsx(Residental_Forecast_2014, "data/ptanofsky_kwh_forecast_2014.xlsx")
Part C consists of two data sets. These are simple 2 columns sets, however they have different time stamps. Your optional assignment is to time-base sequence the data and aggregate based on hour (example of what this looks like, follows). Note for multiple recordings within an hour, take the mean. Then to determine if the data is stationary and can it be forecast. If so, provide a week forward forecast and present results via Rpubs and .rmd and the forecast in an Excel readable file.
library(lubridate)
# Read in data
pipe1_data_raw <- read_excel("data/Waterflow_Pipe1.xlsx", col_types = c("date", "numeric"))
pipe2_data_raw <- read_excel("data/Waterflow_Pipe2.xlsx", col_types = c("date", "numeric"))
# From: https://stackoverflow.com/questions/53635818/convert-datetime-from-excel-to-r
pipe1_data_raw$`Date Time` <- as.POSIXct(pipe1_data_raw$`Date Time`,
origin="1899-12-30",
tz="GMT")
pipe2_data_raw$`Date Time` <- as.POSIXct(pipe2_data_raw$`Date Time`,
origin="1899-12-30",
tz="GMT")
#summary(pipe1_data_raw)
#summary(pipe2_data_raw)
#dim(pipe1_data_raw)
#dim(pipe2_data_raw)
# Calculate the average flow per hour for pipe1
pipe1_data_raw$Date <- as.Date(pipe1_data_raw$`Date Time`)
pipe1_data_raw$Time <- hour(pipe1_data_raw$`Date Time`) + 1
pipe1_data_raw <- pipe1_data_raw %>%
group_by(Date, Time) %>%
summarise(WF_AVG = mean(WaterFlow)) %>% ungroup()
pipe1_data_raw$`Date Time` <- with(pipe1_data_raw, ymd_h(paste(Date, Time)))
pipe1_data_raw <- pipe1_data_raw %>%
select(c(`Date Time`, WF_AVG)) %>%
rename(WaterFlow = WF_AVG)
head(pipe1_data_raw)
## # A tibble: 6 × 2
## `Date Time` WaterFlow
## <dttm> <dbl>
## 1 2015-10-23 01:00:00 26.1
## 2 2015-10-23 02:00:00 18.9
## 3 2015-10-23 03:00:00 15.2
## 4 2015-10-23 04:00:00 23.1
## 5 2015-10-23 05:00:00 15.5
## 6 2015-10-23 06:00:00 22.7
Above output confirms the average of Pipe 1 has been calculated for each hour. Note: I applied the ceiling to each timestamp of Pipe 1 to the nearest hour, so the Pipe 1 and Pipe 2 datasets line up and both begin with “2015-10-23 01:00:00”.
# Define as tsibble
pipe1_data_ts <- pipe1_data_raw %>%
as_tsibble(index = `Date Time`)
pipe1_data_ts %>% autoplot()
The above plot is a simply visualization of the Pipe 1 waterflow data.
pipe2_data_ts <- pipe2_data_raw %>%
as_tsibble(index = `Date Time`)
pipe2_data_ts %>%
autoplot(colour = "gray") +
autolayer(pipe1_data_ts, colour = "#0072B2")
The above plot shows the Pipe 1 hourly mean data in blue overlayed on the Pipe 2 provided hourly data in gray.
pipe2_data_ts %>%
filter_index("2015-10-23" ~ "2015-11-01") %>%
autoplot(colour = "gray") +
autolayer(pipe1_data_ts, colour = "#0072B2")
In an attempt to better compare the Pipe 1 and Pipe 2 data, I’ve truncated the plot to only include the hours in which data is available for both pipe waterflows.
Unfortunately, I didn’t allot myself the time needed to tackle Part C. I performed the initial data wrangling above to find the mean for Pipe 1 waterflow by hour. The initial plots above for visual inspection didn’t provide any initial insights, so I didn’t move forward with any modeling. The discrepancy in the dates for Pipe 1 and Pipe 2 data provided definitely requires additional analysis.