In part A, I want you to forecast how much cash is taken out of 4
different ATM machines for May 2010.
The data is given in a single file.
The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward.
I am being somewhat ambiguous on purpose to make this have a little more business feeling.
Explain and demonstrate your process, techniques used and not used, and your actual forecast.
I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file
Also please submit the forecast which you will put in an Excel readable file.
head(atm)
## # A tibble: 6 × 3
## Date ATM Cash
## <dttm> <chr> <dbl>
## 1 2009-05-01 00:00:00 ATM1 96
## 2 2009-05-01 00:00:00 ATM2 107
## 3 2009-05-02 00:00:00 ATM1 82
## 4 2009-05-02 00:00:00 ATM2 89
## 5 2009-05-03 00:00:00 ATM1 85
## 6 2009-05-03 00:00:00 ATM2 90
summary(atm)
## Date ATM Cash
## Min. :2009-05-01 00:00:00.0 Length:1474 Min. : 0.0
## 1st Qu.:2009-08-01 00:00:00.0 Class :character 1st Qu.: 0.5
## Median :2009-11-01 00:00:00.0 Mode :character Median : 73.0
## Mean :2009-10-31 19:33:27.5 Mean : 155.6
## 3rd Qu.:2010-02-01 00:00:00.0 3rd Qu.: 114.0
## Max. :2010-05-14 00:00:00.0 Max. :10919.8
## NA's :19
I had an initial issue with dates, since excel saves them as an internal number, so we had to convert, there were no decimal points in internal numbers indicating that even though in the excel, it looked like a time stamp, it actually was just a date that someone converted to time stamp. It looks like the min date is 5-1-2009 and the max is 5-14-2010, so a little more than a year. For this project, we will be creating a forecast for all of May, and using 5-1-2010 - 5-14-2010. First, it does look like we have some null values that we should check out.
Let’s make the dataset so that each ATM is it’s own column
atm_s<- atm %>% spread(ATM, Cash)
atm_s <- atm_s %>% select(Date, ATM1, ATM2, ATM3, ATM4)
atm_names <- colnames(atm_s)[2:5]
for (i in atm_names) {
print(i)
print(sum(is.na(atm_s[i])))
}
## [1] "ATM1"
## [1] 17
## [1] "ATM2"
## [1] 16
## [1] "ATM3"
## [1] 14
## [1] "ATM4"
## [1] 14
# Several are dates, so we will remove the dates that are missing and run again
atm_s <- atm_s %>%
filter(Date < "2010-05-01")
for (i in atm_names) {
print(i)
print(sum(is.na(atm_s[i])))
}
## [1] "ATM1"
## [1] 3
## [1] "ATM2"
## [1] 2
## [1] "ATM3"
## [1] 0
## [1] "ATM4"
## [1] 0
We can see that the missing values are limited to ATM1 and ATM2 and also the data after 5-1-2010 is blank. So for the days with missing values, let’s make them the mean for that ATM, since there are very few missing days, we should be fine.
atm_s <- atm_s %>% mutate(across(c(ATM1, ATM2), ~replace_na(., mean(., na.rm=TRUE))))
paste0("Now that we added in the values and removed the dates we have: ", sum(is.na(atm_s))," NAs")
## [1] "Now that we added in the values and removed the dates we have: 0 NAs"
Now let’s get the summary stats for each
par(mfrow=c(4,1))
for (i in atm_names) {
print(summary(atm_s[i]))
boxplot(atm_s[i], horizontal = TRUE)
}
## ATM1
## Min. : 1.00
## 1st Qu.: 73.00
## Median : 90.00
## Mean : 83.89
## 3rd Qu.:108.00
## Max. :180.00
## ATM2
## Min. : 0.00
## 1st Qu.: 26.00
## Median : 66.00
## Mean : 62.58
## 3rd Qu.: 93.00
## Max. :147.00
## ATM3
## Min. : 0.0000
## 1st Qu.: 0.0000
## Median : 0.0000
## Mean : 0.7206
## 3rd Qu.: 0.0000
## Max. :96.0000
## ATM4
## Min. : 1.563
## 1st Qu.: 124.334
## Median : 403.839
## Mean : 474.043
## 3rd Qu.: 704.507
## Max. :10919.762
We can see that ATM4 makes up the bulk of the money withdrawn, while ATM1 and ATM2 see a fair amount, while the third sees almost nothing, especially since there are only three data points.
Now let’s take a look at the plot. Before we can plot the data, we first have to make sure the data is a tsibble.
atm_ts <- atm_s %>%
mutate(Day = as_date(Date)) %>%
as_tsibble(index = Day)
atm_ts <- atm_ts %>% select(Day, ATM1, ATM2, ATM3, ATM4)
atm_ts %>%
gather("ATM","Cash", 2:5) %>%
autoplot(Cash) +
facet_wrap(~ATM, scales = "free", nrow = 4) +
labs(title = "Money Withdrawn USD",
y= "USD in Hundreds",
x = "Day")
From this plot we can see that ATMs 1,2,4 have consistent data and have been there a while, while ATM3 was just created. For that ATM, we may just want to look at a Naive forecast since there is no historical data. For this next part, we will look at each ATM individually and create forecasts for them.
But first, to be a good Data Scientist, we need to split the data into train and test
atm_train <- atm_ts %>%
filter_index(~ "2010-03")
atm_test <- atm_ts %>%
filter_index("2010-04")
atm_train %>%
select(ATM1) %>%
autoplot() +
ggtitle("ATM1")
## Plot variable not specified, automatically selected `.vars = ATM1`
# STL decomposition
atm_train %>%
select(ATM1) %>%
model(STL(ATM1 ~ season(window = "periodic"), robust = TRUE)) %>%
components() %>%
autoplot() +
labs(title = "STL Decomposition for ATM1")
the seasonal data stands out for each week, with a low point. Let’s find out when it is.
atm_train %>%
add_column(day_no = as.vector(wday(atm_train$Day, week_start = 7))) %>%
as_tibble() %>%
select(ATM1,day_no) %>%
group_by(day_no) %>%
summarise_all(mean)
## # A tibble: 7 × 2
## day_no ATM1
## <dbl> <dbl>
## 1 1 103.
## 2 2 87.0
## 3 3 96.6
## 4 4 80.6
## 5 5 26.5
## 6 6 99.8
## 7 7 96.9
It looks like Thursdays are the low date, my guess us that is when it is restocked. We can also see the trend has a semi equal amplitude with no obvious indication if it is increasing or decreasing. So with that in mind, let’s take a look at several model with and without the box-cox. We will run whichever ETS is best, plus the alternative M or A method, SNaive, and for Box plot the same with the addition of an ARIMA.
lambda_ATM1 <- atm_train %>%
features(ATM1, features = guerrero) %>%
pull(lambda_guerrero)
fit <- atm_train %>%
model(ETS(ATM1))
report(fit)
## Series: ATM1
## Model: ETS(A,N,A)
## Smoothing parameters:
## alpha = 0.01333705
## gamma = 0.3243183
##
## Initial states:
## l[0] s[0] s[-1] s[-2] s[-3] s[-4] s[-5] s[-6]
## 77.98459 -66.76148 3.941321 16.12065 6.093844 14.60934 8.524811 17.47152
##
## sigma^2: 624.7121
##
## AIC AICc BIC
## 4115.103 4115.782 4153.244
atm1_fit <- atm_train %>%
model(
# additive ETS model
additive = ETS(ATM1 ~ error("A") + trend("N") + season("A")),
# damped additive ETS model
additived = ETS(ATM1 ~ error("A") + trend("Ad") + season("A")),
# multiplicative ETS model
multiplicative = ETS(ATM1 ~ error("M") + trend("N") + season("M")),
# SNAIVE model
snaive = SNAIVE(ATM1~ lag("week")),
# transformed additive ETS model
additive_l = ETS(box_cox(ATM1,lambda_ATM1) ~ error("A") + trend("N") + season("A")),
# transformed multiplicative ETS model
multiplicative_l = ETS(box_cox(ATM1,lambda_ATM1) ~ error("M") + trend("N") + season("M")),
# transformed SNAIVE model
snaive_l = SNAIVE(box_cox(ATM1,lambda_ATM1)~ lag("week")),
# arima model
ARIMA = ARIMA(box_cox(ATM1,lambda_ATM1))
)
report(atm1_fit)
## # A tibble: 8 × 11
## .model sigma2 log_lik AIC AICc BIC MSE AMSE MAE ar_ro…¹ ma_ro…²
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <list> <list>
## 1 additi… 6.25e+2 -2048. 4115. 4116. 4153. 608. 611. 15.7 <NULL> <NULL>
## 2 additi… 6.32e+2 -2048. 4122. 4123. 4171. 609. 611. 15.8 <NULL> <NULL>
## 3 multip… 1.40e-1 -2083. 4186. 4187. 4224. 748. 760. 0.223 <NULL> <NULL>
## 4 snaive 8.37e+2 NA NA NA NA NA NA NA <NULL> <NULL>
## 5 additi… 6.05e+0 -1271. 2562. 2562. 2600. 5.89 5.90 1.42 <NULL> <NULL>
## 6 multip… 5.22e-2 -1302. 2624. 2625. 2662. 6.08 6.10 0.129 <NULL> <NULL>
## 7 snaive… 8.21e+0 NA NA NA NA NA NA NA <NULL> <NULL>
## 8 ARIMA 5.90e+0 -757. 1522. 1522. 1537. NA NA NA <cpl> <cpl>
## # … with abbreviated variable names ¹ar_roots, ²ma_roots
for (model in colnames(atm1_fit)) {
print(atm1_fit %>% select(model) %>% gg_tsresiduals() + labs(title = paste0(model," Residuals")))
}
From the report we can see that the ARIMA model has the best AICc score,
but when taking a look at residuals, the Additive model without the
box-cox transformation has the best distribution. Anyways, we can run
them all and see which ones perform the best against our test
dataset.
atm1_fc <- atm1_fit %>%
fabletools::forecast(h = 30)
atm1_fc %>% autoplot(atm_test %>% select(ATM1))
atm1_fc %>% fabletools::accuracy(atm_test %>% select(ATM1)) %>% arrange(RMSE)
## # A tibble: 8 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 additived Test -5.81 11.6 9.12 -61.8 65.5 NaN NaN -0.258
## 2 additive Test -5.81 11.6 9.13 -62.1 65.7 NaN NaN -0.257
## 3 ARIMA Test -8.02 12.7 10.5 -47.4 50.1 NaN NaN -0.0936
## 4 multiplicative_l Test -8.45 12.7 10.6 -49.3 51.6 NaN NaN -0.0914
## 5 additive_l Test -8.40 12.8 10.6 -48.0 50.3 NaN NaN -0.124
## 6 multiplicative Test -10.7 15.3 13.1 -71.2 73.8 NaN NaN -0.151
## 7 snaive Test -7.07 16.0 13.5 -66.2 73.4 NaN NaN 0.0437
## 8 snaive_l Test -19.4 25.2 21.6 -116. 118. NaN NaN 0.237
Comparing with the test data we can see that the best model was actually the additive damped without a box-cox transformation. So for the ATM1 predicted values, we will use this model.
ATM1_forcast <- atm1_fit %>%
select(additived) %>%
fabletools::forecast(h = 61)
ATM1_forcast <- ATM1_forcast %>% as_tibble () %>% select(Day, '.mean') %>% rename('ATM1' = '.mean' )
atm_train %>%
select(ATM2) %>%
autoplot() +
ggtitle("ATM2")
## Plot variable not specified, automatically selected `.vars = ATM2`
# STL decomposition
atm_train %>%
select(ATM2) %>%
model(STL(ATM2 ~ season(window = "periodic"), robust = TRUE)) %>%
components() %>%
autoplot() +
labs(title = "STL Decomposition for ATM2")
for ATM 2, we see a decreasing trend (slight) and another weekly season,
let’s see if it is in a similar country
atm_train %>%
add_column(day_no = as.vector(wday(atm_train$Day, week_start = 7))) %>%
as_tibble() %>%
select(ATM2,day_no) %>%
group_by(day_no) %>%
summarise_all(mean)
## # A tibble: 7 × 2
## day_no ATM2
## <dbl> <dbl>
## 1 1 66.9
## 2 2 62.7
## 3 3 79.2
## 4 4 39.3
## 5 5 19.2
## 6 6 94.6
## 7 7 76.0
We see the same weekly pattern as before. Now let’s see which model would be best.
lambda_ATM2 <- atm_train %>%
features(ATM2, features = guerrero) %>%
pull(lambda_guerrero)
fit <- atm_train %>%
model(ETS(ATM2))
report(fit)
## Series: ATM2
## Model: ETS(A,N,A)
## Smoothing parameters:
## alpha = 0.0001000102
## gamma = 0.3614287
##
## Initial states:
## l[0] s[0] s[-1] s[-2] s[-3] s[-4] s[-5] s[-6]
## 71.93059 -45.58854 -33.32942 27.05156 -2.0296 6.550105 6.425435 40.92046
##
## sigma^2: 672.3146
##
## AIC AICc BIC
## 4139.704 4140.383 4177.845
Again, we are seeing ANA would work best, so let’s see which model would be best.
atm2_fit <- atm_train %>%
model(
# additive ETS model
additive = ETS(ATM2 ~ error("A") + trend("N") + season("A")),
# damped additive ETS model
additived = ETS(ATM2 ~ error("A") + trend("Ad") + season("A")),
# multiplicative ETS model
multiplicative = ETS(ATM2 ~ error("M") + trend("N") + season("M")),
# SNAIVE model
snaive = SNAIVE(ATM2~ lag("week")),
# transformed additive ETS model
additive_l = ETS(box_cox(ATM2,lambda_ATM2) ~ error("A") + trend("N") + season("A")),
# transformed multiplicative ETS model
multiplicative_l = ETS(box_cox(ATM2,lambda_ATM2) ~ error("M") + trend("N") + season("M")),
# transformed SNAIVE model
snaive_l = SNAIVE(box_cox(ATM2,lambda_ATM2)~ lag("week")),
# arima model
ARIMA = ARIMA(box_cox(ATM2,lambda_ATM2))
)
report(atm2_fit) %>% arrange(AICc)
## # A tibble: 8 × 11
## .model sigma2 log_lik AIC AICc BIC MSE AMSE MAE ar_ro…¹ ma_ro…²
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <list> <list>
## 1 ARIMA 65.7 -1151. 2315. 2315. 2338. NA NA NA <cpl> <cpl>
## 2 additi… 70.3 -1682. 3383. 3384. 3422. 68.4 68.6 5.80 <NULL> <NULL>
## 3 multip… 0.266 -1796. 3612. 3613. 3650. 111. 110. 0.393 <NULL> <NULL>
## 4 additi… 672. -2060. 4140. 4140. 4178. 654. 656. 18.2 <NULL> <NULL>
## 5 additi… 681. -2060. 4147. 4148. 4196. 656. 658. 18.2 <NULL> <NULL>
## 6 multip… 0.370 -2162. 4345. 4345. 4383. 1028. 1015. 0.469 <NULL> <NULL>
## 7 snaive 942. NA NA NA NA NA NA NA <NULL> <NULL>
## 8 snaive… 96.3 NA NA NA NA NA NA NA <NULL> <NULL>
## # … with abbreviated variable names ¹ar_roots, ²ma_roots
for (model in colnames(atm2_fit)) {
print(atm2_fit %>% select(model) %>% gg_tsresiduals() + labs(title = paste0(model," Residuals")))
}
Again we see ARIMA with the lowest AICc, this time we also see it with the most normal residuals indicating it will be the best model, for which we will see.
atm2_fc <- atm2_fit %>%
fabletools::forecast(h = 30)
atm2_fc %>% autoplot(atm_test %>% select(ATM2))
atm2_fc %>% fabletools::accuracy(atm_test %>% select(ATM2)) %>% arrange(RMSE)
## # A tibble: 8 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 additive Test 9.52 19.4 14.3 -25.6 57.6 NaN NaN -0.0949
## 2 additived Test 9.61 19.4 14.4 -23.7 56.1 NaN NaN -0.0971
## 3 additive_l Test 7.71 19.7 14.8 -89.9 117. NaN NaN -0.0507
## 4 ARIMA Test 7.29 21.2 16.5 -90.6 119. NaN NaN 0.0323
## 5 snaive_l Test 5.01 23.0 15.9 3.61 48.0 NaN NaN -0.289
## 6 snaive Test 12.9 26.2 17.8 35.1 45.1 NaN NaN -0.319
## 7 multiplicative Test -0.285 48.0 41.0 -785. 828. NaN NaN 0.0115
## 8 multiplicative_l Test 3.87 48.0 41.2 -744. 792. NaN NaN 0.0234
We actually see the additive model as the best for ATM2
ATM2_forcast <- atm2_fit %>%
select(additive) %>%
fabletools::forecast(h = 61)
ATM2_forcast <- ATM2_forcast %>% as_tibble () %>% select(Day, '.mean') %>% rename('ATM2' = '.mean' )
For this one, there is no historical data to either test against or get any historical data. However, since the other two ATMs have the same weekly pattern (ATM4 does too), what we can do is take the seasonality form one of the models and apply it to our data with a Naive Seasonal Trend. The other option would have been to run the mean method, but since we know these ATMs do follow some sort of seasonal cycle, as a business person, I’m sure my boss would appreciate a model that uses that seasonal cycle rather than just gives the mean answer.
I decided to use the 1st model because they had the exact same values on 4/28-4/30, so we could easily extract the trend from the data and add it to the ATM3 data. I only used April to keep it as naive as possible while accounting for the seasonality.
df33 <- atm_test %>% select(ATM3) %>% filter(ATM3 != 0)
df31 <- atm_test %>% select(ATM1) %>% filter(Day < "2010-04-28") %>% rename("ATM3" = "ATM1")
df_atm3 <- bind_rows(df33,df31)
atm3_fit <- df_atm3 %>%
model(SNAIVE(ATM3 ~ lag("week")))
ATM3_forcast <- atm3_fit %>%
fabletools::forecast(h = 31)
ATM3_forcast <- ATM3_forcast %>% as_tibble () %>% select(Day, '.mean') %>% rename('ATM3' = '.mean' )
Let’s take a look at ATM4, which had the highest transaction volumes than the other three. The first thing to do is take a look at the large spike in the data and see how we should handle that. The spike occurred on 2-9-2010 and I could not find any good news stories or reasons on why it would occur in the world. so we will treat it as an anomaly and replace it with the data mean
avg <- atm_ts %>% as_tibble() %>% select(ATM4) %>% summarise_all(mean)
atm_train$ATM4[atm_train$Day == "2010-02-09"] <- avg$ATM4[1]
atm_train %>%
select(ATM4) %>%
autoplot() +
ggtitle("ATM4")
## Plot variable not specified, automatically selected `.vars = ATM4`
# STL decomposition
atm_train %>%
select(ATM4) %>%
model(STL(ATM4 ~ season(window = "periodic"), robust = TRUE)) %>%
components() %>%
autoplot() +
labs(title = "STL Decomposition for ATM4")
atm_train %>%
add_column(day_no = as.vector(wday(atm_train$Day, week_start = 7))) %>%
as_tibble() %>%
select(ATM4,day_no) %>%
group_by(day_no) %>%
summarise_all(mean)
## # A tibble: 7 × 2
## day_no ATM4
## <dbl> <dbl>
## 1 1 550.
## 2 2 487.
## 3 3 482.
## 4 4 405.
## 5 5 148.
## 6 6 569.
## 7 7 503.
As we mentioned above, ATM4 follows the same pattern as the other ATMs. Now let’s see how the models perform (Again we see ANA), we will run the same comparisons. Looking at it the ARIMA AICc wasn’t as good as I thought it would be, so I ran some variations to see if there was a better version.
lambda_ATM4 <- atm_train %>%
features(ATM4, features = guerrero) %>%
pull(lambda_guerrero)
fit <- atm_train %>%
model(ETS(ATM4))
report(fit)
## Series: ATM4
## Model: ETS(A,N,A)
## Smoothing parameters:
## alpha = 0.0002820141
## gamma = 0.0001000387
##
## Initial states:
## l[0] s[0] s[-1] s[-2] s[-3] s[-4] s[-5] s[-6]
## 453.818 -289.4372 -49.9708 38.87142 33.06923 94.52715 53.49428 119.4459
##
## sigma^2: 112334.6
##
## AIC AICc BIC
## 5854.405 5855.084 5892.546
fit <- atm_train %>%
model(ARIMA(ATM4))
report(fit)
## Series: ATM4
## Model: ARIMA(3,0,2)(1,0,0)[7] w/ mean
##
## Coefficients:
## ar1 ar2 ar3 ma1 ma2 sar1 constant
## -0.9005 -0.7159 0.1525 0.9844 0.8373 0.2322 851.2132
## s.e. 0.1106 0.1480 0.0611 0.1002 0.1329 0.0594 52.9251
##
## sigma^2 estimated as 122012: log likelihood=-2433.77
## AIC=4883.54 AICc=4883.98 BIC=4914.05
atm4_fit <- atm_train %>%
model(
# additive ETS model
additive = ETS(ATM4 ~ error("A") + trend("N") + season("A")),
# damped additive ETS model
additived = ETS(ATM4 ~ error("A") + trend("Ad") + season("A")),
# multiplicative ETS model
multiplicative = ETS(ATM4 ~ error("M") + trend("N") + season("M")),
# damped multiplicative ETS model
multiplicatived = ETS(ATM4 ~ error("M") + trend("Ad") + season("M")),
# SNAIVE model
snaive = SNAIVE(ATM4~ lag("week")),
# transformed additive ETS model
additive_l = ETS(box_cox(ATM4,lambda_ATM4) ~ error("A") + trend("N") + season("A")),
# transformed multiplicative ETS model
multiplicative_l = ETS(box_cox(ATM4,lambda_ATM4) ~ error("M") + trend("N") + season("M")),
# transformed SNAIVE model
snaive_l = SNAIVE(box_cox(ATM4,lambda_ATM4)~ lag("week")),
# arima model (3,0,2)(1,0,0)
ARIMA_r = ARIMA(box_cox(ATM4,lambda_ATM4)),
# arima model (3,0,2)(1,1,0)
ARIMA_110 = ARIMA(box_cox(ATM4,lambda_ATM4) ~ pdq(3,0,2) + PDQ(1,1,0)),
# arima model (3,0,1)(1,0,0)
ARIMA_301 = ARIMA(box_cox(ATM4,lambda_ATM4) ~ pdq(3,0,1) + PDQ(1,0,0)),
# arima model (3,0,2)(1,0,0) + constant 0
ARIMA_c0 = ARIMA(box_cox(ATM4,lambda_ATM4) ~ 0 + pdq(3,0,2) + PDQ(1,0,0)),
# arima model (3,0,2)(1,0,0) + constant 1
ARIMA_c1 = ARIMA(box_cox(ATM4,lambda_ATM4) ~ 1 + pdq(3,0,2) + PDQ(1,0,0)),
)
report(atm4_fit) %>% arrange(AICc)
## # A tibble: 13 × 11
## .model sigma2 log_lik AIC AICc BIC MSE AMSE MAE ar_ro…¹
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <list>
## 1 ARIMA_r 1.03e+2 -1250. 2509. 2509. 2524. NA NA NA <cpl>
## 2 ARIMA_c1 1.04e+2 -1250. 2516. 2517. 2547. NA NA NA <cpl>
## 3 ARIMA_301 1.06e+2 -1254. 2522. 2522. 2548. NA NA NA <cpl>
## 4 ARIMA_110 1.26e+2 -1257. 2528. 2528. 2554. NA NA NA <cpl>
## 5 ARIMA_c0 1.07e+2 -1259. 2531. 2532. 2558. NA NA NA <cpl>
## 6 additive_l 9.46e+1 -1731. 3483. 3484. 3521. 9.21e1 9.23e1 7.69 <NULL>
## 7 multiplica… 1.95e-1 -1733. 3485. 3486. 3523. 9.61e1 9.64e1 0.345 <NULL>
## 8 additive 1.12e+5 -2917. 5854. 5855. 5893. 1.09e5 1.10e5 264. <NULL>
## 9 multiplica… 7.53e-1 -2922. 5864. 5865. 5902. 1.30e5 1.31e5 0.638 <NULL>
## 10 additived 1.15e+5 -2919. 5864. 5865. 5914. 1.11e5 1.11e5 265. <NULL>
## 11 multiplica… 7.35e-1 -2926. 5878. 5879. 5928. 1.34e5 1.35e5 0.627 <NULL>
## 12 snaive 2.13e+5 NA NA NA NA NA NA NA <NULL>
## 13 snaive_l 1.69e+2 NA NA NA NA NA NA NA <NULL>
## # … with 1 more variable: ma_roots <list>, and abbreviated variable name
## # ¹ar_roots
for (model in colnames(atm4_fit)) {
print(atm4_fit %>% select(model) %>% gg_tsresiduals() + labs(title = paste0(model," Residuals")))
}
Looking at the AICc, the ARIMA models perform similarly, but with
residuals,
atm4_fc <- atm4_fit %>%
fabletools::forecast(h = 30)
atm4_fc %>% autoplot(atm_test %>% select(ATM4), level = NULL)
atm4_fc %>% fabletools::accuracy(atm_test %>% select(ATM4)) %>% arrange(RMSE)
## # A tibble: 13 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 multiplicative Test -54.9 244. 208. -636. 656. NaN NaN 0.117
## 2 multiplicatived Test -53.6 244. 208. -635. 656. NaN NaN 0.122
## 3 multiplicative_l Test -99.7 249. 210. -642. 657. NaN NaN 0.165
## 4 ARIMA_r Test -62.3 259. 223. -828. 849. NaN NaN -0.0138
## 5 ARIMA_c0 Test -67.9 276. 238. -959. 981. NaN NaN 0.000708
## 6 ARIMA_301 Test -67.0 278. 240. -950. 972. NaN NaN -0.00588
## 7 ARIMA_c1 Test -67.8 282. 241. -968. 991. NaN NaN -0.0343
## 8 additive_l Test -54.1 301. 258. -1162. 1195. NaN NaN 0.166
## 9 snaive Test -130. 301. 227. -360. 375. NaN NaN 0.0298
## 10 additive Test -52.4 308. 261. -1226. 1259. NaN NaN 0.119
## 11 additived Test -44.8 309. 262. -1213. 1248. NaN NaN 0.130
## 12 ARIMA_110 Test -226. 360. 280. -712. 718. NaN NaN 0.304
## 13 snaive_l Test -589. 729. 603. -1222. 1223. NaN NaN 0.420
In this case, it looks like the MNA method performed the best, so we will use that for the forecast.
ATM4_forcast <- atm4_fit %>%
select(multiplicative) %>%
fabletools::forecast(h = 61)
ATM4_forcast <- ATM4_forcast %>% as_tibble () %>% select(Day, '.mean') %>% rename('ATM4' = '.mean' )
Part B – Forecasting Power
Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.
We will start out this analysis in the same way by looking at the data
head(dpower)
## # A tibble: 6 × 3
## Case Date KWH
## <dbl> <chr> <dbl>
## 1 733 1998-Jan 6862583
## 2 734 1998-Feb 5838198
## 3 735 1998-Mar 5420658
## 4 736 1998-Apr 5010364
## 5 737 1998-May 4665377
## 6 738 1998-Jun 6467147
paste0("The min date is: ",min(dpower$Date)," and the max date is: ",max(dpower$Date))
## [1] "The min date is: 1998-Apr and the max date is: 2013-Sep"
par(mfrow=c(4,1))
for (i in p_names[c(1,3)]) {
print(summary(dpower[i]))
boxplot(dpower[i], horizontal = TRUE, main=i)
}
## Case
## Min. :733.0
## 1st Qu.:780.8
## Median :828.5
## Mean :828.5
## 3rd Qu.:876.2
## Max. :924.0
## KWH
## Min. : 770523
## 1st Qu.: 5429912
## Median : 6283324
## Mean : 6502475
## 3rd Qu.: 7620524
## Max. :10655730
## NA's :1
Case is straightforward but it does look like KWH has some outliers.
Before we proceed, I want to make sure that Case and Dates are unique,
then check for NAs on all columns.
sum(duplicated(dpower$Case))
## [1] 0
sum(duplicated(dpower$Date))
## [1] 0
for (i in p_names) {
print(i)
print(sum(is.na(dpower[i])))
}
## [1] "Case"
## [1] 0
## [1] "Date"
## [1] 0
## [1] "KWH"
## [1] 1
dpower %>% filter_all(any_vars(is.na(.)))
## # A tibble: 1 × 3
## Case Date KWH
## <dbl> <chr> <dbl>
## 1 861 2008-Sep NA
Great, we can see there are no duplicates, so we do not need to worry about cleaning the columns. But there is one missing day in the data, we will replace with the mean, then we will create a box-cox transformed column for the KHW data.
Note: I later realized I missed some dates that were not in the data. So we will be adding those in as well here
dpower <- dpower %>% mutate(across(KWH, ~replace_na(., mean(., na.rm=TRUE))))
power_ts <- dpower %>%
mutate(Date = yearmonth(Date)) %>%
as_tsibble(index = Date) %>%
fill_gaps(.full = TRUE)
Now let’s plot and see what we are dealing with using a seasonal plot so we can see the whole and component for each column:
# STL decomposition
power_ts %>%
select(KWH) %>%
model(STL(KWH ~ season(window = "periodic"), robust = TRUE)) %>%
components() %>%
autoplot() +
labs(title = "STL Decomposition for KWH")
We can see there is a lot of seasonal data, but also somewhere in Jan there is a large decrease in KWH. Since it is a one time event, I don’t think we would have to remove it because it should not weigh on the data too much (and it does not affect the seasonality. Like with the ATM, we will first split the data, then we will run the data against itself.
However, this time, we will then take the winning model and run it on the whole data set to predict since now we only have 12 periods instead of 52.
power_train <- power_ts %>% filter(Case < 900 )
min(power_train$Date)
## <yearmonth[1]>
## [1] "1998 Jan"
max(power_train$Date)
## <yearmonth[1]>
## [1] "2011 Nov"
power_test <- power_ts %>% filter(Case >= 900)
min(power_test$Date)
## <yearmonth[1]>
## [1] "2011 Dec"
max(power_test$Date)
## <yearmonth[1]>
## [1] "2013 Dec"
fit <- power_train %>%
model(ETS(KWH))
report(fit)
## Series: KWH
## Model: ETS(M,N,M)
## Smoothing parameters:
## alpha = 0.1321936
## gamma = 0.0001000993
##
## Initial states:
## l[0] s[0] s[-1] s[-2] s[-3] s[-4] s[-5] s[-6]
## 6129953 0.917436 0.7497227 0.8826468 1.19224 1.264716 1.208195 0.9891802
## s[-7] s[-8] s[-9] s[-10] s[-11]
## 0.7679627 0.815219 0.9221161 1.077275 1.21329
##
## sigma^2: 0.0144
##
## AIC AICc BIC
## 5385.722 5388.901 5432.492
fit <- power_train %>%
model(ARIMA(KWH))
report(fit)
## Series: KWH
## Model: ARIMA(0,0,1)(2,1,0)[12] w/ drift
##
## Coefficients:
## ma1 sar1 sar2 constant
## 0.1352 -0.9763 -0.5591 178547.98
## s.e. 0.0811 0.0801 0.0894 79349.56
##
## sigma^2 estimated as 6.774e+11: log likelihood=-2336.62
## AIC=4683.24 AICc=4683.64 BIC=4698.45
We can see the best fit for ETS is a MNM model, we will run that and tha ANM. For ARIMA, we will run (1,0,0)(0,1,1)[12], some alternatives, and also an auto model to see which one is best.
power_fit <- power_train %>%
model(
# additive ETS model
additive = ETS(KWH ~ error("A") + trend("N") + season("A")),
# multiplicative ETS model
multiplicative = ETS(KWH ~ error("M") + trend("N") + season("M")),
# arima model (1,0,1)(0,1,1)
ARIMA_r = ARIMA(KWH),
# arima model (1,0,1)(0,2,1)
ARIMA_021 = ARIMA(KWH ~ pdq(1,0,1) + PDQ(0,2,1)),
# arima model (1,0,1)(0,1,2)
ARIMA_012 = ARIMA(KWH ~ pdq(1,0,1) + PDQ(0,1,2)),
# arima model (1,0,1)(0,1,1) + constant 0
ARIMA_c0 = ARIMA(KWH ~ 0 + pdq(1,0,1) + PDQ(0,1,1)),
# arima model (1,0,1)(0,1,1) + constant 1
ARIMA_c1 = ARIMA(KWH ~ 1 + pdq(1,0,1) + PDQ(0,1,1))
)
report(power_fit) %>% arrange(AICc)
## # A tibble: 7 × 11
## .model sigma2 log_lik AIC AICc BIC MSE AMSE MAE ar_ro…¹
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <list>
## 1 ARIMA_0… 1.49e+12 -2221. 4450. 4450. 4462. NA NA NA <cpl>
## 2 ARIMA_r 6.77e+11 -2337. 4683. 4684. 4698. NA NA NA <cpl>
## 3 ARIMA_0… 7.12e+11 -2339. 4689. 4690. 4708. NA NA NA <cpl>
## 4 ARIMA_c1 7.39e+11 -2343. 4696. 4697. 4711. NA NA NA <cpl>
## 5 ARIMA_c0 7.66e+11 -2345. 4699. 4699. 4711. NA NA NA <cpl>
## 6 multipl… 1.44e- 2 -2678. 5386. 5389. 5432. 6.97e11 7.40e11 7.64e-2 <NULL>
## 7 additive 7.41e+11 -2702. 5434. 5438. 5481. 6.79e11 7.00e11 5.09e+5 <NULL>
## # … with 1 more variable: ma_roots <list>, and abbreviated variable name
## # ¹ar_roots
for (model in colnames(power_fit)) {
print(power_fit %>% select(model) %>% gg_tsresiduals() + labs(title = paste0(model," Residuals")))
}
power_fc <- power_fit %>%
fabletools::forecast(h = 24)
power_fc %>% autoplot(power_test %>% select(KWH), level = NULL)
power_fc %>% fabletools::accuracy(power_test %>% select(KWH)) %>% arrange(RMSE)
## # A tibble: 7 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 multiplicative Test 43856. 670564. 4.95e5 -0.0621 6.66 NaN NaN 0.0738
## 2 additive Test 367469. 801350. 6.04e5 3.93 7.74 NaN NaN 0.177
## 3 ARIMA_c1 Test 459176. 806384. 5.92e5 5.41 7.48 NaN NaN 0.166
## 4 ARIMA_c0 Test 747267. 997737. 7.67e5 9.53 9.81 NaN NaN 0.165
## 5 ARIMA_021 Test -347092. 1006604. 7.69e5 -4.54 9.64 NaN NaN 0.0350
## 6 ARIMA_012 Test 530606. 1063905. 7.01e5 6.28 8.78 NaN NaN 0.250
## 7 ARIMA_r Test 490946. 1253331. 8.02e5 5.57 10.1 NaN NaN 0.190
We can see that the Multiplicative model is the best, so we will train with that and output the forecast
power_fit_m <- power_ts %>%
model(# multiplicative ETS model
ETS(KWH ~ error("M") + trend("N") + season("M")))
report(power_fit_m)
## Series: KWH
## Model: ETS(M,N,M)
## Smoothing parameters:
## alpha = 0.1113039
## gamma = 0.0001001479
##
## Initial states:
## l[0] s[0] s[-1] s[-2] s[-3] s[-4] s[-5] s[-6]
## 6134546 0.9570497 0.7499494 0.8695777 1.188998 1.260456 1.197843 0.9963632
## s[-7] s[-8] s[-9] s[-10] s[-11]
## 0.7655965 0.8115011 0.9123226 1.068165 1.222177
##
## sigma^2: 0.0144
##
## AIC AICc BIC
## 6223.827 6226.555 6272.690
power_m_fc <- power_fit_m %>%
fabletools::forecast(h = 12)
power_m_fc %>% autoplot(power_ts %>% select(KWH))