library(tidyverse)
library(readxl)
library(ggfortify)
library(fpp3)
library(tsibble)
library(plotly)
library(xlsx)
In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.
atm <- readxl::read_xlsx('ATM.xlsx') %>% as_tsibble(index = DATE, key = ATM)
head(atm)
## # A tsibble: 6 x 3 [1]
## # Key: ATM [1]
## DATE ATM Cash
## <dbl> <chr> <dbl>
## 1 39934 ATM1 96
## 2 39935 ATM1 82
## 3 39936 ATM1 85
## 4 39937 ATM1 90
## 5 39938 ATM1 99
## 6 39939 ATM1 88
This dataset contains three columns: Date, ATM, Cash. ATM represents one of four ATMs in question and Cash represents the amount withdrawn in hundreds of dollars. The DATE data is a series of five digit Excel dates. The first task will be to convert the date column to a more readable format.
atm <- atm %>% mutate(DATE = as.Date(DATE, origin = "1899-12-30"))
The DATE column is converted to type ‘date’ using the as.Date function.
head(atm)
## # A tsibble: 6 x 3 [1D]
## # Key: ATM [1]
## DATE ATM Cash
## <date> <chr> <dbl>
## 1 2009-05-01 ATM1 96
## 2 2009-05-02 ATM1 82
## 3 2009-05-03 ATM1 85
## 4 2009-05-04 ATM1 90
## 5 2009-05-05 ATM1 99
## 6 2009-05-06 ATM1 88
summary(atm)
## DATE ATM Cash
## Min. :2009-05-01 Length:1474 Min. : 0.0
## 1st Qu.:2009-08-01 Class :character 1st Qu.: 0.5
## Median :2009-11-01 Mode :character Median : 73.0
## Mean :2009-10-31 Mean : 155.6
## 3rd Qu.:2010-02-01 3rd Qu.: 114.0
## Max. :2010-05-14 Max. :10919.8
## NA's :19
unique(atm$ATM)
## [1] "ATM1" "ATM2" "ATM3" "ATM4" NA
The data summaries tells us that the data ranges from May 1st 200 to May 14th 2009. The ATM column has four unique values representing each ATM and some missing values. The Cash column also has some missing values.
atm[!complete.cases(atm), ]
## # A tsibble: 19 x 3 [1D]
## # Key: ATM [3]
## DATE ATM Cash
## <date> <chr> <dbl>
## 1 2009-06-13 ATM1 NA
## 2 2009-06-16 ATM1 NA
## 3 2009-06-22 ATM1 NA
## 4 2009-06-18 ATM2 NA
## 5 2009-06-24 ATM2 NA
## 6 2010-05-01 <NA> NA
## 7 2010-05-02 <NA> NA
## 8 2010-05-03 <NA> NA
## 9 2010-05-04 <NA> NA
## 10 2010-05-05 <NA> NA
## 11 2010-05-06 <NA> NA
## 12 2010-05-07 <NA> NA
## 13 2010-05-08 <NA> NA
## 14 2010-05-09 <NA> NA
## 15 2010-05-10 <NA> NA
## 16 2010-05-11 <NA> NA
## 17 2010-05-12 <NA> NA
## 18 2010-05-13 <NA> NA
## 19 2010-05-14 <NA> NA
Upon further examination we see that there are no ATM and Cash values for the last 15 rows. All 15 rows are for the month of May which we will be forecasting, so we will go ahead and filter those out of the dataset. The other five data points are parts of the time series and have missing Cash values. We will need to find the best approach for these values. We will now filter the forecast values out of the data set so that the only rows with missing values will be the five missing Cash value points.
atm <- atm %>% filter(!is.na(ATM))
atm[!complete.cases(atm), ]
## # A tsibble: 5 x 3 [1D]
## # Key: ATM [2]
## DATE ATM Cash
## <date> <chr> <dbl>
## 1 2009-06-13 ATM1 NA
## 2 2009-06-16 ATM1 NA
## 3 2009-06-22 ATM1 NA
## 4 2009-06-18 ATM2 NA
## 5 2009-06-24 ATM2 NA
The most common approaches to dealing with missing data are to either delete it from the training data or to impute it. Since we are working with time series, deleting data would cause more harm than good. This leaves us with the option of imputing the data. Chapter 13 section 9 from Forecasting: Principals and Practice discusses the best practices for outliers and missing data when working with time series. The text states that some models will work and produce forecast with missing values, while some will cause errors. In this case we will be proactive and fix the missing values. Here we will use the second approach and replace the missing values with estimates using the ARIMA model.
atm <- atm %>%
model(ARIMA(Cash)) %>%
interpolate(atm)
Now that we have dealt with missing values, lets go ahead and take a look at the time series for each ATM.
atm %>%
autoplot(Cash)
atm %>%
autoplot(Cash) +
facet_wrap(~ATM, scales = "free", nrow = 4)
If we look at the first plot with all four ATMs plotted together, we can see that ATM4 has significantly higher withdrawals than the other three ATMs and has a major spike around the beginning of February. ATM1 and ATM2 seems to be moving at the same pace, while ATM3 is at a flatline. Looking at each ATM individually, we see that ATM1 and ATM2 show strong seasonality and strong cyclic behavior. They seem to be stationary with no sign of an increasing or decreasing trend. ATM3 shows no active use for the most part of the series with withdrawals starting towards the end of April. Some reasons for this might be due to the ATM being closed or an out of order status. ATM4 also shows a strong cyclic behavior and seasonality. In terms of outliers, we will need to address the spike in ATM4.
Referring back to the the FPP text, we can identify outliers as those that are greater than 3 interquartile ranges (IQRs) from the central 50% of the data. Once we identify the outliers for ATM4, we will replace them with estimates from the ARIMA model.
outliers <- atm |>
filter(ATM == 'ATM4') |>
model(
stl = STL(Cash ~ season(window = "periodic"), robust = TRUE)
) |>
components()
outliers <- outliers |>
filter(
remainder < quantile(remainder, 0.25) - 3*IQR(remainder) |
remainder > quantile(remainder, 0.75) + 3*IQR(remainder)
)
outliers
## # A dable: 2 x 8 [1D]
## # Key: ATM, .model [1]
## # : Cash = trend + season_week + remainder
## ATM .model DATE Cash trend season_week remainder season_adjust
## <chr> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ATM4 stl 2009-09-22 1712. 373. -71.5 1411. 1784.
## 2 ATM4 stl 2010-02-09 10920. 279. -71.5 10712. 10991.
atm2 <- atm
atm2$Cash[atm$DATE == '2009-09-22' & atm$ATM == 'ATM4'] <- NA
atm2$Cash[atm$DATE == '2010-02-09' & atm$ATM == 'ATM4'] <- NA
atm <- atm2 %>%
model(ARIMA(Cash)) %>%
interpolate(atm2)
Three of the four series show sign of seasonality. To examine this we will look at the STL decomposition for each respectively.
atm %>% filter(ATM == 'ATM1') %>% model(STL(Cash ~ season(window = "periodic"), robust = TRUE)) %>% components() %>% autoplot() + ggtitle('ATM1 STL Decomposition')
atm %>% filter(ATM == 'ATM2') %>% model(STL(Cash ~ season(window = "periodic"), robust = TRUE)) %>% components() %>% autoplot() + ggtitle('ATM2 STL Decomposition')
atm %>% filter(ATM == 'ATM4') %>% model(STL(Cash ~ season(window = "periodic"), robust = TRUE)) %>% components() %>% autoplot() + ggtitle('ATM4 STL Decomposition')
For the ATM1, ATM2, and ATM3 we will fit the SNAIVE, ETS and ARIMA models. For the ETS model we will be using the Additive variation as the seasonal variation is constant throughout the series. We will be using RMSE as the criteria for model selection.
a1 <- atm %>% filter(ATM == "ATM1") %>% features(Cash,features=guerrero) %>% pull(lambda_guerrero)
atm1_mdl <- atm %>% filter(ATM == "ATM1") %>%
model(
SNAIVE = SNAIVE(Cash),
Additive_ETS = ETS(Cash ~ error("A") + trend("N") + season("A")),
ARIMA = ARIMA(box_cox(Cash, a1))
)
accuracy(atm1_mdl) %>% select(.model, RMSE) %>% arrange(RMSE)
## # A tibble: 3 x 2
## .model RMSE
## <chr> <dbl>
## 1 Additive_ETS 23.7
## 2 ARIMA 24.9
## 3 SNAIVE 27.6
For ATM1 the Additive ETS model is chosen as it has the lowest RMSE.
a2 <- atm %>% filter(ATM == "ATM2") %>% features(Cash,features=guerrero) %>% pull(lambda_guerrero)
atm2_mdl <- atm %>% filter(ATM == "ATM2") %>%
model(
SNAIVE = SNAIVE(Cash),
Additive_ETS = ETS(Cash ~ error("A") + trend("N") + season("A")),
ARIMA = ARIMA(box_cox(Cash, a2))
)
accuracy(atm2_mdl) %>% select(.model, RMSE) %>% arrange(RMSE)
## # A tibble: 3 x 2
## .model RMSE
## <chr> <dbl>
## 1 ARIMA 24.3
## 2 Additive_ETS 24.6
## 3 SNAIVE 29.6
atm2_mdl %>% select(.model = 'ARIMA') %>% report()
## Series: Cash
## Model: ARIMA(2,0,2)(0,1,1)[7]
## Transformation: box_cox(Cash, a2)
##
## Coefficients:
## ar1 ar2 ma1 ma2 sma1
## -0.4180 -0.9012 0.4579 0.7889 -0.7129
## s.e. 0.0592 0.0475 0.0888 0.0619 0.0419
##
## sigma^2 estimated as 44.97: log likelihood=-1188.76
## AIC=2389.51 AICc=2389.75 BIC=2412.8
For ATM2 the ARIMA(2,0,2)(0,1,1)[7] model is chosen as it has the lowest RMSE.
a4 <- atm %>% filter(ATM == "ATM4") %>% features(Cash,features=guerrero) %>% pull(lambda_guerrero)
atm4_mdl <- atm %>% filter(ATM == "ATM4") %>%
model(
SNAIVE = SNAIVE(Cash),
Additive_ETS = ETS(Cash ~ error("A") + trend("N") + season("A")),
ARIMA = ARIMA(box_cox(Cash, a4))
)
accuracy(atm4_mdl) %>% select(.model, RMSE) %>% arrange(RMSE)
## # A tibble: 3 x 2
## .model RMSE
## <chr> <dbl>
## 1 Additive_ETS 322.
## 2 ARIMA 345.
## 3 SNAIVE 451.
For ATM4 the Additive ETS model is chosen as it has the lowest RMSE.
For ATM3 we only have 3 data points with actual entries. Due to the lack of available data we will be using one of the simplier models, the MEAN model where the forecasts will be the average of the historical data. While we can omit forecasting for this ATM, creating a simplier model might be able to provide some insight.
atm3_mdl <- atm %>% filter(ATM == 'ATM3') %>% filter(Cash > 0) %>% model(MEAN(Cash))
accuracy(atm3_mdl) %>% select(.model, RMSE) %>% arrange(RMSE)
## # A tibble: 1 x 2
## .model RMSE
## <chr> <dbl>
## 1 MEAN(Cash) 6.02
fc1 <- atm1_mdl %>%
forecast(h = 31) %>%
filter(.model=='Additive_ETS')
fc1 %>%
autoplot(atm)
fc2 <- atm2_mdl %>%
forecast(h = 31) %>%
filter(.model=='ARIMA')
fc2 %>%
autoplot(atm)
fc3 <- atm3_mdl %>%
forecast(h = 31)
fc3 %>%
autoplot(atm)
fc4 <- atm4_mdl %>%
forecast(h = 31) %>%
filter(.model=='Additive_ETS')
fc4 %>%
autoplot(atm)
Below are the withdrawal forecasts for the May 2010.
may_fc <- bind_rows(fc1, fc2, fc3, fc4) %>%
as.data.frame() %>% select(DATE, ATM, .mean) %>%
rename(Cash = .mean)
may_fc %>%
as_tsibble(index = DATE, key = ATM) %>%
autoplot(Cash) +
facet_wrap(~ATM, scales = "free", nrow = 4) +
ggtitle('May 2010 Withdrawal Forecasts')
#may_fc %>% write.xlsx('May_Withdrawal_Forecasts.xlsx')
Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.
rpu <- readxl::read_xlsx('Residential.xlsx')
head(rpu)
## # A tibble: 6 x 3
## CaseSequence `YYYY-MMM` KWH
## <dbl> <chr> <dbl>
## 1 733 1998-Jan 6862583
## 2 734 1998-Feb 5838198
## 3 735 1998-Mar 5420658
## 4 736 1998-Apr 5010364
## 5 737 1998-May 4665377
## 6 738 1998-Jun 6467147
This dataset contains three columns: CaseSequence, YYYY-MMMM, KWH. YYYY-MMMM represents the year and month of the power usage and KWH represents the amount of power used. We will convert the year and month column to type year month and rename it to MONTH.
rpu <- rpu %>% as.data.frame() %>% rename(MONTH = `YYYY-MMM`) %>% mutate(MONTH = yearmonth(MONTH)) %>% as_tsibble(index = MONTH)
head(rpu)
## # A tsibble: 6 x 3 [1M]
## CaseSequence MONTH KWH
## <dbl> <mth> <dbl>
## 1 733 1998 Jan 6862583
## 2 734 1998 Feb 5838198
## 3 735 1998 Mar 5420658
## 4 736 1998 Apr 5010364
## 5 737 1998 May 4665377
## 6 738 1998 Jun 6467147
summary(rpu$KWH)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 770523 5429912 6283324 6502475 7620524 10655730 1
We have one missing value for KWh. We will be using the ARIMA method to replace the missing value with an estimate.
rpu <- rpu %>%
model(ARIMA(KWH)) %>%
interpolate(rpu)
rpu %>% autoplot(KWH)
We see that there are some outlier points. We will use the same ARIMA
approach to replace them with an estimate.
outliers <- rpu %>%
model(
stl = STL(KWH ~ season(window = "periodic"), robust = TRUE)
) |>
components()
outliers <- outliers |>
filter(
remainder < quantile(remainder, 0.25) - 3*IQR(remainder) |
remainder > quantile(remainder, 0.75) + 3*IQR(remainder)
)
outliers
## # A dable: 2 x 7 [1M]
## # Key: .model [1]
## # : KWH = trend + season_year + remainder
## .model MONTH KWH trend season_year remainder season_adjust
## <chr> <mth> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 stl 2010 Jul 770523 6779462. 1266280. -7275219. -495757.
## 2 stl 2013 Dec 9606304 7123483. -402370. 2885192. 10008674.
rpu2 <- rpu
rpu2$KWH[rpu$KWH == 770523] <- NA
rpu2$KWH[rpu$KWH == 9606304] <- NA
rpu2
## # A tsibble: 192 x 2 [1M]
## MONTH KWH
## <mth> <dbl>
## 1 1998 Jan 6862583
## 2 1998 Feb 5838198
## 3 1998 Mar 5420658
## 4 1998 Apr 5010364
## 5 1998 May 4665377
## 6 1998 Jun 6467147
## 7 1998 Jul 8914755
## 8 1998 Aug 8607428
## 9 1998 Sep 6989888
## 10 1998 Oct 6345620
## # ... with 182 more rows
rpu <- rpu2 %>%
model(ARIMA(KWH)) %>%
interpolate(rpu2)
rpu[!complete.cases(rpu),]
## # A tsibble: 0 x 2 [?]
## # ... with 2 variables: MONTH <mth>, KWH <dbl>
We will look at the STL Decomposition of the series
rpu %>% autoplot(KWH)
rpu %>% model(STL(KWH ~ season(window = "periodic"), robust = TRUE)) %>% components() %>% autoplot() + ggtitle('KWH STL Decomposition')
We see that the series shows strong seasonality and strong cyclic
behavior. There is also an increasing trend.
For this series we will fit the SNAIVE, ETS and ARIMA models. For the ETS model we will be using the Additive variation as the seasonal variation is constant throughout the series. We will be using RMSE as the criteria for model selection.
rpu_lam <- rpu %>% features(KWH,features=guerrero) %>% pull(lambda_guerrero)
rpu_mdl <- rpu %>% model(
SNAIVE = SNAIVE(KWH),
Additive_ETS = ETS(KWH ~ error("A") + trend("N") + season("A")),
ARIMA = ARIMA(box_cox(KWH, rpu_lam))
)
accuracy(rpu_mdl) %>% select(.model, RMSE) %>% arrange(RMSE)
## # A tibble: 3 x 2
## .model RMSE
## <chr> <dbl>
## 1 ARIMA 567025.
## 2 Additive_ETS 591531.
## 3 SNAIVE 779676.
rpu_mdl %>% select(.model = 'ARIMA') %>% report()
## Series: KWH
## Model: ARIMA(0,0,3)(2,1,0)[12] w/ drift
## Transformation: box_cox(KWH, rpu_lam)
##
## Coefficients:
## ma1 ma2 ma3 sar1 sar2 constant
## 0.2671 0.0587 0.2237 -0.7098 -0.4132 0.0031
## s.e. 0.0736 0.0806 0.0668 0.0725 0.0743 0.0010
##
## sigma^2 estimated as 7.702e-05: log likelihood=599.02
## AIC=-1184.04 AICc=-1183.39 BIC=-1161.69
For this series the ARIMA(0,0,3)(2,1,0) w/ drift model is chosen as it has the lowest RMSE.
rpu_fc <- rpu_mdl %>%
forecast(h = 12) %>%
filter(.model=='ARIMA')
rpu_fc %>%
autoplot(rpu)
Below are the power usage forecasts for 2014.
fc_2014 <- rpu_fc %>%
as.data.frame() %>% select(MONTH, .mean) %>%
rename(KWH = .mean)
fc_2014 %>%
as_tsibble(index = MONTH) %>%
autoplot(KWH) +
ggtitle('Residential Power Usage Forecasts 2014')
#fc_2014 %>% write.xlsx('Residential_Power_Usage_Forecasts_2014.xlsx')