library(tidyverse)
library(readxl)
library(ggfortify)
library(fpp3)
library(tsibble)
library(plotly)
library(xlsx)

Part A ATM Forecast

In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.

ATM Data

atm <- readxl::read_xlsx('ATM.xlsx') %>% as_tsibble(index = DATE, key = ATM)

View Data

head(atm)

## # A tsibble: 6 x 3 [1]
## # Key:       ATM [1]
##    DATE ATM    Cash
##   <dbl> <chr> <dbl>
## 1 39934 ATM1     96
## 2 39935 ATM1     82
## 3 39936 ATM1     85
## 4 39937 ATM1     90
## 5 39938 ATM1     99
## 6 39939 ATM1     88

This dataset contains three columns: Date, ATM, Cash. ATM represents one of four ATMs in question and Cash represents the amount withdrawn in hundreds of dollars. The DATE data is a series of five digit Excel dates. The first task will be to convert the date column to a more readable format.

atm <- atm %>% mutate(DATE = as.Date(DATE, origin = "1899-12-30"))

The DATE column is converted to type ‘date’ using the as.Date function.

head(atm)

## # A tsibble: 6 x 3 [1D]
## # Key:       ATM [1]
##   DATE       ATM    Cash
##   <date>     <chr> <dbl>
## 1 2009-05-01 ATM1     96
## 2 2009-05-02 ATM1     82
## 3 2009-05-03 ATM1     85
## 4 2009-05-04 ATM1     90
## 5 2009-05-05 ATM1     99
## 6 2009-05-06 ATM1     88

Summerize Data

summary(atm)

##       DATE                ATM                 Cash        
##  Min.   :2009-05-01   Length:1474        Min.   :    0.0  
##  1st Qu.:2009-08-01   Class :character   1st Qu.:    0.5  
##  Median :2009-11-01   Mode  :character   Median :   73.0  
##  Mean   :2009-10-31                      Mean   :  155.6  
##  3rd Qu.:2010-02-01                      3rd Qu.:  114.0  
##  Max.   :2010-05-14                      Max.   :10919.8  
##                                          NA's   :19

unique(atm$ATM)

## [1] "ATM1" "ATM2" "ATM3" "ATM4" NA

The data summaries tells us that the data ranges from May 1st 200 to May 14th 2009. The ATM column has four unique values representing each ATM and some missing values. The Cash column also has some missing values.

atm[!complete.cases(atm), ]

## # A tsibble: 19 x 3 [1D]
## # Key:       ATM [3]
##    DATE       ATM    Cash
##    <date>     <chr> <dbl>
##  1 2009-06-13 ATM1     NA
##  2 2009-06-16 ATM1     NA
##  3 2009-06-22 ATM1     NA
##  4 2009-06-18 ATM2     NA
##  5 2009-06-24 ATM2     NA
##  6 2010-05-01 <NA>     NA
##  7 2010-05-02 <NA>     NA
##  8 2010-05-03 <NA>     NA
##  9 2010-05-04 <NA>     NA
## 10 2010-05-05 <NA>     NA
## 11 2010-05-06 <NA>     NA
## 12 2010-05-07 <NA>     NA
## 13 2010-05-08 <NA>     NA
## 14 2010-05-09 <NA>     NA
## 15 2010-05-10 <NA>     NA
## 16 2010-05-11 <NA>     NA
## 17 2010-05-12 <NA>     NA
## 18 2010-05-13 <NA>     NA
## 19 2010-05-14 <NA>     NA

Upon further examination we see that there are no ATM and Cash values for the last 15 rows. All 15 rows are for the month of May which we will be forecasting, so we will go ahead and filter those out of the dataset. The other five data points are parts of the time series and have missing Cash values. We will need to find the best approach for these values. We will now filter the forecast values out of the data set so that the only rows with missing values will be the five missing Cash value points.

atm <- atm %>% filter(!is.na(ATM))

atm[!complete.cases(atm), ]

## # A tsibble: 5 x 3 [1D]
## # Key:       ATM [2]
##   DATE       ATM    Cash
##   <date>     <chr> <dbl>
## 1 2009-06-13 ATM1     NA
## 2 2009-06-16 ATM1     NA
## 3 2009-06-22 ATM1     NA
## 4 2009-06-18 ATM2     NA
## 5 2009-06-24 ATM2     NA

The most common approaches to dealing with missing data are to either delete it from the training data or to impute it. Since we are working with time series, deleting data would cause more harm than good. This leaves us with the option of imputing the data. Chapter 13 section 9 from Forecasting: Principals and Practice discusses the best practices for outliers and missing data when working with time series. The text states that some models will work and produce forecast with missing values, while some will cause errors. In this case we will be proactive and fix the missing values. Here we will use the second approach and replace the missing values with estimates using the ARIMA model.

atm <- atm %>%
  model(ARIMA(Cash)) %>%
  interpolate(atm)

Now that we have dealt with missing values, lets go ahead and take a look at the time series for each ATM.

atm %>%
  autoplot(Cash)

atm %>%
  autoplot(Cash) +
  facet_wrap(~ATM, scales = "free", nrow = 4)

If we look at the first plot with all four ATMs plotted together, we can see that ATM4 has significantly higher withdrawals than the other three ATMs and has a major spike around the beginning of February. ATM1 and ATM2 seems to be moving at the same pace, while ATM3 is at a flatline. Looking at each ATM individually, we see that ATM1 and ATM2 show strong seasonality and strong cyclic behavior. They seem to be stationary with no sign of an increasing or decreasing trend. ATM3 shows no active use for the most part of the series with withdrawals starting towards the end of April. Some reasons for this might be due to the ATM being closed or an out of order status. ATM4 also shows a strong cyclic behavior and seasonality. In terms of outliers, we will need to address the spike in ATM4.

Referring back to the the FPP text, we can identify outliers as those that are greater than 3 interquartile ranges (IQRs) from the central 50% of the data. Once we identify the outliers for ATM4, we will replace them with estimates from the ARIMA model.

outliers <- atm |>
  filter(ATM == 'ATM4') |>
  model(
    stl = STL(Cash ~ season(window = "periodic"), robust = TRUE)
  ) |>
  components()
outliers <- outliers |>
  filter(
    remainder < quantile(remainder, 0.25) - 3*IQR(remainder) |
    remainder > quantile(remainder, 0.75) + 3*IQR(remainder)
  )

outliers

## # A dable: 2 x 8 [1D]
## # Key:     ATM, .model [1]
## # :        Cash = trend + season_week + remainder
##   ATM   .model DATE         Cash trend season_week remainder season_adjust
##   <chr> <chr>  <date>      <dbl> <dbl>       <dbl>     <dbl>         <dbl>
## 1 ATM4  stl    2009-09-22  1712.  373.       -71.5     1411.         1784.
## 2 ATM4  stl    2010-02-09 10920.  279.       -71.5    10712.        10991.

atm2 <- atm
atm2$Cash[atm$DATE == '2009-09-22' & atm$ATM == 'ATM4'] <- NA
atm2$Cash[atm$DATE == '2010-02-09' & atm$ATM == 'ATM4'] <- NA
atm <- atm2 %>%
  model(ARIMA(Cash)) %>%
  interpolate(atm2)

Plots and STL Decomposition

Three of the four series show sign of seasonality. To examine this we will look at the STL decomposition for each respectively.

atm %>% filter(ATM == 'ATM1') %>% model(STL(Cash ~ season(window = "periodic"), robust = TRUE)) %>%  components() %>% autoplot() + ggtitle('ATM1 STL Decomposition')

atm %>% filter(ATM == 'ATM2') %>% model(STL(Cash ~ season(window = "periodic"), robust = TRUE)) %>%  components() %>% autoplot() + ggtitle('ATM2 STL Decomposition')

atm %>% filter(ATM == 'ATM4') %>% model(STL(Cash ~ season(window = "periodic"), robust = TRUE)) %>%  components() %>% autoplot() + ggtitle('ATM4 STL Decomposition')

Modeling

ATM1, ATM2, ATM4

For the ATM1, ATM2, and ATM3 we will fit the SNAIVE, ETS and ARIMA models. For the ETS model we will be using the Additive variation as the seasonal variation is constant throughout the series. We will be using RMSE as the criteria for model selection.

a1 <- atm %>% filter(ATM == "ATM1") %>% features(Cash,features=guerrero) %>% pull(lambda_guerrero)

atm1_mdl <- atm %>% filter(ATM == "ATM1") %>%
  model(
    SNAIVE = SNAIVE(Cash),
    Additive_ETS = ETS(Cash ~ error("A") + trend("N") + season("A")),
    ARIMA = ARIMA(box_cox(Cash, a1))
  )
accuracy(atm1_mdl) %>% select(.model, RMSE) %>% arrange(RMSE)

## # A tibble: 3 x 2
##   .model        RMSE
##   <chr>        <dbl>
## 1 Additive_ETS  23.7
## 2 ARIMA         24.9
## 3 SNAIVE        27.6

For ATM1 the Additive ETS model is chosen as it has the lowest RMSE.

a2 <- atm %>% filter(ATM == "ATM2") %>% features(Cash,features=guerrero) %>% pull(lambda_guerrero)

atm2_mdl <- atm %>% filter(ATM == "ATM2") %>%
  model(
    SNAIVE = SNAIVE(Cash),
    Additive_ETS = ETS(Cash ~ error("A") + trend("N") + season("A")),
    ARIMA = ARIMA(box_cox(Cash, a2))
  )

accuracy(atm2_mdl) %>% select(.model, RMSE) %>% arrange(RMSE)

## # A tibble: 3 x 2
##   .model        RMSE
##   <chr>        <dbl>
## 1 ARIMA         24.3
## 2 Additive_ETS  24.6
## 3 SNAIVE        29.6

atm2_mdl %>% select(.model = 'ARIMA') %>% report()

## Series: Cash 
## Model: ARIMA(2,0,2)(0,1,1)[7] 
## Transformation: box_cox(Cash, a2) 
## 
## Coefficients:
##           ar1      ar2     ma1     ma2     sma1
##       -0.4180  -0.9012  0.4579  0.7889  -0.7129
## s.e.   0.0592   0.0475  0.0888  0.0619   0.0419
## 
## sigma^2 estimated as 44.97:  log likelihood=-1188.76
## AIC=2389.51   AICc=2389.75   BIC=2412.8

For ATM2 the ARIMA(2,0,2)(0,1,1)[7] model is chosen as it has the lowest RMSE.

a4 <- atm %>% filter(ATM == "ATM4") %>% features(Cash,features=guerrero) %>% pull(lambda_guerrero)

atm4_mdl <- atm %>% filter(ATM == "ATM4") %>%
  model(
    SNAIVE = SNAIVE(Cash),
    Additive_ETS = ETS(Cash ~ error("A") + trend("N") + season("A")),
    ARIMA = ARIMA(box_cox(Cash, a4))
  )
accuracy(atm4_mdl) %>% select(.model, RMSE) %>% arrange(RMSE)

## # A tibble: 3 x 2
##   .model        RMSE
##   <chr>        <dbl>
## 1 Additive_ETS  322.
## 2 ARIMA         345.
## 3 SNAIVE        451.

For ATM4 the Additive ETS model is chosen as it has the lowest RMSE.

ATM3

For ATM3 we only have 3 data points with actual entries. Due to the lack of available data we will be using one of the simplier models, the MEAN model where the forecasts will be the average of the historical data. While we can omit forecasting for this ATM, creating a simplier model might be able to provide some insight.

atm3_mdl <- atm %>% filter(ATM == 'ATM3') %>% filter(Cash > 0) %>% model(MEAN(Cash))
accuracy(atm3_mdl) %>% select(.model, RMSE) %>% arrange(RMSE)

## # A tibble: 1 x 2
##   .model      RMSE
##   <chr>      <dbl>
## 1 MEAN(Cash)  6.02

Forecasting

fc1 <- atm1_mdl %>%
  forecast(h = 31) %>%
  filter(.model=='Additive_ETS')

fc1 %>%
  autoplot(atm)

fc2 <- atm2_mdl %>%
  forecast(h = 31) %>%
  filter(.model=='ARIMA')

fc2 %>%
  autoplot(atm)

fc3 <- atm3_mdl %>%
  forecast(h = 31)

fc3 %>%
  autoplot(atm)

fc4 <- atm4_mdl %>%
  forecast(h = 31) %>%
  filter(.model=='Additive_ETS')

fc4 %>%
  autoplot(atm)

Below are the withdrawal forecasts for the May 2010.

may_fc <- bind_rows(fc1, fc2, fc3, fc4) %>%
  as.data.frame() %>% select(DATE, ATM, .mean) %>%
  rename(Cash = .mean)

may_fc %>%
  as_tsibble(index = DATE, key = ATM) %>%
  autoplot(Cash) +
  facet_wrap(~ATM, scales = "free", nrow = 4) +
  ggtitle('May 2010 Withdrawal Forecasts')

#may_fc %>% write.xlsx('May_Withdrawal_Forecasts.xlsx')

Part B Forecasting Power, ResidentialCustomerForecastLoad-624.xlsx

Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.

Residential Power Usage Data

rpu <- readxl::read_xlsx('Residential.xlsx')

View Data

head(rpu)

## # A tibble: 6 x 3
##   CaseSequence `YYYY-MMM`     KWH
##          <dbl> <chr>        <dbl>
## 1          733 1998-Jan   6862583
## 2          734 1998-Feb   5838198
## 3          735 1998-Mar   5420658
## 4          736 1998-Apr   5010364
## 5          737 1998-May   4665377
## 6          738 1998-Jun   6467147

This dataset contains three columns: CaseSequence, YYYY-MMMM, KWH. YYYY-MMMM represents the year and month of the power usage and KWH represents the amount of power used. We will convert the year and month column to type year month and rename it to MONTH.

rpu <- rpu %>% as.data.frame() %>% rename(MONTH = `YYYY-MMM`) %>%  mutate(MONTH = yearmonth(MONTH)) %>% as_tsibble(index = MONTH)

head(rpu)

## # A tsibble: 6 x 3 [1M]
##   CaseSequence    MONTH     KWH
##          <dbl>    <mth>   <dbl>
## 1          733 1998 Jan 6862583
## 2          734 1998 Feb 5838198
## 3          735 1998 Mar 5420658
## 4          736 1998 Apr 5010364
## 5          737 1998 May 4665377
## 6          738 1998 Jun 6467147

Summerize Data

summary(rpu$KWH)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
##   770523  5429912  6283324  6502475  7620524 10655730        1

We have one missing value for KWh. We will be using the ARIMA method to replace the missing value with an estimate.

rpu <- rpu %>%
  model(ARIMA(KWH)) %>%
  interpolate(rpu)

rpu %>% autoplot(KWH)

We see that there are some outlier points. We will use the same ARIMA approach to replace them with an estimate.

outliers <- rpu %>%
  model(
    stl = STL(KWH ~ season(window = "periodic"), robust = TRUE)
  ) |>
  components()
outliers <- outliers |>
  filter(
    remainder < quantile(remainder, 0.25) - 3*IQR(remainder) |
    remainder > quantile(remainder, 0.75) + 3*IQR(remainder)
  )

outliers

## # A dable: 2 x 7 [1M]
## # Key:     .model [1]
## # :        KWH = trend + season_year + remainder
##   .model    MONTH     KWH    trend season_year remainder season_adjust
##   <chr>     <mth>   <dbl>    <dbl>       <dbl>     <dbl>         <dbl>
## 1 stl    2010 Jul  770523 6779462.    1266280. -7275219.      -495757.
## 2 stl    2013 Dec 9606304 7123483.    -402370.  2885192.     10008674.

rpu2 <- rpu
rpu2$KWH[rpu$KWH == 770523] <- NA
rpu2$KWH[rpu$KWH == 9606304] <- NA
rpu2

## # A tsibble: 192 x 2 [1M]
##       MONTH     KWH
##       <mth>   <dbl>
##  1 1998 Jan 6862583
##  2 1998 Feb 5838198
##  3 1998 Mar 5420658
##  4 1998 Apr 5010364
##  5 1998 May 4665377
##  6 1998 Jun 6467147
##  7 1998 Jul 8914755
##  8 1998 Aug 8607428
##  9 1998 Sep 6989888
## 10 1998 Oct 6345620
## # ... with 182 more rows

rpu <- rpu2 %>%
  model(ARIMA(KWH)) %>%
  interpolate(rpu2)

rpu[!complete.cases(rpu),]

## # A tsibble: 0 x 2 [?]
## # ... with 2 variables: MONTH <mth>, KWH <dbl>

Plots and STL Decomposition

We will look at the STL Decomposition of the series

rpu %>% autoplot(KWH)

rpu %>% model(STL(KWH ~ season(window = "periodic"), robust = TRUE)) %>%  components() %>% autoplot() + ggtitle('KWH STL Decomposition')

We see that the series shows strong seasonality and strong cyclic behavior. There is also an increasing trend.

Modeling

For this series we will fit the SNAIVE, ETS and ARIMA models. For the ETS model we will be using the Additive variation as the seasonal variation is constant throughout the series. We will be using RMSE as the criteria for model selection.

rpu_lam <- rpu %>% features(KWH,features=guerrero) %>% pull(lambda_guerrero)

rpu_mdl <- rpu %>%  model(
    SNAIVE = SNAIVE(KWH),
    Additive_ETS = ETS(KWH ~ error("A") + trend("N") + season("A")),
    ARIMA = ARIMA(box_cox(KWH, rpu_lam))
  )
accuracy(rpu_mdl) %>% select(.model, RMSE) %>% arrange(RMSE)

## # A tibble: 3 x 2
##   .model          RMSE
##   <chr>          <dbl>
## 1 ARIMA        567025.
## 2 Additive_ETS 591531.
## 3 SNAIVE       779676.

rpu_mdl %>% select(.model = 'ARIMA') %>% report()

## Series: KWH 
## Model: ARIMA(0,0,3)(2,1,0)[12] w/ drift 
## Transformation: box_cox(KWH, rpu_lam) 
## 
## Coefficients:
##          ma1     ma2     ma3     sar1     sar2  constant
##       0.2671  0.0587  0.2237  -0.7098  -0.4132    0.0031
## s.e.  0.0736  0.0806  0.0668   0.0725   0.0743    0.0010
## 
## sigma^2 estimated as 7.702e-05:  log likelihood=599.02
## AIC=-1184.04   AICc=-1183.39   BIC=-1161.69

For this series the ARIMA(0,0,3)(2,1,0) w/ drift model is chosen as it has the lowest RMSE.

Forecasting

rpu_fc <- rpu_mdl %>%
  forecast(h = 12) %>%
  filter(.model=='ARIMA')

rpu_fc %>%
  autoplot(rpu)

Below are the power usage forecasts for 2014.

fc_2014 <- rpu_fc %>%
  as.data.frame() %>% select(MONTH, .mean) %>%
  rename(KWH = .mean)

fc_2014 %>%
  as_tsibble(index = MONTH) %>%
  autoplot(KWH) +
  ggtitle('Residential Power Usage Forecasts 2014')

#fc_2014 %>% write.xlsx('Residential_Power_Usage_Forecasts_2014.xlsx')

Data 624 Project 1

Krutika Patel

2023-03-25

Part A ATM Forecast

ATM Data

View Data

Summerize Data

Plots and STL Decomposition

Modeling

ATM1, ATM2, ATM4

ATM3

Forecasting

Part B Forecasting Power, ResidentialCustomerForecastLoad-624.xlsx

Residential Power Usage Data

View Data

Summerize Data

Plots and STL Decomposition

Modeling

Forecasting