Project1

Author

Semyon Toybis

Project 1

A - ATM Data

We are tasked with forecasting how much cash is taken out of four different ATM machines for the month of May 2010.

First, I import the data:

atm_data <- read_xlsx('ATM624Data.xlsx', col_types = c('date','text','numeric'))

head(atm_data)
# A tibble: 6 × 3
  DATE                ATM    Cash
  <dttm>              <chr> <dbl>
1 2009-05-01 00:00:00 ATM1     96
2 2009-05-01 00:00:00 ATM2    107
3 2009-05-02 00:00:00 ATM1     82
4 2009-05-02 00:00:00 ATM2     89
5 2009-05-03 00:00:00 ATM1     85
6 2009-05-03 00:00:00 ATM2     90
summary(atm_data)
      DATE                            ATM                 Cash        
 Min.   :2009-05-01 00:00:00.00   Length:1474        Min.   :    0.0  
 1st Qu.:2009-08-01 00:00:00.00   Class :character   1st Qu.:    0.5  
 Median :2009-11-01 00:00:00.00   Mode  :character   Median :   73.0  
 Mean   :2009-10-31 19:11:48.27                      Mean   :  155.6  
 3rd Qu.:2010-02-01 00:00:00.00                      3rd Qu.:  114.0  
 Max.   :2010-05-14 00:00:00.00                      Max.   :10919.8  
                                                     NA's   :19       

I will sort the ATM in time order for ease of analysis:

atm_data <- atm_data |> arrange(DATE)

Next, I take a look to see where the NAs for the Cash variable are coming from:

atm_data |> filter(is.na(Cash))
# A tibble: 19 × 3
   DATE                ATM    Cash
   <dttm>              <chr> <dbl>
 1 2009-06-13 00:00:00 ATM1     NA
 2 2009-06-16 00:00:00 ATM1     NA
 3 2009-06-18 00:00:00 ATM2     NA
 4 2009-06-22 00:00:00 ATM1     NA
 5 2009-06-24 00:00:00 ATM2     NA
 6 2010-05-01 00:00:00 <NA>     NA
 7 2010-05-02 00:00:00 <NA>     NA
 8 2010-05-03 00:00:00 <NA>     NA
 9 2010-05-04 00:00:00 <NA>     NA
10 2010-05-05 00:00:00 <NA>     NA
11 2010-05-06 00:00:00 <NA>     NA
12 2010-05-07 00:00:00 <NA>     NA
13 2010-05-08 00:00:00 <NA>     NA
14 2010-05-09 00:00:00 <NA>     NA
15 2010-05-10 00:00:00 <NA>     NA
16 2010-05-11 00:00:00 <NA>     NA
17 2010-05-12 00:00:00 <NA>     NA
18 2010-05-13 00:00:00 <NA>     NA
19 2010-05-14 00:00:00 <NA>     NA

From this, I can see that there are NAs for a few ATMS in the month of June 2009 and then several NAs for the month of 2010 that do not have cash nor ATM information. Since we are tasked with forecasting withdrawals for the month of May 2010, I will drop the NAs from May 2010 that do not have ATM nor cash information. I will need to determine what to do with the NAs for the observations in June 2009 later.

atm_data <- atm_data |> drop_na(ATM)
atm_data |> filter(is.na(Cash))
# A tibble: 5 × 3
  DATE                ATM    Cash
  <dttm>              <chr> <dbl>
1 2009-06-13 00:00:00 ATM1     NA
2 2009-06-16 00:00:00 ATM1     NA
3 2009-06-18 00:00:00 ATM2     NA
4 2009-06-22 00:00:00 ATM1     NA
5 2009-06-24 00:00:00 ATM2     NA

Next, I convert the data frame to a tsibble to make it more conducive for time series analysis:

atm_data <- atm_data |> mutate(Date = ymd(DATE)) |> as_tsibble(index = Date, key = ATM) |> select(-DATE)

Next, I visualize the data to get a better understanding of what it looks like.

atm_data |> autoplot(Cash)

ATM 4 has higher cash withdrawals, so I separate the charts to get a better understanding of the data.

atm_data |> autoplot(Cash) + facet_wrap(vars(ATM), scales = 'free_y')

The plots reveal interesting findings about the data. ATM1 and ATM2 seem similar in nature while ATM3 is unusual as its withdrawals were consistently at zero until recently. It is possible that this is a new ATM that was not in service until recently and that the data was entered as zeros rather than NAs. Lastly, ATM4 has one massive outlier, without which it might look more similar to ATM1 and ATM2.

Given that the time series has daily data over a period of one year, building a forecast model will be difficult as there will be multiple seasonal periods involved (eg, weekly, monthly, quarterly). Correcting for this would involve advance methods, so for now I will work with the data as is and use ARIMA and ETS models to generate forecasts. An alternative would be to aggregate the data to weekly or monthly values; however, based on the prompt, I suspect that daily values are of more interest.

ATM 1

atm_one <- atm_data |> filter(ATM=='ATM1')
atm_one |> autoplot(Cash)

As seen in the plot above and discussed earlier, ATM1 has a few observations that are NAs:

atm_one |> filter(is.na(Cash))
# A tsibble: 3 x 3 [1D]
# Key:       ATM [1]
  ATM    Cash Date      
  <chr> <dbl> <date>    
1 ATM1     NA 2009-06-13
2 ATM1     NA 2009-06-16
3 ATM1     NA 2009-06-22

There are three missing values for ATM1 and based on the pattern of the time series, I will interpolate these via an average of the surrounding values. I use the na_ma function from the imputeTS package:

atm_one <- na_ma(atm_one, k = 1)
atm_one |> filter(is.na(Cash))
# A tsibble: 0 x 3 [?]
# Key:       ATM [0]
# ℹ 3 variables: ATM <chr>, Cash <dbl>, Date <date>

Next, I will use the Guerrero method to see if I can make the variance of the time series more stable. I will check what an appropriate lambda value for a Box-Cox transformation is via the Guerrero method:

lambda_atm_one <- atm_one |>
  features(Cash, features = guerrero) %>%
  pull(lambda_guerrero)

print(lambda_atm_one)
[1] 0.261571

The lambda is close to 0.25, so for simplicity I will transform the data by raising it to the quarter root.

atm_one <- atm_one |> mutate(quarter_root_cash = Cash ^ 0.25)

In order to create a forecast, I will split the data into a train and test set using the 80/20 method. I will try fitting an ETS model, an ARIMA model, and for comparison purposes a Seasonal NAIVE model to see which one performs best. I suspect that the Holt-Winters method with additive seasonality (the level of seasonality seems constant in this time series) could work, as well as a seasonal ARIMA model. The Seasonal Naive model will be used to evaluate whether the ETS or ARIMA models are an improvement over a simple approach.

atm_one_train_cutoff <- floor(0.8 * nrow(atm_one))
atm_one_train <- atm_one[1:atm_one_train_cutoff, ]
atm_one_test <- atm_one[(atm_one_train_cutoff + 1):nrow(atm_one), ]
atm_one_fit <- atm_one_train |> model(
  ETS = ETS(quarter_root_cash),
  ARIMA = ARIMA(quarter_root_cash),
  SNAIVE = SNAIVE(quarter_root_cash)
)
atm_one_fit
# A mable: 1 x 4
# Key:     ATM [1]
  ATM            ETS                    ARIMA   SNAIVE
  <chr>      <model>                  <model>  <model>
1 ATM1  <ETS(M,N,M)> <ARIMA(0,0,1)(0,1,1)[7]> <SNAIVE>

Below are what the residuals of the models look like

ETS:

atm_one_fit |> select(ETS) |> gg_tsresiduals()

ARIMA:

atm_one_fit |> select(ARIMA) |> gg_tsresiduals()

SNAIVE:

atm_one_fit |> select(SNAIVE) |> gg_tsresiduals()
Warning: Removed 7 rows containing missing values or values outside the scale range
(`geom_line()`).
Warning: Removed 7 rows containing missing values or values outside the scale range
(`geom_point()`).
Warning: Removed 7 rows containing non-finite outside the scale range
(`stat_bin()`).

The residuals for the ETS model and ARIMA model seem like white noise.

Below is the accuracy of the training models:

atm_one_fit |> accuracy() |> select(-ME, -MPE, -ACF1)
# A tibble: 3 × 8
  ATM   .model .type     RMSE   MAE  MAPE  MASE RMSSE
  <chr> <chr>  <chr>    <dbl> <dbl> <dbl> <dbl> <dbl>
1 ATM1  ETS    Training 0.279 0.145  6.46 0.735 0.737
2 ATM1  ARIMA  Training 0.279 0.154  6.65 0.781 0.737
3 ATM1  SNAIVE Training 0.379 0.197  8.30 1     1    

Now that I have the models, I will try forecasting the test set. The test set has 73 observations:

nrow(atm_one_test)
[1] 73
atm_one_test_forecast <- atm_one_fit |>
  forecast(atm_one_test)

atm_one_test_forecast |> autoplot(atm_one_test, level=NULL) + guides(color=guide_legend(title = 'Forecast method'))

The chart is a bit noisy so I also produce an accuracy table:

accuracy(atm_one_test_forecast, atm_one) |> select(-ME, -MPE, -ACF1)
# A tibble: 3 × 8
  .model ATM   .type  RMSE   MAE  MAPE  MASE RMSSE
  <chr>  <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ARIMA  ATM1  Test  0.814 0.540  27.7  2.74  2.15
2 ETS    ATM1  Test  0.816 0.526  27.2  2.67  2.16
3 SNAIVE ATM1  Test  0.865 0.654  31.2  3.32  2.28

The ETS model has the lowest MAPE, so I will use that for forecasting. However, it should be noted that because we have daily data over a long stretch of time, there are multiple seasonal patterns involved (daily, weekly, monthly, quarterly), which creates difficulty in forecasting. This may explain why the MAPE of 27% is somewhat high. Correcting for this would require more advanced methods than we have covered, so for now I will go with the ETS model.

#atm_one_may_forecast <- atm_one_fit |> forecast(atm_one_test)

#atm_one_may_forecast |> autoplot(atm_one, level = NULL) 

atm_one_may_forecast <- atm_one |> model(ETS(quarter_root_cash)) |> forecast(h = 31)
atm_one_may_forecast |> autoplot(atm_one)

As we can see, the confidence intervals are fairly wide, making the point forecast somewhat less dependable.

ATM 2

I will repeat the process from ATM1 for ATM2.

atm_two <- atm_data |> filter(ATM=='ATM2')
atm_two |> autoplot(Cash)

Next, I check for NAs

atm_two |> filter(is.na(Cash))
# A tsibble: 2 x 3 [1D]
# Key:       ATM [1]
  ATM    Cash Date      
  <chr> <dbl> <date>    
1 ATM2     NA 2009-06-18
2 ATM2     NA 2009-06-24

I will use the same method as ATM1 to impute these values:

atm_two <- na_ma(atm_two, k = 1)
atm_two |> filter(is.na(Cash))
# A tsibble: 0 x 3 [?]
# Key:       ATM [0]
# ℹ 3 variables: ATM <chr>, Cash <dbl>, Date <date>

Next, I will use the Guerrero method to see if I can make the variance of the time series more stable. I will check what an appropriate lambda value for a Box-Cox transformation is via the Guerrero method:

lambda_atm_two <- atm_two |>
  features(Cash, features = guerrero) %>%
  pull(lambda_guerrero)

print(lambda_atm_two)
[1] 0.7242767

The lambda is close to 0.75, so for simplicity I will transform the data by raising it to the power of three-fourths.

atm_two <- atm_two |> mutate(transformed_cash = Cash ^ 0.75)
atm_two |> autoplot(transformed_cash)

In order to create a forecast, I will split the data into a train and test set using the 80/20 method. I will try fitting an ETS model, an ARIMA model, and for comparison purposes a Seasonal NAIVE model to see which one performs best.

atm_two_train_cutoff <- floor(0.8 * nrow(atm_two))
atm_two_train <- atm_two[1:atm_two_train_cutoff, ]
atm_two_test <- atm_two[(atm_two_train_cutoff + 1):nrow(atm_two), ]
atm_two_fit <- atm_two_train |> model(
  ETS = ETS(transformed_cash),
  ARIMA = ARIMA(transformed_cash),
  SNAIVE = SNAIVE(transformed_cash)
)
atm_two_fit
# A mable: 1 x 4
# Key:     ATM [1]
  ATM            ETS                             ARIMA   SNAIVE
  <chr>      <model>                           <model>  <model>
1 ATM2  <ETS(A,A,A)> <ARIMA(0,0,1)(0,1,1)[7] w/ drift> <SNAIVE>

Below are what the residuals of the models look like

ETS:

atm_two_fit |> select(ETS) |> gg_tsresiduals()

ARIMA:

atm_two_fit |> select(ARIMA) |> gg_tsresiduals()

SNAIVE:

atm_two_fit |> select(SNAIVE) |> gg_tsresiduals()
Warning: Removed 7 rows containing missing values or values outside the scale range
(`geom_line()`).
Warning: Removed 7 rows containing missing values or values outside the scale range
(`geom_point()`).
Warning: Removed 7 rows containing non-finite outside the scale range
(`stat_bin()`).

The residuals for the ETS model and ARIMA model seem like white noise.

Below is the accuracy of the training models:

atm_two_fit |> accuracy() |> select(-ME, -MPE, -ACF1)
# A tibble: 3 × 8
  ATM   .model .type     RMSE   MAE  MAPE  MASE RMSSE
  <chr> <chr>  <chr>    <dbl> <dbl> <dbl> <dbl> <dbl>
1 ATM2  ETS    Training  5.92  4.09   Inf 0.713 0.713
2 ATM2  ARIMA  Training  5.91  3.99   Inf 0.695 0.712
3 ATM2  SNAIVE Training  8.31  5.74   Inf 1     1    

Now that I have the models, I will try forecasting the test set. The test set has 73 observations:

nrow(atm_two_test)
[1] 73
atm_two_test_forecast <- atm_two_fit |>
  forecast(atm_two_test)

atm_two_test_forecast |> autoplot(atm_two_test, level=NULL) + guides(color=guide_legend(title = 'Forecast method'))

The chart is a bit noisy so I also produce an accuracy table:

accuracy(atm_two_test_forecast, atm_two) |> select(-ME, -MPE, -ACF1)
# A tibble: 3 × 8
  .model ATM   .type  RMSE   MAE  MAPE  MASE RMSSE
  <chr>  <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ARIMA  ATM2  Test   19.1  17.3   Inf  3.02  2.29
2 ETS    ATM2  Test   18.8  16.6   Inf  2.89  2.27
3 SNAIVE ATM2  Test   18.1  16.1   Inf  2.80  2.17

Interestingly, both the ARIMA and ETS model performed worse than the SNAIVE model based on RMSE and MAE. Thus, I will use the SNAIVE model to forecast data for May.

#atm_one_may_forecast <- atm_one_fit |> forecast(atm_one_test)

#atm_one_may_forecast |> autoplot(atm_one, level = NULL) 

atm_two_may_forecast <- atm_two |> model(SNAIVE(transformed_cash)) |> forecast(h = 31)
atm_two_may_forecast |> autoplot(atm_two)

ATM 3

atm_three <- atm_data |> filter(ATM=='ATM3')
atm_three |> autoplot(Cash)

As discussed, ATM3 is unique in that it seems like something about it changed. For example, it is possible that this location does not usually receive much foot traffic, but had an outlier day where a person visited and made a withdrawal. It is also possible that the location of the ATM changed places (went from a low traffic location to a higher traffic one). This makes it difficult to forecast. I will generate a forecast for the purposes of the exercise but more information is needed about the outlier before determining an appropriate forecast method.

There are no missing values in the dataset:

atm_three |> filter(is.na(Cash))
# A tsibble: 0 x 3 [?]
# Key:       ATM [0]
# ℹ 3 variables: ATM <chr>, Cash <dbl>, Date <date>

I will forgo training and test sets since they will mostly contain zeros. Below are the fitted models (I used a NIAVE model as opposed to SNAIVE as the benchmark).

atm_three_fit <- atm_three |> model(
  ETS = ETS(Cash),
  ARIMA = ARIMA(Cash),
  NAIVE = NAIVE(Cash)
)
atm_three_fit
# A mable: 1 x 4
# Key:     ATM [1]
  ATM            ETS          ARIMA   NAIVE
  <chr>      <model>        <model> <model>
1 ATM3  <ETS(A,N,N)> <ARIMA(0,0,2)> <NAIVE>

Below are the forecasts generated by the models:

atm_three_forecast <- atm_three_fit |> forecast(h=31)

atm_three_forecast |> autoplot(atm_three, level = NULL)

Again, this data set requires more investigation on what caused the spike in ATM withdrawals and what withdrawals could look like going forward.

ATM 4

atm_four <- atm_data |> filter(ATM=='ATM4')
atm_four |> autoplot(Cash)

This dataset set has a large outlier, which seems due to some kind of exogenous even that is not normally present. I will keep the outlier as is for modeling purposes, assuming it is a valid measurement, however it would be beneficial to know what may have caused the outlier measurement.

Next, I check for NAs:

atm_four |> filter(is.na(Cash))
# A tsibble: 0 x 3 [?]
# Key:       ATM [0]
# ℹ 3 variables: ATM <chr>, Cash <dbl>, Date <date>

Next, I will use the Guerrero method to see if I can make the variance of the time series more stable. I will check what an appropriate lambda value for a Box-Cox transformation is via the Guerrero method:

lambda_atm_four <- atm_four |>
  features(Cash, features = guerrero) %>%
  pull(lambda_guerrero)

print(lambda_atm_four)
[1] -0.0737252

The lambda is close to 0, so for simplicity I will transform the data by taking a log of the data.

atm_four <- atm_four |> mutate(log_cash = log(Cash))
atm_four |> autoplot(log_cash)

In order to create a forecast, I will split the data into a train and test set using the 80/20 method. I will try fitting an ETS model, an ARIMA model, and for comparison purposes a Seasonal NAIVE model to see which one performs best.

atm_four_train_cutoff <- floor(0.8 * nrow(atm_four))
atm_four_train <- atm_four[1:atm_four_train_cutoff, ]
atm_four_test <- atm_four[(atm_four_train_cutoff + 1):nrow(atm_four), ]
atm_four_fit <- atm_four_train |> model(
  ETS = ETS(log_cash),
  ARIMA = ARIMA(log_cash),
  SNAIVE = SNAIVE(log_cash)
)
atm_four_fit
# A mable: 1 x 4
# Key:     ATM [1]
  ATM            ETS                            ARIMA   SNAIVE
  <chr>      <model>                          <model>  <model>
1 ATM4  <ETS(M,N,A)> <ARIMA(0,0,0)(2,0,0)[7] w/ mean> <SNAIVE>

Below are what the residuals look like:

ETS:

atm_four_fit |> select(ETS) |> gg_tsresiduals()

atm_four_fit |> select(ARIMA) |> gg_tsresiduals()

SNAIVE:

atm_four_fit |> select(SNAIVE) |> gg_tsresiduals()
Warning: Removed 7 rows containing missing values or values outside the scale range
(`geom_line()`).
Warning: Removed 7 rows containing missing values or values outside the scale range
(`geom_point()`).
Warning: Removed 7 rows containing non-finite outside the scale range
(`stat_bin()`).

Since the models have a few significant spikes at various lags, I will conduct an Ljung-Box test to see if the residuals resemble white noise.

ETS:

atm_four_fit |> select(ETS) |> augment() |> features(.innov, ljung_box, lag = 14)
# A tibble: 1 × 3
  .model lb_stat lb_pvalue
  <chr>    <dbl>     <dbl>
1 ETS       24.1    0.0443

The low p-value suggests that the residuals do not resemble white noise.

ARIMA:

atm_four_fit |> select(ARIMA) |> augment() |> features(.innov, ljung_box, lag = 14)
# A tibble: 1 × 3
  .model lb_stat lb_pvalue
  <chr>    <dbl>     <dbl>
1 ARIMA     17.2     0.246

The high p-value suggest the residuals resemble white noise.

SNAIVE:

atm_four_fit |> select(SNAIVE) |> augment() |> features(.innov, ljung_box, lag = 14)
# A tibble: 1 × 3
  .model lb_stat lb_pvalue
  <chr>    <dbl>     <dbl>
1 SNAIVE    81.4  1.53e-11

The low p-value suggests that the residuals do not resemble white noise.

Based, on this, I suspect the ARIMA model will perform best on the test set.

Below is the accuracy of the models on the training set:

atm_four_fit |> accuracy() |> select(-ME, -MPE, -ACF1)
# A tibble: 3 × 8
  ATM   .model .type     RMSE   MAE  MAPE  MASE RMSSE
  <chr> <chr>  <chr>    <dbl> <dbl> <dbl> <dbl> <dbl>
1 ATM4  ETS    Training  1.17 0.893  24.4 0.743 0.714
2 ATM4  ARIMA  Training  1.29 0.999  27.2 0.831 0.788
3 ATM4  SNAIVE Training  1.64 1.20   30.3 1     1    

Interestingly, the ETS model has the lowest RMSE and MAPE.

Now that I have the models, I will try forecasting the test set. The test set has 73 observations:

atm_four_test_forecast <- atm_four_fit |>
  forecast(atm_four_test)

atm_four_test_forecast |> autoplot(atm_four_test, level=NULL) + guides(color=guide_legend(title = 'Forecast method'))

The chart is a bit noisy so I also produce an accuracy table:

accuracy(atm_four_test_forecast, atm_four) |> select(-ME, -MPE, -ACF1)
# A tibble: 3 × 8
  .model ATM   .type  RMSE   MAE  MAPE  MASE RMSSE
  <chr>  <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ARIMA  ATM4  Test   1.51  1.15  34.8 0.956 0.923
2 ETS    ATM4  Test   1.75  1.30  38.6 1.08  1.07 
3 SNAIVE ATM4  Test   2.11  1.61  47.1 1.34  1.29 

The ARMIA model has the lowest RMSE and MAPE, so I will use it to forecast the data for May.

atm_four_may_forecast <- atm_four |> model(SNAIVE(log_cash)) |> forecast(h = 31)
atm_four_may_forecast |> autoplot(atm_four)

The forecast has wide confidence intervals, making the point forecast somewhat less reliable.

B - Forecasting Power

We are tasked with creating a monthly forecast of residential power usage for 2014 based on the below data set:

power_data <- read_xlsx('ResidentialCustomerForecastLoad-624.xlsx')

head(power_data)
# A tibble: 6 × 3
  CaseSequence `YYYY-MMM`     KWH
         <dbl> <chr>        <dbl>
1          733 1998-Jan   6862583
2          734 1998-Feb   5838198
3          735 1998-Mar   5420658
4          736 1998-Apr   5010364
5          737 1998-May   4665377
6          738 1998-Jun   6467147

I will drop the CaseSequence column and convert the YYYY-MMM column to date format:

power_data <- power_data |> select(-CaseSequence)
power_data$`YYYY-MMM` <- parse_date_time(power_data$`YYYY-MMM`,'ym')

head(power_data)
# A tibble: 6 × 2
  `YYYY-MMM`              KWH
  <dttm>                <dbl>
1 1998-01-01 00:00:00 6862583
2 1998-02-01 00:00:00 5838198
3 1998-03-01 00:00:00 5420658
4 1998-04-01 00:00:00 5010364
5 1998-05-01 00:00:00 4665377
6 1998-06-01 00:00:00 6467147

Next, I will convert to a tsibble for easier time series analysis

power_data <- power_data |> mutate(Month = yearmonth(`YYYY-MMM`)) |> select(-`YYYY-MMM`) |>  as_tsibble(index = Month)


head(power_data)
# A tsibble: 6 x 2 [1M]
      KWH    Month
    <dbl>    <mth>
1 6862583 1998 Jan
2 5838198 1998 Feb
3 5420658 1998 Mar
4 5010364 1998 Apr
5 4665377 1998 May
6 6467147 1998 Jun

Next, I check to see if there are any missing values:

power_data |> filter(is.na(KWH))
# A tsibble: 1 x 2 [1M]
    KWH    Month
  <dbl>    <mth>
1    NA 2008 Sep

There is one missing value. I next plot the data to see what it looks like:

power_data |> autoplot(KWH)

Based on this graph, I will impute the missing value as the average of its two surrounding values, as this will be consistent with the pattern in the graph.

power_data <- na_ma(power_data, k = 1)
power_data |> filter(is.na(KWH))
# A tsibble: 0 x 2 [?]
# ℹ 2 variables: KWH <dbl>, Month <mth>

The variance of the time series does not seem constant. I will check what an appropriate lambda value for a Box-Cox transformation is via the Guerrero method:

lambda <- power_data |>
  features(KWH, features = guerrero) %>%
  pull(lambda_guerrero)

print(lambda)
[1] 0.1073943

The lambda is close enough to zero that for simplicity, I will go with a log transformation.

power_data <- power_data |> mutate(log_KWH = log(KWH))

power_data |> autoplot(log_KWH)

There is a significant outlier in early 2010. Since there is a recorded value, it does seem like a legitimate measure that was likely due to some kind of exogenous shock (perhaps a blackout or something similar). Assuming that the outlier is a legitimate measurement, I will choose to leave it as is in the data so models properly capture potential shocks in the data.

Because the time series is relatively long, I will create a training and test set on which I will evaluate the models before determining which model to use to create a forecast for 2014. I will use the 80%/20% method.

power_data_train_cutoff <- floor(0.8 * nrow(power_data))
power_data_train <- power_data[1:power_data_train_cutoff, ]
power_data_test <- power_data[(power_data_train_cutoff + 1):nrow(power_data), ]

Next, I will fit three models on the training data: an ETS model, an ARIMA model, and a Seasonal Naive model:

power_fit <- power_data_train |>
  model(
    ETS = ETS(log_KWH),
    ARIMA = ARIMA(log_KWH),
    SNAIVE = SNAIVE(log_KWH)
  )

power_fit
# A mable: 1 x 3
           ETS                             ARIMA   SNAIVE
       <model>                           <model>  <model>
1 <ETS(M,N,M)> <ARIMA(0,0,0)(1,0,0)[12] w/ mean> <SNAIVE>

Next, I check the accuracy of the models on the training data:

power_fit |> accuracy() |> select(-ME, -MPE, -ACF1)
# A tibble: 3 × 7
  .model .type     RMSE    MAE  MAPE  MASE RMSSE
  <chr>  <chr>    <dbl>  <dbl> <dbl> <dbl> <dbl>
1 ETS    Training 0.200 0.0897 0.587 0.837 0.890
2 ARIMA  Training 0.215 0.111  0.722 1.03  0.956
3 SNAIVE Training 0.225 0.107  0.701 1     1    

Based on the above, the ETS model seems to be most accurate.

Next, I check the residuals of the models from the training data:

power_fit |> select(ETS) |> gg_tsresiduals()

power_fit |> select(ARIMA) |> gg_tsresiduals()

power_fit |> select(SNAIVE) |> gg_tsresiduals()
Warning: Removed 12 rows containing missing values or values outside the scale range
(`geom_line()`).
Warning: Removed 12 rows containing missing values or values outside the scale range
(`geom_point()`).
Warning: Removed 12 rows containing non-finite outside the scale range
(`stat_bin()`).

The above plots have residuals that look like white noise.

Next, I test the accuracy of the models on the test set

power_fit |> forecast(power_data_test) |> accuracy(power_data) |> select(-ME, -MPE, -ACF1)
# A tibble: 3 × 7
  .model .type  RMSE   MAE  MAPE  MASE RMSSE
  <chr>  <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ARIMA  Test  0.437 0.233  1.46  2.17 1.94 
2 ETS    Test  0.215 0.178  1.12  1.66 0.956
3 SNAIVE Test  0.696 0.291  1.82  2.71 3.09 

Indeed, the ETS model has the lowest RMSE and MAPE. I will thus use the ETS method to create a forecast for 2014:

power_data_2014_forecast <- power_data |> model(ETS(log_KWH)) |>
  forecast(h = '1 year')

power_data_2014_forecast |> autoplot(power_data)

Exporting to Excel

We are required to export forecasts to an excel file.

Starting with the ATM data, I create an output data frame for each ATM that has the values for the point forecast and confidence intervals in the original scale.

Formatting

#ATM1
atm_one_output <- atm_one_may_forecast |> mutate(ci = hilo(quarter_root_cash, 95)) |> as_tibble() |> select(-quarter_root_cash)

atm_one_output <- atm_one_output |> 
  mutate(mean_original_scale = (.mean)^4,
         ci_95_lower_original_scale = map_dbl(ci, ~.x$lower^4),
         ci_95_upper_original_scale = map_dbl(ci, ~.x$upper^4))

atm_one_output <- atm_one_output |> select(-.mean, -ci)
#ATM2
atm_two_output <- atm_two_may_forecast |> mutate(ci = hilo(transformed_cash, 95)) |> as_tibble() |> select(-transformed_cash)

atm_two_output <- atm_two_output |> 
  mutate(mean_original_scale = (.mean)^(4/3),
         ci_95_lower_original_scale = map_dbl(ci, ~.x$lower^(4/3)),
         ci_95_upper_original_scale = map_dbl(ci, ~.x$upper^(4/3)))

atm_two_output <- atm_two_output |> select(-.mean, -ci)
#ATM3
atm_three_output <- atm_three_forecast |> mutate(ci = hilo(Cash, 95)) |> as_tibble() |> select(-Cash)

atm_three_output <- atm_three_output |> 
  mutate(mean_original_scale = .mean,
         ci_95_lower_original_scale = map_dbl(ci, ~.x$lower),
         ci_95_upper_original_scale = map_dbl(ci, ~.x$upper))

atm_three_output <- atm_three_output |> select(-.mean, -ci)
#ATM4
atm_four_output <- atm_four_may_forecast |> mutate(ci = hilo(log_cash, 95)) |> as_tibble() |> select(-log_cash)

atm_four_output <- atm_four_output |> 
  mutate(mean_original_scale = exp(.mean),
         ci_95_lower_original_scale = map_dbl(ci, ~exp(.x$lower)),
         ci_95_upper_original_scale = map_dbl(ci, ~exp(.x$upper)))

atm_four_output <- atm_four_output |> select(-.mean, -ci)
#combind ATM dataframe

atm_output <- bind_rows(atm_one_output, atm_two_output,
                        atm_three_output, atm_four_output)
#power forecast
power_output <- power_data_2014_forecast |> mutate(ci = hilo(log_KWH, 95)) |> as_tibble() |> select(-log_KWH)

power_output <- power_output |> 
  mutate(mean_original_scale = exp(.mean),
         ci_95_lower_original_scale = map_dbl(ci, ~exp(.x$lower)),
         ci_95_upper_original_scale = map_dbl(ci, ~exp(.x$upper)))

power_output <- power_output |> select(-.mean, -ci)

Writing to excel

export_dataframes <- list("ATM_Forecasts" = atm_output, "Power_Forecast" = power_output)
write.xlsx(export_dataframes, file = "project1_output.xlsx")