DATA 624 Midterm Project

Part A – ATM Forecast, ATM624Data.xlsx

In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.

Load the Libraries

library(fpp3)

## Registered S3 method overwritten by 'tsibble':
##   method               from 
##   as_tibble.grouped_df dplyr

## ── Attaching packages ──────────────────────────────────────────── fpp3 1.0.1 ──

## ✔ tibble      3.2.1     ✔ tsibble     1.1.6
## ✔ dplyr       1.1.4     ✔ tsibbledata 0.4.1
## ✔ tidyr       1.3.1     ✔ feasts      0.4.1
## ✔ lubridate   1.9.3     ✔ fable       0.4.1
## ✔ ggplot2     3.5.1

## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## ✖ lubridate::date()    masks base::date()
## ✖ dplyr::filter()      masks stats::filter()
## ✖ tsibble::intersect() masks base::intersect()
## ✖ tsibble::interval()  masks lubridate::interval()
## ✖ dplyr::lag()         masks stats::lag()
## ✖ tsibble::setdiff()   masks base::setdiff()
## ✖ tsibble::union()     masks base::union()

library(lubridate)
library(seasonal)

## 
## Attaching package: 'seasonal'

## The following object is masked from 'package:tibble':
## 
##     view

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0     ✔ readr   2.1.5
## ✔ purrr   1.0.2     ✔ stringr 1.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter()     masks stats::filter()
## ✖ tsibble::interval() masks lubridate::interval()
## ✖ dplyr::lag()        masks stats::lag()
## ✖ seasonal::view()    masks tibble::view()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Read the Data

residential_power_usage <- read_csv("https://raw.githubusercontent.com/gillianmcgovern0/cuny-data-624/refs/heads/main/ResidentialCustomerForecastLoad-624(ResidentialCustomerForecastLoad).csv")

## Rows: 192 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): YYYY-MMM
## dbl (2): CaseSequence, KWH
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(residential_power_usage)

## # A tibble: 6 × 3
##   CaseSequence `YYYY-MMM`     KWH
##          <dbl> <chr>        <dbl>
## 1          733 1998-Jan   6862583
## 2          734 1998-Feb   5838198
## 3          735 1998-Mar   5420658
## 4          736 1998-Apr   5010364
## 5          737 1998-May   4665377
## 6          738 1998-Jun   6467147

Now let’s clean up the data into an actual time series:

names(residential_power_usage) <- c("CaseSequence", "Date", "KWH")
head(residential_power_usage)

## # A tibble: 6 × 3
##   CaseSequence Date         KWH
##          <dbl> <chr>      <dbl>
## 1          733 1998-Jan 6862583
## 2          734 1998-Feb 5838198
## 3          735 1998-Mar 5420658
## 4          736 1998-Apr 5010364
## 5          737 1998-May 4665377
## 6          738 1998-Jun 6467147

residential_power_usage2 <- residential_power_usage |>
  separate(Date, c("Year", "Month"), "-") |>
  unite(Month, Year:Month, sep=" ")
head(residential_power_usage)

## # A tibble: 6 × 3
##   CaseSequence Date         KWH
##          <dbl> <chr>      <dbl>
## 1          733 1998-Jan 6862583
## 2          734 1998-Feb 5838198
## 3          735 1998-Mar 5420658
## 4          736 1998-Apr 5010364
## 5          737 1998-May 4665377
## 6          738 1998-Jun 6467147

# final time series
residential_power_usage3 <- residential_power_usage2 |>
  mutate(Month = yearmonth(Month)) |>
  as_tsibble(index = Month)
head(residential_power_usage3)

## # A tsibble: 6 x 3 [1M]
##   CaseSequence    Month     KWH
##          <dbl>    <mth>   <dbl>
## 1          733 1998 Jan 6862583
## 2          734 1998 Feb 5838198
## 3          735 1998 Mar 5420658
## 4          736 1998 Apr 5010364
## 5          737 1998 May 4665377
## 6          738 1998 Jun 6467147

Understand the Original Data

Let’s take a look at the original data:

autoplot(residential_power_usage3, KWH) + geom_point()

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).

The autoplot does not show a major trend of increasing or decreasing. It looks like the trend is increasing very slightly. There is variance, but the variance seems pretty consistent. There is also an outlier after Jan 2010.

There does seem to be some seasonality which is a bit hard to see in this graph, so let’s take a closer look at seasonality below:

gg_season(residential_power_usage3, KWH)

This graph shows some pretty consistent seasonality which makes perfect sense for power usage. Takeaways from this graph are:

Power usage starts high at the start of the year and then decreases as spring approaches
Power usage starts to increase in May and continues to peak in July or August
Power usage then begins to decrease again hitting a trough in November, and increase again in December

These observations make sense as it is not only cold in January/December so heat is needed and people are inside, but also it’s during the holidays when people have decorations and lights out. In the summer, as the temperature rises, the need for AC increases. Cooler temperatures exist in the spring and fall, which cause less heat and/or AC to be needed so power usage would naturally decrease.

Additionally, recent years have more power usage which make sense due to global warming.

Lastly, we know the outlier exists in July.

Let’s now look at more details via the gg_subseries:

gg_subseries(residential_power_usage3, KWH) + geom_point()

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).

This graph confirms what we’ve seen above where August and January are peaks, and May and November are troughs. There is also an outlier in July 2010 specifically.

Let’s take a lag plot:

# since data is monthly, use lag=12
gg_lag(residential_power_usage3, KWH, geom = "point", lag=1:12)

## Warning: Removed 1 rows containing missing values (gg_lag).

The lag plot shows pretty strong positive correlation in lag 1, 5, 6, 11, and 12, while the others show negative correlation.

Let’s take a look at ACF:

residential_power_usage3 |>
  ACF(KWH) |>
  autoplot()

The ACF shows:

Lag 1 positive correlation
Lag 2 - 4 negative correlation
Lag 5 - 7 postive correlation
Lag 8 - 10 negative correlation
Lag 11 - 12 postive correlation
Major peak/dips every 3 months

This again supports the strong seasonality.

x11_dcmp <- residential_power_usage3 |>
  model(x11 = X_13ARIMA_SEATS(KWH ~ x11())) |>
  components()
autoplot(x11_dcmp)

Seasonal component shows slight varying seasonal variation towards the beginning, then it begins to even out.

Check for Transformations

Let’s check for any transformations. Calendar, population, and inflation are not really relevant here. So all that’s left is mathematical. Let’s see if the usual Box-Cox transformation improves the original data at all:

lambda_residential_power_usage3 <- residential_power_usage3 |>
  features(KWH, features = guerrero) |>
  pull(lambda_guerrero)
lambda_residential_power_usage3

## [1] 0.1132906

residential_power_usage3 |>
  autoplot(box_cox(KWH, lambda_residential_power_usage3))

This transformation makes the size of the seasonal variation about the same across the whole series which makes forecasting model simpler. Lambda value 0.1132906 works quite well.

To deal with this outlier, let’s make this value NA and interpolate it:

# Make outlier NA and interpolate
residential_power_usage3$KWH[residential_power_usage3$Month == yearmonth("2010 Jul")] <- NA

# Find missing values
residential_power_usage3 |>
  filter(is.na(KWH))

## # A tsibble: 2 x 3 [1M]
##   CaseSequence    Month   KWH
##          <dbl>    <mth> <dbl>
## 1          861 2008 Sep    NA
## 2          883 2010 Jul    NA

residential_power_usage3

## # A tsibble: 192 x 3 [1M]
##    CaseSequence    Month     KWH
##           <dbl>    <mth>   <dbl>
##  1          733 1998 Jan 6862583
##  2          734 1998 Feb 5838198
##  3          735 1998 Mar 5420658
##  4          736 1998 Apr 5010364
##  5          737 1998 May 4665377
##  6          738 1998 Jun 6467147
##  7          739 1998 Jul 8914755
##  8          740 1998 Aug 8607428
##  9          741 1998 Sep 6989888
## 10          742 1998 Oct 6345620
## # ℹ 182 more rows

residential_power_usage3_miss <- residential_power_usage3 |>
  # Replace with missing values
  fill_gaps()
residential_power_usage3_fill <- residential_power_usage3_miss |>
  # Fit ARIMA model to the data containing missing values
  model(ARIMA(KWH)) |>
  # Estimate Trips for all periods
  interpolate(residential_power_usage3_miss)

residential_power_usage3_fill |>
  autoplot(KWH) +
  autolayer(residential_power_usage3_fill |> filter_index("2008 Aug"~"2008 Oct"),
    KWH, colour="#D55E00")

Let’s try the box cox transformation again with the new data:

lambda_residential_power_usage3 <- residential_power_usage3_fill |>
  features(KWH, features = guerrero) |>
  pull(lambda_guerrero)
lambda_residential_power_usage3

## [1] -0.2057366

residential_power_usage3_fill |>
  autoplot(box_cox(KWH, lambda_residential_power_usage3))

I am not seeing a major difference for the box cox transformation so will choose not to do it.

Let’s do a quick STL() since there is seasonality:

dcmp <- residential_power_usage3_fill |>
  model(stl = STL(KWH))

components(dcmp) |> autoplot()

Let’s first try a simple forecasting methods. Since we know there is definitely seasonality, it will most likely be the SNAIVE() that wins out of the simple forecasting methods but let’s plot them all:

# Experiment with all forecast models
residential_power_usage3_fit <- residential_power_usage3_fill |>
  filter(!is.na(KWH)) |> # just want data that contains Bricks data
  model(
    Seasonal_naive = SNAIVE(KWH),
    Naive = NAIVE(KWH),
    Drift = RW(KWH ~ drift()),
    Mean = MEAN(KWH)
  )
residential_power_usage3_fc <- residential_power_usage3_fit |>
  forecast(h = 12)
residential_power_usage3_fc |>
  autoplot(residential_power_usage3_fill, level = NULL) +
  guides(colour = guide_legend(title = "Forecast"))

Seasonal naive is definitely the winner here.

Let’s now try Holt-Winters methods. Since the seasonal variation is pretty constant, the additive method should win here. There isn’t a trend, so let’s put “N” for trend.

fit <- residential_power_usage3_fill |>
  model(
    additive = ETS(KWH ~ error("A") + trend("N") + season("A")),
    multiplicative = ETS(KWH ~ error("M") + trend("N") + season("M"))
  )
fc <- fit |> forecast(h = 12)

fc |>
  autoplot(residential_power_usage3_fill, level = NULL) +
  guides(colour = guide_legend(title = "Forecast"))

# Confirm which is the best fit:
fit <- residential_power_usage3_fill |>
  model(ETS(KWH))
fit

## # A mable: 1 x 1
##     `ETS(KWH)`
##        <model>
## 1 <ETS(M,A,M)>

In the graph, the multiplicative looks better which makes sense because the seasonal variation does change slightly according to the decomposition. It doesn’t change much which is why I think additive is pretty similar forecast. The automatically chosen ETS() is also ETS(M,N,M).

Since we don’t have a ton of data, let’s refrain from doing a training/test model for this dataset.

Now let’s do time series cross validation to check for lowest RMSE:

residential_power_usage3_fill |>
  stretch_tsibble(.init = 10) |>
  model(
    MNM = ETS(KWH ~ error("M") + trend("N") + season("M")),
    MNA = ETS(KWH ~ error("M") + trend("N") + season("A")),
    MAM = ETS(KWH ~ error("M") + trend("A") + season("M")),
    snaive = SNAIVE(KWH)
  ) |>
  forecast(h = 12) |>
  accuracy(residential_power_usage3_fill)

## Warning: 6 errors (2 unique) encountered for MNM
## [3] A seasonal ETS model cannot be used for this data.
## [3] Not enough data to estimate this ETS model.

## Warning: 6 errors (2 unique) encountered for MNA
## [3] A seasonal ETS model cannot be used for this data.
## [3] Not enough data to estimate this ETS model.

## Warning: 8 errors (2 unique) encountered for MAM
## [3] A seasonal ETS model cannot be used for this data.
## [5] Not enough data to estimate this ETS model.

## Warning: The future dataset is incomplete, incomplete out-of-sample data will be treated as missing. 
## 12 observations are missing between 2014 Jan and 2014 Dec

## # A tibble: 4 × 10
##   .model .type      ME    RMSE     MAE   MPE  MAPE  MASE RMSSE  ACF1
##   <chr>  <chr>   <dbl>   <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 MAM    Test   91286. 800878. 587099. 0.680  8.74 0.963 0.989 0.477
## 2 MNA    Test  152085. 738794. 546842. 1.44   8.03 0.897 0.912 0.309
## 3 MNM    Test  170501. 718521. 536502. 1.90   7.88 0.880 0.887 0.330
## 4 snaive Test  101272. 813315. 612149. 0.808  9.11 1.00  1.00  0.257

The lowest RMSE value is 943696.0 for ETS(M,N,M).

Model Evaluation

Now let’s check the residuals:

fit <- residential_power_usage3_fill |>
  model(
    multiplicative = ETS(KWH ~ error("M") + trend("N") + season("M"))
  )

fc <- fit |> forecast(h = 12)

components(fit) |> autoplot() + labs(title = "ETS(M,N,M) components")

## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_line()`).

Now let’s check the residuals for white noise. Since this has seasonality, we pick l = 2m. The seasonal period is 12 months (1 year), so l = 2 * 12 which is 24:

# Box-Pierce test
augment(fit) |> features(.innov, box_pierce, lag = 24)

## # A tibble: 1 × 3
##   .model         bp_stat bp_pvalue
##   <chr>            <dbl>     <dbl>
## 1 multiplicative    30.4     0.171

# Ljung-Box test
augment(fit) |> features(.innov, ljung_box, lag = 24)

## # A tibble: 1 × 3
##   .model         lb_stat lb_pvalue
##   <chr>            <dbl>     <dbl>
## 1 multiplicative    32.5     0.116

For both methods, the p value is greater than 0.05, so it’s likely to occur by chance. These residuals are not easily distinguished from white noise. We accept the white noise hypothesis.

Check ARIMA

Now let’s double check ARIMA in case it’s better:

residential_power_usage3_fill |>
  stretch_tsibble(.init = 10) |>
  model(
    ETS(KWH ~ error("M") + trend("N") + season("M")),
    ARIMA(KWH)) |>
  forecast(h = 24) |>
  accuracy(residential_power_usage3_fill) |>
  select(.model, RMSE:MAPE)

## Warning: 6 errors (2 unique) encountered for ETS(KWH ~ error("M") + trend("N") + season("M"))
## [3] A seasonal ETS model cannot be used for this data.
## [3] Not enough data to estimate this ETS model.

## Warning: The future dataset is incomplete, incomplete out-of-sample data will be treated as missing. 
## 24 observations are missing between 2014 Jan and 2015 Dec

## # A tibble: 2 × 5
##   .model                                                 RMSE    MAE   MPE  MAPE
##   <chr>                                                 <dbl>  <dbl> <dbl> <dbl>
## 1 "ARIMA(KWH)"                                         8.33e5 6.18e5  1.36  9.28
## 2 "ETS(KWH ~ error(\"M\") + trend(\"N\") + season(\"M… 7.45e5 5.56e5  2.67  8.10

RMSE value is smaller for ETS so we shall choose the ETS model.

Write Forecast to CSV

write_csv(fc, "~/Downloads/KWH-forecast.csv")