In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.
library(fpp3)
## Registered S3 method overwritten by 'tsibble':
## method from
## as_tibble.grouped_df dplyr
## ── Attaching packages ──────────────────────────────────────────── fpp3 1.0.1 ──
## ✔ tibble 3.2.1 ✔ tsibble 1.1.6
## ✔ dplyr 1.1.4 ✔ tsibbledata 0.4.1
## ✔ tidyr 1.3.1 ✔ feasts 0.4.1
## ✔ lubridate 1.9.3 ✔ fable 0.4.1
## ✔ ggplot2 3.5.1
## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## ✖ lubridate::date() masks base::date()
## ✖ dplyr::filter() masks stats::filter()
## ✖ tsibble::intersect() masks base::intersect()
## ✖ tsibble::interval() masks lubridate::interval()
## ✖ dplyr::lag() masks stats::lag()
## ✖ tsibble::setdiff() masks base::setdiff()
## ✖ tsibble::union() masks base::union()
library(lubridate)
library(seasonal)
##
## Attaching package: 'seasonal'
## The following object is masked from 'package:tibble':
##
## view
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ readr 2.1.5
## ✔ purrr 1.0.2 ✔ stringr 1.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ tsibble::interval() masks lubridate::interval()
## ✖ dplyr::lag() masks stats::lag()
## ✖ seasonal::view() masks tibble::view()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
residential_power_usage <- read_csv("https://raw.githubusercontent.com/gillianmcgovern0/cuny-data-624/refs/heads/main/ResidentialCustomerForecastLoad-624(ResidentialCustomerForecastLoad).csv")
## Rows: 192 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): YYYY-MMM
## dbl (2): CaseSequence, KWH
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(residential_power_usage)
## # A tibble: 6 × 3
## CaseSequence `YYYY-MMM` KWH
## <dbl> <chr> <dbl>
## 1 733 1998-Jan 6862583
## 2 734 1998-Feb 5838198
## 3 735 1998-Mar 5420658
## 4 736 1998-Apr 5010364
## 5 737 1998-May 4665377
## 6 738 1998-Jun 6467147
Now let’s clean up the data into an actual time series:
names(residential_power_usage) <- c("CaseSequence", "Date", "KWH")
head(residential_power_usage)
## # A tibble: 6 × 3
## CaseSequence Date KWH
## <dbl> <chr> <dbl>
## 1 733 1998-Jan 6862583
## 2 734 1998-Feb 5838198
## 3 735 1998-Mar 5420658
## 4 736 1998-Apr 5010364
## 5 737 1998-May 4665377
## 6 738 1998-Jun 6467147
residential_power_usage2 <- residential_power_usage |>
separate(Date, c("Year", "Month"), "-") |>
unite(Month, Year:Month, sep=" ")
head(residential_power_usage)
## # A tibble: 6 × 3
## CaseSequence Date KWH
## <dbl> <chr> <dbl>
## 1 733 1998-Jan 6862583
## 2 734 1998-Feb 5838198
## 3 735 1998-Mar 5420658
## 4 736 1998-Apr 5010364
## 5 737 1998-May 4665377
## 6 738 1998-Jun 6467147
# final time series
residential_power_usage3 <- residential_power_usage2 |>
mutate(Month = yearmonth(Month)) |>
as_tsibble(index = Month)
head(residential_power_usage3)
## # A tsibble: 6 x 3 [1M]
## CaseSequence Month KWH
## <dbl> <mth> <dbl>
## 1 733 1998 Jan 6862583
## 2 734 1998 Feb 5838198
## 3 735 1998 Mar 5420658
## 4 736 1998 Apr 5010364
## 5 737 1998 May 4665377
## 6 738 1998 Jun 6467147
Let’s take a look at the original data:
autoplot(residential_power_usage3, KWH) + geom_point()
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).
The autoplot does not show a major trend of increasing or decreasing. It looks like the trend is increasing very slightly. There is variance, but the variance seems pretty consistent. There is also an outlier after Jan 2010.
There does seem to be some seasonality which is a bit hard to see in this graph, so let’s take a closer look at seasonality below:
gg_season(residential_power_usage3, KWH)
This graph shows some pretty consistent seasonality which makes perfect sense for power usage. Takeaways from this graph are:
These observations make sense as it is not only cold in January/December so heat is needed and people are inside, but also it’s during the holidays when people have decorations and lights out. In the summer, as the temperature rises, the need for AC increases. Cooler temperatures exist in the spring and fall, which cause less heat and/or AC to be needed so power usage would naturally decrease.
Additionally, recent years have more power usage which make sense due to global warming.
Lastly, we know the outlier exists in July.
Let’s now look at more details via the gg_subseries
:
gg_subseries(residential_power_usage3, KWH) + geom_point()
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).
This graph confirms what we’ve seen above where August and January are peaks, and May and November are troughs. There is also an outlier in July 2010 specifically.
Let’s take a lag plot:
# since data is monthly, use lag=12
gg_lag(residential_power_usage3, KWH, geom = "point", lag=1:12)
## Warning: Removed 1 rows containing missing values (gg_lag).
The lag plot shows pretty strong positive correlation in lag 1, 5, 6, 11, and 12, while the others show negative correlation.
Let’s take a look at ACF:
residential_power_usage3 |>
ACF(KWH) |>
autoplot()
The ACF shows:
This again supports the strong seasonality.
x11_dcmp <- residential_power_usage3 |>
model(x11 = X_13ARIMA_SEATS(KWH ~ x11())) |>
components()
autoplot(x11_dcmp)
Seasonal component shows slight varying seasonal variation towards the beginning, then it begins to even out.
Let’s check for any transformations. Calendar, population, and inflation are not really relevant here. So all that’s left is mathematical. Let’s see if the usual Box-Cox transformation improves the original data at all:
lambda_residential_power_usage3 <- residential_power_usage3 |>
features(KWH, features = guerrero) |>
pull(lambda_guerrero)
lambda_residential_power_usage3
## [1] 0.1132906
residential_power_usage3 |>
autoplot(box_cox(KWH, lambda_residential_power_usage3))
This transformation makes the size of the seasonal variation about the same across the whole series which makes forecasting model simpler. Lambda value 0.1132906 works quite well.
To deal with this outlier, let’s make this value NA and interpolate it:
# Make outlier NA and interpolate
residential_power_usage3$KWH[residential_power_usage3$Month == yearmonth("2010 Jul")] <- NA
# Find missing values
residential_power_usage3 |>
filter(is.na(KWH))
## # A tsibble: 2 x 3 [1M]
## CaseSequence Month KWH
## <dbl> <mth> <dbl>
## 1 861 2008 Sep NA
## 2 883 2010 Jul NA
residential_power_usage3
## # A tsibble: 192 x 3 [1M]
## CaseSequence Month KWH
## <dbl> <mth> <dbl>
## 1 733 1998 Jan 6862583
## 2 734 1998 Feb 5838198
## 3 735 1998 Mar 5420658
## 4 736 1998 Apr 5010364
## 5 737 1998 May 4665377
## 6 738 1998 Jun 6467147
## 7 739 1998 Jul 8914755
## 8 740 1998 Aug 8607428
## 9 741 1998 Sep 6989888
## 10 742 1998 Oct 6345620
## # ℹ 182 more rows
residential_power_usage3_miss <- residential_power_usage3 |>
# Replace with missing values
fill_gaps()
residential_power_usage3_fill <- residential_power_usage3_miss |>
# Fit ARIMA model to the data containing missing values
model(ARIMA(KWH)) |>
# Estimate Trips for all periods
interpolate(residential_power_usage3_miss)
residential_power_usage3_fill |>
autoplot(KWH) +
autolayer(residential_power_usage3_fill |> filter_index("2008 Aug"~"2008 Oct"),
KWH, colour="#D55E00")
Let’s try the box cox transformation again with the new data:
lambda_residential_power_usage3 <- residential_power_usage3_fill |>
features(KWH, features = guerrero) |>
pull(lambda_guerrero)
lambda_residential_power_usage3
## [1] -0.2057366
residential_power_usage3_fill |>
autoplot(box_cox(KWH, lambda_residential_power_usage3))
I am not seeing a major difference for the box cox transformation so will choose not to do it.
Let’s do a quick STL()
since there is seasonality:
dcmp <- residential_power_usage3_fill |>
model(stl = STL(KWH))
components(dcmp) |> autoplot()
Let’s first try a simple forecasting methods. Since we know there is
definitely seasonality, it will most likely be the SNAIVE()
that wins out of the simple forecasting methods but let’s plot them
all:
# Experiment with all forecast models
residential_power_usage3_fit <- residential_power_usage3_fill |>
filter(!is.na(KWH)) |> # just want data that contains Bricks data
model(
Seasonal_naive = SNAIVE(KWH),
Naive = NAIVE(KWH),
Drift = RW(KWH ~ drift()),
Mean = MEAN(KWH)
)
residential_power_usage3_fc <- residential_power_usage3_fit |>
forecast(h = 12)
residential_power_usage3_fc |>
autoplot(residential_power_usage3_fill, level = NULL) +
guides(colour = guide_legend(title = "Forecast"))
Seasonal naive is definitely the winner here.
Let’s now try Holt-Winters methods. Since the seasonal variation is pretty constant, the additive method should win here. There isn’t a trend, so let’s put “N” for trend.
fit <- residential_power_usage3_fill |>
model(
additive = ETS(KWH ~ error("A") + trend("N") + season("A")),
multiplicative = ETS(KWH ~ error("M") + trend("N") + season("M"))
)
fc <- fit |> forecast(h = 12)
fc |>
autoplot(residential_power_usage3_fill, level = NULL) +
guides(colour = guide_legend(title = "Forecast"))
# Confirm which is the best fit:
fit <- residential_power_usage3_fill |>
model(ETS(KWH))
fit
## # A mable: 1 x 1
## `ETS(KWH)`
## <model>
## 1 <ETS(M,A,M)>
In the graph, the multiplicative looks better which makes sense
because the seasonal variation does change slightly according to the
decomposition. It doesn’t change much which is why I think additive is
pretty similar forecast. The automatically chosen ETS()
is
also ETS(M,N,M)
.
Since we don’t have a ton of data, let’s refrain from doing a training/test model for this dataset.
Now let’s do time series cross validation to check for lowest RMSE:
residential_power_usage3_fill |>
stretch_tsibble(.init = 10) |>
model(
MNM = ETS(KWH ~ error("M") + trend("N") + season("M")),
MNA = ETS(KWH ~ error("M") + trend("N") + season("A")),
MAM = ETS(KWH ~ error("M") + trend("A") + season("M")),
snaive = SNAIVE(KWH)
) |>
forecast(h = 12) |>
accuracy(residential_power_usage3_fill)
## Warning: 6 errors (2 unique) encountered for MNM
## [3] A seasonal ETS model cannot be used for this data.
## [3] Not enough data to estimate this ETS model.
## Warning: 6 errors (2 unique) encountered for MNA
## [3] A seasonal ETS model cannot be used for this data.
## [3] Not enough data to estimate this ETS model.
## Warning: 8 errors (2 unique) encountered for MAM
## [3] A seasonal ETS model cannot be used for this data.
## [5] Not enough data to estimate this ETS model.
## Warning: The future dataset is incomplete, incomplete out-of-sample data will be treated as missing.
## 12 observations are missing between 2014 Jan and 2014 Dec
## # A tibble: 4 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 MAM Test 91286. 800878. 587099. 0.680 8.74 0.963 0.989 0.477
## 2 MNA Test 152085. 738794. 546842. 1.44 8.03 0.897 0.912 0.309
## 3 MNM Test 170501. 718521. 536502. 1.90 7.88 0.880 0.887 0.330
## 4 snaive Test 101272. 813315. 612149. 0.808 9.11 1.00 1.00 0.257
The lowest RMSE value is 943696.0 for ETS(M,N,M)
.
Now let’s check the residuals:
fit <- residential_power_usage3_fill |>
model(
multiplicative = ETS(KWH ~ error("M") + trend("N") + season("M"))
)
fc <- fit |> forecast(h = 12)
components(fit) |> autoplot() + labs(title = "ETS(M,N,M) components")
## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_line()`).
Now let’s check the residuals for white noise. Since this has seasonality, we pick l = 2m. The seasonal period is 12 months (1 year), so l = 2 * 12 which is 24:
# Box-Pierce test
augment(fit) |> features(.innov, box_pierce, lag = 24)
## # A tibble: 1 × 3
## .model bp_stat bp_pvalue
## <chr> <dbl> <dbl>
## 1 multiplicative 30.4 0.171
# Ljung-Box test
augment(fit) |> features(.innov, ljung_box, lag = 24)
## # A tibble: 1 × 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 multiplicative 32.5 0.116
For both methods, the p value is greater than 0.05, so it’s likely to occur by chance. These residuals are not easily distinguished from white noise. We accept the white noise hypothesis.
Now let’s double check ARIMA in case it’s better:
residential_power_usage3_fill |>
stretch_tsibble(.init = 10) |>
model(
ETS(KWH ~ error("M") + trend("N") + season("M")),
ARIMA(KWH)) |>
forecast(h = 24) |>
accuracy(residential_power_usage3_fill) |>
select(.model, RMSE:MAPE)
## Warning: 6 errors (2 unique) encountered for ETS(KWH ~ error("M") + trend("N") + season("M"))
## [3] A seasonal ETS model cannot be used for this data.
## [3] Not enough data to estimate this ETS model.
## Warning: The future dataset is incomplete, incomplete out-of-sample data will be treated as missing.
## 24 observations are missing between 2014 Jan and 2015 Dec
## # A tibble: 2 × 5
## .model RMSE MAE MPE MAPE
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 "ARIMA(KWH)" 8.33e5 6.18e5 1.36 9.28
## 2 "ETS(KWH ~ error(\"M\") + trend(\"N\") + season(\"M… 7.45e5 5.56e5 2.67 8.10
RMSE value is smaller for ETS so we shall choose the ETS model.
write_csv(fc, "~/Downloads/KWH-forecast.csv")