Midterm Project: Part B

Author

Alex Ptacek

Load Packages and Data

library(tidyverse)
library(fpp3)
library(readxl)
library(janitor)
library(psych)
library(mice)
library(ggpubr)
power_data_raw <- read_excel("ResidentialCustomerForecastLoad-624.xlsx") |> 
  clean_names()

Data Tidying

power_data_raw |> 
  print(n = 3)
# A tibble: 192 × 3
  case_sequence yyyy_mmm     kwh
          <dbl> <chr>      <dbl>
1           733 1998-Jan 6862583
2           734 1998-Feb 5838198
3           735 1998-Mar 5420658
# ℹ 189 more rows
power_data <- power_data_raw |> 
  mutate(month = yearmonth(yyyy_mmm), .keep = "unused", .before = 3) |> 
  tsibble(index = month)
power_data |> print(n = 3)
# A tsibble: 192 x 3 [1M]
  case_sequence    month     kwh
          <dbl>    <mth>   <dbl>
1           733 1998 Jan 6862583
2           734 1998 Feb 5838198
3           735 1998 Mar 5420658
# ℹ 189 more rows

We know that energy is a continuous variable that we are trying to produce time series forecasts for. If we check the complete_case variable for unique values, we’ll find that we get the same amount of rows, so all the values are unique. Perhaps this is not a categorical variable that could help forecasts. To investigate, I first created an numeric sequence using the maximum and minimum values from power_data. I found, in fact, that our complete_case variable is a simple ordered sequence, possibly another index, so it won’t be useful for this project.

#Check for groups of case_sequence
power_data |> 
  count(case_sequence) |> 
  print(n = 3)
# A tibble: 192 × 2
  case_sequence     n
          <dbl> <int>
1           733     1
2           734     1
3           735     1
# ℹ 189 more rows
#Find max case_sequence value
power_data |> arrange(desc(case_sequence)) |> head(1)
# A tsibble: 1 x 3 [1M]
  case_sequence    month     kwh
          <dbl>    <mth>   <dbl>
1           924 2013 Dec 9606304
#Create sample sequence for comparison
test_seq <- seq(from = 733, to = 924, by = 1)

identical(power_data$case_sequence, test_seq)
[1] TRUE

Check For Missing Values

There is one missing value for kwh. It may only make up 0.5% of the data, but we still want to check for seasonality before removing or imputing the value.

power_data |> 
  filter(is.na(kwh))
# A tsibble: 1 x 3 [1M]
  case_sequence    month   kwh
          <dbl>    <mth> <dbl>
1           861 2008 Sep    NA
sum(is.na(power_data$month))
[1] 0

Exploratory Data Analysis

Visualizing Seasonality and Solving for NA Values

My first impression of this time series is that it probably has a significant seasonal component, as well as some cyclic and trend behavior. There also appears to be an outlier value. The variance seems fairly constant, so I wouldn’t consider applying a Box-Cox transformation to this time series.

power_data |> 
  autoplot(kwh)

The following seasonal and subseries plots show a consinstent seasonal pattern in the data, and the ACF plot shows highly significant autocorrelation at seasonal lags. Combined, this gives us clear evidence of an important seasonal component, so we will have to impute the missing kwh value. If do not impute the missing value, modeling of the data will articially change the seasonality, by pushing all the dates back, and inducing inaccuracies to our forecasts. We can begin choosing an imputation method by further analyzing the plots below. The seasonal plot shows us clearly that energy consumption consistently peaks in January, decreases until May, has another peak between July to September, then decreases until November. Since we found the month of the missing value to be September, we can look at the September subseries plot and see that the data has somewhat stable variance, so I feel very comfortable using seasonal mean imputation for this missing kwh value. Since it only makes up 0.5% of our data, we only need to make sure it doesn’t effect our model.

power_data |> 
  gg_season(kwh) +
  ggtitle("Seasonal Plot")

subs_power <- power_data |> 
  gg_subseries(kwh) +
  ggtitle("Subseries Plot")
subs_power

power_data |> 
  ACF(kwh, lag_max = 36) |> 
  autoplot() +
  ggtitle("ACF Plot")

# Find the mean kwh value for September
avg_sep_power <- power_data |> 
  tibble() |> 
  filter(month(month) == 9) |> 
  summarise(avg_sep_kwh = mean(kwh, na.rm = TRUE))
avg_sep_power
# A tibble: 1 × 1
  avg_sep_kwh
        <dbl>
1    7702333.
# Impute average September kwh for missing value
power_data <- power_data |> 
  mutate(kwh = ifelse(!is.na(kwh), kwh,
                      avg_sep_power$avg_sep_kwh))

Resolve Outliers

Similar to missing values, extreme outliers can wreak havoc on a prediction model, so we will have to treat it similarly. Observing the distribution below, we see further evidence of an outlier value. It’s an extremely lower energy usage day, so it could be a blackout, a data entry or measurement error, etc. We can’t predict these events with this data, since there is only one. In turn, it also doesn’t give us information about “normal” energy consumption, but it can still have a large affect on the model and hinder our ability to predict normal energy consumption. We can gauge the magnitude of this outlier by seeing how far it is from the IQR of kwh.

power_data |> 
  ggplot(aes(x = kwh)) +
  geom_histogram()

A standard and simple outlier detection method is finding values far away from the upper and lower quartiles of the data, and the typical threshold is 1.5*IQR. Plugging this in below, we see that the kwh value we’ve observed is included. That convinces me that it will be worth it to impute for this value before modeling.

Q1 <- quantile(power_data$kwh, 0.25)
Q3 <- quantile(power_data$kwh, 0.75)
IQR_val <- IQR(power_data$kwh)

# Define outlier bounds
lower_bound <- Q1 - 1.5 * IQR_val
upper_bound <- Q3 + 1.5 * IQR_val

# Identify outliers
outliers <- power_data$kwh[power_data$kwh < lower_bound | power_data$kwh > upper_bound]
outliers
[1] 770523

We already know the data is seasonal, so we can start by getting the month where the outlier occurred and examining the subseries plot. The missing value comes from July, which has a U-shaped pattern in the subseries plot. I think a decision tree model will do a good enough job of imputing in this data.

power_data |> 
  filter(kwh == outliers)
# A tsibble: 1 x 3 [1M]
  case_sequence    month    kwh
          <dbl>    <mth>  <dbl>
1           883 2010 Jul 770523
subs_power

Im using the mice package, which applies a selected model to our data multiple times (5 here) and adjusts the parameters each time to introduce some random error.

set.seed(624)

mice_power_data <- power_data |> 
  mutate(kwh = ifelse(kwh == outliers, NA, kwh)) |> 
  select(-case_sequence)
mice_imp_data <- mice(mice_power_data, method = "cart") |> 
  complete() |> 
  tibble()

 iter imp variable
  1   1  kwh
  1   2  kwh
  1   3  kwh
  1   4  kwh
  1   5  kwh
  2   1  kwh
  2   2  kwh
  2   3  kwh
  2   4  kwh
  2   5  kwh
  3   1  kwh
  3   2  kwh
  3   3  kwh
  3   4  kwh
  3   5  kwh
  4   1  kwh
  4   2  kwh
  4   3  kwh
  4   4  kwh
  4   5  kwh
  5   1  kwh
  5   2  kwh
  5   3  kwh
  5   4  kwh
  5   5  kwh

The plots below show a time series comparison of yearly energy consumption in July before and after imputation. After the imputation, the data looks much more normal. The imputed value looks like a reasonable fit, so I will replace the outlier with the imputed value for purposes of forecasting.

outlier_plot <- power_data |>  
  mutate(year = year(month)) |> 
  filter(month(month) == 7) |> 
  tsibble(index = year) |> 
  autoplot(kwh) +
  labs(title = "Yearly July Data", subtitle = "With Outlier")

imp_plot <- mice_imp_data |> 
  mutate(year = year(month)) |> 
  filter(month(month) == 7) |> 
  tsibble(index = year) |> 
  autoplot(kwh) +
  labs(title = "Yearly July Data", subtitle = "After Imputation")

ggarrange(outlier_plot, imp_plot)

# Pull the date for our outlier value
outlier_date <- power_data |> 
  filter(kwh == outliers) |> 
  pull(month)

# Find imputed value at this data
imp_value <- mice_imp_data |> 
  filter(month == outlier_date) |> 
  pull(kwh)

# Replace outlier value with imputed value
power_data <- power_data |> 
  mutate(kwh = ifelse(kwh == outliers, imp_value, kwh))

Modeling

Since we’ve already determined this data has a strong seasonal component, I will use a seasonal naive model as a benchmark. I will also compare auto-generated ETS and ARIMA models.

power_data |> 
  autoplot(kwh)

Fable has auto-generated and ETS(M,N,A) and ARIMA(1,0,0)(2,1,0) with drift model. Next, we need to which model fits the data better, by using cross-validation.

power_fit <- power_data |> 
  model(snaive = SNAIVE(kwh),
        ets = ETS(kwh),
        arima = ARIMA(kwh))
power_fit
# A mable: 1 x 3
    snaive          ets                              arima
   <model>      <model>                            <model>
1 <SNAIVE> <ETS(M,N,A)> <ARIMA(1,0,0)(2,1,0)[12] w/ drift>

To do cross-validation, we can use stretch_tsibble to create many training and test sets. Then, we can fit our models to all of the training sets and forecast for all of the test sets in one step. Lastly, we check the accuracy measures. The ARIMA model scored the best accuracy for the cross-validation test. This means the ARIMA model provided the best fit for our data. Therefore, I will select the ARIMA model for forecasting

many_train <- power_data |> 
  stretch_tsibble(.init = 100, .step = 1) |> 
  filter(.id != max(.id))

cv_fit <- many_train |> 
  model(snaive = SNAIVE(kwh),
        ets = ETS(kwh ~ error("M") + trend("N") + season("A")),
        arima = ARIMA(kwh ~ 1 + pdq(1,0,0) + PDQ(2,1,0)))

cv_fc <- cv_fit |> 
  forecast(h = 1) |> 
  filter(!is.na(.mean))

accuracy(cv_fc, power_data)
# A tibble: 3 × 10
  .model .type      ME     RMSE     MAE   MPE  MAPE  MASE RMSSE  ACF1
  <chr>  <chr>   <dbl>    <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 arima  Test  190830.  761227. 527741.  1.80  7.43 0.851 0.895 0.146
2 ets    Test  180045.  844760. 592700.  1.28  8.40 0.956 0.993 0.357
3 snaive Test  172944. 1037507. 727427.  1.46 10.3  1.17  1.22  0.341

Forecasting and Final Results

The forecast has very narrow prediction intervals and, overall, looks like a very reasonalable fit for the data.

Point Forecasts and Distributional Forecasts

The table below can be used to deploy our forecasts. For example, if an electical company’s revenue depends on energy consumption and they need to create a budget, they could use the point forecasts as a parameter to model their budget around.

# A tibble: 13 × 4
   month        lower     pt_fc      upper
   <chr>        <dbl>     <dbl>      <dbl>
 1 2014 Jan  8745969. 10033059.  11320150.
 2 2014 Feb  7385586.  8721726.  10057866.
 3 2014 Mar  5341357.  6681231.   8021105.
 4 2014 Apr  4655273.  5995437.   7335600.
 5 2014 May  4597841.  5938028.   7278214.
 6 2014 Jun  6868633.  8208821.   9549009.
 7 2014 Jul  8194686.  9534874.  10875062.
 8 2014 Aug  8683211. 10023399.  11363587.
 9 2014 Sep  7152703.  8492891.   9833079.
10 2014 Oct  4525647.  5865835.   7206023.
11 2014 Nov  4818360.  6158548.   7498737.
12 2014 Dec  6937416.  8277604.   9617792.
13 Total    77906683. 93931453. 109956224.