library(tidyverse)
library(fpp3)
library(readxl)
library(janitor)
library(psych)
library(mice)
library(ggpubr)Midterm Project: Part B
Load Packages and Data
power_data_raw <- read_excel("ResidentialCustomerForecastLoad-624.xlsx") |>
clean_names()Data Tidying
power_data_raw |>
print(n = 3)# A tibble: 192 × 3
case_sequence yyyy_mmm kwh
<dbl> <chr> <dbl>
1 733 1998-Jan 6862583
2 734 1998-Feb 5838198
3 735 1998-Mar 5420658
# ℹ 189 more rows
power_data <- power_data_raw |>
mutate(month = yearmonth(yyyy_mmm), .keep = "unused", .before = 3) |>
tsibble(index = month)
power_data |> print(n = 3)# A tsibble: 192 x 3 [1M]
case_sequence month kwh
<dbl> <mth> <dbl>
1 733 1998 Jan 6862583
2 734 1998 Feb 5838198
3 735 1998 Mar 5420658
# ℹ 189 more rows
We know that energy is a continuous variable that we are trying to produce time series forecasts for. If we check the complete_case variable for unique values, we’ll find that we get the same amount of rows, so all the values are unique. Perhaps this is not a categorical variable that could help forecasts. To investigate, I first created an numeric sequence using the maximum and minimum values from power_data. I found, in fact, that our complete_case variable is a simple ordered sequence, possibly another index, so it won’t be useful for this project.
#Check for groups of case_sequence
power_data |>
count(case_sequence) |>
print(n = 3)# A tibble: 192 × 2
case_sequence n
<dbl> <int>
1 733 1
2 734 1
3 735 1
# ℹ 189 more rows
#Find max case_sequence value
power_data |> arrange(desc(case_sequence)) |> head(1)# A tsibble: 1 x 3 [1M]
case_sequence month kwh
<dbl> <mth> <dbl>
1 924 2013 Dec 9606304
#Create sample sequence for comparison
test_seq <- seq(from = 733, to = 924, by = 1)
identical(power_data$case_sequence, test_seq)[1] TRUE
Check For Missing Values
There is one missing value for kwh. It may only make up 0.5% of the data, but we still want to check for seasonality before removing or imputing the value.
power_data |>
filter(is.na(kwh))# A tsibble: 1 x 3 [1M]
case_sequence month kwh
<dbl> <mth> <dbl>
1 861 2008 Sep NA
sum(is.na(power_data$month))[1] 0
Exploratory Data Analysis
Visualizing Seasonality and Solving for NA Values
My first impression of this time series is that it probably has a significant seasonal component, as well as some cyclic and trend behavior. There also appears to be an outlier value. The variance seems fairly constant, so I wouldn’t consider applying a Box-Cox transformation to this time series.
power_data |>
autoplot(kwh)The following seasonal and subseries plots show a consinstent seasonal pattern in the data, and the ACF plot shows highly significant autocorrelation at seasonal lags. Combined, this gives us clear evidence of an important seasonal component, so we will have to impute the missing kwh value. If do not impute the missing value, modeling of the data will articially change the seasonality, by pushing all the dates back, and inducing inaccuracies to our forecasts. We can begin choosing an imputation method by further analyzing the plots below. The seasonal plot shows us clearly that energy consumption consistently peaks in January, decreases until May, has another peak between July to September, then decreases until November. Since we found the month of the missing value to be September, we can look at the September subseries plot and see that the data has somewhat stable variance, so I feel very comfortable using seasonal mean imputation for this missing kwh value. Since it only makes up 0.5% of our data, we only need to make sure it doesn’t effect our model.
power_data |>
gg_season(kwh) +
ggtitle("Seasonal Plot")subs_power <- power_data |>
gg_subseries(kwh) +
ggtitle("Subseries Plot")
subs_powerpower_data |>
ACF(kwh, lag_max = 36) |>
autoplot() +
ggtitle("ACF Plot")# Find the mean kwh value for September
avg_sep_power <- power_data |>
tibble() |>
filter(month(month) == 9) |>
summarise(avg_sep_kwh = mean(kwh, na.rm = TRUE))
avg_sep_power# A tibble: 1 × 1
avg_sep_kwh
<dbl>
1 7702333.
# Impute average September kwh for missing value
power_data <- power_data |>
mutate(kwh = ifelse(!is.na(kwh), kwh,
avg_sep_power$avg_sep_kwh))Resolve Outliers
Similar to missing values, extreme outliers can wreak havoc on a prediction model, so we will have to treat it similarly. Observing the distribution below, we see further evidence of an outlier value. It’s an extremely lower energy usage day, so it could be a blackout, a data entry or measurement error, etc. We can’t predict these events with this data, since there is only one. In turn, it also doesn’t give us information about “normal” energy consumption, but it can still have a large affect on the model and hinder our ability to predict normal energy consumption. We can gauge the magnitude of this outlier by seeing how far it is from the IQR of kwh.
power_data |>
ggplot(aes(x = kwh)) +
geom_histogram()A standard and simple outlier detection method is finding values far away from the upper and lower quartiles of the data, and the typical threshold is 1.5*IQR. Plugging this in below, we see that the kwh value we’ve observed is included. That convinces me that it will be worth it to impute for this value before modeling.
Q1 <- quantile(power_data$kwh, 0.25)
Q3 <- quantile(power_data$kwh, 0.75)
IQR_val <- IQR(power_data$kwh)
# Define outlier bounds
lower_bound <- Q1 - 1.5 * IQR_val
upper_bound <- Q3 + 1.5 * IQR_val
# Identify outliers
outliers <- power_data$kwh[power_data$kwh < lower_bound | power_data$kwh > upper_bound]
outliers[1] 770523
We already know the data is seasonal, so we can start by getting the month where the outlier occurred and examining the subseries plot. The missing value comes from July, which has a U-shaped pattern in the subseries plot. I think a decision tree model will do a good enough job of imputing in this data.
power_data |>
filter(kwh == outliers)# A tsibble: 1 x 3 [1M]
case_sequence month kwh
<dbl> <mth> <dbl>
1 883 2010 Jul 770523
subs_powerIm using the mice package, which applies a selected model to our data multiple times (5 here) and adjusts the parameters each time to introduce some random error.
set.seed(624)
mice_power_data <- power_data |>
mutate(kwh = ifelse(kwh == outliers, NA, kwh)) |>
select(-case_sequence)mice_imp_data <- mice(mice_power_data, method = "cart") |>
complete() |>
tibble()
iter imp variable
1 1 kwh
1 2 kwh
1 3 kwh
1 4 kwh
1 5 kwh
2 1 kwh
2 2 kwh
2 3 kwh
2 4 kwh
2 5 kwh
3 1 kwh
3 2 kwh
3 3 kwh
3 4 kwh
3 5 kwh
4 1 kwh
4 2 kwh
4 3 kwh
4 4 kwh
4 5 kwh
5 1 kwh
5 2 kwh
5 3 kwh
5 4 kwh
5 5 kwh
The plots below show a time series comparison of yearly energy consumption in July before and after imputation. After the imputation, the data looks much more normal. The imputed value looks like a reasonable fit, so I will replace the outlier with the imputed value for purposes of forecasting.
outlier_plot <- power_data |>
mutate(year = year(month)) |>
filter(month(month) == 7) |>
tsibble(index = year) |>
autoplot(kwh) +
labs(title = "Yearly July Data", subtitle = "With Outlier")
imp_plot <- mice_imp_data |>
mutate(year = year(month)) |>
filter(month(month) == 7) |>
tsibble(index = year) |>
autoplot(kwh) +
labs(title = "Yearly July Data", subtitle = "After Imputation")
ggarrange(outlier_plot, imp_plot)# Pull the date for our outlier value
outlier_date <- power_data |>
filter(kwh == outliers) |>
pull(month)
# Find imputed value at this data
imp_value <- mice_imp_data |>
filter(month == outlier_date) |>
pull(kwh)
# Replace outlier value with imputed value
power_data <- power_data |>
mutate(kwh = ifelse(kwh == outliers, imp_value, kwh))Modeling
Since we’ve already determined this data has a strong seasonal component, I will use a seasonal naive model as a benchmark. I will also compare auto-generated ETS and ARIMA models.
power_data |>
autoplot(kwh)Fable has auto-generated and ETS(M,N,A) and ARIMA(1,0,0)(2,1,0) with drift model. Next, we need to which model fits the data better, by using cross-validation.
power_fit <- power_data |>
model(snaive = SNAIVE(kwh),
ets = ETS(kwh),
arima = ARIMA(kwh))
power_fit# A mable: 1 x 3
snaive ets arima
<model> <model> <model>
1 <SNAIVE> <ETS(M,N,A)> <ARIMA(1,0,0)(2,1,0)[12] w/ drift>
To do cross-validation, we can use stretch_tsibble to create many training and test sets. Then, we can fit our models to all of the training sets and forecast for all of the test sets in one step. Lastly, we check the accuracy measures. The ARIMA model scored the best accuracy for the cross-validation test. This means the ARIMA model provided the best fit for our data. Therefore, I will select the ARIMA model for forecasting
many_train <- power_data |>
stretch_tsibble(.init = 100, .step = 1) |>
filter(.id != max(.id))
cv_fit <- many_train |>
model(snaive = SNAIVE(kwh),
ets = ETS(kwh ~ error("M") + trend("N") + season("A")),
arima = ARIMA(kwh ~ 1 + pdq(1,0,0) + PDQ(2,1,0)))
cv_fc <- cv_fit |>
forecast(h = 1) |>
filter(!is.na(.mean))
accuracy(cv_fc, power_data)# A tibble: 3 × 10
.model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 arima Test 190830. 761227. 527741. 1.80 7.43 0.851 0.895 0.146
2 ets Test 180045. 844760. 592700. 1.28 8.40 0.956 0.993 0.357
3 snaive Test 172944. 1037507. 727427. 1.46 10.3 1.17 1.22 0.341
Forecasting and Final Results
The forecast has very narrow prediction intervals and, overall, looks like a very reasonalable fit for the data.
Point Forecasts and Distributional Forecasts
The table below can be used to deploy our forecasts. For example, if an electical company’s revenue depends on energy consumption and they need to create a budget, they could use the point forecasts as a parameter to model their budget around.
# A tibble: 13 × 4
month lower pt_fc upper
<chr> <dbl> <dbl> <dbl>
1 2014 Jan 8745969. 10033059. 11320150.
2 2014 Feb 7385586. 8721726. 10057866.
3 2014 Mar 5341357. 6681231. 8021105.
4 2014 Apr 4655273. 5995437. 7335600.
5 2014 May 4597841. 5938028. 7278214.
6 2014 Jun 6868633. 8208821. 9549009.
7 2014 Jul 8194686. 9534874. 10875062.
8 2014 Aug 8683211. 10023399. 11363587.
9 2014 Sep 7152703. 8492891. 9833079.
10 2014 Oct 4525647. 5865835. 7206023.
11 2014 Nov 4818360. 6158548. 7498737.
12 2014 Dec 6937416. 8277604. 9617792.
13 Total 77906683. 93931453. 109956224.