In Part B of this project, we have a dataset of residential power usage for January 1998 until December 2013. The goal is to develop a model these data and provide a monthly forecast for 2014.
The general process, will be the same as outlined in Part A
We begin by importing the data. Once we have the data imported, we will convert the data field to field the desired YYYY-MMM format and then create the necessary tsibble.
raw_data <- read_xlsx('../data/ResidentialCustomerForecastLoad-624.xlsx')
raw_data <- raw_data %>% mutate(date_field = as_date(ym(`YYYY-MMM`)))
raw_data
## # A tibble: 192 × 4
## CaseSequence `YYYY-MMM` KWH date_field
## <dbl> <chr> <dbl> <date>
## 1 733 1998-Jan 6862583 1998-01-01
## 2 734 1998-Feb 5838198 1998-02-01
## 3 735 1998-Mar 5420658 1998-03-01
## 4 736 1998-Apr 5010364 1998-04-01
## 5 737 1998-May 4665377 1998-05-01
## 6 738 1998-Jun 6467147 1998-06-01
## 7 739 1998-Jul 8914755 1998-07-01
## 8 740 1998-Aug 8607428 1998-08-01
## 9 741 1998-Sep 6989888 1998-09-01
## 10 742 1998-Oct 6345620 1998-10-01
## # ℹ 182 more rows
data_ts <- raw_data %>%
select(-`YYYY-MMM`) %>%
mutate(Month = yearmonth(date_field)) %>%
tsibble(index=Month)
data_ts %>% autoplot(KWH)
There is a value missing for CaseSequence 861. As we did in Part A, we will forecast the value by building a timeseries using the earlier data
data_ts %>% filter(is.na(KWH))
## # A tsibble: 1 x 4 [1M]
## CaseSequence KWH date_field Month
## <dbl> <dbl> <date> <mth>
## 1 861 NA 2008-09-01 2008 Sep
early_data <- data_ts %>%
filter(CaseSequence < 861)
early_data %>%
autoplot(KWH)
early_data %>%
model(stl = STL(KWH)) %>%
components() %>%
autoplot()
From the plot of the data, we see that there appears to somewhat be a
positive trend and we see evidence of seasonality in the data. So we
will explore exponential smoothing model, and try different
combinations.
fit_early <- early_data %>%
model(m1 = ETS(KWH ~ error("A") + trend("A") + season("A")),
m2 = ETS(KWH ~ error("A") + trend("A") + season("M")),
m3 = ETS(KWH ~ error("A") + trend("M") + season("M")),
m4 = ETS(KWH))
glance(fit_early) %>% arrange(AICc)
## # A tibble: 4 × 9
## .model sigma2 log_lik AIC AICc BIC MSE AMSE MAE
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 m4 7.50e- 3 -1990. 4010. 4014. 4053. 266724313447. 268119581517. 6.53e-2
## 2 m1 3.06e+11 -1994. 4023. 4029. 4071. 267385294351. 268018613863. 4.10e+5
## 3 m2 3.07e+11 -1995. 4023. 4029. 4072. 268249229600. 269690812211. 4.08e+5
## 4 m3 3.15e+11 -1996. 4027. 4032. 4075. 275309122826. 284114223083. 4.10e+5
fit_early
## # A mable: 1 x 4
## m1 m2 m3 m4
## <model> <model> <model> <model>
## 1 <ETS(A,A,A)> <ETS(A,A,M)> <ETS(A,M,M)> <ETS(M,N,M)>
The AICc for the system generated model ETA(A,N,A) was the lowest, so that’s the model that we will use.
fit_early <- early_data %>% model(ETS(KWH ~ error("M") + trend("N") + season("M")))
aug <- fit_early %>% augment()
aug %>%
ggplot(aes(x=.innov)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
aug %>%
ACF(.innov, lags=36) %>%
autoplot()
## Warning: The `...` argument of `PACF()` is deprecated as of feasts 0.2.2.
## ℹ ACF variables should be passed to the `y` argument. If multiple variables are
## to be used, specify them using `vars(...)`.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: ACF currently only supports one column, `.innov` will be used.
Looking at this data, we see that there is significant autocorrelation
within the data. We are going to try some differencing and focus on
using an ARIMA model.
early_data %>%
gg_tsdisplay(KWH, plot_type = 'partial')
We see strong evidence of seasonality in the data in the lags, so we
will both transform the reponse variable and difference the data to
remove the seasonality
bc <- BoxCoxTrans(early_data$KWH)
lambda <- bc$lambda
early_data %>%
autoplot(KWH^lambda-1/lambda)
early_data %>%
gg_tsdisplay(difference(KWH^lambda-1/lambda,12), plot_type = 'partial')
## Warning: Removed 12 rows containing missing values (`geom_line()`).
## Warning: Removed 12 rows containing missing values (`geom_point()`).
Based on this data, we will try the following ARIMA models with
seasonality based on the ACF and the PCF charts:
fit_early <- early_data %>%
model(m1 = ARIMA((KWH^lambda-1/lambda)~pdq(4,0,0) + PDQ(1,1,0)),
m2 = ARIMA((KWH^lambda-1/lambda)~pdq(0,0,4) + PDQ(,1,1)))
glance(fit_early) %>% arrange(AICc)
## # A tibble: 2 × 8
## .model sigma2 log_lik AIC AICc BIC ar_roots ma_roots
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <list> <list>
## 1 m2 0.0000146 566. -1118. -1117. -1099. <cpl [0]> <cpl [16]>
## 2 m1 0.0000151 559. -1107. -1106. -1090. <cpl [16]> <cpl [0]>
Based on the AICc, we will go with the second model to general our 1-point ahead forecast and then we will use that imputed value to plug into our original timeseries to replace the missing value.
fit_early <- early_data %>%
model(ARIMA((KWH^lambda-1/lambda)~pdq(0,0,4) + PDQ(,1,1)))
fc_early <- fit_early %>% forecast(h=1)
data_ts[data_ts$CaseSequence == 861,]$KWH <- fc_early$.mean
Now that we’ve updated the timeseries, we will now use this to build a model on the entire set of data. Now we will build a model for the entire complete data and then forecast based on this
data_ts %>% model(STL(KWH)) %>%
components() %>%
autoplot()
fit <- data_ts %>%
model(m1 = ETS(KWH ~ error("A") + trend("N") + season("A")),
m2 = ETS(KWH ~ error("A") + trend("N") + season("M")),
m3 = ARIMA((KWH^lambda-1/lambda)~1+pdq(4,0,0) + PDQ(1,1,0)),
m4 = ARIMA((KWH^lambda-1/lambda)~1+pdq(0,0,4) + PDQ(,1,1)),
m5 = ARIMA((KWH^lambda-1/lambda)))
glance(fit) %>% arrange(AICc)
## # A tibble: 5 × 11
## .model sigma2 log_lik AIC AICc BIC MSE AMSE MAE
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 m5 2.78e- 5 733. -1456. -1456. -1440. NA NA NA
## 2 m4 2.86e- 5 707. -1399. -1398. -1373. NA NA NA
## 3 m3 3.72e- 5 683. -1352. -1351. -1330. NA NA NA
## 4 m2 7.30e+11 -3120. 6270. 6272. 6319. 676896866919. 7.06e11 516061.
## 5 m1 7.36e+11 -3121. 6271. 6274. 6320. 682203380453. 7.05e11 514104.
## # ℹ 2 more variables: ar_roots <list>, ma_roots <list>
fit
## # A mable: 1 x 5
## m1 m2 m3
## <model> <model> <model>
## 1 <ETS(A,N,A)> <ETS(A,N,M)> <ARIMA(4,0,0)(1,1,0)[12] w/ drift>
## # ℹ 2 more variables: m4 <model>, m5 <model>
Based on the AICc, we see that the automated ARIMA model - MA(1) with MA(2) is the best model. So we will use that to build our final model and forecast the remaining data.
fit_final <- data_ts %>%
model(ARIMA(KWH^lambda-1/lambda~pdq(0,0,1) + PDQ(0,1,2)))
partb_fcst <- fit_final %>% forecast(h=12)
partb_fcst
## # A fable: 12 x 4 [1M]
## # Key: .model [1]
## .model Month KWH .mean
## <chr> <mth> <dist> <dbl>
## 1 ARIMA(KWH^lambda - 1/lambda ~ pdq(0, 0, 1)… 2014 Jan t(N(10, 2.8e-05)) 9.55e6
## 2 ARIMA(KWH^lambda - 1/lambda ~ pdq(0, 0, 1)… 2014 Feb t(N(10, 2.8e-05)) 8.19e6
## 3 ARIMA(KWH^lambda - 1/lambda ~ pdq(0, 0, 1)… 2014 Mar t(N(10, 2.8e-05)) 6.85e6
## 4 ARIMA(KWH^lambda - 1/lambda ~ pdq(0, 0, 1)… 2014 Apr t(N(10, 2.8e-05)) 6.02e6
## 5 ARIMA(KWH^lambda - 1/lambda ~ pdq(0, 0, 1)… 2014 May t(N(10, 2.8e-05)) 5.70e6
## 6 ARIMA(KWH^lambda - 1/lambda ~ pdq(0, 0, 1)… 2014 Jun t(N(10, 2.8e-05)) 7.48e6
## 7 ARIMA(KWH^lambda - 1/lambda ~ pdq(0, 0, 1)… 2014 Jul t(N(10, 2.8e-05)) 6.80e6
## 8 ARIMA(KWH^lambda - 1/lambda ~ pdq(0, 0, 1)… 2014 Aug t(N(10, 2.8e-05)) 9.52e6
## 9 ARIMA(KWH^lambda - 1/lambda ~ pdq(0, 0, 1)… 2014 Sep t(N(10, 2.8e-05)) 8.69e6
## 10 ARIMA(KWH^lambda - 1/lambda ~ pdq(0, 0, 1)… 2014 Oct t(N(10, 2.8e-05)) 6.25e6
## 11 ARIMA(KWH^lambda - 1/lambda ~ pdq(0, 0, 1)… 2014 Nov t(N(10, 2.8e-05)) 5.67e6
## 12 ARIMA(KWH^lambda - 1/lambda ~ pdq(0, 0, 1)… 2014 Dec t(N(10, 2.8e-05)) 7.31e6
write.xlsx(partb_fcst, '../data/project1_forecast.xlsx', sheetName='partb_fcst',append = TRUE)
## Warning in FUN(X[[i]], ...): argument is not an atomic vector; coercing
## Warning in FUN(X[[i]], ...): argument is not an atomic vector; coercing
## Warning in FUN(X[[i]], ...): argument is not an atomic vector; coercing
## Warning in FUN(X[[i]], ...): argument is not an atomic vector; coercing
## Warning in FUN(X[[i]], ...): argument is not an atomic vector; coercing
## Warning in FUN(X[[i]], ...): argument is not an atomic vector; coercing
## Warning in FUN(X[[i]], ...): argument is not an atomic vector; coercing
## Warning in FUN(X[[i]], ...): argument is not an atomic vector; coercing
## Warning in FUN(X[[i]], ...): argument is not an atomic vector; coercing
## Warning in FUN(X[[i]], ...): argument is not an atomic vector; coercing
## Warning in FUN(X[[i]], ...): argument is not an atomic vector; coercing
## Warning in FUN(X[[i]], ...): argument is not an atomic vector; coercing