library(fpp3)
library(tidyverse)
library(ggrepel)
library(seasonal)
library(fabletools)
library(corrplot)
library(GGally)
library(patchwork)
library(caret)
library(readxl)
library(imputeTS)
theme_set(theme_bw())
Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.
The first step in the analysis is to download the data and explore any missing values within the dataset. The below code loads the data from the provide Excel file and checks for any missing values within both the Date and KWH columns.
# Load in Data from Excel File
energy_data = read_excel("ResidentialCustomerForecastLoad-624.xlsx")
energy_data = energy_data |> mutate(Date = yearmonth(`YYYY-MMM`), key = 1) |> select(-`YYYY-MMM`)
# Convert to tsibble
energy_data = as_tsibble(energy_data, index = "Date", key = "key")
# Check for Missing Values in Date or KWH Columns
energy_data |> filter(is.na(Date) | is.na(KWH))
There is one missing value from the KWH column. Before deciding on how to deal with the missing value, a time plot and seasonal plots are looked at to see what the data looks like, as well as see if there’s any seasonality.
# Time Plot
energy_data |> autoplot(KWH) +
labs(title = "Time Plot of Residential Power Usage", y = "Power Usage (kWh)")
# Seasonal Plots
energy_data |> gg_season(KWH) +
labs(title = "Seasonal Plot of Residential Power Usage", y = "Power Usage (kWh)")
# Filtered years to investigate data better
energy_data |> filter(Date >= yearmonth("2008")) |> gg_season(KWH) +
labs(title = "Time Plot of Residential Power Usage since 2008", y = "Power Usage (kWh)")
Looking at the time plot and seasonal plots above, the data does appear to have seasonality. In the seasonal plot that was filtered to only include dates in 2008 or later, the value of KWH decreases between the months of August and October. Since the missing value is in September, it’s reasonable to use linear interpolation for the missing value. The below code uses linear interpolation for the missing value, and then confirms there are no longer any missing values in the dataset.
energy_data = energy_data |> select(KWH) |> na_interpolation(option = "linear")
energy_data |> filter(is.na(KWH))
The time plot and seasonal plot previously generated shows a significant outlier in July 2010. Looking at the data from the year 2010 below, the outlier in July is about 10x less than any of the other values that year. Additionally, the seasonal plot from before shows that KWH increases from June to July, and then again from July to August. This suggests that the KWH in July 2010 should lie somewhere between that of June and August 2010.
energy_data |> filter(Date >= yearmonth("2010 Jan"), Date <= yearmonth("2010 Dec"))
Based on this information above, it seems likely that the data from July 2010 was an incorrect entry, and that a digit was missed when the data was entered. As a result, the KWH value on July 2010 is multiplied by 10.
energy_data = energy_data |> mutate(KWH = case_when(
Date == yearmonth("2010 Jul") ~ KWH*10,
.default = KWH))
energy_data |> autoplot(KWH) +
labs(title = "Time Plot of Residential Power Usage after Handling Missing Data and Outliers",
y = "Power Usage (kWh)")
energy_data |> filter(Date >= yearmonth("2008")) |> gg_season(KWH) +
labs(title = paste("Seasonal Plot of Residential Power Usage Since 2008 after",
"Handling Missing Data and Outliers", sep = "\n"),
y = "Power Usage (kWh)")
The time and seasonal plots show that after the outlier was multiplied by 10, it now is reasonable when looking at the seasonality of the data, as well as the magnitude of the remaining data.
Now that all missing values and outliers have been handled, I wanted to further investigate the data for potential trends and seasonality. A time plot with the updated dataset was provided in the previous section, whereas the below code generates seasonal plots for the updated dataset to look for seasonality, trend, and cyclicity.
energy_data |> gg_season(KWH) +
labs(title = "Seasonal Plot of Residential Power Usage", y = "Power Usage (kWh)")
energy_data |> ACF(KWH, lag_max = 36) |> autoplot() +
labs(title = "ACF Plot of Residential Power Usage")
energy_data |> gg_lag(KWH, lags = 1:12, geom = "point") +
labs(title = "Lag Plot of Residential Power Usage")
The seasonal plot above shows that there is strong seasonality, where the KWH data follows a sinusoidal pattern with higher energy uses in the winter months and summer months, and lower energy uses in the spring and autumn months. The ACF plot and gg_lag plot confirm this information, as the ACF plot has high positive peaks every 6 and 12 months, and the gg_lag plot has strong positive linear correlation in lags 6 and 12 when compared to other lag values.
After checking the data for possible trends and seasonality, the data was analyzed to see if any data transformations would improve the analysis.
BoxCoxTrans(energy_data$KWH)
## Box-Cox Transformation
##
## 192 data points used to estimate Lambda
##
## Input data summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4313019 5443503 6351262 6538942 7649733 10655730
##
## Largest/Smallest: 2.47
## Sample Skewness: 0.485
##
## Estimated Lambda: -0.3
The above function suggests that taking the negative square root or log of the dataset would result in a better dataset for modeling. The log transformation was selected for easier interpretability.
energy_log = energy_data |> mutate(log_kwh = log(KWH))
energy_log |> autoplot(log_kwh) +
labs(title = "Time Plot of Residential Power Usage after Log Transformation",
y = "log(Power Usage) (log[kWh])")
The log transformation of the data has leveled the variance out based on the time plot above.
In order to test different models and their accuracy, a training and test set will be generated from the dataset. The training data will then be used to generate a forecast, which will then be compared to the test set. For the split, about 80% of the data will be used for the training set, and about 20% will be used for the test set.
# Training Set
energy_train = energy_log |> head(round(nrow(energy_log)*0.8))
# Test Set
energy_test = energy_log |> tail(round(nrow(energy_log)*0.2))
Before generating models, I wanted to further explore the data by decomposing the dataset into their seasonal, trend, and remainder components to get a better idea of how the data is behaving. The below code uses various decomposition methods to break these components out.
# STL Decomposition
energy_stl = energy_train |> model(stl = STL(log_kwh))
energy_stl |> components() |> autoplot()
# Classical Decomposition (Additive)
energy_class_dcmp_add = energy_train |> model(classical_decomposition(log_kwh, type = "additive"))
energy_class_dcmp_add |> components() |> autoplot()
# Classical Decomposition (Multiplicative)
energy_class_dcmp_mul = energy_train |> model(classical_decomposition(log_kwh, type = "multiplicative"))
energy_class_dcmp_mul |> components() |> autoplot()
# X-11 Decomposition
energy_x11 = energy_train |> model(X_13ARIMA_SEATS(log_kwh ~ x11()))
energy_x11 |> components() |> autoplot()
# SEATS Decomposition
energy_seats = energy_train |> model(X_13ARIMA_SEATS(log_kwh ~ seats()))
energy_seats |> components() |> autoplot()
All decomposition methods show that there is some level of seasonality within the data. Additionally, although the trend component varies between the decomposition method, there generally appears to be a positive trend as time passes.
The following few subsections create models using a variety of methods. The first section below creates models using simple methods.
# Mean Method
energy_mean_model = energy_train |> model(Mean = MEAN(log_kwh))
# Naive Method
energy_naive_model = energy_train |> model(Naive = NAIVE(log_kwh))
# Seasonal Naive Method
energy_snaive_model = energy_train |> model(SNaive = SNAIVE(log_kwh ~ lag("year")))
# Drift Method
energy_drift_model = energy_train |> model(Drift = RW(log_kwh ~ drift()))
The following models use simple models on seasonally adjusted data. The data has been decomposed using either the STL method, X11 method, or SEATS method. The classic decomposition methods are not used for generating models since they utilize moving averages that doesn’t allow for the trends to be calculated for the first and last few observations.
# MEAN Method on Seasonally Adjusted Data using STL Decomposition
stl_energy_mean_model = energy_train |> model(STL_MEAN = decomposition_model(STL(log_kwh),
MEAN(season_adjust)))
# Naive Method on Seasonally Adjusted Data using STL Decomposition
stl_energy_naive_model = energy_train |> model(STL_NAIVE = decomposition_model(STL(log_kwh),
NAIVE(season_adjust)))
# MEAN Method on Seasonally Adjusted Data using X11 Decomposition
x11_energy_mean_model = energy_train |>
model(x11_mean = decomposition_model(X_13ARIMA_SEATS(log_kwh ~ x11()), MEAN(season_adjust)))
# Naive Method on Seasonally Adjusted Data using X11 Decomposition
x11_energy_naive_model = energy_train |>
model(x11_naive = decomposition_model(X_13ARIMA_SEATS(log_kwh ~ x11()), NAIVE(season_adjust)))
# MEAN Method on Seasonally Adjusted Data using SEATS Decomposition
seats_energy_mean_model = energy_train |>
model(seats_mean = decomposition_model(X_13ARIMA_SEATS(log_kwh ~ seats()), MEAN(season_adjust)))
# Naive Method on Seasonally Adjusted Data using SEATS Decomposition
seats_energy_naive_model = energy_train |>
model(seats_naive = decomposition_model(X_13ARIMA_SEATS(log_kwh ~ seats()), NAIVE(season_adjust)))
The following code generates a model using exponential smoothing. The parameters used for the model are automatically selected from the ETS function.
ets_energy_model = energy_train |> model(ETS = ETS(log_kwh))
report(ets_energy_model)
## Series: log_kwh
## Model: ETS(A,N,A)
## Smoothing parameters:
## alpha = 0.0711126
## gamma = 0.0001026117
##
## Initial states:
## l[0] s[0] s[-1] s[-2] s[-3] s[-4] s[-5]
## 15.58935 -0.06024831 -0.274027 -0.09901397 0.192135 0.2517406 0.1951448
## s[-6] s[-7] s[-8] s[-9] s[-10] s[-11]
## -0.006916807 -0.2419686 -0.1766857 -0.06648136 0.06709745 0.2192238
##
## sigma^2: 0.0078
##
## AIC AICc BIC
## 43.97093 47.44919 89.52522
components(ets_energy_model) |> autoplot()
Based on the output from the model above, the best exponential model is an ETS(A,N,A) model, which has additive errors, no trend component, and additive seasonality.
The following code looks into ARIMA modeling. Before generating the model, the dataset needs to be stationary.
energy_train |> gg_tsdisplay(log_kwh, plot_type = "partial", lag = 36) +
labs(title = "Plots of Residential Power Usage after Log Transformation",
y = "log(Power Usage) (log[kWh])")
Looking at the plot above, the data does not appear to be stationary due to the seasonality within the dataset. As a result, differencing needs to be looked at.
# Check for Number of Differences Based on Unit Root Tests
energy_train |> features(log_kwh, unitroot_kpss)
energy_train |> features(log_kwh, unitroot_ndiffs)
# Check for Number of Seasonal Differences Based on Unit Root Tests
energy_train |> features(log_kwh, feat_stl)
energy_train |> features(log_kwh, unitroot_nsdiffs)
The KPSS root test shows a p-value greater than 0.05, which suggests that differencing is not required. On the other hand, the seasonal strength is greater than 0.64, which suggests that seasonal differencing is required. These tests, combined with the time plot and ACF plot above, there are clear signs that the data is non-stationary. When looking at the KPSS tests, a single seasonal difference is recommended, but no other differencing is required. The below code shows the seasonal differencing, as well as the plots of the data after seasonal differencing.
energy_train |> gg_tsdisplay(difference(log_kwh, 12), plot_type = "partial", lag = 36) +
labs(title = paste("Plots of Residential Power Usage after",
"Log Transformation and Seasonal Differencing", sep = "\n"),
y = "log(Power Usage) (log[kWh])")
energy_train |> features(difference(log_kwh, 12), unitroot_kpss)
energy_train |> features(difference(log_kwh, 12), feat_stl)
After taking the seasonal difference, the data now appears to be stationary, which is confirmed with the KPSS tests that suggest no additional differences are required.
The significant spike at lag = 12 on the ACF plot above suggests a seasonal MA(1) component could be used. However, the significant spike at lag = 24 on the PACF above suggests a seasonal AR(2) component could be used. Additionally, the spike at lag = 1 on the ACF plot suggests a non-seasonal MA(1) component could be used, whereas the spike at lag = 1 on the PACF plot suggests a non-seasonal AR(1) component could be used. Overall, this would suggest 4 different models could be used: \(ARIMA(0,0,1)(0,1,1)_{12}\), \(ARIMA(1,0,0)(0,1,1)_{12}\), \(ARIMA(0,0,1)(2,1,0)_{12}\), and \(ARIMA(1,0,0)(2,1,0)_{12}\). All of these models, along with a model that is automatically selected by the ARIMA function, will be compared.
energy_arima_models = energy_train |>
model(
arima001011 = ARIMA(log_kwh ~ 0+ pdq(0,0,1) + PDQ(0,1,1)),
arima100011 = ARIMA(log_kwh ~ 0 + pdq(1,0,0) + PDQ(0,1,1)),
arima001210 = ARIMA(log_kwh ~ 0 + pdq(0,0,1) + PDQ(2,1,0)),
arima100210 = ARIMA(log_kwh ~ 0 + pdq(1,0,0) + PDQ(2,1,0)),
arima_auto = ARIMA(log_kwh, stepwise = FALSE, approx = FALSE)
)
energy_arima_models |> pivot_longer(everything(), names_to = "Model name",
values_to = "Orders")
report(energy_arima_models |> select(arima_auto))
## Series: log_kwh
## Model: ARIMA(4,0,0)(0,1,2)[12] w/ drift
##
## Coefficients:
## ar1 ar2 ar3 ar4 sma1 sma2 constant
## 0.2448 -0.0403 0.2578 -0.2227 -0.8619 0.1883 0.0067
## s.e. 0.0865 0.0816 0.0846 0.0830 0.0925 0.1044 0.0026
##
## sigma^2 estimated as 0.006951: log likelihood=150.04
## AIC=-284.08 AICc=-283 BIC=-260.44
Looking at the list of models above, the model that was automatically generated from the ARIMA function was an \(ARIMA(4,0,0)(0,1,2)_{12}\ with\ drift\).
Now that all of the models have been generated using the training dataset, all models will generate forecasts and compared to the test dataset. The below code generates these forecasts against the test set.
# Mean Method
energy_mean_forecast = energy_mean_model |> forecast(new_data = energy_test)
# Naive Method
energy_naive_forecast = energy_naive_model |> forecast(new_data = energy_test)
# Seasonal Naive Method
energy_snaive_forecast = energy_snaive_model |> forecast(new_data = energy_test)
# Drift Method
energy_drift_forecast = energy_drift_model |> forecast(new_data = energy_test)
# Naive Method Using Seasonally Adjusted Data from Various Methods
stl_energy_mean_forecast = stl_energy_mean_model |> forecast(new_data = energy_test)
stl_energy_naive_forecast = stl_energy_naive_model |> forecast(new_data = energy_test)
x11_energy_mean_forecast = x11_energy_mean_model |> forecast(new_data = energy_test)
x11_energy_naive_forecast = x11_energy_naive_model |> forecast(new_data = energy_test)
seats_energy_mean_forecast = seats_energy_mean_model |> forecast(new_data = energy_test)
seats_energy_naive_forecast = seats_energy_naive_model |> forecast(new_data = energy_test)
# ETS Method
ets_energy_forecast = ets_energy_model |> forecast(new_data = energy_test)
# ARIMA Methods
energy_arima_forecasts = energy_arima_models |> forecast(new_data = energy_test)
The code below generates 3 different tables using the forecasts above: performance comparisons based on residuals, performance comparisons based on point forecasts, and performance comparisons based on distributional forecasts.
# Compare Performance of Models Based on Residuals
bind_rows(
energy_mean_model |> accuracy(),
energy_naive_model |> accuracy(),
energy_snaive_model |> accuracy(),
energy_drift_model |> accuracy(),
stl_energy_mean_model |> accuracy(),
stl_energy_naive_model |> accuracy(),
x11_energy_mean_model |> accuracy(),
x11_energy_naive_model |> accuracy(),
seats_energy_mean_model |> accuracy(),
seats_energy_naive_model |> accuracy(),
ets_energy_model |> accuracy(),
energy_arima_models |> accuracy()
) |>
select(-ME, -MPE, -ACF1) |>
arrange(RMSE)
# Compare Performance of Models Based on Point Forecasts
bind_rows(
energy_mean_forecast |> accuracy(energy_log),
energy_naive_forecast |> accuracy(energy_log),
energy_snaive_forecast |> accuracy(energy_log),
energy_drift_forecast |> accuracy(energy_log),
stl_energy_mean_forecast |> accuracy(energy_log),
stl_energy_naive_forecast |> accuracy(energy_log),
x11_energy_mean_forecast |> accuracy(energy_log),
x11_energy_naive_forecast |> accuracy(energy_log),
seats_energy_mean_forecast |> accuracy(energy_log),
seats_energy_naive_forecast |> accuracy(energy_log),
ets_energy_forecast |> accuracy(energy_log),
energy_arima_forecasts |> accuracy(energy_log),
) |>
select(-ME, -MPE, -ACF1) |>
arrange(RMSE)
# Compare Performance of Models Based on Distributional Forecasts
bind_rows(
energy_mean_forecast |> accuracy(energy_log, list(crps = CRPS)),
energy_naive_forecast |> accuracy(energy_log, list(crps = CRPS)),
energy_snaive_forecast |> accuracy(energy_log, list(crps = CRPS)),
energy_drift_forecast |> accuracy(energy_log, list(crps = CRPS)),
stl_energy_mean_forecast |> accuracy(energy_log, list(crps = CRPS)),
stl_energy_naive_forecast |> accuracy(energy_log, list(crps = CRPS)),
x11_energy_mean_forecast |> accuracy(energy_log, list(crps = CRPS)),
x11_energy_naive_forecast |> accuracy(energy_log, list(crps = CRPS)),
seats_energy_mean_forecast |> accuracy(energy_log, list(crps = CRPS)),
seats_energy_naive_forecast |> accuracy(energy_log, list(crps = CRPS)),
ets_energy_forecast |> accuracy(energy_log, list(crps = CRPS)),
energy_arima_forecasts |> accuracy(energy_log, list(crps = CRPS)),
) |> arrange(crps)
Based on the tables above, the method that performed the best when modelling over the training set was the MEAN method using the seasonally adjusted data from the STL decomposition, followed by ARIMA model that was automatically selected. This is based on having the lowest Root Mean Squared Error (RMSE) value. When looking at the performance of the models based on the forecasts on the test dataset, different models appear to perform the best, depending on the metric that is observed. For example, when looking at the point forecast, the seasonal naive model performed the best based on the RMSE values, followed by the naive method using STL decomposition and the ARIMA model that was automatically selected. However, other metrics, such as Mean Absolute Percentage Error (MAPE) or Mean Absolute Scaled Error (MASE), show that the ARIMA model automatically selected performed better. This suggests that the ARIMA model is a good model based on the point forecasts. When looking at the forecast distributions, the model that performed the best was the ARIMA model automatically selected, followed by the seaonal naive model, based on the Continuous Ranked Probability Score (CRPS) values.
Based on the information above, the model that appears to perform the best overall is the ARIMA model that was automatically selected because it is the only model that finished in the top 3 of all the metrics depicted above. However, before using the ARIMA method for forecasting, the residuals are evaluated to ensure that everything has been captured in the model.
energy_arima_models |> select(arima_auto) |> gg_tsresiduals(lag_max = 36) +
labs(title = paste("Residual Plots of Residential Power Usage after",
"Log Transformation and Seasonal Differencing", sep = "\n"),
y = "log(Power Usage) (log[kWh])")
energy_arima_models |> select(arima_auto) |> augment() |> features(.innov, ljung_box, lag = 24)
energy_arima_forecasts |> filter(.model == "arima_auto") |> autoplot(energy_log) +
labs(title = paste("Time Plot of Residential Power Usage after Log Transformation with",
"Forecast over Test Dataset", sep = "\n"),
y = "log(Power Usage) (log[kWh])")
The residual plots abvoe show that the residuals behave similarly to white noise, are normally distributed, and have a mean close to 0. As a result, the ARIMA model appears to capture all information from the dataset. This is confirmed with the Ljung-Box test, which has a p-value greater than 0.05. As a result, the \(ARIMA(4,0,0)(0,1,2)_{12}\ with\ drift\) model can be used for forecasting.
The following section generates a 12 month forecast for the residential power usage using the \(ARIMA(4,0,0)(0,1,2)_{12}\ with\ drift\) model. The model utilizes all data from the dataset to ensure all information is taken into consideration.
energy_arima = energy_data |> model(ARIMA(log(KWH) ~ 1 + pdq(4,0,0) + PDQ(0,1,2)))
report(energy_arima)
## Series: KWH
## Model: ARIMA(4,0,0)(0,1,2)[12] w/ drift
## Transformation: log(KWH)
##
## Coefficients:
## ar1 ar2 ar3 ar4 sma1 sma2 constant
## 0.2881 -0.0060 0.2132 -0.1397 -0.7874 0.1583 0.0099
## s.e. 0.0782 0.0772 0.0781 0.0756 0.0824 0.0993 0.0027
##
## sigma^2 estimated as 0.008185: log likelihood=176.7
## AIC=-337.39 AICc=-336.55 BIC=-311.85
energy_arima_forecast = energy_arima |> forecast(h = 12)
energy_arima_forecast |> autoplot(energy_data) +
labs(title = paste("Time Plot of Residential Power Usage with 12 Month Forecast",
"Using ARIMA(4,0,0)(0,1,2)12 with Drift Model", sep = "\n"))
# Convert Forecast Distributions to Intervals
energy_arima_intervals = energy_arima_forecast |> hilo()
# Total KWH Uage of 12 Month Forecast Based on Mean
sum(energy_arima_intervals$.mean)
## [1] 92491851
# Total KWH Uage of 12 Month Forecast Based on 80% Confidence Interval
sum(energy_arima_intervals$`80%`$lower)
## [1] 81446180
sum(energy_arima_intervals$`80%`$upper)
## [1] 104080906
# Total KWH Uage of 12 Month Forecast Based on 95% Confidence Interval
sum(energy_arima_intervals$`95%`$lower)
## [1] 76327994
sum(energy_arima_intervals$`95%`$upper)
## [1] 111061123
The plot above shows a 12 Month Forecast for the residential energy usage in kWh. Based on the forecast intervals, the total energy usage over the next 12 months has 80% chance of being between 81,446,180 kWh and 104,080,906 kWh, and a 95% chance of being between 76,327,994 kWh and 111,061,123 kWh. The average residential energy usage over the next 12 months is 92,491,851 kWh.
The final bit of code below exports the relevant outputs from the forecast to a CSV file.
energy_arima_intervals_full = energy_arima_intervals |> mutate(
lower_interval_95 = energy_arima_intervals$`95%`$lower,
upper_interval_95 = energy_arima_intervals$`95%`$upper,
lower_interval_80 = energy_arima_intervals$`80%`$lower,
upper_interval_80 = energy_arima_intervals$`80%`$upper
)
energy_arima_intervals_full = energy_arima_intervals_full |>
select(-.model, -KWH, -`80%`, -`95%`) |>
rename("kwh_mean" = .mean)
write_excel_csv(energy_arima_intervals_full, "data624_partB_forecast.csv")