Introduction

Energy is vital to the sustainable development of any nation, be it social, economic, or environmental. During the last decade, energy use has grown exponentially worldwide. The normal state of affairs in today’s developed countries is that energy is consumed everywhere at all times. Ten searches on the Internet, for example, require one-kilowatt hour (kWh) of electricity. A similar amount is required for a 30-watt energy saving lamp to burn uninterrupted for 36 hours. About 30 kWh of energy are used in the average home—for heating, hot water, and electrical appliances—every day.

The energy management is crucial to the economic future prosperity and environmental safety. Predicting energy consumption is essential for energy planning, strategy formulation, and recommendation of energy policies. Since energy consumption has always increased because of rapid population growth, the world population from 7.5 billion today to about 9.5 billion over the next 30 years. Such increase in population would produce a dramatic impact on energy demand, making at least double by 2030. Hence, energy consumption prediction is necessary to keep the energy supply in balance with its demand.

In this report, forecasting and time series analysis would be used to provide a good forecast and seasonality explanation. The process will include : Data Preprocess, Seasonality Analysis, Model Fitting, Prediction Performance, and Conclusion. The objective of this report is to produce forecasting result and seasonality explanation for hourly energy consumption provided by JM Interconnection LLC, a regional transmission organization (RTO) in the United States, that would be evaluated on the next 7 days.

Source: Article

Data Preprocess

Data & Library Import

# Library Import
library(lubridate)
library(dplyr)
library(ggplot2)
library(scales)
library(lubridate)
library(tidyverse)
library(xts)
library(forecast)
library(padr)
library(zoo)
library(tseries)
library(RColorBrewer)
# Data Import
aep <- read.csv("AEP_hourly.csv")
head(aep)

The hourly power consumption data comes from PJM’s website and are in megawatts (MW).

PJM Interconnection LLC (PJM) is a regional transmission organization (RTO) in the United States. It is part of the Eastern Interconnection grid operating an electric transmission system serving all or parts of Delaware, Illinois, Indiana, Kentucky, Maryland, Michigan, New Jersey, North Carolina, Ohio, Pennsylvania, Tennessee, Virginia, West Virginia, and the District of Columbia.

This dataset includes information about:

  • Datetime: The timestamp of energy consumption.
  • AEP_MW: Megawatt energy consumption provided by AEP provider.

Time series padding

# Data coercion
aep <- aep %>% 
  mutate(Datetime = ymd_hms(Datetime)) %>% 
  arrange(Datetime)
# Determining the start and the end of the time interval
min(aep$Datetime)
## [1] "2004-10-01 01:00:00 UTC"
max(aep$Datetime)
## [1] "2018-08-03 UTC"

Based from the result, we can see that the start and the end of the time interval in our dataset are:

  • Start: 2004-10-01 01:00:00
  • End: 2018-08-03 00:00:00
# Ensure the data is complete for the time period
aep %>% 
  pad(start_val = min(aep$Datetime), end_val = max(aep$Datetime)) %>% 
  anyNA
## [1] TRUE

Since the data is incomplete for the time period, we have to fill them before converting it into time series object.

# Padding
aep_full <- aep %>% 
  pad(start_val = min(aep$Datetime), end_val = max(aep$Datetime)) %>% 
  mutate(AEP_MW = na.approx(AEP_MW)) 

colnames(aep_full) <- c("Datetime", "EnergyConsumption")

anyNA(aep_full)
## [1] FALSE
# Converting to time series object
aepts <- aep_full$EnergyConsumption %>% ts(freq= 24)

The data has been converted into time series object and ready for further analysis.

Seasonality Analysis

Decomposition

# Decomposition (daily)
aepts %>% 
  tail(24*7*4) %>% 
  ts(frequency = 24) %>% 
  decompose() %>% 
  autoplot()

As can be seen, the daily approach suggest a strong seasonality. However, we can see the trend line is still forming a pattern, indicating this is a multiseasonal timeseries dataset. Hence, we have to reanalyze this dataset with multiseasonal timeseries. The other seasonal we will be approaching is weekly seasonality.

# Time series object (multi-seasonal)
aep_msts <- aep_full$EnergyConsumption %>% 
  msts(seasonal.periods = c(24, 24*7))
# Decomposition (multi-seasonal)
aep_msts %>% 
tail(24*7*4) %>% 
  msts(seasonal.periods = c(24, 24*7)) %>% 
  mstl() %>% 
  autoplot()

As can be seen from the plot, the trend line is now smooth and the second seasonality line plot confirmed our previous hypothesis that there’s a weekly seasonality besides the daily seasonality. As the last time series object showed the better decomposition than before, it will be used for the time series model building. Additionally, note that the data is an additive time series and this is a valuable information for the model building later.

Interpretation

As explained in the previous section, the data set is multiseasonal with two seasonalities, weekly seasonality and daily seasonality. In this section, we will try to interpret both seasonalities.

# Daily seasonality visualization
aep_daily <- aep_full %>% 
  tail(24*7) %>% 
  mutate(Hour = as.factor(hour(Datetime)))

blue_theme <- theme_update(
  plot.background = element_rect(fill = "#62a7c8", colour = NA),
  panel.background = element_rect(fill = "lightblue", colour = NA),
  axis.text = element_text(colour = "linen"),
  axis.title = element_text(colour = "linen"))

ggplot(data = aep_daily, aes(x = Datetime, y = EnergyConsumption)) +
  geom_point(color="linen") +
  geom_point(data = aep_daily %>% 
               filter(Hour %in% c("4", "17", "23")),
             aes(col = Hour)) +
  geom_line(color="linen") +
  scale_color_brewer(palette = "Dark2") +
  labs(title = "Daily Seasonality Visualization") +
  theme_set(blue_theme)

From the daily seasonality visualization above, it can be seen that EnergyConsumption will peak around 16.00 - 18.00. Furthermore, on midnight, around 23.00 - 00.00 the energy consumption seems to be moving around its mean. While during dawn, around 03.00 -05.00 they have the lowest amount of energy consumption almost everyday.

# Weekly seasonality visualization
aep_weekly <- aep_full %>% 
  head(24*7*5) %>% 
  mutate(Date = date(Datetime),
           Dayofweek = as.factor(wday(Datetime, label = T))) %>%
  group_by(Date, Dayofweek) %>% 
  summarise(EnergyConsumption = sum(EnergyConsumption)) %>% 
  head( - 1)
## `summarise()` has grouped output by 'Date'. You can override using the `.groups` argument.
blue_theme <- theme_update(
  plot.background = element_rect(fill = "#62a7c8", colour = NA),
  panel.background = element_rect(fill = "lightblue", colour = NA),
  axis.text = element_text(colour = "linen"),
  axis.title = element_text(colour = "linen"))

ggplot(data = aep_weekly, aes(x = Date, y = EnergyConsumption)) +
  geom_point(color="linen") +
  geom_point(data = aep_weekly %>% 
               filter(Dayofweek %in% c("Sun", "Mon", "Tue")),
             aes(col = Dayofweek)) +
  geom_line(color="linen") +
  scale_color_brewer(palette = "Dark2") +
  labs(title = "Weekly Seasonality Visualization") +
  theme_set(blue_theme)

From the weekly seasonality visualization above, it can be seen that EnergyConsumption usually peaks on Monday or Tuesday. In contrary, the lowest amount of energy consumption occurs on Sunday.

Model Fitting and Evaluation

Cross Validation

In this report, we will divide this data set into 2 parts, data train and data test. Data train will be used to train the data, meanwhile the data test will be used to predict and evaluate the models.

# Cross validation
msts_train <- head(aep_msts, length(aep_msts) - 24*7)
msts_test <- tail(aep_msts,  24*7)

# Subset to more recent period
msts_train <- tail(msts_train, 24*7*12)
autoplot(msts_train)

Next, we will make prediction for 7 days and plot them.

Model Fitting

Since the dataset is multiseasonal, the models we are going to use in this report will be the models that can handle seasonality. Hence, we will try to use: Holt’s Winters Exponential, SARIMA, STL, and TBATS.

Holt’s Winters Exponential

Holt’s Winters Exponential (Triple Exponential Smoothing) is the right forecasting method used for data that has trend and seasonal effects

# Model building
hw_model <- HoltWinters(x = msts_train, seasonal = "additive") 
hw_forecast <- forecast(object = hw_model, h = 24*7)
plot(hw_forecast)

Based on the forecast plot, it seems to be not good enough. However, to see it clearly, we are going to compare the actual energy consumption and the estimated energy consumption from this model through the visualization below:

# Actual VS estimated energy consumption visualization
aep_msts %>% tail(24*7*4) %>% autoplot(series = "actual") +
  autolayer(msts_test, series = "data test") +
  autolayer(hw_forecast$mean, series = "forecast") +
  scale_color_manual(values = c("black", "blue", "firebrick")) +
    labs(
    title = "Holt's Winters Exponential",
    x = "",
    y = "",
    colour = ""
  ) +
  theme_minimal() +
  theme(legend.position='bottom')

Based on the comparison between the actual energy consumption and the estimated energy consumption from this model through the visualization above, the model is quite good, although it could be better. It can’t seem to imitate the movement of the actual data well. But as we can see, there’s no lagging on this model.

After all the model fitting processes are done, we will evaluate and compare all the models used in the next section.

SARIMA

SARIMA (Seasonal Autoregressive Integrated Moving Average) is an ARIMA method where the existing time series objects have a seasonalilty. Meanwhile, ARIMA is a combination of two methods, namely Auto Regressive (AR) and Moving Average (MA). The I describes Integrated. The main purpose of ARIMA is to perform autocorrelation on data.

For Seasonal ARIMA, data has to be stationary. Hence, Augmented Dickey-Fuller test will be performed.

# Augmented Dickey-Fuller test
adf.test(msts_train)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  msts_train
## Dickey-Fuller = -5.9319, Lag order = 12, p-value = 0.01
## alternative hypothesis: stationary

Based on the result, since p-value < 0.05 (alpha), the timeseries are stationary, and no differencing is needed.

# Model building
sarima_model <- msts_train %>%
  stlm(method = "arima", lambda = NULL) %>% 
  forecast(h = 24*7)
plot(sarima_model)

Based on the forecast plot, it seems to be good enough. However, to see it clearly, we are going to compare the actual energy consumption and the estimated energy consumption from this model through the visualization below:

# Actual VS estimated energy consumption visualization
aep_msts %>% tail(24*7*4) %>% autoplot(series = "actual") +
  autolayer(msts_test, series = "data test") +
  autolayer(sarima_model$mean, series = "forecast") +
  scale_color_manual(values = c("black", "blue", "firebrick")) +
    labs(
    title = "SARIMA",
    x = "",
    y = "",
    colour = ""
  ) +
  theme_minimal() +
  theme(legend.position='bottom')

Based on the comparison between the actual energy consumption and the estimated energy consumption from this model through the visualization above, the model is good. It can imitate the movement of the actual data well and there’s no lagging on this model.

After all the model fitting processes are done, we will evaluate and compare all the models used in the next section.

STL

STL (Seasonal and Trend decomposition using Loess) conceptually smooth the nearest data for each observation by giving a bigger proportion to the data that is close to the observed data.

# Model building
stlm_model <- msts_train %>%
  stlm(method = "ets", lambda = NULL) %>% 
  forecast(h = 24*7) 
plot(stlm_model)

Based on the forecast plot, it seems to be good enough. However, to see it clearly, we are going to compare the actual energy consumption and the estimated energy consumption from this model through the visualization below:

# Actual VS estimated energy consumption visualization
aep_msts %>% tail(24*7*4) %>% autoplot(series = "actual") +
  autolayer(msts_test, series = "data test") +
  autolayer(stlm_model$mean, series = "forecast") +
  scale_color_manual(values = c("black", "blue", "firebrick")) +
    labs(
    title = "STL (ETS)",
    x = "",
    y = "",
    colour = ""
  ) +
  theme_minimal() +
  theme(legend.position='bottom')

Based on the comparison between the actual energy consumption and the estimated energy consumption from this model through the visualization above, the model is good. It can imitate the movement of the actual data well and there’s no lagging on this model.

After all the model fitting processes are done, we will evaluate and compare all the models used in the next section.

TBATS

A TBATS (Trigonometric Seasonal, Box-Cox Transformation, ARMA residuals, Trend and Seasonality) model differs from dynamic harmonic regression in that the seasonality is allowed to change slowly over time in a TBATS model, while harmonic regression terms force the seasonal patterns to repeat periodically without changing.

# Model building
tbats_mod <- msts_train %>%
            tbats(use.box.cox = FALSE, 
                  use.trend = TRUE, 
                  use.damped.trend = TRUE)
tbats_model <-  forecast(tbats_mod,h=24*7) 

plot(tbats_model)   

Based on the forecast plot, it seems to be good enough. However, to see it clearly, we are going to compare the actual energy consumption and the estimated energy consumption from this model through the visualization below:

# Actual VS estimated energy consumption visualization
aep_msts %>% tail(24*7*4) %>% autoplot(series = "actual") +
  autolayer(msts_test, series = "data test") +
  autolayer(tbats_model$mean, series = "forecast") +
  scale_color_manual(values = c("black", "blue", "firebrick")) +
    labs(
    title = "TBATS",
    x = "",
    y = "",
    colour = ""
  ) +
  theme_minimal() +
  theme(legend.position='bottom')

Based on the comparison between the actual energy consumption and the estimated energy consumption from this model through the visualization above, the model is quite good, although it could be better. It can’t seem to imitate the movement of the actual data well. But as we can see, there’s no lagging on this model.

After all the model fitting processes are done, we will evaluate and compare all the models used in the next section.

Prediction Performance

Model Evaluation & Comparison

In this section, we are going to evaluate all of the model used in the previous section and compare all 4 of them to decide which model is the best for this data set, and furthermore will be used in the prediction of the data test.

# Model evaluation
result <-  rbind(accuracy(as.vector(hw_forecast$mean), msts_test),
  accuracy(as.vector(sarima_model$mean), msts_test),
  accuracy(as.vector(stlm_model$mean) , msts_test), 
  accuracy(as.vector(tbats_model$mean) , msts_test)) 
rownames(result) <- c("hw_model", "sarima_model", "stlm_model","tbats_model")
result
##                      ME      RMSE       MAE        MPE      MAPE      ACF1
## hw_model     -1518.7452 1757.5218 1585.5841 -10.698516 11.065970 0.9623030
## sarima_model  -802.9875  941.9626  813.9863  -5.435729  5.497683 0.9490408
## stlm_model    -496.0608  690.3370  549.0244  -3.348116  3.653793 0.9505276
## tbats_model  -1087.7270 1251.2361 1089.4345  -7.379769  7.392544 0.9543804
##              Theil's U
## hw_model      3.231671
## sarima_model  1.617628
## stlm_model    1.161109
## tbats_model   2.197623

Based on the model evaluation result, STL (ETS) has the smallest MAPE of all model with only 3.65% and therefore is the best model compared to the others. Next, we will try to make the comparison of actual VS estimated number of visitors visualization of all 4 models.

accuracyData <- data.frame(datetime= aep_full$Datetime %>% tail(24*7),
  actual = as.vector(msts_test),
  hwForecast = as.vector(hw_forecast$mean),
  sarimaForecast = as.vector(sarima_model$mean),
  stlmForecast = as.vector(stlm_model$mean),
  tbatsForecast = as.vector(tbats_model$mean)
)
# Actual VS estimated number of visitors visualization
accuracyData %>% 
 ggplot() +
  geom_line(aes(x = (aep_full$Datetime %>% tail(24*7)), 
                y = (aep_full$EnergyConsumption %>% tail(24*7)), colour = "Actual"))+
  geom_line(aes(x = (aep_full$Datetime %>% tail(24*7)), y = hw_forecast$mean, colour = "HW"))+
  geom_line(aes(x = (aep_full$Datetime %>% tail(24*7)), y = sarima_model$mean, colour = "SARIMA"))+
  geom_line(aes(x = (aep_full$Datetime %>% tail(24*7)), y = stlm_model$mean, colour = "STL(ETS)"))+
  geom_line(aes(x = (aep_full$Datetime %>% tail(24*7)), y = tbats_model$mean,   colour = "TBATS"))+ 
  scale_y_continuous(labels = comma)+
  labs(
    title = "Hourly Energy Consumption of PJM",
    x = "Date",
    y = "",
    colour = ""
  ) +
  theme_minimal() +
  scale_color_manual(values = c("red", "black", "green", "blue", "purple"))

Based on the visualization above, it is clear that STL (ETS) is the closest model to imitate the movement of the actual data, followed by SARIMA model. Based on the MAPE and the visualization, STL (ETS) will be chosen to predict and we will check its assumption in the next section.

Assumption Checking

The assumptions in the time series models are tested to measure whether the residuals obtained from the modeling results are good enough to describe and capture information in the data. A good forecasting model produces the following residual values:

  1. Uncorrelated residuals. If there are correlated residuals, it means that there is still information left that should be used to calculate forecast results.
  2. Residuals have a mean of 0.

To ensure that the resulting residual has the above criteria, there are assumptions to test it.

No-autocorrelation residual

There are several methods to check the autocorrelation in the forecasting time series results, such as ACF plot and Ljung-Box test.

# ACF plot
acf(stlm_model$residuals)

Based on the ACF residual model, there is still a lag that crossed the dotted line, there could be an indication that the model you have has autocorrelation. However, since this method isn’t clear enough, we will check the autocorrelation using Ljung-Box test. And these are the hypothesis:

\(H_0\): residual has no-autocorrelation

\(H_1\): residual has autocorrelation

# Ljung-Box test
Box.test(stlm_model$residuals, type = "Ljung-Box")
## 
##  Box-Ljung test
## 
## data:  stlm_model$residuals
## X-squared = 0.15899, df = 1, p-value = 0.6901

Based on the Ljung-Box test above, p-value > alpha, meaning that the residual/error in the data does not have autocorrelation. Which means that there is NO information left that should be used to calculate forecast results.

Normality of residual

To check the normality of the residuals in the forecasting time series results, we can perform a normality test (shapiro test) using the shapiro.test(residual model) function. And these are the hypothesis:

\(H_0\): residuals are normally distributed

\(H_1\): residuals are not normally distributed

# Shapiro test
shapiro.test(stlm_model$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  stlm_model$residuals
## W = 0.99061, p-value = 3.815e-10

Based on the Shapiro test result, unfortunately, because p-value < alpha, then the residuals are not normally distributed. Next we will check the distribution of the residuals using histogram.

# Histogram
hist(stlm_model$residuals)

From the histogram, we can see that the residuals actually looks distributed normally. However, the result of the Shapiro test said otherwise. Hence, we will conclude that the residuals are not normally distributed and the residuals didn’t have a mean of 0.

This could mean that the data train might not be large enough and therefore the model did not have enough information to imitate the movement of the actual data. As The central limit theorem (CLT) states that the distribution of sample means approximates a normal distribution as the sample size gets larger, regardless of the population’s distribution. Thus, further data transformation will be needed (log transformation, scaling, etc).

It is important to notice that, the Normality of residual assumption does not need to be fulfilled. Because it is hard to find an actual data set that normally distributed. If the purpose of the time series model is to make predictions, then the Normality of residual assumption is not required to be met.

Conclusion

Based on the data used in this report and the Time Series Forecasting process that has been done, we can conclude that:

  • Based on the Seasonality Analysis on this section, we can see that:

    • From the daily seasonality, EnergyConsumption will peak around 16.00 - 18.00. Furthermore, on midnight, around 23.00 - 00.00 the energy consumption seems to be moving around its mean. While during dawn, around 03.00 -05.00 they have the lowest amount of energy consumption almost everyday.

    • From the weekly seasonality, EnergyConsumption usually peaks on Monday or Tuesday. In contrary, the lowest amount of energy consumption occurs on Sunday.

  • Based on the time series forecasting process that has been done, the chosen model to do the prediction of the data test is STL(ETS) model, with the consideration of:

    • Having the smallest MAPE of all model with only 3.65%.

    • Based on the visualization, STL (ETS) also has the closest model to imitate the movement of the actual data, compared to the other models.

  • The model fulfilled No-autocorrelation residual assumption but does NOT fulfill Normality of residual assumption. his could mean that the data train might not be large enough and therefore the model did not have enough information to imitate the movement of the actual data. Thus, further data transformation will be needed (log transformation, scaling, etc). It is also important to notice that, the Normality of residual assumption does not need to be fulfilled. Because it is hard to find an actual data set that normally distributed.