1 Forecasting Energy Demand

“Some things are easier to forecast than others. The time of the sunrise tomorrow morning can be forecast precisely. On the other hand, tomorrow’s lotto numbers cannot be forecast with any accuracy.” (Hyndman & Athanasopoulos 2018)

This report demostrates the process to forecast energy demand using the AEMO data set for NSW.

2 Data Pipeline

The following reusable functions are created to enapsulate the repetitive tasks in the proecessing pipeline:

2.1 Data splitting to Train and Test timeseries

Data will be split in chronological order with 80/20 train to test ratio. The training data set will be atleast 2 years long to ensure that it captures any existing seasonal behavior that are observed on yearly basis.

The function GetAemoDataAndSplit is used to load a preprocesed ts object. For the process of preparing the ts object refer to Appendix 1.

GetAemoDataAndSplit <-  function(start = c(2013, 1),
                                 region  = "NSW",
                                 freq = "daily", 
                                 shift = FALSE) 
{
                          
                          
    ret <- list()
    
    sfreq <- case_when(freq == "weekly" ~ "weekly",
                       TRUE ~ "daily")
    
    sregion <- case_when(region %in%  c("NSW", "QLD", "SA", "SNOWY", "TAS", "VIC") ~ region,
                         TRUE ~ "NSW")
    
    filename <-
      sprintf("rds/aemo_%s_%s_ts.rds", tolower(sregion), sfreq)
    
    if(shift)
      filename <- sprintf("rds/aemo_%s_%s_l_ts.rds", tolower(sregion), sfreq)
    
    ts1 <- readRDS(filename)
    ts1 <- window(ts1, start = start)
    
    #  split 80/20
    idx <- as.integer(length(ts1) * 0.8)
    
    ret$train <- ts(ts1[1:idx], start = start(ts1), frequency = frequency(ts1))
    ret$test <- ts(ts1[(idx + 1):length(ts1)], start = index(ts1)[idx + 1], frequency = frequency(ts1))
    
    ret$full <- ts1
    
    ret$state <- sregion
    ret$dataset <- sfreq
    
    return(ret)
}

Specifying the parameter freq = "weekly" in the function call will load the weekly aggregated timeseries data. The returned object includes the full range of the timeseries along with the train and test split timeseries.

nswWeeklyts <- GetAemoDataAndSplit(freq = "weekly")

autoplot(nswWeeklyts$train, main = "NSW Weekly Energy Demand", series = "Train set") + 
  autolayer(nswWeeklyts$test, series = "Test set") 

Outliers in this dataset are imputed to the as described in (Appendix 1){#Appendix1}.

2.2 Explore the ts object

This function will print attributes of the data set and perform ur.kpss() to assess stationarity:

ExploreTs <- function(tsobj) {
  ts1 <- tsobj$full
  
  
  cat(sprintf(
    "Start: %s\nEnd: %s\nFrequency %s\n",
    paste(start(ts1), collapse = " "),
    paste(end(ts1), collapse = " "),
    frequency(ts1)
  ))
  
  
  
  
  cat("Plots a time series components, acf and pacf graphs")
  
  plotSubTitle = sprintf("%s - %s", tsobj$state, tsobj$dataset)
  
  
  

  print(mstl(ts1) %>% autoplot(main=paste0(plotSubTitle , " with Trend, Seasonal and lag-1 differnece"))) 

  # 
  # 
  # autoplot(tsobj$full) +
  #   labs(y = "Demand (MWh)",
  #        title = "Energy Demand",
  #        subtitle = plotSubTitle) -> p
  
  # print(p)

  print("Auto correlation plots")
  print(ggAcf(ts1)) 
  print(ggPacf(ts1))
  
  # #### basic ts eda ####
  # 
  # print("Stationarity Test")
  # 
  # print("ur.test on the original data set")
  # print(ts1 %>% ur.kpss() %>% summary())
  # 
  # print("ur.test on the log-transformed data set log(ts) to stablise the variabnce")
  # print(log(ts1) %>% ur.kpss() %>% summary())
  # 
  # print("ur.test on the differeced data set diff(ts) to remove seasonality")
  # print(diff(ts1) %>% ur.kpss() %>% summary()) ## this produced 0.0085 t-stat < 0.01 -> conclude that we could use auto.arima
  # 
  # 
  # print("ur.test on the log transformed and differenced data set diff(log(ts))")
  # print(diff(log(ts1)) %>% ur.kpss() %>% summary())

 
}

Example usage:

ExploreTs(nswWeeklyts)

In the unit root test ur.kpss(), we look for stationarity where the Null hypothesis (Ho) is stationarity.

2.3 Fit and Diagnose time series models and forecasts

This function will display the residuals plot and Ljung-Box output to test the significance of resiudals, high p-values indicate that residuals are white noise and contain no signal, this result is desired from a succesfull model that captures the infomraiton corectly in its coffecients.

FittedModelDiagnostics <- function(tsfit) {
  # print(tsfit %>% summary())
  tsfit %>% checkresiduals() -> p
  
  print(p)
}
ForecastAndAccuracy <- function(tsobj, tsfit) {
  ret <- list()
  
  forecastHorizon = length(tsobj$test)
  ret$forecast <- forecast(tsfit, h = forecastHorizon)
  
  ret$accuracy <- accuracy(ret$forecast, tsobj$test)
  
  
  plotSubTitle = sprintf("%s - %s", tsobj$state, tsobj$dataset)
  autoplot(ret$forecast) +
    labs(y = "Demand (MWh)",
         subtitle = plotSubTitle) + 
    autolayer(tsobj$test, series = "Test")-> p
  print(p)
  
  
  ret$desc <- sprintf("Accuracy measures for %s %s STLF using %s method.", 
                          tsobj$state, 
                          tsobj$dataset, 
                          tsfit$method)
  
  print(ret$desc)
  print(ret$accuracy)
  
  return(ret)
}

3 Fitting models

First produce a train/test data set and check for its poperties and statioarity.

nsw <- GetAemoDataAndSplit(freq = "weekly", start = c(2012,1))
# nsw <- GetAemoDataAndSplit(freq = "weekly")
ts1 <- nsw$train

ExploreTs(nsw)
## Start: 2012 1
## End: 2018 30
## Frequency 53
## Plots a time series components, acf and pacf graphs

## [1] "Auto correlation plots"

Unit root test using the ur.kpss() method stipulates that if the test statitic reuslt is greater than the critical value, then the nyll hypotheisis is rejected and the series is non-stationary (Data Science Central 2010). So we look for test-statistic less than the critical value of 1% significance level.

We can see that the differenced timeseries with 5 lags produces the lowest test-statistics value of .0301, this result is small and within the range for a stationary data set; less that the critival value of 0.739 for a 1% significance level.

3.1 naive model

#### naive ##### 

naiveFit1 <- naive(ts1,h = length(nsw$train))

FittedModelDiagnostics(naiveFit1)

## 
##  Ljung-Box test
## 
## data:  Residuals from Naive method
## Q* = 123.35, df = 55.6, p-value = 4.776e-07
## 
## Model df: 0.   Total lags used: 55.6
## 
## NULL

3.2 snaive

#### snaive ####

snaiveFit1 <- snaive(ts1, h =length(nsw$train) )

FittedModelDiagnostics(snaiveFit1)

## 
##  Ljung-Box test
## 
## data:  Residuals from Seasonal naive method
## Q* = 158.67, df = 55.6, p-value = 7.52e-12
## 
## Model df: 0.   Total lags used: 55.6
## 
## NULL

3.3 Arima

Some stationarity tests first using AFC plots and Box.test():

ggAcf(ts1)

ggAcf(diff(ts1))

Box.test(ts1 , type="Ljung-Box")
## 
##  Box-Ljung test
## 
## data:  ts1
## X-squared = 111.49, df = 1, p-value < 2.2e-16
Box.test(ts1 %>% diff(), type="Ljung-Box")
## 
##  Box-Ljung test
## 
## data:  ts1 %>% diff()
## X-squared = 12.481, df = 1, p-value = 0.0004111

Fit two models using auto.arima() function

#### Fit a model on training set #### 


autarimaFit1 <- auto.arima(ts1)
FittedModelDiagnostics(autarimaFit1)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(3,1,2)(1,1,1)[53]
## Q* = 90.342, df = 48.6, p-value = 0.000259
## 
## Model df: 7.   Total lags used: 55.6
## 
## NULL
# fit an auto.arima on a log transformed timeseries 


logNsw <- nsw 
logNsw$train <- log(logNsw$train)
logNsw$test <- log(logNsw$test)
logNsw$full <- log(logNsw$full)

autoarimaFit2 <- auto.arima(logNsw$train)
FittedModelDiagnostics(autoarimaFit2)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(3,1,4)(1,1,1)[53]
## Q* = 47.298, df = 46.6, p-value = 0.444
## 
## Model df: 9.   Total lags used: 55.6
## 
## NULL

The second ARIMA model is fitted on a log-transformed time series to stabilise the variability, it return a p-value of .444 for residul lag, while the first one had a lower p-value of .000259. Here a higher residual lag p-value is better as it implies that no significant signals remain in the residuals for this model.

The same conclusion can be made by observing the ACF plot for the residuals of the model. ACF plot that shows correlations within the threshold limit indicates that residuals do not hold significant correlations. They behave as white noise. This is an important assumption to prove for fitting any model, that the residuals are white noise, random in nature and normally distributed.

3.4 tbats

#### fit TBATS #### 

tbatsFit1 <- tbats(ts1, use.box.cox = T,
                   use.damped.trend = NULL, 
                   use.trend = NULL,
                   use.parallel = T)
FittedModelDiagnostics(tbatsFit1)

## 
##  Ljung-Box test
## 
## data:  Residuals from TBATS
## Q* = 50.546, df = 24.6, p-value = 0.00155
## 
## Model df: 31.   Total lags used: 55.6
## 
## NULL

3.5 stlf

#### stlf ####

stlfNSW <- stlf(ts1)
FittedModelDiagnostics(stlfNSW)

## 
##  Ljung-Box test
## 
## data:  Residuals from STL +  ETS(M,N,N)
## Q* = 135.56, df = 53.6, p-value = 4.861e-09
## 
## Model df: 2.   Total lags used: 55.6
## 
## NULL

4 Model Selection

performance of models by examining the accuracy

naiveRR <-ForecastAndAccuracy(nsw,naiveFit1)

## [1] "Accuracy measures for NSW weekly STLF using Naive method method."
##                       ME     RMSE      MAE        MPE     MAPE     MASE
## Training set   -671.0026 182576.8 126025.5 -0.2606091 4.700569 0.940192
## Test set     171202.5059 272545.9 217019.1  5.7595978 7.779485 1.619035
##                    ACF1 Theil's U
## Training set -0.2111231        NA
## Test set      0.5848730  1.433904
snaiveRR <-ForecastAndAccuracy(nsw,snaiveFit1)

## [1] "Accuracy measures for NSW weekly STLF using Seasonal naive method method."
##                      ME     RMSE      MAE        MPE     MAPE     MASE
## Training set  -7957.158 182018.8 134042.2 -0.5509262 4.912432 1.000000
## Test set     -57849.838 223548.0 162671.2 -2.3522440 6.126603 1.213581
##                     ACF1 Theil's U
## Training set  0.24163881        NA
## Test set     -0.03101777  1.219258
autoarimaRR <- ForecastAndAccuracy(tsobj = nsw, tsfit = autarimaFit1)

## character(0)
##                      ME     RMSE       MAE       MPE     MAPE      MASE
## Training set   7070.187 131660.3  86442.37  0.082548 3.185307 0.6448891
## Test set     -78990.721 174065.5 127361.82 -3.202499 4.896095 0.9501618
##                     ACF1 Theil's U
## Training set 0.004027041        NA
## Test set     0.062455449 0.9517688
autoarimaRR2 <- ForecastAndAccuracy(tsobj = logNsw, tsfit = autoarimaFit2)

## character(0)
##                        ME       RMSE        MAE         MPE      MAPE
## Training set  0.002395276 0.04736516 0.03181368  0.01540007 0.2148296
## Test set     -0.024121273 0.06374352 0.04484730 -0.16409176 0.3035952
##                   MASE        ACF1 Theil's U
## Training set 0.6472544 -0.01321492        NA
## Test set     0.9124253  0.04616218 0.9222589
tbatsRR <- ForecastAndAccuracy(tsobj = nsw, tsfit = tbatsFit1)

## [1] "Accuracy measures for NSW weekly STLF using TBATS method."
##                       ME     RMSE       MAE        MPE     MAPE      MASE
## Training set    616.3925 125496.7  90925.86 -0.1848008 3.375433 0.6783374
## Test set     -65697.8141 163184.0 112423.69 -2.7407844 4.365448 0.8387184
##                    ACF1 Theil's U
## Training set 0.04022745        NA
## Test set     0.13169736 0.8823472
stlfRR <- ForecastAndAccuracy(nsw, stlfNSW)

## [1] "Accuracy measures for NSW weekly STLF using STL +  ETS(M,N,N) method."
##                       ME     RMSE       MAE       MPE     MAPE      MASE
## Training set   -239.9017 114745.0  84614.88 -0.183280 3.123967 0.6312553
## Test set     -65501.3625 163091.2 115301.92 -2.715755 4.459145 0.8601910
##                    ACF1 Theil's U
## Training set 0.12730301        NA
## Test set     0.08157539 0.8961511

Plotting it all together

autoplot(window(nsw$train, start=2017)) + 
# autolayer(naiveRR$forecast, PI = FALSE, series="naive") + 
  # autolayer(snaiveRR$forecast, PI=FALSE, series = "snaive") + 
  autolayer(autoarimaRR$forecast, PI=FALSE, series = "autoarima") +
  autolayer(exp(autoarimaRR2$forecast$mean), PI=FALSE, series = "arima2") + 
  # autolayer(tbatsRR$forecast, PI=FALSE, series = "tbats") +
  # autolayer(stlfRR$forecast, PI = FALSE, series = "stlf") + 
  autolayer(nsw$test, series = "ACTUAL") +
  theme(legend.position = "bottom") 
## Warning: Ignoring unknown parameters: PI

5 Appendix 1: Energy Energy Demand

Energy demand data, extracted from the Australian Enregy Market Operator (AEMO), provide 20 years of historical demand measured at 30 mins intervals for the states of NSW, QLD, SA, TAS and VIC. The data set is time series of sequential observaions in chonological order.

5.1 The NSW Energy Demand Data

The plot below shows monthly summaries of demand data for NSW over the period from 1998 to 2018.

aemo_month <- read_csv("tidy/aemo_day.csv")

aemo_month %>% ggplot( aes(x=settlementdate, y=totaldemandday)) + 
  geom_line() + 
  theme_bw() +
  labs( title = "Historical Eneregy Demand for NSW", 
        subtitle = "1998 - 2018",
        y = "Total Demand (MWh)", 
        x = "Date") + 
   scale_x_date(
    date_labels = "%Y",
    date_breaks  = "3 year",
    minor_breaks = "1 year"
  ) +
  scale_y_continuous(labels = scales::comma)

There obvious spikes in the timeseries, exploring the distribution of demand shows a number of outliers at both sides of the distribution.

aemo_month %>% ggplot(aes(x = totaldemandday)) +
  geom_histogram(bins = 60) +
  theme_bw() +
  theme(
  axis.title.y = element_blank(),
  axis.text.y = element_blank(),
  axis.ticks.y = element_blank(), 
  axis.title.x = element_blank(),
  axis.text.x = element_blank(),
  axis.ticks.x = element_blank()
  ) +
  labs(title = "Distribution of Eneregy Demand for NSW",
  subtitle = "1998 - 2018",
  x = "") +
  scale_x_continuous(labels = scales::comma) -> p1
  
  aemo_month %>% ggplot(aes(y = totaldemandday, x = 1)) +
  geom_boxplot () + geom_jitter(alpha = .05, color = "#56B4E9") +
  theme_bw() +
  theme(
  axis.title.y = element_blank(),
  axis.text.y = element_blank(),
  axis.ticks.y = element_blank()
  ) +
  labs(y = "Total Demand (MWh)", x = "") +
  scale_y_continuous(labels = scales::comma) +
  coord_flip() -> p2
  
  grid.arrange (p1, p2)

5.2 Weekly Aggregate Timeseries Object

Weekly summaries of energy demand is produced from raw AEMO data and converted to a timeseries object by using the ts function.

df <- read_csv("aemo/nsw_aemo.csv", locale = locale(tz ="Australia/Sydney"))

df <- df %>% # filter(SETTLEMENTDATE >= ymd("2017-08-1")) %>%
  mutate(y = year(SETTLEMENTDATE), 
                    m = month(SETTLEMENTDATE), 
                    d = day(SETTLEMENTDATE),
                    h = hour(SETTLEMENTDATE), 
                    REGION = REGION)%>% 
  mutate(settlementhour = ymd_h(paste( y,m,d,h, sep= " ")))

# group by week 
dfweek <- df %>% mutate(w=week(SETTLEMENTDATE)) %>% group_by(REGION, y,w) %>% 
  summarise(
    totaldemandweek = sum(TOTALDEMAND), 
    minrrp = min(RRP), 
    maxrrp = max(RRP), 
    meanrrp = mean(RRP), 
    medianrrp = median(RRP), 
    firstrrp = first(RRP), 
    lastrrp = last(RRP))  %>% 
  ungroup()

#### construct ts objects ####

# calcuate the start of the ts objects 

startdate <- min(df$SETTLEMENTDATE)
startyear <- year(startdate)
startmonth <- month(startdate)
startday <- yday(startdate) ## need day of year 
startweek <- week(startdate)

#### weekly ts object 
dfweek <- dfweek %>%
  filter(REGION == "NSW1")  %>% 
  select(y,w,totaldemandweek)

5.3 Handling Outliers in the Timeseries Object

The existence of outlier events in the timeseries data could mislead the time series analysis and produce incorrect outcomes and hence need to be handled before any procesin (Tsay 1988).

The outliers in the timeseries are detected using the Tukey’s rule: values more than tht e1.5 times the interquartile range from the quartiles. Outliers are imputed to the quartiles.

#calculating Tukey's outlier limits based IRQ range from 1st and 3rd quartiles 

wIQRup <- quantile(dfweek$totaldemandweek, probs=.75) + 1.5 * IQR(dfweek$totaldemandweek)
wIQRdown <-quantile(dfweek$totaldemandweek, probs=.25) - 1.5 * IQR(dfweek$totaldemandweek)
wMedian <- median(dfweek$totaldemandweek) 
dfweek2 <- dfweek %>% mutate(totaldemandweek = case_when(
  totaldemandweek >= wIQRup ~ wIQRup, 
  totaldemandweek <= wIQRdown ~ wIQRdown, 
  TRUE ~ totaldemandweek))

aemotsweek <- ts(dfweek %>% select(totaldemandweek), 
                 start = c(startyear, startweek),
                 frequency = 53)


aemotsweek2 <- ts(dfweek2 %>% select(totaldemandweek), 
                 start = c(startyear, startweek),
                 frequency = 53)

aemotsweekcombined <- cbind(aemotsweek, aemotsweek2)
autoplot(aemotsweekcombined) +
  scale_y_continuous(labels = scales::comma) +
  labs( title = "Weekly Aggregate of Energy Demand", 
        subtitle = "Before and After Handling of Outlier values", 
    y="Weekly Demand (MWh)")

5.4 Save Procesed ts Object to .rds Files

The result timeseries objects are saved to file in .rds format for reuse.

saveRDS(aemotsdaily, "rds/aemo_nsw_daily_ts.rds") ## daily summary ts object 
saveRDS(aemotsweek2, "rds/aemo_nsw_weekly_ts.rds") ## weekly summary ts object

6 References

References

Fan, S. & Hyndman, R.J. 2012, ‘Short-term load forecasting based on a semi-parametric additive model’, IEEE Transactions on Power Systems, vol. 27, no. 1, pp. 134-41.

Hyndman, R.J. & Athanasopoulos, G. 2018, Forecasting: principles and practice, OTexts, .

Mill, S. 2016, Electric load forecasting: advantages and challenges, viewed Oct 10, 2018, http://engineering.electrical-equipment.org/electrical-distribution/electric-load-forecasting-advantages-challenges.html.

Tsay, R.S. 1988, ‘Outliers, level shifts, and variance changes in time series’, Journal of Forecasting, vol. 7, no. 1, pp. 1-20.