1 Introduction

This notebook provides a reproducible guide for longterm forecasting fitting an ARIMA model on electrical demand data using R. The objective of this analysis is to review time series theories and do some Exploratory Data Analysis (EDA).

1.1 Brief history of Electrical Demand in Australia

The Australian demand of electricity has consistently been on the rise until 2009. Since then, there has been a constant decline in the demand. The focus of this investigation will be on the National Electricity Market (NEM) which covers:

  1. Queensland
  2. New South Wales
  3. Australia Capital Territory
  4. Victoria
  5. Tasmania
  6. South Australia
NEM - Largest geographically interconnected power systems in the world

NEM - Largest geographically interconnected power systems in the world

NEM has 40,000 km of transmission lines to supply 200 terrawatt hours of electricty to Australia each Year. It is the wholesale market where electricity is bought and sold. Typically, if there is high demand for electricity, prices are likely to also increase. Nem is regulated by the Australian Energy Market Operator (AEMO).

How does NEM work?

How does NEM work?

1.2 Some basic Definitions

  1. Wholesale market - where generators sell electricity to retailers. The retailers then on-sell to the consumer. The market is saturated and highly competitive which ensures prices to be competitive.
  2. Financial market - hedging contracts between retailers and generators to buy and sell electricity. This reduces the risk of price volatility through agreed prices.
  3. The physical grid - the network that delivers electricity from the power stations to the consumers. The grid is purely for transport services and cannot store electricity.
  4. Generators - Production of electricity for sale on the NEM which primarily is sourced from fossil fuels and renewable power stations.
  5. Retailers - Some retailers will purchase electricity directly from the generators which on-sell to the consumer.
  6. Distributors & Networks - Distributors

1.3 Overview of the data Used

The data used is scraped using a tryCatch function in R from the AEMO website (https://www.aemo.com.au/aemo/data/nem/priceanddemand/PRICE_AND_DEMAND_). The dataset provides half hourly demand, RRP and Region from 1998 to 2018.

1.4 Objective

The demand of electricity in Australia has been in the spotlight for the general population due to the price increase. Recently, however, forecasts of the electricity demand has been expected to decrease due to a variety of factors.

The objective is to explore ARIMA models in accurately predicting monthly electricity demand for the whole of the NEM.

2 Load libraries and set global parameters

packages = c("ggplot2", "dplyr", "tidyr", "data.table", "corrplot", "gridExtra", "forecast", "tseries", "TSA", "tibble", "TTR", "xts", "dygraphs", "assertthat", "Matrix", "tidyverse", "lubridate", "hms")

my.install <- function(pkg, ...){
  if (!(pkg %in% installed.packages()[,1])){
    install.packages(pkg)
  }
  return(library(pkg, ...))
}

purrr::walk(packages, my.install, character.only = TRUE, warn.conflicts = FALSE)
elec.raw <- read_csv("elec_price_demand.csv")

#assess what format the variables are in.
elec.raw %>%
  summarise_all(typeof) %>%
  gather
## # A tibble: 6 x 2
##   key            value    
##   <chr>          <chr>    
## 1 X1             integer  
## 2 REGION         character
## 3 SETTLEMENTDATE character
## 4 TOTALDEMAND    double   
## 5 RRP            double   
## 6 PERIODTYPE     character

3 Clean the data

#create a function to clean the formats
clean_data <- function(x) {
  output <- x
  #remove first column (it has no value)
  output$X1 <- NULL
  #rename the regions appropriately & change to factor
  output$REGION <- gsub('1', '', output$REGION)
  output$REGION <- as.factor(output$REGION)
  # Period Type should be factor
  output$PERIODTYPE <- as.factor(output$PERIODTYPE)
  #change Settlementdate to posixct
  output$SETTLEMENTDATE <- as.POSIXct(output$SETTLEMENTDATE, format = '%Y/%m/%d %H:%M')
  return(output)
}

aemo_clean <- clean_data(elec.raw)

  #Split date and time

aemo_clean <- aemo_clean %>%
  mutate_at(vars(SETTLEMENTDATE), ymd_hms) %>% 
  mutate_at(vars(SETTLEMENTDATE), funs("DATE" = date(.), "TIME" = as.hms(.))) %>% 
  select(-SETTLEMENTDATE)

#reorder the columns appropriately
aemo_clean <- aemo_clean %>%
  select(REGION, DATE, TIME, TOTALDEMAND, RRP, PERIODTYPE)

#Lets see a summary
aemo_clean %>%
  summary
##  REGION            DATE                TIME           TOTALDEMAND   
##  NSW:348079   Min.   :1998-12-07   Length:1627489    Min.   :    0  
##  QLD:348079   1st Qu.:2004-09-25   Class1:hms        1st Qu.: 1446  
##  SA :348079   Median :2009-07-03   Class2:difftime   Median : 5125  
##  TAS:235173   Mean   :2009-04-28   Mode  :numeric    Mean   : 4641  
##  VIC:348079   3rd Qu.:2014-02-22                     3rd Qu.: 6577  
##               Max.   :2018-10-15                     Max.   :14580  
##               NA's   :188                                           
##       RRP           PERIODTYPE     
##  Min.   :-1000.00   TRADE:1627489  
##  1st Qu.:   22.27                  
##  Median :   32.19                  
##  Mean   :   48.36                  
##  3rd Qu.:   50.39                  
##  Max.   :14166.50                  
## 
#Lets filter NAs in Date using Dplyr 
aemo_clean %>%
  filter(., !complete.cases(DATE))
## # A tibble: 188 x 6
##    REGION DATE       TIME   TOTALDEMAND   RRP PERIODTYPE
##    <fct>  <date>     <time>       <dbl> <dbl> <fct>     
##  1 QLD    NA            NA        3562.  14.0 TRADE     
##  2 QLD    NA            NA        3458.  13.9 TRADE     
##  3 QLD    NA            NA        3834.  19.5 TRADE     
##  4 QLD    NA            NA        3682.  18   TRADE     
##  5 QLD    NA            NA        3967.  17.8 TRADE     
##  6 QLD    NA            NA        3845.  16.1 TRADE     
##  7 QLD    NA            NA        4190.  21.0 TRADE     
##  8 QLD    NA            NA        4105.  17.2 TRADE     
##  9 QLD    NA            NA        4364.  11.0 TRADE     
## 10 QLD    NA            NA        4246.  10.6 TRADE     
## # ... with 178 more rows

4 Missing Information

The date from AEMO consist of 348,079 rows of data. Of this, we have 188 missing dates. This is less than 1% of the total dataset and shouldn’t impact in the overall modelling process.

aemo_clean <- aemo_clean %>%
  na.omit()

aemo_clean %>% summary()
##  REGION            DATE                TIME           TOTALDEMAND   
##  NSW:348039   Min.   :1998-12-07   Length:1627301    Min.   :    0  
##  QLD:348039   1st Qu.:2004-09-25   Class1:hms        1st Qu.: 1446  
##  SA :348039   Median :2009-07-03   Class2:difftime   Median : 5125  
##  TAS:235145   Mean   :2009-04-28   Mode  :numeric    Mean   : 4641  
##  VIC:348039   3rd Qu.:2014-02-22                     3rd Qu.: 6577  
##               Max.   :2018-10-15                     Max.   :14580  
##       RRP           PERIODTYPE     
##  Min.   :-1000.00   TRADE:1627301  
##  1st Qu.:   22.28                  
##  Median :   32.19                  
##  Mean   :   48.36                  
##  3rd Qu.:   50.39                  
##  Max.   :14166.50

5 Univariate distributions

#total Demand
p1 <- ggplot(aemo_clean, aes(TOTALDEMAND)) + 
                      geom_histogram(bins = 50, 
                                     col = "red",
                                     fill = "red",
                                     alpha = 0.3) +
                      geom_density(alpha = 0.1)

p2 <- ggplot(aemo_clean, aes(RRP, stat(count))) + 
                      geom_histogram(bins = 50, 
                                     col = "red",
                                     fill = "red",
                                     alpha = 0.3) +
                      geom_density(alpha = 0.1)

p3 <- ggplot(aemo_clean, aes(TOTALDEMAND, fill = REGION)) +
                      geom_density(position = "stack")

grid.arrange(p1,p2, nrow = 1, ncol = 2)

The graphs above show an odd distribution for both Total Demand and RRP. What we can understand is that:

  1. There is a high count of Total demand <2500 mW
  2. There is a dense count of Total demand between 4500 mW to 7000 mW.
  3. The retail price is predominantly between $0-100 dollars with alot of outliers.
p3

To understand the Total Demand better, the distribution above is seperated by Region which gives us an understanding that:

  1. Tasmania and South Australia has a higher density with demand <2500 mW
  2. Queensland and Tasmania has a distribution of 0.008 with the spread around 4500 mW to 7000 mW
  3. New South Wales is also similiar to Queensland and Tasmandia but stretches just shy of 15000 mW

6 Aggregation of Demand

The focus of this is to accurately predict the upcoming months of the whole network. To proceed, we need to aggregate the Demand from 30 minutes to Maximum Monthly demand. Also, to simplify the metric of Demand, We will convert the unit of measurement from MegaWatts to GigaWatts.

aemo_agg <- aemo_clean

#seperate the month and year in a different column so we can aggregate.
aemo_agg$MONTH <- format(as.Date(aemo_clean$DATE, tz = 'Auastralia/Sydney'), "%Y-%m")

#create monthly total of WHOLE AEMO Network.
monthly_demand <- aemo_agg$TOTALDEMAND %>%
  aggregate(., by = list(aemo_agg$MONTH), sum)

monthly_demand <- monthly_demand %>%
  rename(MONTH = Group.1) %>%
  rename(TOTAL_DEMAND = x) %>%
  filter(MONTH != "2018-10")

#change Demand from Megawatts to Gigawatts so axis on plots are not scientific
monthly_demand$TOTAL_DEMAND <- monthly_demand$TOTAL_DEMAND / 1000

head(monthly_demand)
##     MONTH TOTAL_DEMAND
## 1 1998-12     20284.13
## 2 1999-01     26003.38
## 3 1999-02     24281.17
## 4 1999-03     26288.02
## 5 1999-04     24939.75
## 6 1999-05     27573.49

Lets change the monthly_demand to a time series object ts()

demand.ts <- ts(monthly_demand[2], frequency = 12, start = c(1998, 12))

6 Train Test Split

lets have a 80/20 train/test split.

train.ts <- window(demand.ts, start = c(1998,12), end = c(2014,9))
test.ts <- window(demand.ts, start = c(2014,10), end = c(2018,9))

7 Exploratory Data Analysis

autoplot(train.ts) +
  labs(titles = 'Monthly AEMO Network Demand',
       x = 'Year',
       y = 'Megawatts') 

As we can see, the NEM network from 1998 to 2018 shows consistent seasonality. The trend was on an inclination till 2009 which then decreased till 2015 and now is stagnant.

autoplot(diff(train.ts)) +
  labs(titles = "Differenced Monthly Electricity Demand",
       x = "year")

The difference plot shows the difference between 2nd and 1st monthand then the difference between the 3rd month and the 2nd month and so on. This is plot identifies that the mean/average value has not changed over time.

ggseasonplot(train.ts)

ggsubseriesplot(train.ts)

The seasonality and subseries plot show that during June to August is our peak demand periods where the lowest demand is during February and April.

autoplot(decompose(train.ts)) +
  labs(title = 'Decomposition of Monthly AEMO Demand',
       x = 'year')

autoplot(diff(train.ts, lag = 12)) +
  labs(titles = "Seasonally Differenced Monthly Electricity Demand",
       x = "year")

ggAcf(diff(train.ts), lag.max = 36) +
  labs(title = "Autocorrelation Plot - differenced demand")

ggPacf(diff(train.ts), lag.max = 36) +
  labs(title = "Partial autocorrelation function - differenced demand")

Noise

Box.test(diff(train.ts), lag = 12, fitdf = 0, type = "Ljung-Box")
## 
##  Box-Ljung test
## 
## data:  diff(train.ts)
## X-squared = 292, df = 12, p-value < 2.2e-16

The Ljung box test confirms randomness of a time series data set. instead of testing for randomness at lags, it tests the overall randomness based on a number of lags.

The P value is is less that 0.05 which suggests that the data is significantly different from white noise.

https://robjhyndman.com/hyndsight/ljung-box-test/

8 Forecasting

Serveral methods of forecasting will be implemented to measure which model would be the best for forecasting a horizon of 48 (ie. 48 months / 4 years). The accuracy measure will be RMSE.

When creating a predictive model, the goal is to create a model where it captures the signal rather than the noise of the data.

It is to be noted that the RMSE score will return a result for both the “training set” and the “test set”. Generally speaking the score will be lower for the “training set”. RMSE score on the training set is a metric which measures how much signal and noise is explained by the model.

8.1 Naive Forecasting Method

train.naive <- naive(train.ts, h=48)

autoplot(train.naive)

Residuals

checkresiduals(train.naive)

## 
##  Ljung-Box test
## 
## data:  Residuals from Naive method
## Q* = 569.07, df = 24, p-value < 2.2e-16
## 
## Model df: 0.   Total lags used: 24

The residuals from the naive model have a lot of ACF lags where it breaks the threshold (blue dashed line) and the distribution isn’t very uniform.

Evaluate forecast accuracy methods

accuracy(train.naive, test.ts[,"TOTAL_DEMAND"])
##                      ME     RMSE      MAE        MPE     MAPE     MASE
## Training set   48.94329 1816.295 1509.806 0.02753847 4.807119 1.626931
## Test set     2177.93125 2664.251 2218.350 6.64954735 6.791048 2.390441
##                    ACF1 Theil's U
## Training set -0.3680743        NA
## Test set      0.2098428  1.399151

A naive forecast yielded a RMSE score of 2664.251 on the test set.

8.2 Seasonal Naive Forecasting Method

train.snaive <- snaive(train.ts, h=48)

autoplot(train.snaive)

snaive.cv <- tsCV(train.ts, snaive, h=48)
mean(snaive.cv^2, na.rm = TRUE)
## [1] 4294456

Residuals

train.snaive %>%
  checkresiduals()

## 
##  Ljung-Box test
## 
## data:  Residuals from Seasonal naive method
## Q* = 520.63, df = 24, p-value < 2.2e-16
## 
## Model df: 0.   Total lags used: 24

Evaluate forecast accuracy methods

snaive.ac <- accuracy(train.snaive, test.ts)
snaive.ac
##                    ME      RMSE      MAE       MPE     MAPE      MASE
## Training set 345.4302 1233.6199 928.0088 1.1365015 2.925019 1.0000000
## Test set     254.4141  712.9643 555.7368 0.7598895 1.747115 0.5988486
##                     ACF1 Theil's U
## Training set  0.61277617        NA
## Test set     -0.07701604 0.3747468

Seasonal Naive forecast yielded a RMSE score of 712.9643 on the Test set.

8.3 Simple Exponential Smoothing (SES)

train.ses <- ses(train.ts, h = 48)
autoplot(train.ses) +
  autolayer(fitted(train.ses))

Residuals

checkresiduals(train.ses)

## 
##  Ljung-Box test
## 
## data:  Residuals from Simple exponential smoothing
## Q* = 444.88, df = 22, p-value < 2.2e-16
## 
## Model df: 2.   Total lags used: 24

Evaluate forecast accuracy methods

ses.ac <- accuracy(train.ses, test.ts)
ses.ac
##                    ME     RMSE      MAE       MPE     MAPE     MASE
## Training set 151.1614 1629.031 1397.007 0.2937958 4.433912 1.505382
## Test set     119.0020 1539.159 1350.835 0.1418175 4.254507 1.455627
##                   ACF1 Theil's U
## Training set 0.1870446        NA
## Test set     0.2098428 0.7967513

A simple exponential smoothing technique yielded a RMSE result of 1539.159.

This method obviously didn’t work because our data has both Trend and seasonality.

8.4 Holts winters method

The Holt winters method deals with both trends and seasonlity.

To deal with the trend, Holt winters method allows for smoothing paramters alpha and beta star to minimise the Sum of Squared Errors.

To deal with seasonal changes, Holt winters method changes the average of Gamma to either average of 0 (Additive) and average of 1 (Multiplicative).

train.ho.a <- holt(train.ts, h = 48, season = "additive")
train.ho.m <- holt(train.ts, h = 48, season = "Multiplicative")

autoplot(train.ho.a) +
  autolayer(fitted(train.ho.a))

autoplot(train.ho.m) +
  autolayer(fitted(train.ho.m))

Residuals

checkresiduals(train.ho.a)

## 
##  Ljung-Box test
## 
## data:  Residuals from Holt's method
## Q* = 435.65, df = 20, p-value < 2.2e-16
## 
## Model df: 4.   Total lags used: 24
checkresiduals(train.ho.m)

## 
##  Ljung-Box test
## 
## data:  Residuals from Holt's method
## Q* = 435.65, df = 20, p-value < 2.2e-16
## 
## Model df: 4.   Total lags used: 24

Evaluate forecast accuracy methods

ho.a.ac <- accuracy(train.ho.a, test.ts)
ho.a.m <- accuracy(train.ho.m, test.ts)

ho.a.ac
##                     ME     RMSE      MAE        MPE     MAPE     MASE
## Training set -149.2952 1610.871 1377.818 -0.6848737 4.401552 1.484703
## Test set     1495.5587 2270.863 1814.211  4.4851390 5.555566 1.954951
##                   ACF1 Theil's U
## Training set 0.2371688        NA
## Test set     0.3565152  1.185158
ho.a.m
##                     ME     RMSE      MAE        MPE     MAPE     MASE
## Training set -149.2952 1610.871 1377.818 -0.6848737 4.401552 1.484703
## Test set     1495.5587 2270.863 1814.211  4.4851390 5.555566 1.954951
##                   ACF1 Theil's U
## Training set 0.2371688        NA
## Test set     0.3565152  1.185158

Holt winters Additive & Multiplicative resulted in a - RMSE - 2270.863

8.5 Error Trend and Seasonal Model (ETS)

ETS models are innovative state space models that include additoinal information such as whether a trend can be “none”, “additive” or damp and whether seasonal information is “None”, “additive” and “multiplicative”.

This gives us an additional 9 possible exponential smoothing methods (3x3). Smoothing will require measuring the errors to offset. The two errors to cater for are “additive errors” or “multiplicative errors”.

This gives us a total of 18 possible state space models.

The difference of ETS and previous models are that they maximise the liklihood rather than minimising the sum of squared.

#function to return ETS forecast
fets <- function(y,h) {
  forecast(ets(y), h = h)
}

train.ets <- fets(train.ts, h=48)

autoplot(train.ets)

checkresiduals(train.ets)

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(A,Ad,A)
## Q* = 16.768, df = 7, p-value = 0.01896
## 
## Model df: 17.   Total lags used: 24
accuracy(train.ets, test.ts)
##                     ME     RMSE      MAE        MPE     MAPE      MASE
## Training set -25.94297 642.9169 461.0866 -0.1219831 1.503394 0.4968559
## Test set     692.62867 962.0351 777.8562  2.1362633 2.422624 0.8381992
##                    ACF1 Theil's U
## Training set 0.07378692        NA
## Test set     0.26138199 0.5129682

RMSE - 962.0351


8.6 Autoregressive Integrated Moving Average Models (ARIMA) - Auto

fit.ar <- auto.arima(train.ts)

checkresiduals(fit.ar)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(2,1,1)(1,1,1)[12]
## Q* = 17.418, df = 19, p-value = 0.5616
## 
## Model df: 5.   Total lags used: 24
summary(fit.ar)
## Series: train.ts 
## ARIMA(2,1,1)(1,1,1)[12] 
## 
## Coefficients:
##           ar1     ar2      ma1     sar1     sma1
##       -0.1186  -0.190  -0.3108  -0.1133  -0.6889
## s.e.   0.3040   0.143   0.3169   0.1221   0.1052
## 
## sigma^2 estimated as 485863:  log likelihood=-1412.39
## AIC=2836.78   AICc=2837.27   BIC=2855.84
## 
## Training set error measures:
##                     ME  RMSE      MAE        MPE     MAPE      MASE
## Training set -139.9079 663.2 484.3896 -0.4830665 1.531013 0.5219666
##                    ACF1
## Training set 0.02759502

ARIMA(2,1,1)(1,1,1)12 AICc=2837.27

train.ar <- fit.ar %>%
  forecast(h=48)

train.ar %>%
  autoplot() 

accuracy(train.ar, test.ts)
##                     ME     RMSE       MAE        MPE     MAPE      MASE
## Training set -139.9079  663.200  484.3896 -0.4830665 1.531013 0.5219666
## Test set     1421.0407 1629.185 1425.8468  4.4428671 4.458780 1.5364583
##                    ACF1 Theil's U
## Training set 0.02759502        NA
## Test set     0.45935134 0.8621858

RMSE - 1629.185

train.ts %>% 
  diff(lag=12) %>%
  ggtsdisplay()

Above, we can see the seasonally differenced data. They also appear to be non-stationary as per the plot below.

train.ts %>% 
  diff(lag=12) %>%
  diff() %>%
  ggtsdisplay()

Using the plot above, we can base our ARIMA model off of the ACF and PACF component. The ACF and PACF suggest that there is are significant correlations to be captured by the ARIMA model because they exceed the 5% signifiance bands. We can see that:

  1. ACF & PACF lag 1,2 & 12 suggests seasonal MA(1) component
  2. ACF & PACF lag 9 suggests a significant non-seasonal.
train.ts %>%
  Arima(order = c(0,1,1), seasonal = c(0,1,1),
        include.constant = TRUE) %>%
  residuals() %>%
  ggtsdisplay()

With the second Arima model (0,1,1)(0,1,1)12, The ACF AND PACF are within the significance limits. This model resulted in an AICc value of 2834.15

We can see that there are large spikes at lag 2, 10 and 11 indicating that some additional non-seasonal terms need to be included in the model.

train.ar3 <- Arima(train.ts, order = c(0,1,2), seasonal=c(0,1,1))

checkresiduals(train.ar3)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,1,2)(0,1,1)[12]
## Q* = 18.436, df = 21, p-value = 0.6212
## 
## Model df: 3.   Total lags used: 24
summary(train.ar3)
## Series: train.ts 
## ARIMA(0,1,2)(0,1,1)[12] 
## 
## Coefficients:
##          ma1      ma2     sma1
##       -0.416  -0.1116  -0.7493
## s.e.   0.098   0.0937   0.0735
## 
## sigma^2 estimated as 484688:  log likelihood=-1413.29
## AIC=2834.59   AICc=2834.82   BIC=2847.29
## 
## Training set error measures:
##                     ME     RMSE      MAE        MPE    MAPE      MASE
## Training set -154.1261 666.2371 484.0827 -0.5266763 1.53056 0.5216359
##                    ACF1
## Training set 0.01275241

With the third Arima model (0,1,2)(0,1,1)12, The ACF AND PACF are within the significance limits. This model resulted in an AICc value of 2834.82

The other tested Arima models are as per below.

(0,1,1)(0,1,1)12 - AICc = 2834.15 (0,1,3)(0,1,1)12 - AICc = 2835.67 (0,1,10)(0,1,1)12 - AICc = 2847.57 (0,1,11)(0,1,1)12 - AICc = 2849.69 (0,1,10)(0,1,10)12 - AICc = 2867.11

train.ar3 <- train.ar3 %>%
  forecast(h=48)

train.ar3 %>%
  autoplot() 

accuracy(train.ar3, test.ts)
##                     ME      RMSE       MAE        MPE    MAPE      MASE
## Training set -154.1261  666.2371  484.0827 -0.5266763 1.53056 0.5216359
## Test set     1194.2434 1391.9331 1202.0121  3.7276471 3.75315 1.2952594
##                    ACF1 Theil's U
## Training set 0.01275241        NA
## Test set     0.35962074 0.7369895

This resulted in an RMSE of 1391.9331 which beats the auto.arima() model which had an RMSE of 1629.185.

autoplot(window(train.ts, start = 2014)) +
  autolayer(train.ar3, 
            series = "Seasonal Arima",
            PI = FALSE) +
  autolayer(train.snaive, 
            series = "Seasonal Naive",
            PI = FALSE) +
  autolayer(train.ets,
            series = "ETS",
            PI = FALSE) +
  autolayer(test.ts,
            series = "Actual",
            PI = FALSE) +
  labs(titles = "Top 4 Forecasts vs Actual",
       x = "Year",
       y = "Gigawatts") +
  theme(legend.position = 'bottom')

The plot above shows the the first year of predictions are pretty accurate because the confidence interval are smaller in the short term horizon. Expanding it over the next two years proved that the Seasonal Naive model performed the best

Conclusion

Reflection