This notebook provides a reproducible guide for longterm forecasting fitting an ARIMA model on electrical demand data using R. The objective of this analysis is to review time series theories and do some Exploratory Data Analysis (EDA).
The Australian demand of electricity has consistently been on the rise until 2009. Since then, there has been a constant decline in the demand. The focus of this investigation will be on the National Electricity Market (NEM) which covers:
NEM - Largest geographically interconnected power systems in the world
NEM has 40,000 km of transmission lines to supply 200 terrawatt hours of electricty to Australia each Year. It is the wholesale market where electricity is bought and sold. Typically, if there is high demand for electricity, prices are likely to also increase. Nem is regulated by the Australian Energy Market Operator (AEMO).
How does NEM work?
The data used is scraped using a tryCatch function in R from the AEMO website (https://www.aemo.com.au/aemo/data/nem/priceanddemand/PRICE_AND_DEMAND_). The dataset provides half hourly demand, RRP and Region from 1998 to 2018.
The demand of electricity in Australia has been in the spotlight for the general population due to the price increase. Recently, however, forecasts of the electricity demand has been expected to decrease due to a variety of factors.
The objective is to explore ARIMA models in accurately predicting monthly electricity demand for the whole of the NEM.
packages = c("ggplot2", "dplyr", "tidyr", "data.table", "corrplot", "gridExtra", "forecast", "tseries", "TSA", "tibble", "TTR", "xts", "dygraphs", "assertthat", "Matrix", "tidyverse", "lubridate", "hms")
my.install <- function(pkg, ...){
if (!(pkg %in% installed.packages()[,1])){
install.packages(pkg)
}
return(library(pkg, ...))
}
purrr::walk(packages, my.install, character.only = TRUE, warn.conflicts = FALSE)
elec.raw <- read_csv("elec_price_demand.csv")
#assess what format the variables are in.
elec.raw %>%
summarise_all(typeof) %>%
gather
## # A tibble: 6 x 2
## key value
## <chr> <chr>
## 1 X1 integer
## 2 REGION character
## 3 SETTLEMENTDATE character
## 4 TOTALDEMAND double
## 5 RRP double
## 6 PERIODTYPE character
#create a function to clean the formats
clean_data <- function(x) {
output <- x
#remove first column (it has no value)
output$X1 <- NULL
#rename the regions appropriately & change to factor
output$REGION <- gsub('1', '', output$REGION)
output$REGION <- as.factor(output$REGION)
# Period Type should be factor
output$PERIODTYPE <- as.factor(output$PERIODTYPE)
#change Settlementdate to posixct
output$SETTLEMENTDATE <- as.POSIXct(output$SETTLEMENTDATE, format = '%Y/%m/%d %H:%M')
return(output)
}
aemo_clean <- clean_data(elec.raw)
#Split date and time
aemo_clean <- aemo_clean %>%
mutate_at(vars(SETTLEMENTDATE), ymd_hms) %>%
mutate_at(vars(SETTLEMENTDATE), funs("DATE" = date(.), "TIME" = as.hms(.))) %>%
select(-SETTLEMENTDATE)
#reorder the columns appropriately
aemo_clean <- aemo_clean %>%
select(REGION, DATE, TIME, TOTALDEMAND, RRP, PERIODTYPE)
#Lets see a summary
aemo_clean %>%
summary
## REGION DATE TIME TOTALDEMAND
## NSW:348079 Min. :1998-12-07 Length:1627489 Min. : 0
## QLD:348079 1st Qu.:2004-09-25 Class1:hms 1st Qu.: 1446
## SA :348079 Median :2009-07-03 Class2:difftime Median : 5125
## TAS:235173 Mean :2009-04-28 Mode :numeric Mean : 4641
## VIC:348079 3rd Qu.:2014-02-22 3rd Qu.: 6577
## Max. :2018-10-15 Max. :14580
## NA's :188
## RRP PERIODTYPE
## Min. :-1000.00 TRADE:1627489
## 1st Qu.: 22.27
## Median : 32.19
## Mean : 48.36
## 3rd Qu.: 50.39
## Max. :14166.50
##
#Lets filter NAs in Date using Dplyr
aemo_clean %>%
filter(., !complete.cases(DATE))
## # A tibble: 188 x 6
## REGION DATE TIME TOTALDEMAND RRP PERIODTYPE
## <fct> <date> <time> <dbl> <dbl> <fct>
## 1 QLD NA NA 3562. 14.0 TRADE
## 2 QLD NA NA 3458. 13.9 TRADE
## 3 QLD NA NA 3834. 19.5 TRADE
## 4 QLD NA NA 3682. 18 TRADE
## 5 QLD NA NA 3967. 17.8 TRADE
## 6 QLD NA NA 3845. 16.1 TRADE
## 7 QLD NA NA 4190. 21.0 TRADE
## 8 QLD NA NA 4105. 17.2 TRADE
## 9 QLD NA NA 4364. 11.0 TRADE
## 10 QLD NA NA 4246. 10.6 TRADE
## # ... with 178 more rows
The date from AEMO consist of 348,079 rows of data. Of this, we have 188 missing dates. This is less than 1% of the total dataset and shouldn’t impact in the overall modelling process.
aemo_clean <- aemo_clean %>%
na.omit()
aemo_clean %>% summary()
## REGION DATE TIME TOTALDEMAND
## NSW:348039 Min. :1998-12-07 Length:1627301 Min. : 0
## QLD:348039 1st Qu.:2004-09-25 Class1:hms 1st Qu.: 1446
## SA :348039 Median :2009-07-03 Class2:difftime Median : 5125
## TAS:235145 Mean :2009-04-28 Mode :numeric Mean : 4641
## VIC:348039 3rd Qu.:2014-02-22 3rd Qu.: 6577
## Max. :2018-10-15 Max. :14580
## RRP PERIODTYPE
## Min. :-1000.00 TRADE:1627301
## 1st Qu.: 22.28
## Median : 32.19
## Mean : 48.36
## 3rd Qu.: 50.39
## Max. :14166.50
#total Demand
p1 <- ggplot(aemo_clean, aes(TOTALDEMAND)) +
geom_histogram(bins = 50,
col = "red",
fill = "red",
alpha = 0.3) +
geom_density(alpha = 0.1)
p2 <- ggplot(aemo_clean, aes(RRP, stat(count))) +
geom_histogram(bins = 50,
col = "red",
fill = "red",
alpha = 0.3) +
geom_density(alpha = 0.1)
p3 <- ggplot(aemo_clean, aes(TOTALDEMAND, fill = REGION)) +
geom_density(position = "stack")
grid.arrange(p1,p2, nrow = 1, ncol = 2)
The graphs above show an odd distribution for both Total Demand and RRP. What we can understand is that:
p3
To understand the Total Demand better, the distribution above is seperated by Region which gives us an understanding that:
The focus of this is to accurately predict the upcoming months of the whole network. To proceed, we need to aggregate the Demand from 30 minutes to Maximum Monthly demand. Also, to simplify the metric of Demand, We will convert the unit of measurement from MegaWatts to GigaWatts.
aemo_agg <- aemo_clean
#seperate the month and year in a different column so we can aggregate.
aemo_agg$MONTH <- format(as.Date(aemo_clean$DATE, tz = 'Auastralia/Sydney'), "%Y-%m")
#create monthly total of WHOLE AEMO Network.
monthly_demand <- aemo_agg$TOTALDEMAND %>%
aggregate(., by = list(aemo_agg$MONTH), sum)
monthly_demand <- monthly_demand %>%
rename(MONTH = Group.1) %>%
rename(TOTAL_DEMAND = x) %>%
filter(MONTH != "2018-10")
#change Demand from Megawatts to Gigawatts so axis on plots are not scientific
monthly_demand$TOTAL_DEMAND <- monthly_demand$TOTAL_DEMAND / 1000
head(monthly_demand)
## MONTH TOTAL_DEMAND
## 1 1998-12 20284.13
## 2 1999-01 26003.38
## 3 1999-02 24281.17
## 4 1999-03 26288.02
## 5 1999-04 24939.75
## 6 1999-05 27573.49
Lets change the monthly_demand to a time series object ts()
demand.ts <- ts(monthly_demand[2], frequency = 12, start = c(1998, 12))
lets have a 80/20 train/test split.
train.ts <- window(demand.ts, start = c(1998,12), end = c(2014,9))
test.ts <- window(demand.ts, start = c(2014,10), end = c(2018,9))
autoplot(train.ts) +
labs(titles = 'Monthly AEMO Network Demand',
x = 'Year',
y = 'Megawatts')
As we can see, the NEM network from 1998 to 2018 shows consistent seasonality. The trend was on an inclination till 2009 which then decreased till 2015 and now is stagnant.
autoplot(diff(train.ts)) +
labs(titles = "Differenced Monthly Electricity Demand",
x = "year")
The difference plot shows the difference between 2nd and 1st monthand then the difference between the 3rd month and the 2nd month and so on. This is plot identifies that the mean/average value has not changed over time.
ggseasonplot(train.ts)
ggsubseriesplot(train.ts)
The seasonality and subseries plot show that during June to August is our peak demand periods where the lowest demand is during February and April.
autoplot(decompose(train.ts)) +
labs(title = 'Decomposition of Monthly AEMO Demand',
x = 'year')
autoplot(diff(train.ts, lag = 12)) +
labs(titles = "Seasonally Differenced Monthly Electricity Demand",
x = "year")
ggAcf(diff(train.ts), lag.max = 36) +
labs(title = "Autocorrelation Plot - differenced demand")
ggPacf(diff(train.ts), lag.max = 36) +
labs(title = "Partial autocorrelation function - differenced demand")
Box.test(diff(train.ts), lag = 12, fitdf = 0, type = "Ljung-Box")
##
## Box-Ljung test
##
## data: diff(train.ts)
## X-squared = 292, df = 12, p-value < 2.2e-16
The Ljung box test confirms randomness of a time series data set. instead of testing for randomness at lags, it tests the overall randomness based on a number of lags.
The P value is is less that 0.05 which suggests that the data is significantly different from white noise.
Serveral methods of forecasting will be implemented to measure which model would be the best for forecasting a horizon of 48 (ie. 48 months / 4 years). The accuracy measure will be RMSE.
When creating a predictive model, the goal is to create a model where it captures the signal rather than the noise of the data.
It is to be noted that the RMSE score will return a result for both the “training set” and the “test set”. Generally speaking the score will be lower for the “training set”. RMSE score on the training set is a metric which measures how much signal and noise is explained by the model.
train.naive <- naive(train.ts, h=48)
autoplot(train.naive)
checkresiduals(train.naive)
##
## Ljung-Box test
##
## data: Residuals from Naive method
## Q* = 569.07, df = 24, p-value < 2.2e-16
##
## Model df: 0. Total lags used: 24
The residuals from the naive model have a lot of ACF lags where it breaks the threshold (blue dashed line) and the distribution isn’t very uniform.
accuracy(train.naive, test.ts[,"TOTAL_DEMAND"])
## ME RMSE MAE MPE MAPE MASE
## Training set 48.94329 1816.295 1509.806 0.02753847 4.807119 1.626931
## Test set 2177.93125 2664.251 2218.350 6.64954735 6.791048 2.390441
## ACF1 Theil's U
## Training set -0.3680743 NA
## Test set 0.2098428 1.399151
A naive forecast yielded a RMSE score of 2664.251 on the test set.
train.snaive <- snaive(train.ts, h=48)
autoplot(train.snaive)
snaive.cv <- tsCV(train.ts, snaive, h=48)
mean(snaive.cv^2, na.rm = TRUE)
## [1] 4294456
train.snaive %>%
checkresiduals()
##
## Ljung-Box test
##
## data: Residuals from Seasonal naive method
## Q* = 520.63, df = 24, p-value < 2.2e-16
##
## Model df: 0. Total lags used: 24
snaive.ac <- accuracy(train.snaive, test.ts)
snaive.ac
## ME RMSE MAE MPE MAPE MASE
## Training set 345.4302 1233.6199 928.0088 1.1365015 2.925019 1.0000000
## Test set 254.4141 712.9643 555.7368 0.7598895 1.747115 0.5988486
## ACF1 Theil's U
## Training set 0.61277617 NA
## Test set -0.07701604 0.3747468
Seasonal Naive forecast yielded a RMSE score of 712.9643 on the Test set.
train.ses <- ses(train.ts, h = 48)
autoplot(train.ses) +
autolayer(fitted(train.ses))
checkresiduals(train.ses)
##
## Ljung-Box test
##
## data: Residuals from Simple exponential smoothing
## Q* = 444.88, df = 22, p-value < 2.2e-16
##
## Model df: 2. Total lags used: 24
ses.ac <- accuracy(train.ses, test.ts)
ses.ac
## ME RMSE MAE MPE MAPE MASE
## Training set 151.1614 1629.031 1397.007 0.2937958 4.433912 1.505382
## Test set 119.0020 1539.159 1350.835 0.1418175 4.254507 1.455627
## ACF1 Theil's U
## Training set 0.1870446 NA
## Test set 0.2098428 0.7967513
A simple exponential smoothing technique yielded a RMSE result of 1539.159.
This method obviously didn’t work because our data has both Trend and seasonality.
The Holt winters method deals with both trends and seasonlity.
To deal with the trend, Holt winters method allows for smoothing paramters alpha and beta star to minimise the Sum of Squared Errors.
To deal with seasonal changes, Holt winters method changes the average of Gamma to either average of 0 (Additive) and average of 1 (Multiplicative).
train.ho.a <- holt(train.ts, h = 48, season = "additive")
train.ho.m <- holt(train.ts, h = 48, season = "Multiplicative")
autoplot(train.ho.a) +
autolayer(fitted(train.ho.a))
autoplot(train.ho.m) +
autolayer(fitted(train.ho.m))
checkresiduals(train.ho.a)
##
## Ljung-Box test
##
## data: Residuals from Holt's method
## Q* = 435.65, df = 20, p-value < 2.2e-16
##
## Model df: 4. Total lags used: 24
checkresiduals(train.ho.m)
##
## Ljung-Box test
##
## data: Residuals from Holt's method
## Q* = 435.65, df = 20, p-value < 2.2e-16
##
## Model df: 4. Total lags used: 24
ho.a.ac <- accuracy(train.ho.a, test.ts)
ho.a.m <- accuracy(train.ho.m, test.ts)
ho.a.ac
## ME RMSE MAE MPE MAPE MASE
## Training set -149.2952 1610.871 1377.818 -0.6848737 4.401552 1.484703
## Test set 1495.5587 2270.863 1814.211 4.4851390 5.555566 1.954951
## ACF1 Theil's U
## Training set 0.2371688 NA
## Test set 0.3565152 1.185158
ho.a.m
## ME RMSE MAE MPE MAPE MASE
## Training set -149.2952 1610.871 1377.818 -0.6848737 4.401552 1.484703
## Test set 1495.5587 2270.863 1814.211 4.4851390 5.555566 1.954951
## ACF1 Theil's U
## Training set 0.2371688 NA
## Test set 0.3565152 1.185158
Holt winters Additive & Multiplicative resulted in a - RMSE - 2270.863
ETS models are innovative state space models that include additoinal information such as whether a trend can be “none”, “additive” or damp and whether seasonal information is “None”, “additive” and “multiplicative”.
This gives us an additional 9 possible exponential smoothing methods (3x3). Smoothing will require measuring the errors to offset. The two errors to cater for are “additive errors” or “multiplicative errors”.
This gives us a total of 18 possible state space models.
The difference of ETS and previous models are that they maximise the liklihood rather than minimising the sum of squared.
#function to return ETS forecast
fets <- function(y,h) {
forecast(ets(y), h = h)
}
train.ets <- fets(train.ts, h=48)
autoplot(train.ets)
checkresiduals(train.ets)
##
## Ljung-Box test
##
## data: Residuals from ETS(A,Ad,A)
## Q* = 16.768, df = 7, p-value = 0.01896
##
## Model df: 17. Total lags used: 24
accuracy(train.ets, test.ts)
## ME RMSE MAE MPE MAPE MASE
## Training set -25.94297 642.9169 461.0866 -0.1219831 1.503394 0.4968559
## Test set 692.62867 962.0351 777.8562 2.1362633 2.422624 0.8381992
## ACF1 Theil's U
## Training set 0.07378692 NA
## Test set 0.26138199 0.5129682
RMSE - 962.0351
fit.ar <- auto.arima(train.ts)
checkresiduals(fit.ar)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(2,1,1)(1,1,1)[12]
## Q* = 17.418, df = 19, p-value = 0.5616
##
## Model df: 5. Total lags used: 24
summary(fit.ar)
## Series: train.ts
## ARIMA(2,1,1)(1,1,1)[12]
##
## Coefficients:
## ar1 ar2 ma1 sar1 sma1
## -0.1186 -0.190 -0.3108 -0.1133 -0.6889
## s.e. 0.3040 0.143 0.3169 0.1221 0.1052
##
## sigma^2 estimated as 485863: log likelihood=-1412.39
## AIC=2836.78 AICc=2837.27 BIC=2855.84
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -139.9079 663.2 484.3896 -0.4830665 1.531013 0.5219666
## ACF1
## Training set 0.02759502
ARIMA(2,1,1)(1,1,1)12 AICc=2837.27
train.ar <- fit.ar %>%
forecast(h=48)
train.ar %>%
autoplot()
accuracy(train.ar, test.ts)
## ME RMSE MAE MPE MAPE MASE
## Training set -139.9079 663.200 484.3896 -0.4830665 1.531013 0.5219666
## Test set 1421.0407 1629.185 1425.8468 4.4428671 4.458780 1.5364583
## ACF1 Theil's U
## Training set 0.02759502 NA
## Test set 0.45935134 0.8621858
RMSE - 1629.185
train.ts %>%
diff(lag=12) %>%
ggtsdisplay()
Above, we can see the seasonally differenced data. They also appear to be non-stationary as per the plot below.
train.ts %>%
diff(lag=12) %>%
diff() %>%
ggtsdisplay()
Using the plot above, we can base our ARIMA model off of the ACF and PACF component. The ACF and PACF suggest that there is are significant correlations to be captured by the ARIMA model because they exceed the 5% signifiance bands. We can see that:
train.ts %>%
Arima(order = c(0,1,1), seasonal = c(0,1,1),
include.constant = TRUE) %>%
residuals() %>%
ggtsdisplay()
With the second Arima model (0,1,1)(0,1,1)12, The ACF AND PACF are within the significance limits. This model resulted in an AICc value of 2834.15
We can see that there are large spikes at lag 2, 10 and 11 indicating that some additional non-seasonal terms need to be included in the model.
train.ar3 <- Arima(train.ts, order = c(0,1,2), seasonal=c(0,1,1))
checkresiduals(train.ar3)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,1,2)(0,1,1)[12]
## Q* = 18.436, df = 21, p-value = 0.6212
##
## Model df: 3. Total lags used: 24
summary(train.ar3)
## Series: train.ts
## ARIMA(0,1,2)(0,1,1)[12]
##
## Coefficients:
## ma1 ma2 sma1
## -0.416 -0.1116 -0.7493
## s.e. 0.098 0.0937 0.0735
##
## sigma^2 estimated as 484688: log likelihood=-1413.29
## AIC=2834.59 AICc=2834.82 BIC=2847.29
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -154.1261 666.2371 484.0827 -0.5266763 1.53056 0.5216359
## ACF1
## Training set 0.01275241
With the third Arima model (0,1,2)(0,1,1)12, The ACF AND PACF are within the significance limits. This model resulted in an AICc value of 2834.82
The other tested Arima models are as per below.
(0,1,1)(0,1,1)12 - AICc = 2834.15 (0,1,3)(0,1,1)12 - AICc = 2835.67 (0,1,10)(0,1,1)12 - AICc = 2847.57 (0,1,11)(0,1,1)12 - AICc = 2849.69 (0,1,10)(0,1,10)12 - AICc = 2867.11
train.ar3 <- train.ar3 %>%
forecast(h=48)
train.ar3 %>%
autoplot()
accuracy(train.ar3, test.ts)
## ME RMSE MAE MPE MAPE MASE
## Training set -154.1261 666.2371 484.0827 -0.5266763 1.53056 0.5216359
## Test set 1194.2434 1391.9331 1202.0121 3.7276471 3.75315 1.2952594
## ACF1 Theil's U
## Training set 0.01275241 NA
## Test set 0.35962074 0.7369895
This resulted in an RMSE of 1391.9331 which beats the auto.arima() model which had an RMSE of 1629.185.
autoplot(window(train.ts, start = 2014)) +
autolayer(train.ar3,
series = "Seasonal Arima",
PI = FALSE) +
autolayer(train.snaive,
series = "Seasonal Naive",
PI = FALSE) +
autolayer(train.ets,
series = "ETS",
PI = FALSE) +
autolayer(test.ts,
series = "Actual",
PI = FALSE) +
labs(titles = "Top 4 Forecasts vs Actual",
x = "Year",
y = "Gigawatts") +
theme(legend.position = 'bottom')
The plot above shows the the first year of predictions are pretty accurate because the confidence interval are smaller in the short term horizon. Expanding it over the next two years proved that the Seasonal Naive model performed the best