“Some things are easier to forecast than others. The time of the sunrise tomorrow morning can be forecast precisely. On the other hand, tomorrow’s lotto numbers cannot be forecast with any accuracy.” (Hyndman & Athanasopoulos 2018)
This report demostrates the process to forecast energy demand using the AEMO data set for NSW.
The following reusable functions are created to enapsulate the repetitive tasks in the proecessing pipeline:
ts
object explorationData will be split in chronological order with 80/20 train to test ratio. The training data set will be atleast 2 years long to ensure that it captures any existing seasonal behavior that are observed on yearly basis.
The function GetAemoDataAndSplit
is used to load a preprocesed ts
object. For the process of preparing the ts
object refer to Appendix 1.
GetAemoDataAndSplit <- function(start = c(2013, 1),
region = "NSW",
freq = "daily",
shift = FALSE)
{
ret <- list()
sfreq <- case_when(freq == "weekly" ~ "weekly",
TRUE ~ "daily")
sregion <- case_when(region %in% c("NSW", "QLD", "SA", "SNOWY", "TAS", "VIC") ~ region,
TRUE ~ "NSW")
filename <-
sprintf("rds/aemo_%s_%s_ts.rds", tolower(sregion), sfreq)
if(shift)
filename <- sprintf("rds/aemo_%s_%s_l_ts.rds", tolower(sregion), sfreq)
ts1 <- readRDS(filename)
ts1 <- window(ts1, start = start)
# split 80/20
idx <- as.integer(length(ts1) * 0.8)
ret$train <- ts(ts1[1:idx], start = start(ts1), frequency = frequency(ts1))
ret$test <- ts(ts1[(idx + 1):length(ts1)], start = index(ts1)[idx + 1], frequency = frequency(ts1))
ret$full <- ts1
ret$state <- sregion
ret$dataset <- sfreq
return(ret)
}
Specifying the parameter freq = "weekly"
in the function call will load the weekly aggregated timeseries data. The returned object includes the full range of the timeseries along with the train and test split timeseries.
nswWeeklyts <- GetAemoDataAndSplit(freq = "weekly")
autoplot(nswWeeklyts$train, main = "NSW Weekly Energy Demand", series = "Train set") +
autolayer(nswWeeklyts$test, series = "Test set")
Outliers in this dataset are imputed to the as described in (Appendix 1){#Appendix1}.
ts
objectThis function will print attributes of the data set and perform ur.kpss()
to assess stationarity:
ExploreTs <- function(tsobj) {
ts1 <- tsobj$full
cat(sprintf(
"Start: %s\nEnd: %s\nFrequency %s\n",
paste(start(ts1), collapse = " "),
paste(end(ts1), collapse = " "),
frequency(ts1)
))
cat("Plots a time series components, acf and pacf graphs")
plotSubTitle = sprintf("%s - %s", tsobj$state, tsobj$dataset)
print(mstl(ts1) %>% autoplot(main=paste0(plotSubTitle , " with Trend, Seasonal and lag-1 differnece")))
#
#
# autoplot(tsobj$full) +
# labs(y = "Demand (MWh)",
# title = "Energy Demand",
# subtitle = plotSubTitle) -> p
# print(p)
print("Auto correlation plots")
print(ggAcf(ts1))
print(ggPacf(ts1))
# #### basic ts eda ####
#
# print("Stationarity Test")
#
# print("ur.test on the original data set")
# print(ts1 %>% ur.kpss() %>% summary())
#
# print("ur.test on the log-transformed data set log(ts) to stablise the variabnce")
# print(log(ts1) %>% ur.kpss() %>% summary())
#
# print("ur.test on the differeced data set diff(ts) to remove seasonality")
# print(diff(ts1) %>% ur.kpss() %>% summary()) ## this produced 0.0085 t-stat < 0.01 -> conclude that we could use auto.arima
#
#
# print("ur.test on the log transformed and differenced data set diff(log(ts))")
# print(diff(log(ts1)) %>% ur.kpss() %>% summary())
}
Example usage:
ExploreTs(nswWeeklyts)
In the unit root test ur.kpss()
, we look for stationarity where the Null hypothesis (Ho) is stationarity.
This function will display the residuals plot and Ljung-Box output to test the significance of resiudals, high p-values indicate that residuals are white noise and contain no signal, this result is desired from a succesfull model that captures the infomraiton corectly in its coffecients.
FittedModelDiagnostics <- function(tsfit) {
# print(tsfit %>% summary())
tsfit %>% checkresiduals() -> p
print(p)
}
ForecastAndAccuracy <- function(tsobj, tsfit) {
ret <- list()
forecastHorizon = length(tsobj$test)
ret$forecast <- forecast(tsfit, h = forecastHorizon)
ret$accuracy <- accuracy(ret$forecast, tsobj$test)
plotSubTitle = sprintf("%s - %s", tsobj$state, tsobj$dataset)
autoplot(ret$forecast) +
labs(y = "Demand (MWh)",
subtitle = plotSubTitle) +
autolayer(tsobj$test, series = "Test")-> p
print(p)
ret$desc <- sprintf("Accuracy measures for %s %s STLF using %s method.",
tsobj$state,
tsobj$dataset,
tsfit$method)
print(ret$desc)
print(ret$accuracy)
return(ret)
}
First produce a train/test data set and check for its poperties and statioarity.
nsw <- GetAemoDataAndSplit(freq = "weekly", start = c(2012,1))
# nsw <- GetAemoDataAndSplit(freq = "weekly")
ts1 <- nsw$train
ExploreTs(nsw)
## Start: 2012 1
## End: 2018 30
## Frequency 53
## Plots a time series components, acf and pacf graphs
## [1] "Auto correlation plots"
Unit root test using the ur.kpss()
method stipulates that if the test statitic reuslt is greater than the critical value, then the nyll hypotheisis is rejected and the series is non-stationary (Data Science Central 2010). So we look for test-statistic less than the critical value of 1% significance level.
We can see that the differenced timeseries with 5 lags produces the lowest test-statistics value of .0301, this result is small and within the range for a stationary data set; less that the critival value of 0.739 for a 1% significance level.
#### naive #####
naiveFit1 <- naive(ts1,h = length(nsw$train))
FittedModelDiagnostics(naiveFit1)
##
## Ljung-Box test
##
## data: Residuals from Naive method
## Q* = 123.35, df = 55.6, p-value = 4.776e-07
##
## Model df: 0. Total lags used: 55.6
##
## NULL
#### snaive ####
snaiveFit1 <- snaive(ts1, h =length(nsw$train) )
FittedModelDiagnostics(snaiveFit1)
##
## Ljung-Box test
##
## data: Residuals from Seasonal naive method
## Q* = 158.67, df = 55.6, p-value = 7.52e-12
##
## Model df: 0. Total lags used: 55.6
##
## NULL
Some stationarity tests first using AFC plots and Box.test()
:
ggAcf(ts1)
ggAcf(diff(ts1))
Box.test(ts1 , type="Ljung-Box")
##
## Box-Ljung test
##
## data: ts1
## X-squared = 111.49, df = 1, p-value < 2.2e-16
Box.test(ts1 %>% diff(), type="Ljung-Box")
##
## Box-Ljung test
##
## data: ts1 %>% diff()
## X-squared = 12.481, df = 1, p-value = 0.0004111
Fit two models using auto.arima()
function
#### Fit a model on training set ####
autarimaFit1 <- auto.arima(ts1)
FittedModelDiagnostics(autarimaFit1)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(3,1,2)(1,1,1)[53]
## Q* = 90.342, df = 48.6, p-value = 0.000259
##
## Model df: 7. Total lags used: 55.6
##
## NULL
# fit an auto.arima on a log transformed timeseries
logNsw <- nsw
logNsw$train <- log(logNsw$train)
logNsw$test <- log(logNsw$test)
logNsw$full <- log(logNsw$full)
autoarimaFit2 <- auto.arima(logNsw$train)
FittedModelDiagnostics(autoarimaFit2)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(3,1,4)(1,1,1)[53]
## Q* = 47.298, df = 46.6, p-value = 0.444
##
## Model df: 9. Total lags used: 55.6
##
## NULL
The second ARIMA model is fitted on a log-transformed time series to stabilise the variability, it return a p-value of .444 for residul lag, while the first one had a lower p-value of .000259. Here a higher residual lag p-value is better as it implies that no significant signals remain in the residuals for this model.
The same conclusion can be made by observing the ACF plot for the residuals of the model. ACF plot that shows correlations within the threshold limit indicates that residuals do not hold significant correlations. They behave as white noise. This is an important assumption to prove for fitting any model, that the residuals are white noise, random in nature and normally distributed.
#### fit TBATS ####
tbatsFit1 <- tbats(ts1, use.box.cox = T,
use.damped.trend = NULL,
use.trend = NULL,
use.parallel = T)
FittedModelDiagnostics(tbatsFit1)
##
## Ljung-Box test
##
## data: Residuals from TBATS
## Q* = 50.546, df = 24.6, p-value = 0.00155
##
## Model df: 31. Total lags used: 55.6
##
## NULL
#### stlf ####
stlfNSW <- stlf(ts1)
FittedModelDiagnostics(stlfNSW)
##
## Ljung-Box test
##
## data: Residuals from STL + ETS(M,N,N)
## Q* = 135.56, df = 53.6, p-value = 4.861e-09
##
## Model df: 2. Total lags used: 55.6
##
## NULL
performance of models by examining the accuracy
naiveRR <-ForecastAndAccuracy(nsw,naiveFit1)
## [1] "Accuracy measures for NSW weekly STLF using Naive method method."
## ME RMSE MAE MPE MAPE MASE
## Training set -671.0026 182576.8 126025.5 -0.2606091 4.700569 0.940192
## Test set 171202.5059 272545.9 217019.1 5.7595978 7.779485 1.619035
## ACF1 Theil's U
## Training set -0.2111231 NA
## Test set 0.5848730 1.433904
snaiveRR <-ForecastAndAccuracy(nsw,snaiveFit1)
## [1] "Accuracy measures for NSW weekly STLF using Seasonal naive method method."
## ME RMSE MAE MPE MAPE MASE
## Training set -7957.158 182018.8 134042.2 -0.5509262 4.912432 1.000000
## Test set -57849.838 223548.0 162671.2 -2.3522440 6.126603 1.213581
## ACF1 Theil's U
## Training set 0.24163881 NA
## Test set -0.03101777 1.219258
autoarimaRR <- ForecastAndAccuracy(tsobj = nsw, tsfit = autarimaFit1)
## character(0)
## ME RMSE MAE MPE MAPE MASE
## Training set 7070.187 131660.3 86442.37 0.082548 3.185307 0.6448891
## Test set -78990.721 174065.5 127361.82 -3.202499 4.896095 0.9501618
## ACF1 Theil's U
## Training set 0.004027041 NA
## Test set 0.062455449 0.9517688
autoarimaRR2 <- ForecastAndAccuracy(tsobj = logNsw, tsfit = autoarimaFit2)
## character(0)
## ME RMSE MAE MPE MAPE
## Training set 0.002395276 0.04736516 0.03181368 0.01540007 0.2148296
## Test set -0.024121273 0.06374352 0.04484730 -0.16409176 0.3035952
## MASE ACF1 Theil's U
## Training set 0.6472544 -0.01321492 NA
## Test set 0.9124253 0.04616218 0.9222589
tbatsRR <- ForecastAndAccuracy(tsobj = nsw, tsfit = tbatsFit1)
## [1] "Accuracy measures for NSW weekly STLF using TBATS method."
## ME RMSE MAE MPE MAPE MASE
## Training set 616.3925 125496.7 90925.86 -0.1848008 3.375433 0.6783374
## Test set -65697.8141 163184.0 112423.69 -2.7407844 4.365448 0.8387184
## ACF1 Theil's U
## Training set 0.04022745 NA
## Test set 0.13169736 0.8823472
stlfRR <- ForecastAndAccuracy(nsw, stlfNSW)
## [1] "Accuracy measures for NSW weekly STLF using STL + ETS(M,N,N) method."
## ME RMSE MAE MPE MAPE MASE
## Training set -239.9017 114745.0 84614.88 -0.183280 3.123967 0.6312553
## Test set -65501.3625 163091.2 115301.92 -2.715755 4.459145 0.8601910
## ACF1 Theil's U
## Training set 0.12730301 NA
## Test set 0.08157539 0.8961511
Plotting it all together
autoplot(window(nsw$train, start=2017)) +
# autolayer(naiveRR$forecast, PI = FALSE, series="naive") +
# autolayer(snaiveRR$forecast, PI=FALSE, series = "snaive") +
autolayer(autoarimaRR$forecast, PI=FALSE, series = "autoarima") +
autolayer(exp(autoarimaRR2$forecast$mean), PI=FALSE, series = "arima2") +
# autolayer(tbatsRR$forecast, PI=FALSE, series = "tbats") +
# autolayer(stlfRR$forecast, PI = FALSE, series = "stlf") +
autolayer(nsw$test, series = "ACTUAL") +
theme(legend.position = "bottom")
## Warning: Ignoring unknown parameters: PI
Energy demand data, extracted from the Australian Enregy Market Operator (AEMO), provide 20 years of historical demand measured at 30 mins intervals for the states of NSW, QLD, SA, TAS and VIC. The data set is time series of sequential observaions in chonological order.
The plot below shows monthly summaries of demand data for NSW over the period from 1998 to 2018.
aemo_month <- read_csv("tidy/aemo_day.csv")
aemo_month %>% ggplot( aes(x=settlementdate, y=totaldemandday)) +
geom_line() +
theme_bw() +
labs( title = "Historical Eneregy Demand for NSW",
subtitle = "1998 - 2018",
y = "Total Demand (MWh)",
x = "Date") +
scale_x_date(
date_labels = "%Y",
date_breaks = "3 year",
minor_breaks = "1 year"
) +
scale_y_continuous(labels = scales::comma)
There obvious spikes in the timeseries, exploring the distribution of demand shows a number of outliers at both sides of the distribution.
aemo_month %>% ggplot(aes(x = totaldemandday)) +
geom_histogram(bins = 60) +
theme_bw() +
theme(
axis.title.y = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
axis.title.x = element_blank(),
axis.text.x = element_blank(),
axis.ticks.x = element_blank()
) +
labs(title = "Distribution of Eneregy Demand for NSW",
subtitle = "1998 - 2018",
x = "") +
scale_x_continuous(labels = scales::comma) -> p1
aemo_month %>% ggplot(aes(y = totaldemandday, x = 1)) +
geom_boxplot () + geom_jitter(alpha = .05, color = "#56B4E9") +
theme_bw() +
theme(
axis.title.y = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank()
) +
labs(y = "Total Demand (MWh)", x = "") +
scale_y_continuous(labels = scales::comma) +
coord_flip() -> p2
grid.arrange (p1, p2)
Weekly summaries of energy demand is produced from raw AEMO data and converted to a timeseries object by using the ts
function.
df <- read_csv("aemo/nsw_aemo.csv", locale = locale(tz ="Australia/Sydney"))
df <- df %>% # filter(SETTLEMENTDATE >= ymd("2017-08-1")) %>%
mutate(y = year(SETTLEMENTDATE),
m = month(SETTLEMENTDATE),
d = day(SETTLEMENTDATE),
h = hour(SETTLEMENTDATE),
REGION = REGION)%>%
mutate(settlementhour = ymd_h(paste( y,m,d,h, sep= " ")))
# group by week
dfweek <- df %>% mutate(w=week(SETTLEMENTDATE)) %>% group_by(REGION, y,w) %>%
summarise(
totaldemandweek = sum(TOTALDEMAND),
minrrp = min(RRP),
maxrrp = max(RRP),
meanrrp = mean(RRP),
medianrrp = median(RRP),
firstrrp = first(RRP),
lastrrp = last(RRP)) %>%
ungroup()
#### construct ts objects ####
# calcuate the start of the ts objects
startdate <- min(df$SETTLEMENTDATE)
startyear <- year(startdate)
startmonth <- month(startdate)
startday <- yday(startdate) ## need day of year
startweek <- week(startdate)
#### weekly ts object
dfweek <- dfweek %>%
filter(REGION == "NSW1") %>%
select(y,w,totaldemandweek)
The existence of outlier events in the timeseries data could mislead the time series analysis and produce incorrect outcomes and hence need to be handled before any procesin (Tsay 1988).
The outliers in the timeseries are detected using the Tukey’s rule: values more than tht e1.5 times the interquartile range from the quartiles. Outliers are imputed to the quartiles.
#calculating Tukey's outlier limits based IRQ range from 1st and 3rd quartiles
wIQRup <- quantile(dfweek$totaldemandweek, probs=.75) + 1.5 * IQR(dfweek$totaldemandweek)
wIQRdown <-quantile(dfweek$totaldemandweek, probs=.25) - 1.5 * IQR(dfweek$totaldemandweek)
wMedian <- median(dfweek$totaldemandweek)
dfweek2 <- dfweek %>% mutate(totaldemandweek = case_when(
totaldemandweek >= wIQRup ~ wIQRup,
totaldemandweek <= wIQRdown ~ wIQRdown,
TRUE ~ totaldemandweek))
aemotsweek <- ts(dfweek %>% select(totaldemandweek),
start = c(startyear, startweek),
frequency = 53)
aemotsweek2 <- ts(dfweek2 %>% select(totaldemandweek),
start = c(startyear, startweek),
frequency = 53)
aemotsweekcombined <- cbind(aemotsweek, aemotsweek2)
autoplot(aemotsweekcombined) +
scale_y_continuous(labels = scales::comma) +
labs( title = "Weekly Aggregate of Energy Demand",
subtitle = "Before and After Handling of Outlier values",
y="Weekly Demand (MWh)")
ts
Object to .rds
FilesThe result timeseries objects are saved to file in .rds
format for reuse.
saveRDS(aemotsdaily, "rds/aemo_nsw_daily_ts.rds") ## daily summary ts object
saveRDS(aemotsweek2, "rds/aemo_nsw_weekly_ts.rds") ## weekly summary ts object
References
Fan, S. & Hyndman, R.J. 2012, ‘Short-term load forecasting based on a semi-parametric additive model’, IEEE Transactions on Power Systems, vol. 27, no. 1, pp. 134-41.
Hyndman, R.J. & Athanasopoulos, G. 2018, Forecasting: principles and practice, OTexts, .
Mill, S. 2016, Electric load forecasting: advantages and challenges, viewed Oct 10, 2018, http://engineering.electrical-equipment.org/electrical-distribution/electric-load-forecasting-advantages-challenges.html.
Tsay, R.S. 1988, ‘Outliers, level shifts, and variance changes in time series’, Journal of Forecasting, vol. 7, no. 1, pp. 1-20.