Electricity is something that is vital nowadays. Because we use a lot of electronic device in our life. So, here we want to forecast the demand of daily electricity in Victoria, Australia using time series model. Why we use the data from Victoria, Australia? Because Victoria is the second most populated state in Australia. Using the data that was downloaded from kaggle about Daily Electricity Price and Demand Data.
Some relevant columns in the data :
- date : the date of the recording
- demand : a total daily electricity demand in MWh
- RRP : a recommended retail price in AUD$ / MWh
- demand_pos_RRP : a total daily demand at positive RRP in MWh
- RRP_positive : an averaged positive RRP, weighted by the corresponding intraday demand in AUD$ / MWh
- demand_neg_RRP : an total daily demand at negative RRP in MWh
- RRP_negative : an average negative RRP, weighted by the corresponding intraday demand in AUD$ / MWh
- frac_at_neg_RRP : a fraction of the day when the demand was traded at negative RRP
- min_temperature : minimum temperature during the day in Celsius
- max_temperature : maximum temperature during the day in Celsius
- solar_exposure : total daily sunlight energy in MJ/m^2
- rainfall : daily rainfall in mm
- school_day : if students were at school on that day
- holiday : if the day was a state or national holiday
library(dplyr)
library(forecast)## Warning: package 'forecast' was built under R version 4.0.5
library(ggplot2)
library(lubridate)
library(MLmetrics)## Warning: package 'MLmetrics' was built under R version 4.0.5
library(tseries)## Warning: package 'tseries' was built under R version 4.0.5
library(TSstudio)## Warning: package 'TSstudio' was built under R version 4.0.5
elect <- read.csv("data/complete_dataset.csv")head(elect)## date demand RRP demand_pos_RRP RRP_positive demand_neg_RRP
## 1 2015-01-01 99635.03 25.63370 97319.24 26.41595 2315.790
## 2 2015-01-02 129606.01 33.13899 121082.01 38.83766 8523.995
## 3 2015-01-03 142300.54 34.56485 142300.54 34.56485 0.000
## 4 2015-01-04 104330.71 25.00556 104330.71 25.00556 0.000
## 5 2015-01-05 118132.20 26.72418 118132.20 26.72418 0.000
## 6 2015-01-06 130672.48 31.28231 130672.48 31.28231 0.000
## RRP_negative frac_at_neg_RRP min_temperature max_temperature solar_exposure
## 1 -7.24000 0.02083333 13.3 26.9 23.6
## 2 -47.80978 0.06250000 15.4 38.8 26.8
## 3 0.00000 0.00000000 20.0 38.2 26.5
## 4 0.00000 0.00000000 16.3 21.4 25.2
## 5 0.00000 0.00000000 15.0 22.0 30.7
## 6 0.00000 0.00000000 17.7 26.0 31.6
## rainfall school_day holiday
## 1 0.0 N Y
## 2 0.0 N N
## 3 0.0 N N
## 4 4.2 N N
## 5 0.0 N N
## 6 0.0 N N
Changing character type data into date type data.
elect <- elect %>%
mutate(date = ymd(date)) %>%
arrange(date)
glimpse(elect)## Rows: 2,106
## Columns: 14
## $ date <date> 2015-01-01, 2015-01-02, 2015-01-03, 2015-01-04, 2015-~
## $ demand <dbl> 99635.03, 129606.01, 142300.54, 104330.71, 118132.20, ~
## $ RRP <dbl> 25.63370, 33.13899, 34.56485, 25.00556, 26.72418, 31.2~
## $ demand_pos_RRP <dbl> 97319.24, 121082.01, 142300.54, 104330.71, 118132.20, ~
## $ RRP_positive <dbl> 26.41595, 38.83766, 34.56485, 25.00556, 26.72418, 31.2~
## $ demand_neg_RRP <dbl> 2315.790, 8523.995, 0.000, 0.000, 0.000, 0.000, 4016.1~
## $ RRP_negative <dbl> -7.24000, -47.80978, 0.00000, 0.00000, 0.00000, 0.0000~
## $ frac_at_neg_RRP <dbl> 0.02083333, 0.06250000, 0.00000000, 0.00000000, 0.0000~
## $ min_temperature <dbl> 13.3, 15.4, 20.0, 16.3, 15.0, 17.7, 18.9, 23.1, 16.5, ~
## $ max_temperature <dbl> 26.9, 38.8, 38.2, 21.4, 22.0, 26.0, 37.4, 28.2, 18.0, ~
## $ solar_exposure <dbl> 23.6, 26.8, 26.5, 25.2, 30.7, 31.6, 20.7, 13.5, 3.1, 5~
## $ rainfall <dbl> 0.0, 0.0, 0.0, 4.2, 0.0, 0.0, 0.0, 19.4, 1.2, 5.2, 0.0~
## $ school_day <chr> "N", "N", "N", "N", "N", "N", "N", "N", "N", "N", "N",~
## $ holiday <chr> "Y", "N", "N", "N", "N", "N", "N", "N", "N", "N", "N",~
To check if there is any missing value from the data.
anyNA(elect)## [1] TRUE
colSums(is.na(elect))## date demand RRP demand_pos_RRP RRP_positive
## 0 0 0 0 0
## demand_neg_RRP RRP_negative frac_at_neg_RRP min_temperature max_temperature
## 0 0 0 0 0
## solar_exposure rainfall school_day holiday
## 1 3 0 0
There are 3 missing data on rainfall and 1 missing data on solar_exposure, but we only use date and demand, so we can ignore the missing data.
First, we need to make a time series object using our data. We will use frequency = 365 because we want to see yearly seasonality.
elect_ts <- ts(data = elect$demand,
start = c(2015,1),
frequency = 365) # yearlyNext, we will check our data’s trend and seasonal.
elect_dc <- decompose(x = elect_ts)
elect_dc %>%
autoplot()As we can see above, our data have a trend and a seasonality. And we can classify our data as an additive time series.
Here we want to split the data to make a train data and a test data.
elect_train <- head(elect_ts, n = length(elect_ts) - (365))
elect_test <- tail(elect_ts, n = (365))Next we will check if our train data need to perform differencing using Augmented Dickey-Fuller Test.
adf.test(elect_train)## Warning in adf.test(elect_train): p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: elect_train
## Dickey-Fuller = -4.9692, Lag order = 12, p-value = 0.01
## alternative hypothesis: stationary
Since our p-value < alpha, then our train data is already stationary.
Because our data have a trend and a seasonality we can use Holt-Winters model. And then I will compare the Holt-Winters model with ARIMA model to compare the result.
model_ets <- stlm(y = elect_train,
s.window = 365,
method = "ets")
model_arima <- stlm(y = elect_train,
s.window = 365,
method = "arima")After we make our model, next we will try to forecast it.
hw_forecast <- forecast(object = model_ets,
h = length(elect_test))
arima_forecast <- forecast(object = model_arima,
h = length(elect_test))After we get the forecast result from the Holt-Winters model and ARIMA model we will evaluate it using Mean Absolute Percentage Error (MAPE).
accuracy(object = hw_forecast, x = elect_test)## ME RMSE MAE MPE MAPE MASE
## Training set 0.4912505 10050.32 7676.896 -0.3521483 6.436513 0.6944569
## Test set -7267.9134876 15231.94 11836.943 -7.4122544 10.806532 1.0707775
## ACF1 Theil's U
## Training set 0.05540237 NA
## Test set 0.51648306 1.479728
accuracy(object = arima_forecast, x = elect_test)## ME RMSE MAE MPE MAPE MASE
## Training set -14.52008 7525.987 5707.736 -0.3323991 4.771328 0.5163255
## Test set -1535.69905 13325.482 10273.420 -2.3721673 9.089797 0.9293401
## ACF1 Theil's U
## Training set -0.01017575 NA
## Test set 0.51348284 1.255835
Here we get 10.80 % MAPE from the Holt-Winters model and 9.08 % MAPE from ARIMA model. The difference between the two model is not that big and is not that significant, but because MAPE of ARIMA model is lower than Holt-Winters model so we can conclude that ARIMA model is better than Holt-Winters model.
plot_forecast(forecast_obj = arima_forecast)shapiro.test(arima_forecast$residuals)##
## Shapiro-Wilk normality test
##
## data: arima_forecast$residuals
## W = 0.98353, p-value = 2.975e-13
Because our p-value < alpha, we can conclude that our residuals are not normally distributed.
hist(arima_forecast$residuals)Box.test(arima_forecast$residuals)##
## Box-Pierce test
##
## data: arima_forecast$residuals
## X-squared = 0.18027, df = 1, p-value = 0.6711
Because our p-value > alpha, we can conclude that there is no autocorrelation on our forecast residuals.
As we already said before, the best model between Holt-Winters model and ARIMA model is ARIMA model. But, even though our ARIMA model’s residuals doesn’t have autocorrelation it is still not normally distributed. It means there are still some or many unpredictable events that we don’t know in our data. All we can do is adjust our model again.