1 Explanation

Electricity is something that is vital nowadays. Because we use a lot of electronic device in our life. So, here we want to forecast the demand of daily electricity in Victoria, Australia using time series model. Why we use the data from Victoria, Australia? Because Victoria is the second most populated state in Australia. Using the data that was downloaded from kaggle about Daily Electricity Price and Demand Data.

Some relevant columns in the data :
- date : the date of the recording
- demand : a total daily electricity demand in MWh
- RRP : a recommended retail price in AUD$ / MWh
- demand_pos_RRP : a total daily demand at positive RRP in MWh
- RRP_positive : an averaged positive RRP, weighted by the corresponding intraday demand in AUD$ / MWh
- demand_neg_RRP : an total daily demand at negative RRP in MWh
- RRP_negative : an average negative RRP, weighted by the corresponding intraday demand in AUD$ / MWh
- frac_at_neg_RRP : a fraction of the day when the demand was traded at negative RRP
- min_temperature : minimum temperature during the day in Celsius
- max_temperature : maximum temperature during the day in Celsius
- solar_exposure : total daily sunlight energy in MJ/m^2
- rainfall : daily rainfall in mm
- school_day : if students were at school on that day
- holiday : if the day was a state or national holiday

2 Data Preparation

2.1 Attaching Packages

library(dplyr)
library(forecast)
## Warning: package 'forecast' was built under R version 4.0.5
library(ggplot2)
library(lubridate)
library(MLmetrics)
## Warning: package 'MLmetrics' was built under R version 4.0.5
library(tseries)
## Warning: package 'tseries' was built under R version 4.0.5
library(TSstudio)
## Warning: package 'TSstudio' was built under R version 4.0.5

2.2 Input Data

elect <- read.csv("data/complete_dataset.csv")

2.3 Data Inspection

head(elect)
##         date    demand      RRP demand_pos_RRP RRP_positive demand_neg_RRP
## 1 2015-01-01  99635.03 25.63370       97319.24     26.41595       2315.790
## 2 2015-01-02 129606.01 33.13899      121082.01     38.83766       8523.995
## 3 2015-01-03 142300.54 34.56485      142300.54     34.56485          0.000
## 4 2015-01-04 104330.71 25.00556      104330.71     25.00556          0.000
## 5 2015-01-05 118132.20 26.72418      118132.20     26.72418          0.000
## 6 2015-01-06 130672.48 31.28231      130672.48     31.28231          0.000
##   RRP_negative frac_at_neg_RRP min_temperature max_temperature solar_exposure
## 1     -7.24000      0.02083333            13.3            26.9           23.6
## 2    -47.80978      0.06250000            15.4            38.8           26.8
## 3      0.00000      0.00000000            20.0            38.2           26.5
## 4      0.00000      0.00000000            16.3            21.4           25.2
## 5      0.00000      0.00000000            15.0            22.0           30.7
## 6      0.00000      0.00000000            17.7            26.0           31.6
##   rainfall school_day holiday
## 1      0.0          N       Y
## 2      0.0          N       N
## 3      0.0          N       N
## 4      4.2          N       N
## 5      0.0          N       N
## 6      0.0          N       N

Changing character type data into date type data.

elect <- elect %>% 
  mutate(date = ymd(date)) %>% 
  arrange(date)

glimpse(elect)
## Rows: 2,106
## Columns: 14
## $ date            <date> 2015-01-01, 2015-01-02, 2015-01-03, 2015-01-04, 2015-~
## $ demand          <dbl> 99635.03, 129606.01, 142300.54, 104330.71, 118132.20, ~
## $ RRP             <dbl> 25.63370, 33.13899, 34.56485, 25.00556, 26.72418, 31.2~
## $ demand_pos_RRP  <dbl> 97319.24, 121082.01, 142300.54, 104330.71, 118132.20, ~
## $ RRP_positive    <dbl> 26.41595, 38.83766, 34.56485, 25.00556, 26.72418, 31.2~
## $ demand_neg_RRP  <dbl> 2315.790, 8523.995, 0.000, 0.000, 0.000, 0.000, 4016.1~
## $ RRP_negative    <dbl> -7.24000, -47.80978, 0.00000, 0.00000, 0.00000, 0.0000~
## $ frac_at_neg_RRP <dbl> 0.02083333, 0.06250000, 0.00000000, 0.00000000, 0.0000~
## $ min_temperature <dbl> 13.3, 15.4, 20.0, 16.3, 15.0, 17.7, 18.9, 23.1, 16.5, ~
## $ max_temperature <dbl> 26.9, 38.8, 38.2, 21.4, 22.0, 26.0, 37.4, 28.2, 18.0, ~
## $ solar_exposure  <dbl> 23.6, 26.8, 26.5, 25.2, 30.7, 31.6, 20.7, 13.5, 3.1, 5~
## $ rainfall        <dbl> 0.0, 0.0, 0.0, 4.2, 0.0, 0.0, 0.0, 19.4, 1.2, 5.2, 0.0~
## $ school_day      <chr> "N", "N", "N", "N", "N", "N", "N", "N", "N", "N", "N",~
## $ holiday         <chr> "Y", "N", "N", "N", "N", "N", "N", "N", "N", "N", "N",~

2.4 Data Cleansing

To check if there is any missing value from the data.

anyNA(elect)
## [1] TRUE
colSums(is.na(elect))
##            date          demand             RRP  demand_pos_RRP    RRP_positive 
##               0               0               0               0               0 
##  demand_neg_RRP    RRP_negative frac_at_neg_RRP min_temperature max_temperature 
##               0               0               0               0               0 
##  solar_exposure        rainfall      school_day         holiday 
##               1               3               0               0

There are 3 missing data on rainfall and 1 missing data on solar_exposure, but we only use date and demand, so we can ignore the missing data.

3 Time Series Object

First, we need to make a time series object using our data. We will use frequency = 365 because we want to see yearly seasonality.

elect_ts <- ts(data = elect$demand,
                 start = c(2015,1),
                 frequency = 365) # yearly

Next, we will check our data’s trend and seasonal.

elect_dc <- decompose(x = elect_ts)

elect_dc %>% 
  autoplot()

As we can see above, our data have a trend and a seasonality. And we can classify our data as an additive time series.

4 Cross Validation

Here we want to split the data to make a train data and a test data.

elect_train <- head(elect_ts, n = length(elect_ts) - (365))
elect_test <- tail(elect_ts, n = (365))

Next we will check if our train data need to perform differencing using Augmented Dickey-Fuller Test.

adf.test(elect_train)
## Warning in adf.test(elect_train): p-value smaller than printed p-value
## 
##  Augmented Dickey-Fuller Test
## 
## data:  elect_train
## Dickey-Fuller = -4.9692, Lag order = 12, p-value = 0.01
## alternative hypothesis: stationary

Since our p-value < alpha, then our train data is already stationary.

5 Time Series Modeling

Because our data have a trend and a seasonality we can use Holt-Winters model. And then I will compare the Holt-Winters model with ARIMA model to compare the result.

model_ets <- stlm(y = elect_train,
                  s.window = 365,
                  method = "ets")

model_arima <- stlm(y = elect_train,
                  s.window = 365,
                  method = "arima")

6 Forecasting

After we make our model, next we will try to forecast it.

hw_forecast <- forecast(object = model_ets,
                        h = length(elect_test))

arima_forecast <- forecast(object = model_arima,
                           h = length(elect_test))

7 Model Evaluation

After we get the forecast result from the Holt-Winters model and ARIMA model we will evaluate it using Mean Absolute Percentage Error (MAPE).

accuracy(object = hw_forecast, x = elect_test)
##                         ME     RMSE       MAE        MPE      MAPE      MASE
## Training set     0.4912505 10050.32  7676.896 -0.3521483  6.436513 0.6944569
## Test set     -7267.9134876 15231.94 11836.943 -7.4122544 10.806532 1.0707775
##                    ACF1 Theil's U
## Training set 0.05540237        NA
## Test set     0.51648306  1.479728
accuracy(object = arima_forecast, x = elect_test)
##                       ME      RMSE       MAE        MPE     MAPE      MASE
## Training set   -14.52008  7525.987  5707.736 -0.3323991 4.771328 0.5163255
## Test set     -1535.69905 13325.482 10273.420 -2.3721673 9.089797 0.9293401
##                     ACF1 Theil's U
## Training set -0.01017575        NA
## Test set      0.51348284  1.255835

Here we get 10.80 % MAPE from the Holt-Winters model and 9.08 % MAPE from ARIMA model. The difference between the two model is not that big and is not that significant, but because MAPE of ARIMA model is lower than Holt-Winters model so we can conclude that ARIMA model is better than Holt-Winters model.

plot_forecast(forecast_obj = arima_forecast)

8 Assumption Checking

8.1 Normality of Residuals

  • H0 : Residuals are normally distributed
  • H1 : Residuals are not normally distributed
shapiro.test(arima_forecast$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  arima_forecast$residuals
## W = 0.98353, p-value = 2.975e-13

Because our p-value < alpha, we can conclude that our residuals are not normally distributed.

hist(arima_forecast$residuals)

8.2 No-Autocorrelation on Residuals

  • H0 : No autocorrelation in the forecast residuals
  • H1 : There is an autocorrelation in the forecast residuals
Box.test(arima_forecast$residuals)
## 
##  Box-Pierce test
## 
## data:  arima_forecast$residuals
## X-squared = 0.18027, df = 1, p-value = 0.6711

Because our p-value > alpha, we can conclude that there is no autocorrelation on our forecast residuals.

9 Conclusion

As we already said before, the best model between Holt-Winters model and ARIMA model is ARIMA model. But, even though our ARIMA model’s residuals doesn’t have autocorrelation it is still not normally distributed. It means there are still some or many unpredictable events that we don’t know in our data. All we can do is adjust our model again.