Average Energy Consumptions

Before you start reading my analysis below, let me thank you first for being willing to see the results of my writing, and it means a lot to me.

Whatever you will see in this analysis, is the result of my study in Time Series and Forecasting class at Algoritma Academy. To see what I’ve learned in more detail, you can visit the Algoritma Academy learning syllabus.

Everything I have written is entirely my personal opinion based on my experience and knowledge up until now. If something is not right or missing, please feel free to contact me here, I’d love to discuss it with you. Thank you.

Introduction

Background

Data sets and scripts for the publication in Energy and Buildings.

Data driven prediction models of energy use of appliances in a low-energy house. Luis M. Candanedo, Véronique Feldheim, Dominique Deramaix. Energy and Buildings, Volume 140, 1 April 2017, Pages 81-97, ISSN 0378-7788, http://dx.doi.org/10.1016/j.enbuild.2017.01.083.

For more information about the dataset, the dataset is downloaded from UCI Machine Learning Repository

Objectives

Our objectives from this analysis is to :

See if there is any pattern formed on the energy usage.
Trying to predict energy usage in the future.

Load Libraries

library(forecast)
library(tidyverse)
library(lubridate)
library(padr)
library(ggplot2)
library(tseries)

Import Dataset

energy <- read_csv("dataset/energydata_complete.csv")
glimpse(energy)

## Rows: 19,735
## Columns: 29
## $ date        <dttm> 2016-01-11 17:00:00, 2016-01-11 17:10:00, 2016-01-11 17:2~
## $ Appliances  <dbl> 60, 60, 50, 50, 60, 50, 60, 60, 60, 70, 230, 580, 430, 250~
## $ lights      <dbl> 30, 30, 30, 40, 40, 40, 50, 50, 40, 40, 70, 60, 50, 40, 10~
## $ T1          <dbl> 19.89000, 19.89000, 19.89000, 19.89000, 19.89000, 19.89000~
## $ RH_1        <dbl> 47.59667, 46.69333, 46.30000, 46.06667, 46.33333, 46.02667~
## $ T2          <dbl> 19.20000, 19.20000, 19.20000, 19.20000, 19.20000, 19.20000~
## $ RH_2        <dbl> 44.79000, 44.72250, 44.62667, 44.59000, 44.53000, 44.50000~
## $ T3          <dbl> 19.79000, 19.79000, 19.79000, 19.79000, 19.79000, 19.79000~
## $ RH_3        <dbl> 44.73000, 44.79000, 44.93333, 45.00000, 45.00000, 44.93333~
## $ T4          <dbl> 19.00000, 19.00000, 18.92667, 18.89000, 18.89000, 18.89000~
## $ RH_4        <dbl> 45.56667, 45.99250, 45.89000, 45.72333, 45.53000, 45.73000~
## $ T5          <dbl> 17.16667, 17.16667, 17.16667, 17.16667, 17.20000, 17.13333~
## $ RH_5        <dbl> 55.20000, 55.20000, 55.09000, 55.09000, 55.09000, 55.03000~
## $ T6          <dbl> 7.026667, 6.833333, 6.560000, 6.433333, 6.366667, 6.300000~
## $ RH_6        <dbl> 84.25667, 84.06333, 83.15667, 83.42333, 84.89333, 85.76667~
## $ T7          <dbl> 17.20000, 17.20000, 17.20000, 17.13333, 17.20000, 17.13333~
## $ RH_7        <dbl> 41.62667, 41.56000, 41.43333, 41.29000, 41.23000, 41.26000~
## $ T8          <dbl> 18.20000, 18.20000, 18.20000, 18.10000, 18.10000, 18.10000~
## $ RH_8        <dbl> 48.90000, 48.86333, 48.73000, 48.59000, 48.59000, 48.59000~
## $ T9          <dbl> 17.03333, 17.06667, 17.00000, 17.00000, 17.00000, 17.00000~
## $ RH_9        <dbl> 45.53000, 45.56000, 45.50000, 45.40000, 45.40000, 45.29000~
## $ T_out       <dbl> 6.600000, 6.483333, 6.366667, 6.250000, 6.133333, 6.016667~
## $ Press_mm_hg <dbl> 733.5000, 733.6000, 733.7000, 733.8000, 733.9000, 734.0000~
## $ RH_out      <dbl> 92.00000, 92.00000, 92.00000, 92.00000, 92.00000, 92.00000~
## $ Windspeed   <dbl> 7.000000, 6.666667, 6.333333, 6.000000, 5.666667, 5.333333~
## $ Visibility  <dbl> 63.00000, 59.16667, 55.33333, 51.50000, 47.66667, 43.83333~
## $ Tdewpoint   <dbl> 5.300000, 5.200000, 5.100000, 5.000000, 4.900000, 4.800000~
## $ rv1         <dbl> 13.275433, 18.606195, 28.642668, 45.410389, 10.084097, 44.~
## $ rv2         <dbl> 13.275433, 18.606195, 28.642668, 45.410389, 10.084097, 44.~

Variable Description :

date time year-month-day hour:minute:second
Appliances, energy use in Wh
lights, energy use of light fixtures in the house in Wh
T1, Temperature in kitchen area, in Celsius
RH_1, Humidity in kitchen area, in %
T2, Temperature in living room area, in Celsius
RH_2, Humidity in living room area, in %
T3, Temperature in laundry room area
RH_3, Humidity in laundry room area, in %
T4, Temperature in office room, in Celsius
RH_4, Humidity in office room, in %
T5, Temperature in bathroom, in Celsius
RH_5, Humidity in bathroom, in %
T6, Temperature outside the building (north side), in Celsius
RH_6, Humidity outside the building (north side), in %
T7, Temperature in ironing room , in Celsius
RH_7, Humidity in ironing room, in %
T8, Temperature in teenager room 2, in Celsius
RH_8, Humidity in teenager room 2, in %
T9, Temperature in parents room, in Celsius
RH_9, Humidity in parents room, in %
To, Temperature outside (from Chièvres weather station), in Celsius
Pressure (from Chièvres weather station), in mm Hg
RH_out, Humidity outside (from Chièvres weather station), in %
Windspeed (from Chièvres weather station), in m/s
Visibility (from Chièvres weather station), in km
Tdewpoint (from Chièvres weather station), °C
rv1, Random variable 1, nondimensional
rv2, Rnadom variable 2, nondimensional

Our dataset contains 19.735 observations with various columns that explains the energy consumption and various condition inside a house.

Data Preprocessing

Before we analyze the data, we have to make sure that the dataset we are using is ready to use. What we can do at this stage is to see if there is missing data and whether the data type of each column matches the data of the column.

# Check for missing values
colSums(is.na(energy))

##        date  Appliances      lights          T1        RH_1          T2 
##           0           0           0           0           0           0 
##        RH_2          T3        RH_3          T4        RH_4          T5 
##           0           0           0           0           0           0 
##        RH_5          T6        RH_6          T7        RH_7          T8 
##           0           0           0           0           0           0 
##        RH_8          T9        RH_9       T_out Press_mm_hg      RH_out 
##           0           0           0           0           0           0 
##   Windspeed  Visibility   Tdewpoint         rv1         rv2 
##           0           0           0           0           0

Fortunately, the data set contains no missing values, and the date column is already recognized as the datetime data type, so no further manipulation is required other than subsetting the dataset for wanted columns only. In this experiment, we will only use the date column and appliances to see the energy consumption for appliances in the house. And also, we have to do padding to our dataset to make sure there is no date skipped.

# Subset the dataset and do padding
energy_clean <- energy %>% 
  select(date, Appliances) %>% # Select only columns we want to use
  pad() # Apply pad

## Warning: The `.dots` argument of `group_by()` is deprecated as of dplyr 1.0.0.

## pad applied on the interval: 10 min

energy_clean

## # A tibble: 19,735 x 2
##    date                Appliances
##    <dttm>                   <dbl>
##  1 2016-01-11 17:00:00         60
##  2 2016-01-11 17:10:00         60
##  3 2016-01-11 17:20:00         50
##  4 2016-01-11 17:30:00         50
##  5 2016-01-11 17:40:00         60
##  6 2016-01-11 17:50:00         50
##  7 2016-01-11 18:00:00         60
##  8 2016-01-11 18:10:00         60
##  9 2016-01-11 18:20:00         60
## 10 2016-01-11 18:30:00         70
## # ... with 19,725 more rows

# Verify there is no missing values
energy_clean %>% anyNA()

## [1] FALSE

As we can see, the observations is way too many because it’s taken every 10 minutes. I think it’s rare to see someone talk about their energy usage for each 10 minutes, so we will find the average daily usage to make it simpler.

energy_daily <- energy_clean %>% 
  mutate(date = floor_date(date, unit = "days")) %>% 
  group_by(date) %>% 
  summarise(avg_daily_usage = mean(Appliances)) # average daily usage

energy_daily

## # A tibble: 138 x 2
##    date                avg_daily_usage
##    <dttm>                        <dbl>
##  1 2016-01-11 00:00:00           137. 
##  2 2016-01-12 00:00:00            85.7
##  3 2016-01-13 00:00:00            97.0
##  4 2016-01-14 00:00:00           151. 
##  5 2016-01-15 00:00:00           125. 
##  6 2016-01-16 00:00:00           125. 
##  7 2016-01-17 00:00:00           143. 
##  8 2016-01-18 00:00:00            94.0
##  9 2016-01-19 00:00:00            83.3
## 10 2016-01-20 00:00:00           114. 
## # ... with 128 more rows

Perfect, now the observations are reduced significantly only to 138 observations (days).

Exploratory Data Analysis

Before creating a machine learning model, let’s explore our data first to see if there is a pattern in energy use in home appliances. To see this, we can use ggplot2 to make some basic plots.

# Check the date range of our data
range(energy_daily$date)

## [1] "2016-01-11 UTC" "2016-05-27 UTC"

# Visualize the data to see if there is any pattern
ggplot(data = energy_daily, aes(x = date, y = avg_daily_usage)) +
  geom_point() +
  geom_line()

We can’t clearly see the pattern yet in our basic daily ggplot2 line plot above. To make it clearer we can use decompose() and autoplot() functions to make another plot.

# Make time series object with monthly frequency
energy_ts <- ts(energy_daily$avg_daily_usage,
                frequency = 30, # daily data, monthly pattern
                )

energy_ts %>% decompose() %>% autoplot()

From the result above, we can see that our data has seasonality and trend. The seasonality of our data categorized as an additive, because the pattern constant for each month without noticeable variation.

Cross-Validation

Before we make machine learning models, we have to split our data into train data and test data, so we can evaluate the performance of our models before testing them in real life. In this case, we will split our data, with the test data 30 days long or 1 month.

# Split the Data
energy_train <- head(energy_ts, -30) 
energy_test <- tail(energy_ts, 30) # 30 days

Build Model

Because our data has seasonality, we can use SARIMA or seasonal ARIMA to build our machine learning model. To do this, we have 2 options. Either let R do this automatically (auto SARIMA) or we can do it manually.

But, to use the SARIMA model, we have to make our data stationary first. We can use diff() to achieve this.

energy_train_diff <- energy_train %>% 
  diff(lag=30) # differencing for seasonal component

energy_train_diff %>% adf.test()

## Warning in adf.test(.): p-value smaller than printed p-value

## 
##  Augmented Dickey-Fuller Test
## 
## data:  .
## Dickey-Fuller = -5.3813, Lag order = 4, p-value = 0.01
## alternative hypothesis: stationary

Now, our data is stationary because the p-value is < alpha (0,05).

Auto SARIMA

To use seasonality in ARIMA we can use the auto.arima function but with seasonality parameter set to True.

energy_auto <- auto.arima(energy_train, seasonal = T)
energy_auto

## Series: energy_train 
## ARIMA(0,0,0)(1,0,0)[30] with non-zero mean 
## 
## Coefficients:
##         sar1     mean
##       0.0572  98.3157
## s.e.  0.1221   3.1061
## 
## sigma^2 estimated as 975.4:  log likelihood=-523.96
## AIC=1053.92   AICc=1054.15   BIC=1061.97

According to auto SARIMA the best combination for this case is ARIMA(0,0,0)(1,0,0)[30].

# Make the visualization
energy_train %>% 
  autoplot(series = "Actual") +
  autolayer(energy_auto$fitted, series = "Auto SARIMA")

Manual SARIMA

In manual SARIMA we have to figure out the p, d, q and P, D, Q by ourselves. So we have to take a closer look at ACF and PACF.

# PACF and ACF Visualization
tsdisplay(energy_train_diff)

# Custom function to see PACF and ACF bigger
library(patchwork)

## Warning: package 'patchwork' was built under R version 4.0.5

plot_cf <- function(x, lag.max = NULL){
  p1 <- ggAcf(x, lag.max=lag.max) + labs(title = "Plot ACF and PACF")
  p2 <- ggPacf(x, lag.max=lag.max) + labs(title = NULL)
  p1 / p2
}

plot_cf(energy_train_diff)

ACF

non-seasonal: q = 0
seasonal (freq = 144): Q = 1

PACF

non-seasonal: p = 2
seasonal (freq = 144): P = 0

Diff

seasonal: 1 kali (D = 1)
trend: 1 kali (d = 0)

Model combination ARIMA(p,d,q)(P,D,Q)[m]

ARIMA(2,0,0)(0,1,1)[30]

energy_sarima <- Arima(energy_train,
                      order = c(2,0,0),
                      seasonal = c(0,1,1))

energy_sarima

## Series: energy_train 
## ARIMA(2,0,0)(0,1,1)[30] 
## 
## Coefficients:
##          ar1      ar2     sma1
##       0.1526  -0.1691  -0.7833
## s.e.  0.1131   0.1145   0.6333
## 
## sigma^2 estimated as 1086:  log likelihood=-393.13
## AIC=794.27   AICc=794.81   BIC=803.69

# Make the visualization
energy_train %>% 
  autoplot(series = "Actual") +
  autolayer(energy_sarima$fitted, series = "Manual SARIMA")

Compare Models

Now we have 2 models, auto and manual. Let’s compare these 2 models and see the error score for each model.

# Custom function to compare models
compare_fitting <- function(x){
  lapply(x, function(x) summary(x)["Training set", ]) %>% 
    lapply(., function(x) x %>% t() %>% as.data.frame) %>% 
    bind_rows() %>% 
    mutate(model = names(x)) %>%
    select(model, everything())
}

model_list <- list(
  "Auto SARIMA" = energy_auto,
  "Manual SARIMA" = energy_sarima
)

compare_fitting(model_list)

## Series: energy_train 
## ARIMA(0,0,0)(1,0,0)[30] with non-zero mean 
## 
## Coefficients:
##         sar1     mean
##       0.0572  98.3157
## s.e.  0.1221   3.1061
## 
## sigma^2 estimated as 975.4:  log likelihood=-523.96
## AIC=1053.92   AICc=1054.15   BIC=1061.97
## 
## Training set error measures:
##                       ME     RMSE      MAE       MPE     MAPE      MASE
## Training set -0.04534327 30.94133 24.70089 -10.34396 27.54273 0.7549604
##                   ACF1
## Training set 0.1010689
## Series: energy_train 
## ARIMA(2,0,0)(0,1,1)[30] 
## 
## Coefficients:
##          ar1      ar2     sma1
##       0.1526  -0.1691  -0.7833
## s.e.  0.1131   0.1145   0.6333
## 
## sigma^2 estimated as 1086:  log likelihood=-393.13
## AIC=794.27   AICc=794.81   BIC=803.69
## 
## Training set error measures:
##                     ME     RMSE      MAE       MPE     MAPE      MASE
## Training set -1.388583 27.46763 18.32929 -5.517679 19.04382 0.5602183
##                      ACF1
## Training set -0.007040379

##           model          ME     RMSE      MAE        MPE     MAPE      MASE
## 1   Auto SARIMA -0.04534327 30.94133 24.70089 -10.343960 27.54273 0.7549604
## 2 Manual SARIMA -1.38858281 27.46763 18.32929  -5.517679 19.04382 0.5602183
##           ACF1
## 1  0.101068940
## 2 -0.007040379

According to the table above, our manual SARIMA performs better in the training dataset. We can see this through the MAPE score. Manual SARIMA has a lower MAPE score which indicates that it performs better than auto SARIMA in the train dataset. Now let’s see how it performs in the test dataset.

Forecasting

# Forecast the next 30 days of usage
energy_auto_f <- forecast(energy_auto, h = 30)
energy_sarima_f <- forecast(energy_sarima, h = 30)

# Custom function to compare forecasts
compare_forecast <- function(x, test){
  lapply(x, function(x) forecast::accuracy(x, test)["Test set", ]) %>%
    lapply(., function(x) x %>% t() %>% as.data.frame) %>% 
    bind_rows() %>% 
    mutate(model = names(x)) %>%
    select(model, everything())
}

forecast_list <- list(
  "Auto SARIMA" = energy_auto_f,
  "Manual SARIMA" = energy_sarima_f)

compare_forecast(forecast_list, energy_test)

##           model        ME     RMSE      MAE       MPE     MAPE      MASE
## 1   Auto SARIMA -1.462188 28.27378 21.48750 -8.936078 22.66054 0.6567461
## 2 Manual SARIMA -1.570116 34.29712 27.08743 -9.127787 28.01858 0.8279028
##        ACF1 Theil's U
## 1 0.1512070 0.8269132
## 2 0.1097558 0.9749539

# Make the visualization
energy_ts %>% 
  autoplot(series = "Actual") +
  autolayer(energy_auto_f$mean, series = "Auto SARIMA") +
  autolayer(energy_sarima_f$mean, series = "Manual SARIMA")

Unfortunately, our manual SARIMA model got defeated in the test dataset. The MAPE of manual SARIMA is higher by few points than the auto SARIMA.

Evaluation

To evaluate our models, we have to check 2 assumptions in forecasting time series. Those assumptions are no auto-correlation residual and normality residual.

No Auto-Correlation Residual

To see if our model has no auto-correlation between its residuals, we can use the Ljung-Box test.

H0: No auto-correlation exists.
H1: auto-correlation exists.

Box.test(energy_auto$residuals, type = "Ljung-Box")

## 
##  Box-Ljung test
## 
## data:  energy_auto$residuals
## X-squared = 1.1341, df = 1, p-value = 0.2869

The p-value from the test is > alpha (0,05) so we failed to reject H0 and our auto SARIMA model has no auto-correlation between its residuals.

Normality Residual

To see if the residual distributed normally around 0 we can use the Shapiro-Wilk test.

shapiro.test(energy_auto$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  energy_auto$residuals
## W = 0.9495, p-value = 0.0004431

hist(energy_auto$residuals, breaks = 15)

Because the p-value is < alpha (0.05) and the histogram is skewed, we can say that our model residuals are not normally distributed.

Conclusion

From today’s experiment, we have successfully created models using ARIMA (auto and manual) to forecast daily energy assumptions for the next 30 days. But, although we can see the result, the error is still quite high around 30% (MAPE). Maybe, this is caused by residuals that are not normally distributed, so there are errors that are far from 0.

I think this forecasting model is still far away from implementation in real-life cases. It still needs improvements.