Average Energy Consumptions
Before you start reading my analysis below, let me thank you first for being willing to see the results of my writing, and it means a lot to me.
Whatever you will see in this analysis, is the result of my study in Time Series and Forecasting class at Algoritma Academy. To see what I’ve learned in more detail, you can visit the Algoritma Academy learning syllabus.
Everything I have written is entirely my personal opinion based on my experience and knowledge up until now. If something is not right or missing, please feel free to contact me here, I’d love to discuss it with you. Thank you.
Introduction
Background
Data sets and scripts for the publication in Energy and Buildings.
Data driven prediction models of energy use of appliances in a low-energy house. Luis M. Candanedo, Véronique Feldheim, Dominique Deramaix. Energy and Buildings, Volume 140, 1 April 2017, Pages 81-97, ISSN 0378-7788, http://dx.doi.org/10.1016/j.enbuild.2017.01.083.
For more information about the dataset, the dataset is downloaded from UCI Machine Learning Repository
Objectives
Our objectives from this analysis is to :
- See if there is any pattern formed on the energy usage.
- Trying to predict energy usage in the future.
Load Libraries
library(forecast)
library(tidyverse)
library(lubridate)
library(padr)
library(ggplot2)
library(tseries)Import Dataset
energy <- read_csv("dataset/energydata_complete.csv")
glimpse(energy)## Rows: 19,735
## Columns: 29
## $ date <dttm> 2016-01-11 17:00:00, 2016-01-11 17:10:00, 2016-01-11 17:2~
## $ Appliances <dbl> 60, 60, 50, 50, 60, 50, 60, 60, 60, 70, 230, 580, 430, 250~
## $ lights <dbl> 30, 30, 30, 40, 40, 40, 50, 50, 40, 40, 70, 60, 50, 40, 10~
## $ T1 <dbl> 19.89000, 19.89000, 19.89000, 19.89000, 19.89000, 19.89000~
## $ RH_1 <dbl> 47.59667, 46.69333, 46.30000, 46.06667, 46.33333, 46.02667~
## $ T2 <dbl> 19.20000, 19.20000, 19.20000, 19.20000, 19.20000, 19.20000~
## $ RH_2 <dbl> 44.79000, 44.72250, 44.62667, 44.59000, 44.53000, 44.50000~
## $ T3 <dbl> 19.79000, 19.79000, 19.79000, 19.79000, 19.79000, 19.79000~
## $ RH_3 <dbl> 44.73000, 44.79000, 44.93333, 45.00000, 45.00000, 44.93333~
## $ T4 <dbl> 19.00000, 19.00000, 18.92667, 18.89000, 18.89000, 18.89000~
## $ RH_4 <dbl> 45.56667, 45.99250, 45.89000, 45.72333, 45.53000, 45.73000~
## $ T5 <dbl> 17.16667, 17.16667, 17.16667, 17.16667, 17.20000, 17.13333~
## $ RH_5 <dbl> 55.20000, 55.20000, 55.09000, 55.09000, 55.09000, 55.03000~
## $ T6 <dbl> 7.026667, 6.833333, 6.560000, 6.433333, 6.366667, 6.300000~
## $ RH_6 <dbl> 84.25667, 84.06333, 83.15667, 83.42333, 84.89333, 85.76667~
## $ T7 <dbl> 17.20000, 17.20000, 17.20000, 17.13333, 17.20000, 17.13333~
## $ RH_7 <dbl> 41.62667, 41.56000, 41.43333, 41.29000, 41.23000, 41.26000~
## $ T8 <dbl> 18.20000, 18.20000, 18.20000, 18.10000, 18.10000, 18.10000~
## $ RH_8 <dbl> 48.90000, 48.86333, 48.73000, 48.59000, 48.59000, 48.59000~
## $ T9 <dbl> 17.03333, 17.06667, 17.00000, 17.00000, 17.00000, 17.00000~
## $ RH_9 <dbl> 45.53000, 45.56000, 45.50000, 45.40000, 45.40000, 45.29000~
## $ T_out <dbl> 6.600000, 6.483333, 6.366667, 6.250000, 6.133333, 6.016667~
## $ Press_mm_hg <dbl> 733.5000, 733.6000, 733.7000, 733.8000, 733.9000, 734.0000~
## $ RH_out <dbl> 92.00000, 92.00000, 92.00000, 92.00000, 92.00000, 92.00000~
## $ Windspeed <dbl> 7.000000, 6.666667, 6.333333, 6.000000, 5.666667, 5.333333~
## $ Visibility <dbl> 63.00000, 59.16667, 55.33333, 51.50000, 47.66667, 43.83333~
## $ Tdewpoint <dbl> 5.300000, 5.200000, 5.100000, 5.000000, 4.900000, 4.800000~
## $ rv1 <dbl> 13.275433, 18.606195, 28.642668, 45.410389, 10.084097, 44.~
## $ rv2 <dbl> 13.275433, 18.606195, 28.642668, 45.410389, 10.084097, 44.~
Variable Description :
datetime year-month-day hour:minute:secondAppliances, energy use in Whlights, energy use of light fixtures in the house in WhT1, Temperature in kitchen area, in CelsiusRH_1, Humidity in kitchen area, in %T2, Temperature in living room area, in CelsiusRH_2, Humidity in living room area, in %T3, Temperature in laundry room areaRH_3, Humidity in laundry room area, in %T4, Temperature in office room, in CelsiusRH_4, Humidity in office room, in %T5, Temperature in bathroom, in CelsiusRH_5, Humidity in bathroom, in %T6, Temperature outside the building (north side), in CelsiusRH_6, Humidity outside the building (north side), in %T7, Temperature in ironing room , in CelsiusRH_7, Humidity in ironing room, in %T8, Temperature in teenager room 2, in CelsiusRH_8, Humidity in teenager room 2, in %T9, Temperature in parents room, in CelsiusRH_9, Humidity in parents room, in %To, Temperature outside (from Chièvres weather station), in CelsiusPressure(from Chièvres weather station), in mm HgRH_out, Humidity outside (from Chièvres weather station), in %Windspeed(from Chièvres weather station), in m/sVisibility(from Chièvres weather station), in kmTdewpoint(from Chièvres weather station), °Crv1, Random variable 1, nondimensionalrv2, Rnadom variable 2, nondimensional
Our dataset contains 19.735 observations with various columns that explains the energy consumption and various condition inside a house.
Data Preprocessing
Before we analyze the data, we have to make sure that the dataset we are using is ready to use. What we can do at this stage is to see if there is missing data and whether the data type of each column matches the data of the column.
# Check for missing values
colSums(is.na(energy))## date Appliances lights T1 RH_1 T2
## 0 0 0 0 0 0
## RH_2 T3 RH_3 T4 RH_4 T5
## 0 0 0 0 0 0
## RH_5 T6 RH_6 T7 RH_7 T8
## 0 0 0 0 0 0
## RH_8 T9 RH_9 T_out Press_mm_hg RH_out
## 0 0 0 0 0 0
## Windspeed Visibility Tdewpoint rv1 rv2
## 0 0 0 0 0
Fortunately, the data set contains no missing values, and the date column is already recognized as the datetime data type, so no further manipulation is required other than subsetting the dataset for wanted columns only. In this experiment, we will only use the date column and appliances to see the energy consumption for appliances in the house. And also, we have to do padding to our dataset to make sure there is no date skipped.
# Subset the dataset and do padding
energy_clean <- energy %>%
select(date, Appliances) %>% # Select only columns we want to use
pad() # Apply pad## Warning: The `.dots` argument of `group_by()` is deprecated as of dplyr 1.0.0.
## pad applied on the interval: 10 min
energy_clean## # A tibble: 19,735 x 2
## date Appliances
## <dttm> <dbl>
## 1 2016-01-11 17:00:00 60
## 2 2016-01-11 17:10:00 60
## 3 2016-01-11 17:20:00 50
## 4 2016-01-11 17:30:00 50
## 5 2016-01-11 17:40:00 60
## 6 2016-01-11 17:50:00 50
## 7 2016-01-11 18:00:00 60
## 8 2016-01-11 18:10:00 60
## 9 2016-01-11 18:20:00 60
## 10 2016-01-11 18:30:00 70
## # ... with 19,725 more rows
# Verify there is no missing values
energy_clean %>% anyNA()## [1] FALSE
As we can see, the observations is way too many because it’s taken every 10 minutes. I think it’s rare to see someone talk about their energy usage for each 10 minutes, so we will find the average daily usage to make it simpler.
energy_daily <- energy_clean %>%
mutate(date = floor_date(date, unit = "days")) %>%
group_by(date) %>%
summarise(avg_daily_usage = mean(Appliances)) # average daily usage
energy_daily## # A tibble: 138 x 2
## date avg_daily_usage
## <dttm> <dbl>
## 1 2016-01-11 00:00:00 137.
## 2 2016-01-12 00:00:00 85.7
## 3 2016-01-13 00:00:00 97.0
## 4 2016-01-14 00:00:00 151.
## 5 2016-01-15 00:00:00 125.
## 6 2016-01-16 00:00:00 125.
## 7 2016-01-17 00:00:00 143.
## 8 2016-01-18 00:00:00 94.0
## 9 2016-01-19 00:00:00 83.3
## 10 2016-01-20 00:00:00 114.
## # ... with 128 more rows
Perfect, now the observations are reduced significantly only to 138 observations (days).
Exploratory Data Analysis
Before creating a machine learning model, let’s explore our data first to see if there is a pattern in energy use in home appliances. To see this, we can use ggplot2 to make some basic plots.
# Check the date range of our data
range(energy_daily$date)## [1] "2016-01-11 UTC" "2016-05-27 UTC"
# Visualize the data to see if there is any pattern
ggplot(data = energy_daily, aes(x = date, y = avg_daily_usage)) +
geom_point() +
geom_line()We can’t clearly see the pattern yet in our basic daily ggplot2 line plot above. To make it clearer we can use decompose() and autoplot() functions to make another plot.
# Make time series object with monthly frequency
energy_ts <- ts(energy_daily$avg_daily_usage,
frequency = 30, # daily data, monthly pattern
)
energy_ts %>% decompose() %>% autoplot() From the result above, we can see that our data has seasonality and trend. The seasonality of our data categorized as an additive, because the pattern constant for each month without noticeable variation.
Cross-Validation
Before we make machine learning models, we have to split our data into train data and test data, so we can evaluate the performance of our models before testing them in real life. In this case, we will split our data, with the test data 30 days long or 1 month.
# Split the Data
energy_train <- head(energy_ts, -30)
energy_test <- tail(energy_ts, 30) # 30 daysBuild Model
Because our data has seasonality, we can use SARIMA or seasonal ARIMA to build our machine learning model. To do this, we have 2 options. Either let R do this automatically (auto SARIMA) or we can do it manually.
But, to use the SARIMA model, we have to make our data stationary first. We can use diff() to achieve this.
energy_train_diff <- energy_train %>%
diff(lag=30) # differencing for seasonal component
energy_train_diff %>% adf.test()## Warning in adf.test(.): p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: .
## Dickey-Fuller = -5.3813, Lag order = 4, p-value = 0.01
## alternative hypothesis: stationary
Now, our data is stationary because the p-value is < alpha (0,05).
Auto SARIMA
To use seasonality in ARIMA we can use the auto.arima function but with seasonality parameter set to True.
energy_auto <- auto.arima(energy_train, seasonal = T)
energy_auto## Series: energy_train
## ARIMA(0,0,0)(1,0,0)[30] with non-zero mean
##
## Coefficients:
## sar1 mean
## 0.0572 98.3157
## s.e. 0.1221 3.1061
##
## sigma^2 estimated as 975.4: log likelihood=-523.96
## AIC=1053.92 AICc=1054.15 BIC=1061.97
According to auto SARIMA the best combination for this case is ARIMA(0,0,0)(1,0,0)[30].
# Make the visualization
energy_train %>%
autoplot(series = "Actual") +
autolayer(energy_auto$fitted, series = "Auto SARIMA")Manual SARIMA
In manual SARIMA we have to figure out the p, d, q and P, D, Q by ourselves. So we have to take a closer look at ACF and PACF.
# PACF and ACF Visualization
tsdisplay(energy_train_diff)# Custom function to see PACF and ACF bigger
library(patchwork)## Warning: package 'patchwork' was built under R version 4.0.5
plot_cf <- function(x, lag.max = NULL){
p1 <- ggAcf(x, lag.max=lag.max) + labs(title = "Plot ACF and PACF")
p2 <- ggPacf(x, lag.max=lag.max) + labs(title = NULL)
p1 / p2
}
plot_cf(energy_train_diff)ACF
- non-seasonal: q = 0
- seasonal (freq = 144): Q = 1
PACF
- non-seasonal: p = 2
- seasonal (freq = 144): P = 0
Diff
- seasonal: 1 kali (D = 1)
- trend: 1 kali (d = 0)
Model combination ARIMA(p,d,q)(P,D,Q)[m]
- ARIMA(2,0,0)(0,1,1)[30]
energy_sarima <- Arima(energy_train,
order = c(2,0,0),
seasonal = c(0,1,1))
energy_sarima## Series: energy_train
## ARIMA(2,0,0)(0,1,1)[30]
##
## Coefficients:
## ar1 ar2 sma1
## 0.1526 -0.1691 -0.7833
## s.e. 0.1131 0.1145 0.6333
##
## sigma^2 estimated as 1086: log likelihood=-393.13
## AIC=794.27 AICc=794.81 BIC=803.69
# Make the visualization
energy_train %>%
autoplot(series = "Actual") +
autolayer(energy_sarima$fitted, series = "Manual SARIMA")Compare Models
Now we have 2 models, auto and manual. Let’s compare these 2 models and see the error score for each model.
# Custom function to compare models
compare_fitting <- function(x){
lapply(x, function(x) summary(x)["Training set", ]) %>%
lapply(., function(x) x %>% t() %>% as.data.frame) %>%
bind_rows() %>%
mutate(model = names(x)) %>%
select(model, everything())
}
model_list <- list(
"Auto SARIMA" = energy_auto,
"Manual SARIMA" = energy_sarima
)
compare_fitting(model_list)## Series: energy_train
## ARIMA(0,0,0)(1,0,0)[30] with non-zero mean
##
## Coefficients:
## sar1 mean
## 0.0572 98.3157
## s.e. 0.1221 3.1061
##
## sigma^2 estimated as 975.4: log likelihood=-523.96
## AIC=1053.92 AICc=1054.15 BIC=1061.97
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -0.04534327 30.94133 24.70089 -10.34396 27.54273 0.7549604
## ACF1
## Training set 0.1010689
## Series: energy_train
## ARIMA(2,0,0)(0,1,1)[30]
##
## Coefficients:
## ar1 ar2 sma1
## 0.1526 -0.1691 -0.7833
## s.e. 0.1131 0.1145 0.6333
##
## sigma^2 estimated as 1086: log likelihood=-393.13
## AIC=794.27 AICc=794.81 BIC=803.69
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -1.388583 27.46763 18.32929 -5.517679 19.04382 0.5602183
## ACF1
## Training set -0.007040379
## model ME RMSE MAE MPE MAPE MASE
## 1 Auto SARIMA -0.04534327 30.94133 24.70089 -10.343960 27.54273 0.7549604
## 2 Manual SARIMA -1.38858281 27.46763 18.32929 -5.517679 19.04382 0.5602183
## ACF1
## 1 0.101068940
## 2 -0.007040379
According to the table above, our manual SARIMA performs better in the training dataset. We can see this through the MAPE score. Manual SARIMA has a lower MAPE score which indicates that it performs better than auto SARIMA in the train dataset. Now let’s see how it performs in the test dataset.
Forecasting
# Forecast the next 30 days of usage
energy_auto_f <- forecast(energy_auto, h = 30)
energy_sarima_f <- forecast(energy_sarima, h = 30)# Custom function to compare forecasts
compare_forecast <- function(x, test){
lapply(x, function(x) forecast::accuracy(x, test)["Test set", ]) %>%
lapply(., function(x) x %>% t() %>% as.data.frame) %>%
bind_rows() %>%
mutate(model = names(x)) %>%
select(model, everything())
}
forecast_list <- list(
"Auto SARIMA" = energy_auto_f,
"Manual SARIMA" = energy_sarima_f)
compare_forecast(forecast_list, energy_test)## model ME RMSE MAE MPE MAPE MASE
## 1 Auto SARIMA -1.462188 28.27378 21.48750 -8.936078 22.66054 0.6567461
## 2 Manual SARIMA -1.570116 34.29712 27.08743 -9.127787 28.01858 0.8279028
## ACF1 Theil's U
## 1 0.1512070 0.8269132
## 2 0.1097558 0.9749539
# Make the visualization
energy_ts %>%
autoplot(series = "Actual") +
autolayer(energy_auto_f$mean, series = "Auto SARIMA") +
autolayer(energy_sarima_f$mean, series = "Manual SARIMA")Unfortunately, our manual SARIMA model got defeated in the test dataset. The MAPE of manual SARIMA is higher by few points than the auto SARIMA.
Evaluation
To evaluate our models, we have to check 2 assumptions in forecasting time series. Those assumptions are no auto-correlation residual and normality residual.
No Auto-Correlation Residual
To see if our model has no auto-correlation between its residuals, we can use the Ljung-Box test.
- H0: No auto-correlation exists.
- H1: auto-correlation exists.
Box.test(energy_auto$residuals, type = "Ljung-Box")##
## Box-Ljung test
##
## data: energy_auto$residuals
## X-squared = 1.1341, df = 1, p-value = 0.2869
The p-value from the test is > alpha (0,05) so we failed to reject H0 and our auto SARIMA model has no auto-correlation between its residuals.
Normality Residual
To see if the residual distributed normally around 0 we can use the Shapiro-Wilk test.
shapiro.test(energy_auto$residuals)##
## Shapiro-Wilk normality test
##
## data: energy_auto$residuals
## W = 0.9495, p-value = 0.0004431
hist(energy_auto$residuals, breaks = 15) Because the p-value is < alpha (0.05) and the histogram is skewed, we can say that our model residuals are not normally distributed.
Conclusion
From today’s experiment, we have successfully created models using ARIMA (auto and manual) to forecast daily energy assumptions for the next 30 days. But, although we can see the result, the error is still quite high around 30% (MAPE). Maybe, this is caused by residuals that are not normally distributed, so there are errors that are far from 0.
I think this forecasting model is still far away from implementation in real-life cases. It still needs improvements.