Appliances Energy Analysis

Sara Monica

July 19, 2022

Introduction

The purpose of this research is to forecast the electricity consumption of a particular household in Belgium based on the temperature and humidity levels of various rooms in the facility and surrounding weather information over 4.5 months. The data set runs 4.5 months at 10 minutes. A ZigBee wireless sensor network is being used to monitor the home’s temperature and humidity levels. Around 3.3 minutes, each wireless node sent the temperature and humidity data. The wireless data was then averaged across intervals of 10 minutes. Every 10 minutes, m-bus energy meters collected the energy data. The experimental data sets were combined with the weather data from the closest airport weather station (Chievres Airport, Belgium), which was extracted from a public data set from Reliable Prognosis (rp5.ru). The data set has two random variables to test the regression models and exclude non-predictive characteristics (parameters).

Dataset Link : http://archive.ics.uci.edu/ml/datasets/Appliances+energy+prediction , https://github.com/LuisM78/Appliances-energy-prediction-data

Repositor : https://www.sciencedirect.com/science/article/abs/pii/S0378778816308970?via%3Dihub

Business Problem

The increasing trend in energy consumption is becoming cause of concern for the entire world, as the energy consumption is increasing year after year so is the carbon and greenhouse gas emission, the majority portion of the electricity generated is consumed by industrial sector but a considerable amount is also consumed by residential sector.

It is important to study the energy consuming behaviour in the residential sector and predict the energy consumption by home appliances as it consume maximum amount of energy in the residence. This project focuses on predicting the energy consumption of home appliances based on humidity and temperature.

This project aims to predict the energy consumption of home appliances. With the advent of smart homes and the rising need for energy management, existing smart home systems can benefit from accurate prediction. If the energy usage can be predicted for every possible state of appliances, then device control can be optimized for energy savings as well. This is a case of Regression analysis which is part of the Supervised Learning problem. Appliance energy usage is the target variable while sensor data and weather data are the features.

Data Processing

Load Libraries

library(dplyr)
library(caret)
library(tidyr)
library(randomForest)
library(ggplot2)
library(lime)
library(GGally)
library(performance) 
library(MLmetrics)
library(lmtest)
library(car)
library(lubridate)
library(psych)
library(plotly)

Read Data

#  read data
training <-  read.csv("data/training.csv")
testing <-  read.csv("data/testing.csv")

data <- bind_rows(training, testing)

The observation data consists of the following variables:

datetime year-month-day hour : minute:second
Appliances: energy use in Wh [TARGETED]
lights: energy use of light fixtures in the house in Wh
T1: Temperature in kitchen area, in Celsius
RH_1: Humidity in kitchen area, in %
T2: Temperature in living room area, in Celsius
RH_2:Humidity in living room area, in %
T3:Temperature in laundry room area
RH_3:Humidity in laundry room area, in %
T4:Temperature in office room, in Celsius
RH_4:Humidity in office room, in %
T5:Temperature in bathroom, in Celsius
RH_5:Humidity in bathroom, in %
T6:Temperature outside the building (north side), in Celsius
RH_6:Humidity outside the building (north side), in %
T7:Temperature in ironing room , in Celsius
RH_7:Humidity in ironing room, in %
T8:Temperature in teenager room 2, in Celsius
RH_8:Humidity in teenager room 2, in %
T9:Temperature in parents room, in Celsius
RH_9:Humidity in parents room, in %
T_out:Temperature outside (from Chièvres weather station), in Celsius
Press_mm_hg: (from Chièvres weather station), in mm Hg
RH_out: Humidity outside (from Chièvres weather station), in %
Windspeed: (from Chièvres weather station), in m/s
Visibility: (from Chièvres weather station), in km
Tdewpoint: (from Chièvres weather station), °C
rv1: Random variable 1, nondimensional
rv2: Rnadom variable 2, nondimensional
Day_of_week: Name of Day, ordered
WeekStatus: Day status of Day_of_week

Number of instances: 19,735 Number of attributes: 32

In order to ensure that the data is “fully prepared,” we demonstrate how to use various data transformations, scaling, handling outliers, or any other statistical strategy. It is best practice to preprocess our data before performing analysis. Data must first be cleaned and transformed before being used for analysis and modeling.

Pre-processing

# data structure
glimpse(data)

## Rows: 19,735
## Columns: 32
## $ date        <chr> "2016-01-11 17:00:00", "2016-01-11 17:10:00", "2016-01-11 …
## $ Appliances  <int> 60, 60, 50, 60, 50, 60, 60, 70, 430, 250, 100, 90, 80, 140…
## $ lights      <int> 30, 30, 30, 40, 40, 50, 40, 40, 50, 40, 10, 10, 30, 40, 20…
## $ T1          <dbl> 19.89000, 19.89000, 19.89000, 19.89000, 19.89000, 19.85667…
## $ RH_1        <dbl> 47.59667, 46.69333, 46.30000, 46.33333, 46.02667, 45.56000…
## $ T2          <dbl> 19.20000, 19.20000, 19.20000, 19.20000, 19.20000, 19.20000…
## $ RH_2        <dbl> 44.79000, 44.72250, 44.62667, 44.53000, 44.50000, 44.50000…
## $ T3          <dbl> 19.79000, 19.79000, 19.79000, 19.79000, 19.79000, 19.73000…
## $ RH_3        <dbl> 44.73000, 44.79000, 44.93333, 45.00000, 44.93333, 44.90000…
## $ T4          <dbl> 19.00000, 19.00000, 18.92667, 18.89000, 18.89000, 18.89000…
## $ RH_4        <dbl> 45.56667, 45.99250, 45.89000, 45.53000, 45.73000, 45.86333…
## $ T5          <dbl> 17.16667, 17.16667, 17.16667, 17.20000, 17.13333, 17.10000…
## $ RH_5        <dbl> 55.20000, 55.20000, 55.09000, 55.09000, 55.03000, 54.90000…
## $ T6          <dbl> 7.026667, 6.833333, 6.560000, 6.366667, 6.300000, 6.190000…
## $ RH_6        <dbl> 84.25667, 84.06333, 83.15667, 84.89333, 85.76667, 86.42333…
## $ T7          <dbl> 17.20000, 17.20000, 17.20000, 17.20000, 17.13333, 17.10000…
## $ RH_7        <dbl> 41.62667, 41.56000, 41.43333, 41.23000, 41.26000, 41.20000…
## $ T8          <dbl> 18.20000, 18.20000, 18.20000, 18.10000, 18.10000, 18.10000…
## $ RH_8        <dbl> 48.90000, 48.86333, 48.73000, 48.59000, 48.59000, 48.59000…
## $ T9          <dbl> 17.03333, 17.06667, 17.00000, 17.00000, 17.00000, 17.00000…
## $ RH_9        <dbl> 45.53000, 45.56000, 45.50000, 45.40000, 45.29000, 45.29000…
## $ T_out       <dbl> 6.600000, 6.483333, 6.366667, 6.133333, 6.016667, 5.916667…
## $ Press_mm_hg <dbl> 733.5000, 733.6000, 733.7000, 733.9000, 734.0000, 734.1667…
## $ RH_out      <dbl> 92.00000, 92.00000, 92.00000, 92.00000, 92.00000, 91.83333…
## $ Windspeed   <dbl> 7.000000, 6.666667, 6.333333, 5.666667, 5.333333, 5.166667…
## $ Visibility  <dbl> 63.00000, 59.16667, 55.33333, 47.66667, 43.83333, 40.00000…
## $ Tdewpoint   <dbl> 5.300000, 5.200000, 5.100000, 4.900000, 4.800000, 4.683333…
## $ rv1         <dbl> 13.2754332, 18.6061950, 28.6426682, 10.0840966, 44.9194842…
## $ rv2         <dbl> 13.2754332, 18.6061950, 28.6426682, 10.0840966, 44.9194842…
## $ NSM         <int> 61200, 61800, 62400, 63600, 64200, 65400, 66000, 66600, 68…
## $ WeekStatus  <chr> "Weekday", "Weekday", "Weekday", "Weekday", "Weekday", "We…
## $ Day_of_week <chr> "Monday", "Monday", "Monday", "Monday", "Monday", "Monday"…

data <- data %>% rename('temp_kitchen' = 'T1', 
                        'temp_living' = 'T2', 
                        'temp_laundry' = 'T3',
                        'temp_office' = 'T4',
                        'temp_bath' = 'T5', 
                        'temp_outside' = 'T6',
                        'temp_iron' ='T7', 
                        'temp_teen' = 'T8', 
                        'temp_parents' = 'T9', 
                        'temp_station' = 'T_out',
                        'humid_kitchen' = 'RH_1', 
                        'humid_living' = 'RH_2', 
                        'humid_laundry' = 'RH_3', 
                        'humid_office'= 'RH_4', 
                        'humid_bath' = 'RH_5', 
                        'humid_outside' = 'RH_6',
                        'humid_iron' = 'RH_7', 
                        'humid_teen' = 'RH_8', 
                        'humid_parents' = 'RH_9', 
                        'humid_station' = 'RH_out',
                        'random_1' = 'rv1',
                        'random_2' = 'rv2')

#  check missing value
colSums(is.na(data))

##          date    Appliances        lights  temp_kitchen humid_kitchen 
##             0             0             0             0             0 
##   temp_living  humid_living  temp_laundry humid_laundry   temp_office 
##             0             0             0             0             0 
##  humid_office     temp_bath    humid_bath  temp_outside humid_outside 
##             0             0             0             0             0 
##     temp_iron    humid_iron     temp_teen    humid_teen  temp_parents 
##             0             0             0             0             0 
## humid_parents  temp_station   Press_mm_hg humid_station     Windspeed 
##             0             0             0             0             0 
##    Visibility     Tdewpoint      random_1      random_2           NSM 
##             0             0             0             0             0 
##    WeekStatus   Day_of_week 
##             0             0

# remove duplicate 
unique(data)
# remove row containing NA value 
data <- data %>% filter(complete.cases(.))

data <- data %>% mutate(date = ymd_hms(date))
# summarise the data to 24h format
data <- data %>%
  mutate(date = floor_date(date, "hour")) %>%
  group_by(date) %>%
  select_if(is.numeric) %>%
  summarise_all("mean") %>%
  ungroup() %>%
  mutate(Day_of_week = wday(date, label = T),
        WeekStatus = ifelse(Day_of_week %in% c("Sat", "Sun"), "weekend", "weekday") , 
        WeekStatus = as.factor(WeekStatus),
        Day_of_week = as.factor(Day_of_week))

Check data distribution of each predictor

data %>% 
   select_if(is.numeric) %>% 
   select(-c(Press_mm_hg, Appliances, NSM)) %>% 
   boxplot(main = 'Distribution of Each Predictor', xlab = 'Predictor', ylab = 'Values')

Our data can be visually examined to identify whether any outliers are present. By requiring our model to accommodate them, outliers impact the dependent variable we’re developing. As their names indicate, outliers lie outside our model’s majority. The resolving capability of our model might be reduced if we include outliers. We can observe from the boxplot that some variables, such humid_laundry, humid_station, and Press_mm_hg have noticeable outliers.

Define Outlier

boxplot.stats(data$humid_laundry)$out
boxplot.stats(data$temp_office)$out
boxplot.stats(data$temp_kitchen)$out
boxplot.stats(data$humid_living)$out
out_hkit <- boxplot.stats(data$humid_kitchen)$out
out_tempout <- boxplot.stats(data$temp_outside)$out
out_humidstat <- boxplot.stats(data$humid_station)$out
out_tempbath <- boxplot.stats(data$temp_bath)$out
out_tempiron <- boxplot.stats(data$temp_iron)$out
out_hiron <- boxplot.stats(data$humid_iron)$out
out_tempstat <- boxplot.stats(data$temp_station)$out
out_tteen <- boxplot.stats(data$temp_teen)$out
out_hteen <- boxplot.stats(data$humid_teen)$out
boxplot.stats(data$humid_parents)$out
out_wind <- boxplot.stats(data$Windspeed)$out
out_dew <- boxplot.stats(data$Tdewpoint)$out
out_temppar <- boxplot.stats(data$temp_parents)$out
out_templiv <- boxplot.stats(data$temp_living)$out
out_press <- boxplot.stats(data$Press_mm_hg)$out

#data_outlier <- data %>% filter(humid_laundry >= 49.34867 | temp_office >= 26.00583 | temp_office <= 15.69000 | temp_kitchen >= 25.45700 | temp_kitchen <= 17.89000 | humid_parents <= 29.22833 | humid_kitchen >= 51.91067 | humid_kitchen <= 28.26944 | temp_bath >= out_tempbath | humid_iron >= out_hiron | temp_teen <= out_tteen | humid_teen >= out_hteen | Windspeed >= out_wind | Tdewpoint >= out_dew | temp_outside >= out_tempout | humid_station <= out_humidstat | temp_station >= out_tempstat | temp_living >= out_templiv | temp_parents >= out_temppar | Press_mm_hg <= out_press)

data_outlier <- data %>% filter(humid_laundry >= 49.47222 | temp_office >= 25.94310 | temp_office <= 15.69000 | temp_kitchen >= 25.47556 | temp_kitchen <= 16.79000 | humid_parents >= 53.14000 | humid_parents <= 29.48750 | humid_living >= 51.33444 | humid_living <= 26.18556 | humid_kitchen >= out_hkit |temp_bath >= out_tempbath | humid_iron >= out_hiron | temp_teen <= out_tteen | humid_teen >= out_hteen | temp_parents >= out_temppar | temp_outside >= out_tempout | humid_station <= out_humidstat | temp_station >= out_tempstat | temp_living >= out_templiv | Windspeed >= out_wind | Tdewpoint >= out_dew)

#data_outlier <- data %>% filter(humid_laundry >= 49.22667 | humid_laundry <= 29.00000 | temp_office >= 25.96333 | temp_office <= 15.66000 | temp_kitchen >= 25.39000 | temp_kitchen <= 17.79000  | humid_parents >= 53.16333 | humid_parents <= 29.16667 | humid_kitchen >= 52.12667 | humid_kitchen <= 27.36000 |temp_bath >= out_tempbath | humid_iron >= out_hiron | temp_teen <= out_tteen | humid_teen >= out_hteen | temp_outside >= out_tempout | humid_station <= out_humidstat | temp_station >= out_tempstat | temp_living >= out_templiv | Windspeed >= out_wind | Tdewpoint >= out_dew)

Outlier on Original Data Distribution

# check if the outliers has an influence to targeted variable
data %>% 
    mutate(date = as.numeric(date)) %>%
    ggplot(aes(x = date, y = Appliances)) +
    geom_point() + 
    geom_point(data = data_outlier, aes(x = date, y = Appliances), col = 'red') + 
    labs(
        title = 'Distribution of Appliance : Original vs outlier (red)',
        x = NULL,
        y = 'Appliance') +
    theme(axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank())

Check the outlier data

data_no_outlier <- data %>% filter(!date %in% data_outlier$date)
t.test(data$Appliances, data_outlier$Appliances)

## 
##  Welch Two Sample t-test
## 
## data:  data$Appliances and data_outlier$Appliances
## t = -2.0731, df = 610.83, p-value = 0.03858
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -16.4165796  -0.4442911
## sample estimates:
## mean of x mean of y 
##  97.77913 106.20956

The p-value for the T-test is less than 0.05. This indicates that it is safe to assume that the strengths of the standard and outlier groups differ. Even though they might require a different inspection, let’s eliminate the outliers for this project. We have created data_no_outlier for our anlysis.

Distribution on Each Predictor

data_no_outlier %>% 
    select_if(is.numeric) %>% 
    pivot_longer(cols = -Appliances, names_to = 'predictor') %>% 
    ggplot(aes(x = value)) +
    geom_density() +
    facet_wrap(~predictor, scales = 'free_x')  +
    labs(
        title = 'Density graph of each variable',
        x = 'variable',
        y = 'Frequency'
    )

The graph shows that humid_outside, random_1, random_2 are all fairly uniform in shape. This can imply that these variables are combined freely and without followed by other variables.

Data Transformation

Let’s see the trend of our data for each predictor

data_no_outlier %>% 
    select_if(is.numeric) %>% 
    pivot_longer(cols = -Appliances, names_to = 'predictor') %>% 
    ggplot(aes(x = value, y = Appliances)) +
    geom_point() +
    geom_smooth(method = 'loess', formula = 'y ~ x') +
    facet_wrap(~predictor, scales = 'free_x')  +
    labs(
        title = 'Trends in each variable',
        x = 'Variable',
        y = 'Values'
    )

According to the plots, we’re hardly to suspect the correlation between each predictors to Appliances. lights are has the most noticable correlation line and a minor positive correlation exists between humid_bath, humid_kitchen, temp_laundry``, andtemp_outside. There's also a minor negative correlation betweenNSM,humid_teen, andhumid_parent`. With linear data, regression models perform the best. We can attempt to modify the distribution to become more linear by transforming the non-linear data.

`Appliances` as Log(`Appliances`)

ggplot(data = data_no_outlier, aes(x = (Appliances))) +
  geom_density(fill = "pink", color = "black", bins = 30) +
  labs( title = "Appliances  Distribution", x = "Appliances")

## Warning: Ignoring unknown parameters: bins

> Transform Appliance

ggplot(data = data_no_outlier, aes(x = log(Appliances))) +
  geom_density(fill = "pink", color = "black", bins = 30) +
  labs( title = "Log(Appliances)  Distribution", x = "Log(Appliances)")

## Warning: Ignoring unknown parameters: bins

We see that the two relation is more linear. We’ll persist this change to our dataset.

data_log <-  data_no_outlier %>% mutate(Appliances = log(Appliances))

recap <- data_log %>% 
  mutate(date = floor_date(date, "hour")) %>%
  mutate(hour = hour(data_log$date),
        month = month(data_log$date, label = T),
        day = day(data_log$date))  %>% group_by(date, month, hour, day) %>% 
  summarise(Appliances = mean(Appliances)) %>% 
  ggplot(aes(x = hour , y = day, fill = Appliances)) + 
  geom_tile() +
  facet_grid(~month) +
  scale_y_continuous(breaks = seq(1,31)) +
  scale_x_continuous(breaks = seq(0,24,6)) +
      labs(title = "Appliance Recap",
             x = "hour",
             y = "date") +
        theme(axis.text.x = element_text(hjust = 1, angle = 45),
              plot.title = element_text(face = "bold"),
              panel.background = element_rect(fill = "#ffffff"),
              axis.line.y = element_line(colour = "grey"),
              axis.line.x = element_line(colour = "grey"))

ggplotly(recap) %>% plotly::layout(legend=list(x=0, 
                                 xanchor='left',
                                 yanchor='bottom',
                                 orientation='h'))

If we take a look at the plot there’s a pattern that lies on the data, and it was an outlier that we removed before.

Data Scaling

describe(data_log, fast = T)

##               vars    n     mean       sd     min      max    range     se
## date             1 2816      NaN       NA     Inf     -Inf     -Inf     NA
## Appliances       2 2816     4.35     0.61    3.34     6.41     3.07   0.01
## lights           3 2816     3.72     6.81    0.00    51.67    51.67   0.13
## temp_kitchen     4 2816    21.62     1.45   17.46    25.48     8.01   0.03
## humid_kitchen    5 2816    40.11     3.70   31.50    52.81    21.31   0.07
## temp_living      6 2816    20.18     1.92   16.20    28.28    12.08   0.04
## humid_living     7 2816    40.45     3.79   26.40    51.11    24.71   0.07
## temp_laundry     8 2816    22.17     1.84   17.79    28.52    10.73   0.03
## humid_laundry    9 2816    39.15     3.17   29.70    48.59    18.89   0.06
## temp_office     10 2816    20.77     1.86   15.73    25.84    10.11   0.04
## humid_office    11 2816    38.87     4.23   29.42    50.75    21.32   0.08
## temp_bath       12 2816    19.51     1.69   15.37    24.33     8.97   0.03
## humid_bath      13 2816    50.96     8.62   36.06    94.88    58.82   0.16
## temp_outside    14 2816     7.37     5.53   -5.93    25.38    31.31   0.10
## humid_outside   15 2816    55.72    30.06    1.00    99.90    98.90   0.57
## temp_iron       16 2816    20.18     1.98   15.48    25.19     9.71   0.04
## humid_iron      17 2816    35.21     5.00   23.34    50.25    26.91   0.09
## temp_teen       18 2816    21.98     1.86   16.97    26.92     9.95   0.04
## humid_teen      19 2816    42.78     5.10   29.74    56.67    26.93   0.10
## temp_parents    20 2816    19.40     1.89   14.89    24.35     9.46   0.04
## humid_parents   21 2816    41.44     4.06   29.52    52.78    23.27   0.08
## temp_station    22 2816     6.93     4.83   -4.96    22.87    27.83   0.09
## Press_mm_hg     23 2816   755.60     7.37  729.38   772.26    42.88   0.14
## humid_station   24 2816    80.51    13.82   33.25   100.00    66.75   0.26
## Windspeed       25 2816     4.01     2.38    0.42    11.83    11.42   0.04
## Visibility      26 2816    38.45    11.24    1.00    66.00    65.00   0.21
## Tdewpoint       27 2816     3.50     4.03   -6.46    14.86    21.32   0.08
## random_1        28 2816    24.98     5.94    5.26    43.61    38.35   0.11
## random_2        29 2816    24.98     5.94    5.26    43.61    38.35   0.11
## NSM             30 2816 42171.31 25155.53 1500.00 84300.00 82800.00 474.04
## Day_of_week     31 2816      NaN       NA     Inf     -Inf     -Inf     NA
## WeekStatus      32 2816      NaN       NA     Inf     -Inf     -Inf     NA

Before Scaling

data_log %>%
    select_if(is.numeric) %>% 
    pivot_longer(cols = -Appliances, names_to = 'predictor') %>% 
    group_by(predictor) %>% 
    summarize(value = max(value)) %>% 
    ggplot(aes(x = predictor, y = value)) +
    geom_col(fill = 'pink') + 
    labs(
        title = 'Data Range Before Scaling',
        x = 'Variable',
        y = 'Value') + 
        theme(legend.title = element_blank(),
              axis.text.x = element_text(hjust = 1, angle = 45),
              plot.title = element_text(face = "bold"),
              panel.background = element_rect(fill = "#ffffff"),
              axis.line.y = element_line(colour = "grey"),
              axis.line.x = element_line(colour = "grey"))

After scaling Scaling

Before we scale data_log, we need to remove non-numeric column date and exclude Day_of_week, WeekStatus as categorical variable

# data scaling
data_scale <- data_log %>% select(-date) %>% mutate(Appliances = as.numeric(Appliances))
data_scale[,-c(1,30,31)] <- scale(data_scale[,-c(1,30,31)])
data_scale %>% select_if(is.numeric) %>% 
    pivot_longer(cols = -Appliances, names_to = 'predictor') %>% 
    group_by(predictor) %>% 
    summarize(value = max(value)) %>% 
    ggplot(aes(x = predictor, y = value)) +
    geom_col(fill = 'pink') + 
    labs(
        title = 'Data Range After Scaling',
        x = 'Variable',
        y = 'Values') +
        theme(legend.title = element_blank(),
              axis.text.x = element_text(hjust = 1, angle = 45),
              plot.title = element_text(face = "bold"),
              panel.background = element_rect(fill = "#ffffff"),
              axis.line.y = element_line(colour = "grey"),
              axis.line.x = element_line(colour = "grey"))

Data distribution after scaling

data_scale %>% 
    select_if(is.numeric) %>% 
    pivot_longer(cols = -Appliances, names_to = 'predictor') %>% 
    ggplot() +
    geom_histogram(aes(x = value), bins = 15, color = 'black', fill = 'white') +
    facet_wrap(~predictor, scales = 'free_x')  +
    labs(
        title = 'Density graph of each variable',
        x = 'variable',
        y = 'Frequency'
    )

All humidity values except humid_parent and humid_outside has Normal distribution and for all temperature readings follow a Normal distribution except for temp_parent.
Out of the remaining columns, we can see that Visibility, Windspeed and lights are skewed.
The random variables rv1 and rv2 has normal distribution.

Exploratoty Data Analysis

Correlation

#  cek korelasi
ggcorr(data_scale, hjust =1, label = T)

data_scale %>% select_if(is.numeric) %>% cor() %>% as.data.frame() %>% arrange(-Appliances)

##                 Appliances       lights temp_kitchen humid_kitchen temp_living
## Appliances     1.000000000  0.338810158  0.123073002   0.088989937  0.19961754
## NSM            0.374363865  0.281624927  0.152764480   0.005780070  0.25220072
## lights         0.338810158  1.000000000 -0.022275708   0.141555746  0.01081935
## temp_living    0.199617544  0.010819353  0.811870731   0.295341263  1.00000000
## temp_outside   0.193987993 -0.071498046  0.615034878   0.334121883  0.76418114
## temp_station   0.165003827 -0.069869396  0.646893516   0.370359661  0.75416594
## temp_laundry   0.130799441 -0.115428404  0.872943873   0.241785998  0.68517067
## Windspeed      0.129427459  0.068827632 -0.085919085   0.234827675  0.09091366
## temp_kitchen   0.123073002 -0.022275708  1.000000000   0.159542863  0.81187073
## temp_teen      0.117362573 -0.073567046  0.802327943  -0.068019785  0.52628677
## humid_kitchen  0.088989937  0.141555746  0.159542863   1.000000000  0.29534126
## temp_office    0.086184976 -0.004522683  0.846695564   0.100440692  0.71430258
## temp_iron      0.063357082 -0.152406913  0.805736368  -0.010583495  0.60164107
## temp_bath      0.058984335 -0.089771232  0.861934335   0.203059793  0.67229058
## temp_parents   0.037476740 -0.179815832  0.814566232   0.091586565  0.61722444
## Tdewpoint      0.023441319 -0.024895891  0.575618569   0.628441616  0.57627498
## humid_bath     0.008268389  0.183916803 -0.037004463   0.319135213  0.02128543
## humid_laundry -0.008041878  0.158408793 -0.036570275   0.849218214  0.14054798
## humid_office  -0.011908013  0.145181904  0.100732696   0.887477521  0.26075115
## Visibility    -0.014150338  0.025474189 -0.090203399  -0.014962428 -0.08356921
## random_1      -0.030979918  0.001358579 -0.011687987  -0.007658236 -0.01543808
## random_2      -0.030979918  0.001358579 -0.011687987  -0.007658236 -0.01543808
## Press_mm_hg   -0.070264450 -0.033704651 -0.084687829  -0.306246430 -0.07895460
## humid_living  -0.111080673  0.059884314  0.054498847   0.805236090 -0.11002630
## humid_iron    -0.123515486  0.052737544  0.124889551   0.801498695  0.23195726
## humid_parents -0.131077364 -0.001783853  0.059069414   0.766849573  0.15653425
## humid_outside -0.154685092  0.173000537 -0.564183148   0.280753296 -0.51965963
## humid_teen    -0.202931354  0.020576423 -0.006678625   0.730217552  0.07670443
## humid_station -0.249727435  0.077051026 -0.251152661   0.271540437 -0.42786381
##               humid_living temp_laundry humid_laundry   temp_office
## Appliances    -0.111080673  0.130799441  -0.008041878  0.0861849759
## NSM           -0.199860925 -0.002329452  -0.059030515  0.0370261361
## lights         0.059884314 -0.115428404   0.158408793 -0.0045226826
## temp_living   -0.110026303  0.685170670   0.140547980  0.7143025752
## temp_outside   0.060089208  0.655965692   0.066213077  0.6188555135
## temp_station   0.117203671  0.670828907   0.118945279  0.6312303922
## temp_laundry   0.198367483  1.000000000  -0.037820928  0.8302983471
## Windspeed      0.066138523 -0.106071393   0.278863751 -0.1941049195
## temp_kitchen   0.054498847  0.872943873  -0.036570275  0.8466955643
## temp_teen     -0.020937617  0.774206911  -0.323904208  0.7682776258
## humid_kitchen  0.805236090  0.241785998   0.849218214  0.1004406922
## temp_office    0.010921592  0.830298347  -0.161122899  1.0000000000
## temp_iron     -0.014986664  0.824478982  -0.287228779  0.8588000143
## temp_bath      0.175725944  0.872596692  -0.065368458  0.8520224985
## temp_parents   0.102964976  0.888444971  -0.232070118  0.8747095018
## Tdewpoint      0.533363119  0.643640050   0.390777473  0.5209571473
## humid_bath     0.257108744 -0.102746764   0.395794977 -0.1093678753
## humid_laundry  0.671350802 -0.037820928   1.000000000 -0.1611228989
## humid_office   0.719741199  0.105189118   0.900927825 -0.0638188840
## Visibility    -0.003266467 -0.119225127   0.019899570 -0.1273198653
## random_1       0.002999757 -0.011675317   0.001197525  0.0012824619
## random_2       0.002999757 -0.011675317   0.001197525  0.0012824619
## Press_mm_hg   -0.280621916 -0.132264831  -0.238436319  0.0002931769
## humid_living   1.000000000  0.198367483   0.671350802  0.0109215921
## humid_iron     0.699409857  0.148108438   0.834355461  0.0288630551
## humid_parents  0.683285373  0.113647662   0.834583363 -0.0435234059
## humid_outside  0.374868862 -0.613625239   0.553741740 -0.6658859015
## humid_teen     0.674230049  0.030715427   0.826414870 -0.0979481533
## humid_station  0.555453812 -0.199760031   0.360528031 -0.3036779664
##               humid_office    temp_bath   humid_bath temp_outside humid_outside
## Appliances    -0.011908013  0.058984335  0.008268389   0.19398799   -0.15468509
## NSM           -0.023686704  0.031563779  0.110060913   0.19750114   -0.17578499
## lights         0.145181904 -0.089771232  0.183916803  -0.07149805    0.17300054
## temp_living    0.260751147  0.672290584  0.021285432   0.76418114   -0.51965963
## temp_outside   0.263526893  0.588894475 -0.094249271   1.00000000   -0.64964087
## temp_station   0.307352680  0.613395779 -0.061475539   0.97167655   -0.61258926
## temp_laundry   0.105189118  0.872596692 -0.102746764   0.65596569   -0.61362524
## Windspeed      0.315195558 -0.152672716  0.074749351   0.21747107    0.08054052
## temp_kitchen   0.100732696  0.861934335 -0.037004463   0.61503488   -0.56418315
## temp_teen     -0.200196458  0.801919966 -0.118687866   0.44706048   -0.64070977
## humid_kitchen  0.887477521  0.203059793  0.319135213   0.33412188    0.28075330
## temp_office   -0.063818884  0.852022498 -0.109367875   0.61885551   -0.66588590
## temp_iron     -0.166370183  0.845912962 -0.178980522   0.57645541   -0.72778213
## temp_bath      0.084238750  1.000000000  0.001797893   0.58889447   -0.58942228
## temp_parents  -0.074041595  0.897183397 -0.179263106   0.63588057   -0.71296098
## Tdewpoint      0.599085050  0.584910825  0.067833298   0.77080866   -0.23185420
## humid_bath     0.370271152  0.001797893  1.000000000  -0.09424927    0.30882879
## humid_laundry  0.900927825 -0.065368458  0.395794977   0.06621308    0.55374174
## humid_office   1.000000000  0.084238750  0.370271152   0.26352689    0.42563882
## Visibility     0.005132886 -0.098715244 -0.022269130  -0.09238018    0.12143442
## random_1      -0.003551205 -0.015731681 -0.026978684  -0.03122955    0.02894372
## random_2      -0.003551205 -0.015731681 -0.026978684  -0.03122955    0.02894372
## Press_mm_hg   -0.259407785 -0.118223183 -0.117107432  -0.11579712   -0.11632078
## humid_living   0.719741199  0.175725944  0.257108744   0.06008921    0.37486886
## humid_iron     0.889132160  0.132253430  0.343063528   0.23975900    0.40137249
## humid_parents  0.853357085  0.054633126  0.287252513   0.17112616    0.42903069
## humid_outside  0.425638821 -0.589422283  0.308828788  -0.64964087    1.00000000
## humid_teen     0.835057397  0.013045696  0.372867770   0.05807320    0.52443133
## humid_station  0.342855036 -0.182994600  0.201164997  -0.52149720    0.70230617
##                  temp_iron  humid_iron    temp_teen   humid_teen temp_parents
## Appliances     0.063357082 -0.12351549  0.117362573 -0.202931354  0.037476740
## NSM            0.024079108 -0.17298403  0.091851925 -0.311957345 -0.043350089
## lights        -0.152406913  0.05273754 -0.073567046  0.020576423 -0.179815832
## temp_living    0.601641073  0.23195726  0.526286769  0.076704433  0.617224439
## temp_outside   0.576455407  0.23975900  0.447060481  0.058073202  0.635880569
## temp_station   0.589922644  0.28971852  0.468479587  0.114208192  0.636476717
## temp_laundry   0.824478982  0.14810844  0.774206911  0.030715427  0.888444971
## Windspeed     -0.187360047  0.22698814 -0.228599402  0.198473045 -0.180154695
## temp_kitchen   0.805736368  0.12488955  0.802327943 -0.006678625  0.814566232
## temp_teen      0.868592735 -0.16844947  1.000000000 -0.240286241  0.853508681
## humid_kitchen -0.010583495  0.80149870 -0.068019785  0.730217552  0.091586565
## temp_office    0.858800014  0.02886306  0.768277626 -0.097948153  0.874709502
## temp_iron      1.000000000 -0.07417770  0.868592735 -0.242928697  0.934703693
## temp_bath      0.845912962  0.13225343  0.801919966  0.013045696  0.897183397
## temp_parents   0.934703693 -0.00915467  0.853508681 -0.137904563  1.000000000
## Tdewpoint      0.443356745  0.61853606  0.375033488  0.479380802  0.571007814
## humid_bath    -0.178980522  0.34306353 -0.118687866  0.372867770 -0.179263106
## humid_laundry -0.287228779  0.83435546 -0.323904208  0.826414870 -0.232070118
## humid_office  -0.166370183  0.88913216 -0.200196458  0.835057397 -0.074041595
## Visibility    -0.128428837 -0.01239976 -0.076267131  0.049486694 -0.119556156
## random_1      -0.007373707  0.01578497 -0.006246445  0.020173426 -0.001709659
## random_2      -0.007373707  0.01578497 -0.006246445  0.020173426 -0.001709659
## Press_mm_hg   -0.036652769 -0.26719383 -0.107450351 -0.234625817 -0.100418861
## humid_living  -0.014986664  0.69940986 -0.020937617  0.674230049  0.102964976
## humid_iron    -0.074177704  1.00000000 -0.168449473  0.879577498 -0.009154670
## humid_parents -0.113428104  0.85068433 -0.197048587  0.849937828 -0.040893114
## humid_outside -0.727782134  0.40137249 -0.640709767  0.524431335 -0.712960984
## humid_teen    -0.242928697  0.87957750 -0.240286241  1.000000000 -0.137904563
## humid_station -0.351235460  0.40655442 -0.243785404  0.507496100 -0.246834639
##               humid_parents temp_station   Press_mm_hg humid_station
## Appliances     -0.131077364   0.16500383 -0.0702644501   -0.24972744
## NSM            -0.278727068   0.21026313  0.0037037150   -0.35247376
## lights         -0.001783853  -0.06986940 -0.0337046512    0.07705103
## temp_living     0.156534246   0.75416594 -0.0789546037   -0.42786381
## temp_outside    0.171126156   0.97167655 -0.1157971151   -0.52149720
## temp_station    0.222743195   1.00000000 -0.1149562281   -0.51791385
## temp_laundry    0.113647662   0.67082891 -0.1322648314   -0.19976003
## Windspeed       0.258639612   0.23913140 -0.2307022596   -0.21405190
## temp_kitchen    0.059069414   0.64689352 -0.0846878289   -0.25115266
## temp_teen      -0.197048587   0.46847959 -0.1074503508   -0.24378540
## humid_kitchen   0.766849573   0.37035966 -0.3062464299    0.27154044
## temp_office    -0.043523406   0.63123039  0.0002931769   -0.30367797
## temp_iron      -0.113428104   0.58992264 -0.0366527690   -0.35123546
## temp_bath       0.054633126   0.61339578 -0.1182231833   -0.18299460
## temp_parents   -0.040893114   0.63647672 -0.1004188615   -0.24683464
## Tdewpoint       0.517260212   0.80694036 -0.2328641832    0.08139534
## humid_bath      0.287252513  -0.06147554 -0.1171074319    0.20116500
## humid_laundry   0.834583363   0.11894528 -0.2384363187    0.36052803
## humid_office    0.853357085   0.30735268 -0.2594077848    0.34285504
## Visibility      0.013035001  -0.08726717  0.0500554251    0.09622750
## random_1        0.001197930  -0.03030349 -0.0063237489    0.04994755
## random_2        0.001197930  -0.03030349 -0.0063237489    0.04994755
## Press_mm_hg    -0.178297011  -0.11495623  1.0000000000   -0.14177882
## humid_living    0.683285373   0.11720367 -0.2806219157    0.55545381
## humid_iron      0.850684330   0.28971852 -0.2671938309    0.40655442
## humid_parents   1.000000000   0.22274320 -0.1782970106    0.37400570
## humid_outside   0.429030692  -0.61258926 -0.1163207770    0.70230617
## humid_teen      0.849937828   0.11420819 -0.2346258173    0.50749610
## humid_station   0.374005695  -0.51791385 -0.1417788164    1.00000000
##                  Windspeed   Visibility    Tdewpoint     random_1     random_2
## Appliances     0.129427459 -0.014150338  0.023441319 -0.030979918 -0.030979918
## NSM            0.111275681 -0.018052889  0.010466014 -0.043261213 -0.043261213
## lights         0.068827632  0.025474189 -0.024895891  0.001358579  0.001358579
## temp_living    0.090913659 -0.083569208  0.576274982 -0.015438083 -0.015438083
## temp_outside   0.217471069 -0.092380178  0.770808662 -0.031229545 -0.031229545
## temp_station   0.239131401 -0.087267169  0.806940356 -0.030303485 -0.030303485
## temp_laundry  -0.106071393 -0.119225127  0.643640050 -0.011675317 -0.011675317
## Windspeed      1.000000000 -0.005644861  0.147252915 -0.021270171 -0.021270171
## temp_kitchen  -0.085919085 -0.090203399  0.575618569 -0.011687987 -0.011687987
## temp_teen     -0.228599402 -0.076267131  0.375033488 -0.006246445 -0.006246445
## humid_kitchen  0.234827675 -0.014962428  0.628441616 -0.007658236 -0.007658236
## temp_office   -0.194104920 -0.127319865  0.520957147  0.001282462  0.001282462
## temp_iron     -0.187360047 -0.128428837  0.443356745 -0.007373707 -0.007373707
## temp_bath     -0.152672716 -0.098715244  0.584910825 -0.015731681 -0.015731681
## temp_parents  -0.180154695 -0.119556156  0.571007814 -0.001709659 -0.001709659
## Tdewpoint      0.147252915 -0.042153669  1.000000000 -0.001942969 -0.001942969
## humid_bath     0.074749351 -0.022269130  0.067833298 -0.026978684 -0.026978684
## humid_laundry  0.278863751  0.019899570  0.390777473  0.001197525  0.001197525
## humid_office   0.315195558  0.005132886  0.599085050 -0.003551205 -0.003551205
## Visibility    -0.005644861  1.000000000 -0.042153669 -0.015224406 -0.015224406
## random_1      -0.021270171 -0.015224406 -0.001942969  1.000000000  1.000000000
## random_2      -0.021270171 -0.015224406 -0.001942969  1.000000000  1.000000000
## Press_mm_hg   -0.230702260  0.050055425 -0.232864183 -0.006323749 -0.006323749
## humid_living   0.066138523 -0.003266467  0.533363119  0.002999757  0.002999757
## humid_iron     0.226988142 -0.012399765  0.618536065  0.015784974  0.015784974
## humid_parents  0.258639612  0.013035001  0.517260212  0.001197930  0.001197930
## humid_outside  0.080540519  0.121434424 -0.231854203  0.028943717  0.028943717
## humid_teen     0.198473045  0.049486694  0.479380802  0.020173426  0.020173426
## humid_station -0.214051903  0.096227502  0.081395343  0.049947551  0.049947551
##                        NSM
## Appliances     0.374363865
## NSM            1.000000000
## lights         0.281624927
## temp_living    0.252200715
## temp_outside   0.197501139
## temp_station   0.210263131
## temp_laundry  -0.002329452
## Windspeed      0.111275681
## temp_kitchen   0.152764480
## temp_teen      0.091851925
## humid_kitchen  0.005780070
## temp_office    0.037026136
## temp_iron      0.024079108
## temp_bath      0.031563779
## temp_parents  -0.043350089
## Tdewpoint      0.010466014
## humid_bath     0.110060913
## humid_laundry -0.059030515
## humid_office  -0.023686704
## Visibility    -0.018052889
## random_1      -0.043261213
## random_2      -0.043261213
## Press_mm_hg    0.003703715
## humid_living  -0.199860925
## humid_iron    -0.172984034
## humid_parents -0.278727068
## humid_outside -0.175784992
## humid_teen    -0.311957345
## humid_station -0.352473756

data_scale <-  data_scale %>% select(-c(random_1, random_2, WeekStatus, Day_of_week))

The stronger the correlation, or how near 1 or -1 it is, the more closely related the predictors are. The correlation matrix graphic above shows the correlatiion on each variables. In our dataset, humid_station and Appliances have the highest negative correlations (-0.2) also NSM and Appliances have the highest positive correlations (0.6)

NSM, lights, and temp_living have the most significant positive Appliances relationships. This indicates that the variables positively and substantially contribute to Appliances. On the other hand, the most vital negative link is found with water most negative correlation on the other hand.

Handling Outliers

Find outlier value

# Check the outlier and remove after scaling using zscore threshold point = 3
data_clean <- data_scale %>% 
  mutate( zscore = (Appliances - mean(Appliances)) / sd(Appliances)) %>%
  filter(zscore <=3) %>%
  select(-zscore)

boxplot(data_scale$Appliances)

Remove the outlier after scaling

boxplot(data_clean$Appliances)

Modeling with one predictor

model_scale <- lm(formula = Appliances ~ temp_living, data = data_scale)
model_clean <- lm(formula = Appliances ~ temp_living, data = data_clean)

Plot the difference between two data

plot(formula = Appliances ~ temp_living, data = data_scale)
abline(model_scale, col = "red")
abline(model_clean, col = "green")

High Leverage, Low Influence: Because the graph shows that the outlier of the Appliances variable is at High Leverage, Low influence, then we analyze from R-Squared.

R-squared

summary(model_scale)$r.squared

## [1] 0.03984716

summary(model_clean)$r.squared

## [1] 0.04166081

Since the original data data has the same r-square as the scaled one, we decided to not using data_scale and using data_clean beacause it has better rsqured

Model Fitting and Evaluation

Data Splitting

We now split the data into train and validation sets. The training set is used to train the model, which is checked against the validation set.

data_train <- data_clean[index,] data_validation <- data_clean[-index,]

library(rsample)
RNGkind(sample.kind = "Rounding")

## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used

set.seed(123)

index <- sample(nrow(data_clean), nrow(data_clean)*0.8)

data_train <- data_clean[index,]
data_validation <- data_clean[-index,]

Check the Data Split

set.seed(123)
control <- trainControl(method = "cv", number = 10)

se_model <- train(Appliances ~ ., data = data_train, method = "lm", trControl = control)

se_model

## Linear Regression 
## 
## 2243 samples
##   26 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 2019, 2020, 2018, 2018, 2018, 2018, ... 
## Resampling results:
## 
##   RMSE       Rsquared   MAE      
##   0.4791869  0.3575817  0.3597521
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

Model Fitting

Model with No Predictor

#  model tanpa prediktor
model_none <- lm(formula = Appliances ~ 1, data = data_train)

Model with All Predictors

#  model dengan semua prediktor
model_all <- lm(Appliances ~ ., data_train)

Variable Selection : Step-Wise Regression Model

We’ve built model.none that uses no predictor and model.all that uses all variables. Stepwise regression is a method to pick out the optimal model using the Akaika Information Criterion (AIC) as is metrics. The method optimizes the model for the least AIC, meaning the least information loss. Let’s try to pick the important variables using stepwise regression. It uses a greedy algorithm to find a local minima. Therefore, it does not guarantee the best model.

1. Backward

#  stepwise regression: backward elimination
model_backward <- step(object = model_all,
                       direction = "backward",
                       trace = FALSE) #  agar proses step-wise tidak ditampilkan

2. Forward

model_forward <- step(
  object = model_none, #  batas bawah
  direction = "forward",
  scope = list(upper = model_all), #  batas atas
  trace = FALSE) #  tidak menampilkan proses step-wise

3. Both

model_both <- step(
  object = model_none, #  bawah batas
  direction = "both",
  scope = list(upper = model_all), #  batas atas
  trace = FALSE
)

Model Evaluation

We developed a model_none that doesn’t employ a model or predictor. All variables are used. The Akaike Information Criterion (AIC) and metrics are used stepwise regression to determine the best model. To minimize information loss, the technique optimizes the model for the lowest AIC. Let’s use stepwise regression to identify the crucial factors. To locate a local minimum, it employs a greedy method. As a result, it cannot assure the best model.

comparison <- compare_performance(model_none, model_all, model_backward, model_forward, model_both)
as.data.frame(comparison)

##             Name Model      AIC        AIC_wt      BIC        BIC_wt        R2
## 1     model_none    lm 4058.833 2.016258e-217 4070.264 3.686885e-197 0.0000000
## 2      model_all    lm 3075.912  5.534254e-04 3235.948  5.445635e-16 0.3695997
## 3 model_backward    lm 3062.871  3.757549e-01 3177.183  3.139463e-03 0.3687675
## 4  model_forward    lm 3063.729  2.446898e-01 3172.325  3.561941e-02 0.3679627
## 5     model_both    lm 3062.854  3.790018e-01 3165.734  9.612411e-01 0.3676457
##   R2_adjusted      RMSE     Sigma
## 1   0.0000000 0.5974693 0.5976025
## 2   0.3622033 0.4743772 0.4772584
## 3   0.3636586 0.4746902 0.4767135
## 4   0.3631337 0.4749927 0.4769101
## 5   0.3631005 0.4751118 0.4769226

Evaluation Function

eval_recap <- function(truth, estimate){
  
  df_new <- data.frame(truth = truth,
                       estimate = estimate)
  
  data.frame(RMSE = RMSE(estimate, truth),
             MAE = MAE(estimate, truth),
             "R-Square" = R2_Score(estimate, truth),
             MAPE = MAPE(estimate, truth),
             check.names = F
             ) %>% 
    mutate(MSE = sqrt(RMSE))
}

Model None - Evaluation

# Model None - Evaluation
pred_none_val <- predict(model_none, data_validation)

eval_recap(truth = data_validation$Appliances,
           estimate = pred_none_val)

##        RMSE       MAE     R-Square      MAPE       MSE
## 1 0.5884615 0.4832447 -0.001756201 0.1096837 0.7671124

Model All - Evaluation

pred_all_val <- predict(object = model_all, newdata = data_validation)

eval_recap(truth = data_validation$Appliances,
           estimate = pred_all_val)

##        RMSE       MAE  R-Square       MAPE       MSE
## 1 0.4399668 0.3276412 0.4400287 0.07398865 0.6632999

Model Step-Wise Backward - Evaluation

pred_backward_val <- predict(object = model_backward, newdata = data_validation)

eval_recap(truth = data_validation$Appliances,
           estimate = pred_backward_val)

##        RMSE       MAE  R-Square       MAPE       MSE
## 1 0.4404716 0.3266219 0.4387429 0.07373417 0.6636804

Model Step-Wise Both - Evaluation

pred_forward_val <- predict(object = model_forward, newdata = data_validation)

eval_recap(truth = data_validation$Appliances,
           estimate = pred_forward_val)

##        RMSE       MAE R-Square       MAPE       MSE
## 1 0.4416871 0.3270429 0.435641 0.07383755 0.6645955

Model Step-Wise Both - Evaluation

pred_both_val <- predict(object = model_both, newdata = data_validation)

eval_recap(truth = data_validation$Appliances,
           estimate = pred_both_val)

##        RMSE      MAE R-Square       MAPE       MSE
## 1 0.4413853 0.326661 0.436412 0.07370904 0.6643684

As shown above, model_all has the best evaluation score. Now, we’re check the linearity assumption

Checking Assumptions

Linear models are made with 4 assumptions. Before we carry on, we have to check whether these assumptions hold for our model.

Assumption of linearity

The assumption of linearity assumes that there exists a linear relationship between the predictors and the targe variable, so that our model can correctly describe it. A visual way to evaluate this is to plot the value of residues between our plot and the model.

Visualization of residual histogram using hist() . function

#  histogram residual
ggplot(data = as.data.frame(model_all$residuals), aes(x = model_all$residuals)) +
  geom_histogram(fill = "#CC0000", color = "orange", bins = 30) +
  labs( title = "Regression Residual Distribution", subtitle = "Log Transformation", x = "residual")

Statistics Test with `shapiro.test()``

Shapiro-Wilk hypothesis test:

H0: Residuals are normal distributed
H1: Residuals are not normally distributed (heteroscedastic)

#  shapiro test dari residual
shapiro.test(model_all$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  model_all$residuals
## W = 0.93506, p-value < 2.2e-16

check_normality(model_all)

## Warning: Non-normality of residuals detected (p < .001).

Based on the result, the residuals are not normally distributed.

VIF : Independence of Variable

Multicollinearity is a condition with a strong correlation between predictors. This is undesirable because it indicates a redundant predictor in the model, which should be able to choose only one of the variables with a solid relationship. It is hoped that multicollinearity will not occur

Test the VIF (Variance Inflation Factor) with the vif() function from the car package: * VIF value > 10: multicollinearity occurs in the model * VIF value < 10: there is no multicollinearity in the model

vif(model_all)

##        lights  temp_kitchen humid_kitchen   temp_living  humid_living 
##      1.442851     18.213582     20.795985     26.760992     23.336718 
##  temp_laundry humid_laundry   temp_office  humid_office     temp_bath 
##      9.270430     11.910682      9.043777     17.693180      8.991944 
##    humid_bath  temp_outside humid_outside     temp_iron    humid_iron 
##      1.430813     30.144979      9.607118     15.281805     10.546148 
##     temp_teen    humid_teen  temp_parents humid_parents  temp_station 
##      7.480010      9.574910     25.975439      6.748515    233.710780 
##   Press_mm_hg humid_station     Windspeed    Visibility     Tdewpoint 
##      1.422226     74.571765      1.665932      1.063911    156.170709 
##           NSM 
##      2.185402

The test result means there is has multicollinearity in the model

Homoscedasticity

Homoscedasticity assumption states that the error term in the relationship between the predictor and target variables is constant across all values of inputs. This assumption can be checked using the Breusch-Pagan test with hypotheses :

H0: Value of error is the same across all inputs (homoscedastic)
H1: Value of error is not the same across all range of inputs (heteroscedastic)

plot(x = model_all$fitted.values, y = model_all$residuals)
abline(h = 0, col = "#FF0000", ylab = 'Residuals', xlab = 'Prediction')

We can test the homoscedasticity of the model using the Breusch-Pagan test.

bptest(model_all)

## 
##  studentized Breusch-Pagan test
## 
## data:  model_all
## BP = 183.42, df = 26, p-value < 2.2e-16

Based on the result, the error are not same across all range of inputs.

Even though our linear model fails the tests, we can still try to conclude it. Our model’s mean average percentage error is a decent 0.074.

coef_all <- model_all$coefficients[-1]
barplot(coef_all, xlab = names(coef_all), main = 'Influence of `Model_all` Predictor',  ylab = 'Value')

Model Interpretation and Improvement Ideas

We shouldn’t transform the data_train because we already did it before in the beginning such as scaling, tranforming several variable into log, or removing any outliers and we are not tranforming the targeted variabel into a scaled version, because we wont scaled back the Test Result in the end of our research.

One-Way ANOVA

anova_train <- aov(formula = Appliances ~ ., data = data_train)
summary(anova_train)

##                 Df Sum Sq Mean Sq F value   Pr(>F)    
## lights           1   86.8   86.81 381.099  < 2e-16 ***
## temp_kitchen     1   12.9   12.89  56.576 7.81e-14 ***
## humid_kitchen    1    0.5    0.48   2.121  0.14545    
## temp_living      1   18.3   18.35  80.556  < 2e-16 ***
## humid_living     1   59.3   59.29 260.286  < 2e-16 ***
## temp_laundry     1   25.7   25.68 112.723  < 2e-16 ***
## humid_laundry    1   10.0    9.97  43.760 4.64e-11 ***
## temp_office      1   12.1   12.07  52.976 4.67e-13 ***
## humid_office     1    9.7    9.71  42.616 8.23e-11 ***
## temp_bath        1    0.1    0.15   0.641  0.42348    
## humid_bath       1    0.0    0.01   0.052  0.81917    
## temp_outside     1    8.0    8.03  35.234 3.38e-09 ***
## humid_outside    1    0.1    0.13   0.558  0.45510    
## temp_iron        1    0.0    0.04   0.174  0.67628    
## humid_iron       1   11.2   11.16  48.991 3.39e-12 ***
## temp_teen        1    1.8    1.79   7.864  0.00509 ** 
## humid_teen       1   23.9   23.91 104.973  < 2e-16 ***
## temp_parents     1    4.6    4.58  20.105 7.70e-06 ***
## humid_parents    1    0.0    0.05   0.211  0.64590    
## temp_station     1    3.5    3.55  15.584 8.14e-05 ***
## Press_mm_hg      1    0.3    0.27   1.199  0.27361    
## humid_station    1    0.3    0.33   1.455  0.22793    
## Windspeed        1    2.1    2.07   9.079  0.00262 ** 
## Visibility       1    0.1    0.12   0.510  0.47528    
## Tdewpoint        1    0.0    0.00   0.002  0.96260    
## NSM              1    4.5    4.53  19.882 8.65e-06 ***
## Residuals     2216  504.8    0.23                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Random Foresttion

Create random forest model as model_rf

set.seed(123)
model_rf <- randomForest(x = data_train %>% select(-Appliances),
                         y = data_train$Appliances, 
                         ntree = 500)

model_rf

## 
## Call:
##  randomForest(x = data_train %>% select(-Appliances), y = data_train$Appliances,      ntree = 500) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 8
## 
##           Mean of squared residuals: 0.1511909
##                     % Var explained: 57.65

Check the summary and Predictor contribution on Targeted Variable

model_rf$finalModel

## NULL

varImp(model_rf)

##                 Overall
## lights         60.63114
## temp_kitchen   18.86222
## humid_kitchen  28.82037
## temp_living    25.80899
## humid_living   20.28900
## temp_laundry   29.35548
## humid_laundry  26.20662
## temp_office    21.15855
## humid_office   17.64511
## temp_bath      26.08773
## humid_bath     18.95060
## temp_outside   17.53418
## humid_outside  21.22572
## temp_iron      18.41549
## humid_iron     19.76505
## temp_teen      37.28446
## humid_teen     28.01687
## temp_parents   27.91468
## humid_parents  21.25766
## temp_station   14.94264
## Press_mm_hg    24.53724
## humid_station  30.06757
## Windspeed      16.03340
## Visibility     11.55638
## Tdewpoint      16.32909
## NSM           182.77964

Model Random Forest - Evaluation

pred_rf_val <- predict(object = model_rf, newdata = data_validation)


eval_recap(truth = data_validation$Appliances,
           estimate = pred_rf_val)

##        RMSE      MAE  R-Square       MAPE       MSE
## 1 0.3763291 0.271212 0.5903041 0.06049703 0.6134567

Random Forest Variable Importance on Targeted Variabel

library("tibble")
model_rf$importance %>% 
  as.data.frame() %>% 
  arrange(-IncNodePurity) %>% 
  rownames_to_column("variable") %>% 
  head(10) %>% 
  ggplot(aes(IncNodePurity, 
             reorder(variable, IncNodePurity))
         ) +
  geom_col(fill = "firebrick") +
  labs(x = "Importance",
       y = NULL,
       title = "Random Forest Variable Importance")

The plot above showing how big the influence of each predictor, top 3 predictor who correlate with Appliances is NSM, lights and temp_teen

Lime Interpretation

library(lime)

set.seed(123)
explainer <- lime(x = data_validation %>% select(-Appliances),
                  model = model_rf)

model_type.randomForest <- function(x){
  return("regression") # for regression problem
}

predict_model.randomForest <- function(x, newdata, type = "response") {

    # return prediction value
    predict(x, newdata) %>% as.data.frame()
    
}

#  Select only the first 4 observations
selected_data <- data_validation %>% 
  select(-c(Appliances)) %>% 
  slice(1:4)

#  Explain the model
set.seed(123)
explanation <- explain(x = selected_data, 
                       explainer = explainer,
                       n_features = 27 #  Number of features to explain the model
                       )

Since we’re using scaled data from the beginning, so to visualize model_rf, we’re still using scaled data.

Random Forest Visualization dan Interpretation

plot_features(explanation = explanation)

Explanation Fit indicate how good LIME explain the model, kind of like the \(R^2\) (R-Squared) value of linear regression. Here we see the Explanation Fit only has values around 0.15-0.61 (15%-61%), which can be interpreted that LIME can only explain a little about our model. Twi of the cases reached the standard which >= 50% (0.5),Case 1 and 2 has explanation fit under 0.50. We also can summarise that Case 3 has the biggest Explanation, but Case 2 has the biggest Prediction.

Support Vector Machine

library(e1071)
model_svm <- svm(Appliances ~ ., data = data_train)
pred_svm_val <- predict(object = model_svm, newdata = data_validation)


eval_recap(truth = data_validation$Appliances,
           estimate = pred_svm_val)

##        RMSE       MAE  R-Square       MAPE       MSE
## 1 0.4085633 0.2748591 0.5171139 0.05969319 0.6391896

The SVR model has higher performance compared to any model that we made before. However, we will still use both model for further analysis both as comparison and as examples.

Lime Interpretation

# create the explanation for the SVR model.
set.seed(123)
explainer_svm <- lime(x = data_train %>% select(-Appliances), 
                  model = model_svm)

# Create SVR model specification for lime.
model_type.svm <- function(x){
  return("regression") # for regression problem
}

predict_model.svm <- function(x, newdata, type = "response") {

    # return prediction value
    predict(x, newdata) %>% as.data.frame()
    
}

Random Forest Visualization dan Interpretation

set.seed(123)
explanation_svm <- explain(x = selected_data, 
                       explainer = explainer_svm,
                       kernel_width = 1,
                       feature_select = "auto", # Method of feature selection for lime
                       n_features = 10 # Number of features to explain the model
                       )

plot_features(explanation_svm)

Explanation Fit indicate how good LIME explain the model, kind of like the \(R^2\) (R-Squared) value of linear regression. Here we see the Explanation Fit only has values around 0.3-0.4 (30-40%), which can be interpreted that LIME can only explain a little about our model. None of the cases reached the standard which >= 50% (0.5). From all of the case, We also can summarise that Case 1 has the biggest Explanation and Prediction.

Conclusion

In this research project, we have examined various concrete formulations with different Appliancess. We developed a model that aligns to the available information. Utilizing model as a framework, we developed a fresh formulation and, being used to predicted the Appliances.

Throughout this project, we have employed a Random Foreat Model. Compared to a standard regression, the model better describes the data. As we have discovered, despite being more complicated, it is a model which could be understood. The prediction model implementing “model_rf” obtained MAE values of 6.0 and R Square of 59% on validation dataset.

Appliances Energy Analysis

Introduction

Business Problem

Data Processing

Load Libraries

Data Transformation

Appliances as Log(Appliances)

Data Scaling

Exploratoty Data Analysis

Correlation

Handling Outliers

Modeling with one predictor

Plot the difference between two data

R-squared

Model Fitting and Evaluation

Data Splitting

Model Fitting

Model with No Predictor

Model with All Predictors

Variable Selection : Step-Wise Regression Model

1. Backward

2. Forward

3. Both

Model Evaluation

Evaluation Function

Model None - Evaluation

Model All - Evaluation

Model Step-Wise Backward - Evaluation

Model Step-Wise Both - Evaluation

Model Step-Wise Both - Evaluation

Checking Assumptions

Assumption of linearity

Statistics Test with `shapiro.test()``

VIF : Independence of Variable

Homoscedasticity

Model Interpretation and Improvement Ideas

One-Way ANOVA

Random Foresttion

Random Forest Variable Importance on Targeted Variabel

Lime Interpretation

Support Vector Machine

Lime Interpretation

Conclusion

`Appliances` as Log(`Appliances`)