Introduction
The purpose of this research is to forecast the electricity consumption of a particular household in Belgium based on the temperature and humidity levels of various rooms in the facility and surrounding weather information over 4.5 months. The data set runs 4.5 months at 10 minutes. A ZigBee wireless sensor network is being used to monitor the home’s temperature and humidity levels. Around 3.3 minutes, each wireless node sent the temperature and humidity data. The wireless data was then averaged across intervals of 10 minutes. Every 10 minutes, m-bus energy meters collected the energy data. The experimental data sets were combined with the weather data from the closest airport weather station (Chievres Airport, Belgium), which was extracted from a public data set from Reliable Prognosis (rp5.ru). The data set has two random variables to test the regression models and exclude non-predictive characteristics (parameters).
Dataset Link : http://archive.ics.uci.edu/ml/datasets/Appliances+energy+prediction , https://github.com/LuisM78/Appliances-energy-prediction-data
Repositor : https://www.sciencedirect.com/science/article/abs/pii/S0378778816308970?via%3Dihub
Business Problem
The increasing trend in energy consumption is becoming cause of concern for the entire world, as the energy consumption is increasing year after year so is the carbon and greenhouse gas emission, the majority portion of the electricity generated is consumed by industrial sector but a considerable amount is also consumed by residential sector.
It is important to study the energy consuming behaviour in the residential sector and predict the energy consumption by home appliances as it consume maximum amount of energy in the residence. This project focuses on predicting the energy consumption of home appliances based on humidity and temperature.
This project aims to predict the energy consumption of home appliances. With the advent of smart homes and the rising need for energy management, existing smart home systems can benefit from accurate prediction. If the energy usage can be predicted for every possible state of appliances, then device control can be optimized for energy savings as well. This is a case of Regression analysis which is part of the Supervised Learning problem. Appliance energy usage is the target variable while sensor data and weather data are the features.
Data Processing
Load Libraries
library(dplyr)
library(caret)
library(tidyr)
library(randomForest)
library(ggplot2)
library(lime)
library(GGally)
library(performance)
library(MLmetrics)
library(lmtest)
library(car)
library(lubridate)
library(psych)
library(plotly)
Read Data
# read data
training <- read.csv("data/training.csv")
testing <- read.csv("data/testing.csv")
data <- bind_rows(training, testing)
The observation data consists of the following variables:
datetime year-month-day hour : minute:secondAppliances: energy use in Wh [TARGETED]lights: energy use of light fixtures in the house in WhT1: Temperature in kitchen area, in CelsiusRH_1: Humidity in kitchen area, in %T2: Temperature in living room area, in CelsiusRH_2:Humidity in living room area, in %T3:Temperature in laundry room areaRH_3:Humidity in laundry room area, in %T4:Temperature in office room, in CelsiusRH_4:Humidity in office room, in %T5:Temperature in bathroom, in CelsiusRH_5:Humidity in bathroom, in %T6:Temperature outside the building (north side), in CelsiusRH_6:Humidity outside the building (north side), in %T7:Temperature in ironing room , in CelsiusRH_7:Humidity in ironing room, in %T8:Temperature in teenager room 2, in CelsiusRH_8:Humidity in teenager room 2, in %T9:Temperature in parents room, in CelsiusRH_9:Humidity in parents room, in %T_out:Temperature outside (from Chièvres weather station), in CelsiusPress_mm_hg: (from Chièvres weather station), in mm HgRH_out: Humidity outside (from Chièvres weather station), in %Windspeed: (from Chièvres weather station), in m/sVisibility: (from Chièvres weather station), in kmTdewpoint: (from Chièvres weather station), °Crv1: Random variable 1, nondimensionalrv2: Rnadom variable 2, nondimensionalDay_of_week: Name of Day, orderedWeekStatus: Day status ofDay_of_week
Number of instances: 19,735 Number of attributes: 32
In order to ensure that the data is “fully prepared,” we demonstrate how to use various data transformations, scaling, handling outliers, or any other statistical strategy. It is best practice to preprocess our data before performing analysis. Data must first be cleaned and transformed before being used for analysis and modeling.
Pre-processing
# data structure
glimpse(data)
## Rows: 19,735
## Columns: 32
## $ date <chr> "2016-01-11 17:00:00", "2016-01-11 17:10:00", "2016-01-11 …
## $ Appliances <int> 60, 60, 50, 60, 50, 60, 60, 70, 430, 250, 100, 90, 80, 140…
## $ lights <int> 30, 30, 30, 40, 40, 50, 40, 40, 50, 40, 10, 10, 30, 40, 20…
## $ T1 <dbl> 19.89000, 19.89000, 19.89000, 19.89000, 19.89000, 19.85667…
## $ RH_1 <dbl> 47.59667, 46.69333, 46.30000, 46.33333, 46.02667, 45.56000…
## $ T2 <dbl> 19.20000, 19.20000, 19.20000, 19.20000, 19.20000, 19.20000…
## $ RH_2 <dbl> 44.79000, 44.72250, 44.62667, 44.53000, 44.50000, 44.50000…
## $ T3 <dbl> 19.79000, 19.79000, 19.79000, 19.79000, 19.79000, 19.73000…
## $ RH_3 <dbl> 44.73000, 44.79000, 44.93333, 45.00000, 44.93333, 44.90000…
## $ T4 <dbl> 19.00000, 19.00000, 18.92667, 18.89000, 18.89000, 18.89000…
## $ RH_4 <dbl> 45.56667, 45.99250, 45.89000, 45.53000, 45.73000, 45.86333…
## $ T5 <dbl> 17.16667, 17.16667, 17.16667, 17.20000, 17.13333, 17.10000…
## $ RH_5 <dbl> 55.20000, 55.20000, 55.09000, 55.09000, 55.03000, 54.90000…
## $ T6 <dbl> 7.026667, 6.833333, 6.560000, 6.366667, 6.300000, 6.190000…
## $ RH_6 <dbl> 84.25667, 84.06333, 83.15667, 84.89333, 85.76667, 86.42333…
## $ T7 <dbl> 17.20000, 17.20000, 17.20000, 17.20000, 17.13333, 17.10000…
## $ RH_7 <dbl> 41.62667, 41.56000, 41.43333, 41.23000, 41.26000, 41.20000…
## $ T8 <dbl> 18.20000, 18.20000, 18.20000, 18.10000, 18.10000, 18.10000…
## $ RH_8 <dbl> 48.90000, 48.86333, 48.73000, 48.59000, 48.59000, 48.59000…
## $ T9 <dbl> 17.03333, 17.06667, 17.00000, 17.00000, 17.00000, 17.00000…
## $ RH_9 <dbl> 45.53000, 45.56000, 45.50000, 45.40000, 45.29000, 45.29000…
## $ T_out <dbl> 6.600000, 6.483333, 6.366667, 6.133333, 6.016667, 5.916667…
## $ Press_mm_hg <dbl> 733.5000, 733.6000, 733.7000, 733.9000, 734.0000, 734.1667…
## $ RH_out <dbl> 92.00000, 92.00000, 92.00000, 92.00000, 92.00000, 91.83333…
## $ Windspeed <dbl> 7.000000, 6.666667, 6.333333, 5.666667, 5.333333, 5.166667…
## $ Visibility <dbl> 63.00000, 59.16667, 55.33333, 47.66667, 43.83333, 40.00000…
## $ Tdewpoint <dbl> 5.300000, 5.200000, 5.100000, 4.900000, 4.800000, 4.683333…
## $ rv1 <dbl> 13.2754332, 18.6061950, 28.6426682, 10.0840966, 44.9194842…
## $ rv2 <dbl> 13.2754332, 18.6061950, 28.6426682, 10.0840966, 44.9194842…
## $ NSM <int> 61200, 61800, 62400, 63600, 64200, 65400, 66000, 66600, 68…
## $ WeekStatus <chr> "Weekday", "Weekday", "Weekday", "Weekday", "Weekday", "We…
## $ Day_of_week <chr> "Monday", "Monday", "Monday", "Monday", "Monday", "Monday"…
data <- data %>% rename('temp_kitchen' = 'T1',
'temp_living' = 'T2',
'temp_laundry' = 'T3',
'temp_office' = 'T4',
'temp_bath' = 'T5',
'temp_outside' = 'T6',
'temp_iron' ='T7',
'temp_teen' = 'T8',
'temp_parents' = 'T9',
'temp_station' = 'T_out',
'humid_kitchen' = 'RH_1',
'humid_living' = 'RH_2',
'humid_laundry' = 'RH_3',
'humid_office'= 'RH_4',
'humid_bath' = 'RH_5',
'humid_outside' = 'RH_6',
'humid_iron' = 'RH_7',
'humid_teen' = 'RH_8',
'humid_parents' = 'RH_9',
'humid_station' = 'RH_out',
'random_1' = 'rv1',
'random_2' = 'rv2')
# check missing value
colSums(is.na(data))
## date Appliances lights temp_kitchen humid_kitchen
## 0 0 0 0 0
## temp_living humid_living temp_laundry humid_laundry temp_office
## 0 0 0 0 0
## humid_office temp_bath humid_bath temp_outside humid_outside
## 0 0 0 0 0
## temp_iron humid_iron temp_teen humid_teen temp_parents
## 0 0 0 0 0
## humid_parents temp_station Press_mm_hg humid_station Windspeed
## 0 0 0 0 0
## Visibility Tdewpoint random_1 random_2 NSM
## 0 0 0 0 0
## WeekStatus Day_of_week
## 0 0
# remove duplicate
unique(data)
# remove row containing NA value
data <- data %>% filter(complete.cases(.))
data <- data %>% mutate(date = ymd_hms(date))
# summarise the data to 24h format
data <- data %>%
mutate(date = floor_date(date, "hour")) %>%
group_by(date) %>%
select_if(is.numeric) %>%
summarise_all("mean") %>%
ungroup() %>%
mutate(Day_of_week = wday(date, label = T),
WeekStatus = ifelse(Day_of_week %in% c("Sat", "Sun"), "weekend", "weekday") ,
WeekStatus = as.factor(WeekStatus),
Day_of_week = as.factor(Day_of_week))
Check data distribution of each predictor
data %>%
select_if(is.numeric) %>%
select(-c(Press_mm_hg, Appliances, NSM)) %>%
boxplot(main = 'Distribution of Each Predictor', xlab = 'Predictor', ylab = 'Values')
Our data can be visually examined to identify whether any outliers
are present. By requiring our model to accommodate them, outliers
impact the dependent variable we’re developing. As their names indicate,
outliers lie outside our model’s majority. The resolving capability of
our model might be reduced if we include outliers. We can observe from
the boxplot that some variables, such humid_laundry,
humid_station, and Press_mm_hg have noticeable
outliers.
Define Outlier
boxplot.stats(data$humid_laundry)$out
boxplot.stats(data$temp_office)$out
boxplot.stats(data$temp_kitchen)$out
boxplot.stats(data$humid_living)$out
out_hkit <- boxplot.stats(data$humid_kitchen)$out
out_tempout <- boxplot.stats(data$temp_outside)$out
out_humidstat <- boxplot.stats(data$humid_station)$out
out_tempbath <- boxplot.stats(data$temp_bath)$out
out_tempiron <- boxplot.stats(data$temp_iron)$out
out_hiron <- boxplot.stats(data$humid_iron)$out
out_tempstat <- boxplot.stats(data$temp_station)$out
out_tteen <- boxplot.stats(data$temp_teen)$out
out_hteen <- boxplot.stats(data$humid_teen)$out
boxplot.stats(data$humid_parents)$out
out_wind <- boxplot.stats(data$Windspeed)$out
out_dew <- boxplot.stats(data$Tdewpoint)$out
out_temppar <- boxplot.stats(data$temp_parents)$out
out_templiv <- boxplot.stats(data$temp_living)$out
out_press <- boxplot.stats(data$Press_mm_hg)$out
#data_outlier <- data %>% filter(humid_laundry >= 49.34867 | temp_office >= 26.00583 | temp_office <= 15.69000 | temp_kitchen >= 25.45700 | temp_kitchen <= 17.89000 | humid_parents <= 29.22833 | humid_kitchen >= 51.91067 | humid_kitchen <= 28.26944 | temp_bath >= out_tempbath | humid_iron >= out_hiron | temp_teen <= out_tteen | humid_teen >= out_hteen | Windspeed >= out_wind | Tdewpoint >= out_dew | temp_outside >= out_tempout | humid_station <= out_humidstat | temp_station >= out_tempstat | temp_living >= out_templiv | temp_parents >= out_temppar | Press_mm_hg <= out_press)
data_outlier <- data %>% filter(humid_laundry >= 49.47222 | temp_office >= 25.94310 | temp_office <= 15.69000 | temp_kitchen >= 25.47556 | temp_kitchen <= 16.79000 | humid_parents >= 53.14000 | humid_parents <= 29.48750 | humid_living >= 51.33444 | humid_living <= 26.18556 | humid_kitchen >= out_hkit |temp_bath >= out_tempbath | humid_iron >= out_hiron | temp_teen <= out_tteen | humid_teen >= out_hteen | temp_parents >= out_temppar | temp_outside >= out_tempout | humid_station <= out_humidstat | temp_station >= out_tempstat | temp_living >= out_templiv | Windspeed >= out_wind | Tdewpoint >= out_dew)
#data_outlier <- data %>% filter(humid_laundry >= 49.22667 | humid_laundry <= 29.00000 | temp_office >= 25.96333 | temp_office <= 15.66000 | temp_kitchen >= 25.39000 | temp_kitchen <= 17.79000 | humid_parents >= 53.16333 | humid_parents <= 29.16667 | humid_kitchen >= 52.12667 | humid_kitchen <= 27.36000 |temp_bath >= out_tempbath | humid_iron >= out_hiron | temp_teen <= out_tteen | humid_teen >= out_hteen | temp_outside >= out_tempout | humid_station <= out_humidstat | temp_station >= out_tempstat | temp_living >= out_templiv | Windspeed >= out_wind | Tdewpoint >= out_dew)
Outlier on Original Data Distribution
# check if the outliers has an influence to targeted variable
data %>%
mutate(date = as.numeric(date)) %>%
ggplot(aes(x = date, y = Appliances)) +
geom_point() +
geom_point(data = data_outlier, aes(x = date, y = Appliances), col = 'red') +
labs(
title = 'Distribution of Appliance : Original vs outlier (red)',
x = NULL,
y = 'Appliance') +
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank())
Check the outlier data
data_no_outlier <- data %>% filter(!date %in% data_outlier$date)
t.test(data$Appliances, data_outlier$Appliances)
##
## Welch Two Sample t-test
##
## data: data$Appliances and data_outlier$Appliances
## t = -2.0731, df = 610.83, p-value = 0.03858
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -16.4165796 -0.4442911
## sample estimates:
## mean of x mean of y
## 97.77913 106.20956
The p-value for the T-test is less than 0.05. This indicates that it
is safe to assume that the strengths of the standard and outlier groups
differ. Even though they might require a different inspection, let’s
eliminate the outliers for this project. We have created
data_no_outlier for our anlysis.
Distribution on Each Predictor
data_no_outlier %>%
select_if(is.numeric) %>%
pivot_longer(cols = -Appliances, names_to = 'predictor') %>%
ggplot(aes(x = value)) +
geom_density() +
facet_wrap(~predictor, scales = 'free_x') +
labs(
title = 'Density graph of each variable',
x = 'variable',
y = 'Frequency'
)
The graph shows that
humid_outside,random_1,random_2are all fairly uniform in shape. This can imply that these variables are combined freely and without followed by other variables.
Data Transformation
Let’s see the trend of our data for each predictor
data_no_outlier %>%
select_if(is.numeric) %>%
pivot_longer(cols = -Appliances, names_to = 'predictor') %>%
ggplot(aes(x = value, y = Appliances)) +
geom_point() +
geom_smooth(method = 'loess', formula = 'y ~ x') +
facet_wrap(~predictor, scales = 'free_x') +
labs(
title = 'Trends in each variable',
x = 'Variable',
y = 'Values'
)
According to the plots, we’re hardly to suspect the correlation
between each predictors to Appliances. lights are has the
most noticable correlation line and a minor positive correlation exists
between humid_bath, humid_kitchen,
temp_laundry``, andtemp_outside. There's also a minor negative correlation betweenNSM,humid_teen, andhumid_parent`.
With linear data, regression models perform the best. We can attempt to
modify the distribution to become more linear by transforming the
non-linear data.
Appliances as Log(Appliances)
ggplot(data = data_no_outlier, aes(x = (Appliances))) +
geom_density(fill = "pink", color = "black", bins = 30) +
labs( title = "Appliances Distribution", x = "Appliances")
## Warning: Ignoring unknown parameters: bins
> Transform Appliance
ggplot(data = data_no_outlier, aes(x = log(Appliances))) +
geom_density(fill = "pink", color = "black", bins = 30) +
labs( title = "Log(Appliances) Distribution", x = "Log(Appliances)")
## Warning: Ignoring unknown parameters: bins
We see that the two relation is more linear. We’ll persist this change
to our dataset.
data_log <- data_no_outlier %>% mutate(Appliances = log(Appliances))
recap <- data_log %>%
mutate(date = floor_date(date, "hour")) %>%
mutate(hour = hour(data_log$date),
month = month(data_log$date, label = T),
day = day(data_log$date)) %>% group_by(date, month, hour, day) %>%
summarise(Appliances = mean(Appliances)) %>%
ggplot(aes(x = hour , y = day, fill = Appliances)) +
geom_tile() +
facet_grid(~month) +
scale_y_continuous(breaks = seq(1,31)) +
scale_x_continuous(breaks = seq(0,24,6)) +
labs(title = "Appliance Recap",
x = "hour",
y = "date") +
theme(axis.text.x = element_text(hjust = 1, angle = 45),
plot.title = element_text(face = "bold"),
panel.background = element_rect(fill = "#ffffff"),
axis.line.y = element_line(colour = "grey"),
axis.line.x = element_line(colour = "grey"))
ggplotly(recap) %>% plotly::layout(legend=list(x=0,
xanchor='left',
yanchor='bottom',
orientation='h'))
If we take a look at the plot there’s a pattern that lies on the data, and it was an outlier that we removed before.
Data Scaling
describe(data_log, fast = T)
## vars n mean sd min max range se
## date 1 2816 NaN NA Inf -Inf -Inf NA
## Appliances 2 2816 4.35 0.61 3.34 6.41 3.07 0.01
## lights 3 2816 3.72 6.81 0.00 51.67 51.67 0.13
## temp_kitchen 4 2816 21.62 1.45 17.46 25.48 8.01 0.03
## humid_kitchen 5 2816 40.11 3.70 31.50 52.81 21.31 0.07
## temp_living 6 2816 20.18 1.92 16.20 28.28 12.08 0.04
## humid_living 7 2816 40.45 3.79 26.40 51.11 24.71 0.07
## temp_laundry 8 2816 22.17 1.84 17.79 28.52 10.73 0.03
## humid_laundry 9 2816 39.15 3.17 29.70 48.59 18.89 0.06
## temp_office 10 2816 20.77 1.86 15.73 25.84 10.11 0.04
## humid_office 11 2816 38.87 4.23 29.42 50.75 21.32 0.08
## temp_bath 12 2816 19.51 1.69 15.37 24.33 8.97 0.03
## humid_bath 13 2816 50.96 8.62 36.06 94.88 58.82 0.16
## temp_outside 14 2816 7.37 5.53 -5.93 25.38 31.31 0.10
## humid_outside 15 2816 55.72 30.06 1.00 99.90 98.90 0.57
## temp_iron 16 2816 20.18 1.98 15.48 25.19 9.71 0.04
## humid_iron 17 2816 35.21 5.00 23.34 50.25 26.91 0.09
## temp_teen 18 2816 21.98 1.86 16.97 26.92 9.95 0.04
## humid_teen 19 2816 42.78 5.10 29.74 56.67 26.93 0.10
## temp_parents 20 2816 19.40 1.89 14.89 24.35 9.46 0.04
## humid_parents 21 2816 41.44 4.06 29.52 52.78 23.27 0.08
## temp_station 22 2816 6.93 4.83 -4.96 22.87 27.83 0.09
## Press_mm_hg 23 2816 755.60 7.37 729.38 772.26 42.88 0.14
## humid_station 24 2816 80.51 13.82 33.25 100.00 66.75 0.26
## Windspeed 25 2816 4.01 2.38 0.42 11.83 11.42 0.04
## Visibility 26 2816 38.45 11.24 1.00 66.00 65.00 0.21
## Tdewpoint 27 2816 3.50 4.03 -6.46 14.86 21.32 0.08
## random_1 28 2816 24.98 5.94 5.26 43.61 38.35 0.11
## random_2 29 2816 24.98 5.94 5.26 43.61 38.35 0.11
## NSM 30 2816 42171.31 25155.53 1500.00 84300.00 82800.00 474.04
## Day_of_week 31 2816 NaN NA Inf -Inf -Inf NA
## WeekStatus 32 2816 NaN NA Inf -Inf -Inf NA
Before Scaling
data_log %>%
select_if(is.numeric) %>%
pivot_longer(cols = -Appliances, names_to = 'predictor') %>%
group_by(predictor) %>%
summarize(value = max(value)) %>%
ggplot(aes(x = predictor, y = value)) +
geom_col(fill = 'pink') +
labs(
title = 'Data Range Before Scaling',
x = 'Variable',
y = 'Value') +
theme(legend.title = element_blank(),
axis.text.x = element_text(hjust = 1, angle = 45),
plot.title = element_text(face = "bold"),
panel.background = element_rect(fill = "#ffffff"),
axis.line.y = element_line(colour = "grey"),
axis.line.x = element_line(colour = "grey"))
After scaling Scaling
Before we scale data_log, we need to remove non-numeric
column date and exclude Day_of_week,
WeekStatus as categorical variable
# data scaling
data_scale <- data_log %>% select(-date) %>% mutate(Appliances = as.numeric(Appliances))
data_scale[,-c(1,30,31)] <- scale(data_scale[,-c(1,30,31)])
data_scale %>% select_if(is.numeric) %>%
pivot_longer(cols = -Appliances, names_to = 'predictor') %>%
group_by(predictor) %>%
summarize(value = max(value)) %>%
ggplot(aes(x = predictor, y = value)) +
geom_col(fill = 'pink') +
labs(
title = 'Data Range After Scaling',
x = 'Variable',
y = 'Values') +
theme(legend.title = element_blank(),
axis.text.x = element_text(hjust = 1, angle = 45),
plot.title = element_text(face = "bold"),
panel.background = element_rect(fill = "#ffffff"),
axis.line.y = element_line(colour = "grey"),
axis.line.x = element_line(colour = "grey"))
Data distribution after scaling
data_scale %>%
select_if(is.numeric) %>%
pivot_longer(cols = -Appliances, names_to = 'predictor') %>%
ggplot() +
geom_histogram(aes(x = value), bins = 15, color = 'black', fill = 'white') +
facet_wrap(~predictor, scales = 'free_x') +
labs(
title = 'Density graph of each variable',
x = 'variable',
y = 'Frequency'
)
- All humidity values except
humid_parentandhumid_outsidehas Normal distribution and for all temperature readings follow a Normal distribution except fortemp_parent. - Out of the remaining columns, we can see that
Visibility,Windspeedandlightsare skewed. - The random variables rv1 and rv2 has normal distribution.
Exploratoty Data Analysis
Correlation
# cek korelasi
ggcorr(data_scale, hjust =1, label = T)
data_scale %>% select_if(is.numeric) %>% cor() %>% as.data.frame() %>% arrange(-Appliances)
## Appliances lights temp_kitchen humid_kitchen temp_living
## Appliances 1.000000000 0.338810158 0.123073002 0.088989937 0.19961754
## NSM 0.374363865 0.281624927 0.152764480 0.005780070 0.25220072
## lights 0.338810158 1.000000000 -0.022275708 0.141555746 0.01081935
## temp_living 0.199617544 0.010819353 0.811870731 0.295341263 1.00000000
## temp_outside 0.193987993 -0.071498046 0.615034878 0.334121883 0.76418114
## temp_station 0.165003827 -0.069869396 0.646893516 0.370359661 0.75416594
## temp_laundry 0.130799441 -0.115428404 0.872943873 0.241785998 0.68517067
## Windspeed 0.129427459 0.068827632 -0.085919085 0.234827675 0.09091366
## temp_kitchen 0.123073002 -0.022275708 1.000000000 0.159542863 0.81187073
## temp_teen 0.117362573 -0.073567046 0.802327943 -0.068019785 0.52628677
## humid_kitchen 0.088989937 0.141555746 0.159542863 1.000000000 0.29534126
## temp_office 0.086184976 -0.004522683 0.846695564 0.100440692 0.71430258
## temp_iron 0.063357082 -0.152406913 0.805736368 -0.010583495 0.60164107
## temp_bath 0.058984335 -0.089771232 0.861934335 0.203059793 0.67229058
## temp_parents 0.037476740 -0.179815832 0.814566232 0.091586565 0.61722444
## Tdewpoint 0.023441319 -0.024895891 0.575618569 0.628441616 0.57627498
## humid_bath 0.008268389 0.183916803 -0.037004463 0.319135213 0.02128543
## humid_laundry -0.008041878 0.158408793 -0.036570275 0.849218214 0.14054798
## humid_office -0.011908013 0.145181904 0.100732696 0.887477521 0.26075115
## Visibility -0.014150338 0.025474189 -0.090203399 -0.014962428 -0.08356921
## random_1 -0.030979918 0.001358579 -0.011687987 -0.007658236 -0.01543808
## random_2 -0.030979918 0.001358579 -0.011687987 -0.007658236 -0.01543808
## Press_mm_hg -0.070264450 -0.033704651 -0.084687829 -0.306246430 -0.07895460
## humid_living -0.111080673 0.059884314 0.054498847 0.805236090 -0.11002630
## humid_iron -0.123515486 0.052737544 0.124889551 0.801498695 0.23195726
## humid_parents -0.131077364 -0.001783853 0.059069414 0.766849573 0.15653425
## humid_outside -0.154685092 0.173000537 -0.564183148 0.280753296 -0.51965963
## humid_teen -0.202931354 0.020576423 -0.006678625 0.730217552 0.07670443
## humid_station -0.249727435 0.077051026 -0.251152661 0.271540437 -0.42786381
## humid_living temp_laundry humid_laundry temp_office
## Appliances -0.111080673 0.130799441 -0.008041878 0.0861849759
## NSM -0.199860925 -0.002329452 -0.059030515 0.0370261361
## lights 0.059884314 -0.115428404 0.158408793 -0.0045226826
## temp_living -0.110026303 0.685170670 0.140547980 0.7143025752
## temp_outside 0.060089208 0.655965692 0.066213077 0.6188555135
## temp_station 0.117203671 0.670828907 0.118945279 0.6312303922
## temp_laundry 0.198367483 1.000000000 -0.037820928 0.8302983471
## Windspeed 0.066138523 -0.106071393 0.278863751 -0.1941049195
## temp_kitchen 0.054498847 0.872943873 -0.036570275 0.8466955643
## temp_teen -0.020937617 0.774206911 -0.323904208 0.7682776258
## humid_kitchen 0.805236090 0.241785998 0.849218214 0.1004406922
## temp_office 0.010921592 0.830298347 -0.161122899 1.0000000000
## temp_iron -0.014986664 0.824478982 -0.287228779 0.8588000143
## temp_bath 0.175725944 0.872596692 -0.065368458 0.8520224985
## temp_parents 0.102964976 0.888444971 -0.232070118 0.8747095018
## Tdewpoint 0.533363119 0.643640050 0.390777473 0.5209571473
## humid_bath 0.257108744 -0.102746764 0.395794977 -0.1093678753
## humid_laundry 0.671350802 -0.037820928 1.000000000 -0.1611228989
## humid_office 0.719741199 0.105189118 0.900927825 -0.0638188840
## Visibility -0.003266467 -0.119225127 0.019899570 -0.1273198653
## random_1 0.002999757 -0.011675317 0.001197525 0.0012824619
## random_2 0.002999757 -0.011675317 0.001197525 0.0012824619
## Press_mm_hg -0.280621916 -0.132264831 -0.238436319 0.0002931769
## humid_living 1.000000000 0.198367483 0.671350802 0.0109215921
## humid_iron 0.699409857 0.148108438 0.834355461 0.0288630551
## humid_parents 0.683285373 0.113647662 0.834583363 -0.0435234059
## humid_outside 0.374868862 -0.613625239 0.553741740 -0.6658859015
## humid_teen 0.674230049 0.030715427 0.826414870 -0.0979481533
## humid_station 0.555453812 -0.199760031 0.360528031 -0.3036779664
## humid_office temp_bath humid_bath temp_outside humid_outside
## Appliances -0.011908013 0.058984335 0.008268389 0.19398799 -0.15468509
## NSM -0.023686704 0.031563779 0.110060913 0.19750114 -0.17578499
## lights 0.145181904 -0.089771232 0.183916803 -0.07149805 0.17300054
## temp_living 0.260751147 0.672290584 0.021285432 0.76418114 -0.51965963
## temp_outside 0.263526893 0.588894475 -0.094249271 1.00000000 -0.64964087
## temp_station 0.307352680 0.613395779 -0.061475539 0.97167655 -0.61258926
## temp_laundry 0.105189118 0.872596692 -0.102746764 0.65596569 -0.61362524
## Windspeed 0.315195558 -0.152672716 0.074749351 0.21747107 0.08054052
## temp_kitchen 0.100732696 0.861934335 -0.037004463 0.61503488 -0.56418315
## temp_teen -0.200196458 0.801919966 -0.118687866 0.44706048 -0.64070977
## humid_kitchen 0.887477521 0.203059793 0.319135213 0.33412188 0.28075330
## temp_office -0.063818884 0.852022498 -0.109367875 0.61885551 -0.66588590
## temp_iron -0.166370183 0.845912962 -0.178980522 0.57645541 -0.72778213
## temp_bath 0.084238750 1.000000000 0.001797893 0.58889447 -0.58942228
## temp_parents -0.074041595 0.897183397 -0.179263106 0.63588057 -0.71296098
## Tdewpoint 0.599085050 0.584910825 0.067833298 0.77080866 -0.23185420
## humid_bath 0.370271152 0.001797893 1.000000000 -0.09424927 0.30882879
## humid_laundry 0.900927825 -0.065368458 0.395794977 0.06621308 0.55374174
## humid_office 1.000000000 0.084238750 0.370271152 0.26352689 0.42563882
## Visibility 0.005132886 -0.098715244 -0.022269130 -0.09238018 0.12143442
## random_1 -0.003551205 -0.015731681 -0.026978684 -0.03122955 0.02894372
## random_2 -0.003551205 -0.015731681 -0.026978684 -0.03122955 0.02894372
## Press_mm_hg -0.259407785 -0.118223183 -0.117107432 -0.11579712 -0.11632078
## humid_living 0.719741199 0.175725944 0.257108744 0.06008921 0.37486886
## humid_iron 0.889132160 0.132253430 0.343063528 0.23975900 0.40137249
## humid_parents 0.853357085 0.054633126 0.287252513 0.17112616 0.42903069
## humid_outside 0.425638821 -0.589422283 0.308828788 -0.64964087 1.00000000
## humid_teen 0.835057397 0.013045696 0.372867770 0.05807320 0.52443133
## humid_station 0.342855036 -0.182994600 0.201164997 -0.52149720 0.70230617
## temp_iron humid_iron temp_teen humid_teen temp_parents
## Appliances 0.063357082 -0.12351549 0.117362573 -0.202931354 0.037476740
## NSM 0.024079108 -0.17298403 0.091851925 -0.311957345 -0.043350089
## lights -0.152406913 0.05273754 -0.073567046 0.020576423 -0.179815832
## temp_living 0.601641073 0.23195726 0.526286769 0.076704433 0.617224439
## temp_outside 0.576455407 0.23975900 0.447060481 0.058073202 0.635880569
## temp_station 0.589922644 0.28971852 0.468479587 0.114208192 0.636476717
## temp_laundry 0.824478982 0.14810844 0.774206911 0.030715427 0.888444971
## Windspeed -0.187360047 0.22698814 -0.228599402 0.198473045 -0.180154695
## temp_kitchen 0.805736368 0.12488955 0.802327943 -0.006678625 0.814566232
## temp_teen 0.868592735 -0.16844947 1.000000000 -0.240286241 0.853508681
## humid_kitchen -0.010583495 0.80149870 -0.068019785 0.730217552 0.091586565
## temp_office 0.858800014 0.02886306 0.768277626 -0.097948153 0.874709502
## temp_iron 1.000000000 -0.07417770 0.868592735 -0.242928697 0.934703693
## temp_bath 0.845912962 0.13225343 0.801919966 0.013045696 0.897183397
## temp_parents 0.934703693 -0.00915467 0.853508681 -0.137904563 1.000000000
## Tdewpoint 0.443356745 0.61853606 0.375033488 0.479380802 0.571007814
## humid_bath -0.178980522 0.34306353 -0.118687866 0.372867770 -0.179263106
## humid_laundry -0.287228779 0.83435546 -0.323904208 0.826414870 -0.232070118
## humid_office -0.166370183 0.88913216 -0.200196458 0.835057397 -0.074041595
## Visibility -0.128428837 -0.01239976 -0.076267131 0.049486694 -0.119556156
## random_1 -0.007373707 0.01578497 -0.006246445 0.020173426 -0.001709659
## random_2 -0.007373707 0.01578497 -0.006246445 0.020173426 -0.001709659
## Press_mm_hg -0.036652769 -0.26719383 -0.107450351 -0.234625817 -0.100418861
## humid_living -0.014986664 0.69940986 -0.020937617 0.674230049 0.102964976
## humid_iron -0.074177704 1.00000000 -0.168449473 0.879577498 -0.009154670
## humid_parents -0.113428104 0.85068433 -0.197048587 0.849937828 -0.040893114
## humid_outside -0.727782134 0.40137249 -0.640709767 0.524431335 -0.712960984
## humid_teen -0.242928697 0.87957750 -0.240286241 1.000000000 -0.137904563
## humid_station -0.351235460 0.40655442 -0.243785404 0.507496100 -0.246834639
## humid_parents temp_station Press_mm_hg humid_station
## Appliances -0.131077364 0.16500383 -0.0702644501 -0.24972744
## NSM -0.278727068 0.21026313 0.0037037150 -0.35247376
## lights -0.001783853 -0.06986940 -0.0337046512 0.07705103
## temp_living 0.156534246 0.75416594 -0.0789546037 -0.42786381
## temp_outside 0.171126156 0.97167655 -0.1157971151 -0.52149720
## temp_station 0.222743195 1.00000000 -0.1149562281 -0.51791385
## temp_laundry 0.113647662 0.67082891 -0.1322648314 -0.19976003
## Windspeed 0.258639612 0.23913140 -0.2307022596 -0.21405190
## temp_kitchen 0.059069414 0.64689352 -0.0846878289 -0.25115266
## temp_teen -0.197048587 0.46847959 -0.1074503508 -0.24378540
## humid_kitchen 0.766849573 0.37035966 -0.3062464299 0.27154044
## temp_office -0.043523406 0.63123039 0.0002931769 -0.30367797
## temp_iron -0.113428104 0.58992264 -0.0366527690 -0.35123546
## temp_bath 0.054633126 0.61339578 -0.1182231833 -0.18299460
## temp_parents -0.040893114 0.63647672 -0.1004188615 -0.24683464
## Tdewpoint 0.517260212 0.80694036 -0.2328641832 0.08139534
## humid_bath 0.287252513 -0.06147554 -0.1171074319 0.20116500
## humid_laundry 0.834583363 0.11894528 -0.2384363187 0.36052803
## humid_office 0.853357085 0.30735268 -0.2594077848 0.34285504
## Visibility 0.013035001 -0.08726717 0.0500554251 0.09622750
## random_1 0.001197930 -0.03030349 -0.0063237489 0.04994755
## random_2 0.001197930 -0.03030349 -0.0063237489 0.04994755
## Press_mm_hg -0.178297011 -0.11495623 1.0000000000 -0.14177882
## humid_living 0.683285373 0.11720367 -0.2806219157 0.55545381
## humid_iron 0.850684330 0.28971852 -0.2671938309 0.40655442
## humid_parents 1.000000000 0.22274320 -0.1782970106 0.37400570
## humid_outside 0.429030692 -0.61258926 -0.1163207770 0.70230617
## humid_teen 0.849937828 0.11420819 -0.2346258173 0.50749610
## humid_station 0.374005695 -0.51791385 -0.1417788164 1.00000000
## Windspeed Visibility Tdewpoint random_1 random_2
## Appliances 0.129427459 -0.014150338 0.023441319 -0.030979918 -0.030979918
## NSM 0.111275681 -0.018052889 0.010466014 -0.043261213 -0.043261213
## lights 0.068827632 0.025474189 -0.024895891 0.001358579 0.001358579
## temp_living 0.090913659 -0.083569208 0.576274982 -0.015438083 -0.015438083
## temp_outside 0.217471069 -0.092380178 0.770808662 -0.031229545 -0.031229545
## temp_station 0.239131401 -0.087267169 0.806940356 -0.030303485 -0.030303485
## temp_laundry -0.106071393 -0.119225127 0.643640050 -0.011675317 -0.011675317
## Windspeed 1.000000000 -0.005644861 0.147252915 -0.021270171 -0.021270171
## temp_kitchen -0.085919085 -0.090203399 0.575618569 -0.011687987 -0.011687987
## temp_teen -0.228599402 -0.076267131 0.375033488 -0.006246445 -0.006246445
## humid_kitchen 0.234827675 -0.014962428 0.628441616 -0.007658236 -0.007658236
## temp_office -0.194104920 -0.127319865 0.520957147 0.001282462 0.001282462
## temp_iron -0.187360047 -0.128428837 0.443356745 -0.007373707 -0.007373707
## temp_bath -0.152672716 -0.098715244 0.584910825 -0.015731681 -0.015731681
## temp_parents -0.180154695 -0.119556156 0.571007814 -0.001709659 -0.001709659
## Tdewpoint 0.147252915 -0.042153669 1.000000000 -0.001942969 -0.001942969
## humid_bath 0.074749351 -0.022269130 0.067833298 -0.026978684 -0.026978684
## humid_laundry 0.278863751 0.019899570 0.390777473 0.001197525 0.001197525
## humid_office 0.315195558 0.005132886 0.599085050 -0.003551205 -0.003551205
## Visibility -0.005644861 1.000000000 -0.042153669 -0.015224406 -0.015224406
## random_1 -0.021270171 -0.015224406 -0.001942969 1.000000000 1.000000000
## random_2 -0.021270171 -0.015224406 -0.001942969 1.000000000 1.000000000
## Press_mm_hg -0.230702260 0.050055425 -0.232864183 -0.006323749 -0.006323749
## humid_living 0.066138523 -0.003266467 0.533363119 0.002999757 0.002999757
## humid_iron 0.226988142 -0.012399765 0.618536065 0.015784974 0.015784974
## humid_parents 0.258639612 0.013035001 0.517260212 0.001197930 0.001197930
## humid_outside 0.080540519 0.121434424 -0.231854203 0.028943717 0.028943717
## humid_teen 0.198473045 0.049486694 0.479380802 0.020173426 0.020173426
## humid_station -0.214051903 0.096227502 0.081395343 0.049947551 0.049947551
## NSM
## Appliances 0.374363865
## NSM 1.000000000
## lights 0.281624927
## temp_living 0.252200715
## temp_outside 0.197501139
## temp_station 0.210263131
## temp_laundry -0.002329452
## Windspeed 0.111275681
## temp_kitchen 0.152764480
## temp_teen 0.091851925
## humid_kitchen 0.005780070
## temp_office 0.037026136
## temp_iron 0.024079108
## temp_bath 0.031563779
## temp_parents -0.043350089
## Tdewpoint 0.010466014
## humid_bath 0.110060913
## humid_laundry -0.059030515
## humid_office -0.023686704
## Visibility -0.018052889
## random_1 -0.043261213
## random_2 -0.043261213
## Press_mm_hg 0.003703715
## humid_living -0.199860925
## humid_iron -0.172984034
## humid_parents -0.278727068
## humid_outside -0.175784992
## humid_teen -0.311957345
## humid_station -0.352473756
data_scale <- data_scale %>% select(-c(random_1, random_2, WeekStatus, Day_of_week))
The stronger the correlation, or how near 1 or -1 it is, the more
closely related the predictors are. The correlation matrix graphic above
shows the correlatiion on each variables. In our dataset,
humid_station and Appliances have the highest
negative correlations (-0.2) also NSM and
Appliances have the highest positive correlations (0.6)
NSM, lights, and temp_living
have the most significant positive Appliances relationships. This
indicates that the variables positively and substantially contribute to
Appliances. On the other hand, the most vital negative link is found
with water most negative correlation on the other hand.
Handling Outliers
Find outlier value
# Check the outlier and remove after scaling using zscore threshold point = 3
data_clean <- data_scale %>%
mutate( zscore = (Appliances - mean(Appliances)) / sd(Appliances)) %>%
filter(zscore <=3) %>%
select(-zscore)
boxplot(data_scale$Appliances)
Remove the outlier after scaling
boxplot(data_clean$Appliances)
Modeling with one predictor
model_scale <- lm(formula = Appliances ~ temp_living, data = data_scale)
model_clean <- lm(formula = Appliances ~ temp_living, data = data_clean)
Plot the difference between two data
plot(formula = Appliances ~ temp_living, data = data_scale)
abline(model_scale, col = "red")
abline(model_clean, col = "green")
High Leverage, Low Influence: Because the graph shows that the outlier of the Appliances variable is at High Leverage, Low influence, then we analyze from R-Squared.
R-squared
summary(model_scale)$r.squared
## [1] 0.03984716
summary(model_clean)$r.squared
## [1] 0.04166081
Since the original data data has the same r-square as
the scaled one, we decided to not using data_scale and
using data_clean beacause it has better rsqured
Model Fitting and Evaluation
Data Splitting
We now split the data into train and validation sets. The training set is used to train the model, which is checked against the validation set.
data_train <- data_clean[index,] data_validation <- data_clean[-index,]
library(rsample)
RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(123)
index <- sample(nrow(data_clean), nrow(data_clean)*0.8)
data_train <- data_clean[index,]
data_validation <- data_clean[-index,]
Check the Data Split
set.seed(123)
control <- trainControl(method = "cv", number = 10)
se_model <- train(Appliances ~ ., data = data_train, method = "lm", trControl = control)
se_model
## Linear Regression
##
## 2243 samples
## 26 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 2019, 2020, 2018, 2018, 2018, 2018, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 0.4791869 0.3575817 0.3597521
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
Model Fitting
Model with No Predictor
# model tanpa prediktor
model_none <- lm(formula = Appliances ~ 1, data = data_train)
Model with All Predictors
# model dengan semua prediktor
model_all <- lm(Appliances ~ ., data_train)
Variable Selection : Step-Wise Regression Model
We’ve built model.none that uses no predictor and model.all that uses all variables. Stepwise regression is a method to pick out the optimal model using the Akaika Information Criterion (AIC) as is metrics. The method optimizes the model for the least AIC, meaning the least information loss. Let’s try to pick the important variables using stepwise regression. It uses a greedy algorithm to find a local minima. Therefore, it does not guarantee the best model.
1. Backward
# stepwise regression: backward elimination
model_backward <- step(object = model_all,
direction = "backward",
trace = FALSE) # agar proses step-wise tidak ditampilkan
2. Forward
model_forward <- step(
object = model_none, # batas bawah
direction = "forward",
scope = list(upper = model_all), # batas atas
trace = FALSE) # tidak menampilkan proses step-wise
3. Both
model_both <- step(
object = model_none, # bawah batas
direction = "both",
scope = list(upper = model_all), # batas atas
trace = FALSE
)
Model Evaluation
We developed a model_none that doesn’t employ a model or predictor. All variables are used. The Akaike Information Criterion (AIC) and metrics are used stepwise regression to determine the best model. To minimize information loss, the technique optimizes the model for the lowest AIC. Let’s use stepwise regression to identify the crucial factors. To locate a local minimum, it employs a greedy method. As a result, it cannot assure the best model.
comparison <- compare_performance(model_none, model_all, model_backward, model_forward, model_both)
as.data.frame(comparison)
## Name Model AIC AIC_wt BIC BIC_wt R2
## 1 model_none lm 4058.833 2.016258e-217 4070.264 3.686885e-197 0.0000000
## 2 model_all lm 3075.912 5.534254e-04 3235.948 5.445635e-16 0.3695997
## 3 model_backward lm 3062.871 3.757549e-01 3177.183 3.139463e-03 0.3687675
## 4 model_forward lm 3063.729 2.446898e-01 3172.325 3.561941e-02 0.3679627
## 5 model_both lm 3062.854 3.790018e-01 3165.734 9.612411e-01 0.3676457
## R2_adjusted RMSE Sigma
## 1 0.0000000 0.5974693 0.5976025
## 2 0.3622033 0.4743772 0.4772584
## 3 0.3636586 0.4746902 0.4767135
## 4 0.3631337 0.4749927 0.4769101
## 5 0.3631005 0.4751118 0.4769226
Evaluation Function
eval_recap <- function(truth, estimate){
df_new <- data.frame(truth = truth,
estimate = estimate)
data.frame(RMSE = RMSE(estimate, truth),
MAE = MAE(estimate, truth),
"R-Square" = R2_Score(estimate, truth),
MAPE = MAPE(estimate, truth),
check.names = F
) %>%
mutate(MSE = sqrt(RMSE))
}
Model None - Evaluation
# Model None - Evaluation
pred_none_val <- predict(model_none, data_validation)
eval_recap(truth = data_validation$Appliances,
estimate = pred_none_val)
## RMSE MAE R-Square MAPE MSE
## 1 0.5884615 0.4832447 -0.001756201 0.1096837 0.7671124
Model All - Evaluation
pred_all_val <- predict(object = model_all, newdata = data_validation)
eval_recap(truth = data_validation$Appliances,
estimate = pred_all_val)
## RMSE MAE R-Square MAPE MSE
## 1 0.4399668 0.3276412 0.4400287 0.07398865 0.6632999
Model Step-Wise Backward - Evaluation
pred_backward_val <- predict(object = model_backward, newdata = data_validation)
eval_recap(truth = data_validation$Appliances,
estimate = pred_backward_val)
## RMSE MAE R-Square MAPE MSE
## 1 0.4404716 0.3266219 0.4387429 0.07373417 0.6636804
Model Step-Wise Both - Evaluation
pred_forward_val <- predict(object = model_forward, newdata = data_validation)
eval_recap(truth = data_validation$Appliances,
estimate = pred_forward_val)
## RMSE MAE R-Square MAPE MSE
## 1 0.4416871 0.3270429 0.435641 0.07383755 0.6645955
Model Step-Wise Both - Evaluation
pred_both_val <- predict(object = model_both, newdata = data_validation)
eval_recap(truth = data_validation$Appliances,
estimate = pred_both_val)
## RMSE MAE R-Square MAPE MSE
## 1 0.4413853 0.326661 0.436412 0.07370904 0.6643684
As shown above, model_all has the best evaluation score. Now, we’re check the linearity assumption
Checking Assumptions
Linear models are made with 4 assumptions. Before we carry on, we have to check whether these assumptions hold for our model.
Assumption of linearity
The assumption of linearity assumes that there exists a linear relationship between the predictors and the targe variable, so that our model can correctly describe it. A visual way to evaluate this is to plot the value of residues between our plot and the model.
Visualization of residual histogram using
hist(). function
# histogram residual
ggplot(data = as.data.frame(model_all$residuals), aes(x = model_all$residuals)) +
geom_histogram(fill = "#CC0000", color = "orange", bins = 30) +
labs( title = "Regression Residual Distribution", subtitle = "Log Transformation", x = "residual")
Statistics Test with `shapiro.test()``
Shapiro-Wilk hypothesis test:
- H0: Residuals are normal distributed
- H1: Residuals are not normally distributed (heteroscedastic)
# shapiro test dari residual
shapiro.test(model_all$residuals)
##
## Shapiro-Wilk normality test
##
## data: model_all$residuals
## W = 0.93506, p-value < 2.2e-16
check_normality(model_all)
## Warning: Non-normality of residuals detected (p < .001).
Based on the result, the residuals are not normally distributed.
VIF : Independence of Variable
Multicollinearity is a condition with a strong correlation between predictors. This is undesirable because it indicates a redundant predictor in the model, which should be able to choose only one of the variables with a solid relationship. It is hoped that multicollinearity will not occur
Test the VIF (Variance Inflation Factor) with the vif()
function from the car package: * VIF value > 10:
multicollinearity occurs in the model * VIF value < 10: there is no
multicollinearity in the model
vif(model_all)
## lights temp_kitchen humid_kitchen temp_living humid_living
## 1.442851 18.213582 20.795985 26.760992 23.336718
## temp_laundry humid_laundry temp_office humid_office temp_bath
## 9.270430 11.910682 9.043777 17.693180 8.991944
## humid_bath temp_outside humid_outside temp_iron humid_iron
## 1.430813 30.144979 9.607118 15.281805 10.546148
## temp_teen humid_teen temp_parents humid_parents temp_station
## 7.480010 9.574910 25.975439 6.748515 233.710780
## Press_mm_hg humid_station Windspeed Visibility Tdewpoint
## 1.422226 74.571765 1.665932 1.063911 156.170709
## NSM
## 2.185402
The test result means there is has multicollinearity in the model
Homoscedasticity
Homoscedasticity assumption states that the error term in the relationship between the predictor and target variables is constant across all values of inputs. This assumption can be checked using the Breusch-Pagan test with hypotheses :
- H0: Value of error is the same across all inputs (homoscedastic)
- H1: Value of error is not the same across all range of inputs (heteroscedastic)
plot(x = model_all$fitted.values, y = model_all$residuals)
abline(h = 0, col = "#FF0000", ylab = 'Residuals', xlab = 'Prediction')
We can test the homoscedasticity of the model using the Breusch-Pagan test.
bptest(model_all)
##
## studentized Breusch-Pagan test
##
## data: model_all
## BP = 183.42, df = 26, p-value < 2.2e-16
Based on the result, the error are not same across all range of inputs.
Even though our linear model fails the tests, we can still try to conclude it. Our model’s mean average percentage error is a decent 0.074.
coef_all <- model_all$coefficients[-1]
barplot(coef_all, xlab = names(coef_all), main = 'Influence of `Model_all` Predictor', ylab = 'Value')
Model Interpretation and Improvement Ideas
We shouldn’t transform the data_train because we already did it before in the beginning such as scaling, tranforming several variable into log, or removing any outliers and we are not tranforming the targeted variabel into a scaled version, because we wont scaled back the Test Result in the end of our research.
One-Way ANOVA
anova_train <- aov(formula = Appliances ~ ., data = data_train)
summary(anova_train)
## Df Sum Sq Mean Sq F value Pr(>F)
## lights 1 86.8 86.81 381.099 < 2e-16 ***
## temp_kitchen 1 12.9 12.89 56.576 7.81e-14 ***
## humid_kitchen 1 0.5 0.48 2.121 0.14545
## temp_living 1 18.3 18.35 80.556 < 2e-16 ***
## humid_living 1 59.3 59.29 260.286 < 2e-16 ***
## temp_laundry 1 25.7 25.68 112.723 < 2e-16 ***
## humid_laundry 1 10.0 9.97 43.760 4.64e-11 ***
## temp_office 1 12.1 12.07 52.976 4.67e-13 ***
## humid_office 1 9.7 9.71 42.616 8.23e-11 ***
## temp_bath 1 0.1 0.15 0.641 0.42348
## humid_bath 1 0.0 0.01 0.052 0.81917
## temp_outside 1 8.0 8.03 35.234 3.38e-09 ***
## humid_outside 1 0.1 0.13 0.558 0.45510
## temp_iron 1 0.0 0.04 0.174 0.67628
## humid_iron 1 11.2 11.16 48.991 3.39e-12 ***
## temp_teen 1 1.8 1.79 7.864 0.00509 **
## humid_teen 1 23.9 23.91 104.973 < 2e-16 ***
## temp_parents 1 4.6 4.58 20.105 7.70e-06 ***
## humid_parents 1 0.0 0.05 0.211 0.64590
## temp_station 1 3.5 3.55 15.584 8.14e-05 ***
## Press_mm_hg 1 0.3 0.27 1.199 0.27361
## humid_station 1 0.3 0.33 1.455 0.22793
## Windspeed 1 2.1 2.07 9.079 0.00262 **
## Visibility 1 0.1 0.12 0.510 0.47528
## Tdewpoint 1 0.0 0.00 0.002 0.96260
## NSM 1 4.5 4.53 19.882 8.65e-06 ***
## Residuals 2216 504.8 0.23
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Random Foresttion
Create random forest model as
model_rf
set.seed(123)
model_rf <- randomForest(x = data_train %>% select(-Appliances),
y = data_train$Appliances,
ntree = 500)
model_rf
##
## Call:
## randomForest(x = data_train %>% select(-Appliances), y = data_train$Appliances, ntree = 500)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 8
##
## Mean of squared residuals: 0.1511909
## % Var explained: 57.65
Check the summary and Predictor contribution on Targeted Variable
model_rf$finalModel
## NULL
varImp(model_rf)
## Overall
## lights 60.63114
## temp_kitchen 18.86222
## humid_kitchen 28.82037
## temp_living 25.80899
## humid_living 20.28900
## temp_laundry 29.35548
## humid_laundry 26.20662
## temp_office 21.15855
## humid_office 17.64511
## temp_bath 26.08773
## humid_bath 18.95060
## temp_outside 17.53418
## humid_outside 21.22572
## temp_iron 18.41549
## humid_iron 19.76505
## temp_teen 37.28446
## humid_teen 28.01687
## temp_parents 27.91468
## humid_parents 21.25766
## temp_station 14.94264
## Press_mm_hg 24.53724
## humid_station 30.06757
## Windspeed 16.03340
## Visibility 11.55638
## Tdewpoint 16.32909
## NSM 182.77964
Model Random Forest - Evaluation
pred_rf_val <- predict(object = model_rf, newdata = data_validation)
eval_recap(truth = data_validation$Appliances,
estimate = pred_rf_val)
## RMSE MAE R-Square MAPE MSE
## 1 0.3763291 0.271212 0.5903041 0.06049703 0.6134567
Random Forest Variable Importance on Targeted Variabel
library("tibble")
model_rf$importance %>%
as.data.frame() %>%
arrange(-IncNodePurity) %>%
rownames_to_column("variable") %>%
head(10) %>%
ggplot(aes(IncNodePurity,
reorder(variable, IncNodePurity))
) +
geom_col(fill = "firebrick") +
labs(x = "Importance",
y = NULL,
title = "Random Forest Variable Importance")
The plot above showing how big the influence of each predictor, top 3
predictor who correlate with Appliances is
NSM, lights and temp_teen
Lime Interpretation
library(lime)
set.seed(123)
explainer <- lime(x = data_validation %>% select(-Appliances),
model = model_rf)
model_type.randomForest <- function(x){
return("regression") # for regression problem
}
predict_model.randomForest <- function(x, newdata, type = "response") {
# return prediction value
predict(x, newdata) %>% as.data.frame()
}
# Select only the first 4 observations
selected_data <- data_validation %>%
select(-c(Appliances)) %>%
slice(1:4)
# Explain the model
set.seed(123)
explanation <- explain(x = selected_data,
explainer = explainer,
n_features = 27 # Number of features to explain the model
)
Since we’re using scaled data from the beginning, so to visualize
model_rf, we’re still using scaled data.
Random Forest Visualization dan Interpretation
plot_features(explanation = explanation)
Explanation Fit indicate how good LIME explain the model, kind of like the \(R^2\) (R-Squared) value of linear regression. Here we see the Explanation Fit only has values around 0.15-0.61 (15%-61%), which can be interpreted that LIME can only explain a little about our model. Twi of the cases reached the standard which >= 50% (0.5),Case 1 and 2 has explanation fit under 0.50. We also can summarise that Case 3 has the biggest Explanation, but Case 2 has the biggest Prediction.
Support Vector Machine
library(e1071)
model_svm <- svm(Appliances ~ ., data = data_train)
pred_svm_val <- predict(object = model_svm, newdata = data_validation)
eval_recap(truth = data_validation$Appliances,
estimate = pred_svm_val)
## RMSE MAE R-Square MAPE MSE
## 1 0.4085633 0.2748591 0.5171139 0.05969319 0.6391896
The SVR model has higher performance compared to any model that we made before. However, we will still use both model for further analysis both as comparison and as examples.
Lime Interpretation
# create the explanation for the SVR model.
set.seed(123)
explainer_svm <- lime(x = data_train %>% select(-Appliances),
model = model_svm)
# Create SVR model specification for lime.
model_type.svm <- function(x){
return("regression") # for regression problem
}
predict_model.svm <- function(x, newdata, type = "response") {
# return prediction value
predict(x, newdata) %>% as.data.frame()
}
Random Forest Visualization dan Interpretation
set.seed(123)
explanation_svm <- explain(x = selected_data,
explainer = explainer_svm,
kernel_width = 1,
feature_select = "auto", # Method of feature selection for lime
n_features = 10 # Number of features to explain the model
)
plot_features(explanation_svm)
Explanation Fit indicate how good LIME explain the model, kind of like the \(R^2\) (R-Squared) value of linear regression. Here we see the Explanation Fit only has values around 0.3-0.4 (30-40%), which can be interpreted that LIME can only explain a little about our model. None of the cases reached the standard which >= 50% (0.5). From all of the case, We also can summarise that Case 1 has the biggest Explanation and Prediction.
Conclusion
In this research project, we have examined various concrete formulations with different Appliancess. We developed a model that aligns to the available information. Utilizing model as a framework, we developed a fresh formulation and, being used to predicted the Appliances.
Throughout this project, we have employed a
Random Foreat Model. Compared to a standard regression, the
model better describes the data. As we have discovered, despite being
more complicated, it is a model which could be understood. The
prediction model implementing “model_rf” obtained MAE values of 6.0 and
R Square of 59% on validation dataset.